Skip to document

Practical Guides To Panel Data Modeling AStepby Step

Course

Economics

999+ Documents
Students shared 1436 documents in this course
Academic year: 2018/2019
Uploaded by:
0followers
2Uploads
11upvotes

Comments

Please sign in or register to post comments.

Preview text

International University of Japan

Public Management & Policy Analysis Program

Practical Guides To Panel Data Modeling: A Step by Step

Analysis Using Stata

*

Hun Myoung Park, Ph. kucc625@iuj.ac

  1. Introduction
  2. Preparing Panel Data
  3. Basics of Panel Data Models
  4. Pooled OLS and LSDV
  5. Fixed Effect Model
  6. Random Effect Model
  7. Hausman Test and Chow Test
  8. Presenting Panel Data Models
  9. Conclusion References
© 2011

Last modified on October 2011

Public Management and Policy Analysis Program
Graduate School of International Relations
International University of Japan

777 Kokusai-cho Minami Uonuma-shi, Niigata 949-7277, Japan (025) 779- iuj.ac/faculty/kucc

  • The citation of this document should read: “Park, Hun Myoung. 2011. Practical Guides To Panel Data

Modeling: A Step-by-step Analysis Using Stata. Tutorial Working Paper. Graduate School of International Relations, International University of Japan.” This document is based on Park, Hun Myoung. 2005-2009. Linear Regression Models for Panel Data Using SAS, Stata, LIMDEP, and SPSS. The University Information Technology Services (UITS) Center for Statistical and Mathematical Computing, Indiana University

1. Introduction

Panel data are also called longitudinal data or cross-sectional time-series data. These longitudinal data have “observations on the same units in several different time periods” (Kennedy, 2008: 281); A panel data set has multiple entities, each of which has repeated measurements at different time periods. Panel data may have individual (group) effect, time effect, or both, which are analyzed by fixed effect and/or random effect models.

U. Census Bureau’s Census 2000 data at the state or county level are cross-sectional but not time-series, while annual sales figures of Apple Computer Inc. for the past 20 years are time series but not cross-sectional. The cumulative Census data at the state level for the past 20 years are longitudinal. If annual sales data of Apple, IBM, LG, Siemens, Microsoft, Sony, and AT&T for the past 10 years are available, they are panel data. The National Longitudinal Survey of Labor Market Experience (NLS) and the Michigan Panel Study of Income Dynamics (PSID) data are cross sectional and time-series, while the cumulative General Social Survey (GSS) and American National Election Studies (ANES) data are not in the sense that individual respondents vary across survey year.

As more and more panel data are available, many scholars, practitioners, and students have been interested in panel data modeling because these longitudinal data have more variability and allow to explore more issues than do cross-sectional or time-series data alone (Kennedy, 2008: 282). Baltagi (2001) puts, “Panel data give more informative data, more variability, less collinearity among the variables, more degrees of freedom and more efficiency” (p). Given well-organized panel data, panel data models are definitely attractive and appealing since they provide ways of dealing with heterogeneity and examine fixed and/or random effects in the longitudinal data.

However, panel data modeling is not as easy as it sounds. A common misunderstanding is that fixed and/or random effect models should always be employed whenever your data are arranged in the panel data format. The problems of panel data modeling, by and large, come from 1) panel data themselves, 2) modeling process, and 3) interpretation and presentation of the result. Some studies analyze poorly organized panel data (in fact, they are not longitudinal in a strong econometric sense) and some others mechanically apply fixed and/or random effect models in haste without consideration of relevance of such models. Careless researchers often fail to interpret the results correctly and to present them appropriately.

The motivation of this document is several IUJ master’s theses that, I think, applied panel data models inappropriately and failed to interpret the results correctly. This document is intended to provide practical guides of panel data modeling, in particular, for writing a master’s thesis. Students can learn how to 1) organize panel data, 2) recognize and handle ill- organized data, 3) choose a proper panel data model, 4) read and report Stata output correctly, 5) interpret the result substantively, and 6) present the result in a professional manner.

In order to avoid unnecessary complication, this document mainly focuses on linear regression models rather than nonlinear models (e., binary response and event count data models) and balanced data rather than unbalanced ones. Hopefully this document will be a good companion of those who want to analyze panel data for their master’s theses at IUJ. Let us begin with preparing and evaluating panel data.

time variable: year, 1 to 15 delta: 1 unit

Let us first explore descriptive statistics of panel data. Run .xtsum to obtain summary statistics. The total number of observations is 90 because there are 6 units (entities) and 15 time periods. The overall mean (13) and standard deviation (1) of total cost below are the same as those in the .sum output above.

. xtsum cost output fuel load

Variable | Mean Std. Dev. Min Max | Observations -----------------+--------------------------------------------+---------------- cost overall | 13 1 11 15 | N = 90 between | .9978636 12 14 | n = 6 within | .6650252 12 14 | T = 15 | | output overall | -1 1 -3 .6608616 | N = 90 between | 1 -2 .3192696 | n = 6 within | .4208405 -1 .1339861 | T = 15 | | fuel overall | 12 .8123749 11 13 | N = 90 between | .0237151 12 12 | n = 6 within | .8120832 11 13 | T = 15 | | load overall | .5604602 .0527934 .432066 .676287 | N = 90 between | .0281511 .5197756 .5971917 | n = 6 within | .0460361 .4368492 .6581019 | T = 15

Note that Stata lists three different types of statistics: overall, between, and within. Overall statistics are ordinary statistics that are based on 90 observations. “Between” statistics are calculated on the basis of summary statistics of six airlines (entities) regardless of time period, while “within” statistics by summary statistics of 15 time periods regardless of airline.

2 Type of Panel Data

A panel data set contains n entities or subjects, each of which includes T observations measured at 1 through t time period. Thus, the total number of observations in the panel data is nT. Ideally, panel data are measured at regular time intervals (e., year, quarter, and month). Otherwise, panel data should be analyzed with caution. A panel may be long or short, balanced or unbalanced, and fixed or rotating.

2.2 Long versus Short Panel Data

A short panel has many entities (large n) but few time periods (small T), while a long panel has many time periods (large T) but few entities (Cameron and Trivedi, 2009: 230). Accordingly, a short panel data set is wide in width (cross-sectional) and short in length (time-series), whereas a long panel is narrow in width. Both too small N (Type I error) and too large N (Type II error) problems matter. Researchers should be very careful especially when examining either short or long panel.

2.2 Balanced versus Unbalanced Panel Data

In a balanced panel, all entities have measurements in all time periods. In a contingency table (or cross-table) of cross-sectional and time-series variables, each cell should have only one frequency. Therefore, the total number of observations is nT. This tutorial document assumes that we have a well-organized balanced panel data set.

When each entity in a data set has different numbers of observations, the panel data are not balanced. Some cells in the contingency table have zero frequency. Accordingly, the total number of observations is not nT in an unbalanced panel. Unbalanced panel data entail some computation and estimation issues although most software packages are able to handle both balanced and unbalanced data.

2.2 Fixed versus Rotating Panel Data

If the same individuals (or entities) are observed for each period, the panel data set is called a fixed panel (Greene 2008: 184). If a set of individuals changes from one period to the next, the data set is a rotating panel. This document assumes a fixed panel.

2 Data Arrangement: Long versus Wide Form in Stata

A typical panel data set has a cross-section (entity or subject) variable and a time-series variable. In Stata, this arrangement is called the long form (as opposed to the wide form). While the long form has both individual (e., entity and group) and time variables, the wide form includes either individual or time variable. Most statistical software packages assume that panel data are arranged in the long form.

The following data set shows a typical panel data arrangement. Yes, this is a long form. There are 6 entities (airline) and 15 time periods (year). 2

. list airline year load cost output fuel in 1/20, sep(20)

+------------------------------------------------------------+

airline year load cost output fuel
  1. | 1 1 .534487 13 - 11 |
  2. | 1 2 .532328 14 - 11 |
  3. | 1 3 .547736 14 .0879925 11 |
  4. | 1 4 .540846 14 .1619318 11 |
  5. | 1 5 .591167 14 .1485665 12 |
  6. | 1 6 .575417 14 .1602123 12 |
  7. | 1 7 .594495 14 .2550375 12 |
  8. | 1 8 .597409 14 .3297856 12 |
  9. | 1 9 .638522 14 .4779284 12 |
  10. | 1 10 .676287 14 .6018211 13 |
  11. | 1 11 .605735 15 .4356969 13 |
  12. | 1 12 .61436 15 .4238942 13 |
  13. | 1 13 .633366 15 .5069381 13 |
  14. | 1 14 .650117 15 .6001049 13 |
  15. | 1 15 .625603 15 .6608616 13 |
  16. | 2 1 .490851 13 - 11 |
  17. | 2 2 .473449 13 - 11 |
  18. | 2 3 .503013 13 - 11 |
  19. | 2 4 .512501 13 - 11 |
  20. | 2 5 .566782 14 - 12 | +------------------------------------------------------------+

If data are structured in a wide form, you need to rearrange data first. Stata has the .reshape command to rearrange a data set back and forth between long and short forms. The following .reshape with wide changes from the long form to wide one so that the resulting data set in a wide form has only six observations but in turn include an identification (entity)

2 The .list command lists data items of individual observations. The in 1/20 of this command displays

data of the first 20 observations, and the sep(20) option inserts a horizontal separator line in every 20 observations rather than in the default every 5 lines.

  • Check if measurement methods employed are not consistent. Measurements are not commensurable if 1) some entities were measured in method A and other entities in method B, 2) some time periods were measured in method C and other periods in method D, and/or 3) both 1) and 2) are mixed. 3
  • Be careful when you “darn” your data set by combining data sets measured and built by different institutions who employed different methods. This circumstance is quite understandable because a perfect data set is rarely ready for you; in many cases, you need to combine some sources of information to build a new data set for your research.

Another issue is if the number of entities and/or time-period is too small or too large. It is less valuable to contrast one group (or time period) with another in the panel data framework: n= or T=3). By contrast, comparing millions of individuals or time periods is almost useless because of high likelihood of Type II error. This task is almost similar to arguing that at least one company out of 1 million firms in the world has a different productivity. Is this argument interesting to you?; We already know that! In case of too large N (specifically n or T), you might try to reclassify individuals or time periods into several meaningful categories; for example, classify millions of individuals by their citizenships or ethnic groups (e., white, black, Asian, and Spanish).

Finally, many missing values are likely lower the quality of panel data. So called listwise deletion (an entire record is excluded from analysis if any single value of a variable is missing) tends to reduce the number of observations used in a model and thus weaken statistical power of a test. This issue is also related to discussion on balanced versus unbalanced panel data.

Once a well organized panel data is prepared, we are moving forward to discuss panel data models that are used to analyze fixed and/or random effects embedded in the longitudinal data.

3 Assume that methods A and B, and methods C and D are not comparable each other in terms of scale and unit

of measurements.

3. Basics of Panel Data Models

Panel data models examine group (individual-specific) effects, time effects, or both in order to deal with heterogeneity or individual effect that may or may not be observed. 4 These effects are either fixed or random effect. A fixed effect model examines if intercepts vary across group or time period, whereas a random effect model explores differences in error variance components across individual or time period. A one-way model includes only one set of dummy variables (e., firm1, firm2, ...), while a two-way model considers two sets of dummy variables (e., city1, city2, ... and year1, year2, ...).

This section follows Greene’s (2008) notations with some modifications, such as lower-case k (the number of regressors excluding the intercept term; He uses K instead), wit (the composite error term), and vit (traditional error term; He uses εit).

3 Pooled OLS

If individual effect ui (cross-sectional or time specific effect) does not exist (ui =0), ordinary least squares (OLS) produces efficient and consistent parameter estimates.

yit=α+Xitβ+εit ' (ui =0)

OLS consists of five core assumptions (Greene, 2008: 11-19; Kennedy, 2008: 41-42).

  1. Linearity says that the dependent variable is formulated as a linear function of a set of independent variable and the error (disturbance) term.
  2. Exogeneity says that the expected value of disturbances is zero or disturbances are not correlated with any regressors.
  3. Disturbances have the same variance (3 homoskedasticity) and are not related with one another (3 nonautocorrelation)
  4. The observations on the independent variable are not stochastic but fixed in repeated samples without measurement errors.
  5. Full rank assumption says that there is no exact linear relationship among independent variables (no multicollinearity).

If individual effect ui is not zero in longitudinal data, heterogeneity (individual specific characteristics like intelligence and personality that are not captured in regressors) may influence assumption 2 and 3. In particular, disturbances may not have same variance but vary across individual (heteroskedasticity, violation of assumption 3) and/or are related with each other (autocorrelation, violation of assumption 3). This is an issue of nonspherical variance-covariance matrix of disturbances. The violation of assumption 2 renders random effect estimators biased. Hence, the OLS estimator is no longer best unbiased linear estimator. Then panel data models provide a way to deal with these problems.

3 Fixed versus Random Effects

Panel data models examine fixed and/or random effects of individual or time. The core difference between fixed and random effect models lies in the role of dummy variables

4 Country, state, agency, firm, respondent, employee, and student are examples of a unit (individual or entity),

whereas year, quarter, month, week, day, and hour can be examples of a time period.

test, the pooled OLS regression is favored. The Hausman specification test (Hausman, 1978) compares a random effect model to its fixed counterpart. If the null hypothesis that the individual effects are uncorrelated with the other regressors is not rejected, a random effect model is favored over its fixed counterpart.

If one cross-sectional or time-series variable is considered (e., country, firm, and race), this is called a one-way fixed or random effect model. Two-way effect models have two sets of dummy variables for individual and/or time variables (e., state and year) and thus entail some issues in estimation and interpretation.

3 Estimating Fixed Effect Models

There are several strategies for estimating a fixed effect model. The least squares dummy variable model (LSDV) uses dummy variables, whereas the “within” estimation does not. These strategies, of course, produce the identical parameter estimates of regressors (non- dummy independent variables). The “between” estimation fits a model using individual or time means of dependent and independent variables without dummies.

LSDV with a dummy dropped out of a set of dummies is widely used because it is relatively easy to estimate and interpret substantively. This LSDV, however, becomes problematic when there are many individuals (or groups) in panel data. If T is fixed and n→∞ (n is the number of groups or firms and T is the number of time periods), parameter estimates of regressors are consistent but the coefficients of individual effects, α + ui, are not (Baltagi, 2001: 14). In this short panel, LSDV includes a large number of dummy variables; the number of these parameters to be estimated increases as n increases (incidental parameter problem); therefore, LSDV loses n degrees of freedom but returns less efficient estimators (p). Under this circumstance, LSDV is useless and thus calls for another strategy, the within effect estimation.

Unlike LSDV, the “within” estimation does not need dummy variables, but it uses deviations from group (or time period) means. That is, “within” estimation uses variation within each individual or entity instead of a large number of dummies. The “within” estimation is, 6

(yit−yi•)=(xit−xi•)'β+(εit−εi•),

where yi• is the mean of dependent variable (DV) of individual (group) i, xi•represent the

means of independent variables (IVs) of group i, andεi•is the mean of errors of group i.

In this “within” estimation, the incidental parameter problem is no longer an issue. The parameter estimates of regressors in the “within” estimation are identical to those of LSDV. The “within” estimation reports correct the sum of squared errors (SSE). The “within” estimation, however, has several disadvantages.

First, data transformation for “within” estimation wipes out all time-invariant variables (e., gender, citizenship, and ethnic group) that do not vary within an entity (Kennedy, 2008: 284). Since deviations of time-invariant variables from their average are all zero, it is not possible

6 This “within” estimation needs three steps: 1) compute group means of the dependent and independent

variables; 2) transform dependent and independent variables to get deviations from their group means; 3) run OLS on the transformed variables without the intercept term.

to estimate coefficients of such variables in “within” estimation. As a consequence, we have to fit LSDV when a model has time-invariant independent variables.

Second, “within” estimation produces incorrect statistics. Since no dummy is used, the within effect model has larger degrees of freedom for errors, accordingly reporting small mean squared errors (MSE), standard errors of the estimates (SEE) or square root of mean squared errors (SRMSE), and incorrect (smaller) standard errors of parameter estimates. Hence, we have to adjust incorrect standard errors using the following formula. 7

nT n k

nT k se df

df se se LSDV k error

within error k k − −

*= =

Third, R 2 of the “within” estimation is not correct because the intercept term is suppressed. Finally, the “within” estimation does not report dummy coefficients. We have to compute them, if really needed, using the formula di*=yi•−xi•'β.

Table 3 Comparison of Three Estimation Methods LSDV Within Estimation Between Estimation Functional form yi=iαi+Xiβ+εi yit−yi•=xit−xi•+εit−εi• yi•=α+xi•+εi

Time invariant variables

Yes No No

Dummy variables Yes No No Dummy coefficients Presented Need to be computed N/A Transformation No Deviation from the group means Group means Intercept estimated Yes No Yes R 2 Correct Incorrect SSE Correct Correct MSE/SEE (SRMSE) Correct Incorrect (smaller) Standard errors Correct Incorrect (smaller) DFerror nT-n-k* nT-k (n larger) n-k- Observations nT nT n

  • It means that the LSDV estimation loses n degrees of freedom because of dummy variables included.

The “between group” estimation, so called the group mean regression, uses variation between individual entities (groups). Specifically, this estimation calculates group means of the dependent and independent variables and thus reduces the number of observations down to n. Then, run OLS on these transformed, aggregated data: yi•=α+xi•+εi. Table 3 contrasts

LSDV, “within group” estimation, and “between group” estimation.

3 Estimating Random Effect Models

The one-way random effect model incorporates a composite error term, wit=ui+vit. The ui

are assumed independent of traditional error term vit and regressorsXit, which are also

independent of each other for all i and t. Remember that this assumption is not necessary in a fixed effect model. This model is,

7 Fortunately, Stata and other software packages report adjusted standard errors for us.

How do we know if fixed and/or random effects exist in panel data in hand? A fixed effect is tested by F-test, while a random effect is examined by Breusch and Pagan’s (1980) Lagrange multiplier (LM) test. The former compares a fixed effect model and OLS to see how much the fixed effect model can improve the goodness-of-fit, whereas the latter contrast a random effect model with OLS. The similarity between random and fixed effect estimators is tested by a Hausman test.

3.5 F-test for Fixed Effects

In a regression of yit=α+μi+Xit'β+εit, the null hypothesis is that all dummy parameters

except for one for the dropped are all zero, H 0 :μ 1 =...=μn− 1 = 0. The alternative hypothesis

is that at least one dummy parameter is not zero. This hypothesis is tested by an F test, which is based on loss of goodness-of-fit. This test contrasts LSDV (robust model) with the pooled OLS (efficient model) and examines the extent that the goodness-of-fit measures (SSE or R 2 ) changed.

1( () )
( () )
'( () )
'( ' () )
( ,1 ) 2

2 2

R nT n k

R R n ee nT n k

ee ee n Fn nT n k LSDV

LSDV pooled LSDV

pooled LSDV − − −

− −
=
− −
− −
− − − =

If the null hypothesis is rejected (at least one group/time specific intercept ui is not zero), you may conclude that there is a significant fixed effect or significant increase in goodness-of-fit in the fixed effect model; therefore, the fixed effect model is better than the pooled OLS.

3.5 Breusch-Pagan LM Test for Random Effects

Breusch and Pagan’s (1980) Lagrange multiplier (LM) test examines if individual (or time)

specific variance components are zero,H 0 :σu 2 = 0. The LM statistic follows the chi-squared

distribution with one degree of freedom.

1 ~ )1(
'
'
(2 )

2

2 2 ⎥ χ ⎦

=

ee

T ee T

LM nT u ,

where eis the n × 1 vector of the group means of pooled regression residuals, and 'ee is the SSE of the pooled OLS regression.

Baltagi (2001) presents the same LM test in a different way.

LMu=

nT 2(T−1)

()∑eit

2

∑∑eit 2

− 1

2

=

nT 2(T−1)

()Te i•

2

∑∑eit 2

− 1

2 ~χ 2 (1).

If the null hypothesis is rejected, you can conclude that there is a significant random effect in the panel data, and that the random effect model is able to deal with heterogeneity better than does the pooled OLS.

3.5 Hausman Test for Comparing Fixed and Random Effects

How do we know which effect (fixed effect or random effect) is more relevant and significant in the panel data? The Hausman specification test compares fixed and random effect models

under the null hypothesis that individual effects are uncorrelated with any regressor in the model (Hausman, 1978). If the null hypothesis of no correlation is not violated, LSDV and GLS are consistent, but LSDV is inefficient; otherwise, LSDV is consistent but GLS is inconsistent and biased (Greene, 2008: 208). The estimates of LSDV and GLS should not differ systematically under the null hypothesis. The Hausman test uses that “the covariance of an efficient estimator with its difference from an inefficient estimator is zero” (Greene, 2008: 208).

LM=()bLSDV−brandom'Wˆ− 1 ()bLSDV−brandom ~χ 2 k)( ,

where Wˆ=Var[bLSDV−brandom]=Var(bLSDV)−Var(brandom) is the difference in the estimated

covariance matrices of LSDV (robust model) and GLS (efficient model). Keep in mind that an intercept and dummy variables SHOULD be excluded in computation. This test statistic follows the chi-squared distribution with k degrees of freedom.

The formula says that a Hausman test examines if “the random effects estimate is insignificantly different from the unbiased fixed effect estimate” (Kennedy, 2008: 286). If the null hypothesis of no correlation is rejected, you may conclude that individual effects ui are significantly correlated with at least one regressors in the model and thus the random effect model is problematic. Therefore, you need to go for a fixed effect model rather than the random effect counterpart. A drawback of this Hausman test is, however, that the difference of covariance matrices W may not be positive definite; Then, we may conclude that the null is not rejected assuming similarity of the covariance matrices renders such a problem (Greene, 2008: 209).

3.5 Chow Test for Poolability

What is poolability? Poolability asks if slopes are the same across group or over time (Baltagi 2001: 51-57). One simple version of poolability test is an extension of the Chow test (Chow, 1960). The null hypothesis of this Chow test is the slope of a regressor is the same regardless of individual for all k regrssors, H 0 :βik=βk. Remember that slopes remain constant in

fixed and random effect models; only intercepts and error variances matter.

[]
( )
'( () )(1 )
( )(1 ),1 ( )1 '

'

− −

− − +
− + − − =

ee Tn k

ee ee n k F n k Tn k ii

ii ,

where 'ee is the SSE of the pooled OLS and 'ee ii is the SSE of the pooled OLS for group i. If

the null hypothesis is rejected, the panel data are not poolable; each individual has its own slopes for all regressors. Under this circumstance, you may try the random coefficient model or hierarchical regression model.

The Chow test assumes that individual error variance components follow the normal

distribution,μ~N ,0( 2 Is nT). If this assumption does not hold, the Chow test may not

properly examine the null hypothesis (Baltagi, 2001: 53). Kennedy (2008) notes, “if there is reason to believe that errors in different equations have different variances, or that there is contemporaneous correlation between the equations’ errors, such testing should be undertaken by using the SURE estimator, not OLS; ... inference with OLS is unreliable if the variance-covariance matrix of the error is nonspherical” (p).

overall intercept (p. 284). He argues that the key difference between fixed and random effects is not whether unobserved heterogeneity is attributed to the intercept or variance components, but whether the individual specific error component is related to regressors.

Figure 3 Scatter Plots of Total Cost versus Output Index and Loading Factor

Airline 1

Airline 2 Airline 3 Airline 4

Regression Line

Total Cost

-2 -2 -1 Output Index-1 - 0 .5 1 Source: pages.stern.nyu/~wgreene/Text/tables/tablelist5

Airline 1

Airline 2 Airline 3

Airline 4 Regression Line

Total Cost

.4 .5 Loading Factor .6. Source: pages.stern.nyu/~wgreene/Text/tables/tablelist5

It will be a good practice to draw plots of the dependent and independent variables before modeling panel data. For instance, Figure 3 illustrates two scatter plots with linear regression lines of four airlines only. The left plot is of total cost versus output index, and the right one is of total cost versus loading factor (compare them with Kennedy’s Figure 18 and 18). Assume that the thick black lines represent linear regression lines of entire observations. The key difference is that slopes of individual airlines are very similar to the overall regression line on the left plot, but different in the right plot.

As Kennedy (2008: 286) explains, OLS, fixed effect, and random effect estimators on the left plot are all unbiased, but random effect estimators are most efficient; a random effect is better. In the right plot, however, OLS and random effects estimators are biased because the composite error term seems to be correlated with a regressor, loading factor, but the fixed effects estimator is not biased; accordingly, a fixed effect model might be better.

3.6 Two Recommendations for Panel Data Modeling

The first recommendation, as in other data analysis processes, is to describe the data of interest carefully before analysis. Although often ignored in many data analyses, this data description is very important and useful for researchers to get ideas about data and analysis strategies. In panel data analysis, properties and quality of panel data influence model section significantly. - Clean the data by examining if they were measured in reliable and consistent manners. If different time periods were used in a long panel, for example, try to rearrange (aggregate) data to improve consistency. If there are many missing values, decide whether you go for a balanced panel by throwing away some pieces of usable information or keep all usable observations in an unbalanced panel at the expense of methodological and computational complication. - Examine the properties of the panel data including the number of entities (individuals), the number of time periods, balanced versus unbalanced panel, and fixed versus rotating panel. Then, try to find models appropriate for those properties. - Be careful if you have “long” or “short” panel data. Imagine a long panel that has 10 thousand time periods but 3 individuals or a short panel of 2 (years) × 9,000 (firms).

  • If n and/or T are too large, try to reclassify individuals and/or time periods and get some manageable n’ and T’. The null hypothesis of u 1 = u 2 = ... = u999,999 = 0 in a fixed effect model, for instance, is almost useless. This is just as you are seriously arguing that at least one citizen looks different from other 999,999 people! Didn’t you know that before? Try to use yearly data rather than weekly data or monthly data rather than daily data.

Second recommendation is to begin with a simpler model. Try a pooled OLS rather than a fixed or random effect model; a one-way effect model rather than a two-way model; a fixed or random effect model rather than a hierarchical linear model; and so on. Do not try a fancy, of course, complicated, model that your panel data do not support enough (e., poorly organized panel and long/short panel).

3.6 Guidelines of Model Selection

On the modeling stage, let us begin with pooled OLS and then think critically about its potential problems if observed and unobserved heterogeneity (a set of missing relevant variables) is not taken into account. Also think about the source of heterogeneity (i., cross- sectional or time series variables) to determine individual (entity or group) effect or time effect. 9 Figure 3 provides a big picture of the panel data modeling process.

Figure 3 Panel Data Modeling Process

If you think that the individual heterogeneity is captured in the disturbance term and the individual (group or time) effect is not correlated with any regressors, try a random effect model. If the heterogeneity can be dealt with individual specific intercepts and the individual effect may possibly be correlated with any regressors, try a fixed effect model. If each individual (group) has its own initial capacity and shares the same disturbance variance with

9 Kennedy (2008: 286) suggests that first examine if individual specific intercepts are equal; if yes, the panel

data are poolable and OLS will do; if not, conduct the Hausman test; use random effect estimators if the group effect is not correlated with the error term; otherwise, use the fixed effect estimator.

You can also use .regress with the .xi prefix command to fit LSDV1 without creating dummy variables (see 4.4). The .cnsreg command is used for LSDV3 with restrictions defined in .constraint (see 4.4). The .areg command with the absorb option, equivalent to the .xtreg with the fe option below, supports the one-way “within” estimation that involves a large number of individuals or time periods.

Stata has more convenient commands and options for panel data analysis. First, .xtreg estimates a fixed effect model with the fe option (“within” estimation), “between” estimators with be, and a random effect model with re. This command, however, does not directly fit two-way fixed and random effect models. 10 Table 3 summarizes related Stata commands.

A random effect model can be also estimated using .xtmixed and .xtgls. The .xtgls command fits panel data models with heteroscedasticity across group (time) and/or autocorrelation within a group (time). .xtmixed and .xtrc are used to fit hierarchical linear models and random coefficient models. In fact, a random effect model is a simple hierarchical linear model with a random intercept. .logit and .probit fit nonlinear regression models and examine fixed effects in logit and probit models.

.xtmixed with fe by default conducts the F-test for fixed effects. Of course, you can also use .test to conduct a classical Wald test to examine the fixed effects. Since .xtmixed does not report the Breusch-Pagan LM statistic for a random effect model, you need to conduct .xtest0 after fitting a random effect model. Use .hausman to conduct Hausman test to compare fixed and random effect models.

10 You may fit a two-way fixed effect model by including a set of dummies and using the fe option. For the

two-way random effect model, you need to use the .xtmixed command instead of .xtreg.

4. Pooled OLS and LSDV

This section begins with classical least squares method called ordinary least squares (OLS) and explains how OLS can deal with unobserved heterogeneity using dummy variables. A dummy variable is a binary variable that is coded to either one or zero. OLS using dummy variables is called a least square dummy variable (LSDV) model. The sample model used here regresses total cost of airline companies on output in revenue passenger miles (output index), fuel price, and loading factor (the average capacity utilization of the fleet). 11

4 Pooled OLS

The (pooled) OLS is a pooled linear regression without fixed and/or random effects. It assumes a constant intercept and slopes regardless of group and time period. In the sample panel data with five airlines and 15 time periods, the basic scheme is that total cost is determined by output, fuel price, and loading factor. The pooled OLS posits no difference in intercept and slopes across airline and time period.

OLS: costi=β 0 +β 1 outputi+β 2 fueli+β 3 loadingi+εi

Note that β 0 is the intercept; β 1 is the slope (coefficient or parameter estimate) of output; β 2

is the slope of fuel price; β 3 is the slope of loading factor; and εi is the error term.

Now, let us load the data and fit the pooled regression model.

. use indiana/~statmath/stat/all/panel/airline, clear (Cost of U. Airlines (Greene 2003))

. regress cost output fuel load Source | SS df MS Number of obs = 90 -------------+------------------------------ F( 3, 86) = 2419. Model | 112 3 37 Prob > F = 0. Residual | 1 86 .01552839 R-squared = 0. -------------+------------------------------ Adj R-squared = 0. Total | 114 89 1 Root MSE =.


cost | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- output | .8827385 .0132545 66 0 .8563895. fuel | .453977 .0203042 22 0 .4136136. load | -1 .345302 -4 0 -2 -.

_cons | 9 .2292445 41 0 9 9.

This pooled OLS model fits the data well at the .05 significance level (F=2419 and p<.0000). R 2 of .9883 says that this model accounts for 99 percent of the total variance in the total cost of airline companies. The regression equation is,

cost = 9 + .8827output +.4540fuel -1*load

You may interpret these slopes in several ways. The ceteris paribus assumption, “holding all other variables constant,” is important but often skipped in presentation. The p-values in parenthesis below are the results of t-tests for individual parameters.

11 For details on the data, see pages.stern.nyu/~wgreene/Text/tables/tablelist5

Was this document helpful?

Practical Guides To Panel Data Modeling AStepby Step

Course: Economics

999+ Documents
Students shared 1436 documents in this course
Was this document helpful?
International University of Japan
Public Management & Policy Analysis Program
Practical Guides To Panel Data Modeling: A Step by Step
Analysis Using Stata*
Hun Myoung Park, Ph.D.
kucc625@iuj.ac.jp
1. Introduction
2. Preparing Panel Data
3. Basics of Panel Data Models
4. Pooled OLS and LSDV
5. Fixed Effect Model
6. Random Effect Model
7. Hausman Test and Chow Test
8. Presenting Panel Data Models
9. Conclusion
References
© 2011
Last modified on October 2011
Public Management and Policy Analysis Program
Graduate School of International Relations
International University of Japan
777 Kokusai-cho Minami Uonuma-shi, Niigata 949-7277, Japan
(025) 779-1424
http://www.iuj.ac.jp/faculty/kucc625
* The citation of this document should read: “Park, Hun Myoung. 2011. Practical Guides To Panel Data
Modeling: A Step-by-step Analysis Using Stata. Tutorial Working Paper. Graduate School of International
Relations, International University of Japan.” This document is based on Park, Hun Myoung. 2005-2009. Linear
Regression Models for Panel Data Using SAS, Stata, LIMDEP, and SPSS. The University Information
Technology Services (UITS) Center for Statistical and Mathematical Computing, Indiana University