Panel Data Examples using R

Here is an example of three different methods (fixed effects, random effects, and latent growth modeling) for analyzing longitudinal data using artificial data sets. First, we will demonstrate how to create a longitudinal data set. We create an artificial data set with two variables a1 and a2. Let us assume that a1 and a2 are scores on some measurement across time and we need to create a longitudinal data set that stacks a1 on top of a2 into a single variable with a time variable that identifies the time point for each observation. To do this we can use the reshape package in R.

There are five components that we need to complete to create the longitudinal data set. First, we identify the dataset that we are using, dataLong in this case. Then select the variables that we want to stack using the varying command and identifying the a1 and a2 variables. Then we create a name for the new stacked variable, which we called dependent. Then we select the time points that will be associated with each of the variables in a variable called time, which is one and two in this example. Finally, we select the direction, which is long, because we want a longitudinal data set.

library(reshape)

dataLong = data.frame(a1 = c(rnorm(10)), a2 = c(rnorm(10)))
dataLongLong = reshape(dataLong, varying = list(c("a1", "a2")), v.names = "dependent", times = c(1,2), direction = "long")

head(dataLongLong)

##     time   dependent id
## 1.1    1  0.89796912  1
## 2.1    1  0.84687613  2
## 3.1    1  0.28676476  3
## 4.1    1 -0.79183975  4
## 5.1    1  0.01235569  5
## 6.1    1 -1.94155535  6

The fixed model includes time variant variables that suck out variation associated with steady fixtures such as schools or a time variable that captures variation in the dependent variable associated with these fixtures. In this example, below we look a data set of countries over time in years across an artificial dependent variable y and artificial x variable x1. To create the fixed effects model, we use the plm function in R. To run the PLM function in R, we create the model, which is y regressed on x1, create the index variables and set the model. The index variables are the fixed effects that we want to create, so in this case we have two sets of fixed effects countries and years. The model function lets us tell R whether to use fixed or random effects, where in this case within means fixed effects. Then we can get the parameter estimate for x1, which indicates that average amount of change per country over time given a unit increase in x.

library(foreign)
library(plm)
Panel <-read.dta("http://dss.princeton.edu/training/Panel101.dta")
head(Panel)

##   country year           y y_bin        x1         x2          x3
## 1       A 1990  1342787840     1 0.2779036 -1.1079559  0.28255358
## 2       A 1991 -1899660544     0 0.3206847 -0.9487200  0.49253848
## 3       A 1992   -11234363     0 0.3634657 -0.7894840  0.70252335
## 4       A 1993  2645775360     1 0.2461440 -0.8855330 -0.09439092
## 5       A 1994  3008334848     1 0.4246230 -0.7297683  0.94613063
## 6       A 1995  3229574144     1 0.4772141 -0.7232460  1.02968037
##     opinion
## 1 Str agree
## 2     Disag
## 3     Disag
## 4     Disag
## 5     Disag
## 6 Str agree

fixed = plm(y ~ x1, data = Panel, index = c("country", "year"), model = "within")
summary(fixed)

## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = y ~ x1, data = Panel, model = "within", index = c("country", 
##     "year"))
## 
## Balanced Panel: n=7, T=10, N=70
## 
## Residuals :
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8.63e+09 -9.70e+08  5.40e+08  0.00e+00  1.39e+09  5.61e+09 
## 
## Coefficients :
##      Estimate Std. Error t-value Pr(>|t|)  
## x1 2475617827 1106675594   2.237  0.02889 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    5.2364e+20
## Residual Sum of Squares: 4.8454e+20
## R-Squared:      0.074684
## Adj. R-Squared: -0.029788
## F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892

In a random effects model, we do not assume that the effects of the time invariant variables, such as school and country, are the same and allow them to have their own starting values (i.e. intercepts). In this model, the R code is almost identical, except that in the model section we change within to random. Then we get the value of the parameter estimate, which is the average change in y over time between countries.

random = plm(y ~ x1, data = Panel, index = c("country", "year"), model = "random")
summary(random)

## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = y ~ x1, data = Panel, model = "random", index = c("country", 
##     "year"))
## 
## Balanced Panel: n=7, T=10, N=70
## 
## Effects:
##                     var   std.dev share
## idiosyncratic 7.815e+18 2.796e+09 0.873
## individual    1.133e+18 1.065e+09 0.127
## theta:  0.3611  
## 
## Residuals :
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8.94e+09 -1.51e+09  2.82e+08  0.00e+00  1.56e+09  6.63e+09 
## 
## Coefficients :
##               Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 1037014284  790626206  1.3116   0.1941
## x1          1247001782  902145601  1.3823   0.1714
## 
## Total Sum of Squares:    5.6595e+20
## Residual Sum of Squares: 5.5048e+20
## R-Squared:      0.02733
## Adj. R-Squared: 0.013026
## F-statistic: 1.91065 on 1 and 68 DF, p-value: 0.17141

How do we know when to use fixed or random effects? We can use the Hausman Test, where the null hypothesis is that the model is random effects and the alternative is that fixed effects are better fit. In this example, the p-value is above .05 so a random effects model is the better fit for this data.

phtest(random, fixed)

## 
##  Hausman Test
## 
## data:  y ~ x1
## chisq = 3.674, df = 1, p-value = 0.05527
## alternative hypothesis: one model is inconsistent

Another way to analyze longitudinal data is through Latent Growth Modeling (LGM). LGM models time as a latent variable similar to a multi-level modeling approach. To analyze LGM in R we use the lavaan package. To analyze LGM in lavaan we use an artificial data set provided by lavaan. The first step is to define the model, which is enclosed in quotations and sent as a string. In this case, we are fixing parameter estimates for time to be one to ensure the model is identified. Then we set each of the slope parameter estimates to their respective time point with one change. Since we cannot estimate the slope of time point one, we set that to zero and use n-1 time points, in this case three to identify the slopes. Variances for the time points, intercept, slope, and covariance between the slope and variance are freely estimated.

Then we can use the growth function in lavaan to correctly analyze the data as a LGM. In this example I am focusing on the interpretation of parameters not the model fit, so I will skip the first section regarding model fit. The first section is the parameter estimates, we fixed in the model. The next section is the covariance between the intercept and the slope demonstrating the relationship between the starting values and the slope or change over time. To fully interpret this value, we need to analyze the slope value found in the intercept section of the output. If the slope is positive and the covariance is positive, as is the case in our example, then we can say higher starting values (i.e. intercepts) leads to larger slopes or increases in y relative to lower starting values. Then we can move the intercepts, which is what we are interested in. “i” is the mean of the starting value at time one and “s” is the average change in y over time. We can see that the slope is statistically significantly different from zero meaning that there is a significant change in y over time.

library(lavaan)
head(Demo.growth)

##           t1         t2         t3          t4         x1         x2
## 1  1.7256454  2.1424005  2.7731717  2.51595586 -1.1641026  0.1742293
## 2 -1.9841595 -4.4006027 -6.0165563 -7.02961801 -1.7454025 -1.5768602
## 3  0.3195183 -1.2691171  1.5600160  2.86852958  0.9202112 -0.1418180
## 4  0.7769485  3.5313707  3.1382114  5.36374139  2.3595236  0.7079681
## 5  0.4489440 -0.7727747 -1.5035150  0.07846742 -1.0887077 -1.0099977
## 6 -1.7469951 -0.9963400 -0.8242174  0.56692480 -0.5135169 -0.1440428
##            c1         c2         c3          c4
## 1 -0.02767765  0.5549234  0.2544784 -1.00639541
## 2 -2.03196724  0.1253348 -1.5642323  1.22926875
## 3  0.05237496 -1.2577408 -1.8033909 -0.32725761
## 4  0.01911429  0.6473830 -0.4323795 -1.03239779
## 5  0.65243274  0.7309148 -0.7537816 -0.02745598
## 6 -0.04452906 -0.4168022 -1.2411670  0.51719763

model = "i =~ 1*t1 + 1*t2 + 1*t3 + 1*t4
s =~ 0*t1 + 1*t2 + 2*t3 + 3*t4"

fit = growth(model, data = Demo.growth)
summary(fit)

## lavaan (0.5-23.1097) converged normally after  29 iterations
## 
##   Number of observations                           400
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                8.069
##   Degrees of freedom                                 5
##   P-value (Chi-square)                           0.152
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   i =~                                                
##     t1                1.000                           
##     t2                1.000                           
##     t3                1.000                           
##     t4                1.000                           
##   s =~                                                
##     t1                0.000                           
##     t2                1.000                           
##     t3                2.000                           
##     t4                3.000                           
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   i ~~                                                
##     s                 0.618    0.071    8.686    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .t1                0.000                           
##    .t2                0.000                           
##    .t3                0.000                           
##    .t4                0.000                           
##     i                 0.615    0.077    8.007    0.000
##     s                 1.006    0.042   24.076    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .t1                0.595    0.086    6.944    0.000
##    .t2                0.676    0.061   11.061    0.000
##    .t3                0.635    0.072    8.761    0.000
##    .t4                0.508    0.124    4.090    0.000
##     i                 1.932    0.173   11.194    0.000
##     s                 0.587    0.052   11.336    0.000