Here is an example of three different methods (fixed effects, random effects, and latent growth modeling) for analyzing longitudinal data using artificial data sets. First, we will demonstrate how to create a longitudinal data set. We create an artificial data set with two variables a1 and a2. Let us assume that a1 and a2 are scores on some measurement across time and we need to create a longitudinal data set that stacks a1 on top of a2 into a single variable with a time variable that identifies the time point for each observation. To do this we can use the reshape package in R.
There are five components that we need to complete to create the longitudinal data set. First, we identify the dataset that we are using, dataLong in this case. Then select the variables that we want to stack using the varying command and identifying the a1 and a2 variables. Then we create a name for the new stacked variable, which we called dependent. Then we select the time points that will be associated with each of the variables in a variable called time, which is one and two in this example. Finally, we select the direction, which is long, because we want a longitudinal data set.
library(reshape)
dataLong = data.frame(a1 = c(rnorm(10)), a2 = c(rnorm(10)))
dataLongLong = reshape(dataLong, varying = list(c("a1", "a2")), v.names = "dependent", times = c(1,2), direction = "long")
head(dataLongLong)
## time dependent id
## 1.1 1 0.89796912 1
## 2.1 1 0.84687613 2
## 3.1 1 0.28676476 3
## 4.1 1 -0.79183975 4
## 5.1 1 0.01235569 5
## 6.1 1 -1.94155535 6
The fixed model includes time variant variables that suck out variation associated with steady fixtures such as schools or a time variable that captures variation in the dependent variable associated with these fixtures. In this example, below we look a data set of countries over time in years across an artificial dependent variable y and artificial x variable x1. To create the fixed effects model, we use the plm function in R. To run the PLM function in R, we create the model, which is y regressed on x1, create the index variables and set the model. The index variables are the fixed effects that we want to create, so in this case we have two sets of fixed effects countries and years. The model function lets us tell R whether to use fixed or random effects, where in this case within means fixed effects. Then we can get the parameter estimate for x1, which indicates that average amount of change per country over time given a unit increase in x.
library(foreign)
library(plm)
Panel <-read.dta("http://dss.princeton.edu/training/Panel101.dta")
head(Panel)
## country year y y_bin x1 x2 x3
## 1 A 1990 1342787840 1 0.2779036 -1.1079559 0.28255358
## 2 A 1991 -1899660544 0 0.3206847 -0.9487200 0.49253848
## 3 A 1992 -11234363 0 0.3634657 -0.7894840 0.70252335
## 4 A 1993 2645775360 1 0.2461440 -0.8855330 -0.09439092
## 5 A 1994 3008334848 1 0.4246230 -0.7297683 0.94613063
## 6 A 1995 3229574144 1 0.4772141 -0.7232460 1.02968037
## opinion
## 1 Str agree
## 2 Disag
## 3 Disag
## 4 Disag
## 5 Disag
## 6 Str agree
fixed = plm(y ~ x1, data = Panel, index = c("country", "year"), model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = y ~ x1, data = Panel, model = "within", index = c("country",
## "year"))
##
## Balanced Panel: n=7, T=10, N=70
##
## Residuals :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.63e+09 -9.70e+08 5.40e+08 0.00e+00 1.39e+09 5.61e+09
##
## Coefficients :
## Estimate Std. Error t-value Pr(>|t|)
## x1 2475617827 1106675594 2.237 0.02889 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 5.2364e+20
## Residual Sum of Squares: 4.8454e+20
## R-Squared: 0.074684
## Adj. R-Squared: -0.029788
## F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892
In a random effects model, we do not assume that the effects of the time invariant variables, such as school and country, are the same and allow them to have their own starting values (i.e. intercepts). In this model, the R code is almost identical, except that in the model section we change within to random. Then we get the value of the parameter estimate, which is the average change in y over time between countries.
random = plm(y ~ x1, data = Panel, index = c("country", "year"), model = "random")
summary(random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = y ~ x1, data = Panel, model = "random", index = c("country",
## "year"))
##
## Balanced Panel: n=7, T=10, N=70
##
## Effects:
## var std.dev share
## idiosyncratic 7.815e+18 2.796e+09 0.873
## individual 1.133e+18 1.065e+09 0.127
## theta: 0.3611
##
## Residuals :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.94e+09 -1.51e+09 2.82e+08 0.00e+00 1.56e+09 6.63e+09
##
## Coefficients :
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 1037014284 790626206 1.3116 0.1941
## x1 1247001782 902145601 1.3823 0.1714
##
## Total Sum of Squares: 5.6595e+20
## Residual Sum of Squares: 5.5048e+20
## R-Squared: 0.02733
## Adj. R-Squared: 0.013026
## F-statistic: 1.91065 on 1 and 68 DF, p-value: 0.17141
How do we know when to use fixed or random effects? We can use the Hausman Test, where the null hypothesis is that the model is random effects and the alternative is that fixed effects are better fit. In this example, the p-value is above .05 so a random effects model is the better fit for this data.
phtest(random, fixed)
##
## Hausman Test
##
## data: y ~ x1
## chisq = 3.674, df = 1, p-value = 0.05527
## alternative hypothesis: one model is inconsistent
Another way to analyze longitudinal data is through Latent Growth Modeling (LGM). LGM models time as a latent variable similar to a multi-level modeling approach. To analyze LGM in R we use the lavaan package. To analyze LGM in lavaan we use an artificial data set provided by lavaan. The first step is to define the model, which is enclosed in quotations and sent as a string. In this case, we are fixing parameter estimates for time to be one to ensure the model is identified. Then we set each of the slope parameter estimates to their respective time point with one change. Since we cannot estimate the slope of time point one, we set that to zero and use n-1 time points, in this case three to identify the slopes. Variances for the time points, intercept, slope, and covariance between the slope and variance are freely estimated.
Then we can use the growth function in lavaan to correctly analyze the data as a LGM. In this example I am focusing on the interpretation of parameters not the model fit, so I will skip the first section regarding model fit. The first section is the parameter estimates, we fixed in the model. The next section is the covariance between the intercept and the slope demonstrating the relationship between the starting values and the slope or change over time. To fully interpret this value, we need to analyze the slope value found in the intercept section of the output. If the slope is positive and the covariance is positive, as is the case in our example, then we can say higher starting values (i.e. intercepts) leads to larger slopes or increases in y relative to lower starting values. Then we can move the intercepts, which is what we are interested in. “i” is the mean of the starting value at time one and “s” is the average change in y over time. We can see that the slope is statistically significantly different from zero meaning that there is a significant change in y over time.
library(lavaan)
head(Demo.growth)
## t1 t2 t3 t4 x1 x2
## 1 1.7256454 2.1424005 2.7731717 2.51595586 -1.1641026 0.1742293
## 2 -1.9841595 -4.4006027 -6.0165563 -7.02961801 -1.7454025 -1.5768602
## 3 0.3195183 -1.2691171 1.5600160 2.86852958 0.9202112 -0.1418180
## 4 0.7769485 3.5313707 3.1382114 5.36374139 2.3595236 0.7079681
## 5 0.4489440 -0.7727747 -1.5035150 0.07846742 -1.0887077 -1.0099977
## 6 -1.7469951 -0.9963400 -0.8242174 0.56692480 -0.5135169 -0.1440428
## c1 c2 c3 c4
## 1 -0.02767765 0.5549234 0.2544784 -1.00639541
## 2 -2.03196724 0.1253348 -1.5642323 1.22926875
## 3 0.05237496 -1.2577408 -1.8033909 -0.32725761
## 4 0.01911429 0.6473830 -0.4323795 -1.03239779
## 5 0.65243274 0.7309148 -0.7537816 -0.02745598
## 6 -0.04452906 -0.4168022 -1.2411670 0.51719763
model = "i =~ 1*t1 + 1*t2 + 1*t3 + 1*t4
s =~ 0*t1 + 1*t2 + 2*t3 + 3*t4"
fit = growth(model, data = Demo.growth)
summary(fit)
## lavaan (0.5-23.1097) converged normally after 29 iterations
##
## Number of observations 400
##
## Estimator ML
## Minimum Function Test Statistic 8.069
## Degrees of freedom 5
## P-value (Chi-square) 0.152
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## i =~
## t1 1.000
## t2 1.000
## t3 1.000
## t4 1.000
## s =~
## t1 0.000
## t2 1.000
## t3 2.000
## t4 3.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## i ~~
## s 0.618 0.071 8.686 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .t1 0.000
## .t2 0.000
## .t3 0.000
## .t4 0.000
## i 0.615 0.077 8.007 0.000
## s 1.006 0.042 24.076 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .t1 0.595 0.086 6.944 0.000
## .t2 0.676 0.061 11.061 0.000
## .t3 0.635 0.072 8.761 0.000
## .t4 0.508 0.124 4.090 0.000
## i 1.932 0.173 11.194 0.000
## s 0.587 0.052 11.336 0.000