Using R for Value Added Modeling (VAM)

This post explains how to use nmle and lme4 package for value added modeling. First, we created an artificial data set.

dep = rnorm(50000)
time = rep(1:10, 5000)
female = as.factor(c(rep(1,25000), rep(0,25000)))
size = abs(rnorm(50000)*10000)
childID = as.factor(rep(1:5000, each = 10))
childID = as.factor(childID)
schoolID = as.factor(rep(1:500, each = 100))
vamData = cbind(dep, time, female, size, childID, schoolID)
names(vamData) = tolower(names(vamData))
head(vamData)

##             dep time female       size childID schoolID
## [1,] -0.1895273    1      2  8281.4900       1        1
## [2,] -0.1794148    2      2   919.9785       1        1
## [3,]  0.4303515    3      2  8835.7647       1        1
## [4,]  0.8136552    4      2 18013.1269       1        1
## [5,] -1.2259365    5      2 16783.9323       1        1
## [6,] -0.7389885    6      2  5680.4226       1        1

library(nlme)
vamData = data.frame(vamData)
vamData$female = factor(vamData$female)
vamData$childID = factor(vamData$childID)
vamData$schoolID = factor(vamData$schoolID)
vamDataInd = groupedData(dep ~ time | schoolID/childID, data = vamData)
vamDataInd$female = factor(vamDataInd$female)
vamDataInd$childID = factor(vamDataInd$childID)
vamDataInd$schoolID = factor(vamDataInd$schoolID)
head(vamDataInd)

## Grouped Data: dep ~ time | schoolID/childID
##          dep time female       size childID schoolID
## 1 -0.1895273    1      2  8281.4900       1        1
## 2 -0.1794148    2      2   919.9785       1        1
## 3  0.4303515    3      2  8835.7647       1        1
## 4  0.8136552    4      2 18013.1269       1        1
## 5 -1.2259365    5      2 16783.9323       1        1
## 6 -0.7389885    6      2  5680.4226       1        1

We then fit an unconditional model meaning that both the intercepts and slopes can vary by schools and for children nested within schools. For the first model, we only include one variable time and evaluate the average effect of time on the dependent variable. After the comma in the first model, we create the random part which says that we want to evaluate the differences in the schools and children within schools (random intercepts) on the dependent variable and how the differences in schools and children within schools varies over time (random slopes) which we evaluate across schools and children in schools. We then select the method ML to use full maximum likelihood estimation, which is one way of handling missing data: http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf

The first part of the output focuses on the intercept, which is the average standard deviation difference between the different schools starting values on the dependent variable. The slope is the average standard deviation difference between the slopes (i.e. time’s effect on the dependent variable per school). The Corr is the correlation between the intercept and slopes. The next provides the same estimates, but for the student levels, which are nested within the schools. Finally, we have the fixed part, which is where we can evaluate the effect of time on the dependent variable.

unconVamData = lme(dep ~ time, random =~ time | schoolID/childID, data = vamDataInd, method = "ML")
summary(unconVamData)

## Linear mixed-effects model fit by maximum likelihood
##  Data: vamDataInd 
##        AIC      BIC    logLik
##   141582.8 141662.2 -70782.41
## 
## Random effects:
##  Formula: ~time | schoolID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev       Corr  
## (Intercept) 2.366050e-06 (Intr)
## time        4.637429e-03 -0.479
## 
##  Formula: ~time | childID %in% schoolID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev       Corr  
## (Intercept) 8.006738e-19 (Intr)
## time        1.704209e-12 0     
## Residual    9.963160e-01       
## 
## Fixed effects: dep ~ time 
##                     Value   Std.Error    DF     t-value p-value
## (Intercept)  0.0008514379 0.009625520 44999  0.08845631  0.9295
## time        -0.0002809295 0.001565095 44999 -0.17949675  0.8575
##  Correlation: 
##      (Intr)
## time -0.879
## 
## Standardized Within-Group Residuals:
##           Min            Q1           Med            Q3           Max 
## -4.4439095618 -0.6706176457 -0.0004606482  0.6752230503  4.0432536184 
## 
## Number of Observations: 50000
## Number of Groups: 
##              schoolID childID %in% schoolID 
##                   500                  5000

The next model is the same as before except with two control variables (school size and female status of student). Although, a VAM model will include more covariates, for this example two covariates will suffice. To answer the question we are interested in, what is the effect of schools on student’s dependent variable score, we need to look at the individual coefficients for schools. Using the second model (unconVamDataCovs) we use the coef function to produce the random intercepts and slopes for the first five schools across time. As we can see schools 2 and 3 experiencing positive growth while schools 1,4, and 5 are experiencing negative growth.

unconVamDataCovs = lme(fixed = dep ~ time + size+female, random =~ time | schoolID/childID, data = vamDataInd)
summary(unconVamDataCovs)

## Linear mixed-effects model fit by REML
##  Data: vamDataInd 
##        AIC      BIC    logLik
##   141640.5 141737.5 -70809.25
## 
## Random effects:
##  Formula: ~time | schoolID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev       Corr  
## (Intercept) 1.257443e-05 (Intr)
## time        4.751247e-03 -0.482
## 
##  Formula: ~time | childID %in% schoolID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev       Corr  
## (Intercept) 5.570011e-26 (Intr)
## time        5.719875e-19 0     
## Residual    9.963342e-01       
## 
## Fixed effects: dep ~ time + size + female 
##                     Value   Std.Error    DF    t-value p-value
## (Intercept) -0.0027688193 0.012203773 44998 -0.2268822  0.8205
## time        -0.0002782806 0.001565783 44998 -0.1777262  0.8589
## size         0.0000003799 0.000000739 44998  0.5138215  0.6074
## female2      0.0011504945 0.009207306   498  0.1249545  0.9006
##  Correlation: 
##         (Intr) time   size  
## time    -0.694              
## size    -0.485  0.003       
## female2 -0.377  0.000 -0.001
## 
## Standardized Within-Group Residuals:
##           Min            Q1           Med            Q3           Max 
## -4.4427411129 -0.6702002167 -0.0003959044  0.6755367864  4.0405433681 
## 
## Number of Observations: 50000
## Number of Groups: 
##              schoolID childID %in% schoolID 
##                   500                  5000

ranef(unconVamDataCovs, level = 1)[1:5,]

##     (Intercept)          time
## 1 -1.604440e-06  1.257651e-03
## 2  9.649399e-09 -8.023279e-06
## 3  6.711580e-08 -5.208853e-05
## 4 -2.991712e-06  2.344618e-03
## 5  1.869538e-06 -1.465186e-03

There are two other modifications that I would like to talk about. First is that sometimes data over time are autocorrelated, which is not accounted for the in current model. Therefore, we may need to explicitly specify the autocorrelation in the model. The auto1Vam does that by using the update function to update the unconVamDataCovs with an ar1 model (i.e. each time point is correlated with its predecessor). The model is interpreted in the same way as before.

auto1Vam = update(unconVamDataCovs, correlation=corCAR1(form = ~ time | schoolID/childID))
summary(auto1Vam)

## Linear mixed-effects model fit by REML
##  Data: vamDataInd 
##        AIC      BIC   logLik
##   141642.6 141748.4 -70809.3
## 
## Random effects:
##  Formula: ~time | schoolID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 0.02747114 (Intr)
## time        0.00448768 -0.468
## 
##  Formula: ~time | childID %in% schoolID
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev       Corr  
## (Intercept) 1.059747e-04 (Intr)
## time        3.385686e-05 -0.001
## Residual    9.963184e-01       
## 
## Correlation Structure: Continuous AR(1)
##  Formula: ~time | schoolID/childID 
##  Parameter estimate(s):
##          Phi 
## 2.563321e-05 
## Fixed effects: dep ~ time + size + female 
##                     Value   Std.Error    DF    t-value p-value
## (Intercept) -0.0029264026 0.012270054 44998 -0.2384996  0.8115
## time        -0.0002783258 0.001564230 44998 -0.1779314  0.8588
## size         0.0000003791 0.000000739 44998  0.5127330  0.6081
## female2      0.0014789602 0.009232278   498  0.1601945  0.8728
##  Correlation: 
##         (Intr) time   size  
## time    -0.697              
## size    -0.483  0.003       
## female2 -0.376  0.000 -0.001
## 
## Standardized Within-Group Residuals:
##           Min            Q1           Med            Q3           Max 
## -4.4464196559 -0.6705851867 -0.0003015507  0.6761172856  4.0424564721 
## 
## Number of Observations: 50000
## Number of Groups: 
##              schoolID childID %in% schoolID 
##                   500                  5000

The final model uses a different package to handle non or cross nested levels. Unfortunately, the nlme package has difficultly handling non or cross nested data, so we will switch to the lme4 package, which more easily handles non or cross nested data. In this example, we are using the InstEval data. Y is the dependent variable, s is the student indicator and d is the instructor. Sometimes students have more than one instructor therefore, we cannot evaluate the data as students nested within instructor. Instead we model them as separate levels using the code below. Here we have the intercepts for both the s students and teachers (d) demonstrating how students and teachers deviate from each other on the dependent variable y. To get the individual intercepts for students and instructors we use the coef function and grab the students and instructors individually.

Although, this example does not include a time variable or covariates, they can be added and interpreted in the same way as the previous examples in this post.

library(lme4)
data("InstEval")
head(InstEval)

##   s    d studage lectage service dept y
## 1 1 1002       2       2       0    2 5
## 2 1 1050       2       1       1    6 2
## 3 1 1582       2       2       0    2 5
## 4 1 2050       2       2       1    3 3
## 5 2  115       2       1       0    5 2
## 6 2  756       2       1       0    5 4

fmCross = lmer(y~ 1 + (1|s) + (1|d), InstEval, REML = FALSE)
fmCross

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: y ~ 1 + (1 | s) + (1 | d)
##    Data: InstEval
##       AIC       BIC    logLik  deviance  df.resid 
##  237785.7  237822.5 -118888.9  237777.7     73417 
## Random effects:
##  Groups   Name        Std.Dev.
##  s        (Intercept) 0.3259  
##  d        (Intercept) 0.5230  
##  Residual             1.1778  
## Number of obs: 73421, groups:  s, 2972; d, 1128
## Fixed Effects:
## (Intercept)  
##       3.254

fmCrossCoef =coef(fmCross)
head(fmCrossCoef$s)

##   (Intercept)
## 1    3.412885
## 2    3.208094
## 3    3.550832
## 4    3.504570
## 5    3.295704
## 6    3.357045

head(fmCrossCoef$d)

##    (Intercept)
## 1     3.666954
## 6     2.814612
## 7     3.863743
## 8     2.691897
## 12    3.449230
## 13    3.558906