In this example, I will fit a hierarchical linear model to the ECLS-K 2011 data. The outcome of interest here is the child’s kindergarten math score. I will illustrate how to fit the basic multilevel model with random intercepts, a model with random intercepts and random slopes, and compare models with a likelihood ratio test. I also show how to extract the variance components of the models and form the intra class correlation coefficient. Lastly, I will visualize the implied regression lines for the random intercepts and slopes model, highlighting the variation in the effect of household poverty status on children’s math scores between schools in the data.

Data and recodes

First we load our data

load("~/Google Drive/dem7903_App_Hier/data/eclsk_11.Rdata")
names(eclsk11)<-tolower(names(eclsk11))
library (car)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(arm)
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
## 
## arm (Version 1.8-6, built: 2015-7-7)
## 
## Working directory is /Users/ozd504/Google Drive/dem7903_App_Hier/code/code14
## 
## 
## Attaching package: 'arm'
## 
## The following object is masked from 'package:car':
## 
##     logit
#get out only the variables I'm going to use for this example

myvars<-c("x_chsex_r", "x1locale", "x_raceth_r", "x2povty", "x12par1ed_i","p1curmar", "x1htotal", "x1mscalk1",  "s1_id", "p2safepl","x2krceth" )
#subset the data
eclsk.sub<-eclsk11[,myvars]
rm(eclsk11); gc()
##           used (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 1542682 82.4    2637877  140.9   1880661  100.5
## Vcells 2244961 17.2  221668588 1691.2 182895767 1395.4

Next, I do some recoding of variables using a mixture of the ifelse() function and the recode () function.

#recode our outcomes, the  first is the child's math standardized test score  in Kindergarten
eclsk.sub$math<-ifelse(eclsk.sub$x1mscalk1<0, NA, eclsk.sub$x1mscalk1)

#First we recode some Child characteristics
#Child's sex: recode as male =1
eclsk.sub$male<-recode(eclsk.sub$x_chsex_r, recodes="1=1; 2=0; -9=NA")

#Recode race with white, non Hispanic as reference using dummy vars
eclsk.sub$hisp<-recode (eclsk.sub$x_raceth_r, recodes="3:4=1;-9=NA; else=0")
eclsk.sub$black<-recode (eclsk.sub$x_raceth_r, recodes="2=1;-9=NA; else=0")
eclsk.sub$asian<-recode (eclsk.sub$x_raceth_r, recodes="5=1;-9=NA; else=0")
eclsk.sub$nahn<-recode (eclsk.sub$x_raceth_r, recodes="6:7=1;-9=NA; else=0")
eclsk.sub$other<-recode (eclsk.sub$x_raceth_r, recodes="8=1;-9=NA; else=0")


#Then we recode some parent/mother characteristics
#Mother's education, recode as 2 dummys with HS = reference
eclsk.sub$lths<-recode(eclsk.sub$x12par1ed_i, recodes = "0:2=1; 3:8=0; else = NA")
eclsk.sub$gths<-recode(eclsk.sub$x12par1ed_i, recodes = "1:3=0; 4:8=1; else =NA") 

#marital status, recode as 2 dummys, ref= married
eclsk.sub$single<-recode(eclsk.sub$p1curmar, recodes="4=1; -7:-9=NA; else=0")
eclsk.sub$notmar<-recode(eclsk.sub$p1curmar, recodes="2:3=1; -7:-9=NA; else=0")


#Then we do some household level variables
#Urban school location = 1
eclsk.sub$urban<-recode(eclsk.sub$x1locale, recodes = "1:3=1; 4=0; -1:-9=NA")

#poverty level in poverty = 1
eclsk.sub$pov<-recode(eclsk.sub$x2povty , recodes ="1:2=1; 3=0; -9=NA")

#Household size
eclsk.sub$hhsize<-eclsk.sub$x1htotal

#school % minority student body
eclsk.sub$minorsch<-ifelse(eclsk.sub$x2krceth <0, NA, eclsk.sub$x2krceth/10)

#Unsafe neighborhood
eclsk.sub$unsafe<-recode(eclsk.sub$p2safepl , recodes = "1:2='unsafe'; 3='safe'; else=NA", as.factor.result = T)
#Show the first few lines of the data
head(eclsk.sub)
##   x_chsex_r x1locale x_raceth_r x2povty x12par1ed_i p1curmar x1htotal
## 1         1        4          1       2           5        1        6
## 2         2        2          1       3           5        1        4
## 3         1       NA          3       3           8       NA       NA
## 4         1        4          1       3           5        1        4
## 5         2        2          3       2           4        1        5
## 6         2        4          1       2           3        3        4
##   x1mscalk1 s1_id p2safepl x2krceth    math male hisp black asian nahn
## 1   44.9695  1861        2     8.00 44.9695    1    0     0     0    0
## 2   30.1857  2113        3    30.00 30.1857    0    0     0     0    0
## 3        NA              3    36.00      NA    1    1     0     0    0
## 4   13.8410  2070        2    18.00 13.8410    1    0     0     0    0
## 5   27.6587  1079        3    90.48 27.6587    0    1     0     0    0
## 6   22.2918  1210        3     2.00 22.2918    0    0     0     0    0
##   other lths gths single notmar urban pov hhsize minorsch unsafe
## 1     0    0    1      0      0     0   1      6    0.800 unsafe
## 2     0    0    1      0      0     1   0      4    3.000   safe
## 3     0    0    1      0      0    NA   0     NA    3.600   safe
## 4     0    0    1      0      0     0   0      4    1.800 unsafe
## 5     0    0    1      0      0     1   1      5    9.048   safe
## 6     0    0    0      0      1     0   1      4    0.200   safe

Modeling

Model 1

The First model will fit a model that considers the individual and family level variables and a random intercept only. One may state a reserach question such as:

“Do children living in households in poverty have lower math test scores than children living in households that are not in poverty, net of other demographic and socioeconomic factors?”

So the hypothesis of interest here involves us testing whether the effec of household poverty shows a \(\beta\) coefficient different from zero.

If we were to write the symbolic form of this model it would be: \[y_{ij} = \beta_{0j} + \sum {\beta_k x_{ik}} + e_{ij}\] \[\beta_{0j} = \beta_0 + u_j\] \[u_j \sim N(0, \sigma^2_u)\]

fit1<-lmer(math~male+hisp+black+asian+nahn+other+lths+gths+single+notmar+urban+pov+hhsize+(1|s1_id), data=eclsk.sub, REML=T)

display(fit1)
## lmer(formula = math ~ male + hisp + black + asian + nahn + other + 
##     lths + gths + single + notmar + urban + pov + hhsize + (1 | 
##     s1_id), data = eclsk.sub, REML = T)
##             coef.est coef.se
## (Intercept) 33.53     0.46  
## male         0.56     0.19  
## hisp        -3.21     0.30  
## black       -3.08     0.37  
## asian        2.88     0.42  
## nahn        -1.62     0.94  
## other        0.56     0.44  
## lths        -2.13     0.37  
## gths         3.08     0.26  
## single      -2.21     0.31  
## notmar      -2.23     0.32  
## urban       -0.09     0.11  
## pov         -3.03     0.24  
## hhsize      -0.33     0.08  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  s1_id    (Intercept) 3.29    
##  Residual             9.46    
## ---
## number of obs: 10633, groups: s1_id, 852
## AIC = 78737.6, DIC = 78681.8
## deviance = 78693.7

The coefficient for pov in the model is -3.03, versus the standard error which is 0.2380874, which gives a test of -12.7178194, which, if we use a z-statisic, yields a p-value of : < 1e-05. This tells us that, yes, children in households below the poverty line have significantly lower math scores than children not living in households in poverty.

Model 2

The second model considers both random intercepts and random slopes, in this case, i’m only considering a random slope for poverty status. This would imply the proposition:

“Do children living in poverty face a disadvantage in terms of their math test score equally in every school?”

If we were to write the symbolic form of this model it would be: \[y_{ij} = \beta_{0j} + \sum {\beta_k x_{ik}} + \beta_{j \ pov} \ pov_i + e_{ij}\] \[\beta_{0j} = \beta_0 + u_{1j}\] \[\beta_{j \ pov} = \beta_{pov} + u_{2j}\] \[\mathbf{u} \text{~} \mathbf{MVN (0, \Sigma )}\] \[\mathbf{\Sigma = \begin{bmatrix} \sigma_1&\sigma_{12} \\ \sigma_{21}&\sigma_{2} \end{bmatrix}}\]

the term \(\beta_{j \ pov} \ pov_i\) is the effect of poverty for the \(j^{th}\) school. Since we have two random effects in the model at the school level, it’s easier to model them as a multivariate normal distribution, which is how the lmer() function does it.

fit2<-lmer(math~male+hisp+black+asian+nahn+other+lths+gths+single+notmar+urban+pov+hhsize+(1+pov|s1_id), data = eclsk.sub, REML = T)
display(fit2)
## lmer(formula = math ~ male + hisp + black + asian + nahn + other + 
##     lths + gths + single + notmar + urban + pov + hhsize + (1 + 
##     pov | s1_id), data = eclsk.sub, REML = T)
##             coef.est coef.se
## (Intercept) 33.47     0.47  
## male         0.55     0.19  
## hisp        -3.20     0.29  
## black       -3.04     0.37  
## asian        2.87     0.43  
## nahn        -1.63     0.94  
## other        0.55     0.44  
## lths        -2.16     0.37  
## gths         3.06     0.26  
## single      -2.20     0.31  
## notmar      -2.22     0.32  
## urban       -0.10     0.11  
## pov         -3.07     0.25  
## hhsize      -0.33     0.08  
## 
## Error terms:
##  Groups   Name        Std.Dev. Corr  
##  s1_id    (Intercept) 3.76           
##           pov         2.50     -0.60 
##  Residual             9.40           
## ---
## number of obs: 10633, groups: s1_id, 852
## AIC = 78719.4, DIC = 78659.7
## deviance = 78671.6
anova (fit1, fit2)
## refitting model(s) with ML (instead of REML)
## Data: eclsk.sub
## Models:
## fit1: math ~ male + hisp + black + asian + nahn + other + lths + gths + 
## fit1:     single + notmar + urban + pov + hhsize + (1 | s1_id)
## fit2: math ~ male + hisp + black + asian + nahn + other + lths + gths + 
## fit2:     single + notmar + urban + pov + hhsize + (1 + pov | s1_id)
##      Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## fit1 16 78726 78842 -39347    78694                             
## fit2 18 78708 78838 -39336    78672 22.133      2  1.563e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, fitting the random slope for poverty adds some information to the model, because when the two models are compared, the second model with the random effect for poverty fits the data better than the first model without this term in the model.

I interpret this as kids living in housholds below the poverty line face a systematic disadvantage, because of the fixed effect coefficient of -3.07, but the disadvantage varies by school.

Intra-class correlations

Here is the ICC from fits 1 & 2

VarCorr(fit1)
##  Groups   Name        Std.Dev.
##  s1_id    (Intercept) 3.2851  
##  Residual             9.4551
VarCorr(fit1)$s1_id[1]/((attr(VarCorr(fit1), "sc")^2)+VarCorr(fit1)$s1_id[1])
## [1] 0.1077105
#the "sc" attribute is the residual variance, who knew?

VarCorr(fit2)
##  Groups   Name        Std.Dev. Corr  
##  s1_id    (Intercept) 3.7591         
##           pov         2.5032   -0.603
##  Residual             9.3989
(VarCorr(fit2)$s1_id[1]+VarCorr(fit2)$s1_id[4])/((attr(VarCorr(fit2),"sc")^2)+VarCorr(fit2)$s1_id[1]+VarCorr(fit2)$s1_id[4])
## [1] 0.187581
#the "sc" attribute is the residual variance

These results suggest that by including the random slope in fit2, we reduce the correlation within schools.

Visualizations of effects

Now, let’s plot the regression lines for each school! This way we can visualize how the mean of each school varies, as well as the effect of poverty varying between schools.

#get the random effects
rancoefs2<-ranef(fit2)

#here are the first 10 schools random effects
head(rancoefs2$s1_id, n=10)
##      (Intercept)         pov
## 1002   2.5141769 -0.23852188
## 1003   2.3645059 -0.74565350
## 1006  -0.1954302  0.66898840
## 1007   0.8748295  0.16796915
## 1014   1.5140040 -1.61051473
## 1015  -2.2733724  0.91271747
## 1016  -2.2856026  1.11051161
## 1017   0.4706594  0.07508403
## 1019   1.0361880  0.45158129
## 1021  -1.5981529  1.06612814
#Histograms of the random effects
par(mfrow=c(1,2))
hist(rancoefs2$s1_id[,"(Intercept)"],xlab= "Intercept Random Effect",  main = "Distribution of Random Intercepts")
hist(rancoefs2$s1_id[,"pov"], xlab= "Poverty Slope Random Effect", main = "Distribution of Random Slopes")

Here is the bivariate plot of each school’s random effect for the intercept and the slope for poverty. This should show a negative trend, since in the output for the fit2 model, the correlation in the random effects was -0.6029184.

par(mfrow=c(1,1))
plot(rancoefs2$s1_id[,"(Intercept)"],rancoefs2$s1_id[,"pov"] )

Here, I plot the random slopes, to do this, I iterate over all schools, drawing lines for each one. The intercept of each line is the global intercept from fit2 + the random intecept component for each school. The slope for each line is the average slope, the fixed effect for poverty from fit2, plus the random slope component for each school.

par(mfrow=c(1,1))
fixef(fit2)
## (Intercept)        male        hisp       black       asian        nahn 
## 33.46774868  0.55159138 -3.20139787 -3.03705414  2.87056232 -1.62991440 
##       other        lths        gths      single      notmar       urban 
##  0.54625369 -2.15657246  3.06143653 -2.19560089 -2.22080172 -0.09537131 
##         pov      hhsize 
## -3.06829612 -0.32574249
summary(fixef(fit2)[1]+rancoefs2$s1_id) # I do this to get values for the range for the y axis.
##   (Intercept)         pov       
##  Min.   :25.94   Min.   :28.18  
##  1st Qu.:31.56   1st Qu.:32.82  
##  Median :33.29   Median :33.50  
##  Mean   :33.47   Mean   :33.47  
##  3rd Qu.:35.19   3rd Qu.:34.16  
##  Max.   :43.34   Max.   :37.01
plot(NULL, ylim=c(25,45), xlim=c(0,1),ylab="Math Score", xlab="HH Poverty Status") #You may have to change the ylim() for your outcome.
title (main="Regression Lines for each school from Random Slope and Intercept Model")
cols=sample(rainbow(n=25), size = dim(rancoefs2$s1_id)[1], replace = T)
for (i in 1: dim(rancoefs2$s1_id)[1]){
  
  abline(a=fixef(fit2)[1]+rancoefs2$s1_id[[1]][i], b=fixef(fit2)[13]+rancoefs2$s1_id[[2]][i], col=cols[i],lwd=.5 )
}
#this line shows the "average" effect of poverty.
abline(a=fixef(fit2)[1], b=fixef(fit2)[13], col=1, lwd=3)
legend("topright", col=1, lwd=3, legend="Average Effect of Poverty")