Mixed Effects Models

Data: The data we will use here come from a path-breaking article by Bryk and Raudenbush, published in 1992. The data consist of 7185 individual students. For each student, we have their scores on a Mathematics Achievement test, and we would like to determine the important determinants of variation in math achievements. For each student, we have measured factors such as the school they go to, their race and sex, a measure of the student's socio-economic status (SES), etc. It is reasonable that the studen't success is partially determined by the school they go to, so for each school, we also know things such as it's size, whether it is private or public, what the average SES of students is, etc.

library(nlme)
data(MathAchieve)
head(MathAchieve)

## Grouped Data: MathAch ~ SES | School
##   School Minority    Sex    SES MathAch MEANSES
## 1   1224       No Female -1.528   5.876  -0.428
## 2   1224       No Female -0.588  19.708  -0.428
## 3   1224       No   Male -0.528  20.349  -0.428
## 4   1224       No   Male -0.668   8.781  -0.428
## 5   1224       No   Male -0.158  17.898  -0.428
## 6   1224       No   Male  0.022   4.583  -0.428

data(MathAchSchool)
head(MathAchSchool)

##      School Size   Sector PRACAD DISCLIM HIMINTY MEANSES
## 1224   1224  842   Public   0.35   1.597       0  -0.428
## 1288   1288 1855   Public   0.27   0.174       0   0.128
## 1296   1296 1719   Public   0.32  -0.137       1  -0.420
## 1308   1308  716 Catholic   0.96  -0.622       0   0.534
## 1317   1317  455 Catholic   0.95  -1.694       1   0.351
## 1358   1358 1430   Public   0.25   1.535       0  -0.014

# School data cleverly have the school ID as a row.name.  This means we
# can use school ID to easily select data Bind the School data onto the
# individuals (this will repeat the school data for each student in the
# same school)
bryk <- cbind(MathAchieve, MathAchSchool[as.character(MathAchieve$School), ])
head(bryk)  # Note that there are two MEANSES columns.  This might be a problem, or might not.  Only time will tell.

##   School Minority    Sex    SES MathAch MEANSES School Size Sector PRACAD
## 1   1224       No Female -1.528   5.876  -0.428   1224  842 Public   0.35
## 2   1224       No Female -0.588  19.708  -0.428   1224  842 Public   0.35
## 3   1224       No   Male -0.528  20.349  -0.428   1224  842 Public   0.35
## 4   1224       No   Male -0.668   8.781  -0.428   1224  842 Public   0.35
## 5   1224       No   Male -0.158  17.898  -0.428   1224  842 Public   0.35
## 6   1224       No   Male  0.022   4.583  -0.428   1224  842 Public   0.35
##   DISCLIM HIMINTY MEANSES
## 1   1.597       0  -0.428
## 2   1.597       0  -0.428
## 3   1.597       0  -0.428
## 4   1.597       0  -0.428
## 5   1.597       0  -0.428
## 6   1.597       0  -0.428

# Select 20 catholic schools and 20 private schools, just for easier
# visualization Note the use of subset to select on sector, unique to
# remove duplicates, sample to randomly reorder the schools, and then
# [1:20] to take the top 20 (after reordering) I randomized the order so
# that the 20 we look at would be 'representative'
cat_20 <- sample(unique(subset(bryk, Sector == "Catholic")$School))[1:20]
pub_20 <- sample(unique(subset(bryk, Sector == "Public")$School))[1:20]


# Now, create a subsetted dataset:
bryk_cat_20 <- subset(bryk, School %in% cat_20)
bryk_pub_20 <- subset(bryk, School %in% pub_20)

# Now use ggplot to plot the relation between math and ses for each school
library(ggplot2)
ggplot(aes(x = SES, y = MathAch), data = bryk_cat_20) + geom_point(aes(color = Minority)) + 
    geom_smooth(method = "loess", span = 1) + facet_wrap(~School, nrow = 5, 
    ncol = 4) + ggtitle("Catholic Schools")

plot of chunk unnamed-chunk-2

# In geom_smooth, I spedified the method to avoid a message coming up, and
# I played with the span to get a smoother curve.
ggplot(aes(x = SES, y = MathAch), data = bryk_pub_20) + geom_point(aes(color = Minority)) + 
    geom_smooth(method = "loess", span = 1) + facet_wrap(~School, nrow = 5, 
    ncol = 4) + ggtitle("Public Schools")

plot of chunk unnamed-chunk-2

It is admittedly difficult to see an overall pattern emerging. There is a wide variety of relationsihps between ses and math achievement, from positive relationship, through no relationship, to negative relationship; and from linear, to quadratic, to other shapes. But we also see that public and catholic schools are decidedly different. There is much more segregation in public schools, both in terms of race and SES. In other words, the Catholic schools appear to be more similar to one another in terms of race and ses, but that there is a big difference between Public Schools.

library(plyr)
# Estimate all of the regression coefficients:
school_coef <- ddply(bryk, .(School), function(x) lm(MathAch ~ SES, data = x)$coef)
head(school_coef)  # That was pretty cool, huh?

##   School (Intercept)     SES
## 1   8367       4.546  0.2504
## 2   8854       5.707  1.9388
## 3   4458       6.999  1.1318
## 4   5762       3.114 -1.0141
## 5   6990       6.441  0.9477
## 6   5815       9.324  3.0180

# Add a column with the Sector
school_coef$Sector <- MathAchSchool[school_coef$School, "Sector"]
names(school_coef) <- c("School", "Intercept", "SES", "Sector")  # Remove the parenthesis from the (intercept) name
ggplot(data = school_coef) + geom_boxplot(aes(x = Sector, y = SES)) + ggtitle("Distribution of slopes for Math ~ SES")

plot of chunk unnamed-chunk-3

ggplot(data = school_coef) + geom_boxplot(aes(x = Sector, y = Intercept)) + 
    ggtitle("Distribution of intercepts for Math ~ SES")

plot of chunk unnamed-chunk-3

# Perform an anova to see of the intercepts and slopes are different
# between Public and Catholic Schools
anova(lm(Intercept ~ Sector, data = school_coef))

## Analysis of Variance Table
## 
## Response: Intercept
##            Df Sum Sq Mean Sq F value Pr(>F)
## Sector      1      0    0.12    0.02    0.9
## Residuals 158   1231    7.79

anova(lm(SES ~ Sector, data = school_coef))

## Analysis of Variance Table
## 
## Response: SES
##            Df Sum Sq Mean Sq F value Pr(>F)
## Sector      1      1    1.44    0.54   0.46
## Residuals 158    422    2.67

So we haven't controlled for any of the school characteristics, but it appears that, even though there is a wide variety in relations between ses and math achievement. This variability is so large, that it is difficule to tell if there are any real differences between Public and Catholic schools.

IMPORTANT NOTE: When working with mixed effects models, it is usually advised to start with an full (everything in the kitchen sink) model. The reason for this is that while counfounding is always a problem, it is even more of a problem in random effects models. So, it is best to try and get the model right without any confounding. If you need to do some model exploration, it is better to start with a full modol, and evaluate whether to drop variables, than it is to start with a simple model and evaluate whether to add variables.

# Calculate a individual SES - school mean SES
bryk$CSES <- bryk$SES - bryk$MEANSES
bryk.lme.1 <- lme(MathAch ~ MEANSES * CSES + Sector * CSES, random = ~1 + CSES | 
    School, data = bryk)
summary(bryk.lme.1)

## Linear mixed-effects model fit by REML
##  Data: bryk 
##     AIC   BIC logLik
##   46524 46592 -23252
## 
## Random effects:
##  Formula: ~1 + CSES | School
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev Corr  
## (Intercept) 1.5433 (Intr)
## CSES        0.3185 0.392 
## Residual    6.0598       
## 
## Fixed effects: MathAch ~ MEANSES * CSES + Sector * CSES 
##                      Value Std.Error   DF t-value p-value
## (Intercept)         12.114    0.1988 7022   60.93   0e+00
## MEANSES              5.339    0.3693  157   14.46   0e+00
## CSES                 2.939    0.1551 7022   18.95   0e+00
## SectorCatholic       1.217    0.3064  157    3.97   1e-04
## MEANSES:CSES         1.039    0.2989 7022    3.48   5e-04
## CSES:SectorCatholic -1.643    0.2398 7022   -6.85   0e+00
##  Correlation: 
##                     (Intr) MEANSES CSES   SctrCt MEANSES:
## MEANSES              0.245                               
## CSES                 0.080  0.020                        
## SectorCatholic      -0.697 -0.356  -0.056                
## MEANSES:CSES         0.019  0.079   0.282 -0.028         
## CSES:SectorCatholic -0.056 -0.029  -0.694  0.082 -0.351  
## 
## Standardized Within-Group Residuals:
##      Min       Q1      Med       Q3      Max 
## -3.15921 -0.72319  0.01707  0.75440  2.95821 
## 
## Number of Observations: 7185
## Number of Groups: 160

intervals(bryk.lme.1)

## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                       lower   est.  upper
## (Intercept)         11.7239 12.114 12.503
## MEANSES              4.6097  5.339  6.069
## CSES                 2.6347  2.939  3.243
## SectorCatholic       0.6115  1.217  1.822
## MEANSES:CSES         0.4529  1.039  1.625
## CSES:SectorCatholic -2.1127 -1.643 -1.173
## attr(,"label")
## [1] "Fixed effects:"
## 
##  Random Effects:
##   Level: School 
##                          lower   est.  upper
## sd((Intercept))        1.32521 1.5433 1.7974
## sd(CSES)               0.06303 0.3185 1.6090
## cor((Intercept),CSES) -0.42063 0.3919 0.8556
## 
##  Within-group standard error:
## lower  est. upper 
## 5.960 6.060 6.161

The ~CSES | School asks for a random effects model, with a random intercept (which is implied), and a random slope on CSES. The | School is the “grouping” variable. It must be a factor, and identifies how the data are to be grouped. In this case, we want each school to be it's own group.

Determining an acceptable random effects structure

bryk.lme.2 <- update(bryk.lme.1, random = ~1 | School)  # omitting random effect of cses
anova(bryk.lme.1, bryk.lme.2)

##            Model df   AIC   BIC logLik   Test L.Ratio p-value
## bryk.lme.1     1 10 46524 46592 -23252                       
## bryk.lme.2     2  8 46521 46576 -23252 1 vs 2   1.126  0.5694


bryk.lme.3 <- update(bryk.lme.1, random = ~CSES - 1 | School)  # omitting random intercept
anova(bryk.lme.1, bryk.lme.3)

##            Model df   AIC   BIC logLik   Test L.Ratio p-value
## bryk.lme.1     1 10 46524 46592 -23252                       
## bryk.lme.3     2  8 46740 46795 -23362 1 vs 2   220.6  <.0001


summary(bryk.lme.2)

## Linear mixed-effects model fit by REML
##  Data: bryk 
##     AIC   BIC logLik
##   46521 46576 -23252
## 
## Random effects:
##  Formula: ~1 | School
##         (Intercept) Residual
## StdDev:       1.541    6.064
## 
## Fixed effects: MathAch ~ MEANSES * CSES + Sector * CSES 
##                      Value Std.Error   DF t-value p-value
## (Intercept)         12.114    0.1986 7022   60.98   0e+00
## MEANSES              5.343    0.3690  157   14.48   0e+00
## CSES                 2.936    0.1507 7022   19.48   0e+00
## SectorCatholic       1.215    0.3061  157    3.97   1e-04
## MEANSES:CSES         1.044    0.2910 7022    3.59   3e-04
## CSES:SectorCatholic -1.642    0.2331 7022   -7.04   0e+00
##  Correlation: 
##                     (Intr) MEANSES CSES   SctrCt MEANSES:
## MEANSES              0.245                               
## CSES                 0.005  0.001                        
## SectorCatholic      -0.697 -0.356  -0.003                
## MEANSES:CSES         0.001  0.005   0.284 -0.002         
## CSES:SectorCatholic -0.003 -0.002  -0.694  0.005 -0.351  
## 
## Standardized Within-Group Residuals:
##      Min       Q1      Med       Q3      Max 
## -3.17006 -0.72487  0.01488  0.75425  2.96552 
## 
## Number of Observations: 7185
## Number of Groups: 160

In summary, we can not reject the hypothesis that models 1 and 2 are equivalent. But we can be certain that models 1 and 3 are different. So there seems to definitely be a random intercept, but not necessarily a random slope.

Determining an acceptable fixed effects structure

bryk.lme.2a <- update(bryk.lme.2, fixed. = MathAch ~ MEANSES * CSES + Sector)

# Plotting the estimates
bryk.pred <- as.data.frame(fitted(bryk.lme.2, level = 0:1))
names(bryk.pred) <- c("pred_fixed", "pred_random")
bryk.pred <- cbind(bryk, bryk.pred)
ggplot(data = bryk.pred) + geom_line(aes(x = SES, y = pred_fixed, group = School, 
    order = SES, color = Sector))

plot of chunk unnamed-chunk-7

ggplot(data = bryk.pred) + geom_line(aes(x = SES, y = pred_random, group = School, 
    order = SES, color = Sector))

plot of chunk unnamed-chunk-7