Data: The data we will use here come from a path-breaking article by Bryk and Raudenbush, published in 1992. The data consist of 7185 individual students. For each student, we have their scores on a Mathematics Achievement test, and we would like to determine the important determinants of variation in math achievements. For each student, we have measured factors such as the school they go to, their race and sex, a measure of the student's socio-economic status (SES), etc. It is reasonable that the studen't success is partially determined by the school they go to, so for each school, we also know things such as it's size, whether it is private or public, what the average SES of students is, etc.
library(nlme)
data(MathAchieve)
head(MathAchieve)
## Grouped Data: MathAch ~ SES | School
## School Minority Sex SES MathAch MEANSES
## 1 1224 No Female -1.528 5.876 -0.428
## 2 1224 No Female -0.588 19.708 -0.428
## 3 1224 No Male -0.528 20.349 -0.428
## 4 1224 No Male -0.668 8.781 -0.428
## 5 1224 No Male -0.158 17.898 -0.428
## 6 1224 No Male 0.022 4.583 -0.428
data(MathAchSchool)
head(MathAchSchool)
## School Size Sector PRACAD DISCLIM HIMINTY MEANSES
## 1224 1224 842 Public 0.35 1.597 0 -0.428
## 1288 1288 1855 Public 0.27 0.174 0 0.128
## 1296 1296 1719 Public 0.32 -0.137 1 -0.420
## 1308 1308 716 Catholic 0.96 -0.622 0 0.534
## 1317 1317 455 Catholic 0.95 -1.694 1 0.351
## 1358 1358 1430 Public 0.25 1.535 0 -0.014
# School data cleverly have the school ID as a row.name. This means we
# can use school ID to easily select data Bind the School data onto the
# individuals (this will repeat the school data for each student in the
# same school)
bryk <- cbind(MathAchieve, MathAchSchool[as.character(MathAchieve$School), ])
head(bryk) # Note that there are two MEANSES columns. This might be a problem, or might not. Only time will tell.
## School Minority Sex SES MathAch MEANSES School Size Sector PRACAD
## 1 1224 No Female -1.528 5.876 -0.428 1224 842 Public 0.35
## 2 1224 No Female -0.588 19.708 -0.428 1224 842 Public 0.35
## 3 1224 No Male -0.528 20.349 -0.428 1224 842 Public 0.35
## 4 1224 No Male -0.668 8.781 -0.428 1224 842 Public 0.35
## 5 1224 No Male -0.158 17.898 -0.428 1224 842 Public 0.35
## 6 1224 No Male 0.022 4.583 -0.428 1224 842 Public 0.35
## DISCLIM HIMINTY MEANSES
## 1 1.597 0 -0.428
## 2 1.597 0 -0.428
## 3 1.597 0 -0.428
## 4 1.597 0 -0.428
## 5 1.597 0 -0.428
## 6 1.597 0 -0.428
# Select 20 catholic schools and 20 private schools, just for easier
# visualization Note the use of subset to select on sector, unique to
# remove duplicates, sample to randomly reorder the schools, and then
# [1:20] to take the top 20 (after reordering) I randomized the order so
# that the 20 we look at would be 'representative'
cat_20 <- sample(unique(subset(bryk, Sector == "Catholic")$School))[1:20]
pub_20 <- sample(unique(subset(bryk, Sector == "Public")$School))[1:20]
# Now, create a subsetted dataset:
bryk_cat_20 <- subset(bryk, School %in% cat_20)
bryk_pub_20 <- subset(bryk, School %in% pub_20)
# Now use ggplot to plot the relation between math and ses for each school
library(ggplot2)
ggplot(aes(x = SES, y = MathAch), data = bryk_cat_20) + geom_point(aes(color = Minority)) +
geom_smooth(method = "loess", span = 1) + facet_wrap(~School, nrow = 5,
ncol = 4) + ggtitle("Catholic Schools")
# In geom_smooth, I spedified the method to avoid a message coming up, and
# I played with the span to get a smoother curve.
ggplot(aes(x = SES, y = MathAch), data = bryk_pub_20) + geom_point(aes(color = Minority)) +
geom_smooth(method = "loess", span = 1) + facet_wrap(~School, nrow = 5,
ncol = 4) + ggtitle("Public Schools")
It is admittedly difficult to see an overall pattern emerging. There is a wide variety of relationsihps between ses and math achievement, from positive relationship, through no relationship, to negative relationship; and from linear, to quadratic, to other shapes. But we also see that public and catholic schools are decidedly different. There is much more segregation in public schools, both in terms of race and SES. In other words, the Catholic schools appear to be more similar to one another in terms of race and ses, but that there is a big difference between Public Schools.
library(plyr)
# Estimate all of the regression coefficients:
school_coef <- ddply(bryk, .(School), function(x) lm(MathAch ~ SES, data = x)$coef)
head(school_coef) # That was pretty cool, huh?
## School (Intercept) SES
## 1 8367 4.546 0.2504
## 2 8854 5.707 1.9388
## 3 4458 6.999 1.1318
## 4 5762 3.114 -1.0141
## 5 6990 6.441 0.9477
## 6 5815 9.324 3.0180
# Add a column with the Sector
school_coef$Sector <- MathAchSchool[school_coef$School, "Sector"]
names(school_coef) <- c("School", "Intercept", "SES", "Sector") # Remove the parenthesis from the (intercept) name
ggplot(data = school_coef) + geom_boxplot(aes(x = Sector, y = SES)) + ggtitle("Distribution of slopes for Math ~ SES")
ggplot(data = school_coef) + geom_boxplot(aes(x = Sector, y = Intercept)) +
ggtitle("Distribution of intercepts for Math ~ SES")
# Perform an anova to see of the intercepts and slopes are different
# between Public and Catholic Schools
anova(lm(Intercept ~ Sector, data = school_coef))
## Analysis of Variance Table
##
## Response: Intercept
## Df Sum Sq Mean Sq F value Pr(>F)
## Sector 1 0 0.12 0.02 0.9
## Residuals 158 1231 7.79
anova(lm(SES ~ Sector, data = school_coef))
## Analysis of Variance Table
##
## Response: SES
## Df Sum Sq Mean Sq F value Pr(>F)
## Sector 1 1 1.44 0.54 0.46
## Residuals 158 422 2.67
So we haven't controlled for any of the school characteristics, but it appears that, even though there is a wide variety in relations between ses and math achievement. This variability is so large, that it is difficule to tell if there are any real differences between Public and Catholic schools.
IMPORTANT NOTE: When working with mixed effects models, it is usually advised to start with an full (everything in the kitchen sink) model. The reason for this is that while counfounding is always a problem, it is even more of a problem in random effects models. So, it is best to try and get the model right without any confounding. If you need to do some model exploration, it is better to start with a full modol, and evaluate whether to drop variables, than it is to start with a simple model and evaluate whether to add variables.
# Calculate a individual SES - school mean SES
bryk$CSES <- bryk$SES - bryk$MEANSES
bryk.lme.1 <- lme(MathAch ~ MEANSES * CSES + Sector * CSES, random = ~1 + CSES |
School, data = bryk)
summary(bryk.lme.1)
## Linear mixed-effects model fit by REML
## Data: bryk
## AIC BIC logLik
## 46524 46592 -23252
##
## Random effects:
## Formula: ~1 + CSES | School
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 1.5433 (Intr)
## CSES 0.3185 0.392
## Residual 6.0598
##
## Fixed effects: MathAch ~ MEANSES * CSES + Sector * CSES
## Value Std.Error DF t-value p-value
## (Intercept) 12.114 0.1988 7022 60.93 0e+00
## MEANSES 5.339 0.3693 157 14.46 0e+00
## CSES 2.939 0.1551 7022 18.95 0e+00
## SectorCatholic 1.217 0.3064 157 3.97 1e-04
## MEANSES:CSES 1.039 0.2989 7022 3.48 5e-04
## CSES:SectorCatholic -1.643 0.2398 7022 -6.85 0e+00
## Correlation:
## (Intr) MEANSES CSES SctrCt MEANSES:
## MEANSES 0.245
## CSES 0.080 0.020
## SectorCatholic -0.697 -0.356 -0.056
## MEANSES:CSES 0.019 0.079 0.282 -0.028
## CSES:SectorCatholic -0.056 -0.029 -0.694 0.082 -0.351
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.15921 -0.72319 0.01707 0.75440 2.95821
##
## Number of Observations: 7185
## Number of Groups: 160
intervals(bryk.lme.1)
## Approximate 95% confidence intervals
##
## Fixed effects:
## lower est. upper
## (Intercept) 11.7239 12.114 12.503
## MEANSES 4.6097 5.339 6.069
## CSES 2.6347 2.939 3.243
## SectorCatholic 0.6115 1.217 1.822
## MEANSES:CSES 0.4529 1.039 1.625
## CSES:SectorCatholic -2.1127 -1.643 -1.173
## attr(,"label")
## [1] "Fixed effects:"
##
## Random Effects:
## Level: School
## lower est. upper
## sd((Intercept)) 1.32521 1.5433 1.7974
## sd(CSES) 0.06303 0.3185 1.6090
## cor((Intercept),CSES) -0.42063 0.3919 0.8556
##
## Within-group standard error:
## lower est. upper
## 5.960 6.060 6.161
The ~CSES | School asks for a random effects model, with a random intercept (which is implied), and a random slope on CSES. The | School is the “grouping” variable. It must be a factor, and identifies how the data are to be grouped. In this case, we want each school to be it's own group.
Determining an acceptable random effects structure
bryk.lme.2 <- update(bryk.lme.1, random = ~1 | School) # omitting random effect of cses
anova(bryk.lme.1, bryk.lme.2)
## Model df AIC BIC logLik Test L.Ratio p-value
## bryk.lme.1 1 10 46524 46592 -23252
## bryk.lme.2 2 8 46521 46576 -23252 1 vs 2 1.126 0.5694
bryk.lme.3 <- update(bryk.lme.1, random = ~CSES - 1 | School) # omitting random intercept
anova(bryk.lme.1, bryk.lme.3)
## Model df AIC BIC logLik Test L.Ratio p-value
## bryk.lme.1 1 10 46524 46592 -23252
## bryk.lme.3 2 8 46740 46795 -23362 1 vs 2 220.6 <.0001
summary(bryk.lme.2)
## Linear mixed-effects model fit by REML
## Data: bryk
## AIC BIC logLik
## 46521 46576 -23252
##
## Random effects:
## Formula: ~1 | School
## (Intercept) Residual
## StdDev: 1.541 6.064
##
## Fixed effects: MathAch ~ MEANSES * CSES + Sector * CSES
## Value Std.Error DF t-value p-value
## (Intercept) 12.114 0.1986 7022 60.98 0e+00
## MEANSES 5.343 0.3690 157 14.48 0e+00
## CSES 2.936 0.1507 7022 19.48 0e+00
## SectorCatholic 1.215 0.3061 157 3.97 1e-04
## MEANSES:CSES 1.044 0.2910 7022 3.59 3e-04
## CSES:SectorCatholic -1.642 0.2331 7022 -7.04 0e+00
## Correlation:
## (Intr) MEANSES CSES SctrCt MEANSES:
## MEANSES 0.245
## CSES 0.005 0.001
## SectorCatholic -0.697 -0.356 -0.003
## MEANSES:CSES 0.001 0.005 0.284 -0.002
## CSES:SectorCatholic -0.003 -0.002 -0.694 0.005 -0.351
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.17006 -0.72487 0.01488 0.75425 2.96552
##
## Number of Observations: 7185
## Number of Groups: 160
In summary, we can not reject the hypothesis that models 1 and 2 are equivalent. But we can be certain that models 1 and 3 are different. So there seems to definitely be a random intercept, but not necessarily a random slope.
Determining an acceptable fixed effects structure
bryk.lme.2a <- update(bryk.lme.2, fixed. = MathAch ~ MEANSES * CSES + Sector)
# Plotting the estimates
bryk.pred <- as.data.frame(fitted(bryk.lme.2, level = 0:1))
names(bryk.pred) <- c("pred_fixed", "pred_random")
bryk.pred <- cbind(bryk, bryk.pred)
ggplot(data = bryk.pred) + geom_line(aes(x = SES, y = pred_fixed, group = School,
order = SES, color = Sector))
ggplot(data = bryk.pred) + geom_line(aes(x = SES, y = pred_random, group = School,
order = SES, color = Sector))