1 Introduction

We use data on 1721 students nested within 60 urban public primary schools. The outcome of interest is mathematics achievement. The data were collected at the end of first grade and annually thereafter up to sixth grade, but not all students have six observations.

# read spss format and covert it to data frame 
dta <- foreign::read.spss("eg_hlm.sav") %>% as.data.frame

# check data structure
str(dta)
'data.frame':   7230 obs. of  12 variables:
 $ size    : num  380 380 380 380 380 380 380 380 380 380 ...
 $ lowinc  : num  40.3 40.3 40.3 40.3 40.3 40.3 40.3 40.3 40.3 40.3 ...
 $ mobility: num  12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 ...
 $ female  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ black   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hispanic: num  1 1 1 0 0 0 0 0 1 1 ...
 $ year    : num  0.5 1.5 2.5 -1.5 -0.5 0.5 1.5 2.5 -1.5 -0.5 ...
 $ grade   : num  2 3 4 0 1 2 3 4 0 1 ...
 $ math    : num  1.146 1.134 2.3 -1.303 0.439 ...
 $ retained: num  0 0 0 0 0 0 0 0 0 0 ...
 $ school  : Factor w/ 60 levels "        2020",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ cid     : Factor w/ 1721 levels "   101480302",..: 244 244 244 248 248 248 248 248 253 253 ...
# Create the summary table at the school level and calculate the quantiles
with(dta, table(school)) |> quantile()
    0%    25%    50%    75%   100% 
 18.00  66.75 103.00 162.00 387.00 
# Create the summary table at the student level and calculate the range
with(dta, table(cid)) |> range()
[1] 2 6

We have 60 schools with a median of 103 students for a total of 1721 children, who have between 2 and 6 observations each for a total of 7230 observations.

2 Variance components

We have specified two random part, one at the school level (level 3), and one at the child-level (level 2). They are both intercept-only, adding an error term at each level.

# add new column(name: nc, number of observations per school)
dta <- dta %>% group_by(school) %>% mutate(nc=n()) %>% ungroup()
ggplot(subset(dta, nc > 200), aes(reorder(cid, math, mean), math))+
  stat_summary(fun.data="mean_cl_boot", size = rel(.2))+
  facet_wrap(.~ school)+
  coord_flip()+
  labs(x="Subject ID",
       y="Mean math score")+
  theme_minimal()+
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

ggplot(dta, aes(reorder(school, math, mean), math))+
  stat_summary(fun.data="mean_cl_boot", size = rel(.3))+
  coord_flip()+
  labs(x="School ID",
       y="Mean math score")+
  theme_minimal()

m0 <- lme4::lmer(math ~ 1 + (1 | cid) + (1 | school), data=dta)
summary(m0, corr=FALSE)
Linear mixed model fit by REML ['lmerMod']
Formula: math ~ 1 + (1 | cid) + (1 | school)
   Data: dta

REML criterion at convergence: 25309.2

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-2.674 -0.658  0.026  0.674  3.334 

Random effects:
 Groups   Name        Variance Std.Dev.
 cid      (Intercept) 0.570    0.755   
 school   (Intercept) 0.324    0.569   
 Residual             1.524    1.234   
Number of obs: 7230, groups:  cid, 1721; school, 60

Fixed effects:
            Estimate Std. Error t value
(Intercept)  -0.5100     0.0787   -6.48

The estimated mean math score is -0.51 and the variances are 0.32, 0.57 and 1.52 at the school, child and observation levels.

Most of the variance comes from the measurements (within the same child), whereas the variance in the child level is greater than that at the school level. These results indicate fairly significant variation in math achievement among children and in mean math among schools.

# by hand
v <- c(sigma(m0)^2, unlist(VarCorr(m0)))
v[3]/(v[1] + v[2] + v[3])       
  school 
0.134074 
(v[2] + v[3])/(v[1] + v[2] + v[3])
     cid 
0.369935 
# use package
performance::icc(m0, by_group=TRUE)
# ICC by Group

Group  |   ICC
--------------
cid    | 0.236
school | 0.134

At the school level, the intra-class correlation is 0.13; 0.37, at the child level. These values indicate the correlation in math achievement between two children in the same school, and between two measurements on the same child (in the same school). It means that 13% of the variation in math achievement can be attributed to attending the same schools and 37% to the children themselves (which includes being in the same school).

3 Effect of time (measurement level)

The two variables, year and grade, are the same but coded differently. Year zero is near the start of the third grade, and can be referred to as ‘third grade’ for simplicity.

with(dta, table(grade, year))
     year
grade -2.5 -1.5 -0.5  0.5  1.5  2.5
    0  131 1346  104    1    0    0
    1    0    0 1412  194    5    1
    2    0    0    4 1470  174    5
    3    0    0    0    7 1200  149
    4    0    0    0    0    8 1010
    5    0    0    0    0    0    9
ggplot(subset(dta, nc < 67), aes(year, math, group=cid))+
  facet_wrap(. ~ school)+
  stat_smooth(method='lm', formula=y~x, se=F, size=rel(.5), alpha=.5)+
  geom_point(pch=1, size=rel(.5)) +
  labs(x="Year (centered)",
       y="Math score")+
  theme_minimal()

ggplot(dta, aes(year, math, group=school))+
  stat_smooth(method='lm', formula=y~x, se=F, col=8,
              size=rel(.5), alpha=.5)+
  geom_point(pch=1, size=rel(.5)) +
  labs(x="Year (centered)",
       y="Math score",
       subtitle='School level')+
  theme_minimal()

3.1 Problem

  • Plot the relationship between math score and year of age for students in those schools with number of students in the upper quartile.
set <- aggregate(dta$size, list(dta$school),FUN=mean)
 ggplot(subset(dta, size == quantile(set$x, prob=c(.75), type=1)), aes(year, math, , group=cid))+
  facet_wrap(. ~ school)+
  stat_smooth(method='lm', formula=y~x, se=F, size=rel(.5), alpha=.5)+
  geom_point(pch=1, size=rel(.5)) +
  labs(x="Year (centered)",
       y="Math score")+
  theme_minimal()

 # 上圖結果顯示,在學生人數為Q3(the upper quartile)的學校中,每位學生的數學成績皆有逐年增加的趨勢。

We fit a model with random intercepts and slopes, with no predictors by child or school. The variable year is at the observation level.

m1 <- lme4::lmer(math ~ year + (1 + year | cid) + (1 + year | school), data=dta)
summary(m1, corr=FALSE)
Linear mixed model fit by REML ['lmerMod']
Formula: math ~ year + (1 + year | cid) + (1 + year | school)
   Data: dta

REML criterion at convergence: 16336.7

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-3.205 -0.564 -0.037  0.531  5.260 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 cid      (Intercept) 0.6405   0.800        
          year        0.0113   0.106    0.55
 school   (Intercept) 0.1686   0.411        
          year        0.0113   0.106    0.40
 Residual             0.3014   0.549        
Number of obs: 7230, groups:  cid, 1721; school, 60

Fixed effects:
            Estimate Std. Error t value
(Intercept)  -0.7792     0.0583   -13.4
year          0.7631     0.0154    49.6

The fixed-effect estimates yield an intercept of -0.78 and a slope of 0.76. These values can be interpreted as the average achievement in third grade, and the average rate of change per year. Average math achievement clearly increases significantly over time.

The intercepts and slopes have correlations of 0.40 at the school level and 0.55 at the child level. The latter can be interpreted as the correlation between a child’s achievement around third grade and her rate of growth; the former, the correlation between a school’s math achievement around third grade and the rate of change per year.

Most of the variance in the intercept comes from the children, whereas the variance in the slope is about the same at each level. These results indicate fairly significant variation among schools in mean achievement around third grade, as well as in the rates of change over time.

4 Effects of ethnicity (child) and income (school)

One way to look for school or individual characteristics that may be associated with levels and trends in mathematics achievement is to plot or regress the estimates of intercepts and slopes on potential predictors at each level.

# recode ethnicity from indicator variables into one factor variable with
# four levels with white as the reference 
dta <- mutate(dta, 
              eth = factor(1 + black + 2 * hispanic,
                           labels=c("White", "Black", "Hispanic")))
with(dta, table(eth))
eth
   White    Black Hispanic 
    1229     4977     1024 
ggplot(subset(dta, nc > 167), aes(year, math, group=cid))+
  facet_wrap(eth ~ .)+
  stat_smooth(method='lm', formula=y~x, se=F, size=rel(.5), alpha=.5, col=8)+
  geom_point(pch=1, size=rel(.5)) +
  labs(x="Year (centered)",
       y="Math score")+
  theme_minimal()

4.1 Problem

  • Explore the effect of income (school level) on average math achievement with plots
# 計算平均數學分數
dta <- dta %>% group_by(school, year) %>% mutate(math_mean=mean(math)) %>% ungroup()

dta %>%
    group_by(school) %>%
    summarise(math_mean = mean(math), lowinc_mean = mean(lowinc)) ->dta2
reg<-lm(math_mean ~ lowinc_mean,data=dta2)

# 繪圖
ggplot(dta, aes(x = lowinc, y = math_mean, color = year, group = year))+
  geom_point(size = 1) +
  stat_smooth(method = 'lm', formula = y ~ x, se = F) +
  labs(x="The percent of students with low income",
       y="Average math achievement",
       subtitle='School level, Year zero is near the start of the third grade')+
  geom_abline(intercept = reg$coefficients[1], slope = reg$coefficients[2], col=5)

 # 上圖結果顯示,學校的低社經地位學生比例越高,學校的平均數學表現越差。

Here we consider a model in which a child’s math achievement depends on to which ethnic group he or she belongs (whites, blacks or hispanics); and where the school average math trend depends on the percent of students with low income.

m2 <- lme4::lmer(math ~ year * (eth + lowinc) + (year | cid) + 
                (year | school),  data=dta)
summary(m2, corr=FALSE)
Linear mixed model fit by REML ['lmerMod']
Formula: math ~ year * (eth + lowinc) + (year | cid) + (year | school)
   Data: dta

REML criterion at convergence: 16293.2

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-3.247 -0.565 -0.033  0.534  5.268 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 cid      (Intercept) 0.62304  0.789        
          year        0.01115  0.106    0.56
 school   (Intercept) 0.08236  0.287        
          year        0.00846  0.092    0.04
 Residual             0.30157  0.549        
Number of obs: 7230, groups:  cid, 1721; school, 60

Fixed effects:
                  Estimate Std. Error t value
(Intercept)       0.140234   0.129912    1.08
year              0.874019   0.040017   21.84
ethBlack         -0.503498   0.078154   -6.44
ethHispanic      -0.319758   0.086258   -3.71
lowinc           -0.007560   0.001721   -4.39
year:ethBlack    -0.030514   0.022550   -1.35
year:ethHispanic  0.043159   0.024710    1.75
year:lowinc      -0.001364   0.000534   -2.56

Starting with a line (along the year) that has 0.1402 points of math achievement by the middle of third grade and increases by 0.8740 points per year, provided the school has no poor students. Each percentage point of low income students is associated with a third grade mean that’s 0.0076 lower and increases 0.0014 points less per year, so a school with 100% poor students would have a third-grade mean of -0.6168 (0.140227-0.756) and an increase of 0.7377 (0.874017-0.1363) points per year. The intercept and slope in any particular school will vary around the average line for its percent low income children.

Blacks have a third-grade mean 0.5035 points below whites in the same school, and the increase per year is 0.0305 points less than for whites in the same school. The results for hispanics indicate a third-grade mean 0.3198 points below whites, but a growth rate per year 0.0432 points higher than whites in the same school. Each child’s growth curve, however, will vary around the average line for kids of the same ethnicity in the same school.

The random effects indicate that the average third-grade score varies by school, with a variance of 0.0824, while the rate of change has a variance of 0.0085. The correlation between intercept and slope is 0.04, suggesting that it can be attributed to the school’s low-income composition.

Variation at the child level is even more sustantial; expected third-grade scores vary with a standard deviation of 0.789, and the rates of change vary with a standard deviation of 0.106. These two random effects remain highly correlated, so children who have high levels of math proficiency in third grade also tend to improve faster over time.

5 References

Raudenbush, S., Byrk, A., et al. (2004). HLM 6: Hierarchical Linear and Nonlinear Modeling. Chapter 4. Scientific Software International.

Rodríguez, G. Multilevel Models. 3-Level Model