A course has two exams: a midterm grade (Grade1) and a
final grade (Grade2). A teacher has four different section
of the same class (Class 1, Class 2, Class 3, Class 4). Are midterm exam
grades associated with final exam grades?
tibble(grades) |>
slice_sample(n = 10)
## # A tibble: 10 × 3
## Class Grade1 Grade2
## <dbl> <dbl> <dbl>
## 1 4 9.5 7.5
## 2 1 6 9
## 3 4 7 8.75
## 4 1 4.5 6
## 5 2 9 8
## 6 1 6.25 7.5
## 7 4 5 8
## 8 3 6.75 8.5
## 9 4 6.8 9.5
## 10 2 8 9
Need to change Class to a factor since it’s dummy
coded:
grades <-
grades |>
mutate(
Class = factor(Class, labels=paste("Class", levels(factor(grades$Class))))
)
tibble(grades) |>
slice_sample(n = 10)
## # A tibble: 10 × 3
## Class Grade1 Grade2
## <fct> <dbl> <dbl>
## 1 Class 1 8.5 8.6
## 2 Class 4 8 9
## 3 Class 2 6.25 6
## 4 Class 2 6.25 7
## 5 Class 1 7.5 8
## 6 Class 2 5.5 7
## 7 Class 1 3 3.5
## 8 Class 4 5 4.15
## 9 Class 2 7.75 7.5
## 10 Class 3 7.8 9
Let’s compare Grade 1 and Grade 2 by class. Let’s first start with a box plot of the difference in scores:
ggplot(
data = grades,
mapping = aes(
x = Grade2 - Grade1,
y = Class
)
) +
geom_boxplot(
fill = 'steelblue',
outlier.shape = NA
) +
# Adding a point for each student to the box plots
geom_jitter(
width = 0,
height = 0.15
) +
plot_theme +
labs(
y = NULL,
title = 'Difference in Exam 2 and Exam 1 by Class'
)
There does appear to be a little difference between the classes.
How strongly are Grade 1 and Grade 2 related?
And a scatter plot of Grade 2 by Grade 1 and Class
ggplot(
data = grades,
mapping = aes(
x = Grade1,
y = Grade2,
color = Class
)
) +
geom_point() +
# Average line
geom_smooth(
color = 'black',
formula = y ~ x,
se = F,
method = 'lm'
) +
# Line by class
geom_smooth(
formula = y ~ x,
se = F,
method = 'lm',
fullrange = T,
linetype = 'dashed',
linewidth = 0.25
) +
plot_theme
We can build an Ordinary Linear Regression (OLR) model
\[y_{ij} = \beta_0 + \beta_1 x_{ij} + \varepsilon_{ij}\]
where
We can fit the model to the data using
lm(formula, data):
grade_olr <- lm(Grade2 ~ Grade1, data = grades)
broom::tidy(grade_olr)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.18 0.446 9.37 5.79e-17
## 2 Grade1 0.492 0.0585 8.42 1.85e-14
###GG Plot with fitted line
grades |>
# Adding predicted grade:
mutate(grade2_hat = predict(grade_olr)) |>
# Removing the low performing students from the graph
filter(Grade1 >=5, Grade2 >=5) |>
ggplot(
mapping = aes(
x = Grade1,
y = Grade2,
color = Class)
) + theme_classic() +
geom_point() +
geom_line(
mapping = aes(
y = grade2_hat
)
) +
facet_wrap(
facets = vars(Class)
) +
labs(
x = "Midterm Grade",
y = "Final Exam Grade",
title = "Final Exam by Midterm Exam",
subtitle = "Fixed Effects Model"
) +
plot_theme
There is a statistically significant association between the midterm grade and final exam grade. But we do have the assumption that the random errors are all independent. That is to say, all the students are independent of one another.
But students in the same class might not be independent! So what do we do?
We could include a fixed effect based on the class: \(\beta_{i}\). The drawback is the interpretation would only be applicable to the students in the 4 different sections, not all students who would take the class! How can we make the results more generalizable?
We think of the 4 sections as a sample of all possible sections that the teacher could have for that course. That is, the classes are also a sample from a population! How do we account for a second source of randomness?
By including a random effect for class:
\[y_{ij} = \beta_0 + \beta_1 x_{ij} + u_i + \varepsilon_{ij}\]
where \(u_i\) is the random effect of selecting the \(i\)th class in the sample.
We assume a probability distribution for the random effects, often:
\[u_i \sim N(0, \sigma^2_u)\]
So now our model has two random effects (class and student) and 1
fixed effect (Grade1).
How can we build a mixed effects model in R? Using the
lmer() function in the lme4 package.
It works very similarly to lm(), but we need to specify
how a variable is a random effect instead of a fixed effect by using
(1|Var), where Var is the variable that forms
the random effect (Class in our example). The (1) part
tells lmer() to form random intercepts. 1 is
the symbol for intercepts in formulas in R.
Building a model with random intercepts
grade_me <-
lmer(
Grade2 ~ Grade1 + (1|Class),
data = grades,
REML = F # Perform ML estimation, not restricted ML
)
# General summary:
summary(grade_me, cor = F)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: Grade2 ~ Grade1 + (1 | Class)
## Data: grades
##
## AIC BIC logLik deviance df.resid
## 545.7 558.1 -268.8 537.7 161
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.4803 -0.6257 -0.0140 0.6933 1.9729
##
## Random effects:
## Groups Name Variance Std.Dev.
## Class (Intercept) 0.08044 0.2836
## Residual 1.48033 1.2167
## Number of obs: 165, groups: Class, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 4.29037 0.45811 102.85056 9.365 1.93e-15 ***
## Grade1 0.47904 0.05707 163.58678 8.393 2.13e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Fixed effects only
fixef(grade_me)
## (Intercept) Grade1
## 4.2903707 0.4790414
If we want to compare the fixed effect and random effect of class, we need to tell R to fit the model with a sum to zero constraint instead of the baseline zero constraint, since the mixed effect model has the random effects sum to 0
grades$class <- grades$Class
contrasts(grades$class) <- contr.sum(4, contrasts=TRUE)
# Fitting the linear model with class as a fixed effect
grade_fe2 <- lm(Grade2 ~ Grade1 + class, data=grades);
# Comparing the ME vs FE model including Class
coef(grade_fe2);fixef(grade_me)
## (Intercept) Grade1 class1 class2 class3
## 4.3373439 0.4730795 0.4448264 0.1366185 -0.1268532
## (Intercept) Grade1
## 4.2903707 0.4790414
Getting the loglikelihoods, differences, and dfs
ll_fe <- logLik(grade_olr)
ll_me <- logLik(grade_me)
ll_diff = -2*as.numeric(ll_fe - ll_me)
df_diff <- attr(ll_me,"df") - attr(ll_fe,"df")
cat(
'Chi-square = ', ll_diff, # Test stat
'df =', df_diff, # DF
'p-value =', pchisq(ll_diff, df = df_diff, lower.tail = F) # p-value
)
## Chi-square = 4.610485 df = 1 p-value = 0.03177704
Plot with fitted line
gg_rand_int <-
grades |>
mutate(
pred = predict(grade_me)
) |>
filter(Grade1 >=5, Grade2 >=5) |>
ggplot(
mapping = aes(
x = Grade1,
y = Grade2,
color = Class
)
) +
theme_classic() +
geom_abline(
slope = fixef(grade_me)["Grade1"],
intercept = fixef(grade_me)["(Intercept)"],
linewidth = 1
) +
geom_point(size=1.75) +
geom_line(
mapping = aes(y = pred),
linewidth = 1,
linetype = "dashed",
show.legend = F
) +
labs(
x = "Midterm Grade",
y = "Final Exam Grade",
title = "Final Exam by Midterm Exam",
subtitle = "Mixed Effects Model: Random Intercept"
) +
plot_theme
gg_rand_int
Plot by class with fitted line
gg_rand_int +
facet_wrap(
facets = vars(Class)
) +
theme_bw() +
theme(legend.position = 'none')
Random Effects Model with intercept and slope:
grade_me2 <-
lmer(
formula = Grade2 ~ Grade1 + (1 + Grade1|Class),
data = grades,
REML = F
)
## boundary (singular) fit: see help('isSingular')
summary(grade_me2, cor=F)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: Grade2 ~ Grade1 + (1 + Grade1 | Class)
## Data: grades
##
## AIC BIC logLik deviance df.resid
## 549.5 568.1 -268.8 537.5 159
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.4791 -0.6540 -0.0106 0.7690 1.9454
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Class (Intercept) 0.1867842 0.43219
## Grade1 0.0003964 0.01991 -1.00
## Residual 1.4786232 1.21599
## Number of obs: 165, groups: Class, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 4.29944 0.48728 11.50128 8.823 1.85e-06 ***
## Grade1 0.47807 0.05799 49.70237 8.243 7.32e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
Testing if the random slope is needed
anova(grade_me, grade_me2, refit=F)
## Data: grades
## Models:
## grade_me: Grade2 ~ Grade1 + (1 | Class)
## grade_me2: Grade2 ~ Grade1 + (1 + Grade1 | Class)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## grade_me 4 545.66 558.08 -268.83 537.66
## grade_me2 6 549.51 568.15 -268.76 537.51 0.1464 2 0.9294
Plot with fitted lines for random intercept and slope
gg_rand_slope <-
grades |>
mutate(
pred2 = predict(grade_me2)
) |>
filter(Grade1 >=5, Grade2 >=5) |>
ggplot(
mapping = aes(
x = Grade1,
y = Grade2,
color = Class
)
) +
theme_classic() +
geom_abline(
slope = fixef(grade_me)["Grade1"],
intercept = fixef(grade_me)["(Intercept)"],
linewidth = 1
) +
geom_point() +
geom_line(
mapping = aes(y = pred2),
linewidth = 1,
linetype="dashed",
show.legend = F
) +
labs(
x = "Midterm Grade",
y = "Final Exam Grade",
title = "Final Exam by Midterm Exam",
subtitle = "MEM: Random Intercept and Slope"
) +
plot_theme
gg_rand_slope
gg_rand_slope +
facet_wrap(
facets = vars(Class)
) +
theme_bw() +
plot_theme