DESCRIPTION OF THE DATASETS
During the exercise sessions, we will use 2 datasets: the hip fracture dataset and the SII dataset. A description of both datasets is given here:
SII Dataset
The data come from the SII (Study of Instructional Improvement),
which was carried out to assess the math achievement scores of first-
and third-grade pupils in randomly selected classrooms from a national
US sample of elementary schools. The dataset includes results for 1190
first-grade pupils sampled from 312 classrooms in 107 schools.
From the sample of elementary schools, all first-grade classrooms were
considered (1 to 9). Within a classroom, 1 to 10 children were sampled
to take a test on math. The following variables were measured: *
schoolid: school’s ID number
* classid: classroom’s ID number
* childid: pupil’s ID number
* sex: an indicator variable for sex (0 for males, 1 for females)
* mathkind: pupil’s math score in the spring of the kindergarten
year
* mathgain: pupil’s gain in the math achievement score from the spring
of kindergarten to the spring of first grade
* ses: pupil’s socioeconomic status
* housepov: % of households in the neighborhood of the school below the
poverty level
* yearstea: years of teacher’s experience in teaching in the first
grade
Hip fracture dataset
Hip fractures are a very common and important health problem among
elderly people. They are often the result of an unfortunate, seemingly
innocent fall in the patient’s home environment. Less fluent movements
combined with bad vision and impaired reaction skills make older people
more at risk for such falls. The weaker bones and especially the
presence of osteoporosis also causes the hip to break more easily. The
hip surgery needed is a serious operation and can involve a lot of blood
loss. A relatively long recovery period (8 to 12 days) in the hospital
is therefore a real necessity.
The combination of these overwhelming events can also cause the mental
status of the patients to worsen. When this deterioration is very
severe, typically causing a loss of orientation in time and space, this
phenomenon is called a delirium. In daily practice, the ‘Mini-Mental
State Examination score’ (MMSE) is sometimes being used to measure the
cognitive status of patients. It is the sum of 30 binary outcomes which
result from correct (1) or incorrect (0) answer to specific questions
(e.g. ‘Where are you now?’). Obviously, higher MMSE values indicate
better cognitive functioning.
It is of interest to study how this MMSE evolution depends on certain characteristics of the patients. The data set used for this purpose contains the MMSE score for 60 patients, at 1, 3, 5, 8 and 12 days after surgery. For each patient the following variables are measured:
Describe in detail why in both datasets, one cannot make use of basic statistical models, like linear regression or ANOVA. What assumptions are violated? What special kind of data is the hip fracture dataset?
Classic ANOVA and regression assume all observations are independent. Here, we have related data - observations are not independent:
The purpose of multilevel (or mixed) models is to account for the relatedness between these observations
#Loading the required packages
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.5.3
# Read in dataset
siiData<-read.table("SII.txt",header=T)
# Coerce schoolid, classid, childid, sex and ses to a factor
siiData$schoolid<-as.factor(siiData$schoolid)
siiData$classid<-as.factor(siiData$classid)
siiData$childid<-as.factor(siiData$childid)
siiData$sex<-as.factor(siiData$sex)
# Rename the factor levels of 'sex' for easier interpretation
levels(siiData$sex)<-c("male","female")
# Generating the Individual-vs.-Aggregate Dot Plot
group_means <- siiData %>%
group_by(schoolid,sex) %>%
summarize(Mean_mathgain = mean(mathgain), .groups = "drop")
ggplot(group_means, aes(x = schoolid, y = Mean_mathgain, col = sex)) +
geom_point(alpha = 0.7, size = 2.5) +
labs(
title = "Distribution of mathgain for males and females across schools",
x = "School ID (Cluster)",
y = "Mathgain",
color = "Sex of student"
) +
theme_minimal()
# Read in dataset
hipData<-read.table("hipfracture2.txt",header = T)
# coercion of categorical variables
hipData$IDNR<-as.factor(hipData$IDNR)
hipData$NEURO<-as.factor(hipData$NEURO)
# Rename the factor levels of 'NEURO' for easier interpretation
levels(hipData$NEURO)<-c("not neuro-psy","neuro-psy")
# one plot with all the patients, using a different color for neuro and non-Neuro patients
ggplot(hipData, aes(x = TIME, y = MMSE, group = IDNR, col=NEURO)) +
geom_line(alpha = 0.6) +
geom_point(alpha = 0.6) +
labs(title = "Individual Trajectory Plot", y = "MMSE")
#> Warning: Removed 30 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 33 rows containing missing values or values outside the scale range
#> (`geom_point()`).
# two separate plots for the neuro and non-neuro persons
ggplot(hipData, aes(x = TIME, y = MMSE, group = IDNR, col = NEURO)) +
geom_line(alpha = 0.6) +
geom_point(alpha = 0.6) +
# Split the plot into separate panels based on the NEURO variable
facet_wrap(~NEURO) +
labs(
title = "Individual Trajectory Plots by Neuro Status",
y = "MMSE",
x = "Time"
)
#> Warning: Removed 30 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Removed 33 rows containing missing values or values outside the scale range
#> (`geom_point()`).
Calculate the ICC and the Design Effect for both datasets. Is there an indication that a multilevel model might be necessary?
library(lme4)
#> Loading required package: Matrix
library(performance)
#> Warning: package 'performance' was built under R version 4.5.3
fit_null_sii <- lme4::lmer(mathgain ~ 1 + (1 | schoolid), data = siiData)
summary(fit_null_sii)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathgain ~ 1 + (1 | schoolid)
#> Data: siiData
#>
#> REML criterion at convergence: 11776.7
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -4.8127 -0.5903 -0.0186 0.5449 5.6632
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> schoolid (Intercept) 109.2 10.45
#> Residual 1094.0 33.08
#> Number of obs: 1190, groups: schoolid, 107
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 57.583 1.442 39.95
# Calculate ICC
ICC_sii <- 109.2/(109.2+1094); ICC_sii
#> [1] 0.09075798
icc(fit_null_sii)
#> # Intraclass Correlation Coefficient
#>
#> Adjusted ICC: 0.091
#> Unadjusted ICC: 0.091
# Calculate Average Cluster Size
avg_n_sii <- mean(table(siiData$schoolid))
# Calculate Design Effect
design_sii <- 1+(avg_n_sii-1)*ICC_sii ; design_sii
#> [1] 1.918606
The ICC is quite low (<0.1) and the design effect is below 2. So there is very few indication that a large proportion of the variance in the data is due to the variance between schools.
fit_null_hip <- lme4::lmer(MMSE ~ 1 + (1 | IDNR), data = hipData)
summary(fit_null_hip)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: MMSE ~ 1 + (1 | IDNR)
#> Data: hipData
#>
#> REML criterion at convergence: 1483.8
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -4.2710 -0.3834 0.0528 0.4960 3.0246
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> IDNR (Intercept) 56.879 7.542
#> Residual 6.782 2.604
#> Number of obs: 267, groups: IDNR, 60
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 20.2470 0.9871 20.51
ICC_hip <- 56.879/(56.879+6.782); ICC_hip
#> [1] 0.893467
icc(fit_null_hip)
#> # Intraclass Correlation Coefficient
#>
#> Adjusted ICC: 0.893
#> Unadjusted ICC: 0.893
# Calculate Average Cluster Size
avg_n_hip <- mean(table(hipData$IDNR))
# Calculate Design Effect
design_hip <- 1+(avg_n_hip-1)*ICC_hip ; design_hip
#> [1] 4.573868
The ICC is very high (>0.8) and the design effect is higher than 2. This indicates that a significant proportion of the variance in the data is due to the variance between patients. A multilevel model is necessary.
The research question we want to answer is:
Is there a difference of math achievement score between males
and females?
We will now look at a random intercept model. We still only consider the clustering on school-level and ignore the clustering on the class-level (this will be included in a later stage).
Write down the full model with assumptions
\(mathgain_{ij} = \alpha + a_i + \beta \cdot sex_j + \epsilon_{ij}\)
\(a_i \sim N(0, \sigma^2_{school})\), and \(\epsilon_{ij} \sim N(0, \sigma^2_{res})\) independent
Fit the model in R, discuss the within-cluster variability and between-cluster variability
model_sii_1 <- lmer(mathgain ~ sex + (1 | schoolid), data = siiData)
summary(model_sii_1)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathgain ~ sex + (1 | schoolid)
#> Data: siiData
#>
#> REML criterion at convergence: 11772.3
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -4.8455 -0.6000 -0.0199 0.5417 5.6377
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> schoolid (Intercept) 109.3 10.45
#> Residual 1093.8 33.07
#> Number of obs: 1190, groups: schoolid, 107
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 58.658 1.749 33.539
#> sexfemale -2.133 1.966 -1.085
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> sexfemale -0.566
What is the average mathgain for female students (across all schools)?
According to the model, the average mathgain for female students across all school is: \[ 58.658 - 2.133 = 56.525 \]
We want to study how MMSE develops over time, depending on certain characteristics of a patient. Fit a random intercept model to investigate this research question, taking into account the pre-operative neuro-status of the patient.
Write down the model with assumptions.
\(MMSE_{ij} = \alpha + a_i + \beta_1 \cdot NEURO_j + \beta_2 \cdot TIME_j + \beta_3 \cdot NEURO_jTIME_j + \epsilon_{ij}\)
\(a_i \sim N(0, \sigma^2_{patient})\), and \(\epsilon_{ij} \sim N(0, \sigma^2_{res})\) independent
Discuss the variability between and within patients.
model_hip_1 <- lmer(MMSE ~ TIME + NEURO + TIME:NEURO + (1 | IDNR), data = hipData)
summary(model_hip_1)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: MMSE ~ TIME + NEURO + TIME:NEURO + (1 | IDNR)
#> Data: hipData
#>
#> REML criterion at convergence: 1452.8
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -4.1447 -0.4416 0.0153 0.4882 2.8102
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> IDNR (Intercept) 41.970 6.478
#> Residual 6.316 2.513
#> Number of obs: 267, groups: IDNR, 60
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 22.01172 1.07702 20.438
#> TIME 0.19753 0.05240 3.770
#> NEUROneuro-psy -8.03853 1.86904 -4.301
#> TIME:NEUROneuro-psy -0.05981 0.09642 -0.620
#>
#> Correlation of Fixed Effects:
#> (Intr) TIME NEURO-
#> TIME -0.254
#> NEUROnr-psy -0.576 0.146
#> TIME:NEURO- 0.138 -0.543 -0.258
What is the marginal model (i.e. average model) that describes the MMSE over time for not neuro-psychiatric patients?
\(MMSE_{NotNeuro} = 22.01172 + 0.19753 \cdot TIME\)
Based on the random intercept model that you fit in the previous part, give an answer to the question: Is there a difference of math achievement score between males and females?
library(lmerTest)
#> Warning: package 'lmerTest' was built under R version 4.5.3
#>
#> Attaching package: 'lmerTest'
#> The following object is masked from 'package:lme4':
#>
#> lmer
#> The following object is masked from 'package:stats':
#>
#> step
library(pbkrtest)
#> Warning: package 'pbkrtest' was built under R version 4.5.3
model_sii_full <- lme4::lmer(mathgain ~ sex + (1 | schoolid), data = siiData, REML=FALSE)
model_sii_red <- lme4::lmer(mathgain ~ (1 | schoolid), data = siiData, REML=FALSE)
# LRT Test
anova(model_sii_red,model_sii_full)
#> Data: siiData
#> Models:
#> model_sii_red: mathgain ~ (1 | schoolid)
#> model_sii_full: mathgain ~ sex + (1 | schoolid)
#> npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
#> model_sii_red 3 11785 11800 -5889.6 11779
#> model_sii_full 4 11786 11806 -5889.0 11778 1.1784 1 0.2777
# Wald test
model_sii_wald <- lmerTest::lmer(mathgain ~ sex + (1 | schoolid), data = siiData)
anova(model_sii_wald, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> sex 1285.8 1285.8 1 1166.3 1.1755 0.2785
# Bootstrap
bootstrap_test <- PBmodcomp(model_sii_full, model_sii_red, nsim = 500)
bootstrap_test
#> Bootstrap test; time: 27.60 sec; samples: 500; extremes: 131;
#> large : mathgain ~ sex + (1 | schoolid)
#> stat df p.value
#> LRT 1.1784 1 0.2777
#> PBtest 1.1784 0.2635
All the tests show a p-value larger than 0.05, indicating that there is no significant difference in math achievement scores between males and females.
Based on the random intercept model that you fit in the previous part, is there a different evolution of MMSE score in the two neuro-status groups?
model_hip_full <- lme4::lmer(MMSE ~ TIME + NEURO + TIME:NEURO + (1 | IDNR), data = hipData, REML=FALSE)
model_hip_red <- lme4::lmer(MMSE ~ TIME + NEURO + (1 | IDNR), data = hipData, REML=FALSE)
# LRT Test
anova(model_hip_red,model_hip_full)
#> Data: hipData
#> Models:
#> model_hip_red: MMSE ~ TIME + NEURO + (1 | IDNR)
#> model_hip_full: MMSE ~ TIME + NEURO + TIME:NEURO + (1 | IDNR)
#> npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
#> model_hip_red 5 1460.4 1478.4 -725.21 1450.4
#> model_hip_full 6 1462.0 1483.5 -725.01 1450.0 0.3915 1 0.5315
# Wald test
model_hip_wald <- lmerTest::lmer(MMSE ~ TIME + NEURO + TIME:NEURO + (1 | IDNR), data = hipData)
anova(model_hip_wald, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> TIME 76.323 76.323 1 206.509 12.0849 0.0006196 ***
#> NEURO 116.823 116.823 1 66.554 18.4975 5.693e-05 ***
#> TIME:NEURO 2.429 2.429 1 206.509 0.3846 0.5358332
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Bootstrap
#bootstrap_test <- PBmodcomp(model_hip_full, model_hip_red, nsim = 500)
#bootstrap_test
Both the LRT test and the Wald test show a p-value above 0.05, indicating that there is no significantly different evolution of MMSE score in the two neuro-status groups. This means that we can remove the interaction term between TIME and NEURO from the model.
Note that if you perform parametric bootstrapping here, you get an error (that’s why the code was not run here). This is due to the fact that there are missing data in the dataset. If you want to perform parametric bootstrapping here, you have to limit your dataset to only include the complete cases.
Based on your conclusion from the previous question, fit a new appropriate model. Is there a significant MMSE evolution over time?
model_hip_full_2 <- lme4::lmer(MMSE ~ TIME + NEURO + (1 | IDNR), data = hipData, REML=FALSE)
model_hip_red_2 <- lme4::lmer(MMSE ~ NEURO + (1 | IDNR), data = hipData, REML=FALSE)
# LRT Test
anova(model_hip_red_2,model_hip_full_2)
#> Data: hipData
#> Models:
#> model_hip_red_2: MMSE ~ NEURO + (1 | IDNR)
#> model_hip_full_2: MMSE ~ TIME + NEURO + (1 | IDNR)
#> npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
#> model_hip_red_2 4 1474.6 1488.9 -733.29 1466.6
#> model_hip_full_2 5 1460.4 1478.4 -725.21 1450.4 16.161 1 5.819e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Wald test
model_hip_wald_2 <- lmerTest::lmer(MMSE ~ TIME + NEURO + (1 | IDNR), data = hipData)
anova(model_hip_wald_2, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> TIME 105.60 105.60 1 207.139 16.781 6.018e-05 ***
#> NEURO 133.87 133.87 1 58.068 21.274 2.238e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Bootstrap
#bootstrap_test <- PBmodcomp(model_hip_full_2, model_hip_red_2, nsim = 500)
#bootstrap_test
Both the LRT test and the Wald test show a p-value below 0.05, indicating that there is a significant MMSE evolution over time.
Looking at the profiles from the hipfracture dataset (see figure produced in PART ONE of the exercises), it might be of interest to include a quadratic time effect.
hipData$TIME2 <- hipData$TIME* hipData$TIME
Fit the model. Is the quadratic time effect needed here?
model_hip_full_3 <- lme4::lmer(MMSE ~ TIME + NEURO + TIME2 + (1 | IDNR), data = hipData, REML=FALSE)
model_hip_red_3 <- lme4::lmer(MMSE ~ NEURO + TIME + (1 | IDNR), data = hipData, REML=FALSE)
# LRT Test
anova(model_hip_red_3,model_hip_full_3)
#> Data: hipData
#> Models:
#> model_hip_red_3: MMSE ~ NEURO + TIME + (1 | IDNR)
#> model_hip_full_3: MMSE ~ TIME + NEURO + TIME2 + (1 | IDNR)
#> npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
#> model_hip_red_3 5 1460.4 1478.4 -725.21 1450.4
#> model_hip_full_3 6 1460.3 1481.8 -724.16 1448.3 2.0937 1 0.1479
# Wald test
model_hip_wald_3 <- lmerTest::lmer(MMSE ~ TIME + NEURO + TIME2 + (1 | IDNR), data = hipData)
anova(model_hip_wald_3, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> TIME 38.825 38.825 1 205.200 6.2006 0.01356 *
#> NEURO 133.636 133.636 1 58.068 21.3422 2.181e-05 ***
#> TIME2 13.036 13.036 1 205.338 2.0819 0.15058
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both the LRT test and the Wald test show a p-value above 0.05, indicating that there is no need for a quadratic time effect.
Include the age of the patient as an extra covariate in the model. Is there a different effect of age in both neuro groups?
model_hip_full_4 <- lme4::lmer(MMSE ~ TIME + NEURO + AGE + AGE:NEURO + (1 | IDNR), data = hipData, REML=FALSE)
model_hip_red_4 <- lme4::lmer(MMSE ~ NEURO + TIME + AGE + (1 | IDNR), data = hipData, REML=FALSE)
# LRT Test
anova(model_hip_red_4,model_hip_full_4)
#> Data: hipData
#> Models:
#> model_hip_red_4: MMSE ~ NEURO + TIME + AGE + (1 | IDNR)
#> model_hip_full_4: MMSE ~ TIME + NEURO + AGE + AGE:NEURO + (1 | IDNR)
#> npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
#> model_hip_red_4 6 1440.6 1462.1 -714.29 1428.6
#> model_hip_full_4 7 1442.5 1467.7 -714.27 1428.5 0.0285 1 0.866
# Wald test
model_hip_wald_4 <- lmerTest::lmer(MMSE ~ TIME + NEURO + AGE + AGE:NEURO + (1 | IDNR), data = hipData)
anova(model_hip_wald_4, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> TIME 101.546 101.546 1 207.506 16.1315 8.259e-05 ***
#> NEURO 0.479 0.479 1 55.947 0.0761 0.7837217
#> AGE 106.330 106.330 1 55.997 16.8915 0.0001305 ***
#> NEURO:AGE 0.166 0.166 1 55.984 0.0263 0.8716731
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both the LRT test and the Wald test show a p-value above 0.05, indicating that there is not a significantly different effect of age in both neuro groups. This means that we can drop the interaction term from the model.
Is the age effect needed here?
model_hip_full_5 <- lme4::lmer(MMSE ~ TIME + NEURO + AGE + (1 | IDNR), data = hipData, REML=FALSE)
model_hip_red_5 <- lme4::lmer(MMSE ~ NEURO + TIME + (1 | IDNR), data = hipData, REML=FALSE)
# LRT Test
anova(model_hip_red_5,model_hip_full_5)
#> Data: hipData
#> Models:
#> model_hip_red_5: MMSE ~ NEURO + TIME + (1 | IDNR)
#> model_hip_full_5: MMSE ~ TIME + NEURO + AGE + (1 | IDNR)
#> npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
#> model_hip_red_5 5 1460.4 1478.4 -725.21 1450.4
#> model_hip_full_5 6 1440.6 1462.1 -714.29 1428.6 21.844 1 2.957e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Wald test
model_hip_wald_5 <- lmerTest::lmer(MMSE ~ TIME + NEURO + AGE + (1 | IDNR), data = hipData)
anova(model_hip_wald_5, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> TIME 101.49 101.49 1 207.554 16.123 8.295e-05 ***
#> NEURO 162.88 162.88 1 57.117 25.875 4.223e-06 ***
#> AGE 157.65 157.65 1 57.191 25.043 5.674e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both the LRT test and the Wald test show a p-value below 0.05, indicating that there is significant effect of AGE on the MMSE value. The variable AGE should therefore be kept in the model.
Give and fit the model, testing if there is a difference in mathgain between boys and girls, taking into account clustering on both school and class levels.
Write down this model (no other covariates have to be considered yet).
\(mathgain_{ijk}=\alpha +a_{1i} + a_{2i(j)} + \beta \cdot sex_k + \epsilon_{ijk}\)
Is there a difference in mathgain between boys and girls?
model_nested <- lme4::lmer(mathgain~ sex + (1|schoolid/classid), data=siiData, REML=F)
model_nested_red <- lme4::lmer(mathgain~ 1+ (1|schoolid/classid), data=siiData, REML=F)
anova(model_nested_red,model_nested )
#> Data: siiData
#> Models:
#> model_nested_red: mathgain ~ 1 + (1 | schoolid/classid)
#> model_nested: mathgain ~ sex + (1 | schoolid/classid)
#> npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
#> model_nested_red 4 11779 11800 -5885.7 11771
#> model_nested 5 11780 11805 -5884.9 11770 1.4574 1 0.2273
The p-value for the LRT is greater than 0.05, so there is no significant difference in mathgain between boys and girls.
Discuss the different variance components from the model in the previous exercise. * What fraction of the total variance is due to the variability between classes? * What fraction of the total variance is due to the variability between schools?
model_nested <- lme4::lmer(mathgain~ sex + (1|schoolid/classid), data=siiData)
summary(model_nested)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathgain ~ sex + (1 | schoolid/classid)
#> Data: siiData
#>
#> REML criterion at convergence: 11764.1
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -4.6789 -0.5889 -0.0324 0.5396 5.6126
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> classid:schoolid (Intercept) 101.04 10.052
#> schoolid (Intercept) 76.96 8.773
#> Residual 1026.86 32.045
#> Number of obs: 1190, groups: classid:schoolid, 312; schoolid, 107
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 58.615 1.747 33.547
#> sexfemale -2.353 1.948 -1.208
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> sexfemale -0.564
Fit the same model, but consider the random effects for school and class levels to be crossed, rather than nested. Look at the variance components now. Do you notice a difference? Why is that?
model_crossed <- lme4::lmer(mathgain~ sex + (1|schoolid) + (1|classid), data=siiData)
summary(model_crossed)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: mathgain ~ sex + (1 | schoolid) + (1 | classid)
#> Data: siiData
#>
#> REML criterion at convergence: 11764.1
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -4.6789 -0.5889 -0.0324 0.5396 5.6126
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> classid (Intercept) 101.04 10.052
#> schoolid (Intercept) 76.96 8.773
#> Residual 1026.86 32.045
#> Number of obs: 1190, groups: classid, 312; schoolid, 107
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 58.615 1.747 33.547
#> sexfemale -2.353 1.948 -1.208
#>
#> Correlation of Fixed Effects:
#> (Intr)
#> sexfemale -0.564
There is no difference in the estimated variance components compared to the previous model specification. This is because the variable classid is coded with globally unique identifiers for each class across the entire dataset, rather than with identifiers that repeat within schools (e.g., class 1, 2, 3 in school 1 and again class 1, 2, 3 in school 2). As a result, each class is uniquely associated with a single school, making the specification with nested random effects mathematically equivalent to the specification with crossed random effects for this dataset.
Given the model from the previous part, with sex as a fixed effect and a random effect for school and class. Is it necessary to take into account the nested structure of classes within schools? Test whether the nested effects model performs better than a model that only includes a random effect for school, and a model that only includes a random effect for class.
model_nested <- lme4::lmer(mathgain~ sex + (1|schoolid/classid), data=siiData)
model_school <- lme4::lmer(mathgain~ sex + (1|schoolid), data=siiData)
model_class <- lme4::lmer(mathgain~ sex + (1|classid), data=siiData)
anova(model_school,model_nested)
#> refitting model(s) with ML (instead of REML)
#> Data: siiData
#> Models:
#> model_school: mathgain ~ sex + (1 | schoolid)
#> model_nested: mathgain ~ sex + (1 | schoolid/classid)
#> npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
#> model_school 4 11786 11806 -5889.0 11778
#> model_nested 5 11780 11805 -5884.9 11770 8.1813 1 0.004232 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_class,model_nested)
#> refitting model(s) with ML (instead of REML)
#> Data: siiData
#> Models:
#> model_class: mathgain ~ sex + (1 | classid)
#> model_nested: mathgain ~ sex + (1 | schoolid/classid)
#> npar AIC BIC logLik -2*log(L) Chisq Df Pr(>Chisq)
#> model_class 4 11785 11806 -5888.7 11777
#> model_nested 5 11780 11805 -5884.9 11770 7.4614 1 0.006303 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(model_school,model_class,model_nested)
#> df AIC
#> model_school 4 11780.30
#> model_class 4 11779.89
#> model_nested 5 11774.14
BIC(model_school,model_class,model_nested)
#> df BIC
#> model_school 4 11800.63
#> model_class 4 11800.22
#> model_nested 5 11799.55
The LRT, AIC and BIC all indicate that the fully nested model, with random effects for school and class performs the best.
Fit one model including all covariates as main effects and as an interaction with sex, and taking into account all levels of clustering. Discuss significance of the covariates. Perform stepwise backward model selection on the fixed effects to select the best model
model_fullynested <- lmer(mathgain~ sex*(ses+housepov+yearstea+mathkind) + (1|schoolid/classid), data=siiData)
#> Warning: Some predictor variables are on very different scales: consider
#> rescaling
#> Warning: Some predictor variables are on very different scales: consider
#> rescaling
anova(model_fullynested)
#> Type III Analysis of Variance Table with Satterthwaite's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> sex 232 232 1 1134.74 0.3128 0.57608
#> ses 15603 15603 1 1166.89 21.0357 4.995e-06 ***
#> housepov 2293 2293 1 116.20 3.0914 0.08134 .
#> yearstea 196 196 1 245.22 0.2638 0.60801
#> mathkind 316941 316941 1 1164.19 427.3025 < 2.2e-16 ***
#> sex:ses 0 0 1 1127.84 0.0005 0.98150
#> sex:housepov 2645 2645 1 1116.88 3.5664 0.05922 .
#> sex:yearstea 2850 2850 1 1113.87 3.8417 0.05024 .
#> sex:mathkind 666 666 1 1134.65 0.8981 0.34349
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_step2 <- lmer(mathgain~ sex*(housepov+yearstea+mathkind) + ses + (1|schoolid/classid), data=siiData)
#> Warning: Some predictor variables are on very different scales: consider
#> rescaling
#> Warning: Some predictor variables are on very different scales: consider
#> rescaling
anova(model_step2)
#> Type III Analysis of Variance Table with Satterthwaite's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> sex 239 239 1 1133.03 0.3227 0.57013
#> housepov 2292 2292 1 116.08 3.0933 0.08125 .
#> yearstea 196 196 1 244.97 0.2649 0.60725
#> mathkind 316920 316920 1 1165.14 427.6762 < 2.2e-16 ***
#> ses 15603 15603 1 1167.85 21.0556 4.943e-06 ***
#> sex:housepov 2713 2713 1 1120.73 3.6614 0.05594 .
#> sex:yearstea 2852 2852 1 1114.95 3.8483 0.05005 .
#> sex:mathkind 699 699 1 1132.67 0.9427 0.33180
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_step3 <- lmer(mathgain~ sex*(housepov+yearstea)+ ses + mathkind + (1|schoolid/classid), data=siiData)
anova(model_step3, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> sex 3023 3023 1 1137.64 4.0761 0.04373 *
#> housepov 2299 2299 1 110.12 3.1004 0.08105 .
#> yearstea 206 206 1 255.28 0.2781 0.59841
#> ses 15367 15367 1 1168.28 20.7233 5.861e-06 ***
#> mathkind 315980 315980 1 1164.45 426.1224 < 2.2e-16 ***
#> sex:housepov 3171 3171 1 1124.02 4.2759 0.03888 *
#> sex:yearstea 2828 2828 1 1119.17 3.8138 0.05108 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_step4 <- lmer(mathgain~ sex*(housepov)+ ses + mathkind + yearstea + (1|schoolid/classid), data=siiData)
anova(model_step4, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> sex 665 665 1 1111.80 0.8952 0.34428
#> housepov 2164 2164 1 110.04 2.9141 0.09063 .
#> ses 15824 15824 1 1168.19 21.3127 4.333e-06 ***
#> mathkind 317988 317988 1 1164.52 428.2735 < 2.2e-16 ***
#> yearstea 220 220 1 256.34 0.2962 0.58672
#> sex:housepov 2336 2336 1 1116.47 3.1457 0.07640 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_step5 <- lmer(mathgain~ sex*housepov + ses + mathkind + (1|schoolid/classid), data=siiData)
anova(model_step5, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> sex 670 670 1 1112.34 0.9028 0.34224
#> housepov 2336 2336 1 108.92 3.1476 0.07883 .
#> ses 15898 15898 1 1169.28 21.4205 4.1e-06 ***
#> mathkind 318218 318218 1 1165.81 428.7662 < 2e-16 ***
#> sex:housepov 2369 2369 1 1117.25 3.1921 0.07426 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_step6 <- lmer(mathgain~ sex + housepov + ses + mathkind + (1|schoolid/classid), data=siiData)
anova(model_step6, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> sex 458 458 1 1131.05 0.6165 0.43251
#> housepov 2346 2346 1 108.99 3.1585 0.07832 .
#> ses 16381 16381 1 1169.41 22.0556 2.962e-06 ***
#> mathkind 319087 319087 1 1165.47 429.6178 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_step7 <- lmer(mathgain~ housepov + ses + mathkind + (1|schoolid/classid), data=siiData)
anova(model_step7, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> housepov 2374 2374 1 109.05 3.1965 0.07658 .
#> ses 16514 16514 1 1170.91 22.2305 2.709e-06 ***
#> mathkind 320206 320206 1 1166.13 431.0595 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_step8 <- lmer(mathgain~ ses + mathkind + (1|schoolid/classid), data=siiData)
anova(model_step8, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> ses 18065 18065 1 1163.2 24.297 9.452e-07 ***
#> mathkind 317617 317617 1 1163.8 427.188 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The least significant variables that are excluded in each step
are:
1) sex:ses
2) sex:mathkind
3) sex:yearsta
4) yearsta
5) sex:housepov
6) sex
7) housepov
The final model includes fixed effects for ses and mathkind and random effects for school and class.
Check the model assumptions of the final model you have chosen in the previous exercise.
model_final <- lmer(mathgain~ ses + mathkind + (1|schoolid/classid), data=siiData)
# Residuals vs. Fitted Values Plot (Tests Equal Variance)
plot(model_final,
main = "Residuals vs. Fitted Values",
xlab = "Fitted Values", ylab = "Residuals")
# Q-Q Plot of Residuals (Tests Normality of Residuals)
qqnorm(resid(model_final), main = "Normal Q-Q Plot: Residuals")
qqline(resid(model_final), col = "red")
# Extract Empirical Bayes estimates (random effects)
eb_estimates <- ranef(model_final)
# Histogram to check for Class-level outliers
ggplot(mapping = aes(x = eb_estimates$classid[[1]])) +
geom_histogram(bins = 30, fill = "lightblue", color = "black") +
labs(title = "Distribution of Class Random Effects", x = "Class Intercept Deviation")
# Histogram to check for School-level outliers
ggplot(mapping = aes(x = eb_estimates$schoolid[[1]])) +
geom_histogram(bins = 15, fill = "lightgreen", color = "black") +
labs(title = "Distribution of School Random Effects", x = "School Intercept Deviation")
# Scatterplot
mapping_df <- unique(siiData[, c("classid", "schoolid")])
mapping_df <- na.omit(mapping_df)
mapping_df$class_eb <- eb_estimates$classid[as.character(mapping_df$classid), 1]
mapping_df$school_eb <- eb_estimates$schoolid[as.character(mapping_df$schoolid), 1]
# Plot the relationship between the levels
ggplot(mapping_df, aes(x = school_eb, y = class_eb)) +
geom_point(alpha = 0.5, color = "darkblue") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Class Effects vs. School Effects", x = "School Random Effect", y = "Class Random Effect")
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 31 rows containing non-finite outside the scale range
#> (`stat_smooth()`).
#> Warning: Removed 31 rows containing missing values or values outside the scale range
#> (`geom_point()`).