This document demonstrates multilevel regression analysis using a simulated dataset. We’ll create nested data representing students within schools, fit several multilevel models, calculate ICC values, and visualize the results.
First, let’s create our simulated dataset with a nested structure:
set.seed(123)
# Create school-level variables
n_schools <- 30
school_ids <- 1:n_schools
school_ses <- rnorm(n_schools, mean = 0, sd = 1) # School SES
school_size <- round(rnorm(n_schools, mean = 300, sd = 50)) # School size
# Create student-level data
n_students_per_school <- round(rnorm(n_schools, mean = 50, sd = 10))
total_students <- sum(n_students_per_school)
# Create student data frame
student_data <- data.frame(
student_id = 1:total_students,
school_id = rep(school_ids, n_students_per_school),
student_ses = rnorm(total_students, mean = 0, sd = 1),
study_hours = round(runif(total_students, 1, 10)),
prior_achievement = rnorm(total_students, mean = 70, sd = 10)
)
# Add school-level variables to student data
student_data$school_ses <- school_ses[student_data$school_id]
student_data$school_size <- school_size[student_data$school_id]
# Generate outcome variable (math scores) with multilevel structure
random_intercepts <- rnorm(n_schools, mean = 0, sd = 5)
student_data$math_score <- 70 +
3 * student_data$student_ses +
2 * student_data$school_ses +
1.5 * scale(student_data$study_hours) +
0.5 * scale(student_data$prior_achievement) +
random_intercepts[student_data$school_id] +
rnorm(total_students, mean = 0, sd = 10)
We’ll fit three different multilevel models with increasing complexity:
\(math\_score \sim 1 + (1|school\_id)\)
\(math\_score \sim student\_ses + study\_hours + prior\_achievement + (1|school\_id)\)
\(math\_score \sim student\_ses + study\_hours + prior\_achievement + school\_ses + scale(school\_size) + (1|school\_id)\)
# Model 1: Null model (random intercept only)
model1 <- lmer(math_score ~ 1 + (1|school_id), data = student_data)
# Model 2: Random intercept with student-level predictors
model2 <- lmer(math_score ~ student_ses + study_hours + prior_achievement +
(1|school_id), data = student_data)
# Model 3: Full model with student and school-level predictors
model3 <- lmer(math_score ~ student_ses + study_hours + prior_achievement +
school_ses + scale(school_size) +
(1|school_id), data = student_data)
Let’s calculate the Intraclass Correlation Coefficient (ICC) for each model:
# Calculate ICC for all models
icc1 <- icc(model1)
icc2 <- icc(model2)
icc3 <- icc(model3)
# Create ICC comparison table
icc_table <- data.frame(
Model = c("Null Model", "Student-Level", "Full Model"),
ICC = c(icc1$ICC_adjusted, icc2$ICC_adjusted, icc3$ICC_adjusted)
)
# Create and print ICC table using gt
icc_table %>%
gt() %>%
fmt_number(columns = "ICC", decimals = 3) %>%
tab_header(title = "ICC Values by Model") %>%
tab_options(heading.background.color = "lightgray")
ICC Values by Model | |
Model | ICC |
---|---|
Null Model | 0.145 |
Student-Level | 0.162 |
Full Model | 0.167 |
The ICC values indicate the proportion of variance in math scores that can be attributed to differences between schools. Values above 10% suggest that multilevel modeling is appropriate. In our case, all models show ICC values above this threshold, confirming the necessity of multilevel analysis.
tab_model(model1, model2, model3,
show.ci = FALSE,
show.se = TRUE,
collapse.se = TRUE, # Put SE below estimates in parentheses
show.p = FALSE,
p.style = "stars",
emph.p = TRUE,
string.pred = "Predictor",
string.est = "Estimate",
show.stat = FALSE,
show.re.var = FALSE,
show.intercept = TRUE,
title = "Table 1: Regression Results for Student Grades",
show.icc = TRUE,
dv.labels = c("Model 1", "Model 2", "Model 3"),
pred.labels = c("(Intercept)",
"Student SES",
"Study Hours",
"Prior Achievement",
"School SES",
"School Size (scaled)"))
 | Model 1 | Model 2 | Model 3 |
---|---|---|---|
Predictor | Estimate | Estimate | Estimate |
(Intercept) |
68.70 *** (0.81) |
61.84 *** (2.09) |
61.88 *** (2.09) |
Student SES |
2.65 *** (0.26) |
2.65 *** (0.26) |
|
Study Hours |
0.52 *** (0.10) |
0.52 *** (0.10) |
|
Prior Achievement |
0.06 * (0.03) |
0.06 * (0.03) |
|
School SES |
0.03 (0.88) |
||
School Size (scaled) |
-0.91 (0.86) |
||
ICC | 0.14 | 0.16 | 0.17 |
N | 30 school_id | 30 school_id | 30 school_id |
Observations | 1507 | 1507 | 1507 |
Marginal R2 / Conditional R2 | 0.000 / 0.145 | 0.073 / 0.223 | 0.078 / 0.232 |
|
Based on the regression results shown, the analysis reveals several important findings about student grades. The null model (Model 1) shows a baseline average grade of 68.70 with significant variation between schools, indicated by an ICC of 0.14, suggesting that 14% of the variance in grades is attributable to school-level differences. In Model 2, which incorporates student-level predictors, student SES emerges as a strong predictor with a significant positive effect of 2.65 points (p<0.001) on grades. Study hours also show a meaningful impact, with each additional hour associated with a 0.52-point increase in grades (p<0.001). Prior achievement has a smaller but still significant effect of 0.06 points (p<0.05). The addition of these student-level variables explains about 7.3% of the within-school variance (Marginal R² = 0.073).
The full model (Model 3) maintains similar coefficients for student-level predictors while incorporating school-level factors. Interestingly, school SES shows a minimal effect (0.03 points, not significant), and school size has a negative but non-significant relationship with grades (-0.91 points when scaled). The ICC slightly increases to 0.17, suggesting that after accounting for both student and school-level predictors, about 17% of the remaining unexplained variance is between schools. The full model explains 7.8% of the total variance (Marginal R² = 0.078), with a conditional R² of 0.232, indicating that the combined fixed and random effects explain 23.2% of the variance in student grades. The stability of student-level coefficients across Models 2 and 3 suggests that these effects are robust, while the school-level variables contribute relatively little to explaining student grade variations.
Let’s create a visualization showing the distribution of math scores by school with connected mean values:
# Calculate mean scores for each school
school_means <- student_data %>%
group_by(school_id) %>%
summarize(mean_score = mean(math_score))
# Create the plot
ggplot(student_data, aes(x = factor(school_id), y = math_score)) +
# Add individual points
geom_point(color = "gray70", alpha = 0.5, size = 1) +
# Add mean points
geom_point(data = school_means,
aes(x = factor(school_id), y = mean_score),
color = "red", size = 3) +
# Add connecting line between means
geom_line(data = school_means,
aes(x = factor(school_id), y = mean_score, group = 1),
color = "red", linetype = "solid") +
# Customize labels and theme
labs(title = "Student Grades by Class ID",
subtitle = "Red points show cohort (class) means",
x = "Class ID",
y = "Grades") +
theme_minimal() +
theme(
panel.grid.minor = element_blank(),
axis.text.x = element_text(angle = 0),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 12)
) +
scale_y_continuous(limits = c(40, 100), breaks = seq(40, 100, by = 10))
The visualization shows varying relationships between student SES and math scores across different schools, demonstrating the importance of accounting for school-level effects in our analysis.
Our multilevel analysis revealed significant variation in math scores both within and between schools. The ICC values above 10% confirmed the appropriateness of using multilevel modeling. The final model showed that both student-level factors (SES, study hours, prior achievement) and school-level characteristics (school SES, size) contribute to student performance in mathematics.