1 Required libraries and source files

library(ggplot2)
library(dplyr)
library(gt)
library(lme4)
library(performance)
library(sjPlot)

2 Introduction

This document demonstrates multilevel regression analysis using a simulated dataset. We’ll create nested data representing students within schools, fit several multilevel models, calculate ICC values, and visualize the results.

3 Data Preparation

First, let’s create our simulated dataset with a nested structure:

set.seed(123)

# Create school-level variables
n_schools <- 30
school_ids <- 1:n_schools
school_ses <- rnorm(n_schools, mean = 0, sd = 1)  # School SES
school_size <- round(rnorm(n_schools, mean = 300, sd = 50))  # School size

# Create student-level data
n_students_per_school <- round(rnorm(n_schools, mean = 50, sd = 10))
total_students <- sum(n_students_per_school)

# Create student data frame
student_data <- data.frame(
  student_id = 1:total_students,
  school_id = rep(school_ids, n_students_per_school),
  student_ses = rnorm(total_students, mean = 0, sd = 1),
  study_hours = round(runif(total_students, 1, 10)),
  prior_achievement = rnorm(total_students, mean = 70, sd = 10)
)

# Add school-level variables to student data
student_data$school_ses <- school_ses[student_data$school_id]
student_data$school_size <- school_size[student_data$school_id]

# Generate outcome variable (math scores) with multilevel structure
random_intercepts <- rnorm(n_schools, mean = 0, sd = 5)
student_data$math_score <- 70 + 
  3 * student_data$student_ses +
  2 * student_data$school_ses +
  1.5 * scale(student_data$study_hours) +
  0.5 * scale(student_data$prior_achievement) +
  random_intercepts[student_data$school_id] +
  rnorm(total_students, mean = 0, sd = 10)

4 Fitting Multilevel Models

We’ll fit three different multilevel models with increasing complexity:

Random intercept only (null model):

\(math\_score \sim 1 + (1|school\_id)\)

Random intercept with student-level predictors:

\(math\_score \sim student\_ses + study\_hours + prior\_achievement + (1|school\_id)\)

Random intercept with both student and school-level predictors:

\(math\_score \sim student\_ses + study\_hours + prior\_achievement + school\_ses + scale(school\_size) + (1|school\_id)\)

# Model 1: Null model (random intercept only)
model1 <- lmer(math_score ~ 1 + (1|school_id), data = student_data)

# Model 2: Random intercept with student-level predictors
model2 <- lmer(math_score ~ student_ses + study_hours + prior_achievement + 
               (1|school_id), data = student_data)

# Model 3: Full model with student and school-level predictors
model3 <- lmer(math_score ~ student_ses + study_hours + prior_achievement + 
               school_ses + scale(school_size) + 
               (1|school_id), data = student_data)

5 ICC Calculation and Interpretation

Let’s calculate the Intraclass Correlation Coefficient (ICC) for each model:

# Calculate ICC for all models
icc1 <- icc(model1)
icc2 <- icc(model2)
icc3 <- icc(model3)

# Create ICC comparison table
icc_table <- data.frame(
  Model = c("Null Model", "Student-Level", "Full Model"),
  ICC = c(icc1$ICC_adjusted, icc2$ICC_adjusted, icc3$ICC_adjusted)
)

# Create and print ICC table using gt
icc_table %>%
  gt() %>%
  fmt_number(columns = "ICC", decimals = 3) %>%
  tab_header(title = "ICC Values by Model") %>%
  tab_options(heading.background.color = "lightgray")

Model	ICC
ICC Values by Model
Null Model	0.145
Student-Level	0.162
Full Model	0.167

The ICC values indicate the proportion of variance in math scores that can be attributed to differences between schools. Values above 10% suggest that multilevel modeling is appropriate. In our case, all models show ICC values above this threshold, confirming the necessity of multilevel analysis.

6 Combined Regression Table

tab_model(model1, model2, model3,
          show.ci = FALSE,
          show.se = TRUE,
          collapse.se = TRUE,  # Put SE below estimates in parentheses
          show.p = FALSE,
          p.style = "stars",
          emph.p = TRUE,
          string.pred = "Predictor",
          string.est = "Estimate",
          show.stat = FALSE,
          show.re.var = FALSE,
          show.intercept = TRUE,
          title = "Table 1: Regression Results for Student Grades",
          show.icc = TRUE,
          dv.labels = c("Model 1", "Model 2", "Model 3"),
          pred.labels = c("(Intercept)",
                         "Student SES",
                         "Study Hours",
                         "Prior Achievement",
                         "School SES",
                         "School Size (scaled)"))

Table 1: Regression Results for Student Grades
	Model 1	Model 2	Model 3
Predictor	Estimate	Estimate	Estimate
(Intercept)	68.70 ^*** (0.81)	61.84 ^*** (2.09)	61.88 ^*** (2.09)
Student SES		2.65 ^*** (0.26)	2.65 ^*** (0.26)
Study Hours		0.52 ^*** (0.10)	0.52 ^*** (0.10)
Prior Achievement		0.06 ^* (0.03)	0.06 ^* (0.03)
School SES			0.03 (0.88)
School Size (scaled)			-0.91 (0.86)
ICC	0.14	0.16	0.17
N	30 _{school_id}	30 _{school_id}	30 _{school_id}
Observations	1507	1507	1507
Marginal R² / Conditional R²	0.000 / 0.145	0.073 / 0.223	0.078 / 0.232
p<0.05 p<0.01 * p<0.001

7 Model Interpretation

Based on the regression results shown, the analysis reveals several important findings about student grades. The null model (Model 1) shows a baseline average grade of 68.70 with significant variation between schools, indicated by an ICC of 0.14, suggesting that 14% of the variance in grades is attributable to school-level differences. In Model 2, which incorporates student-level predictors, student SES emerges as a strong predictor with a significant positive effect of 2.65 points (p<0.001) on grades. Study hours also show a meaningful impact, with each additional hour associated with a 0.52-point increase in grades (p<0.001). Prior achievement has a smaller but still significant effect of 0.06 points (p<0.05). The addition of these student-level variables explains about 7.3% of the within-school variance (Marginal R² = 0.073).

The full model (Model 3) maintains similar coefficients for student-level predictors while incorporating school-level factors. Interestingly, school SES shows a minimal effect (0.03 points, not significant), and school size has a negative but non-significant relationship with grades (-0.91 points when scaled). The ICC slightly increases to 0.17, suggesting that after accounting for both student and school-level predictors, about 17% of the remaining unexplained variance is between schools. The full model explains 7.8% of the total variance (Marginal R² = 0.078), with a conditional R² of 0.232, indicating that the combined fixed and random effects explain 23.2% of the variance in student grades. The stability of student-level coefficients across Models 2 and 3 suggests that these effects are robust, while the school-level variables contribute relatively little to explaining student grade variations.

8 Cohort Visualization

Let’s create a visualization showing the distribution of math scores by school with connected mean values:

# Calculate mean scores for each school
school_means <- student_data %>%
  group_by(school_id) %>%
  summarize(mean_score = mean(math_score))

# Create the plot
ggplot(student_data, aes(x = factor(school_id), y = math_score)) +
  # Add individual points
  geom_point(color = "gray70", alpha = 0.5, size = 1) +
  # Add mean points
  geom_point(data = school_means, 
            aes(x = factor(school_id), y = mean_score),
            color = "red", size = 3) +
  # Add connecting line between means
  geom_line(data = school_means,
           aes(x = factor(school_id), y = mean_score, group = 1),
           color = "red", linetype = "solid") +
  # Customize labels and theme
  labs(title = "Student Grades by Class ID",
       subtitle = "Red points show cohort (class) means",
       x = "Class ID",
       y = "Grades") +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),
    axis.text.x = element_text(angle = 0),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12)
  ) +
  scale_y_continuous(limits = c(40, 100), breaks = seq(40, 100, by = 10))

The visualization shows varying relationships between student SES and math scores across different schools, demonstrating the importance of accounting for school-level effects in our analysis.

9 Conclusion

Our multilevel analysis revealed significant variation in math scores both within and between schools. The ICC values above 10% confirmed the appropriateness of using multilevel modeling. The final model showed that both student-level factors (SES, study hours, prior achievement) and school-level characteristics (school SES, size) contribute to student performance in mathematics.

Multilevel Analysis

Performing multilevel regression analysis.

Amir Freund

2025-02-05