Advanced Statistics: Hierarchical Modeling

Author

Nils Mevenkamp | nils.mevenkamp@mci.edu

Published

April 30, 2025

About this script

In many fields, data are naturally organized into groups or clusters. For example, students are grouped within schools, patients are treated within hospitals, and repeated measurements are taken from the same individuals. Traditional statistical methods like ordinary linear regression assume that all observations are independent of each other. However, when data are grouped like this, that assumption is no longer valid as observations within the same group tend to be more similar to each other than observations from different groups.

Multilevel modeling (also called hierarchical or mixed-effects modeling) provides a way to correctly analyze such data. It allows us to account for both the overall relationships that apply to everyone (fixed effects) and the group-specific differences (random effects) that arise naturally. This approach improves the accuracy of our estimates and gives a deeper understanding of how relationships can vary across groups.

In this script, we will introduce multilevel models step-by-step, using clear examples and keeping mathematical formulas to a minimum. The focus will be on understanding when and why multilevel models are needed, how they differ from standard regression models, and how to interpret their results. No strong mathematical background is required, only a willingness to think carefully about the structure of the data.

Detailed R code is provided for each step of the discussion. Simply click to unfold it, take your time to understand it, and feel free to enhance your coding skills if you wish. If not, you can simply leave it closed. Please note that you do not need to replicate all of the code yourself, as much of it is included for didactic purposes to illustrate key concepts step by step. In practice, building multilevel models in R is quite straightforward and typically requires only a few essential functions.

Data

In the following, we will work with example data collected from 3 schools, specifically Hillcrest High School, Oakwood High School, and St Mary’s Church of England School. For each student, we have recorded the average test scores (ranging from 0% to 100%) and the average number of hours they studied shortly before their exams.

Here is the dataset:

Student scores are distributed more or less symmetrically across all schools:

Code

# show how student scores are distributed across schools
ggplot(df, aes(x = test_score)) +
  geom_histogram(binwidth = 5, fill = "#678", color = "white", alpha=alpha_dots) +
  theme_minimal() + theme(panel.grid = element_blank()) + 
  labs(title = "Histogram of Scores across all Schools", x = "Score", y = "Count")

General question

The central question here is how student performance is influenced by both study time and school affiliation. Specifically, we want to know whether the variation in scores arises from between-school factors (e.g., different teaching methods or social norms) or from individual student traits (e.g., diligence or personality).

Hierarchical data

Hierarchical data, also called multilevel or nested data, consists of observations organized at two or more levels, where units at a lower level are grouped within higher-level clusters. For example, test scores (level 1) nested within students (level 2) nested within schools (level 3). Such structure induces intra-cluster correlation and calls for models (e.g. mixed-effects or hierarchical models) that account for both within- and between-group variation.

The dataset above has a two-level structure: average study time and test scores of 183 students (level 1) nested within 3 schools (level 2). By the end of this introduction, you’ll be ready to extend hierarchical regression models beyond two levels.

Trivial (non-hierarchical) approaches

We’ll begin by exploring two straightforward methods and then show how hierarchical modeling addresses their shortcomings.

Simple linear regression (“constant intercept, constant slope”)

Simple linear regression fits a linear trend line through a point cloud of observed data to capture the association between a predictor and an outcome. In essence, it finds the straight line that best summarizes how changes in the predictor relate, on average, to changes in the outcome.

The scatter plot below shows the relationship between students’ average study time and their test scores, ignoring school membership for now.

Code

basic_ggplot = ggplot(df, aes(x = study_time, y = test_score)) +
  geom_point(alpha = alpha_dots, color="#678") +
  theme_minimal() + theme(panel.grid = element_blank()) + 
  labs(x = "Study Time", y = "Score")
basic_ggplot +
    labs(title = "Association of student's study time and scoring")

The scatter plot reveals a strong positive linear relationship between study time and performance, which is well captured by a simple linear regression model:

Code

# Fit a simple linear regression model (fixed intercept, fixed slope model)
model = lm(test_score ~ study_time, data = df)
# show mode summary
summary(model)


Call:
lm(formula = test_score ~ study_time, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.1068  -4.5792   0.5577   4.1368  11.4546 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26.7993     1.2427   21.57   <2e-16 ***
study_time    2.0192     0.1326   15.23   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.266 on 181 degrees of freedom
Multiple R-squared:  0.5616,    Adjusted R-squared:  0.5592 
F-statistic: 231.9 on 1 and 181 DF,  p-value: < 2.2e-16

Knowing that both parameters are significantly different from zero, the model description can be reduced to basic regression parameters:

Code

# Collect "classic" regression parameter (call them fixed effects)
coef(model)

(Intercept)  study_time 
  26.799339    2.019198

The estimated test score $Score_\text{est}$ is modeled by the regression equation:

$Score_\text{est} = 26.799 + 2.019\,Study\,\,Time\,(\text{hours})$

According to this model, each additional minute of study increases the test score by 2.019 units. In case students do not study at all their test score is estimated 26.799 on average.

This modeled association can be visualized by the regression line as the closest approximation of the point cloud. In the terminology of hierarchical modeling we could call this simple model a “fixed intercept, fixed slope” model (see below).

Code

# Assign intercept and slope to variables
intercept = coef(model)[1]
slope = coef(model)[2]

basic_ggplot +
  labs(title = "Association of student's study time and scoring with regression line") + 
  geom_abline(intercept = intercept, slope = slope, color = "#345", linetype = "longdash")

Dummy-variable model (“varying intercepts, constant slopes”)

It is clear that there is a strong association between study time and performance across all schools. This general description, however, does not take into account that test score distributions vary between schools:

Code

# show how student scores are distributed across schools
ggplot(df, aes(x = test_score, fill=school)) +
  geom_histogram(binwidth = 5, color = "white", alpha = alpha_bars) +
  facet_wrap(~ school) +
  theme_minimal() + theme(panel.grid = element_blank(), legend.position="none") + 
  labs(title = "Histogram of Scores by School", x = "Score", y = "Count") +
  scale_fill_manual(values = school_colors)

A more compact description of these differences is provided if scores are averaged across schools:

Code

school_means = data.frame(
  Mean=tapply(df$test_score, df$school, mean, na.rm=T), 
  SD=tapply(df$test_score, df$school, sd, na.rm=T)
  )
school_means

              Mean       SD
Hillcrest 48.84623 4.446091
Oakwood   39.32777 6.527925
St Mary’s 47.19057 8.718307

Hillcrest students perform consistently high. St Mary’s students’ average performance is close but shows highest variation, while Oakwood students’ average performance is weakest with moderate variation.

The differences between schools are not reflected in the basic regression model.

Code

# redefine scatter plot to identify school membership
basic_ggplot = ggplot(df, aes(x = study_time, y = test_score, color = school)) +
  geom_point(alpha = alpha_dots) +
  theme_minimal() + theme(panel.grid = element_blank()) + 
  labs(x = "Study Time", y = "Score") + 
  scale_color_manual(values = school_colors, name = "School")
basic_ggplot +
  labs(title = "Association of student's study time and scoring with regression line across three different scholls") + 
  geom_abline(intercept = intercept, slope = slope, 
              color = "#345", alpha = alpha_lines, linetype = "longdash")

In order to adapt the regression model to take care of these differences, school membership must be included as additional explanatory variable. As school membership is categorical, however, we must not interpret differences in schools as interval information. In other words, the common interpretation in terms of “increasing school by one unit leads to an increase of test score by b units” is entirely meaningless, as there exists no continuum between schools.

In order to still allow this interpretation we have to code schools as dummy variables. Dummy variables are binary (0,1) variables indicating group membership. In general terms, coding of n groups requires n-1 dummy variables. In the example given, 3 different schools must be coded with 2 dummy variables as follows:

Code

dummy_names = paste0("school",school_shortnames[2:3])
dummy_coding = data.frame(school_shortnames, c(0,1,0), c(0,0,1))
names(dummy_coding) = c("School", dummy_names)
dummy_coding

     School schoolOakwood schoolSt Mary’s
1 Hillcrest             0               0
2   Oakwood             1               0
3 St Mary’s             0               1

The dummy variable schoolOakwood is equal to 1 if a student is member of Oakwood High School and zero otherwise. schoolSt Mary’s is equal to 1 if a student is member of St Mary’s Church of England School and zero otherwise. The first group can be regarded as reference group. Students of Hillcrest High School are identified if both schoolOakwood and schoolSt Mary’s are zero.

With R the coding is done automatically, i.e. factors are split into dummy variables when included as explanatory variables into a regression model.

Code

# Fit a simple linear regression model (fixed intercept, fixed slope model)
model = lm(test_score ~ study_time + school, data = df)
# show mode summary
summary(model)


Call:
lm(formula = test_score ~ study_time + school, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9832 -2.2652 -0.0306  2.0228  8.8218 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     30.85270    0.92193  33.465   <2e-16 ***
study_time       1.94884    0.08736  22.308   <2e-16 ***
schoolOakwood   -8.51883    0.61211 -13.917   <2e-16 ***
schoolSt Mary’s -0.72933    0.65003  -1.122    0.263    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.458 on 179 degrees of freedom
Multiple R-squared:  0.813, Adjusted R-squared:  0.8099 
F-statistic: 259.4 on 3 and 179 DF,  p-value: < 2.2e-16

With the extended model four parameters are estimated in total, namely the intercept and 3 regression parameters for study time, Oakwood High School and St Mary’s Church of England School (Hillcrest High School serves as reference):

Code

# Collect regression parameter
coef(model)

    (Intercept)      study_time   schoolOakwood schoolSt Mary’s 
     30.8526985       1.9488407      -8.5188314      -0.7293296

The association between study time and performance is modeled very similarly to the simple model. Each additional hour of studying results in approximately 1.949 additional scores on average. In addition, the new model estimates differences in performance between schools. Compared to Hillcrest High School (school no. 1) as reference, Oakwood High School scores -8.519 and St Mary’s Church of England School scores -0.729 points lower.

These differences are approximately equivalent to the differences of group averages reported above:

Code

# school means (reported above)
school_means[,"Mean", drop=FALSE]

              Mean
Hillcrest 48.84623
Oakwood   39.32777
St Mary’s 47.19057

In the regression models, part of the description of test score variation is covered by the slope of the regression line. This is why the estimated dummy regression coefficients (Coefficient) are slightly smaller compared to the raw differences between school means (Difference).

Code

# differences between score means by schools compared to school-dummy regression coefficients
print(data.frame(Comparison=paste(school_shortnames[2:3], "-", school_shortnames[1]),
                 Difference=round(school_means[,"Mean"][2:3]-school_means[,"Mean"][1], 3), 
                 Coefficient=round(slope[2:3], 3)),
      row.names = FALSE)

            Comparison Difference Coefficient
   Oakwood - Hillcrest     -9.518      -8.519
 St Mary’s - Hillcrest     -1.656      -0.729

The estimated test score $Score_\text{est}$ is now modeled by the regression equation:

$Score_\text{est} = 30.853 + 1.949\,Study\,\,Time -8.519\,schoolOakwood -0.729\,schoolSt Mary’s$

Due to the mutual exclusive definition of dummy variables, the regression model can be rewritten as three separate models.

For students of Hillcrest High School (schoolOakwood = 0, schoolSt Mary’s = 0)

$Score_\text{est} = 30.853 + 1.949\,Study\,\,Time$

For students of Oakwood High School (schoolOakwood = 1, schoolSt Mary’s = 0)

$Score_\text{est} = 30.853 + 1.949\,Study\,\,Time -8.519 = 22.334 + 1.949\,Study\,\,Time$

For students of St Mary’s Church of England School (schoolOakwood = 0, schoolSt Mary’s = 1)

$Score_\text{est} = 30.853 + 1.949\,Study\,\,Time -0.729 = 30.123 + 1.949\,Study\,\,Time$

It clearly follows that dummy variables assign different intercepts to the regression model. Each school has a unique “general level” while the slopes remain constant. This is visualized by parallel regression lines:

Code

basic_ggplot + 
  labs(title = "Regression model including school membership as dummy variables") + 
  geom_abline(aes(intercept = intercept + c(0,slope[2:3]), 
                  slope = slope[1], color = school), 
              data = data.frame(school = school_shortnames), alpha = alpha_lines, linetype = "dashed")

In summary, the enhanced model still shows that each additional hour of study raises student performance by about two points on average (assuming the same slope across schools). It also reveals that schools differ in their overall performance levels (reflected in different intercepts).

Separate models for each school (“varying intercept, varying slope”)

As an alternative to including school membership as dummy variable into a multivariate model we can estimate independent models for each school separately. This allows us to specify the individual characteristics of each school.

Code

# Fit a linear regression model for each school separately
model = lapply(school_shortnames, function(x) lm(test_score ~ study_time, data = df[df$school==x,]))
# show mode summary
lapply(model, summary)

[[1]]

Call:
lm(formula = test_score ~ study_time, data = df[df$school == 
    x, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4931 -2.3417  0.1974  1.8648  6.5145 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.5791     1.1562   33.37  < 2e-16 ***
study_time    1.1120     0.1188    9.36  3.4e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.83 on 58 degrees of freedom
Multiple R-squared:  0.6017,    Adjusted R-squared:  0.5948 
F-statistic: 87.61 on 1 and 58 DF,  p-value: 3.395e-13


[[2]]

Call:
lm(formula = test_score ~ study_time, data = df[df$school == 
    x, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1658 -1.8990 -0.0618  1.6902  6.9972 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  21.1459     1.1579   18.26   <2e-16 ***
study_time    2.0851     0.1265   16.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.925 on 67 degrees of freedom
Multiple R-squared:  0.8022,    Adjusted R-squared:  0.7992 
F-statistic: 271.7 on 1 and 67 DF,  p-value: < 2.2e-16


[[3]]

Call:
lm(formula = test_score ~ study_time, data = df[df$school == 
    x, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-6.1570 -1.6024 -0.1778  1.6068  6.7457 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   22.549      1.108   20.35   <2e-16 ***
study_time     2.814      0.120   23.45   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.587 on 52 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.9119 
F-statistic: 549.9 on 1 and 52 DF,  p-value: < 2.2e-16

With independed models we estimate six parameters in total, namely one intercept and one regression coefficient for each school.

Code

# Collect regression parameter
# sapply(model, coef) #works, but hard to read
independent_models = as.data.frame(sapply(model, coef)) # if formatted as data frame ...
names(independent_models) = school_shortnames# ... we can assign column names for better readability
independent_models # voila

            Hillcrest   Oakwood St Mary’s
(Intercept) 38.579145 21.145910 22.549534
study_time   1.112006  2.085074  2.813669

With this information at hand, we can directly compare learning efficiency across schools: steeper slopes mean that the same number of study hours yields larger gains in performance. Students at Hillcrest High School show the highest average scores but exhibit the weakest improvement per additional hour of study (i.e., the lowest slope). By contrast, students at Oakwood High School, although being at a lower baseline, display a noticeably steeper learning curve. Most strikingly, St Mary’s Church of England School ranks highest in learning efficiency. It has the steepest slope, and its intercept is correspondingly lower to reflect that accelerated rate of score gain.

Code

independent_intercepts = unlist(independent_models[1, ])
independent_slopes = unlist(independent_models[2, ])

basic_ggplot + 
  labs(title = "Independent regression models by school membership") + 
  geom_abline(aes(intercept = independent_intercepts, 
                  slope = independent_slopes, 
                  color = school), 
              data = data.frame(school = school_shortnames), alpha = alpha_lines, linetype = "dashed")

In summary, fitting separate models for each school clearly shows how learning efficiency varies (through differing slopes), but it makes comparing their average performance difficult, since the intercepts are now entangled with those slopes. In other words, while these individual models underscore inter-school differences, they overlook the overall effects that aren’t specific to a particular school, such as the common relationship between study time and scores.

Multi-level (mixed-effects, hierarchical) models

We will now restart the model-building process from the ground up, beginning with general differences between schools and gradually developing an integrated model that captures both student-level and school-level effects, thereby accounting for variation within and between schools.

Fixed and random effects

When using multilevel model we must clearly distinguish between so-called fixed and random effects. They which can be defined as follows:

Fixed effects are regression coefficients that apply to the entire population (like in standard regression), e.g., the overall intercept and slope that describe how study time relates to test score across all students and schools. The regression parameters calculated in the introductory section on simple linear regression can be regarded as fixed effects (fixed intercept and fixed slope), as both are constant across schools.
Random effects allow the coefficients to vary by group to account for clustering, repeated measures, etc. As an example, letting each school have its own intercept and/or slope captures between-school variability around the overall (fixed) effects. The different intercepts defined by the schools when included as dummy variables as outlined above can be regarded as random effects, more precisely as random intercepts.

But why do we call these effects “fixed” vs. “random” rather than “constant” vs. “varying”?

The answer is rooted in statistical theory.

The terms “constant” vs. “varying” describe how something behaves numerically in your sample. In contrast, “fixed” vs. “random” refers to how the model treats the effect in terms of inference and estimation. In other words, it’s not just about whether the effect differs across groups; it’s about how you conceptualize those differences and what your modeling goals are.

In fixed effects, the factor levels (e.g., treatments, groups) are fixed and exhaust all possibilities. In random effects, the factor levels are random draws from a larger population.

A fixed effect is treated as a parameter to estimate directly. You care about the specific group levels or coefficients (e.g., the effect for each student, each school, each drug, each year, each political party, …). The levels are not random samples, they are the entire set you’re interested in. In our example, when you aim at comparing student scores across three specific schools, you include schools as dummy variables. You do not generalize to all schools, just those three.

A random effect is treated as a random variable drawn from a distribution (often normal). You don’t estimate each group’s effect directly; instead, you model the variation across groups. You’re interested in generalizing beyond your sample. As our example, you have students nested in three schools, and you think school has an influence. You treat schools as a random sample from a larger population of schools. You’re interested in the overall variation in school effects, not each school specifically.

Model fitting with R

For fitting a multilevel model with R we employ the function lmer() (Linear Mixed-Effects Regression) from the lme4 library. It fits linear models that include both:

Fixed effects (like in standard regression)
Random effects (to account for clustering, repeated measures, etc.)

In order to read the R formulas below it is important to understand the general notation. For two-level data it reads

outcome = 1 + predictor + (1 + predictor | grouping factor)

Note that the fixed-effect part 1 + ... is separate from the random-effect part ( ... | grouping factor). The random-effect part must be set in parentheses.

The value 1 in the fixed part means that a fixed intercept is included in the model, representing the baseline level of the outcome variable when all predictors are zero. A value of 0 in the fixed part removes the intercept so that you must specify it explicitly. The latter is not recommended in most cases. Thus, 1 just denotes a fixed intercept. If no number is specified in the fixed-effect part, 1 is assumed by default. The predictor in the fixed part denotes the modeling of fixed slopes.

With the random part we declare the model to allow certain effects to vary across the levels of a grouping factor (here: schools). A 1 inside the random part defines random intercepts while 0 inside the random part defines no random intercepts, respectively. A predictor inside the random part defines random slopes. The grouping factor is declared after the |. The full specification of each random effect is enclosed in parentheses.

Null model

Hierarchical modeling starts with estimating the so called null model, also called intercept-only model. It’s purpose is to decompose the overall variance of student scores into two components, i.e. how much is attributable to which level of the hierarchy:

Between-school variance: The part of the variance that comes from differences in average scores between schools.
Within-school variance: The part of the variance that comes from differences among students within the same school.

Code

model = lmer(test_score ~ 1 + (1 | school), data = df) 
summary(model)

Linear mixed model fit by REML ['lmerMod']
Formula: test_score ~ 1 + (1 | school)
   Data: df

REML criterion at convergence: 1221.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4496 -0.5801 -0.0409  0.5877  3.7460 

Random effects:
 Groups   Name        Variance Std.Dev.
 school   (Intercept) 25.22    5.022   
 Residual             44.96    6.705   
Number of obs: 183, groups:  school, 3

Fixed effects:
            Estimate Std. Error t value
(Intercept)   45.112      2.942   15.33

The fixed effect of the null model 1 + ... estimates the overall mean test score (fixed intercept).

Code

fixef(model)

(Intercept) 
   45.11202

The random effect of the null model (1 | school) refers to each school’s deviation from that mean (random intercept by school).

Code

ranef(model)

$school
          (Intercept)
Hillcrest    3.626461
Oakwood     -5.638583
St Mary’s    2.012122

with conditional variances for "school"

The null model includes no predictors (here: the variation of study time is ignored). As graphical representations of it’s fixed and random intercepts we see parallel horizontal lines.

Code

# redefine scatter plot to identify school membership + all schools
basic_ggplot = basic_ggplot +
  scale_color_manual(values = school_colors4, name = "School") +
  scale_linetype_manual(values = school_linetypes4, name = "School")

fixed_intercept = fixef(model)[1]
random_intercept = unlist(ranef(model)[1])

basic_ggplot + 
  labs(title = "The Null Model") + 
  geom_abline(aes(intercept = fixed_intercept + random_intercept, 
                  slope = 0, color = school), 
              data = data.frame(school = school_shortnames), alpha = alpha_lines, linetype = "dashed") + 
  geom_abline(aes(intercept = fixed_intercept, 
                  slope = 0, color = school),
              data = data.frame(school = "All"), alpha = alpha_lines, linetype = "longdash")

As already outlined at the beginning of this primer, the schools show different student scores on average (column Mean in the table below). With the fixed_intercept of the null model these means are estimated constant across all schools. The random_intercept describes the estimated deviations from this grand mean by each school. The school means as estimated by the null model are equivalent to the sum of fixed and random effects (last column):

null_model_estimate = fixed_intercept + random_intercept.

It is evident that these estimates come close to the observed means (first column).

Code

#compare observed means with null model estimates
school_means_hierarchical = school_means[,"Mean", drop=FALSE]# school means (reported above)
school_means_hierarchical$fixed_intercept=fixed_intercept#mean(df$test_score, na.rm=T)
school_means_hierarchical$random_intercept=unlist(random_intercept)
school_means_hierarchical$null_model_estimate=fixed_intercept+unlist(random_intercept)
school_means_hierarchical

              Mean fixed_intercept random_intercept null_model_estimate
Hillcrest 48.84623        45.11202         3.626461            48.73848
Oakwood   39.32777        45.11202        -5.638583            39.47344
St Mary’s 47.19057        45.11202         2.012122            47.12414

The null model helps determine whether multilevel modeling is necessary or if a single‐level analysis could suffice. It asks: “Do we need random intercepts to estimate group means accurately, or can fixed intercepts alone yield sufficiently precise estimates?”. To answer this question the intraclass correlation coefficient (ICC) is computed from the null model. The ICC quantifies the share of total test score variance that lies between schools versus within schools. It is calculated as

$\text{ICC}=\frac{\text{variance between groups}}{\text{variance between groups} + \text{residual variance}}$

Code

vc = as.data.frame(VarCorr(model))$vcov
icc1 <- vc[1] / (vc[1] + vc[2])
#n   <- mean(table(df$school))
#icc2 <- vc[1] / (vc[1] + vc[2]/n)

According to the model summary above we calculate

$\text{ICC}=\frac{25.22}{25.22+44.957} = 0.359$

or even shorter by using icc() as implemented with R:

Code

# Is it worth to estimate a hierarchical model?
performance::icc(model)

# Intraclass Correlation Coefficient

    Adjusted ICC: 0.359
  Unadjusted ICC: 0.359

In short, 35.9% of the total variance can be attributed to differences between schools. This is quite a high proportion that makes the application of multilevel modeling mandatory.

There is, however, no hard-and-fast ICC cutoff that universally “forces” you into a multilevel model, but there are commonly used rules of thumb:

ICC approx. 0.10 or higher is often taken as a signal that clustering is substantial enough to warrant multilevel modeling. In practice, many applied researchers will switch to a multilevel specification once they observe an ICC₁ of about 0.10 or above.
ICC below 0.05 is frequently described as “negligible” (i.e. small clustering), suggesting that single-level methods may suffice.

As a side note, it’s easy to see that computing the ICC is essentially equivalent to performing a simple ANOVA.

Model types

Below we introduce four model specifications, differing only in whether intercepts and slopes are treated as fixed or random:

Fixed intercept, fixed slope: Both the intercept and slope are constant across schools. This is equivalent to a standard (single‐level) OLS regression where every school shares the same baseline score (intercept) and learning efficiency (slope). Because of its non-hierarchical nature we’ll exclude this model from further consideration.
Random intercept, fixed slope: Schools have different baseline scores (random intercepts) but identical learning efficiency (fixed slope). Analogous to including school dummy variables in an OLS model, producing parallel regression lines.
Fixed intercept, random slope: All schools share a common baseline (fixed intercept: the expected score at zero study hours is the same for every school) but learning efficiency varies (random slopes). This model has no equivalent above.
Random intercept, random slope: Both the starting point and the slope vary by school. This is the most flexible specification, akin to fitting separate regressions per school while still borrowing strength across groups. It is quite close to separate linear regressions for subgroups as outlined above.

Your model choice should be adapted to the expected condition. A hierarchical (multilevel) model must be chosen if you expect at least one random effect.

Let’s go through the basic modeling process one by one.

Fit a random intercept, fixed slope model

With this model type we expect schools to have different baseline scores (random intercepts) but identical learning efficiency (fixed slope).

Model output is as follows

Code

# Fit random intercept, fixed slope model
# (1 | school) = random intercept for each level of School
model = lmer(test_score ~ 1 + study_time + (1 | school), data = df)
summary(model)

Linear mixed model fit by REML ['lmerMod']
Formula: test_score ~ 1 + study_time + (1 | school)
   Data: df

REML criterion at convergence: 984.8

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.59014 -0.65776 -0.01149  0.58413  2.55627 

Random effects:
 Groups   Name        Variance Std.Dev.
 school   (Intercept) 22.12    4.704   
 Residual             11.96    3.458   
Number of obs: 183, groups:  school, 3

Fixed effects:
            Estimate Std. Error t value
(Intercept) 27.76200    2.83651   9.787
study_time   1.94941    0.08736  22.315

Correlation of Fixed Effects:
           (Intr)
study_time -0.274

Let’s break down the model results to our empirically sampled schools to create a first visual impression. In addition to the fixed effects …

Code

# Collect fixed effects
fixef(model)

(Intercept)  study_time 
  27.762001    1.949412

… the model output comprises detailed school-specific (random) intercepts, measured as differences between school-intercepts and the fixed intercept:

Code

# Collect random effects
ranef(model)

$school
          (Intercept)
Hillcrest    3.057872
Oakwood     -5.390882
St Mary’s    2.333009

with conditional variances for "school"

By analogy with the dummy regression above the random intercepts define a separate regression equation for each school. In contrast to the simple dummy model, however, we now know the overall average, too.

Code

fixed_intercept = fixef(model)[1]
fixed_slope = fixef(model)[2]
random_intercept = unlist(ranef(model)$school[,1])

basic_ggplot + 
  labs(title = "Random intercept, fixed slope model") + 
  geom_abline(aes(intercept = fixed_intercept + random_intercept, 
                  slope = fixed_slope, color = school), 
              data = data.frame(school = school_shortnames), alpha = alpha_lines, linetype = "dashed") + 
  geom_abline(aes(intercept = fixed_intercept, 
                  slope = fixed_slope, color = school),
              data = data.frame(school = "All"), alpha = alpha_lines, linetype = "longdash")

Code

# gp + facet_wrap(~ school)

With the visualization in mind we can now look at the model output in more detail. Here’s how to interpret each section of the output.

Formula

Code

formula(model)

test_score ~ 1 + study_time + (1 | school)

You’re predicting the students scores based on study time. The term 1 + study_time defines a fixed intercept (1) and a fixed slope. You could also write study_time as a fixed intercept is assumed as default. The random effect is declared in parentheses (1 | school) and includes a random intercept ((1 ...)) for each school ((... | school)) , meaning that each school has its own baseline score.

REML criterion

Code

# REML criterion
summary(model)$AICtab["REML"]

    REML 
984.8395

This is the model’s value of the REML (REstricted Maximum Likelihood) objective function. It is a quality indicator but not useful by itself. Lower is better only when comparing models fit by REML.

Scaled Residuals

Code

# Scaled residuals
raw_residuals = resid(model) # Raw residuals
residual_sd = sigma(model) # Residual standard deviation (sigma)
scaled_residuals = raw_residuals / residual_sd # Scaled residuals
df_coeff = as.data.frame(t(round(quantile(scaled_residuals, c(0, .25, .5, .75, 1)), 5))) # format output
names(df_coeff) = c("Min", "1Q","Median", "3Q", "Max")
df_coeff # scaled residual quantiles as table

       Min       1Q   Median      3Q     Max
1 -2.59014 -0.65776 -0.01149 0.58413 2.55627

These are standard diagnostic stats showing how the residuals (differences between predicted and observed values) are spread. A roughly symmetric distribution around 0 is good — we have that here.

Random effects

Code

# Random effects (variance components)
df_coeff = as.data.frame(summary(model)$varcor)[,c(1,4,5)]
df_coeff[,2:3] = round(df_coeff[,2:3], 3)
names(df_coeff) = c("Group", "Variance", "Std.Dev.")
df_coeff

     Group Variance Std.Dev.
1   school   22.124    4.704
2 Residual   11.960    3.458

School (Intercept) variance (22.124) indicates the amount of variation in scores between schools. Residual variance (11.96) measures the amount of variation in scores within schools (i.e., among students in the same school). If we compute the ICC = 22.124 / (22.124 + 11.96) = 0.649 we see that 64.9% of the total variation in student scores is due to differences between schools. This is a strong clustering effect, indicating a substantial between-school variation in baseline scores.

Fixed effects

Code

# Fixed effects
summary(model)$coefficients

             Estimate Std. Error   t value
(Intercept) 27.762001  2.8365148  9.787363
study_time   1.949412  0.0873593 22.314878

The average baseline test score is 27.762 when study time = 0 (adjusted for random intercepts). Each additional hour of studying increases the predicted score by about 1.949 points on average, across schools. t-values are large, meaning that both effects are statistically strong.

Correlation of fixed effects

Code

# Correlation of fixed effects
vcov_matrix <- vcov(model) # Variance-covariance matrix of fixed effects
correlation_matrix = cov2cor(vcov_matrix) # Correlation matrix of fixed effects
round(correlation_matrix[1,2], 3)

[1] -0.274

There’s a slight negative correlation between the intercept and the slope. In this dataset, schools with higher average baseline scores tend to have a slightly smaller effect of study time, and vice versa. This is a reasonable finding which corresponds to the above observation that steeper slopes imply smaller intercepts.

To sum up, study time has a strong, positive effect on student scores. There are substantial differences between schools in their average performance (random intercept). The model fits the data well and accounts for both fixed and group-level variation.

Fit a fixed intercept, random slope model

With this model type we assume schools having the same baseline (fixed intercept) while impacts of study time differs between schools (random slopes).

Code

# Fit fixed intercept, random slope model
model = lmer(test_score ~ 1 + (0 + study_time | school), data = df)
summary(model)

Linear mixed model fit by REML ['lmerMod']
Formula: test_score ~ 1 + (0 + study_time | school)
   Data: df

REML criterion at convergence: 1022.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.25026 -0.65588 -0.06263  0.65737  2.74581 

Random effects:
 Groups   Name       Variance Std.Dev.
 school   study_time  4.034   2.008   
 Residual            14.037   3.747   
Number of obs: 183, groups:  school, 3

Fixed effects:
            Estimate Std. Error t value
(Intercept)  27.5526     0.8848   31.14

In order to illustrate the model results for our empirically sampled schools we have to collect the fixed intercept …

Code

fixef(model)

(Intercept) 
   27.55259

… and the random slopes …

Code

ranef(model)

$school
          study_time
Hillcrest   2.185660
Oakwood     1.417459
St Mary’s   2.298307

with conditional variances for "school"

… to compile three different regression equations:

Code

random_slope = unlist(ranef(model)$school)
fixed_intercept = fixef(model)

basic_ggplot + 
  labs(title = "Fixed intercept, random slope model") + 
  geom_abline(aes(intercept = fixed_intercept, 
                  slope = random_slope, color = school), 
              data = data.frame(school = school_shortnames), alpha = alpha_lines, linetype = "dashed") +
    scale_color_manual(values = school_colors)

In more detail, the formula

Code

formula(model)

test_score ~ 1 + (0 + study_time | school)

declares

a fixed intercept 1 + ...: The general baseline, independent of school membership.
no random intercept ... (0 + ...): All schools start at the same baseline.
random slopes for study time ... study_time | school): Each school has its own effect (slope) of study time on test score, so that some schools might benefit more from study time, others less.

A brief summary of the model output reads as follows:

Residuals are roughly symmetrically distributed around 0. The baseline test score across all schools (when study time = 0) is 27.553 which is significantly greater than 0 (t > 1.96). Study time slopes vary substantially across schools (random slope variance = 4.034), but residual variance is much greater (random slope residual variance = 14.037). This is reflected by the REML criterion of 1022.172 which is worse (higher) compared to the previous model.

In summary, a fixed intercept, random slope model does not provide an improvement over the previous model. This is no suprise as we have already proven the existence of significant random intercepts.

Fit a random intercept, random slope model

With this model type we assume schools having both different baselines (random intercept) and different study efficiencies (random slopes).

Code

# Fit ramdom intercept, random slope model
model = lmer(test_score ~ 1 + study_time + (1 + study_time | school), data = df)
summary(model)

Linear mixed model fit by REML ['lmerMod']
Formula: test_score ~ 1 + study_time + (1 + study_time | school)
   Data: df

REML criterion at convergence: 916.8

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.92998 -0.71009 -0.00639  0.59938  2.48571 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 school   (Intercept) 92.4861  9.6170        
          study_time   0.7127  0.8442   -0.87
 Residual              7.8295  2.7981        
Number of obs: 183, groups:  school, 3

Fixed effects:
            Estimate Std. Error t value
(Intercept)  27.4288     5.5919   4.905
study_time    2.0026     0.4925   4.066

Correlation of Fixed Effects:
           (Intr)
study_time -0.871

Let’s again start with the visualization of the model results for our empirically sampled schools. We now have both a fixed intercept and a fixed slope …

Code

fixef(model)

(Intercept)  study_time 
  27.428832    2.002611

… together with three pairs of random intercepts and random slopes:

Code

ranef(model)

$school
          (Intercept)  study_time
Hillcrest   10.989442 -0.87464142
Oakwood     -6.277098  0.08494417
St Mary’s   -4.712344  0.78969725

with conditional variances for "school"

This adds up to four different regression equations, one for each school as defined by the combinations of school-specific intercepts and slopes, and one for the fixed effects only.

Code

fixed_intercept = fixef(model)[1]
fixed_slope = fixef(model)[2]
random_intercept = unlist(ranef(model)$school[1])
random_slope = unlist(ranef(model)$school[2])

basic_ggplot + 
  labs(title = "Random intercept, random slope model") + 
  geom_abline(aes(intercept = fixed_intercept + random_intercept, 
                  slope = fixed_slope + random_slope, color = school), 
              data = data.frame(school = school_shortnames), alpha = alpha_lines, linetype = "dashed") + 
  geom_abline(aes(intercept = fixed_intercept, 
                  slope = fixed_slope, color = school),
              data = data.frame(school = "All"), alpha = 1, linetype = "longdash")

This model clearly shows the best performance with a REML criterion of 916.784 which is much better (lower) compared to the two previous multilevel models. Residuals are still roughly symmetrically distributed. Both the baseline test score (27.429) and the general slope across all schools (2.003) are significantly greater than 0 (t > 1.96), indicating that students start with at least some significant knowledge and generally perform better with increasing study time, independent of the school they’re in. In addition, intercepts and study time slopes vary substantially across schools (random intercept variance = 92.486, random slope variance = 0.713, with residual variance as small as 7.829), proving differences in both baseline scores and study performance between schools.

Summary

When handling individuals (here: students) nested within clusters (here: schools), three major modeling strategies can be compared.

The first approach is the dummy-variable model, where schools are included as fixed effects using dummy variables in a standard OLS regression, such as lm(test_score ~ study_time + school_name, data = …). This method controls for any constant, unobserved differences between schools and is easy to implement without special software. It allows direct tests of differences between schools through contrasts. However, it scales poorly if many schools are involved, as a separate dummy is needed for each one. Furthermore, because it estimates each school’s intercept independently, small schools tend to have very noisy estimates. The method also struggles to incorporate school-level predictors easily and cannot accommodate varying slopes across schools without greatly complicating the model with numerous interaction terms.

The second approach involves fitting separate models for each school. Using a structure like lapply(school_name, function(x) lm(test_score ~ study_time, data = df[df$school_name==x,])), each school receives its own regression. This strategy is very simple to understand and offers complete flexibility by allowing each school its own intercept and slope. However, it has significant drawbacks: it does not pool information across schools, meaning small schools produce highly variable and unreliable estimates. It is also difficult to compare schools systematically, as you end up with a separate set of coefficients for each rather than a unified summary. Moreover, this method makes it impossible to estimate an overall population effect, such as an average slope, and it quickly becomes tedious if summaries or plots across schools are needed.

The third approach is the multilevel or mixed-effects model, exemplified by fitting lmer(test_score ~ study_time + (1 + study_time | school), data = …). This method introduces partial pooling, which stabilizes estimates for small schools by borrowing strength from the entire sample. It allows simultaneous estimation of both overall (fixed) effects and school-specific (random) deviations in intercepts and slopes. Multilevel models are especially advantageous when dealing with many subgroups at higher level because they efficiently manage the complexity that arises from large numbers of groups. They also naturally accommodate group-level predictors and cross-level interactions, making more efficient use of the available data and often yielding tighter confidence intervals and better predictive performance. The main challenges are that the method is more complex, requires specialized software like lme4, and demands careful checking of assumptions such as the normality of random effects. Additionally, interpretation can be less straightforward because instead of having a single coefficient per school, you work with distributions of deviations.

In conclusion, dummy-variable models are reasonable if you have a small number of large schools and are primarily interested in differences in intercepts. Separate models offer maximum flexibility but sacrifice precision and comparability. Multilevel models generally offer the best balance when dealing with nested data structures, providing group-specific estimates while borrowing strength across groups and enabling quantification of both within- and between-group effects.