Multilevel (Hierarchical) Models: An Overview

Multilevel models (also called hierarchical linear models or mixed-effects models) are statistical frameworks for analyzing nested or clustered data, where observations are nested under higher-level units (e.g., students within classrooms, counties within states, repeated (annual) measures within countries). Unlike ordinary least squares (OLS) regression, which assumes independence of errors, multilevel models account for dependence within groups by partitioning variance into components at each level.

They combine fixed effects (individual-level slopes and intercepts) with random effects (group-specific deviations).

Let’s walk through a simple example below.

Two-Level Model Example

Consider a two-level structure:

Level 1: Observations (e.g., units) Level 2: Groups (e.g., counties)

Let

\(Y_{ij}\): Outcome for unit \(i\) in group \(j\)
\(X_{ij}\): Predictor at level 1
\(Z_j\): Predictor at level 2 (group-level)

Level 1 (Within-Group) Equation: \[\begin{equation} Y_{ij} = \beta_{0j} + \beta_1 X_{ij} + e_{ij}, \end{equation}\]

where
\(\beta_{0j}\): Intercept for group \(j\) (varies across groups)
\(\beta_1\): Fixed slope for \(X\)
\(e_{ij} \sim \mathcal{N}(0, \sigma^2)\): Within-group error (residual at individual level)

Level 2 (Between-Group) Equations: Model the group-specific intercept (and possibly slope) as random: \[\begin{align} \beta_{0j} &= \gamma_{00} + \gamma_{01} Z_j + u_{0j} \nonumber\\ \beta_{1j} &= \gamma_{10} + u_{1j} \quad \text{(if slope is random)}, \end{align}\]

where
\(\gamma_{00}\): Overall intercept
\(\gamma_{01}\): Effect of group-level predictor \(Z_j\)
\(u_{0j} \sim \mathcal{N}(0, \tau_{00})\): Between-group random intercept
\(u_{1j} \sim \mathcal{N}(0, \tau_{11})\): Between-group random slope (if included)

Combined (Single) Equation Substitute level 2 into level 1: \[\begin{equation} Y_{ij} = \gamma_{00} + \gamma_{01} Z_j + \gamma_{10} X_{ij} + u_{0j} + u_{1j} + X_{ij} + e_{ij} \end{equation}\]

This is a mixed model or, more specifically, a random intercept model, consists the following components.

\(\textbf{Fixed effects}\): \(\gamma_{00}, \gamma_{01}, \gamma_{10}\)
\(\textbf{Random effects}\): \(u_{0j}, u_{1j} X_{ij}\) (group-specific deviations)
\(\textbf{Error term structure}\)
Level 1 (within-cluster): \(e_{ij} \sim \mathcal{N}(0, \sigma^2)\)
Level 2 (between-cluster): \(\mathbf{u}_j = (u_{0j}, u_{1j})^\top \sim \mathcal{N}(\mathbf{0}, \mathbf{T})\), where \(\mathbf{T}\) is the variance-covariance matrix of random effects.

Together, they form the total residual of \(Y_{ij}\)

\[\begin{equation} \varepsilon_{ij} = u_{0j} + u_{1j} X_{ij} + e_{ij} \end{equation}\]

This error has two sources:

\(\textbf{Within-group variance}\) (\(e_{ij}\)): differences between units in the same group
\(\textbf{Between-group variance}\) (\(u_{0j} + u_{1j} X_{ij}\)): differences due to group (county) membership

This implies two things. First, group-level effects (group means) vary across groups: some groups exhibit positive effects, some negative, some significantly different from zero, some insignificant Second, the variability in \(Y_{ij}\) comes not only from the correlation \(\textit{between}\) groups but also the correlation between units \(\textit{within}\) the same group (which persists even after controlling for group-level predictors).

How would these different levels of correlation affect the estimate of \(Y_{ij}\)? What if one group has too few observations, how would its slope (i.e., \(u_{0j}\)) behave?

Let’s walk through an example using Gelman (yes, that “Gelman” from our reading) and Hill’s (2007) radon data. Radon is a naturally occurring radioactive gas that can cause lung cancer, often accumulating in basements because it seeps up from the soil through cracks and openings in the foundation.

We will first estimate a random intercept model by county to decompose the variance in radon level into within and between parts. We then examine how group estimator (slope) would behave if a group contains too few observations.

### Required packages
library(lme4)       # For fitting mixed models
library(merTools)   # For visualization & simulation tools
library(ggplot2)    # For custom plotting
library(dplyr)      # For data manipulation
library(readr)      # Optional: for reading data


# Load radon data
# library(HLMdiag)  # Uncomment if needed
library(HLMdiag)
data(radon)

# Inspect data structure
dim(radon)
## [1] 919   5
str(radon)
## 'data.frame':    919 obs. of  5 variables:
##  $ log.radon  : num  0.788 0.788 1.065 0 1.131 ...
##  $ basement   : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ uranium    : num  -0.689 -0.689 -0.689 -0.689 -0.847 ...
##  $ county     : int  1 1 1 1 2 2 2 2 2 2 ...
##  $ county.name: Factor w/ 85 levels "AITKIN","ANOKA",..: 1 1 1 1 2 2 2 2 2 2 ...
head(radon)
##   log.radon basement    uranium county county.name
## 1 0.7884574        1 -0.6890476      1      AITKIN
## 2 0.7884574        0 -0.6890476      1      AITKIN
## 3 1.0647107        0 -0.6890476      1      AITKIN
## 4 0.0000000        0 -0.6890476      1      AITKIN
## 5 1.1314021        0 -0.8473129      2       ANOKA
## 6 0.9162907        0 -0.8473129      2       ANOKA
# Key variables:
# log.radon: log(radon level)
# basement: 0 = basement, 1 = first floor
# county: county ID (factor with 85 levels (counties) in the state of Minnesota, it's also our group variable)
# uranium: county-level soil uranium (optional)


# 3. Fit Two-Level Random Intercept Model: note how group level variable is specified
radon.fit <- lmer(log.radon ~ basement + (1 | county), data = radon)

# Quick summary via merTools' fastdisp() function
fastdisp(radon.fit)
## lmer(formula = log.radon ~ basement + (1 | county), data = radon)
##             coef.est coef.se
## (Intercept)  1.46     0.05  
## basement    -0.69     0.07  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  county   (Intercept) 0.33    
##  Residual             0.76    
## ---
## number of obs: 919, groups: county, 85
## AIC = 2179.3
# Shows fixed effects and variance components


# Extract & Simulate Random Effects using merTools
set.seed(123)
re_sim <- REsim(radon.fit, n.sims = 1000)  # Simulate 1000 draws from posterior of BLUPs

head(re_sim)
##   groupFctr groupID        term         mean       median        sd
## 1    county       1 (Intercept) -0.267719734 -0.265107750 0.2412969
## 2    county       2 (Intercept) -0.532914280 -0.535319802 0.1069028
## 3    county       3 (Intercept)  0.018564385  0.022359278 0.2601276
## 4    county       4 (Intercept)  0.052474342  0.041860853 0.2250821
## 5    county       5 (Intercept) -0.026675709 -0.025365793 0.2536800
## 6    county       6 (Intercept)  0.008309301  0.001500993 0.2656055
# groupFctr: "county"
# groupID: county name/ID
# term: "(Intercept)"
# mean, median, sd: posterior summaries of random intercept


# Visualize County Random Effects (aka., Caterpillar Plot)
# Using merTools built-in plot
library(ggplot2)  # For theme_minimal() setting

p_caterpillar <- plotREsim(re_sim, level = 0.9) +  # 90% credible interval
  theme_minimal() +
  labs(
    title = "County-Specific Intercepts for Log(Radon) Levels",
    subtitle = "Random intercept model: log(radon) ~ floor + (1 | county)",
    x = "County",
    y = "Estimated Random Intercept (on log scale)"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Estimated county-specific radon level significantly different from 0 are marked in darker color.
print(p_caterpillar)

# Sort by effect size and highlight extremes
re_sim <- re_sim %>%
  arrange(mean)

p_sorted <- plotREsim(re_sim, level = 0.9, stat = "median") +
  labs(title = "Sorted County Effects on Log(Radon)",
       y = "Median Posterior Random Intercept",
       x = "County (ordered by effect)") +
  theme_minimal()

print(p_sorted)

Now that we know \(u_{0j}\) (group means) vary across counties because radon levels are potentially clustered (correlated) at county level albeit with different degree.

that pull them away from the grand mean (\(\gamma_{00}\), sample mean). This makes a lot of sense because by Tobler’s First Law of Geography, “everything is related to everything else, but near things are more related than distant things.” But what about those counties having too few observations to compute reliable estimate for \(u_{0j}\) (due to low variability in \(X_{ij}\))? How would their group means behave?

Well, let’s look at how group means and grand mean are related to group-level estimate:

\[\begin{equation} u_{0j} = w_{j}\bar{Y}_{j} + (1 - w_{j})\gamma_{00}, \end{equation}\]

where \(\bar{X}_{j}\) is the mean of all county \(j\)’s observations, \(\gamma_{00}\) is the sample mean (of the data), and the weight factor \(w_j\) is determined by

\[\begin{equation} w_j = \frac{n_j}{n_j + \frac{Var_{within}}{Var_{between}}}, \end{equation}\]

such that

\(Var_{within}\) \(>\) \(Var_{between}\): \(u_{0j}\) will receive less weight from {X}_{j} and therefore being pulled toward \(\gamma_{00}\)
\(Var_{within}\) \(<\) \(Var_{between}\): \(u_{0j}\) will receive more weight from {X}_{j} and therefore deviate from \(\gamma_{00}\)

In words, high variability at group level due to either unit heterogeneity or low observations (\(n_j\)) causes estimated group mean to tilt toward grand mean \(\gamma_{00}\). By contrast, low variability at group level due to clustering causes estimated group mean to deviate from grand mean \(\gamma_{00}\). So, to answer the question “What if one group has too few observations to make reliable estimate of \(Y_{j}\), how would its slope (i.e., \(u_{0j}\)) behave?” Well, the \(Y_j\) of groups having more observations are more reliable estimators for their own \(u_{0j}\). Yet for groups of the same size, groups exhibiting higher within-group variance will have their \(u_{0j}\) tilt toward the grand mean, meaning more pooling of information from \(\gamma_{00}\).

A relevant indicator, often reported in commercial statistical software, is the so-called intra-class correlation (ICC) bounded between 0 and 1:

And we can re-write the weight factor as

\[\begin{equation} w_j = \frac{n_j}{n_j + \frac{1 - \mbox{ICC}}{\mbox{ICC}}}, \end{equation}\]

ICC \(\rightarrow\) 1 means less pooling from the grand mean, whereas ICC \(\rightarrow\) 0 means more pooling from the grand mean.

Let’s use another example to illustrate the relationship between county-level sample size and the amount of pooling using the same data we analyzed earlier.

# Compute sample size per county
county_n <- radon %>%
  count(county, name = "n_obs") %>%
  mutate(county = as.character(county))

# Extract raw (unpooled) means and model-based (shrunken) means
raw_means <- radon %>%
  group_by(county) %>%
  summarise(raw_mean = mean(log.radon), .groups = "drop") %>%
  mutate(county = as.character(county))

# Model-based means = fixed effect (for basement = 0) + random intercept
fixed_int <- fixef(radon.fit)["(Intercept)"]  # Intercept for basement = 0
county_effects <- ranef(radon.fit)$county %>%
  tibble::rownames_to_column("county") %>%
  rename(shrunken_int = `(Intercept)`) %>%
  mutate(county = as.character(county),
         shrunken_mean = fixed_int + shrunken_int)

# Grand mean (population average for basement = 0)
grand_mean <- fixed_int


# Merge all

plot_data <- re_sim %>%
  filter(term == "(Intercept)") %>%
  select(county = groupID, median_shrunken = median) %>%
  left_join(raw_means, by = "county") %>%
  left_join(county_effects, by = "county") %>%
  left_join(county_n, by = "county") %>%
  mutate(
    # Categorize sample size
    size_cat = cut(n_obs,
                   breaks = c(0, 5, 10, 20, Inf),
                   labels = c("1–5", "6–10", "11–20", "21+"),
                   include.lowest = TRUE),
    # Distance from grand mean
    raw_dist = raw_mean - grand_mean,
    shrunk_dist = median_shrunken - grand_mean
  )


# Visualization: pooling by Sample Size
ggplot(plot_data, aes(x = raw_mean, y = median_shrunken)) +
  geom_point(aes(color = size_cat, size = n_obs), alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "solid", color = "gray") +  # y = x
  geom_hline(yintercept = grand_mean, linetype = "dashed", color = "red") +
  geom_vline(xintercept = grand_mean, linetype = "dashed", color = "red") +
  scale_color_brewer(palette = "Set1", name = "Observations\nper County") +
  scale_size_continuous(name = "n") +
  labs(
    title = "Pooling in Multilevel Model: County Radon Levels",
    subtitle = "Small counties pulled toward grand mean (red lines)",
    x = "Raw County Mean (log radon)",
    y = "Pooled (Model-Based) Mean (log radon)"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "right") +
  annotate("text", x = max(plot_data$raw_mean), y = grand_mean + 0.1,
           label = "Grand Mean", color = "red", hjust = 1)

We see that counties having fewer (more) observations tend to have their group means pulled toward (deviate from) the grand mean!