Hierarchical Linear Modeling Tutorial

Introduction

In this tutorial, I will be introducing the topic of hierarchical linear modeling (HLM), while also giving some simple examples of how to model data using this technique.

What is hierarchical linear modeling? First, it’s important to understand that this is a technique with many names, so it’s possible you’ve heard of this technique being used in other contexts without even realizing it. Other common names include multilevel modeling, mixed modeling, mixed effect modeling, nested modeling, random effects modeling, random parameter modeling, and many more. Sometimes these terms are used in slightly different contexts, but suffice it to say that all these terms refer to essentially the same technique.

The primary goal of HLM is to account for situations in which data exist within multiple levels in a dataset. A prototypical example would be a dataset of students in various classes, where each class contains multiple students. We could say students are “nested” within class. Let’s say we wanted to compute some regression in which we were investigating learning achievement at the student level. It is likely that the class the student is in will have some effect on the outcome. If we fail to account for the class variable, we are not modeling our data as well as possible, and problematically, we are violating the assumption of independence in observations. Therefore, as responsible statisticians, we should find some way for accounting for class in our analysis.

How then, should we account for class in our analysis? We’ll explore a few possible techniques.

Dealing with nested data: Aggregation

One possible solution to deal with nested data is to aggregate within our level-2 variable. In our students within class example, that would mean averaging the scores of all students who are within the same class.

This technique is simple and straight-forward, but it has multiple (potentially large) limitations. The first problem is that we’re throwing away a lot of meaningful information and variance.

The second prioblem is that we’re cutting down on our sample size, perhaps dramatically. Imagine you had a dataset of 15 classes, each averaging 25 students. In this example, we’d have a sample size of 375 observations to work with. If we average within classes, we limit our sample to 15 classes. This is obviously problematic.

Lastly, there’s no guarantee that our relationship at level 2 is the same as level 1. In fact, the relationships can be opposite of one another, something called Simpson’s Paradox. Let me illustrate with a super simple simulated example:

Simpson’s paradox example

Let’s say we have a dataset with 3 groups. Each group contains 3 observations of a variable x and a variable y. It looks like this:

group1 <- data.frame(x = 1:3, y = 4:6)
group2 <- data.frame(x = 2:4, y = 3:5)
group3 <- data.frame(x = 3:5, y = 2:4)

Now let’s compute the correlations between x and y between each group:

cor(group1$x, group1$y)

## [1] 1

cor(group2$x, group2$y)

## [1] 1

cor(group3$x, group3$y)

## [1] 1

The correlation in each group is a perfect correlation of +1.

Now let’s aggregate within each group:

x <- c(mean(group1$x), mean(group2$x), mean(group3$x))
y <- c(mean(group1$y), mean(group2$y), mean(group3$y))

And now let’s see how x and y correlate.

cor(x, y)

## [1] -1

Now they’re perfectly negatively correlated! So relying on a level 2 association to infer something about a level 1 relationship is obviously problematic.

In summary, in most cases, it’s best to avoid aggregating within higher levels.

Dealing with nested data: Fixed effects

A preferable solution is to include our nesting variable as a fixed effect in our model. If you have ever modeled a factor variable in a linear regression, you’ve used this technique, likely without even realizing it. To include a grouping variable as a fixed effect in a model, we dummy code the variable, and each level of the variable gets a coefficient, representing an inflation or deflation of the intercept, in comparison to a reference level.

Let’s look at an example in r.

Fixed effect example in r

Let’s load in the lme4 package into r. (This package will be very important when we get into random effects modeling, but for now we’re just using it for a sample dataset). As always, be sure to install the library first with install.packages("lme4") if it’s not already installed. We’ll also bring in the tidyverse for good measure.

library(lme4)
library(tidyverse)

We will be using an included dataset called “sleepstudy” for this simple example. This study records participants’ reaction times after various days of sleep deprivation. So, in this case, observations (level 1) are nested within participant (level 2). (Note that in our earlier example, participants were the level 1 variable and here they’re level 2. That’s ok! Nesting is a domain general concept and varies based on the structure of the data).

Let’s take a look at the structure of the data:

str(sleepstudy)

## 'data.frame':    180 obs. of  3 variables:
##  $ Reaction: num  250 259 251 321 357 ...
##  $ Days    : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...

So we see that this is a very simple dataset, in which the data are already structured how we want them. We have a numeric measure of reaction time, a numeric measure of the days of sleep deprivation, and a factor variable of participants’ unique ID number.

We want to predict participants’ reaction time by the number of days of sleep deprivation. Let’s first model this without accounting for the nested structure of our data (IMPORTANT: This is not a good way of modeling the data, but we’re doing it for demonstration purposes):

model_no_nest <- lm(Reaction ~ Days, data = sleepstudy)

And let’s look at the output:

summary(model_no_nest)

## 
## Call:
## lm(formula = Reaction ~ Days, data = sleepstudy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.848  -27.483    1.546   26.142  139.953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  251.405      6.610  38.033  < 2e-16 ***
## Days          10.467      1.238   8.454 9.89e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.71 on 178 degrees of freedom
## Multiple R-squared:  0.2865, Adjusted R-squared:  0.2825 
## F-statistic: 71.46 on 1 and 178 DF,  p-value: 9.894e-15

So, as we’d expect, we see that reaction time significantly slows with increased sleep deprivation. However, our R-squared isn’t that great, at about R = .29. I bet we can improve that by accounting for the nested structure of our data.

Fortunately, we can account for nesting with a fixed effect quite easily in R, by just including our factorized Subject ID term:

model_fixed <- lm(Reaction ~ Days + Subject, data = sleepstudy)

And looking at the summary:

summary(model_fixed)

## 
## Call:
## lm(formula = Reaction ~ Days + Subject, data = sleepstudy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.540  -16.389   -0.341   15.215  131.159 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  295.0310    10.4471  28.240  < 2e-16 ***
## Days          10.4673     0.8042  13.015  < 2e-16 ***
## Subject309  -126.9008    13.8597  -9.156 2.35e-16 ***
## Subject310  -111.1326    13.8597  -8.018 2.07e-13 ***
## Subject330   -38.9124    13.8597  -2.808 0.005609 ** 
## Subject331   -32.6978    13.8597  -2.359 0.019514 *  
## Subject332   -34.8318    13.8597  -2.513 0.012949 *  
## Subject333   -25.9755    13.8597  -1.874 0.062718 .  
## Subject334   -46.8318    13.8597  -3.379 0.000913 ***
## Subject335   -92.0638    13.8597  -6.643 4.51e-10 ***
## Subject337    33.5872    13.8597   2.423 0.016486 *  
## Subject349   -66.2994    13.8597  -4.784 3.87e-06 ***
## Subject350   -28.5311    13.8597  -2.059 0.041147 *  
## Subject351   -52.0361    13.8597  -3.754 0.000242 ***
## Subject352    -4.7123    13.8597  -0.340 0.734300    
## Subject369   -36.0992    13.8597  -2.605 0.010059 *  
## Subject370   -50.4321    13.8597  -3.639 0.000369 ***
## Subject371   -47.1498    13.8597  -3.402 0.000844 ***
## Subject372   -24.2477    13.8597  -1.750 0.082108 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.99 on 161 degrees of freedom
## Multiple R-squared:  0.7277, Adjusted R-squared:  0.6973 
## F-statistic: 23.91 on 18 and 161 DF,  p-value: < 2.2e-16

First, let’s note the R-squared. This model fits our data much, much better than did our model that did not account for nesting. When looking at the coefficients, we need to be careful about our interpretation. The intercept reflects the reaction time at 0 days of sleep deprivation for our reference subject. R automatically picked Subject 308 as our reference, but we could have manually specified a reference. The coefficient for each other subject represents an inflation or deflation of the intercept. So, for subject 350, for example, we can estimate a predicted sleep reaction time of roughly 295 - 28.5 + 10.5*(days of sleep deprivation). Note also that the slope (i.e. the effect of sleep deprivation) is specied to be constant for each participant. If we wanted to allow the slope to vary, we would need to specify interaction terms.

We can plot this model to see what this prediction actually looks like:

sleepstudy <- sleepstudy %>% 
  mutate(fixedprediction = predict(model_fixed)) #generate predicted outcome (y-hat)

ggplot(data = sleepstudy, aes(x = Days, y = Reaction, color = Subject)) +
  geom_point() +
  geom_line(aes(y = fixedprediction)) +
  ylim(c(150,500)) #Plot our predictions vs actual y

Here, we can clearly see that each participant has their own, independently-estimated intercept, and each participant has the same slope. If we wanted to allow for different slopes across the sample, we would need to specify interaction terms.

Dealing with nested data: Random effects

Another good solution is to allow for random effects. This is the technique that is typically being referred to when one is discussing “Hierarchical Linear Modeling” or “Multi-Level Modeling.” In many ways, random effects modeling is similar to fixed effects modeling, but with some important differences. With this technique, rather than estimating a parameter separately for every level, we suppose that the parameter varies along a distribution, and we generate a parameter estimate of that distribution.

To make this difference clearer, let’s return to our earlier example in R, but this time fit a model that specifies random effects.

Random effects in R

Before we get deep into this example, let’s load in some additional packages we will use. lmerTest allows for some additional statistics on top of base lme4. MuMIn provides some other useful functions. And sjPlot builds a nice table visualization of model summary statistics.

library(lmerTest)

## 
## Attaching package: 'lmerTest'

## The following object is masked from 'package:lme4':
## 
##     lmer

## The following object is masked from 'package:stats':
## 
##     step

library(MuMIn)
library(sjPlot)

## Learn more about sjPlot with 'browseVignettes("sjPlot")'.

We will specify this model using the lmer() function from lme4. This notation is quite similar to lm() notation, but we need to include random effects syntax. This syntax takes the form (<some value> | <some value>). The left side specifies a random slope, while the right specifies a random intercept. For now, we will specify only a random intercept, like so:

model_rand_int <- lmer(Reaction ~ Days + (1 | Subject), data = sleepstudy)

To go through this command in detail, we are specifying a model where reaction time is predicted by days of sleep deprivation. We are specifying that the intercept will vary on a distribution by participant. However, we specify that slope will be constant across participants (much like we did with the fixed effects example).

Now let’s look at the output:

summary(model_rand_int)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1786.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2257 -0.5529  0.0109  0.5188  4.2506 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 1378.2   37.12   
##  Residual              960.5   30.99   
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) 251.4051     9.7467  22.8102   25.79   <2e-16 ***
## Days         10.4673     0.8042 161.0000   13.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.371

There’s a lot to take in here. Let’s start by looking at the section marked “Fixed effects”. This is akin to the “Coefficients” section of the lm() models. The first thing we can notice is that the intercept is very different here from our fixed effect model. Why is this? Recall that in the previous example, the intercept value represented the intercept for an arbitrary reference group. Here, the intercept is the sample-wide intercept. You can think of it as the average intercept. So our interpretation here is very different.

You should also notice that we do not have coefficients for each group here like we did in the fixed effects model. This is because we are not measuring a parameter for each group. Instead, we are measuring a variance parameter for the distribution. To see this variance parameter, we can look at the “Random effects” section. The variance of the intercept is 1378.2 and the standard deviation is 37.12 (these are mathematically equivalent). So our model estimates that the mean intercept is 251.4, with a standard deviation of 37.12.

While we don’t get fixed effect estimates for each group, we can see the random effects estimate for each group, like so:

ranef(model_rand_int)

## $Subject
##     (Intercept)
## 308   40.783710
## 309  -77.849554
## 310  -63.108567
## 330    4.406442
## 331   10.216189
## 332    8.221238
## 333   16.500494
## 334   -2.996981
## 335  -45.282127
## 337   72.182686
## 349  -21.196249
## 350   14.111363
## 351   -7.862221
## 352   36.378425
## 369    7.036381
## 370   -6.362703
## 371   -3.294273
## 372   18.115747
## 
## with conditional variances for "Subject"

To interpret this, we should compare each intercept coefficient to the sample-wide intercept (251.4). So Subject 308’s intercept, for example is 251.4 + 40.8. Note, however, that this ranef() function does not give us any standard errors or p-values. That’s because we’re not estimating group-level parameters in the same way we were in fixed effects. If our primary interest is in estimating these parameters for a particular group or groups, we’re probably best served by a fixed effects model.

Let’s look back at our earlier summary. One thing to note is that there is no R-squared value, which makes it difficult to assess the model fit. I will refrain from going too far into the weeds of discussing the math here, but computing an R-squared is not straightforward for HLM in the way it is for standard regression. However, we can get a metric of R-squared proposed by Nakagawa and colleagues (2017) by calling r.squaredGLMM() from the MuMIn package:

r.squaredGLMM(model_rand_int)

## Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.

##            R2m       R2c
## [1,] 0.2798856 0.7042555

This returns two values. The first represents the variance explained only by the fixed effects (called the marginal r-squared). The second represents the variance explained by both the fixed and random effects (called the conditional r-squared).

We can also get this information and more in a pretty table by calling tab_model() from the sjPlot plackage.

tab_model(model_rand_int)

## Argument 'df_method' is deprecated. Please use 'ci_method' instead.

	Reaction
Predictors	Estimates	CI	p
(Intercept)	251.41	232.17 – 270.64	<0.001
Days	10.47	8.88 – 12.05	<0.001
Random Effects
σ²	960.46
τ₀₀ _Subject	1378.18
ICC	0.59
N _Subject	18
Observations	180
Marginal R² / Conditional R²	0.280 / 0.704

An additional metric provided here, which may be of some use, is the intraclass correlation coefficient (ICC). This value is simply the ratio of the variance of the random effect to the total variance (the variance of the random effect + the residual variance). If this value is high, this suggests that much of the variance is being explained by your grouping variable. If it is low, little of the variance is being explained by the grouping variable. Some will argue that if the ICC is sufficiently low, you need not account for the grouping at all, but be aware that this is controversial.

Let’s now plot our model results.

sleepstudy <- sleepstudy %>% 
  mutate(randintpred = predict(model_rand_int)) #Get y-hats

ggplot(data = sleepstudy, aes(x = Days, y = Reaction, color = Subject)) +
  geom_point() +
  geom_line(aes(y = randintpred)) +
  ylim(c(150,500))

You should note that this plot does not look terribly different from the plot for our fixed effect model. This is typical. As long as each group is well-represented and assumptions are not violated, fixed effects models and random effects models tend to yield similar results. One slight difference here, however, is that the extreme intercept values are brought slightly toward the mean in the random effects model. This is due to a property of random effects models called (“partial pooling”). Information about other groups is partially used to infer estimates about a given group. On the other hand, in a fixed effect model, there is no pooling, so each intercept estimate is independent. So a random effects model can suggest that these extreme values may be over or under estimates by bringing in information from other groups.

When to use random effects versus fixed effects

At this stage, one may be wondering in what contexts one should favor random effects models versus fixed effects models. This is a pretty big debate among statisticians, and I do not expect to resolve it here (nor do I aim to), but I can provide some basic guidance.

You should consider a fixed effects model if:

You are interested in a coefficient estimate for a specific group or groups, rather than aiming to know something about a wider population of groups
You have the entire possible population of groups represented in your sample. It doesn’t make sense to estimate a population-wide distribution (as you would with a random effects model) if you have the entire population.
Your groups do not represent a random sample of the entire possible population of groups.
If your group-level effect is correlated with other X’s in your model.
- It is an assumption of random-effects models that the random effects are uncorrelated with specified fixed effects
- There is a debate about how stringent this assumption needs to be
- The “Hausman Test” is sometimes suggested as a way of testing whether the random or fixed effect model should be preferred (but this is somewhat controversial and beyond the scope of this tutorial)
When there are too few groups to estimate the variance at that level

You should consider a random effects model if:

You want to estimate group-level variance for the entire population of groups
The sample size within some groups is very small
- Random effects modeling is useful here, thanks to its partial pooling property
Most other times, when the conditions for favoring a fixed effects model aren’t met
- Random effects models are more efficient

Other Random Effects Models

Now that you understand random effects models and when to use them, let’s see what else we can do with them.

Random slope

In our previous random effects model we only specified a random intercept. However, we can easily specify a random slope as well:

model_rand_slope <- lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)

This syntax is very similar to our earlier syntax. We just add the Days variable to the left of Subject to specify that we want to allow the slope of Days to vary across subjects.

Let’s take a look at the output.

summary(model_rand_slope)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4634  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.10   24.741       
##           Days         35.07    5.922   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  251.405      6.825  17.000  36.838  < 2e-16 ***
## Days          10.467      1.546  17.000   6.771 3.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

Here, just like before, the “Intercept” should be interpreted as the average intercept across the sample. Similarly “Days” should be interpreted as the average slope of Days across the sample.

We can also see that we have a new random effect. The variance of the slope of Days is 35.07 (SD = 5.92).

Just like before, we can also see our random effects estimates, but now with an additional column.

ranef(model_rand_slope)

## $Subject
##     (Intercept)        Days
## 308   2.2585509   9.1989758
## 309 -40.3987381  -8.6196806
## 310 -38.9604090  -5.4488565
## 330  23.6906196  -4.8143503
## 331  22.2603126  -3.0699116
## 332   9.0395679  -0.2721770
## 333  16.8405086  -0.2236361
## 334  -7.2326151   1.0745816
## 335  -0.3336684 -10.7521652
## 337  34.8904868   8.6282652
## 349 -25.2102286   1.1734322
## 350 -13.0700342   6.6142178
## 351   4.5778642  -3.0152621
## 352  20.8636782   3.5360011
## 369   3.2754656   0.8722149
## 370 -25.6129993   4.8224850
## 371   0.8070461  -0.9881562
## 372  12.3145921   1.2840221
## 
## with conditional variances for "Subject"

And we can compute the R-squared:

r.squaredGLMM(model_rand_slope)

##            R2m       R2c
## [1,] 0.2786511 0.7992199

And finally plot the data:

sleepstudy <- sleepstudy %>% 
  mutate(randintslopepred = predict(model_rand_slope)) #New y-hats


ggplot(data = sleepstudy, aes(x = Days, y = Reaction, color = Subject)) +
  geom_point() +
  geom_line(aes(y = randintslopepred)) +
  ylim(c(150,500))

This model looks quite different from our previous ones because we allow the effect of sleep deprivation to vary across participants. By plotting the data, we can see that, for some participants, sleep deprivation had very little effect for some participants (and strangely, perhaps a positive effect for one), while it had very large effects for others.

More than two levels

In all our models so far, we’ve been looking at only two levels of the data (observations nested within school). Let’s say, hypothetically, each participant was also from a different school. We could then model our data across three levels (observations nested within participant nested within school). Let’s simulate some data to look at this.

school <- factor(c(rep(1,60), rep(2,60), rep(3,60)))
sleepstudy <- sleepstudy %>% 
  mutate(School = school) %>% 
  select(School, Subject, Reaction, Days)
glimpse(sleepstudy)

## Rows: 180
## Columns: 4
## $ School   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ Subject  <fct> 308, 308, 308, 308, 308, 308, 308, 308, 308, 308, 309, 309, 3~
## $ Reaction <dbl> 249.5600, 258.7047, 250.8006, 321.4398, 356.8519, 414.6901, 3~
## $ Days     <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0~

So now each participant is in one of three schools (IMPORTANT NOTE: in a real world example, 3 schools would not be enough to estimate variance at this level, but this is only for a demonstration.)

There are a few ways of modeling a nested relationship, but I prefer the explicit syntax, like so.

model_three_level <- lmer(Reaction ~ Days + (1 | School) + (1 | Subject:School), data = sleepstudy)

## boundary (singular) fit: see ?isSingular

So what does this notation mean? It means, first that we are predicting reaction time from days of sleep deprivation. Second, we are allowing the intercept to vary across Schools (but not allowing slope to vary). And third, we are allowing the intercept to vary across subjects (which we are specifying to be nested within school).

We get a singularity warning here, suggesting this model is overfit (unsurprising, given these data are merely simulated and we lack sufficient variance), but we can still see the syntax we would use.

We can also specify a random slope here. Let’s say we wanted to specify a random slope of Days that varies by School, but not by Subject within School, we could specify this as follows:

model_three_level <- lmer(Reaction ~ Days + (Days | School) + (1 | Subject:School), data = sleepstudy)

## boundary (singular) fit: see ?isSingular

Conclusion

This tutorial covered a lot of ground. By this point, you should understand the basic problem of hierarchical data and why ignoring this hierarchy is problematic. You should further understand the fixed effects and random effects methods of dealing with this hierarchy, as well as when you might prefer one method over the other. Finally, you should practically understand how to model various random effects structures.

Good luck with your own modeling!