Specifying 2- and 3-level models

Author

Ty Partridge, Ph.D.

Introduction

Do I need to conduct an MLM analysis?

To decide this you need to look at two things: The sampling structure and the ICC.

Data Structure

Remember from the first lecture that MLM data structures will include variables that have different sample sizes because they represent 2 or more structural sampling units.

Lets say we have a study examining students sense of community and school connectedness, and our study includes students from 50 different high schools across the State. From each school we were able to survey a random sample of 100 students.

Our data file includes variables on adolescent attributes such as gender, age, race/ethnicity, perceived sense of connectedness to their community and to their school (e.g., “I feel like I am a valued member of my school”) but also has a number of variables that are attributes of the school: % of students receiving free and reduced lunch, student-to-teacher ratio, level of parent engagement, etc. With this data structure we will have different sample sizes for each set of variables:

  • Student Variables: 100 students from 50 high schools –> n=5,000

  • School Variables: n = 50

Thus, it looks like we will need a MLM analysis. But this raises a practical issue: How do we set the data up in the data file?

Traditionally we analyse data in a spreadsheet format like so:

Wide format data frame
Student ID Age Gender Connectedness Free/Reduced
001 x x x x
002 x x x x
003 x x x x

But this doesn’t work because it doesn’t account for school level effects and mis-attributes the percentage of students with free and reduced lunch to a student level variable. So the first thing we need to do is restructure the data from wide to long format:

Long format data frame
School ID Student ID Age Gender Feeling Connected to School Free Reduced Lunch
01 001 x x x x
01 002 x x x x
02 001 x x x x

Lets take a real example

Start with loading the packages we are going to need

options(repos = c(CRAN = "https://cran.rstudio.com/"))

install.packages("lme4")
Installing package into 'C:/Users/partr/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)
package 'lme4' successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package 'lme4'
Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
C:\Users\partr\AppData\Local\R\win-library\4.4\00LOCK\lme4\libs\x64\lme4.dll to
C:\Users\partr\AppData\Local\R\win-library\4.4\lme4\libs\x64\lme4.dll:
Permission denied
Warning: restored 'lme4'

The downloaded binary packages are in
    C:\Users\partr\AppData\Local\Temp\Rtmpczpjwu\downloaded_packages
install.packages("lmerTest")
Installing package into 'C:/Users/partr/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)
package 'lmerTest' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\partr\AppData\Local\Temp\Rtmpczpjwu\downloaded_packages
install.packages("dataMaid")
Installing package into 'C:/Users/partr/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)
package 'dataMaid' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\partr\AppData\Local\Temp\Rtmpczpjwu\downloaded_packages
library(dataMaid)
library(lme4)
Loading required package: Matrix

Attaching package: 'lme4'
The following object is masked from 'package:dataMaid':

    isSingular
library(lmerTest)

Attaching package: 'lmerTest'
The following object is masked from 'package:lme4':

    lmer
The following object is masked from 'package:stats':

    step
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand()    masks Matrix::expand()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::lag()       masks stats::lag()
✖ tidyr::pack()      masks Matrix::pack()
✖ dplyr::summarize() masks dataMaid::summarize()
✖ tidyr::unpack()    masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(haven)

Now we need to load a data file

MLM_2LV_Data <-read.csv("C:/Work Files/8190 MLM/Fall 2024/R Code and Examples/Data/mgrDistrust.csv")

Now we can take a look at the data to see how it is organized

makeCodebook(MLM_2LV_Data)
Data report generation is finished. Please wait while your output file is being rendered.

 Is codebook_MLM_2LV_Data.docx open on your computer? Please close it as fast as possible to avoid problems! 
head(MLM_2LV_Data, 10)
   Organization Participant age sex RelSalary Orgsize Distrust
1             1           1  52   0 0.7438186       1        2
2             1           2  20   0 0.7438186       1        1
3             1           3  44   1 0.7438186       1        1
4             1           4  43   0 0.7438186       1        2
5             1           5  68   0 0.7438186       1        3
6             1           6  50   0 0.7438186       1        1
7             1           7  47   0 0.7438186       1        2
8             1           8  25   0 0.7438186       1        1
9             1           9  36   0 0.7438186       1        1
10            1          10  24   0 0.7438186       1        1

In this data set we can see that there are 100 organizations with 10 respondents per organization. Thus,

Organization Variables: n=100

Employee Variables: n = 1,000

Clustering v. Nesting

This is a bit pedantic, but one can think of nesting as the data structure and clustering as the statistical violation of independence. Looking at our data we have established that it is nested, but is it clustered? To figure this out we need to take a look at the ICC.

The first step in calculating the ICC is to run an unconditional random intercept model:

Model Equations:

Level 1 (Within-Group Model)

\[ Y_{ij} = \beta_{0j} + \epsilon_{ij} \]

Level 2 (Between-Group Model)

\[ \beta_{0j} = \gamma_{00} + U_{0j} \]

Combined Model

\[ Y_{ij} = \gamma_{00} + U_{0j} + \epsilon_{ij} \]

Translating these equations to r:

Model_1 <- lmer(Distrust ~ + (1 | Organization), data = MLM_2LV_Data)
summary(Model_1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Distrust ~ +(1 | Organization)
   Data: MLM_2LV_Data

REML criterion at convergence: 2239.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.7861 -0.7980 -0.3039  0.6345  2.3634 

Random effects:
 Groups       Name        Variance Std.Dev.
 Organization (Intercept) 0.1084   0.3292  
 Residual                 0.4873   0.6981  
Number of obs: 1000, groups:  Organization, 100

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  1.68400    0.03964 99.00000   42.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ICC is the ratio of within-group variance to total variance and we can take those numbers directly from the output:

\[ ICC_{y} = \epsilon_{ij} / (\epsilon_{ij} + U_{0j}) \]

and in our output:

\[ \epsilon_{ij} = .4873 \] \[ U_{0j} = .1084 \]

ICC <-.4873 / (.4873 + .1084)
ICC
[1] 0.8180292

with an ICC = .82 we definitely need to use MLM analyses with this data.

Now, we can look at a model with a level-1 predictor.

Level-1 Model:

\[ Y_{ij} = \beta_{0j} + \beta_{1j}(X_{ij}) + \epsilon_{ij} \]

Level-2 Model:

\[ \beta_{0j} = \gamma_{00} + U_{0j} \]

\[ \beta_{1j} = \gamma_{10} + U_{1j} \]

Combined Model:

\[ Y_{ij} = \gamma_{00} + \gamma_{10}(X_{ij}) + U_{0j} + U_{1j} + \epsilon_{ij} \]

Model_2 <-lmer(Distrust ~ age + (1 + age |Organization), data = MLM_2LV_Data)
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0601635 (tol = 0.002, component 1)
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?
summary(Model_2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Distrust ~ age + (1 + age | Organization)
   Data: MLM_2LV_Data

REML criterion at convergence: 2159.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.1766 -0.7007 -0.1910  0.6793  2.5072 

Random effects:
 Groups       Name        Variance  Std.Dev. Corr 
 Organization (Intercept) 0.3614853 0.60124       
              age         0.0001546 0.01243  -0.85
 Residual                 0.4179503 0.64649       
Number of obs: 1000, groups:  Organization, 100

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)  1.083290   0.093047 94.467521  11.642  < 2e-16 ***
age          0.012798   0.001903 98.079178   6.727 1.16e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
    (Intr)
age -0.911
optimizer (nloptwrap) convergence code: 0 (OK)
Model failed to converge with max|grad| = 0.0601635 (tol = 0.002, component 1)
Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

Now we can plot this model to examine the organization level influences. First we need to extract the fitted values and the random effects from the model

MLM_2LV_Data$fitted <- fitted(Model_2)
random_effects <- ranef(Model_2)$Organization

Then we can plot them with each org having its own graph.

ggplot(MLM_2LV_Data, aes(x = age, y = Distrust)) +
  geom_point(color = "blue", alpha = 0.5) +  # Observed data
  geom_line(aes(y = fitted), color = "red") +  # Fitted values
  facet_wrap(~ Organization) +  # Separate plots for each group
  labs(title = "Random Intercept and Slope Model",
       x = "Employee Age",
       y = "Distrust") +
  theme_minimal()

We can also look at org on one graph.

ggplot(MLM_2LV_Data, aes(x = age, y = Distrust, group = Organization, color = Organization)) +
  geom_point(alpha = 0.4) +  # Observed data
  geom_line(aes(y = fitted), linewidth = .5) +  # Fitted values
  labs(title = "Random Intercept and Slope Model",
       x = "Employee Age",
       y = "Distrust",
       color = "Organization") +
  theme_minimal()

This begs another question. Are there characteristics of the organization that can account for the variability in either the overall level of distrust in an organization or the relationship of age to distrust within an organization.

To answer this question we can include a level 2 predictor

Level-1 Model:

\[ Y_{ij} = \beta_{0j} + \beta_{1j}(X_{ij}) + \epsilon_{ij} \]

Level-2 Model:

\[ \beta_{0j} = \gamma_{00} + \gamma_{01}(W_{j}) + U_{0j} \]

\[ \beta_{1j} = \gamma_{10} + \gamma_{11}(W_{j}) + U_{1j} \]

Combined Model:

\[ Y_{ij} = \gamma_{00} + \gamma_{10}(X_{ij}) +\gamma_{01}(W_{j}) + \gamma_{11}(W_{j})+ U_{0j} + U_{1j} + \epsilon_{ij} \]

Model_3 <-lmer(Distrust ~ age + RelSalary + age:RelSalary + (1 + age | Organization),data = MLM_2LV_Data)
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.579786 (tol = 0.002, component 1)
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?
summary(Model_3)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Distrust ~ age + RelSalary + age:RelSalary + (1 + age | Organization)
   Data: MLM_2LV_Data

REML criterion at convergence: 2149.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.1466 -0.6965 -0.1570  0.6727  2.5922 

Random effects:
 Groups       Name        Variance  Std.Dev. Corr 
 Organization (Intercept) 0.4056374 0.63690       
              age         0.0001979 0.01407  -0.91
 Residual                 0.4119980 0.64187       
Number of obs: 1000, groups:  Organization, 100

Fixed effects:
               Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)    1.080976   0.094934 80.467811  11.387  < 2e-16 ***
age            0.012854   0.002004 84.400915   6.413 7.85e-09 ***
RelSalary     -0.268217   0.094417 79.899713  -2.841  0.00571 ** 
age:RelSalary  0.001979   0.002005 84.724569   0.987  0.32647    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) age    RlSlry
age         -0.933              
RelSalary    0.010 -0.012       
age:RelSlry -0.012  0.013 -0.933
optimizer (nloptwrap) convergence code: 0 (OK)
Model failed to converge with max|grad| = 0.579786 (tol = 0.002, component 1)
Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?

Now we can graph this model to aid in interpretation. Again first we need to extract the fitted values from the model.

MLM_2LV_Data$fitted <- fitted(Model_3)

Then we can generate a plot in ggplot.

ggplot(MLM_2LV_Data, aes(x = age, y = fitted, group = RelSalary, color = RelSalary)) +
  geom_point(alpha = 0.5) +  # Observed data
  geom_line(aes(y = fitted), linewidth = .5) +  # Fitted values
  labs(title = "Random Intercept and Slope Model with Level 2 Predictor",
       x = "Age",
       y = "Distrust",
       color = "RelSalary") +
  theme_minimal()

3-level models

To specify a 3-level model we just expand on the nested structure. In equation form it looks like the following:

Level-1 model

\[ Y_{ijk} = \beta_{0jk} + \beta_{1jk} + \epsilon_{ijk} \]

Level-2 model

\[ \beta_{0jk} = \gamma_{00k} +\gamma_{01k}(W_{jk}) + U_{0jk} \]

\[ \beta_{1jk} = \gamma_{10k} + \gamma_{11k}(W_{jk}) + U_{1jk} \]

Level-3 model

\[ \gamma_{00k} = \pi_{000} + \pi_{001}(Z_{k}) + R_{00k} \]

\[ \gamma_{01k} = \pi_{010} + \pi_{011}(Z_{k}) + R_{01k} \]

\[ \gamma_{10k} = \pi_{100} + \pi_{101}(Z_{k}) + R_{10k} \]

\[ \gamma_{11k} = \pi_{110} + \pi_{111}(Z_{k}) + R_{11k} \]

Combined model

\[ Y_{ijk} = \pi_{000} + \pi_{010} + \pi_{100} + \pi_{110} + \pi_{001} + \pi_{011} + \pi_{101} + \pi_{111} + R_{00k} + R_{01k} + R_{10k} + R_{11k} + U_{0jk} + U_{1jk} + epsilon_{ijk} \]

Data Example

Now we can look at an empirical example using an education data file that has students nested within classrooms, nested within schools:

Dataset Description

  • Level 1 (Students): Individual student data, including exam scores and other characteristics.

  • Level 2 (Classes): Class-level data, such as class size and teacher information.

  • Level 3 (Schools): School-level data, including school type and location.

MLM_3LV_Data <-read.csv("nested_education_data.csv")
Model_4 <- lmer(test_score ~ gender + (1 | school_id/class_id), data = MLM_3LV_Data)
boundary (singular) fit: see help('isSingular')
# Display model summary
summary(Model_4)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: test_score ~ gender + (1 | school_id/class_id)
   Data: MLM_3LV_Data

REML criterion at convergence: 74296.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.0569 -0.6846 -0.0034  0.6846  3.8748 

Random effects:
 Groups             Name        Variance Std.Dev.
 class_id:school_id (Intercept)  0.00    0.000   
 school_id          (Intercept)  0.00    0.000   
 Residual                       98.65    9.932   
Number of obs: 10000, groups:  class_id:school_id, 400; school_id, 100

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)   75.0328     0.1389 9998.0000 540.298   <2e-16 ***
genderMale    -0.2907     0.1987 9998.0000  -1.463    0.143    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
           (Intr)
genderMale -0.699
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')