This document serves as a practical resource for conducting regression analysis with clustered data when the outcome variable is continuous. The objective is to demonstrate how to avoid fitting regression models that produce variance estimates (standard errors) that are unduly small, which can lead to false positives– incorrectly inferring that an observed effect is real.

The example data come from the “EdSurvey” package (Bailey et al., 2020) and represent artificial National Assessment of Educational Progress (NAEP) data. The outcome variable is a student-level measure of academic achievement (“score”). The predictor variables include 3 categorical student-level variables (“college”, “white”, “el”) and 1 continuous school-level variable (“prop_college”). The categorical student-level predictors are all dummy variables, respectively indicating whether the student’s parent(s) completed college, identifies as White, and has limited English proficiency. The school-level continuous predictor represents the proportion of students from each school (cluster) whose parent(s) completed college.

Load packages

In the next chunk of code, we load the set of R packages that will be used for analysis.

#By loading "tidyverse," a bunch of popular R packages are loaded-- including dplyr and ggplot2
if(!require(tidyverse)) install.packages("tidyverse", 
                                         repos = "http://cran.us.r-project.org")

#"clubSandwich" will be used for computing cluster-robust standard errors
if(!require(clubSandwich)) install.packages("clubSandwich", 
                                         repos = "http://cran.us.r-project.org")

#"survey" will be used for computing cluster-robust standard errors
if(!require(survey)) install.packages("survey", 
                                         repos = "http://cran.us.r-project.org")

#"lmerTest" will be used for fitting mixed-models (HLM/MLM). This package adds 
#   NHST functionality to the "lme4" package.
if(!require(lmerTest)) install.packages("lmertest", 
                                         repos = "http://cran.us.r-project.org")

#"geepack" will be used for fitting Generalized Estimation Equations
if(!require(geepack)) install.packages("geepack", 
                                         repos = "http://cran.us.r-project.org")

#"MLmetrics" will be used comparing the the Root Mean Squared Error of models
if(!require(MLmetrics)) install.packages("MLmetrics", 
                                         repos = "http://cran.us.r-project.org")

Import Data

data_url <- "https://raw.githubusercontent.com/DavidBamat/EdSurvey_demo/master/artificial-NAEP-data.csv"

df <- read_csv(data_url)

Understanding the clustering effect

First, let’s get a sense of the “clustering effect” – the degree to which variation in the outcome exists between versus within schools (clusters).

#fit the unconditional model
unconditional <- lmer(score ~ 1 + (1|school), data = df)

#measure the proportion of between-cluster variance
unconditional_icc <- sjstats::icc(unconditional)
unconditional_icc[1] #displays the unconditional intraclass correlation
## $ICC_adjusted
## [1] 0.2463429

This means that about 25% of the overall variance in scores is between schools. More importantly, this offers evidence that observations are not independent (an important assumption in linear regression). Students from the same school are more similar to one another on the outcome measure, on average, than they are with students across schools.

Ignoring clustering

In this first model, we will pretend that clustering is not an issue for the sake of instruction. In reality we know that the between-school variance in the outcome is non-negligible.

ignore_fit <- lm(score~ college + white + el + prop_college, data = df)
summary(ignore_fit)
## 
## Call:
## lm(formula = score ~ college + white + el + prop_college, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -143.724  -20.334    1.121   21.180  119.962 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  243.9290     0.6787  359.43   <2e-16 ***
## college       10.4032     0.5385   19.32   <2e-16 ***
## white         19.0216     0.5441   34.96   <2e-16 ***
## el           -14.6972     1.1438  -12.85   <2e-16 ***
## prop_college  37.6791     1.4801   25.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.58 on 16346 degrees of freedom
## Multiple R-squared:  0.2249, Adjusted R-squared:  0.2247 
## F-statistic:  1186 on 4 and 16346 DF,  p-value: < 2.2e-16

In subsequent models, in which clustering is accounted for, the standard errors can be expected to be larger.

Model school effects through dummy variables

In some situations, it may be appealing to simply account for clustering through fixed-effect dummy variables. In this example, such an approach is just not feasible because of the number of dummy variables that would need to be created.

length(table(df$school))
## [1] 650

We’d have to create 649 dummy variables. One less than the number of clusters (schools, in this example).

Cluster-robust standard errors

In this section, I demonstrate a couple of different techniques for calculating “cluster-robust” standard errors. The results will be equivalent (same standard error estimates) but the nature of the approaches are distinct. In the first approach, I use results from the previously fit OLS model but use the coef_test() function from the “clubSandwich” package (Pustejovsky, 2020) to indicate how the observations (students) were clustered.

crse_fit <- coef_test(ignore_fit, vcov = "CR1", cluster = df$school)
crse_fit
##          Coef. Estimate   SE t-stat d.f. p-val (Satt) Sig.
## 1  (Intercept)    243.9 1.63 149.53  225       <0.001  ***
## 2      college     10.4 0.59  17.64  433       <0.001  ***
## 3        white     19.0 1.06  17.88  377       <0.001  ***
## 4           el    -14.7 1.76  -8.35  111       <0.001  ***
## 5 prop_college     37.7 2.83  13.30  269       <0.001  ***

In the second approach, through the svydesign() function from the “survey” package (Lumley, 2020), I supply information about survey design– how the survey was administered and data were collected. For the example data, I specify that the observations come from sampled schools. In effect, the observations (students), are nested (clustered) in the sampled units (schools).

design <- svydesign(ids=~school, data=df) #specify design 
svy_mod <- svyglm(score ~ college + white + el + prop_college, data=df, design=design) #specify model
summary(svy_mod) #output results
## 
## Call:
## svyglm(formula = score ~ college + white + el + prop_college, 
##     design = design, data = df)
## 
## Survey design:
## svydesign(ids = ~school, data = df)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  243.9290     1.6313 149.527  < 2e-16 ***
## college       10.4032     0.5898  17.637  < 2e-16 ***
## white         19.0216     1.0637  17.883  < 2e-16 ***
## el           -14.6972     1.7593  -8.354 4.04e-16 ***
## prop_college  37.6791     2.8328  13.301  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 997.0949)
## 
## Number of Fisher Scoring iterations: 2

Regardless of approach, the results are the same relative to those from the OLS model that ignored clustering. The standard error estimates calculated from the two cluster-robust approaches are appropriately larger.

  • college: 0.5 increased to 0.6.
  • white: 0.5 increased to 1.1.
  • el: 1.1 increased to 1.8
  • college_prop: 1.5 increased to 2.8

A cluster-robust standard error approach is an appealing option if the clustering of data is more of a nuisance to accommodate and not of substantive research interest (McNeish, Stapleton and Silverman, 2017).

Generalized Estimation Equation

Using a Generalized Estimation Equation (GEE), a commonly used technique in biostatistics, is another appealing option if the clustered data is more of a nuisance to accommodate than of substantive interest to the researcher (McNeish, Stapleton and Silverman, 2017). In the following chunk of code, the clusters (schools, in this example) are specified within the geeglm() function from the “geepack” package (Højsgaard, Halekoh & Yan, 2006).

gee_fit <- geeglm(score ~ college + white + el + prop_college, data = df, id = school, corstr = "exchangeable") #specify model and nature of clustering
gee_results <- summary(gee_fit) #save results
gee_results$coefficients
##               Estimate   Std.err       Wald Pr(>|W|)
## (Intercept)  246.00275 1.3767103 31929.6989        0
## college       10.41863 0.5890123   312.8760        0
## white         16.74044 0.7966856   441.5296        0
## el           -19.24168 1.5896574   146.5139        0
## prop_college  36.71788 2.7029768   184.5315        0

It should be noted that the standard error estimates from the GEE approach are slightly different than the estimates calculated in the cluster-robust approaches. More importantly, they are substantially larger (and appropriately so) than the standard errors calculated in the OLS model that ignores clustering.

Multilevel model

The last option for handling clustered data outlined in this resource is through Multilevel Modeling (MLM). It should be noted that “Multilevel Models” go by many different names in the research literature. Another commonly used name for this type of model is the “Hierarchical Linear Model” (HLM)– a name largely made popular by the work of Bryk and Raudenbush (2002). These techniques (MLM/HLM and the other names that they go by) are the approach of choice for handling clustered data if cluster effects and between-cluster variance in the outcome are of substantive interest to the researcher. It should be noted however that MLMs/HLMs also duly account for the clustering of data and still serve as a fine option should the researcher be principally interested in using an analytic approach that simply adjusts standard error estimates upwards to account for clustering.

Unlike the other approaches in this resource, I demonstrate a model-building process with the MLM approach. The purpose of this process is to determine which regression slopes for each level-1 predictor should be permitted to vary across clusters (schools, in this example). Note, permitting slopes to vary by cluster is not a feature of either the cluster-robust approaches or GEE.

To begin the model-building process, I first examine whether the slope for “college” should be allowed to vary across schools. I fit the models with the lmer() function from the “lmerTest” package (Kuznetsova, Brockhoff & Christensen, 2017).

mlm_fit_ri <- lmer(score ~ college + white + el + (1 | school) + prop_college, data = df, REML = FALSE)


#Should college slope be allowed to vary across schools?

mlm_fit_rs_v1 <- lmer(score ~ college + white + el + (1 | school) + (college | school) + prop_college, data = df)

anova(mlm_fit_ri, mlm_fit_rs_v1)
## Data: df
## Models:
## mlm_fit_ri: score ~ college + white + el + (1 | school) + prop_college
## mlm_fit_rs_v1: score ~ college + white + el + (1 | school) + (college | school) + 
## mlm_fit_rs_v1:     prop_college
##               npar    AIC    BIC logLik deviance  Chisq Df Pr(>Chisq)    
## mlm_fit_ri       7 158359 158413 -79172   158345                         
## mlm_fit_rs_v1   10 158334 158411 -79157   158314 31.147  3  7.916e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, the slope for “college” should be allowed to vary. How about the slope for “white”?

mlm_fit_rs_v2 <- lmer(score ~ college + white + el + (1 | school) + (college | school) + (white | school) + prop_college, data = df)

anova(mlm_fit_rs_v1, mlm_fit_rs_v2)
## Data: df
## Models:
## mlm_fit_rs_v1: score ~ college + white + el + (1 | school) + (college | school) + 
## mlm_fit_rs_v1:     prop_college
## mlm_fit_rs_v2: score ~ college + white + el + (1 | school) + (college | school) + 
## mlm_fit_rs_v2:     (white | school) + prop_college
##               npar    AIC    BIC logLik deviance Chisq Df Pr(>Chisq)    
## mlm_fit_rs_v1   10 158334 158411 -79157   158314                        
## mlm_fit_rs_v2   13 158268 158368 -79121   158242 71.88  3  1.689e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, the slope for “white” should be allowed to vary. How about the slope for “el”?

mlm_fit_rs_v3 <- lmer(score ~ college + white + el + (1 | school) + (college | school) + (white | school) + (el | school) + prop_college, data = df)

anova(mlm_fit_rs_v2, mlm_fit_rs_v3)
## Data: df
## Models:
## mlm_fit_rs_v2: score ~ college + white + el + (1 | school) + (college | school) + 
## mlm_fit_rs_v2:     (white | school) + prop_college
## mlm_fit_rs_v3: score ~ college + white + el + (1 | school) + (college | school) + 
## mlm_fit_rs_v3:     (white | school) + (el | school) + prop_college
##               npar    AIC    BIC logLik deviance  Chisq Df Pr(>Chisq)    
## mlm_fit_rs_v2   13 158268 158368 -79121   158242                         
## mlm_fit_rs_v3   16 158249 158373 -79109   158217 24.517  3  1.948e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes. the slope for “el” should be allowed to vary. Thus, here is the final model and estimates.

summary(mlm_fit_rs_v3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: score ~ college + white + el + (1 | school) + (college | school) +  
##     (white | school) + (el | school) + prop_college
##    Data: df
## 
## REML criterion at convergence: 158208.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.8693 -0.6234  0.0347  0.6596  3.6255 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  school   (Intercept)  34.793   5.899        
##  school.1 (Intercept)   6.068   2.463        
##           college      36.451   6.037   0.64 
##  school.2 (Intercept) 115.043  10.726        
##           white        94.533   9.723   -1.00
##  school.3 (Intercept)  18.447   4.295        
##           el          144.421  12.018   -0.84
##  Residual             869.553  29.488        
## Number of obs: 16351, groups:  school, 650
## 
## Fixed effects:
##              Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  247.8121     1.2830 731.3508  193.15   <2e-16 ***
## college       10.4871     0.5749 560.4287   18.24   <2e-16 ***
## white         16.4132     0.7887 458.5805   20.81   <2e-16 ***
## el           -19.6015     1.5389 179.2833  -12.74   <2e-16 ***
## prop_college  33.3501     2.5287 673.2561   13.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) colleg white  el    
## college      0.031                     
## white       -0.358 -0.037              
## el          -0.183  0.034  0.150       
## prop_colleg -0.792 -0.180 -0.137  0.054
## convergence code: 0
## unable to evaluate scaled gradient
## Model failed to converge: degenerate  Hessian with 2 negative eigenvalues

It is also common and sometimes appropriate to grand- or group-mean center continuous predictor variables when using an MLM to analyze clustered data (see Bryk & Raudenbush (2002) for details). In the following section, I demonstrate how to grand-mean center continuous predictor variables. In this example, there is just one– the continuous school-level predictor representing the proportion of students whose parent(s) completed college.

#First, grand-mean center the prop_college variable

df <- df %>%
  mutate(prop_college_grd = prop_college - mean(prop_college))

mlm_fit_rs_v4 <- lmer(score ~ college + white + el + (1 | school) + (college | school) + (white | school) + (el | school) + prop_college_grd, data = df)

summary(mlm_fit_rs_v4)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: score ~ college + white + el + (1 | school) + (college | school) +  
##     (white | school) + (el | school) + prop_college_grd
##    Data: df
## 
## REML criterion at convergence: 158208.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.8693 -0.6234  0.0347  0.6596  3.6254 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  school   (Intercept)   6.367   2.523        
##  school.1 (Intercept)  37.495   6.123        
##           college      36.462   6.038   0.26 
##  school.2 (Intercept) 114.441  10.698        
##           white        94.525   9.722   -1.00
##  school.3 (Intercept)  16.038   4.005        
##           el          144.436  12.018   -0.90
##  Residual             869.553  29.488        
## Number of obs: 16351, groups:  school, 650
## 
## Fixed effects:
##                  Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)      262.6300     0.7904 544.3582  332.28   <2e-16 ***
## college           10.4871     0.5749 560.2588   18.24   <2e-16 ***
## white             16.4133     0.7887 458.5696   20.81   <2e-16 ***
## el               -19.6016     1.5390 179.2763  -12.74   <2e-16 ***
## prop_college_grd  33.3504     2.5287 673.2904   13.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) colleg white  el    
## college     -0.205                     
## white       -0.776 -0.037              
## el          -0.220  0.034  0.150       
## prp_cllg_gr  0.136 -0.180 -0.137  0.054
## convergence code: 0
## unable to evaluate scaled gradient
## Model failed to converge: degenerate  Hessian with 1 negative eigenvalues

Compare fit of various models with Root Mean Squared Error

Ignoring clustering

RMSE(ignore_fit$fitted.values, df$score)
## [1] 31.57584

Cluster-robust standard errors

RMSE(svy_mod$fitted.values, df$score)
## [1] 31.57584

Generalized Estimation Equation

RMSE(gee_fit$fitted.values, df$score)
## [1] 31.60416

MLM without centering

RMSE(fitted(mlm_fit_rs_v3), df$score)
## [1] 28.84681

MLM with grand-mean centering

RMSE(fitted(mlm_fit_rs_v4), df$score)
## [1] 28.84678

It should be noted that with this particular example data, the multilevel models outperform the other models with respect to model fit–how close the fitted (predicted) values come to observed values. This result lends credence to the idea that modeling school-level random effects is helpful for explaining variation in the outcome.

References

Bailey, P., Emad, A., Huo, H., Lee, M., Liao, Y., Lishinski, A., Nguyen, T., Xie, Q., Yu, J., Zhang, T., Bundsgaard, J., & C’deBaca, R. (2020). EdSurvey: Analysis of NCES Education Survey and Assessment Data. R package version 2.5.0. https://CRAN.R-project.org/package=EdSurvey

Bryk, A. & Raudenbush, S. (2002). Hierarchical Linear Models. Sage: Thousand Oaks, CA.

Højsgaard, S., Halekoh, U. & Yan J. (2006) The R Package geepack for Generalized Estimating Equations. Journal of Statistical Software, 15(2), 1-11

Kuznetsova, A., Brockhoff, P., & Christensen, R. (2017). “lmerTest Package: Tests in Linear Mixed Effects Models.” Journal of Statistical Software, 82(13), 1-26.

McNeish, D., Stapleton, L. M., & Silverman, R. D. (2017). On the unnecessary ubiquity of hierarchical linear modeling. Psychological Methods, 22(1), 114–140.

Pustejovsky, J. (2020). clubSandwich: Cluster-Robust (Sandwich) Variance Estimators with Small-Sample Corrections. R package version 0.5.0. https://CRAN.R-project.org/package=clubSandwich