This document serves as a practical resource for conducting regression analysis with clustered data when the outcome variable is continuous. The objective is to demonstrate how to avoid fitting regression models that produce variance estimates (standard errors) that are unduly small, which can lead to false positives– incorrectly inferring that an observed effect is real.
The example data come from the “EdSurvey” package (Bailey et al., 2020) and represent artificial National Assessment of Educational Progress (NAEP) data. The outcome variable is a student-level measure of academic achievement (“score”). The predictor variables include 3 categorical student-level variables (“college”, “white”, “el”) and 1 continuous school-level variable (“prop_college”). The categorical student-level predictors are all dummy variables, respectively indicating whether the student’s parent(s) completed college, identifies as White, and has limited English proficiency. The school-level continuous predictor represents the proportion of students from each school (cluster) whose parent(s) completed college.
In the next chunk of code, we load the set of R packages that will be used for analysis.
#By loading "tidyverse," a bunch of popular R packages are loaded-- including dplyr and ggplot2
if(!require(tidyverse)) install.packages("tidyverse",
repos = "http://cran.us.r-project.org")
#"clubSandwich" will be used for computing cluster-robust standard errors
if(!require(clubSandwich)) install.packages("clubSandwich",
repos = "http://cran.us.r-project.org")
#"survey" will be used for computing cluster-robust standard errors
if(!require(survey)) install.packages("survey",
repos = "http://cran.us.r-project.org")
#"lmerTest" will be used for fitting mixed-models (HLM/MLM). This package adds
# NHST functionality to the "lme4" package.
if(!require(lmerTest)) install.packages("lmertest",
repos = "http://cran.us.r-project.org")
#"geepack" will be used for fitting Generalized Estimation Equations
if(!require(geepack)) install.packages("geepack",
repos = "http://cran.us.r-project.org")
#"MLmetrics" will be used comparing the the Root Mean Squared Error of models
if(!require(MLmetrics)) install.packages("MLmetrics",
repos = "http://cran.us.r-project.org")
data_url <- "https://raw.githubusercontent.com/DavidBamat/EdSurvey_demo/master/artificial-NAEP-data.csv"
df <- read_csv(data_url)
First, let’s get a sense of the “clustering effect” – the degree to which variation in the outcome exists between versus within schools (clusters).
#fit the unconditional model
unconditional <- lmer(score ~ 1 + (1|school), data = df)
#measure the proportion of between-cluster variance
unconditional_icc <- sjstats::icc(unconditional)
unconditional_icc[1] #displays the unconditional intraclass correlation
## $ICC_adjusted
## [1] 0.2463429
This means that about 25% of the overall variance in scores is between schools. More importantly, this offers evidence that observations are not independent (an important assumption in linear regression). Students from the same school are more similar to one another on the outcome measure, on average, than they are with students across schools.
In this first model, we will pretend that clustering is not an issue for the sake of instruction. In reality we know that the between-school variance in the outcome is non-negligible.
ignore_fit <- lm(score~ college + white + el + prop_college, data = df)
summary(ignore_fit)
##
## Call:
## lm(formula = score ~ college + white + el + prop_college, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -143.724 -20.334 1.121 21.180 119.962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 243.9290 0.6787 359.43 <2e-16 ***
## college 10.4032 0.5385 19.32 <2e-16 ***
## white 19.0216 0.5441 34.96 <2e-16 ***
## el -14.6972 1.1438 -12.85 <2e-16 ***
## prop_college 37.6791 1.4801 25.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.58 on 16346 degrees of freedom
## Multiple R-squared: 0.2249, Adjusted R-squared: 0.2247
## F-statistic: 1186 on 4 and 16346 DF, p-value: < 2.2e-16
In subsequent models, in which clustering is accounted for, the standard errors can be expected to be larger.
In some situations, it may be appealing to simply account for clustering through fixed-effect dummy variables. In this example, such an approach is just not feasible because of the number of dummy variables that would need to be created.
length(table(df$school))
## [1] 650
We’d have to create 649 dummy variables. One less than the number of clusters (schools, in this example).
In this section, I demonstrate a couple of different techniques for calculating “cluster-robust” standard errors. The results will be equivalent (same standard error estimates) but the nature of the approaches are distinct. In the first approach, I use results from the previously fit OLS model but use the coef_test() function from the “clubSandwich” package (Pustejovsky, 2020) to indicate how the observations (students) were clustered.
crse_fit <- coef_test(ignore_fit, vcov = "CR1", cluster = df$school)
crse_fit
## Coef. Estimate SE t-stat d.f. p-val (Satt) Sig.
## 1 (Intercept) 243.9 1.63 149.53 225 <0.001 ***
## 2 college 10.4 0.59 17.64 433 <0.001 ***
## 3 white 19.0 1.06 17.88 377 <0.001 ***
## 4 el -14.7 1.76 -8.35 111 <0.001 ***
## 5 prop_college 37.7 2.83 13.30 269 <0.001 ***
In the second approach, through the svydesign() function from the “survey” package (Lumley, 2020), I supply information about survey design– how the survey was administered and data were collected. For the example data, I specify that the observations come from sampled schools. In effect, the observations (students), are nested (clustered) in the sampled units (schools).
design <- svydesign(ids=~school, data=df) #specify design
svy_mod <- svyglm(score ~ college + white + el + prop_college, data=df, design=design) #specify model
summary(svy_mod) #output results
##
## Call:
## svyglm(formula = score ~ college + white + el + prop_college,
## design = design, data = df)
##
## Survey design:
## svydesign(ids = ~school, data = df)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 243.9290 1.6313 149.527 < 2e-16 ***
## college 10.4032 0.5898 17.637 < 2e-16 ***
## white 19.0216 1.0637 17.883 < 2e-16 ***
## el -14.6972 1.7593 -8.354 4.04e-16 ***
## prop_college 37.6791 2.8328 13.301 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 997.0949)
##
## Number of Fisher Scoring iterations: 2
Regardless of approach, the results are the same relative to those from the OLS model that ignored clustering. The standard error estimates calculated from the two cluster-robust approaches are appropriately larger.
A cluster-robust standard error approach is an appealing option if the clustering of data is more of a nuisance to accommodate and not of substantive research interest (McNeish, Stapleton and Silverman, 2017).
Using a Generalized Estimation Equation (GEE), a commonly used technique in biostatistics, is another appealing option if the clustered data is more of a nuisance to accommodate than of substantive interest to the researcher (McNeish, Stapleton and Silverman, 2017). In the following chunk of code, the clusters (schools, in this example) are specified within the geeglm() function from the “geepack” package (Højsgaard, Halekoh & Yan, 2006).
gee_fit <- geeglm(score ~ college + white + el + prop_college, data = df, id = school, corstr = "exchangeable") #specify model and nature of clustering
gee_results <- summary(gee_fit) #save results
gee_results$coefficients
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 246.00275 1.3767103 31929.6989 0
## college 10.41863 0.5890123 312.8760 0
## white 16.74044 0.7966856 441.5296 0
## el -19.24168 1.5896574 146.5139 0
## prop_college 36.71788 2.7029768 184.5315 0
It should be noted that the standard error estimates from the GEE approach are slightly different than the estimates calculated in the cluster-robust approaches. More importantly, they are substantially larger (and appropriately so) than the standard errors calculated in the OLS model that ignores clustering.
The last option for handling clustered data outlined in this resource is through Multilevel Modeling (MLM). It should be noted that “Multilevel Models” go by many different names in the research literature. Another commonly used name for this type of model is the “Hierarchical Linear Model” (HLM)– a name largely made popular by the work of Bryk and Raudenbush (2002). These techniques (MLM/HLM and the other names that they go by) are the approach of choice for handling clustered data if cluster effects and between-cluster variance in the outcome are of substantive interest to the researcher. It should be noted however that MLMs/HLMs also duly account for the clustering of data and still serve as a fine option should the researcher be principally interested in using an analytic approach that simply adjusts standard error estimates upwards to account for clustering.
Unlike the other approaches in this resource, I demonstrate a model-building process with the MLM approach. The purpose of this process is to determine which regression slopes for each level-1 predictor should be permitted to vary across clusters (schools, in this example). Note, permitting slopes to vary by cluster is not a feature of either the cluster-robust approaches or GEE.
To begin the model-building process, I first examine whether the slope for “college” should be allowed to vary across schools. I fit the models with the lmer() function from the “lmerTest” package (Kuznetsova, Brockhoff & Christensen, 2017).
mlm_fit_ri <- lmer(score ~ college + white + el + (1 | school) + prop_college, data = df, REML = FALSE)
#Should college slope be allowed to vary across schools?
mlm_fit_rs_v1 <- lmer(score ~ college + white + el + (1 | school) + (college | school) + prop_college, data = df)
anova(mlm_fit_ri, mlm_fit_rs_v1)
## Data: df
## Models:
## mlm_fit_ri: score ~ college + white + el + (1 | school) + prop_college
## mlm_fit_rs_v1: score ~ college + white + el + (1 | school) + (college | school) +
## mlm_fit_rs_v1: prop_college
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mlm_fit_ri 7 158359 158413 -79172 158345
## mlm_fit_rs_v1 10 158334 158411 -79157 158314 31.147 3 7.916e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, the slope for “college” should be allowed to vary. How about the slope for “white”?
mlm_fit_rs_v2 <- lmer(score ~ college + white + el + (1 | school) + (college | school) + (white | school) + prop_college, data = df)
anova(mlm_fit_rs_v1, mlm_fit_rs_v2)
## Data: df
## Models:
## mlm_fit_rs_v1: score ~ college + white + el + (1 | school) + (college | school) +
## mlm_fit_rs_v1: prop_college
## mlm_fit_rs_v2: score ~ college + white + el + (1 | school) + (college | school) +
## mlm_fit_rs_v2: (white | school) + prop_college
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mlm_fit_rs_v1 10 158334 158411 -79157 158314
## mlm_fit_rs_v2 13 158268 158368 -79121 158242 71.88 3 1.689e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, the slope for “white” should be allowed to vary. How about the slope for “el”?
mlm_fit_rs_v3 <- lmer(score ~ college + white + el + (1 | school) + (college | school) + (white | school) + (el | school) + prop_college, data = df)
anova(mlm_fit_rs_v2, mlm_fit_rs_v3)
## Data: df
## Models:
## mlm_fit_rs_v2: score ~ college + white + el + (1 | school) + (college | school) +
## mlm_fit_rs_v2: (white | school) + prop_college
## mlm_fit_rs_v3: score ~ college + white + el + (1 | school) + (college | school) +
## mlm_fit_rs_v3: (white | school) + (el | school) + prop_college
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mlm_fit_rs_v2 13 158268 158368 -79121 158242
## mlm_fit_rs_v3 16 158249 158373 -79109 158217 24.517 3 1.948e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes. the slope for “el” should be allowed to vary. Thus, here is the final model and estimates.
summary(mlm_fit_rs_v3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: score ~ college + white + el + (1 | school) + (college | school) +
## (white | school) + (el | school) + prop_college
## Data: df
##
## REML criterion at convergence: 158208.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.8693 -0.6234 0.0347 0.6596 3.6255
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 34.793 5.899
## school.1 (Intercept) 6.068 2.463
## college 36.451 6.037 0.64
## school.2 (Intercept) 115.043 10.726
## white 94.533 9.723 -1.00
## school.3 (Intercept) 18.447 4.295
## el 144.421 12.018 -0.84
## Residual 869.553 29.488
## Number of obs: 16351, groups: school, 650
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 247.8121 1.2830 731.3508 193.15 <2e-16 ***
## college 10.4871 0.5749 560.4287 18.24 <2e-16 ***
## white 16.4132 0.7887 458.5805 20.81 <2e-16 ***
## el -19.6015 1.5389 179.2833 -12.74 <2e-16 ***
## prop_college 33.3501 2.5287 673.2561 13.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) colleg white el
## college 0.031
## white -0.358 -0.037
## el -0.183 0.034 0.150
## prop_colleg -0.792 -0.180 -0.137 0.054
## convergence code: 0
## unable to evaluate scaled gradient
## Model failed to converge: degenerate Hessian with 2 negative eigenvalues
It is also common and sometimes appropriate to grand- or group-mean center continuous predictor variables when using an MLM to analyze clustered data (see Bryk & Raudenbush (2002) for details). In the following section, I demonstrate how to grand-mean center continuous predictor variables. In this example, there is just one– the continuous school-level predictor representing the proportion of students whose parent(s) completed college.
#First, grand-mean center the prop_college variable
df <- df %>%
mutate(prop_college_grd = prop_college - mean(prop_college))
mlm_fit_rs_v4 <- lmer(score ~ college + white + el + (1 | school) + (college | school) + (white | school) + (el | school) + prop_college_grd, data = df)
summary(mlm_fit_rs_v4)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: score ~ college + white + el + (1 | school) + (college | school) +
## (white | school) + (el | school) + prop_college_grd
## Data: df
##
## REML criterion at convergence: 158208.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.8693 -0.6234 0.0347 0.6596 3.6254
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 6.367 2.523
## school.1 (Intercept) 37.495 6.123
## college 36.462 6.038 0.26
## school.2 (Intercept) 114.441 10.698
## white 94.525 9.722 -1.00
## school.3 (Intercept) 16.038 4.005
## el 144.436 12.018 -0.90
## Residual 869.553 29.488
## Number of obs: 16351, groups: school, 650
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 262.6300 0.7904 544.3582 332.28 <2e-16 ***
## college 10.4871 0.5749 560.2588 18.24 <2e-16 ***
## white 16.4133 0.7887 458.5696 20.81 <2e-16 ***
## el -19.6016 1.5390 179.2763 -12.74 <2e-16 ***
## prop_college_grd 33.3504 2.5287 673.2904 13.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) colleg white el
## college -0.205
## white -0.776 -0.037
## el -0.220 0.034 0.150
## prp_cllg_gr 0.136 -0.180 -0.137 0.054
## convergence code: 0
## unable to evaluate scaled gradient
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
Ignoring clustering
RMSE(ignore_fit$fitted.values, df$score)
## [1] 31.57584
Cluster-robust standard errors
RMSE(svy_mod$fitted.values, df$score)
## [1] 31.57584
Generalized Estimation Equation
RMSE(gee_fit$fitted.values, df$score)
## [1] 31.60416
MLM without centering
RMSE(fitted(mlm_fit_rs_v3), df$score)
## [1] 28.84681
MLM with grand-mean centering
RMSE(fitted(mlm_fit_rs_v4), df$score)
## [1] 28.84678
It should be noted that with this particular example data, the multilevel models outperform the other models with respect to model fit–how close the fitted (predicted) values come to observed values. This result lends credence to the idea that modeling school-level random effects is helpful for explaining variation in the outcome.
Bailey, P., Emad, A., Huo, H., Lee, M., Liao, Y., Lishinski, A., Nguyen, T., Xie, Q., Yu, J., Zhang, T., Bundsgaard, J., & C’deBaca, R. (2020). EdSurvey: Analysis of NCES Education Survey and Assessment Data. R package version 2.5.0. https://CRAN.R-project.org/package=EdSurvey
Bryk, A. & Raudenbush, S. (2002). Hierarchical Linear Models. Sage: Thousand Oaks, CA.
Højsgaard, S., Halekoh, U. & Yan J. (2006) The R Package geepack for Generalized Estimating Equations. Journal of Statistical Software, 15(2), 1-11
Kuznetsova, A., Brockhoff, P., & Christensen, R. (2017). “lmerTest Package: Tests in Linear Mixed Effects Models.” Journal of Statistical Software, 82(13), 1-26.
McNeish, D., Stapleton, L. M., & Silverman, R. D. (2017). On the unnecessary ubiquity of hierarchical linear modeling. Psychological Methods, 22(1), 114–140.
Pustejovsky, J. (2020). clubSandwich: Cluster-Robust (Sandwich) Variance Estimators with Small-Sample Corrections. R package version 0.5.0. https://CRAN.R-project.org/package=clubSandwich