Clustered data are quite common in epidemiological studies. For example, in repeated measures, individuals are clustered with themselves. In school surveys, students are clustered within schools and classes. In crossover designs, individuals are clustered with themselves, and in group-based randomized trials, individuals are clustered within the group.
In this note, I will show different regression methods that can be used to analyze the cluster data by R, SAS and STATA.
library(haven)
my_data<-read_dta("https://dss.princeton.edu/training/schools.dta")
my_data
## # A tibble: 4,059 × 6
## school student y x1 x2 x3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl>
## 1 1 1 2.61 6.19 1 1 [Mixed]
## 2 1 2 1.34 2.06 1 1 [Mixed]
## 3 1 3 -17.2 -13.6 0 1 [Mixed]
## 4 1 4 9.68 2.06 1 1 [Mixed]
## 5 1 5 5.44 3.71 1 1 [Mixed]
## 6 1 6 17.3 21.9 0 1 [Mixed]
## 7 1 7 10.4 -11.2 0 1 [Mixed]
## 8 1 8 -1.29 -10.3 0 1 [Mixed]
## 9 1 9 -9.39 -5.38 1 1 [Mixed]
## 10 1 10 -12.2 -14.5 0 1 [Mixed]
## # ℹ 4,049 more rows
This dataset comprises 4059 records and 6 variables. The “school” variable represents school IDs, the “student” variable represents student IDs, and the “y” variable represents a test score (the specific test type is unspecified). Additionally, “x1” corresponds to reading test scores, “x2” indicates gender where female=1 and male=0, and “x3” denotes types of schools, including “Mixed”, “Boys only,” and “Girls only.”
For this example, we might consider that students within a school could be correlated. Therefore, the assumption of independence and identically distributed (i.i.d.) errors in the regression model might be violated.
Several methods can be used to analyze the data:
Ignore clustering and assume that clustering is not an issue.
Create dummy variables for the schools; in total, there are 65 schools, so we would have to create 64 dummy variables.
Use cluster-robust standard errors.
Use the GEE models.
Use multilevel models or mixed-effects models.
We just ignore the cluster effect and regress x1,x2 and x3 on y
fit_no_cluster<-lm(y~x1+x2+as.factor(x3),data=my_data)
summary(fit_no_cluster)
##
## Call:
## lm(formula = y ~ x1 + x2 + as.factor(x3), data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.1433 -5.1184 0.1606 5.3982 28.3118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.60861 0.23952 -6.716 2.13e-11 ***
## x1 0.59105 0.01263 46.812 < 2e-16 ***
## x2 1.32641 0.34314 3.865 0.000113 ***
## as.factor(x3)2 1.82554 0.42568 4.289 1.84e-05 ***
## as.factor(x3)3 1.68011 0.32591 5.155 2.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.97 on 4054 degrees of freedom
## Multiple R-squared: 0.364, Adjusted R-squared: 0.3634
## F-statistic: 580.1 on 4 and 4054 DF, p-value: < 2.2e-16
proc import out= work.my_data
datafile= "C:\schools.dta"
dbms=stata replace;
run;
proc glm data = my_data;
class x3;
model y = x1 x2 x3/solution;
run;
Results
use "https://dss.princeton.edu/training/schools.dta", replace
glm y x1 x2 ib1.x3,family(gaussian)link(identity)
Results
We have the exactly the same results from the three software.
To create 64 dummy variables for schools, we will not discuss this method.
The cluster-robust standard method uses the “Huber Sandwich Estimator” or “Robust standard errors” to estimate the variance of the coefficients in regression models when the i.i.d assumption does not hold. When using this method in the regression model, the coefficients will be the same as in the i.i.d model; however, standard errors, confidence intervals, and p-values might change. This paperhas a very clear explanation of the Sandwich estimator.
library(clubSandwich)
fit_with_cluster <- coef_test(fit_no_cluster, vcov = "CR1", cluster = my_data$school)
fit_with_cluster
## Coef. Estimate SE t-stat d.f. (Satt) p-val (Satt) Sig.
## (Intercept) -1.609 0.4845 -3.32 25.6 0.0027 **
## x1 0.591 0.0229 25.86 51.9 <0.001 ***
## x2 1.326 0.3064 4.33 27.2 <0.001 ***
## as.factor(x3)2 1.826 0.8974 2.03 17.1 0.0578 .
## as.factor(x3)3 1.680 0.8793 1.91 39.9 0.0632 .
We can see the estimate are the exactly the same as the regression without consider the cluster effect, however the stand errors are changed (increased)
proc surveyreg data=my_data;
cluster school;
class x3;
model y = x1 x2 x3/solution;
run;
Results
We have the sampe results as in R.
use "https://dss.princeton.edu/training/schools.dta", replace
glm y x1 x2 ib1.x3,family(gaussian)link(identity) vce(cluster school)
Results
library(geepack)
gee_cluster <- geeglm(y~x1+x2+as.factor(x3), data = my_data, id = school, corstr = "exchangeable")
summary(gee_cluster)
##
## Call:
## geeglm(formula = y ~ x1 + x2 + as.factor(x3), data = my_data,
## id = school, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.66756 0.53783 9.613 0.00193 **
## x1 0.56110 0.01913 859.960 < 2e-16 ***
## x2 1.66333 0.28786 33.389 7.55e-09 ***
## as.factor(x3)2 1.77364 0.90668 3.827 0.05044 .
## as.factor(x3)3 1.58757 0.89649 3.136 0.07658 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 63.56 2.484
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.09927 0.02207
## Number of clusters: 65 Maximum cluster size: 198
We see both the coefficients and standard errors changed.
proc genmod;
class school x3;
model y = x1 x2 x3;
repeated subject=school / type=exch;
run;
/*The REPEATED statement specifies the covariance structure of multivariate responses for GEE model fitting in the GENMOD procedure. here we use Exchangeable in order to compare with R*/
/*SUBJECT: Identifies a different subject, or cluster*/
Results
use "https://dss.princeton.edu/training/schools.dta", replace
xtset school
xtgee y x1 x2 ib1.x3, family(gaussian) link(identity) corr(exchangeable)
Results
Multilevel models, also known as hierarchical linear models, linear mixed-effect models, mixed models, nested data models, random coefficient models, random-effects models, random parameter models, split-plot designs, and latent growth models, are all mathematically equivalent.
library(lmerTest)
mlm_fit<- lmer(y~x1+x2+as.factor(x3) + (1 | school) , data = my_data, REML = FALSE)
summary(mlm_fit)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: y ~ x1 + x2 + as.factor(x3) + (1 | school)
## Data: my_data
##
## AIC BIC logLik deviance df.resid
## 28032 28076 -14009 28018 4052
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.721 -0.637 0.019 0.685 3.259
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 8.11 2.85
## Residual 56.23 7.50
## Number of obs: 4059, groups: school, 65
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -1.6815 0.5400 74.9491 -3.11 0.0026 **
## x1 0.5600 0.0124 4054.5066 45.00 < 2e-16 ***
## x2 1.6723 0.3408 4054.8418 4.91 9.6e-07 ***
## as.factor(x3)2 1.7762 1.1075 65.9656 1.60 0.1135
## as.factor(x3)3 1.5896 0.8725 65.6792 1.82 0.0730 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) x1 x2 a.(3)2
## x1 0.024
## x2 -0.306 -0.061
## as.fctr(3)2 -0.487 -0.003 0.149
## as.fctr(3)3 -0.499 0.009 -0.201 0.244
proc mixed data=my_data method = ML ;
class school x3;
model y = x1 x2 x3 / solution;
random school / subject=school type=cs;
run;
Results
use "https://dss.princeton.edu/training/schools.dta", replace
mixed y x1 x2 ib1.x3 || school:
We observed that all three software packages produced the same results.