Clustered data are quite common in epidemiological studies. For example, in repeated measures, individuals are clustered with themselves. In school surveys, students are clustered within schools and classes. In crossover designs, individuals are clustered with themselves, and in group-based randomized trials, individuals are clustered within the group.

In this note, I will show different regression methods that can be used to analyze the cluster data by R, SAS and STATA.

library(haven)
my_data<-read_dta("https://dss.princeton.edu/training/schools.dta")
my_data
## # A tibble: 4,059 × 6
##    school student      y     x1    x2 x3       
##     <dbl>   <dbl>  <dbl>  <dbl> <dbl> <dbl+lbl>
##  1      1       1   2.61   6.19     1 1 [Mixed]
##  2      1       2   1.34   2.06     1 1 [Mixed]
##  3      1       3 -17.2  -13.6      0 1 [Mixed]
##  4      1       4   9.68   2.06     1 1 [Mixed]
##  5      1       5   5.44   3.71     1 1 [Mixed]
##  6      1       6  17.3   21.9      0 1 [Mixed]
##  7      1       7  10.4  -11.2      0 1 [Mixed]
##  8      1       8  -1.29 -10.3      0 1 [Mixed]
##  9      1       9  -9.39  -5.38     1 1 [Mixed]
## 10      1      10 -12.2  -14.5      0 1 [Mixed]
## # ℹ 4,049 more rows

This dataset comprises 4059 records and 6 variables. The “school” variable represents school IDs, the “student” variable represents student IDs, and the “y” variable represents a test score (the specific test type is unspecified). Additionally, “x1” corresponds to reading test scores, “x2” indicates gender where female=1 and male=0, and “x3” denotes types of schools, including “Mixed”, “Boys only,” and “Girls only.”

For this example, we might consider that students within a school could be correlated. Therefore, the assumption of independence and identically distributed (i.i.d.) errors in the regression model might be violated.

Several methods can be used to analyze the data:

  1. Ignore clustering and assume that clustering is not an issue.

  2. Create dummy variables for the schools; in total, there are 65 schools, so we would have to create 64 dummy variables.

  3. Use cluster-robust standard errors.

  4. Use the GEE models.

  5. Use multilevel models or mixed-effects models.

Method 1 ignore the cluster

We just ignore the cluster effect and regress x1,x2 and x3 on y

use R:

fit_no_cluster<-lm(y~x1+x2+as.factor(x3),data=my_data)
summary(fit_no_cluster)
## 
## Call:
## lm(formula = y ~ x1 + x2 + as.factor(x3), data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.1433  -5.1184   0.1606   5.3982  28.3118 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.60861    0.23952  -6.716 2.13e-11 ***
## x1              0.59105    0.01263  46.812  < 2e-16 ***
## x2              1.32641    0.34314   3.865 0.000113 ***
## as.factor(x3)2  1.82554    0.42568   4.289 1.84e-05 ***
## as.factor(x3)3  1.68011    0.32591   5.155 2.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.97 on 4054 degrees of freedom
## Multiple R-squared:  0.364,  Adjusted R-squared:  0.3634 
## F-statistic: 580.1 on 4 and 4054 DF,  p-value: < 2.2e-16

use SAS:

proc import out= work.my_data 
            datafile= "C:\schools.dta"   
            dbms=stata replace;
run;
proc glm data = my_data;
class x3;
  model y = x1 x2 x3/solution;
run;

Results

use STATA:

use "https://dss.princeton.edu/training/schools.dta", replace
glm y x1 x2 ib1.x3,family(gaussian)link(identity)

Results

We have the exactly the same results from the three software.

Method 2

To create 64 dummy variables for schools, we will not discuss this method.

Method 3 use cluster-robust standard errors.

The cluster-robust standard method uses the “Huber Sandwich Estimator” or “Robust standard errors” to estimate the variance of the coefficients in regression models when the i.i.d assumption does not hold. When using this method in the regression model, the coefficients will be the same as in the i.i.d model; however, standard errors, confidence intervals, and p-values might change. This paperhas a very clear explanation of the Sandwich estimator.

Use R:

library(clubSandwich)
fit_with_cluster <- coef_test(fit_no_cluster, vcov = "CR1", cluster = my_data$school)
fit_with_cluster
##           Coef. Estimate     SE t-stat d.f. (Satt) p-val (Satt) Sig.
##     (Intercept)   -1.609 0.4845  -3.32        25.6       0.0027   **
##              x1    0.591 0.0229  25.86        51.9       <0.001  ***
##              x2    1.326 0.3064   4.33        27.2       <0.001  ***
##  as.factor(x3)2    1.826 0.8974   2.03        17.1       0.0578    .
##  as.factor(x3)3    1.680 0.8793   1.91        39.9       0.0632    .

We can see the estimate are the exactly the same as the regression without consider the cluster effect, however the stand errors are changed (increased)

Use SAS:

 proc surveyreg data=my_data;
 cluster school;
 class x3;
 model y = x1 x2 x3/solution;
 run;

Results

We have the sampe results as in R.

Use STATA

use "https://dss.princeton.edu/training/schools.dta", replace
glm y x1 x2 ib1.x3,family(gaussian)link(identity) vce(cluster school)

Results

Method 4 use GEE model

use R

library(geepack)
gee_cluster <- geeglm(y~x1+x2+as.factor(x3), data = my_data, id = school, corstr = "exchangeable")
summary(gee_cluster) 
## 
## Call:
## geeglm(formula = y ~ x1 + x2 + as.factor(x3), data = my_data, 
##     id = school, corstr = "exchangeable")
## 
##  Coefficients:
##                Estimate  Std.err    Wald Pr(>|W|)    
## (Intercept)    -1.66756  0.53783   9.613  0.00193 ** 
## x1              0.56110  0.01913 859.960  < 2e-16 ***
## x2              1.66333  0.28786  33.389 7.55e-09 ***
## as.factor(x3)2  1.77364  0.90668   3.827  0.05044 .  
## as.factor(x3)3  1.58757  0.89649   3.136  0.07658 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    63.56   2.484
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha  0.09927 0.02207
## Number of clusters:   65  Maximum cluster size: 198

We see both the coefficients and standard errors changed.

use SAS

proc genmod;
class school x3;
model y = x1 x2 x3;
repeated subject=school / type=exch; 
run;

/*The REPEATED statement specifies the covariance structure of multivariate responses for GEE model fitting in the GENMOD procedure. here we use Exchangeable in order to compare with R*/
/*SUBJECT: Identifies a different subject, or cluster*/

Results

use STATA

use "https://dss.princeton.edu/training/schools.dta", replace
xtset school
xtgee y x1 x2 ib1.x3, family(gaussian) link(identity) corr(exchangeable)

Results

Method 5 Multilevel models

Multilevel models, also known as hierarchical linear models, linear mixed-effect models, mixed models, nested data models, random coefficient models, random-effects models, random parameter models, split-plot designs, and latent growth models, are all mathematically equivalent.

Use R

library(lmerTest)
mlm_fit<- lmer(y~x1+x2+as.factor(x3) + (1 | school) , data = my_data, REML = FALSE)
summary(mlm_fit)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: y ~ x1 + x2 + as.factor(x3) + (1 | school)
##    Data: my_data
## 
##      AIC      BIC   logLik deviance df.resid 
##    28032    28076   -14009    28018     4052 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.721 -0.637  0.019  0.685  3.259 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  school   (Intercept)  8.11    2.85    
##  Residual             56.23    7.50    
## Number of obs: 4059, groups:  school, 65
## 
## Fixed effects:
##                 Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)      -1.6815     0.5400   74.9491   -3.11   0.0026 ** 
## x1                0.5600     0.0124 4054.5066   45.00  < 2e-16 ***
## x2                1.6723     0.3408 4054.8418    4.91  9.6e-07 ***
## as.factor(x3)2    1.7762     1.1075   65.9656    1.60   0.1135    
## as.factor(x3)3    1.5896     0.8725   65.6792    1.82   0.0730 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) x1     x2     a.(3)2
## x1           0.024                     
## x2          -0.306 -0.061              
## as.fctr(3)2 -0.487 -0.003  0.149       
## as.fctr(3)3 -0.499  0.009 -0.201  0.244

Use SAS

 proc mixed data=my_data method = ML ;
 class school x3;
 model y = x1 x2 x3 / solution;
 random school / subject=school  type=cs;
 run;

Results

Use STATA

use "https://dss.princeton.edu/training/schools.dta", replace
mixed y x1 x2 ib1.x3 || school:

We observed that all three software packages produced the same results.