In this example, we will use Generalized Estimating Equations to do some longitudinal modeling of data from the ECLS-K 2011. Specifically, we will model changes in a student’s standardized math score as a continuous outcome and self rated health as a binomial outcome, from fall kindergarten to spring, 1st grade.

Introduction to GEE’s

Up until now, we have used (G)LMM’s to analyze data that were “clustered”

Persons within neighborhoods
Survey data in general - stratified sampling

The next topic will introduce a modeling strategy that allows us to consider clustered data, but in a different fashion

GLMMS’s

GLMM’s are commonly referred to as conditional models, because the model coefficients “\(\beta\)’s” are condition on the random effects in the model.

Likewise, the mean if conditional on the random effects. This is another way of saying that the mean for a given covariate pattern is conditional on the group that the particular person is in.

\(\mu_{ij}^c = E(Y_{ij} | u_j) = X_{ij}\beta + u_j\)

GLMMS’s and GEE’s

In contrast, Generalzed Estimating Equations are referred to as marginal models because they only estimate the overall mean.

\(\mu_{ij} = X_{ij}\beta\)

Lee and Nelder, 2004 provide a very good description of how these two methods compare to one another

Generalized Estimating Equations

Typically first attributed to Liang and Zeger, 1986
GEE’s are regression models
Interested in modeling the mean response, while treating correlation within person/cluster as a nuisance
NOT based on maximum likelihood
Does not need a fully specified joint distribution, only the marginal distribution (mean) of the outcome
Models can be for any distribution for the outcome

GEE’s

For longitudinal data, we assume we have \(y_{ij}\) as our outcome on person i at time j. This could just as easily be persons within other types of clusters, like counties or sampling units.
Also have \(X_{ij}\), the matrix of predictors
Specify the link function between \(y_{ij}\) and \(X_{ij}\) as in a GLM, via a link function
Focus is on the linear predictor of the link function - the mean
NOT INTERESTED in variance components ONLY regression coefficients

GEE’s

Covariance structure
- We also may wish to model how observations are related to one another via some type of correlation structure between waves
- This directly implies that observations are NOT INDEPENDENT, and that’s fine
- Observations between clusters are independent
- Errors are correlated
- No assumption of common variance (homoskedsasticity)

GEE’s - Model form

A basic form of the model would be:

\(Y_{ij} = \beta_0 + \sum_k X_{ijk} \beta_k + CORR + error\)

Ordinary models will tend to over estimate the standard errors for the \(\beta\)’s for time varying predictors in a model with repeated observations, because these models do not account for the correlation within clusters observations over time.

Likewise, the standard errors of time invariant predictors will be under estimated

GEE’s - Model estimation

Given the mean function for the model and a specified correlation function, the model parameters may be estimated by finding the solution for:

\[U(\beta) = \sum_i ^n \frac{\delta \mu_{ij}}{ \delta \beta_k} V_i^{-1} (Y_{ij} - \mu(\beta))\]

Which gives estimates of the \(\beta\)’s for the linear mean function.

GEE’s - Model estimation

First, a naive linear regression analysis is carried out, assuming the observations within subjects are independent.
Then, residuals are calculated from the naive model (observed-predicted) and a working correlation matrix is estimated from these residuals.
Then the regression coefficients are refit, correcting for the correlation. (Iterative process)
The within-subject correlation structure is treated as a nuisance variable (i.e. as a covariate)

GEE’s - Correlation Structure

For three time points per person, the ordinary regression model correlation in residuals within clusters/persons over time can be thought of as the matrix:

\[\begin{bmatrix} \sigma^2 & 0 & 0 \\ 0 & \sigma^2 &0 \\ 0 & 0 & \sigma^2 \end{bmatrix}\]

which assumed the variances are constant and the residuals are independent over time

GEE’s - Correlation Structure

But in a GEE, the model include the actual correlation between measurements over time:

\[\begin{bmatrix} \sigma_1 ^2 & a & c \\ a & \sigma_2 ^2 &b \\ b & c & \sigma_3 ^2 \end{bmatrix}\]

Which allows the variances over time to be different, as well as correlations between times to be present.

GEE’s - Correlation Structure

Several types of correlation/covariance are commonly used in GEE’s
When we fit a GEE, we have to assume a certain type of correlation for the repeated measures. These are typically:
- Independence - same as OLS
- Exchangeable/compound symmetry (simplest)
- Autoregressive
- Unstructured (most complicated)

GEE’s - Correlation Structure - Independent

\[\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 &0 \\ 0 & 0 & 1 \end{bmatrix}\]

GEE’s - Correlation Structure - Exchangeable

\[\begin{bmatrix} 1 & \rho & \rho \\ \rho & 1 &\rho \\ \rho &\rho & 1 \end{bmatrix}\]

GEE’s - Correlation Structure - AR(1)

\[\begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 &\rho\\ \rho^2 & \rho & 1 \end{bmatrix}\]

GEE’s - Correlation Structure - Unstructured

\[\begin{bmatrix} 1 & \rho_1 & \rho_2 \\ \rho_1 & 1 &\rho_3 \\ \rho_2 & \rho_3& 1 \end{bmatrix}\]

library (car)

## Loading required package: carData

library(geepack)
library(MuMIn)  #need to install

## 
## Attaching package: 'MuMIn'

## The following object is masked from 'package:geepack':
## 
##     QIC

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data and recodes

First we load our data

load("/media/corey/0E45-D54F/classes/dem7283/class_20_7283//data/eclsk_k5.Rdata")
names(eclskk5)<-tolower(names(eclskk5))
#get out only the variables I'm going to use for this example

#subset the data

eclsk.sub<-eclskk5%>%
  select(childid, x_chsex_r, x1locale, x_raceth_r, x2povty, x12par1ed_i, p1curmar, x1htotal, x1mscalk5, x2mscalk5, x3mscalk5, x4mscalk5, x5mscalk5, p1hscale, p2hscale, p4hscale, x2fsstat2, x4fsstat2, x4fsstat2, x12sesl, x4sesl_i, p2parct1, p2parct2, s1_id, p2safepl, x2krceth, p1o2near, x_distpov, w1c0, w1p0, w2p0, w1c0str, w1p0str, w4c4p_40, w4c4p_4str,w4c4p_4psu, w1c0psu, w1p0psu, x1height, x2height, x4height, x4height, x5height, x1kage_r, x2kage_r, x3age, x4age, x5age)

gc()

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   2041009  109.1    3432847  183.4   2652507  141.7
## Vcells 477046503 3639.6  590718114 4506.9 478520186 3650.9

#rm(eclsk11); gc()

Time constant variables

First, I do some recoding of variables. First, we code time invariant variables, meaning their values do not change at each wave.

#Non time varying variables
#First we recode some Child characteristics
#Child's sex: recode as male =1
eclsk.sub$male<-Recode(eclsk.sub$x_chsex_r, recodes="1=1; 2=0; -9=NA")

#Recode race with white, non Hispanic as reference using dummy vars
eclsk.sub$hisp<-Recode (eclsk.sub$x_raceth_r, recodes="3:4=1;-9=NA; else=0")
eclsk.sub$black<-Recode (eclsk.sub$x_raceth_r, recodes="2=1;-9=NA; else=0")
eclsk.sub$asian<-Recode (eclsk.sub$x_raceth_r, recodes="5=1;-9=NA; else=0")
eclsk.sub$nahn<-Recode (eclsk.sub$x_raceth_r, recodes="6:7=1;-9=NA; else=0")
eclsk.sub$other<-Recode (eclsk.sub$x_raceth_r, recodes="8=1;-9=NA; else=0")


#Then we recode some parent/mother characteristics
#Mother's education, recode as 2 dummys with HS = reference
eclsk.sub$lths<-Recode(eclsk.sub$x12par1ed_i, recodes = "0:2=1; 3:8=0; else = NA")
eclsk.sub$gths<-Recode(eclsk.sub$x12par1ed_i, recodes = "1:3=0; 4:8=1; else =NA") 

#marital status, recode as 2 dummys, ref= married
eclsk.sub$single<-Recode(eclsk.sub$p1curmar, recodes="4=1; -7:-9=NA; else=0")
eclsk.sub$notmar<-Recode(eclsk.sub$p1curmar, recodes="2:3=1; -7:-9=NA; else=0")


#Then we do some household level variables

#Urban school location = 1
eclsk.sub$urban<-Recode(eclsk.sub$x1locale, recodes = "1:3=1; 4=0; -1:-9=NA")

#poverty level in poverty = 1
eclsk.sub$pov<-Recode(eclsk.sub$x2povty , recodes ="1:2=1; 3=0; -9=NA")

#Household size
eclsk.sub$hhsize<-eclsk.sub$x1htotal

#school % minority student body
eclsk.sub$minorsch<-ifelse(eclsk.sub$x2krceth <0, NA, eclsk.sub$x2krceth/10)

#Unsafe neighborhood
eclsk.sub$unsafe<-Recode(eclsk.sub$p2safepl , recodes = "1:2='unsafe'; 3='safe'; else=NA",as.factor = T)

#school district poverty
eclsk.sub$dist_pov<-ifelse(eclsk.sub$x_distpov==-9, NA, scale(eclsk.sub$x_distpov))

Time varying variables

I have to make the repeated measures of each of my longitudinal variables. These are referred to as time varying variables, meaning their values change at each wave.

#Longitudinal variables
#recode our outcomes, the  first is the child's math standardized test score  in Kindergarten
eclsk.sub$math1<-ifelse(eclsk.sub$x1mscalk5<0, NA, eclsk.sub$x1mscalk5)
eclsk.sub$math2<-ifelse(eclsk.sub$x2mscalk5<0, NA, eclsk.sub$x2mscalk5)
#eclsk.sub$math3<-ifelse(eclsk.sub$x3mscalk1<0, NA, eclsk.sub$x3mscalk1)
eclsk.sub$math4<-ifelse(eclsk.sub$x4mscalk5<0, NA, eclsk.sub$x4mscalk5)

#Second outcome is child's height for age, continuous outcome
eclsk.sub$height1<-ifelse(eclsk.sub$x1height<=-7, NA, eclsk.sub$x1height)
eclsk.sub$height2<-ifelse(eclsk.sub$x2height<=-7, NA, eclsk.sub$x2height)
#eclsk.sub$height3<-ifelse(eclsk.sub$x3height<=-7, NA, eclsk.sub$x3height)
eclsk.sub$height4<-ifelse(eclsk.sub$x4height<=-7, NA, eclsk.sub$x4height)

#Age at each wave
eclsk.sub$age_yrs1<-ifelse(eclsk.sub$x1kage_r<0, NA, eclsk.sub$x1kage_r/12)
eclsk.sub$age_yrs2<-ifelse(eclsk.sub$x2kage_r<0, NA, eclsk.sub$x2kage_r/12)
#eclsk.sub$age_yrs3<-ifelse(eclsk.sub$x3age<0, NA, eclsk.sub$x3age/12)
eclsk.sub$age_yrs4<-ifelse(eclsk.sub$x4age<0, NA, eclsk.sub$x4age/12)

eclsk.sub<- eclsk.sub[is.na(eclsk.sub$age_yrs1)==F, ]

#Height for age z score standardized by sex and age
eclsk.sub$height_z1<-ave(eclsk.sub$height1, as.factor(paste(round(eclsk.sub$age_yrs1, 1.5), eclsk.sub$male)), FUN=scale)
eclsk.sub$height_z2<-ave(eclsk.sub$height2, as.factor(paste(round(eclsk.sub$age_yrs2, 1.5), eclsk.sub$male)), FUN=scale)
#eclsk.sub$height_z3<-ave(eclsk.sub$height3, as.factor(paste(round(eclsk.sub$age_yrs3, 1.5), eclsk.sub$male)), FUN=scale)
eclsk.sub$height_z4<-ave(eclsk.sub$height4, as.factor(paste(round(eclsk.sub$age_yrs4, 1.5), eclsk.sub$male)), FUN=scale)


#Household food insecurity, dichotomous outcome
#This outcome is only present at two waves
eclsk.sub$foodinsec1<-Recode(eclsk.sub$x2fsstat2, recodes="2:3=1; 1=0; else=NA")
eclsk.sub$foodinsec2<-Recode(eclsk.sub$x4fsstat2, recodes="2:3=1; 1=0; else=NA")


#Child health assessment Excellent to poor , ordinal outcome
eclsk.sub$chhealth1<-ifelse(eclsk.sub$p1hscale<0, NA, eclsk.sub$p1hscale)
eclsk.sub$chhealth2<-ifelse(eclsk.sub$p2hscale<0, NA, eclsk.sub$p2hscale)
eclsk.sub$chhealth4<-ifelse(eclsk.sub$p4hscale<0, NA, eclsk.sub$p4hscale)

#SES
eclsk.sub$hhses1<-ifelse(eclsk.sub$x12sesl==-9, NA, scale(eclsk.sub$x12sesl))
eclsk.sub$hhses2<-ifelse(eclsk.sub$x12sesl==-9, NA, scale(eclsk.sub$x12sesl))

eclsk.sub$hhses4<-ifelse(eclsk.sub$x4sesl_i==-9, NA, scale(eclsk.sub$x4sesl_i))

Reshaping data into longitudinal format

To analyze data longitudinally, we must reshape the data from its current “wide” format, where each repeated measure is a column, into the “long” format, where there is a single column for a particular variable, and we account for the repeated measurements of each person. In this case, I’m going to use three waves of data, so each child can contribute up to three lines to the data.

The reshape() function will do this for us, but below I use a tidy method, using a combination of the data.table and dplyr packages. I first make a long data set of the height, age, math, child health and household SES measures, then I left join it to the time invariant variables i’ll use in my models below.

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(magrittr)
out<-melt(setDT(eclsk.sub), id = "childid",
          measure.vars = list(ht=c("height_z1","height_z2","height_z4"),
                              age=c("age_yrs1", "age_yrs2", "age_yrs4"), 
                              math=c("math1", "math2", "math4"),
                              hhses=c("hhses1", "hhses2", "hhses4"),
                              health=c("chhealth1", "chhealth2", "chhealth4")))%>%
  setorder(childid)
head(out, n=20)

e.long<-eclsk.sub%>%
  select(childid, hisp, black,asian, nahn, other,male, unsafe, s1_id, pov, hhsize, urban, w4c4p_40, w4c4p_4str, w4c4p_4psu)%>%
  left_join(., out, "childid")
e.long$wave<-rep(c(1,2,4), length(unique(e.long$childid)))
head(e.long)

head(e.long)

e.long.comp<-e.long%>%
  filter(complete.cases(.))

Visualization of longitudinal data

library(ggplot2)

first10<-unique(e.long.comp$childid)[1:10]

sub<-e.long.comp%>%
  filter(childid%in%first10)

ggplot(sub, aes(x=age, y=math))+geom_point()+ geom_smooth(method='lm',formula=y~x)+facet_wrap(~childid,nrow = 3)+ggtitle(label = "Change in Math score across age", subtitle = "First 10 children in ECLS-K 2011")

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf

Modeling

Longitudinal Models using GEE’s

Again, the GEE is used here instead of the (G)LMM.

#basic linear model
fit.1<-glm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+hhses, data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
summary(fit.1)

## Warning in summary.glm(fit.1): observations with zero weight not used for
## calculating dispersion

## 
## Call:
## glm(formula = scale(math) ~ scale(age) + male + black + hisp + 
##     asian + nahn + other + hhses, data = e.long.comp, weights = w4c4p_40/mean(w4c4p_40))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.8553  -0.3705  -0.0056   0.3652   4.0851  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006315   0.007095  -0.890   0.3734    
## scale(age)   0.703235   0.004072 172.715   <2e-16 ***
## male        -0.002082   0.008179  -0.255   0.7991    
## black       -0.247903   0.013128 -18.883   <2e-16 ***
## hisp        -0.125094   0.010855 -11.524   <2e-16 ***
## asian        0.189993   0.022570   8.418   <2e-16 ***
## nahn        -0.073914   0.030768  -2.402   0.0163 *  
## other        0.010582   0.020473   0.517   0.6052    
## hhses        0.278591   0.005207  53.502   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.4117086)
## 
##     Null deviance: 24295  on 24663  degrees of freedom
## Residual deviance: 10151  on 24655  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 2

#Get residuals and put them in a data frame
e.long.comp$resid<- residuals(fit.1)

test<-reshape(e.long.comp, idvar = "childid", v.names = "resid", timevar = "wave", direction = "wide", drop = names(e.long.comp)[2:21])
head(test[order(test$childid),], n=12)

Here is our actual correlation matrix in the residuals between waves:

cor(test[,-1], use="pairwise.complete")

##           resid.1   resid.2   resid.4
## resid.1 1.0000000 0.8133513 0.7202477
## resid.2 0.8133513 1.0000000 0.7994813
## resid.4 0.7202477 0.7994813 1.0000000

This is certainly not independence, and looks more like an AR(1), because the correlation decreases as the difference between wave number increases.

Now we fit the GEE: ### Model with independent correlation Meaning ZERO correlation between waves

fit.1<-geeglm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+hhses,id=childid , wave = wave, corstr ="independence",   data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
summary(fit.1)

## Warning in summary.glm(object): observations with zero weight not used for
## calculating dispersion

## 
## Call:
## geeglm(formula = scale(math) ~ scale(age) + male + black + hisp + 
##     asian + nahn + other + hhses, data = e.long.comp, weights = w4c4p_40/mean(w4c4p_40), 
##     id = childid, waves = wave, corstr = "independence")
## 
##  Coefficients:
##              Estimate   Std.err      Wald Pr(>|W|)    
## (Intercept) -0.006315  0.012630     0.250    0.617    
## scale(age)   0.703235  0.005598 15780.424  < 2e-16 ***
## male        -0.002082  0.014976     0.019    0.889    
## black       -0.247903  0.024780   100.086  < 2e-16 ***
## hisp        -0.125094  0.020876    35.908 2.07e-09 ***
## asian        0.189993  0.031600    36.148 1.83e-09 ***
## nahn        -0.073914  0.072818     1.030    0.310    
## other        0.010582  0.037836     0.078    0.780    
## hhses        0.278591  0.009536   853.570  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate  Std.err
## (Intercept)   0.4108 0.007586
## Number of clusters:   8432  Maximum cluster size: 3

Model with Exchangeable correlation

Meaning correlation between waves, but the correlation is the same for each pair waves

fit.2<-geeglm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+hhses,id=childid , wave = wave, corstr ="exchangeable",  data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
  
summary(fit.2)

## Warning in summary.glm(object): observations with zero weight not used for
## calculating dispersion

## 
## Call:
## geeglm(formula = scale(math) ~ scale(age) + male + black + hisp + 
##     asian + nahn + other + hhses, data = e.long.comp, weights = w4c4p_40/mean(w4c4p_40), 
##     id = childid, waves = wave, corstr = "exchangeable")
## 
##  Coefficients:
##              Estimate   Std.err     Wald Pr(>|W|)    
## (Intercept)  0.000944  0.013311     0.01     0.94    
## scale(age)   0.838304  0.003065 74810.82  < 2e-16 ***
## male        -0.013915  0.016067     0.75     0.39    
## black       -0.265379  0.026528   100.08  < 2e-16 ***
## hisp        -0.137488  0.021914    39.36  3.5e-10 ***
## asian        0.209346  0.038652    29.34  6.1e-08 ***
## nahn        -0.106499  0.081135     1.72     0.19    
## other        0.016580  0.040126     0.17     0.68    
## hhses        0.230921  0.010227   509.81  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    0.431 0.00841
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.803  0.0214
## Number of clusters:   8432  Maximum cluster size: 3

The second model shows the exchangeable correlation to be 0.803, which is not very different from our measured correlations above

	resid.1	resid.2	resid.4
resid.1	1.000	0.813	0.720
resid.2	0.813	1.000	0.799
resid.4	0.720	0.799	1.000

Now we examine the AR1 correlation types:

fit.3<-geeglm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+hhses, id=childid , wave = wave, corstr ="ar1",  data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
  
summary(fit.3)

## Warning in summary.glm(object): observations with zero weight not used for
## calculating dispersion

## 
## Call:
## geeglm(formula = scale(math) ~ scale(age) + male + black + hisp + 
##     asian + nahn + other + hhses, data = e.long.comp, weights = w4c4p_40/mean(w4c4p_40), 
##     id = childid, waves = wave, corstr = "ar1")
## 
##  Coefficients:
##             Estimate  Std.err     Wald Pr(>|W|)    
## (Intercept) -0.02135  0.01318     2.63     0.11    
## scale(age)   0.81941  0.00317 66897.09  < 2e-16 ***
## male        -0.00263  0.01598     0.03     0.87    
## black       -0.26657  0.02612   104.18  < 2e-16 ***
## hisp        -0.14765  0.02188    45.54  1.5e-11 ***
## asian        0.21710  0.03793    32.77  1.0e-08 ***
## nahn        -0.11135  0.07956     1.96     0.16    
## other        0.01462  0.04035     0.13     0.72    
## hhses        0.22996  0.00990   539.59  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = ar1 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    0.427 0.00826
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.845  0.0163
## Number of clusters:   8432  Maximum cluster size: 3

The implied correlation in the AR(1) model is : 0.845

The other type of correlation allowed in geeglm is the unstructured correlation model. This is how to fit the unstructured model, but it crashes on this data, so beware

fit.4<-geeglm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+ses, id=childid , wave = wave, corstr ="unstructured",  data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
summary(fit.4)

Since GEE’s aren’t fit via maximum likelihood, they aren’t comparable in terms of AIC or likelihood ratio tests. However, Pan, 2001 describe an information criterion using a Quasi-likelihood formulation. This can be used to compare models with alternative correlation structures, with the lowest QIC representing the best fitting model. Another criterion is the Correlation Information Criterion (Hin and Wang, 2008)[https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.3489], which is proposed to be better for choosing among models with the same mean function, but different correlation structures, which is what we’re doing here.

library(MESS) #need to install

## 
## Attaching package: 'MESS'

## The following object is masked from 'package:MuMIn':
## 
##     QIC

## The following object is masked from 'package:geepack':
## 
##     QIC

QIC(fit.1)

##       QIC      QICu Quasi Lik       CIC    params      QICC 
##   10212.1   10170.2   -5076.1      29.9       9.0   10212.1

QIC(fit.2)

##       QIC      QICu Quasi Lik       CIC    params      QICC 
##   10594.7   10582.3   -5282.2      15.2       9.0   10594.7

QIC(fit.3)

##       QIC      QICu Quasi Lik       CIC    params      QICC 
##   10507.5   10496.0   -5239.0      14.7       9.0   10507.5

So, it looks like the exhangeable correlation structure is slightly better than the AR(1), using the CIC but there is not much difference between models using this criteria.

Binary response longitudinal model

Here we use the GEE for a binomial outcome.

Here are what the data look like:

binomial_smooth <- function(...) {
  geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}

sub$poorhealth<-Recode(sub$health, recodes="2:3=1; else=0")

ggplot(sub, aes(x=age, y=poorhealth))+geom_point()+ binomial_smooth()+ggtitle(label = "Change in Math score across age", subtitle = "First 10 children in ECLS-K 2011 - All children")

## `geom_smooth()` using formula 'y ~ x'

ggplot(sub, aes(x=age, y=poorhealth))+geom_point()+ binomial_smooth()+facet_wrap(~childid,nrow=3)+ggtitle(label = "Change in Math score across age", subtitle = "First 10 children in ECLS-K 2011 - Invidivual Children")

## `geom_smooth()` using formula 'y ~ x'

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

btest<-glm(I(health>2)~age+male+black+hisp+asian+nahn+other+hhses+factor(wave) , family=binomial, data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

e.long.comp$binomresid<- residuals(btest)
test2<-reshape(e.long.comp, idvar = "childid", v.names = "binomresid", timevar = "wave", direction = "wide", drop = names(e.long.comp)[c(2:21, 23)])

head(test2[order(test$childid),], n=12)

#empirical correlations between waves in the residuals
cor(test2[, -1], use="pairwise.complete", method = "spearman")

##              binomresid.1 binomresid.2 binomresid.4
## binomresid.1         1.00        0.670        0.610
## binomresid.2         0.67        1.000        0.636
## binomresid.4         0.61        0.636        1.000

These look like a constant correlation, or AR(1)

Logistic GEE with independent correlation

fitb.1<-geeglm(I(health>2)~age+male+black+hisp+asian+nahn+other+hhses, waves = wave,id=childid ,corstr ="independence", family=binomial, data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fitb.1)

## 
## Call:
## geeglm(formula = I(health > 2) ~ age + male + black + hisp + 
##     asian + nahn + other + hhses, family = binomial, data = e.long.comp, 
##     weights = w4c4p_40/mean(w4c4p_40), id = childid, waves = wave, 
##     corstr = "independence")
## 
##  Coefficients:
##             Estimate  Std.err   Wald Pr(>|W|)    
## (Intercept) -2.25908  0.22264 102.96  < 2e-16 ***
## age         -0.00567  0.03339   0.03    0.865    
## male         0.14350  0.06057   5.61    0.018 *  
## black        0.39741  0.09555  17.30  3.2e-05 ***
## hisp         0.57698  0.07707  56.05  7.1e-14 ***
## asian        0.73853  0.11482  41.37  1.3e-10 ***
## nahn         0.18157  0.22476   0.65    0.419    
## other        0.14622  0.14427   1.03    0.311    
## hhses       -0.57710  0.03993 208.89  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     1.01  0.0776
## Number of clusters:   8432  Maximum cluster size: 3

Logistic GEE with exchangeable correlations

fitb.2<-geeglm(I(health>2)~age+male+black+hisp+asian+nahn+other+hhses, waves = wave,id=childid, family=binomial, data=e.long.comp, corstr="exchangeable", weights=w4c4p_40/mean(w4c4p_40))

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fitb.2)

## 
## Call:
## geeglm(formula = I(health > 2) ~ age + male + black + hisp + 
##     asian + nahn + other + hhses, family = binomial, data = e.long.comp, 
##     weights = w4c4p_40/mean(w4c4p_40), id = childid, waves = wave, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -2.0865  0.2062 102.35  < 2e-16 ***
## age          -0.0337  0.0307   1.21    0.272    
## male          0.1482  0.0605   6.00    0.014 *  
## black         0.4167  0.0955  19.04  1.3e-05 ***
## hisp          0.5944  0.0766  60.15  8.8e-15 ***
## asian         0.7236  0.1144  40.02  2.5e-10 ***
## nahn          0.1812  0.2240   0.65    0.418    
## other         0.1609  0.1438   1.25    0.263    
## hhses        -0.5417  0.0391 192.29  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)        1  0.0733
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.337  0.0367
## Number of clusters:   8432  Maximum cluster size: 3

Logistic GEE with AR(1) correlation

fitb.3<-geeglm(I(health>2)~age+male+black+hisp+asian+nahn+other+hhses, waves = wave,id=childid, family=binomial, data=e.long.comp, corstr="ar1", weights=w4c4p_40/mean(w4c4p_40))

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(fitb.3)

## 
## Call:
## geeglm(formula = I(health > 2) ~ age + male + black + hisp + 
##     asian + nahn + other + hhses, family = binomial, data = e.long.comp, 
##     weights = w4c4p_40/mean(w4c4p_40), id = childid, waves = wave, 
##     corstr = "ar1")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -2.1091  0.2091 101.77  < 2e-16 ***
## age          -0.0341  0.0311   1.20   0.2728    
## male          0.1620  0.0607   7.13   0.0076 ** 
## black         0.4422  0.0954  21.48  3.6e-06 ***
## hisp          0.6113  0.0768  63.38  1.7e-15 ***
## asian         0.7814  0.1154  45.85  1.3e-11 ***
## nahn          0.2319  0.2294   1.02   0.3120    
## other         0.1981  0.1466   1.83   0.1767    
## hhses        -0.5583  0.0396 199.02  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = ar1 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     1.01  0.0783
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.412  0.0405
## Number of clusters:   8432  Maximum cluster size: 3

Compare the three models:

QIC(fitb.1)

##       QIC      QICu Quasi Lik       CIC    params      QICC 
##   17428.6   17408.5   -8695.2      19.1       9.0   17428.6

QIC(fitb.2)

##       QIC      QICu Quasi Lik       CIC    params      QICC 
##   17418.2   17411.5   -8696.7      12.4       9.0   17418.3

QIC(fitb.3)

##       QIC      QICu Quasi Lik       CIC    params      QICC 
##   17414.4   17407.3   -8694.7      12.6       9.0   17414.4

In the binomial case, it looks like the echangeable correlation structure is good, as it has the lowest CIC, although the AR(1) model is very similar.

DEM 7283: Longitudinal Models for Change using Generalized Estimating Equations

Corey Sparks, PhD

April 27, 2020

Introduction to GEE’s

GLMMS’s

GLMMS’s and GEE’s

Generalized Estimating Equations

GEE’s

GEE’s

GEE’s - Model form

GEE’s - Model estimation

GEE’s - Model estimation

GEE’s - Correlation Structure

GEE’s - Correlation Structure

GEE’s - Correlation Structure

GEE’s - Correlation Structure - Independent

GEE’s - Correlation Structure - Exchangeable

GEE’s - Correlation Structure - AR(1)

GEE’s - Correlation Structure - Unstructured

Data and recodes

Time constant variables

Time varying variables

Reshaping data into longitudinal format

Visualization of longitudinal data

Modeling

Longitudinal Models using GEE’s

Model with Exchangeable correlation

Binary response longitudinal model

Logistic GEE with independent correlation

Logistic GEE with exchangeable correlations

Logistic GEE with AR(1) correlation