In this example, we will use Generalized Estimating Equations to do some longitudinal modeling of data from the ECLS-K 2011. Specifically, we will model changes in a student’s standardized math score as a continuous outcome and self rated health as a binomial outcome, from fall kindergarten to spring, 1st grade.
Up until now, we have used (G)LMM’s to analyze data that were “clustered”
The next topic will introduce a modeling strategy that allows us to consider clustered data, but in a different fashion
GLMM’s are commonly referred to as conditional models, because the model coefficients “\(\beta\)’s” are condition on the random effects in the model.
Likewise, the mean if conditional on the random effects. This is another way of saying that the mean for a given covariate pattern is conditional on the group that the particular person is in.
\(\mu_{ij}^c = E(Y_{ij} | u_j) = X_{ij}\beta + u_j\)
In contrast, Generalzed Estimating Equations are referred to as marginal models because they only estimate the overall mean.
\(\mu_{ij} = X_{ij}\beta\)
Lee and Nelder, 2004 provide a very good description of how these two methods compare to one another
A basic form of the model would be:
\(Y_{ij} = \beta_0 + \sum_k X_{ijk} \beta_k + CORR + error\)
Ordinary models will tend to over estimate the standard errors for the \(\beta\)’s for time varying predictors in a model with repeated observations, because these models do not account for the correlation within clusters observations over time.
Likewise, the standard errors of time invariant predictors will be under estimated
Given the mean function for the model and a specified correlation function, the model parameters may be estimated by finding the solution for:
\[U(\beta) = \sum_i ^n \frac{\delta \mu_{ij}}{ \delta \beta_k} V_i^{-1} (Y_{ij} - \mu(\beta))\]
Which gives estimates of the \(\beta\)’s for the linear mean function.
For three time points per person, the ordinary regression model correlation in residuals within clusters/persons over time can be thought of as the matrix:
\[\begin{bmatrix} \sigma^2 & 0 & 0 \\ 0 & \sigma^2 &0 \\ 0 & 0 & \sigma^2 \end{bmatrix}\]
which assumed the variances are constant and the residuals are independent over time
But in a GEE, the model include the actual correlation between measurements over time:
\[\begin{bmatrix} \sigma_1 ^2 & a & c \\ a & \sigma_2 ^2 &b \\ b & c & \sigma_3 ^2 \end{bmatrix}\]
Which allows the variances over time to be different, as well as correlations between times to be present.
\[\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 &0 \\ 0 & 0 & 1 \end{bmatrix}\]
\[\begin{bmatrix} 1 & \rho & \rho \\ \rho & 1 &\rho \\ \rho &\rho & 1 \end{bmatrix}\]
\[\begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 &\rho\\ \rho^2 & \rho & 1 \end{bmatrix}\]
\[\begin{bmatrix} 1 & \rho_1 & \rho_2 \\ \rho_1 & 1 &\rho_3 \\ \rho_2 & \rho_3& 1 \end{bmatrix}\]
First we load our data
load("~/Google Drive/classes/dem7283/class18/data/eclsk11.Rdata")
names(eclsk11)<-tolower(names(eclsk11))
library (car)
library(geepack)
library(MuMIn) #need to install
#get out only the variables I'm going to use for this example
myvars<-c("childid", "x_chsex_r", "x1locale", "x_raceth_r", "x2povty", "x12par1ed_i","p1curmar", "x1htotal", "x1mscalk2", "x2mscalk2", "x3mscalk2", "x4mscalk2", "p1hscale", "p2hscale", "p4hscale","x2fsstat2","x4fsstat2", "x12sesl","x4sesl_i", "p2parct1", "p2parct2", "s1_id", "p2safepl","x2krceth","p1o2near", "x_distpov" , "w1c0", "w1p0", "w2p0", "w1c0str", "w1p0str",
"w4c4p_40","w4c4p_4str", "w4c4p_4psu", "w1c0psu", "w1p0psu", "x1height","x2height","x3height","x4height", "x1kage_r","x2kage_r","x3age","x4age")
#subset the data
eclsk.sub<-data.frame(eclsk11[,myvars])
rm(eclsk11); gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1807752 96.6 2637877 140.9 2164898 115.7
## Vcells 3647876 27.9 296749561 2264.1 255374000 1948.4
First, I do some recoding of variables. First, we code time invariant variables, meaning their values do not change at each wave.
#Non time varying variables
#First we recode some Child characteristics
#Child's sex: recode as male =1
eclsk.sub$male<-Recode(eclsk.sub$x_chsex_r, recodes="1=1; 2=0; -9=NA")
#Recode race with white, non Hispanic as reference using dummy vars
eclsk.sub$hisp<-Recode (eclsk.sub$x_raceth_r, recodes="3:4=1;-9=NA; else=0")
eclsk.sub$black<-Recode (eclsk.sub$x_raceth_r, recodes="2=1;-9=NA; else=0")
eclsk.sub$asian<-Recode (eclsk.sub$x_raceth_r, recodes="5=1;-9=NA; else=0")
eclsk.sub$nahn<-Recode (eclsk.sub$x_raceth_r, recodes="6:7=1;-9=NA; else=0")
eclsk.sub$other<-Recode (eclsk.sub$x_raceth_r, recodes="8=1;-9=NA; else=0")
#Then we recode some parent/mother characteristics
#Mother's education, recode as 2 dummys with HS = reference
eclsk.sub$lths<-Recode(eclsk.sub$x12par1ed_i, recodes = "0:2=1; 3:8=0; else = NA")
eclsk.sub$gths<-Recode(eclsk.sub$x12par1ed_i, recodes = "1:3=0; 4:8=1; else =NA")
#marital status, recode as 2 dummys, ref= married
eclsk.sub$single<-Recode(eclsk.sub$p1curmar, recodes="4=1; -7:-9=NA; else=0")
eclsk.sub$notmar<-Recode(eclsk.sub$p1curmar, recodes="2:3=1; -7:-9=NA; else=0")
#Then we do some household level variables
#Urban school location = 1
eclsk.sub$urban<-Recode(eclsk.sub$x1locale, recodes = "1:3=1; 4=0; -1:-9=NA")
#poverty level in poverty = 1
eclsk.sub$pov<-Recode(eclsk.sub$x2povty , recodes ="1:2=1; 3=0; -9=NA")
#Household size
eclsk.sub$hhsize<-eclsk.sub$x1htotal
#school % minority student body
eclsk.sub$minorsch<-ifelse(eclsk.sub$x2krceth <0, NA, eclsk.sub$x2krceth/10)
#Unsafe neighborhood
eclsk.sub$unsafe<-Recode(eclsk.sub$p2safepl , recodes = "1:2='unsafe'; 3='safe'; else=NA",as.factor = T)
#school district poverty
eclsk.sub$dist_pov<-ifelse(eclsk.sub$x_distpov==-9, NA, scale(eclsk.sub$x_distpov))
I have to make the repeated measures of each of my longitudinal variables. These are referred to as time varying variables, meaning their values change at each wave.
#Longitudinal variables
#recode our outcomes, the first is the child's math standardized test score in Kindergarten
eclsk.sub$math1<-ifelse(eclsk.sub$x1mscalk2<0, NA, eclsk.sub$x1mscalk2)
eclsk.sub$math2<-ifelse(eclsk.sub$x2mscalk2<0, NA, eclsk.sub$x2mscalk2)
#eclsk.sub$math3<-ifelse(eclsk.sub$x3mscalk1<0, NA, eclsk.sub$x3mscalk1)
eclsk.sub$math4<-ifelse(eclsk.sub$x4mscalk2<0, NA, eclsk.sub$x4mscalk2)
#Second outcome is child's height for age, continuous outcome
eclsk.sub$height1<-ifelse(eclsk.sub$x1height<=-7, NA, eclsk.sub$x1height)
eclsk.sub$height2<-ifelse(eclsk.sub$x2height<=-7, NA, eclsk.sub$x2height)
#eclsk.sub$height3<-ifelse(eclsk.sub$x3height<=-7, NA, eclsk.sub$x3height)
eclsk.sub$height4<-ifelse(eclsk.sub$x4height<=-7, NA, eclsk.sub$x4height)
#Age at each wave
eclsk.sub$age_yrs1<-ifelse(eclsk.sub$x1kage_r<0, NA, eclsk.sub$x1kage_r/12)
eclsk.sub$age_yrs2<-ifelse(eclsk.sub$x2kage_r<0, NA, eclsk.sub$x2kage_r/12)
#eclsk.sub$age_yrs3<-ifelse(eclsk.sub$x3age<0, NA, eclsk.sub$x3age/12)
eclsk.sub$age_yrs4<-ifelse(eclsk.sub$x4age<0, NA, eclsk.sub$x4age/12)
eclsk.sub<- eclsk.sub[is.na(eclsk.sub$age_yrs1)==F, ]
#Height for age z score standardized by sex and age
eclsk.sub$height_z1<-ave(eclsk.sub$height1, as.factor(paste(round(eclsk.sub$age_yrs1, 1.5), eclsk.sub$male)), FUN=scale)
eclsk.sub$height_z2<-ave(eclsk.sub$height2, as.factor(paste(round(eclsk.sub$age_yrs2, 1.5), eclsk.sub$male)), FUN=scale)
#eclsk.sub$height_z3<-ave(eclsk.sub$height3, as.factor(paste(round(eclsk.sub$age_yrs3, 1.5), eclsk.sub$male)), FUN=scale)
eclsk.sub$height_z4<-ave(eclsk.sub$height4, as.factor(paste(round(eclsk.sub$age_yrs4, 1.5), eclsk.sub$male)), FUN=scale)
#Household food insecurity, dichotomous outcome
#This outcome is only present at two waves
eclsk.sub$foodinsec1<-Recode(eclsk.sub$x2fsstat2, recodes="2:3=1; 1=0; else=NA")
eclsk.sub$foodinsec2<-Recode(eclsk.sub$x4fsstat2, recodes="2:3=1; 1=0; else=NA")
#Child health assessment Excellent to poor , ordinal outcome
eclsk.sub$chhealth1<-ifelse(eclsk.sub$p1hscale<0, NA, eclsk.sub$p1hscale)
eclsk.sub$chhealth2<-ifelse(eclsk.sub$p2hscale<0, NA, eclsk.sub$p2hscale)
eclsk.sub$chhealth4<-ifelse(eclsk.sub$p4hscale<0, NA, eclsk.sub$p4hscale)
#SES
eclsk.sub$hhses1<-ifelse(eclsk.sub$x12sesl==-9, NA, scale(eclsk.sub$x12sesl))
eclsk.sub$hhses4<-ifelse(eclsk.sub$x4sesl_i==-9, NA, scale(eclsk.sub$x4sesl_i))
To analyze data longitudinally, we must reshape the data from its current “wide” format, where each repeated measure is a column, into the “long” format, where there is a single column for a particular variable, and we account for the repeated measurements of each person. In this case, I’m going to use three waves of data, so each child can contribute up to three lines to the data.
The reshape()
function will do this for us. It has a few arguments we will use. First, the data, then the idvar
argument, which identifies a variable that identifies each individual. The varying
argument, specifies a list of variable names in the wide format that you want to vary with time in the long format. The v.names
argument specifies what you want the time varying variables to be called in the long format. times
specifies what time values you want. direction
specifies wither wide or long form for the data. drop
tells R what variables you want to get rid of, since commonly you don’t want to keep everything.
e.long<-reshape(eclsk.sub, idvar="childid", varying=list(math = c("math1", "math2", "math4"),
age = c("age_yrs1", "age_yrs2", "age_yrs4"),
height= c("height_z1", "height_z2", "height_z4"),
health=c("chhealth1", "chhealth2", "chhealth4"),
ses = c("hhses1", "hhses1", "hhses4")),
v.names = c("math", "age", "height", "health", "ses"),
timevar="wave",times=c(1,2,4),direction="long",
drop = names(eclsk.sub)[c(2:20, 22:25)])
e.long<-e.long[order(e.long$childid, e.long$wave),]
head(e.long, n=10)
## childid p2parct2 x_distpov w1c0 w1p0 w2p0 w1c0str
## 10000001.1 10000001 1 45 188.09105 211.4320 215.1901 70
## 10000001.2 10000001 1 45 188.09105 211.4320 215.1901 70
## 10000001.4 10000001 1 45 188.09105 211.4320 215.1901 70
## 10000002.1 10000002 1 7 237.88083 274.5976 327.8694 45
## 10000002.2 10000002 1 7 237.88083 274.5976 327.8694 45
## 10000002.4 10000002 1 7 237.88083 274.5976 327.8694 45
## 10000004.1 10000004 1 17 96.36698 103.5068 111.8929 51
## 10000004.2 10000004 1 17 96.36698 103.5068 111.8929 51
## 10000004.4 10000004 1 17 96.36698 103.5068 111.8929 51
## 10000005.1 10000005 3 -1 178.54929 237.3439 267.5328 30
## w1p0str w4c4p_40 w4c4p_4str w4c4p_4psu w1c0psu w1p0psu x1height
## 10000001.1 70 322.0055 69 3 3 3 44.50
## 10000001.2 70 322.0055 69 3 3 3 44.50
## 10000001.4 70 322.0055 69 3 3 3 44.50
## 10000002.1 45 474.8417 43 2 2 2 45.63
## 10000002.2 45 474.8417 43 2 2 2 45.63
## 10000002.4 45 474.8417 43 2 2 2 45.63
## 10000004.1 51 171.4766 49 2 2 2 46.00
## 10000004.2 51 171.4766 49 2 2 2 46.00
## 10000004.4 51 171.4766 49 2 2 2 46.00
## 10000005.1 30 0.0000 NA NA 3 3 40.80
## x2height x3height x4height x1kage_r x2kage_r x3age x4age male
## 10000001.1 45.75 NA 48.0 72.39 79.76 NA 91.66 1
## 10000001.2 45.75 NA 48.0 72.39 79.76 NA 91.66 1
## 10000001.4 45.75 NA 48.0 72.39 79.76 NA 91.66 1
## 10000002.1 47.00 NA 49.5 71.41 77.52 NA 89.56 0
## 10000002.2 47.00 NA 49.5 71.41 77.52 NA 89.56 0
## 10000002.4 47.00 NA 49.5 71.41 77.52 NA 89.56 0
## 10000004.1 48.00 NA 50.0 72.39 78.84 NA 89.23 1
## 10000004.2 48.00 NA 50.0 72.39 78.84 NA 89.23 1
## 10000004.4 48.00 NA 50.0 72.39 78.84 NA 89.23 1
## 10000005.1 42.00 NA NA 59.51 65.06 NA NA 0
## hisp black asian nahn other lths gths single notmar urban pov
## 10000001.1 0 0 0 0 0 0 1 0 0 0 1
## 10000001.2 0 0 0 0 0 0 1 0 0 0 1
## 10000001.4 0 0 0 0 0 0 1 0 0 0 1
## 10000002.1 0 0 0 0 0 0 1 0 0 1 0
## 10000002.2 0 0 0 0 0 0 1 0 0 1 0
## 10000002.4 0 0 0 0 0 0 1 0 0 1 0
## 10000004.1 0 0 0 0 0 0 1 0 0 0 0
## 10000004.2 0 0 0 0 0 0 1 0 0 0 0
## 10000004.4 0 0 0 0 0 0 1 0 0 0 0
## 10000005.1 1 0 0 0 0 0 1 0 0 1 1
## hhsize minorsch unsafe dist_pov height1 height2 height4
## 10000001.1 6 0.8 unsafe 2.2423492 44.50 45.75 48.0
## 10000001.2 6 0.8 unsafe 2.2423492 44.50 45.75 48.0
## 10000001.4 6 0.8 unsafe 2.2423492 44.50 45.75 48.0
## 10000002.1 4 3.0 safe -0.6986581 45.63 47.00 49.5
## 10000002.2 4 3.0 safe -0.6986581 45.63 47.00 49.5
## 10000002.4 4 3.0 safe -0.6986581 45.63 47.00 49.5
## 10000004.1 4 1.8 unsafe 0.0752912 46.00 48.00 50.0
## 10000004.2 4 1.8 unsafe 0.0752912 46.00 48.00 50.0
## 10000004.4 4 1.8 unsafe 0.0752912 46.00 48.00 50.0
## 10000005.1 5 9.0 safe -1.3178175 40.80 42.00 NA
## foodinsec1 foodinsec2 wave math age height health
## 10000001.1 0 0 1 46.7477 6.032500 -0.971266566 1
## 10000001.2 0 0 2 64.7527 6.646667 -0.986988853 2
## 10000001.4 0 0 4 86.4892 7.638333 -0.657646933 3
## 10000002.1 0 0 1 31.3889 5.950833 0.000987976 1
## 10000002.2 0 0 2 52.1856 6.460000 0.196427207 1
## 10000002.4 0 0 4 75.7123 7.463333 0.109034629 1
## 10000004.1 0 NA 1 14.8004 6.032500 -0.121874843 1
## 10000004.2 0 NA 2 32.2408 6.570000 0.478487031 1
## 10000004.4 0 NA 4 56.4345 7.435833 0.282849029 NA
## 10000005.1 0 NA 1 28.7995 4.959167 -1.341094704 1
## ses
## 10000001.1 -0.05343635
## 10000001.2 -0.05343635
## 10000001.4 -0.04820708
## 10000002.1 0.36092641
## 10000002.2 0.36092641
## 10000002.4 0.32296225
## 10000004.1 -0.18810425
## 10000004.2 -0.18810425
## 10000004.4 -0.32039792
## 10000005.1 0.37128548
#get non missing cases
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
e.long.comp<-e.long%>%
filter(complete.cases(.))
library(ggplot2)
first10<-unique(e.long.comp$childid)[1:10]
sub<-e.long.comp%>%
filter(childid%in%first10)
ggplot(sub, aes(x=age, y=math))+geom_point()+ geom_smooth(method='lm',formula=y~x)+facet_wrap(~childid,nrow = 3)+ggtitle(label = "Change in Math score across age", subtitle = "First 10 children in ECLS-K 2011")
Again, the GEE is used here instead of the (G)LMM.
#basic linear model
fit.1<-glm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+ses, data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
summary(fit.1)
##
## Call:
## glm(formula = scale(math) ~ scale(age) + male + black + hisp +
## asian + nahn + other + ses, data = e.long.comp, weights = w4c4p_40/mean(w4c4p_40))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.6146 -0.3687 0.0125 0.4009 2.7203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.033821 0.013540 2.498 0.01251 *
## scale(age) 0.708104 0.007274 97.354 < 2e-16 ***
## male -0.040321 0.014662 -2.750 0.00597 **
## black -0.241158 0.026697 -9.033 < 2e-16 ***
## hisp -0.155223 0.018360 -8.454 < 2e-16 ***
## asian 0.155219 0.039092 3.971 7.23e-05 ***
## nahn -0.041262 0.043909 -0.940 0.34739
## other -0.065116 0.036520 -1.783 0.07462 .
## ses 0.272183 0.009459 28.775 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.4052642)
##
## Null deviance: 7561.0 on 7633 degrees of freedom
## Residual deviance: 3090.1 on 7625 degrees of freedom
## AIC: 15883
##
## Number of Fisher Scoring iterations: 2
#Get residuals and put them in a data frame
e.long.comp$resid<- residuals(fit.1)
test<-reshape(e.long.comp, idvar = "childid", v.names = "resid", timevar = "wave", direction = "wide", drop = names(e.long.comp)[2:42])
## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : some constant variables (math,age,height,health,ses) are really
## varying
head(test[order(test$childid),c(1,7:9)], n=12)
## childid resid.1 resid.2 resid.4
## 1 10000014 0.49015757 0.12341524 0.47768497
## 4 10000022 -0.13817715 -0.24589874 0.12726966
## 7 10000029 0.07092600 0.08246443 0.55065476
## 10 10000040 -0.48314760 -0.23072683 -0.12163885
## 13 10000046 -0.12058163 -0.33635230 -0.04425801
## 16 10000047 -0.21447969 -0.06906534 -0.37117233
## 19 10000049 -0.03500921 0.16252397 0.27149411
## 22 10000052 -0.57207549 -0.46297265 -0.52158781
## 25 10000053 -0.25300275 0.18153378 0.45553753
## 28 10000060 -0.32728274 0.07202470 0.70566733
## 31 10000062 -0.37207925 0.09236525 -0.02059601
## 34 10000066 -0.64238292 -0.03610193 0.24400231
Here is our actual correlation matrix in the residuals between waves:
cor(test[, 7:9], use="pairwise.complete")
## resid.1 resid.2 resid.4
## resid.1 1.0000000 0.8092426 0.7208978
## resid.2 0.8092426 1.0000000 0.8032736
## resid.4 0.7208978 0.8032736 1.0000000
This is certainly not independence, and looks more like an AR(1), because the correlation decreases as the difference between wave number increases.
Now we fit the GEE: ###Model with independent correlation Meaning ZERO correlation between waves
fit.1<-geeglm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+ses,id=childid , wave = wave, corstr ="independence", data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
summary(fit.1)
##
## Call:
## geeglm(formula = scale(math) ~ scale(age) + male + black + hisp +
## asian + nahn + other + ses, data = e.long.comp, weights = w4c4p_40/mean(w4c4p_40),
## id = childid, waves = wave, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 0.03382 0.02342 2.085 0.14879
## scale(age) 0.70810 0.01096 4172.359 < 2e-16 ***
## male -0.04032 0.02602 2.401 0.12129
## black -0.24116 0.04817 25.061 5.56e-07 ***
## hisp -0.15522 0.03721 17.398 3.03e-05 ***
## asian 0.15522 0.05259 8.710 0.00316 **
## nahn -0.04126 0.10024 0.169 0.68060
## other -0.06512 0.06078 1.148 0.28405
## ses 0.27218 0.01782 233.260 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.4048 0.01443
##
## Correlation: Structure = independenceNumber of clusters: 2557 Maximum cluster size: 3
Meaning correlation between waves, but the correlation is the same for each pair waves
fit.2<-geeglm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+ses,id=childid , wave = wave, corstr ="exchangeable", data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
summary(fit.2)
##
## Call:
## geeglm(formula = scale(math) ~ scale(age) + male + black + hisp +
## asian + nahn + other + ses, data = e.long.comp, weights = w4c4p_40/mean(w4c4p_40),
## id = childid, waves = wave, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 0.04101 0.02501 2.69 0.1011
## scale(age) 0.84392 0.00559 22802.83 < 2e-16 ***
## male -0.05368 0.02805 3.66 0.0557 .
## black -0.24504 0.05023 23.80 1.1e-06 ***
## hisp -0.16548 0.03924 17.79 2.5e-05 ***
## asian 0.18474 0.05775 10.23 0.0014 **
## nahn -0.06576 0.10532 0.39 0.5324
## other -0.05355 0.06424 0.69 0.4045
## ses 0.23627 0.01902 154.32 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.424 0.0164
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.803 0.049
## Number of clusters: 2557 Maximum cluster size: 3
The second model shows the exchangeable correlation to be 0.803, which is not very different from our measured correlations above
resid.1 | resid.2 | resid.4 | |
---|---|---|---|
resid.1 | 1.000 | 0.809 | 0.721 |
resid.2 | 0.809 | 1.000 | 0.803 |
resid.4 | 0.721 | 0.803 | 1.000 |
Now we examine the AR1 correlation types:
fit.3<-geeglm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+ses, id=childid , wave = wave, corstr ="ar1", data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
summary(fit.3)
##
## Call:
## geeglm(formula = scale(math) ~ scale(age) + male + black + hisp +
## asian + nahn + other + ses, data = e.long.comp, weights = w4c4p_40/mean(w4c4p_40),
## id = childid, waves = wave, corstr = "ar1")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 0.02343 0.02530 0.86 0.35436
## scale(age) 0.82675 0.00582 20202.14 < 2e-16 ***
## male -0.04360 0.02850 2.34 0.12601
## black -0.24447 0.05085 23.11 1.5e-06 ***
## hisp -0.18588 0.04006 21.53 3.5e-06 ***
## asian 0.18863 0.05724 10.86 0.00098 ***
## nahn -0.05959 0.10995 0.29 0.58786
## other -0.05487 0.06376 0.74 0.38941
## ses 0.24304 0.01867 169.45 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.42 0.0161
##
## Correlation: Structure = ar1 Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.852 0.0371
## Number of clusters: 2557 Maximum cluster size: 3
The implied correlation in the AR(1) model is : 0.852
The other type of correlation allowed in geeglm
is the unstructured correlation model. This is how to fit the unstructured model, but it crashes on this data, so beware
#fit.4<-geeglm(scale(math)~scale(age)+male+black+hisp+asian+nahn+other+ses, id=childid , wave = wave, corstr ="unstructured", data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
#summary(fit.4)
Since GEE’s aren’t fit via maximum likelihood, they aren’t comparable in terms of AIC or likelihood ratio tests. However, Pan, 2001 describe an information criterion using a Quasi-likelihood formulation. This can be used to compare models with alternative correlation structures, with the lowest QIC representing the best fitting model. Another criterion is the Correlation Information Criterion (Hin and Wang, 2008)[https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.3489], which is proposed to be better for choosing among models with the same mean function, but different correlation structures, which is what we’re doing here.
library(MESS) #need to install
## Loading required package: geeM
## Loading required package: Matrix
##
## Attaching package: 'MESS'
## The following object is masked from 'package:MuMIn':
##
## QIC
QIC(fit.1)
## QIC QICu Quasi Lik CIC params QICC
## 3149.5 3109.9 -1546.0 28.8 9.0 3149.5
QIC(fit.2)
## QIC QICu Quasi Lik CIC params QICC
## 3247.1 3237.2 -1609.6 13.9 9.0 3247.1
QIC(fit.3)
## QIC QICu Quasi Lik CIC params QICC
## 3225.3 3215.1 -1598.6 14.1 9.0 3225.4
So, it looks like the exhangeable correlation structure is slightly better than the AR(1), using the CIC but there is not much difference between models using this criteria.
Here we use the GEE for a binomial outcome.
Here are what the data look like:
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
sub$poorhealth<-Recode(sub$health, recodes="2:3=1; else=0")
ggplot(sub, aes(x=age, y=poorhealth))+geom_point()+ binomial_smooth()+ggtitle(label = "Change in Math score across age", subtitle = "First 10 children in ECLS-K 2011 - All children")
ggplot(sub, aes(x=age, y=poorhealth))+geom_point()+ binomial_smooth()+facet_wrap(~childid,nrow=3)+ggtitle(label = "Change in Math score across age", subtitle = "First 10 children in ECLS-K 2011 - Invidivual Children")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
btest<-glm(I(health>2)~age+male+black+hisp+asian+nahn+other+ses+factor(wave) , family=binomial, data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
e.long.comp$binomresid<- residuals(btest)
test2<-reshape(e.long.comp, idvar = "childid", v.names = "binomresid", timevar = "wave", direction = "wide", drop = names(e.long.comp)[2:42])
## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : some constant variables (math,age,height,health,ses,resid) are
## really varying
head(test2[order(test$childid),c(1, (8:10))], n=12)
## childid binomresid.1 binomresid.2 binomresid.4
## 1 10000014 -0.442 -0.431 -0.460
## 4 10000022 -0.231 -0.226 1.542
## 7 10000029 -0.382 -0.373 -0.402
## 10 10000040 -0.323 -0.315 -0.325
## 13 10000046 -0.287 -0.280 -0.285
## 16 10000047 -0.283 -0.276 -0.285
## 19 10000049 1.341 -0.529 -0.600
## 22 10000052 -0.468 -0.457 -0.498
## 25 10000053 -0.344 -0.336 -0.349
## 28 10000060 1.794 1.810 -0.316
## 31 10000062 -0.797 -0.779 -0.793
## 34 10000066 -0.535 -0.522 -0.538
#empirical correlations between waves in the residuals
cor(test2[, 8:10], use="pairwise.complete", method = "spearman")
## binomresid.1 binomresid.2 binomresid.4
## binomresid.1 1.000 0.636 0.574
## binomresid.2 0.636 1.000 0.580
## binomresid.4 0.574 0.580 1.000
These look like a constant correlation, or AR(1)
fitb.1<-geeglm(I(health>2)~age+male+black+hisp+asian+nahn+other+ses, waves = wave,id=childid ,corstr ="independence", family=binomial, data=e.long.comp, weights=w4c4p_40/mean(w4c4p_40))
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
summary(fitb.1)
##
## Call:
## geeglm(formula = I(health > 2) ~ age + male + black + hisp +
## asian + nahn + other + ses, family = binomial, data = e.long.comp,
## weights = w4c4p_40/mean(w4c4p_40), id = childid, waves = wave,
## corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.5253 0.3658 47.66 5.1e-12 ***
## age 0.0322 0.0539 0.36 0.5495
## male 0.1410 0.1005 1.97 0.1609
## black 0.5200 0.1888 7.59 0.0059 **
## hisp 0.6767 0.1330 25.87 3.6e-07 ***
## asian 0.8204 0.2014 16.59 4.6e-05 ***
## nahn 0.3154 0.2866 1.21 0.2712
## other 0.3289 0.2269 2.10 0.1471
## ses -0.5045 0.0664 57.68 3.1e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 1 0.116
##
## Correlation: Structure = independenceNumber of clusters: 2557 Maximum cluster size: 3
fitb.2<-geeglm(I(health>2)~age+male+black+hisp+asian+nahn+other+ses, waves = wave,id=childid, family=binomial, data=e.long.comp, corstr="exchangeable", weights=w4c4p_40/mean(w4c4p_40))
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
summary(fitb.2)
##
## Call:
## geeglm(formula = I(health > 2) ~ age + male + black + hisp +
## asian + nahn + other + ses, family = binomial, data = e.long.comp,
## weights = w4c4p_40/mean(w4c4p_40), id = childid, waves = wave,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.5327 0.3490 52.66 4.0e-13 ***
## age 0.0329 0.0514 0.41 0.5225
## male 0.1397 0.1006 1.93 0.1650
## black 0.5391 0.1892 8.12 0.0044 **
## hisp 0.6886 0.1320 27.21 1.8e-07 ***
## asian 0.8100 0.1993 16.52 4.8e-05 ***
## nahn 0.3310 0.2861 1.34 0.2473
## other 0.3264 0.2272 2.06 0.1508
## ses -0.4882 0.0660 54.77 1.4e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.999 0.114
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.294 0.0566
## Number of clusters: 2557 Maximum cluster size: 3
fitb.3<-geeglm(I(health>2)~age+male+black+hisp+asian+nahn+other+ses, waves = wave,id=childid, family=binomial, data=e.long.comp, corstr="ar1", weights=w4c4p_40/mean(w4c4p_40))
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
summary(fitb.3)
##
## Call:
## geeglm(formula = I(health > 2) ~ age + male + black + hisp +
## asian + nahn + other + ses, family = binomial, data = e.long.comp,
## weights = w4c4p_40/mean(w4c4p_40), id = childid, waves = wave,
## corstr = "ar1")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.5113 0.3521 50.86 9.9e-13 ***
## age 0.0278 0.0519 0.29 0.5921
## male 0.1546 0.1010 2.34 0.1260
## black 0.5640 0.1911 8.71 0.0032 **
## hisp 0.6798 0.1317 26.66 2.4e-07 ***
## asian 0.8589 0.2003 18.39 1.8e-05 ***
## nahn 0.3344 0.2888 1.34 0.2469
## other 0.3573 0.2318 2.38 0.1233
## ses -0.5100 0.0672 57.54 3.3e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 1 0.119
##
## Correlation: Structure = ar1 Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.359 0.0635
## Number of clusters: 2557 Maximum cluster size: 3
Compare the three models:
QIC(fitb.1)
## QIC QICu Quasi Lik CIC params QICC
## 5803.5 5787.1 -2884.6 17.2 9.0 5803.5
QIC(fitb.2)
## QIC QICu Quasi Lik CIC params QICC
## 5793.1 5787.9 -2885.0 11.6 9.0 5793.2
QIC(fitb.3)
## QIC QICu Quasi Lik CIC params QICC
## 5792 5786 -2884 12 9 5792
In the binomial case, it looks like the echangeable correlation structure is good, as it has the lowest CIC, although the AR(1) model is very similar.