In this example, we will use Generalized Estimating Equations to do some longitudinal modeling of data from the ECLS-K 2011. Specifically, we will model changes in a student’s standardized math score as a continuous outcome and self rated health as a binomial outcome, from fall kindergarten to spring, 1st grade.
Up until now, we have used (G)LMM’s to analyze data that were “clustered”
The next topic will introduce a modeling strategy that allows us to consider clustered data, but in a different fashion
GLMM’s are commonly referred to as conditional models, because the model coefficients “\(\beta\)’s” are condition on the random effects in the model.
Likewise, the mean if conditional on the random effects. This is another way of saying that the mean for a given covariate pattern is conditional on the group that the particular person is in.
\(\mu_{ij}^c = E(Y_{ij} | u_j) = X_{ij}\beta + u_j\)
In contrast, Generalzed Estimating Equations are referred to as marginal models because they only estimate the overall mean.
\(\mu_{ij} = X_{ij}\beta\)
Lee and Nelder, 2004 provide a very good description of how these two methods compare to one another
A basic form of the model would be:
\(Y_{ij} = \beta_0 + \sum_k X_{ijk} \beta_k + CORR + error\)
Ordinary models will tend to over estimate the standard errors for the \(\beta\)’s for time varying predictors in a model with repeated observations, because these models do not account for the correlation within clusters observations over time.
Likewise, the standard errors of time invariant predictors will be under estimated
Given the mean function for the model and a specified correlation function, the model parameters may be estimated by finding the solution for:
\[U(\beta) = \sum_i ^n \frac{\delta \mu_{ij}}{ \delta \beta_k} V_i^{-1} (Y_{ij} - \mu(\beta))\]
Which gives estimates of the \(\beta\)’s for the linear mean function.
For three time points per person, the ordinary regression model correlation in residuals within clusters/persons over time can be thought of as the matrix:
\[\begin{bmatrix} \sigma^2 & 0 & 0 \\ 0 & \sigma^2 &0 \\ 0 & 0 & \sigma^2 \end{bmatrix}\]
which assumed the variances are constant and the residuals are independent over time
But in a GEE, the model include the actual correlation between measurements over time:
\[\begin{bmatrix} \sigma_1 ^2 & a & c \\ a & \sigma_2 ^2 &b \\ b & c & \sigma_3 ^2 \end{bmatrix}\]
Which allows the variances over time to be different, as well as correlations between times to be present.
\[\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 &0 \\ 0 & 0 & 1 \end{bmatrix}\]
\[\begin{bmatrix} 1 & \rho & \rho \\ \rho & 1 &\rho \\ \rho &\rho & 1 \end{bmatrix}\]
\[\begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 &\rho\\ \rho^2 & \rho & 1 \end{bmatrix}\]
\[\begin{bmatrix} 1 & \rho_1 & \rho_2 \\ \rho_1 & 1 &\rho_3 \\ \rho_2 & \rho_3& 1 \end{bmatrix}\]
library (car)
## Loading required package: carData
library(geepack)
library(MuMIn) #may need to install
## Warning: package 'MuMIn' was built under R version 4.0.5
##
## Attaching package: 'MuMIn'
## The following object is masked from 'package:geepack':
##
## QIC
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("rio")
#install_formats()
First we load our data
eclskk5 <- import(choose.files())
names(eclskk5)<-tolower(names(eclskk5))
#get out only the variables I'm going to use for this example
#subset the data
eclsk.sub<-eclskk5%>%
select(childid, x_chsex_r, x1locale, x_raceth_r, x2povty, x12par1ed_i, p1curmar, x1htotal, x1mscalk5, x2mscalk5, x3mscalk5, x4mscalk5, x5mscalk5, p1hscale, p2hscale, p4hscale, x2fsstat2, x4fsstat2, x4fsstat2, x12sesl, x4sesl_i, p2parct1, p2parct2, s1_id, p2safepl, x2krceth, p1o2near, x_distpov, w1c0, w1p0, w2p0, w1c0str, w1p0str, w4c4p_40, w4c4p_4str,w4c4p_4psu, w1c0psu, w1p0psu, x1height, x2height, x4height, x4height, x5height, x1kage_r, x2kage_r, x3age, x4age, x5age)
#rm(eclsk11); gc()
First, I do some recoding of variables. First, we code time invariant variables, meaning their values do not change at each wave.
#Non time varying variables
#First we recode some Child characteristics
#Child's sex: recode as male =1
eclsk.sub$male<-Recode(eclsk.sub$x_chsex_r, recodes="1=1; 2=0; -9=NA")
#Recode race with white, non Hispanic as reference using dummy vars
eclsk.sub$hisp<-Recode (eclsk.sub$x_raceth_r, recodes="3:4=1;-9=NA; else=0")
eclsk.sub$black<-Recode (eclsk.sub$x_raceth_r, recodes="2=1;-9=NA; else=0")
eclsk.sub$asian<-Recode (eclsk.sub$x_raceth_r, recodes="5=1;-9=NA; else=0")
eclsk.sub$nahn<-Recode (eclsk.sub$x_raceth_r, recodes="6:7=1;-9=NA; else=0")
eclsk.sub$other<-Recode (eclsk.sub$x_raceth_r, recodes="8=1;-9=NA; else=0")
#Then we recode some parent/mother characteristics
#Mother's education, recode as 2 dummys with HS = reference
eclsk.sub$lths<-Recode(eclsk.sub$x12par1ed_i, recodes = "0:2=1; 3:8=0; else = NA")
eclsk.sub$gths<-Recode(eclsk.sub$x12par1ed_i, recodes = "1:3=0; 4:8=1; else =NA")
#marital status, recode as 2 dummys, ref= married
eclsk.sub$single<-Recode(eclsk.sub$p1curmar, recodes="4=1; -7:-9=NA; else=0")
eclsk.sub$notmar<-Recode(eclsk.sub$p1curmar, recodes="2:3=1; -7:-9=NA; else=0")
#Then we do some household level variables
#Urban school location = 1
eclsk.sub$urban<-Recode(eclsk.sub$x1locale, recodes = "1:3=1; 4=0; -1:-9=NA")
#poverty level in poverty = 1
eclsk.sub$pov<-Recode(eclsk.sub$x2povty , recodes ="1:2=1; 3=0; -9=NA")
#Household size
eclsk.sub$hhsize<-eclsk.sub$x1htotal
#school % minority student body
eclsk.sub$minorsch<-ifelse(eclsk.sub$x2krceth <0, NA, eclsk.sub$x2krceth/10)
#Unsafe neighborhood
eclsk.sub$unsafe<-Recode(eclsk.sub$p2safepl , recodes = "1:2='unsafe'; 3='safe'; else=NA",as.factor = T)
#school district poverty
eclsk.sub$dist_pov<-ifelse(eclsk.sub$x_distpov==-9, NA, scale(eclsk.sub$x_distpov))
I have to make the repeated measures of each of my longitudinal variables. These are referred to as time varying variables, meaning their values change at each wave.
#Longitudinal variables
#recode our outcomes, the first is the child's math standardized test score in Kindergarten
eclsk.sub$math_1<-ifelse(eclsk.sub$x1mscalk5<0, NA, eclsk.sub$x1mscalk5)
eclsk.sub$math_2<-ifelse(eclsk.sub$x2mscalk5<0, NA, eclsk.sub$x2mscalk5)
#eclsk.sub$math3<-ifelse(eclsk.sub$x3mscalk1<0, NA, eclsk.sub$x3mscalk1)
eclsk.sub$math_4<-ifelse(eclsk.sub$x4mscalk5<0, NA, eclsk.sub$x4mscalk5)
#Second outcome is child's height for age, continuous outcome
eclsk.sub$height_1<-ifelse(eclsk.sub$x1height<=-7, NA, eclsk.sub$x1height)
eclsk.sub$height_2<-ifelse(eclsk.sub$x2height<=-7, NA, eclsk.sub$x2height)
#eclsk.sub$height3<-ifelse(eclsk.sub$x3height<=-7, NA, eclsk.sub$x3height)
eclsk.sub$height_4<-ifelse(eclsk.sub$x4height<=-7, NA, eclsk.sub$x4height)
#Age at each wave
eclsk.sub$ageyrs_1<-ifelse(eclsk.sub$x1kage_r<0, NA, eclsk.sub$x1kage_r/12)
eclsk.sub$ageyrs_2<-ifelse(eclsk.sub$x2kage_r<0, NA, eclsk.sub$x2kage_r/12)
#eclsk.sub$age_yrs3<-ifelse(eclsk.sub$x3age<0, NA, eclsk.sub$x3age/12)
eclsk.sub$ageyrs_4<-ifelse(eclsk.sub$x4age<0, NA, eclsk.sub$x4age/12)
eclsk.sub<- eclsk.sub[is.na(eclsk.sub$ageyrs_1)==F, ]
#Height for age z score standardized by sex and age
eclsk.sub$heightz_1<-ave(eclsk.sub$height_1, as.factor(paste(round(eclsk.sub$ageyrs_1, 1.5), eclsk.sub$male)), FUN=scale)
eclsk.sub$heightz_2<-ave(eclsk.sub$height_2, as.factor(paste(round(eclsk.sub$ageyrs_2, 1.5), eclsk.sub$male)), FUN=scale)
#eclsk.sub$height_z3<-ave(eclsk.sub$height3, as.factor(paste(round(eclsk.sub$age_yrs3, 1.5), eclsk.sub$male)), FUN=scale)
eclsk.sub$heightz_4<-ave(eclsk.sub$height_4, as.factor(paste(round(eclsk.sub$ageyrs_4, 1.5), eclsk.sub$male)), FUN=scale)
#Household food insecurity, dichotomous outcome
#This outcome is only present at two waves
eclsk.sub$foodinsec_1<-Recode(eclsk.sub$x2fsstat2, recodes="2:3=1; 1=0; else=NA")
eclsk.sub$foodinsec_2<-Recode(eclsk.sub$x2fsstat2, recodes="2:3=1; 1=0; else=NA")
eclsk.sub$foodinsec_4<-Recode(eclsk.sub$x4fsstat2, recodes="2:3=1; 1=0; else=NA")
#Child health assessment Excellent to poor , ordinal outcome
eclsk.sub$chhealth_1<-ifelse(eclsk.sub$p1hscale<0, NA, eclsk.sub$p1hscale)
eclsk.sub$chhealth_2<-ifelse(eclsk.sub$p2hscale<0, NA, eclsk.sub$p2hscale)
eclsk.sub$chhealth_4<-ifelse(eclsk.sub$p4hscale<0, NA, eclsk.sub$p4hscale)
#SES
eclsk.sub$hhses_1<-ifelse(eclsk.sub$x12sesl==-9, NA, scale(eclsk.sub$x12sesl))
eclsk.sub$hhses_2<-ifelse(eclsk.sub$x12sesl==-9, NA, scale(eclsk.sub$x12sesl))
eclsk.sub$hhses_4<-ifelse(eclsk.sub$x4sesl_i==-9, NA, scale(eclsk.sub$x4sesl_i))
To analyze data longitudinally, we must reshape the data from its current “wide” format, where each repeated measure is a column, into the “long” format, where there is a single column for a particular variable, and we account for the repeated measurements of each person. In this case, I’m going to use three waves of data, so each child can contribute up to three lines to the data.
The reshape() function will do this for us, but below I use a tidy method, using a combination of the data.table and dplyr packages. I first make a long data set of the height, age, math, child health and household SES measures, then I left join it to the time invariant variables i’ll use in my models below.
library(tidyr)
e.long.comp<-eclsk.sub%>%
rename(wt = w4c4p_40,strata= w4c4p_4str, psu = w4c4p_4psu)%>%
select(childid,male, hisp, black, asian, nahn, other,wt, strata, psu, #time constant
height_1, height_2, height_4, #t-varying variables
ageyrs_1, ageyrs_2, ageyrs_4,
chhealth_1, chhealth_2, chhealth_4,
foodinsec_1, foodinsec_2, foodinsec_4,
hhses_1, hhses_2, hhses_4,
math_1,math_2, math_4)%>%
pivot_longer(cols = c(-childid, -male, -hisp, -black, -asian,-nahn, -other, -wt, -strata, -psu), #time constant variables go here
names_to = c(".value", "wave"), #make wave variable and put t-v vars into columns
names_sep = "_")%>% #all t-v variables have _ between name and time, like age_1, age_2
filter(complete.cases(.))%>%
arrange(childid, wave)
head(e.long.comp)
library(data.table)
library(magrittr)
out<-melt(setDT(eclsk.sub), id = "childid",
measure.vars = list(ht=c("height_z1","height_z2","height_z4"),
age=c("age_yrs1", "age_yrs2", "age_yrs4"),
math=c("math1", "math2", "math4"),
hhses=c("hhses1", "hhses2", "hhses4"),
health=c("chhealth1", "chhealth2", "chhealth4")))%>%
setorder(childid)
head(out, n=20)
#merge back to other data
e.long<-eclsk.sub%>%
select(childid, hisp, black,asian, nahn, other,male, unsafe, s1_id, pov, hhsize, urban, w4c4p_40, w4c4p_4str, w4c4p_4psu)%>%
left_join(., out, "childid")
e.long$wave<-e.long$variable
head(e.long)
e.long.comp<-e.long%>%
filter(complete.cases(.), w4c4p_40>0)
library(ggplot2)
first10<-unique(e.long.comp$childid)[1:10]
sub<-e.long.comp%>%
filter(childid%in%first10)
ggplot(sub, aes(x=ageyrs, y=math))+
geom_point()+
geom_smooth(method='lm',formula=y~x)+
facet_wrap(~childid,nrow = 3)+
ggtitle(label = "Change in Math score across age",
subtitle = "First 10 children in ECLS-K 2011")
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf
The GEE is used here
#basic linear model
fit.1<-glm(scale(math)~scale(ageyrs)+male+black+hisp+asian+nahn+other+hhses, data=e.long.comp, weights=wt/mean(wt))
summary(fit.1)
## Warning in summary.glm(fit.1): observations with zero weight not used for
## calculating dispersion
#Get residuals and put them in a data frame
e.long.comp$resid<- residuals(fit.1)
e.res<-e.long.comp%>%
select(childid, wave,resid)%>%
pivot_wider(id_cols=c(childid),
names_from = wave,
values_from=resid )
head(e.res)
Here is our actual correlation matrix in the residuals between waves:
cor(e.res[,-1], use="pairwise.complete")
## 1 2 4
## 1 1.0000000 0.8149430 0.7218168
## 2 0.8149430 1.0000000 0.7992329
## 4 0.7218168 0.7992329 1.0000000
This is certainly not independence, and looks more like an AR(1), because the correlation decreases as the difference between wave number increases.
Now we fit the GEE: ### Model with independent correlation Meaning ZERO correlation between waves
fit.1<-geeglm(scale(math)~scale(ageyrs)+male+black+hisp+asian+nahn+other+hhses,
id=childid ,
wave = wave,
corstr ="independence",
data=e.long.comp,
weights=wt/mean(wt))
summary(fit.1)
## Warning in summary.glm(object): observations with zero weight not used for
## calculating dispersion
##
## Call:
## geeglm(formula = scale(math) ~ scale(ageyrs) + male + black +
## hisp + asian + nahn + other + hhses, data = e.long.comp,
## weights = wt/mean(wt), id = childid, waves = wave, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.001055 0.012256 0.007 0.931
## scale(ageyrs) 0.701513 0.005727 15006.485 < 2e-16 ***
## male 0.001107 0.014345 0.006 0.938
## black -0.231107 0.023467 96.986 < 2e-16 ***
## hisp -0.116031 0.019612 35.003 3.29e-09 ***
## asian 0.196819 0.033163 35.223 2.94e-09 ***
## nahn -0.099816 0.071867 1.929 0.165
## other 0.009371 0.036362 0.066 0.797
## hhses 0.278024 0.009075 938.672 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.4263 0.007448
## Number of clusters: 10179 Maximum cluster size: 3
Meaning correlation between waves, but the correlation is the same for each pair waves
fit.2<-geeglm(scale(math)~scale(ageyrs)+male+black+hisp+asian+nahn+other+hhses ,
id = childid,
wave = wave,
corstr ="exchangeable",
data=e.long.comp,
weights=wt/mean(wt))
summary(fit.2)
## Warning in summary.glm(object): observations with zero weight not used for
## calculating dispersion
##
## Call:
## geeglm(formula = scale(math) ~ scale(ageyrs) + male + black +
## hisp + asian + nahn + other + hhses, data = e.long.comp,
## weights = wt/mean(wt), id = childid, waves = wave, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.000195 0.012921 0.00 0.988
## scale(ageyrs) 0.847408 0.003044 77484.22 < 2e-16 ***
## male -0.009904 0.015199 0.42 0.515
## black -0.243471 0.024648 97.57 < 2e-16 ***
## hisp -0.128358 0.020268 40.11 2.4e-10 ***
## asian 0.229998 0.034954 43.30 4.7e-11 ***
## nahn -0.140067 0.077768 3.24 0.072 .
## other 0.005155 0.038890 0.02 0.895
## hhses 0.234856 0.009501 611.02 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.45 0.00842
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.814 0.021
## Number of clusters: 10179 Maximum cluster size: 3
The second model shows the exchangeable correlation to be 0.814, which is not very different from our measured correlations from above
| 1 | 2 | 4 | |
|---|---|---|---|
| 1 | 1.000 | 0.815 | 0.722 |
| 2 | 0.815 | 1.000 | 0.799 |
| 4 | 0.722 | 0.799 | 1.000 |
Now we examine the AR1 correlation types:
fit.3<-geeglm(scale(math)~scale(ageyrs)+male+black+hisp+asian+nahn+other+hhses ,
id = childid,
wave = wave,
corstr ="ar(1)",
data=e.long.comp,
weights=wt/mean(wt))
## Warning in if (corstrv == -1) stop("invalid corstr."): the condition has length
## > 1 and only the first element will be used
## Warning in if (corstrv == 5) stop("need zcor matrix for userdefined corstr.")
## else zcor <- genZcor(clusz, : the condition has length > 1 and only the first
## element will be used
## Warning in if (corstrv == 1) return(matrix(0, 0, 0)): the condition has length >
## 1 and only the first element will be used
## Warning in if (corstrv == 6) alpha <- 1 else alpha <- rep(0, q): the condition
## has length > 1 and only the first element will be used
summary(fit.3)
## Warning in summary.glm(object): observations with zero weight not used for
## calculating dispersion
##
## Call:
## geeglm(formula = scale(math) ~ scale(ageyrs) + male + black +
## hisp + asian + nahn + other + hhses, data = e.long.comp,
## weights = wt/mean(wt), id = childid, waves = wave, corstr = "ar(1)")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.000195 0.012921 0.00 0.988
## scale(ageyrs) 0.847408 0.003044 77484.22 < 2e-16 ***
## male -0.009904 0.015199 0.42 0.515
## black -0.243471 0.024648 97.57 < 2e-16 ***
## hisp -0.128358 0.020268 40.11 2.4e-10 ***
## asian 0.229998 0.034954 43.30 4.7e-11 ***
## nahn -0.140067 0.077768 3.24 0.072 .
## other 0.005155 0.038890 0.02 0.895
## hhses 0.234856 0.009501 611.02 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable ar1 unstructured userdefined fixed
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.45 0.00842
## Warning in if (pmatch(x$corstr, "independence", 0) == 0) {: the condition has
## length > 1 and only the first element will be used
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.814 0.021
## Number of clusters: 10179 Maximum cluster size: 3
The implied correlation in the AR(1) model is : 0.814
Since GEE’s aren’t fit via maximum likelihood, they aren’t comparable in terms of AIC or likelihood ratio tests. However, Pan, 2001 describe an information criterion using a Quasi-likelihood formulation. This can be used to compare models with alternative correlation structures, with the lowest QIC representing the best fitting model. Another criterion is the Correlation Information Criterion (Hin and Wang, 2008)[https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.3489], which is proposed to be better for choosing among models with the same mean function, but different correlation structures, which is what we’re doing here.
library(MESS) #need to install
## Warning: package 'MESS' was built under R version 4.0.5
##
## Attaching package: 'MESS'
## The following object is masked from 'package:MuMIn':
##
## QIC
## The following object is masked from 'package:geepack':
##
## QIC
QIC(fit.1)
## QIC QICu Quasi Lik CIC params QICC
## 11541 11499 -5740 30 9 11541
QIC(fit.2)
## QIC QICu Quasi Lik CIC params QICC
## 12029.0 12016.2 -5999.1 15.4 9.0 12029.0
QIC(fit.3)
## QIC QICu Quasi Lik CIC params QICC
## 12029.0 12016.2 -5999.1 15.4 9.0 12029.0
So, it looks like the AR(1) correlation structure is slightly better than the exchangeable structure, using the CIC but there is not much difference between models using this criteria.
Here we use the GEE for a binomial outcome.
Here are what the data look like:
binomial_smooth <- function(...) {
geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}
e.long.comp$poorhealth<-Recode(e.long.comp$chhealth, recodes="2:3=1; else=0")
ggplot(e.long.comp, aes(x=ageyrs, y=poorhealth))+
geom_point()+
binomial_smooth()+
ggtitle(label = "Change in Math score across age",
subtitle = "First 10 children in ECLS-K 2011 - All children")
## `geom_smooth()` using formula 'y ~ x'
ids<-unique(e.long.comp$childid)[1:10]
e.long.comp%>%
filter(childid %in% ids)%>%
ggplot( aes(x=ageyrs, y=poorhealth))+
geom_point()+ binomial_smooth()+
facet_wrap(~childid,nrow=3)+
ggtitle(label = "Change in Math score across age",
subtitle = "First 10 children in ECLS-K 2011 - Invidivual Children")
## `geom_smooth()` using formula 'y ~ x'
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
btest<-glm(I(chhealth>2)~scale(ageyrs)+male+black+hisp+asian+nahn+other+hhses+factor(wave) , family=binomial, data=e.long.comp, weights=wt/mean(wt))
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
e.long.comp$residb<- residuals(btest)
e.res3<-e.long.comp%>%
select(childid, wave,residb)%>%
pivot_wider(id_cols=c(childid),
names_from = wave,
values_from=residb )
head(e.res3)
cor(e.res3[, -1], use = "pairwise")
## 1 2 4
## 1 1.000 0.397 0.297
## 2 0.397 1.000 0.333
## 4 0.297 0.333 1.000
These look like a constant correlation, or AR(1) perhaps because the correlation decreases between waves 1 and 4, but is pretty similar between 1 and 2.
fitb.1<-geeglm(poorhealth~scale(ageyrs)+male+black+hisp+asian+nahn+other+hhses,
waves = wave,
id=childid ,
corstr ="independence",
family=binomial,
data=e.long.comp,
weights=wt/mean(wt))
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fitb.1)
##
## Call:
## geeglm(formula = poorhealth ~ scale(ageyrs) + male + black +
## hisp + asian + nahn + other + hhses, family = binomial, data = e.long.comp,
## weights = wt/mean(wt), id = childid, waves = wave, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.6742 0.0347 378.20 < 2e-16 ***
## scale(ageyrs) 0.0110 0.0150 0.54 0.46441
## male 0.1687 0.0389 18.81 1.4e-05 ***
## black 0.2414 0.0658 13.48 0.00024 ***
## hisp 0.2694 0.0508 28.11 1.1e-07 ***
## asian 0.5571 0.0770 52.29 4.8e-13 ***
## nahn 0.4698 0.1775 7.01 0.00812 **
## other 0.0697 0.0974 0.51 0.47424
## hhses -0.3198 0.0235 185.65 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 1 0.00618
## Number of clusters: 10179 Maximum cluster size: 3
fitb.2<-geeglm(poorhealth~scale(ageyrs)+male+black+hisp+asian+nahn+other+hhses,
waves = wave,
id=childid ,
corstr ="exch",
family=binomial,
data=e.long.comp,
weights=wt/mean(wt))
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(fitb.2)
##
## Call:
## geeglm(formula = poorhealth ~ scale(ageyrs) + male + black +
## hisp + asian + nahn + other + hhses, family = binomial, data = e.long.comp,
## weights = wt/mean(wt), id = childid, waves = wave, corstr = "exch")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.67362 0.03436 384.34 < 2e-16 ***
## scale(ageyrs) 0.00178 0.01374 0.02 0.89716
## male 0.17266 0.03853 20.08 7.4e-06 ***
## black 0.23871 0.06511 13.44 0.00025 ***
## hisp 0.27063 0.05007 29.22 6.5e-08 ***
## asian 0.54350 0.07542 51.93 5.7e-13 ***
## nahn 0.48061 0.17362 7.66 0.00564 **
## other 0.05751 0.09592 0.36 0.54878
## hhses -0.29754 0.02308 166.14 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.997 0.00586
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.35 0.0147
## Number of clusters: 10179 Maximum cluster size: 3
fitb.3<-geeglm(poorhealth~scale(ageyrs)+male+black+hisp+asian+nahn+other+hhses,
waves = wave,
id=childid ,
corstr ="ar(1)",
family=binomial,
data=e.long.comp, weights=wt/mean(wt))
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in if (corstrv == -1) stop("invalid corstr."): the condition has length
## > 1 and only the first element will be used
## Warning in if (corstrv == 5) stop("need zcor matrix for userdefined corstr.")
## else zcor <- genZcor(clusz, : the condition has length > 1 and only the first
## element will be used
## Warning in if (corstrv == 1) return(matrix(0, 0, 0)): the condition has length >
## 1 and only the first element will be used
## Warning in if (corstrv == 6) alpha <- 1 else alpha <- rep(0, q): the condition
## has length > 1 and only the first element will be used
summary(fitb.3)
##
## Call:
## geeglm(formula = poorhealth ~ scale(ageyrs) + male + black +
## hisp + asian + nahn + other + hhses, family = binomial, data = e.long.comp,
## weights = wt/mean(wt), id = childid, waves = wave, corstr = "ar(1)")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.67362 0.03436 384.34 < 2e-16 ***
## scale(ageyrs) 0.00178 0.01374 0.02 0.89716
## male 0.17266 0.03853 20.08 7.4e-06 ***
## black 0.23871 0.06511 13.44 0.00025 ***
## hisp 0.27063 0.05007 29.22 6.5e-08 ***
## asian 0.54350 0.07542 51.93 5.7e-13 ***
## nahn 0.48061 0.17362 7.66 0.00564 **
## other 0.05751 0.09592 0.36 0.54878
## hhses -0.29754 0.02308 166.14 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable ar1 unstructured userdefined fixed
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.997 0.00586
## Warning in if (pmatch(x$corstr, "independence", 0) == 0) {: the condition has
## length > 1 and only the first element will be used
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.35 0.0147
## Number of clusters: 10179 Maximum cluster size: 3
Compare the three models:
QIC(fitb.1)
## QIC QICu Quasi Lik CIC params QICC
## 35075.1 35052.1 -17517.0 20.5 9.0 35075.1
QIC(fitb.2)
## QIC QICu Quasi Lik CIC params QICC
## 35064.3 35055.9 -17518.9 13.2 9.0 35064.3
QIC(fitb.3)
## QIC QICu Quasi Lik CIC params QICC
## 35064.3 35055.9 -17518.9 13.2 9.0 35064.3
In the binomial case, it looks like the exchangeable correlation structure and the AR(1) model are very similar.