This example will illustrate how to fit the discrete time hazard model with group-level frailty to continuous duration data (i.e. person-level data) and a discrete-time (longitudinal) data set. In this example, I will use the event of a child dying before age 5 in Haiti. The data for this example come from the Haitian Demographic and Health Survey for 2012 birth recode file. This file contains information for all live births to women sampled in the survey.

The longitudinal data example uses data from the ECLS-K. Specifically, we will examine the transition into poverty between kindergarten and 8th grade.

#Load required libraries
library(foreign)
library(survival)
library(car)
library(survey)
library(lme4)
library(arm)
#load the data
haiti<-read.dta("/Users/ozd504/Google Drive/dem7223/data/HTBR61FL.DTA", convert.factors = F)
#We form a subset of variables
sub<-data.frame(CASEID=haiti$caseid,kidid=paste(haiti$caseid, haiti$bidx, sep="-"),v008=haiti$v008,bord=haiti$bidx,csex=haiti$b4,b2=haiti$b2, b3=haiti$b3, b5=haiti$b5, b7=haiti$b7, ibint=haiti$b11, rural=haiti$v025, educ=haiti$v106,age=haiti$v012,partneredu=haiti$v701,partnerage=haiti$v730, hhses=haiti$v190, weight=haiti$v005/1000000, psu=haiti$v021, strata=haiti$v022, region=haiti$v023)

sub$death.age<-ifelse(sub$b5==1,
                          ((((sub$v008))+1900)-(((sub$b3))+1900)) 
                          ,sub$b7)

#censoring indicator for death by age 5, in months (<=60 months)
sub$d.event<-ifelse(is.na(sub$b7)==T|sub$b7>60,0,1)
sub$d.eventfac<-factor(sub$d.event); levels(sub$d.eventfac)<-c("Alive at Age 5", "Dead by Age 5")
table(sub$d.eventfac)
## 
## Alive at Age 5  Dead by Age 5 
##          25981           3032
#recodes
sub$male<-ifelse(sub$csex==1,1,0)
sub$educ.high<-ifelse(sub$educ %in% c(2,3), 1, 0)
sub$age2<-sub$age^2
sub$partnerhiedu<-ifelse(sub$partneredu<3,0,ifelse(sub$partneredu%in%c(8,9),NA,1 ))
sub$hises<-ifelse(sub$hhses>3, 1,0)

Create the person-period file

The distinction between the way we have been doing things and the discrete time model, is that we treat time discretely, versus continuously. This means that we transform the data from the case-duration data format to the person-period format. For this example, a natural choice would be year, since we have 5 intervals of equal length (12 months each). R provides a useful function called survSplit() in the survival library that will split a continuous duration into discrete periods.

#make person period file
pp<-survSplit(sub, cut=seq(0,60,12), start="start", end="death.age", event="d.event", episode="year")
pp<-pp[order(pp$kidid, pp$year),]
head(pp[, c("kidid", "death.age", "d.event", "start", "year", "male", "hises")], n=20)
##                    kidid death.age d.event start year male hises
## 29014           1 2  4-1        12       0     0    1    0     1
## 58027           1 2  4-1        24       0    12    2    0     1
## 87040           1 2  4-1        36       0    24    3    0     1
## 116053          1 2  4-1        48       0    36    4    0     1
## 145066          1 2  4-1        60       0    48    5    0     1
## 174079          1 2  4-1       102       0    60    6    0     1
## 29015           1 2  4-2        12       0     0    1    1     1
## 58028           1 2  4-2        24       0    12    2    1     1
## 87041           1 2  4-2        36       0    24    3    1     1
## 116054          1 2  4-2        48       0    36    4    1     1
## 145067          1 2  4-2        60       0    48    5    1     1
## 174080          1 2  4-2       132       0    60    6    1     1
## 29016           1 2  4-3        12       0     0    1    1     1
## 58029           1 2  4-3        24       0    12    2    1     1
## 87042           1 2  4-3        36       0    24    3    1     1
## 116055          1 2  4-3        48       0    36    4    1     1
## 145068          1 2  4-3        60       0    48    5    1     1
## 174081          1 2  4-3       188       0    60    6    1     1
## 29017           1 3  1-1        12       0     0    1    0     1
## 58030           1 3  1-1        24       0    12    2    0     1
library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE
htshp<-readShapePoly("~/Google Drive/dem7223/data/sdr_subnational_data_2015-03-18/shps/sdr_subnational_data_dhs_2012.shp")
plot(htshp)
text(getSpPPolygonsLabptSlots(htshp), labels=as.character(htshp$DHSREGFR), cex=0.6)
## Warning: use coordinates method

#Since the region variable in the shapefile doesn't include two of the areas, I need to make them an ID number:
htshp$reg_merge<-ifelse(htshp$DHSREGFR=="Aire Métropolitaine", 1, ifelse(htshp$DHSREGFR=="Reste-Ouest",2, htshp$REGCODE+1))
htshp$reg_merge
##  [1]  3  4  5  6  7  8  9 10 11  2  1
htshp$z_net<-scale(htshp$iC6D4E13)
htshp$z_edu<-scale(htshp$iA999FA)
pp2<-merge(pp, htshp@data, by.x="region", by.y="reg_merge")

We see that each child is not in the data for multiple “risk periods”, until they experience the event (death) or age out of the risk set (year 6 in this case).

Discrete time model with shared frailty

For the discrete time model, if the logit link is used, then we are effectively fitting a multilevel model for our outcome. The model with only a group frailty component will have the exact same form as the multilevel logit model with a random intecept at the group level: Fit the basic random intercept model : \(logit\left ( \frac{\pi_{ij}}{1-\pi_{ij}} \right ) = \beta_{0j} +x'\beta +Z\gamma'\)

with

\(\beta_{0j} = \beta_0 + Z\gamma'+ u_j\)

and

\(u_j\sim N(0, \sigma^2)\)

Where the intercepts (\(u_j\)) for each group vary randomly around the overall mean (\(\beta_0\)).
The individual level predictors are incorporated into \(x\), while the group level predictors (if any are measured) are included in \(Z\). If only a random intercept is specified, then \(Z\) is a vector of 1’s.

Shared frailty at the regional level

Here, we use the DHS to fit a few different multilevel models. Our outcome is child mortality before age 5. We consider the general random intecept model, a model with a higher level predictor, and a cross-level interaction model, where an individual level predictor interacts with a higher level predictor. Our higher level data come from the Haitian shapefile we got from the DHS, which contains estimates of numerous population characteristics at the same regional level as the v023 variable provides in the DHS recode files.

#how many kids in each region?
table(sub$region)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
## 3282 2399 2091 2532 2374 2398 2856 2196 2293 2288 1926 2378
#how many total regions?
length(table(sub$region))
## [1] 12
#generate survey design
des<-svydesign(ids=~psu, strata = ~strata , weights=~weight, data=pp)

#now we fit the regular discrete time model without frailty, purely for comparison
fit.1<-svyglm(d.event~year+bord+I(hhses>3)+I(educ>=2)+male+rural,design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
display(fit.1, detail=T)
## svyglm(formula = d.event ~ year + bord + I(hhses > 3) + I(educ >= 
##     2) + male + rural, design = des, family = "binomial")
## 
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (445) clusters.
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
## 
##                  coef.est coef.se t value Pr(>|t|)
## (Intercept)       -2.27     0.21  -10.68    0.00  
## year              -0.87     0.03  -27.39    0.00  
## bord               0.18     0.01   13.30    0.00  
## I(hhses > 3)TRUE  -0.16     0.10   -1.63    0.10  
## I(educ >= 2)TRUE  -0.44     0.11   -3.92    0.00  
## male               0.15     0.05    3.04    0.00  
## rural             -0.27     0.11   -2.53    0.01  
## ---
##   n = 418, k = 7
##   residual deviance = 19470.6, null deviance = 22501.4 (difference = 3030.8)
##   overdispersion parameter = 1.0
#basic random intercept model
fit.region<-glmer(d.event~year+bord+I(hhses>3)+I(educ>=2)+male+rural+(1|region),pp2, family="binomial", weights = weight)
## Warning: non-integer #successes in a binomial glm!
display(fit.region, detail=T)
## glmer(formula = d.event ~ year + bord + I(hhses > 3) + I(educ >= 
##     2) + male + rural + (1 | region), data = pp2, family = "binomial", 
##     weights = weight)
##                  coef.est coef.se z value Pr(>|z|)
## (Intercept)       -2.63     0.19  -14.19    0.00  
## year              -0.86     0.02  -36.19    0.00  
## bord               0.19     0.01   18.83    0.00  
## I(hhses > 3)TRUE  -0.23     0.08   -2.96    0.00  
## I(educ >= 2)TRUE  -0.44     0.08   -5.73    0.00  
## male               0.17     0.05    3.47    0.00  
## rural             -0.12     0.09   -1.35    0.18  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  region   (Intercept) 0.20    
##  Residual             1.00    
## ---
## number of obs: 127154, groups: region, 11
## AIC = 16372.3, DIC = 16356.3
## deviance = 16356.3
#plot the region-level predictor, z-score of % of Women with secondary or higher education
spplot(htshp, c("z_net", "z_edu"))

#model with a higher level predictor - z-score of % of Women with secondary or higher education
fit.region2<-glmer(d.event~year+bord+I(hhses>3)+I(educ>=2)+male+rural+z_edu+(1|region),pp2, family="binomial", weights = weight)
## Warning: non-integer #successes in a binomial glm!
display(fit.region2, detail=T)
## glmer(formula = d.event ~ year + bord + I(hhses > 3) + I(educ >= 
##     2) + male + rural + z_edu + (1 | region), data = pp2, family = "binomial", 
##     weights = weight)
##                  coef.est coef.se z value Pr(>|z|)
## (Intercept)       -2.65     0.18  -14.57    0.00  
## year              -0.86     0.02  -36.18    0.00  
## bord               0.19     0.01   18.86    0.00  
## I(hhses > 3)TRUE  -0.23     0.08   -3.06    0.00  
## I(educ >= 2)TRUE  -0.44     0.08   -5.74    0.00  
## male               0.17     0.05    3.47    0.00  
## rural             -0.10     0.09   -1.17    0.24  
## z_edu              0.10     0.06    1.73    0.08  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  region   (Intercept) 0.17    
##  Residual             1.00    
## ---
## number of obs: 127154, groups: region, 11
## AIC = 16371.6, DIC = 16353.6
## deviance = 16353.6
#relgrad <- with(fit.region@optinfo$derivs,solve(Hessian,gradient))
#max(abs(relgrad))
anova(fit.region, fit.region2)
## Data: pp2
## Models:
## fit.region: d.event ~ year + bord + I(hhses > 3) + I(educ >= 2) + male + 
## fit.region:     rural + (1 | region)
## fit.region2: d.event ~ year + bord + I(hhses > 3) + I(educ >= 2) + male + 
## fit.region2:     rural + z_edu + (1 | region)
##             Df   AIC   BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
## fit.region   8 16372 16450 -8178.1    16356                         
## fit.region2  9 16372 16459 -8176.8    16354 2.6157      1     0.1058
#cross level interaction model, interaction between mom's education and regional women's education
fit.region3<-glmer(d.event~year+bord+I(hhses>3)+I(educ>=2)+male+rural+z_edu*I(educ>=2)+(1|region),pp2, family="binomial")
display(fit.region3, detail=T)
## glmer(formula = d.event ~ year + bord + I(hhses > 3) + I(educ >= 
##     2) + male + rural + z_edu * I(educ >= 2) + (1 | region), 
##     data = pp2, family = "binomial")
##                        coef.est coef.se z value Pr(>|z|)
## (Intercept)             -2.58     0.17  -15.15    0.00  
## year                    -0.82     0.02  -36.12    0.00  
## bord                     0.18     0.01   18.63    0.00  
## I(hhses > 3)TRUE        -0.29     0.08   -3.41    0.00  
## I(educ >= 2)TRUE        -0.37     0.09   -4.22    0.00  
## male                     0.16     0.05    3.38    0.00  
## rural                   -0.18     0.08   -2.26    0.02  
## z_edu                    0.08     0.06    1.21    0.23  
## I(educ >= 2)TRUE:z_edu  -0.04     0.06   -0.64    0.52  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  region   (Intercept) 0.18    
##  Residual             1.00    
## ---
## number of obs: 127154, groups: region, 11
## AIC = 17123, DIC = 17103
## deviance = 17103.0
anova(fit.region2, fit.region3)
## Data: pp2
## Models:
## fit.region2: d.event ~ year + bord + I(hhses > 3) + I(educ >= 2) + male + 
## fit.region2:     rural + z_edu + (1 | region)
## fit.region3: d.event ~ year + bord + I(hhses > 3) + I(educ >= 2) + male + 
## fit.region3:     rural + z_edu * I(educ >= 2) + (1 | region)
##             Df   AIC   BIC  logLik deviance Chisq Chi Df Pr(>Chisq)
## fit.region2  9 16372 16459 -8176.8    16354                        
## fit.region3 10 17123 17221 -8551.5    17103     0      1          1

The only model that shows anything is the random intercept model, the model with the higher level predictor shows a marginally significant (p=.08) effect of education level of women at the regional level on under 5 mortality, and the cross level interaction model shows nothing in terms of the interaction between maternal education at the individual level and region-level women’s education.

Using Longitudinal Data

As in the other examples, I illustrate fitting these models to data that are longitudinal, instead of person-duration. In this example, we will examine how to fit the discrete time hazard model with frailty to a longitudinally collected data set.

First we load our data and do some recodes and variable construction

load("~/Google Drive/dem7903_App_Hier/data/eclsk.Rdata")
names(eclsk)<-tolower(names(eclsk))
#get out only the variables I'm going to use for this example
myvars<-c( "childid","gender", "race", "r1_kage","r4age", "r5age", "r6age", "r7age","c1r4mtsc", "c4r4mtsc", "c5r4mtsc", "c6r4mtsc", "c7r4mtsc", "w1povrty","w1povrty","w3povrty", "w5povrty", "w8povrty","wkmomed", "s2_id", "c1_5fp0", "c15fpstr", "c15fppsu")
eclsk<-eclsk[,myvars]


eclsk$age1<-ifelse(eclsk$r1_kage==-9, NA, eclsk$r1_kage/12)
eclsk$age2<-ifelse(eclsk$r4age==-9, NA, eclsk$r4age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclsk$age3<-recode(eclsk$r5age,recodes="1=105; 2=107; 3=109; 4=112; 5=115; 6=117; -9=NA")/12

eclsk$pov1<-ifelse(eclsk$w1povrty==1,1,0)
eclsk$pov2<-ifelse(eclsk$w3povrty==1,1,0)
eclsk$pov3<-ifelse(eclsk$w5povrty==1,1,0)

#Recode race with white, non Hispanic as reference using dummy vars
eclsk$hisp<-recode (eclsk$race, recodes="3:4=1;-9=NA; else=0")
eclsk$black<-recode (eclsk$race, recodes="2=1;-9=NA; else=0")
eclsk$asian<-recode (eclsk$race, recodes="5=1;-9=NA; else=0")
eclsk$nahn<-recode (eclsk$race, recodes="6:7=1;-9=NA; else=0")
eclsk$other<-recode (eclsk$race, recodes="8=1;-9=NA; else=0")
eclsk$male<-recode(eclsk$gender, recodes="1=1; 2=0; -9=NA")
eclsk$mlths<-recode(eclsk$wkmomed, recodes = "1:2=1; 3:9=0; else = NA")
eclsk$mgths<-recode(eclsk$wkmomed, recodes = "1:3=0; 4:9=1; else =NA") 

Now, I need to form the transition variable, this is my event variable, and in this case it will be 1 if a child enters poverty between the first wave of the data and the third grade wave, and 0 otherwise. NOTE I need to remove any children who are already in poverty age wave 1, because they are not at risk of experiencing this particular transition.

eclsk<-subset(eclsk, is.na(pov1)==F&is.na(pov2)==F&is.na(pov3)==F&is.na(age1)==F&is.na(age2)==F&is.na(age3)==F&pov1!=1&is.na(eclsk$c15fpstr)==F)
eclsk$povtran1<-ifelse(eclsk$pov1==0&eclsk$pov2==0, 0,1)
eclsk$povtran2<-ifelse(eclsk$povtran1==1, NA,ifelse(eclsk$pov2==0&eclsk$pov3==0,0,1))

We can rescale the survey weights for use in multilevel models follwing the presentation in Carle, 2009.

#make an id that is the combination of strata and psu
eclsk$sampleid<-paste(eclsk$c15fpstr, eclsk$c15fppsu)
#within each sampling unit, sum the weights
wts<-tapply(eclsk$c1_5fp0,eclsk$sampleid,sum)
#make a data frame from this
wts<-data.frame(id=names(unlist(wts)), wt=unlist(wts))
#get the unique sampling location ids'
t1<-as.data.frame(table(eclsk$sampleid))
#put all of this into a data set
wts2<-data.frame(ids=wts$id, sumwt=wts$wt, jn=t1$Freq)
#merge all of this back to the original data file
eclsk2<-merge(eclsk, wts2, by.x="sampleid", by.y="ids", all.x=T)
#In the new data set, multiply the original weight by the fraction of the
#sampling unit total population each person represents
eclsk2$swts<-eclsk2$c1_5fp0*(eclsk2$jn/eclsk2$sumwt)

Now we do the entire data set. To analyze data longitudinally, we need to reshape the data from the current “wide” format (repeated measures in columns) to a “long” format (repeated observations in rows). The reshape() function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.

e.long<-reshape(eclsk2, idvar="childid", varying=list(age=c("age1","age2"), age2=c("age2", "age3"), povtran=c("povtran1", "povtran2")), times=1:2, direction="long" , drop = names(eclsk)[4:19])
e.long<-e.long[order(e.long$childid, e.long$time),]

#find which kids failed in the first time period and remove them from the second risk period risk set
failed1<-which(is.na(e.long$povtran1)==T)
e.long<-e.long[-failed1,]
e.long$age1r<-round(e.long$age1, 0)
e.long$age2r<-round(e.long$age2, 0)
head(e.long, n=10)
##            sampleid  childid gender race s2_id c1_5fp0 c15fpstr c15fppsu
## 0001002C.1     63 1 0001002C      2    1  0001  352.35       63        1
## 0001002C.2     63 1 0001002C      2    1  0001  352.35       63        1
## 0001007C.1     63 1 0001007C      1    1  0001  358.99       63        1
## 0001007C.2     63 1 0001007C      1    1  0001  358.99       63        1
## 0001010C.1     63 1 0001010C      2    1  0001  352.35       63        1
## 0001010C.2     63 1 0001010C      2    1  0001  352.35       63        1
## 0002004C.1     63 1 0002004C      2    1  0002  228.05       63        1
## 0002004C.2     63 1 0002004C      2    1  0002  228.05       63        1
## 0002006C.1     63 1 0002006C      2    1  0002  287.25       63        1
## 0002008C.1     63 1 0002008C      2    1  0002  259.19       63        1
##            pov1 pov2 pov3 hisp black asian nahn other male mlths mgths
## 0001002C.1    0    0    0    0     0     0    0     0    0     0     0
## 0001002C.2    0    0    0    0     0     0    0     0    0     0     0
## 0001007C.1    0    0    0    0     0     0    0     0    1     0     1
## 0001007C.2    0    0    0    0     0     0    0     0    1     0     1
## 0001010C.1    0    0    0    0     0     0    0     0    0     0     1
## 0001010C.2    0    0    0    0     0     0    0     0    0     0     1
## 0002004C.1    0    0    0    0     0     0    0     0    0     0     1
## 0002004C.2    0    0    0    0     0     0    0     0    0     0     1
## 0002006C.1    0    1    1    0     0     0    0     0    0    NA    NA
## 0002008C.1    0    0    0    0     0     0    0     0    0     0     1
##               sumwt jn      swts time     age1     age2 povtran1 age1r
## 0001002C.1 21844.73 72 1.1613419    1 6.433333 7.894167        0     6
## 0001002C.2 21844.73 72 1.1613419    2 7.894167 9.750000        0     8
## 0001007C.1 21844.73 72 1.1832273    1 5.300000 6.825000        0     5
## 0001007C.2 21844.73 72 1.1832273    2 6.825000 8.916667        0     7
## 0001010C.1 21844.73 72 1.1613419    1 5.885833 7.385833        0     6
## 0001010C.2 21844.73 72 1.1613419    2 7.385833 9.333333        0     7
## 0002004C.1 21844.73 72 0.7516504    1 5.450000 6.977500        0     5
## 0002004C.2 21844.73 72 0.7516504    2 6.977500 9.083333        0     7
## 0002006C.1 21844.73 72 0.9467730    1 5.158333 6.685833        1     5
## 0002008C.1 21844.73 72 0.8542875    1 5.905833 7.433333        0     6
##            age2r
## 0001002C.1     8
## 0001002C.2    10
## 0001007C.1     7
## 0001007C.2     9
## 0001010C.1     7
## 0001010C.2     9
## 0002004C.1     7
## 0002004C.2     9
## 0002006C.1     7
## 0002008C.1     7

Now we fit the discrete time model and doing fraily by the school identifier. I use the weights calculated above, standardized to the within cluster sample size.

#Fit the model
fitl1<-glmer(povtran1~time+mlths+mgths+black+hisp+other+nahn+(1|s2_id),family=binomial,weights=swts, e.long)
## Warning: non-integer #successes in a binomial glm!
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.00240573
## (tol = 0.001, component 4)
display(fitl1, detail=T) 
## glmer(formula = povtran1 ~ time + mlths + mgths + black + hisp + 
##     other + nahn + (1 | s2_id), data = e.long, family = binomial, 
##     weights = swts)
##             coef.est coef.se z value Pr(>|z|)
## (Intercept)  -2.78     0.17  -16.63    0.00  
## time         -0.36     0.09   -4.13    0.00  
## mlths         0.95     0.13    7.19    0.00  
## mgths        -0.88     0.10   -8.64    0.00  
## black         1.43     0.15    9.55    0.00  
## hisp          0.89     0.13    7.03    0.00  
## other         0.68     0.29    2.34    0.02  
## nahn          1.35     0.24    5.56    0.00  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  s2_id    (Intercept) 1.04    
##  Residual             1.00    
## ---
## number of obs: 12976, groups: s2_id, 898
## AIC = 4640.1, DIC = 4622.1
## deviance = 4622.1