This example will illustrate how to fit the discrete time hazard model to person-period. Specifically, this example illustrates various parameterizartions of time in the discrete time model. In this example, I will use the event of a child dying before age 5. The data for this example come from the model.data Demographic and Health Survey for 2012 children’s recode file. This file contains information for all births in the last 5 years prior to the survey.
#Load required libraries
library(foreign)
library(survival)
library(car)
library(survey)
model.dat<-read.dta("/Users/ozd504/Google Drive/dem7223/class2016/data/zzkr62dt/ZZKR62FL.DTA", convert.factors = F)
In the DHS, they record if a child is dead or alive and the age at death if the child is dead. This can be understood using a series of variables about each child.
If the child is alive at the time of interview, then the variable B5==1, and the age at death is censored.
If the age at death is censored, then the age at the date of interview (censored age at death) is the date of the interview - date of birth (in months).
If the child is dead at the time of interview,then the variable B5!=1, then the age at death in months is the variable B7. Here we code this:
#We form a subset of variables
sub<-data.frame(CASEID=model.dat$caseid,kidid=paste(model.dat$caseid, model.dat$bidx, sep="-"), v008=model.dat$v008,bord=model.dat$bidx,csex=model.dat$b4,b2=model.dat$b2, b3=model.dat$b3, b5=model.dat$b5, b7=model.dat$b7, ibint=model.dat$b11, rural=model.dat$v025, educ=model.dat$v106,age=model.dat$v012,partneredu=model.dat$v701,partnerage=model.dat$v730, hhses=model.dat$v190, weight=model.dat$v005/1000000, psu=model.dat$v021, strata=model.dat$v022)
sub$death.age<-ifelse(sub$b5==1,
((((sub$v008))+1900)-(((sub$b3))+1900))
,sub$b7)+.0001
#censoring indicator for death by age 5, in months (<=60 months)
sub$d.event<-ifelse(is.na(sub$b7)==T|sub$b7>60,0,1)
sub$d.eventfac<-factor(sub$d.event); levels(sub$d.eventfac)<-c("Alive at Age 5", "Dead by Age 5")
table(sub$d.eventfac)
##
## Alive at Age 5 Dead by Age 5
## 5335 633
#recodes
sub$male<-ifelse(sub$csex==1,1,0)
sub$educ.high<-ifelse(sub$educ %in% c(2,3), 1, 0)
sub$age2<-sub$age^2
sub$partnerhiedu<-ifelse(sub$partneredu<3,0,ifelse(sub$partneredu%in%c(8,9),NA,1 ))
sub$hises<-ifelse(sub$hhses>3, 1,0)
sub$rural<-ifelse(sub$rural==2,1,0 )
The distinction between the way we have been doing things and the discrete time model, is that we treat time discretely, versus continuously. This means that we transform the data from the case-duration data format to the person-period format. For this example, a natural choice would be year_age, since we have 5 intervals of equal length (12 months each). R provides a useful function called survSplit() in the survival library that will split a continuous duration into discrete periods.
#make person period file
pp<-survSplit(Surv(death.age, d.event)~. , data = sub, cut=seq(0,60,12), episode="year_age")
pp<-pp[order(pp$kidid, pp$year_age),]
head(pp[, c("kidid", "death.age", "d.event", "year_age", "male", "hises")], n=20)
## kidid death.age d.event year_age male hises
## 1 1 1 2-1 5.0001 0 2 0 0
## 2 1 1 2-2 12.0000 0 2 1 0
## 3 1 1 2-2 24.0000 0 3 1 0
## 4 1 1 2-2 36.0000 0 4 1 0
## 5 1 1 2-2 37.0001 0 5 1 0
## 6 1 3 2-1 12.0000 0 2 0 0
## 7 1 3 2-1 24.0000 0 3 0 0
## 8 1 3 2-1 30.0001 0 4 0 0
## 9 1 4 3-1 10.0001 0 2 0 0
## 10 1 4 3-2 12.0000 0 2 1 0
## 11 1 4 3-2 24.0000 0 3 1 0
## 12 1 4 3-2 30.0001 0 4 1 0
## 13 1 5 1-1 0.0001 1 2 1 0
## 14 1 5 1-2 12.0000 0 2 0 0
## 15 1 5 1-2 24.0000 0 3 0 0
## 16 1 5 1-2 34.0001 0 4 0 0
## 17 1 15 1-1 1.0001 1 2 0 0
## 18 1 15 1-2 12.0000 0 2 1 0
## 19 1 15 1-2 18.0001 0 3 1 0
## 20 1 15 1-3 3.0001 1 2 1 0
We see that each child is not in the data for multiple “risk periods”, until they experience the event (death) or age out of the risk set.
So, the best thing about the discrete time model, is that it’s just logistic regression. Each risk period is treated as a single Bernoulli trial, and the child can either fail (y=1) or not (y=0) in the period. This is how we get the hazard of the event, as the estimated probability of failure in each discrete time period. So, any method you would like to use to model this probability would probably work (logit, probit models), but I will show two standard approaches. First, we will use the traditional logit link to the binomial distribution, then we will use the complementary log-log link. The latter is used because it preserves the proportional hazards property of the model, as in the Cox model.
#generate survey design
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~psu, strata = ~strata , weights=~weight, data=pp)
#Fit the basic logistic model with ONLY time in the model
#I do -1 so that no intercept is fit in the model, and we get a hazard estimate for each time period
fit.0<-svyglm(d.event~as.factor(year_age)-1,design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.0)
##
## Call:
## svyglm(formula = d.event ~ as.factor(year_age) - 1, design = des,
## family = "binomial")
##
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## as.factor(year_age)2 -2.41351 0.05901 -40.90 <2e-16 ***
## as.factor(year_age)3 -3.72379 0.13236 -28.13 <2e-16 ***
## as.factor(year_age)4 -4.28401 0.19830 -21.60 <2e-16 ***
## as.factor(year_age)5 -5.16879 0.33117 -15.61 <2e-16 ***
## as.factor(year_age)6 -4.63002 0.34620 -13.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.00006)
##
## Number of Fisher Scoring iterations: 7
#Plot the hazard function on the probability scale
haz<-1/(1+exp(-coef(fit.0)))
time<-seq(1,5,1)
plot(haz~time, type="l", ylab="h(t)")
title(main="Discrete Time Hazard Function for Child Mortality")
Here, we can specify how we want time in our model. The general model fits a hazard at every time point. If you have a lot of time points, and especially if you have low numbers of events at some time points, this can be computationally expensive.
We can, however, specify time as a linear or quadratic term, and the model will not fit separate hazards at all times, instead, the baseline hazard will be a linear or curvilinear function of time.
#Linear term for time
fit.l<-svyglm(d.event~year_age,design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.l)
##
## Call:
## svyglm(formula = d.event ~ year_age, design = des, family = "binomial")
##
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6168 0.1908 -3.233 0.00145 **
## year_age -0.9285 0.0779 -11.920 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.194649)
##
## Number of Fisher Scoring iterations: 7
Which shows the hazard decreases over time, which makes a lot of sense in the context of this outcome. Now we can consider quadratic terms for time and a natural spline function of time as well:
fit.s<-svyglm(d.event~year_age+I(year_age^2),design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.s)
##
## Call:
## svyglm(formula = d.event ~ year_age + I(year_age^2), design = des,
## family = "binomial")
##
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43527 0.52940 2.711 0.00733 **
## year_age -2.37351 0.34549 -6.870 9.14e-11 ***
## I(year_age^2) 0.22378 0.04858 4.607 7.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.000276)
##
## Number of Fisher Scoring iterations: 7
fit.c<-svyglm(d.event~year_age+I(year_age^2)+I(year_age^3 ),design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.c)
##
## Call:
## svyglm(formula = d.event ~ year_age + I(year_age^2) + I(year_age^3),
## design = des, family = "binomial")
##
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.594296 2.264664 0.704 0.482
## year_age -2.526323 2.142680 -1.179 0.240
## I(year_age^2) 0.268623 0.621515 0.432 0.666
## I(year_age^3) -0.004039 0.055857 -0.072 0.942
##
## (Dispersion parameter for binomial family taken to be 0.9997974)
##
## Number of Fisher Scoring iterations: 7
fit.q<-svyglm(d.event~year_age+I(year_age^2)+I(year_age^3 )+I(year_age^4),design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.q)
##
## Call:
## svyglm(formula = d.event ~ year_age + I(year_age^2) + I(year_age^3) +
## I(year_age^4), design = des, family = "binomial")
##
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.86941 12.91257 1.616 0.1077
## year_age -25.95467 15.56394 -1.668 0.0971 .
## I(year_age^2) 10.33754 6.63976 1.557 0.1212
## I(year_age^3) -1.82570 1.19460 -1.528 0.1281
## I(year_age^4) 0.11761 0.07684 1.531 0.1276
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.00006)
##
## Number of Fisher Scoring iterations: 7
#Spline
library(splines)
fit.sp<-svyglm(d.event~ns(year_age),design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.sp)
##
## Call:
## svyglm(formula = d.event ~ ns(year_age), design = des, family = "binomial")
##
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.47382 0.05836 -42.39 <2e-16 ***
## ns(year_age) -4.63228 0.38861 -11.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.194649)
##
## Number of Fisher Scoring iterations: 7
Now, let’s look at the hazards:
dat<-expand.grid(year_age=seq(2,6,1))
dat$genmod<-predict(fit.0, newdata=data.frame(year_age=as.factor(2:6 )), type="response")
dat$lin<-predict(fit.l, newdata=dat, type="response")
dat$sq<-predict(fit.s, newdata=dat, type="response")
dat$cub<-predict(fit.c, newdata=dat, type="response")
dat$quart<-predict(fit.q, newdata=dat, type="response")
dat$spline<-predict(fit.sp, newdata=dat, type="response")
dat
## year_age genmod lin sq cub quart
## 1 2 0.082148569 0.077713924 0.081912609 0.081947902 0.082148569
## 2 3 0.023573110 0.032222344 0.024812934 0.024695701 0.023573110
## 3 4 0.013599720 0.012985345 0.011224855 0.011300687 0.013599720
## 4 5 0.005659243 0.005171618 0.007861727 0.007949827 0.005659243
## 5 6 0.009660337 0.002049913 0.008578986 0.008445461 0.009660337
## spline
## 1 0.077713924
## 2 0.032222344
## 3 0.012985345
## 4 0.005171618
## 5 0.002049913
plot(genmod~year_age, dat, type="l", ylab="h(t)", xlab="Time", ylim=c(0,.1))
title(main="Hazard function from different time parameterizations")
lines(lin~year_age, dat, col=2, lwd=2)
lines(sq~year_age, dat, col=3, lwd=2)
lines(cub~year_age, dat, col=4, lwd=2)
lines(quart~year_age, dat, col=5, lwd=2)
lines(spline~year_age, dat, col=6, lwd=2)
legend("topright", legend=c("General Model", "Linear", "Square", "Cubic", "Quartic", "Natural spline"), col=1:6, lwd=1.5)
#AIC table
aic<-round(c(
fit.l$deviance+2*length(fit.l$coefficients),
fit.s$deviance+2*length(fit.s$coefficients),
fit.c$deviance+2*length(fit.c$coefficients),
fit.q$deviance+2*length(fit.q$coefficients),
fit.sp$deviance+2*length(fit.sp$coefficients),
fit.0$deviance+2*length(fit.0$coefficients)),2)
#compare all aics to the one from the general model
dif.aic<-round(aic-aic[6],2)
data.frame(model =c( "linear", "square", "cubic", "quartic","spline", "general"), aic=aic, aic_dif=dif.aic)
## model aic aic_dif
## 1 linear 5119.29 21.86
## 2 square 5096.79 -0.64
## 3 cubic 5098.78 1.35
## 4 quartic 5097.43 0.00
## 5 spline 5119.29 21.86
## 6 general 5097.43 0.00
So the general model and the quartic give the same AIC points, but the square and cubic models also fit equally as well.