This example will illustrate how to fit the discrete time hazard model to person-period. Specifically, this example illustrates various parameterizartions of time in the discrete time model. In this example, I will use the event of a child dying before age 5 in Haiti. The data for this example come from the Haitian Demographic and Health Survey for 2012 birth recode file. This file contains information for all live births to women sampled in the survey.

#Load required libraries
library(foreign)
library(survival)
## Loading required package: splines
library(car)
library(survey)
## Loading required package: grid
## 
## Attaching package: 'survey'
## 
## The following object is masked from 'package:graphics':
## 
##     dotchart
#load the data
haiti<-read.dta("/Users/ozd504/Google Drive/dem7223/data/HTBR61FL.DTA", convert.factors = F)
#We form a subset of variables
sub<-data.frame(CASEID=haiti$caseid,kidid=paste(haiti$caseid, haiti$bidx, sep="-"), v008=haiti$v008,bord=haiti$bidx,csex=haiti$b4,b2=haiti$b2, b3=haiti$b3, b5=haiti$b5, b7=haiti$b7, ibint=haiti$b11, rural=haiti$v025, educ=haiti$v106,age=haiti$v012,partneredu=haiti$v701,partnerage=haiti$v730, hhses=haiti$v190, weight=haiti$v005/1000000, psu=haiti$v021, strata=haiti$v022)

sub$death.age<-ifelse(sub$b5==1,
                          ((((sub$v008))+1900)-(((sub$b3))+1900)) 
                          ,sub$b7)

#censoring indicator for death by age 5, in months (<=60 months)
sub$d.event<-ifelse(is.na(sub$b7)==T|sub$b7>60,0,1)
sub$d.eventfac<-factor(sub$d.event); levels(sub$d.eventfac)<-c("Alive at Age 5", "Dead by Age 5")
table(sub$d.eventfac)
## 
## Alive at Age 5  Dead by Age 5 
##          25981           3032
#recodes
sub$male<-ifelse(sub$csex==1,1,0)
sub$educ.high<-ifelse(sub$educ %in% c(2,3), 1, 0)
sub$age2<-sub$age^2
sub$partnerhiedu<-ifelse(sub$partneredu<3,0,ifelse(sub$partneredu%in%c(8,9),NA,1 ))
sub$hises<-ifelse(sub$hhses>3, 1,0)

Create the person-period file

The distinction between the way we have been doing things and the discrete time model, is that we treat time discretely, versus continuously. This means that we transform the data from the case-duration data format to the person-period format. For this example, a natural choice would be year, since we have 5 intervals of equal length (12 months each).

Instead of using yearly time intervals, however, I use 6 month time intervals, simply to generate more periods of risk to illustrate the alternative time specifications. R provides a useful function called survSplit() in the survival library that will split a continuous duration into discrete periods.

#make person period file, 11 episodes of length 6 months each
pp<-survSplit(sub, cut=seq(0,60,6), start="start", end="death.age", event="d.event", episode="year")
pp<-pp[order(pp$kidid, pp$year),]
head(pp[, c("kidid", "death.age", "d.event", "start", "year", "male", "hises")], n=20)
##                    kidid death.age d.event start year male hises
## 29014           1 2  4-1         6       0     0    1    0     1
## 58027           1 2  4-1        12       0     6    2    0     1
## 87040           1 2  4-1        18       0    12    3    0     1
## 116053          1 2  4-1        24       0    18    4    0     1
## 145066          1 2  4-1        30       0    24    5    0     1
## 174079          1 2  4-1        36       0    30    6    0     1
## 203092          1 2  4-1        42       0    36    7    0     1
## 232105          1 2  4-1        48       0    42    8    0     1
## 261118          1 2  4-1        54       0    48    9    0     1
## 290131          1 2  4-1        60       0    54   10    0     1
## 319144          1 2  4-1       102       0    60   11    0     1
## 29015           1 2  4-2         6       0     0    1    1     1
## 58028           1 2  4-2        12       0     6    2    1     1
## 87041           1 2  4-2        18       0    12    3    1     1
## 116054          1 2  4-2        24       0    18    4    1     1
## 145067          1 2  4-2        30       0    24    5    1     1
## 174080          1 2  4-2        36       0    30    6    1     1
## 203093          1 2  4-2        42       0    36    7    1     1
## 232106          1 2  4-2        48       0    42    8    1     1
## 261119          1 2  4-2        54       0    48    9    1     1

We see that each child is not in the data for multiple “risk periods”, until they experience the event (death) or age out of the risk set.

Discrete time model

So, the best thing about the discrete time model, is that it’s just logistic regression. Each risk period is treated as a single Bernoulli trial, and the child can either fail (y=1) or not (y=0) in the period. This is how we get the hazard of the event, as the estimated probability of failure in each discrete time period. So, any method you would like to use to model this probability would probably work (logit, probit models), but I will show two standard approaches. First, we will use the traditional logit link to the binomial distribution, then we will use the complementary log-log link. The latter is used because it preserves the proportional hazards property of the model, as in the Cox model.

#generate survey design

des<-svydesign(ids=~psu, strata = ~strata , weights=~weight, data=pp)

#Fit the basic logistic model with ONLY time in the model
#I do -1 so that no intercept is fit in the model, and we get a hazard estimate for each time period
fit.0<-svyglm(d.event~as.factor(year)-1,design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.0)
## 
## Call:
## svyglm(formula = d.event ~ as.factor(year) - 1, design = des, 
##     family = "binomial")
## 
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## as.factor(year)1   -3.56668    0.06464  -55.18   <2e-16 ***
## as.factor(year)2   -3.83430    0.07953  -48.21   <2e-16 ***
## as.factor(year)3   -4.62671    0.09237  -50.09   <2e-16 ***
## as.factor(year)4   -4.36101    0.08024  -54.35   <2e-16 ***
## as.factor(year)5  -21.64043    0.02386 -906.83   <2e-16 ***
## as.factor(year)6   -5.04661    0.10825  -46.62   <2e-16 ***
## as.factor(year)7  -21.63983    0.02415 -895.93   <2e-16 ***
## as.factor(year)8   -5.50268    0.13118  -41.95   <2e-16 ***
## as.factor(year)9  -21.64050    0.02457 -880.65   <2e-16 ***
## as.factor(year)10  -5.53539    0.14354  -38.56   <2e-16 ***
## as.factor(year)11 -21.63971    0.02498 -866.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.6644426)
## 
## Number of Fisher Scoring iterations: 20

Alternative time specifications

Here, we can specify how we want time in our model. The general model fits a hazard at every time point. If you have a lot of time points, and especially if you have low numbers of events at some time points, this can be computationally expensive. We can, however, specify time as a linear or quadratic term, and the model will not fit separate hazards at all times, instead, the baseline hazard will be a linear or curvilinear function of time.

#Linear term for time
fit.l<-svyglm(d.event~year,design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.l)
## 
## Call:
## svyglm(formula = d.event ~ year, design = des, family = "binomial")
## 
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.20342    0.06301  -50.84   <2e-16 ***
## year        -0.37899    0.01627  -23.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.091886)
## 
## Number of Fisher Scoring iterations: 8

Which shows the hazard decreases over time, which makes a lot of sense in the context of this outcome. Now we can consider quadratic terms for time:

fit.s<-svyglm(d.event~year+I(year^2),design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.s)
## 
## Call:
## svyglm(formula = d.event ~ year + I(year^2), design = des, family = "binomial")
## 
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.996543   0.079616 -37.637  < 2e-16 ***
## year        -0.526385   0.039976 -13.168  < 2e-16 ***
## I(year^2)    0.016448   0.003872   4.248 2.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.003553)
## 
## Number of Fisher Scoring iterations: 9
fit.c<-svyglm(d.event~year+I(year^2)+I(year^3 ),design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.c)
## 
## Call:
## svyglm(formula = d.event ~ year + I(year^2) + I(year^3), design = des, 
##     family = "binomial")
## 
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.132382   0.138032 -22.693  < 2e-16 ***
## year        -0.379590   0.122216  -3.106  0.00203 ** 
## I(year^2)   -0.019986   0.026617  -0.751  0.45316    
## I(year^3)    0.002350   0.001584   1.483  0.13877    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.011792)
## 
## Number of Fisher Scoring iterations: 9
fit.q<-svyglm(d.event~year+I(year^2)+I(year^3 )+I(year^4),design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.q)
## 
## Call:
## svyglm(formula = d.event ~ year + I(year^2) + I(year^3) + I(year^4), 
##     design = des, family = "binomial")
## 
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.813251   0.247520 -15.406  < 2e-16 ***
## year         0.591236   0.292424   2.022   0.0438 *  
## I(year^2)   -0.408964   0.100570  -4.066 5.72e-05 ***
## I(year^3)    0.058250   0.013019   4.474 9.93e-06 ***
## I(year^4)   -0.002586   0.000559  -4.627 4.98e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.973821)
## 
## Number of Fisher Scoring iterations: 9
#Spline
fit.sp<-svyglm(d.event~bs(year),design=des , family="binomial")
## Warning: non-integer #successes in a binomial glm!
summary(fit.sp)
## 
## Call:
## svyglm(formula = d.event ~ bs(year), design = des, family = "binomial")
## 
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = pp)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.5296     0.0590 -59.821  < 2e-16 ***
## bs(year)1    -1.3750     0.2500  -5.501 6.64e-08 ***
## bs(year)2    -3.1813     0.2877 -11.057  < 2e-16 ***
## bs(year)3    -3.0690     0.1861 -16.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.011792)
## 
## Number of Fisher Scoring iterations: 9

Now, let’s look at the hazards:

dat<-expand.grid(year=seq(1, 11, 1))
dat$genmod<-predict(fit.0, newdata=dat, type="response")
dat$lin<-predict(fit.l, newdata=dat, type="response")
dat$sq<-predict(fit.s, newdata=dat, type="response")
dat$cub<-predict(fit.c, newdata=dat, type="response")
dat$quart<-predict(fit.q, newdata=dat, type="response")
dat$spline<-predict(fit.sp, newdata=dat, type="response")
dat
##    year       genmod          lin          sq         cub        quart
## 1     1 2.747342e-02 0.0270562811 0.029128445 0.028481407 0.0272435856
## 2     2 2.115910e-02 0.0186809552 0.018279648 0.018840822 0.0209998500
## 3     3 9.692061e-03 0.0128639497 0.011801459 0.012278105 0.0126550388
## 4     4 1.260456e-02 0.0088419631 0.007853544 0.008001165 0.0072043236
## 5     5 3.996510e-10 0.0060697412 0.005392955 0.005291943 0.0044233901
## 6     6 6.390031e-03 0.0041630437 0.003823700 0.003604820 0.0031458623
## 7     7 3.998903e-10 0.0028535805 0.002800200 0.002565939 0.0026145195
## 8     8 4.059278e-03 0.0019551933 0.002118518 0.001936117 0.0024068564
## 9     9 3.996216e-10 0.0013392637 0.001656024 0.001570815 0.0021860311
## 10   10 3.929178e-03 0.0009171874 0.001337597 0.001389904 0.0016397672
## 11   11 3.999405e-10 0.0006280471 0.001116422 0.001360343 0.0007988978
##         spline
## 1  0.028481407
## 2  0.018840822
## 3  0.012278105
## 4  0.008001165
## 5  0.005291943
## 6  0.003604820
## 7  0.002565939
## 8  0.001936117
## 9  0.001570815
## 10 0.001389904
## 11 0.001360343
plot(genmod~year, dat, type="l", ylab="h(t)", xlab="Time")
title(main="Hazard function from different time parameterizations")
lines(lin~year, dat, col=2, lwd=2)
lines(sq~year, dat, col=3, lwd=2)
lines(cub~year, dat, col=4, lwd=2)
lines(quart~year, dat, col=5, lwd=2)
lines(spline~year, dat, col=6, lwd=2)

legend("topright", legend=c("General Mod.", "Linear", "Square", "Cubic", "Quartic", "B-spline"), col=1:6, lwd=1.5)

#AIC table
aic<-round(c(
  fit.l$deviance+2*length(fit.l$coefficients),
  fit.s$deviance+2*length(fit.s$coefficients),
  fit.c$deviance+2*length(fit.c$coefficients),
  fit.q$deviance+2*length(fit.q$coefficients),
  fit.sp$deviance+2*length(fit.sp$coefficients),
  fit.0$deviance+2*length(fit.0$coefficients)),2)
  
dif.aic<-round(aic-aic[6],2)
data.frame(model =c( "linear", "square", "cubic", "quartic","spline", "general"), aic=aic, aic_dif=dif.aic)
##     model      aic aic_dif
## 1  linear 23154.22  700.63
## 2  square 23132.97  679.38
## 3   cubic 23131.33  677.74
## 4 quartic 23105.36  651.77
## 5  spline 23131.33  677.74
## 6 general 22453.59    0.00

So the general model is definately best, with the quartic being the closest in AIC points, but still there is no evidence that it’s fitting the data better than the general model.