This example will illustrate how to fit parametric hazard models to continuous duration data (i.e. person-level data). In this example, I use the time between the first and second birth for women in Haiti. The data for this example come from the Haitian Demographic and Health Survey for 2012 individual recode file. This file contains information for all women sampled in the survey.
#Load required libraries
library(foreign)
library(survival)
## Loading required package: splines
library(car)
library(survey)
## Loading required package: grid
##
## Attaching package: 'survey'
##
## The following object is masked from 'package:graphics':
##
## dotchart
library(muhaz)
library(eha)
#load the data
haiti<-read.dta("/Users/ozd504/Google Drive/dem7223/data//HTIR61FL.DTA", convert.factors = F)
head(test, n=6) In the DHS individual recode file, information on every live birth is collected using a retrospective birth history survey mechanism. Since our outcome is time between first and second birth, we must select as our risk set, only women who have had a first birth. The bidx variable indexes the birth history and if bidx_01 is not missing, then the woman should be at risk of having a second birth (i.e. she has had a first birth, i.e. bidx_01==1). I also select only non-twin births (b0 == 0). The DHS provides the dates of when each child was born in Century Month Codes. To get the interval for women who acutally had a second birth, that is the difference between the CMC for the first birth b3_01 and the second birth b3_02, but for women who had not had a second birth by the time of the interview, the censored time between births is the difference between b3_01 and v008, the date of the interview.
We have 8671 women who are at risk of a second birth.
table(is.na(haiti$bidx_01))
##
## FALSE TRUE
## 8671 5616
#now we extract those women
sub<-subset(haiti, haiti$bidx_01==1&haiti$b0_01==0)
#Here I keep only a few of the variables for the dates, and some characteristics of the women, and details of the survey
sub2<-data.frame(CASEID=sub$caseid,
int.cmc=sub$v008,
fbir.cmc=sub$b3_01,
sbir.cmc=sub$b3_02,
marr.cmc=sub$v509,
rural=sub$v025,
educ=sub$v106,
age=sub$v012,
partneredu=sub$v701,
partnerage=sub$v730,
weight=sub$v005/1000000,
psu=sub$v021, strata=sub$v022)
Now I need to calculate the birth intervals, both observed and censored, and the event indicator (i.e. did the women have the second birth?)
sub2$secbi<-ifelse(is.na(sub2$sbir.cmc)==T, ((sub2$int.cmc))-((sub2$fbir.cmc)), (sub2$fbir.cmc-sub2$sbir.cmc))
sub2$b2event<-ifelse(is.na(sub2$sbir.cmc)==T,0,1)
plot(survfit(Surv(secbi, b2event)~1, sub2), conf.int=T, ylab="S(t)", xlab="Months")
title(main="Survival Function for Second Birth Interval, Haiti", sub="All Women")
While parametric models are not so common in demographic research, fundamental understanding of what they are and how they are constructed is of importance. Some outcomes lend themselves very readily to the parametric approach, but as many demographic duration times are non-unique (tied), the parametric models are not statistically efficient for estimating the survival/hazard functions, as they assume the survival times are continuous random variables. In this section, we first estimate the empirical hazard function and then fit a variety of parametric models to it (Exponential, Weibull, Log-normal and Piecewise exponential). Ideally, a parametric model’s hazard function should approximate the observed empirical hazard function, if the model fits the data.
#since these functions don't work with durations of 0, we add a very small amount to the intervals
fit.haz.km<-kphaz.fit(sub2$secbi[sub2$secbi>0], sub2$b2event[sub2$secbi>0] , method = "product-limit")
#this is a version of the hazard that is smoothed using a kernel-density method
fit.haz.sm<-muhaz(sub2$secbi[sub2$secbi>0], sub2$b2event[sub2$secbi>0] )
#Empirical hazard function (product-limit estimate) plot
kphaz.plot(fit.haz.km)
#overlay the smoothed version
lines(fit.haz.sm, col=2, lwd=3)
So now we see what the empirical hazard function looks like.
#Create some predictor variables: Woman's education, Woman's age^2, Partner's education (> secondary school)
sub2$educ.high<-ifelse(sub2$educ %in% c(2,3), 1, 0)
sub2$age2<-sub2$age^2
sub2$partnerhiedu<-ifelse(sub2$partneredu<3,0,ifelse(sub2$partneredu%in%c(8,9),NA,1 ))
Now we fit the models. I use the eha package to do this, since it fits parametric proportional hazard models, not accellerated failure time models. I prefer the interpretation of regression models on the hazard scale vs. the survival time scale.
Exponential Model
#exponential distribution for hazard, here we hard code it to be
#a weibull dist with shape ==1
fit.1<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5), data=sub2[sub2$secbi>0,], dist="weibull", shape=1)
summary(fit.1)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu +
## I(age/5), data = sub2[sub2$secbi > 0, ], dist = "weibull",
## shape = 1)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## educ.high 0.332 -0.391 0.676 0.032 0.000
## partnerhiedu 0.072 -0.397 0.673 0.070 0.000
## I(age/5) 7.053 0.004 1.004 0.008 0.651
##
## log(scale) 4.067 0.062 0.000
##
## Shape is fixed at 1
##
## Events 5856
## Total time at risk 380253
## Max. log. likelihood -30158
## LR test statistic 10.35
## Degrees of freedom 3
## Overall p-value 0.0158268
plot(fit.1)
Which shows us what the constant hazard model looks like, it assumed the hazard is constant with respect to time, which after seeing the plots above, we know is false. We see the effects of both woman’s and partner’s education are negative, which makes sense. More educated people have lower risks of having more children. We also see the age effect is insignificant, which doesn’t make sense.
Weibull Model
#weibull distribution for hazard
fit.2<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5), data=sub2[sub2$secbi>0,], dist="weibull")
summary(fit.2)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu +
## I(age/5), data = sub2[sub2$secbi > 0, ], dist = "weibull")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## educ.high 0.332 -0.459 0.632 0.032 0.000
## partnerhiedu 0.072 -0.492 0.612 0.070 0.000
## I(age/5) 7.053 -0.070 0.932 0.009 0.000
##
## log(scale) 3.747 0.045 0.000
## log(shape) 0.367 0.009 0.000
##
## Events 5856
## Total time at risk 380253
## Max. log. likelihood -29547
## LR test statistic 399.01
## Degrees of freedom 3
## Overall p-value 0
plot(fit.2)
plot(fit.2, fn="haz")
lines(fit.haz.sm, col=2)
Here, we see a more realistic situation, where the hazard fucntion changes over time (Weibull allows this), but compared to the empirical hazard, the model is a very poor fit, as empirically, the hazard goes up, but then goes down. The Weibull hazard just goes up, as the model does not allow the hazard to change direction, only rate of increase (i.e. it can incraese at a slower or faster rate, but not change direction). We also see the effect of mom’s age is significant and negative (older women have lower risk of having a second birth)
The Log-normal distribution is more flexible and allows the hazard to change direction. Log-Normal Model
#log-normal distribution for hazard
fit.3<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5), data=sub2[sub2$secbi>0,], dist="lognormal", center=T)
summary(fit.3)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu +
## I(age/5), data = sub2[sub2$secbi > 0, ], dist = "lognormal",
## center = T)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## (Intercept) -2.734 0.186 0.000
## educ.high 0.332 -0.442 0.643 0.033 0.000
## partnerhiedu 0.072 -0.408 0.665 0.070 0.000
## I(age/5) 7.053 -0.034 0.967 0.009 0.000
##
## log(scale) 2.691 0.032 0.000
## log(shape) 1.540 0.072 0.000
##
## Events 5856
## Total time at risk 380253
## Max. log. likelihood -28412
## LR test statistic 323.21
## Degrees of freedom 3
## Overall p-value 0
plot(fit.3)
#plot the hazard from the log normal vs the empirical hazard
plot(fit.3, fn="haz")
lines(fit.haz.sm, col=2)
So, the log-normal model fits the empirical hazard pretty well up to ~150 months, where the empirical rate drops off faster. The eha package allows one other parametric distribution, the log-logistic, so we will consider that one too:
Log-logistic Model
#log-normal distribution for hazard
fit.4<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5), data=sub2[sub2$secbi>0,], dist="loglogistic", center=T)
summary(fit.4)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu +
## I(age/5), data = sub2[sub2$secbi > 0, ], dist = "loglogistic",
## center = T)
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## (Intercept) -0.807 0.081 0.000
## educ.high 0.332 -0.429 0.651 0.033 0.000
## partnerhiedu 0.072 -0.378 0.685 0.070 0.000
## I(age/5) 7.053 -0.018 0.982 0.009 0.034
##
## log(scale) 3.270 0.018 0.000
## log(shape) 1.411 0.027 0.000
##
## Events 5856
## Total time at risk 380253
## Max. log. likelihood -28416
## LR test statistic 297.60
## Degrees of freedom 3
## Overall p-value 0
plot(fit.4)
#plot the hazard from the log normal vs the empirical hazard
plot(fit.4, fn="haz")
lines(fit.haz.sm, col=2)
Whose hazard function drops off faster than the log-normal.
We may want to compare the models to one another based off AIC values. the eha package doesn’t give this to you, so we must calculate it:
AIC1<--2*fit.1$loglik[2]+2*length(fit.1$coefficients); AIC1
## [1] 60324.71
AIC2<--2*fit.2$loglik[2]+2*length(fit.2$coefficients); AIC2
## [1] 59103.5
AIC3<--2*fit.3$loglik[2]+2*length(fit.3$coefficients); AIC3
## [1] 56835.32
AIC4<--2*fit.4$loglik[2]+2*length(fit.4$coefficients); AIC4
## [1] 56843.1
And we see the log-normal model best fits the data, although it and the log-logistic model are not different by much in terms of AIC. Only 7.775422 AIC points separate the two models
The final model we consider is the Piecewise constant exponential model. This model breaks the data into pieces, where we may fit constant hazards within these pieces. For instance, given the observed hazard function above, we may break the data into an early piece, say < 30 months, a high piece,30-80 months and maybe two low pieces (80-150 and >150), so to mimic the form of the hazard function.
Piecewise constant exponential model
# here I must supply the times for the "pieces" where I expect the hazard to be constant
fit.5<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5), data=sub2[sub2$secbi>0,], dist="pch", cuts=c(30, 80, 150,250))
summary(fit.5)
## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu +
## I(age/5), data = sub2[sub2$secbi > 0, ], dist = "pch", cuts = c(30,
## 80, 150, 250))
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## educ.high 0.332 -0.435 0.647 0.033 0.000
## partnerhiedu 0.072 -0.396 0.673 0.070 0.000
## I(age/5) 7.053 -0.012 0.988 0.009 0.151
##
##
## Events 5856
## Total time at risk 380253
## Max. log. likelihood -29680
## LR test statistic 313.81
## Degrees of freedom 3
## Overall p-value 0
plot(fit.5)
plot(fit.5, fn="haz", ylim=c(0, .03))
lines(fit.haz.sm, col=2)
Which looks like it actually fits the data pretty good. The AIC’s show:
AIC5<--2*fit.5$loglik[2]+2*length(fit.5$coefficients); AIC5
## [1] 59366.4
AIC4
## [1] 56843.1
The eha package also provides a graphical method for the Cumulative hazard function, which allows us to visualize these models even better. It uses the empirical hazard, as fit in the Cox model (more on this next week), and compares the parametric models to the empirical patter:
emp<-coxreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5), data=sub2[sub2$secbi>0,])
check.dist(sp=emp,pp=fit.1, main = "Empirical vs. Exponential")
check.dist(sp=emp,pp=fit.2, main = "Empirical vs. Weibull")
check.dist(sp=emp,pp=fit.3, main = "Empirical vs. Log-Normal")
check.dist(sp=emp,pp=fit.4, main = "Empirical vs. Log-Logistic")
check.dist(sp=emp,pp=fit.5, main = "Empirical vs. PCH")
Again, we see that the PCH model fits the empirical hazard function better than the other parametric models.
As in the other examples, I illustrate fitting these models to data that are longitudinal, instead of person-duration.
In this example, we will examine how to fit the parametric model to a longitudinally collected data set. Here I use data from the ECLS-K. Specifically, we will examine the transition into poverty between kindergarten and third grade.
First we load our data
load("~/Google Drive/dem7903_App_Hier/data/eclsk.Rdata")
names(eclsk)<-tolower(names(eclsk))
#get out only the variables I'm going to use for this example
myvars<-c( "childid","gender", "race", "r1_kage","r4age", "r5age", "r6age", "r7age","c1r4mtsc", "c4r4mtsc", "c5r4mtsc", "c6r4mtsc", "c7r4mtsc", "w1povrty","w1povrty","w3povrty", "w5povrty", "w8povrty","wkmomed", "s2_id")
eclsk<-eclsk[,myvars]
eclsk$age1<-ifelse(eclsk$r1_kage==-9, NA, eclsk$r1_kage/12)
eclsk$age2<-ifelse(eclsk$r4age==-9, NA, eclsk$r4age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclsk$age3<-recode(eclsk$r5age,recodes="1=105; 2=107; 3=109; 4=112; 5=115; 6=117; -9=NA")/12
eclsk$pov1<-ifelse(eclsk$w1povrty==1,1,0)
eclsk$pov2<-ifelse(eclsk$w3povrty==1,1,0)
eclsk$pov3<-ifelse(eclsk$w5povrty==1,1,0)
#Recode race with white, non Hispanic as reference using dummy vars
eclsk$hisp<-recode (eclsk$race, recodes="3:4=1;-9=NA; else=0")
eclsk$black<-recode (eclsk$race, recodes="2=1;-9=NA; else=0")
eclsk$asian<-recode (eclsk$race, recodes="5=1;-9=NA; else=0")
eclsk$nahn<-recode (eclsk$race, recodes="6:7=1;-9=NA; else=0")
eclsk$other<-recode (eclsk$race, recodes="8=1;-9=NA; else=0")
eclsk$male<-recode(eclsk$gender, recodes="1=1; 2=0; -9=NA")
eclsk$mlths<-recode(eclsk$wkmomed, recodes = "1:2=1; 3:9=0; else = NA")
eclsk$mgths<-recode(eclsk$wkmomed, recodes = "1:3=0; 4:9=1; else =NA")
Now, I need to form the transition variable, this is my event variable, and in this case it will be 1 if a child enters poverty between the first wave of the data and the third grade wave, and 0 otherwise. NOTE I need to remove any children who are already in poverty age wave 1, because they are not at risk of experiencing this particular transition.
eclsk<-subset(eclsk, is.na(pov1)==F&is.na(pov2)==F&is.na(pov3)==F&is.na(age1)==F&is.na(age2)==F&is.na(age3)==F&pov1!=1)
eclsk$povtran1<-ifelse(eclsk$pov1==0&eclsk$pov2==0, 0,1)
eclsk$povtran2<-ifelse(eclsk$povtran1==1, NA,ifelse(eclsk$pov2==0&eclsk$pov3==0,0,1))
Now we do the entire data set. To analyze data longitudinally, we need to reshape the data from the current “wide” format (repeated measures in columns) to a “long” format (repeated observations in rows). The reshape() function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.
e.long<-reshape(eclsk, idvar="childid", varying=list(age=c("age1","age2"), age2=c("age2", "age3"), povtran=c("povtran1", "povtran2")), times=1:2, direction="long" , drop = names(eclsk)[4:20])
e.long<-e.long[order(e.long$childid, e.long$time),]
#find which kids failed in the first time period and remove them from the second risk period risk set
failed1<-which(is.na(e.long$povtran1)==T)
e.long<-e.long[-failed1,]
e.long$age1r<-round(e.long$age1, 0)
e.long$age2r<-round(e.long$age2, 0)
head(e.long, n=10)
## childid gender race pov1 pov2 pov3 hisp black asian nahn other
## 0001002C.1 0001002C 2 1 0 0 0 0 0 0 0 0
## 0001002C.2 0001002C 2 1 0 0 0 0 0 0 0 0
## 0001007C.1 0001007C 1 1 0 0 0 0 0 0 0 0
## 0001007C.2 0001007C 1 1 0 0 0 0 0 0 0 0
## 0001010C.1 0001010C 2 1 0 0 0 0 0 0 0 0
## 0001010C.2 0001010C 2 1 0 0 0 0 0 0 0 0
## 0002004C.1 0002004C 2 1 0 0 0 0 0 0 0 0
## 0002004C.2 0002004C 2 1 0 0 0 0 0 0 0 0
## 0002006C.1 0002006C 2 1 0 1 1 0 0 0 0 0
## 0002008C.1 0002008C 2 1 0 0 0 0 0 0 0 0
## male mlths mgths time age1 age2 povtran1 age1r age2r
## 0001002C.1 0 0 0 1 6.433333 7.894167 0 6 8
## 0001002C.2 0 0 0 2 7.894167 9.750000 0 8 10
## 0001007C.1 1 0 1 1 5.300000 6.825000 0 5 7
## 0001007C.2 1 0 1 2 6.825000 8.916667 0 7 9
## 0001010C.1 0 0 1 1 5.885833 7.385833 0 6 7
## 0001010C.2 0 0 1 2 7.385833 9.333333 0 7 9
## 0002004C.1 0 0 1 1 5.450000 6.977500 0 5 7
## 0002004C.2 0 0 1 2 6.977500 9.083333 0 7 9
## 0002006C.1 0 NA NA 1 5.158333 6.685833 1 5 7
## 0002008C.1 0 0 1 1 5.905833 7.433333 0 6 7
Now we fit the models, I only show the Weibull and PCH model fit here, but the others follow the example from above:
#Weibull
fitl1<-phreg(Surv(time = age2r, event = povtran1)~mlths+mgths+black+hisp+other+nahn, data=e.long, dist = "weibull")
summary(fitl1)
## Call:
## phreg(formula = Surv(time = age2r, event = povtran1) ~ mlths +
## mgths + black + hisp + other + nahn, data = e.long, dist = "weibull")
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## mlths 0.057 0.779 2.179 0.105 0.000
## mgths 0.683 -0.988 0.372 0.089 0.000
## black 0.063 1.346 3.841 0.113 0.000
## hisp 0.130 0.928 2.529 0.097 0.000
## other 0.027 0.377 1.459 0.256 0.141
## nahn 0.023 1.335 3.799 0.167 0.000
##
## log(scale) 2.553 0.014 0.000
## log(shape) 2.017 0.031 0.000
##
## Events 666
## Total time at risk 110255
## Max. log. likelihood -2850.2
## LR test statistic 578.13
## Degrees of freedom 6
## Overall p-value 0
fitl2<-phreg(Surv(time = age2r, event = povtran1)~mlths+mgths+black+hisp+other+nahn, data=e.long, dist = "pch", cuts=c(6, 7,8, 9))
summary(fitl2)
## Call:
## phreg(formula = Surv(time = age2r, event = povtran1) ~ mlths +
## mgths + black + hisp + other + nahn, data = e.long, dist = "pch",
## cuts = c(6, 7, 8, 9))
##
## Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
## mlths 0.057 0.777 2.175 0.105 0.000
## mgths 0.683 -0.977 0.376 0.089 0.000
## black 0.063 1.290 3.634 0.113 0.000
## hisp 0.130 0.883 2.418 0.098 0.000
## other 0.027 0.341 1.407 0.256 0.183
## nahn 0.023 1.313 3.716 0.167 0.000
##
##
## Events 666
## Total time at risk 110255
## Max. log. likelihood -2910.4
## LR test statistic 556.96
## Degrees of freedom 6
## Overall p-value 0
#AIC for weibull
-2*fitl1$loglik[2]+2*length(fitl1$coefficients)
## [1] 5716.314
#AIC for pch
-2*fitl2$loglik[2]+2*length(fitl2$coefficients)
## [1] 5832.823
#Empirical (Cox)
fitle<-coxreg(Surv(time = age2r, event = povtran1)~mlths+mgths+black+hisp+other+nahn, data=e.long)
check.dist(fitle, fitl1)
check.dist(fitle, fitl2)
Which, actually shows the weibull fitting better here. Most likely, the PCH model is over parameterized compared to the Weibull.