This example will illustrate how to fit parametric hazard models to continuous duration data (i.e. person-level data). In this example, I use the time between the first and second birth for women in the data as the outcome variable.

The data for this example come from the DHS Model data file Demographic and Health Survey for 2012 individual recode file. This file contains information for all women sampled in the survey between the ages of 15 and 49.

This is an important data file, because for each woman, it gives information on all of her births, arrayed in columns.

#Load required libraries
library(haven)
library(survival)
library(car)
library(survey)

## Loading required package: grid

## Loading required package: Matrix

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(muhaz)
library(eha)

#load the data
model.dat<-read_dta("C:/Users/ozd504//Google Drive/zzir62dt/ZZIR62FL.DTA")

In the DHS individual recode file, information on every live birth is collected using a retrospective birth history survey mechanism.

Since our outcome is time between first and second birth, we must select as our risk set, only women who have had a first birth.

The bidx variable indexes the birth history and if bidx_01 is not missing, then the woman should be at risk of having a second birth (i.e. she has had a first birth, i.e. bidx_01==1).

I also select only non-twin births (b0 == 0).

The DHS provides the dates of when each child was born in Century Month Codes.

To get the interval for women who acutally had a second birth, that is the difference between the CMC for the first birth b3_01 and the second birth b3_02, but for women who had not had a second birth by the time of the interview, the censored time between births is the difference between b3_01 and v008, the date of the interview.

We have 6161 women who are at risk of a second birth.

table(is.na(model.dat$bidx_01))

## 
## FALSE  TRUE 
##  6161  2187

#now we extract those women
sub<-subset(model.dat, model.dat$bidx_01==1&model.dat$b0_01==0)

#Here I keep only a few of the variables for the dates, and some characteristics of the women, and details of the survey
sub2<-data.frame(CASEID=sub$caseid, 
                 int.cmc=sub$v008,
                 fbir.cmc=sub$b3_01,
                 sbir.cmc=sub$b3_02,
                 marr.cmc=sub$v509,
                 rural=sub$v025,
                 educ=sub$v106,
                 age=sub$v012,
                 partneredu=sub$v701,
                 partnerage=sub$v730,
                 weight=sub$v005/1000000,
                 psu=sub$v021, strata=sub$v022)

Now I need to calculate the birth intervals, both observed and censored, and the event indicator (i.e. did the women have the second birth?)

sub2$secbi<-ifelse(is.na(sub2$sbir.cmc)==T, ((sub2$int.cmc))-((sub2$fbir.cmc)), (sub2$fbir.cmc-sub2$sbir.cmc))
sub2$b2event<-ifelse(is.na(sub2$sbir.cmc)==T,0,1) 
fit<-survfit(Surv(secbi, b2event)~1, sub2)
fit

## Call: survfit(formula = Surv(secbi, b2event) ~ 1, data = sub2)
## 
##       n  events  median 0.95LCL 0.95UCL 
##    6026    4789      39      38      40

plot(fit, conf.int=T, ylab="S(t)", xlab="Months")
title(main="Survival Function for Second Birth Interval, DHS Model Data")

Estimating Parametric Hazard Models

While parametric models are not so common in demographic research, fundamental understanding of what they are and how they are constructed is of importance.

Some outcomes lend themselves very readily to the parametric approach, but as many demographic duration times are non-unique (tied), the parametric models are not statistically efficient for estimating the survival/hazard functions, as they assume the survival times are continuous random variables.

In this section, we first estimate the empirical hazard function and then fit a variety of parametric models to it (Exponential, Weibull, Log-normal and Piecewise exponential). Ideally, a parametric model’s hazard function should approximate the observed empirical hazard function, if the model fits the data.

#since these functions don't work with durations of 0, we add a very small amount to the intervals
fit.haz.km<-kphaz.fit(sub2$secbi[sub2$secbi>0], sub2$b2event[sub2$secbi>0] , method = "product-limit")

#this is a version of the hazard that is smoothed using a kernel-density method
fit.haz.sm<-muhaz(sub2$secbi[sub2$secbi>0], sub2$b2event[sub2$secbi>0] )

#Empirical hazard function (product-limit estimate) plot
kphaz.plot(fit.haz.km,main="Plot of the hazard of having a second birth")
#overlay the smoothed version
lines(fit.haz.sm, col=2, lwd=3)
legend("topleft", legend = c("KM hazard", "Smoothed Hazard"), col=c(1,2), lty=c(1,1))

So now we see what the empirical hazard function looks like, in both the observed and smoothed estimate of it.

Create covariates

Here, we create some predictor variables: Woman’s education (secondary +, vs < secondary), Woman’s age^2, Partner’s education (> secondary school)

sub2$educ.high<-ifelse(sub2$educ %in% c(2,3), 1, 0)
sub2$age2<-(sub2$age/5)^2
sub2$partnerhiedu<-ifelse(sub2$partneredu<3,0,ifelse(sub2$partneredu%in%c(8,9),NA,1 ))

options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~psu, strata=~strata, data=sub2[sub2$secbi>0,], weight=~weight )

rep.des<-as.svrepdesign(des, type="bootstrap" )

Fit the models

Now we fit the models.

I use the eha package to do this, since it fits parametric proportional hazard models, not accellerated failure time models.

I prefer the interpretation of regression models on the hazard scale vs. the survival time scale. EHA is not the only package that will fit parameteric survival models, be sure you read the documentation for the procedure you use!! Different functions fit different parameterizations of the distributions. For example, the survreg() function in the survival library fits accelerated failure time models only.

Exponential Model

Often the exponential model isn’t directly available in packages, so we can fit a weibull model with a fixed shape parameter. This is 100% legal.

The exponential distribution has a constant hazard rate, \(\lambda (t) = \lambda\). The survival function is \(S(t) = \exp (-\lambda t)\)

To specify the model in terms of covariates, you can write the hazard as a log-linear model : \(\text{log} \lambda = x`\beta\)

#exponential distribution for hazard, here we hard code it to be
#a weibull dist with shape ==1 
fit.1<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5)+age2 + age2, data=sub2[sub2$secbi>0,], dist="weibull", shape = 1)
summary(fit.1)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu + 
##     I(age/5) + age2 + age2, data = sub2[sub2$secbi > 0, ], dist = "weibull", 
##     shape = 1)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## educ.high           0.167    -0.345     0.708     0.048     0.000 
## partnerhiedu        0.071    -0.160     0.852     0.069     0.020 
## I(age/5)            6.859     0.277     1.320     0.078     0.000 
## age2               49.668    -0.022     0.978     0.006     0.000 
## 
## log(scale)                    4.701               0.259     0.000 
## 
##  Shape is fixed at  1 
## 
## Events                    4527 
## Total time at risk        237141 
## Max. log. likelihood      -22395 
## LR test statistic         36.14 
## Degrees of freedom        4 
## Overall p-value           2.70792e-07

plot(fit.1)

Which shows us what the constant hazard model looks like, it assumed the hazard is constant with respect to time, which after seeing the plots above, we know is false. We see the effects of both woman’s and partner’s education are negative, which makes sense. Women with more education, and who have partners with more education lower risks of having a second birth. We also see the age effect is significant, meaning older women in this sample are more likely to have a second birthm but the hazard doesn’t go up forever, as the curvilinear term shows a negative slope.

Interpreting the model coefficients

To interpret the effects specifically, you can use the Exp(Coef) column. So, for example for women who have secondary or higher education, their hazard of having a second child is 29.211 lower than a woman with less than a secondary education. To get that number I do : \(100* 1 - \exp(\beta_{\text{educ.high }})\)

Likewise, for the effect of age, we can compare the hazards for a women who is age 35 to a woman who is age 20. To do this comparison for a continuous covariate, you have to form the ratio of the hazards at two different plausible values. For this comparison, we see that women who are age 35 are 1.104 times more likely to have a second birth than women who are 20. To get this, I find:

\(\text{Hazard Ratio} = \frac{\exp \left( \beta_{\text{I(age/5)}} * 7 + \beta_{\text{age2}}*7 \right )}{\exp \left( \beta_{\text{I(age/5)}} * 4 + \beta_{\text{age2}}*4 \right )}\)

I choose 7 because 7 * 5 = 35, and 4 because 4*5 = 20. Remember, I divided Age by 5 when I created my variables.

AFT model specification

If you wanted to do the AFT model, you can either aftreg() in the eha package or survreg() in the survival package. Generally AFT models are written as:

\(\text{log} T = -x ` \beta + \sigma W\) Where W is an error (residual) term, which is assumed to follow some distribution.

fit.1.aft<-survreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5)+age2 + age2, data=sub2[sub2$secbi>0,],dist = "exponential")

summary(fit.1.aft)

## 
## Call:
## survreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu + 
##     I(age/5) + age2 + age2, data = sub2[sub2$secbi > 0, ], dist = "exponential")
##                Value Std. Error     z        p
## (Intercept)   4.7013    0.25934 18.13 1.91e-73
## educ.high     0.3455    0.04762  7.26 4.00e-13
## partnerhiedu  0.1601    0.06884  2.33 2.00e-02
## I(age/5)     -0.2774    0.07768 -3.57 3.56e-04
## age2          0.0222    0.00562  3.95 7.72e-05
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -22395.3   Loglik(intercept only)= -22447.6
##  Chisq= 104.47 on 4 degrees of freedom, p= 0 
## Number of Newton-Raphson Iterations: 4 
## n=5279 (728 observations deleted due to missingness)

Which shows, compared to the PH model, that the coefficients are all backwards. That’s because if a predictor lowers the hazard, then, by default it extends survival.

Lower risk == longer survival times!

Weibull Model

The Weibull model is more flexible than the Exponential, because it’s distribution function has two parameters, scale and shape.

The Weibull distribution has hazard rate, \(\lambda(t)=\lambda^p p t^{p-1}\). Where \(\lambda\) is the scale and p is the shape. The survival function is \(S(t) = exp ( -(\lambda t)^p)\)

#weibull distribution for hazard
fit.2<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5)+age2, data=sub2[sub2$secbi>0,], dist="weibull")
summary(fit.2)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu + 
##     I(age/5) + age2, data = sub2[sub2$secbi > 0, ], dist = "weibull")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## educ.high           0.167    -0.471     0.625     0.047     0.000 
## partnerhiedu        0.071    -0.242     0.785     0.069     0.000 
## I(age/5)            6.859     0.079     1.083     0.078     0.310 
## age2               49.668    -0.015     0.985     0.006     0.007 
## 
## log(scale)                    3.848               0.165     0.000 
## log(shape)                    0.465               0.010     0.000 
## 
## Events                    4527 
## Total time at risk        237141 
## Max. log. likelihood      -21643 
## LR test statistic         320.28 
## Degrees of freedom        4 
## Overall p-value           0

plot(fit.2)

plot(fit.2, fn="haz")
lines(fit.haz.sm, col=2)

Here, we see a more realistic situation, where the hazard fucntion changes over time (Weibull allows this), but compared to the empirical hazard, the model is a very poor fit, as empirically, the hazard goes up, but then goes down. The Weibull hazard just goes up, as the model does not allow the hazard to change direction, only rate of increase (i.e. it can incraese at a slower or faster rate, but not change direction). We see the Age effects begin to go away, because the baseline hazard is accounting for the age effects on fertility.

Note on exponential and Weibull models AFT vs PH parameterization

and, as a nice trick for the exponential and weibull models, you can rescale the AFT beta’s to PH model betas (see here)

#re-scaled beta's
(betaHat <- -coef(fit.1.aft) / fit.1.aft$scale)

##  (Intercept)    educ.high partnerhiedu     I(age/5)         age2 
##  -4.70131932  -0.34547289  -0.16011312   0.27739347  -0.02221717

#beta's from the PH model
coef(fit.1)

##    educ.high partnerhiedu     I(age/5)         age2   log(scale) 
##  -0.34547289  -0.16011312   0.27739347  -0.02221717   4.70131932

So for these two models, you can go back and forth.

Log-Normal Model

The Log-normal distribution is more flexible and allows the hazard to change direction.

The Log-normal distribution has hazard rate, \(h(t) = \frac{\phi \left (\frac{log t}{\sigma} \right ) }{\left [ 1 - \Phi \left ( \frac{log t}{\sigma} \right ) \right ] \sigma t}\)

. Where \(\sigma\) is the shape.

The survival function is \(S(t) = 1 - \Phi \left ( \frac{log t - \mu}{\sigma} \right )\)

#log-normal distribution for hazard
fit.3<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5)+age2, data=sub2[sub2$secbi>0,], dist="lognormal", center=T)
summary(fit.3)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu + 
##     I(age/5) + age2, data = sub2[sub2$secbi > 0, ], dist = "lognormal", 
##     center = T)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## (Intercept)                  -1.438               0.307     0.000 
## educ.high           0.167    -0.416     0.660     0.048     0.000 
## partnerhiedu        0.071    -0.208     0.812     0.069     0.002 
## I(age/5)            6.859    -0.094     0.911     0.080     0.240 
## age2               49.668     0.001     1.001     0.006     0.910 
## 
## log(scale)                    2.953               0.034     0.000 
## log(shape)                    1.318               0.055     0.000 
## 
## Events                    4527 
## Total time at risk        237141 
## Max. log. likelihood      -20653 
## LR test statistic         190.05 
## Degrees of freedom        4 
## Overall p-value           0

plot(fit.3)

#plot the hazard from the log normal vs the empirical hazard
plot(fit.3, fn="haz")
lines(fit.haz.sm, col=2)

We now see the age effect completely gone from the model.

So, the log-normal model fits the empirical hazard pretty well up to ~150 months, where the empirical rate drops off faster. The eha package allows one other parametric distribution, the log-logistic, so we will consider that one too:

Log-logistic Model

#log-normal distribution for hazard
fit.4<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5)+age2, data=sub2[sub2$secbi>0,], dist="loglogistic", center=T)
summary(fit.4)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu + 
##     I(age/5) + age2, data = sub2[sub2$secbi > 0, ], dist = "loglogistic", 
##     center = T)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## (Intercept)                  -0.310               0.274     0.257 
## educ.high           0.167    -0.386     0.680     0.048     0.000 
## partnerhiedu        0.071    -0.194     0.824     0.069     0.005 
## I(age/5)            6.859    -0.086     0.917     0.080     0.280 
## age2               49.668     0.002     1.002     0.006     0.769 
## 
## log(scale)                    3.323               0.017     0.000 
## log(shape)                    1.549               0.028     0.000 
## 
## Events                    4527 
## Total time at risk        237141 
## Max. log. likelihood      -20563 
## LR test statistic         141.47 
## Degrees of freedom        4 
## Overall p-value           0

plot(fit.4)

#plot the hazard from the log normal vs the empirical hazard
plot(fit.4, fn="haz")
lines(fit.haz.sm, col=2)

Whose hazard function drops off faster than the log-normal.

We may want to compare the models to one another based off AIC values. the eha package doesn’t give this to you, so we must calculate it:

AIC1<--2*fit.1$loglik[2]+2*length(fit.1$coefficients); AIC1

## [1] 44800.65

AIC2<--2*fit.2$loglik[2]+2*length(fit.2$coefficients); AIC2

## [1] 43297.33

AIC3<--2*fit.3$loglik[2]+2*length(fit.3$coefficients); AIC3

## [1] 41320.29

AIC4<--2*fit.4$loglik[2]+2*length(fit.4$coefficients); AIC4

## [1] 41139.73

And we see the log-logistic model best fits the data, based on the minimum AIC criteria

Piecewise constant exponential model

The final model we consider is the Piecewise constant exponential model. This model breaks the data into pieces, where we may fit constant hazards within these pieces.

For instance, given the observed hazard function above, we may break the data into an early piece, say < 30 months, a high piece,30-80 months and maybe two low pieces (80-150 and >150), so to mimic the form of the hazard function.

# here I must supply the times for the "pieces" where I expect the  hazard to be constant
fit.5<-phreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5)+age2, data=sub2[sub2$secbi>0,], dist="pch", cuts=c(30, 80, 150,250))
summary(fit.5)

## Call:
## phreg(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu + 
##     I(age/5) + age2, data = sub2[sub2$secbi > 0, ], dist = "pch", 
##     cuts = c(30, 80, 150, 250))
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## educ.high           0.167    -0.391     0.677     0.048     0.000 
## partnerhiedu        0.071    -0.191     0.826     0.069     0.005 
## I(age/5)            6.859     0.011     1.011     0.079     0.887 
## age2               49.668    -0.005     0.995     0.006     0.391 
## 
## 
## Events                    4527 
## Total time at risk        237141 
## Max. log. likelihood      -21728 
## LR test statistic         134.44 
## Degrees of freedom        4 
## Overall p-value           0

plot(fit.5)

plot(fit.5, fn="haz", ylim=c(0, .05))
lines(fit.haz.sm, col=2)

Which looks like it actually fits the data pretty good. The AIC’s show the log-logistic model still fitting better.

AIC5<--2*fit.5$loglik[2]+2*length(fit.5$coefficients); AIC5

## [1] 43463.75

AIC4

## [1] 41139.73

Graphical checks on the model fit

The eha package also provides a graphical method for the Cumulative hazard function, which allows us to visualize these models even better. It uses the empirical hazard, as fit in the Cox model (more on this next week), and compares the parametric models to the empirical pattern:

emp<-coxreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5)+age2, data=sub2[sub2$secbi>0,])

check.dist(sp=emp,pp=fit.1, main = "Empirical vs. Exponential")

check.dist(sp=emp,pp=fit.2, main = "Empirical vs. Weibull")

check.dist(sp=emp,pp=fit.3, main = "Empirical vs. Log-Normal")

check.dist(sp=emp,pp=fit.4, main = "Empirical vs. Log-Logistic")

check.dist(sp=emp,pp=fit.5, main = "Empirical vs. PCH")

We see that the PCH model and the log-logistic models both appear to fit the empirical hazard function better than the other parametric models.

Using Survey design

There are no survey analysis functions to fit parametric hazard models, so we must roll our own using advice from Thomas Lumely in his book Appendix E You can get this on campus through the library.

survey.fit <- withReplicates(rep.des, 
                             quote(coef(survreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5)+age2,dist="lognormal", weights = .weights+.0001))))
survey.est<-as.data.frame(survey.fit)
survey.test<-data.frame(beta = rownames(survey.est), estimate=survey.est$theta, se.est= survey.est$SE)
survey.test$t<-survey.test$estimate/survey.test$se.est
survey.test$pval<-2*pnorm(survey.test$t,lower.tail = F )
survey.test

##           beta     estimate      se.est         t         pval
## 1  (Intercept)  3.194050837 0.175058613 18.245608 2.242280e-74
## 2    educ.high  0.241962761 0.036898053  6.557602 5.467976e-11
## 3 partnerhiedu  0.139942579 0.056140391  2.492725 1.267668e-02
## 4     I(age/5)  0.116588267 0.052618261  2.215738 2.670947e-02
## 5         age2 -0.006114498 0.003798158 -1.609859 1.892571e+00

fit.2.aft<-survreg(Surv(secbi, b2event)~educ.high+partnerhiedu+I(age/5)+age2 + age2, data=sub2[sub2$secbi>0,],dist = "lognormal")

fit.2.aft.sum<-summary(fit.2.aft)

#Compare the se's of the parameters
survey.test$se.est/sqrt(diag(fit.2.aft.sum$var[-6, -6]))

##  (Intercept)    educ.high partnerhiedu     I(age/5)         age2 
##     1.249377     1.443863     1.491965     1.244614     1.235913

#survey based errors are larger, as they should be.

Using Longitudinal Data

As in the other examples, I illustrate fitting these models to data that are longitudinal, instead of person-duration.

In this example, we will examine how to fit the parametric model to a longitudinally collected data set. Here I use data from the ECLS-K. Specifically, we will examine the transition into poverty between kindergarten and third grade.

First we load our data

load("C:/Users/ozd504//Google Drive/classes/dem7223/class17/data/eclsk.Rdata")
names(eclsk)<-tolower(names(eclsk))
#get out only the variables I'm going to use for this example
myvars<-c( "childid","gender", "race", "r1_kage","r4age", "r5age", "r6age", "r7age","c1r4mtsc", "c4r4mtsc", "c5r4mtsc", "c6r4mtsc", "c7r4mtsc", "w1povrty","w1povrty","w3povrty", "w5povrty", "w8povrty","wkmomed", "s2_id")
eclsk<-eclsk[,myvars]


eclsk$age1<-ifelse(eclsk$r1_kage==-9, NA, eclsk$r1_kage/12)
eclsk$age2<-ifelse(eclsk$r4age==-9, NA, eclsk$r4age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclsk$age3<-recode(eclsk$r5age,recodes="1=105; 2=107; 3=109; 4=112; 5=115; 6=117; -9=NA")/12

eclsk$pov1<-ifelse(eclsk$w1povrty==1,1,0)
eclsk$pov2<-ifelse(eclsk$w3povrty==1,1,0)
eclsk$pov3<-ifelse(eclsk$w5povrty==1,1,0)

#Recode race with white, non Hispanic as reference using dummy vars
eclsk$hisp<-recode (eclsk$race, recodes="3:4=1;-9=NA; else=0")
eclsk$race_rec<-recode (eclsk$race, recodes="1 = 'nhwhite';2='nhblack';3:4='hispanic';5='nhasian'; 6:8='other';-9=NA", as.factor.result = T)
eclsk$male<-recode(eclsk$gender, recodes="1=1; 2=0; -9=NA")
eclsk$mlths<-recode(eclsk$wkmomed, recodes = "1:2=1; 3:9=0; else = NA")
eclsk$mgths<-recode(eclsk$wkmomed, recodes = "1:3=0; 4:9=1; else =NA")

Now, I need to form the transition variable, this is my event variable, and in this case it will be 1 if a child enters poverty between the first wave of the data and the third grade wave, and 0 otherwise. NOTE I need to remove any children who are already in poverty age wave 1, because they are not at risk of experiencing this particular transition.

eclsk<-subset(eclsk, is.na(pov1)==F&is.na(pov2)==F&is.na(pov3)==F&is.na(age1)==F&is.na(age2)==F&is.na(age3)==F&pov1!=1)

Now we do the entire data set. To analyze data longitudinally, we need to reshape the data from the current “wide” format (repeated measures in columns) to a “long” format (repeated observations in rows). The reshape() function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.

e.long<-reshape(eclsk, idvar="childid", varying=list(c("age1","age2"),
                                                     c("age2", "age3")),
                v.names=c("age_enter", "age_exit"),
                times=1:2, direction="long" )
e.long<-e.long[order(e.long$childid, e.long$time),]

e.long$povtran<-NA

e.long$povtran[e.long$pov1==0&e.long$pov2==1&e.long$time==1]<-1
e.long$povtran[e.long$pov2==0&e.long$pov3==1&e.long$time==2]<-1

e.long$povtran[e.long$pov1==0&e.long$pov2==0&e.long$time==1]<-0
e.long$povtran[e.long$pov2==0&e.long$pov3==0&e.long$time==2]<-0

#find which kids failed in earlier time periods and remove them from the second & third period risk set
failed1<-which(is.na(e.long$povtran)==T)
e.long<-e.long[-failed1,]


e.long$age1r<-round(e.long$age_enter, 0)
e.long$age2r<-round(e.long$age_exit, 0)
head(e.long, n=10)

##             childid gender race r1_kage r4age r5age r6age r7age c1r4mtsc
## 0001002C.1 0001002C      2    1   77.20 94.73     6     4     4   68.324
## 0001002C.2 0001002C      2    1   77.20 94.73     6     4     4   68.324
## 0001007C.1 0001007C      1    1   63.60 81.90     2     2     2   43.836
## 0001007C.2 0001007C      1    1   63.60 81.90     2     2     2   43.836
## 0001010C.1 0001010C      2    1   70.63 88.63     4     3     3   53.324
## 0001010C.2 0001010C      2    1   70.63 88.63     4     3     3   53.324
## 0002004C.1 0002004C      2    1   65.40 83.73     3     2     2   59.598
## 0002004C.2 0002004C      2    1   65.40 83.73     3     2     2   59.598
## 0002006C.1 0002006C      2    1   61.90 80.23     2     2     2   44.711
## 0002008C.1 0002008C      2    1   70.87 89.20     5     3     3   42.235
##            c4r4mtsc c5r4mtsc c6r4mtsc c7r4mtsc w1povrty w1povrty.1
## 0001002C.1   58.738   60.390   62.585   66.481        2          2
## 0001002C.2   58.738   60.390   62.585   66.481        2          2
## 0001007C.1   37.160   40.946   44.830   44.122        2          2
## 0001007C.2   37.160   40.946   44.830   44.122        2          2
## 0001010C.1   52.256   54.986   54.759   50.096        2          2
## 0001010C.2   52.256   54.986   54.759   50.096        2          2
## 0002004C.1   54.965   53.793   57.428   62.880        2          2
## 0002004C.2   54.965   53.793   57.428   62.880        2          2
## 0002006C.1   49.979   51.026   47.371   49.941        2          2
## 0002008C.1   41.269   40.119   35.635   46.600        2          2
##            w3povrty w5povrty w8povrty wkmomed s2_id pov1 pov2 pov3 hisp
## 0001002C.1        2        2        2       3  0001    0    0    0    0
## 0001002C.2        2        2        2       3  0001    0    0    0    0
## 0001007C.1        2        2        2       6  0001    0    0    0    0
## 0001007C.2        2        2        2       6  0001    0    0    0    0
## 0001010C.1        2        2        2       7  0001    0    0    0    0
## 0001010C.2        2        2        2       7  0001    0    0    0    0
## 0002004C.1        2        2        2       6  0002    0    0    0    0
## 0002004C.2        2        2        2       6  0002    0    0    0    0
## 0002006C.1        1        1        2      -1  0002    0    1    1    0
## 0002008C.1        2        2        2       4  0002    0    0    0    0
##            race_rec male mlths mgths time age_enter age_exit povtran age1r
## 0001002C.1  nhwhite    0     0     0    1  6.433333 7.894167       0     6
## 0001002C.2  nhwhite    0     0     0    2  7.894167 9.750000       0     8
## 0001007C.1  nhwhite    1     0     1    1  5.300000 6.825000       0     5
## 0001007C.2  nhwhite    1     0     1    2  6.825000 8.916667       0     7
## 0001010C.1  nhwhite    0     0     1    1  5.885833 7.385833       0     6
## 0001010C.2  nhwhite    0     0     1    2  7.385833 9.333333       0     7
## 0002004C.1  nhwhite    0     0     1    1  5.450000 6.977500       0     5
## 0002004C.2  nhwhite    0     0     1    2  6.977500 9.083333       0     7
## 0002006C.1  nhwhite    0    NA    NA    1  5.158333 6.685833       1     5
## 0002008C.1  nhwhite    0     0     1    1  5.905833 7.433333       0     6
##            age2r
## 0001002C.1     8
## 0001002C.2    10
## 0001007C.1     7
## 0001007C.2     9
## 0001010C.1     7
## 0001010C.2     9
## 0002004C.1     7
## 0002004C.2     9
## 0002006C.1     7
## 0002008C.1     7

Now we fit the models, I only show the Exponential, Weibull and PCH model fit here, but the others follow the example from above. I specify the age of the transition using an interval-censored notation to show when a child began and ended each risk period.

#Exponential
#interval censored
fitl1<-phreg(Surv(time = time, event = povtran)~mlths+mgths+race_rec, data=e.long, dist = "weibull", shape=1)
summary(fitl1)

## Call:
## phreg(formula = Surv(time = time, event = povtran) ~ mlths + 
##     mgths + race_rec, data = e.long, dist = "weibull", shape = 1)
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## mlths               0.055     0.763     2.145     0.105     0.000 
## mgths               0.685    -0.979     0.376     0.089     0.000 
## race_rec 
##         hispanic    0.130     0         1           (reference)
##          nhasian    0.054    -0.432     0.649     0.181     0.017 
##          nhblack    0.063     0.395     1.485     0.124     0.001 
##          nhwhite    0.703    -0.885     0.413     0.101     0.000 
##            other    0.050     0.058     1.060     0.154     0.705 
## 
## log(scale)                    2.499               0.092     0.000 
## 
##  Shape is fixed at  1 
## 
## Events                    666 
## Total time at risk         19908 
## Max. log. likelihood      -2662.7 
## LR test statistic         0.70 
## Degrees of freedom        6 
## Overall p-value           0.994389

#Weibull
fitl2<-phreg(Surv(time = time, event = povtran)~mlths+mgths+race_rec, data=e.long, dist = "weibull")
summary(fitl2)

## Call:
## phreg(formula = Surv(time = time, event = povtran) ~ mlths + 
##     mgths + race_rec, data = e.long, dist = "weibull")
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## mlths               0.055     0.802     2.230     0.105     0.000 
## mgths               0.685    -0.995     0.370     0.089     0.000 
## race_rec 
##         hispanic    0.130     0         1           (reference)
##          nhasian    0.054    -0.442     0.643     0.181     0.014 
##          nhblack    0.063     0.412     1.510     0.124     0.001 
##          nhwhite    0.703    -0.900     0.406     0.100     0.000 
##            other    0.050     0.050     1.051     0.154     0.744 
## 
## log(scale)                    1.201               0.037     0.000 
## log(shape)                    1.077               0.032     0.000 
## 
## Events                    666 
## Total time at risk         19908 
## Max. log. likelihood      -2292.1 
## LR test statistic         555.05 
## Degrees of freedom        6 
## Overall p-value           0

#Piecewise constant
fitl3<-phreg(Surv(time = time, event = povtran)~mlths+mgths+race_rec,data=e.long, dist = "pch", cuts=c(1))
summary(fitl3)

## Call:
## phreg(formula = Surv(time = time, event = povtran) ~ mlths + 
##     mgths + race_rec, data = e.long, dist = "pch", cuts = c(1))
## 
## Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
## mlths               0.055     0.768     2.156     0.105     0.000 
## mgths               0.685    -0.981     0.375     0.089     0.000 
## race_rec 
##         hispanic    0.130     0         1           (reference)
##          nhasian    0.054    -0.433     0.648     0.181     0.017 
##          nhblack    0.063     0.398     1.488     0.124     0.001 
##          nhwhite    0.703    -0.887     0.412     0.101     0.000 
##            other    0.050     0.057     1.059     0.154     0.711 
## 
## 
## Events                    666 
## Total time at risk         19908 
## Max. log. likelihood      -2657.2 
## LR test statistic         535.12 
## Degrees of freedom        6 
## Overall p-value           0

#AIC for exponential
-2*fitl1$loglik[2]+2*length(fitl1$coefficients)

## [1] 5339.368

#AIC for weibull
-2*fitl2$loglik[2]+2*length(fitl2$coefficients)

## [1] 4600.18

#AIC for weibull
-2*fitl3$loglik[2]+2*length(fitl3$coefficients)

## [1] 5326.482

#Empirical (Cox)
fitle<-coxreg(Surv(time = time, event = povtran)~mlths+mgths+race_rec, data=e.long)

check.dist(fitle, fitl1, main = "Exponential")

check.dist(fitle, fitl2, main = "Weibull")

check.dist(fitle, fitl3, main = "Piecewise Exponential")

According to the AIC, the Weibull model is fitting better here.

Event History Analysis - DEM 7223 - Example 3 Parametric Hazard Models

Corey S. Sparks, Ph.D.

September 25, 2017