This example will illustrate how to construct a basic survival function from individual-level data. The example will use as its outcome variable, the event of a child dying before age 1. The data for this example come from the Haitian Demographic and Health Survey for 2012 children’s recode file. This file contains information for all births in the last 5 years prior to the survey.
#Example 1
library(foreign)
library(survival)
## Loading required package: splines
library(lattice)
haiti<-read.dta("/Users/ozd504/Google Drive/dem7223/data//HTKR61FL.DTA", convert.factors = F)
In the DHS, they record if a child is dead or alive and the age at death if the child is dead. If the child is alive at the time of interview, B5==1, then the age at death is censored. If the age at death is censored, then the age at the date of interview (censored age at death) is the date of the interview - date of birth (in months). If the child is dead at the time of interview, B5!=1, then the age at death in months is the B7. Here we code this:
haiti$death.age<-ifelse(haiti$b5==1,
((((haiti$v008))+1900)-(((haiti$b3))+1900))
,haiti$b7)
#censoring indicator for death by age 1, in months (12 months)
haiti$d.event<-ifelse(is.na(haiti$b7)==T|haiti$b7>12,0,1)
haiti$d.eventfac<-factor(haiti$d.event); levels(haiti$d.eventfac)<-c("Alive at 1", "Dead by 1")
table(haiti$d.eventfac)
##
## Alive at 1 Dead by 1
## 6819 428
We see 428 infant deaths among the 7247 births in the last 5 years.
To generate a basic life table, we use the survfit() procedure in the survival library. The data for this is a Surv() object, which typically has 2 arguments, the duration, and the censoring indicator. This uses death by age 1 as the outcome
#Here we see the data
head(Surv(haiti$death.age, haiti$d.event), n=20)
## [1] 15+ 43+ 24+ 8+ 0 25+ 20+ 23+ 47+ 37+ 6+ 34+ 20+ 20+ 3+ 15+ 31+
## [18] 16+ 51+ 53+
In the first 20 cases from the data, the 5th child died basically when it was less than a month old, while all the other children had not experienced the event (they were still alive at age 12 months), these have a + after their censored age at death.
mort<-survfit(Surv(death.age, d.event)~1, data=haiti,conf.type="none")
plot(mort, ylim=c(.9,1), xlim=c(0,14), main="Survival Function for Infant Mortality")
summary(mort)
## Call: survfit(formula = Surv(death.age, d.event) ~ 1, data = haiti,
## conf.type = "none")
##
## time n.risk n.event survival std.err
## 0 7247 232 0.968 0.00207
## 1 6955 22 0.965 0.00216
## 2 6829 19 0.962 0.00224
## 3 6693 23 0.959 0.00234
## 4 6527 16 0.957 0.00240
## 5 6358 15 0.954 0.00247
## 6 6183 26 0.950 0.00258
## 7 6027 20 0.947 0.00267
## 8 5868 10 0.946 0.00271
## 9 5718 16 0.943 0.00278
## 10 5561 3 0.942 0.00280
## 11 5447 8 0.941 0.00283
## 12 5334 18 0.938 0.00292
This is the so-called Kaplan-Meier estimate of the survival function. At each month, we see the number of children at risk and the number dying. We see the highest number of deaths occurred between 0 and 1 month, which is not surprising. The estimate is that just over 6% of children in the sample died before reaching their first birthday. The Worldbank lists the 2012 IMR for Haiti as 56/1000, which is remarkably close to our estimate of 62 .
We can likewise get an estimate of the hazard function using the Kaplan-Meier method as well, using the muhaz library.
library(muhaz)
haz<-kphaz.fit(time=haiti$death.age, status=haiti$d.event, method = "product-limit")
kphaz.plot(haz, main="Hazard function plot")
Now we have our S(t) and h(t) functions. We can derive the other functions of survival time from these but integrating (summing) and differentiating these functions.
#cumulative hazard
plot(cumsum(haz$haz)~haz$time, main = "Cumulative Hazard function", ylab="H(t)",xlab="Time in Months", type="l",xlim=c(0,14), lwd=2,col=3)
#Survival function, I just store this in an object so I can use it
surv<-mort
#here is a cheap version of the pdf
ft<--diff(mort$surv)
plot(ft, xlim=c(0,12), type="s",ylab="f(t)",xlab="Time in Months", main="Probability Density Function")
#here is the cumulative distribution function
Ft<-cumsum(ft)
plot(Ft, xlim=c(0,12), type="s", ylab="F(t)",xlab="Time in Months", main="Cumulative Distribution Function")
So in this example, we calculated the censored ages at death for children under age 1, we estimated the survival function, hazard and Cumulative hazard functions, and the associated pdf and cdf’s.