This example will illustrate how to construct a basic survival function from individual-level data. The example will use as its outcome variable, the event of a child dying before age 1. The data for this example come from the Demographic and Health Survey Model Data Files children’s recode file.
The DHS Program has created example datasets for users to practice with. These datasets have been created strictly for practice and do not represent any actual country’s data. See more here.
This file contains information for all births to the sample of women between the ages of 15 and 49 in the last 5 years prior to the survey.
#Example 1
library(foreign)
library(survival)
library(lattice)
model.dat<-read.dta("/Users/ozd504/Google Drive/dem7223/class2016/data/zzkr62dt/ZZKR62FL.DTA", convert.factors = F)
In the DHS, they record if a child is dead or alive and the age at death if the child is dead. This can be understood using a series of variables about each child.
If the child is alive at the time of interview, then the variable B5==1, and the age at death is censored.
If the age at death is censored, then the age at the date of interview (censored age at death) is the date of the interview - date of birth (in months).
If the child is dead at the time of interview,then the variable B5!=1, then the age at death in months is the variable B7. Here we code this:
model.dat$death.age<-ifelse(model.dat$b5==1,
((((model.dat$v008))+1900)-(((model.dat$b3))+1900))
,model.dat$b7)
#censoring indicator for death by age 1, in months (12 months)
model.dat$d.event<-ifelse(is.na(model.dat$b7)==T|model.dat$b7>12,0,1)
model.dat$d.eventfac<-factor(model.dat$d.event); levels(model.dat$d.eventfac)<-c("Alive at 1", "Dead by 1")
table(model.dat$d.eventfac)
##
## Alive at 1 Dead by 1
## 5434 534
We see 534 infant deaths among the 5968 births in the last 5 years.
To generate a basic life table, we use the survfit() procedure in the survival library. The data for this is a Surv() object, which typically has 2 arguments, the duration, and the censoring indicator. This uses age at death (the death.age variable from above) for children dying before age 1 as the outcome, and the d.event variable from above as the censoring indicator.
#Here we see the data
head(model.dat[,c("death.age","d.event")], n=20)
## death.age d.event
## 1 5 0
## 2 37 0
## 3 30 0
## 4 10 0
## 5 30 0
## 6 0 1
## 7 34 0
## 8 1 1
## 9 18 0
## 10 3 1
## 11 27 0
## 12 24 0
## 13 12 0
## 14 9 0
## 15 5 1
## 16 54 0
## 17 16 0
## 18 37 0
## 19 30 0
## 20 0 0
#The Surv() object
head(Surv(model.dat$death.age, model.dat$d.event), n=20)
## [1] 5+ 37+ 30+ 10+ 30+ 0 34+ 1 18+ 3 27+ 24+ 12+ 9+ 5 54+ 16+
## [18] 37+ 30+ 0+
In the first 20 cases from the data, several children died (no + after the time), while all the other children had not experienced the event (they were still alive at age 12 months), these have a + after their censored age at death.
mort<-survfit(Surv(death.age, d.event)~1, data=model.dat,conf.type="none")
plot(mort, ylim=c(.9,1), xlim=c(0,12), main="Survival Function for Infant Mortality")
summary(mort)
## Call: survfit(formula = Surv(death.age, d.event) ~ 1, data = model.dat,
## conf.type = "none")
##
## time n.risk n.event survival std.err
## 0 5968 209 0.965 0.00238
## 1 5690 26 0.961 0.00252
## 2 5573 37 0.954 0.00271
## 3 5423 38 0.948 0.00290
## 4 5282 25 0.943 0.00302
## 5 5161 22 0.939 0.00313
## 6 5021 26 0.934 0.00326
## 7 4880 18 0.931 0.00334
## 8 4755 26 0.926 0.00347
## 9 4621 23 0.921 0.00359
## 10 4507 5 0.920 0.00361
## 11 4405 13 0.917 0.00368
## 12 4303 66 0.903 0.00401
This is the so-called Kaplan-Meier estimate of the survival function. At each month, we see the number of children at risk and the number dying. We see the highest number of deaths occurred between 0 and 1 month, which is not surprising.
The estimate is that the infant morality rate is 82.7382814, I get this by doing 1000*(1-summary(mort)$surv[12]).
We can likewise get an estimate of the hazard function using the Kaplan-Meier method as well, using the muhaz library.
library(muhaz)
haz<-kphaz.fit(time=model.dat$death.age, status=model.dat$d.event, method = "product-limit")
kphaz.plot(haz, main="Hazard function plot")
data.frame(haz)
## time haz var
## 1 0.5 0.004617002 8.198994e-07
## 2 1.5 0.006722906 1.221621e-06
## 3 2.5 0.007109508 1.330224e-06
## 4 3.5 0.004784710 9.157877e-07
## 5 4.5 0.004326655 8.509621e-07
## 6 5.5 0.005244306 1.057883e-06
## 7 6.5 0.003732925 7.741945e-07
## 8 7.5 0.005545073 1.182716e-06
## 9 8.5 0.005037790 1.103514e-06
## 10 9.5 0.001118111 2.500496e-07
## 11 10.5 0.002993934 6.895473e-07
## 12 11.5 0.015646925 3.709910e-06
This illustrates, that while the largest drop in survivorship occurred between 0 and 1, the hazard is actually higher in the 1-3 month range, illustrating the conditionality of that probability. There is also a large jump in risk at age 1, which may indicate something about age-heaping in the data.
Now we have our S(t) and h(t) functions. We can derive the other functions of survival time from these but integrating (summing) and differentiating these functions.
#cumulative hazard
plot(cumsum(haz$haz)~haz$time,
main = "Cumulative Hazard function",
ylab="H(t)",xlab="Time in Months",
type="l",xlim=c(0,12), lwd=2,col=3)
#Survival function, I just store this in an object so I can use it
surv<-mort
#here is a cheap version of the pdf
ft<- -diff(mort$surv)
plot(ft, xlim=c(.5,11.5),
type="s",
ylab="f(t)",xlab="Time in Months",
main="Probability Density Function")
#here is the cumulative distribution function
Ft<-cumsum(ft)
plot(Ft, xlim=c(0.5,12), type="s", ylab="F(t)",xlab="Time in Months", main="Cumulative Distribution Function")
So in this example, we calculated the censored ages at death for children under age 1, we estimated the survival function, hazard and Cumulative hazard functions, and the associated pdf and cdf’s.