Survival or event history data have special functions of the duration aspect. Remember, survival or event history data are generally represented by a dyad of information, so the outcome has two parts. The first part being the actual time duration, and the second being the censoring, or event indicator, which indicates whether the event of interest was observed or not at that time point.
In life tables, we had lots of functions of the death process. Some of these were more interesting than others, with two being of special interest to use here. These are the \(l(x)\) and \(q(x, n)\) functions. If you recall, \(l(x)\) represents the population size of the stationary population that is alive at age \(x\), and the risk of dying between age \(x, x+n\) is \(q(x, n)\).
These are genearlized more in the event history analysis literature, but we can still describe the distrubion of survival time using three functions. These are the Survival Function, \(S(t)\), the probability density function, \(f(t)\), and the hazard function, \(h(t)\). These three are related and we can derive one from the others.
If we take statistical notation for a while, then we can say that our duration variable can be a discrete or continuous random variable, \(T\), where we observe \(t_i\).
The probability density, \(f(t)\) is the probability that an individual observation falls within a short interval (if we are using a discrete function), such that \(f(t) = Pr(T = t_i)\), and the cumulative distribution function, or CDF of \(t\) is \(F(t) = \sum_t^{t + \delta_t} Pr(T \le t) = \sum_t^{t + \delta_t} f(t)\). This tells us the probability of seeing a time before a given value of \(T\).
From the CDF, we can get the pdf by differentiating it.
\(f(t) = \frac {dF(t)}{d(t)} = F(t)'\) or \(f(t) =\lim_{\Delta_t \to 0} \frac {F(t+ \Delta_t) - F(t)}{d(t)}\)
This give us the unconditional probablity of failing in a very small interval of time.
The Survival function, \(S(t)\) is the completment of the CDF, \(S(t) = 1- F(t) = Pr(T \geq t)\), and is the probabilty of surviving longer than a given time point. At time 0, \(S(t)= 1\) and at time = \(\infty\), \(S(t)= 0\), so \(S(t)\) is a strictly decreasing function of time. Take-away - everything dies.
The hazard function, \(h(t)\) is the conditional probability of experienceing the event at time \(t\). It is related to the other functions as: \(h(t) = \frac{f(t)}{S(t)}\). So, it’s the probability of dying at time t, conditional on surviving to that time.
It can also be written: \(h(t) = \frac{Pr(t \leq T \leq t+\Delta_t | T\geq t)}{\Delta_t}\)
It is also called the Instantaneous failure rate, and is of interest because it tells you how likely it is something is going to happen at a given time. If we have people with different values of a predictor, \(x\), then we can rewrite this as:
\(h(t) = \frac{Pr(t \leq T \leq t+\Delta_t | T\geq t, x)}{\Delta_t}\)
We can also integrate the hazard function to get the summed risk of failure up to a particular time, this is called the Cumulative Hazard Function, \(H(t)\), \(H(t) = \int_0^t h(t) dt\)
This example will illustrate how to construct a basic survival function from individual-level data. The example will use as its outcome variable, the event of a child dying before age 1. The data for this example come from the Demographic and Health Survey Model Data Files children’s recode file.
The DHS Program has created example datasets for users to practice with. These datasets have been created strictly for practice and do not represent any actual country’s data. See more here.
This file contains information for all births to the sample of women between the ages of 15 and 49 in the last 5 years prior to the survey.
#Example 1
library(haven)
library(survival)
library(car)
## Loading required package: carData
library(muhaz)
model.dat<-read_dta("~/Google Drive/classes/dem7223/dem7223_19/data/zzkr62dt/ZZKR62FL.DTA")
##Event - Infant Mortality In the DHS, they record if a child is dead or alive and the age at death if the child is dead. This can be understood using a series of variables about each child.
If the child is alive at the time of interview, then the variable B5==1, and the age at death is censored.
If the age at death is censored, then the age at the date of interview (censored age at death) is the date of the interview - date of birth (in months).
If the child is dead at the time of interview,then the variable B5!=1, then the age at death in months is the variable B7. Here we code this:
model.dat$death.age<-ifelse(model.dat$b5==1,
((((model.dat$v008))+1900)-(((model.dat$b3))+1900))
,model.dat$b7)
#censoring indicator for death by age 1, in months (12 months)
model.dat$d.event<-ifelse(is.na(model.dat$b7)==T|model.dat$b7>12,0,1)
model.dat$d.eventfac<-factor(model.dat$d.event)
levels(model.dat$d.eventfac)<-c("Alive at 1", "Dead by 1")
table(model.dat$d.eventfac)
##
## Alive at 1 Dead by 1
## 5434 534
We see 534 infant deaths among the 5968 births in the last 5 years.
##Example of Estimating Survival Time Functions from data## To generate a basic life table, we use the survfit() procedure in the survival library. The data for this is a Surv() object, which typically has 2 arguments, the duration, and the censoring indicator. This uses age at death (the death.age variable from above) for children dying before age 1 as the outcome, and the d.event variable from above as the censoring indicator.
#Here we see the data
head(model.dat[,c("death.age","d.event")], n=20)
#The Surv() object
head(Surv(model.dat$death.age, model.dat$d.event), n=20)
## [1] 5+ 37+ 30+ 10+ 30+ 0 34+ 1 18+ 3 27+ 24+ 12+ 9+ 5 54+ 16+
## [18] 37+ 30+ 0+
In the first 20 cases from the data, several children died (no + after the time), while all the other children had not experienced the event (they were still alive at age 12 months), these have a + after their censored age at death.
mort<-survfit(Surv(time=death.age, event = d.event)~1, data=model.dat)
summary(mort)
## Call: survfit(formula = Surv(time = death.age, event = d.event) ~ 1,
## data = model.dat)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 5968 209 0.965 0.00238 0.960 0.970
## 1 5690 26 0.961 0.00252 0.956 0.966
## 2 5573 37 0.954 0.00271 0.949 0.960
## 3 5423 38 0.948 0.00290 0.942 0.953
## 4 5282 25 0.943 0.00302 0.937 0.949
## 5 5161 22 0.939 0.00313 0.933 0.945
## 6 5021 26 0.934 0.00326 0.928 0.941
## 7 4880 18 0.931 0.00334 0.924 0.937
## 8 4755 26 0.926 0.00347 0.919 0.932
## 9 4621 23 0.921 0.00359 0.914 0.928
## 10 4507 5 0.920 0.00361 0.913 0.927
## 11 4405 13 0.917 0.00368 0.910 0.924
## 12 4303 66 0.903 0.00401 0.895 0.911
library(survminer)
## Loading required package: ggplot2
## Loading required package: ggpubr
## Loading required package: magrittr
library(ggplot2)
ggsurvplot(mort, conf.int = T, risk.table = T
, title="Survival to Age 1",
xlim=c(0, 12), ylim=c(.9,1))
This is the so-called Kaplan-Meier estimate of the survival function. At each month, we see the number of children at risk and the number dying. We see the highest number of deaths occurred between 0 and 1 month, which is not surprising.
The estimate is that the infant morality rate is 82.7382814, I get this by doing 1000*(1-summary(mort)$surv[12]).
We can likewise get an estimate of the hazard function using the Kaplan-Meier method as well, using the muhaz library.
haz<-kphaz.fit(time=model.dat$death.age, status=model.dat$d.event, method = "product-limit")
kphaz.plot(haz, main="Hazard function plot")
data.frame(haz)
This illustrates, that while the largest drop in survivorship occurred between 0 and 1, the hazard is actually higher in the 1-3 month range, illustrating the conditionality of that probability. There is also a large jump in risk at age 1, which may indicate something about age-heaping in the data.
Now we have our S(t) and h(t) functions. We can derive the other functions of survival time from these but integrating (summing) and differentiating these functions.
#cumulative hazard
plot(cumsum(haz$haz)~haz$time,
main = "Cumulative Hazard function",
ylab="H(t)",xlab="Time in Months",
type="l",xlim=c(0,12), lwd=2,col=3)
#Survival function, I just store this in an object so I can use it
surv<-mort
#here is a cheap version of the pdf
ft<- -diff(mort$surv)
plot(ft, xlim=c(.5,11.5),
type="s",
ylab="f(t)",xlab="Time in Months",
main="Probability Density Function")
#here is the cumulative distribution function
Ft<-cumsum(ft)
plot(Ft, xlim=c(0.5,12), type="s", ylab="F(t)",xlab="Time in Months", main="Cumulative Distribution Function")
So in this example, we calculated the censored ages at death for children under age 1, we estimated the survival function, hazard and Cumulative hazard functions, and the associated pdf and cdf’s.
#Using Longitudinal Data In this example, we will examine how to calculate the survival function for a longitudinally collected data set. Here I use data from the ECLS-K. Specifically, we will examine the transition into poverty between kindergarten and third grade.
First we load our data
load("~/Google Drive/classes/dem7223/dem7223_19/data/eclsk_k5.Rdata")
names(eclskk5)<-tolower(names(eclskk5))
#get out only the variables I'm going to use for this example
myvars<-c( "childid","x_chsex_r", "x_raceth_r", "x1kage_r","x4age", "x5age", "x6age", "x7age", "x2povty","x4povty_i", "x6povty_i", "x8povty_i","x12par1ed_i", "s2_id")
eclskk5<-eclskk5[,myvars]
eclskk5$age1<-ifelse(eclskk5$x1kage_r==-9, NA, eclskk5$x1kage_r/12)
eclskk5$age2<-ifelse(eclskk5$x4age==-9, NA, eclskk5$x4age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclskk5$age3<-ifelse(eclskk5$x5age==-9, NA, eclskk5$x5age/12)
eclskk5$pov1<-ifelse(eclskk5$x2povty==1,1,0)
eclskk5$pov2<-ifelse(eclskk5$x4povty_i==1,1,0)
eclskk5$pov3<-ifelse(eclskk5$x6povty_i==1,1,0)
#Recode race with white, non Hispanic as reference using dummy vars
eclskk5$race_rec<-recode (eclskk5$x_raceth_r, recodes="1 = 'nhwhite';2='nhblack';3:4='hispanic';5='nhasian'; 6:8='other';-9=NA", as.factor = T)
eclskk5$male<-recode(eclskk5$x_chsex_r, recodes="1=1; 2=0; -9=NA")
eclskk5$mlths<-recode(eclskk5$x12par1ed_i, recodes = "1:2=1; 3:9=0; else = NA")
eclskk5$mgths<-recode(eclskk5$x12par1ed_i, recodes = "1:3=0; 4:9=1; else =NA")
Now, I need to form the transition variable, this is my event variable, and in this case it will be 1 if a child enters poverty between the first wave of the data and the third grade wave, and 0 otherwise.
NOTE I need to remove any children who are already in poverty age wave 1, because they are not at risk of experiencing this particular transition. Again, this is called forming the risk set
eclskk5<-subset(eclskk5, is.na(pov1)==F&is.na(pov2)==F&is.na(pov3)==F&is.na(age1)==F&is.na(age2)==F&is.na(age3)==F&pov1!=1)
eclskk5$povtran1<-ifelse(eclskk5$pov1==0&eclskk5$pov2==0, 0,1)
eclskk5$povtran2<-ifelse(eclskk5$povtran1==1, NA,
ifelse(eclskk5$pov2==0&eclskk5$pov3==0,0,1)
)
Now we do the entire data set. To analyze data longitudinally, we need to reshape the data from the current “wide” format (repeated measures in columns) to a “long” format (repeated observations in rows). The reshape() function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.
e.long<-reshape(data.frame(eclskk5), idvar="childid", varying=list(c("age1","age2"),
c("age2", "age3"),
c("povtran1", "povtran2")),
v.names=c("age_enter", "age_exit","povtrans"),
times=1:2, direction="long" )
e.long<-e.long[order(e.long$childid, e.long$time),]
#find which kids failed in the first time period and remove them from the second risk period risk set
failed1<-which(is.na(e.long$povtrans)==T)
e.long<-e.long[-failed1,]
e.long$age1r<-round(e.long$age_enter, 0)
e.long$age2r<-round(e.long$age_exit, 0)
head(e.long, n=10)
So, this shows us the repeated measures nature of the longitudinal data set.
#poverty transition .
fit<-survfit(Surv(time = time, event = povtrans)~1, e.long)
fit
## Call: survfit(formula = Surv(time = time, event = povtrans) ~ 1, data = e.long)
##
## n events median 0.95LCL 0.95UCL
## 3995 225 NA NA NA
summary(fit)
## Call: survfit(formula = Surv(time = time, event = povtrans) ~ 1, data = e.long)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 3995 149 0.963 0.00300 0.957 0.969
## 2 1923 76 0.925 0.00516 0.915 0.935
This displays the number of poverty transitions between each of the waves.
library(survminer)
library(ggplot2)
ggsurvplot(fit, data=e.long, risk.table = T, title="Survival function for poverty transition, K-5th Grade", ylim=c(.9,1))
ggsurvplot(fit, data=e.long, risk.table = T, fun="cumhaz", title="Cumulative Hazard function for poverty transition, K-5th Grade")
Which, again shows us that at least two of these groups are different from one another.
library(muhaz)
haz<-kphaz.fit(time=e.long$time, status=e.long$povtrans, method = "product-limit")
haz
## $time
## [1] 1.5
##
## $haz
## [1] 0.3642165
##
## $var
## [1] NaN
kphaz.plot(haz, main="Hazard function plot")
data.frame(haz)
This illustrates, that while the largest drop in survivorship occurred between 0 and 1, the hazard is actually higher in the 1-3 month range, illustrating the conditionality of that probability. There is also a large jump in risk at age 1, which may indicate something about age-heaping in the data.
Now we have our S(t) and h(t) functions. We can derive the other functions of survival time from these but integrating (summing) and differentiating these functions.
#cumulative hazard
plot(cumsum(haz$haz)~haz$time,
main = "Cumulative Hazard function",
ylab="H(t)",xlab="Time in Months",
type="l", lwd=2,col=3)
#Survival function, I just store this in an object so I can use it
surv<-fit
#here is a cheap version of the pdf
ft<- -diff(fit$surv)
plot(ft,
type="s",
ylab="f(t)",xlab="Time in Months",
main="Probability Density Function")
#here is the cumulative distribution function
Ft<-cumsum(ft)
plot(Ft, type="s", ylab="F(t)",xlab="Time ", main="Cumulative Distribution Function")