MSBA: Data Analytics III
Time-to-event data that consists of a distinct start time and end time. These data are common in the medical field.
Examples from cancer:
Time-to-event data is common in many fields including, but not limited to:
Because survival analysis is common in many other fields, it also goes by other names:
The two most common questions I encounter related to survival analysis are:
In survival analysis it is common for the exact event time to be unknown, or unobserved, which is called censoring. A subject may be censored due to:
Specifically these are examples of right censoring. Other common types of censoring include:
When the exact event time is unknown then some patients are censored, and survival analysis methods are required.
In this example, how would we compute the proportion who are event-free at 10 years?
Toy example of a Kaplan-Meier curve for this simple data (details to follow):
Case study: musicians and mortality
Conclusion: Musical genre is associated with early death among musicians.
Problem: this graph does not account for the right-censored nature of the data.
For subject \(i\):
Event indicator \(\delta_i\):
Observed time \(Y_i = \min(T_i, C_i)\)
Organ Transplantation is limited by the supply of cadaveric and living donors.
Unfortunately, many people die while waiting on the list.
What types of characteristics lead to an increase or decrease likelihood of dying on the waiting list?
data("transplant")
transplant %>%
head
## age sex abo year futime event ## 1 47 m B 1994 1197 death ## 2 55 m A 1991 28 ltx ## 3 52 m B 1996 85 ltx ## 4 40 f O 1995 231 ltx ## 5 70 m O 1996 1271 censored ## 6 66 f O 1996 58 ltx
Most functions used in survival analysis will also require a binary indicator of event that is:
Currently our data example contains a factor variable indicating whehter the patient has died or if they have received a liver transpant, been withdrawn, or are censored.
transplant$delta<-0 transplant$delta[transplant$event=="death"]<-1
Recall the questions of interest:
The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.
Surv function from the survival package creates a survival object for use as the response in a model formula. There will be one entry for each subject that is the survival time, which is followed by a + if the subject was censored. Let's look at the first 10 observations:survival::Surv(transplant$futime, transplant$delta)[1:10]
## [1] 1197 28+ 85+ 231+ 1271+ 58+ 392+ 30+ 12+ 139+
survival::survfit function creates survival curves based on a formula. Let's generate the overall survival curve for the entire cohort, assign it to object f1, and look at the names of that object:f1 <- survfit(Surv(transplant$futime, transplant$delta) ~ 1, data = transplant) names(f1)
## [1] "n" "time" "n.risk" "n.event" "n.censor" ## [6] "surv" "type" "std.err" "lower" "upper" ## [11] "conf.type" "conf.int" "call"
Some key components of this survfit object that will be used to create survival curves include:
time, which contains the start and endpoints of each time intervalsurv, which contains the survival probability corresponding to each timeNow we plot the survfit object in base R to get the Kaplan-Meier plot:
plot(f1,
xlab = "Days",
ylab = "Overall survival probability")
R shows the step function (solid line) with associated confidence intervals (dotted lines). Note that the tick marks for censored patients are not shown by default, but could be added using mark.time = TRUEOne quantity often of interest in a survival analysis is the probability of surviving a certain number (\(x\)) of years.
For example, to estimate the probability of survivng to 5 years, use summary with the times argument:
summary(f1, times = 5*365)
## Call: survfit(formula = Surv(transplant$futime, transplant$delta) ~ ## 1, data = transplant) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 1825 2 66 0.49 0.206 0.215 1
We find that the 5-year probability of survival in this study is 49%. The associated lower and upper bounds of the 95% confidence interval are also displayed.
What happens if you use a "naive" estimate?
66 of the 815 patients died by 5 years so:
\[\Big(1 - \frac{66}{815}\Big) \times 100 = 92\%\]
You get an incorrect estimate of the 5-year probability of survival when you ignore the fact that 747 patients were censored before 5 years.
Recall the correct estimate of the 5-year probability of survival was 49%.
Another quantity often of interest in a survival analysis is the average survival time, which we quantify using the median (survival times are not expected to be normally distributed so the mean is not an appropriate summary).
We can obtain this directly from our survfit object:
survival::survfit(survival::Surv(futime, delta) ~ 1, data = transplant)
## Call: survfit(formula = survival::Surv(futime, delta) ~ 1, data = transplant) ## ## n events median 0.95LCL 0.95UCL ## 815 66 1422 1422 NA
We see the median survival time is 3.9 years. The lower and upper bounds of the 95% confidence interval are also displayed.
What happens if you use a "naive" estimate?
Summarize the median survival time among the 66 patients who died:
transplant$futime[transplant$delta == 1] %>%
median
## [1] 65.5
You get an incorrect estimate of median survival time of 0.2 years when you ignore the fact that censored patients also contribute follow-up time.
Recall the correct estimate of median survival time is 3.9 years.
Is there a difference in survival probability between groups?
From our example: does the probability of survival differ according to gender among liver transplant patients?
We can add a covariate to the right-hand side of the survival::survfit object to obtain a stratified Kaplan-Meier plot.
Let's also look at some other customization we can do with survminer::ggsurvplot.
survminer::ggsurvplot(
fit = survival::survfit(survival::Surv(futime, delta) ~ abo, data = transplant),
xlab = "Days",
ylab = "Overall survival probability",
legend.title = "Blood Type",
legend.labs = c("A", "B", "AB","O"),
break.x.by = 100,
censor = FALSE,
risk.table = TRUE,
risk.table.y.text = FALSE)
As before, we can get an estimate of, for example, 5-year survival by using summary with the times argument in our survival::survfit object:
summary(survfit(Surv(futime, delta) ~ sex, data = transplant),
times = 5*365)
## Call: survfit(formula = Surv(futime, delta) ~ sex, data = transplant) ## ## sex=m ## time n.risk n.event survival std.err ## 1825.000 2.000 42.000 0.672 0.106 ## lower 95% CI upper 95% CI ## 0.493 0.915 ## ## sex=f ## time n.risk n.event survival std.err lower 95% CI upper 95% CI
We get the log-rank p-value using the survival::survdiff function:
survival::survdiff(Surv(futime, delta) ~ sex, data = transplant)
## Call: ## survival::survdiff(formula = Surv(futime, delta) ~ sex, data = transplant) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## sex=m 447 42 36.4 0.861 1.93 ## sex=f 368 24 29.6 1.059 1.93 ## ## Chisq= 1.9 on 1 degrees of freedom, p= 0.2
And we see that the p-value is .2, indicating no significant difference in overall survival according to gender.
We may want to quantify an effect size for a single variable, or include more than one variable into a regression model to account for the effects of multiple variables.
The Cox regression model is a semi-parametric model that can be used to fit univariable and multivariable regression models that have survival outcomes.
Some key assumptions of the model:
We can fit regression models for survival data using the survival::coxph function, which takes a survival::Surv object on the left hand side and has standard syntax for regression formulas in R on the right hand side.
survival::coxph(survival::Surv(futime, delta) ~ factor(abo), data = transplant)
We can see a tidy version of the output using the tidy function from the broom package:
broom::tidy(survival::coxph(survival::Surv(futime, delta) ~ factor(abo),
data = transplant))
## # A tibble: 3 x 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 factor(abo)B 0.183 0.386 0.474 0.636 -0.573 0.938 ## 2 factor(abo)AB 0.195 0.618 0.316 0.752 -1.02 1.41 ## 3 factor(abo)O -0.0110 0.284 -0.0388 0.969 -0.568 0.546
The quantity of interest from a Cox regression model is a hazard ratio (HR).
If you have a regression parameter \(\beta\) (from column estimate in our survival::coxph) then HR = \(\exp(\beta)\).
For example, from our example we obtain the regression parameter \(\beta_1=0.195\) for AB vs B blood type, so we have HR = \(\exp(\beta_1)=1.21\).
A HR < 1 indicates reduced hazard of death whereas a HR > 1 indicates an increased hazard of death.
So we would say that a person with AB blood type has 1.21 times increased hazard of death as compared to a person with A blood type.
survival package provides tools for survival analysis, including the Surv and survfit functionssurvminer package allows for customization of Kaplan-Meier plots based on ggplot2survival::survdiffsurvival::coxph# First let's import the data
q = read.csv('https://3jd8gl2iires146kaw2hgqy9-wpengine.netdna-ssl.com/wp-content/uploads/2018/01/turnover.csv', header = TRUE, sep = ",", na.strings = c("",NA))
Our case uses data of 1785 employees.
$ exp - length of employment in the company
$ event - event (1 - terminated, 0 - currently employed)
$ branch - branch
$ pipeline - source of recruitment
Please note that the data is already prepared for survival analysis. Moreover, length of employment is counted in months up to two decimal places, according to the following formula: (date fire - date hire) / (365.25 / 12).
w = coxph(Surv(exp, event) ~ 1 , data = q) w1 = coxph(Surv(exp, event) ~ . , data = q) summary(w1)
## Call: ## coxph(formula = Surv(exp, event) ~ ., data = q) ## ## n= 1785, number of events= 1004 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## branchfirst -0.08927 0.91460 0.16419 -0.544 0.58664 ## branchfourth -0.52913 0.58912 0.17576 -3.010 0.00261 ** ## branchsecond -0.42416 0.65432 0.16378 -2.590 0.00960 ** ## branchthird -0.46644 0.62723 0.21000 -2.221 0.02634 * ## pipelineea -0.21381 0.80750 0.26715 -0.800 0.42351 ## pipelinejs 0.50318 1.65398 0.20076 2.506 0.01220 * ## pipelineref 0.32156 1.37927 0.21409 1.502 0.13310 ## pipelinesm 0.36136 1.43528 0.25807 1.400 0.16144 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## exp(coef) exp(-coef) lower .95 upper .95 ## branchfirst 0.9146 1.0934 0.6629 1.2618 ## branchfourth 0.5891 1.6975 0.4174 0.8314 ## branchsecond 0.6543 1.5283 0.4747 0.9020 ## branchthird 0.6272 1.5943 0.4156 0.9466 ## pipelineea 0.8075 1.2384 0.4784 1.3631 ## pipelinejs 1.6540 0.6046 1.1160 2.4514 ## pipelineref 1.3793 0.7250 0.9066 2.0984 ## pipelinesm 1.4353 0.6967 0.8655 2.3802 ## ## Concordance= 0.579 (se = 0.01 ) ## Rsquare= 0.037 (max possible= 1 ) ## Likelihood ratio test= 67.28 on 8 df, p=2e-11 ## Wald test = 63.26 on 8 df, p=1e-10 ## Score (logrank) test = 64.85 on 8 df, p=5e-11
An accelerated failure-time (AFT) model is a parametric model with covariates and failure times following the survival function of the form \(S(x|Z) = S_0 (x exp[\theta Z])\), where \(S_0\) is a function for the baseline survival rate. The term \(exp[\theta Z]\) is called the acceleration factor.
The AFT model uses covariates to place individuals on different time scales { note the scaling by the covariates in \(S(t|Z)\) via \(exp[\theta Z]\). The AFT model can be rewritten in a log-linear form, where the log of failure time (call this logX) is linearly related to the mean \(\mu\), the acceleration factor, and an error term \(\sigma W\): \[log X = \theta'Z+\sigma W\]
There are serveral AFT models depending on the assumption about the distribution of the error \(W\).
| Distribution | df | Included in Survival |
|---|---|---|
| Exponential | 1 | yes |
| Weibull | 2 | yes |
| lognormal | 2 | yes |
| log logistic | 2 | yes |
| generalized gamma | 3 | no |
The Weibull distribution has a survival function equal to \[S(t)=exp(-(\lambda t)^\rho)\]
and a hazard function equal to \[\Lambda(t)=\rho \lambda(\lambda t)^{\rho-1}\]
where \(\lambda>0\) and \(\rho>0\). A special case of the Weibul functions is the exponential distribution when \(\rho=1\).
If \(\rho > 1\), then the risk increases over time.
If \(\rho < 1\), then the risk decreases over time.
The survial package allows us to estimate all of these models. We will use the same model with a small change to estimate a Weibul AFT model.
w2 = survreg(Surv(exp, event) ~ . , data = q, dist = "weibul") w3 = survreg(Surv(exp, event) ~ . , data = q, dist = "exponential") stargazer::stargazer(w2,w3,type="text",style = "qje")
## ## ====================================================== ## exp ## Weibull exponential ## (1) (2) ## ------------------------------------------------------ ## branchfirst 0.142 (0.147) 0.119 (0.164) ## branchfourth 0.494*** (0.157) 0.524*** (0.176) ## branchsecond 0.433*** (0.146) 0.441*** (0.163) ## branchthird 0.434** (0.189) 0.469** (0.210) ## pipelineea 0.228 (0.239) 0.239 (0.267) ## pipelinejs -0.529*** (0.180) -0.537*** (0.200) ## pipelineref -0.383** (0.192) -0.372* (0.213) ## pipelinesm -0.376 (0.232) -0.386 (0.258) ## Constant 1.831*** (0.220) 1.860*** (0.245) ## N 1,785 1,785 ## Log Likelihood -2,712.426 -2,721.241 ## chi2 (df = 8) 74.713*** 69.134*** ## ====================================================== ## Notes: ***Significant at the 1 percent level. ## **Significant at the 5 percent level. ## *Significant at the 10 percent level.
The exponential model assumes a constant effect of time.
We can perform a likelihood ratio test between the exponential and Weibul model to see if AFT is neccessary.
library(lmtest) lrtest(w2,w3)
## Likelihood ratio test ## ## Model 1: Surv(exp, event) ~ branch + pipeline ## Model 2: Surv(exp, event) ~ branch + pipeline ## #Df LogLik Df Chisq Pr(>Chisq) ## 1 10 -2712.4 ## 2 9 -2721.2 -1 17.63 2.683e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Although there is little difference in the parameter estimates, the likelihood ratio test suggest we should use the Weibul model over the exponential.