library(survival)
library(survminer)
library(broom)Basics of Survival Analysis
1 Survival analysis
Survival analysis models how much time elapses before an event occurs.
One of the goals of survival analysis is to estimate the probability that a subject survives without experiencing the event past some time t.
Median survival time is defined as the time t at which 50% of the population is expected to be still surviving.
2 Data for survival analysis
The simplest data structure for a typical survival analysis is:
single row per subject
a status variable coding whether the subject experienced the event or not (censored)
single time variable measuring T time to event (or censoring time, time of last observation)
variables for covariates, assumed to be time-constant in this structure
We will work with the aml dataset in the survival package. These data come from a study looking at time to death for patients with acute myelogenous leukemia, comparing “maintained” chemotherapy treatment to “nonmaintained”.
Variables:
time: survival or censoring time
status: 0 - censored, 1 - death
x: chemotherapy “maintained” or “non-maintained”
# ~ 1 indicates KM survival estimtes for whole sample
head(aml) time status x
1 9 1 Maintained
2 13 1 Maintained
3 13 0 Maintained
4 18 1 Maintained
5 23 1 Maintained
6 28 0 Maintained
KM <- survfit(Surv(time, status) ~ 1, data=aml)
print(KM)Call: survfit(formula = Surv(time, status) ~ 1, data = aml)
n events median 0.95LCL 0.95UCL
[1,] 23 18 27 18 45
KM.tab <- tidy(KM)
KM.tab# A tibble: 18 × 8
time n.risk n.event n.censor estimate std.error conf.high conf.low
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 23 2 0 0.913 0.0643 1 0.805
2 8 21 2 0 0.826 0.0957 0.996 0.685
3 9 19 1 0 0.783 0.110 0.971 0.631
4 12 18 1 0 0.739 0.124 0.942 0.580
5 13 17 1 1 0.696 0.138 0.912 0.531
6 16 15 0 1 0.696 0.138 0.912 0.531
7 18 14 1 0 0.646 0.157 0.878 0.475
8 23 13 2 0 0.547 0.196 0.803 0.372
9 27 11 1 0 0.497 0.218 0.762 0.324
10 28 10 0 1 0.497 0.218 0.762 0.324
11 30 9 1 0 0.442 0.248 0.718 0.272
12 31 8 1 0 0.386 0.282 0.671 0.223
13 33 7 1 0 0.331 0.321 0.622 0.177
14 34 6 1 0 0.276 0.369 0.569 0.134
15 43 5 1 0 0.221 0.432 0.515 0.0947
16 45 4 1 1 0.166 0.519 0.458 0.0598
17 48 2 1 0 0.0828 0.877 0.462 0.0148
18 161 1 0 1 0.0828 0.877 0.462 0.0148
Interpretation: At month 5, there is a probability of 0.913 that a subject is still alive.
2.1 Graphing the survival function
plot(KM,
ylab = "Survival probability",
xlab = "Months")2.2 Comparing survival curves: Stratified Kaplan-Meier estimates
# stratify by x variable
KM.x <- survfit(Surv(time, status) ~ x, data = aml)
KM.xCall: survfit(formula = Surv(time, status) ~ x, data = aml)
n events median 0.95LCL 0.95UCL
x=Maintained 11 7 31 18 NA
x=Nonmaintained 12 11 23 8 NA
tidy(KM.x)# A tibble: 20 × 9
time n.risk n.event n.censor estimate std.error conf.high conf.low strata
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 9 11 1 0 0.909 0.0953 1 0.754 x=Mainta…
2 13 10 1 1 0.818 0.142 1 0.619 x=Mainta…
3 18 8 1 0 0.716 0.195 1 0.488 x=Mainta…
4 23 7 1 0 0.614 0.249 0.999 0.377 x=Mainta…
5 28 6 0 1 0.614 0.249 0.999 0.377 x=Mainta…
6 31 5 1 0 0.491 0.334 0.946 0.255 x=Mainta…
7 34 4 1 0 0.368 0.442 0.875 0.155 x=Mainta…
8 45 3 0 1 0.368 0.442 0.875 0.155 x=Mainta…
9 48 2 1 0 0.184 0.834 0.944 0.0359 x=Mainta…
10 161 1 0 1 0.184 0.834 0.944 0.0359 x=Mainta…
11 5 12 2 0 0.833 0.129 1 0.647 x=Nonmai…
12 8 10 2 0 0.667 0.204 0.995 0.447 x=Nonmai…
13 12 8 1 0 0.583 0.244 0.941 0.362 x=Nonmai…
14 16 7 0 1 0.583 0.244 0.941 0.362 x=Nonmai…
15 23 6 1 0 0.486 0.305 0.883 0.268 x=Nonmai…
16 27 5 1 0 0.389 0.378 0.816 0.185 x=Nonmai…
17 30 4 1 0 0.292 0.476 0.741 0.115 x=Nonmai…
18 33 3 1 0 0.194 0.627 0.664 0.0569 x=Nonmai…
19 43 2 1 0 0.0972 0.945 0.620 0.0153 x=Nonmai…
20 45 1 1 0 0 Inf NA NA x=Nonmai…
plot(KM.x,
ylab = "Survival probability",
xlab = "Months",
conf.int = T, col = c("red","blue")) # red is for maintained, comes first2.2.1 Customizable plot with ggsurvplot
ggsurvplot(KM.x, conf.int = T)ggsurvplot(KM.x, conf.int = T,
risk.table = T)2.2.2 Comparing survival functions with survdiff()
Null: survival curves across 2 or more groups are equivalent
Alternative: survival curve across 2 or more groups are not equivalent
The log-rank statistic is one popular method to evaluate this hypothesis. Under the null, the log-rank statistic is chi-squared distributed with g-1 degrees of freedom.
# log-rank test, default is rho = 0
survdiff(Surv(time, status) ~ x, data = aml)Call:
survdiff(formula = Surv(time, status) ~ x, data = aml)
N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained 11 7 10.69 1.27 3.4
x=Nonmaintained 12 11 7.31 1.86 3.4
Chisq= 3.4 on 1 degrees of freedom, p= 0.07
rho = 0 is the default, which is the log-rank test (treats all timepoints equally)
rho = 1, the weights equal the survival estimate itself, equivalent to the Gehan-Wilcoxon test
- earlier timepoints are weighted more heavily (might make sense for death after surgery, for example)
rho values between 0 and 1 are valid, with values closer to 1 putting more weight on earlier time points
survdiff(Surv(time, status) ~ x, data = aml, rho = 1)Call:
survdiff(formula = Surv(time, status) ~ x, data = aml, rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained 11 3.85 6.14 0.859 2.78
x=Nonmaintained 12 7.18 4.88 1.081 2.78
Chisq= 2.8 on 1 degrees of freedom, p= 0.1
2.3 The Cox proportional hazards model
Instead of estimating the survival function S(t) directly, the Cox proportional hazards model estimates changes to the hazard function, h(t).
The Cox model can estimate the effects of multiple predictors on the hazard function. By far, the most popular method for survival analysis.
no distribution is assumed for survival times
naturally accomodates right-censoring and time-varying covariates
can be extended in many ways
time-varying coefficients
random effect frailties for recurrent events or clustering
competing risks modeling
As an example with a single covariate, X1, image that X1 is a treatment variable, with values 1 for treatment and 0 for control.
exp(b1) is the hazard ratio comparing the hazard for treatment to controls.
HR = 0.25 means that that treatment has 1/4 (25%) the hazard of control, or a 75% decrease. With a lower hazard rate, treatment will have fewer expected events and thus better survival.
HR = 2 here means that treatment has twice the hazard of control, or a 100% increase, and thus worse survival.
In general, exp(b1) expresses the hazard ratio for a 1-unit increase in the associated covariate.
b1 itself is the log-hazard ratio.
The standard Cox model assumes proportional hazards, which means that the effects of covariates are constant over time. For example, in our simple Cox model for a treatment effect, proportional hazards means that the effect of treatment does not change over time.
We will be using the lung dataset from the survival package for Cox modeling.
Variables:
time: survival time in days
status: 1 = censored, 2 = dead
age: age in years (assessed at begining)
sex: 1 = male, 2 = female,
wt.loss: weight loss (pounds) in last 6 months
Research question is whether age, sex, and weight loss affect the hazard (and therefore, survival). We estimate the effects of each variables on the hazard.
lung.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data = lung)
summary(lung.cox)Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0200882 1.0202913 0.0096644 2.079 0.0377 *
sex -0.5210319 0.5939074 0.1743541 -2.988 0.0028 **
wt.loss 0.0007596 1.0007599 0.0061934 0.123 0.9024
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0203 0.9801 1.0011 1.0398
sex 0.5939 1.6838 0.4220 0.8359
wt.loss 1.0008 0.9992 0.9887 1.0130
Concordance= 0.612 (se = 0.027 )
Likelihood ratio test= 14.67 on 3 df, p=0.002
Wald test = 13.98 on 3 df, p=0.003
Score (logrank) test = 14.24 on 3 df, p=0.003
Interpretation
For each additional year of age at baseline, the hazard increases by 2.03% or by a factor of 1.0203.
Females have 60% the hazard of males, or a 40% decrease.
For each additional pound of weight loss, the hazard increases by 0.08%.
We want concordance closer to 1. Those test are omnibus test. Testing that all coefficients are zero. If > 0.05, not a good model.
# saving summarized results as data.frame
lung.cox.tab <- tidy(lung.cox, exponentiate = T, conf.int = T)
lung.cox.tab# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age 1.02 0.00966 2.08 0.0377 1.00 1.04
2 sex 0.594 0.174 -2.99 0.00280 0.422 0.836
3 wt.loss 1.00 0.00619 0.123 0.902 0.989 1.01
# plot of hazard ratios and 95% CIs
ggplot(lung.cox.tab,
aes(y=term, x=estimate, xmin=conf.low, xmax=conf.high)) +
geom_pointrange() + # plots center point (x) and range (xmin, xmax)
geom_vline(xintercept=1, color="red") + # vertical line at HR=1
labs(x="hazard ratio", title="Hazard ratios and 95% CIs") +
theme_classic()2.3.1 Predicting survival with Cox estimates
If no covariate values are supplied to survfit(), then a survival function will be estimated for a subject with mean values on all model covariates.
# predict survival function for subject with means on all covariates
surv.at.means <- survfit(lung.cox)
#table of survival function
tidy(surv.at.means)# A tibble: 179 × 8
time n.risk n.event n.censor estimate std.error conf.high conf.low
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 214 1 0 0.996 0.00443 1 0.987
2 11 213 2 0 0.987 0.00772 1 0.972
3 12 211 1 0 0.982 0.00894 1.00 0.965
4 13 210 2 0 0.973 0.0110 0.995 0.953
5 15 208 1 0 0.969 0.0120 0.992 0.946
6 26 207 1 0 0.964 0.0128 0.989 0.940
7 30 206 1 0 0.960 0.0137 0.986 0.935
8 31 205 1 0 0.955 0.0145 0.983 0.929
9 53 204 2 0 0.946 0.0160 0.976 0.917
10 54 202 1 0 0.942 0.0167 0.973 0.912
# ℹ 169 more rows
# plot of predicted survival for subject at means of covariates
plot(surv.at.means, xlab="days", ylab="survival probability")it is recommended to always supply a data.frame of covariate values at which to predict the survival function to the newdata= option of survfit().
# create new data for plotting: 1 row for each sex
# and mean age and wt.loss for both rows
plotdata <- data.frame(age=mean(lung$age),
sex=1:2,
wt.loss=mean(lung$wt.loss, na.rm=T))
# look at new data
plotdata age sex wt.loss
1 62.44737 1 9.831776
2 62.44737 2 9.831776
# get survival function estimates for each sex
surv.by.sex <- survfit(lung.cox, newdata=plotdata) # one function for each sex
# tidy results
tidy(surv.by.sex)# A tibble: 179 × 12
time n.risk n.event n.censor estimate.1 estimate.2 std.error.1 std.error.2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 214 1 0 0.995 0.997 0.00546 0.00327
2 11 213 2 0 0.984 0.990 0.00953 0.00577
3 12 211 1 0 0.978 0.987 0.0111 0.00674
4 13 210 2 0 0.967 0.980 0.0137 0.00844
5 15 208 1 0 0.962 0.977 0.0148 0.00921
6 26 207 1 0 0.956 0.974 0.0159 0.00995
7 30 206 1 0 0.951 0.971 0.0170 0.0107
8 31 205 1 0 0.945 0.967 0.0180 0.0113
9 53 204 2 0 0.934 0.960 0.0199 0.0127
10 54 202 1 0 0.929 0.957 0.0208 0.0133
# ℹ 169 more rows
# ℹ 4 more variables: conf.high.1 <dbl>, conf.high.2 <dbl>, conf.low.1 <dbl>,
# conf.low.2 <dbl>
Note that we now have estimate.1 and estimate.2 for males and females separately.
# plot survival estimates
plot(surv.by.sex, xlab="days", ylab="survival probability",
conf.int=T, col=c("blue", "red"))# data= is the same data used in survfit()
# censor=F removes censoring symbols
ggsurvplot(surv.by.sex, data=plotdata, censor=F,
legend.labs=c("male", "female")) 2.4 Assessing the proportional hazards assumption
Several methods have been developed to assess the proportional hazards assumption of the Cox model. Here we discuss 2 tools developed by Grambsch and Therneau (1994).
A chi-square test based on Schoenfeld residuals is available with cox.zph() to test the hypothesis:
Null: covariate effect is constant (proportional) over time
Alternative: covariate effect changes over time
The null hypothesis of proportional hazards is tested for each covariate individually and jointly as well.
cox.zph(lung.cox) chisq df p
age 0.5077 1 0.48
sex 2.5489 1 0.11
wt.loss 0.0144 1 0.90
GLOBAL 3.0051 3 0.39
No strong evidence of violation of proportional hazards for any covariate, though some suggestion that sex may violate.
Another tool used to assess proportional hazards is a plot of a smoothed curve over the Schoenfeld residuals. Proportional hazards is indicated by flat line. Note that coefficients are being plotted.
plot(cox.zph(lung.cox))The effect of sex seems to be strongest at the beginning of follow-up, but then trends toward zero as time passes.
2.5 Dealing with proportional hazards violations
Many strategies have been proposed to account for violation of PH. We discuss two here:
stratify by the non-PH variable
add an interaction of the non-PH variable with time to the model
Stratified Cox Model
lung.strat.sex <- coxph(Surv(time, status) ~ age + wt.loss + strata(sex), data=lung)
summary(lung.strat.sex)Call:
coxph(formula = Surv(time, status) ~ age + wt.loss + strata(sex),
data = lung)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0192190 1.0194049 0.0096226 1.997 0.0458 *
wt.loss 0.0001412 1.0001412 0.0062509 0.023 0.9820
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.019 0.9810 1.000 1.039
wt.loss 1.000 0.9999 0.988 1.012
Concordance= 0.561 (se = 0.027 )
Likelihood ratio test= 4.09 on 2 df, p=0.1
Wald test = 3.99 on 2 df, p=0.1
Score (logrank) test = 4 on 2 df, p=0.1
We see no coefficients for sex, the stratification variable. Although the estimates have changed a bit, our inferences regarding wt.loss (and age) are similar to those from the model where we assume PH for sex.
Modeling time-varying coefficients
The Cox model can be extended to allow the effects of a covariate (coefficients) to change over time, by interacting that covariate with some function of time.
- Specify the nonPH covariate both by itself and inside of
tt()in the model formula
Below we specify an interaction of sex with time itself, so that the effect of sex is allowed to change linearly with time. Why linearly? Some guidance from the plot of Schoenfeld residuals.
# notice sex and tt(sex) in model formula
lung.sex.by.time <- coxph(Surv(time, status) ~ age + wt.loss + sex + tt(sex), # sex and tt(sex) in formula
data=lung,
tt=function(x,t,...) x*t) # linear change in effect of sex
summary(lung.sex.by.time)Call:
coxph(formula = Surv(time, status) ~ age + wt.loss + sex + tt(sex),
data = lung, tt = function(x, t, ...) x * t)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.0194343 1.0196244 0.0096522 2.013 0.0441 *
wt.loss 0.0001260 1.0001261 0.0062502 0.020 0.9839
sex -0.9417444 0.3899470 0.3224791 -2.920 0.0035 **
tt(sex) 0.0013778 1.0013787 0.0008581 1.606 0.1084
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0196 0.9808 1.0005 1.0391
wt.loss 1.0001 0.9999 0.9879 1.0125
sex 0.3899 2.5645 0.2073 0.7337
tt(sex) 1.0014 0.9986 0.9997 1.0031
Concordance= 0.613 (se = 0.027 )
Likelihood ratio test= 17.23 on 4 df, p=0.002
Wald test = 15.86 on 4 df, p=0.003
Score (logrank) test = 16.44 on 4 df, p=0.002
The coef estimated for sex is the log-hazard-ratio at day t = 0 and the value is -0.942, corresponding to a hazard ratio of 0.39. The coef for tt(sex) is the change in the log-hazard-ratio for each additional day that passes.
These estimates match the graph of the smoothed Schoenfeld residuals for sex. At the beginning of follow-up, the coefficient is close to -1, and it increases gradually over time.
Again, our inferences regarding wt.loss are mostly unchanged.
2.6 Time-varying covariates
The Cox PH model easily accommodates time-varying covariates, but the data should be structured in start-stop time format:
each subject can have multiple rows of data
2 time variables are required to record the beginning and end of time intervals
The status variable records the event status as the end at each interval
- for single event data, this can only be the last interval
If no time gaps, the start time of an interval will be the stop time of the previous interval
The time when a covariate changes value should be recorded as the beginning of a new interval
The survival package provides a function tmerge() to help get your data in this format. See vignette(timedep) for guidance on its usage.
In the jasa1 dataset, transplant is a time-varying covariate.
head(jasa1) id start stop event transplant age year surgery
1 1 0 49 1 0 -17.155373 0.1232033 0
2 2 0 5 1 0 3.835729 0.2546201 0
102 3 0 15 1 1 6.297057 0.2655715 0
3 4 0 35 0 0 -7.737166 0.4900753 0
103 4 35 38 1 1 -7.737166 0.4900753 0
4 5 0 17 1 0 -27.214237 0.6078029 0
jasa1.cox <- coxph(Surv(start, stop, event) ~ transplant + age + surgery, data=jasa1)
summary(jasa1.cox)Call:
coxph(formula = Surv(start, stop, event) ~ transplant + age +
surgery, data = jasa1)
n= 170, number of events= 75
coef exp(coef) se(coef) z Pr(>|z|)
transplant 0.01405 1.01415 0.30822 0.046 0.9636
age 0.03055 1.03103 0.01389 2.199 0.0279 *
surgery -0.77326 0.46150 0.35966 -2.150 0.0316 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
transplant 1.0142 0.9860 0.5543 1.8555
age 1.0310 0.9699 1.0033 1.0595
surgery 0.4615 2.1668 0.2280 0.9339
Concordance= 0.599 (se = 0.036 )
Likelihood ratio test= 10.72 on 3 df, p=0.01
Wald test = 9.68 on 3 df, p=0.02
Score (logrank) test = 10 on 3 df, p=0.02
We can follow with the same procedures as before, checking PH assumptions and plotting predicted survival curves.
# check PH assumptions
cox.zph(jasa1.cox) chisq df p
transplant 0.118 1 0.73
age 0.897 1 0.34
surgery 0.097 1 0.76
GLOBAL 1.363 3 0.71
# plot predicted survival by transplant group at mean age and surgery=0
plotdata <- data.frame(transplant=0:1, age=-2.48, surgery=0)
surv.by.transplant <- survfit(jasa1.cox, newdata=plotdata)
ggsurvplot(surv.by.transplant, data=plotdata) # remember to supply data to ggsurvplot() for predicted survival after coxph()# vignette("survival")
# vignette("timedep")