Basics of Survival Analysis

Author

Nirajan B

1 Survival analysis

Survival analysis models how much time elapses before an event occurs.

One of the goals of survival analysis is to estimate the probability that a subject survives without experiencing the event past some time t.

Median survival time is defined as the time t at which 50% of the population is expected to be still surviving.

2 Data for survival analysis

The simplest data structure for a typical survival analysis is:

single row per subject
a status variable coding whether the subject experienced the event or not (censored)
single time variable measuring T time to event (or censoring time, time of last observation)
variables for covariates, assumed to be time-constant in this structure

We will work with the aml dataset in the survival package. These data come from a study looking at time to death for patients with acute myelogenous leukemia, comparing “maintained” chemotherapy treatment to “nonmaintained”.

Variables:

time: survival or censoring time
status: 0 - censored, 1 - death
x: chemotherapy “maintained” or “non-maintained”

library(survival)
library(survminer)
library(broom)

# ~ 1 indicates KM survival estimtes for whole sample
head(aml)

  time status          x
1    9      1 Maintained
2   13      1 Maintained
3   13      0 Maintained
4   18      1 Maintained
5   23      1 Maintained
6   28      0 Maintained

KM <- survfit(Surv(time, status) ~ 1, data=aml)
print(KM)

Call: survfit(formula = Surv(time, status) ~ 1, data = aml)

      n events median 0.95LCL 0.95UCL
[1,] 23     18     27      18      45

KM.tab <- tidy(KM)
KM.tab

# A tibble: 18 × 8
    time n.risk n.event n.censor estimate std.error conf.high conf.low
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
 1     5     23       2        0   0.913     0.0643     1       0.805 
 2     8     21       2        0   0.826     0.0957     0.996   0.685 
 3     9     19       1        0   0.783     0.110      0.971   0.631 
 4    12     18       1        0   0.739     0.124      0.942   0.580 
 5    13     17       1        1   0.696     0.138      0.912   0.531 
 6    16     15       0        1   0.696     0.138      0.912   0.531 
 7    18     14       1        0   0.646     0.157      0.878   0.475 
 8    23     13       2        0   0.547     0.196      0.803   0.372 
 9    27     11       1        0   0.497     0.218      0.762   0.324 
10    28     10       0        1   0.497     0.218      0.762   0.324 
11    30      9       1        0   0.442     0.248      0.718   0.272 
12    31      8       1        0   0.386     0.282      0.671   0.223 
13    33      7       1        0   0.331     0.321      0.622   0.177 
14    34      6       1        0   0.276     0.369      0.569   0.134 
15    43      5       1        0   0.221     0.432      0.515   0.0947
16    45      4       1        1   0.166     0.519      0.458   0.0598
17    48      2       1        0   0.0828    0.877      0.462   0.0148
18   161      1       0        1   0.0828    0.877      0.462   0.0148

Interpretation: At month 5, there is a probability of 0.913 that a subject is still alive.

2.1 Graphing the survival function

plot(KM,
     ylab = "Survival probability",
     xlab = "Months")

2.2 Comparing survival curves: Stratified Kaplan-Meier estimates

# stratify by x variable
KM.x <- survfit(Surv(time, status) ~ x, data = aml)
KM.x

Call: survfit(formula = Surv(time, status) ~ x, data = aml)

                 n events median 0.95LCL 0.95UCL
x=Maintained    11      7     31      18      NA
x=Nonmaintained 12     11     23       8      NA

tidy(KM.x)

# A tibble: 20 × 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata   
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>    
 1     9     11       1        0   0.909     0.0953     1       0.754  x=Mainta…
 2    13     10       1        1   0.818     0.142      1       0.619  x=Mainta…
 3    18      8       1        0   0.716     0.195      1       0.488  x=Mainta…
 4    23      7       1        0   0.614     0.249      0.999   0.377  x=Mainta…
 5    28      6       0        1   0.614     0.249      0.999   0.377  x=Mainta…
 6    31      5       1        0   0.491     0.334      0.946   0.255  x=Mainta…
 7    34      4       1        0   0.368     0.442      0.875   0.155  x=Mainta…
 8    45      3       0        1   0.368     0.442      0.875   0.155  x=Mainta…
 9    48      2       1        0   0.184     0.834      0.944   0.0359 x=Mainta…
10   161      1       0        1   0.184     0.834      0.944   0.0359 x=Mainta…
11     5     12       2        0   0.833     0.129      1       0.647  x=Nonmai…
12     8     10       2        0   0.667     0.204      0.995   0.447  x=Nonmai…
13    12      8       1        0   0.583     0.244      0.941   0.362  x=Nonmai…
14    16      7       0        1   0.583     0.244      0.941   0.362  x=Nonmai…
15    23      6       1        0   0.486     0.305      0.883   0.268  x=Nonmai…
16    27      5       1        0   0.389     0.378      0.816   0.185  x=Nonmai…
17    30      4       1        0   0.292     0.476      0.741   0.115  x=Nonmai…
18    33      3       1        0   0.194     0.627      0.664   0.0569 x=Nonmai…
19    43      2       1        0   0.0972    0.945      0.620   0.0153 x=Nonmai…
20    45      1       1        0   0       Inf         NA      NA      x=Nonmai…

plot(KM.x,
     ylab = "Survival probability",
     xlab = "Months",
     conf.int = T, col = c("red","blue")) # red is for maintained, comes first

2.2.1 Customizable plot with ggsurvplot

ggsurvplot(KM.x, conf.int = T)

ggsurvplot(KM.x, conf.int = T,
           risk.table = T)

2.2.2 Comparing survival functions with survdiff()

Null: survival curves across 2 or more groups are equivalent

Alternative: survival curve across 2 or more groups are not equivalent

The log-rank statistic is one popular method to evaluate this hypothesis. Under the null, the log-rank statistic is chi-squared distributed with g-1 degrees of freedom.

# log-rank test, default is rho = 0
survdiff(Surv(time, status) ~ x, data = aml)

Call:
survdiff(formula = Surv(time, status) ~ x, data = aml)

                 N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained    11        7    10.69      1.27       3.4
x=Nonmaintained 12       11     7.31      1.86       3.4

 Chisq= 3.4  on 1 degrees of freedom, p= 0.07

rho = 0 is the default, which is the log-rank test (treats all timepoints equally)

rho = 1, the weights equal the survival estimate itself, equivalent to the Gehan-Wilcoxon test

earlier timepoints are weighted more heavily (might make sense for death after surgery, for example)

rho values between 0 and 1 are valid, with values closer to 1 putting more weight on earlier time points

survdiff(Surv(time, status) ~ x, data = aml, rho = 1)

Call:
survdiff(formula = Surv(time, status) ~ x, data = aml, rho = 1)

                 N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained    11     3.85     6.14     0.859      2.78
x=Nonmaintained 12     7.18     4.88     1.081      2.78

 Chisq= 2.8  on 1 degrees of freedom, p= 0.1

2.3 The Cox proportional hazards model

Instead of estimating the survival function S(t) directly, the Cox proportional hazards model estimates changes to the hazard function, h(t).

The Cox model can estimate the effects of multiple predictors on the hazard function. By far, the most popular method for survival analysis.

no distribution is assumed for survival times
naturally accomodates right-censoring and time-varying covariates
can be extended in many ways
- time-varying coefficients
- random effect frailties for recurrent events or clustering
- competing risks modeling

As an example with a single covariate, X1, image that X1 is a treatment variable, with values 1 for treatment and 0 for control.

exp(b1) is the hazard ratio comparing the hazard for treatment to controls.

HR = 0.25 means that that treatment has 1/4 (25%) the hazard of control, or a 75% decrease. With a lower hazard rate, treatment will have fewer expected events and thus better survival.
HR = 2 here means that treatment has twice the hazard of control, or a 100% increase, and thus worse survival.

In general, exp(b1) expresses the hazard ratio for a 1-unit increase in the associated covariate.

b1 itself is the log-hazard ratio.

The standard Cox model assumes proportional hazards, which means that the effects of covariates are constant over time. For example, in our simple Cox model for a treatment effect, proportional hazards means that the effect of treatment does not change over time.

We will be using the lung dataset from the survival package for Cox modeling.

Variables:

time: survival time in days
status: 1 = censored, 2 = dead
age: age in years (assessed at begining)
sex: 1 = male, 2 = female,
wt.loss: weight loss (pounds) in last 6 months

Research question is whether age, sex, and weight loss affect the hazard (and therefore, survival). We estimate the effects of each variables on the hazard.

lung.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data = lung)
summary(lung.cox)

Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

              coef  exp(coef)   se(coef)      z Pr(>|z|)   
age      0.0200882  1.0202913  0.0096644  2.079   0.0377 * 
sex     -0.5210319  0.5939074  0.1743541 -2.988   0.0028 **
wt.loss  0.0007596  1.0007599  0.0061934  0.123   0.9024   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0203     0.9801    1.0011    1.0398
sex        0.5939     1.6838    0.4220    0.8359
wt.loss    1.0008     0.9992    0.9887    1.0130

Concordance= 0.612  (se = 0.027 )
Likelihood ratio test= 14.67  on 3 df,   p=0.002
Wald test            = 13.98  on 3 df,   p=0.003
Score (logrank) test = 14.24  on 3 df,   p=0.003

Interpretation

For each additional year of age at baseline, the hazard increases by 2.03% or by a factor of 1.0203.
Females have 60% the hazard of males, or a 40% decrease.
For each additional pound of weight loss, the hazard increases by 0.08%.

We want concordance closer to 1. Those test are omnibus test. Testing that all coefficients are zero. If > 0.05, not a good model.

# saving summarized results as data.frame
lung.cox.tab <- tidy(lung.cox, exponentiate = T, conf.int = T)
lung.cox.tab

# A tibble: 3 × 7
  term    estimate std.error statistic p.value conf.low conf.high
  <chr>      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 age        1.02    0.00966     2.08  0.0377     1.00      1.04 
2 sex        0.594   0.174      -2.99  0.00280    0.422     0.836
3 wt.loss    1.00    0.00619     0.123 0.902      0.989     1.01

# plot of hazard ratios and 95% CIs
ggplot(lung.cox.tab, 
       aes(y=term, x=estimate, xmin=conf.low, xmax=conf.high)) + 
  geom_pointrange() +  # plots center point (x) and range (xmin, xmax)
  geom_vline(xintercept=1, color="red") + # vertical line at HR=1
  labs(x="hazard ratio", title="Hazard ratios and 95% CIs") +
  theme_classic()

2.3.1 Predicting survival with Cox estimates

If no covariate values are supplied to survfit(), then a survival function will be estimated for a subject with mean values on all model covariates.

# predict survival function for subject with means on all covariates
surv.at.means <- survfit(lung.cox) 

#table of survival function
tidy(surv.at.means)

# A tibble: 179 × 8
    time n.risk n.event n.censor estimate std.error conf.high conf.low
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
 1     5    214       1        0    0.996   0.00443     1        0.987
 2    11    213       2        0    0.987   0.00772     1        0.972
 3    12    211       1        0    0.982   0.00894     1.00     0.965
 4    13    210       2        0    0.973   0.0110      0.995    0.953
 5    15    208       1        0    0.969   0.0120      0.992    0.946
 6    26    207       1        0    0.964   0.0128      0.989    0.940
 7    30    206       1        0    0.960   0.0137      0.986    0.935
 8    31    205       1        0    0.955   0.0145      0.983    0.929
 9    53    204       2        0    0.946   0.0160      0.976    0.917
10    54    202       1        0    0.942   0.0167      0.973    0.912
# ℹ 169 more rows

# plot of predicted survival for subject at means of covariates
plot(surv.at.means, xlab="days", ylab="survival probability")

it is recommended to always supply a data.frame of covariate values at which to predict the survival function to the newdata= option of survfit().

# create new data for plotting: 1 row for each sex
#  and mean age and wt.loss for both rows
plotdata <- data.frame(age=mean(lung$age),
                       sex=1:2,
                       wt.loss=mean(lung$wt.loss, na.rm=T))

# look at new data
plotdata

       age sex  wt.loss
1 62.44737   1 9.831776
2 62.44737   2 9.831776

# get survival function estimates for each sex
surv.by.sex <- survfit(lung.cox, newdata=plotdata) # one function for each sex

# tidy results
tidy(surv.by.sex)

# A tibble: 179 × 12
    time n.risk n.event n.censor estimate.1 estimate.2 std.error.1 std.error.2
   <dbl>  <dbl>   <dbl>    <dbl>      <dbl>      <dbl>       <dbl>       <dbl>
 1     5    214       1        0      0.995      0.997     0.00546     0.00327
 2    11    213       2        0      0.984      0.990     0.00953     0.00577
 3    12    211       1        0      0.978      0.987     0.0111      0.00674
 4    13    210       2        0      0.967      0.980     0.0137      0.00844
 5    15    208       1        0      0.962      0.977     0.0148      0.00921
 6    26    207       1        0      0.956      0.974     0.0159      0.00995
 7    30    206       1        0      0.951      0.971     0.0170      0.0107 
 8    31    205       1        0      0.945      0.967     0.0180      0.0113 
 9    53    204       2        0      0.934      0.960     0.0199      0.0127 
10    54    202       1        0      0.929      0.957     0.0208      0.0133 
# ℹ 169 more rows
# ℹ 4 more variables: conf.high.1 <dbl>, conf.high.2 <dbl>, conf.low.1 <dbl>,
#   conf.low.2 <dbl>

Note that we now have estimate.1 and estimate.2 for males and females separately.

# plot survival estimates
plot(surv.by.sex, xlab="days", ylab="survival probability",
     conf.int=T, col=c("blue", "red"))

# data= is the same data used in survfit()
#  censor=F removes censoring symbols
ggsurvplot(surv.by.sex, data=plotdata, censor=F,
           legend.labs=c("male", "female"))

2.4 Assessing the proportional hazards assumption

Several methods have been developed to assess the proportional hazards assumption of the Cox model. Here we discuss 2 tools developed by Grambsch and Therneau (1994).

A chi-square test based on Schoenfeld residuals is available with cox.zph() to test the hypothesis:

Null: covariate effect is constant (proportional) over time

Alternative: covariate effect changes over time

The null hypothesis of proportional hazards is tested for each covariate individually and jointly as well.

cox.zph(lung.cox)

         chisq df    p
age     0.5077  1 0.48
sex     2.5489  1 0.11
wt.loss 0.0144  1 0.90
GLOBAL  3.0051  3 0.39

No strong evidence of violation of proportional hazards for any covariate, though some suggestion that sex may violate.

Another tool used to assess proportional hazards is a plot of a smoothed curve over the Schoenfeld residuals. Proportional hazards is indicated by flat line. Note that coefficients are being plotted.

plot(cox.zph(lung.cox))

The effect of sex seems to be strongest at the beginning of follow-up, but then trends toward zero as time passes.

2.5 Dealing with proportional hazards violations

Many strategies have been proposed to account for violation of PH. We discuss two here:

stratify by the non-PH variable
add an interaction of the non-PH variable with time to the model

Stratified Cox Model

lung.strat.sex <- coxph(Surv(time, status) ~ age + wt.loss + strata(sex), data=lung)
summary(lung.strat.sex)

Call:
coxph(formula = Surv(time, status) ~ age + wt.loss + strata(sex), 
    data = lung)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

             coef exp(coef)  se(coef)     z Pr(>|z|)  
age     0.0192190 1.0194049 0.0096226 1.997   0.0458 *
wt.loss 0.0001412 1.0001412 0.0062509 0.023   0.9820  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.019     0.9810     1.000     1.039
wt.loss     1.000     0.9999     0.988     1.012

Concordance= 0.561  (se = 0.027 )
Likelihood ratio test= 4.09  on 2 df,   p=0.1
Wald test            = 3.99  on 2 df,   p=0.1
Score (logrank) test = 4  on 2 df,   p=0.1

We see no coefficients for sex, the stratification variable. Although the estimates have changed a bit, our inferences regarding wt.loss (and age) are similar to those from the model where we assume PH for sex.

Modeling time-varying coefficients

The Cox model can be extended to allow the effects of a covariate (coefficients) to change over time, by interacting that covariate with some function of time.

Specify the nonPH covariate both by itself and inside of tt() in the model formula

Below we specify an interaction of sex with time itself, so that the effect of sex is allowed to change linearly with time. Why linearly? Some guidance from the plot of Schoenfeld residuals.

# notice sex and tt(sex) in model formula
lung.sex.by.time <- coxph(Surv(time, status) ~ age + wt.loss + sex + tt(sex), # sex and tt(sex) in formula
                          data=lung,
                          tt=function(x,t,...) x*t) # linear change in effect of sex
summary(lung.sex.by.time)

Call:
coxph(formula = Surv(time, status) ~ age + wt.loss + sex + tt(sex), 
    data = lung, tt = function(x, t, ...) x * t)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

              coef  exp(coef)   se(coef)      z Pr(>|z|)   
age      0.0194343  1.0196244  0.0096522  2.013   0.0441 * 
wt.loss  0.0001260  1.0001261  0.0062502  0.020   0.9839   
sex     -0.9417444  0.3899470  0.3224791 -2.920   0.0035 **
tt(sex)  0.0013778  1.0013787  0.0008581  1.606   0.1084   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0196     0.9808    1.0005    1.0391
wt.loss    1.0001     0.9999    0.9879    1.0125
sex        0.3899     2.5645    0.2073    0.7337
tt(sex)    1.0014     0.9986    0.9997    1.0031

Concordance= 0.613  (se = 0.027 )
Likelihood ratio test= 17.23  on 4 df,   p=0.002
Wald test            = 15.86  on 4 df,   p=0.003
Score (logrank) test = 16.44  on 4 df,   p=0.002

The coef estimated for sex is the log-hazard-ratio at day t = 0 and the value is -0.942, corresponding to a hazard ratio of 0.39. The coef for tt(sex) is the change in the log-hazard-ratio for each additional day that passes.

These estimates match the graph of the smoothed Schoenfeld residuals for sex. At the beginning of follow-up, the coefficient is close to -1, and it increases gradually over time.

Again, our inferences regarding wt.loss are mostly unchanged.

2.6 Time-varying covariates

The Cox PH model easily accommodates time-varying covariates, but the data should be structured in start-stop time format:

each subject can have multiple rows of data
2 time variables are required to record the beginning and end of time intervals
The status variable records the event status as the end at each interval
- for single event data, this can only be the last interval
If no time gaps, the start time of an interval will be the stop time of the previous interval
The time when a covariate changes value should be recorded as the beginning of a new interval

The survival package provides a function tmerge() to help get your data in this format. See vignette(timedep) for guidance on its usage.

In the jasa1 dataset, transplant is a time-varying covariate.

head(jasa1)

    id start stop event transplant        age      year surgery
1    1     0   49     1          0 -17.155373 0.1232033       0
2    2     0    5     1          0   3.835729 0.2546201       0
102  3     0   15     1          1   6.297057 0.2655715       0
3    4     0   35     0          0  -7.737166 0.4900753       0
103  4    35   38     1          1  -7.737166 0.4900753       0
4    5     0   17     1          0 -27.214237 0.6078029       0

jasa1.cox <- coxph(Surv(start, stop, event) ~ transplant + age + surgery, data=jasa1)
summary(jasa1.cox)

Call:
coxph(formula = Surv(start, stop, event) ~ transplant + age + 
    surgery, data = jasa1)

  n= 170, number of events= 75 

               coef exp(coef) se(coef)      z Pr(>|z|)  
transplant  0.01405   1.01415  0.30822  0.046   0.9636  
age         0.03055   1.03103  0.01389  2.199   0.0279 *
surgery    -0.77326   0.46150  0.35966 -2.150   0.0316 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

           exp(coef) exp(-coef) lower .95 upper .95
transplant    1.0142     0.9860    0.5543    1.8555
age           1.0310     0.9699    1.0033    1.0595
surgery       0.4615     2.1668    0.2280    0.9339

Concordance= 0.599  (se = 0.036 )
Likelihood ratio test= 10.72  on 3 df,   p=0.01
Wald test            = 9.68  on 3 df,   p=0.02
Score (logrank) test = 10  on 3 df,   p=0.02

We can follow with the same procedures as before, checking PH assumptions and plotting predicted survival curves.

# check PH assumptions
cox.zph(jasa1.cox)

           chisq df    p
transplant 0.118  1 0.73
age        0.897  1 0.34
surgery    0.097  1 0.76
GLOBAL     1.363  3 0.71

# plot predicted survival by transplant group at mean age and surgery=0
plotdata <- data.frame(transplant=0:1, age=-2.48, surgery=0)
surv.by.transplant <- survfit(jasa1.cox, newdata=plotdata)
ggsurvplot(surv.by.transplant, data=plotdata) # remember to supply data to ggsurvplot() for predicted survival after coxph()

# vignette("survival")
# vignette("timedep")