Survival Analysis in R

Mark Bounthavong

2/7/2022; updated: 02/13/2023

This tutorial is located on RPubs.

The entire R Markdown code is located on my GitHub page

For this tutorial, you will need the following packages: survival, dplyr, psych, survminer, gmodels, and gtsummary.

Introduction

A dichotomous variable has two outcomes (Yes or No, Survives or Dies, Cured or Uncured) and is coded a 0 for \(No\) and 1 for \(Yes\). We can compare the differences in the probability or risk of having the event (outcome = 1) between two groups using a relative risk ratio or odds ratio. This give us information about the size of the effect, the direction of the effect, and the uncertainty surrounding the effect (e.g., 95% confidence interval). However, we do not have information on how long they were alive or at risk, which is critical information from a survival point of view. Hence, a different kind of analysis was needed to handle this issue–survival analysis.

Survival analysis is also known as a time-to-event analysis. This is a special type of analysis that takes into consideration when the event occurred rather than if the event occurred. In other words, we are focused on acquiring the rate, which is the number of events per unit time.

In survival analysis, we are interested in the hazard, which is the instantaneous event (e.g., death) ro rate at a particular time \(t\).

By combining these two elements (survival and hazard), we will be able to estimate the hazard ratio of an event occurring between two groups.

Survivor and Hazard functions

In survival analysis, the two main features are the survivor and hazard functions.

Survivor function \(S(t)\) is a probability and hazard function \(h(t)\) is a rate.

The survivor function is described as:

\(\large S(t) = Pr(T > t),\)

where \(S(t)\) is the survival probability, \(0 \le S(t) \le 1\), and \(Pr(T > t)\) is the probability that the time of the event (\(T\)) is greater than the some time \(t\).

Hazards function \(h(t)\) is the instantaneous potential per unit time for the event to occur, given that the subject is alive (rate)

\(\large h(t) = \lim_{\Delta t \to 0} \frac{P(t \le T \le t + \Delta t | T \ge t)}{\Delta t}\),

where the hazard rate at some instantaneous point in time \(\lim_{\Delta t \to 0}\) is the probability of the event occurring (\(P(t \le T \le t + \Delta t )\)) conditioned on the subject being alive (\(T \ge t\) divided by per unit time.

The survivor and hazard functions are related by:

\(h(t) = \lambda\) if and only if \(S(t) = \exp^{-\lambda t}\)

This means that you can derive the hazard function \(h(t)\) from the survivor function \(S(t)\) and vice versa.

Kaplan-Meirer Curve

When comparing the survival between two groups, it’s good practice to plot out the survival curves. The Kaplan-Meier (KM) curve provides a non-parametric visualization of the survivor function. You can determine the survival probability at specific time points. The KM curve has a unique “step-like” appearance that represents when patients either experience the event (e.g., death) or are censored.

The data to plot a KM curve requires at minimum the subject identifier, survival time (time spent at risk), status (event), and group indicator. I generated these data to illustrate a basic KM curve dataset. There are 10 subjects in each group. If a subject died, they are coded as event = 1, but if they withdraw or lost to follow-up, they are censored event = 0. the survivaltime is the time the subjects are at risk or contributing to the survival time.

You can download the data used in this tutorial from my GitHub site.

kmplot <- read.csv("kmcurve.csv")

knitr::kable(
  (kmplot), caption = "Table 1. Variables for constructing KM curves"
)

Table 1. Variables for constructing KM curves

subject survivaltime event group
1 10 1 1
2 20 1 1
3 30 1 1
4 35 0 1
5 40 1 1
6 50 1 1
7 60 0 1
8 70 1 1
9 80 1 1
10 90 1 1
11 5 1 0
12 8 1 0
13 14 1 0
14 20 1 0
15 22 0 0
16 23 0 0
17 24 0 0
18 25 1 0
19 45 1 0
20 60 1 0

Let’s take a look at a typical KM curve. The KM curve has a noticeable “step-like” plot that decreases (or increases) based on when the event of interest occurs at some specific time point. For example at time = 0, there are 10 subjects in each group. But at time = 25, only 3 subjects in Treatment 0 remaining and 5 subjects in Treatment 1 remaining at risk. At time = 75, there are 0 subjects in Treatment 0 remaining and 2 subjects in Treatment 1 remaining at risk. The hash marks on the KM curves indicate that a subject was censored, which means that they either were lost to follow-up or withdrew from the study. There are three hash marks in Treatment 1 and two hash marks in Treatment 2.

The Kaplan-Meier curve has a unique “step-like” plot that distinguishes it from other survival plots.

kmplot <- read.csv("kmcurve.csv")

km.plot <- survfit(Surv(survivaltime, event) ~ group, data = kmplot)

ggsurvplot(km.plot,
           risk.table = TRUE,
           legend.labs = c("treatment 0", "treatment 1"),
           title = "Kaplan-Meier curves")

Log rank test

The log rank test is used to compare the survival between groups. It is a test of significance, which means that it doesn’t tell us anything about the size of the effect, its direction, nor its uncertainty. However, it does let us know whether the survival between the groups are significantly different.

Statistical hypotheses for log rank test:

The log rank test is based on the chi square test. It estimates the expected number of deaths at each time point and compares that to the observed number of deaths. The degrees of freedom is estimated as the number of groups minus 1 (n - 1). Fortunately, R can perform this statistical comparison using the survdiff function.

If the p-value is < 0.05, we reject the null that there is no difference in survival between the populations. Otherwise, we fail to reject the null and the survival between the populations are the same (assuming our statistical threshold is \(\alpha\) = 0.5).

To perform the log rank test on our data, we need to use the survdiff function from the survival package.

### log rank test 
survdiff(Surv(survivaltime, event) ~ group, data = kmplot)
## Call:
## survdiff(formula = Surv(survivaltime, event) ~ group, data = kmplot)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## group=0 10        7     4.14     1.978      3.27
## group=1 10        8    10.86     0.754      3.27
## 
##  Chisq= 3.3  on 1 degrees of freedom, p= 0.07

The log rank test yields a chi square statistical value because the underlying foundation is based on this framework. Like the chi square tests, the log rank estimates the expected number of subjects who died and compares that to the observed number of subjects who died. This yields a p-value = 0.07, which means that we fail to reject the null and the that the survival between Treatment 0 and Treatment 1 are statistically significantly different.

Since the log rank test is a test of significance, it doesn’t give us information about the treatment effect size, direction, and uncertainty. For that, we will need to construct a Cox proportional hazard model.

Cox proportional hazards model

Cox proportional hazard regression model is a type of survival regression. It assumes that the hazards between two groups are proportional (proportional hazards regression). This means that the effects of the predictor variables upon the outcome (survival) are constant over time and are additive on a linear scale. For example, if something doubles the risk of an end point at time = 1, it will also double the risk at time = 2, time = 3, etc. One way to determine if the proportional hazards assumption is violated is to look and see if the KM curves cross at multiple points. It’s not a great way to do this, but it’s an easy visual.

The Cox proportional hazard model does not have a constant term; rather, it has a constant hazard \(h_{0}(t)\).

The hazard rate can be described as:

\(\large h(t, X) = h_{0}(t) e^{\sum_{i=1}^{p} \beta_{i} X_{i}}\)

The hazard ratio is the exponential of the two hazards for the groups and can be described as:

\(\large HR = \frac{h(t, X^{*})}{h(t, X)} = \frac{h_{0}(t) e^{\sum_{i=1}^{p} \beta_{i} X_{i}^{*}}}{h_{0}(t) e^{\sum_{i=1}^{p} \beta_{i} X_{i}}}\)

\(\large HR = e^{\sum_{i=1}^{p} \beta_{i} (X_{i}^{*} - X_{i})}\)

We can estimate the Cox proportional hazard model using the coxph function from the survival package. We can rewrite our \(HR\) formula to reflect the main predictor of interest, which is group (Treatment 1 versus Treatment 2):

We are interested in the \(\beta_{1}\) coefficient. \(\large HR = e^{\beta_{1} [(Group = 1) - (Group = 0)]}\)

coxmodel <- coxph(Surv(survivaltime, event) ~ group, data = kmplot)
coxmodel  ## Generates the log hazard ratio
## Call:
## coxph(formula = Surv(survivaltime, event) ~ group, data = kmplot)
## 
##          coef exp(coef) se(coef)      z      p
## group -1.0554    0.3481   0.6084 -1.735 0.0828
## 
## Likelihood ratio test=3.06  on 1 df, p=0.08047
## n= 20, number of events= 15

We have to exponentiate the log hazard ratio to get the hazard ratio. We also use the confint function to generate the 95% confidence interval for the hazard ratio.

exp(coef(coxmodel)) ## Generates exp coefficients (or hazard ratio)
##   group 
## 0.34805
exp(confint(coxmodel))
##           2.5 %   97.5 %
## group 0.1056181 1.146952

We are interested in the group coefficient, which is the log hazard ratio. When we exponentiate this, we get the hazard ratio, which is 0.35. The 95% confidence interval is 0.11, 1.15.

We conclude that the hazard of having the event is 65% lower among subjects in Treatment 1 compared to Treatment 0 with a 95% confidence interval of 0.11, 1.15. This result is not statistically significant since the 95% confidence interval cross the null or \(HR = 1\).

Movitvating example - Veteran Lung Cancer Study

We can apply what we’ve learned using data from the Veterans Lung Cancer Study.

The Veteran Lung Cancer Study is part of the survival package. You can load it by assigning it to an object. For instance, veteran.data <- veteran. Once done, you can use veteran.data for your analyses.

Data

We will use the Veterans Administration Lung Cancer Trial from Prentice, RL. Exponential survivals with censoring and explanatory variables. Biometrika. 1973;60:279–88. In this study, 137 patients with advanced, inoperable lung cancer were randomized into two groups: chemotherapy + new drug treatment (experimental group) and chemotherapy (control treatment). You can download the data for the variable list from my GitHub site.

The variables in the veteran dataset:

### Load the "variables.csv" data from my GitHub site; this is a list of variables in the "veteran" dataset
variables <- read.csv("variables.csv")

knitr::kable(
    (variables), caption = "Table 2. Variables in the Veterans Affairs Lung Cancer Trial"
)

Table 2. Variables in the Veterans Affairs Lung CancerTrial

variable description
trt treatment assignment (1 = control, 2 = experimental)
celltype lung cancer cell type (1 = squamous, 2 = smallcell, 3 = adeno, 4 = large)
time survival time in days
status censoring status (1 = death occurred, 0 = censored)
karno Karnofsky performance score (100 = good)
diagtime months from diagnosis to randomization
age in years
prior prior therapy (0 = no, 10 = yes)

Let’s take a look at the first six rows of the data.

### Load "veteran" data from the "survival" package
veteran.data <- veteran

knitr::kable(
    head(veteran.data), caption = "Table 3. First six rows of the Veterans Affairs Lung Cancer Trial"
)

Table 3. First six rows of the Veterans Affairs Lung CancerTrial

trt celltype time status karno diagtime age prior
1 squamous 72 1 60 7 69 0
1 squamous 411 1 70 5 64 10
1 squamous 228 1 60 3 38 0
1 squamous 126 1 60 9 63 10
1 squamous 118 1 70 11 65 10
1 squamous 10 1 20 5 49 0

We can look at some summary statistics of the veterans datasets. There are 137 subjects in the study. The average survival time is 121.63 days and the average age of the sample is 58.3 years. The other variables are discrete, so the means do not provide a good description of them. We will need to use the CrossTable command, which is part of the gmodels package.

knitr::kable(
  describeBy(veteran.data) %>% round(2)
)
vars n mean sd median trimmed mad min max range skew kurtosis se
trt 1 137 1.50 0.50 1 1.50 0.00 1 2 1 0.01 -2.01 0.04
celltype* 2 137 2.34 1.07 2 2.30 1.48 1 4 3 0.28 -1.17 0.09
time 3 137 121.63 157.82 80 90.33 87.47 1 999 998 3.06 12.33 13.48
status 4 137 0.93 0.25 1 1.00 0.00 0 1 1 -3.47 10.10 0.02
karno 5 137 58.57 20.04 60 59.37 29.65 10 99 89 -0.34 -0.83 1.71
diagtime 6 137 8.77 10.61 5 6.73 4.45 1 87 86 4.06 23.00 0.91
age 7 137 58.31 10.54 62 59.13 8.90 34 81 47 -0.63 -0.49 0.90
prior 8 137 2.92 4.56 0 2.43 0.00 0 10 10 0.91 -1.19 0.39
CrossTable(veteran.data$celltype)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  137 
## 
##  
##           |  squamous | smallcell |     adeno |     large | 
##           |-----------|-----------|-----------|-----------|
##           |        35 |        48 |        27 |        27 | 
##           |     0.255 |     0.350 |     0.197 |     0.197 | 
##           |-----------|-----------|-----------|-----------|
## 
## 
## 
## 

Using the CrossTable function, we report that there were 35 (25.5%) subjects with squamous lung cancer, 48 (35.0%) with small cell lung cancer, 27 (19.7%) with adenocarcinoma, and 27 (19.7%) with large cell lung cancer.

We can create a demographics table using the gtsummary package and the tbl_summary function.

veteran.data %>%
    select(trt, celltype, time, status, karno, diagtime, age, prior) %>%
    tbl_summary(by = trt, 
                missing = "no") %>%
    add_p(pvalue_fun = ~style_pvalue(.x, digits = 2)) %>% 
    add_overall() %>%
    modify_header(label = "**Variable**") %>% 
    bold_labels() %>%
    modify_caption("**Table 4. Patient characteristics from the Veterans Lung Cancer Trial**") %>%
    modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment Received**")
Table 4. Patient characteristics from the Veterans Lung Cancer Trial
Variable Overall, N = 1371 Treatment Received p-value2
1, N = 691 2, N = 681
celltype 0.071
    squamous 35 (26%) 15 (22%) 20 (29%)
    smallcell 48 (35%) 30 (43%) 18 (26%)
    adeno 27 (20%) 9 (13%) 18 (26%)
    large 27 (20%) 15 (22%) 12 (18%)
time 80 (25, 144) 97 (25, 153) 52 (25, 117) 0.35
status 128 (93%) 64 (93%) 64 (94%) >0.99
karno 60 (40, 75) 60 (40, 80) 60 (40, 70) 0.75
diagtime 5 (3, 11) 5 (4, 11) 4 (3, 11) 0.35
age 62 (51, 66) 62 (49, 66) 62 (52, 66) 0.46
prior 0.75
    0 97 (71%) 48 (70%) 49 (72%)
    10 40 (29%) 21 (30%) 19 (28%)
1 n (%); Median (IQR)
2 Pearson's Chi-squared test; Wilcoxon rank sum test; Fisher's exact test

Kaplan-Meier Curve & log rank test - Veterans Study

You can plot the Kaplan-Meier (KM) curves for the two groups and estimate the log rank test.

In the basic KM curve plot, notice how there are no hash marks for the censored data. The basic plot function will not include this, unfortunately. However, the ggsurvplot has that ability along with other options to improve your KM curves.

We first use the basic plot function to visual the KM curve without any of the color or 95% confidence interval.

kmcurve <- survfit(Surv(time, status) ~ trt, data = veteran.data)
plot(kmcurve)

We can also install the survminer package to use the ggsurvplot function to plot the KM curves with additional features such as the 95% confidence interval (CI), log-rank test, and risk tables. We can estimate the log rank test by including the option pval = TRUE into the ggsurvplot function. But we can also estimate the log rank test separately using the survdiff function.

ggsurvplot(kmcurve, 
           conf.int = TRUE,
           pval = TRUE,
           risk.table = TRUE,
           legend.labs = c("treatment 1", "treatment 2"),
           title = "Kaplan-Meier curves")

survdiff(Surv(time, status) ~ trt, data = veteran.data)
## Call:
## survdiff(formula = Surv(time, status) ~ trt, data = veteran.data)
## 
##        N Observed Expected (O-E)^2/E (O-E)^2/V
## trt=1 69       64     64.5   0.00388   0.00823
## trt=2 68       64     63.5   0.00394   0.00823
## 
##  Chisq= 0  on 1 degrees of freedom, p= 0.9

Both the ggsurvplot and survdiff yield the same p-value for the log rank test, which is 0.93. Therefore, we fail to reject the null that there is no statistically significant difference in survival between Treatment 1 and Treatment 2.

Cox proportional hazard model - Veterans study

We can estimate the hazard ratio between Treatment 1 and Treatment 2 using the Cox proportional hazards model.

\(\large HR = e^{\beta_{1}(Treatment)}\)

coxmodel2 <- coxph(Surv(time, status) ~ trt, data = veteran.data)

exp(coef(coxmodel2))
##      trt 
## 1.017901
exp(confint(coxmodel2))
##         2.5 %   97.5 %
## trt 0.7143755 1.450389
### Create nice regression output
model1 <- tbl_regression(coxmodel2, intercept = TRUE)
as_gt(model1) %>% 
              gt::tab_header("Cox proportional hazards regression model") %>% 
              gt::tab_options(table.align='left')
Cox proportional hazards regression model
Characteristic log(HR)1 95% CI1 p-value
trt 0.02 -0.34, 0.37 >0.9
1 HR = Hazard Ratio, CI = Confidence Interval

The Cox proportional hazards model yields log hazard ratio. To get the hazard ratio, exponentiate the log hazard ratio.

The hazard ratio was 1.02 with a 95% confidence interval of 0.71, 1.45. Since the 95% confidence interval crosses where the \(HR = 1\), we fail to reject the null hypothesis that there is no significant differences in hazards between subjects in Treatment 1 and Treatment 2.

We can control for confounders such as the age, prior history, cell type, diagnosis time, and Karnofsky score.

\(\large HR = e^{\beta_{1}(Treatment) + \beta_{2}(age) + \beta_{3}(prior) + \beta_{4}(celltype) + \beta_{5}(diagtime) + \beta_{6}(karno)}\)

coxmodel3 <- coxph(Surv(time, status) ~ trt + celltype + age + prior + diagtime + karno, data = veteran.data)

exp(coef(coxmodel3))
##               trt celltypesmallcell     celltypeadeno     celltypelarge 
##         1.3425930         2.3668512         3.3070825         1.4937529 
##               age             prior          diagtime             karno 
##         0.9913313         1.0071850         1.0000813         0.9677173
exp(confint(coxmodel3))
##                       2.5 %    97.5 %
## trt               0.8938772 2.0165590
## celltypesmallcell 1.3799024 4.0596961
## celltypeadeno     1.8335975 5.9646647
## celltypelarge     0.8583289 2.5995834
## age               0.9734248 1.0095673
## prior             0.9623552 1.0541032
## diagtime          0.9823329 1.0181504
## karno             0.9573269 0.9782204
### Create nice regression output
model2 <- tbl_regression(coxmodel3, intercept = TRUE)
as_gt(model2) %>% 
              gt::tab_header("Cox proportional hazards regression model with confounders") %>% 
              gt::tab_options(table.align='left')
Cox proportional hazards regression model with confounders
Characteristic log(HR)1 95% CI1 p-value
trt 0.29 -0.11, 0.70 0.2
celltype
    squamous
    smallcell 0.86 0.32, 1.4 0.002
    adeno 1.2 0.61, 1.8 <0.001
    large 0.40 -0.15, 0.96 0.2
age -0.01 -0.03, 0.01 0.3
prior 0.01 -0.04, 0.05 0.8
diagtime 0.00 -0.02, 0.02 >0.9
karno -0.03 -0.04, -0.02 <0.001
1 HR = Hazard Ratio, CI = Confidence Interval

In the adjusted Cox proportional hazards model, subjects in Treatment 2 had a 1.34 higher hazard of the event compared to subjects in Treatment 1; the 95% confidence interval is between 0.89 and 2.02. This is not statistically significant becacause the 95% confidence interval includes the null or \(HR = 1\).

We can check to see if the adjustment model reduced confounding by estimating the magnitude of confounding. This is a rule of thumb: if the magnitude of confounding is greater than 10%, we should present the adjusted \(HR\).

magnitude of confounding = \(\large \frac{HR_{crude} - HR_{adjusted}}{HR_{adjusted}}\)

confounding <- (1.017901 - 1.3425930) / 1.3425930
confounding
## [1] -0.2418395

Since the magnitude of confounding is 24% and greater than 10%, we should present the adjusted \(HR\).

We can also compare the models side by side.

#### Merge the two Cox regression modelss outputs
table1 <- tbl_merge(tbls = list(model1, model2),
          tab_spanner = c("**Model 1**", "**Model 2**"))
as_gt(table1) %>% 
              gt::tab_header("Comparison between models [Model 1 (crude) v. Model 2 (adjusted)]") %>% 
              gt::tab_options(table.align = 'left')
Comparison between models [Model 1 (crude) v. Model 2 (adjusted)]
Characteristic Model 1 Model 2
log(HR)1 95% CI1 p-value log(HR)1 95% CI1 p-value
trt 0.02 -0.34, 0.37 >0.9 0.29 -0.11, 0.70 0.2
celltype
    squamous
    smallcell 0.86 0.32, 1.4 0.002
    adeno 1.2 0.61, 1.8 <0.001
    large 0.40 -0.15, 0.96 0.2
age -0.01 -0.03, 0.01 0.3
prior 0.01 -0.04, 0.05 0.8
diagtime 0.00 -0.02, 0.02 >0.9
karno -0.03 -0.04, -0.02 <0.001
1 HR = Hazard Ratio, CI = Confidence Interval

Conclusions

Survival analysis takes into account the amount of time that the subjects are at risk; this is known as survival time. Even though subjects do not experience the event during the study period, they can be censored because they may experience the event at a late date. Kaplan-Meier curves provide a convenient non-parametric visual of the survival probability across time, and the log rank test allows us to tests whether the survival curves are different between groups. However, the log rank test does not provide information about the size of the effect, its direction, nor its uncertainty. The Cox proportional hazards model provides us with the hazard ratio and 95% confidence interval, which describes the relative hazards between two groups.

References

Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association. 1958; 53:457–81.link

Bland JM, Altman DG. The logrank test. BMJ. 2004;328(7447):1073.link

The gtsummary package creates nice tables for demographic summaries. The gtsummary package was developed by Daniel D. Sjoberg.

## 
## To cite gtsummary in publications use:
## 
##   Sjoberg DD, Whiting K, Curry M, Lavery JA, Larmarange J. Reproducible
##   summary tables with the gtsummary package. The R Journal
##   2021;13:570–80. https://doi.org/10.32614/RJ-2021-053.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{gtsummary,
##     author = {Daniel D. Sjoberg and Karissa Whiting and Michael Curry and Jessica A. Lavery and Joseph Larmarange},
##     title = {Reproducible Summary Tables with the gtsummary Package},
##     journal = {{The R Journal}},
##     year = {2021},
##     url = {https://doi.org/10.32614/RJ-2021-053},
##     doi = {10.32614/RJ-2021-053},
##     volume = {13},
##     issue = {1},
##     pages = {570-580},
##   }

Work in progress

These tutorials are a work in progress, so they may contain errors. I highly recommend you seek out statistical consult if you plan on running any analysis meant for publication. Any comments and suggestions can be sent to: