Surviva analysis also known as time-to-event analysis or duration analysis, is a branch of statistics that focuses on analyzing the expected duration of time until specific events occur.
These events can include:
Death (commonly studied in medical contexts, such as time to death for patients with a specific disease).
Recurrence (time to relapse of a certain disease).
Other non-medical scenarios, such as:
Time until finding a new job after unemployment.
Time until a mechanical system or machine fails.
Time until a customer buys a new product or cancels a subscription.
Time until a letter is delivered.
Time until an employee leaves a company.
In cancer studies, typical research questions are like:
What is the impact of certain clinical characteristics on patientβs survival
What is the probability that an individual survives 3 years?
Are there differences in survival between groups of patients?
The aim of this chapter is to describe the basic concepts of survival analysis. In cancer studies, most of survival analyses use the following methods:
Kaplan-Meier plots to visualize survival curves
Log-rank test to compare the survival curves of two or more groups
Cox proportional hazards regression to describe the effect of variables on survival The Cox model is discussed in the next chapter: Cox proportional hazards model.
Here, weβll start by explaining the essential concepts of survival analysis, including:
how to generate and interpret survival curves,
and how to quantify and test survival differences between two or more groups of patients.
Then, weβll continue by describing multivariate analysis using Cox proportional hazards model.
Here, we start by defining fundamental terms of survival analysis including:
There are different types of events, including:
The two most important measures in cancer studies include:
As mentioned above, survival analysis focuses on the expected duration of time until occurrence of an event of interest (relapse or death). However, the event may not be observed for some individuals within the study time period, producing the so-called censored observations.
Censoring may arise in the following ways:
This type of censoring, named right censoring, is handled in survival analysis.
Two related probabilities are used to describe survival data: the survival probability and the hazard probability.
The survival probability, also known as the survivor function \(S(t)\), is the probability that an individual survives from the time origin (e.g.Β diagnosis of cancer) to a specified future time t.
The hazard, denoted by \(h(t)\), is the probability that an individual who is under observation at a time t has an event at that time.
Note that, in contrast to the survivor function, which focuses on not having an event, the hazard function focuses on the event occurring.
The Kaplan-Meier (KM) method is a non-parametric method used to estimate the survival probability from observed survival times (Kaplan and Meier, 1958).
The survival probability at time \(t_i\), \(S(t_i)\), is calculated as follow:
[ {$S(t_i)$} = $S(t_{i-1})$ \left(1 - \frac{d_i}{n_I}\right) \]
where,
\(S(t(iβ1))\) = the probability of being alive at \(tiβ1\)
ni = the number of patients alive just before ti
di = the number of events at tβi
t0=0 , S0=1 The estimated probability S(t) is a step function that changes value only at the time of each event. Itβs also possible to compute confidence intervals for the survival probability.
The KM survival curve, a plot of the KM survival probability against time, provides a useful summary of the data that can be used to estimate measures such as median survival time.
Install the required packages
Weβll use two R packages:
Install and load the packages
library(survival)
library(survminer)
## Loading required package: ggplot2
## Loading required package: ggpubr
##
## Attaching package: 'survminer'
## The following object is masked from 'package:survival':
##
## myeloma
Weβll use the lung cancer data available in the survival package.
data("lung")
## Warning in data("lung"): data set 'lung' not found
head(lung,3)
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
We want to compute the survival probability by sex.
The function survfit() [in survival package] can be used to compute kaplan-Meier survival estimate. Its main arguments include:
a survival object created using the function Surv() and the data set containing the variables. To compute survival curves, type this:
fit <- survfit(Surv(time, status) ~ sex, data = lung)
print(fit)
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## n events median 0.95LCL 0.95UCL
## sex=1 138 112 270 212 310
## sex=2 90 53 426 348 550
By default, the function print() shows a short summary of the survival curves. It prints the number of observations, number of events, the median survival and the confidence limits for the median.
If you want to display a more complete summary of the survival curves, type this:
# Summary of survival curves
summary(fit, 8)
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## sex=1
## time n.risk n.event survival std.err lower 95% CI
## 8 138 0 1 0 1
## upper 95% CI
## 1
##
## sex=2
## time n.risk n.event survival std.err lower 95% CI
## 8.000 89.000 1.000 0.989 0.011 0.967
## upper 95% CI
## 1.000
# Access to the sort summary table
summary(fit)$table
## records n.max n.start events rmean se(rmean) median 0.95LCL 0.95UCL
## sex=1 138 138 138 112 326.0841 22.91156 270 212 310
## sex=2 90 90 90 53 460.6473 34.68985 426 348 550
The function survfit() returns a list of variables, including the following components:
The components can be accessed as follow:
d <- data.frame(time = fit$time,
n.risk = fit$n.risk,
n.event = fit$n.event,
n.censor = fit$n.censor,
surv = fit$surv,
upper = fit$upper,
lower = fit$lower
)
head(d,3)
## time n.risk n.event n.censor surv upper lower
## 1 11 138 3 0 0.9782609 1.0000000 0.9542301
## 2 12 135 1 0 0.9710145 0.9994124 0.9434235
## 3 13 134 2 0 0.9565217 0.9911586 0.9230952
Weβll use the function ggsurvplot() [in Survminer R package] to produce the survival curves for the two groups of subjects.
Itβs also possible to show:
the 95% confidence limits of the survivor function using the argument conf.int = TRUE.
the number and/or the percentage of individuals at risk by time using the option risk.table. Allowed values for risk.table include:
TRUE or FALSE specifying whether to show or not the risk table. Default is FALSE.
βabsoluteβ or βpercentageβ: to show the absolute number and the percentage of subjects at risk by time, respectively. Use βabs_pctβ to show both absolute number and percentage.
the p-value of the Log-Rank test comparing the groups using pval = TRUE.
horizontal/vertical line at median survival using the argument surv.median.line. Allowed values include one of c(βnoneβ, βhvβ, βhβ, βvβ). v: vertical, h:horizontal.
# Change color, linetype by strata, risk.table color by strata
ggsurvplot(fit,
pval = TRUE, conf.int = TRUE,
risk.table = TRUE, # Add risk table
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
surv.median.line = "hv", # Specify median survival
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"))
The plot can be further customized using the following arguments:
ggsurvplot(fit, pval=TRUE, conf.int = TRUE, xlab="Time in days", break.time.by = 200, ggtheme = theme_light(),risk.table = "abs_pct",risk.table.y.text.col = T, risk.table.y.text = FALSE, ncensor.plot = TRUE, surv.median.line = "hv",legend.labs =
c("Male", "Female"),palette =
c("#E7B800", "#2E9FDF"))
The Kaplan-Meier plot can be interpreted as follow:
The horizontal axis (x-axis) represents time in days, and the vertical axis (y-axis) shows the probability of surviving or the proportion of people surviving. The lines represent survival curves of the two groups. A vertical drop in the curves indicates an event. The vertical tick mark on the curves means that a patient was censored at this time.
The median survival times for each group can be obtained using the code below:
summary(fit)$table
## records n.max n.start events rmean se(rmean) median 0.95LCL 0.95UCL
## sex=1 138 138 138 112 326.0841 22.91156 270 212 310
## sex=2 90 90 90 53 460.6473 34.68985 426 348 550
The median survival times for each group represent the time at which the survival probability, \(S(t)\) , is 0.5.
The median survival time for sex=1 (Male group) is 270 days, as opposed to 426 days for sex=2 (Female). There appears to be a survival advantage for female with lung cancer compare to male. However, to evaluate whether this difference is statistically significant requires a formal statistical test, a subject that is discussed in the next sections.
Note that, the confidence limits are wide at the tail of the curves, making meaningful interpretations difficult. This can be explained by the fact that, in practice, there are usually patients who are lost to follow-up or alive at the end of follow-up. Thus, it may be sensible to shorten plots before the end of follow-up on the x-axis (Pocock et al, 2002).
The survival curves can be shorten using the argument xlim as follow:
ggsurvplot(fit,
conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"),
xlim = c(0, 600))
Note that, three often used transformations can be specified using the
argument fun:
ggsurvplot(fit,
conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"),
fun = "event")
The cummulative hazard is commonly used to estimate the hazard
probability. Itβs defined as H(t)=βlog(survivalfunction)=βlog(S(t)) .
The cumulative hazard (H(t)) can be interpreted as the cumulative force
of mortality. In other words, it corresponds to the number of events
that would be expected for each individual by time t if the event were a
repeatable process.
To plot cumulative hazard, type this:
ggsurvplot(fit,
conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"),
fun = "cumhaz")
#### Kaplan-Meier life table: summary of survival curves
As mentioned above, you can use the function summary() to have a complete summary of survival curves:
summary(fit,10)
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## sex=1
## time n.risk n.event survival std.err lower 95% CI
## 10 138 0 1 0 1
## upper 95% CI
## 1
##
## sex=2
## time n.risk n.event survival std.err lower 95% CI
## 10.000 89.000 1.000 0.989 0.011 0.967
## upper 95% CI
## 1.000
Itβs also possible to use the function surv_summary() [in survminer package] to get a summary of survival curves. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.
res.sum <- surv_summary(fit)
## Warning in .get_data(x, data = data): The `data` argument is not provided. Data
## will be extracted from model fit.
head(res.sum,3)
## time n.risk n.event n.censor surv std.err upper lower strata
## 1 11 138 3 0 0.9782609 0.01268978 1.0000000 0.9542301 sex=1
## 2 12 135 1 0 0.9710145 0.01470747 0.9994124 0.9434235 sex=1
## 3 13 134 2 0 0.9565217 0.01814885 0.9911586 0.9230952 sex=1
## sex
## 1 1
## 2 1
## 3 1
The function surv_summary() returns a data frame with the following columns:
In a situation, where survival curves have been fitted with one or more variables, surv_summary object contains extra columns representing the variables. This makes it possible to facet the output of ggsurvplot by strata or by some combinations of factors.
surv_summary object has also an attribute named βtableβ containing information about the survival curves, including medians of survival with confidence intervals, as well as, the total number of subjects and the number of event in each curve. To get access to the attribute βtableβ, type this:
attr(res.sum, "table")
## records n.max n.start events rmean se(rmean) median 0.95LCL 0.95UCL
## sex=1 138 138 138 112 326.0841 22.91156 270 212 310
## sex=2 90 90 90 53 460.6473 34.68985 426 348 550
The log-rank test is the most widely used method of comparing two or more survival curves. The null hypothesis is that there is no difference in survival between the two groups. The log rank test is a non-parametric test, which makes no assumptions about the survival distributions. Essentially, the log rank test compares the observed number of events in each group to what would be expected if the null hypothesis were true (i.e., if the survival curves were identical). The log rank statistic is approximately distributed as a chi-square test statistic.
The function survdiff() [in survival package] can be used to compute log-rank test comparing two or more survival curves.
survdiff() can be used as follow:
surv_diff <- survdiff(Surv(time, status) ~ sex, data = lung)
surv_diff
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138 112 91.6 4.55 10.3
## sex=2 90 53 73.4 5.68 10.3
##
## Chisq= 10.3 on 1 degrees of freedom, p= 0.001
The function returns a list of components, including:
Fit complex survival curves In this section, weβll compute survival curves using the combination of multiple factors. Next, weβll facet the output of ggsurvplot() by a combination of factors
require("survival")
fit2 <- survfit( Surv(time, status) ~ sex + rx + adhere,
data = colon )
# Plot survival curves by sex and facet by rx and adhere
ggsurv <- ggsurvplot(fit2, fun = "event", conf.int = TRUE,
ggtheme = theme_bw())
ggsurv$plot +theme_bw() +
theme (legend.position = "right")+
facet_grid(rx ~ adhere)
Survival analysis is a set of statistical approaches for data analysis where the outcome variable of interest is time until an event occurs.
Survival data are generally described and modeled in terms of two related functions:
the survivor function representing the probability that an individual survives from the time of origin to some time beyond time t. Itβs usually estimated by the Kaplan-Meier method. The logrank test may be used to test for differences between survival curves for groups, such as treatment arms.
The hazard function gives the instantaneous potential of having an event at a time, given survival up to that time. It is used primarily as a diagnostic tool or for specifying a mathematical model for survival analysis.
In this article, we demonstrate how to perform and visualize survival analyses using the combination of two R packages: survival (for the analysis) and survminer (for the visualization
References
Clark TG, Bradburn MJ, Love SB and Altman DG. Survival Analysis Part I: Basic concepts and first analyses. British Journal of Cancer (2003) 89, 232 β 238
Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53: 457β481.
Pocock S, Clayton TC, Altman DG (2002) Survival plots of time-to-event outcomes in clinical trials: good practice and pitfalls. Lancet 359: 1686β 1689.