This review has been produced using Rmarkdown and will no longer be updated. An updated version based on Quarto can be found here
(Last compiled on Thu Dec 08 10:10:10 2022)
The aim of this document is to give a short but yet comprehensive
review on how to conduct survival analysis in R. The literature on the
topic is extensive and only a limited number of (common)
problems/features will be covered.
The amount of R packages available reflects the extent of the research
on the topic. A broad (yet not complete) task view presenting useful R
packages for different aspects of survival analysis can be found on the
dedicated CRAN Task View at https://CRAN.R-project.org/view=Survival.
A variety of R packages can be used to tackle specific problems and
there are alternative functions to address a common question. The
following are the packages used in this review for reading, managing,
analyzing, and displaying data.
Run the following lines for installing and loading the required
packages.
if (!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, survival, ggfortify, survminer, plotly, gridExtra,
Epi, KMsurv, gnm, cmprsk, mstate, flexsurv, splines, epitools,
eha, shiny, ctqr, scales)
OBS Data manipulation and graphics will be based on a collection of packages that share common philosophies and are designed to work together (http://tidyverse.org/). Additional useful resources can be found here 1, 2.
The review will be based on the orca dataset which
contains data from a population-based retrospective cohort design. The
dataset can be found in a text file format at http://www.stats4life.se/data/oralca.txt.
It includes a subset of 338 patients diagnosed with an oral squamous
cell carcinoma (OSCC) between January 1, 1985 and December 31, 2005 from
the 2 northernmost provinces of Finland. Follow-up of patients was
started on the date of cancer diagnosis and ended on the date of death,
migration, or the closing date of the follow-up, December 31, 2008.
Cause of death was classified into the 2 categories: (1) deaths from
OSCC; and (2) deaths of other causes.
The dataset contains the following variables:
id = a sequential number,
sex = sex, a factor with categories 1 = “Female”, 2 =
“Male”,
age = age (years) at the date of diagnosing the
cancer,
stage = TNM stage of the tumor (factor): 1 = “I”, …, 4 =
“IV”, 5 = “unkn”
time = follow-up time (in years) since diagnosis until
death or censoring,
event = event ending the follow-up (factor): 1 = censoring
alive, 2 = death from oral cancer, 3 = death from other causes.
Thanks to Esa Läärä for sharing the data.
Load data into R from the URL.
orca <- read.table("http://www.stats4life.se/data/oralca.txt", stringsAsFactors = TRUE)
Have a feeling of the data at hand.
head(orca)
id sex age stage time event
1 1 Male 65.42274 unkn 5.081 Alive
2 2 Female 83.08783 III 0.419 Oral ca. death
3 3 Male 52.59008 II 7.915 Other death
4 4 Male 77.08630 I 2.480 Other death
5 5 Male 80.33622 IV 2.500 Oral ca. death
6 6 Female 82.58132 IV 0.167 Other death
glimpse(orca)
Rows: 338
Columns: 6
$ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, …
$ sex <fct> Male, Female, Male, Male, Male, Female, Male, Male, Female, Male, Male, Female,…
$ age <dbl> 65.42274, 83.08783, 52.59008, 77.08630, 80.33622, 82.58132, 73.07524, 74.49622,…
$ stage <fct> unkn, III, II, I, IV, IV, II, unkn, IV, II, IV, unkn, III, IV, IV, II, II, IV, …
$ time <dbl> 5.081, 0.419, 7.915, 2.480, 2.500, 0.167, 5.925, 1.503, 13.333, 7.666, 2.834, 0…
$ event <fct> Alive, Oral ca. death, Other death, Other death, Oral ca. death, Other death, A…
summary(orca)
id sex age stage time event
Min. : 1.00 Female:152 Min. :15.15 I :50 Min. : 0.085 Alive :109
1st Qu.: 85.25 Male :186 1st Qu.:53.24 II :77 1st Qu.: 1.333 Oral ca. death:122
Median :169.50 Median :64.86 III :72 Median : 3.869 Other death :107
Mean :169.50 Mean :63.51 IV :68 Mean : 5.662
3rd Qu.:253.75 3rd Qu.:74.29 unkn:71 3rd Qu.: 8.417
Max. :338.00 Max. :92.24 Max. :23.258
Survival analysis focuses on time to event data, usually referred to as failure time \(T\), \(T \ge 0\). In our example, \(T\) is time to death after diagnosis.
In order to define a failure time random variable, we need:
1. a time origin (diagnosis of OSCC),
2. a time scale (years after diagnosis, age),
3. definition of the event. We will first consider total (or all-cause)
mortality, pooling the two causes of death into a single outcome
(Figure 1 (A)).
tm <- matrix(c(NA, NA, 1, NA), ncol = 2)
rownames(tm) <- colnames(tm) <- c("Alive", "Death")
tm2 <- matrix(c(NA, NA, NA, 1, NA, NA, 2, NA, NA), ncol = 3)
rownames(tm2) <- colnames(tm2) <- levels(orca$event)
par(mfrow = c(1, 2))
layout(rbind(c(1, 2, 2)))
boxes.matrix(tm, boxpos = TRUE)
title("A)")
boxes.matrix(tm2, boxpos = TRUE)
title("B)")
Figure 1: Box diagram for transitions.
table(orca$event)
Alive Oral ca. death Other death
109 122 107
orca <- mutate(orca, all = event != "Alive")
table(orca$all)
FALSE TRUE
109 229
A graphically presentation of the observed follow-up times can be of great aid in an analysis of survival data. The illustration of survival data in Figure 2 shows several features which are typically encountered in analysis of survival data.
ggplotly(
orca %>%
mutate(
text = paste("Subject ID = ", id, "<br>", "Time = ", time, "<br>", "Event = ",
event, "<br>", "Age = ", round(age, 2), "<br>", "Stage = ", stage)
) %>%
ggplot(aes(x = id, y = time, text = text)) +
geom_linerange(aes(ymin = 0, ymax = time)) +
geom_point(aes(shape = event, color = event), stroke = 1, cex = 2) +
scale_shape_manual(values = c(1, 3, 4)) +
labs(y = "Time (years)", x = "Subject ID") + coord_flip() + theme_classic(),
tooltip = "text"
)
Figure 2: Possible representations of follow-up time.
Different time scales can give different perspectives.
grid.arrange(
ggplot(orca, aes(x = id, y = time)) +
geom_linerange(aes(ymin = 0, ymax = time)) +
geom_point(aes(shape = event, color = event), stroke = 1, cex = 2) +
scale_shape_manual(values = c(1, 3, 4)) + guides(shape = "none", color = "none") +
labs(y = "Time (years)", x = "Subject ID") + coord_flip() + theme_classic(),
orca %>%
mutate(age_orig = age,
age_end = age + time) %>%
ggplot(aes(x = id, y = age_end)) +
geom_linerange(aes(ymin = age_orig, ymax = age_end)) +
geom_point(aes(shape = event, color = event), stroke = 1, cex = 2) +
scale_shape_manual(values = c(1, 3, 4)) + guides(fill = "none") +
labs(y = "Age (years)", x = "Subject ID") + coord_flip() + theme_classic(),
ncol = 2
)
Death from OSCC is more likely to occur early after diagnosis, as opposed to death from other causes. What about the type of censoring?
A survival object is defined by the pair (\(y, \delta\)), i.e. the time
variable and the failure or status indicator. To
create such an object in R, the Surv() function in the
survival package can be used. See the help page by typing
?Surv
for a description of the different possibilities.
su_obj <- Surv(orca$time, orca$all)
str(su_obj)
'Surv' num [1:338, 1:2] 5.081+ 0.419 7.915 2.480 2.500 0.167 5.925+ 1.503 13.333 7.666+ ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "time" "status"
- attr(*, "type")= chr "right"
The created survival object is then used as response variable in other specific functions for survival analysis.
There are several equivalent ways to characterize the probability
distribution of a survival random variable:
1. the density function \(f(t) = \lim_{\Delta
t \to 0} \frac{1}{\Delta t} Pr(t \le T \le t + \Delta t)\);
2. the cumulative distribution function \(F(t)
= P(T \le t) = \int_{0}^{t} f(u) du\), i.e. the probability of
dying within a certain time \(t\);
3. the survivor function \(S(t) = P(T > t)
= \int_{t}^{\infty} f(u) du\), i.e. the probability of surviving
longer than time \(t\);
4. the hazard function \(\lambda(t) =
\lim_{\Delta t \to 0} \frac{1}{\Delta t} Pr(t \le T \le t + \Delta t | T
\ge T) = \frac{f(t)}{S(t)}\), usually referred to as
instantaneous failure rate, the force of mortality;
5. the cumulative hazard function \(\Lambda(t)
= \int_{0}^{t} \lambda(u) du\).
N.B. These distributions are closely related to each other. Hence one may estimate one and derive the remaining.
\[\lambda(t) = \frac{f(t)}{1 - F(t)} = -\frac{S'(t)}{S(t)} = - \frac{d \log[S(t)]}{dt}\] \[\Lambda(t) = -\log[S(t)]\] \[S(t) = \exp(-\Lambda(t)) = \exp \left( - \int_{0}^{t} \lambda(v) dv \right)\] \[f(t) = \lambda(t)S(t)\] \[F(t) = 1 - \exp(-\Lambda(t)) = \int_{0}^{t} \lambda(v)S(v) dv\]
In addition, measures of central tendency can be useful for
summarizing the observed distributions:
1. mean survival \(\mu = \int_{0}^{\infty} u
f(u) du\);
2. median survival \(\tau: \min_{\tau} S(\tau)
\le 0.5\);
3. any other quantiles.
There are two alternatives for estimating the survival or the
hazard:
a. an empirical estimate of the survival function (i.e., non-parametric
estimation);
b. a parametric model for \(\lambda(t)\) based on a particular density
function \(f(t)\).
We will first cover the class of non-parametric estimators (a.), which includes the Kaplan–Meier, Life-table, and Nelson-Aalen estimators.
The Kaplan-Meier estimator is the most common estimator and can be explained using different strategies (product limit estimator, likelihood justification).
\[\hat S(t) = \prod_{j: \tau_j \le t} \left(1 - \frac{d_j}{r_j} \right)\] with \(\tau_j\) = distinct death times observed in the sample, \(d_j\) = number of deaths at \(\tau_j\), and \(r_j\) = number of individuals at risk right before the \(j\)-th death time (\(r_j = r_{j−1}−d_{j−1}−c_{j−1}\), \(c_j\) = number of censored observations between the \(j\)-th and \((j + 1)\)-st death times).
A survival curve is based on a tabulation of the number at risk and
number of events at each unique death times. The survfit()
function of the survival package creates (estimates) the
survival curves using different methods (see ?survfit).
Using the created survival object su_obj as response
variable in the formula, the survfit() function will return
the default calculations for a Kaplan–Meier analysis of the (overall)
survival curve.
fit_km <- survfit(Surv(time, all) ~ 1, data = orca)
print(fit_km, print.rmean = TRUE)
Call: survfit(formula = Surv(time, all) ~ 1, data = orca)
n events rmean* se(rmean) median 0.95LCL 0.95UCL
[1,] 338 229 8.06 0.465 5.42 4.33 6.92
* restricted mean with upper limit = 23.3
The print() function returns just a summary of the
estimated survival curve. The fortify() function of the
ggfortify package is useful to extract the whole survival
table in a data.frame.
dat_km <- fortify(fit_km)
head(dat_km)
time n.risk n.event n.censor surv std.err upper lower
1 0.085 338 2 0 0.9940828 0.004196498 1.0000000 0.9859401
2 0.162 336 2 0 0.9881657 0.005952486 0.9997618 0.9767041
3 0.167 334 4 0 0.9763314 0.008468952 0.9926726 0.9602592
4 0.170 330 2 0 0.9704142 0.009497400 0.9886472 0.9525175
5 0.246 328 1 0 0.9674556 0.009976176 0.9865584 0.9487228
6 0.249 327 1 0 0.9644970 0.010435745 0.9844277 0.9449699
The ggsurvplot() is a dedicated function in the
survminer package to give an informative illustration of
the estimated survival curve(s). See the help page
?ggsurvplot for a description of the different
possibilities (arguments).
ggsurvplot(fit_km, risk.table = TRUE, xlab = "Time (years)", censor = F)
N.B.: see the help page
?survfit.formula
for a description of the different methods for constructing confidence
intervals (argument conf.type).
The default KM plot presents the survival function. Several
alternatives/functions are available (see the help page ?plot.survfit
or ?ggsurvplot).
glist <- list(
ggsurvplot(fit_km, fun = "event", main = "Cumulative proportion"),
ggsurvplot(fit_km, fun = "cumhaz", main = "Cumulative Hazard"),
ggsurvplot(fit_km, fun = "cloglog", main = "Complementary log−log")
)
arrange_ggsurvplots(glist, print = TRUE, ncol = 3, nrow = 1)
The lifetable method is very common in actuary and demography. It is particularly suitable for grouped data.
\[\hat S(t_j) = \prod_{l \le j} \hat
p_l\]
with \(\hat p_j = 1- \hat q_j\)
(conditional probability of surviving), \(\hat
q_j = d_j/r_j'\) (conditional probability of dying), and
\(r_j' = r_j - c_j/2\)
In order to show this method on the actual example, we need first to create aggregated data, i.e. divide the follow-up in groups and calculate in each strata the number of people at risk, events, and censored.
cuts <- seq(0, 23, 1)
lifetab_dat <- orca %>%
mutate(time_cat = cut(time, cuts)) %>%
group_by(time_cat) %>%
summarise(nlost = sum(all == 0),
nevent = sum(all == 1))
Based on the grouped data, we will estimate the survival curve using
the lifetab() in the KMsurv package. See the
help page ?lifetab
for a description of the arguments and example.
dat_lt <- with(lifetab_dat, lifetab(tis = cuts, ninit = nrow(orca),
nlost = nlost, nevent = nevent))
round(dat_lt, 4)
nsubs nlost nrisk nevent surv pdf hazard se.surv se.pdf se.hazard
0-1 338 0 338.0 64 1.0000 0.1893 0.2092 0.0000 0.0213 0.0260
1-2 274 4 272.0 41 0.8107 0.1222 0.1630 0.0213 0.0179 0.0254
2-3 229 9 224.5 21 0.6885 0.0644 0.0981 0.0252 0.0136 0.0214
3-4 199 12 193.0 20 0.6241 0.0647 0.1093 0.0265 0.0140 0.0244
4-5 167 9 162.5 13 0.5594 0.0448 0.0833 0.0274 0.0121 0.0231
5-6 145 14 138.0 13 0.5146 0.0485 0.0989 0.0279 0.0131 0.0274
6-7 118 5 115.5 8 0.4662 0.0323 0.0717 0.0283 0.0112 0.0254
7-8 105 8 101.0 9 0.4339 0.0387 0.0933 0.0286 0.0126 0.0311
8-9 88 7 84.5 1 0.3952 0.0047 0.0119 0.0288 0.0047 0.0119
9-10 80 4 78.0 8 0.3905 0.0401 0.1081 0.0288 0.0137 0.0382
10-11 68 4 66.0 5 0.3505 0.0266 0.0787 0.0291 0.0116 0.0352
11-12 59 3 57.5 5 0.3239 0.0282 0.0909 0.0292 0.0123 0.0406
12-13 51 6 48.0 0 0.2958 0.0000 0.0000 0.0293 NaN NaN
13-14 45 2 44.0 8 0.2958 0.0538 0.2000 0.0293 0.0180 0.0704
14-15 35 6 32.0 3 0.2420 0.0227 0.0984 0.0295 0.0128 0.0567
15-16 26 3 24.5 4 0.2193 0.0358 0.1778 0.0295 0.0171 0.0885
16-17 19 5 16.5 2 0.1835 0.0222 0.1290 0.0296 0.0152 0.0910
17-18 12 2 11.0 1 0.1613 0.0147 0.0952 0.0299 0.0142 0.0951
18-19 9 2 8.0 0 0.1466 0.0000 0.0000 0.0306 NaN NaN
19-20 7 2 6.0 2 0.1466 0.0489 0.4000 0.0306 0.0300 0.2771
20-21 3 1 2.5 0 0.0977 0.0000 0.0000 0.0348 NaN NaN
21-22 2 0 2.0 1 0.0977 0.0489 0.6667 0.0348 0.0387 0.6285
22-23 1 1 0.5 0 0.0489 NA NA 0.0387 NA NA
The focus of the Nelson-Aalen estimator is on the cumulative hazard at time t \(\left( \hat{ \Lambda}_{NA}(t) \right)\).
\[\hat \Lambda_{\textrm{NA}}(t) = \sum_{j: \tau_j \le t} \frac{d_j}{r_j}\] Once we have \(\hat \Lambda(t)_{NA}\), we can derive the Fleming-Harrington estimator of \(S(t)\)
\[\hat S_{\textrm{FH}} = \exp(-\hat \Lambda(t)_{\textrm{NA}})\]
fit_fh <- survfit(Surv(time, all) ~ 1, data = orca, type = "fleming-harrington", conf.type = "log-log")
dat_fh <- fortify(fit_fh)
## for the Nelson-Aalen estimator of the cumulative hazard
#dat_fh <- fortify(fit_fh, fun = "cumhaz")
head(dat_fh)
time n.risk n.event n.censor surv std.err upper lower
1 0.085 338 2 0 0.9941003 0.004184064 0.9985212 0.9766183
2 0.162 336 2 0 0.9882006 0.005934796 0.9955551 0.9688694
3 0.167 334 4 0 0.9764365 0.008430791 0.9881458 0.9534367
4 0.170 330 2 0 0.9705366 0.009457469 0.9840379 0.9459334
5 0.246 328 1 0 0.9675821 0.009936739 0.9819155 0.9422275
6 0.249 327 1 0 0.9646277 0.010396671 0.9797562 0.9385535
It is possible to plot different estimates of the survival functions to evaluate potential differences.
ggplotly(
ggplot() +
geom_step(data = dat_km, aes(x = time, y = surv, colour = "K-M")) +
geom_step(data = dat_fh, aes(x = time, y = surv, colour = "N-A")) +
geom_step(data = dat_lt, aes(x = cuts[-length(cuts)], y = surv, colour = "LT")) +
labs(x = "Time (years)", y = "Survival", colour = "Estimator") +
theme_classic()
)
Measures of central tendency such as quantiles can be derived from the estimated survival curves.
(mc <- data.frame(q = c(.25, .5, .75),
km = quantile(fit_km),
fh = quantile(fit_fh)))
q km.quantile km.lower km.upper fh.quantile fh.lower fh.upper
25 0.25 1.333 1.084 1.834 1.333 1.084 1.747
50 0.50 5.418 4.331 6.916 5.418 4.244 6.913
75 0.75 13.673 11.748 16.580 13.673 11.748 15.833
Half of the individuals is estimated to live longer than 5.4
years.
The first one-fourth of the individuals died within 1.3 year while the
top three-fourths of the individuals lived longer than 1.3 years.
The first three-fourths of the individuals died within 13.7 year while
the top one-fourth of the individuals lived longer than 13.7 years.
A graphical presentation of the estimated quantities (based on the survival curve using K-M).
ggsurvplot(fit_km, xlab = "Time (years)", censor = F)$plot +
geom_segment(data = mc, aes(x = km.quantile, y = 1-q, xend = km.quantile, yend = 0), lty = 2) +
geom_segment(data = mc, aes(x = 0, y = 1-q, xend = km.quantile, yend = 1-q), lty = 2)
As opposed to a non-parametric approach, a parametric one assumes a
distribution for the survival distribution. The family of survival
distributions can be written by introducing location and scale changes
of the form
\[\log(T) = \mu + \sigma W\] where the
parametric assumption is made on \(W\).
The flexsurvreg() function in the flexsurv
package estimates parametric accelerated failure time (AFT) models. See
the help page ?flexsurvreg
for a description of different parametric distributions for \(W\).
We are going to consider three common choices: the exponential, the
Weibull, and the log-logistic models. In addition, the flexible
parametric modelling of time-to-event data using the spline model of
Royston and Parmar (2002) is also considered.
| Model | Hazard | Survival |
|---|---|---|
| Exponential | \(\lambda(t) = \lambda\) | \(S(t) = \exp(-\lambda t)\) |
| Weibull | \(\lambda(t) = \lambda^p p t^{p-1}\) | \(S(t) = \exp((-\lambda t)^p)\) |
| Log logistic | \(\lambda(t) = \frac{\lambda p (\lambda t)^{p-1}}{1 + (\lambda t)^p}\) | \(S(t) = \frac{1}{1 + (\lambda t)^p}\) |
fit_exp <- flexsurvreg(Surv(time, all) ~ 1, data = orca, dist = "exponential")
fit_exp
Call:
flexsurvreg(formula = Surv(time, all) ~ 1, data = orca, dist = "exponential")
Estimates:
est L95% U95% se
rate 0.11967 0.10513 0.13621 0.00791
N = 338, Events: 229, Censored: 109
Total time at risk: 1913.673
Log-likelihood = -715.1802, df = 1
AIC = 1432.36
fit_w <- flexsurvreg(Surv(time, all) ~ 1, data = orca, dist = "weibull")
fit_ll <- flexsurvreg(Surv(time, all) ~ 1, data = orca, dist = "llogis")
fit_sp <- flexsurvspline(Surv(time, all) ~ 1, data = orca, k = 1, scale = "odds")
Again, different approaches can be graphically compared (estimated survival curves or hazard functions) with a non-parametric estimator
ggflexsurvplot(fit_exp, xlab = "Time (years)", censor = F)
or with each other
grid.arrange(
ggplot(data.frame(summary(fit_exp)), aes(x = time)) +
geom_line(aes(y = est, col = "Exponential")) +
geom_line(data = data.frame(summary(fit_w)), aes(y = est, col = "Weibull")) +
geom_line(data = data.frame(summary(fit_ll)), aes(y = est, col = "Log-logistic")) +
geom_line(data = data.frame(summary(fit_sp)), aes(y = est, col = "Flex splines")) +
labs(x = "Time (years)", y = "Survival", col = "Distributions") + theme_classic(),
ggplot(data.frame(summary(fit_exp, type = "hazard")), aes(x = time)) +
geom_line(aes(y = est, col = "Exponential")) +
geom_line(data = data.frame(summary(fit_w, type = "hazard")), aes(y = est, col = "Weibull")) +
geom_line(data = data.frame(summary(fit_ll, type = "hazard")), aes(y = est, col = "Log-logistic")) +
geom_line(data = data.frame(summary(fit_sp, type = "hazard")), aes(y = est, col = "Flex splines")) +
labs(x = "Time (years)", y = "Hazard", col = "Distributions") + theme_classic(),
ncol = 2
)
A common research question is to compare the survival functions between 2 or more groups. Several alternatives (as well as packages) are available. Have a look at the testing section in the survival analysis task view.
Tumor stage, for example, is an important prognostic factor in cancer survival studies. We can estimate and plot separate survival curves for the different groups (stages) with different colors.
#ci.exp(glm(all ~ 0 + stage, data = orca, family = "poisson", offset = log(time)))
group_by(orca, stage) %>%
summarise(
D = sum(all),
Y = sum(time)
) %>%
cbind(
pois.approx(x = .$D, pt = .$Y)
)
stage D Y x pt rate lower upper conf.level
1 I 25 336.776 25 336.776 0.07423332 0.04513439 0.1033322 0.95
2 II 51 556.700 51 556.700 0.09161128 0.06646858 0.1167540 0.95
3 III 51 464.836 51 464.836 0.10971611 0.07960454 0.1398277 0.95
4 IV 57 262.552 57 262.552 0.21709985 0.16073995 0.2734597 0.95
5 unkn 45 292.809 45 292.809 0.15368380 0.10878136 0.1985862 0.95
In general, patients diagnostic with a lower stage tumor has a lower
(mortality) rate as compared to patients with high stage tumor. An
overall comparison of the survival functions can be performed using the
survfit() function.
su_stg <- survfit(Surv(time, all) ~ stage, data = orca)
su_stg
Call: survfit(formula = Surv(time, all) ~ stage, data = orca)
n events median 0.95LCL 0.95UCL
stage=I 50 25 10.56 6.17 NA
stage=II 77 51 7.92 4.92 13.34
stage=III 72 51 7.41 3.92 9.90
stage=IV 68 57 2.00 1.08 4.82
stage=unkn 71 45 3.67 2.83 8.17
As the incidence rates are lower for low tumoral stages, the median survival times also decrease for increasing levels of tumoral stage. The same behavior can be observed plotting the K-M survival curves separately for the different tumoral stages.
ggsurvplot(su_stg, fun = "event", censor = F, xlab = "Time (years)")
It is also possible to construct the whole survival table for each stage level. Here the first 3 lines of the survival table in each tumoral stage.
lifetab_stg <- fortify(su_stg)
lifetab_stg %>%
group_by(strata) %>%
do(head(., n = 3))
# A tibble: 15 × 9
# Groups: strata [5]
time n.risk n.event n.censor surv std.err upper lower strata
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 0.17 50 1 0 0.98 0.0202 1 0.942 I
2 0.498 49 1 0 0.96 0.0289 1 0.907 I
3 0.665 48 1 0 0.94 0.0357 1 0.876 I
4 0.419 77 1 0 0.987 0.0131 1 0.962 II
5 0.498 76 1 0 0.974 0.0186 1 0.939 II
6 0.665 75 1 0 0.961 0.0229 1 0.919 II
7 0.167 72 1 0 0.986 0.0140 1 0.959 III
8 0.249 71 1 0 0.972 0.0199 1 0.935 III
9 0.413 70 1 0 0.958 0.0246 1 0.913 III
10 0.085 68 2 0 0.971 0.0211 1 0.931 IV
11 0.162 66 1 0 0.956 0.0261 1 0.908 IV
12 0.167 65 1 0 0.941 0.0303 0.999 0.887 IV
13 0.162 71 1 0 0.986 0.0142 1 0.959 unkn
14 0.167 70 2 0 0.958 0.0249 1 0.912 unkn
15 0.17 68 1 0 0.944 0.0290 0.999 0.892 unkn
Alternatively, the cumulative hazards and the log-cumulative hazards for the different stages can be presented.
glist <- list(
ggsurvplot(su_stg, fun = "cumhaz"),
ggsurvplot(su_stg, fun = "cloglog")
)
# plot(su_stg, fun = "cloglog")
arrange_ggsurvplots(glist, print = TRUE, ncol = 2, nrow = 1)
Several methods have been developed for formally testing the overall
equivalence of survival curves. The survdiff() in the
survival package implements the G-rho family of
tests for evaluating differences in survival curves. See the help page
(?survdiff) for the different options .
The default argument rho = 0 implements the log-rank or
Mantel-Haenszel test.
survdiff(Surv(time, all) ~ stage, data = orca)
Call:
survdiff(formula = Surv(time, all) ~ stage, data = orca)
N Observed Expected (O-E)^2/E (O-E)^2/V
stage=I 50 25 39.9 5.573 6.813
stage=II 77 51 63.9 2.606 3.662
stage=III 72 51 54.1 0.174 0.231
stage=IV 68 57 33.2 16.966 20.103
stage=unkn 71 45 37.9 1.346 1.642
Chisq= 27.2 on 4 degrees of freedom, p= 2e-05
With rho = 1 it is equivalent to the Peto & Peto
modification of the Gehan-Wilcoxon test.
survdiff(Surv(time, all) ~ stage, data = orca, rho = 1)
Call:
survdiff(formula = Surv(time, all) ~ stage, data = orca, rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
stage=I 50 14.5 25.2 4.500 7.653
stage=II 77 29.3 39.3 2.549 4.954
stage=III 72 30.7 33.8 0.284 0.521
stage=IV 68 40.3 22.7 13.738 21.887
stage=unkn 71 32.0 25.9 1.438 2.359
Chisq= 30.9 on 4 degrees of freedom, p= 3e-06
Different tests use different weights in comparing the survival functions depending on the failure times. In the actual example, they give comparable results, suggesting that the survival functions for different tumoral stage are different.
Non-parametric tests are particularly feasible when comparing
survival functions across the levels of a factor. They are fairly
robust, efficient, and usually simple/intuitive.
As the number of factors of interest increases, however, non-parametric
tests become difficult to conduct and interpret. Regression models,
instead, are more flexible for exploring the relationship between
survival and predictors.
We will cover two different broad class of models: semi-parametric (i.e. proportional hazard) and parametric (accelerated failure time) models.
A cox proportional hazards model assumes a baseline hazard function \(\lambda_0(t)\), i.e. the hazard for the reference group (\(Z_1, \dots, Z_p = 0\)). Each predictor \(Z_i\) has a multiplicative effect on the hazard.
\[\lambda(t, Z) = \lambda_0(t)e^{Z\beta}\] The semi-parametric nature of the Cox model is that the baseline rate may vary over time and it is not required to be estimated. The major assumption of the Cox model is that the hazard ratio for a predictor \(Z_i\) is constant (\(e^{\beta_i}\)) and does not depend on the time, i.e. the hazards in the two groups are proportional over time.
In our example we will consider modeling time to death as a function
sex, age, and tumoral stage.
A cox proportional hazards model can be fitted using the
coxph() function in the survival package.
m1 <- coxph(Surv(time, all) ~ sex + I((age-65)/10) + stage, data = orca)
summary(m1)
Call:
coxph(formula = Surv(time, all) ~ sex + I((age - 65)/10) + stage,
data = orca)
n= 338, number of events= 229
coef exp(coef) se(coef) z Pr(>|z|)
sexMale 0.35139 1.42104 0.14139 2.485 0.012947
I((age - 65)/10) 0.41603 1.51593 0.05641 7.375 1.65e-13
stageII 0.03492 1.03554 0.24667 0.142 0.887421
stageIII 0.34545 1.41262 0.24568 1.406 0.159708
stageIV 0.88542 2.42399 0.24273 3.648 0.000265
stageunkn 0.58441 1.79393 0.25125 2.326 0.020016
exp(coef) exp(-coef) lower .95 upper .95
sexMale 1.421 0.7037 1.0771 1.875
I((age - 65)/10) 1.516 0.6597 1.3573 1.693
stageII 1.036 0.9657 0.6386 1.679
stageIII 1.413 0.7079 0.8728 2.286
stageIV 2.424 0.4125 1.5063 3.901
stageunkn 1.794 0.5574 1.0963 2.935
Concordance= 0.674 (se = 0.02 )
Likelihood ratio test= 86.76 on 6 df, p=<2e-16
Wald test = 80.5 on 6 df, p=3e-15
Score (logrank) test = 82.86 on 6 df, p=9e-16
There are 3 types of diagonostics for:
1) testing the proportional hazards assumption;
2) examining influential observations (or outliers);
3) detecting non-linearity in association between the log hazard and the
covariates.
To check the previous model assumptions, we are going to consider
following residuals:
1) Schoenfeld residuals to check the proportional hazards
assumption;
2) Deviance residual (symmetric transformation of the
Martinguale residuals), to examine influential observations;
3) Martingale residual to assess non-linearity.
Those residuals can be computed and graphically presented using
fucnctions in the survminer package (ggcoxzph
and ggcoxdiagnostics).
We can first check whether the data are sufficiently consistent with
the assumption of proportional hazards with respect to each of the
variables separately as well as globally, using the
cox.zph() function.
cox.zph.m1 <- cox.zph(m1)
cox.zph.m1
chisq df p
sex 0.275 1 0.60
I((age - 65)/10) 1.706 1 0.19
stage 3.648 4 0.46
GLOBAL 5.144 6 0.53
No evidence against proportionality assumption could apparently be found.
Theggcoxzph function can be used to plot, for each
covariate, the scaled Schoenfeld residuals against the transformed time.
ggcoxzph(cox.zph.m1)
ggcoxdiagnostics provides a convenient
solution for checkind influential observations
ggcoxdiagnostics(m1, type = "dfbeta", linear.predictions = FALSE)
ggcoxdiagnostics(m1, type = "deviance", linear.predictions = FALSE)
Martingale residuals can be useful to detect non-linearity for
continuous covariates. They are in the range (\(-\infty\), +1):
- a value of martinguale residuals near 1 represents individuals that
“died too soon”;
- large negative values correspond to individuals that “lived too
long”.
cox_martingale <- coxph(Surv(time, all) ~ log(age) + sqrt(age), data = orca)
ggcoxfunctional(cox_martingale, data = orca)
Results from the Cox model suggested a significant effect of sex, age, and stage. In particular, every 10 year increase in age the mortality rate increased by 50%. The HR for all-cause mortality comparing men to women was 1.42. Moreover, no differences could be observed between stages I and II in the estimates. On the other hand, the group with stage unknown is a complex mixture of patients from various true stages. Therefore, it may be prudent to exclude these subjects from the data and to pool the first two stage groups into one.
orca2 <- orca %>%
filter(stage != "unkn") %>%
mutate(st3 = Relevel(droplevels(stage), list(1:2, 3, 4)))
m2 <- coxph(Surv(time, all) ~ sex + I((age-65)/10) + st3, data = orca2, ties = "breslow")
round(ci.exp(m2), 4)
exp(Est.) 2.5% 97.5%
sexMale 1.3284 0.9763 1.8074
I((age - 65)/10) 1.4624 1.2947 1.6519
st3III 1.3620 0.9521 1.9482
st3IV 2.3828 1.6789 3.3818
A convenient way to display and graphically compare the results from multivariate Cox models is through forest plots.
grid.arrange(
ggforest(m1, data = orca),
ggforest(m2, data = orca2),
ncol = 2
)
Let’s plot the predicted survival curves by stage, fixing the values
for sex and age (focusing only on 40 and 80 year old patients), based on
the fitted model m2. In order to do that, we first create a
new artificial data frame containing the desired values for the
covariates.
newd <- expand.grid(sex = c("Male", "Female"), age = c(40, 80), st3 = levels(orca2$st3))
newd$id <- 1:12
newd
sex age st3 id
1 Male 40 I+II 1
2 Female 40 I+II 2
3 Male 80 I+II 3
4 Female 80 I+II 4
5 Male 40 III 5
6 Female 40 III 6
7 Male 80 III 7
8 Female 80 III 8
9 Male 40 IV 9
10 Female 40 IV 10
11 Male 80 IV 11
12 Female 80 IV 12
surv_summary(survfit(m2, newdata = newd)) %>%
merge(newd, by.x = "strata", by.y = "id") %>%
ggplot(aes(x = time, y = surv, col = sex, linetype = factor(age))) +
geom_step() + facet_grid(. ~ st3) +
labs(x = "Time (years)", y = "Survival probability") + theme_classic()
A parametric model assumes a distribution for the survival time. The model can be written as
\[\log(T) = \mu + Z\beta + \sigma W\] We will consider a popular strategy, i.e. \(W \sim \textrm{Weibull}(\lambda, \gamma)\)
m2w <- flexsurvreg(Surv(time, all) ~ sex + I((age-65)/10) + st3, data = orca2, dist = "weibull")
m2w
Call:
flexsurvreg(formula = Surv(time, all) ~ sex + I((age - 65)/10) +
st3, data = orca2, dist = "weibull")
Estimates:
data mean est L95% U95% se exp(est) L95%
shape NA 0.93268 0.82957 1.04861 0.05575 NA NA
scale NA 13.53151 9.97582 18.35456 2.10472 NA NA
sexMale 0.53184 -0.33905 -0.66858 -0.00951 0.16813 0.71245 0.51243
I((age - 65)/10) -0.15979 -0.41836 -0.54898 -0.28773 0.06665 0.65813 0.57754
st3III 0.26966 -0.32567 -0.70973 0.05839 0.19595 0.72204 0.49178
st3IV 0.25468 -0.95656 -1.33281 -0.58030 0.19197 0.38421 0.26374
U95%
shape NA
scale NA
sexMale 0.99053
I((age - 65)/10) 0.74996
st3III 1.06012
st3IV 0.55973
N = 267, Events: 184, Censored: 83
Total time at risk: 1620.864
Log-likelihood = -545.858, df = 6
AIC = 1103.716
The interpretation of the coefficients is in terms of \(t\). For example, men have a 30% (\(100\cdot(1-0.71)\)) reduction in the survival time compared to women, adjusting for age and tumoral stage. Every ten years increase in age are associated with a 35% reduction in survival times, adjusting for sex and tumoral stage.
It can be shown that an AFT models assuming an exponential or a
Weibull distribution can be reparametrized as proportional hazards
models (with baseline hazard from the exponential/Weibull family of
distributions).
This can be shown using the weibreg() function in the
eha package.
m2wph <- weibreg(Surv(time, all) ~ sex + I((age-65)/10) + st3, data = orca2)
summary(m2wph)
Call:
weibreg(formula = Surv(time, all) ~ sex + I((age - 65)/10) +
st3, data = orca2)
Covariate Mean Coef Exp(Coef) se(Coef) Wald p
sex
Female 0.490 0 1 (reference)
Male 0.510 0.316 1.372 0.156 0.043
I((age - 65)/10) -0.522 0.390 1.477 0.062 0.000
st3
I+II 0.551 0 1 (reference)
III 0.287 0.304 1.355 0.182 0.095
IV 0.162 0.892 2.440 0.178 0.000
log(scale) 2.605 13.532 0.156 0.000
log(shape) -0.070 0.933 0.060 0.244
Events
Total time at risk 1620.9
Max. log. likelihood -545.86
LR test statistic 68.7
Degrees of freedom 4
Overall p-value 4.30767e-14
The (exponential of the) coefficients have an equivalent interpretation to the coefficients of a Cox proportional model (the estimates are similar as well).
Any function of the parameters of a fitted model can be summarized or
plotted by supplying the argument fn to the
summary or plot methods. For example, median
survival under the Weibull model can be summarized as
median.weibull <- function(shape, scale) qweibull(0.5, shape = shape, scale = scale)
set.seed(2153)
newd <- data.frame(sex = c("Male", "Female"), age = 65, st3 = "I+II")
summary(m2w, newdata = newd, fn = median.weibull, t = 1, B = 10000)
sex=Male,I((age - 65)/10)=0,st3=I+II
time est lcl ucl
1 1 6.507834 4.898889 8.631952
sex=Female,I((age - 65)/10)=0,st3=I+II
time est lcl ucl
1 1 9.134466 6.724806 12.28625
Compare the results with those from the Cox model.
survfit(m2, newdata = newd)
Call: survfit(formula = m2, newdata = newd)
n events median 0.95LCL 0.95UCL
1 267 184 7.00 5.25 10.6
2 267 184 9.92 7.33 13.8
It can be shown that the Cox model is mathematically equivalent to a
Poisson regression model on a particular transformation of the
data.
The idea is to split the follow-up time every time an event is observed
in such a way every time interval contains only one event. In this
augmented dataset subjects may be represented several times (multiple
rows).
We first define the unique time where we observed an event
(all == 1) and use the survSplit() function in
the survival package to create the augmented or splitted
data.
cuts <- sort(unique(orca2$time[orca2$all == 1]))
orca_splitted <- survSplit(Surv(time, all) ~ ., data = orca2, cut = cuts, episode = "tgroup")
head(orca_splitted, 15)
id sex age stage event st3 tstart time all tgroup
1 2 Female 83.08783 III Oral ca. death III 0.000 0.085 0 1
2 2 Female 83.08783 III Oral ca. death III 0.085 0.162 0 2
3 2 Female 83.08783 III Oral ca. death III 0.162 0.167 0 3
4 2 Female 83.08783 III Oral ca. death III 0.167 0.170 0 4
5 2 Female 83.08783 III Oral ca. death III 0.170 0.249 0 5
6 2 Female 83.08783 III Oral ca. death III 0.249 0.252 0 6
7 2 Female 83.08783 III Oral ca. death III 0.252 0.329 0 7
8 2 Female 83.08783 III Oral ca. death III 0.329 0.413 0 8
9 2 Female 83.08783 III Oral ca. death III 0.413 0.419 1 9
10 3 Male 52.59008 II Other death I+II 0.000 0.085 0 1
11 3 Male 52.59008 II Other death I+II 0.085 0.162 0 2
12 3 Male 52.59008 II Other death I+II 0.162 0.167 0 3
13 3 Male 52.59008 II Other death I+II 0.167 0.170 0 4
14 3 Male 52.59008 II Other death I+II 0.170 0.249 0 5
15 3 Male 52.59008 II Other death I+II 0.249 0.252 0 6
The gnm() function in the gnm package fits
a conditional Poisson on splitted data, where the effects of the time
(as a factor variable) can be marginalized (not estimated to improve
computational efficiency).
mod_poi <- gnm(all ~ sex + I((age-65)/10) + st3, data = orca_splitted,
family = poisson, eliminate = factor(time))
summary(mod_poi)
Call:
gnm(formula = all ~ sex + I((age - 65)/10) + st3, eliminate = factor(time),
family = poisson, data = orca_splitted)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.00325 -0.14706 -0.11186 -0.08518 3.45881
Coefficients of interest:
Estimate Std. Error z value Pr(>|z|)
sexMale 0.28394 0.15711 1.807 0.0707
I((age - 65)/10) 0.38009 0.06215 6.116 9.58e-10
st3III 0.30892 0.18265 1.691 0.0908
st3IV 0.86827 0.17864 4.860 1.17e-06
(Dispersion parameter for poisson family taken to be 1)
Residual deviance: 1566.6 on 19426 degrees of freedom
AIC: 2370.6
Number of iterations: 3
Compare the estimates obtained from the conditional Poisson with the cox proportional hazard model.
round(data.frame(cox = ci.exp(m2), poisson = ci.exp(mod_poi)), 4)
cox.exp.Est.. cox.2.5. cox.97.5. poisson.exp.Est.. poisson.2.5. poisson.97.5.
sexMale 1.3284 0.9763 1.8074 1.3284 0.9763 1.8074
I((age - 65)/10) 1.4624 1.2947 1.6519 1.4624 1.2947 1.6519
st3III 1.3620 0.9521 1.9482 1.3620 0.9521 1.9482
st3IV 2.3828 1.6789 3.3818 2.3828 1.6789 3.3818
If we want to estimate the baseline hazard, we need also to estimate the effects of time in the Poisson model (OBS we also need to include the (log) length of the time intervals as offset).
orca_splitted$dur <- with(orca_splitted, time - tstart)
mod_poi2 <- glm(all ~ -1 + factor(time) + sex + I((age-65)/10) + st3,
data = orca_splitted, family = poisson, offset = log(dur))
The baseline hazard consists of a step function, where the rate is constant in each time interval.
newd <- data.frame(time = cuts, dur = 1,
sex = "Female", age = 65, st3 = "I+II")
blhaz <- data.frame(ci.pred(mod_poi2, newdata = newd))
ggplot(blhaz, aes(x = c(0, cuts[-138]), y = Estimate, xend = cuts, yend = Estimate)) + geom_segment() +
scale_y_continuous(trans = "log", limits = c(.05, 5), breaks = pretty_breaks()) +
theme_classic() + labs(x = "Time (years)", y = "Baseline hazard")
A better approach would be to flexibly model the baseline hazard by using, for instance, splines with knots \(k\).
k <- quantile(orca2$time, 1:5/6)
mod_poi2s <- glm(all ~ ns(time, knots = k) + sex + I((age-65)/10) + st3,
data = orca_splitted, family = poisson, offset = log(dur))
round(ci.exp(mod_poi2s), 3)
exp(Est.) 2.5% 97.5%
(Intercept) 0.074 0.040 0.135
ns(time, knots = k)1 0.402 0.177 0.912
ns(time, knots = k)2 1.280 0.477 3.432
ns(time, knots = k)3 0.576 0.220 1.509
ns(time, knots = k)4 1.038 0.321 3.358
ns(time, knots = k)5 4.076 0.854 19.452
ns(time, knots = k)6 1.040 0.171 6.314
sexMale 1.325 0.975 1.801
I((age - 65)/10) 1.469 1.300 1.659
st3III 1.360 0.952 1.942
st3IV 2.361 1.665 3.347
blhazs <- data.frame(ci.pred(mod_poi2s, newdata = newd))
ggplot(blhazs, aes(x = newd$time, y = Estimate)) + geom_line() +
geom_ribbon(aes(ymin = X2.5., ymax = X97.5.), alpha = .2) +
scale_y_continuous(trans = "log", breaks = pretty_breaks()) +
theme_classic() + labs(x = "Time (years)", y = "Baseline hazard")
We can compare the previous strategies based on the predicted survival curves for a specific covariate patters, saying 65 years-old women with tumoral stage I or II.
newd <- data.frame(sex = "Female", age = 65, st3 = "I+II")
surv_cox <- fortify(survfit(m2, newdata = newd))
surv_weibull <- summary(m2w, newdata = newd, tidy = TRUE)
## For the poisson model we need some extra steps
tbmid <- sort(unique(.5*(orca_splitted$tstart + orca_splitted$time)))
mat <- cbind(1, ns(tbmid, knots = k), 0, 0, 0, 0)
Lambda <- ci.cum(mod_poi2s, ctr.mat = mat, intl = diff(c(0, tbmid)))
surv_poisson <- data.frame(exp(-Lambda))
A graphical representation of the survival function facilitates the comparison.
ggplot(surv_cox, aes(time, surv)) + geom_step(aes(col = "Cox")) +
geom_line(data = surv_weibull, aes(y = est, col = "Weibull")) +
geom_line(data = surv_poisson, aes(x = c(0, tbmid[-1]), y = Estimate, col = "Poisson")) +
labs(x = "Time (years)", y = "Survival", col = "Models") + theme_classic()
Get interactive! Shiny tutorial.
library(shiny)
library(tidyverse)
library(splines)
library(Epi)
library(survival)
library(flexsurv)
library(ggfortify)
shinyApp(
ui = fluidPage(
h2("Choose covariate pattern:"),
selectInput("sex", label = h3("Sex"),
choices = list("Female" = "Female" , "Male" = "Male")),
sliderInput("age", label = h3("Age"),
min = 20, max = 80, value = 65),
selectInput("st3", label = h3("Stage (3 levels)"),
choices = list("Stage I and II" = "I+II", "Stage III" = "III",
"Stage IV" = "IV")),
plotOutput("survPlot")
),
server = function(input, output){
orca <- read.table("http://www.stats4life.se/data/oralca.txt", header = T,
stringsAsFactors = TRUE)
orca2 <- orca %>%
filter(stage != "unkn") %>%
mutate(st3 = Relevel(droplevels(stage), list(1:2, 3, 4)),
all = 1*(event != "Alive"))
m2 <- coxph(Surv(time, all) ~ sex + I((age-65)/10) + st3, data = orca2, ties = "breslow")
m2w <- flexsurvreg(Surv(time, all) ~ sex + I((age-65)/10) + st3, data = orca2, dist = "weibull")
cuts <- sort(unique(orca2$time[orca2$all == 1]))
orca_splitted <- survSplit(Surv(time, all) ~ ., data = orca2, cut = cuts, episode = "tgroup")
orca_splitted$dur <- with(orca_splitted, time - tstart)
k <- quantile(orca2$time, 1:5/6)
mod_poi2s <- glm(all ~ ns(time, knots = k) + sex + I((age-65)/10) + st3,
data = orca_splitted, family = poisson, offset = log(dur))
newd <- reactive({
data.frame(sex = input$sex, age = input$age, st3 = input$st3)
})
output$survPlot <- renderPlot({
newd <- newd()
surv_cox <- fortify(survfit(m2, newdata = newd))
surv_weibull <- summary(m2w, newdata = newd, tidy = TRUE)
tbmid <- sort(unique(.5*(orca_splitted$tstart + orca_splitted$time)))
mat <- cbind(1, ns(tbmid, knots = k), 1*(input$sex == "Male"), (input$age - 65)/10,
1*(input$st3 == "III"), 1*(input$st3 == "IV"))
Lambda <- ci.cum(mod_poi2s, ctr.mat = mat, intl = diff(c(0, tbmid)))
surv_poisson <- data.frame(exp(-Lambda))
ggplot(surv_cox, aes(time, surv)) + geom_step(aes(col = "Cox")) +
geom_line(data = surv_weibull, aes(y = est, col = "Weibull")) +
geom_line(data = surv_poisson, aes(x = c(0, tbmid[-1]), y = Estimate, col = "Poisson")) +
labs(x = "Time (years)", y = "Survival", col = "Models") + theme_classic()
})
}
)
We have assumed that the effect of age on the (log) mortality rate is linear. A possible strategy to relax this assumption is fit a Cox model where age is modeled with a quadratic effect.
m3 <- coxph(Surv(time, all) ~ sex + I(age-65) + I((age-65)^2) + st3, data = orca2)
summary(m3)
Call:
coxph(formula = Surv(time, all) ~ sex + I(age - 65) + I((age -
65)^2) + st3, data = orca2)
n= 267, number of events= 184
coef exp(coef) se(coef) z Pr(>|z|)
sexMale 2.903e-01 1.337e+00 1.591e-01 1.825 0.0681
I(age - 65) 3.868e-02 1.039e+00 6.554e-03 5.902 3.59e-09
I((age - 65)^2) 9.443e-05 1.000e+00 3.576e-04 0.264 0.7917
st3III 3.168e-01 1.373e+00 1.838e-01 1.724 0.0847
st3IV 8.691e-01 2.385e+00 1.787e-01 4.863 1.16e-06
exp(coef) exp(-coef) lower .95 upper .95
sexMale 1.337 0.7481 0.9787 1.826
I(age - 65) 1.039 0.9621 1.0262 1.053
I((age - 65)^2) 1.000 0.9999 0.9994 1.001
st3III 1.373 0.7284 0.9576 1.968
st3IV 2.385 0.4193 1.6801 3.385
Concordance= 0.674 (se = 0.022 )
Likelihood ratio test= 64.89 on 5 df, p=1e-12
Wald test = 63.11 on 5 df, p=3e-12
Score (logrank) test = 67.64 on 5 df, p=3e-13
The \(p\)-value for non-linearity (i.e. quadratic term) is high and thus there is no evidence to reject the null hypothesis (i.e. the linearity assumption is appropriate).
If the relation would be non-linear, the coefficients for age are no longer directly interpretable. We can instead present the HR graphically as a function of age. We need to specify a referent value; we chose the median age value of 65 years old.
age <- seq(20, 80, 1) - 65
hrtab <- ci.exp(m3, ctr.mat = cbind(0, age, age^2, 0, 0))
ggplot(data.frame(hrtab), aes(x = age+65, y = exp.Est.., ymin = X2.5., ymax = X97.5.)) +
geom_line() + geom_ribbon(alpha = .1) +
scale_y_continuous(trans = "log", breaks = pretty_breaks()) +
labs(x = "Age (years)", y = "Hazard ratio") + theme_classic() +
geom_vline(xintercept = 65, lty = 2) + geom_hline(yintercept = 1, lty = 2)
Let’s consider the Cox proportional hazard model fitted and saved in
the object m1. As already mentioned, the main assumption is
that the effect of a predictor, e.g. sex, is constant over time.
\[\lambda(t, Z) = \lambda_0(t)e^{Z\beta}\]
The cox.zph() function is useful to plot the effect of
individual predictors over time, and is thus used to diagnose and
understand non-proportional hazards.
plot(cox.zph.m1[1])
abline(h= m1$coef[1], col = 2, lty = 2, lwd = 2)
We can relax the proportional hazard assumption by fitting a step function for \(\beta(t)\), which implies different \(\beta\)s over different time intervals.
\[\lambda(t, Z) = \lambda_0(t)e^{Z\beta(t)}\]
The survSplit() function in the survival
package breaks the data set into time dependent parts. Let’s consider as
time breaks 5 and 15.
orca3 <- survSplit(Surv(time, all) ~ ., data = orca2, cut = c(5, 15), episode = "tgroup")
head(orca3)
id sex age stage event st3 tstart time all tgroup
1 2 Female 83.08783 III Oral ca. death III 0 0.419 1 1
2 3 Male 52.59008 II Other death I+II 0 5.000 0 1
3 3 Male 52.59008 II Other death I+II 5 7.915 1 2
4 4 Male 77.08630 I Other death I+II 0 2.480 1 1
5 5 Male 80.33622 IV Oral ca. death IV 0 2.500 1 1
6 6 Female 82.58132 IV Other death IV 0 0.167 1 1
m3 <- coxph(Surv(tstart, time, all) ~ relevel(sex, 2):strata(tgroup) + I((age-65)/10) + st3, data = orca3)
m3
Call:
coxph(formula = Surv(tstart, time, all) ~ relevel(sex, 2):strata(tgroup) +
I((age - 65)/10) + st3, data = orca3)
coef exp(coef) se(coef) z p
I((age - 65)/10) 0.38184 1.46498 0.06255 6.104 1.03e-09
st3III 0.28857 1.33451 0.18393 1.569 0.1167
st3IV 0.87579 2.40076 0.17963 4.876 1.09e-06
relevel(sex, 2)Male:strata(tgroup)tgroup=1 0.42076 1.52312 0.19052 2.209 0.0272
relevel(sex, 2)Female:strata(tgroup)tgroup=1 NA NA 0.00000 NA NA
relevel(sex, 2)Male:strata(tgroup)tgroup=2 -0.10270 0.90240 0.28120 -0.365 0.7149
relevel(sex, 2)Female:strata(tgroup)tgroup=2 NA NA 0.00000 NA NA
relevel(sex, 2)Male:strata(tgroup)tgroup=3 1.13186 3.10142 1.09435 1.034 0.3010
relevel(sex, 2)Female:strata(tgroup)tgroup=3 NA NA 0.00000 NA NA
Likelihood ratio test=68.06 on 6 df, p=1.023e-12
n= 416, number of events= 184
Although not significant, the hazard ratio comparing male to female
is lower than 1 for the second time period (between 5 and 15 years)
while it’s higher than one for the other two time periods. The
cox.zph() function can be used to check if there are still
departure from the proportionality assumption on the splitted
analysis.
A different yet interesting perspective consists of modelling the percentiles of survival times. A comprehensive description of the main aspect of the methodology and possible advantages can be found here.
For example, we may be interested in comparing the time by which the probability of surviving is 75%.
p <- .25
fit_qs <- ctqr(Surv(time, all) ~ st3, p = p, data = orca2)
fit_qs
Call:
ctqr(formula = Surv(time, all) ~ st3, data = orca2, p = p)
Coefficients:
p = 0.25
(Intercept) 2.678
st3III -1.276
st3IV -1.878
Number of observations: 267
N. of events: 184
Number of free parameters: 30 (CDF) + 3 (ctqr)
\(\beta_0 = 2.665\) is the time by which the probability of dying is equal to 0.25 in the reference group. The other \(\beta\) are interpreted as relative measures. For example, for \(\beta_1\), we can say that the time by which the probability of dying is equal to 0.25 in people diagnosed with tumoral stage III is 1.4 years shorter as compared with people diagnosed with tumoral stage I or II.
This information can be visualize comparing the survival curves estimated separately across the levels of tumoral stage.
fit_kms <- survfit(Surv(time, all) ~ st3, data = orca2)
pc_qs <- tibble(
st3 = levels(orca2$st3),
coef = coef(fit_qs),
time_p = c(NA, coef[-1] + coef[1]),
time_ref = coef[1],
p = c(p, p - .005, p + .005)
)[-1, ]
surv_summary(fit_kms, data = orca2) %>%
ggplot(aes(time, surv, col = st3)) +
geom_step() +
geom_segment(data = pc_qs, aes(x = time_p, y = 1 - p, xend = time_ref,
yend = 1 - p))
A complementary analysis to that one represented in the Cox model m2 evaluates possible differences in the percentiles of the survival time as a function of age at diagnosis, sex, and tumoral stage
fit_q <- ctqr(Surv(time, all) ~ sex + I((age-65)/10) + st3, p = seq(.1, .7, .1),
data = orca2)
fit_q
Call:
ctqr(formula = Surv(time, all) ~ sex + I((age - 65)/10) + st3,
data = orca2, p = seq(0.1, 0.7, 0.1))
Coefficients:
p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.5 p = 0.6 p = 0.7
(Intercept) 1.28574 2.62470 4.99833 7.94709 11.20292 12.19435 15.21107
sexMale 0.02074 -0.39417 -1.06789 -2.76112 -3.65836 -2.79597 -4.01113
I((age - 65)/10) -0.12818 -0.49331 -1.38838 -2.11039 -2.38810 -2.97978 -3.49147
st3III -0.41452 -1.07507 -1.97123 -2.40554 -2.87689 -1.80784 -2.05138
st3IV -1.00645 -1.72545 -3.02944 -3.16852 -4.99132 -4.80402 -5.42381
Number of observations: 267
N. of events: 184
Number of free parameters: 35 (CDF) + 5 (ctqr)
The results consist of differences in the survival time for each covariate at different percentiles. A graphical presentation can facilitate interpretation of the results.
coef_q <- data.frame(coef(fit_q)) %>%
rownames_to_column(var = "coef") %>%
gather(p, beta, -coef) %>%
mutate(
p = as.double(gsub("p...", "", p)),
se = c(sapply(fit_q$covar, function(x) sqrt(diag(x)))),
low = beta - 1.96 * se,
up = beta + 1.96 * se
)
ggplot(coef_q, aes(p, beta)) +
geom_line() +
geom_ribbon(aes(ymin = low, ymax = up), alpha = .1) +
facet_wrap(~ coef, nrow = 2, scales = "free") +
theme_bw()
Alternatively, the percentiles of survival times can be predicted for a set of specific covariance patterns.
expand.grid(sex = c("Male", "Female"), age = c(40, 80), st3 = levels(orca2$st3)) %>%
cbind(round(predict(fit_q, newdata = .), 3))
sex age st3 p0.1 p0.2 p0.3 p0.4 p0.5 p0.6 p0.7
1 Male 40 I+II 1.627 3.464 7.401 10.462 13.515 16.848 19.929
2 Female 40 I+II 1.606 3.858 8.469 13.223 17.173 19.644 23.940
3 Male 80 I+II 1.114 1.491 1.848 2.020 3.962 4.929 5.963
4 Female 80 I+II 1.093 1.885 2.916 4.781 7.621 7.725 9.974
5 Male 40 III 1.212 2.389 5.430 8.056 10.638 15.040 17.877
6 Female 40 III 1.192 2.783 6.498 10.818 14.296 17.836 21.888
7 Male 80 III 0.700 0.415 -0.123 -0.385 1.086 3.121 3.911
8 Female 80 III 0.679 0.810 0.945 2.376 4.744 5.917 7.922
9 Male 40 IV 0.620 1.738 4.372 7.293 8.523 12.044 14.505
10 Female 40 IV 0.600 2.133 5.440 10.055 12.182 14.840 18.516
11 Male 80 IV 0.108 -0.235 -1.182 -1.148 -1.029 0.125 0.539
12 Female 80 IV 0.087 0.159 -0.114 1.613 2.629 2.921 4.550
Oftentimes the main interest if on the risk or hazard of dying from
one specific cause. The cause-specific event may not be observed because
of a competing cause which prevents the subject to develop the event.
Competing events occur not only for cause-specific mortality but more in
general every time an event prevents a concurrent event to happen.
In our example we are interested in modeling the risk of mortality from
oral cancer, and dying from other causes will be considered as competing
event.
In a competing risk scenario, event-specific survival obtained
censoring the other event (aka naive Kaplan–Meier estimates of
cause-specific survival) are generally not appropriate.
We will consider instead the cumulative incidence function (CIF) for the
event \(c\)
\[F_c(t) = P(T \le t \textrm{ and } C =
c)\] From these, it is possible to recover the CDF of event-free
survival time \(T\), i.e. cumulative
risk of any event by t: \(F(t) = \sum_c
F_c(t)\)
and the event-free survival function, i.e. probability of avoiding all
events by \(t\): \(S(t) = 1 - F(t)\)
The CIF (risk of event \(c\) over
risk period \([0, t]\) in the presence
of competing risks) can also be obtained as
\[F_c(t) = \int_{0}^{t} \lambda_c(v)S(v)
dv\] Depends on the hazard of the competing event as well
\[S(t) = \exp\left( - \int_{0}^{t}
[\lambda_1(v) + \lambda_2(v)]dv \right)\]
The hazard of the subdistribution is defined as
\[\gamma_c(t)=f_c(t)/[1-F_c(t)]\] and
is not the same as \(\lambda_c(t) =
f_c(t)/[1-F(t)]\)
The Cuminc() function in the mstate package
calculates non-parametric CIF (aka Aalen–Johansen estimates) and
associated standard errors for the competing events.
cif <- Cuminc(time = "time", status = "event", data = orca2)
head(cif)
time Surv CI.Oral ca. death CI.Other death seSurv seCI.Oral ca. death
1 0.085 0.9925094 0.007490637 0.000000000 0.005276805 0.005276805
2 0.162 0.9887640 0.011235955 0.000000000 0.006450534 0.006450534
3 0.167 0.9812734 0.011235955 0.007490637 0.008296000 0.006450534
4 0.170 0.9775281 0.011235955 0.011235955 0.009070453 0.006450534
5 0.249 0.9737828 0.011235955 0.014981273 0.009778423 0.006450534
6 0.252 0.9662921 0.014981273 0.018726592 0.011044962 0.007434315
seCI.Other death
1 0.000000000
2 0.000000000
3 0.005276805
4 0.006450534
5 0.007434315
6 0.008296000
We can plot the CIF (one stacked of the other) together with the derived event-free survival function.
ggplot(cif, aes(time)) +
geom_step(aes(y = `CI.Oral ca. death`, colour = "Cancer death")) +
geom_step(aes(y = `CI.Oral ca. death`+ `CI.Other death`, colour = "Total")) +
geom_step(aes(y = Surv, colour = "Overall survival")) +
labs(x = "Time (years)", y = "Proportion", colour = "") +
theme_classic()
Extensions have been implemented to estimate the cumulative
incidences functions by the levels of a factor variable, e.g by stage in
3 levels (st3) for both causes of death.
cif_stage <- Cuminc(time = "time", status = "event", group = "st3", data = orca2)
cif_stage %>%
group_by(group) %>%
do(head(., n = 3))
# A tibble: 9 × 8
# Groups: group [3]
group time Surv `CI.Oral ca. death` `CI.Other death` seSurv `seCI.Oral ca. death` seCI.O…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 I+II 0.17 0.992 0 0.00787 0.00784 0 0.00784
2 I+II 0.419 0.984 0 0.0157 0.0110 0 0.0110
3 I+II 0.498 0.969 0.0157 0.0157 0.0155 0.0110 0.0110
4 III 0.167 0.986 0 0.0139 0.0138 0 0.0138
5 III 0.249 0.972 0 0.0278 0.0194 0 0.0194
6 III 0.413 0.958 0.0139 0.0278 0.0235 0.0138 0.0194
7 IV 0.085 0.971 0.0294 0 0.0205 0.0205 0
8 IV 0.162 0.956 0.0441 0 0.0249 0.0249 0
9 IV 0.167 0.941 0.0441 0.0147 0.0285 0.0249 0.0146
# … with abbreviated variable name ¹`seCI.Other death`
grid.arrange(
ggplot(cif_stage, aes(time)) +
geom_step(aes(y = `CI.Oral ca. death`, colour = group)) +
labs(x = "Time (years)", y = "Proportion", title = "Cancer death by stage") +
theme_classic(),
ggplot(cif_stage, aes(time)) +
geom_step(aes(y = `CI.Other death`, colour = group)) +
labs(x = "Time (years)", y = "Proportion", title = "Other deaths by stage") +
theme_classic(),
ncol = 2
)
We can see that the CIF for oral cancer death for stage IV are higher as compared to III, and even more to I+II. For other cause mortality, instead, the curves do not seem to vary according to tumoral stage.
When we want to model survival data in a competing risk setting,
there are two common strategies that address different questions:
- Cox model for event-specific hazards, when e.g. the interest is in the
biological effect of the prognostic factors on the fatality of the very
disease that often leads to the relevant outcome.
- Fine–Gray model for the hazard of the subdistribution when we want to
assess the impact of the factors on the overall cumulative incidence of
event \(c\).
m2haz1 <- coxph(Surv(time, event == "Oral ca. death") ~ sex + I((age-65)/10) + st3, data = orca2)
round(ci.exp(m2haz1), 4)
exp(Est.) 2.5% 97.5%
sexMale 1.0171 0.6644 1.5569
I((age - 65)/10) 1.4261 1.2038 1.6893
st3III 1.5140 0.9012 2.5434
st3IV 3.1813 1.9853 5.0978
m2haz2 <- coxph(Surv(time, event == "Other death") ~ sex + I((age-65)/10) + st3, data = orca2)
round(ci.exp(m2haz2), 4)
exp(Est.) 2.5% 97.5%
sexMale 1.8103 1.1528 2.8431
I((age - 65)/10) 1.4876 1.2491 1.7715
st3III 1.2300 0.7488 2.0206
st3IV 1.6407 0.9522 2.8270
The results of the cause-specific Cox models agree with the graphical presentation of the cause-specific CIFs, i.e. tumoral stage IV to be a significant risk factor only for oral cancer mortality. Increasing levels of age are associated with higher mortality rates for both causes (HR = 1.42 for oral cancer mortality, HR = 1.48 for mortality from other causes). Differences according to gender are observed only for other cause mortality (HR = 1.8).
The crr() function in the cmprsk package
can be used for regression modeling of subdistribution functions in case
of competing risks. We present the results for the Fine–Gray model for
the hazard of the subdistribution for both oral cancer deaths and other
cause deaths with the same covariates as above.
m2fg1 <- with(orca2, crr(time, event, cov1 = model.matrix(m2), failcode = "Oral ca. death"))
summary(m2fg1, Exp = T)
Competing Risks Regression
Call:
crr(ftime = time, fstatus = event, cov1 = model.matrix(m2), failcode = "Oral ca. death")
coef exp(coef) se(coef) z p-value
sexMale -0.0953 0.909 0.213 -0.447 6.5e-01
I((age - 65)/10) 0.2814 1.325 0.093 3.024 2.5e-03
st3III 0.3924 1.481 0.258 1.519 1.3e-01
st3IV 1.0208 2.775 0.233 4.374 1.2e-05
exp(coef) exp(-coef) 2.5% 97.5%
sexMale 0.909 1.100 0.599 1.38
I((age - 65)/10) 1.325 0.755 1.104 1.59
st3III 1.481 0.675 0.892 2.46
st3IV 2.775 0.360 1.757 4.39
Num. cases = 267
Pseudo Log-likelihood = -501
Pseudo likelihood ratio test = 31.4 on 4 df,
m2fg2 <- with(orca2, crr(time, event, cov1 = model.matrix(m2), failcode = "Other death"))
summary(m2fg2, Exp = T)
Competing Risks Regression
Call:
crr(ftime = time, fstatus = event, cov1 = model.matrix(m2), failcode = "Other death")
coef exp(coef) se(coef) z p-value
sexMale 0.544 1.723 0.2342 2.324 0.020
I((age - 65)/10) 0.197 1.218 0.0807 2.444 0.015
st3III 0.130 1.139 0.2502 0.521 0.600
st3IV -0.212 0.809 0.2839 -0.748 0.450
exp(coef) exp(-coef) 2.5% 97.5%
sexMale 1.723 0.580 1.089 2.73
I((age - 65)/10) 1.218 0.821 1.040 1.43
st3III 1.139 0.878 0.698 1.86
st3IV 0.809 1.237 0.464 1.41
Num. cases = 267
Pseudo Log-likelihood = -471
Pseudo likelihood ratio test = 9.43 on 4 df,
Läärä E, Korpi JT, Pitkänen H, Alho OP, Kantola S. Competing risks analysis of cause‐specific mortality in patients with oral squamous cell carcinoma. Head & neck. 2017 Jan 1;39(1):56-62.
Survival data analysis course by Michal Pešta http://www.karlin.mff.cuni.cz/~pesta/NMFM404/survival.html
Kleinbaum DG, Klein M. Survival analysis: a self-learning text. Springer Science & Business Media; 2006 Jan 2.
Jackson CH. flexsurv: a platform for parametric survival modelling in R. Journal of Statistical Software. 2016 May 1;70(8):1-33.
Therneau, T.M. and Grambsch, P.M., 2013. Modeling survival data: extending the Cox model. Springer Science & Business Media.
Carstensen B. Who needs the Cox model anyway. Life. 2004;3:46.
Xu S, Gargiullo P, Mullooly J, McClure D, Hambidge SJ, Glanz J. Fitting parametric and semi-parametric conditional Poisson regression models with Cox’s partial likelihood in self-controlled case series and matched cohort studies. J Data Sci. 2010;8:349-60.
Therneau T, Crowson C, Atkinson E. Using time dependent covariates and time dependent coefficients in the cox model. Survival Vignettes. 2016 Oct 29.
Frumento P, Bottai M. An estimating equation for censored and truncated quantile regression. Computational Statistics & Data Analysis. 2017 Sep 1;113:53-63.