library(dplyr)
library(survival)
library(ggsurvfit)
Tutorial from online YouTube video collection: https://www.youtube.com/watch?v=vX3l36ptrTU&list=PLqzoL9-eJTNDdnKvep_YHIwk2AMqHhuJ0&index=1
For an online version of this rmarkdown file, please check: https://rpubs.com/auraf285/SurvAnalysisR
For the RMarkdown file, see https://github.com/AuraFrizzati/Tutorials/tree/main/Cox%20regression
Survival analysis models time-to-event as an outcome (e.g. time to death, how long does one wait on a waiting list for surgery, etc).
The outcome \(y\) has two parts:
- Time = T (time to event or follow-up)
- Has the event (E) occurred? Yes/No
\[y = Surv(T,E)\]
Censoring: has the event occurred or not occurred (= the event was censored) within the follow-up period. The concept of censoring makes survival analysis unique. Individuals might also be lost at followed-up.
Censored observations get included into a survival model
Right-censoring vs left-censoring (we don’t know what happened to the subject before the study)
All survival models assume that censoring is non-informative: being censored or not is not related to the probability of the event occurring. (e.g., measuring time-to-death, patients lost at follow-up are those that have started to feel much better)
Survival function: probability of surviving beyond time \(t\) \[S(t) = P(T > t)\]
Hazard: probability of getting the event in the next few seconds \(t + \delta\) given the event is not present now \(t\). \[H(t) = P(T < t + \delta | T > t)\]
Hazard ratio (HR): this is the hazard for the exposure group relative to the hazard for the non-exposure group
\[\frac{H_{x = 1}}{H_{x = 0}}\]
If HR = 2, for example, at any given time, someone who is exposed has twice the risk of dying than someone who is not exposed.
Load the dataset
censor = 1 –> deathcensor = 0 –> censored#install.packages("Bolstad2")
library(Bolstad2)
data(AidsSurvival.df)
head(AidsSurvival.df)
## id enter end time age drug censor
## 1 1 15may90 14oct90 5 46 0 1
## 2 2 19sep89 20mar90 6 35 1 0
## 3 3 21apr91 20dec91 8 30 1 1
## 4 4 03jan91 04apr91 3 30 1 1
## 5 5 18sep89 19jul91 22 36 0 1
## 6 6 18mar91 17apr91 1 32 1 0
Load the survival library (already installed in Base R)
library(survival)
Fit the KM (survival) model to the data
km.model <- survfit(
Surv(time,censor) ~ 1, # ~ 1 is used when we do not have any predictor for survival
type = "kaplan-meier", ## default type
data = AidsSurvival.df)
km.model
## Call: survfit(formula = Surv(time, censor) ~ 1, data = AidsSurvival.df,
## type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## [1,] 100 80 7 5 10
The median survival time is 7 months
summary(km.model)
## Call: survfit(formula = Surv(time, censor) ~ 1, data = AidsSurvival.df,
## type = "kaplan-meier")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 100 15 0.8500 0.0357 0.7828 0.923
## 2 83 5 0.7988 0.0402 0.7237 0.882
## 3 73 10 0.6894 0.0473 0.6026 0.789
## 4 61 4 0.6442 0.0493 0.5544 0.748
## 5 56 7 0.5636 0.0517 0.4709 0.675
## 6 49 2 0.5406 0.0521 0.4476 0.653
## 7 46 6 0.4701 0.0526 0.3775 0.586
## 8 39 4 0.4219 0.0525 0.3306 0.538
## 9 35 3 0.3857 0.0520 0.2962 0.502
## 10 32 3 0.3496 0.0511 0.2625 0.466
## 11 28 3 0.3121 0.0500 0.2280 0.427
## 12 25 2 0.2872 0.0490 0.2055 0.401
## 13 21 1 0.2735 0.0486 0.1931 0.387
## 14 20 1 0.2598 0.0480 0.1809 0.373
## 15 19 2 0.2325 0.0467 0.1568 0.345
## 22 16 1 0.2179 0.0460 0.1441 0.330
## 30 14 1 0.2024 0.0453 0.1305 0.314
## 31 13 1 0.1868 0.0444 0.1173 0.298
## 32 12 1 0.1712 0.0433 0.1043 0.281
## 34 11 1 0.1557 0.0421 0.0916 0.264
## 35 10 1 0.1401 0.0407 0.0793 0.247
## 36 9 1 0.1245 0.0390 0.0674 0.230
## 43 8 1 0.1090 0.0371 0.0559 0.212
## 53 7 1 0.0934 0.0349 0.0449 0.194
## 54 6 1 0.0778 0.0324 0.0344 0.176
## 57 4 1 0.0584 0.0296 0.0216 0.157
## 58 3 1 0.0389 0.0253 0.0109 0.139
Model plot (the red line indicates the median survival time on the curve)
plot(km.model,
xlab ="Time (months)",
ylab = "% Alive = S(t)",
main = "KM-Model",
las = 1, ## rotate y-axis labels
mark.time = TRUE ## this allows to visualise at what time were the observations censored
)
abline(h = 0.5, col= "red")
Adding a predictor (drug = 1 or 0)
km.model2 <- survfit(
Surv(time,censor) ~ drug, # ~ 1 is used when we do not have any predictor for survival
type = "kaplan-meier", ## default type
data = AidsSurvival.df)
km.model2
## Call: survfit(formula = Surv(time, censor) ~ drug, data = AidsSurvival.df,
## type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## drug=0 51 42 11 8 30
## drug=1 49 38 5 3 7
summary(km.model2)
## Call: survfit(formula = Surv(time, censor) ~ drug, data = AidsSurvival.df,
## type = "kaplan-meier")
##
## drug=0
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 51 5 0.9020 0.0416 0.8239 0.987
## 2 46 3 0.8431 0.0509 0.7490 0.949
## 3 41 2 0.8020 0.0561 0.6992 0.920
## 4 38 2 0.7598 0.0606 0.6498 0.888
## 5 35 3 0.6947 0.0660 0.5766 0.837
## 6 32 1 0.6730 0.0675 0.5529 0.819
## 7 31 1 0.6513 0.0687 0.5296 0.801
## 8 30 2 0.6078 0.0706 0.4840 0.763
## 9 28 2 0.5644 0.0720 0.4396 0.725
## 10 26 2 0.5210 0.0727 0.3964 0.685
## 11 23 2 0.4757 0.0731 0.3520 0.643
## 12 21 2 0.4304 0.0728 0.3090 0.600
## 13 19 1 0.4077 0.0724 0.2879 0.577
## 14 18 1 0.3851 0.0718 0.2672 0.555
## 15 17 1 0.3624 0.0711 0.2468 0.532
## 22 15 1 0.3383 0.0703 0.2250 0.508
## 30 13 1 0.3123 0.0696 0.2018 0.483
## 31 12 1 0.2862 0.0685 0.1791 0.457
## 32 11 1 0.2602 0.0670 0.1571 0.431
## 34 10 1 0.2342 0.0652 0.1357 0.404
## 35 9 1 0.2082 0.0629 0.1151 0.376
## 36 8 1 0.1821 0.0602 0.0953 0.348
## 43 7 1 0.1561 0.0569 0.0764 0.319
## 53 6 1 0.1301 0.0531 0.0585 0.289
## 54 5 1 0.1041 0.0484 0.0418 0.259
## 57 4 1 0.0781 0.0427 0.0267 0.228
## 58 3 1 0.0520 0.0355 0.0136 0.198
##
## drug=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 49 10 0.7959 0.0576 0.6907 0.917
## 2 37 2 0.7529 0.0620 0.6407 0.885
## 3 32 8 0.5647 0.0740 0.4367 0.730
## 4 23 2 0.5156 0.0753 0.3872 0.686
## 5 21 4 0.4174 0.0753 0.2931 0.594
## 6 17 1 0.3928 0.0748 0.2705 0.570
## 7 15 5 0.2619 0.0691 0.1562 0.439
## 8 9 2 0.2037 0.0648 0.1092 0.380
## 9 7 1 0.1746 0.0618 0.0873 0.349
## 10 6 1 0.1455 0.0579 0.0667 0.317
## 11 5 1 0.1164 0.0531 0.0476 0.285
## 15 2 1 0.0582 0.0490 0.0112 0.303
plot(km.model2,
xlab ="Time (months)",
ylab = "% Alive = S(t)",
main = "KM-Model",
las = 1, ## rotate y-axis labels
lwd = 2,
col = c("black", "blue"),
mark.time = TRUE ## this allows to visualise at what time were the observations censored
)
abline(h = 0.5, col= "red")
legend(18,0.95, legend=c("DrugNO", "DrugYES"), lty = 1, lwd = 2, bty = "", cex = 0.6, col = c("black", "blue"))
Is there a statistically significant difference between the two groups?
Let’s use the Log-rank test (H0: the survival function for the 2 groups is the same). It can also be used for more than 2 groups
survdiff(
Surv(time,censor) ~ drug,
data = AidsSurvival.df
)
## Call:
## survdiff(formula = Surv(time, censor) ~ drug, data = AidsSurvival.df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## drug=0 51 42 54.9 3.02 11.9
## drug=1 49 38 25.1 6.60 11.9
##
## Chisq= 11.9 on 1 degrees of freedom, p= 6e-04
There is a statistically sig difference between the survival functions of the 2 groups.
These are parametric models.
The idea of exponential distribution can then be extended to the exponential survival model and further to the Weibull model or the Cox proportional hazard model.
Survival function = \(S(t) = P(T>t) = 1 - P(T \leq t)\) (here the probability is given as surviving after time \(t\)), therefore:
\(S(t) = P(T>t) = 1 - P(T \leq t) =\)
\(= 1 - F(t) = 1 - (1 - e^{-\lambda t}) = e^{-\lambda t}\)
\[S(t) = P(T>t) = e^{-\lambda t}\]
\[S(t) = P(T>t) = e^{-\lambda t} = e^{-HAZ \times t}\]
Relevant equations so far:
To estimate the hazard rate:
exponential function: \(HAZ = e^{b_0 + b_1X_1 + ... + b_kX_k}\)
or, linear function: \(ln(HAZ) = b_0 + b_1X_1 + ... + b_kX_k\)
\[S(t) = P(T>t) = e^{-\lambda t} = e^{-HAZ \times t}\]
The intercept of the model (\(b_0\)) is what differentiates the three models and it corresponds to the reference log-hazard (for \(t = 0\))
\[b_0 = ln(\alpha)ln(t) + b_0\]
\[b_0 = ln(h_0(t))\]
\(h_0(t)\) is the hazard function; it is the hazard for the baseline group that is allowed to fluctuate over time.
You can estimate all the model’s predictors’ coefficients (\(b_1, b_2, ...,b_k\)) without having to estimate \(h_0(t)\). However, this means we do not get any estimate of the intercept, therefore we cannot estimate neither the hazard rate nor the survival function. Therefore we cannot use this model for predictive purposes (e.g. estimate the probability of surviving beyond a specific time point)
If our goal is to estimate only the hazard ratio (HR) between groups (effect size model), we can use Cox proportional hazard model
To summarise:
- Kaplan-Meier model can be used for prediction of survival probability but not for estimating hazard ratios
- Cox proportional hazard model can be used for estimating hazard ratios but not for prediction of survival probability
- Exponential model and Weibull model can be used for both estimating hazard ratios and for prediction of survival probability (although their assummptions are now always realistic)
\[ln(HAZ) = b_0 + b_1X_1 + ... + b_kX_k = ln(h_0(t)) + b_1X_1 + ... + b_kX_k\]
As alternative:
\[HAZ = h_0(t) + e^{b_1X_1 + ... + b_kX_k}\] - The model’s coefficients (\(b_1,b_2,...,b_k\)) can be estimated without having to specify the value of the baseline hazard function (\(h_0(t)\)) (similar to estimating the slope of a regression line without estimating the intercept. However the assumption is that the two lines of two groups are going to remain at the same distance with time)
If assumption (3) is violated (i.e. HR is not constant over time), we can:
If assumption (4) is violated (i.e. the relationship between \(ln(HAZ)\) and the explanatory variable is not linear), the solution could be:
If assumption (5) is violated (e.g. \(X\) is the drug’s dose and it is changing over time), we can use a time-dependent covariates’ model
We cannot do anything for violation of assumption (1). Although we can check for it by taking the censored vs non-censored observations and try to compare them to see if they differ in any way (study design)
Assumption (2) depends on the study design
# Use the Stanford heart transplant data
data(stanford2, package="survival")
stanford2<-
stanford2 %>%
mutate(
Over40 = if_else(age > 40, "YES", "NO"),
Over40 = as.factor(Over40),
MisMatchLevel = case_when(
t5 < 0.7 ~ 0,
t5 >= 0.7 & t5 < 1.5 ~ 1,
TRUE ~ 2
),
MisMatchLevel = as.factor(MisMatchLevel)
)
head(stanford2)
## id time status age t5 Over40 MisMatchLevel
## 139 139 86 1 12 1.26 NO 1
## 159 159 10 1 13 1.49 NO 1
## 181 181 60 0 13 NA NO 2
## 119 119 1116 0 14 0.54 NO 0
## 74 74 2006 0 15 1.26 NO 1
## 120 120 1107 0 18 0.25 NO 0
summary(stanford2)
## id time status age
## Min. : 1.00 Min. : 0.50 Min. :0.0000 Min. :12.00
## 1st Qu.: 46.75 1st Qu.: 64.75 1st Qu.:0.0000 1st Qu.:35.00
## Median : 92.50 Median : 351.00 Median :1.0000 Median :44.00
## Mean : 92.50 Mean : 696.94 Mean :0.6141 Mean :41.09
## 3rd Qu.:138.25 3rd Qu.:1160.75 3rd Qu.:1.0000 3rd Qu.:49.00
## Max. :184.00 Max. :3695.00 Max. :1.0000 Max. :64.00
##
## t5 Over40 MisMatchLevel
## Min. :0.000 NO : 71 0:40
## 1st Qu.:0.690 YES:113 1:79
## Median :1.040 2:65
## Mean :1.117
## 3rd Qu.:1.460
## Max. :3.050
## NA's :27
First glance the data using K-M model and survival curves
km.model <- survfit(
Surv(time, status) ~ MisMatchLevel, # ~ 1 is used when we do not have any predictor for survival
type = "kaplan-meier", ## default type
data = stanford2)
km.model
## Call: survfit(formula = Surv(time, status) ~ MisMatchLevel, data = stanford2,
## type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## MisMatchLevel=0 40 26 364 66 NA
## MisMatchLevel=1 79 49 994 381 1478
## MisMatchLevel=2 65 38 544 169 1534
## base R plot
# plot(km.model,
# xlab ="Time (months)",
# ylab = "% Alive = S(t)",
# main = "KM-Model",
# las = 1, ## rotate y-axis labels
# mark.time = TRUE ## this allows to visualise at what time were the observations censored
# )
# abline(h = 0.5, col= "red")
ggsurvfit::survfit2(survival::Surv(time, status) ~ MisMatchLevel, data = stanford2) %>%
ggsurvfit::ggsurvfit() +
labs(
x = "Days",
y = "Overall survival probability"
) +
add_confidence_interval() +
add_risktable() +
labs(title = "Survival curves by 'MisMatchLevel' variable")
ggsurvfit::survfit2(survival::Surv(time, status) ~ Over40, data = stanford2) %>%
ggsurvfit::ggsurvfit() +
labs(
x = "Days",
y = "Overall survival probability"
) +
add_confidence_interval() +
add_risktable() +
labs(title = "Survival curves by 'Over40' variable")
cox.mod <- coxph(Surv(time, status) ~ Over40 + MisMatchLevel, data = stanford2)
summary(cox.mod)
## Call:
## coxph(formula = Surv(time, status) ~ Over40 + MisMatchLevel,
## data = stanford2)
##
## n= 184, number of events= 113
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Over40YES 0.49350 1.63804 0.20671 2.387 0.017 *
## MisMatchLevel1 -0.24169 0.78530 0.24336 -0.993 0.321
## MisMatchLevel2 0.08197 1.08542 0.25626 0.320 0.749
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Over40YES 1.6380 0.6105 1.0924 2.456
## MisMatchLevel1 0.7853 1.2734 0.4874 1.265
## MisMatchLevel2 1.0854 0.9213 0.6569 1.794
##
## Concordance= 0.57 (se = 0.029 )
## Likelihood ratio test= 8.2 on 3 df, p=0.04
## Wald test = 7.86 on 3 df, p=0.05
## Score (logrank) test = 7.98 on 3 df, p=0.05
The model summary returns the HRs (exp(coef)) for the variable groups:
Over40 variable, a \(HR = 1.64\) means that at a given instant in time, someone over 40 has \(1.64\) times as likely to die as someone <= 40 (adjusting for the level of the MisMatchLevel variable)MisMatchLevel variable). Taking two individuals with the same MisMatchLevel, the one > 40 will be \(64%\) more likely to die than the one <= 40.The model summary also returns the exp(-coef), which is the HR obtained by flipping the groups (therefore changing the reference group, i.e. the denominator of the HR): e.g. someone who is <=40 is 0.61 as likely to die as someone who is > 40.
At the bottom there are the tests (Likelihood ratio test, Wald test and Score (logrank) test) for the null hypothesis : \(H0: \beta_1 = \beta_2 = ... = \beta_k = 0\) (and \(H_A\): at least one coefficient is not 0). They test the overall-model significance.
The Concordance (or C-statistic): this is the goodness-of-fit statistic for survival analysis (in logistic regression, this is equivalent to the area under the curve, AUC): it is calculated by looking at pairs of observations and checking the concordance of the model’s prediction of length of survival versus the actual survival retrieved from the data. The concordance is the proportion of pairs of observations that are concordant (model predictions and actual data). If the model is just making random guesses, we would expect a concordance = 0.5.
The likelihood ratio test (LRT) can be used to compare nested models, if adding/removing variables improves the model, etc
In the example, we want to understand whether we can drop from the model the MisMatchLevel variable
cox.mod <- coxph(Surv(time, status) ~ Over40 + MisMatchLevel, data = stanford2)
cox.mod2 <- coxph(Surv(time, status) ~ Over40 , data = stanford2)
Carry out the LRT to compare the two models:
anova(cox.mod2, cox.mod, test = "LRT")
## Analysis of Deviance Table
## Cox model: response is Surv(time, status)
## Model 1: ~ Over40
## Model 2: ~ Over40 + MisMatchLevel
## loglik Chisq Df Pr(>|Chi|)
## 1 -510.21
## 2 -509.01 2.4035 2 0.3007
MisMatchLevel without loss of predictive power.cox.num <- coxph(Surv(time, status) ~ age + t5, data = stanford2)
summary(cox.num)
## Call:
## coxph(formula = Surv(time, status) ~ age + t5, data = stanford2)
##
## n= 157, number of events= 102
## (27 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.02961 1.03006 0.01136 2.608 0.00911 **
## t5 0.17041 1.18579 0.18326 0.930 0.35243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.030 0.9708 1.007 1.053
## t5 1.186 0.8433 0.828 1.698
##
## Concordance= 0.59 (se = 0.034 )
## Likelihood ratio test= 8.47 on 2 df, p=0.01
## Wald test = 7.81 on 2 df, p=0.02
## Score (logrank) test = 7.87 on 2 df, p=0.02
Age variable: at a given instant in time, the probabilty of dying for someone who is 1-year older is \(3%\) (or \(1.03\)) higher than someone who is 1 year younger, adjusting for the t5 score.We will check two assumptions:
- How to check linearity
- How to check the proportional hazards’ assumption
plot(
x = predict(cox.num),
y = residuals(cox.num, type = "martingale"),
xlab = "fitted values",
ylab = "Martingale residuals",
main = "Residual plot (Martingale)",
las = 1 # this rotates values on y-axis
)
## add a line at y = residual = 0
abline(h=0)
## fit a smoother through the points
lines(
smooth.spline(
x = predict(cox.num),
y = residuals(cox.num, type = "martingale")),
col = "red"
)
The fitted values are the predicted values by the model
We can observe some non-linearity present from the plot
We can also check the same kind of plot but using the deviance residuals
plot(
x = predict(cox.num),
y = residuals(cox.num, type = "deviance"),
xlab = "fitted values",
ylab = "Deviance residuals",
main = "Residual plot (Deviance)",
las = 1 # this rotates values on y-axis
)
## add a line at y = residual = 0
abline(h=0)
## fit a smoother through the points
lines(
smooth.spline(
x = predict(cox.num),
y = residuals(cox.num, type = "deviance")),
col = "red"
)
Again, some non-linearity is evidenced also in this plot
The approach to non-linearity, we can identify the non-linear variable and categorise it, or include polynomial terms or transform the variable
There are various way to fo this
plot(
survfit(Surv(time, status) ~ Over40, data = stanford2) ,
col=c("black", "red"),
xlab = "Time (in days) ",
ylab = "Survival",
main = "Survival curves by Over40"
)
legend(3000,0.9, legend=c("<= 40", "> 40"), lty = 1, lwd = 2, bty = "", cex = 0.8, col = c("black", "red"))
plot(
survfit(Surv(time, status) ~ Over40, data = stanford2) ,
col=c("black", "red"),
fun="cloglog",
xlab = "Time (in days) using log",
ylab = "log-log survival",
main = "log-log curves by Over40"
)
legend(1000,-3, legend=c("<= 40", "> 40"), lty = 1, lwd = 2, bty = "", cex = 0.8, col = c("black", "red"))
cox.zph(cox.num)
## chisq df p
## age 0.83 1 0.36
## t5 2.06 1 0.15
## GLOBAL 2.77 2 0.25
The test returns a result for each individual explanatory variable as well as a GLOBAL test for the overall model
We can also look at the plot of the Schoenfeld test: if we allow the variables’ coefficients (\(\beta\)) to change over time (equivalent to allow the \(HRs\) to change over time), what changes would we see?
# we have two variables in the model, so we have 2 plots returned
plot(cox.zph(cox.num)[1], main = "Age")
abline(h=0, col = 2)
# we have two variables in the model, so we have 2 plots returned
plot(cox.zph(cox.num)[2], main = "t5")
abline(h=0, col = 2)
The solid black line in the middle is looking at: if we allow the coefficients (\(\beta s\)) for the variable to change over time, how much of a change we will be seeing? A change of \(0\) means n change (red line on the plots).
The dashed lines represent the 95% confidence interval for the \(\beta s\) change
The main thing to check is whether the red line (\(0\) change) is in the 95% confidence interval for most of the time (in the example it looks it is ok for t5 but not for Age)
Scaled Schoenfeld Residuals Chart: Schoenfeld residuals are used to test the assumption of proportional hazards. Schoenfeld residuals “can essentially be thought of as the observed minus the expected values of the covariates at each failure time” (Steffensmeier & Jones, 2004: 121). There is a Schoenfeld residual for each subject for each covariate. The plot of Schoenfeld residuals against time for any covariate should not show a pattern of changing residuals for that covariate. If there is a pattern, that covariate is time-dependent. As a rule of thumb, a non-zero slope is an indication of a violation of the proportional hazard assumption. The dotted lines outline the 95% confidence interval.
If the PHs assumption is violated there are different solutions: