Survival analysis corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur.
Survival analysis is used in a variety of field such as:
In cancer studies, typical research questions are like:
The aim of this chapter is to describe the basic concepts of survival analysis. In cancer studies, most of survival analyses use the following methods:
Here, we’ll start by explaining the essential concepts of survival analysis, including:
Then, we’ll continue by describing multivariate analysis using Cox proportional hazards model.
Here, we start by defining fundamental terms of survival analysis including:
There are different types of events, including:
The time from ‘response to treatment’ (complete remission) to the occurrence of the event of interest is commonly called survival time (or time to event).
The two most important measures in cancer studies include:
As mentioned above, survival analysis focuses on the expected duration of time until occurrence of an event of interest (relapse or death). However, the event may not be observed for some individuals within the study time period, producing the so-called censored observations.
Censoring may arise in the following ways:
This type of censoring, named right censoring, is handled in survival analysis.
Two related probabilities are used to describe survival data: the survival probability and the hazard probability.
The survival probability, also known as the survivor function $ S(t)$, is the probability that an individual survives from the time origin (e.g. diagnosis of cancer) to a specified future time t.
The hazard, denoted by $ h(t) $, is the probability that an individual who is under observation at a time t has an event at that time.
Note that, in contrast to the survivor function, which focuses on not having an event, the hazard function focuses on the event occurring.
The Kaplan-Meier (KM) method is a non-parametric method used to estimate the survival probability from observed survival times (Kaplan and Meier, 1958).
The survival probability at time \(t_i\), \(S(t_i)\), is calculated as follow: \[ S(t_i)=S(t_{i-1})(1-\frac{d_i}{n_i}) \]
where,
The estimated probability \(S(t)\) is a step function that changes value only at the time of each event. It’s also possible to compute confidence intervals for the survival probability.
The KM survival curve, a plot of the KM survival probability against time, provides a useful summary of the data that can be used to estimate measures such as median survival time.
Install the required packages
We’ll use two R packages:
Install and load the packages
Example Data Sets
We’ll use the lung cancer data available in the survival package.
Compute survival curves: survfit()
We want to compute the survival probability by sex.
The function survfit() [in survival package] can be used to compute kaplan-Meier survival estimate. Its main arguments include:
a survival object created using the function Surv() and the data set containing the variables. To compute survival curves, type this:
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## n events median 0.95LCL 0.95UCL
## sex=1 138 112 270 212 310
## sex=2 90 53 426 348 550
By default, the function print() shows a short summary of the survival curves. It prints the number of observations, number of events, the median survival and the confidence limits for the median.
If you want to display a more complete summary of the survival curves, type this:
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## sex=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 11 138 3 0.9783 0.0124 0.9542 1.000
## 12 135 1 0.9710 0.0143 0.9434 0.999
## 13 134 2 0.9565 0.0174 0.9231 0.991
## 15 132 1 0.9493 0.0187 0.9134 0.987
## 26 131 1 0.9420 0.0199 0.9038 0.982
## 30 130 1 0.9348 0.0210 0.8945 0.977
## 31 129 1 0.9275 0.0221 0.8853 0.972
## 53 128 2 0.9130 0.0240 0.8672 0.961
## 54 126 1 0.9058 0.0249 0.8583 0.956
## 59 125 1 0.8986 0.0257 0.8496 0.950
## 60 124 1 0.8913 0.0265 0.8409 0.945
## 65 123 2 0.8768 0.0280 0.8237 0.933
## 71 121 1 0.8696 0.0287 0.8152 0.928
## 81 120 1 0.8623 0.0293 0.8067 0.922
## 88 119 2 0.8478 0.0306 0.7900 0.910
## 92 117 1 0.8406 0.0312 0.7817 0.904
## 93 116 1 0.8333 0.0317 0.7734 0.898
## 95 115 1 0.8261 0.0323 0.7652 0.892
## 105 114 1 0.8188 0.0328 0.7570 0.886
## 107 113 1 0.8116 0.0333 0.7489 0.880
## 110 112 1 0.8043 0.0338 0.7408 0.873
## 116 111 1 0.7971 0.0342 0.7328 0.867
## 118 110 1 0.7899 0.0347 0.7247 0.861
## 131 109 1 0.7826 0.0351 0.7167 0.855
## 132 108 2 0.7681 0.0359 0.7008 0.842
## 135 106 1 0.7609 0.0363 0.6929 0.835
## 142 105 1 0.7536 0.0367 0.6851 0.829
## 144 104 1 0.7464 0.0370 0.6772 0.823
## 147 103 1 0.7391 0.0374 0.6694 0.816
## 156 102 2 0.7246 0.0380 0.6538 0.803
## 163 100 3 0.7029 0.0389 0.6306 0.783
## 166 97 1 0.6957 0.0392 0.6230 0.777
## 170 96 1 0.6884 0.0394 0.6153 0.770
## 175 94 1 0.6811 0.0397 0.6076 0.763
## 176 93 1 0.6738 0.0399 0.5999 0.757
## 177 92 1 0.6664 0.0402 0.5922 0.750
## 179 91 2 0.6518 0.0406 0.5769 0.736
## 180 89 1 0.6445 0.0408 0.5693 0.730
## 181 88 2 0.6298 0.0412 0.5541 0.716
## 183 86 1 0.6225 0.0413 0.5466 0.709
## 189 83 1 0.6150 0.0415 0.5388 0.702
## 197 80 1 0.6073 0.0417 0.5309 0.695
## 202 78 1 0.5995 0.0419 0.5228 0.687
## 207 77 1 0.5917 0.0420 0.5148 0.680
## 210 76 1 0.5839 0.0422 0.5068 0.673
## 212 75 1 0.5762 0.0424 0.4988 0.665
## 218 74 1 0.5684 0.0425 0.4909 0.658
## 222 72 1 0.5605 0.0426 0.4829 0.651
## 223 70 1 0.5525 0.0428 0.4747 0.643
## 229 67 1 0.5442 0.0429 0.4663 0.635
## 230 66 1 0.5360 0.0431 0.4579 0.627
## 239 64 1 0.5276 0.0432 0.4494 0.619
## 246 63 1 0.5192 0.0433 0.4409 0.611
## 267 61 1 0.5107 0.0434 0.4323 0.603
## 269 60 1 0.5022 0.0435 0.4238 0.595
## 270 59 1 0.4937 0.0436 0.4152 0.587
## 283 57 1 0.4850 0.0437 0.4065 0.579
## 284 56 1 0.4764 0.0438 0.3979 0.570
## 285 54 1 0.4676 0.0438 0.3891 0.562
## 286 53 1 0.4587 0.0439 0.3803 0.553
## 288 52 1 0.4499 0.0439 0.3716 0.545
## 291 51 1 0.4411 0.0439 0.3629 0.536
## 301 48 1 0.4319 0.0440 0.3538 0.527
## 303 46 1 0.4225 0.0440 0.3445 0.518
## 306 44 1 0.4129 0.0440 0.3350 0.509
## 310 43 1 0.4033 0.0441 0.3256 0.500
## 320 42 1 0.3937 0.0440 0.3162 0.490
## 329 41 1 0.3841 0.0440 0.3069 0.481
## 337 40 1 0.3745 0.0439 0.2976 0.471
## 353 39 2 0.3553 0.0437 0.2791 0.452
## 363 37 1 0.3457 0.0436 0.2700 0.443
## 364 36 1 0.3361 0.0434 0.2609 0.433
## 371 35 1 0.3265 0.0432 0.2519 0.423
## 387 34 1 0.3169 0.0430 0.2429 0.413
## 390 33 1 0.3073 0.0428 0.2339 0.404
## 394 32 1 0.2977 0.0425 0.2250 0.394
## 428 29 1 0.2874 0.0423 0.2155 0.383
## 429 28 1 0.2771 0.0420 0.2060 0.373
## 442 27 1 0.2669 0.0417 0.1965 0.362
## 455 25 1 0.2562 0.0413 0.1868 0.351
## 457 24 1 0.2455 0.0410 0.1770 0.341
## 460 22 1 0.2344 0.0406 0.1669 0.329
## 477 21 1 0.2232 0.0402 0.1569 0.318
## 519 20 1 0.2121 0.0397 0.1469 0.306
## 524 19 1 0.2009 0.0391 0.1371 0.294
## 533 18 1 0.1897 0.0385 0.1275 0.282
## 558 17 1 0.1786 0.0378 0.1179 0.270
## 567 16 1 0.1674 0.0371 0.1085 0.258
## 574 15 1 0.1562 0.0362 0.0992 0.246
## 583 14 1 0.1451 0.0353 0.0900 0.234
## 613 13 1 0.1339 0.0343 0.0810 0.221
## 624 12 1 0.1228 0.0332 0.0722 0.209
## 643 11 1 0.1116 0.0320 0.0636 0.196
## 655 10 1 0.1004 0.0307 0.0552 0.183
## 689 9 1 0.0893 0.0293 0.0470 0.170
## 707 8 1 0.0781 0.0276 0.0390 0.156
## 791 7 1 0.0670 0.0259 0.0314 0.143
## 814 5 1 0.0536 0.0239 0.0223 0.128
## 883 3 1 0.0357 0.0216 0.0109 0.117
##
## sex=2
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 90 1 0.9889 0.0110 0.9675 1.000
## 60 89 1 0.9778 0.0155 0.9478 1.000
## 61 88 1 0.9667 0.0189 0.9303 1.000
## 62 87 1 0.9556 0.0217 0.9139 0.999
## 79 86 1 0.9444 0.0241 0.8983 0.993
## 81 85 1 0.9333 0.0263 0.8832 0.986
## 95 83 1 0.9221 0.0283 0.8683 0.979
## 107 81 1 0.9107 0.0301 0.8535 0.972
## 122 80 1 0.8993 0.0318 0.8390 0.964
## 145 79 2 0.8766 0.0349 0.8108 0.948
## 153 77 1 0.8652 0.0362 0.7970 0.939
## 166 76 1 0.8538 0.0375 0.7834 0.931
## 167 75 1 0.8424 0.0387 0.7699 0.922
## 182 71 1 0.8305 0.0399 0.7559 0.913
## 186 70 1 0.8187 0.0411 0.7420 0.903
## 194 68 1 0.8066 0.0422 0.7280 0.894
## 199 67 1 0.7946 0.0432 0.7142 0.884
## 201 66 2 0.7705 0.0452 0.6869 0.864
## 208 62 1 0.7581 0.0461 0.6729 0.854
## 226 59 1 0.7452 0.0471 0.6584 0.843
## 239 57 1 0.7322 0.0480 0.6438 0.833
## 245 54 1 0.7186 0.0490 0.6287 0.821
## 268 51 1 0.7045 0.0501 0.6129 0.810
## 285 47 1 0.6895 0.0512 0.5962 0.798
## 293 45 1 0.6742 0.0523 0.5791 0.785
## 305 43 1 0.6585 0.0534 0.5618 0.772
## 310 42 1 0.6428 0.0544 0.5447 0.759
## 340 39 1 0.6264 0.0554 0.5267 0.745
## 345 38 1 0.6099 0.0563 0.5089 0.731
## 348 37 1 0.5934 0.0572 0.4913 0.717
## 350 36 1 0.5769 0.0579 0.4739 0.702
## 351 35 1 0.5604 0.0586 0.4566 0.688
## 361 33 1 0.5434 0.0592 0.4390 0.673
## 363 32 1 0.5265 0.0597 0.4215 0.658
## 371 30 1 0.5089 0.0603 0.4035 0.642
## 426 26 1 0.4893 0.0610 0.3832 0.625
## 433 25 1 0.4698 0.0617 0.3632 0.608
## 444 24 1 0.4502 0.0621 0.3435 0.590
## 450 23 1 0.4306 0.0624 0.3241 0.572
## 473 22 1 0.4110 0.0626 0.3050 0.554
## 520 19 1 0.3894 0.0629 0.2837 0.534
## 524 18 1 0.3678 0.0630 0.2628 0.515
## 550 15 1 0.3433 0.0634 0.2390 0.493
## 641 11 1 0.3121 0.0649 0.2076 0.469
## 654 10 1 0.2808 0.0655 0.1778 0.443
## 687 9 1 0.2496 0.0652 0.1496 0.417
## 705 8 1 0.2184 0.0641 0.1229 0.388
## 728 7 1 0.1872 0.0621 0.0978 0.359
## 731 6 1 0.1560 0.0590 0.0743 0.328
## 735 5 1 0.1248 0.0549 0.0527 0.295
## 765 3 1 0.0832 0.0499 0.0257 0.270
## records n.max n.start events *rmean *se(rmean) median 0.95LCL 0.95UCL
## sex=1 138 138 138 112 325.0663 22.59845 270 212 310
## sex=2 90 90 90 53 458.2757 33.78530 426 348 550
Access to the value returned by survfit()
The function survfit() returns a list of variables, including the following components:
The components can be accessed as follow:
d <- data.frame(time = fit$time,
n.risk = fit$n.risk,
n.event = fit$n.event,
n.censor = fit$n.censor,
surv = fit$surv,
upper = fit$upper,
lower = fit$lower
)
head(d)
Visualize survival curves
We’ll use the function ggsurvplot() [in Survminer R package] to produce the survival curves for the two groups of subjects.
It’s also possible to show:
# Change color, linetype by strata, risk.table color by strata
ggsurvplot(fit,
pval = TRUE, conf.int = TRUE,
risk.table = TRUE, # Add risk table
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
surv.median.line = "hv", # Specify median survival
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"))
The plot can be further customized using the following arguments:
ggsurvplot(fit, pval=TRUE, conf.int = TRUE, xlab="Time in days", break.time.by = 200, ggtheme = theme_light(),risk.table = "abs_pct",risk.table.y.text.col = T, risk.table.y.text = FALSE, ncensor.plot = TRUE, surv.median.line = "hv",legend.labs =
c("Male", "Female"),palette =
c("#E7B800", "#2E9FDF"))
The Kaplan-Meier plot can be interpreted as follow:
The horizontal axis (x-axis) represents time in days, and the vertical axis (y-axis) shows the probability of surviving or the proportion of people surviving. The lines represent survival curves of the two groups. A vertical drop in the curves indicates an event. The vertical tick mark on the curves means that a patient was censored at this time.
The median survival times for each group can be obtained using the code below:
## records n.max n.start events *rmean *se(rmean) median 0.95LCL 0.95UCL
## sex=1 138 138 138 112 325.0663 22.59845 270 212 310
## sex=2 90 90 90 53 458.2757 33.78530 426 348 550
The median survival times for each group represent the time at which the survival probability, \(S(t)\), is 0.5.
The median survival time for sex=1 (Male group) is 270 days, as opposed to 426 days for sex=2 (Female). There appears to be a survival advantage for female with lung cancer compare to male. However, to evaluate whether this difference is statistically significant requires a formal statistical test, a subject that is discussed in the next sections.
Note that, the confidence limits are wide at the tail of the curves, making meaningful interpretations difficult. This can be explained by the fact that, in practice, there are usually patients who are lost to follow-up or alive at the end of follow-up. Thus, it may be sensible to shorten plots before the end of follow-up on the x-axis (Pocock et al, 2002).
The survival curves can be shorten using the argument xlim as follow:
ggsurvplot(fit,
conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"),
xlim = c(0, 600))
Note that, three often used transformations can be specified using the argument fun:
For example, to plot cumulative events, type this:
ggsurvplot(fit,
conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"),
fun = "event")
The cummulative hazard is commonly used to estimate the hazard probability. It’s defined as \[H(t)=−log(survivalfunction)=−log(S(t))\]. The cumulative hazard (H(t)) can be interpreted as the cumulative force of mortality. In other words, it corresponds to the number of events that would be expected for each individual by time t if the event were a repeatable process.
To plot cumulative hazard, type this:
ggsurvplot(fit,
conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF"),
fun = "cumhaz")
Kaplan-Meier life table: summary of survival curves
As mentioned above, you can use the function summary() to have a complete summary of survival curves:
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## sex=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 11 138 3 0.9783 0.0124 0.9542 1.000
## 12 135 1 0.9710 0.0143 0.9434 0.999
## 13 134 2 0.9565 0.0174 0.9231 0.991
## 15 132 1 0.9493 0.0187 0.9134 0.987
## 26 131 1 0.9420 0.0199 0.9038 0.982
## 30 130 1 0.9348 0.0210 0.8945 0.977
## 31 129 1 0.9275 0.0221 0.8853 0.972
## 53 128 2 0.9130 0.0240 0.8672 0.961
## 54 126 1 0.9058 0.0249 0.8583 0.956
## 59 125 1 0.8986 0.0257 0.8496 0.950
## 60 124 1 0.8913 0.0265 0.8409 0.945
## 65 123 2 0.8768 0.0280 0.8237 0.933
## 71 121 1 0.8696 0.0287 0.8152 0.928
## 81 120 1 0.8623 0.0293 0.8067 0.922
## 88 119 2 0.8478 0.0306 0.7900 0.910
## 92 117 1 0.8406 0.0312 0.7817 0.904
## 93 116 1 0.8333 0.0317 0.7734 0.898
## 95 115 1 0.8261 0.0323 0.7652 0.892
## 105 114 1 0.8188 0.0328 0.7570 0.886
## 107 113 1 0.8116 0.0333 0.7489 0.880
## 110 112 1 0.8043 0.0338 0.7408 0.873
## 116 111 1 0.7971 0.0342 0.7328 0.867
## 118 110 1 0.7899 0.0347 0.7247 0.861
## 131 109 1 0.7826 0.0351 0.7167 0.855
## 132 108 2 0.7681 0.0359 0.7008 0.842
## 135 106 1 0.7609 0.0363 0.6929 0.835
## 142 105 1 0.7536 0.0367 0.6851 0.829
## 144 104 1 0.7464 0.0370 0.6772 0.823
## 147 103 1 0.7391 0.0374 0.6694 0.816
## 156 102 2 0.7246 0.0380 0.6538 0.803
## 163 100 3 0.7029 0.0389 0.6306 0.783
## 166 97 1 0.6957 0.0392 0.6230 0.777
## 170 96 1 0.6884 0.0394 0.6153 0.770
## 175 94 1 0.6811 0.0397 0.6076 0.763
## 176 93 1 0.6738 0.0399 0.5999 0.757
## 177 92 1 0.6664 0.0402 0.5922 0.750
## 179 91 2 0.6518 0.0406 0.5769 0.736
## 180 89 1 0.6445 0.0408 0.5693 0.730
## 181 88 2 0.6298 0.0412 0.5541 0.716
## 183 86 1 0.6225 0.0413 0.5466 0.709
## 189 83 1 0.6150 0.0415 0.5388 0.702
## 197 80 1 0.6073 0.0417 0.5309 0.695
## 202 78 1 0.5995 0.0419 0.5228 0.687
## 207 77 1 0.5917 0.0420 0.5148 0.680
## 210 76 1 0.5839 0.0422 0.5068 0.673
## 212 75 1 0.5762 0.0424 0.4988 0.665
## 218 74 1 0.5684 0.0425 0.4909 0.658
## 222 72 1 0.5605 0.0426 0.4829 0.651
## 223 70 1 0.5525 0.0428 0.4747 0.643
## 229 67 1 0.5442 0.0429 0.4663 0.635
## 230 66 1 0.5360 0.0431 0.4579 0.627
## 239 64 1 0.5276 0.0432 0.4494 0.619
## 246 63 1 0.5192 0.0433 0.4409 0.611
## 267 61 1 0.5107 0.0434 0.4323 0.603
## 269 60 1 0.5022 0.0435 0.4238 0.595
## 270 59 1 0.4937 0.0436 0.4152 0.587
## 283 57 1 0.4850 0.0437 0.4065 0.579
## 284 56 1 0.4764 0.0438 0.3979 0.570
## 285 54 1 0.4676 0.0438 0.3891 0.562
## 286 53 1 0.4587 0.0439 0.3803 0.553
## 288 52 1 0.4499 0.0439 0.3716 0.545
## 291 51 1 0.4411 0.0439 0.3629 0.536
## 301 48 1 0.4319 0.0440 0.3538 0.527
## 303 46 1 0.4225 0.0440 0.3445 0.518
## 306 44 1 0.4129 0.0440 0.3350 0.509
## 310 43 1 0.4033 0.0441 0.3256 0.500
## 320 42 1 0.3937 0.0440 0.3162 0.490
## 329 41 1 0.3841 0.0440 0.3069 0.481
## 337 40 1 0.3745 0.0439 0.2976 0.471
## 353 39 2 0.3553 0.0437 0.2791 0.452
## 363 37 1 0.3457 0.0436 0.2700 0.443
## 364 36 1 0.3361 0.0434 0.2609 0.433
## 371 35 1 0.3265 0.0432 0.2519 0.423
## 387 34 1 0.3169 0.0430 0.2429 0.413
## 390 33 1 0.3073 0.0428 0.2339 0.404
## 394 32 1 0.2977 0.0425 0.2250 0.394
## 428 29 1 0.2874 0.0423 0.2155 0.383
## 429 28 1 0.2771 0.0420 0.2060 0.373
## 442 27 1 0.2669 0.0417 0.1965 0.362
## 455 25 1 0.2562 0.0413 0.1868 0.351
## 457 24 1 0.2455 0.0410 0.1770 0.341
## 460 22 1 0.2344 0.0406 0.1669 0.329
## 477 21 1 0.2232 0.0402 0.1569 0.318
## 519 20 1 0.2121 0.0397 0.1469 0.306
## 524 19 1 0.2009 0.0391 0.1371 0.294
## 533 18 1 0.1897 0.0385 0.1275 0.282
## 558 17 1 0.1786 0.0378 0.1179 0.270
## 567 16 1 0.1674 0.0371 0.1085 0.258
## 574 15 1 0.1562 0.0362 0.0992 0.246
## 583 14 1 0.1451 0.0353 0.0900 0.234
## 613 13 1 0.1339 0.0343 0.0810 0.221
## 624 12 1 0.1228 0.0332 0.0722 0.209
## 643 11 1 0.1116 0.0320 0.0636 0.196
## 655 10 1 0.1004 0.0307 0.0552 0.183
## 689 9 1 0.0893 0.0293 0.0470 0.170
## 707 8 1 0.0781 0.0276 0.0390 0.156
## 791 7 1 0.0670 0.0259 0.0314 0.143
## 814 5 1 0.0536 0.0239 0.0223 0.128
## 883 3 1 0.0357 0.0216 0.0109 0.117
##
## sex=2
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 90 1 0.9889 0.0110 0.9675 1.000
## 60 89 1 0.9778 0.0155 0.9478 1.000
## 61 88 1 0.9667 0.0189 0.9303 1.000
## 62 87 1 0.9556 0.0217 0.9139 0.999
## 79 86 1 0.9444 0.0241 0.8983 0.993
## 81 85 1 0.9333 0.0263 0.8832 0.986
## 95 83 1 0.9221 0.0283 0.8683 0.979
## 107 81 1 0.9107 0.0301 0.8535 0.972
## 122 80 1 0.8993 0.0318 0.8390 0.964
## 145 79 2 0.8766 0.0349 0.8108 0.948
## 153 77 1 0.8652 0.0362 0.7970 0.939
## 166 76 1 0.8538 0.0375 0.7834 0.931
## 167 75 1 0.8424 0.0387 0.7699 0.922
## 182 71 1 0.8305 0.0399 0.7559 0.913
## 186 70 1 0.8187 0.0411 0.7420 0.903
## 194 68 1 0.8066 0.0422 0.7280 0.894
## 199 67 1 0.7946 0.0432 0.7142 0.884
## 201 66 2 0.7705 0.0452 0.6869 0.864
## 208 62 1 0.7581 0.0461 0.6729 0.854
## 226 59 1 0.7452 0.0471 0.6584 0.843
## 239 57 1 0.7322 0.0480 0.6438 0.833
## 245 54 1 0.7186 0.0490 0.6287 0.821
## 268 51 1 0.7045 0.0501 0.6129 0.810
## 285 47 1 0.6895 0.0512 0.5962 0.798
## 293 45 1 0.6742 0.0523 0.5791 0.785
## 305 43 1 0.6585 0.0534 0.5618 0.772
## 310 42 1 0.6428 0.0544 0.5447 0.759
## 340 39 1 0.6264 0.0554 0.5267 0.745
## 345 38 1 0.6099 0.0563 0.5089 0.731
## 348 37 1 0.5934 0.0572 0.4913 0.717
## 350 36 1 0.5769 0.0579 0.4739 0.702
## 351 35 1 0.5604 0.0586 0.4566 0.688
## 361 33 1 0.5434 0.0592 0.4390 0.673
## 363 32 1 0.5265 0.0597 0.4215 0.658
## 371 30 1 0.5089 0.0603 0.4035 0.642
## 426 26 1 0.4893 0.0610 0.3832 0.625
## 433 25 1 0.4698 0.0617 0.3632 0.608
## 444 24 1 0.4502 0.0621 0.3435 0.590
## 450 23 1 0.4306 0.0624 0.3241 0.572
## 473 22 1 0.4110 0.0626 0.3050 0.554
## 520 19 1 0.3894 0.0629 0.2837 0.534
## 524 18 1 0.3678 0.0630 0.2628 0.515
## 550 15 1 0.3433 0.0634 0.2390 0.493
## 641 11 1 0.3121 0.0649 0.2076 0.469
## 654 10 1 0.2808 0.0655 0.1778 0.443
## 687 9 1 0.2496 0.0652 0.1496 0.417
## 705 8 1 0.2184 0.0641 0.1229 0.388
## 728 7 1 0.1872 0.0621 0.0978 0.359
## 731 6 1 0.1560 0.0590 0.0743 0.328
## 735 5 1 0.1248 0.0549 0.0527 0.295
## 765 3 1 0.0832 0.0499 0.0257 0.270
It’s also possible to use the function surv_summary() [in survminer package] to get a summary of survival curves. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.
## Warning in .get_data(x, data = data): The `data` argument is not provided. Data
## will be extracted from model fit.
The function surv_summary() returns a data frame with the following columns:
In a situation, where survival curves have been fitted with one or more variables, surv_summary object contains extra columns representing the variables. This makes it possible to facet the output of ggsurvplot by strata or by some combinations of factors.
surv_summary object has also an attribute named ‘table’ containing information about the survival curves, including medians of survival with confidence intervals, as well as, the total number of subjects and the number of event in each curve. To get access to the attribute ‘table’, type this:
Log-Rank test comparing survival curves: survdiff()
The log-rank test is the most widely used method of comparing two or more survival curves. The null hypothesis is that there is no difference in survival between the two groups. The log rank test is a non-parametric test, which makes no assumptions about the survival distributions. Essentially, the log rank test compares the observed number of events in each group to what would be expected if the null hypothesis were true (i.e., if the survival curves were identical). The log rank statistic is approximately distributed as a chi-square test statistic.
The function survdiff() [in survival package] can be used to compute log-rank test comparing two or more survival curves.
survdiff() can be used as follow:
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138 112 91.6 4.55 10.3
## sex=2 90 53 73.4 5.68 10.3
##
## Chisq= 10.3 on 1 degrees of freedom, p= 0.001
The function returns a list of components, including:
The log rank test for difference in survival gives a p-value of p = 0.0013, indicating that the sex groups differ significantly in survival.
Fit complex survival curves In this section, we’ll compute survival curves using the combination of multiple factors. Next, we’ll facet the output of ggsurvplot() by a combination of factors
# Plot survival curves by sex and facet by rx and adhere
ggsurv <- ggsurvplot(fit2, fun = "event", conf.int = TRUE,
ggtheme = theme_bw())
ggsurv$plot +theme_bw() +
theme (legend.position = "right")+
facet_grid(rx ~ adhere)
Summary
Survival analysis is a set of statistical approaches for data analysis where the outcome variable of interest is time until an event occurs.
Survival data are generally described and modeled in terms of two related functions:
the survivor function representing the probability that an individual survives from the time of origin to some time beyond time t. It’s usually estimated by the Kaplan-Meier method. The logrank test may be used to test for differences between survival curves for groups, such as treatment arms.
The hazard function gives the instantaneous potential of having an event at a time, given survival up to that time. It is used primarily as a diagnostic tool or for specifying a mathematical model for survival analysis.
In this article, we demonstrate how to perform and visualize survival analyses using the combination of two R packages: survival (for the analysis) and survminer (for the visualization).
References
The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.
In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:
The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.
Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age.
An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.
In this article, we’ll describe the Cox regression model and provide practical examples using R software.
The need for multivariate statistical modeling
In clinical investigations, there are many situations, where several known quantities (known as covariates), potentially affect patient prognosis.
For instance, suppose two groups of patients are compared: those with and those without a specific genotype. If one of the groups also contains older individuals, any difference in survival may be attributable to genotype or age or indeed both. Hence, when investigating survival in relation to any one factor, it is often desirable to adjust for the impact of others.
Statistical model is a frequently used tool that allows to analyze survival with respect to several factors simultaneously. Additionally, statistical model provides the effect size for each factor.
The cox proportional-hazards model is one of the most important methods used for modelling survival analysis data. The next section introduces the basics of the Cox regression model.
The purpose of the model is to evaluate simultaneously the effect of several factors on survival. In other words, it allows us to examine how specified factors influence the rate of a particular event happening (e.g., infection, death) at a particular point in time. This rate is commonly referred as the hazard rate. Predictor variables (or factors) are usually termed covariates in the survival-analysis literature.
The Cox model is expressed by the hazard function denoted by \(h(t)\). Briefly, the hazard function can be interpreted as the risk of dying at time \(t\). It can be estimated as follow: \[ h(t)= h_0 (t) e^{b_1 x_1+b_2 x_2+...+b_p x_p}\]
where
The Cox model can be written as a multiple linear regression of the logarithm of the hazard on the variables \(x_i\), with the baseline hazard being an ‘intercept’ term that varies with time.
The quantities \(e^{bi}\) are called hazard ratios (HR). A value of \(b_i\) greater than zero, or equivalently a hazard ratio greater than one, indicates that as the value of the \(i^{th}\) covariate increases, the event hazard increases and thus the length of survival decreases.
Put another way, a hazard ratio above 1 indicates a covariate that is positively associated with the event probability, and thus negatively associated with the length of survival.
In summary,
Note that in cancer studies:
A key assumption of the Cox model is that the hazard curves for the groups of observations (or patients) should be proportional and cannot cross.
Consider two patients k and k’ that differ in their x-values. The corresponding hazard function can be simply written as follow
Hazard function for the patient \(k\): \[ h_k (t)= h_0 (t) e^{\sum_{i=1}^{n} \beta x}\]
Hazard function for the patient \(k’\): \[ h_{k'} (t)=h_0(t) e^{\sum_{i=1}^{n} \beta x'}\]
The hazard ratio for these two patients \[ \frac{h_k (t)}{h_{k'} (t)} = \frac{e^{\sum_{i=1}^{n} \beta x}}{e^{\sum_{i=1}^{n} \beta x'}}\]
Consequently, the Cox model is a proportional-hazards model: the hazard of the event in any group is a constant multiple of the hazard in any other. This assumption implies that, as mentioned above, the hazard curves for the groups should be proportional and cannot cross.
In other words, if an individual has a risk of death at some initial time point that is twice as high as that of another individual, then at all later times the risk of death remains twice as high.
This assumption of proportional hazards should be tested. We’ll discuss methods for assessing proportionality in the next article in this series: Cox Model Assumptions.
We’ll use two R packages:
R function to compute the Cox model: coxph()
The function coxph()[in survival package] can be used to compute the Cox proportional hazards regression model in R.
The simplified format is as follow:
coxph(formula, data, method)
Example data sets
We’ll use the lung cancer data in the survival R package.
Compute the Cox model
We’ll fit the Cox regression using the following covariates: age, sex, ph.ecog and wt.loss.
We start by computing univariate Cox analyses for all these variables; then we’ll fit multivariate cox analyses using two variables to describe how the factors jointly impact on survival.
Univariate Cox regression
Univariate Cox analyses can be computed as follow:
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
##
## coef exp(coef) se(coef) z p
## sex -0.5310 0.5880 0.1672 -3.176 0.00149
##
## Likelihood ratio test=10.63 on 1 df, p=0.001111
## n= 228, number of events= 165
The function summary() for Cox models produces a more complete report:
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
##
## n= 228, number of events= 165
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex -0.5310 0.5880 0.1672 -3.176 0.00149 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 0.588 1.701 0.4237 0.816
##
## Concordance= 0.579 (se = 0.021 )
## Likelihood ratio test= 10.63 on 1 df, p=0.001
## Wald test = 10.09 on 1 df, p=0.001
## Score (logrank) test = 10.33 on 1 df, p=0.001
The Cox regression results can be interpreted as follow:
Statistical significance. The column marked “z” gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The wald statistic evaluates, whether the beta (β) coefficient of a given variable is statistically significantly different from 0. From the output above, we can conclude that the variable sex have highly statistically significant coefficients.
The regression coefficients. The second feature to note in the Cox model results is the the sign of the regression coefficients (coef). A positive sign means that the hazard (risk of death) is higher, and thus the prognosis worse, for subjects with higher values of that variable. The variable sex is encoded as a numeric vector. 1: male, 2: female. The R summary for the Cox model gives the hazard ratio (HR) for the second group relative to the first group, that is, female versus male. The beta coefficient for sex = -0.53 indicates that females have lower risk of death (lower survival rates) than males, in these data.
Hazard ratios. The exponentiated coefficients (exp(coef) = exp(-0.53) = 0.59), also known as hazard ratios, give the effect size of covariates. For example, being female (sex=2) reduces the hazard by a factor of 0.59, or 41%. Being female is associated with good prognostic.
Confidence intervals of the hazard ratios. The summary output also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)), lower 95% bound = 0.4237, upper 95% bound = 0.816.
Global statistical significance of the model. Finally, the output gives p-values for three alternative tests for overall significance of the model: The likelihood-ratio test, Wald test, and score logrank statistics. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.
To apply the univariate coxph function to multiple covariates at once, type this:
covariates <- c("age", "sex", "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,
function(x) as.formula(paste('Surv(time, status)~', x)))
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})
# Extract data
univ_results <- lapply(univ_models,
function(x){
x <- summary(x)
p.value<-signif(x$wald["pvalue"], digits=2)
wald.test<-signif(x$wald["test"], digits=2)
beta<-signif(x$coef[1], digits=2);#coeficient beta
HR <-signif(x$coef[2], digits=2);#exp(beta)
HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
HR <- paste0(HR, " (",
HR.confint.lower, "-", HR.confint.upper, ")")
res<-c(beta, HR, wald.test, p.value)
names(res)<-c("beta", "HR (95% CI for HR)", "wald.test",
"p.value")
return(res)
#return(exp(cbind(coef(x),confint(x))))
})
res <- t(as.data.frame(univ_results, check.names = FALSE))
as.data.frame(res)
The output above shows the regression beta coefficients, the effect sizes (given as hazard ratios) and statistical significance for each of the variables in relation to overall survival. Each factor is assessed through separate univariate Cox regressions.
From the output above,
The variables sex, age and ph.ecog have highly statistically significant coefficients, while the coefficient for ph.karno is not significant.
age and ph.ecog have positive beta coefficients, while sex has a negative coefficient. Thus, older age and higher ph.ecog are associated with poorer survival, whereas being female (sex=2) is associated with better survival.
Now, we want to describe how the factors jointly impact on survival. To answer to this question, we’ll perform a multivariate Cox regression analysis. As the variable ph.karno is not significant in the univariate Cox analysis, we’ll skip it in the multivariate analysis. We’ll include the 3 factors (sex, age and ph.ecog) into the multivariate model.
Multivariate Cox regression analysis
A Cox regression of time to death on the time-constant covariates is specified as follow:
# Data preparation and computing cox model
res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)
summary(res.cox)
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)
##
## n= 227, number of events= 164
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.011067 1.011128 0.009267 1.194 0.232416
## sex -0.552612 0.575445 0.167739 -3.294 0.000986 ***
## ph.ecog 0.463728 1.589991 0.113577 4.083 4.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0111 0.9890 0.9929 1.0297
## sex 0.5754 1.7378 0.4142 0.7994
## ph.ecog 1.5900 0.6289 1.2727 1.9864
##
## Concordance= 0.637 (se = 0.025 )
## Likelihood ratio test= 30.5 on 3 df, p=1e-06
## Wald test = 29.93 on 3 df, p=1e-06
## Score (logrank) test = 30.5 on 3 df, p=1e-06
The p-value for all three overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. These tests evaluate the omnibus null hypothesis that all of the betas (β) are 0. In the above example, the test statistics are in close agreement, and the omnibus null hypothesis is soundly rejected.
In the multivariate Cox analysis, the covariates sex and ph.ecog remain significant (p < 0.05). However, the covariate age fails to be significant (p = 0.23, which is grater than 0.05).
The p-value for sex is 0.000986, with a hazard ratio HR = exp(coef) = 0.58, indicating a strong relationship between the patients’ sex and decreased risk of death. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 0.58, or 42%. We conclude that, being female is associated with good prognostic.
Similarly, the p-value for ph.ecog is 4.45e-05, with a hazard ratio HR = 1.59, indicating a strong relationship between the ph.ecog value and increased risk of death. Holding the other covariates constant, a higher value of ph.ecog is associated with a poor survival.
By contrast, the p-value for age is now p=0.23. The hazard ratio HR = exp(coef) = 1.01, with a 95% confidence interval of 0.99 to 1.03. Because the confidence interval for HR includes 1, these results indicate that age makes a smaller contribution to the difference in the HR after adjusting for the ph.ecog values and patient’s sex, and only trend toward significance. For example, holding the other covariates constant, an additional year of age induce daily hazard of death by a factor of exp(beta) = 1.01, or 1%, which is not a significant contribution.
Visualizing the estimated distribution of survival times
Having fit a Cox model to the data, it’s possible to visualize the predicted survival proportion at any given point in time for a particular risk group. The function survfit() estimates the survival proportion, by default at the mean values of covariates.
We may wish to display how estimated survival depends upon the value of a covariate of interest.
Consider that, we want to assess the impact of the sex on the estimated survival probability. In this case, we construct a new data frame with two rows, one for each value of sex; the other covariates are fixed to their average values (if they are continuous variables) or to their lowest level (if they are discrete variables). For a dummy covariate, the average value is the proportion coded 1 in the data set. This data frame is passed to survfit() via the newdata argument:
Summary
In this article, we described the Cox regression model for assessing simultaneously the relationship between multiple risk factors and patient’s survival time. We demonstrated how to compute the Cox model using the survival package. Additionally, we described how to visualize the results of the analysis using the survminer package.
References
Previously, we described the basic methods for analyzing survival data, as well as, the Cox proportional hazards methods to deal with the situation where several factors impact on the survival process.
In the current article, we continue the series by describing methods to evaluate the validity of the Cox model assumptions.
Note that, when used inappropriately, statistical models may give rise to misleading conclusions. Therefore, it’s important to check that a given model is an appropriate representation of the data.
Diagnostics for the Cox model
The Cox proportional hazards model makes sevral assumptions. Thus, it is important to assess whether a fitted Cox regression model adequately describes the data.
Here, we’ll disscuss three types of diagonostics for the Cox model:
In order to check these model assumptions, Residuals method are used. The common residuals for the Cox model include:
Computing the Cox Model
We’ll use the lung data sets and the coxph() function in the survival package.
To compute a Cox model, type this:
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
##
## coef exp(coef) se(coef) z p
## age 0.0200882 1.0202913 0.0096644 2.079 0.0377
## sex -0.5210319 0.5939074 0.1743541 -2.988 0.0028
## wt.loss 0.0007596 1.0007599 0.0061934 0.123 0.9024
##
## Likelihood ratio test=14.67 on 3 df, p=0.002122
## n= 214, number of events= 152
## (14 observations deleted due to missingness)
Testing proportional Hazards assumption
The proportional hazards (PH) assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals.
In principle, the Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption.
The function cox.zph() [in the survival package] provides a convenient solution to test the proportional hazards assumption for each covariate included in a Cox refression model fit.
For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole.
The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.
To illustrate the test, we start by computing a Cox regression model using the lung data set [in survival package]:
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
##
## coef exp(coef) se(coef) z p
## age 0.0200882 1.0202913 0.0096644 2.079 0.0377
## sex -0.5210319 0.5939074 0.1743541 -2.988 0.0028
## wt.loss 0.0007596 1.0007599 0.0061934 0.123 0.9024
##
## Likelihood ratio test=14.67 on 3 df, p=0.002122
## n= 214, number of events= 152
## (14 observations deleted due to missingness)
To test for the proportional-hazards (PH) assumption, type this:
## chisq df p
## age 0.5077 1 0.48
## sex 2.5489 1 0.11
## wt.loss 0.0144 1 0.90
## GLOBAL 3.0051 3 0.39
From the output above, the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, we can assume the proportional hazards.
It’s possible to do a graphical diagnostic using the function ggcoxzph() [in the survminer package], which produces, for each covariate, graphs of the scaled Schoenfeld residuals against the transformed time.
In the figure above, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a +/- 2-standard-error band around the fit.
Note that, systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates \(β_1,β_2,β_3\) do not vary much over time.
From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates sex (which is, recall, a two-level factor, accounting for the two bands in the graph), wt.loss and age.
Another graphical methods for checking proportional hazards is to plot log(-log(S(t))) vs. t or log(t) and look for parallelism. This can be done only for categorical covariates.
A violations of proportional hazards assumption can be resolved by:
Adding covariate*time interaction Stratification Stratification is usefull for “nuisance” confounders, where you do not care to estimate the effect. You cannot examine the effects of the stratification variable (John Fox & Sanford Weisberg).
To read more about how to accomodate with non-proportional hazards, read the following articles:
To test influential observations or outliers, we can visualize either:
the deviance residuals or the dfbeta values The function ggcoxdiagnostics()[in survminer package] provides a convenient solution for checkind influential observations. The simplified format is as follow:
ggcoxdiagnostics(fit, type = , linear.predictions = TRUE)
Specifying the argument type = “dfbeta”, plots the estimated changes in the regression coefficients upon deleting each observation in turn; likewise, type=“dfbetas” produces the estimated changes in the coefficients divided by their standard errors.
For example:
## `geom_smooth()` using formula 'y ~ x'
(Index plots of dfbeta for the Cox regression of time to death on age, sex and wt.loss)
The above index plots show that comparing the magnitudes of the largest dfbeta values to the regression coefficients suggests that none of the observations is terribly influential individually, even though some of the dfbeta values for age and wt.loss are large compared with the others.
It’s also possible to check outliers by visualizing the deviance residuals. The deviance residual is a normalized transform of the martingale residual. These residuals should be roughtly symmetrically distributed about zero with a standard deviation of 1.
Example of deviance residuals:
## `geom_smooth()` using formula 'y ~ x'
Often, we assume that continuous covariates have a linear form. However, this assumption should be checked.
Plotting the Martingale residuals against continuous covariates is a common approach used to detect nonlinearity or, in other words, to assess the functional form of a covariate. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit.
Nonlinearity is not an issue for categorical variables, so we only examine plots of martingale residuals and partial residuals against a continuous variable.
Martingale residuals may present any value in the range (-INF, +1):
The function ggcoxfunctional() displays graphs of continuous covariates against martingale residuals of null cox proportional hazards model. This might help to properly choose the functional form of continuous variable in the Cox model. Fitted lines with lowess function should be linear to satisfy the Cox proportional hazards model assumptions.
For example, to assess the functional forme of age, type this:
## Warning: arguments formula is deprecated; will be removed in the next version;
## please use fit instead.
It appears that, nonlinearity is slightly here.
Summary
We described how to assess the valididy of the Cox model assumptions using the survival and survminer packages.