Survival Analysis

Basics of Survival Analysis

Survival analysis corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur.

Survival analysis is used in a variety of field such as:

Cancer studies for patients survival time analyses,
Sociology for “event-history analysis”,
and in engineering for “failure-time analysis”.

In cancer studies, typical research questions are like:

What is the impact of certain clinical characteristics on patient’s survival
What is the probability that an individual survives 3 years?
Are there differences in survival between groups of patients?

Objectives

The aim of this chapter is to describe the basic concepts of survival analysis. In cancer studies, most of survival analyses use the following methods:

Kaplan-Meier plots to visualize survival curves
Log-rank test to compare the survival curves of two or more groups
Cox proportional hazards regression to describe the effect of variables on survival. The Cox model is discussed in the next chapter: Cox proportional hazards model.

Here, we’ll start by explaining the essential concepts of survival analysis, including:

how to generate and interpret survival curves,
and how to quantify and test survival differences between two or more groups of patients.

Then, we’ll continue by describing multivariate analysis using Cox proportional hazards model.

Basic Concepts

Here, we start by defining fundamental terms of survival analysis including:

Survival time and event
Censoring
Survival function and hazard function.

Survival time & type of events

There are different types of events, including:

Relapse
Progression
Death

The time from ‘response to treatment’ (complete remission) to the occurrence of the event of interest is commonly called survival time (or time to event).

The two most important measures in cancer studies include:

the time to death; and
the relapse-free survival time, which corresponds to the time between response to treatment and recurrence of the disease. It’s also known as disease-free survival time and event-free survival time.

Censoring

As mentioned above, survival analysis focuses on the expected duration of time until occurrence of an event of interest (relapse or death). However, the event may not be observed for some individuals within the study time period, producing the so-called censored observations.

Censoring may arise in the following ways:

a patient has not (yet) experienced the event of interest, such as relapse or death, within the study time period;
a patient is lost to follow-up during the study period;
a patient experiences a different event that makes further follow-up impossible.

This type of censoring, named right censoring, is handled in survival analysis.

Survival & hazard functions

Two related probabilities are used to describe survival data: the survival probability and the hazard probability.

The survival probability, also known as the survivor function $ S(t)$, is the probability that an individual survives from the time origin (e.g. diagnosis of cancer) to a specified future time t.

The hazard, denoted by $ h(t) $, is the probability that an individual who is under observation at a time t has an event at that time.

Note that, in contrast to the survivor function, which focuses on not having an event, the hazard function focuses on the event occurring.

Kaplan-Meier estimate

The Kaplan-Meier (KM) method is a non-parametric method used to estimate the survival probability from observed survival times (Kaplan and Meier, 1958).

The survival probability at time $t_i$, $S(t_i)$, is calculated as follow: \[ S(t_i)=S(t_{i-1})(1-\frac{d_i}{n_i}) \]

where,

$S(t_{i-1})$ = the probability of being alive at $t_{i-1}$
$n_i$ = the number of patients alive just before $t_i$
$d_i$ = the number of events at $t-i$
$t_0=0$, $S_0=1$

The estimated probability $S(t)$ is a step function that changes value only at the time of each event. It’s also possible to compute confidence intervals for the survival probability.

The KM survival curve, a plot of the KM survival probability against time, provides a useful summary of the data that can be used to estimate measures such as median survival time.

Survival Analysis in R

Install the required packages

We’ll use two R packages:

survival for computing survival analyses
survminer for summarizing and visualizing the results of survival analysis

Install and load the packages

library(survival)
library(survminer)

Example Data Sets

We’ll use the lung cancer data available in the survival package.

data("lung")
head(lung)

inst: Institution code
time: Survival time in days
status: censoring status 1=censored, 2=dead
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score (0=good 5=dead)
ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score as rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months

Compute survival curves: survfit()

We want to compute the survival probability by sex.

The function survfit() [in survival package] can be used to compute kaplan-Meier survival estimate. Its main arguments include:

a survival object created using the function Surv() and the data set containing the variables. To compute survival curves, type this:

fit <- survfit(Surv(time, status) ~ sex, data = lung)
print(fit)

## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
## 
##         n events median 0.95LCL 0.95UCL
## sex=1 138    112    270     212     310
## sex=2  90     53    426     348     550

By default, the function print() shows a short summary of the survival curves. It prints the number of observations, number of events, the median survival and the confidence limits for the median.

If you want to display a more complete summary of the survival curves, type this:

# Summary of survival curves
summary(fit)

## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
## 
##                 sex=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    11    138       3   0.9783  0.0124       0.9542        1.000
##    12    135       1   0.9710  0.0143       0.9434        0.999
##    13    134       2   0.9565  0.0174       0.9231        0.991
##    15    132       1   0.9493  0.0187       0.9134        0.987
##    26    131       1   0.9420  0.0199       0.9038        0.982
##    30    130       1   0.9348  0.0210       0.8945        0.977
##    31    129       1   0.9275  0.0221       0.8853        0.972
##    53    128       2   0.9130  0.0240       0.8672        0.961
##    54    126       1   0.9058  0.0249       0.8583        0.956
##    59    125       1   0.8986  0.0257       0.8496        0.950
##    60    124       1   0.8913  0.0265       0.8409        0.945
##    65    123       2   0.8768  0.0280       0.8237        0.933
##    71    121       1   0.8696  0.0287       0.8152        0.928
##    81    120       1   0.8623  0.0293       0.8067        0.922
##    88    119       2   0.8478  0.0306       0.7900        0.910
##    92    117       1   0.8406  0.0312       0.7817        0.904
##    93    116       1   0.8333  0.0317       0.7734        0.898
##    95    115       1   0.8261  0.0323       0.7652        0.892
##   105    114       1   0.8188  0.0328       0.7570        0.886
##   107    113       1   0.8116  0.0333       0.7489        0.880
##   110    112       1   0.8043  0.0338       0.7408        0.873
##   116    111       1   0.7971  0.0342       0.7328        0.867
##   118    110       1   0.7899  0.0347       0.7247        0.861
##   131    109       1   0.7826  0.0351       0.7167        0.855
##   132    108       2   0.7681  0.0359       0.7008        0.842
##   135    106       1   0.7609  0.0363       0.6929        0.835
##   142    105       1   0.7536  0.0367       0.6851        0.829
##   144    104       1   0.7464  0.0370       0.6772        0.823
##   147    103       1   0.7391  0.0374       0.6694        0.816
##   156    102       2   0.7246  0.0380       0.6538        0.803
##   163    100       3   0.7029  0.0389       0.6306        0.783
##   166     97       1   0.6957  0.0392       0.6230        0.777
##   170     96       1   0.6884  0.0394       0.6153        0.770
##   175     94       1   0.6811  0.0397       0.6076        0.763
##   176     93       1   0.6738  0.0399       0.5999        0.757
##   177     92       1   0.6664  0.0402       0.5922        0.750
##   179     91       2   0.6518  0.0406       0.5769        0.736
##   180     89       1   0.6445  0.0408       0.5693        0.730
##   181     88       2   0.6298  0.0412       0.5541        0.716
##   183     86       1   0.6225  0.0413       0.5466        0.709
##   189     83       1   0.6150  0.0415       0.5388        0.702
##   197     80       1   0.6073  0.0417       0.5309        0.695
##   202     78       1   0.5995  0.0419       0.5228        0.687
##   207     77       1   0.5917  0.0420       0.5148        0.680
##   210     76       1   0.5839  0.0422       0.5068        0.673
##   212     75       1   0.5762  0.0424       0.4988        0.665
##   218     74       1   0.5684  0.0425       0.4909        0.658
##   222     72       1   0.5605  0.0426       0.4829        0.651
##   223     70       1   0.5525  0.0428       0.4747        0.643
##   229     67       1   0.5442  0.0429       0.4663        0.635
##   230     66       1   0.5360  0.0431       0.4579        0.627
##   239     64       1   0.5276  0.0432       0.4494        0.619
##   246     63       1   0.5192  0.0433       0.4409        0.611
##   267     61       1   0.5107  0.0434       0.4323        0.603
##   269     60       1   0.5022  0.0435       0.4238        0.595
##   270     59       1   0.4937  0.0436       0.4152        0.587
##   283     57       1   0.4850  0.0437       0.4065        0.579
##   284     56       1   0.4764  0.0438       0.3979        0.570
##   285     54       1   0.4676  0.0438       0.3891        0.562
##   286     53       1   0.4587  0.0439       0.3803        0.553
##   288     52       1   0.4499  0.0439       0.3716        0.545
##   291     51       1   0.4411  0.0439       0.3629        0.536
##   301     48       1   0.4319  0.0440       0.3538        0.527
##   303     46       1   0.4225  0.0440       0.3445        0.518
##   306     44       1   0.4129  0.0440       0.3350        0.509
##   310     43       1   0.4033  0.0441       0.3256        0.500
##   320     42       1   0.3937  0.0440       0.3162        0.490
##   329     41       1   0.3841  0.0440       0.3069        0.481
##   337     40       1   0.3745  0.0439       0.2976        0.471
##   353     39       2   0.3553  0.0437       0.2791        0.452
##   363     37       1   0.3457  0.0436       0.2700        0.443
##   364     36       1   0.3361  0.0434       0.2609        0.433
##   371     35       1   0.3265  0.0432       0.2519        0.423
##   387     34       1   0.3169  0.0430       0.2429        0.413
##   390     33       1   0.3073  0.0428       0.2339        0.404
##   394     32       1   0.2977  0.0425       0.2250        0.394
##   428     29       1   0.2874  0.0423       0.2155        0.383
##   429     28       1   0.2771  0.0420       0.2060        0.373
##   442     27       1   0.2669  0.0417       0.1965        0.362
##   455     25       1   0.2562  0.0413       0.1868        0.351
##   457     24       1   0.2455  0.0410       0.1770        0.341
##   460     22       1   0.2344  0.0406       0.1669        0.329
##   477     21       1   0.2232  0.0402       0.1569        0.318
##   519     20       1   0.2121  0.0397       0.1469        0.306
##   524     19       1   0.2009  0.0391       0.1371        0.294
##   533     18       1   0.1897  0.0385       0.1275        0.282
##   558     17       1   0.1786  0.0378       0.1179        0.270
##   567     16       1   0.1674  0.0371       0.1085        0.258
##   574     15       1   0.1562  0.0362       0.0992        0.246
##   583     14       1   0.1451  0.0353       0.0900        0.234
##   613     13       1   0.1339  0.0343       0.0810        0.221
##   624     12       1   0.1228  0.0332       0.0722        0.209
##   643     11       1   0.1116  0.0320       0.0636        0.196
##   655     10       1   0.1004  0.0307       0.0552        0.183
##   689      9       1   0.0893  0.0293       0.0470        0.170
##   707      8       1   0.0781  0.0276       0.0390        0.156
##   791      7       1   0.0670  0.0259       0.0314        0.143
##   814      5       1   0.0536  0.0239       0.0223        0.128
##   883      3       1   0.0357  0.0216       0.0109        0.117
## 
##                 sex=2 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5     90       1   0.9889  0.0110       0.9675        1.000
##    60     89       1   0.9778  0.0155       0.9478        1.000
##    61     88       1   0.9667  0.0189       0.9303        1.000
##    62     87       1   0.9556  0.0217       0.9139        0.999
##    79     86       1   0.9444  0.0241       0.8983        0.993
##    81     85       1   0.9333  0.0263       0.8832        0.986
##    95     83       1   0.9221  0.0283       0.8683        0.979
##   107     81       1   0.9107  0.0301       0.8535        0.972
##   122     80       1   0.8993  0.0318       0.8390        0.964
##   145     79       2   0.8766  0.0349       0.8108        0.948
##   153     77       1   0.8652  0.0362       0.7970        0.939
##   166     76       1   0.8538  0.0375       0.7834        0.931
##   167     75       1   0.8424  0.0387       0.7699        0.922
##   182     71       1   0.8305  0.0399       0.7559        0.913
##   186     70       1   0.8187  0.0411       0.7420        0.903
##   194     68       1   0.8066  0.0422       0.7280        0.894
##   199     67       1   0.7946  0.0432       0.7142        0.884
##   201     66       2   0.7705  0.0452       0.6869        0.864
##   208     62       1   0.7581  0.0461       0.6729        0.854
##   226     59       1   0.7452  0.0471       0.6584        0.843
##   239     57       1   0.7322  0.0480       0.6438        0.833
##   245     54       1   0.7186  0.0490       0.6287        0.821
##   268     51       1   0.7045  0.0501       0.6129        0.810
##   285     47       1   0.6895  0.0512       0.5962        0.798
##   293     45       1   0.6742  0.0523       0.5791        0.785
##   305     43       1   0.6585  0.0534       0.5618        0.772
##   310     42       1   0.6428  0.0544       0.5447        0.759
##   340     39       1   0.6264  0.0554       0.5267        0.745
##   345     38       1   0.6099  0.0563       0.5089        0.731
##   348     37       1   0.5934  0.0572       0.4913        0.717
##   350     36       1   0.5769  0.0579       0.4739        0.702
##   351     35       1   0.5604  0.0586       0.4566        0.688
##   361     33       1   0.5434  0.0592       0.4390        0.673
##   363     32       1   0.5265  0.0597       0.4215        0.658
##   371     30       1   0.5089  0.0603       0.4035        0.642
##   426     26       1   0.4893  0.0610       0.3832        0.625
##   433     25       1   0.4698  0.0617       0.3632        0.608
##   444     24       1   0.4502  0.0621       0.3435        0.590
##   450     23       1   0.4306  0.0624       0.3241        0.572
##   473     22       1   0.4110  0.0626       0.3050        0.554
##   520     19       1   0.3894  0.0629       0.2837        0.534
##   524     18       1   0.3678  0.0630       0.2628        0.515
##   550     15       1   0.3433  0.0634       0.2390        0.493
##   641     11       1   0.3121  0.0649       0.2076        0.469
##   654     10       1   0.2808  0.0655       0.1778        0.443
##   687      9       1   0.2496  0.0652       0.1496        0.417
##   705      8       1   0.2184  0.0641       0.1229        0.388
##   728      7       1   0.1872  0.0621       0.0978        0.359
##   731      6       1   0.1560  0.0590       0.0743        0.328
##   735      5       1   0.1248  0.0549       0.0527        0.295
##   765      3       1   0.0832  0.0499       0.0257        0.270

# Access to the sort summary table
summary(fit)$table

##       records n.max n.start events   *rmean *se(rmean) median 0.95LCL 0.95UCL
## sex=1     138   138     138    112 325.0663   22.59845    270     212     310
## sex=2      90    90      90     53 458.2757   33.78530    426     348     550

Access to the value returned by survfit()

The function survfit() returns a list of variables, including the following components:

n: total number of subjects in each curve.
time: the time points on the curve.
n.risk: the number of subjects at risk at time t
n.event: the number of events that occurred at time t.
n.censor: the number of censored subjects, who exit the risk set, without an event, at time t.
lower,upper: lower and upper confidence limits for the curve, respectively.
strata: indicates stratification of curve estimation. If strata is not NULL, there are multiple curves in the result. The levels of strata (a factor) are the labels for the curves.

The components can be accessed as follow:

d <- data.frame(time = fit$time,
                  n.risk = fit$n.risk,
                  n.event = fit$n.event,
                  n.censor = fit$n.censor,
                  surv = fit$surv,
                  upper = fit$upper,
                  lower = fit$lower
                  )
head(d)

Visualize survival curves

We’ll use the function ggsurvplot() [in Survminer R package] to produce the survival curves for the two groups of subjects.

It’s also possible to show:

the 95% confidence limits of the survivor function using the argument conf.int = TRUE.
the number and/or the percentage of individuals at risk by time using the option risk.table. Allowed values for risk.table include:
- TRUE or FALSE specifying whether to show or not the risk table. Default is FALSE.
- “absolute” or “percentage”: to show the absolute number and the percentage of subjects at risk by time, respectively. Use “abs_pct” to show both absolute number and percentage.
the p-value of the Log-Rank test comparing the groups using pval = TRUE.
horizontal/vertical line at median survival using the argument surv.median.line. Allowed values include one of c(“none”, “hv”, “h”, “v”). v: vertical, h:horizontal.

# Change color, linetype by strata, risk.table color by strata
ggsurvplot(fit,
          pval = TRUE, conf.int = TRUE,
          risk.table = TRUE, # Add risk table
          risk.table.col = "strata", # Change risk table color by groups
          linetype = "strata", # Change line type by groups
          surv.median.line = "hv", # Specify median survival
          ggtheme = theme_bw(), # Change ggplot2 theme
          palette = c("#E7B800", "#2E9FDF"))

The plot can be further customized using the following arguments:

conf.int.style = “step” to change the style of confidence interval bands.
xlab to change the x axis label.
break.time.by = 200 break x axis in time intervals by 200.
risk.table = “abs_pct”to show both absolute number and percentage of individuals at risk.
risk.table.y.text.col = TRUE and risk.table.y.text = FALSE to provide bars instead of names in text annotations of the legend of risk table.
ncensor.plot = TRUE to plot the number of censored subjects at time t. As suggested by Marcin Kosinski, This is a good additional feedback to survival curves, so that one could realize: how do survival curves look like, what is the number of risk set AND what is the cause that the risk set become smaller: is it caused by events or by censored events?
legend.labs to change the legend labels.

ggsurvplot(fit, pval=TRUE, conf.int = TRUE, xlab="Time in days",  break.time.by = 200, ggtheme = theme_light(),risk.table = "abs_pct",risk.table.y.text.col = T, risk.table.y.text = FALSE,  ncensor.plot = TRUE, surv.median.line = "hv",legend.labs = 
    c("Male", "Female"),palette = 
    c("#E7B800", "#2E9FDF"))

The Kaplan-Meier plot can be interpreted as follow:

The horizontal axis (x-axis) represents time in days, and the vertical axis (y-axis) shows the probability of surviving or the proportion of people surviving. The lines represent survival curves of the two groups. A vertical drop in the curves indicates an event. The vertical tick mark on the curves means that a patient was censored at this time.

At time zero, the survival probability is 1.0 (or 100% of the participants are alive).
At time 250, the probability of survival is approximately 0.55 (or 55%) for sex=1 and 0.75 (or 75%) for sex=2.
The median survival is approximately 270 days for sex=1 and 426 days for sex=2, suggesting a good survival for sex=2 compared to sex=1

The median survival times for each group can be obtained using the code below:

summary(fit)$table

##       records n.max n.start events   *rmean *se(rmean) median 0.95LCL 0.95UCL
## sex=1     138   138     138    112 325.0663   22.59845    270     212     310
## sex=2      90    90      90     53 458.2757   33.78530    426     348     550

The median survival times for each group represent the time at which the survival probability, $S(t)$, is 0.5.

The median survival time for sex=1 (Male group) is 270 days, as opposed to 426 days for sex=2 (Female). There appears to be a survival advantage for female with lung cancer compare to male. However, to evaluate whether this difference is statistically significant requires a formal statistical test, a subject that is discussed in the next sections.

Note that, the confidence limits are wide at the tail of the curves, making meaningful interpretations difficult. This can be explained by the fact that, in practice, there are usually patients who are lost to follow-up or alive at the end of follow-up. Thus, it may be sensible to shorten plots before the end of follow-up on the x-axis (Pocock et al, 2002).

The survival curves can be shorten using the argument xlim as follow:

ggsurvplot(fit,
          conf.int = TRUE,
          risk.table.col = "strata", # Change risk table color by groups
          ggtheme = theme_bw(), # Change ggplot2 theme
          palette = c("#E7B800", "#2E9FDF"),
          xlim = c(0, 600))

Note that, three often used transformations can be specified using the argument fun:

“log”: log transformation of the survivor function,
“event”: plots cumulative events (f(y) = 1-y). It’s also known as the cumulative incidence,
“cumhaz” plots the cumulative hazard function (f(y) = -log(y))

For example, to plot cumulative events, type this:

ggsurvplot(fit,
          conf.int = TRUE,
          risk.table.col = "strata", # Change risk table color by groups
          ggtheme = theme_bw(), # Change ggplot2 theme
          palette = c("#E7B800", "#2E9FDF"),
          fun = "event")

The cummulative hazard is commonly used to estimate the hazard probability. It’s defined as \[H(t)=−log(survivalfunction)=−log(S(t))\]. The cumulative hazard (H(t)) can be interpreted as the cumulative force of mortality. In other words, it corresponds to the number of events that would be expected for each individual by time t if the event were a repeatable process.

To plot cumulative hazard, type this:

ggsurvplot(fit,
          conf.int = TRUE,
          risk.table.col = "strata", # Change risk table color by groups
          ggtheme = theme_bw(), # Change ggplot2 theme
          palette = c("#E7B800", "#2E9FDF"),
          fun = "cumhaz")

Kaplan-Meier life table: summary of survival curves

As mentioned above, you can use the function summary() to have a complete summary of survival curves:

summary(fit)

## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
## 
##                 sex=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    11    138       3   0.9783  0.0124       0.9542        1.000
##    12    135       1   0.9710  0.0143       0.9434        0.999
##    13    134       2   0.9565  0.0174       0.9231        0.991
##    15    132       1   0.9493  0.0187       0.9134        0.987
##    26    131       1   0.9420  0.0199       0.9038        0.982
##    30    130       1   0.9348  0.0210       0.8945        0.977
##    31    129       1   0.9275  0.0221       0.8853        0.972
##    53    128       2   0.9130  0.0240       0.8672        0.961
##    54    126       1   0.9058  0.0249       0.8583        0.956
##    59    125       1   0.8986  0.0257       0.8496        0.950
##    60    124       1   0.8913  0.0265       0.8409        0.945
##    65    123       2   0.8768  0.0280       0.8237        0.933
##    71    121       1   0.8696  0.0287       0.8152        0.928
##    81    120       1   0.8623  0.0293       0.8067        0.922
##    88    119       2   0.8478  0.0306       0.7900        0.910
##    92    117       1   0.8406  0.0312       0.7817        0.904
##    93    116       1   0.8333  0.0317       0.7734        0.898
##    95    115       1   0.8261  0.0323       0.7652        0.892
##   105    114       1   0.8188  0.0328       0.7570        0.886
##   107    113       1   0.8116  0.0333       0.7489        0.880
##   110    112       1   0.8043  0.0338       0.7408        0.873
##   116    111       1   0.7971  0.0342       0.7328        0.867
##   118    110       1   0.7899  0.0347       0.7247        0.861
##   131    109       1   0.7826  0.0351       0.7167        0.855
##   132    108       2   0.7681  0.0359       0.7008        0.842
##   135    106       1   0.7609  0.0363       0.6929        0.835
##   142    105       1   0.7536  0.0367       0.6851        0.829
##   144    104       1   0.7464  0.0370       0.6772        0.823
##   147    103       1   0.7391  0.0374       0.6694        0.816
##   156    102       2   0.7246  0.0380       0.6538        0.803
##   163    100       3   0.7029  0.0389       0.6306        0.783
##   166     97       1   0.6957  0.0392       0.6230        0.777
##   170     96       1   0.6884  0.0394       0.6153        0.770
##   175     94       1   0.6811  0.0397       0.6076        0.763
##   176     93       1   0.6738  0.0399       0.5999        0.757
##   177     92       1   0.6664  0.0402       0.5922        0.750
##   179     91       2   0.6518  0.0406       0.5769        0.736
##   180     89       1   0.6445  0.0408       0.5693        0.730
##   181     88       2   0.6298  0.0412       0.5541        0.716
##   183     86       1   0.6225  0.0413       0.5466        0.709
##   189     83       1   0.6150  0.0415       0.5388        0.702
##   197     80       1   0.6073  0.0417       0.5309        0.695
##   202     78       1   0.5995  0.0419       0.5228        0.687
##   207     77       1   0.5917  0.0420       0.5148        0.680
##   210     76       1   0.5839  0.0422       0.5068        0.673
##   212     75       1   0.5762  0.0424       0.4988        0.665
##   218     74       1   0.5684  0.0425       0.4909        0.658
##   222     72       1   0.5605  0.0426       0.4829        0.651
##   223     70       1   0.5525  0.0428       0.4747        0.643
##   229     67       1   0.5442  0.0429       0.4663        0.635
##   230     66       1   0.5360  0.0431       0.4579        0.627
##   239     64       1   0.5276  0.0432       0.4494        0.619
##   246     63       1   0.5192  0.0433       0.4409        0.611
##   267     61       1   0.5107  0.0434       0.4323        0.603
##   269     60       1   0.5022  0.0435       0.4238        0.595
##   270     59       1   0.4937  0.0436       0.4152        0.587
##   283     57       1   0.4850  0.0437       0.4065        0.579
##   284     56       1   0.4764  0.0438       0.3979        0.570
##   285     54       1   0.4676  0.0438       0.3891        0.562
##   286     53       1   0.4587  0.0439       0.3803        0.553
##   288     52       1   0.4499  0.0439       0.3716        0.545
##   291     51       1   0.4411  0.0439       0.3629        0.536
##   301     48       1   0.4319  0.0440       0.3538        0.527
##   303     46       1   0.4225  0.0440       0.3445        0.518
##   306     44       1   0.4129  0.0440       0.3350        0.509
##   310     43       1   0.4033  0.0441       0.3256        0.500
##   320     42       1   0.3937  0.0440       0.3162        0.490
##   329     41       1   0.3841  0.0440       0.3069        0.481
##   337     40       1   0.3745  0.0439       0.2976        0.471
##   353     39       2   0.3553  0.0437       0.2791        0.452
##   363     37       1   0.3457  0.0436       0.2700        0.443
##   364     36       1   0.3361  0.0434       0.2609        0.433
##   371     35       1   0.3265  0.0432       0.2519        0.423
##   387     34       1   0.3169  0.0430       0.2429        0.413
##   390     33       1   0.3073  0.0428       0.2339        0.404
##   394     32       1   0.2977  0.0425       0.2250        0.394
##   428     29       1   0.2874  0.0423       0.2155        0.383
##   429     28       1   0.2771  0.0420       0.2060        0.373
##   442     27       1   0.2669  0.0417       0.1965        0.362
##   455     25       1   0.2562  0.0413       0.1868        0.351
##   457     24       1   0.2455  0.0410       0.1770        0.341
##   460     22       1   0.2344  0.0406       0.1669        0.329
##   477     21       1   0.2232  0.0402       0.1569        0.318
##   519     20       1   0.2121  0.0397       0.1469        0.306
##   524     19       1   0.2009  0.0391       0.1371        0.294
##   533     18       1   0.1897  0.0385       0.1275        0.282
##   558     17       1   0.1786  0.0378       0.1179        0.270
##   567     16       1   0.1674  0.0371       0.1085        0.258
##   574     15       1   0.1562  0.0362       0.0992        0.246
##   583     14       1   0.1451  0.0353       0.0900        0.234
##   613     13       1   0.1339  0.0343       0.0810        0.221
##   624     12       1   0.1228  0.0332       0.0722        0.209
##   643     11       1   0.1116  0.0320       0.0636        0.196
##   655     10       1   0.1004  0.0307       0.0552        0.183
##   689      9       1   0.0893  0.0293       0.0470        0.170
##   707      8       1   0.0781  0.0276       0.0390        0.156
##   791      7       1   0.0670  0.0259       0.0314        0.143
##   814      5       1   0.0536  0.0239       0.0223        0.128
##   883      3       1   0.0357  0.0216       0.0109        0.117
## 
##                 sex=2 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5     90       1   0.9889  0.0110       0.9675        1.000
##    60     89       1   0.9778  0.0155       0.9478        1.000
##    61     88       1   0.9667  0.0189       0.9303        1.000
##    62     87       1   0.9556  0.0217       0.9139        0.999
##    79     86       1   0.9444  0.0241       0.8983        0.993
##    81     85       1   0.9333  0.0263       0.8832        0.986
##    95     83       1   0.9221  0.0283       0.8683        0.979
##   107     81       1   0.9107  0.0301       0.8535        0.972
##   122     80       1   0.8993  0.0318       0.8390        0.964
##   145     79       2   0.8766  0.0349       0.8108        0.948
##   153     77       1   0.8652  0.0362       0.7970        0.939
##   166     76       1   0.8538  0.0375       0.7834        0.931
##   167     75       1   0.8424  0.0387       0.7699        0.922
##   182     71       1   0.8305  0.0399       0.7559        0.913
##   186     70       1   0.8187  0.0411       0.7420        0.903
##   194     68       1   0.8066  0.0422       0.7280        0.894
##   199     67       1   0.7946  0.0432       0.7142        0.884
##   201     66       2   0.7705  0.0452       0.6869        0.864
##   208     62       1   0.7581  0.0461       0.6729        0.854
##   226     59       1   0.7452  0.0471       0.6584        0.843
##   239     57       1   0.7322  0.0480       0.6438        0.833
##   245     54       1   0.7186  0.0490       0.6287        0.821
##   268     51       1   0.7045  0.0501       0.6129        0.810
##   285     47       1   0.6895  0.0512       0.5962        0.798
##   293     45       1   0.6742  0.0523       0.5791        0.785
##   305     43       1   0.6585  0.0534       0.5618        0.772
##   310     42       1   0.6428  0.0544       0.5447        0.759
##   340     39       1   0.6264  0.0554       0.5267        0.745
##   345     38       1   0.6099  0.0563       0.5089        0.731
##   348     37       1   0.5934  0.0572       0.4913        0.717
##   350     36       1   0.5769  0.0579       0.4739        0.702
##   351     35       1   0.5604  0.0586       0.4566        0.688
##   361     33       1   0.5434  0.0592       0.4390        0.673
##   363     32       1   0.5265  0.0597       0.4215        0.658
##   371     30       1   0.5089  0.0603       0.4035        0.642
##   426     26       1   0.4893  0.0610       0.3832        0.625
##   433     25       1   0.4698  0.0617       0.3632        0.608
##   444     24       1   0.4502  0.0621       0.3435        0.590
##   450     23       1   0.4306  0.0624       0.3241        0.572
##   473     22       1   0.4110  0.0626       0.3050        0.554
##   520     19       1   0.3894  0.0629       0.2837        0.534
##   524     18       1   0.3678  0.0630       0.2628        0.515
##   550     15       1   0.3433  0.0634       0.2390        0.493
##   641     11       1   0.3121  0.0649       0.2076        0.469
##   654     10       1   0.2808  0.0655       0.1778        0.443
##   687      9       1   0.2496  0.0652       0.1496        0.417
##   705      8       1   0.2184  0.0641       0.1229        0.388
##   728      7       1   0.1872  0.0621       0.0978        0.359
##   731      6       1   0.1560  0.0590       0.0743        0.328
##   735      5       1   0.1248  0.0549       0.0527        0.295
##   765      3       1   0.0832  0.0499       0.0257        0.270

It’s also possible to use the function surv_summary() [in survminer package] to get a summary of survival curves. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.

res.sum <- surv_summary(fit)

## Warning in .get_data(x, data = data): The `data` argument is not provided. Data
## will be extracted from model fit.

head(res.sum)

The function surv_summary() returns a data frame with the following columns:

time: the time points at which the curve has a step.
n.risk: the number of subjects at risk at t.
n.event: the number of events that occur at time t.
n.censor: number of censored events.
surv: estimate of survival probability.
std.err: standard error of survival.
upper: upper end of confidence interval
lower: lower end of confidence interval
strata: indicates stratification of curve estimation. The levels of strata (a factor) are the labels for the curves.

In a situation, where survival curves have been fitted with one or more variables, surv_summary object contains extra columns representing the variables. This makes it possible to facet the output of ggsurvplot by strata or by some combinations of factors.

surv_summary object has also an attribute named ‘table’ containing information about the survival curves, including medians of survival with confidence intervals, as well as, the total number of subjects and the number of event in each curve. To get access to the attribute ‘table’, type this:

attr(res.sum, "table")

Log-Rank test comparing survival curves: survdiff()

The log-rank test is the most widely used method of comparing two or more survival curves. The null hypothesis is that there is no difference in survival between the two groups. The log rank test is a non-parametric test, which makes no assumptions about the survival distributions. Essentially, the log rank test compares the observed number of events in each group to what would be expected if the null hypothesis were true (i.e., if the survival curves were identical). The log rank statistic is approximately distributed as a chi-square test statistic.

The function survdiff() [in survival package] can be used to compute log-rank test comparing two or more survival curves.

survdiff() can be used as follow:

surv_diff <- survdiff(Surv(time, status) ~ sex, data = lung)
surv_diff

## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138      112     91.6      4.55      10.3
## sex=2  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

The function returns a list of components, including:

n: the number of subjects in each group.
obs: the weighted observed number of events in each group.
exp: the weighted expected number of events in each group.
chisq: the chisquare statistic for a test of equality.
strata: optionally, the number of subjects contained in each stratum.

The log rank test for difference in survival gives a p-value of p = 0.0013, indicating that the sex groups differ significantly in survival.

Fit complex survival curves In this section, we’ll compute survival curves using the combination of multiple factors. Next, we’ll facet the output of ggsurvplot() by a combination of factors

Fit (complex) survival curves using colon data sets

require("survival")
fit2 <- survfit( Surv(time, status) ~ sex + rx + adhere,
                data = colon )

Visualize the output using survminer. The plot below shows survival curves by the sex variable faceted according to the values of rx & adhere.

# Plot survival curves by sex and facet by rx and adhere
ggsurv <- ggsurvplot(fit2, fun = "event", conf.int = TRUE,
                     ggtheme = theme_bw())
   
ggsurv$plot +theme_bw() + 
  theme (legend.position = "right")+
  facet_grid(rx ~ adhere)

Summary

Survival analysis is a set of statistical approaches for data analysis where the outcome variable of interest is time until an event occurs.

Survival data are generally described and modeled in terms of two related functions:

the survivor function representing the probability that an individual survives from the time of origin to some time beyond time t. It’s usually estimated by the Kaplan-Meier method. The logrank test may be used to test for differences between survival curves for groups, such as treatment arms.
The hazard function gives the instantaneous potential of having an event at a time, given survival up to that time. It is used primarily as a diagnostic tool or for specifying a mathematical model for survival analysis.

In this article, we demonstrate how to perform and visualize survival analyses using the combination of two R packages: survival (for the analysis) and survminer (for the visualization).

References

Clark TG, Bradburn MJ, Love SB and Altman DG. Survival Analysis Part I: Basic concepts and first analyses. British Journal of Cancer (2003) 89, 232 – 238
Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53: 457–481.
Pocock S, Clayton TC, Altman DG (2002) Survival plots of time-to-event outcomes in clinical trials: good practice and pitfalls. Lancet 359: 1686– 1689.

Cox Proportional Hazard Model

The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.

In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:

the definition of hazard and survival functions,
the construction of Kaplan-Meier survival curves for different patient groups
the logrank test for comparing two or more survival curves

The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.

Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age.

An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.

In this article, we’ll describe the Cox regression model and provide practical examples using R software.

The need for multivariate statistical modeling

In clinical investigations, there are many situations, where several known quantities (known as covariates), potentially affect patient prognosis.

For instance, suppose two groups of patients are compared: those with and those without a specific genotype. If one of the groups also contains older individuals, any difference in survival may be attributable to genotype or age or indeed both. Hence, when investigating survival in relation to any one factor, it is often desirable to adjust for the impact of others.

Statistical model is a frequently used tool that allows to analyze survival with respect to several factors simultaneously. Additionally, statistical model provides the effect size for each factor.

The cox proportional-hazards model is one of the most important methods used for modelling survival analysis data. The next section introduces the basics of the Cox regression model.

Basics Concepts

The purpose of the model is to evaluate simultaneously the effect of several factors on survival. In other words, it allows us to examine how specified factors influence the rate of a particular event happening (e.g., infection, death) at a particular point in time. This rate is commonly referred as the hazard rate. Predictor variables (or factors) are usually termed covariates in the survival-analysis literature.

The Cox model is expressed by the hazard function denoted by $h(t)$. Briefly, the hazard function can be interpreted as the risk of dying at time $t$. It can be estimated as follow: \[ h(t)= h_0 (t) e^{b_1 x_1+b_2 x_2+...+b_p x_p}\]

where

t represents the survival time
$h(t)$ is the hazard function determined by a set of p covariates ($ x_1,x_2,…,x_p$)
the coefficients ($ b_1,b_2,…,b_p $) measure the impact (i.e. effect sizes) of covariates
the term $h_0$ is called the baseline hazard. It corresponds to the value of the hazard if all $x_i$ are equal to zero (the quantity $e^0=1$). The $t$ in $h(t)$ reminds us that the hazard may vary over time.

The Cox model can be written as a multiple linear regression of the logarithm of the hazard on the variables $x_i$, with the baseline hazard being an ‘intercept’ term that varies with time.

The quantities $e^{bi}$ are called hazard ratios (HR). A value of $b_i$ greater than zero, or equivalently a hazard ratio greater than one, indicates that as the value of the $i^{th}$ covariate increases, the event hazard increases and thus the length of survival decreases.

Put another way, a hazard ratio above 1 indicates a covariate that is positively associated with the event probability, and thus negatively associated with the length of survival.

In summary,

HR = 1: No effect
HR < 1: Reduction in the hazard
HR > 1: Increase in Hazard

Note that in cancer studies:

A covariate with hazard ratio > 1 (i.e.: b > 0) is called bad prognostic factor
A covariate with hazard ratio < 1 (i.e.: b < 0) is called good prognostic factor

A key assumption of the Cox model is that the hazard curves for the groups of observations (or patients) should be proportional and cannot cross.

Consider two patients k and k’ that differ in their x-values. The corresponding hazard function can be simply written as follow

Hazard function for the patient $k$: \[ h_k (t)= h_0 (t) e^{\sum_{i=1}^{n} \beta x}\]
Hazard function for the patient $k’$: \[ h_{k'} (t)=h_0(t) e^{\sum_{i=1}^{n} \beta x'}\]
The hazard ratio for these two patients \[ \frac{h_k (t)}{h_{k'} (t)} = \frac{e^{\sum_{i=1}^{n} \beta x}}{e^{\sum_{i=1}^{n} \beta x'}}\]

Consequently, the Cox model is a proportional-hazards model: the hazard of the event in any group is a constant multiple of the hazard in any other. This assumption implies that, as mentioned above, the hazard curves for the groups should be proportional and cannot cross.

In other words, if an individual has a risk of death at some initial time point that is twice as high as that of another individual, then at all later times the risk of death remains twice as high.

This assumption of proportional hazards should be tested. We’ll discuss methods for assessing proportionality in the next article in this series: Cox Model Assumptions.

Compute the Cox model

We’ll use two R packages:

survival for computing survival analyses
survminer for visualizing survival analysis results

R function to compute the Cox model: coxph()

The function coxph()[in survival package] can be used to compute the Cox proportional hazards regression model in R.

The simplified format is as follow:

coxph(formula, data, method)

formula: is linear model with a survival object as the response variable. Survival object is created using the function Surv() as - follow: Surv(time, event).
data: a data frame containing the variables
method: is used to specify how to handle ties. The default is ‘efron’. Other options are ‘breslow’ and ‘exact’. The default ‘efron’ is generally preferred to the once-popular “breslow” method. The “exact” method is much more computationally intensive.

Example data sets

We’ll use the lung cancer data in the survival R package.

data("lung")
head(lung)

inst: Institution code
time: Survival time in days
status: censoring status 1=censored, 2=dead
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score (0=good 5=dead)
ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score as rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months

Compute the Cox model

We’ll fit the Cox regression using the following covariates: age, sex, ph.ecog and wt.loss.

We start by computing univariate Cox analyses for all these variables; then we’ll fit multivariate cox analyses using two variables to describe how the factors jointly impact on survival.

Univariate Cox regression

Univariate Cox analyses can be computed as follow:

res.cox <- coxph(Surv(time, status) ~ sex, data = lung)
res.cox

## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
## 
##        coef exp(coef) se(coef)      z       p
## sex -0.5310    0.5880   0.1672 -3.176 0.00149
## 
## Likelihood ratio test=10.63  on 1 df, p=0.001111
## n= 228, number of events= 165

The function summary() for Cox models produces a more complete report:

summary(res.cox)

## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
## 
##   n= 228, number of events= 165 
## 
##        coef exp(coef) se(coef)      z Pr(>|z|)   
## sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## sex     0.588      1.701    0.4237     0.816
## 
## Concordance= 0.579  (se = 0.021 )
## Likelihood ratio test= 10.63  on 1 df,   p=0.001
## Wald test            = 10.09  on 1 df,   p=0.001
## Score (logrank) test = 10.33  on 1 df,   p=0.001

The Cox regression results can be interpreted as follow:

Statistical significance. The column marked “z” gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The wald statistic evaluates, whether the beta (β) coefficient of a given variable is statistically significantly different from 0. From the output above, we can conclude that the variable sex have highly statistically significant coefficients.
The regression coefficients. The second feature to note in the Cox model results is the the sign of the regression coefficients (coef). A positive sign means that the hazard (risk of death) is higher, and thus the prognosis worse, for subjects with higher values of that variable. The variable sex is encoded as a numeric vector. 1: male, 2: female. The R summary for the Cox model gives the hazard ratio (HR) for the second group relative to the first group, that is, female versus male. The beta coefficient for sex = -0.53 indicates that females have lower risk of death (lower survival rates) than males, in these data.
Hazard ratios. The exponentiated coefficients (exp(coef) = exp(-0.53) = 0.59), also known as hazard ratios, give the effect size of covariates. For example, being female (sex=2) reduces the hazard by a factor of 0.59, or 41%. Being female is associated with good prognostic.
Confidence intervals of the hazard ratios. The summary output also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)), lower 95% bound = 0.4237, upper 95% bound = 0.816.
Global statistical significance of the model. Finally, the output gives p-values for three alternative tests for overall significance of the model: The likelihood-ratio test, Wald test, and score logrank statistics. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.

To apply the univariate coxph function to multiple covariates at once, type this:

covariates <- c("age", "sex",  "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,
                        function(x) as.formula(paste('Surv(time, status)~', x)))
                        
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})
# Extract data 
univ_results <- lapply(univ_models,
                       function(x){ 
                          x <- summary(x)
                          p.value<-signif(x$wald["pvalue"], digits=2)
                          wald.test<-signif(x$wald["test"], digits=2)
                          beta<-signif(x$coef[1], digits=2);#coeficient beta
                          HR <-signif(x$coef[2], digits=2);#exp(beta)
                          HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
                          HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
                          HR <- paste0(HR, " (", 
                                       HR.confint.lower, "-", HR.confint.upper, ")")
                          res<-c(beta, HR, wald.test, p.value)
                          names(res)<-c("beta", "HR (95% CI for HR)", "wald.test", 
                                        "p.value")
                          return(res)
                          #return(exp(cbind(coef(x),confint(x))))
                         })
res <- t(as.data.frame(univ_results, check.names = FALSE))
as.data.frame(res)

The output above shows the regression beta coefficients, the effect sizes (given as hazard ratios) and statistical significance for each of the variables in relation to overall survival. Each factor is assessed through separate univariate Cox regressions.

From the output above,

The variables sex, age and ph.ecog have highly statistically significant coefficients, while the coefficient for ph.karno is not significant.
age and ph.ecog have positive beta coefficients, while sex has a negative coefficient. Thus, older age and higher ph.ecog are associated with poorer survival, whereas being female (sex=2) is associated with better survival.

Now, we want to describe how the factors jointly impact on survival. To answer to this question, we’ll perform a multivariate Cox regression analysis. As the variable ph.karno is not significant in the univariate Cox analysis, we’ll skip it in the multivariate analysis. We’ll include the 3 factors (sex, age and ph.ecog) into the multivariate model.

Multivariate Cox regression analysis

A Cox regression of time to death on the time-constant covariates is specified as follow:

# Data preparation and computing cox model
res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data =  lung)
summary(res.cox)

## Call:
## coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung)
## 
##   n= 227, number of events= 164 
##    (1 observation deleted due to missingness)
## 
##              coef exp(coef)  se(coef)      z Pr(>|z|)    
## age      0.011067  1.011128  0.009267  1.194 0.232416    
## sex     -0.552612  0.575445  0.167739 -3.294 0.000986 ***
## ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## age        1.0111     0.9890    0.9929    1.0297
## sex        0.5754     1.7378    0.4142    0.7994
## ph.ecog    1.5900     0.6289    1.2727    1.9864
## 
## Concordance= 0.637  (se = 0.025 )
## Likelihood ratio test= 30.5  on 3 df,   p=1e-06
## Wald test            = 29.93  on 3 df,   p=1e-06
## Score (logrank) test = 30.5  on 3 df,   p=1e-06

The p-value for all three overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. These tests evaluate the omnibus null hypothesis that all of the betas (β) are 0. In the above example, the test statistics are in close agreement, and the omnibus null hypothesis is soundly rejected.

In the multivariate Cox analysis, the covariates sex and ph.ecog remain significant (p < 0.05). However, the covariate age fails to be significant (p = 0.23, which is grater than 0.05).

The p-value for sex is 0.000986, with a hazard ratio HR = exp(coef) = 0.58, indicating a strong relationship between the patients’ sex and decreased risk of death. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 0.58, or 42%. We conclude that, being female is associated with good prognostic.

Similarly, the p-value for ph.ecog is 4.45e-05, with a hazard ratio HR = 1.59, indicating a strong relationship between the ph.ecog value and increased risk of death. Holding the other covariates constant, a higher value of ph.ecog is associated with a poor survival.

By contrast, the p-value for age is now p=0.23. The hazard ratio HR = exp(coef) = 1.01, with a 95% confidence interval of 0.99 to 1.03. Because the confidence interval for HR includes 1, these results indicate that age makes a smaller contribution to the difference in the HR after adjusting for the ph.ecog values and patient’s sex, and only trend toward significance. For example, holding the other covariates constant, an additional year of age induce daily hazard of death by a factor of exp(beta) = 1.01, or 1%, which is not a significant contribution.

Visualizing the estimated distribution of survival times

Having fit a Cox model to the data, it’s possible to visualize the predicted survival proportion at any given point in time for a particular risk group. The function survfit() estimates the survival proportion, by default at the mean values of covariates.

We may wish to display how estimated survival depends upon the value of a covariate of interest.

Consider that, we want to assess the impact of the sex on the estimated survival probability. In this case, we construct a new data frame with two rows, one for each value of sex; the other covariates are fixed to their average values (if they are continuous variables) or to their lowest level (if they are discrete variables). For a dummy covariate, the average value is the proportion coded 1 in the data set. This data frame is passed to survfit() via the newdata argument:

Summary

In this article, we described the Cox regression model for assessing simultaneously the relationship between multiple risk factors and patient’s survival time. We demonstrated how to compute the Cox model using the survival package. Additionally, we described how to visualize the results of the analysis using the survminer package.

References

Cox DR (1972). Regression models and life tables (with discussion). J R Statist Soc B 34: 187–220
MJ Bradburn, TG Clark, SB Love and DG Altman. Survival Analysis Part II: Multivariate data analysis – an introduction to concepts and methods. British Journal of Cancer (2003) 89, 431 – 436.

Cox Model Assumptions

Previously, we described the basic methods for analyzing survival data, as well as, the Cox proportional hazards methods to deal with the situation where several factors impact on the survival process.

In the current article, we continue the series by describing methods to evaluate the validity of the Cox model assumptions.

Note that, when used inappropriately, statistical models may give rise to misleading conclusions. Therefore, it’s important to check that a given model is an appropriate representation of the data.

Diagnostics for the Cox model

The Cox proportional hazards model makes sevral assumptions. Thus, it is important to assess whether a fitted Cox regression model adequately describes the data.

Here, we’ll disscuss three types of diagonostics for the Cox model:

Testing the proportional hazards assumption.
Examining influential observations (or outliers).
Detecting nonlinearity in relationship between the log hazard and the covariates.

In order to check these model assumptions, Residuals method are used. The common residuals for the Cox model include:

Schoenfeld residuals to check the proportional hazards assumption
Martingale residual to assess nonlinearity
Deviance residual (symmetric transformation of the Martinguale residuals), to examine influential observations

Validy of a Cox model

Computing the Cox Model

We’ll use the lung data sets and the coxph() function in the survival package.

To compute a Cox model, type this:

res.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data =  lung)
res.cox

## Call:
## coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
## 
##               coef  exp(coef)   se(coef)      z      p
## age      0.0200882  1.0202913  0.0096644  2.079 0.0377
## sex     -0.5210319  0.5939074  0.1743541 -2.988 0.0028
## wt.loss  0.0007596  1.0007599  0.0061934  0.123 0.9024
## 
## Likelihood ratio test=14.67  on 3 df, p=0.002122
## n= 214, number of events= 152 
##    (14 observations deleted due to missingness)

Testing proportional Hazards assumption

The proportional hazards (PH) assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals.

In principle, the Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption.

The function cox.zph() [in the survival package] provides a convenient solution to test the proportional hazards assumption for each covariate included in a Cox refression model fit.

For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole.

The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.

To illustrate the test, we start by computing a Cox regression model using the lung data set [in survival package]:

res.cox <- coxph(Surv(time, status) ~ age + sex + wt.loss, data =  lung)
res.cox

## Call:
## coxph(formula = Surv(time, status) ~ age + sex + wt.loss, data = lung)
## 
##               coef  exp(coef)   se(coef)      z      p
## age      0.0200882  1.0202913  0.0096644  2.079 0.0377
## sex     -0.5210319  0.5939074  0.1743541 -2.988 0.0028
## wt.loss  0.0007596  1.0007599  0.0061934  0.123 0.9024
## 
## Likelihood ratio test=14.67  on 3 df, p=0.002122
## n= 214, number of events= 152 
##    (14 observations deleted due to missingness)

To test for the proportional-hazards (PH) assumption, type this:

test.ph <- cox.zph(res.cox)
test.ph

##          chisq df    p
## age     0.5077  1 0.48
## sex     2.5489  1 0.11
## wt.loss 0.0144  1 0.90
## GLOBAL  3.0051  3 0.39

From the output above, the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, we can assume the proportional hazards.

It’s possible to do a graphical diagnostic using the function ggcoxzph() [in the survminer package], which produces, for each covariate, graphs of the scaled Schoenfeld residuals against the transformed time.

ggcoxzph(test.ph)

In the figure above, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a +/- 2-standard-error band around the fit.

Note that, systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates $β_1,β_2,β_3$ do not vary much over time.

From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates sex (which is, recall, a two-level factor, accounting for the two bands in the graph), wt.loss and age.

Another graphical methods for checking proportional hazards is to plot log(-log(S(t))) vs. t or log(t) and look for parallelism. This can be done only for categorical covariates.

A violations of proportional hazards assumption can be resolved by:

Adding covariate*time interaction Stratification Stratification is usefull for “nuisance” confounders, where you do not care to estimate the effect. You cannot examine the effects of the stratification variable (John Fox & Sanford Weisberg).

To read more about how to accomodate with non-proportional hazards, read the following articles:

Jadwiga Borucka, PAREXEL, Warsaw, Poland. Extensions of cox model for non-proportional hazards purpose. 2013.
John Fox & Sanford Weisberg. Cox Proportional-Hazards Regression for Survival Data in R.
Max Gordon. Dealing with non-proportional hazards in R. March 29, 2016.

Testing influential observations

To test influential observations or outliers, we can visualize either:

the deviance residuals or the dfbeta values The function ggcoxdiagnostics()[in survminer package] provides a convenient solution for checkind influential observations. The simplified format is as follow:

ggcoxdiagnostics(fit, type = , linear.predictions = TRUE)

fit: an object of class coxph.object
type: the type of residuals to present on Y axis. Allowed values include one of c(“martingale”, “deviance”, “score”, “schoenfeld”, “dfbeta”, “dfbetas”, “scaledsch”, “partial”).
linear.predictions: a logical value indicating whether to show linear predictions for observations (TRUE) or just indexed of observations (FALSE) on X axis.

Specifying the argument type = “dfbeta”, plots the estimated changes in the regression coefficients upon deleting each observation in turn; likewise, type=“dfbetas” produces the estimated changes in the coefficients divided by their standard errors.

For example:

ggcoxdiagnostics(res.cox, type = "dfbeta",
                 linear.predictions = FALSE, ggtheme = theme_bw())

## `geom_smooth()` using formula 'y ~ x'

(Index plots of dfbeta for the Cox regression of time to death on age, sex and wt.loss)

The above index plots show that comparing the magnitudes of the largest dfbeta values to the regression coefficients suggests that none of the observations is terribly influential individually, even though some of the dfbeta values for age and wt.loss are large compared with the others.

It’s also possible to check outliers by visualizing the deviance residuals. The deviance residual is a normalized transform of the martingale residual. These residuals should be roughtly symmetrically distributed about zero with a standard deviation of 1.

Positive values correspond to individuals that “died too soon” compared to expected survival times.
Negative values correspond to individual that “lived too long”.
Very large or small values are outliers, which are poorly predicted by the model.

Example of deviance residuals:

ggcoxdiagnostics(res.cox, type = "deviance",
                 linear.predictions = FALSE, ggtheme = theme_bw())

## `geom_smooth()` using formula 'y ~ x'

Testing non linearity

Often, we assume that continuous covariates have a linear form. However, this assumption should be checked.

Plotting the Martingale residuals against continuous covariates is a common approach used to detect nonlinearity or, in other words, to assess the functional form of a covariate. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit.

Nonlinearity is not an issue for categorical variables, so we only examine plots of martingale residuals and partial residuals against a continuous variable.

Martingale residuals may present any value in the range (-INF, +1):

A value of martinguale residuals near 1 represents individuals that “died too soon”, and large negative values correspond to individuals that “lived too long”.
To assess the functional form of a continuous variable in a Cox proportional hazards model, we’ll use the function ggcoxfunctional() [in the survminer R package].

The function ggcoxfunctional() displays graphs of continuous covariates against martingale residuals of null cox proportional hazards model. This might help to properly choose the functional form of continuous variable in the Cox model. Fitted lines with lowess function should be linear to satisfy the Cox proportional hazards model assumptions.

For example, to assess the functional forme of age, type this:

ggcoxfunctional(Surv(time, status) ~ age + log(age) + sqrt(age), data = lung)

## Warning: arguments formula is deprecated; will be removed in the next version;
## please use fit instead.

It appears that, nonlinearity is slightly here.

Summary

We described how to assess the valididy of the Cox model assumptions using the survival and survminer packages.

Survival Analysis

Roel Ceballos

4/7/2020

Basics of Survival Analysis

Objectives

Basic Concepts

Survival time & type of events

Censoring

Survival & hazard functions

Kaplan-Meier estimate

Survival Analysis in R

Cox Proportional Hazard Model

Basics Concepts

Compute the Cox model

Cox Model Assumptions

Validy of a Cox model

Testing influential observations

Testing non linearity