library(dplyr)
library(survival)
library(ggsurvfit)

Tutorial from online YouTube video collection: https://www.youtube.com/watch?v=vX3l36ptrTU&list=PLqzoL9-eJTNDdnKvep_YHIwk2AMqHhuJ0&index=1

For an online version of this rmarkdown file, please check: https://rpubs.com/auraf285/SurvAnalysisR

For the RMarkdown file, see https://github.com/AuraFrizzati/Tutorials/tree/main/Cox%20regression

1 What is censoring?

Survival analysis models time-to-event as an outcome (e.g. time to death, how long does one wait on a waiting list for surgery, etc).

The outcome \(y\) has two parts:
- Time = T (time to event or follow-up)
- Has the event (E) occurred? Yes/No

\[y = Surv(T,E)\]

  • Censoring: has the event occurred or not occurred (= the event was censored) within the follow-up period. The concept of censoring makes survival analysis unique. Individuals might also be lost at followed-up.

  • Censored observations get included into a survival model

  • Right-censoring vs left-censoring (we don’t know what happened to the subject before the study)

  • All survival models assume that censoring is non-informative: being censored or not is not related to the probability of the event occurring. (e.g., measuring time-to-death, patients lost at follow-up are those that have started to feel much better)

2 Survival function, hazard & hazard ratio

  • Survival function: probability of surviving beyond time \(t\) \[S(t) = P(T > t)\]

  • Hazard: probability of getting the event in the next few seconds \(t + \delta\) given the event is not present now \(t\). \[H(t) = P(T < t + \delta | T > t)\]

  • Hazard ratio (HR): this is the hazard for the exposure group relative to the hazard for the non-exposure group

\[\frac{H_{x = 1}}{H_{x = 0}}\]

If HR = 2, for example, at any given time, someone who is exposed has twice the risk of dying than someone who is not exposed.

3 Survival models

  • Kaplan-Meier (K-M) survival model: non-parametric
    • Pro:
      • Simple to interpret
      • You can estimate the survival beyond a specific time
      • The hazard rate can increase or decrease and not necessarily proportionally with time
    • Con:
      • No functional mathematical form to describe the K-M curve (every single step of the curve needs to be described)
      • You cannot estimate a HR (because there isn’t a constant rate of decrease)
      • It can only include few and categorical independent variables (\(Xs\))


  • Exponential survival model: parametric (the survival function is modelled as a negative exponential curve). The hazard represents the rate of decrease of the curve
    • Pro:
      • We can estimate \(S(t)\) and HR
      • It has a mathematical function to describe survival: \(S(t) = e^{-Haz \times t}\) (the function has one parameter, the hazard)
    • Con:
      • Not always realistic because it assumes a constant hazard (i.e. rate of decrease of survival curve is constant. E.g. compare the human risk of dying while ageing)
      • As an alternative, the Weibull model allows the hazard to increase/decrease proportionally with time. However,this is still not often realistic (e.g. death risk after surgery: high risk immediately after surgery and then gradually decreasing)


  • Cox Proportional-Hazard (P-H) model: semi-parametric (it is a sort of combination of the other two models)
    • Pro:
      • The hazard can fluctuate with time (like K-M survival model)
      • You can estimate the HR
    • Con:
      • You cannot estimate the survival function

4 Kaplan-Meier model (also called Product-limit method)

Load the dataset

  • censor = 1 –> death
  • censor = 0 –> censored
#install.packages("Bolstad2")
library(Bolstad2)
data(AidsSurvival.df)
head(AidsSurvival.df)
##   id   enter     end time age drug censor
## 1  1 15may90 14oct90    5  46    0      1
## 2  2 19sep89 20mar90    6  35    1      0
## 3  3 21apr91 20dec91    8  30    1      1
## 4  4 03jan91 04apr91    3  30    1      1
## 5  5 18sep89 19jul91   22  36    0      1
## 6  6 18mar91 17apr91    1  32    1      0

Load the survival library (already installed in Base R)

library(survival)

Fit the KM (survival) model to the data

km.model <- survfit(
  Surv(time,censor) ~ 1, # ~ 1 is used when we do not have any predictor for survival
  type = "kaplan-meier", ## default type
  data = AidsSurvival.df)

km.model
## Call: survfit(formula = Surv(time, censor) ~ 1, data = AidsSurvival.df, 
##     type = "kaplan-meier")
## 
##        n events median 0.95LCL 0.95UCL
## [1,] 100     80      7       5      10

The median survival time is 7 months

summary(km.model)
## Call: survfit(formula = Surv(time, censor) ~ 1, data = AidsSurvival.df, 
##     type = "kaplan-meier")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    100      15   0.8500  0.0357       0.7828        0.923
##     2     83       5   0.7988  0.0402       0.7237        0.882
##     3     73      10   0.6894  0.0473       0.6026        0.789
##     4     61       4   0.6442  0.0493       0.5544        0.748
##     5     56       7   0.5636  0.0517       0.4709        0.675
##     6     49       2   0.5406  0.0521       0.4476        0.653
##     7     46       6   0.4701  0.0526       0.3775        0.586
##     8     39       4   0.4219  0.0525       0.3306        0.538
##     9     35       3   0.3857  0.0520       0.2962        0.502
##    10     32       3   0.3496  0.0511       0.2625        0.466
##    11     28       3   0.3121  0.0500       0.2280        0.427
##    12     25       2   0.2872  0.0490       0.2055        0.401
##    13     21       1   0.2735  0.0486       0.1931        0.387
##    14     20       1   0.2598  0.0480       0.1809        0.373
##    15     19       2   0.2325  0.0467       0.1568        0.345
##    22     16       1   0.2179  0.0460       0.1441        0.330
##    30     14       1   0.2024  0.0453       0.1305        0.314
##    31     13       1   0.1868  0.0444       0.1173        0.298
##    32     12       1   0.1712  0.0433       0.1043        0.281
##    34     11       1   0.1557  0.0421       0.0916        0.264
##    35     10       1   0.1401  0.0407       0.0793        0.247
##    36      9       1   0.1245  0.0390       0.0674        0.230
##    43      8       1   0.1090  0.0371       0.0559        0.212
##    53      7       1   0.0934  0.0349       0.0449        0.194
##    54      6       1   0.0778  0.0324       0.0344        0.176
##    57      4       1   0.0584  0.0296       0.0216        0.157
##    58      3       1   0.0389  0.0253       0.0109        0.139

Model plot (the red line indicates the median survival time on the curve)

plot(km.model, 
     xlab ="Time (months)", 
     ylab = "% Alive = S(t)", 
     main = "KM-Model", 
     las = 1, ## rotate y-axis labels
     mark.time = TRUE ## this allows to visualise at what time were the observations censored
     )
abline(h = 0.5, col= "red")

Adding a predictor (drug = 1 or 0)

km.model2 <- survfit(
  Surv(time,censor) ~ drug, # ~ 1 is used when we do not have any predictor for survival
  type = "kaplan-meier", ## default type
  data = AidsSurvival.df)

km.model2
## Call: survfit(formula = Surv(time, censor) ~ drug, data = AidsSurvival.df, 
##     type = "kaplan-meier")
## 
##         n events median 0.95LCL 0.95UCL
## drug=0 51     42     11       8      30
## drug=1 49     38      5       3       7
summary(km.model2)
## Call: survfit(formula = Surv(time, censor) ~ drug, data = AidsSurvival.df, 
##     type = "kaplan-meier")
## 
##                 drug=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1     51       5   0.9020  0.0416       0.8239        0.987
##     2     46       3   0.8431  0.0509       0.7490        0.949
##     3     41       2   0.8020  0.0561       0.6992        0.920
##     4     38       2   0.7598  0.0606       0.6498        0.888
##     5     35       3   0.6947  0.0660       0.5766        0.837
##     6     32       1   0.6730  0.0675       0.5529        0.819
##     7     31       1   0.6513  0.0687       0.5296        0.801
##     8     30       2   0.6078  0.0706       0.4840        0.763
##     9     28       2   0.5644  0.0720       0.4396        0.725
##    10     26       2   0.5210  0.0727       0.3964        0.685
##    11     23       2   0.4757  0.0731       0.3520        0.643
##    12     21       2   0.4304  0.0728       0.3090        0.600
##    13     19       1   0.4077  0.0724       0.2879        0.577
##    14     18       1   0.3851  0.0718       0.2672        0.555
##    15     17       1   0.3624  0.0711       0.2468        0.532
##    22     15       1   0.3383  0.0703       0.2250        0.508
##    30     13       1   0.3123  0.0696       0.2018        0.483
##    31     12       1   0.2862  0.0685       0.1791        0.457
##    32     11       1   0.2602  0.0670       0.1571        0.431
##    34     10       1   0.2342  0.0652       0.1357        0.404
##    35      9       1   0.2082  0.0629       0.1151        0.376
##    36      8       1   0.1821  0.0602       0.0953        0.348
##    43      7       1   0.1561  0.0569       0.0764        0.319
##    53      6       1   0.1301  0.0531       0.0585        0.289
##    54      5       1   0.1041  0.0484       0.0418        0.259
##    57      4       1   0.0781  0.0427       0.0267        0.228
##    58      3       1   0.0520  0.0355       0.0136        0.198
## 
##                 drug=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1     49      10   0.7959  0.0576       0.6907        0.917
##     2     37       2   0.7529  0.0620       0.6407        0.885
##     3     32       8   0.5647  0.0740       0.4367        0.730
##     4     23       2   0.5156  0.0753       0.3872        0.686
##     5     21       4   0.4174  0.0753       0.2931        0.594
##     6     17       1   0.3928  0.0748       0.2705        0.570
##     7     15       5   0.2619  0.0691       0.1562        0.439
##     8      9       2   0.2037  0.0648       0.1092        0.380
##     9      7       1   0.1746  0.0618       0.0873        0.349
##    10      6       1   0.1455  0.0579       0.0667        0.317
##    11      5       1   0.1164  0.0531       0.0476        0.285
##    15      2       1   0.0582  0.0490       0.0112        0.303
plot(km.model2, 
     xlab ="Time (months)", 
     ylab = "% Alive = S(t)", 
     main = "KM-Model", 
     las = 1, ## rotate y-axis labels
     lwd = 2,
     col = c("black", "blue"),
     mark.time = TRUE ## this allows to visualise at what time were the observations censored
     )
abline(h = 0.5, col= "red")
legend(18,0.95, legend=c("DrugNO", "DrugYES"), lty = 1, lwd = 2, bty = "", cex = 0.6, col = c("black", "blue"))

Is there a statistically significant difference between the two groups?

Let’s use the Log-rank test (H0: the survival function for the 2 groups is the same). It can also be used for more than 2 groups

survdiff(
  Surv(time,censor) ~ drug,
  data = AidsSurvival.df
  )
## Call:
## survdiff(formula = Surv(time, censor) ~ drug, data = AidsSurvival.df)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## drug=0 51       42     54.9      3.02      11.9
## drug=1 49       38     25.1      6.60      11.9
## 
##  Chisq= 11.9  on 1 degrees of freedom, p= 6e-04

There is a statistically sig difference between the survival functions of the 2 groups.

5 Exponential models (regression models for survival)

These are parametric models.

  • Poisson process: events that occur independently over time (used to estimate how many time has an event occurred in a fixed period of time)
    • Rate = \(\lambda = \frac{y}{t}\) (i.e. the rate is the number of occurences over time)
    • The Poisson process gives rise to two different theoretical probability distributions of random variables
      • Poisson distribution: let \(y\) be the number of occurrences of event in time \(t\) (in this case, number of occurrences is the random variable and time is fixed):
        \(y \sim poisson(\lambda)\)
        \(f(Y) = P(Y=k) = \frac{e^{-\lambda}(\lambda^k)}{k!}\), with \(k\) number of occurrences, \(e\) the Euler’s number and \(!\) the factorial function (this distribution has mean = \(\lambda\) and variance = \(\lambda\))
      • Exponential distribution: let \(t\) be the time until the next event occurs (\(y = 1\)) (in this case, time is the random variable): \(f(t) \approx P(T = t) = \lambda e^{-\lambda t}\)
        \(F(t) = P(T \leq t) = 1 - e^{-\lambda t}\) (mean = \(\frac{1}{\lambda}\)) (the way the probability distribution is presented is the formula is probability of not surviving after time \(t\))

The idea of exponential distribution can then be extended to the exponential survival model and further to the Weibull model or the Cox proportional hazard model.

Survival function = \(S(t) = P(T>t) = 1 - P(T \leq t)\) (here the probability is given as surviving after time \(t\)), therefore:

\(S(t) = P(T>t) = 1 - P(T \leq t) =\)
\(= 1 - F(t) = 1 - (1 - e^{-\lambda t}) = e^{-\lambda t}\)

\[S(t) = P(T>t) = e^{-\lambda t}\]

  • Therefore, the survival function is equivalent to flipping the exponential distribution (around median = \(y = 0.5\), obtaining a negative exponential function).
  • The rate of curve decrease, \(\lambda\), is the hazard in the survival model:

\[S(t) = P(T>t) = e^{-\lambda t} = e^{-HAZ \times t}\]

  • To estimate the survival function we can use a regression model to estimate the hazard, using either:
    • another exponential function: \(HAZ = e^{b_0 + b_1X_1 + ... + b_kX_k}\)
    • or a linear function: \(ln(HAZ) = b_0 + b_1X_1 + ... + b_kX_k\) (for the linear model, the hazard rate for each predictor is obtained as \(e^{b_n}\))
  • Alternative models to the survival exponential model (i.e. Weibull model and Cox proportional hazard model) differ in the intercept value \(b_0\)
  • The hazard rate first gets estimated using the predictors and then it is plugged into the survival function to estimate survival

6 Exponential vs Weibull vs Cox Proportional Hazards (How they differ)

Relevant equations so far:

  • To estimate the hazard rate:

    • exponential function: \(HAZ = e^{b_0 + b_1X_1 + ... + b_kX_k}\)

    • or, linear function: \(ln(HAZ) = b_0 + b_1X_1 + ... + b_kX_k\)


  • To estimate the survival function:

\[S(t) = P(T>t) = e^{-\lambda t} = e^{-HAZ \times t}\]


The intercept of the model (\(b_0\)) is what differentiates the three models and it corresponds to the reference log-hazard (for \(t = 0\))

6.1 Exponential survival model

  • \(b_0\) is a constant and doesn’t change
  • Therefore, the hazard is constant in time (no \(t\) explicitly appears in the formulas to estimate the hazard rate)

6.2 Weibull survival model

  • \(b_0\) increases proportionally with time (and the hazard increases with time as well); or \(b_0\) decreases proportionally with time (and the hazard decreases with time as well)

\[b_0 = ln(\alpha)ln(t) + b_0\]

  • if \(\alpha = 1\) –> the model becomes the same as the exponential (no effect of time)
  • if \(\alpha > 1\) –> the hazard increases with time
  • if \(\alpha < 1\) –> the hazard decreases with time

6.3 Cox proportional hazard model

  • \(b_0\) is a function of time (therefore it can fluctuate with time)

\[b_0 = ln(h_0(t))\]

  • \(h_0(t)\) is the hazard function; it is the hazard for the baseline group that is allowed to fluctuate over time.

  • You can estimate all the model’s predictors’ coefficients (\(b_1, b_2, ...,b_k\)) without having to estimate \(h_0(t)\). However, this means we do not get any estimate of the intercept, therefore we cannot estimate neither the hazard rate nor the survival function. Therefore we cannot use this model for predictive purposes (e.g. estimate the probability of surviving beyond a specific time point)

  • If our goal is to estimate only the hazard ratio (HR) between groups (effect size model), we can use Cox proportional hazard model

To summarise:
- Kaplan-Meier model can be used for prediction of survival probability but not for estimating hazard ratios
- Cox proportional hazard model can be used for estimating hazard ratios but not for prediction of survival probability
- Exponential model and Weibull model can be used for both estimating hazard ratios and for prediction of survival probability (although their assummptions are now always realistic)

7 Cox Proportional Hazard (CPH) Model

  • CPH model allows the hazard to change over time
  • It assumes that hazard between groups are proportional, i.e. the HR is constant over time
  • If the proportional hazard assumption is not met, we could include the interaction between the independent variable x time in the model

\[ln(HAZ) = b_0 + b_1X_1 + ... + b_kX_k = ln(h_0(t)) + b_1X_1 + ... + b_kX_k\]

  • The intercept (\(ln(h_0(t))\)) is a function of time

As alternative:

\[HAZ = h_0(t) + e^{b_1X_1 + ... + b_kX_k}\] - The model’s coefficients (\(b_1,b_2,...,b_k\)) can be estimated without having to specify the value of the baseline hazard function (\(h_0(t)\)) (similar to estimating the slope of a regression line without estimating the intercept. However the assumption is that the two lines of two groups are going to remain at the same distance with time)

8 Model Assumptions for CPH Model

  1. Censoring is non-informative: there is no association between an observation being censored and the likelihood of the event to occur (assumption also for the Kaplan-Meier, exponential and Weibull models)
  2. The survival times (\(t\)) of different individuals are independent (assumption also for the Kaplan-Meier, exponential and Weibull models)
  3. The hazards are proportional (i.e. the HR is constant) over time (assumption also for the exponential and Weibull models). In other words:
  • The relative difference between groups is constant over time
  • Survival curves do not cross
    This assumption can be checked using:
  • C-log-log plot
  • Schoenfeld’s test
  1. The \(ln(HAZ)\) is a linear function of the numeric explanatory variables (\(Xs\)) (assumption also for the exponential and Weibull models). This is not important for categorical explanatory variables. This can be checked using:
  • Residual plots (the should be distributed along a line close to zero)
  1. Values of explanatory variables (\(Xs\)) do not change over time
  2. The baseline hazard (\(h_0(t)\)) is unspecified
  • If assumption (3) is violated (i.e. HR is not constant over time), we can:

    • Stratify by the relevant variable and fit a separate model in each strata
    • Introduce time-dependent coefficients/parameters (the \(\beta s\)), i.e. fitting an interaction term that allows the effect of \(X\) to change over time (\(X \times t\)).
  • If assumption (4) is violated (i.e. the relationship between \(ln(HAZ)\) and the explanatory variable is not linear), the solution could be:

    • \(X\) transformation (e.g. \(ln(X)\), \(\sqrt(X)\), …)
    • Include polynomials (e.g. \(X^2\), \(X^3\), …)
    • Transform the numeric \(X\) into a categorical variable
  • If assumption (5) is violated (e.g. \(X\) is the drug’s dose and it is changing over time), we can use a time-dependent covariates’ model

  • We cannot do anything for violation of assumption (1). Although we can check for it by taking the censored vs non-censored observations and try to compare them to see if they differ in any way (study design)

  • Assumption (2) depends on the study design

9 CPH Model in R

# Use the Stanford heart transplant data
data(stanford2, package="survival")
stanford2<- 
  stanford2 %>% 
  mutate(
    Over40 = if_else(age > 40, "YES", "NO"),
    Over40 = as.factor(Over40),
    MisMatchLevel = case_when(
      t5 < 0.7 ~ 0,
      t5 >= 0.7 & t5 < 1.5 ~ 1,
      TRUE ~ 2
    ),
    MisMatchLevel = as.factor(MisMatchLevel)
    )

head(stanford2)
##      id time status age   t5 Over40 MisMatchLevel
## 139 139   86      1  12 1.26     NO             1
## 159 159   10      1  13 1.49     NO             1
## 181 181   60      0  13   NA     NO             2
## 119 119 1116      0  14 0.54     NO             0
## 74   74 2006      0  15 1.26     NO             1
## 120 120 1107      0  18 0.25     NO             0
summary(stanford2)
##        id              time             status            age       
##  Min.   :  1.00   Min.   :   0.50   Min.   :0.0000   Min.   :12.00  
##  1st Qu.: 46.75   1st Qu.:  64.75   1st Qu.:0.0000   1st Qu.:35.00  
##  Median : 92.50   Median : 351.00   Median :1.0000   Median :44.00  
##  Mean   : 92.50   Mean   : 696.94   Mean   :0.6141   Mean   :41.09  
##  3rd Qu.:138.25   3rd Qu.:1160.75   3rd Qu.:1.0000   3rd Qu.:49.00  
##  Max.   :184.00   Max.   :3695.00   Max.   :1.0000   Max.   :64.00  
##                                                                     
##        t5        Over40    MisMatchLevel
##  Min.   :0.000   NO : 71   0:40         
##  1st Qu.:0.690   YES:113   1:79         
##  Median :1.040             2:65         
##  Mean   :1.117                          
##  3rd Qu.:1.460                          
##  Max.   :3.050                          
##  NA's   :27

First glance the data using K-M model and survival curves

km.model <- survfit(
  Surv(time, status) ~ MisMatchLevel, # ~ 1 is used when we do not have any predictor for survival
  type = "kaplan-meier", ## default type
  data = stanford2)

km.model
## Call: survfit(formula = Surv(time, status) ~ MisMatchLevel, data = stanford2, 
##     type = "kaplan-meier")
## 
##                  n events median 0.95LCL 0.95UCL
## MisMatchLevel=0 40     26    364      66      NA
## MisMatchLevel=1 79     49    994     381    1478
## MisMatchLevel=2 65     38    544     169    1534
## base R plot
# plot(km.model, 
#      xlab ="Time (months)", 
#      ylab = "% Alive = S(t)", 
#      main = "KM-Model", 
#      las = 1, ## rotate y-axis labels
#      mark.time = TRUE ## this allows to visualise at what time were the observations censored
#      )
# abline(h = 0.5, col= "red")
ggsurvfit::survfit2(survival::Surv(time, status) ~ MisMatchLevel, data = stanford2) %>% 
  ggsurvfit::ggsurvfit() +
  labs(
    x = "Days",
    y = "Overall survival probability"
  ) + 
  add_confidence_interval() +
  add_risktable() +
  labs(title = "Survival curves by 'MisMatchLevel' variable")

ggsurvfit::survfit2(survival::Surv(time, status) ~ Over40, data = stanford2) %>% 
  ggsurvfit::ggsurvfit() +
  labs(
    x = "Days",
    y = "Overall survival probability"
  ) + 
  add_confidence_interval() +
  add_risktable() +
  labs(title = "Survival curves by 'Over40' variable")

cox.mod <- coxph(Surv(time, status) ~ Over40 + MisMatchLevel, data = stanford2)
summary(cox.mod)
## Call:
## coxph(formula = Surv(time, status) ~ Over40 + MisMatchLevel, 
##     data = stanford2)
## 
##   n= 184, number of events= 113 
## 
##                    coef exp(coef) se(coef)      z Pr(>|z|)  
## Over40YES       0.49350   1.63804  0.20671  2.387    0.017 *
## MisMatchLevel1 -0.24169   0.78530  0.24336 -0.993    0.321  
## MisMatchLevel2  0.08197   1.08542  0.25626  0.320    0.749  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## Over40YES         1.6380     0.6105    1.0924     2.456
## MisMatchLevel1    0.7853     1.2734    0.4874     1.265
## MisMatchLevel2    1.0854     0.9213    0.6569     1.794
## 
## Concordance= 0.57  (se = 0.029 )
## Likelihood ratio test= 8.2  on 3 df,   p=0.04
## Wald test            = 7.86  on 3 df,   p=0.05
## Score (logrank) test = 7.98  on 3 df,   p=0.05
  • The model summary returns the HRs (exp(coef)) for the variable groups:

    • for example, looking at the Over40 variable, a \(HR = 1.64\) means that at a given instant in time, someone over 40 has \(1.64\) times as likely to die as someone <= 40 (adjusting for the level of the MisMatchLevel variable)
    • If we subtract \(HR-1\), we can interpret the result as a percentage change: in the example, \(1.64-1 = 0.64\) or \(64%\) –> at a given instant in time, someone > 40 is \(64%\) more likely to die as someone who is <= 40 (adjusting for the level of the MisMatchLevel variable). Taking two individuals with the same MisMatchLevel, the one > 40 will be \(64%\) more likely to die than the one <= 40.
  • The model summary also returns the exp(-coef), which is the HR obtained by flipping the groups (therefore changing the reference group, i.e. the denominator of the HR): e.g. someone who is <=40 is 0.61 as likely to die as someone who is > 40.

  • At the bottom there are the tests (Likelihood ratio test, Wald test and Score (logrank) test) for the null hypothesis : \(H0: \beta_1 = \beta_2 = ... = \beta_k = 0\) (and \(H_A\): at least one coefficient is not 0). They test the overall-model significance.

  • The Concordance (or C-statistic): this is the goodness-of-fit statistic for survival analysis (in logistic regression, this is equivalent to the area under the curve, AUC): it is calculated by looking at pairs of observations and checking the concordance of the model’s prediction of length of survival versus the actual survival retrieved from the data. The concordance is the proportion of pairs of observations that are concordant (model predictions and actual data). If the model is just making random guesses, we would expect a concordance = 0.5.

9.1 Comparing nested models using the LRT

  • The likelihood ratio test (LRT) can be used to compare nested models, if adding/removing variables improves the model, etc

  • In the example, we want to understand whether we can drop from the model the MisMatchLevel variable

cox.mod <- coxph(Surv(time, status) ~ Over40 + MisMatchLevel, data = stanford2)
cox.mod2 <- coxph(Surv(time, status) ~ Over40 , data = stanford2)

Carry out the LRT to compare the two models:

anova(cox.mod2, cox.mod, test = "LRT")
## Analysis of Deviance Table
##  Cox model: response is  Surv(time, status)
##  Model 1: ~ Over40
##  Model 2: ~ Over40 + MisMatchLevel
##    loglik  Chisq Df Pr(>|Chi|)
## 1 -510.21                     
## 2 -509.01 2.4035  2     0.3007
  • The outcome highlights that there isn’t a statistically significant difference between the two models, therefore we can drop MisMatchLevel without loss of predictive power.

9.2 Including numeric variables in the CPH model

cox.num <- coxph(Surv(time, status) ~ age + t5, data = stanford2)
summary(cox.num)
## Call:
## coxph(formula = Surv(time, status) ~ age + t5, data = stanford2)
## 
##   n= 157, number of events= 102 
##    (27 observations deleted due to missingness)
## 
##        coef exp(coef) se(coef)     z Pr(>|z|)   
## age 0.02961   1.03006  0.01136 2.608  0.00911 **
## t5  0.17041   1.18579  0.18326 0.930  0.35243   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## age     1.030     0.9708     1.007     1.053
## t5      1.186     0.8433     0.828     1.698
## 
## Concordance= 0.59  (se = 0.034 )
## Likelihood ratio test= 8.47  on 2 df,   p=0.01
## Wald test            = 7.81  on 2 df,   p=0.02
## Score (logrank) test = 7.87  on 2 df,   p=0.02
  • Interpretation of results for Age variable: at a given instant in time, the probabilty of dying for someone who is 1-year older is \(3%\) (or \(1.03\)) higher than someone who is 1 year younger, adjusting for the t5 score.

10 Checking CPH model assumptions in R

We will check two assumptions:
- How to check linearity
- How to check the proportional hazards’ assumption

10.1 Check linearity assumption

  • The assumption is that the relationship of any of the numeric predictors (\(X\)) and the \(ln(HAZ)\) is linear.
  • To check this assumption we can use the same technique that is used for other models (e.g. linear or Poisson regression), i.e. using Martingale residuals
plot(
  x = predict(cox.num), 
  y = residuals(cox.num, type = "martingale"),
  xlab = "fitted values", 
  ylab = "Martingale residuals",
  main = "Residual plot (Martingale)", 
  las = 1 # this rotates values on y-axis
  )

## add a line at y = residual = 0
abline(h=0)

## fit a smoother through the points
lines(
  smooth.spline( 
    x = predict(cox.num),
    y = residuals(cox.num, type = "martingale")),
  col = "red"
  )

  • The fitted values are the predicted values by the model

  • We can observe some non-linearity present from the plot

  • We can also check the same kind of plot but using the deviance residuals

plot(
  x = predict(cox.num), 
  y = residuals(cox.num, type = "deviance"),
  xlab = "fitted values", 
  ylab = "Deviance residuals",
  main = "Residual plot (Deviance)", 
  las = 1 # this rotates values on y-axis
  )

## add a line at y = residual = 0
abline(h=0)

## fit a smoother through the points
lines(
  smooth.spline( 
    x = predict(cox.num),
    y = residuals(cox.num, type = "deviance")),
  col = "red"
  )

  • Again, some non-linearity is evidenced also in this plot

  • The approach to non-linearity, we can identify the non-linear variable and categorise it, or include polynomial terms or transform the variable

10.2 Checking the Proportional Hazards assumption

There are various way to fo this

10.2.1 C-log-log plot

  • Use a C-log-log plot: log of survival vs log of time. We look for convergence, divergence or crossing of the hazard functions (it does not take into account confounding)
plot(
  survfit(Surv(time, status) ~ Over40, data = stanford2) , 
  col=c("black", "red"), 
  xlab = "Time (in days) ",
  ylab = "Survival", 
  main = "Survival curves by Over40"
  )

legend(3000,0.9, legend=c("<= 40", "> 40"), lty = 1, lwd = 2, bty = "", cex = 0.8, col = c("black", "red"))

plot(
  survfit(Surv(time, status) ~ Over40, data = stanford2) , 
  col=c("black", "red"), 
  fun="cloglog", 
  xlab = "Time (in days) using log",
  ylab = "log-log survival", 
  main = "log-log curves by Over40"
  )

legend(1000,-3, legend=c("<= 40", "> 40"), lty = 1, lwd = 2, bty = "", cex = 0.8, col = c("black", "red"))

  • The log-log curves cross indicate that the hazards of the two groups are changing over time, so using one HR to summarise the data is incorrect (the HR is not constant!)
  • In this example the log-log curves cross, therefore it is not appropriate to use the CPH model for this data.

10.2.2 Schoenfeld test

  • Use Schoenfeld test for proportional hazards (\(H_0\): the hazards are proportional, equivalent to HR being constant over time).
cox.zph(cox.num)
##        chisq df    p
## age     0.83  1 0.36
## t5      2.06  1 0.15
## GLOBAL  2.77  2 0.25
  • The test returns a result for each individual explanatory variable as well as a GLOBAL test for the overall model

  • We can also look at the plot of the Schoenfeld test: if we allow the variables’ coefficients (\(\beta\)) to change over time (equivalent to allow the \(HRs\) to change over time), what changes would we see?

    • If coefficients do not change over time, we expect no change (\(0\))
# we have two variables in the model, so we have 2 plots returned
plot(cox.zph(cox.num)[1], main = "Age")
abline(h=0, col = 2)

# we have two variables in the model, so we have 2 plots returned
plot(cox.zph(cox.num)[2], main = "t5")
abline(h=0, col = 2)

  • The solid black line in the middle is looking at: if we allow the coefficients (\(\beta s\)) for the variable to change over time, how much of a change we will be seeing? A change of \(0\) means n change (red line on the plots).

  • The dashed lines represent the 95% confidence interval for the \(\beta s\) change

  • The main thing to check is whether the red line (\(0\) change) is in the 95% confidence interval for most of the time (in the example it looks it is ok for t5 but not for Age)

  • Scaled Schoenfeld Residuals Chart: Schoenfeld residuals are used to test the assumption of proportional hazards. Schoenfeld residuals “can essentially be thought of as the observed minus the expected values of the covariates at each failure time” (Steffensmeier & Jones, 2004: 121). There is a Schoenfeld residual for each subject for each covariate. The plot of Schoenfeld residuals against time for any covariate should not show a pattern of changing residuals for that covariate. If there is a pattern, that covariate is time-dependent. As a rule of thumb, a non-zero slope is an indication of a violation of the proportional hazard assumption. The dotted lines outline the 95% confidence interval.

  • If the PHs assumption is violated there are different solutions:

    • Stratify by the variable that doesn’t meet the PH assumption
    • Use time-dependent coefficient models (these allow the effect of the variable to interact with time or to change over time). E.g. time-updated Cox models.