Survival analysis

David Springate
March 2014

Modelling the time taken for events to occur

Survival analysis AKA failure time analysis

Classically, the analysis of time to death.

But can be used anywhere you want to know what factors affect the time for an event to occur:

  • Germination timing
  • Arrival of a migrant or parasite
  • Dispersal of seeds or offspring
  • Failure time in mechanical systems
  • Response to stimulus

Censoring: dealing with missing data

Right censoring:

Where the date of death is unknown but is after some known date


  • Date of death is after the end of the study
  • Subject is removed from the study (patient withdraws, animal escapes, plant gets eaten etc.)

Survival analysis can account for this kind of censoring

Censoring: dealing with missing data

Left censoring:

Occurs when a subject's survival time is incomplete on the left side of the follow-up period

  • e.g. Following up a patient after being tested for an infection, we don't know the exact time of exposure
  • Less common
  • Survival analysis can account for this (see ref 4)

ASSUMPTION: censoring must be independent of the event being looked at!

The Survival Function (Survival curve)

\[ S(t)=Pr(T>t) \]

The Survival function (\( S \)) is the probability that the time of death (\( T \)) is greater than some specified time (\( t \))

It is composed of:

  • The underlying Hazard function (How the risk of death per unit time changes over time at baseline covariates)
  • The effect parameters (How the hazard varies in response to the covariates)

plot of chunk unnamed-chunk-2

The Cox Proportional-Hazards Model

The most common model used to determine the effects of covariates on survival

\[ h_i(t)=h_0(t)exp(\beta_{1}x_{i1} + \beta_{2}x_{ik} + ... + \beta_{2}x_{ik} ) \]

It is a semi-parametric model:

  • The baseline hazard function is unspecified
  • The effects of the covariates are multiplicative
  • Doesn't make arbitrary assumptions about the shape/form of the baseline hazard function

The Proportional Hazards Assumption

  • Covariates multiply the hazard by some constant
  • e.g. a drug may halve a subjects risk of death at any time \( t \)
  • The effect is the same at any time point

Violating the PH assumption can seriously invalidate your model!

Survival analysis in R

Uses the survival package.

The response variable is a Surv object taking in

  • start time (after study start)
  • stop time (after study start)
  • whether or not an event occurred

Otherwise, the model is specified in the same way as for a standard regression, using the coxph function


Surv(): Define a survival object

coxph(): Run a cox PH regression

survfit(): Fit a survival curve to a model or formula

Effect of Beta-blockers on colon cancer survival rates

From the electronic medical records of 13 million UK patients, I identified:

  • Patients aged 40-85 with a first diagnosis of colon cancer between 1997 and 2006
  • Classified according to exposure to \( \beta \)-blockers in the 1 year period prior to diagnosis
  • Patients with heart disease, stroke, asthma, diabetes, kidney disease etc. were excluded
  • controlling for sex, age, year of diagnosis, numer of drugs being taken, smoking status, level of social deprivation

1104 patients not on \( \beta \)-blockers - CONTROL
695 patients on \( \beta \)-blockers - TREATMENT

Effect of Beta-blockers on colon cancer survival rates


Setting up the response variable

  • Start time: time (days) of diagnosis date after the study start date
  • Stop time: time to whichever comes first from
    • Study end
    • patient dies
    • patient leaves practice
  • event: binary does the patient die?

# Create the response variable:
S <- Surv(
  time = colon$start, 
  time2 = colon$stop, 
  event = colon$death)

This includes the right censoring information with the survival time

Building the model

model <- coxph(S ~ sex + treatment + age + 
                 year + deprivation + 
                 num_drugs + smoking_status, 
               data = colon)
  • Use the coxph function
  • The new response variable goes on the LHS
  • Otherwise, it uses the same syntax as lm(), glm() etc.

Model results

Output 1

The exp(coef) variable is the Hazard ratio: The multiplicative effect of the covariate on the Hazard function

Output 2

  • HR = 1: No effect
  • HR < 1: Reduction in the hazard
  • HR > 1: Increase in Hazard

Model results

# Plot the baseline survival function
     xscale = 365.25,
     xlab = "Years after diagnosis",
     ylab = "Proportion survived",
     main = "Baseline Hazard Curve")

plot of chunk unnamed-chunk-6

Plotting the effects

First, build a dataframe for the effect levels you want to look at
Other covariates are held at their lowest level (or means if they are continuous)

treat <- with(colon,
                treatment = levels(treatment),
                age = rep(levels(age)[1], 2),
                sex = rep(levels(sex)[1], 2),
                year = rep(levels(year)[1], 2),
                deprivation = rep(levels(deprivation)[1], 2),
                num_drugs = rep(levels(num_drugs)[1], 2),
                smoking_status = rep(
                  levels(smoking_status)[1], 2)))

Plotting the effects

First, build a dataframe for the effect levels you want to look at
Other covariates are held at their lowest level (or means if they are continuous)

'data.frame':   2 obs. of  7 variables:
 $ treatment     : Factor w/ 2 levels "BBs","No BBs": 2 1
 $ age           : Factor w/ 1 level "55-65": 1 1
 $ sex           : Factor w/ 1 level "M": 1 1
 $ year          : Factor w/ 1 level "1997-1998": 1 1
 $ deprivation   : Factor w/ 1 level "0": 1 1
 $ num_drugs     : Factor w/ 1 level "0-4": 1 1
 $ smoking_status: Factor w/ 1 level "0": 1 1

Plotting the effects

Then add the newdata arg to call to survfit:

plot(survfit(model, newdata = treat), 
     xscale = 365.25, = TRUE,
     xlab = "Years after diagnosis",
     ylab = "Proportion survived",
     col = c("red", "green"))

legend(8, 0.9, 
       legend = c("Beta blockers", 
                  "No beta blockers"), 
       lty = 1, 
       col = c("green", "red"))

plot of chunk unnamed-chunk-10

Testing for Proportional Hazards

Hypothesis test for proportional hazards:

# Note this is very conservative!


Testing for Proportional Hazards

It is often better to check the graphs for systematic trends:

a <- cox.zph(model)
par(mfrow = c(3, 1))
plot(a[1], main = "Sex")
plot(a[6], main = "diagnosis year")
plot(a[18], main = "Smoking status")

plot of chunk unnamed-chunk-13

Accounting for non-proportional hazards

Include an interaction with time for the variables:

  • This factors time out of the main effect
  • Only use if it makes sense to have a linear interaction between the covariate and time (look at the graphs!)
mod2 <- coxph(S ~ sex + treatment + age + 
                 year + deprivation + 
                 num_drugs + year + year:stop + smoking_status, 
               data = colon)

Note you don't need to include stop as a main effect since it is already factored in S

Accounting for non-proportional hazards

Alternatively, Stratify the data by that covariate

mod3 <- coxph(S ~ sex + treatment + age + 
                deprivation + 
                num_drugs + strata(year) + 
              data = colon)
  • Each strata has a different baseline hazard function but the remaining covariates are assumed to be constant
  • Does not assume a linear relationship
  • But you cannot examine the effects of the stratification variable
    • though you can plot them …

Visualising stratified survival curves

survfit plots a survival curve for each strata

plot(survfit(mod3), col = 1:5)
legend(11, 1, 
       legend = levels(colon$year), 
       lty = 1, 
       col = 1:5,
       title = "Diagnosis year")

plot of chunk unnamed-chunk-17

Non-independent observations

e.g. several plants in a family, patients in practices etc.

Deal with this by using cluster(variable):

mod3 <- coxph(S ~ sex + treatment + age + 
                deprivation + 
                cluster(practice) + 
                num_drugs + year + 
              data = colon)
  • Robust standard errors are produced taking into account the clusters
  • Simpler than a full mixed-effects model

Further information