David Springate
March 2014
But can be used anywhere you want to know what factors affect the time for an event to occur:
Where the date of death is unknown but is after some known date
e.g.
Survival analysis can account for this kind of censoring
Occurs when a subject's survival time is incomplete on the left side of the follow-up period
\[ S(t)=Pr(T>t) \]
The Survival function (\( S \)) is the probability that the time of death (\( T \)) is greater than some specified time (\( t \))
It is composed of:
The most common model used to determine the effects of covariates on survival
\[ h_i(t)=h_0(t)exp(\beta_{1}x_{i1} + \beta_{2}x_{ik} + ... + \beta_{2}x_{ik} ) \]
It is a semi-parametric model:
Uses the survival package.
The response variable is a Surv object taking in
Otherwise, the model is specified in the same way as for a standard regression, using the coxph function
Toolkit
Surv(): Define a survival object
coxph(): Run a cox PH regression
survfit(): Fit a survival curve to a model or formula
From the electronic medical records of 13 million UK patients, I identified:
1104 patients not on \( \beta \)-blockers - CONTROL
695 patients on \( \beta \)-blockers - TREATMENT
# Create the response variable:
S <- Surv(
time = colon$start,
time2 = colon$stop,
event = colon$death)
This includes the right censoring information with the survival time
model <- coxph(S ~ sex + treatment + age +
year + deprivation +
num_drugs + smoking_status,
data = colon)
coxph functionlm(), glm() etc.
exp(coef) variable is the Hazard ratio: The multiplicative effect of the covariate on the Hazard function
# Plot the baseline survival function
plot(survfit(model),
xscale = 365.25,
xlab = "Years after diagnosis",
ylab = "Proportion survived",
main = "Baseline Hazard Curve")
First, build a dataframe for the effect levels you want to look at
Other covariates are held at their lowest level (or means if they are continuous)
treat <- with(colon,
data.frame(
treatment = levels(treatment),
age = rep(levels(age)[1], 2),
sex = rep(levels(sex)[1], 2),
year = rep(levels(year)[1], 2),
deprivation = rep(levels(deprivation)[1], 2),
num_drugs = rep(levels(num_drugs)[1], 2),
smoking_status = rep(
levels(smoking_status)[1], 2)))
First, build a dataframe for the effect levels you want to look at
Other covariates are held at their lowest level (or means if they are continuous)
str(treat)
'data.frame': 2 obs. of 7 variables:
$ treatment : Factor w/ 2 levels "BBs","No BBs": 2 1
$ age : Factor w/ 1 level "55-65": 1 1
$ sex : Factor w/ 1 level "M": 1 1
$ year : Factor w/ 1 level "1997-1998": 1 1
$ deprivation : Factor w/ 1 level "0": 1 1
$ num_drugs : Factor w/ 1 level "0-4": 1 1
$ smoking_status: Factor w/ 1 level "0": 1 1
Then add the newdata arg to call to survfit:
plot(survfit(model, newdata = treat),
xscale = 365.25,
conf.int = TRUE,
xlab = "Years after diagnosis",
ylab = "Proportion survived",
col = c("red", "green"))
legend(8, 0.9,
legend = c("Beta blockers",
"No beta blockers"),
lty = 1,
col = c("green", "red"))
Hypothesis test for proportional hazards:
# Note this is very conservative!
cox.zph(model)
It is often better to check the graphs for systematic trends:
a <- cox.zph(model)
par(mfrow = c(3, 1))
plot(a[1], main = "Sex")
plot(a[6], main = "diagnosis year")
plot(a[18], main = "Smoking status")
Include an interaction with time for the variables:
mod2 <- coxph(S ~ sex + treatment + age +
year + deprivation +
num_drugs + year + year:stop + smoking_status,
data = colon)
Note you don't need to include stop as a main effect since it is already factored in S
Alternatively, Stratify the data by that covariate
mod3 <- coxph(S ~ sex + treatment + age +
deprivation +
num_drugs + strata(year) +
smoking_status,
data = colon)
survfit plots a survival curve for each strata
plot(survfit(mod3), col = 1:5)
legend(11, 1,
legend = levels(colon$year),
lty = 1,
col = 1:5,
title = "Diagnosis year")
e.g. several plants in a family, patients in practices etc.
Deal with this by using cluster(variable):
mod3 <- coxph(S ~ sex + treatment + age +
deprivation +
cluster(practice) +
num_drugs + year +
smoking_status,
data = colon)