Censoring in this context means the event does not occur within the time-period.
Right Censoring: Termination not related to treatment OR subject leaves as they do not believe the treatment is working (The design of the study can effect the drop-out rate)
Interval censoring: The event happened (not sure excactly when) but have a start and end time point within which it ocurred.
Left censoring: Event happened sometime before a time point.
### Survival Analysis in R ###
setwd("~/Documents/STATS/Survival Analysis")
prost <- read.table("prostate_cancer.txt", header = F, sep = " ")
# Providing column names
colnames(prost) = c("patient", "treatment", "time", "status", "age", "sh", "size", "index")
head(prost)
## patient treatment time status age sh size index
## 1 1 1 65 0 67 13.4 34 8
## 2 2 2 61 0 60 14.6 4 10
## 3 3 2 60 0 77 15.6 3 8
## 4 4 1 58 0 64 16.2 6 9
## 5 5 2 51 0 65 14.1 21 9
## 6 6 1 51 0 61 13.5 8 8
\(F(t)\): Cumulative distribution function (related to the hazard function h(t) or \(\lambda\)(t))
\(S(t) = P(T>t) = 1 - F(t)\)
Can be understood as the instantanious death rate: Divide the probability (\(P\)) of an individual’s survival time (\(T\)) within an interval by the interval length.
\(S(t)\): The survival function adding the rate of risk per time unit or every time unit passed.
\(H(t) = -log(S(t))\)
A vector with the survival time for each subject. Note those with “+”s represent censored data.
library(survival)
## The Fundamental Survival Function
## The Survival Object
Surv(prost$time, prost$status)
## [1] 65+ 61+ 60+ 58+ 51+ 51+ 14 43+ 16+ 52+ 59 55+ 68+ 51 2+ 67+ 66+ 66 28+
## [20] 50 69 67 65+ 24+ 45+ 64+ 61+ 26 42 57+ 70+ 5+ 54+ 36 70 67+ 23+ 52
## [39] 65+ 24 45+ 64+ 61 21 22 37+ 51 12+ 67+ 61+ 66 65+ 50 35 67 65+ 33+
## [58] 45+ 62+ 61+ 36 42 57+
\(n_i\): The number of individuals known to survive. (Not died or censored)
\(\hat{S}(t) = \prod_{i:t_i \le t}(1-\frac{d_i}{n_i})\)
Declining step plot:
Variables required in the data:
# Simple Kaplan Meier plot - modeling the survival function without a grouping variable
mykm1 <- survfit(Surv(time, status) ~ 1, data = prost)
plot(mykm1, mark.time = TRUE)
Note each vertical slash represents the time point at which a censored subject left the experiment.
# Stratified KM plot - modeling the survival function with treatment groups
mykm2 <- survfit(Surv(time, status) ~ treatment, data = prost)
plot(mykm2, mark.time = TRUE)
# Advanced plot
library(ggfortify)
## Loading required package: ggplot2
autoplot(mykm2)
Compares the estimates of the hazard function of the two groups at each observed time point.
Test Statistic: N(0,1) \(Z = \frac{\sum_{j=1}^J(O_{1j} - E_{1j})}{\sqrt{\sum_{j=1}^JV_j}} \to N(0,1)\)
Null Hypothesis: The hazard rate in each group is similar (rejected if the test statistic \(Z\) is not normally distributed.)
Assumptions:
# Log Rank Test
survival::survdiff(survival::Surv(prost$time, prost$status) ~ prost$treatment)
## Call:
## survival::survdiff(formula = survival::Surv(prost$time, prost$status) ~
## prost$treatment)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## prost$treatment=1 23 7 5.13 0.686 0.958
## prost$treatment=2 40 16 17.87 0.197 0.958
##
## Chisq= 1 on 1 degrees of freedom, p= 0.3
# note you can put more weight on later events
# :the p-value is non-significant stick to the null hypothesis
# Calculated the ch0-squared statistic and the variance.
library(SMPracticals)
## Loading required package: ellipse
##
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
##
## pairs
##
## Attaching package: 'SMPracticals'
## The following objects are masked from 'package:survival':
##
## aml, pbc
data("motorette")
#motorette data
#y: Time to failure
#x: Temp
head(motorette)
## x cens y
## 1 150 0 8064
## 2 150 0 8064
## 3 150 0 8064
## 4 150 0 8064
## 5 150 0 8064
## 6 150 0 8064
plot(x = motorette$x, y = motorette$y)
# kaplan meier plot not taking the temperature into account
mykm1_2 <- survfit(Surv(motorette$y , motorette$cens) ~ 1, data = motorette)
plot(mykm1_2, mark.time = TRUE)
# kaplan meier plot taking the temperature into account
temp_binary = ifelse(motorette$x < 180, "low", "high"); temp_binary
## [1] "low" "low" "low" "low" "low" "low" "low" "low" "low" "low"
## [11] "low" "low" "low" "low" "low" "low" "low" "low" "low" "low"
## [21] "high" "high" "high" "high" "high" "high" "high" "high" "high" "high"
## [31] "high" "high" "high" "high" "high" "high" "high" "high" "high" "high"
motorette$temp_binary = temp_binary
survival::survdiff(survival::Surv(motorette$y, motorette$cens) ~ motorette$temp_binary)
## Call:
## survival::survdiff(formula = survival::Surv(motorette$y, motorette$cens) ~
## motorette$temp_binary)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## motorette$temp_binary=high 20 10 4.14 8.32 15.7
## motorette$temp_binary=low 20 7 12.86 2.67 15.7
##
## Chisq= 15.7 on 1 degrees of freedom, p= 7e-05
mykm2 <- survfit(Surv(y, cens) ~ temp_binary, data = motorette)
plot(mykm2, mark.time = TRUE)
Hazard Rate: The time it takes for an event to take place
External factors can effect the probability of an event (covariates) which can now be included in the model.
\(\lambda_0(t) or h_0(t)\): The baseline hazard/risk rate at time point \(t\) regardless of the subect or covariates. (ie each subject has the same prob of an event after t time points have past)
\(h_i(t) = h_o(t)e^{X_i\beta}\)
### Cox Proportional Hazards Model
cox <- coxph(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost)
summary(cox)
## Call:
## coxph(formula = Surv(time, status) ~ treatment + age + sh + size +
## index, data = prost)
##
## n= 63, number of events= 23
##
## coef exp(coef) se(coef) z Pr(>|z|)
## treatment -0.69695 0.49810 0.53471 -1.303 0.1924
## age -0.08361 0.91979 0.03692 -2.265 0.0235 *
## sh -0.23664 0.78927 0.18511 -1.278 0.2011
## size 0.06786 1.07021 0.02833 2.395 0.0166 *
## index 0.77410 2.16865 0.18803 4.117 3.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## treatment 0.4981 2.0076 0.1747 1.4206
## age 0.9198 1.0872 0.8556 0.9888
## sh 0.7893 1.2670 0.5491 1.1345
## size 1.0702 0.9344 1.0124 1.1313
## index 2.1686 0.4611 1.5002 3.1350
##
## Concordance= 0.866 (se = 0.037 )
## Likelihood ratio test= 38.4 on 5 df, p=3e-07
## Wald test = 26.47 on 5 df, p=7e-05
## Score (logrank) test = 34.47 on 5 df, p=2e-06
The probability of agreement of two randomly chosen observations.
The probability of correctly choosing the the observation with higher risk of an event occuring. Want the concordance to be as close to 1 as possible (Anything below 0.5 is a very bad model)
Fraction of variance in survival rate that is predicted by the covariates.
Do the covariates used predict the hypothesis rate? Reject the null hypothesis if the p-values are significant. df = number of covariates in the model
# Getting a plot of the model
autoplot(survfit(cox))
# alternative
coxfit = survfit(cox)
autoplot(coxfit)
The influence of covariates (On the prob. of an event occurring) can vary depending on the time point in the study. To model the time dependant effect of the covariates we use Aalen’s additive regression model.
\(B(t)\): The time dependent matrix of coeficients.
\(H(t) = a(t) + XB(t)\)
NOTE: Results at the tail of the dataset can get inacurate, as there are few subjects left in the study. This can be accounted for through making use of the arguments taper which smoothes out the tail end of the curve or qrtol and nmin which truncates the estimate before the last death event.
# Aalens additive regression
aalen <- aareg(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost)
autoplot(aalen)
What is the difference between paramtric, non-parametric and semi-parametric models?
The distribution explains how the observations are distributed in their occurrence.
Common distributions used in Survival Analysis * Exponential: The probability of an event is the same at any time point. * Weibull: Allows you to set the time point at which most of the events are most likely to occur (Usually at the end) * Gaussian: Event probabilty is the highest around the center of the survival curve (Looking at the step plot the steepest decrease in survival probablity will occur somewhere in the center of the plot)
NOTE: These distribution refer to the distribution at which the events occur. (ie the y-axis)
NOTE: The classic non-parametric step plot is also included in the plots below.
library(flexsurv)
weib_mod = flexsurvreg(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost, dist = "weibull")
exp_mod = flexsurvreg(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost, dist = "exp")
plot(weib_mod)
lines(exp_mod, col="blue")
# Cox Proportional Hazards
cox <- coxph(Surv(futime, status) ~ trt + stage + bili + riskscore, data = udca1)
summary(cox)
## Call:
## coxph(formula = Surv(futime, status) ~ trt + stage + bili + riskscore,
## data = udca1)
##
## n= 169, number of events= 72
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## trt -1.01636 0.36191 0.25628 -3.966 7.31e-05 ***
## stage 0.03453 1.03514 0.28768 0.120 0.9045
## bili 0.08971 1.09385 0.07827 1.146 0.2518
## riskscore 0.30950 1.36275 0.16142 1.917 0.0552 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## trt 0.3619 2.7631 0.2190 0.5981
## stage 1.0351 0.9661 0.5890 1.8192
## bili 1.0939 0.9142 0.9383 1.2752
## riskscore 1.3628 0.7338 0.9932 1.8699
##
## Concordance= 0.684 (se = 0.031 )
## Likelihood ratio test= 30.32 on 4 df, p=4e-06
## Wald test = 29.86 on 4 df, p=5e-06
## Score (logrank) test = 31.47 on 4 df, p=2e-06
# Getting a plot of the model
autoplot(survfit(cox))
# Standard Kaplan Meier plot to compare the treatments
kmfit <- survfit(Surv(futime, status) ~ trt, data = udca1)
autoplot(kmfit)
# Can we improve the concordance by eliminating non significant covariates?
cox2 <- coxph(Surv(futime, status) ~ trt, data = udca1)
summary(cox2)
## Call:
## coxph(formula = Surv(futime, status) ~ trt, data = udca1)
##
## n= 170, number of events= 72
##
## coef exp(coef) se(coef) z Pr(>|z|)
## trt -0.8624 0.4222 0.2443 -3.53 0.000416 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## trt 0.4222 2.369 0.2615 0.6815
##
## Concordance= 0.614 (se = 0.029 )
## Likelihood ratio test= 12.99 on 1 df, p=3e-04
## Wald test = 12.46 on 1 df, p=4e-04
## Score (logrank) test = 13.23 on 1 df, p=3e-04
A decision tree fitted on survival data, allowing covariates to be incorporated.
NOTE: The outcome of this model is very different form what we saw previously in the Cox-Proportional Hazards model.(The sample size might be a bit small for a survival tree…)
### Survival Trees
library(ranger)
streefit = ranger(Surv(time, status) ~ treatment + age + sh + size + index, data = prost,
importance = "permutation",
splitrule = "extratrees",
seed = 43)
# Average the survival models (in order to create one aerage survival tree)
death_times = streefit$unique.death.times
surv_prob = data.frame(streefit$survival)
avg_prob = sapply(surv_prob,mean) # can also use the median
# Plot the survival tree model - average
plot(death_times, avg_prob,
type = "s",
ylim = c(0,1),
col = "red", lwd = 2, bty = "n",
ylab = "Survival Probability", xlab = "Time in Months",
main = "Survival Tree Model\nAverage Survival Curve")
## Comparing the main models we discussed so far
# Set up for ggplot
kmdf <- data.frame(mykm1$time,
mykm1$surv,
KM = rep("KM", 36))
names(kmdf) <- c("Time in Months","Survival Probability","Model")
coxdf <- data.frame(coxfit$time,
coxfit$surv,
COX = rep("COX", 36))
names(coxdf) <- c("Time in Months","Survival Probability","Model")
rfdf <- data.frame(death_times,
avg_prob,
RF = rep("RF", 36))
names(rfdf) <- c("Time in Months","Survival Probability","Model")
plotdf <- rbind(kmdf,coxdf,rfdf)
p <- ggplot(plotdf, aes(x = plotdf[,1] ,
y = plotdf[,2] ,
color = plotdf[,3]))
p + geom_step(size = 2) +
labs(x = "Time in Months",
y = "Survival Probability",
color = "",
title = "Comparison Plot of\n3 Main Survival Models") +
theme(plot.title = element_text(hjust = 0.5))
How is the 95% confidence interval calculated in a non-parametric setting? (Use the normal distribution method?)