SECTIONS

Censoring of Data

The Data

### Survival Analysis in R ###

setwd("~/Documents/STATS/Survival Analysis")
prost <- read.table("prostate_cancer.txt", header = F, sep = " ")

# Providing column names
colnames(prost) = c("patient", "treatment", "time", "status", "age", "sh", "size", "index")

head(prost)
##   patient treatment time status age   sh size index
## 1       1         1   65      0  67 13.4   34     8
## 2       2         2   61      0  60 14.6    4    10
## 3       3         2   60      0  77 15.6    3     8
## 4       4         1   58      0  64 16.2    6     9
## 5       5         2   51      0  65 14.1   21     9
## 6       6         1   51      0  61 13.5    8     8

Estimating the Survival Function

Non-Parametric Models (No underlying assumptions-parameters)

  • Kaplan-Meier
  • log-rank test

The Survival Function:

Probability of surviving beyond a certain point.

  • \(S(t)\): The Survival estimation
  • \(P\): Probability
  • \(T\): Survival time
  • \(t\): Specified time point
  • \(F(t)\): Cumulative distribution function (related to the hazard function h(t) or \(\lambda\)(t))

  • \(S(t) = P(T>t) = 1 - F(t)\)

The Hazard rate

  • \(h(t)\): The risk of an event at time point \(t\)
  • \(P\): The probability that a subject has an event

Can be understood as the instantanious death rate: Divide the probability (\(P\)) of an individual’s survival time (\(T\)) within an interval by the interval length.

  • \(h(t) = lim_{\delta t \to 0}\{\frac{P(t\le T \lt t + \delta t |T \ge t)}{\delta t}\}\)

The Integrated/Cumulative Harzard

  • \(H(t) or \Lambda(t)\)
  • \(t\): The time units already passed.
  • \(S(t)\): The survival function adding the rate of risk per time unit or every time unit passed.

  • \(H(t) = -log(S(t))\)

The Survival Object

A vector with the survival time for each subject. Note those with “+”s represent censored data.

Inputs in the Surv function:

  • event: binary indicator (0=alive, 1=dead)
  • time: time at which the event/censoring occurs
  • time2: (start,end], ie you need to provide the start and emd time of the interval
  • type: Specify the type of censoring
library(survival)

## The Fundamental Survival Function
## The Survival Object
Surv(prost$time, prost$status)
##  [1] 65+ 61+ 60+ 58+ 51+ 51+ 14  43+ 16+ 52+ 59  55+ 68+ 51   2+ 67+ 66+ 66  28+
## [20] 50  69  67  65+ 24+ 45+ 64+ 61+ 26  42  57+ 70+  5+ 54+ 36  70  67+ 23+ 52 
## [39] 65+ 24  45+ 64+ 61  21  22  37+ 51  12+ 67+ 61+ 66  65+ 50  35  67  65+ 33+
## [58] 45+ 62+ 61+ 36  42  57+

The Kaplan Meier Estimator

Declining step plot:

Variables required in the data:

# Simple Kaplan Meier plot - modeling the survival function without a grouping variable
mykm1 <- survfit(Surv(time, status) ~ 1, data = prost)
plot(mykm1, mark.time = TRUE)

Note each vertical slash represents the time point at which a censored subject left the experiment.

# Stratified KM plot - modeling the survival function with treatment groups
mykm2 <- survfit(Surv(time, status) ~ treatment, data = prost) 
plot(mykm2, mark.time = TRUE)

# Advanced plot
library(ggfortify)
## Loading required package: ggplot2
autoplot(mykm2)

Log Rank Test (non-parametric)

Compares the estimates of the hazard function of the two groups at each observed time point.

Test Statistic: N(0,1) \(Z = \frac{\sum_{j=1}^J(O_{1j} - E_{1j})}{\sqrt{\sum_{j=1}^JV_j}} \to N(0,1)\)

Null Hypothesis: The hazard rate in each group is similar (rejected if the test statistic \(Z\) is not normally distributed.)

Assumptions:

# Log Rank Test
survival::survdiff(survival::Surv(prost$time, prost$status) ~ prost$treatment)
## Call:
## survival::survdiff(formula = survival::Surv(prost$time, prost$status) ~ 
##     prost$treatment)
## 
##                    N Observed Expected (O-E)^2/E (O-E)^2/V
## prost$treatment=1 23        7     5.13     0.686     0.958
## prost$treatment=2 40       16    17.87     0.197     0.958
## 
##  Chisq= 1  on 1 degrees of freedom, p= 0.3
# note you can put more weight on later events
# :the p-value is non-significant stick to the null hypothesis
# Calculated the ch0-squared statistic and the variance.

Example

library(SMPracticals)
## Loading required package: ellipse
## 
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
## 
##     pairs
## 
## Attaching package: 'SMPracticals'
## The following objects are masked from 'package:survival':
## 
##     aml, pbc
data("motorette")
#motorette data
#y: Time to failure
#x: Temp
head(motorette)
##     x cens    y
## 1 150    0 8064
## 2 150    0 8064
## 3 150    0 8064
## 4 150    0 8064
## 5 150    0 8064
## 6 150    0 8064
plot(x = motorette$x, y = motorette$y)

# kaplan meier plot not taking the temperature into account
mykm1_2 <- survfit(Surv(motorette$y , motorette$cens) ~ 1, data = motorette)
plot(mykm1_2, mark.time = TRUE)

# kaplan meier plot taking the temperature into account
temp_binary = ifelse(motorette$x < 180, "low", "high"); temp_binary
##  [1] "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low" 
## [11] "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low" 
## [21] "high" "high" "high" "high" "high" "high" "high" "high" "high" "high"
## [31] "high" "high" "high" "high" "high" "high" "high" "high" "high" "high"
motorette$temp_binary = temp_binary

survival::survdiff(survival::Surv(motorette$y, motorette$cens) ~ motorette$temp_binary)
## Call:
## survival::survdiff(formula = survival::Surv(motorette$y, motorette$cens) ~ 
##     motorette$temp_binary)
## 
##                             N Observed Expected (O-E)^2/E (O-E)^2/V
## motorette$temp_binary=high 20       10     4.14      8.32      15.7
## motorette$temp_binary=low  20        7    12.86      2.67      15.7
## 
##  Chisq= 15.7  on 1 degrees of freedom, p= 7e-05
mykm2 <- survfit(Surv(y, cens) ~ temp_binary, data = motorette)

plot(mykm2, mark.time = TRUE)

The Cox Proportional Hazards Model

Hazard Rate: The time it takes for an event to take place

External factors can effect the probability of an event (covariates) which can now be included in the model.

The Cox Proportional Hazards Model equation

  • \(\lambda or h\) = The hazard
  • \(i\): The subject
  • \(t\): time point
  • \(\lambda_0(t) or h_0(t)\): The baseline hazard/risk rate at time point \(t\) regardless of the subect or covariates. (ie each subject has the same prob of an event after t time points have past)

  • \(\lambda(t|X_i) = \lambda_0(t)exp(X_i\beta)\)
  • \(h_i(t) = h_o(t)e^{X_i\beta}\)

The Covariates:

  • Can be of any data type
  • Weight of significance is controlled by \(\beta\)
  • It is possbile to estimate \(\beta\) without considering the Hazard function so long as the data is stationary and constant over time.

Implementing the Cox Proportional Hazards Model

### Cox Proportional Hazards Model

cox <- coxph(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost)

summary(cox)
## Call:
## coxph(formula = Surv(time, status) ~ treatment + age + sh + size + 
##     index, data = prost)
## 
##   n= 63, number of events= 23 
## 
##               coef exp(coef) se(coef)      z Pr(>|z|)    
## treatment -0.69695   0.49810  0.53471 -1.303   0.1924    
## age       -0.08361   0.91979  0.03692 -2.265   0.0235 *  
## sh        -0.23664   0.78927  0.18511 -1.278   0.2011    
## size       0.06786   1.07021  0.02833  2.395   0.0166 *  
## index      0.77410   2.16865  0.18803  4.117 3.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## treatment    0.4981     2.0076    0.1747    1.4206
## age          0.9198     1.0872    0.8556    0.9888
## sh           0.7893     1.2670    0.5491    1.1345
## size         1.0702     0.9344    1.0124    1.1313
## index        2.1686     0.4611    1.5002    3.1350
## 
## Concordance= 0.866  (se = 0.037 )
## Likelihood ratio test= 38.4  on 5 df,   p=3e-07
## Wald test            = 26.47  on 5 df,   p=7e-05
## Score (logrank) test = 34.47  on 5 df,   p=2e-06

The Concordance

The probability of agreement of two randomly chosen observations.

The probability of correctly choosing the the observation with higher risk of an event occuring. Want the concordance to be as close to 1 as possible (Anything below 0.5 is a very bad model)

  • Kendells \(\tau\) is used for continuous variables
  • Logistic regression is used for binary/grouped covariates

R Squared

Fraction of variance in survival rate that is predicted by the covariates.

Likelihood ratio test / Wald Test / Logrank test

Do the covariates used predict the hypothesis rate? Reject the null hypothesis if the p-values are significant. df = number of covariates in the model

# Getting a plot of the model
autoplot(survfit(cox))

# alternative
coxfit = survfit(cox)
autoplot(coxfit)

Aalen’s Additive Regression Model

The influence of covariates (On the prob. of an event occurring) can vary depending on the time point in the study. To model the time dependant effect of the covariates we use Aalen’s additive regression model.

NOTE: Results at the tail of the dataset can get inacurate, as there are few subjects left in the study. This can be accounted for through making use of the arguments taper which smoothes out the tail end of the curve or qrtol and nmin which truncates the estimate before the last death event.

# Aalens additive regression

aalen <- aareg(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost)
autoplot(aalen)

Parametric Theory

What is the difference between paramtric, non-parametric and semi-parametric models?

Comparing the diiferent models The distribution explains how the observations are distributed in their occurrence.

Common distributions used in Survival Analysis * Exponential: The probability of an event is the same at any time point. * Weibull: Allows you to set the time point at which most of the events are most likely to occur (Usually at the end) * Gaussian: Event probabilty is the highest around the center of the survival curve (Looking at the step plot the steepest decrease in survival probablity will occur somewhere in the center of the plot)

NOTE: These distribution refer to the distribution at which the events occur. (ie the y-axis)

Parametric Regression Models in Survival Analysis

NOTE: The classic non-parametric step plot is also included in the plots below.

library(flexsurv)

weib_mod = flexsurvreg(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost, dist = "weibull")

exp_mod = flexsurvreg(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost, dist = "exp")

plot(weib_mod)

lines(exp_mod, col="blue")

Example: Cox-Proportional Hazards Model

# Cox Proportional Hazards

cox <- coxph(Surv(futime, status) ~ trt + stage + bili + riskscore, data = udca1)
summary(cox)
## Call:
## coxph(formula = Surv(futime, status) ~ trt + stage + bili + riskscore, 
##     data = udca1)
## 
##   n= 169, number of events= 72 
##    (1 observation deleted due to missingness)
## 
##               coef exp(coef) se(coef)      z Pr(>|z|)    
## trt       -1.01636   0.36191  0.25628 -3.966 7.31e-05 ***
## stage      0.03453   1.03514  0.28768  0.120   0.9045    
## bili       0.08971   1.09385  0.07827  1.146   0.2518    
## riskscore  0.30950   1.36275  0.16142  1.917   0.0552 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## trt          0.3619     2.7631    0.2190    0.5981
## stage        1.0351     0.9661    0.5890    1.8192
## bili         1.0939     0.9142    0.9383    1.2752
## riskscore    1.3628     0.7338    0.9932    1.8699
## 
## Concordance= 0.684  (se = 0.031 )
## Likelihood ratio test= 30.32  on 4 df,   p=4e-06
## Wald test            = 29.86  on 4 df,   p=5e-06
## Score (logrank) test = 31.47  on 4 df,   p=2e-06
# Getting a plot of the model
autoplot(survfit(cox))

# Standard Kaplan Meier plot to compare the treatments
kmfit <- survfit(Surv(futime, status) ~ trt, data = udca1)
autoplot(kmfit)

# Can we improve the concordance by eliminating non significant covariates?
cox2 <- coxph(Surv(futime, status) ~ trt, data = udca1)
summary(cox2)
## Call:
## coxph(formula = Surv(futime, status) ~ trt, data = udca1)
## 
##   n= 170, number of events= 72 
## 
##        coef exp(coef) se(coef)     z Pr(>|z|)    
## trt -0.8624    0.4222   0.2443 -3.53 0.000416 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## trt    0.4222      2.369    0.2615    0.6815
## 
## Concordance= 0.614  (se = 0.029 )
## Likelihood ratio test= 12.99  on 1 df,   p=3e-04
## Wald test            = 12.46  on 1 df,   p=4e-04
## Score (logrank) test = 13.23  on 1 df,   p=3e-04

Survival Trees

A decision tree fitted on survival data, allowing covariates to be incorporated.

Recap of Regression Trees

  • Prune the tree to prevent overfitting and reduce complexity. Remove sections that provide litle power to classify instances.
  • Works best on large datasets
  • Non-linear model. (Cox-proportional Hazards model is linear - ie a single line/curve/surface is sufficient to estimate the quantitative response)

Important arguments

  • Formula: Surival object and covariates.
  • num.trees: Number of trees generated (usually use the average for the )
  • mtry: Number of variables to possibly split on at each node.
  • importance: variable importance for a given split rule. ie which variable to chose primarily for a node.
  • Splitrule: How to calculate the cutoffs of the splits. (default is the log-rank test)
  • Seed: Allows you to reproduce your results.

NOTE: The outcome of this model is very different form what we saw previously in the Cox-Proportional Hazards model.(The sample size might be a bit small for a survival tree…)

### Survival Trees

library(ranger)

streefit = ranger(Surv(time, status) ~ treatment + age + sh + size + index, data = prost,

                importance = "permutation",

                splitrule = "extratrees",

                seed = 43)

# Average the survival models (in order to create one aerage survival tree)

death_times = streefit$unique.death.times
surv_prob = data.frame(streefit$survival)
avg_prob = sapply(surv_prob,mean) # can also use the median

# Plot the survival tree model - average

plot(death_times, avg_prob,

     type = "s",

     ylim = c(0,1),

     col = "red", lwd = 2, bty = "n",

     ylab = "Survival Probability", xlab = "Time in Months",

     main = "Survival Tree Model\nAverage Survival Curve")

Comparison Plot

## Comparing the main models we discussed so far

# Set up for ggplot

kmdf <- data.frame(mykm1$time,

                    mykm1$surv,

                    KM = rep("KM", 36))

names(kmdf) <- c("Time in Months","Survival Probability","Model")

coxdf <- data.frame(coxfit$time,

                     coxfit$surv,

                     COX = rep("COX", 36))

names(coxdf) <- c("Time in Months","Survival Probability","Model")

rfdf <- data.frame(death_times,

                    avg_prob,

                    RF = rep("RF", 36))

names(rfdf) <- c("Time in Months","Survival Probability","Model")

plotdf <- rbind(kmdf,coxdf,rfdf)

p <- ggplot(plotdf, aes(x = plotdf[,1] ,

                         y = plotdf[,2] ,

                         color = plotdf[,3]))

p + geom_step(size = 2) +

  labs(x = "Time in Months",

  y = "Survival Probability",

  color = "",

  title = "Comparison Plot of\n3 Main Survival Models") +

  theme(plot.title = element_text(hjust = 0.5))

Managing the Time Variable in a Survival Dataset

Outlier Detection & Missing Value Imputation in Survival Analysis

Questions:

How is the 95% confidence interval calculated in a non-parametric setting? (Use the normal distribution method?)