SECTIONS

Estimate the underlying survival function with life tables and kaplan-Meier curves.
Compare the survival of 2 or more groups with the log-rank test.
Describe the effect of covariates with the cox proportional hazards regression/survival trees and random forests.

Censoring of Data

Censoring in this context means the event does not occur within the time-period.
Right Censoring: Termination not related to treatment OR subject leaves as they do not believe the treatment is working (The design of the study can effect the drop-out rate)
Interval censoring: The event happened (not sure excactly when) but have a start and end time point within which it ocurred.
Left censoring: Event happened sometime before a time point.

The Data

### Survival Analysis in R ###

setwd("~/Documents/STATS/Survival Analysis")
prost <- read.table("prostate_cancer.txt", header = F, sep = " ")

# Providing column names
colnames(prost) = c("patient", "treatment", "time", "status", "age", "sh", "size", "index")

head(prost)

##   patient treatment time status age   sh size index
## 1       1         1   65      0  67 13.4   34     8
## 2       2         2   61      0  60 14.6    4    10
## 3       3         2   60      0  77 15.6    3     8
## 4       4         1   58      0  64 16.2    6     9
## 5       5         2   51      0  65 14.1   21     9
## 6       6         1   51      0  61 13.5    8     8

Estimating the Survival Function

Kaplan-Meier - does not set any pre-conditions that have to be met.
Exponential Survival Curve - Probabilty of event is the same at any time point (not likely in many cases)
Weibull Distribution - Optional preset probabilities in relation to time.
Normal distribution - most events occur around the time point indicated with the mean.

Non-Parametric Models (No underlying assumptions-parameters)

Kaplan-Meier
log-rank test

The Survival Function:

Probability of surviving beyond a certain point.

\(S(t)\): The Survival estimation
\(P\): Probability
\(T\): Survival time
\(t\): Specified time point
\(F(t)\): Cumulative distribution function (related to the hazard function h(t) or \(\lambda\)(t))
\(S(t) = P(T>t) = 1 - F(t)\)

The Hazard rate

\(h(t)\): The risk of an event at time point \(t\)
\(P\): The probability that a subject has an event

Can be understood as the instantanious death rate: Divide the probability (\(P\)) of an individual’s survival time (\(T\)) within an interval by the interval length.

\(h(t) = lim_{\delta t \to 0}\{\frac{P(t\le T \lt t + \delta t |T \ge t)}{\delta t}\}\)

The Integrated/Cumulative Harzard

\(H(t) or \Lambda(t)\)
\(t\): The time units already passed.
\(S(t)\): The survival function adding the rate of risk per time unit or every time unit passed.
\(H(t) = -log(S(t))\)

The Survival Object

A vector with the survival time for each subject. Note those with “+”s represent censored data.

Inputs in the Surv function:

event: binary indicator (0=alive, 1=dead)
time: time at which the event/censoring occurs
time2: (start,end], ie you need to provide the start and emd time of the interval
type: Specify the type of censoring

library(survival)

## The Fundamental Survival Function
## The Survival Object
Surv(prost$time, prost$status)

##  [1] 65+ 61+ 60+ 58+ 51+ 51+ 14  43+ 16+ 52+ 59  55+ 68+ 51   2+ 67+ 66+ 66  28+
## [20] 50  69  67  65+ 24+ 45+ 64+ 61+ 26  42  57+ 70+  5+ 54+ 36  70  67+ 23+ 52 
## [39] 65+ 24  45+ 64+ 61  21  22  37+ 51  12+ 67+ 61+ 66  65+ 50  35  67  65+ 33+
## [58] 45+ 62+ 61+ 36  42  57+

The Kaplan Meier Estimator

\(\hat{S}(t)\): Estimator of survival to time point t and beyond (y axis in the kaplan meier plot)
\(t_i\): Time point where at least one event has ocurred.
\(d_i\): The number of events happening at time point t.
\(n_i\): The number of individuals known to survive. (Not died or censored)
\(\hat{S}(t) = \prod_{i:t_i \le t}(1-\frac{d_i}{n_i})\)

Declining step plot:

(Comparison plot possible) with 95% Confidence intervals
incorporates censored data
cannot incorporate covariates

Variables required in the data:

Status of event and time indicator.
Group indicator

# Simple Kaplan Meier plot - modeling the survival function without a grouping variable
mykm1 <- survfit(Surv(time, status) ~ 1, data = prost)
plot(mykm1, mark.time = TRUE)

Note each vertical slash represents the time point at which a censored subject left the experiment.

# Stratified KM plot - modeling the survival function with treatment groups
mykm2 <- survfit(Surv(time, status) ~ treatment, data = prost) 
plot(mykm2, mark.time = TRUE)

# Advanced plot
library(ggfortify)

## Loading required package: ggplot2

autoplot(mykm2)

Log Rank Test (non-parametric)

Comparing the survival distribution across two groups
Best used for right-skewed data that is censored.
If there is no censoring use alternative: Wilcoxon Rank Sum Test

Compares the estimates of the hazard function of the two groups at each observed time point.

Computes the observed and expected number of events
Results are added up to get a total across all time points

Test Statistic: N(0,1) \(Z = \frac{\sum_{j=1}^J(O_{1j} - E_{1j})}{\sqrt{\sum_{j=1}^JV_j}} \to N(0,1)\)

Null Hypothesis: The hazard rate in each group is similar (rejected if the test statistic \(Z\) is not normally distributed.)

Assumptions:

Censoring is unrelated to group setup
Survival probabilities are the same for all subjects
Events happened at times specified

# Log Rank Test
survival::survdiff(survival::Surv(prost$time, prost$status) ~ prost$treatment)

## Call:
## survival::survdiff(formula = survival::Surv(prost$time, prost$status) ~ 
##     prost$treatment)
## 
##                    N Observed Expected (O-E)^2/E (O-E)^2/V
## prost$treatment=1 23        7     5.13     0.686     0.958
## prost$treatment=2 40       16    17.87     0.197     0.958
## 
##  Chisq= 1  on 1 degrees of freedom, p= 0.3

# note you can put more weight on later events
# :the p-value is non-significant stick to the null hypothesis
# Calculated the ch0-squared statistic and the variance.

Example

library(SMPracticals)

## Loading required package: ellipse

## 
## Attaching package: 'ellipse'

## The following object is masked from 'package:graphics':
## 
##     pairs

## 
## Attaching package: 'SMPracticals'

## The following objects are masked from 'package:survival':
## 
##     aml, pbc

data("motorette")
#motorette data
#y: Time to failure
#x: Temp
head(motorette)

##     x cens    y
## 1 150    0 8064
## 2 150    0 8064
## 3 150    0 8064
## 4 150    0 8064
## 5 150    0 8064
## 6 150    0 8064

plot(x = motorette$x, y = motorette$y)

# kaplan meier plot not taking the temperature into account
mykm1_2 <- survfit(Surv(motorette$y , motorette$cens) ~ 1, data = motorette)
plot(mykm1_2, mark.time = TRUE)

# kaplan meier plot taking the temperature into account
temp_binary = ifelse(motorette$x < 180, "low", "high"); temp_binary

##  [1] "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low" 
## [11] "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low"  "low" 
## [21] "high" "high" "high" "high" "high" "high" "high" "high" "high" "high"
## [31] "high" "high" "high" "high" "high" "high" "high" "high" "high" "high"

motorette$temp_binary = temp_binary

survival::survdiff(survival::Surv(motorette$y, motorette$cens) ~ motorette$temp_binary)

## Call:
## survival::survdiff(formula = survival::Surv(motorette$y, motorette$cens) ~ 
##     motorette$temp_binary)
## 
##                             N Observed Expected (O-E)^2/E (O-E)^2/V
## motorette$temp_binary=high 20       10     4.14      8.32      15.7
## motorette$temp_binary=low  20        7    12.86      2.67      15.7
## 
##  Chisq= 15.7  on 1 degrees of freedom, p= 7e-05

mykm2 <- survfit(Surv(y, cens) ~ temp_binary, data = motorette)

plot(mykm2, mark.time = TRUE)

The Cox Proportional Hazards Model

Hazard Rate: The time it takes for an event to take place

External factors can effect the probability of an event (covariates) which can now be included in the model.

The Cox Proportional Hazards Model equation

\(\lambda or h\) = The hazard
\(i\): The subject
\(t\): time point
\(\lambda_0(t) or h_0(t)\): The baseline hazard/risk rate at time point \(t\) regardless of the subect or covariates. (ie each subject has the same prob of an event after t time points have past)
\(\lambda(t|X_i) = \lambda_0(t)exp(X_i\beta)\)
\(h_i(t) = h_o(t)e^{X_i\beta}\)

The Covariates:

Can be of any data type
Weight of significance is controlled by \(\beta\)
It is possbile to estimate \(\beta\) without considering the Hazard function so long as the data is stationary and constant over time.

Implementing the Cox Proportional Hazards Model

### Cox Proportional Hazards Model

cox <- coxph(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost)

summary(cox)

## Call:
## coxph(formula = Surv(time, status) ~ treatment + age + sh + size + 
##     index, data = prost)
## 
##   n= 63, number of events= 23 
## 
##               coef exp(coef) se(coef)      z Pr(>|z|)    
## treatment -0.69695   0.49810  0.53471 -1.303   0.1924    
## age       -0.08361   0.91979  0.03692 -2.265   0.0235 *  
## sh        -0.23664   0.78927  0.18511 -1.278   0.2011    
## size       0.06786   1.07021  0.02833  2.395   0.0166 *  
## index      0.77410   2.16865  0.18803  4.117 3.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## treatment    0.4981     2.0076    0.1747    1.4206
## age          0.9198     1.0872    0.8556    0.9888
## sh           0.7893     1.2670    0.5491    1.1345
## size         1.0702     0.9344    1.0124    1.1313
## index        2.1686     0.4611    1.5002    3.1350
## 
## Concordance= 0.866  (se = 0.037 )
## Likelihood ratio test= 38.4  on 5 df,   p=3e-07
## Wald test            = 26.47  on 5 df,   p=7e-05
## Score (logrank) test = 34.47  on 5 df,   p=2e-06

The Concordance

The probability of agreement of two randomly chosen observations.

The probability of correctly choosing the the observation with higher risk of an event occuring. Want the concordance to be as close to 1 as possible (Anything below 0.5 is a very bad model)

Kendells \(\tau\) is used for continuous variables
Logistic regression is used for binary/grouped covariates

R Squared

Fraction of variance in survival rate that is predicted by the covariates.

Likelihood ratio test / Wald Test / Logrank test

Do the covariates used predict the hypothesis rate? Reject the null hypothesis if the p-values are significant. df = number of covariates in the model

# Getting a plot of the model
autoplot(survfit(cox))

# alternative
coxfit = survfit(cox)
autoplot(coxfit)

Aalen’s Additive Regression Model

The influence of covariates (On the prob. of an event occurring) can vary depending on the time point in the study. To model the time dependant effect of the covariates we use Aalen’s additive regression model.

\(H(t) or \Lambda(t)\): The cumulative hazard.
\(a(t)\): The time dependant intercept term.
\(X\): The vector of covariates.
\(B(t)\): The time dependent matrix of coeficients.
\(H(t) = a(t) + XB(t)\)

NOTE: Results at the tail of the dataset can get inacurate, as there are few subjects left in the study. This can be accounted for through making use of the arguments taper which smoothes out the tail end of the curve or qrtol and nmin which truncates the estimate before the last death event.

It can be seen that tumor size and index has an increasing effect on the hazrd rate over time. (ie these covariates become more influential as time goes by. At the beginning of the study these covariates are not that important.)
Ages becomes more benificial after some time in the study. ie) it is benefical to be older after some time in the study. sh and treatment’s level of importance remains fairly constant over the length of the study. Intercept becomes more important over time.

# Aalens additive regression

aalen <- aareg(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost)
autoplot(aalen)

Parametric Theory

What is the difference between paramtric, non-parametric and semi-parametric models?

The distribution explains how the observations are distributed in their occurrence.

Common distributions used in Survival Analysis * Exponential: The probability of an event is the same at any time point. * Weibull: Allows you to set the time point at which most of the events are most likely to occur (Usually at the end) * Gaussian: Event probabilty is the highest around the center of the survival curve (Looking at the step plot the steepest decrease in survival probablity will occur somewhere in the center of the plot)

NOTE: These distribution refer to the distribution at which the events occur. (ie the y-axis)

Parametric Regression Models in Survival Analysis

NOTE: The classic non-parametric step plot is also included in the plots below.

The fit of the Weibull based survival curve (red) is super-imposed over this. The probability of an event is assumed to be higher at the end.
The exponential based survival curve (blue) is a straight line, given that the probaility of an event is assuemed to be the same over time. (Not such a good fit typically in survival analysis).

library(flexsurv)

weib_mod = flexsurvreg(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost, dist = "weibull")

exp_mod = flexsurvreg(Surv(time, status) ~ treatment + age + sh+ size + index, data = prost, dist = "exp")

plot(weib_mod)

lines(exp_mod, col="blue")

Example: Cox-Proportional Hazards Model

# Cox Proportional Hazards

cox <- coxph(Surv(futime, status) ~ trt + stage + bili + riskscore, data = udca1)
summary(cox)

## Call:
## coxph(formula = Surv(futime, status) ~ trt + stage + bili + riskscore, 
##     data = udca1)
## 
##   n= 169, number of events= 72 
##    (1 observation deleted due to missingness)
## 
##               coef exp(coef) se(coef)      z Pr(>|z|)    
## trt       -1.01636   0.36191  0.25628 -3.966 7.31e-05 ***
## stage      0.03453   1.03514  0.28768  0.120   0.9045    
## bili       0.08971   1.09385  0.07827  1.146   0.2518    
## riskscore  0.30950   1.36275  0.16142  1.917   0.0552 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## trt          0.3619     2.7631    0.2190    0.5981
## stage        1.0351     0.9661    0.5890    1.8192
## bili         1.0939     0.9142    0.9383    1.2752
## riskscore    1.3628     0.7338    0.9932    1.8699
## 
## Concordance= 0.684  (se = 0.031 )
## Likelihood ratio test= 30.32  on 4 df,   p=4e-06
## Wald test            = 29.86  on 4 df,   p=5e-06
## Score (logrank) test = 31.47  on 4 df,   p=2e-06

# Getting a plot of the model
autoplot(survfit(cox))

# Standard Kaplan Meier plot to compare the treatments
kmfit <- survfit(Surv(futime, status) ~ trt, data = udca1)
autoplot(kmfit)

# Can we improve the concordance by eliminating non significant covariates?
cox2 <- coxph(Surv(futime, status) ~ trt, data = udca1)
summary(cox2)

## Call:
## coxph(formula = Surv(futime, status) ~ trt, data = udca1)
## 
##   n= 170, number of events= 72 
## 
##        coef exp(coef) se(coef)     z Pr(>|z|)    
## trt -0.8624    0.4222   0.2443 -3.53 0.000416 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## trt    0.4222      2.369    0.2615    0.6815
## 
## Concordance= 0.614  (se = 0.029 )
## Likelihood ratio test= 12.99  on 1 df,   p=3e-04
## Wald test            = 12.46  on 1 df,   p=4e-04
## Score (logrank) test = 13.23  on 1 df,   p=3e-04

Survival Trees

A decision tree fitted on survival data, allowing covariates to be incorporated.

Recap of Regression Trees

Prune the tree to prevent overfitting and reduce complexity. Remove sections that provide litle power to classify instances.
Works best on large datasets
Non-linear model. (Cox-proportional Hazards model is linear - ie a single line/curve/surface is sufficient to estimate the quantitative response)

Important arguments

Formula: Surival object and covariates.
num.trees: Number of trees generated (usually use the average for the )
mtry: Number of variables to possibly split on at each node.
importance: variable importance for a given split rule. ie which variable to chose primarily for a node.
Splitrule: How to calculate the cutoffs of the splits. (default is the log-rank test)
Seed: Allows you to reproduce your results.

NOTE: The outcome of this model is very different form what we saw previously in the Cox-Proportional Hazards model.(The sample size might be a bit small for a survival tree…)

### Survival Trees

library(ranger)

streefit = ranger(Surv(time, status) ~ treatment + age + sh + size + index, data = prost,

                importance = "permutation",

                splitrule = "extratrees",

                seed = 43)

# Average the survival models (in order to create one aerage survival tree)

death_times = streefit$unique.death.times
surv_prob = data.frame(streefit$survival)
avg_prob = sapply(surv_prob,mean) # can also use the median

# Plot the survival tree model - average

plot(death_times, avg_prob,

     type = "s",

     ylim = c(0,1),

     col = "red", lwd = 2, bty = "n",

     ylab = "Survival Probability", xlab = "Time in Months",

     main = "Survival Tree Model\nAverage Survival Curve")

Comparison Plot

Kaplan-Meier
Cox-Proportional Hazard
Survival Tree

## Comparing the main models we discussed so far

# Set up for ggplot

kmdf <- data.frame(mykm1$time,

                    mykm1$surv,

                    KM = rep("KM", 36))

names(kmdf) <- c("Time in Months","Survival Probability","Model")

coxdf <- data.frame(coxfit$time,

                     coxfit$surv,

                     COX = rep("COX", 36))

names(coxdf) <- c("Time in Months","Survival Probability","Model")

rfdf <- data.frame(death_times,

                    avg_prob,

                    RF = rep("RF", 36))

names(rfdf) <- c("Time in Months","Survival Probability","Model")

plotdf <- rbind(kmdf,coxdf,rfdf)

p <- ggplot(plotdf, aes(x = plotdf[,1] ,

                         y = plotdf[,2] ,

                         color = plotdf[,3]))

p + geom_step(size = 2) +

  labs(x = "Time in Months",

  y = "Survival Probability",

  color = "",

  title = "Comparison Plot of\n3 Main Survival Models") +

  theme(plot.title = element_text(hjust = 0.5))

Managing the Time Variable in a Survival Dataset

Outlier Detection & Missing Value Imputation in Survival Analysis

Questions:

How is the 95% confidence interval calculated in a non-parametric setting? (Use the normal distribution method?)

Survival Analysis

Anri Smith

23/04/2020