Survival Analysis with R

Statistical Modelling with R

telco_data <- readr::read_csv("telco.csv")

1 Basic Concepts

In survival analysis we analyse time-to-event data and try to estimate the underlying distribution of times for events to occur.

1.1 Introduction

For the purposes of this workshop, we focus on a single event type, but the topic is wide and broadly applicable.

1.1.1 Workshop Materials

All materials for this workshop is available in my standard GitHub repo:

https://github.com/kaybenleroll/dublin_r_workshops

book cover

The content of this workshop is partly based on the book “Applied Survival Analysis Using R” by Dirk F. Moore. The data from this book is available from CRAN via the package ‘asaur’.

1.2 Basic Principles

1.2.1 Data Censoring and Truncation

In the problems we work on, our data observation process does not allow us to fully observe the variable in question. Instead, our observations are often right-censored - that is, we know the value in question is at least the observed value, but it may also be higher.

Expressing this formally, suppose \(T^*\) is the time to event, and \(U\) is the time to the censoring, the our observed time \(T\), and censoring indicator, \(\delta\), are given by

\[\begin{eqnarray*} T &=& \min(T^*, U) \\ \delta &=& I[T^* \leq U] \end{eqnarray*}\]

A less common phenomenon is left-censoring - where we observe the event to be at most a given duration. This may happen in medical studies where continual observation of the patients is not possible or feasible.

1.2.2 Hazard and Survival Functions

Our key goal is to find the survival distribution - the distribution of times from some given start point to the time of the event.

Two common ways of specifying this distribution are the survival function, \(S(t)\) and the hazard function, \(h(t)\).

\(S(t)\) is the probability of surviving to time \(t\), so is defined as follows:

\[ S(t) = P(T > t), \, 0 \leq t \leq \infty \]

Thus, \(S(0) = 1\) and decreases monotonically with time \(t\). As it is a probability it is non-negative.

We can also define the survival function in terms of the instantaneous failure rate, the probability of failing at exactly time \(t\). More formally,

\[ \lambda(t) = \lim_{\delta \to 0} \frac{P(t \leq T \leq t + \delta \; | \, T > t)}{\delta} \]

This is also called the intensity function or the force of mortality

1.2.3 Cumulative Functions

Some analyses make it easier to use the cumulative hazard function

\[ \Lambda(t) = \int^t_0 \lambda(u) \, du \]

As a consequence of this definition we also have

\[ S(t) = \exp(- \Lambda(t)) \]

1.2.4 Mean and Median Survival

Finally, two common statistics for survival are the mean survival time and the median survival time:

The mean survival time is the expected value of the distribution, using standard probability definitions:

\[ \mu = E(T) = \int^\infty_0 t f(t) \, dt = \int^\infty_0 S(t) \, dt \]

The median survival time, \(t_{\text{med}}\) such that

\[ S(t_{\text{med}}) = 0.5 \]

1.2.5 Survival Analysis

The survival function, also known as a survivor function or reliability function, is a property of any random variable that maps a set of events, usually associated with mortality or failure of some system, onto time.

It captures the probability that the system will survive beyond a specified time.

1.2.5.1 Reliability function

The term reliability function is common in engineering while the term survival function is used in a broader range of applications, including human mortality.
Another name for the survival function is the complementary cumulative distribution function.

1.2.5.2 Definition

Let \(T\) be a continuous random variable with cumulative distribution function \(F(t)\) on the interval \([0,\infty)\).

Its survival function or reliability function is:

\[R(t) = P(T > t)\]

\[R(t) = \int_t^{\infty} f(u)\,du,.\]

\[R(t) = 1-F(t).\]

Properties of the Survival Function

Every survival function \(R(t)\) is monotonically decreasing, i.e. \(R(u) \le R(t)\) for all \(u > t\).
The time, t = 0, represents some origin, typically the beginning of a study or the start of operation of some system.
\(R(0)\) is commonly unity but can be less to represent the probability that the system fails immediately upon operation.
The survivor function is right-continuous.

1.2.6 The Cox Proportional-Hazards Model

The most common model used to determine the effects of covariates on survival

\[ h_i(t)=h_0(t)exp(\beta_{1}x_{i1} + \beta_{2}x_{ik} + ... + \beta_{2}x_{ik} ) \]

It is a semi-parametric model:

The baseline hazard function is unspecified The effects of the covariates are multiplicative Doesn’t make arbitrary assumptions about the shape/form of the baseline hazard function The Proportional Hazards Assumption

Covariates multiply the hazard by some constant e.g. a drug may halve a subjects risk of death at any time \(t\) The effect is the same at any time point Violating the PH assumption can seriously invalidate your model!

1.2.7 The Survival Function (Survival curve)

\[ S(t)=Pr(T>t) \]

The Survival function (\(S\)) is the probability that the time of death (\(T\)) is greater than some specified time (\(t\))

It is composed of:

The underlying Hazard function (How the risk of death per unit time changes over time at baseline covariates) The effect parameters (How the hazard varies in response to the covariates)

1.2.8 Hazard Probability

In the medical world, doctors often want to understand which treatments help patients survive longer and which have no effect at all (or worse). In the business world, the equivalent concern is when customers stop.
This is particularly true of businesses that have a well-defined beginning and end to the customer relationship subscription-based relationships. These relationships are found in a wide range of industries, such as insurance, communication, cable televisions, newspaper/magazine subscription, banking, and electricity providers in competitive markets.
The basis of survival data mining is the hazard probability, the chance that someone who has survived for a certain length of time (called the customer tenure) is going to stop, cancel, or expire before the next unit of time.
This definition assumes that time is discrete, and such discrete time intervals whether days, weeks, or months often fits business needs. By contrast, traditional survival analysis in statistics usually assumes that time is continuous.