Survival Analysis

Purpose: To equip students with skills required to estimate survival functions and perform calculations on real time-to-event data.

Objectives

At the end of the course you should be able to:

  1. Define the terms such as probability mass/density functions, survivor, hazard and cumulative hazard functions and derive all the others, given one of them.
  2. Identify left, right and interval censoring.
  3. Determine the Kaplan–Meier survival estimator, and sketch its graph for given data.
  4. Use the log likelihood function to determine an unknown parameter for a probability distribution, given a sample from the distribution.
  5. Apply the methods available for comparison of two groups, including comparison of survival curves and the log-rank test.
  6. Apply regression models, including the Cox proportional hazards model.
  7. Explain the accelerated failure time model.

Course Description

Definition: Survivor, hazard and cumulative hazard functions. Left, right and interval censoring. Kaplan–Meier survival curve estimator. Parametric estimation of the survivor function. Parametric and non-parametric comparison of two groups. Statistical methods for censored survival data arising from follow-up studies on human or animal populations, comparison of survival curves, log-rank test, regression models including the Cox proportional hazards model with application, competing risks, and introduction to accelerated failure time models.

References

  1. Elandt-Johnson & Johnson. Survival Models and Data Analysis. John Wiley and Sons, 1999.
  2. Elandt-Johnson, R. C. & Johnson, Norman L. Survival Models and Data Analysis. John Wiley, 1997.
  3. Marubini, E. & Valsecchi, M. Analysis of Data from Clinical Trials and Observational Studies. John Wiley, 1995.
  4. Marubini, E. & Valsecchi, M. G. Analyzing Survival Data from Clinical Trials and Observational Studies. John Wiley, 1995.

Learning Outcomes

1 Introduction to Survival Analysis

In survival analysis, we consider a population or group of individuals. Associated with each individual is a point event called the failure time. The point event can occur once for each individual.

In survival analysis, the variable of interest is the length of time or duration from one well-defined starting event to another well-defined endpoint.

The endpoint may be:

  • death of a patient;
  • recovery from disease;
  • failure of a machine component;
  • completion of a task;
  • age at first marriage;
  • age at weaning.

The variable of interest is therefore a time-to-event variable.

1.1 Examples

  1. Time that machine components take before failing to work in an industrial reliability experiment.
  2. Time that a student takes to complete a task in a psychological experiment.
  3. Time that a patient takes to recover in a medical study.
  4. Age at which a certain event occurs, for example age at weaning or age at first marriage.

1.2 Worked Example 1.1

A medical study follows five patients after treatment.

Patient Time (Months) Event
1 5 Yes
2 8 Yes
3 10 No
4 12 Yes
5 15 No

Patients 1, 2 and 4 experienced the event. Patients 3 and 5 did not experience the event during follow-up and are censored.

1.3 R Example 1.1: Creating a Simple Survival Dataset

patient <- 1:5
time <- c(5, 8, 10, 12, 15)
status <- c(1, 1, 0, 1, 0)

surv_data <- data.frame(patient, time, status)
surv_data
##   patient time status
## 1       1    5      1
## 2       2    8      1
## 3       3   10      0
## 4       4   12      1
## 5       5   15      0

In survival analysis, a common coding convention is:

  • status = 1: event observed;
  • status = 0: censored.

1.4 R Example 1.2: Representing Survival Data in R

Surv(time, status)
## [1]  5   8  10+ 12  15+

The plus sign + in the output indicates a censored observation.

1.5 Exercise 1.1

A researcher follows six patients for time to relapse after treatment.

Patient Time Status
1 4 1
2 6 0
3 9 1
4 12 0
5 15 1
6 18 0
  1. Identify the patients who experienced the event.
  2. Identify the censored observations.
  3. Create the dataset in R.
  4. Represent the data using Surv().

1.6 Solution 1.1

Patients 1, 3 and 5 experienced the event. Patients 2, 4 and 6 are censored.

patient <- 1:6
time <- c(4, 6, 9, 12, 15, 18)
status <- c(1, 0, 1, 0, 1, 0)

data.frame(patient, time, status)
##   patient time status
## 1       1    4      1
## 2       2    6      0
## 3       3    9      1
## 4       4   12      0
## 5       5   15      1
## 6       6   18      0
Surv(time, status)
## [1]  4   6+  9  12+ 15  18+

2 Censoring

2.1 Learning Outcomes

  • Define censoring.
  • Distinguish between right, left and interval censoring.
  • Differentiate between Type I and Type II censoring.
  • Create censored survival data in R.

2.2 Definition

A common feature of survival data is that not all individuals experience the event of interest before the study ends. Such incomplete observations of failure time are called censored observations.

Censoring occurs when the exact survival time is unknown, but partial information is available.

2.3 Right Censoring

Right censoring occurs when the event of interest has not occurred by the end of observation.

For example, in a reliability study, some machine components may not have failed by the end of the experiment. In this case, the failure time is above the total observation time, say \(c\). Therefore, we only know that:

\[T > c.\]

2.3.1 Example 2.1

A light bulb is tested for 100 hours. At the end of the experiment, it is still working.

The exact failure time is not known, but we know that:

\[T > 100.\]

This is right censoring.

2.3.2 R Example 2.1: Right-Censored Observations

time <- c(40, 55, 70, 100, 100)
status <- c(1, 1, 1, 0, 0)

bulb_data <- data.frame(time, status)
bulb_data
##   time status
## 1   40      1
## 2   55      1
## 3   70      1
## 4  100      0
## 5  100      0
Surv(time, status)
## [1]  40   55   70  100+ 100+

The observations with + are right-censored.

2.4 Left Censoring

Left censoring occurs when the event of interest occurred before observation began.

For example, in a medical study, some patients may have recovered before the start of the experiment. The exact recovery time is unknown, but it occurred before the first observation.

If the first observation is at time \(c\), then:

\[T < c.\]

2.5 Interval Censoring

Interval censoring occurs when the exact event time is unknown but is known to lie within an interval.

For example, if a patient tests negative in January and positive in February, the infection occurred between the two visits.

\[L < T < R.\]

2.6 General Notation for Censoring

Suppose that in the absence of censoring, the \(i^{th}\) individual has failure time \(T_i\). Let \(C_i\) be the censoring time.

The observed time is:

\[Y_i = \min(T_i, C_i).\]

Define the event indicator:

\[V_i = \begin{cases} 1, & T_i \le C_i \quad \text{event observed},\\ 0, & T_i > C_i \quad \text{censored}. \end{cases}\]

2.6.1 R Example 2.2: Generating Observed Time and Event Indicator

T <- c(5, 8, 13, 15, 18)
C <- c(10, 10, 10, 20, 20)

Y <- pmin(T, C)
V <- ifelse(T <= C, 1, 0)

data.frame(T, C, Y, V)
##    T  C  Y V
## 1  5 10  5 1
## 2  8 10  8 1
## 3 13 10 10 0
## 4 15 20 15 1
## 5 18 20 18 1
Surv(Y, V)
## [1]  5   8  10+ 15  18

2.7 Type I Censoring

In Type I censoring, all censoring times are equal:

\[C_i = C.\]

The termination time is fixed and is under the control of the investigator.

2.7.1 Example 2.2

A study follows patients for exactly 12 months. Any patient who has not experienced the event by 12 months is censored.

2.7.2 R Example 2.3: Type I Censoring

set.seed(123)

true_time <- rexp(20, rate = 0.1)
C <- 12

observed_time <- pmin(true_time, C)
status <- ifelse(true_time <= C, 1, 0)

head(data.frame(true_time, observed_time, status))
##    true_time observed_time status
## 1  8.4345726     8.4345726      1
## 2  5.7661027     5.7661027      1
## 3 13.2905487    12.0000000      0
## 4  0.3157736     0.3157736      1
## 5  0.5621098     0.5621098      1
## 6  3.1650122     3.1650122      1
Surv(observed_time, status)
##  [1]  8.4345726   5.7661027  12.0000000+  0.3157736   0.5621098   3.1650122   3.1422729   1.4526680 
##  [9] 12.0000000+  0.2915345  10.0483006   4.8021473   2.8101363   3.7711783   1.8828404   8.4978613 
## [17] 12.0000000+  4.7876042   5.9093484  12.0000000+

2.8 Type II Censoring

In Type II censoring, observation ends after a predetermined number of failures.

For example, 100 bulbs may be tested until 10 have burnt out. The remaining 90 are censored at the time the 10th failure occurs.

2.8.1 R Example 2.4: Type II Censoring

set.seed(123)

true_time <- rexp(20, rate = 0.1)
d <- 8

cutoff <- sort(true_time)[d]

observed_time <- pmin(true_time, cutoff)
status <- ifelse(true_time <= cutoff, 1, 0)

data.frame(observed_time, status)
##    observed_time status
## 1      3.1650122      0
## 2      3.1650122      0
## 3      3.1650122      0
## 4      0.3157736      1
## 5      0.5621098      1
## 6      3.1650122      1
## 7      3.1422729      1
## 8      1.4526680      1
## 9      3.1650122      0
## 10     0.2915345      1
## 11     3.1650122      0
## 12     3.1650122      0
## 13     2.8101363      1
## 14     3.1650122      0
## 15     1.8828404      1
## 16     3.1650122      0
## 17     3.1650122      0
## 18     3.1650122      0
## 19     3.1650122      0
## 20     3.1650122      0
sum(status)
## [1] 8

The experiment stops after 8 observed failures.

2.9 Exercise 2.1

Distinguish between Type I censoring and Type II censoring.

2.10 Solution 2.1

In Type I censoring, the study ends at a fixed time. In Type II censoring, the study ends after a fixed number of failures.

2.11 Exercise 2.2

Identify the type of censoring in each case:

  1. A patient is alive at the end of follow-up.
  2. A patient recovered before the study started.
  3. Infection occurred between two clinic visits.
  4. A study ends after 20 machine components have failed.

2.12 Solution 2.2

  1. Right censoring.
  2. Left censoring.
  3. Interval censoring.
  4. Type II censoring.

3 Survivor Function, Hazard Function and Cumulative Hazard

3.1 Learning Outcomes

  • Define the survivor function.
  • Define the probability density function.
  • Define the hazard function.
  • Define the cumulative hazard function.
  • Derive relationships between \(f(t)\), \(F(t)\), \(S(t)\), \(h(t)\), and \(H(t)\).
  • Compute mean time to failure.

3.2 Distribution of Failure Time

We consider a population of individuals each having a failure time \(T\). The random variable \(T\) is the variable of interest. The origin and scale used for measuring time should be clearly defined.

Since time cannot be negative:

\[T \ge 0.\]

3.3 Survivor Function

For a continuous failure time \(T\), the survivor function or reliability function is:

\[S(t) = P(T > t).\]

Since:

\[F(t) = P(T \le t),\]

we have:

\[S(t) = 1 - F(t).\]

The survivor function gives the probability that an individual does not fail within the time interval \((0,t)\).

3.3.1 R Example 3.1: Plotting a Survivor Function

t <- seq(0, 20, by = 0.1)
S <- exp(-0.1 * t)

plot(t, S, type = "l",
     xlab = "Time",
     ylab = "Survival Probability",
     main = "Survivor Function")

3.4 Probability Density Function

The probability density function is:

\[f(t) = F'(t).\]

Since \(S(t)=1-F(t)\), it follows that:

\[f(t) = -S'(t).\]

3.5 Hazard Function

The hazard function is the instantaneous failure rate and is defined as:

\[h(t)=\lim_{\Delta t \to 0} \frac{P(t \le T < t+\Delta t \mid T > t)}{\Delta t}.\]

Using conditional probability:

\[h(t)=\frac{f(t)}{S(t)}.\]

Since \(f(t)=-S'(t)\), we obtain:

\[h(t)=\frac{-S'(t)}{S(t)}.\]

Therefore:

\[h(t)=-\frac{d}{dt}\log S(t).\]

3.5.1 Interpretation

The hazard function answers the question:

Given that an individual has survived up to time \(t\), what is the instantaneous risk of failure immediately after time \(t\)?

3.5.2 R Example 3.2: Constant Hazard

t <- seq(0, 20, by = 0.1)
h <- rep(0.1, length(t))

plot(t, h, type = "l",
     xlab = "Time",
     ylab = "Hazard",
     main = "Constant Hazard Function")

3.6 Cumulative Hazard Function

The cumulative hazard function is:

\[H(t)=\int_0^t h(u)du.\]

Since:

\[h(t)=-\frac{d}{dt}\log S(t),\]

integrating from 0 to \(t\) gives:

\[H(t)=-\log S(t).\]

Therefore:

\[S(t)=\exp[-H(t)].\]

Also:

\[f(t)=h(t)S(t)=h(t)\exp[-H(t)].\]

3.6.1 Worked Example 3.1

Suppose the hazard function is constant:

\[h(t)=0.2.\]

Then:

\[H(t)=\int_0^t 0.2du=0.2t.\]

Hence:

\[S(t)=\exp(-0.2t).\]

3.6.2 R Example 3.3: Survival Function from Hazard

t <- seq(0, 20, by = 0.1)
H <- 0.2 * t
S <- exp(-H)

plot(t, S, type = "l",
     xlab = "Time",
     ylab = "Survival Probability",
     main = "Survival Function from Constant Hazard")

3.7 Mean Time to Failure

An important relationship exists between the survivor function and the mean time to failure.

\[E(T)=\int_0^\infty S(t)dt.\]

This quantity is also called the mean survival time or life expectancy in demographic applications.

3.7.1 Worked Example 3.2

Suppose:

\[S(t)=e^{-0.2t}.\]

Then:

\[E(T)=\int_0^\infty e^{-0.2t}dt=\frac{1}{0.2}=5.\]

3.7.2 R Example 3.4: Computing Mean Time to Failure

lambda <- 0.2
1/lambda
## [1] 5

3.8 Numerical Approximation in R

t <- seq(0, 100, by = 0.01)
S <- exp(-0.2 * t)

mttf <- sum(S) * 0.01
mttf
## [1] 5.005002

3.9 Exercise 3.1

Given:

\[S(t)=e^{-0.25t},\]

find:

  1. \(H(t)\)
  2. \(h(t)\)
  3. \(E(T)\)

3.10 Solution 3.1

\[H(t)=0.25t.\]

\[h(t)=0.25.\]

\[E(T)=\frac{1}{0.25}=4.\]

lambda <- 0.25
1/lambda
## [1] 4

3.11 Exercise 3.2

Show that:

\[f(t)=h(t)S(t).\]

3.12 Solution 3.2

From the definition of the hazard function:

\[h(t)=\frac{f(t)}{S(t)}.\]

Multiplying both sides by \(S(t)\) gives:

\[f(t)=h(t)S(t).\]

4 Exponential Failure Law

4.1 Learning Outcomes

  • Define the exponential failure law.
  • Explain the constant hazard assumption.
  • Derive the survivor function, density function and hazard function.
  • Explain the memoryless property.
  • Estimate the exponential hazard parameter using R.

4.2 Exponential Failure Law

If the hazard function is constant, say:

\[h(t)=\alpha,\]

then the cumulative hazard is:

\[H(t)=\int_0^t \alpha du=\alpha t.\]

Therefore, the survivor function is:

\[S(t)=\exp[-H(t)]=\exp(-\alpha t).\]

The density function is:

\[f(t)=h(t)S(t)=\alpha e^{-\alpha t}, \quad t>0.\]

Thus, \(T\) has an exponential distribution with parameter \(\alpha\).

4.3 Constant Hazard Interpretation

In an industrial reliability setting, the assumption of constant hazard means that after an item has been used, its probability of failure has not changed. In other words, there is no ageing or wearing-out effect.

4.4 Memoryless Property

For \(\Delta t>0\),

\[P(t \le T \le t+\Delta t \mid T>t) = 1-e^{-\alpha \Delta t}.\]

This probability does not depend on \(t\). It only depends on the additional time interval \(\Delta t\). This is called the lack of memory property.

4.5 Worked Example 4.1

Suppose the hazard rate is:

\[\alpha=0.05.\]

Find:

  1. \(S(10)\)
  2. \(H(10)\)
  3. Mean time to failure.

4.5.1 Solution

\[S(10)=e^{-0.05(10)}=e^{-0.5}=0.6065.\]

\[H(10)=0.05(10)=0.5.\]

\[E(T)=\frac{1}{0.05}=20.\]

4.6 R Example 4.1

alpha <- 0.05

S10 <- exp(-alpha * 10)
H10 <- alpha * 10
mean_time <- 1 / alpha

S10
## [1] 0.6065307
H10
## [1] 0.5
mean_time
## [1] 20

4.7 Estimation of the Exponential Parameter

If there is no censoring, the maximum likelihood estimator is:

\[\hat{\alpha}=\frac{n}{\sum_{i=1}^n x_i}.\]

With right censoring, if \(d\) events are observed, then:

\[\hat{\alpha}=\frac{d}{\sum_{i=1}^n x_i}.\]

4.8 R Example 4.2: Estimating Hazard with Censoring

time <- c(9, 13, 13, 18, 23, 28, 31, 34, 45, 48, 161)
status <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0)

d <- sum(status)
alpha_hat <- d / sum(time)

alpha_hat
## [1] 0.01654846
1 / alpha_hat
## [1] 60.42857

The estimated hazard rate is alpha_hat, and the estimated mean survival time is \(1/\hat{\alpha}\).

4.9 R Example 4.3: Plotting Exponential Survival Curves

t <- seq(0, 100, by = 1)
alpha_values <- c(0.02, 0.05, 0.10)

plot(t, exp(-alpha_values[1] * t), type = "l",
     ylim = c(0, 1),
     xlab = "Time",
     ylab = "Survival Probability",
     main = "Exponential Survival Curves")

lines(t, exp(-alpha_values[2] * t), lty = 2)
lines(t, exp(-alpha_values[3] * t), lty = 3)

legend("topright",
       legend = paste("alpha =", alpha_values),
       lty = 1:3,
       bty = "n")

4.10 Exercise 4.1

For an exponential failure time with \(\alpha=0.2\), compute:

  1. \(S(5)\)
  2. \(H(5)\)
  3. \(E(T)\)

4.11 Solution 4.1

alpha <- 0.2
S5 <- exp(-alpha * 5)
H5 <- alpha * 5
ET <- 1 / alpha

S5
## [1] 0.3678794
H5
## [1] 1
ET
## [1] 5

5 Weibull, Gamma and Log-Logistic Failure Laws

5.1 Learning Outcomes

  • Define the Weibull failure law.
  • Interpret the Weibull shape parameter.
  • Define the gamma failure law.
  • Define the log-logistic survival model.
  • Use R to simulate and plot survival distributions.

6 Weibull Failure Law

The Weibull distribution generalizes the exponential distribution by allowing the hazard function to vary with time.

Assume the hazard function is:

\[h(t)=\alpha\beta(\alpha t)^{\beta-1},\]

where:

  • \(\alpha>0\) is a scale parameter;
  • \(\beta>0\) is a shape parameter.

The cumulative hazard is:

\[H(t)=\int_0^t \alpha\beta(\alpha u)^{\beta-1}du.\]

This gives:

\[H(t)=(\alpha t)^\beta.\]

Therefore:

\[S(t)=\exp[-(\alpha t)^\beta].\]

The density function is:

\[f(t)=\alpha\beta(\alpha t)^{\beta-1}\exp[-(\alpha t)^\beta].\]

6.1 Interpretation of the Shape Parameter

  • If \(\beta=1\), the Weibull model reduces to the exponential model.
  • If \(\beta>1\), the hazard increases with time.
  • If \(\beta<1\), the hazard decreases with time.

6.2 R Example 5.1: Weibull Hazard Shapes

t <- seq(0.1, 20, by = 0.1)
alpha <- 0.1

beta_values <- c(0.5, 1, 2)

plot(t, alpha * beta_values[1] * (alpha * t)^(beta_values[1] - 1),
     type = "l",
     ylim = c(0, 0.3),
     xlab = "Time",
     ylab = "Hazard",
     main = "Weibull Hazard Functions")

lines(t, alpha * beta_values[2] * (alpha * t)^(beta_values[2] - 1), lty = 2)
lines(t, alpha * beta_values[3] * (alpha * t)^(beta_values[3] - 1), lty = 3)

legend("topright",
       legend = paste("beta =", beta_values),
       lty = 1:3,
       bty = "n")

6.3 R Example 5.2: Simulating Weibull Survival Times

set.seed(123)

x <- rweibull(100, shape = 2, scale = 10)

hist(x,
     main = "Simulated Weibull Failure Times",
     xlab = "Failure Time")

7 Gamma Failure Law

The gamma family has density:

\[f(t)=\frac{\alpha(\alpha t)^{\beta-1}}{\Gamma(\beta)}e^{-\alpha t},\]

where \(\alpha>0\) and \(\beta>0\).

The mean is:

\[E(T)=\frac{\beta}{\alpha}.\]

The gamma distribution is flexible but its survivor function involves the incomplete gamma integral, making it less convenient for some survival analysis calculations.

7.1 R Example 5.3: Gamma Failure Times

set.seed(123)

alpha <- 0.5
beta <- 2

x <- rgamma(100, shape = beta, rate = alpha)

hist(x,
     main = "Simulated Gamma Failure Times",
     xlab = "Failure Time")

mean(x)
## [1] 3.442999
beta / alpha
## [1] 4

8 Log-Logistic Distribution

The log-logistic model has survivor function:

\[S(t)=\frac{1}{1+(\alpha t)^\beta}.\]

Its density is:

\[f(t)=\frac{\alpha^\beta\beta t^{\beta-1}}{\{1+(\alpha t)^\beta\}^2}.\]

The hazard function is:

\[h(t)=\frac{\beta\alpha^\beta t^{\beta-1}}{1+(\alpha t)^\beta}.\]

8.1 R Example 5.4: Log-Logistic Survival Function

t <- seq(0.1, 50, by = 0.1)
alpha <- 0.1
beta <- 2

S <- 1 / (1 + (alpha * t)^beta)

plot(t, S, type = "l",
     xlab = "Time",
     ylab = "Survival Probability",
     main = "Log-Logistic Survivor Function")

8.2 Exercise 5.1

For a Weibull distribution with \(\alpha=0.1\) and \(\beta=2\), compute:

  1. \(H(10)\)
  2. \(S(10)\)
  3. Interpret the shape parameter.

8.3 Solution 5.1

alpha <- 0.1
beta <- 2
t <- 10

H <- (alpha * t)^beta
S <- exp(-H)

H
## [1] 1
S
## [1] 0.3678794

Since \(\beta>1\), the hazard increases with time.

9 Discrete Failure Time

9.1 Learning Outcomes

  • Define discrete failure time.
  • Compute the discrete survivor function.
  • Compute the discrete hazard function.
  • Calculate the expected failure time.
  • Use R to compute discrete survival quantities.

9.2 Discrete Failure Time

Let the possible values of \(T\) be ordered as:

\[u_0 \le u_1 \le u_2 \le \cdots.\]

The probability mass function is:

\[f_j=P(T=u_j).\]

The survivor function is:

\[S_j=P(T \ge u_j).\]

The hazard function is:

\[h_j=P(T=u_j \mid T \ge u_j).\]

Therefore:

\[h_j=\frac{f_j}{S_j}.\]

Also:

\[f_j=h_jS_j.\]

9.3 Survivor Function from Hazards

The survivor function can be written as:

\[S_{j+1}=S_j(1-h_j).\]

Therefore:

\[S_{j+1}=\prod_{\eta=0}^{j}(1-h_\eta).\]

9.4 Mean of a Discrete Failure Time

For non-negative discrete survival time:

\[E(T)=\sum_{t=1}^{\infty}S(t).\]

9.5 Worked Example 6.1

Given the probability mass function:

t 1 2 3 4 5
f(t) 0.2 0.2 0.2 0.2 0.2

Find \(S(t)\) and \(E(T)\).

The survivor probabilities are:

t 1 2 3 4 5
S(t) 1.0 0.8 0.6 0.4 0.2

Thus:

\[E(T)=1.0+0.8+0.6+0.4+0.2=3.\]

9.6 R Example 6.1

t <- 1:5
f <- rep(0.2, 5)

S <- rev(cumsum(rev(f)))

data.frame(t, f, S)
##   t   f   S
## 1 1 0.2 1.0
## 2 2 0.2 0.8
## 3 3 0.2 0.6
## 4 4 0.2 0.4
## 5 5 0.2 0.2
sum(S)
## [1] 3
sum(t * f)
## [1] 3

9.7 Worked Example 6.2

Given:

t 0 1 2
f(t) 1/3 1/3 1/3

The survivor function is:

t 0 1 2
S(t) 1 2/3 1/3

\[E(T)=\frac{2}{3}+\frac{1}{3}=1.\]

9.8 R Example 6.2

t <- c(0, 1, 2)
f <- c(1/3, 1/3, 1/3)

S <- sapply(t, function(x) sum(f[t >= x]))

data.frame(t, f, S)
##   t         f         S
## 1 0 0.3333333 1.0000000
## 2 1 0.3333333 0.6666667
## 3 2 0.3333333 0.3333333
sum(t * f)
## [1] 1

9.9 Geometric Distribution

The geometric distribution is the discrete analogue of the exponential distribution. It has the same lack of memory property.

Let \(X\) be the number of attempts including the successful attempt, with constant probability \(s\) of success.

\[P(X=x)=(1-s)^{x-1}s.\]

The survivor function is:

\[S(x)=P(X \ge x)=(1-s)^{x-1}.\]

The hazard is:

\[h(x)=s.\]

The expectation is:

\[E(X)=\frac{1}{s}.\]

9.10 R Example 6.3: Geometric Distribution

s <- 0.25
x <- 1:20

pmf <- (1 - s)^(x - 1) * s
surv <- (1 - s)^(x - 1)
haz <- pmf / surv

data.frame(x, pmf, surv, haz)
##     x         pmf        surv  haz
## 1   1 0.250000000 1.000000000 0.25
## 2   2 0.187500000 0.750000000 0.25
## 3   3 0.140625000 0.562500000 0.25
## 4   4 0.105468750 0.421875000 0.25
## 5   5 0.079101562 0.316406250 0.25
## 6   6 0.059326172 0.237304688 0.25
## 7   7 0.044494629 0.177978516 0.25
## 8   8 0.033370972 0.133483887 0.25
## 9   9 0.025028229 0.100112915 0.25
## 10 10 0.018771172 0.075084686 0.25
## 11 11 0.014078379 0.056313515 0.25
## 12 12 0.010558784 0.042235136 0.25
## 13 13 0.007919088 0.031676352 0.25
## 14 14 0.005939316 0.023757264 0.25
## 15 15 0.004454487 0.017817948 0.25
## 16 16 0.003340865 0.013363461 0.25
## 17 17 0.002505649 0.010022596 0.25
## 18 18 0.001879237 0.007516947 0.25
## 19 19 0.001409428 0.005637710 0.25
## 20 20 0.001057071 0.004228283 0.25
1 / s
## [1] 4

9.11 Exercise 6.1

For the PMF below, determine the survival function, hazard function and mean.

x 0 1 2 3 4
f(x) 0.15 0.35 0.25 0.15 0.10

9.12 Solution 6.1

x <- 0:4
f <- c(0.15, 0.35, 0.25, 0.15, 0.10)

S <- sapply(x, function(a) sum(f[x >= a]))
h <- f / S
mean_T <- sum(x * f)

data.frame(x, f, S, h)
##   x    f    S         h
## 1 0 0.15 1.00 0.1500000
## 2 1 0.35 0.85 0.4117647
## 3 2 0.25 0.50 0.5000000
## 4 3 0.15 0.25 0.6000000
## 5 4 0.10 0.10 1.0000000
mean_T
## [1] 1.7

10 Parametric Modelling of Survival Data

10.1 Learning Outcomes

  • Explain parametric modelling of survival data.
  • Write likelihood contributions for censored and uncensored observations.
  • Derive the likelihood under censoring.
  • Estimate model parameters using R.

10.2 Parametric Modelling

Assume that a specific family of distributions is given, so the distribution is known except for a vector of parameters \(\phi\).

For example, under the exponential model, \(\phi=\alpha\). Under the Weibull model, \(\phi=(\alpha,\beta)\).

10.3 Likelihood Contributions

For continuous survival time:

  • an uncensored observation at time \(t_i\) contributes \(f(t_i;\phi)\);
  • a censored observation at time \(c_i\) contributes \(S(c_i;\phi)\).

Thus, the full likelihood is:

\[L=\prod_{u} f(t_i;\phi)\prod_c S(c_i;\phi),\]

where the first product is over uncensored observations and the second product is over censored observations.

Using:

\[f(t)=h(t)S(t),\]

we can write the log-likelihood as:

\[\ell = \sum_u \log h(x_i;\phi)+\sum_i \log S(x_i;\phi).\]

Since:

\[S(t)=\exp[-H(t)],\]

we obtain:

\[\ell = \sum_u \log h(x_i;\phi)-\sum_i H(x_i;\phi).\]

10.4 Exponential Special Case

For the exponential model:

\[h(t)=\beta,\]

and:

\[H(t)=\beta t.\]

Therefore:

\[\ell=d\log\beta-\beta\sum x_i,\]

where \(d\) is the number of observed failures.

Differentiating gives:

\[\hat{\beta}=\frac{d}{\sum x_i}.\]

10.5 Worked Example 7.1

Consider the following group A data:

Time 9 13 13+ 18 23 28+ 31 34 45+ 48 161+

Here, + indicates censoring.

10.6 R Example 7.1: Exponential MLE with Censoring

time_A <- c(9, 13, 13, 18, 23, 28, 31, 34, 45, 48, 161)
status_A <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0)

d_A <- sum(status_A)
beta_hat_A <- d_A / sum(time_A)

d_A
## [1] 7
sum(time_A)
## [1] 423
beta_hat_A
## [1] 0.01654846
1 / beta_hat_A
## [1] 60.42857

10.7 R Example 7.2: Parametric Exponential Model Using survreg

surv_A <- data.frame(time = time_A, status = status_A)

fit_exp <- survreg(Surv(time, status) ~ 1,
                   data = surv_A,
                   dist = "exponential")

summary(fit_exp)
## 
## Call:
## survreg(formula = Surv(time, status) ~ 1, data = surv_A, dist = "exponential")
##             Value Std. Error    z      p
## (Intercept) 4.101      0.378 10.8 <2e-16
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -35.7   Loglik(intercept only)= -35.7
## Number of Newton-Raphson Iterations: 4 
## n= 11

10.8 Weibull Parametric Model

For the Weibull distribution, the likelihood can be maximized numerically.

10.9 R Example 7.3: Weibull Model

fit_weibull <- survreg(Surv(time, status) ~ 1,
                       data = surv_A,
                       dist = "weibull")

summary(fit_weibull)
## 
## Call:
## survreg(formula = Surv(time, status) ~ 1, data = surv_A, dist = "weibull")
##               Value Std. Error     z      p
## (Intercept)  4.0997     0.3665 11.19 <2e-16
## Log(scale)  -0.0314     0.2771 -0.11   0.91
## 
## Scale= 0.969 
## 
## Weibull distribution
## Loglik(model)= -35.7   Loglik(intercept only)= -35.7
## Number of Newton-Raphson Iterations: 5 
## n= 11

10.10 Exercise 7.1

Using the data below, estimate the exponential hazard rate.

Time 5 5 8 8 12 23 27 30 33 43 45
Status 1 1 1 1 1 1 1 1 1 1 1

10.11 Solution 7.1

time_B <- c(5, 5, 8, 8, 12, 23, 27, 30, 33, 43, 45)
status_B <- rep(1, length(time_B))

beta_hat_B <- sum(status_B) / sum(time_B)

beta_hat_B
## [1] 0.0460251
1 / beta_hat_B
## [1] 21.72727

11 Non-Parametric Modelling and Kaplan-Meier Estimation

11.1 Learning Outcomes

  • Explain non-parametric survival estimation.
  • Derive the Kaplan–Meier estimator.
  • Compute Kaplan–Meier estimates manually.
  • Use R to fit and plot Kaplan–Meier curves.

11.2 Product Limit Estimation

Suppose the distribution is discrete with probability \(f_j\) at failure times:

\[u_1 \le u_2 \le \cdots.\]

The survivor function can be written as:

\[S(t)=\prod_{u_j<t}(1-h_j).\]

A non-parametric estimator is:

\[\hat{S}(t)=\prod_{u_j<t}(1-\hat{h}_j).\]

At failure time \(u_j\), let:

  • \(d_j\) be the number of failures;
  • \(r_j\) be the number at risk.

The maximum likelihood estimator of \(h_j\) is:

\[\hat{h}_j=\frac{d_j}{r_j}.\]

Therefore, the Kaplan–Meier estimator is:

\[\hat{S}(t)=\prod_{u_j<t}\left(1-\frac{d_j}{r_j}\right).\]

11.3 Worked Example 8.1

For group A:

Time 9 13 13+ 18 23 28+ 31 34 45+ 48 161+

The event times are 9, 13, 18, 23, 31, 34 and 48.

11.4 R Example 8.1: Kaplan–Meier Estimator

time_A <- c(9, 13, 13, 18, 23, 28, 31, 34, 45, 48, 161)
status_A <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0)

km_A <- survfit(Surv(time_A, status_A) ~ 1)
summary(km_A)
## Call: survfit(formula = Surv(time_A, status_A) ~ 1)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     9     11       1    0.909  0.0867       0.7541        1.000
##    13     10       1    0.818  0.1163       0.6192        1.000
##    18      8       1    0.716  0.1397       0.4884        1.000
##    23      7       1    0.614  0.1526       0.3769        0.999
##    31      5       1    0.491  0.1642       0.2549        0.946
##    34      4       1    0.368  0.1627       0.1549        0.875
##    48      2       1    0.184  0.1535       0.0359        0.944
plot(km_A,
     xlab = "Time",
     ylab = "Survival Probability",
     main = "Kaplan-Meier Curve for Group A")

11.5 Manual Kaplan–Meier Table in R

event_times <- sort(unique(time_A[status_A == 1]))

r_j <- sapply(event_times, function(t) sum(time_A >= t))
d_j <- sapply(event_times, function(t) sum(time_A == t & status_A == 1))

surv_step <- 1 - d_j / r_j
S_hat <- cumprod(surv_step)

km_table <- data.frame(
  time = event_times,
  r_j = r_j,
  d_j = d_j,
  one_minus_d_over_r = surv_step,
  S_hat = S_hat
)

km_table
##   time r_j d_j one_minus_d_over_r     S_hat
## 1    9  11   1          0.9090909 0.9090909
## 2   13  10   1          0.9000000 0.8181818
## 3   18   8   1          0.8750000 0.7159091
## 4   23   7   1          0.8571429 0.6136364
## 5   31   5   1          0.8000000 0.4909091
## 6   34   4   1          0.7500000 0.3681818
## 7   48   2   1          0.5000000 0.1840909

11.6 Interpretation

At each observed event time, the Kaplan–Meier estimate is updated by multiplying the previous survival probability by:

\[1-\frac{d_j}{r_j}.\]

Censored observations reduce the number at risk after their censoring time, but they do not cause a drop in the survival curve.

11.7 Exercise 8.1

Using group B data, compute and plot the Kaplan–Meier curve.

Group B 5 5 8 8 12 23 27 30 33 43 45

11.8 Solution 8.1

time_B <- c(5, 5, 8, 8, 12, 23, 27, 30, 33, 43, 45)
status_B <- rep(1, length(time_B))

km_B <- survfit(Surv(time_B, status_B) ~ 1)

summary(km_B)
## Call: survfit(formula = Surv(time_B, status_B) ~ 1)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5     11       2   0.8182  0.1163       0.6192        1.000
##     8      9       2   0.6364  0.1450       0.4071        0.995
##    12      7       1   0.5455  0.1501       0.3180        0.936
##    23      6       1   0.4545  0.1501       0.2379        0.868
##    27      5       1   0.3636  0.1450       0.1664        0.795
##    30      4       1   0.2727  0.1343       0.1039        0.716
##    33      3       1   0.1818  0.1163       0.0519        0.637
##    43      2       1   0.0909  0.0867       0.0140        0.589
##    45      1       1   0.0000     NaN           NA           NA
plot(km_B,
     xlab = "Time",
     ylab = "Survival Probability",
     main = "Kaplan-Meier Curve for Group B")

12 Comparing Survival Curves and the Log-Rank Test

12.1 Learning Outcomes

  • Compare survival curves between groups.
  • Explain the log-rank test.
  • Conduct a log-rank test in R.
  • Interpret Kaplan–Meier curves and test results.

12.2 Comparing Two Groups

Survival analysis often compares time-to-event outcomes between two or more groups.

Examples include:

  • treatment vs control;
  • exposed vs unexposed;
  • male vs female;
  • intervention vs standard care.

12.3 Kaplan–Meier Curves by Group

time_A <- c(9, 13, 13, 18, 23, 28, 31, 34, 45, 48, 161)
status_A <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0)

time_B <- c(5, 5, 8, 8, 12, 23, 27, 30, 33, 43, 45)
status_B <- rep(1, length(time_B))

time <- c(time_A, time_B)
status <- c(status_A, status_B)
group <- factor(c(rep("A", length(time_A)), rep("B", length(time_B))))

aml_data <- data.frame(time, status, group)

fit_group <- survfit(Surv(time, status) ~ group, data = aml_data)

plot(fit_group,
     xlab = "Time",
     ylab = "Survival Probability",
     main = "Kaplan-Meier Curves by Group",
     lty = 1:2)

legend("topright",
       legend = levels(group),
       lty = 1:2,
       bty = "n")

12.4 Log-Rank Test

The log-rank test compares the survival experience of two or more groups.

The null hypothesis is:

\[H_0: S_1(t)=S_2(t)\]

for all \(t\).

The alternative hypothesis is that the survival functions differ.

12.5 R Example 9.1: Log-Rank Test

logrank <- survdiff(Surv(time, status) ~ group, data = aml_data)
logrank
## Call:
## survdiff(formula = Surv(time, status) ~ group, data = aml_data)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## group=A 11        7    10.89      1.39      3.79
## group=B 11       11     7.11      2.12      3.79
## 
##  Chisq= 3.8  on 1 degrees of freedom, p= 0.05

12.6 Obtaining the p-value

chisq <- logrank$chisq
df <- length(logrank$n) - 1
p_value <- 1 - pchisq(chisq, df)

p_value
## [1] 0.05149125

12.7 Interpretation

If the p-value is less than 0.05, we reject the null hypothesis and conclude that there is evidence of a difference in survival between the groups.

If the p-value is greater than 0.05, we do not reject the null hypothesis.

12.8 Exercise 9.1

Using the lung dataset in R:

  1. Fit Kaplan–Meier curves by sex.
  2. Plot the survival curves.
  3. Conduct a log-rank test.
  4. Interpret the result.

12.9 Solution 9.1

data(lung, package = "survival")

fit_lung <- survfit(Surv(time, status) ~ sex, data = lung)

plot(fit_lung,
     xlab = "Time",
     ylab = "Survival Probability",
     main = "Kaplan-Meier Curves by Sex",
     lty = 1:2)

legend("topright",
       legend = c("Male", "Female"),
       lty = 1:2,
       bty = "n")

survdiff(Surv(time, status) ~ sex, data = lung)
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138      112     91.6      4.55      10.3
## sex=2  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

13 Cox Proportional Hazards Model and Accelerated Failure Time Models

13.1 Learning Outcomes

  • Explain the Cox proportional hazards model.
  • Interpret hazard ratios.
  • Fit Cox models in R.
  • Check the proportional hazards assumption.
  • Explain accelerated failure time models.

13.2 Cox Proportional Hazards Model

The Cox proportional hazards model is:

\[h(t|X)=h_0(t)\exp(\beta X).\]

Where:

  • \(h(t|X)\) is the hazard at time \(t\) for covariates \(X\);
  • \(h_0(t)\) is the baseline hazard;
  • \(\beta\) is the regression coefficient.

13.3 Hazard Ratio

For a one-unit increase in \(X\), the hazard ratio is:

\[HR=\exp(\beta).\]

Interpretation:

  • \(HR>1\): increased hazard;
  • \(HR<1\): reduced hazard;
  • \(HR=1\): no association.

13.4 R Example 10.1: Cox Model Using the Lung Dataset

data(lung, package = "survival")

cox_fit <- coxph(Surv(time, status) ~ age + sex, data = lung)

summary(cox_fit)
## Call:
## coxph(formula = Surv(time, status) ~ age + sex, data = lung)
## 
##   n= 228, number of events= 165 
## 
##          coef exp(coef)  se(coef)      z Pr(>|z|)   
## age  0.017045  1.017191  0.009223  1.848  0.06459 . 
## sex -0.513219  0.598566  0.167458 -3.065  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## age    1.0172     0.9831    0.9990    1.0357
## sex    0.5986     1.6707    0.4311    0.8311
## 
## Concordance= 0.603  (se = 0.025 )
## Likelihood ratio test= 14.12  on 2 df,   p=9e-04
## Wald test            = 13.47  on 2 df,   p=0.001
## Score (logrank) test = 13.72  on 2 df,   p=0.001

13.5 Interpreting Hazard Ratios

exp(coef(cox_fit))
##      age      sex 
## 1.017191 0.598566

The exponentiated coefficients are hazard ratios.

13.6 Checking the Proportional Hazards Assumption

ph_test <- cox.zph(cox_fit)
ph_test
##        chisq df    p
## age    0.209  1 0.65
## sex    2.608  1 0.11
## GLOBAL 2.771  2 0.25
plot(ph_test)

A large p-value suggests no strong evidence against the proportional hazards assumption.

14 Accelerated Failure Time Models

Accelerated failure time models directly model survival time rather than hazard.

A general AFT model can be written as:

\[\log(T)=\mu+\beta X+\sigma \epsilon.\]

In an AFT model:

  • positive coefficients increase survival time;
  • negative coefficients shorten survival time.

14.1 R Example 10.2: Weibull AFT Model

aft_fit <- survreg(Surv(time, status) ~ age + sex,
                   data = lung,
                   dist = "weibull")

summary(aft_fit)
## 
## Call:
## survreg(formula = Surv(time, status) ~ age + sex, data = lung, 
##     dist = "weibull")
##                Value Std. Error     z       p
## (Intercept)  6.27485    0.48137 13.04 < 2e-16
## age         -0.01226    0.00696 -1.76  0.0781
## sex          0.38209    0.12748  3.00  0.0027
## Log(scale)  -0.28230    0.06188 -4.56 5.1e-06
## 
## Scale= 0.754 
## 
## Weibull distribution
## Loglik(model)= -1147.1   Loglik(intercept only)= -1153.9
##  Chisq= 13.59 on 2 degrees of freedom, p= 0.0011 
## Number of Newton-Raphson Iterations: 5 
## n= 228

14.2 Exercise 10.1

Using the lung dataset:

  1. Fit a Cox model with age and sex.
  2. Interpret the hazard ratio for age.
  3. Test the proportional hazards assumption.
  4. Fit a Weibull AFT model.

14.3 Solution 10.1

cox_fit <- coxph(Surv(time, status) ~ age + sex, data = lung)
summary(cox_fit)
## Call:
## coxph(formula = Surv(time, status) ~ age + sex, data = lung)
## 
##   n= 228, number of events= 165 
## 
##          coef exp(coef)  se(coef)      z Pr(>|z|)   
## age  0.017045  1.017191  0.009223  1.848  0.06459 . 
## sex -0.513219  0.598566  0.167458 -3.065  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## age    1.0172     0.9831    0.9990    1.0357
## sex    0.5986     1.6707    0.4311    0.8311
## 
## Concordance= 0.603  (se = 0.025 )
## Likelihood ratio test= 14.12  on 2 df,   p=9e-04
## Wald test            = 13.47  on 2 df,   p=0.001
## Score (logrank) test = 13.72  on 2 df,   p=0.001
exp(coef(cox_fit))
##      age      sex 
## 1.017191 0.598566
cox.zph(cox_fit)
##        chisq df    p
## age    0.209  1 0.65
## sex    2.608  1 0.11
## GLOBAL 2.771  2 0.25
aft_fit <- survreg(Surv(time, status) ~ age + sex,
                   data = lung,
                   dist = "weibull")

summary(aft_fit)
## 
## Call:
## survreg(formula = Surv(time, status) ~ age + sex, data = lung, 
##     dist = "weibull")
##                Value Std. Error     z       p
## (Intercept)  6.27485    0.48137 13.04 < 2e-16
## age         -0.01226    0.00696 -1.76  0.0781
## sex          0.38209    0.12748  3.00  0.0027
## Log(scale)  -0.28230    0.06188 -4.56 5.1e-06
## 
## Scale= 0.754 
## 
## Weibull distribution
## Loglik(model)= -1147.1   Loglik(intercept only)= -1153.9
##  Chisq= 13.59 on 2 degrees of freedom, p= 0.0011 
## Number of Newton-Raphson Iterations: 5 
## n= 228