Survival Analysis
Purpose: To equip students with skills required to estimate survival functions and perform calculations on real time-to-event data.
Objectives
At the end of the course you should be able to:
- Define the terms such as probability mass/density functions, survivor, hazard and cumulative hazard functions and derive all the others, given one of them.
- Identify left, right and interval censoring.
- Determine the Kaplan–Meier survival estimator, and sketch its graph for given data.
- Use the log likelihood function to determine an unknown parameter for a probability distribution, given a sample from the distribution.
- Apply the methods available for comparison of two groups, including comparison of survival curves and the log-rank test.
- Apply regression models, including the Cox proportional hazards model.
- Explain the accelerated failure time model.
Course Description
Definition: Survivor, hazard and cumulative hazard functions. Left, right and interval censoring. Kaplan–Meier survival curve estimator. Parametric estimation of the survivor function. Parametric and non-parametric comparison of two groups. Statistical methods for censored survival data arising from follow-up studies on human or animal populations, comparison of survival curves, log-rank test, regression models including the Cox proportional hazards model with application, competing risks, and introduction to accelerated failure time models.
References
- Elandt-Johnson & Johnson. Survival Models and Data Analysis. John Wiley and Sons, 1999.
- Elandt-Johnson, R. C. & Johnson, Norman L. Survival Models and Data Analysis. John Wiley, 1997.
- Marubini, E. & Valsecchi, M. Analysis of Data from Clinical Trials and Observational Studies. John Wiley, 1995.
- Marubini, E. & Valsecchi, M. G. Analyzing Survival Data from Clinical Trials and Observational Studies. John Wiley, 1995.
Learning Outcomes
- Define survival analysis.
- Explain time-to-event data.
- Identify applications of survival analysis.
- Distinguish between complete and censored observations.
1 Introduction to Survival Analysis
In survival analysis, we consider a population or group of individuals. Associated with each individual is a point event called the failure time. The point event can occur once for each individual.
In survival analysis, the variable of interest is the length of time or duration from one well-defined starting event to another well-defined endpoint.
The endpoint may be:
- death of a patient;
- recovery from disease;
- failure of a machine component;
- completion of a task;
- age at first marriage;
- age at weaning.
The variable of interest is therefore a time-to-event variable.
1.1 Examples
- Time that machine components take before failing to work in an industrial reliability experiment.
- Time that a student takes to complete a task in a psychological experiment.
- Time that a patient takes to recover in a medical study.
- Age at which a certain event occurs, for example age at weaning or age at first marriage.
1.2 Worked Example 1.1
A medical study follows five patients after treatment.
| Patient | Time (Months) | Event |
|---|---|---|
| 1 | 5 | Yes |
| 2 | 8 | Yes |
| 3 | 10 | No |
| 4 | 12 | Yes |
| 5 | 15 | No |
Patients 1, 2 and 4 experienced the event. Patients 3 and 5 did not experience the event during follow-up and are censored.
1.3 R Example 1.1: Creating a Simple Survival Dataset
patient <- 1:5
time <- c(5, 8, 10, 12, 15)
status <- c(1, 1, 0, 1, 0)
surv_data <- data.frame(patient, time, status)
surv_data## patient time status
## 1 1 5 1
## 2 2 8 1
## 3 3 10 0
## 4 4 12 1
## 5 5 15 0
In survival analysis, a common coding convention is:
status = 1: event observed;status = 0: censored.
1.4 R Example 1.2: Representing Survival Data in R
## [1] 5 8 10+ 12 15+
The plus sign + in the output indicates a censored observation.
1.5 Exercise 1.1
A researcher follows six patients for time to relapse after treatment.
| Patient | Time | Status |
|---|---|---|
| 1 | 4 | 1 |
| 2 | 6 | 0 |
| 3 | 9 | 1 |
| 4 | 12 | 0 |
| 5 | 15 | 1 |
| 6 | 18 | 0 |
- Identify the patients who experienced the event.
- Identify the censored observations.
- Create the dataset in R.
- Represent the data using
Surv().
1.6 Solution 1.1
Patients 1, 3 and 5 experienced the event. Patients 2, 4 and 6 are censored.
patient <- 1:6
time <- c(4, 6, 9, 12, 15, 18)
status <- c(1, 0, 1, 0, 1, 0)
data.frame(patient, time, status)## patient time status
## 1 1 4 1
## 2 2 6 0
## 3 3 9 1
## 4 4 12 0
## 5 5 15 1
## 6 6 18 0
## [1] 4 6+ 9 12+ 15 18+
2 Censoring
2.1 Learning Outcomes
- Define censoring.
- Distinguish between right, left and interval censoring.
- Differentiate between Type I and Type II censoring.
- Create censored survival data in R.
2.2 Definition
A common feature of survival data is that not all individuals experience the event of interest before the study ends. Such incomplete observations of failure time are called censored observations.
Censoring occurs when the exact survival time is unknown, but partial information is available.
2.3 Right Censoring
Right censoring occurs when the event of interest has not occurred by the end of observation.
For example, in a reliability study, some machine components may not have failed by the end of the experiment. In this case, the failure time is above the total observation time, say \(c\). Therefore, we only know that:
\[T > c.\]
2.3.1 Example 2.1
A light bulb is tested for 100 hours. At the end of the experiment, it is still working.
The exact failure time is not known, but we know that:
\[T > 100.\]
This is right censoring.
2.4 Left Censoring
Left censoring occurs when the event of interest occurred before observation began.
For example, in a medical study, some patients may have recovered before the start of the experiment. The exact recovery time is unknown, but it occurred before the first observation.
If the first observation is at time \(c\), then:
\[T < c.\]
2.5 Interval Censoring
Interval censoring occurs when the exact event time is unknown but is known to lie within an interval.
For example, if a patient tests negative in January and positive in February, the infection occurred between the two visits.
\[L < T < R.\]
2.6 General Notation for Censoring
Suppose that in the absence of censoring, the \(i^{th}\) individual has failure time \(T_i\). Let \(C_i\) be the censoring time.
The observed time is:
\[Y_i = \min(T_i, C_i).\]
Define the event indicator:
\[V_i = \begin{cases} 1, & T_i \le C_i \quad \text{event observed},\\ 0, & T_i > C_i \quad \text{censored}. \end{cases}\]
2.7 Type I Censoring
In Type I censoring, all censoring times are equal:
\[C_i = C.\]
The termination time is fixed and is under the control of the investigator.
2.7.1 Example 2.2
A study follows patients for exactly 12 months. Any patient who has not experienced the event by 12 months is censored.
2.7.2 R Example 2.3: Type I Censoring
set.seed(123)
true_time <- rexp(20, rate = 0.1)
C <- 12
observed_time <- pmin(true_time, C)
status <- ifelse(true_time <= C, 1, 0)
head(data.frame(true_time, observed_time, status))## true_time observed_time status
## 1 8.4345726 8.4345726 1
## 2 5.7661027 5.7661027 1
## 3 13.2905487 12.0000000 0
## 4 0.3157736 0.3157736 1
## 5 0.5621098 0.5621098 1
## 6 3.1650122 3.1650122 1
## [1] 8.4345726 5.7661027 12.0000000+ 0.3157736 0.5621098 3.1650122 3.1422729 1.4526680
## [9] 12.0000000+ 0.2915345 10.0483006 4.8021473 2.8101363 3.7711783 1.8828404 8.4978613
## [17] 12.0000000+ 4.7876042 5.9093484 12.0000000+
2.8 Type II Censoring
In Type II censoring, observation ends after a predetermined number of failures.
For example, 100 bulbs may be tested until 10 have burnt out. The remaining 90 are censored at the time the 10th failure occurs.
2.8.1 R Example 2.4: Type II Censoring
set.seed(123)
true_time <- rexp(20, rate = 0.1)
d <- 8
cutoff <- sort(true_time)[d]
observed_time <- pmin(true_time, cutoff)
status <- ifelse(true_time <= cutoff, 1, 0)
data.frame(observed_time, status)## observed_time status
## 1 3.1650122 0
## 2 3.1650122 0
## 3 3.1650122 0
## 4 0.3157736 1
## 5 0.5621098 1
## 6 3.1650122 1
## 7 3.1422729 1
## 8 1.4526680 1
## 9 3.1650122 0
## 10 0.2915345 1
## 11 3.1650122 0
## 12 3.1650122 0
## 13 2.8101363 1
## 14 3.1650122 0
## 15 1.8828404 1
## 16 3.1650122 0
## 17 3.1650122 0
## 18 3.1650122 0
## 19 3.1650122 0
## 20 3.1650122 0
## [1] 8
The experiment stops after 8 observed failures.
2.9 Exercise 2.1
Distinguish between Type I censoring and Type II censoring.
2.10 Solution 2.1
In Type I censoring, the study ends at a fixed time. In Type II censoring, the study ends after a fixed number of failures.
2.11 Exercise 2.2
Identify the type of censoring in each case:
- A patient is alive at the end of follow-up.
- A patient recovered before the study started.
- Infection occurred between two clinic visits.
- A study ends after 20 machine components have failed.
2.12 Solution 2.2
- Right censoring.
- Left censoring.
- Interval censoring.
- Type II censoring.
3 Survivor Function, Hazard Function and Cumulative Hazard
3.1 Learning Outcomes
- Define the survivor function.
- Define the probability density function.
- Define the hazard function.
- Define the cumulative hazard function.
- Derive relationships between \(f(t)\), \(F(t)\), \(S(t)\), \(h(t)\), and \(H(t)\).
- Compute mean time to failure.
3.2 Distribution of Failure Time
We consider a population of individuals each having a failure time \(T\). The random variable \(T\) is the variable of interest. The origin and scale used for measuring time should be clearly defined.
Since time cannot be negative:
\[T \ge 0.\]
3.3 Survivor Function
For a continuous failure time \(T\), the survivor function or reliability function is:
\[S(t) = P(T > t).\]
Since:
\[F(t) = P(T \le t),\]
we have:
\[S(t) = 1 - F(t).\]
The survivor function gives the probability that an individual does not fail within the time interval \((0,t)\).
3.4 Probability Density Function
The probability density function is:
\[f(t) = F'(t).\]
Since \(S(t)=1-F(t)\), it follows that:
\[f(t) = -S'(t).\]
3.5 Hazard Function
The hazard function is the instantaneous failure rate and is defined as:
\[h(t)=\lim_{\Delta t \to 0} \frac{P(t \le T < t+\Delta t \mid T > t)}{\Delta t}.\]
Using conditional probability:
\[h(t)=\frac{f(t)}{S(t)}.\]
Since \(f(t)=-S'(t)\), we obtain:
\[h(t)=\frac{-S'(t)}{S(t)}.\]
Therefore:
\[h(t)=-\frac{d}{dt}\log S(t).\]
3.5.1 Interpretation
The hazard function answers the question:
Given that an individual has survived up to time \(t\), what is the instantaneous risk of failure immediately after time \(t\)?
3.6 Cumulative Hazard Function
The cumulative hazard function is:
\[H(t)=\int_0^t h(u)du.\]
Since:
\[h(t)=-\frac{d}{dt}\log S(t),\]
integrating from 0 to \(t\) gives:
\[H(t)=-\log S(t).\]
Therefore:
\[S(t)=\exp[-H(t)].\]
Also:
\[f(t)=h(t)S(t)=h(t)\exp[-H(t)].\]
3.6.1 Worked Example 3.1
Suppose the hazard function is constant:
\[h(t)=0.2.\]
Then:
\[H(t)=\int_0^t 0.2du=0.2t.\]
Hence:
\[S(t)=\exp(-0.2t).\]
3.7 Mean Time to Failure
An important relationship exists between the survivor function and the mean time to failure.
\[E(T)=\int_0^\infty S(t)dt.\]
This quantity is also called the mean survival time or life expectancy in demographic applications.
3.7.1 Worked Example 3.2
Suppose:
\[S(t)=e^{-0.2t}.\]
Then:
\[E(T)=\int_0^\infty e^{-0.2t}dt=\frac{1}{0.2}=5.\]
3.8 Numerical Approximation in R
## [1] 5.005002
3.9 Exercise 3.1
Given:
\[S(t)=e^{-0.25t},\]
find:
- \(H(t)\)
- \(h(t)\)
- \(E(T)\)
3.10 Solution 3.1
\[H(t)=0.25t.\]
\[h(t)=0.25.\]
\[E(T)=\frac{1}{0.25}=4.\]
## [1] 4
3.11 Exercise 3.2
Show that:
\[f(t)=h(t)S(t).\]
3.12 Solution 3.2
From the definition of the hazard function:
\[h(t)=\frac{f(t)}{S(t)}.\]
Multiplying both sides by \(S(t)\) gives:
\[f(t)=h(t)S(t).\]
4 Exponential Failure Law
4.1 Learning Outcomes
- Define the exponential failure law.
- Explain the constant hazard assumption.
- Derive the survivor function, density function and hazard function.
- Explain the memoryless property.
- Estimate the exponential hazard parameter using R.
4.2 Exponential Failure Law
If the hazard function is constant, say:
\[h(t)=\alpha,\]
then the cumulative hazard is:
\[H(t)=\int_0^t \alpha du=\alpha t.\]
Therefore, the survivor function is:
\[S(t)=\exp[-H(t)]=\exp(-\alpha t).\]
The density function is:
\[f(t)=h(t)S(t)=\alpha e^{-\alpha t}, \quad t>0.\]
Thus, \(T\) has an exponential distribution with parameter \(\alpha\).
4.3 Constant Hazard Interpretation
In an industrial reliability setting, the assumption of constant hazard means that after an item has been used, its probability of failure has not changed. In other words, there is no ageing or wearing-out effect.
4.4 Memoryless Property
For \(\Delta t>0\),
\[P(t \le T \le t+\Delta t \mid T>t) = 1-e^{-\alpha \Delta t}.\]
This probability does not depend on \(t\). It only depends on the additional time interval \(\Delta t\). This is called the lack of memory property.
4.5 Worked Example 4.1
Suppose the hazard rate is:
\[\alpha=0.05.\]
Find:
- \(S(10)\)
- \(H(10)\)
- Mean time to failure.
4.5.1 Solution
\[S(10)=e^{-0.05(10)}=e^{-0.5}=0.6065.\]
\[H(10)=0.05(10)=0.5.\]
\[E(T)=\frac{1}{0.05}=20.\]
4.6 R Example 4.1
## [1] 0.6065307
## [1] 0.5
## [1] 20
4.7 Estimation of the Exponential Parameter
If there is no censoring, the maximum likelihood estimator is:
\[\hat{\alpha}=\frac{n}{\sum_{i=1}^n x_i}.\]
With right censoring, if \(d\) events are observed, then:
\[\hat{\alpha}=\frac{d}{\sum_{i=1}^n x_i}.\]
4.8 R Example 4.2: Estimating Hazard with Censoring
time <- c(9, 13, 13, 18, 23, 28, 31, 34, 45, 48, 161)
status <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0)
d <- sum(status)
alpha_hat <- d / sum(time)
alpha_hat## [1] 0.01654846
## [1] 60.42857
The estimated hazard rate is alpha_hat, and the estimated mean survival time is \(1/\hat{\alpha}\).
4.9 R Example 4.3: Plotting Exponential Survival Curves
t <- seq(0, 100, by = 1)
alpha_values <- c(0.02, 0.05, 0.10)
plot(t, exp(-alpha_values[1] * t), type = "l",
ylim = c(0, 1),
xlab = "Time",
ylab = "Survival Probability",
main = "Exponential Survival Curves")
lines(t, exp(-alpha_values[2] * t), lty = 2)
lines(t, exp(-alpha_values[3] * t), lty = 3)
legend("topright",
legend = paste("alpha =", alpha_values),
lty = 1:3,
bty = "n")4.10 Exercise 4.1
For an exponential failure time with \(\alpha=0.2\), compute:
- \(S(5)\)
- \(H(5)\)
- \(E(T)\)
5 Weibull, Gamma and Log-Logistic Failure Laws
5.1 Learning Outcomes
- Define the Weibull failure law.
- Interpret the Weibull shape parameter.
- Define the gamma failure law.
- Define the log-logistic survival model.
- Use R to simulate and plot survival distributions.
6 Weibull Failure Law
The Weibull distribution generalizes the exponential distribution by allowing the hazard function to vary with time.
Assume the hazard function is:
\[h(t)=\alpha\beta(\alpha t)^{\beta-1},\]
where:
- \(\alpha>0\) is a scale parameter;
- \(\beta>0\) is a shape parameter.
The cumulative hazard is:
\[H(t)=\int_0^t \alpha\beta(\alpha u)^{\beta-1}du.\]
This gives:
\[H(t)=(\alpha t)^\beta.\]
Therefore:
\[S(t)=\exp[-(\alpha t)^\beta].\]
The density function is:
\[f(t)=\alpha\beta(\alpha t)^{\beta-1}\exp[-(\alpha t)^\beta].\]
6.1 Interpretation of the Shape Parameter
- If \(\beta=1\), the Weibull model reduces to the exponential model.
- If \(\beta>1\), the hazard increases with time.
- If \(\beta<1\), the hazard decreases with time.
6.2 R Example 5.1: Weibull Hazard Shapes
t <- seq(0.1, 20, by = 0.1)
alpha <- 0.1
beta_values <- c(0.5, 1, 2)
plot(t, alpha * beta_values[1] * (alpha * t)^(beta_values[1] - 1),
type = "l",
ylim = c(0, 0.3),
xlab = "Time",
ylab = "Hazard",
main = "Weibull Hazard Functions")
lines(t, alpha * beta_values[2] * (alpha * t)^(beta_values[2] - 1), lty = 2)
lines(t, alpha * beta_values[3] * (alpha * t)^(beta_values[3] - 1), lty = 3)
legend("topright",
legend = paste("beta =", beta_values),
lty = 1:3,
bty = "n")7 Gamma Failure Law
The gamma family has density:
\[f(t)=\frac{\alpha(\alpha t)^{\beta-1}}{\Gamma(\beta)}e^{-\alpha t},\]
where \(\alpha>0\) and \(\beta>0\).
The mean is:
\[E(T)=\frac{\beta}{\alpha}.\]
The gamma distribution is flexible but its survivor function involves the incomplete gamma integral, making it less convenient for some survival analysis calculations.
8 Log-Logistic Distribution
The log-logistic model has survivor function:
\[S(t)=\frac{1}{1+(\alpha t)^\beta}.\]
Its density is:
\[f(t)=\frac{\alpha^\beta\beta t^{\beta-1}}{\{1+(\alpha t)^\beta\}^2}.\]
The hazard function is:
\[h(t)=\frac{\beta\alpha^\beta t^{\beta-1}}{1+(\alpha t)^\beta}.\]
8.1 R Example 5.4: Log-Logistic Survival Function
t <- seq(0.1, 50, by = 0.1)
alpha <- 0.1
beta <- 2
S <- 1 / (1 + (alpha * t)^beta)
plot(t, S, type = "l",
xlab = "Time",
ylab = "Survival Probability",
main = "Log-Logistic Survivor Function")8.2 Exercise 5.1
For a Weibull distribution with \(\alpha=0.1\) and \(\beta=2\), compute:
- \(H(10)\)
- \(S(10)\)
- Interpret the shape parameter.
9 Discrete Failure Time
9.1 Learning Outcomes
- Define discrete failure time.
- Compute the discrete survivor function.
- Compute the discrete hazard function.
- Calculate the expected failure time.
- Use R to compute discrete survival quantities.
9.2 Discrete Failure Time
Let the possible values of \(T\) be ordered as:
\[u_0 \le u_1 \le u_2 \le \cdots.\]
The probability mass function is:
\[f_j=P(T=u_j).\]
The survivor function is:
\[S_j=P(T \ge u_j).\]
The hazard function is:
\[h_j=P(T=u_j \mid T \ge u_j).\]
Therefore:
\[h_j=\frac{f_j}{S_j}.\]
Also:
\[f_j=h_jS_j.\]
9.3 Survivor Function from Hazards
The survivor function can be written as:
\[S_{j+1}=S_j(1-h_j).\]
Therefore:
\[S_{j+1}=\prod_{\eta=0}^{j}(1-h_\eta).\]
9.4 Mean of a Discrete Failure Time
For non-negative discrete survival time:
\[E(T)=\sum_{t=1}^{\infty}S(t).\]
9.5 Worked Example 6.1
Given the probability mass function:
| t | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| f(t) | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
Find \(S(t)\) and \(E(T)\).
The survivor probabilities are:
| t | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| S(t) | 1.0 | 0.8 | 0.6 | 0.4 | 0.2 |
Thus:
\[E(T)=1.0+0.8+0.6+0.4+0.2=3.\]
9.6 R Example 6.1
## t f S
## 1 1 0.2 1.0
## 2 2 0.2 0.8
## 3 3 0.2 0.6
## 4 4 0.2 0.4
## 5 5 0.2 0.2
## [1] 3
## [1] 3
9.7 Worked Example 6.2
Given:
| t | 0 | 1 | 2 |
|---|---|---|---|
| f(t) | 1/3 | 1/3 | 1/3 |
The survivor function is:
| t | 0 | 1 | 2 |
|---|---|---|---|
| S(t) | 1 | 2/3 | 1/3 |
\[E(T)=\frac{2}{3}+\frac{1}{3}=1.\]
9.8 R Example 6.2
t <- c(0, 1, 2)
f <- c(1/3, 1/3, 1/3)
S <- sapply(t, function(x) sum(f[t >= x]))
data.frame(t, f, S)## t f S
## 1 0 0.3333333 1.0000000
## 2 1 0.3333333 0.6666667
## 3 2 0.3333333 0.3333333
## [1] 1
9.9 Geometric Distribution
The geometric distribution is the discrete analogue of the exponential distribution. It has the same lack of memory property.
Let \(X\) be the number of attempts including the successful attempt, with constant probability \(s\) of success.
\[P(X=x)=(1-s)^{x-1}s.\]
The survivor function is:
\[S(x)=P(X \ge x)=(1-s)^{x-1}.\]
The hazard is:
\[h(x)=s.\]
The expectation is:
\[E(X)=\frac{1}{s}.\]
9.10 R Example 6.3: Geometric Distribution
s <- 0.25
x <- 1:20
pmf <- (1 - s)^(x - 1) * s
surv <- (1 - s)^(x - 1)
haz <- pmf / surv
data.frame(x, pmf, surv, haz)## x pmf surv haz
## 1 1 0.250000000 1.000000000 0.25
## 2 2 0.187500000 0.750000000 0.25
## 3 3 0.140625000 0.562500000 0.25
## 4 4 0.105468750 0.421875000 0.25
## 5 5 0.079101562 0.316406250 0.25
## 6 6 0.059326172 0.237304688 0.25
## 7 7 0.044494629 0.177978516 0.25
## 8 8 0.033370972 0.133483887 0.25
## 9 9 0.025028229 0.100112915 0.25
## 10 10 0.018771172 0.075084686 0.25
## 11 11 0.014078379 0.056313515 0.25
## 12 12 0.010558784 0.042235136 0.25
## 13 13 0.007919088 0.031676352 0.25
## 14 14 0.005939316 0.023757264 0.25
## 15 15 0.004454487 0.017817948 0.25
## 16 16 0.003340865 0.013363461 0.25
## 17 17 0.002505649 0.010022596 0.25
## 18 18 0.001879237 0.007516947 0.25
## 19 19 0.001409428 0.005637710 0.25
## 20 20 0.001057071 0.004228283 0.25
## [1] 4
9.11 Exercise 6.1
For the PMF below, determine the survival function, hazard function and mean.
| x | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| f(x) | 0.15 | 0.35 | 0.25 | 0.15 | 0.10 |
9.12 Solution 6.1
x <- 0:4
f <- c(0.15, 0.35, 0.25, 0.15, 0.10)
S <- sapply(x, function(a) sum(f[x >= a]))
h <- f / S
mean_T <- sum(x * f)
data.frame(x, f, S, h)## x f S h
## 1 0 0.15 1.00 0.1500000
## 2 1 0.35 0.85 0.4117647
## 3 2 0.25 0.50 0.5000000
## 4 3 0.15 0.25 0.6000000
## 5 4 0.10 0.10 1.0000000
## [1] 1.7
10 Parametric Modelling of Survival Data
10.1 Learning Outcomes
- Explain parametric modelling of survival data.
- Write likelihood contributions for censored and uncensored observations.
- Derive the likelihood under censoring.
- Estimate model parameters using R.
10.2 Parametric Modelling
Assume that a specific family of distributions is given, so the distribution is known except for a vector of parameters \(\phi\).
For example, under the exponential model, \(\phi=\alpha\). Under the Weibull model, \(\phi=(\alpha,\beta)\).
10.3 Likelihood Contributions
For continuous survival time:
- an uncensored observation at time \(t_i\) contributes \(f(t_i;\phi)\);
- a censored observation at time \(c_i\) contributes \(S(c_i;\phi)\).
Thus, the full likelihood is:
\[L=\prod_{u} f(t_i;\phi)\prod_c S(c_i;\phi),\]
where the first product is over uncensored observations and the second product is over censored observations.
Using:
\[f(t)=h(t)S(t),\]
we can write the log-likelihood as:
\[\ell = \sum_u \log h(x_i;\phi)+\sum_i \log S(x_i;\phi).\]
Since:
\[S(t)=\exp[-H(t)],\]
we obtain:
\[\ell = \sum_u \log h(x_i;\phi)-\sum_i H(x_i;\phi).\]
10.4 Exponential Special Case
For the exponential model:
\[h(t)=\beta,\]
and:
\[H(t)=\beta t.\]
Therefore:
\[\ell=d\log\beta-\beta\sum x_i,\]
where \(d\) is the number of observed failures.
Differentiating gives:
\[\hat{\beta}=\frac{d}{\sum x_i}.\]
10.5 Worked Example 7.1
Consider the following group A data:
| Time | 9 | 13 | 13+ | 18 | 23 | 28+ | 31 | 34 | 45+ | 48 | 161+ |
|---|
Here, + indicates censoring.
10.6 R Example 7.1: Exponential MLE with Censoring
time_A <- c(9, 13, 13, 18, 23, 28, 31, 34, 45, 48, 161)
status_A <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0)
d_A <- sum(status_A)
beta_hat_A <- d_A / sum(time_A)
d_A## [1] 7
## [1] 423
## [1] 0.01654846
## [1] 60.42857
10.7 R Example 7.2: Parametric Exponential Model Using survreg
surv_A <- data.frame(time = time_A, status = status_A)
fit_exp <- survreg(Surv(time, status) ~ 1,
data = surv_A,
dist = "exponential")
summary(fit_exp)##
## Call:
## survreg(formula = Surv(time, status) ~ 1, data = surv_A, dist = "exponential")
## Value Std. Error z p
## (Intercept) 4.101 0.378 10.8 <2e-16
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -35.7 Loglik(intercept only)= -35.7
## Number of Newton-Raphson Iterations: 4
## n= 11
10.8 Weibull Parametric Model
For the Weibull distribution, the likelihood can be maximized numerically.
10.9 R Example 7.3: Weibull Model
fit_weibull <- survreg(Surv(time, status) ~ 1,
data = surv_A,
dist = "weibull")
summary(fit_weibull)##
## Call:
## survreg(formula = Surv(time, status) ~ 1, data = surv_A, dist = "weibull")
## Value Std. Error z p
## (Intercept) 4.0997 0.3665 11.19 <2e-16
## Log(scale) -0.0314 0.2771 -0.11 0.91
##
## Scale= 0.969
##
## Weibull distribution
## Loglik(model)= -35.7 Loglik(intercept only)= -35.7
## Number of Newton-Raphson Iterations: 5
## n= 11
10.10 Exercise 7.1
Using the data below, estimate the exponential hazard rate.
| Time | 5 | 5 | 8 | 8 | 12 | 23 | 27 | 30 | 33 | 43 | 45 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Status | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
11 Non-Parametric Modelling and Kaplan-Meier Estimation
11.1 Learning Outcomes
- Explain non-parametric survival estimation.
- Derive the Kaplan–Meier estimator.
- Compute Kaplan–Meier estimates manually.
- Use R to fit and plot Kaplan–Meier curves.
11.2 Product Limit Estimation
Suppose the distribution is discrete with probability \(f_j\) at failure times:
\[u_1 \le u_2 \le \cdots.\]
The survivor function can be written as:
\[S(t)=\prod_{u_j<t}(1-h_j).\]
A non-parametric estimator is:
\[\hat{S}(t)=\prod_{u_j<t}(1-\hat{h}_j).\]
At failure time \(u_j\), let:
- \(d_j\) be the number of failures;
- \(r_j\) be the number at risk.
The maximum likelihood estimator of \(h_j\) is:
\[\hat{h}_j=\frac{d_j}{r_j}.\]
Therefore, the Kaplan–Meier estimator is:
\[\hat{S}(t)=\prod_{u_j<t}\left(1-\frac{d_j}{r_j}\right).\]
11.3 Worked Example 8.1
For group A:
| Time | 9 | 13 | 13+ | 18 | 23 | 28+ | 31 | 34 | 45+ | 48 | 161+ |
|---|
The event times are 9, 13, 18, 23, 31, 34 and 48.
11.4 R Example 8.1: Kaplan–Meier Estimator
time_A <- c(9, 13, 13, 18, 23, 28, 31, 34, 45, 48, 161)
status_A <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0)
km_A <- survfit(Surv(time_A, status_A) ~ 1)
summary(km_A)## Call: survfit(formula = Surv(time_A, status_A) ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 9 11 1 0.909 0.0867 0.7541 1.000
## 13 10 1 0.818 0.1163 0.6192 1.000
## 18 8 1 0.716 0.1397 0.4884 1.000
## 23 7 1 0.614 0.1526 0.3769 0.999
## 31 5 1 0.491 0.1642 0.2549 0.946
## 34 4 1 0.368 0.1627 0.1549 0.875
## 48 2 1 0.184 0.1535 0.0359 0.944
11.5 Manual Kaplan–Meier Table in R
event_times <- sort(unique(time_A[status_A == 1]))
r_j <- sapply(event_times, function(t) sum(time_A >= t))
d_j <- sapply(event_times, function(t) sum(time_A == t & status_A == 1))
surv_step <- 1 - d_j / r_j
S_hat <- cumprod(surv_step)
km_table <- data.frame(
time = event_times,
r_j = r_j,
d_j = d_j,
one_minus_d_over_r = surv_step,
S_hat = S_hat
)
km_table## time r_j d_j one_minus_d_over_r S_hat
## 1 9 11 1 0.9090909 0.9090909
## 2 13 10 1 0.9000000 0.8181818
## 3 18 8 1 0.8750000 0.7159091
## 4 23 7 1 0.8571429 0.6136364
## 5 31 5 1 0.8000000 0.4909091
## 6 34 4 1 0.7500000 0.3681818
## 7 48 2 1 0.5000000 0.1840909
11.6 Interpretation
At each observed event time, the Kaplan–Meier estimate is updated by multiplying the previous survival probability by:
\[1-\frac{d_j}{r_j}.\]
Censored observations reduce the number at risk after their censoring time, but they do not cause a drop in the survival curve.
11.7 Exercise 8.1
Using group B data, compute and plot the Kaplan–Meier curve.
| Group B | 5 | 5 | 8 | 8 | 12 | 23 | 27 | 30 | 33 | 43 | 45 |
|---|
11.8 Solution 8.1
time_B <- c(5, 5, 8, 8, 12, 23, 27, 30, 33, 43, 45)
status_B <- rep(1, length(time_B))
km_B <- survfit(Surv(time_B, status_B) ~ 1)
summary(km_B)## Call: survfit(formula = Surv(time_B, status_B) ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 11 2 0.8182 0.1163 0.6192 1.000
## 8 9 2 0.6364 0.1450 0.4071 0.995
## 12 7 1 0.5455 0.1501 0.3180 0.936
## 23 6 1 0.4545 0.1501 0.2379 0.868
## 27 5 1 0.3636 0.1450 0.1664 0.795
## 30 4 1 0.2727 0.1343 0.1039 0.716
## 33 3 1 0.1818 0.1163 0.0519 0.637
## 43 2 1 0.0909 0.0867 0.0140 0.589
## 45 1 1 0.0000 NaN NA NA
12 Comparing Survival Curves and the Log-Rank Test
12.1 Learning Outcomes
- Compare survival curves between groups.
- Explain the log-rank test.
- Conduct a log-rank test in R.
- Interpret Kaplan–Meier curves and test results.
12.2 Comparing Two Groups
Survival analysis often compares time-to-event outcomes between two or more groups.
Examples include:
- treatment vs control;
- exposed vs unexposed;
- male vs female;
- intervention vs standard care.
12.3 Kaplan–Meier Curves by Group
time_A <- c(9, 13, 13, 18, 23, 28, 31, 34, 45, 48, 161)
status_A <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0)
time_B <- c(5, 5, 8, 8, 12, 23, 27, 30, 33, 43, 45)
status_B <- rep(1, length(time_B))
time <- c(time_A, time_B)
status <- c(status_A, status_B)
group <- factor(c(rep("A", length(time_A)), rep("B", length(time_B))))
aml_data <- data.frame(time, status, group)
fit_group <- survfit(Surv(time, status) ~ group, data = aml_data)
plot(fit_group,
xlab = "Time",
ylab = "Survival Probability",
main = "Kaplan-Meier Curves by Group",
lty = 1:2)
legend("topright",
legend = levels(group),
lty = 1:2,
bty = "n")12.4 Log-Rank Test
The log-rank test compares the survival experience of two or more groups.
The null hypothesis is:
\[H_0: S_1(t)=S_2(t)\]
for all \(t\).
The alternative hypothesis is that the survival functions differ.
12.5 R Example 9.1: Log-Rank Test
## Call:
## survdiff(formula = Surv(time, status) ~ group, data = aml_data)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## group=A 11 7 10.89 1.39 3.79
## group=B 11 11 7.11 2.12 3.79
##
## Chisq= 3.8 on 1 degrees of freedom, p= 0.05
12.6 Obtaining the p-value
## [1] 0.05149125
12.7 Interpretation
If the p-value is less than 0.05, we reject the null hypothesis and conclude that there is evidence of a difference in survival between the groups.
If the p-value is greater than 0.05, we do not reject the null hypothesis.
12.8 Exercise 9.1
Using the lung dataset in R:
- Fit Kaplan–Meier curves by sex.
- Plot the survival curves.
- Conduct a log-rank test.
- Interpret the result.
12.9 Solution 9.1
data(lung, package = "survival")
fit_lung <- survfit(Surv(time, status) ~ sex, data = lung)
plot(fit_lung,
xlab = "Time",
ylab = "Survival Probability",
main = "Kaplan-Meier Curves by Sex",
lty = 1:2)
legend("topright",
legend = c("Male", "Female"),
lty = 1:2,
bty = "n")## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138 112 91.6 4.55 10.3
## sex=2 90 53 73.4 5.68 10.3
##
## Chisq= 10.3 on 1 degrees of freedom, p= 0.001
13 Cox Proportional Hazards Model and Accelerated Failure Time Models
13.1 Learning Outcomes
- Explain the Cox proportional hazards model.
- Interpret hazard ratios.
- Fit Cox models in R.
- Check the proportional hazards assumption.
- Explain accelerated failure time models.
13.2 Cox Proportional Hazards Model
The Cox proportional hazards model is:
\[h(t|X)=h_0(t)\exp(\beta X).\]
Where:
- \(h(t|X)\) is the hazard at time \(t\) for covariates \(X\);
- \(h_0(t)\) is the baseline hazard;
- \(\beta\) is the regression coefficient.
13.3 Hazard Ratio
For a one-unit increase in \(X\), the hazard ratio is:
\[HR=\exp(\beta).\]
Interpretation:
- \(HR>1\): increased hazard;
- \(HR<1\): reduced hazard;
- \(HR=1\): no association.
13.4 R Example 10.1: Cox Model Using the Lung Dataset
data(lung, package = "survival")
cox_fit <- coxph(Surv(time, status) ~ age + sex, data = lung)
summary(cox_fit)## Call:
## coxph(formula = Surv(time, status) ~ age + sex, data = lung)
##
## n= 228, number of events= 165
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.017045 1.017191 0.009223 1.848 0.06459 .
## sex -0.513219 0.598566 0.167458 -3.065 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0172 0.9831 0.9990 1.0357
## sex 0.5986 1.6707 0.4311 0.8311
##
## Concordance= 0.603 (se = 0.025 )
## Likelihood ratio test= 14.12 on 2 df, p=9e-04
## Wald test = 13.47 on 2 df, p=0.001
## Score (logrank) test = 13.72 on 2 df, p=0.001
13.5 Interpreting Hazard Ratios
## age sex
## 1.017191 0.598566
The exponentiated coefficients are hazard ratios.
14 Accelerated Failure Time Models
Accelerated failure time models directly model survival time rather than hazard.
A general AFT model can be written as:
\[\log(T)=\mu+\beta X+\sigma \epsilon.\]
In an AFT model:
- positive coefficients increase survival time;
- negative coefficients shorten survival time.
14.1 R Example 10.2: Weibull AFT Model
##
## Call:
## survreg(formula = Surv(time, status) ~ age + sex, data = lung,
## dist = "weibull")
## Value Std. Error z p
## (Intercept) 6.27485 0.48137 13.04 < 2e-16
## age -0.01226 0.00696 -1.76 0.0781
## sex 0.38209 0.12748 3.00 0.0027
## Log(scale) -0.28230 0.06188 -4.56 5.1e-06
##
## Scale= 0.754
##
## Weibull distribution
## Loglik(model)= -1147.1 Loglik(intercept only)= -1153.9
## Chisq= 13.59 on 2 degrees of freedom, p= 0.0011
## Number of Newton-Raphson Iterations: 5
## n= 228
14.2 Exercise 10.1
Using the lung dataset:
- Fit a Cox model with age and sex.
- Interpret the hazard ratio for age.
- Test the proportional hazards assumption.
- Fit a Weibull AFT model.
14.3 Solution 10.1
## Call:
## coxph(formula = Surv(time, status) ~ age + sex, data = lung)
##
## n= 228, number of events= 165
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.017045 1.017191 0.009223 1.848 0.06459 .
## sex -0.513219 0.598566 0.167458 -3.065 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0172 0.9831 0.9990 1.0357
## sex 0.5986 1.6707 0.4311 0.8311
##
## Concordance= 0.603 (se = 0.025 )
## Likelihood ratio test= 14.12 on 2 df, p=9e-04
## Wald test = 13.47 on 2 df, p=0.001
## Score (logrank) test = 13.72 on 2 df, p=0.001
## age sex
## 1.017191 0.598566
## chisq df p
## age 0.209 1 0.65
## sex 2.608 1 0.11
## GLOBAL 2.771 2 0.25
##
## Call:
## survreg(formula = Surv(time, status) ~ age + sex, data = lung,
## dist = "weibull")
## Value Std. Error z p
## (Intercept) 6.27485 0.48137 13.04 < 2e-16
## age -0.01226 0.00696 -1.76 0.0781
## sex 0.38209 0.12748 3.00 0.0027
## Log(scale) -0.28230 0.06188 -4.56 5.1e-06
##
## Scale= 0.754
##
## Weibull distribution
## Loglik(model)= -1147.1 Loglik(intercept only)= -1153.9
## Chisq= 13.59 on 2 degrees of freedom, p= 0.0011
## Number of Newton-Raphson Iterations: 5
## n= 228