In the Poisson regression lecture, we modeled counts of events. But often in epidemiologic research the outcome of interest is not “how many events” but rather “how long until the event occurs.” For example: time from diagnosis to death, time from treatment start to relapse, or time from enrollment to hospital readmission. These are time-to-event outcomes, and they require special methods because of a phenomenon called censoring: not every subject will experience the event during the study period.
Survival analysis is the family of statistical methods designed for time-to-event data. It is central to clinical trials, cohort studies, and any research involving follow-up periods.
In this lecture we will:
Tutorial reference: Zabor, E. C. (2025). Survival Analysis in R. Available at emilyzabor.com/survival-analysis-in-r.html
library(tidyverse)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(survival)
library(ggsurvfit)
library(survminer)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))lung DatasetWe use the classic NCCTG Lung Cancer dataset, built
into the survival package. This dataset contains 228
patients with advanced lung cancer from the North Central Cancer
Treatment Group. The outcome is time from enrollment to death.
data(cancer, package = "survival")
lung <- lung |>
mutate(
status_event = status - 1,
sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))
)
glimpse(lung)## Rows: 228
## Columns: 11
## $ inst <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22,…
## $ time <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, …
## $ status <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ age <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 5…
## $ sex <fct> Male, Male, Male, Male, Male, Male, Female, Female, Male,…
## $ ph.ecog <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2,…
## $ ph.karno <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, …
## $ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, …
## $ meal.cal <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, …
## $ wt.loss <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 1…
## $ status_event <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
Key variables:
| Variable | Description |
|---|---|
time |
Survival time in days |
status |
Censoring status (1 = censored, 2 = dead in the original coding) |
status_event |
Recoded: 1 = dead, 0 = censored |
sex |
1 = Male, 2 = Female |
age |
Age in years |
ph.ecog |
ECOG performance status (0 = good, 4 = bedbound) |
lung |>
tbl_summary(
include = c(time, status_event, sex, age, ph.ecog),
type = list(
c(time, age) ~ "continuous",
ph.ecog ~ "continuous"
),
statistic = list(
all_continuous() ~ "{mean} ({sd}); Median: {median}"
),
label = list(
time ~ "Survival time (days)",
status_event ~ "Death (event)",
sex ~ "Sex",
age ~ "Age (years)",
ph.ecog ~ "ECOG performance status"
)
) |>
bold_labels()| Characteristic | N = 2281 |
|---|---|
| Survival time (days) | 305 (211); Median: 256 |
| Death (event) | 165 (72%) |
| Sex | |
| Male | 138 (61%) |
| Female | 90 (39%) |
| Age (years) | 62 (9); Median: 63 |
| ECOG performance status | 0.95 (0.72); Median: 1.00 |
| Unknown | 1 |
| 1 Mean (SD); Median: Median; n (%) | |
In a survival study, we observe each subject from a starting point (diagnosis, enrollment, treatment start) until either:
The data for each subject are a pair: (time, event indicator). The event indicator is 1 if the event occurred and 0 if the observation was censored.
Why not just use logistic regression? Because logistic regression discards the time information. Two patients who both die have very different prognoses if one died at 6 months and the other at 6 years. Survival analysis uses both the event status and the time to that status.
Right censoring is the most common type: we know the subject was event-free up to time \(t\), but we do not know what happens after that. Survival methods are specifically designed to handle right censoring correctly.
set.seed(1220)
censor_example <- tibble(
patient = factor(paste("Patient", 1:8)),
start = 0,
end = c(5, 8, 3, 10, 6, 10, 4, 7),
event = c(1, 0, 1, 0, 1, 0, 1, 0)
) |>
mutate(status_label = ifelse(event == 1, "Event", "Censored"))
ggplot(censor_example, aes(y = patient)) +
geom_segment(aes(x = start, xend = end, yend = patient),
linewidth = 1.2, color = "steelblue") +
geom_point(aes(x = end, shape = status_label, color = status_label),
size = 4) +
scale_shape_manual(values = c("Event" = 4, "Censored" = 1)) +
scale_color_manual(values = c("Event" = "red", "Censored" = "steelblue")) +
labs(
title = "Visualizing Follow-up and Censoring",
subtitle = "X = event occurred; O = censored (event-free at last observation)",
x = "Follow-up Time (years)",
y = NULL,
shape = "Status",
color = "Status"
) +
theme_minimal(base_size = 13)Interpretation: Each horizontal line represents one patient’s follow-up period. Patients marked with an X experienced the event; patients marked with O were censored (still alive at last follow-up, lost to follow-up, or the study ended). Censored observations contribute information up to the point of censoring: we know these patients survived at least that long.
The survival function \(S(t)\) gives the probability that an individual survives beyond time \(t\):
\[ S(t) = \Pr(T > t) = 1 - F(t) \]
where \(F(t) = \Pr(T \leq t)\) is the cumulative distribution function of the event time.
Properties:
The hazard function \(h(t)\) gives the instantaneous rate of failure at time \(t\), conditional on having survived to that point:
\[ h(t) = \lim_{\Delta t \to 0} \frac{\Pr(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t} \]
Think of the hazard as the “risk per unit time” at any given moment. It is not a probability (it can exceed 1) but rather a rate.
Relationship between S(t) and h(t):
\[ S(t) = \exp\left(-\int_0^t h(u) \, du\right) = \exp\left(-H(t)\right) \]
where \(H(t) = \int_0^t h(u)\,du\) is the cumulative hazard.
Key insight: The survival function and hazard function are two sides of the same coin. If you know one, you can derive the other. Cox regression models the hazard directly; Kaplan-Meier estimates the survival function.
The Kaplan-Meier estimator provides a non-parametric estimate of the survival function from data with censoring. At each event time \(t_j\), with \(d_j\) events out of \(n_j\) subjects still at risk:
\[ \hat{S}(t) = \prod_{t_j \leq t} \left(1 - \frac{d_j}{n_j}\right) \]
Each factor \((1 - d_j / n_j)\) is the conditional probability of surviving past time \(t_j\), given survival up to just before \(t_j\). Censored observations are removed from the risk set at their censoring times but do not produce drops in the survival curve.
The Surv() function creates a survival object that pairs
time with event status. In the output, a + after a time
indicates censoring:
## [1] 306 455 1010+ 210 883 1022+ 310 361 218 166 170 654
## [13] 728 71 567 144 613 707 61 88
km_overall <- survfit(Surv(time, status_event) ~ 1, data = lung)
survfit2(Surv(time, status_event) ~ 1, data = lung) |>
ggsurvfit() +
add_confidence_interval() +
add_risktable() +
labs(
title = "Kaplan-Meier Survival Curve: NCCTG Lung Cancer",
x = "Days from Enrollment",
y = "Overall Survival Probability"
) +
theme_minimal(base_size = 13)Interpretation: The curve starts at 1.0 (all patients alive at enrollment) and drops as events occur. The shaded band is the 95% confidence interval. The risk table below the plot shows the number of patients still being followed at each time point. Vertical drops indicate event times; flat sections indicate periods where no events occurred (though censoring may have occurred).
The median survival is the time at which \(S(t) = 0.5\), meaning half of the subjects have experienced the event. This is the standard summary statistic for survival data because the distribution is usually skewed.
## Call: survfit(formula = Surv(time, status_event) ~ 1, data = lung)
##
## n events median 0.95LCL 0.95UCL
## [1,] 228 165 310 285 363
survfit(Surv(time, status_event) ~ 1, data = lung) |>
tbl_survfit(
probs = 0.5,
label_header = "**Median Survival (95% CI)**"
)| Characteristic | Median Survival (95% CI) |
|---|---|
| Overall | 310 (285, 363) |
Why not use the mean? Survival times are almost never normally distributed. They are right-skewed, and censoring prevents us from observing the full distribution. The median is more robust and interpretable.
## Call: survfit(formula = Surv(time, status_event) ~ 1, data = lung)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 365 65 121 0.409 0.0358 0.345 0.486
survfit(Surv(time, status_event) ~ 1, data = lung) |>
tbl_survfit(
times = 365.25,
label_header = "**1-Year Survival (95% CI)**"
)| Characteristic | 1-Year Survival (95% CI) |
|---|---|
| Overall | 41% (34%, 49%) |
Why not just compute (1 - events/N)? This naive calculation ignores censoring and misclassifies censored patients as event-free, inflating the survival estimate. The Kaplan-Meier method correctly accounts for the fact that censored patients contribute information only during their observed follow-up period.
survfit2(Surv(time, status_event) ~ sex, data = lung) |>
ggsurvfit() +
add_confidence_interval() +
add_risktable() +
labs(
title = "Kaplan-Meier Survival Curves by Sex",
x = "Days from Enrollment",
y = "Overall Survival Probability"
) +
theme_minimal(base_size = 13)Interpretation: The two curves separate over time, with females appearing to have better survival than males. But is this difference statistically significant? We need a formal test.
The log-rank test is the standard non-parametric test for comparing survival distributions between groups. It compares the observed number of events in each group with the expected number under the null hypothesis that the survival curves are identical.
## Call:
## survdiff(formula = Surv(time, status_event) ~ sex, data = lung)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Male 138 112 91.6 4.55 10.3
## sex=Female 90 53 73.4 5.68 10.3
##
## Chisq= 10.3 on 1 degrees of freedom, p= 0.001
Interpretation: A small p-value indicates that the survival curves differ significantly between the groups. The log-rank test tells us whether the curves differ, but not how much. It does not provide an effect size. For that, we turn to Cox regression.
Limitations of the log-rank test:
lung_ecog <- lung |>
filter(!is.na(ph.ecog)) |>
mutate(ph.ecog = factor(ph.ecog))
survfit2(Surv(time, status_event) ~ ph.ecog, data = lung_ecog) |>
ggsurvfit() +
add_risktable() +
labs(
title = "Kaplan-Meier Survival by ECOG Performance Status",
subtitle = "0 = Fully active, 1 = Restricted, 2 = Ambulatory, 3 = Limited self-care",
x = "Days from Enrollment",
y = "Overall Survival Probability"
) +
theme_minimal(base_size = 13)## Call:
## survdiff(formula = Surv(time, status_event) ~ ph.ecog, data = lung_ecog)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## ph.ecog=0 63 37 54.153 5.4331 8.2119
## ph.ecog=1 113 82 83.528 0.0279 0.0573
## ph.ecog=2 50 44 26.147 12.1893 14.6491
## ph.ecog=3 1 1 0.172 3.9733 4.0040
##
## Chisq= 22 on 3 degrees of freedom, p= 7e-05
There is a clear dose-response relationship: worse performance status is associated with worse survival. This is one of the strongest prognostic factors in oncology.
The Cox proportional hazards model is the workhorse of survival analysis. It models the hazard as:
\[ h(t \mid X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p) \]
Exponentiated coefficients are hazard ratios (HRs):
\[ \text{HR} = e^{\beta} \]
| HR | Interpretation |
|---|---|
| HR = 1 | No effect on hazard |
| HR > 1 | Higher hazard (worse survival, faster event occurrence) |
| HR < 1 | Lower hazard (better survival, slower event occurrence) |
Important: A hazard ratio is NOT the same as a risk ratio, even though they are sometimes (incorrectly) used interchangeably. The HR compares instantaneous event rates, not cumulative risks.
mod_cox_sex <- coxph(Surv(time, status_event) ~ sex, data = lung)
mod_cox_sex |>
tbl_regression(exponentiate = TRUE) |>
bold_labels() |>
modify_caption("**Unadjusted Cox Model: Sex and Lung Cancer Survival**")| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| sex | |||
| Male | — | — | |
| Female | 0.59 | 0.42, 0.82 | 0.001 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
tidy_sex <- tidy(mod_cox_sex, exponentiate = TRUE, conf.int = TRUE)
hr_female <- tidy_sex |> filter(term == "sexFemale")Interpretation: Females have a hazard ratio of 0.59 (95% CI: 0.42, 0.82; p = 0.001). This means females experience approximately 41% lower hazard of death at any time point compared to males.
mod_cox <- coxph(
Surv(time, status_event) ~ age + sex + ph.ecog,
data = lung
)
mod_cox |>
tbl_regression(exponentiate = TRUE) |>
bold_labels() |>
modify_caption("**Adjusted Cox Proportional Hazards Model: Lung Cancer Survival**")| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| age | 1.01 | 0.99, 1.03 | 0.2 |
| sex | |||
| Male | — | — | |
| Female | 0.58 | 0.41, 0.80 | <0.001 |
| ph.ecog | 1.59 | 1.27, 1.99 | <0.001 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
tidy_cox <- tidy(mod_cox, exponentiate = TRUE, conf.int = TRUE)
tidy_cox |>
mutate(
HR = round(estimate, 3),
`95% CI` = paste0("(", round(conf.low, 3), ", ", round(conf.high, 3), ")"),
`p-value` = round(p.value, 4)
) |>
dplyr::select(term, HR, `95% CI`, `p-value`) |>
kable(caption = "Cox Model Coefficients") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | HR | 95% CI | p-value |
|---|---|---|---|
| age | 1.011 | (0.993, 1.03) | 0.2324 |
| sexFemale | 0.575 | (0.414, 0.799) | 0.0010 |
| ph.ecog | 1.590 | (1.273, 1.986) | 0.0000 |
Age: Holding sex and ECOG status constant, each additional year of age is associated with a modest increase in hazard.
Sex: After adjusting for age and ECOG status, females still have a significantly lower hazard of death compared to males.
ECOG performance status: Each one-point worsening in ECOG status is associated with a substantial increase in hazard, confirming its strong prognostic value.
ggsurvplot(
survfit(mod_cox, data = lung),
data = lung,
conf.int = TRUE,
ggtheme = theme_minimal(base_size = 13),
palette = "steelblue",
title = "Adjusted Survival Curve from Cox Model"
)new_data <- tibble(
sex = factor(c("Male", "Female"), levels = c("Male", "Female")),
age = median(lung$age, na.rm = TRUE),
ph.ecog = 1
)
fit_new <- survfit(mod_cox, newdata = new_data)
ggsurvplot(
fit_new,
data = new_data,
conf.int = TRUE,
legend.labs = c("Male (age=63, ECOG=1)", "Female (age=63, ECOG=1)"),
ggtheme = theme_minimal(base_size = 13),
palette = c("steelblue", "tomato"),
title = "Adjusted Survival Curves by Sex",
subtitle = "Holding age at median and ECOG performance status at 1"
)These curves represent the predicted survival for a “typical” patient (median age, ECOG = 1) in each sex group, based on the Cox model.
The Cox model assumes that hazard ratios are constant over time. In other words, if females have half the hazard of males at 30 days, they also have half the hazard at 300 days. If this assumption fails, the estimated HR is a kind of “average” across time, which may be misleading.
cox.zph()The cox.zph() function tests the proportional hazards
assumption by evaluating the correlation between scaled Schoenfeld
residuals and time. A significant p-value indicates a violation of the
assumption.
## chisq df p
## age 0.188 1 0.66
## sex 2.305 1 0.13
## ph.ecog 2.054 1 0.15
## GLOBAL 4.464 3 0.22
Interpretation: If all p-values are non-significant (\(p > 0.05\)), the proportional hazards assumption is satisfied for each covariate and globally. If a specific covariate shows a significant result, its effect may change over time.
A flat trend line (centered around zero) indicates the PH assumption holds. A curve or slope suggests the effect of that covariate changes over time.
If the proportional hazards assumption is violated for a covariate, options include:
strata() in the formula)# Example: stratifying by sex if PH fails for sex
coxph(Surv(time, status_event) ~ age + ph.ecog + strata(sex), data = lung)There is a deep connection between these two frameworks. If we split follow-up time into small intervals and count events in each interval, a Poisson regression with an offset for person-time at risk recovers approximately the same hazard ratios as a Cox model. In epidemiology, this person-time Poisson approach is frequently used for cohort studies with aggregated data.
| Feature | Poisson Regression | Cox Proportional Hazards |
|---|---|---|
| Outcome | Count of events | Time to event |
| Effect measure | Rate ratio | Hazard ratio |
| Censoring | Handled via offset for person-time | Built into the likelihood |
| Baseline rate/hazard | Modeled or absorbed in offset | Left unspecified |
| When useful | Aggregated counts, large cohorts | Individual-level time-to-event data |
Takeaway: Rate ratios from Poisson regression and hazard ratios from Cox regression are closely related. When the event rate is constant within time intervals, they are mathematically equivalent. This is why both methods are taught together in epidemiologic methods courses.
| Concept | Key Point |
|---|---|
| Outcome | (time, event status) pair |
| Censoring | Right-censored observations contribute partial information |
| \(S(t)\) | Survival function: \(\Pr(T > t)\) |
| \(h(t)\) | Hazard function: instantaneous event rate at time \(t\) |
| Kaplan-Meier | Non-parametric estimate of \(S(t)\) |
| Log-rank test | Compares survival curves between groups (no effect size) |
| Cox model | Semi-parametric; \(h(t \mid X) = h_0(t) e^{X\beta}\) |
| Hazard ratio | \(\text{HR} = e^\beta\); HR > 1 = worse survival |
| PH assumption | Check with cox.zph() and Schoenfeld residual plots |
| In R | Surv(), survfit(),
survdiff(), coxph() |
| Earlier in the Course | Survival Analogue |
|---|---|
lm() for continuous outcomes |
coxph() for time-to-event outcomes |
glm(..., family = binomial) for binary |
coxph() handles censored binary events over time |
glm(..., family = poisson) for counts |
Cox model is related via person-time Poisson approach |
| F-test comparing nested linear models | Log-rank test comparing survival curves |
| Adjusted ORs from logistic regression | Adjusted HRs from Cox regression |
No lab activity for this lecture.