Introduction

In the Poisson regression lecture, we modeled counts of events. But often in epidemiologic research the outcome of interest is not “how many events” but rather “how long until the event occurs.” For example: time from diagnosis to death, time from treatment start to relapse, or time from enrollment to hospital readmission. These are time-to-event outcomes, and they require special methods because of a phenomenon called censoring: not every subject will experience the event during the study period.

Survival analysis is the family of statistical methods designed for time-to-event data. It is central to clinical trials, cohort studies, and any research involving follow-up periods.

In this lecture we will:

Understand time-to-event data and censoring
Define the survival function \(S(t)\) and the hazard function \(h(t)\)
Estimate and plot survival using the Kaplan-Meier method
Compare survival between groups using the log-rank test
Fit and interpret a Cox proportional hazards regression model
Check the proportional hazards assumption

Tutorial reference: Zabor, E. C. (2025). Survival Analysis in R. Available at emilyzabor.com/survival-analysis-in-r.html

Why/When/Watch-out

Why use survival analysis?

The outcome is time until an event (death, relapse, recovery, first hospitalization)
Some subjects do not experience the event during follow-up (censoring)
We want to estimate survival probabilities over time
We need adjusted hazard ratios that control for confounders

When to use it?

Clinical trials: comparing time to death or progression between treatment arms
Cohort studies: estimating survival or incidence over follow-up
Any study where subjects enter and leave observation at different times

Watch out for:

Informative censoring: if the reason for censoring is related to the outcome, standard methods are biased
Proportional hazards assumption: Cox regression assumes that hazard ratios are constant over time
Small event counts: survival models need adequate events (not just sample size) for stable estimation
Competing risks: when multiple types of events can occur (e.g., death from disease vs. death from other causes), standard Kaplan-Meier overestimates cumulative incidence

Setup and Data

library(tidyverse)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(survival)
library(ggsurvfit)
library(survminer)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The `lung` Dataset

We use the classic NCCTG Lung Cancer dataset, built into the survival package. This dataset contains 228 patients with advanced lung cancer from the North Central Cancer Treatment Group. The outcome is time from enrollment to death.

data(cancer, package = "survival")

lung <- lung |>
  mutate(
    status_event = status - 1,
    sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))
  )

glimpse(lung)

## Rows: 228
## Columns: 11
## $ inst         <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12, 1, 22,…
## $ time         <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166, 170, …
## $ status       <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ age          <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68, 60, 5…
## $ sex          <fct> Male, Male, Male, Male, Male, Male, Female, Female, Male,…
## $ ph.ecog      <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1, 2, 2,…
## $ ph.karno     <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 90, 60, …
## $ pat.karno    <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 90, 70, …
## $ meal.cal     <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271, 1025, …
## $ wt.loss      <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32, 60, 1…
## $ status_event <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Key variables:

Variable	Description
`time`	Survival time in days
`status`	Censoring status (1 = censored, 2 = dead in the original coding)
`status_event`	Recoded: 1 = dead, 0 = censored
`sex`	1 = Male, 2 = Female
`age`	Age in years
`ph.ecog`	ECOG performance status (0 = good, 4 = bedbound)

lung |>
  tbl_summary(
    include = c(time, status_event, sex, age, ph.ecog),
    type = list(
      c(time, age) ~ "continuous",
      ph.ecog ~ "continuous"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd}); Median: {median}"
    ),
    label = list(
      time         ~ "Survival time (days)",
      status_event ~ "Death (event)",
      sex          ~ "Sex",
      age          ~ "Age (years)",
      ph.ecog      ~ "ECOG performance status"
    )
  ) |>
  bold_labels()

Characteristic	N = 228¹
Survival time (days)	305 (211); Median: 256
Death (event)	165 (72%)
Sex
Male	138 (61%)
Female	90 (39%)
Age (years)	62 (9); Median: 63
ECOG performance status	0.95 (0.72); Median: 1.00
Unknown	1
¹ Mean (SD); Median: Median; n (%)

Part 1: Time-to-Event Data and Censoring

What Makes Survival Data Special?

In a survival study, we observe each subject from a starting point (diagnosis, enrollment, treatment start) until either:

The event occurs (death, relapse, hospitalization, recovery)
The subject is censored: the study ends, the subject is lost to follow-up, or the subject withdraws

The data for each subject are a pair: (time, event indicator). The event indicator is 1 if the event occurred and 0 if the observation was censored.

Why not just use logistic regression? Because logistic regression discards the time information. Two patients who both die have very different prognoses if one died at 6 months and the other at 6 years. Survival analysis uses both the event status and the time to that status.

Censoring

Right censoring is the most common type: we know the subject was event-free up to time \(t\), but we do not know what happens after that. Survival methods are specifically designed to handle right censoring correctly.

set.seed(1220)

censor_example <- tibble(
  patient = factor(paste("Patient", 1:8)),
  start   = 0,
  end     = c(5, 8, 3, 10, 6, 10, 4, 7),
  event   = c(1, 0, 1, 0, 1, 0, 1, 0)
) |>
  mutate(status_label = ifelse(event == 1, "Event", "Censored"))

ggplot(censor_example, aes(y = patient)) +
  geom_segment(aes(x = start, xend = end, yend = patient),
               linewidth = 1.2, color = "steelblue") +
  geom_point(aes(x = end, shape = status_label, color = status_label),
             size = 4) +
  scale_shape_manual(values = c("Event" = 4, "Censored" = 1)) +
  scale_color_manual(values = c("Event" = "red", "Censored" = "steelblue")) +
  labs(
    title = "Visualizing Follow-up and Censoring",
    subtitle = "X = event occurred; O = censored (event-free at last observation)",
    x = "Follow-up Time (years)",
    y = NULL,
    shape = "Status",
    color = "Status"
  ) +
  theme_minimal(base_size = 13)

Interpretation: Each horizontal line represents one patient’s follow-up period. Patients marked with an X experienced the event; patients marked with O were censored (still alive at last follow-up, lost to follow-up, or the study ended). Censored observations contribute information up to the point of censoring: we know these patients survived at least that long.

Part 2: The Survival Function and the Hazard Function

The Survival Function

The survival function \(S(t)\) gives the probability that an individual survives beyond time \(t\):

\[ S(t) = \Pr(T > t) = 1 - F(t) \]

where \(F(t) = \Pr(T \leq t)\) is the cumulative distribution function of the event time.

Properties:

\(S(0) = 1\) (everyone is alive at time 0)
\(S(t)\) is a non-increasing function (survival probability can only decrease or stay the same)
As \(t \to \infty\), \(S(t) \to 0\) (eventually everyone experiences the event)

The Hazard Function

The hazard function \(h(t)\) gives the instantaneous rate of failure at time \(t\), conditional on having survived to that point:

\[ h(t) = \lim_{\Delta t \to 0} \frac{\Pr(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t} \]

Think of the hazard as the “risk per unit time” at any given moment. It is not a probability (it can exceed 1) but rather a rate.

Relationship between S(t) and h(t):

\[ S(t) = \exp\left(-\int_0^t h(u) \, du\right) = \exp\left(-H(t)\right) \]

where \(H(t) = \int_0^t h(u)\,du\) is the cumulative hazard.

Key insight: The survival function and hazard function are two sides of the same coin. If you know one, you can derive the other. Cox regression models the hazard directly; Kaplan-Meier estimates the survival function.

Part 3: The Kaplan-Meier Estimator

Definition

The Kaplan-Meier estimator provides a non-parametric estimate of the survival function from data with censoring. At each event time \(t_j\), with \(d_j\) events out of \(n_j\) subjects still at risk:

\[ \hat{S}(t) = \prod_{t_j \leq t} \left(1 - \frac{d_j}{n_j}\right) \]

Each factor \((1 - d_j / n_j)\) is the conditional probability of surviving past time \(t_j\), given survival up to just before \(t_j\). Censored observations are removed from the risk set at their censoring times but do not produce drops in the survival curve.

Creating a Survival Object

The Surv() function creates a survival object that pairs time with event status. In the output, a + after a time indicates censoring:

Surv(lung$time, lung$status_event)[1:20]

##  [1]  306   455  1010+  210   883  1022+  310   361   218   166   170   654 
## [13]  728    71   567   144   613   707    61    88

Overall Kaplan-Meier Curve

km_overall <- survfit(Surv(time, status_event) ~ 1, data = lung)

survfit2(Surv(time, status_event) ~ 1, data = lung) |>
  ggsurvfit() +
  add_confidence_interval() +
  add_risktable() +
  labs(
    title = "Kaplan-Meier Survival Curve: NCCTG Lung Cancer",
    x = "Days from Enrollment",
    y = "Overall Survival Probability"
  ) +
  theme_minimal(base_size = 13)

Interpretation: The curve starts at 1.0 (all patients alive at enrollment) and drops as events occur. The shaded band is the 95% confidence interval. The risk table below the plot shows the number of patients still being followed at each time point. Vertical drops indicate event times; flat sections indicate periods where no events occurred (though censoring may have occurred).

Estimating Median Survival Time

The median survival is the time at which \(S(t) = 0.5\), meaning half of the subjects have experienced the event. This is the standard summary statistic for survival data because the distribution is usually skewed.

km_overall

## Call: survfit(formula = Surv(time, status_event) ~ 1, data = lung)
## 
##        n events median 0.95LCL 0.95UCL
## [1,] 228    165    310     285     363

survfit(Surv(time, status_event) ~ 1, data = lung) |>
  tbl_survfit(
    probs = 0.5,
    label_header = "**Median Survival (95% CI)**"
  )

Characteristic	Median Survival (95% CI)
Overall	310 (285, 363)

Why not use the mean? Survival times are almost never normally distributed. They are right-skewed, and censoring prevents us from observing the full distribution. The median is more robust and interpretable.

Estimating 1-Year Survival

summary(km_overall, times = 365.25)

## Call: survfit(formula = Surv(time, status_event) ~ 1, data = lung)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   365     65     121    0.409  0.0358        0.345        0.486

survfit(Surv(time, status_event) ~ 1, data = lung) |>
  tbl_survfit(
    times = 365.25,
    label_header = "**1-Year Survival (95% CI)**"
  )

Characteristic	1-Year Survival (95% CI)
Overall	41% (34%, 49%)

Why not just compute (1 - events/N)? This naive calculation ignores censoring and misclassifies censored patients as event-free, inflating the survival estimate. The Kaplan-Meier method correctly accounts for the fact that censored patients contribute information only during their observed follow-up period.

Part 4: Comparing Survival Between Groups

Kaplan-Meier by Sex

survfit2(Surv(time, status_event) ~ sex, data = lung) |>
  ggsurvfit() +
  add_confidence_interval() +
  add_risktable() +
  labs(
    title = "Kaplan-Meier Survival Curves by Sex",
    x = "Days from Enrollment",
    y = "Overall Survival Probability"
  ) +
  theme_minimal(base_size = 13)

Interpretation: The two curves separate over time, with females appearing to have better survival than males. But is this difference statistically significant? We need a formal test.

The Log-Rank Test

The log-rank test is the standard non-parametric test for comparing survival distributions between groups. It compares the observed number of events in each group with the expected number under the null hypothesis that the survival curves are identical.

survdiff(Surv(time, status_event) ~ sex, data = lung)

## Call:
## survdiff(formula = Surv(time, status_event) ~ sex, data = lung)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Male   138      112     91.6      4.55      10.3
## sex=Female  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

Interpretation: A small p-value indicates that the survival curves differ significantly between the groups. The log-rank test tells us whether the curves differ, but not how much. It does not provide an effect size. For that, we turn to Cox regression.

Limitations of the log-rank test:

It gives equal weight to all time points (alternatives exist that weight early or late events differently)
It cannot adjust for confounders
It provides no effect size

Comparing by ECOG Performance Status

lung_ecog <- lung |>
  filter(!is.na(ph.ecog)) |>
  mutate(ph.ecog = factor(ph.ecog))

survfit2(Surv(time, status_event) ~ ph.ecog, data = lung_ecog) |>
  ggsurvfit() +
  add_risktable() +
  labs(
    title = "Kaplan-Meier Survival by ECOG Performance Status",
    subtitle = "0 = Fully active, 1 = Restricted, 2 = Ambulatory, 3 = Limited self-care",
    x = "Days from Enrollment",
    y = "Overall Survival Probability"
  ) +
  theme_minimal(base_size = 13)

survdiff(Surv(time, status_event) ~ ph.ecog, data = lung_ecog)

## Call:
## survdiff(formula = Surv(time, status_event) ~ ph.ecog, data = lung_ecog)
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## ph.ecog=0  63       37   54.153    5.4331    8.2119
## ph.ecog=1 113       82   83.528    0.0279    0.0573
## ph.ecog=2  50       44   26.147   12.1893   14.6491
## ph.ecog=3   1        1    0.172    3.9733    4.0040
## 
##  Chisq= 22  on 3 degrees of freedom, p= 7e-05

There is a clear dose-response relationship: worse performance status is associated with worse survival. This is one of the strongest prognostic factors in oncology.

Part 5: Cox Proportional Hazards Regression

The Model

The Cox proportional hazards model is the workhorse of survival analysis. It models the hazard as:

\[ h(t \mid X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p) \]

\(h_0(t)\) is the baseline hazard at time \(t\), which is left completely unspecified (making the model semi-parametric)
The covariate effects are multiplicative on the hazard
The model does not assume any particular distribution for survival times

Exponentiated coefficients are hazard ratios (HRs):

\[ \text{HR} = e^{\beta} \]

Interpreting Hazard Ratios

HR	Interpretation
HR = 1	No effect on hazard
HR > 1	Higher hazard (worse survival, faster event occurrence)
HR < 1	Lower hazard (better survival, slower event occurrence)

Important: A hazard ratio is NOT the same as a risk ratio, even though they are sometimes (incorrectly) used interchangeably. The HR compares instantaneous event rates, not cumulative risks.

Unadjusted Cox Model

mod_cox_sex <- coxph(Surv(time, status_event) ~ sex, data = lung)

mod_cox_sex |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Unadjusted Cox Model: Sex and Lung Cancer Survival**")

**Unadjusted Cox Model: Sex and Lung Cancer Survival**
Characteristic	HR	95% CI	p-value
sex
Male	—	—
Female	0.59	0.42, 0.82	0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

tidy_sex <- tidy(mod_cox_sex, exponentiate = TRUE, conf.int = TRUE)
hr_female <- tidy_sex |> filter(term == "sexFemale")

Interpretation: Females have a hazard ratio of 0.59 (95% CI: 0.42, 0.82; p = 0.001). This means females experience approximately 41% lower hazard of death at any time point compared to males.

Adjusted Cox Model

mod_cox <- coxph(
  Surv(time, status_event) ~ age + sex + ph.ecog,
  data = lung
)

mod_cox |>
  tbl_regression(exponentiate = TRUE) |>
  bold_labels() |>
  modify_caption("**Adjusted Cox Proportional Hazards Model: Lung Cancer Survival**")

**Adjusted Cox Proportional Hazards Model: Lung Cancer Survival**
Characteristic	HR	95% CI	p-value
age	1.01	0.99, 1.03	0.2
sex
Male	—	—
Female	0.58	0.41, 0.80	<0.001
ph.ecog	1.59	1.27, 1.99	<0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Interpreting the Adjusted Model

tidy_cox <- tidy(mod_cox, exponentiate = TRUE, conf.int = TRUE)

tidy_cox |>
  mutate(
    HR = round(estimate, 3),
    `95% CI` = paste0("(", round(conf.low, 3), ", ", round(conf.high, 3), ")"),
    `p-value` = round(p.value, 4)
  ) |>
  dplyr::select(term, HR, `95% CI`, `p-value`) |>
  kable(caption = "Cox Model Coefficients") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Cox Model Coefficients
term	HR	95% CI	p-value
age	1.011	(0.993, 1.03)	0.2324
sexFemale	0.575	(0.414, 0.799)	0.0010
ph.ecog	1.590	(1.273, 1.986)	0.0000

Age: Holding sex and ECOG status constant, each additional year of age is associated with a modest increase in hazard.

Sex: After adjusting for age and ECOG status, females still have a significantly lower hazard of death compared to males.

ECOG performance status: Each one-point worsening in ECOG status is associated with a substantial increase in hazard, confirming its strong prognostic value.

Visualizing Adjusted Survival Curves

ggsurvplot(
  survfit(mod_cox, data = lung),
  data = lung,
  conf.int = TRUE,
  ggtheme = theme_minimal(base_size = 13),
  palette = "steelblue",
  title = "Adjusted Survival Curve from Cox Model"
)

new_data <- tibble(
  sex     = factor(c("Male", "Female"), levels = c("Male", "Female")),
  age     = median(lung$age, na.rm = TRUE),
  ph.ecog = 1
)

fit_new <- survfit(mod_cox, newdata = new_data)

ggsurvplot(
  fit_new,
  data = new_data,
  conf.int = TRUE,
  legend.labs = c("Male (age=63, ECOG=1)", "Female (age=63, ECOG=1)"),
  ggtheme = theme_minimal(base_size = 13),
  palette = c("steelblue", "tomato"),
  title = "Adjusted Survival Curves by Sex",
  subtitle = "Holding age at median and ECOG performance status at 1"
)

These curves represent the predicted survival for a “typical” patient (median age, ECOG = 1) in each sex group, based on the Cox model.

Part 6: Checking the Proportional Hazards Assumption

The Assumption

The Cox model assumes that hazard ratios are constant over time. In other words, if females have half the hazard of males at 30 days, they also have half the hazard at 300 days. If this assumption fails, the estimated HR is a kind of “average” across time, which may be misleading.

Testing with `cox.zph()`

The cox.zph() function tests the proportional hazards assumption by evaluating the correlation between scaled Schoenfeld residuals and time. A significant p-value indicates a violation of the assumption.

ph_test <- cox.zph(mod_cox)
ph_test

##         chisq df    p
## age     0.188  1 0.66
## sex     2.305  1 0.13
## ph.ecog 2.054  1 0.15
## GLOBAL  4.464  3 0.22

Interpretation: If all p-values are non-significant (\(p > 0.05\)), the proportional hazards assumption is satisfied for each covariate and globally. If a specific covariate shows a significant result, its effect may change over time.

Visual Inspection: Schoenfeld Residuals

ggcoxzph(ph_test)

A flat trend line (centered around zero) indicates the PH assumption holds. A curve or slope suggests the effect of that covariate changes over time.

What If the Assumption Fails?

If the proportional hazards assumption is violated for a covariate, options include:

Stratification: Fit the model stratified by the offending variable (strata() in the formula)
Time-varying coefficients: Allow the coefficient to change over time
Parametric models: Use an accelerated failure time model instead
Restrict the time window: Limit the analysis to a period where PH holds

# Example: stratifying by sex if PH fails for sex
coxph(Surv(time, status_event) ~ age + ph.ecog + strata(sex), data = lung)

Part 7: Connecting Poisson Regression and Survival Analysis

There is a deep connection between these two frameworks. If we split follow-up time into small intervals and count events in each interval, a Poisson regression with an offset for person-time at risk recovers approximately the same hazard ratios as a Cox model. In epidemiology, this person-time Poisson approach is frequently used for cohort studies with aggregated data.

Feature	Poisson Regression	Cox Proportional Hazards
Outcome	Count of events	Time to event
Effect measure	Rate ratio	Hazard ratio
Censoring	Handled via offset for person-time	Built into the likelihood
Baseline rate/hazard	Modeled or absorbed in offset	Left unspecified
When useful	Aggregated counts, large cohorts	Individual-level time-to-event data

Takeaway: Rate ratios from Poisson regression and hazard ratios from Cox regression are closely related. When the event rate is constant within time intervals, they are mathematically equivalent. This is why both methods are taught together in epidemiologic methods courses.

Summary

Key Concepts

Concept	Key Point
Outcome	(time, event status) pair
Censoring	Right-censored observations contribute partial information
\(S(t)\)	Survival function: \(\Pr(T > t)\)
\(h(t)\)	Hazard function: instantaneous event rate at time \(t\)
Kaplan-Meier	Non-parametric estimate of \(S(t)\)
Log-rank test	Compares survival curves between groups (no effect size)
Cox model	Semi-parametric; \(h(t \mid X) = h_0(t) e^{X\beta}\)
Hazard ratio	\(\text{HR} = e^\beta\); HR > 1 = worse survival
PH assumption	Check with `cox.zph()` and Schoenfeld residual plots
In R	`Surv()`, `survfit()`, `survdiff()`, `coxph()`

Connecting to What We Know

Earlier in the Course	Survival Analogue
`lm()` for continuous outcomes	`coxph()` for time-to-event outcomes
`glm(..., family = binomial)` for binary	`coxph()` handles censored binary events over time
`glm(..., family = poisson)` for counts	Cox model is related via person-time Poisson approach
F-test comparing nested linear models	Log-rank test comparing survival curves
Adjusted ORs from logistic regression	Adjusted HRs from Cox regression

References

Zabor, E. C. (2025). Survival Analysis in R. emilyzabor.com/survival-analysis-in-r.html
Kleinbaum, D. G., & Klein, M. (2012). Survival Analysis: A Self-Learning Text (3rd ed.). Springer.
Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer.
Clark, T. G., Bradburn, M. J., Love, S. B., & Altman, D. G. (2003). Survival analysis part I-IV. British Journal of Cancer, 89, 232-238.

No lab activity for this lecture.

Survival Analysis — A Basic Introduction

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Muntasir Masum

April 21, 2026