Code
toy <- tibble(
patient = paste0("P", 1:8),
time = c(3, 5, 5, 7, 8, 10, 12, 12),
status = c(1, 1, 0, 1, 0, 1, 1, 0)
)
toy| patient | time | status |
|---|---|---|
| P1 | 3 | 1 |
| P2 | 5 | 1 |
| P3 | 5 | 0 |
| P4 | 7 | 1 |
| P5 | 8 | 0 |
| P6 | 10 | 1 |
| P7 | 12 | 1 |
| P8 | 12 | 0 |
Non-parametric survival curves, step-by-step derivation, and log-rank tests
In the previous article we examined censoring — the fundamental challenge that makes survival data different from ordinary outcomes. In this article, we build the tool that handles it with elegance: the Kaplan-Meier estimator.
“The Kaplan-Meier curve doesn’t assume a shape for survival. It lets the data draw it, one event at a time.”
Survival data raises a deceptively simple question:
What is the probability that a patient, customer, or device survives beyond a given time \(t\)?
We need a method that:
Makes no distributional assumptions
No Weibull, no exponential — the curve follows the data exactly
Handles censoring correctly
Censored observations contribute to the risk set until they leave
Updates step-by-step
Survival only drops when an actual event occurs — flat otherwise
Works on any event type
Death, relapse, churn, failure — the math is identical
This article is Part 2 of the series:
Part 1
Censoring & Truncation— Understanding incomplete observation
Part 2 — Current
Kaplan-Meier — Non-parametric survival curves and log-rank tests
Part 3
Cox Regression — Hazard ratios, adjusted effects, and model assumptions
The Kaplan-Meier (KM) estimator is a non-parametric, empirical estimate of the survival function \(S(t) = P(T > t)\).
The word non-parametric means it makes no assumption about the shape of \(S(t)\). Compare this to parametric models (exponential, Weibull) that impose a fixed mathematical form. KM is purely data-driven.
A KM curve is a staircase, not a smooth line:
Each drop reflects a real event that occurred in the data.
The key insight is that KM decomposes survival probability into a product of conditional probabilities, one step at a time:
\[\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)\]
where:
The product runs over all event times up to and including \(t\).
Before deriving the formula formally, let’s build the intuition.
Survival is built one step at a time. At each event time, we ask:
Of the people still at risk right now, what fraction survived this moment?
Then we multiply that fraction by everything that came before.
Kaplan-Meier recursion
\[\hat{S}(t) = \left(1 - \frac{n_{\text{event}}}{n_{\text{risk}}}\right) \times \hat{S}(t_{\text{prev}})\]
\(n_{\text{risk}}\) — people still under observation who could have the event at this time
\(n_{\text{event}}\) — events that actually occurred at this time
\(\hat{S}(t_{\text{prev}})\) — survival estimated at the previous event time
The first step uses \(\hat{S}(t_{\text{prev}}) = 1\) because at time zero, everyone is alive.
The recursion above follows from the conditional probability decomposition of \(S(t) = P(T > t)\):
\[ S(t) = P(T > t \mid T \geq t) \cdot P(T \geq t) = \left[1 - P(T = t \mid T \geq t)\right] \cdot S(t_{\text{prev}}) \]
The term \(P(T = t \mid T \geq t)\) is estimated empirically as \(d_i / n_i\) — the observed fraction of the risk set who experienced the event at \(t_i\). This is why the KM estimator is a nonparametric maximum likelihood estimate: it uses the observed data proportions directly, imposing no shape.
The risk set at time \(t\) is the group of individuals who:
The denominator \(n_{\text{risk}}\) in the KM formula is the size of the risk set — not the total sample size. As the study progresses:
This dynamic updating of the risk set is precisely how KM extracts information from censored observations without discarding them.
Let’s compute the KM estimator by hand on a tiny dataset of 8 patients to make every calculation transparent.
toy <- tibble(
patient = paste0("P", 1:8),
time = c(3, 5, 5, 7, 8, 10, 12, 12),
status = c(1, 1, 0, 1, 0, 1, 1, 0)
)
toy| patient | time | status |
|---|---|---|
| P1 | 3 | 1 |
| P2 | 5 | 1 |
| P3 | 5 | 0 |
| P4 | 7 | 1 |
| P5 | 8 | 0 |
| P6 | 10 | 1 |
| P7 | 12 | 1 |
| P8 | 12 | 0 |
Now compute \(n_{\text{risk}}\), \(d_i\), and \(\hat{S}(t)\) at each event time:
km_steps <- tibble(
`Time $t_i$` = c(3, 5, 7, 10, 12),
`$n_{\\text{risk}}$` = c(8, 7, 5, 4, 3),
`$d_i$ (events)` = c(1, 1, 1, 1, 1),
`$1 - d_i/n_i$` = c(7/8, 6/7, 4/5, 3/4, 2/3),
`$\\hat{S}(t)$` = round(cumprod(c(7/8, 6/7, 4/5, 3/4, 2/3)), 4)
)
km_steps| Time \(t_i\) | \(n_{\text{risk}}\) | \(d_i\) (events) | \(1 - d_i/n_i\) | \(\hat{S}(t)\) |
|---|---|---|---|---|
| 3 | 8 | 1 | 0.8750000 | 0.875 |
| 5 | 7 | 1 | 0.8571429 | 0.750 |
| 7 | 5 | 1 | 0.8000000 | 0.600 |
| 10 | 4 | 1 | 0.7500000 | 0.450 |
| 12 | 3 | 1 | 0.6666667 | 0.300 |
Between \(t = 5\) and \(t = 7\), the risk set drops from 7 to 5 — yet only 1 event occurred at \(t = 5\). The extra reduction is due to the censored observation at \(t = 5\) (patient P3, status = 0). That patient contributed to \(n_{\text{risk}}\) at \(t = 5\) but was then removed without triggering a drop in \(\hat{S}(t)\).
survfit()km_toy <- survfit(Surv(time, status) ~ 1, data = toy)
print(summary(km_toy), digits = 4)Call: survfit(formula = Surv(time, status) ~ 1, data = toy)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
3 8 1 0.875 0.1169 0.67338 1
5 7 1 0.750 0.1531 0.50270 1
7 5 1 0.600 0.1817 0.33146 1
10 3 1 0.400 0.2033 0.14771 1
12 2 1 0.200 0.1742 0.03629 1
The survival column matches our manual calculation exactly.
survival::survfit()The standard workflow uses Surv(time, event) as the response and ~ 1 to estimate overall survival (no stratification):
# lung dataset: NSCLC patients (survival package)
# time = days, status = 2 means death
km_overall <- survfit(Surv(time, status) ~ 1, data = lung)
km_overallCall: survfit(formula = Surv(time, status) ~ 1, data = lung)
n events median 0.95LCL 0.95UCL
[1,] 228 165 310 285 363
# Full life table: one row per unique event time
print(summary(km_overall, times = c(30, 60, 90, 180, 365)), digits = 4)Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
30 219 10 0.9561 0.01356 0.9299 0.9831
60 213 7 0.9254 0.01740 0.8920 0.9602
90 201 10 0.8816 0.02140 0.8406 0.9245
180 160 36 0.7217 0.02981 0.6655 0.7825
365 65 58 0.4092 0.03582 0.3447 0.4858
Use the times argument of summary() to pull \(\hat{S}(t)\) at any time point you need:
# 1-year (365.25-day) survival probability
s1yr <- summary(km_overall, times = 365.25)
cat("1-year survival:", round(s1yr$surv, 3),
" 95% CI: [", round(s1yr$lower, 3), ",", round(s1yr$upper, 3), "]\n")1-year survival: 0.409 95% CI: [ 0.345 , 0.486 ]
plot(km_overall,
xlab = "Days", ylab = "Survival probability",
main = "Overall survival — lung cancer (NCCTG)",
xlim = c(0, max(lung$time, na.rm = TRUE)),
mark.time = TRUE,
col = clr$teal, lwd = 2)ggsurvfit — the modern approachThe {ggsurvfit} package brings the KM curve into the {ggplot2} ecosystem. Use survfit2() in place of survfit() for better label defaults:
survfit2(Surv(time, status) ~ 1, data = lung) |>
ggsurvfit(linewidth = 1) +
add_confidence_interval() +
add_risktable(
risktable_stats = c("n.risk", "cum.event"),
risktable_height = 0.28
) +
scale_color_manual(values = clr$teal) +
scale_fill_manual(values = clr$teal) +
scale_y_continuous(labels = percent_format(accuracy = 1),
limits = c(0, 1)) +
labs(
x = "Days since enrolment",
y = "Survival probability",
title = "Kaplan-Meier: Overall survival",
caption = "Shaded band = 95% CI · Tick marks = censored observations"
) +
theme_survival()The probability \(\hat{S}(t) = P(T > t)\) is read directly off the curve — or extracted with summary(..., times = t).
A common mistake is to compute survival probability by simply counting the proportion of patients who were event-free, ignoring censored patients:
\[\text{(Naive)} \quad \hat{S}_{\text{naive}}(t) = \frac{\text{still event-free at } t}{\text{total enrolled}}\]
This treats censored patients as if they were followed for the entire study period — which they were not. It overestimates survival because it artificially inflates the denominator.
KM is correct precisely because it removes censored patients from the risk set at their censoring time.
# 1-year naive estimate (ignores censoring)
n_total <- sum(!is.na(lung$time) & !is.na(lung$status))
n_alive_1yr <- sum(lung$time > 365.25, na.rm = TRUE)
naive_1yr <- n_alive_1yr / n_total
# KM estimate
km_1yr <- summary(km_overall, times = 365.25)$surv
cat("Naive 1-year survival : ", round(naive_1yr, 3), "\n")Naive 1-year survival : 0.285
cat("KM 1-year survival : ", round(km_1yr, 3), "\n")KM 1-year survival : 0.409
cat("Overestimate from ignoring censoring:",
round(naive_1yr - km_1yr, 3), "\n")Overestimate from ignoring censoring: -0.124
The median survival time is the time at which \(\hat{S}(t) = 0.50\) — the point where half of the population has experienced the event.
If the survival curve never drops below 0.50, the median is undefined (NA). This happens when fewer than 50% of participants experienced the event — common in studies with short follow-up or effective treatments.
# Extract median and 95% CI
summary(km_overall)$table[c("median", "0.95LCL", "0.95UCL")] median 0.95LCL 0.95UCL
310 285 363
# Naive median: median among events only (ignores censored follow-up)
naive_median <- lung |> filter(status == 2) |> pull(time) |> median()
km_median <- summary(km_overall)$table["median"]
cat("Naive median (events only) :", naive_median, "days\n")Naive median (events only) : 226 days
cat("KM median (correct) :", km_median, "days\n")KM median (correct) : 310 days
cat("Underestimate from ignoring censored follow-up:",
km_median - naive_median, "days\n")Underestimate from ignoring censored follow-up: 84 days
The naive estimate uses only patients who experienced the event, discarding all the follow-up time contributed by censored patients — so it underestimates median survival.
To estimate \(P(t_1 \leq T \leq t_2)\) — the probability of experiencing the event between two time points — use the relationship:
\[P(t_1 \leq T \leq t_2) = \hat{S}(t_1^{\text{prev}}) - \hat{S}(t_2)\]
where \(t_1^{\text{prev}}\) is the largest event time strictly before \(t_1\).
# P(180 <= T <= 365): probability of event between 6 months and 1 year
s179 <- summary(km_overall, times = 179)$surv
s365 <- summary(km_overall, times = 365)$surv
cat("P(180 ≤ T ≤ 365) = S(179) - S(365) =",
round(s179, 4), "-", round(s365, 4), "=",
round(s179 - s365, 4), "\n")P(180 ≤ T ≤ 365) = S(179) - S(365) = 0.7262 - 0.4092 = 0.317
The complete set of interval probability formulas, for reference:
| Probability | Formula using \(\hat{S}(t)\) |
|---|---|
| \(P(t_1 \leq T \leq t_2)\) | \(\hat{S}(t_1^{\text{prev}}) - \hat{S}(t_2)\) |
| \(P(t_1 < T \leq t_2)\) | \(\hat{S}(t_1) - \hat{S}(t_2)\) |
| \(P(T > t)\) | \(\hat{S}(t)\) |
| \(P(T \geq t)\) | \(\hat{S}(t^{\text{prev}})\) |
| \(P(T \leq t)\) | \(1 - \hat{S}(t)\) |
| \(P(T = t)\) | \(\hat{S}(t^{\text{prev}}) - \hat{S}(t)\) |
One of the most common uses of the KM estimator is comparing survival between two or more groups. In the survfit() formula, place the grouping variable on the right-hand side:
km_sex <- survfit(Surv(time, status) ~ sex, data = lung)
km_sexCall: survfit(formula = Surv(time, status) ~ sex, data = lung)
n events median 0.95LCL 0.95UCL
sex=1 138 112 270 212 310
sex=2 90 53 426 348 550
survfit2(Surv(time, status) ~ sex, data = lung) |>
ggsurvfit(linewidth = 1) +
add_confidence_interval() +
add_risktable(risktable_height = 0.28) +
scale_color_manual(
values = c("1" = clr$blue, "2" = clr$red),
labels = c("1" = "Male", "2" = "Female")
) +
scale_fill_manual(
values = c("1" = clr$blue, "2" = clr$red),
labels = c("1" = "Male", "2" = "Female")
) +
scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
labs(
x = "Days since enrolment",
y = "Survival probability",
color = "Sex", fill = "Sex",
title = "Kaplan-Meier: Survival by sex",
caption = "Shaded bands = 95% CI · Tick marks = censored observations"
) +
theme_survival()The KM curve shows whether groups differ. The log-rank test answers whether that difference is statistically significant under \(H_0: S_1(t) = S_2(t)\) for all \(t\).
At each event time \(t_i\), we compare observed events in each group against the expected number under the null hypothesis of equal survival. The test statistic aggregates these \(O - E\) differences:
\[\chi^2 = \frac{(\sum_i (O_{1i} - E_{1i}))^2}{\sum_i V_{1i}}\]
where \(V_{1i}\) is the hypergeometric variance at time \(t_i\). Under \(H_0\), this follows a \(\chi^2\) distribution with \(k - 1\) degrees of freedom (\(k\) = number of groups).
The log-rank test equally weights all time points. It tests whether the survival curves differ somewhere — but it does not quantify the magnitude of the difference, and it has reduced power when curves cross. For effect size, use the Cox hazard ratio (Part 3).
lr_sex <- survdiff(Surv(time, status) ~ sex, data = lung)
lr_sexCall:
survdiff(formula = Surv(time, status) ~ sex, data = lung)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 138 112 91.6 4.55 10.3
sex=2 90 53 73.4 5.68 10.3
Chisq= 10.3 on 1 degrees of freedom, p= 0.001
# Extract p-value manually (pchisq with df = k - 1)
p_val <- pchisq(lr_sex$chisq,
df = length(levels(factor(lung$sex))) - 1,
lower.tail = FALSE)
cat("Log-rank test: chi-sq =", round(lr_sex$chisq, 2),
", df = 1, p =", round(p_val, 4), "\n")Log-rank test: chi-sq = 10.33 , df = 1, p = 0.0013
The test confirms a statistically significant difference in survival between male and female patients (\(\chi^2 = 10.3\), \(p = 0.001\)). Female patients have meaningfully longer survival.
{gtsummary}# 1-year survival table
tbl1 <- survfit(Surv(time, status) ~ sex, data = lung) |>
tbl_survfit(
times = 365.25,
label_header = "**1-year survival (95% CI)**"
)
# Median survival table
tbl2 <- survfit(Surv(time, status) ~ sex, data = lung) |>
tbl_survfit(
probs = 0.5,
label_header = "**Median survival (95% CI)**"
)
tbl_merge(list(tbl1, tbl2))| Characteristic |
Table 1
|
Table 2
|
|---|---|---|
| 1-year survival (95% CI) | Median survival (95% CI) | |
| sex | ||
| 1 | 34% (26%, 43%) | 270 (212, 310) |
| 2 | 53% (42%, 66%) | 426 (348, 550) |
The log-rank test generalises naturally to \(k > 2\) groups — the degrees of freedom become \(k - 1\):
lung2 <- lung |> filter(!is.na(ph.ecog))
survfit2(Surv(time, status) ~ ph.ecog, data = lung2) |>
ggsurvfit(linewidth = 0.9) +
add_confidence_interval(alpha = 0.08) +
add_risktable(risktable_height = 0.3) +
scale_color_manual(values = c(
"0" = clr$green, "1" = clr$blue,
"2" = clr$amber, "3" = clr$red
)) +
scale_fill_manual(values = c(
"0" = clr$green, "1" = clr$blue,
"2" = clr$amber, "3" = clr$red
)) +
scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
labs(
x = "Days since enrolment",
y = "Survival probability",
color = "ECOG score",
fill = "ECOG score",
title = "Kaplan-Meier: Survival by ECOG performance score",
caption = "ECOG 0 = fully active · ECOG 3 = limited self-care · Shaded bands = 95% CI"
) +
theme_survival()lr_ecog <- survdiff(Surv(time, status) ~ ph.ecog, data = lung2)
p_ecog <- pchisq(lr_ecog$chisq,
df = length(unique(lung2$ph.ecog)) - 1,
lower.tail = FALSE)
cat("Log-rank test (ECOG): chi-sq =", round(lr_ecog$chisq, 2),
", df = 3, p =", format(p_ecog, scientific = TRUE, digits = 3), "\n")Log-rank test (ECOG): chi-sq = 21.96 , df = 3, p = 6.64e-05
Mistake 1 — Ignoring censoring
Computing survival probability or median by simply counting patients without accounting for censoring always biases results. KM exists precisely to avoid this. The bias goes in opposite directions depending on what you compute: naive survival probability is too high, naive median is too low.
Mistake 2 — Reading curves after the risk set is tiny
KM confidence intervals widen dramatically at late time points when few patients remain at risk. A survival estimate of \(\hat{S}(t) = 0.30\) with only 3 patients remaining has enormous uncertainty. Always examine the risk table before interpreting the tail of a KM curve.
Mistake 3 — Using the log-rank test when curves cross
The log-rank test equally weights all time points. When survival curves cross — meaning one group has higher early survival but lower late survival — the positive and negative differences cancel out and the test has low power. In this situation, weighted log-rank tests or restricted mean survival time (RMST) comparisons are more appropriate.
Mistake 4 — Treating KM as a model
KM estimates \(\hat{S}(t)\) but cannot adjust for confounders or quantify covariate effects. For that, you need the Cox proportional hazards model (Part 3). KM is purely descriptive.
In Part 3, we introduce the Cox proportional hazards model — estimating the hazard ratio between groups, adjusting for confounders, checking the proportional hazards assumption, and extending to time-varying covariates.
sessionInfo()R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
[3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
[5] LC_TIME=English_India.utf8
time zone: Asia/Calcutta
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scales_1.4.0 patchwork_1.3.2 gtsummary_2.4.0 ggsurvfit_1.2.0
[5] survival_3.8-3 tidyr_1.3.1 dplyr_1.1.4 ggplot2_4.0.1
loaded via a namespace (and not attached):
[1] gt_1.1.0 sass_0.4.10 generics_0.1.4 xml2_1.4.1
[5] lattice_0.22-7 digest_0.6.38 magrittr_2.0.4 evaluate_1.0.5
[9] grid_4.5.2 RColorBrewer_1.1-3 cards_0.7.0 fastmap_1.2.0
[13] jsonlite_2.0.0 Matrix_1.7-4 cardx_0.3.0 backports_1.5.0
[17] purrr_1.2.0 cli_3.6.5 rlang_1.1.6 litedown_0.8
[21] commonmark_2.0.0 splines_4.5.2 base64enc_0.1-3 withr_3.0.2
[25] yaml_2.3.10 tools_4.5.2 broom_1.0.10 vctrs_0.6.5
[29] R6_2.6.1 lifecycle_1.0.4 fs_1.6.6 htmlwidgets_1.6.4
[33] pkgconfig_2.0.3 pillar_1.11.1 gtable_0.3.6 glue_1.8.0
[37] xfun_0.54 tibble_3.3.0 tidyselect_1.2.1 rstudioapi_0.17.1
[41] knitr_1.50 farver_2.1.2 htmltools_0.5.8.1 rmarkdown_2.30
[45] labeling_0.4.3 compiler_4.5.2 S7_0.2.1 markdown_2.0
Collett, D. (2015). Modelling Survival Data in Medical Research (3rd ed.). CRC Press.
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.
Harrington, D. P., & Fleming, T. R. (1982). A class of rank test procedures for censored survival data. Biometrika, 69(3), 553–566.
Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.). Springer.
Sjoberg, D. D. (2023). ggsurvfit: Flexible Time-to-Event Figures. R package. https://www.danieldsjoberg.com/ggsurvfit/
Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer.