Survival Analysis: Understanding and Visualizing Censoring

When the story isn’t finished — and why that’s okay

Author

Introduction

When analyzing real-world events, we are often interested not just in if something happens, but when it happens. That is the core of survival analysis — a powerful statistical framework built for time-to-event data.

“Survival analysis isn’t about death. It’s about time — and what we do when time runs out before our data does.”

Despite its name, survival analysis is not limited to mortality. The “event” can be anything that occurs once in time:

🏥

Healthcare

Time until cancer recurrence, hospital readmission, or first treatment response

💼

HR Analytics

Time until an employee resigns, gets promoted, or changes roles

📱

Product & Growth

Time until a customer churns, upgrades, or becomes inactive

⚙️

Engineering

Time until a machine part fails, a system crashes, or maintenance is required

Why Not Use Standard Regression?

The survival function \(S(t)\) is defined as the probability that the event time \(T\) exceeds time \(t\):

\[S(t) = P(T > t), \quad t \geq 0\]

Standard regression methods cannot estimate this correctly in the presence of censoring:

Method Limitation
Linear regression Assumes every outcome is fully observed — censored data directly violates this
Logistic regression Can tell you if an event occurred, but not when — timing is discarded
Survival analysis Handles censoring, unequal follow-up durations, and time-varying effects ✓

Survival analysis is the gold standard precisely because it incorporates censored observations into \(\hat{S}(t)\) rather than discarding them.

This article is Part 1 of a series building toward full survival modelling in R:

Part 1 — Current

Censoring & Truncation— Understanding incomplete observation

Part 2

Kaplan-Meier — Non-parametric survival curves and log-rank tests

Part 3

Cox Regression — Hazard ratios, adjusted effects, and model assumptions


What Is Censoring?

The Unfinished Symphony

Imagine you’re conducting a study on how long people take to finish reading War and Peace. You give 100 people the book and check in after 6 months:

  • 40 people finished it — you know exactly when ✓
  • 30 people are still reading — they might finish tomorrow, or never
  • 20 people moved away — you lost contact entirely
  • 10 people admitted they gave up — but won’t say when

Welcome to censoring in its natural habitat.

The Simple Definition

Censoring occurs when we have incomplete information about when — or if — the event of interest occurred for a given individual.

Crucially, a censored observation still contributes information: we know that for this person, the event had not yet occurred up to the point of censoring. This partial information is what survival analysis is designed to use.

Visualizing Individual Journeys

The swim-lane plot below shows 20 simulated participants. Each horizontal line represents one person’s observed follow-up. A filled circle marks an observed event; a triangle with a dashed arrow means we stopped watching before the event occurred.

Code
set.seed(123)
n <- 20

study_data <- data.frame(
  subject  = 1:n,
  category = rep(c("Observed Event", "Still Ongoing",
                   "Lost to Follow-up", "Study Ended"), each = 5)
) |>
  mutate(
    true_event_time = case_when(
      category == "Observed Event"    ~ runif(n(), 2, 10),
      category == "Still Ongoing"     ~ runif(n(), 15, 25),
      category == "Lost to Follow-up" ~ runif(n(), 12, 20),
      category == "Study Ended"       ~ runif(n(), 11, 18)
    ),
    observed_time = case_when(
      category == "Observed Event"    ~ true_event_time,
      category == "Still Ongoing"     ~ 12,
      category == "Lost to Follow-up" ~ runif(n(), 3, 9),
      category == "Study Ended"       ~ 12
    ),
    status = ifelse(category == "Observed Event", "Event", "Censored"),
    label  = paste0("P", sprintf("%02d", subject))
  )

pal <- c(
  "Observed Event"    = clr$red,
  "Still Ongoing"     = clr$blue,
  "Lost to Follow-up" = clr$amber,
  "Study Ended"       = clr$purple
)

ggplot(study_data, aes(y = reorder(label, subject))) +
  geom_segment(
    aes(x = 0, xend = observed_time,
        yend = reorder(label, subject)),
    linewidth = 2.2, color = "gray78", lineend = "round"
  ) +
  geom_point(
    data = filter(study_data, status == "Event"),
    aes(x = observed_time, color = category),
    size = 4, shape = 19
  ) +
  geom_point(
    data = filter(study_data, status == "Censored"),
    aes(x = observed_time, color = category),
    size = 4, shape = 17
  ) +
  geom_segment(
    data = filter(study_data, status == "Censored"),
    aes(x = observed_time, xend = observed_time + 0.7,
        yend = reorder(label, subject), color = category),
    arrow = arrow(length = unit(0.18, "cm"), type = "open"),
    linetype = "dashed", linewidth = 0.7, alpha = 0.65
  ) +
  geom_vline(xintercept = 12, linetype = "dotted",
             color = clr$red, linewidth = 0.8, alpha = 0.7) +
  annotate("text", x = 12.15, y = 1.5,
           label = "Study ends", angle = 90, hjust = 0,
           color = clr$red, size = 3.2, fontface = "italic") +
  scale_color_manual(values = pal) +
  scale_x_continuous(limits = c(0, 14), breaks = seq(0, 14, 2)) +
  labs(
    x     = "Time (months)", y = NULL,
    title = "Individual Journeys Through the Study",
    subtitle = "Each row is one participant — observed, censored, or still at risk",
    color = "Reason for exit"
  ) +
  theme_survival() +
  theme(panel.grid.major.y = element_line(color = "gray93"))
Figure 1: Swim-lane plot of 20 simulated participants. Filled circles (●) indicate observed events; triangles with arrows (▷) indicate censored observations. The red dotted vertical line marks study closure.

Every censored row carries a message: “the event had not happened up to this moment.” Discarding these rows would throw away real information and bias \(\hat{S}(t)\) upward.


The Three Types of Censoring

🕐

Right Censoring

“It hasn’t happened yet…”

The most common type. The event has not occurred by last observation but may occur later.

  • 🏥 Patient still cancer-free after 5 years of follow-up
  • 💼 Employee still with the company at study close
  • 📱 User still active after 30 days of monitoring

Left Censoring

“It already happened — but when?”

The event occurred before observation began, but the exact time is unknown.

  • 🦷 Cavity present at first dental checkup — onset unknown
  • 🚬 Patient already smoking when enrolled — start date unknown
  • 🏠 Termites found at inspection — duration unknown

🌓

Interval Censoring

“Somewhere between two visits…”

The event occurred between two known check-in times, but the exact moment is unknown.

  • 🔬 Tumour absent in January, present in July
  • 🎓 Child unable to read in September, able by June
  • 🚗 Tire intact at 30,000 miles, flat at 35,000 miles

Visualizing the Three Types

Code
mk_panel <- function(title, subtitle, col) {
  function(seg, pts, ann) {
    ggplot() +
      geom_segment(
        data = seg,
        aes(x = x, xend = xend, y = y, yend = y,
            linetype = ltype, linewidth = lw, alpha = al),
        color = col, lineend = "round", show.legend = FALSE
      ) +
      geom_point(data = pts, aes(x = x, y = y, shape = sh),
                 color = col, size = 5.5, show.legend = FALSE) +
      geom_text(data = ann,
                aes(x = x, y = y, label = label, hjust = hjust),
                size = 3.4, color = "gray38", fontface = "italic") +
      scale_linetype_identity() + scale_linewidth_identity() +
      scale_alpha_identity() + scale_shape_identity() +
      xlim(-0.5, 10.5) + ylim(0.4, 1.6) +
      labs(title = title, subtitle = subtitle) +
      theme_void() +
      theme(
        plot.title    = element_text(family = "serif", face = "bold",
                                     size = 12, hjust = 0.5, color = col,
                                     margin = margin(b = 3)),
        plot.subtitle = element_text(family = "serif", size = 9.5,
                                     hjust = 0.5, color = "gray50",
                                     face = "italic", margin = margin(b = 6))
      )
  }
}

# Right censoring
p_right <- mk_panel("Right Censoring", '"Hasn\'t happened yet..."', clr$blue)(
  data.frame(x=c(0,8), xend=c(8,10.3), y=1,
             ltype=c("solid","dashed"), lw=c(2.5,1.5), al=c(1,0.4)),
  data.frame(x=8, y=1, sh=17),
  data.frame(x=c(4,9.2), y=c(1.25,1.25),
             label=c("Observed period","Unknown future"), hjust=c(0.5,0.5))
)

# Left censoring
p_left <- mk_panel("Left Censoring", '"Already happened — but when?"', clr$red)(
  data.frame(x=c(0,3), xend=c(3,8.5), y=1,
             ltype=c("dashed","solid"), lw=c(1.5,2.5), al=c(0.4,1)),
  data.frame(x=c(1.5,3), y=1, sh=c(4,15)),
  data.frame(x=c(1.5,5.8,1.5), y=c(1.25,1.25,0.72),
             label=c("Unknown past","Observed period","?"), hjust=c(0.5,0.5,0.5))
)

# Interval censoring
p_int <- mk_panel("Interval Censoring", '"Somewhere between two check-ups..."', clr$purple)(
  data.frame(x=c(0,2,6), xend=c(2,6,8.5), y=1,
             ltype=c("solid","dashed","solid"), lw=c(2.5,1.5,2.5), al=c(1,0.4,1)),
  data.frame(x=c(2,4,6), y=1, sh=c(1,4,16)),
  data.frame(x=c(1,4,7.2,4), y=c(1.25,1.25,1.25,0.72),
             label=c("Event-free","Somewhere here","Event confirmed","?"),
             hjust=c(0.5,0.5,0.5,0.5))
)

(p_right / p_left / p_int) +
  plot_annotation(
    title   = "The Three Faces of Censoring",
    caption = "Solid bar = observed period · Dashed = uncertain zone · Shapes mark event/censoring points"
  ) & theme(plot.title = element_text(family = "serif", face = "bold",
                                       size = 14, hjust = 0.5))
Figure 2: Schematic of the three censoring mechanisms. Solid bars show the observed period; dashed segments mark the uncertain zone. Shapes distinguish event onset, censoring points, and confirmed detections.

Left Censoring vs Left Truncation

Frequently Confused in Practice — Know the Difference

Left censoring and left truncation are not the same concept, yet they are regularly conflated — even in peer-reviewed literature and RWE discussions. Getting this distinction right signals genuine technical maturity.

Left Censoring

Event happened — time unknown

The individual experienced the event, but we don’t know when. They are still in our risk set — we just lack a precise event time.

Classic example: A patient already had a cough when they enrolled in a respiratory study. The cough onset is left-censored — somewhere before day 0 of the study.

Left Truncation (Late Entry)

Only survivors enter — selection bias

An individual only appears in our data because they had not yet experienced the event by the time observation began. Those who had the event earlier are completely absent — they never entered the study.

Classic example: A retirement home study enrolls residents who must be alive and resident at age 65. Those who died before 65 are never seen. This creates a left-truncated sample — systematically biased toward long survivors.

Why Truncation Is a Bigger Problem

Left censoring gives us partial information — we know the event happened, just not when. Left truncation removes people entirely from the sample before we even see them, creating a survivorship selection bias that standard methods cannot correct for without explicit adjustment.

Feature Left Censoring Left Truncation
Person enters the dataset ✓ Yes ✓ Yes (conditionally)
Event time known ✗ No ✓ Yes (if observed later)
Partial information available ✓ Yes ✗ No (pre-entry history lost)
Biases \(\hat{S}(t)\) if ignored Mild Severe — upward bias
Handled in R Surv(time, event) Surv(entry, time, event)
Handling Left Truncation in R

When individuals enter the study late (i.e., they have already been at risk for some time before the study window begins), use the three-argument form of Surv():

# entry = age entered study, time = age at event/censor
# Only subjects who survived to their entry age are observed
# Surv(entry_time, event_time, status) correctly adjusts the risk set

library(survival)

# Example: retirement home study — entry at age 65+
set.seed(42)
n <- 200
entry_age   <- runif(n, 65, 75)           # must have survived to entry
event_age   <- entry_age + rexp(n, 0.08)  # time-to-event after entry
censor_age  <- entry_age + runif(n, 0, 20)
obs_age     <- pmin(event_age, censor_age)
status      <- as.integer(event_age <= censor_age)

# Three-argument Surv() handles left truncation correctly
km_truncated <- survfit(Surv(entry_age, obs_age, status) ~ 1)
cat("Median survival (left-truncation adjusted):",
    round(summary(km_truncated)$table["median"], 1), "years\n")
Median survival (left-truncation adjusted): 75.6 years

Why Censoring Mechanisms Matter

Not all censoring is created equal. The reason someone exits observation — not just the fact that they did — determines whether your survival estimates remain valid.

The Hazard Function

Alongside \(S(t)\), a second fundamental quantity is the hazard function \(h(t)\) — the instantaneous rate at which the event occurs given survival to time \(t\):

\[h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t}\]

The hazard and survival functions are related by:

\[S(t) = \exp\!\left(-\int_0^t h(u)\, du\right)\]

Informative censoring distorts our estimate of \(h(t)\), which then propagates bias into \(\hat{S}(t)\).

Non-Informative vs Informative Censoring

Type Definition Examples Impact on \(\hat{S}(t)\)
Non-informative Censoring is independent of the event risk Study ends on a pre-set calendar date; patient relocates for non-medical reasons; device battery dies ✓ Unbiased Standard methods valid
Informative Censoring is associated with the risk of the event Sickest patients drop out because they are too ill to attend; healthiest patients stop follow-up because they feel cured ⚠ Biased \(\hat{S}(t)\) can be severely distorted
The Core Assumption of Standard Survival Methods

Both the Kaplan-Meier estimator and the Cox proportional hazards model assume non-informative (independent) censoring: the censoring time \(C\) is independent of the true event time \(T\), conditional on any covariates in the model.

\[T \perp C \mid \mathbf{X}\]

Violating this without correction leads to biased estimates of \(S(t)\) and \(h(t)\).

Simulating the Bias in \(\hat{S}(t)\)

Code
set.seed(456)
n_sim <- 500

# True event times: Exp(rate = 0.1) → median survival ≈ 6.9 time units
true_times <- rexp(n_sim, rate = 0.1)

# Scenario 1: Non-informative — random censoring unrelated to event risk
rand_censor      <- runif(n_sim, 0, 20)
obs_random       <- pmin(true_times, rand_censor)
status_random    <- as.integer(true_times <= rand_censor)

# Scenario 2: Informative — sicker individuals (shorter true_times) drop out early
prob_dropout     <- plogis(0.5 * (true_times - 10))  # sigmoid
inform_censor    <- ifelse(runif(n_sim) < prob_dropout,
                           runif(n_sim, 0, 5), 20)
obs_inform       <- pmin(true_times, inform_censor)
status_inform    <- as.integer(true_times <= inform_censor)

km_rand   <- survfit(Surv(obs_random, status_random) ~ 1)
km_inform <- survfit(Surv(obs_inform, status_inform) ~ 1)

plot_df <- bind_rows(
  tibble(time = km_rand$time, surv = km_rand$surv,
         upper = km_rand$upper, lower = km_rand$lower,
         scenario = "Non-informative censoring"),
  tibble(time = km_inform$time, surv = km_inform$surv,
         upper = km_inform$upper, lower = km_inform$lower,
         scenario = "Informative censoring")
)

ggplot(plot_df, aes(x = time, y = surv, color = scenario, fill = scenario)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.11, color = NA) +
  geom_step(linewidth = 1.3) +
  geom_hline(yintercept = 0.5, linetype = "dotted", color = "gray55") +
  annotate("text", x = 19.2, y = 0.515, label = "S(t) = 0.50",
           size = 3.1, color = "gray45", fontface = "italic", hjust = 1) +
  annotate("segment",
           x = 11.5, xend = 11.5, y = 0.55, yend = 0.72,
           arrow = arrow(length = unit(0.18, "cm"), ends = "both"),
           color = clr$red, linewidth = 0.8) +
  annotate("text", x = 13, y = 0.635,
           label = "Bias in\n\u015a(t)",
           color = clr$red, size = 3.2, fontface = "italic", hjust = 0) +
  scale_color_manual(values = c(
    "Non-informative censoring" = clr$green,
    "Informative censoring"     = clr$red
  )) +
  scale_fill_manual(values = c(
    "Non-informative censoring" = clr$green,
    "Informative censoring"     = clr$red
  )) +
  scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0, 20, 5)) +
  labs(
    x        = "Time",
    y        = expression("Estimated survival probability " * hat(S)(t)),
    title    = "The Hidden Danger of Informative Censoring",
    subtitle = "Identical true survival times — very different estimated curves",
    color    = "Censoring mechanism",
    caption  = "Shaded bands = 95% pointwise CI · True distribution: Exp(rate = 0.1) · n = 500"
  ) +
  theme_survival() +
  guides(fill = "none")
Figure 3: Two Kaplan-Meier curves estimated from identical underlying survival distributions, differing only in the censoring mechanism. The gap between them is entirely attributable to informative censoring — not any real difference in survival. Shaded bands show 95% confidence intervals.

The gap between the two curves is entirely artefactual — it reflects bias in \(\hat{S}(t)\) introduced by the censoring mechanism, not any real difference in survival.


Exploring and Reporting Censoring

Characterise Your Data First

Before fitting any model, summarise your censoring. Here we simulate a two-arm clinical dataset and produce a compact descriptive table:

Code
set.seed(789)
n_pts <- 300

sim_data <- tibble(
  patient_id  = 1:n_pts,
  group       = sample(c("Treatment", "Control"), n_pts, replace = TRUE),
  true_time   = rexp(n_pts, rate = ifelse(group == "Treatment", 0.08, 0.12)),
  censor_time = runif(n_pts, 5, 25),
  obs_time    = pmin(true_time, censor_time),
  event       = as.integer(true_time <= censor_time)
)

sim_data |>
  group_by(Group = group) |>
  summarise(
    N                           = n(),
    `Events, n (%)`             = paste0(sum(event), " (",
                                          round(mean(event) * 100, 1), "%)"),
    `Censored, n (%)`           = paste0(sum(1 - event), " (",
                                          round(mean(1 - event) * 100, 1), "%)"),
    `Median follow-up (months)` = round(median(obs_time), 1),
    `IQR`                       = paste0(round(quantile(obs_time, 0.25), 1),
                                          "–",
                                          round(quantile(obs_time, 0.75), 1))
  )
Table 1: Censoring summary by treatment group. Always report this table before any survival model output.
Group N Events, n (%) Censored, n (%) Median follow-up (months) IQR
Control 163 133 (81.6%) 30 (18.4%) 6.0 3.1–10.6
Treatment 137 96 (70.1%) 41 (29.9%) 7.5 3.2–11.5

When Are Patients Being Censored?

Heavy early censoring is a red flag for informative dropout:

Code
sim_data |>
  filter(event == 0) |>
  ggplot(aes(x = obs_time, fill = group)) +
  geom_histogram(bins = 25, alpha = 0.72, position = "identity",
                 color = "white", linewidth = 0.3) +
  scale_fill_manual(values = c("Treatment" = clr$blue, "Control" = clr$amber)) +
  labs(
    x       = "Time at censoring (months)",
    y       = "Number of censored patients",
    fill    = "Group",
    title   = "When Are Patients Being Censored?",
    subtitle = "Unexpected early spikes may signal informative censoring",
    caption  = "Only censored observations shown"
  ) +
  theme_survival()
Figure 4: Distribution of censoring times by group. An unexpected spike of early censoring — relative to the event distribution — may indicate informative censoring requiring sensitivity analysis.

Estimating Median Follow-up

Use the reverse Kaplan-Meier method: treat censored observations as events and events as censored. This gives an unbiased estimate of how long patients were actually under observation:

km_followup <- survfit(Surv(obs_time, 1 - event) ~ 1, data = sim_data)
cat("Median follow-up (reverse KM):",
    round(summary(km_followup)$table["median"], 1), "months\n")
Median follow-up (reverse KM): 16.3 months

Reporting Checklist

  • Number and percentage of censored observations, overall and by key subgroups
  • Primary reasons for censoring (end of study, withdrawal, loss to follow-up)
  • Median and range of follow-up time (reverse KM method preferred)
  • Assessment of whether censoring patterns differ across groups
  • Explicit statement on whether censoring is assumed non-informative
  • Sensitivity analyses if informative censoring is suspected

Summary

Key Takeaways

  • Censoring is not missing data — it is partially observed data that still carries information about \(S(t)\)
  • Right censoring is the most common form; standard methods are designed for it
  • Left censoring means the event happened before observation began — we lack a precise time
  • Left truncation is fundamentally different: entire individuals are absent from the data because they didn’t survive to enter the study — it creates survivorship selection bias
  • The critical assumption is \(T \perp C \mid \mathbf{X}\): censoring must be independent of the event risk given covariates
  • Always visualize censoring patterns before modelling; early spikes warrant investigation
  • Report transparently: number censored, reasons, timing, median follow-up, and any sensitivity analyses
Up Next in the Series

In Part 2, we build the Kaplan-Meier estimator \(\hat{S}(t)\) from first principles — deriving the step-function formula, interpreting confidence intervals, comparing groups with the log-rank test, and understanding the at-risk table.


Session Info

Show R session details
sessionInfo()
R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
[3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
[5] LC_TIME=English_India.utf8    

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] scales_1.4.0    patchwork_1.3.2 survminer_0.5.1 ggpubr_0.6.2   
[5] survival_3.8-3  tidyr_1.3.1     dplyr_1.1.4     ggplot2_4.0.1  

loaded via a namespace (and not attached):
 [1] generics_0.1.4     rstatix_0.7.3      lattice_0.22-7     digest_0.6.38     
 [5] magrittr_2.0.4     evaluate_1.0.5     grid_4.5.2         RColorBrewer_1.1-3
 [9] fastmap_1.2.0      jsonlite_2.0.0     Matrix_1.7-4       backports_1.5.0   
[13] Formula_1.2-5      gridExtra_2.3      purrr_1.2.0        abind_1.4-8       
[17] cli_3.6.5          KMsurv_0.1-6       rlang_1.1.6        splines_4.5.2     
[21] withr_3.0.2        yaml_2.3.10        tools_4.5.2        ggsignif_0.6.4    
[25] km.ci_0.5-6        broom_1.0.10       vctrs_0.6.5        R6_2.6.1          
[29] zoo_1.8-14         lifecycle_1.0.4    car_3.1-3          htmlwidgets_1.6.4 
[33] pkgconfig_2.0.3    pillar_1.11.1      gtable_0.3.6       glue_1.8.0        
[37] data.table_1.17.8  xfun_0.54          tibble_3.3.0       tidyselect_1.2.1  
[41] rstudioapi_0.17.1  knitr_1.50         farver_2.1.2       xtable_1.8-4      
[45] survMisc_0.5.6     htmltools_0.5.8.1  labeling_0.4.3     rmarkdown_2.30    
[49] carData_3.0-5      compiler_4.5.2     S7_0.2.1          

References

Collett, D. (2015). Modelling Survival Data in Medical Research (3rd ed.). CRC Press.

Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34(2), 187–202.

Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.

Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.). Springer.

Siannis, F., Copas, J., & Lu, G. (2005). Sensitivity analysis for informative censoring in parametric survival models. Biostatistics, 6(1), 77–91.

Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer.