When analyzing real-world events, we are often interested not just in if something happens, but when it happens. That is the core of survival analysis — a powerful statistical framework built for time-to-event data.
“Survival analysis isn’t about death. It’s about time — and what we do when time runs out before our data does.”
Despite its name, survival analysis is not limited to mortality. The “event” can be anything that occurs once in time:
🏥
Healthcare
Time until cancer recurrence, hospital readmission, or first treatment response
💼
HR Analytics
Time until an employee resigns, gets promoted, or changes roles
📱
Product & Growth
Time until a customer churns, upgrades, or becomes inactive
⚙️
Engineering
Time until a machine part fails, a system crashes, or maintenance is required
Why Not Use Standard Regression?
The survival function \(S(t)\) is defined as the probability that the event time \(T\) exceeds time \(t\):
\[S(t) = P(T > t), \quad t \geq 0\]
Standard regression methods cannot estimate this correctly in the presence of censoring:
Method
Limitation
Linear regression
Assumes every outcome is fully observed — censored data directly violates this
Logistic regression
Can tell you if an event occurred, but not when — timing is discarded
Survival analysis
Handles censoring, unequal follow-up durations, and time-varying effects ✓
Survival analysis is the gold standard precisely because it incorporates censored observations into \(\hat{S}(t)\) rather than discarding them.
This article is Part 1 of a series building toward full survival modelling in R:
Kaplan-Meier — Non-parametric survival curves and log-rank tests
Part 3
Cox Regression — Hazard ratios, adjusted effects, and model assumptions
What Is Censoring?
The Unfinished Symphony
Imagine you’re conducting a study on how long people take to finish reading War and Peace. You give 100 people the book and check in after 6 months:
40 people finished it — you know exactly when ✓
30 people are still reading — they might finish tomorrow, or never
20 people moved away — you lost contact entirely
10 people admitted they gave up — but won’t say when
Welcome to censoring in its natural habitat.
The Simple Definition
Censoring occurs when we have incomplete information about when — or if — the event of interest occurred for a given individual.
Crucially, a censored observation still contributes information: we know that for this person, the event had not yet occurred up to the point of censoring. This partial information is what survival analysis is designed to use.
Visualizing Individual Journeys
The swim-lane plot below shows 20 simulated participants. Each horizontal line represents one person’s observed follow-up. A filled circle marks an observed event; a triangle with a dashed arrow means we stopped watching before the event occurred.
Code
set.seed(123)n <-20study_data <-data.frame(subject =1:n,category =rep(c("Observed Event", "Still Ongoing","Lost to Follow-up", "Study Ended"), each =5)) |>mutate(true_event_time =case_when( category =="Observed Event"~runif(n(), 2, 10), category =="Still Ongoing"~runif(n(), 15, 25), category =="Lost to Follow-up"~runif(n(), 12, 20), category =="Study Ended"~runif(n(), 11, 18) ),observed_time =case_when( category =="Observed Event"~ true_event_time, category =="Still Ongoing"~12, category =="Lost to Follow-up"~runif(n(), 3, 9), category =="Study Ended"~12 ),status =ifelse(category =="Observed Event", "Event", "Censored"),label =paste0("P", sprintf("%02d", subject)) )pal <-c("Observed Event"= clr$red,"Still Ongoing"= clr$blue,"Lost to Follow-up"= clr$amber,"Study Ended"= clr$purple)ggplot(study_data, aes(y =reorder(label, subject))) +geom_segment(aes(x =0, xend = observed_time,yend =reorder(label, subject)),linewidth =2.2, color ="gray78", lineend ="round" ) +geom_point(data =filter(study_data, status =="Event"),aes(x = observed_time, color = category),size =4, shape =19 ) +geom_point(data =filter(study_data, status =="Censored"),aes(x = observed_time, color = category),size =4, shape =17 ) +geom_segment(data =filter(study_data, status =="Censored"),aes(x = observed_time, xend = observed_time +0.7,yend =reorder(label, subject), color = category),arrow =arrow(length =unit(0.18, "cm"), type ="open"),linetype ="dashed", linewidth =0.7, alpha =0.65 ) +geom_vline(xintercept =12, linetype ="dotted",color = clr$red, linewidth =0.8, alpha =0.7) +annotate("text", x =12.15, y =1.5,label ="Study ends", angle =90, hjust =0,color = clr$red, size =3.2, fontface ="italic") +scale_color_manual(values = pal) +scale_x_continuous(limits =c(0, 14), breaks =seq(0, 14, 2)) +labs(x ="Time (months)", y =NULL,title ="Individual Journeys Through the Study",subtitle ="Each row is one participant — observed, censored, or still at risk",color ="Reason for exit" ) +theme_survival() +theme(panel.grid.major.y =element_line(color ="gray93"))
Figure 1: Swim-lane plot of 20 simulated participants. Filled circles (●) indicate observed events; triangles with arrows (▷) indicate censored observations. The red dotted vertical line marks study closure.
Every censored row carries a message: “the event had not happened up to this moment.” Discarding these rows would throw away real information and bias \(\hat{S}(t)\) upward.
The Three Types of Censoring
🕐
Right Censoring
“It hasn’t happened yet…”
The most common type. The event has not occurred by last observation but may occur later.
🏥 Patient still cancer-free after 5 years of follow-up
💼 Employee still with the company at study close
📱 User still active after 30 days of monitoring
⏳
Left Censoring
“It already happened — but when?”
The event occurred before observation began, but the exact time is unknown.
🦷 Cavity present at first dental checkup — onset unknown
🚬 Patient already smoking when enrolled — start date unknown
🏠 Termites found at inspection — duration unknown
🌓
Interval Censoring
“Somewhere between two visits…”
The event occurred between two known check-in times, but the exact moment is unknown.
🔬 Tumour absent in January, present in July
🎓 Child unable to read in September, able by June
🚗 Tire intact at 30,000 miles, flat at 35,000 miles
Visualizing the Three Types
Code
mk_panel <-function(title, subtitle, col) {function(seg, pts, ann) {ggplot() +geom_segment(data = seg,aes(x = x, xend = xend, y = y, yend = y,linetype = ltype, linewidth = lw, alpha = al),color = col, lineend ="round", show.legend =FALSE ) +geom_point(data = pts, aes(x = x, y = y, shape = sh),color = col, size =5.5, show.legend =FALSE) +geom_text(data = ann,aes(x = x, y = y, label = label, hjust = hjust),size =3.4, color ="gray38", fontface ="italic") +scale_linetype_identity() +scale_linewidth_identity() +scale_alpha_identity() +scale_shape_identity() +xlim(-0.5, 10.5) +ylim(0.4, 1.6) +labs(title = title, subtitle = subtitle) +theme_void() +theme(plot.title =element_text(family ="serif", face ="bold",size =12, hjust =0.5, color = col,margin =margin(b =3)),plot.subtitle =element_text(family ="serif", size =9.5,hjust =0.5, color ="gray50",face ="italic", margin =margin(b =6)) ) }}# Right censoringp_right <-mk_panel("Right Censoring", '"Hasn\'t happened yet..."', clr$blue)(data.frame(x=c(0,8), xend=c(8,10.3), y=1,ltype=c("solid","dashed"), lw=c(2.5,1.5), al=c(1,0.4)),data.frame(x=8, y=1, sh=17),data.frame(x=c(4,9.2), y=c(1.25,1.25),label=c("Observed period","Unknown future"), hjust=c(0.5,0.5)))# Left censoringp_left <-mk_panel("Left Censoring", '"Already happened — but when?"', clr$red)(data.frame(x=c(0,3), xend=c(3,8.5), y=1,ltype=c("dashed","solid"), lw=c(1.5,2.5), al=c(0.4,1)),data.frame(x=c(1.5,3), y=1, sh=c(4,15)),data.frame(x=c(1.5,5.8,1.5), y=c(1.25,1.25,0.72),label=c("Unknown past","Observed period","?"), hjust=c(0.5,0.5,0.5)))# Interval censoringp_int <-mk_panel("Interval Censoring", '"Somewhere between two check-ups..."', clr$purple)(data.frame(x=c(0,2,6), xend=c(2,6,8.5), y=1,ltype=c("solid","dashed","solid"), lw=c(2.5,1.5,2.5), al=c(1,0.4,1)),data.frame(x=c(2,4,6), y=1, sh=c(1,4,16)),data.frame(x=c(1,4,7.2,4), y=c(1.25,1.25,1.25,0.72),label=c("Event-free","Somewhere here","Event confirmed","?"),hjust=c(0.5,0.5,0.5,0.5)))(p_right / p_left / p_int) +plot_annotation(title ="The Three Faces of Censoring",caption ="Solid bar = observed period · Dashed = uncertain zone · Shapes mark event/censoring points" ) &theme(plot.title =element_text(family ="serif", face ="bold",size =14, hjust =0.5))
Figure 2: Schematic of the three censoring mechanisms. Solid bars show the observed period; dashed segments mark the uncertain zone. Shapes distinguish event onset, censoring points, and confirmed detections.
Left Censoring vs Left Truncation
Frequently Confused in Practice — Know the Difference
Left censoring and left truncation are not the same concept, yet they are regularly conflated — even in peer-reviewed literature and RWE discussions. Getting this distinction right signals genuine technical maturity.
Left Censoring
Event happened — time unknown
The individual experienced the event, but we don’t know when. They are still in our risk set — we just lack a precise event time.
Classic example: A patient already had a cough when they enrolled in a respiratory study. The cough onset is left-censored — somewhere before day 0 of the study.
Left Truncation (Late Entry)
Only survivors enter — selection bias
An individual only appears in our data because they had not yet experienced the event by the time observation began. Those who had the event earlier are completely absent — they never entered the study.
Classic example: A retirement home study enrolls residents who must be alive and resident at age 65. Those who died before 65 are never seen. This creates a left-truncated sample — systematically biased toward long survivors.
Why Truncation Is a Bigger Problem
Left censoring gives us partial information — we know the event happened, just not when. Left truncation removes people entirely from the sample before we even see them, creating a survivorship selection bias that standard methods cannot correct for without explicit adjustment.
Feature
Left Censoring
Left Truncation
Person enters the dataset
✓ Yes
✓ Yes (conditionally)
Event time known
✗ No
✓ Yes (if observed later)
Partial information available
✓ Yes
✗ No (pre-entry history lost)
Biases \(\hat{S}(t)\) if ignored
Mild
Severe — upward bias
Handled in R
Surv(time, event)
Surv(entry, time, event)
Handling Left Truncation in R
When individuals enter the study late (i.e., they have already been at risk for some time before the study window begins), use the three-argument form of Surv():
# entry = age entered study, time = age at event/censor# Only subjects who survived to their entry age are observed# Surv(entry_time, event_time, status) correctly adjusts the risk setlibrary(survival)# Example: retirement home study — entry at age 65+set.seed(42)n <-200entry_age <-runif(n, 65, 75) # must have survived to entryevent_age <- entry_age +rexp(n, 0.08) # time-to-event after entrycensor_age <- entry_age +runif(n, 0, 20)obs_age <-pmin(event_age, censor_age)status <-as.integer(event_age <= censor_age)# Three-argument Surv() handles left truncation correctlykm_truncated <-survfit(Surv(entry_age, obs_age, status) ~1)cat("Median survival (left-truncation adjusted):",round(summary(km_truncated)$table["median"], 1), "years\n")
Median survival (left-truncation adjusted): 75.6 years
Why Censoring Mechanisms Matter
Not all censoring is created equal. The reason someone exits observation — not just the fact that they did — determines whether your survival estimates remain valid.
The Hazard Function
Alongside \(S(t)\), a second fundamental quantity is the hazard function\(h(t)\) — the instantaneous rate at which the event occurs given survival to time \(t\):
\[h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t}\]
The hazard and survival functions are related by:
\[S(t) = \exp\!\left(-\int_0^t h(u)\, du\right)\]
Informative censoring distorts our estimate of \(h(t)\), which then propagates bias into \(\hat{S}(t)\).
Non-Informative vs Informative Censoring
Type
Definition
Examples
Impact on \(\hat{S}(t)\)
Non-informative
Censoring is independent of the event risk
Study ends on a pre-set calendar date; patient relocates for non-medical reasons; device battery dies
✓ Unbiased Standard methods valid
Informative
Censoring is associated with the risk of the event
Sickest patients drop out because they are too ill to attend; healthiest patients stop follow-up because they feel cured
⚠ Biased\(\hat{S}(t)\) can be severely distorted
The Core Assumption of Standard Survival Methods
Both the Kaplan-Meier estimator and the Cox proportional hazards model assume non-informative (independent) censoring: the censoring time \(C\) is independent of the true event time \(T\), conditional on any covariates in the model.
\[T \perp C \mid \mathbf{X}\]
Violating this without correction leads to biased estimates of \(S(t)\) and \(h(t)\).
Simulating the Bias in \(\hat{S}(t)\)
Code
set.seed(456)n_sim <-500# True event times: Exp(rate = 0.1) → median survival ≈ 6.9 time unitstrue_times <-rexp(n_sim, rate =0.1)# Scenario 1: Non-informative — random censoring unrelated to event riskrand_censor <-runif(n_sim, 0, 20)obs_random <-pmin(true_times, rand_censor)status_random <-as.integer(true_times <= rand_censor)# Scenario 2: Informative — sicker individuals (shorter true_times) drop out earlyprob_dropout <-plogis(0.5* (true_times -10)) # sigmoidinform_censor <-ifelse(runif(n_sim) < prob_dropout,runif(n_sim, 0, 5), 20)obs_inform <-pmin(true_times, inform_censor)status_inform <-as.integer(true_times <= inform_censor)km_rand <-survfit(Surv(obs_random, status_random) ~1)km_inform <-survfit(Surv(obs_inform, status_inform) ~1)plot_df <-bind_rows(tibble(time = km_rand$time, surv = km_rand$surv,upper = km_rand$upper, lower = km_rand$lower,scenario ="Non-informative censoring"),tibble(time = km_inform$time, surv = km_inform$surv,upper = km_inform$upper, lower = km_inform$lower,scenario ="Informative censoring"))ggplot(plot_df, aes(x = time, y = surv, color = scenario, fill = scenario)) +geom_ribbon(aes(ymin = lower, ymax = upper), alpha =0.11, color =NA) +geom_step(linewidth =1.3) +geom_hline(yintercept =0.5, linetype ="dotted", color ="gray55") +annotate("text", x =19.2, y =0.515, label ="S(t) = 0.50",size =3.1, color ="gray45", fontface ="italic", hjust =1) +annotate("segment",x =11.5, xend =11.5, y =0.55, yend =0.72,arrow =arrow(length =unit(0.18, "cm"), ends ="both"),color = clr$red, linewidth =0.8) +annotate("text", x =13, y =0.635,label ="Bias in\n\u015a(t)",color = clr$red, size =3.2, fontface ="italic", hjust =0) +scale_color_manual(values =c("Non-informative censoring"= clr$green,"Informative censoring"= clr$red )) +scale_fill_manual(values =c("Non-informative censoring"= clr$green,"Informative censoring"= clr$red )) +scale_y_continuous(labels =percent_format(accuracy =1), limits =c(0, 1)) +scale_x_continuous(breaks =seq(0, 20, 5)) +labs(x ="Time",y =expression("Estimated survival probability "*hat(S)(t)),title ="The Hidden Danger of Informative Censoring",subtitle ="Identical true survival times — very different estimated curves",color ="Censoring mechanism",caption ="Shaded bands = 95% pointwise CI · True distribution: Exp(rate = 0.1) · n = 500" ) +theme_survival() +guides(fill ="none")
Figure 3: Two Kaplan-Meier curves estimated from identical underlying survival distributions, differing only in the censoring mechanism. The gap between them is entirely attributable to informative censoring — not any real difference in survival. Shaded bands show 95% confidence intervals.
The gap between the two curves is entirely artefactual — it reflects bias in \(\hat{S}(t)\) introduced by the censoring mechanism, not any real difference in survival.
Exploring and Reporting Censoring
Characterise Your Data First
Before fitting any model, summarise your censoring. Here we simulate a two-arm clinical dataset and produce a compact descriptive table:
Table 1: Censoring summary by treatment group. Always report this table before any survival model output.
Group
N
Events, n (%)
Censored, n (%)
Median follow-up (months)
IQR
Control
163
133 (81.6%)
30 (18.4%)
6.0
3.1–10.6
Treatment
137
96 (70.1%)
41 (29.9%)
7.5
3.2–11.5
When Are Patients Being Censored?
Heavy early censoring is a red flag for informative dropout:
Code
sim_data |>filter(event ==0) |>ggplot(aes(x = obs_time, fill = group)) +geom_histogram(bins =25, alpha =0.72, position ="identity",color ="white", linewidth =0.3) +scale_fill_manual(values =c("Treatment"= clr$blue, "Control"= clr$amber)) +labs(x ="Time at censoring (months)",y ="Number of censored patients",fill ="Group",title ="When Are Patients Being Censored?",subtitle ="Unexpected early spikes may signal informative censoring",caption ="Only censored observations shown" ) +theme_survival()
Figure 4: Distribution of censoring times by group. An unexpected spike of early censoring — relative to the event distribution — may indicate informative censoring requiring sensitivity analysis.
Estimating Median Follow-up
Use the reverse Kaplan-Meier method: treat censored observations as events and events as censored. This gives an unbiased estimate of how long patients were actually under observation:
Number and percentage of censored observations, overall and by key subgroups
Primary reasons for censoring (end of study, withdrawal, loss to follow-up)
Median and range of follow-up time (reverse KM method preferred)
Assessment of whether censoring patterns differ across groups
Explicit statement on whether censoring is assumed non-informative
Sensitivity analyses if informative censoring is suspected
Summary
Key Takeaways
Censoring is not missing data — it is partially observed data that still carries information about \(S(t)\)
Right censoring is the most common form; standard methods are designed for it
Left censoring means the event happened before observation began — we lack a precise time
Left truncation is fundamentally different: entire individuals are absent from the data because they didn’t survive to enter the study — it creates survivorship selection bias
The critical assumption is \(T \perp C \mid \mathbf{X}\): censoring must be independent of the event risk given covariates
Always visualize censoring patterns before modelling; early spikes warrant investigation
Report transparently: number censored, reasons, timing, median follow-up, and any sensitivity analyses
Up Next in the Series
In Part 2, we build the Kaplan-Meier estimator\(\hat{S}(t)\) from first principles — deriving the step-function formula, interpreting confidence intervals, comparing groups with the log-rank test, and understanding the at-risk table.
Collett, D. (2015). Modelling Survival Data in Medical Research (3rd ed.). CRC Press.
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34(2), 187–202.
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.
Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.). Springer.
Siannis, F., Copas, J., & Lu, G. (2005). Sensitivity analysis for informative censoring in parametric survival models. Biostatistics, 6(1), 77–91.
Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer.