Survival Analysis

Censoring & Truncation

Bakti Siregar, M.Sc.

Data Science ~ ITSB

2026-03-12

⌛ Why Survival Analysis?

Many real-world problems study time until something happens.

Examples:

  • 🏥 Patient survival after treatment
  • ⚙️ Time until machine failure
  • 👥 Customer churn in subscription services
  • ⛏️ Time until mine equipment breakdown

Traditional regression cannot handle:

  • ⏱️ incomplete event times
  • 🚧 censored observations
  • ⏳ delayed entry

Important

📈 Survival Analysis is specifically designed for time-to-event data.

🎯 Learning Objectives

After this session, students will be able to:

  • ⏱️ Explain what time-to-event data represents
  • 🔍 Distinguish censoring vs truncation
  • 🧩 Identify right, left, and interval censoring
  • 💻 Construct correct Surv() objects in R
  • 📈 Interpret Kaplan–Meier survival curves

Note

🕒 Survival analysis studies when events happen, not just whether they happen.

⚠️ The Core Problem

Suppose we study machine failure time.

Ideally we want:

⏱️ T = exact time until machine fails

But in reality we observe situations like:

  • ⚙️ Machine still running when study ends
  • ⏮️ Machine installed before observation began
  • 🔍 Failure occurred between inspections

Warning

📉 In practice, event times are often incomplete.
This is why survival analysis exists.

🧠 What Do We Really Observe?

In theory we want the true event time:

⏱️ T = time until event

But in real datasets we observe:

  • 📅 Follow-up duration
  • 🔔 Event indicator (event occurred or not)
  • 📏 Sometimes only a time interval
  • ⏳ Sometimes delayed entry

Important

🧩 Survival data contains partial information about event time.

⚖️ Censoring vs Truncation

This is the most important distinction in survival analysis.

Aspect Censoring Truncation
Definition Subject exists in dataset but event time not fully observed Some subjects never enter the dataset
Subject Status Subject observed but event has not occurred yet Subject never observed in the data
Information Known Event time > observed time (T > t) No information available
Example Customer still active when study ends Study includes only workers ≥ 2 years
Implication Handled using Kaplan–Meier or Cox model May introduce selection bias

Important

Censoring → The subject is in the data, but we don’t know the exact time of the event.

Truncation → Some subjects never appear in the data because they were filtered out from the beginning.

❗ Censoring Is NOT Missing Data

Example: 👤 A customer is observed for 365 days and is still active.

What do we know?

T > 365

This is still valuable information.

Warning

⚠️ Common mistake

Many analysts delete censored observations because they think the data is incomplete.

Do NOT remove them.

Important

  • Missing data → no information about the event
  • Censored data → we know the event occurs after a certain time

That information is essential for survival analysis.

📊 Basic Structure of Survival Data

In survival analysis we convert calendar dates into numeric duration.

Example dataset

df_demo <- data.frame(
  id = c("A","B","C","D","E"),
  start_date = as.Date(c("2023-01-01","2023-01-10","2023-02-05","2023-03-15","2023-04-20")),
  event_date = as.Date(c("2023-06-01", NA, "2023-05-01", "2023-09-10", NA)),
  last_followup = as.Date(c(NA,"2024-01-01",NA, NA, "2024-02-15"))
)

df_demo
  id start_date event_date last_followup
1  A 2023-01-01 2023-06-01          <NA>
2  B 2023-01-10       <NA>    2024-01-01
3  C 2023-02-05 2023-05-01          <NA>
4  D 2023-03-15 2023-09-10          <NA>
5  E 2023-04-20       <NA>    2024-02-15

Convert to Survival Time

df_demo$end_date <- ifelse(is.na(df_demo$event_date),
                           df_demo$last_followup,
                           df_demo$event_date)

df_demo$end_date <- as.Date(df_demo$end_date, origin="1970-01-01")

df_demo$time_days <- as.numeric(df_demo$end_date - df_demo$start_date)

df_demo$event <- ifelse(is.na(df_demo$event_date),0,1)

df_demo
  id start_date event_date last_followup   end_date time_days event
1  A 2023-01-01 2023-06-01          <NA> 2023-06-01       151     1
2  B 2023-01-10       <NA>    2024-01-01 2024-01-01       356     0
3  C 2023-02-05 2023-05-01          <NA> 2023-05-01        85     1
4  D 2023-03-15 2023-09-10          <NA> 2023-09-10       179     1
5  E 2023-04-20       <NA>    2024-02-15 2024-02-15       301     0

🔵 Right Censoring

This example continues from the previous dataset.

Now we convert the data into a Survival object using Surv().

library(survival)
S_demo <- with(df_demo, Surv(time_days, event))
S_demo
[1] 151  356+  85  179  301+

Important

How to Read Surv() Output (5 Observations)

Output Meaning
151 Event occurred at time 151 (ID A)
356+ Right-censored → event did not occur until at least time 356 (ID B)
85 Event occurred at time 85 (ID C)
179 Event occurred at time 179 (ID D)
301+ Right-censored → event did not occur until at least time 301 (ID E)

Note: The “+” symbol indicates that the observation is censored (i.e., the event was not observed during the study period).

Note

  • Curve decreases only when event occurs
  • Censoring reduces risk set
  • Censoring does NOT cause a drop

🟡 Left Censoring

Definition:

Event occurred before observation began.

We only know:

T ≤ t

💻 Left Censoring in R

[1]  60   90   75-  40  120-  55 

Tip

Always specify type=“left”.

🟢 Interval Censoring

Definition:

Event occurred within interval:

L < T ≤ R

Example:

Negative in January
Positive in July

Exact infection time unknown.

💻 Interval Censoring in R

[1] [  0,  50] [ 30,  90] [ 60, 120] [ 40,  80]  90+       [120, 160]

Important

If R = Inf → right-censored interval.

🔴 Truncation

Truncation removes subjects from the dataset entirely.

Example:

Study includes only people who survived at least 2 years.

People who died earlier never appear.

This is selection bias mechanism.

🟣 Left Truncation (Delayed Entry)

Subjects enter study after time origin.

They are at risk only after entry time.

💻 Left Truncation in R

 [1] (150,384]  ( 67,341]  ( 74,313+] (140,304]  (120,281]  (137,360+]
 [7] (193,405+] (205,295]  (162,300]  ( 60,183+] ( 73,197+] ( 87,362] 
[13] (127,294]  (117,374]  (176,280+] (108,171]  (122,402+] (139,206+]
[19] (142,303]  (167,410+] (211,350+] (175,304]  (155,327]  ( 63,301] 
[25] (130,212+] ( 41,101]  ( 85,161]  ( 77,220]  ( 64,225]  ( 41,201] 
[31] ( 32,204]  ( 71,165]  ( 90,297+] (129,395+] ( 37,273]  ( 95,258+]
[37] (202,282]  (218,330]  (134,357+] (162,306+] (230,462+] (111,267] 
[43] (100,313]  (184,415]  (221,283+] ( 79,168+] (155,427]  (130,336] 
[49] ( 98,330+] (186,342]  ( 97,311]  ( 93,190]  (124,308]  ( 46,219] 
[55] (165,276]  (166,435+] (220,312]  ( 32,111]  ( 90,255]  (186,273] 

Interpretation:

Risk set changes because individuals join later.

📝 Concept Check

  1. No event_date but has last_followup → ?
  2. Already positive at first test → ?
  3. Only subjects who survived long enough to join → ?

Expected answer format:

Concept + correct Surv() syntax

🎓 Final Takeaways

  • Survival analysis studies time until event
  • Censoring ≠ missing data
  • Truncation affects sample composition
  • Correct Surv() specification is critical
  • Risk set is central to interpretation

Important

If Surv() object is wrong, all downstream analysis is wrong.