SURVIVAL ANALYSIS

Understanding Survival Curve

Bakti Siregar

Bakti Siregar, M.Sc., CDSS.

LinkedIn | GitHub | Email

He is a certified Data Scientist (BNSP) and Certified Data Science Specialist (CDSS).
His research focuses on Applied Mathematics and Data Science for data-driven decision-making.

Understanding Survival Curves

A survival curve shows how the probability of remaining event-free changes over time.

The survival function is defined as:

\[ S(t) = P(T > t) \]

Where:

  • T = time until event occurs
  • t = specific time point

Interpretation:

  • If (S(t) = 0.8), then 80% of subjects survive beyond time t.

Examples of survival analysis applications:

  • Patient survival after treatment
  • Machine operating time before failure
  • Customer retention in subscription services
  • Equipment reliability in mining operations

Using Real Data from R

In this lecture we will use a dataset from the R survival package.

The lung dataset contains survival data from patients with advanced lung cancer.

Key variables include:

Variable Description
time survival time (days)
status event indicator
age patient age
sex gender
ph.ecog performance score

Inspecting the Dataset

Before building a survival model we explore the dataset.

'data.frame':   228 obs. of  10 variables:
 $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
 $ time     : num  306 455 1010 210 883 ...
 $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
 $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
 $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
 $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
 $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
 $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
 $ meal.cal : num  1175 1225 NA 1150 NA ...
 $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...
  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15
4    5  210      2  57   1       1       90        60     1150      11
5    1  883      2  60   1       0      100        90       NA       0
6   12 1022      1  74   1       1       50        80      513       0
      inst            time            status           age       
 Min.   : 1.00   Min.   :   5.0   Min.   :1.000   Min.   :39.00  
 1st Qu.: 3.00   1st Qu.: 166.8   1st Qu.:1.000   1st Qu.:56.00  
 Median :11.00   Median : 255.5   Median :2.000   Median :63.00  
 Mean   :11.09   Mean   : 305.2   Mean   :1.724   Mean   :62.45  
 3rd Qu.:16.00   3rd Qu.: 396.5   3rd Qu.:2.000   3rd Qu.:69.00  
 Max.   :33.00   Max.   :1022.0   Max.   :2.000   Max.   :82.00  
 NA's   :1                                                       
      sex           ph.ecog          ph.karno        pat.karno     
 Min.   :1.000   Min.   :0.0000   Min.   : 50.00   Min.   : 30.00  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 75.00   1st Qu.: 70.00  
 Median :1.000   Median :1.0000   Median : 80.00   Median : 80.00  
 Mean   :1.395   Mean   :0.9515   Mean   : 81.94   Mean   : 79.96  
 3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.: 90.00  
 Max.   :2.000   Max.   :3.0000   Max.   :100.00   Max.   :100.00  
                 NA's   :1        NA's   :1        NA's   :3       
    meal.cal         wt.loss       
 Min.   :  96.0   Min.   :-24.000  
 1st Qu.: 635.0   1st Qu.:  0.000  
 Median : 975.0   Median :  7.000  
 Mean   : 928.8   Mean   :  9.832  
 3rd Qu.:1150.0   3rd Qu.: 15.750  
 Max.   :2600.0   Max.   : 68.000  
 NA's   :47       NA's   :14       

Important variables:

  • time → observed survival time
  • status → censoring indicator

However the event variable must be recoded.

Preparing the Event Variable

In the dataset:

  • status = 1 → censored
  • status = 2 → event occurred

For survival analysis we convert to:

  • 0 = censored
  • 1 = event

  0   1 
 63 165 

Creating the Survival Object

The core structure in survival analysis is the Survival Object.

  [1]  306   455  1010+  210   883  1022+  310   361   218   166   170   654 
 [13]  728    71   567   144   613   707    61    88   301    81   624   371 
 [25]  394   520   574   118   390    12   473    26   533   107    53   122 
 [37]  814   965+   93   731   460   153   433   145   583    95   303   519 
 [49]  643   765   735   189    53   246   689    65     5   132   687   345 
 [61]  444   223   175    60   163    65   208   821+  428   230   840+  305 
 [73]   11   132   226   426   705   363    11   176   791    95   196+  167 
 [85]  806+  284   641   147   740+  163   655   239    88   245   588+   30 
 [97]  179   310   477   166   559+  450   364   107   177   156   529+   11 
[109]  429   351    15   181   283   201   524    13   212   524   288   363 
[121]  442   199   550    54   558   207    92    60   551+  543+  293   202 
[133]  353   511+  267   511+  371   387   457   337   201   404+  222    62 
[145]  458+  356+  353   163    31   340   229   444+  315+  182   156   329 
[157]  364+  291   179   376+  384+  268   292+  142   413+  266+  194   320 
[169]  181   285   301+  348   197   382+  303+  296+  180   186   145   269+
[181]  300+  284+  350   272+  292+  332+  285   259+  110   286   270    81 
[193]  131   225+  269   225+  243+  279+  276+  135    79    59   240+  202+
[205]  235+  105   224+  239   237+  173+  252+  221+  185+   92+   13   222+
[217]  192+  183   211+  175+  197+  203+  116   188+  191+  105+  174+  177+

Interpretation of output:

  • 306 → event occurred at time 306
  • 1010+ → censored observation

The + symbol indicates right censoring.

Kaplan–Meier Survival Estimation

The Kaplan–Meier estimator calculates survival probability over time.

Call: survfit(formula = S_lung ~ 1, data = lung)

       n events median 0.95LCL 0.95UCL
[1,] 228    165    310     285     363

Explanation:

  • ~1 means we estimate one overall survival curve.

Plotting the Survival Curve

Interpretation:

  • curve starts at 1
  • drops when events occur
  • tick marks represent censored observations

Interpreting Survival Curves

Kaplan–Meier curves are step functions.

Important rules:

  • curve drops when events occur
  • curve remains flat when no events occur
  • censoring does not cause drops

Thus the survival curve describes how quickly events occur over time.

Survival Probability at Specific Times

We can estimate survival probability at particular time points.

Example:

Call: survfit(formula = S_lung ~ 1, data = lung)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  100    196      31    0.864  0.0227        0.821        0.910
  200    144      41    0.680  0.0311        0.622        0.744
  300     92      29    0.531  0.0346        0.467        0.603

This output provides:

  • survival probability
  • confidence interval
  • number at risk

Example interpretation:

If (S(200)=0.59)

59% of patients survive beyond 200 days.

Median Survival Time

Median survival is the time when:

\[ S(t) = 0.5 \]

Meaning:

  • 50% of subjects experienced the event
  • 50% remain event-free

Extract median survival:

  records     n.max   n.start    events     rmean se(rmean)    median   0.95LCL 
228.00000 228.00000 228.00000 165.00000 376.27475  19.70779 310.00000 285.00000 
  0.95UCL 
363.00000 

Median survival is widely reported in clinical studies.

Visualizing with survminer

For better visualization we use survminer.

This graph shows:

  • Kaplan–Meier curve
  • confidence intervals
  • number at risk

Understanding the Risk Table

The risk table shows the number of individuals still being observed.

Example interpretation:

Time At Risk
0 228
200 120
400 45
600 10

As time increases:

  • the risk set becomes smaller
  • estimates become less stable.

Comparing Survival by Group

We can estimate survival curves for groups.

Example: survival by sex

This produces two survival curves.

Interpreting Group Differences

When comparing curves visually we examine:

  • which curve declines faster
  • how early curves separate
  • overlap of confidence intervals

However visual comparison alone is not sufficient.

A statistical test is needed.

Next Step: Log-Rank Test

To formally compare survival curves we use:

Log-Rank Test

This test evaluates whether survival functions differ between groups.

This will be discussed in the next lecture.

Key Takeaways

  • Survival curves estimate probability of remaining event-free over time
  • Surv() creates the survival object
  • survfit() estimates Kaplan–Meier curves
  • ggsurvplot() produces clear survival visualizations
  • Risk tables help interpret survival reliability

Understanding survival curves is the foundation for:

  • Log-Rank Test
  • Cox Proportional Hazard Model