Estimations of the Survival Functions

The Kaplan-Meier Estimator

Kaplan-Meier is the standard method for estimating the survival function of a given dataset. Formally, it is defined as follows

\[ \hat{S}(t) = \prod_{t_i \leq t} (1 - \hat{q}_i) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right) \]

where \(n_i\) is the number of subjects at risk at time \(t\), and \(d_i\) is the number of individuals who fail at that time.

Using Kaplan-Meier

In R, we construct KM estimators using the survfit() function.

Before we move on to our datasets, we start with a small set of data.

tt   <- c(7, 6, 6, 5, 2, 4)
cens <- c(0, 1, 0, 0, 1, 1)
grp  <- c(0, 0, 0, 1, 1, 1)

Surv(tt, cens)
## [1] 7+ 6  6+ 5+ 2  4
sample_tbl <- tibble(tt = tt, cens = cens, grp = grp)

example_km <- survfit(Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = 'log-log')

plot(example_km)

Basic plotting routines are worth trying, but the survminer package has specialised plots that use ggplot2 to create them.

ggsurvplot(example_km)

Printing out the ‘fitted’ object gives us some basic statistics:

example_km %>% print()
## Call: survfit(formula = Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = "log-log")
## 
##      n events median 0.95LCL 0.95UCL
## [1,] 6      3      6       2      NA

We get more details from the summary() function:

example_km %>% summary()
## Call: survfit(formula = Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     2      6       1    0.833   0.152       0.2731        0.975
##     4      5       1    0.667   0.192       0.1946        0.904
##     6      3       1    0.444   0.222       0.0662        0.785

Exercises

  1. Construct the KM estimator for the telco churn data
  2. What is the median survival time for this data?
  3. What is the mean survival time?
  4. Repeat the above for the other two datasets.