Estimations of the Survival Functions
The Kaplan-Meier Estimator
Kaplan-Meier is the standard method for estimating the survival function of a given dataset. Formally, it is defined as follows
\[ \hat{S}(t) = \prod_{t_i \leq t} (1 - \hat{q}_i) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right) \]
where \(n_i\) is the number of subjects at risk at time \(t\), and \(d_i\) is the number of individuals who fail at that time.
Using Kaplan-Meier
In R, we construct KM estimators using the survfit()
function.
Before we move on to our datasets, we start with a small set of data.
## [1] 7+ 6 6+ 5+ 2 4
sample_tbl <- tibble(tt = tt, cens = cens, grp = grp)
example_km <- survfit(Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = 'log-log')
plot(example_km)
Basic plotting routines are worth trying, but the
survminer
package has specialised plots that use
ggplot2
to create them.
Printing out the ‘fitted’ object gives us some basic statistics:
## Call: survfit(formula = Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = "log-log")
##
## n events median 0.95LCL 0.95UCL
## [1,] 6 3 6 2 NA
We get more details from the summary()
function:
## Call: survfit(formula = Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = "log-log")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 2 6 1 0.833 0.152 0.2731 0.975
## 4 5 1 0.667 0.192 0.1946 0.904
## 6 3 1 0.444 0.222 0.0662 0.785
Exercises
- Construct the KM estimator for the telco churn data
- What is the median survival time for this data?
- What is the mean survival time?
- Repeat the above for the other two datasets.