Consider the following (hypothetical) data on the time to recovery from surgery (in weeks) for 12 patients after gall-bladder removal:
13, 18, 21+, 29, 29+, 52, 52, 75, 92+, 112+, 181, 230
(a) Using the above data (and assuming non-informative censoring as necessary), calculate the Kaplan-Meier estimate of the survival function, Sˆ(t), by hand. Summarize your calculations in a table with columns for events (dj), censoring (cj), and risk sets (rj) at each time tj.
## Time points
t_j <- c(13, 18, 21, 29, 29, 52, 52, 75, 92, 112, 181, 230)
## Number of censoring at each points
c_j <- c( 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
## Number of events at each time points (reverse 0 and 1)
d_j <- as.numeric(c_j == 0)
## Construct a data frame
df1 <- data.frame(t_j = t_j, d_j = d_j, c_j = c_j)
df1
t_j d_j c_j
1 13 1 0
2 18 1 0
3 21 0 1
4 29 1 0
5 29 0 1
6 52 1 0
7 52 1 0
8 75 1 0
9 92 0 1
10 112 0 1
11 181 1 0
12 230 1 0
## Aggregate by time (ties are added up)
df1 <- aggregate(cbind(d_j, c_j) ~ t_j, data = df1, FUN = sum)
df1
t_j d_j c_j
1 13 1 0
2 18 1 0
3 21 0 1
4 29 1 1
5 52 2 0
6 75 1 0
7 92 0 1
8 112 0 1
9 181 1 0
10 230 1 0
## Add time 0
df1 <- rbind(c(0, 0, 0), df1)
df1
t_j d_j c_j
1 0 0 0
2 13 1 0
3 18 1 0
4 21 0 1
5 29 1 1
6 52 2 0
7 75 1 0
8 92 0 1
9 112 0 1
10 181 1 0
11 230 1 0
## Number at risk at time 0
initial.riskset <- length(t_j)
## Subtract cumulative decrease in the riskset (add d_j and c_j and lag by 1)
decrements <- with(df1, d_j + c_j)
decrements <- c(0, head(decrements, -1)) # Drop last one, and add 0 to beginning
df1$r_j <- initial.riskset - cumsum(decrements)
## Add survival in each interval
df1$S_j <- with(df1, 1 - (d_j / r_j))
df1
t_j d_j c_j r_j S_j
1 0 0 0 12 1.0000000
2 13 1 0 12 0.9166667
3 18 1 0 11 0.9090909
4 21 0 1 10 1.0000000
5 29 1 1 9 0.8888889
6 52 2 0 7 0.7142857
7 75 1 0 5 0.8000000
8 92 0 1 4 1.0000000
9 112 0 1 3 1.0000000
10 181 1 0 2 0.5000000
11 230 1 0 1 0.0000000
## Survival function = cumulative product of interval-specific survivals
df1$St <- cumprod(df1$S_j)
df1
t_j d_j c_j r_j S_j St
1 0 0 0 12 1.0000000 1.0000000
2 13 1 0 12 0.9166667 0.9166667
3 18 1 0 11 0.9090909 0.8333333
4 21 0 1 10 1.0000000 0.8333333
5 29 1 1 9 0.8888889 0.7407407
6 52 2 0 7 0.7142857 0.5291005
7 75 1 0 5 0.8000000 0.4232804
8 92 0 1 4 1.0000000 0.4232804
9 112 0 1 3 1.0000000 0.4232804
10 181 1 0 2 0.5000000 0.2116402
11 230 1 0 1 0.0000000 0.0000000
(b) Repeat the above estimation of Sˆ(t) using any software you choose. Also calculate (by hand or computer) pointwise 95% confidence intervals for Sˆ(t) using at least 2 of the three approaches (linear, log-survival, or log-log survival). Do any of the approaches you used result in lower or upper confidence bounds outside the [0,1] interval?
## Create a data frame
df2 <- data.frame(t_j = t_j, c_j = c_j, d_j = d_j)
## Create a survival object
library(survival)
df2$SurvObj <- with(df2, Surv(time = t_j, event = d_j))
df2
t_j c_j d_j SurvObj
1 13 0 1 13
2 18 0 1 18
3 21 1 0 21+
4 29 0 1 29
5 29 1 0 29+
6 52 0 1 52
7 52 0 1 52
8 75 0 1 75
9 92 1 0 92+
10 112 1 0 112+
11 181 0 1 181
12 230 0 1 230
## Performing survival analysis with different confidence interval types
lst.survival <-
sapply(simplify = FALSE,
X = c("plain", "log", "log-log"),
function(CONFTYPE) {
survfit(SurvObj ~ 1, data = df2, conf.type = CONFTYPE)
})
lapply(lst.survival, summary)
$plain
Call: survfit(formula = SurvObj ~ 1, data = df2, conf.type = CONFTYPE)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
13 12 1 0.917 0.0798 0.760 1.000
18 11 1 0.833 0.1076 0.622 1.000
29 9 1 0.741 0.1295 0.487 0.995
52 7 2 0.529 0.1567 0.222 0.836
75 5 1 0.423 0.1571 0.115 0.731
181 2 1 0.212 0.1690 0.000 0.543
230 1 1 0.000 NaN NaN NaN
$log
Call: survfit(formula = SurvObj ~ 1, data = df2, conf.type = CONFTYPE)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
13 12 1 0.917 0.0798 0.7729 1.000
18 11 1 0.833 0.1076 0.6470 1.000
29 9 1 0.741 0.1295 0.5259 1.000
52 7 2 0.529 0.1567 0.2961 0.945
75 5 1 0.423 0.1571 0.2045 0.876
181 2 1 0.212 0.1690 0.0442 1.000
230 1 1 0.000 NaN NA NA
$`log-log`
Call: survfit(formula = SurvObj ~ 1, data = df2, conf.type = CONFTYPE)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
13 12 1 0.917 0.0798 0.5390 0.988
18 11 1 0.833 0.1076 0.4817 0.956
29 9 1 0.741 0.1295 0.3907 0.909
52 7 2 0.529 0.1567 0.2051 0.774
75 5 1 0.423 0.1571 0.1349 0.691
181 2 1 0.212 0.1690 0.0142 0.567
230 1 1 0.000 NaN NA NA
© Plot the estimated survival function Sˆ(t) and pointwise 95% confidence intervals, either by hand or using any statistical software package.
## rms for survplot
library(rms)
## 3 plots horizontally lined up
layout(matrix(1:3, ncol = 3))
## plyr for l_ply()
library(plyr)
l_ply(names(lst.survival),
function(CONFTYPE) {
survplot(lst.survival[[CONFTYPE]], xlab = "")
title(paste("Confidence interval by ", CONFTYPE, "method"))
})
(d) Provide the estimated median survival, along with the estimated 25th and 75th per- centiles. Indicate where these percentiles fall on your KM plot from © by drawing horizontal lines. What are the actual KM survival estimates corresponding to each of the estimated percentiles?
## Plot survival curve
survplot(lst.survival$`log-log`)
## 25th percentile (S(25th)<75%): Actual value 74.1%
abline(h = 0.75, lty = 3)
abline(v = 29, lty = 2)
## Median survival (S(50th)<50%): Actual value 42.3%
abline(h = 0.50, lty = 3)
abline(v = 75, lty = 2)
## 75th percentile (S(75th)<25%): Actual value 21.2%
abline(h = 0.25, lty = 3)
abline(v = 181, lty = 2)
(e) Calculate the estimated cumulative hazard rate, Λˆ(t) at each time t using the Kaplan- Meier survival estimate from part (a) or (b) as the basis.
\( H(t) = - log\{S(t)\} \)
## Calculate the estimate hazard rate
df1$Ht <- - log(df1$St)
df1
t_j d_j c_j r_j S_j St Ht
1 0 0 0 12 1.0000000 1.0000000 0.00000000
2 13 1 0 12 0.9166667 0.9166667 0.08701138
3 18 1 0 11 0.9090909 0.8333333 0.18232156
4 21 0 1 10 1.0000000 0.8333333 0.18232156
5 29 1 1 9 0.8888889 0.7407407 0.30010459
6 52 2 0 7 0.7142857 0.5291005 0.63657683
7 75 1 0 5 0.8000000 0.4232804 0.85972038
8 92 0 1 4 1.0000000 0.4232804 0.85972038
9 112 0 1 3 1.0000000 0.4232804 0.85972038
10 181 1 0 2 0.5000000 0.2116402 1.55286756
11 230 1 0 1 0.0000000 0.0000000 Inf
(f) Calculate the estimated cumulative hazard, Λˆ(t), using the Nelson-Aalen estimator.
## Approximate interval-specific hazard by interval-specific risk
df1$h_j <- with(df1, d_j / r_j)
## Nelson-Aalen estimator
df1$Ht_NA <- cumsum(df1$h_j)
df1
t_j d_j c_j r_j S_j St Ht h_j Ht_NA
1 0 0 0 12 1.0000000 1.0000000 0.00000000 0.00000000 0.00000000
2 13 1 0 12 0.9166667 0.9166667 0.08701138 0.08333333 0.08333333
3 18 1 0 11 0.9090909 0.8333333 0.18232156 0.09090909 0.17424242
4 21 0 1 10 1.0000000 0.8333333 0.18232156 0.00000000 0.17424242
5 29 1 1 9 0.8888889 0.7407407 0.30010459 0.11111111 0.28535354
6 52 2 0 7 0.7142857 0.5291005 0.63657683 0.28571429 0.57106782
7 75 1 0 5 0.8000000 0.4232804 0.85972038 0.20000000 0.77106782
8 92 0 1 4 1.0000000 0.4232804 0.85972038 0.00000000 0.77106782
9 112 0 1 3 1.0000000 0.4232804 0.85972038 0.00000000 0.77106782
10 181 1 0 2 0.5000000 0.2116402 1.55286756 0.50000000 1.27106782
11 230 1 0 1 0.0000000 0.0000000 Inf 1.00000000 2.27106782
(g) Plot (i) the estimated cumulative hazard Λˆ(t) vs t and (ii) the estimated log cumulative hazard log Λˆ (t) vs log(t) and use these plots to comment on the appropriateness of the Exponential and Weibull models for this data.
## Create a step function. Remove first element from x.
Ht_NA_step <- with(df1, stepfun(x = t_j[-1], y = Ht_NA))
## 1 x 2 layout
layout(matrix(1:2, ncol = 2))
## Plot on natural scales
plot(Ht_NA_step, xlim = c(0, 230), main = "Cumulative hazard (NA method)", pch = "", xlab = "Week", ylab = "H(t)")
## Plot on natural log scales
plot(Ht_NA_step, xlim = c(13, 230), ylim = c(0.08, 2.5), log = "xy", pch = "", main = "log scale", xlab = "Week", ylab = "H(t)")
(h) Using the results from part (f), calculate the alternative Fleming-Harrington estimator of the survival function, \( \hat{S}_{FH}(t) \) and comment on its agreement with the Kaplan-Meier estimate from parts (a)/(b).
## Calculate Fleming-Harrigton estimator
df1$St_FH <- exp(- df1$Ht_NA)
df1
t_j d_j c_j r_j S_j St Ht h_j Ht_NA St_FH
1 0 0 0 12 1.0000000 1.0000000 0.00000000 0.00000000 0.00000000 1.0000000
2 13 1 0 12 0.9166667 0.9166667 0.08701138 0.08333333 0.08333333 0.9200444
3 18 1 0 11 0.9090909 0.8333333 0.18232156 0.09090909 0.17424242 0.8400932
4 21 0 1 10 1.0000000 0.8333333 0.18232156 0.00000000 0.17424242 0.8400932
5 29 1 1 9 0.8888889 0.7407407 0.30010459 0.11111111 0.28535354 0.7517484
6 52 2 0 7 0.7142857 0.5291005 0.63657683 0.28571429 0.57106782 0.5649219
7 75 1 0 5 0.8000000 0.4232804 0.85972038 0.20000000 0.77106782 0.4625189
8 92 0 1 4 1.0000000 0.4232804 0.85972038 0.00000000 0.77106782 0.4625189
9 112 0 1 3 1.0000000 0.4232804 0.85972038 0.00000000 0.77106782 0.4625189
10 181 1 0 2 0.5000000 0.2116402 1.55286756 0.50000000 1.27106782 0.2805319
11 230 1 0 1 0.0000000 0.0000000 Inf 1.00000000 2.27106782 0.1032019
## 1 x 2 layout
layout(matrix(1:2, ncol = 2))
## Kaplan-Meier plot
survplot(lst.survival[["log-log"]], xlab = "")
title("Kaplan-Meier with log-log 95% CI")
## Estimate FH survival
res.FH <- survfit(Surv(t_j, d_j) ~ 1, data = df2, type = "fleming-harrington", conf.type = "log-log")
## Plot FH
survplot(res.FH, xlab = "")
title("Fleming-Harrington with log-log 95% CI")
Now group the data from Question #1 into six-month intervals. (note that the data above is given in terms of weeks, so 6-month intervals are 0-26, 26-52, 52-78, etc).
(a) Using the grouped data, calculate the actuarial estimate of the survival function for surgery recovery (by hand or by computer).
## Construct table
df3 <- data.frame(t_j = t_j, d_j = d_j, c_j = c_j)
df3
t_j d_j c_j
1 13 1 0
2 18 1 0
3 21 0 1
4 29 1 0
5 29 0 1
6 52 1 0
7 52 1 0
8 75 1 0
9 92 0 1
10 112 0 1
11 181 1 0
12 230 1 0
## Add interval indicator
df3$interval <- cut(df2$t_j, breaks = c(0:9 * 26), right = F)
table(df3$interval)
[0,26) [26,52) [52,78) [78,104) [104,130) [130,156) [156,182) [182,208) [208,234)
3 2 3 1 1 0 1 0 1
df3
t_j d_j c_j interval
1 13 1 0 [0,26)
2 18 1 0 [0,26)
3 21 0 1 [0,26)
4 29 1 0 [26,52)
5 29 0 1 [26,52)
6 52 1 0 [52,78)
7 52 1 0 [52,78)
8 75 1 0 [52,78)
9 92 0 1 [78,104)
10 112 0 1 [104,130)
11 181 1 0 [156,182)
12 230 1 0 [208,234)
## Aggregate by intervals
library(plyr)
df4 <- ddply(.drop = F, df3, "interval", function(DF) {
res <- colSums(DF[,c("d_j", "c_j")])
res
})
df4
interval d_j c_j
1 [0,26) 2 1
2 [26,52) 1 1
3 [52,78) 3 0
4 [78,104) 0 1
5 [104,130) 0 1
6 [130,156) 0 0
7 [156,182) 1 0
8 [182,208) 0 0
9 [208,234) 1 0
## Size of initial riskset
initial.riskset <- length(t_j)
initial.riskset
[1] 12
## Subtract cumulative decrease in the riskset (add d_j and c_j and lag by 1)
decrements <- with(df4, d_j + c_j)
decrements <- c(0, head(decrements, -1)) # Drop last one, and add 0 to beginning
df4$r_j <- initial.riskset - cumsum(decrements)
df4
interval d_j c_j r_j
1 [0,26) 2 1 12
2 [26,52) 1 1 9
3 [52,78) 3 0 7
4 [78,104) 0 1 4
5 [104,130) 0 1 3
6 [130,156) 0 0 2
7 [156,182) 1 0 2
8 [182,208) 0 0 1
9 [208,234) 1 0 1
## Calculate size of the modified riskset by subtracting 0.5 * (number censored)
df4$r2_j <- with(df4, r_j - 0.5 * c_j)
df4
interval d_j c_j r_j r2_j
1 [0,26) 2 1 12 11.5
2 [26,52) 1 1 9 8.5
3 [52,78) 3 0 7 7.0
4 [78,104) 0 1 4 3.5
5 [104,130) 0 1 3 2.5
6 [130,156) 0 0 2 2.0
7 [156,182) 1 0 2 2.0
8 [182,208) 0 0 1 1.0
9 [208,234) 1 0 1 1.0
## Interval-specific survival
df4$S_j <- with(df4, 1 - (d_j / r2_j))
## Cumulative product of survivals
df4$St_LT <- cumprod(df4$S_j)
## By definition, survival calculated within the first interval is shown in the second interval, etc
df4$St_LT <- head(c(1,df4$St_LT), -1)
df4
interval d_j c_j r_j r2_j S_j St_LT
1 [0,26) 2 1 12 11.5 0.8260870 1.0000000
2 [26,52) 1 1 9 8.5 0.8823529 0.8260870
3 [52,78) 3 0 7 7.0 0.5714286 0.7289003
4 [78,104) 0 1 4 3.5 1.0000000 0.4165144
5 [104,130) 0 1 3 2.5 1.0000000 0.4165144
6 [130,156) 0 0 2 2.0 1.0000000 0.4165144
7 [156,182) 1 0 2 2.0 0.5000000 0.4165144
8 [182,208) 0 0 1 1.0 1.0000000 0.2082572
9 [208,234) 1 0 1 1.0 0.0000000 0.2082572
Using KMsurv::lifetab() The survival for the first segment has to be 100%
## Calculation by KMsurv::lifetab()
library(KMsurv)
with(df4,
lifetab(tis = 0:9 * 26,
ninit = r_j[1],
nlost = c_j,
nevent = d_j)
)
nsubs nlost nrisk nevent surv pdf hazard se.surv se.pdf se.hazard
0-26 12 1 11.5 2 1.0000000 0.006688963 0.007326007 0.0000000 0.004298894 0.005156723
26-52 9 1 8.5 1 0.8260870 0.003737950 0.004807692 0.1117712 0.003547430 0.004798293
52-78 7 0 7.0 3 0.7289003 0.012014839 0.020979021 0.1343886 0.005692411 0.011653085
78-104 4 1 3.5 0 0.4165144 0.000000000 0.000000000 0.1564763 NaN NaN
104-130 3 1 2.5 0 0.4165144 0.000000000 0.000000000 0.1564763 NaN NaN
130-156 2 0 2.0 0 0.4165144 0.000000000 0.000000000 0.1564763 NaN NaN
156-182 2 0 2.0 1 0.4165144 0.008009893 0.025641026 0.1564763 0.006413598 0.024174591
182-208 1 0 1.0 0 0.2082572 0.000000000 0.000000000 0.1667535 NaN NaN
208-234 1 0 1.0 1 0.2082572 NA NA 0.1667535 NA NA
## Create step function
res.step.survival <- stepfun(x = 1:8 * 26, y = df4$St_LT)
## Plot
plot(res.step.survival, pch = "", main = "Survival function by actuarial method", xlab = "Week")
(b) Calculate the estimated hazard function at the midpoint of each time interval and plot.
## Hazard is estimated by (# of death) / ((modified riskset - 0.5 * # of death) * interval width)
df4$h_j <- with(df4, d_j / ((r2_j - d_j/2) * 26))
## Create stepfunction omit 0 for X
res.step.hazard <- stepfun(x = 1:8 * 26, y = df4$h_j)
## Plot
plot(res.step.hazard, pch = "", main = "Hazard function by actuarial method", xlab = "Week")
© What can you say about the hazard for surgery recovery over time? Does an exponen- tial model seem appropriate for this data?
(a) Among the 12 patients with gall-bladder removal in question #1, 4 were censored. This means that their time to recovery was not observed. What if it were determined that 2 of these subjects (those censored at 29 and 92 weeks) had actually died, soon after they were lost to follow-up. How would this information impact the assumption of non-informative censoring? How do you think it might bias the estimated survival distribution?
(b) Read the articles posted on the class account by Brinkhof et al (PLOS One, 2010) and by Braitstein et al (JAIDS 2011). What was the concern of the authors regarding the assumption of non-informative censoring in their mortality analysis? What steps did they take to attempt to address or correct for non-informative censoring?
(a) Using the data in question #1, calculate the maximum likelihood estimator of the parameter λ from an exponential distribution, \( f(t) = λe^{−λt} \).
\( λ \) = (# of events) / (total person-time)
##
lambda <- sum(df2$d_j) / sum(df2$t_j)
lambda
[1] 0.008849558
(b) Using this parameter estimate, estimate the following probabilities [Note: if you don’t get part (a), use \( \hat(λ) \) = 0.01]:
(i) The mean survival time
mean(\( S(t) \)) = \( \int_0^\infty u f(u) du = \int_0^\infty S(u) du = [-e^{-λu}/λ]_0^\infty = (-e^{-λ\infty}/λ) - (-e^{-λ0}/λ) = 1/λ \)
## Mean survival
1/lambda
[1] 113
(ii) The one- and two-year survival probabilities: S(52) and S(104).
\( S(t) = e^{-H(t)} = e^{-λt} \)
## Define as a function
S <- function(t) exp(-lambda * t)
## Calculate survival
S(52)
[1] 0.6311719
S(104)
[1] 0.398378
(iii) The cumulative probabilities of recovery by one and two years (based on the CDF)
## F(52)
1 - S(52)
[1] 0.3688281
## F(105)
1 - S(104)
[1] 0.601622
(iv) Based on the exponential distribution with \( \hat{λ} \) as calculated in (a), calculate the probability of “surviving” 2 years (eg, in this case, not achieving recovery from surgery) given that one has “survived” for one year. How does this compare with the probability of surviving one year after surgery?
\( P[T≥2 | T≥1] = P[T≥2, T≥1] / P[T≥1] = P[T≥2] / P[T≥1] = S(2yr) / S(1yr) \)
## 2yr survival given 1yr survival
S(104) / S(54)
[1] 0.6424426
## 1yr survival
S(54)
[1] 0.620099