BIO 223: Applied Survival Analysis HOMEWORK I in R

1. Kaplan-Meier Survival Estimate

Consider the following (hypothetical) data on the time to recovery from surgery (in weeks) for 12 patients after gall-bladder removal:

13, 18, 21+, 29, 29+, 52, 52, 75, 92+, 112+, 181, 230

(a) Using the above data (and assuming non-informative censoring as necessary), calculate the Kaplan-Meier estimate of the survival function, Sˆ(t), by hand. Summarize your calculations in a table with columns for events (dj), censoring (cj), and risk sets (rj) at each time tj.

## Time points
t_j <- c(13, 18, 21, 29, 29, 52, 52, 75, 92, 112, 181, 230)

## Number of censoring at each points
c_j <- c( 0,  0,  1,  0,  1,  0,  0,  0,  1,   1,   0,   0)

## Number of events at each time points (reverse 0 and 1)
d_j <- as.numeric(c_j == 0)

## Construct a data frame
df1  <- data.frame(t_j = t_j, d_j = d_j, c_j = c_j)
df1

   t_j d_j c_j
1   13   1   0
2   18   1   0
3   21   0   1
4   29   1   0
5   29   0   1
6   52   1   0
7   52   1   0
8   75   1   0
9   92   0   1
10 112   0   1
11 181   1   0
12 230   1   0


## Aggregate by time (ties are added up)
df1 <- aggregate(cbind(d_j, c_j) ~ t_j, data = df1, FUN = sum)
df1

   t_j d_j c_j
1   13   1   0
2   18   1   0
3   21   0   1
4   29   1   1
5   52   2   0
6   75   1   0
7   92   0   1
8  112   0   1
9  181   1   0
10 230   1   0


## Add time 0
df1 <- rbind(c(0, 0, 0), df1)
df1

   t_j d_j c_j
1    0   0   0
2   13   1   0
3   18   1   0
4   21   0   1
5   29   1   1
6   52   2   0
7   75   1   0
8   92   0   1
9  112   0   1
10 181   1   0
11 230   1   0


## Number at risk at time 0
initial.riskset <- length(t_j)

## Subtract cumulative decrease in the riskset (add d_j and c_j and lag by 1)
decrements <- with(df1, d_j + c_j)
decrements <- c(0, head(decrements, -1))  # Drop last one, and add 0 to beginning
df1$r_j <- initial.riskset - cumsum(decrements)

## Add survival in each interval
df1$S_j <- with(df1, 1 - (d_j / r_j))
df1

   t_j d_j c_j r_j       S_j
1    0   0   0  12 1.0000000
2   13   1   0  12 0.9166667
3   18   1   0  11 0.9090909
4   21   0   1  10 1.0000000
5   29   1   1   9 0.8888889
6   52   2   0   7 0.7142857
7   75   1   0   5 0.8000000
8   92   0   1   4 1.0000000
9  112   0   1   3 1.0000000
10 181   1   0   2 0.5000000
11 230   1   0   1 0.0000000


## Survival function = cumulative product of interval-specific survivals
df1$St <- cumprod(df1$S_j)
df1

   t_j d_j c_j r_j       S_j        St
1    0   0   0  12 1.0000000 1.0000000
2   13   1   0  12 0.9166667 0.9166667
3   18   1   0  11 0.9090909 0.8333333
4   21   0   1  10 1.0000000 0.8333333
5   29   1   1   9 0.8888889 0.7407407
6   52   2   0   7 0.7142857 0.5291005
7   75   1   0   5 0.8000000 0.4232804
8   92   0   1   4 1.0000000 0.4232804
9  112   0   1   3 1.0000000 0.4232804
10 181   1   0   2 0.5000000 0.2116402
11 230   1   0   1 0.0000000 0.0000000

(b) Repeat the above estimation of Sˆ(t) using any software you choose. Also calculate (by hand or computer) pointwise 95% confidence intervals for Sˆ(t) using at least 2 of the three approaches (linear, log-survival, or log-log survival). Do any of the approaches you used result in lower or upper confidence bounds outside the [0,1] interval?

## Create a data frame
df2  <- data.frame(t_j = t_j, c_j = c_j, d_j = d_j)

## Create a survival object
library(survival)
df2$SurvObj <- with(df2, Surv(time = t_j, event = d_j))
df2

   t_j c_j d_j SurvObj
1   13   0   1     13 
2   18   0   1     18 
3   21   1   0     21+
4   29   0   1     29 
5   29   1   0     29+
6   52   0   1     52 
7   52   0   1     52 
8   75   0   1     75 
9   92   1   0     92+
10 112   1   0    112+
11 181   0   1    181 
12 230   0   1    230


## Performing survival analysis with different confidence interval types
lst.survival <-
    sapply(simplify = FALSE,
           X = c("plain", "log", "log-log"),
           function(CONFTYPE) {

               survfit(SurvObj ~ 1, data = df2, conf.type = CONFTYPE)
           })

lapply(lst.survival, summary)

$plain
Call: survfit(formula = SurvObj ~ 1, data = df2, conf.type = CONFTYPE)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   13     12       1    0.917  0.0798        0.760        1.000
   18     11       1    0.833  0.1076        0.622        1.000
   29      9       1    0.741  0.1295        0.487        0.995
   52      7       2    0.529  0.1567        0.222        0.836
   75      5       1    0.423  0.1571        0.115        0.731
  181      2       1    0.212  0.1690        0.000        0.543
  230      1       1    0.000     NaN          NaN          NaN

$log
Call: survfit(formula = SurvObj ~ 1, data = df2, conf.type = CONFTYPE)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   13     12       1    0.917  0.0798       0.7729        1.000
   18     11       1    0.833  0.1076       0.6470        1.000
   29      9       1    0.741  0.1295       0.5259        1.000
   52      7       2    0.529  0.1567       0.2961        0.945
   75      5       1    0.423  0.1571       0.2045        0.876
  181      2       1    0.212  0.1690       0.0442        1.000
  230      1       1    0.000     NaN           NA           NA

$`log-log`
Call: survfit(formula = SurvObj ~ 1, data = df2, conf.type = CONFTYPE)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   13     12       1    0.917  0.0798       0.5390        0.988
   18     11       1    0.833  0.1076       0.4817        0.956
   29      9       1    0.741  0.1295       0.3907        0.909
   52      7       2    0.529  0.1567       0.2051        0.774
   75      5       1    0.423  0.1571       0.1349        0.691
  181      2       1    0.212  0.1690       0.0142        0.567
  230      1       1    0.000     NaN           NA           NA

© Plot the estimated survival function Sˆ(t) and pointwise 95% confidence intervals, either by hand or using any statistical software package.

## rms for survplot
library(rms)

## 3 plots horizontally lined up
layout(matrix(1:3, ncol = 3))

## plyr for l_ply()
library(plyr)

l_ply(names(lst.survival),
      function(CONFTYPE) {

    survplot(lst.survival[[CONFTYPE]], xlab = "")

    title(paste("Confidence interval by ", CONFTYPE, "method"))
})

plot of chunk unnamed-chunk-4

(d) Provide the estimated median survival, along with the estimated 25th and 75th per- centiles. Indicate where these percentiles fall on your KM plot from © by drawing horizontal lines. What are the actual KM survival estimates corresponding to each of the estimated percentiles?

## Plot survival curve
survplot(lst.survival$`log-log`)

## 25th percentile (S(25th)<75%): Actual value 74.1%
abline(h = 0.75, lty = 3)
abline(v = 29, lty = 2)

## Median survival (S(50th)<50%): Actual value 42.3%
abline(h = 0.50, lty = 3)
abline(v = 75, lty = 2)

## 75th percentile (S(75th)<25%): Actual value 21.2%
abline(h = 0.25, lty = 3)
abline(v = 181, lty = 2)

plot of chunk unnamed-chunk-5

(e) Calculate the estimated cumulative hazard rate, Λˆ(t) at each time t using the Kaplan- Meier survival estimate from part (a) or (b) as the basis.

\( H(t) = - log\{S(t)\} \)

## Calculate the estimate hazard rate
df1$Ht <- - log(df1$St)
df1

   t_j d_j c_j r_j       S_j        St         Ht
1    0   0   0  12 1.0000000 1.0000000 0.00000000
2   13   1   0  12 0.9166667 0.9166667 0.08701138
3   18   1   0  11 0.9090909 0.8333333 0.18232156
4   21   0   1  10 1.0000000 0.8333333 0.18232156
5   29   1   1   9 0.8888889 0.7407407 0.30010459
6   52   2   0   7 0.7142857 0.5291005 0.63657683
7   75   1   0   5 0.8000000 0.4232804 0.85972038
8   92   0   1   4 1.0000000 0.4232804 0.85972038
9  112   0   1   3 1.0000000 0.4232804 0.85972038
10 181   1   0   2 0.5000000 0.2116402 1.55286756
11 230   1   0   1 0.0000000 0.0000000        Inf

(f) Calculate the estimated cumulative hazard, Λˆ(t), using the Nelson-Aalen estimator.

## Approximate interval-specific hazard by interval-specific risk
df1$h_j <- with(df1, d_j / r_j)

## Nelson-Aalen estimator
df1$Ht_NA <- cumsum(df1$h_j)
df1

   t_j d_j c_j r_j       S_j        St         Ht        h_j      Ht_NA
1    0   0   0  12 1.0000000 1.0000000 0.00000000 0.00000000 0.00000000
2   13   1   0  12 0.9166667 0.9166667 0.08701138 0.08333333 0.08333333
3   18   1   0  11 0.9090909 0.8333333 0.18232156 0.09090909 0.17424242
4   21   0   1  10 1.0000000 0.8333333 0.18232156 0.00000000 0.17424242
5   29   1   1   9 0.8888889 0.7407407 0.30010459 0.11111111 0.28535354
6   52   2   0   7 0.7142857 0.5291005 0.63657683 0.28571429 0.57106782
7   75   1   0   5 0.8000000 0.4232804 0.85972038 0.20000000 0.77106782
8   92   0   1   4 1.0000000 0.4232804 0.85972038 0.00000000 0.77106782
9  112   0   1   3 1.0000000 0.4232804 0.85972038 0.00000000 0.77106782
10 181   1   0   2 0.5000000 0.2116402 1.55286756 0.50000000 1.27106782
11 230   1   0   1 0.0000000 0.0000000        Inf 1.00000000 2.27106782

(g) Plot (i) the estimated cumulative hazard Λˆ(t) vs t and (ii) the estimated log cumulative hazard log Λˆ (t) vs log(t) and use these plots to comment on the appropriateness of the Exponential and Weibull models for this data.

## Create a step function. Remove first element from x.
Ht_NA_step <- with(df1, stepfun(x = t_j[-1], y = Ht_NA))

## 1 x 2 layout
layout(matrix(1:2, ncol = 2))

## Plot on natural scales
plot(Ht_NA_step, xlim = c(0, 230), main = "Cumulative hazard (NA method)", pch = "", xlab = "Week", ylab = "H(t)")

## Plot on natural log scales
plot(Ht_NA_step, xlim = c(13, 230), ylim = c(0.08, 2.5), log = "xy", pch = "", main = "log scale", xlab = "Week", ylab = "H(t)")

plot of chunk unnamed-chunk-8

(h) Using the results from part (f), calculate the alternative Fleming-Harrington estimator of the survival function, \( \hat{S}_{FH}(t) \) and comment on its agreement with the Kaplan-Meier estimate from parts (a)/(b).

## Calculate Fleming-Harrigton estimator
df1$St_FH <- exp(- df1$Ht_NA)
df1

   t_j d_j c_j r_j       S_j        St         Ht        h_j      Ht_NA     St_FH
1    0   0   0  12 1.0000000 1.0000000 0.00000000 0.00000000 0.00000000 1.0000000
2   13   1   0  12 0.9166667 0.9166667 0.08701138 0.08333333 0.08333333 0.9200444
3   18   1   0  11 0.9090909 0.8333333 0.18232156 0.09090909 0.17424242 0.8400932
4   21   0   1  10 1.0000000 0.8333333 0.18232156 0.00000000 0.17424242 0.8400932
5   29   1   1   9 0.8888889 0.7407407 0.30010459 0.11111111 0.28535354 0.7517484
6   52   2   0   7 0.7142857 0.5291005 0.63657683 0.28571429 0.57106782 0.5649219
7   75   1   0   5 0.8000000 0.4232804 0.85972038 0.20000000 0.77106782 0.4625189
8   92   0   1   4 1.0000000 0.4232804 0.85972038 0.00000000 0.77106782 0.4625189
9  112   0   1   3 1.0000000 0.4232804 0.85972038 0.00000000 0.77106782 0.4625189
10 181   1   0   2 0.5000000 0.2116402 1.55286756 0.50000000 1.27106782 0.2805319
11 230   1   0   1 0.0000000 0.0000000        Inf 1.00000000 2.27106782 0.1032019


## 1 x 2 layout
layout(matrix(1:2, ncol = 2))

## Kaplan-Meier plot
survplot(lst.survival[["log-log"]], xlab = "")
title("Kaplan-Meier with log-log 95% CI")

## Estimate FH survival
res.FH <- survfit(Surv(t_j, d_j) ~ 1, data = df2, type = "fleming-harrington", conf.type = "log-log")

## Plot FH
survplot(res.FH, xlab = "")
title("Fleming-Harrington with log-log 95% CI")

plot of chunk unnamed-chunk-9

2. Lifetable (Actuarial) Survival Estimate

Now group the data from Question #1 into six-month intervals. (note that the data above is given in terms of weeks, so 6-month intervals are 0-26, 26-52, 52-78, etc).

(a) Using the grouped data, calculate the actuarial estimate of the survival function for surgery recovery (by hand or by computer).

## Construct table
df3 <- data.frame(t_j = t_j, d_j = d_j, c_j = c_j)
df3

   t_j d_j c_j
1   13   1   0
2   18   1   0
3   21   0   1
4   29   1   0
5   29   0   1
6   52   1   0
7   52   1   0
8   75   1   0
9   92   0   1
10 112   0   1
11 181   1   0
12 230   1   0


## Add interval indicator
df3$interval <- cut(df2$t_j, breaks = c(0:9 * 26), right = F)
table(df3$interval)


   [0,26)   [26,52)   [52,78)  [78,104) [104,130) [130,156) [156,182) [182,208) [208,234) 
        3         2         3         1         1         0         1         0         1

df3

   t_j d_j c_j  interval
1   13   1   0    [0,26)
2   18   1   0    [0,26)
3   21   0   1    [0,26)
4   29   1   0   [26,52)
5   29   0   1   [26,52)
6   52   1   0   [52,78)
7   52   1   0   [52,78)
8   75   1   0   [52,78)
9   92   0   1  [78,104)
10 112   0   1 [104,130)
11 181   1   0 [156,182)
12 230   1   0 [208,234)


## Aggregate by intervals
library(plyr)
df4 <- ddply(.drop = F, df3, "interval", function(DF) {

    res <- colSums(DF[,c("d_j", "c_j")])
    res
})
df4

   interval d_j c_j
1    [0,26)   2   1
2   [26,52)   1   1
3   [52,78)   3   0
4  [78,104)   0   1
5 [104,130)   0   1
6 [130,156)   0   0
7 [156,182)   1   0
8 [182,208)   0   0
9 [208,234)   1   0


## Size of initial riskset
initial.riskset <- length(t_j)
initial.riskset

[1] 12


## Subtract cumulative decrease in the riskset (add d_j and c_j and lag by 1)
decrements <- with(df4, d_j + c_j)
decrements <- c(0, head(decrements, -1))  # Drop last one, and add 0 to beginning
df4$r_j <- initial.riskset - cumsum(decrements)
df4

   interval d_j c_j r_j
1    [0,26)   2   1  12
2   [26,52)   1   1   9
3   [52,78)   3   0   7
4  [78,104)   0   1   4
5 [104,130)   0   1   3
6 [130,156)   0   0   2
7 [156,182)   1   0   2
8 [182,208)   0   0   1
9 [208,234)   1   0   1


## Calculate size of the modified riskset by subtracting 0.5 * (number censored)
df4$r2_j <- with(df4, r_j - 0.5 * c_j)
df4

   interval d_j c_j r_j r2_j
1    [0,26)   2   1  12 11.5
2   [26,52)   1   1   9  8.5
3   [52,78)   3   0   7  7.0
4  [78,104)   0   1   4  3.5
5 [104,130)   0   1   3  2.5
6 [130,156)   0   0   2  2.0
7 [156,182)   1   0   2  2.0
8 [182,208)   0   0   1  1.0
9 [208,234)   1   0   1  1.0


## Interval-specific survival
df4$S_j <- with(df4, 1 - (d_j / r2_j))

## Cumulative product of survivals
df4$St_LT <- cumprod(df4$S_j)

## By definition, survival calculated within the first interval is shown in the second interval, etc
df4$St_LT <- head(c(1,df4$St_LT), -1)
df4

   interval d_j c_j r_j r2_j       S_j     St_LT
1    [0,26)   2   1  12 11.5 0.8260870 1.0000000
2   [26,52)   1   1   9  8.5 0.8823529 0.8260870
3   [52,78)   3   0   7  7.0 0.5714286 0.7289003
4  [78,104)   0   1   4  3.5 1.0000000 0.4165144
5 [104,130)   0   1   3  2.5 1.0000000 0.4165144
6 [130,156)   0   0   2  2.0 1.0000000 0.4165144
7 [156,182)   1   0   2  2.0 0.5000000 0.4165144
8 [182,208)   0   0   1  1.0 1.0000000 0.2082572
9 [208,234)   1   0   1  1.0 0.0000000 0.2082572

Using KMsurv::lifetab() The survival for the first segment has to be 100%

## Calculation by KMsurv::lifetab()
library(KMsurv)
with(df4,
     lifetab(tis    = 0:9 * 26,
             ninit  = r_j[1],
             nlost  = c_j,
             nevent = d_j)
     )

        nsubs nlost nrisk nevent      surv         pdf      hazard   se.surv      se.pdf   se.hazard
0-26       12     1  11.5      2 1.0000000 0.006688963 0.007326007 0.0000000 0.004298894 0.005156723
26-52       9     1   8.5      1 0.8260870 0.003737950 0.004807692 0.1117712 0.003547430 0.004798293
52-78       7     0   7.0      3 0.7289003 0.012014839 0.020979021 0.1343886 0.005692411 0.011653085
78-104      4     1   3.5      0 0.4165144 0.000000000 0.000000000 0.1564763         NaN         NaN
104-130     3     1   2.5      0 0.4165144 0.000000000 0.000000000 0.1564763         NaN         NaN
130-156     2     0   2.0      0 0.4165144 0.000000000 0.000000000 0.1564763         NaN         NaN
156-182     2     0   2.0      1 0.4165144 0.008009893 0.025641026 0.1564763 0.006413598 0.024174591
182-208     1     0   1.0      0 0.2082572 0.000000000 0.000000000 0.1667535         NaN         NaN
208-234     1     0   1.0      1 0.2082572          NA          NA 0.1667535          NA          NA

## Create step function
res.step.survival <- stepfun(x = 1:8 * 26, y = df4$St_LT)

## Plot
plot(res.step.survival, pch = "", main = "Survival function by actuarial method", xlab = "Week")

plot of chunk unnamed-chunk-12

(b) Calculate the estimated hazard function at the midpoint of each time interval and plot.

## Hazard is estimated by (# of death) / ((modified riskset - 0.5 * # of death) * interval width)
df4$h_j <- with(df4, d_j / ((r2_j - d_j/2) * 26))

## Create stepfunction omit 0 for X
res.step.hazard <- stepfun(x = 1:8 * 26, y = df4$h_j)

## Plot
plot(res.step.hazard, pch = "", main = "Hazard function by actuarial method", xlab = "Week")

plot of chunk unnamed-chunk-13

© What can you say about the hazard for surgery recovery over time? Does an exponen- tial model seem appropriate for this data?

3. Non-Informative and Informative Censoring**

(a) Among the 12 patients with gall-bladder removal in question #1, 4 were censored. This means that their time to recovery was not observed. What if it were determined that 2 of these subjects (those censored at 29 and 92 weeks) had actually died, soon after they were lost to follow-up. How would this information impact the assumption of non-informative censoring? How do you think it might bias the estimated survival distribution?

(b) Read the articles posted on the class account by Brinkhof et al (PLOS One, 2010) and by Braitstein et al (JAIDS 2011). What was the concern of the authors regarding the assumption of non-informative censoring in their mortality analysis? What steps did they take to attempt to address or correct for non-informative censoring?

4. Working with an Exponential Density

(a) Using the data in question #1, calculate the maximum likelihood estimator of the parameter λ from an exponential distribution, \( f(t) = λe^{−λt} \).

\( λ \) = (# of events) / (total person-time)

##
lambda <- sum(df2$d_j) / sum(df2$t_j)
lambda

[1] 0.008849558

(b) Using this parameter estimate, estimate the following probabilities [Note: if you don’t get part (a), use \( \hat(λ) \) = 0.01]:

(i) The mean survival time

mean(\( S(t) \)) = \( \int_0^\infty u f(u) du = \int_0^\infty S(u) du = [-e^{-λu}/λ]_0^\infty = (-e^{-λ\infty}/λ) - (-e^{-λ0}/λ) = 1/λ \)

## Mean survival
1/lambda

[1] 113

(ii) The one- and two-year survival probabilities: S(52) and S(104).

\( S(t) = e^{-H(t)} = e^{-λt} \)

## Define as a function
S <- function(t) exp(-lambda * t)

## Calculate survival
S(52)

[1] 0.6311719

S(104)

[1] 0.398378

(iii) The cumulative probabilities of recovery by one and two years (based on the CDF)

## F(52)
1 - S(52)

[1] 0.3688281

## F(105)
1 - S(104)

[1] 0.601622

(iv) Based on the exponential distribution with \( \hat{λ} \) as calculated in (a), calculate the probability of “surviving” 2 years (eg, in this case, not achieving recovery from surgery) given that one has “survived” for one year. How does this compare with the probability of surviving one year after surgery?

\( P[T≥2 | T≥1] = P[T≥2, T≥1] / P[T≥1] = P[T≥2] / P[T≥1] = S(2yr) / S(1yr) \)

## 2yr survival given 1yr survival
S(104) / S(54)

[1] 0.6424426


## 1yr survival
S(54)

[1] 0.620099