# Installing some useful packages
library(ggplot2)
library(tidyverse)
library(kableExtra)
library(ggpubr)   # Combine ggplot

Problem 1.

(a).Given the hazard function \(\lambda(t)=c\ , where\ c>0\),derive the survival function and the density function. Derive the median failure time for c = 2; 5 and 11.

\[ Given\ \lambda(t)=c\ , where\ c>0 \\ \Rightarrow\left\{ \begin{aligned} S(t) &= exp(-\Lambda(t))=exp(- \int_{0}^{t}cdu)=exp(-ct) \\ f(t) &= \lambda(t)S(t) = ce^{-ct}\ , where\ t,c>0 \\ \end{aligned} \right. \]
Let \(\eta_T\) be the median failure time,then \[ \frac{1}{2} = S(\eta_T) = exp(-c*\eta_T) \Rightarrow \eta_T=\frac{-1}{c}*ln \left( \frac{1}{2} \right) \] So, \[ \eta_T= \left\{ \begin{aligned} \frac{-1}{2}*ln \left( \frac{1}{2} \right) \approx 0.3466\quad &,c=2 \\ \frac{-1}{5}*ln \left( \frac{1}{2} \right) \approx 0.1386\quad &,c=5 \\ \frac{-1}{11}*ln \left( \frac{1}{2} \right) \approx 0.0630\quad &,c=11 \end{aligned} \right. \]

(b). Given the survival function \(S(t)=exp(\theta t^{\beta})\ ,where \beta>0\), derive the density function and the hazard function.

\[ Given\ S(t)=exp(-\theta t^{\beta})\ ,where \beta>0 \\ \Rightarrow\left\{ \begin{aligned} \lambda(t) &= -\frac{d\ lnS(t)}{dt} = -\frac{d\ (-\theta t^{\beta})}{dt}=\theta \beta t^{\beta-1} \\ f(t) &= \lambda(t)S(t) = \theta \beta t^{\beta-1}exp(-\theta t^{\beta})\ ,\ t,\beta >0 \\ \end{aligned} \right. \]

Note:
(a),(b)兩小題之 Survival function 如下所示(Assume,c = 2, \(\theta\) = 2, \(\beta\) = 2 )

S1 <- function(t){
  c <- 2
  exp(-c*t)
}

S2 <- function(t){
  theta <- 2
  b <- 2
  exp(-theta*t^(b))
}
g1 <- ggplot()+
  stat_function(aes(x = 0:5), fun = S1) +
  labs(title = expression(paste("(a),  ",
                              S(t) == e^{-2*t}  )),
       x = "t",
       y = "survival function")
g2 <- ggplot()+
  stat_function(aes(x = 0:5), fun = S2) +
  labs(title = expression(paste("(b),  ",
                              S(t) == e^{-2*t^{2}}  )),
       x = "t",
       y = "survival function")
ggarrange(g1, g2)

Problem 2.

The data below are the ordered survival times(in days) from data of diagnosis of 43 patients suffering from leukemia(?զ??f): 7, 47, 58, 74, 177, 232, 273, 285, 317, 429, 440, 445, 455, 468, 495, 497, 532, 571, 579, 581, 650, 702, 715, 779, 881, 900, 903, 968, 1077, 1109, 1314, 1334, 1367, 1534, 1712, 1784, 1877, 1886, 2045, 2056, 2260, 2429, 2509.

# ???פJ????
leukemia <- c(7, 47, 58, 74,  177, 232, 273, 285, 317, 429, 440, 445, 455, 468, 495, 497, 532, 571, 579, 581, 650, 702, 715, 779, 881, 900, 903, 968, 1077, 1109, 1314, 1334, 1367, 1534, 1712, 1784, 1877, 1886, 2045, 2056, 2260, 2429, 2509)

(a). Calculate the empirical survival function and graph it.
we know empirical survival function is \[\hat{S(t)}=\frac{ \sum_{i=1}^{n} I(t_i>t)}{n}\quad ,t >0\] ?h
\[ \left\{ \begin{aligned} & \hat{S}(6) = \frac{ \sum_{i=1}^{n} I(t_i>6)}{n} = \frac{43}{43} = 1 \\ & \hat{S}(7) = \frac{ \sum_{i=1}^{n} I(t_i>7)}{n} = \frac{42}{43} \approx 0.977 \\ & \quad \quad\vdots \\ & \hat{S}(2429)= \frac{ \sum_{i=1}^{n} I(t_i>2429)}{n} = \frac{1}{43} \approx 0.023 \\ & \hat{S}(2509)= \frac{ \sum_{i=1}^{n} I(t_i>2509)}{n} = \frac{0}{43} = 0 \end{aligned} \right. \] 其中,Survival function 如下所示

# 1.???Nempirical survival function ?g?U
esf.leukemia <- function(t){
  sf <- numeric(length(t))
  for(i in 1:length(t)){
    sf[i] <- sum ( leukemia > t[i] ) / 43
  }
  return(sf)
}
#  2.Graph 
xs <- seq(0, 2500)
ggplot(data = NULL, aes(x = xs, y = esf.leukemia(xs))) + 
  geom_step() +
  labs(x = "t", y = "", title = "Empirical Survival function")

Above this figure, the empirical survival functio is a step function.

(b). Estimate the 1-year, 2-year, 3-year, 4-year and 5-year survival rates by nonparametric methods.
According to the statement
\[ \left\{ \begin{aligned} & \hat{S}(1-year) = \frac{ \sum_{i=1}^{n} I(t_i>365)}{n} \approx 0.791 \\ & \hat{S}(2-year) = \frac{ \sum_{i=1}^{n} I(t_i>730)}{n} \approx 0.465 \\ & \hat{S}(3-year) = \frac{ \sum_{i=1}^{n} I(t_i>1095)}{n} \approx 0.326 \\ & \hat{S}(4-year)= \frac{ \sum_{i=1}^{n} I(t_i>1460)}{n} \approx 0.23 \\ & \hat{S}(5-year)= \frac{ \sum_{i=1}^{n} I(t_i>1825)}{n} \approx 0.163 \end{aligned} \right. \] 可整理成下述表格

	1-year	2-year	3-year	4-year	5-year
S(t)	0.791	0.465	0.326	0.233	0.163

其中,上述表格意思為(以t=1為例):
S(1)=0.791表存活時間超過一年的機率為0.791.

(c). Under nonparametric model, construct 95% C.I for the 1-year, 2-year, 3-year, 4-year and 5-year survival rates
由(a)小題知道,\(\hat{S(t)}=\frac{ \sum_{i=1}^{n} I(t_i>t)}{n}\quad ,t >0\)
\(Let\ B(t)= \sum_{i=1}^{n} I(t_i>t) \sim\ Bin(n,p=S(t)),then\)

\(E \left[ \hat{S(t)} \right]=\frac{1}{n}E\left[ B(t) \right]=\frac{1}{n}np=p=S(t)\)

\(Var\left[ \hat{S(t)}\right]= \frac{1}{n^2}Var\left[ B(t) \right]=\frac{1}{n^2}np(1-p)=\frac{p(1-p)}{n}=\frac{S(t)(1-S(t))}{n}\)
By, C.L.T
\[ \hat{S(t)}\xrightarrow{a}N \left( S(t),\frac{S(t)(1-S(t))}{n} \right) \] So, the 95% C.I for \(S(t)\) is \[ \left(\hat{S(t)} \pm 1.96 \sqrt{\frac{\hat{S(t)}(1-\hat{S(t)})}{n}}\ \right) \] 由上述推導可知,在 nonparametric model, 且資料為complete下,可逐點做信賴區間.

\[ The\ 95\%\ C.I\ for\ S(t)\ is \Rightarrow \left\{ \begin{aligned} & (0.669,\ 0.912) \ &, t = 1-year \\ & (0.316,\ 0.614) \ &, t = 2-year \\ & (0.186,\ 0.466) \ &, t = 3-year \\ & (0.106,\ 0.359) \ &, t = 4-year \\ & (0.052,\ 0.273) \ &, t = 5-year \\ \end{aligned} \right. \] 其中,上述表格意思為(以t=1為例):
( 0.669, 0.912 )表我們有95%信心說存活時間超過一年的機率會落在此區間中.

	1-year	2-year	3-year	4-year	5-year
95% lower bound	0.669	0.316	0.186	0.106	0.052
95% upper bound	0.912	0.614	0.466	0.359	0.273

(d). Estimate the median survival time.
Let \(\eta_T\) be the median failure time,then
\[ \frac{1}{2} = S(\eta_T) = \frac{\sum_{i=1}^{43} I(t_i>\eta_T)}{43} \Rightarrow \sum_{i=1}^{43} I(t_i>\eta_T) = \frac{43}{2}=21.5 \] 由上述式子知 \(\eta_T=X_{(\frac{43+1}{2}-1)}=X_{(21)}=650\)
其中, \(X_{(i)}\)表leukemia中第 \(i\) 筆資料(已由小到大排序)
\(\because S(X_{(21)})=\frac{\sum_{i=1}^{43} I(t_i>650)}{43} \approx 0.511\)
\(\therefore \eta_T=X_{(21)}=650\)

(e). Under the exponential model with the constant hazard \(\lambda\). calculate the MLE of \(\lambda\) and construct a 95% C.I for \(\lambda\).
Let \(T\) be the survival time, \(T\sim Exp(\lambda)\)

\[ \begin{align} L(\lambda) &=\prod_{i=1}^{n}f(t_i;\lambda)=\lambda^n*exp \left( -\lambda\sum_{i=1}^{n}t_i \right) \\ lnL(\lambda) &= n*ln(\lambda)-\lambda* \sum_{i=1}^{n}t_i \\ Let\ U(\lambda) = \frac{dlnL(\lambda)}{d\lambda} &= \frac{n}{\lambda}-\sum_{i=1}^{n}t_i = 0\quad,\hat{\lambda} = \frac{n}{\sum_{i=1}^{n}t_i} \\ \because \frac{d^2lnL(\lambda)}{d\lambda^2} <0 &\qquad \therefore \hat{\lambda}_{mle} = \frac{n}{\sum_{i=1}^{n}T_i} \end{align} \] \(and\) \[ CRLB(\lambda) = \frac{1}{nI_1(\lambda)} = \frac{1}{I_n(\lambda)} = \frac{\lambda^2}{n} \] \(Where\) \[ I_n(\lambda) = -E \bigg\{ \frac{d^2lnL(\lambda)}{d\lambda^2} \bigg\} = \frac{n}{\lambda^2} \] \(Thus\) \[ \hat{\lambda}_{mle} \xrightarrow{a} N \left(\lambda,CRLB(\lambda) \right) = N \left(\lambda,\frac{\lambda^2}{n} \right) \] \(Let\) \[ Z = \frac{\sqrt{n}*(\hat{\lambda}_{mle}-\lambda)}{{\lambda}} \xrightarrow{d} N(0,1) \quad, \hat{\lambda}_{mle}=\frac{1}{\bar{T}} \] \(By\ Weak\ Law\ of\ Large\ Number\ (WLLN)\) \[ \begin{align} &\bar{T} \xrightarrow{p} E(T)=\frac{1}{\lambda} \\ \Rightarrow\ &\hat{\lambda}_{mle}=\frac{1}{\bar{T}} \xrightarrow{p} \lambda \\ \Rightarrow\ &\frac{\hat{\lambda}_{mle}}{\lambda} \xrightarrow{p} 1 \end{align} \] \(So,\ by\ Slutsky\ theorem\) \[ Q = \frac{\sqrt{n}*(\hat{\lambda}_{mle}-\lambda)}{\hat{\lambda}_{mle}} =\frac{\sqrt{n}*(\hat{\lambda}_{mle}-\lambda)/\lambda}{\hat{\lambda}_{mle}/\lambda} \xrightarrow{d} N(0,1) \] \(Then\ the\ 95\%\ C.I\ for\ \lambda\ is\) \[ \begin{align} 0.95 &= P ( -1.96 \leq Q \leq 1.96 ) \\ &= P \left( -1.96 \leq \frac{\sqrt{n}*(\hat{\lambda}_{mle}-\lambda)}{\hat{\lambda}_{mle}} \leq 1.96 \right) \\ &= P \left( \lambda\in \left( \hat{\lambda}_{mle} \pm1.96*\frac{\hat{\lambda}_{mle}}{\sqrt{n}}\right) \right) \end{align} \]

(f). Do you think the exponential model is appropriate for the data? Explain.
計算各點之hazard function \[ \left\{ \begin{aligned} & \hat{\lambda}(7) = \frac{P(T=7)}{P(T \geq 7)} = \frac{1}{43} \\ & \hat{\lambda}(47) = \frac{P(T=47)}{P(T \geq 47)} = \frac{1}{42}\\ & \quad \quad \vdots \\ & \hat{\lambda}(2429) = \frac{P(T=2429)}{P(T \geq 2429)} = \frac{1}{2} \\ & \hat{\lambda}(2509) = \frac{P(T=2509)}{P(T \geq 2509)} = 1 \\ \end{aligned} \right. \] 由上述hazard function 知道,因為hazard function會隨著t而改變,不為一常數,故此資料不適合用exponential model.

Problem 3.

Show the inversion formula of survival function \(S(t)=P(T>t)\). Let \(m(t)=E(T-t|T>t)\) for a continuous nonnegative r.v T with finite mean. Show that \[ \begin{align} S(t) = \frac{m(0)}{m(t)}exp \left[ -\int_{0}^{t}\frac{1}{m(u)}du \right] \end{align} \] < Proof >
I. \[ \begin{align} m(t) &= E(T-t|T>t) = \int_{t}^{\infty}(w-t)d \left(P(T \leq w|T>t) \right) \\ &= \int_{t}^{\infty}(w-t)d \left( \frac{F(w)-F(t)}{S(t)} \right) \\ &= \int_{t}^{\infty}(w-t)d \left( \frac{-S(w)+S(t)}{S(t)} \right)\quad Where,\ S(t) = 1-F(t) \\ &= \frac{1}{S(t)} \int_{t}^{\infty}(w-t)d \left( -S(w) \right)\quad By\ Integration\ by\ parts \\ &= \frac{1}{S(t)} \int_{t}^{\infty} S(w)dw \\ \end{align} \] II.
\(Let\quad k(t) = \int_{t}^{\infty} S(w)dw = m(t)S(t) \ \Rightarrow k'(t) =-S(t)\)
III.
故由I,II知
\[ \begin{align} \int_{0}^{t}\frac{1}{m(u)}du &= {\color{Red}-} \int_{0}^{t} \frac{{\color{Red} {-S(u)} }}{ {\color{Red} {S(u)} }} \frac{1}{m(u)} du \\ &= - \int_{0}^{t} \frac{k'(u)}{k(u)} du \quad by\ (II) \\ &= - \int_{0}^{t} \frac{1}{k(u)} d \left( k(u) \right) \\ &= -ln \left( \frac{k(t)}{k(0)} \right) \\ &= -ln \left( \frac{m(t)S(t)}{m(0)S(0)} \right) \quad Where,S(0)=1 \\ \Rightarrow exp \left[ -\int_{0}^{t}\frac{1}{m(u)}du \right] &= \frac{m(t)S(t)}{m(0)} \\ \Rightarrow S(t) &= \frac{m(0)}{m(t)} exp \left[ -\int_{0}^{t}\frac{1}{m(u)}du \right]\ Q.E.D \end{align} \] Note:
\(m(t)=E(T-t|T>t)\) 表在時間點t下,平均剩餘時間.

Problem 4.

(Use of computing packages is not allowed)
The following data are the re-mission times of 42 patients with acute leukemia: 6MP vs placebo. Reference: Freireichet al.(1963) Blood, p699.
Placebo: 1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23
6MP:6, 6, 6, 7, 10, 13, 16, 22, 23, 6+, 9+, 10+, 11+, 17+, 19+, 20+, 25+, 32+, 32+, 34+, 35+.

# ???פJ????
Placebo <- c(1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23)

y <- c(6, 6, 6, 7, 10, 13, 16, 22, 23, 6, 9, 10, 11, 17, 19, 20, 25, 32, 32, 34, 35)
d <- c(rep(1, 9), rep(0, 12))   # d = 1 be uncensored 
MP.6 <- data.frame(y, d)

(a). Calculate both the Kaplan-Meier estimate and the empirical survival distribution on the basis of the placebo data. What is the relationship between the Kaplan-Meier estimate and the emoirical survival distribution from your calculation?
已知 Kaplan-Meier estimate 為
定義
\(y_{(j)}\)表在 uncensored data 中的第j個順序統計量
\(R(y_{(j)})=R_{(j)}=\{k\ ;y_k\geq y_{(j)} \}\)
\(d_{(j)}=\#\ of\ failure\ at\ y_{(j)}\) 表有幾個 \(_{(j)}\)
\(N_{(j)}=\#\ R(y_{(j)})\) 表Risk set 中之個數
則,Kaplan-Meier estimate 為
\[ \hat{S}(t)=\prod_{y_{(j)} \leq t} \left( 1-\frac{d_{(j)}}{N_{(j)}} \right) \] 可由上述定義得如下表格.且已知empirical survival function 為\(\hat{S(t)}=\frac{ \sum_{i=1}^{n} I(t_i>t)}{n}\quad ,t >0\)


y_(j)	1	2	3	4	5	8	11	12	15	17	22	23
d_(j)	2	2	1	2	2	4	2	2	1	1	1	1
N_(j)	21	19	17	16	14	12	8	6	4	3	2	1

經計算得 \[ \left\{ \begin{aligned} & \hat{S}_{KM}(1) = \left( 1-\frac{2}{21} \right) = \frac{19}{21} \approx 0.9048 \\ & \hat{S}_{KM}(2) = \frac{19}{21}\left( 1-\frac{2}{19} \right) = \frac{17}{21} \approx 0.8095 \\ & \quad \quad\vdots \\ & \hat{S}_{KM}(23) = 0 \\ \end{aligned} \right. \ , \left\{ \begin{aligned} & \hat{S}_{emp}(1) = \frac{ \sum_{i=1}^{n} I(t_i>1)}{21} = \frac{19}{21} \approx 0.9048 \\ & \hat{S}_{emp}(2) = \frac{ \sum_{i=1}^{n} I(t_i>2)}{21} = \frac{17}{21} \approx 0.8095 \\ & \quad \quad \vdots \\ & \hat{S}_{emp}(23) = \frac{ \sum_{i=1}^{n} I(t_i>23)}{21}=0 \\ \end{aligned} \right. \] 由上述計算可知,當資料全都沒有censored時,Kaplan-Meier estimator 會和用 empirical survival function估計的方法相同.

Note:
可將兩方法之Survival function 畫出

# 1. ?g?U Kaplan-Meier estimate ?????? 
KM.estimate.uncensored <- function(data, t){
  y <- unique(data)
  d <- numeric(length(y))       # Calculate d_(j)
  for(i in 1: length(y) ){
    d[i] <- sum( data == sort(y)[i] )
  }

  N <- numeric(length(y))       # Calculate N_(j)
  for(i in 1: length(y) ){
    N[i] <- sum(data >= sort(y)[i] ) 
  }
  
  y_i <- sort(y)[ sort(y) <= t]    # Calculate  S(t)
  z <- 1
  for(i in 1:length(y_i) ){
    z <- z*(1-d[i]/N[i])
  }
  return(z)
} 
# 2. Graph
# 2.1 Kaplan-Meier estimate
xs <- seq(1, 25)
y <- numeric( length(xs) )
for(i in 1: length(xs) ){
  y[i] <- KM.estimate.uncensored(Placebo, xs[i])
}

# 2.2 Empirical survival distribution
esf.Placebo <- function(t){
  sf <- numeric(length(t))
  for(i in 1:length(t)){
    sf[i] <- sum ( Placebo > t[i] ) / 21
  }
  return(sf)
}

g1 <- ggplot() +
  geom_step(aes(x = seq(0, 25), y = c(1, y))) + 
  labs(title = "Kaplan-Meier estimate", 
       x = "t", y = "Survival function")
g2 <- ggplot() + 
  geom_step(aes(x = seq(0, 25), esf.Placebo(seq(0 ,25)))) + 
  labs(title = "Empirical survival distribution", 
       x = "t", y = "Survival function")
ggarrange(g1, g2)

由上圖可看出,當資料全都沒有censored時,Kaplan-Meier estimator 會簡化成 empirical survival function. (\(S(t)\) 相同).

(b). Do you observe a similar phenomenon which also holds for other uncensored survival data? Explain.
有,像是在實驗室中觀測小白鼠吃藥的存活時間,因為此實驗均在我們可控制的情況下,故可確認資料不會有censored的情況產生,則在此情形下,用Kaplan-Meier estimator 估計 S(t) 會等於用 empirical survival function 去估計結果相同..

(c). Calculate the Kaplan-Meier estimate on the basis of the 6MP data.

可由Kaplan-Meier estimate 定義得如下表格


y_(j)	6	7	10	13	16	22	23
d_(j)	3	1	1	1	1	1	1
N_(j)	21	17	15	12	11	7	6

經計算得
\[ \left\{ \begin{aligned} & \hat{S}_{KM}(6) = \left( 1-\frac{3}{21} \right) = \frac{6}{7} \approx 0.8571 \\ & \hat{S}_{KM}(7) = \frac{6}{7}\left( 1-\frac{3}{21} \right) = \frac{96}{119} \approx 0.8067 \\ & \quad \quad\vdots \\ & \hat{S}_{KM}(23) \approx 0.4482 \\ \end{aligned} \right. \] 其中,Survival function 如下

# ?g?U Kaplan-Meier estimate ??????
K.M.estimate <- function(data, t){
  val <- which(data[,2]==0)
  data.uncensored <- data[-(val),]      # ?d?U?Ӫ????O uncensored
  y <- unique(data.uncensored[,1])
  
  d <- numeric(length(y))         # Calculate d_(j)
  for(i in 1: length(y) ){
    d[i] <- sum( data.uncensored[,1] == sort(y)[i] )
  }

  N <- numeric(length(y))         # Calculate N_(j)
  for(i in 1: length(y) ){
    N[i] <- sum(data[,1] >= sort(y)[i] ) 
  }
  
  y_i <- sort(y)[ sort(y) <= t]    # Calculate  S(t)
  z <- 1
  for(i in 1:length(y_i) ){
    z <- z*(1-d[i]/N[i])
  }
  return(z)
} 

# 2. Graph
xs <- seq(6, 23)
y <- numeric( length(xs) )
for(i in 1: length(xs) ){
  y[i] <- K.M.estimate(MP.6, xs[i])
}
ggplot() + 
  geom_step(aes(x = seq(1, 23), y = c(rep(1, 5), y))) +
  labs(title = "Survival function for 6MP", 
       x = "t", y = "")

因為最大的資料是censored,所以Kaplan-Meier estimate不會到0.即當 t > 23時,我們說\(S(t)\)是未確定的(undetermined).

(d). Use the variance estimates that you learn from class to produce a 95% C.I at each observed failure time for the empirical survival distribution on the basis of the placebo data. Comment on your results.
若用 empirical survival distribution 估計 survival function,則其95% C.I 為 (推導部分在Problem2-(c)) \[ \left(\hat{S(t)} \pm 1.96 \sqrt{\frac{\hat{S(t)}(1-\hat{S(t)})}{n}}\ \right) \] 可逐點做C.I,即 \[ The\ 95\%\ C.I\ for\ S(t)\ is\ \Rightarrow \left\{ \begin{aligned} & (0.779,\ 1) \ &, t = 1 \\ & (0.642,\ 0.977) \ &, t = 2 \\ & \quad \quad\vdots \\ & (0,\ 0.139) \ &, t = 22 \\ & (0,\ 0) \ &, t = 23 \\ \end{aligned} \right. \] 因為Placebo 的資料為一complete data,所以估計出來的\(\hat{S}(23)=0\)故最後一點的C.I為零是正常的. 其中上述解釋為(以t = 1為例)
( 0.7991, 1 )表我們有95%信心說存活時間超過一個單位的機率會落在此區間中 .

(e). Use the variance estimates that you learn form class to produce a 95% C.I, at each uncensored time, for the Kaplan-Meier estimate on the basis of the 6MP data. Comment on you results.
\(By\ Greenwood's\ formula,\ we\ can\ obtain\\) \[ The\ variance\ of\ the\ Kaplan-Meier\ estimate\ is\ \hat{Var}(\hat{S}(t)) \approx \left[ \hat{S}(t) \right]^2 \sum_{y_{(j) \leq t}} \frac{d_{(j)}}{N_{(j)}(N_{(j)}-d_{(j)})} \] \[ \Rightarrow \left\{ \begin{aligned} & \hat{Var}(\hat{S}(6)) = \left[ \hat{S}(6) \right]^2* \sum_{y_{(j) \leq 6}} \frac{d_{(j)}}{N_{(j)}(N_{(j)}-d_{(j)})} = \left( \frac{6}{7} \right)^2* \left(\frac{3}{21(21-3)} \right) = \frac{2}{343} \approx 0.0058 \\ & \hat{Var}(\hat{S}(7)) = \left[ \hat{S}(7) \right]^2* \sum_{y_{(j) \leq 7}} \frac{d_{(j)}}{N_{(j)}(N_{(j)}-d_{(j)})} = \left( \frac{96}{119} \right)^2* \left( \frac{3}{21(21-3)} + \frac{1}{17(17-1)} \right) \approx 0.0076 \\ & \hat{Var}(\hat{S}(10)) = (0.753)^2 * \left( \frac{3}{21(21-3)} + \frac{1}{17(17-1)}+ \frac{1}{15(15-1)} \right) \approx 0.0093 \\ & \quad \quad\vdots \\ & \hat{Var}(\hat{S}(23)) = (0.4482)^2* \left( \frac{3}{21(21-3)} + \frac{1}{17(17-1)}+\cdots+ \frac{1}{6(6-1)} \right) \approx 0.0181 \\ \end{aligned} \right. \] 得上述survival function 各點之變異數.
接著建構其95%C.I,公式如下..
\[ \left(\hat{S(t)} \pm 1.96 \sqrt{\frac{\hat{S(t)}(1-\hat{S(t)})}{n}}\ \right)\quad,\forall t\ is\ continuous \] \[ The\ 95\%\ C.I\ for\ S(t)\ is \Rightarrow \left\{ \begin{aligned} (0.8247,\ 0.8896) \ &, t = 6 \\ (0.7709,\ 0.8426) \ &, t = 7 \\ (0.7119,\ 0.7941) \ &, t = 10 \\ \quad \quad \quad \vdots \\ (0.3912,\ 0.5052) \ &, t = 23 \\ \end{aligned} \right. \] 其中,上述表格意思為(以t=6為例):
( 0.8247, 0.8896 )表我們有95%信心說存活時間超過6個單位的機率會落在此區間中.

(f). Compare the survival function estimates from the placebo and 6MP groups. Interpret your results.
將兩個group之survival function 畫出

# 2. Graph survival function for 6MP
xs <- seq(6, 23)
y <- numeric( length(xs) )
for(i in 1: length(xs) ){
  y[i] <- K.M.estimate(MP.6, xs[i])
}
ggplot() +
  geom_step(aes(x = seq(1, 23), 
                y = c(rep(1, 5), y), 
                colour = "6MP")) +
  geom_step(aes(x = seq(0, 25), 
                y = esf.Placebo(seq(0, 25)),
                colour = "Placebo")) +
  labs(title = "Survival function", x = "t", y = "") +
  scale_colour_discrete(name = "Groups")

因為6MP的資料為一right censoring data 且最大的觀測值剛好是censored的,故估計出來的survival function不會逼近0;而Placebo的資料為一complete data,所以估計出來的survival function會是完好的一條曲線.

Problem 5.

Simulate survival data and compute the Nelson-Aalen estimator.

(a). Simulate 100 event times with constant hazard 0.9 (using rexp()). Also simulate 100 censoring times with U[0,5]. Next, compute the observed times and an event indicator. How many observed events/censorings do you have?

# Simulate data n = 100
set.seed(100)
n <- 100
t <- rexp(n, rate = 0.9)   # lambda = 0.9
c <- runif(n, min = 0, max = 5)
min.function <- function(t, c){
  z <- numeric(length(t))
  for(i in 1:length(t)){
    z[i] <- min(t[i], c[i])
  }
  return(z)
}
y <- min.function(t, c)
d <- ifelse(t < c, 1, 0)
data.problem5 <- data.frame(y,d)
sum(data.problem5[,2]==1)

## [1] 78

由上述程式模擬知道,在\(T\sim exp(\lambda=0.9)\ and\ C\sim U(0,5)\)下,生成100筆資料,且令\(Y=min(T,C)\ ,\delta=I(T<C)\)下,根據程式模擬結果,會有78筆資料是complete,22筆資料是censord.
其中,模擬的資料呈現為

(b). Compute Nelson-Aalen (cumulative hazard).
We know, Nelson-Aalen estimate is \[ \begin{align} \hat{\Lambda}_{NA}(t)=\sum_{y_{j} \leq t}\frac{d_{(j)}}{N_{(j)}}\quad ,?䤤,d_{(j)},N_{(j)}?w?q?PKaplan-Meier\ estimate \end{align} \] 故可計算各點之Nelson-Aalen estimate(cumulative hazard) \[ \left\{ \begin{aligned} & \hat{\Lambda}_{NA}(0.5)=\sum_{y_{j} \leq 0.5}\frac{d_{(j)}}{N_{(j)}} \approx 0.4054 \\ & \hat{\Lambda}_{NA}(1)=\sum_{y_{j} \leq 1}\frac{d_{(j)}}{N_{(j)}} \approx 0.864 \\ & \hat{\Lambda}_{NA}(1.5)=\sum_{y_{j} \leq 1.5 }\frac{d_{(j)}}{N_{(j)}} \approx 1.52 \\ & \quad \quad \quad \vdots \\ & \hat{\Lambda}_{NA}(3)=\sum_{y_{j} \leq 3 }\frac{d_{(j)}}{N_{(j)}} \approx 2.27 \\ \end{aligned} \right. \] 可將cumulative hazard function畫出,如下

N.A.estimate <- function(data, t){
  val <- which(data[,2]==0)
  data.uncensored <- data[-(val),]      # ?d?U?Ӫ????O uncensored
  y <- unique(data.uncensored[,1])
  y
  d <- numeric(length(y))         # Calculate d_(j)
  for(i in 1: length(y) ){
    d[i] <- sum( data.uncensored[,1] == sort(y)[i] )
  }

  N <- numeric(length(y))         # Calculate N_(j)
  for(i in 1: length(y) ){
    N[i] <- sum(data[,1] >= sort(y)[i] ) 
  }
  
  y_i <- sort(y)[ sort(y) <= t]    # Calculate  harzar function
  z <- 0
  for(i in 1:length(y_i) ){
    z <- z+(d[i]/N[i])
  }
  return(z)
} 
xs <- seq(0.1, 4, 0.1)
y <- numeric(length(xs))
for(i in 1: length(xs)){
  y[i] <- N.A.estimate(data.problem5, xs[i]) 
}
ggplot() +
  geom_line(aes(x = seq(0, 4, 0.1), 
                y = c(0, y))) +
  labs(title = "Cumulative hazard function for Nelson-Aalen",
       x = "t", y = "")

(c). Plot the Nelson-Aalen estimator. Compare it to the true cumulative hazard using curve. For comparison, you may want to restrict plotting to time regions where there is more data. E.g. restrict to the true median of the survival distribution.
\[ \begin{align} T \sim Exp(\lambda=0.9) \Rightarrow \lambda(t)=0.9 \Rightarrow \Lambda(t)=\int_{0}^{t} 0.9du=0.9t \end{align} \]

故真實的cumulative hazard function ,和用Nelson-Aalon estimate估計的cumulative hazard function如下

g1 <- ggplot() +
  geom_line(aes(x = xs, y = y, colour = "Nelson_Aalon")) +
  stat_function(aes(x = xs, 
                    colour = "True"), 
                fun = Hazard.function) + 
  labs(title = "n = 100",
       x = "t", y = "Cumulative hazard function") +
  scale_colour_manual(name = "", 
                      values = c(Nelson_Aalon = "black", 
                                 True = "red"))
g2 <- ggplot() +
  geom_line(aes(x = xs.1000, y = y.1000, 
                colour = "Nelson_Aalon")) +
  stat_function(aes(x = xs, 
                    colour = "True"), 
                fun = Hazard.function) + 
  labs(title = "n = 1000",
       x = "t", y = "Cumulative hazard function") +
  scale_colour_manual(name = "", 
                      values = c(Nelson_Aalon = "black", 
                                 True = "red"))
ggarrange(g1, g2)

由上圖比較可看出,當n = 100時,用Nelson-Aalon estimate估計的cumulative hazard function,只有當t <= 1時才會和 true cumulative hazard function 相近;但當n增加(n=1000)時,可看出估計得cumulative hazard function 逐漸和true cumulative hazard function 相近..

Problem 6.

(a). Reproduce simulation study in Updated Survival Notes, Second Chapter, Page30. (Attached Code)
依題意知, \[ T \sim Exp(\theta)\ , C \sim U(0,\beta)\quad ,where\ (\theta,\beta)= (0.1,5),(0.1,10) \]

# 1.Simulation data
# (theta , b) = (0.1, 10), (0.1, 5)
# n = 100
n <- 100
t <- rexp(n, rate = 0.1)   # lambda = 0.1
c.U5 <- runif(n, min = 0, max = 5)  
c.U10 <- runif(n, min = 0, max = 10)

y.U5 <- min.function(t, c.U5)
d.U5 <- ifelse(t < c.U5, 1, 0)
data.p6.U5 <- data.frame(y.U5,d.U5)

y.U10 <- min.function(t, c.U10)
d.U10 <- ifelse(t < c.U10, 1, 0)
data.p6.U10 <- data.frame(y.U10,d.U10)

# Analysis n = 100
y.U5 <- with(data.p6.U5, Surv(y.U5, d.U5))  # use Surv() to build the standard survival object
y.U5.fit <- survfit(Surv(y.U5, d.U5) ~ 1, 
                    data = data.p6.U5, conf.type = "none")
y.U10 <- with(data.p6.U10, Surv(y.U10, d.U10))
y.U10.fit <- survfit(Surv(y.U10, d.U10) ~ 1, 
                     data = data.p6.U10, conf.type = "none")
#########################################################
# n = 10000
n <- 10000
t.1 <- rexp(n, rate = 0.1)   # lambda = 0.1
c.U5.1 <- runif(n, min = 0, max = 5)  
c.U10.1 <- runif(n, min = 0, max = 10)

y.U5.1 <- min.function(t.1, c.U5.1)
d.U5.1 <- ifelse(t.1 < c.U5.1, 1, 0)
data.p6.U5.1 <- data.frame(y.U5.1, d.U5.1)

y.U10.1 <- min.function(t.1, c.U10.1)
d.U10.1 <- ifelse(t.1 < c.U10.1, 1, 0)
data.p6.U10.1 <- data.frame(y.U10.1, d.U10.1)

# Analysis n = 10000
y.U5.1 <- with(data.p6.U5.1, Surv(y.U5.1, d.U5.1))  
y.U5.1.fit <- survfit(Surv(y.U5.1, d.U5.1) ~ 1, 
                      data = data.p6.U5.1, conf.type = "none")
y.U10.1 <- with(data.p6.U10.1, Surv(y.U10.1, d.U10.1))  
y.U10.1.fit <- survfit(Surv(y.U10.1, d.U10.1) ~ 1, 
                       data = data.p6.U10.1, conf.type = "none")

# 2. Graph 
par(mfrow = c(1, 2))
# n = 100
plot(y.U5.fit, 
     xlab = "t", ylab = "Survival function", main = 'n = 100',
     xlim = c(0,12), ylim = c(0,1),
     col = 2, lty = 1)
par(new = T)  
plot(y.U10.fit, 
     xlab = "t", lty = 2, col = 4,
     xlim = c(0,12), ylim = c(0,1))
curve(exp(-0.1*x), 0 ,12, add = TRUE, lty = 5, col = 1)
legend("topright", legend = c("KM1", "KM2", "TRUE"),
       col = c(2,4,1), lty = c(1,2,5))

# n = 10000
plot(y.U5.1.fit, 
     xlab = "t", ylab = "Survival function", main = 'n = 10000',
     xlim = c(0,12), ylim = c(0,1),
     col = 2, lty = 1)
par(new = T)  
plot(y.U10.1.fit, 
     xlab = "t", lty = 2, col = 4,
     xlim = c(0,12), ylim = c(0,1))
curve(exp(-0.1*x), 0 ,12, add = TRUE, lty = 5, col = 1)
legend("topright", legend = c("KM1", "KM2", "TRUE"),
       col = c(2,4,1), lty = c(1,2,5))

且上述定義KM1為\((\ ( \theta,\beta)=(0.1,5) \ )\) ,KM2為\((\ ( \theta,\beta)=(0.1,10) \ )\) 可發現當隨著n增加,在不同情況下(censoring distribution不同),用Kaplan-Meier estimate 估計的survival function 會越來約趨近真實的survival function..

(b). Regenerate 100 simulation studies in (a). Based on these 100 simulation results, plot the average of estimated survival curve and 95% C.I (Attached Code)
不失一般性,討論\((\theta,\beta)=(0.1,10)\)之情況
當n = 100時

# n = 100, (theta , b) = (0.1, 10)
data.surv <- matrix(NA, nrow = 100, ncol = 100)
surv.lower <- matrix(NA, nrow = 100, ncol = 100)
surv.upper <- matrix(NA, nrow = 100, ncol = 100)
for(i in 1:100){
  t <- rexp(100, rate = 0.1)   # lambda = 0.1
  c.U10 <- runif(100, min = 0, max = 10)  
  y.U10 <- min.function(t, c.U10)
  d.U10 <- ifelse(t < c.U10, 1, 0)
  data.p6.U10 <- data.frame(y.U10,d.U10)  # U10 data
  y.U10 <- with(data.p6.U10, Surv(y.U10, d.U10))
  y.U10.fit <- survfit(Surv(y.U10, d.U10) ~ 1, 
                       data = data.p6.U10, conf.type = "plain")
  data.surv[,i] <- y.U10.fit$surv
  surv.lower[,i] <- y.U10.fit$lower
  surv.upper[,i] <- y.U10.fit$upper
}

y.surv <- apply(data.surv, 1 , mean)
y.lower <- apply(surv.lower, 1 , mean)
y.upper <- apply(surv.upper, 1 , mean)

par(mfrow = c(1,2))
# Estimation of Survival function
plot(y.U10.fit$time, y.U10.fit$surv, type = "s",
     xlab = "t", ylab = "Survival function", main = 'n = 100',
     xlim = c(0,12), ylim = c(0,1),
     col = 1, lty = 1)
lines(y.U10.fit$time, y.U10.fit$lower, ,col = 2, lty = 2)
lines(y.U10.fit$time, y.U10.fit$upper, ,col = 2, lty = 2)
legend("topright", legend = c("S(t)","95% C.I"),
       col = c(1,2), lty = c(1,2))
# Average Survival function
plot(y.U10.fit$time, y.surv, type = "s",
     xlab = "t", ylab = "Average Survival curve", main = 'n = 100',
     xlim = c(0,12), ylim = c(0,1),
     col = 1,lty = 1)
lines(y.U10.fit$time, y.lower, ,col = 2, lty = 2)
lines(y.U10.fit$time, y.upper, ,col = 2, lty = 2)
legend("topright", legend = c("average","95% C.I"),
       col = c(1,2), lty = c(1,2))

由上述圖型可看出,average survival curve 和 95% C.I(average survival),皆較原本的survival function,和 95% C.I 平滑.

Survival_Analysis_HW1

林暐翔