Survival Analysis Lab

Summary : 다음을 R을 통하여 실습합니다. (1) Review (2) Survival function $S(t)$를 추정봅해시다. (Kaplan-Meier estimator, $\exp($-Nelson-Aalen estimator$)$) (3) Survival function $S(t)$의 confidence interval도 구해봅시다.

Review

생존분석에서는 survival function $S(t)=P(X >t)$에 대한 통계적 추론을 합니다. Chapter 2에서 Random variable $X$가 연속형인 경우 survival fucntion $S(t) = \exp\{-H(t)\}$, cumulative hazard function $H(t)=\int_{0}^{t}h(u)du$를 학습했습니다. ($F$, $S$, $H$의 관계를 확실히 알아둡시다!)

Chapter 3 에서는 censoring 개념이 소개되었고 counting process를 배웠습니다.(이 후에 더 정확히 배울지도..) Data with right censoring ($\{ (T_i, \delta_i) \}_{i=1, 2, \ldots, n}$)에서 다음과 같이 정의했습니다. \[ Y(t) := Y_i := \sum_{i=1}^{n} I(T_i \geq t) \\ dN(t) := d_i := \sum_{i=1}^{n} I(T_i = t, \delta_i = 1) \] 여기서 $Y(t)$는 the number of risks at time t, $dN(t)$는 the number of deaths at time t로 해석할 수 있겠습니다. 이제 이 counting process의 도구들로 Estimation을 시작해봅시다.

[REMARK] Counting process theory에서 통상적으로 $Y(t)$는 risk process $N(t)$는 event process 라고 하며 $N(t)$는 $\sum_{i=1}^{n} I(T_i \leq t, \delta_i = 1)$으로 정의합니다. 그래서 $dN(t)$를 해석하는 한 방법으로서 Riemann-Stieltjes 적분 $\int \cdots dN(t)$의 적분변수로 생각할 수 있습니다.

$S(t)$의 estimator

$S(t) = \exp\{-H(t)\}$의 관계에서 $S(t)$의 estimator 두 개(KM, NA)를 얻을 수 있습니다. Kaplan-Meier (product-limit) estimator for survival function and its estimated variance (Greenwood formula) : \[ \hat{S}(t) = \prod_{i:t_i \leq t} \left\{ 1 - \frac{d_i}{Y_i} \right\} = \prod_{u : u \leq t} \left\{ 1 - \frac{dN(u)}{Y(u)} \right\} \\ \widehat{\rm Var}[\hat{S}(t)] = \left[\hat{S}(t) \right]^2 \hat{\sigma}_{S}^2(t) := \left[\hat{S}(t) \right]^2 \sum_{i:t_i \leq t}\frac{d_i}{Y_i(Y_i - d_i)} \] Nelson-Aalen estimator for cumulative hazard function and its estimated variance : \[ \hat{H}(t) = \sum_{i:t_i \leq t} \frac{d_i}{Y_i} = \sum_{u : u \leq t} \frac{dN(u)}{Y(u)} \\ \widehat{\rm Var}[\hat{H}(t)] = \hat{\sigma}_H^2(t) := \sum_{i:t_i \leq t} \frac{d_i}{Y_i^2} = \sum_{u : u \leq t} \frac{dN(u)}{Y(u)^2} \]

pointwise confidence interval (CI)

강의노트에서 다음 4가지 방법을 확인할 수 있습니다. (아래에서 $\hat{S}(t)$는 KM estimator를, $\hat{H}(t)$는 NA estimator)

Nelson-Aalen + exponentiating CI for $H(x)$ : $\exp\{- \hat{H}(t) \pm z_{\alpha/2} \hat{\sigma}_{H}(t) \}\\$
Nelson-Aalen + delta-method : $e^{- \hat{H}(t)} \pm z_{\alpha/2} e^{-\hat{H}(t)} \hat{\sigma}_{H}(t) \\$
Kaplan-Meier + exponentiating : $\hat{S}(t) \cdot \exp \left\{ \pm z_{\alpha/2} \hat{\sigma}_{S}(t) \right\} \\$
Kaplan-Meier + Greenwood formula : $\hat{S}(t) \, \pm\ z_{\alpha/2} \hat{S}(t) \hat{\sigma}_{S}(t)$

오늘은 코드에서 (1)과 (4)를 구해봅시다. 참고로 R의 survfit()이 주는 CI의 기본값은 (3)이며 약간의 option 조정으로 (4)도 구할 수 있습니다.

Survival function $S(t)$

이제 생존분석을 위해서 survival 패키지를 설치하고 로드합시다.

#install.packages("survival") ;
#install.packages("KMsurv") ;
library(survival) ;
library(KMsurv) ;

이번 시간 사용할 데이터는 Textbook appendix D의 Bone Marrow Transplant Patients 입니다. 데이터의 변수설명과 데이터를 볼 수 있습니다. data(bmt)를 이용해서 데이터를 불러봅시다. 잘 불러왔는지 확인하기 위해 class(data.raw), names(data.raw) 그리고 head(data.raw) 등을 입력해봅시다.

data(bmt)
data.raw = bmt;
class(data.raw)
names(data.raw)
head(data.raw)

이 예제에서는 All group의 $T_2$(time2)를 시각 자료로, $\delta_3$(ind3)을 censoring indicator로 사용합니다.

data = data.raw[ ,c("group", "t2", "d3")]; 
data = data[data$group==1, ] ;
head(data)

Survival package에서 survfit()을 이용하여 Kaplan-Meier estimator 구해봅시다. 이 함수를 사용하기 위해서는 먼저 Surv()를 이용하여 자료를 ’survival object’형 변수로 변환합시다. http://cran.r-project.org/web/packages/survival/survival.pdf (공식설명서)에 Surv(), survfit.xxx()가 자세히 설명되어 있습니다.

my.surv = Surv(time=data$t2, event=data$d3) ;
my.fit = survfit(formula = my.surv ~ 1, data=my.surv) ;

output 확인을 위해 class(my.fit), names(my.fit)을 시도할 수 있습니다.

my.fit$time ;           ## 모든 timepoints, 중복 제거
my.fit$surv ;           ## 각 timepoint에서의 Kaplan-Meier estimator
plot(my.fit) ;          ## The whole curve of Kaplan-Meier estimator function

위에서 plot() 안에 survival object를 바로 대입해서 plotting을 할 수 있습니다. 그 외에 기존에 plot()에서 쓰던 옵션들(xlab=, main=…)을 사용할 수 있습니다. 더 자세한 정보는 http://cran.r-project.org/web/packages/survival/survival.pdf (공식 설명서)의 plot.survfit()을 참고!!

이 객체의 summary를 확인해봅시다.

my.fit.summ = summary(my.fit) ;
my.fit.summ
my.fit.summ$time     # t_i's : the distinct uncensored timepoints
my.fit.summ$surv     # the Kaplan-Meier estimate at each t_i's
my.fit.summ$n.risk   # Y_i's
my.fit.summ$n.event  # d_i's
my.fit.summ$std.err  # standard error of the K-M estimate at t_i
my.fit.summ$lower    # lower bound of pointwise CI bound

my.fit과 my.fit.summ이 담고 있는 정보에는 다소 차이가 있습니다. my.fit$time, my.fit$surv 등을 입력하여 my.fit.summ과 비교해보세요. (무엇이 차이일까요?)

REMARK std.err에는 Greenwood formula 기반의 계산식 $\hat{S}(t_i) \hat{\sigma}_{S}(t_i)$ 들이 저장되어 있습니다. upper/lower에는 (3) 방식의 계산을 디폴트로 합니다. survfit() 함수의 option 중 일부를 변경하여 (4) 방식의 pointwise CI를 저장하게 할 수도 있습니다. 물론 std.err 계산값들은 모두 똑같습니다. :

my.fit = survfit(formula = my.surv ~ 1, data=my.surv, conf.type="plain") ; # (4)
my.fit = survfit(formula = my.surv ~ 1, data=my.surv, conf.type="log") ; # (3), 기본값

REMARK $\hat{S}(t)$가 저장된 방식을 다시 주목해봅시다. KM의 정의나 위 코드의 여러 결과를 구경하여 알겠지만 uncensored 관측시간이 있는 timepoint $t_i$들 에서만 $\hat{S}(t)$에 jump가 일어납니다. 나머지 부분에서는 flat하므로 $(t_i, \hat{S}(t_i))$만 저장해도 되겠습니다. NA의 정의를 보면 $\hat{H}(t)$도 같은 방식으로 저장해봅시다.

Kaplan-Meier estimator vs. Nelson-Aalen estimator 이제 KM estimator $\hat{S}(t)$와 NA estimator로 추정한 survival function $\exp\{-\hat{H}(t)\}$를 비교해봅시다. 두 estimator가 asymptotically 동일하긴 하나 finite sample에서는 아주 약간의 차이가 있는데 이를 직접 확인해봅시다. NA estimator는 survfit()의 결과물로부터 직접 만들어야 합니다.

obs.time = my.fit.summ$time ; # t_i
n.risk = my.fit.summ$n.risk ; # Y_i
n.event = my.fit.summ$n.event ; # d_i
KM.surv = my.fit.summ$surv ;    #\hat{S}(t_i)

정의대로 NA.cumhzd $\hat{H}(t_i)$를 계산해봅시다.(NA.surv가 $\exp\{-\hat{H}(t_i)\}$들을 저장하고 있습니다.)

incr = n.event / n.risk ; # d_i / y_i
NA.cumhzd = NULL ; # initialize
for (i in 1:length(obs.time)) NA.cumhzd[i] = sum(incr[1:i]) ;
NA.surv = exp(-NA.cumhzd) ;

plot을 그려 비교해봅시다.

windows(width=6, height=6) ;
plot(c(0, max(obs.time)+10), c(0,1), type="n", xlab="Time",
       ylab="Survival probability",
       main="Comparing two estimators for survival function") ;
points(obs.time, KM.surv, type="s", col="red", lty=1) ; 
points(obs.time, NA.surv, type="s", col="blue", lty=1) ; 
legend("bottomright", col=c("red", "blue"), lty=c(1, 1), 
       c("K-M estimator", "exp(-{N-A esetimator})")) ;

Pointwise confidence intervals

먼저 $\hat{\sigma}_{S}(t) = \left( \sum_{i:t_i \leq t}\frac{d_i}{Y_i(Y_i - d_i)} \right)^{1/2}$와 $\hat{\sigma}_{H}(t) = \left( \sum_{i:t_i \leq t} \frac{d_i}{Y_i^2} \right)^{1/2}$를 계산해야합니다. 각각 var.GW, var.NA에 계산하도록 합시다.

incr.GW = n.event / n.risk / (n.risk - n.event) ;
var.GW = NULL ;
for (i in 1:length(obs.time)) var.GW[i] = sum(incr.GW[1:i]) ;
incr.NA = n.event / n.risk^2 ;
var.NA = NULL ;
for (i in 1:length(obs.time)) var.NA[i] = sum(incr.NA[1:i]) ;

REMARK survfit()은 $\hat{S}(t)$의 standard error의 기본값으로 Greenwood formula를 사용합니다. 따라서 다음 둘의 결과가 같습니다.

sqrt(var.GW * KM.surv^2)
my.fit.summ$std.err

이제 (1)을 계산해봅시다. $t_i$ 위에서의 95% upper-bound와 lower-bound에 계산하면 됩니다.(아래 코드 up.NA와 lo.NA)

gamma = qnorm(0.975) ;
std.NA = sqrt(var.NA) ;
up.NA = exp(- NA.cumhzd + gamma*std.NA) ;
lo.NA = exp(- NA.cumhzd - gamma*std.NA) ;

(4)의 계산도 비슷하게 한다.

stderr.GW = KM.surv * sqrt(var.GW) ;
up.GW = KM.surv + gamma*stderr.GW ;
lo.GW = KM.surv - gamma*stderr.GW ;

끝으로, plotting으로 차이를 확인해봅시다.

windows(width=6, height=6) ;
plot(c(0, max(obs.time)+10), c(0,1), type="n", xlab="Time",
       ylab="Survival probability",
       main="Comparing two estimators for survival function") ;
points(obs.time, KM.surv, type="s", col="red", lty=1) ; 
points(obs.time, up.GW, type="s", col="red", lty=2) ; 
points(obs.time, lo.GW, type="s", col="red", lty=2) ; 
points(obs.time, NA.surv, type="s", col="blue", lty=1) ; 
points(obs.time, up.NA, type="s", col="blue", lty=2) ; 
points(obs.time, lo.NA, type="s", col="blue", lty=2) ; 
legend("bottomright", col=c("red", "blue"), lty=c(1, 1), 
       c("K-M estimator", "exp(-{N-A esetimator})")) ;

Disclaimer : 이 자료는 최영근 조교님의 2013년 가을 Survival anlysis 실습 수업자료를 바탕으로 만들어졌습니다.

Survival Analysis Lab_2

yongchan Kwon

Review

Survival function \(S(t)\)

Pointwise confidence intervals