Survival analysis, lab 2

Summary : 다음을 R을 통하여 실습합니다. (1) Review (2) Survival function $S(t)$를 추정봅해시다. (Kaplan-Meier estimator, $\exp($-Nelson-Aalen estimator$)$) (3) Survival function $S(t)$의 confidence interval도 구해봅시다.

Review

생존분석에서는 survival function $S(t)=P(X >t)$에 대한 통계적 추론을 합니다. Chapter 2에서 Random variable $X$가 연속형인 경우 survival fucntion $S(t) = \exp\{-H(t)\}$, cumulative hazard function $H(t)=\int_{0}^{t}h(u)du$를 학습했습니다. ($F$, $S$, $H$의 관계를 확실히 알아둡시다!)

Chapter 3 에서는 censoring 개념이 소개되었고 counting process를 배웠습니다.(이 후에 더 정확히 배울지도..) Data with right censoring ($\{ (T_i, \delta_i) \}_{i=1, 2, \ldots, n}$)에서 다음과 같이 정의했습니다. \[ Y(t) := Y_i := \sum_{i=1}^{n} I(T_i \geq t) \\ dN(t) := d_i := \sum_{i=1}^{n} I(T_i = t, \delta_i = 1) \] 여기서 $Y(t)$는 the number of risks at time t, $dN(t)$는 the number of deaths at time t로 해석할 수 있겠습니다. 이제 이 counting process의 도구들로 Estimation을 시작해봅시다.

[REMARK] Counting process theory에서 통상적으로 $Y(t)$는 risk process $N(t)$는 event process 라고 하며 $N(t)$는 $\sum_{i=1}^{n} I(T_i \leq t, \delta_i = 1)$으로 정의합니다. 그래서 $dN(t)$를 해석하는 한 방법으로서 Riemann-Stieltjes 적분 $\int \cdots dN(t)$의 적분변수로 생각할 수 있습니다.

$S(t)$의 estimator

$S(t) = \exp\{-H(t)\}$의 관계에서 $S(t)$의 estimator 두 개(KM, NA)를 얻을 수 있습니다. Kaplan-Meier (product-limit) estimator for survival function and its estimated variance (Greenwood formula) : \[ \hat{S}(t) = \prod_{i:t_i \leq t} \left\{ 1 - \frac{d_i}{Y_i} \right\} = \prod_{u : u \leq t} \left\{ 1 - \frac{dN(u)}{Y(u)} \right\} \\ \widehat{\rm Var}[\hat{S}(t)] = \left[\hat{S}(t) \right]^2 \hat{\sigma}_{S}^2(t) := \left[\hat{S}(t) \right]^2 \sum_{i:t_i \leq t}\frac{d_i}{Y_i(Y_i - d_i)} \] Nelson-Aalen estimator for cumulative hazard function and its estimated variance : \[ \hat{H}(t) = \sum_{i:t_i \leq t} \frac{d_i}{Y_i} = \sum_{u : u \leq t} \frac{dN(u)}{Y(u)} \\ \widehat{\rm Var}[\hat{H}(t)] = \hat{\sigma}_H^2(t) := \sum_{i:t_i \leq t} \frac{d_i}{Y_i^2} = \sum_{u : u \leq t} \frac{dN(u)}{Y(u)^2} \]

pointwise confidence interval (CI)

강의노트에서 다음 4가지 방법을 확인할 수 있습니다. (아래에서 $\hat{S}(t)$는 KM estimator를, $\hat{H}(t)$는 NA estimator)

Nelson-Aalen + exponentiating CI for $H(x)$ : $\exp\{- \hat{H}(t) \pm z_{\alpha/2} \hat{\sigma}_{H}(t) \}\\$
Nelson-Aalen + delta-method : $e^{- \hat{H}(t)} \pm z_{\alpha/2} e^{-\hat{H}(t)} \hat{\sigma}_{H}(t) \\$
Kaplan-Meier + exponentiating : $\hat{S}(t) \cdot \exp \left\{ \pm z_{\alpha/2} \hat{\sigma}_{S}(t) \right\} \\$
Kaplan-Meier + Greenwood formula : $\hat{S}(t) \, \pm\ z_{\alpha/2} \hat{S}(t) \hat{\sigma}_{S}(t)$

오늘은 코드에서 (1)과 (4)를 구해봅시다. 참고로 R의 survfit()이 주는 CI의 기본값은 (3)이며 약간의 option 조정으로 (4)도 구할 수 있습니다.

Survival function $S(t)$

이제 생존분석을 위해서 survival 패키지를 설치하고 로드합시다.

#install.packages("survival") ;
library(survival) ;

## Loading required package: splines

working directory를 설정하고 eTL에 있는 leukemia dataset을 다운 받아봅시다. Textbook appendix D에서 데이터에 대한 설명과 데이터를 볼 수 있습니다. 잘 불러왔는지 확인하기 위해 class(data.raw), names(data.raw) 그리고 head(data.raw) 등을 입력해봅시다.

setwd("C:/Users/snu/Downloads/"); # working directory setting
data.raw = read.table(file="leukemia.txt", header=T, sep=" ");
class(data.raw)

## [1] "data.frame"

names(data.raw)

##  [1] "group"     "time1"     "time2"     "ind1"      "ind2"     
##  [6] "ind3"      "timeA"     "indA"      "timeC"     "indC"     
## [11] "timeP"     "indP"      "pat.age"   "donor.age" "pat.sex"  
## [16] "donor.sex" "pat.CMV"   "donor.CMV" "waiting"   "FAB"      
## [21] "hospital"  "MTX"

이 예제에서는 All group의 $T_2$(time2)를 시각 자료로, $\delta_3$(ind3)을 censoring indicator로 사용합니다.

data = data.raw[ ,c("group", "time2", "ind3")] ; 
data = data[data$group==1, ] ;
head(data)

##   group time2 ind3
## 1     1  2081    0
## 2     1  1602    0
## 3     1  1496    0
## 4     1  1462    0
## 5     1  1433    0
## 6     1  1377    0

Survival package에서 survfit()을 이용하여 Kaplan-Meier estimator 구해봅시다. 이 함수를 사용하기 위해서는 먼저 Surv()를 이용하여 자료를 ’survival object’형 변수로 변환합시다. http://cran.r-project.org/web/packages/survival/survival.pdf (공식설명서)에 Surv(), survfit.xxx()가 자세히 설명되어 있습니다.

my.surv = Surv(time=data$time2, event=data$ind3) ;
my.fit = survfit(formula = my.surv ~ 1, data=my.surv) ;

output 확인을 위해 class(my.fit), names(my.fit)을 시도할 수 있습니다.

my.fit$time ;           ## 모든 timepoints, 중복 제거

##  [1]    1   55   74   86  104  107  109  110  122  129  172  192  194  226
## [15]  230  276  332  383  418  466  487  526  530  609  662  996 1111 1167
## [29] 1182 1199 1330 1377 1433 1462 1496 1602 2081

my.fit$surv ;           ## 각 timepoint에서의 Kaplan-Meier estimator

##  [1] 0.9737 0.9474 0.9211 0.8947 0.8684 0.8421 0.8158 0.7895 0.7368 0.7105
## [11] 0.6842 0.6579 0.6316 0.6316 0.6041 0.5767 0.5492 0.5217 0.4943 0.4668
## [21] 0.4394 0.4119 0.4119 0.3825 0.3531 0.3531 0.3531 0.3531 0.3531 0.3531
## [31] 0.3531 0.3531 0.3531 0.3531 0.3531 0.3531 0.3531

plot(my.fit) ;          ## The whole curve of Kaplan-Meier estimator function

plot of chunk unnamed-chunk-5 위에서 plot() 안에 survival object를 바로 대입해서 plotting을 할 수 있습니다. 그 외에 기존에 plot()에서 쓰던 옵션들(xlab=, main=…)을 사용할 수 있습니다. 더 자세한 정보는 http://cran.r-project.org/web/packages/survival/survival.pdf (공식 설명서)의 plot.survfit()을 참고!!

이 객체의 summary를 확인해봅시다.

my.fit.summ = summary(my.fit) ;
my.fit.summ

## Call: survfit(formula = my.surv ~ 1, data = my.surv)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1     38       1    0.974  0.0260        0.924        1.000
##    55     37       1    0.947  0.0362        0.879        1.000
##    74     36       1    0.921  0.0437        0.839        1.000
##    86     35       1    0.895  0.0498        0.802        0.998
##   104     34       1    0.868  0.0548        0.767        0.983
##   107     33       1    0.842  0.0592        0.734        0.966
##   109     32       1    0.816  0.0629        0.701        0.949
##   110     31       1    0.789  0.0661        0.670        0.930
##   122     30       2    0.737  0.0714        0.609        0.891
##   129     28       1    0.711  0.0736        0.580        0.870
##   172     27       1    0.684  0.0754        0.551        0.849
##   192     26       1    0.658  0.0770        0.523        0.827
##   194     25       1    0.632  0.0783        0.495        0.805
##   230     23       1    0.604  0.0795        0.467        0.782
##   276     22       1    0.577  0.0805        0.439        0.758
##   332     21       1    0.549  0.0812        0.411        0.734
##   383     20       1    0.522  0.0817        0.384        0.709
##   418     19       1    0.494  0.0819        0.357        0.684
##   466     18       1    0.467  0.0818        0.331        0.658
##   487     17       1    0.439  0.0815        0.305        0.632
##   526     16       1    0.412  0.0809        0.280        0.605
##   609     14       1    0.382  0.0803        0.254        0.577
##   662     13       1    0.353  0.0793        0.227        0.548

my.fit.summ$time     # t_i's : the distinct uncensored timepoints

##  [1]   1  55  74  86 104 107 109 110 122 129 172 192 194 230 276 332 383
## [18] 418 466 487 526 609 662

my.fit.summ$surv     # the Kaplan-Meier estimate at each t_i's

##  [1] 0.9737 0.9474 0.9211 0.8947 0.8684 0.8421 0.8158 0.7895 0.7368 0.7105
## [11] 0.6842 0.6579 0.6316 0.6041 0.5767 0.5492 0.5217 0.4943 0.4668 0.4394
## [21] 0.4119 0.3825 0.3531

my.fit.summ$n.risk   # Y_i's

##  [1] 38 37 36 35 34 33 32 31 30 28 27 26 25 23 22 21 20 19 18 17 16 14 13

my.fit.summ$n.event  # d_i's

##  [1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1

my.fit.summ$std.err  # standard error of the K-M estimate at t_i

##  [1] 0.02597 0.03622 0.04374 0.04978 0.05484 0.05915 0.06289 0.06613
##  [9] 0.07143 0.07357 0.07541 0.07696 0.07825 0.07952 0.08051 0.08122
## [17] 0.08167 0.08186 0.08179 0.08146 0.08086 0.08026 0.07930

my.fit.summ$lower    # lower bound of pointwise CI bound

##  [1] 0.9241 0.8790 0.8392 0.8023 0.7673 0.7338 0.7014 0.6699 0.6093 0.5800
## [11] 0.5513 0.5231 0.4954 0.4667 0.4386 0.4110 0.3839 0.3573 0.3311 0.3055
## [21] 0.2803 0.2535 0.2273

my.fit과 my.fit.summ이 담고 있는 정보에는 다소 차이가 있습니다. my.fit$time, my.fit$surv 등을 입력하여 my.fit.summ과 비교해보세요. (무엇이 차이일까요?)

REMARK std.err에는 Greenwood formula 기반의 계산식 $\hat{S}(t_i) \hat{\sigma}_{S}(t_i)$ 들이 저장되어 있습니다. upper/lower에는 (3) 방식의 계산을 디폴트로 합니다. survfit() 함수의 option 중 일부를 변경하여 (4) 방식의 pointwise CI를 저장하게 할 수도 있습니다. 물론 std.err 계산값들은 모두 똑같습니다. :

my.fit = survfit(formula = my.surv ~ 1, data=my.surv, conf.type="plain") ; # (4)
my.fit = survfit(formula = my.surv ~ 1, data=my.surv, conf.type="log") ; # (3), 기본값

REMARK $\hat{S}(t)$가 저장된 방식을 다시 주목해봅시다. KM의 정의나 위 코드의 여러 결과를 구경하여 알겠지만 uncensored 관측시간이 있는 timepoint $t_i$들 에서만 $\hat{S}(t)$에 jump가 일어납니다. 나머지 부분에서는 flat하므로 $(t_i, \hat{S}(t_i))$만 저장해도 되겠습니다. NA의 정의를 보면 $\hat{H}(t)$도 같은 방식으로 저장해봅시다.

Kaplan-Meier estimator vs. Nelson-Aalen estimator 이제 KM estimator $\hat{S}(t)$와 NA estimator로 추정한 survival function $\exp\{-\hat{H}(t)\}$를 비교해봅시다. 두 estimator가 asymptotically 동일하긴 하나 finite sample에서는 아주 약간의 차이가 있는데 이를 직접 확인해봅시다. NA estimator는 survfit()의 결과물로부터 직접 만들어야 합니다.

obs.time = my.fit.summ$time ; # t_i
n.risk = my.fit.summ$n.risk ; # Y_i
n.event = my.fit.summ$n.event ; # d_i
KM.surv = my.fit.summ$surv ;    #\hat{S}(t_i)

정의대로 NA.cumhzd $\hat{H}(t_i)$를 계산해봅시다.(NA.surv가 $\exp\{-\hat{H}(t_i)\}$들을 저장하고 있습니다.)

incr = n.event / n.risk ; # d_i / y_i
NA.cumhzd = NULL ; # initialize
for (i in 1:length(obs.time)) NA.cumhzd[i] = sum(incr[1:i]) ;
NA.surv = exp(-NA.cumhzd) ;

plot을 그려 비교해봅시다.

windows(width=6, height=6) ;
plot(c(0, max(obs.time)+10), c(0,1), type="n", xlab="Time",
       ylab="Survival probability",
       main="Comparing two estimators for survival function") ;
points(obs.time, KM.surv, type="s", col="red", lty=1) ; 
points(obs.time, NA.surv, type="s", col="blue", lty=1) ; 
legend("bottomright", col=c("red", "blue"), lty=c(1, 1), 
       c("K-M estimator", "exp(-{N-A esetimator})")) ;

plot of chunk unnamed-chunk-10

Pointwise confidence intervals

먼저 $\hat{\sigma}_{S}(t) = \left( \sum_{i:t_i \leq t}\frac{d_i}{Y_i(Y_i - d_i)} \right)^{1/2}$와 $\hat{\sigma}_{H}(t) = \left( \sum_{i:t_i \leq t} \frac{d_i}{Y_i^2} \right)^{1/2} $를 계산해야합니다. 각각 var.GW, var.NA에 계산하도록 합시다.

incr.GW = n.event / n.risk / (n.risk - n.event) ;
var.GW = NULL ;
for (i in 1:length(obs.time)) var.GW[i] = sum(incr.GW[1:i]) ;
incr.NA = n.event / n.risk^2 ;
var.NA = NULL ;
for (i in 1:length(obs.time)) var.NA[i] = sum(incr.NA[1:i]) ;

REMARK survfit()은 $\hat{S}(t)$의 standard error의 기본값으로 Greenwood formula를 사용합니다. 따라서 다음 둘의 결과가 같습니다.

sqrt(var.GW * KM.surv^2)

##  [1] 0.02597 0.03622 0.04374 0.04978 0.05484 0.05915 0.06289 0.06613
##  [9] 0.07143 0.07357 0.07541 0.07696 0.07825 0.07952 0.08051 0.08122
## [17] 0.08167 0.08186 0.08179 0.08146 0.08086 0.08026 0.07930

my.fit.summ$std.err

##  [1] 0.02597 0.03622 0.04374 0.04978 0.05484 0.05915 0.06289 0.06613
##  [9] 0.07143 0.07357 0.07541 0.07696 0.07825 0.07952 0.08051 0.08122
## [17] 0.08167 0.08186 0.08179 0.08146 0.08086 0.08026 0.07930

이제 (1)을 계산해봅시다. $t_i$ 위에서의 95% upper-bound와 lower-bound에 계산하면 됩니다.(아래 코드 up.NA와 lo.NA)

gamma = qnorm(0.975) ;
std.NA = sqrt(var.NA) ;
up.NA = exp(- NA.cumhzd + gamma*std.NA) ;
lo.NA = exp(- NA.cumhzd - gamma*std.NA) ;

(4)의 계산도 비슷하게 한다.

stderr.GW = KM.surv * sqrt(var.GW) ;
up.GW = KM.surv + gamma*stderr.GW ;
lo.GW = KM.surv - gamma*stderr.GW ;

끝으로, plotting으로 차이를 확인해봅시다.

windows(width=6, height=6) ;
plot(c(0, max(obs.time)+10), c(0,1), type="n", xlab="Time",
       ylab="Survival probability",
       main="Comparing two estimators for survival function") ;
points(obs.time, KM.surv, type="s", col="red", lty=1) ; 
points(obs.time, up.GW, type="s", col="red", lty=2) ; 
points(obs.time, lo.GW, type="s", col="red", lty=2) ; 
points(obs.time, NA.surv, type="s", col="blue", lty=1) ; 
points(obs.time, up.NA, type="s", col="blue", lty=2) ; 
points(obs.time, lo.NA, type="s", col="blue", lty=2) ; 
legend("bottomright", col=c("red", "blue"), lty=c(1, 1), 
       c("K-M estimator", "exp(-{N-A esetimator})")) ;

plot of chunk unnamed-chunk-15

Disclaimer : 이 자료는 최영근 조교님의 2013년 가을 Survival anlysis 실습 수업자료를 바탕으로 만들어졌습니다.

Survival analysis, lab 2

yongchan Kwon, ykwon.0407@gmail.com

Review

Survival function \(S(t)\)

Pointwise confidence intervals