서론

생존분석시 K-M survival curve를 그릴때 연속변수의 optimal cutpoint level을 찾아주었으면 하는 요청이 있었읍니다. 요청하신 분은 maxstat 패키지를 추천해주셨읍니다. 먼저 maxstat 패키지를 설치하고 예를 한번 따라해 보았읍니다.

`maxstat` 패키지

예로써 패키지에 포함되어 있는 DLBCL 데이타를 사용합니다. 이 데이타는 악성임파종의 한 종류인 diffuse large B cell lymphoma 42예에 대한 자료입니다. 자료의 요약은 다음과 같습니다.

#install.packages("maxstat")
require(maxstat)
require(survival)
require(party)
require(rms)
require(ztable)

data(DLBCL)
summary(DLBCL)

    RowNames           DLCLid                 GEG          time        
 Min.   : 7.00   DLCL-0001: 1   Activated B-like:21   Min.   :  1.300  
 1st Qu.:17.25   DLCL-0002: 1   GC B-Like       :21   1st Qu.:  9.162  
 Median :27.50   DLCL-0003: 1                         Median : 36.050  
 Mean   :27.50   DLCL-0004: 1                         Mean   : 43.947  
 3rd Qu.:37.75   DLCL-0005: 1                         3rd Qu.: 71.483  
 Max.   :48.00   DLCL-0006: 1                         Max.   :129.900  
                 (Other)  :36                         NA's   :2        
      cens           IPI             MGE          
 Min.   :0.00   Min.   :0.000   Min.   :-0.15892  
 1st Qu.:0.00   1st Qu.:1.000   1st Qu.:-0.01826  
 Median :1.00   Median :2.000   Median : 0.06584  
 Mean   :0.55   Mean   :1.947   Mean   : 0.07551  
 3rd Qu.:1.00   3rd Qu.:3.000   3rd Qu.: 0.14754  
 Max.   :1.00   Max.   :4.000   Max.   : 0.37129  
 NA's   :2      NA's   :4

이 중 time은 생존기간, cens는 0이 생존, 1은 사망입니다. IPI는 international prognostic index(0이 7명, 1이 7명, 2가 10명, 3이 9명, 4가 5명)이고 MGE는 mean gene expression(연속형변수)입니다.

table(DLBCL$IPI)


 0  1  2  3  4 
 7  7 10  9  5

hist(DLBCL$MGE)

단변량분석

이 중 연속형 변수인 MGE를 영향변수로 검정해보겠읍니다.

mstat <- maxstat.test(Surv(time, cens) ~ MGE, data=DLBCL, 
                      smethod="LogRank", pmethod="exactGauss", 
                      abseps=0.01)
mstat


    Maximally selected LogRank statistics using exactGauss

data:  Surv(time, cens) by MGE
M = 3.1772, p-value = 0.01849
sample estimates:
estimated cutpoint 
         0.1860526

plot(mstat)

검정 결과 0.186을 cutpoint로 삼을 수 있겠읍니다. 과연 그런지 0.186이하와 0.186초과를 기준으로 새로운 범주형 변수를 만들고 생존분석을 해보겠읍니다.

이를 Kaplan-Meier survival curve로 두 군을 비교하여 그려보겠읍니다.

MGEgroup=ifelse(DLBCL$MGE<=0.186,0,1)
MGEgroup=factor(MGEgroup)
levels(MGEgroup)=c("below 0.186","above 0.186")

fit  <- survfit(Surv(time,cens)~MGEgroup,data=DLBCL)
fit

Call: survfit(formula = Surv(time, cens) ~ MGEgroup, data = DLBCL)

   2 observations deleted due to missingness 
                     records n.max n.start events median 0.95LCL 0.95UCL
MGEgroup=below 0.186      32    32      32     21   25.4    12.3      NA
MGEgroup=above 0.186       8     8       8      1     NA      NA      NA

plot(fit,col=1:2,lty=1:2)
legend("bottomleft",title="Mean Gene Expression",legend=levels(MGEgroup),col=1:2,lty=1:2)

log-rank test로 두군 간의 차이를 검정해보겠읍니다.

survdiff(Surv(time,cens)~MGEgroup,data=DLBCL)

Call:
survdiff(formula = Surv(time, cens) ~ MGEgroup, data = DLBCL)

n=40, 2 observations deleted due to missingness.

                      N Observed Expected (O-E)^2/E (O-E)^2/V
MGEgroup=below 0.186 32       21    16.31      1.35      5.36
MGEgroup=above 0.186  8        1     5.69      3.87      5.36

 Chisq= 5.4  on 1 degrees of freedom, p= 0.0206

GGally패키지의 ggsurv함수를 이용하면 더 나은 그래프를 얻을수 있읍니다.

require(GGally)
ggsurv(fit)

Cox’s proprotional hazard model을 사용하여 검정해보겠읍니다.

out=coxph(Surv(time,cens)~MGEgroup,data=DLBCL)
out

Call:
coxph(formula = Surv(time, cens) ~ MGEgroup, data = DLBCL)

                     coef exp(coef) se(coef)     z     p
MGEgroupabove 0.186 -2.03     0.132     1.03 -1.97 0.049

Likelihood ratio test=7.3  on 1 df, p=0.00691  n= 40, number of events= 22 
   (2 observations deleted due to missingness)

Cox의 비례위험모형에서도 p값은 약간 차이 있으나 유의한 결과를 얻었습니다.

다변량분석

이제는 MGE와 IPI 두개의 영향변수로 검정해보겠읍니다.

mstat1 <- maxstat.test(Surv(time, cens) ~ MGE + IPI, data=DLBCL, 
                      smethod="LogRank", pmethod="exactGauss", 
                      abseps=0.01)
mstat1


     Optimally Selected Prognostic Factors 

Call: maxstat.test.data.frame(formula = Surv(time, cens) ~ MGE + IPI, 
    data = DLBCL, smethod = "LogRank", pmethod = "exactGauss", 
    abseps = 0.01)


 Selected: 

    Maximally selected LogRank statistics using exactGauss

data:  Surv(time, cens) by IPI
M = 2.9603, p-value = 0.01116
sample estimates:
estimated cutpoint 
                 1 

Adjusted p.value: 
0.03468229 , error:  0.0012978

plot(mstat1)

두개의 영향변수를 넣어보니 MGE는 차이가 없고 IPI만 유의한 영향변수로 판단할 수 있겠읍니다.

모든 변수 분석

욕심을 내어 보다 편하게 쓸수 있게 모든 변수를 넣어 한번 통계를 돌려보겠읍니다.

mstat2 <- maxstat.test(Surv(time, cens) ~ ., data=DLBCL, 
                      smethod="LogRank", pmethod="exactGauss", 
                      abseps=0.01)

Warning in cmaxstat(scores, x[, i], weights = weights, pmethod = pmethod,
: cannot order in x, returning NA

Warning in mmax[[i]]$data.name <- colnames(x)[i]: Coercing LHS to a list

Error in pvalues[i] <- mmax[[i]]$p.value: replacement has length zero

하지만 에러가 나고 실행되지 않습니다.

`party`패키지

단변량분석

같은 데이타를 party패키지로 분석해보겠읍니다. party 패키지의 ctree함수를 사용할 수 있읍니다. 하지만 party패키지는 단점이 하나 있는데 반응변수가 누락된 값이 있으면 실행이 안됩니다.

require(party)
out=ctree(Surv(time, cens) ~ MGE + IPI, data=DLBCL)

Error in model@dpp(...): missing values in response variable not allowed

따라서 cens에 누락된 값이 있는 행을 제외하고 다시 통계를 돌려보겠읍니다.

data1=DLBCL[!is.na(DLBCL$cens),]
out=ctree(Surv(time, cens) ~ MGE , data=data1)
out


     Conditional inference tree with 2 terminal nodes

Response:  Surv(time, cens) 
Input:  MGE 
Number of observations:  40 

1) MGE <= 0.1860526; criterion = 0.994, statistic = 7.439
  2)*  weights = 33 
1) MGE > 0.1860526
  3)*  weights = 7

plot(out)

이 경우 cutpoint는 maxstat.test와 마찬가지로 0.186이 나옵니다.

다변량분석

IPI와 MGE둘 다 넣고 통계를 돌려보겠읍니다.

out1=ctree(Surv(time, cens) ~ MGE+IPI , data=data1)
out1


     Conditional inference tree with 2 terminal nodes

Response:  Surv(time, cens) 
Inputs:  MGE, IPI 
Number of observations:  40 

1) IPI <= 1; criterion = 0.99, statistic = 7.883
  2)*  weights = 14 
1) IPI > 1
  3)*  weights = 26

plot(out1)

이 경우 결과는 maxstat.test와 마찬가지로 IPI만 유의한 것으로 나옵니다.

모든 변수분석

모든 변수를 영향변수로 넣고 통계를 돌려보겠읍니다.

out2=ctree(Surv(time, cens) ~ . , data=data1)
out2


     Conditional inference tree with 2 terminal nodes

Response:  Surv(time, cens) 
Inputs:  RowNames, DLCLid, GEG, IPI, MGE 
Number of observations:  40 

1) IPI <= 1; criterion = 0.975, statistic = 39
  2)*  weights = 14 
1) IPI > 1
  3)*  weights = 26

plot(out2)

이 경우 에러가 나지 않고 IPI만 유의한 것으로 나옵니다.

이상의 결과를 표로 요약하면 다음과 같습니다.

DLBCL데이타 비교


	단변량분석	다변량분석	모든변수분석
maxstat.test	0.0185	0.0112
ctree	0.0060	0.0100	0.0250
log-rank test	0.0206

다른 데이타 분석: 폐암데이타

대부분 모든 패키지들이 자신의 데이타를 이용하여 통계를 돌릴때는 잘 되지만 다른 데이타의 경우 잘 안될 수도 있읍니다. 따라서 어느 패키지의 함수의 범용성을 평가할 때 다른 데이타를 사용하여 잘 되는지 보는것이 좋겠읍니다. 이번에는 survival 데이타의 폐암 데이타(lung)를 사용하여 비교해보겠읍니다. 데이타 lung은 North CentralCancer Group의 228명의 폐암 환자 데이타입니다.

단변량분석

먼저 나이에 따른 사망을 비교하여 나이의 cutpoint를 구할 수 있는지 보겠읍니다.

data(lung)
summary(lung)

      inst            time            status           age       
 Min.   : 1.00   Min.   :   5.0   Min.   :1.000   Min.   :39.00  
 1st Qu.: 3.00   1st Qu.: 166.8   1st Qu.:1.000   1st Qu.:56.00  
 Median :11.00   Median : 255.5   Median :2.000   Median :63.00  
 Mean   :11.09   Mean   : 305.2   Mean   :1.724   Mean   :62.45  
 3rd Qu.:16.00   3rd Qu.: 396.5   3rd Qu.:2.000   3rd Qu.:69.00  
 Max.   :33.00   Max.   :1022.0   Max.   :2.000   Max.   :82.00  
 NA's   :1                                                       
      sex           ph.ecog          ph.karno        pat.karno     
 Min.   :1.000   Min.   :0.0000   Min.   : 50.00   Min.   : 30.00  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 75.00   1st Qu.: 70.00  
 Median :1.000   Median :1.0000   Median : 80.00   Median : 80.00  
 Mean   :1.395   Mean   :0.9515   Mean   : 81.94   Mean   : 79.96  
 3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.: 90.00  
 Max.   :2.000   Max.   :3.0000   Max.   :100.00   Max.   :100.00  
                 NA's   :1        NA's   :1        NA's   :3       
    meal.cal         wt.loss       
 Min.   :  96.0   Min.   :-24.000  
 1st Qu.: 635.0   1st Qu.:  0.000  
 Median : 975.0   Median :  7.000  
 Mean   : 928.8   Mean   :  9.832  
 3rd Qu.:1150.0   3rd Qu.: 15.750  
 Max.   :2600.0   Max.   : 68.000  
 NA's   :47       NA's   :14

result=ctree(Surv(time, status==2) ~ age , data=lung)
result


     Conditional inference tree with 2 terminal nodes

Response:  Surv(time, status == 2) 
Input:  age 
Number of observations:  228 

1) age <= 70; criterion = 0.953, statistic = 3.932
  2)*  weights = 182 
1) age > 70
  3)*  weights = 46

plot(result)

mstat3 <- maxstat.test(Surv(time, status==2) ~ age, data=lung, 
                      smethod="LogRank", pmethod="exactGauss", 
                      abseps=0.01)
mstat3


    Maximally selected LogRank statistics using exactGauss

data:  Surv(time, status == 2) by age
M = 2.0136, p-value = 0.2912
sample estimates:
estimated cutpoint 
                70

plot(mstat3)

ctree함수와 maxstat.test함수 모두 cutponit는 70으로 나왔읍니다. 하지만 p값은 ctree함수의 경우 0.047, maxstat.test는 0.2899로 다릅니다. 이를 log-rank test 로 확인하기 위해 oldage라는 범주형 변수를 새로 만들어 확인해보겠읍니다.

oldage=ifelse(lung$age<=70,0,1)
oldage=factor(oldage)
levels(oldage)=c("below 71","above 70" )
survdiff(Surv(time, status==2) ~ oldage , data=lung)

Call:
survdiff(formula = Surv(time, status == 2) ~ oldage, data = lung)

                  N Observed Expected (O-E)^2/E (O-E)^2/V
oldage=below 71 182      127    137.3     0.773      4.64
oldage=above 70  46       38     27.7     3.829      4.64

 Chisq= 4.6  on 1 degrees of freedom, p= 0.0312

log-rank test결과 p값은 0.0312로 의미있읍니다. 이를 ggsurv함수로 plot해보면 다음과 같습니다.

fit2= survfit(Surv(time, status==2) ~ oldage, data=lung)
ggsurv(fit2)

다변량분석

이제 age와 sex 두개의 변수에 대해 검정을 해보겠읍니다.

result2=ctree(Surv(time, status==2) ~ age+sex , data=lung)
result2


     Conditional inference tree with 2 terminal nodes

Response:  Surv(time, status == 2) 
Inputs:  age, sex 
Number of observations:  228 

1) sex <= 1; criterion = 0.998, statistic = 10.745
  2)*  weights = 138 
1) sex > 1
  3)*  weights = 90

plot(result2)

mstat4 <- maxstat.test(Surv(time, status==2) ~ age+sex, data=lung, 
                      smethod="LogRank", pmethod="exactGauss", 
                      abseps=0.01)
mstat4


     Optimally Selected Prognostic Factors 

Call: maxstat.test.data.frame(formula = Surv(time, status == 2) ~ age + 
    sex, data = lung, smethod = "LogRank", pmethod = "exactGauss", 
    abseps = 0.01)


 Selected: 

    Maximally selected LogRank statistics using exactGauss

data:  Surv(time, status == 2) by sex
M = 3.2768, p-value = 0.00105
sample estimates:
estimated cutpoint 
                 1 

Adjusted p.value: 
0.01269026 , error:  0.0006581205

plot(mstat4)

두 가지 방법 모두 성별만 의미있는 것으로 나왔읍니다.

모든 변수를 영향변수로 한 검정

다음은 모든 검정을 영향변수로 한 검정을 해보겠읍니다.

result3=ctree(Surv(time, status==2) ~ . , data=lung)
result3


     Conditional inference tree with 3 terminal nodes

Response:  Surv(time, status == 2) 
Inputs:  inst, age, sex, ph.ecog, ph.karno, pat.karno, meal.cal, wt.loss 
Number of observations:  228 

1) ph.ecog <= 1; criterion = 1, statistic = 16.228
  2) sex <= 1; criterion = 0.985, statistic = 9.684
    3)*  weights = 108 
  2) sex > 1
    4)*  weights = 69 
1) ph.ecog > 1
  5)*  weights = 51

plot(result3)

mstat5 <- maxstat.test(Surv(time, status==2) ~ ., data=lung, 
                      smethod="LogRank", pmethod="exactGauss", 
                      abseps=0.01)
mstat5


     Optimally Selected Prognostic Factors 

Call: maxstat.test.data.frame(formula = Surv(time, status == 2) ~ ., 
    data = lung, smethod = "LogRank", pmethod = "exactGauss", 
    abseps = 0.01)


 Selected: 

    Maximally selected LogRank statistics using exactGauss

data:  Surv(time, status == 2) by ph.ecog
M = 3.1437, p-value = 0.003305
sample estimates:
estimated cutpoint 
                 1 

Adjusted p.value: 
0.0777071 , error:  0.002908686

plot(mstat5)

생존나무분석 결과는 ECOG performance sore와 성별에 따라 생존에 차이가 나는 것을 알 수 있고 performance score 1 이하인 군과 2 이상인 군이 차이가 나는 것을 알 수 있읍니다. 하지만 maxstat.test는 ph.ecog만 의미 있는 것으로 나타납니다. 생존나무 분석에서 나타난 결과를 확인해보기 위해 ph.ecog가 1이하인 환자들만 대상으로 성별에 따른 차이가 있는지 보겠읍니다.

survdiff(Surv(time, status==2) ~ sex , data=lung[lung$ph.ecog<=1,])

Call:
survdiff(formula = Surv(time, status == 2) ~ sex, data = lung[lung$ph.ecog <= 
    1, ])

n=176, 1 observation deleted due to missingness.

        N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 107       82     65.7      4.06      9.16
sex=2  69       37     53.3      5.00      9.16

 Chisq= 9.2  on 1 degrees of freedom, p= 0.00248

log-rank test 결과 ph.ecog가 1이하인 환자들의 경우 성별에 따른 차이가 있읍니다. 이를 그래프로 나타내 보면 다음과 같습니다.

fit3= survfit(Surv(time, status==2) ~ sex , data=lung[lung$ph.ecog<=1,])
ggsurv(fit3)

폐암 데이타 비교


	단변량분석	다변량분석	ph.ecog	sex
			모든변수분석
maxstat.test	0.2912	0.0010	0.0030
ctree	0.0470	0.0020	0.0010	0.0150
log-rank test	0.0312		0.0312	0.0020

결론

생존분석시 K-M survival curve를 그릴때 연속변수의 optimal cutpoint level을 찾기 위해 maxstat 패키지의 maxstat.test() 함수와 party package의 ctree()함수를 각각 실행한 후 전통적인 log-rank test와 비교하여 보았읍니다. 결론적으로 party package의 ctree함수가 보다 사용하기 편하고 log-rank test 결과와 근접한 결과를 보여줍니다. 따라서 “웹에서 하는 R통계”에서는 party패키지의 ctree함수를 사용하여 생존나무 분석을 시행하겠읍니다.

생존분석에서 연속형 영향변수의 cutpoint 구하기

문건웅

2015년 3월 30일

서론

`maxstat` 패키지

단변량분석

다변량분석

모든 변수 분석

`party`패키지

단변량분석

다변량분석

모든 변수분석

DLBCL데이타 비교

다른 데이타 분석: 폐암데이타

단변량분석

다변량분석

모든 변수를 영향변수로 한 검정

폐암 데이타 비교

결론

생존분석에서 연속형 영향변수의 cutpoint 구하기

문건웅

2015년 3월 30일

서론

maxstat 패키지

단변량분석

다변량분석

모든 변수 분석

party패키지

단변량분석

다변량분석

모든 변수분석

DLBCL데이타 비교

다른 데이타 분석: 폐암데이타

단변량분석

다변량분석

모든 변수를 영향변수로 한 검정

폐암 데이타 비교

결론

`maxstat` 패키지

`party`패키지