coxphにおけるtime-dependent covariatesの使い方

参考資料

Cox Proportional-Hazards Regression for Survival Data: http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf
Script: http://socserv.mcmaster.ca/jfox/Courses/soc761/survival-analysis.R
BIO 223 Applied Survival Analysis note chapter 7.

実際のデータ作成と解析

作成されたデータセットでは一人複数行のデータがあるためclusteringがあります。 + cluster(id) のような項をモデル式に足すことで相関を考慮したrobust sandwich variance estimatorsが使えるようですが、参考資料のPDFでは使っていないし、SASのデフォルトはrobust estimatorでない方の標準誤差とp値に一致するようです。

## 生存分析パッケージを読み込む。標準付属。
library(survival)

## データ読み込み。ネット上にテキスト形式である。
## 432名の釈放された男性囚人のRecidivism(再犯)に関してのデータです。
## Rossi, et al. 1980. Money, Work and Crime: Some Experimental Results. New York: Academic Press.
## http://cran.r-project.org/doc/contrib/Fox-Companion/Rossi.txt
Rossi <- read.table(file="http://cran.r-project.org/doc/contrib/Fox-Companion/Rossi.txt", header=TRUE)

## データ構造を見ます。
## week観察期間(censored or arrested), arrest(逮捕の場合は1でoutcome発生), fin (financial aidの有無)
## age保釈時の年齢, race 1ならblack, wexp服役前の職務経験, mar保釈時に結婚していたか, paro仮保釈?だったか
## prio何回罪に問われたことがあるか, educ教育を受けた期間2ならgrade6以下、6ならpost-secondary
## emp1 - emp52: emp1 1週目に職があったか, emp2 2週目に職があったか... time-dependent covariateになる
names(Rossi)

 [1] "week"   "arrest" "fin"    "age"    "race"   "wexp"   "mar"    "paro"   "prio"   "educ"   "emp1"   "emp2"  
[13] "emp3"   "emp4"   "emp5"   "emp6"   "emp7"   "emp8"   "emp9"   "emp10"  "emp11"  "emp12"  "emp13"  "emp14" 
[25] "emp15"  "emp16"  "emp17"  "emp18"  "emp19"  "emp20"  "emp21"  "emp22"  "emp23"  "emp24"  "emp25"  "emp26" 
[37] "emp27"  "emp28"  "emp29"  "emp30"  "emp31"  "emp32"  "emp33"  "emp34"  "emp35"  "emp36"  "emp37"  "emp38" 
[49] "emp39"  "emp40"  "emp41"  "emp42"  "emp43"  "emp44"  "emp45"  "emp46"  "emp47"  "emp48"  "emp49"  "emp50" 
[61] "emp51"  "emp52"


## 横向きに時系列データをのばしているwide形式で一人一行になっています。
## 1人目を見てみるとweek==20 arres==1なのでemp21以降はNot Availableになっています。
Rossi[1,]

  week arrest fin age race wexp mar paro prio educ emp1 emp2 emp3 emp4 emp5 emp6 emp7 emp8 emp9 emp10 emp11 emp12
1   20      1   0  27    1    0   0    1    3    3    0    0    0    0    0    0    0    0    0     0     0     0
  emp13 emp14 emp15 emp16 emp17 emp18 emp19 emp20 emp21 emp22 emp23 emp24 emp25 emp26 emp27 emp28 emp29 emp30 emp31
1     0     0     0     0     0     0     0     0    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
  emp32 emp33 emp34 emp35 emp36 emp37 emp38 emp39 emp40 emp41 emp42 emp43 emp44 emp45 emp46 emp47 emp48 emp49 emp50
1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
  emp51 emp52
1    NA    NA

## 一人の対象が観察ごとに複数行のデータをもつlong形式にする必要があります。
## foldは下記URLにあるプログラムからfold()だけ抜き出したものです。
## http://socserv.mcmaster.ca/jfox/Courses/soc761/survival-analysis.R
## 関数名だけ実行すると上の方に説明が書いてあります。
## fold
##  Arguments:
##  data: A data frame or numeric matrix (with column names) to be `unfolded.'
##      For reasons of efficiency, if there are factors in data these will be
##      converted to numeric variables in the output data frame.
##  time: The quoted name of the event/censoring-time variable in data.
##  event: The quoted name of the event/censoring-indicator variable in data.
##  cov: A vector giving the column numbers of the time-dependent covariate
##      in data, or a list of vectors if there is more than one time-varying
##      covariate.
##  cov.names: A character string or character vector giving the name or names
##      to be assigned to the time-dependent covariate(s) in the output data set.
##  suffix: The suffix to be attached to the name of the time-to-event variable
##      in the output data setl defaults to '.time'.
##  cov.times: The observation times for the covariate values, including the start
##      time. This argument can take several forms:
##      *   The default is integers from 0 to the number of covariate values (i.e.,
##              one more than the length of each vector in cov).
##      *   An arbitrary numerical vector with one more entry than the length of each
##      *       vector in cov.
##      *   The columns in the input data set that give the observations times for each
##              individual. There should be one more column than the length of each
##              vector in cov.
##      *   common.times: A logical value indicating whether the times of observation
##              are the same for all individuals; defaults to TRUE.
##  lag: Number of observation periods to lag each value of the time-varying
##              covariate(s); defaults to 0.

## 複数time-dependent covariateの場合も下記の感じで使えるようです。
## http://r.789695.n4.nabble.com/Fold-function-with-several-time-varying-covariates-td897086.html
## Dummy.2 <- fold(Dummy, time="week", event="arrest", cov=list(11:62,63:124), cov.names=c("employed","healthy"))

## 実際使ってみます。
Rossi.2 <- fold(Rossi, time = "week", event = "arrest", cov = 11:62, cov.names = "employed")

## 変換後のデータ構造を見てみます。先頭50行を表示。
## 行番号1.1から1.20が一人目の対象、2.1から2.17が二人目の対象、、というように複数行になります。
## 一人目は20の観察期間に切り分けられ、startその観察期間の開始とstop終了という変数が与えられています。
## stop変数はstart変数を一行上にずらした形になっているかと思います。
## arrest.timeが逮捕されていれば対象ごとの最後の行で1になっています。
## 右端のemployed変数が対象2で0から1になり0に戻っている(time-dependent)なのが分かりますでしょうか。
head(Rossi.2, 50)

     start stop arrest.time week arrest fin age race wexp mar paro prio educ employed
1.1      0    1           0   20      1   0  27    1    0   0    1    3    3        0
1.2      1    2           0   20      1   0  27    1    0   0    1    3    3        0
1.3      2    3           0   20      1   0  27    1    0   0    1    3    3        0
1.4      3    4           0   20      1   0  27    1    0   0    1    3    3        0
1.5      4    5           0   20      1   0  27    1    0   0    1    3    3        0
1.6      5    6           0   20      1   0  27    1    0   0    1    3    3        0
1.7      6    7           0   20      1   0  27    1    0   0    1    3    3        0
1.8      7    8           0   20      1   0  27    1    0   0    1    3    3        0
1.9      8    9           0   20      1   0  27    1    0   0    1    3    3        0
1.10     9   10           0   20      1   0  27    1    0   0    1    3    3        0
1.11    10   11           0   20      1   0  27    1    0   0    1    3    3        0
1.12    11   12           0   20      1   0  27    1    0   0    1    3    3        0
1.13    12   13           0   20      1   0  27    1    0   0    1    3    3        0
1.14    13   14           0   20      1   0  27    1    0   0    1    3    3        0
1.15    14   15           0   20      1   0  27    1    0   0    1    3    3        0
1.16    15   16           0   20      1   0  27    1    0   0    1    3    3        0
1.17    16   17           0   20      1   0  27    1    0   0    1    3    3        0
1.18    17   18           0   20      1   0  27    1    0   0    1    3    3        0
1.19    18   19           0   20      1   0  27    1    0   0    1    3    3        0
1.20    19   20           1   20      1   0  27    1    0   0    1    3    3        0
2.1      0    1           0   17      1   0  18    1    0   0    1    8    4        0
2.2      1    2           0   17      1   0  18    1    0   0    1    8    4        0
2.3      2    3           0   17      1   0  18    1    0   0    1    8    4        0
2.4      3    4           0   17      1   0  18    1    0   0    1    8    4        0
2.5      4    5           0   17      1   0  18    1    0   0    1    8    4        0
2.6      5    6           0   17      1   0  18    1    0   0    1    8    4        0
2.7      6    7           0   17      1   0  18    1    0   0    1    8    4        0
2.8      7    8           0   17      1   0  18    1    0   0    1    8    4        0
2.9      8    9           0   17      1   0  18    1    0   0    1    8    4        0
2.10     9   10           0   17      1   0  18    1    0   0    1    8    4        1
2.11    10   11           0   17      1   0  18    1    0   0    1    8    4        1
2.12    11   12           0   17      1   0  18    1    0   0    1    8    4        1
2.13    12   13           0   17      1   0  18    1    0   0    1    8    4        1
2.14    13   14           0   17      1   0  18    1    0   0    1    8    4        1
2.15    14   15           0   17      1   0  18    1    0   0    1    8    4        0
2.16    15   16           0   17      1   0  18    1    0   0    1    8    4        0
2.17    16   17           1   17      1   0  18    1    0   0    1    8    4        0
3.1      0    1           0   25      1   0  19    0    1   0    1   13    3        0
3.2      1    2           0   25      1   0  19    0    1   0    1   13    3        0
3.3      2    3           0   25      1   0  19    0    1   0    1   13    3        0
3.4      3    4           0   25      1   0  19    0    1   0    1   13    3        0
3.5      4    5           0   25      1   0  19    0    1   0    1   13    3        0
3.6      5    6           0   25      1   0  19    0    1   0    1   13    3        0
3.7      6    7           0   25      1   0  19    0    1   0    1   13    3        0
3.8      7    8           0   25      1   0  19    0    1   0    1   13    3        0
3.9      8    9           0   25      1   0  19    0    1   0    1   13    3        0
3.10     9   10           0   25      1   0  19    0    1   0    1   13    3        0
3.11    10   11           0   25      1   0  19    0    1   0    1   13    3        0
3.12    11   12           0   25      1   0  19    0    1   0    1   13    3        0
3.13    12   13           0   25      1   0  19    0    1   0    1   13    3        0


## モデル式を書きます。
## Surv(time, status)の標準形式ではなくSurv(start, stop, status)という形式です。
## 以降のcovariateはtime-dependentなものも含めてただ並べるだけです。
mod.allison.2 <- coxph(data=Rossi.2,
                       Surv(start, stop, arrest.time) ~
                       fin + age + race + wexp + mar + paro + prio + employed)

## 表示してみます。
## time-dependent covariateのemployedを見てみると***で有意なようです。
## exp(coef)がhazard ratioにあたります。employed 0.265ですので雇用されている
## 状態(週)であれば再逮捕の危険が73.5%低下するようです。
## 他には年齢が高い方が再逮捕されにくい、何度も罪に問われていると再逮捕されやすいようです。
summary(mod.allison.2)

Call:
coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + 
    race + wexp + mar + paro + prio + employed, data = Rossi.2)

  n= 19809, number of events= 114 

            coef exp(coef) se(coef)     z   Pr(>|z|)    
fin      -0.3567    0.7000   0.1911 -1.87     0.0620 .  
age      -0.0463    0.9547   0.0217 -2.13     0.0330 *  
race      0.3387    1.4031   0.3096  1.09     0.2740    
wexp     -0.0256    0.9748   0.2114 -0.12     0.9038    
mar      -0.2937    0.7455   0.3830 -0.77     0.4431    
paro     -0.0642    0.9378   0.1947 -0.33     0.7416    
prio      0.0851    1.0889   0.0290  2.94     0.0033 ** 
employed -1.3283    0.2649   0.2507 -5.30 0.00000012 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

         exp(coef) exp(-coef) lower .95 upper .95
fin          0.700      1.429     0.481     1.018
age          0.955      1.047     0.915     0.996
race         1.403      0.713     0.765     2.574
wexp         0.975      1.026     0.644     1.475
mar          0.745      1.341     0.352     1.579
paro         0.938      1.066     0.640     1.374
prio         1.089      0.918     1.029     1.152
employed     0.265      3.775     0.162     0.433

Concordance= 0.708  (se = 0.027 )
Rsquare= 0.003   (max possible= 0.066 )
Likelihood ratio test= 68.7  on 8 df,   p=9.11e-12
Wald test            = 56.1  on 8 df,   p=0.00000000263
Score (logrank) test = 64.5  on 8 df,   p=6.1e-11



## おまけ
## ただし、この結果には逮捕されていたら仕事できないじゃないか? (Allison 1995)という
## 批判もあったようなので、直前の週の雇用状態と再逮捕の関連をみるモデルもあるようです。
## fold()にlag=1オプションをつけます。
Rossi.3 <- fold(Rossi, time="week", event="arrest", cov=11:62, cov.names="employed", lag=1)

## 2人目の対象だけさっきのデータセットと今作ったlaggedデータセットで比べてみます。
## 行の名前が 2. で始まっているところのみ抜く正規表現を使っています。
## 新しいデータセットでは先頭行がなくなって、employed==1が一個後ろにずれている、
## つまり前の観察期間の雇用状態を示しているのが分かりますでしょうか。
Rossi.2[grep("^2\\..*", rownames(Rossi.2)), ]

     start stop arrest.time week arrest fin age race wexp mar paro prio educ employed
2.1      0    1           0   17      1   0  18    1    0   0    1    8    4        0
2.2      1    2           0   17      1   0  18    1    0   0    1    8    4        0
2.3      2    3           0   17      1   0  18    1    0   0    1    8    4        0
2.4      3    4           0   17      1   0  18    1    0   0    1    8    4        0
2.5      4    5           0   17      1   0  18    1    0   0    1    8    4        0
2.6      5    6           0   17      1   0  18    1    0   0    1    8    4        0
2.7      6    7           0   17      1   0  18    1    0   0    1    8    4        0
2.8      7    8           0   17      1   0  18    1    0   0    1    8    4        0
2.9      8    9           0   17      1   0  18    1    0   0    1    8    4        0
2.10     9   10           0   17      1   0  18    1    0   0    1    8    4        1
2.11    10   11           0   17      1   0  18    1    0   0    1    8    4        1
2.12    11   12           0   17      1   0  18    1    0   0    1    8    4        1
2.13    12   13           0   17      1   0  18    1    0   0    1    8    4        1
2.14    13   14           0   17      1   0  18    1    0   0    1    8    4        1
2.15    14   15           0   17      1   0  18    1    0   0    1    8    4        0
2.16    15   16           0   17      1   0  18    1    0   0    1    8    4        0
2.17    16   17           1   17      1   0  18    1    0   0    1    8    4        0

Rossi.3[grep("^2\\..*", rownames(Rossi.3)), ]

     start stop arrest.time week arrest fin age race wexp mar paro prio educ employed
2.2      1    2           0   17      1   0  18    1    0   0    1    8    4        0
2.3      2    3           0   17      1   0  18    1    0   0    1    8    4        0
2.4      3    4           0   17      1   0  18    1    0   0    1    8    4        0
2.5      4    5           0   17      1   0  18    1    0   0    1    8    4        0
2.6      5    6           0   17      1   0  18    1    0   0    1    8    4        0
2.7      6    7           0   17      1   0  18    1    0   0    1    8    4        0
2.8      7    8           0   17      1   0  18    1    0   0    1    8    4        0
2.9      8    9           0   17      1   0  18    1    0   0    1    8    4        0
2.10     9   10           0   17      1   0  18    1    0   0    1    8    4        0
2.11    10   11           0   17      1   0  18    1    0   0    1    8    4        1
2.12    11   12           0   17      1   0  18    1    0   0    1    8    4        1
2.13    12   13           0   17      1   0  18    1    0   0    1    8    4        1
2.14    13   14           0   17      1   0  18    1    0   0    1    8    4        1
2.15    14   15           0   17      1   0  18    1    0   0    1    8    4        1
2.16    15   16           0   17      1   0  18    1    0   0    1    8    4        0
2.17    16   17           1   17      1   0  18    1    0   0    1    8    4        0


## モデルの作り方は同じです。
mod.allison.3 <- coxph(data=Rossi.3,
                       Surv(start, stop, arrest.time) ~
                       fin + age + race + wexp + mar + paro + prio + employed)

## さっきほどではないですがemployedは有意で残っています。
## 保釈されても前の週に無職の状態だと再逮捕のリスクが高いことと関連するようです。
summary(mod.allison.3)

Call:
coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + 
    race + wexp + mar + paro + prio + employed, data = Rossi.3)

  n= 19377, number of events= 113 

            coef exp(coef) se(coef)     z Pr(>|z|)    
fin      -0.3513    0.7038   0.1918 -1.83  0.06703 .  
age      -0.0498    0.9514   0.0219 -2.27  0.02297 *  
race      0.3215    1.3792   0.3091  1.04  0.29837    
wexp     -0.0476    0.9535   0.2132 -0.22  0.82321    
mar      -0.3448    0.7084   0.3832 -0.90  0.36831    
paro     -0.0471    0.9540   0.1963 -0.24  0.81038    
prio      0.0920    1.0964   0.0288  3.19  0.00140 ** 
employed -0.7869    0.4553   0.2181 -3.61  0.00031 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

         exp(coef) exp(-coef) lower .95 upper .95
fin          0.704      1.421     0.483     1.025
age          0.951      1.051     0.911     0.993
race         1.379      0.725     0.752     2.528
wexp         0.953      1.049     0.628     1.448
mar          0.708      1.412     0.334     1.501
paro         0.954      1.048     0.649     1.402
prio         1.096      0.912     1.036     1.160
employed     0.455      2.197     0.297     0.698

Concordance= 0.67  (se = 0.027 )
Rsquare= 0.002   (max possible= 0.067 )
Likelihood ratio test= 47.2  on 8 df,   p=0.000000143
Wald test            = 43.4  on 8 df,   p=0.000000748
Score (logrank) test = 46.4  on 8 df,   p=0.000000199