Part 1: 摘要:加護病房 (ICU) 留院死亡預測模型改良
ICU死亡預測模型架構與應用
在模型改良上, 嘗試透過在vital sign時間序列上取得新的特徵來達成:
2.1 以Dynamic Time Warp (DTW) 演算法計算不同時間序列間的距離. DTW會考慮兩序列是否在變形 (transformation) 或位移 (shift) 後呈現較高的相似性, 再從此去計算兩者間的距離, 因此可能較一般Kmeans使用的距離演算法更為適合時間序列分群. 本篇報告使用含有DTW演算法的R package: dtwclust來對vital sign的時間序列進行分群.
2.2 除了分群, 還提供每個序列對群心 (centroid) 的距離做為建模特徵, 這提供了與群心有多相似的資訊給建模演算法.
2.3 所有患者最早即最晚記錄點的各項vital sign, 這也是模型表現的baseline所使用的特徵, 因為通常狀況很差的患者, 這時數據也都是較為惡化的狀況 (例如很低的脈搏).
2.4 新的統計型特徵: range (max - min), max_slope及min_slope (找出患者vitalsign出現的最大/最小變化速率)
2.5 其它未實現的vital sign相關新特徵:
2.5.1 初步評估記錄時間差中位數為0.5 hr, 因此如改變時間序列長度, 改以0.5 hr為一單位組成時間序列, 對於有較高記錄頻率的資料, 也許能提供更多特徵. 在本篇報告中已包含該項參數, 但尚未完成特徵工程及模型表現比較.
2.5.2 參考MIMIC-III的疾病分類表 (https://www.nature.com/articles/sdata201635/tables/2), 研究各分類主要的惡化特徵, 或導致死亡前的可能生理變化, 將該些資訊轉為ifelse rule用在特徵工程上.
提高模型的泛化性 (generalization), 則是透過使用所有資料及lightGBM演算法:
3.1 由於原始模型使用的樣本患者死亡率較剩餘患者高了55% (17%對11%), 這代表了兩族群特徵分布可能有明顯差異, 其可能的推測是嚴重的患者, 醫生會需要更多監控或生化檢查的資訊來協助診斷及建立處方. 也因此, 基於該些患者建立的預測模型, 其找出的相關性規則可能不適用於剩餘患者族群.
3.2 因此, 本報告的程式保留使用所有樣本的彈性, 如此也可以讓資料提供最多的資訊.
3.3 建模演算法改用了lightGBM, 如此使用者可以保留無法合理imputation的NA值, 演算法會計算在遇到NA時的最佳分類方向, 簡化了NA的處理流程, 並盡可能保留所有樣本可以提供的資訊. 而其缺點就是black box造成較差的解釋性, 會需要額外方法去分析其找出的規則.
在結果部分會先以原始分析報告使用的icustay ID來建模, 評估新特徵及建模演算法對模型表現的影響, 最後則是報告使用所有icustay ID資料的模型表現.
4.1 使用非常簡單的15個特徵模型, 其效果就已經與原始模型的最佳模型 (44個特徵 + 時間序列), 具有相同的靈敏度與精確度.
Part 2: 載入檔案及資料前處理
在本篇報告的參數設定下, 只使用了原始分析報告中篩選後的icustay ID, 但如果將demoCodeID_only設定為FALSE, 將可使用所有的icustay ID來進行後續的特徵工程及建模作業.
sampleRate參數則與vital sign的時間序列呈現有關, 當設定為1時, 代表以1 hr為單位 (同AI4quant提供的示範分析) 來建構時間序列(0, 1, 2…, 48), 而0.5則代表以0.5 hr為1單位 (0, 0.5, 1…47.5, 48).
由於分析記錄時間間隔時發現中位數為0.5 hr, 因此對於某些vital sign以較高的sampleRate參數也許可以在比較時間序列時提供更多的資訊及差異.
Sys.setlocale("LC_TIME", "English")
#caret亦用於建模過程
suppressMessages(library(readr))
suppressMessages(library(dplyr))
suppressMessages(library(dtwclust))
suppressMessages(library(plotly))
suppressMessages(library(caret))
#以下用於建模和評估模型表現
suppressMessages(library(Matrix))
suppressMessages(library(lightgbm))
suppressMessages(library(car))
suppressMessages(library(pROC))
#自訂函數, 用於計算一組向量的最大值與最小值的差值
RANGE <- function(x){
y <- diff(range(x, na.rm=TRUE))
if(is.infinite(y)){
return(NA)
} else{
return(y)
}
}
suppressMessages(dfr <- read_csv("raw/mimic3_mortality.csv", guess_max = 300000))
colnames(dfr) <- gsub(x = colnames(dfr), pattern = " ", replacement = "_")
#資料選擇參數
#如果demoCodeID_only == FALSE, 會產生demoCode使用的ID (後續可用於其他觀察), 但特徵工程使用所有ID的資料
#如果demoCodeID_only == TRUE, 特徵工程資料只限於demoCode使用的ID
demoCodeID_only <- TRUE
#時間序列頻率參數 (1代表以1hr為1單位, 序列長度為49: 0-48 hr)
sampleRate <- 1資料概述
icustay: 病患ID
hours: 進入加護病房後的累計時間 (day則是hour的轉換)
diastolic_BP, glasgow_coma_scale, glucose, heart_rate, mean_BP, oxygen_saturation, respiratory_rate, systolic_BP, temperature, pH: 時間序列資料, 記錄病患的生理數據
age, genderm height, weight: 年齡性別特徵
motality: 是否於院內死亡
| icustay | hours | diastolic_BP | glasgow_coma_scale | glucose | heart_rate | mean_BP | oxygen_saturation | respiratory_rate | systolic_BP | temperature | age | gender | height | pH | weight | day | mortality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 282372 | 0.0667 | 60 | NA | NA | 139 | 84.6667 | 100 | 20 | 134 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.1500 | 73 | NA | NA | 128 | 93.0000 | 100 | 25 | 133 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.2333 | 81 | NA | NA | 127 | 88.6667 | 100 | 22 | 104 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.3167 | 86 | NA | NA | 132 | 100.0000 | 100 | 19 | 128 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.4000 | 86 | NA | NA | 138 | 100.3330 | 100 | 21 | 129 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.4833 | 82 | NA | NA | 145 | 93.3333 | 100 | 27 | 116 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.5167 | 84 | NA | NA | 150 | 97.0000 | 100 | 29 | 123 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.5667 | 71 | NA | NA | 149 | 90.3333 | 100 | 30 | 129 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.6500 | 92 | NA | NA | 139 | 106.6670 | 100 | 40 | 136 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.7333 | 98 | NA | NA | 139 | 110.6670 | 100 | 37 | 136 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.8167 | 82 | NA | NA | 129 | 104.3330 | 100 | 30 | 149 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.9000 | 79 | NA | NA | 133 | 98.6667 | 100 | 35 | 138 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 0.9833 | 85 | NA | NA | 133 | 99.6667 | 100 | 27 | 129 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 1.0667 | 96 | NA | NA | 133 | 108.3330 | 100 | 19 | 133 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 1.1500 | 86 | NA | NA | 129 | 105.3330 | 100 | 22 | 144 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 1.2333 | 89 | 15 | NA | 126 | 111.0000 | 100 | 20 | 155 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 1.3167 | 86 | NA | NA | 123 | 107.3330 | 100 | 22 | 150 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 1.4000 | 79 | NA | NA | 118 | 97.3333 | 100 | 17 | 134 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 2.2333 | 83 | NA | NA | 133 | 95.3333 | 100 | 24 | 120 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 3.2333 | 81 | NA | NA | 146 | 89.0000 | 100 | 23 | 105 | 34.7222 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 3.7333 | 84 | NA | NA | 138 | 94.0000 | 100 | 16 | 111 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 3.9833 | 70 | NA | NA | 103 | 85.0000 | NA | 19 | 122 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 4.2333 | 73 | NA | NA | 113 | 84.0000 | NA | 20 | 106 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 4.8167 | 39 | NA | NA | 141 | 50.6667 | NA | 21 | 86 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 4.9000 | 34 | NA | NA | 136 | 43.3333 | NA | 20 | 92 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 5.0667 | 57 | NA | NA | 119 | 66.3333 | 64 | NA | 101 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 5.1500 | 63 | NA | NA | 120 | 71.0000 | NA | 24 | 93 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 5.2333 | 55 | 15 | NA | 123 | 71.0000 | 66 | 23 | 77 | 34.1667 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 5.2667 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.44 | 59 | 1 | 1 |
| 282372 | 5.3667 | NA | NA | 445 | NA | NA | 99 | NA | NA | NA | 48.6824 | 2 | NA | 7.34 | 59 | 1 | 1 |
| 282372 | 5.6000 | NA | NA | 343 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 6.8333 | NA | NA | 430 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.10 | 59 | 1 | 1 |
| 282372 | 7.7167 | NA | NA | 335 | NA | NA | 98 | NA | NA | NA | 48.6824 | 2 | NA | 7.20 | 59 | 1 | 1 |
| 282372 | 8.4000 | NA | NA | 257 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.11 | 59 | 1 | 1 |
| 282372 | 9.0833 | NA | NA | 367 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.18 | 59 | 1 | 1 |
| 282372 | 9.9500 | NA | NA | 428 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.21 | 59 | 1 | 1 |
| 282372 | 10.4833 | NA | NA | 312 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 10.7000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.26 | 59 | 1 | 1 |
| 282372 | 11.8167 | 82 | NA | NA | 65 | 117.0000 | 100 | 10 | 156 | 33.8333 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 11.9000 | 91 | NA | NA | 65 | 120.0000 | NA | 10 | 164 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 11.9833 | 88 | NA | NA | 69 | 122.0000 | NA | 11 | 174 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 12.0667 | 100 | NA | NA | 70 | 105.0000 | NA | 19 | 170 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 12.1500 | 0 | NA | NA | 67 | NA | NA | 19 | 0 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 12.2333 | 106 | NA | NA | 66 | 143.0000 | NA | 21 | 202 | 34.4444 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 12.2667 | NA | NA | NA | NA | NA | 99 | NA | NA | NA | 48.6824 | 2 | NA | 7.32 | 59 | 1 | 1 |
| 282372 | 12.3167 | 106 | NA | NA | 69 | 139.0000 | NA | 21 | 194 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 12.3833 | NA | NA | 185 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 12.4833 | 76 | NA | NA | 86 | 92.0000 | NA | 21 | 123 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 12.7333 | 71 | NA | NA | 86 | 88.0000 | NA | 22 | 120 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 12.9833 | 85 | NA | 100 | 89 | 86.6667 | NA | 22 | 154 | 34.7778 | 48.6824 | 2 | NA | 7.50 | 59 | 1 | 1 |
| 282372 | 13.2333 | 86 | 3 | NA | 87 | 119.0000 | NA | 23 | 186 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 13.4833 | 85 | NA | NA | 98 | 116.0000 | NA | 23 | 184 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 13.5667 | 87 | NA | NA | 99 | 118.0000 | NA | 19 | 182 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 13.6500 | 87 | NA | NA | 100 | 118.0000 | NA | 21 | 187 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 13.7333 | 87 | NA | NA | 100 | 117.0000 | NA | 23 | 186 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 13.8167 | 85 | NA | NA | 100 | 113.0000 | NA | 23 | 178 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 13.9000 | 82 | NA | NA | 99 | 110.0000 | NA | 18 | 170 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 13.9833 | 85 | NA | NA | 100 | 113.0000 | NA | 23 | 178 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 14.0667 | 82 | NA | NA | 99 | 110.0000 | NA | 18 | 170 | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 14.2333 | 84 | 3 | NA | 86 | 109.0000 | NA | 22 | 162 | 36.2000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 14.7333 | 79 | NA | NA | 92 | 102.0000 | NA | 13 | 149 | 36.7000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 15.2333 | 86 | NA | NA | 94 | 129.0000 | NA | 0 | 177 | 36.8000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 15.2500 | NA | NA | 120 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 15.2833 | NA | NA | 119 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.47 | 59 | 1 | 1 |
| 282372 | 15.7333 | 69 | NA | NA | 94 | 95.0000 | NA | 24 | 148 | 36.8000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 16.2333 | 74 | NA | NA | 96 | 100.0000 | NA | 24 | 156 | 36.9000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 16.5333 | NA | NA | 103 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.47 | 59 | 1 | 1 |
| 282372 | 17.2333 | 63 | 3 | NA | 96 | 81.0000 | NA | 25 | 121 | 37.1000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 18.2333 | 74 | 3 | NA | 96 | 95.0000 | NA | 17 | 140 | 37.3000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 19.2333 | 73 | NA | NA | 97 | 97.0000 | NA | 4 | 147 | 37.3000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 19.5167 | NA | NA | 47 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.53 | 59 | 1 | 1 |
| 282372 | 19.5333 | NA | NA | 46 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 19.7333 | 65 | NA | NA | 95 | 84.0000 | NA | 5 | 126 | 37.4000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 20.2333 | 64 | 3 | 118 | 112 | 78.0000 | 94 | 11 | 107 | 37.4000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 20.7333 | 61 | NA | NA | 101 | 75.0000 | 97 | 11 | 102 | 37.5000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 20.8833 | NA | NA | 77 | NA | NA | 96 | NA | NA | NA | 48.6824 | 2 | NA | 7.37 | 59 | 1 | 1 |
| 282372 | 21.2333 | 58 | 3 | NA | 105 | 72.0000 | 98 | 12 | 98 | 37.6000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 22.2333 | 34 | NA | NA | 94 | 34.0000 | 92 | 23 | 39 | 37.6000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 23.2167 | NA | NA | 85 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 23.2333 | 56 | NA | NA | 103 | 64.0000 | NA | 16 | 89 | 37.5000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 23.6500 | NA | NA | 86 | NA | NA | 86 | NA | NA | NA | 48.6824 | 2 | NA | 7.36 | 59 | 1 | 1 |
| 282372 | 23.7333 | 57 | NA | NA | 99 | 69.0000 | NA | 12 | 105 | 37.4000 | 48.6824 | 2 | NA | NA | 59 | 1 | 1 |
| 282372 | 24.2333 | 64 | 8 | NA | 100 | 76.0000 | 83 | 0 | 108 | 37.4000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 24.2833 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 5.00 | 59 | 2 | 1 |
| 282372 | 24.4833 | 62 | NA | NA | 112 | 71.3333 | 99 | 10 | 106 | 37.4000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 24.7333 | 55 | NA | NA | 104 | 62.3333 | 100 | 16 | 85 | 37.4000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 24.9833 | 62 | NA | NA | 112 | 72.0000 | 95 | 15 | 90 | 37.4000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 25.2333 | 73 | 6 | NA | 113 | 90.0000 | 94 | 6 | 124 | 37.4000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 25.6500 | NA | NA | 79 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | 7.31 | 59 | 2 | 1 |
| 282372 | 25.9833 | NA | NA | 73 | NA | NA | NA | NA | NA | NA | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 26.2333 | 61 | NA | NA | 111 | 74.0000 | 94 | 16 | 114 | 37.6000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 27.2333 | 62 | NA | NA | 110 | 74.3333 | 96 | 6 | 101 | 37.7000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 27.7333 | 73 | NA | NA | 118 | 84.0000 | 94 | 2 | 103 | 37.7000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 28.2333 | 56 | NA | NA | 96 | 63.0000 | 97 | 14 | 85 | 37.6000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 29.2333 | 57 | 7 | NA | 107 | 66.0000 | 96 | 2 | 90 | 37.7000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 29.4833 | 55 | NA | NA | 106 | 60.0000 | 97 | 0 | 75 | 37.8000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 29.6000 | NA | NA | NA | NA | NA | 97 | NA | NA | NA | 48.6824 | 2 | NA | 7.30 | 59 | 2 | 1 |
| 282372 | 29.7333 | 59 | NA | NA | 109 | 66.0000 | 96 | 0 | 78 | 37.7000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 29.8167 | 62 | NA | NA | 113 | 69.0000 | 95 | 17 | 81 | 37.6000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
| 282372 | 30.2333 | 66 | NA | NA | 120 | 75.0000 | 96 | 16 | 88 | 37.6000 | 48.6824 | 2 | NA | NA | 59 | 2 | 1 |
- 處理異常樣本, 規則設定同原始模型.
dfr <- dfr %>%
arrange(icustay, hours) %>%
mutate(diastolic_BP = ifelse(diastolic_BP > 300, NA, diastolic_BP),
glucose = ifelse(glucose > 2000, NA, glucose),
heart_rate = ifelse(heart_rate > 400, NA, heart_rate),
mean_BP = ifelse(mean_BP > 300 | mean_BP < 0, NA, mean_BP),
systolic_BP = ifelse(systolic_BP > 10000, NA, systolic_BP),
temperature = ifelse(temperature > 50 | temperature < 20, NA, temperature),
pH = ifelse(pH > 7.8 | pH < 6.8, NA, pH),
respiratory_rate = ifelse(respiratory_rate > 300, NA, respiratory_rate),
oxygen_saturation = ifelse(oxygen_saturation > 100 | oxygen_saturation < 0, NA, oxygen_saturation)
)依據前述demoCodeID_only設定, 依照原始分析報告設定規則, 每個vital sign至少須有兩個記錄值.
如果demoCodeID_only設定為FALSE, 則只會產生demoCodeID供額外的測試使用, 而不進行樣本篩選.
demoCodeID <- dfr %>%
select(-age, -gender, -height, -weight, -day) %>%
mutate(hours = floor(hours)) %>%
group_by(icustay, hours) %>%
summarise_all(list(~mean(., na.rm = T))) %>%
group_by(icustay) %>%
summarise_all(list(~sum(!is.na(.)))) %>%
filter_all(all_vars(. >= 2)) %>%
ungroup() %>%
select(icustay)
if(demoCodeID_only == TRUE){
#進一步只使用1000個ID
demoCodeID <- demoCodeID[sample(1:nrow(demoCodeID), 1000), ]
dfr <- dfr %>%
mutate(match = match(icustay, demoCodeID$icustay, nomatch = NA)) %>%
filter(!(is.na(match))) %>%
select(-match)
}Part 3: 特徵工程
- 每位患者最早及最晚記錄時間點的特徵值, 包含除了day以外的所有特徵.
#最晚時間點的特徵, 也是模型baseline使用的特徵. 若最後記錄點是NA, 則使用最靠近該時間點的前一筆記錄值.
#df是最後建模使用的資料集
df <- dfr %>%
group_by(icustay) %>%
tidyr::fill(.data$diastolic_BP:.data$weight, .direction = "down") %>%
filter(row_number() == max(row_number())) %>%
ungroup() %>%
select(-day) %>%
rename_at(vars(-icustay, -hours, -age, -gender, -height, -weight, -mortality), ~ paste0(., "_last"))
baseFeatures <- colnames(df)
#最早時間點的特徵. 若最早記錄點是NA, 則使用最靠近該時間點的下一筆記錄值.
#vitalSign包含各個可能隨時間變化的特徵欄位名稱
vitalSign <- c("diastolic_BP", "glasgow_coma_scale", "glucose", "heart_rate", "mean_BP",
"oxygen_saturation", "respiratory_rate", "systolic_BP", "temperature", "pH")
df <- dfr %>%
select_at(c("icustay", vitalSign)) %>%
group_by(icustay) %>%
tidyr::fill(.data$diastolic_BP:.data$pH, .direction = "up") %>%
filter(row_number() == 1) %>%
ungroup() %>%
rename_at(vars(-icustay), ~ paste0(., "_1st")) %>%
right_join(df, by = "icustay")- 建立統計型特徵mean, sd, min, max, range, 以及vital sign變化速率的特徵slope.
#mean
df <- dfr %>%
select_at(c("icustay", vitalSign)) %>%
group_by(icustay) %>%
summarise_all(mean, na.rm = TRUE) %>%
ungroup() %>%
rename_at(vars(-icustay), ~ paste0(., "_median")) %>%
right_join(df, by = "icustay")
#sd
df <- dfr %>%
select_at(c("icustay", vitalSign)) %>%
group_by(icustay) %>%
summarise_all(sd, na.rm = TRUE) %>%
ungroup() %>%
rename_at(vars(-icustay), ~ paste0(., "_sd")) %>%
right_join(df, by = "icustay")
#min
df <- dfr %>%
select_at(c("icustay", vitalSign)) %>%
group_by(icustay) %>%
summarise_all(min, na.rm = TRUE) %>%
ungroup() %>%
rename_at(vars(-icustay), ~ paste0(., "_min")) %>%
mutate_all(list(~na_if(., Inf))) %>%
right_join(df, by = "icustay")
#max
df <- dfr %>%
select_at(c("icustay", vitalSign)) %>%
group_by(icustay) %>%
summarise_all(max, na.rm = TRUE) %>%
ungroup() %>%
rename_at(vars(-icustay), ~ paste0(., "_max")) %>%
mutate_all(list(~na_if(., -Inf))) %>%
right_join(df, by = "icustay")
#range
df <- dfr %>%
select_at(c("icustay", vitalSign)) %>%
group_by(icustay) %>%
summarise_all(RANGE) %>%
ungroup() %>%
rename_at(vars(-icustay), ~ paste0(., "_range")) %>%
mutate_all(list(~na_if(., -Inf))) %>%
right_join(df, by = "icustay")
#計算所有slope = (v1-v0)/(t1-t0)
for(i in vitalSign){
df <- dfr %>%
select_at(c("icustay", "hours", i)) %>%
rename_at(vars(-icustay, -hours), ~ "vsign") %>%
filter(!is.na(vsign)) %>%
group_by(icustay) %>%
mutate(int = hours - lag(hours),
index_slope = (vsign - lag(vsign))/int,
slope_max = max(index_slope, na.rm = TRUE),
slope_min = min(index_slope, na.rm = TRUE)
) %>%
filter(row_number()==1) %>%
ungroup() %>%
mutate_all(list(~na_if(., Inf))) %>%
mutate_all(list(~na_if(., -Inf))) %>%
select(icustay, slope_max, slope_min) %>%
rename_at(vars(-icustay), ~ paste0(., "_", i)) %>%
right_join(df, by = "icustay")
}3.1 時間序列特徵: kmeans分群. 參考原始分析報告, 各患者的各項vital sign獨立進行minmax normalization, 並以建模會用的train group icustay ID完成kmeans訓練後, 再去predict test group的分群中心.
3.2 k為評估時的群心數, 計算k = 2~15時的betweenss/totss分數, 當分數(k+1) - 分數(k) < 0.01時, K = k, K為最後設定的群心數.
3.3 在此處只展示了一個vital sign的分群結果, 但最後建模和比較會使用所有vital sign的分群結果.
#1 產生時間序列資料, 可用於kmeans及dtw分群
#1.1 先產生id x 48 hr的data frame: dft
icustay <- rep(unique(dfr$icustay), (48/sampleRate + 1))
dft <- data.frame(icustay) %>%
arrange(icustay) %>%
group_by(icustay) %>%
mutate(hours = (row_number()-1)*sampleRate) %>%
ungroup()
rm(icustay)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2461465 131.5 6279357 335.4 6279357 335.4
## Vcells 7900161 60.3 70661148 539.2 88126515 672.4
#1.2 對vital sign產生by hour (或單位時間, 依sampleRate參數變化) 的中位數
#1.2 如果某id某欄全無資料, 最後分群該值顯示為NA
#1.2 將資料併到dft, 並以前一筆非NA數值fill NA, 若無則用下一筆出現的非NA值fill NA
dft <- dfr %>%
select_at(c("icustay", "hours", vitalSign)) %>%
mutate(hours = floor(hours/sampleRate)*sampleRate) %>%
group_by(icustay, hours) %>%
summarise_all(median, na.rm = TRUE) %>%
ungroup() %>%
right_join(dft, by = c("icustay", "hours")) %>%
mutate_all(list(~na_if(., "NaN"))) %>%
group_by(icustay) %>%
tidyr::fill(.data$diastolic_BP:.data$pH, .direction = "down") %>%
tidyr::fill(.data$diastolic_BP:.data$pH, .direction = "up")
#1.3 產生建模時的訓練組icustayID
set.seed(777)
trainidx <- createDataPartition(df$mortality, p=0.8, list = FALSE)
trainID <- df[trainidx, "icustay", drop=TRUE]
#2 以vital sign名稱為變數, 進行kmeans分群loop, 並將分群結果併到df
#2 將vitalSign[1]改成vitalSign或者vitalSign[1:10]就可以計算所有結果
for(i in vitalSign[1]){
#2.1 minmax normalization
dfk <- dft %>%
select_at(c("icustay", "hours", i)) %>%
rename_at(vars(-icustay, -hours), ~ "vsign") %>%
#刪除連一個觀察值都沒有的樣本
filter(!(is.na(vsign))) %>%
group_by(icustay) %>%
#只有一個觀察值的在下式會變成NaN, 亦刪除
mutate(vsign = (vsign - min(vsign))/(max(vsign) - min(vsign))) %>%
filter(!(is.na(vsign))) %>%
ungroup()
#2.2 dcast to id with 48H columns
dfk <- reshape2::dcast(dfk, formula = icustay ~ hours, value.var = c("vsign")) %>%
rename_at(vars(-icustay), ~ paste0("H",.))
#如果dfk內有任何Inf, 停止loop
if(checkmate::anyInfinite(dfk)) {
print("Inf in dfk for clustering")
break
}
#2.3 分成train與test組別
dfk_train <- subset(dfk, icustay %in% trainID)
dfk_test <- subset(dfk, !(icustay %in% trainID))
#2.4 Kmeans clustering of time series data
#2.4.1 計算k = 2~15時的betweenss/totss分數, 當分數(k+1) - 分數(k) < 0.01時, 設定K = k
cluster <- NA
avg_score <- NA
sd_score <- NA
avg_minsize <- NA
avg_meansize <- NA
avg_maxsize <- NA
scoreBoard <- data.frame(cluster, avg_score, sd_score, avg_minsize, avg_meansize, avg_maxsize)
for(j in 2:15){
score_j <- NULL
min_size_j <- NULL
mean_size_j <- NULL
max_size_j <- NULL
for(k in 1:30){
ns <- 1
while(ns <= j){
eval_K <- kmeans(dfk_train[,2:50], j, iter.max = 80, nstart = ns,
algorithm = c("Hartigan-Wong"), trace=FALSE)
if(eval_K$ifault == 0) {break}
ns <- ns + 1
}
score_j <- c(score_j, eval_K[["betweenss"]]/eval_K[["totss"]])
min_size_j <- c(min_size_j, min(eval_K$size))
mean_size_j <- c(mean_size_j, mean(eval_K$size))
max_size_j <- c(max_size_j, max(eval_K$size))
}
scoreBoard[(j-1),1:6] <- c(j, round(mean(score_j), digits = 3), round(sd(score_j),digits =3),
ceiling(mean(min_size_j)), ceiling(mean(mean_size_j)), ceiling(mean(max_size_j))
)
}
scoreBoard$inc <- scoreBoard$avg_score - lag(scoreBoard$avg_score)
#產生最後的K及kmeans結果final_K
K <- scoreBoard$cluster[first(which(scoreBoard$inc < 0.01)) - 1]
ns <- 1
while(ns <= K){
final_K <- kmeans(dfk_train[,2:50], K, iter.max = 30, nstart = ns,
algorithm = c("Hartigan-Wong"), trace=FALSE)
if(final_K$ifault == 0) {break}
ns <- ns + 1
}
dfk_train$CL <- final_K$cluster
#使用flexclust轉換final_K成為kcca物件再進行test組的fit (因為stats kmean的物件不能直接prediction)
final_K <- flexclust::as.kcca(final_K, data=dfk_train[,2:50])
dfk_test$CL <- flexclust::predict(final_K, newdata=dfk_test[,2:50])
cat(paste0("kmeans to ", i, " is done...\n"))
#將分類結果merge到df
df <- dfk_train %>%
bind_rows(dfk_test) %>%
select(icustay, CL) %>%
rename_at(vars(-icustay), ~ paste0(i, "_KMCL")) %>%
right_join(df, by = "icustay")
rm(dfk, dfk_train, dfk_test, cluster, avg_score, sd_score, avg_minsize, avg_meansize, avg_maxsize,
scoreBoard, score_j, min_size_j, mean_size_j, max_size_j, ns, eval_K, K, final_K)
gc()
}## kmeans to diastolic_BP is done...
4.1 時間序列特徵: 基於DTW距離演算法的分群. 參考原作者說明, 各患者的各項vital sign獨立進行zscore normalization, 並以建模會用的train group icustay ID完成kmeans訓練後, 再去predict test group的分群中心.
4.2 k為評估時的群心數, 以不同指標計算k = 2~10時的得分, 並以投票法決定最後設定的群心數. 如果有同分的情況則選擇k較大者.
4.3 取得各序列與群心的距離做為額外特徵.
4.4 在此處只展示了一個以dtw演算法分群的vital sign結果, 但在最後建模時, 會使用所有vital sign的centroids和sbd演算法的分群結果.
#將vitalSign[1]改成vitalSign或者vitalSign[1:10]就可以計算所有結果
for(i in vitalSign[1]){
#3.1 產生data, 並刪除全NA的樣本
dfk <- dft %>%
select_at(c("icustay", "hours", i)) %>%
rename_at(vars(-icustay, -hours), ~ "vsign") %>%
filter(!(is.na(vsign))) %>%
#以下用於刪除僅有一個唯一值的樣本 (無法zscore)
group_by(icustay) %>%
mutate(len = length(unique(vsign))) %>%
ungroup() %>%
#刪除vsign唯一值 <= 1
filter(!(len == 1)) %>%
select(-len)
#3.2 dcast to id with 48H columns
dfk <- reshape2::dcast(dfk, formula = icustay ~ hours, value.var = c("vsign")) %>%
rename_at(vars(-icustay), ~ paste0("H",.))
#如果dfk內有任何Inf, 停止loop
if(checkmate::anyInfinite(dfk)) {
print("Inf in dfk for clustering")
break
}
#3.3 分成train與test組別
dfk_train <- subset(dfk, icustay %in% trainID)
dfk_test <- subset(dfk, !(icustay %in% trainID))
#3.4 設定DTW參數並計算結果
#主要確認分群演算法, 距離計算演算法及群心產生的演算法
#window size以10%計算 (smaller might be better), SBD不用設定
#distance, centrodi = ("dtw_basic", "dba", zcore/null), ("gak", "dba", zscore),
#( k-Shape algorithm: "sbd", "shpae", zscore, type = "p")
#如果type要改用"h", 會需要額外的control參數(hierarchical_control)
#type = "p" or "h". "h"時會忽略centroid, if it wasn't a function
#type = "h" and preporc = NULL的分群會較為分佈不均, 但Sil, DB得分較好, 計算較快
#"p": control = partitional_control(iter.max = 200L)
#"h": control = hierarchical_control(method = "average")
K_eval <- c(3:10)
type_use <- "p"
dis_use <- "dtw_basic"
cen_use <- "dba"
Zsc <- "zscore"
TSCL <- tsclust(series = dfk_train[ ,2:50], type = type_use, k = K_eval, preproc = zscore,
distance = dis_use, centroid = cen_use, window.size = floor((48/sampleRate + 1)*0.1)-1,
seed = 777, trace = FALSE,
control = partitional_control(iter.max = 200L)
)
#type = "p"限定: 將沒有收斂的list刪除, 此外由於後面會命名, 因此也將K_eval中代表未收斂的K移除
if(type_use == "p"){
del_k <- NULL
for(j in 1:length(K_eval)){
if(!(TSCL[[j]]@converged)){
del_k <- c(del_k, j)
}
if(j == length(K_eval) && !(is.null(del_k))){
TSCL <- TSCL[-del_k]
K_eval <- K_eval[! K_eval %in% (del_k + min(K_eval) - 1)]
}
}
gc()
}
#3.5 CVI: clustering evaluation
#將type從internal改為去掉SF (出現兩次0分, 又是global, 分數計算較久)
#CH, SF可能不適合centroid = DBA or shape (centroid有random性質)
#DB, DBstar非常快速
#附註: CVI指數縮寫說明及其指數目標 (極大化或極小化)
#Crisp partitions
#"Sil" : Silhouette max
#"SF" : Score Function max (global)
#"CH" : Calinski-Harabasz max (global)
#"DB" : Davies-Bouldin min
#"DBstar" : Modified Davies-Bouldin min
#"D" : Dunn max
#"COP" : COP min
names(TSCL) <- paste0("K", K_eval)
if(cen_use == "dba" || cen_use == "shape"){
eval_method <- c("Sil", "DB", "DBstar", "D", "COP")
} else {
eval_method <- c("Sil", "CH", "DB", "DBstar", "D", "COP")
}
evalMtx <- sapply(TSCL, cvi, type = eval_method)
#顯示原始分數
print(evalMtx)
#以cvi投票最多數者決定K值
#同分者取K值較大者 (初步判斷: 不同cluster有類似性質, 對最後ML不會有太大影響)
#或是看跑票的次佳解在哪個K最高分
cviList <- rownames(evalMtx)
vote_k <- NULL
for(k in 1:length(cviList)){
if(cviList[k] %in% c("Sil", "SF", "CH", "D")){
vote_k <- c(vote_k, which(evalMtx[k,] == max(evalMtx[k,])))
} else {
vote_k <- c(vote_k, which(evalMtx[k,] == min(evalMtx[k,])))
}
}
#各欄的投票結果, 依順序呈現
vote_tbl <- sort(table(vote_k),decreasing=TRUE)
#maxVote = 最高票的票數
maxVote <- max(vote_tbl)
#選出獲得最高票票數的最後一個欄位, 再從vote_k產生屬於該欄位的K代碼
K <- last(names(which(vote_tbl == maxVote)))
K_names <- first(names(which(vote_k == as.integer(K))))
#顯示投票結果
print(vote_tbl)
cat(paste0("the most voted is ", K_names, "\n"))
#顯示該K群下, 各群總ID數
print(table(TSCL[[K_names]]@cluster))
#存檔圖形
jpeg(filename = paste0("fig/centroids/",paste(type_use, dis_use, cen_use, Zsc, i, sep = "_"), ".jpg"),
height = 700, width = 1400)
plot(TSCL[[K_names]], type = "centroids")
dev.off()
#將vote最高的cluster/distance定義merge到dfk_train和dfk_test
dfk_train$TSCL <- TSCL[[K_names]]@cluster
dfk_train$TSCL_DIS <- TSCL[[K_names]]@cldist
dfk_test$TSCL <- predict(TSCL[[K_names]], newdata = dfk_test[,2:50])
#計算test組別的distance(下式dtw_basic限定)
dfk_test$TSCL_DIS <- NA
dfk_test_z <- zscore(dfk_test[,2:50])
for(j in 1:nrow(dfk_test)){
cen_at_j <- dfk_test$TSCL[j]
#因x,y都已zscore, 故normalize = NULL
dfk_test$TSCL_DIS[j] <- dtw_basic(as.numeric(dfk_test_z[j,]),
TSCL[[K_names]]@centroids[[cen_at_j]],
window.size = floor((48/sampleRate + 1)*0.1)-1, norm = "L1",
step.pattern = dtw::symmetric2, backtrack = FALSE,
normalize = NULL, error.check = TRUE)
}
#將分類結果merge到df
df <- dfk_train %>%
bind_rows(dfk_test) %>%
select(icustay, TSCL, TSCL_DIS) %>%
rename_at(vars(-c(1,3)), ~ paste0(i, "_DTWCL")) %>%
rename_at(vars(-c(1,2)), ~ paste0(i, "_DTWCL_DIS")) %>%
right_join(df, by = "icustay")
#完成時提醒並產生圖形
cat(paste0("DTW to ", i, " is done...\n"))
plot(TSCL[[K_names]], type = "centroids")
rm(dfk, dfk_train, dfk_test, type_use, dis_use, cen_use, Zsc, TSCL, del_k, evalMtx,
cviList, vote_k, vote_tbl, K, K_eval, K_names, maxVote, dfk_test_z, cen_at_j)
gc()
}## K3 K4 K5 K6 K7 K8
## Sil 0.1072738 0.07938733 0.06676769 0.04938117 0.05046659 0.0438226
## DB 2.1372444 2.43942787 2.40986400 2.52331901 2.45823721 2.5321695
## DBstar 2.1544960 2.46799401 2.45521071 2.60192397 2.54492938 2.6126641
## D 0.1891252 0.25371383 0.17226610 0.17344802 0.26798591 0.1939582
## COP 0.5429106 0.52067542 0.50830281 0.49333997 0.47356019 0.4770591
## K9 K10
## Sil 0.05077695 0.04001242
## DB 2.41475733 2.52882863
## DBstar 2.46604557 2.61002962
## D 0.17647530 0.24128281
## COP 0.46198171 0.46553136
## vote_k
## 1 5 7
## 3 1 1
## the most voted is K3
##
## 1 2 3
## 260 230 310
## DTW to diastolic_BP is done...
Part 4: 建模步驟說明
藉由特徵篩選, 我們可以比較baseline, 基礎特徵, KMCL, DTWCL/DIS, SBDCL/DIS五種不同特徵組合下, 對於mortality預測模型的影響.
1.1 baseline: 只使用病人所有原始特徵最後的記錄值.
1.2 基礎特徵: 包含始末記錄值, 統計型特徵, 最大最小斜率變化, 不含時間序列產生的特徵.
1.3 KMCL: 除基礎特徵外, 還包含了kmeans對時間序列分群產生的特徵.
1.4 DTWCL/DIS: 除基礎特徵外, 還包含了DTW為基礎的時間序列分群特徵及距離群心距離的特徵.
1.5 SBDCL/DIS: 除基礎特徵外, 還包含了SBD為基礎的時間序列分群特徵及距離群心距離的特徵.
在建模參數部分, 我們多數使用default, 但限定max_depth = 2.
藉由調整threshold, 使cv final round的prediction的sensitivity與specificity盡量接近, 最後test set則使用該threshold來評估表現.
#1 自訂函數: 取得cv所有prediction的結果, 用於評估threshold
get_lgbm_cv_preds <- function(cv){
rows <- length(cv$boosters[[1]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices)+length(cv$boosters[[1]]$booster$.__enclos_env__$private$train_set$.__enclos_env__$private$used_indices)
preds <- numeric(rows)
for(i in 1:length(cv$boosters)){
preds[
cv$boosters[[i]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices] <-
cv$boosters[[i]]$booster$.__enclos_env__$private$inner_predict(2)
}
return(preds)
}
#2 載入完整特徵資料 (可在前端加#disable並使用之前步驟重新產生的df)
suppressMessages(df <- read_csv("modeling_final3_df.csv", guess_max = 300000))
#3 前處理
#3.1 將名字不同,但各row數值相同的column, 刪除到只剩一組column
df <- df[!duplicated(unclass(df))]
#3.2 刪除NZV, 設定每個特徵最少需要出現的unique值
#3.2 當前設定為刪除unique值 = 1的特徵
nzv_r <- nearZeroVar(df, freqCut = (nrow(df)-1)/1, uniqueCut = 2/nrow(df), saveMetrics= TRUE)
nzv <- nearZeroVar(df, freqCut = (nrow(df)-1)/1, uniqueCut = 2/nrow(df))
if (as.numeric(sum(nzv_r$nzv=="TRUE")) > 0) df <- df[, -nzv]
#3.3 篩選資料 (若demoCodeID_only == TRUE, 則都只使用demoCodeID來進行建模)
demoCodeID_only_ML <- FALSE
if(demoCodeID_only_ML) {
dft <- df %>%
right_join(demoCodeID, by = "icustay")
} else {
dft <- df
}
#3.4 篩選特徵: 依據cluster種類
featureIDX <- "baseline"
if(featureIDX == "baseline"){
dft <- dft %>% select_at(baseFeatures)
} else if(featureIDX == "baseFE") {
dft <- dft[ , !grepl("CL", colnames(dft))]
} else if(featureIDX == "KMCL") {
dft <- dft[ , !grepl("DTWCL|SBDCL", colnames(dft))]
} else if(featureIDX == "DTWCL") {
dft <- dft[ , !grepl("KMCL|SBDCL", colnames(dft))]
} else if(featureIDX == "SBDCL") {
dft <- dft[ , !grepl("KMCL|DTWCL", colnames(dft))]
}
#4 建模準備
#4.1 使用lightGBM的categorical_feature功能, 但類別值須先轉換為非負整數 (在MIMIC-III此處不需要)
#4.1 將clustering結果轉為類別型特徵, 但移除帶有DIS的欄位名稱
categoricals.vec = colnames(dft)[grepl("CL|gender", colnames(dft))]
categoricals.vec <- categoricals.vec[-grep("_DIS", categoricals.vec)]
#4.2 產生train & test set
set.seed(777)
trainidx <- createDataPartition(dft$mortality, p=0.8, list = FALSE)
train <- dft[trainidx, ]
test <- dft[-trainidx, ]
#4.3 以options讓train, test的NA可以進入轉換後的sparse_matrix
previous_na_action <- options('na.action')
options(na.action='na.pass')
sparse_matrix <- sparse.model.matrix(mortality + icustay~.-1, data = train)
label_train <- train[,c('mortality'), drop=T]
lgb.test <- sparse.model.matrix(mortality + icustay~.-1, data = test)
label_test <- test[,c('mortality'), drop=T]
options(na.action=previous_na_action$na.action)
lgb.train = lgb.Dataset(data = as.matrix(sparse_matrix), label = label_train)
#4.4 設定lgb參數
lgb.grid = list(objective = "binary",
metric = "auc",
min_sum_hessian_in_leaf = 1,
feature_fraction = 0.6,
bagging_fraction = 0.6,
bagging_freq = 1,
boosting = "gbdt",
max_bin = 255,
lambda_l1 = 0,
lambda_l2 = 0,
min_data_in_bin = 3,
min_gain_to_split = 0,
min_data_in_leaf = 3,
is_unbalance = TRUE
)
#5 cross-validation with train set
#5.1 參數設定與cv run
lgb.model.cv = lgb.cv(params = lgb.grid, data = lgb.train, learning_rate = 0.01, num_leaves = 31,
max_depth = 2, num_threads = 6 , nrounds = 1500, early_stopping_rounds = 30,
eval = "auc", nfold = 5, stratified = TRUE, categorical_feature = categoricals.vec)
#5.2 取得最佳iteration
best.iter = lgb.model.cv$best_iter
#5.3 依據設定的threshold (pos_THR), 以cv所有的prediction產生混淆矩陣與f1 score
pred_cv <- get_lgbm_cv_preds(lgb.model.cv)
pos_THR <- 0.5
CFmatrix_cv <- confusionMatrix(as.factor(ifelse(pred_cv <= pos_THR, 0, 1)), as.factor(label_train),
positive = levels(as.factor(label_train))[2])
print(CFmatrix_cv)
F1_in_Train <- MLmetrics::F1_Score(as.factor(label_train), as.factor(ifelse(pred_cv <= pos_THR, 0, 1)), positive = "1")
cat(paste0("F1 score for train set is ", F1_in_Train, "\n"))
#5.4 作圖ROC及計算auc
print("roc for train set")
#plot(pROC::roc(response = label_train,
#predictor = pred_cv,
#levels=c(0, 1)),lwd=1.5)
roc_obj_train <- roc(label_train, pred_cv)
print("auc for train set")
auc(roc_obj_train)
#6以train set建模評估在test set的表現
#6.1 建立訓練模型
lgb.model = lgb.train(params = lgb.grid, data = lgb.train, learning_rate = 0.01,
max_depth = 2, num_leaves = 31, num_threads = 6 , nrounds = best.iter,
eval_freq = 5, eval = "auc", categorical_feature = categoricals.vec)
#6.2 對test gp預測
pred <- predict(lgb.model, lgb.test)
#6.3 confusionMatrix & f1 score
CFmatrix <- confusionMatrix(as.factor(ifelse(pred <= pos_THR, 0, 1)), as.factor(label_test),
positive = levels(as.factor(label_test))[2])
print(CFmatrix)
F1_in_Test <- MLmetrics::F1_Score(as.factor(label_test), as.factor(ifelse(pred <= pos_THR, 0, 1)), positive = "1")
cat(paste0("F1 score for test set is ", F1_in_Test, "\n"))
#6.4 作圖ROC及計算auc
print("roc for test set")
#plot(pROC::roc(response = label_test,
#predictor = pred,
#levels=c(0, 1)),lwd=1.5)
roc_obj <- roc(label_test, pred)
print("auc for train set")
auc(roc_obj)
#feature重要性
lgb_imp <- lgb.importance(lgb.model, percentage = TRUE)
head(lgb_imp, n = 30L)Part 5: 結果報告
- 以baseline特徵組合展示建模結果.
## [1] "Results from train group"
## Threshold = 0.5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3523 212
## 1 1086 724
##
## Accuracy : 0.7659
## 95% CI : (0.7545, 0.777)
## No Information Rate : 0.8312
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.392
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7735
## Specificity : 0.7644
## Pos Pred Value : 0.4000
## Neg Pred Value : 0.9432
## Prevalence : 0.1688
## Detection Rate : 0.1306
## Detection Prevalence : 0.3264
## Balanced Accuracy : 0.7689
##
## 'Positive' Class : 1
##
## F1 score for train group is 0.527312454479243
## [1] "roc for train group"
## [1] "auc for train group"
## Area under the curve: 0.8496
## [1] "Results from test group"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 864 56
## 1 271 195
##
## Accuracy : 0.7641
## 95% CI : (0.7408, 0.7862)
## No Information Rate : 0.8189
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7769
## Specificity : 0.7612
## Pos Pred Value : 0.4185
## Neg Pred Value : 0.9391
## Prevalence : 0.1811
## Detection Rate : 0.1407
## Detection Prevalence : 0.3362
## Balanced Accuracy : 0.7691
##
## 'Positive' Class : 1
##
## F1 score for test group is 0.543933054393305
## [1] "roc for test group"
## [1] "auc for train group"
## Area under the curve: 0.8393
## [1] "top 10 important features of baseline set"
## Feature Gain Cover Frequency
## 1: glasgow_coma_scale_last 0.52636080 0.23991757 0.19284603
## 2: age 0.10872943 0.14101205 0.13141524
## 3: temperature_last 0.08632123 0.10866395 0.11314152
## 4: pH_last 0.05988178 0.07723369 0.07737170
## 5: systolic_BP_last 0.03853192 0.06940831 0.07620529
## 6: heart_rate_last 0.03771744 0.07182339 0.07309487
## 7: respiratory_rate_last 0.03682040 0.06840482 0.06570762
## 8: weight 0.03430303 0.06413463 0.07776050
## 9: oxygen_saturation_last 0.01937140 0.04244301 0.04043546
## 10: glucose_last 0.01931424 0.04869721 0.05482115
2.1 比較不同特徵組合的模型表現.
2.2 只使用baseline (所有病患最後時間點的數據) 建模, 其於測試組的表現已與原始分析報告中的RF small相當.
2.3 baseFE (基礎特徵工程) 於測試組表現較baseline有稍微提升.
2.4 再加入分群特徵, 並無法改善模型, 不論是KMCL, DTWCL, SBDCL.
2.5 lightGBM訓練的模型在train和test的表現上不會有明顯差異 (overfitting on train group).
| set | Threshold | AUC | Sensitivity | Specificity | F1score | features_used | group |
|---|---|---|---|---|---|---|---|
| RF small | 0.43 | 0.95 | 0.88 | 0.88 | NA | 44 | train |
| baseline | 0.50 | 0.85 | 0.78 | 0.77 | 0.55 | 15 | train |
| baseFE | 0.50 | 0.87 | 0.78 | 0.78 | 0.56 | 95 | train |
| KMCL | 0.50 | 0.87 | 0.79 | 0.79 | 0.57 | 105 | train |
| DTWCL | 0.50 | 0.87 | 0.79 | 0.79 | 0.57 | 115 | train |
| SBDCL | 0.50 | 0.87 | 0.78 | 0.79 | 0.56 | 115 | train |
| set | Threshold | AUC | Sensitivity | Specificity | F1score | features_used | group |
|---|---|---|---|---|---|---|---|
| RF small | 0.43 | 0.84 | 0.74 | 0.77 | NA | 44 | test |
| baseline | 0.50 | 0.83 | 0.74 | 0.77 | 0.50 | 15 | test |
| baseFE | 0.50 | 0.86 | 0.76 | 0.79 | 0.52 | 95 | test |
| KMCL | 0.50 | 0.86 | 0.76 | 0.79 | 0.52 | 105 | test |
| DTWCL | 0.50 | 0.86 | 0.76 | 0.78 | 0.52 | 115 | test |
| SBDCL | 0.50 | 0.86 | 0.75 | 0.78 | 0.51 | 115 | test |
| Feature | Gain | Cover | Frequency | set |
|---|---|---|---|---|
| glasgow_coma_scale_last | 0.5431716 | 0.2371647 | 0.1959836 | baseline |
| age | 0.1146249 | 0.1513042 | 0.1413165 | baseline |
| temperature_last | 0.0734940 | 0.1054313 | 0.1056155 | baseline |
| pH_last | 0.0609291 | 0.0845748 | 0.0855337 | baseline |
| systolic_BP_last | 0.0492277 | 0.0853214 | 0.0933433 | baseline |
| heart_rate_last | 0.0391879 | 0.0895158 | 0.0859055 | baseline |
| respiratory_rate_last | 0.0329390 | 0.0677535 | 0.0635924 | baseline |
| weight | 0.0265045 | 0.0509721 | 0.0710301 | baseline |
| mean_BP_last | 0.0174225 | 0.0319913 | 0.0416512 | baseline |
| glucose_last | 0.0153973 | 0.0369419 | 0.0449981 | baseline |
| Feature | Gain | Cover | Frequency | set |
|---|---|---|---|---|
| glasgow_coma_scale_last | 0.2848392 | 0.1272857 | 0.0965909 | baseFE |
| age | 0.0830286 | 0.0953069 | 0.0817551 | baseFE |
| glasgow_coma_scale_median | 0.0792020 | 0.0322875 | 0.0312500 | baseFE |
| slope_max_glasgow_coma_scale | 0.0506695 | 0.0306575 | 0.0366162 | baseFE |
| glasgow_coma_scale_max | 0.0349185 | 0.0272442 | 0.0239899 | baseFE |
| temperature_median | 0.0332980 | 0.0389177 | 0.0388258 | baseFE |
| respiratory_rate_median | 0.0282064 | 0.0333139 | 0.0274621 | baseFE |
| glasgow_coma_scale_1st | 0.0269001 | 0.0115327 | 0.0123106 | baseFE |
| pH_last | 0.0244493 | 0.0338241 | 0.0284091 | baseFE |
| temperature_last | 0.0213747 | 0.0309420 | 0.0284091 | baseFE |
| Feature | Gain | Cover | Frequency | set |
|---|---|---|---|---|
| glasgow_coma_scale_last | 0.3113490 | 0.1234277 | 0.0940171 | KMCL |
| age | 0.0815234 | 0.0989535 | 0.0870529 | KMCL |
| glasgow_coma_scale_median | 0.0652352 | 0.0322202 | 0.0322887 | KMCL |
| slope_max_glasgow_coma_scale | 0.0462544 | 0.0282271 | 0.0319721 | KMCL |
| glasgow_coma_scale_max | 0.0313525 | 0.0270694 | 0.0243748 | KMCL |
| temperature_median | 0.0290112 | 0.0383516 | 0.0335549 | KMCL |
| respiratory_rate_median | 0.0258169 | 0.0336710 | 0.0259576 | KMCL |
| glasgow_coma_scale_1st | 0.0242027 | 0.0115569 | 0.0117126 | KMCL |
| temperature_last | 0.0236133 | 0.0323883 | 0.0291231 | KMCL |
| pH_last | 0.0224627 | 0.0307223 | 0.0253245 | KMCL |
| Feature | Gain | Cover | Frequency | set |
|---|---|---|---|---|
| glasgow_coma_scale_last | 0.3087401 | 0.1169473 | 0.0918301 | DTWCL |
| age | 0.0804741 | 0.0988403 | 0.0813725 | DTWCL |
| glasgow_coma_scale_median | 0.0495890 | 0.0308881 | 0.0281046 | DTWCL |
| glasgow_coma_scale_max | 0.0444637 | 0.0308031 | 0.0274510 | DTWCL |
| slope_max_glasgow_coma_scale | 0.0401288 | 0.0247345 | 0.0316993 | DTWCL |
| temperature_median | 0.0290053 | 0.0368359 | 0.0336601 | DTWCL |
| glasgow_coma_scale_DTWCL | 0.0274952 | 0.0264849 | 0.0196078 | DTWCL |
| respiratory_rate_median | 0.0260797 | 0.0325386 | 0.0267974 | DTWCL |
| glasgow_coma_scale_sd | 0.0229429 | 0.0175897 | 0.0189542 | DTWCL |
| temperature_last | 0.0212302 | 0.0307315 | 0.0274510 | DTWCL |
| Feature | Gain | Cover | Frequency | set |
|---|---|---|---|---|
| glasgow_coma_scale_last | 0.3122091 | 0.1211137 | 0.0952381 | SBDCL |
| age | 0.0810678 | 0.0987597 | 0.0818656 | SBDCL |
| glasgow_coma_scale_median | 0.0505690 | 0.0314461 | 0.0290280 | SBDCL |
| glasgow_coma_scale_max | 0.0441047 | 0.0307950 | 0.0273973 | SBDCL |
| slope_max_glasgow_coma_scale | 0.0418178 | 0.0286484 | 0.0335943 | SBDCL |
| temperature_median | 0.0311178 | 0.0408468 | 0.0362035 | SBDCL |
| respiratory_rate_median | 0.0264716 | 0.0332447 | 0.0277234 | SBDCL |
| glasgow_coma_scale_sd | 0.0219748 | 0.0163417 | 0.0185910 | SBDCL |
| pH_last | 0.0211295 | 0.0314835 | 0.0277234 | SBDCL |
| glasgow_coma_scale_SBDCL | 0.0206814 | 0.0107535 | 0.0088063 | SBDCL |
3.1 雖然新的分群特徵並無法改善模型, 但在採用所有樣本的資料集下進行分群, 也許會產生更具代表性的結果, 進而改善模型.
3.2 此外在初步的SBD分群結果分析中, 發現diastolic BP的centroid, 與mean BP的centroid有很高的相似性, 也許這代表DTW為基礎的分群方式, 是值得再深入探討與測試.
diastolic BP與mean BP的SBD分析產生之主要時間序列趨勢
- 最後呈現以所有樣本, 加上baseFE特徵組合, 進行mortality預測的模型表現.
| set | Threshold | AUC | Sensitivity | Specificity | F1score | features_used | group |
|---|---|---|---|---|---|---|---|
| baseFE | 0.47 | 0.82 | 0.74 | 0.74 | 0.43 | 95 | train |
| baseFE | 0.47 | 0.81 | 0.70 | 0.75 | 0.42 | 95 | test |
email: chtsai0108@gmail.com