Part 1: 摘要:加護病房 (ICU) 留院死亡預測模型改良

ICU死亡預測模型架構與應用

ICU死亡預測模型架構與應用

  1. 原始模型連結

  2. 在模型改良上, 嘗試透過在vital sign時間序列上取得新的特徵來達成:

    2.1 以Dynamic Time Warp (DTW) 演算法計算不同時間序列間的距離. DTW會考慮兩序列是否在變形 (transformation) 或位移 (shift) 後呈現較高的相似性, 再從此去計算兩者間的距離, 因此可能較一般Kmeans使用的距離演算法更為適合時間序列分群. 本篇報告使用含有DTW演算法的R package: dtwclust來對vital sign的時間序列進行分群.

    2.2 除了分群, 還提供每個序列對群心 (centroid) 的距離做為建模特徵, 這提供了與群心有多相似的資訊給建模演算法.

    2.3 所有患者最早即最晚記錄點的各項vital sign, 這也是模型表現的baseline所使用的特徵, 因為通常狀況很差的患者, 這時數據也都是較為惡化的狀況 (例如很低的脈搏).

    2.4 新的統計型特徵: range (max - min), max_slope及min_slope (找出患者vitalsign出現的最大/最小變化速率)

    2.5 其它未實現的vital sign相關新特徵:

    2.5.1 初步評估記錄時間差中位數為0.5 hr, 因此如改變時間序列長度, 改以0.5 hr為一單位組成時間序列, 對於有較高記錄頻率的資料, 也許能提供更多特徵. 在本篇報告中已包含該項參數, 但尚未完成特徵工程及模型表現比較.

    2.5.2 參考MIMIC-III的疾病分類表 (https://www.nature.com/articles/sdata201635/tables/2), 研究各分類主要的惡化特徵, 或導致死亡前的可能生理變化, 將該些資訊轉為ifelse rule用在特徵工程上.

  3. 提高模型的泛化性 (generalization), 則是透過使用所有資料及lightGBM演算法:

    3.1 由於原始模型使用的樣本患者死亡率較剩餘患者高了55% (17%對11%), 這代表了兩族群特徵分布可能有明顯差異, 其可能的推測是嚴重的患者, 醫生會需要更多監控或生化檢查的資訊來協助診斷及建立處方. 也因此, 基於該些患者建立的預測模型, 其找出的相關性規則可能不適用於剩餘患者族群.

    3.2 因此, 本報告的程式保留使用所有樣本的彈性, 如此也可以讓資料提供最多的資訊.

    3.3 建模演算法改用了lightGBM, 如此使用者可以保留無法合理imputation的NA值, 演算法會計算在遇到NA時的最佳分類方向, 簡化了NA的處理流程, 並盡可能保留所有樣本可以提供的資訊. 而其缺點就是black box造成較差的解釋性, 會需要額外方法去分析其找出的規則.

  4. 在結果部分會先以原始分析報告使用的icustay ID來建模, 評估新特徵及建模演算法對模型表現的影響, 最後則是報告使用所有icustay ID資料的模型表現.

    4.1 使用非常簡單的15個特徵模型, 其效果就已經與原始模型的最佳模型 (44個特徵 + 時間序列), 具有相同的靈敏度與精確度.

Part 2: 載入檔案及資料前處理

  1. 在本篇報告的參數設定下, 只使用了原始分析報告中篩選後的icustay ID, 但如果將demoCodeID_only設定為FALSE, 將可使用所有的icustay ID來進行後續的特徵工程及建模作業.

  2. sampleRate參數則與vital sign的時間序列呈現有關, 當設定為1時, 代表以1 hr為單位 (同AI4quant提供的示範分析) 來建構時間序列(0, 1, 2…, 48), 而0.5則代表以0.5 hr為1單位 (0, 0.5, 1…47.5, 48).

  3. 由於分析記錄時間間隔時發現中位數為0.5 hr, 因此對於某些vital sign以較高的sampleRate參數也許可以在比較時間序列時提供更多的資訊及差異.

Sys.setlocale("LC_TIME", "English")

#caret亦用於建模過程
suppressMessages(library(readr))
suppressMessages(library(dplyr))
suppressMessages(library(dtwclust))
suppressMessages(library(plotly))
suppressMessages(library(caret))
#以下用於建模和評估模型表現
suppressMessages(library(Matrix))
suppressMessages(library(lightgbm))
suppressMessages(library(car))
suppressMessages(library(pROC))

#自訂函數, 用於計算一組向量的最大值與最小值的差值
RANGE <- function(x){
  y <- diff(range(x, na.rm=TRUE))
  if(is.infinite(y)){
    return(NA)
  } else{
    return(y)
  }
}

suppressMessages(dfr <- read_csv("raw/mimic3_mortality.csv", guess_max = 300000))
colnames(dfr) <- gsub(x = colnames(dfr), pattern = " ", replacement = "_")

#資料選擇參數
#如果demoCodeID_only == FALSE, 會產生demoCode使用的ID (後續可用於其他觀察), 但特徵工程使用所有ID的資料
#如果demoCodeID_only == TRUE, 特徵工程資料只限於demoCode使用的ID
demoCodeID_only <- TRUE
#時間序列頻率參數 (1代表以1hr為1單位, 序列長度為49: 0-48 hr)
sampleRate <- 1
  1. 資料概述

    1. icustay: 病患ID

    2. hours: 進入加護病房後的累計時間 (day則是hour的轉換)

    3. diastolic_BP, glasgow_coma_scale, glucose, heart_rate, mean_BP, oxygen_saturation, respiratory_rate, systolic_BP, temperature, pH: 時間序列資料, 記錄病患的生理數據

    4. age, genderm height, weight: 年齡性別特徵

    5. motality: 是否於院內死亡

icustay hours diastolic_BP glasgow_coma_scale glucose heart_rate mean_BP oxygen_saturation respiratory_rate systolic_BP temperature age gender height pH weight day mortality
282372 0.0667 60 NA NA 139 84.6667 100 20 134 NA 48.6824 2 NA NA 59 1 1
282372 0.1500 73 NA NA 128 93.0000 100 25 133 NA 48.6824 2 NA NA 59 1 1
282372 0.2333 81 NA NA 127 88.6667 100 22 104 NA 48.6824 2 NA NA 59 1 1
282372 0.3167 86 NA NA 132 100.0000 100 19 128 NA 48.6824 2 NA NA 59 1 1
282372 0.4000 86 NA NA 138 100.3330 100 21 129 NA 48.6824 2 NA NA 59 1 1
282372 0.4833 82 NA NA 145 93.3333 100 27 116 NA 48.6824 2 NA NA 59 1 1
282372 0.5167 84 NA NA 150 97.0000 100 29 123 NA 48.6824 2 NA NA 59 1 1
282372 0.5667 71 NA NA 149 90.3333 100 30 129 NA 48.6824 2 NA NA 59 1 1
282372 0.6500 92 NA NA 139 106.6670 100 40 136 NA 48.6824 2 NA NA 59 1 1
282372 0.7333 98 NA NA 139 110.6670 100 37 136 NA 48.6824 2 NA NA 59 1 1
282372 0.8167 82 NA NA 129 104.3330 100 30 149 NA 48.6824 2 NA NA 59 1 1
282372 0.9000 79 NA NA 133 98.6667 100 35 138 NA 48.6824 2 NA NA 59 1 1
282372 0.9833 85 NA NA 133 99.6667 100 27 129 NA 48.6824 2 NA NA 59 1 1
282372 1.0667 96 NA NA 133 108.3330 100 19 133 NA 48.6824 2 NA NA 59 1 1
282372 1.1500 86 NA NA 129 105.3330 100 22 144 NA 48.6824 2 NA NA 59 1 1
282372 1.2333 89 15 NA 126 111.0000 100 20 155 NA 48.6824 2 NA NA 59 1 1
282372 1.3167 86 NA NA 123 107.3330 100 22 150 NA 48.6824 2 NA NA 59 1 1
282372 1.4000 79 NA NA 118 97.3333 100 17 134 NA 48.6824 2 NA NA 59 1 1
282372 2.2333 83 NA NA 133 95.3333 100 24 120 NA 48.6824 2 NA NA 59 1 1
282372 3.2333 81 NA NA 146 89.0000 100 23 105 34.7222 48.6824 2 NA NA 59 1 1
282372 3.7333 84 NA NA 138 94.0000 100 16 111 NA 48.6824 2 NA NA 59 1 1
282372 3.9833 70 NA NA 103 85.0000 NA 19 122 NA 48.6824 2 NA NA 59 1 1
282372 4.2333 73 NA NA 113 84.0000 NA 20 106 NA 48.6824 2 NA NA 59 1 1
282372 4.8167 39 NA NA 141 50.6667 NA 21 86 NA 48.6824 2 NA NA 59 1 1
282372 4.9000 34 NA NA 136 43.3333 NA 20 92 NA 48.6824 2 NA NA 59 1 1
282372 5.0667 57 NA NA 119 66.3333 64 NA 101 NA 48.6824 2 NA NA 59 1 1
282372 5.1500 63 NA NA 120 71.0000 NA 24 93 NA 48.6824 2 NA NA 59 1 1
282372 5.2333 55 15 NA 123 71.0000 66 23 77 34.1667 48.6824 2 NA NA 59 1 1
282372 5.2667 NA NA NA NA NA NA NA NA NA 48.6824 2 NA 7.44 59 1 1
282372 5.3667 NA NA 445 NA NA 99 NA NA NA 48.6824 2 NA 7.34 59 1 1
282372 5.6000 NA NA 343 NA NA NA NA NA NA 48.6824 2 NA NA 59 1 1
282372 6.8333 NA NA 430 NA NA NA NA NA NA 48.6824 2 NA 7.10 59 1 1
282372 7.7167 NA NA 335 NA NA 98 NA NA NA 48.6824 2 NA 7.20 59 1 1
282372 8.4000 NA NA 257 NA NA NA NA NA NA 48.6824 2 NA 7.11 59 1 1
282372 9.0833 NA NA 367 NA NA NA NA NA NA 48.6824 2 NA 7.18 59 1 1
282372 9.9500 NA NA 428 NA NA NA NA NA NA 48.6824 2 NA 7.21 59 1 1
282372 10.4833 NA NA 312 NA NA NA NA NA NA 48.6824 2 NA NA 59 1 1
282372 10.7000 NA NA NA NA NA NA NA NA NA 48.6824 2 NA 7.26 59 1 1
282372 11.8167 82 NA NA 65 117.0000 100 10 156 33.8333 48.6824 2 NA NA 59 1 1
282372 11.9000 91 NA NA 65 120.0000 NA 10 164 NA 48.6824 2 NA NA 59 1 1
282372 11.9833 88 NA NA 69 122.0000 NA 11 174 NA 48.6824 2 NA NA 59 1 1
282372 12.0667 100 NA NA 70 105.0000 NA 19 170 NA 48.6824 2 NA NA 59 1 1
282372 12.1500 0 NA NA 67 NA NA 19 0 NA 48.6824 2 NA NA 59 1 1
282372 12.2333 106 NA NA 66 143.0000 NA 21 202 34.4444 48.6824 2 NA NA 59 1 1
282372 12.2667 NA NA NA NA NA 99 NA NA NA 48.6824 2 NA 7.32 59 1 1
282372 12.3167 106 NA NA 69 139.0000 NA 21 194 NA 48.6824 2 NA NA 59 1 1
282372 12.3833 NA NA 185 NA NA NA NA NA NA 48.6824 2 NA NA 59 1 1
282372 12.4833 76 NA NA 86 92.0000 NA 21 123 NA 48.6824 2 NA NA 59 1 1
282372 12.7333 71 NA NA 86 88.0000 NA 22 120 NA 48.6824 2 NA NA 59 1 1
282372 12.9833 85 NA 100 89 86.6667 NA 22 154 34.7778 48.6824 2 NA 7.50 59 1 1
282372 13.2333 86 3 NA 87 119.0000 NA 23 186 NA 48.6824 2 NA NA 59 1 1
282372 13.4833 85 NA NA 98 116.0000 NA 23 184 NA 48.6824 2 NA NA 59 1 1
282372 13.5667 87 NA NA 99 118.0000 NA 19 182 NA 48.6824 2 NA NA 59 1 1
282372 13.6500 87 NA NA 100 118.0000 NA 21 187 NA 48.6824 2 NA NA 59 1 1
282372 13.7333 87 NA NA 100 117.0000 NA 23 186 NA 48.6824 2 NA NA 59 1 1
282372 13.8167 85 NA NA 100 113.0000 NA 23 178 NA 48.6824 2 NA NA 59 1 1
282372 13.9000 82 NA NA 99 110.0000 NA 18 170 NA 48.6824 2 NA NA 59 1 1
282372 13.9833 85 NA NA 100 113.0000 NA 23 178 NA 48.6824 2 NA NA 59 1 1
282372 14.0667 82 NA NA 99 110.0000 NA 18 170 NA 48.6824 2 NA NA 59 1 1
282372 14.2333 84 3 NA 86 109.0000 NA 22 162 36.2000 48.6824 2 NA NA 59 1 1
282372 14.7333 79 NA NA 92 102.0000 NA 13 149 36.7000 48.6824 2 NA NA 59 1 1
282372 15.2333 86 NA NA 94 129.0000 NA 0 177 36.8000 48.6824 2 NA NA 59 1 1
282372 15.2500 NA NA 120 NA NA NA NA NA NA 48.6824 2 NA NA 59 1 1
282372 15.2833 NA NA 119 NA NA NA NA NA NA 48.6824 2 NA 7.47 59 1 1
282372 15.7333 69 NA NA 94 95.0000 NA 24 148 36.8000 48.6824 2 NA NA 59 1 1
282372 16.2333 74 NA NA 96 100.0000 NA 24 156 36.9000 48.6824 2 NA NA 59 1 1
282372 16.5333 NA NA 103 NA NA NA NA NA NA 48.6824 2 NA 7.47 59 1 1
282372 17.2333 63 3 NA 96 81.0000 NA 25 121 37.1000 48.6824 2 NA NA 59 1 1
282372 18.2333 74 3 NA 96 95.0000 NA 17 140 37.3000 48.6824 2 NA NA 59 1 1
282372 19.2333 73 NA NA 97 97.0000 NA 4 147 37.3000 48.6824 2 NA NA 59 1 1
282372 19.5167 NA NA 47 NA NA NA NA NA NA 48.6824 2 NA 7.53 59 1 1
282372 19.5333 NA NA 46 NA NA NA NA NA NA 48.6824 2 NA NA 59 1 1
282372 19.7333 65 NA NA 95 84.0000 NA 5 126 37.4000 48.6824 2 NA NA 59 1 1
282372 20.2333 64 3 118 112 78.0000 94 11 107 37.4000 48.6824 2 NA NA 59 1 1
282372 20.7333 61 NA NA 101 75.0000 97 11 102 37.5000 48.6824 2 NA NA 59 1 1
282372 20.8833 NA NA 77 NA NA 96 NA NA NA 48.6824 2 NA 7.37 59 1 1
282372 21.2333 58 3 NA 105 72.0000 98 12 98 37.6000 48.6824 2 NA NA 59 1 1
282372 22.2333 34 NA NA 94 34.0000 92 23 39 37.6000 48.6824 2 NA NA 59 1 1
282372 23.2167 NA NA 85 NA NA NA NA NA NA 48.6824 2 NA NA 59 1 1
282372 23.2333 56 NA NA 103 64.0000 NA 16 89 37.5000 48.6824 2 NA NA 59 1 1
282372 23.6500 NA NA 86 NA NA 86 NA NA NA 48.6824 2 NA 7.36 59 1 1
282372 23.7333 57 NA NA 99 69.0000 NA 12 105 37.4000 48.6824 2 NA NA 59 1 1
282372 24.2333 64 8 NA 100 76.0000 83 0 108 37.4000 48.6824 2 NA NA 59 2 1
282372 24.2833 NA NA NA NA NA NA NA NA NA 48.6824 2 NA 5.00 59 2 1
282372 24.4833 62 NA NA 112 71.3333 99 10 106 37.4000 48.6824 2 NA NA 59 2 1
282372 24.7333 55 NA NA 104 62.3333 100 16 85 37.4000 48.6824 2 NA NA 59 2 1
282372 24.9833 62 NA NA 112 72.0000 95 15 90 37.4000 48.6824 2 NA NA 59 2 1
282372 25.2333 73 6 NA 113 90.0000 94 6 124 37.4000 48.6824 2 NA NA 59 2 1
282372 25.6500 NA NA 79 NA NA NA NA NA NA 48.6824 2 NA 7.31 59 2 1
282372 25.9833 NA NA 73 NA NA NA NA NA NA 48.6824 2 NA NA 59 2 1
282372 26.2333 61 NA NA 111 74.0000 94 16 114 37.6000 48.6824 2 NA NA 59 2 1
282372 27.2333 62 NA NA 110 74.3333 96 6 101 37.7000 48.6824 2 NA NA 59 2 1
282372 27.7333 73 NA NA 118 84.0000 94 2 103 37.7000 48.6824 2 NA NA 59 2 1
282372 28.2333 56 NA NA 96 63.0000 97 14 85 37.6000 48.6824 2 NA NA 59 2 1
282372 29.2333 57 7 NA 107 66.0000 96 2 90 37.7000 48.6824 2 NA NA 59 2 1
282372 29.4833 55 NA NA 106 60.0000 97 0 75 37.8000 48.6824 2 NA NA 59 2 1
282372 29.6000 NA NA NA NA NA 97 NA NA NA 48.6824 2 NA 7.30 59 2 1
282372 29.7333 59 NA NA 109 66.0000 96 0 78 37.7000 48.6824 2 NA NA 59 2 1
282372 29.8167 62 NA NA 113 69.0000 95 17 81 37.6000 48.6824 2 NA NA 59 2 1
282372 30.2333 66 NA NA 120 75.0000 96 16 88 37.6000 48.6824 2 NA NA 59 2 1
  1. 處理異常樣本, 規則設定同原始模型.
dfr <- dfr %>%
  arrange(icustay, hours) %>%
  mutate(diastolic_BP = ifelse(diastolic_BP > 300, NA, diastolic_BP),
         glucose = ifelse(glucose > 2000, NA, glucose),
         heart_rate = ifelse(heart_rate > 400, NA, heart_rate),
         mean_BP = ifelse(mean_BP > 300 | mean_BP < 0, NA, mean_BP),
         systolic_BP = ifelse(systolic_BP > 10000, NA, systolic_BP),
         temperature = ifelse(temperature > 50 | temperature < 20, NA, temperature),
         pH = ifelse(pH > 7.8 | pH < 6.8, NA, pH),
         respiratory_rate = ifelse(respiratory_rate > 300, NA, respiratory_rate),
         oxygen_saturation = ifelse(oxygen_saturation > 100 | oxygen_saturation < 0, NA, oxygen_saturation)
         )
  1. 依據前述demoCodeID_only設定, 依照原始分析報告設定規則, 每個vital sign至少須有兩個記錄值.

  2. 如果demoCodeID_only設定為FALSE, 則只會產生demoCodeID供額外的測試使用, 而不進行樣本篩選.

demoCodeID <- dfr %>%
  select(-age, -gender, -height, -weight, -day) %>%
  mutate(hours = floor(hours)) %>%
  group_by(icustay, hours) %>%
  summarise_all(list(~mean(., na.rm = T))) %>%
  group_by(icustay) %>%
  summarise_all(list(~sum(!is.na(.)))) %>%
  filter_all(all_vars(. >= 2)) %>%
  ungroup() %>%
  select(icustay)

if(demoCodeID_only == TRUE){
  #進一步只使用1000個ID
  demoCodeID <- demoCodeID[sample(1:nrow(demoCodeID), 1000), ]
  
  dfr <- dfr %>%
    mutate(match = match(icustay, demoCodeID$icustay, nomatch = NA)) %>%
    filter(!(is.na(match))) %>%
    select(-match)
}

Part 3: 特徵工程

  1. 每位患者最早及最晚記錄時間點的特徵值, 包含除了day以外的所有特徵.
#最晚時間點的特徵, 也是模型baseline使用的特徵. 若最後記錄點是NA, 則使用最靠近該時間點的前一筆記錄值.
#df是最後建模使用的資料集
df <- dfr %>%
  group_by(icustay) %>%
  tidyr::fill(.data$diastolic_BP:.data$weight, .direction = "down") %>%
  filter(row_number() == max(row_number())) %>%
  ungroup() %>%
  select(-day) %>%
  rename_at(vars(-icustay, -hours, -age, -gender, -height, -weight, -mortality), ~ paste0(., "_last"))

baseFeatures <- colnames(df)

#最早時間點的特徵. 若最早記錄點是NA, 則使用最靠近該時間點的下一筆記錄值.
#vitalSign包含各個可能隨時間變化的特徵欄位名稱
vitalSign <- c("diastolic_BP", "glasgow_coma_scale", "glucose", "heart_rate", "mean_BP", 
               "oxygen_saturation", "respiratory_rate", "systolic_BP", "temperature", "pH")
df <- dfr %>%
  select_at(c("icustay", vitalSign)) %>%
  group_by(icustay) %>%
  tidyr::fill(.data$diastolic_BP:.data$pH, .direction = "up") %>%
  filter(row_number() == 1) %>%
  ungroup() %>%
  rename_at(vars(-icustay), ~ paste0(., "_1st")) %>%
  right_join(df, by = "icustay")
  1. 建立統計型特徵mean, sd, min, max, range, 以及vital sign變化速率的特徵slope.
#mean
df <- dfr %>%
  select_at(c("icustay", vitalSign)) %>%
  group_by(icustay) %>%
  summarise_all(mean, na.rm = TRUE) %>%
  ungroup() %>%
  rename_at(vars(-icustay), ~ paste0(., "_median")) %>%
  right_join(df, by = "icustay")

#sd
df <- dfr %>%
  select_at(c("icustay", vitalSign)) %>%
  group_by(icustay) %>%
  summarise_all(sd, na.rm = TRUE) %>%
  ungroup() %>%
  rename_at(vars(-icustay), ~ paste0(., "_sd")) %>%
  right_join(df, by = "icustay")

#min
df <- dfr %>%
  select_at(c("icustay", vitalSign)) %>%
  group_by(icustay) %>%
  summarise_all(min, na.rm = TRUE) %>%
  ungroup() %>%
  rename_at(vars(-icustay), ~ paste0(., "_min")) %>%
  mutate_all(list(~na_if(., Inf))) %>%
  right_join(df, by = "icustay")

#max
df <- dfr %>%
  select_at(c("icustay", vitalSign)) %>%
  group_by(icustay) %>%
  summarise_all(max, na.rm = TRUE) %>%
  ungroup() %>%
  rename_at(vars(-icustay), ~ paste0(., "_max")) %>%
  mutate_all(list(~na_if(., -Inf))) %>%
  right_join(df, by = "icustay")

#range
df <- dfr %>%
  select_at(c("icustay", vitalSign)) %>%
  group_by(icustay) %>%
  summarise_all(RANGE) %>%
  ungroup() %>%
  rename_at(vars(-icustay), ~ paste0(., "_range")) %>%
  mutate_all(list(~na_if(., -Inf))) %>%
  right_join(df, by = "icustay")

#計算所有slope = (v1-v0)/(t1-t0)
for(i in vitalSign){
  df <- dfr %>%
    select_at(c("icustay", "hours", i)) %>%
    rename_at(vars(-icustay, -hours), ~ "vsign") %>%
    filter(!is.na(vsign)) %>%
    group_by(icustay) %>%
    mutate(int = hours - lag(hours),
           index_slope = (vsign - lag(vsign))/int,
           slope_max = max(index_slope, na.rm = TRUE),
           slope_min = min(index_slope, na.rm = TRUE)
    ) %>%
    filter(row_number()==1) %>%
    ungroup() %>%
    mutate_all(list(~na_if(., Inf))) %>%
    mutate_all(list(~na_if(., -Inf))) %>%
    select(icustay, slope_max, slope_min) %>%
    rename_at(vars(-icustay), ~ paste0(., "_", i)) %>%
    right_join(df, by = "icustay")
}

3.1 時間序列特徵: kmeans分群. 參考原始分析報告, 各患者的各項vital sign獨立進行minmax normalization, 並以建模會用的train group icustay ID完成kmeans訓練後, 再去predict test group的分群中心.

3.2 k為評估時的群心數, 計算k = 2~15時的betweenss/totss分數, 當分數(k+1) - 分數(k) < 0.01時, K = k, K為最後設定的群心數.

3.3 在此處只展示了一個vital sign的分群結果, 但最後建模和比較會使用所有vital sign的分群結果.

#1 產生時間序列資料, 可用於kmeans及dtw分群
#1.1 先產生id x 48 hr的data frame: dft
icustay <- rep(unique(dfr$icustay), (48/sampleRate + 1))
dft <- data.frame(icustay) %>%
  arrange(icustay) %>%
  group_by(icustay) %>%
  mutate(hours = (row_number()-1)*sampleRate) %>%
  ungroup()

rm(icustay)
gc()
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2461465 131.5    6279357 335.4  6279357 335.4
## Vcells 7900161  60.3   70661148 539.2 88126515 672.4
#1.2  對vital sign產生by hour (或單位時間, 依sampleRate參數變化) 的中位數
#1.2  如果某id某欄全無資料, 最後分群該值顯示為NA
#1.2  將資料併到dft, 並以前一筆非NA數值fill NA, 若無則用下一筆出現的非NA值fill NA
dft <- dfr %>%
  select_at(c("icustay", "hours", vitalSign)) %>%
  mutate(hours = floor(hours/sampleRate)*sampleRate) %>%
  group_by(icustay, hours) %>%
  summarise_all(median, na.rm = TRUE) %>%
  ungroup() %>%
  right_join(dft, by = c("icustay", "hours")) %>%
  mutate_all(list(~na_if(., "NaN"))) %>%
  group_by(icustay) %>%
  tidyr::fill(.data$diastolic_BP:.data$pH, .direction = "down") %>%
  tidyr::fill(.data$diastolic_BP:.data$pH, .direction = "up")


#1.3 產生建模時的訓練組icustayID
set.seed(777)
trainidx <- createDataPartition(df$mortality, p=0.8, list = FALSE)
trainID <- df[trainidx, "icustay", drop=TRUE]

#2 以vital sign名稱為變數, 進行kmeans分群loop, 並將分群結果併到df
#2 將vitalSign[1]改成vitalSign或者vitalSign[1:10]就可以計算所有結果
for(i in vitalSign[1]){
  #2.1 minmax normalization
  dfk <- dft %>%
    select_at(c("icustay", "hours", i)) %>%
    rename_at(vars(-icustay, -hours), ~ "vsign") %>%
    #刪除連一個觀察值都沒有的樣本
    filter(!(is.na(vsign))) %>%
    group_by(icustay) %>%
    #只有一個觀察值的在下式會變成NaN, 亦刪除
    mutate(vsign = (vsign - min(vsign))/(max(vsign) - min(vsign))) %>%
    filter(!(is.na(vsign))) %>%
    ungroup()
  
  #2.2 dcast to id with 48H columns
  dfk <- reshape2::dcast(dfk, formula = icustay ~ hours, value.var = c("vsign")) %>%
    rename_at(vars(-icustay), ~ paste0("H",.))
  
  #如果dfk內有任何Inf, 停止loop
  if(checkmate::anyInfinite(dfk)) {
    print("Inf in dfk for clustering")
    break
  }
  
  #2.3 分成train與test組別
  dfk_train <- subset(dfk, icustay %in% trainID)
  dfk_test <- subset(dfk, !(icustay %in% trainID))
  
  #2.4 Kmeans clustering of time series data
  #2.4.1 計算k = 2~15時的betweenss/totss分數, 當分數(k+1) - 分數(k) < 0.01時, 設定K = k
  cluster <- NA
  avg_score <- NA
  sd_score <- NA
  avg_minsize <- NA
  avg_meansize <- NA
  avg_maxsize <- NA
  scoreBoard <- data.frame(cluster, avg_score, sd_score, avg_minsize, avg_meansize, avg_maxsize)
  for(j in 2:15){
    score_j <- NULL
    min_size_j <- NULL
    mean_size_j <- NULL
    max_size_j <- NULL
    for(k in 1:30){
      ns <- 1
      while(ns <= j){
        eval_K <- kmeans(dfk_train[,2:50], j, iter.max = 80, nstart = ns, 
                         algorithm = c("Hartigan-Wong"), trace=FALSE)
        if(eval_K$ifault == 0) {break}
        ns <- ns + 1
      }
      score_j <- c(score_j, eval_K[["betweenss"]]/eval_K[["totss"]])
      min_size_j <- c(min_size_j, min(eval_K$size))
      mean_size_j <- c(mean_size_j, mean(eval_K$size))
      max_size_j <- c(max_size_j, max(eval_K$size))
    }
    scoreBoard[(j-1),1:6] <- c(j, round(mean(score_j), digits = 3), round(sd(score_j),digits =3),
                               ceiling(mean(min_size_j)), ceiling(mean(mean_size_j)), ceiling(mean(max_size_j))
                               )
  }
  
  scoreBoard$inc <- scoreBoard$avg_score - lag(scoreBoard$avg_score)
  
  #產生最後的K及kmeans結果final_K
  K <- scoreBoard$cluster[first(which(scoreBoard$inc < 0.01)) - 1]
  ns <- 1
  while(ns <= K){
    final_K <- kmeans(dfk_train[,2:50], K, iter.max = 30, nstart = ns, 
                     algorithm = c("Hartigan-Wong"), trace=FALSE)
    if(final_K$ifault == 0) {break}
    ns <- ns + 1
  }

  dfk_train$CL <- final_K$cluster
  #使用flexclust轉換final_K成為kcca物件再進行test組的fit (因為stats kmean的物件不能直接prediction)
  final_K <- flexclust::as.kcca(final_K, data=dfk_train[,2:50])
  dfk_test$CL <- flexclust::predict(final_K, newdata=dfk_test[,2:50])
  
  cat(paste0("kmeans to ", i, " is done...\n")) 
  
  #將分類結果merge到df
  df <- dfk_train %>%
    bind_rows(dfk_test) %>%
    select(icustay, CL) %>%
    rename_at(vars(-icustay), ~ paste0(i, "_KMCL")) %>%
    right_join(df, by = "icustay")
  
  rm(dfk, dfk_train, dfk_test, cluster, avg_score, sd_score, avg_minsize, avg_meansize, avg_maxsize,
     scoreBoard, score_j, min_size_j, mean_size_j, max_size_j, ns, eval_K, K, final_K)
  gc()
}
## kmeans to diastolic_BP is done...

4.1 時間序列特徵: 基於DTW距離演算法的分群. 參考原作者說明, 各患者的各項vital sign獨立進行zscore normalization, 並以建模會用的train group icustay ID完成kmeans訓練後, 再去predict test group的分群中心.

4.2 k為評估時的群心數, 以不同指標計算k = 2~10時的得分, 並以投票法決定最後設定的群心數. 如果有同分的情況則選擇k較大者.

4.3 取得各序列與群心的距離做為額外特徵.

4.4 在此處只展示了一個以dtw演算法分群的vital sign結果, 但在最後建模時, 會使用所有vital sign的centroids和sbd演算法的分群結果.

#將vitalSign[1]改成vitalSign或者vitalSign[1:10]就可以計算所有結果
for(i in vitalSign[1]){
  #3.1 產生data, 並刪除全NA的樣本
  dfk <- dft %>%
    select_at(c("icustay", "hours", i)) %>%
    rename_at(vars(-icustay, -hours), ~ "vsign") %>%
    filter(!(is.na(vsign))) %>%
    #以下用於刪除僅有一個唯一值的樣本 (無法zscore)
    group_by(icustay) %>%
    mutate(len = length(unique(vsign))) %>%
    ungroup() %>%
    #刪除vsign唯一值 <= 1
    filter(!(len == 1)) %>%
    select(-len)
  
  #3.2 dcast to id with 48H columns
  dfk <- reshape2::dcast(dfk, formula = icustay ~ hours, value.var = c("vsign")) %>%
    rename_at(vars(-icustay), ~ paste0("H",.))
  
  #如果dfk內有任何Inf, 停止loop
  if(checkmate::anyInfinite(dfk)) {
    print("Inf in dfk for clustering")
    break
    }
  
  #3.3 分成train與test組別
  dfk_train <- subset(dfk, icustay %in% trainID)
  dfk_test <- subset(dfk, !(icustay %in% trainID))
  
  #3.4 設定DTW參數並計算結果
  #主要確認分群演算法, 距離計算演算法及群心產生的演算法
  #window size以10%計算 (smaller might be better), SBD不用設定
  #distance, centrodi = ("dtw_basic", "dba", zcore/null), ("gak", "dba", zscore), 
  #( k-Shape algorithm: "sbd", "shpae", zscore, type = "p")
  #如果type要改用"h", 會需要額外的control參數(hierarchical_control)
  #type = "p" or "h". "h"時會忽略centroid, if it wasn't a function
  #type = "h" and preporc = NULL的分群會較為分佈不均, 但Sil, DB得分較好, 計算較快
  #"p": control = partitional_control(iter.max = 200L)
  #"h": control = hierarchical_control(method = "average")
  K_eval <- c(3:10)
  type_use <- "p"
  dis_use <- "dtw_basic"
  cen_use <- "dba"
  Zsc <- "zscore"
  
  TSCL <- tsclust(series = dfk_train[ ,2:50], type = type_use, k = K_eval, preproc = zscore, 
                  distance = dis_use,  centroid = cen_use, window.size = floor((48/sampleRate + 1)*0.1)-1,
                  seed = 777, trace = FALSE,
                  control = partitional_control(iter.max = 200L)
  )
  
  
  #type = "p"限定: 將沒有收斂的list刪除, 此外由於後面會命名, 因此也將K_eval中代表未收斂的K移除
  if(type_use == "p"){
    del_k <- NULL
    for(j in 1:length(K_eval)){
      if(!(TSCL[[j]]@converged)){
        del_k <- c(del_k, j)
      }
      if(j == length(K_eval) && !(is.null(del_k))){
        TSCL <- TSCL[-del_k]
        K_eval <- K_eval[! K_eval %in% (del_k + min(K_eval) - 1)]
      }
    }
    gc()
  }
  
  #3.5 CVI: clustering evaluation
  #將type從internal改為去掉SF (出現兩次0分, 又是global, 分數計算較久)
  #CH, SF可能不適合centroid = DBA or shape (centroid有random性質)
  #DB, DBstar非常快速
  #附註: CVI指數縮寫說明及其指數目標 (極大化或極小化)
  #Crisp partitions
  #"Sil" :     Silhouette                  max
  #"SF"  :     Score Function              max (global)
  #"CH"  :     Calinski-Harabasz           max (global)
  #"DB"  :     Davies-Bouldin              min
  #"DBstar"  : Modified Davies-Bouldin     min
  #"D"   :     Dunn                        max
  #"COP" :     COP                         min
  names(TSCL) <- paste0("K", K_eval)
  if(cen_use == "dba" || cen_use == "shape"){
    eval_method <- c("Sil", "DB", "DBstar", "D", "COP")
  } else {
    eval_method <- c("Sil", "CH", "DB", "DBstar", "D", "COP")
  }
  evalMtx <- sapply(TSCL, cvi, type = eval_method)
  #顯示原始分數
  print(evalMtx)
  
  #以cvi投票最多數者決定K值
  #同分者取K值較大者 (初步判斷: 不同cluster有類似性質, 對最後ML不會有太大影響)
  #或是看跑票的次佳解在哪個K最高分
  cviList <- rownames(evalMtx)
  vote_k <- NULL
  for(k in 1:length(cviList)){
    if(cviList[k] %in% c("Sil", "SF", "CH", "D")){
      vote_k <- c(vote_k, which(evalMtx[k,] == max(evalMtx[k,])))
    } else {
      vote_k <- c(vote_k, which(evalMtx[k,] == min(evalMtx[k,])))
    }
  }
  
  #各欄的投票結果, 依順序呈現
  vote_tbl <- sort(table(vote_k),decreasing=TRUE)
  #maxVote = 最高票的票數
  maxVote <- max(vote_tbl)
  #選出獲得最高票票數的最後一個欄位, 再從vote_k產生屬於該欄位的K代碼
  K <- last(names(which(vote_tbl == maxVote)))
  K_names <- first(names(which(vote_k == as.integer(K))))
  #顯示投票結果
  print(vote_tbl)
  cat(paste0("the most voted is ", K_names, "\n"))
  #顯示該K群下, 各群總ID數
  print(table(TSCL[[K_names]]@cluster))
  
  #存檔圖形
  jpeg(filename = paste0("fig/centroids/",paste(type_use, dis_use, cen_use, Zsc, i, sep = "_"), ".jpg"),
       height = 700, width = 1400)
  plot(TSCL[[K_names]], type = "centroids")
  dev.off()
  
  #將vote最高的cluster/distance定義merge到dfk_train和dfk_test
  dfk_train$TSCL <- TSCL[[K_names]]@cluster
  dfk_train$TSCL_DIS <- TSCL[[K_names]]@cldist
  
  dfk_test$TSCL <- predict(TSCL[[K_names]], newdata = dfk_test[,2:50])
  #計算test組別的distance(下式dtw_basic限定)
  dfk_test$TSCL_DIS <- NA
  dfk_test_z <- zscore(dfk_test[,2:50])
  for(j in 1:nrow(dfk_test)){
    cen_at_j <- dfk_test$TSCL[j]
    #因x,y都已zscore, 故normalize = NULL
    dfk_test$TSCL_DIS[j] <- dtw_basic(as.numeric(dfk_test_z[j,]), 
                                      TSCL[[K_names]]@centroids[[cen_at_j]], 
                                      window.size = floor((48/sampleRate + 1)*0.1)-1, norm = "L1",
                                      step.pattern = dtw::symmetric2, backtrack = FALSE,
                                      normalize = NULL, error.check = TRUE)
  }
  
  #將分類結果merge到df
  df <- dfk_train %>%
    bind_rows(dfk_test) %>%
    select(icustay, TSCL, TSCL_DIS) %>%
    rename_at(vars(-c(1,3)), ~ paste0(i, "_DTWCL")) %>%
    rename_at(vars(-c(1,2)), ~ paste0(i, "_DTWCL_DIS")) %>%
    right_join(df, by = "icustay")
  
  #完成時提醒並產生圖形
  cat(paste0("DTW to ", i, " is done...\n"))  
  plot(TSCL[[K_names]], type = "centroids")
  
  rm(dfk, dfk_train, dfk_test, type_use, dis_use, cen_use, Zsc, TSCL, del_k, evalMtx, 
     cviList, vote_k, vote_tbl, K, K_eval, K_names, maxVote, dfk_test_z, cen_at_j)
  gc()
}
##               K3         K4         K5         K6         K7        K8
## Sil    0.1072738 0.07938733 0.06676769 0.04938117 0.05046659 0.0438226
## DB     2.1372444 2.43942787 2.40986400 2.52331901 2.45823721 2.5321695
## DBstar 2.1544960 2.46799401 2.45521071 2.60192397 2.54492938 2.6126641
## D      0.1891252 0.25371383 0.17226610 0.17344802 0.26798591 0.1939582
## COP    0.5429106 0.52067542 0.50830281 0.49333997 0.47356019 0.4770591
##                K9        K10
## Sil    0.05077695 0.04001242
## DB     2.41475733 2.52882863
## DBstar 2.46604557 2.61002962
## D      0.17647530 0.24128281
## COP    0.46198171 0.46553136
## vote_k
## 1 5 7 
## 3 1 1 
## the most voted is K3
## 
##   1   2   3 
## 260 230 310
## DTW to diastolic_BP is done...

Part 4: 建模步驟說明

  1. 藉由特徵篩選, 我們可以比較baseline, 基礎特徵, KMCL, DTWCL/DIS, SBDCL/DIS五種不同特徵組合下, 對於mortality預測模型的影響.

    1.1 baseline: 只使用病人所有原始特徵最後的記錄值.

    1.2 基礎特徵: 包含始末記錄值, 統計型特徵, 最大最小斜率變化, 不含時間序列產生的特徵.

    1.3 KMCL: 除基礎特徵外, 還包含了kmeans對時間序列分群產生的特徵.

    1.4 DTWCL/DIS: 除基礎特徵外, 還包含了DTW為基礎的時間序列分群特徵及距離群心距離的特徵.

    1.5 SBDCL/DIS: 除基礎特徵外, 還包含了SBD為基礎的時間序列分群特徵及距離群心距離的特徵.

  2. 在建模參數部分, 我們多數使用default, 但限定max_depth = 2.

  3. 藉由調整threshold, 使cv final round的prediction的sensitivity與specificity盡量接近, 最後test set則使用該threshold來評估表現.

#1 自訂函數: 取得cv所有prediction的結果, 用於評估threshold
get_lgbm_cv_preds <- function(cv){
  rows <- length(cv$boosters[[1]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices)+length(cv$boosters[[1]]$booster$.__enclos_env__$private$train_set$.__enclos_env__$private$used_indices)
  preds <- numeric(rows)
  for(i in 1:length(cv$boosters)){
    preds[
      cv$boosters[[i]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices] <-
      cv$boosters[[i]]$booster$.__enclos_env__$private$inner_predict(2)
  }
  return(preds)
}

#2 載入完整特徵資料 (可在前端加#disable並使用之前步驟重新產生的df)
suppressMessages(df <- read_csv("modeling_final3_df.csv", guess_max = 300000))

#3 前處理
#3.1 將名字不同,但各row數值相同的column, 刪除到只剩一組column
df <- df[!duplicated(unclass(df))]

#3.2 刪除NZV, 設定每個特徵最少需要出現的unique值
#3.2 當前設定為刪除unique值 = 1的特徵
nzv_r <- nearZeroVar(df, freqCut = (nrow(df)-1)/1, uniqueCut = 2/nrow(df), saveMetrics= TRUE)
nzv <- nearZeroVar(df, freqCut = (nrow(df)-1)/1, uniqueCut = 2/nrow(df))
if (as.numeric(sum(nzv_r$nzv=="TRUE")) > 0) df <- df[, -nzv]

#3.3 篩選資料 (若demoCodeID_only == TRUE, 則都只使用demoCodeID來進行建模)
demoCodeID_only_ML <- FALSE
  if(demoCodeID_only_ML) {
  dft <- df %>%
    right_join(demoCodeID, by = "icustay")
  } else {
  dft <- df
  }

#3.4 篩選特徵: 依據cluster種類
featureIDX <- "baseline"
  if(featureIDX == "baseline"){
    dft <- dft %>% select_at(baseFeatures)
  } else if(featureIDX == "baseFE") {
    dft <- dft[ , !grepl("CL", colnames(dft))]
  } else if(featureIDX == "KMCL") {
    dft <- dft[ , !grepl("DTWCL|SBDCL", colnames(dft))]
  } else if(featureIDX == "DTWCL") {
    dft <- dft[ , !grepl("KMCL|SBDCL", colnames(dft))]
  } else if(featureIDX == "SBDCL") {
    dft <- dft[ , !grepl("KMCL|DTWCL", colnames(dft))]
  }
  
#4 建模準備
#4.1 使用lightGBM的categorical_feature功能, 但類別值須先轉換為非負整數 (在MIMIC-III此處不需要)
#4.1 將clustering結果轉為類別型特徵, 但移除帶有DIS的欄位名稱
categoricals.vec = colnames(dft)[grepl("CL|gender", colnames(dft))]
categoricals.vec <- categoricals.vec[-grep("_DIS", categoricals.vec)]


#4.2 產生train & test set
set.seed(777)
trainidx <- createDataPartition(dft$mortality, p=0.8, list = FALSE)
train <- dft[trainidx, ]
test <- dft[-trainidx, ]


#4.3 以options讓train, test的NA可以進入轉換後的sparse_matrix
previous_na_action <- options('na.action')
options(na.action='na.pass')
sparse_matrix <- sparse.model.matrix(mortality + icustay~.-1, data = train)
label_train <- train[,c('mortality'), drop=T]

lgb.test <- sparse.model.matrix(mortality + icustay~.-1, data = test)
label_test <- test[,c('mortality'), drop=T]
options(na.action=previous_na_action$na.action)

lgb.train = lgb.Dataset(data = as.matrix(sparse_matrix), label = label_train)

#4.4 設定lgb參數
lgb.grid = list(objective = "binary",
                metric = "auc",
                min_sum_hessian_in_leaf = 1,
                feature_fraction = 0.6,
                bagging_fraction = 0.6,
                bagging_freq = 1,
                boosting = "gbdt",
                max_bin = 255,
                lambda_l1 = 0,
                lambda_l2 = 0,
                min_data_in_bin = 3,
                min_gain_to_split = 0,
                min_data_in_leaf = 3,
                is_unbalance = TRUE
                )

#5 cross-validation with train set
#5.1 參數設定與cv run
lgb.model.cv = lgb.cv(params = lgb.grid, data = lgb.train, learning_rate = 0.01, num_leaves = 31,
                      max_depth = 2, num_threads = 6 , nrounds = 1500, early_stopping_rounds = 30,
                      eval = "auc", nfold = 5, stratified = TRUE, categorical_feature = categoricals.vec)

#5.2 取得最佳iteration
best.iter = lgb.model.cv$best_iter

#5.3 依據設定的threshold (pos_THR), 以cv所有的prediction產生混淆矩陣與f1 score
pred_cv <- get_lgbm_cv_preds(lgb.model.cv)
pos_THR <- 0.5
CFmatrix_cv <- confusionMatrix(as.factor(ifelse(pred_cv <= pos_THR, 0, 1)), as.factor(label_train),
                               positive = levels(as.factor(label_train))[2])
print(CFmatrix_cv)
F1_in_Train <- MLmetrics::F1_Score(as.factor(label_train), as.factor(ifelse(pred_cv <= pos_THR, 0, 1)), positive = "1")
cat(paste0("F1 score for train set is ", F1_in_Train, "\n"))


#5.4 作圖ROC及計算auc
print("roc for train set")
#plot(pROC::roc(response = label_train,
               #predictor = pred_cv,
               #levels=c(0, 1)),lwd=1.5)

roc_obj_train <- roc(label_train, pred_cv)
print("auc for train set")
auc(roc_obj_train)


#6以train set建模評估在test set的表現
#6.1 建立訓練模型
lgb.model = lgb.train(params = lgb.grid, data = lgb.train, learning_rate = 0.01,
                      max_depth = 2, num_leaves = 31, num_threads = 6 , nrounds = best.iter,
                      eval_freq = 5, eval = "auc", categorical_feature = categoricals.vec)

#6.2 對test gp預測
pred <- predict(lgb.model, lgb.test)


#6.3 confusionMatrix & f1 score
CFmatrix <- confusionMatrix(as.factor(ifelse(pred <= pos_THR, 0, 1)), as.factor(label_test),
                            positive = levels(as.factor(label_test))[2])
print(CFmatrix)
F1_in_Test <- MLmetrics::F1_Score(as.factor(label_test), as.factor(ifelse(pred <= pos_THR, 0, 1)), positive = "1")
cat(paste0("F1 score for test set is ", F1_in_Test, "\n"))

#6.4 作圖ROC及計算auc
print("roc for test set")
#plot(pROC::roc(response = label_test,
               #predictor = pred,
               #levels=c(0, 1)),lwd=1.5)

roc_obj <- roc(label_test, pred)
print("auc for train set")
auc(roc_obj)


#feature重要性
lgb_imp <- lgb.importance(lgb.model, percentage = TRUE)
head(lgb_imp, n = 30L)

Part 5: 結果報告

  1. 以baseline特徵組合展示建模結果.
## [1] "Results from train group"
## Threshold =  0.5
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3523  212
##          1 1086  724
##                                          
##                Accuracy : 0.7659         
##                  95% CI : (0.7545, 0.777)
##     No Information Rate : 0.8312         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.392          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7735         
##             Specificity : 0.7644         
##          Pos Pred Value : 0.4000         
##          Neg Pred Value : 0.9432         
##              Prevalence : 0.1688         
##          Detection Rate : 0.1306         
##    Detection Prevalence : 0.3264         
##       Balanced Accuracy : 0.7689         
##                                          
##        'Positive' Class : 1              
## 
## F1 score for train group is 0.527312454479243
## [1] "roc for train group"

## [1] "auc for train group"
## Area under the curve: 0.8496
## [1] "Results from test group"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 864  56
##          1 271 195
##                                           
##                Accuracy : 0.7641          
##                  95% CI : (0.7408, 0.7862)
##     No Information Rate : 0.8189          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4035          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7769          
##             Specificity : 0.7612          
##          Pos Pred Value : 0.4185          
##          Neg Pred Value : 0.9391          
##              Prevalence : 0.1811          
##          Detection Rate : 0.1407          
##    Detection Prevalence : 0.3362          
##       Balanced Accuracy : 0.7691          
##                                           
##        'Positive' Class : 1               
## 
## F1 score for test group is 0.543933054393305
## [1] "roc for test group"

## [1] "auc for train group"
## Area under the curve: 0.8393
## [1] "top 10 important features of baseline set"
##                     Feature       Gain      Cover  Frequency
##  1: glasgow_coma_scale_last 0.52636080 0.23991757 0.19284603
##  2:                     age 0.10872943 0.14101205 0.13141524
##  3:        temperature_last 0.08632123 0.10866395 0.11314152
##  4:                 pH_last 0.05988178 0.07723369 0.07737170
##  5:        systolic_BP_last 0.03853192 0.06940831 0.07620529
##  6:         heart_rate_last 0.03771744 0.07182339 0.07309487
##  7:   respiratory_rate_last 0.03682040 0.06840482 0.06570762
##  8:                  weight 0.03430303 0.06413463 0.07776050
##  9:  oxygen_saturation_last 0.01937140 0.04244301 0.04043546
## 10:            glucose_last 0.01931424 0.04869721 0.05482115

2.1 比較不同特徵組合的模型表現.

2.2 只使用baseline (所有病患最後時間點的數據) 建模, 其於測試組的表現已與原始分析報告中的RF small相當.

2.3 baseFE (基礎特徵工程) 於測試組表現較baseline有稍微提升.

2.4 再加入分群特徵, 並無法改善模型, 不論是KMCL, DTWCL, SBDCL.

2.5 lightGBM訓練的模型在train和test的表現上不會有明顯差異 (overfitting on train group).

模型表現比較表格: train
set Threshold AUC Sensitivity Specificity F1score features_used group
RF small 0.43 0.95 0.88 0.88 NA 44 train
baseline 0.50 0.85 0.78 0.77 0.55 15 train
baseFE 0.50 0.87 0.78 0.78 0.56 95 train
KMCL 0.50 0.87 0.79 0.79 0.57 105 train
DTWCL 0.50 0.87 0.79 0.79 0.57 115 train
SBDCL 0.50 0.87 0.78 0.79 0.56 115 train
模型表現比較表格: test
set Threshold AUC Sensitivity Specificity F1score features_used group
RF small 0.43 0.84 0.74 0.77 NA 44 test
baseline 0.50 0.83 0.74 0.77 0.50 15 test
baseFE 0.50 0.86 0.76 0.79 0.52 95 test
KMCL 0.50 0.86 0.76 0.79 0.52 105 test
DTWCL 0.50 0.86 0.76 0.78 0.52 115 test
SBDCL 0.50 0.86 0.75 0.78 0.51 115 test
特徵重要性 top 10: baseline
Feature Gain Cover Frequency set
glasgow_coma_scale_last 0.5431716 0.2371647 0.1959836 baseline
age 0.1146249 0.1513042 0.1413165 baseline
temperature_last 0.0734940 0.1054313 0.1056155 baseline
pH_last 0.0609291 0.0845748 0.0855337 baseline
systolic_BP_last 0.0492277 0.0853214 0.0933433 baseline
heart_rate_last 0.0391879 0.0895158 0.0859055 baseline
respiratory_rate_last 0.0329390 0.0677535 0.0635924 baseline
weight 0.0265045 0.0509721 0.0710301 baseline
mean_BP_last 0.0174225 0.0319913 0.0416512 baseline
glucose_last 0.0153973 0.0369419 0.0449981 baseline
特徵重要性 top 10: baseFE
Feature Gain Cover Frequency set
glasgow_coma_scale_last 0.2848392 0.1272857 0.0965909 baseFE
age 0.0830286 0.0953069 0.0817551 baseFE
glasgow_coma_scale_median 0.0792020 0.0322875 0.0312500 baseFE
slope_max_glasgow_coma_scale 0.0506695 0.0306575 0.0366162 baseFE
glasgow_coma_scale_max 0.0349185 0.0272442 0.0239899 baseFE
temperature_median 0.0332980 0.0389177 0.0388258 baseFE
respiratory_rate_median 0.0282064 0.0333139 0.0274621 baseFE
glasgow_coma_scale_1st 0.0269001 0.0115327 0.0123106 baseFE
pH_last 0.0244493 0.0338241 0.0284091 baseFE
temperature_last 0.0213747 0.0309420 0.0284091 baseFE
特徵重要性 top 10: KMCL
Feature Gain Cover Frequency set
glasgow_coma_scale_last 0.3113490 0.1234277 0.0940171 KMCL
age 0.0815234 0.0989535 0.0870529 KMCL
glasgow_coma_scale_median 0.0652352 0.0322202 0.0322887 KMCL
slope_max_glasgow_coma_scale 0.0462544 0.0282271 0.0319721 KMCL
glasgow_coma_scale_max 0.0313525 0.0270694 0.0243748 KMCL
temperature_median 0.0290112 0.0383516 0.0335549 KMCL
respiratory_rate_median 0.0258169 0.0336710 0.0259576 KMCL
glasgow_coma_scale_1st 0.0242027 0.0115569 0.0117126 KMCL
temperature_last 0.0236133 0.0323883 0.0291231 KMCL
pH_last 0.0224627 0.0307223 0.0253245 KMCL
特徵重要性 top 10: DTWCL
Feature Gain Cover Frequency set
glasgow_coma_scale_last 0.3087401 0.1169473 0.0918301 DTWCL
age 0.0804741 0.0988403 0.0813725 DTWCL
glasgow_coma_scale_median 0.0495890 0.0308881 0.0281046 DTWCL
glasgow_coma_scale_max 0.0444637 0.0308031 0.0274510 DTWCL
slope_max_glasgow_coma_scale 0.0401288 0.0247345 0.0316993 DTWCL
temperature_median 0.0290053 0.0368359 0.0336601 DTWCL
glasgow_coma_scale_DTWCL 0.0274952 0.0264849 0.0196078 DTWCL
respiratory_rate_median 0.0260797 0.0325386 0.0267974 DTWCL
glasgow_coma_scale_sd 0.0229429 0.0175897 0.0189542 DTWCL
temperature_last 0.0212302 0.0307315 0.0274510 DTWCL
特徵重要性 top 10: SBDCL
Feature Gain Cover Frequency set
glasgow_coma_scale_last 0.3122091 0.1211137 0.0952381 SBDCL
age 0.0810678 0.0987597 0.0818656 SBDCL
glasgow_coma_scale_median 0.0505690 0.0314461 0.0290280 SBDCL
glasgow_coma_scale_max 0.0441047 0.0307950 0.0273973 SBDCL
slope_max_glasgow_coma_scale 0.0418178 0.0286484 0.0335943 SBDCL
temperature_median 0.0311178 0.0408468 0.0362035 SBDCL
respiratory_rate_median 0.0264716 0.0332447 0.0277234 SBDCL
glasgow_coma_scale_sd 0.0219748 0.0163417 0.0185910 SBDCL
pH_last 0.0211295 0.0314835 0.0277234 SBDCL
glasgow_coma_scale_SBDCL 0.0206814 0.0107535 0.0088063 SBDCL

3.1 雖然新的分群特徵並無法改善模型, 但在採用所有樣本的資料集下進行分群, 也許會產生更具代表性的結果, 進而改善模型.

3.2 此外在初步的SBD分群結果分析中, 發現diastolic BP的centroid, 與mean BP的centroid有很高的相似性, 也許這代表DTW為基礎的分群方式, 是值得再深入探討與測試.

diastolic BP與mean BP的SBD分析產生之主要時間序列趨勢

diastolic BP與mean BP的SBD分析產生之主要時間序列趨勢

  1. 最後呈現以所有樣本, 加上baseFE特徵組合, 進行mortality預測的模型表現.
使用所有icustay ID資料的模型表現
set Threshold AUC Sensitivity Specificity F1score features_used group
baseFE 0.47 0.82 0.74 0.74 0.43 95 train
baseFE 0.47 0.81 0.70 0.75 0.42 95 test

email: chtsai0108@gmail.com