For the last two years, the world has witnessed for the first time in a century an unprecedentedly catastrophic COVID-19 pandemic that indescribably crippled the world in virtually of its vital sectors. Undeniably, the most severely hit sector was the medical field as the pandemic has put humongous strains on hospitals, which suffered an insurmountable sudden influx in patients. Many of these patients were admitted in critical condition and needed to urgently access an Intensive Care Unit (ICU). Nonetheless, a massive percentage of patients prematurely accessed ICUs or needed one after a few hours of being hospitalized, which led to many hospitals suffering from an inefficient and disastrous ICU admission planning and an incapability to accurately allocate ICUs to patients in need. The aim of the project was to implement robust machine learning algorithms to design an efficient prediction model that would accurately predict a patients' need for an ICU within the very first hours after they have been admitted to a hospital. The data studied in the paper was obtained from Kaggle and consists of a window-based dataset organized by the Sirio-Libanes hospital team in Sao Paolo, Brazil. Two separate preprocessing methods were implemented on the data, one which consisted of imputation, propagation, cleaning as well as feature selection while the other technique consisted of omitting the missing data, followed by cleaning and feature selection. The cleaning process was achieved by removing zero-variance, redundant, and/or highly correlated features using the Pearson correlation coefficient. After obtaining the final datasets, we implemented a series of specialized binary classification models on various stages starting from the features in the first Window 1, which represents the first two hours of admission, and then incrementally adding the remaining features in Window 2, Window 3 etc. The latter is done since ideally the aim is to be able to predict ICU admission as early as possible. The model that consistently yielded the highest accuracy (65% for W1, 81% for W2, 88% for W3, 91% for W4, 93% for W5) as well as the highest AUC (0.93 for W4) was the Extreme Gradient Boosting (XGBoost) implemented on a limited subset of the most significant features such as age, respiratory rate, lactate levels, & blood pressure.
### Import Excel Data
library(readxl)
### Data Manipulation
library(dplyr)
### Deep Learning
library(keras)
library(neuralnet)
### General Utility
library(tidyverse)
### Feature Selection
library(leaps)
### SVM Classification
library(e1071)
### XGBoost Classification
library(xgboost)
### Cross Validation & Modeling
library(caret)
### KNN Classification
library(class)
### Random Forest Classificaiton
library(randomForest)
### Convert Dataframe to Matrix
library(Matrix)
### ROC Curve Plots
library(pROC)
### Visualization
library(ggplot2)
library(ggrepel)
The DataSet was observed and manipulated with Excel according to few statements made by the hospital, and logical reasoning for feature selection.
As indicated by the Hospital, blood measurements are not taken as often as Vital Signs measurements. We observe that the MIN,MAX,MEAN,MEDIAN are all equal within the first 36 Features that represent the blood measurements.
As for the Vital Signs Measurements, we keep their MIN, MAX, MEDIAN values as they are not equal in all cases. This will be dealt with later on in the feature selection phase.
data = read_excel("Kaggle_Sirio_Libanes_ICU_Prediction.xlsx")
As indicated by the Hospital, if a patient is stable, no new measurements are taken in the next window. To deal with NA Values, we do NA Imputation.
The technique we use is as follows:
1. If window1 [0-2h] has anl NA vaue, take the value from the next non-NA window.
2. If windowX (X>1) [2h+] has an NA value, take the value from the previous non-NA window.
propagate_NA <- function(data){
# Indices of Window 0-2
indices_02<- which(data[,c("WINDOW")]=="0-2")
for(i in indices_02){
for(j in 1:ncol(data)){
if(is.na(data[i,j])){
for(k in seq(i+1:i+4)){
if(!is.na(data[k,j])){
data[i,j] = data[k,j]
break
}
}
}
}
}
## Propagate for all other windows
for(i in 1:nrow(data)){
for(j in 1:ncol(data)){
if(is.na(data[i,j])){
data[i,j] = data[i-1,j]
}
}
}
return(data)
}
data <- propagate_NA(data)
nrow(data)
## [1] 1925
Explain Reasoning
build_targets <- function(data){
## Extract the Response
## If at any time in the future the patient gets admitted to the ICU
## Then, reponse = 1, otherwise reponse = 0
result = c()
for(i in seq(1:385)){
result[i] = max(data[which(data[c("PATIENT_VISIT_IDENTIFIER")]==i-1),c("ICU")])
}
## ICU Admitted
length(which(result==1))
## No admission to ICU
length(which(result==0))
## The Dataset is Balanced
results <- c()
for(i in result){
vec <- rep(i, times = 5)
results <- c(results, vec)
}
data <- mutate(data, response = results)
return(data)
}
data <- build_targets(data)
data <- data[complete.cases(data),]
nrow(data)
## [1] 1925
vars <- apply(data, 2, function(x){
variance_col <- var(x)
})
head(vars[order(vars)])
## TGP PC02_ARTERIAL TGO BILLIRUBIN INR
## 0.002328126 0.002356368 0.002554769 0.003040343 0.004298024
## BIC_ARTERIAL
## 0.004470547
### Set Threshold of Variance to 0.0002 [Adjustable]
### datavars <- data[,-which(vars < 0.0002)]
### vars <- apply(datavars, 2, function(x){
### variance_col <- var(x)
### })
### head(vars[order(vars)])
### datavars <- data
make_factors <- function(data){
data$AGE_ABOVE65 <- as.factor(data$AGE_ABOVE65)
data$GENDER <- as.factor(data$GENDER)
data$IMMUNOCOMPROMISED <- as.factor(data$IMMUNOCOMPROMISED)
data$HTN <- as.factor(data$HTN)
data$response <- as.factor(data$response)
data$`DISEASE GROUPING 1` <- as.factor(data$`DISEASE GROUPING 1`)
data$`DISEASE GROUPING 2` <- as.factor(data$`DISEASE GROUPING 2`)
data$`DISEASE GROUPING 3` <- as.factor(data$`DISEASE GROUPING 3`)
data$`DISEASE GROUPING 4` <- as.factor(data$`DISEASE GROUPING 4`)
data$`DISEASE GROUPING 5` <- as.factor(data$`DISEASE GROUPING 5`)
data$`DISEASE GROUPING 6` <- as.factor(data$`DISEASE GROUPING 6`)
return(data)
}
data <- make_factors(data)
summary(data)
## PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 GENDER DISEASE GROUPING 1
## Min. : 0 0:1025 0:1215 0:1717
## 1st Qu.: 96 1: 900 1: 710 1: 208
## Median :192
## Mean :192
## 3rd Qu.:288
## Max. :384
## DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5
## 0:1871 0:1737 0:1887 0:1674
## 1: 54 1: 188 1: 38 1: 251
##
##
##
##
## DISEASE GROUPING 6 HTN IMMUNOCOMPROMISED ALBUMIN
## 0:1830 0:1516 0:1621 Min. :-1.0000
## 1: 95 1: 409 1: 304 1st Qu.: 0.6053
## Median : 0.6053
## Mean : 0.5630
## 3rd Qu.: 0.6053
## Max. : 1.0000
## BE_ARTERIAL BE_VENOUS BIC_ARTERIAL BIC_VENOUS
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-0.3171 1st Qu.:-0.3171
## Median :-1.0000 Median :-1.0000 Median :-0.3171 Median :-0.3171
## Mean :-0.9839 Mean :-0.9542 Mean :-0.3151 Mean :-0.3191
## 3rd Qu.:-1.0000 3rd Qu.:-1.0000 3rd Qu.:-0.3171 3rd Qu.:-0.3171
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BILLIRUBIN BLAST CALCIUM CREATININ
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.9613 1st Qu.:-1.0000 1st Qu.: 0.2041 1st Qu.:-0.9250
## Median :-0.9389 Median :-1.0000 Median : 0.3571 Median :-0.9038
## Mean :-0.9451 Mean :-0.9951 Mean : 0.3034 Mean :-0.8842
## 3rd Qu.:-0.9389 3rd Qu.:-1.0000 3rd Qu.: 0.3571 3rd Qu.:-0.8684
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## FFA GGT GLUCOSE HEMATOCRITE
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.:-0.7420 1st Qu.:-0.9585 1st Qu.:-0.8920 1st Qu.:-0.22432
## Median :-0.7420 Median :-0.9585 Median :-0.8920 Median :-0.06499
## Mean :-0.7273 Mean :-0.9311 Mean :-0.8614 Mean :-0.09683
## 3rd Qu.:-0.7420 3rd Qu.:-0.9451 3rd Qu.:-0.8920 3rd Qu.: 0.09015
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## HEMOGLOBIN INR LACTATE LEUKOCYTES
## Min. :-1.00000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.26829 1st Qu.:-0.9598 1st Qu.:-0.8053 1st Qu.:-0.8358
## Median :-0.09756 Median :-0.9598 Median : 1.0000 Median :-0.8038
## Mean :-0.12564 Mean :-0.9381 Mean : 0.4652 Mean :-0.7647
## 3rd Qu.: 0.08537 3rd Qu.:-0.9322 3rd Qu.: 1.0000 3rd Qu.:-0.7350
## Max. : 1.00000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## LINFOCITOS NEUTROPHILES P02_ARTERIAL P02_VENOUS
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.8859 1st Qu.:-0.8904 1st Qu.:-0.1707 1st Qu.:-0.7041
## Median :-0.8029 Median :-0.8687 Median :-0.1707 Median :-0.7041
## Mean :-0.7646 Mean :-0.8243 Mean :-0.1728 Mean :-0.6879
## 3rd Qu.:-0.6846 3rd Qu.:-0.8047 3rd Qu.:-0.1707 3rd Qu.:-0.7041
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## PC02_ARTERIAL PC02_VENOUS PCR PH_ARTERIAL
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.7793 1st Qu.:-0.7546 1st Qu.:-0.9773 1st Qu.: 0.2340
## Median :-0.7793 Median :-0.7546 Median :-0.8752 Median : 0.2340
## Mean :-0.7787 Mean :-0.7607 Mean :-0.8293 Mean : 0.2346
## 3rd Qu.:-0.7793 3rd Qu.:-0.7546 3rd Qu.:-0.8106 3rd Qu.: 0.2340
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## PH_VENOUS PLATELETS POTASSIUM SAT02_ARTERIAL
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.: 0.3636 1st Qu.:-0.6075 1st Qu.:-0.6296 1st Qu.: 0.9394
## Median : 0.3636 Median :-0.5407 Median :-0.5185 Median : 0.9394
## Mean : 0.3729 Mean :-0.4805 Mean :-0.5408 Mean : 0.9264
## 3rd Qu.: 0.3636 3rd Qu.:-0.3858 3rd Qu.:-0.4815 3rd Qu.: 0.9394
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## SAT02_VENOUS SODIUM TGO TGP
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.: 0.3457 1st Qu.:-0.14286 1st Qu.:-0.9972 1st Qu.:-0.9931
## Median : 0.3457 Median :-0.02857 Median :-0.9957 Median :-0.9901
## Mean : 0.3163 Mean :-0.07198 Mean :-0.9930 Mean :-0.9855
## 3rd Qu.: 0.3457 3rd Qu.: 0.02857 3rd Qu.:-0.9954 3rd Qu.:-0.9867
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000 Max. : 1.0000
## TTPA UREA DIMER
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.8466 1st Qu.:-0.8892 1st Qu.:-0.9938
## Median :-0.8466 Median :-0.8554 Median :-0.9780
## Mean :-0.8290 Mean :-0.8365 Mean :-0.9608
## 3rd Qu.:-0.8256 3rd Qu.:-0.8265 3rd Qu.:-0.9743
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BLOODPRESSURE_DIASTOLIC_MEDIAN BLOODPRESSURE_SISTOLIC_MEDIAN HEART_RATE_MEDIAN
## Min. :-1.00000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.18519 1st Qu.:-0.4769 1st Qu.:-0.3774
## Median : 0.03704 Median :-0.2615 Median :-0.2830
## Mean :-0.04978 Mean :-0.3149 Mean :-0.2689
## 3rd Qu.: 0.08642 3rd Qu.:-0.2308 3rd Qu.:-0.1887
## Max. : 1.00000 Max. : 1.0000 Max. : 1.0000
## RESPIRATORY_RATE_MEDIAN TEMPERATURE_MEDIAN OXYGEN_SATURATION_MEDIAN
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000
## 1st Qu.:-0.5862 1st Qu.:-0.28571 1st Qu.: 0.7368
## Median :-0.5172 Median :-0.07143 Median : 0.7368
## Mean :-0.4792 Mean :-0.02404 Mean : 0.7462
## 3rd Qu.:-0.4483 3rd Qu.: 0.14286 3rd Qu.: 0.7895
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000
## BLOODPRESSURE_DIASTOLIC_MIN BLOODPRESSURE_SISTOLIC_MIN HEART_RATE_MIN
## Min. :-1.00000 Min. :-1.000 Min. :-1.0000
## 1st Qu.:-0.17526 1st Qu.:-0.325 1st Qu.:-0.3675
## Median : 0.07216 Median :-0.125 Median :-0.1624
## Mean : 0.04133 Mean :-0.147 Mean :-0.2263
## 3rd Qu.: 0.23711 3rd Qu.: 0.000 3rd Qu.:-0.1624
## Max. : 1.00000 Max. : 1.000 Max. : 1.0000
## RESPIRATORY_RATE_MIN TEMPERATURE_MIN OXYGEN_SATURATION_MIN
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.5714 1st Qu.: 0.2088 1st Qu.: 0.8586
## Median :-0.5000 Median : 0.2527 Median : 0.8990
## Mean :-0.4758 Mean : 0.3069 Mean : 0.8465
## 3rd Qu.:-0.4286 3rd Qu.: 0.4286 3rd Qu.: 0.9192
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BLOODPRESSURE_DIASTOLIC_MAX BLOODPRESSURE_SISTOLIC_MAX HEART_RATE_MAX
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.3846 1st Qu.:-0.5676 1st Qu.:-0.4478
## Median :-0.2479 Median :-0.4595 Median :-0.4328
## Mean :-0.2496 Mean :-0.4298 Mean :-0.3305
## 3rd Qu.:-0.1795 3rd Qu.:-0.3514 3rd Qu.:-0.2239
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## RESPIRATORY_RATE_MAX TEMPERATURE_MAX OXYGEN_SATURATION_MAX
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.6364 1st Qu.:-0.4203 1st Qu.: 0.7368
## Median :-0.5758 Median :-0.1594 Median : 0.7895
## Mean :-0.4268 Mean :-0.1093 Mean : 0.7925
## 3rd Qu.:-0.3939 3rd Qu.: 0.1014 3rd Qu.: 0.8947
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BLOODPRESSURE_DIASTOLIC_DIFF BLOODPRESSURE_SISTOLIC_DIFF HEART_RATE_DIFF
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
## Median :-1.0000 Median :-1.0000 Median :-1.0000
## Mean :-0.8366 Mean :-0.8221 Mean :-0.8382
## 3rd Qu.:-0.8261 3rd Qu.:-0.8037 3rd Qu.:-0.8473
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## RESPIRATORY_RATE_DIFF TEMPERATURE_DIFF OXYGEN_SATURATION_DIFF
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
## Median :-1.0000 Median :-1.0000 Median :-1.0000
## Mean :-0.8166 Mean :-0.8508 Mean :-0.9261
## 3rd Qu.:-0.8824 3rd Qu.:-0.8333 3rd Qu.:-0.9394
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL
## Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-1.0000 1st Qu.:-1.0000
## Median :-1.0000 Median :-1.0000
## Mean :-0.8593 Mean :-0.8141
## 3rd Qu.:-0.8509 3rd Qu.:-0.8037
## Max. : 1.0000 Max. : 1.0000
## HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
## Median :-1.0000 Median :-1.0000 Median :-1.0000
## Mean :-0.8803 Mean :-0.8261 Mean :-0.8514
## 3rd Qu.:-0.8847 3rd Qu.:-0.8629 3rd Qu.:-0.8347
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## OXYGEN_SATURATION_DIFF_REL WINDOW ICU response
## Min. :-1.0000 Length:1925 Min. :0.0000 0:950
## 1st Qu.:-1.0000 Class :character 1st Qu.:0.0000 1:975
## Median :-1.0000 Mode :character Median :0.0000
## Mean :-0.9260 Mean :0.2675
## 3rd Qu.:-0.9401 3rd Qu.:1.0000
## Max. : 1.0000 Max. :1.0000
make_factors <- function(data){
data$AGE_ABOVE65 <- as.factor(data$AGE_ABOVE65)
data$GENDER <- as.factor(data$GENDER)
data$IMMUNOCOMPROMISED <- as.factor(data$IMMUNOCOMPROMISED)
data$HTN <- as.factor(data$HTN)
data$response <- as.factor(data$response)
data$`DISEASE GROUPING 1` <- as.factor(data$`DISEASE GROUPING 1`)
data$`DISEASE GROUPING 2` <- as.factor(data$`DISEASE GROUPING 2`)
data$`DISEASE GROUPING 3` <- as.factor(data$`DISEASE GROUPING 3`)
data$`DISEASE GROUPING 4` <- as.factor(data$`DISEASE GROUPING 4`)
data$`DISEASE GROUPING 5` <- as.factor(data$`DISEASE GROUPING 5`)
data$`DISEASE GROUPING 6` <- as.factor(data$`DISEASE GROUPING 6`)
return(data)
}
corr_data <- make_factors(corr_data)
summary(corr_data)
## PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 GENDER DISEASE GROUPING 1
## Min. : 0 0:1025 0:1215 0:1717
## 1st Qu.: 96 1: 900 1: 710 1: 208
## Median :192
## Mean :192
## 3rd Qu.:288
## Max. :384
## DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5
## 0:1871 0:1737 0:1887 0:1674
## 1: 54 1: 188 1: 38 1: 251
##
##
##
##
## DISEASE GROUPING 6 HTN IMMUNOCOMPROMISED ALBUMIN
## 0:1830 0:1516 0:1621 Min. :-1.0000
## 1: 95 1: 409 1: 304 1st Qu.: 0.6053
## Median : 0.6053
## Mean : 0.5630
## 3rd Qu.: 0.6053
## Max. : 1.0000
## BE_ARTERIAL BE_VENOUS BIC_VENOUS BILLIRUBIN
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-0.3171 1st Qu.:-0.9613
## Median :-1.0000 Median :-1.0000 Median :-0.3171 Median :-0.9389
## Mean :-0.9839 Mean :-0.9542 Mean :-0.3191 Mean :-0.9451
## 3rd Qu.:-1.0000 3rd Qu.:-1.0000 3rd Qu.:-0.3171 3rd Qu.:-0.9389
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BLAST CALCIUM CREATININ FFA
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-1.0000 1st Qu.: 0.2041 1st Qu.:-0.9250 1st Qu.:-0.7420
## Median :-1.0000 Median : 0.3571 Median :-0.9038 Median :-0.7420
## Mean :-0.9951 Mean : 0.3034 Mean :-0.8842 Mean :-0.7273
## 3rd Qu.:-1.0000 3rd Qu.: 0.3571 3rd Qu.:-0.8684 3rd Qu.:-0.7420
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## GLUCOSE HEMATOCRITE INR LACTATE
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.8920 1st Qu.:-0.22432 1st Qu.:-0.9598 1st Qu.:-0.8053
## Median :-0.8920 Median :-0.06499 Median :-0.9598 Median : 1.0000
## Mean :-0.8614 Mean :-0.09683 Mean :-0.9381 Mean : 0.4652
## 3rd Qu.:-0.8920 3rd Qu.: 0.09015 3rd Qu.:-0.9322 3rd Qu.: 1.0000
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000 Max. : 1.0000
## LEUKOCYTES LINFOCITOS P02_ARTERIAL P02_VENOUS
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.8358 1st Qu.:-0.8859 1st Qu.:-0.1707 1st Qu.:-0.7041
## Median :-0.8038 Median :-0.8029 Median :-0.1707 Median :-0.7041
## Mean :-0.7647 Mean :-0.7646 Mean :-0.1728 Mean :-0.6879
## 3rd Qu.:-0.7350 3rd Qu.:-0.6846 3rd Qu.:-0.1707 3rd Qu.:-0.7041
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## PC02_ARTERIAL PC02_VENOUS PCR PH_ARTERIAL
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.7793 1st Qu.:-0.7546 1st Qu.:-0.9773 1st Qu.: 0.2340
## Median :-0.7793 Median :-0.7546 Median :-0.8752 Median : 0.2340
## Mean :-0.7787 Mean :-0.7607 Mean :-0.8293 Mean : 0.2346
## 3rd Qu.:-0.7793 3rd Qu.:-0.7546 3rd Qu.:-0.8106 3rd Qu.: 0.2340
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## PH_VENOUS PLATELETS POTASSIUM SAT02_ARTERIAL
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.: 0.3636 1st Qu.:-0.6075 1st Qu.:-0.6296 1st Qu.: 0.9394
## Median : 0.3636 Median :-0.5407 Median :-0.5185 Median : 0.9394
## Mean : 0.3729 Mean :-0.4805 Mean :-0.5408 Mean : 0.9264
## 3rd Qu.: 0.3636 3rd Qu.:-0.3858 3rd Qu.:-0.4815 3rd Qu.: 0.9394
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## SODIUM TGO TTPA UREA
## Min. :-1.00000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.14286 1st Qu.:-0.9972 1st Qu.:-0.8466 1st Qu.:-0.8892
## Median :-0.02857 Median :-0.9957 Median :-0.8466 Median :-0.8554
## Mean :-0.07198 Mean :-0.9930 Mean :-0.8290 Mean :-0.8365
## 3rd Qu.: 0.02857 3rd Qu.:-0.9954 3rd Qu.:-0.8256 3rd Qu.:-0.8265
## Max. : 1.00000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## DIMER BLOODPRESSURE_DIASTOLIC_MEDIAN BLOODPRESSURE_SISTOLIC_MEDIAN
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000
## 1st Qu.:-0.9938 1st Qu.:-0.18519 1st Qu.:-0.4769
## Median :-0.9780 Median : 0.03704 Median :-0.2615
## Mean :-0.9608 Mean :-0.04978 Mean :-0.3149
## 3rd Qu.:-0.9743 3rd Qu.: 0.08642 3rd Qu.:-0.2308
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000
## HEART_RATE_MEDIAN RESPIRATORY_RATE_MEDIAN TEMPERATURE_MEDIAN
## Min. :-1.0000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.:-0.3774 1st Qu.:-0.5862 1st Qu.:-0.28571
## Median :-0.2830 Median :-0.5172 Median :-0.07143
## Mean :-0.2689 Mean :-0.4792 Mean :-0.02404
## 3rd Qu.:-0.1887 3rd Qu.:-0.4483 3rd Qu.: 0.14286
## Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## OXYGEN_SATURATION_MEDIAN RESPIRATORY_RATE_MIN OXYGEN_SATURATION_MIN
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.: 0.7368 1st Qu.:-0.5714 1st Qu.: 0.8586
## Median : 0.7368 Median :-0.5000 Median : 0.8990
## Mean : 0.7462 Mean :-0.4758 Mean : 0.8465
## 3rd Qu.: 0.7895 3rd Qu.:-0.4286 3rd Qu.: 0.9192
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BLOODPRESSURE_DIASTOLIC_MAX BLOODPRESSURE_SISTOLIC_MAX HEART_RATE_MAX
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.3846 1st Qu.:-0.5676 1st Qu.:-0.4478
## Median :-0.2479 Median :-0.4595 Median :-0.4328
## Mean :-0.2496 Mean :-0.4298 Mean :-0.3305
## 3rd Qu.:-0.1795 3rd Qu.:-0.3514 3rd Qu.:-0.2239
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## RESPIRATORY_RATE_MAX WINDOW ICU response
## Min. :-1.0000 Length:1925 Min. :0.0000 0:950
## 1st Qu.:-0.6364 Class :character 1st Qu.:0.0000 1:975
## Median :-0.5758 Mode :character Median :0.0000
## Mean :-0.4268 Mean :0.2675
## 3rd Qu.:-0.3939 3rd Qu.:1.0000
## Max. : 1.0000 Max. :1.0000
getWindows <- function(data, windows){
new_data <- subset(data, WINDOW %in% windows)
return(new_data)
}
Get Windows
new_data <- getWindows(corr_data, unique(data$WINDOW))
nrow(new_data)
## [1] 1925
new_data <- new_data[,-(which(colnames(new_data) %in%
c("PATIENT_VISIT_IDENTIFIER", "OTHER",
"ICU","WINDOW", "AGE_PERCENTIL")))]
summary(new_data)
## AGE_ABOVE65 GENDER DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3
## 0:1025 0:1215 0:1717 0:1871 0:1737
## 1: 900 1: 710 1: 208 1: 54 1: 188
##
##
##
##
## DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 HTN
## 0:1887 0:1674 0:1830 0:1516
## 1: 38 1: 251 1: 95 1: 409
##
##
##
##
## IMMUNOCOMPROMISED ALBUMIN BE_ARTERIAL BE_VENOUS
## 0:1621 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1: 304 1st Qu.: 0.6053 1st Qu.:-1.0000 1st Qu.:-1.0000
## Median : 0.6053 Median :-1.0000 Median :-1.0000
## Mean : 0.5630 Mean :-0.9839 Mean :-0.9542
## 3rd Qu.: 0.6053 3rd Qu.:-1.0000 3rd Qu.:-1.0000
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BIC_VENOUS BILLIRUBIN BLAST CALCIUM
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.3171 1st Qu.:-0.9613 1st Qu.:-1.0000 1st Qu.: 0.2041
## Median :-0.3171 Median :-0.9389 Median :-1.0000 Median : 0.3571
## Mean :-0.3191 Mean :-0.9451 Mean :-0.9951 Mean : 0.3034
## 3rd Qu.:-0.3171 3rd Qu.:-0.9389 3rd Qu.:-1.0000 3rd Qu.: 0.3571
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## CREATININ FFA GLUCOSE HEMATOCRITE
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.00000
## 1st Qu.:-0.9250 1st Qu.:-0.7420 1st Qu.:-0.8920 1st Qu.:-0.22432
## Median :-0.9038 Median :-0.7420 Median :-0.8920 Median :-0.06499
## Mean :-0.8842 Mean :-0.7273 Mean :-0.8614 Mean :-0.09683
## 3rd Qu.:-0.8684 3rd Qu.:-0.7420 3rd Qu.:-0.8920 3rd Qu.: 0.09015
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## INR LACTATE LEUKOCYTES LINFOCITOS
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.9598 1st Qu.:-0.8053 1st Qu.:-0.8358 1st Qu.:-0.8859
## Median :-0.9598 Median : 1.0000 Median :-0.8038 Median :-0.8029
## Mean :-0.9381 Mean : 0.4652 Mean :-0.7647 Mean :-0.7646
## 3rd Qu.:-0.9322 3rd Qu.: 1.0000 3rd Qu.:-0.7350 3rd Qu.:-0.6846
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## P02_ARTERIAL P02_VENOUS PC02_ARTERIAL PC02_VENOUS
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.1707 1st Qu.:-0.7041 1st Qu.:-0.7793 1st Qu.:-0.7546
## Median :-0.1707 Median :-0.7041 Median :-0.7793 Median :-0.7546
## Mean :-0.1728 Mean :-0.6879 Mean :-0.7787 Mean :-0.7607
## 3rd Qu.:-0.1707 3rd Qu.:-0.7041 3rd Qu.:-0.7793 3rd Qu.:-0.7546
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## PCR PH_ARTERIAL PH_VENOUS PLATELETS
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.9773 1st Qu.: 0.2340 1st Qu.: 0.3636 1st Qu.:-0.6075
## Median :-0.8752 Median : 0.2340 Median : 0.3636 Median :-0.5407
## Mean :-0.8293 Mean : 0.2346 Mean : 0.3729 Mean :-0.4805
## 3rd Qu.:-0.8106 3rd Qu.: 0.2340 3rd Qu.: 0.3636 3rd Qu.:-0.3858
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## POTASSIUM SAT02_ARTERIAL SODIUM TGO
## Min. :-1.0000 Min. :-1.0000 Min. :-1.00000 Min. :-1.0000
## 1st Qu.:-0.6296 1st Qu.: 0.9394 1st Qu.:-0.14286 1st Qu.:-0.9972
## Median :-0.5185 Median : 0.9394 Median :-0.02857 Median :-0.9957
## Mean :-0.5408 Mean : 0.9264 Mean :-0.07198 Mean :-0.9930
## 3rd Qu.:-0.4815 3rd Qu.: 0.9394 3rd Qu.: 0.02857 3rd Qu.:-0.9954
## Max. : 1.0000 Max. : 1.0000 Max. : 1.00000 Max. : 1.0000
## TTPA UREA DIMER
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.8466 1st Qu.:-0.8892 1st Qu.:-0.9938
## Median :-0.8466 Median :-0.8554 Median :-0.9780
## Mean :-0.8290 Mean :-0.8365 Mean :-0.9608
## 3rd Qu.:-0.8256 3rd Qu.:-0.8265 3rd Qu.:-0.9743
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BLOODPRESSURE_DIASTOLIC_MEDIAN BLOODPRESSURE_SISTOLIC_MEDIAN HEART_RATE_MEDIAN
## Min. :-1.00000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.18519 1st Qu.:-0.4769 1st Qu.:-0.3774
## Median : 0.03704 Median :-0.2615 Median :-0.2830
## Mean :-0.04978 Mean :-0.3149 Mean :-0.2689
## 3rd Qu.: 0.08642 3rd Qu.:-0.2308 3rd Qu.:-0.1887
## Max. : 1.00000 Max. : 1.0000 Max. : 1.0000
## RESPIRATORY_RATE_MEDIAN TEMPERATURE_MEDIAN OXYGEN_SATURATION_MEDIAN
## Min. :-1.0000 Min. :-1.00000 Min. :-1.0000
## 1st Qu.:-0.5862 1st Qu.:-0.28571 1st Qu.: 0.7368
## Median :-0.5172 Median :-0.07143 Median : 0.7368
## Mean :-0.4792 Mean :-0.02404 Mean : 0.7462
## 3rd Qu.:-0.4483 3rd Qu.: 0.14286 3rd Qu.: 0.7895
## Max. : 1.0000 Max. : 1.00000 Max. : 1.0000
## RESPIRATORY_RATE_MIN OXYGEN_SATURATION_MIN BLOODPRESSURE_DIASTOLIC_MAX
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000
## 1st Qu.:-0.5714 1st Qu.: 0.8586 1st Qu.:-0.3846
## Median :-0.5000 Median : 0.8990 Median :-0.2479
## Mean :-0.4758 Mean : 0.8465 Mean :-0.2496
## 3rd Qu.:-0.4286 3rd Qu.: 0.9192 3rd Qu.:-0.1795
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## BLOODPRESSURE_SISTOLIC_MAX HEART_RATE_MAX RESPIRATORY_RATE_MAX response
## Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 0:950
## 1st Qu.:-0.5676 1st Qu.:-0.4478 1st Qu.:-0.6364 1:975
## Median :-0.4595 Median :-0.4328 Median :-0.5758
## Mean :-0.4298 Mean :-0.3305 Mean :-0.4268
## 3rd Qu.:-0.3514 3rd Qu.:-0.2239 3rd Qu.:-0.3939
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
colnames(new_data)
## [1] "AGE_ABOVE65" "GENDER"
## [3] "DISEASE GROUPING 1" "DISEASE GROUPING 2"
## [5] "DISEASE GROUPING 3" "DISEASE GROUPING 4"
## [7] "DISEASE GROUPING 5" "DISEASE GROUPING 6"
## [9] "HTN" "IMMUNOCOMPROMISED"
## [11] "ALBUMIN" "BE_ARTERIAL"
## [13] "BE_VENOUS" "BIC_VENOUS"
## [15] "BILLIRUBIN" "BLAST"
## [17] "CALCIUM" "CREATININ"
## [19] "FFA" "GLUCOSE"
## [21] "HEMATOCRITE" "INR"
## [23] "LACTATE" "LEUKOCYTES"
## [25] "LINFOCITOS" "P02_ARTERIAL"
## [27] "P02_VENOUS" "PC02_ARTERIAL"
## [29] "PC02_VENOUS" "PCR"
## [31] "PH_ARTERIAL" "PH_VENOUS"
## [33] "PLATELETS" "POTASSIUM"
## [35] "SAT02_ARTERIAL" "SODIUM"
## [37] "TGO" "TTPA"
## [39] "UREA" "DIMER"
## [41] "BLOODPRESSURE_DIASTOLIC_MEDIAN" "BLOODPRESSURE_SISTOLIC_MEDIAN"
## [43] "HEART_RATE_MEDIAN" "RESPIRATORY_RATE_MEDIAN"
## [45] "TEMPERATURE_MEDIAN" "OXYGEN_SATURATION_MEDIAN"
## [47] "RESPIRATORY_RATE_MIN" "OXYGEN_SATURATION_MIN"
## [49] "BLOODPRESSURE_DIASTOLIC_MAX" "BLOODPRESSURE_SISTOLIC_MAX"
## [51] "HEART_RATE_MAX" "RESPIRATORY_RATE_MAX"
## [53] "response"
paper%in%colnames(new_data)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
set.seed(200)
train_indices <- sample(nrow(new_data), size = nrow(new_data)*0.7, replace = FALSE)
train_new = new_data[train_indices,]
test_new = new_data[-train_indices,]
Finally, we obtain from the dataset:
new_data: train_new and test_new with 80% and 20% of new_data respectively.
#### Get Info from Matrix
getInfo = function(matrix,y_pred){
return(list(
accuracy =((matrix[1,1] + matrix[2,2])/sum(matrix))*100,
sensitivity = sensitivity(matrix)*100,
specificity = specificity(matrix)*100,
auc = auc(roc(response=test_new$response, predictor= as.numeric(y_pred))),
roc = roc(response=test_new$response, predictor= as.numeric(y_pred))
))
}
#### Predict & Get Accuracy
getAccuracy = function(model, test_data){
y_pred = predict(model, newdata = test_data)
tab <- table(y_pred,test_data$response)
return(getInfo(tab,y_pred))
}
mod_xgb <- function(data, k){
control = trainControl(method="cv", number=k)
cv.xgb = train(
response~.,
data=data,
tuneLength = 3,
method="xgbTree",
trControl = control,
verbose = 0
)
return(cv.xgb)
}
xgmods <- mod_xgb(train_new, 10)
results_xg_tune_noEFS <- getAccuracy(xgmods, test_new)
results_xg_tune_noEFS
## $accuracy
## [1] 89.79239
##
## $sensitivity
## [1] 91.81495
##
## $specificity
## [1] 87.87879
##
## $auc
## Area under the curve: 0.8985
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.8985
Speicifc Parameters to Try on new_data
dtrain_new <- Matrix(data.matrix(train_new[,-ncol(train_new)]), sparse = TRUE)
dtest_new <- Matrix(data.matrix(test_new[,-ncol(test_new)]), sparse = TRUE)
xg_new <- xgboost(dtrain_new, label = as.numeric(train_new$response)-1,
nrounds = 150, subsample = 0.8, colsample_bytree = 0.8,
early_stopping_rounds = 100, max_depth =5,
objective = "binary:logistic",
validate_parameters = TRUE,
eval_metric = "auc")
## [1] train-auc:0.863898
## Will train until train_auc hasn't improved in 100 rounds.
##
## [2] train-auc:0.920337
## [3] train-auc:0.937123
## [4] train-auc:0.948901
## [5] train-auc:0.957677
## [6] train-auc:0.963513
## [7] train-auc:0.967526
## [8] train-auc:0.977318
## [9] train-auc:0.978905
## [10] train-auc:0.980889
## [11] train-auc:0.983076
## [12] train-auc:0.983937
## [13] train-auc:0.984609
## [14] train-auc:0.985707
## [15] train-auc:0.986775
## [16] train-auc:0.988629
## [17] train-auc:0.990142
## [18] train-auc:0.990840
## [19] train-auc:0.991627
## [20] train-auc:0.992292
## [21] train-auc:0.992455
## [22] train-auc:0.993475
## [23] train-auc:0.993773
## [24] train-auc:0.994168
## [25] train-auc:0.994322
## [26] train-auc:0.994666
## [27] train-auc:0.994990
## [28] train-auc:0.995134
## [29] train-auc:0.995372
## [30] train-auc:0.995726
## [31] train-auc:0.996024
## [32] train-auc:0.996156
## [33] train-auc:0.996328
## [34] train-auc:0.996423
## [35] train-auc:0.996676
## [36] train-auc:0.996727
## [37] train-auc:0.996950
## [38] train-auc:0.997012
## [39] train-auc:0.997017
## [40] train-auc:0.997273
## [41] train-auc:0.997299
## [42] train-auc:0.997330
## [43] train-auc:0.997469
## [44] train-auc:0.997491
## [45] train-auc:0.997725
## [46] train-auc:0.997769
## [47] train-auc:0.997921
## [48] train-auc:0.997943
## [49] train-auc:0.998100
## [50] train-auc:0.998155
## .......................
## [125] train-auc:0.999055
## [126] train-auc:0.999062
## [127] train-auc:0.998998
## [128] train-auc:0.999068
## [129] train-auc:0.999068
## [130] train-auc:0.999062
## [131] train-auc:0.999062
## [132] train-auc:0.999047
## [133] train-auc:0.999024
## [134] train-auc:0.999024
## [135] train-auc:0.999038
## [136] train-auc:0.999038
## [137] train-auc:0.999038
## [138] train-auc:0.999060
## [139] train-auc:0.999060
## [140] train-auc:0.999053
## [141] train-auc:0.999049
## [142] train-auc:0.999047
## [143] train-auc:0.999031
## [144] train-auc:0.999062
## [145] train-auc:0.999047
## [146] train-auc:0.999047
## [147] train-auc:0.999068
## [148] train-auc:0.999068
## [149] train-auc:0.999060
## [150] train-auc:0.999062
pred_new <- ifelse(predict(xg_new, dtest_new) >= 0.5, 1, 0)
tab_new <- table(pred_new, test_new$response)
results_xg_noEFS <- getInfo(tab_new,pred_new)
results_xg_noEFS
## $accuracy
## [1] 90.31142
##
## $sensitivity
## [1] 92.17082
##
## $specificity
## [1] 88.55219
##
## $auc
## Area under the curve: 0.9036
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.9036
ps <- c(seq(1,ncol(new_data)-1))
for(i in seq(1,ncol(new_data)-1)){
pred <- unlist(new_data[,i])
resp <- as.numeric(new_data$response)-1
mat <- cbind(pred,resp)
df <- as.data.frame(mat)
lm.fit <- lm(pred~resp,df)
ps[i] <- summary(lm.fit)$coefficients[2,4]
}
head(ps)
## [1] 1.470736e-38 1.585433e-08 9.532685e-03 1.502497e-05 1.968126e-07
## [6] 2.835204e-05
tbl <- cbind(ps, colnames(new_data)[-53])
or_tbl <- tbl[order(as.numeric(tbl[,1]), decreasing = FALSE),]
or_tbl
## ps
## [1,] "7.30664720971241e-54" "RESPIRATORY_RATE_MAX"
## [2,] "3.59399335367418e-43" "LACTATE"
## [3,] "1.47073621906567e-38" "AGE_ABOVE65"
## [4,] "3.93091529250932e-38" "RESPIRATORY_RATE_MEDIAN"
## [5,] "5.41684548795936e-26" "BLOODPRESSURE_DIASTOLIC_MEDIAN"
## [6,] "9.63744419847796e-23" "PCR"
## [7,] "1.10214095142365e-22" "UREA"
## [8,] "4.74978568068286e-18" "BLOODPRESSURE_SISTOLIC_MAX"
## [9,] "4.42347960244578e-17" "LEUKOCYTES"
## [10,] "1.0573123364535e-16" "HTN"
## [11,] "1.1603519937353e-14" "BE_VENOUS"
## [12,] "3.14637681067657e-13" "HEMATOCRITE"
## [13,] "4.56572431937058e-13" "ALBUMIN"
## [14,] "1.03361919175912e-10" "SODIUM"
## [15,] "1.3366523324479e-10" "BE_ARTERIAL"
## [16,] "3.64481126280959e-09" "CALCIUM"
## [17,] "5.09606702381456e-09" "CREATININ"
## [18,] "6.56298811366707e-09" "POTASSIUM"
## [19,] "1.30021346208415e-08" "OXYGEN_SATURATION_MIN"
## [20,] "1.5854333465344e-08" "GENDER"
## [21,] "1.86057910779632e-07" "INR"
## [22,] "1.96812614583134e-07" "DISEASE GROUPING 3"
## [23,] "6.62676033141478e-07" "SAT02_ARTERIAL"
## [24,] "1.1270376447761e-06" "DISEASE GROUPING 5"
## [25,] "3.93672404228426e-06" "GLUCOSE"
## [26,] "1.09116409465114e-05" "LINFOCITOS"
## [27,] "1.50249657586716e-05" "DISEASE GROUPING 2"
## [28,] "2.83520446807765e-05" "DISEASE GROUPING 4"
## [29,] "3.78963234259058e-05" "BLOODPRESSURE_SISTOLIC_MEDIAN"
## [30,] "5.03590852186578e-05" "HEART_RATE_MAX"
## [31,] "8.79895014865381e-05" "TTPA"
## [32,] "0.00037547055998641" "TEMPERATURE_MEDIAN"
## [33,] "0.000450905257115843" "OXYGEN_SATURATION_MEDIAN"
## [34,] "0.000539850767380292" "FFA"
## [35,] "0.00178756049605462" "RESPIRATORY_RATE_MIN"
## [36,] "0.0024429217273631" "P02_VENOUS"
## [37,] "0.00587907694336656" "IMMUNOCOMPROMISED"
## [38,] "0.00638249879452864" "DIMER"
## [39,] "0.00953268487264496" "DISEASE GROUPING 1"
## [40,] "0.0211360984147818" "BLAST"
## [41,] "0.0307575750566481" "BIC_VENOUS"
## [42,] "0.0371482634464712" "BLOODPRESSURE_DIASTOLIC_MAX"
## [43,] "0.0612777277402924" "PC02_VENOUS"
## [44,] "0.0868592820636745" "TGO"
## [45,] "0.13429720702812" "DISEASE GROUPING 6"
## [46,] "0.411141206943931" "PC02_ARTERIAL"
## [47,] "0.447215429415833" "BILLIRUBIN"
## [48,] "0.528085932554305" "P02_ARTERIAL"
## [49,] "0.663166334469405" "PLATELETS"
## [50,] "0.677035907847009" "PH_VENOUS"
## [51,] "0.867707785173232" "PH_ARTERIAL"
## [52,] "0.980672756719852" "HEART_RATE_MEDIAN"
last <- which.max(which(as.numeric(or_tbl[,1])<0.01))
top_acc = 0
results_top_features_xg = NA
acc <- seq(1,last)
for(i in seq(1,last)){
dd <- new_data[,which(colnames(new_data)
%in% c(or_tbl[seq(1,i),2],"response"))]
set.seed(200)
train_indices <- sample(nrow(dd), size = nrow(dd)*0.7, replace = FALSE)
train_new = dd[train_indices,]
test_new = dd[-train_indices,]
dtrain_new <- Matrix(data.matrix(train_new[,-ncol(train_new)]), sparse = TRUE)
dtest_new <- Matrix(data.matrix(test_new[,-ncol(test_new)]), sparse = TRUE)
xg_new <- xgboost(dtrain_new, label = as.numeric(train_new$response)-1,
nrounds = 150, subsample = 0.8, colsample_bytree = 0.8,
early_stopping_rounds = 100, max_depth =5,
objective = "binary:logistic",
validate_parameters = TRUE,
eval_metric = "auc",
verbose=0)
pred_new <- ifelse(predict(xg_new, dtest_new) >= 0.5, 1, 0)
tab_new <- table(pred_new, test_new$response)
acc[i] <- getInfo(tab_new, pred_new)$accuracy
if(acc[i]>top_acc){
top_acc = acc[i]
results_top_features_xg = getInfo(tab_new, pred_new)
}
}
c(or_tbl[seq(1,which.max(acc)),2])
## [1] "RESPIRATORY_RATE_MAX" "LACTATE"
## [3] "AGE_ABOVE65" "RESPIRATORY_RATE_MEDIAN"
## [5] "BLOODPRESSURE_DIASTOLIC_MEDIAN" "PCR"
## [7] "UREA" "BLOODPRESSURE_SISTOLIC_MAX"
## [9] "LEUKOCYTES" "HTN"
## [11] "BE_VENOUS" "HEMATOCRITE"
## [13] "ALBUMIN" "SODIUM"
## [15] "BE_ARTERIAL" "CALCIUM"
## [17] "CREATININ" "POTASSIUM"
## [19] "OXYGEN_SATURATION_MIN" "GENDER"
## [21] "INR" "DISEASE GROUPING 3"
## [23] "SAT02_ARTERIAL" "DISEASE GROUPING 5"
## [25] "GLUCOSE" "LINFOCITOS"
## [27] "DISEASE GROUPING 2" "DISEASE GROUPING 4"
results_top_features_xg
## $accuracy
## [1] 92.90657
##
## $sensitivity
## [1] 94.30605
##
## $specificity
## [1] 91.58249
##
## $auc
## Area under the curve: 0.9294
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.9294
mod_da <- function(data, k, method){
control = trainControl(method="cv", number=k)
cv.da = train(
response~.,
data=data,
method=method,
trControl = control,
)
return(cv.da)
}
### Training the Model
set.seed(100)
lda.fit_new <- mod_da(train_new, 10, "lda")
lda.fit_new
## Linear Discriminant Analysis
##
## 1347 samples
## 39 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1213, 1212, 1212, 1213, 1212, 1213, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7445882 0.4895812
### Results of LDA
set.seed(100)
results_lda_noEFS <- getAccuracy(lda.fit_new, test_new)
results_lda_noEFS
## $accuracy
## [1] 76.12457
##
## $sensitivity
## [1] 85.40925
##
## $specificity
## [1] 67.34007
##
## $auc
## Area under the curve: 0.7637
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7637
### Training the Model
### set.seed(100)
### qda.fit_new <- mod_da(train_new, 10, "qda")
### qda.fit_new
### Results of LDA without previous Exhaustive Feature Selection (EFS)
### set.seed(100)
### results_qda_noEFS <- getAccuracy(qda.fit_new, test_new)
### results_qda_noEFS
### Rank Dificiency in Group 0 Error
### This error arises when there are low variance features
### QDA doesn't work on this subset of features, we try after
### exhaustive feature selection.
mod_knn <- function(data, k){
control = trainControl(method="cv", number=k)
cv.knn = train(
response~.,
data = data,
method = "knn",
metric = "Accuracy",
trControl = control,
)
return(cv.knn)
}
Using new_data (data without exhaustive feature selection)
knn.fit_new <- mod_knn(train_new, 10)
knn.fit_new
## k-Nearest Neighbors
##
## 1347 samples
## 39 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1213, 1212, 1212, 1213, 1212, 1213, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7497512 0.5002436
## 7 0.7371476 0.4751500
## 9 0.7222941 0.4455067
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
results_knn_noEFS <- getAccuracy(knn.fit_new, test_new)
results_knn_noEFS
## $accuracy
## [1] 76.47059
##
## $sensitivity
## [1] 86.121
##
## $specificity
## [1] 67.34007
##
## $auc
## Area under the curve: 0.7673
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7673
mod_svm_poly <- function(data, degree){
tuned.poly = tune(svm,
response~.,
kernel = "polynomial",
data=data,
ranges = list(degree))
return(tuned.poly)
}
svm_poly_new <- mod_svm_poly(train_new, degree = seq(1:10))
results_svm_polynoEFS <- getAccuracy(svm_poly_new$best.model, test_new)
results_svm_polynoEFS
## $accuracy
## [1] 72.14533
##
## $sensitivity
## [1] 99.28826
##
## $specificity
## [1] 46.46465
##
## $auc
## Area under the curve: 0.7288
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7288
train_control <- trainControl(method="cv", number=10)
svm_poly2_new <- train(response ~ .,
method = 'svmPoly',
tuneLength = 2,
data = train_new,
trControl = train_control)
results_svm_polynoEFS2 = getAccuracy(svm_poly2_new, test_new)
results_svm_polynoEFS2
## $accuracy
## [1] 77.16263
##
## $sensitivity
## [1] 88.96797
##
## $specificity
## [1] 65.99327
##
## $auc
## Area under the curve: 0.7748
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7748
Using new_data (data without exhaustive feature selection)
svm_radial_new <- train(response ~ .,
method = 'svmRadial',
tuneLength = 4,
data = train_new,
trControl = train_control)
results_svm_radial_noEFS = getAccuracy(svm_radial_new, test_new)
results_svm_radial_noEFS
## $accuracy
## [1] 83.73702
##
## $sensitivity
## [1] 88.6121
##
## $specificity
## [1] 79.12458
##
## $auc
## Area under the curve: 0.8387
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.8387
svm_LW_new <- train(response ~ .,
method = 'svmLinearWeights2',
tuneLength = 4,
data = train_new,
trControl = train_control)
results_svm_LW_new_noEFS = getAccuracy(svm_LW_new, test_new)
results_svm_LW_new_noEFS
## $accuracy
## [1] 77.33564
##
## $sensitivity
## [1] 86.47687
##
## $specificity
## [1] 68.68687
##
## $auc
## Area under the curve: 0.7758
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7758
Using new_data (data without exhaustive feature selection)
set.seed(2645)
## Perform Bagging: mtry = number of predictors
## Perform Random Forests: mtry = sqrt(p) for classification
oob.values <- matrix(data=NA, nrow=ncol(train_new)-1, ncol=4)
accuracies <- matrix(data=NA, nrow=ncol(train_new)-1, ncol=4)
attach(train_new)
for(i in 1:(ncol(train_new)-1)) {
for(n in c(250,500,750,1000)){
temp.model <- randomForest(response~ ., data=train_new,
importance=TRUE, mtry=i, ntree=n)
#Get the mse values
accuracies[i,n/250] <- getAccuracy(temp.model, test_new)$accuracy
#Get the oob values
oob.values[i,n/250] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
}
accuracies
## [,1] [,2] [,3] [,4]
## [1,] 84.60208 84.25606 84.77509 83.91003
## [2,] 89.61938 89.96540 89.79239 90.13841
## [3,] 90.83045 89.27336 91.00346 90.65744
## [4,] 90.83045 91.52249 90.83045 90.65744
## [5,] 90.83045 90.65744 91.17647 90.65744
## [6,] 90.31142 91.17647 90.83045 90.83045
## [7,] 90.48443 90.48443 91.17647 90.83045
## [8,] 90.31142 91.00346 90.83045 91.00346
## [9,] 91.34948 91.00346 90.65744 90.48443
## [10,] 91.00346 91.00346 90.48443 91.17647
## [11,] 90.31142 90.65744 90.83045 91.17647
## [12,] 90.48443 91.17647 91.17647 90.31142
## [13,] 91.00346 90.48443 90.83045 91.17647
## [14,] 91.17647 91.34948 91.34948 91.34948
## [15,] 91.00346 91.00346 91.34948 91.17647
## [16,] 90.83045 91.17647 91.00346 90.83045
## [17,] 90.31142 91.17647 91.17647 90.48443
## [18,] 91.34948 91.17647 91.17647 90.65744
## [19,] 91.69550 91.17647 90.65744 90.48443
## [20,] 91.34948 90.83045 90.83045 90.65744
## [21,] 90.83045 90.65744 90.83045 91.00346
## [22,] 90.83045 90.83045 91.00346 91.17647
## [23,] 90.48443 91.34948 91.17647 91.00346
## [24,] 91.00346 91.52249 90.65744 91.17647
## [25,] 91.17647 91.17647 91.00346 91.17647
## [26,] 90.83045 91.17647 90.83045 91.17647
## [27,] 90.65744 91.00346 91.00346 91.17647
## [28,] 90.83045 91.00346 91.17647 91.17647
## [29,] 91.00346 91.00346 91.34948 91.17647
## [30,] 90.48443 91.17647 90.31142 90.83045
## [31,] 91.34948 91.17647 91.00346 90.83045
## [32,] 90.65744 91.34948 90.83045 91.34948
## [33,] 91.34948 91.00346 90.83045 91.34948
## [34,] 91.17647 91.34948 91.17647 91.17647
## [35,] 90.48443 90.83045 91.00346 91.00346
## [36,] 90.65744 90.83045 91.17647 90.65744
## [37,] 90.65744 91.00346 90.65744 90.83045
## [38,] 90.65744 90.83045 90.48443 91.34948
## [39,] 91.00346 91.17647 91.17647 91.00346
oob.values
## [,1] [,2] [,3] [,4]
## [1,] 0.16778025 0.1714922 0.17371938 0.17371938
## [2,] 0.11581292 0.1106162 0.11284336 0.11952487
## [3,] 0.11284336 0.1039347 0.09576837 0.10319228
## [4,] 0.09502598 0.1002227 0.09799555 0.09799555
## [5,] 0.10393467 0.1054195 0.10022272 0.09502598
## [6,] 0.10913140 0.1002227 0.10690423 0.10170750
## [7,] 0.10319228 0.1091314 0.10616184 0.11061618
## [8,] 0.10838901 0.1098738 0.10764662 0.10541945
## [9,] 0.11135857 0.1083890 0.10838901 0.11135857
## [10,] 0.11061618 0.1054195 0.11358575 0.11284336
## [11,] 0.11061618 0.1143281 0.11135857 0.10913140
## [12,] 0.11655531 0.1143281 0.10838901 0.11358575
## [13,] 0.10764662 0.1121010 0.10764662 0.11135857
## [14,] 0.11358575 0.1158129 0.10616184 0.10393467
## [15,] 0.10913140 0.1039347 0.10764662 0.11210097
## [16,] 0.11729770 0.1128434 0.10616184 0.10838901
## [17,] 0.11581292 0.1121010 0.10913140 0.11061618
## [18,] 0.11135857 0.1143281 0.10987379 0.10244989
## [19,] 0.10690423 0.1113586 0.11210097 0.10467706
## [20,] 0.11655531 0.1083890 0.10913140 0.10838901
## [21,] 0.11804009 0.1121010 0.11210097 0.10616184
## [22,] 0.10690423 0.1083890 0.10764662 0.10690423
## [23,] 0.11507053 0.1143281 0.10690423 0.10838901
## [24,] 0.11952487 0.1143281 0.10913140 0.11507053
## [25,] 0.11507053 0.1158129 0.11135857 0.11135857
## [26,] 0.11135857 0.1113586 0.11061618 0.10690423
## [27,] 0.11061618 0.1113586 0.11210097 0.11210097
## [28,] 0.11878248 0.1128434 0.11135857 0.10987379
## [29,] 0.11655531 0.1165553 0.11284336 0.11135857
## [30,] 0.11878248 0.1143281 0.11358575 0.11210097
## [31,] 0.11284336 0.1121010 0.11358575 0.11210097
## [32,] 0.11655531 0.1121010 0.11878248 0.11284336
## [33,] 0.11655531 0.1172977 0.11210097 0.11507053
## [34,] 0.11878248 0.1158129 0.11358575 0.11507053
## [35,] 0.12100965 0.1150705 0.11061618 0.11655531
## [36,] 0.11952487 0.1128434 0.11210097 0.11135857
## [37,] 0.11655531 0.1113586 0.11135857 0.11432814
## [38,] 0.11581292 0.1150705 0.11581292 0.10913140
## [39,] 0.11284336 0.1143281 0.10987379 0.11358575
#ntrees <- as.integer(which.max(accuracies)/(ncol(train_new)-1))
#accuracies[which.max(accuracies)]
which.max(accuracies)
## [1] 19
ind_min = which.min(oob.values)
ind_min
## [1] 4
min_oob = oob.values[which.min(oob.values)]
min_oob
## [1] 0.09502598
rnum = ind_min%%(ncol(train_new)-1)
if(rnum == 0){
rnum = nrow(train_new)
}
rnum
## [1] 4
randForest_new <- randomForest(response~ ., data=train_new,
importance=TRUE, mtry=rnum, ntree=1000)
randForest_new$err.rate[nrow(randForest_new$err.rate),1]
## OOB
## 0.1017075
result_randomForest_noEFS <- getAccuracy(randForest_new, test_new)
result_randomForest_noEFS
## $accuracy
## [1] 90.83045
##
## $sensitivity
## [1] 91.1032
##
## $specificity
## [1] 90.57239
##
## $auc
## Area under the curve: 0.9084
##
## $roc
##
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
##
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.9084
model_results <- cbind(unlist(result_randomForest_noEFS), unlist(results_knn_noEFS),
unlist(results_svm_polynoEFS), unlist(results_svm_polynoEFS2),
unlist(results_svm_LW_new_noEFS), unlist(results_svm_radial_noEFS),
unlist(results_lda_noEFS), unlist(results_xg_noEFS),
unlist(results_xg_tune_noEFS), unlist(results_top_features_xg))[seq(1,4),]
colnames(model_results) <- c("Random Forest", "KNN", "SVM Poly", "SVM Poly2",
"SVM Linear L2", "SVM Radial", "LDA", "XGBoost",
"XGB Tuned", "Top Features Only XGB")
model_results <- model_results[,order(unlist(model_results[1,]))]
model_results
## SVM Poly LDA KNN SVM Poly2 SVM Linear L2 SVM Radial
## accuracy 72.14533 76.12457 76.47059 77.16263 77.33564 83.73702
## sensitivity 99.28826 85.40925 86.121 88.96797 86.47687 88.6121
## specificity 46.46465 67.34007 67.34007 65.99327 68.68687 79.12458
## auc 0.7287645 0.7637466 0.7673053 0.7748062 0.7758187 0.8386834
## XGB Tuned XGBoost Random Forest Top Features Only XGB
## accuracy 89.79239 90.31142 90.83045 92.90657
## sensitivity 91.81495 92.17082 91.1032 94.30605
## specificity 87.87879 88.55219 90.57239 91.58249
## auc 0.8984687 0.903615 0.908378 0.9294427
#### PLOTTING THE ACCURACIES
accs <- unlist(model_results[1,])
ggplot(mapping=aes(x=reorder(colnames(model_results),accs), y=as.numeric(accs))) +
geom_point(col="blue", stroke=1, fill="white", size=2, alpha=0.9) +
xlab("Model Name") +
ylab("Accuracy") +
ggtitle("Graph Representing the Obtained Accuracy of every Model") +
geom_label_repel(aes(label = format(round(as.numeric(accs), 2), nsmall = 2)),
fontface = 'bold',
box.padding = unit(0.35, "lines"),
point.padding = unit(0.3, "lines"),
segment.color = 'black') +
expand_limits(y=55) +
scale_x_discrete (expand = expansion(add=0.2))+
theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1))
#### PLOTTING THE AUCs
aucs <- unlist(model_results[4,])*100
ggplot(mapping=aes(x=reorder(colnames(model_results),aucs), y=as.numeric(aucs))) +
geom_point(col="blue", stroke=1, fill="white", size=2, alpha=0.9) +
xlab("Model Name") +
ylab("Area Under the Curve AUC") +
ggtitle("Graph Representing the Obtained AUC of every Model") +
geom_label_repel(aes(label = format(round(as.numeric(aucs), 2), nsmall = 2)),
fontface = 'bold',
box.padding = unit(0.35, "lines"),
point.padding = unit(0.3, "lines"),
segment.color = 'black') +
scale_x_discrete (expand = expansion(add=0.2))+
scale_y_continuous(limits = c(55, max(aucs)+5)) +
theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1))
rocs <- list(lda=results_lda_noEFS$roc, knn=results_knn_noEFS$roc,
svm_poly=results_svm_polynoEFS$roc,
svm_poly2=results_svm_polynoEFS2$roc,
svm_rdial = results_svm_radial_noEFS$roc,
svm_linear = results_svm_LW_new_noEFS$roc,
rf = result_randomForest_noEFS$roc, xgb=results_xg_noEFS$roc,
xgbTuned = results_top_features_xg$roc,
xgbFeatureSelection = results_top_features_xg$roc)
ggroc(rocs)