Introduction

For the last two years, the world has witnessed for the first time in a century an unprecedentedly catastrophic COVID-19 pandemic that indescribably crippled the world in virtually of its vital sectors. Undeniably, the most severely hit sector was the medical field as the pandemic has put humongous strains on hospitals, which suffered an insurmountable sudden influx in patients. Many of these patients were admitted in critical condition and needed to urgently access an Intensive Care Unit (ICU). Nonetheless, a massive percentage of patients prematurely accessed ICUs or needed one after a few hours of being hospitalized, which led to many hospitals suffering from an inefficient and disastrous ICU admission planning and an incapability to accurately allocate ICUs to patients in need. The aim of the project was to implement robust machine learning algorithms to design an efficient prediction model that would accurately predict a patients' need for an ICU within the very first hours after they have been admitted to a hospital. The data studied in the paper was obtained from Kaggle and consists of a window-based dataset organized by the Sirio-Libanes hospital team in Sao Paolo, Brazil. Two separate preprocessing methods were implemented on the data, one which consisted of imputation, propagation, cleaning as well as feature selection while the other technique consisted of omitting the missing data, followed by cleaning and feature selection. The cleaning process was achieved by removing zero-variance, redundant, and/or highly correlated features using the Pearson correlation coefficient. After obtaining the final datasets, we implemented a series of specialized binary classification models on various stages starting from the features in the first Window 1, which represents the first two hours of admission, and then incrementally adding the remaining features in Window 2, Window 3 etc. The latter is done since ideally the aim is to be able to predict ICU admission as early as possible. The model that consistently yielded the highest accuracy (65% for W1, 81% for W2, 88% for W3, 91% for W4, 93% for W5) as well as the highest AUC (0.93 for W4) was the Extreme Gradient Boosting (XGBoost) implemented on a limited subset of the most significant features such as age, respiratory rate, lactate levels, & blood pressure.

Loading Libraries

### Import Excel Data
library(readxl)

### Data Manipulation
library(dplyr)

### Deep Learning
library(keras)
library(neuralnet)

### General Utility
library(tidyverse)

### Feature Selection
library(leaps)

### SVM Classification
library(e1071)

### XGBoost Classification
library(xgboost)

### Cross Validation & Modeling 
library(caret)

### KNN Classification
library(class)

### Random Forest Classificaiton
library(randomForest)

### Convert Dataframe to Matrix
library(Matrix)

### ROC Curve Plots
library(pROC)

### Visualization
library(ggplot2)
library(ggrepel)

Importing Dataset

The DataSet was observed and manipulated with Excel according to few statements made by the hospital, and logical reasoning for feature selection.

As indicated by the Hospital, blood measurements are not taken as often as Vital Signs measurements. We observe that the MIN,MAX,MEAN,MEDIAN are all equal within the first 36 Features that represent the blood measurements.

As for the Vital Signs Measurements, we keep their MIN, MAX, MEDIAN values as they are not equal in all cases. This will be dealt with later on in the feature selection phase.

data = read_excel("Kaggle_Sirio_Libanes_ICU_Prediction.xlsx")

Data Wrangling - Cleaning

Dealing with NA Values

As indicated by the Hospital, if a patient is stable, no new measurements are taken in the next window. To deal with NA Values, we do NA Imputation.

The technique we use is as follows:
1. If window1 [0-2h] has anl NA vaue, take the value from the next non-NA window.
2. If windowX (X>1) [2h+] has an NA value, take the value from the previous non-NA window.

propagate_NA <- function(data){
  # Indices of Window 0-2
  indices_02<- which(data[,c("WINDOW")]=="0-2")
  for(i in indices_02){
    for(j in 1:ncol(data)){
      if(is.na(data[i,j])){
        for(k in seq(i+1:i+4)){
          if(!is.na(data[k,j])){
            data[i,j] = data[k,j]
            break
          }
        }
      }
    }
  }

  ## Propagate for all other windows
  for(i in 1:nrow(data)){
    for(j in 1:ncol(data)){
      if(is.na(data[i,j])){
        data[i,j] = data[i-1,j]
      }
    }
  }
  return(data)
}

data <- propagate_NA(data)
nrow(data)
## [1] 1925

Build Response

Explain Reasoning

build_targets <- function(data){
  ## Extract the Response
  ## If at any time in the future the patient gets admitted to the ICU
  ## Then, reponse = 1, otherwise reponse = 0
  result = c()
  for(i in seq(1:385)){
    result[i] = max(data[which(data[c("PATIENT_VISIT_IDENTIFIER")]==i-1),c("ICU")])
  }
  ## ICU Admitted
  length(which(result==1))
  ## No admission to ICU
  length(which(result==0))
  ## The Dataset is Balanced
  
  results <- c()
  for(i in result){
    vec <- rep(i, times = 5)
    results <- c(results, vec)
  }
  
  data <- mutate(data, response = results)
  return(data)
}

data <- build_targets(data)

Remove NA Values

data <- data[complete.cases(data),]
nrow(data)
## [1] 1925

Feature Analysis

Check the Variance of the Variables

vars <- apply(data, 2, function(x){
  variance_col <- var(x)
})
head(vars[order(vars)])
##           TGP PC02_ARTERIAL           TGO    BILLIRUBIN           INR 
##   0.002328126   0.002356368   0.002554769   0.003040343   0.004298024 
##  BIC_ARTERIAL 
##   0.004470547
### Set Threshold of Variance to 0.0002 [Adjustable]
### datavars <- data[,-which(vars < 0.0002)]
### vars <- apply(datavars, 2, function(x){
###  variance_col <- var(x)
### })
### head(vars[order(vars)])

### datavars <- data

Convert Categorical Variables to Factor

make_factors <- function(data){
  data$AGE_ABOVE65 <- as.factor(data$AGE_ABOVE65)
  data$GENDER <- as.factor(data$GENDER)
  data$IMMUNOCOMPROMISED <- as.factor(data$IMMUNOCOMPROMISED)
  data$HTN <- as.factor(data$HTN)
  data$response <- as.factor(data$response)
  data$`DISEASE GROUPING 1` <- as.factor(data$`DISEASE GROUPING 1`)
  data$`DISEASE GROUPING 2` <- as.factor(data$`DISEASE GROUPING 2`)
  data$`DISEASE GROUPING 3` <- as.factor(data$`DISEASE GROUPING 3`)
  data$`DISEASE GROUPING 4` <- as.factor(data$`DISEASE GROUPING 4`)
  data$`DISEASE GROUPING 5` <- as.factor(data$`DISEASE GROUPING 5`)
  data$`DISEASE GROUPING 6` <- as.factor(data$`DISEASE GROUPING 6`)
  return(data)
}
data <- make_factors(data)
summary(data)
##  PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 GENDER   DISEASE GROUPING 1
##  Min.   :  0              0:1025      0:1215   0:1717            
##  1st Qu.: 96              1: 900      1: 710   1: 208            
##  Median :192                                                     
##  Mean   :192                                                     
##  3rd Qu.:288                                                     
##  Max.   :384                                                     
##  DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5
##  0:1871             0:1737             0:1887             0:1674            
##  1:  54             1: 188             1:  38             1: 251            
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  DISEASE GROUPING 6 HTN      IMMUNOCOMPROMISED    ALBUMIN       
##  0:1830             0:1516   0:1621            Min.   :-1.0000  
##  1:  95             1: 409   1: 304            1st Qu.: 0.6053  
##                                                Median : 0.6053  
##                                                Mean   : 0.5630  
##                                                3rd Qu.: 0.6053  
##                                                Max.   : 1.0000  
##   BE_ARTERIAL        BE_VENOUS        BIC_ARTERIAL       BIC_VENOUS     
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-0.3171   1st Qu.:-0.3171  
##  Median :-1.0000   Median :-1.0000   Median :-0.3171   Median :-0.3171  
##  Mean   :-0.9839   Mean   :-0.9542   Mean   :-0.3151   Mean   :-0.3191  
##  3rd Qu.:-1.0000   3rd Qu.:-1.0000   3rd Qu.:-0.3171   3rd Qu.:-0.3171  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##    BILLIRUBIN          BLAST            CALCIUM          CREATININ      
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.9613   1st Qu.:-1.0000   1st Qu.: 0.2041   1st Qu.:-0.9250  
##  Median :-0.9389   Median :-1.0000   Median : 0.3571   Median :-0.9038  
##  Mean   :-0.9451   Mean   :-0.9951   Mean   : 0.3034   Mean   :-0.8842  
##  3rd Qu.:-0.9389   3rd Qu.:-1.0000   3rd Qu.: 0.3571   3rd Qu.:-0.8684  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##       FFA               GGT             GLUCOSE         HEMATOCRITE      
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.00000  
##  1st Qu.:-0.7420   1st Qu.:-0.9585   1st Qu.:-0.8920   1st Qu.:-0.22432  
##  Median :-0.7420   Median :-0.9585   Median :-0.8920   Median :-0.06499  
##  Mean   :-0.7273   Mean   :-0.9311   Mean   :-0.8614   Mean   :-0.09683  
##  3rd Qu.:-0.7420   3rd Qu.:-0.9451   3rd Qu.:-0.8920   3rd Qu.: 0.09015  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000  
##    HEMOGLOBIN            INR             LACTATE          LEUKOCYTES     
##  Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.26829   1st Qu.:-0.9598   1st Qu.:-0.8053   1st Qu.:-0.8358  
##  Median :-0.09756   Median :-0.9598   Median : 1.0000   Median :-0.8038  
##  Mean   :-0.12564   Mean   :-0.9381   Mean   : 0.4652   Mean   :-0.7647  
##  3rd Qu.: 0.08537   3rd Qu.:-0.9322   3rd Qu.: 1.0000   3rd Qu.:-0.7350  
##  Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##    LINFOCITOS       NEUTROPHILES      P02_ARTERIAL       P02_VENOUS     
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.8859   1st Qu.:-0.8904   1st Qu.:-0.1707   1st Qu.:-0.7041  
##  Median :-0.8029   Median :-0.8687   Median :-0.1707   Median :-0.7041  
##  Mean   :-0.7646   Mean   :-0.8243   Mean   :-0.1728   Mean   :-0.6879  
##  3rd Qu.:-0.6846   3rd Qu.:-0.8047   3rd Qu.:-0.1707   3rd Qu.:-0.7041  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##  PC02_ARTERIAL      PC02_VENOUS           PCR           PH_ARTERIAL     
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.7793   1st Qu.:-0.7546   1st Qu.:-0.9773   1st Qu.: 0.2340  
##  Median :-0.7793   Median :-0.7546   Median :-0.8752   Median : 0.2340  
##  Mean   :-0.7787   Mean   :-0.7607   Mean   :-0.8293   Mean   : 0.2346  
##  3rd Qu.:-0.7793   3rd Qu.:-0.7546   3rd Qu.:-0.8106   3rd Qu.: 0.2340  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##    PH_VENOUS         PLATELETS         POTASSIUM       SAT02_ARTERIAL   
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.: 0.3636   1st Qu.:-0.6075   1st Qu.:-0.6296   1st Qu.: 0.9394  
##  Median : 0.3636   Median :-0.5407   Median :-0.5185   Median : 0.9394  
##  Mean   : 0.3729   Mean   :-0.4805   Mean   :-0.5408   Mean   : 0.9264  
##  3rd Qu.: 0.3636   3rd Qu.:-0.3858   3rd Qu.:-0.4815   3rd Qu.: 0.9394  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##   SAT02_VENOUS         SODIUM              TGO               TGP         
##  Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.: 0.3457   1st Qu.:-0.14286   1st Qu.:-0.9972   1st Qu.:-0.9931  
##  Median : 0.3457   Median :-0.02857   Median :-0.9957   Median :-0.9901  
##  Mean   : 0.3163   Mean   :-0.07198   Mean   :-0.9930   Mean   :-0.9855  
##  3rd Qu.: 0.3457   3rd Qu.: 0.02857   3rd Qu.:-0.9954   3rd Qu.:-0.9867  
##  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.0000  
##       TTPA              UREA             DIMER        
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.8466   1st Qu.:-0.8892   1st Qu.:-0.9938  
##  Median :-0.8466   Median :-0.8554   Median :-0.9780  
##  Mean   :-0.8290   Mean   :-0.8365   Mean   :-0.9608  
##  3rd Qu.:-0.8256   3rd Qu.:-0.8265   3rd Qu.:-0.9743  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##  BLOODPRESSURE_DIASTOLIC_MEDIAN BLOODPRESSURE_SISTOLIC_MEDIAN HEART_RATE_MEDIAN
##  Min.   :-1.00000               Min.   :-1.0000               Min.   :-1.0000  
##  1st Qu.:-0.18519               1st Qu.:-0.4769               1st Qu.:-0.3774  
##  Median : 0.03704               Median :-0.2615               Median :-0.2830  
##  Mean   :-0.04978               Mean   :-0.3149               Mean   :-0.2689  
##  3rd Qu.: 0.08642               3rd Qu.:-0.2308               3rd Qu.:-0.1887  
##  Max.   : 1.00000               Max.   : 1.0000               Max.   : 1.0000  
##  RESPIRATORY_RATE_MEDIAN TEMPERATURE_MEDIAN OXYGEN_SATURATION_MEDIAN
##  Min.   :-1.0000         Min.   :-1.00000   Min.   :-1.0000         
##  1st Qu.:-0.5862         1st Qu.:-0.28571   1st Qu.: 0.7368         
##  Median :-0.5172         Median :-0.07143   Median : 0.7368         
##  Mean   :-0.4792         Mean   :-0.02404   Mean   : 0.7462         
##  3rd Qu.:-0.4483         3rd Qu.: 0.14286   3rd Qu.: 0.7895         
##  Max.   : 1.0000         Max.   : 1.00000   Max.   : 1.0000         
##  BLOODPRESSURE_DIASTOLIC_MIN BLOODPRESSURE_SISTOLIC_MIN HEART_RATE_MIN   
##  Min.   :-1.00000            Min.   :-1.000             Min.   :-1.0000  
##  1st Qu.:-0.17526            1st Qu.:-0.325             1st Qu.:-0.3675  
##  Median : 0.07216            Median :-0.125             Median :-0.1624  
##  Mean   : 0.04133            Mean   :-0.147             Mean   :-0.2263  
##  3rd Qu.: 0.23711            3rd Qu.: 0.000             3rd Qu.:-0.1624  
##  Max.   : 1.00000            Max.   : 1.000             Max.   : 1.0000  
##  RESPIRATORY_RATE_MIN TEMPERATURE_MIN   OXYGEN_SATURATION_MIN
##  Min.   :-1.0000      Min.   :-1.0000   Min.   :-1.0000      
##  1st Qu.:-0.5714      1st Qu.: 0.2088   1st Qu.: 0.8586      
##  Median :-0.5000      Median : 0.2527   Median : 0.8990      
##  Mean   :-0.4758      Mean   : 0.3069   Mean   : 0.8465      
##  3rd Qu.:-0.4286      3rd Qu.: 0.4286   3rd Qu.: 0.9192      
##  Max.   : 1.0000      Max.   : 1.0000   Max.   : 1.0000      
##  BLOODPRESSURE_DIASTOLIC_MAX BLOODPRESSURE_SISTOLIC_MAX HEART_RATE_MAX   
##  Min.   :-1.0000             Min.   :-1.0000            Min.   :-1.0000  
##  1st Qu.:-0.3846             1st Qu.:-0.5676            1st Qu.:-0.4478  
##  Median :-0.2479             Median :-0.4595            Median :-0.4328  
##  Mean   :-0.2496             Mean   :-0.4298            Mean   :-0.3305  
##  3rd Qu.:-0.1795             3rd Qu.:-0.3514            3rd Qu.:-0.2239  
##  Max.   : 1.0000             Max.   : 1.0000            Max.   : 1.0000  
##  RESPIRATORY_RATE_MAX TEMPERATURE_MAX   OXYGEN_SATURATION_MAX
##  Min.   :-1.0000      Min.   :-1.0000   Min.   :-1.0000      
##  1st Qu.:-0.6364      1st Qu.:-0.4203   1st Qu.: 0.7368      
##  Median :-0.5758      Median :-0.1594   Median : 0.7895      
##  Mean   :-0.4268      Mean   :-0.1093   Mean   : 0.7925      
##  3rd Qu.:-0.3939      3rd Qu.: 0.1014   3rd Qu.: 0.8947      
##  Max.   : 1.0000      Max.   : 1.0000   Max.   : 1.0000      
##  BLOODPRESSURE_DIASTOLIC_DIFF BLOODPRESSURE_SISTOLIC_DIFF HEART_RATE_DIFF  
##  Min.   :-1.0000              Min.   :-1.0000             Min.   :-1.0000  
##  1st Qu.:-1.0000              1st Qu.:-1.0000             1st Qu.:-1.0000  
##  Median :-1.0000              Median :-1.0000             Median :-1.0000  
##  Mean   :-0.8366              Mean   :-0.8221             Mean   :-0.8382  
##  3rd Qu.:-0.8261              3rd Qu.:-0.8037             3rd Qu.:-0.8473  
##  Max.   : 1.0000              Max.   : 1.0000             Max.   : 1.0000  
##  RESPIRATORY_RATE_DIFF TEMPERATURE_DIFF  OXYGEN_SATURATION_DIFF
##  Min.   :-1.0000       Min.   :-1.0000   Min.   :-1.0000       
##  1st Qu.:-1.0000       1st Qu.:-1.0000   1st Qu.:-1.0000       
##  Median :-1.0000       Median :-1.0000   Median :-1.0000       
##  Mean   :-0.8166       Mean   :-0.8508   Mean   :-0.9261       
##  3rd Qu.:-0.8824       3rd Qu.:-0.8333   3rd Qu.:-0.9394       
##  Max.   : 1.0000       Max.   : 1.0000   Max.   : 1.0000       
##  BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_SISTOLIC_DIFF_REL
##  Min.   :-1.0000                  Min.   :-1.0000                
##  1st Qu.:-1.0000                  1st Qu.:-1.0000                
##  Median :-1.0000                  Median :-1.0000                
##  Mean   :-0.8593                  Mean   :-0.8141                
##  3rd Qu.:-0.8509                  3rd Qu.:-0.8037                
##  Max.   : 1.0000                  Max.   : 1.0000                
##  HEART_RATE_DIFF_REL RESPIRATORY_RATE_DIFF_REL TEMPERATURE_DIFF_REL
##  Min.   :-1.0000     Min.   :-1.0000           Min.   :-1.0000     
##  1st Qu.:-1.0000     1st Qu.:-1.0000           1st Qu.:-1.0000     
##  Median :-1.0000     Median :-1.0000           Median :-1.0000     
##  Mean   :-0.8803     Mean   :-0.8261           Mean   :-0.8514     
##  3rd Qu.:-0.8847     3rd Qu.:-0.8629           3rd Qu.:-0.8347     
##  Max.   : 1.0000     Max.   : 1.0000           Max.   : 1.0000     
##  OXYGEN_SATURATION_DIFF_REL    WINDOW               ICU         response
##  Min.   :-1.0000            Length:1925        Min.   :0.0000   0:950   
##  1st Qu.:-1.0000            Class :character   1st Qu.:0.0000   1:975   
##  Median :-1.0000            Mode  :character   Median :0.0000           
##  Mean   :-0.9260                               Mean   :0.2675           
##  3rd Qu.:-0.9401                               3rd Qu.:1.0000           
##  Max.   : 1.0000                               Max.   :1.0000

Check & Remove the Highly Correlated Variables

num_data <- data %>% dplyr::select(where(is.numeric))
corr <- cor(num_data, method="pearson")
diag(corr) = NA
melted = reshape2::melt(corr, na.rm = TRUE, value.name = "corr")
format = melted[order(abs(melted$corr), decreasing=TRUE),]
format = format[!duplicated(format$corr),]
head(format)
##                                  Var1                         Var2      corr
## 4078             TEMPERATURE_DIFF_REL             TEMPERATURE_DIFF 0.9999123
## 4147       OXYGEN_SATURATION_DIFF_REL       OXYGEN_SATURATION_DIFF 0.9998387
## 3871  BLOODPRESSURE_SISTOLIC_DIFF_REL  BLOODPRESSURE_SISTOLIC_DIFF 0.9923891
## 4009        RESPIRATORY_RATE_DIFF_REL        RESPIRATORY_RATE_DIFF 0.9920986
## 3802 BLOODPRESSURE_DIASTOLIC_DIFF_REL BLOODPRESSURE_DIASTOLIC_DIFF 0.9892807
## 3940              HEART_RATE_DIFF_REL              HEART_RATE_DIFF 0.9852209
ggplot(data = melted, aes(x=Var1, y=Var2, fill=corr)) + 
    geom_tile()+
    theme(axis.text.x = element_text(angle = 80,
                                     size = 5, hjust = 1),
          axis.text.y = element_text(angle = 0,
                                     size = 5, hjust = 1))

#### Set Threshold of Corr to 0.7
highly_corr <- format[abs(format$corr)>0.7,]
highly_corr_features = c(as.character(highly_corr$Var1))
unique(highly_corr_features)
##  [1] "TEMPERATURE_DIFF_REL"             "OXYGEN_SATURATION_DIFF_REL"      
##  [3] "BLOODPRESSURE_SISTOLIC_DIFF_REL"  "RESPIRATORY_RATE_DIFF_REL"       
##  [5] "BLOODPRESSURE_DIASTOLIC_DIFF_REL" "HEART_RATE_DIFF_REL"             
##  [7] "HEMOGLOBIN"                       "OXYGEN_SATURATION_DIFF"          
##  [9] "NEUTROPHILES"                     "TGP"                             
## [11] "BLOODPRESSURE_SISTOLIC_DIFF"      "HEART_RATE_DIFF"                 
## [13] "RESPIRATORY_RATE_DIFF"            "TEMPERATURE_DIFF"                
## [15] "BLOODPRESSURE_DIASTOLIC_MIN"      "OXYGEN_SATURATION_MAX"           
## [17] "HEART_RATE_MIN"                   "BLOODPRESSURE_SISTOLIC_MIN"      
## [19] "TEMPERATURE_MIN"                  "SAT02_VENOUS"                    
## [21] "TEMPERATURE_MAX"                  "BIC_ARTERIAL"                    
## [23] "GGT"                              "BLOODPRESSURE_DIASTOLIC_DIFF"
corr_data <- data[,-which(colnames(data) %in% highly_corr_features)]
paper <- c("LACTATE", "OXYGEN_SATURATION_MIN", "PLATELETS", "BE_VENOUS", "PCR", "UREA")
colnames(corr_data)
##  [1] "PATIENT_VISIT_IDENTIFIER"       "AGE_ABOVE65"                   
##  [3] "GENDER"                         "DISEASE GROUPING 1"            
##  [5] "DISEASE GROUPING 2"             "DISEASE GROUPING 3"            
##  [7] "DISEASE GROUPING 4"             "DISEASE GROUPING 5"            
##  [9] "DISEASE GROUPING 6"             "HTN"                           
## [11] "IMMUNOCOMPROMISED"              "ALBUMIN"                       
## [13] "BE_ARTERIAL"                    "BE_VENOUS"                     
## [15] "BIC_VENOUS"                     "BILLIRUBIN"                    
## [17] "BLAST"                          "CALCIUM"                       
## [19] "CREATININ"                      "FFA"                           
## [21] "GLUCOSE"                        "HEMATOCRITE"                   
## [23] "INR"                            "LACTATE"                       
## [25] "LEUKOCYTES"                     "LINFOCITOS"                    
## [27] "P02_ARTERIAL"                   "P02_VENOUS"                    
## [29] "PC02_ARTERIAL"                  "PC02_VENOUS"                   
## [31] "PCR"                            "PH_ARTERIAL"                   
## [33] "PH_VENOUS"                      "PLATELETS"                     
## [35] "POTASSIUM"                      "SAT02_ARTERIAL"                
## [37] "SODIUM"                         "TGO"                           
## [39] "TTPA"                           "UREA"                          
## [41] "DIMER"                          "BLOODPRESSURE_DIASTOLIC_MEDIAN"
## [43] "BLOODPRESSURE_SISTOLIC_MEDIAN"  "HEART_RATE_MEDIAN"             
## [45] "RESPIRATORY_RATE_MEDIAN"        "TEMPERATURE_MEDIAN"            
## [47] "OXYGEN_SATURATION_MEDIAN"       "RESPIRATORY_RATE_MIN"          
## [49] "OXYGEN_SATURATION_MIN"          "BLOODPRESSURE_DIASTOLIC_MAX"   
## [51] "BLOODPRESSURE_SISTOLIC_MAX"     "HEART_RATE_MAX"                
## [53] "RESPIRATORY_RATE_MAX"           "WINDOW"                        
## [55] "ICU"                            "response"
paper %in% colnames(corr_data)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Convert Categorical Variables to Factor

make_factors <- function(data){
  data$AGE_ABOVE65 <- as.factor(data$AGE_ABOVE65)
  data$GENDER <- as.factor(data$GENDER)
  data$IMMUNOCOMPROMISED <- as.factor(data$IMMUNOCOMPROMISED)
  data$HTN <- as.factor(data$HTN)
  data$response <- as.factor(data$response)
  data$`DISEASE GROUPING 1` <- as.factor(data$`DISEASE GROUPING 1`)
  data$`DISEASE GROUPING 2` <- as.factor(data$`DISEASE GROUPING 2`)
  data$`DISEASE GROUPING 3` <- as.factor(data$`DISEASE GROUPING 3`)
  data$`DISEASE GROUPING 4` <- as.factor(data$`DISEASE GROUPING 4`)
  data$`DISEASE GROUPING 5` <- as.factor(data$`DISEASE GROUPING 5`)
  data$`DISEASE GROUPING 6` <- as.factor(data$`DISEASE GROUPING 6`)
  return(data)
}
corr_data <- make_factors(corr_data)
summary(corr_data)
##  PATIENT_VISIT_IDENTIFIER AGE_ABOVE65 GENDER   DISEASE GROUPING 1
##  Min.   :  0              0:1025      0:1215   0:1717            
##  1st Qu.: 96              1: 900      1: 710   1: 208            
##  Median :192                                                     
##  Mean   :192                                                     
##  3rd Qu.:288                                                     
##  Max.   :384                                                     
##  DISEASE GROUPING 2 DISEASE GROUPING 3 DISEASE GROUPING 4 DISEASE GROUPING 5
##  0:1871             0:1737             0:1887             0:1674            
##  1:  54             1: 188             1:  38             1: 251            
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  DISEASE GROUPING 6 HTN      IMMUNOCOMPROMISED    ALBUMIN       
##  0:1830             0:1516   0:1621            Min.   :-1.0000  
##  1:  95             1: 409   1: 304            1st Qu.: 0.6053  
##                                                Median : 0.6053  
##                                                Mean   : 0.5630  
##                                                3rd Qu.: 0.6053  
##                                                Max.   : 1.0000  
##   BE_ARTERIAL        BE_VENOUS         BIC_VENOUS        BILLIRUBIN     
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-0.3171   1st Qu.:-0.9613  
##  Median :-1.0000   Median :-1.0000   Median :-0.3171   Median :-0.9389  
##  Mean   :-0.9839   Mean   :-0.9542   Mean   :-0.3191   Mean   :-0.9451  
##  3rd Qu.:-1.0000   3rd Qu.:-1.0000   3rd Qu.:-0.3171   3rd Qu.:-0.9389  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##      BLAST            CALCIUM          CREATININ            FFA         
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-1.0000   1st Qu.: 0.2041   1st Qu.:-0.9250   1st Qu.:-0.7420  
##  Median :-1.0000   Median : 0.3571   Median :-0.9038   Median :-0.7420  
##  Mean   :-0.9951   Mean   : 0.3034   Mean   :-0.8842   Mean   :-0.7273  
##  3rd Qu.:-1.0000   3rd Qu.: 0.3571   3rd Qu.:-0.8684   3rd Qu.:-0.7420  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##     GLUCOSE         HEMATOCRITE            INR             LACTATE       
##  Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.8920   1st Qu.:-0.22432   1st Qu.:-0.9598   1st Qu.:-0.8053  
##  Median :-0.8920   Median :-0.06499   Median :-0.9598   Median : 1.0000  
##  Mean   :-0.8614   Mean   :-0.09683   Mean   :-0.9381   Mean   : 0.4652  
##  3rd Qu.:-0.8920   3rd Qu.: 0.09015   3rd Qu.:-0.9322   3rd Qu.: 1.0000  
##  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.0000  
##    LEUKOCYTES        LINFOCITOS       P02_ARTERIAL       P02_VENOUS     
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.8358   1st Qu.:-0.8859   1st Qu.:-0.1707   1st Qu.:-0.7041  
##  Median :-0.8038   Median :-0.8029   Median :-0.1707   Median :-0.7041  
##  Mean   :-0.7647   Mean   :-0.7646   Mean   :-0.1728   Mean   :-0.6879  
##  3rd Qu.:-0.7350   3rd Qu.:-0.6846   3rd Qu.:-0.1707   3rd Qu.:-0.7041  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##  PC02_ARTERIAL      PC02_VENOUS           PCR           PH_ARTERIAL     
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.7793   1st Qu.:-0.7546   1st Qu.:-0.9773   1st Qu.: 0.2340  
##  Median :-0.7793   Median :-0.7546   Median :-0.8752   Median : 0.2340  
##  Mean   :-0.7787   Mean   :-0.7607   Mean   :-0.8293   Mean   : 0.2346  
##  3rd Qu.:-0.7793   3rd Qu.:-0.7546   3rd Qu.:-0.8106   3rd Qu.: 0.2340  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##    PH_VENOUS         PLATELETS         POTASSIUM       SAT02_ARTERIAL   
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.: 0.3636   1st Qu.:-0.6075   1st Qu.:-0.6296   1st Qu.: 0.9394  
##  Median : 0.3636   Median :-0.5407   Median :-0.5185   Median : 0.9394  
##  Mean   : 0.3729   Mean   :-0.4805   Mean   :-0.5408   Mean   : 0.9264  
##  3rd Qu.: 0.3636   3rd Qu.:-0.3858   3rd Qu.:-0.4815   3rd Qu.: 0.9394  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##      SODIUM              TGO               TTPA              UREA        
##  Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.14286   1st Qu.:-0.9972   1st Qu.:-0.8466   1st Qu.:-0.8892  
##  Median :-0.02857   Median :-0.9957   Median :-0.8466   Median :-0.8554  
##  Mean   :-0.07198   Mean   :-0.9930   Mean   :-0.8290   Mean   :-0.8365  
##  3rd Qu.: 0.02857   3rd Qu.:-0.9954   3rd Qu.:-0.8256   3rd Qu.:-0.8265  
##  Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##      DIMER         BLOODPRESSURE_DIASTOLIC_MEDIAN BLOODPRESSURE_SISTOLIC_MEDIAN
##  Min.   :-1.0000   Min.   :-1.00000               Min.   :-1.0000              
##  1st Qu.:-0.9938   1st Qu.:-0.18519               1st Qu.:-0.4769              
##  Median :-0.9780   Median : 0.03704               Median :-0.2615              
##  Mean   :-0.9608   Mean   :-0.04978               Mean   :-0.3149              
##  3rd Qu.:-0.9743   3rd Qu.: 0.08642               3rd Qu.:-0.2308              
##  Max.   : 1.0000   Max.   : 1.00000               Max.   : 1.0000              
##  HEART_RATE_MEDIAN RESPIRATORY_RATE_MEDIAN TEMPERATURE_MEDIAN
##  Min.   :-1.0000   Min.   :-1.0000         Min.   :-1.00000  
##  1st Qu.:-0.3774   1st Qu.:-0.5862         1st Qu.:-0.28571  
##  Median :-0.2830   Median :-0.5172         Median :-0.07143  
##  Mean   :-0.2689   Mean   :-0.4792         Mean   :-0.02404  
##  3rd Qu.:-0.1887   3rd Qu.:-0.4483         3rd Qu.: 0.14286  
##  Max.   : 1.0000   Max.   : 1.0000         Max.   : 1.00000  
##  OXYGEN_SATURATION_MEDIAN RESPIRATORY_RATE_MIN OXYGEN_SATURATION_MIN
##  Min.   :-1.0000          Min.   :-1.0000      Min.   :-1.0000      
##  1st Qu.: 0.7368          1st Qu.:-0.5714      1st Qu.: 0.8586      
##  Median : 0.7368          Median :-0.5000      Median : 0.8990      
##  Mean   : 0.7462          Mean   :-0.4758      Mean   : 0.8465      
##  3rd Qu.: 0.7895          3rd Qu.:-0.4286      3rd Qu.: 0.9192      
##  Max.   : 1.0000          Max.   : 1.0000      Max.   : 1.0000      
##  BLOODPRESSURE_DIASTOLIC_MAX BLOODPRESSURE_SISTOLIC_MAX HEART_RATE_MAX   
##  Min.   :-1.0000             Min.   :-1.0000            Min.   :-1.0000  
##  1st Qu.:-0.3846             1st Qu.:-0.5676            1st Qu.:-0.4478  
##  Median :-0.2479             Median :-0.4595            Median :-0.4328  
##  Mean   :-0.2496             Mean   :-0.4298            Mean   :-0.3305  
##  3rd Qu.:-0.1795             3rd Qu.:-0.3514            3rd Qu.:-0.2239  
##  Max.   : 1.0000             Max.   : 1.0000            Max.   : 1.0000  
##  RESPIRATORY_RATE_MAX    WINDOW               ICU         response
##  Min.   :-1.0000      Length:1925        Min.   :0.0000   0:950   
##  1st Qu.:-0.6364      Class :character   1st Qu.:0.0000   1:975   
##  Median :-0.5758      Mode  :character   Median :0.0000           
##  Mean   :-0.4268                         Mean   :0.2675           
##  3rd Qu.:-0.3939                         3rd Qu.:1.0000           
##  Max.   : 1.0000                         Max.   :1.0000

Modeling

Extract Windows

getWindows <- function(data, windows){
  new_data <- subset(data, WINDOW %in% windows)
  return(new_data)
}

Modeling on Window X (X=c(“…”))

Get Windows

new_data <- getWindows(corr_data, unique(data$WINDOW))
nrow(new_data)
## [1] 1925
new_data <- new_data[,-(which(colnames(new_data) %in% 
                                  c("PATIENT_VISIT_IDENTIFIER", "OTHER",
                                    "ICU","WINDOW", "AGE_PERCENTIL")))]
summary(new_data)
##  AGE_ABOVE65 GENDER   DISEASE GROUPING 1 DISEASE GROUPING 2 DISEASE GROUPING 3
##  0:1025      0:1215   0:1717             0:1871             0:1737            
##  1: 900      1: 710   1: 208             1:  54             1: 188            
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##  DISEASE GROUPING 4 DISEASE GROUPING 5 DISEASE GROUPING 6 HTN     
##  0:1887             0:1674             0:1830             0:1516  
##  1:  38             1: 251             1:  95             1: 409  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  IMMUNOCOMPROMISED    ALBUMIN         BE_ARTERIAL        BE_VENOUS      
##  0:1621            Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1: 304            1st Qu.: 0.6053   1st Qu.:-1.0000   1st Qu.:-1.0000  
##                    Median : 0.6053   Median :-1.0000   Median :-1.0000  
##                    Mean   : 0.5630   Mean   :-0.9839   Mean   :-0.9542  
##                    3rd Qu.: 0.6053   3rd Qu.:-1.0000   3rd Qu.:-1.0000  
##                    Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##    BIC_VENOUS        BILLIRUBIN          BLAST            CALCIUM       
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.3171   1st Qu.:-0.9613   1st Qu.:-1.0000   1st Qu.: 0.2041  
##  Median :-0.3171   Median :-0.9389   Median :-1.0000   Median : 0.3571  
##  Mean   :-0.3191   Mean   :-0.9451   Mean   :-0.9951   Mean   : 0.3034  
##  3rd Qu.:-0.3171   3rd Qu.:-0.9389   3rd Qu.:-1.0000   3rd Qu.: 0.3571  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##    CREATININ            FFA             GLUCOSE         HEMATOCRITE      
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.00000  
##  1st Qu.:-0.9250   1st Qu.:-0.7420   1st Qu.:-0.8920   1st Qu.:-0.22432  
##  Median :-0.9038   Median :-0.7420   Median :-0.8920   Median :-0.06499  
##  Mean   :-0.8842   Mean   :-0.7273   Mean   :-0.8614   Mean   :-0.09683  
##  3rd Qu.:-0.8684   3rd Qu.:-0.7420   3rd Qu.:-0.8920   3rd Qu.: 0.09015  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000  
##       INR             LACTATE          LEUKOCYTES        LINFOCITOS     
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.9598   1st Qu.:-0.8053   1st Qu.:-0.8358   1st Qu.:-0.8859  
##  Median :-0.9598   Median : 1.0000   Median :-0.8038   Median :-0.8029  
##  Mean   :-0.9381   Mean   : 0.4652   Mean   :-0.7647   Mean   :-0.7646  
##  3rd Qu.:-0.9322   3rd Qu.: 1.0000   3rd Qu.:-0.7350   3rd Qu.:-0.6846  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##   P02_ARTERIAL       P02_VENOUS      PC02_ARTERIAL      PC02_VENOUS     
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.1707   1st Qu.:-0.7041   1st Qu.:-0.7793   1st Qu.:-0.7546  
##  Median :-0.1707   Median :-0.7041   Median :-0.7793   Median :-0.7546  
##  Mean   :-0.1728   Mean   :-0.6879   Mean   :-0.7787   Mean   :-0.7607  
##  3rd Qu.:-0.1707   3rd Qu.:-0.7041   3rd Qu.:-0.7793   3rd Qu.:-0.7546  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##       PCR           PH_ARTERIAL        PH_VENOUS         PLATELETS      
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.9773   1st Qu.: 0.2340   1st Qu.: 0.3636   1st Qu.:-0.6075  
##  Median :-0.8752   Median : 0.2340   Median : 0.3636   Median :-0.5407  
##  Mean   :-0.8293   Mean   : 0.2346   Mean   : 0.3729   Mean   :-0.4805  
##  3rd Qu.:-0.8106   3rd Qu.: 0.2340   3rd Qu.: 0.3636   3rd Qu.:-0.3858  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##    POTASSIUM       SAT02_ARTERIAL        SODIUM              TGO         
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000  
##  1st Qu.:-0.6296   1st Qu.: 0.9394   1st Qu.:-0.14286   1st Qu.:-0.9972  
##  Median :-0.5185   Median : 0.9394   Median :-0.02857   Median :-0.9957  
##  Mean   :-0.5408   Mean   : 0.9264   Mean   :-0.07198   Mean   :-0.9930  
##  3rd Qu.:-0.4815   3rd Qu.: 0.9394   3rd Qu.: 0.02857   3rd Qu.:-0.9954  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000  
##       TTPA              UREA             DIMER        
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.:-0.8466   1st Qu.:-0.8892   1st Qu.:-0.9938  
##  Median :-0.8466   Median :-0.8554   Median :-0.9780  
##  Mean   :-0.8290   Mean   :-0.8365   Mean   :-0.9608  
##  3rd Qu.:-0.8256   3rd Qu.:-0.8265   3rd Qu.:-0.9743  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##  BLOODPRESSURE_DIASTOLIC_MEDIAN BLOODPRESSURE_SISTOLIC_MEDIAN HEART_RATE_MEDIAN
##  Min.   :-1.00000               Min.   :-1.0000               Min.   :-1.0000  
##  1st Qu.:-0.18519               1st Qu.:-0.4769               1st Qu.:-0.3774  
##  Median : 0.03704               Median :-0.2615               Median :-0.2830  
##  Mean   :-0.04978               Mean   :-0.3149               Mean   :-0.2689  
##  3rd Qu.: 0.08642               3rd Qu.:-0.2308               3rd Qu.:-0.1887  
##  Max.   : 1.00000               Max.   : 1.0000               Max.   : 1.0000  
##  RESPIRATORY_RATE_MEDIAN TEMPERATURE_MEDIAN OXYGEN_SATURATION_MEDIAN
##  Min.   :-1.0000         Min.   :-1.00000   Min.   :-1.0000         
##  1st Qu.:-0.5862         1st Qu.:-0.28571   1st Qu.: 0.7368         
##  Median :-0.5172         Median :-0.07143   Median : 0.7368         
##  Mean   :-0.4792         Mean   :-0.02404   Mean   : 0.7462         
##  3rd Qu.:-0.4483         3rd Qu.: 0.14286   3rd Qu.: 0.7895         
##  Max.   : 1.0000         Max.   : 1.00000   Max.   : 1.0000         
##  RESPIRATORY_RATE_MIN OXYGEN_SATURATION_MIN BLOODPRESSURE_DIASTOLIC_MAX
##  Min.   :-1.0000      Min.   :-1.0000       Min.   :-1.0000            
##  1st Qu.:-0.5714      1st Qu.: 0.8586       1st Qu.:-0.3846            
##  Median :-0.5000      Median : 0.8990       Median :-0.2479            
##  Mean   :-0.4758      Mean   : 0.8465       Mean   :-0.2496            
##  3rd Qu.:-0.4286      3rd Qu.: 0.9192       3rd Qu.:-0.1795            
##  Max.   : 1.0000      Max.   : 1.0000       Max.   : 1.0000            
##  BLOODPRESSURE_SISTOLIC_MAX HEART_RATE_MAX    RESPIRATORY_RATE_MAX response
##  Min.   :-1.0000            Min.   :-1.0000   Min.   :-1.0000      0:950   
##  1st Qu.:-0.5676            1st Qu.:-0.4478   1st Qu.:-0.6364      1:975   
##  Median :-0.4595            Median :-0.4328   Median :-0.5758              
##  Mean   :-0.4298            Mean   :-0.3305   Mean   :-0.4268              
##  3rd Qu.:-0.3514            3rd Qu.:-0.2239   3rd Qu.:-0.3939              
##  Max.   : 1.0000            Max.   : 1.0000   Max.   : 1.0000
colnames(new_data)
##  [1] "AGE_ABOVE65"                    "GENDER"                        
##  [3] "DISEASE GROUPING 1"             "DISEASE GROUPING 2"            
##  [5] "DISEASE GROUPING 3"             "DISEASE GROUPING 4"            
##  [7] "DISEASE GROUPING 5"             "DISEASE GROUPING 6"            
##  [9] "HTN"                            "IMMUNOCOMPROMISED"             
## [11] "ALBUMIN"                        "BE_ARTERIAL"                   
## [13] "BE_VENOUS"                      "BIC_VENOUS"                    
## [15] "BILLIRUBIN"                     "BLAST"                         
## [17] "CALCIUM"                        "CREATININ"                     
## [19] "FFA"                            "GLUCOSE"                       
## [21] "HEMATOCRITE"                    "INR"                           
## [23] "LACTATE"                        "LEUKOCYTES"                    
## [25] "LINFOCITOS"                     "P02_ARTERIAL"                  
## [27] "P02_VENOUS"                     "PC02_ARTERIAL"                 
## [29] "PC02_VENOUS"                    "PCR"                           
## [31] "PH_ARTERIAL"                    "PH_VENOUS"                     
## [33] "PLATELETS"                      "POTASSIUM"                     
## [35] "SAT02_ARTERIAL"                 "SODIUM"                        
## [37] "TGO"                            "TTPA"                          
## [39] "UREA"                           "DIMER"                         
## [41] "BLOODPRESSURE_DIASTOLIC_MEDIAN" "BLOODPRESSURE_SISTOLIC_MEDIAN" 
## [43] "HEART_RATE_MEDIAN"              "RESPIRATORY_RATE_MEDIAN"       
## [45] "TEMPERATURE_MEDIAN"             "OXYGEN_SATURATION_MEDIAN"      
## [47] "RESPIRATORY_RATE_MIN"           "OXYGEN_SATURATION_MIN"         
## [49] "BLOODPRESSURE_DIASTOLIC_MAX"    "BLOODPRESSURE_SISTOLIC_MAX"    
## [51] "HEART_RATE_MAX"                 "RESPIRATORY_RATE_MAX"          
## [53] "response"
paper%in%colnames(new_data)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Data Split: Train - Test

set.seed(200)
train_indices <- sample(nrow(new_data), size = nrow(new_data)*0.7, replace = FALSE)

train_new = new_data[train_indices,]
test_new = new_data[-train_indices,]

Finally, we obtain from the dataset:

new_data: train_new and test_new with 80% and 20% of new_data respectively.

Functions to Measure Accuracy

#### Get Info from Matrix
getInfo = function(matrix,y_pred){
  return(list(
    accuracy =((matrix[1,1] + matrix[2,2])/sum(matrix))*100,
    sensitivity = sensitivity(matrix)*100,
    specificity = specificity(matrix)*100,
    auc = auc(roc(response=test_new$response, predictor= as.numeric(y_pred))),
    roc = roc(response=test_new$response, predictor= as.numeric(y_pred))
  ))
}
#### Predict & Get Accuracy
getAccuracy = function(model, test_data){
  y_pred = predict(model, newdata = test_data)
  tab <- table(y_pred,test_data$response)
  return(getInfo(tab,y_pred))
}

XGBOOST TREE

XGBOOST TREE Training & Tuning Function

mod_xgb <- function(data, k){
  control = trainControl(method="cv", number=k)
  cv.xgb = train(
    response~.,
    data=data,
    tuneLength = 3,
    method="xgbTree",
    trControl = control,
    verbose = 0
  )
  return(cv.xgb)
}

Training and Testing the Model

xgmods <- mod_xgb(train_new, 10)
results_xg_tune_noEFS <- getAccuracy(xgmods, test_new)
results_xg_tune_noEFS
## $accuracy
## [1] 89.79239
## 
## $sensitivity
## [1] 91.81495
## 
## $specificity
## [1] 87.87879
## 
## $auc
## Area under the curve: 0.8985
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.8985

Speicifc Parameters to Try on new_data

dtrain_new <- Matrix(data.matrix(train_new[,-ncol(train_new)]), sparse = TRUE)
dtest_new <- Matrix(data.matrix(test_new[,-ncol(test_new)]), sparse = TRUE)

xg_new <- xgboost(dtrain_new, label = as.numeric(train_new$response)-1,
                  nrounds = 150, subsample = 0.8, colsample_bytree = 0.8,
                  early_stopping_rounds = 100, max_depth =5,
                  objective = "binary:logistic",
                  validate_parameters = TRUE,
                  eval_metric = "auc")
## [1]  train-auc:0.863898 
## Will train until train_auc hasn't improved in 100 rounds.
## 
## [2]  train-auc:0.920337 
## [3]  train-auc:0.937123 
## [4]  train-auc:0.948901 
## [5]  train-auc:0.957677 
## [6]  train-auc:0.963513 
## [7]  train-auc:0.967526 
## [8]  train-auc:0.977318 
## [9]  train-auc:0.978905 
## [10] train-auc:0.980889 
## [11] train-auc:0.983076 
## [12] train-auc:0.983937 
## [13] train-auc:0.984609 
## [14] train-auc:0.985707 
## [15] train-auc:0.986775 
## [16] train-auc:0.988629 
## [17] train-auc:0.990142 
## [18] train-auc:0.990840 
## [19] train-auc:0.991627 
## [20] train-auc:0.992292 
## [21] train-auc:0.992455 
## [22] train-auc:0.993475 
## [23] train-auc:0.993773 
## [24] train-auc:0.994168 
## [25] train-auc:0.994322 
## [26] train-auc:0.994666 
## [27] train-auc:0.994990 
## [28] train-auc:0.995134 
## [29] train-auc:0.995372 
## [30] train-auc:0.995726 
## [31] train-auc:0.996024 
## [32] train-auc:0.996156 
## [33] train-auc:0.996328 
## [34] train-auc:0.996423 
## [35] train-auc:0.996676 
## [36] train-auc:0.996727 
## [37] train-auc:0.996950 
## [38] train-auc:0.997012 
## [39] train-auc:0.997017 
## [40] train-auc:0.997273 
## [41] train-auc:0.997299 
## [42] train-auc:0.997330 
## [43] train-auc:0.997469 
## [44] train-auc:0.997491 
## [45] train-auc:0.997725 
## [46] train-auc:0.997769 
## [47] train-auc:0.997921 
## [48] train-auc:0.997943 
## [49] train-auc:0.998100 
## [50] train-auc:0.998155 
## .......................
## [125]    train-auc:0.999055 
## [126]    train-auc:0.999062 
## [127]    train-auc:0.998998 
## [128]    train-auc:0.999068 
## [129]    train-auc:0.999068 
## [130]    train-auc:0.999062 
## [131]    train-auc:0.999062 
## [132]    train-auc:0.999047 
## [133]    train-auc:0.999024 
## [134]    train-auc:0.999024 
## [135]    train-auc:0.999038 
## [136]    train-auc:0.999038 
## [137]    train-auc:0.999038 
## [138]    train-auc:0.999060 
## [139]    train-auc:0.999060 
## [140]    train-auc:0.999053 
## [141]    train-auc:0.999049 
## [142]    train-auc:0.999047 
## [143]    train-auc:0.999031 
## [144]    train-auc:0.999062 
## [145]    train-auc:0.999047 
## [146]    train-auc:0.999047 
## [147]    train-auc:0.999068 
## [148]    train-auc:0.999068 
## [149]    train-auc:0.999060 
## [150]    train-auc:0.999062
pred_new <- ifelse(predict(xg_new, dtest_new) >= 0.5, 1, 0)
tab_new <- table(pred_new, test_new$response)
results_xg_noEFS <- getInfo(tab_new,pred_new)
results_xg_noEFS
## $accuracy
## [1] 90.31142
## 
## $sensitivity
## [1] 92.17082
## 
## $specificity
## [1] 88.55219
## 
## $auc
## Area under the curve: 0.9036
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.9036

BRUTE FORCE TOP PREDICTORS

ps <- c(seq(1,ncol(new_data)-1))
for(i in seq(1,ncol(new_data)-1)){
  pred <- unlist(new_data[,i])
  resp <- as.numeric(new_data$response)-1
  mat <- cbind(pred,resp)
  df <- as.data.frame(mat)
  lm.fit <- lm(pred~resp,df)
  ps[i] <- summary(lm.fit)$coefficients[2,4]
}
head(ps)
## [1] 1.470736e-38 1.585433e-08 9.532685e-03 1.502497e-05 1.968126e-07
## [6] 2.835204e-05
tbl <- cbind(ps, colnames(new_data)[-53])
or_tbl <- tbl[order(as.numeric(tbl[,1]), decreasing = FALSE),]
or_tbl
##       ps                                                     
##  [1,] "7.30664720971241e-54" "RESPIRATORY_RATE_MAX"          
##  [2,] "3.59399335367418e-43" "LACTATE"                       
##  [3,] "1.47073621906567e-38" "AGE_ABOVE65"                   
##  [4,] "3.93091529250932e-38" "RESPIRATORY_RATE_MEDIAN"       
##  [5,] "5.41684548795936e-26" "BLOODPRESSURE_DIASTOLIC_MEDIAN"
##  [6,] "9.63744419847796e-23" "PCR"                           
##  [7,] "1.10214095142365e-22" "UREA"                          
##  [8,] "4.74978568068286e-18" "BLOODPRESSURE_SISTOLIC_MAX"    
##  [9,] "4.42347960244578e-17" "LEUKOCYTES"                    
## [10,] "1.0573123364535e-16"  "HTN"                           
## [11,] "1.1603519937353e-14"  "BE_VENOUS"                     
## [12,] "3.14637681067657e-13" "HEMATOCRITE"                   
## [13,] "4.56572431937058e-13" "ALBUMIN"                       
## [14,] "1.03361919175912e-10" "SODIUM"                        
## [15,] "1.3366523324479e-10"  "BE_ARTERIAL"                   
## [16,] "3.64481126280959e-09" "CALCIUM"                       
## [17,] "5.09606702381456e-09" "CREATININ"                     
## [18,] "6.56298811366707e-09" "POTASSIUM"                     
## [19,] "1.30021346208415e-08" "OXYGEN_SATURATION_MIN"         
## [20,] "1.5854333465344e-08"  "GENDER"                        
## [21,] "1.86057910779632e-07" "INR"                           
## [22,] "1.96812614583134e-07" "DISEASE GROUPING 3"            
## [23,] "6.62676033141478e-07" "SAT02_ARTERIAL"                
## [24,] "1.1270376447761e-06"  "DISEASE GROUPING 5"            
## [25,] "3.93672404228426e-06" "GLUCOSE"                       
## [26,] "1.09116409465114e-05" "LINFOCITOS"                    
## [27,] "1.50249657586716e-05" "DISEASE GROUPING 2"            
## [28,] "2.83520446807765e-05" "DISEASE GROUPING 4"            
## [29,] "3.78963234259058e-05" "BLOODPRESSURE_SISTOLIC_MEDIAN" 
## [30,] "5.03590852186578e-05" "HEART_RATE_MAX"                
## [31,] "8.79895014865381e-05" "TTPA"                          
## [32,] "0.00037547055998641"  "TEMPERATURE_MEDIAN"            
## [33,] "0.000450905257115843" "OXYGEN_SATURATION_MEDIAN"      
## [34,] "0.000539850767380292" "FFA"                           
## [35,] "0.00178756049605462"  "RESPIRATORY_RATE_MIN"          
## [36,] "0.0024429217273631"   "P02_VENOUS"                    
## [37,] "0.00587907694336656"  "IMMUNOCOMPROMISED"             
## [38,] "0.00638249879452864"  "DIMER"                         
## [39,] "0.00953268487264496"  "DISEASE GROUPING 1"            
## [40,] "0.0211360984147818"   "BLAST"                         
## [41,] "0.0307575750566481"   "BIC_VENOUS"                    
## [42,] "0.0371482634464712"   "BLOODPRESSURE_DIASTOLIC_MAX"   
## [43,] "0.0612777277402924"   "PC02_VENOUS"                   
## [44,] "0.0868592820636745"   "TGO"                           
## [45,] "0.13429720702812"     "DISEASE GROUPING 6"            
## [46,] "0.411141206943931"    "PC02_ARTERIAL"                 
## [47,] "0.447215429415833"    "BILLIRUBIN"                    
## [48,] "0.528085932554305"    "P02_ARTERIAL"                  
## [49,] "0.663166334469405"    "PLATELETS"                     
## [50,] "0.677035907847009"    "PH_VENOUS"                     
## [51,] "0.867707785173232"    "PH_ARTERIAL"                   
## [52,] "0.980672756719852"    "HEART_RATE_MEDIAN"
last <- which.max(which(as.numeric(or_tbl[,1])<0.01))

top_acc = 0
results_top_features_xg = NA
acc <- seq(1,last)
for(i in seq(1,last)){
  dd <- new_data[,which(colnames(new_data) 
                        %in% c(or_tbl[seq(1,i),2],"response"))]
  
  set.seed(200)
  train_indices <- sample(nrow(dd), size = nrow(dd)*0.7, replace = FALSE)

  train_new = dd[train_indices,]
  test_new = dd[-train_indices,]
  
  dtrain_new <- Matrix(data.matrix(train_new[,-ncol(train_new)]), sparse = TRUE)
  dtest_new <- Matrix(data.matrix(test_new[,-ncol(test_new)]), sparse = TRUE)
  
  xg_new <- xgboost(dtrain_new, label = as.numeric(train_new$response)-1,
                    nrounds = 150, subsample = 0.8, colsample_bytree = 0.8,
                    early_stopping_rounds = 100, max_depth =5,
                    objective = "binary:logistic",
                    validate_parameters = TRUE,
                    eval_metric = "auc",
                    verbose=0)
  pred_new <- ifelse(predict(xg_new, dtest_new) >= 0.5, 1, 0)
  tab_new <- table(pred_new, test_new$response)
  acc[i] <- getInfo(tab_new, pred_new)$accuracy
  if(acc[i]>top_acc){
    top_acc = acc[i]
    results_top_features_xg = getInfo(tab_new, pred_new)
  }
}
c(or_tbl[seq(1,which.max(acc)),2])
##  [1] "RESPIRATORY_RATE_MAX"           "LACTATE"                       
##  [3] "AGE_ABOVE65"                    "RESPIRATORY_RATE_MEDIAN"       
##  [5] "BLOODPRESSURE_DIASTOLIC_MEDIAN" "PCR"                           
##  [7] "UREA"                           "BLOODPRESSURE_SISTOLIC_MAX"    
##  [9] "LEUKOCYTES"                     "HTN"                           
## [11] "BE_VENOUS"                      "HEMATOCRITE"                   
## [13] "ALBUMIN"                        "SODIUM"                        
## [15] "BE_ARTERIAL"                    "CALCIUM"                       
## [17] "CREATININ"                      "POTASSIUM"                     
## [19] "OXYGEN_SATURATION_MIN"          "GENDER"                        
## [21] "INR"                            "DISEASE GROUPING 3"            
## [23] "SAT02_ARTERIAL"                 "DISEASE GROUPING 5"            
## [25] "GLUCOSE"                        "LINFOCITOS"                    
## [27] "DISEASE GROUPING 2"             "DISEASE GROUPING 4"
results_top_features_xg
## $accuracy
## [1] 92.90657
## 
## $sensitivity
## [1] 94.30605
## 
## $specificity
## [1] 91.58249
## 
## $auc
## Area under the curve: 0.9294
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.9294

Discriminant Analysis: LDA & QDA

Discriminant Analysis Function

mod_da <- function(data, k, method){
  control = trainControl(method="cv", number=k)
  cv.da = train(
  response~.,
  data=data,
  method=method,
  trControl = control,
  )
  return(cv.da)
}

LDA

### Training the Model
set.seed(100)
lda.fit_new <- mod_da(train_new, 10, "lda")
lda.fit_new
## Linear Discriminant Analysis 
## 
## 1347 samples
##   39 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1213, 1212, 1212, 1213, 1212, 1213, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7445882  0.4895812
### Results of LDA
set.seed(100)
results_lda_noEFS <- getAccuracy(lda.fit_new, test_new)
results_lda_noEFS
## $accuracy
## [1] 76.12457
## 
## $sensitivity
## [1] 85.40925
## 
## $specificity
## [1] 67.34007
## 
## $auc
## Area under the curve: 0.7637
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7637

QDA

### Training the Model
### set.seed(100)
### qda.fit_new <- mod_da(train_new, 10, "qda")
### qda.fit_new


### Results of LDA without previous Exhaustive Feature Selection (EFS)
### set.seed(100)
### results_qda_noEFS <- getAccuracy(qda.fit_new, test_new)
### results_qda_noEFS

### Rank Dificiency in Group 0 Error
### This error arises when there are low variance features

### QDA doesn't work on this subset of features, we try after
### exhaustive feature selection.

KNN

Function to Train Model

mod_knn <- function(data, k){
  control = trainControl(method="cv", number=k)
  cv.knn = train(
  response~.,
  data = data,
  method = "knn",
  metric = "Accuracy",
  trControl = control,
  )
  return(cv.knn)
}

Training & Testing the Model

Using new_data (data without exhaustive feature selection)

knn.fit_new <- mod_knn(train_new, 10)
knn.fit_new
## k-Nearest Neighbors 
## 
## 1347 samples
##   39 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1213, 1212, 1212, 1213, 1212, 1213, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7497512  0.5002436
##   7  0.7371476  0.4751500
##   9  0.7222941  0.4455067
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
results_knn_noEFS <- getAccuracy(knn.fit_new, test_new)
results_knn_noEFS
## $accuracy
## [1] 76.47059
## 
## $sensitivity
## [1] 86.121
## 
## $specificity
## [1] 67.34007
## 
## $auc
## Area under the curve: 0.7673
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7673

Support Vector Machine SVM Classifier

SVM Polynomial Function

mod_svm_poly <- function(data, degree){
  tuned.poly = tune(svm,
             response~.,
             kernel = "polynomial",
             data=data,
             ranges = list(degree))
  return(tuned.poly)
}

Training & Testing the Model

svm_poly_new <- mod_svm_poly(train_new, degree = seq(1:10))

results_svm_polynoEFS <- getAccuracy(svm_poly_new$best.model, test_new)
results_svm_polynoEFS
## $accuracy
## [1] 72.14533
## 
## $sensitivity
## [1] 99.28826
## 
## $specificity
## [1] 46.46465
## 
## $auc
## Area under the curve: 0.7288
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7288

SVM Polynomial Function - Different Library

train_control <- trainControl(method="cv", number=10)
svm_poly2_new <- train(response ~ .,
                         method = 'svmPoly',
                         tuneLength = 2,
                         data       = train_new,
                         trControl = train_control)

results_svm_polynoEFS2 = getAccuracy(svm_poly2_new, test_new)
results_svm_polynoEFS2
## $accuracy
## [1] 77.16263
## 
## $sensitivity
## [1] 88.96797
## 
## $specificity
## [1] 65.99327
## 
## $auc
## Area under the curve: 0.7748
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7748

SVM Radial Kernel

Using new_data (data without exhaustive feature selection)

svm_radial_new <- train(response ~ .,
                         method = 'svmRadial',
                         tuneLength = 4,
                         data       = train_new,
                         trControl = train_control)

results_svm_radial_noEFS = getAccuracy(svm_radial_new, test_new)
results_svm_radial_noEFS
## $accuracy
## [1] 83.73702
## 
## $sensitivity
## [1] 88.6121
## 
## $specificity
## [1] 79.12458
## 
## $auc
## Area under the curve: 0.8387
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.8387

Linear SVM L2 Regularized with Class Weights

svm_LW_new <- train(response ~ .,
                         method = 'svmLinearWeights2',
                         tuneLength = 4,
                         data       = train_new,
                         trControl = train_control)

results_svm_LW_new_noEFS = getAccuracy(svm_LW_new, test_new)
results_svm_LW_new_noEFS
## $accuracy
## [1] 77.33564
## 
## $sensitivity
## [1] 86.47687
## 
## $specificity
## [1] 68.68687
## 
## $auc
## Area under the curve: 0.7758
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.7758

Random Forest

Using new_data (data without exhaustive feature selection)

set.seed(2645)
## Perform Bagging: mtry = number of predictors
## Perform Random Forests: mtry = sqrt(p) for classification
oob.values <- matrix(data=NA, nrow=ncol(train_new)-1, ncol=4)
accuracies <- matrix(data=NA, nrow=ncol(train_new)-1, ncol=4)

attach(train_new)
for(i in 1:(ncol(train_new)-1)) {
  for(n in c(250,500,750,1000)){
    temp.model <- randomForest(response~ ., data=train_new,
                             importance=TRUE, mtry=i, ntree=n)
  #Get the mse values
    accuracies[i,n/250] <- getAccuracy(temp.model, test_new)$accuracy
  #Get the oob values
    oob.values[i,n/250] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
  }
}
accuracies
##           [,1]     [,2]     [,3]     [,4]
##  [1,] 84.60208 84.25606 84.77509 83.91003
##  [2,] 89.61938 89.96540 89.79239 90.13841
##  [3,] 90.83045 89.27336 91.00346 90.65744
##  [4,] 90.83045 91.52249 90.83045 90.65744
##  [5,] 90.83045 90.65744 91.17647 90.65744
##  [6,] 90.31142 91.17647 90.83045 90.83045
##  [7,] 90.48443 90.48443 91.17647 90.83045
##  [8,] 90.31142 91.00346 90.83045 91.00346
##  [9,] 91.34948 91.00346 90.65744 90.48443
## [10,] 91.00346 91.00346 90.48443 91.17647
## [11,] 90.31142 90.65744 90.83045 91.17647
## [12,] 90.48443 91.17647 91.17647 90.31142
## [13,] 91.00346 90.48443 90.83045 91.17647
## [14,] 91.17647 91.34948 91.34948 91.34948
## [15,] 91.00346 91.00346 91.34948 91.17647
## [16,] 90.83045 91.17647 91.00346 90.83045
## [17,] 90.31142 91.17647 91.17647 90.48443
## [18,] 91.34948 91.17647 91.17647 90.65744
## [19,] 91.69550 91.17647 90.65744 90.48443
## [20,] 91.34948 90.83045 90.83045 90.65744
## [21,] 90.83045 90.65744 90.83045 91.00346
## [22,] 90.83045 90.83045 91.00346 91.17647
## [23,] 90.48443 91.34948 91.17647 91.00346
## [24,] 91.00346 91.52249 90.65744 91.17647
## [25,] 91.17647 91.17647 91.00346 91.17647
## [26,] 90.83045 91.17647 90.83045 91.17647
## [27,] 90.65744 91.00346 91.00346 91.17647
## [28,] 90.83045 91.00346 91.17647 91.17647
## [29,] 91.00346 91.00346 91.34948 91.17647
## [30,] 90.48443 91.17647 90.31142 90.83045
## [31,] 91.34948 91.17647 91.00346 90.83045
## [32,] 90.65744 91.34948 90.83045 91.34948
## [33,] 91.34948 91.00346 90.83045 91.34948
## [34,] 91.17647 91.34948 91.17647 91.17647
## [35,] 90.48443 90.83045 91.00346 91.00346
## [36,] 90.65744 90.83045 91.17647 90.65744
## [37,] 90.65744 91.00346 90.65744 90.83045
## [38,] 90.65744 90.83045 90.48443 91.34948
## [39,] 91.00346 91.17647 91.17647 91.00346
oob.values
##             [,1]      [,2]       [,3]       [,4]
##  [1,] 0.16778025 0.1714922 0.17371938 0.17371938
##  [2,] 0.11581292 0.1106162 0.11284336 0.11952487
##  [3,] 0.11284336 0.1039347 0.09576837 0.10319228
##  [4,] 0.09502598 0.1002227 0.09799555 0.09799555
##  [5,] 0.10393467 0.1054195 0.10022272 0.09502598
##  [6,] 0.10913140 0.1002227 0.10690423 0.10170750
##  [7,] 0.10319228 0.1091314 0.10616184 0.11061618
##  [8,] 0.10838901 0.1098738 0.10764662 0.10541945
##  [9,] 0.11135857 0.1083890 0.10838901 0.11135857
## [10,] 0.11061618 0.1054195 0.11358575 0.11284336
## [11,] 0.11061618 0.1143281 0.11135857 0.10913140
## [12,] 0.11655531 0.1143281 0.10838901 0.11358575
## [13,] 0.10764662 0.1121010 0.10764662 0.11135857
## [14,] 0.11358575 0.1158129 0.10616184 0.10393467
## [15,] 0.10913140 0.1039347 0.10764662 0.11210097
## [16,] 0.11729770 0.1128434 0.10616184 0.10838901
## [17,] 0.11581292 0.1121010 0.10913140 0.11061618
## [18,] 0.11135857 0.1143281 0.10987379 0.10244989
## [19,] 0.10690423 0.1113586 0.11210097 0.10467706
## [20,] 0.11655531 0.1083890 0.10913140 0.10838901
## [21,] 0.11804009 0.1121010 0.11210097 0.10616184
## [22,] 0.10690423 0.1083890 0.10764662 0.10690423
## [23,] 0.11507053 0.1143281 0.10690423 0.10838901
## [24,] 0.11952487 0.1143281 0.10913140 0.11507053
## [25,] 0.11507053 0.1158129 0.11135857 0.11135857
## [26,] 0.11135857 0.1113586 0.11061618 0.10690423
## [27,] 0.11061618 0.1113586 0.11210097 0.11210097
## [28,] 0.11878248 0.1128434 0.11135857 0.10987379
## [29,] 0.11655531 0.1165553 0.11284336 0.11135857
## [30,] 0.11878248 0.1143281 0.11358575 0.11210097
## [31,] 0.11284336 0.1121010 0.11358575 0.11210097
## [32,] 0.11655531 0.1121010 0.11878248 0.11284336
## [33,] 0.11655531 0.1172977 0.11210097 0.11507053
## [34,] 0.11878248 0.1158129 0.11358575 0.11507053
## [35,] 0.12100965 0.1150705 0.11061618 0.11655531
## [36,] 0.11952487 0.1128434 0.11210097 0.11135857
## [37,] 0.11655531 0.1113586 0.11135857 0.11432814
## [38,] 0.11581292 0.1150705 0.11581292 0.10913140
## [39,] 0.11284336 0.1143281 0.10987379 0.11358575
#ntrees <- as.integer(which.max(accuracies)/(ncol(train_new)-1))
#accuracies[which.max(accuracies)]
which.max(accuracies)
## [1] 19
ind_min = which.min(oob.values)
ind_min
## [1] 4
min_oob = oob.values[which.min(oob.values)]
min_oob
## [1] 0.09502598
rnum = ind_min%%(ncol(train_new)-1)
if(rnum == 0){
  rnum = nrow(train_new)
}
rnum
## [1] 4
randForest_new <- randomForest(response~ ., data=train_new,
                             importance=TRUE, mtry=rnum, ntree=1000)
randForest_new$err.rate[nrow(randForest_new$err.rate),1]
##       OOB 
## 0.1017075
result_randomForest_noEFS <- getAccuracy(randForest_new, test_new)
result_randomForest_noEFS
## $accuracy
## [1] 90.83045
## 
## $sensitivity
## [1] 91.1032
## 
## $specificity
## [1] 90.57239
## 
## $auc
## Area under the curve: 0.9084
## 
## $roc
## 
## Call:
## roc.default(response = test_new$response, predictor = as.numeric(y_pred))
## 
## Data: as.numeric(y_pred) in 281 controls (test_new$response 0) < 297 cases (test_new$response 1).
## Area under the curve: 0.9084

Results

model_results <- cbind(unlist(result_randomForest_noEFS), unlist(results_knn_noEFS), 
            unlist(results_svm_polynoEFS), unlist(results_svm_polynoEFS2),
            unlist(results_svm_LW_new_noEFS), unlist(results_svm_radial_noEFS),
            unlist(results_lda_noEFS), unlist(results_xg_noEFS),
            unlist(results_xg_tune_noEFS), unlist(results_top_features_xg))[seq(1,4),]

colnames(model_results) <- c("Random Forest", "KNN", "SVM Poly", "SVM Poly2",
                             "SVM Linear L2", "SVM Radial", "LDA", "XGBoost",
                             "XGB Tuned", "Top Features Only XGB")


model_results <- model_results[,order(unlist(model_results[1,]))]
model_results
##             SVM Poly  LDA       KNN       SVM Poly2 SVM Linear L2 SVM Radial
## accuracy    72.14533  76.12457  76.47059  77.16263  77.33564      83.73702  
## sensitivity 99.28826  85.40925  86.121    88.96797  86.47687      88.6121   
## specificity 46.46465  67.34007  67.34007  65.99327  68.68687      79.12458  
## auc         0.7287645 0.7637466 0.7673053 0.7748062 0.7758187     0.8386834 
##             XGB Tuned XGBoost  Random Forest Top Features Only XGB
## accuracy    89.79239  90.31142 90.83045      92.90657             
## sensitivity 91.81495  92.17082 91.1032       94.30605             
## specificity 87.87879  88.55219 90.57239      91.58249             
## auc         0.8984687 0.903615 0.908378      0.9294427
#### PLOTTING THE ACCURACIES 
accs <- unlist(model_results[1,])
ggplot(mapping=aes(x=reorder(colnames(model_results),accs), y=as.numeric(accs))) +
  geom_point(col="blue", stroke=1, fill="white", size=2, alpha=0.9) +
  xlab("Model Name") +
  ylab("Accuracy") + 
  ggtitle("Graph Representing the Obtained Accuracy of every Model") +
  geom_label_repel(aes(label = format(round(as.numeric(accs), 2), nsmall = 2)),
                  fontface = 'bold', 
                  box.padding   = unit(0.35, "lines"), 
                  point.padding = unit(0.3, "lines"),
                  segment.color = 'black') +
  expand_limits(y=55) +
  scale_x_discrete (expand = expansion(add=0.2))+
  theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1))

#### PLOTTING THE AUCs
aucs <- unlist(model_results[4,])*100
ggplot(mapping=aes(x=reorder(colnames(model_results),aucs), y=as.numeric(aucs))) +
  geom_point(col="blue", stroke=1, fill="white", size=2, alpha=0.9) +
  xlab("Model Name") +
  ylab("Area Under the Curve AUC") + 
  ggtitle("Graph Representing the Obtained AUC of every Model") +
  geom_label_repel(aes(label = format(round(as.numeric(aucs), 2), nsmall = 2)),
                  fontface = 'bold', 
                  box.padding   = unit(0.35, "lines"), 
                  point.padding = unit(0.3, "lines"),
                  segment.color = 'black') +
  scale_x_discrete (expand = expansion(add=0.2))+
  scale_y_continuous(limits = c(55, max(aucs)+5)) +
  theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1))

rocs <- list(lda=results_lda_noEFS$roc, knn=results_knn_noEFS$roc,
           svm_poly=results_svm_polynoEFS$roc, 
           svm_poly2=results_svm_polynoEFS2$roc,
           svm_rdial = results_svm_radial_noEFS$roc,
           svm_linear = results_svm_LW_new_noEFS$roc,
           rf = result_randomForest_noEFS$roc, xgb=results_xg_noEFS$roc,
           xgbTuned = results_top_features_xg$roc, 
           xgbFeatureSelection = results_top_features_xg$roc)

ggroc(rocs)