Predicting Intensive Care Unit (ICU) Patient Admission using R programming

Project Background

During the early phase of the pandemic, hospital intensive care units (ICU) experienced a surge of critically ill COVID-19 patients. With limited resources, management wants us to develop a model that can: (a) identify independent clinical features related with ICU admissions and (b) predict ICU admissions.

Project Significance

1. Urgent need to help frontline clinicians to effectively triage patients
2. The limited healthcare resources and the increased demand for care

Step 1: Install and Load the R packages

#install.packages("caret")
#install.packages("pscl")
#install.packages("Hmisc")

library(readxl)
library(caret) #helps us to split our data into training and testing sets
## Loading required package: ggplot2
## Loading required package: lattice
library(pscl) #gets the pseudo R-square for logistic regression
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(Hmisc) #used to find the correlation and its p-values
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units

Step 2: Import & summarize the data

hospital_df <- read_excel("Hospital Data.xlsx")
head(hospital_df)
## # A tibble: 6 × 21
##   ICU_Admit   ESI   Age Sex      SBP    HR  Temp    RR  Spo2 qSOFA   BMI    MI
##       <dbl> <dbl> <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         0     3    27 Female   125   115 102.     18    99     0  18.9     0
## 2         0     2    44 Male      92    97 103.     20    97     1  29.6     0
## 3         0     2    39 Male     109   100 103.     44    97     1  35.9     0
## 4         0     2    46 Female   113    98  98.7    18    96     0  41.4     0
## 5         0     2    34 Female   109   101  99.4    24    98     1  22.6     0
## 6         0     3    69 Female   152   113  99.4    16    97     0  29.3     0
## # ℹ 9 more variables: CHF <dbl>, Stroke <dbl>, DM <dbl>, CKD <dbl>,
## #   Cancer <dbl>, Asthma <dbl>, HTN <dbl>, LowIncome <dbl>, obese <dbl>
summary(hospital_df)
##    ICU_Admit           ESI             Age             Sex           
##  Min.   :0.0000   Min.   :1.000   Min.   : 17.00   Length:1175       
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.: 47.00   Class :character  
##  Median :0.0000   Median :2.000   Median : 60.00   Mode  :character  
##  Mean   :0.1047   Mean   :2.205   Mean   : 57.85                     
##  3rd Qu.:0.0000   3rd Qu.:3.000   3rd Qu.: 70.00                     
##  Max.   :1.0000   Max.   :4.000   Max.   :103.00                     
##                   NA's   :10                                         
##       SBP            HR              Temp              RR        
##  Min.   : 65   Min.   : 12.00   Min.   : 93.90   Min.   : 10.00  
##  1st Qu.:115   1st Qu.: 86.00   1st Qu.: 98.20   1st Qu.: 18.00  
##  Median :130   Median : 99.00   Median : 99.00   Median : 20.00  
##  Mean   :132   Mean   : 98.68   Mean   : 99.34   Mean   : 22.11  
##  3rd Qu.:146   3rd Qu.:111.00   3rd Qu.:100.40   3rd Qu.: 24.00  
##  Max.   :251   Max.   :176.00   Max.   :103.80   Max.   :135.00  
##  NA's   :1                      NA's   :7        NA's   :1       
##       Spo2            qSOFA             BMI              MI         
##  Min.   :  2.00   Min.   :0.0000   Min.   :13.68   Min.   :0.00000  
##  1st Qu.: 94.00   1st Qu.:0.0000   1st Qu.:25.84   1st Qu.:0.00000  
##  Median : 96.00   Median :0.0000   Median :31.01   Median :0.00000  
##  Mean   : 95.17   Mean   :0.5685   Mean   :32.27   Mean   :0.07745  
##  3rd Qu.: 98.00   3rd Qu.:1.0000   3rd Qu.:36.81   3rd Qu.:0.00000  
##  Max.   :100.00   Max.   :3.0000   Max.   :78.38   Max.   :1.00000  
##  NA's   :2                         NA's   :2                        
##       CHF             Stroke              DM              CKD        
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.1753   Mean   :0.08766   Mean   :0.4085   Mean   :0.2477  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##                                                                      
##      Cancer            Asthma            HTN           LowIncome     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.00000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.08255   Mean   :0.1072   Mean   :0.6153   Mean   :0.8817  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                      
##      obese       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.5434  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##  NA's   :1
#Note: Make sure your excel file and R markdown file are in the same folder
Data Description: A description of some of the features are presented in the table below. 
Variable      | Definition
------------- | -------------
1. ICU Admit  | patient admitted to the ICU or not
2. ESI        | emergency severity index
3. SBP        | systolic blood pressure
4. HR         | heart rate
5. Temp       | temperature

Step 3: Data visualization

# Visualizing the target variable (i.e., ICU Admit) using a column chart
counts <- table(hospital_df$ICU_Admit)
barplot(counts)

# Interpretation: Based on the column chart, majority of the patients were not admitted
# Meaning of the target variables
# 0: Do not admit
# 1: Admit

Step 4: Feature engineering - pre-processing the data

# We count the number of missing values
colSums(is.na(hospital_df)) 
## ICU_Admit       ESI       Age       Sex       SBP        HR      Temp        RR 
##         0        10         0         1         1         0         7         1 
##      Spo2     qSOFA       BMI        MI       CHF    Stroke        DM       CKD 
##         2         0         2         0         0         0         0         0 
##    Cancer    Asthma       HTN LowIncome     obese 
##         0         0         0         0         1
# Interpretation: Eight of the variables have missing values.

# We use the na.omit function to drop all the rows with missing values
hosp_df <- na.omit(hospital_df) 
summary(hosp_df)
##    ICU_Admit           ESI             Age             Sex           
##  Min.   :0.0000   Min.   :1.000   Min.   : 17.00   Length:1152       
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.: 47.00   Class :character  
##  Median :0.0000   Median :2.000   Median : 60.00   Mode  :character  
##  Mean   :0.1033   Mean   :2.205   Mean   : 57.86                     
##  3rd Qu.:0.0000   3rd Qu.:3.000   3rd Qu.: 70.00                     
##  Max.   :1.0000   Max.   :4.000   Max.   :103.00                     
##       SBP              HR              Temp              RR        
##  Min.   : 65.0   Min.   : 12.00   Min.   : 93.90   Min.   : 10.00  
##  1st Qu.:115.0   1st Qu.: 86.00   1st Qu.: 98.20   1st Qu.: 18.00  
##  Median :129.5   Median : 99.00   Median : 99.00   Median : 20.00  
##  Mean   :132.0   Mean   : 98.76   Mean   : 99.35   Mean   : 22.06  
##  3rd Qu.:146.2   3rd Qu.:111.00   3rd Qu.:100.40   3rd Qu.: 24.00  
##  Max.   :251.0   Max.   :176.00   Max.   :103.80   Max.   :135.00  
##       Spo2            qSOFA             BMI              MI         
##  Min.   :  2.00   Min.   :0.0000   Min.   :13.68   Min.   :0.00000  
##  1st Qu.: 94.00   1st Qu.:0.0000   1st Qu.:25.84   1st Qu.:0.00000  
##  Median : 96.00   Median :0.0000   Median :31.00   Median :0.00000  
##  Mean   : 95.18   Mean   :0.5668   Mean   :32.25   Mean   :0.07899  
##  3rd Qu.: 98.00   3rd Qu.:1.0000   3rd Qu.:36.81   3rd Qu.:0.00000  
##  Max.   :100.00   Max.   :3.0000   Max.   :78.38   Max.   :1.00000  
##       CHF             Stroke              DM              CKD      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00  
##  Mean   :0.1762   Mean   :0.08681   Mean   :0.4106   Mean   :0.25  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.25  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00  
##      Cancer            Asthma            HTN           LowIncome     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.00000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.08333   Mean   :0.1094   Mean   :0.6155   Mean   :0.8819  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      obese       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.5417  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
# Interpretation: The median age for patients in our dataset is 60 years

# Create dummy or indicator variables for the patient sex
hosp_df$Sex <- ifelse(hosp_df$Sex == 'Male', 1, 0) 
head(hosp_df) #Outputs a snapshot of our new columns. Compare the "Sex" column with the one shown in Step 2 to see the difference.
## # A tibble: 6 × 21
##   ICU_Admit   ESI   Age   Sex   SBP    HR  Temp    RR  Spo2 qSOFA   BMI    MI
##       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         0     3    27     0   125   115 102.     18    99     0  18.9     0
## 2         0     2    44     1    92    97 103.     20    97     1  29.6     0
## 3         0     2    39     1   109   100 103.     44    97     1  35.9     0
## 4         0     2    46     0   113    98  98.7    18    96     0  41.4     0
## 5         0     2    34     0   109   101  99.4    24    98     1  22.6     0
## 6         0     3    69     0   152   113  99.4    16    97     0  29.3     0
## # ℹ 9 more variables: CHF <dbl>, Stroke <dbl>, DM <dbl>, CKD <dbl>,
## #   Cancer <dbl>, Asthma <dbl>, HTN <dbl>, LowIncome <dbl>, obese <dbl>

Step 5: Feature selection - identifying the contributing variables using correlation

# Correlation analysis - shows the relationship between the target and independent variables
corr <- rcorr(as.matrix(hosp_df)) # We use the rcorr function for correlation analysis
corr #Outputs the correlation results
##           ICU_Admit   ESI   Age   Sex   SBP    HR  Temp    RR  Spo2 qSOFA   BMI
## ICU_Admit      1.00 -0.31  0.12  0.09 -0.02  0.06 -0.07  0.34 -0.32  0.36 -0.03
## ESI           -0.31  1.00 -0.25 -0.09 -0.04 -0.13  0.00 -0.31  0.23 -0.40  0.04
## Age            0.12 -0.25  1.00  0.09 -0.03 -0.19 -0.11  0.16 -0.17  0.33 -0.26
## Sex            0.09 -0.09  0.09  1.00 -0.04 -0.07 -0.03  0.04 -0.05  0.07 -0.16
## SBP           -0.02 -0.04 -0.03 -0.04  1.00  0.09  0.05 -0.01  0.04 -0.23  0.14
## HR             0.06 -0.13 -0.19 -0.07  0.09  1.00  0.22  0.13 -0.12  0.09  0.14
## Temp          -0.07  0.00 -0.11 -0.03  0.05  0.22  1.00  0.00 -0.03 -0.07  0.19
## RR             0.34 -0.31  0.16  0.04 -0.01  0.13  0.00  1.00 -0.31  0.55  0.01
## Spo2          -0.32  0.23 -0.17 -0.05  0.04 -0.12 -0.03 -0.31  1.00 -0.27 -0.04
## qSOFA          0.36 -0.40  0.33  0.07 -0.23  0.09 -0.07  0.55 -0.27  1.00 -0.14
## BMI           -0.03  0.04 -0.26 -0.16  0.14  0.14  0.19  0.01 -0.04 -0.14  1.00
## MI             0.11 -0.10  0.15  0.08 -0.01 -0.13 -0.07  0.01  0.02  0.09 -0.04
## CHF            0.03 -0.10  0.26  0.04 -0.01 -0.15  0.01  0.00 -0.05  0.12  0.02
## Stroke         0.11 -0.16  0.21  0.04  0.06 -0.04  0.00  0.07 -0.02  0.20 -0.14
## DM             0.15 -0.13  0.23  0.05  0.02 -0.02  0.01  0.05 -0.08  0.11  0.08
## CKD            0.10 -0.14  0.32  0.12  0.02 -0.12 -0.03  0.01 -0.08  0.10 -0.03
## Cancer         0.03 -0.08  0.17  0.08 -0.08 -0.03 -0.06 -0.01 -0.06  0.04 -0.11
## Asthma        -0.06  0.01 -0.09 -0.13 -0.02  0.10  0.04 -0.02 -0.01 -0.05  0.18
## HTN            0.10 -0.18  0.48  0.04  0.10 -0.02 -0.03  0.07 -0.12  0.17  0.03
## LowIncome      0.02 -0.08  0.04 -0.04  0.00 -0.02  0.00  0.01 -0.01  0.01  0.08
## obese          0.00  0.05 -0.20 -0.17  0.07  0.15  0.13  0.01 -0.04 -0.13  0.74
##              MI   CHF Stroke    DM   CKD Cancer Asthma   HTN LowIncome obese
## ICU_Admit  0.11  0.03   0.11  0.15  0.10   0.03  -0.06  0.10      0.02  0.00
## ESI       -0.10 -0.10  -0.16 -0.13 -0.14  -0.08   0.01 -0.18     -0.08  0.05
## Age        0.15  0.26   0.21  0.23  0.32   0.17  -0.09  0.48      0.04 -0.20
## Sex        0.08  0.04   0.04  0.05  0.12   0.08  -0.13  0.04     -0.04 -0.17
## SBP       -0.01 -0.01   0.06  0.02  0.02  -0.08  -0.02  0.10      0.00  0.07
## HR        -0.13 -0.15  -0.04 -0.02 -0.12  -0.03   0.10 -0.02     -0.02  0.15
## Temp      -0.07  0.01   0.00  0.01 -0.03  -0.06   0.04 -0.03      0.00  0.13
## RR         0.01  0.00   0.07  0.05  0.01  -0.01  -0.02  0.07      0.01  0.01
## Spo2       0.02 -0.05  -0.02 -0.08 -0.08  -0.06  -0.01 -0.12     -0.01 -0.04
## qSOFA      0.09  0.12   0.20  0.11  0.10   0.04  -0.05  0.17      0.01 -0.13
## BMI       -0.04  0.02  -0.14  0.08 -0.03  -0.11   0.18  0.03      0.08  0.74
## MI         1.00  0.26   0.22  0.12  0.22   0.10  -0.02  0.15      0.06  0.00
## CHF        0.26  1.00   0.16  0.21  0.35   0.12   0.04  0.28      0.06 -0.03
## Stroke     0.22  0.16   1.00  0.12  0.18   0.01  -0.01  0.20      0.06 -0.14
## DM         0.12  0.21   0.12  1.00  0.30   0.10   0.00  0.36      0.11  0.05
## CKD        0.22  0.35   0.18  0.30  1.00   0.15   0.00  0.37      0.05 -0.04
## Cancer     0.10  0.12   0.01  0.10  0.15   1.00   0.01  0.12     -0.02 -0.10
## Asthma    -0.02  0.04  -0.01  0.00  0.00   0.01   1.00  0.05      0.05  0.14
## HTN        0.15  0.28   0.20  0.36  0.37   0.12   0.05  1.00      0.09  0.03
## LowIncome  0.06  0.06   0.06  0.11  0.05  -0.02   0.05  0.09      1.00  0.07
## obese      0.00 -0.03  -0.14  0.05 -0.04  -0.10   0.14  0.03      0.07  1.00
## 
## n= 1152 
## 
## 
## P
##           ICU_Admit ESI    Age    Sex    SBP    HR     Temp   RR     Spo2  
## ICU_Admit           0.0000 0.0000 0.0020 0.4564 0.0502 0.0161 0.0000 0.0000
## ESI       0.0000           0.0000 0.0016 0.1995 0.0000 0.9360 0.0000 0.0000
## Age       0.0000    0.0000        0.0024 0.2500 0.0000 0.0002 0.0000 0.0000
## Sex       0.0020    0.0016 0.0024        0.1693 0.0138 0.2836 0.1300 0.0717
## SBP       0.4564    0.1995 0.2500 0.1693        0.0038 0.1035 0.7642 0.1526
## HR        0.0502    0.0000 0.0000 0.0138 0.0038        0.0000 0.0000 0.0000
## Temp      0.0161    0.9360 0.0002 0.2836 0.1035 0.0000        0.9415 0.2548
## RR        0.0000    0.0000 0.0000 0.1300 0.7642 0.0000 0.9415        0.0000
## Spo2      0.0000    0.0000 0.0000 0.0717 0.1526 0.0000 0.2548 0.0000       
## qSOFA     0.0000    0.0000 0.0000 0.0109 0.0000 0.0020 0.0270 0.0000 0.0000
## BMI       0.3904    0.1818 0.0000 0.0000 0.0000 0.0000 0.0000 0.7369 0.2343
## MI        0.0001    0.0012 0.0000 0.0058 0.6330 0.0000 0.0126 0.7453 0.4941
## CHF       0.3062    0.0012 0.0000 0.2181 0.7854 0.0000 0.7370 0.8740 0.1146
## Stroke    0.0002    0.0000 0.0000 0.1311 0.0310 0.1300 0.9121 0.0175 0.4142
## DM        0.0000    0.0000 0.0000 0.0678 0.5954 0.4831 0.6349 0.0697 0.0052
## CKD       0.0006    0.0000 0.0000 0.0000 0.5186 0.0000 0.3616 0.6653 0.0040
## Cancer    0.2806    0.0054 0.0000 0.0050 0.0100 0.2609 0.0277 0.7825 0.0411
## Asthma    0.0296    0.8420 0.0032 0.0000 0.4546 0.0009 0.1422 0.4299 0.7748
## HTN       0.0004    0.0000 0.0000 0.1423 0.0009 0.4165 0.2991 0.0223 0.0000
## LowIncome 0.5392    0.0087 0.1657 0.1624 0.8919 0.4223 0.9653 0.7010 0.6577
## obese     0.9291    0.0890 0.0000 0.0000 0.0195 0.0000 0.0000 0.6878 0.1858
##           qSOFA  BMI    MI     CHF    Stroke DM     CKD    Cancer Asthma HTN   
## ICU_Admit 0.0000 0.3904 0.0001 0.3062 0.0002 0.0000 0.0006 0.2806 0.0296 0.0004
## ESI       0.0000 0.1818 0.0012 0.0012 0.0000 0.0000 0.0000 0.0054 0.8420 0.0000
## Age       0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0032 0.0000
## Sex       0.0109 0.0000 0.0058 0.2181 0.1311 0.0678 0.0000 0.0050 0.0000 0.1423
## SBP       0.0000 0.0000 0.6330 0.7854 0.0310 0.5954 0.5186 0.0100 0.4546 0.0009
## HR        0.0020 0.0000 0.0000 0.0000 0.1300 0.4831 0.0000 0.2609 0.0009 0.4165
## Temp      0.0270 0.0000 0.0126 0.7370 0.9121 0.6349 0.3616 0.0277 0.1422 0.2991
## RR        0.0000 0.7369 0.7453 0.8740 0.0175 0.0697 0.6653 0.7825 0.4299 0.0223
## Spo2      0.0000 0.2343 0.4941 0.1146 0.4142 0.0052 0.0040 0.0411 0.7748 0.0000
## qSOFA            0.0000 0.0015 0.0000 0.0000 0.0002 0.0011 0.1477 0.0725 0.0000
## BMI       0.0000        0.1375 0.5799 0.0000 0.0076 0.3161 0.0001 0.0000 0.2951
## MI        0.0015 0.1375        0.0000 0.0000 0.0000 0.0000 0.0009 0.4947 0.0000
## CHF       0.0000 0.5799 0.0000        0.0000 0.0000 0.0000 0.0000 0.2350 0.0000
## Stroke    0.0000 0.0000 0.0000 0.0000        0.0000 0.0000 0.8009 0.7535 0.0000
## DM        0.0002 0.0076 0.0000 0.0000 0.0000        0.0000 0.0007 0.8881 0.0000
## CKD       0.0011 0.3161 0.0000 0.0000 0.0000 0.0000        0.0000 0.9133 0.0000
## Cancer    0.1477 0.0001 0.0009 0.0000 0.8009 0.0007 0.0000        0.8645 0.0000
## Asthma    0.0725 0.0000 0.4947 0.2350 0.7535 0.8881 0.9133 0.8645        0.1011
## HTN       0.0000 0.2951 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.1011       
## LowIncome 0.6894 0.0061 0.0519 0.0563 0.0598 0.0002 0.0918 0.5823 0.0858 0.0032
## obese     0.0000 0.0000 0.9491 0.3555 0.0000 0.0755 0.1333 0.0006 0.0000 0.2766
##           LowIncome obese 
## ICU_Admit 0.5392    0.9291
## ESI       0.0087    0.0890
## Age       0.1657    0.0000
## Sex       0.1624    0.0000
## SBP       0.8919    0.0195
## HR        0.4223    0.0000
## Temp      0.9653    0.0000
## RR        0.7010    0.6878
## Spo2      0.6577    0.1858
## qSOFA     0.6894    0.0000
## BMI       0.0061    0.0000
## MI        0.0519    0.9491
## CHF       0.0563    0.3555
## Stroke    0.0598    0.0000
## DM        0.0002    0.0755
## CKD       0.0918    0.1333
## Cancer    0.5823    0.0006
## Asthma    0.0858    0.0000
## HTN       0.0032    0.2766
## LowIncome           0.0203
## obese     0.0203
# Drop all the columns with p-value > 0.05 and store the data in a new data frame
hosp_df2 <- subset(hosp_df, select = -c(SBP, HR, BMI, CHF, Cancer, LowIncome, obese))
hosp_df2
## # A tibble: 1,152 × 14
##    ICU_Admit   ESI   Age   Sex  Temp    RR  Spo2 qSOFA    MI Stroke    DM   CKD
##        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1         0     3    27     0 102.     18    99     0     0      0     0     0
##  2         0     2    44     1 103.     20    97     1     0      0     0     0
##  3         0     2    39     1 103.     44    97     1     0      0     0     0
##  4         0     2    46     0  98.7    18    96     0     0      0     0     0
##  5         0     2    34     0  99.4    24    98     1     0      0     0     0
##  6         0     3    69     0  99.4    16    97     0     0      0     0     0
##  7         0     2    51     0 101.     18   100     0     0      0     0     0
##  8         0     2    59     1  98.5    16    95     1     0      0     0     0
##  9         0     3    34     1  98.8    18   100     0     0      0     0     0
## 10         0     2    78     1 103.     24    96     1     1      0     1     1
## # ℹ 1,142 more rows
## # ℹ 2 more variables: Asthma <dbl>, HTN <dbl>
# Interpretation: The data set is reduced from 21 features to 14 features

Step 6: Logistics Regression Model building

# Splitting the data into training and testing sets
set.seed(3456)
trainIndex <- createDataPartition(hosp_df2$ICU_Admit, p = .70, list = FALSE, times = 1)
Train <- hosp_df2[ trainIndex,] #We use 70% of the data to train the model
Test  <- hosp_df2[-trainIndex,] #The remaining 30% is used to text the model's performance

# First model - all the significant variables after correlation analysis.
model <- glm(ICU_Admit ~ ESI + Age + Sex + Temp + RR + Spo2 + qSOFA + MI + Stroke + DM + CKD + Asthma + HTN, data = Train, family = binomial)
summary(model)
## 
## Call:
## glm(formula = ICU_Admit ~ ESI + Age + Sex + Temp + RR + Spo2 + 
##     qSOFA + MI + Stroke + DM + CKD + Asthma + HTN, family = binomial, 
##     data = Train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 23.77633    9.27732   2.563 0.010382 *  
## ESI         -0.87551    0.29013  -3.018 0.002547 ** 
## Age         -0.01519    0.01049  -1.448 0.147561    
## Sex          0.11622    0.28752   0.404 0.686063    
## Temp        -0.18126    0.08610  -2.105 0.035260 *  
## RR           0.04752    0.01494   3.182 0.001463 ** 
## Spo2        -0.08767    0.02351  -3.729 0.000192 ***
## qSOFA        0.68665    0.22189   3.095 0.001971 ** 
## MI           0.39902    0.43297   0.922 0.356753    
## Stroke       0.42227    0.40106   1.053 0.292393    
## DM           0.73881    0.30939   2.388 0.016943 *  
## CKD          0.18152    0.31826   0.570 0.568435    
## Asthma      -0.68316    0.57291  -1.192 0.233095    
## HTN          0.73034    0.41132   1.776 0.075802 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 508.23  on 806  degrees of freedom
## Residual deviance: 362.39  on 793  degrees of freedom
## AIC: 390.39
## 
## Number of Fisher Scoring iterations: 6
# First model results: Age, Sex, Myocardial infarction (MI), Stroke, Chronic kidney disease (CKD), Asthma, and Hypertension (HTN) are insignificant variables (p > 0.05).

# Second model - we drop all the insignificant variables and rebuild the model
model2 <- glm(ICU_Admit ~ ESI + Temp + RR + Spo2 + qSOFA + DM, data = Train, family = binomial)
summary(model2)
## 
## Call:
## glm(formula = ICU_Admit ~ ESI + Temp + RR + Spo2 + qSOFA + DM, 
##     family = binomial, data = Train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 22.49099    8.74746   2.571 0.010136 *  
## ESI         -0.96206    0.28494  -3.376 0.000735 ***
## Temp        -0.17718    0.08359  -2.120 0.034030 *  
## RR           0.04398    0.01469   2.994 0.002755 ** 
## Spo2        -0.07904    0.02195  -3.601 0.000317 ***
## qSOFA        0.68941    0.20579   3.350 0.000808 ***
## DM           0.95479    0.28853   3.309 0.000936 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 508.23  on 806  degrees of freedom
## Residual deviance: 372.62  on 800  degrees of freedom
## AIC: 386.62
## 
## Number of Fisher Scoring iterations: 6
# Interpretation: All the variables are signifcant (p < 0.05)

Result Interpretation

1. Emergency severity index (ESI): having a severity index of 2, versus a severity index of 1, changes the log odds of ICU admission by -0.96.
2. Respiratory rate (RR): for every one unit change in patient RR, the log odds of ICU admission (versus non-admission) increases by 0.04.
3. Percent Oxygen (Spo2): for a one unit increase in Spo2, the log odds of being admitted to the ICU decreases by -0.08.
4. Quick sepsis related organ failure assessment (qSOFA): having an organ failure 1, versus not having an organ failure 0, changes the log odds of ICU admission by 0.69.
5. Diabetes Mellitus (DM): a patient diagnosed with diabetes 1, versus no diabetes 0, changes the log odds of ICU admission by 0.95.

Step 7: Evaluating the model’s performance

# McFadden's R-square (i.e., Pseudo R-square)
pR2(model2) 
## fitting null model for pseudo-r2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -186.3078199 -254.1166729  135.6177060    0.2668414    0.1546899    0.3310350
Interpretation: A McFadden's R-squared of 0.27 indicates that our logistic regression model provides a good level of explanatory power based on the predictors (i.e., independent variables). In logistic regression, values of 0.2 to 0.4 for McFadden's R-squared are often considered indicative of excellent fit.
# Predicting on the test data
fitted.results <- predict(model2, Test, type='response')

# We need to show the probability of patients getting admitted or not
probability <- data.frame(prob = as.data.frame(fitted.results), actual = Test$ICU_Admit)

# Creating a threshold for our probability values. Default threshold is 0.5
fitted.results <- ifelse(fitted.results > 0.5,1,0)

#### Accuracy as a performance measure ####
misClasificError <- mean(fitted.results != Test$ICU_Admit) #misclassification error
print(paste('Accuracy',1-misClasificError)) 
## [1] "Accuracy 0.898550724637681"
Interpretation: The accuracy of 0.898 indicates that the logistic regression model correctly predicts whether a patient will be admitted to the ICU 89.8% of the time. This high level of accuracy suggests our model is very effective at distinguishing between patients who will require ICU care and those who will not, based on the variables included.
# Count the number of admitted vs not admitted patients in the test data
counts1 <- table(Test$ICU_Admit)
counts1
## 
##   0   1 
## 303  42
# Confusion matrix
table(Test$ICU_Admit, fitted.results)
##    fitted.results
##       0   1
##   0 299   4
##   1  31  11

Project Conclusion

1. The significant variables that impact patient admission in the ICU includes the following: patient emergency severity index (ESI), patient's temperature, respiratory rate (RR), percent oxygenc (Spo2), quick sepsis related organ failure assessment (qSOFA), and diabetes melitus (DM).

2. The developed model's performance on the test data was McFadden R-square of 0.27 and accuracy of 89.8%.

3. The confusion matrix indicates that the logistic regression model is quite good at identifying patients who don't need ICU care but struggles more with accurately identifying those who do. The low number of True Positives compared to False Negatives suggests that while the model is conservative in predicting ICU admissions, it might miss identifying some patients who actually need ICU care (as seen in the False Negatives = 31). The low number of False Positives indicates that when the model predicts ICU admission, it's generally accurate, but there's room for improvement in catching more of the true ICU admission cases.