library(ggthemes)
col34="dodgerblue4"
col2="blue"
col1="slateblue"
col="#E64141" #Covid Red #E64141

# caption used for certain the charts 
the_caption <- "Covid-19"
theme_2 <-
    theme_gray() +
    theme(
      axis.title.y = element_blank(),
      legend.title = element_blank(),
      legend.position = "top")


theme_1 <- function (base_size = 11, base_family = "") {
ggplot2::theme_set(theme_classic(base_size = 16))
suppressWarnings(
theme_update(
  axis.text.x = element_text(size = 16)
, axis.text.y = element_text(size = 16)
, axis.title.x = element_text(size = 16)
, axis.title.y = element_text(size = 16)
, legend.title = element_text(size = 16)
, legend.text  = element_text(size = 14)
, panel.grid.major = element_blank()
, panel.grid.minor = element_blank()
, strip.background = element_blank()
, panel.margin = unit(0, "lines")
, legend.key.size = unit(.55, "cm")
, legend.key = element_rect(fill = "white")
, panel.margin.y = unit(0.5, "lines")
, panel.border = element_rect(
    colour = "black", fill = NA, size = 1)
, strip.text.x = element_text(
    size = 16, colour = "blue", face = "bold")
  )
)
}
# 

# #set global ggtheme
# gbl_ggtheme <- gbl_ggtheme 
gbl_ggtheme <- theme_1() #theme_classic(base_size = 16)
theme_set(gbl_ggtheme)
#  theme_set(theme_classic(base_size = 16)) 
# #,panel.background = element_rect(fill = "#E64141") #Covid Red #E64141


#https://www.rapidtables.com/web/color/RGB_Color.html

Overview

The sudden increase of COVID-19 cases has placed high pressure on our health-care services worldwide. It is vital that fast, accurate and early clinical assessments are made in order to reduce the disease severity. To support decision making and logistical planning in health care systems, this study leverages the data of a prior study.

In the prior study, blood samples from 485 infected patients in the region of Wuhan, China were used to study biomarkers of Covid-19 patients. The new study here will include 375 samples of the aforementioned 485 patients. Machine learning tools will be used to identify biomarkers that can be used to predict mortality of the 375 individual patients sampled. The outcomes of all the patients is known, and indicted in the data.

The analysis to follow will attempt to improve upon the accuracy of the fatality predictions in the original study by using enhanced machine learning and deep learning tools associated with the language R. As part of the analysis, a comprehensive look at the R machine learning models will be explored first to find the parsimonious model, then a deep learning approach will be deployed to determine if the accuracy results can be improved.

Motivation

Patients can potentially survive Covid-19 if the biomarkers are measured early and analyzed throughout the patients’ stay. Machine learning models may be the best way to detect signals within key biomarkers, and thereby act as an early warning mechanism. Patients can have their care augmented with specific treatments informed by changes in the biomarkers, which predict their mortality. This analysis seeks to find what R modeling solution yields the highest accuracy in an attempt to demonstrate a way to help save more lives when faced with epidemics such as Covid-19.

#functions

printVecInfo <- function(inputVector,param="all"){
  
  v <- inputVector
 
 print(paste("mean: ", mean(v,na.rm=T) ))
  print(paste("median: ", median(v,na.rm=T) ))
   print(paste("min: ", min(v,na.rm=T) ))
      print(paste("max: ", max(v,na.rm=T) ))
   print(paste("range: ", max(v,na.rm=T)-min(v,na.rm=T) ))
  print(paste("sd: ", sd(v,na.rm=T)))
 print(paste("quantile: ", quantile(v,na.rm=T) ))
 print(paste("IQR: ",quantile(v,na.rm=T)[['75%']]-quantile(v,na.rm=T)[['25%']])) # Interquartile range       
}
# gets the percentages for each column that has NA's
NASummaryInfoPct <- function(indf){
  a<- gather(indf %>% summarise_all(~(sum(is.na(.))/n())), key = "columns", value = "percent_missing")

 return(a)
}

# gets the percentages for each column that has NA's
NASummaryInfoCnt <- function(indf){
  b<- gather(indf %>% summarise_all(~(sum(is.na(.)))), key = "columns", value = "count_missing")

 return(b)
}


#Return multiple data frames in one object using the S4 class system.
setClass(
    "mdl",
    representation(
        trn = "vector",
        tst = "vector",
        hld = "vector"
    )
)

require(caret)
modelDataC <- function (df1 = df,trnpct=.8) {
  #take a 10% holdout
  z <- df1  
  data<- createDataPartition(y=z$label, p=.90, list=FALSE)
    df1 <-z[data,] #set to 90% of its original rows
    tst2 <- z[-data,]
    
  z <- df1  
  data<- createDataPartition(y=z$label, p=trnpct, list=FALSE)
    trn <-z[data,]
    
    tst  <- z[-data,] #balance what is left over after train @.80
  
    
   
  mdl <- new("mdl",
                  trn = trn,
                  tst = tst,
                  hld = tst2  )
  return(mdl)
  
}


#Return multiple data frames in one object using the S4 class system.
setClass(
    "mdl",
    representation(
        trn = "vector",
        tst = "vector",
        hld = "vector"
    )
)

modelData2 <- function (df1 = df_ww) {
  z <- df1  
    # get number of rows in the data frame
  len <- as.numeric(dim(z)[1])
  randIndex <- sample(as.numeric(rownames(z)[1:len]))
  # setting cutpoint to 2/3 of the input data frame
  cutPoint80 <- floor(.8 * len)
  # setting cutpoint to 1/3 of the input data frame
  cutPoint20 <- floor(.2 * len)
  trn2 <- z[randIndex[1:cutPoint80],]
  tst2 <- z[randIndex[cutPoint80+1:c(len-cutPoint80)],] 
  hld2 <- z[-randIndex[1: cutPoint80],]
  
  mdl <- new("mdl",
                  trn = trn2,
                  tst = tst2,
                  hld = hld2  )
  return(mdl)
  
}

About Model Data

The data tracks several patients over their hospital stay, and keeps records on the patients 74 bio markers. The admission time, discharge time and outcome make a unique patient.

  • The original data source had already been anonymized to comply with HIPAA, PCI and PII data sensitivity laws.

  • Dependent variable is outcome with 0 death and 1 survived

  • Data set dimensions are 6120 rows by 81 columns

  • There are 74 biomarker features, and some of the most important one are listed below.

Biomarker definitions:

The most important bio markers for the analysis are listed with brief definitions. The three primary categories of BIO markers fall into cholesterol, inflammation, and clotting.

  • percent_lymphocyte: Lymphocytes are a type of white blood cell, consisting of both B cells and T cells. B cells are responsible for the creation of antibodies, while T cells are responsible for controlling immune response to foreign substances. This marker indicates the percent of lymphocytes present relative to a patient’s white blood cell count.

  • lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.

  • lymphocyte_count lymphocytes are a type of white blood cell, consisting of both B cells and T cells. B cells are responsible for the creation of antibodies, while T cells are responsible for controlling immune response to foreign substances. This marker indicates the count of lymphocytes present in a patient’s blood count.

  • albumin: Albumin is a protein created by your liver. This protein is primarily responsible for maintaining and keeping fluid levels in your bloodstream so that it does not leak into other tissues. This protein also helps carry hormones, vitamins, and enzymes throughout the body. Higher levels can indicate a patient fighting infections. Low levels can indicate problems with liver or inflammatory diseases.

  • neutrophils_percent: Neutrophils are another type of white blood cell that makes up 55 to 70 percent of the white blood cell count. They consist of the majority of cells that make up the immune system response. Neutrophils help fight antigens and are not limited to a specific area of circulation. This marker shows the percent of neutrophils present relative to a patient’s white blood cell count.

  • eosinophils_percent: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in the inflammatory response. This marker shows the percent of eosinophils present relative to a patient’s white blood cell count.

  • eosinophil_count: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in the inflammatory response. This marker shows the count of eosinophils present in a patient’s blood count.

  • mean_corpuscular_hemoglobin_concentration: MCHC is a measure of the concentration of hemoglobin present in a given volume of packed red blood cell.

  • x2019_n_co_v_nucleic_acid_detection: This marker is a specific test result from a detection test responsible for identifying the presence of nucleic acid from SARS-CoV-2 in humans. This test shows the presence or absence of viral infection by SARS-CoV-2.

  • basophil_percent: Basophil is naturally produced by several types of white blood cells. Basophil contains heparin which naturally prevents blood clotting, while also releasing histamine during allergic reactions. It is also responsible for helping the body produce an antibody called immunoglobulin E, which help mediate inflammatory responses in the body. This marker shows the percent of basophil present relative to a patient’s white blood cell count.

  • total_cholesterol: Cholesterol is a waxy substance found in blood that assists in building healthy cells. High levels of cholesterol can coat arteries and restrict blood flow. Total cholesterol is measured as a count in a patient’s blood panel.

  • procalcitonin: Procalcitonin is a protein produced by many cells in the body in response to both bacterial infections and tissue injury. The count of procalcitonin can increase based on systemic infections and sepsis.

  • thrombin_time: Thrombin time or thrombin clotting time is a blood test that measures the time It takes for a clot to form in the plasma of a blood sample. This assists in helping doctors investigate excessive bleeding or abnormal blood clot formation.

  • red_blood_cell_count: Red blood cell count is a test conducted to find out how many red blood cells are present in your blood. It is also known as an erythrocyte count. RBC’s contain hemoglobin, which is responsible for carrying oxygen to body tissues. This count is vital in understanding how tissues are functioning and healing in response to viral infections.

  • indirect_bilirubin: Bilirubin tests measure the amount of bilirubin in blood, an orange-yellow pigment that forms when red blood cells break down. Indirect bilirubin is specifically bound to albumin. High levels of bilirubin indicate that the body is destroying too many red blood cells, which can also indicate liver or tissue damage.

  • h_bs_ag: HBsAg is the antigen of the hepatitis B virus. Presence of this indicates current HBV infection.

  • hco3: Bicarbonate is a byproduct of the body’s metabolic function. Blood contains bicarbonate which is brought to the lungs and exhaled as carbon dioxide. Kidneys excrete and reabsorb bicarbonate and maintain appropriate acid balance in the body.

  • platelet_count: Platelets are small fragments of cells present for creating normal and protective blood clots. The platelet count test determined the number of platelets present in a sample of blood.

  • serum_sodium: Sodium tests are conducted to see how much sodium is present in the blood. Sodium is an essential mineral in the body that is important for healthy nerve and muscle function.

  • hemoglobin: Hemoglobin is a red blood cell protein responsible for transporting oxygen in the body’s blood. A low count of hemoglobin can indicate deficiencies and anemia, which can hinder healing and recovery in the body from virus or infection.

  • lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.

  • monocytes_percent: Monocytes are another type of white blood cell. They assist in fighting bacteria, viruses, and other infections. They are a key component of white blood cell immune response. High levels indicate the body fighting infection.

  • international_standard_ratio International Normalized Ratio is a preferred test for patients to assess the risk of bleeding or coagulation status in patients. This ratio shows how well blood clots, and if there are any abnormalities present.

  • percent_lymphocyte: Lymphocytes are a type of white blood cell, consisting of both B cells and T cells. B cells are responsible for the creation of antibodies, while T cells are responsible for controlling immune response to foreign substances. This marker indicates the percent of lymphocytes present relative to a patient’s white blood cell count.

  • eosinophil_count: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in inflammatory response. This marker shows the count of eosinophils present in a patient’s blood count.

  • eosinophils_percent: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in inflammatory response. This marker shows the percent of eosinophils present relative to a patient’s white blood cell count.

  • neutrophils_percent: Neutrophils are another type of white blood cell, that makes up 55 to 70 percent of the white blood cell count. They consist of the majority of cells that make up the immune system response. Neutrophils help fight antigens and are not limited to a specific area of circulation. This marker shows the percent of neutrophils present relative to a patient’s white blood cell count.

  • amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp: Natriuretic peptide tests measure the presence of brain natriuretic peptide and B-terminal pro b-type natriuretic peptides present. These are substances created by the heart. Small levels are normally found in the bloodstream. High levels of BNP, NT-proBNP indicate that the heart is failing to pump the appropriate amount of blood needed.

  • thrombocytocrit: This device measures the platelet content of the blood. Platelets are small fragments of cells present for creating normal and protective blood clots. The platelet count test determined the number of platelets present in a sample of blood.

  • hypersensitive_cardiac_troponin_i: This is the preferred biomarker for acute myocardial infarction. This helps diagnose acute coronary syndromes in those who are critically ill. 

  • fibrin_degradation_products: Fibrin degradation products are components of the blood produced by clot degeneration. This substance remains in your bloodstream after your body dissolves a blood clot, which in return regulates and manages clot dissolving. This specific test determines the amount of FDPs present in blood to ensure that the body is healing correctly, particularly in response to clots and blood illnesses.

  • procalcitonin: Procalcitonin is a protein produced by many cells in the body in response to both bacterial infections and tissue injury. The count of procalcitonin can increase based on systemic infections and sepsis.

  • albumin: Albumin is a protein created by your liver. This protein is primarily responsible for maintaining and keeping fluid levels in your bloodstream so that it does not leak into other tissues. This protein also helps carry hormones, vitamins, and enzymes throughout the body. Higher levels can indicate a patient fighting infections. Low levels can indicate problems with liver or inflammatory diseases

  • prothrombin_activity: Prothrombin time is a test used to help detect and diagnose a bleeding or excessivee clotting disorder. This helps doctors evaluate a patient’s ability to appropriately form blood clots.

  • d_d_dimer: A D-dimer test is used to help rule out the presence of serious life-threatening blood clots. High levels of D-dimer, the protein leftover from clotting, can indicate deep vein thrombosis and other clotting disorders.

  • aspartate_aminotransferase: Aspartate aminotransferase is an enzyme created by the liver but found in many cells throughout the body. High levels of AST are found in the body when liver damage is present, as the liver releases AST in response to distress of the liver.

  • neutrophils_count: Neutrophils are another type of white blood cell, that makes up 55 to 70 percent of the white blood cell count. They consist of the majority of cells that make up the immune system response. Neutrophils help fight antigens and are not limited to a specific area of circulation. This marker shows the count of neutrophils present in a patient’s blood cell count.

  • direct_bilirubin: Bilirubin tests measure the amount of bilirubin in blood, an orange-yellow pigment that forms when red blood cells break down. Direct bilirubin specifically travels from the liver to the small intestines. High levels of bilirubin indicate that the body is destroying too many red blood cells, which can also indicate liver or tissue damage.

  • thrombin_time: Thrombin time or thrombin clotting time is a blood test that measures the time it takes for a clot to form in the plasma of a blood sample. This assists in helping doctors investigate excessive bleeding or abnormal blood clot formation.

  • high_sensitivity_c_reactive_protein: High sensitivity C-reactive protein is a biomarker used in determining coronary heart disease. The protein itself is produced when there is damage or inflammation in blood vessels, particularly in the heart.

Data Loading/Cleaning

The original data dimensions and format can be viewed below. There is a lot of NAs seen within the patient columns, but was found to be due to not repeating the patient id for each day of the patients stay. The biomaker festures have plenty of NA since not all markers were captured for every patient during every time period. Additionally, the biomarker features can be quite long and they are largely non-conforming R column names.

#Use janitory to clean the column names
require(janitor)
#https://medium.com/@verajosemanuel/janitor-a-good-r-package-for-data-cleaning-f3c733632ad9


raw <- read_csv('./Data/covid19_bio_mdl.csv')
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   RE_DATE = col_datetime(format = ""),
##   `Admission time` = col_datetime(format = ""),
##   `Discharge time` = col_datetime(format = "")
## )
## See spec(...) for full column specifications.
dim(raw)
## [1] 6120   81
#glimpse(raw)
head(raw[,1:7]) %>% kable() %>% kable_styling()
PATIENT_ID RE_DATE age gender Admission time Discharge time outcome
1 2020-01-31 01:09:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0
NA 2020-01-31 01:25:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0
NA 2020-01-31 01:44:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0
NA 2020-01-31 01:45:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0
NA 2020-01-31 01:56:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0
NA 2020-01-31 01:59:00 73 1 2020-01-30 22:12:47 2020-02-17 12:40:09 0
head(raw[,8:15]) %>% kable() %>% kable_styling()
Hypersensitive cardiac troponinI hemoglobin Serum chloride Prothrombin time procalcitonin eosinophils(%) Interleukin 2 receptor Alkaline phosphatase
NA NA NA NA NA NA NA NA
NA 136 NA NA NA 0.6 NA NA
NA NA 103.1 NA NA NA NA 46
NA NA NA 13.9 NA NA NA NA
19.9 NA NA NA 0.09 NA NA NA
NA NA NA NA NA NA NA NA
#The names are dirty, so we're using janitor to clean them. The original names are stored for reference. 

raw_names <- names(raw)
#raw_names # we need these original columns for reference

df_raw <- raw
df <- df_raw %>% clean_names() 
df_clean_names <- names(df)
#df_clean_names
#make names for good measure to ensure janitor names a legal R columns names
colnames(df) <- make.names(colnames(df))
#df_clean_names  
#dim(df)
#glimpse(raw)
#glimpse(df)

Fixed Biomarker Names

The data dimensions and a sample of the fixed data is seen below here.

#dim(df)
head(df[,8:15]) %>% kable() %>% kable_styling()
hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time procalcitonin eosinophils_percent interleukin_2_receptor alkaline_phosphatase
NA NA NA NA NA NA NA NA
NA 136 NA NA NA 0.6 NA NA
NA NA 103.1 NA NA NA NA 46
NA NA NA 13.9 NA NA NA NA
19.9 NA NA NA 0.09 NA NA NA
NA NA NA NA NA NA NA NA
#tail(df) %>% kable() %>% kable_styling()

Create Patient Master

Each patient is master contains the three columns,patient id plus the admission and discharge time, revealing 375 patients in the sample.

df_patient_master_raw <- df %>% select(patient_id,admission_time,discharge_time) 
#dim(df_patient_master_raw)

# #unique ID before patient compression
# library(data.table)
# df_patient_master_raw2 <- setDT(df_patient_master_raw)[, id := paste0('',.I)]
# head(df_patient_master_raw2)
# dim(df_patient_master_raw2)

data <- df_patient_master_raw
data_subset <- data[ , c("patient_id")] 
df_patient_master <- data[complete.cases(data_subset), ] # Omit NAs by columns
head(df_patient_master)
## # A tibble: 6 x 3
##   patient_id admission_time      discharge_time     
##        <dbl> <dttm>              <dttm>             
## 1          1 2020-01-30 22:12:47 2020-02-17 12:40:09
## 2          2 2020-02-04 21:39:03 2020-02-19 12:59:01
## 3          3 2020-01-23 10:59:36 2020-02-08 17:52:31
## 4          4 2020-01-31 23:03:59 2020-02-18 12:59:12
## 5          5 2020-02-01 20:59:54 2020-02-18 10:33:06
## 6          6 2020-01-24 10:47:10 2020-02-07 09:06:58
dim(df_patient_master)
## [1] 375   3
# Print data_by_column to RStudio console
#write.csv(data,file="covid_19_patient_master.csv")
#Create df with corrected patient_id label. The data is still a sparse data frame at this point with the sparse data being NA (as opposed to 0). 

#Dimensions are still 6120,81 here. 
library(dplyr)

df_raw2 <- inner_join(df_patient_master, df, by = c("admission_time" = "admission_time","discharge_time" ="discharge_time"))
#drop bad patient_id
df_raw2$patient_id.y <- NULL
#rename new patient_id
names(df_raw2)[1] <- c("patient_id")

#head(df_raw2) #%>% regulartable() %>% theme_zebra() %>%  autofit()   
#dim(df_raw2)
#head(df_raw2[1,c(1:5)])

Agregate Data

Next we aggregate the data to the patient master level of granularity. As part of the aggregation we take the “max” of all biomarkers while grouping by patient master data. This approach leaves the highest bio-marker measurements for each patient. Additionally, the NA and infinite numbers are set to zero.

Note: This reduces the dataset to the number of patients, which is 375. Dimension is now 375,80 here as we lthe time the bio marker was taken. We aren’t interested in the time component from this analysis. The the objective here is only determine if we can predict mortality with the biomarkers. The time component does become important later as time is of the essence when once is admitted with Covid-19 testing positive.

New/Changed Features: * duration_days - is the differences in days the patient was admitted to discharged * factorized - patient_id,gender,age,outcome

#install.packages("hablar")
library(hablar)
require(hablar)
require(lubridate)

#summarise numerics by patient
by_patient <- df_raw2 %>%
  group_by(patient_id, age, gender, outcome,admission_time,discharge_time)

#create df_clean
by_patient %>%
  summarise_if(is.numeric, max, na.rm = TRUE) %>% ## Clean infinite and NA
  mutate_if(is.numeric, ~replace(., is.infinite(.), 0)) %>% 
  mutate_if(is.numeric, ~replace(., is.na(.), 0)) -> df_clean

#write.csv(df_clean,file="biomarkers_max.csv")
#write.csv(df_clean,file="covid19_biomarkers_max.csv")
#write.csv(df_clean,file="df_clean.csv")
#as_datetime(x, tz = "UTC")

head(df_clean[,c(2:4,7:9)])
## # A tibble: 6 x 6
## # Groups:   age, gender, outcome [6]
##     age gender outcome hypersensitive_cardiac_troponi~ hemoglobin serum_chloride
##   <dbl>  <dbl>   <dbl>                           <dbl>      <dbl>          <dbl>
## 1    73      1       0                            19.9        140           103.
## 2    61      1       0                            16.9        151           100.
## 3    70      2       0                             0          126           103.
## 4    74      1       0                             4.8        110           103.
## 5    29      2       0                             5.6        134           102.
## 6    81      2       0                            19.7        108           106.
#dim(df_clean)

Biomarker Stats

The biomarkers with relatively high standard deviations and/or wide ranges are candidates for a closer look. For example, “hypersensitive_cardiac_troponin_i” falls into the category, which should be further studied.

#library(tigerstats)
#inspect(df_clean[,c(2:4,7:80)]) #%>% kable() %>% kable_styling()
#densityplot(~outcome|gender,data=df_clean[,c(2:4,7:80)])
df_stat <- df_clean[,c(7:80)]
dim(df_stat)
## [1] 375  74
dfIntCols <- df_stat %>% select_if(., is.numeric)
#names(dfIntCols) %>% kable() %>% kable_styling()
#print('===============================================================')
#print('===============================================================')

statInfo <- function(a) {
  for (i in colnames(a)){
   x <- colnames(df_stat[i])
   print('---------------------------------------------------------------------------------')
   print(x)
  #paste('Column: ', colnames(df_stat[i]), sep=',')
 
  #print(colnames(df_stat[i]))
   print('=================================================================================')
   printVecInfo(df_stat[[i]])
   
  
   }
}   

statInfo(dfIntCols)
## [1] "---------------------------------------------------------------------------------"
## [1] "hypersensitive_cardiac_troponin_i"
## [1] "================================================================================="
## [1] "mean:  887.5552"
## [1] "median:  4.8"
## [1] "min:  0"
## [1] "max:  50000"
## [1] "range:  50000"
## [1] "sd:  4933.42370112608"
## [1] "quantile:  0"     "quantile:  1.9"   "quantile:  4.8"   "quantile:  47.2" 
## [5] "quantile:  50000"
## [1] "IQR:  45.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "hemoglobin"
## [1] "================================================================================="
## [1] "mean:  124.914666666667"
## [1] "median:  130"
## [1] "min:  0"
## [1] "max:  178"
## [1] "range:  178"
## [1] "sd:  33.6386614429292"
## [1] "quantile:  0"   "quantile:  118" "quantile:  130" "quantile:  143"
## [5] "quantile:  178"
## [1] "IQR:  25"
## [1] "---------------------------------------------------------------------------------"
## [1] "serum_chloride"
## [1] "================================================================================="
## [1] "mean:  98.8357333333333"
## [1] "median:  103.1"
## [1] "min:  0"
## [1] "max:  140.4"
## [1] "range:  140.4"
## [1] "sd:  25.2361054677626"
## [1] "quantile:  0"      "quantile:  100.25" "quantile:  103.1" 
## [4] "quantile:  106.2"  "quantile:  140.4" 
## [1] "IQR:  5.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "prothrombin_time"
## [1] "================================================================================="
## [1] "mean:  15.6466666666667"
## [1] "median:  14.4"
## [1] "min:  0"
## [1] "max:  120"
## [1] "range:  120"
## [1] "sd:  9.1200714964205"
## [1] "quantile:  0"     "quantile:  13.45" "quantile:  14.4"  "quantile:  16.3" 
## [5] "quantile:  120"  
## [1] "IQR:  2.85"
## [1] "---------------------------------------------------------------------------------"
## [1] "procalcitonin"
## [1] "================================================================================="
## [1] "mean:  1.05938666666667"
## [1] "median:  0.07"
## [1] "min:  0"
## [1] "max:  57.17"
## [1] "range:  57.17"
## [1] "sd:  4.66756505327561"
## [1] "quantile:  0"     "quantile:  0.02"  "quantile:  0.07"  "quantile:  0.33" 
## [5] "quantile:  57.17"
## [1] "IQR:  0.31"
## [1] "---------------------------------------------------------------------------------"
## [1] "eosinophils_percent"
## [1] "================================================================================="
## [1] "mean:  1.0112"
## [1] "median:  0.5"
## [1] "min:  0"
## [1] "max:  8.6"
## [1] "range:  8.6"
## [1] "sd:  1.3383214005324"
## [1] "quantile:  0"   "quantile:  0"   "quantile:  0.5" "quantile:  1.6"
## [5] "quantile:  8.6"
## [1] "IQR:  1.6"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_2_receptor"
## [1] "================================================================================="
## [1] "mean:  559.413333333333"
## [1] "median:  381"
## [1] "min:  0"
## [1] "max:  7500"
## [1] "range:  7500"
## [1] "sd:  802.589480465136"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  381"  "quantile:  853" 
## [5] "quantile:  7500"
## [1] "IQR:  853"
## [1] "---------------------------------------------------------------------------------"
## [1] "alkaline_phosphatase"
## [1] "================================================================================="
## [1] "mean:  88.1013333333333"
## [1] "median:  74"
## [1] "min:  0"
## [1] "max:  620"
## [1] "range:  620"
## [1] "sd:  59.0000940208925"
## [1] "quantile:  0"     "quantile:  56"    "quantile:  74"    "quantile:  106.5"
## [5] "quantile:  620"  
## [1] "IQR:  50.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "albumin"
## [1] "================================================================================="
## [1] "mean:  33.28"
## [1] "median:  34.9"
## [1] "min:  0"
## [1] "max:  48.6"
## [1] "range:  48.6"
## [1] "sd:  9.32318138995196"
## [1] "quantile:  0"    "quantile:  30.1" "quantile:  34.9" "quantile:  38.8"
## [5] "quantile:  48.6"
## [1] "IQR:  8.7"
## [1] "---------------------------------------------------------------------------------"
## [1] "basophil_percent"
## [1] "================================================================================="
## [1] "mean:  0.303733333333333"
## [1] "median:  0.2"
## [1] "min:  0"
## [1] "max:  1.7"
## [1] "range:  1.7"
## [1] "sd:  0.268660637839674"
## [1] "quantile:  0"   "quantile:  0.1" "quantile:  0.2" "quantile:  0.4"
## [5] "quantile:  1.7"
## [1] "IQR:  0.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_10"
## [1] "================================================================================="
## [1] "mean:  9.06533333333333"
## [1] "median:  5"
## [1] "min:  0"
## [1] "max:  1000"
## [1] "range:  1000"
## [1] "sd:  52.4802083736788"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  5"    "quantile:  8"   
## [5] "quantile:  1000"
## [1] "IQR:  8"
## [1] "---------------------------------------------------------------------------------"
## [1] "total_bilirubin"
## [1] "================================================================================="
## [1] "mean:  18.8773333333333"
## [1] "median:  12"
## [1] "min:  0"
## [1] "max:  505.7"
## [1] "range:  505.7"
## [1] "sd:  35.0329774867403"
## [1] "quantile:  0"     "quantile:  8.05"  "quantile:  12"    "quantile:  18.4" 
## [5] "quantile:  505.7"
## [1] "IQR:  10.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "platelet_count"
## [1] "================================================================================="
## [1] "mean:  220.685333333333"
## [1] "median:  223"
## [1] "min:  -1"
## [1] "max:  558"
## [1] "range:  559"
## [1] "sd:  116.340695393463"
## [1] "quantile:  -1"    "quantile:  144.5" "quantile:  223"   "quantile:  287"  
## [5] "quantile:  558"  
## [1] "IQR:  142.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "monocytes_percent"
## [1] "================================================================================="
## [1] "mean:  7.636"
## [1] "median:  7.8"
## [1] "min:  0"
## [1] "max:  53"
## [1] "range:  53"
## [1] "sd:  4.98565428636018"
## [1] "quantile:  0"     "quantile:  4.2"   "quantile:  7.8"   "quantile:  10.15"
## [5] "quantile:  53"   
## [1] "IQR:  5.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "antithrombin"
## [1] "================================================================================="
## [1] "mean:  49.0186666666667"
## [1] "median:  68"
## [1] "min:  0"
## [1] "max:  136"
## [1] "range:  136"
## [1] "sd:  46.8639762605004"
## [1] "quantile:  0"   "quantile:  0"   "quantile:  68"  "quantile:  91" 
## [5] "quantile:  136"
## [1] "IQR:  91"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_8"
## [1] "================================================================================="
## [1] "mean:  57.0885333333333"
## [1] "median:  6.8"
## [1] "min:  0"
## [1] "max:  6795"
## [1] "range:  6795"
## [1] "sd:  469.091495985732"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  6.8"   "quantile:  21.75"
## [5] "quantile:  6795" 
## [1] "IQR:  21.75"
## [1] "---------------------------------------------------------------------------------"
## [1] "indirect_bilirubin"
## [1] "================================================================================="
## [1] "mean:  8.27093333333333"
## [1] "median:  6.6"
## [1] "min:  0"
## [1] "max:  145.1"
## [1] "range:  145.1"
## [1] "sd:  9.56582502896893"
## [1] "quantile:  0"     "quantile:  4.45"  "quantile:  6.6"   "quantile:  9.95" 
## [5] "quantile:  145.1"
## [1] "IQR:  5.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "red_blood_cell_distribution_width"
## [1] "================================================================================="
## [1] "mean:  12.4170666666667"
## [1] "median:  12.7"
## [1] "min:  0"
## [1] "max:  27.1"
## [1] "range:  27.1"
## [1] "sd:  3.76392628848777"
## [1] "quantile:  0"     "quantile:  12"    "quantile:  12.7"  "quantile:  13.75"
## [5] "quantile:  27.1" 
## [1] "IQR:  1.75"
## [1] "---------------------------------------------------------------------------------"
## [1] "neutrophils_percent"
## [1] "================================================================================="
## [1] "mean:  75.9504"
## [1] "median:  83.6"
## [1] "min:  0"
## [1] "max:  98.9"
## [1] "range:  98.9"
## [1] "sd:  23.1538055129739"
## [1] "quantile:  0"     "quantile:  66.1"  "quantile:  83.6"  "quantile:  93.05"
## [5] "quantile:  98.9" 
## [1] "IQR:  26.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "total_protein"
## [1] "================================================================================="
## [1] "mean:  65.6357333333333"
## [1] "median:  68.9"
## [1] "min:  0"
## [1] "max:  88.7"
## [1] "range:  88.7"
## [1] "sd:  16.3493839663283"
## [1] "quantile:  0"     "quantile:  64.15" "quantile:  68.9"  "quantile:  73.1" 
## [5] "quantile:  88.7" 
## [1] "IQR:  8.94999999999999"
## [1] "---------------------------------------------------------------------------------"
## [1] "quantification_of_treponema_pallidum_antibodies"
## [1] "================================================================================="
## [1] "mean:  0.0977066666666667"
## [1] "median:  0.04"
## [1] "min:  0"
## [1] "max:  11.95"
## [1] "range:  11.95"
## [1] "sd:  0.666373030163863"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  0.04"  "quantile:  0.06" 
## [5] "quantile:  11.95"
## [1] "IQR:  0.06"
## [1] "---------------------------------------------------------------------------------"
## [1] "prothrombin_activity"
## [1] "================================================================================="
## [1] "mean:  81.7146666666667"
## [1] "median:  89"
## [1] "min:  0"
## [1] "max:  142"
## [1] "range:  142"
## [1] "sd:  27.8699457341204"
## [1] "quantile:  0"    "quantile:  71.5" "quantile:  89"   "quantile:  98"  
## [5] "quantile:  142" 
## [1] "IQR:  26.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "h_bs_ag"
## [1] "================================================================================="
## [1] "mean:  6.17981333333333"
## [1] "median:  0"
## [1] "min:  0"
## [1] "max:  250"
## [1] "range:  250"
## [1] "sd:  37.1898486338532"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  0"    "quantile:  0.01"
## [5] "quantile:  250" 
## [1] "IQR:  0.01"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_corpuscular_volume"
## [1] "================================================================================="
## [1] "mean:  86.4445333333333"
## [1] "median:  90.3"
## [1] "min:  0"
## [1] "max:  118.9"
## [1] "range:  118.9"
## [1] "sd:  21.0188320381879"
## [1] "quantile:  0"     "quantile:  87"    "quantile:  90.3"  "quantile:  94.3" 
## [5] "quantile:  118.9"
## [1] "IQR:  7.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "hematocrit"
## [1] "================================================================================="
## [1] "mean:  36.4373333333333"
## [1] "median:  38.1"
## [1] "min:  0"
## [1] "max:  52.3"
## [1] "range:  52.3"
## [1] "sd:  9.81007848738728"
## [1] "quantile:  0"    "quantile:  34.6" "quantile:  38.1" "quantile:  41.3"
## [5] "quantile:  52.3"
## [1] "IQR:  6.7"
## [1] "---------------------------------------------------------------------------------"
## [1] "white_blood_cell_count"
## [1] "================================================================================="
## [1] "mean:  25.94064"
## [1] "median:  10.33"
## [1] "min:  0"
## [1] "max:  1726.6"
## [1] "range:  1726.6"
## [1] "sd:  106.41926472905"
## [1] "quantile:  0"      "quantile:  5.935"  "quantile:  10.33" 
## [4] "quantile:  16.815" "quantile:  1726.6"
## [1] "IQR:  10.88"
## [1] "---------------------------------------------------------------------------------"
## [1] "tumor_necrosis_factor"
## [1] "================================================================================="
## [1] "mean:  7.09466666666667"
## [1] "median:  5.9"
## [1] "min:  0"
## [1] "max:  168"
## [1] "range:  168"
## [1] "sd:  12.4087933860502"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  5.9"  "quantile:  9.45"
## [5] "quantile:  168" 
## [1] "IQR:  9.45"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_corpuscular_hemoglobin_concentration"
## [1] "================================================================================="
## [1] "mean:  331.650666666667"
## [1] "median:  347"
## [1] "min:  0"
## [1] "max:  514"
## [1] "range:  514"
## [1] "sd:  79.1725800136951"
## [1] "quantile:  0"   "quantile:  338" "quantile:  347" "quantile:  357"
## [5] "quantile:  514"
## [1] "IQR:  19"
## [1] "---------------------------------------------------------------------------------"
## [1] "fibrinogen"
## [1] "================================================================================="
## [1] "mean:  3.93029333333333"
## [1] "median:  4.4"
## [1] "min:  0"
## [1] "max:  10.78"
## [1] "range:  10.78"
## [1] "sd:  2.51909969685788"
## [1] "quantile:  0"     "quantile:  2.28"  "quantile:  4.4"   "quantile:  5.72" 
## [5] "quantile:  10.78"
## [1] "IQR:  3.44"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_1"
## [1] "================================================================================="
## [1] "mean:  3.90773333333333"
## [1] "median:  5"
## [1] "min:  0"
## [1] "max:  88.5"
## [1] "range:  88.5"
## [1] "sd:  6.75831189039416"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  5"    "quantile:  5"   
## [5] "quantile:  88.5"
## [1] "IQR:  5"
## [1] "---------------------------------------------------------------------------------"
## [1] "urea"
## [1] "================================================================================="
## [1] "mean:  9.93402666666667"
## [1] "median:  5.6"
## [1] "min:  0"
## [1] "max:  68.4"
## [1] "range:  68.4"
## [1] "sd:  10.7461847998858"
## [1] "quantile:  0"     "quantile:  4"     "quantile:  5.6"   "quantile:  11.45"
## [5] "quantile:  68.4" 
## [1] "IQR:  7.45"
## [1] "---------------------------------------------------------------------------------"
## [1] "lymphocyte_count"
## [1] "================================================================================="
## [1] "mean:  1.28085333333333"
## [1] "median:  1.07"
## [1] "min:  0"
## [1] "max:  52.42"
## [1] "range:  52.42"
## [1] "sd:  2.73636157533041"
## [1] "quantile:  0"     "quantile:  0.64"  "quantile:  1.07"  "quantile:  1.585"
## [5] "quantile:  52.42"
## [1] "IQR:  0.945"
## [1] "---------------------------------------------------------------------------------"
## [1] "ph_value"
## [1] "================================================================================="
## [1] "mean:  4.109904"
## [1] "median:  6"
## [1] "min:  0"
## [1] "max:  7.565"
## [1] "range:  7.565"
## [1] "sd:  3.28009748312161"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  6"     "quantile:  7"    
## [5] "quantile:  7.565"
## [1] "IQR:  7"
## [1] "---------------------------------------------------------------------------------"
## [1] "red_blood_cell_count"
## [1] "================================================================================="
## [1] "mean:  19.2459466666667"
## [1] "median:  4.61"
## [1] "min:  0"
## [1] "max:  749.5"
## [1] "range:  749.5"
## [1] "sd:  61.626490493824"
## [1] "quantile:  0"     "quantile:  3.98"  "quantile:  4.61"  "quantile:  6.54" 
## [5] "quantile:  749.5"
## [1] "IQR:  2.56"
## [1] "---------------------------------------------------------------------------------"
## [1] "eosinophil_count"
## [1] "================================================================================="
## [1] "mean:  0.0617333333333333"
## [1] "median:  0.04"
## [1] "min:  0"
## [1] "max:  0.49"
## [1] "range:  0.49"
## [1] "sd:  0.0754199177463744"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  0.04" "quantile:  0.1" 
## [5] "quantile:  0.49"
## [1] "IQR:  0.1"
## [1] "---------------------------------------------------------------------------------"
## [1] "corrected_calcium"
## [1] "================================================================================="
## [1] "mean:  2.25064"
## [1] "median:  2.4"
## [1] "min:  0"
## [1] "max:  2.79"
## [1] "range:  2.79"
## [1] "sd:  0.577511826857622"
## [1] "quantile:  0"    "quantile:  2.27" "quantile:  2.4"  "quantile:  2.47"
## [5] "quantile:  2.79"
## [1] "IQR:  0.2"
## [1] "---------------------------------------------------------------------------------"
## [1] "serum_potassium"
## [1] "================================================================================="
## [1] "mean:  4.50314666666667"
## [1] "median:  4.58"
## [1] "min:  0"
## [1] "max:  12.8"
## [1] "range:  12.8"
## [1] "sd:  1.44827657761779"
## [1] "quantile:  0"     "quantile:  4.125" "quantile:  4.58"  "quantile:  5.04" 
## [5] "quantile:  12.8" 
## [1] "IQR:  0.915"
## [1] "---------------------------------------------------------------------------------"
## [1] "glucose"
## [1] "================================================================================="
## [1] "mean:  9.37077333333333"
## [1] "median:  7.58"
## [1] "min:  0"
## [1] "max:  43.01"
## [1] "range:  43.01"
## [1] "sd:  6.40805264531818"
## [1] "quantile:  0"      "quantile:  5.685"  "quantile:  7.58"  
## [4] "quantile:  11.135" "quantile:  43.01" 
## [1] "IQR:  5.45"
## [1] "---------------------------------------------------------------------------------"
## [1] "neutrophils_count"
## [1] "================================================================================="
## [1] "mean:  8.80610666666667"
## [1] "median:  6.38"
## [1] "min:  0"
## [1] "max:  33.88"
## [1] "range:  33.88"
## [1] "sd:  7.21423001767688"
## [1] "quantile:  0"     "quantile:  3.42"  "quantile:  6.38"  "quantile:  12.33"
## [5] "quantile:  33.88"
## [1] "IQR:  8.91"
## [1] "---------------------------------------------------------------------------------"
## [1] "direct_bilirubin"
## [1] "================================================================================="
## [1] "mean:  11.2413333333333"
## [1] "median:  5.3"
## [1] "min:  0"
## [1] "max:  360.6"
## [1] "range:  360.6"
## [1] "sd:  27.7644778118371"
## [1] "quantile:  0"     "quantile:  3.5"   "quantile:  5.3"   "quantile:  8.65" 
## [5] "quantile:  360.6"
## [1] "IQR:  5.15"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_platelet_volume"
## [1] "================================================================================="
## [1] "mean:  10.3717333333333"
## [1] "median:  11"
## [1] "min:  0"
## [1] "max:  15"
## [1] "range:  15"
## [1] "sd:  3.20872822580761"
## [1] "quantile:  0"    "quantile:  10.2" "quantile:  11"   "quantile:  11.9"
## [5] "quantile:  15"  
## [1] "IQR:  1.7"
## [1] "---------------------------------------------------------------------------------"
## [1] "ferritin"
## [1] "================================================================================="
## [1] "mean:  883.942933333333"
## [1] "median:  147.3"
## [1] "min:  0"
## [1] "max:  50000"
## [1] "range:  50000"
## [1] "sd:  3052.68673704692"
## [1] "quantile:  0"      "quantile:  0"      "quantile:  147.3" 
## [4] "quantile:  881.95" "quantile:  50000" 
## [1] "IQR:  881.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "rbc_distribution_width_sd"
## [1] "================================================================================="
## [1] "mean:  40.3882666666667"
## [1] "median:  41.1"
## [1] "min:  0"
## [1] "max:  113.3"
## [1] "range:  113.3"
## [1] "sd:  12.772592270079"
## [1] "quantile:  0"     "quantile:  38.55" "quantile:  41.1"  "quantile:  45.15"
## [5] "quantile:  113.3"
## [1] "IQR:  6.60000000000001"
## [1] "---------------------------------------------------------------------------------"
## [1] "thrombin_time"
## [1] "================================================================================="
## [1] "mean:  15.1773333333333"
## [1] "median:  16.5"
## [1] "min:  0"
## [1] "max:  161.9"
## [1] "range:  161.9"
## [1] "sd:  13.0846705153605"
## [1] "quantile:  0"     "quantile:  15"    "quantile:  16.5"  "quantile:  18.3" 
## [5] "quantile:  161.9"
## [1] "IQR:  3.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "percent_lymphocyte"
## [1] "================================================================================="
## [1] "mean:  19.236"
## [1] "median:  17.7"
## [1] "min:  0"
## [1] "max:  60"
## [1] "range:  60"
## [1] "sd:  14.3091874304513"
## [1] "quantile:  0"     "quantile:  6.1"   "quantile:  17.7"  "quantile:  31.05"
## [5] "quantile:  60"   
## [1] "IQR:  24.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "hcv_antibody_quantification"
## [1] "================================================================================="
## [1] "mean:  0.0822666666666667"
## [1] "median:  0.05"
## [1] "min:  0"
## [1] "max:  2.09"
## [1] "range:  2.09"
## [1] "sd:  0.192442597409416"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  0.05" "quantile:  0.08"
## [5] "quantile:  2.09"
## [1] "IQR:  0.08"
## [1] "---------------------------------------------------------------------------------"
## [1] "d_d_dimer"
## [1] "================================================================================="
## [1] "mean:  6.78184"
## [1] "median:  1.27"
## [1] "min:  0"
## [1] "max:  60"
## [1] "range:  60"
## [1] "sd:  9.27520148827659"
## [1] "quantile:  0"     "quantile:  0.425" "quantile:  1.27"  "quantile:  19.78"
## [5] "quantile:  60"   
## [1] "IQR:  19.355"
## [1] "---------------------------------------------------------------------------------"
## [1] "total_cholesterol"
## [1] "================================================================================="
## [1] "mean:  3.79714666666667"
## [1] "median:  3.88"
## [1] "min:  0"
## [1] "max:  7.3"
## [1] "range:  7.3"
## [1] "sd:  1.32915602562809"
## [1] "quantile:  0"     "quantile:  3.155" "quantile:  3.88"  "quantile:  4.515"
## [5] "quantile:  7.3"  
## [1] "IQR:  1.36"
## [1] "---------------------------------------------------------------------------------"
## [1] "aspartate_aminotransferase"
## [1] "================================================================================="
## [1] "mean:  62.416"
## [1] "median:  33"
## [1] "min:  0"
## [1] "max:  1858"
## [1] "range:  1858"
## [1] "sd:  150.569143993697"
## [1] "quantile:  0"    "quantile:  22"   "quantile:  33"   "quantile:  53.5"
## [5] "quantile:  1858"
## [1] "IQR:  31.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "uric_acid"
## [1] "================================================================================="
## [1] "mean:  308.073066666667"
## [1] "median:  286"
## [1] "min:  0"
## [1] "max:  1176"
## [1] "range:  1176"
## [1] "sd:  172.536536764377"
## [1] "quantile:  0"      "quantile:  212.2"  "quantile:  286"   
## [4] "quantile:  372.55" "quantile:  1176"  
## [1] "IQR:  160.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "hco3"
## [1] "================================================================================="
## [1] "mean:  23.2208"
## [1] "median:  24.6"
## [1] "min:  0"
## [1] "max:  36.3"
## [1] "range:  36.3"
## [1] "sd:  6.75612175559844"
## [1] "quantile:  0"     "quantile:  21.8"  "quantile:  24.6"  "quantile:  27.15"
## [5] "quantile:  36.3" 
## [1] "IQR:  5.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "calcium"
## [1] "================================================================================="
## [1] "mean:  2.03805333333333"
## [1] "median:  2.15"
## [1] "min:  0"
## [1] "max:  2.62"
## [1] "range:  2.62"
## [1] "sd:  0.514584300669525"
## [1] "quantile:  0"    "quantile:  2.04" "quantile:  2.15" "quantile:  2.26"
## [5] "quantile:  2.62"
## [1] "IQR:  0.22"
## [1] "---------------------------------------------------------------------------------"
## [1] "amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp"
## [1] "================================================================================="
## [1] "mean:  2811.68533333333"
## [1] "median:  99"
## [1] "min:  0"
## [1] "max:  70000"
## [1] "range:  70000"
## [1] "sd:  9387.95662943333"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  99"    "quantile:  941"  
## [5] "quantile:  70000"
## [1] "IQR:  941"
## [1] "---------------------------------------------------------------------------------"
## [1] "lactate_dehydrogenase"
## [1] "================================================================================="
## [1] "mean:  519.592"
## [1] "median:  355"
## [1] "min:  0"
## [1] "max:  1867"
## [1] "range:  1867"
## [1] "sd:  436.03199416537"
## [1] "quantile:  0"     "quantile:  239.5" "quantile:  355"   "quantile:  661"  
## [5] "quantile:  1867" 
## [1] "IQR:  421.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "platelet_large_cell_ratio"
## [1] "================================================================================="
## [1] "mean:  31.664"
## [1] "median:  32.5"
## [1] "min:  0"
## [1] "max:  62.2"
## [1] "range:  62.2"
## [1] "sd:  12.6608563463322"
## [1] "quantile:  0"     "quantile:  26"    "quantile:  32.5"  "quantile:  39.35"
## [5] "quantile:  62.2" 
## [1] "IQR:  13.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_6"
## [1] "================================================================================="
## [1] "mean:  77.4030666666667"
## [1] "median:  2.58"
## [1] "min:  0"
## [1] "max:  5000"
## [1] "range:  5000"
## [1] "sd:  460.193926173624"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  2.58" "quantile:  33.8"
## [5] "quantile:  5000"
## [1] "IQR:  33.8"
## [1] "---------------------------------------------------------------------------------"
## [1] "fibrin_degradation_products"
## [1] "================================================================================="
## [1] "mean:  29.6808"
## [1] "median:  4"
## [1] "min:  0"
## [1] "max:  190.8"
## [1] "range:  190.8"
## [1] "sd:  55.7152472377118"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  4"     "quantile:  7.4"  
## [5] "quantile:  190.8"
## [1] "IQR:  7.4"
## [1] "---------------------------------------------------------------------------------"
## [1] "monocytes_count"
## [1] "================================================================================="
## [1] "mean:  0.648213333333333"
## [1] "median:  0.48"
## [1] "min:  0"
## [1] "max:  39.92"
## [1] "range:  39.92"
## [1] "sd:  2.05815873720233"
## [1] "quantile:  0"     "quantile:  0.355" "quantile:  0.48"  "quantile:  0.7"  
## [5] "quantile:  39.92"
## [1] "IQR:  0.345"
## [1] "---------------------------------------------------------------------------------"
## [1] "plt_distribution_width"
## [1] "================================================================================="
## [1] "mean:  12.816"
## [1] "median:  13"
## [1] "min:  0"
## [1] "max:  25.3"
## [1] "range:  25.3"
## [1] "sd:  4.80697576889886"
## [1] "quantile:  0"     "quantile:  11.2"  "quantile:  13"    "quantile:  15.15"
## [5] "quantile:  25.3" 
## [1] "IQR:  3.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "globulin"
## [1] "================================================================================="
## [1] "mean:  33.4546666666667"
## [1] "median:  34.6"
## [1] "min:  0"
## [1] "max:  50.6"
## [1] "range:  50.6"
## [1] "sd:  9.43016966028432"
## [1] "quantile:  0"    "quantile:  31.1" "quantile:  34.6" "quantile:  38.7"
## [5] "quantile:  50.6"
## [1] "IQR:  7.6"
## [1] "---------------------------------------------------------------------------------"
## [1] "glutamyl_transpeptidase"
## [1] "================================================================================="
## [1] "mean:  57.6213333333333"
## [1] "median:  37"
## [1] "min:  0"
## [1] "max:  732"
## [1] "range:  732"
## [1] "sd:  78.5797430274511"
## [1] "quantile:  0"    "quantile:  20.5" "quantile:  37"   "quantile:  61"  
## [5] "quantile:  732" 
## [1] "IQR:  40.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "international_standard_ratio"
## [1] "================================================================================="
## [1] "mean:  1.28170666666667"
## [1] "median:  1.11"
## [1] "min:  0"
## [1] "max:  13.48"
## [1] "range:  13.48"
## [1] "sd:  1.03490022218339"
## [1] "quantile:  0"     "quantile:  1.02"  "quantile:  1.11"  "quantile:  1.31" 
## [5] "quantile:  13.48"
## [1] "IQR:  0.29"
## [1] "---------------------------------------------------------------------------------"
## [1] "basophil_count_number"
## [1] "================================================================================="
## [1] "mean:  0.0241333333333333"
## [1] "median:  0.02"
## [1] "min:  0"
## [1] "max:  0.12"
## [1] "range:  0.12"
## [1] "sd:  0.0205988768016656"
## [1] "quantile:  0"    "quantile:  0.01" "quantile:  0.02" "quantile:  0.03"
## [5] "quantile:  0.12"
## [1] "IQR:  0.02"
## [1] "---------------------------------------------------------------------------------"
## [1] "x2019_n_co_v_nucleic_acid_detection"
## [1] "================================================================================="
## [1] "mean:  -0.581333333333333"
## [1] "median:  -1"
## [1] "min:  -1"
## [1] "max:  0"
## [1] "range:  1"
## [1] "sd:  0.493999646380325"
## [1] "quantile:  -1" "quantile:  -1" "quantile:  -1" "quantile:  0" 
## [5] "quantile:  0" 
## [1] "IQR:  1"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_corpuscular_hemoglobin"
## [1] "================================================================================="
## [1] "mean:  29.7725333333333"
## [1] "median:  31"
## [1] "min:  0"
## [1] "max:  50.8"
## [1] "range:  50.8"
## [1] "sd:  7.56932789939007"
## [1] "quantile:  0"     "quantile:  29.75" "quantile:  31"    "quantile:  32.35"
## [5] "quantile:  50.8" 
## [1] "IQR:  2.59999999999999"
## [1] "---------------------------------------------------------------------------------"
## [1] "activation_of_partial_thromboplastin_time"
## [1] "================================================================================="
## [1] "mean:  34.4565333333333"
## [1] "median:  39.1"
## [1] "min:  0"
## [1] "max:  144"
## [1] "range:  144"
## [1] "sd:  20.189724546286"
## [1] "quantile:  0"    "quantile:  33.4" "quantile:  39.1" "quantile:  44.6"
## [5] "quantile:  144" 
## [1] "IQR:  11.2"
## [1] "---------------------------------------------------------------------------------"
## [1] "high_sensitivity_c_reactive_protein"
## [1] "================================================================================="
## [1] "mean:  86.6608"
## [1] "median:  54.1"
## [1] "min:  0"
## [1] "max:  320"
## [1] "range:  320"
## [1] "sd:  87.9859750272583"
## [1] "quantile:  0"     "quantile:  10.1"  "quantile:  54.1"  "quantile:  144.4"
## [5] "quantile:  320"  
## [1] "IQR:  134.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "hiv_antibody_quantification"
## [1] "================================================================================="
## [1] "mean:  0.0732"
## [1] "median:  0.08"
## [1] "min:  0"
## [1] "max:  0.27"
## [1] "range:  0.27"
## [1] "sd:  0.0561405798498943"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  0.08" "quantile:  0.1" 
## [5] "quantile:  0.27"
## [1] "IQR:  0.1"
## [1] "---------------------------------------------------------------------------------"
## [1] "serum_sodium"
## [1] "================================================================================="
## [1] "mean:  134.9216"
## [1] "median:  141.3"
## [1] "min:  0"
## [1] "max:  179.7"
## [1] "range:  179.7"
## [1] "sd:  33.7104973828152"
## [1] "quantile:  0"      "quantile:  138.7"  "quantile:  141.3" 
## [4] "quantile:  143.85" "quantile:  179.7" 
## [1] "IQR:  5.15000000000003"
## [1] "---------------------------------------------------------------------------------"
## [1] "thrombocytocrit"
## [1] "================================================================================="
## [1] "mean:  0.22928"
## [1] "median:  0.24"
## [1] "min:  0"
## [1] "max:  0.51"
## [1] "range:  0.51"
## [1] "sd:  0.116031823036696"
## [1] "quantile:  0"    "quantile:  0.16" "quantile:  0.24" "quantile:  0.3" 
## [5] "quantile:  0.51"
## [1] "IQR:  0.14"
## [1] "---------------------------------------------------------------------------------"
## [1] "esr"
## [1] "================================================================================="
## [1] "mean:  28.4506666666667"
## [1] "median:  22"
## [1] "min:  0"
## [1] "max:  110"
## [1] "range:  110"
## [1] "sd:  27.6133946674868"
## [1] "quantile:  0"   "quantile:  2"   "quantile:  22"  "quantile:  43" 
## [5] "quantile:  110"
## [1] "IQR:  41"
## [1] "---------------------------------------------------------------------------------"
## [1] "glutamic_pyruvic_transaminase"
## [1] "================================================================================="
## [1] "mean:  47.9946666666667"
## [1] "median:  29"
## [1] "min:  0"
## [1] "max:  1600"
## [1] "range:  1600"
## [1] "sd:  102.51852564206"
## [1] "quantile:  0"    "quantile:  18"   "quantile:  29"   "quantile:  48.5"
## [5] "quantile:  1600"
## [1] "IQR:  30.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "e_gfr"
## [1] "================================================================================="
## [1] "mean:  85.4397333333333"
## [1] "median:  92.9"
## [1] "min:  0"
## [1] "max:  224"
## [1] "range:  224"
## [1] "sd:  34.2058778595522"
## [1] "quantile:  0"     "quantile:  72.1"  "quantile:  92.9"  "quantile:  108.4"
## [5] "quantile:  224"  
## [1] "IQR:  36.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "creatinine"
## [1] "================================================================================="
## [1] "mean:  120.322666666667"
## [1] "median:  81"
## [1] "min:  0"
## [1] "max:  1497"
## [1] "range:  1497"
## [1] "sd:  160.171732347307"
## [1] "quantile:  0"    "quantile:  58"   "quantile:  81"   "quantile:  105" 
## [5] "quantile:  1497"
## [1] "IQR:  47"

EDA

Age and Gender Analysis

The data in cleaner form at this point, so analysis can be conducted. We’ll start by looking at age and gender of the Covid-19 patients.

#set temp df for analysis
df_analysis <- df_clean

df_analysis$age_factor <- cut(df_analysis$age, breaks = c(0,30,55,65,Inf),labels=c("young","middle","older", "old"))
#head(df_analysis[,c(1:10)]) %>% regulartable() %>% theme_zebra() %>%  autofit()

df_analysis$gender_factor <- cut(df_analysis$gender, breaks = c(0,1,2),labels=c("Female","Male"))
#head(df_analysis[,c(1:10)]) %>% regulartable() %>% theme_zebra() %>%  autofit()

#head(df_analysis[,c(1:4,81:82)],8)

Age Histogram

The histogram showing distribution of age across the 375 patients is very normal looking.

Hist Line Color Legend: * Mean (red) * Median (gray) * Mode (green)

require(plotly)
#require(cowplot)

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
   }

df_analysis %>% 
  ggplot(aes(age)) +
  geom_histogram(bins = 10, boundary = 0, fill = "#FF6666", col = "black", aes(y = ..density..),  ) + 
  geom_density(alpha = 0.2, fill = "gray") +
  ggtitle("Age Histogram",
          subtitle = "Mean (grey) and Median (blue) and Mode (green)") +
  theme(axis.title = element_text(), axis.title.x = element_text()) +
  
  geom_vline(xintercept = round(mean(df_analysis$age), 2), size = 2, linetype = 3) +
  
  geom_vline(aes(xintercept=mean(df_analysis$age,na.rm='T')),color="blue", linetype=3, size=1) +
  
  geom_vline(aes(xintercept=median(df_analysis$age,na.rm='T')),  color="gray", linetype="dashed", size=1) +
  
  geom_vline(aes(xintercept=getmode(df_analysis$age)), color="green", linetype="dashed", size=1) +
  
  gbl_ggtheme -> ph1

#ph1 + theme_bw() 
ph1

ph1ly <- ggplotly(ph1) 
#ph1ly
#plot_grid(ph1,ph2, nrow = 1, ncol = 2) 

Age Discretized

The age is better studied in buckets of young, middle, older, old.

#quantile(df_clean$age)
#data.table(df_clean$age)
age_df <- as.data.frame(prop.table(table(df_analysis$age_factor))) 
names(age_df)[1:2] <- c("Age","Percentage")
age_df %>% kable() %>% kable_styling()
Age Percentage
young 0.0586667
middle 0.3040000
older 0.2346667
old 0.4026667

Age outliers

How many patients greater than or equal to 90 years old.

age_outliers <- which(df_analysis$age >= 90)

paste("There are", length(age_outliers), "patients age 90 or older.")
## [1] "There are 5 patients age 90 or older."
df_analysis[age_outliers, "age"] %>% arrange(age)
## # A tibble: 5 x 1
## # Groups:   age [5]
##     age
##   <dbl>
## 1    90
## 2    91
## 3    92
## 4    94
## 5    95

Gender Analysis

Gender isn’t significantly correlated whereas age does show significant positive correlation to the patient outcome. We have more females than males in this particular patient sample.

gender_df <- as.data.frame(prop.table(table(df_analysis$gender_factor))) 
names(gender_df)[1:2] <- c("Gender","Percentage")
gender_df %>% kable() %>% kable_styling()
Gender Percentage
Female 0.5973333
Male 0.4026667
#ggpairs(df_analysis, columns = c(3,4,81:82), ggplot2::aes(colour=df_analysis$gender_factor)) +gbl_ggtheme
#chart.Correlation(cor_df1[,c(1,2:5)])

 ggcorr(df_analysis[c(2:3,4,81:82)], method = c("everything", "pearson")) + gbl_ggtheme

Correlation Analysis

Visual analysis of the outcome (dependent feature) with all others (independent features).

During the analysis, the following features were found to be important. They are list in order of relative importance.

  • lactate_dehydrogenase 100.00
  • monocytes_percent 83.50
  • international_standard_ratio 78.47
  • percent_lymphocyte 64.12
  • eosinophil_count 59.75
  • eosinophils_percent 50.99
  • neutrophils_percent 49.15
  • amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp 48.62
  • thrombocytocrit 46.40
  • hypersensitive_cardiac_troponin_i 45.83
  • fibrin_degradation_products 43.42
  • procalcitonin 41.40
  • albumin 40.76
  • prothrombin_activity 37.53
  • d_d_dimer 33.37
  • aspartate_aminotransferase 32.88
  • neutrophils_count 31.01
  • direct_bilirubin 30.13
  • thrombin_time 29.82
  • high_sensitivity_c_reactive_protein 28.38

Figure 1 - “outcome”,“percent_lymphocyte”,“lactate_dehydrogenase”,“lymphocyte_count”,“albumin”, “neutrophils_percent”,“eosinophils_percent”,“eosinophil_count”, “x2019_n_co_v_nucleic_acid_detection”,“direct_bilirubin”,“thrombin_time”, “monocytes_percent”,“international_standard_ratio”

Figure 2 - “outcome”,“amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp”, “thrombocytocrit”,“hypersensitive_cardiac_troponin_i”,“fibrin_degradation_products”,“procalcitonin”, “prothrombin_activity”,“d_d_dimer”,“aspartate_aminotransferase”,“neutrophils_count”, “high_sensitivity_c_reactive_protein”

#head(df_clean)
#ensures we are starting with the clean data set
df <- df_clean 
cor_df <- df[,c(2:4,7:80)]
#glimpse(cor_df)
#head(cor_df)


require(GGally)
require(plotly)
vars1 <- c("outcome","percent_lymphocyte","lactate_dehydrogenase","lymphocyte_count","albumin",
           "neutrophils_percent","eosinophils_percent","eosinophil_count","x2019_n_co_v_nucleic_acid_detection",
           "direct_bilirubin","thrombin_time","monocytes_percent","international_standard_ratio")
cor_df_vars1 <- cor_df[,c(vars1)]
ggcorr(cor_df_vars1,label = TRUE, label_size = 3,hjust = 0.75, size = 1.5,  method = c("everything", "pearson")) + gbl_ggtheme

vars1b <- c("outcome","percent_lymphocyte","lactate_dehydrogenase","x2019_n_co_v_nucleic_acid_detection",
           "direct_bilirubin","thrombin_time","monocytes_percent")
# cor_df_vars1b <- cor_df[,c(vars1b)]
# ggp2 <- ggpairs(cor_df_vars1b)  + gbl_ggtheme
# ggp2 <- ggpairs(cor_df_vars1b, ggplot2::aes(colour="#E64141")) +gbl_ggtheme
# ggp2
#ggplotly(ggp2)

vars2 <- c("outcome","amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp",
           "thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products","procalcitonin",
           "prothrombin_activity","d_d_dimer","aspartate_aminotransferase","neutrophils_count",
           "high_sensitivity_c_reactive_protein")   

#cor_df[,c(vars1)]
cor_df_vars2 <- cor_df[,c(vars2)]
#cor_df_vars2
ggcorr(cor_df_vars2, label = TRUE, label_size = 3,hjust = 0.75, size = 1.5, method = c("everything", "pearson")) + gbl_ggtheme

vars2b <-  c("outcome","thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products",
           "prothrombin_activity", "high_sensitivity_c_reactive_protein")
# cor_df_vars2b <- cor_df[,c(vars2b)]
# ggp2 <- ggpairs(cor_df_vars2b, ggplot2::aes(colour="#E64141")) +gbl_ggtheme
# ggp2
#set value to 3 since the dependent is in the 3rd position
#dependent_var <- c(1)
#paste("variable number ",dependent_var," is = ",names(cor_df)[3],sep="")

#cor_df1 <- cor_df_vars2[,c(dependent_var,6:17)]
#corrplot(cor(cor_df_vars2), method = "square")
#ggcorr(cor_df_vars2, method = c("everything", "pearson")) + gbl_ggtheme
#chart.Correlation(cor_df_vars2[,c(1,2:5)])
#ggpairs(cor_df_vars2[,c(1,2:5)], title="correlogram with ggpairs()") #+ gbl_ggtheme
#
# cor_df2 <- df[,c(dependent_var,18:28)]
# #corrplot(cor(cor_df2), method = "square")
# ggcorr(cor_df2, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df3 <- df[,c(dependent_var,29:39)]
# #corrplot(cor(cor_df3), method = "square")
# ggcorr(cor_df3, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df4 <- df[,c(dependent_var,40:50)]
# #corrplot(cor(cor_df4), method = "square")
# ggcorr(cor_df4, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df5 <- df[,c(dependent_var,51:61)]
# #corrplot(cor(cor_df5), method = "square")
# ggcorr(cor_df5, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df6 <- df[,c(dependent_var,62:72)]
# #corrplot(cor(cor_df6), method = "square")
# ggcorr(cor_df6, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df7 <- df[,c(dependent_var,73:80)]
# #corrplot(cor(cor_df7), method = "square")
# ggcorr(cor_df7, method = c("everything", "pearson")) + gbl_ggtheme


paste("")
## [1] ""

Kmeans Clustering

kmeans() function returns a list of components, including:

*cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

*centers: A matrix of cluster centers (cluster means)

*totss: The total sum of squares (TSS), i.e ∑(xi−x¯)2. TSS measures the total variance in the data.

*withinss: Vector of within-cluster sum of squares, one component per cluster

*tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss)

*betweenss: The between-cluster sum of squares, i.e. totss−tot.withinss

*size: The number of observations in each cluster

require(NbClust)


df <- df_clean 
#df
df <- df[,c(7:80)] # 
dim(df)
## [1] 375  74
#df
#df %>% select(-patient_id,-admission_time,-discharge_time)
df_subset1 <- df #%>% select(-gender,-outcome)
#df_subset1
df <- df_subset1
head(df[,c(1:5)]) %>% regulartable() %>% theme_zebra() %>%  autofit()

hypersensitive_cardiac_troponin_i

hemoglobin

serum_chloride

prothrombin_time

procalcitonin

19.9

140

103.1

14.1

0.09

16.9

151

100.5

14.3

0.09

0.0

126

102.9

13.6

0.06

4.8

110

103.1

16.3

0.38

5.6

134

102.2

14.6

0.02

19.7

108

105.8

12.4

0.10

Cluster Build

X<- df
X_norm <- apply(X, 1, function(i) round(i/sum(i),3))
#X_norm
#str(X)

#kmeans function
kmeansCluster <- function (mtrx,vcenters,v_nstart = 50,v_iter.max = 150) {  
km.res <- kmeans(mtrx, centers = vcenters, nstart = v_nstart, iter.max = v_iter.max)


z<-fviz_cluster(km.res, data = X, geom = "point", stand = FALSE, ellipse.type = "norm",main = paste("Kmeans cluster plot vcenters=",vcenters)) + theme_minimal()

z2 <- fviz_cluster(km.res, data = mtrx, geom = "point", stand = FALSE, ellipse.type = "norm",   outlier.color = "black",main = paste("Kmeans cluster plot vcenters=",vcenters), show.clust.cent = TRUE) + theme_minimal()

s <- summary(km.res)
return(list(z,s))


s <- summary(km.res) 
return(list(z2,s))
#return(c(x))
}
 k2 <-kmeansCluster(X,2)
 k2
## [[1]]

## 
## [[2]]
##              Length Class  Mode   
## cluster      375    -none- numeric
## centers      148    -none- numeric
## totss          1    -none- numeric
## withinss       2    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           2    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
 k3 <-kmeansCluster(X,3)
 k3
## [[1]]
## Too few points to calculate an ellipse

## 
## [[2]]
##              Length Class  Mode   
## cluster      375    -none- numeric
## centers      222    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
 k4 <-kmeansCluster(X,4)
 k4
## [[1]]
## Too few points to calculate an ellipse

## 
## [[2]]
##              Length Class  Mode   
## cluster      375    -none- numeric
## centers      296    -none- numeric
## totss          1    -none- numeric
## withinss       4    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           4    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
 k5 <-kmeansCluster(X,5)
 k5
## [[1]]
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## 
## [[2]]
##              Length Class  Mode   
## cluster      375    -none- numeric
## centers      370    -none- numeric
## totss          1    -none- numeric
## withinss       5    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           5    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
 # k6 <-kmeansCluster(X,6)
 # k6

Elbow Plot Result

The plot represents the variance within the clusters. It decreases as k increases, and a bend (or “elbow”) can be seen at k = 4. This bend indicates that additional clusters beyond the fourth have little value. In the next section, we’ll classify the observations into 4 clusters.

k<-2:7 #K from 2 to 7
tries <-10 #Run the K Means algorithm n times
avg.totw.ss <-integer(length(k)) #Set up an empty vector to hold all of points
for(v in k){ # For each value of the range variable
 v.totw.ss <-integer(tries) #Set up an empty vector to hold the 100 tries
 for(i in 1:tries){
 k.temp <-kmeans(X,centers=v) #Run kmeans
 v.totw.ss[i] <-k.temp$tot.withinss#Store the total withinss
 }
 avg.totw.ss[v-1] <-mean(v.totw.ss) #Average the 100 total withinss

}

df <- as.data.frame(cbind(k,avg.totw.ss)) 
ggplot(df,aes(x=k, y=avg.totw.ss)) + geom_point() + 
  geom_line(aes(color="Elbow"))  + 
  ggtitle("Total Within SS by Various K",subtitle = "Elbow") +  theme_bw()

#plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K", ylab="Average Total Within Sum of Squares", xlab="Value of K")

Kmeans Results

Results using k=4 clustering with patient age and 74 biomarkers

set.seed(seed)

km.res <- NULL
km.res <- kmeans(X, 4, nstart = 50,iter.max = 150)
#km.res4 <- kmeans(X, 4)
#summary(km.res)
#glimpse(km.res)

z2 <- fviz_cluster(km.res, data = X, geom = "point", stand = FALSE, ellipse.type = "norm",   outlier.color = "black",main = "Kmeans cluster plot vcenters=4", show.clust.cent = TRUE) + theme_minimal()
z2

Samples from the 4 clusters are displayed below in descending order. Only 5 of the 74 biomarkers are displayed. Clusters 1 and 4 had the most observations and we know there were not that many important features out of the 74 biomarkers. This may mean clusters 3 and 2 were defined by how the important features set them apart from the rest.

  • hypersensitive_cardiac_troponin_i: This is the preferred biomarker for acute myocardial infarction. This helps diagnose acute coronary syndromes in those who are critically ill. 

  • amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp: Natriuretic peptide tests measure the presence of brain natriuretic peptide and B-terminal pro b-type natriuretic peptides present. These are substances created by the heart. Small levels are normally found in the bloodstream. High levels of BNP, NT-proBNP indicate that the heart is failing to pump the appropriate amount of blood needed.

  • lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.

  • ferritin: Ferritin is a major intracellular iron storage protein in all organisms. It binds free ions of the trace element, neutralising its toxic properties and increasing its solubility. In the soluble form, the body is able to expend iron as needed, in particular for regulation of cellular oxygen metabolism. Elevated levels of ferritin, or hyperferritinemia, indicate the presence of viruses and bacteria into the body.

Impvars <- c("amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp",
           "thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products","procalcitonin",
           "prothrombin_activity","d_d_dimer","aspartate_aminotransferase","neutrophils_count",
           "high_sensitivity_c_reactive_protein","thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products",
           "prothrombin_activity", "high_sensitivity_c_reactive_protein","ferritin")

#head(X[,Impvars])
dd <- cbind(X, cluster = km.res$cluster)
dd <- dd %>% select("cluster","lactate_dehydrogenase","amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp","hypersensitive_cardiac_troponin_i", "high_sensitivity_c_reactive_protein","ferritin")

dd.names <-names(dd) #back up original names
dfname <-dd.names
names2 <- base::abbreviate( dfname,25)
#names2 #checking abbreviations
names(dd) <- names2
#print(km.res) 
#str(km.res)
#km.res$cluster %>% regulartable() %>% theme_zebra() %>%  autofit()
#cluster centers
#km.res$centers  %>% regulartable() %>% theme_zebra() %>%  autofit()
#head(dd)
#dim(dd)

k=4 cluster dimensions:

  • [1] 324, 12 - Cluster one is the largest

  • [2] 3, 12 - Cluster two is the smallest

  • [3] 42, 12 - Cluster three is the second largest

  • [4] 6, 12 - Cluster four is very small as well

The smaller clusters have elevated markers, while the larger clusters do not. Note: long biomarker names have been abbreviated to 25 characters.

#abbreviate long biomarker names to 25 chars make analysis easier

#Create biomarker aggregate with means
agg_clus <- aggregate(X, by=list(cluster=km.res$cluster), mean)

#abbreviate long biomarker names to 25 chars make analysis easier
agg_clus.names <-names(agg_clus) #back up original names
agg_clus.dfname <- agg_clus.names
agg_clus.names2 <- base::abbreviate( agg_clus.dfname,25)
#names2 #checking abbreviations
names(agg_clus) <- agg_clus.names2

Impvars_short <- base::abbreviate( Impvars,25)
agg_clus[,Impvars_short] %>% regulartable() %>% theme_zebra() %>%  autofit()

amn_trmnl_brn_ntrrtc_____

thrombocytocrit

hypersensitiv_crdc_trpnn_

fibrin_degradation_prdcts

procalcitonin

prothrombin_activity

d_d_dimer

aspartate_aminotransferas

neutrophils_count

high_senstvty_c_rctv_prtn

thrombocytocrit.1

hypersensitiv_crdc_trpnn_.1

fibrin_degradation_prdcts.1

prothrombin_activity.1

high_senstvty_c_rctv_prtn.1

ferritin

35716.3333

0.13666667

50000.0000

150.00000

0.3700000

74.33333

21.000000

228.6667

19.076667

130.90000

0.13666667

50000.0000

150.00000

74.33333

130.90000

6594.5333

10561.6429

0.17000000

2423.6452

84.70000

5.4209524

71.42857

15.254762

128.8571

15.698333

183.77619

0.17000000

2423.6452

84.70000

71.42857

183.77619

2145.2833

62566.0000

0.09166667

4629.9000

53.90000

4.4066667

70.66667

12.835000

377.3333

14.578333

100.15000

0.09166667

4629.9000

53.90000

70.66667

100.15000

5100.3000

395.8272

0.24037037

164.3849

20.98611

0.4383951

83.32099

5.439753

46.4321

7.710679

73.41235

0.24037037

164.3849

20.98611

83.32099

73.41235

589.4793

clus_id1 <- dd %>% filter(cluster == 1) # %>% head(3) #%>% regulartable() %>% theme_zebra() %>%  autofit()
head(clus_id1[,c(1:6)]) %>% kable() %>% kable_styling()
cluster lactate_dehydrogenase amn_trmnl_brn_ntrrtc_____ hypersensitiv_crdc_trpnn_ high_senstvty_c_rctv_prtn ferritin
1 1867 20906 50000 113.1 1935.6
1 1500 70000 50000 87.3 0.0
1 785 16243 50000 192.3 17848.0
clus_id2 <- dd %>% filter(cluster == 2) #%>% head(3) #%>% regulartable() %>% theme_zebra() %>%  autofit()
head(clus_id2[,c(1:6)]) %>% kable() %>% kable_styling()
cluster lactate_dehydrogenase amn_trmnl_brn_ntrrtc_____ hypersensitiv_crdc_trpnn_ high_senstvty_c_rctv_prtn ferritin
2 208 11017 0.0 24.3 0.0
2 187 13006 124.0 2.1 0.0
2 310 22525 617.0 160.2 545.7
2 1772 9310 46.3 274.1 2578.7
2 890 7447 304.2 86.1 2661.8
2 492 11534 1317.8 196.9 0.0
clus_id3 <-dd %>% filter(cluster == 3) #%>% head(3) #%>% regulartable() %>% theme_zebra() %>%  autofit()
head(clus_id3[,c(1:6)]) %>% kable() %>% kable_styling()
cluster lactate_dehydrogenase amn_trmnl_brn_ntrrtc_____ hypersensitiv_crdc_trpnn_ high_senstvty_c_rctv_prtn ferritin
3 284 70000 85.5 81.7 0.0
3 1752 40096 2.6 84.6 15718.0
3 1867 70000 523.1 36.7 0.0
3 1126 55300 3404.7 138.8 4995.6
3 1867 70000 21953.4 259.1 9888.2
3 702 70000 1810.1 0.0 0.0
clus_id4 <-dd %>% filter(cluster == 4) # %>% head(3) #%>% regulartable() %>% theme_zebra() %>%  autofit()
head(clus_id4[,c(1:6)]) %>% kable() %>% kable_styling()
cluster lactate_dehydrogenase amn_trmnl_brn_ntrrtc_____ hypersensitiv_crdc_trpnn_ high_senstvty_c_rctv_prtn ferritin
4 306 60 19.9 43.1 634.9
4 738 173 16.9 27.4 1512.9
4 328 0 0.0 42.3 567.2
4 338 152 4.8 108.2 0.0
4 195 5 5.6 7.0 121.1
4 268 645 19.7 29.3 1124.6

Boxplot summary of most important biomarkers.

# kmeansplot(dd,"lactate_dehydrogenase")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=lactate_dehydrogenase,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()

# kmeansplot(dd,"amn_trmnl_brn_ntrrtc_____")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=amn_trmnl_brn_ntrrtc_____,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()

# kmeansplot(dd,"hypersensitiv_crdc_trpnn_")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=hypersensitiv_crdc_trpnn_,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip() 

# kmeansplot(dd,"high_senstvty_c_rctv_prtn")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=high_senstvty_c_rctv_prtn,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip() 

# kmeansplot(dd,"ferritin")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=ferritin,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip() 

Quantiles statistics for most important biomarkers.

df_stat <- dd
dim(df_stat)
## [1] 375   6
dfIntCols <- df_stat %>% select_if(., is.numeric)
names(dfIntCols)
## [1] "cluster"                   "lactate_dehydrogenase"    
## [3] "amn_trmnl_brn_ntrrtc_____" "hypersensitiv_crdc_trpnn_"
## [5] "high_senstvty_c_rctv_prtn" "ferritin"
print('===============================================================')
## [1] "==============================================================="
print('===============================================================')
## [1] "==============================================================="
statInfo <- function(a) {
  for (i in colnames(a)){
  paste('Column: ', colnames(df_stat[i]), sep=',')
  print('---------------------------------------------------------------')
  print(colnames(df_stat[i]))
  print('---------------------------------------------------------------')
  printVecInfo(df_stat[[i]])
   }
}   

statInfo(dfIntCols)
## [1] "---------------------------------------------------------------"
## [1] "cluster"
## [1] "---------------------------------------------------------------"
## [1] "mean:  3.736"
## [1] "median:  4"
## [1] "min:  1"
## [1] "max:  4"
## [1] "range:  3"
## [1] "sd:  0.68377686575303"
## [1] "quantile:  1" "quantile:  4" "quantile:  4" "quantile:  4" "quantile:  4"
## [1] "IQR:  0"
## [1] "---------------------------------------------------------------"
## [1] "lactate_dehydrogenase"
## [1] "---------------------------------------------------------------"
## [1] "mean:  519.592"
## [1] "median:  355"
## [1] "min:  0"
## [1] "max:  1867"
## [1] "range:  1867"
## [1] "sd:  436.03199416537"
## [1] "quantile:  0"     "quantile:  239.5" "quantile:  355"   "quantile:  661"  
## [5] "quantile:  1867" 
## [1] "IQR:  421.5"
## [1] "---------------------------------------------------------------"
## [1] "amn_trmnl_brn_ntrrtc_____"
## [1] "---------------------------------------------------------------"
## [1] "mean:  2811.68533333333"
## [1] "median:  99"
## [1] "min:  0"
## [1] "max:  70000"
## [1] "range:  70000"
## [1] "sd:  9387.95662943333"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  99"    "quantile:  941"  
## [5] "quantile:  70000"
## [1] "IQR:  941"
## [1] "---------------------------------------------------------------"
## [1] "hypersensitiv_crdc_trpnn_"
## [1] "---------------------------------------------------------------"
## [1] "mean:  887.5552"
## [1] "median:  4.8"
## [1] "min:  0"
## [1] "max:  50000"
## [1] "range:  50000"
## [1] "sd:  4933.42370112608"
## [1] "quantile:  0"     "quantile:  1.9"   "quantile:  4.8"   "quantile:  47.2" 
## [5] "quantile:  50000"
## [1] "IQR:  45.3"
## [1] "---------------------------------------------------------------"
## [1] "high_senstvty_c_rctv_prtn"
## [1] "---------------------------------------------------------------"
## [1] "mean:  86.6608"
## [1] "median:  54.1"
## [1] "min:  0"
## [1] "max:  320"
## [1] "range:  320"
## [1] "sd:  87.9859750272583"
## [1] "quantile:  0"     "quantile:  10.1"  "quantile:  54.1"  "quantile:  144.4"
## [5] "quantile:  320"  
## [1] "IQR:  134.3"
## [1] "---------------------------------------------------------------"
## [1] "ferritin"
## [1] "---------------------------------------------------------------"
## [1] "mean:  883.942933333333"
## [1] "median:  147.3"
## [1] "min:  0"
## [1] "max:  50000"
## [1] "range:  50000"
## [1] "sd:  3052.68673704692"
## [1] "quantile:  0"      "quantile:  0"      "quantile:  147.3" 
## [4] "quantile:  881.95" "quantile:  50000" 
## [1] "IQR:  881.95"
#dd[,c(76,1:10)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,11:20)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,21:30)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,31:40)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,41:50)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,51:60)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,61:70)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,71:74)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()

Principal Component Analysis

Principal Component Analysis (PCA) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data.

From a data analysis standpoint, PCA is used for studying one table of observations and variables with the main idea of transforming the observed variables into a set of new variables, the principal components, which are uncorrelated and explain the variation in the data. For this reason, PCA allows us to reduce a “complex” data set to a lower dimension in order to reveal the structures or the dominant types of variations in both the observations and the variables. Note: starts by setting with clean data set.

Data Prep

df <- df_clean 

df <- df[,c(4,7:80)] # %>% select(-patient_id,-admission_time,-discharge_time)

#rename outcome to Survived
names(df)[1] <- c("Survived")

df$Survived <- factor(df$Survived)

#Discretize Survived
df$Survived=dplyr::recode(df$Survived, `0`="Deceased", `1`="Survived")
colnames(df) <- make.names(colnames(df))

head(df[,1:5]) %>% kable %>% kable_styling()
Survived hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
Deceased 19.9 140 103.1 14.1
Deceased 16.9 151 100.5 14.3
Deceased 0.0 126 102.9 13.6
Deceased 4.8 110 103.1 16.3
Deceased 5.6 134 102.2 14.6
Deceased 19.7 108 105.8 12.4

There are several methods for doing PCA using R, but we’re going use princomp method, since its in base R stats.

Typical PCA results should consist of the following:

  • A set of eigenvalues

  • A table with the scores or Principal Components (PCs)

  • A table of loadings (or correlations between variables and PCs).

The eigenvalues provide information of the variability in the data. The scores provide information about the structure of the observations. The loadings (or correlations) allow you to get a sense of the relationships between variables, as well as their associations with the extracted PCs.

features<-df[,-1] #remove the label
pca<-princomp(features)
std_dev <- pca[1:74]$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
res.pca <- pca

#pca

Plots

As seen in the plot below, nearly first 15 components explains most variation in the data. The second plot indicates 15-20 components might be adequate for the modeling data sets.

require(factoextra)
require(plotly)
plot(cumsum(prop_varex[1:74]), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b",
     col = 4)

plot(cumsum(prop_varex[1:20]), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b",
     col = 4)

The scree plot shows the first component is explaining 76.3% of the variances. The line drops of quickly from there, which bolsters the case that we can use far less than 74 features to get a good prediction.

require(factoextra)
require(plotly)
res.pca <- pca
#fplt <- fviz_eig(res.pca, addlabels = TRUE)
fplt <- fviz_screeplot(res.pca, addlabels = TRUE)
#ggplotly(fplt)
fplt

Contribution Analysis

Since the majority of the variance are explained by the first three components a contribution analysis is performed on them here.

Top contributors:

  • hypersensitive_cardiac_troponin_i: This is the preferred biomarker for acute myocardial infarction. This helps diagnose acute coronary syndromes in those who are critically ill. 

  • amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp: Natriuretic peptide tests measure the presence of brain natriuretic peptide and B-terminal pro b-type natriuretic peptides present. These are substances created by the heart. Small levels are normally found in the bloodstream. High levels of BNP, NT-proBNP indicate that the heart is failing to pump the appropriate amount of blood needed.

  • lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.

  • ferritin: Ferritin is a major intracellular iron storage protein in all organisms. It binds free ions of the trace element, neutralising its toxic properties and increasing its solubility. In the soluble form, the body is able to expend iron as needed, in particular for regulation of cellular oxygen metabolism. Elevated levels of ferritin, or hyperferritinemia, indicate the presence of viruses and bacteria into the body.

fviz_pca_var(res.pca, col.var="contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = FALSE # Avoid text overlapping
             )

# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 3)

# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 3)

# Contributions of variables to PC3
fviz_contrib(res.pca, choice = "var", axes = 3, top = 3)

Caret Machine Learning

Classification Models

Several classification models will be leveraged and compared while using the Covid-19, patient, biomarker data. Classification in this case predicts a binary outcome of case fatality or or survival. The models will be trained in supervised mode using the dependent variable we will call label, which holds values 0 for fatal and 1 for survived. The models will then be tested to determine if they can classify case fatality and survival without having the benefit of the labels.

The list of classification models to be used can be viewed below here:

  • Decision Tree(rpart)

  • Random Forest(rf)

  • Naive Bayes(nb)

  • K Nearest Neighbors(knn)

  • Support Vector Machines(svm)

  • Gradient Boosting Machines(gbm)

  • Classification Based On Association Rules (cba)

The models will each be described further as they are used.

Checking for Imbalance

Checking the dependent variable “Survived” to ensure imbalance will not be a problem. The output indicates there is good balance of fatalities versus survived outcomes in our data set.

#summary(df$Survived)

survived_df <- as.data.frame(prop.table(table(df$Survived))) 
names(survived_df)[1:2] <- c("Survived","Percentage")
survived_df
##   Survived Percentage
## 1 Deceased      0.536
## 2 Survived      0.464

Data Splitting

Using the first 20 components for the model splitting after determining that gets the best predictions without loosing predictive quality using all 74 features.

nDim <- 21

set.seed(seed)
new_df<-data.frame(label = df[, "Survived"], pca$scores)
df2 <- new_df
dim(df2)
## [1] 375  75
pca_df<- new_df[,1:nDim]
samp_size <-floor(0.80* nrow(pca_df))
train_ind <-sample(seq_len(nrow(pca_df)), size = samp_size)
train <- pca_df[train_ind,]
dim(train)
## [1] 300  21
test  <-pca_df[-train_ind,]
dim(test)
## [1] 75 21
names(pca_df)[1] <- c("label")
names(df)[1] <- c("label")

The data will be split into 80/20 for training and test sets.

##   dimension_of_training_dataset dimension_of_testing_dataset
## 1                           300                           75
## 2                            21                           21
## [1] 300  21
## [1] 75 21
## [1] 75 21
# #view the data sets
# head(train)
# head(test)
# head(hold)

Model Controls

Model controls will consist of cross validation set to 5-fold, and fit metric set to “Accuracy.”

## [conflicted] Removing existing preference
## [conflicted] Will prefer caret::confusionMatrix over any other package
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following object is masked from 'package:leaflet':
## 
##     addLegend
## The following objects are masked from 'package:data.table':
## 
##     first, last
## The following objects are masked from 'package:dplyr':
## 
##     first, last

Model Tuning

Reviewing the models and their different parameters which can be tuned. Caret will be used with method model training controls to run algorithms u5-fold cross validation and various parameter tuning. The most important features are printed after each model is trained for additional information.

x1<- modelLookup(model='rpart')
x2<-modelLookup(model='rf')
x3<-modelLookup(model='nb')
x4<-modelLookup(model='knn')
x5<-modelLookup(model='svmRadial')
x6<-modelLookup(model='gbm')

mlist <-rbind(x1,x2,x3,x4,x5,x6)
mlist
##        model         parameter                         label forReg forClass
## 1      rpart                cp          Complexity Parameter   TRUE     TRUE
## 2         rf              mtry #Randomly Selected Predictors   TRUE     TRUE
## 3         nb                fL            Laplace Correction  FALSE     TRUE
## 4         nb         usekernel             Distribution Type  FALSE     TRUE
## 5         nb            adjust          Bandwidth Adjustment  FALSE     TRUE
## 6        knn                 k                    #Neighbors   TRUE     TRUE
## 7  svmRadial             sigma                         Sigma   TRUE     TRUE
## 8  svmRadial                 C                          Cost   TRUE     TRUE
## 9        gbm           n.trees         # Boosting Iterations   TRUE     TRUE
## 10       gbm interaction.depth                Max Tree Depth   TRUE     TRUE
## 11       gbm         shrinkage                     Shrinkage   TRUE     TRUE
## 12       gbm    n.minobsinnode       Min. Terminal Node Size   TRUE     TRUE
##    probModel
## 1       TRUE
## 2       TRUE
## 3       TRUE
## 4       TRUE
## 5       TRUE
## 6       TRUE
## 7       TRUE
## 8       TRUE
## 9       TRUE
## 10      TRUE
## 11      TRUE
## 12      TRUE

Model Build

Running a similar cross validation experiment using decision trees to compare the results We are now ready to train and test using classifiers. Below we use a few different decision tree models with different parameters and prunings to get varied results.

Caret will be used with method used for model training controls to run algorithms 5-fold cross validation and various parameter tuning.

Note: The most important features are printed after each model is trained for additional information.

Decision Tree

  • Decision Tree - A tree in which each internal (non-leaf) node is labeled with an input feature. The arcs coming from a node labeled with an input feature are labeled with each of the possible values of the target feature or the arc leads to a subordinate decision node on a different input feature.
library(rpart)

#Ten Fold CV used with maincontrol2  
control <- maincontrol
#Train Tree Model 2 CV
ptm <- proc.time()
fit.tree <- train(label~ ., data = train, method="rpart", metric=fitmetric, trControl=control,control=rpart.control(cp = 0.4))
fit.tree
## CART 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 240, 240, 241, 239, 240 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.04347826  0.8397610  0.6815205
##   0.13768116  0.7701436  0.5319054
##   0.62318841  0.7159063  0.4094905
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04347826.
proc.time() - ptm
##    user  system elapsed 
##    1.68    0.03    1.89
#fancy tree plots
suppressMessages(library(rattle))
fancyRpartPlot(fit.tree$finalModel)

#summary(fit.tree2$finalModel)

#important variables
treeImp <- varImp(fit.tree)
#treeImp
ggplot(treeImp) + gbl_ggtheme

Gradient Boosting Machine

  • Gradient Boosting Machines - Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
require(gbm)
## Loading required package: gbm
## Loaded gbm 2.1.5
#Train Random Forest CV
control <- maincontrol

gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9), 
                        n.trees = (1:30)*50, 
                        shrinkage = 0.1,
                        n.minobsinnode = 20)
                        
nrow(gbmGrid)
## [1] 90
ptm <- proc.time()
fit.gbm <- train(label ~ ., data = train, 
                 method = 'gbm', 
                 tuneGrid = gbmGrid,
                 trControl = control,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
fit.gbm
## Stochastic Gradient Boosting 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 241, 239, 241, 240, 239 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                    50     0.8869760  0.7722846
##   1                   100     0.8934204  0.7850006
##   1                   150     0.9133130  0.8257010
##   1                   200     0.9000871  0.7982149
##   1                   250     0.9065351  0.8118068
##   1                   300     0.9132583  0.8254766
##   1                   350     0.9068667  0.8121216
##   1                   400     0.9232602  0.8457358
##   1                   450     0.9034787  0.8054518
##   1                   500     0.9133148  0.8258101
##   1                   550     0.9099250  0.8189050
##   1                   600     0.9165916  0.8323539
##   1                   650     0.9232602  0.8456811
##   1                   700     0.9232055  0.8455723
##   1                   750     0.9297629  0.8588103
##   1                   800     0.9133695  0.8253909
##   1                   850     0.9232055  0.8455723
##   1                   900     0.9133695  0.8253909
##   1                   950     0.9199268  0.8389179
##   1                  1000     0.9068121  0.8118238
##   1                  1050     0.9068121  0.8118238
##   1                  1100     0.9166481  0.8322628
##   1                  1150     0.9098685  0.8186061
##   1                  1200     0.8966426  0.7911592
##   1                  1250     0.9067009  0.8116982
##   1                  1300     0.9164259  0.8316007
##   1                  1350     0.9034222  0.8048151
##   1                  1400     0.9165370  0.8319092
##   1                  1450     0.9067009  0.8116171
##   1                  1500     0.9165370  0.8319092
##   5                    50     0.9103130  0.8201424
##   5                   100     0.9004770  0.7997087
##   5                   150     0.9136464  0.8264013
##   5                   200     0.9234824  0.8462389
##   5                   250     0.9202056  0.8399207
##   5                   300     0.9103695  0.8189685
##   5                   350     0.9168723  0.8331420
##   5                   400     0.9167611  0.8331009
##   5                   450     0.9069251  0.8125138
##   5                   500     0.9167611  0.8332795
##   5                   550     0.9069251  0.8128074
##   5                   600     0.9036464  0.8059687
##   5                   650     0.9133713  0.8261328
##   5                   700     0.9036464  0.8056365
##   5                   750     0.9167611  0.8331009
##   5                   800     0.9100926  0.8194051
##   5                   850     0.9002566  0.7986684
##   5                   900     0.9002566  0.7986684
##   5                   950     0.9100926  0.8194051
##   5                  1000     0.9035352  0.8055457
##   5                  1050     0.9002566  0.7986684
##   5                  1100     0.9100926  0.8194051
##   5                  1150     0.9035352  0.8055457
##   5                  1200     0.9133713  0.8261328
##   5                  1250     0.9035352  0.8055457
##   5                  1300     0.9035352  0.8055457
##   5                  1350     0.9133713  0.8261328
##   5                  1400     0.9001454  0.7986578
##   5                  1450     0.9099815  0.8192450
##   5                  1500     0.9133148  0.8259852
##   9                    50     0.9002001  0.7997006
##   9                   100     0.9102019  0.8192672
##   9                   150     0.9333750  0.8665816
##   9                   200     0.9202603  0.8396702
##   9                   250     0.9235389  0.8460798
##   9                   300     0.9268176  0.8530253
##   9                   350     0.9137029  0.8260286
##   9                   400     0.9268176  0.8530253
##   9                   450     0.9234843  0.8464329
##   9                   500     0.9136482  0.8258457
##   9                   550     0.9103695  0.8189685
##   9                   600     0.9069797  0.8121997
##   9                   650     0.9168158  0.8332545
##   9                   700     0.9069797  0.8121997
##   9                   750     0.9102584  0.8190770
##   9                   800     0.9167611  0.8331009
##   9                   850     0.9167611  0.8331009
##   9                   900     0.9167611  0.8331009
##   9                   950     0.9134824  0.8263732
##   9                  1000     0.9133713  0.8258490
##   9                  1050     0.9035352  0.8051123
##   9                  1100     0.9002566  0.7984271
##   9                  1150     0.8968667  0.7910900
##   9                  1200     0.9133713  0.8256077
##   9                  1250     0.9067028  0.8123286
##   9                  1300     0.9036464  0.8053364
##   9                  1350     0.9069251  0.8118805
##   9                  1400     0.9001454  0.7981035
##   9                  1450     0.9100926  0.8190961
##   9                  1500     0.9100926  0.8190961
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  9, shrinkage = 0.1 and n.minobsinnode = 20.
proc.time() - ptm
##    user  system elapsed 
##   16.97    0.09   18.26
#fancy tree plots
suppressMessages(library(rattle))
plot(fit.gbm$finalModel)

#summary(fit.rf$$finalModel)

gbmImp <- varImp(fit.gbm)
#rfImp
#plot(rfImp)
ggplot(gbmImp) + gbl_ggtheme

#summary(fit.rf$finalModel)

Random Forest

  • Random Forest - An ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees.
#Train Random Forest CV
control <- maincontrol
ptm <- proc.time()
fit.rf <- train(label~ ., data = train, method="rf", metric=fitmetric, ntree = 35, trControl=control)
fit.rf
## Random Forest 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 240, 240, 240, 240, 240 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8966667  0.7939077
##   11    0.9166667  0.8328082
##   20    0.9000000  0.7994747
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.
proc.time() - ptm
##    user  system elapsed 
##    1.59    0.02    1.75
#fancy tree plots
suppressMessages(library(rattle))
plot(fit.rf$finalModel)

#summary(fit.rf$$finalModel)

rfImp <- varImp(fit.rf)
#rfImp
#plot(rfImp)
ggplot(rfImp) + gbl_ggtheme

#summary(fit.rf$finalModel)

Support Vector Machines

  • Support Vector Machines (radial) - A discriminative classifier formally defined by a separating hyperplane. The algorithm creates a line or a hyperplane, which separates the data into classes.
#Training Support Vector Machine As a Challenger Model to Decision Tree.
control <- maincontrol
#control <- trainControl(method = "repeatedcv", number = 5,repeats=2,verbose = FALSE)
#methods "svmRadial" "svmLinear"
#
#
SVMgrid <- expand.grid(C= c(.5,1,2,5), sigma = c(0.01)) #C = 0.5, sigma =0.0001253863
ptm <- proc.time()
#fit.svm <- train(Survived~ ., data = train, method="svmLinear", metric=fitmetric, trControl=control,preProcess = #c("center","scale"),tuneLength = 10)
fit.svm <- train(label~ ., data = train, method="svmRadial", metric=fitmetric, trControl=control,preProcess = c("center", "scale"),tuneLength = 10)
fit.svm
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## Pre-processing: centered (20), scaled (20) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 240, 241, 240, 239, 240 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa    
##     0.25  0.8764351  0.7533862
##     0.50  0.8763786  0.7532587
##     1.00  0.8730453  0.7462505
##     2.00  0.8697666  0.7398198
##     4.00  0.8665444  0.7335186
##     8.00  0.8598759  0.7202493
##    16.00  0.8631546  0.7274778
##    32.00  0.8664879  0.7336732
##    64.00  0.8664879  0.7339260
##   128.00  0.8664879  0.7339260
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1594806
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1594806 and C = 0.25.
proc.time() - ptm
##    user  system elapsed 
##    7.87    0.08    8.39
#print(fit.svm)
#plot(fit.svm$finalModel)
#summary(fit.svm$finalModel)
svmImp <- varImp(fit.svm)
#svmImp
#plot(svmImp)
ggplot(svmImp) + gbl_ggtheme

Naive Bayes

  • Naive Bayes - An algorithm that uses Bayes’ theorem to classify objects. Naive Bayes classifiers assume strong, or naive, independence between attributes of data points.
control <- maincontrol
#install.packages("naivebayes")
library(naivebayes)
library(e1071) #naiveBayes

# # Define tuning grid
nb_grid <- expand.grid(usekernel = c(TRUE, FALSE),laplace = c(0, 0.5, 1), adjust = c(0.75, 1, 1.25, 1.5))

ptm <- proc.time()
fit.nb <- train(label~ ., data = train, method="naive_bayes", tuneGrid = nb_grid, trControl=control)
fit.nb
## Naive Bayes 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 241, 240, 240, 239, 240 
## Resampling results across tuning parameters:
## 
##   usekernel  laplace  adjust  Accuracy   Kappa    
##   FALSE      0.0      0.75    0.7296934  0.4377316
##   FALSE      0.0      1.00    0.7296934  0.4377316
##   FALSE      0.0      1.25    0.7296934  0.4377316
##   FALSE      0.0      1.50    0.7296934  0.4377316
##   FALSE      0.5      0.75    0.7296934  0.4377316
##   FALSE      0.5      1.00    0.7296934  0.4377316
##   FALSE      0.5      1.25    0.7296934  0.4377316
##   FALSE      0.5      1.50    0.7296934  0.4377316
##   FALSE      1.0      0.75    0.7296934  0.4377316
##   FALSE      1.0      1.00    0.7296934  0.4377316
##   FALSE      1.0      1.25    0.7296934  0.4377316
##   FALSE      1.0      1.50    0.7296934  0.4377316
##    TRUE      0.0      0.75    0.8433769  0.6817564
##    TRUE      0.0      1.00    0.8366518  0.6682333
##    TRUE      0.0      1.25    0.8265370  0.6458675
##    TRUE      0.0      1.50    0.8298138  0.6519463
##    TRUE      0.5      0.75    0.8433769  0.6817564
##    TRUE      0.5      1.00    0.8366518  0.6682333
##    TRUE      0.5      1.25    0.8265370  0.6458675
##    TRUE      0.5      1.50    0.8298138  0.6519463
##    TRUE      1.0      0.75    0.8433769  0.6817564
##    TRUE      1.0      1.00    0.8366518  0.6682333
##    TRUE      1.0      1.25    0.8265370  0.6458675
##    TRUE      1.0      1.50    0.8298138  0.6519463
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 0.75.
proc.time() - ptm
##    user  system elapsed 
##    6.96    0.09    7.26
#fit.nb$bestTune
nbImp <- varImp(fit.nb)
#nbImp
#plot(nbImp)
ggplot(nbImp) + gbl_ggtheme

K Nearest Neighbor

  • K Nearest Neighbors - KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.It estimates how likely a data point is to be a member of one group or the other depending on what group in which the data points nearest to it.The k-nearest-neighbor is an example of a “lazy learner” algorithm, meaning that it does not build a model using the training set until a query of the data set is performed.
control <- maincontrol
# # Define tuning grid
knn_grid <- expand.grid(k= c(2,3,5,7,9))

ptm <- proc.time()
fit.knn <- train(label~ ., data = train, method="knn", tuneGrid = knn_grid, trControl=control)
fit.knn
## k-Nearest Neighbors 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 241, 240, 240, 240, 239 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   2  0.8667611  0.7312582
##   3  0.8700417  0.7370683
##   5  0.8667083  0.7305085
##   7  0.8534278  0.7028591
##   9  0.8568723  0.7097726
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.
proc.time() - ptm
##    user  system elapsed 
##    1.52    0.02    1.59
fit.knn$bestTune
##   k
## 2 3
#plot(fit.knn$finalModel)
knnImp <- varImp(fit.knn)
#nbImp
#plot(knnImp)
ggplot(knnImp) +gbl_ggtheme

Model Fit Results

The graph below shows two evaluation metrics for the models.

  • Accuracy is the percentage of correctly classifies instances out of all instances. It is more useful on a binary classification than multi-class classification problems because it can be less clear exactly how the accuracy breaks down across those classes (e.g. you need to go deeper with a confusion matrix). Learn more about Accuracy here.

  • Kappa or Cohen’s Kappa is like classification accuracy, except that it is normalized at the baseline of random chance on your data set. It is a more useful measure to use on problems that have an imbalance in the classes (e.g. 70-30 split for classes 0 and 1 and you can achieve 70% accuracy by predicting all instances are for class 0). Learn more about Kappa here.

The gradient algorithm denoted by gbm is the better model according to these fit metrics, but all models performed reasonably well on the training data set.

## 
## Call:
## summary.resamples(object = results)
## 
## Models: gbm, rf, tree, nb, svm, knn 
## Number of resamples: 5 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## gbm  0.9016393 0.9166667 0.9322034 0.9333750 0.9491525 0.9672131    0
## rf   0.8666667 0.9166667 0.9166667 0.9166667 0.9333333 0.9500000    0
## tree 0.8000000 0.8135593 0.8500000 0.8397610 0.8500000 0.8852459    0
## nb   0.7666667 0.8166667 0.8500000 0.8433769 0.8852459 0.8983051    0
## svm  0.8000000 0.8666667 0.8813559 0.8764351 0.8833333 0.9508197    0
## knn  0.8166667 0.8500000 0.8688525 0.8700417 0.8813559 0.9333333    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## gbm  0.8051118 0.8329621 0.8626310 0.8665816 0.8978650 0.9343380    0
## rf   0.7306397 0.8337029 0.8337029 0.8328082 0.8666667 0.8993289    0
## tree 0.6017699 0.6233314 0.7045952 0.6815205 0.7058824 0.7720235    0
## nb   0.5291480 0.6232877 0.6966292 0.6817564 0.7657707 0.7939464    0
## svm  0.6017699 0.7342193 0.7630522 0.7533862 0.7671840 0.9007054    0
## knn  0.6283784 0.6952596 0.7359307 0.7370683 0.7603018 0.8654709    0

  • AUC-ROC curve: This is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. The Higher the AUC, better the model is at making predictions.

It should be noted that we fitted our models using “accuracy” to aide with finding the best tuning parameters. However, it is good to measure models with multiple metrics and AUC-ROC is one of the additional metrics to consider when making a decision.

The last graph indicates all the models look very good in terms of AUC-ROC, but gradient boosting, random forest are standouts. This means the gradient boosting machines resulted in a better fit than all other models, judging by the AUC-ROC metric. The Naive Bayes, Support Vector Machines, and K Nearest Neighbors were not far behind GBM, but the decision tree was trailing a bit more.

The recall sensitivity is import here too, since that is our ability to accurately discern all the fatalities. The gradient boosting machines method proving to be a good model for true positives. Random forests and support vector machines do reasonably here, but naive bayes, decision tree, and knn trail significantly on this metric.

## run MLeval comparing tree 2 and 3
library(MLeval)
#??MLeval #for help
res <- evalm(list(fit.gbm,fit.tree,fit.rf,fit.nb,fit.svm,fit.knn),gnames=c('gbm','dt','rf','nb','svm','knn'))
## ***MLeval: Machine Learning Model Evaluation***
## Input: caret train function object
## Not averaging probs.
## Group 1 type: cv
## Group 2 type: cv
## Group 3 type: cv
## Group 4 type: cv
## Group 5 type: cv
## Group 6 type: cv
## Observations: 1800
## Number of groups: 6
## Observations per group: 300
## Positive: Survived
## Negative: Deceased
## Group: gbm
## Positive: 138
## Negative: 162
## Group: dt
## Positive: 138
## Negative: 162
## Group: rf
## Positive: 138
## Negative: 162
## Group: nb
## Positive: 138
## Negative: 162
## Group: svm
## Positive: 138
## Negative: 162
## Group: knn
## Positive: 138
## Negative: 162
## ***Performance Metrics***

## gbm Optimal Informedness = 0.870101986044015
## dt Optimal Informedness = 0.691894793344069
## rf Optimal Informedness = 0.843263553408481
## nb Optimal Informedness = 0.72812667740204
## svm Optimal Informedness = 0.773483628556092
## knn Optimal Informedness = 0.747987117552335
## gbm AUC-ROC = 0.98
## dt AUC-ROC = 0.88
## rf AUC-ROC = 0.97
## nb AUC-ROC = 0.91
## svm AUC-ROC = 0.92
## knn AUC-ROC = 0.92

Model Prediction Results

Definition of the Terms related to measuring classification models:

  • Positive (P) : Observation is positive (for example: is an apple).
  • Negative (N) : Observation is not positive (for example: is not an apple).
  • True Positive (TP) : Observation is positive, and is predicted to be positive.
  • False Negative (FN) : Observation is positive, but is predicted negative.
  • True Negative (TN) : Observation is negative, and is predicted to be negative.
  • False Positive (FP) : Observation is negative, but is predicted positive.

Accuracy: * Classification Rate or Accuracy is given by the relation: (TP + TN) / (TP + TN + FP + FN) * It assumes equal costs for both kinds of errors. A 99% accuracy can be excellent, good, mediocre, poor or terrible depending upon the problem.

Recall: Recall can be defined as the ratio of the total number of correctly classified positive examples divide to the total number of positive examples. High Recall indicates the class is correctly recognized (a small number of FN). Recall is given by the relation: TP / (TP + FN)

AUC-ROC curve: This is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at binary predictions.

Summary the test results for each of the following classifier models. The results vary slightly from run to run due to random sampling.

Before PCA:

  • Decision Tree - Accuracy : 0.90

  • Random Forest - Accuracy : 0.94

  • Naive Bayes - Accuracy : 0.90

  • K Nearest Neighbors - Accuracy : 0.92

  • Support Vector Machines (radial)- Accuracy : 0.89

  • Classification Based On Association Rules (CBA) - Accuracy : .88 **this was run with no PCA involved

After PCA:

  • Decision Tree - Accuracy : 0.94

  • Random Forest - Accuracy : 0.92

  • Naive Bayes - Accuracy : 0.88

  • K Nearest Neighbors - Accuracy : 0.96

  • Support Vector Machines (radial)- Accuracy : 0.88

  • Gradient Boosting Machines - Accuracy : .96 **this was added after PCA, so there is no before PCA run

test <- test
#head(test)

#https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package
draw_confusion_matrix <- function(cm) {

  #   total <- sum(cm$table)
  # res <- as.numeric(cm$table)
  # 
  # # Generate color gradients. Palettes come from RColorBrewer.
  # greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
  # redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#E64141")
  # getColor <- function (greenOrRed = "green", amount = 0) {
  #   if (amount == 0)
  #     return("#FFFFFF")
  #   palette <- greenPalette
  #   if (greenOrRed == "red")
  #     palette <- redPalette
  #   colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
  # }
  # 
  
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

   # create the matrix 
  rect(150, 430, 240, 370, col='#a2aba5')
  text(195, 435, 'Deceased ', cex=1.2)
  rect(250, 430, 340, 370, col='#E64141')
  text(295, 435, 'Survived', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#E64141')
  rect(250, 305, 340, 365, col='#a2aba5')
  text(140, 400, 'Deceased ', cex=1.2, srt=90)
  text(140, 335, 'Survived', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  

# Error: [conflicted] `layout` found in 2 packages. 
# Either pick the one you want with `::` * plotly::layout * graphics::layout Or declare a preference with
# `conflict_prefer()` * conflict_prefer("layout", "plotly") * conflict_prefer("layout", "graphics")
conflict_prefer("layout", "plotly")

Gradient Boosting Machines

  • Gradient Boosting Machines - Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. This model type was added to the analysis to challenge the accuracy of random forest. The results are 96% accuracy, and recall/sensitivity is 97%.
predict.gbm= predict(fit.gbm, test, type="raw")
#predict.tree= predict(fit.tree, test, type="prob")

#summary(predict.gbm)
#caret confusion
cm <- confusionMatrix(factor(predict.gbm), factor(test$label))
#cm

#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

Decision Tree

  • Decision Tree - A tree in which each internal (non-leaf) node is labeled with an input feature. The arcs coming from a node labeled with an input feature are labeled with each of the possible values of the target feature or the arc leads to a subordinate decision node on a different input feature. Decision tree tested at 94% accuracy, 95% recall/sensitivity, which is a solid test performance.
predict.tree= predict(fit.tree, test, type="raw")
#predict.tree= predict(fit.tree, test, type="prob")

#summary(predict.tree)
#caret confusion
cm <- confusionMatrix(factor(predict.tree), factor(test$label))
#cm

#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

SVM

  • Support Vector Machines (radial) - A discriminative classifier formally defined by a separating hyperplane. The algorithm creates a line or a hyperplane, which separates the data into classes. Support vector machines turned in an accuracy of 88% and 90% recall/sensitivity here.
#Test Results SVM
predict.svm= predict(fit.svm, test, type="raw")
#predict.svm= predict(fit.svm, test, type="prob")
#summary(predict.svm)

#confusion matrix to find correct and incorrect predictions
#table(labelship=predict.svm, true=test$Survived)
#caret confusion
cm <-confusionMatrix(factor(predict.svm), factor(test$label))
#cm

#plot(cm$table)
#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

Random Forest

  • Random Forest - An ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. Our random forest testing produced accuracy of 92% and 92% recall/sensitivity. In comparison gradient boosting machines has done better in the testing compared to random forest.
predict.rf= predict(fit.rf, test, type="raw")
#predicted3= predict(fit.rf, test, type="prob")
#summary(predict.rf)

#confusion matrix to find correct and incorrect predictions
#table(labelship=predict.rf, true=test$Survived)
#caret confusion
cm <- confusionMatrix(factor(predict.rf), factor(test$label))
#cm

#fourfoldplot(cm$table, color = c("grey", "#E64141"),    conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

K Nearest Neighbors

  • K Nearest Neighbors - KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.It estimates how likely a data point is to be a member of one group or the other depending on what group in which the data points nearest to it.The k-nearest-neighbor is an example of a “lazy learner” algorithm, meaning that it does not build a model using the training set until a query of the data set is performed. The knn results were very good with 96% accuracy and a high recall of 97%. It is a good candidate model, but it had 2 misclassified fatalities.
predict.knn= predict(fit.knn, test, type="raw")
#predicted3= predict(fit.knn, test, type="prob")
#summary(predict.knn)

#table(labelship=predict.rf, true=test$Survived)
#caret confusion

cm <- confusionMatrix(factor(predict.knn), factor(test$label))
#cm

#fourfoldplot(cm$table, color = c("grey", "#E64141"),      conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

Naive Bayes

  • Naive Bayes - An algorithm that uses Bayes’ theorem to classify objects. Naive Bayes classifiers assume strong, or naive, independence between attributes of data points. Naive Bayes has a reasonable accuracy at 90%, but it misclassified four fatalities.
predict.nb= predict(fit.nb, test, type="raw")
#summary(predict.nb)

#confusion matrix to find correct and incorrect predictions
#table(labelship=predict.nb, true=test$Survived)
cm <- confusionMatrix(factor(predict.nb), factor(test$label))
#cm


#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

Keras Deep Learning

The keras library will be used to classify survival or fatality of covid-19 patient sample based upon the same biomarker data set.

require(keras)
## Loading required package: keras
#start with the clean data frame again
df <- df_clean 
df <- df[,c(4,7:80)] # or df <- df %>% select(-patient_id,-admission_time,-discharge_time)

#rename outcome to label
names(df)[1] <- c("label")

df$label <- factor(df$label)

#Discretize Survived
#df$Survived=dplyr::recode(df$Survived, `0`="Deceased", `1`="Survived")
colnames(df) <- make.names(colnames(df))

head(df[,1:5])  %>% kable() %>% kable_styling()
label hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
0 19.9 140 103.1 14.1
0 16.9 151 100.5 14.3
0 0.0 126 102.9 13.6
0 4.8 110 103.1 16.3
0 5.6 134 102.2 14.6
0 19.7 108 105.8 12.4

Data Splits

The keras model will use the hold out approach for the data splits. 80%, 20% where the 20%, which normally simply test will be further split 80/20 into test and holdout respectively.

require(kableExtra)

dim(df)
## [1] 375  75
paste("-------")
## [1] "-------"
nDim <- dim(df)[2]-1
#nDim

x <- modelData2(df) # 80/20 train/test; holdout = test
#x <- modelDataC(df) # removes 10% for holdout then splits 80/20 for remainder
train <-x@trn
test <- x@tst
hold <- x@hld
#hold <-z[-data,] #balance what is left over after train @.90  

trn <- as.data.frame(dim(train))
names(trn) <- c("dimension_training_train")
trn
##   dimension_training_train
## 1                      300
## 2                       75
tst <-as.data.frame(dim(test))
names(tst) <- c("dimension_testing_test")
hld <-as.data.frame(dim(hold))
names(hld) <- c("dimension_testing_holdout")

train <- train
dim(train)
## [1] 300  75
test <- test
dim(test)
## [1] 75 75
hold <- hold
dim(hold)
## [1] 75 75
split <- cbind(trn,tst,hld)
split %>%kable() %>% kable_styling()
dimension_training_train dimension_testing_test dimension_testing_holdout
300 75 75
75 75 75
#view the data sets
head(train[,1:5]) %>% kable() %>% kable_styling()
label hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
1 1776.3 124 107.1 19.1
1 62.3 156 90.5 13.2
1 0.0 129 95.0 13.2
0 3.7 151 104.6 13.6
0 1.9 116 103.8 14.7
0 0.0 120 104.4 14.9
head(test[,1:5]) %>% kable() %>% kable_styling()
label hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
1 8.2 119 100.8 17.1
1 32.6 132 105.6 17.0
1 140.8 159 131.7 32.2
1 4.4 124 109.8 15.4
1 8443.5 124 103.3 20.0
0 8.1 130 101.9 15.1
head(hold[,1:5]) %>% kable() %>% kable_styling()
label hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
0 19.9 140 103.1 14.1
0 0.0 126 102.9 13.6
0 3.7 151 104.5 13.3
0 10.4 125 107.1 13.1
0 0.0 125 106.3 13.6
0 0.0 133 103.3 12.9

Keras Data Prep

#devtools::install_github("rstudio/keras")
library(keras)

x_train <- train[,2:(nDim+1)] #omit label
x_test <- test[,2:(nDim+1)] #omit label
x_hold <- hold[,2:(nDim+1)] #omit label

scale data frame’s numerical columns

library(caret)
# Assuming goal class is column 10

normalize <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
}

x_train <- as.data.frame(lapply(x_train, normalize))
head(x_train[,1:5])
##   hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
## 1                          0.035526  0.6966292      0.7628205        0.1591667
## 2                          0.001246  0.8764045      0.6445869        0.1100000
## 3                          0.000000  0.7247191      0.6766382        0.1100000
## 4                          0.000074  0.8483146      0.7450142        0.1133333
## 5                          0.000038  0.6516854      0.7393162        0.1225000
## 6                          0.000000  0.6741573      0.7435897        0.1241667
##   procalcitonin
## 1  0.0251880357
## 2  0.0083960119
## 3  0.0034983383
## 4  0.0008745846
## 5  0.0006996677
## 6  0.0006996677
x_test <- as.data.frame(lapply(x_test, normalize))
head(x_test[,1:5])
##   hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
## 1                          0.000164  0.6685393      0.7601810        0.5310559
## 2                          0.000652  0.7415730      0.7963801        0.5279503
## 3                          0.002816  0.8932584      0.9932127        1.0000000
## 4                          0.000088  0.6966292      0.8280543        0.4782609
## 5                          0.168870  0.6966292      0.7790347        0.6211180
## 6                          0.000162  0.7303371      0.7684766        0.4689441
##   procalcitonin
## 1   0.003907638
## 2   0.006394316
## 3   0.032326821
## 4   0.698046181
## 5   0.043694494
## 6   0.001776199
x_hold <- as.data.frame(lapply(x_hold, normalize))
head(x_hold[,1:5])
##   hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
## 1                          0.000398  0.7865169      0.7775264        0.4378882
## 2                          0.000000  0.7078652      0.7760181        0.4223602
## 3                          0.000074  0.8483146      0.7880845        0.4130435
## 4                          0.000208  0.7022472      0.8076923        0.4068323
## 5                          0.000000  0.7022472      0.8016591        0.4223602
## 6                          0.000000  0.7471910      0.7790347        0.4006211
##   procalcitonin
## 1  0.0031971581
## 2  0.0021314387
## 3  0.0014209591
## 4  0.0039076377
## 5  0.0014209591
## 6  0.0007104796

Make normalized data frame a matrix shape

x_train <- as.matrix(x_train)
x_test <- as.matrix(x_test)
x_hold <- as.matrix(x_hold)

dim(x_train)
## [1] 300  74
head(x_train[,1:5]) %>% kable() %>% kable_styling()
hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time procalcitonin
0.035526 0.6966292 0.7628205 0.1591667 0.0251880
0.001246 0.8764045 0.6445869 0.1100000 0.0083960
0.000000 0.7247191 0.6766382 0.1100000 0.0034983
0.000074 0.8483146 0.7450142 0.1133333 0.0008746
0.000038 0.6516854 0.7393162 0.1225000 0.0006997
0.000000 0.6741573 0.7435897 0.1241667 0.0006997
dim(x_test)
## [1] 75 74
head(x_test[,1:5])  %>% kable() %>% kable_styling()
hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time procalcitonin
0.000164 0.6685393 0.7601810 0.5310559 0.0039076
0.000652 0.7415730 0.7963801 0.5279503 0.0063943
0.002816 0.8932584 0.9932127 1.0000000 0.0323268
0.000088 0.6966292 0.8280543 0.4782609 0.6980462
0.168870 0.6966292 0.7790347 0.6211180 0.0436945
0.000162 0.7303371 0.7684766 0.4689441 0.0017762
dim(x_hold)
## [1] 75 74
head(x_hold[,1:5])  %>% kable() %>% kable_styling()
hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time procalcitonin
0.000398 0.7865169 0.7775264 0.4378882 0.0031972
0.000000 0.7078652 0.7760181 0.4223602 0.0021314
0.000074 0.8483146 0.7880845 0.4130435 0.0014210
0.000208 0.7022472 0.8076923 0.4068323 0.0039076
0.000000 0.7022472 0.8016591 0.4223602 0.0014210
0.000000 0.7471910 0.7790347 0.4006211 0.0007105

Features

x_train <- array_reshape(x_train, c(nrow(x_train), nDim))
x_test <- array_reshape(x_test, c(nrow(x_test), nDim))
x_hold <- array_reshape(x_hold, c(nrow(x_hold), nDim))


#check dimensions
dim(x_train)
## [1] 300  74
dim(x_test)
## [1] 75 74
dim(x_hold)
## [1] 75 74

Labels

Note that we use the array_reshape() function rather than the dim<-() function to reshape the array. This is so that the data is re-interpreted using row-major semantics (as opposed to R’s default column-major semantics), which is in turn compatible with the way that the numerical libraries called by Keras interpret array dimensions.

The y data is an integer vector with values ranging from 0 to 1. To prepare this data for training we one-hot encode the vectors into binary class matrices using the Keras to_categorical() function:

#y_train <- as.numeric(train$label)
y_train <- train$label #these were already R factor
head(y_train)
## [1] 1 1 1 0 0 0
## Levels: 0 1
#y_test <- as.numeric(test$label)
y_test <- test$label #these were already R factor
head(y_test)
## [1] 1 1 1 1 1 0
## Levels: 0 1
y_hold <- hold$label #these were already R factor
head(y_hold)
## [1] 0 0 0 0 0 0
## Levels: 0 1
# categorical conversion
y_train <- to_categorical(y_train, 2)
y_test <- to_categorical(y_test, 2)
y_hold <- to_categorical(y_hold, 2)

Hypertuning

https://henry090.github.io/kerastuneR/

conflict_prefer("train", "caret")
## [conflicted] Will prefer caret::train over any other package
library(reticulate)
#use_python("C:\Users\knjpo\AppData\Local\r-miniconda")
use_condaenv("r-miniconda")
#install.packages('kerastuneR')
library(keras)
library(tensorflow)
#library(kerastuneR)
#install_kerastuner()

Defining the model

The simplest type of model is the Sequential model, a linear stack of layers. We begin by creating a sequential model and then adding layers using the pipe (%>%) operator:

# 97.3%
# tf.fit <- keras_model_sequential() 
# tf.fit %>% 
#   layer_dense(units = 50, activation = 'relu', input_shape = c(nDim)) %>%
#   layer_dropout(rate = 0.3) %>% 
#   layer_dense(units = 20, activation = 'relu') %>%
#   layer_dropout(rate = 0.2) %>%
#   layer_dense(units = 10, activation = 'relu') %>%
#   layer_dropout(rate = 0.1) %>%
#   layer_dense(units = 2, activation = 'softmax')


tf.fit <- keras_model_sequential() 
tf.fit %>% 
  layer_dense(units = 50, activation = 'relu', input_shape = c(nDim)) %>%
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 20, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 10, activation = 'relu') %>%
  layer_dropout(rate = 0.1) %>%
  layer_dense(units = 2, activation = 'softmax')

Fit Summary

The input_shape argument to the first layer specifies the shape of the input data (a length 784 numeric vector representing a grayscale image). The final layer outputs a length 10 numeric vector (probabilities for each digit) using a softmax activation function.

Use the summary() function to print the details of the tf.fit:

summary(tf.fit)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 50)                      3750        
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 50)                      0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 20)                      1020        
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 20)                      0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 10)                      210         
## ________________________________________________________________________________
## dropout_2 (Dropout)                 (None, 10)                      0           
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 2)                       22          
## ================================================================================
## Total params: 5,002
## Trainable params: 5,002
## Non-trainable params: 0
## ________________________________________________________________________________

Fit Compile

Next, compile the tf.fit with appropriate loss function, optimizer, and metrics:

tf.fit %>% keras::compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

Training and Evaluation

Use the fit() function to train the tf.fit for 50 epochs using batches of 5:

ptm <- proc.time() #timer

tf.train <- tf.fit %>% fit(
  x_train, y_train,
  epochs = 33, batch_size = 30,
  #epochs = 30, batch_size = 25,
  #epochs = 30, batch_size = 128,
  validation_split = 0.20
)
proc.time() - ptm
##    user  system elapsed 
##    4.17    0.15    4.28

Plot

The history object returned by fit() includes loss and accuracy metrics which we can plot:

plot(tf.train)
## `geom_smooth()` using formula 'y ~ x'

# Results

#alluvial plots
plotCM <- function(cm){
  cmdf <- as.data.frame(cm[["table"]])
  cmdf[["color"]] <- ifelse(cmdf[[1]] == cmdf[[2]], "grey", "red")
  
  alluvial::alluvial(cmdf[,1:2]
                     , freq = cmdf$Freq
                     , col = cmdf[["color"]]
                     , alpha = 0.5
                     , hide  = cmdf$Freq == 0
                     )
}
 
require(plotly)
ggPlotCM <- function(cm){
## ggplotly version
cm_d <- as.data.frame(cm$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm$overall)
# round the values
cm_st$cm.overall <- round(cm_st$cm.overall,2)

# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm$table))
cm_d$Perc <- round(cm_p$Freq*100,2)

# plotting the matrix
cm_d_p <-  ggplot(data = cm_d, aes(x = Prediction , y =  Reference, fill = Freq))+
  geom_tile() +
  geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'red', size = 3) +
  theme_light() +
  guides(fill=FALSE) + theme_1()

# plotting the stats
#cm_st_p <-  tableGrob(cm_st)

cm_d_ply <-ggplotly(cm_d_p)
return (cm_d_ply)
}

Testing

Evaluate on test

Evaluate the tf.fit’s performance on the test data:

tf.fit %>% keras::evaluate(x_test, y_test)
##      loss  accuracy 
## 0.1227164 0.9600000

Confusion Matrix Test

#Generate predictions on new data:
pred <-tf.fit %>% predict_classes(x_test) 
#table(Predicted=pred, Actual =test$label)

cm <- confusionMatrix(factor(pred), factor(test$label))
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  1
##          1  2 37
##                                           
##                Accuracy : 0.96            
##                  95% CI : (0.8875, 0.9917)
##     No Information Rate : 0.5067          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.92            
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9459          
##             Specificity : 0.9737          
##          Pos Pred Value : 0.9722          
##          Neg Pred Value : 0.9487          
##              Prevalence : 0.4933          
##          Detection Rate : 0.4667          
##    Detection Prevalence : 0.4800          
##       Balanced Accuracy : 0.9598          
##                                           
##        'Positive' Class : 0               
## 
draw_confusion_matrix(cm)

#plotCM(cm)
#ggPlotCM(cm)

Evaluate on hold out

Evaluate the tf.fit’s performance on the hold out data:

tf.fit %>% keras::evaluate(x_hold, y_hold)
##      loss  accuracy 
## 0.1227164 0.9600000

Confusion Matrix Holdout

#Generate predictions on new data:
pred2 <-tf.fit %>% predict_classes(x_hold) 
#table(Predicted=pred, Actual =test$label)

cm <- confusionMatrix(factor(pred2), factor(hold$label))
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  1
##          1  2 37
##                                           
##                Accuracy : 0.96            
##                  95% CI : (0.8875, 0.9917)
##     No Information Rate : 0.5067          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.92            
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9459          
##             Specificity : 0.9737          
##          Pos Pred Value : 0.9722          
##          Neg Pred Value : 0.9487          
##              Prevalence : 0.4933          
##          Detection Rate : 0.4667          
##    Detection Prevalence : 0.4800          
##       Balanced Accuracy : 0.9598          
##                                           
##        'Positive' Class : 0               
## 
draw_confusion_matrix(cm)

#plotCM(cm)
#ggPlotCM(cm)

ML Classification by Arules

This analysis is designed as a demonstration of using the Arules CBA method to predict covid 19 patient fatality using a sample of 375 patients and their biomarkers measured during hospital stay. The input data originated from another more expansive predictive modeling study on the topic, and the input data used here is simplified for this Apriori and ArulesCBA demonstration.

#Return multiple data frames in one object using the S4 class system.
setClass(
    "mdl",
    representation(
        trn = "vector",
        tst = "vector",
        hld = "vector"
    )
)

modelData2 <- function (df1 = df_ww) {
  z <- df1  
    # get number of rows in the data frame
  len <- as.numeric(dim(z)[1])
  randIndex <- sample(as.numeric(rownames(z)[1:len]))
  # setting cutpoint to 2/3 of the input data frame
  cutPoint80 <- floor(.8 * len)
  # setting cutpoint to 1/3 of the input data frame
  cutPoint20 <- floor(.2 * len)
  trn2 <- z[randIndex[1:cutPoint80],]
  tst2 <- z[randIndex[cutPoint80+1:c(len-cutPoint80)],] 
  hld2 <- z[-randIndex[1: cutPoint80],]
  
  mdl <- new("mdl",
                  trn = trn2,
                  tst = tst2,
                  hld = hld2  )
  return(mdl)
  
}

Data Loading

Anonymized Patient Data

View anonymized Covid-19 patient information

df_CBA <- read.csv("covid19_biomarkers_max.csv") # this data resulted a write.csv of df_clean data 
head(df_CBA[,2:7]) %>% kable() %>% kable_styling()
patient_id age gender outcome admission_time discharge_time
1 73 1 0 2020-01-30 22:12:47 2020-02-17 12:40:09
2 61 1 0 2020-02-04 21:39:03 2020-02-19 12:59:01
3 70 2 0 2020-01-23 10:59:36 2020-02-08 17:52:31
4 74 1 0 2020-01-31 23:03:59 2020-02-18 12:59:12
5 29 2 0 2020-02-01 20:59:54 2020-02-18 10:33:06
6 81 2 0 2020-01-24 10:47:10 2020-02-07 09:06:58
#remove first column that created when csvwas written

Model Data View

View slice of model data input set. The target variable outcome has been discretized.

df_CBA <- df_CBA[,-1]
# omit columns that will not be part of the model data set
#df_CBA <- df_CBA[,c(4,7:80)] 
df_CBA <- df_CBA %>% dplyr::select(-patient_id,-age,-gender,-admission_time,-discharge_time)

#rename outcome to Survived
names(df_CBA)[1] <- c("outcome")
df_CBA$outcome <- factor(df_CBA$outcome)

#Discretize Survived
df_CBA$outcome=dplyr::recode(df_CBA$outcome, `0`="Deceased", `1`="Survived")
colnames(df_CBA) <- make.names(colnames(df_CBA))
head(df_CBA[,1:5]) %>% kable() %>% kable_styling()
outcome hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
Deceased 19.9 140 103.1 14.1
Deceased 16.9 151 100.5 14.3
Deceased 0.0 126 102.9 13.6
Deceased 4.8 110 103.1 16.3
Deceased 5.6 134 102.2 14.6
Deceased 19.7 108 105.8 12.4
# set model input
biomarkers <- df_CBA

During the exploratory data analysis, the following 8 features were found to be important out of 74 total. They are listed in order of relative importance.

  • lactate_dehydrogenase 100.00

  • monocytes_percent 83.50

  • international_standard_ratio 78.47

  • percent_lymphocyte 64.12

  • eosinophil_count 59.75

  • eosinophils_percent 50.99

  • neutrophils_percent 49.15

  • amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp 48.62


The CBA analysis will use the top four:

  • lactate_dehydrogenase

  • monocytes_percent

  • international_standard_ratio

  • percent_lymphocyte


Research to inform cutoffs for descritization of the top four features

  • lactate_dehydrogenase (low cutoff 140, high cutoff 280)

https://www.uofmhealth.org/health-library/tv6985#:~:text=Lactic%20acid%20dehydrogenase%20(LDH)%20is%20an%20enzyme%20that%20helps%20produce,L%20to%204.68%20microkatals%2FL

  • monocytes_percent (low cutoff 2 / high cutoff 8)

https://cllsociety.org/toolbox/normal-lab-values/

  • international_standard_ratio (low cutoff 0.8 / high cutoff 1.1)

https://www.mayoclinic.org/tests-procedures/prothrombin-time/about/pac-20384661#:~:text=In%20healthy%20people%20an%20INR,in%20the%20leg%20or%20lung

  • percent_lymphocyte (low cutoff 20 / high cutoff 40)

https://cllsociety.org/toolbox/normal-lab-values/

The information will be used with mutate() and ifelse to discretize the top four features in low, normal, and high buckets based on the researched reference values.

#lactate_dehydrogenase
biomarkers <- biomarkers %>% mutate(lactate_dehydrogenase = ifelse(lactate_dehydrogenase < 140, "low"
        ,ifelse(lactate_dehydrogenase > 140, "high" , "normal")))

#monocytes_percent
biomarkers <- biomarkers %>% mutate(monocytes_percent = ifelse(monocytes_percent < 2, "low"
        ,ifelse(monocytes_percent > 8, "high" , "normal")))

#international_standard_ratio
biomarkers <- biomarkers %>% mutate(international_standard_ratio = ifelse(international_standard_ratio < 0.8, "low" ,ifelse(international_standard_ratio > 1.1, "high" , "normal")))

#percent_lymphocyte
biomarkers <- biomarkers %>% mutate(percent_lymphocyte = ifelse(percent_lymphocyte < 20, "low"
        ,ifelse(percent_lymphocyte > 40, "high" , "normal")))

# Make dataframe from  discretized biomarkers
df.biomarkers <- data.frame(outcome = biomarkers$outcome
              , lactate_dehydrogenase = biomarkers$lactate_dehydrogenase
              , monocytes_percent = biomarkers$monocytes_percent
              , international_standard_ratio = biomarkers$international_standard_ratio
              , percent_lymphocyte = biomarkers$percent_lymphocyte) 

# Make factors of all columns
df.biomarkers$outcome <- factor(df.biomarkers$outcome)
df.biomarkers$lactate_dehydrogenase <- factor(df.biomarkers$lactate_dehydrogenase)
df.biomarkers$monocytes_percent <- factor(df.biomarkers$monocytes_percent)
df.biomarkers$percent_lymphocyte <- factor(df.biomarkers$percent_lymphocyte)

View Discretized Data

head(df.biomarkers) %>% kable() %>% kable_styling()
outcome lactate_dehydrogenase monocytes_percent international_standard_ratio percent_lymphocyte
Deceased high high normal normal
Deceased high high normal normal
Deceased high normal normal normal
Deceased high high high normal
Deceased high normal high normal
Deceased high high normal normal
#head(df.biomarkers)
glimpse(df.biomarkers)
## Rows: 375
## Columns: 5
## $ outcome                      <fct> Deceased, Deceased, Deceased, Deceased...
## $ lactate_dehydrogenase        <fct> high, high, high, high, high, high, hi...
## $ monocytes_percent            <fct> high, high, normal, high, normal, high...
## $ international_standard_ratio <fct> normal, normal, normal, high, high, no...
## $ percent_lymphocyte           <fct> normal, normal, normal, normal, normal...

Apriori Analysis

The Apriori algorithm can now be used on the transformed data to generate association rules. Parameters of support,confidence and lift will be leveraged to study rules formed from Covid-19 patient biomarkers.

  • Support: This measure gives an idea of how frequent an items is in all the transactions. A transaction in this case means patient and their associated biomarkers.

  • Confidence: Confidence is an indication of how often the rule has been found to be true.

  • Lift: Lift is the ratio of the observed support to that expected

Several different combinations of the support, and confidence were use to filter the universe of strong rules for the data.

  • Set support: .07 to .11
  • Set confidence: .75 to .79
  • Set lift: >= 1

A set of strong rules will be obtained from these parameters. The discretized data set will be used to perform apriori analysis prior to CBA.

Create Arules

Inspect Rules

The initial rules view is using support of .07, confidence of .75, and maxlen=3

arule_df <- df.biomarkers
#set top n

n = 20
#n = 20
#n = 10
#n = 5

#rules = apriori(arule_df, parameter = list(supp = 0.09, conf = 0.95, maxlen = 3))
rules = apriori(arule_df, parameter = list(supp = 0.07, conf = 0.75, maxlen = 3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5    0.07      1
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 26 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[13 item(s), 375 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3
## Warning in apriori(arule_df, parameter = list(supp = 0.07, conf = 0.75, : Mining
## stopped (maxlen reached). Only patterns up to a length of 3 returned!
##  done [0.00s].
## writing ... [76 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
#arules::inspect(head(rules,n)) 
inspectDT(rules)

View higher quality rules with elevated support of .11 and confidence of .79. The number rules decreased to 50 rules.

# This combination was used to produced the 53 rules strong rules
subrules <- rules[quality(rules)$support >= 0.11]
subrules <- subrules[quality(subrules)$confidence >= 0.79]
inspectDT(subrules)

View 50 rules but with lift greater than or equal to 1. The result is reduced to 47 rules meeting that criterion.

A few initial Arule observations:

  • monocytes_percents=high and outcome=deceased

  • percent_lymphocytes=low and outcomes=surviving.

  • The right hand side of rule as lactate_dehydrogenas=high is a common

#The lift was 
subrules_lift <- subrules[quality(subrules)$lift >= 1]
inspectDT(subrules_lift)
# set to subrules to the lift =1 result
rules <- subrules_lift

Plot Strong Rules

The 47 strong rules are plotted below here with a few different plotting schemes.

#plot (rules, measure=c("confidence","lift"), shading="lift")
plot (rules, measure=c("support","lift"), shading="lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules, method = "matrix", measure = "lift")
## Itemsets in Antecedent (LHS)
##  [1] "{lactate_dehydrogenase=high,percent_lymphocyte=low}"             
##  [2] "{lactate_dehydrogenase=high,percent_lymphocyte=normal}"          
##  [3] "{percent_lymphocyte=low}"                                        
##  [4] "{outcome=Survived,lactate_dehydrogenase=high}"                   
##  [5] "{outcome=Survived}"                                              
##  [6] "{lactate_dehydrogenase=high,international_standard_ratio=normal}"
##  [7] "{outcome=Survived,percent_lymphocyte=low}"                       
##  [8] "{monocytes_percent=normal,international_standard_ratio=high}"    
##  [9] "{monocytes_percent=normal,percent_lymphocyte=low}"               
## [10] "{lactate_dehydrogenase=high,monocytes_percent=high}"             
## [11] "{outcome=Survived,monocytes_percent=normal}"                     
## [12] "{international_standard_ratio=high,percent_lymphocyte=low}"      
## [13] "{international_standard_ratio=normal,percent_lymphocyte=normal}" 
## [14] "{outcome=Survived,international_standard_ratio=high}"            
## [15] "{monocytes_percent=high,percent_lymphocyte=normal}"              
## [16] "{percent_lymphocyte=normal}"                                     
## [17] "{monocytes_percent=high,international_standard_ratio=normal}"    
## [18] "{international_standard_ratio=normal}"                           
## [19] "{monocytes_percent=high}"                                        
## [20] "{monocytes_percent=normal}"                                      
## [21] "{monocytes_percent=normal,international_standard_ratio=normal}"  
## [22] "{outcome=Deceased,monocytes_percent=normal}"                     
## [23] "{monocytes_percent=high,percent_lymphocyte=low}"                 
## [24] "{international_standard_ratio=high}"                             
## [25] "{outcome=Deceased,international_standard_ratio=normal}"          
## [26] "{monocytes_percent=high,international_standard_ratio=high}"      
## [27] "{outcome=Deceased,international_standard_ratio=high}"            
## [28] "{outcome=Deceased,percent_lymphocyte=normal}"                    
## [29] "{outcome=Deceased,monocytes_percent=high}"                       
## [30] "{outcome=Deceased}"                                              
## [31] "{}"                                                              
## Itemsets in Consequent (RHS)
## [1] "{lactate_dehydrogenase=high}"        "{international_standard_ratio=high}"
## [3] "{outcome=Deceased}"                  "{percent_lymphocyte=low}"           
## [5] "{outcome=Survived}"

plot(rules, method = "matrix" , engine="3d", measure = "lift")
## Itemsets in Antecedent (LHS)
##  [1] "{lactate_dehydrogenase=high,percent_lymphocyte=low}"             
##  [2] "{lactate_dehydrogenase=high,percent_lymphocyte=normal}"          
##  [3] "{percent_lymphocyte=low}"                                        
##  [4] "{outcome=Survived,lactate_dehydrogenase=high}"                   
##  [5] "{outcome=Survived}"                                              
##  [6] "{lactate_dehydrogenase=high,international_standard_ratio=normal}"
##  [7] "{outcome=Survived,percent_lymphocyte=low}"                       
##  [8] "{monocytes_percent=normal,international_standard_ratio=high}"    
##  [9] "{monocytes_percent=normal,percent_lymphocyte=low}"               
## [10] "{lactate_dehydrogenase=high,monocytes_percent=high}"             
## [11] "{outcome=Survived,monocytes_percent=normal}"                     
## [12] "{international_standard_ratio=high,percent_lymphocyte=low}"      
## [13] "{international_standard_ratio=normal,percent_lymphocyte=normal}" 
## [14] "{outcome=Survived,international_standard_ratio=high}"            
## [15] "{monocytes_percent=high,percent_lymphocyte=normal}"              
## [16] "{percent_lymphocyte=normal}"                                     
## [17] "{monocytes_percent=high,international_standard_ratio=normal}"    
## [18] "{international_standard_ratio=normal}"                           
## [19] "{monocytes_percent=high}"                                        
## [20] "{monocytes_percent=normal}"                                      
## [21] "{monocytes_percent=normal,international_standard_ratio=normal}"  
## [22] "{outcome=Deceased,monocytes_percent=normal}"                     
## [23] "{monocytes_percent=high,percent_lymphocyte=low}"                 
## [24] "{international_standard_ratio=high}"                             
## [25] "{outcome=Deceased,international_standard_ratio=normal}"          
## [26] "{monocytes_percent=high,international_standard_ratio=high}"      
## [27] "{outcome=Deceased,international_standard_ratio=high}"            
## [28] "{outcome=Deceased,percent_lymphocyte=normal}"                    
## [29] "{outcome=Deceased,monocytes_percent=high}"                       
## [30] "{outcome=Deceased}"                                              
## [31] "{}"                                                              
## Itemsets in Consequent (RHS)
## [1] "{lactate_dehydrogenase=high}"        "{international_standard_ratio=high}"
## [3] "{outcome=Deceased}"                  "{percent_lymphocyte=low}"           
## [5] "{outcome=Survived}"

Rules sorted by support

This sort gives us perspective in terms of what high support means in the context of the Covid-19 dataset. The highest support is .94, but that number tapers off very fast after just ten entries.

Rules sorted by confidence

Our strong rule results sorted by confidence. The right hand side again shows lactate_dehydrogenase=high with high frequency. The highest confidence is 1 within the top 47 strong rules.

#### Rules sorted by lift

The top 47 rules are sorted again, but this time by lift. The right hand side outcome=“Survived” dominates the top 10 rules sorted by lift.

#### Top Rules

Sticking with the premise that most important higher values for most features will have a tendency to result in Covid-19 patient fatality, we use the left hand side (LHS) filter to specify outcome=Deceased. Recall that “outcome=Deceased” means the patient passed away as a result of contracting covid-19.

This results in only top 9 rules, but they are very strong ones. Not only do they have good support, confidence and lift, but they also occur a lot. Recall there are only 375 observations in the patient data set, so over 100 occurrences of a rule is high.

Primary rule filter is (supp = 0.10,conf = 0.78, maxlen = 3).

Top 9 Rules for with RHS filtered by Deceased Patients.

Top 9 Rules for with RHS filtered by Survived Patients. This results in 5 of the 9 top rules.

ArulViz Visualizations

Plotting Top Rules

The two plots show five rules with RHS filtered to Survived and Deceased respectively. The size is controlled by support and color is controlled by lift.

Plot Rules

The degree of size and color indicates level of support, and lactate_dehydrogenase is indicated with high support.

Plot Confidence

The two key plot shows many rules have between .08 and 1 confidence and support between .02 and .06.

Arules CBA Model

Classification by Arules or CBA is a classification method.

Model Data Splitting

The data will be split into 80/20 using a function based upon caret partitioning. The split results in 300 rows to train CBA upon and 75 for testing. The top four features are used as predictors.

set.seed(seed)
#seed
#Used 80/20 split to be consistent with the rest of the study.
x <- modelData2(df.biomarkers)

train <-x@trn
test <- x@tst
hold <- x@hld
trn <- as.data.frame(dim(train))
names(trn) <- c("dimension_of_training_dataset")
tst <-as.data.frame(dim(test))
names(tst) <- c("dimension_of_testing_dataset")
hld <-as.data.frame(dim(hold))
names(hld) <- c("dimension_of_hold_dataset")


split80_20 <- cbind(trn,tst) 
split80_20 %>% kable() %>% kable_styling()
dimension_of_training_dataset dimension_of_testing_dataset
300 75
5 5
cba_train <- train 
#dim(train)
cba_test <- test 
#dim(test)
cba_hold <- hold
#dim(hold)

Fit CBA Model

Arc CBA is called to fit model on the training data set and the rules generated by the fit are shown.

set.seed(seed)
require(arulesCBA)
#install.packages("arc", dependencies = TRUE)
require(arc)

#use same support and confidence settings as were used in the strong rules analysis.
v_supp <- .11
v_conf <- .76
# train CBA classifers
#88% accuracy
#fit.cba.4 <- CBA(outcome ~ ., data = cba_train, supp = v_supp, conf=v_conf, verbose = FALSE)

#88% accuracy
fit.cba.4 <- arc::cba(cba_train, "outcome",rulelearning_options = list(minsupp = v_supp, minconf = v_conf,minlen = 2, maxlen = 3, maxtime = 1000,target_rule_count = 50000,trim=TRUE, find_conf_supp_thresholds = FALSE))

# 84% accuracy
#fit.cba.4 <- arc::cba(cba_train, "outcome",rulelearning_options = list(target_rule_count = 60))

#summary(fit.cba.4)
#fit.cba.4

#arules::inspect(fit.cba.4@rules)
# DT breaks on CBA train but works on arc train.
inspectDT(fit.cba.4@rules)
#https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package

draw_confusion_matrix <- function(cm) {

  #   total <- sum(cm$table)
  # res <- as.numeric(cm$table)
  # 
  # # Generate color gradients. Palettes come from RColorBrewer.
  # greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
  # redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#E64141")
  # getColor <- function (greenOrRed = "green", amount = 0) {
  #   if (amount == 0)
  #     return("#FFFFFF")
  #   palette <- greenPalette
  #   if (greenOrRed == "red")
  #     palette <- redPalette
  #   colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
  # }
  # 
  
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

   # create the matrix 
  rect(150, 430, 240, 370, col='#a2aba5')
  text(195, 435, 'Deceased ', cex=1.2)
  rect(250, 430, 340, 370, col='#E64141')
  text(295, 435, 'Survived', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#E64141')
  rect(250, 305, 340, 365, col='#a2aba5')
  text(140, 400, 'Deceased ', cex=1.2, srt=90)
  text(140, 335, 'Survived', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  

# Error: [conflicted] `layout` found in 2 packages. 
# Either pick the one you want with `::` * plotly::layout * graphics::layout Or declare a preference with
# `conflict_prefer()` * conflict_prefer("layout", "plotly") * conflict_prefer("layout", "graphics")
conflict_prefer("layout", "plotly")

Test CBA Model

Classification Based On Association Rules - a classifier based on association rules mined for an input data set. CBA demonstrated good accuracy at 88%, and only misclassified six fatalities. Test results with caret confusion matrix can be viewed below.

# test CBA

predict.cba <- predict(fit.cba.4,cba_test)
#cm <- table(predicted = predict.cba, true = cba_test$outcome)
cm  <- confusionMatrix(predict.cba, cba_test$outcome)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Deceased Survived
##   Deceased       36        6
##   Survived        3       30
##                                         
##                Accuracy : 0.88          
##                  95% CI : (0.784, 0.944)
##     No Information Rate : 0.52          
##     P-Value [Acc > NIR] : 3.57e-11      
##                                         
##                   Kappa : 0.759         
##                                         
##  Mcnemar's Test P-Value : 0.505         
##                                         
##             Sensitivity : 0.923         
##             Specificity : 0.833         
##          Pos Pred Value : 0.857         
##          Neg Pred Value : 0.909         
##              Prevalence : 0.520         
##          Detection Rate : 0.480         
##    Detection Prevalence : 0.560         
##       Balanced Accuracy : 0.878         
##                                         
##        'Positive' Class : Deceased      
## 
draw_confusion_matrix(cm)

Conclusions

Fortunately, a very small percentage of people who contract Covid-19 die from it. However, just the possibility of contracting a virus, which can cause fatality, has sent fears throughout the global community. Some citizens are highly susceptible to the virus and there is data that points to pre-existing health conditions as a major factor. Biomarkers are measured to help indicate phenomenons such as disease, infection, or environmental exposures. The biomarkers collected from Covid-19 patients have shown to be very useful in predicting mortality rates with very high accuracy.

Several machine learning models were tested against the Covid-19 biomarker data, and all the models did very well. We can conclude the biomarker data lends itself well to the use case of predicting Covid-19 fatalities. This was especially true when combined with the principal component analysis. Ultimately, the gradient boosting machines(gbm) and K Nearest Neighbors had the best accuracy at 96%, but other modeling algorithms were not far behind. In fact, none of the methods tested had accuracy below 88%. Even the classification by association (CBA) rules method achieved 88% accuracy using only four of seventy four biomarkers. The four biomarkers were informed by a combination of earlier analysis in the project such as principal component analysis, k-means clustering, important feature results, and correlation analysis.

To improve upon the traditional machine learning results, deep learning (DL) was employed. The Keras interface to R has presented new possibilities for R data scientists, and was used in this study. The method was proven to be very useful with this use case as it yielded the 97% accuracy with a sequential model. Although deep learning is harder to explain than the traditional machine learning (ML) models, it is still worthy of deployment consideration.

In summary, this study developed several classification models, which have been shown to predict mortality rates with accuracy rates as high 97%. These results are promising, and if used as early detection mechanisms, for patients with Covid-19, could assist in reducing case fatality rates (CFR) now or in the future. It is paramount models such as these be implementing within a hospital setting and used in conjunction with high frequency testing of the most predictive biomarkers. As part of a solution, machine learning may give doctors the edge they need to save more lives.