library(ggthemes)
col34="dodgerblue4"
col2="blue"
col1="slateblue"
col="#E64141" #Covid Red #E64141

# caption used for certain the charts 
the_caption <- "Covid-19"
theme_2 <-
    theme_gray() +
    theme(
      axis.title.y = element_blank(),
      legend.title = element_blank(),
      legend.position = "top")


theme_1 <- function (base_size = 11, base_family = "") {
ggplot2::theme_set(theme_classic(base_size = 16))
suppressWarnings(
theme_update(
  axis.text.x = element_text(size = 16)
, axis.text.y = element_text(size = 16)
, axis.title.x = element_text(size = 16)
, axis.title.y = element_text(size = 16)
, legend.title = element_text(size = 16)
, legend.text  = element_text(size = 14)
, panel.grid.major = element_blank()
, panel.grid.minor = element_blank()
, strip.background = element_blank()
, panel.margin = unit(0, "lines")
, legend.key.size = unit(.55, "cm")
, legend.key = element_rect(fill = "white")
, panel.margin.y = unit(0.5, "lines")
, panel.border = element_rect(
    colour = "black", fill = NA, size = 1)
, strip.text.x = element_text(
    size = 16, colour = "blue", face = "bold")
  )
)
}
# 

# #set global ggtheme
# gbl_ggtheme <- gbl_ggtheme 
gbl_ggtheme <- theme_1() #theme_classic(base_size = 16)
theme_set(gbl_ggtheme)
#  theme_set(theme_classic(base_size = 16)) 
# #,panel.background = element_rect(fill = "#E64141") #Covid Red #E64141


#https://www.rapidtables.com/web/color/RGB_Color.html

Overview

The sudden increase of COVID-19 cases has placed high pressure on our health-care services worldwide. It is vital that fast, accurate and early clinical assessments are made in order to reduce the disease severity. To support decision making and logistical planning in health care systems, this study leverages the data of a prior study.

In the prior study, blood samples from 485 infected patients in the region of Wuhan, China were used to study biomarkers of Covid-19 patients. The new study here will include 375 samples of the aforementioned 485 patients. Machine learning tools will be used to identify biomarkers that can be used to predict mortality of the 375 individual patients sampled. The outcomes of all the patients is known, and indicted in the data.

The analysis to follow will attempt to improve upon the accuracy of the fatality predictions in the original study by using enhanced machine learning and deep learning tools associated with the language R. As part of the analysis, a comprehensive look at the R machine learning models will be explored first to find the parsimonious model, then a deep learning approach will be deployed to determine if the accuracy results can be improved.

Motivation

Patients can potentially survive Covid-19 if the biomarkers are measured early and analyzed throughout the patients’ stay. Machine learning models may be the best way to detect signals within key biomarkers, and thereby act as an early warning mechanism. Patients can have their care augmented with specific treatments informed by changes in the biomarkers, which predict their mortality. This analysis seeks to find what R modeling solution yields the highest accuracy in an attempt to demonstrate a way to help save more lives when faced with epidemics such as Covid-19.

#functions

printVecInfo <- function(inputVector,param="all"){
  
  v <- inputVector
 
 print(paste("mean: ", mean(v,na.rm=T) ))
  print(paste("median: ", median(v,na.rm=T) ))
   print(paste("min: ", min(v,na.rm=T) ))
      print(paste("max: ", max(v,na.rm=T) ))
   print(paste("range: ", max(v,na.rm=T)-min(v,na.rm=T) ))
  print(paste("sd: ", sd(v,na.rm=T)))
 print(paste("quantile: ", quantile(v,na.rm=T) ))
 print(paste("IQR: ",quantile(v,na.rm=T)[['75%']]-quantile(v,na.rm=T)[['25%']])) # Interquartile range       
}
# gets the percentages for each column that has NA's
NASummaryInfoPct <- function(indf){
  a<- gather(indf %>% summarise_all(~(sum(is.na(.))/n())), key = "columns", value = "percent_missing")

 return(a)
}

# gets the percentages for each column that has NA's
NASummaryInfoCnt <- function(indf){
  b<- gather(indf %>% summarise_all(~(sum(is.na(.)))), key = "columns", value = "count_missing")

 return(b)
}


#Return multiple data frames in one object using the S4 class system.
setClass(
    "mdl",
    representation(
        trn = "vector",
        tst = "vector",
        hld = "vector"
    )
)

require(caret)
modelDataC <- function (df1 = df,trnpct=.8) {
  #take a 10% holdout
  z <- df1  
  data<- createDataPartition(y=z$label, p=.90, list=FALSE)
    df1 <-z[data,] #set to 90% of its original rows
    tst2 <- z[-data,]
    
  z <- df1  
  data<- createDataPartition(y=z$label, p=trnpct, list=FALSE)
    trn <-z[data,]
    
    tst  <- z[-data,] #balance what is left over after train @.80
  
    
   
  mdl <- new("mdl",
                  trn = trn,
                  tst = tst,
                  hld = tst2  )
  return(mdl)
  
}


#Return multiple data frames in one object using the S4 class system.
setClass(
    "mdl",
    representation(
        trn = "vector",
        tst = "vector",
        hld = "vector"
    )
)

modelData2 <- function (df1 = df_ww) {
  z <- df1  
    # get number of rows in the data frame
  len <- as.numeric(dim(z)[1])
  randIndex <- sample(as.numeric(rownames(z)[1:len]))
  # setting cutpoint to 2/3 of the input data frame
  cutPoint80 <- floor(.8 * len)
  # setting cutpoint to 1/3 of the input data frame
  cutPoint20 <- floor(.2 * len)
  trn2 <- z[randIndex[1:cutPoint80],]
  tst2 <- z[randIndex[cutPoint80+1:c(len-cutPoint80)],] 
  hld2 <- z[-randIndex[1: cutPoint80],]
  
  mdl <- new("mdl",
                  trn = trn2,
                  tst = tst2,
                  hld = hld2  )
  return(mdl)
  
}

About Model Data

The data tracks several patients over their hospital stay, and keeps records on the patients 74 bio markers. The admission time, discharge time and outcome make a unique patient.

The original data source had already been anonymized to comply with HIPAA, PCI and PII data sensitivity laws.
Dependent variable is outcome with 0 death and 1 survived
Data set dimensions are 6120 rows by 81 columns
There are 74 biomarker features, and some of the most important one are listed below.

Biomarker definitions:

The most important bio markers for the analysis are listed with brief definitions. The three primary categories of BIO markers fall into cholesterol, inflammation, and clotting.

percent_lymphocyte: Lymphocytes are a type of white blood cell, consisting of both B cells and T cells. B cells are responsible for the creation of antibodies, while T cells are responsible for controlling immune response to foreign substances. This marker indicates the percent of lymphocytes present relative to a patient’s white blood cell count.
lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.
lymphocyte_count lymphocytes are a type of white blood cell, consisting of both B cells and T cells. B cells are responsible for the creation of antibodies, while T cells are responsible for controlling immune response to foreign substances. This marker indicates the count of lymphocytes present in a patient’s blood count.
albumin: Albumin is a protein created by your liver. This protein is primarily responsible for maintaining and keeping fluid levels in your bloodstream so that it does not leak into other tissues. This protein also helps carry hormones, vitamins, and enzymes throughout the body. Higher levels can indicate a patient fighting infections. Low levels can indicate problems with liver or inflammatory diseases.
neutrophils_percent: Neutrophils are another type of white blood cell that makes up 55 to 70 percent of the white blood cell count. They consist of the majority of cells that make up the immune system response. Neutrophils help fight antigens and are not limited to a specific area of circulation. This marker shows the percent of neutrophils present relative to a patient’s white blood cell count.
eosinophils_percent: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in the inflammatory response. This marker shows the percent of eosinophils present relative to a patient’s white blood cell count.
eosinophil_count: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in the inflammatory response. This marker shows the count of eosinophils present in a patient’s blood count.
mean_corpuscular_hemoglobin_concentration: MCHC is a measure of the concentration of hemoglobin present in a given volume of packed red blood cell.
x2019_n_co_v_nucleic_acid_detection: This marker is a specific test result from a detection test responsible for identifying the presence of nucleic acid from SARS-CoV-2 in humans. This test shows the presence or absence of viral infection by SARS-CoV-2.
basophil_percent: Basophil is naturally produced by several types of white blood cells. Basophil contains heparin which naturally prevents blood clotting, while also releasing histamine during allergic reactions. It is also responsible for helping the body produce an antibody called immunoglobulin E, which help mediate inflammatory responses in the body. This marker shows the percent of basophil present relative to a patient’s white blood cell count.
total_cholesterol: Cholesterol is a waxy substance found in blood that assists in building healthy cells. High levels of cholesterol can coat arteries and restrict blood flow. Total cholesterol is measured as a count in a patient’s blood panel.
procalcitonin: Procalcitonin is a protein produced by many cells in the body in response to both bacterial infections and tissue injury. The count of procalcitonin can increase based on systemic infections and sepsis.
thrombin_time: Thrombin time or thrombin clotting time is a blood test that measures the time It takes for a clot to form in the plasma of a blood sample. This assists in helping doctors investigate excessive bleeding or abnormal blood clot formation.
red_blood_cell_count: Red blood cell count is a test conducted to find out how many red blood cells are present in your blood. It is also known as an erythrocyte count. RBC’s contain hemoglobin, which is responsible for carrying oxygen to body tissues. This count is vital in understanding how tissues are functioning and healing in response to viral infections.
indirect_bilirubin: Bilirubin tests measure the amount of bilirubin in blood, an orange-yellow pigment that forms when red blood cells break down. Indirect bilirubin is specifically bound to albumin. High levels of bilirubin indicate that the body is destroying too many red blood cells, which can also indicate liver or tissue damage.
h_bs_ag: HBsAg is the antigen of the hepatitis B virus. Presence of this indicates current HBV infection.
hco3: Bicarbonate is a byproduct of the body’s metabolic function. Blood contains bicarbonate which is brought to the lungs and exhaled as carbon dioxide. Kidneys excrete and reabsorb bicarbonate and maintain appropriate acid balance in the body.
platelet_count: Platelets are small fragments of cells present for creating normal and protective blood clots. The platelet count test determined the number of platelets present in a sample of blood.
serum_sodium: Sodium tests are conducted to see how much sodium is present in the blood. Sodium is an essential mineral in the body that is important for healthy nerve and muscle function.
hemoglobin: Hemoglobin is a red blood cell protein responsible for transporting oxygen in the body’s blood. A low count of hemoglobin can indicate deficiencies and anemia, which can hinder healing and recovery in the body from virus or infection.
lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.
monocytes_percent: Monocytes are another type of white blood cell. They assist in fighting bacteria, viruses, and other infections. They are a key component of white blood cell immune response. High levels indicate the body fighting infection.
international_standard_ratio International Normalized Ratio is a preferred test for patients to assess the risk of bleeding or coagulation status in patients. This ratio shows how well blood clots, and if there are any abnormalities present.
percent_lymphocyte: Lymphocytes are a type of white blood cell, consisting of both B cells and T cells. B cells are responsible for the creation of antibodies, while T cells are responsible for controlling immune response to foreign substances. This marker indicates the percent of lymphocytes present relative to a patient’s white blood cell count.
eosinophil_count: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in inflammatory response. This marker shows the count of eosinophils present in a patient’s blood count.
eosinophils_percent: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in inflammatory response. This marker shows the percent of eosinophils present relative to a patient’s white blood cell count.
neutrophils_percent: Neutrophils are another type of white blood cell, that makes up 55 to 70 percent of the white blood cell count. They consist of the majority of cells that make up the immune system response. Neutrophils help fight antigens and are not limited to a specific area of circulation. This marker shows the percent of neutrophils present relative to a patient’s white blood cell count.
amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp: Natriuretic peptide tests measure the presence of brain natriuretic peptide and B-terminal pro b-type natriuretic peptides present. These are substances created by the heart. Small levels are normally found in the bloodstream. High levels of BNP, NT-proBNP indicate that the heart is failing to pump the appropriate amount of blood needed.
thrombocytocrit: This device measures the platelet content of the blood. Platelets are small fragments of cells present for creating normal and protective blood clots. The platelet count test determined the number of platelets present in a sample of blood.
hypersensitive_cardiac_troponin_i: This is the preferred biomarker for acute myocardial infarction. This helps diagnose acute coronary syndromes in those who are critically ill.
fibrin_degradation_products: Fibrin degradation products are components of the blood produced by clot degeneration. This substance remains in your bloodstream after your body dissolves a blood clot, which in return regulates and manages clot dissolving. This specific test determines the amount of FDPs present in blood to ensure that the body is healing correctly, particularly in response to clots and blood illnesses.
procalcitonin: Procalcitonin is a protein produced by many cells in the body in response to both bacterial infections and tissue injury. The count of procalcitonin can increase based on systemic infections and sepsis.
albumin: Albumin is a protein created by your liver. This protein is primarily responsible for maintaining and keeping fluid levels in your bloodstream so that it does not leak into other tissues. This protein also helps carry hormones, vitamins, and enzymes throughout the body. Higher levels can indicate a patient fighting infections. Low levels can indicate problems with liver or inflammatory diseases
prothrombin_activity: Prothrombin time is a test used to help detect and diagnose a bleeding or excessivee clotting disorder. This helps doctors evaluate a patient’s ability to appropriately form blood clots.
d_d_dimer: A D-dimer test is used to help rule out the presence of serious life-threatening blood clots. High levels of D-dimer, the protein leftover from clotting, can indicate deep vein thrombosis and other clotting disorders.
aspartate_aminotransferase: Aspartate aminotransferase is an enzyme created by the liver but found in many cells throughout the body. High levels of AST are found in the body when liver damage is present, as the liver releases AST in response to distress of the liver.
neutrophils_count: Neutrophils are another type of white blood cell, that makes up 55 to 70 percent of the white blood cell count. They consist of the majority of cells that make up the immune system response. Neutrophils help fight antigens and are not limited to a specific area of circulation. This marker shows the count of neutrophils present in a patient’s blood cell count.
direct_bilirubin: Bilirubin tests measure the amount of bilirubin in blood, an orange-yellow pigment that forms when red blood cells break down. Direct bilirubin specifically travels from the liver to the small intestines. High levels of bilirubin indicate that the body is destroying too many red blood cells, which can also indicate liver or tissue damage.
thrombin_time: Thrombin time or thrombin clotting time is a blood test that measures the time it takes for a clot to form in the plasma of a blood sample. This assists in helping doctors investigate excessive bleeding or abnormal blood clot formation.
high_sensitivity_c_reactive_protein: High sensitivity C-reactive protein is a biomarker used in determining coronary heart disease. The protein itself is produced when there is damage or inflammation in blood vessels, particularly in the heart.

Data Loading/Cleaning

The original data dimensions and format can be viewed below. There is a lot of NAs seen within the patient columns, but was found to be due to not repeating the patient id for each day of the patients stay. The biomaker festures have plenty of NA since not all markers were captured for every patient during every time period. Additionally, the biomarker features can be quite long and they are largely non-conforming R column names.

#Use janitory to clean the column names
require(janitor)
#https://medium.com/@verajosemanuel/janitor-a-good-r-package-for-data-cleaning-f3c733632ad9


raw <- read_csv('./Data/covid19_bio_mdl.csv')

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   RE_DATE = col_datetime(format = ""),
##   `Admission time` = col_datetime(format = ""),
##   `Discharge time` = col_datetime(format = "")
## )

## See spec(...) for full column specifications.

dim(raw)

## [1] 6120   81

#glimpse(raw)
head(raw[,1:7]) %>% kable() %>% kable_styling()

PATIENT_ID	RE_DATE	age	gender	Admission time	Discharge time
1	2020-01-31 01:09:00	73	1	2020-01-30 22:12:47	2020-02-17 12:40:09
NA	2020-01-31 01:25:00	73	1	2020-01-30 22:12:47	2020-02-17 12:40:09
NA	2020-01-31 01:44:00	73	1	2020-01-30 22:12:47	2020-02-17 12:40:09
NA	2020-01-31 01:45:00	73	1	2020-01-30 22:12:47	2020-02-17 12:40:09
NA	2020-01-31 01:56:00	73	1	2020-01-30 22:12:47	2020-02-17 12:40:09
NA	2020-01-31 01:59:00	73	1	2020-01-30 22:12:47	2020-02-17 12:40:09

head(raw[,8:15]) %>% kable() %>% kable_styling()

Hypersensitive cardiac troponinI	hemoglobin	Serum chloride	Prothrombin time	procalcitonin	eosinophils(%)	Interleukin 2 receptor	Alkaline phosphatase
NA	NA	NA	NA	NA	NA	NA	NA
NA	136	NA	NA	NA	0.6	NA	NA
NA	NA	103.1	NA	NA	NA	NA	46
NA	NA	NA	13.9	NA	NA	NA	NA
19.9	NA	NA	NA	0.09	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA

#The names are dirty, so we're using janitor to clean them. The original names are stored for reference. 

raw_names <- names(raw)
#raw_names # we need these original columns for reference

df_raw <- raw
df <- df_raw %>% clean_names() 
df_clean_names <- names(df)
#df_clean_names
#make names for good measure to ensure janitor names a legal R columns names
colnames(df) <- make.names(colnames(df))
#df_clean_names  
#dim(df)
#glimpse(raw)
#glimpse(df)

Fixed Biomarker Names

The data dimensions and a sample of the fixed data is seen below here.

#dim(df)
head(df[,8:15]) %>% kable() %>% kable_styling()

hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time	procalcitonin	eosinophils_percent	interleukin_2_receptor	alkaline_phosphatase
NA	NA	NA	NA	NA	NA	NA	NA
NA	136	NA	NA	NA	0.6	NA	NA
NA	NA	103.1	NA	NA	NA	NA	46
NA	NA	NA	13.9	NA	NA	NA	NA
19.9	NA	NA	NA	0.09	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA

#tail(df) %>% kable() %>% kable_styling()

Create Patient Master

Each patient is master contains the three columns,patient id plus the admission and discharge time, revealing 375 patients in the sample.

df_patient_master_raw <- df %>% select(patient_id,admission_time,discharge_time) 
#dim(df_patient_master_raw)

# #unique ID before patient compression
# library(data.table)
# df_patient_master_raw2 <- setDT(df_patient_master_raw)[, id := paste0('',.I)]
# head(df_patient_master_raw2)
# dim(df_patient_master_raw2)

data <- df_patient_master_raw
data_subset <- data[ , c("patient_id")] 
df_patient_master <- data[complete.cases(data_subset), ] # Omit NAs by columns
head(df_patient_master)

## # A tibble: 6 x 3
##   patient_id admission_time      discharge_time     
##        <dbl> <dttm>              <dttm>             
## 1          1 2020-01-30 22:12:47 2020-02-17 12:40:09
## 2          2 2020-02-04 21:39:03 2020-02-19 12:59:01
## 3          3 2020-01-23 10:59:36 2020-02-08 17:52:31
## 4          4 2020-01-31 23:03:59 2020-02-18 12:59:12
## 5          5 2020-02-01 20:59:54 2020-02-18 10:33:06
## 6          6 2020-01-24 10:47:10 2020-02-07 09:06:58

dim(df_patient_master)

## [1] 375   3

# Print data_by_column to RStudio console
#write.csv(data,file="covid_19_patient_master.csv")

#Create df with corrected patient_id label. The data is still a sparse data frame at this point with the sparse data being NA (as opposed to 0). 

#Dimensions are still 6120,81 here. 
library(dplyr)

df_raw2 <- inner_join(df_patient_master, df, by = c("admission_time" = "admission_time","discharge_time" ="discharge_time"))
#drop bad patient_id
df_raw2$patient_id.y <- NULL
#rename new patient_id
names(df_raw2)[1] <- c("patient_id")

#head(df_raw2) #%>% regulartable() %>% theme_zebra() %>%  autofit()   
#dim(df_raw2)
#head(df_raw2[1,c(1:5)])

Agregate Data

Next we aggregate the data to the patient master level of granularity. As part of the aggregation we take the “max” of all biomarkers while grouping by patient master data. This approach leaves the highest bio-marker measurements for each patient. Additionally, the NA and infinite numbers are set to zero.

Note: This reduces the dataset to the number of patients, which is 375. Dimension is now 375,80 here as we lthe time the bio marker was taken. We aren’t interested in the time component from this analysis. The the objective here is only determine if we can predict mortality with the biomarkers. The time component does become important later as time is of the essence when once is admitted with Covid-19 testing positive.

New/Changed Features: * duration_days - is the differences in days the patient was admitted to discharged * factorized - patient_id,gender,age,outcome

#install.packages("hablar")
library(hablar)
require(hablar)
require(lubridate)

#summarise numerics by patient
by_patient <- df_raw2 %>%
  group_by(patient_id, age, gender, outcome,admission_time,discharge_time)

#create df_clean
by_patient %>%
  summarise_if(is.numeric, max, na.rm = TRUE) %>% ## Clean infinite and NA
  mutate_if(is.numeric, ~replace(., is.infinite(.), 0)) %>% 
  mutate_if(is.numeric, ~replace(., is.na(.), 0)) -> df_clean

#write.csv(df_clean,file="biomarkers_max.csv")
#write.csv(df_clean,file="covid19_biomarkers_max.csv")
#write.csv(df_clean,file="df_clean.csv")
#as_datetime(x, tz = "UTC")

head(df_clean[,c(2:4,7:9)])

## # A tibble: 6 x 6
## # Groups:   age, gender, outcome [6]
##     age gender outcome hypersensitive_cardiac_troponi~ hemoglobin serum_chloride
##   <dbl>  <dbl>   <dbl>                           <dbl>      <dbl>          <dbl>
## 1    73      1       0                            19.9        140           103.
## 2    61      1       0                            16.9        151           100.
## 3    70      2       0                             0          126           103.
## 4    74      1       0                             4.8        110           103.
## 5    29      2       0                             5.6        134           102.
## 6    81      2       0                            19.7        108           106.

#dim(df_clean)

Biomarker Stats

The biomarkers with relatively high standard deviations and/or wide ranges are candidates for a closer look. For example, “hypersensitive_cardiac_troponin_i” falls into the category, which should be further studied.

#library(tigerstats)
#inspect(df_clean[,c(2:4,7:80)]) #%>% kable() %>% kable_styling()
#densityplot(~outcome|gender,data=df_clean[,c(2:4,7:80)])

df_stat <- df_clean[,c(7:80)]
dim(df_stat)

## [1] 375  74

dfIntCols <- df_stat %>% select_if(., is.numeric)
#names(dfIntCols) %>% kable() %>% kable_styling()
#print('===============================================================')
#print('===============================================================')

statInfo <- function(a) {
  for (i in colnames(a)){
   x <- colnames(df_stat[i])
   print('---------------------------------------------------------------------------------')
   print(x)
  #paste('Column: ', colnames(df_stat[i]), sep=',')
 
  #print(colnames(df_stat[i]))
   print('=================================================================================')
   printVecInfo(df_stat[[i]])
   
  
   }
}   

statInfo(dfIntCols)

## [1] "---------------------------------------------------------------------------------"
## [1] "hypersensitive_cardiac_troponin_i"
## [1] "================================================================================="
## [1] "mean:  887.5552"
## [1] "median:  4.8"
## [1] "min:  0"
## [1] "max:  50000"
## [1] "range:  50000"
## [1] "sd:  4933.42370112608"
## [1] "quantile:  0"     "quantile:  1.9"   "quantile:  4.8"   "quantile:  47.2" 
## [5] "quantile:  50000"
## [1] "IQR:  45.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "hemoglobin"
## [1] "================================================================================="
## [1] "mean:  124.914666666667"
## [1] "median:  130"
## [1] "min:  0"
## [1] "max:  178"
## [1] "range:  178"
## [1] "sd:  33.6386614429292"
## [1] "quantile:  0"   "quantile:  118" "quantile:  130" "quantile:  143"
## [5] "quantile:  178"
## [1] "IQR:  25"
## [1] "---------------------------------------------------------------------------------"
## [1] "serum_chloride"
## [1] "================================================================================="
## [1] "mean:  98.8357333333333"
## [1] "median:  103.1"
## [1] "min:  0"
## [1] "max:  140.4"
## [1] "range:  140.4"
## [1] "sd:  25.2361054677626"
## [1] "quantile:  0"      "quantile:  100.25" "quantile:  103.1" 
## [4] "quantile:  106.2"  "quantile:  140.4" 
## [1] "IQR:  5.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "prothrombin_time"
## [1] "================================================================================="
## [1] "mean:  15.6466666666667"
## [1] "median:  14.4"
## [1] "min:  0"
## [1] "max:  120"
## [1] "range:  120"
## [1] "sd:  9.1200714964205"
## [1] "quantile:  0"     "quantile:  13.45" "quantile:  14.4"  "quantile:  16.3" 
## [5] "quantile:  120"  
## [1] "IQR:  2.85"
## [1] "---------------------------------------------------------------------------------"
## [1] "procalcitonin"
## [1] "================================================================================="
## [1] "mean:  1.05938666666667"
## [1] "median:  0.07"
## [1] "min:  0"
## [1] "max:  57.17"
## [1] "range:  57.17"
## [1] "sd:  4.66756505327561"
## [1] "quantile:  0"     "quantile:  0.02"  "quantile:  0.07"  "quantile:  0.33" 
## [5] "quantile:  57.17"
## [1] "IQR:  0.31"
## [1] "---------------------------------------------------------------------------------"
## [1] "eosinophils_percent"
## [1] "================================================================================="
## [1] "mean:  1.0112"
## [1] "median:  0.5"
## [1] "min:  0"
## [1] "max:  8.6"
## [1] "range:  8.6"
## [1] "sd:  1.3383214005324"
## [1] "quantile:  0"   "quantile:  0"   "quantile:  0.5" "quantile:  1.6"
## [5] "quantile:  8.6"
## [1] "IQR:  1.6"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_2_receptor"
## [1] "================================================================================="
## [1] "mean:  559.413333333333"
## [1] "median:  381"
## [1] "min:  0"
## [1] "max:  7500"
## [1] "range:  7500"
## [1] "sd:  802.589480465136"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  381"  "quantile:  853" 
## [5] "quantile:  7500"
## [1] "IQR:  853"
## [1] "---------------------------------------------------------------------------------"
## [1] "alkaline_phosphatase"
## [1] "================================================================================="
## [1] "mean:  88.1013333333333"
## [1] "median:  74"
## [1] "min:  0"
## [1] "max:  620"
## [1] "range:  620"
## [1] "sd:  59.0000940208925"
## [1] "quantile:  0"     "quantile:  56"    "quantile:  74"    "quantile:  106.5"
## [5] "quantile:  620"  
## [1] "IQR:  50.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "albumin"
## [1] "================================================================================="
## [1] "mean:  33.28"
## [1] "median:  34.9"
## [1] "min:  0"
## [1] "max:  48.6"
## [1] "range:  48.6"
## [1] "sd:  9.32318138995196"
## [1] "quantile:  0"    "quantile:  30.1" "quantile:  34.9" "quantile:  38.8"
## [5] "quantile:  48.6"
## [1] "IQR:  8.7"
## [1] "---------------------------------------------------------------------------------"
## [1] "basophil_percent"
## [1] "================================================================================="
## [1] "mean:  0.303733333333333"
## [1] "median:  0.2"
## [1] "min:  0"
## [1] "max:  1.7"
## [1] "range:  1.7"
## [1] "sd:  0.268660637839674"
## [1] "quantile:  0"   "quantile:  0.1" "quantile:  0.2" "quantile:  0.4"
## [5] "quantile:  1.7"
## [1] "IQR:  0.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_10"
## [1] "================================================================================="
## [1] "mean:  9.06533333333333"
## [1] "median:  5"
## [1] "min:  0"
## [1] "max:  1000"
## [1] "range:  1000"
## [1] "sd:  52.4802083736788"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  5"    "quantile:  8"   
## [5] "quantile:  1000"
## [1] "IQR:  8"
## [1] "---------------------------------------------------------------------------------"
## [1] "total_bilirubin"
## [1] "================================================================================="
## [1] "mean:  18.8773333333333"
## [1] "median:  12"
## [1] "min:  0"
## [1] "max:  505.7"
## [1] "range:  505.7"
## [1] "sd:  35.0329774867403"
## [1] "quantile:  0"     "quantile:  8.05"  "quantile:  12"    "quantile:  18.4" 
## [5] "quantile:  505.7"
## [1] "IQR:  10.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "platelet_count"
## [1] "================================================================================="
## [1] "mean:  220.685333333333"
## [1] "median:  223"
## [1] "min:  -1"
## [1] "max:  558"
## [1] "range:  559"
## [1] "sd:  116.340695393463"
## [1] "quantile:  -1"    "quantile:  144.5" "quantile:  223"   "quantile:  287"  
## [5] "quantile:  558"  
## [1] "IQR:  142.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "monocytes_percent"
## [1] "================================================================================="
## [1] "mean:  7.636"
## [1] "median:  7.8"
## [1] "min:  0"
## [1] "max:  53"
## [1] "range:  53"
## [1] "sd:  4.98565428636018"
## [1] "quantile:  0"     "quantile:  4.2"   "quantile:  7.8"   "quantile:  10.15"
## [5] "quantile:  53"   
## [1] "IQR:  5.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "antithrombin"
## [1] "================================================================================="
## [1] "mean:  49.0186666666667"
## [1] "median:  68"
## [1] "min:  0"
## [1] "max:  136"
## [1] "range:  136"
## [1] "sd:  46.8639762605004"
## [1] "quantile:  0"   "quantile:  0"   "quantile:  68"  "quantile:  91" 
## [5] "quantile:  136"
## [1] "IQR:  91"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_8"
## [1] "================================================================================="
## [1] "mean:  57.0885333333333"
## [1] "median:  6.8"
## [1] "min:  0"
## [1] "max:  6795"
## [1] "range:  6795"
## [1] "sd:  469.091495985732"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  6.8"   "quantile:  21.75"
## [5] "quantile:  6795" 
## [1] "IQR:  21.75"
## [1] "---------------------------------------------------------------------------------"
## [1] "indirect_bilirubin"
## [1] "================================================================================="
## [1] "mean:  8.27093333333333"
## [1] "median:  6.6"
## [1] "min:  0"
## [1] "max:  145.1"
## [1] "range:  145.1"
## [1] "sd:  9.56582502896893"
## [1] "quantile:  0"     "quantile:  4.45"  "quantile:  6.6"   "quantile:  9.95" 
## [5] "quantile:  145.1"
## [1] "IQR:  5.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "red_blood_cell_distribution_width"
## [1] "================================================================================="
## [1] "mean:  12.4170666666667"
## [1] "median:  12.7"
## [1] "min:  0"
## [1] "max:  27.1"
## [1] "range:  27.1"
## [1] "sd:  3.76392628848777"
## [1] "quantile:  0"     "quantile:  12"    "quantile:  12.7"  "quantile:  13.75"
## [5] "quantile:  27.1" 
## [1] "IQR:  1.75"
## [1] "---------------------------------------------------------------------------------"
## [1] "neutrophils_percent"
## [1] "================================================================================="
## [1] "mean:  75.9504"
## [1] "median:  83.6"
## [1] "min:  0"
## [1] "max:  98.9"
## [1] "range:  98.9"
## [1] "sd:  23.1538055129739"
## [1] "quantile:  0"     "quantile:  66.1"  "quantile:  83.6"  "quantile:  93.05"
## [5] "quantile:  98.9" 
## [1] "IQR:  26.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "total_protein"
## [1] "================================================================================="
## [1] "mean:  65.6357333333333"
## [1] "median:  68.9"
## [1] "min:  0"
## [1] "max:  88.7"
## [1] "range:  88.7"
## [1] "sd:  16.3493839663283"
## [1] "quantile:  0"     "quantile:  64.15" "quantile:  68.9"  "quantile:  73.1" 
## [5] "quantile:  88.7" 
## [1] "IQR:  8.94999999999999"
## [1] "---------------------------------------------------------------------------------"
## [1] "quantification_of_treponema_pallidum_antibodies"
## [1] "================================================================================="
## [1] "mean:  0.0977066666666667"
## [1] "median:  0.04"
## [1] "min:  0"
## [1] "max:  11.95"
## [1] "range:  11.95"
## [1] "sd:  0.666373030163863"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  0.04"  "quantile:  0.06" 
## [5] "quantile:  11.95"
## [1] "IQR:  0.06"
## [1] "---------------------------------------------------------------------------------"
## [1] "prothrombin_activity"
## [1] "================================================================================="
## [1] "mean:  81.7146666666667"
## [1] "median:  89"
## [1] "min:  0"
## [1] "max:  142"
## [1] "range:  142"
## [1] "sd:  27.8699457341204"
## [1] "quantile:  0"    "quantile:  71.5" "quantile:  89"   "quantile:  98"  
## [5] "quantile:  142" 
## [1] "IQR:  26.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "h_bs_ag"
## [1] "================================================================================="
## [1] "mean:  6.17981333333333"
## [1] "median:  0"
## [1] "min:  0"
## [1] "max:  250"
## [1] "range:  250"
## [1] "sd:  37.1898486338532"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  0"    "quantile:  0.01"
## [5] "quantile:  250" 
## [1] "IQR:  0.01"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_corpuscular_volume"
## [1] "================================================================================="
## [1] "mean:  86.4445333333333"
## [1] "median:  90.3"
## [1] "min:  0"
## [1] "max:  118.9"
## [1] "range:  118.9"
## [1] "sd:  21.0188320381879"
## [1] "quantile:  0"     "quantile:  87"    "quantile:  90.3"  "quantile:  94.3" 
## [5] "quantile:  118.9"
## [1] "IQR:  7.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "hematocrit"
## [1] "================================================================================="
## [1] "mean:  36.4373333333333"
## [1] "median:  38.1"
## [1] "min:  0"
## [1] "max:  52.3"
## [1] "range:  52.3"
## [1] "sd:  9.81007848738728"
## [1] "quantile:  0"    "quantile:  34.6" "quantile:  38.1" "quantile:  41.3"
## [5] "quantile:  52.3"
## [1] "IQR:  6.7"
## [1] "---------------------------------------------------------------------------------"
## [1] "white_blood_cell_count"
## [1] "================================================================================="
## [1] "mean:  25.94064"
## [1] "median:  10.33"
## [1] "min:  0"
## [1] "max:  1726.6"
## [1] "range:  1726.6"
## [1] "sd:  106.41926472905"
## [1] "quantile:  0"      "quantile:  5.935"  "quantile:  10.33" 
## [4] "quantile:  16.815" "quantile:  1726.6"
## [1] "IQR:  10.88"
## [1] "---------------------------------------------------------------------------------"
## [1] "tumor_necrosis_factor"
## [1] "================================================================================="
## [1] "mean:  7.09466666666667"
## [1] "median:  5.9"
## [1] "min:  0"
## [1] "max:  168"
## [1] "range:  168"
## [1] "sd:  12.4087933860502"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  5.9"  "quantile:  9.45"
## [5] "quantile:  168" 
## [1] "IQR:  9.45"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_corpuscular_hemoglobin_concentration"
## [1] "================================================================================="
## [1] "mean:  331.650666666667"
## [1] "median:  347"
## [1] "min:  0"
## [1] "max:  514"
## [1] "range:  514"
## [1] "sd:  79.1725800136951"
## [1] "quantile:  0"   "quantile:  338" "quantile:  347" "quantile:  357"
## [5] "quantile:  514"
## [1] "IQR:  19"
## [1] "---------------------------------------------------------------------------------"
## [1] "fibrinogen"
## [1] "================================================================================="
## [1] "mean:  3.93029333333333"
## [1] "median:  4.4"
## [1] "min:  0"
## [1] "max:  10.78"
## [1] "range:  10.78"
## [1] "sd:  2.51909969685788"
## [1] "quantile:  0"     "quantile:  2.28"  "quantile:  4.4"   "quantile:  5.72" 
## [5] "quantile:  10.78"
## [1] "IQR:  3.44"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_1"
## [1] "================================================================================="
## [1] "mean:  3.90773333333333"
## [1] "median:  5"
## [1] "min:  0"
## [1] "max:  88.5"
## [1] "range:  88.5"
## [1] "sd:  6.75831189039416"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  5"    "quantile:  5"   
## [5] "quantile:  88.5"
## [1] "IQR:  5"
## [1] "---------------------------------------------------------------------------------"
## [1] "urea"
## [1] "================================================================================="
## [1] "mean:  9.93402666666667"
## [1] "median:  5.6"
## [1] "min:  0"
## [1] "max:  68.4"
## [1] "range:  68.4"
## [1] "sd:  10.7461847998858"
## [1] "quantile:  0"     "quantile:  4"     "quantile:  5.6"   "quantile:  11.45"
## [5] "quantile:  68.4" 
## [1] "IQR:  7.45"
## [1] "---------------------------------------------------------------------------------"
## [1] "lymphocyte_count"
## [1] "================================================================================="
## [1] "mean:  1.28085333333333"
## [1] "median:  1.07"
## [1] "min:  0"
## [1] "max:  52.42"
## [1] "range:  52.42"
## [1] "sd:  2.73636157533041"
## [1] "quantile:  0"     "quantile:  0.64"  "quantile:  1.07"  "quantile:  1.585"
## [5] "quantile:  52.42"
## [1] "IQR:  0.945"
## [1] "---------------------------------------------------------------------------------"
## [1] "ph_value"
## [1] "================================================================================="
## [1] "mean:  4.109904"
## [1] "median:  6"
## [1] "min:  0"
## [1] "max:  7.565"
## [1] "range:  7.565"
## [1] "sd:  3.28009748312161"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  6"     "quantile:  7"    
## [5] "quantile:  7.565"
## [1] "IQR:  7"
## [1] "---------------------------------------------------------------------------------"
## [1] "red_blood_cell_count"
## [1] "================================================================================="
## [1] "mean:  19.2459466666667"
## [1] "median:  4.61"
## [1] "min:  0"
## [1] "max:  749.5"
## [1] "range:  749.5"
## [1] "sd:  61.626490493824"
## [1] "quantile:  0"     "quantile:  3.98"  "quantile:  4.61"  "quantile:  6.54" 
## [5] "quantile:  749.5"
## [1] "IQR:  2.56"
## [1] "---------------------------------------------------------------------------------"
## [1] "eosinophil_count"
## [1] "================================================================================="
## [1] "mean:  0.0617333333333333"
## [1] "median:  0.04"
## [1] "min:  0"
## [1] "max:  0.49"
## [1] "range:  0.49"
## [1] "sd:  0.0754199177463744"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  0.04" "quantile:  0.1" 
## [5] "quantile:  0.49"
## [1] "IQR:  0.1"
## [1] "---------------------------------------------------------------------------------"
## [1] "corrected_calcium"
## [1] "================================================================================="
## [1] "mean:  2.25064"
## [1] "median:  2.4"
## [1] "min:  0"
## [1] "max:  2.79"
## [1] "range:  2.79"
## [1] "sd:  0.577511826857622"
## [1] "quantile:  0"    "quantile:  2.27" "quantile:  2.4"  "quantile:  2.47"
## [5] "quantile:  2.79"
## [1] "IQR:  0.2"
## [1] "---------------------------------------------------------------------------------"
## [1] "serum_potassium"
## [1] "================================================================================="
## [1] "mean:  4.50314666666667"
## [1] "median:  4.58"
## [1] "min:  0"
## [1] "max:  12.8"
## [1] "range:  12.8"
## [1] "sd:  1.44827657761779"
## [1] "quantile:  0"     "quantile:  4.125" "quantile:  4.58"  "quantile:  5.04" 
## [5] "quantile:  12.8" 
## [1] "IQR:  0.915"
## [1] "---------------------------------------------------------------------------------"
## [1] "glucose"
## [1] "================================================================================="
## [1] "mean:  9.37077333333333"
## [1] "median:  7.58"
## [1] "min:  0"
## [1] "max:  43.01"
## [1] "range:  43.01"
## [1] "sd:  6.40805264531818"
## [1] "quantile:  0"      "quantile:  5.685"  "quantile:  7.58"  
## [4] "quantile:  11.135" "quantile:  43.01" 
## [1] "IQR:  5.45"
## [1] "---------------------------------------------------------------------------------"
## [1] "neutrophils_count"
## [1] "================================================================================="
## [1] "mean:  8.80610666666667"
## [1] "median:  6.38"
## [1] "min:  0"
## [1] "max:  33.88"
## [1] "range:  33.88"
## [1] "sd:  7.21423001767688"
## [1] "quantile:  0"     "quantile:  3.42"  "quantile:  6.38"  "quantile:  12.33"
## [5] "quantile:  33.88"
## [1] "IQR:  8.91"
## [1] "---------------------------------------------------------------------------------"
## [1] "direct_bilirubin"
## [1] "================================================================================="
## [1] "mean:  11.2413333333333"
## [1] "median:  5.3"
## [1] "min:  0"
## [1] "max:  360.6"
## [1] "range:  360.6"
## [1] "sd:  27.7644778118371"
## [1] "quantile:  0"     "quantile:  3.5"   "quantile:  5.3"   "quantile:  8.65" 
## [5] "quantile:  360.6"
## [1] "IQR:  5.15"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_platelet_volume"
## [1] "================================================================================="
## [1] "mean:  10.3717333333333"
## [1] "median:  11"
## [1] "min:  0"
## [1] "max:  15"
## [1] "range:  15"
## [1] "sd:  3.20872822580761"
## [1] "quantile:  0"    "quantile:  10.2" "quantile:  11"   "quantile:  11.9"
## [5] "quantile:  15"  
## [1] "IQR:  1.7"
## [1] "---------------------------------------------------------------------------------"
## [1] "ferritin"
## [1] "================================================================================="
## [1] "mean:  883.942933333333"
## [1] "median:  147.3"
## [1] "min:  0"
## [1] "max:  50000"
## [1] "range:  50000"
## [1] "sd:  3052.68673704692"
## [1] "quantile:  0"      "quantile:  0"      "quantile:  147.3" 
## [4] "quantile:  881.95" "quantile:  50000" 
## [1] "IQR:  881.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "rbc_distribution_width_sd"
## [1] "================================================================================="
## [1] "mean:  40.3882666666667"
## [1] "median:  41.1"
## [1] "min:  0"
## [1] "max:  113.3"
## [1] "range:  113.3"
## [1] "sd:  12.772592270079"
## [1] "quantile:  0"     "quantile:  38.55" "quantile:  41.1"  "quantile:  45.15"
## [5] "quantile:  113.3"
## [1] "IQR:  6.60000000000001"
## [1] "---------------------------------------------------------------------------------"
## [1] "thrombin_time"
## [1] "================================================================================="
## [1] "mean:  15.1773333333333"
## [1] "median:  16.5"
## [1] "min:  0"
## [1] "max:  161.9"
## [1] "range:  161.9"
## [1] "sd:  13.0846705153605"
## [1] "quantile:  0"     "quantile:  15"    "quantile:  16.5"  "quantile:  18.3" 
## [5] "quantile:  161.9"
## [1] "IQR:  3.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "percent_lymphocyte"
## [1] "================================================================================="
## [1] "mean:  19.236"
## [1] "median:  17.7"
## [1] "min:  0"
## [1] "max:  60"
## [1] "range:  60"
## [1] "sd:  14.3091874304513"
## [1] "quantile:  0"     "quantile:  6.1"   "quantile:  17.7"  "quantile:  31.05"
## [5] "quantile:  60"   
## [1] "IQR:  24.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "hcv_antibody_quantification"
## [1] "================================================================================="
## [1] "mean:  0.0822666666666667"
## [1] "median:  0.05"
## [1] "min:  0"
## [1] "max:  2.09"
## [1] "range:  2.09"
## [1] "sd:  0.192442597409416"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  0.05" "quantile:  0.08"
## [5] "quantile:  2.09"
## [1] "IQR:  0.08"
## [1] "---------------------------------------------------------------------------------"
## [1] "d_d_dimer"
## [1] "================================================================================="
## [1] "mean:  6.78184"
## [1] "median:  1.27"
## [1] "min:  0"
## [1] "max:  60"
## [1] "range:  60"
## [1] "sd:  9.27520148827659"
## [1] "quantile:  0"     "quantile:  0.425" "quantile:  1.27"  "quantile:  19.78"
## [5] "quantile:  60"   
## [1] "IQR:  19.355"
## [1] "---------------------------------------------------------------------------------"
## [1] "total_cholesterol"
## [1] "================================================================================="
## [1] "mean:  3.79714666666667"
## [1] "median:  3.88"
## [1] "min:  0"
## [1] "max:  7.3"
## [1] "range:  7.3"
## [1] "sd:  1.32915602562809"
## [1] "quantile:  0"     "quantile:  3.155" "quantile:  3.88"  "quantile:  4.515"
## [5] "quantile:  7.3"  
## [1] "IQR:  1.36"
## [1] "---------------------------------------------------------------------------------"
## [1] "aspartate_aminotransferase"
## [1] "================================================================================="
## [1] "mean:  62.416"
## [1] "median:  33"
## [1] "min:  0"
## [1] "max:  1858"
## [1] "range:  1858"
## [1] "sd:  150.569143993697"
## [1] "quantile:  0"    "quantile:  22"   "quantile:  33"   "quantile:  53.5"
## [5] "quantile:  1858"
## [1] "IQR:  31.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "uric_acid"
## [1] "================================================================================="
## [1] "mean:  308.073066666667"
## [1] "median:  286"
## [1] "min:  0"
## [1] "max:  1176"
## [1] "range:  1176"
## [1] "sd:  172.536536764377"
## [1] "quantile:  0"      "quantile:  212.2"  "quantile:  286"   
## [4] "quantile:  372.55" "quantile:  1176"  
## [1] "IQR:  160.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "hco3"
## [1] "================================================================================="
## [1] "mean:  23.2208"
## [1] "median:  24.6"
## [1] "min:  0"
## [1] "max:  36.3"
## [1] "range:  36.3"
## [1] "sd:  6.75612175559844"
## [1] "quantile:  0"     "quantile:  21.8"  "quantile:  24.6"  "quantile:  27.15"
## [5] "quantile:  36.3" 
## [1] "IQR:  5.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "calcium"
## [1] "================================================================================="
## [1] "mean:  2.03805333333333"
## [1] "median:  2.15"
## [1] "min:  0"
## [1] "max:  2.62"
## [1] "range:  2.62"
## [1] "sd:  0.514584300669525"
## [1] "quantile:  0"    "quantile:  2.04" "quantile:  2.15" "quantile:  2.26"
## [5] "quantile:  2.62"
## [1] "IQR:  0.22"
## [1] "---------------------------------------------------------------------------------"
## [1] "amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp"
## [1] "================================================================================="
## [1] "mean:  2811.68533333333"
## [1] "median:  99"
## [1] "min:  0"
## [1] "max:  70000"
## [1] "range:  70000"
## [1] "sd:  9387.95662943333"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  99"    "quantile:  941"  
## [5] "quantile:  70000"
## [1] "IQR:  941"
## [1] "---------------------------------------------------------------------------------"
## [1] "lactate_dehydrogenase"
## [1] "================================================================================="
## [1] "mean:  519.592"
## [1] "median:  355"
## [1] "min:  0"
## [1] "max:  1867"
## [1] "range:  1867"
## [1] "sd:  436.03199416537"
## [1] "quantile:  0"     "quantile:  239.5" "quantile:  355"   "quantile:  661"  
## [5] "quantile:  1867" 
## [1] "IQR:  421.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "platelet_large_cell_ratio"
## [1] "================================================================================="
## [1] "mean:  31.664"
## [1] "median:  32.5"
## [1] "min:  0"
## [1] "max:  62.2"
## [1] "range:  62.2"
## [1] "sd:  12.6608563463322"
## [1] "quantile:  0"     "quantile:  26"    "quantile:  32.5"  "quantile:  39.35"
## [5] "quantile:  62.2" 
## [1] "IQR:  13.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_6"
## [1] "================================================================================="
## [1] "mean:  77.4030666666667"
## [1] "median:  2.58"
## [1] "min:  0"
## [1] "max:  5000"
## [1] "range:  5000"
## [1] "sd:  460.193926173624"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  2.58" "quantile:  33.8"
## [5] "quantile:  5000"
## [1] "IQR:  33.8"
## [1] "---------------------------------------------------------------------------------"
## [1] "fibrin_degradation_products"
## [1] "================================================================================="
## [1] "mean:  29.6808"
## [1] "median:  4"
## [1] "min:  0"
## [1] "max:  190.8"
## [1] "range:  190.8"
## [1] "sd:  55.7152472377118"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  4"     "quantile:  7.4"  
## [5] "quantile:  190.8"
## [1] "IQR:  7.4"
## [1] "---------------------------------------------------------------------------------"
## [1] "monocytes_count"
## [1] "================================================================================="
## [1] "mean:  0.648213333333333"
## [1] "median:  0.48"
## [1] "min:  0"
## [1] "max:  39.92"
## [1] "range:  39.92"
## [1] "sd:  2.05815873720233"
## [1] "quantile:  0"     "quantile:  0.355" "quantile:  0.48"  "quantile:  0.7"  
## [5] "quantile:  39.92"
## [1] "IQR:  0.345"
## [1] "---------------------------------------------------------------------------------"
## [1] "plt_distribution_width"
## [1] "================================================================================="
## [1] "mean:  12.816"
## [1] "median:  13"
## [1] "min:  0"
## [1] "max:  25.3"
## [1] "range:  25.3"
## [1] "sd:  4.80697576889886"
## [1] "quantile:  0"     "quantile:  11.2"  "quantile:  13"    "quantile:  15.15"
## [5] "quantile:  25.3" 
## [1] "IQR:  3.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "globulin"
## [1] "================================================================================="
## [1] "mean:  33.4546666666667"
## [1] "median:  34.6"
## [1] "min:  0"
## [1] "max:  50.6"
## [1] "range:  50.6"
## [1] "sd:  9.43016966028432"
## [1] "quantile:  0"    "quantile:  31.1" "quantile:  34.6" "quantile:  38.7"
## [5] "quantile:  50.6"
## [1] "IQR:  7.6"
## [1] "---------------------------------------------------------------------------------"
## [1] "glutamyl_transpeptidase"
## [1] "================================================================================="
## [1] "mean:  57.6213333333333"
## [1] "median:  37"
## [1] "min:  0"
## [1] "max:  732"
## [1] "range:  732"
## [1] "sd:  78.5797430274511"
## [1] "quantile:  0"    "quantile:  20.5" "quantile:  37"   "quantile:  61"  
## [5] "quantile:  732" 
## [1] "IQR:  40.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "international_standard_ratio"
## [1] "================================================================================="
## [1] "mean:  1.28170666666667"
## [1] "median:  1.11"
## [1] "min:  0"
## [1] "max:  13.48"
## [1] "range:  13.48"
## [1] "sd:  1.03490022218339"
## [1] "quantile:  0"     "quantile:  1.02"  "quantile:  1.11"  "quantile:  1.31" 
## [5] "quantile:  13.48"
## [1] "IQR:  0.29"
## [1] "---------------------------------------------------------------------------------"
## [1] "basophil_count_number"
## [1] "================================================================================="
## [1] "mean:  0.0241333333333333"
## [1] "median:  0.02"
## [1] "min:  0"
## [1] "max:  0.12"
## [1] "range:  0.12"
## [1] "sd:  0.0205988768016656"
## [1] "quantile:  0"    "quantile:  0.01" "quantile:  0.02" "quantile:  0.03"
## [5] "quantile:  0.12"
## [1] "IQR:  0.02"
## [1] "---------------------------------------------------------------------------------"
## [1] "x2019_n_co_v_nucleic_acid_detection"
## [1] "================================================================================="
## [1] "mean:  -0.581333333333333"
## [1] "median:  -1"
## [1] "min:  -1"
## [1] "max:  0"
## [1] "range:  1"
## [1] "sd:  0.493999646380325"
## [1] "quantile:  -1" "quantile:  -1" "quantile:  -1" "quantile:  0" 
## [5] "quantile:  0" 
## [1] "IQR:  1"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_corpuscular_hemoglobin"
## [1] "================================================================================="
## [1] "mean:  29.7725333333333"
## [1] "median:  31"
## [1] "min:  0"
## [1] "max:  50.8"
## [1] "range:  50.8"
## [1] "sd:  7.56932789939007"
## [1] "quantile:  0"     "quantile:  29.75" "quantile:  31"    "quantile:  32.35"
## [5] "quantile:  50.8" 
## [1] "IQR:  2.59999999999999"
## [1] "---------------------------------------------------------------------------------"
## [1] "activation_of_partial_thromboplastin_time"
## [1] "================================================================================="
## [1] "mean:  34.4565333333333"
## [1] "median:  39.1"
## [1] "min:  0"
## [1] "max:  144"
## [1] "range:  144"
## [1] "sd:  20.189724546286"
## [1] "quantile:  0"    "quantile:  33.4" "quantile:  39.1" "quantile:  44.6"
## [5] "quantile:  144" 
## [1] "IQR:  11.2"
## [1] "---------------------------------------------------------------------------------"
## [1] "high_sensitivity_c_reactive_protein"
## [1] "================================================================================="
## [1] "mean:  86.6608"
## [1] "median:  54.1"
## [1] "min:  0"
## [1] "max:  320"
## [1] "range:  320"
## [1] "sd:  87.9859750272583"
## [1] "quantile:  0"     "quantile:  10.1"  "quantile:  54.1"  "quantile:  144.4"
## [5] "quantile:  320"  
## [1] "IQR:  134.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "hiv_antibody_quantification"
## [1] "================================================================================="
## [1] "mean:  0.0732"
## [1] "median:  0.08"
## [1] "min:  0"
## [1] "max:  0.27"
## [1] "range:  0.27"
## [1] "sd:  0.0561405798498943"
## [1] "quantile:  0"    "quantile:  0"    "quantile:  0.08" "quantile:  0.1" 
## [5] "quantile:  0.27"
## [1] "IQR:  0.1"
## [1] "---------------------------------------------------------------------------------"
## [1] "serum_sodium"
## [1] "================================================================================="
## [1] "mean:  134.9216"
## [1] "median:  141.3"
## [1] "min:  0"
## [1] "max:  179.7"
## [1] "range:  179.7"
## [1] "sd:  33.7104973828152"
## [1] "quantile:  0"      "quantile:  138.7"  "quantile:  141.3" 
## [4] "quantile:  143.85" "quantile:  179.7" 
## [1] "IQR:  5.15000000000003"
## [1] "---------------------------------------------------------------------------------"
## [1] "thrombocytocrit"
## [1] "================================================================================="
## [1] "mean:  0.22928"
## [1] "median:  0.24"
## [1] "min:  0"
## [1] "max:  0.51"
## [1] "range:  0.51"
## [1] "sd:  0.116031823036696"
## [1] "quantile:  0"    "quantile:  0.16" "quantile:  0.24" "quantile:  0.3" 
## [5] "quantile:  0.51"
## [1] "IQR:  0.14"
## [1] "---------------------------------------------------------------------------------"
## [1] "esr"
## [1] "================================================================================="
## [1] "mean:  28.4506666666667"
## [1] "median:  22"
## [1] "min:  0"
## [1] "max:  110"
## [1] "range:  110"
## [1] "sd:  27.6133946674868"
## [1] "quantile:  0"   "quantile:  2"   "quantile:  22"  "quantile:  43" 
## [5] "quantile:  110"
## [1] "IQR:  41"
## [1] "---------------------------------------------------------------------------------"
## [1] "glutamic_pyruvic_transaminase"
## [1] "================================================================================="
## [1] "mean:  47.9946666666667"
## [1] "median:  29"
## [1] "min:  0"
## [1] "max:  1600"
## [1] "range:  1600"
## [1] "sd:  102.51852564206"
## [1] "quantile:  0"    "quantile:  18"   "quantile:  29"   "quantile:  48.5"
## [5] "quantile:  1600"
## [1] "IQR:  30.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "e_gfr"
## [1] "================================================================================="
## [1] "mean:  85.4397333333333"
## [1] "median:  92.9"
## [1] "min:  0"
## [1] "max:  224"
## [1] "range:  224"
## [1] "sd:  34.2058778595522"
## [1] "quantile:  0"     "quantile:  72.1"  "quantile:  92.9"  "quantile:  108.4"
## [5] "quantile:  224"  
## [1] "IQR:  36.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "creatinine"
## [1] "================================================================================="
## [1] "mean:  120.322666666667"
## [1] "median:  81"
## [1] "min:  0"
## [1] "max:  1497"
## [1] "range:  1497"
## [1] "sd:  160.171732347307"
## [1] "quantile:  0"    "quantile:  58"   "quantile:  81"   "quantile:  105" 
## [5] "quantile:  1497"
## [1] "IQR:  47"

EDA

Age and Gender Analysis

The data in cleaner form at this point, so analysis can be conducted. We’ll start by looking at age and gender of the Covid-19 patients.

#set temp df for analysis
df_analysis <- df_clean

df_analysis$age_factor <- cut(df_analysis$age, breaks = c(0,30,55,65,Inf),labels=c("young","middle","older", "old"))
#head(df_analysis[,c(1:10)]) %>% regulartable() %>% theme_zebra() %>%  autofit()

df_analysis$gender_factor <- cut(df_analysis$gender, breaks = c(0,1,2),labels=c("Female","Male"))
#head(df_analysis[,c(1:10)]) %>% regulartable() %>% theme_zebra() %>%  autofit()

#head(df_analysis[,c(1:4,81:82)],8)

Age Histogram

The histogram showing distribution of age across the 375 patients is very normal looking.

Hist Line Color Legend: * Mean (red) * Median (gray) * Mode (green)

require(plotly)
#require(cowplot)

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
   }

df_analysis %>% 
  ggplot(aes(age)) +
  geom_histogram(bins = 10, boundary = 0, fill = "#FF6666", col = "black", aes(y = ..density..),  ) + 
  geom_density(alpha = 0.2, fill = "gray") +
  ggtitle("Age Histogram",
          subtitle = "Mean (grey) and Median (blue) and Mode (green)") +
  theme(axis.title = element_text(), axis.title.x = element_text()) +
  
  geom_vline(xintercept = round(mean(df_analysis$age), 2), size = 2, linetype = 3) +
  
  geom_vline(aes(xintercept=mean(df_analysis$age,na.rm='T')),color="blue", linetype=3, size=1) +
  
  geom_vline(aes(xintercept=median(df_analysis$age,na.rm='T')),  color="gray", linetype="dashed", size=1) +
  
  geom_vline(aes(xintercept=getmode(df_analysis$age)), color="green", linetype="dashed", size=1) +
  
  gbl_ggtheme -> ph1

#ph1 + theme_bw() 
ph1

ph1ly <- ggplotly(ph1) 
#ph1ly
#plot_grid(ph1,ph2, nrow = 1, ncol = 2)

Age Discretized

The age is better studied in buckets of young, middle, older, old.

#quantile(df_clean$age)
#data.table(df_clean$age)
age_df <- as.data.frame(prop.table(table(df_analysis$age_factor))) 
names(age_df)[1:2] <- c("Age","Percentage")
age_df %>% kable() %>% kable_styling()

Age	Percentage
young	0.0586667
middle	0.3040000
older	0.2346667
old	0.4026667

Age outliers

How many patients greater than or equal to 90 years old.

age_outliers <- which(df_analysis$age >= 90)

paste("There are", length(age_outliers), "patients age 90 or older.")

## [1] "There are 5 patients age 90 or older."

df_analysis[age_outliers, "age"] %>% arrange(age)

## # A tibble: 5 x 1
## # Groups:   age [5]
##     age
##   <dbl>
## 1    90
## 2    91
## 3    92
## 4    94
## 5    95

Gender Analysis

Gender isn’t significantly correlated whereas age does show significant positive correlation to the patient outcome. We have more females than males in this particular patient sample.

gender_df <- as.data.frame(prop.table(table(df_analysis$gender_factor))) 
names(gender_df)[1:2] <- c("Gender","Percentage")
gender_df %>% kable() %>% kable_styling()

Gender	Percentage
Female	0.5973333
Male	0.4026667

#ggpairs(df_analysis, columns = c(3,4,81:82), ggplot2::aes(colour=df_analysis$gender_factor)) +gbl_ggtheme
#chart.Correlation(cor_df1[,c(1,2:5)])

 ggcorr(df_analysis[c(2:3,4,81:82)], method = c("everything", "pearson")) + gbl_ggtheme

Correlation Analysis

Visual analysis of the outcome (dependent feature) with all others (independent features).

During the analysis, the following features were found to be important. They are list in order of relative importance.

lactate_dehydrogenase 100.00
monocytes_percent 83.50
international_standard_ratio 78.47
percent_lymphocyte 64.12
eosinophil_count 59.75
eosinophils_percent 50.99
neutrophils_percent 49.15
amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp 48.62
thrombocytocrit 46.40
hypersensitive_cardiac_troponin_i 45.83
fibrin_degradation_products 43.42
procalcitonin 41.40
albumin 40.76
prothrombin_activity 37.53
d_d_dimer 33.37
aspartate_aminotransferase 32.88
neutrophils_count 31.01
direct_bilirubin 30.13
thrombin_time 29.82
high_sensitivity_c_reactive_protein 28.38

Figure 1 - “outcome”,“percent_lymphocyte”,“lactate_dehydrogenase”,“lymphocyte_count”,“albumin”, “neutrophils_percent”,“eosinophils_percent”,“eosinophil_count”, “x2019_n_co_v_nucleic_acid_detection”,“direct_bilirubin”,“thrombin_time”, “monocytes_percent”,“international_standard_ratio”

Figure 2 - “outcome”,“amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp”, “thrombocytocrit”,“hypersensitive_cardiac_troponin_i”,“fibrin_degradation_products”,“procalcitonin”, “prothrombin_activity”,“d_d_dimer”,“aspartate_aminotransferase”,“neutrophils_count”, “high_sensitivity_c_reactive_protein”

#head(df_clean)
#ensures we are starting with the clean data set
df <- df_clean 
cor_df <- df[,c(2:4,7:80)]
#glimpse(cor_df)
#head(cor_df)


require(GGally)
require(plotly)
vars1 <- c("outcome","percent_lymphocyte","lactate_dehydrogenase","lymphocyte_count","albumin",
           "neutrophils_percent","eosinophils_percent","eosinophil_count","x2019_n_co_v_nucleic_acid_detection",
           "direct_bilirubin","thrombin_time","monocytes_percent","international_standard_ratio")
cor_df_vars1 <- cor_df[,c(vars1)]
ggcorr(cor_df_vars1,label = TRUE, label_size = 3,hjust = 0.75, size = 1.5,  method = c("everything", "pearson")) + gbl_ggtheme

vars1b <- c("outcome","percent_lymphocyte","lactate_dehydrogenase","x2019_n_co_v_nucleic_acid_detection",
           "direct_bilirubin","thrombin_time","monocytes_percent")
# cor_df_vars1b <- cor_df[,c(vars1b)]
# ggp2 <- ggpairs(cor_df_vars1b)  + gbl_ggtheme
# ggp2 <- ggpairs(cor_df_vars1b, ggplot2::aes(colour="#E64141")) +gbl_ggtheme
# ggp2
#ggplotly(ggp2)

vars2 <- c("outcome","amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp",
           "thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products","procalcitonin",
           "prothrombin_activity","d_d_dimer","aspartate_aminotransferase","neutrophils_count",
           "high_sensitivity_c_reactive_protein")   

#cor_df[,c(vars1)]
cor_df_vars2 <- cor_df[,c(vars2)]
#cor_df_vars2
ggcorr(cor_df_vars2, label = TRUE, label_size = 3,hjust = 0.75, size = 1.5, method = c("everything", "pearson")) + gbl_ggtheme

vars2b <-  c("outcome","thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products",
           "prothrombin_activity", "high_sensitivity_c_reactive_protein")
# cor_df_vars2b <- cor_df[,c(vars2b)]
# ggp2 <- ggpairs(cor_df_vars2b, ggplot2::aes(colour="#E64141")) +gbl_ggtheme
# ggp2

#set value to 3 since the dependent is in the 3rd position
#dependent_var <- c(1)
#paste("variable number ",dependent_var," is = ",names(cor_df)[3],sep="")

#cor_df1 <- cor_df_vars2[,c(dependent_var,6:17)]
#corrplot(cor(cor_df_vars2), method = "square")
#ggcorr(cor_df_vars2, method = c("everything", "pearson")) + gbl_ggtheme
#chart.Correlation(cor_df_vars2[,c(1,2:5)])
#ggpairs(cor_df_vars2[,c(1,2:5)], title="correlogram with ggpairs()") #+ gbl_ggtheme
#
# cor_df2 <- df[,c(dependent_var,18:28)]
# #corrplot(cor(cor_df2), method = "square")
# ggcorr(cor_df2, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df3 <- df[,c(dependent_var,29:39)]
# #corrplot(cor(cor_df3), method = "square")
# ggcorr(cor_df3, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df4 <- df[,c(dependent_var,40:50)]
# #corrplot(cor(cor_df4), method = "square")
# ggcorr(cor_df4, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df5 <- df[,c(dependent_var,51:61)]
# #corrplot(cor(cor_df5), method = "square")
# ggcorr(cor_df5, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df6 <- df[,c(dependent_var,62:72)]
# #corrplot(cor(cor_df6), method = "square")
# ggcorr(cor_df6, method = c("everything", "pearson")) + gbl_ggtheme
# 
# cor_df7 <- df[,c(dependent_var,73:80)]
# #corrplot(cor(cor_df7), method = "square")
# ggcorr(cor_df7, method = c("everything", "pearson")) + gbl_ggtheme


paste("")

## [1] ""

Kmeans Clustering

kmeans() function returns a list of components, including:

*cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

*centers: A matrix of cluster centers (cluster means)

*totss: The total sum of squares (TSS), i.e ∑(xi−x¯)2. TSS measures the total variance in the data.

*withinss: Vector of within-cluster sum of squares, one component per cluster

*tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss)

*betweenss: The between-cluster sum of squares, i.e. totss−tot.withinss

*size: The number of observations in each cluster

require(NbClust)


df <- df_clean 
#df
df <- df[,c(7:80)] # 
dim(df)

## [1] 375  74

#df
#df %>% select(-patient_id,-admission_time,-discharge_time)
df_subset1 <- df #%>% select(-gender,-outcome)
#df_subset1
df <- df_subset1
head(df[,c(1:5)]) %>% regulartable() %>% theme_zebra() %>%  autofit()

hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time	procalcitonin
19.9	140	103.1	14.1	0.09
16.9	151	100.5	14.3	0.09
0.0	126	102.9	13.6	0.06
4.8	110	103.1	16.3	0.38
5.6	134	102.2	14.6	0.02
19.7	108	105.8	12.4	0.10

Cluster Build

X<- df
X_norm <- apply(X, 1, function(i) round(i/sum(i),3))
#X_norm
#str(X)

#kmeans function
kmeansCluster <- function (mtrx,vcenters,v_nstart = 50,v_iter.max = 150) {  
km.res <- kmeans(mtrx, centers = vcenters, nstart = v_nstart, iter.max = v_iter.max)


z<-fviz_cluster(km.res, data = X, geom = "point", stand = FALSE, ellipse.type = "norm",main = paste("Kmeans cluster plot vcenters=",vcenters)) + theme_minimal()

z2 <- fviz_cluster(km.res, data = mtrx, geom = "point", stand = FALSE, ellipse.type = "norm",   outlier.color = "black",main = paste("Kmeans cluster plot vcenters=",vcenters), show.clust.cent = TRUE) + theme_minimal()

s <- summary(km.res)
return(list(z,s))


s <- summary(km.res) 
return(list(z2,s))
#return(c(x))
}
 k2 <-kmeansCluster(X,2)
 k2

## [[1]]

## 
## [[2]]
##              Length Class  Mode   
## cluster      375    -none- numeric
## centers      148    -none- numeric
## totss          1    -none- numeric
## withinss       2    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           2    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

 k3 <-kmeansCluster(X,3)
 k3

## [[1]]

## Too few points to calculate an ellipse

## 
## [[2]]
##              Length Class  Mode   
## cluster      375    -none- numeric
## centers      222    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

 k4 <-kmeansCluster(X,4)
 k4

## [[1]]

## Too few points to calculate an ellipse

## 
## [[2]]
##              Length Class  Mode   
## cluster      375    -none- numeric
## centers      296    -none- numeric
## totss          1    -none- numeric
## withinss       4    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           4    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

 k5 <-kmeansCluster(X,5)
 k5

## [[1]]

## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

## 
## [[2]]
##              Length Class  Mode   
## cluster      375    -none- numeric
## centers      370    -none- numeric
## totss          1    -none- numeric
## withinss       5    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           5    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

 # k6 <-kmeansCluster(X,6)
 # k6

Elbow Plot Result

The plot represents the variance within the clusters. It decreases as k increases, and a bend (or “elbow”) can be seen at k = 4. This bend indicates that additional clusters beyond the fourth have little value. In the next section, we’ll classify the observations into 4 clusters.

k<-2:7 #K from 2 to 7
tries <-10 #Run the K Means algorithm n times
avg.totw.ss <-integer(length(k)) #Set up an empty vector to hold all of points
for(v in k){ # For each value of the range variable
 v.totw.ss <-integer(tries) #Set up an empty vector to hold the 100 tries
 for(i in 1:tries){
 k.temp <-kmeans(X,centers=v) #Run kmeans
 v.totw.ss[i] <-k.temp$tot.withinss#Store the total withinss
 }
 avg.totw.ss[v-1] <-mean(v.totw.ss) #Average the 100 total withinss

}

df <- as.data.frame(cbind(k,avg.totw.ss)) 
ggplot(df,aes(x=k, y=avg.totw.ss)) + geom_point() + 
  geom_line(aes(color="Elbow"))  + 
  ggtitle("Total Within SS by Various K",subtitle = "Elbow") +  theme_bw()

#plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K", ylab="Average Total Within Sum of Squares", xlab="Value of K")

Kmeans Results

Results using k=4 clustering with patient age and 74 biomarkers

set.seed(seed)

km.res <- NULL
km.res <- kmeans(X, 4, nstart = 50,iter.max = 150)
#km.res4 <- kmeans(X, 4)
#summary(km.res)
#glimpse(km.res)

z2 <- fviz_cluster(km.res, data = X, geom = "point", stand = FALSE, ellipse.type = "norm",   outlier.color = "black",main = "Kmeans cluster plot vcenters=4", show.clust.cent = TRUE) + theme_minimal()
z2

Samples from the 4 clusters are displayed below in descending order. Only 5 of the 74 biomarkers are displayed. Clusters 1 and 4 had the most observations and we know there were not that many important features out of the 74 biomarkers. This may mean clusters 3 and 2 were defined by how the important features set them apart from the rest.

hypersensitive_cardiac_troponin_i: This is the preferred biomarker for acute myocardial infarction. This helps diagnose acute coronary syndromes in those who are critically ill.
amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp: Natriuretic peptide tests measure the presence of brain natriuretic peptide and B-terminal pro b-type natriuretic peptides present. These are substances created by the heart. Small levels are normally found in the bloodstream. High levels of BNP, NT-proBNP indicate that the heart is failing to pump the appropriate amount of blood needed.
lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.
ferritin: Ferritin is a major intracellular iron storage protein in all organisms. It binds free ions of the trace element, neutralising its toxic properties and increasing its solubility. In the soluble form, the body is able to expend iron as needed, in particular for regulation of cellular oxygen metabolism. Elevated levels of ferritin, or hyperferritinemia, indicate the presence of viruses and bacteria into the body.

Impvars <- c("amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp",
           "thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products","procalcitonin",
           "prothrombin_activity","d_d_dimer","aspartate_aminotransferase","neutrophils_count",
           "high_sensitivity_c_reactive_protein","thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products",
           "prothrombin_activity", "high_sensitivity_c_reactive_protein","ferritin")

#head(X[,Impvars])
dd <- cbind(X, cluster = km.res$cluster)
dd <- dd %>% select("cluster","lactate_dehydrogenase","amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp","hypersensitive_cardiac_troponin_i", "high_sensitivity_c_reactive_protein","ferritin")

dd.names <-names(dd) #back up original names
dfname <-dd.names
names2 <- base::abbreviate( dfname,25)
#names2 #checking abbreviations
names(dd) <- names2
#print(km.res) 
#str(km.res)
#km.res$cluster %>% regulartable() %>% theme_zebra() %>%  autofit()
#cluster centers
#km.res$centers  %>% regulartable() %>% theme_zebra() %>%  autofit()
#head(dd)
#dim(dd)

k=4 cluster dimensions:

[1] 324, 12 - Cluster one is the largest
[2] 3, 12 - Cluster two is the smallest
[3] 42, 12 - Cluster three is the second largest
[4] 6, 12 - Cluster four is very small as well

The smaller clusters have elevated markers, while the larger clusters do not. Note: long biomarker names have been abbreviated to 25 characters.

#abbreviate long biomarker names to 25 chars make analysis easier

#Create biomarker aggregate with means
agg_clus <- aggregate(X, by=list(cluster=km.res$cluster), mean)

#abbreviate long biomarker names to 25 chars make analysis easier
agg_clus.names <-names(agg_clus) #back up original names
agg_clus.dfname <- agg_clus.names
agg_clus.names2 <- base::abbreviate( agg_clus.dfname,25)
#names2 #checking abbreviations
names(agg_clus) <- agg_clus.names2

Impvars_short <- base::abbreviate( Impvars,25)
agg_clus[,Impvars_short] %>% regulartable() %>% theme_zebra() %>%  autofit()

amn_trmnl_brn_ntrrtc_____	thrombocytocrit	hypersensitiv_crdc_trpnn_	fibrin_degradation_prdcts	procalcitonin	prothrombin_activity	d_d_dimer	aspartate_aminotransferas	neutrophils_count	high_senstvty_c_rctv_prtn	thrombocytocrit.1	hypersensitiv_crdc_trpnn_.1	fibrin_degradation_prdcts.1	prothrombin_activity.1	high_senstvty_c_rctv_prtn.1	ferritin
35716.3333	0.13666667	50000.0000	150.00000	0.3700000	74.33333	21.000000	228.6667	19.076667	130.90000	0.13666667	50000.0000	150.00000	74.33333	130.90000	6594.5333
10561.6429	0.17000000	2423.6452	84.70000	5.4209524	71.42857	15.254762	128.8571	15.698333	183.77619	0.17000000	2423.6452	84.70000	71.42857	183.77619	2145.2833
62566.0000	0.09166667	4629.9000	53.90000	4.4066667	70.66667	12.835000	377.3333	14.578333	100.15000	0.09166667	4629.9000	53.90000	70.66667	100.15000	5100.3000
395.8272	0.24037037	164.3849	20.98611	0.4383951	83.32099	5.439753	46.4321	7.710679	73.41235	0.24037037	164.3849	20.98611	83.32099	73.41235	589.4793

clus_id1 <- dd %>% filter(cluster == 1) # %>% head(3) #%>% regulartable() %>% theme_zebra() %>%  autofit()
head(clus_id1[,c(1:6)]) %>% kable() %>% kable_styling()

cluster	lactate_dehydrogenase	amn_trmnl_brn_ntrrtc_____	hypersensitiv_crdc_trpnn_	high_senstvty_c_rctv_prtn	ferritin
1	1867	20906	50000	113.1	1935.6
1	1500	70000	50000	87.3	0.0
1	785	16243	50000	192.3	17848.0

clus_id2 <- dd %>% filter(cluster == 2) #%>% head(3) #%>% regulartable() %>% theme_zebra() %>%  autofit()
head(clus_id2[,c(1:6)]) %>% kable() %>% kable_styling()

cluster	lactate_dehydrogenase	amn_trmnl_brn_ntrrtc_____	hypersensitiv_crdc_trpnn_	high_senstvty_c_rctv_prtn	ferritin
2	208	11017	0.0	24.3	0.0
2	187	13006	124.0	2.1	0.0
2	310	22525	617.0	160.2	545.7
2	1772	9310	46.3	274.1	2578.7
2	890	7447	304.2	86.1	2661.8
2	492	11534	1317.8	196.9	0.0

clus_id3 <-dd %>% filter(cluster == 3) #%>% head(3) #%>% regulartable() %>% theme_zebra() %>%  autofit()
head(clus_id3[,c(1:6)]) %>% kable() %>% kable_styling()

cluster	lactate_dehydrogenase	amn_trmnl_brn_ntrrtc_____	hypersensitiv_crdc_trpnn_	high_senstvty_c_rctv_prtn	ferritin
3	284	70000	85.5	81.7	0.0
3	1752	40096	2.6	84.6	15718.0
3	1867	70000	523.1	36.7	0.0
3	1126	55300	3404.7	138.8	4995.6
3	1867	70000	21953.4	259.1	9888.2
3	702	70000	1810.1	0.0	0.0

clus_id4 <-dd %>% filter(cluster == 4) # %>% head(3) #%>% regulartable() %>% theme_zebra() %>%  autofit()
head(clus_id4[,c(1:6)]) %>% kable() %>% kable_styling()

cluster	lactate_dehydrogenase	amn_trmnl_brn_ntrrtc_____	hypersensitiv_crdc_trpnn_	high_senstvty_c_rctv_prtn	ferritin
4	306	60	19.9	43.1	634.9
4	738	173	16.9	27.4	1512.9
4	328	0	0.0	42.3	567.2
4	338	152	4.8	108.2	0.0
4	195	5	5.6	7.0	121.1
4	268	645	19.7	29.3	1124.6

Boxplot summary of most important biomarkers.

# kmeansplot(dd,"lactate_dehydrogenase")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=lactate_dehydrogenase,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()

# kmeansplot(dd,"amn_trmnl_brn_ntrrtc_____")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=amn_trmnl_brn_ntrrtc_____,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()

# kmeansplot(dd,"hypersensitiv_crdc_trpnn_")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=hypersensitiv_crdc_trpnn_,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()

# kmeansplot(dd,"high_senstvty_c_rctv_prtn")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=high_senstvty_c_rctv_prtn,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()

# kmeansplot(dd,"ferritin")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=ferritin,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()

Quantiles statistics for most important biomarkers.

df_stat <- dd
dim(df_stat)

## [1] 375   6

dfIntCols <- df_stat %>% select_if(., is.numeric)
names(dfIntCols)

## [1] "cluster"                   "lactate_dehydrogenase"    
## [3] "amn_trmnl_brn_ntrrtc_____" "hypersensitiv_crdc_trpnn_"
## [5] "high_senstvty_c_rctv_prtn" "ferritin"

print('===============================================================')

## [1] "==============================================================="

print('===============================================================')

## [1] "==============================================================="

statInfo <- function(a) {
  for (i in colnames(a)){
  paste('Column: ', colnames(df_stat[i]), sep=',')
  print('---------------------------------------------------------------')
  print(colnames(df_stat[i]))
  print('---------------------------------------------------------------')
  printVecInfo(df_stat[[i]])
   }
}   

statInfo(dfIntCols)

## [1] "---------------------------------------------------------------"
## [1] "cluster"
## [1] "---------------------------------------------------------------"
## [1] "mean:  3.736"
## [1] "median:  4"
## [1] "min:  1"
## [1] "max:  4"
## [1] "range:  3"
## [1] "sd:  0.68377686575303"
## [1] "quantile:  1" "quantile:  4" "quantile:  4" "quantile:  4" "quantile:  4"
## [1] "IQR:  0"
## [1] "---------------------------------------------------------------"
## [1] "lactate_dehydrogenase"
## [1] "---------------------------------------------------------------"
## [1] "mean:  519.592"
## [1] "median:  355"
## [1] "min:  0"
## [1] "max:  1867"
## [1] "range:  1867"
## [1] "sd:  436.03199416537"
## [1] "quantile:  0"     "quantile:  239.5" "quantile:  355"   "quantile:  661"  
## [5] "quantile:  1867" 
## [1] "IQR:  421.5"
## [1] "---------------------------------------------------------------"
## [1] "amn_trmnl_brn_ntrrtc_____"
## [1] "---------------------------------------------------------------"
## [1] "mean:  2811.68533333333"
## [1] "median:  99"
## [1] "min:  0"
## [1] "max:  70000"
## [1] "range:  70000"
## [1] "sd:  9387.95662943333"
## [1] "quantile:  0"     "quantile:  0"     "quantile:  99"    "quantile:  941"  
## [5] "quantile:  70000"
## [1] "IQR:  941"
## [1] "---------------------------------------------------------------"
## [1] "hypersensitiv_crdc_trpnn_"
## [1] "---------------------------------------------------------------"
## [1] "mean:  887.5552"
## [1] "median:  4.8"
## [1] "min:  0"
## [1] "max:  50000"
## [1] "range:  50000"
## [1] "sd:  4933.42370112608"
## [1] "quantile:  0"     "quantile:  1.9"   "quantile:  4.8"   "quantile:  47.2" 
## [5] "quantile:  50000"
## [1] "IQR:  45.3"
## [1] "---------------------------------------------------------------"
## [1] "high_senstvty_c_rctv_prtn"
## [1] "---------------------------------------------------------------"
## [1] "mean:  86.6608"
## [1] "median:  54.1"
## [1] "min:  0"
## [1] "max:  320"
## [1] "range:  320"
## [1] "sd:  87.9859750272583"
## [1] "quantile:  0"     "quantile:  10.1"  "quantile:  54.1"  "quantile:  144.4"
## [5] "quantile:  320"  
## [1] "IQR:  134.3"
## [1] "---------------------------------------------------------------"
## [1] "ferritin"
## [1] "---------------------------------------------------------------"
## [1] "mean:  883.942933333333"
## [1] "median:  147.3"
## [1] "min:  0"
## [1] "max:  50000"
## [1] "range:  50000"
## [1] "sd:  3052.68673704692"
## [1] "quantile:  0"      "quantile:  0"      "quantile:  147.3" 
## [4] "quantile:  881.95" "quantile:  50000" 
## [1] "IQR:  881.95"

#dd[,c(76,1:10)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,11:20)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,21:30)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,31:40)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,41:50)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,51:60)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,61:70)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,71:74)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()

Principal Component Analysis

Principal Component Analysis (PCA) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data.

From a data analysis standpoint, PCA is used for studying one table of observations and variables with the main idea of transforming the observed variables into a set of new variables, the principal components, which are uncorrelated and explain the variation in the data. For this reason, PCA allows us to reduce a “complex” data set to a lower dimension in order to reveal the structures or the dominant types of variations in both the observations and the variables. Note: starts by setting with clean data set.

Data Prep

df <- df_clean 

df <- df[,c(4,7:80)] # %>% select(-patient_id,-admission_time,-discharge_time)

#rename outcome to Survived
names(df)[1] <- c("Survived")

df$Survived <- factor(df$Survived)

#Discretize Survived
df$Survived=dplyr::recode(df$Survived, `0`="Deceased", `1`="Survived")
colnames(df) <- make.names(colnames(df))

head(df[,1:5]) %>% kable %>% kable_styling()

Survived	hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time
Deceased	19.9	140	103.1	14.1
Deceased	16.9	151	100.5	14.3
Deceased	0.0	126	102.9	13.6
Deceased	4.8	110	103.1	16.3
Deceased	5.6	134	102.2	14.6
Deceased	19.7	108	105.8	12.4

There are several methods for doing PCA using R, but we’re going use princomp method, since its in base R stats.

Typical PCA results should consist of the following:

A set of eigenvalues
A table with the scores or Principal Components (PCs)
A table of loadings (or correlations between variables and PCs).

The eigenvalues provide information of the variability in the data. The scores provide information about the structure of the observations. The loadings (or correlations) allow you to get a sense of the relationships between variables, as well as their associations with the extracted PCs.

features<-df[,-1] #remove the label
pca<-princomp(features)
std_dev <- pca[1:74]$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
res.pca <- pca

#pca

Plots

As seen in the plot below, nearly first 15 components explains most variation in the data. The second plot indicates 15-20 components might be adequate for the modeling data sets.

require(factoextra)
require(plotly)
plot(cumsum(prop_varex[1:74]), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b",
     col = 4)

plot(cumsum(prop_varex[1:20]), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b",
     col = 4)

The scree plot shows the first component is explaining 76.3% of the variances. The line drops of quickly from there, which bolsters the case that we can use far less than 74 features to get a good prediction.

require(factoextra)
require(plotly)
res.pca <- pca
#fplt <- fviz_eig(res.pca, addlabels = TRUE)
fplt <- fviz_screeplot(res.pca, addlabels = TRUE)
#ggplotly(fplt)
fplt

Contribution Analysis

Since the majority of the variance are explained by the first three components a contribution analysis is performed on them here.

Top contributors:

hypersensitive_cardiac_troponin_i: This is the preferred biomarker for acute myocardial infarction. This helps diagnose acute coronary syndromes in those who are critically ill.
amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp: Natriuretic peptide tests measure the presence of brain natriuretic peptide and B-terminal pro b-type natriuretic peptides present. These are substances created by the heart. Small levels are normally found in the bloodstream. High levels of BNP, NT-proBNP indicate that the heart is failing to pump the appropriate amount of blood needed.
lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.
ferritin: Ferritin is a major intracellular iron storage protein in all organisms. It binds free ions of the trace element, neutralising its toxic properties and increasing its solubility. In the soluble form, the body is able to expend iron as needed, in particular for regulation of cellular oxygen metabolism. Elevated levels of ferritin, or hyperferritinemia, indicate the presence of viruses and bacteria into the body.

fviz_pca_var(res.pca, col.var="contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = FALSE # Avoid text overlapping
             )

# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 3)

# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 3)

# Contributions of variables to PC3
fviz_contrib(res.pca, choice = "var", axes = 3, top = 3)

Caret Machine Learning

Classification Models

Several classification models will be leveraged and compared while using the Covid-19, patient, biomarker data. Classification in this case predicts a binary outcome of case fatality or or survival. The models will be trained in supervised mode using the dependent variable we will call label, which holds values 0 for fatal and 1 for survived. The models will then be tested to determine if they can classify case fatality and survival without having the benefit of the labels.

The list of classification models to be used can be viewed below here:

Decision Tree(rpart)
Random Forest(rf)
Naive Bayes(nb)
K Nearest Neighbors(knn)
Support Vector Machines(svm)
Gradient Boosting Machines(gbm)
Classification Based On Association Rules (cba)

The models will each be described further as they are used.

Checking for Imbalance

Checking the dependent variable “Survived” to ensure imbalance will not be a problem. The output indicates there is good balance of fatalities versus survived outcomes in our data set.

#summary(df$Survived)

survived_df <- as.data.frame(prop.table(table(df$Survived))) 
names(survived_df)[1:2] <- c("Survived","Percentage")
survived_df

##   Survived Percentage
## 1 Deceased      0.536
## 2 Survived      0.464

Data Splitting

Using the first 20 components for the model splitting after determining that gets the best predictions without loosing predictive quality using all 74 features.

nDim <- 21

set.seed(seed)
new_df<-data.frame(label = df[, "Survived"], pca$scores)
df2 <- new_df
dim(df2)

## [1] 375  75

pca_df<- new_df[,1:nDim]
samp_size <-floor(0.80* nrow(pca_df))
train_ind <-sample(seq_len(nrow(pca_df)), size = samp_size)
train <- pca_df[train_ind,]
dim(train)

## [1] 300  21

test  <-pca_df[-train_ind,]
dim(test)

## [1] 75 21

names(pca_df)[1] <- c("label")
names(df)[1] <- c("label")

The data will be split into 80/20 for training and test sets.

##   dimension_of_training_dataset dimension_of_testing_dataset
## 1                           300                           75
## 2                            21                           21

## [1] 300  21

## [1] 75 21

## [1] 75 21

# #view the data sets
# head(train)
# head(test)
# head(hold)

Model Controls

Model controls will consist of cross validation set to 5-fold, and fit metric set to “Accuracy.”

## [conflicted] Removing existing preference

## [conflicted] Will prefer caret::confusionMatrix over any other package

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

## Loading required package: parallel

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following object is masked from 'package:leaflet':
## 
##     addLegend

## The following objects are masked from 'package:data.table':
## 
##     first, last

## The following objects are masked from 'package:dplyr':
## 
##     first, last

Model Tuning

Reviewing the models and their different parameters which can be tuned. Caret will be used with method model training controls to run algorithms u5-fold cross validation and various parameter tuning. The most important features are printed after each model is trained for additional information.

x1<- modelLookup(model='rpart')
x2<-modelLookup(model='rf')
x3<-modelLookup(model='nb')
x4<-modelLookup(model='knn')
x5<-modelLookup(model='svmRadial')
x6<-modelLookup(model='gbm')

mlist <-rbind(x1,x2,x3,x4,x5,x6)
mlist

##        model         parameter                         label forReg forClass
## 1      rpart                cp          Complexity Parameter   TRUE     TRUE
## 2         rf              mtry #Randomly Selected Predictors   TRUE     TRUE
## 3         nb                fL            Laplace Correction  FALSE     TRUE
## 4         nb         usekernel             Distribution Type  FALSE     TRUE
## 5         nb            adjust          Bandwidth Adjustment  FALSE     TRUE
## 6        knn                 k                    #Neighbors   TRUE     TRUE
## 7  svmRadial             sigma                         Sigma   TRUE     TRUE
## 8  svmRadial                 C                          Cost   TRUE     TRUE
## 9        gbm           n.trees         # Boosting Iterations   TRUE     TRUE
## 10       gbm interaction.depth                Max Tree Depth   TRUE     TRUE
## 11       gbm         shrinkage                     Shrinkage   TRUE     TRUE
## 12       gbm    n.minobsinnode       Min. Terminal Node Size   TRUE     TRUE
##    probModel
## 1       TRUE
## 2       TRUE
## 3       TRUE
## 4       TRUE
## 5       TRUE
## 6       TRUE
## 7       TRUE
## 8       TRUE
## 9       TRUE
## 10      TRUE
## 11      TRUE
## 12      TRUE

Model Build

Running a similar cross validation experiment using decision trees to compare the results We are now ready to train and test using classifiers. Below we use a few different decision tree models with different parameters and prunings to get varied results.

Caret will be used with method used for model training controls to run algorithms 5-fold cross validation and various parameter tuning.

Note: The most important features are printed after each model is trained for additional information.

Decision Tree

Decision Tree - A tree in which each internal (non-leaf) node is labeled with an input feature. The arcs coming from a node labeled with an input feature are labeled with each of the possible values of the target feature or the arc leads to a subordinate decision node on a different input feature.

library(rpart)

#Ten Fold CV used with maincontrol2  
control <- maincontrol
#Train Tree Model 2 CV
ptm <- proc.time()
fit.tree <- train(label~ ., data = train, method="rpart", metric=fitmetric, trControl=control,control=rpart.control(cp = 0.4))
fit.tree

## CART 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 240, 240, 241, 239, 240 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.04347826  0.8397610  0.6815205
##   0.13768116  0.7701436  0.5319054
##   0.62318841  0.7159063  0.4094905
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04347826.

proc.time() - ptm

##    user  system elapsed 
##    1.68    0.03    1.89

#fancy tree plots
suppressMessages(library(rattle))
fancyRpartPlot(fit.tree$finalModel)

#summary(fit.tree2$finalModel)

#important variables
treeImp <- varImp(fit.tree)
#treeImp
ggplot(treeImp) + gbl_ggtheme

Gradient Boosting Machine

Gradient Boosting Machines - Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

require(gbm)

## Loading required package: gbm

## Loaded gbm 2.1.5

#Train Random Forest CV
control <- maincontrol

gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9), 
                        n.trees = (1:30)*50, 
                        shrinkage = 0.1,
                        n.minobsinnode = 20)
                        
nrow(gbmGrid)

## [1] 90

ptm <- proc.time()
fit.gbm <- train(label ~ ., data = train, 
                 method = 'gbm', 
                 tuneGrid = gbmGrid,
                 trControl = control,
                 ## This last option is actually one
                 ## for gbm() that passes through
                 verbose = FALSE)
fit.gbm

## Stochastic Gradient Boosting 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 241, 239, 241, 240, 239 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                    50     0.8869760  0.7722846
##   1                   100     0.8934204  0.7850006
##   1                   150     0.9133130  0.8257010
##   1                   200     0.9000871  0.7982149
##   1                   250     0.9065351  0.8118068
##   1                   300     0.9132583  0.8254766
##   1                   350     0.9068667  0.8121216
##   1                   400     0.9232602  0.8457358
##   1                   450     0.9034787  0.8054518
##   1                   500     0.9133148  0.8258101
##   1                   550     0.9099250  0.8189050
##   1                   600     0.9165916  0.8323539
##   1                   650     0.9232602  0.8456811
##   1                   700     0.9232055  0.8455723
##   1                   750     0.9297629  0.8588103
##   1                   800     0.9133695  0.8253909
##   1                   850     0.9232055  0.8455723
##   1                   900     0.9133695  0.8253909
##   1                   950     0.9199268  0.8389179
##   1                  1000     0.9068121  0.8118238
##   1                  1050     0.9068121  0.8118238
##   1                  1100     0.9166481  0.8322628
##   1                  1150     0.9098685  0.8186061
##   1                  1200     0.8966426  0.7911592
##   1                  1250     0.9067009  0.8116982
##   1                  1300     0.9164259  0.8316007
##   1                  1350     0.9034222  0.8048151
##   1                  1400     0.9165370  0.8319092
##   1                  1450     0.9067009  0.8116171
##   1                  1500     0.9165370  0.8319092
##   5                    50     0.9103130  0.8201424
##   5                   100     0.9004770  0.7997087
##   5                   150     0.9136464  0.8264013
##   5                   200     0.9234824  0.8462389
##   5                   250     0.9202056  0.8399207
##   5                   300     0.9103695  0.8189685
##   5                   350     0.9168723  0.8331420
##   5                   400     0.9167611  0.8331009
##   5                   450     0.9069251  0.8125138
##   5                   500     0.9167611  0.8332795
##   5                   550     0.9069251  0.8128074
##   5                   600     0.9036464  0.8059687
##   5                   650     0.9133713  0.8261328
##   5                   700     0.9036464  0.8056365
##   5                   750     0.9167611  0.8331009
##   5                   800     0.9100926  0.8194051
##   5                   850     0.9002566  0.7986684
##   5                   900     0.9002566  0.7986684
##   5                   950     0.9100926  0.8194051
##   5                  1000     0.9035352  0.8055457
##   5                  1050     0.9002566  0.7986684
##   5                  1100     0.9100926  0.8194051
##   5                  1150     0.9035352  0.8055457
##   5                  1200     0.9133713  0.8261328
##   5                  1250     0.9035352  0.8055457
##   5                  1300     0.9035352  0.8055457
##   5                  1350     0.9133713  0.8261328
##   5                  1400     0.9001454  0.7986578
##   5                  1450     0.9099815  0.8192450
##   5                  1500     0.9133148  0.8259852
##   9                    50     0.9002001  0.7997006
##   9                   100     0.9102019  0.8192672
##   9                   150     0.9333750  0.8665816
##   9                   200     0.9202603  0.8396702
##   9                   250     0.9235389  0.8460798
##   9                   300     0.9268176  0.8530253
##   9                   350     0.9137029  0.8260286
##   9                   400     0.9268176  0.8530253
##   9                   450     0.9234843  0.8464329
##   9                   500     0.9136482  0.8258457
##   9                   550     0.9103695  0.8189685
##   9                   600     0.9069797  0.8121997
##   9                   650     0.9168158  0.8332545
##   9                   700     0.9069797  0.8121997
##   9                   750     0.9102584  0.8190770
##   9                   800     0.9167611  0.8331009
##   9                   850     0.9167611  0.8331009
##   9                   900     0.9167611  0.8331009
##   9                   950     0.9134824  0.8263732
##   9                  1000     0.9133713  0.8258490
##   9                  1050     0.9035352  0.8051123
##   9                  1100     0.9002566  0.7984271
##   9                  1150     0.8968667  0.7910900
##   9                  1200     0.9133713  0.8256077
##   9                  1250     0.9067028  0.8123286
##   9                  1300     0.9036464  0.8053364
##   9                  1350     0.9069251  0.8118805
##   9                  1400     0.9001454  0.7981035
##   9                  1450     0.9100926  0.8190961
##   9                  1500     0.9100926  0.8190961
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  9, shrinkage = 0.1 and n.minobsinnode = 20.

proc.time() - ptm

##    user  system elapsed 
##   16.97    0.09   18.26

#fancy tree plots
suppressMessages(library(rattle))
plot(fit.gbm$finalModel)

#summary(fit.rf$$finalModel)

gbmImp <- varImp(fit.gbm)
#rfImp
#plot(rfImp)
ggplot(gbmImp) + gbl_ggtheme

#summary(fit.rf$finalModel)

Random Forest

Random Forest - An ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees.

#Train Random Forest CV
control <- maincontrol
ptm <- proc.time()
fit.rf <- train(label~ ., data = train, method="rf", metric=fitmetric, ntree = 35, trControl=control)
fit.rf

## Random Forest 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 240, 240, 240, 240, 240 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8966667  0.7939077
##   11    0.9166667  0.8328082
##   20    0.9000000  0.7994747
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.

proc.time() - ptm

##    user  system elapsed 
##    1.59    0.02    1.75

#fancy tree plots
suppressMessages(library(rattle))
plot(fit.rf$finalModel)

#summary(fit.rf$$finalModel)

rfImp <- varImp(fit.rf)
#rfImp
#plot(rfImp)
ggplot(rfImp) + gbl_ggtheme

#summary(fit.rf$finalModel)

Support Vector Machines

Support Vector Machines (radial) - A discriminative classifier formally defined by a separating hyperplane. The algorithm creates a line or a hyperplane, which separates the data into classes.

#Training Support Vector Machine As a Challenger Model to Decision Tree.
control <- maincontrol
#control <- trainControl(method = "repeatedcv", number = 5,repeats=2,verbose = FALSE)
#methods "svmRadial" "svmLinear"
#
#
SVMgrid <- expand.grid(C= c(.5,1,2,5), sigma = c(0.01)) #C = 0.5, sigma =0.0001253863
ptm <- proc.time()
#fit.svm <- train(Survived~ ., data = train, method="svmLinear", metric=fitmetric, trControl=control,preProcess = #c("center","scale"),tuneLength = 10)
fit.svm <- train(label~ ., data = train, method="svmRadial", metric=fitmetric, trControl=control,preProcess = c("center", "scale"),tuneLength = 10)
fit.svm

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## Pre-processing: centered (20), scaled (20) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 240, 241, 240, 239, 240 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa    
##     0.25  0.8764351  0.7533862
##     0.50  0.8763786  0.7532587
##     1.00  0.8730453  0.7462505
##     2.00  0.8697666  0.7398198
##     4.00  0.8665444  0.7335186
##     8.00  0.8598759  0.7202493
##    16.00  0.8631546  0.7274778
##    32.00  0.8664879  0.7336732
##    64.00  0.8664879  0.7339260
##   128.00  0.8664879  0.7339260
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1594806
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1594806 and C = 0.25.

proc.time() - ptm

##    user  system elapsed 
##    7.87    0.08    8.39

#print(fit.svm)
#plot(fit.svm$finalModel)
#summary(fit.svm$finalModel)
svmImp <- varImp(fit.svm)
#svmImp
#plot(svmImp)
ggplot(svmImp) + gbl_ggtheme

Naive Bayes

Naive Bayes - An algorithm that uses Bayes’ theorem to classify objects. Naive Bayes classifiers assume strong, or naive, independence between attributes of data points.

control <- maincontrol
#install.packages("naivebayes")
library(naivebayes)
library(e1071) #naiveBayes

# # Define tuning grid
nb_grid <- expand.grid(usekernel = c(TRUE, FALSE),laplace = c(0, 0.5, 1), adjust = c(0.75, 1, 1.25, 1.5))

ptm <- proc.time()
fit.nb <- train(label~ ., data = train, method="naive_bayes", tuneGrid = nb_grid, trControl=control)
fit.nb

## Naive Bayes 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 241, 240, 240, 239, 240 
## Resampling results across tuning parameters:
## 
##   usekernel  laplace  adjust  Accuracy   Kappa    
##   FALSE      0.0      0.75    0.7296934  0.4377316
##   FALSE      0.0      1.00    0.7296934  0.4377316
##   FALSE      0.0      1.25    0.7296934  0.4377316
##   FALSE      0.0      1.50    0.7296934  0.4377316
##   FALSE      0.5      0.75    0.7296934  0.4377316
##   FALSE      0.5      1.00    0.7296934  0.4377316
##   FALSE      0.5      1.25    0.7296934  0.4377316
##   FALSE      0.5      1.50    0.7296934  0.4377316
##   FALSE      1.0      0.75    0.7296934  0.4377316
##   FALSE      1.0      1.00    0.7296934  0.4377316
##   FALSE      1.0      1.25    0.7296934  0.4377316
##   FALSE      1.0      1.50    0.7296934  0.4377316
##    TRUE      0.0      0.75    0.8433769  0.6817564
##    TRUE      0.0      1.00    0.8366518  0.6682333
##    TRUE      0.0      1.25    0.8265370  0.6458675
##    TRUE      0.0      1.50    0.8298138  0.6519463
##    TRUE      0.5      0.75    0.8433769  0.6817564
##    TRUE      0.5      1.00    0.8366518  0.6682333
##    TRUE      0.5      1.25    0.8265370  0.6458675
##    TRUE      0.5      1.50    0.8298138  0.6519463
##    TRUE      1.0      0.75    0.8433769  0.6817564
##    TRUE      1.0      1.00    0.8366518  0.6682333
##    TRUE      1.0      1.25    0.8265370  0.6458675
##    TRUE      1.0      1.50    0.8298138  0.6519463
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 0.75.

proc.time() - ptm

##    user  system elapsed 
##    6.96    0.09    7.26

#fit.nb$bestTune
nbImp <- varImp(fit.nb)
#nbImp
#plot(nbImp)
ggplot(nbImp) + gbl_ggtheme

K Nearest Neighbor

K Nearest Neighbors - KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.It estimates how likely a data point is to be a member of one group or the other depending on what group in which the data points nearest to it.The k-nearest-neighbor is an example of a “lazy learner” algorithm, meaning that it does not build a model using the training set until a query of the data set is performed.

control <- maincontrol
# # Define tuning grid
knn_grid <- expand.grid(k= c(2,3,5,7,9))

ptm <- proc.time()
fit.knn <- train(label~ ., data = train, method="knn", tuneGrid = knn_grid, trControl=control)
fit.knn

## k-Nearest Neighbors 
## 
## 300 samples
##  20 predictor
##   2 classes: 'Deceased', 'Survived' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 241, 240, 240, 240, 239 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   2  0.8667611  0.7312582
##   3  0.8700417  0.7370683
##   5  0.8667083  0.7305085
##   7  0.8534278  0.7028591
##   9  0.8568723  0.7097726
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.

proc.time() - ptm

##    user  system elapsed 
##    1.52    0.02    1.59

fit.knn$bestTune

##   k
## 2 3

#plot(fit.knn$finalModel)
knnImp <- varImp(fit.knn)
#nbImp
#plot(knnImp)
ggplot(knnImp) +gbl_ggtheme

Model Fit Results

The graph below shows two evaluation metrics for the models.

Accuracy is the percentage of correctly classifies instances out of all instances. It is more useful on a binary classification than multi-class classification problems because it can be less clear exactly how the accuracy breaks down across those classes (e.g. you need to go deeper with a confusion matrix). Learn more about Accuracy here.
Kappa or Cohen’s Kappa is like classification accuracy, except that it is normalized at the baseline of random chance on your data set. It is a more useful measure to use on problems that have an imbalance in the classes (e.g. 70-30 split for classes 0 and 1 and you can achieve 70% accuracy by predicting all instances are for class 0). Learn more about Kappa here.

The gradient algorithm denoted by gbm is the better model according to these fit metrics, but all models performed reasonably well on the training data set.

## 
## Call:
## summary.resamples(object = results)
## 
## Models: gbm, rf, tree, nb, svm, knn 
## Number of resamples: 5 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## gbm  0.9016393 0.9166667 0.9322034 0.9333750 0.9491525 0.9672131    0
## rf   0.8666667 0.9166667 0.9166667 0.9166667 0.9333333 0.9500000    0
## tree 0.8000000 0.8135593 0.8500000 0.8397610 0.8500000 0.8852459    0
## nb   0.7666667 0.8166667 0.8500000 0.8433769 0.8852459 0.8983051    0
## svm  0.8000000 0.8666667 0.8813559 0.8764351 0.8833333 0.9508197    0
## knn  0.8166667 0.8500000 0.8688525 0.8700417 0.8813559 0.9333333    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## gbm  0.8051118 0.8329621 0.8626310 0.8665816 0.8978650 0.9343380    0
## rf   0.7306397 0.8337029 0.8337029 0.8328082 0.8666667 0.8993289    0
## tree 0.6017699 0.6233314 0.7045952 0.6815205 0.7058824 0.7720235    0
## nb   0.5291480 0.6232877 0.6966292 0.6817564 0.7657707 0.7939464    0
## svm  0.6017699 0.7342193 0.7630522 0.7533862 0.7671840 0.9007054    0
## knn  0.6283784 0.6952596 0.7359307 0.7370683 0.7603018 0.8654709    0

AUC-ROC curve: This is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. The Higher the AUC, better the model is at making predictions.

It should be noted that we fitted our models using “accuracy” to aide with finding the best tuning parameters. However, it is good to measure models with multiple metrics and AUC-ROC is one of the additional metrics to consider when making a decision.

The last graph indicates all the models look very good in terms of AUC-ROC, but gradient boosting, random forest are standouts. This means the gradient boosting machines resulted in a better fit than all other models, judging by the AUC-ROC metric. The Naive Bayes, Support Vector Machines, and K Nearest Neighbors were not far behind GBM, but the decision tree was trailing a bit more.

The recall sensitivity is import here too, since that is our ability to accurately discern all the fatalities. The gradient boosting machines method proving to be a good model for true positives. Random forests and support vector machines do reasonably here, but naive bayes, decision tree, and knn trail significantly on this metric.

## run MLeval comparing tree 2 and 3
library(MLeval)
#??MLeval #for help
res <- evalm(list(fit.gbm,fit.tree,fit.rf,fit.nb,fit.svm,fit.knn),gnames=c('gbm','dt','rf','nb','svm','knn'))

## ***MLeval: Machine Learning Model Evaluation***

## Input: caret train function object

## Not averaging probs.

## Group 1 type: cv

## Group 2 type: cv

## Group 3 type: cv

## Group 4 type: cv

## Group 5 type: cv

## Group 6 type: cv

## Observations: 1800

## Number of groups: 6

## Observations per group: 300

## Positive: Survived

## Negative: Deceased

## Group: gbm

## Positive: 138

## Negative: 162

## Group: dt

## Positive: 138

## Negative: 162

## Group: rf

## Positive: 138

## Negative: 162

## Group: nb

## Positive: 138

## Negative: 162

## Group: svm

## Positive: 138

## Negative: 162

## Group: knn

## Positive: 138

## Negative: 162

## ***Performance Metrics***

## gbm Optimal Informedness = 0.870101986044015

## dt Optimal Informedness = 0.691894793344069

## rf Optimal Informedness = 0.843263553408481

## nb Optimal Informedness = 0.72812667740204

## svm Optimal Informedness = 0.773483628556092

## knn Optimal Informedness = 0.747987117552335

## gbm AUC-ROC = 0.98

## dt AUC-ROC = 0.88

## rf AUC-ROC = 0.97

## nb AUC-ROC = 0.91

## svm AUC-ROC = 0.92

## knn AUC-ROC = 0.92

Model Prediction Results

Definition of the Terms related to measuring classification models:

Positive (P) : Observation is positive (for example: is an apple).
Negative (N) : Observation is not positive (for example: is not an apple).
True Positive (TP) : Observation is positive, and is predicted to be positive.
False Negative (FN) : Observation is positive, but is predicted negative.
True Negative (TN) : Observation is negative, and is predicted to be negative.
False Positive (FP) : Observation is negative, but is predicted positive.

Accuracy: * Classification Rate or Accuracy is given by the relation: (TP + TN) / (TP + TN + FP + FN) * It assumes equal costs for both kinds of errors. A 99% accuracy can be excellent, good, mediocre, poor or terrible depending upon the problem.

Recall: Recall can be defined as the ratio of the total number of correctly classified positive examples divide to the total number of positive examples. High Recall indicates the class is correctly recognized (a small number of FN). Recall is given by the relation: TP / (TP + FN)

AUC-ROC curve: This is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at binary predictions.

Summary the test results for each of the following classifier models. The results vary slightly from run to run due to random sampling.

Before PCA:

Decision Tree - Accuracy : 0.90
Random Forest - Accuracy : 0.94
Naive Bayes - Accuracy : 0.90
K Nearest Neighbors - Accuracy : 0.92
Support Vector Machines (radial)- Accuracy : 0.89
Classification Based On Association Rules (CBA) - Accuracy : .88 **this was run with no PCA involved

After PCA:

Decision Tree - Accuracy : 0.94
Random Forest - Accuracy : 0.92
Naive Bayes - Accuracy : 0.88
K Nearest Neighbors - Accuracy : 0.96
Support Vector Machines (radial)- Accuracy : 0.88
Gradient Boosting Machines - Accuracy : .96 **this was added after PCA, so there is no before PCA run

test <- test
#head(test)

#https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package
draw_confusion_matrix <- function(cm) {

  #   total <- sum(cm$table)
  # res <- as.numeric(cm$table)
  # 
  # # Generate color gradients. Palettes come from RColorBrewer.
  # greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
  # redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#E64141")
  # getColor <- function (greenOrRed = "green", amount = 0) {
  #   if (amount == 0)
  #     return("#FFFFFF")
  #   palette <- greenPalette
  #   if (greenOrRed == "red")
  #     palette <- redPalette
  #   colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
  # }
  # 
  
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

   # create the matrix 
  rect(150, 430, 240, 370, col='#a2aba5')
  text(195, 435, 'Deceased ', cex=1.2)
  rect(250, 430, 340, 370, col='#E64141')
  text(295, 435, 'Survived', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#E64141')
  rect(250, 305, 340, 365, col='#a2aba5')
  text(140, 400, 'Deceased ', cex=1.2, srt=90)
  text(140, 335, 'Survived', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  

# Error: [conflicted] `layout` found in 2 packages. 
# Either pick the one you want with `::` * plotly::layout * graphics::layout Or declare a preference with
# `conflict_prefer()` * conflict_prefer("layout", "plotly") * conflict_prefer("layout", "graphics")
conflict_prefer("layout", "plotly")

Gradient Boosting Machines

Gradient Boosting Machines - Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. This model type was added to the analysis to challenge the accuracy of random forest. The results are 96% accuracy, and recall/sensitivity is 97%.

predict.gbm= predict(fit.gbm, test, type="raw")
#predict.tree= predict(fit.tree, test, type="prob")

#summary(predict.gbm)
#caret confusion
cm <- confusionMatrix(factor(predict.gbm), factor(test$label))
#cm

#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

Decision Tree

Decision Tree - A tree in which each internal (non-leaf) node is labeled with an input feature. The arcs coming from a node labeled with an input feature are labeled with each of the possible values of the target feature or the arc leads to a subordinate decision node on a different input feature. Decision tree tested at 94% accuracy, 95% recall/sensitivity, which is a solid test performance.

predict.tree= predict(fit.tree, test, type="raw")
#predict.tree= predict(fit.tree, test, type="prob")

#summary(predict.tree)
#caret confusion
cm <- confusionMatrix(factor(predict.tree), factor(test$label))
#cm

#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

SVM

Support Vector Machines (radial) - A discriminative classifier formally defined by a separating hyperplane. The algorithm creates a line or a hyperplane, which separates the data into classes. Support vector machines turned in an accuracy of 88% and 90% recall/sensitivity here.

#Test Results SVM
predict.svm= predict(fit.svm, test, type="raw")
#predict.svm= predict(fit.svm, test, type="prob")
#summary(predict.svm)

#confusion matrix to find correct and incorrect predictions
#table(labelship=predict.svm, true=test$Survived)
#caret confusion
cm <-confusionMatrix(factor(predict.svm), factor(test$label))
#cm

#plot(cm$table)
#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

Random Forest

Random Forest - An ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. Our random forest testing produced accuracy of 92% and 92% recall/sensitivity. In comparison gradient boosting machines has done better in the testing compared to random forest.

predict.rf= predict(fit.rf, test, type="raw")
#predicted3= predict(fit.rf, test, type="prob")
#summary(predict.rf)

#confusion matrix to find correct and incorrect predictions
#table(labelship=predict.rf, true=test$Survived)
#caret confusion
cm <- confusionMatrix(factor(predict.rf), factor(test$label))
#cm

#fourfoldplot(cm$table, color = c("grey", "#E64141"),    conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

K Nearest Neighbors

K Nearest Neighbors - KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.It estimates how likely a data point is to be a member of one group or the other depending on what group in which the data points nearest to it.The k-nearest-neighbor is an example of a “lazy learner” algorithm, meaning that it does not build a model using the training set until a query of the data set is performed. The knn results were very good with 96% accuracy and a high recall of 97%. It is a good candidate model, but it had 2 misclassified fatalities.

predict.knn= predict(fit.knn, test, type="raw")
#predicted3= predict(fit.knn, test, type="prob")
#summary(predict.knn)

#table(labelship=predict.rf, true=test$Survived)
#caret confusion

cm <- confusionMatrix(factor(predict.knn), factor(test$label))
#cm

#fourfoldplot(cm$table, color = c("grey", "#E64141"),      conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

Naive Bayes

Naive Bayes - An algorithm that uses Bayes’ theorem to classify objects. Naive Bayes classifiers assume strong, or naive, independence between attributes of data points. Naive Bayes has a reasonable accuracy at 90%, but it misclassified four fatalities.

predict.nb= predict(fit.nb, test, type="raw")
#summary(predict.nb)

#confusion matrix to find correct and incorrect predictions
#table(labelship=predict.nb, true=test$Survived)
cm <- confusionMatrix(factor(predict.nb), factor(test$label))
#cm


#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)

Keras Deep Learning

The keras library will be used to classify survival or fatality of covid-19 patient sample based upon the same biomarker data set.

require(keras)

## Loading required package: keras

#start with the clean data frame again
df <- df_clean 
df <- df[,c(4,7:80)] # or df <- df %>% select(-patient_id,-admission_time,-discharge_time)

#rename outcome to label
names(df)[1] <- c("label")

df$label <- factor(df$label)

#Discretize Survived
#df$Survived=dplyr::recode(df$Survived, `0`="Deceased", `1`="Survived")
colnames(df) <- make.names(colnames(df))

head(df[,1:5])  %>% kable() %>% kable_styling()

hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time
19.9	140	103.1	14.1
16.9	151	100.5	14.3
0.0	126	102.9	13.6
4.8	110	103.1	16.3
5.6	134	102.2	14.6
19.7	108	105.8	12.4

Data Splits

The keras model will use the hold out approach for the data splits. 80%, 20% where the 20%, which normally simply test will be further split 80/20 into test and holdout respectively.

require(kableExtra)

dim(df)

## [1] 375  75

paste("-------")

## [1] "-------"

nDim <- dim(df)[2]-1
#nDim

x <- modelData2(df) # 80/20 train/test; holdout = test
#x <- modelDataC(df) # removes 10% for holdout then splits 80/20 for remainder
train <-x@trn
test <- x@tst
hold <- x@hld
#hold <-z[-data,] #balance what is left over after train @.90  

trn <- as.data.frame(dim(train))
names(trn) <- c("dimension_training_train")
trn

##   dimension_training_train
## 1                      300
## 2                       75

tst <-as.data.frame(dim(test))
names(tst) <- c("dimension_testing_test")
hld <-as.data.frame(dim(hold))
names(hld) <- c("dimension_testing_holdout")

train <- train
dim(train)

## [1] 300  75

test <- test
dim(test)

## [1] 75 75

hold <- hold
dim(hold)

## [1] 75 75

split <- cbind(trn,tst,hld)
split %>%kable() %>% kable_styling()

dimension_training_train	dimension_testing_test	dimension_testing_holdout
300	75	75
75	75	75

#view the data sets
head(train[,1:5]) %>% kable() %>% kable_styling()

label	hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time
1	1776.3	124	107.1	19.1
1	62.3	156	90.5	13.2
1	0.0	129	95.0	13.2
0	3.7	151	104.6	13.6
0	1.9	116	103.8	14.7
0	0.0	120	104.4	14.9

head(test[,1:5]) %>% kable() %>% kable_styling()

label	hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time
1	8.2	119	100.8	17.1
1	32.6	132	105.6	17.0
1	140.8	159	131.7	32.2
1	4.4	124	109.8	15.4
1	8443.5	124	103.3	20.0
0	8.1	130	101.9	15.1

head(hold[,1:5]) %>% kable() %>% kable_styling()

hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time
19.9	140	103.1	14.1
0.0	126	102.9	13.6
3.7	151	104.5	13.3
10.4	125	107.1	13.1
0.0	125	106.3	13.6
0.0	133	103.3	12.9

Keras Data Prep

#devtools::install_github("rstudio/keras")
library(keras)

x_train <- train[,2:(nDim+1)] #omit label
x_test <- test[,2:(nDim+1)] #omit label
x_hold <- hold[,2:(nDim+1)] #omit label

scale data frame’s numerical columns

library(caret)
# Assuming goal class is column 10

normalize <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
}

x_train <- as.data.frame(lapply(x_train, normalize))
head(x_train[,1:5])

##   hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
## 1                          0.035526  0.6966292      0.7628205        0.1591667
## 2                          0.001246  0.8764045      0.6445869        0.1100000
## 3                          0.000000  0.7247191      0.6766382        0.1100000
## 4                          0.000074  0.8483146      0.7450142        0.1133333
## 5                          0.000038  0.6516854      0.7393162        0.1225000
## 6                          0.000000  0.6741573      0.7435897        0.1241667
##   procalcitonin
## 1  0.0251880357
## 2  0.0083960119
## 3  0.0034983383
## 4  0.0008745846
## 5  0.0006996677
## 6  0.0006996677

x_test <- as.data.frame(lapply(x_test, normalize))
head(x_test[,1:5])

##   hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
## 1                          0.000164  0.6685393      0.7601810        0.5310559
## 2                          0.000652  0.7415730      0.7963801        0.5279503
## 3                          0.002816  0.8932584      0.9932127        1.0000000
## 4                          0.000088  0.6966292      0.8280543        0.4782609
## 5                          0.168870  0.6966292      0.7790347        0.6211180
## 6                          0.000162  0.7303371      0.7684766        0.4689441
##   procalcitonin
## 1   0.003907638
## 2   0.006394316
## 3   0.032326821
## 4   0.698046181
## 5   0.043694494
## 6   0.001776199

x_hold <- as.data.frame(lapply(x_hold, normalize))
head(x_hold[,1:5])

##   hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
## 1                          0.000398  0.7865169      0.7775264        0.4378882
## 2                          0.000000  0.7078652      0.7760181        0.4223602
## 3                          0.000074  0.8483146      0.7880845        0.4130435
## 4                          0.000208  0.7022472      0.8076923        0.4068323
## 5                          0.000000  0.7022472      0.8016591        0.4223602
## 6                          0.000000  0.7471910      0.7790347        0.4006211
##   procalcitonin
## 1  0.0031971581
## 2  0.0021314387
## 3  0.0014209591
## 4  0.0039076377
## 5  0.0014209591
## 6  0.0007104796

Make normalized data frame a matrix shape

x_train <- as.matrix(x_train)
x_test <- as.matrix(x_test)
x_hold <- as.matrix(x_hold)

dim(x_train)

## [1] 300  74

head(x_train[,1:5]) %>% kable() %>% kable_styling()

hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time	procalcitonin
0.035526	0.6966292	0.7628205	0.1591667	0.0251880
0.001246	0.8764045	0.6445869	0.1100000	0.0083960
0.000000	0.7247191	0.6766382	0.1100000	0.0034983
0.000074	0.8483146	0.7450142	0.1133333	0.0008746
0.000038	0.6516854	0.7393162	0.1225000	0.0006997
0.000000	0.6741573	0.7435897	0.1241667	0.0006997

dim(x_test)

## [1] 75 74

head(x_test[,1:5])  %>% kable() %>% kable_styling()

hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time	procalcitonin
0.000164	0.6685393	0.7601810	0.5310559	0.0039076
0.000652	0.7415730	0.7963801	0.5279503	0.0063943
0.002816	0.8932584	0.9932127	1.0000000	0.0323268
0.000088	0.6966292	0.8280543	0.4782609	0.6980462
0.168870	0.6966292	0.7790347	0.6211180	0.0436945
0.000162	0.7303371	0.7684766	0.4689441	0.0017762

dim(x_hold)

## [1] 75 74

head(x_hold[,1:5])  %>% kable() %>% kable_styling()

hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time	procalcitonin
0.000398	0.7865169	0.7775264	0.4378882	0.0031972
0.000000	0.7078652	0.7760181	0.4223602	0.0021314
0.000074	0.8483146	0.7880845	0.4130435	0.0014210
0.000208	0.7022472	0.8076923	0.4068323	0.0039076
0.000000	0.7022472	0.8016591	0.4223602	0.0014210
0.000000	0.7471910	0.7790347	0.4006211	0.0007105

Features

x_train <- array_reshape(x_train, c(nrow(x_train), nDim))
x_test <- array_reshape(x_test, c(nrow(x_test), nDim))
x_hold <- array_reshape(x_hold, c(nrow(x_hold), nDim))


#check dimensions
dim(x_train)

## [1] 300  74

dim(x_test)

## [1] 75 74

dim(x_hold)

## [1] 75 74

Labels

Note that we use the array_reshape() function rather than the dim<-() function to reshape the array. This is so that the data is re-interpreted using row-major semantics (as opposed to R’s default column-major semantics), which is in turn compatible with the way that the numerical libraries called by Keras interpret array dimensions.

The y data is an integer vector with values ranging from 0 to 1. To prepare this data for training we one-hot encode the vectors into binary class matrices using the Keras to_categorical() function:

#y_train <- as.numeric(train$label)
y_train <- train$label #these were already R factor
head(y_train)

## [1] 1 1 1 0 0 0
## Levels: 0 1

#y_test <- as.numeric(test$label)
y_test <- test$label #these were already R factor
head(y_test)

## [1] 1 1 1 1 1 0
## Levels: 0 1

y_hold <- hold$label #these were already R factor
head(y_hold)

## [1] 0 0 0 0 0 0
## Levels: 0 1

# categorical conversion
y_train <- to_categorical(y_train, 2)
y_test <- to_categorical(y_test, 2)
y_hold <- to_categorical(y_hold, 2)

Hypertuning

https://henry090.github.io/kerastuneR/

conflict_prefer("train", "caret")

## [conflicted] Will prefer caret::train over any other package

library(reticulate)
#use_python("C:\Users\knjpo\AppData\Local\r-miniconda")
use_condaenv("r-miniconda")
#install.packages('kerastuneR')
library(keras)
library(tensorflow)
#library(kerastuneR)
#install_kerastuner()

Defining the model

The simplest type of model is the Sequential model, a linear stack of layers. We begin by creating a sequential model and then adding layers using the pipe (%>%) operator:

# 97.3%
# tf.fit <- keras_model_sequential() 
# tf.fit %>% 
#   layer_dense(units = 50, activation = 'relu', input_shape = c(nDim)) %>%
#   layer_dropout(rate = 0.3) %>% 
#   layer_dense(units = 20, activation = 'relu') %>%
#   layer_dropout(rate = 0.2) %>%
#   layer_dense(units = 10, activation = 'relu') %>%
#   layer_dropout(rate = 0.1) %>%
#   layer_dense(units = 2, activation = 'softmax')


tf.fit <- keras_model_sequential() 
tf.fit %>% 
  layer_dense(units = 50, activation = 'relu', input_shape = c(nDim)) %>%
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 20, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 10, activation = 'relu') %>%
  layer_dropout(rate = 0.1) %>%
  layer_dense(units = 2, activation = 'softmax')

Fit Summary

The input_shape argument to the first layer specifies the shape of the input data (a length 784 numeric vector representing a grayscale image). The final layer outputs a length 10 numeric vector (probabilities for each digit) using a softmax activation function.

Use the summary() function to print the details of the tf.fit:

summary(tf.fit)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 50)                      3750        
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 50)                      0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 20)                      1020        
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 20)                      0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 10)                      210         
## ________________________________________________________________________________
## dropout_2 (Dropout)                 (None, 10)                      0           
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 2)                       22          
## ================================================================================
## Total params: 5,002
## Trainable params: 5,002
## Non-trainable params: 0
## ________________________________________________________________________________

Fit Compile

Next, compile the tf.fit with appropriate loss function, optimizer, and metrics:

tf.fit %>% keras::compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

Training and Evaluation

Use the fit() function to train the tf.fit for 50 epochs using batches of 5:

ptm <- proc.time() #timer

tf.train <- tf.fit %>% fit(
  x_train, y_train,
  epochs = 33, batch_size = 30,
  #epochs = 30, batch_size = 25,
  #epochs = 30, batch_size = 128,
  validation_split = 0.20
)
proc.time() - ptm

##    user  system elapsed 
##    4.17    0.15    4.28

Plot

The history object returned by fit() includes loss and accuracy metrics which we can plot:

plot(tf.train)

## `geom_smooth()` using formula 'y ~ x'

# Results

#alluvial plots
plotCM <- function(cm){
  cmdf <- as.data.frame(cm[["table"]])
  cmdf[["color"]] <- ifelse(cmdf[[1]] == cmdf[[2]], "grey", "red")
  
  alluvial::alluvial(cmdf[,1:2]
                     , freq = cmdf$Freq
                     , col = cmdf[["color"]]
                     , alpha = 0.5
                     , hide  = cmdf$Freq == 0
                     )
}
 
require(plotly)
ggPlotCM <- function(cm){
## ggplotly version
cm_d <- as.data.frame(cm$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm$overall)
# round the values
cm_st$cm.overall <- round(cm_st$cm.overall,2)

# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm$table))
cm_d$Perc <- round(cm_p$Freq*100,2)

# plotting the matrix
cm_d_p <-  ggplot(data = cm_d, aes(x = Prediction , y =  Reference, fill = Freq))+
  geom_tile() +
  geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'red', size = 3) +
  theme_light() +
  guides(fill=FALSE) + theme_1()

# plotting the stats
#cm_st_p <-  tableGrob(cm_st)

cm_d_ply <-ggplotly(cm_d_p)
return (cm_d_ply)
}

Testing

Evaluate on test

Evaluate the tf.fit’s performance on the test data:

tf.fit %>% keras::evaluate(x_test, y_test)

##      loss  accuracy 
## 0.1227164 0.9600000

Confusion Matrix Test

#Generate predictions on new data:
pred <-tf.fit %>% predict_classes(x_test) 
#table(Predicted=pred, Actual =test$label)

cm <- confusionMatrix(factor(pred), factor(test$label))
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  1
##          1  2 37
##                                           
##                Accuracy : 0.96            
##                  95% CI : (0.8875, 0.9917)
##     No Information Rate : 0.5067          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.92            
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9459          
##             Specificity : 0.9737          
##          Pos Pred Value : 0.9722          
##          Neg Pred Value : 0.9487          
##              Prevalence : 0.4933          
##          Detection Rate : 0.4667          
##    Detection Prevalence : 0.4800          
##       Balanced Accuracy : 0.9598          
##                                           
##        'Positive' Class : 0               
##

draw_confusion_matrix(cm)

#plotCM(cm)
#ggPlotCM(cm)

Evaluate on hold out

Evaluate the tf.fit’s performance on the hold out data:

tf.fit %>% keras::evaluate(x_hold, y_hold)

##      loss  accuracy 
## 0.1227164 0.9600000

Confusion Matrix Holdout

#Generate predictions on new data:
pred2 <-tf.fit %>% predict_classes(x_hold) 
#table(Predicted=pred, Actual =test$label)

cm <- confusionMatrix(factor(pred2), factor(hold$label))
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  1
##          1  2 37
##                                           
##                Accuracy : 0.96            
##                  95% CI : (0.8875, 0.9917)
##     No Information Rate : 0.5067          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.92            
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9459          
##             Specificity : 0.9737          
##          Pos Pred Value : 0.9722          
##          Neg Pred Value : 0.9487          
##              Prevalence : 0.4933          
##          Detection Rate : 0.4667          
##    Detection Prevalence : 0.4800          
##       Balanced Accuracy : 0.9598          
##                                           
##        'Positive' Class : 0               
##

draw_confusion_matrix(cm)

#plotCM(cm)
#ggPlotCM(cm)

ML Classification by Arules

This analysis is designed as a demonstration of using the Arules CBA method to predict covid 19 patient fatality using a sample of 375 patients and their biomarkers measured during hospital stay. The input data originated from another more expansive predictive modeling study on the topic, and the input data used here is simplified for this Apriori and ArulesCBA demonstration.

#Return multiple data frames in one object using the S4 class system.
setClass(
    "mdl",
    representation(
        trn = "vector",
        tst = "vector",
        hld = "vector"
    )
)

modelData2 <- function (df1 = df_ww) {
  z <- df1  
    # get number of rows in the data frame
  len <- as.numeric(dim(z)[1])
  randIndex <- sample(as.numeric(rownames(z)[1:len]))
  # setting cutpoint to 2/3 of the input data frame
  cutPoint80 <- floor(.8 * len)
  # setting cutpoint to 1/3 of the input data frame
  cutPoint20 <- floor(.2 * len)
  trn2 <- z[randIndex[1:cutPoint80],]
  tst2 <- z[randIndex[cutPoint80+1:c(len-cutPoint80)],] 
  hld2 <- z[-randIndex[1: cutPoint80],]
  
  mdl <- new("mdl",
                  trn = trn2,
                  tst = tst2,
                  hld = hld2  )
  return(mdl)
  
}

Data Loading

Anonymized Patient Data

View anonymized Covid-19 patient information

df_CBA <- read.csv("covid19_biomarkers_max.csv") # this data resulted a write.csv of df_clean data 
head(df_CBA[,2:7]) %>% kable() %>% kable_styling()

patient_id	age	gender	admission_time	discharge_time
1	73	1	2020-01-30 22:12:47	2020-02-17 12:40:09
2	61	1	2020-02-04 21:39:03	2020-02-19 12:59:01
3	70	2	2020-01-23 10:59:36	2020-02-08 17:52:31
4	74	1	2020-01-31 23:03:59	2020-02-18 12:59:12
5	29	2	2020-02-01 20:59:54	2020-02-18 10:33:06
6	81	2	2020-01-24 10:47:10	2020-02-07 09:06:58

#remove first column that created when csvwas written

Model Data View

View slice of model data input set. The target variable outcome has been discretized.

df_CBA <- df_CBA[,-1]
# omit columns that will not be part of the model data set
#df_CBA <- df_CBA[,c(4,7:80)] 
df_CBA <- df_CBA %>% dplyr::select(-patient_id,-age,-gender,-admission_time,-discharge_time)

#rename outcome to Survived
names(df_CBA)[1] <- c("outcome")
df_CBA$outcome <- factor(df_CBA$outcome)

#Discretize Survived
df_CBA$outcome=dplyr::recode(df_CBA$outcome, `0`="Deceased", `1`="Survived")
colnames(df_CBA) <- make.names(colnames(df_CBA))
head(df_CBA[,1:5]) %>% kable() %>% kable_styling()

outcome	hypersensitive_cardiac_troponin_i	hemoglobin	serum_chloride	prothrombin_time
Deceased	19.9	140	103.1	14.1
Deceased	16.9	151	100.5	14.3
Deceased	0.0	126	102.9	13.6
Deceased	4.8	110	103.1	16.3
Deceased	5.6	134	102.2	14.6
Deceased	19.7	108	105.8	12.4

# set model input
biomarkers <- df_CBA

During the exploratory data analysis, the following 8 features were found to be important out of 74 total. They are listed in order of relative importance.

lactate_dehydrogenase 100.00
monocytes_percent 83.50
international_standard_ratio 78.47
percent_lymphocyte 64.12
eosinophil_count 59.75
eosinophils_percent 50.99
neutrophils_percent 49.15
amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp 48.62

The CBA analysis will use the top four:

lactate_dehydrogenase
monocytes_percent
international_standard_ratio
percent_lymphocyte

Research to inform cutoffs for descritization of the top four features

lactate_dehydrogenase (low cutoff 140, high cutoff 280)

https://www.uofmhealth.org/health-library/tv6985#:~:text=Lactic%20acid%20dehydrogenase%20(LDH)%20is%20an%20enzyme%20that%20helps%20produce,L%20to%204.68%20microkatals%2FL

monocytes_percent (low cutoff 2 / high cutoff 8)

https://cllsociety.org/toolbox/normal-lab-values/

international_standard_ratio (low cutoff 0.8 / high cutoff 1.1)

https://www.mayoclinic.org/tests-procedures/prothrombin-time/about/pac-20384661#:~:text=In%20healthy%20people%20an%20INR,in%20the%20leg%20or%20lung

percent_lymphocyte (low cutoff 20 / high cutoff 40)

https://cllsociety.org/toolbox/normal-lab-values/

The information will be used with mutate() and ifelse to discretize the top four features in low, normal, and high buckets based on the researched reference values.

#lactate_dehydrogenase
biomarkers <- biomarkers %>% mutate(lactate_dehydrogenase = ifelse(lactate_dehydrogenase < 140, "low"
        ,ifelse(lactate_dehydrogenase > 140, "high" , "normal")))

#monocytes_percent
biomarkers <- biomarkers %>% mutate(monocytes_percent = ifelse(monocytes_percent < 2, "low"
        ,ifelse(monocytes_percent > 8, "high" , "normal")))

#international_standard_ratio
biomarkers <- biomarkers %>% mutate(international_standard_ratio = ifelse(international_standard_ratio < 0.8, "low" ,ifelse(international_standard_ratio > 1.1, "high" , "normal")))

#percent_lymphocyte
biomarkers <- biomarkers %>% mutate(percent_lymphocyte = ifelse(percent_lymphocyte < 20, "low"
        ,ifelse(percent_lymphocyte > 40, "high" , "normal")))

# Make dataframe from  discretized biomarkers
df.biomarkers <- data.frame(outcome = biomarkers$outcome
              , lactate_dehydrogenase = biomarkers$lactate_dehydrogenase
              , monocytes_percent = biomarkers$monocytes_percent
              , international_standard_ratio = biomarkers$international_standard_ratio
              , percent_lymphocyte = biomarkers$percent_lymphocyte) 

# Make factors of all columns
df.biomarkers$outcome <- factor(df.biomarkers$outcome)
df.biomarkers$lactate_dehydrogenase <- factor(df.biomarkers$lactate_dehydrogenase)
df.biomarkers$monocytes_percent <- factor(df.biomarkers$monocytes_percent)
df.biomarkers$percent_lymphocyte <- factor(df.biomarkers$percent_lymphocyte)

View Discretized Data

head(df.biomarkers) %>% kable() %>% kable_styling()

outcome	lactate_dehydrogenase	monocytes_percent	international_standard_ratio	percent_lymphocyte
Deceased	high	high	normal	normal
Deceased	high	high	normal	normal
Deceased	high	normal	normal	normal
Deceased	high	high	high	normal
Deceased	high	normal	high	normal
Deceased	high	high	normal	normal

#head(df.biomarkers)
glimpse(df.biomarkers)

## Rows: 375
## Columns: 5
## $ outcome                      <fct> Deceased, Deceased, Deceased, Deceased...
## $ lactate_dehydrogenase        <fct> high, high, high, high, high, high, hi...
## $ monocytes_percent            <fct> high, high, normal, high, normal, high...
## $ international_standard_ratio <fct> normal, normal, normal, high, high, no...
## $ percent_lymphocyte           <fct> normal, normal, normal, normal, normal...

Apriori Analysis

The Apriori algorithm can now be used on the transformed data to generate association rules. Parameters of support,confidence and lift will be leveraged to study rules formed from Covid-19 patient biomarkers.

Support: This measure gives an idea of how frequent an items is in all the transactions. A transaction in this case means patient and their associated biomarkers.
Confidence: Confidence is an indication of how often the rule has been found to be true.
Lift: Lift is the ratio of the observed support to that expected

Several different combinations of the support, and confidence were use to filter the universe of strong rules for the data.

Set support: .07 to .11
Set confidence: .75 to .79
Set lift: >= 1

A set of strong rules will be obtained from these parameters. The discretized data set will be used to perform apriori analysis prior to CBA.

Create Arules

Inspect Rules

The initial rules view is using support of .07, confidence of .75, and maxlen=3

arule_df <- df.biomarkers
#set top n

n = 20
#n = 20
#n = 10
#n = 5

#rules = apriori(arule_df, parameter = list(supp = 0.09, conf = 0.95, maxlen = 3))
rules = apriori(arule_df, parameter = list(supp = 0.07, conf = 0.75, maxlen = 3))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5    0.07      1
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 26 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[13 item(s), 375 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3

## Warning in apriori(arule_df, parameter = list(supp = 0.07, conf = 0.75, : Mining
## stopped (maxlen reached). Only patterns up to a length of 3 returned!

##  done [0.00s].
## writing ... [76 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

#arules::inspect(head(rules,n)) 
inspectDT(rules)

View higher quality rules with elevated support of .11 and confidence of .79. The number rules decreased to 50 rules.

# This combination was used to produced the 53 rules strong rules
subrules <- rules[quality(rules)$support >= 0.11]
subrules <- subrules[quality(subrules)$confidence >= 0.79]
inspectDT(subrules)

View 50 rules but with lift greater than or equal to 1. The result is reduced to 47 rules meeting that criterion.

A few initial Arule observations:

monocytes_percents=high and outcome=deceased
percent_lymphocytes=low and outcomes=surviving.
The right hand side of rule as lactate_dehydrogenas=high is a common

#The lift was 
subrules_lift <- subrules[quality(subrules)$lift >= 1]
inspectDT(subrules_lift)

# set to subrules to the lift =1 result
rules <- subrules_lift

Plot Strong Rules

The 47 strong rules are plotted below here with a few different plotting schemes.

#plot (rules, measure=c("confidence","lift"), shading="lift")
plot (rules, measure=c("support","lift"), shading="lift")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules, method = "matrix", measure = "lift")

## Itemsets in Antecedent (LHS)
##  [1] "{lactate_dehydrogenase=high,percent_lymphocyte=low}"             
##  [2] "{lactate_dehydrogenase=high,percent_lymphocyte=normal}"          
##  [3] "{percent_lymphocyte=low}"                                        
##  [4] "{outcome=Survived,lactate_dehydrogenase=high}"                   
##  [5] "{outcome=Survived}"                                              
##  [6] "{lactate_dehydrogenase=high,international_standard_ratio=normal}"
##  [7] "{outcome=Survived,percent_lymphocyte=low}"                       
##  [8] "{monocytes_percent=normal,international_standard_ratio=high}"    
##  [9] "{monocytes_percent=normal,percent_lymphocyte=low}"               
## [10] "{lactate_dehydrogenase=high,monocytes_percent=high}"             
## [11] "{outcome=Survived,monocytes_percent=normal}"                     
## [12] "{international_standard_ratio=high,percent_lymphocyte=low}"      
## [13] "{international_standard_ratio=normal,percent_lymphocyte=normal}" 
## [14] "{outcome=Survived,international_standard_ratio=high}"            
## [15] "{monocytes_percent=high,percent_lymphocyte=normal}"              
## [16] "{percent_lymphocyte=normal}"                                     
## [17] "{monocytes_percent=high,international_standard_ratio=normal}"    
## [18] "{international_standard_ratio=normal}"                           
## [19] "{monocytes_percent=high}"                                        
## [20] "{monocytes_percent=normal}"                                      
## [21] "{monocytes_percent=normal,international_standard_ratio=normal}"  
## [22] "{outcome=Deceased,monocytes_percent=normal}"                     
## [23] "{monocytes_percent=high,percent_lymphocyte=low}"                 
## [24] "{international_standard_ratio=high}"                             
## [25] "{outcome=Deceased,international_standard_ratio=normal}"          
## [26] "{monocytes_percent=high,international_standard_ratio=high}"      
## [27] "{outcome=Deceased,international_standard_ratio=high}"            
## [28] "{outcome=Deceased,percent_lymphocyte=normal}"                    
## [29] "{outcome=Deceased,monocytes_percent=high}"                       
## [30] "{outcome=Deceased}"                                              
## [31] "{}"                                                              
## Itemsets in Consequent (RHS)
## [1] "{lactate_dehydrogenase=high}"        "{international_standard_ratio=high}"
## [3] "{outcome=Deceased}"                  "{percent_lymphocyte=low}"           
## [5] "{outcome=Survived}"

plot(rules, method = "matrix" , engine="3d", measure = "lift")

## Itemsets in Antecedent (LHS)
##  [1] "{lactate_dehydrogenase=high,percent_lymphocyte=low}"             
##  [2] "{lactate_dehydrogenase=high,percent_lymphocyte=normal}"          
##  [3] "{percent_lymphocyte=low}"                                        
##  [4] "{outcome=Survived,lactate_dehydrogenase=high}"                   
##  [5] "{outcome=Survived}"                                              
##  [6] "{lactate_dehydrogenase=high,international_standard_ratio=normal}"
##  [7] "{outcome=Survived,percent_lymphocyte=low}"                       
##  [8] "{monocytes_percent=normal,international_standard_ratio=high}"    
##  [9] "{monocytes_percent=normal,percent_lymphocyte=low}"               
## [10] "{lactate_dehydrogenase=high,monocytes_percent=high}"             
## [11] "{outcome=Survived,monocytes_percent=normal}"                     
## [12] "{international_standard_ratio=high,percent_lymphocyte=low}"      
## [13] "{international_standard_ratio=normal,percent_lymphocyte=normal}" 
## [14] "{outcome=Survived,international_standard_ratio=high}"            
## [15] "{monocytes_percent=high,percent_lymphocyte=normal}"              
## [16] "{percent_lymphocyte=normal}"                                     
## [17] "{monocytes_percent=high,international_standard_ratio=normal}"    
## [18] "{international_standard_ratio=normal}"                           
## [19] "{monocytes_percent=high}"                                        
## [20] "{monocytes_percent=normal}"                                      
## [21] "{monocytes_percent=normal,international_standard_ratio=normal}"  
## [22] "{outcome=Deceased,monocytes_percent=normal}"                     
## [23] "{monocytes_percent=high,percent_lymphocyte=low}"                 
## [24] "{international_standard_ratio=high}"                             
## [25] "{outcome=Deceased,international_standard_ratio=normal}"          
## [26] "{monocytes_percent=high,international_standard_ratio=high}"      
## [27] "{outcome=Deceased,international_standard_ratio=high}"            
## [28] "{outcome=Deceased,percent_lymphocyte=normal}"                    
## [29] "{outcome=Deceased,monocytes_percent=high}"                       
## [30] "{outcome=Deceased}"                                              
## [31] "{}"                                                              
## Itemsets in Consequent (RHS)
## [1] "{lactate_dehydrogenase=high}"        "{international_standard_ratio=high}"
## [3] "{outcome=Deceased}"                  "{percent_lymphocyte=low}"           
## [5] "{outcome=Survived}"

Rules sorted by support

This sort gives us perspective in terms of what high support means in the context of the Covid-19 dataset. The highest support is .94, but that number tapers off very fast after just ten entries.

Rules sorted by confidence

Our strong rule results sorted by confidence. The right hand side again shows lactate_dehydrogenase=high with high frequency. The highest confidence is 1 within the top 47 strong rules.

#### Rules sorted by lift

The top 47 rules are sorted again, but this time by lift. The right hand side outcome=“Survived” dominates the top 10 rules sorted by lift.

#### Top Rules

Sticking with the premise that most important higher values for most features will have a tendency to result in Covid-19 patient fatality, we use the left hand side (LHS) filter to specify outcome=Deceased. Recall that “outcome=Deceased” means the patient passed away as a result of contracting covid-19.

This results in only top 9 rules, but they are very strong ones. Not only do they have good support, confidence and lift, but they also occur a lot. Recall there are only 375 observations in the patient data set, so over 100 occurrences of a rule is high.

Primary rule filter is (supp = 0.10,conf = 0.78, maxlen = 3).

Top 9 Rules for with RHS filtered by Deceased Patients.

Top 9 Rules for with RHS filtered by Survived Patients. This results in 5 of the 9 top rules.

ArulViz Visualizations

Plotting Top Rules

The two plots show five rules with RHS filtered to Survived and Deceased respectively. The size is controlled by support and color is controlled by lift.

Plot Rules

The degree of size and color indicates level of support, and lactate_dehydrogenase is indicated with high support.

Plot Confidence

The two key plot shows many rules have between .08 and 1 confidence and support between .02 and .06.

Arules CBA Model

Classification by Arules or CBA is a classification method.

Model Data Splitting

The data will be split into 80/20 using a function based upon caret partitioning. The split results in 300 rows to train CBA upon and 75 for testing. The top four features are used as predictors.

set.seed(seed)
#seed
#Used 80/20 split to be consistent with the rest of the study.
x <- modelData2(df.biomarkers)

train <-x@trn
test <- x@tst
hold <- x@hld
trn <- as.data.frame(dim(train))
names(trn) <- c("dimension_of_training_dataset")
tst <-as.data.frame(dim(test))
names(tst) <- c("dimension_of_testing_dataset")
hld <-as.data.frame(dim(hold))
names(hld) <- c("dimension_of_hold_dataset")


split80_20 <- cbind(trn,tst) 
split80_20 %>% kable() %>% kable_styling()

dimension_of_training_dataset	dimension_of_testing_dataset
300	75
5	5

cba_train <- train 
#dim(train)
cba_test <- test 
#dim(test)
cba_hold <- hold
#dim(hold)

Fit CBA Model

Arc CBA is called to fit model on the training data set and the rules generated by the fit are shown.

set.seed(seed)
require(arulesCBA)
#install.packages("arc", dependencies = TRUE)
require(arc)

#use same support and confidence settings as were used in the strong rules analysis.
v_supp <- .11
v_conf <- .76
# train CBA classifers
#88% accuracy
#fit.cba.4 <- CBA(outcome ~ ., data = cba_train, supp = v_supp, conf=v_conf, verbose = FALSE)

#88% accuracy
fit.cba.4 <- arc::cba(cba_train, "outcome",rulelearning_options = list(minsupp = v_supp, minconf = v_conf,minlen = 2, maxlen = 3, maxtime = 1000,target_rule_count = 50000,trim=TRUE, find_conf_supp_thresholds = FALSE))

# 84% accuracy
#fit.cba.4 <- arc::cba(cba_train, "outcome",rulelearning_options = list(target_rule_count = 60))

#summary(fit.cba.4)
#fit.cba.4

#arules::inspect(fit.cba.4@rules)
# DT breaks on CBA train but works on arc train.
inspectDT(fit.cba.4@rules)

#https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package

draw_confusion_matrix <- function(cm) {

  #   total <- sum(cm$table)
  # res <- as.numeric(cm$table)
  # 
  # # Generate color gradients. Palettes come from RColorBrewer.
  # greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
  # redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#E64141")
  # getColor <- function (greenOrRed = "green", amount = 0) {
  #   if (amount == 0)
  #     return("#FFFFFF")
  #   palette <- greenPalette
  #   if (greenOrRed == "red")
  #     palette <- redPalette
  #   colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
  # }
  # 
  
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

   # create the matrix 
  rect(150, 430, 240, 370, col='#a2aba5')
  text(195, 435, 'Deceased ', cex=1.2)
  rect(250, 430, 340, 370, col='#E64141')
  text(295, 435, 'Survived', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#E64141')
  rect(250, 305, 340, 365, col='#a2aba5')
  text(140, 400, 'Deceased ', cex=1.2, srt=90)
  text(140, 335, 'Survived', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  

# Error: [conflicted] `layout` found in 2 packages. 
# Either pick the one you want with `::` * plotly::layout * graphics::layout Or declare a preference with
# `conflict_prefer()` * conflict_prefer("layout", "plotly") * conflict_prefer("layout", "graphics")
conflict_prefer("layout", "plotly")

Test CBA Model

Classification Based On Association Rules - a classifier based on association rules mined for an input data set. CBA demonstrated good accuracy at 88%, and only misclassified six fatalities. Test results with caret confusion matrix can be viewed below.

# test CBA

predict.cba <- predict(fit.cba.4,cba_test)
#cm <- table(predicted = predict.cba, true = cba_test$outcome)
cm  <- confusionMatrix(predict.cba, cba_test$outcome)
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Deceased Survived
##   Deceased       36        6
##   Survived        3       30
##                                         
##                Accuracy : 0.88          
##                  95% CI : (0.784, 0.944)
##     No Information Rate : 0.52          
##     P-Value [Acc > NIR] : 3.57e-11      
##                                         
##                   Kappa : 0.759         
##                                         
##  Mcnemar's Test P-Value : 0.505         
##                                         
##             Sensitivity : 0.923         
##             Specificity : 0.833         
##          Pos Pred Value : 0.857         
##          Neg Pred Value : 0.909         
##              Prevalence : 0.520         
##          Detection Rate : 0.480         
##    Detection Prevalence : 0.560         
##       Balanced Accuracy : 0.878         
##                                         
##        'Positive' Class : Deceased      
##

draw_confusion_matrix(cm)

Conclusions

Fortunately, a very small percentage of people who contract Covid-19 die from it. However, just the possibility of contracting a virus, which can cause fatality, has sent fears throughout the global community. Some citizens are highly susceptible to the virus and there is data that points to pre-existing health conditions as a major factor. Biomarkers are measured to help indicate phenomenons such as disease, infection, or environmental exposures. The biomarkers collected from Covid-19 patients have shown to be very useful in predicting mortality rates with very high accuracy.

Several machine learning models were tested against the Covid-19 biomarker data, and all the models did very well. We can conclude the biomarker data lends itself well to the use case of predicting Covid-19 fatalities. This was especially true when combined with the principal component analysis. Ultimately, the gradient boosting machines(gbm) and K Nearest Neighbors had the best accuracy at 96%, but other modeling algorithms were not far behind. In fact, none of the methods tested had accuracy below 88%. Even the classification by association (CBA) rules method achieved 88% accuracy using only four of seventy four biomarkers. The four biomarkers were informed by a combination of earlier analysis in the project such as principal component analysis, k-means clustering, important feature results, and correlation analysis.

To improve upon the traditional machine learning results, deep learning (DL) was employed. The Keras interface to R has presented new possibilities for R data scientists, and was used in this study. The method was proven to be very useful with this use case as it yielded the 97% accuracy with a sequential model. Although deep learning is harder to explain than the traditional machine learning (ML) models, it is still worthy of deployment consideration.

In summary, this study developed several classification models, which have been shown to predict mortality rates with accuracy rates as high 97%. These results are promising, and if used as early detection mechanisms, for patients with Covid-19, could assist in reducing case fatality rates (CFR) now or in the future. It is paramount models such as these be implementing within a hospital setting and used in conjunction with high frequency testing of the most predictive biomarkers. As part of a solution, machine learning may give doctors the edge they need to save more lives.

References

Yan, L., Zhang, H., Goncalves, J. et al. An interpretable mortality prediction model for COVID-19 patients. Nat Mach Intell 2, 283–288 (2020). https://doi.org/10.1038/s42256-020-0180-7

Covid-19 CFR Predictions: Caret ML, Keras, CBA

Kevin Pollard

5/29/2020

Overview

Motivation

About Model Data

Biomarker definitions:

Data Loading/Cleaning

Fixed Biomarker Names

Create Patient Master

Agregate Data

Biomarker Stats

EDA

Age and Gender Analysis

Age Histogram

Age Discretized

Age outliers

Gender Analysis

Correlation Analysis

Kmeans Clustering

Cluster Build

Elbow Plot Result

Kmeans Results

Principal Component Analysis

Data Prep

Plots

Contribution Analysis

Caret Machine Learning

Classification Models

Checking for Imbalance

Data Splitting

Model Controls

Model Tuning

Model Build

Decision Tree

Gradient Boosting Machine

Random Forest

Support Vector Machines

Naive Bayes

K Nearest Neighbor

Model Fit Results

Model Prediction Results

Gradient Boosting Machines

Decision Tree

SVM

Random Forest

K Nearest Neighbors

Naive Bayes

Keras Deep Learning

Data Splits

Keras Data Prep

Features

Labels

Hypertuning

Defining the model

Fit Summary

Fit Compile

Training and Evaluation

Plot

Testing

Evaluate on test

Confusion Matrix Test

Evaluate on hold out

Confusion Matrix Holdout

ML Classification by Arules

Data Loading

Anonymized Patient Data

Model Data View

View Discretized Data

Apriori Analysis

Create Arules

Inspect Rules

Plot Strong Rules

Rules sorted by support

Rules sorted by confidence

ArulViz Visualizations

Plotting Top Rules

Plot Rules

Plot Confidence

Arules CBA Model