library(ggthemes)
col34="dodgerblue4"
col2="blue"
col1="slateblue"
col="#E64141" #Covid Red #E64141
# caption used for certain the charts
the_caption <- "Covid-19"
theme_2 <-
theme_gray() +
theme(
axis.title.y = element_blank(),
legend.title = element_blank(),
legend.position = "top")
theme_1 <- function (base_size = 11, base_family = "") {
ggplot2::theme_set(theme_classic(base_size = 16))
suppressWarnings(
theme_update(
axis.text.x = element_text(size = 16)
, axis.text.y = element_text(size = 16)
, axis.title.x = element_text(size = 16)
, axis.title.y = element_text(size = 16)
, legend.title = element_text(size = 16)
, legend.text = element_text(size = 14)
, panel.grid.major = element_blank()
, panel.grid.minor = element_blank()
, strip.background = element_blank()
, panel.margin = unit(0, "lines")
, legend.key.size = unit(.55, "cm")
, legend.key = element_rect(fill = "white")
, panel.margin.y = unit(0.5, "lines")
, panel.border = element_rect(
colour = "black", fill = NA, size = 1)
, strip.text.x = element_text(
size = 16, colour = "blue", face = "bold")
)
)
}
#
# #set global ggtheme
# gbl_ggtheme <- gbl_ggtheme
gbl_ggtheme <- theme_1() #theme_classic(base_size = 16)
theme_set(gbl_ggtheme)
# theme_set(theme_classic(base_size = 16))
# #,panel.background = element_rect(fill = "#E64141") #Covid Red #E64141
#https://www.rapidtables.com/web/color/RGB_Color.html
The sudden increase of COVID-19 cases has placed high pressure on our health-care services worldwide. It is vital that fast, accurate and early clinical assessments are made in order to reduce the disease severity. To support decision making and logistical planning in health care systems, this study leverages the data of a prior study.
In the prior study, blood samples from 485 infected patients in the region of Wuhan, China were used to study biomarkers of Covid-19 patients. The new study here will include 375 samples of the aforementioned 485 patients. Machine learning tools will be used to identify biomarkers that can be used to predict mortality of the 375 individual patients sampled. The outcomes of all the patients is known, and indicted in the data.
The analysis to follow will attempt to improve upon the accuracy of the fatality predictions in the original study by using enhanced machine learning and deep learning tools associated with the language R. As part of the analysis, a comprehensive look at the R machine learning models will be explored first to find the parsimonious model, then a deep learning approach will be deployed to determine if the accuracy results can be improved.
Patients can potentially survive Covid-19 if the biomarkers are measured early and analyzed throughout the patients’ stay. Machine learning models may be the best way to detect signals within key biomarkers, and thereby act as an early warning mechanism. Patients can have their care augmented with specific treatments informed by changes in the biomarkers, which predict their mortality. This analysis seeks to find what R modeling solution yields the highest accuracy in an attempt to demonstrate a way to help save more lives when faced with epidemics such as Covid-19.
#functions
printVecInfo <- function(inputVector,param="all"){
v <- inputVector
print(paste("mean: ", mean(v,na.rm=T) ))
print(paste("median: ", median(v,na.rm=T) ))
print(paste("min: ", min(v,na.rm=T) ))
print(paste("max: ", max(v,na.rm=T) ))
print(paste("range: ", max(v,na.rm=T)-min(v,na.rm=T) ))
print(paste("sd: ", sd(v,na.rm=T)))
print(paste("quantile: ", quantile(v,na.rm=T) ))
print(paste("IQR: ",quantile(v,na.rm=T)[['75%']]-quantile(v,na.rm=T)[['25%']])) # Interquartile range
}
# gets the percentages for each column that has NA's
NASummaryInfoPct <- function(indf){
a<- gather(indf %>% summarise_all(~(sum(is.na(.))/n())), key = "columns", value = "percent_missing")
return(a)
}
# gets the percentages for each column that has NA's
NASummaryInfoCnt <- function(indf){
b<- gather(indf %>% summarise_all(~(sum(is.na(.)))), key = "columns", value = "count_missing")
return(b)
}
#Return multiple data frames in one object using the S4 class system.
setClass(
"mdl",
representation(
trn = "vector",
tst = "vector",
hld = "vector"
)
)
require(caret)
modelDataC <- function (df1 = df,trnpct=.8) {
#take a 10% holdout
z <- df1
data<- createDataPartition(y=z$label, p=.90, list=FALSE)
df1 <-z[data,] #set to 90% of its original rows
tst2 <- z[-data,]
z <- df1
data<- createDataPartition(y=z$label, p=trnpct, list=FALSE)
trn <-z[data,]
tst <- z[-data,] #balance what is left over after train @.80
mdl <- new("mdl",
trn = trn,
tst = tst,
hld = tst2 )
return(mdl)
}
#Return multiple data frames in one object using the S4 class system.
setClass(
"mdl",
representation(
trn = "vector",
tst = "vector",
hld = "vector"
)
)
modelData2 <- function (df1 = df_ww) {
z <- df1
# get number of rows in the data frame
len <- as.numeric(dim(z)[1])
randIndex <- sample(as.numeric(rownames(z)[1:len]))
# setting cutpoint to 2/3 of the input data frame
cutPoint80 <- floor(.8 * len)
# setting cutpoint to 1/3 of the input data frame
cutPoint20 <- floor(.2 * len)
trn2 <- z[randIndex[1:cutPoint80],]
tst2 <- z[randIndex[cutPoint80+1:c(len-cutPoint80)],]
hld2 <- z[-randIndex[1: cutPoint80],]
mdl <- new("mdl",
trn = trn2,
tst = tst2,
hld = hld2 )
return(mdl)
}
The data tracks several patients over their hospital stay, and keeps records on the patients 74 bio markers. The admission time, discharge time and outcome make a unique patient.
The original data source had already been anonymized to comply with HIPAA, PCI and PII data sensitivity laws.
Dependent variable is outcome with 0 death and 1 survived
Data set dimensions are 6120 rows by 81 columns
There are 74 biomarker features, and some of the most important one are listed below.
The most important bio markers for the analysis are listed with brief definitions. The three primary categories of BIO markers fall into cholesterol, inflammation, and clotting.
percent_lymphocyte: Lymphocytes are a type of white blood cell, consisting of both B cells and T cells. B cells are responsible for the creation of antibodies, while T cells are responsible for controlling immune response to foreign substances. This marker indicates the percent of lymphocytes present relative to a patient’s white blood cell count.
lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.
lymphocyte_count lymphocytes are a type of white blood cell, consisting of both B cells and T cells. B cells are responsible for the creation of antibodies, while T cells are responsible for controlling immune response to foreign substances. This marker indicates the count of lymphocytes present in a patient’s blood count.
albumin: Albumin is a protein created by your liver. This protein is primarily responsible for maintaining and keeping fluid levels in your bloodstream so that it does not leak into other tissues. This protein also helps carry hormones, vitamins, and enzymes throughout the body. Higher levels can indicate a patient fighting infections. Low levels can indicate problems with liver or inflammatory diseases.
neutrophils_percent: Neutrophils are another type of white blood cell that makes up 55 to 70 percent of the white blood cell count. They consist of the majority of cells that make up the immune system response. Neutrophils help fight antigens and are not limited to a specific area of circulation. This marker shows the percent of neutrophils present relative to a patient’s white blood cell count.
eosinophils_percent: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in the inflammatory response. This marker shows the percent of eosinophils present relative to a patient’s white blood cell count.
eosinophil_count: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in the inflammatory response. This marker shows the count of eosinophils present in a patient’s blood count.
mean_corpuscular_hemoglobin_concentration: MCHC is a measure of the concentration of hemoglobin present in a given volume of packed red blood cell.
x2019_n_co_v_nucleic_acid_detection: This marker is a specific test result from a detection test responsible for identifying the presence of nucleic acid from SARS-CoV-2 in humans. This test shows the presence or absence of viral infection by SARS-CoV-2.
basophil_percent: Basophil is naturally produced by several types of white blood cells. Basophil contains heparin which naturally prevents blood clotting, while also releasing histamine during allergic reactions. It is also responsible for helping the body produce an antibody called immunoglobulin E, which help mediate inflammatory responses in the body. This marker shows the percent of basophil present relative to a patient’s white blood cell count.
total_cholesterol: Cholesterol is a waxy substance found in blood that assists in building healthy cells. High levels of cholesterol can coat arteries and restrict blood flow. Total cholesterol is measured as a count in a patient’s blood panel.
procalcitonin: Procalcitonin is a protein produced by many cells in the body in response to both bacterial infections and tissue injury. The count of procalcitonin can increase based on systemic infections and sepsis.
thrombin_time: Thrombin time or thrombin clotting time is a blood test that measures the time It takes for a clot to form in the plasma of a blood sample. This assists in helping doctors investigate excessive bleeding or abnormal blood clot formation.
red_blood_cell_count: Red blood cell count is a test conducted to find out how many red blood cells are present in your blood. It is also known as an erythrocyte count. RBC’s contain hemoglobin, which is responsible for carrying oxygen to body tissues. This count is vital in understanding how tissues are functioning and healing in response to viral infections.
indirect_bilirubin: Bilirubin tests measure the amount of bilirubin in blood, an orange-yellow pigment that forms when red blood cells break down. Indirect bilirubin is specifically bound to albumin. High levels of bilirubin indicate that the body is destroying too many red blood cells, which can also indicate liver or tissue damage.
h_bs_ag: HBsAg is the antigen of the hepatitis B virus. Presence of this indicates current HBV infection.
hco3: Bicarbonate is a byproduct of the body’s metabolic function. Blood contains bicarbonate which is brought to the lungs and exhaled as carbon dioxide. Kidneys excrete and reabsorb bicarbonate and maintain appropriate acid balance in the body.
platelet_count: Platelets are small fragments of cells present for creating normal and protective blood clots. The platelet count test determined the number of platelets present in a sample of blood.
serum_sodium: Sodium tests are conducted to see how much sodium is present in the blood. Sodium is an essential mineral in the body that is important for healthy nerve and muscle function.
hemoglobin: Hemoglobin is a red blood cell protein responsible for transporting oxygen in the body’s blood. A low count of hemoglobin can indicate deficiencies and anemia, which can hinder healing and recovery in the body from virus or infection.
lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.
monocytes_percent: Monocytes are another type of white blood cell. They assist in fighting bacteria, viruses, and other infections. They are a key component of white blood cell immune response. High levels indicate the body fighting infection.
international_standard_ratio International Normalized Ratio is a preferred test for patients to assess the risk of bleeding or coagulation status in patients. This ratio shows how well blood clots, and if there are any abnormalities present.
percent_lymphocyte: Lymphocytes are a type of white blood cell, consisting of both B cells and T cells. B cells are responsible for the creation of antibodies, while T cells are responsible for controlling immune response to foreign substances. This marker indicates the percent of lymphocytes present relative to a patient’s white blood cell count.
eosinophil_count: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in inflammatory response. This marker shows the count of eosinophils present in a patient’s blood count.
eosinophils_percent: Eosinophils are another disease-fighting white blood cell. The cell is primarily responsible for destroying antigens, particularly viruses. They also have a role in inflammatory response. This marker shows the percent of eosinophils present relative to a patient’s white blood cell count.
neutrophils_percent: Neutrophils are another type of white blood cell, that makes up 55 to 70 percent of the white blood cell count. They consist of the majority of cells that make up the immune system response. Neutrophils help fight antigens and are not limited to a specific area of circulation. This marker shows the percent of neutrophils present relative to a patient’s white blood cell count.
amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp: Natriuretic peptide tests measure the presence of brain natriuretic peptide and B-terminal pro b-type natriuretic peptides present. These are substances created by the heart. Small levels are normally found in the bloodstream. High levels of BNP, NT-proBNP indicate that the heart is failing to pump the appropriate amount of blood needed.
thrombocytocrit: This device measures the platelet content of the blood. Platelets are small fragments of cells present for creating normal and protective blood clots. The platelet count test determined the number of platelets present in a sample of blood.
hypersensitive_cardiac_troponin_i: This is the preferred biomarker for acute myocardial infarction. This helps diagnose acute coronary syndromes in those who are critically ill.
fibrin_degradation_products: Fibrin degradation products are components of the blood produced by clot degeneration. This substance remains in your bloodstream after your body dissolves a blood clot, which in return regulates and manages clot dissolving. This specific test determines the amount of FDPs present in blood to ensure that the body is healing correctly, particularly in response to clots and blood illnesses.
procalcitonin: Procalcitonin is a protein produced by many cells in the body in response to both bacterial infections and tissue injury. The count of procalcitonin can increase based on systemic infections and sepsis.
albumin: Albumin is a protein created by your liver. This protein is primarily responsible for maintaining and keeping fluid levels in your bloodstream so that it does not leak into other tissues. This protein also helps carry hormones, vitamins, and enzymes throughout the body. Higher levels can indicate a patient fighting infections. Low levels can indicate problems with liver or inflammatory diseases
prothrombin_activity: Prothrombin time is a test used to help detect and diagnose a bleeding or excessivee clotting disorder. This helps doctors evaluate a patient’s ability to appropriately form blood clots.
d_d_dimer: A D-dimer test is used to help rule out the presence of serious life-threatening blood clots. High levels of D-dimer, the protein leftover from clotting, can indicate deep vein thrombosis and other clotting disorders.
aspartate_aminotransferase: Aspartate aminotransferase is an enzyme created by the liver but found in many cells throughout the body. High levels of AST are found in the body when liver damage is present, as the liver releases AST in response to distress of the liver.
neutrophils_count: Neutrophils are another type of white blood cell, that makes up 55 to 70 percent of the white blood cell count. They consist of the majority of cells that make up the immune system response. Neutrophils help fight antigens and are not limited to a specific area of circulation. This marker shows the count of neutrophils present in a patient’s blood cell count.
direct_bilirubin: Bilirubin tests measure the amount of bilirubin in blood, an orange-yellow pigment that forms when red blood cells break down. Direct bilirubin specifically travels from the liver to the small intestines. High levels of bilirubin indicate that the body is destroying too many red blood cells, which can also indicate liver or tissue damage.
thrombin_time: Thrombin time or thrombin clotting time is a blood test that measures the time it takes for a clot to form in the plasma of a blood sample. This assists in helping doctors investigate excessive bleeding or abnormal blood clot formation.
high_sensitivity_c_reactive_protein: High sensitivity C-reactive protein is a biomarker used in determining coronary heart disease. The protein itself is produced when there is damage or inflammation in blood vessels, particularly in the heart.
The original data dimensions and format can be viewed below. There is a lot of NAs seen within the patient columns, but was found to be due to not repeating the patient id for each day of the patients stay. The biomaker festures have plenty of NA since not all markers were captured for every patient during every time period. Additionally, the biomarker features can be quite long and they are largely non-conforming R column names.
#Use janitory to clean the column names
require(janitor)
#https://medium.com/@verajosemanuel/janitor-a-good-r-package-for-data-cleaning-f3c733632ad9
raw <- read_csv('./Data/covid19_bio_mdl.csv')
## Parsed with column specification:
## cols(
## .default = col_double(),
## RE_DATE = col_datetime(format = ""),
## `Admission time` = col_datetime(format = ""),
## `Discharge time` = col_datetime(format = "")
## )
## See spec(...) for full column specifications.
dim(raw)
## [1] 6120 81
#glimpse(raw)
head(raw[,1:7]) %>% kable() %>% kable_styling()
| PATIENT_ID | RE_DATE | age | gender | Admission time | Discharge time | outcome |
|---|---|---|---|---|---|---|
| 1 | 2020-01-31 01:09:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 |
| NA | 2020-01-31 01:25:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 |
| NA | 2020-01-31 01:44:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 |
| NA | 2020-01-31 01:45:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 |
| NA | 2020-01-31 01:56:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 |
| NA | 2020-01-31 01:59:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 |
head(raw[,8:15]) %>% kable() %>% kable_styling()
| Hypersensitive cardiac troponinI | hemoglobin | Serum chloride | Prothrombin time | procalcitonin | eosinophils(%) | Interleukin 2 receptor | Alkaline phosphatase |
|---|---|---|---|---|---|---|---|
| NA | NA | NA | NA | NA | NA | NA | NA |
| NA | 136 | NA | NA | NA | 0.6 | NA | NA |
| NA | NA | 103.1 | NA | NA | NA | NA | 46 |
| NA | NA | NA | 13.9 | NA | NA | NA | NA |
| 19.9 | NA | NA | NA | 0.09 | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA |
#The names are dirty, so we're using janitor to clean them. The original names are stored for reference.
raw_names <- names(raw)
#raw_names # we need these original columns for reference
df_raw <- raw
df <- df_raw %>% clean_names()
df_clean_names <- names(df)
#df_clean_names
#make names for good measure to ensure janitor names a legal R columns names
colnames(df) <- make.names(colnames(df))
#df_clean_names
#dim(df)
#glimpse(raw)
#glimpse(df)
The data dimensions and a sample of the fixed data is seen below here.
#dim(df)
head(df[,8:15]) %>% kable() %>% kable_styling()
| hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time | procalcitonin | eosinophils_percent | interleukin_2_receptor | alkaline_phosphatase |
|---|---|---|---|---|---|---|---|
| NA | NA | NA | NA | NA | NA | NA | NA |
| NA | 136 | NA | NA | NA | 0.6 | NA | NA |
| NA | NA | 103.1 | NA | NA | NA | NA | 46 |
| NA | NA | NA | 13.9 | NA | NA | NA | NA |
| 19.9 | NA | NA | NA | 0.09 | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA |
#tail(df) %>% kable() %>% kable_styling()
Each patient is master contains the three columns,patient id plus the admission and discharge time, revealing 375 patients in the sample.
df_patient_master_raw <- df %>% select(patient_id,admission_time,discharge_time)
#dim(df_patient_master_raw)
# #unique ID before patient compression
# library(data.table)
# df_patient_master_raw2 <- setDT(df_patient_master_raw)[, id := paste0('',.I)]
# head(df_patient_master_raw2)
# dim(df_patient_master_raw2)
data <- df_patient_master_raw
data_subset <- data[ , c("patient_id")]
df_patient_master <- data[complete.cases(data_subset), ] # Omit NAs by columns
head(df_patient_master)
## # A tibble: 6 x 3
## patient_id admission_time discharge_time
## <dbl> <dttm> <dttm>
## 1 1 2020-01-30 22:12:47 2020-02-17 12:40:09
## 2 2 2020-02-04 21:39:03 2020-02-19 12:59:01
## 3 3 2020-01-23 10:59:36 2020-02-08 17:52:31
## 4 4 2020-01-31 23:03:59 2020-02-18 12:59:12
## 5 5 2020-02-01 20:59:54 2020-02-18 10:33:06
## 6 6 2020-01-24 10:47:10 2020-02-07 09:06:58
dim(df_patient_master)
## [1] 375 3
# Print data_by_column to RStudio console
#write.csv(data,file="covid_19_patient_master.csv")
#Create df with corrected patient_id label. The data is still a sparse data frame at this point with the sparse data being NA (as opposed to 0).
#Dimensions are still 6120,81 here.
library(dplyr)
df_raw2 <- inner_join(df_patient_master, df, by = c("admission_time" = "admission_time","discharge_time" ="discharge_time"))
#drop bad patient_id
df_raw2$patient_id.y <- NULL
#rename new patient_id
names(df_raw2)[1] <- c("patient_id")
#head(df_raw2) #%>% regulartable() %>% theme_zebra() %>% autofit()
#dim(df_raw2)
#head(df_raw2[1,c(1:5)])
Next we aggregate the data to the patient master level of granularity. As part of the aggregation we take the “max” of all biomarkers while grouping by patient master data. This approach leaves the highest bio-marker measurements for each patient. Additionally, the NA and infinite numbers are set to zero.
Note: This reduces the dataset to the number of patients, which is 375. Dimension is now 375,80 here as we lthe time the bio marker was taken. We aren’t interested in the time component from this analysis. The the objective here is only determine if we can predict mortality with the biomarkers. The time component does become important later as time is of the essence when once is admitted with Covid-19 testing positive.
New/Changed Features: * duration_days - is the differences in days the patient was admitted to discharged * factorized - patient_id,gender,age,outcome
#install.packages("hablar")
library(hablar)
require(hablar)
require(lubridate)
#summarise numerics by patient
by_patient <- df_raw2 %>%
group_by(patient_id, age, gender, outcome,admission_time,discharge_time)
#create df_clean
by_patient %>%
summarise_if(is.numeric, max, na.rm = TRUE) %>% ## Clean infinite and NA
mutate_if(is.numeric, ~replace(., is.infinite(.), 0)) %>%
mutate_if(is.numeric, ~replace(., is.na(.), 0)) -> df_clean
#write.csv(df_clean,file="biomarkers_max.csv")
#write.csv(df_clean,file="covid19_biomarkers_max.csv")
#write.csv(df_clean,file="df_clean.csv")
#as_datetime(x, tz = "UTC")
head(df_clean[,c(2:4,7:9)])
## # A tibble: 6 x 6
## # Groups: age, gender, outcome [6]
## age gender outcome hypersensitive_cardiac_troponi~ hemoglobin serum_chloride
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 73 1 0 19.9 140 103.
## 2 61 1 0 16.9 151 100.
## 3 70 2 0 0 126 103.
## 4 74 1 0 4.8 110 103.
## 5 29 2 0 5.6 134 102.
## 6 81 2 0 19.7 108 106.
#dim(df_clean)
The biomarkers with relatively high standard deviations and/or wide ranges are candidates for a closer look. For example, “hypersensitive_cardiac_troponin_i” falls into the category, which should be further studied.
#library(tigerstats)
#inspect(df_clean[,c(2:4,7:80)]) #%>% kable() %>% kable_styling()
#densityplot(~outcome|gender,data=df_clean[,c(2:4,7:80)])
df_stat <- df_clean[,c(7:80)]
dim(df_stat)
## [1] 375 74
dfIntCols <- df_stat %>% select_if(., is.numeric)
#names(dfIntCols) %>% kable() %>% kable_styling()
#print('===============================================================')
#print('===============================================================')
statInfo <- function(a) {
for (i in colnames(a)){
x <- colnames(df_stat[i])
print('---------------------------------------------------------------------------------')
print(x)
#paste('Column: ', colnames(df_stat[i]), sep=',')
#print(colnames(df_stat[i]))
print('=================================================================================')
printVecInfo(df_stat[[i]])
}
}
statInfo(dfIntCols)
## [1] "---------------------------------------------------------------------------------"
## [1] "hypersensitive_cardiac_troponin_i"
## [1] "================================================================================="
## [1] "mean: 887.5552"
## [1] "median: 4.8"
## [1] "min: 0"
## [1] "max: 50000"
## [1] "range: 50000"
## [1] "sd: 4933.42370112608"
## [1] "quantile: 0" "quantile: 1.9" "quantile: 4.8" "quantile: 47.2"
## [5] "quantile: 50000"
## [1] "IQR: 45.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "hemoglobin"
## [1] "================================================================================="
## [1] "mean: 124.914666666667"
## [1] "median: 130"
## [1] "min: 0"
## [1] "max: 178"
## [1] "range: 178"
## [1] "sd: 33.6386614429292"
## [1] "quantile: 0" "quantile: 118" "quantile: 130" "quantile: 143"
## [5] "quantile: 178"
## [1] "IQR: 25"
## [1] "---------------------------------------------------------------------------------"
## [1] "serum_chloride"
## [1] "================================================================================="
## [1] "mean: 98.8357333333333"
## [1] "median: 103.1"
## [1] "min: 0"
## [1] "max: 140.4"
## [1] "range: 140.4"
## [1] "sd: 25.2361054677626"
## [1] "quantile: 0" "quantile: 100.25" "quantile: 103.1"
## [4] "quantile: 106.2" "quantile: 140.4"
## [1] "IQR: 5.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "prothrombin_time"
## [1] "================================================================================="
## [1] "mean: 15.6466666666667"
## [1] "median: 14.4"
## [1] "min: 0"
## [1] "max: 120"
## [1] "range: 120"
## [1] "sd: 9.1200714964205"
## [1] "quantile: 0" "quantile: 13.45" "quantile: 14.4" "quantile: 16.3"
## [5] "quantile: 120"
## [1] "IQR: 2.85"
## [1] "---------------------------------------------------------------------------------"
## [1] "procalcitonin"
## [1] "================================================================================="
## [1] "mean: 1.05938666666667"
## [1] "median: 0.07"
## [1] "min: 0"
## [1] "max: 57.17"
## [1] "range: 57.17"
## [1] "sd: 4.66756505327561"
## [1] "quantile: 0" "quantile: 0.02" "quantile: 0.07" "quantile: 0.33"
## [5] "quantile: 57.17"
## [1] "IQR: 0.31"
## [1] "---------------------------------------------------------------------------------"
## [1] "eosinophils_percent"
## [1] "================================================================================="
## [1] "mean: 1.0112"
## [1] "median: 0.5"
## [1] "min: 0"
## [1] "max: 8.6"
## [1] "range: 8.6"
## [1] "sd: 1.3383214005324"
## [1] "quantile: 0" "quantile: 0" "quantile: 0.5" "quantile: 1.6"
## [5] "quantile: 8.6"
## [1] "IQR: 1.6"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_2_receptor"
## [1] "================================================================================="
## [1] "mean: 559.413333333333"
## [1] "median: 381"
## [1] "min: 0"
## [1] "max: 7500"
## [1] "range: 7500"
## [1] "sd: 802.589480465136"
## [1] "quantile: 0" "quantile: 0" "quantile: 381" "quantile: 853"
## [5] "quantile: 7500"
## [1] "IQR: 853"
## [1] "---------------------------------------------------------------------------------"
## [1] "alkaline_phosphatase"
## [1] "================================================================================="
## [1] "mean: 88.1013333333333"
## [1] "median: 74"
## [1] "min: 0"
## [1] "max: 620"
## [1] "range: 620"
## [1] "sd: 59.0000940208925"
## [1] "quantile: 0" "quantile: 56" "quantile: 74" "quantile: 106.5"
## [5] "quantile: 620"
## [1] "IQR: 50.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "albumin"
## [1] "================================================================================="
## [1] "mean: 33.28"
## [1] "median: 34.9"
## [1] "min: 0"
## [1] "max: 48.6"
## [1] "range: 48.6"
## [1] "sd: 9.32318138995196"
## [1] "quantile: 0" "quantile: 30.1" "quantile: 34.9" "quantile: 38.8"
## [5] "quantile: 48.6"
## [1] "IQR: 8.7"
## [1] "---------------------------------------------------------------------------------"
## [1] "basophil_percent"
## [1] "================================================================================="
## [1] "mean: 0.303733333333333"
## [1] "median: 0.2"
## [1] "min: 0"
## [1] "max: 1.7"
## [1] "range: 1.7"
## [1] "sd: 0.268660637839674"
## [1] "quantile: 0" "quantile: 0.1" "quantile: 0.2" "quantile: 0.4"
## [5] "quantile: 1.7"
## [1] "IQR: 0.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_10"
## [1] "================================================================================="
## [1] "mean: 9.06533333333333"
## [1] "median: 5"
## [1] "min: 0"
## [1] "max: 1000"
## [1] "range: 1000"
## [1] "sd: 52.4802083736788"
## [1] "quantile: 0" "quantile: 0" "quantile: 5" "quantile: 8"
## [5] "quantile: 1000"
## [1] "IQR: 8"
## [1] "---------------------------------------------------------------------------------"
## [1] "total_bilirubin"
## [1] "================================================================================="
## [1] "mean: 18.8773333333333"
## [1] "median: 12"
## [1] "min: 0"
## [1] "max: 505.7"
## [1] "range: 505.7"
## [1] "sd: 35.0329774867403"
## [1] "quantile: 0" "quantile: 8.05" "quantile: 12" "quantile: 18.4"
## [5] "quantile: 505.7"
## [1] "IQR: 10.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "platelet_count"
## [1] "================================================================================="
## [1] "mean: 220.685333333333"
## [1] "median: 223"
## [1] "min: -1"
## [1] "max: 558"
## [1] "range: 559"
## [1] "sd: 116.340695393463"
## [1] "quantile: -1" "quantile: 144.5" "quantile: 223" "quantile: 287"
## [5] "quantile: 558"
## [1] "IQR: 142.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "monocytes_percent"
## [1] "================================================================================="
## [1] "mean: 7.636"
## [1] "median: 7.8"
## [1] "min: 0"
## [1] "max: 53"
## [1] "range: 53"
## [1] "sd: 4.98565428636018"
## [1] "quantile: 0" "quantile: 4.2" "quantile: 7.8" "quantile: 10.15"
## [5] "quantile: 53"
## [1] "IQR: 5.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "antithrombin"
## [1] "================================================================================="
## [1] "mean: 49.0186666666667"
## [1] "median: 68"
## [1] "min: 0"
## [1] "max: 136"
## [1] "range: 136"
## [1] "sd: 46.8639762605004"
## [1] "quantile: 0" "quantile: 0" "quantile: 68" "quantile: 91"
## [5] "quantile: 136"
## [1] "IQR: 91"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_8"
## [1] "================================================================================="
## [1] "mean: 57.0885333333333"
## [1] "median: 6.8"
## [1] "min: 0"
## [1] "max: 6795"
## [1] "range: 6795"
## [1] "sd: 469.091495985732"
## [1] "quantile: 0" "quantile: 0" "quantile: 6.8" "quantile: 21.75"
## [5] "quantile: 6795"
## [1] "IQR: 21.75"
## [1] "---------------------------------------------------------------------------------"
## [1] "indirect_bilirubin"
## [1] "================================================================================="
## [1] "mean: 8.27093333333333"
## [1] "median: 6.6"
## [1] "min: 0"
## [1] "max: 145.1"
## [1] "range: 145.1"
## [1] "sd: 9.56582502896893"
## [1] "quantile: 0" "quantile: 4.45" "quantile: 6.6" "quantile: 9.95"
## [5] "quantile: 145.1"
## [1] "IQR: 5.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "red_blood_cell_distribution_width"
## [1] "================================================================================="
## [1] "mean: 12.4170666666667"
## [1] "median: 12.7"
## [1] "min: 0"
## [1] "max: 27.1"
## [1] "range: 27.1"
## [1] "sd: 3.76392628848777"
## [1] "quantile: 0" "quantile: 12" "quantile: 12.7" "quantile: 13.75"
## [5] "quantile: 27.1"
## [1] "IQR: 1.75"
## [1] "---------------------------------------------------------------------------------"
## [1] "neutrophils_percent"
## [1] "================================================================================="
## [1] "mean: 75.9504"
## [1] "median: 83.6"
## [1] "min: 0"
## [1] "max: 98.9"
## [1] "range: 98.9"
## [1] "sd: 23.1538055129739"
## [1] "quantile: 0" "quantile: 66.1" "quantile: 83.6" "quantile: 93.05"
## [5] "quantile: 98.9"
## [1] "IQR: 26.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "total_protein"
## [1] "================================================================================="
## [1] "mean: 65.6357333333333"
## [1] "median: 68.9"
## [1] "min: 0"
## [1] "max: 88.7"
## [1] "range: 88.7"
## [1] "sd: 16.3493839663283"
## [1] "quantile: 0" "quantile: 64.15" "quantile: 68.9" "quantile: 73.1"
## [5] "quantile: 88.7"
## [1] "IQR: 8.94999999999999"
## [1] "---------------------------------------------------------------------------------"
## [1] "quantification_of_treponema_pallidum_antibodies"
## [1] "================================================================================="
## [1] "mean: 0.0977066666666667"
## [1] "median: 0.04"
## [1] "min: 0"
## [1] "max: 11.95"
## [1] "range: 11.95"
## [1] "sd: 0.666373030163863"
## [1] "quantile: 0" "quantile: 0" "quantile: 0.04" "quantile: 0.06"
## [5] "quantile: 11.95"
## [1] "IQR: 0.06"
## [1] "---------------------------------------------------------------------------------"
## [1] "prothrombin_activity"
## [1] "================================================================================="
## [1] "mean: 81.7146666666667"
## [1] "median: 89"
## [1] "min: 0"
## [1] "max: 142"
## [1] "range: 142"
## [1] "sd: 27.8699457341204"
## [1] "quantile: 0" "quantile: 71.5" "quantile: 89" "quantile: 98"
## [5] "quantile: 142"
## [1] "IQR: 26.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "h_bs_ag"
## [1] "================================================================================="
## [1] "mean: 6.17981333333333"
## [1] "median: 0"
## [1] "min: 0"
## [1] "max: 250"
## [1] "range: 250"
## [1] "sd: 37.1898486338532"
## [1] "quantile: 0" "quantile: 0" "quantile: 0" "quantile: 0.01"
## [5] "quantile: 250"
## [1] "IQR: 0.01"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_corpuscular_volume"
## [1] "================================================================================="
## [1] "mean: 86.4445333333333"
## [1] "median: 90.3"
## [1] "min: 0"
## [1] "max: 118.9"
## [1] "range: 118.9"
## [1] "sd: 21.0188320381879"
## [1] "quantile: 0" "quantile: 87" "quantile: 90.3" "quantile: 94.3"
## [5] "quantile: 118.9"
## [1] "IQR: 7.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "hematocrit"
## [1] "================================================================================="
## [1] "mean: 36.4373333333333"
## [1] "median: 38.1"
## [1] "min: 0"
## [1] "max: 52.3"
## [1] "range: 52.3"
## [1] "sd: 9.81007848738728"
## [1] "quantile: 0" "quantile: 34.6" "quantile: 38.1" "quantile: 41.3"
## [5] "quantile: 52.3"
## [1] "IQR: 6.7"
## [1] "---------------------------------------------------------------------------------"
## [1] "white_blood_cell_count"
## [1] "================================================================================="
## [1] "mean: 25.94064"
## [1] "median: 10.33"
## [1] "min: 0"
## [1] "max: 1726.6"
## [1] "range: 1726.6"
## [1] "sd: 106.41926472905"
## [1] "quantile: 0" "quantile: 5.935" "quantile: 10.33"
## [4] "quantile: 16.815" "quantile: 1726.6"
## [1] "IQR: 10.88"
## [1] "---------------------------------------------------------------------------------"
## [1] "tumor_necrosis_factor"
## [1] "================================================================================="
## [1] "mean: 7.09466666666667"
## [1] "median: 5.9"
## [1] "min: 0"
## [1] "max: 168"
## [1] "range: 168"
## [1] "sd: 12.4087933860502"
## [1] "quantile: 0" "quantile: 0" "quantile: 5.9" "quantile: 9.45"
## [5] "quantile: 168"
## [1] "IQR: 9.45"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_corpuscular_hemoglobin_concentration"
## [1] "================================================================================="
## [1] "mean: 331.650666666667"
## [1] "median: 347"
## [1] "min: 0"
## [1] "max: 514"
## [1] "range: 514"
## [1] "sd: 79.1725800136951"
## [1] "quantile: 0" "quantile: 338" "quantile: 347" "quantile: 357"
## [5] "quantile: 514"
## [1] "IQR: 19"
## [1] "---------------------------------------------------------------------------------"
## [1] "fibrinogen"
## [1] "================================================================================="
## [1] "mean: 3.93029333333333"
## [1] "median: 4.4"
## [1] "min: 0"
## [1] "max: 10.78"
## [1] "range: 10.78"
## [1] "sd: 2.51909969685788"
## [1] "quantile: 0" "quantile: 2.28" "quantile: 4.4" "quantile: 5.72"
## [5] "quantile: 10.78"
## [1] "IQR: 3.44"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_1"
## [1] "================================================================================="
## [1] "mean: 3.90773333333333"
## [1] "median: 5"
## [1] "min: 0"
## [1] "max: 88.5"
## [1] "range: 88.5"
## [1] "sd: 6.75831189039416"
## [1] "quantile: 0" "quantile: 0" "quantile: 5" "quantile: 5"
## [5] "quantile: 88.5"
## [1] "IQR: 5"
## [1] "---------------------------------------------------------------------------------"
## [1] "urea"
## [1] "================================================================================="
## [1] "mean: 9.93402666666667"
## [1] "median: 5.6"
## [1] "min: 0"
## [1] "max: 68.4"
## [1] "range: 68.4"
## [1] "sd: 10.7461847998858"
## [1] "quantile: 0" "quantile: 4" "quantile: 5.6" "quantile: 11.45"
## [5] "quantile: 68.4"
## [1] "IQR: 7.45"
## [1] "---------------------------------------------------------------------------------"
## [1] "lymphocyte_count"
## [1] "================================================================================="
## [1] "mean: 1.28085333333333"
## [1] "median: 1.07"
## [1] "min: 0"
## [1] "max: 52.42"
## [1] "range: 52.42"
## [1] "sd: 2.73636157533041"
## [1] "quantile: 0" "quantile: 0.64" "quantile: 1.07" "quantile: 1.585"
## [5] "quantile: 52.42"
## [1] "IQR: 0.945"
## [1] "---------------------------------------------------------------------------------"
## [1] "ph_value"
## [1] "================================================================================="
## [1] "mean: 4.109904"
## [1] "median: 6"
## [1] "min: 0"
## [1] "max: 7.565"
## [1] "range: 7.565"
## [1] "sd: 3.28009748312161"
## [1] "quantile: 0" "quantile: 0" "quantile: 6" "quantile: 7"
## [5] "quantile: 7.565"
## [1] "IQR: 7"
## [1] "---------------------------------------------------------------------------------"
## [1] "red_blood_cell_count"
## [1] "================================================================================="
## [1] "mean: 19.2459466666667"
## [1] "median: 4.61"
## [1] "min: 0"
## [1] "max: 749.5"
## [1] "range: 749.5"
## [1] "sd: 61.626490493824"
## [1] "quantile: 0" "quantile: 3.98" "quantile: 4.61" "quantile: 6.54"
## [5] "quantile: 749.5"
## [1] "IQR: 2.56"
## [1] "---------------------------------------------------------------------------------"
## [1] "eosinophil_count"
## [1] "================================================================================="
## [1] "mean: 0.0617333333333333"
## [1] "median: 0.04"
## [1] "min: 0"
## [1] "max: 0.49"
## [1] "range: 0.49"
## [1] "sd: 0.0754199177463744"
## [1] "quantile: 0" "quantile: 0" "quantile: 0.04" "quantile: 0.1"
## [5] "quantile: 0.49"
## [1] "IQR: 0.1"
## [1] "---------------------------------------------------------------------------------"
## [1] "corrected_calcium"
## [1] "================================================================================="
## [1] "mean: 2.25064"
## [1] "median: 2.4"
## [1] "min: 0"
## [1] "max: 2.79"
## [1] "range: 2.79"
## [1] "sd: 0.577511826857622"
## [1] "quantile: 0" "quantile: 2.27" "quantile: 2.4" "quantile: 2.47"
## [5] "quantile: 2.79"
## [1] "IQR: 0.2"
## [1] "---------------------------------------------------------------------------------"
## [1] "serum_potassium"
## [1] "================================================================================="
## [1] "mean: 4.50314666666667"
## [1] "median: 4.58"
## [1] "min: 0"
## [1] "max: 12.8"
## [1] "range: 12.8"
## [1] "sd: 1.44827657761779"
## [1] "quantile: 0" "quantile: 4.125" "quantile: 4.58" "quantile: 5.04"
## [5] "quantile: 12.8"
## [1] "IQR: 0.915"
## [1] "---------------------------------------------------------------------------------"
## [1] "glucose"
## [1] "================================================================================="
## [1] "mean: 9.37077333333333"
## [1] "median: 7.58"
## [1] "min: 0"
## [1] "max: 43.01"
## [1] "range: 43.01"
## [1] "sd: 6.40805264531818"
## [1] "quantile: 0" "quantile: 5.685" "quantile: 7.58"
## [4] "quantile: 11.135" "quantile: 43.01"
## [1] "IQR: 5.45"
## [1] "---------------------------------------------------------------------------------"
## [1] "neutrophils_count"
## [1] "================================================================================="
## [1] "mean: 8.80610666666667"
## [1] "median: 6.38"
## [1] "min: 0"
## [1] "max: 33.88"
## [1] "range: 33.88"
## [1] "sd: 7.21423001767688"
## [1] "quantile: 0" "quantile: 3.42" "quantile: 6.38" "quantile: 12.33"
## [5] "quantile: 33.88"
## [1] "IQR: 8.91"
## [1] "---------------------------------------------------------------------------------"
## [1] "direct_bilirubin"
## [1] "================================================================================="
## [1] "mean: 11.2413333333333"
## [1] "median: 5.3"
## [1] "min: 0"
## [1] "max: 360.6"
## [1] "range: 360.6"
## [1] "sd: 27.7644778118371"
## [1] "quantile: 0" "quantile: 3.5" "quantile: 5.3" "quantile: 8.65"
## [5] "quantile: 360.6"
## [1] "IQR: 5.15"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_platelet_volume"
## [1] "================================================================================="
## [1] "mean: 10.3717333333333"
## [1] "median: 11"
## [1] "min: 0"
## [1] "max: 15"
## [1] "range: 15"
## [1] "sd: 3.20872822580761"
## [1] "quantile: 0" "quantile: 10.2" "quantile: 11" "quantile: 11.9"
## [5] "quantile: 15"
## [1] "IQR: 1.7"
## [1] "---------------------------------------------------------------------------------"
## [1] "ferritin"
## [1] "================================================================================="
## [1] "mean: 883.942933333333"
## [1] "median: 147.3"
## [1] "min: 0"
## [1] "max: 50000"
## [1] "range: 50000"
## [1] "sd: 3052.68673704692"
## [1] "quantile: 0" "quantile: 0" "quantile: 147.3"
## [4] "quantile: 881.95" "quantile: 50000"
## [1] "IQR: 881.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "rbc_distribution_width_sd"
## [1] "================================================================================="
## [1] "mean: 40.3882666666667"
## [1] "median: 41.1"
## [1] "min: 0"
## [1] "max: 113.3"
## [1] "range: 113.3"
## [1] "sd: 12.772592270079"
## [1] "quantile: 0" "quantile: 38.55" "quantile: 41.1" "quantile: 45.15"
## [5] "quantile: 113.3"
## [1] "IQR: 6.60000000000001"
## [1] "---------------------------------------------------------------------------------"
## [1] "thrombin_time"
## [1] "================================================================================="
## [1] "mean: 15.1773333333333"
## [1] "median: 16.5"
## [1] "min: 0"
## [1] "max: 161.9"
## [1] "range: 161.9"
## [1] "sd: 13.0846705153605"
## [1] "quantile: 0" "quantile: 15" "quantile: 16.5" "quantile: 18.3"
## [5] "quantile: 161.9"
## [1] "IQR: 3.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "percent_lymphocyte"
## [1] "================================================================================="
## [1] "mean: 19.236"
## [1] "median: 17.7"
## [1] "min: 0"
## [1] "max: 60"
## [1] "range: 60"
## [1] "sd: 14.3091874304513"
## [1] "quantile: 0" "quantile: 6.1" "quantile: 17.7" "quantile: 31.05"
## [5] "quantile: 60"
## [1] "IQR: 24.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "hcv_antibody_quantification"
## [1] "================================================================================="
## [1] "mean: 0.0822666666666667"
## [1] "median: 0.05"
## [1] "min: 0"
## [1] "max: 2.09"
## [1] "range: 2.09"
## [1] "sd: 0.192442597409416"
## [1] "quantile: 0" "quantile: 0" "quantile: 0.05" "quantile: 0.08"
## [5] "quantile: 2.09"
## [1] "IQR: 0.08"
## [1] "---------------------------------------------------------------------------------"
## [1] "d_d_dimer"
## [1] "================================================================================="
## [1] "mean: 6.78184"
## [1] "median: 1.27"
## [1] "min: 0"
## [1] "max: 60"
## [1] "range: 60"
## [1] "sd: 9.27520148827659"
## [1] "quantile: 0" "quantile: 0.425" "quantile: 1.27" "quantile: 19.78"
## [5] "quantile: 60"
## [1] "IQR: 19.355"
## [1] "---------------------------------------------------------------------------------"
## [1] "total_cholesterol"
## [1] "================================================================================="
## [1] "mean: 3.79714666666667"
## [1] "median: 3.88"
## [1] "min: 0"
## [1] "max: 7.3"
## [1] "range: 7.3"
## [1] "sd: 1.32915602562809"
## [1] "quantile: 0" "quantile: 3.155" "quantile: 3.88" "quantile: 4.515"
## [5] "quantile: 7.3"
## [1] "IQR: 1.36"
## [1] "---------------------------------------------------------------------------------"
## [1] "aspartate_aminotransferase"
## [1] "================================================================================="
## [1] "mean: 62.416"
## [1] "median: 33"
## [1] "min: 0"
## [1] "max: 1858"
## [1] "range: 1858"
## [1] "sd: 150.569143993697"
## [1] "quantile: 0" "quantile: 22" "quantile: 33" "quantile: 53.5"
## [5] "quantile: 1858"
## [1] "IQR: 31.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "uric_acid"
## [1] "================================================================================="
## [1] "mean: 308.073066666667"
## [1] "median: 286"
## [1] "min: 0"
## [1] "max: 1176"
## [1] "range: 1176"
## [1] "sd: 172.536536764377"
## [1] "quantile: 0" "quantile: 212.2" "quantile: 286"
## [4] "quantile: 372.55" "quantile: 1176"
## [1] "IQR: 160.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "hco3"
## [1] "================================================================================="
## [1] "mean: 23.2208"
## [1] "median: 24.6"
## [1] "min: 0"
## [1] "max: 36.3"
## [1] "range: 36.3"
## [1] "sd: 6.75612175559844"
## [1] "quantile: 0" "quantile: 21.8" "quantile: 24.6" "quantile: 27.15"
## [5] "quantile: 36.3"
## [1] "IQR: 5.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "calcium"
## [1] "================================================================================="
## [1] "mean: 2.03805333333333"
## [1] "median: 2.15"
## [1] "min: 0"
## [1] "max: 2.62"
## [1] "range: 2.62"
## [1] "sd: 0.514584300669525"
## [1] "quantile: 0" "quantile: 2.04" "quantile: 2.15" "quantile: 2.26"
## [5] "quantile: 2.62"
## [1] "IQR: 0.22"
## [1] "---------------------------------------------------------------------------------"
## [1] "amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp"
## [1] "================================================================================="
## [1] "mean: 2811.68533333333"
## [1] "median: 99"
## [1] "min: 0"
## [1] "max: 70000"
## [1] "range: 70000"
## [1] "sd: 9387.95662943333"
## [1] "quantile: 0" "quantile: 0" "quantile: 99" "quantile: 941"
## [5] "quantile: 70000"
## [1] "IQR: 941"
## [1] "---------------------------------------------------------------------------------"
## [1] "lactate_dehydrogenase"
## [1] "================================================================================="
## [1] "mean: 519.592"
## [1] "median: 355"
## [1] "min: 0"
## [1] "max: 1867"
## [1] "range: 1867"
## [1] "sd: 436.03199416537"
## [1] "quantile: 0" "quantile: 239.5" "quantile: 355" "quantile: 661"
## [5] "quantile: 1867"
## [1] "IQR: 421.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "platelet_large_cell_ratio"
## [1] "================================================================================="
## [1] "mean: 31.664"
## [1] "median: 32.5"
## [1] "min: 0"
## [1] "max: 62.2"
## [1] "range: 62.2"
## [1] "sd: 12.6608563463322"
## [1] "quantile: 0" "quantile: 26" "quantile: 32.5" "quantile: 39.35"
## [5] "quantile: 62.2"
## [1] "IQR: 13.35"
## [1] "---------------------------------------------------------------------------------"
## [1] "interleukin_6"
## [1] "================================================================================="
## [1] "mean: 77.4030666666667"
## [1] "median: 2.58"
## [1] "min: 0"
## [1] "max: 5000"
## [1] "range: 5000"
## [1] "sd: 460.193926173624"
## [1] "quantile: 0" "quantile: 0" "quantile: 2.58" "quantile: 33.8"
## [5] "quantile: 5000"
## [1] "IQR: 33.8"
## [1] "---------------------------------------------------------------------------------"
## [1] "fibrin_degradation_products"
## [1] "================================================================================="
## [1] "mean: 29.6808"
## [1] "median: 4"
## [1] "min: 0"
## [1] "max: 190.8"
## [1] "range: 190.8"
## [1] "sd: 55.7152472377118"
## [1] "quantile: 0" "quantile: 0" "quantile: 4" "quantile: 7.4"
## [5] "quantile: 190.8"
## [1] "IQR: 7.4"
## [1] "---------------------------------------------------------------------------------"
## [1] "monocytes_count"
## [1] "================================================================================="
## [1] "mean: 0.648213333333333"
## [1] "median: 0.48"
## [1] "min: 0"
## [1] "max: 39.92"
## [1] "range: 39.92"
## [1] "sd: 2.05815873720233"
## [1] "quantile: 0" "quantile: 0.355" "quantile: 0.48" "quantile: 0.7"
## [5] "quantile: 39.92"
## [1] "IQR: 0.345"
## [1] "---------------------------------------------------------------------------------"
## [1] "plt_distribution_width"
## [1] "================================================================================="
## [1] "mean: 12.816"
## [1] "median: 13"
## [1] "min: 0"
## [1] "max: 25.3"
## [1] "range: 25.3"
## [1] "sd: 4.80697576889886"
## [1] "quantile: 0" "quantile: 11.2" "quantile: 13" "quantile: 15.15"
## [5] "quantile: 25.3"
## [1] "IQR: 3.95"
## [1] "---------------------------------------------------------------------------------"
## [1] "globulin"
## [1] "================================================================================="
## [1] "mean: 33.4546666666667"
## [1] "median: 34.6"
## [1] "min: 0"
## [1] "max: 50.6"
## [1] "range: 50.6"
## [1] "sd: 9.43016966028432"
## [1] "quantile: 0" "quantile: 31.1" "quantile: 34.6" "quantile: 38.7"
## [5] "quantile: 50.6"
## [1] "IQR: 7.6"
## [1] "---------------------------------------------------------------------------------"
## [1] "glutamyl_transpeptidase"
## [1] "================================================================================="
## [1] "mean: 57.6213333333333"
## [1] "median: 37"
## [1] "min: 0"
## [1] "max: 732"
## [1] "range: 732"
## [1] "sd: 78.5797430274511"
## [1] "quantile: 0" "quantile: 20.5" "quantile: 37" "quantile: 61"
## [5] "quantile: 732"
## [1] "IQR: 40.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "international_standard_ratio"
## [1] "================================================================================="
## [1] "mean: 1.28170666666667"
## [1] "median: 1.11"
## [1] "min: 0"
## [1] "max: 13.48"
## [1] "range: 13.48"
## [1] "sd: 1.03490022218339"
## [1] "quantile: 0" "quantile: 1.02" "quantile: 1.11" "quantile: 1.31"
## [5] "quantile: 13.48"
## [1] "IQR: 0.29"
## [1] "---------------------------------------------------------------------------------"
## [1] "basophil_count_number"
## [1] "================================================================================="
## [1] "mean: 0.0241333333333333"
## [1] "median: 0.02"
## [1] "min: 0"
## [1] "max: 0.12"
## [1] "range: 0.12"
## [1] "sd: 0.0205988768016656"
## [1] "quantile: 0" "quantile: 0.01" "quantile: 0.02" "quantile: 0.03"
## [5] "quantile: 0.12"
## [1] "IQR: 0.02"
## [1] "---------------------------------------------------------------------------------"
## [1] "x2019_n_co_v_nucleic_acid_detection"
## [1] "================================================================================="
## [1] "mean: -0.581333333333333"
## [1] "median: -1"
## [1] "min: -1"
## [1] "max: 0"
## [1] "range: 1"
## [1] "sd: 0.493999646380325"
## [1] "quantile: -1" "quantile: -1" "quantile: -1" "quantile: 0"
## [5] "quantile: 0"
## [1] "IQR: 1"
## [1] "---------------------------------------------------------------------------------"
## [1] "mean_corpuscular_hemoglobin"
## [1] "================================================================================="
## [1] "mean: 29.7725333333333"
## [1] "median: 31"
## [1] "min: 0"
## [1] "max: 50.8"
## [1] "range: 50.8"
## [1] "sd: 7.56932789939007"
## [1] "quantile: 0" "quantile: 29.75" "quantile: 31" "quantile: 32.35"
## [5] "quantile: 50.8"
## [1] "IQR: 2.59999999999999"
## [1] "---------------------------------------------------------------------------------"
## [1] "activation_of_partial_thromboplastin_time"
## [1] "================================================================================="
## [1] "mean: 34.4565333333333"
## [1] "median: 39.1"
## [1] "min: 0"
## [1] "max: 144"
## [1] "range: 144"
## [1] "sd: 20.189724546286"
## [1] "quantile: 0" "quantile: 33.4" "quantile: 39.1" "quantile: 44.6"
## [5] "quantile: 144"
## [1] "IQR: 11.2"
## [1] "---------------------------------------------------------------------------------"
## [1] "high_sensitivity_c_reactive_protein"
## [1] "================================================================================="
## [1] "mean: 86.6608"
## [1] "median: 54.1"
## [1] "min: 0"
## [1] "max: 320"
## [1] "range: 320"
## [1] "sd: 87.9859750272583"
## [1] "quantile: 0" "quantile: 10.1" "quantile: 54.1" "quantile: 144.4"
## [5] "quantile: 320"
## [1] "IQR: 134.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "hiv_antibody_quantification"
## [1] "================================================================================="
## [1] "mean: 0.0732"
## [1] "median: 0.08"
## [1] "min: 0"
## [1] "max: 0.27"
## [1] "range: 0.27"
## [1] "sd: 0.0561405798498943"
## [1] "quantile: 0" "quantile: 0" "quantile: 0.08" "quantile: 0.1"
## [5] "quantile: 0.27"
## [1] "IQR: 0.1"
## [1] "---------------------------------------------------------------------------------"
## [1] "serum_sodium"
## [1] "================================================================================="
## [1] "mean: 134.9216"
## [1] "median: 141.3"
## [1] "min: 0"
## [1] "max: 179.7"
## [1] "range: 179.7"
## [1] "sd: 33.7104973828152"
## [1] "quantile: 0" "quantile: 138.7" "quantile: 141.3"
## [4] "quantile: 143.85" "quantile: 179.7"
## [1] "IQR: 5.15000000000003"
## [1] "---------------------------------------------------------------------------------"
## [1] "thrombocytocrit"
## [1] "================================================================================="
## [1] "mean: 0.22928"
## [1] "median: 0.24"
## [1] "min: 0"
## [1] "max: 0.51"
## [1] "range: 0.51"
## [1] "sd: 0.116031823036696"
## [1] "quantile: 0" "quantile: 0.16" "quantile: 0.24" "quantile: 0.3"
## [5] "quantile: 0.51"
## [1] "IQR: 0.14"
## [1] "---------------------------------------------------------------------------------"
## [1] "esr"
## [1] "================================================================================="
## [1] "mean: 28.4506666666667"
## [1] "median: 22"
## [1] "min: 0"
## [1] "max: 110"
## [1] "range: 110"
## [1] "sd: 27.6133946674868"
## [1] "quantile: 0" "quantile: 2" "quantile: 22" "quantile: 43"
## [5] "quantile: 110"
## [1] "IQR: 41"
## [1] "---------------------------------------------------------------------------------"
## [1] "glutamic_pyruvic_transaminase"
## [1] "================================================================================="
## [1] "mean: 47.9946666666667"
## [1] "median: 29"
## [1] "min: 0"
## [1] "max: 1600"
## [1] "range: 1600"
## [1] "sd: 102.51852564206"
## [1] "quantile: 0" "quantile: 18" "quantile: 29" "quantile: 48.5"
## [5] "quantile: 1600"
## [1] "IQR: 30.5"
## [1] "---------------------------------------------------------------------------------"
## [1] "e_gfr"
## [1] "================================================================================="
## [1] "mean: 85.4397333333333"
## [1] "median: 92.9"
## [1] "min: 0"
## [1] "max: 224"
## [1] "range: 224"
## [1] "sd: 34.2058778595522"
## [1] "quantile: 0" "quantile: 72.1" "quantile: 92.9" "quantile: 108.4"
## [5] "quantile: 224"
## [1] "IQR: 36.3"
## [1] "---------------------------------------------------------------------------------"
## [1] "creatinine"
## [1] "================================================================================="
## [1] "mean: 120.322666666667"
## [1] "median: 81"
## [1] "min: 0"
## [1] "max: 1497"
## [1] "range: 1497"
## [1] "sd: 160.171732347307"
## [1] "quantile: 0" "quantile: 58" "quantile: 81" "quantile: 105"
## [5] "quantile: 1497"
## [1] "IQR: 47"
The data in cleaner form at this point, so analysis can be conducted. We’ll start by looking at age and gender of the Covid-19 patients.
#set temp df for analysis
df_analysis <- df_clean
df_analysis$age_factor <- cut(df_analysis$age, breaks = c(0,30,55,65,Inf),labels=c("young","middle","older", "old"))
#head(df_analysis[,c(1:10)]) %>% regulartable() %>% theme_zebra() %>% autofit()
df_analysis$gender_factor <- cut(df_analysis$gender, breaks = c(0,1,2),labels=c("Female","Male"))
#head(df_analysis[,c(1:10)]) %>% regulartable() %>% theme_zebra() %>% autofit()
#head(df_analysis[,c(1:4,81:82)],8)
The histogram showing distribution of age across the 375 patients is very normal looking.
Hist Line Color Legend: * Mean (red) * Median (gray) * Mode (green)
require(plotly)
#require(cowplot)
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
df_analysis %>%
ggplot(aes(age)) +
geom_histogram(bins = 10, boundary = 0, fill = "#FF6666", col = "black", aes(y = ..density..), ) +
geom_density(alpha = 0.2, fill = "gray") +
ggtitle("Age Histogram",
subtitle = "Mean (grey) and Median (blue) and Mode (green)") +
theme(axis.title = element_text(), axis.title.x = element_text()) +
geom_vline(xintercept = round(mean(df_analysis$age), 2), size = 2, linetype = 3) +
geom_vline(aes(xintercept=mean(df_analysis$age,na.rm='T')),color="blue", linetype=3, size=1) +
geom_vline(aes(xintercept=median(df_analysis$age,na.rm='T')), color="gray", linetype="dashed", size=1) +
geom_vline(aes(xintercept=getmode(df_analysis$age)), color="green", linetype="dashed", size=1) +
gbl_ggtheme -> ph1
#ph1 + theme_bw()
ph1
ph1ly <- ggplotly(ph1)
#ph1ly
#plot_grid(ph1,ph2, nrow = 1, ncol = 2)
The age is better studied in buckets of young, middle, older, old.
#quantile(df_clean$age)
#data.table(df_clean$age)
age_df <- as.data.frame(prop.table(table(df_analysis$age_factor)))
names(age_df)[1:2] <- c("Age","Percentage")
age_df %>% kable() %>% kable_styling()
| Age | Percentage |
|---|---|
| young | 0.0586667 |
| middle | 0.3040000 |
| older | 0.2346667 |
| old | 0.4026667 |
How many patients greater than or equal to 90 years old.
age_outliers <- which(df_analysis$age >= 90)
paste("There are", length(age_outliers), "patients age 90 or older.")
## [1] "There are 5 patients age 90 or older."
df_analysis[age_outliers, "age"] %>% arrange(age)
## # A tibble: 5 x 1
## # Groups: age [5]
## age
## <dbl>
## 1 90
## 2 91
## 3 92
## 4 94
## 5 95
Gender isn’t significantly correlated whereas age does show significant positive correlation to the patient outcome. We have more females than males in this particular patient sample.
gender_df <- as.data.frame(prop.table(table(df_analysis$gender_factor)))
names(gender_df)[1:2] <- c("Gender","Percentage")
gender_df %>% kable() %>% kable_styling()
| Gender | Percentage |
|---|---|
| Female | 0.5973333 |
| Male | 0.4026667 |
#ggpairs(df_analysis, columns = c(3,4,81:82), ggplot2::aes(colour=df_analysis$gender_factor)) +gbl_ggtheme
#chart.Correlation(cor_df1[,c(1,2:5)])
ggcorr(df_analysis[c(2:3,4,81:82)], method = c("everything", "pearson")) + gbl_ggtheme
Visual analysis of the outcome (dependent feature) with all others (independent features).
During the analysis, the following features were found to be important. They are list in order of relative importance.
Figure 1 - “outcome”,“percent_lymphocyte”,“lactate_dehydrogenase”,“lymphocyte_count”,“albumin”, “neutrophils_percent”,“eosinophils_percent”,“eosinophil_count”, “x2019_n_co_v_nucleic_acid_detection”,“direct_bilirubin”,“thrombin_time”, “monocytes_percent”,“international_standard_ratio”
Figure 2 - “outcome”,“amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp”, “thrombocytocrit”,“hypersensitive_cardiac_troponin_i”,“fibrin_degradation_products”,“procalcitonin”, “prothrombin_activity”,“d_d_dimer”,“aspartate_aminotransferase”,“neutrophils_count”, “high_sensitivity_c_reactive_protein”
#head(df_clean)
#ensures we are starting with the clean data set
df <- df_clean
cor_df <- df[,c(2:4,7:80)]
#glimpse(cor_df)
#head(cor_df)
require(GGally)
require(plotly)
vars1 <- c("outcome","percent_lymphocyte","lactate_dehydrogenase","lymphocyte_count","albumin",
"neutrophils_percent","eosinophils_percent","eosinophil_count","x2019_n_co_v_nucleic_acid_detection",
"direct_bilirubin","thrombin_time","monocytes_percent","international_standard_ratio")
cor_df_vars1 <- cor_df[,c(vars1)]
ggcorr(cor_df_vars1,label = TRUE, label_size = 3,hjust = 0.75, size = 1.5, method = c("everything", "pearson")) + gbl_ggtheme
vars1b <- c("outcome","percent_lymphocyte","lactate_dehydrogenase","x2019_n_co_v_nucleic_acid_detection",
"direct_bilirubin","thrombin_time","monocytes_percent")
# cor_df_vars1b <- cor_df[,c(vars1b)]
# ggp2 <- ggpairs(cor_df_vars1b) + gbl_ggtheme
# ggp2 <- ggpairs(cor_df_vars1b, ggplot2::aes(colour="#E64141")) +gbl_ggtheme
# ggp2
#ggplotly(ggp2)
vars2 <- c("outcome","amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp",
"thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products","procalcitonin",
"prothrombin_activity","d_d_dimer","aspartate_aminotransferase","neutrophils_count",
"high_sensitivity_c_reactive_protein")
#cor_df[,c(vars1)]
cor_df_vars2 <- cor_df[,c(vars2)]
#cor_df_vars2
ggcorr(cor_df_vars2, label = TRUE, label_size = 3,hjust = 0.75, size = 1.5, method = c("everything", "pearson")) + gbl_ggtheme
vars2b <- c("outcome","thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products",
"prothrombin_activity", "high_sensitivity_c_reactive_protein")
# cor_df_vars2b <- cor_df[,c(vars2b)]
# ggp2 <- ggpairs(cor_df_vars2b, ggplot2::aes(colour="#E64141")) +gbl_ggtheme
# ggp2
#set value to 3 since the dependent is in the 3rd position
#dependent_var <- c(1)
#paste("variable number ",dependent_var," is = ",names(cor_df)[3],sep="")
#cor_df1 <- cor_df_vars2[,c(dependent_var,6:17)]
#corrplot(cor(cor_df_vars2), method = "square")
#ggcorr(cor_df_vars2, method = c("everything", "pearson")) + gbl_ggtheme
#chart.Correlation(cor_df_vars2[,c(1,2:5)])
#ggpairs(cor_df_vars2[,c(1,2:5)], title="correlogram with ggpairs()") #+ gbl_ggtheme
#
# cor_df2 <- df[,c(dependent_var,18:28)]
# #corrplot(cor(cor_df2), method = "square")
# ggcorr(cor_df2, method = c("everything", "pearson")) + gbl_ggtheme
#
# cor_df3 <- df[,c(dependent_var,29:39)]
# #corrplot(cor(cor_df3), method = "square")
# ggcorr(cor_df3, method = c("everything", "pearson")) + gbl_ggtheme
#
# cor_df4 <- df[,c(dependent_var,40:50)]
# #corrplot(cor(cor_df4), method = "square")
# ggcorr(cor_df4, method = c("everything", "pearson")) + gbl_ggtheme
#
# cor_df5 <- df[,c(dependent_var,51:61)]
# #corrplot(cor(cor_df5), method = "square")
# ggcorr(cor_df5, method = c("everything", "pearson")) + gbl_ggtheme
#
# cor_df6 <- df[,c(dependent_var,62:72)]
# #corrplot(cor(cor_df6), method = "square")
# ggcorr(cor_df6, method = c("everything", "pearson")) + gbl_ggtheme
#
# cor_df7 <- df[,c(dependent_var,73:80)]
# #corrplot(cor(cor_df7), method = "square")
# ggcorr(cor_df7, method = c("everything", "pearson")) + gbl_ggtheme
paste("")
## [1] ""
kmeans() function returns a list of components, including:
*cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated
*centers: A matrix of cluster centers (cluster means)
*totss: The total sum of squares (TSS), i.e ∑(xi−x¯)2. TSS measures the total variance in the data.
*withinss: Vector of within-cluster sum of squares, one component per cluster
*tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss)
*betweenss: The between-cluster sum of squares, i.e. totss−tot.withinss
*size: The number of observations in each cluster
require(NbClust)
df <- df_clean
#df
df <- df[,c(7:80)] #
dim(df)
## [1] 375 74
#df
#df %>% select(-patient_id,-admission_time,-discharge_time)
df_subset1 <- df #%>% select(-gender,-outcome)
#df_subset1
df <- df_subset1
head(df[,c(1:5)]) %>% regulartable() %>% theme_zebra() %>% autofit()
hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time | procalcitonin |
19.9 | 140 | 103.1 | 14.1 | 0.09 |
16.9 | 151 | 100.5 | 14.3 | 0.09 |
0.0 | 126 | 102.9 | 13.6 | 0.06 |
4.8 | 110 | 103.1 | 16.3 | 0.38 |
5.6 | 134 | 102.2 | 14.6 | 0.02 |
19.7 | 108 | 105.8 | 12.4 | 0.10 |
X<- df
X_norm <- apply(X, 1, function(i) round(i/sum(i),3))
#X_norm
#str(X)
#kmeans function
kmeansCluster <- function (mtrx,vcenters,v_nstart = 50,v_iter.max = 150) {
km.res <- kmeans(mtrx, centers = vcenters, nstart = v_nstart, iter.max = v_iter.max)
z<-fviz_cluster(km.res, data = X, geom = "point", stand = FALSE, ellipse.type = "norm",main = paste("Kmeans cluster plot vcenters=",vcenters)) + theme_minimal()
z2 <- fviz_cluster(km.res, data = mtrx, geom = "point", stand = FALSE, ellipse.type = "norm", outlier.color = "black",main = paste("Kmeans cluster plot vcenters=",vcenters), show.clust.cent = TRUE) + theme_minimal()
s <- summary(km.res)
return(list(z,s))
s <- summary(km.res)
return(list(z2,s))
#return(c(x))
}
k2 <-kmeansCluster(X,2)
k2
## [[1]]
##
## [[2]]
## Length Class Mode
## cluster 375 -none- numeric
## centers 148 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k3 <-kmeansCluster(X,3)
k3
## [[1]]
## Too few points to calculate an ellipse
##
## [[2]]
## Length Class Mode
## cluster 375 -none- numeric
## centers 222 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k4 <-kmeansCluster(X,4)
k4
## [[1]]
## Too few points to calculate an ellipse
##
## [[2]]
## Length Class Mode
## cluster 375 -none- numeric
## centers 296 -none- numeric
## totss 1 -none- numeric
## withinss 4 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 4 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k5 <-kmeansCluster(X,5)
k5
## [[1]]
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
##
## [[2]]
## Length Class Mode
## cluster 375 -none- numeric
## centers 370 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
# k6 <-kmeansCluster(X,6)
# k6
The plot represents the variance within the clusters. It decreases as k increases, and a bend (or “elbow”) can be seen at k = 4. This bend indicates that additional clusters beyond the fourth have little value. In the next section, we’ll classify the observations into 4 clusters.
k<-2:7 #K from 2 to 7
tries <-10 #Run the K Means algorithm n times
avg.totw.ss <-integer(length(k)) #Set up an empty vector to hold all of points
for(v in k){ # For each value of the range variable
v.totw.ss <-integer(tries) #Set up an empty vector to hold the 100 tries
for(i in 1:tries){
k.temp <-kmeans(X,centers=v) #Run kmeans
v.totw.ss[i] <-k.temp$tot.withinss#Store the total withinss
}
avg.totw.ss[v-1] <-mean(v.totw.ss) #Average the 100 total withinss
}
df <- as.data.frame(cbind(k,avg.totw.ss))
ggplot(df,aes(x=k, y=avg.totw.ss)) + geom_point() +
geom_line(aes(color="Elbow")) +
ggtitle("Total Within SS by Various K",subtitle = "Elbow") + theme_bw()
#plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K", ylab="Average Total Within Sum of Squares", xlab="Value of K")
Results using k=4 clustering with patient age and 74 biomarkers
set.seed(seed)
km.res <- NULL
km.res <- kmeans(X, 4, nstart = 50,iter.max = 150)
#km.res4 <- kmeans(X, 4)
#summary(km.res)
#glimpse(km.res)
z2 <- fviz_cluster(km.res, data = X, geom = "point", stand = FALSE, ellipse.type = "norm", outlier.color = "black",main = "Kmeans cluster plot vcenters=4", show.clust.cent = TRUE) + theme_minimal()
z2
Samples from the 4 clusters are displayed below in descending order. Only 5 of the 74 biomarkers are displayed. Clusters 1 and 4 had the most observations and we know there were not that many important features out of the 74 biomarkers. This may mean clusters 3 and 2 were defined by how the important features set them apart from the rest.
hypersensitive_cardiac_troponin_i: This is the preferred biomarker for acute myocardial infarction. This helps diagnose acute coronary syndromes in those who are critically ill.
amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp: Natriuretic peptide tests measure the presence of brain natriuretic peptide and B-terminal pro b-type natriuretic peptides present. These are substances created by the heart. Small levels are normally found in the bloodstream. High levels of BNP, NT-proBNP indicate that the heart is failing to pump the appropriate amount of blood needed.
lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.
ferritin: Ferritin is a major intracellular iron storage protein in all organisms. It binds free ions of the trace element, neutralising its toxic properties and increasing its solubility. In the soluble form, the body is able to expend iron as needed, in particular for regulation of cellular oxygen metabolism. Elevated levels of ferritin, or hyperferritinemia, indicate the presence of viruses and bacteria into the body.
Impvars <- c("amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp",
"thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products","procalcitonin",
"prothrombin_activity","d_d_dimer","aspartate_aminotransferase","neutrophils_count",
"high_sensitivity_c_reactive_protein","thrombocytocrit","hypersensitive_cardiac_troponin_i","fibrin_degradation_products",
"prothrombin_activity", "high_sensitivity_c_reactive_protein","ferritin")
#head(X[,Impvars])
dd <- cbind(X, cluster = km.res$cluster)
dd <- dd %>% select("cluster","lactate_dehydrogenase","amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp","hypersensitive_cardiac_troponin_i", "high_sensitivity_c_reactive_protein","ferritin")
dd.names <-names(dd) #back up original names
dfname <-dd.names
names2 <- base::abbreviate( dfname,25)
#names2 #checking abbreviations
names(dd) <- names2
#print(km.res)
#str(km.res)
#km.res$cluster %>% regulartable() %>% theme_zebra() %>% autofit()
#cluster centers
#km.res$centers %>% regulartable() %>% theme_zebra() %>% autofit()
#head(dd)
#dim(dd)
k=4 cluster dimensions:
[1] 324, 12 - Cluster one is the largest
[2] 3, 12 - Cluster two is the smallest
[3] 42, 12 - Cluster three is the second largest
[4] 6, 12 - Cluster four is very small as well
The smaller clusters have elevated markers, while the larger clusters do not. Note: long biomarker names have been abbreviated to 25 characters.
#abbreviate long biomarker names to 25 chars make analysis easier
#Create biomarker aggregate with means
agg_clus <- aggregate(X, by=list(cluster=km.res$cluster), mean)
#abbreviate long biomarker names to 25 chars make analysis easier
agg_clus.names <-names(agg_clus) #back up original names
agg_clus.dfname <- agg_clus.names
agg_clus.names2 <- base::abbreviate( agg_clus.dfname,25)
#names2 #checking abbreviations
names(agg_clus) <- agg_clus.names2
Impvars_short <- base::abbreviate( Impvars,25)
agg_clus[,Impvars_short] %>% regulartable() %>% theme_zebra() %>% autofit()
amn_trmnl_brn_ntrrtc_____ | thrombocytocrit | hypersensitiv_crdc_trpnn_ | fibrin_degradation_prdcts | procalcitonin | prothrombin_activity | d_d_dimer | aspartate_aminotransferas | neutrophils_count | high_senstvty_c_rctv_prtn | thrombocytocrit.1 | hypersensitiv_crdc_trpnn_.1 | fibrin_degradation_prdcts.1 | prothrombin_activity.1 | high_senstvty_c_rctv_prtn.1 | ferritin |
35716.3333 | 0.13666667 | 50000.0000 | 150.00000 | 0.3700000 | 74.33333 | 21.000000 | 228.6667 | 19.076667 | 130.90000 | 0.13666667 | 50000.0000 | 150.00000 | 74.33333 | 130.90000 | 6594.5333 |
10561.6429 | 0.17000000 | 2423.6452 | 84.70000 | 5.4209524 | 71.42857 | 15.254762 | 128.8571 | 15.698333 | 183.77619 | 0.17000000 | 2423.6452 | 84.70000 | 71.42857 | 183.77619 | 2145.2833 |
62566.0000 | 0.09166667 | 4629.9000 | 53.90000 | 4.4066667 | 70.66667 | 12.835000 | 377.3333 | 14.578333 | 100.15000 | 0.09166667 | 4629.9000 | 53.90000 | 70.66667 | 100.15000 | 5100.3000 |
395.8272 | 0.24037037 | 164.3849 | 20.98611 | 0.4383951 | 83.32099 | 5.439753 | 46.4321 | 7.710679 | 73.41235 | 0.24037037 | 164.3849 | 20.98611 | 83.32099 | 73.41235 | 589.4793 |
clus_id1 <- dd %>% filter(cluster == 1) # %>% head(3) #%>% regulartable() %>% theme_zebra() %>% autofit()
head(clus_id1[,c(1:6)]) %>% kable() %>% kable_styling()
| cluster | lactate_dehydrogenase | amn_trmnl_brn_ntrrtc_____ | hypersensitiv_crdc_trpnn_ | high_senstvty_c_rctv_prtn | ferritin |
|---|---|---|---|---|---|
| 1 | 1867 | 20906 | 50000 | 113.1 | 1935.6 |
| 1 | 1500 | 70000 | 50000 | 87.3 | 0.0 |
| 1 | 785 | 16243 | 50000 | 192.3 | 17848.0 |
clus_id2 <- dd %>% filter(cluster == 2) #%>% head(3) #%>% regulartable() %>% theme_zebra() %>% autofit()
head(clus_id2[,c(1:6)]) %>% kable() %>% kable_styling()
| cluster | lactate_dehydrogenase | amn_trmnl_brn_ntrrtc_____ | hypersensitiv_crdc_trpnn_ | high_senstvty_c_rctv_prtn | ferritin |
|---|---|---|---|---|---|
| 2 | 208 | 11017 | 0.0 | 24.3 | 0.0 |
| 2 | 187 | 13006 | 124.0 | 2.1 | 0.0 |
| 2 | 310 | 22525 | 617.0 | 160.2 | 545.7 |
| 2 | 1772 | 9310 | 46.3 | 274.1 | 2578.7 |
| 2 | 890 | 7447 | 304.2 | 86.1 | 2661.8 |
| 2 | 492 | 11534 | 1317.8 | 196.9 | 0.0 |
clus_id3 <-dd %>% filter(cluster == 3) #%>% head(3) #%>% regulartable() %>% theme_zebra() %>% autofit()
head(clus_id3[,c(1:6)]) %>% kable() %>% kable_styling()
| cluster | lactate_dehydrogenase | amn_trmnl_brn_ntrrtc_____ | hypersensitiv_crdc_trpnn_ | high_senstvty_c_rctv_prtn | ferritin |
|---|---|---|---|---|---|
| 3 | 284 | 70000 | 85.5 | 81.7 | 0.0 |
| 3 | 1752 | 40096 | 2.6 | 84.6 | 15718.0 |
| 3 | 1867 | 70000 | 523.1 | 36.7 | 0.0 |
| 3 | 1126 | 55300 | 3404.7 | 138.8 | 4995.6 |
| 3 | 1867 | 70000 | 21953.4 | 259.1 | 9888.2 |
| 3 | 702 | 70000 | 1810.1 | 0.0 | 0.0 |
clus_id4 <-dd %>% filter(cluster == 4) # %>% head(3) #%>% regulartable() %>% theme_zebra() %>% autofit()
head(clus_id4[,c(1:6)]) %>% kable() %>% kable_styling()
| cluster | lactate_dehydrogenase | amn_trmnl_brn_ntrrtc_____ | hypersensitiv_crdc_trpnn_ | high_senstvty_c_rctv_prtn | ferritin |
|---|---|---|---|---|---|
| 4 | 306 | 60 | 19.9 | 43.1 | 634.9 |
| 4 | 738 | 173 | 16.9 | 27.4 | 1512.9 |
| 4 | 328 | 0 | 0.0 | 42.3 | 567.2 |
| 4 | 338 | 152 | 4.8 | 108.2 | 0.0 |
| 4 | 195 | 5 | 5.6 | 7.0 | 121.1 |
| 4 | 268 | 645 | 19.7 | 29.3 | 1124.6 |
Boxplot summary of most important biomarkers.
# kmeansplot(dd,"lactate_dehydrogenase")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=lactate_dehydrogenase,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()
# kmeansplot(dd,"amn_trmnl_brn_ntrrtc_____")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=amn_trmnl_brn_ntrrtc_____,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()
# kmeansplot(dd,"hypersensitiv_crdc_trpnn_")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=hypersensitiv_crdc_trpnn_,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()
# kmeansplot(dd,"high_senstvty_c_rctv_prtn")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=high_senstvty_c_rctv_prtn,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()
# kmeansplot(dd,"ferritin")
ggplot(data=dd) +
geom_boxplot(aes(x=cluster,y=ferritin,group=cluster))+
facet_grid(cluster ~ ., scales = "free", space = "free")+ gbl_ggtheme + coord_flip()
Quantiles statistics for most important biomarkers.
df_stat <- dd
dim(df_stat)
## [1] 375 6
dfIntCols <- df_stat %>% select_if(., is.numeric)
names(dfIntCols)
## [1] "cluster" "lactate_dehydrogenase"
## [3] "amn_trmnl_brn_ntrrtc_____" "hypersensitiv_crdc_trpnn_"
## [5] "high_senstvty_c_rctv_prtn" "ferritin"
print('===============================================================')
## [1] "==============================================================="
print('===============================================================')
## [1] "==============================================================="
statInfo <- function(a) {
for (i in colnames(a)){
paste('Column: ', colnames(df_stat[i]), sep=',')
print('---------------------------------------------------------------')
print(colnames(df_stat[i]))
print('---------------------------------------------------------------')
printVecInfo(df_stat[[i]])
}
}
statInfo(dfIntCols)
## [1] "---------------------------------------------------------------"
## [1] "cluster"
## [1] "---------------------------------------------------------------"
## [1] "mean: 3.736"
## [1] "median: 4"
## [1] "min: 1"
## [1] "max: 4"
## [1] "range: 3"
## [1] "sd: 0.68377686575303"
## [1] "quantile: 1" "quantile: 4" "quantile: 4" "quantile: 4" "quantile: 4"
## [1] "IQR: 0"
## [1] "---------------------------------------------------------------"
## [1] "lactate_dehydrogenase"
## [1] "---------------------------------------------------------------"
## [1] "mean: 519.592"
## [1] "median: 355"
## [1] "min: 0"
## [1] "max: 1867"
## [1] "range: 1867"
## [1] "sd: 436.03199416537"
## [1] "quantile: 0" "quantile: 239.5" "quantile: 355" "quantile: 661"
## [5] "quantile: 1867"
## [1] "IQR: 421.5"
## [1] "---------------------------------------------------------------"
## [1] "amn_trmnl_brn_ntrrtc_____"
## [1] "---------------------------------------------------------------"
## [1] "mean: 2811.68533333333"
## [1] "median: 99"
## [1] "min: 0"
## [1] "max: 70000"
## [1] "range: 70000"
## [1] "sd: 9387.95662943333"
## [1] "quantile: 0" "quantile: 0" "quantile: 99" "quantile: 941"
## [5] "quantile: 70000"
## [1] "IQR: 941"
## [1] "---------------------------------------------------------------"
## [1] "hypersensitiv_crdc_trpnn_"
## [1] "---------------------------------------------------------------"
## [1] "mean: 887.5552"
## [1] "median: 4.8"
## [1] "min: 0"
## [1] "max: 50000"
## [1] "range: 50000"
## [1] "sd: 4933.42370112608"
## [1] "quantile: 0" "quantile: 1.9" "quantile: 4.8" "quantile: 47.2"
## [5] "quantile: 50000"
## [1] "IQR: 45.3"
## [1] "---------------------------------------------------------------"
## [1] "high_senstvty_c_rctv_prtn"
## [1] "---------------------------------------------------------------"
## [1] "mean: 86.6608"
## [1] "median: 54.1"
## [1] "min: 0"
## [1] "max: 320"
## [1] "range: 320"
## [1] "sd: 87.9859750272583"
## [1] "quantile: 0" "quantile: 10.1" "quantile: 54.1" "quantile: 144.4"
## [5] "quantile: 320"
## [1] "IQR: 134.3"
## [1] "---------------------------------------------------------------"
## [1] "ferritin"
## [1] "---------------------------------------------------------------"
## [1] "mean: 883.942933333333"
## [1] "median: 147.3"
## [1] "min: 0"
## [1] "max: 50000"
## [1] "range: 50000"
## [1] "sd: 3052.68673704692"
## [1] "quantile: 0" "quantile: 0" "quantile: 147.3"
## [4] "quantile: 881.95" "quantile: 50000"
## [1] "IQR: 881.95"
#dd[,c(76,1:10)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,11:20)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,21:30)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,31:40)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,41:50)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,51:60)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,61:70)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
# dd[,c(76,71:74)] %>% arrange(cluster) %>% head() %>% kable() %>% kable_styling()
Principal Component Analysis (PCA) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data.
From a data analysis standpoint, PCA is used for studying one table of observations and variables with the main idea of transforming the observed variables into a set of new variables, the principal components, which are uncorrelated and explain the variation in the data. For this reason, PCA allows us to reduce a “complex” data set to a lower dimension in order to reveal the structures or the dominant types of variations in both the observations and the variables. Note: starts by setting with clean data set.
df <- df_clean
df <- df[,c(4,7:80)] # %>% select(-patient_id,-admission_time,-discharge_time)
#rename outcome to Survived
names(df)[1] <- c("Survived")
df$Survived <- factor(df$Survived)
#Discretize Survived
df$Survived=dplyr::recode(df$Survived, `0`="Deceased", `1`="Survived")
colnames(df) <- make.names(colnames(df))
head(df[,1:5]) %>% kable %>% kable_styling()
| Survived | hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time |
|---|---|---|---|---|
| Deceased | 19.9 | 140 | 103.1 | 14.1 |
| Deceased | 16.9 | 151 | 100.5 | 14.3 |
| Deceased | 0.0 | 126 | 102.9 | 13.6 |
| Deceased | 4.8 | 110 | 103.1 | 16.3 |
| Deceased | 5.6 | 134 | 102.2 | 14.6 |
| Deceased | 19.7 | 108 | 105.8 | 12.4 |
There are several methods for doing PCA using R, but we’re going use princomp method, since its in base R stats.
Typical PCA results should consist of the following:
A set of eigenvalues
A table with the scores or Principal Components (PCs)
A table of loadings (or correlations between variables and PCs).
The eigenvalues provide information of the variability in the data. The scores provide information about the structure of the observations. The loadings (or correlations) allow you to get a sense of the relationships between variables, as well as their associations with the extracted PCs.
features<-df[,-1] #remove the label
pca<-princomp(features)
std_dev <- pca[1:74]$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)
res.pca <- pca
#pca
As seen in the plot below, nearly first 15 components explains most variation in the data. The second plot indicates 15-20 components might be adequate for the modeling data sets.
require(factoextra)
require(plotly)
plot(cumsum(prop_varex[1:74]), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b",
col = 4)
plot(cumsum(prop_varex[1:20]), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b",
col = 4)
The scree plot shows the first component is explaining 76.3% of the variances. The line drops of quickly from there, which bolsters the case that we can use far less than 74 features to get a good prediction.
require(factoextra)
require(plotly)
res.pca <- pca
#fplt <- fviz_eig(res.pca, addlabels = TRUE)
fplt <- fviz_screeplot(res.pca, addlabels = TRUE)
#ggplotly(fplt)
fplt
Since the majority of the variance are explained by the first three components a contribution analysis is performed on them here.
Top contributors:
hypersensitive_cardiac_troponin_i: This is the preferred biomarker for acute myocardial infarction. This helps diagnose acute coronary syndromes in those who are critically ill.
amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp: Natriuretic peptide tests measure the presence of brain natriuretic peptide and B-terminal pro b-type natriuretic peptides present. These are substances created by the heart. Small levels are normally found in the bloodstream. High levels of BNP, NT-proBNP indicate that the heart is failing to pump the appropriate amount of blood needed.
lactate_dehydrogenase: LDH is an enzyme released when there is tissue damage present in a patient’s body tissues. This marker measures the amount of LDH in blood or other body fluids and is vital in monitoring viral conditions or cancer.
ferritin: Ferritin is a major intracellular iron storage protein in all organisms. It binds free ions of the trace element, neutralising its toxic properties and increasing its solubility. In the soluble form, the body is able to expend iron as needed, in particular for regulation of cellular oxygen metabolism. Elevated levels of ferritin, or hyperferritinemia, indicate the presence of viruses and bacteria into the body.
fviz_pca_var(res.pca, col.var="contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = FALSE # Avoid text overlapping
)
# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 3)
# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 3)
# Contributions of variables to PC3
fviz_contrib(res.pca, choice = "var", axes = 3, top = 3)
Several classification models will be leveraged and compared while using the Covid-19, patient, biomarker data. Classification in this case predicts a binary outcome of case fatality or or survival. The models will be trained in supervised mode using the dependent variable we will call label, which holds values 0 for fatal and 1 for survived. The models will then be tested to determine if they can classify case fatality and survival without having the benefit of the labels.
The list of classification models to be used can be viewed below here:
Decision Tree(rpart)
Random Forest(rf)
Naive Bayes(nb)
K Nearest Neighbors(knn)
Support Vector Machines(svm)
Gradient Boosting Machines(gbm)
Classification Based On Association Rules (cba)
The models will each be described further as they are used.
Checking the dependent variable “Survived” to ensure imbalance will not be a problem. The output indicates there is good balance of fatalities versus survived outcomes in our data set.
#summary(df$Survived)
survived_df <- as.data.frame(prop.table(table(df$Survived)))
names(survived_df)[1:2] <- c("Survived","Percentage")
survived_df
## Survived Percentage
## 1 Deceased 0.536
## 2 Survived 0.464
Using the first 20 components for the model splitting after determining that gets the best predictions without loosing predictive quality using all 74 features.
nDim <- 21
set.seed(seed)
new_df<-data.frame(label = df[, "Survived"], pca$scores)
df2 <- new_df
dim(df2)
## [1] 375 75
pca_df<- new_df[,1:nDim]
samp_size <-floor(0.80* nrow(pca_df))
train_ind <-sample(seq_len(nrow(pca_df)), size = samp_size)
train <- pca_df[train_ind,]
dim(train)
## [1] 300 21
test <-pca_df[-train_ind,]
dim(test)
## [1] 75 21
names(pca_df)[1] <- c("label")
names(df)[1] <- c("label")
The data will be split into 80/20 for training and test sets.
## dimension_of_training_dataset dimension_of_testing_dataset
## 1 300 75
## 2 21 21
## [1] 300 21
## [1] 75 21
## [1] 75 21
# #view the data sets
# head(train)
# head(test)
# head(hold)
Model controls will consist of cross validation set to 5-fold, and fit metric set to “Accuracy.”
## [conflicted] Removing existing preference
## [conflicted] Will prefer caret::confusionMatrix over any other package
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following object is masked from 'package:leaflet':
##
## addLegend
## The following objects are masked from 'package:data.table':
##
## first, last
## The following objects are masked from 'package:dplyr':
##
## first, last
Reviewing the models and their different parameters which can be tuned. Caret will be used with method model training controls to run algorithms u5-fold cross validation and various parameter tuning. The most important features are printed after each model is trained for additional information.
x1<- modelLookup(model='rpart')
x2<-modelLookup(model='rf')
x3<-modelLookup(model='nb')
x4<-modelLookup(model='knn')
x5<-modelLookup(model='svmRadial')
x6<-modelLookup(model='gbm')
mlist <-rbind(x1,x2,x3,x4,x5,x6)
mlist
## model parameter label forReg forClass
## 1 rpart cp Complexity Parameter TRUE TRUE
## 2 rf mtry #Randomly Selected Predictors TRUE TRUE
## 3 nb fL Laplace Correction FALSE TRUE
## 4 nb usekernel Distribution Type FALSE TRUE
## 5 nb adjust Bandwidth Adjustment FALSE TRUE
## 6 knn k #Neighbors TRUE TRUE
## 7 svmRadial sigma Sigma TRUE TRUE
## 8 svmRadial C Cost TRUE TRUE
## 9 gbm n.trees # Boosting Iterations TRUE TRUE
## 10 gbm interaction.depth Max Tree Depth TRUE TRUE
## 11 gbm shrinkage Shrinkage TRUE TRUE
## 12 gbm n.minobsinnode Min. Terminal Node Size TRUE TRUE
## probModel
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
## 7 TRUE
## 8 TRUE
## 9 TRUE
## 10 TRUE
## 11 TRUE
## 12 TRUE
Running a similar cross validation experiment using decision trees to compare the results We are now ready to train and test using classifiers. Below we use a few different decision tree models with different parameters and prunings to get varied results.
Caret will be used with method used for model training controls to run algorithms 5-fold cross validation and various parameter tuning.
Note: The most important features are printed after each model is trained for additional information.
library(rpart)
#Ten Fold CV used with maincontrol2
control <- maincontrol
#Train Tree Model 2 CV
ptm <- proc.time()
fit.tree <- train(label~ ., data = train, method="rpart", metric=fitmetric, trControl=control,control=rpart.control(cp = 0.4))
fit.tree
## CART
##
## 300 samples
## 20 predictor
## 2 classes: 'Deceased', 'Survived'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 240, 240, 241, 239, 240
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.04347826 0.8397610 0.6815205
## 0.13768116 0.7701436 0.5319054
## 0.62318841 0.7159063 0.4094905
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04347826.
proc.time() - ptm
## user system elapsed
## 1.68 0.03 1.89
#fancy tree plots
suppressMessages(library(rattle))
fancyRpartPlot(fit.tree$finalModel)
#summary(fit.tree2$finalModel)
#important variables
treeImp <- varImp(fit.tree)
#treeImp
ggplot(treeImp) + gbl_ggtheme
require(gbm)
## Loading required package: gbm
## Loaded gbm 2.1.5
#Train Random Forest CV
control <- maincontrol
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = (1:30)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
nrow(gbmGrid)
## [1] 90
ptm <- proc.time()
fit.gbm <- train(label ~ ., data = train,
method = 'gbm',
tuneGrid = gbmGrid,
trControl = control,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
fit.gbm
## Stochastic Gradient Boosting
##
## 300 samples
## 20 predictor
## 2 classes: 'Deceased', 'Survived'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 241, 239, 241, 240, 239
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8869760 0.7722846
## 1 100 0.8934204 0.7850006
## 1 150 0.9133130 0.8257010
## 1 200 0.9000871 0.7982149
## 1 250 0.9065351 0.8118068
## 1 300 0.9132583 0.8254766
## 1 350 0.9068667 0.8121216
## 1 400 0.9232602 0.8457358
## 1 450 0.9034787 0.8054518
## 1 500 0.9133148 0.8258101
## 1 550 0.9099250 0.8189050
## 1 600 0.9165916 0.8323539
## 1 650 0.9232602 0.8456811
## 1 700 0.9232055 0.8455723
## 1 750 0.9297629 0.8588103
## 1 800 0.9133695 0.8253909
## 1 850 0.9232055 0.8455723
## 1 900 0.9133695 0.8253909
## 1 950 0.9199268 0.8389179
## 1 1000 0.9068121 0.8118238
## 1 1050 0.9068121 0.8118238
## 1 1100 0.9166481 0.8322628
## 1 1150 0.9098685 0.8186061
## 1 1200 0.8966426 0.7911592
## 1 1250 0.9067009 0.8116982
## 1 1300 0.9164259 0.8316007
## 1 1350 0.9034222 0.8048151
## 1 1400 0.9165370 0.8319092
## 1 1450 0.9067009 0.8116171
## 1 1500 0.9165370 0.8319092
## 5 50 0.9103130 0.8201424
## 5 100 0.9004770 0.7997087
## 5 150 0.9136464 0.8264013
## 5 200 0.9234824 0.8462389
## 5 250 0.9202056 0.8399207
## 5 300 0.9103695 0.8189685
## 5 350 0.9168723 0.8331420
## 5 400 0.9167611 0.8331009
## 5 450 0.9069251 0.8125138
## 5 500 0.9167611 0.8332795
## 5 550 0.9069251 0.8128074
## 5 600 0.9036464 0.8059687
## 5 650 0.9133713 0.8261328
## 5 700 0.9036464 0.8056365
## 5 750 0.9167611 0.8331009
## 5 800 0.9100926 0.8194051
## 5 850 0.9002566 0.7986684
## 5 900 0.9002566 0.7986684
## 5 950 0.9100926 0.8194051
## 5 1000 0.9035352 0.8055457
## 5 1050 0.9002566 0.7986684
## 5 1100 0.9100926 0.8194051
## 5 1150 0.9035352 0.8055457
## 5 1200 0.9133713 0.8261328
## 5 1250 0.9035352 0.8055457
## 5 1300 0.9035352 0.8055457
## 5 1350 0.9133713 0.8261328
## 5 1400 0.9001454 0.7986578
## 5 1450 0.9099815 0.8192450
## 5 1500 0.9133148 0.8259852
## 9 50 0.9002001 0.7997006
## 9 100 0.9102019 0.8192672
## 9 150 0.9333750 0.8665816
## 9 200 0.9202603 0.8396702
## 9 250 0.9235389 0.8460798
## 9 300 0.9268176 0.8530253
## 9 350 0.9137029 0.8260286
## 9 400 0.9268176 0.8530253
## 9 450 0.9234843 0.8464329
## 9 500 0.9136482 0.8258457
## 9 550 0.9103695 0.8189685
## 9 600 0.9069797 0.8121997
## 9 650 0.9168158 0.8332545
## 9 700 0.9069797 0.8121997
## 9 750 0.9102584 0.8190770
## 9 800 0.9167611 0.8331009
## 9 850 0.9167611 0.8331009
## 9 900 0.9167611 0.8331009
## 9 950 0.9134824 0.8263732
## 9 1000 0.9133713 0.8258490
## 9 1050 0.9035352 0.8051123
## 9 1100 0.9002566 0.7984271
## 9 1150 0.8968667 0.7910900
## 9 1200 0.9133713 0.8256077
## 9 1250 0.9067028 0.8123286
## 9 1300 0.9036464 0.8053364
## 9 1350 0.9069251 0.8118805
## 9 1400 0.9001454 0.7981035
## 9 1450 0.9100926 0.8190961
## 9 1500 0.9100926 0.8190961
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 9, shrinkage = 0.1 and n.minobsinnode = 20.
proc.time() - ptm
## user system elapsed
## 16.97 0.09 18.26
#fancy tree plots
suppressMessages(library(rattle))
plot(fit.gbm$finalModel)
#summary(fit.rf$$finalModel)
gbmImp <- varImp(fit.gbm)
#rfImp
#plot(rfImp)
ggplot(gbmImp) + gbl_ggtheme
#summary(fit.rf$finalModel)
#Train Random Forest CV
control <- maincontrol
ptm <- proc.time()
fit.rf <- train(label~ ., data = train, method="rf", metric=fitmetric, ntree = 35, trControl=control)
fit.rf
## Random Forest
##
## 300 samples
## 20 predictor
## 2 classes: 'Deceased', 'Survived'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 240, 240, 240, 240, 240
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8966667 0.7939077
## 11 0.9166667 0.8328082
## 20 0.9000000 0.7994747
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.
proc.time() - ptm
## user system elapsed
## 1.59 0.02 1.75
#fancy tree plots
suppressMessages(library(rattle))
plot(fit.rf$finalModel)
#summary(fit.rf$$finalModel)
rfImp <- varImp(fit.rf)
#rfImp
#plot(rfImp)
ggplot(rfImp) + gbl_ggtheme
#summary(fit.rf$finalModel)
#Training Support Vector Machine As a Challenger Model to Decision Tree.
control <- maincontrol
#control <- trainControl(method = "repeatedcv", number = 5,repeats=2,verbose = FALSE)
#methods "svmRadial" "svmLinear"
#
#
SVMgrid <- expand.grid(C= c(.5,1,2,5), sigma = c(0.01)) #C = 0.5, sigma =0.0001253863
ptm <- proc.time()
#fit.svm <- train(Survived~ ., data = train, method="svmLinear", metric=fitmetric, trControl=control,preProcess = #c("center","scale"),tuneLength = 10)
fit.svm <- train(label~ ., data = train, method="svmRadial", metric=fitmetric, trControl=control,preProcess = c("center", "scale"),tuneLength = 10)
fit.svm
## Support Vector Machines with Radial Basis Function Kernel
##
## 300 samples
## 20 predictor
## 2 classes: 'Deceased', 'Survived'
##
## Pre-processing: centered (20), scaled (20)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 240, 241, 240, 239, 240
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8764351 0.7533862
## 0.50 0.8763786 0.7532587
## 1.00 0.8730453 0.7462505
## 2.00 0.8697666 0.7398198
## 4.00 0.8665444 0.7335186
## 8.00 0.8598759 0.7202493
## 16.00 0.8631546 0.7274778
## 32.00 0.8664879 0.7336732
## 64.00 0.8664879 0.7339260
## 128.00 0.8664879 0.7339260
##
## Tuning parameter 'sigma' was held constant at a value of 0.1594806
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1594806 and C = 0.25.
proc.time() - ptm
## user system elapsed
## 7.87 0.08 8.39
#print(fit.svm)
#plot(fit.svm$finalModel)
#summary(fit.svm$finalModel)
svmImp <- varImp(fit.svm)
#svmImp
#plot(svmImp)
ggplot(svmImp) + gbl_ggtheme
control <- maincontrol
#install.packages("naivebayes")
library(naivebayes)
library(e1071) #naiveBayes
# # Define tuning grid
nb_grid <- expand.grid(usekernel = c(TRUE, FALSE),laplace = c(0, 0.5, 1), adjust = c(0.75, 1, 1.25, 1.5))
ptm <- proc.time()
fit.nb <- train(label~ ., data = train, method="naive_bayes", tuneGrid = nb_grid, trControl=control)
fit.nb
## Naive Bayes
##
## 300 samples
## 20 predictor
## 2 classes: 'Deceased', 'Survived'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 241, 240, 240, 239, 240
## Resampling results across tuning parameters:
##
## usekernel laplace adjust Accuracy Kappa
## FALSE 0.0 0.75 0.7296934 0.4377316
## FALSE 0.0 1.00 0.7296934 0.4377316
## FALSE 0.0 1.25 0.7296934 0.4377316
## FALSE 0.0 1.50 0.7296934 0.4377316
## FALSE 0.5 0.75 0.7296934 0.4377316
## FALSE 0.5 1.00 0.7296934 0.4377316
## FALSE 0.5 1.25 0.7296934 0.4377316
## FALSE 0.5 1.50 0.7296934 0.4377316
## FALSE 1.0 0.75 0.7296934 0.4377316
## FALSE 1.0 1.00 0.7296934 0.4377316
## FALSE 1.0 1.25 0.7296934 0.4377316
## FALSE 1.0 1.50 0.7296934 0.4377316
## TRUE 0.0 0.75 0.8433769 0.6817564
## TRUE 0.0 1.00 0.8366518 0.6682333
## TRUE 0.0 1.25 0.8265370 0.6458675
## TRUE 0.0 1.50 0.8298138 0.6519463
## TRUE 0.5 0.75 0.8433769 0.6817564
## TRUE 0.5 1.00 0.8366518 0.6682333
## TRUE 0.5 1.25 0.8265370 0.6458675
## TRUE 0.5 1.50 0.8298138 0.6519463
## TRUE 1.0 0.75 0.8433769 0.6817564
## TRUE 1.0 1.00 0.8366518 0.6682333
## TRUE 1.0 1.25 0.8265370 0.6458675
## TRUE 1.0 1.50 0.8298138 0.6519463
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
## and adjust = 0.75.
proc.time() - ptm
## user system elapsed
## 6.96 0.09 7.26
#fit.nb$bestTune
nbImp <- varImp(fit.nb)
#nbImp
#plot(nbImp)
ggplot(nbImp) + gbl_ggtheme
control <- maincontrol
# # Define tuning grid
knn_grid <- expand.grid(k= c(2,3,5,7,9))
ptm <- proc.time()
fit.knn <- train(label~ ., data = train, method="knn", tuneGrid = knn_grid, trControl=control)
fit.knn
## k-Nearest Neighbors
##
## 300 samples
## 20 predictor
## 2 classes: 'Deceased', 'Survived'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 241, 240, 240, 240, 239
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 2 0.8667611 0.7312582
## 3 0.8700417 0.7370683
## 5 0.8667083 0.7305085
## 7 0.8534278 0.7028591
## 9 0.8568723 0.7097726
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.
proc.time() - ptm
## user system elapsed
## 1.52 0.02 1.59
fit.knn$bestTune
## k
## 2 3
#plot(fit.knn$finalModel)
knnImp <- varImp(fit.knn)
#nbImp
#plot(knnImp)
ggplot(knnImp) +gbl_ggtheme
The graph below shows two evaluation metrics for the models.
Accuracy is the percentage of correctly classifies instances out of all instances. It is more useful on a binary classification than multi-class classification problems because it can be less clear exactly how the accuracy breaks down across those classes (e.g. you need to go deeper with a confusion matrix). Learn more about Accuracy here.
Kappa or Cohen’s Kappa is like classification accuracy, except that it is normalized at the baseline of random chance on your data set. It is a more useful measure to use on problems that have an imbalance in the classes (e.g. 70-30 split for classes 0 and 1 and you can achieve 70% accuracy by predicting all instances are for class 0). Learn more about Kappa here.
The gradient algorithm denoted by gbm is the better model according to these fit metrics, but all models performed reasonably well on the training data set.
##
## Call:
## summary.resamples(object = results)
##
## Models: gbm, rf, tree, nb, svm, knn
## Number of resamples: 5
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## gbm 0.9016393 0.9166667 0.9322034 0.9333750 0.9491525 0.9672131 0
## rf 0.8666667 0.9166667 0.9166667 0.9166667 0.9333333 0.9500000 0
## tree 0.8000000 0.8135593 0.8500000 0.8397610 0.8500000 0.8852459 0
## nb 0.7666667 0.8166667 0.8500000 0.8433769 0.8852459 0.8983051 0
## svm 0.8000000 0.8666667 0.8813559 0.8764351 0.8833333 0.9508197 0
## knn 0.8166667 0.8500000 0.8688525 0.8700417 0.8813559 0.9333333 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## gbm 0.8051118 0.8329621 0.8626310 0.8665816 0.8978650 0.9343380 0
## rf 0.7306397 0.8337029 0.8337029 0.8328082 0.8666667 0.8993289 0
## tree 0.6017699 0.6233314 0.7045952 0.6815205 0.7058824 0.7720235 0
## nb 0.5291480 0.6232877 0.6966292 0.6817564 0.7657707 0.7939464 0
## svm 0.6017699 0.7342193 0.7630522 0.7533862 0.7671840 0.9007054 0
## knn 0.6283784 0.6952596 0.7359307 0.7370683 0.7603018 0.8654709 0
It should be noted that we fitted our models using “accuracy” to aide with finding the best tuning parameters. However, it is good to measure models with multiple metrics and AUC-ROC is one of the additional metrics to consider when making a decision.
The last graph indicates all the models look very good in terms of AUC-ROC, but gradient boosting, random forest are standouts. This means the gradient boosting machines resulted in a better fit than all other models, judging by the AUC-ROC metric. The Naive Bayes, Support Vector Machines, and K Nearest Neighbors were not far behind GBM, but the decision tree was trailing a bit more.
The recall sensitivity is import here too, since that is our ability to accurately discern all the fatalities. The gradient boosting machines method proving to be a good model for true positives. Random forests and support vector machines do reasonably here, but naive bayes, decision tree, and knn trail significantly on this metric.
## run MLeval comparing tree 2 and 3
library(MLeval)
#??MLeval #for help
res <- evalm(list(fit.gbm,fit.tree,fit.rf,fit.nb,fit.svm,fit.knn),gnames=c('gbm','dt','rf','nb','svm','knn'))
## ***MLeval: Machine Learning Model Evaluation***
## Input: caret train function object
## Not averaging probs.
## Group 1 type: cv
## Group 2 type: cv
## Group 3 type: cv
## Group 4 type: cv
## Group 5 type: cv
## Group 6 type: cv
## Observations: 1800
## Number of groups: 6
## Observations per group: 300
## Positive: Survived
## Negative: Deceased
## Group: gbm
## Positive: 138
## Negative: 162
## Group: dt
## Positive: 138
## Negative: 162
## Group: rf
## Positive: 138
## Negative: 162
## Group: nb
## Positive: 138
## Negative: 162
## Group: svm
## Positive: 138
## Negative: 162
## Group: knn
## Positive: 138
## Negative: 162
## ***Performance Metrics***
## gbm Optimal Informedness = 0.870101986044015
## dt Optimal Informedness = 0.691894793344069
## rf Optimal Informedness = 0.843263553408481
## nb Optimal Informedness = 0.72812667740204
## svm Optimal Informedness = 0.773483628556092
## knn Optimal Informedness = 0.747987117552335
## gbm AUC-ROC = 0.98
## dt AUC-ROC = 0.88
## rf AUC-ROC = 0.97
## nb AUC-ROC = 0.91
## svm AUC-ROC = 0.92
## knn AUC-ROC = 0.92
Definition of the Terms related to measuring classification models:
Accuracy: * Classification Rate or Accuracy is given by the relation: (TP + TN) / (TP + TN + FP + FN) * It assumes equal costs for both kinds of errors. A 99% accuracy can be excellent, good, mediocre, poor or terrible depending upon the problem.
Recall: Recall can be defined as the ratio of the total number of correctly classified positive examples divide to the total number of positive examples. High Recall indicates the class is correctly recognized (a small number of FN). Recall is given by the relation: TP / (TP + FN)
AUC-ROC curve: This is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at binary predictions.
Summary the test results for each of the following classifier models. The results vary slightly from run to run due to random sampling.
Before PCA:
Decision Tree - Accuracy : 0.90
Random Forest - Accuracy : 0.94
Naive Bayes - Accuracy : 0.90
K Nearest Neighbors - Accuracy : 0.92
Support Vector Machines (radial)- Accuracy : 0.89
Classification Based On Association Rules (CBA) - Accuracy : .88 **this was run with no PCA involved
After PCA:
Decision Tree - Accuracy : 0.94
Random Forest - Accuracy : 0.92
Naive Bayes - Accuracy : 0.88
K Nearest Neighbors - Accuracy : 0.96
Support Vector Machines (radial)- Accuracy : 0.88
Gradient Boosting Machines - Accuracy : .96 **this was added after PCA, so there is no before PCA run
test <- test
#head(test)
#https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package
draw_confusion_matrix <- function(cm) {
# total <- sum(cm$table)
# res <- as.numeric(cm$table)
#
# # Generate color gradients. Palettes come from RColorBrewer.
# greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
# redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#E64141")
# getColor <- function (greenOrRed = "green", amount = 0) {
# if (amount == 0)
# return("#FFFFFF")
# palette <- greenPalette
# if (greenOrRed == "red")
# palette <- redPalette
# colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
# }
#
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
rect(150, 430, 240, 370, col='#a2aba5')
text(195, 435, 'Deceased ', cex=1.2)
rect(250, 430, 340, 370, col='#E64141')
text(295, 435, 'Survived', cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col='#E64141')
rect(250, 305, 340, 365, col='#a2aba5')
text(140, 400, 'Deceased ', cex=1.2, srt=90)
text(140, 335, 'Survived', cex=1.2, srt=90)
# add in the cm results
res <- as.numeric(cm$table)
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
# Error: [conflicted] `layout` found in 2 packages.
# Either pick the one you want with `::` * plotly::layout * graphics::layout Or declare a preference with
# `conflict_prefer()` * conflict_prefer("layout", "plotly") * conflict_prefer("layout", "graphics")
conflict_prefer("layout", "plotly")
predict.gbm= predict(fit.gbm, test, type="raw")
#predict.tree= predict(fit.tree, test, type="prob")
#summary(predict.gbm)
#caret confusion
cm <- confusionMatrix(factor(predict.gbm), factor(test$label))
#cm
#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)
predict.tree= predict(fit.tree, test, type="raw")
#predict.tree= predict(fit.tree, test, type="prob")
#summary(predict.tree)
#caret confusion
cm <- confusionMatrix(factor(predict.tree), factor(test$label))
#cm
#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)
#Test Results SVM
predict.svm= predict(fit.svm, test, type="raw")
#predict.svm= predict(fit.svm, test, type="prob")
#summary(predict.svm)
#confusion matrix to find correct and incorrect predictions
#table(labelship=predict.svm, true=test$Survived)
#caret confusion
cm <-confusionMatrix(factor(predict.svm), factor(test$label))
#cm
#plot(cm$table)
#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)
predict.rf= predict(fit.rf, test, type="raw")
#predicted3= predict(fit.rf, test, type="prob")
#summary(predict.rf)
#confusion matrix to find correct and incorrect predictions
#table(labelship=predict.rf, true=test$Survived)
#caret confusion
cm <- confusionMatrix(factor(predict.rf), factor(test$label))
#cm
#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)
predict.knn= predict(fit.knn, test, type="raw")
#predicted3= predict(fit.knn, test, type="prob")
#summary(predict.knn)
#table(labelship=predict.rf, true=test$Survived)
#caret confusion
cm <- confusionMatrix(factor(predict.knn), factor(test$label))
#cm
#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)
predict.nb= predict(fit.nb, test, type="raw")
#summary(predict.nb)
#confusion matrix to find correct and incorrect predictions
#table(labelship=predict.nb, true=test$Survived)
cm <- confusionMatrix(factor(predict.nb), factor(test$label))
#cm
#fourfoldplot(cm$table, color = c("grey", "#E64141"), conf.level = 0, margin = 1, main = "Confusion Matrix")
draw_confusion_matrix(cm)
The keras library will be used to classify survival or fatality of covid-19 patient sample based upon the same biomarker data set.
require(keras)
## Loading required package: keras
#start with the clean data frame again
df <- df_clean
df <- df[,c(4,7:80)] # or df <- df %>% select(-patient_id,-admission_time,-discharge_time)
#rename outcome to label
names(df)[1] <- c("label")
df$label <- factor(df$label)
#Discretize Survived
#df$Survived=dplyr::recode(df$Survived, `0`="Deceased", `1`="Survived")
colnames(df) <- make.names(colnames(df))
head(df[,1:5]) %>% kable() %>% kable_styling()
| label | hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time |
|---|---|---|---|---|
| 0 | 19.9 | 140 | 103.1 | 14.1 |
| 0 | 16.9 | 151 | 100.5 | 14.3 |
| 0 | 0.0 | 126 | 102.9 | 13.6 |
| 0 | 4.8 | 110 | 103.1 | 16.3 |
| 0 | 5.6 | 134 | 102.2 | 14.6 |
| 0 | 19.7 | 108 | 105.8 | 12.4 |
The keras model will use the hold out approach for the data splits. 80%, 20% where the 20%, which normally simply test will be further split 80/20 into test and holdout respectively.
require(kableExtra)
dim(df)
## [1] 375 75
paste("-------")
## [1] "-------"
nDim <- dim(df)[2]-1
#nDim
x <- modelData2(df) # 80/20 train/test; holdout = test
#x <- modelDataC(df) # removes 10% for holdout then splits 80/20 for remainder
train <-x@trn
test <- x@tst
hold <- x@hld
#hold <-z[-data,] #balance what is left over after train @.90
trn <- as.data.frame(dim(train))
names(trn) <- c("dimension_training_train")
trn
## dimension_training_train
## 1 300
## 2 75
tst <-as.data.frame(dim(test))
names(tst) <- c("dimension_testing_test")
hld <-as.data.frame(dim(hold))
names(hld) <- c("dimension_testing_holdout")
train <- train
dim(train)
## [1] 300 75
test <- test
dim(test)
## [1] 75 75
hold <- hold
dim(hold)
## [1] 75 75
split <- cbind(trn,tst,hld)
split %>%kable() %>% kable_styling()
| dimension_training_train | dimension_testing_test | dimension_testing_holdout |
|---|---|---|
| 300 | 75 | 75 |
| 75 | 75 | 75 |
#view the data sets
head(train[,1:5]) %>% kable() %>% kable_styling()
| label | hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time |
|---|---|---|---|---|
| 1 | 1776.3 | 124 | 107.1 | 19.1 |
| 1 | 62.3 | 156 | 90.5 | 13.2 |
| 1 | 0.0 | 129 | 95.0 | 13.2 |
| 0 | 3.7 | 151 | 104.6 | 13.6 |
| 0 | 1.9 | 116 | 103.8 | 14.7 |
| 0 | 0.0 | 120 | 104.4 | 14.9 |
head(test[,1:5]) %>% kable() %>% kable_styling()
| label | hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time |
|---|---|---|---|---|
| 1 | 8.2 | 119 | 100.8 | 17.1 |
| 1 | 32.6 | 132 | 105.6 | 17.0 |
| 1 | 140.8 | 159 | 131.7 | 32.2 |
| 1 | 4.4 | 124 | 109.8 | 15.4 |
| 1 | 8443.5 | 124 | 103.3 | 20.0 |
| 0 | 8.1 | 130 | 101.9 | 15.1 |
head(hold[,1:5]) %>% kable() %>% kable_styling()
| label | hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time |
|---|---|---|---|---|
| 0 | 19.9 | 140 | 103.1 | 14.1 |
| 0 | 0.0 | 126 | 102.9 | 13.6 |
| 0 | 3.7 | 151 | 104.5 | 13.3 |
| 0 | 10.4 | 125 | 107.1 | 13.1 |
| 0 | 0.0 | 125 | 106.3 | 13.6 |
| 0 | 0.0 | 133 | 103.3 | 12.9 |
#devtools::install_github("rstudio/keras")
library(keras)
x_train <- train[,2:(nDim+1)] #omit label
x_test <- test[,2:(nDim+1)] #omit label
x_hold <- hold[,2:(nDim+1)] #omit label
scale data frame’s numerical columns
library(caret)
# Assuming goal class is column 10
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
x_train <- as.data.frame(lapply(x_train, normalize))
head(x_train[,1:5])
## hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
## 1 0.035526 0.6966292 0.7628205 0.1591667
## 2 0.001246 0.8764045 0.6445869 0.1100000
## 3 0.000000 0.7247191 0.6766382 0.1100000
## 4 0.000074 0.8483146 0.7450142 0.1133333
## 5 0.000038 0.6516854 0.7393162 0.1225000
## 6 0.000000 0.6741573 0.7435897 0.1241667
## procalcitonin
## 1 0.0251880357
## 2 0.0083960119
## 3 0.0034983383
## 4 0.0008745846
## 5 0.0006996677
## 6 0.0006996677
x_test <- as.data.frame(lapply(x_test, normalize))
head(x_test[,1:5])
## hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
## 1 0.000164 0.6685393 0.7601810 0.5310559
## 2 0.000652 0.7415730 0.7963801 0.5279503
## 3 0.002816 0.8932584 0.9932127 1.0000000
## 4 0.000088 0.6966292 0.8280543 0.4782609
## 5 0.168870 0.6966292 0.7790347 0.6211180
## 6 0.000162 0.7303371 0.7684766 0.4689441
## procalcitonin
## 1 0.003907638
## 2 0.006394316
## 3 0.032326821
## 4 0.698046181
## 5 0.043694494
## 6 0.001776199
x_hold <- as.data.frame(lapply(x_hold, normalize))
head(x_hold[,1:5])
## hypersensitive_cardiac_troponin_i hemoglobin serum_chloride prothrombin_time
## 1 0.000398 0.7865169 0.7775264 0.4378882
## 2 0.000000 0.7078652 0.7760181 0.4223602
## 3 0.000074 0.8483146 0.7880845 0.4130435
## 4 0.000208 0.7022472 0.8076923 0.4068323
## 5 0.000000 0.7022472 0.8016591 0.4223602
## 6 0.000000 0.7471910 0.7790347 0.4006211
## procalcitonin
## 1 0.0031971581
## 2 0.0021314387
## 3 0.0014209591
## 4 0.0039076377
## 5 0.0014209591
## 6 0.0007104796
Make normalized data frame a matrix shape
x_train <- as.matrix(x_train)
x_test <- as.matrix(x_test)
x_hold <- as.matrix(x_hold)
dim(x_train)
## [1] 300 74
head(x_train[,1:5]) %>% kable() %>% kable_styling()
| hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time | procalcitonin |
|---|---|---|---|---|
| 0.035526 | 0.6966292 | 0.7628205 | 0.1591667 | 0.0251880 |
| 0.001246 | 0.8764045 | 0.6445869 | 0.1100000 | 0.0083960 |
| 0.000000 | 0.7247191 | 0.6766382 | 0.1100000 | 0.0034983 |
| 0.000074 | 0.8483146 | 0.7450142 | 0.1133333 | 0.0008746 |
| 0.000038 | 0.6516854 | 0.7393162 | 0.1225000 | 0.0006997 |
| 0.000000 | 0.6741573 | 0.7435897 | 0.1241667 | 0.0006997 |
dim(x_test)
## [1] 75 74
head(x_test[,1:5]) %>% kable() %>% kable_styling()
| hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time | procalcitonin |
|---|---|---|---|---|
| 0.000164 | 0.6685393 | 0.7601810 | 0.5310559 | 0.0039076 |
| 0.000652 | 0.7415730 | 0.7963801 | 0.5279503 | 0.0063943 |
| 0.002816 | 0.8932584 | 0.9932127 | 1.0000000 | 0.0323268 |
| 0.000088 | 0.6966292 | 0.8280543 | 0.4782609 | 0.6980462 |
| 0.168870 | 0.6966292 | 0.7790347 | 0.6211180 | 0.0436945 |
| 0.000162 | 0.7303371 | 0.7684766 | 0.4689441 | 0.0017762 |
dim(x_hold)
## [1] 75 74
head(x_hold[,1:5]) %>% kable() %>% kable_styling()
| hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time | procalcitonin |
|---|---|---|---|---|
| 0.000398 | 0.7865169 | 0.7775264 | 0.4378882 | 0.0031972 |
| 0.000000 | 0.7078652 | 0.7760181 | 0.4223602 | 0.0021314 |
| 0.000074 | 0.8483146 | 0.7880845 | 0.4130435 | 0.0014210 |
| 0.000208 | 0.7022472 | 0.8076923 | 0.4068323 | 0.0039076 |
| 0.000000 | 0.7022472 | 0.8016591 | 0.4223602 | 0.0014210 |
| 0.000000 | 0.7471910 | 0.7790347 | 0.4006211 | 0.0007105 |
x_train <- array_reshape(x_train, c(nrow(x_train), nDim))
x_test <- array_reshape(x_test, c(nrow(x_test), nDim))
x_hold <- array_reshape(x_hold, c(nrow(x_hold), nDim))
#check dimensions
dim(x_train)
## [1] 300 74
dim(x_test)
## [1] 75 74
dim(x_hold)
## [1] 75 74
Note that we use the array_reshape() function rather than the dim<-() function to reshape the array. This is so that the data is re-interpreted using row-major semantics (as opposed to R’s default column-major semantics), which is in turn compatible with the way that the numerical libraries called by Keras interpret array dimensions.
The y data is an integer vector with values ranging from 0 to 1. To prepare this data for training we one-hot encode the vectors into binary class matrices using the Keras to_categorical() function:
#y_train <- as.numeric(train$label)
y_train <- train$label #these were already R factor
head(y_train)
## [1] 1 1 1 0 0 0
## Levels: 0 1
#y_test <- as.numeric(test$label)
y_test <- test$label #these were already R factor
head(y_test)
## [1] 1 1 1 1 1 0
## Levels: 0 1
y_hold <- hold$label #these were already R factor
head(y_hold)
## [1] 0 0 0 0 0 0
## Levels: 0 1
# categorical conversion
y_train <- to_categorical(y_train, 2)
y_test <- to_categorical(y_test, 2)
y_hold <- to_categorical(y_hold, 2)
https://henry090.github.io/kerastuneR/
conflict_prefer("train", "caret")
## [conflicted] Will prefer caret::train over any other package
library(reticulate)
#use_python("C:\Users\knjpo\AppData\Local\r-miniconda")
use_condaenv("r-miniconda")
#install.packages('kerastuneR')
library(keras)
library(tensorflow)
#library(kerastuneR)
#install_kerastuner()
The simplest type of model is the Sequential model, a linear stack of layers. We begin by creating a sequential model and then adding layers using the pipe (%>%) operator:
# 97.3%
# tf.fit <- keras_model_sequential()
# tf.fit %>%
# layer_dense(units = 50, activation = 'relu', input_shape = c(nDim)) %>%
# layer_dropout(rate = 0.3) %>%
# layer_dense(units = 20, activation = 'relu') %>%
# layer_dropout(rate = 0.2) %>%
# layer_dense(units = 10, activation = 'relu') %>%
# layer_dropout(rate = 0.1) %>%
# layer_dense(units = 2, activation = 'softmax')
tf.fit <- keras_model_sequential()
tf.fit %>%
layer_dense(units = 50, activation = 'relu', input_shape = c(nDim)) %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 20, activation = 'relu') %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 10, activation = 'relu') %>%
layer_dropout(rate = 0.1) %>%
layer_dense(units = 2, activation = 'softmax')
The input_shape argument to the first layer specifies the shape of the input data (a length 784 numeric vector representing a grayscale image). The final layer outputs a length 10 numeric vector (probabilities for each digit) using a softmax activation function.
Use the summary() function to print the details of the tf.fit:
summary(tf.fit)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense (Dense) (None, 50) 3750
## ________________________________________________________________________________
## dropout (Dropout) (None, 50) 0
## ________________________________________________________________________________
## dense_1 (Dense) (None, 20) 1020
## ________________________________________________________________________________
## dropout_1 (Dropout) (None, 20) 0
## ________________________________________________________________________________
## dense_2 (Dense) (None, 10) 210
## ________________________________________________________________________________
## dropout_2 (Dropout) (None, 10) 0
## ________________________________________________________________________________
## dense_3 (Dense) (None, 2) 22
## ================================================================================
## Total params: 5,002
## Trainable params: 5,002
## Non-trainable params: 0
## ________________________________________________________________________________
Next, compile the tf.fit with appropriate loss function, optimizer, and metrics:
tf.fit %>% keras::compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(),
metrics = c('accuracy')
)
Use the fit() function to train the tf.fit for 50 epochs using batches of 5:
ptm <- proc.time() #timer
tf.train <- tf.fit %>% fit(
x_train, y_train,
epochs = 33, batch_size = 30,
#epochs = 30, batch_size = 25,
#epochs = 30, batch_size = 128,
validation_split = 0.20
)
proc.time() - ptm
## user system elapsed
## 4.17 0.15 4.28
The history object returned by fit() includes loss and accuracy metrics which we can plot:
plot(tf.train)
## `geom_smooth()` using formula 'y ~ x'
# Results
#alluvial plots
plotCM <- function(cm){
cmdf <- as.data.frame(cm[["table"]])
cmdf[["color"]] <- ifelse(cmdf[[1]] == cmdf[[2]], "grey", "red")
alluvial::alluvial(cmdf[,1:2]
, freq = cmdf$Freq
, col = cmdf[["color"]]
, alpha = 0.5
, hide = cmdf$Freq == 0
)
}
require(plotly)
ggPlotCM <- function(cm){
## ggplotly version
cm_d <- as.data.frame(cm$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm$overall)
# round the values
cm_st$cm.overall <- round(cm_st$cm.overall,2)
# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm$table))
cm_d$Perc <- round(cm_p$Freq*100,2)
# plotting the matrix
cm_d_p <- ggplot(data = cm_d, aes(x = Prediction , y = Reference, fill = Freq))+
geom_tile() +
geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'red', size = 3) +
theme_light() +
guides(fill=FALSE) + theme_1()
# plotting the stats
#cm_st_p <- tableGrob(cm_st)
cm_d_ply <-ggplotly(cm_d_p)
return (cm_d_ply)
}
Evaluate the tf.fit’s performance on the test data:
tf.fit %>% keras::evaluate(x_test, y_test)
## loss accuracy
## 0.1227164 0.9600000
#Generate predictions on new data:
pred <-tf.fit %>% predict_classes(x_test)
#table(Predicted=pred, Actual =test$label)
cm <- confusionMatrix(factor(pred), factor(test$label))
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 1
## 1 2 37
##
## Accuracy : 0.96
## 95% CI : (0.8875, 0.9917)
## No Information Rate : 0.5067
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.92
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9459
## Specificity : 0.9737
## Pos Pred Value : 0.9722
## Neg Pred Value : 0.9487
## Prevalence : 0.4933
## Detection Rate : 0.4667
## Detection Prevalence : 0.4800
## Balanced Accuracy : 0.9598
##
## 'Positive' Class : 0
##
draw_confusion_matrix(cm)
#plotCM(cm)
#ggPlotCM(cm)
Evaluate the tf.fit’s performance on the hold out data:
tf.fit %>% keras::evaluate(x_hold, y_hold)
## loss accuracy
## 0.1227164 0.9600000
#Generate predictions on new data:
pred2 <-tf.fit %>% predict_classes(x_hold)
#table(Predicted=pred, Actual =test$label)
cm <- confusionMatrix(factor(pred2), factor(hold$label))
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 1
## 1 2 37
##
## Accuracy : 0.96
## 95% CI : (0.8875, 0.9917)
## No Information Rate : 0.5067
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.92
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9459
## Specificity : 0.9737
## Pos Pred Value : 0.9722
## Neg Pred Value : 0.9487
## Prevalence : 0.4933
## Detection Rate : 0.4667
## Detection Prevalence : 0.4800
## Balanced Accuracy : 0.9598
##
## 'Positive' Class : 0
##
draw_confusion_matrix(cm)
#plotCM(cm)
#ggPlotCM(cm)
This analysis is designed as a demonstration of using the Arules CBA method to predict covid 19 patient fatality using a sample of 375 patients and their biomarkers measured during hospital stay. The input data originated from another more expansive predictive modeling study on the topic, and the input data used here is simplified for this Apriori and ArulesCBA demonstration.
#Return multiple data frames in one object using the S4 class system.
setClass(
"mdl",
representation(
trn = "vector",
tst = "vector",
hld = "vector"
)
)
modelData2 <- function (df1 = df_ww) {
z <- df1
# get number of rows in the data frame
len <- as.numeric(dim(z)[1])
randIndex <- sample(as.numeric(rownames(z)[1:len]))
# setting cutpoint to 2/3 of the input data frame
cutPoint80 <- floor(.8 * len)
# setting cutpoint to 1/3 of the input data frame
cutPoint20 <- floor(.2 * len)
trn2 <- z[randIndex[1:cutPoint80],]
tst2 <- z[randIndex[cutPoint80+1:c(len-cutPoint80)],]
hld2 <- z[-randIndex[1: cutPoint80],]
mdl <- new("mdl",
trn = trn2,
tst = tst2,
hld = hld2 )
return(mdl)
}
View anonymized Covid-19 patient information
df_CBA <- read.csv("covid19_biomarkers_max.csv") # this data resulted a write.csv of df_clean data
head(df_CBA[,2:7]) %>% kable() %>% kable_styling()
| patient_id | age | gender | outcome | admission_time | discharge_time |
|---|---|---|---|---|---|
| 1 | 73 | 1 | 0 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 |
| 2 | 61 | 1 | 0 | 2020-02-04 21:39:03 | 2020-02-19 12:59:01 |
| 3 | 70 | 2 | 0 | 2020-01-23 10:59:36 | 2020-02-08 17:52:31 |
| 4 | 74 | 1 | 0 | 2020-01-31 23:03:59 | 2020-02-18 12:59:12 |
| 5 | 29 | 2 | 0 | 2020-02-01 20:59:54 | 2020-02-18 10:33:06 |
| 6 | 81 | 2 | 0 | 2020-01-24 10:47:10 | 2020-02-07 09:06:58 |
#remove first column that created when csvwas written
View slice of model data input set. The target variable outcome has been discretized.
df_CBA <- df_CBA[,-1]
# omit columns that will not be part of the model data set
#df_CBA <- df_CBA[,c(4,7:80)]
df_CBA <- df_CBA %>% dplyr::select(-patient_id,-age,-gender,-admission_time,-discharge_time)
#rename outcome to Survived
names(df_CBA)[1] <- c("outcome")
df_CBA$outcome <- factor(df_CBA$outcome)
#Discretize Survived
df_CBA$outcome=dplyr::recode(df_CBA$outcome, `0`="Deceased", `1`="Survived")
colnames(df_CBA) <- make.names(colnames(df_CBA))
head(df_CBA[,1:5]) %>% kable() %>% kable_styling()
| outcome | hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time |
|---|---|---|---|---|
| Deceased | 19.9 | 140 | 103.1 | 14.1 |
| Deceased | 16.9 | 151 | 100.5 | 14.3 |
| Deceased | 0.0 | 126 | 102.9 | 13.6 |
| Deceased | 4.8 | 110 | 103.1 | 16.3 |
| Deceased | 5.6 | 134 | 102.2 | 14.6 |
| Deceased | 19.7 | 108 | 105.8 | 12.4 |
# set model input
biomarkers <- df_CBA
During the exploratory data analysis, the following 8 features were found to be important out of 74 total. They are listed in order of relative importance.
lactate_dehydrogenase 100.00
monocytes_percent 83.50
international_standard_ratio 78.47
percent_lymphocyte 64.12
eosinophil_count 59.75
eosinophils_percent 50.99
neutrophils_percent 49.15
amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp 48.62
The CBA analysis will use the top four:
lactate_dehydrogenase
monocytes_percent
international_standard_ratio
percent_lymphocyte
Research to inform cutoffs for descritization of the top four features
https://cllsociety.org/toolbox/normal-lab-values/
https://cllsociety.org/toolbox/normal-lab-values/
The information will be used with mutate() and ifelse to discretize the top four features in low, normal, and high buckets based on the researched reference values.
#lactate_dehydrogenase
biomarkers <- biomarkers %>% mutate(lactate_dehydrogenase = ifelse(lactate_dehydrogenase < 140, "low"
,ifelse(lactate_dehydrogenase > 140, "high" , "normal")))
#monocytes_percent
biomarkers <- biomarkers %>% mutate(monocytes_percent = ifelse(monocytes_percent < 2, "low"
,ifelse(monocytes_percent > 8, "high" , "normal")))
#international_standard_ratio
biomarkers <- biomarkers %>% mutate(international_standard_ratio = ifelse(international_standard_ratio < 0.8, "low" ,ifelse(international_standard_ratio > 1.1, "high" , "normal")))
#percent_lymphocyte
biomarkers <- biomarkers %>% mutate(percent_lymphocyte = ifelse(percent_lymphocyte < 20, "low"
,ifelse(percent_lymphocyte > 40, "high" , "normal")))
# Make dataframe from discretized biomarkers
df.biomarkers <- data.frame(outcome = biomarkers$outcome
, lactate_dehydrogenase = biomarkers$lactate_dehydrogenase
, monocytes_percent = biomarkers$monocytes_percent
, international_standard_ratio = biomarkers$international_standard_ratio
, percent_lymphocyte = biomarkers$percent_lymphocyte)
# Make factors of all columns
df.biomarkers$outcome <- factor(df.biomarkers$outcome)
df.biomarkers$lactate_dehydrogenase <- factor(df.biomarkers$lactate_dehydrogenase)
df.biomarkers$monocytes_percent <- factor(df.biomarkers$monocytes_percent)
df.biomarkers$percent_lymphocyte <- factor(df.biomarkers$percent_lymphocyte)
head(df.biomarkers) %>% kable() %>% kable_styling()
| outcome | lactate_dehydrogenase | monocytes_percent | international_standard_ratio | percent_lymphocyte |
|---|---|---|---|---|
| Deceased | high | high | normal | normal |
| Deceased | high | high | normal | normal |
| Deceased | high | normal | normal | normal |
| Deceased | high | high | high | normal |
| Deceased | high | normal | high | normal |
| Deceased | high | high | normal | normal |
#head(df.biomarkers)
glimpse(df.biomarkers)
## Rows: 375
## Columns: 5
## $ outcome <fct> Deceased, Deceased, Deceased, Deceased...
## $ lactate_dehydrogenase <fct> high, high, high, high, high, high, hi...
## $ monocytes_percent <fct> high, high, normal, high, normal, high...
## $ international_standard_ratio <fct> normal, normal, normal, high, high, no...
## $ percent_lymphocyte <fct> normal, normal, normal, normal, normal...
The Apriori algorithm can now be used on the transformed data to generate association rules. Parameters of support,confidence and lift will be leveraged to study rules formed from Covid-19 patient biomarkers.
Support: This measure gives an idea of how frequent an items is in all the transactions. A transaction in this case means patient and their associated biomarkers.
Confidence: Confidence is an indication of how often the rule has been found to be true.
Lift: Lift is the ratio of the observed support to that expected
Several different combinations of the support, and confidence were use to filter the universe of strong rules for the data.
A set of strong rules will be obtained from these parameters. The discretized data set will be used to perform apriori analysis prior to CBA.
The initial rules view is using support of .07, confidence of .75, and maxlen=3
arule_df <- df.biomarkers
#set top n
n = 20
#n = 20
#n = 10
#n = 5
#rules = apriori(arule_df, parameter = list(supp = 0.09, conf = 0.95, maxlen = 3))
rules = apriori(arule_df, parameter = list(supp = 0.07, conf = 0.75, maxlen = 3))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.07 1
## maxlen target ext
## 3 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 26
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[13 item(s), 375 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3
## Warning in apriori(arule_df, parameter = list(supp = 0.07, conf = 0.75, : Mining
## stopped (maxlen reached). Only patterns up to a length of 3 returned!
## done [0.00s].
## writing ... [76 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
#arules::inspect(head(rules,n))
inspectDT(rules)
View higher quality rules with elevated support of .11 and confidence of .79. The number rules decreased to 50 rules.
# This combination was used to produced the 53 rules strong rules
subrules <- rules[quality(rules)$support >= 0.11]
subrules <- subrules[quality(subrules)$confidence >= 0.79]
inspectDT(subrules)
View 50 rules but with lift greater than or equal to 1. The result is reduced to 47 rules meeting that criterion.
A few initial Arule observations:
monocytes_percents=high and outcome=deceased
percent_lymphocytes=low and outcomes=surviving.
The right hand side of rule as lactate_dehydrogenas=high is a common
#The lift was
subrules_lift <- subrules[quality(subrules)$lift >= 1]
inspectDT(subrules_lift)
# set to subrules to the lift =1 result
rules <- subrules_lift
The 47 strong rules are plotted below here with a few different plotting schemes.
#plot (rules, measure=c("confidence","lift"), shading="lift")
plot (rules, measure=c("support","lift"), shading="lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules, method = "matrix", measure = "lift")
## Itemsets in Antecedent (LHS)
## [1] "{lactate_dehydrogenase=high,percent_lymphocyte=low}"
## [2] "{lactate_dehydrogenase=high,percent_lymphocyte=normal}"
## [3] "{percent_lymphocyte=low}"
## [4] "{outcome=Survived,lactate_dehydrogenase=high}"
## [5] "{outcome=Survived}"
## [6] "{lactate_dehydrogenase=high,international_standard_ratio=normal}"
## [7] "{outcome=Survived,percent_lymphocyte=low}"
## [8] "{monocytes_percent=normal,international_standard_ratio=high}"
## [9] "{monocytes_percent=normal,percent_lymphocyte=low}"
## [10] "{lactate_dehydrogenase=high,monocytes_percent=high}"
## [11] "{outcome=Survived,monocytes_percent=normal}"
## [12] "{international_standard_ratio=high,percent_lymphocyte=low}"
## [13] "{international_standard_ratio=normal,percent_lymphocyte=normal}"
## [14] "{outcome=Survived,international_standard_ratio=high}"
## [15] "{monocytes_percent=high,percent_lymphocyte=normal}"
## [16] "{percent_lymphocyte=normal}"
## [17] "{monocytes_percent=high,international_standard_ratio=normal}"
## [18] "{international_standard_ratio=normal}"
## [19] "{monocytes_percent=high}"
## [20] "{monocytes_percent=normal}"
## [21] "{monocytes_percent=normal,international_standard_ratio=normal}"
## [22] "{outcome=Deceased,monocytes_percent=normal}"
## [23] "{monocytes_percent=high,percent_lymphocyte=low}"
## [24] "{international_standard_ratio=high}"
## [25] "{outcome=Deceased,international_standard_ratio=normal}"
## [26] "{monocytes_percent=high,international_standard_ratio=high}"
## [27] "{outcome=Deceased,international_standard_ratio=high}"
## [28] "{outcome=Deceased,percent_lymphocyte=normal}"
## [29] "{outcome=Deceased,monocytes_percent=high}"
## [30] "{outcome=Deceased}"
## [31] "{}"
## Itemsets in Consequent (RHS)
## [1] "{lactate_dehydrogenase=high}" "{international_standard_ratio=high}"
## [3] "{outcome=Deceased}" "{percent_lymphocyte=low}"
## [5] "{outcome=Survived}"
plot(rules, method = "matrix" , engine="3d", measure = "lift")
## Itemsets in Antecedent (LHS)
## [1] "{lactate_dehydrogenase=high,percent_lymphocyte=low}"
## [2] "{lactate_dehydrogenase=high,percent_lymphocyte=normal}"
## [3] "{percent_lymphocyte=low}"
## [4] "{outcome=Survived,lactate_dehydrogenase=high}"
## [5] "{outcome=Survived}"
## [6] "{lactate_dehydrogenase=high,international_standard_ratio=normal}"
## [7] "{outcome=Survived,percent_lymphocyte=low}"
## [8] "{monocytes_percent=normal,international_standard_ratio=high}"
## [9] "{monocytes_percent=normal,percent_lymphocyte=low}"
## [10] "{lactate_dehydrogenase=high,monocytes_percent=high}"
## [11] "{outcome=Survived,monocytes_percent=normal}"
## [12] "{international_standard_ratio=high,percent_lymphocyte=low}"
## [13] "{international_standard_ratio=normal,percent_lymphocyte=normal}"
## [14] "{outcome=Survived,international_standard_ratio=high}"
## [15] "{monocytes_percent=high,percent_lymphocyte=normal}"
## [16] "{percent_lymphocyte=normal}"
## [17] "{monocytes_percent=high,international_standard_ratio=normal}"
## [18] "{international_standard_ratio=normal}"
## [19] "{monocytes_percent=high}"
## [20] "{monocytes_percent=normal}"
## [21] "{monocytes_percent=normal,international_standard_ratio=normal}"
## [22] "{outcome=Deceased,monocytes_percent=normal}"
## [23] "{monocytes_percent=high,percent_lymphocyte=low}"
## [24] "{international_standard_ratio=high}"
## [25] "{outcome=Deceased,international_standard_ratio=normal}"
## [26] "{monocytes_percent=high,international_standard_ratio=high}"
## [27] "{outcome=Deceased,international_standard_ratio=high}"
## [28] "{outcome=Deceased,percent_lymphocyte=normal}"
## [29] "{outcome=Deceased,monocytes_percent=high}"
## [30] "{outcome=Deceased}"
## [31] "{}"
## Itemsets in Consequent (RHS)
## [1] "{lactate_dehydrogenase=high}" "{international_standard_ratio=high}"
## [3] "{outcome=Deceased}" "{percent_lymphocyte=low}"
## [5] "{outcome=Survived}"
This sort gives us perspective in terms of what high support means in the context of the Covid-19 dataset. The highest support is .94, but that number tapers off very fast after just ten entries.
Our strong rule results sorted by confidence. The right hand side again shows lactate_dehydrogenase=high with high frequency. The highest confidence is 1 within the top 47 strong rules.
The top 47 rules are sorted again, but this time by lift. The right hand side outcome=“Survived” dominates the top 10 rules sorted by lift.
Sticking with the premise that most important higher values for most features will have a tendency to result in Covid-19 patient fatality, we use the left hand side (LHS) filter to specify outcome=Deceased. Recall that “outcome=Deceased” means the patient passed away as a result of contracting covid-19.
This results in only top 9 rules, but they are very strong ones. Not only do they have good support, confidence and lift, but they also occur a lot. Recall there are only 375 observations in the patient data set, so over 100 occurrences of a rule is high.
Primary rule filter is (supp = 0.10,conf = 0.78, maxlen = 3).
Top 9 Rules for with RHS filtered by Deceased Patients.
Top 9 Rules for with RHS filtered by Survived Patients. This results in 5 of the 9 top rules.
The two plots show five rules with RHS filtered to Survived and Deceased respectively. The size is controlled by support and color is controlled by lift.
The degree of size and color indicates level of support, and lactate_dehydrogenase is indicated with high support.
The two key plot shows many rules have between .08 and 1 confidence and support between .02 and .06.
Classification by Arules or CBA is a classification method.
The data will be split into 80/20 using a function based upon caret partitioning. The split results in 300 rows to train CBA upon and 75 for testing. The top four features are used as predictors.
set.seed(seed)
#seed
#Used 80/20 split to be consistent with the rest of the study.
x <- modelData2(df.biomarkers)
train <-x@trn
test <- x@tst
hold <- x@hld
trn <- as.data.frame(dim(train))
names(trn) <- c("dimension_of_training_dataset")
tst <-as.data.frame(dim(test))
names(tst) <- c("dimension_of_testing_dataset")
hld <-as.data.frame(dim(hold))
names(hld) <- c("dimension_of_hold_dataset")
split80_20 <- cbind(trn,tst)
split80_20 %>% kable() %>% kable_styling()
| dimension_of_training_dataset | dimension_of_testing_dataset |
|---|---|
| 300 | 75 |
| 5 | 5 |
cba_train <- train
#dim(train)
cba_test <- test
#dim(test)
cba_hold <- hold
#dim(hold)
Arc CBA is called to fit model on the training data set and the rules generated by the fit are shown.
set.seed(seed)
require(arulesCBA)
#install.packages("arc", dependencies = TRUE)
require(arc)
#use same support and confidence settings as were used in the strong rules analysis.
v_supp <- .11
v_conf <- .76
# train CBA classifers
#88% accuracy
#fit.cba.4 <- CBA(outcome ~ ., data = cba_train, supp = v_supp, conf=v_conf, verbose = FALSE)
#88% accuracy
fit.cba.4 <- arc::cba(cba_train, "outcome",rulelearning_options = list(minsupp = v_supp, minconf = v_conf,minlen = 2, maxlen = 3, maxtime = 1000,target_rule_count = 50000,trim=TRUE, find_conf_supp_thresholds = FALSE))
# 84% accuracy
#fit.cba.4 <- arc::cba(cba_train, "outcome",rulelearning_options = list(target_rule_count = 60))
#summary(fit.cba.4)
#fit.cba.4
#arules::inspect(fit.cba.4@rules)
# DT breaks on CBA train but works on arc train.
inspectDT(fit.cba.4@rules)
#https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package
draw_confusion_matrix <- function(cm) {
# total <- sum(cm$table)
# res <- as.numeric(cm$table)
#
# # Generate color gradients. Palettes come from RColorBrewer.
# greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
# redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#E64141")
# getColor <- function (greenOrRed = "green", amount = 0) {
# if (amount == 0)
# return("#FFFFFF")
# palette <- greenPalette
# if (greenOrRed == "red")
# palette <- redPalette
# colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
# }
#
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
rect(150, 430, 240, 370, col='#a2aba5')
text(195, 435, 'Deceased ', cex=1.2)
rect(250, 430, 340, 370, col='#E64141')
text(295, 435, 'Survived', cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col='#E64141')
rect(250, 305, 340, 365, col='#a2aba5')
text(140, 400, 'Deceased ', cex=1.2, srt=90)
text(140, 335, 'Survived', cex=1.2, srt=90)
# add in the cm results
res <- as.numeric(cm$table)
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
# Error: [conflicted] `layout` found in 2 packages.
# Either pick the one you want with `::` * plotly::layout * graphics::layout Or declare a preference with
# `conflict_prefer()` * conflict_prefer("layout", "plotly") * conflict_prefer("layout", "graphics")
conflict_prefer("layout", "plotly")
Classification Based On Association Rules - a classifier based on association rules mined for an input data set. CBA demonstrated good accuracy at 88%, and only misclassified six fatalities. Test results with caret confusion matrix can be viewed below.
# test CBA
predict.cba <- predict(fit.cba.4,cba_test)
#cm <- table(predicted = predict.cba, true = cba_test$outcome)
cm <- confusionMatrix(predict.cba, cba_test$outcome)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Deceased Survived
## Deceased 36 6
## Survived 3 30
##
## Accuracy : 0.88
## 95% CI : (0.784, 0.944)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 3.57e-11
##
## Kappa : 0.759
##
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.923
## Specificity : 0.833
## Pos Pred Value : 0.857
## Neg Pred Value : 0.909
## Prevalence : 0.520
## Detection Rate : 0.480
## Detection Prevalence : 0.560
## Balanced Accuracy : 0.878
##
## 'Positive' Class : Deceased
##
draw_confusion_matrix(cm)
Fortunately, a very small percentage of people who contract Covid-19 die from it. However, just the possibility of contracting a virus, which can cause fatality, has sent fears throughout the global community. Some citizens are highly susceptible to the virus and there is data that points to pre-existing health conditions as a major factor. Biomarkers are measured to help indicate phenomenons such as disease, infection, or environmental exposures. The biomarkers collected from Covid-19 patients have shown to be very useful in predicting mortality rates with very high accuracy.
Several machine learning models were tested against the Covid-19 biomarker data, and all the models did very well. We can conclude the biomarker data lends itself well to the use case of predicting Covid-19 fatalities. This was especially true when combined with the principal component analysis. Ultimately, the gradient boosting machines(gbm) and K Nearest Neighbors had the best accuracy at 96%, but other modeling algorithms were not far behind. In fact, none of the methods tested had accuracy below 88%. Even the classification by association (CBA) rules method achieved 88% accuracy using only four of seventy four biomarkers. The four biomarkers were informed by a combination of earlier analysis in the project such as principal component analysis, k-means clustering, important feature results, and correlation analysis.
To improve upon the traditional machine learning results, deep learning (DL) was employed. The Keras interface to R has presented new possibilities for R data scientists, and was used in this study. The method was proven to be very useful with this use case as it yielded the 97% accuracy with a sequential model. Although deep learning is harder to explain than the traditional machine learning (ML) models, it is still worthy of deployment consideration.
In summary, this study developed several classification models, which have been shown to predict mortality rates with accuracy rates as high 97%. These results are promising, and if used as early detection mechanisms, for patients with Covid-19, could assist in reducing case fatality rates (CFR) now or in the future. It is paramount models such as these be implementing within a hospital setting and used in conjunction with high frequency testing of the most predictive biomarkers. As part of a solution, machine learning may give doctors the edge they need to save more lives.
Related Links:
https://www.nature.com/articles/s42256-020-0180-7.pdf https://github.com/HAIRLAB/Pre_Surv_COVID_19