# Set Working Directory
setwd("c:/Users/acerna/Desktop/DSU/MLM/Project")

#Load in Packages
library(readr)
library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)
library(VIM)
library(class)

#Use the color scales in this package to make plots that are pretty, better represent your data, 
# easier to read by those with colorblindness, and print well in grey scale.
library(viridis)
library(caret)
library(mltools)
library(randomForest)
library(psych) #matrix plots
library(corrplot) #corrrelation plots
library(RColorBrewer) #plot coloring
library(ggpubr) #used for qqplots
library(gridExtra) #arranging qqplots in 1 graph
library(glmnet) #used for Lasso/Ridge

1 Context / Background

According to the World Health Organization “ischaemic heart disease and stroke are the world’s biggest killers, accounting for a combined 15.2 million deaths in 2016.” The Mayo Clinic says that the term heart disease describes a wide range of conditions that affect your heart. Diseases under the heart disease umbrella include blood vessel diseases, such as coronary artery disease; heart rhythm problems (arrhythmias); and heart defects you’re born with (congenital heart defects), among others.

Top 10 global causes of deaths, 2016, WHO

Top 10 global causes of deaths, 2016, WHO

The term “heart disease” is often used interchangeably with the term “cardiovascular disease.” Cardiovascular disease generally refers to conditions that involve narrowed or blocked blood vessels that can lead to a heart attack, chest pain (angina) or stroke. Other heart conditions, such as those that affect your heart’s muscle, valves or rhythm, also are considered forms of heart disease.

Given the above information, we can see why it would be beneficial to try to determine which factors are the most important when attempting to predict heart disease.

2 Data Source

This report made use of UC Irvine’s Machine Learning Repository: Heart Disease Data Set. This directory contains 4 databases concerning heart disease diagnosis. The data was collected from four different locations:

  1. Cleveland Clinic Foundation (cleveland.data), 303 observations
  2. Hungarian Institute of Cardiology, Budapest (hungarian.data), 294 observations
  3. V.A. Medical Center, Long Beach, CA (long-beach-va.data), 200 observations
  4. University Hospital, Zurich, Switzerland (switzerland.data), 123 observations

It is composed of a mix of categorical and numeric data. Per the repository, “while the databases have 76 raw attributes, only 14 of them are actually used.” The main attributes used in this analysis are:

  1. age: age in years
  2. sex: sex (1 = male; 0 = female)
  3. cp: chest pain type
    • Value 1: typical angina
    • Value 2: atypical angina
    • Value 3: non-anginal pain
    • Value 4: asymptomatic
  4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
  5. chol: serum cholesterol in mg/dl
  6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  7. restecg: resting electrocardiographic results
    • Value 0: normal
    • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  8. thalach: maximum heart rate achieved
  9. exang: exercise induced angina (1 = yes; 0 = no)
  10. oldpeak = ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment
    • Value 1: upsloping
    • Value 2: flat
    • Value 3: downsloping
  12. ca: number of major vessels (0-3) colored by fluoroscopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
  14. num: diagnosis of heart disease (angiographic disease status). This field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

An angiogram is an X-ray image of blood vessels after they are filled with a contrast material. An angiogram of the heart, a coronary angiogram, is the “gold standard” for the evaluation of coronary artery disease (CAD). Angiographic images accurately reveal the extent and severity of all coronary artery blockages. Therefore, the num variable label tells us the severity of the angiographic disease of each patient after an angiogram exam.

Coronary artery disease (CAD) occurs when plaque, a sticky substance, narrows or partially obstructs coronary arteries (like sticky material stopping up a straw) and can result in reduced blood flow. This reduced blood flow may cause chest pain (angina), a warning sign of potential heart problems such as a heart attack. Plaque may also trap small blood clots, completely blocking a coronary artery suddenly, resulting in a heart attack.

Right Image: Shows blood flow blockage, RxList

Right Image: Shows blood flow blockage, RxList

Attribute #14 (num) served as my dependent variable (i.e. the value that my models predicted) based on the other remaining 13 variables. I imported the 4 data sets into R, merged them into one data set, and then added the appropriate labels to the categorical variables. I obtained a data set with 920 total observations.

clev <- read_csv("processed.cleveland.csv", col_names=FALSE, na="?", col_types="dffddffdfdffff")
hung <- read_csv("processed.hungarian.csv", col_names=FALSE, na="?", col_types="dffddffdfdffff")
swit <- read_csv("processed.switzerland.csv", col_names=FALSE, na="?", col_types="dffddffdfdffff")
va <- read_csv("processed.va.csv", col_names=FALSE, na="?", col_types="dffddffdfdffff")

#Merge 4 Data Sets and add correct column labels
heart <- rbind(clev, hung, swit, va)
names(heart) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", 
                  "exang", "oldpeak", "slope", "ca", "thal", "num")

#Add Labels to Other Factor Columns if necessary
heart$sex <- factor(heart$sex, levels = c(0, 1), 
                    labels = c("female", "male"))

heart$cp <- factor(heart$cp, levels = c(1,2,3,4), 
                   labels = c("typical angina", "atypical angina", "non-anginl pain", "asymptomatic"))

heart$fbs <- factor(heart$fbs, levels = c(0, 1), 
                    labels = c("FALSE", "TRUE"))

heart$restecg <- factor(heart$restecg, levels = c(0, 1, 2), 
                        labels = c("normal", "ST-T abnormality", "hypertrophy"))

heart$exang <- factor(heart$exang, levels = c(0, 1), 
                      labels = c("no", "yes"))

heart$slope <- factor(heart$slope, levels = c(1, 2, 3), 
                      labels = c("upsloping", "flat", "downsloping"))

heart$num <- as.factor(heart$num)


glimpse(heart)
## Observations: 920
## Variables: 14
## $ age      <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 4...
## $ sex      <fct> male, male, male, male, female, male, female, female,...
## $ cp       <fct> typical angina, asymptomatic, asymptomatic, non-angin...
## $ trestbps <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140...
## $ chol     <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192...
## $ fbs      <fct> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ restecg  <fct> hypertrophy, hypertrophy, hypertrophy, normal, hypert...
## $ thalach  <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148...
## $ exang    <fct> no, yes, yes, no, no, no, no, yes, no, yes, no, no, y...
## $ oldpeak  <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4...
## $ slope    <fct> downsloping, flat, flat, downsloping, upsloping, upsl...
## $ ca       <fct> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,...
## $ thal     <fct> 6, 3, 7, 3, 3, 3, 3, 3, 7, 7, 6, 3, 6, 7, 7, 3, 7, 3,...
## $ num      <fct> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0,...
summary(heart)
##       age            sex                    cp         trestbps    
##  Min.   :28.00   female:194   typical angina : 46   Min.   :  0.0  
##  1st Qu.:47.00   male  :726   atypical angina:174   1st Qu.:120.0  
##  Median :54.00                non-anginl pain:204   Median :130.0  
##  Mean   :53.51                asymptomatic   :496   Mean   :132.1  
##  3rd Qu.:60.00                                      3rd Qu.:140.0  
##  Max.   :77.00                                      Max.   :200.0  
##                                                     NA's   :59     
##       chol          fbs                  restecg       thalach     
##  Min.   :  0.0   FALSE:692   normal          :551   Min.   : 60.0  
##  1st Qu.:175.0   TRUE :138   ST-T abnormality:179   1st Qu.:120.0  
##  Median :223.0   NA's : 90   hypertrophy     :188   Median :140.0  
##  Mean   :199.1               NA's            :  2   Mean   :137.5  
##  3rd Qu.:268.0                                      3rd Qu.:157.0  
##  Max.   :603.0                                      Max.   :202.0  
##  NA's   :30                                         NA's   :55     
##   exang        oldpeak                slope        ca        thal    
##  no  :528   Min.   :-2.6000   upsloping  :203   0   :181   6   : 46  
##  yes :337   1st Qu.: 0.0000   flat       :345   3   : 20   3   :196  
##  NA's: 55   Median : 0.5000   downsloping: 63   2   : 41   7   :192  
##             Mean   : 0.8788   NA's       :309   1   : 67   NA's:486  
##             3rd Qu.: 1.5000                     NA's:611             
##             Max.   : 6.2000                                          
##             NA's   :62                                               
##  num    
##  0:411  
##  2:109  
##  1:265  
##  3:107  
##  4: 28  
##         
## 

3 Business Problem, Question & Hypothesis

Million Hearts claims that “approximately 1.5 million heart attacks and strokes occur every year in the United States.” In terms of monetary costs, WebMD states that “in 2010, the cost of cardiovascular disease in the U.S. was about $444 billion. That includes costs for treatment of: heart conditions, stroke, peripheral artery disease, and high blood pressure.” The treatment of these diseases accounts for $1 of every $6 spent on healthcare in the U.S.

This analysis attempted to answer the following question: which health markers are considered to be the most important when trying to predict the presence of heart disease? These huge number of deaths and dollars spent could be reduced if there was a way to pin point these health markers. This analysis looked at four common supervised machine learning classification models. It compared and contrasted the predictive power of each, as well as their interpretability. When should we choose one over the other?

Growing up in a family that has been affected by high blood pressure and heart problems has shifted our attention to trying to live a healthier lifestyle. My mother has tried to reduce the amount of oil she uses when prepping meals and tells us to not eat excess amounts of shrimp because of their high cholesterol levels. Food that contain high amounts of cholesterol are consistently scutinized at home. Therefore, I hypothesized that the variable chol would appear in the top 3 most important measurements when it comes to predicting num (heart disease). I also hypothesized that cp (chest pain) would be a significant predictor variable.

4 Data Preparation

At the end of the Data Source section, the summary table of the combined data set showed that many variables contained missing values (NA). I first explored the amount of missing data on a per observation basis (i.e. per rows) and then looked at them on a per variable basis (i.e. columns).

4.1 Missing Values: By Observations

After exploring the total number of NA values for each individual observation, I noticed that only 299 out of the 920 observations (32.5%) of our observations had zero missing values. I made the decision to completely remove the observations that had at least 7 NA values. In other words, if 7 of the 14 measurements of one particular observation was missing, then that meant that at least 50% of the data for that given observation was missing. I decided that these observations weren’t very meaningful because they would require a lot of imputation.

#----------- Exploring Missing Values: By Observations (Rows)

table(rowSums(is.na(heart))) #number of observations with x number of NA values
## 
##   0   1   2   3   4   5   6   7   8 
## 299  74 208 253  29   3   1  51   2
table(rowSums(is.na(heart))) / nrow(heart) #percentage of observations with x number of NA values
## 
##           0           1           2           3           4           5 
## 0.325000000 0.080434783 0.226086957 0.275000000 0.031521739 0.003260870 
##           6           7           8 
## 0.001086957 0.055434783 0.002173913
# Remove observations with 7+ variables of missing values (NA)
# There are 14 variables and 7+ missing values per observations means at least 50% of the data 
# for that particular observations is missing. 
heart2 <- heart[rowSums(is.na(heart)) < 7,] #removed 53 observations
table(rowSums(is.na(heart2)))
## 
##   0   1   2   3   4   5   6 
## 299  74 208 253  29   3   1

A total of 53 observations were removed. I ended up with 867 observations.

4.2 Missing Values: By Variables

Once I removed the observations that had more than 50% of it’s data missing, I then turned my attention to individual variables. The below plot shows that the variables ca and thal had more than 50% of its data missing. This implied that these variables didn’t provide a large amount of information for the predictive models. Imputation for such a high number of missing data would also be cumbersome. Therefore, I completely removed these variables from my data set. After this, 8 of the remaining 12 variables required some form of data imputation to deal with the remaining missing values.

#----------- Exploring Missing Values: By Variables (Columns)
# Visualize NA Values       
heart_plot <- aggr(heart2, col=c("dimgrey", "coral2"), numbers=TRUE, 
                   sortVars=TRUE, labels=names(heart), cex.axis=0.7,
                   gap=3, ylab=c("% of Missing Data", "Pattern of Missing Data"))

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##        ca 0.643598616
##      thal 0.500576701
##     slope 0.295271050
##       fbs 0.103806228
##      chol 0.031141869
##   oldpeak 0.010380623
##  trestbps 0.006920415
##   restecg 0.002306805
##   thalach 0.002306805
##     exang 0.002306805
##       age 0.000000000
##       sex 0.000000000
##        cp 0.000000000
##       num 0.000000000
#Drop the 'ca' and 'thal' variables that have 50% + missing data
drops <- c("ca", "thal")
heart3 <- heart2[, !(names(heart2) %in% drops)]

sort(colSums(is.na(heart3)) / nrow(heart3)) #8 of 12 Variables with NA's will require imputation
##         age         sex          cp         num     restecg     thalach 
## 0.000000000 0.000000000 0.000000000 0.000000000 0.002306805 0.002306805 
##       exang    trestbps     oldpeak        chol         fbs       slope 
## 0.002306805 0.006920415 0.010380623 0.031141869 0.103806228 0.295271050
summary(heart3)
##       age            sex                    cp         trestbps    
##  Min.   :28.00   female:193   typical angina : 42   Min.   :  0.0  
##  1st Qu.:46.00   male  :674   atypical angina:168   1st Qu.:120.0  
##  Median :54.00                non-anginl pain:186   Median :130.0  
##  Mean   :53.15                asymptomatic   :471   Mean   :132.1  
##  3rd Qu.:60.00                                      3rd Qu.:140.0  
##  Max.   :77.00                                      Max.   :200.0  
##                                                     NA's   :6      
##       chol          fbs                  restecg       thalach     
##  Min.   :  0.0   FALSE:660   normal          :532   Min.   : 60.0  
##  1st Qu.:175.0   TRUE :117   ST-T abnormality:152   1st Qu.:120.0  
##  Median :224.0   NA's : 90   hypertrophy     :181   Median :140.0  
##  Mean   :199.2               NA's            :  2   Mean   :137.5  
##  3rd Qu.:268.0                                      3rd Qu.:157.0  
##  Max.   :603.0                                      Max.   :202.0  
##  NA's   :27                                         NA's   :2      
##   exang        oldpeak                slope     num    
##  no  :528   Min.   :-2.6000   upsloping  :203   0:392  
##  yes :337   1st Qu.: 0.0000   flat       :345   2:102  
##  NA's:  2   Median : 0.5000   downsloping: 63   1:252  
##             Mean   : 0.8788   NA's       :256   3: 95  
##             3rd Qu.: 1.5000                     4: 26  
##             Max.   : 6.2000                            
##             NA's   :9

4.3 Outliers: trestbps & chol

Aside from the NA values, one other thing that stood out to me was that there were observations that had a value of zero for trestbps (resting blood pressure) and chol (serum cholestoral in mg/dl). This is obviously incorrect information, which means that these zero values required imputation as well. Only 1 observation had a 0 value for trestbps. However, there were 163 observations that had a zero value for chol. This large number of zeros can be seen towards to the bottom of the boxplot.

#----------- Fix Outliers: Variables trestbps (diastolic/resting blood pressure) and 
#----------- chol (serum cholestoral in mg/dl) cannot be zero.
heart3[which(heart3$trestbps == 0),] #one observation with trestbps = 0
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope num
55 male non-anginl pain 0 0 FALSE normal 155 no 1.5 flat 3
dim(heart3[which(heart3$chol == 0),]) #163 observation with chol = 0
## [1] 163  12
heart3 %>%
  ggplot(aes(x=num,y=chol, fill=num)) +
  geom_boxplot(na.rm = TRUE) +
  scale_fill_viridis(discrete = TRUE, alpha=0.6) +
  geom_jitter(color="black", size=0.4, alpha=0.9) +
  theme(legend.position="none", plot.title = element_text(size=11)) +
  ggtitle("Boxplot: chol by num")

4.4 Data Imputation

4.4.1 Imputing Continuous Variables

The goal of this section was to use the heart disease variable num to find marginal averages of each variable and then impute accordingly. For example, I first filtered out all the trestbps values that were either an NA or were set to zero (outlier value that was explained in the previous section). I then subsetted the remaining values using the num label and found the average trestbps for each category. Lastly, I used this average to impute accordingly.

The 4 continuous variables that required imputation were: trestbps, chol, thalach, and oldpeak.

#----------- Data Imputation For Continuous Variables
# Goal: Use our heart disease dependent variable 'num' to find marginal
# average of each variable. Impute accordingly.
# Continuous Variables: trestbps, chol, thalach, oldpeak

#Create a version 4 which will have the manually imputated data
heart4 <- heart3

#------- Imputing 'trestbps'
#Check for averages for each of the 5 heart diagnosis where trestbps is NOT NA or is NOT 0 (wrong value)
imp1 <- heart3 %>%
  filter(!is.na(trestbps) | trestbps != 0) %>%
  group_by(num) %>%
  summarise(avgTrestbps = mean(trestbps)) 

imp1 <- round(imp1$avgTrestbps)

#Imputate accordingly
for (i in 0:4) {
  heart4$trestbps[which((is.na(heart4$trestbps) | heart4$trestbps == 0) & heart4$num == i)] <- imp1[i+1]
}


#------- Imputing 'chol'
#Check for averages for each of the 5 heart diagnosis where chol is NOT NA or is NOT 0 (wrong value)
imp2 <- heart3 %>%
  filter(!is.na(chol) | chol != 0) %>%
  group_by(num) %>%
  summarise(avgChol = mean(chol))

imp2 <- round(imp2$avgChol)

#Imputate accordingly
for (i in 0:4) {
  heart4$chol[which((is.na(heart4$chol) | heart4$chol == 0) & heart4$num == i)] <- imp2[i+1]
}

#------- Imputing 'thalach'
#Check for averages for each of the 5 heart diagnosis
imp3 <- heart3 %>%
    filter(!is.na(thalach)) %>%
    group_by(num) %>%
    summarise(avgThalach = mean(thalach))

imp3 <- round(imp3$avgThalach, 1)

#Imputate accordingly
for (i in 0:4) {
  heart4$thalach[which(is.na(heart4$thalach) & heart4$num == i)] <- imp3[i+1]
}

#------- Imputing 'oldpeak'
#Check for averages for each of the 5 heart diagnosis
imp4 <- heart3 %>%
  filter(!is.na(oldpeak)) %>%
  group_by(num) %>%
  summarise(avgOldpeak = mean(oldpeak))

imp4 <- round(imp4$avgOldpeak, 1)

#Imputate accordingly
for (i in 0:4) {
  heart4$oldpeak[which(is.na(heart4$oldpeak) & heart4$num == i)] <- imp4[i+1]
}

sort(colSums(is.na(heart4)) / nrow(heart4)) #4 of 12 Variables with NA's still require imputation
##         age         sex          cp    trestbps        chol     thalach 
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 
##     oldpeak         num     restecg       exang         fbs       slope 
## 0.000000000 0.000000000 0.002306805 0.002306805 0.103806228 0.295271050

4.4.2 Imputing Categorical Variables

This section used the heart disease dependent variable num and sex to find marginal counts of each variable. I then imputed the missing value with the most common category. The four categorical variables that were imputed in this section were fbs, restecg, exang, and slope.

For the fbs variable, it is clear that throughout each of the subcategories, “FALSE” is the label that is most likely to occur. Therefore, all NA’s within fbs were set to FALSE.

#---------------------------------------------------------------------------------
#----------- Data Imputation For Categorical Variables
# Goal: Use our heart disease dependent variable 'num' to find marginal
# counts of each variable. Impute accordingly.
# No Variables: fbs, restecg, exang 

#------- Imputing 'fbs'
heart3 %>%
  filter(!is.na(fbs)) %>%
  ggplot(aes(x = num, fill=fbs)) + 
  geom_bar(position = "dodge") +
  facet_grid(~sex) +
  scale_fill_brewer(palette = "Set1") +
  labs(title="fbs Frequency: By num and sex ")

# Obvious that fbs is most likely FALSE
heart4$fbs[is.na(heart4$fbs)] <- FALSE

I repeated the same process for restecg. There are two observations that contained an NA value. Both are male with num = 1. After plotting the frequencies, it is evident that the most likely restecg label is “normal”. This value was then imputed.

#------- Imputing 'restecg'
#Check distribution
heart3[is.na(heart3$restecg),] #only two observations with NA
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope num
55 male typical angina 140 295 FALSE NA 136 no 0.0 NA 1
34 male asymptomatic 115 0 NA NA 154 no 0.2 upsloping 1
heart3 %>%
  filter(!is.na(restecg)) %>%
  ggplot(aes(x = num, fill=restecg)) + 
  geom_bar(position = "dodge") +
  facet_grid(~sex) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title="restecg Frequency: By num and sex")

# Both are Male with num = 1. Most likely restecg = 'normal'
heart4$restecg[is.na(heart4$restecg)] <- 'normal'

There were only two observations that had an NA value for exang. The first observation was a female with num = 0. Based on these criteria, the most likely label for exang would be “no”. The second observation was a male with num = 3. The most likely label for this observation would be exang = “yes”

#------- Imputing 'exang'
#Check distribution
heart3[is.na(heart3$exang),] #only two observations with NA
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope num
48 female atypical angina NA 308 FALSE ST-T abnormality NA NA 2 upsloping 0
69 male asymptomatic NA 0 FALSE ST-T abnormality NA NA NA NA 3
which(is.na(heart3$exang)) #observations row number
## [1] 394 715
heart3 %>%
  filter(!is.na(exang)) %>%
  ggplot(aes(x = num, fill=exang)) + 
  geom_bar(position = "dodge") +
  facet_grid(~sex) +
  scale_fill_brewer(palette = "Set1") +
  labs(title="exang Frequency: By num and sex")

#Observation 394: female with num=0. Set exang most likely to be 'no'
heart4$exang[394] <- 'no'

#Observation 715: male with num=3. Set exang most likely to be 'yes'
heart4$exang[715] <- 'yes'

Lastly, I imputed the slope variable. There were a total of 256 observations with slope = NA. My frequency plots showed that, regardless of sex, “upsloping” was the most common slope value for num = 0. Also, regardless of sex, “flat” was the most common slope value for num = 1, 2, 3, or 4. The “downsloping” value was not common in neither of the subcategories, so this value was not imputed. This completed my data imputation.

#------- Imputing 'slope'
nrow(heart3[is.na(heart3$slope),]) #256 observations with slope = NA
## [1] 256
heart3 %>%
  filter(!is.na(slope)) %>%
  ggplot(aes(x = num, fill=slope)) + 
  geom_bar(position = "dodge") +
  facet_grid(~sex) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title="slope Frequency: By num and sex")

heart4$slope[is.na(heart4$slope) & heart4$num == 0] <- 'upsloping'
heart4$slope[is.na(heart4$slope) & heart4$num %in% c(1,2,3,4)] <- 'flat'

4.5 Binary Dependent Variable

For my final data preparation stage, I created two copies of my clean data set. My first copy was the original multi-class data set that kept the 5 different num categories. My second data set transformed the num variable into a binary label: 0 = ‘no disease’ vs. 1 = ‘displaying heart disease’ (i.e. num = 1, 2, 3, or 4). This served as the primary data set that I used for my four binary classification models during my modeling phase.

#create two copies: one will mantain the 5 categories of heart disease
# The second will make num into a binary variable for heart disease:
# disease vs. no disease
multiheart <- heart4 # 'num' is multiclass
binheart <- heart4 # 'num' is binary


#------- Dependent Variable: Make 'num' column into binary variable. 
# 0 = 'no disease' vs. 1 = 'displaying heart disease' 
binheart$num[binheart$num %in% c(1,2,3,4)] <- 1
binheart$num <- factor(binheart$num, levels = c(0, 1), 
                    labels = c("no disease", "heart disease"))

5 Exploratory Data Analysis

I began my EDA by first creating a two plot matrices that displayed any marginal and two dimensional distributions. Due to the large number of values, I decided to break up my plot matrices into two parts: numeric data and categorical data.

Looking at the numeric data marginal plots, we can see that all variables appear to more or less follow a bell shape curve, with the exception of oldpeak. I verfied their normality using a Shapiro Wilk Test. This showed that these numeric variables are NOT normally distributed. The QQ-Plots confirm this as well.

The correlation plot also showed that these numeric variables are not highly correlated with one another, which is a good for modeling purpose.

# Exploratory Data Analysis:
all.cat <- c("sex", "cp", "fbs", "restecg", "exang", "slope", "num")
all.numer <- c("age", "trestbps", "chol", "thalach", "oldpeak")

pairs.panels(binheart[,all.numer], 
             method="pearson", 
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

#Shapiro Wilk Test for Normality: All reject null hypothesis
#i.e. variables are NOT normally distributed
apply(binheart[,all.numer], 2, shapiro.test)
## $age
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.99141, p-value = 6.122e-05
## 
## 
## $trestbps
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.96923, p-value = 1.463e-12
## 
## 
## $chol
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.9374, p-value < 2.2e-16
## 
## 
## $thalach
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.99041, p-value = 1.968e-05
## 
## 
## $oldpeak
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.85201, p-value < 2.2e-16
#Checking for normality of continuous variables
q1 <- ggqqplot(binheart$age, main="age: Normal Q-Q Plot")
q2 <- ggqqplot(binheart$trestbps, main="trestbps: Normal Q-Q Plot")
q3 <- ggqqplot(binheart$chol, main="chol: Normal Q-Q Plot")
q4 <- ggqqplot(binheart$thalach, main="thalach: Normal Q-Q Plot")
q5 <- ggqqplot(binheart$oldpeak, main="oldpeak: Normal Q-Q Plot")

grid.arrange(q1, q2, q3, q4, q5, ncol=3)

#Plotting Correations of Continuous Variables
corrplot.mixed(cor(binheart[,all.numer]),
               lower.col='black',
               upper="square",
               title="Correlation Plot of Continuous Variables",
               mar=c(0,0,1,0) )

The plot matrix for categorical variables was a little difficult to interpret due to the overlapping values. I made use of the jitter argument for the section of the plots that had a higher concentration of data points. The marginal plot of the num dependent variable does show that our binary category doesn’t appear to be highly unbalanced. Roughly 45% of our entire data set is composed of observations with “No Disease” and 55% of our data is composed of observations with “Heart disease”.

pairs.panels(binheart[,all.cat], 
             method="pearson", 
             hist.col = "#800000",
             density = FALSE,
             ellipses = FALSE,
             smooth = FALSE,
             jiggle = TRUE, #jitter points
             factor=1 #jitter amount
)

table(binheart$num) / length(binheart$num) #Comparing ratio of heart disease classes
## 
##    no disease heart disease 
##     0.4521338     0.5478662

6 Data Analysis / Modeling

This section looked at four of the most common supervised classification models: Lasso Logistic Regression, Ridge Logistic Regression, K-Nearest Neighbors, and Random Forest. First, I split my data into 80% training set and 20% test set. This test set was NOT touched until the very end once the hyperparameters of the models had been tuned. In the training phase, I made use of the 10-fold Cross Validation method to find the most optimal model. This 20% test set was used to assess the final accuracy score between all 4 models.

#---------------------------------------------------------------------------------
#---------------------------Predictors: Categorical and Numeric
cat <- c("sex", "cp", "fbs", "restecg", "exang", "slope") #dont include 'num' for one hot encode
numer <- c("age", "trestbps", "chol", "thalach", "oldpeak")

#---------------------------Split the data into training and test set
set.seed(123)
training.samples <- binheart$num %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- binheart[training.samples,] #non-standardized train data
test.data <- binheart[-training.samples,] #non-standardized test data

6.1 Standardization of Data

My first three models required that my data be standardized (i.e. mean = 0 and standard deviation = 1). Both Lasso and Ridge “puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. It is therefore necessary to center and reduce, or standardize, the variables.” The K-Nearest Neighbor model also required standardization because the model makes use of Euclidean distance when assessing which neighbors are considered to be the ‘nearest.’ Note that standardization is NOT necessary for Random Forest.

The code below one hot encodes my categorical variables and then standardizes them. I ended up with 4 data sets. ‘train.final’ and ‘test.final’ are one hot encoded and standardized, while ‘train.data’ and ‘test.data’ are non-standardized and left as is.

#---------------------------------------------------------------------------------
#----------------Standardize Function
#create standardization function: subtract mean and divide by sd for each column
# z = data frame to standardize
# avg.vec = vector of column means
# sd.vec = vector of column standard deviations
standardize <- function(z, avg.vec, sd.vec) {
  rv <- sweep(z, 2, avg.vec,"-")  #subtracting mean
  rv <- sweep(rv, 2, sd.vec, "/")  # dividing by sd
  return(rv)
}
#----------------Standardize Train Data

#one Hot Encode Categorical Variables (dont include dependent variable)
train.cat <- one_hot(as.data.table(train.data[,cat]))

#Merge One hot encode variables with numeric 
train.full <- cbind(train.data[,numer], train.cat)

#Find column means and sd for training data
train.mean <-  apply(train.full, 2, mean)
train.sd <- apply(train.full,2, sd)

#Final Standardized Train Data with dependent variable
train.final <- cbind(standardize(train.full, train.mean, train.sd), train.data[,"num"])

#----------------Standardize Test Data (Note: Using Train Mean and Train SD)
#one Hot Encode Categorical Variables (dont include dependent variable)
test.cat <- one_hot(as.data.table(test.data[,cat]))

#Merge One hot encode variables with numeric 
test.full <- cbind(test.data[,numer], test.cat)


#Final Standardized test Data with dependent variable
#Note: Use train mean and train sd to keep data in same scale
test.final <- cbind(standardize(test.full, train.mean, train.sd), test.data[,"num"])

# ******4 Data Sets used for modeling
# Non-standardized: train.data & test.data
# One Hot Encoded Standardized: train.final & test.final.

6.2 Model 1: Lasso Logistic Regression

Lasso Logistic Regression is a form of penalized regression that is used for model selection. The lambda value ‘shrinks’ non-important coefficients and can bring them all the way down to zero, effectively removing them entirely from the model. I tuned this hyperparameter with a 10 fold cross-validation. Notice that the plot displays the CV error according to the log(lambda). The left dashed vertical line indicates that the log of the optimal value of lambda is approximately -5, which is the one that minimizes the prediction error.

The purpose of regularization is to balance accuracy and simplicity. This means, a model with the smallest number of predictors that also gives a ‘good’ accuracy rate relative to the optimum. The function cv.glmnet() also finds the value of lambda that gives the simplest model, but also lies within one standard error of the optimal value of lambda. This value is called lambda.1se, which is the right vertical line on the plot.

#---------------------------------------------------------------------------------
# Model #1: Lasso Logistic Regression
# Requires One Hot Encoded Standardized Data
#----------------
# Matrix of Predictor Variables (dont include 'num')
x <- as.matrix(train.final[,-ncol(train.final)])

# Convert the outcome (class) to a numerical variable
y <- ifelse(train.final$num == "heart disease", 1, 0)


# Find the best lambda using 10 fold cross-validation. alpha = 1 is for lasso
# i.e. Find the optimal value of lambda that minimizes the CV error
set.seed(123) 
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial", nfolds=10)

# Plot displays the CV error according to the log(lambda)
plot(cv.lasso)

cv.lasso$lambda.min
## [1] 0.006622656
log(cv.lasso$lambda.min) #left vertical line
## [1] -5.017259
cv.lasso$lambda.1se 
## [1] 0.03220334
log(cv.lasso$lambda.1se) #right vertical line
## [1] -3.435685

Because I planned to compare my four models accuracy rate, I trained my final model with the lambda.min value. The coefficients of my final lasso logistic regression model are displayed below as well as the final prediction accuracy on the hold out test data set. My test accuracy rate was 82.65896%.

# Fit the final model on the training data
lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
                      lambda = cv.lasso$lambda.min)

# Display regression coefficients
coef(lasso.model)
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                                     s0
## (Intercept)               2.633197e-01
## age                       2.473973e-01
## trestbps                  8.905061e-02
## chol                     -2.115127e-01
## thalach                  -1.581075e-01
## oldpeak                   3.706193e-01
## sex_female               -5.594024e-01
## sex_male                  4.607935e-15
## cp_typical angina         .           
## cp_atypical angina       -4.514957e-02
## cp_non-anginl pain        .           
## cp_asymptomatic           8.034857e-01
## fbs_FALSE                 .           
## fbs_TRUE                  .           
## restecg_normal            .           
## restecg_ST-T abnormality  3.471912e-02
## restecg_hypertrophy       .           
## exang_no                 -1.965360e-01
## exang_yes                 2.389183e-15
## slope_upsloping          -6.829909e-01
## slope_flat                3.449807e-01
## slope_downsloping         .
# Make predictions on the test data
x.test <- as.matrix(test.final[,-ncol(test.final)])
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "heart disease", "no disease")

# Model accuracy
observed.classes <- test.final$num
mean(predicted.classes == observed.classes) #0.8265896
## [1] 0.8265896

6.3 Model 2: Ridge Logistic Regression

Similar to lasso, ridge is another form of penalized logistic regression. However, the difference from lasso is that ridge only takes the coefficient values close to zero, but never actually reduces them all the way to zero. Therefore, ridge regression keeps all coefficients within the model and simply shrinks their size to a small value. This is why lasso is a more effective way of doing variable selection.

The process and code was identical to the lasso. I only had to set my alpha = 0 to specify that this was a ridge model. The final test accuracy for this model was 83.23699%. There was a slight improvement from the lasso model in terms of accuracy. However, the lasso model is a bit more interpretable because of it uses less variables.

#---------------------------------------------------------------------------------
# Model #2: Ridge Logistic Regression

#----------------

# Find the best lambda using 10 fold cross-validation. alpha = 0 is for ridge
# i.e. Find the optimal value of lambda that minimizes the CV error
set.seed(123) 
cv.ridge <- cv.glmnet(x, y, alpha = 0, family = "binomial", nfolds=10)

# Plot displays the CV error according to the log(lambda)
# Left dashed vertical line indicates that the log of the 
# optimal value of lambda is approximately -5, which is the 
# one that minimizes the prediction error.
plot(cv.ridge)

cv.ridge$lambda.min
## [1] 0.03003294
log(cv.ridge$lambda.min) #left vertical line
## [1] -3.505461
# Purpose of regularization is to balance accuracy and simplicity. 
# This means, a model with the smallest number of predictors that also 
# gives a good accuracy. Function cv.glmnet() finds also the value of 
# lambda that gives the simplest model but also lies within one standard 
# error of the optimal value of lambda. This value is called lambda.1se.
cv.ridge$lambda.1se 
## [1] 0.1460382
log(cv.ridge$lambda.1se) #right vertical line
## [1] -1.923887
# Fit the final model on the training data
ridge.model <- glmnet(x, y, alpha = 0, family = "binomial",
                      lambda = cv.ridge$lambda.min)

# Display regression coefficients
coef(ridge.model)
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                                     s0
## (Intercept)               0.2664624910
## age                       0.2254069053
## trestbps                  0.1059003978
## chol                     -0.2065357036
## thalach                  -0.1830542682
## oldpeak                   0.3443846682
## sex_female               -0.2636030692
## sex_male                  0.2629080955
## cp_typical angina        -0.0988940810
## cp_atypical angina       -0.2852317072
## cp_non-anginl pain       -0.2184684213
## cp_asymptomatic           0.4489077919
## fbs_FALSE                -0.0004151994
## fbs_TRUE                  0.0005184995
## restecg_normal           -0.0325449856
## restecg_ST-T abnormality  0.0516350786
## restecg_hypertrophy      -0.0112171893
## exang_no                 -0.1308310527
## exang_yes                 0.1305933596
## slope_upsloping          -0.4891743973
## slope_flat                0.4493948234
## slope_downsloping         0.0650008646
# Make predictions on the test data
x.test <- as.matrix(test.final[,-ncol(test.final)])
probabilities <- ridge.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "heart disease", "no disease")

# Model accuracy
observed.classes <- test.final$num
mean(predicted.classes == observed.classes) #0.8323699
## [1] 0.8323699

6.4 Model 3: K-Nearest Neighbors

The KNN algorithm is a non-parametric method that is very popular for classification due to its lack of assumptions on the distribution of the data. Per wikipedia, “an object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. If k = 1, then the object is simply assigned to the class of that single nearest neighbor.”

Similar to my first two models, I ran a 10 fold cross-validation to find the most optimal k (# of neighbors) value. The k value that yielded the highest accuracy rate was k = 86. This yielded a CV Error Rate of 85.00621%.

#----------------------
#Use 10-fold CV - One Hot Encoded Standardized Data

#Randomly shuffle the data
set.seed(10)
binheartcv <- train.final[sample(nrow(train.final)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(binheartcv)),breaks=10,labels=FALSE)

#Loop through different K's (nearest Neighbors) using 10 fold CV
cv.acc <- c() #cv accuracy for different values of K
for(g in 1:400) {
  accuracy <- c()
  for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- binheartcv[testIndexes, ]
    trainData <- binheartcv[-testIndexes, ]
    
    #Use the test and train data partitions however you desire...
    set.seed(5)
    knn.pred <- knn(train = trainData[,-ncol(trainData)],
                    test = testData[,-ncol(testData)],
                    cl = trainData[,ncol(trainData)],
                    k=g)
    
    accuracy[i] <- mean(knn.pred == testData[,ncol(testData)])
  }
  cv.acc[g] <- mean(accuracy)
  
}

qplot(seq_along(cv.acc), cv.acc, geom="line", col=I("red")) + 
  labs(x="k (# of nearest neighbors)",
       y="10-Fold CV Accuracy Rate",
       title="K Nearest Neighbors") +
  geom_point(aes(x=which(cv.acc==max(cv.acc)), 
                 y=max(cv.acc)),
             colour="blue", size=3)

which(cv.acc==max(cv.acc)) #k=86 (optimal K)
## [1] 86
max(cv.acc) #0.8500621
## [1] 0.8500621

Lastly, I used k=86 to train my final model. The final test accuracy rate on the hold out test data set was 88.43931%. This non-parametric model performed significantly better than both lasso or ridge logistic regression.

# Train Final KNN Model with optimal k = 86
# Use hold out Test Data set for Test Accuracy rate
set.seed(5)
knn.pred.final <- knn(train = train.final[,-ncol(train.final)],
                      test = test.final[,-ncol(testData),],
                      cl = train.final[,ncol(trainData)],
                      k=86)

mean(knn.pred.final == test.final[,ncol(test.final)]) #0.8843931
## [1] 0.8843931

6.5 Model 4: Random Forest

One of the great things of a random forest model is that it does not require the data to be standardized. Therefore, I used my data sets that were neither one hot encoded or standardized. I used a 10 fold cross-validation to find my optimal mtry hyperparameter. The plot below showed that the mtry=2 yielded that highest accuracy rate. The CV Accuracy rate is 85.86749%.

#---------------------------------------------------------------------------------
#Model #4: Random Forest

#Random Forest: Use K-fold CV
#Randomly shuffle the data. Note: Use non-standardized data
#Standardization not required for random forest
set.seed(10)
binheartcv <- train.data[sample(nrow(train.data)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(binheartcv)),breaks=10,labels=FALSE)

#Perform a 10 fold cross validation to find optimal mtry value
cv.acc <- c() #cv accuracy for different values of mtry
for(g in 1:11) {
  accuracy <- c()
  for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- binheartcv[testIndexes, ]
    trainData <- binheartcv[-testIndexes, ]
    
    #Use the test and train data partitions however you desire...
    set.seed(5)
    rf.mod <- randomForest(num~., data=trainData, ntree=500, mtry=g, importance=TRUE)
    predValid <- predict(rf.mod, newdata=testData, type="class")
    accuracy[i] <- mean(predValid == testData$num)
  }
  cv.acc[g] <- mean(accuracy)
}

qplot(seq_along(cv.acc), cv.acc, geom="line", col=I("red")) + 
  geom_point() +
  labs(x="mtry",
       y="10-Fold CV Accuracy Rate",
       title="Random Forest: Tuning mtry hyperparameter") +
  geom_point(aes(x=which(cv.acc==max(cv.acc)), 
                 y=max(cv.acc)),
             colour="blue", size=3) +
  scale_x_continuous(breaks=c(0:12))

which(cv.acc==max(cv.acc)) #Optimal mtry=2
## [1] 2
max(cv.acc) #0.8586749
## [1] 0.8586749

I then used my mtry=2 value to train my final model on the full training data set. Lastly, I used my hold out test data set to obtain a test accuracy rate of 89.01734%. This was the most accurate predictive model out of the four.

#-------------------
#Final Model w/ ROC, AUC Plots
set.seed(5)
final.rf <- randomForest(num~., data=train.data, mtry=2, ntree=500, 
                         importance=TRUE, keep.forest=TRUE)
final.pred.rf <- predict(final.rf, newdata=test.data, type="class")
mean(final.pred.rf == test.data$num) #0.8901734
## [1] 0.8901734

7 Summary of Analysis & Conclusion

When doing a direct comparison of the four classification models, one thing that was proven was that the most flexible models (i.e. KNN and Random Forest) had the highest accuracy. Lasso and Ridge Logistic Regression are considered to be the least flexible models, but generally have a higher level of interpretability. This is because they are parametric models and assume a linear association of the independent variables with the dependent variable. In particular, Lasso also serves as a subset selection algorithm that zeroes out the variables that are of least importance, effectivly removing them from the model entirely. A popular graph in Chapter 2 (page 25) in An Introduction to Statistical Learning shows how Lasso is one of the most interpretable models with the least flexibility. Random Forest (i.e. similar to bagging) is a highly flexible model with low interpretability.

Lasso Logistic Regression had a Test Accuracy Rate = 82.65896%. Ridge Logistic Regression had a Test Accuracy Rate = 83.23699%. The KNN model had a Test Accuracy Rate = 88.43931%. Finally, the Random Forest model had a Test Accuracy Rate = 89.01734%. Comparing the most accurate to the least accurate model, we can see a 5.78% improvement on accuracy, which is a significant difference.

Turning to the topic of interpretability, Lasso is by far the most interpretable. Because the variables were standardized, we can look at the magnitude of each of the coefficient variables on the final lasso model and do a direct comparison between the variables. Lasso shows shows that cp, slope, and sex were the top 3 variables with the highest amount of importance.

The ridge model shows that slope, cp, and oldpeak were the top 3 most important predictors when looking strictly at their magnitude.

Although random forest is considered to be one of the least interpretable models, the variable importance plot allows us to see which predictors were used the most often and their effect on the accuracy of the model. When looking at the Mean Decrease Accuracy measurement, this shows that the top 3 variables are slope, cp, and old peak. If we look at the Mean Decrease Gini measurement, this shows that the top 3 variables are slope, cp, and thalach.

It is clear that all four models appeared to be in agreement with each other as to which variables were the most important when it comes to prediction of heart disease. My hypothesis that cp (chest pain) would appear as a top 3 important variable was in fact correct. Cholesterol (chol) unfortuntely did not crack the top 3 variables in neither one of my models. This concludes my analysis.

#Lasso Ordered Coefficients
data.frame(lasso.coef.name = dimnames(coef(lasso.model))[[1]], 
           coef.value = matrix(coef(lasso.model))) %>% 
  mutate(abs.coef.value  = round(abs(coef.value), 5)) %>%
  arrange(desc(abs.coef.value))
lasso.coef.name coef.value abs.coef.value
cp_asymptomatic 0.8034857 0.80349
slope_upsloping -0.6829909 0.68299
sex_female -0.5594024 0.55940
oldpeak 0.3706193 0.37062
slope_flat 0.3449807 0.34498
(Intercept) 0.2633197 0.26332
age 0.2473973 0.24740
chol -0.2115127 0.21151
exang_no -0.1965360 0.19654
thalach -0.1581075 0.15811
trestbps 0.0890506 0.08905
cp_atypical angina -0.0451496 0.04515
restecg_ST-T abnormality 0.0347191 0.03472
sex_male 0.0000000 0.00000
cp_typical angina 0.0000000 0.00000
cp_non-anginl pain 0.0000000 0.00000
fbs_FALSE 0.0000000 0.00000
fbs_TRUE 0.0000000 0.00000
restecg_normal 0.0000000 0.00000
restecg_hypertrophy 0.0000000 0.00000
exang_yes 0.0000000 0.00000
slope_downsloping 0.0000000 0.00000
#Ridge Ordered
data.frame(ridge.coef.name = dimnames(coef(ridge.model))[[1]], 
           coef.value = matrix(coef(ridge.model))) %>% 
  mutate(abs.coef.value  = round(abs(coef.value), 5)) %>%
  arrange(desc(abs.coef.value))
ridge.coef.name coef.value abs.coef.value
slope_upsloping -0.4891744 0.48917
slope_flat 0.4493948 0.44939
cp_asymptomatic 0.4489078 0.44891
oldpeak 0.3443847 0.34438
cp_atypical angina -0.2852317 0.28523
(Intercept) 0.2664625 0.26646
sex_female -0.2636031 0.26360
sex_male 0.2629081 0.26291
age 0.2254069 0.22541
cp_non-anginl pain -0.2184684 0.21847
chol -0.2065357 0.20654
thalach -0.1830543 0.18305
exang_no -0.1308311 0.13083
exang_yes 0.1305934 0.13059
trestbps 0.1059004 0.10590
cp_typical angina -0.0988941 0.09889
slope_downsloping 0.0650009 0.06500
restecg_ST-T abnormality 0.0516351 0.05164
restecg_normal -0.0325450 0.03254
restecg_hypertrophy -0.0112172 0.01122
fbs_TRUE 0.0005185 0.00052
fbs_FALSE -0.0004152 0.00042
#Random Forest Importance
varImpPlot(final.rf)

8 Ethical Considerations

The heart disease data that was used in this report underwent through pre-processing before being made available to the public. The documentation states that “the names and social security numbers of the patients were recently removed from the database, replaced with dummy values.” I am not exactly sure how ‘recently’ this information was removed, which seems to be a little concerning. This implies that the private information of patients was originally made available to the public, but was then redacted. However, this is out of my control and appears to no longer be an issue.

9 Opportunity for Improvement

One of the main areas for improvement could be in the imputation phase of the missing data. Data imputation is still a very subjective field because there are many techniques that are used by various kinds of analyst. Having a subject matter expert that is familiar with this kind of data set might provide some valuable insight with this portion.

I also felt that I could not provide a very in depth analysis of my variables and how they interacted with my dependent variable simply because I do not have a SME level understanding of this data. This is where domain and industry knowledge are indispensible with this kind of work.

10 Citations

The following are the names of the principal investigators that were responsible for the heart disease data that was collected at each institution:

  1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
  2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
  3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
  4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Other citations:

  • World Health Organization, “The top 10 causes of death”
  • Mayo Clinic, “Heart Disease”
  • UC Irvine’s Machine Learning Repository: Heart Disease Data Set
  • MedicineNet, “Coronary Angiogram”
  • RxList, “Heart Disease: Causes of a Heart Attack”
  • Million Hearts, “Costs & Consequences”
  • WebMD, “Heart Disease: What Are the Medical Costs?”
  • StackExchange: Cross Validated, “Is standardisation before Lasso really necessary?” & “Why do you need to scale data in KNN”
  • Wikipedia, “k-nearest neighbors algorithm”
  • Tibshirani, Robert; James, Gareth; Witten, Daniela; Hastie, Trevor, “An Introduction to Statistical Learning with Applications in R”