Introduction


Recently, computer technology and machine learning techniques are developing software to assist doctors in making decision of heart disease in the early stage. Heart disease prediction system can assist medical professionals in predicting heart disease status based on the clinical data of patients.

The main objective of this project is to answer the below research questions:

  • Can we predict who will suffer Heart Disease?
  • Can we discover interesting features that affect Heart Disease?
  • Can we start to understand what causes Heart Disease?

For this project, we will utilize Heart Disease dataset taken from http://archive.ics.uci.edu/ml/datasets/Heart+Disease. This dataset was donated to UCI on the 1st of July 1988.The name of the dataset is processed.cleveland.data. The data was collected from Cleveland Clinic Foundation. The principal investigator for the data collection is Robert Detrano, M.D., Ph.D. from V.A. Medical Center, Long Beach and Cleveland Clinic Foundation.

The dataset consists of 14 features. Features description of the dataset are as below:

No. Features Description
1 age age in years
2 sex sex (1 = male; 0 = female)
3 cp chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)
4 trestbps resting blood pressure (in mm Hg on admission to the hospital)
5 chol serum cholestoral in mg/dl
6 fbs fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7 restecg resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)
8 thalach maximum heart rate achieved
9 exang exercise induced angina (1 = yes; 0 = no)
10 oldpeak ST depression induced by exercise relative to rest
11 slope the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
12 ca number of major vessels (0-3) colored by flourosopy
13 thal 3 = normal; 6 = fixed defect; 7 = reversable defect
14 num the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < 50% diameter narrowing; Value 1 = > 50% diameter narrowing)


## Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.0
## ✓ readr   1.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
#library(cowplot)
library(waffle)
library(ggcorrplot)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Data Cleaning


First, read the raw data and store it in a data frame.

heart_df <- read.csv("processed.cleveland.data")
head(heart_df)
##   X63.0 X1.0 X1.0.1 X145.0 X233.0 X1.0.2 X2.0 X150.0 X0.0 X2.3 X3.0 X0.0.1 X6.0
## 1    67    1      4    160    286      0    2    108    1  1.5    2    3.0  3.0
## 2    67    1      4    120    229      0    2    129    1  2.6    2    2.0  7.0
## 3    37    1      3    130    250      0    0    187    0  3.5    3    0.0  3.0
## 4    41    0      2    130    204      0    2    172    0  1.4    1    0.0  3.0
## 5    56    1      2    120    236      0    0    178    0  0.8    1    0.0  3.0
## 6    62    0      4    140    268      0    2    160    0  3.6    3    2.0  3.0
##   X0
## 1  2
## 2  1
## 3  0
## 4  0
## 5  0
## 6  3

This dataset does not have columns name. We will rename the columns according to features.

names(heart_df) <- c('Age', 'Sex', 'Chest Pain Type', 'Resting Blood Pressure', 'Cholesterol', 'Fasting Blood Sugar', 'Resting ECG', 'Max. HR Achieved', 'Exercise Induced Angina', 'ST Depression', 'ST Slope', 'Num. Major Blood Vessels', 'Thalassemia', 'Condition')
head(heart_df)
##   Age Sex Chest Pain Type Resting Blood Pressure Cholesterol
## 1  67   1               4                    160         286
## 2  67   1               4                    120         229
## 3  37   1               3                    130         250
## 4  41   0               2                    130         204
## 5  56   1               2                    120         236
## 6  62   0               4                    140         268
##   Fasting Blood Sugar Resting ECG Max. HR Achieved Exercise Induced Angina
## 1                   0           2              108                       1
## 2                   0           2              129                       1
## 3                   0           0              187                       0
## 4                   0           2              172                       0
## 5                   0           0              178                       0
## 6                   0           2              160                       0
##   ST Depression ST Slope Num. Major Blood Vessels Thalassemia Condition
## 1           1.5        2                      3.0         3.0         2
## 2           2.6        2                      2.0         7.0         1
## 3           3.5        3                      0.0         3.0         0
## 4           1.4        1                      0.0         3.0         0
## 5           0.8        1                      0.0         3.0         0
## 6           3.6        3                      2.0         3.0         3

We will check the data for any missing values.

sum(is.na(heart_df))
## [1] 0
summary(heart_df)
##       Age             Sex         Chest Pain Type Resting Blood Pressure
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0         
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0         
##  Median :55.50   Median :1.0000   Median :3.000   Median :130.0         
##  Mean   :54.41   Mean   :0.6788   Mean   :3.166   Mean   :131.6         
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0         
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0         
##   Cholesterol    Fasting Blood Sugar  Resting ECG     Max. HR Achieved
##  Min.   :126.0   Min.   :0.0000      Min.   :0.0000   Min.   : 71.0   
##  1st Qu.:211.0   1st Qu.:0.0000      1st Qu.:0.0000   1st Qu.:133.2   
##  Median :241.5   Median :0.0000      Median :0.5000   Median :153.0   
##  Mean   :246.7   Mean   :0.1457      Mean   :0.9868   Mean   :149.6   
##  3rd Qu.:275.0   3rd Qu.:0.0000      3rd Qu.:2.0000   3rd Qu.:166.0   
##  Max.   :564.0   Max.   :1.0000      Max.   :2.0000   Max.   :202.0   
##  Exercise Induced Angina ST Depression      ST Slope    
##  Min.   :0.0000          Min.   :0.000   Min.   :1.000  
##  1st Qu.:0.0000          1st Qu.:0.000   1st Qu.:1.000  
##  Median :0.0000          Median :0.800   Median :2.000  
##  Mean   :0.3278          Mean   :1.035   Mean   :1.596  
##  3rd Qu.:1.0000          3rd Qu.:1.600   3rd Qu.:2.000  
##  Max.   :1.0000          Max.   :6.200   Max.   :3.000  
##  Num. Major Blood Vessels Thalassemia          Condition     
##  Length:302               Length:302         Min.   :0.0000  
##  Class :character         Class :character   1st Qu.:0.0000  
##  Mode  :character         Mode  :character   Median :0.0000  
##                                              Mean   :0.9404  
##                                              3rd Qu.:2.0000  
##                                              Max.   :4.0000
str(heart_df)
## 'data.frame':    302 obs. of  14 variables:
##  $ Age                     : num  67 67 37 41 56 62 57 63 53 57 ...
##  $ Sex                     : num  1 1 1 0 1 0 0 1 1 1 ...
##  $ Chest Pain Type         : num  4 4 3 2 2 4 4 4 4 4 ...
##  $ Resting Blood Pressure  : num  160 120 130 130 120 140 120 130 140 140 ...
##  $ Cholesterol             : num  286 229 250 204 236 268 354 254 203 192 ...
##  $ Fasting Blood Sugar     : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Resting ECG             : num  2 2 0 2 0 2 0 2 2 0 ...
##  $ Max. HR Achieved        : num  108 129 187 172 178 160 163 147 155 148 ...
##  $ Exercise Induced Angina : num  1 1 0 0 0 0 1 0 1 0 ...
##  $ ST Depression           : num  1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
##  $ ST Slope                : num  2 2 3 1 1 3 1 2 3 2 ...
##  $ Num. Major Blood Vessels: chr  "3.0" "2.0" "0.0" "0.0" ...
##  $ Thalassemia             : chr  "3.0" "7.0" "3.0" "3.0" ...
##  $ Condition               : int  2 1 0 0 0 3 0 2 1 0 ...

There is no missing value found. However, we found out that for features Num. Major Blood Vessels and Thalassemia, the class is Character although in the feature description, it should be numeric. We need to check the distinct values of these features.

unique(heart_df$`Num. Major Blood Vessels`)
## [1] "3.0" "2.0" "0.0" "1.0" "?"
unique(heart_df$`Thalassemia`)
## [1] "3.0" "7.0" "6.0" "?"

The character values of unknown is denoted by ?. We will replace the unknown value to median value of the series, and change the class to numeric. (we removed it)

heart_df$`Num. Major Blood Vessels`[heart_df$`Num. Major Blood Vessels` == "?"] <- NA
heart_df$`Thalassemia`[heart_df$`Thalassemia` == "?"] <- NA
nrow(heart_df)
## [1] 302
heart_df <- heart_df[complete.cases(heart_df),]
str(heart_df)
## 'data.frame':    296 obs. of  14 variables:
##  $ Age                     : num  67 67 37 41 56 62 57 63 53 57 ...
##  $ Sex                     : num  1 1 1 0 1 0 0 1 1 1 ...
##  $ Chest Pain Type         : num  4 4 3 2 2 4 4 4 4 4 ...
##  $ Resting Blood Pressure  : num  160 120 130 130 120 140 120 130 140 140 ...
##  $ Cholesterol             : num  286 229 250 204 236 268 354 254 203 192 ...
##  $ Fasting Blood Sugar     : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Resting ECG             : num  2 2 0 2 0 2 0 2 2 0 ...
##  $ Max. HR Achieved        : num  108 129 187 172 178 160 163 147 155 148 ...
##  $ Exercise Induced Angina : num  1 1 0 0 0 0 1 0 1 0 ...
##  $ ST Depression           : num  1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
##  $ ST Slope                : num  2 2 3 1 1 3 1 2 3 2 ...
##  $ Num. Major Blood Vessels: chr  "3.0" "2.0" "0.0" "0.0" ...
##  $ Thalassemia             : chr  "3.0" "7.0" "3.0" "3.0" ...
##  $ Condition               : int  2 1 0 0 0 3 0 2 1 0 ...
summary(heart_df)
##       Age             Sex         Chest Pain Type Resting Blood Pressure
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0         
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0         
##  Median :56.00   Median :1.0000   Median :3.000   Median :130.0         
##  Mean   :54.51   Mean   :0.6757   Mean   :3.166   Mean   :131.6         
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0         
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0         
##   Cholesterol    Fasting Blood Sugar  Resting ECG     Max. HR Achieved
##  Min.   :126.0   Min.   :0.0000      Min.   :0.0000   Min.   : 71.0   
##  1st Qu.:211.0   1st Qu.:0.0000      1st Qu.:0.0000   1st Qu.:133.0   
##  Median :243.0   Median :0.0000      Median :1.0000   Median :153.0   
##  Mean   :247.4   Mean   :0.1419      Mean   :0.9932   Mean   :149.6   
##  3rd Qu.:276.2   3rd Qu.:0.0000      3rd Qu.:2.0000   3rd Qu.:166.0   
##  Max.   :564.0   Max.   :1.0000      Max.   :2.0000   Max.   :202.0   
##  Exercise Induced Angina ST Depression      ST Slope    
##  Min.   :0.0000          Min.   :0.000   Min.   :1.000  
##  1st Qu.:0.0000          1st Qu.:0.000   1st Qu.:1.000  
##  Median :0.0000          Median :0.800   Median :2.000  
##  Mean   :0.3277          Mean   :1.051   Mean   :1.598  
##  3rd Qu.:1.0000          3rd Qu.:1.600   3rd Qu.:2.000  
##  Max.   :1.0000          Max.   :6.200   Max.   :3.000  
##  Num. Major Blood Vessels Thalassemia          Condition     
##  Length:296               Length:296         Min.   :0.0000  
##  Class :character         Class :character   1st Qu.:0.0000  
##  Mode  :character         Mode  :character   Median :0.0000  
##                                              Mean   :0.9493  
##                                              3rd Qu.:2.0000  
##                                              Max.   :4.0000

Also from feature description, value greater than or equals to 1 in condition feature denotes heart disease. We will change the value to 1.

unique(heart_df$`Condition`)
## [1] 2 1 0 3 4
heart_df$`Condition`[heart_df$`Condition` == 2] <- 1
heart_df$`Condition`[heart_df$`Condition` == 3] <- 1
heart_df$`Condition`[heart_df$`Condition` == 4] <- 1
unique(heart_df$`Condition`)
## [1] 1 0
heart_df$Condition <- as.factor(heart_df$Condition)
heart_df$Thalassemia <-as.numeric(heart_df$Thalassemia)
heart_df$`Num. Major Blood Vessels` <-as.numeric(heart_df$`Num. Major Blood Vessels`) 

From observation, the features can be classified into numerical and categorical variables as below:

No. Features Category
1 Age Numerical
2 Sex Categorical
3 Chest Pain Type Categorical
4 Resting Blood Pressure Numerical
5 Cholesterol Numerical
6 Fasting Blood Sugar Categorical
7 Resting ECG Categorical
8 Max. HR Achieved Numerical
9 Exercise Induced Angina Categorical
10 ST Depression Numerical
11 ST Slope Categorical
12 Num. Major Blood Vessels Numerical
13 Thalassemia Categorical
14 Condition Categorical

Our data is now ready for analysis.

Data Exploration and Feature Engineering

How many have a heart condition in our dataset?

cond <- heart_df %>%
  ggplot(aes(x=Condition,fill=factor(Condition))) +
  geom_bar(alpha=0.8) +
  geom_text(
    aes(label = sprintf('%s (%.0f%%)', after_stat(count), after_stat(count/sum(count)*100))),
    stat='count', 
    vjust = -0.25
  )
grid.arrange(cond, ncol=1)

#waffle(heart_df$Condition/3)

From our dataset, we see a fairly balanced dataset where from 303 samples:

  • 54% or 164 samples are with no heart disease condition
  • 46% or 139 samples are with heart disease condition

However, from CDC fact sheet https://www.cdc.gov/nchs/fastats/heart-disease.htm, percentage of adults who have ever been diagnosed with coronary haert disease in the U.S is only 4.6 percent. Although we see a big discrepancies between our dataset and the fact, it will not be a problem for our prediction. It is just something for us to be aware of.

Numerical Variable Distribution?

age <- heart_df %>%
  ggplot(aes(x=`Age`, fill=factor(Condition))) +
  geom_density(alpha = 0.8)
rbp <- heart_df %>%
  ggplot(aes(x=`Resting Blood Pressure`,fill=factor(Condition))) +
  geom_density(alpha = 0.8)
chl <- heart_df %>%
  ggplot(aes(x=`Cholesterol`,fill=factor(Condition))) +
  geom_density(alpha = 0.8)
mha <- heart_df %>%
  ggplot(aes(x=`Max. HR Achieved`,fill=factor(Condition))) +
  geom_density(alpha = 0.8)
std <- heart_df %>%
  ggplot(aes(x=`ST Depression`,fill=factor(Condition))) +
  geom_density(alpha = 0.8)
std <- heart_df %>%
  ggplot(aes(x=`ST Depression`,fill=factor(Condition))) +
  geom_density(alpha = 0.8)
mbv <- heart_df %>%
  ggplot(aes(x=`Num. Major Blood Vessels`,fill=factor(Condition))) +
  geom_density(alpha = 0.8)
grid.arrange(age, rbp, chl, mha, std, mbv, ncol=2, nrow=3)

We do see some differences between the conditions. In particular the Num. Major Blood Vessels, Age, ST Depression & Max. HR Achieved seem to be very important. We can explore these more later as it does seems like this will be useful for our models. Let's zoom in on two noticeable plots.

grid.arrange(mha, mbv, ncol=2)

It looks like these two variables have a strong impact. They will likely become important features for our model later on.

Categorical Variable Distribution?

age <- heart_df %>%
  ggplot(aes(x=`Sex`)) +
  geom_bar(position='dodge',fill="orange",alpha=0.8)
rbp <- heart_df %>%
  ggplot(aes(x=`Chest Pain Type`)) +
  geom_bar(position='dodge',fill="orange",alpha=0.8)
chl <- heart_df %>%
  ggplot(aes(x=`Fasting Blood Sugar`)) +
  geom_bar(position='dodge',fill="orange",alpha=0.8)
mha <- heart_df %>%
  ggplot(aes(x=`Resting ECG`)) +
  geom_bar(position='dodge',fill="orange",alpha=0.8)
eia <- heart_df %>%
  ggplot(aes(x=`Exercise Induced Angina`)) +
  geom_bar(position='dodge',fill="orange",alpha=0.8)
std <- heart_df %>%
  ggplot(aes(x=`ST Slope`)) +
  geom_bar(position='dodge',fill="orange",alpha=0.8)
mbv <- heart_df %>%
  ggplot(aes(x=`Thalassemia`)) +
  geom_bar(position='dodge',fill="orange",alpha=0.8)
grid.arrange(age, rbp, chl, mha, eia, std, mbv, ncol=2, nrow=4)

So above we see how common or uncommon certain categories are. For example, category 1 of Resting ECG is very uncommon. But how does the Condition variable present itself with respect to each of these features? Can we learn anything?

age <- heart_df %>%
  ggplot(aes(x=`Sex`, fill=factor(Condition))) +
  geom_bar(position='dodge',alpha=0.8)
rbp <- heart_df %>%
  ggplot(aes(x=`Chest Pain Type`,fill=factor(Condition))) +
  geom_bar(position='dodge',alpha=0.8)
chl <- heart_df %>%
  ggplot(aes(x=`Fasting Blood Sugar`,fill=factor(Condition))) +
  geom_bar(position='dodge',alpha=0.8)
mha <- heart_df %>%
  ggplot(aes(x=`Resting ECG`,fill=factor(Condition))) +
  geom_bar(position='dodge',alpha=0.8)
eia <- heart_df %>%
  ggplot(aes(x=`Exercise Induced Angina`,fill=factor(Condition))) +
  geom_bar(position='dodge',alpha=0.8)
std <- heart_df %>%
  ggplot(aes(x=`ST Slope`,fill=factor(Condition))) +
  geom_bar(position='dodge')
mbv <- heart_df %>%
  ggplot(aes(x=`Thalassemia`,fill=factor(Condition))) +
  geom_bar(position='dodge')
grid.arrange(age, rbp, chl, mha, eia, std, mbv, ncol=2, nrow=4)

Let's now zoom in to a couple of standout observations.

grid.arrange(rbp, mbv, ncol=2)

Thalassemia and Chest Pain Type values look to be highly indicative of heart disease, and indeed of being lower risk in the case of some values.

Important Observations?

So far, a few important points has been observed.

  • Risk of heart disease increases with Age.
  • Rising Cholesterol and Resting Blood Pressure does not appear to be a major indicator of heart disease.
  • A low Max HR Achieved is a big warning sign of heart disease.
  • Rising ST Depression and Num. Major Blood Vessels is an indicator of heart disease.
  • In general, female has fewer heart disease compared to male.
  • Chest Pain Type 4 is a major warning sign of heart disease.
  • Fasting Blood Sugar and Resting ECG shows little correlation to heart disease.
  • Exercise Induced Angina is an indicator of heart disease.
  • Flat ST Slope is a major indicator of heart disease, and the risk is lower for Upsloping ST Slope.
  • Thalassemia of reversable defect type is a major indicator of heart disease.

Predictions

We split the dataset into a training and test set with 80% of data in the training set and 20% of the data in the test set.

X = heart_df[,c('Age', 'Sex', 'Chest Pain Type', 'Resting Blood Pressure', 'Cholesterol', 'Fasting Blood Sugar', 'Resting ECG', 'Max. HR Achieved', 'Exercise Induced Angina', 'ST Depression', 'ST Slope', 'Num. Major Blood Vessels', 'Thalassemia')]

y = heart_df[,'Condition']

set.seed(1234)

trainIndex <- createDataPartition(y, p = 0.8, 
                                  list = FALSE, 
                                  times = 1)
heart_train <- heart_df[trainIndex,]
heart_test <- heart_df[-trainIndex,]

y_test <- heart_test$Condition

We resample the data by using 10-fold CV repeated 10 times. The modelling techniques used include GBM, Random Forest, Boosted Logistic Regression, GLM.

fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 10)

Gradient boosted model

#Gradient boosted model

gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9), 
                        n.trees = (1:30)*50, 
                        shrinkage = 0.1,
                        n.minobsinnode = 20)
                        
nrow(gbmGrid)
## [1] 90
gbmFit <- train(Condition ~ ., data = heart_train, 
                 method = "gbm", 
                 trControl = fitControl, 
                 verbose = FALSE, 
                 tuneGrid = gbmGrid)
gbmFit
## Stochastic Gradient Boosting 
## 
## 238 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 214, 214, 214, 215, 214, 214, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                    50     0.8295471  0.6538782
##   1                   100     0.8223007  0.6398456
##   1                   150     0.8173913  0.6298750
##   1                   200     0.8123370  0.6199509
##   1                   250     0.8073551  0.6103521
##   1                   300     0.8094746  0.6144505
##   1                   350     0.8011232  0.5976685
##   1                   400     0.8019384  0.5992130
##   1                   450     0.7998007  0.5953020
##   1                   500     0.7976993  0.5910714
##   1                   550     0.8006522  0.5968089
##   1                   600     0.7989493  0.5934410
##   1                   650     0.8032428  0.6021590
##   1                   700     0.7973188  0.5901957
##   1                   750     0.7977174  0.5909452
##   1                   800     0.7956522  0.5867058
##   1                   850     0.7952174  0.5860359
##   1                   900     0.7948007  0.5854955
##   1                   950     0.7981703  0.5920128
##   1                  1000     0.7926993  0.5809246
##   1                  1050     0.7943297  0.5840233
##   1                  1100     0.7951993  0.5857265
##   1                  1150     0.7897645  0.5747940
##   1                  1200     0.7880616  0.5717075
##   1                  1250     0.7893841  0.5742973
##   1                  1300     0.7876449  0.5709684
##   1                  1350     0.7868116  0.5693416
##   1                  1400     0.7880797  0.5718237
##   1                  1450     0.7918297  0.5792441
##   1                  1500     0.7888949  0.5739633
##   5                    50     0.8175725  0.6302962
##   5                   100     0.8063949  0.6077375
##   5                   150     0.7996739  0.5939868
##   5                   200     0.7955797  0.5857695
##   5                   250     0.7895833  0.5737680
##   5                   300     0.7837862  0.5622346
##   5                   350     0.7816304  0.5575412
##   5                   400     0.7782971  0.5514361
##   5                   450     0.7833514  0.5621680
##   5                   500     0.7815761  0.5583140
##   5                   550     0.7845290  0.5646668
##   5                   600     0.7820109  0.5596157
##   5                   650     0.7807428  0.5565966
##   5                   700     0.7819565  0.5591593
##   5                   750     0.7844565  0.5642783
##   5                   800     0.7793841  0.5542145
##   5                   850     0.7802536  0.5558767
##   5                   900     0.7807065  0.5568982
##   5                   950     0.7823913  0.5599739
##   5                  1000     0.7828080  0.5605561
##   5                  1050     0.7844928  0.5641076
##   5                  1100     0.7807065  0.5566675
##   5                  1150     0.7806703  0.5565545
##   5                  1200     0.7778080  0.5507785
##   5                  1250     0.7794746  0.5541416
##   5                  1300     0.7794565  0.5541551
##   5                  1350     0.7815580  0.5585346
##   5                  1400     0.7798913  0.5553025
##   5                  1450     0.7773732  0.5500746
##   5                  1500     0.7773370  0.5500615
##   9                    50     0.8218841  0.6394643
##   9                   100     0.8072464  0.6090133
##   9                   150     0.8043297  0.6036770
##   9                   200     0.7937862  0.5823047
##   9                   250     0.7912862  0.5771588
##   9                   300     0.7909420  0.5766143
##   9                   350     0.7883333  0.5715755
##   9                   400     0.7845652  0.5641770
##   9                   450     0.7874819  0.5699383
##   9                   500     0.7841848  0.5632181
##   9                   550     0.7824638  0.5598663
##   9                   600     0.7819928  0.5591557
##   9                   650     0.7840761  0.5632732
##   9                   700     0.7841486  0.5634000
##   9                   750     0.7837138  0.5624185
##   9                   800     0.7853623  0.5658350
##   9                   850     0.7815942  0.5584138
##   9                   900     0.7807428  0.5568626
##   9                   950     0.7782428  0.5515547
##   9                  1000     0.7828442  0.5610630
##   9                  1050     0.7824638  0.5597803
##   9                  1100     0.7803261  0.5557048
##   9                  1150     0.7790761  0.5533578
##   9                  1200     0.7820290  0.5589906
##   9                  1250     0.7819928  0.5588478
##   9                  1300     0.7811775  0.5575289
##   9                  1350     0.7837319  0.5627105
##   9                  1400     0.7803804  0.5559026
##   9                  1450     0.7778261  0.5508612
##   9                  1500     0.7769928  0.5490401
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 50, interaction.depth =
##  1, shrinkage = 0.1 and n.minobsinnode = 20.
pred_gbm <- predict(gbmFit,heart_test)

gbmConf <- confusionMatrix(reference = heart_test$Condition, data = pred_gbm, mode='everything', positive='0')

gbmConf$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7419355            0.9259259            0.9200000 
##       Neg Pred Value            Precision               Recall 
##            0.7575758            0.9200000            0.7419355 
##                   F1           Prevalence       Detection Rate 
##            0.8214286            0.5344828            0.3965517 
## Detection Prevalence    Balanced Accuracy 
##            0.4310345            0.8339307
#Accuracy : 84.5%
#Sensitivity: 83.9%
#Precision: 86.7%

Random forest

mtry <- sqrt(ncol(X))

rfGrid <-  expand.grid(mtry = mtry)

rfFit <- train(Condition ~ ., data = heart_train, 
                 method = "rf", 
                 trControl = fitControl, 
                 verbose = FALSE,
                 tuneGrid = rfGrid)
rfFit
## Random Forest 
## 
## 238 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8024457  0.6006811
## 
## Tuning parameter 'mtry' was held constant at a value of 3.605551
pred_rf <- predict(rfFit,heart_test)

rfConf <- confusionMatrix(reference = heart_test$Condition, data = pred_rf, mode='everything', positive='0')

rfConf$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7419355            0.9259259            0.9200000 
##       Neg Pred Value            Precision               Recall 
##            0.7575758            0.9200000            0.7419355 
##                   F1           Prevalence       Detection Rate 
##            0.8214286            0.5344828            0.3965517 
## Detection Prevalence    Balanced Accuracy 
##            0.4310345            0.8339307
#Accuracy: 87.9%
#Sensitivity: 87.1%
#Precision: 90.0%

Boosted Logistic regression

logitFit <- train(Condition ~ ., data = heart_train, 
                 method = "LogitBoost", 
                 trControl = fitControl, 
                 verbose = FALSE )
logitFit
## Boosted Logistic Regression 
## 
## 238 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 214, 214, 214, 214, 215, 215, ... 
## Resampling results across tuning parameters:
## 
##   nIter  Accuracy   Kappa    
##   11     0.8060688  0.6083785
##   21     0.7865399  0.5704245
##   31     0.7818659  0.5593302
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was nIter = 11.
pred_logit <- predict(logitFit,heart_test)

logitConf <- confusionMatrix(reference = heart_test$Condition, data = pred_logit, mode='everything', positive='0')

logitConf$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.6129032            0.9259259            0.9047619 
##       Neg Pred Value            Precision               Recall 
##            0.6756757            0.9047619            0.6129032 
##                   F1           Prevalence       Detection Rate 
##            0.7307692            0.5344828            0.3275862 
## Detection Prevalence    Balanced Accuracy 
##            0.3620690            0.7694146
#Accuracy: 79.3%
#Sensitivity: 77.4%
#Precision: 82.8%

Generalized Linear Model

glmnetGrid <- expand.grid(alpha = 0:1, lambda = seq(0.0001, 1, length = 100))

glmnetFit <- train(Condition ~ ., data = heart_train, 
                 method = "glmnet", 
                 trControl = fitControl, 
                 verbose = FALSE, tuneGrid = glmnetGrid )
glmnetFit
## glmnet 
## 
## 238 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 214, 214, 214, 215, 215, 214, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  Accuracy   Kappa        
##   0      0.0001  0.8346014   6.643597e-01
##   0      0.0102  0.8346014   6.643597e-01
##   0      0.0203  0.8346014   6.643597e-01
##   0      0.0304  0.8346014   6.643597e-01
##   0      0.0405  0.8341667   6.635040e-01
##   0      0.0506  0.8341667   6.635040e-01
##   0      0.0607  0.8329167   6.608108e-01
##   0      0.0708  0.8329167   6.607756e-01
##   0      0.0809  0.8333333   6.615968e-01
##   0      0.0910  0.8328986   6.606261e-01
##   0      0.1011  0.8328986   6.606261e-01
##   0      0.1112  0.8333333   6.614218e-01
##   0      0.1213  0.8329167   6.605770e-01
##   0      0.1314  0.8324819   6.595840e-01
##   0      0.1415  0.8328986   6.604046e-01
##   0      0.1516  0.8316486   6.577714e-01
##   0      0.1617  0.8316486   6.577357e-01
##   0      0.1718  0.8312319   6.568416e-01
##   0      0.1819  0.8320652   6.584734e-01
##   0      0.1920  0.8328986   6.601144e-01
##   0      0.2021  0.8328986   6.601144e-01
##   0      0.2122  0.8328986   6.601144e-01
##   0      0.2223  0.8324819   6.592330e-01
##   0      0.2324  0.8325000   6.591941e-01
##   0      0.2425  0.8325000   6.591941e-01
##   0      0.2526  0.8329167   6.600250e-01
##   0      0.2627  0.8333514   6.608927e-01
##   0      0.2728  0.8329348   6.599986e-01
##   0      0.2829  0.8325181   6.591176e-01
##   0      0.2930  0.8333696   6.607835e-01
##   0      0.3031  0.8329348   6.598887e-01
##   0      0.3132  0.8333514   6.607098e-01
##   0      0.3233  0.8333333   6.605717e-01
##   0      0.3334  0.8333333   6.605717e-01
##   0      0.3435  0.8333333   6.605717e-01
##   0      0.3536  0.8329167   6.596431e-01
##   0      0.3637  0.8333152   6.604142e-01
##   0      0.3738  0.8324819   6.586631e-01
##   0      0.3839  0.8328986   6.594959e-01
##   0      0.3940  0.8337500   6.611165e-01
##   0      0.4041  0.8320471   6.575378e-01
##   0      0.4142  0.8307790   6.548965e-01
##   0      0.4243  0.8307790   6.548965e-01
##   0      0.4344  0.8307790   6.548965e-01
##   0      0.4445  0.8299094   6.530859e-01
##   0      0.4546  0.8294928   6.521932e-01
##   0      0.4647  0.8286594   6.504305e-01
##   0      0.4748  0.8286594   6.503580e-01
##   0      0.4849  0.8286594   6.503580e-01
##   0      0.4950  0.8294928   6.519998e-01
##   0      0.5051  0.8303261   6.536641e-01
##   0      0.5152  0.8303261   6.535781e-01
##   0      0.5253  0.8299094   6.526710e-01
##   0      0.5354  0.8290761   6.509201e-01
##   0      0.5455  0.8290761   6.509201e-01
##   0      0.5556  0.8286594   6.500260e-01
##   0      0.5657  0.8286594   6.500260e-01
##   0      0.5758  0.8286594   6.500260e-01
##   0      0.5859  0.8282428   6.491694e-01
##   0      0.5960  0.8282428   6.491694e-01
##   0      0.6061  0.8282428   6.491694e-01
##   0      0.6162  0.8282428   6.491694e-01
##   0      0.6263  0.8278080   6.481798e-01
##   0      0.6364  0.8278080   6.481798e-01
##   0      0.6465  0.8273913   6.472983e-01
##   0      0.6566  0.8273913   6.472983e-01
##   0      0.6667  0.8273913   6.472983e-01
##   0      0.6768  0.8269746   6.464042e-01
##   0      0.6869  0.8261413   6.446192e-01
##   0      0.6970  0.8265580   6.454283e-01
##   0      0.7071  0.8265580   6.454283e-01
##   0      0.7172  0.8261413   6.445596e-01
##   0      0.7273  0.8261413   6.445596e-01
##   0      0.7374  0.8257246   6.435812e-01
##   0      0.7475  0.8253080   6.427122e-01
##   0      0.7576  0.8248913   6.418431e-01
##   0      0.7677  0.8244746   6.409862e-01
##   0      0.7778  0.8257246   6.434473e-01
##   0      0.7879  0.8261413   6.442802e-01
##   0      0.7980  0.8257246   6.434233e-01
##   0      0.8081  0.8252899   6.425354e-01
##   0      0.8182  0.8252899   6.425354e-01
##   0      0.8283  0.8252899   6.425354e-01
##   0      0.8384  0.8252899   6.425354e-01
##   0      0.8485  0.8257065   6.433439e-01
##   0      0.8586  0.8261232   6.441157e-01
##   0      0.8687  0.8252899   6.423390e-01
##   0      0.8788  0.8252899   6.423390e-01
##   0      0.8889  0.8252899   6.423390e-01
##   0      0.8990  0.8252899   6.422530e-01
##   0      0.9091  0.8252899   6.422530e-01
##   0      0.9192  0.8252899   6.422530e-01
##   0      0.9293  0.8252899   6.422530e-01
##   0      0.9394  0.8252899   6.422530e-01
##   0      0.9495  0.8252899   6.422530e-01
##   0      0.9596  0.8252899   6.422530e-01
##   0      0.9697  0.8261232   6.439072e-01
##   0      0.9798  0.8261232   6.439072e-01
##   0      0.9899  0.8265399   6.447279e-01
##   0      1.0000  0.8265399   6.447279e-01
##   1      0.0001  0.8333333   6.623223e-01
##   1      0.0102  0.8295290   6.541642e-01
##   1      0.0203  0.8269928   6.485851e-01
##   1      0.0304  0.8295290   6.538773e-01
##   1      0.0405  0.8282609   6.512860e-01
##   1      0.0506  0.8248551   6.444621e-01
##   1      0.0607  0.8210326   6.367883e-01
##   1      0.0708  0.8176630   6.298780e-01
##   1      0.0809  0.8117935   6.177948e-01
##   1      0.0910  0.8092572   6.124459e-01
##   1      0.1011  0.8037500   6.014228e-01
##   1      0.1112  0.7961594   5.858516e-01
##   1      0.1213  0.7899094   5.729635e-01
##   1      0.1314  0.7823913   5.577120e-01
##   1      0.1415  0.7759964   5.448798e-01
##   1      0.1516  0.7701087   5.329407e-01
##   1      0.1617  0.7667572   5.262559e-01
##   1      0.1718  0.7642572   5.209645e-01
##   1      0.1819  0.7617572   5.158794e-01
##   1      0.1920  0.7588949   5.096484e-01
##   1      0.2021  0.7496558   4.900685e-01
##   1      0.2122  0.7271739   4.420015e-01
##   1      0.2223  0.6733152   3.210903e-01
##   1      0.2324  0.5773913   1.005120e-01
##   1      0.2425  0.5368478   1.611613e-03
##   1      0.2526  0.5372645  -6.993007e-05
##   1      0.2627  0.5376812   0.000000e+00
##   1      0.2728  0.5376812   0.000000e+00
##   1      0.2829  0.5376812   0.000000e+00
##   1      0.2930  0.5376812   0.000000e+00
##   1      0.3031  0.5376812   0.000000e+00
##   1      0.3132  0.5376812   0.000000e+00
##   1      0.3233  0.5376812   0.000000e+00
##   1      0.3334  0.5376812   0.000000e+00
##   1      0.3435  0.5376812   0.000000e+00
##   1      0.3536  0.5376812   0.000000e+00
##   1      0.3637  0.5376812   0.000000e+00
##   1      0.3738  0.5376812   0.000000e+00
##   1      0.3839  0.5376812   0.000000e+00
##   1      0.3940  0.5376812   0.000000e+00
##   1      0.4041  0.5376812   0.000000e+00
##   1      0.4142  0.5376812   0.000000e+00
##   1      0.4243  0.5376812   0.000000e+00
##   1      0.4344  0.5376812   0.000000e+00
##   1      0.4445  0.5376812   0.000000e+00
##   1      0.4546  0.5376812   0.000000e+00
##   1      0.4647  0.5376812   0.000000e+00
##   1      0.4748  0.5376812   0.000000e+00
##   1      0.4849  0.5376812   0.000000e+00
##   1      0.4950  0.5376812   0.000000e+00
##   1      0.5051  0.5376812   0.000000e+00
##   1      0.5152  0.5376812   0.000000e+00
##   1      0.5253  0.5376812   0.000000e+00
##   1      0.5354  0.5376812   0.000000e+00
##   1      0.5455  0.5376812   0.000000e+00
##   1      0.5556  0.5376812   0.000000e+00
##   1      0.5657  0.5376812   0.000000e+00
##   1      0.5758  0.5376812   0.000000e+00
##   1      0.5859  0.5376812   0.000000e+00
##   1      0.5960  0.5376812   0.000000e+00
##   1      0.6061  0.5376812   0.000000e+00
##   1      0.6162  0.5376812   0.000000e+00
##   1      0.6263  0.5376812   0.000000e+00
##   1      0.6364  0.5376812   0.000000e+00
##   1      0.6465  0.5376812   0.000000e+00
##   1      0.6566  0.5376812   0.000000e+00
##   1      0.6667  0.5376812   0.000000e+00
##   1      0.6768  0.5376812   0.000000e+00
##   1      0.6869  0.5376812   0.000000e+00
##   1      0.6970  0.5376812   0.000000e+00
##   1      0.7071  0.5376812   0.000000e+00
##   1      0.7172  0.5376812   0.000000e+00
##   1      0.7273  0.5376812   0.000000e+00
##   1      0.7374  0.5376812   0.000000e+00
##   1      0.7475  0.5376812   0.000000e+00
##   1      0.7576  0.5376812   0.000000e+00
##   1      0.7677  0.5376812   0.000000e+00
##   1      0.7778  0.5376812   0.000000e+00
##   1      0.7879  0.5376812   0.000000e+00
##   1      0.7980  0.5376812   0.000000e+00
##   1      0.8081  0.5376812   0.000000e+00
##   1      0.8182  0.5376812   0.000000e+00
##   1      0.8283  0.5376812   0.000000e+00
##   1      0.8384  0.5376812   0.000000e+00
##   1      0.8485  0.5376812   0.000000e+00
##   1      0.8586  0.5376812   0.000000e+00
##   1      0.8687  0.5376812   0.000000e+00
##   1      0.8788  0.5376812   0.000000e+00
##   1      0.8889  0.5376812   0.000000e+00
##   1      0.8990  0.5376812   0.000000e+00
##   1      0.9091  0.5376812   0.000000e+00
##   1      0.9192  0.5376812   0.000000e+00
##   1      0.9293  0.5376812   0.000000e+00
##   1      0.9394  0.5376812   0.000000e+00
##   1      0.9495  0.5376812   0.000000e+00
##   1      0.9596  0.5376812   0.000000e+00
##   1      0.9697  0.5376812   0.000000e+00
##   1      0.9798  0.5376812   0.000000e+00
##   1      0.9899  0.5376812   0.000000e+00
##   1      1.0000  0.5376812   0.000000e+00
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0 and lambda = 0.0304.
pred_glmnet <- predict(glmnetFit,heart_test)

glmConf <- confusionMatrix(reference = heart_test$Condition, data = pred_glmnet, mode='everything', positive='0')

glmConf$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7741935            0.9259259            0.9230769 
##       Neg Pred Value            Precision               Recall 
##            0.7812500            0.9230769            0.7741935 
##                   F1           Prevalence       Detection Rate 
##            0.8421053            0.5344828            0.4137931 
## Detection Prevalence    Balanced Accuracy 
##            0.4482759            0.8500597
#Accuracy: 86.2%
#Sensitivity: 87.1%
#Precision: 87.1%

Overall results

model <- c("GBM","RF", "Logit", "GLMnet")
accuracy <- c(84.5,87.9,79.3,86.2)
sens <-c(83.9,87.1,77.4,87.1)
prec <-c(86.7,90.0,82.8,87.1)

data.frame(model,accuracy,sens,prec)
##    model accuracy sens prec
## 1    GBM     84.5 83.9 86.7
## 2     RF     87.9 87.1 90.0
## 3  Logit     79.3 77.4 82.8
## 4 GLMnet     86.2 87.1 87.1

Most accurate model: Random Forest

Most sensitive model: Random Forest and Generalized Linear Model

Most precise model: Random Forest

Most important variables according to RF

plot(varImp(rfFit, scale = FALSE))

As we can see, Chest Pain Type, Thalassemia and Num. Major Blood Vessels are the top 3 most important predictors

AUC of each predictor

roc <- filterVarImp(x = heart_train[, -ncol(heart_train)], y = heart_train$Condition)
roc
##                                 X0        X1
## Age                      0.6836648 0.6836648
## Sex                      0.6350142 0.6350142
## Chest Pain Type          0.7473722 0.7473722
## Resting Blood Pressure   0.5828480 0.5828480
## Cholesterol              0.5838068 0.5838068
## Fasting Blood Sugar      0.5011364 0.5011364
## Resting ECG              0.6001065 0.6001065
## Max. HR Achieved         0.7440696 0.7440696
## Exercise Induced Angina  0.6757812 0.6757812
## ST Depression            0.7515625 0.7515625
## ST Slope                 0.7104403 0.7104403
## Num. Major Blood Vessels 0.7451705 0.7451705
## Thalassemia              0.7582386 0.7582386