Recently, computer technology and machine learning techniques are developing software to assist doctors in making decision of heart disease in the early stage. Heart disease prediction system can assist medical professionals in predicting heart disease status based on the clinical data of patients.
The main objective of this project is to answer the below research questions:
For this project, we will utilize Heart Disease dataset taken from http://archive.ics.uci.edu/ml/datasets/Heart+Disease. This dataset was donated to UCI on the 1st of July 1988.The name of the dataset is processed.cleveland.data. The data was collected from Cleveland Clinic Foundation. The principal investigator for the data collection is Robert Detrano, M.D., Ph.D. from V.A. Medical Center, Long Beach and Cleveland Clinic Foundation.
The dataset consists of 14 features. Features description of the dataset are as below:
| No. | Features | Description |
|---|---|---|
| 1 | age | age in years |
| 2 | sex | sex (1 = male; 0 = female) |
| 3 | cp | chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic) |
| 4 | trestbps | resting blood pressure (in mm Hg on admission to the hospital) |
| 5 | chol | serum cholestoral in mg/dl |
| 6 | fbs | fasting blood sugar > 120 mg/dl (1 = true; 0 = false) |
| 7 | restecg | resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy) |
| 8 | thalach | maximum heart rate achieved |
| 9 | exang | exercise induced angina (1 = yes; 0 = no) |
| 10 | oldpeak | ST depression induced by exercise relative to rest |
| 11 | slope | the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping) |
| 12 | ca | number of major vessels (0-3) colored by flourosopy |
| 13 | thal | 3 = normal; 6 = fixed defect; 7 = reversable defect |
| 14 | num | the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < 50% diameter narrowing; Value 1 = > 50% diameter narrowing) |
## Libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ stringr 1.4.0
## ✓ tidyr 1.1.3 ✓ forcats 0.5.0
## ✓ readr 1.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
#library(cowplot)
library(waffle)
library(ggcorrplot)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
First, read the raw data and store it in a data frame.
heart_df <- read.csv("processed.cleveland.data")
head(heart_df)
## X63.0 X1.0 X1.0.1 X145.0 X233.0 X1.0.2 X2.0 X150.0 X0.0 X2.3 X3.0 X0.0.1 X6.0
## 1 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0
## 2 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0
## 3 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0
## 4 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0
## 5 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0
## 6 62 0 4 140 268 0 2 160 0 3.6 3 2.0 3.0
## X0
## 1 2
## 2 1
## 3 0
## 4 0
## 5 0
## 6 3
This dataset does not have columns name. We will rename the columns according to features.
names(heart_df) <- c('Age', 'Sex', 'Chest Pain Type', 'Resting Blood Pressure', 'Cholesterol', 'Fasting Blood Sugar', 'Resting ECG', 'Max. HR Achieved', 'Exercise Induced Angina', 'ST Depression', 'ST Slope', 'Num. Major Blood Vessels', 'Thalassemia', 'Condition')
head(heart_df)
## Age Sex Chest Pain Type Resting Blood Pressure Cholesterol
## 1 67 1 4 160 286
## 2 67 1 4 120 229
## 3 37 1 3 130 250
## 4 41 0 2 130 204
## 5 56 1 2 120 236
## 6 62 0 4 140 268
## Fasting Blood Sugar Resting ECG Max. HR Achieved Exercise Induced Angina
## 1 0 2 108 1
## 2 0 2 129 1
## 3 0 0 187 0
## 4 0 2 172 0
## 5 0 0 178 0
## 6 0 2 160 0
## ST Depression ST Slope Num. Major Blood Vessels Thalassemia Condition
## 1 1.5 2 3.0 3.0 2
## 2 2.6 2 2.0 7.0 1
## 3 3.5 3 0.0 3.0 0
## 4 1.4 1 0.0 3.0 0
## 5 0.8 1 0.0 3.0 0
## 6 3.6 3 2.0 3.0 3
We will check the data for any missing values.
sum(is.na(heart_df))
## [1] 0
summary(heart_df)
## Age Sex Chest Pain Type Resting Blood Pressure
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Median :55.50 Median :1.0000 Median :3.000 Median :130.0
## Mean :54.41 Mean :0.6788 Mean :3.166 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
## Cholesterol Fasting Blood Sugar Resting ECG Max. HR Achieved
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.2
## Median :241.5 Median :0.0000 Median :0.5000 Median :153.0
## Mean :246.7 Mean :0.1457 Mean :0.9868 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## Exercise Induced Angina ST Depression ST Slope
## Min. :0.0000 Min. :0.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000
## Median :0.0000 Median :0.800 Median :2.000
## Mean :0.3278 Mean :1.035 Mean :1.596
## 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000
## Max. :1.0000 Max. :6.200 Max. :3.000
## Num. Major Blood Vessels Thalassemia Condition
## Length:302 Length:302 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.9404
## 3rd Qu.:2.0000
## Max. :4.0000
str(heart_df)
## 'data.frame': 302 obs. of 14 variables:
## $ Age : num 67 67 37 41 56 62 57 63 53 57 ...
## $ Sex : num 1 1 1 0 1 0 0 1 1 1 ...
## $ Chest Pain Type : num 4 4 3 2 2 4 4 4 4 4 ...
## $ Resting Blood Pressure : num 160 120 130 130 120 140 120 130 140 140 ...
## $ Cholesterol : num 286 229 250 204 236 268 354 254 203 192 ...
## $ Fasting Blood Sugar : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Resting ECG : num 2 2 0 2 0 2 0 2 2 0 ...
## $ Max. HR Achieved : num 108 129 187 172 178 160 163 147 155 148 ...
## $ Exercise Induced Angina : num 1 1 0 0 0 0 1 0 1 0 ...
## $ ST Depression : num 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
## $ ST Slope : num 2 2 3 1 1 3 1 2 3 2 ...
## $ Num. Major Blood Vessels: chr "3.0" "2.0" "0.0" "0.0" ...
## $ Thalassemia : chr "3.0" "7.0" "3.0" "3.0" ...
## $ Condition : int 2 1 0 0 0 3 0 2 1 0 ...
There is no missing value found. However, we found out that for features Num. Major Blood Vessels and Thalassemia, the class is Character although in the feature description, it should be numeric. We need to check the distinct values of these features.
unique(heart_df$`Num. Major Blood Vessels`)
## [1] "3.0" "2.0" "0.0" "1.0" "?"
unique(heart_df$`Thalassemia`)
## [1] "3.0" "7.0" "6.0" "?"
The character values of unknown is denoted by ?. We will replace the unknown value to median value of the series, and change the class to numeric. (we removed it)
heart_df$`Num. Major Blood Vessels`[heart_df$`Num. Major Blood Vessels` == "?"] <- NA
heart_df$`Thalassemia`[heart_df$`Thalassemia` == "?"] <- NA
nrow(heart_df)
## [1] 302
heart_df <- heart_df[complete.cases(heart_df),]
str(heart_df)
## 'data.frame': 296 obs. of 14 variables:
## $ Age : num 67 67 37 41 56 62 57 63 53 57 ...
## $ Sex : num 1 1 1 0 1 0 0 1 1 1 ...
## $ Chest Pain Type : num 4 4 3 2 2 4 4 4 4 4 ...
## $ Resting Blood Pressure : num 160 120 130 130 120 140 120 130 140 140 ...
## $ Cholesterol : num 286 229 250 204 236 268 354 254 203 192 ...
## $ Fasting Blood Sugar : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Resting ECG : num 2 2 0 2 0 2 0 2 2 0 ...
## $ Max. HR Achieved : num 108 129 187 172 178 160 163 147 155 148 ...
## $ Exercise Induced Angina : num 1 1 0 0 0 0 1 0 1 0 ...
## $ ST Depression : num 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
## $ ST Slope : num 2 2 3 1 1 3 1 2 3 2 ...
## $ Num. Major Blood Vessels: chr "3.0" "2.0" "0.0" "0.0" ...
## $ Thalassemia : chr "3.0" "7.0" "3.0" "3.0" ...
## $ Condition : int 2 1 0 0 0 3 0 2 1 0 ...
summary(heart_df)
## Age Sex Chest Pain Type Resting Blood Pressure
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :3.000 Median :130.0
## Mean :54.51 Mean :0.6757 Mean :3.166 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
## Cholesterol Fasting Blood Sugar Resting ECG Max. HR Achieved
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.0
## Median :243.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :247.4 Mean :0.1419 Mean :0.9932 Mean :149.6
## 3rd Qu.:276.2 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## Exercise Induced Angina ST Depression ST Slope
## Min. :0.0000 Min. :0.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000
## Median :0.0000 Median :0.800 Median :2.000
## Mean :0.3277 Mean :1.051 Mean :1.598
## 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000
## Max. :1.0000 Max. :6.200 Max. :3.000
## Num. Major Blood Vessels Thalassemia Condition
## Length:296 Length:296 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.9493
## 3rd Qu.:2.0000
## Max. :4.0000
Also from feature description, value greater than or equals to 1 in condition feature denotes heart disease. We will change the value to 1.
unique(heart_df$`Condition`)
## [1] 2 1 0 3 4
heart_df$`Condition`[heart_df$`Condition` == 2] <- 1
heart_df$`Condition`[heart_df$`Condition` == 3] <- 1
heart_df$`Condition`[heart_df$`Condition` == 4] <- 1
unique(heart_df$`Condition`)
## [1] 1 0
heart_df$Condition <- as.factor(heart_df$Condition)
heart_df$Thalassemia <-as.numeric(heart_df$Thalassemia)
heart_df$`Num. Major Blood Vessels` <-as.numeric(heart_df$`Num. Major Blood Vessels`)
From observation, the features can be classified into numerical and categorical variables as below:
| No. | Features | Category |
|---|---|---|
| 1 | Age | Numerical |
| 2 | Sex | Categorical |
| 3 | Chest Pain Type | Categorical |
| 4 | Resting Blood Pressure | Numerical |
| 5 | Cholesterol | Numerical |
| 6 | Fasting Blood Sugar | Categorical |
| 7 | Resting ECG | Categorical |
| 8 | Max. HR Achieved | Numerical |
| 9 | Exercise Induced Angina | Categorical |
| 10 | ST Depression | Numerical |
| 11 | ST Slope | Categorical |
| 12 | Num. Major Blood Vessels | Numerical |
| 13 | Thalassemia | Categorical |
| 14 | Condition | Categorical |
Our data is now ready for analysis.
cond <- heart_df %>%
ggplot(aes(x=Condition,fill=factor(Condition))) +
geom_bar(alpha=0.8) +
geom_text(
aes(label = sprintf('%s (%.0f%%)', after_stat(count), after_stat(count/sum(count)*100))),
stat='count',
vjust = -0.25
)
grid.arrange(cond, ncol=1)
#waffle(heart_df$Condition/3)
From our dataset, we see a fairly balanced dataset where from 303 samples:
However, from CDC fact sheet https://www.cdc.gov/nchs/fastats/heart-disease.htm, percentage of adults who have ever been diagnosed with coronary haert disease in the U.S is only 4.6 percent. Although we see a big discrepancies between our dataset and the fact, it will not be a problem for our prediction. It is just something for us to be aware of.
age <- heart_df %>%
ggplot(aes(x=`Age`, fill=factor(Condition))) +
geom_density(alpha = 0.8)
rbp <- heart_df %>%
ggplot(aes(x=`Resting Blood Pressure`,fill=factor(Condition))) +
geom_density(alpha = 0.8)
chl <- heart_df %>%
ggplot(aes(x=`Cholesterol`,fill=factor(Condition))) +
geom_density(alpha = 0.8)
mha <- heart_df %>%
ggplot(aes(x=`Max. HR Achieved`,fill=factor(Condition))) +
geom_density(alpha = 0.8)
std <- heart_df %>%
ggplot(aes(x=`ST Depression`,fill=factor(Condition))) +
geom_density(alpha = 0.8)
std <- heart_df %>%
ggplot(aes(x=`ST Depression`,fill=factor(Condition))) +
geom_density(alpha = 0.8)
mbv <- heart_df %>%
ggplot(aes(x=`Num. Major Blood Vessels`,fill=factor(Condition))) +
geom_density(alpha = 0.8)
grid.arrange(age, rbp, chl, mha, std, mbv, ncol=2, nrow=3)
We do see some differences between the conditions. In particular the Num. Major Blood Vessels, Age, ST Depression & Max. HR Achieved seem to be very important. We can explore these more later as it does seems like this will be useful for our models. Let's zoom in on two noticeable plots.
grid.arrange(mha, mbv, ncol=2)
It looks like these two variables have a strong impact. They will likely become important features for our model later on.
age <- heart_df %>%
ggplot(aes(x=`Sex`)) +
geom_bar(position='dodge',fill="orange",alpha=0.8)
rbp <- heart_df %>%
ggplot(aes(x=`Chest Pain Type`)) +
geom_bar(position='dodge',fill="orange",alpha=0.8)
chl <- heart_df %>%
ggplot(aes(x=`Fasting Blood Sugar`)) +
geom_bar(position='dodge',fill="orange",alpha=0.8)
mha <- heart_df %>%
ggplot(aes(x=`Resting ECG`)) +
geom_bar(position='dodge',fill="orange",alpha=0.8)
eia <- heart_df %>%
ggplot(aes(x=`Exercise Induced Angina`)) +
geom_bar(position='dodge',fill="orange",alpha=0.8)
std <- heart_df %>%
ggplot(aes(x=`ST Slope`)) +
geom_bar(position='dodge',fill="orange",alpha=0.8)
mbv <- heart_df %>%
ggplot(aes(x=`Thalassemia`)) +
geom_bar(position='dodge',fill="orange",alpha=0.8)
grid.arrange(age, rbp, chl, mha, eia, std, mbv, ncol=2, nrow=4)
So above we see how common or uncommon certain categories are. For example, category 1 of Resting ECG is very uncommon. But how does the Condition variable present itself with respect to each of these features? Can we learn anything?
age <- heart_df %>%
ggplot(aes(x=`Sex`, fill=factor(Condition))) +
geom_bar(position='dodge',alpha=0.8)
rbp <- heart_df %>%
ggplot(aes(x=`Chest Pain Type`,fill=factor(Condition))) +
geom_bar(position='dodge',alpha=0.8)
chl <- heart_df %>%
ggplot(aes(x=`Fasting Blood Sugar`,fill=factor(Condition))) +
geom_bar(position='dodge',alpha=0.8)
mha <- heart_df %>%
ggplot(aes(x=`Resting ECG`,fill=factor(Condition))) +
geom_bar(position='dodge',alpha=0.8)
eia <- heart_df %>%
ggplot(aes(x=`Exercise Induced Angina`,fill=factor(Condition))) +
geom_bar(position='dodge',alpha=0.8)
std <- heart_df %>%
ggplot(aes(x=`ST Slope`,fill=factor(Condition))) +
geom_bar(position='dodge')
mbv <- heart_df %>%
ggplot(aes(x=`Thalassemia`,fill=factor(Condition))) +
geom_bar(position='dodge')
grid.arrange(age, rbp, chl, mha, eia, std, mbv, ncol=2, nrow=4)
Let's now zoom in to a couple of standout observations.
grid.arrange(rbp, mbv, ncol=2)
Thalassemia and Chest Pain Type values look to be highly indicative of heart disease, and indeed of being lower risk in the case of some values.
So far, a few important points has been observed.
We split the dataset into a training and test set with 80% of data in the training set and 20% of the data in the test set.
X = heart_df[,c('Age', 'Sex', 'Chest Pain Type', 'Resting Blood Pressure', 'Cholesterol', 'Fasting Blood Sugar', 'Resting ECG', 'Max. HR Achieved', 'Exercise Induced Angina', 'ST Depression', 'ST Slope', 'Num. Major Blood Vessels', 'Thalassemia')]
y = heart_df[,'Condition']
set.seed(1234)
trainIndex <- createDataPartition(y, p = 0.8,
list = FALSE,
times = 1)
heart_train <- heart_df[trainIndex,]
heart_test <- heart_df[-trainIndex,]
y_test <- heart_test$Condition
We resample the data by using 10-fold CV repeated 10 times. The modelling techniques used include GBM, Random Forest, Boosted Logistic Regression, GLM.
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
#Gradient boosted model
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = (1:30)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
nrow(gbmGrid)
## [1] 90
gbmFit <- train(Condition ~ ., data = heart_train,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid)
gbmFit
## Stochastic Gradient Boosting
##
## 238 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 214, 214, 214, 215, 214, 214, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8295471 0.6538782
## 1 100 0.8223007 0.6398456
## 1 150 0.8173913 0.6298750
## 1 200 0.8123370 0.6199509
## 1 250 0.8073551 0.6103521
## 1 300 0.8094746 0.6144505
## 1 350 0.8011232 0.5976685
## 1 400 0.8019384 0.5992130
## 1 450 0.7998007 0.5953020
## 1 500 0.7976993 0.5910714
## 1 550 0.8006522 0.5968089
## 1 600 0.7989493 0.5934410
## 1 650 0.8032428 0.6021590
## 1 700 0.7973188 0.5901957
## 1 750 0.7977174 0.5909452
## 1 800 0.7956522 0.5867058
## 1 850 0.7952174 0.5860359
## 1 900 0.7948007 0.5854955
## 1 950 0.7981703 0.5920128
## 1 1000 0.7926993 0.5809246
## 1 1050 0.7943297 0.5840233
## 1 1100 0.7951993 0.5857265
## 1 1150 0.7897645 0.5747940
## 1 1200 0.7880616 0.5717075
## 1 1250 0.7893841 0.5742973
## 1 1300 0.7876449 0.5709684
## 1 1350 0.7868116 0.5693416
## 1 1400 0.7880797 0.5718237
## 1 1450 0.7918297 0.5792441
## 1 1500 0.7888949 0.5739633
## 5 50 0.8175725 0.6302962
## 5 100 0.8063949 0.6077375
## 5 150 0.7996739 0.5939868
## 5 200 0.7955797 0.5857695
## 5 250 0.7895833 0.5737680
## 5 300 0.7837862 0.5622346
## 5 350 0.7816304 0.5575412
## 5 400 0.7782971 0.5514361
## 5 450 0.7833514 0.5621680
## 5 500 0.7815761 0.5583140
## 5 550 0.7845290 0.5646668
## 5 600 0.7820109 0.5596157
## 5 650 0.7807428 0.5565966
## 5 700 0.7819565 0.5591593
## 5 750 0.7844565 0.5642783
## 5 800 0.7793841 0.5542145
## 5 850 0.7802536 0.5558767
## 5 900 0.7807065 0.5568982
## 5 950 0.7823913 0.5599739
## 5 1000 0.7828080 0.5605561
## 5 1050 0.7844928 0.5641076
## 5 1100 0.7807065 0.5566675
## 5 1150 0.7806703 0.5565545
## 5 1200 0.7778080 0.5507785
## 5 1250 0.7794746 0.5541416
## 5 1300 0.7794565 0.5541551
## 5 1350 0.7815580 0.5585346
## 5 1400 0.7798913 0.5553025
## 5 1450 0.7773732 0.5500746
## 5 1500 0.7773370 0.5500615
## 9 50 0.8218841 0.6394643
## 9 100 0.8072464 0.6090133
## 9 150 0.8043297 0.6036770
## 9 200 0.7937862 0.5823047
## 9 250 0.7912862 0.5771588
## 9 300 0.7909420 0.5766143
## 9 350 0.7883333 0.5715755
## 9 400 0.7845652 0.5641770
## 9 450 0.7874819 0.5699383
## 9 500 0.7841848 0.5632181
## 9 550 0.7824638 0.5598663
## 9 600 0.7819928 0.5591557
## 9 650 0.7840761 0.5632732
## 9 700 0.7841486 0.5634000
## 9 750 0.7837138 0.5624185
## 9 800 0.7853623 0.5658350
## 9 850 0.7815942 0.5584138
## 9 900 0.7807428 0.5568626
## 9 950 0.7782428 0.5515547
## 9 1000 0.7828442 0.5610630
## 9 1050 0.7824638 0.5597803
## 9 1100 0.7803261 0.5557048
## 9 1150 0.7790761 0.5533578
## 9 1200 0.7820290 0.5589906
## 9 1250 0.7819928 0.5588478
## 9 1300 0.7811775 0.5575289
## 9 1350 0.7837319 0.5627105
## 9 1400 0.7803804 0.5559026
## 9 1450 0.7778261 0.5508612
## 9 1500 0.7769928 0.5490401
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 50, interaction.depth =
## 1, shrinkage = 0.1 and n.minobsinnode = 20.
pred_gbm <- predict(gbmFit,heart_test)
gbmConf <- confusionMatrix(reference = heart_test$Condition, data = pred_gbm, mode='everything', positive='0')
gbmConf$byClass
## Sensitivity Specificity Pos Pred Value
## 0.7419355 0.9259259 0.9200000
## Neg Pred Value Precision Recall
## 0.7575758 0.9200000 0.7419355
## F1 Prevalence Detection Rate
## 0.8214286 0.5344828 0.3965517
## Detection Prevalence Balanced Accuracy
## 0.4310345 0.8339307
#Accuracy : 84.5%
#Sensitivity: 83.9%
#Precision: 86.7%
mtry <- sqrt(ncol(X))
rfGrid <- expand.grid(mtry = mtry)
rfFit <- train(Condition ~ ., data = heart_train,
method = "rf",
trControl = fitControl,
verbose = FALSE,
tuneGrid = rfGrid)
rfFit
## Random Forest
##
## 238 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 214, 214, 214, 214, 214, 214, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8024457 0.6006811
##
## Tuning parameter 'mtry' was held constant at a value of 3.605551
pred_rf <- predict(rfFit,heart_test)
rfConf <- confusionMatrix(reference = heart_test$Condition, data = pred_rf, mode='everything', positive='0')
rfConf$byClass
## Sensitivity Specificity Pos Pred Value
## 0.7419355 0.9259259 0.9200000
## Neg Pred Value Precision Recall
## 0.7575758 0.9200000 0.7419355
## F1 Prevalence Detection Rate
## 0.8214286 0.5344828 0.3965517
## Detection Prevalence Balanced Accuracy
## 0.4310345 0.8339307
#Accuracy: 87.9%
#Sensitivity: 87.1%
#Precision: 90.0%
logitFit <- train(Condition ~ ., data = heart_train,
method = "LogitBoost",
trControl = fitControl,
verbose = FALSE )
logitFit
## Boosted Logistic Regression
##
## 238 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 214, 214, 214, 214, 215, 215, ...
## Resampling results across tuning parameters:
##
## nIter Accuracy Kappa
## 11 0.8060688 0.6083785
## 21 0.7865399 0.5704245
## 31 0.7818659 0.5593302
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was nIter = 11.
pred_logit <- predict(logitFit,heart_test)
logitConf <- confusionMatrix(reference = heart_test$Condition, data = pred_logit, mode='everything', positive='0')
logitConf$byClass
## Sensitivity Specificity Pos Pred Value
## 0.6129032 0.9259259 0.9047619
## Neg Pred Value Precision Recall
## 0.6756757 0.9047619 0.6129032
## F1 Prevalence Detection Rate
## 0.7307692 0.5344828 0.3275862
## Detection Prevalence Balanced Accuracy
## 0.3620690 0.7694146
#Accuracy: 79.3%
#Sensitivity: 77.4%
#Precision: 82.8%
glmnetGrid <- expand.grid(alpha = 0:1, lambda = seq(0.0001, 1, length = 100))
glmnetFit <- train(Condition ~ ., data = heart_train,
method = "glmnet",
trControl = fitControl,
verbose = FALSE, tuneGrid = glmnetGrid )
glmnetFit
## glmnet
##
## 238 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 214, 214, 214, 215, 215, 214, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0 0.0001 0.8346014 6.643597e-01
## 0 0.0102 0.8346014 6.643597e-01
## 0 0.0203 0.8346014 6.643597e-01
## 0 0.0304 0.8346014 6.643597e-01
## 0 0.0405 0.8341667 6.635040e-01
## 0 0.0506 0.8341667 6.635040e-01
## 0 0.0607 0.8329167 6.608108e-01
## 0 0.0708 0.8329167 6.607756e-01
## 0 0.0809 0.8333333 6.615968e-01
## 0 0.0910 0.8328986 6.606261e-01
## 0 0.1011 0.8328986 6.606261e-01
## 0 0.1112 0.8333333 6.614218e-01
## 0 0.1213 0.8329167 6.605770e-01
## 0 0.1314 0.8324819 6.595840e-01
## 0 0.1415 0.8328986 6.604046e-01
## 0 0.1516 0.8316486 6.577714e-01
## 0 0.1617 0.8316486 6.577357e-01
## 0 0.1718 0.8312319 6.568416e-01
## 0 0.1819 0.8320652 6.584734e-01
## 0 0.1920 0.8328986 6.601144e-01
## 0 0.2021 0.8328986 6.601144e-01
## 0 0.2122 0.8328986 6.601144e-01
## 0 0.2223 0.8324819 6.592330e-01
## 0 0.2324 0.8325000 6.591941e-01
## 0 0.2425 0.8325000 6.591941e-01
## 0 0.2526 0.8329167 6.600250e-01
## 0 0.2627 0.8333514 6.608927e-01
## 0 0.2728 0.8329348 6.599986e-01
## 0 0.2829 0.8325181 6.591176e-01
## 0 0.2930 0.8333696 6.607835e-01
## 0 0.3031 0.8329348 6.598887e-01
## 0 0.3132 0.8333514 6.607098e-01
## 0 0.3233 0.8333333 6.605717e-01
## 0 0.3334 0.8333333 6.605717e-01
## 0 0.3435 0.8333333 6.605717e-01
## 0 0.3536 0.8329167 6.596431e-01
## 0 0.3637 0.8333152 6.604142e-01
## 0 0.3738 0.8324819 6.586631e-01
## 0 0.3839 0.8328986 6.594959e-01
## 0 0.3940 0.8337500 6.611165e-01
## 0 0.4041 0.8320471 6.575378e-01
## 0 0.4142 0.8307790 6.548965e-01
## 0 0.4243 0.8307790 6.548965e-01
## 0 0.4344 0.8307790 6.548965e-01
## 0 0.4445 0.8299094 6.530859e-01
## 0 0.4546 0.8294928 6.521932e-01
## 0 0.4647 0.8286594 6.504305e-01
## 0 0.4748 0.8286594 6.503580e-01
## 0 0.4849 0.8286594 6.503580e-01
## 0 0.4950 0.8294928 6.519998e-01
## 0 0.5051 0.8303261 6.536641e-01
## 0 0.5152 0.8303261 6.535781e-01
## 0 0.5253 0.8299094 6.526710e-01
## 0 0.5354 0.8290761 6.509201e-01
## 0 0.5455 0.8290761 6.509201e-01
## 0 0.5556 0.8286594 6.500260e-01
## 0 0.5657 0.8286594 6.500260e-01
## 0 0.5758 0.8286594 6.500260e-01
## 0 0.5859 0.8282428 6.491694e-01
## 0 0.5960 0.8282428 6.491694e-01
## 0 0.6061 0.8282428 6.491694e-01
## 0 0.6162 0.8282428 6.491694e-01
## 0 0.6263 0.8278080 6.481798e-01
## 0 0.6364 0.8278080 6.481798e-01
## 0 0.6465 0.8273913 6.472983e-01
## 0 0.6566 0.8273913 6.472983e-01
## 0 0.6667 0.8273913 6.472983e-01
## 0 0.6768 0.8269746 6.464042e-01
## 0 0.6869 0.8261413 6.446192e-01
## 0 0.6970 0.8265580 6.454283e-01
## 0 0.7071 0.8265580 6.454283e-01
## 0 0.7172 0.8261413 6.445596e-01
## 0 0.7273 0.8261413 6.445596e-01
## 0 0.7374 0.8257246 6.435812e-01
## 0 0.7475 0.8253080 6.427122e-01
## 0 0.7576 0.8248913 6.418431e-01
## 0 0.7677 0.8244746 6.409862e-01
## 0 0.7778 0.8257246 6.434473e-01
## 0 0.7879 0.8261413 6.442802e-01
## 0 0.7980 0.8257246 6.434233e-01
## 0 0.8081 0.8252899 6.425354e-01
## 0 0.8182 0.8252899 6.425354e-01
## 0 0.8283 0.8252899 6.425354e-01
## 0 0.8384 0.8252899 6.425354e-01
## 0 0.8485 0.8257065 6.433439e-01
## 0 0.8586 0.8261232 6.441157e-01
## 0 0.8687 0.8252899 6.423390e-01
## 0 0.8788 0.8252899 6.423390e-01
## 0 0.8889 0.8252899 6.423390e-01
## 0 0.8990 0.8252899 6.422530e-01
## 0 0.9091 0.8252899 6.422530e-01
## 0 0.9192 0.8252899 6.422530e-01
## 0 0.9293 0.8252899 6.422530e-01
## 0 0.9394 0.8252899 6.422530e-01
## 0 0.9495 0.8252899 6.422530e-01
## 0 0.9596 0.8252899 6.422530e-01
## 0 0.9697 0.8261232 6.439072e-01
## 0 0.9798 0.8261232 6.439072e-01
## 0 0.9899 0.8265399 6.447279e-01
## 0 1.0000 0.8265399 6.447279e-01
## 1 0.0001 0.8333333 6.623223e-01
## 1 0.0102 0.8295290 6.541642e-01
## 1 0.0203 0.8269928 6.485851e-01
## 1 0.0304 0.8295290 6.538773e-01
## 1 0.0405 0.8282609 6.512860e-01
## 1 0.0506 0.8248551 6.444621e-01
## 1 0.0607 0.8210326 6.367883e-01
## 1 0.0708 0.8176630 6.298780e-01
## 1 0.0809 0.8117935 6.177948e-01
## 1 0.0910 0.8092572 6.124459e-01
## 1 0.1011 0.8037500 6.014228e-01
## 1 0.1112 0.7961594 5.858516e-01
## 1 0.1213 0.7899094 5.729635e-01
## 1 0.1314 0.7823913 5.577120e-01
## 1 0.1415 0.7759964 5.448798e-01
## 1 0.1516 0.7701087 5.329407e-01
## 1 0.1617 0.7667572 5.262559e-01
## 1 0.1718 0.7642572 5.209645e-01
## 1 0.1819 0.7617572 5.158794e-01
## 1 0.1920 0.7588949 5.096484e-01
## 1 0.2021 0.7496558 4.900685e-01
## 1 0.2122 0.7271739 4.420015e-01
## 1 0.2223 0.6733152 3.210903e-01
## 1 0.2324 0.5773913 1.005120e-01
## 1 0.2425 0.5368478 1.611613e-03
## 1 0.2526 0.5372645 -6.993007e-05
## 1 0.2627 0.5376812 0.000000e+00
## 1 0.2728 0.5376812 0.000000e+00
## 1 0.2829 0.5376812 0.000000e+00
## 1 0.2930 0.5376812 0.000000e+00
## 1 0.3031 0.5376812 0.000000e+00
## 1 0.3132 0.5376812 0.000000e+00
## 1 0.3233 0.5376812 0.000000e+00
## 1 0.3334 0.5376812 0.000000e+00
## 1 0.3435 0.5376812 0.000000e+00
## 1 0.3536 0.5376812 0.000000e+00
## 1 0.3637 0.5376812 0.000000e+00
## 1 0.3738 0.5376812 0.000000e+00
## 1 0.3839 0.5376812 0.000000e+00
## 1 0.3940 0.5376812 0.000000e+00
## 1 0.4041 0.5376812 0.000000e+00
## 1 0.4142 0.5376812 0.000000e+00
## 1 0.4243 0.5376812 0.000000e+00
## 1 0.4344 0.5376812 0.000000e+00
## 1 0.4445 0.5376812 0.000000e+00
## 1 0.4546 0.5376812 0.000000e+00
## 1 0.4647 0.5376812 0.000000e+00
## 1 0.4748 0.5376812 0.000000e+00
## 1 0.4849 0.5376812 0.000000e+00
## 1 0.4950 0.5376812 0.000000e+00
## 1 0.5051 0.5376812 0.000000e+00
## 1 0.5152 0.5376812 0.000000e+00
## 1 0.5253 0.5376812 0.000000e+00
## 1 0.5354 0.5376812 0.000000e+00
## 1 0.5455 0.5376812 0.000000e+00
## 1 0.5556 0.5376812 0.000000e+00
## 1 0.5657 0.5376812 0.000000e+00
## 1 0.5758 0.5376812 0.000000e+00
## 1 0.5859 0.5376812 0.000000e+00
## 1 0.5960 0.5376812 0.000000e+00
## 1 0.6061 0.5376812 0.000000e+00
## 1 0.6162 0.5376812 0.000000e+00
## 1 0.6263 0.5376812 0.000000e+00
## 1 0.6364 0.5376812 0.000000e+00
## 1 0.6465 0.5376812 0.000000e+00
## 1 0.6566 0.5376812 0.000000e+00
## 1 0.6667 0.5376812 0.000000e+00
## 1 0.6768 0.5376812 0.000000e+00
## 1 0.6869 0.5376812 0.000000e+00
## 1 0.6970 0.5376812 0.000000e+00
## 1 0.7071 0.5376812 0.000000e+00
## 1 0.7172 0.5376812 0.000000e+00
## 1 0.7273 0.5376812 0.000000e+00
## 1 0.7374 0.5376812 0.000000e+00
## 1 0.7475 0.5376812 0.000000e+00
## 1 0.7576 0.5376812 0.000000e+00
## 1 0.7677 0.5376812 0.000000e+00
## 1 0.7778 0.5376812 0.000000e+00
## 1 0.7879 0.5376812 0.000000e+00
## 1 0.7980 0.5376812 0.000000e+00
## 1 0.8081 0.5376812 0.000000e+00
## 1 0.8182 0.5376812 0.000000e+00
## 1 0.8283 0.5376812 0.000000e+00
## 1 0.8384 0.5376812 0.000000e+00
## 1 0.8485 0.5376812 0.000000e+00
## 1 0.8586 0.5376812 0.000000e+00
## 1 0.8687 0.5376812 0.000000e+00
## 1 0.8788 0.5376812 0.000000e+00
## 1 0.8889 0.5376812 0.000000e+00
## 1 0.8990 0.5376812 0.000000e+00
## 1 0.9091 0.5376812 0.000000e+00
## 1 0.9192 0.5376812 0.000000e+00
## 1 0.9293 0.5376812 0.000000e+00
## 1 0.9394 0.5376812 0.000000e+00
## 1 0.9495 0.5376812 0.000000e+00
## 1 0.9596 0.5376812 0.000000e+00
## 1 0.9697 0.5376812 0.000000e+00
## 1 0.9798 0.5376812 0.000000e+00
## 1 0.9899 0.5376812 0.000000e+00
## 1 1.0000 0.5376812 0.000000e+00
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0 and lambda = 0.0304.
pred_glmnet <- predict(glmnetFit,heart_test)
glmConf <- confusionMatrix(reference = heart_test$Condition, data = pred_glmnet, mode='everything', positive='0')
glmConf$byClass
## Sensitivity Specificity Pos Pred Value
## 0.7741935 0.9259259 0.9230769
## Neg Pred Value Precision Recall
## 0.7812500 0.9230769 0.7741935
## F1 Prevalence Detection Rate
## 0.8421053 0.5344828 0.4137931
## Detection Prevalence Balanced Accuracy
## 0.4482759 0.8500597
#Accuracy: 86.2%
#Sensitivity: 87.1%
#Precision: 87.1%
model <- c("GBM","RF", "Logit", "GLMnet")
accuracy <- c(84.5,87.9,79.3,86.2)
sens <-c(83.9,87.1,77.4,87.1)
prec <-c(86.7,90.0,82.8,87.1)
data.frame(model,accuracy,sens,prec)
## model accuracy sens prec
## 1 GBM 84.5 83.9 86.7
## 2 RF 87.9 87.1 90.0
## 3 Logit 79.3 77.4 82.8
## 4 GLMnet 86.2 87.1 87.1
Most accurate model: Random Forest
Most sensitive model: Random Forest and Generalized Linear Model
Most precise model: Random Forest
Most important variables according to RF
plot(varImp(rfFit, scale = FALSE))
As we can see, Chest Pain Type, Thalassemia and Num. Major Blood Vessels are the top 3 most important predictors
AUC of each predictor
roc <- filterVarImp(x = heart_train[, -ncol(heart_train)], y = heart_train$Condition)
roc
## X0 X1
## Age 0.6836648 0.6836648
## Sex 0.6350142 0.6350142
## Chest Pain Type 0.7473722 0.7473722
## Resting Blood Pressure 0.5828480 0.5828480
## Cholesterol 0.5838068 0.5838068
## Fasting Blood Sugar 0.5011364 0.5011364
## Resting ECG 0.6001065 0.6001065
## Max. HR Achieved 0.7440696 0.7440696
## Exercise Induced Angina 0.6757812 0.6757812
## ST Depression 0.7515625 0.7515625
## ST Slope 0.7104403 0.7104403
## Num. Major Blood Vessels 0.7451705 0.7451705
## Thalassemia 0.7582386 0.7582386