Based on statistics obtained from World Health Organisation (WHO), cardiovascular diseases (CVDs) are the leading cause of death globally. An estimated 17.9 million people died from CVDs in 2019, representing 32% of all global deaths. Of these deaths, 85% were due to heart attack and stroke.
Moreover, Centers for Disease Control and Prevention (CDC) points out the high mortality rate of CVDs in United States, in which one person dies every 36 seconds in the country caused by the CVDs.
Even most of the people have known about the severity of CVDs or in general heart disease, but how far is people’s awareness on preventing the diseases? As the saying goes “Prevention is always better than cure”, we should actually emphasize on preventing and reducing the risk from suffering CVDs.
In the medical field, machine learning can be used for diagnosis, detection and prediction of various diseases. Hence, in this study, we would like to present different prediction model on detecting the heart diseases in early stage. This in turn will help to provide effective treatment to patients and avoid severe consequences.
library(dplyr)
library(Rcpp)
library(lattice)
library(ggplot2)
library(proto)
library(RSQLite)
library(gsubfn)
library(caret)
library(sqldf)
library(Amelia)
library(BinMat)
library(tidyr)
library(tidyverse)
library(MASS)
library(Hmisc)
library(Formula)
library(klaR)
library(e1071)
library(survival)
library(mlbench)
library(readr)
library(skimr)
library(DataExplorer)
library(funModeling)
library(Hmisc)
library(Rcpp)
library(ROCR)
library(knitr)
library(kableExtra)
library(GGally)
library(rsample)
library(viridisLite)
library(yardstick)
library(parsnip)
library(recipes)
This section provides information on the source of the data and high level explanation on the features (extracted from website https://archive.ics.uci.edu/ml/datasets/Heart+Disease)
The data used for this project is an open source database obtained from UCI Machine Learning website. The data consists of 302 rows and 14 columns. The last column in the dataset is the target feature that shows the presence of heart disease.
#heart<-read.csv("processed_cleveland_ori_data.csv", header = T)
heart <- read.csv("D:/01a Prog DS (Thursday)/01 Project/cleveland.csv", header = F)
# List out column names
names <- c("Age",
"Sex",
"Chest_Pain_Type",
"Resting_Blood_Pressure",
"Serum_Cholesterol",
"Fasting_Blood_Sugar",
"Resting_ECG",
"Max_Heart_Rate_Achieved",
"Exercise_Induced_Angina",
"ST_Depression_Exercise",
"Peak_Exercise_ST_Segment",
"Num_Major_Vessels_Flouro",
"Thalassemia",
"target")
# Apply column names to the dataframe
colnames(heart) <- names
The details of all the features are listed below:
Age: Age of subject
Sex: Gender of subject: 0 = female 1 = male
Chest-pain type: Type of chest-pain experienced by the individual: 1 = typical angina 2 = atypical angina 3 = non-angina pain 4 = asymptomatic angina
Resting Blood Pressure: Resting blood pressure in mm Hg
Serum Cholesterol: Serum cholesterol in mg/dl
Fasting Blood Sugar: Fasting blood sugar level relative to 120 mg/dl: 0 = fasting blood sugar <= 120 mg/dl 1 = fasting blood sugar > 120 mg/dl
Resting ECG: Resting electrocardiographic results 0 = normal 1 = ST-T wave abnormality 2 = left ventricle hyperthrophy
Max Heart Rate Achieved: Max heart rate of subject
Exercise Induced Angina: 0 = no 1 = yes
ST Depression Induced by Exercise Relative to Rest: ST Depression of subject
Peak Exercise ST Segment: 1 = Up-sloaping 2 = Flat 3 = Down-sloaping
Number of Major Vessels (0-3) Visible on Flouroscopy: Number of visible vessels under flouro
Thal: Form of thalassemia: 3 3 = normal 6 = fixed defect 7 = reversible defect
target: Indicates whether subject is suffering from heart disease or not: 0 = absence 1-4 = heart disease present
# HEAD / TAIL
# It allows us to see the first and last 6 rows by default.
head(heart)
## Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 1 63 1 1 145 233
## 2 67 1 4 160 286
## 3 67 1 4 120 229
## 4 37 1 3 130 250
## 5 41 0 2 130 204
## 6 56 1 2 120 236
## Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 1 1 2 150
## 2 0 2 108
## 3 0 2 129
## 4 0 0 187
## 5 0 2 172
## 6 0 0 178
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 1 0 2.3 3
## 2 1 1.5 2
## 3 1 2.6 2
## 4 0 3.5 3
## 5 0 1.4 1
## 6 0 0.8 1
## Num_Major_Vessels_Flouro Thalassemia target
## 1 0 6 0
## 2 3 3 2
## 3 2 7 1
## 4 0 3 0
## 5 0 3 0
## 6 0 3 0
tail(heart)
## Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 298 57 0 4 140 241
## 299 45 1 1 110 264
## 300 68 1 4 144 193
## 301 57 1 4 130 131
## 302 57 0 2 130 236
## 303 38 1 3 138 175
## Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 298 0 0 123
## 299 0 0 132
## 300 1 0 141
## 301 0 0 115
## 302 0 2 174
## 303 0 0 173
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 298 1 0.2 2
## 299 0 1.2 2
## 300 0 3.4 2
## 301 1 1.2 2
## 302 0 0.0 2
## 303 0 0.0 1
## Num_Major_Vessels_Flouro Thalassemia target
## 298 0 7 1
## 299 0 7 1
## 300 2 7 2
## 301 1 7 3
## 302 1 3 1
## 303 ? 3 0
# Structure of the dataset
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ Age : chr "63" "67" "67" "37" ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ Chest_Pain_Type : int 1 4 4 3 2 2 4 4 4 4 ...
## $ Resting_Blood_Pressure : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Serum_Cholesterol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fasting_Blood_Sugar : int 1 0 0 0 0 0 0 0 0 1 ...
## $ Resting_ECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ Max_Heart_Rate_Achieved : int 150 108 129 187 172 178 160 163 147 155 ...
## $ Exercise_Induced_Angina : int 0 1 1 0 0 0 0 1 0 1 ...
## $ ST_Depression_Exercise : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Peak_Exercise_ST_Segment: int 3 2 2 3 1 1 3 1 2 3 ...
## $ Num_Major_Vessels_Flouro: chr "0" "3" "2" "0" ...
## $ Thalassemia : chr "6" "3" "7" "3" ...
## $ target : int 0 2 1 0 0 0 3 0 2 1 ...
# Summary of the dataset
summary(heart)
## Age Sex Chest_Pain_Type Resting_Blood_Pressure
## Length:303 Min. :0.0000 Min. :1.000 Min. : 94.0
## Class :character 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Mode :character Median :1.0000 Median :3.000 Median :130.0
## Mean :0.6799 Mean :3.158 Mean :131.7
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :1.0000 Max. :4.000 Max. :200.0
## Serum_Cholesterol Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :241.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.7 Mean :0.1485 Mean :0.9901 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## Min. :0.0000 Min. :0.00 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000
## Median :0.0000 Median :0.80 Median :2.000
## Mean :0.3267 Mean :1.04 Mean :1.601
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :1.0000 Max. :6.20 Max. :3.000
## Num_Major_Vessels_Flouro Thalassemia target
## Length:303 Length:303 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.9373
## 3rd Qu.:2.0000
## Max. :4.0000
# DIMENSION
# Displays the dimensions of the table. The output takes the form of row, column.
dim(heart)
## [1] 303 14
### GLIMPSE
# Displays the type and a preview of all columns as a row so that it's very easy to take in.
# This will display a vertical preview of the dataset.
# It allows us to easily preview the data type and sample data.
glimpse(heart)
## Rows: 303
## Columns: 14
## $ Age <chr> "63", "67", "67", "37", "41", "56", "62", ~
## $ Sex <int> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, ~
## $ Chest_Pain_Type <int> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, ~
## $ Resting_Blood_Pressure <int> 145, 160, 120, 130, 130, 120, 140, 120, 130, ~
## $ Serum_Cholesterol <int> 233, 286, 229, 250, 204, 236, 268, 354, 254, ~
## $ Fasting_Blood_Sugar <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, ~
## $ Resting_ECG <int> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, ~
## $ Max_Heart_Rate_Achieved <int> 150, 108, 129, 187, 172, 178, 160, 163, 147, ~
## $ Exercise_Induced_Angina <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, ~
## $ ST_Depression_Exercise <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, ~
## $ Peak_Exercise_ST_Segment <int> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, ~
## $ Num_Major_Vessels_Flouro <chr> "0", "3", "2", "0", "0", "0", "2", "0", "1", ~
## $ Thalassemia <chr> "6", "3", "7", "3", "3", "3", "3", "3", "7", ~
## $ target <int> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, ~
# Skim
# This function is a good addition to the summary function.
# It displays most of the numerical attributes from summary, but it also
# displays missing values, more quantile information and an inline histogram for each variable
skim(heart)
| Name | heart |
| Number of rows | 303 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 2 | 5 | 0 | 42 | 0 |
| Num_Major_Vessels_Flouro | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
| Thalassemia | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Sex | 0 | 1 | 0.68 | 0.47 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| Chest_Pain_Type | 0 | 1 | 3.16 | 0.96 | 1 | 3.0 | 3.0 | 4.0 | 4.0 | <U+2581><U+2583><U+2581><U+2585><U+2587> |
| Resting_Blood_Pressure | 0 | 1 | 131.69 | 17.60 | 94 | 120.0 | 130.0 | 140.0 | 200.0 | <U+2583><U+2587><U+2585><U+2581><U+2581> |
| Serum_Cholesterol | 0 | 1 | 246.69 | 51.78 | 126 | 211.0 | 241.0 | 275.0 | 564.0 | <U+2583><U+2587><U+2582><U+2581><U+2581> |
| Fasting_Blood_Sugar | 0 | 1 | 0.15 | 0.36 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| Resting_ECG | 0 | 1 | 0.99 | 0.99 | 0 | 0.0 | 1.0 | 2.0 | 2.0 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
| Max_Heart_Rate_Achieved | 0 | 1 | 149.61 | 22.88 | 71 | 133.5 | 153.0 | 166.0 | 202.0 | <U+2581><U+2582><U+2585><U+2587><U+2582> |
| Exercise_Induced_Angina | 0 | 1 | 0.33 | 0.47 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
| ST_Depression_Exercise | 0 | 1 | 1.04 | 1.16 | 0 | 0.0 | 0.8 | 1.6 | 6.2 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| Peak_Exercise_ST_Segment | 0 | 1 | 1.60 | 0.62 | 1 | 1.0 | 2.0 | 2.0 | 3.0 | <U+2587><U+2581><U+2587><U+2581><U+2581> |
| target | 0 | 1 | 0.94 | 1.23 | 0 | 0.0 | 0.0 | 2.0 | 4.0 | <U+2587><U+2583><U+2582><U+2582><U+2581> |
# Analyzing categorical variables
# freq function runs for all factor or character variables automatically:
freq(heart)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Age frequency percentage cumulative_perc
## 1 58 19 6.27 6.27
## 2 57 17 5.61 11.88
## 3 54 16 5.28 17.16
## 4 59 14 4.62 21.78
## 5 52 13 4.29 26.07
## 6 51 12 3.96 30.03
## 7 60 12 3.96 33.99
## 8 44 11 3.63 37.62
## 9 56 11 3.63 41.25
## 10 62 11 3.63 44.88
## 11 41 10 3.30 48.18
## 12 64 10 3.30 51.48
## 13 67 9 2.97 54.45
## 14 42 8 2.64 57.09
## 15 43 8 2.64 59.73
## 16 45 8 2.64 62.37
## 17 53 8 2.64 65.01
## 18 55 8 2.64 67.65
## 19 61 8 2.64 70.29
## 20 63 8 2.64 72.93
## 21 65 8 2.64 75.57
## 22 46 7 2.31 77.88
## 23 48 7 2.31 80.19
## 24 50 7 2.31 82.50
## 25 66 7 2.31 84.81
## 26 47 5 1.65 86.46
## 27 49 5 1.65 88.11
## 28 35 4 1.32 89.43
## 29 39 4 1.32 90.75
## 30 68 4 1.32 92.07
## 31 70 4 1.32 93.39
## 32 40 3 0.99 94.38
## 33 69 3 0.99 95.37
## 34 71 3 0.99 96.36
## 35 34 2 0.66 97.02
## 36 37 2 0.66 97.68
## 37 38 2 0.66 98.34
## 38 29 1 0.33 98.67
## 39 74 1 0.33 99.00
## 40 76 1 0.33 99.33
## 41 77 1 0.33 99.66
## 42 63 1 0.33 100.00
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Num_Major_Vessels_Flouro frequency percentage cumulative_perc
## 1 0 176 58.09 58.09
## 2 1 65 21.45 79.54
## 3 2 38 12.54 92.08
## 4 3 20 6.60 98.68
## 5 ? 4 1.32 100.00
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Thalassemia frequency percentage cumulative_perc
## 1 3 166 54.79 54.79
## 2 7 117 38.61 93.40
## 3 6 18 5.94 99.34
## 4 ? 2 0.66 100.00
## [1] "Variables processed: Age, Num_Major_Vessels_Flouro, Thalassemia"
# Analyzing numerical variables
# Quantitatively
# profiling_num runs for all numerical/integer variables automatically:
profiling_num(heart)
## variable mean std_dev variation_coef p_01 p_05
## 1 Sex 0.6798680 0.4672988 0.6873376 0.00 0.0
## 2 Chest_Pain_Type 3.1584158 0.9601256 0.3039896 1.00 1.0
## 3 Resting_Blood_Pressure 131.6897690 17.5997477 0.1336455 100.00 108.0
## 4 Serum_Cholesterol 246.6930693 51.7769175 0.2098840 149.00 175.1
## 5 Fasting_Blood_Sugar 0.1485149 0.3561979 2.3983990 0.00 0.0
## 6 Resting_ECG 0.9900990 0.9949713 1.0049210 0.00 0.0
## 7 Max_Heart_Rate_Achieved 149.6072607 22.8750033 0.1529004 95.02 108.1
## 8 Exercise_Induced_Angina 0.3267327 0.4697945 1.4378558 0.00 0.0
## 9 ST_Depression_Exercise 1.0396040 1.1610750 1.1168436 0.00 0.0
## 10 Peak_Exercise_ST_Segment 1.6006601 0.6162261 0.3849825 1.00 1.0
## 11 target 0.9372937 1.2285357 1.3107265 0.00 0.0
## p_25 p_50 p_75 p_95 p_99 skewness kurtosis iqr range_98
## 1 0.0 1.0 1.0 1.0 1.00 -0.77109346 1.594585 1.0 [0, 1]
## 2 3.0 3.0 4.0 4.0 4.00 -0.83758103 2.586189 1.0 [1, 4]
## 3 120.0 130.0 140.0 160.0 180.00 0.70253461 3.845881 20.0 [100, 180]
## 4 211.0 241.0 275.0 326.9 406.74 1.12987410 7.398208 64.0 [149, 406.74]
## 5 0.0 0.0 0.0 1.0 1.00 1.97680346 4.907752 0.0 [0, 1]
## 6 0.0 1.0 2.0 2.0 2.00 0.01980163 1.013773 2.0 [0, 2]
## 7 133.5 153.0 166.0 181.9 191.96 -0.53478437 2.927602 32.5 [95.02, 191.96]
## 8 0.0 0.0 1.0 1.0 1.00 0.73885058 1.545900 1.0 [0, 1]
## 9 0.0 0.8 1.6 3.4 4.20 1.26342552 4.530193 1.6 [0, 4.2]
## 10 1.0 2.0 2.0 3.0 3.00 0.50579573 2.363050 1.0 [1, 3]
## 11 0.0 0.0 2.0 3.0 4.00 1.05324831 2.843788 2.0 [0, 4]
## range_80
## 1 [0, 1]
## 2 [2, 4]
## 3 [110, 152]
## 4 [188.8, 308.8]
## 5 [0, 1]
## 6 [0, 2]
## 7 [116, 176.6]
## 8 [0, 1]
## 9 [0, 2.8]
## 10 [1, 2]
## 11 [0, 3]
# Graphically
# Plot_num and profiling_num. Both run automatically for all numerical/integer variables:
plot_num(heart)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# Describe from Hmisc Package
# Analyzing numerical and categorical at the same time
describe(heart)
## heart
##
## 14 Variables 303 Observations
## --------------------------------------------------------------------------------
## Age
## n missing distinct
## 303 0 42
##
## lowest : 29 34 35 37 38 , highest: 71 74 76 77 63
## --------------------------------------------------------------------------------
## Sex
## n missing distinct Info Sum Mean Gmd
## 303 0 2 0.653 206 0.6799 0.4367
##
## --------------------------------------------------------------------------------
## Chest_Pain_Type
## n missing distinct Info Mean Gmd
## 303 0 4 0.865 3.158 1.008
##
## Value 1 2 3 4
## Frequency 23 50 86 144
## Proportion 0.076 0.165 0.284 0.475
## --------------------------------------------------------------------------------
## Resting_Blood_Pressure
## n missing distinct Info Mean Gmd .05 .10
## 303 0 50 0.995 131.7 19.41 108 110
## .25 .50 .75 .90 .95
## 120 130 140 152 160
##
## lowest : 94 100 101 102 104, highest: 174 178 180 192 200
## --------------------------------------------------------------------------------
## Serum_Cholesterol
## n missing distinct Info Mean Gmd .05 .10
## 303 0 152 1 246.7 55.91 175.1 188.8
## .25 .50 .75 .90 .95
## 211.0 241.0 275.0 308.8 326.9
##
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## --------------------------------------------------------------------------------
## Fasting_Blood_Sugar
## n missing distinct Info Sum Mean Gmd
## 303 0 2 0.379 45 0.1485 0.2538
##
## --------------------------------------------------------------------------------
## Resting_ECG
## n missing distinct Info Mean Gmd
## 303 0 3 0.76 0.9901 1.003
##
## Value 0 1 2
## Frequency 151 4 148
## Proportion 0.498 0.013 0.488
## --------------------------------------------------------------------------------
## Max_Heart_Rate_Achieved
## n missing distinct Info Mean Gmd .05 .10
## 303 0 91 1 149.6 25.73 108.1 116.0
## .25 .50 .75 .90 .95
## 133.5 153.0 166.0 176.6 181.9
##
## lowest : 71 88 90 95 96, highest: 190 192 194 195 202
## --------------------------------------------------------------------------------
## Exercise_Induced_Angina
## n missing distinct Info Sum Mean Gmd
## 303 0 2 0.66 99 0.3267 0.4414
##
## --------------------------------------------------------------------------------
## ST_Depression_Exercise
## n missing distinct Info Mean Gmd .05 .10
## 303 0 40 0.964 1.04 1.225 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.8 1.6 2.8 3.4
##
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 4.0 4.2 4.4 5.6 6.2
## --------------------------------------------------------------------------------
## Peak_Exercise_ST_Segment
## n missing distinct Info Mean Gmd
## 303 0 3 0.798 1.601 0.6291
##
## Value 1 2 3
## Frequency 142 140 21
## Proportion 0.469 0.462 0.069
## --------------------------------------------------------------------------------
## Num_Major_Vessels_Flouro
## n missing distinct
## 303 0 5
##
## lowest : ? 0 1 2 3, highest: ? 0 1 2 3
##
## Value ? 0 1 2 3
## Frequency 4 176 65 38 20
## Proportion 0.013 0.581 0.215 0.125 0.066
## --------------------------------------------------------------------------------
## Thalassemia
## n missing distinct
## 303 0 4
##
## Value ? 3 6 7
## Frequency 2 166 18 117
## Proportion 0.007 0.548 0.059 0.386
## --------------------------------------------------------------------------------
## target
## n missing distinct Info Mean Gmd
## 303 0 5 0.832 0.9373 1.25
##
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##
## Value 0 1 2 3 4
## Frequency 164 55 36 35 13
## Proportion 0.541 0.182 0.119 0.116 0.043
## --------------------------------------------------------------------------------
#Determine the number of values in each level of dependent variable
heart %>%
drop_na() %>%
group_by(target) %>%
count() %>%
ungroup() %>%
kable(align = rep("c", 2)) %>% kable_styling("full_width" = F)
| target | n |
|---|---|
| 0 | 164 |
| 1 | 55 |
| 2 | 36 |
| 3 | 35 |
| 4 | 13 |
#Identify the different levels of Thalassemia
heart %>%
drop_na() %>%
group_by(Thalassemia) %>%
count() %>%
ungroup() %>%
kable(align = rep("c", 2)) %>% kable_styling("full_width" = F)
| Thalassemia | n |
|---|---|
| ? | 2 |
| 3 | 166 |
| 6 | 18 |
| 7 | 117 |
# The section below checks for missing values and perform missing value imputation (using median)
heart$Num_Major_Vessels_Flouro[which(heart$Num_Major_Vessels_Flouro== "?")] <- NA
heart$Thalassemia[which(heart$Thalassemia== "?")] <- NA
colSums(is.na(heart))
## Age Sex Chest_Pain_Type
## 0 0 0
## Resting_Blood_Pressure Serum_Cholesterol Fasting_Blood_Sugar
## 0 0 0
## Resting_ECG Max_Heart_Rate_Achieved Exercise_Induced_Angina
## 0 0 0
## ST_Depression_Exercise Peak_Exercise_ST_Segment Num_Major_Vessels_Flouro
## 0 0 4
## Thalassemia target
## 2 0
# Change the data type
heart$Num_Major_Vessels_Flouro <- as.numeric(heart$Num_Major_Vessels_Flouro)
# Obtain the median value
median.result_heart <- median(heart$Num_Major_Vessels_Flouro, na.rm = TRUE)
median.result_heart #1
## [1] 0
# Missing Value Imputation with Median
# Replace na value with median value
heart$Num_Major_Vessels_Flouro[is.na(heart$Num_Major_Vessels_Flouro)] <- 1
heart$Thalassemia[is.na(heart$Thalassemia)] <- 3
# Recode the 'target' Feature Into a Binary Class
# Any value above 0 in column ‘target’ indicates the presence of heart disease,
# we can combine all levels > 0 together so the classification predictions are
# binary – Yes or No (1 or 0).
heart$target <- ifelse(heart$target== 0, yes = 0, no=1)
# Check the latest class for 'target'
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ Age : chr "63" "67" "67" "37" ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ Chest_Pain_Type : int 1 4 4 3 2 2 4 4 4 4 ...
## $ Resting_Blood_Pressure : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Serum_Cholesterol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fasting_Blood_Sugar : int 1 0 0 0 0 0 0 0 0 1 ...
## $ Resting_ECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ Max_Heart_Rate_Achieved : int 150 108 129 187 172 178 160 163 147 155 ...
## $ Exercise_Induced_Angina : int 0 1 1 0 0 0 0 1 0 1 ...
## $ ST_Depression_Exercise : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Peak_Exercise_ST_Segment: int 3 2 2 3 1 1 3 1 2 3 ...
## $ Num_Major_Vessels_Flouro: num 0 3 2 0 0 0 2 0 1 0 ...
## $ Thalassemia : chr "6" "3" "7" "3" ...
## $ target : num 0 1 1 0 0 0 1 0 1 1 ...
# Copy the clean data into a new DF (for model creation purpose)
heart1 <- heart
# Select categorical vars, recode them to their character values, convert to long format
heart
## Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 1 63 1 1 145 233
## 2 67 1 4 160 286
## 3 67 1 4 120 229
## 4 37 1 3 130 250
## 5 41 0 2 130 204
## 6 56 1 2 120 236
## 7 62 0 4 140 268
## 8 57 0 4 120 354
## 9 63 1 4 130 254
## 10 53 1 4 140 203
## 11 57 1 4 140 192
## 12 56 0 2 140 294
## 13 56 1 3 130 256
## 14 44 1 2 120 263
## 15 52 1 3 172 199
## 16 57 1 3 150 168
## 17 48 1 2 110 229
## 18 54 1 4 140 239
## 19 48 0 3 130 275
## 20 49 1 2 130 266
## 21 64 1 1 110 211
## 22 58 0 1 150 283
## 23 58 1 2 120 284
## 24 58 1 3 132 224
## 25 60 1 4 130 206
## 26 50 0 3 120 219
## 27 58 0 3 120 340
## 28 66 0 1 150 226
## 29 43 1 4 150 247
## 30 40 1 4 110 167
## 31 69 0 1 140 239
## 32 60 1 4 117 230
## 33 64 1 3 140 335
## 34 59 1 4 135 234
## 35 44 1 3 130 233
## 36 42 1 4 140 226
## 37 43 1 4 120 177
## 38 57 1 4 150 276
## 39 55 1 4 132 353
## 40 61 1 3 150 243
## 41 65 0 4 150 225
## 42 40 1 1 140 199
## 43 71 0 2 160 302
## 44 59 1 3 150 212
## 45 61 0 4 130 330
## 46 58 1 3 112 230
## 47 51 1 3 110 175
## 48 50 1 4 150 243
## 49 65 0 3 140 417
## 50 53 1 3 130 197
## 51 41 0 2 105 198
## 52 65 1 4 120 177
## 53 44 1 4 112 290
## 54 44 1 2 130 219
## 55 60 1 4 130 253
## 56 54 1 4 124 266
## 57 50 1 3 140 233
## 58 41 1 4 110 172
## 59 54 1 3 125 273
## 60 51 1 1 125 213
## 61 51 0 4 130 305
## 62 46 0 3 142 177
## 63 58 1 4 128 216
## 64 54 0 3 135 304
## 65 54 1 4 120 188
## 66 60 1 4 145 282
## 67 60 1 3 140 185
## 68 54 1 3 150 232
## 69 59 1 4 170 326
## 70 46 1 3 150 231
## 71 65 0 3 155 269
## 72 67 1 4 125 254
## 73 62 1 4 120 267
## 74 65 1 4 110 248
## 75 44 1 4 110 197
## 76 65 0 3 160 360
## 77 60 1 4 125 258
## 78 51 0 3 140 308
## 79 48 1 2 130 245
## 80 58 1 4 150 270
## 81 45 1 4 104 208
## 82 53 0 4 130 264
## 83 39 1 3 140 321
## 84 68 1 3 180 274
## 85 52 1 2 120 325
## 86 44 1 3 140 235
## 87 47 1 3 138 257
## 88 53 0 3 128 216
## 89 53 0 4 138 234
## 90 51 0 3 130 256
## 91 66 1 4 120 302
## 92 62 0 4 160 164
## 93 62 1 3 130 231
## 94 44 0 3 108 141
## 95 63 0 3 135 252
## 96 52 1 4 128 255
## 97 59 1 4 110 239
## 98 60 0 4 150 258
## 99 52 1 2 134 201
## 100 48 1 4 122 222
## 101 45 1 4 115 260
## 102 34 1 1 118 182
## 103 57 0 4 128 303
## 104 71 0 3 110 265
## 105 49 1 3 120 188
## 106 54 1 2 108 309
## 107 59 1 4 140 177
## 108 57 1 3 128 229
## 109 61 1 4 120 260
## 110 39 1 4 118 219
## 111 61 0 4 145 307
## 112 56 1 4 125 249
## 113 52 1 1 118 186
## 114 43 0 4 132 341
## 115 62 0 3 130 263
## 116 41 1 2 135 203
## 117 58 1 3 140 211
## 118 35 0 4 138 183
## 119 63 1 4 130 330
## 120 65 1 4 135 254
## 121 48 1 4 130 256
## 122 63 0 4 150 407
## 123 51 1 3 100 222
## 124 55 1 4 140 217
## 125 65 1 1 138 282
## 126 45 0 2 130 234
## 127 56 0 4 200 288
## 128 54 1 4 110 239
## 129 44 1 2 120 220
## 130 62 0 4 124 209
## 131 54 1 3 120 258
## 132 51 1 3 94 227
## 133 29 1 2 130 204
## 134 51 1 4 140 261
## 135 43 0 3 122 213
## 136 55 0 2 135 250
## 137 70 1 4 145 174
## 138 62 1 2 120 281
## 139 35 1 4 120 198
## 140 51 1 3 125 245
## 141 59 1 2 140 221
## 142 59 1 1 170 288
## 143 52 1 2 128 205
## 144 64 1 3 125 309
## 145 58 1 3 105 240
## 146 47 1 3 108 243
## 147 57 1 4 165 289
## 148 41 1 3 112 250
## 149 45 1 2 128 308
## 150 60 0 3 102 318
## 151 52 1 1 152 298
## 152 42 0 4 102 265
## 153 67 0 3 115 564
## 154 55 1 4 160 289
## 155 64 1 4 120 246
## 156 70 1 4 130 322
## 157 51 1 4 140 299
## 158 58 1 4 125 300
## 159 60 1 4 140 293
## 160 68 1 3 118 277
## 161 46 1 2 101 197
## 162 77 1 4 125 304
## 163 54 0 3 110 214
## 164 58 0 4 100 248
## 165 48 1 3 124 255
## 166 57 1 4 132 207
## 167 52 1 3 138 223
## 168 54 0 2 132 288
## 169 35 1 4 126 282
## 170 45 0 2 112 160
## 171 70 1 3 160 269
## 172 53 1 4 142 226
## 173 59 0 4 174 249
## 174 62 0 4 140 394
## 175 64 1 4 145 212
## 176 57 1 4 152 274
## 177 52 1 4 108 233
## 178 56 1 4 132 184
## 179 43 1 3 130 315
## 180 53 1 3 130 246
## 181 48 1 4 124 274
## 182 56 0 4 134 409
## 183 42 1 1 148 244
## 184 59 1 1 178 270
## 185 60 0 4 158 305
## 186 63 0 2 140 195
## 187 42 1 3 120 240
## 188 66 1 2 160 246
## 189 54 1 2 192 283
## 190 69 1 3 140 254
## 191 50 1 3 129 196
## 192 51 1 4 140 298
## 193 43 1 4 132 247
## 194 62 0 4 138 294
## 195 68 0 3 120 211
## 196 67 1 4 100 299
## 197 69 1 1 160 234
## 198 45 0 4 138 236
## 199 50 0 2 120 244
## 200 59 1 1 160 273
## 201 50 0 4 110 254
## 202 64 0 4 180 325
## 203 57 1 3 150 126
## 204 64 0 3 140 313
## 205 43 1 4 110 211
## 206 45 1 4 142 309
## 207 58 1 4 128 259
## 208 50 1 4 144 200
## 209 55 1 2 130 262
## 210 62 0 4 150 244
## 211 37 0 3 120 215
## 212 38 1 1 120 231
## 213 41 1 3 130 214
## 214 66 0 4 178 228
## 215 52 1 4 112 230
## 216 56 1 1 120 193
## 217 46 0 2 105 204
## 218 46 0 4 138 243
## 219 64 0 4 130 303
## 220 59 1 4 138 271
## 221 41 0 3 112 268
## 222 54 0 3 108 267
## 223 39 0 3 94 199
## 224 53 1 4 123 282
## 225 63 0 4 108 269
## 226 34 0 2 118 210
## 227 47 1 4 112 204
## 228 67 0 3 152 277
## 229 54 1 4 110 206
## 230 66 1 4 112 212
## 231 52 0 3 136 196
## 232 55 0 4 180 327
## 233 49 1 3 118 149
## 234 74 0 2 120 269
## 235 54 0 3 160 201
## 236 54 1 4 122 286
## 237 56 1 4 130 283
## 238 46 1 4 120 249
## 239 49 0 2 134 271
## 240 42 1 2 120 295
## 241 41 1 2 110 235
## 242 41 0 2 126 306
## 243 49 0 4 130 269
## 244 61 1 1 134 234
## 245 60 0 3 120 178
## 246 67 1 4 120 237
## 247 58 1 4 100 234
## 248 47 1 4 110 275
## 249 52 1 4 125 212
## 250 62 1 2 128 208
## 251 57 1 4 110 201
## 252 58 1 4 146 218
## 253 64 1 4 128 263
## 254 51 0 3 120 295
## 255 43 1 4 115 303
## 256 42 0 3 120 209
## 257 67 0 4 106 223
## 258 76 0 3 140 197
## 259 70 1 2 156 245
## 260 57 1 2 124 261
## 261 44 0 3 118 242
## 262 58 0 2 136 319
## 263 60 0 1 150 240
## 264 44 1 3 120 226
## 265 61 1 4 138 166
## 266 42 1 4 136 315
## 267 52 1 4 128 204
## 268 59 1 3 126 218
## 269 40 1 4 152 223
## 270 42 1 3 130 180
## 271 61 1 4 140 207
## 272 66 1 4 160 228
## 273 46 1 4 140 311
## 274 71 0 4 112 149
## 275 59 1 1 134 204
## 276 64 1 1 170 227
## 277 66 0 3 146 278
## 278 39 0 3 138 220
## 279 57 1 2 154 232
## 280 58 0 4 130 197
## 281 57 1 4 110 335
## 282 47 1 3 130 253
## 283 55 0 4 128 205
## 284 35 1 2 122 192
## 285 61 1 4 148 203
## 286 58 1 4 114 318
## 287 58 0 4 170 225
## 288 58 1 2 125 220
## 289 56 1 2 130 221
## 290 56 1 2 120 240
## 291 67 1 3 152 212
## 292 55 0 2 132 342
## 293 44 1 4 120 169
## 294 63 1 4 140 187
## 295 63 0 4 124 197
## 296 41 1 2 120 157
## 297 59 1 4 164 176
## 298 57 0 4 140 241
## 299 45 1 1 110 264
## 300 68 1 4 144 193
## 301 57 1 4 130 131
## 302 57 0 2 130 236
## 303 38 1 3 138 175
## Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 1 1 2 150
## 2 0 2 108
## 3 0 2 129
## 4 0 0 187
## 5 0 2 172
## 6 0 0 178
## 7 0 2 160
## 8 0 0 163
## 9 0 2 147
## 10 1 2 155
## 11 0 0 148
## 12 0 2 153
## 13 1 2 142
## 14 0 0 173
## 15 1 0 162
## 16 0 0 174
## 17 0 0 168
## 18 0 0 160
## 19 0 0 139
## 20 0 0 171
## 21 0 2 144
## 22 1 2 162
## 23 0 2 160
## 24 0 2 173
## 25 0 2 132
## 26 0 0 158
## 27 0 0 172
## 28 0 0 114
## 29 0 0 171
## 30 0 2 114
## 31 0 0 151
## 32 1 0 160
## 33 0 0 158
## 34 0 0 161
## 35 0 0 179
## 36 0 0 178
## 37 0 2 120
## 38 0 2 112
## 39 0 0 132
## 40 1 0 137
## 41 0 2 114
## 42 0 0 178
## 43 0 0 162
## 44 1 0 157
## 45 0 2 169
## 46 0 2 165
## 47 0 0 123
## 48 0 2 128
## 49 1 2 157
## 50 1 2 152
## 51 0 0 168
## 52 0 0 140
## 53 0 2 153
## 54 0 2 188
## 55 0 0 144
## 56 0 2 109
## 57 0 0 163
## 58 0 2 158
## 59 0 2 152
## 60 0 2 125
## 61 0 0 142
## 62 0 2 160
## 63 0 2 131
## 64 1 0 170
## 65 0 0 113
## 66 0 2 142
## 67 0 2 155
## 68 0 2 165
## 69 0 2 140
## 70 0 0 147
## 71 0 0 148
## 72 1 0 163
## 73 0 0 99
## 74 0 2 158
## 75 0 2 177
## 76 0 2 151
## 77 0 2 141
## 78 0 2 142
## 79 0 2 180
## 80 0 2 111
## 81 0 2 148
## 82 0 2 143
## 83 0 2 182
## 84 1 2 150
## 85 0 0 172
## 86 0 2 180
## 87 0 2 156
## 88 0 2 115
## 89 0 2 160
## 90 0 2 149
## 91 0 2 151
## 92 0 2 145
## 93 0 0 146
## 94 0 0 175
## 95 0 2 172
## 96 0 0 161
## 97 0 2 142
## 98 0 2 157
## 99 0 0 158
## 100 0 2 186
## 101 0 2 185
## 102 0 2 174
## 103 0 2 159
## 104 1 2 130
## 105 0 0 139
## 106 0 0 156
## 107 0 0 162
## 108 0 2 150
## 109 0 0 140
## 110 0 0 140
## 111 0 2 146
## 112 1 2 144
## 113 0 2 190
## 114 1 2 136
## 115 0 0 97
## 116 0 0 132
## 117 1 2 165
## 118 0 0 182
## 119 1 2 132
## 120 0 2 127
## 121 1 2 150
## 122 0 2 154
## 123 0 0 143
## 124 0 0 111
## 125 1 2 174
## 126 0 2 175
## 127 1 2 133
## 128 0 0 126
## 129 0 0 170
## 130 0 0 163
## 131 0 2 147
## 132 0 0 154
## 133 0 2 202
## 134 0 2 186
## 135 0 0 165
## 136 0 2 161
## 137 0 0 125
## 138 0 2 103
## 139 0 0 130
## 140 1 2 166
## 141 0 0 164
## 142 0 2 159
## 143 1 0 184
## 144 0 0 131
## 145 0 2 154
## 146 0 0 152
## 147 1 2 124
## 148 0 0 179
## 149 0 2 170
## 150 0 0 160
## 151 1 0 178
## 152 0 2 122
## 153 0 2 160
## 154 0 2 145
## 155 0 2 96
## 156 0 2 109
## 157 0 0 173
## 158 0 2 171
## 159 0 2 170
## 160 0 0 151
## 161 1 0 156
## 162 0 2 162
## 163 0 0 158
## 164 0 2 122
## 165 1 0 175
## 166 0 0 168
## 167 0 0 169
## 168 1 2 159
## 169 0 2 156
## 170 0 0 138
## 171 0 0 112
## 172 0 2 111
## 173 0 0 143
## 174 0 2 157
## 175 0 2 132
## 176 0 0 88
## 177 1 0 147
## 178 0 2 105
## 179 0 0 162
## 180 1 2 173
## 181 0 2 166
## 182 0 2 150
## 183 0 2 178
## 184 0 2 145
## 185 0 2 161
## 186 0 0 179
## 187 1 0 194
## 188 0 0 120
## 189 0 2 195
## 190 0 2 146
## 191 0 0 163
## 192 0 0 122
## 193 1 2 143
## 194 1 0 106
## 195 0 2 115
## 196 0 2 125
## 197 1 2 131
## 198 0 2 152
## 199 0 0 162
## 200 0 2 125
## 201 0 2 159
## 202 0 0 154
## 203 1 0 173
## 204 0 0 133
## 205 0 0 161
## 206 0 2 147
## 207 0 2 130
## 208 0 2 126
## 209 0 0 155
## 210 0 0 154
## 211 0 0 170
## 212 0 0 182
## 213 0 2 168
## 214 1 0 165
## 215 0 0 160
## 216 0 2 162
## 217 0 0 172
## 218 0 2 152
## 219 0 0 122
## 220 0 2 182
## 221 0 2 172
## 222 0 2 167
## 223 0 0 179
## 224 0 0 95
## 225 0 0 169
## 226 0 0 192
## 227 0 0 143
## 228 0 0 172
## 229 0 2 108
## 230 0 2 132
## 231 0 2 169
## 232 0 1 117
## 233 0 2 126
## 234 0 2 121
## 235 0 0 163
## 236 0 2 116
## 237 1 2 103
## 238 0 2 144
## 239 0 0 162
## 240 0 0 162
## 241 0 0 153
## 242 0 0 163
## 243 0 0 163
## 244 0 0 145
## 245 1 0 96
## 246 0 0 71
## 247 0 0 156
## 248 0 2 118
## 249 0 0 168
## 250 1 2 140
## 251 0 0 126
## 252 0 0 105
## 253 0 0 105
## 254 0 2 157
## 255 0 0 181
## 256 0 0 173
## 257 0 0 142
## 258 0 1 116
## 259 0 2 143
## 260 0 0 141
## 261 0 0 149
## 262 1 2 152
## 263 0 0 171
## 264 0 0 169
## 265 0 2 125
## 266 0 0 125
## 267 1 0 156
## 268 1 0 134
## 269 0 0 181
## 270 0 0 150
## 271 0 2 138
## 272 0 2 138
## 273 0 0 120
## 274 0 0 125
## 275 0 0 162
## 276 0 2 155
## 277 0 2 152
## 278 0 0 152
## 279 0 2 164
## 280 0 0 131
## 281 0 0 143
## 282 0 0 179
## 283 0 1 130
## 284 0 0 174
## 285 0 0 161
## 286 0 1 140
## 287 1 2 146
## 288 0 0 144
## 289 0 2 163
## 290 0 0 169
## 291 0 2 150
## 292 0 0 166
## 293 0 0 144
## 294 0 2 144
## 295 0 0 136
## 296 0 0 182
## 297 1 2 90
## 298 0 0 123
## 299 0 0 132
## 300 1 0 141
## 301 0 0 115
## 302 0 2 174
## 303 0 0 173
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 1 0 2.3 3
## 2 1 1.5 2
## 3 1 2.6 2
## 4 0 3.5 3
## 5 0 1.4 1
## 6 0 0.8 1
## 7 0 3.6 3
## 8 1 0.6 1
## 9 0 1.4 2
## 10 1 3.1 3
## 11 0 0.4 2
## 12 0 1.3 2
## 13 1 0.6 2
## 14 0 0.0 1
## 15 0 0.5 1
## 16 0 1.6 1
## 17 0 1.0 3
## 18 0 1.2 1
## 19 0 0.2 1
## 20 0 0.6 1
## 21 1 1.8 2
## 22 0 1.0 1
## 23 0 1.8 2
## 24 0 3.2 1
## 25 1 2.4 2
## 26 0 1.6 2
## 27 0 0.0 1
## 28 0 2.6 3
## 29 0 1.5 1
## 30 1 2.0 2
## 31 0 1.8 1
## 32 1 1.4 1
## 33 0 0.0 1
## 34 0 0.5 2
## 35 1 0.4 1
## 36 0 0.0 1
## 37 1 2.5 2
## 38 1 0.6 2
## 39 1 1.2 2
## 40 1 1.0 2
## 41 0 1.0 2
## 42 1 1.4 1
## 43 0 0.4 1
## 44 0 1.6 1
## 45 0 0.0 1
## 46 0 2.5 2
## 47 0 0.6 1
## 48 0 2.6 2
## 49 0 0.8 1
## 50 0 1.2 3
## 51 0 0.0 1
## 52 0 0.4 1
## 53 0 0.0 1
## 54 0 0.0 1
## 55 1 1.4 1
## 56 1 2.2 2
## 57 0 0.6 2
## 58 0 0.0 1
## 59 0 0.5 3
## 60 1 1.4 1
## 61 1 1.2 2
## 62 1 1.4 3
## 63 1 2.2 2
## 64 0 0.0 1
## 65 0 1.4 2
## 66 1 2.8 2
## 67 0 3.0 2
## 68 0 1.6 1
## 69 1 3.4 3
## 70 0 3.6 2
## 71 0 0.8 1
## 72 0 0.2 2
## 73 1 1.8 2
## 74 0 0.6 1
## 75 0 0.0 1
## 76 0 0.8 1
## 77 1 2.8 2
## 78 0 1.5 1
## 79 0 0.2 2
## 80 1 0.8 1
## 81 1 3.0 2
## 82 0 0.4 2
## 83 0 0.0 1
## 84 1 1.6 2
## 85 0 0.2 1
## 86 0 0.0 1
## 87 0 0.0 1
## 88 0 0.0 1
## 89 0 0.0 1
## 90 0 0.5 1
## 91 0 0.4 2
## 92 0 6.2 3
## 93 0 1.8 2
## 94 0 0.6 2
## 95 0 0.0 1
## 96 1 0.0 1
## 97 1 1.2 2
## 98 0 2.6 2
## 99 0 0.8 1
## 100 0 0.0 1
## 101 0 0.0 1
## 102 0 0.0 1
## 103 0 0.0 1
## 104 0 0.0 1
## 105 0 2.0 2
## 106 0 0.0 1
## 107 1 0.0 1
## 108 0 0.4 2
## 109 1 3.6 2
## 110 0 1.2 2
## 111 1 1.0 2
## 112 1 1.2 2
## 113 0 0.0 2
## 114 1 3.0 2
## 115 0 1.2 2
## 116 0 0.0 2
## 117 0 0.0 1
## 118 0 1.4 1
## 119 1 1.8 1
## 120 0 2.8 2
## 121 1 0.0 1
## 122 0 4.0 2
## 123 1 1.2 2
## 124 1 5.6 3
## 125 0 1.4 2
## 126 0 0.6 2
## 127 1 4.0 3
## 128 1 2.8 2
## 129 0 0.0 1
## 130 0 0.0 1
## 131 0 0.4 2
## 132 1 0.0 1
## 133 0 0.0 1
## 134 1 0.0 1
## 135 0 0.2 2
## 136 0 1.4 2
## 137 1 2.6 3
## 138 0 1.4 2
## 139 1 1.6 2
## 140 0 2.4 2
## 141 1 0.0 1
## 142 0 0.2 2
## 143 0 0.0 1
## 144 1 1.8 2
## 145 1 0.6 2
## 146 0 0.0 1
## 147 0 1.0 2
## 148 0 0.0 1
## 149 0 0.0 1
## 150 0 0.0 1
## 151 0 1.2 2
## 152 0 0.6 2
## 153 0 1.6 2
## 154 1 0.8 2
## 155 1 2.2 3
## 156 0 2.4 2
## 157 1 1.6 1
## 158 0 0.0 1
## 159 0 1.2 2
## 160 0 1.0 1
## 161 0 0.0 1
## 162 1 0.0 1
## 163 0 1.6 2
## 164 0 1.0 2
## 165 0 0.0 1
## 166 1 0.0 1
## 167 0 0.0 1
## 168 1 0.0 1
## 169 1 0.0 1
## 170 0 0.0 2
## 171 1 2.9 2
## 172 1 0.0 1
## 173 1 0.0 2
## 174 0 1.2 2
## 175 0 2.0 2
## 176 1 1.2 2
## 177 0 0.1 1
## 178 1 2.1 2
## 179 0 1.9 1
## 180 0 0.0 1
## 181 0 0.5 2
## 182 1 1.9 2
## 183 0 0.8 1
## 184 0 4.2 3
## 185 0 0.0 1
## 186 0 0.0 1
## 187 0 0.8 3
## 188 1 0.0 2
## 189 0 0.0 1
## 190 0 2.0 2
## 191 0 0.0 1
## 192 1 4.2 2
## 193 1 0.1 2
## 194 0 1.9 2
## 195 0 1.5 2
## 196 1 0.9 2
## 197 0 0.1 2
## 198 1 0.2 2
## 199 0 1.1 1
## 200 0 0.0 1
## 201 0 0.0 1
## 202 1 0.0 1
## 203 0 0.2 1
## 204 0 0.2 1
## 205 0 0.0 1
## 206 1 0.0 2
## 207 1 3.0 2
## 208 1 0.9 2
## 209 0 0.0 1
## 210 1 1.4 2
## 211 0 0.0 1
## 212 1 3.8 2
## 213 0 2.0 2
## 214 1 1.0 2
## 215 0 0.0 1
## 216 0 1.9 2
## 217 0 0.0 1
## 218 1 0.0 2
## 219 0 2.0 2
## 220 0 0.0 1
## 221 1 0.0 1
## 222 0 0.0 1
## 223 0 0.0 1
## 224 1 2.0 2
## 225 1 1.8 2
## 226 0 0.7 1
## 227 0 0.1 1
## 228 0 0.0 1
## 229 1 0.0 2
## 230 1 0.1 1
## 231 0 0.1 2
## 232 1 3.4 2
## 233 0 0.8 1
## 234 1 0.2 1
## 235 0 0.0 1
## 236 1 3.2 2
## 237 1 1.6 3
## 238 0 0.8 1
## 239 0 0.0 2
## 240 0 0.0 1
## 241 0 0.0 1
## 242 0 0.0 1
## 243 0 0.0 1
## 244 0 2.6 2
## 245 0 0.0 1
## 246 0 1.0 2
## 247 0 0.1 1
## 248 1 1.0 2
## 249 0 1.0 1
## 250 0 0.0 1
## 251 1 1.5 2
## 252 0 2.0 2
## 253 1 0.2 2
## 254 0 0.6 1
## 255 0 1.2 2
## 256 0 0.0 2
## 257 0 0.3 1
## 258 0 1.1 2
## 259 0 0.0 1
## 260 0 0.3 1
## 261 0 0.3 2
## 262 0 0.0 1
## 263 0 0.9 1
## 264 0 0.0 1
## 265 1 3.6 2
## 266 1 1.8 2
## 267 1 1.0 2
## 268 0 2.2 2
## 269 0 0.0 1
## 270 0 0.0 1
## 271 1 1.9 1
## 272 0 2.3 1
## 273 1 1.8 2
## 274 0 1.6 2
## 275 0 0.8 1
## 276 0 0.6 2
## 277 0 0.0 2
## 278 0 0.0 2
## 279 0 0.0 1
## 280 0 0.6 2
## 281 1 3.0 2
## 282 0 0.0 1
## 283 1 2.0 2
## 284 0 0.0 1
## 285 0 0.0 1
## 286 0 4.4 3
## 287 1 2.8 2
## 288 0 0.4 2
## 289 0 0.0 1
## 290 0 0.0 3
## 291 0 0.8 2
## 292 0 1.2 1
## 293 1 2.8 3
## 294 1 4.0 1
## 295 1 0.0 2
## 296 0 0.0 1
## 297 0 1.0 2
## 298 1 0.2 2
## 299 0 1.2 2
## 300 0 3.4 2
## 301 1 1.2 2
## 302 0 0.0 2
## 303 0 0.0 1
## Num_Major_Vessels_Flouro Thalassemia target
## 1 0 6 0
## 2 3 3 1
## 3 2 7 1
## 4 0 3 0
## 5 0 3 0
## 6 0 3 0
## 7 2 3 1
## 8 0 3 0
## 9 1 7 1
## 10 0 7 1
## 11 0 6 0
## 12 0 3 0
## 13 1 6 1
## 14 0 7 0
## 15 0 7 0
## 16 0 3 0
## 17 0 7 1
## 18 0 3 0
## 19 0 3 0
## 20 0 3 0
## 21 0 3 0
## 22 0 3 0
## 23 0 3 1
## 24 2 7 1
## 25 2 7 1
## 26 0 3 0
## 27 0 3 0
## 28 0 3 0
## 29 0 3 0
## 30 0 7 1
## 31 2 3 0
## 32 2 7 1
## 33 0 3 1
## 34 0 7 0
## 35 0 3 0
## 36 0 3 0
## 37 0 7 1
## 38 1 6 1
## 39 1 7 1
## 40 0 3 0
## 41 3 7 1
## 42 0 7 0
## 43 2 3 0
## 44 0 3 0
## 45 0 3 1
## 46 1 7 1
## 47 0 3 0
## 48 0 7 1
## 49 1 3 0
## 50 0 3 0
## 51 1 3 0
## 52 0 7 0
## 53 1 3 1
## 54 0 3 0
## 55 1 7 1
## 56 1 7 1
## 57 1 7 1
## 58 0 7 1
## 59 1 3 0
## 60 1 3 0
## 61 0 7 1
## 62 0 3 0
## 63 3 7 1
## 64 0 3 0
## 65 1 7 1
## 66 2 7 1
## 67 0 3 1
## 68 0 7 0
## 69 0 7 1
## 70 0 3 1
## 71 0 3 0
## 72 2 7 1
## 73 2 7 1
## 74 2 6 1
## 75 1 3 1
## 76 0 3 0
## 77 1 7 1
## 78 1 3 0
## 79 0 3 0
## 80 0 7 1
## 81 0 3 0
## 82 0 3 0
## 83 0 3 0
## 84 0 7 1
## 85 0 3 0
## 86 0 3 0
## 87 0 3 0
## 88 0 3 0
## 89 0 3 0
## 90 0 3 0
## 91 0 3 0
## 92 3 7 1
## 93 3 7 0
## 94 0 3 0
## 95 0 3 0
## 96 1 7 1
## 97 1 7 1
## 98 2 7 1
## 99 1 3 0
## 100 0 3 0
## 101 0 3 0
## 102 0 3 0
## 103 1 3 0
## 104 1 3 0
## 105 3 7 1
## 106 0 7 0
## 107 1 7 1
## 108 1 7 1
## 109 1 7 1
## 110 0 7 1
## 111 0 7 1
## 112 1 3 1
## 113 0 6 0
## 114 0 7 1
## 115 1 7 1
## 116 0 6 0
## 117 0 3 0
## 118 0 3 0
## 119 3 7 1
## 120 1 7 1
## 121 2 7 1
## 122 3 7 1
## 123 0 3 0
## 124 0 7 1
## 125 1 3 1
## 126 0 3 0
## 127 2 7 1
## 128 1 7 1
## 129 0 3 0
## 130 0 3 0
## 131 0 7 0
## 132 1 7 0
## 133 0 3 0
## 134 0 3 0
## 135 0 3 0
## 136 0 3 0
## 137 0 7 1
## 138 1 7 1
## 139 0 7 1
## 140 0 3 0
## 141 0 3 0
## 142 0 7 1
## 143 0 3 0
## 144 0 7 1
## 145 0 7 0
## 146 0 3 1
## 147 3 7 1
## 148 0 3 0
## 149 0 3 0
## 150 1 3 0
## 151 0 7 0
## 152 0 3 0
## 153 0 7 0
## 154 1 7 1
## 155 1 3 1
## 156 3 3 1
## 157 0 7 1
## 158 2 7 1
## 159 2 7 1
## 160 1 7 0
## 161 0 7 0
## 162 3 3 1
## 163 0 3 0
## 164 0 3 0
## 165 2 3 0
## 166 0 7 0
## 167 1 3 0
## 168 1 3 0
## 169 0 7 1
## 170 0 3 0
## 171 1 7 1
## 172 0 7 0
## 173 0 3 1
## 174 0 3 0
## 175 2 6 1
## 176 1 7 1
## 177 3 7 0
## 178 1 6 1
## 179 1 3 0
## 180 3 3 0
## 181 0 7 1
## 182 2 7 1
## 183 2 3 0
## 184 0 7 0
## 185 0 3 1
## 186 2 3 0
## 187 0 7 0
## 188 3 6 1
## 189 1 7 1
## 190 3 7 1
## 191 0 3 0
## 192 3 7 1
## 193 1 7 1
## 194 3 3 1
## 195 0 3 0
## 196 2 3 1
## 197 1 3 0
## 198 0 3 0
## 199 0 3 0
## 200 0 3 1
## 201 0 3 0
## 202 0 3 0
## 203 1 7 0
## 204 0 7 0
## 205 0 7 0
## 206 3 7 1
## 207 2 7 1
## 208 0 7 1
## 209 0 3 0
## 210 0 3 1
## 211 0 3 0
## 212 0 7 1
## 213 0 3 0
## 214 2 7 1
## 215 1 3 1
## 216 0 7 0
## 217 0 3 0
## 218 0 3 0
## 219 2 3 0
## 220 0 3 0
## 221 0 3 0
## 222 0 3 0
## 223 0 3 0
## 224 2 7 1
## 225 2 3 1
## 226 0 3 0
## 227 0 3 0
## 228 1 3 0
## 229 1 3 1
## 230 1 3 1
## 231 0 3 0
## 232 0 3 1
## 233 3 3 1
## 234 1 3 0
## 235 1 3 0
## 236 2 3 1
## 237 0 7 1
## 238 0 7 1
## 239 0 3 0
## 240 0 3 0
## 241 0 3 0
## 242 0 3 0
## 243 0 3 0
## 244 2 3 1
## 245 0 3 0
## 246 0 3 1
## 247 1 7 1
## 248 1 3 1
## 249 2 7 1
## 250 0 3 0
## 251 0 6 0
## 252 1 7 1
## 253 1 7 0
## 254 0 3 0
## 255 0 3 0
## 256 0 3 0
## 257 2 3 0
## 258 0 3 0
## 259 0 3 0
## 260 0 7 1
## 261 1 3 0
## 262 2 3 1
## 263 0 3 0
## 264 0 3 0
## 265 1 3 1
## 266 0 6 1
## 267 0 3 1
## 268 1 6 1
## 269 0 7 1
## 270 0 3 0
## 271 1 7 1
## 272 0 6 0
## 273 2 7 1
## 274 0 3 0
## 275 2 3 1
## 276 0 7 0
## 277 1 3 0
## 278 0 3 0
## 279 1 3 1
## 280 0 3 0
## 281 1 7 1
## 282 0 3 0
## 283 1 7 1
## 284 0 3 0
## 285 1 7 1
## 286 3 6 1
## 287 2 6 1
## 288 1 7 0
## 289 0 7 0
## 290 0 3 0
## 291 0 7 1
## 292 0 3 0
## 293 0 6 1
## 294 2 7 1
## 295 0 3 1
## 296 0 3 0
## 297 2 6 1
## 298 0 7 1
## 299 0 7 1
## 300 2 7 1
## 301 1 7 1
## 302 1 3 1
## 303 1 3 0
hd_long_fact_tbl <- heart %>%
dplyr::select(Sex, Chest_Pain_Type, Fasting_Blood_Sugar, Resting_ECG, Exercise_Induced_Angina,Peak_Exercise_ST_Segment,Thalassemia,target) %>%
mutate(Sex = recode_factor(Sex, `0` = "female",
`1` = "male" ),
Chest_Pain_Type = recode_factor(Chest_Pain_Type, `1` = "typical",
`2` = "atypical",
`3` = "non-angina",
`4` = "asymptomatic"),
Fasting_Blood_Sugar = recode_factor(Fasting_Blood_Sugar, `0` = "<= 120 mg/dl",
`1` = "> 120 mg/dl"),
Resting_ECG = recode_factor(Resting_ECG, `0` = "normal",
`1` = "ST-T abnormality",
`2` = "LV hypertrophy"),
Exercise_Induced_Angina = recode_factor(Exercise_Induced_Angina, `0` = "no",
`1` = "yes"),
Peak_Exercise_ST_Segment = recode_factor(Peak_Exercise_ST_Segment, `1` = "up-sloaping",
`2` = "flat",
`3` = "down-sloaping"),
Thalassemia = recode_factor(Thalassemia, `3` = "normal",
`6` = "fixed defect",
`7` = "reversible defect")) %>%
gather(key = "key", value = "value", -target)
## Warning: attributes are not identical across measure variables;
## they will be dropped
#Visualize with bar plot
hd_long_fact_tbl %>%
ggplot(aes(value)) +
geom_bar(aes(x = value,
fill = target),
alpha = .6,
position = "dodge",
color = "black",
width = .8
) +
labs(x = "",
y = "",
title = "Scaled Effect of Categorical Variables") +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
facet_wrap(~ key, scales = "free", nrow = 4) +
scale_fill_manual(
values = c("#fde725ff", "#20a486ff"),
name = "Heart\nDisease",
labels = c("No HD", "Yes HD"))
#Must gather() data first in order to facet wrap by key
#(default gather call puts all var names into new key col)
#hd_long_cont_tbl <- heart %>%
#dplyr::select(Age, Resting_Blood_Pressure, Serum_Cholesterol, Max_Heart_Rate_Achieved,
#ST_Depression_Exercise, Num_Major_Vessels_Flouro, target) %>%
#gather(key = "key",
#value = "value",
#-target)
#Visualize numeric variables as boxplots
#hd_long_cont_tbl %>%
#ggplot(aes(y = value)) +
#geom_boxplot(aes(fill = target),
#alpha = .6,
#fatten = .7) +
#labs(x = "",
#y = "",
#title = "Boxplots for Numeric Variables") +
#scale_fill_manual(
#values = c("#fde725ff", "#20a486ff"),
#name = "Heart\nDisease",
#labels = c("No HD", "Yes HD")) +
#theme(
#axis.text.x = element_blank(),
#axis.ticks.x = element_blank()) +
#facet_wrap(~ key,
#scales = "free",
#ncol = 2)
table <- table(as.numeric(heart$Chest_Pain_Type))
pie(table)
From the pie chart, we can observe that most individual experience asymptomatic angina and followed by non-angina pain.
The faceted plots for categorical and numeric variables suggest the following conditions are associated with increased prevalence of heart disease.
We can’t all be cardiologists but these do seem to pass the eye check. Particularly: age, blood pressure, cholesterol, and sex all point in the right direction based on what we generally know about the world around us. This provides a nice phase gate to let us proceed with the analysis.
Highly correlated variables can lead to overly complicated models or wonky predictions. The ggcorr() function from GGally package provides a nice, clean correlation matrix of the numeric variables. The default method is Pearson which I use here first. Pearson isn’t ideal if the data is skewed or has a lot of outliers so I’ll check using the rank-based Kendall method as well.
Correlation analysis allows us to obtain an understanding on relationship and direction between features. Example +0.8 indicates very strong positive relationship while 0 indicates no relationship.
Two methods i.e. Pearson and Kendall are used for the correlation analysis in this study.
#Correlation matrix using Pearson method, default method is Pearson
heart %>% ggcorr(high = "#20a486ff",
low = "#fde725ff",
label = TRUE,
hjust = .75,
size = 3,
label_size = 3,
nbreaks = 5
) +
labs(title = "Correlation Matrix",
subtitle = "Pearson Method Using Pairwise Obervations")
## Warning in ggcorr(., high = "#20a486ff", low = "#fde725ff", label = TRUE, : data
## in column(s) 'Age', 'Thalassemia' are not numeric and were ignored
#Correlation matrix using Kendall method
heart %>% ggcorr(method = c("pairwise", "kendall"),
high = "#20a486ff",
low = "#fde725ff",
label = TRUE,
hjust = .75,
size = 3,
label_size = 3,
nbreaks = 5
) +
labs(title = "Correlation Matrix",
subtitle = "Kendall Method Using Pairwise Observations")
## Warning in ggcorr(., method = c("pairwise", "kendall"), high = "#20a486ff", :
## data in column(s) 'Age', 'Thalassemia' are not numeric and were ignored
Based on Pearson and Kendall’s correlation result, the factors that showed correlation >= 40% with the target feature are: Chest Pain type, Num of Major Vessels, Exercise Induced Angina, and ST Depression.
There are very minor differences between the Pearson and Kendall results. No variables appear to be highly correlated (i.e. > 50%) As such, it seems reasonable to keep all the original 14 variables as we proceed into the modeling section.Some additional steps can be used in the modelling stage to identify the statitically significant features
The plan is to split up the original data set to form a training group (70%) and testing group (30%). The training group will be used to fit the model while the testing group will be used to evaluate predictions. The initial_split() function creates a split object which is just an efficient way to store both the training and testing sets. The training() and testing() functions are used to extract the appropriate dataframes out of the split object when needed.
Logistic regression is a form of regression analysis. A binary logistic regression is used when the target feature is a dichotomy, having only two values, for example, 0 or 1, or Yes or No. In this project, it is used to compute the prediction of the presence of heart disease. It can be used to predict what conditions that could likely cause the presence of heart disease.
# Read the Caret Library
library(caret)
'%ni%' <- Negate('%in%') # define 'not in' func
options(scipen=999) # prevents printing scientific notations.
# Prep Training and Test dataset
# First, create a Train-test split with 70% data included in the training set
set.seed(100)
trainDataIndex <- createDataPartition(heart1$target, p=0.7, list = F) # 70% training data
trainData <- heart1[trainDataIndex, ]
testData <- heart1[-trainDataIndex, ]
# Build Logistic Model (Using All Features)
logitmod1 <- glm(target ~ Age + Sex + Chest_Pain_Type + Resting_Blood_Pressure + Serum_Cholesterol + Fasting_Blood_Sugar + Resting_ECG+Max_Heart_Rate_Achieved+Exercise_Induced_Angina+ST_Depression_Exercise+Peak_Exercise_ST_Segment+Num_Major_Vessels_Flouro+Thalassemia, family = "binomial", data=trainData)
# Print the model summary
summary(logitmod1)
##
## Call:
## glm(formula = target ~ Age + Sex + Chest_Pain_Type + Resting_Blood_Pressure +
## Serum_Cholesterol + Fasting_Blood_Sugar + Resting_ECG + Max_Heart_Rate_Achieved +
## Exercise_Induced_Angina + ST_Depression_Exercise + Peak_Exercise_ST_Segment +
## Num_Major_Vessels_Flouro + Thalassemia, family = "binomial",
## data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.70796 -0.29831 -0.00013 0.20624 2.47762
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.826675 6522.640154 -0.003 0.997452
## Age34 0.916971 7751.573625 0.000 0.999906
## Age35 15.292539 6522.639838 0.002 0.998129
## Age37 -1.361105 7719.002128 0.000 0.999859
## Age38 15.931586 6522.638764 0.002 0.998051
## Age39 16.802190 6522.641395 0.003 0.997945
## Age40 15.259690 6522.638789 0.002 0.998133
## Age41 13.912322 6522.638702 0.002 0.998298
## Age42 12.722683 6522.638870 0.002 0.998444
## Age43 12.592353 6522.638994 0.002 0.998460
## Age44 14.685295 6522.638658 0.002 0.998204
## Age45 13.138367 6522.638733 0.002 0.998393
## Age46 12.893359 6522.638869 0.002 0.998423
## Age47 15.692236 6522.638756 0.002 0.998080
## Age48 10.731397 6522.639944 0.002 0.998687
## Age49 13.698389 6522.639723 0.002 0.998324
## Age50 13.257435 6522.639576 0.002 0.998378
## Age51 11.605676 6522.638754 0.002 0.998580
## Age52 12.206813 6522.638855 0.002 0.998507
## Age53 -7.476687 6929.428706 -0.001 0.999139
## Age54 11.352843 6522.638751 0.002 0.998611
## Age55 14.177270 6522.639090 0.002 0.998266
## Age56 14.051898 6522.639028 0.002 0.998281
## Age57 13.649477 6522.638667 0.002 0.998330
## Age58 12.011296 6522.638729 0.002 0.998531
## Age59 14.265998 6522.638692 0.002 0.998255
## Age60 15.716961 6522.638784 0.002 0.998077
## Age61 15.444033 6522.638766 0.002 0.998111
## Age62 11.136944 6522.638944 0.002 0.998638
## Age63 15.611120 6522.639018 0.002 0.998090
## Age64 9.540608 6522.639142 0.001 0.998833
## Age65 13.005983 6522.638740 0.002 0.998409
## Age66 11.670180 6522.638779 0.002 0.998572
## Age67 13.227907 6522.638869 0.002 0.998382
## Age68 10.691786 6522.638924 0.002 0.998692
## Age69 11.043980 6522.646017 0.002 0.998649
## Age70 27.651256 7288.812690 0.004 0.996973
## Age71 -2.825241 7742.837763 0.000 0.999709
## Age74 -4.821633 9224.404119 -0.001 0.999583
## Age76 -2.549170 9224.404083 0.000 0.999780
## Age77 26.566605 9224.404210 0.003 0.997702
## Age63 -3.739744 9224.404076 0.000 0.999677
## Sex 1.974530 1.011735 1.952 0.050982 .
## Chest_Pain_Type 0.745306 0.327136 2.278 0.022710 *
## Resting_Blood_Pressure 0.017929 0.018716 0.958 0.338073
## Serum_Cholesterol 0.014237 0.008057 1.767 0.077220 .
## Fasting_Blood_Sugar -0.158072 1.056833 -0.150 0.881102
## Resting_ECG 0.767514 0.367585 2.088 0.036799 *
## Max_Heart_Rate_Achieved -0.042724 0.019933 -2.143 0.032079 *
## Exercise_Induced_Angina 0.830954 0.762069 1.090 0.275541
## ST_Depression_Exercise 0.123910 0.370983 0.334 0.738377
## Peak_Exercise_ST_Segment 0.655676 0.733402 0.894 0.371312
## Num_Major_Vessels_Flouro 1.733654 0.486418 3.564 0.000365 ***
## Thalassemia6 0.143307 1.320759 0.109 0.913597
## Thalassemia7 2.618716 0.861268 3.041 0.002362 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 293.584 on 212 degrees of freedom
## Residual deviance: 99.031 on 158 degrees of freedom
## AIC: 209.03
##
## Number of Fisher Scoring iterations: 17
# Re-run Logistic with only significant variables
logitmod2 <- glm(target ~ Chest_Pain_Type+Resting_Blood_Pressure+Exercise_Induced_Angina+Num_Major_Vessels_Flouro+Thalassemia, family = "binomial", data=trainData)
summary(logitmod2)
##
## Call:
## glm(formula = target ~ Chest_Pain_Type + Resting_Blood_Pressure +
## Exercise_Induced_Angina + Num_Major_Vessels_Flouro + Thalassemia,
## family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7817 -0.4983 -0.2581 0.5417 2.5019
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.44547 1.79949 -3.582 0.000341 ***
## Chest_Pain_Type 0.59300 0.21625 2.742 0.006103 **
## Resting_Blood_Pressure 0.01464 0.01106 1.324 0.185665
## Exercise_Induced_Angina 1.27346 0.43939 2.898 0.003753 **
## Num_Major_Vessels_Flouro 1.26868 0.25465 4.982 0.00000062885 ***
## Thalassemia6 1.44373 0.76835 1.879 0.060244 .
## Thalassemia7 2.53388 0.43588 5.813 0.00000000613 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 293.58 on 212 degrees of freedom
## Residual deviance: 159.04 on 206 degrees of freedom
## AIC: 173.04
##
## Number of Fisher Scoring iterations: 5
# Apply the model to predict the testdata
# Pred contains the probability that the observation with heart disease's presence for each observation.
# In logistic regression, need to set type='response' in order to compute the prediction probabilities.
pred <- predict(logitmod2, newdata = testData, type = "response")
# Measure the accuracy of prediction in the test data
# The common practice is to take the probability cutoff as 0.5.
# If the probability of Y is > 0.5, then it can be classified an event (presence of heart disease).
# So if pred is greater than 0.5, it is positive(heart disease =yes) else it is negative
y_pred_num <- ifelse(pred > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels=c(0, 1))
y_act <- testData$target
# Result : Prediction Accuracy (Proportion of predicted target that matches with actual target)
mean(y_pred == y_act)
## [1] 0.8222222
# Plot ROC Curve
# install.packages("InformationValue")
library(InformationValue)
##
## Attaching package: 'InformationValue'
## The following objects are masked from 'package:yardstick':
##
## npv, precision, sensitivity, specificity
## The following objects are masked from 'package:caret':
##
## confusionMatrix, precision, sensitivity, specificity
InformationValue::plotROC(y_act, pred)
InformationValue::AUROC(y_act, pred)
## [1] 0.8603671
## Interpretation of ROC
# This is nicely captured by the ‘Receiver Operating Characteristics’ curve,
# also called as the ROC curve. In fact, the area under the ROC curve can be used as an
# evaluation metric to compare the efficacy of the models.
The logistic regression model shows an accuracy of 82% on the test dataset. The ROC obtained is 86%.
The significant features are: Chest_Pain_Type, Resting_Blood_Pressure, Exercise_Induced_Angina, Num_Major_Vessels_Flouro and Thalassemia.
This finding is consistent with the pearson correlation result.
This is a way of analyzing how the sensitivity and specificity perform for the full range of probability cutoffs, that is from 0 to 1. Ideally, if you have a perfect model, all the events will have a probability score of 1 and all non-events will have a score of 0. For such a model, the area under the ROC will be a perfect 1. So, if we trace the curve from bottom left, the value of probability cutoff decreases from 1 towards 0. If you have a good model, more of the real events should be predicted as events, resulting in high sensitivity and low FPR. In that case, the curve will rise steeply covering a large area before reaching the top-right. Therefore, the larger the area under the ROC curve, the better is your model.
NB is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
indxTrain <- createDataPartition(y = heart$target,p = 0.70,list = FALSE)
training <- heart[indxTrain,]
testing <- heart[-indxTrain,]
#Check dimensions of the split
prop.table(table(heart$target)) * 100
##
## 0 1
## 54.12541 45.87459
prop.table(table(training$target)) * 100
##
## 0 1
## 54.46009 45.53991
#create objects x which holds the predictor variables and y which holds the response variables
x = training[,-14]
y = training$target
y <- as.factor(y)
defaultW <- getOption("warn")
options(warn = -1)
model = train(x,y,'nb',trControl=trainControl(method='cv',number=10))
model
## Naive Bayes
##
## 213 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 192, 192, 191, 191, 191, 191, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.8341775 0.6599184
## TRUE 0.8198918 0.6278793
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE and adjust
## = 1.
#Model Evaluation
#Predict testing set
Predict <- predict(model,newdata = testing )
#Get the confusion matrix to see accuracy value and other parameter values
#Confusion Matrix and Statistics
confusionMatrix(Predict, as.factor(testing$target))
## [1] 0 1
## <0 rows> (or 0-length row.names)
options(warn = defaultW)
Naive Bayes model result shows almost similar performance with Logistic Regression with Accuracy for ‘true’ prediction is about 83% and ‘false’ is 81%. Overall, the prediction accuracy is about 82%, which is similar to Logistic Regression.
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.
One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification. It performs better results for classification problems.
# Import library
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# To control the sampling permutation
set.seed(100)
# Change column 'target' to factor
heart$target <- as.factor(heart$target)
# Check the latest class of 'target'
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ Age : chr "63" "67" "67" "37" ...
## $ Sex : int 1 1 1 1 0 1 0 0 1 1 ...
## $ Chest_Pain_Type : int 1 4 4 3 2 2 4 4 4 4 ...
## $ Resting_Blood_Pressure : int 145 160 120 130 130 120 140 120 130 140 ...
## $ Serum_Cholesterol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ Fasting_Blood_Sugar : int 1 0 0 0 0 0 0 0 0 1 ...
## $ Resting_ECG : int 2 2 2 0 2 0 2 0 2 2 ...
## $ Max_Heart_Rate_Achieved : int 150 108 129 187 172 178 160 163 147 155 ...
## $ Exercise_Induced_Angina : int 0 1 1 0 0 0 0 1 0 1 ...
## $ ST_Depression_Exercise : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ Peak_Exercise_ST_Segment: int 3 2 2 3 1 1 3 1 2 3 ...
## $ Num_Major_Vessels_Flouro: num 0 3 2 0 0 0 2 0 1 0 ...
## $ Thalassemia : chr "6" "3" "7" "3" ...
## $ target : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 2 1 2 2 ...
# Split dataset into training and testing set with probability 75% & 25%
rf_sample <- sample(2, nrow(heart), replace = TRUE, prob = c(0.75, 0.25))
rf_train <- heart[rf_sample==1,]
rf_test <- heart[rf_sample==2,]
str(rf_train)
## 'data.frame': 228 obs. of 14 variables:
## $ Age : chr "63" "67" "67" "37" ...
## $ Sex : int 1 1 1 1 0 1 0 1 1 1 ...
## $ Chest_Pain_Type : int 1 4 4 3 2 2 4 4 4 4 ...
## $ Resting_Blood_Pressure : int 145 160 120 130 130 120 120 130 140 140 ...
## $ Serum_Cholesterol : int 233 286 229 250 204 236 354 254 203 192 ...
## $ Fasting_Blood_Sugar : int 1 0 0 0 0 0 0 0 1 0 ...
## $ Resting_ECG : int 2 2 2 0 2 0 0 2 2 0 ...
## $ Max_Heart_Rate_Achieved : int 150 108 129 187 172 178 163 147 155 148 ...
## $ Exercise_Induced_Angina : int 0 1 1 0 0 0 1 0 1 0 ...
## $ ST_Depression_Exercise : num 2.3 1.5 2.6 3.5 1.4 0.8 0.6 1.4 3.1 0.4 ...
## $ Peak_Exercise_ST_Segment: int 3 2 2 3 1 1 1 2 3 2 ...
## $ Num_Major_Vessels_Flouro: num 0 3 2 0 0 0 0 1 0 0 ...
## $ Thalassemia : chr "6" "3" "7" "3" ...
## $ target : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 2 1 ...
# Running the Random Forest model
rf <- randomForest(target~., data=rf_train, proximity=TRUE)
print(rf)
##
## Call:
## randomForest(formula = target ~ ., data = rf_train, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 17.98%
## Confusion matrix:
## 0 1 class.error
## 0 111 16 0.1259843
## 1 25 76 0.2475248
#Checking the accuracy of training and testing set
p1 <- predict(rf, rf_train)
confusionMatrix(p1, rf_train$target)
## Warning in Ops.factor(predictedScores, threshold): '<' not meaningful for
## factors
## [1] 0 1
## <0 rows> (or 0-length row.names)
p2 <- predict(rf, rf_test)
confusionMatrix(p2, as.factor(rf_test$target))
## Warning in Ops.factor(predictedScores, threshold): '<' not meaningful for
## factors
## [1] 0 1
## <0 rows> (or 0-length row.names)
plot(rf)
The result shows: OB estimate of error rate: 17.98% Therefore, the overall accuracy of RF is about 82%.
Overall, all the three (3) Machine Learning Models showed similar accuracy of 82%. However, Logistic Regression is deemed as a better model here as it produced model result that shows the significant features in the model. This allows for better model interpretation and implementation.
As a conclusion, with the increasing number of deaths due to heart diseases, it has become mandatory to develop a system to predict heart diseases effectively and accurately. The motivation for this study was to find the most efficient ML algorithm for detection of heart diseases. This study compares the accuracy score of Logistic Regression, Naive Bayes and Random Forest algorithms for predicting heart disease using UCI machine learning repository dataset. The result of this study indicates that the Logistic Regression algorithm is the most suitable algorithm with accuracy score of 82% for prediction of heart disease. One of the advantage of Logistic Regression is, ease of interpretation, implementation and allows to know the significant features in the model. In contrast, RF provides similar accuracy but the interpretation and implementation is rather complex.
In future the work can be enhanced by developing a web application based on the Logistic Regression algorithm as well as using a larger dataset as compared to the one used in this analysis which will help to provide better results and help health professionals in predicting the heart disease effectively and efficiently.
heart3 <- heart
# Recode some categorical variables to numeric for report generation
heart3$Num_Major_Vessels_Flouro = factor(heart3$Num_Major_Vessels_Flouro, levels = c("0","1","2","3"), labels = c(0,1,2,3))
heart3$Thalassemia = factor(heart3$Thalassemia, levels = c("3","6","7"), labels = c(3,6,7))
# Reporting
# Create_report in DataExplorer
# Pull a full data profile of your data frame.
# It will produce an html file with the basic statistics, structure, missing data,
# distribution visualizations, correlation matrix and principal component analysis for your data frame
DataExplorer::create_report(heart1)
##
##
## processing file: report.rmd
##
|
| | 0%
|
|.. | 2%
## inline R code fragments
##
##
|
|... | 5%
## label: global_options (with options)
## List of 1
## $ include: logi FALSE
##
##
|
|..... | 7%
## ordinary text without R code
##
##
|
|....... | 10%
## label: introduce
##
|
|........ | 12%
## ordinary text without R code
##
##
|
|.......... | 14%
## label: plot_intro
##
|
|............ | 17%
## ordinary text without R code
##
##
|
|............. | 19%
## label: data_structure
##
|
|............... | 21%
## ordinary text without R code
##
##
|
|................. | 24%
## label: missing_profile
##
|
|.................. | 26%
## ordinary text without R code
##
##
|
|.................... | 29%
## label: univariate_distribution_header
##
|
|...................... | 31%
## ordinary text without R code
##
##
|
|....................... | 33%
## label: plot_histogram
##
|
|......................... | 36%
## ordinary text without R code
##
##
|
|........................... | 38%
## label: plot_density
##
|
|............................ | 40%
## ordinary text without R code
##
##
|
|.............................. | 43%
## label: plot_frequency_bar
##
|
|................................ | 45%
## ordinary text without R code
##
##
|
|................................. | 48%
## label: plot_response_bar
##
|
|................................... | 50%
## ordinary text without R code
##
##
|
|..................................... | 52%
## label: plot_with_bar
##
|
|...................................... | 55%
## ordinary text without R code
##
##
|
|........................................ | 57%
## label: plot_normal_qq
##
|
|.......................................... | 60%
## ordinary text without R code
##
##
|
|........................................... | 62%
## label: plot_response_qq
##
|
|............................................. | 64%
## ordinary text without R code
##
##
|
|............................................... | 67%
## label: plot_by_qq
##
|
|................................................ | 69%
## ordinary text without R code
##
##
|
|.................................................. | 71%
## label: correlation_analysis
##
|
|.................................................... | 74%
## ordinary text without R code
##
##
|
|..................................................... | 76%
## label: principal_component_analysis
##
|
|....................................................... | 79%
## ordinary text without R code
##
##
|
|......................................................... | 81%
## label: bivariate_distribution_header
##
|
|.......................................................... | 83%
## ordinary text without R code
##
##
|
|............................................................ | 86%
## label: plot_response_boxplot
##
|
|.............................................................. | 88%
## ordinary text without R code
##
##
|
|............................................................... | 90%
## label: plot_by_boxplot
##
|
|................................................................. | 93%
## ordinary text without R code
##
##
|
|................................................................... | 95%
## label: plot_response_scatterplot
##
|
|.................................................................... | 98%
## ordinary text without R code
##
##
|
|......................................................................| 100%
## label: plot_by_scatterplot
## output file: D:/01a Prog DS (Thursday)/01 Project/report.knit.md
## "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS "D:/01a Prog DS (Thursday)/01 Project/report.knit.md" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc53c493b7f80.html --lua-filter "C:\Users\Valli\Documents\R\win-library\4.1\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\Valli\Documents\R\win-library\4.1\rmarkdown\rmarkdown\lua\latex-div.lua" --self-contained --variable bs3=TRUE --standalone --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\Valli\Documents\R\win-library\4.1\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --include-in-header "C:\Users\Valli\AppData\Local\Temp\RtmpYtASNq\rmarkdown-str53c435732f59.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
##
## Output created: report.html