knitr::include_graphics("HeartDiseases.png")
Heart Diseases, also known as Cardiovascular Diseases(CVD), are a generic term used for conditions that affect the heart’s blood vessels. Heart Diseases is the leading cause of death globally. Heart Diseases are typically a result of arteries building-up fatty deposits.
The typical heart disease symptoms comprise chest pain (angina), numbness, shortness of breath, coldness, or weakness around the arms and legs.
The arrhythmias symptoms include a type of flutter in the chest; tachycardia or a racing heartbeat; a slow or a weak heartbeat, or bradycardia; and fainting.
The cardiovascular problem may be marked by blue or pale gray skin.
thick heart muscle, termed cardiomyopathy, can result in bloating and breathlessness, amongst other symptoms.
High blood pressure, age, diabetes, poor diet or exercise habits, smoking, substance abuse, obesity, and other chronic kidney diseases also contribute to heart problems.
The Heart Diseases dataset are from University of California machine learning repository. The dataset consist of 12 variables and 918 observations. The description of the variables are :
Age : age of the patient [years]
Sex : sex of the patient
ChestPainType : chest pain type
RestingBP : resting blood pressure [mmHg]
Cholesterol : serum cholesterol [mm/dl]
FastingBS : fasting blood sugar
RestingECG : esting electrocardiogram results
MaxHR : maximum heart rate achived [Numeric value between 60 and 202]
ExerciseAngina : exercise-induced angina
Oldpeak : oldpeak [Numeric value measured in depression]
ST_Slope : the slope of the peak exercise ST segment
HeartDisease : output class
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret) #confusionmatrix
## Loading required package: ggplot2
## Loading required package: lattice
library(class) #knn
library(e1071) #naiveBayes
library(ROCR) #ROC
library(partykit) #Decision Tree
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(randomForest)
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(knitr)
heart <- read.csv("heart.csv")
str(heart)
## 'data.frame': 918 obs. of 12 variables:
## $ Age : int 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : chr "M" "F" "M" "F" ...
## $ ChestPainType : chr "ATA" "NAP" "ATA" "ASY" ...
## $ RestingBP : int 140 160 130 138 150 120 130 110 140 120 ...
## $ Cholesterol : int 289 180 283 214 195 339 237 208 207 284 ...
## $ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RestingECG : chr "Normal" "Normal" "ST" "Normal" ...
## $ MaxHR : int 172 156 98 108 122 170 170 142 130 120 ...
## $ ExerciseAngina: chr "N" "N" "N" "Y" ...
## $ Oldpeak : num 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ ST_Slope : chr "Up" "Flat" "Up" "Flat" ...
## $ HeartDisease : int 0 1 0 1 0 0 0 0 1 0 ...
colSums(is.na(heart))
## Age Sex ChestPainType RestingBP Cholesterol
## 0 0 0 0 0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## 0 0 0 0 0
## ST_Slope HeartDisease
## 0 0
heart_clean <- heart %>%
mutate_if(is.character, as.factor) %>%
mutate_at(vars(FastingBS,HeartDisease), as.factor)
str(heart_clean)
## 'data.frame': 918 obs. of 12 variables:
## $ Age : int 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
## $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
## $ RestingBP : int 140 160 130 138 150 120 130 110 140 120 ...
## $ Cholesterol : int 289 180 283 214 195 339 237 208 207 284 ...
## $ FastingBS : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ RestingECG : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
## $ MaxHR : int 172 156 98 108 122 170 170 142 130 120 ...
## $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
## $ Oldpeak : num 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ ST_Slope : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
## $ HeartDisease : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...
summary(heart_clean)
## Age Sex ChestPainType RestingBP Cholesterol
## Min. :28.00 F:193 ASY:496 Min. : 0.0 Min. : 0.0
## 1st Qu.:47.00 M:725 ATA:173 1st Qu.:120.0 1st Qu.:173.2
## Median :54.00 NAP:203 Median :130.0 Median :223.0
## Mean :53.51 TA : 46 Mean :132.4 Mean :198.8
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:267.0
## Max. :77.00 Max. :200.0 Max. :603.0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## 0:704 LVH :188 Min. : 60.0 N:547 Min. :-2.6000
## 1:214 Normal:552 1st Qu.:120.0 Y:371 1st Qu.: 0.0000
## ST :178 Median :138.0 Median : 0.6000
## Mean :136.8 Mean : 0.8874
## 3rd Qu.:156.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## ST_Slope HeartDisease
## Down: 63 0:410
## Flat:460 1:508
## Up :395
##
##
##
Split the data into train and test dataset. 80% of the data will be use as the train dataset, and 20% of the data will be use as the test dataset.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(310)
ind_heart <- sample(x = nrow(heart_clean), size = nrow(heart_clean)*0.8)
heart_train <- heart_clean[ind_heart,]
heart_test <- heart_clean[-ind_heart,]
heart_train$HeartDisease %>%
table() %>%
prop.table()
## .
## 0 1
## 0.4509537 0.5490463
The positive class in the train dataset is 55% and negative class is 45%. In spite of imbalace between the positive and negative class, but they are within acceptable range. Therefore, no need to upsample or downsample.
Build basic Logistic Regression with all variables, the AIC number is 513.56
hmodel_logreg <- glm(formula = HeartDisease ~ . , data = heart_train, family = "binomial")
summary(hmodel_logreg)
##
## Call:
## glm(formula = HeartDisease ~ ., family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5501 -0.3742 0.1803 0.4480 2.6262
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.605788 1.559330 -1.030 0.303107
## Age 0.019757 0.015125 1.306 0.191457
## SexM 1.478898 0.310109 4.769 1.85e-06 ***
## ChestPainTypeATA -1.946743 0.366836 -5.307 1.12e-07 ***
## ChestPainTypeNAP -1.607122 0.294460 -5.458 4.82e-08 ***
## ChestPainTypeTA -1.296979 0.464342 -2.793 0.005220 **
## RestingBP 0.007600 0.006742 1.127 0.259637
## Cholesterol -0.003811 0.001231 -3.096 0.001962 **
## FastingBS1 1.093455 0.303533 3.602 0.000315 ***
## RestingECGNormal -0.113511 0.298322 -0.380 0.703576
## RestingECGST -0.104270 0.385371 -0.271 0.786722
## MaxHR -0.007241 0.005440 -1.331 0.183137
## ExerciseAnginaY 1.002417 0.268274 3.737 0.000187 ***
## Oldpeak 0.256180 0.135863 1.886 0.059353 .
## ST_SlopeFlat 1.484516 0.462843 3.207 0.001339 **
## ST_SlopeUp -0.820405 0.496396 -1.653 0.098387 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1010.47 on 733 degrees of freedom
## Residual deviance: 481.56 on 718 degrees of freedom
## AIC: 513.56
##
## Number of Fisher Scoring iterations: 5
Using Feature Selection to tune the base Logistic Regression Model, with backward direction. The resulting AIC is slightly better, 508.7 (compared to 513.56). Therefore, we will be using model produced by Feature Selection.
hmodel_logreg_back <- step(object = hmodel_logreg, direction = "backward", trace = 0)
summary(hmodel_logreg_back)
##
## Call:
## glm(formula = HeartDisease ~ Age + Sex + ChestPainType + Cholesterol +
## FastingBS + ExerciseAngina + Oldpeak + ST_Slope, family = "binomial",
## data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5332 -0.3741 0.1850 0.4566 2.6337
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.275175 0.983254 -2.314 0.020672 *
## Age 0.030517 0.013807 2.210 0.027087 *
## SexM 1.479617 0.306649 4.825 1.40e-06 ***
## ChestPainTypeATA -1.958143 0.361297 -5.420 5.97e-08 ***
## ChestPainTypeNAP -1.675442 0.291650 -5.745 9.21e-09 ***
## ChestPainTypeTA -1.317708 0.465893 -2.828 0.004679 **
## Cholesterol -0.003805 0.001160 -3.279 0.001041 **
## FastingBS1 1.095335 0.300604 3.644 0.000269 ***
## ExerciseAnginaY 1.094505 0.260487 4.202 2.65e-05 ***
## Oldpeak 0.247598 0.133228 1.858 0.063104 .
## ST_SlopeFlat 1.497483 0.462311 3.239 0.001199 **
## ST_SlopeUp -0.854692 0.492659 -1.735 0.082766 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1010.5 on 733 degrees of freedom
## Residual deviance: 484.7 on 722 degrees of freedom
## AIC: 508.7
##
## Number of Fisher Scoring iterations: 5
Predict the test dataset using the Logistic Regression Model
hpred_logreg_back <- predict(object = hmodel_logreg_back, newdata = heart_test, type = "response")
#return the class
hpred_logreg_back_label <-as.factor(ifelse(hpred_logreg_back > 0.5, "1", "0"))
Model Evaluation using confusion matrix. The accuracy of the model at 88%. Recall/Sensitivity at 88%. Precision/Pos Pred Value at 91%.
conf_logreg <- confusionMatrix(data = hpred_logreg_back_label, reference = heart_test$HeartDisease, positive = "1")
conf_logreg
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 70 13
## 1 9 92
##
## Accuracy : 0.8804
## 95% CI : (0.8246, 0.9235)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7575
##
## Mcnemar's Test P-Value : 0.5224
##
## Sensitivity : 0.8762
## Specificity : 0.8861
## Pos Pred Value : 0.9109
## Neg Pred Value : 0.8434
## Prevalence : 0.5707
## Detection Rate : 0.5000
## Detection Prevalence : 0.5489
## Balanced Accuracy : 0.8811
##
## 'Positive' Class : 1
##
Split the X(predictor) and Y(target) from the train and test dataset.
heart_train_x <- heart_train %>%
select_if(is.numeric)
heart_train_y <- heart_train %>%
select(HeartDisease)
heart_test_x <- heart_test %>%
select_if(is.numeric)
heart_test_y <- heart_test %>%
select(HeartDisease)
Scale the train predictor, as they are not in the same scale.
heart_train_x_scale <- scale(heart_train_x)
heart_test_x_scale <- scale(heart_test_x,
center = attr(heart_train_x_scale,"scaled:center"),
scale = attr(heart_train_x_scale,"scaled:scale"))
define K value
heart_train_x %>%
nrow() %>%
sqrt()
## [1] 27.09243
K = 27
Predict the test dataset using the KNN Model
hpred_knn <- knn(train = heart_train_x_scale,
test = heart_test_x_scale,
cl = heart_train_y$HeartDisease,
k = 27)
Model Evaluation using confusion matrix. The accuracy of the model at 82%. Recall/Sensitivity at 79%. Precision/Pos Pred Value at 88%.
conf_knn <- confusionMatrix(data = hpred_knn, reference = heart_test_y$HeartDisease, positive = "1")
conf_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 68 22
## 1 11 83
##
## Accuracy : 0.8207
## 95% CI : (0.7575, 0.8732)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : 5.171e-13
##
## Kappa : 0.6402
##
## Mcnemar's Test P-Value : 0.08172
##
## Sensitivity : 0.7905
## Specificity : 0.8608
## Pos Pred Value : 0.8830
## Neg Pred Value : 0.7556
## Prevalence : 0.5707
## Detection Rate : 0.4511
## Detection Prevalence : 0.5109
## Balanced Accuracy : 0.8256
##
## 'Positive' Class : 1
##
Model with Naive Bayes
hmodel_NB <- naiveBayes(formula = HeartDisease ~ . , data = heart_train, laplace = 1)
Predict the test dataset using the Naive Bayes Model
hpred_NB <- predict(object = hmodel_NB, newdata = heart_test, type = "class")
Model Evaluation using confusion matrix. The accuracy of the model at 88%. Recall/Sensitivity at 87%. Precision/Pos Pred Value at 91%.
conf_NB <- confusionMatrix(data = hpred_NB, reference = heart_test$HeartDisease, positive = "1")
conf_NB
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 70 14
## 1 9 91
##
## Accuracy : 0.875
## 95% CI : (0.8184, 0.9191)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7469
##
## Mcnemar's Test P-Value : 0.4042
##
## Sensitivity : 0.8667
## Specificity : 0.8861
## Pos Pred Value : 0.9100
## Neg Pred Value : 0.8333
## Prevalence : 0.5707
## Detection Rate : 0.4946
## Detection Prevalence : 0.5435
## Balanced Accuracy : 0.8764
##
## 'Positive' Class : 1
##
Receiver-Operating Curve (ROC). Plot True Positive Rate (Sensitivity atau Recall) with False Positive Rate (1-Specificity). The resulting plot shows ┌── , which is a good curve.
hpred_NBprob <- predict(object = hmodel_NB, newdata = heart_test, type = "raw")
hpred_NB_roc <- prediction(predictions = hpred_NBprob[,2],
labels = as.numeric(heart_test$HeartDisease == "1"))
plot(performance(prediction.obj = hpred_NB_roc,
measure = "tpr",
x.measure = "fpr"))
Area Under ROC Curve (AUC) is 0.93.
hpred_NB_auc <- performance(prediction.obj = hpred_NB_roc, measure = "auc")
hpred_NB_auc@y.values
## [[1]]
## [1] 0.9320072
Build model with Decistion Tree, and plot show ST_Slope is the Root Node.
hmodel_DT <- ctree(formula = HeartDisease ~ . , data = heart_train)
plot(hmodel_DT, type = "simple")
Predict the test dataset using Decision Tree model
hpred_DT <- predict(object = hmodel_DT, newdata = heart_test, type = "response")
Model Evaluation using confusion matrix. The accuracy of the model at 88%. Recall/Sensitivity at 90%. Precision/Pos Pred Value at 88%.
conf_DT <- confusionMatrix(data = hpred_DT, reference = heart_test$HeartDisease, positive = "1")
conf_DT
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 67 10
## 1 12 95
##
## Accuracy : 0.8804
## 95% CI : (0.8246, 0.9235)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7552
##
## Mcnemar's Test P-Value : 0.8312
##
## Sensitivity : 0.9048
## Specificity : 0.8481
## Pos Pred Value : 0.8879
## Neg Pred Value : 0.8701
## Prevalence : 0.5707
## Detection Rate : 0.5163
## Detection Prevalence : 0.5815
## Balanced Accuracy : 0.8764
##
## 'Positive' Class : 1
##
Receiver-Operating Curve (ROC). Plot True Positive Rate (Sensitivity atau Recall) with False Positive Rate (1-Specificity). The resulting plot shows ┌── , which is a good curve.
hpred_DTprob <- predict(object = hmodel_DT, newdata = heart_test, type = "prob")
hpred_DT_roc <- prediction(predictions = hpred_DTprob[,2],
labels = as.numeric(heart_test$HeartDisease == "1"))
plot(performance(prediction.obj = hpred_DT_roc,
measure = "tpr",
x.measure = "fpr"))
Area Under ROC Curve (AUC) is 0.91.
hpred_DT_auc <- performance(prediction.obj = hpred_DT_roc, measure = "auc")
hpred_DT_auc@y.values
## [[1]]
## [1] 0.9136227
Build Random Forest model with k-fold cross validation (k=25), and repeat 10 times. Save the model for future use.
set.seed(310)
control <- trainControl(method = "repeatedcv", number = 25, repeats = 10)
# hmodel_RF <- train(form = HeartDisease ~ . , data = heart_train,
# method = "rf", trControl = control)
#
# saveRDS(hmodel_RF, "hmodel_RF.RDS")
The sample size of the model is 705. The final and optimum mtry is 2. And with OOB 13%, which mean the model is 87% accurate predicting the train data.
hmodel_RF <- readRDS("hmodel_RF.RDS")
hmodel_RF
## Random Forest
##
## 734 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (25 fold, repeated 10 times)
## Summary of sample sizes: 705, 705, 705, 705, 705, 705, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8674846 0.7303772
## 8 0.8534797 0.7028073
## 15 0.8368805 0.6696213
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
hmodel_RF$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.22%
## Confusion matrix:
## 0 1 class.error
## 0 274 57 0.17220544
## 1 40 363 0.09925558
Importance variable is ST_Slope, same with Decision Tree.
plot(varImp(hmodel_RF))
Predict the test dataset using Random Forest model
hpred_RF <- predict(object = hmodel_RF, newdata = heart_test)
Model Evaluation using confusion matrix. The accuracy of the model at 91%. Recall/Sensitivity at 91%. Precision/Pos Pred Value at 93%.
conf_RF <- confusionMatrix(data = hpred_RF, reference = heart_test$HeartDisease, positive = "1")
conf_RF
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 72 9
## 1 7 96
##
## Accuracy : 0.913
## 95% CI : (0.8626, 0.9495)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8231
##
## Mcnemar's Test P-Value : 0.8026
##
## Sensitivity : 0.9143
## Specificity : 0.9114
## Pos Pred Value : 0.9320
## Neg Pred Value : 0.8889
## Prevalence : 0.5707
## Detection Rate : 0.5217
## Detection Prevalence : 0.5598
## Balanced Accuracy : 0.9128
##
## 'Positive' Class : 1
##
Receiver-Operating Curve (ROC). Plot True Positive Rate (Sensitivity atau Recall) with False Positive Rate (1-Specificity). The resulting plot shows ┌── , which is a good curve.
hpred_RFprob <- predict(object = hmodel_RF, newdata = heart_test, type = "prob")
hpred_RF_roc <- prediction(predictions = hpred_RFprob[,2],
labels = as.numeric(heart_test$HeartDisease == "1"))
plot(performance(prediction.obj = hpred_RF_roc,
measure = "tpr",
x.measure = "fpr"))
Area Under ROC Curve (AUC) is 0.94.
hpred_RF_auc <- performance(prediction.obj = hpred_RF_roc, measure = "auc")
hpred_RF_auc@y.values
## [[1]]
## [1] 0.9393611
Summarize all the confusion matrix of the model. Given the Heart Diseases dataset, the best model to predict Heart Diseases is Random Forest Model, with 91% Accuracy, 91% Recall/Sensitivity, and 93% Precision/Pos Pred.
conf_list <- list ("Logistic Regression" = conf_logreg,
"K-Nearest Neighbor" = conf_knn,
"Naive Bayes" = conf_NB,
"Decision Tree" = conf_DT,
"Random Forest" = conf_RF)
conf_results <- sapply(conf_list, function(x) x$byClass)
options(digits = 2)
results_table <- kable(conf_results)
results_table
| Logistic Regression | K-Nearest Neighbor | Naive Bayes | Decision Tree | Random Forest | |
|---|---|---|---|---|---|
| Sensitivity | 0.88 | 0.79 | 0.87 | 0.90 | 0.91 |
| Specificity | 0.89 | 0.86 | 0.89 | 0.85 | 0.91 |
| Pos Pred Value | 0.91 | 0.88 | 0.91 | 0.89 | 0.93 |
| Neg Pred Value | 0.84 | 0.76 | 0.83 | 0.87 | 0.89 |
| Precision | 0.91 | 0.88 | 0.91 | 0.89 | 0.93 |
| Recall | 0.88 | 0.79 | 0.87 | 0.90 | 0.91 |
| F1 | 0.89 | 0.83 | 0.89 | 0.90 | 0.92 |
| Prevalence | 0.57 | 0.57 | 0.57 | 0.57 | 0.57 |
| Detection Rate | 0.50 | 0.45 | 0.49 | 0.52 | 0.52 |
| Detection Prevalence | 0.55 | 0.51 | 0.54 | 0.58 | 0.56 |
| Balanced Accuracy | 0.88 | 0.83 | 0.88 | 0.88 | 0.91 |