This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
library(dplyr)
library(class)
library(caret)
library(stringr)
library(ggplot2)
library(caret)
library(MASS)heart <- read.csv("heart.csv")
heart <- heart %>%
mutate(age = as.numeric(ï..age),
cp = as.factor(cp),
sex = factor(sex, levels = c(0,1),
labels = c("Female", "Male")),
fbs = as.factor(fbs),
restecg = as.factor(restecg),
exang = as.factor(exang),
target = factor(target, levels = c(0,1),
labels = c("Healthy", "Unhealthy")))
heart <- heart[,-c(1)]
glimpse(heart)## Rows: 303
## Columns: 14
## $ sex <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male, M~
## $ cp <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ restecg <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target <fct> Unhealthy, Unhealthy, Unhealthy, Unhealthy, Unhealthy, Unheal~
## $ age <dbl> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
Here are some details regarding each columns:
age Patient’s Age
sex Patient’s Sex
cp chest pain type (4 values)
trestbpsresting blood pressure
chol cholesterol level in mg/dl
fbs fasting blood sugar > 120 mg/dl
restecg resting electrocardiographic results (values 0,1,2)
thalach maximum heart rate achieved
exang exercise induced angina
oldpeak ST depression induced by exercise relative to rest
slope the slope of the peak exercise ST segment
ca number of major vessels (0-3) colored by flourosopy
thal 3 = normal; 6 = fixed defect; 7 = reversable defect
target 0 = normal, 1 = sick
Let’s look at the proportion of target
prop.table(table(heart$target))##
## Healthy Unhealthy
## 0.4554455 0.5445545
It seems that the proportion between both classes are quite even, therefore further processing is not necessary.
The next step is to split the data into two data, Train & Test. The aim of data splitting is to use the Train data for modelling purpose, while the test data will be useful to test the model when it’s faced with unseen data. We are going to use 80% of the data to Train, and 20% to Test.
RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(heart), nrow(heart)*0.8)
heart_train <- heart[intrain,]
heart_test <- heart[-intrain,]
heart$target %>%
levels()## [1] "Healthy" "Unhealthy"
The first model we are going to make is GLM using the code glm().
model1 <- glm(formula = target ~ ., family = "binomial",
data = heart_train)
summary(model1)##
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7125 -0.3193 0.1277 0.4542 2.4473
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.707908 3.100004 1.841 0.065584 .
## sexMale -1.996535 0.575835 -3.467 0.000526 ***
## cp1 1.888420 0.623743 3.028 0.002465 **
## cp2 2.314759 0.539733 4.289 1.8e-05 ***
## cp3 3.397750 0.918005 3.701 0.000215 ***
## trestbps -0.021491 0.012403 -1.733 0.083133 .
## chol -0.002380 0.004463 -0.533 0.593851
## fbs1 0.268893 0.658713 0.408 0.683120
## restecg1 0.506083 0.433106 1.168 0.242606
## restecg2 -0.193956 2.500707 -0.078 0.938178
## thalach 0.008265 0.012098 0.683 0.494525
## exang1 -1.122501 0.504731 -2.224 0.026151 *
## oldpeak -0.803860 0.312516 -2.572 0.010105 *
## slope 0.489853 0.478451 1.024 0.305916
## ca -0.627754 0.227226 -2.763 0.005733 **
## [ reached getOption("max.print") -- omitted 2 rows ]
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 332.24 on 241 degrees of freedom
## Residual deviance: 152.40 on 225 degrees of freedom
## AIC: 186.4
##
## Number of Fisher Scoring iterations: 6
When we look at the first model, there are a lot of insignificant predictors, therefore we are going to make a new model, this time using Stepwise to automatically determine the best predictors to predict TARGET.
model2 <- stepAIC(model1, direction = "backward", trace = 0)Using backward method in Stepwise, we acquire model 2 as follows
summary(model2)##
## Call:
## glm(formula = target ~ sex + cp + trestbps + exang + oldpeak +
## ca + thal, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6763 -0.3321 0.1471 0.4664 2.2073
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.56578 1.79186 3.664 0.000248 ***
## sexMale -1.63453 0.51534 -3.172 0.001515 **
## cp1 1.91884 0.58847 3.261 0.001111 **
## cp2 2.49942 0.52805 4.733 2.21e-06 ***
## cp3 3.34330 0.88264 3.788 0.000152 ***
## trestbps -0.02009 0.01175 -1.709 0.087415 .
## exang1 -1.31695 0.47640 -2.764 0.005703 **
## oldpeak -1.09549 0.26419 -4.147 3.37e-05 ***
## ca -0.60635 0.21219 -2.858 0.004268 **
## thal -0.85413 0.33210 -2.572 0.010114 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 332.24 on 241 degrees of freedom
## Residual deviance: 157.09 on 232 degrees of freedom
## AIC: 177.09
##
## Number of Fisher Scoring iterations: 6
Using model2 from Stepwise, we are going to try to predict using the test data.
heart_test$prob_heart<-predict(model2, type = "response", newdata = heart_test)Let’s look at the distribution of our prediction
ggplot(heart_test, aes(x=prob_heart)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Prediction on Heart Attack Patient") +
theme_minimal()If we look at the plot above, the plot seems quite balanced, though the density on 0 seems higher. This means that more of the patients are predicted to be Healthy.
heart_test$pred_heart <- factor(ifelse(heart_test$prob_heart > 0.5, "Unhealthy","Healthy"))
heart_test[1:10, c("pred_heart", "target")]## pred_heart target
## 1 Unhealthy Unhealthy
## 2 Healthy Unhealthy
## 5 Unhealthy Unhealthy
## 10 Unhealthy Unhealthy
## 11 Healthy Unhealthy
## 12 Unhealthy Unhealthy
## 19 Healthy Unhealthy
## 25 Unhealthy Unhealthy
## 29 Unhealthy Unhealthy
## 30 Unhealthy Unhealthy
Using the Syntax above we classified that those that has a prob_heart of more than 0.5 are classified as Sick or Unhealthy.
We are going to use Confusion Matrix to evaluate our model.
confmatrix <- confusionMatrix(heart_test$pred_heart, heart_test$target, positive = "Unhealthy")
confmatrix## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Unhealthy
## Healthy 24 8
## Unhealthy 7 22
##
## Accuracy : 0.7541
## 95% CI : (0.6271, 0.8554)
## No Information Rate : 0.5082
## P-Value [Acc > NIR] : 7.39e-05
##
## Kappa : 0.5078
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7333
## Specificity : 0.7742
## Pos Pred Value : 0.7586
## Neg Pred Value : 0.7500
## Prevalence : 0.4918
## Detection Rate : 0.3607
## Detection Prevalence : 0.4754
## Balanced Accuracy : 0.7538
##
## 'Positive' Class : Unhealthy
##
Accuracy is the overall accuracy of the model.
\((TP+TN)/TOTAL\)
Recall is the model accuracy of predicting Positive.
\(TP/(TP+FN)\)
Precision is the model accuracy of predicting Positive class.
\(TP/(TP+FP)\)
Specificity is the model accuracy of predicting Negative.
\(TN/(TN+FP)\)
Accuracy <- round((24+22)/(24+8+7+22),2)
Recall <- round((22)/(22+8),2)
Precision <- round((22)/(22+8),2)
Specificity <- round((24)/(24+7),2)
performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance## Accuracy Recall Precision Specificity
## 1 0.75 0.73 0.73 0.77
Based on the result above, we can summarize that our model’s Overall Accuracy is 75%. The Model’s accuracy at predicting the Positive (Unhealthy) patients is 73%. While its accuracy at predicting Positive Class is 73%, and its accuracy at predicting Negative (Healthy) patients is at 77%.
First of all, let’s seperate the target variable and its predictor.
heart_train_x <- heart_train %>% select_if(is.numeric)
heart_test_x <- heart_test %>% select_if(is.numeric)
heart_test_x <- heart_test_x[,-c(9)]
heart_train_y <- heart_train[,13]
heart_test_y <- heart_test[,13]We are also going to scale the predictor using Z-Score Standardization method.
heart_train_xs <- scale(x = heart_train_x)
heart_test_xs <- scale(x = heart_test_x,
center = attr(heart_train_xs, "scaled:center"),
scale = attr(heart_train_xs, "scaled:scale"))Find the Optimum K.
round(sqrt(nrow(heart_train)))## [1] 16
Now, let’s make the KNN Model.
knn_model <- knn(train = heart_train_xs,
test = heart_test_xs,
cl = heart_train_y,
k = 16)
head(knn_model)## [1] Healthy Unhealthy Unhealthy Unhealthy Unhealthy Unhealthy
## Levels: Healthy Unhealthy
Evaluate the model using Confusion Matrix
confusionMatrix(data = knn_model,
reference = heart_test_y,
positive = "Unhealthy") ## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Unhealthy
## Healthy 22 4
## Unhealthy 9 26
##
## Accuracy : 0.7869
## 95% CI : (0.6632, 0.8814)
## No Information Rate : 0.5082
## P-Value [Acc > NIR] : 6.823e-06
##
## Kappa : 0.5748
##
## Mcnemar's Test P-Value : 0.2673
##
## Sensitivity : 0.8667
## Specificity : 0.7097
## Pos Pred Value : 0.7429
## Neg Pred Value : 0.8462
## Prevalence : 0.4918
## Detection Rate : 0.4262
## Detection Prevalence : 0.5738
## Balanced Accuracy : 0.7882
##
## 'Positive' Class : Unhealthy
##
Accuracy2 <- round((26+22)/(22+4+9+26),2)
Recall2 <- round((26)/(26+4),2)
Precision2 <- round((26)/(26+9),2)
Specificity2 <- round((22)/(22+9),2)
knn_perf <- cbind.data.frame(Accuracy2, Recall2, Precision2, Specificity2)
knn_perf## Accuracy2 Recall2 Precision2 Specificity2
## 1 0.79 0.87 0.74 0.71
Based on the result above, we can summarize that our model’s Overall Accuracy is 79%. The Model’s accuracy at predicting the Positive (Unhealthy) patients is 87%. While its accuracy at predicting Positive Class is 74%, and its accuracy at predicting Negative (Healthy) patients is at 71%.
From the models above we can pretty much say that the model acquired from KNN are better. Although it has lower Specificity Accuracy, it has better accuracy in the other area. With that said, the selection of a model in this case should be chosen on the better need of the hospital (or doctors) after weighing the Pros and Cons.
Specificity (Where the KNN Model fall short) is the accuracy of predicting the negative target, in this case the Healthy patients. I believe identifying the Sick or Unhealthy patients are more important in this model, so the model that has better Precision should be chosen. Therefore, I would like to suggest the use of the Model acquired from KNN Model to be chosen over the Generalized Linear Model. Apart from higher Precision Accuracy, it has also better Overall Accuracy and Recall Accuracy.