1. The Data

1.1 Context

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

1.2 Library Preparation

library(dplyr)
library(class)
library(caret)
library(stringr)
library(ggplot2)
library(caret)
library(MASS)

2. Data Preparation

2.1 Read Data

heart <- read.csv("heart.csv")
heart <- heart %>% 
  mutate(age = as.numeric(ï..age),
         cp = as.factor(cp),
         sex = factor(sex, levels = c(0,1), 
                         labels = c("Female", "Male")),
         fbs = as.factor(fbs),
         restecg = as.factor(restecg),
         exang = as.factor(exang),
         target = factor(target, levels = c(0,1), 
                            labels = c("Healthy", "Unhealthy")))
heart <- heart[,-c(1)]
glimpse(heart)
## Rows: 303
## Columns: 14
## $ sex      <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male, M~
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ restecg  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang    <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target   <fct> Unhealthy, Unhealthy, Unhealthy, Unhealthy, Unhealthy, Unheal~
## $ age      <dbl> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~

Here are some details regarding each columns:

age Patient’s Age

sex Patient’s Sex

cp chest pain type (4 values)

trestbpsresting blood pressure

chol cholesterol level in mg/dl

fbs fasting blood sugar > 120 mg/dl

restecg resting electrocardiographic results (values 0,1,2)

thalach maximum heart rate achieved

exang exercise induced angina

oldpeak ST depression induced by exercise relative to rest

slope the slope of the peak exercise ST segment

ca number of major vessels (0-3) colored by flourosopy

thal 3 = normal; 6 = fixed defect; 7 = reversable defect

target 0 = normal, 1 = sick

2.2 Data Pre-processing

Let’s look at the proportion of target

prop.table(table(heart$target))
## 
##   Healthy Unhealthy 
## 0.4554455 0.5445545

It seems that the proportion between both classes are quite even, therefore further processing is not necessary.

2.3 Data Splitting

The next step is to split the data into two data, Train & Test. The aim of data splitting is to use the Train data for modelling purpose, while the test data will be useful to test the model when it’s faced with unseen data. We are going to use 80% of the data to Train, and 20% to Test.

RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(heart), nrow(heart)*0.8)
heart_train <- heart[intrain,]
heart_test <- heart[-intrain,]
heart$target %>% 
  levels()
## [1] "Healthy"   "Unhealthy"

3. Generalized Linear Model (GLM)

3.1 Model Creation

The first model we are going to make is GLM using the code glm().

model1 <- glm(formula = target ~ ., family = "binomial", 
             data = heart_train)
summary(model1)
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7125  -0.3193   0.1277   0.4542   2.4473  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.707908   3.100004   1.841 0.065584 .  
## sexMale     -1.996535   0.575835  -3.467 0.000526 ***
## cp1          1.888420   0.623743   3.028 0.002465 ** 
## cp2          2.314759   0.539733   4.289  1.8e-05 ***
## cp3          3.397750   0.918005   3.701 0.000215 ***
## trestbps    -0.021491   0.012403  -1.733 0.083133 .  
## chol        -0.002380   0.004463  -0.533 0.593851    
## fbs1         0.268893   0.658713   0.408 0.683120    
## restecg1     0.506083   0.433106   1.168 0.242606    
## restecg2    -0.193956   2.500707  -0.078 0.938178    
## thalach      0.008265   0.012098   0.683 0.494525    
## exang1      -1.122501   0.504731  -2.224 0.026151 *  
## oldpeak     -0.803860   0.312516  -2.572 0.010105 *  
## slope        0.489853   0.478451   1.024 0.305916    
## ca          -0.627754   0.227226  -2.763 0.005733 ** 
##  [ reached getOption("max.print") -- omitted 2 rows ]
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 332.24  on 241  degrees of freedom
## Residual deviance: 152.40  on 225  degrees of freedom
## AIC: 186.4
## 
## Number of Fisher Scoring iterations: 6

3.2 Model Fitting

When we look at the first model, there are a lot of insignificant predictors, therefore we are going to make a new model, this time using Stepwise to automatically determine the best predictors to predict TARGET.

model2 <- stepAIC(model1, direction = "backward", trace = 0)

Using backward method in Stepwise, we acquire model 2 as follows

summary(model2)
## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + exang + oldpeak + 
##     ca + thal, family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6763  -0.3321   0.1471   0.4664   2.2073  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.56578    1.79186   3.664 0.000248 ***
## sexMale     -1.63453    0.51534  -3.172 0.001515 ** 
## cp1          1.91884    0.58847   3.261 0.001111 ** 
## cp2          2.49942    0.52805   4.733 2.21e-06 ***
## cp3          3.34330    0.88264   3.788 0.000152 ***
## trestbps    -0.02009    0.01175  -1.709 0.087415 .  
## exang1      -1.31695    0.47640  -2.764 0.005703 ** 
## oldpeak     -1.09549    0.26419  -4.147 3.37e-05 ***
## ca          -0.60635    0.21219  -2.858 0.004268 ** 
## thal        -0.85413    0.33210  -2.572 0.010114 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 332.24  on 241  degrees of freedom
## Residual deviance: 157.09  on 232  degrees of freedom
## AIC: 177.09
## 
## Number of Fisher Scoring iterations: 6

3.3 Prediction

Using model2 from Stepwise, we are going to try to predict using the test data.

heart_test$prob_heart<-predict(model2, type = "response", newdata = heart_test)

Let’s look at the distribution of our prediction

ggplot(heart_test, aes(x=prob_heart)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Prediction on Heart Attack Patient") +
  theme_minimal()

If we look at the plot above, the plot seems quite balanced, though the density on 0 seems higher. This means that more of the patients are predicted to be Healthy.

heart_test$pred_heart <- factor(ifelse(heart_test$prob_heart > 0.5, "Unhealthy","Healthy"))
heart_test[1:10, c("pred_heart", "target")]
##    pred_heart    target
## 1   Unhealthy Unhealthy
## 2     Healthy Unhealthy
## 5   Unhealthy Unhealthy
## 10  Unhealthy Unhealthy
## 11    Healthy Unhealthy
## 12  Unhealthy Unhealthy
## 19    Healthy Unhealthy
## 25  Unhealthy Unhealthy
## 29  Unhealthy Unhealthy
## 30  Unhealthy Unhealthy

Using the Syntax above we classified that those that has a prob_heart of more than 0.5 are classified as Sick or Unhealthy.

3.4 Model Evaluation

We are going to use Confusion Matrix to evaluate our model.

confmatrix <- confusionMatrix(heart_test$pred_heart, heart_test$target, positive = "Unhealthy")
confmatrix
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Healthy Unhealthy
##   Healthy        24         8
##   Unhealthy       7        22
##                                           
##                Accuracy : 0.7541          
##                  95% CI : (0.6271, 0.8554)
##     No Information Rate : 0.5082          
##     P-Value [Acc > NIR] : 7.39e-05        
##                                           
##                   Kappa : 0.5078          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7333          
##             Specificity : 0.7742          
##          Pos Pred Value : 0.7586          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.4918          
##          Detection Rate : 0.3607          
##    Detection Prevalence : 0.4754          
##       Balanced Accuracy : 0.7538          
##                                           
##        'Positive' Class : Unhealthy       
## 

Accuracy is the overall accuracy of the model.

\((TP+TN)/TOTAL\)

Recall is the model accuracy of predicting Positive.

\(TP/(TP+FN)\)

Precision is the model accuracy of predicting Positive class.

\(TP/(TP+FP)\)

Specificity is the model accuracy of predicting Negative.

\(TN/(TN+FP)\)

Accuracy <- round((24+22)/(24+8+7+22),2) 
Recall <- round((22)/(22+8),2) 
Precision <- round((22)/(22+8),2) 
Specificity <- round((24)/(24+7),2) 

performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance
##   Accuracy Recall Precision Specificity
## 1     0.75   0.73      0.73        0.77

Based on the result above, we can summarize that our model’s Overall Accuracy is 75%. The Model’s accuracy at predicting the Positive (Unhealthy) patients is 73%. While its accuracy at predicting Positive Class is 73%, and its accuracy at predicting Negative (Healthy) patients is at 77%.

4. K-Nearest Neighbour (KNN)

4.1 Model Creation

First of all, let’s seperate the target variable and its predictor.

heart_train_x <- heart_train %>% select_if(is.numeric)
heart_test_x <- heart_test %>%  select_if(is.numeric)
heart_test_x <- heart_test_x[,-c(9)]

heart_train_y <- heart_train[,13]
heart_test_y <- heart_test[,13]

We are also going to scale the predictor using Z-Score Standardization method.

heart_train_xs <- scale(x = heart_train_x)
heart_test_xs <- scale(x = heart_test_x,
                     center = attr(heart_train_xs, "scaled:center"),
                     scale = attr(heart_train_xs, "scaled:scale"))

Find the Optimum K.

round(sqrt(nrow(heart_train)))
## [1] 16

Now, let’s make the KNN Model.

knn_model <- knn(train = heart_train_xs,
                 test = heart_test_xs, 
                 cl = heart_train_y, 
                 k = 16)

head(knn_model)
## [1] Healthy   Unhealthy Unhealthy Unhealthy Unhealthy Unhealthy
## Levels: Healthy Unhealthy

4.2 Model Evaluation

Evaluate the model using Confusion Matrix

confusionMatrix(data = knn_model, 
                reference = heart_test_y, 
                positive = "Unhealthy") 
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Healthy Unhealthy
##   Healthy        22         4
##   Unhealthy       9        26
##                                           
##                Accuracy : 0.7869          
##                  95% CI : (0.6632, 0.8814)
##     No Information Rate : 0.5082          
##     P-Value [Acc > NIR] : 6.823e-06       
##                                           
##                   Kappa : 0.5748          
##                                           
##  Mcnemar's Test P-Value : 0.2673          
##                                           
##             Sensitivity : 0.8667          
##             Specificity : 0.7097          
##          Pos Pred Value : 0.7429          
##          Neg Pred Value : 0.8462          
##              Prevalence : 0.4918          
##          Detection Rate : 0.4262          
##    Detection Prevalence : 0.5738          
##       Balanced Accuracy : 0.7882          
##                                           
##        'Positive' Class : Unhealthy       
## 
Accuracy2 <- round((26+22)/(22+4+9+26),2) 
Recall2 <- round((26)/(26+4),2) 
Precision2 <- round((26)/(26+9),2) 
Specificity2 <- round((22)/(22+9),2) 


knn_perf <- cbind.data.frame(Accuracy2, Recall2, Precision2, Specificity2)
knn_perf
##   Accuracy2 Recall2 Precision2 Specificity2
## 1      0.79    0.87       0.74         0.71

Based on the result above, we can summarize that our model’s Overall Accuracy is 79%. The Model’s accuracy at predicting the Positive (Unhealthy) patients is 87%. While its accuracy at predicting Positive Class is 74%, and its accuracy at predicting Negative (Healthy) patients is at 71%.

5. Conclusion

From the models above we can pretty much say that the model acquired from KNN are better. Although it has lower Specificity Accuracy, it has better accuracy in the other area. With that said, the selection of a model in this case should be chosen on the better need of the hospital (or doctors) after weighing the Pros and Cons.

Specificity (Where the KNN Model fall short) is the accuracy of predicting the negative target, in this case the Healthy patients. I believe identifying the Sick or Unhealthy patients are more important in this model, so the model that has better Precision should be chosen. Therefore, I would like to suggest the use of the Model acquired from KNN Model to be chosen over the Generalized Linear Model. Apart from higher Precision Accuracy, it has also better Overall Accuracy and Recall Accuracy.