Introduction

WHO stated that the characteristic of terminal illness is the type of illness that have a slow development toward the host. Currently, heart disease is the most deadliest disease in the world right now.

As per 2015, from 56,5 M death case reported, 68% caused from chronic long term disease, and heart disease are the top cause of it.

As a data scientist, I will try to make a prediction regarding the heart disease patients With several predictor variables.

For the analysis, i will use Logistic regression and K-Nearest Neighbor to create a model to predict whether the patients have a heart disease or not. Both of these model are included in Supervised Learning Method.

Attaching Necessary Library

We will try to attach several libraries such as dplyr, MASS, gtools, gmodels, class, ggplot, and caret in R studio.

library(dplyr)
library(gtools)
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.0.3
library(ggplot2)
library(class)
library(tidyr)
library(lattice)
library(caret)
library(rmdformats)
## Warning: package 'rmdformats' was built under R version 4.0.3

Logistic regression

Data Import

We will use dataset that record an information about patients with heart disease. The dataset can be downloaded from Kaggle.

heart <- read.csv("heart.csv")
# Overview of Data

glimpse(heart)
## Rows: 303
## Columns: 14
## $ ï..age   <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Based on the data above there are several information that maybe useful for our analysis:

ï..age : age of respondent sex : 1 = male; 0 = female cp : the severe type of pains trestbps : blood pressure (mm Hg) chol : cholesterol (mg / dl) fbs : blood sugar, 1 = above 120 mg / dl; 0 = below 120 mg / dl restecg : electrocardiograph result thalach : maximum hertbeat recorded exang : exercised include angina, 1 = yes; 0 = no oldspeak : ST Depression, based on exercise to rest slope : Slop of ST Segment ca : Number of the major blood vessels thal : 3 = Normal, 6 = permanently disable, 7 = reversable disable target : 1 = sick; 0 = not sick

Here are the overview of data that we will use further:

head(heart, 3)
##   ï..age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1     63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2     37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3     41   0  1      130  204   0       0     172     0     1.4     2  0    2
##   target
## 1      1
## 2      1
## 3      1

Data Wrangling

From several predictor variable that we will use, some of it does not have a correct type of data. Therefore, we will adjust our variable type of data.

heart <- heart %>%
  #mutate_if(is.integer, as.factor) %>%
  mutate(cp = as.factor(cp),
         restecg = as.factor(restecg),
         slope = as.factor(slope),
         ca = as.factor(ca),
         thal = as.factor(thal),
         sex = factor(sex, levels = c(0,1), labels = c("female", "male")),
         fbs = factor(fbs, levels = c(0,1), labels = c("False", "True")),
         exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
         target = factor(target, levels = c(0,1), labels = c("Health", "Not Health")))

glimpse(heart)
## Rows: 303
## Columns: 14
## $ ï..age   <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <fct> male, male, female, male, female, male, female, male, male...
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <fct> True, False, False, False, False, False, False, False, Tru...
## $ restecg  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, ...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <fct> Not Health, Not Health, Not Health, Not Health, Not Health...

Next, we will check the availability of any missing value in each variables.

colSums(is.na(heart))
##   ï..age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

We can see that there are no missing values in each of our variables, so we do not need to drop those variable from our analysis.

Data Pre-Processing

Before we create our model, we check our target variable proportions using prop.table()

While these are the proportions of our variable target class.

prop.table(table(heart$target))
## 
##     Health Not Health 
##  0.4554455  0.5445545

These are the actual number of our variable target class.

table(heart$target)
## 
##     Health Not Health 
##        138        165

We can see that our proportion is balance enough, and we do not need to balancing the class by using down sample or up sample method.

Cross Validation

Next step of our analysis will be splitting our data into train and test data. We will use our train data to train our model and use our test data to validate our model when overcoming the unseen data.

set.seed(100)
index <- sample(nrow(heart), nrow(heart)*0.7)

# Data train
train_heart <- heart[index,]

# Data test
test_heart <- heart[-index,]

Next, we will check our train data proportion whether the proportion is balance enough to train our model, this need to be done so we can minimize the risk that our models are overfit.

prop.table(table(train_heart$target))
## 
##     Health Not Health 
##  0.4575472  0.5424528

Modelling

We will create a model based on our train data. we will use several variables that may have a significant effect toward our target variable like sex, cp, fbs, and thal.

Then we will also use the stepwise method to see if we can get a better model than our previous one.

# Create a model
model1 <- glm(formula = target ~ sex + cp +  fbs + thal, family = "binomial", data = train_heart)

# Model summary
summary(model1)
## 
## Call:
## glm(formula = target ~ sex + cp + fbs + thal, family = "binomial", 
##     data = train_heart)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5494  -0.6195   0.2813   0.6259   2.1699  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.42390    1.89429  -0.224 0.822931    
## sexmale     -1.00808    0.46155  -2.184 0.028955 *  
## cp1          2.63834    0.58159   4.536 5.72e-06 ***
## cp2          1.96708    0.43148   4.559 5.14e-06 ***
## cp3          2.65973    0.75439   3.526 0.000422 ***
## fbsTrue     -0.11120    0.55873  -0.199 0.842244    
## thal1       -0.01373    1.99362  -0.007 0.994505    
## thal2        0.99582    1.89198   0.526 0.598653    
## thal3       -0.71130    1.89910  -0.375 0.708000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.36  on 211  degrees of freedom
## Residual deviance: 188.77  on 203  degrees of freedom
## AIC: 206.77
## 
## Number of Fisher Scoring iterations: 5

as we can see, there are only several variables that are significant toward our model. As we have not try other variable that may have been affected our target variable, we will try to use stepwise method to create a better model than our previous one.

# Create a model without predictor
model_none <- glm(target ~ 1, family = "binomial", data = train_heart)

# Create a model with all predictor
model_all <- glm(target ~ ., family = "binomial", data = train_heart)
# Stepwise regression backward
model_back <- step(object = model_all, direction = "backward", trace = F)

# Stepwise regression forward
model_forw <- step(object = model_all, scope = list(lower = model_none, upper = model_all), direction = "forward", trace = F)

# Stepwise regression both
model_both <- step(object = model_all, scope = list(lower = model_none, upper = model_all), direction = "both", trace = F)

Next we will see our model summary in each of the model that we have been created before.

We will see it based on:

  1. AIC, amount of information lost in the model, lower AIC indicate a good quality of a model.
  2. Residual Deviance, Error of the model when the model have a predictor, lower Residual Deviance means we have a better model
# Model Summary

# Backward

summary(model_back)
## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + restecg + slope + 
##     ca + thal, family = "binomial", data = train_heart)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.00360  -0.31563   0.08544   0.38932   2.80515  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.14771    3.54773   0.324 0.746311    
## sexmale     -1.94702    0.66841  -2.913 0.003581 ** 
## cp1          1.82245    0.68720   2.652 0.008002 ** 
## cp2          2.16193    0.57076   3.788 0.000152 ***
## cp3          3.67871    0.96664   3.806 0.000141 ***
## trestbps    -0.02357    0.01392  -1.693 0.090392 .  
## restecg1     0.89406    0.48413   1.847 0.064786 .  
## restecg2    -1.34017    1.93060  -0.694 0.487574    
## slope1       0.33531    1.02241   0.328 0.742943    
## slope2       2.29316    1.05204   2.180 0.029278 *  
## ca1         -2.24324    0.58867  -3.811 0.000139 ***
## ca2         -3.54535    0.95429  -3.715 0.000203 ***
## ca3         -1.84653    0.91095  -2.027 0.042660 *  
## ca4          1.23995    1.87483   0.661 0.508379    
## thal1        2.48179    2.93285   0.846 0.397440    
## thal2        2.49503    2.81405   0.887 0.375277    
## thal3        0.91494    2.81824   0.325 0.745447    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.36  on 211  degrees of freedom
## Residual deviance: 127.71  on 195  degrees of freedom
## AIC: 161.71
## 
## Number of Fisher Scoring iterations: 6
# Forward

summary(model_forw)
## 
## Call:
## glm(formula = target ~ ï..age + sex + cp + trestbps + chol + 
##     fbs + restecg + thalach + exang + oldpeak + slope + ca + 
##     thal, family = "binomial", data = train_heart)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.97383  -0.26792   0.09287   0.42050   2.90599  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.698162   4.308864   0.162 0.871283    
## ï..age       0.037131   0.031962   1.162 0.245360    
## sexmale     -2.024142   0.722806  -2.800 0.005104 ** 
## cp1          1.269686   0.744520   1.705 0.088125 .  
## cp2          1.793341   0.615266   2.915 0.003560 ** 
## cp3          3.422971   1.031057   3.320 0.000901 ***
## trestbps    -0.027117   0.015591  -1.739 0.081984 .  
## chol        -0.005909   0.006067  -0.974 0.330050    
## fbsTrue      0.255722   0.763957   0.335 0.737826    
## restecg1     0.945576   0.502120   1.883 0.059678 .  
## restecg2    -0.960831   3.051418  -0.315 0.752853    
## thalach      0.009953   0.014681   0.678 0.497774    
## exangYes    -0.646220   0.557791  -1.159 0.246646    
## oldpeak     -0.303202   0.322751  -0.939 0.347511    
## slope1      -0.083720   1.180784  -0.071 0.943476    
## slope2       1.494174   1.286747   1.161 0.245559    
## ca1         -2.541702   0.651145  -3.903 9.48e-05 ***
## ca2         -3.686123   1.069670  -3.446 0.000569 ***
## ca3         -2.076167   1.004198  -2.067 0.038688 *  
## ca4          1.048384   2.079025   0.504 0.614074    
## thal1        2.883801   2.739549   1.053 0.292499    
## thal2        2.609207   2.608416   1.000 0.317164    
## thal3        1.108764   2.609270   0.425 0.670886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.36  on 211  degrees of freedom
## Residual deviance: 122.15  on 189  degrees of freedom
## AIC: 168.15
## 
## Number of Fisher Scoring iterations: 6
# Both

summary(model_both)
## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + restecg + slope + 
##     ca + thal, family = "binomial", data = train_heart)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.00360  -0.31563   0.08544   0.38932   2.80515  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.14771    3.54773   0.324 0.746311    
## sexmale     -1.94702    0.66841  -2.913 0.003581 ** 
## cp1          1.82245    0.68720   2.652 0.008002 ** 
## cp2          2.16193    0.57076   3.788 0.000152 ***
## cp3          3.67871    0.96664   3.806 0.000141 ***
## trestbps    -0.02357    0.01392  -1.693 0.090392 .  
## restecg1     0.89406    0.48413   1.847 0.064786 .  
## restecg2    -1.34017    1.93060  -0.694 0.487574    
## slope1       0.33531    1.02241   0.328 0.742943    
## slope2       2.29316    1.05204   2.180 0.029278 *  
## ca1         -2.24324    0.58867  -3.811 0.000139 ***
## ca2         -3.54535    0.95429  -3.715 0.000203 ***
## ca3         -1.84653    0.91095  -2.027 0.042660 *  
## ca4          1.23995    1.87483   0.661 0.508379    
## thal1        2.48179    2.93285   0.846 0.397440    
## thal2        2.49503    2.81405   0.887 0.375277    
## thal3        0.91494    2.81824   0.325 0.745447    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.36  on 211  degrees of freedom
## Residual deviance: 127.71  on 195  degrees of freedom
## AIC: 161.71
## 
## Number of Fisher Scoring iterations: 6

Based on the result, we can see that our model_back and model_both have a similar value of AIC & Residual Deviance. Both of those model have the AIC value 161.71 and the Residual deviance value 127.71. while our model_forw have the AIC value 168.15 and the Residual deviance value 122.15.

Although our model_forw have a better Residual Deviance than other model, the number of variables that significant towards the our target variables are not as much as the other two models. Therefore, we will not choose model_forw and proceed to choose either the model_back or model_both.

In this case we will choose model_both for our further analysis.

Prediction

By using model_both from stepwise method, we will try to make a prediction with our test data. by using parameter type = "response" we will return the probability value of our prediction.

with ggplot() we will see the probabilities distribution of our prediction data

test_heart$prediction <-  predict(model_both, type = "response", newdata = test_heart)

# Create Plot

test_heart %>%
  ggplot(aes(x=prediction)) +
  geom_density() +
  labs(title = "Probabilities Distribution of Prediction Data") +
  theme_minimal()

As we can see from the plot, the result of our prediction are more inclined to the value of 1 (Not Health).

pred <- predict(model_both, type = "response", newdata = test_heart)
result_pred <- ifelse(pred >= 0.5, "Not Health", "Health")

# Put our result prediction into our test data

test_heart$prediction <- result_pred

Here are the overview comparison between our prediction data and the target variable of our test data

test_heart %>%
  select(target, prediction) %>%
  head(5)
##        target prediction
## 6  Not Health     Health
## 10 Not Health Not Health
## 17 Not Health Not Health
## 18 Not Health Not Health
## 21 Not Health     Health

Model Evaluation

In model evaluation, we will see how good our model based on several different matrix.

Accuracy : How much our prediction able to predict our target variable Recall / Sensitivity : How much our prediction able to predict correctly the positive class
Specificity : How much our prediction able to predict correctly the negative class Precision : How much our prediction of positive class correctly predicted

conf_mat <- confusionMatrix(as.factor(result_pred), reference = test_heart$target, positive = "Not Health")

conf_mat
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Health Not Health
##   Health         30          4
##   Not Health     11         46
##                                           
##                Accuracy : 0.8352          
##                  95% CI : (0.7427, 0.9047)
##     No Information Rate : 0.5495          
##     P-Value [Acc > NIR] : 7.843e-09       
##                                           
##                   Kappa : 0.6619          
##                                           
##  Mcnemar's Test P-Value : 0.1213          
##                                           
##             Sensitivity : 0.9200          
##             Specificity : 0.7317          
##          Pos Pred Value : 0.8070          
##          Neg Pred Value : 0.8824          
##              Prevalence : 0.5495          
##          Detection Rate : 0.5055          
##    Detection Prevalence : 0.6264          
##       Balanced Accuracy : 0.8259          
##                                           
##        'Positive' Class : Not Health      
## 
recall <- round(46/(46+4),3)
specificity <- round(30/(30+11),3)
precision <- round(46/(46+11),3)
accuracy <- round((46+30)/(46+30+11+4),3)

matrix <- cbind.data.frame(accuracy, recall, specificity, precision)

matrix
##   accuracy recall specificity precision
## 1    0.835   0.92       0.732     0.807

Based on the matrix summary, we can see that the ability of our model to predict our target variable are 83.5%. from entire of our prediction data. Our model ability to correctly predicted the Not Health person is 92%, while it ability to correctly predicted the Health person is 73.2%. Furthermore, from all of Not Health prediction our model able to predict it correctly 80.7%.

Model Interpretation

By analyze the coefficient of our model, we can see the probability of our predictor variable able to predict our positive class. We will create a data frame consist of our model coefficients and transform the value with inv.logit(). These will transform the log of odds value into probability value.

# Return the probability value

model_both$coefficients %>%
  inv.logit() %>%
  data.frame()
##                      .
## (Intercept) 0.75909323
## sexmale     0.12487893
## cp1         0.86085997
## cp2         0.89677858
## cp3         0.97536664
## trestbps    0.49410880
## restecg1    0.70972632
## restecg2    0.20748268
## slope1      0.58305041
## slope2      0.90830878
## ca1         0.09593436
## ca2         0.02804914
## ca3         0.13628139
## ca4         0.77555478
## thal1       0.92285510
## thal2       0.92379261
## thal3       0.71400963

Some conclusions that we can interprete from these dataframe are:

  1. The probability of male to be diagnosed with heart disease is 12.4%.

  2. People with high level severe type of pain (cp = 3) have a 97% probability to be diagnosed with heart disease.

K - Nearest Neighbor

Data Wrangling

as we use difference approach for K - Nearest Neighbor method, we need to create a new data frame that consist of dummy variable that we are going to use to predict our target variable.

# Create dummy variable

dummy <- dummyVars("~target + sex +cp + trestbps + chol + fbs + restecg + thalach + exang + oldpeak + slope + ca + thal", data = heart)

# Create new data frame

dummy <- data.frame(predict(dummy, newdata = heart))

# Check our data frame structure

str(dummy)
## 'data.frame':    303 obs. of  31 variables:
##  $ target.Health    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ target.Not.Health: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ sex.female       : num  0 0 1 0 1 0 1 0 0 0 ...
##  $ sex.male         : num  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp.0             : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ cp.1             : num  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp.2             : num  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp.3             : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ trestbps         : num  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol             : num  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs.False        : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ fbs.True         : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg.0        : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ restecg.1        : num  0 1 0 1 1 1 0 1 1 1 ...
##  $ restecg.2        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thalach          : num  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang.No         : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ exang.Yes        : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak          : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope.0          : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ slope.1          : num  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope.2          : num  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca.0             : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ ca.1             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca.2             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca.3             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca.4             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal.0           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal.1           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal.2           : num  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal.3           : num  0 0 0 0 0 0 0 1 1 0 ...

After we create a new data frame that called dummy, we will remove variables that originally consist of two categories.

Those variables are:

  • target
  • sex
  • fbs
  • exang
dummy$target.Health <- NULL
dummy$sex.female <- NULL
dummy$fbs.False <- NULL
dummy$exang.No <- NULL

Below are the overview of our new data frame that we are going to use for our analysis.

head(dummy, 3)
##   target.Not.Health sex.male cp.0 cp.1 cp.2 cp.3 trestbps chol fbs.True
## 1                 1        1    0    0    0    1      145  233        1
## 2                 1        1    0    0    1    0      130  250        0
## 3                 1        0    0    1    0    0      130  204        0
##   restecg.0 restecg.1 restecg.2 thalach exang.Yes oldpeak slope.0 slope.1
## 1         1         0         0     150         0     2.3       1       0
## 2         0         1         0     187         0     3.5       1       0
## 3         1         0         0     172         0     1.4       0       0
##   slope.2 ca.0 ca.1 ca.2 ca.3 ca.4 thal.0 thal.1 thal.2 thal.3
## 1       0    1    0    0    0    0      0      1      0      0
## 2       0    1    0    0    0    0      0      0      1      0
## 3       1    1    0    0    0    0      0      0      1      0

Cross Validation: K - Nearest Method

K - Nearest Method have a different approach for cross validation. We are going to split both of our Predictor and Target variables into train and test data.

The proportion will be 70% for our train data and 30% for our test data.

set.seed(100)

# Predictor

train_x <- dummy[index, -1]
test_x <- dummy[-index, -1]

# Target

train_y <- dummy[index, 1]
test_y <- dummy[-index, 1]

Choose k

we will choose k based on the common method which is the square-root of the data count.

sqrt(nrow(train_x))
## [1] 14.56022

Based on the calculation, we will choose 14 as our k

scaling

as the overview of our new data frame before, we notice that some variable have a different range with each other. Before we make our K - Nearest neighbor prediction, we need to scale our data to overcome these problems.

We will use scale() function to scale both of our predictor train and test data.

train_x <- scale(x = train_x)
test_x <- scale(x = test_x, center = attr(train_x, "scaled:center"), scale = attr(train_x, "scaled:scale"))

Create K - Nearest Neighbor Prediction

Next, we will create a prediction using knn() from library class

pred_knn <- knn(train = train_x, test = test_x, cl = train_y, k = 14)

To ease our model summary readibility, we will transform our K - Nearest Neighbor Prediction object into data frame, and rename the levels into their original labels.

  • 0 : Health
  • 1 : Not Health
pred_knn <- pred_knn %>%
  as.data.frame() %>%
  mutate(pred_knn = factor(pred_knn, levels = c(0,1), labels = c("Health", "Not Health"))) %>%
  select(pred_knn)

We will do the same way with our target test data which will be use for reference in our confusion matrix

  • 0 : Health
  • 1 : Not Health
test_y <- test_y %>%
  as.data.frame() %>%
  mutate(target = factor(test_y, levels = c(0,1), labels = c("Health", "Not Health"))) %>%
  select(target)

Create Confusion Matrix

Finally we will create confusion matrix from our K - Nearest Neighbor Prediction.

conf_mat_knn <- confusionMatrix(pred_knn$pred_knn, reference = test_y$target, positive = "Not Health")

conf_mat_knn
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Health Not Health
##   Health         29          2
##   Not Health     12         48
##                                           
##                Accuracy : 0.8462          
##                  95% CI : (0.7554, 0.9133)
##     No Information Rate : 0.5495          
##     P-Value [Acc > NIR] : 1.82e-09        
##                                           
##                   Kappa : 0.6823          
##                                           
##  Mcnemar's Test P-Value : 0.01616         
##                                           
##             Sensitivity : 0.9600          
##             Specificity : 0.7073          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.9355          
##              Prevalence : 0.5495          
##          Detection Rate : 0.5275          
##    Detection Prevalence : 0.6593          
##       Balanced Accuracy : 0.8337          
##                                           
##        'Positive' Class : Not Health      
## 
recall_knn <- round(48/(48+2),3)
specificity_knn <- round(29/(29+12),3)
precision_knn <- round(48/(48+12),3)
accuracy_knn <- round((48+29)/(48+29+12+2),3)

matrix_knn <- cbind.data.frame(accuracy_knn, recall_knn, specificity_knn, precision_knn)

matrix_knn
##   accuracy_knn recall_knn specificity_knn precision_knn
## 1        0.846       0.96           0.707           0.8

Based on the matrix summary, we can see that the ability of our K - Nearest Neighbor Prediction predict our target variable are 84.6%. from entire of our prediction data. Our K - Nearest Neighbor Prediction ability to correctly predicted the Not Health person is 96%, while it ability to correctly predicted the Health person is 70.7%. Furthermore, from all of Not Health prediction our K - Nearest Neighbor Prediction able to predict it correctly 80%.

Model Comparison: Logistic Regression vs K - Nearest Neighbor

We have two method for our analysis, now we need to choose which of these method output that we will use to predict whether a person is diagnosed with heart disease or vice versa.

# Matrix summary from Logistic Regression

matrix
##   accuracy recall specificity precision
## 1    0.835   0.92       0.732     0.807
# Matrix summary from K - Nearest Neighbor

matrix_knn
##   accuracy_knn recall_knn specificity_knn precision_knn
## 1        0.846       0.96           0.707           0.8

We can see from the table above that we have almost similar value for each matrix with K - Nearest Neighbor have a slightly better value in accuracy and recall, while Logistic Regression model have a better value in specificity and precision.

Conclusion

To conclude our analysis, we need to reflect back with situation and the question that we want to answer:

If a doctor wanted to give a different treatment for people diagnosed with heart disease and thus who do not, i will use a method with a better precision value. If our model fail to predict which people that diagnosed with heart disease and thus who do not. There are probability the doctor will give incorrect treatment for some people as they were wrongly classified as people with heart disease which can harmed them. On the other hand, if a doctor only wanted to diagnosed as many as it is a person that correctly predicted with heart disease and ignoring the wrong classification, then choose a method with a better recall.