Background

Introduction

The objective of this document is to compare the effectiveness of both logistic regression and knn on this dataset where the goal is to detect the presence of heart disease on a patient based on various predictor variables.

With the dataset predictor variables vary between numerical and categorical, I think it’s interesting to pit together logistic regression model with knn and see which one performs better.

Library

library(tidyverse)
library(stringr)
library(GGally)
library(lmtest)
library(car)
library(MLmetrics)
library(effects)
library(magrittr)
library(caret)
library(rsample)

Data Import

heart <- read.csv("data_input/heart.csv")
head(heart)

No NA found on the dataset.

anyNA(heart)

## [1] FALSE

Data Explanation

age : age of patient
sex : sex of patient
cp : chest pain type (4 values)
trestbps : resting blood pressure
chol : serum cholestoral in mg/dl
fbs : fasting blood sugar > 120 mg/dl
restecg : resting electrocardiographic results (values 0,1,2)
thalach : maximum heart rate achieved
exang : exercise induced angina
oldpeak : ST depression induced by exercise relative to rest
slope : the slope of the peak exercise ST segment
ca : number of major vessels (0-3) colored by flourosopy
thal : 3 = normal; 6 = fixed defect; 7 = reversable defect
target : whether one has a heart disease or not

Data Wrangling

Change into factor : sex, cp, fbs, restecg, exang, slope, ca, thal.

heart_wr <- heart %>% 
  mutate(sex = as.factor(sex),
         cp = as.factor(cp), 
         fbs = as.factor(fbs),
         restecg = as.factor(restecg),
         exang = as.factor(exang),
         slope = as.factor(slope),
         ca = as.factor(ca),
         thal = as.factor(thal),
         target = as.factor(target))

Cross Validation

set.seed(345)
train_split <- initial_split (heart_wr, prop = .8)

heart_wr_train <- training(train_split)
heart_wr_test <- testing(train_split)

heart_wr_test_knn <- testing(train_split)

Target Variable Proportion

Checking the proportion of our target variable.

prop.table(table(heart$target))

## 
##         0         1 
## 0.4554455 0.5445545

I’d say that our target variable is nicely balanced with 46-54 ratio.

Modelling - Logistic Regression

We’re using the model version 2 for our prediction.

Version 1 - All Var

model_log_heart <- glm(formula = target~., data = heart_wr_train, family = "binomial")
summary(model_log_heart)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_wr_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5945  -0.3482   0.1299   0.4521   3.0798  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.216e+00  3.423e+00  -0.647 0.517426    
## age          1.741e-02  2.867e-02   0.607 0.543739    
## sex1        -1.933e+00  6.185e-01  -3.126 0.001774 ** 
## cp1          1.256e+00  6.614e-01   1.898 0.057637 .  
## cp2          1.590e+00  5.421e-01   2.932 0.003365 ** 
## cp3          2.511e+00  8.134e-01   3.087 0.002022 ** 
## trestbps    -1.663e-02  1.328e-02  -1.251 0.210765    
## chol        -4.870e-03  4.558e-03  -1.068 0.285384    
## fbs1         8.340e-01  6.472e-01   1.289 0.197545    
## restecg1     4.543e-01  4.394e-01   1.034 0.301161    
## restecg2    -1.406e+01  1.263e+03  -0.011 0.991116    
## thalach      2.514e-02  1.305e-02   1.927 0.053994 .  
## exang1      -3.694e-01  4.912e-01  -0.752 0.452060    
## oldpeak     -2.348e-01  2.662e-01  -0.882 0.377794    
## slope1       2.628e-01  1.020e+00   0.258 0.796634    
## slope2       1.488e+00  1.129e+00   1.318 0.187544    
## ca1         -2.117e+00  5.850e-01  -3.619 0.000296 ***
## ca2         -3.369e+00  9.695e-01  -3.475 0.000510 ***
## ca3         -1.970e+00  9.270e-01  -2.125 0.033588 *  
## ca4          7.893e-01  1.640e+00   0.481 0.630354    
## thal1        2.286e+00  2.058e+00   1.111 0.266658    
## thal2        2.287e+00  1.920e+00   1.191 0.233625    
## thal3        8.857e-01  1.930e+00   0.459 0.646313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 150.93  on 220  degrees of freedom
## AIC: 196.93
## 
## Number of Fisher Scoring iterations: 15

Version 2 - Stepwise

Our stepwise

step(model_log_heart, direction = "backward", trace = 0)

## 
## Call:  glm(formula = target ~ sex + cp + trestbps + thalach + slope + 
##     ca + thal, family = "binomial", data = heart_wr_train)
## 
## Coefficients:
## (Intercept)         sex1          cp1          cp2          cp3     trestbps  
##    -2.03553     -1.74001      1.59716      1.84120      2.76912     -0.01895  
##     thalach       slope1       slope2          ca1          ca2          ca3  
##     0.02514      0.30537      1.70208     -2.02765     -3.06457     -1.72804  
##         ca4        thal1        thal2        thal3  
##     1.29932      1.76422      1.69175      0.22606  
## 
## Degrees of Freedom: 242 Total (i.e. Null);  227 Residual
## Null Deviance:       335.1 
## Residual Deviance: 158.4     AIC: 190.4

We’ve tried the full stepwise variable recommendation and see that the variable thal is not significant enough, so we’ve decided to take it out of the model.

model_log_heart_v2 <- glm(formula = target ~ sex + cp + trestbps + thalach + slope + 
    ca, family = "binomial", data = heart_wr_train)

summary(model_log_heart_v2)

## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + slope + 
##     ca, family = "binomial", data = heart_wr_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4021  -0.4573   0.1312   0.4835   2.9851  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.35131    2.35016  -0.149 0.881171    
## sex1        -2.10903    0.49646  -4.248 2.16e-05 ***
## cp1          1.89121    0.60176   3.143 0.001674 ** 
## cp2          1.96363    0.47660   4.120 3.79e-05 ***
## cp3          2.88931    0.77280   3.739 0.000185 ***
## trestbps    -0.02348    0.01099  -2.137 0.032621 *  
## thalach      0.02559    0.01106   2.314 0.020690 *  
## slope1       0.30869    0.83322   0.370 0.711030    
## slope2       1.90695    0.84690   2.252 0.024342 *  
## ca1         -1.97334    0.52718  -3.743 0.000182 ***
## ca2         -3.00723    0.76657  -3.923 8.75e-05 ***
## ca3         -1.78755    0.78563  -2.275 0.022887 *  
## ca4          0.64862    1.59084   0.408 0.683478    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 170.72  on 230  degrees of freedom
## AIC: 196.72
## 
## Number of Fisher Scoring iterations: 6

Prediction using Logistic Regression

heart_wr_test$predict_risk <- predict(model_log_heart_v2, heart_wr_test, type = "response")

heart_wr_test$pred_label <- as.factor(ifelse(heart_wr_test$predict_risk>.4, "1", "0"))

Confusion Matrix using Logistic Regression

logreg_conf <- confusionMatrix(data = heart_wr_test$pred_label, reference = heart_wr_test$target, positive = "1")
logreg_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 21  2
##          1  6 31
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.7541, 0.9406)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 1.653e-07       
##                                           
##                   Kappa : 0.727           
##                                           
##  Mcnemar's Test P-Value : 0.2888          
##                                           
##             Sensitivity : 0.9394          
##             Specificity : 0.7778          
##          Pos Pred Value : 0.8378          
##          Neg Pred Value : 0.9130          
##              Prevalence : 0.5500          
##          Detection Rate : 0.5167          
##    Detection Prevalence : 0.6167          
##       Balanced Accuracy : 0.8586          
##                                           
##        'Positive' Class : 1               
##

Using kNN

kNN method only accepts numerical predictor. Therefore, we will select all the numerical predictor and scale it, so it can be processed with kNN.

Scaling Numerical Data

numericIndex <- sapply(X = heart_wr_train, FUN = is.numeric)

heart_wr_train_x <- scale(heart_wr_train[,numericIndex])
heart_wr_train_y <- select(heart_wr_train, target)

heart_wr_test_knn_x <- scale(heart_wr_test_knn[,numericIndex], 
                             center = attr(heart_wr_train_x, "scaled:center"),
                             scale = attr(heart_wr_train_x, "scaled:scale"))
heart_wr_test_knn_y <- select(heart_wr_test_knn, target)

Determine the best k number

round(sqrt(nrow(heart_wr_train)),0)

## [1] 16

Since we have 2 target class, we’re rounding it up to 17.

Prediction using kNN

library(class)

heart_wr_knn_pred <- knn(train = as.data.frame(heart_wr_train_x),
                         test = as.data.frame(heart_wr_test_knn_x),
                         cl = heart_wr_train_y$target,
                         k = 17)

Confusion Matrix using kNN

knn_conf <- confusionMatrix(heart_wr_knn_pred, reference = heart_wr_test_knn_y$target, positive = "1")
knn_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  5
##          1 11 28
##                                           
##                Accuracy : 0.7333          
##                  95% CI : (0.6034, 0.8393)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.002709        
##                                           
##                   Kappa : 0.4502          
##                                           
##  Mcnemar's Test P-Value : 0.211300        
##                                           
##             Sensitivity : 0.8485          
##             Specificity : 0.5926          
##          Pos Pred Value : 0.7179          
##          Neg Pred Value : 0.7619          
##              Prevalence : 0.5500          
##          Detection Rate : 0.4667          
##    Detection Prevalence : 0.6500          
##       Balanced Accuracy : 0.7205          
##                                           
##        'Positive' Class : 1               
##

Conclusion

eval_log <- data.frame(Accuracy = logreg_conf$overall[1],
                       Recall = logreg_conf$byClass[1],
                       Specificity = logreg_conf$byClass[2],
                       Precision = logreg_conf$byClass[3])


eval_knn <- data.frame(Accuracy = knn_conf$overall[1],
                       Recall = knn_conf$byClass[1],
                       Specificity = knn_conf$byClass[2],
                       Precision = knn_conf$byClass[3])

eval_total <- rbind(eval_log, eval_knn)
rownames(eval_total) <- c("Logistic Regression", "kNN")

eval_total

The metric that we’re looking for in this exercise is the Sensitivity or Recall number where the closer it is to 1, the better it is. In medical case, it’s often better to have more “false alarm” or False Positive rather than the other way around where you have the disease, yet diagnosed otherwise (in this case is called False Negative).

Sensitivity or Recall is also known as true positive rate. A good way to understand it is, let’s say that our model classified 100 people as sick. A recall value of .9 means that there is a possibility that 90 people are correctly classified as sick and 10 people are not actually sick.

Our result with logistic regression seemed to perform better than our kNN model with Sensitivity/Recall value of 0.94, compared to 0.84 on our kNN model. We believe that this happened because the categorical variable that we omit out of our kNN model is highly affecting our model quality.

While kNN usually boasts good results out of categorical predictor variable, it seemed that in this case where there are categorical and numerical variable, logistic regression takes the cake.

The variable that we used for our logistic regression models are : sex, chest pain type, resting blood pressure, maximum heart rate, slope of peak exercise ST segment, and number of major vessels colored by flourosopy.

Heart Disease Classification

Deo Ivan Mareza

2/27/2020