Background

Introduction

The objective of this document is to compare the effectiveness of both logistic regression and knn on this dataset where the goal is to detect the presence of heart disease on a patient based on various predictor variables.

With the dataset predictor variables vary between numerical and categorical, I think it’s interesting to pit together logistic regression model with knn and see which one performs better.

Data Import

No NA found on the dataset.

## [1] FALSE


Data Explanation

age : age of patient
sex : sex of patient
cp : chest pain type (4 values)
trestbps : resting blood pressure
chol : serum cholestoral in mg/dl
fbs : fasting blood sugar > 120 mg/dl
restecg : resting electrocardiographic results (values 0,1,2)
thalach : maximum heart rate achieved
exang : exercise induced angina
oldpeak : ST depression induced by exercise relative to rest
slope : the slope of the peak exercise ST segment
ca : number of major vessels (0-3) colored by flourosopy
thal : 3 = normal; 6 = fixed defect; 7 = reversable defect
target : whether one has a heart disease or not


Target Variable Proportion

Checking the proportion of our target variable.

## 
##         0         1 
## 0.4554455 0.5445545

I’d say that our target variable is nicely balanced with 46-54 ratio.


Modelling - Logistic Regression

We’re using the model version 2 for our prediction.


Version 1 - All Var

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_wr_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5945  -0.3482   0.1299   0.4521   3.0798  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.216e+00  3.423e+00  -0.647 0.517426    
## age          1.741e-02  2.867e-02   0.607 0.543739    
## sex1        -1.933e+00  6.185e-01  -3.126 0.001774 ** 
## cp1          1.256e+00  6.614e-01   1.898 0.057637 .  
## cp2          1.590e+00  5.421e-01   2.932 0.003365 ** 
## cp3          2.511e+00  8.134e-01   3.087 0.002022 ** 
## trestbps    -1.663e-02  1.328e-02  -1.251 0.210765    
## chol        -4.870e-03  4.558e-03  -1.068 0.285384    
## fbs1         8.340e-01  6.472e-01   1.289 0.197545    
## restecg1     4.543e-01  4.394e-01   1.034 0.301161    
## restecg2    -1.406e+01  1.263e+03  -0.011 0.991116    
## thalach      2.514e-02  1.305e-02   1.927 0.053994 .  
## exang1      -3.694e-01  4.912e-01  -0.752 0.452060    
## oldpeak     -2.348e-01  2.662e-01  -0.882 0.377794    
## slope1       2.628e-01  1.020e+00   0.258 0.796634    
## slope2       1.488e+00  1.129e+00   1.318 0.187544    
## ca1         -2.117e+00  5.850e-01  -3.619 0.000296 ***
## ca2         -3.369e+00  9.695e-01  -3.475 0.000510 ***
## ca3         -1.970e+00  9.270e-01  -2.125 0.033588 *  
## ca4          7.893e-01  1.640e+00   0.481 0.630354    
## thal1        2.286e+00  2.058e+00   1.111 0.266658    
## thal2        2.287e+00  1.920e+00   1.191 0.233625    
## thal3        8.857e-01  1.930e+00   0.459 0.646313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 150.93  on 220  degrees of freedom
## AIC: 196.93
## 
## Number of Fisher Scoring iterations: 15

Version 2 - Stepwise

Our stepwise

## 
## Call:  glm(formula = target ~ sex + cp + trestbps + thalach + slope + 
##     ca + thal, family = "binomial", data = heart_wr_train)
## 
## Coefficients:
## (Intercept)         sex1          cp1          cp2          cp3     trestbps  
##    -2.03553     -1.74001      1.59716      1.84120      2.76912     -0.01895  
##     thalach       slope1       slope2          ca1          ca2          ca3  
##     0.02514      0.30537      1.70208     -2.02765     -3.06457     -1.72804  
##         ca4        thal1        thal2        thal3  
##     1.29932      1.76422      1.69175      0.22606  
## 
## Degrees of Freedom: 242 Total (i.e. Null);  227 Residual
## Null Deviance:       335.1 
## Residual Deviance: 158.4     AIC: 190.4

We’ve tried the full stepwise variable recommendation and see that the variable thal is not significant enough, so we’ve decided to take it out of the model.

## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + slope + 
##     ca, family = "binomial", data = heart_wr_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4021  -0.4573   0.1312   0.4835   2.9851  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.35131    2.35016  -0.149 0.881171    
## sex1        -2.10903    0.49646  -4.248 2.16e-05 ***
## cp1          1.89121    0.60176   3.143 0.001674 ** 
## cp2          1.96363    0.47660   4.120 3.79e-05 ***
## cp3          2.88931    0.77280   3.739 0.000185 ***
## trestbps    -0.02348    0.01099  -2.137 0.032621 *  
## thalach      0.02559    0.01106   2.314 0.020690 *  
## slope1       0.30869    0.83322   0.370 0.711030    
## slope2       1.90695    0.84690   2.252 0.024342 *  
## ca1         -1.97334    0.52718  -3.743 0.000182 ***
## ca2         -3.00723    0.76657  -3.923 8.75e-05 ***
## ca3         -1.78755    0.78563  -2.275 0.022887 *  
## ca4          0.64862    1.59084   0.408 0.683478    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 170.72  on 230  degrees of freedom
## AIC: 196.72
## 
## Number of Fisher Scoring iterations: 6


Prediction using Logistic Regression


Confusion Matrix using Logistic Regression

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 21  2
##          1  6 31
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.7541, 0.9406)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 1.653e-07       
##                                           
##                   Kappa : 0.727           
##                                           
##  Mcnemar's Test P-Value : 0.2888          
##                                           
##             Sensitivity : 0.9394          
##             Specificity : 0.7778          
##          Pos Pred Value : 0.8378          
##          Neg Pred Value : 0.9130          
##              Prevalence : 0.5500          
##          Detection Rate : 0.5167          
##    Detection Prevalence : 0.6167          
##       Balanced Accuracy : 0.8586          
##                                           
##        'Positive' Class : 1               
## 


Using kNN

kNN method only accepts numerical predictor. Therefore, we will select all the numerical predictor and scale it, so it can be processed with kNN.


Determine the best k number

## [1] 16

Since we have 2 target class, we’re rounding it up to 17.


Prediction using kNN

Confusion Matrix using kNN

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  5
##          1 11 28
##                                           
##                Accuracy : 0.7333          
##                  95% CI : (0.6034, 0.8393)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.002709        
##                                           
##                   Kappa : 0.4502          
##                                           
##  Mcnemar's Test P-Value : 0.211300        
##                                           
##             Sensitivity : 0.8485          
##             Specificity : 0.5926          
##          Pos Pred Value : 0.7179          
##          Neg Pred Value : 0.7619          
##              Prevalence : 0.5500          
##          Detection Rate : 0.4667          
##    Detection Prevalence : 0.6500          
##       Balanced Accuracy : 0.7205          
##                                           
##        'Positive' Class : 1               
## 


Conclusion

The metric that we’re looking for in this exercise is the Sensitivity or Recall number where the closer it is to 1, the better it is. In medical case, it’s often better to have more “false alarm” or False Positive rather than the other way around where you have the disease, yet diagnosed otherwise (in this case is called False Negative).

Sensitivity or Recall is also known as true positive rate. A good way to understand it is, let’s say that our model classified 100 people as sick. A recall value of .9 means that there is a possibility that 90 people are correctly classified as sick and 10 people are not actually sick.

Our result with logistic regression seemed to perform better than our kNN model with Sensitivity/Recall value of 0.94, compared to 0.84 on our kNN model. We believe that this happened because the categorical variable that we omit out of our kNN model is highly affecting our model quality.

While kNN usually boasts good results out of categorical predictor variable, it seemed that in this case where there are categorical and numerical variable, logistic regression takes the cake.

The variable that we used for our logistic regression models are : sex, chest pain type, resting blood pressure, maximum heart rate, slope of peak exercise ST segment, and number of major vessels colored by flourosopy.