The objective of this document is to compare the effectiveness of both logistic regression and knn on this dataset where the goal is to detect the presence of heart disease on a patient based on various predictor variables.
With the dataset predictor variables vary between numerical and categorical, I think it’s interesting to pit together logistic regression model with knn and see which one performs better.
No NA found on the dataset.
## [1] FALSE
age : age of patient
sex : sex of patient
cp : chest pain type (4 values)
trestbps : resting blood pressure
chol : serum cholestoral in mg/dl
fbs : fasting blood sugar > 120 mg/dl
restecg : resting electrocardiographic results (values 0,1,2)
thalach : maximum heart rate achieved
exang : exercise induced angina
oldpeak : ST depression induced by exercise relative to rest
slope : the slope of the peak exercise ST segment
ca : number of major vessels (0-3) colored by flourosopy
thal : 3 = normal; 6 = fixed defect; 7 = reversable defect
target : whether one has a heart disease or not
Change into factor : sex, cp, fbs, restecg, exang, slope, ca, thal.
heart_wr <- heart %>%
mutate(sex = as.factor(sex),
cp = as.factor(cp),
fbs = as.factor(fbs),
restecg = as.factor(restecg),
exang = as.factor(exang),
slope = as.factor(slope),
ca = as.factor(ca),
thal = as.factor(thal),
target = as.factor(target))set.seed(345)
train_split <- initial_split (heart_wr, prop = .8)
heart_wr_train <- training(train_split)
heart_wr_test <- testing(train_split)
heart_wr_test_knn <- testing(train_split)Checking the proportion of our target variable.
##
## 0 1
## 0.4554455 0.5445545
I’d say that our target variable is nicely balanced with 46-54 ratio.
We’re using the model version 2 for our prediction.
model_log_heart <- glm(formula = target~., data = heart_wr_train, family = "binomial")
summary(model_log_heart)##
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_wr_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5945 -0.3482 0.1299 0.4521 3.0798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.216e+00 3.423e+00 -0.647 0.517426
## age 1.741e-02 2.867e-02 0.607 0.543739
## sex1 -1.933e+00 6.185e-01 -3.126 0.001774 **
## cp1 1.256e+00 6.614e-01 1.898 0.057637 .
## cp2 1.590e+00 5.421e-01 2.932 0.003365 **
## cp3 2.511e+00 8.134e-01 3.087 0.002022 **
## trestbps -1.663e-02 1.328e-02 -1.251 0.210765
## chol -4.870e-03 4.558e-03 -1.068 0.285384
## fbs1 8.340e-01 6.472e-01 1.289 0.197545
## restecg1 4.543e-01 4.394e-01 1.034 0.301161
## restecg2 -1.406e+01 1.263e+03 -0.011 0.991116
## thalach 2.514e-02 1.305e-02 1.927 0.053994 .
## exang1 -3.694e-01 4.912e-01 -0.752 0.452060
## oldpeak -2.348e-01 2.662e-01 -0.882 0.377794
## slope1 2.628e-01 1.020e+00 0.258 0.796634
## slope2 1.488e+00 1.129e+00 1.318 0.187544
## ca1 -2.117e+00 5.850e-01 -3.619 0.000296 ***
## ca2 -3.369e+00 9.695e-01 -3.475 0.000510 ***
## ca3 -1.970e+00 9.270e-01 -2.125 0.033588 *
## ca4 7.893e-01 1.640e+00 0.481 0.630354
## thal1 2.286e+00 2.058e+00 1.111 0.266658
## thal2 2.287e+00 1.920e+00 1.191 0.233625
## thal3 8.857e-01 1.930e+00 0.459 0.646313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 335.05 on 242 degrees of freedom
## Residual deviance: 150.93 on 220 degrees of freedom
## AIC: 196.93
##
## Number of Fisher Scoring iterations: 15
Our stepwise
##
## Call: glm(formula = target ~ sex + cp + trestbps + thalach + slope +
## ca + thal, family = "binomial", data = heart_wr_train)
##
## Coefficients:
## (Intercept) sex1 cp1 cp2 cp3 trestbps
## -2.03553 -1.74001 1.59716 1.84120 2.76912 -0.01895
## thalach slope1 slope2 ca1 ca2 ca3
## 0.02514 0.30537 1.70208 -2.02765 -3.06457 -1.72804
## ca4 thal1 thal2 thal3
## 1.29932 1.76422 1.69175 0.22606
##
## Degrees of Freedom: 242 Total (i.e. Null); 227 Residual
## Null Deviance: 335.1
## Residual Deviance: 158.4 AIC: 190.4
We’ve tried the full stepwise variable recommendation and see that the variable thal is not significant enough, so we’ve decided to take it out of the model.
model_log_heart_v2 <- glm(formula = target ~ sex + cp + trestbps + thalach + slope +
ca, family = "binomial", data = heart_wr_train)
summary(model_log_heart_v2)##
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + slope +
## ca, family = "binomial", data = heart_wr_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4021 -0.4573 0.1312 0.4835 2.9851
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.35131 2.35016 -0.149 0.881171
## sex1 -2.10903 0.49646 -4.248 2.16e-05 ***
## cp1 1.89121 0.60176 3.143 0.001674 **
## cp2 1.96363 0.47660 4.120 3.79e-05 ***
## cp3 2.88931 0.77280 3.739 0.000185 ***
## trestbps -0.02348 0.01099 -2.137 0.032621 *
## thalach 0.02559 0.01106 2.314 0.020690 *
## slope1 0.30869 0.83322 0.370 0.711030
## slope2 1.90695 0.84690 2.252 0.024342 *
## ca1 -1.97334 0.52718 -3.743 0.000182 ***
## ca2 -3.00723 0.76657 -3.923 8.75e-05 ***
## ca3 -1.78755 0.78563 -2.275 0.022887 *
## ca4 0.64862 1.59084 0.408 0.683478
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 335.05 on 242 degrees of freedom
## Residual deviance: 170.72 on 230 degrees of freedom
## AIC: 196.72
##
## Number of Fisher Scoring iterations: 6
heart_wr_test$predict_risk <- predict(model_log_heart_v2, heart_wr_test, type = "response")
heart_wr_test$pred_label <- as.factor(ifelse(heart_wr_test$predict_risk>.4, "1", "0"))logreg_conf <- confusionMatrix(data = heart_wr_test$pred_label, reference = heart_wr_test$target, positive = "1")
logreg_conf## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 2
## 1 6 31
##
## Accuracy : 0.8667
## 95% CI : (0.7541, 0.9406)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 1.653e-07
##
## Kappa : 0.727
##
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.9394
## Specificity : 0.7778
## Pos Pred Value : 0.8378
## Neg Pred Value : 0.9130
## Prevalence : 0.5500
## Detection Rate : 0.5167
## Detection Prevalence : 0.6167
## Balanced Accuracy : 0.8586
##
## 'Positive' Class : 1
##
kNN method only accepts numerical predictor. Therefore, we will select all the numerical predictor and scale it, so it can be processed with kNN.
numericIndex <- sapply(X = heart_wr_train, FUN = is.numeric)
heart_wr_train_x <- scale(heart_wr_train[,numericIndex])
heart_wr_train_y <- select(heart_wr_train, target)
heart_wr_test_knn_x <- scale(heart_wr_test_knn[,numericIndex],
center = attr(heart_wr_train_x, "scaled:center"),
scale = attr(heart_wr_train_x, "scaled:scale"))
heart_wr_test_knn_y <- select(heart_wr_test_knn, target)## [1] 16
Since we have 2 target class, we’re rounding it up to 17.
library(class)
heart_wr_knn_pred <- knn(train = as.data.frame(heart_wr_train_x),
test = as.data.frame(heart_wr_test_knn_x),
cl = heart_wr_train_y$target,
k = 17)knn_conf <- confusionMatrix(heart_wr_knn_pred, reference = heart_wr_test_knn_y$target, positive = "1")
knn_conf## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16 5
## 1 11 28
##
## Accuracy : 0.7333
## 95% CI : (0.6034, 0.8393)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 0.002709
##
## Kappa : 0.4502
##
## Mcnemar's Test P-Value : 0.211300
##
## Sensitivity : 0.8485
## Specificity : 0.5926
## Pos Pred Value : 0.7179
## Neg Pred Value : 0.7619
## Prevalence : 0.5500
## Detection Rate : 0.4667
## Detection Prevalence : 0.6500
## Balanced Accuracy : 0.7205
##
## 'Positive' Class : 1
##
eval_log <- data.frame(Accuracy = logreg_conf$overall[1],
Recall = logreg_conf$byClass[1],
Specificity = logreg_conf$byClass[2],
Precision = logreg_conf$byClass[3])
eval_knn <- data.frame(Accuracy = knn_conf$overall[1],
Recall = knn_conf$byClass[1],
Specificity = knn_conf$byClass[2],
Precision = knn_conf$byClass[3])
eval_total <- rbind(eval_log, eval_knn)
rownames(eval_total) <- c("Logistic Regression", "kNN")
eval_totalThe metric that we’re looking for in this exercise is the Sensitivity or Recall number where the closer it is to 1, the better it is. In medical case, it’s often better to have more “false alarm” or False Positive rather than the other way around where you have the disease, yet diagnosed otherwise (in this case is called False Negative).
Sensitivity or Recall is also known as true positive rate. A good way to understand it is, let’s say that our model classified 100 people as sick. A recall value of .9 means that there is a possibility that 90 people are correctly classified as sick and 10 people are not actually sick.
Our result with logistic regression seemed to perform better than our kNN model with Sensitivity/Recall value of 0.94, compared to 0.84 on our kNN model. We believe that this happened because the categorical variable that we omit out of our kNN model is highly affecting our model quality.
While kNN usually boasts good results out of categorical predictor variable, it seemed that in this case where there are categorical and numerical variable, logistic regression takes the cake.
The variable that we used for our logistic regression models are : sex, chest pain type, resting blood pressure, maximum heart rate, slope of peak exercise ST segment, and number of major vessels colored by flourosopy.