In this notebook i will create a prediction model to determine wether or not a patient (or you) have an unhealthy or healthy heart condition using Logistic Regression model and KNN Model.
The Prediction should be used by both the medical examiner and patient wether or not they should proceed with the medical precedure or not, rather than as an absolute answer.
options(scipen = 999)
library(dplyr)
library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(rsample)
library(MASS)
library(caret)heart = read.csv('heart.csv')
head(heart)Data Dictionary : 1. age
sex : 1 = male, 0 = female
chest pain type (4 values)
resting blood pressure
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl : 1 = yes, 0 = no
resting electrocardiographic results (values 0,1,2)
maximum heart rate achieved
exercise induced angina : 1 = yes, 0 = no
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
target : 1 Not Healthy, 0 = Healthy
summary(heart)#> ï..age sex cp trestbps
#> Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
#> 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0
#> Median :55.00 Median :1.0000 Median :1.000 Median :130.0
#> Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6
#> 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0
#> Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
#> chol fbs restecg thalach
#> Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
#> 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
#> Median :240.0 Median :0.0000 Median :1.0000 Median :153.0
#> Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6
#> 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
#> Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
#> exang oldpeak slope ca
#> Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
#> 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
#> Median :0.0000 Median :0.80 Median :1.000 Median :0.0000
#> Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294
#> 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
#> Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000
#> thal target
#> Min. :0.000 Min. :0.0000
#> 1st Qu.:2.000 1st Qu.:0.0000
#> Median :2.000 Median :1.0000
#> Mean :2.314 Mean :0.5446
#> 3rd Qu.:3.000 3rd Qu.:1.0000
#> Max. :3.000 Max. :1.0000
Check target class proportion.
prop.table(table(heart$target))#>
#> 0 1
#> 0.4554455 0.5445545
Our classes is balanced, our model will be able to learn on both classes.
Change the label to their proper label respectively rather than zero and one, so it will be easier to be interpreted.
heart <- heart %>%
mutate_if(is.integer, as.factor) %>%
mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
fbs =factor(fbs, levels = c(0,1), labels = c("No", "Yes")),
exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
target = factor(target, levels = c(0,1), labels = c("Healthy", "Not Healthy")),
ï..age = as.integer(ï..age),
trestbps = as.integer(trestbps),
chol = as.integer(chol),
thalach = as.integer(thalach))Check missing value
anyNA(heart)#> [1] FALSE
No missing value found, we can move to next step.
Split the data for training and testing.
set.seed(420)
index <- initial_split(heart, 0.7, strata = 'target')
heart.train <- training(index)
heart.test <- testing(index)Check the proportion again on both set, hopefully they retain their proportion.
prop.table(table(heart.train$target))#>
#> Healthy Not Healthy
#> 0.4553991 0.5446009
prop.table(table(heart.test$target))#>
#> Healthy Not Healthy
#> 0.4555556 0.5444444
The easiest way to determine which variable to used as a predictor without the knowledge of the business is to create a model using all available variable, and then eliminate the variables one by one using stepwise function.
init_all <- glm(target~., 'binomial', heart.train)Backward Stepwise Method
backstep_model <- step(init_all, direction = 'backward', trace = 0)
summary(backstep_model)#>
#> Call:
#> glm(formula = target ~ ï..age + sex + cp + trestbps + chol +
#> fbs + thalach + exang + slope + ca + thal, family = "binomial",
#> data = heart.train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.8734 -0.3131 0.1023 0.4336 3.3986
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 13.624107 1455.398180 0.009 0.99253
#> ï..age 0.055260 0.033267 1.661 0.09669 .
#> sexMale -1.894728 0.631525 -3.000 0.00270 **
#> cp1 0.626882 0.699937 0.896 0.37045
#> cp2 1.979165 0.659954 2.999 0.00271 **
#> cp3 1.528225 0.816119 1.873 0.06113 .
#> trestbps -0.066326 0.025622 -2.589 0.00963 **
#> chol -0.015633 0.007114 -2.198 0.02798 *
#> fbsYes 1.276485 0.778762 1.639 0.10119
#> thalach 0.032078 0.015280 2.099 0.03579 *
#> exangYes -0.885005 0.575772 -1.537 0.12427
#> slope1 -1.252268 0.971136 -1.289 0.19723
#> slope2 0.455515 0.986081 0.462 0.64412
#> ca1 -2.175795 0.622176 -3.497 0.00047 ***
#> ca2 -4.739092 1.029772 -4.602 0.00000418 ***
#> ca3 -2.091254 0.998433 -2.095 0.03621 *
#> ca4 1.459926 1.700220 0.859 0.39052
#> thal1 -11.883109 1455.398489 -0.008 0.99349
#> thal2 -10.806283 1455.398015 -0.007 0.99408
#> thal3 -12.343589 1455.397972 -0.008 0.99323
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 293.58 on 212 degrees of freedom
#> Residual deviance: 127.65 on 193 degrees of freedom
#> AIC: 167.65
#>
#> Number of Fisher Scoring iterations: 14
Apparently, thal variable indicating a perfect separation (can be identified with unusualy high value of log of odds), so we have to remove it.
model_heart <- glm(target ~ sex + cp + trestbps + exang + oldpeak + ca + slope, 'binomial', heart.train)
summary(model_heart)#>
#> Call:
#> glm(formula = target ~ sex + cp + trestbps + exang + oldpeak +
#> ca + slope, family = "binomial", data = heart.train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.8501 -0.3986 0.1696 0.5160 2.6324
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 4.56710 1.39230 3.280 0.001037 **
#> sexMale -1.72180 0.52842 -3.258 0.001120 **
#> cp1 1.03857 0.64351 1.614 0.106543
#> cp2 2.03498 0.57740 3.524 0.000424 ***
#> cp3 2.07917 0.82697 2.514 0.011930 *
#> trestbps -0.04457 0.02199 -2.027 0.042628 *
#> exangYes -1.18674 0.52438 -2.263 0.023629 *
#> oldpeak -0.57757 0.26333 -2.193 0.028285 *
#> ca1 -1.94683 0.53520 -3.638 0.000275 ***
#> ca2 -3.43424 0.83659 -4.105 0.0000404 ***
#> ca3 -1.70245 0.85405 -1.993 0.046219 *
#> ca4 0.55490 1.62975 0.340 0.733494
#> slope1 -2.09305 1.03141 -2.029 0.042427 *
#> slope2 -0.47960 1.12540 -0.426 0.669992
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 293.58 on 212 degrees of freedom
#> Residual deviance: 144.37 on 199 degrees of freedom
#> AIC: 172.37
#>
#> Number of Fisher Scoring iterations: 6
Now we have a model, lets evaluate it.
Run the prediction on test set.
#Model Prediction
heart_pred <- predict(model_heart, heart.test, 'response')By default, the threshold are 0.5, but i adjust it to 0.4 to prioritize recall measurement.
heart_pred_class <- as.factor(ifelse(heart_pred > 0.4, 'Not Healthy', 'Healthy'))confusionMatrix(heart_pred_class, heart.test$target, positive = "Not Healthy")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Healthy Not Healthy
#> Healthy 30 5
#> Not Healthy 11 44
#>
#> Accuracy : 0.8222
#> 95% CI : (0.7274, 0.8948)
#> No Information Rate : 0.5444
#> P-Value [Acc > NIR] : 0.0000000284
#>
#> Kappa : 0.6373
#>
#> Mcnemar's Test P-Value : 0.2113
#>
#> Sensitivity : 0.8980
#> Specificity : 0.7317
#> Pos Pred Value : 0.8000
#> Neg Pred Value : 0.8571
#> Prevalence : 0.5444
#> Detection Rate : 0.4889
#> Detection Prevalence : 0.6111
#> Balanced Accuracy : 0.8148
#>
#> 'Positive' Class : Not Healthy
#>
Because we do not want to miss diagnose an unhealthy person as healthy, i reduced the treshold to 0.4. it could be lower but it will resulting in more false positive (diagnosing healthy person as unhealthy). This might sound ‘bad’ for now, but if you think about it, in a healthcare industry, you most definately dont want to misdiagnose your patient.
exp(model_heart$coefficients)#> (Intercept) sexMale cp1 cp2 cp3 trestbps
#> 96.26421814 0.17874380 2.82518196 7.65212794 7.99781582 0.95640458
#> exangYes oldpeak ca1 ca2 ca3 ca4
#> 0.30521567 0.56126227 0.14272622 0.03224996 0.18223599 1.74176524
#> slope1 slope2
#> 0.12330988 0.61903111
KNN Model does not work well with categorical variable, and unfortunately most of our variable are categorical, to make our data work properly with KNN model, we have to transform our categorical into multiple individual variable.
heart2 <- dummyVars(" ~target+sex+cp+fbs+exang+oldpeak+slope+ca+thal", data = heart)
heart2 <- data.frame(predict(heart2, newdata = heart))
str(heart2)#> 'data.frame': 303 obs. of 25 variables:
#> $ target.Healthy : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ target.Not.Healthy: num 1 1 1 1 1 1 1 1 1 1 ...
#> $ sex.Female : num 0 0 1 0 1 0 1 0 0 0 ...
#> $ sex.Male : num 1 1 0 1 0 1 0 1 1 1 ...
#> $ cp.0 : num 0 0 0 0 1 1 0 0 0 0 ...
#> $ cp.1 : num 0 0 1 1 0 0 1 1 0 0 ...
#> $ cp.2 : num 0 1 0 0 0 0 0 0 1 1 ...
#> $ cp.3 : num 1 0 0 0 0 0 0 0 0 0 ...
#> $ fbs.No : num 0 1 1 1 1 1 1 1 0 1 ...
#> $ fbs.Yes : num 1 0 0 0 0 0 0 0 1 0 ...
#> $ exang.No : num 1 1 1 1 0 1 1 1 1 1 ...
#> $ exang.Yes : num 0 0 0 0 1 0 0 0 0 0 ...
#> $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
#> $ slope.0 : num 1 1 0 0 0 0 0 0 0 0 ...
#> $ slope.1 : num 0 0 0 0 0 1 1 0 0 0 ...
#> $ slope.2 : num 0 0 1 1 1 0 0 1 1 1 ...
#> $ ca.0 : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ ca.1 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ ca.2 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ ca.3 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ ca.4 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ thal.0 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ thal.1 : num 1 0 0 0 0 1 0 0 0 0 ...
#> $ thal.2 : num 0 1 1 1 1 0 1 0 0 1 ...
#> $ thal.3 : num 0 0 0 0 0 0 0 1 1 0 ...
Remove every negative class variables, and transform target to factor
heart2$sex.Female <- NULL
heart2$fbs.False <- NULL
heart2$exang.No <- NULL
heart2$target.Healthy <- NULL
heart2$target.Not.Healthy <- as.factor(heart2$target.Not.Healthy)Train Test Split
set.seed(420)
index <- initial_split(heart2, 0.7)
heart2.train <- training(index)
heart2.test <- testing(index)Seperating Predictor and Target
# prediktor data train
train_x <- heart2.train %>%
select_if(is.numeric)
# target data train
train_y <- heart2.train %>%
dplyr::select(target.Not.Healthy)
# prediktor data test
test_x <- heart2.test %>%
select_if(is.numeric)
# target data test
test_y <- heart2.test %>%
dplyr::select(target.Not.Healthy)We can eliminate feature that have small variance, small to no information.
n0vartrainx <- nearZeroVar(train_x)
n0vartestx <- nearZeroVar(test_x)
train_x <- train_x[,-n0vartrainx]
test_x <- test_x[,-n0vartrainx]Our data need to be normalized so every variables have same ranges.
train_x <- scale(train_x)
test_x <- scale(test_x,
center = attr(train_x, "scaled:center"),
scale = attr(train_x,"scaled:scale"))heartknn <- knn(train = train_x, test = test_x, cl = train_y$target, k = 18)k values are root of number of observation
anyNA(heart)#> [1] FALSE
confusionMatrix(data = heartknn ,reference = heart2.test$target.Not.Healthy, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 35 5
#> 1 8 42
#>
#> Accuracy : 0.8556
#> 95% CI : (0.7657, 0.9208)
#> No Information Rate : 0.5222
#> P-Value [Acc > NIR] : 0.00000000002547
#>
#> Kappa : 0.7097
#>
#> Mcnemar's Test P-Value : 0.5791
#>
#> Sensitivity : 0.8936
#> Specificity : 0.8140
#> Pos Pred Value : 0.8400
#> Neg Pred Value : 0.8750
#> Prevalence : 0.5222
#> Detection Rate : 0.4667
#> Detection Prevalence : 0.5556
#> Balanced Accuracy : 0.8538
#>
#> 'Positive' Class : 1
#>
confusionMatrix(heart_pred_class, heart.test$target, positive = "Not Healthy")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Healthy Not Healthy
#> Healthy 30 5
#> Not Healthy 11 44
#>
#> Accuracy : 0.8222
#> 95% CI : (0.7274, 0.8948)
#> No Information Rate : 0.5444
#> P-Value [Acc > NIR] : 0.0000000284
#>
#> Kappa : 0.6373
#>
#> Mcnemar's Test P-Value : 0.2113
#>
#> Sensitivity : 0.8980
#> Specificity : 0.7317
#> Pos Pred Value : 0.8000
#> Neg Pred Value : 0.8571
#> Prevalence : 0.5444
#> Detection Rate : 0.4889
#> Detection Prevalence : 0.6111
#> Balanced Accuracy : 0.8148
#>
#> 'Positive' Class : Not Healthy
#>
confusionMatrix(data = heartknn ,reference = heart2.test$target.Not.Healthy, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 35 5
#> 1 8 42
#>
#> Accuracy : 0.8556
#> 95% CI : (0.7657, 0.9208)
#> No Information Rate : 0.5222
#> P-Value [Acc > NIR] : 0.00000000002547
#>
#> Kappa : 0.7097
#>
#> Mcnemar's Test P-Value : 0.5791
#>
#> Sensitivity : 0.8936
#> Specificity : 0.8140
#> Pos Pred Value : 0.8400
#> Neg Pred Value : 0.8750
#> Prevalence : 0.5222
#> Detection Rate : 0.4667
#> Detection Prevalence : 0.5556
#> Balanced Accuracy : 0.8538
#>
#> 'Positive' Class : 1
#>
Idealy, we want our accuracy (overall prediction on both classes) as our indicator on how good our model is, Realisticaly, we want to prioritize one of the class, here i decided to prioritize Recall (Minimizing False Negative), if i have to put myself as a medical practioner, I dont want patient who are actually Not Healthy diagnosed as healthy, could end up in a lawsuit, thats $ loss.
Our KNN model are slightly better on all metrics compared to Logistic Regression model, if the goal is only to predict the result, use KNN model because it has better metrics and performance, but it cannot be interpreted, we cannot figure on which variable effecting our target the most and the least. If data understanding is the goal, then use Logistic Regression Model, every variable can be intepreted as Probability and Odds toward our target.