We will try to make machine learning model with logistic regression and K-NN method. The objective is to predict whether a person have heart disease or not. The dataset is obtained from kaggle. It contain 14 columns and 303 rows.
First, we load the required packages
library(tidyverse)
library(GGally)
library(caret)heart <- read.csv("heart.csv")glimpse(heart)## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
We need to change some datatypes into factors.
heart1 <- heart %>%
mutate(sex = ifelse(sex == 0, "Female", "Male"),
cp = as.factor(cp),
fbs = ifelse(fbs == 0, "False", "True"),
restecg = as.factor(restecg),
exang = ifelse(exang == 0, "No", "Yes"),
oldpeak = as.factor(oldpeak),
slope = as.factor(slope),
ca = as.factor(ca),
thal = as.factor(thal),
target = ifelse(target == 0, "No", "Yes")) %>%
mutate_if(is.character, as.factor) %>%
glimpse()## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male...
## $ cp <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs <fct> True, False, False, False, False, False, False, False, Tru...
## $ restecg <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, ...
## $ oldpeak <fct> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0, 0.5, 1.6, 1.2, 0.2, ...
## $ slope <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes...
After that, we need to check if there is any NA in the dataset.
colSums(is.na(heart1))## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
Now, we know that the data don’t have NA and we can move to explore the data.
First, we need to know proportion of the target.
round(prop.table(table(heart1$target)),2)##
## No Yes
## 0.46 0.54
table(heart1$target)##
## No Yes
## 138 165
It seems the proportion is quite balance so we don’t need to upscale the data.
We must split the dataset into train and test dataset. This is done to check if the model is capable to classify new data that has not been seen by the model. The data will be split with ratio of 80/20 (80% data will be used to train, 20% to test).
set.seed(123)
row_data <- nrow(heart1)
index <- sample(row_data, row_data*0.8)
data_train <- heart1[ index, ]
data_test <- heart1[ -index, ]Now we will try to model the logistic regression using target as the target variable. For this case, we can use stepwise method with both direction from empty predictors to full predictors.
set.seed(123)
model_full <- glm(target ~ ., data_train, family = "binomial")
model_none <- glm(target ~ 1, data_train, family = "binomial")
model_step <- step(model_none,
scope = list(lower = model_none, upper = model_full),
direction = "both",
trace = 0
)
summary(model_step)##
## Call:
## glm(formula = target ~ thal + ca + exang + slope + restecg +
## sex + cp + trestbps + thalach, family = "binomial", data = data_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7131 -0.3347 0.1409 0.4946 2.8580
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.57528 2399.54627 -0.006 0.99515
## thal1 17.06009 2399.54493 0.007 0.99433
## thal2 16.20605 2399.54482 0.007 0.99461
## thal3 14.74306 2399.54480 0.006 0.99510
## ca1 -1.25908 0.54096 -2.328 0.01994 *
## ca2 -3.27025 0.71848 -4.552 5.32e-06 ***
## ca3 -1.83853 0.93903 -1.958 0.05024 .
## ca4 1.56751 1.84412 0.850 0.39532
## exangYes -0.96791 0.48579 -1.992 0.04632 *
## slope1 -0.66512 0.86447 -0.769 0.44166
## slope2 0.69563 0.89318 0.779 0.43608
## restecg1 0.84330 0.43422 1.942 0.05213 .
## restecg2 -14.66681 1589.24431 -0.009 0.99264
## sexMale -1.69406 0.57028 -2.971 0.00297 **
## cp1 1.02171 0.67414 1.516 0.12963
## cp2 1.36972 0.56154 2.439 0.01472 *
## cp3 1.94003 0.69914 2.775 0.00552 **
## trestbps -0.01969 0.01144 -1.721 0.08516 .
## thalach 0.01932 0.01231 1.569 0.11666
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 333.48 on 241 degrees of freedom
## Residual deviance: 155.45 on 223 degrees of freedom
## AIC: 193.45
##
## Number of Fisher Scoring iterations: 15
Lets use the model to predict new data then use confusion matrix to evaluate the model.
pred_test <- predict(model_step, data_test, type = "response")
pred_class_test <- ifelse(pred_test > 0.5, "Yes", "No") %>%
as.factor
confusionMatrix(pred_class_test, data_test$target, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 22 5
## Yes 6 28
##
## Accuracy : 0.8197
## 95% CI : (0.7002, 0.9064)
## No Information Rate : 0.541
## P-Value [Acc > NIR] : 4.82e-06
##
## Kappa : 0.6359
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8485
## Specificity : 0.7857
## Pos Pred Value : 0.8235
## Neg Pred Value : 0.8148
## Prevalence : 0.5410
## Detection Rate : 0.4590
## Detection Prevalence : 0.5574
## Balanced Accuracy : 0.8171
##
## 'Positive' Class : Yes
##
Based on the result, we can see that the model is quite good. The accuracy is 81.97%, sensitiviy is 84.85%, specificity is 78.57%, and precision (PosPredValue) is 82.35%.
Lets compare the prediction from test data to train data to see if the model underfit/overfit/optimum.
pred_train <- predict(model_step, data_train, type = "response")
pred_class_train <- ifelse(pred_train > 0.5, "Yes", "No") %>%
as.factor()
confusionMatrix(pred_class_train, data_train$target, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 89 12
## Yes 21 120
##
## Accuracy : 0.8636
## 95% CI : (0.8139, 0.9042)
## No Information Rate : 0.5455
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7231
##
## Mcnemar's Test P-Value : 0.1637
##
## Sensitivity : 0.9091
## Specificity : 0.8091
## Pos Pred Value : 0.8511
## Neg Pred Value : 0.8812
## Prevalence : 0.5455
## Detection Rate : 0.4959
## Detection Prevalence : 0.5826
## Balanced Accuracy : 0.8591
##
## 'Positive' Class : Yes
##
If we compare data test and data train, the accuracy differ 4.39%, sensitivity differ 6.06%, specificity differ 2,34%, and precision (PosPredValue) differ 2,76%. As we can see, the difference <10% so we can say the model is optimum.
Lets make another model with K-NN method. First, we need to change the target variable into factor.
heart2 <- heart %>%
mutate(target = as.factor(target))After that, like logistic regression model, we need to split the data.
set.seed(123)
row_data2 <- nrow(heart2)
index2 <- sample(row_data2, row_data2*0.8)
data_train2 <- heart2[index2, ]
data_test2 <- heart2[-index2, ]We need to scale the data with Z-Score method. First, we need to scale the data from train data then use the data we get from train data to scale the data from test data. We do this to prevent data leakage.
train_x <- data_train2 %>%
select(-target) %>%
scale()
train_y <- data_train2$target
test_x <- data_test2 %>%
select(-target) %>%
scale(center = attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale")
)
test_y <- data_test2$targetTo predict, we need the optimum number of k (nearest neighbor).
k_choose <- sqrt(nrow(train_x)) %>% floor()
k_choose## [1] 15
After we get the number, we can use knn3Train to get the prediction.
pred_knn <- knn3Train(train = train_x,
cl = train_y,
test = test_x,
k = k_choose
) %>%
as.factor()Like regression model, to evaluate the model we can use confusion matrix.
confusionMatrix(pred_knn, test_y)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 18 3
## 1 10 30
##
## Accuracy : 0.7869
## 95% CI : (0.6632, 0.8814)
## No Information Rate : 0.541
## P-Value [Acc > NIR] : 5.878e-05
##
## Kappa : 0.5626
##
## Mcnemar's Test P-Value : 0.09609
##
## Sensitivity : 0.6429
## Specificity : 0.9091
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.7500
## Prevalence : 0.4590
## Detection Rate : 0.2951
## Detection Prevalence : 0.3443
## Balanced Accuracy : 0.7760
##
## 'Positive' Class : 0
##
Based on the result, we can see the model is more specific and precision than logistic regression model but with lower accuracy and sensitivity.
If we are a doctor that treat our patient, we need to make sure the model has high precision. We don’t want the patient to get false treatment because of the model. For this case, the model has similar precision with only differences about 3%. If we want to choose the model without any need of interpretation then K-NN is the choice. But if we want a model that can be interprate then we can choose logistic regression. My personal choice in this case is logistic linear. Even though the model have lower precision but it have higher accuracy and sensivity also interpretable.