Objective of this project is to make predictions whether a patient in the hospital has heart disease or not using the logistic regression algorithm and the KNN. This notebook also wants to compare the performance of the two algorithms in this case.
Before going any further, first of all, we have to setup libraries that might needed.
library(ggplot2)
library(dplyr)
library(MLmetrics)
library(gtools)
library(caret)
library(class)With read.csv function we can easily import the data. After looking at the table, the first column name seems not correct, therefore I rename the first column name.
heart <- read.csv(file = "heart.csv")
names(heart)[names(heart) == "ï..age"] <- "age"All data consist with numeric value. However, some variables are in categorical type, but represented in number. The next step is to convert sex, cp, fbs, exang, and target.
# Transforming data into Category
heart <- heart %>%
mutate_at(vars(target, exang, fbs, cp, sex), as.factor)
glimpse(heart)#> Rows: 303
#> Columns: 14
#> $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
#> $ sex <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
#> $ cp <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
#> $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
#> $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
#> $ fbs <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
#> $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
#> $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
#> $ exang <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
#> $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
#> $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
#> $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
#> $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
#> $ target <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
This data has 14 columns consist from several factors that predict the heart disease. Target variable in this data is target column. Details below:
1. Age : age
2. Sex : sex
3. chest pain type (4 values) : cp
4. Resting blood pressure : trestbps
5. Serum cholestoral in mg/dl : chol
6. Fasting blood sugar > 120 mg/dl : fbs
7. Resting electrocardiographic results (values 0,1,2) : restecg
8. Maximum heart rate achieved : thalach
9. Exercise induced angina : exang
10. Oldpeak = ST depression induced by exercise relative to rest : oldpeak
11.the slope of the peak exercise ST segment : slope
12. number of major vessels (0-3) colored by flourosopy : ca
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect : thal
14 Target (1 or 0, Yes or No) : target
This data has 303 rows and 14 columns
dim(heart)#> [1] 303 14
Checking the proportion of target variable, we can use prop.table function.
data.frame(prop.table(table(heart$target)), table(heart$target))From 303 data, there are 138 data of does not have heart disease, and 165 having heart disease. The proportion of target variable is 46% for no heart desease and 54% with heart disease.
# Cross validation
RNGkind(sample.kind = "Rounding")
set.seed(198)
split <- sample(nrow(heart), nrow(heart)*0.80) # splitting 80:20
heart_train <- heart[split,] # 80% data train
heart_test <- heart[-split, ] # 20% data test# Target variable proportion table after splitting
data.frame(prop.table(table(heart_train$target)))data.frame(prop.table(table(heart_test$target)))The data splitted quite proportionally. Could be said that the difference is still tollerable.
For making Logistic regression model, we use glm(), providing X and Y variable, test data, and specify the family is “binomial” (the target is binomial, 1 or 0, or Yes or No). After model is made, then I do the stepwise regression with step() and backward direction. Hopefully this model will perform better than ordinary Logistic Regression.
# Model Logistic Regression
model_lr <- glm(formula = target~.,
data = heart_train,
family = "binomial")
model_lr_back <- step(model_lr, direction = "backward", trace = F)
summary(model_lr_back)#>
#> Call:
#> glm(formula = target ~ sex + cp + trestbps + thalach + exang +
#> oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6787 -0.3338 0.1322 0.5744 2.3857
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.42641 2.19588 1.105 0.269167
#> sex1 -1.67032 0.51488 -3.244 0.001178 **
#> cp1 0.50142 0.60850 0.824 0.409931
#> cp2 1.98504 0.53148 3.735 0.000188 ***
#> cp3 2.42118 0.76049 3.184 0.001454 **
#> trestbps -0.02082 0.01160 -1.794 0.072762 .
#> thalach 0.02381 0.01076 2.213 0.026910 *
#> exang1 -1.15566 0.46590 -2.481 0.013119 *
#> oldpeak -0.63043 0.24048 -2.622 0.008752 **
#> slope 0.92867 0.38853 2.390 0.016840 *
#> ca -0.92712 0.22824 -4.062 0.0000487 ***
#> thal -1.02073 0.32198 -3.170 0.001524 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 332.68 on 241 degrees of freedom
#> Residual deviance: 161.25 on 230 degrees of freedom
#> AIC: 185.25
#>
#> Number of Fisher Scoring iterations: 6
From the summary(model_lr_back), most variables are significant, and others are not significant.
Predicting logistic regression will result the probability number or logit. Type = “response” will give the probabilty, and “link” returns logit number. In order to categorize the probability number, we have to categorize it manually. In this case, I categorize probability greater than 0.5 as 1 (Heart disease), and other is 0 (No heart disease).
## Predicting with Logistic Regression
pred_lr <- predict(model_lr_back, heart_test, type = "response")
pred_label <- ifelse(pred_lr > 0.5,"1","0")
pred_label <- as.factor(pred_label)## Logistic Regression
eval_lr <- confusionMatrix(pred_label, heart_test$target, positive = "1" )
eval_lr#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 22 2
#> 1 8 29
#>
#> Accuracy : 0.8361
#> 95% CI : (0.7191, 0.9185)
#> No Information Rate : 0.5082
#> P-Value [Acc > NIR] : 0.00000009418
#>
#> Kappa : 0.671
#>
#> Mcnemar's Test P-Value : 0.1138
#>
#> Sensitivity : 0.9355
#> Specificity : 0.7333
#> Pos Pred Value : 0.7838
#> Neg Pred Value : 0.9167
#> Prevalence : 0.5082
#> Detection Rate : 0.4754
#> Detection Prevalence : 0.6066
#> Balanced Accuracy : 0.8344
#>
#> 'Positive' Class : 1
#>
From the confusion matrix, we conclude that this model has quite good percormance predicting 1 (Have heart disease) class with Sensitivity 0.9032. This model want to predict the heart disease class stritcly, to avoid missed-treatment of giving healthy patient with heart disease medication.
KNN is known as an algorithm that could predict classification with predictor in numerical values. This algorithm will count the euclidian distance. Although the categorical variable is presented in numeric, but those variable is not suitable. Therefore, in KNN model I merely select numeric variables and use it to predict heart disease.
# Data Preparation
heart_knn <- heart %>%
select(-c(sex, cp, fbs, restecg, exang))After selecting only numeric values, there still another problem. From the data, we could see different number like tresbps with hundreds, but oldpeak with small number. Unequal distance between predictor variables will make one variable very dominant compared to other variables. Therefore, before making the KNN model, the predictor variables used must be scaled.
Scaling will use Z-score Standarization with function scale(). Z-score Standarization will subtract every x value with mean, and divide with standard deviation. First, the whole data is splitted with 80% data for training, and 20% data for testing. Second, the train data is scaled. After that test data is scaled with mean and standard deviation from training data. For selecting the best fit K, I use square root from train data for starting point.
# Cross validation
RNGkind(sample.kind = "Rounding")
set.seed(234) # mengunci random yang dihasilkan oleh fungsi sample
split <- sample(nrow(heart_knn), nrow(heart_knn)*0.80) # 80% persen data
data_train_knn <- heart_knn[split,] # 80% data train
data_test_knn <- heart_knn[-split, ] # 20% data test
# Scaling
# Selecting X from train data (except target column)
heart_x_train <- data_train_knn %>% select(-target) %>% scale()
# Selecting Y (target) from train data
heart_y_train <- data_train_knn %>% select(target)
# Scale the X of test data with mean and sd of train data
heart_x_test <- data_test_knn %>%
select(-target) %>%
scale(center = attr(heart_x_train, "scaled:center"), # center = mean
scale = attr(heart_x_train, "scaled:scale")) # scale = standard deviation
# Selecting Y (target) from testing data
heart_y_test <- data_test_knn %>% select(target)
# Defining K
sqrt(nrow(heart_x_train))#> [1] 15.55635
K options = 15, 13, 17
# KNN Modeling
model_knn <- knn(train = heart_x_train,
test = heart_x_test,
cl = heart_y_train$target,
k = 15)KNN model in R is quite short, just 1 line of code and the KNN model is set. With knn() function, give the train data, test data, class that want to predict, and number of k.
It is similar with logistic regression, I use confusionMatrix() function to evaluate the accuracy and other method of evaluation. I give the positive class is 1. Positive class is the class that we want to predict or concentrated in.
## KNN
eval_knn <- confusionMatrix(model_knn, heart_y_test$target, positive = "1" )
eval_knn#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 15 4
#> 1 8 34
#>
#> Accuracy : 0.8033
#> 95% CI : (0.6816, 0.894)
#> No Information Rate : 0.623
#> P-Value [Acc > NIR] : 0.00195
#>
#> Kappa : 0.5664
#>
#> Mcnemar's Test P-Value : 0.38648
#>
#> Sensitivity : 0.8947
#> Specificity : 0.6522
#> Pos Pred Value : 0.8095
#> Neg Pred Value : 0.7895
#> Prevalence : 0.6230
#> Detection Rate : 0.5574
#> Detection Prevalence : 0.6885
#> Balanced Accuracy : 0.7735
#>
#> 'Positive' Class : 1
#>
table_lr <- data_frame(Accuracy = round(eval_lr$overall[1],3),
Recall = round(eval_lr$byClass[1],3),
Specificity = round(eval_lr$byClass[2],3),
Precision = round(eval_lr$byClass[3],3))
table_knn <- data_frame(Accuracy = round(eval_knn$overall[1],3),
Recall = round(eval_knn$byClass[1],3),
Specificity = round(eval_knn$byClass[2],3),
Precision = round(eval_knn$byClass[3],3))table_lrtable_knnBy comparing evaluation table of logistic regression and KNN, Logistic Regression has higher score on Accuracy, Recall, Specificity and Precision. In this case, because the goal of this model is to predict the Heart disease class, Recall is suitable as reference to compare both models. Logistic Regression has better Recall score than KNN model to predict Heart disease class.