Heart Disease Prediction with Logistic Regression and K-Nearest Neighbour

Objective

Objective of this project is to make predictions whether a patient in the hospital has heart disease or not using the logistic regression algorithm and the KNN. This notebook also wants to compare the performance of the two algorithms in this case.

Libraries

Before going any further, first of all, we have to setup libraries that might needed.

library(ggplot2)
library(dplyr)
library(MLmetrics)
library(gtools)
library(caret)
library(class)

Data and Preparation

Input Data and Preparation

With read.csv function we can easily import the data. After looking at the table, the first column name seems not correct, therefore I rename the first column name.

heart <- read.csv(file = "heart.csv")
names(heart)[names(heart) == "ï..age"] <- "age"

All data consist with numeric value. However, some variables are in categorical type, but represented in number. The next step is to convert sex, cp, fbs, exang, and target.

# Transforming data into Category

heart <- heart %>% 
  mutate_at(vars(target, exang, fbs, cp, sex), as.factor)
glimpse(heart)

#> Rows: 303
#> Columns: 14
#> $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
#> $ sex      <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
#> $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
#> $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
#> $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
#> $ fbs      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
#> $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
#> $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
#> $ exang    <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
#> $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
#> $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
#> $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
#> $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
#> $ target   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~

This data has 14 columns consist from several factors that predict the heart disease. Target variable in this data is target column. Details below:
1. Age : age
2. Sex : sex
3. chest pain type (4 values) : cp
4. Resting blood pressure : trestbps
5. Serum cholestoral in mg/dl : chol
6. Fasting blood sugar > 120 mg/dl : fbs
7. Resting electrocardiographic results (values 0,1,2) : restecg
8. Maximum heart rate achieved : thalach
9. Exercise induced angina : exang
10. Oldpeak = ST depression induced by exercise relative to rest : oldpeak
11.the slope of the peak exercise ST segment : slope
12. number of major vessels (0-3) colored by flourosopy : ca
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect : thal
14 Target (1 or 0, Yes or No) : target

This data has 303 rows and 14 columns

dim(heart)

#> [1] 303  14

Checking the proportion of target variable, we can use prop.table function.

data.frame(prop.table(table(heart$target)), table(heart$target))

From 303 data, there are 138 data of does not have heart disease, and 165 having heart disease. The proportion of target variable is 46% for no heart desease and 54% with heart disease.

# Cross validation

RNGkind(sample.kind = "Rounding") 
set.seed(198) 
split <- sample(nrow(heart), nrow(heart)*0.80) # splitting 80:20 
heart_train <- heart[split,] # 80% data train
heart_test <- heart[-split, ] # 20% data test

# Target variable proportion table after splitting
data.frame(prop.table(table(heart_train$target)))

data.frame(prop.table(table(heart_test$target)))

The data splitted quite proportionally. Could be said that the difference is still tollerable.

Logistic Regression

Modelling

For making Logistic regression model, we use glm(), providing X and Y variable, test data, and specify the family is “binomial” (the target is binomial, 1 or 0, or Yes or No). After model is made, then I do the stepwise regression with step() and backward direction. Hopefully this model will perform better than ordinary Logistic Regression.

# Model Logistic Regression
model_lr <- glm(formula = target~.,
    data = heart_train,
    family = "binomial")
model_lr_back <- step(model_lr, direction = "backward", trace = F)
summary(model_lr_back)

#> 
#> Call:
#> glm(formula = target ~ sex + cp + trestbps + thalach + exang + 
#>     oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6787  -0.3338   0.1322   0.5744   2.3857  
#> 
#> Coefficients:
#>             Estimate Std. Error z value  Pr(>|z|)    
#> (Intercept)  2.42641    2.19588   1.105  0.269167    
#> sex1        -1.67032    0.51488  -3.244  0.001178 ** 
#> cp1          0.50142    0.60850   0.824  0.409931    
#> cp2          1.98504    0.53148   3.735  0.000188 ***
#> cp3          2.42118    0.76049   3.184  0.001454 ** 
#> trestbps    -0.02082    0.01160  -1.794  0.072762 .  
#> thalach      0.02381    0.01076   2.213  0.026910 *  
#> exang1      -1.15566    0.46590  -2.481  0.013119 *  
#> oldpeak     -0.63043    0.24048  -2.622  0.008752 ** 
#> slope        0.92867    0.38853   2.390  0.016840 *  
#> ca          -0.92712    0.22824  -4.062 0.0000487 ***
#> thal        -1.02073    0.32198  -3.170  0.001524 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 332.68  on 241  degrees of freedom
#> Residual deviance: 161.25  on 230  degrees of freedom
#> AIC: 185.25
#> 
#> Number of Fisher Scoring iterations: 6

From the summary(model_lr_back), most variables are significant, and others are not significant.

Predicting

Predicting logistic regression will result the probability number or logit. Type = “response” will give the probabilty, and “link” returns logit number. In order to categorize the probability number, we have to categorize it manually. In this case, I categorize probability greater than 0.5 as 1 (Heart disease), and other is 0 (No heart disease).

## Predicting with Logistic Regression
pred_lr <- predict(model_lr_back, heart_test, type = "response")
pred_label <- ifelse(pred_lr > 0.5,"1","0")
pred_label <- as.factor(pred_label)

Model Evaluation

## Logistic Regression
eval_lr <- confusionMatrix(pred_label, heart_test$target, positive = "1" )
eval_lr

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 22  2
#>          1  8 29
#>                                           
#>                Accuracy : 0.8361          
#>                  95% CI : (0.7191, 0.9185)
#>     No Information Rate : 0.5082          
#>     P-Value [Acc > NIR] : 0.00000009418   
#>                                           
#>                   Kappa : 0.671           
#>                                           
#>  Mcnemar's Test P-Value : 0.1138          
#>                                           
#>             Sensitivity : 0.9355          
#>             Specificity : 0.7333          
#>          Pos Pred Value : 0.7838          
#>          Neg Pred Value : 0.9167          
#>              Prevalence : 0.5082          
#>          Detection Rate : 0.4754          
#>    Detection Prevalence : 0.6066          
#>       Balanced Accuracy : 0.8344          
#>                                           
#>        'Positive' Class : 1               
#>

From the confusion matrix, we conclude that this model has quite good percormance predicting 1 (Have heart disease) class with Sensitivity 0.9032. This model want to predict the heart disease class stritcly, to avoid missed-treatment of giving healthy patient with heart disease medication.

K-Nearest Neighbour

KNN is known as an algorithm that could predict classification with predictor in numerical values. This algorithm will count the euclidian distance. Although the categorical variable is presented in numeric, but those variable is not suitable. Therefore, in KNN model I merely select numeric variables and use it to predict heart disease.

Data Preparation

# Data Preparation
heart_knn <- heart %>% 
  select(-c(sex, cp, fbs, restecg, exang))

Data Scaling

After selecting only numeric values, there still another problem. From the data, we could see different number like tresbps with hundreds, but oldpeak with small number. Unequal distance between predictor variables will make one variable very dominant compared to other variables. Therefore, before making the KNN model, the predictor variables used must be scaled.

Scaling will use Z-score Standarization with function scale(). Z-score Standarization will subtract every x value with mean, and divide with standard deviation. First, the whole data is splitted with 80% data for training, and 20% data for testing. Second, the train data is scaled. After that test data is scaled with mean and standard deviation from training data. For selecting the best fit K, I use square root from train data for starting point.

# Cross validation
RNGkind(sample.kind = "Rounding")
set.seed(234) # mengunci random yang dihasilkan oleh fungsi sample
split <- sample(nrow(heart_knn), nrow(heart_knn)*0.80) # 80% persen data
data_train_knn <- heart_knn[split,] # 80% data train
data_test_knn <- heart_knn[-split, ] # 20% data test

# Scaling
# Selecting X from train data (except target column)
heart_x_train <- data_train_knn %>% select(-target) %>% scale()

# Selecting Y (target) from train data
heart_y_train <- data_train_knn %>% select(target)

# Scale the X of test data with mean and sd of train data
heart_x_test <- data_test_knn %>%
  select(-target) %>% 
  scale(center = attr(heart_x_train, "scaled:center"), # center = mean
        scale = attr(heart_x_train, "scaled:scale")) # scale = standard deviation

# Selecting Y (target) from testing data
heart_y_test <- data_test_knn %>% select(target)

# Defining K
sqrt(nrow(heart_x_train))

#> [1] 15.55635

K options = 15, 13, 17

Modeling

# KNN Modeling

model_knn <- knn(train = heart_x_train,
                 test  = heart_x_test,
                 cl = heart_y_train$target,
                 k = 15)

KNN model in R is quite short, just 1 line of code and the KNN model is set. With knn() function, give the train data, test data, class that want to predict, and number of k.

Model Evaluation

It is similar with logistic regression, I use confusionMatrix() function to evaluate the accuracy and other method of evaluation. I give the positive class is 1. Positive class is the class that we want to predict or concentrated in.

## KNN
eval_knn <- confusionMatrix(model_knn, heart_y_test$target, positive = "1" )
eval_knn

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 15  4
#>          1  8 34
#>                                          
#>                Accuracy : 0.8033         
#>                  95% CI : (0.6816, 0.894)
#>     No Information Rate : 0.623          
#>     P-Value [Acc > NIR] : 0.00195        
#>                                          
#>                   Kappa : 0.5664         
#>                                          
#>  Mcnemar's Test P-Value : 0.38648        
#>                                          
#>             Sensitivity : 0.8947         
#>             Specificity : 0.6522         
#>          Pos Pred Value : 0.8095         
#>          Neg Pred Value : 0.7895         
#>              Prevalence : 0.6230         
#>          Detection Rate : 0.5574         
#>    Detection Prevalence : 0.6885         
#>       Balanced Accuracy : 0.7735         
#>                                          
#>        'Positive' Class : 1              
#>

Conclusion

table_lr <- data_frame(Accuracy = round(eval_lr$overall[1],3),
           Recall = round(eval_lr$byClass[1],3),
           Specificity = round(eval_lr$byClass[2],3),
           Precision = round(eval_lr$byClass[3],3))

table_knn <- data_frame(Accuracy = round(eval_knn$overall[1],3),
           Recall = round(eval_knn$byClass[1],3),
           Specificity = round(eval_knn$byClass[2],3),
           Precision = round(eval_knn$byClass[3],3))

table_lr

table_knn

By comparing evaluation table of logistic regression and KNN, Logistic Regression has higher score on Accuracy, Recall, Specificity and Precision. In this case, because the goal of this model is to predict the Heart disease class, Recall is suitable as reference to compare both models. Logistic Regression has better Recall score than KNN model to predict Heart disease class.

Heart Disease Prediction with Logistic Regression and K-Nearest Neighbour

Muhammad Asadullah Al Ghozi

5/1/2021

Objective

Libraries

Data and Preparation

Input Data and Preparation

Logistic Regression

Modelling

Predicting

Model Evaluation

K-Nearest Neighbour

Data Preparation

Data Scaling

Modeling

Model Evaluation

Conclusion