Introduction

We will try to make machine learning model with logistic regression and K-NN method. The objective is to predict whether a person have heart disease or not. The dataset is obtained from kaggle. It contain 14 columns and 303 rows.

Data Preparation

First, we load the required packages

library(tidyverse)
library(GGally)
library(caret)

heart <- read.csv("heart.csv")

glimpse(heart)

## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

We need to change some datatypes into factors.

heart1 <- heart %>% 
  mutate(sex = ifelse(sex == 0, "Female", "Male"),
         cp = as.factor(cp),
         fbs = ifelse(fbs == 0, "False", "True"),
         restecg = as.factor(restecg),
         exang = ifelse(exang == 0, "No", "Yes"),
         oldpeak = as.factor(oldpeak),
         slope = as.factor(slope),
         ca = as.factor(ca),
         thal = as.factor(thal),
         target = ifelse(target == 0, "No", "Yes")) %>% 
  mutate_if(is.character, as.factor) %>% 
  glimpse()

## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male...
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <fct> True, False, False, False, False, False, False, False, Tru...
## $ restecg  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, ...
## $ oldpeak  <fct> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0, 0.5, 1.6, 1.2, 0.2, ...
## $ slope    <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes...

After that, we need to check if there is any NA in the dataset.

colSums(is.na(heart1))

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

Now, we know that the data don’t have NA and we can move to explore the data.

Exploratory Data Analysis

First, we need to know proportion of the target.

round(prop.table(table(heart1$target)),2)

## 
##   No  Yes 
## 0.46 0.54

table(heart1$target)

## 
##  No Yes 
## 138 165

It seems the proportion is quite balance so we don’t need to upscale the data.

Modelling

Logistic Regression

We must split the dataset into train and test dataset. This is done to check if the model is capable to classify new data that has not been seen by the model. The data will be split with ratio of 80/20 (80% data will be used to train, 20% to test).

set.seed(123)
row_data <- nrow(heart1)
index <- sample(row_data, row_data*0.8)

data_train <- heart1[ index, ]
data_test <- heart1[ -index, ]

Now we will try to model the logistic regression using target as the target variable. For this case, we can use stepwise method with both direction from empty predictors to full predictors.

set.seed(123)
model_full <- glm(target ~ ., data_train, family = "binomial")
model_none <- glm(target ~ 1, data_train, family = "binomial")
model_step <- step(model_none, 
                   scope = list(lower = model_none, upper = model_full),
                   direction = "both",
                   trace = 0
                   )
summary(model_step)

## 
## Call:
## glm(formula = target ~ thal + ca + exang + slope + restecg + 
##     sex + cp + trestbps + thalach, family = "binomial", data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7131  -0.3347   0.1409   0.4946   2.8580  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -14.57528 2399.54627  -0.006  0.99515    
## thal1         17.06009 2399.54493   0.007  0.99433    
## thal2         16.20605 2399.54482   0.007  0.99461    
## thal3         14.74306 2399.54480   0.006  0.99510    
## ca1           -1.25908    0.54096  -2.328  0.01994 *  
## ca2           -3.27025    0.71848  -4.552 5.32e-06 ***
## ca3           -1.83853    0.93903  -1.958  0.05024 .  
## ca4            1.56751    1.84412   0.850  0.39532    
## exangYes      -0.96791    0.48579  -1.992  0.04632 *  
## slope1        -0.66512    0.86447  -0.769  0.44166    
## slope2         0.69563    0.89318   0.779  0.43608    
## restecg1       0.84330    0.43422   1.942  0.05213 .  
## restecg2     -14.66681 1589.24431  -0.009  0.99264    
## sexMale       -1.69406    0.57028  -2.971  0.00297 ** 
## cp1            1.02171    0.67414   1.516  0.12963    
## cp2            1.36972    0.56154   2.439  0.01472 *  
## cp3            1.94003    0.69914   2.775  0.00552 ** 
## trestbps      -0.01969    0.01144  -1.721  0.08516 .  
## thalach        0.01932    0.01231   1.569  0.11666    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 333.48  on 241  degrees of freedom
## Residual deviance: 155.45  on 223  degrees of freedom
## AIC: 193.45
## 
## Number of Fisher Scoring iterations: 15

Evaluation

Lets use the model to predict new data then use confusion matrix to evaluate the model.

pred_test <- predict(model_step, data_test, type = "response")
pred_class_test <- ifelse(pred_test > 0.5, "Yes", "No") %>% 
  as.factor
confusionMatrix(pred_class_test, data_test$target, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  22   5
##        Yes  6  28
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.541           
##     P-Value [Acc > NIR] : 4.82e-06        
##                                           
##                   Kappa : 0.6359          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8485          
##             Specificity : 0.7857          
##          Pos Pred Value : 0.8235          
##          Neg Pred Value : 0.8148          
##              Prevalence : 0.5410          
##          Detection Rate : 0.4590          
##    Detection Prevalence : 0.5574          
##       Balanced Accuracy : 0.8171          
##                                           
##        'Positive' Class : Yes             
##

Based on the result, we can see that the model is quite good. The accuracy is 81.97%, sensitiviy is 84.85%, specificity is 78.57%, and precision (PosPredValue) is 82.35%.

Lets compare the prediction from test data to train data to see if the model underfit/overfit/optimum.

pred_train <- predict(model_step, data_train, type = "response")
pred_class_train <- ifelse(pred_train > 0.5, "Yes", "No") %>% 
  as.factor()
confusionMatrix(pred_class_train, data_train$target, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   89  12
##        Yes  21 120
##                                           
##                Accuracy : 0.8636          
##                  95% CI : (0.8139, 0.9042)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7231          
##                                           
##  Mcnemar's Test P-Value : 0.1637          
##                                           
##             Sensitivity : 0.9091          
##             Specificity : 0.8091          
##          Pos Pred Value : 0.8511          
##          Neg Pred Value : 0.8812          
##              Prevalence : 0.5455          
##          Detection Rate : 0.4959          
##    Detection Prevalence : 0.5826          
##       Balanced Accuracy : 0.8591          
##                                           
##        'Positive' Class : Yes             
##

If we compare data test and data train, the accuracy differ 4.39%, sensitivity differ 6.06%, specificity differ 2,34%, and precision (PosPredValue) differ 2,76%. As we can see, the difference <10% so we can say the model is optimum.

K-NN

Lets make another model with K-NN method. First, we need to change the target variable into factor.

heart2 <- heart %>% 
  mutate(target = as.factor(target))

After that, like logistic regression model, we need to split the data.

set.seed(123)
row_data2 <- nrow(heart2)
index2 <- sample(row_data2, row_data2*0.8)

data_train2 <- heart2[index2, ]
data_test2 <- heart2[-index2, ]

We need to scale the data with Z-Score method. First, we need to scale the data from train data then use the data we get from train data to scale the data from test data. We do this to prevent data leakage.

train_x <- data_train2 %>% 
  select(-target) %>%  
  scale() 

train_y <- data_train2$target

test_x <- data_test2 %>% 
  select(-target) %>%  
  scale(center = attr(train_x, "scaled:center"), 
        scale = attr(train_x, "scaled:scale") 
        ) 

test_y <- data_test2$target

Evaluation

To predict, we need the optimum number of k (nearest neighbor).

k_choose <- sqrt(nrow(train_x)) %>% floor()
k_choose

## [1] 15

After we get the number, we can use knn3Train to get the prediction.

pred_knn <- knn3Train(train = train_x, 
                      cl = train_y, 
                      test = test_x, 
                      k = k_choose 
                      ) %>% 
  as.factor()

Like regression model, to evaluate the model we can use confusion matrix.

confusionMatrix(pred_knn, test_y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 18  3
##          1 10 30
##                                           
##                Accuracy : 0.7869          
##                  95% CI : (0.6632, 0.8814)
##     No Information Rate : 0.541           
##     P-Value [Acc > NIR] : 5.878e-05       
##                                           
##                   Kappa : 0.5626          
##                                           
##  Mcnemar's Test P-Value : 0.09609         
##                                           
##             Sensitivity : 0.6429          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.4590          
##          Detection Rate : 0.2951          
##    Detection Prevalence : 0.3443          
##       Balanced Accuracy : 0.7760          
##                                           
##        'Positive' Class : 0               
##

Based on the result, we can see the model is more specific and precision than logistic regression model but with lower accuracy and sensitivity.

Conclusion

If we are a doctor that treat our patient, we need to make sure the model has high precision. We don’t want the patient to get false treatment because of the model. For this case, the model has similar precision with only differences about 3%. If we want to choose the model without any need of interpretation then K-NN is the choice. But if we want a model that can be interprate then we can choose logistic regression. My personal choice in this case is logistic linear. Even though the model have lower precision but it have higher accuracy and sensivity also interpretable.

Prediction on Heart Disease Patients

Patricio Kresnatama

January 17, 2021

Introduction

Data Preparation

Exploratory Data Analysis

Modelling

Logistic Regression

Evaluation

K-NN

Evaluation

Conclusion