Introduction

About the data

Some descriptions of the columns:


- age: age in years
- sex: 1 = male, 0 = female - cp : chest pain types (4 values)
- trestbps:resting blood pressure on admission to the hospital (mm Hg)
- chol : serum cholestoral (mg/dl)
- fbs : fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
- restecg : resting electrocardiographic results (values 0,1,2)
- thalach : maximum heart rate achieved
- exang : exercise induced angina (1 = yes, 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slope : the slope of the peak exercise ST segment
- ca : number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Goal

This is a dataset created to predict the presence of a heart disease. It was built with the purpose of helping ML researchers to build a model and find any other trends in heart data to predict certain cardiovascular events or find any clear indications of heart healths. The predicted output gives them a fair idea about whether a heart disease is present or absent in the patient.

What we will do

We will use Logisitic Regression and KNN models on Heart Disease UCI data from Kaggle. We want to know the relationship among variables, especially between the target with other variables. We also want to predict the chance of someone having any heart disease using historical data. You can download the data here: https://www.kaggle.com/ronitf/heart-disease-uci

Import Library

library(dplyr)
library(ggplot2)
library(GGally)
library(performance)
library(MLmetrics)
library(rmdformats)
library(class)
library(caret)

Data Preparation

Read data

heart <- read.csv("heart.csv")
rmarkdown::paged_table(heart)

Data Wrangling/Preprocessing

Check for missing values

heart %>% 
  is.na() %>% 
  colSums()/nrow(heart)
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

No missing value, thus the data is well prepared.

Check if there are any mismatched data type and change them if necessary

Changing them to the right data type will ease the data analytics and machine learning process.

str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

Looks like we have to change sex, cp, fbs, exang and target to Factor

Change the respective columns to the right data types

heart <- heart %>% 
        mutate_at(vars(sex, cp, fbs, exang, target), as.factor)
str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 2 1 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Exploratory Data Analysis

Check correlations among the numeric columns to the target column

ggcorr(heart, label = T, hjust = 0.5)
## Warning in ggcorr(heart, label = T, hjust = 0.5): data in column(s) 'sex', 'cp',
## 'fbs', 'exang', 'target' are not numeric and were ignored

Cross Validation

This step is necessary to prepare some “unseen” data for the ML model to determine its accuracy and performance We will use 75:25 proportion for this data

Check if the target class proportion is balanced

prop.table(table(heart$target))
## 
##         0         1 
## 0.4554455 0.5445545

It’s a very balanced dataset.

Splitting the data into train and test sets

set.seed(123)
index <- sample(nrow(heart), nrow(heart)*0.75)

data_train <- heart[index,]
data_test <- heart[-index,]

Check target class proportion on the training data

prop.table(table(data_train$target))
## 
##         0         1 
## 0.4537445 0.5462555

Model Building

Numeric variables scaling

To make the scale of all the numeric variables more uniform.

# Take all the numeric variables
data_train_numeric <- data_train %>% select_if(is.numeric)
data_test_numeric <- data_test %>% select_if(is.numeric)

# Take all the non numeric variables
data_train_nn <- data_train %>% select(sex, cp, fbs, exang, target)
data_test_nn <- data_test %>% select(sex, cp, fbs, exang, target)

# Scale all the numeric variables
data_train_numeric_scaled <- scale(data_train_numeric)
data_test_numeric_scaled <- scale(data_test_numeric,
                           center = attr(data_train_numeric_scaled, "scaled:center"),
                           scale = attr(data_train_numeric_scaled, "scaled:scale"))

# Combining the numeric data with the non numeric data
data_train_new <- cbind(data_train_nn, data_train_numeric_scaled)
data_test_new <- cbind(data_test_nn, data_test_numeric_scaled)

Separating target column from the rest of the dataset

data_train_x <- data_train_new %>% select(-target)
data_test_x <- data_test_new %>% select(-target)
  
data_train_y <- data_train_new[,"target"]
data_test_y <- data_test_new[,"target"]

Logistic Regression model

lg_model <- glm(formula = target~.,
                   family="binomial",
                   data = data_train_new)

KNN model

Determine the number of class in the target variable

unique(data_train_new$target)
## [1] 0 1
## Levels: 0 1

Since there are 2 (even number) classes for the target variable, we have to use odd number for the K in the KNN model

Determine the K for KNN model

sqrt(nrow(data_train_new))
## [1] 15.06652
knn_model <- knn(train = data_train_x,
                     test = data_test_x,
                     cl = data_train_y,
                     k = 15)

Model Evaluation

Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative. The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

In this case, we will be evaluating our model using Recall since we emphasize more on having lesser false negative than false positive.

Logistic regression confusion matrix

Making a prediction using logistic regression model

data_test_new$pred_Risk <- predict(object = lg_model,
        newdata = data_test_new,
        type = "response")

data_test_new$pred_Label <- ifelse(data_test_new$pred_Risk > 0.5 , yes = "1", no = "0")
data_test_new$pred_Label <- as.factor(data_test_new$pred_Label)
confusionMatrix(data = as.factor(data_test_new$pred_Label),
                reference = as.factor(data_test_y),
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 25  2
##          1 10 39
##                                           
##                Accuracy : 0.8421          
##                  95% CI : (0.7404, 0.9157)
##     No Information Rate : 0.5395          
##     P-Value [Acc > NIR] : 2.51e-08        
##                                           
##                   Kappa : 0.6768          
##                                           
##  Mcnemar's Test P-Value : 0.04331         
##                                           
##             Sensitivity : 0.9512          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.7959          
##          Neg Pred Value : 0.9259          
##              Prevalence : 0.5395          
##          Detection Rate : 0.5132          
##    Detection Prevalence : 0.6447          
##       Balanced Accuracy : 0.8328          
##                                           
##        'Positive' Class : 1               
## 

KNN confusion matrix

confusionMatrix(data = as.factor(knn_model),
                reference = as.factor(data_test_y),
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 23  1
##          1 12 40
##                                           
##                Accuracy : 0.8289          
##                  95% CI : (0.7253, 0.9057)
##     No Information Rate : 0.5395          
##     P-Value [Acc > NIR] : 1.083e-07       
##                                           
##                   Kappa : 0.6476          
##                                           
##  Mcnemar's Test P-Value : 0.005546        
##                                           
##             Sensitivity : 0.9756          
##             Specificity : 0.6571          
##          Pos Pred Value : 0.7692          
##          Neg Pred Value : 0.9583          
##              Prevalence : 0.5395          
##          Detection Rate : 0.5263          
##    Detection Prevalence : 0.6842          
##       Balanced Accuracy : 0.8164          
##                                           
##        'Positive' Class : 1               
## 

Model fine-tuning

Logistic regression model fine-tuning

Using step-wise regression to find the best predictors

lg_tuned <- step(lg_model, direction = "backward", trace = F)

Making a prediction using fine-tuned logistic regression model

data_test_new$pred_Risk <- predict(object = lg_tuned,
        newdata = data_test_new,
        type = "response")

data_test_new$pred_Label <- ifelse(data_test_new$pred_Risk > 0.5 , yes = "1", no = "0")
data_test_new$pred_Label <- as.factor(data_test_new$pred_Label)

Logistic regression model confusion matrix

confusionMatrix(data = as.factor(data_test_new$pred_Label),
                reference = as.factor(data_test_y),
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 25  3
##          1 10 38
##                                           
##                Accuracy : 0.8289          
##                  95% CI : (0.7253, 0.9057)
##     No Information Rate : 0.5395          
##     P-Value [Acc > NIR] : 1.083e-07       
##                                           
##                   Kappa : 0.6506          
##                                           
##  Mcnemar's Test P-Value : 0.09609         
##                                           
##             Sensitivity : 0.9268          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.7917          
##          Neg Pred Value : 0.8929          
##              Prevalence : 0.5395          
##          Detection Rate : 0.5000          
##    Detection Prevalence : 0.6316          
##       Balanced Accuracy : 0.8206          
##                                           
##        'Positive' Class : 1               
## 

KNN model fine-tuning

Changing the K value of KNN

knn_tuned <- knn(train = data_train_x,
                test = data_test_x,
                cl = data_train_y,
                k = 17)

KNN model confusion matrix

confusionMatrix(data = as.factor(knn_tuned),
                reference = as.factor(data_test_y),
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 23  1
##          1 12 40
##                                           
##                Accuracy : 0.8289          
##                  95% CI : (0.7253, 0.9057)
##     No Information Rate : 0.5395          
##     P-Value [Acc > NIR] : 1.083e-07       
##                                           
##                   Kappa : 0.6476          
##                                           
##  Mcnemar's Test P-Value : 0.005546        
##                                           
##             Sensitivity : 0.9756          
##             Specificity : 0.6571          
##          Pos Pred Value : 0.7692          
##          Neg Pred Value : 0.9583          
##              Prevalence : 0.5395          
##          Detection Rate : 0.5263          
##    Detection Prevalence : 0.6842          
##       Balanced Accuracy : 0.8164          
##                                           
##        'Positive' Class : 1               
## 

Conclusion

  • Since we are looking for the recall/sensitivity, KNN model performed better in this dataset compared to the logistic regression model.
  • Original logistic regression model performed better on the accuracy and recall compared to the fine-tuned one.

In the medical world, I think it is really crucial to have a model that reduces the false detection of a healthy person when they are actually not, especially in the cardiovascular world as the heart works as one of our core organs. Thus, giving false treatment to the actually healthy group of people might be less dangerous.

Reference

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D. David W. Aha (aha ‘@’ ics.uci.edu) (714) 856-8779