- age: age in years
- sex: 1 = male, 0 = female - cp : chest pain types (4 values)
- trestbps:resting blood pressure on admission to the hospital (mm Hg)
- chol : serum cholestoral (mg/dl)
- fbs : fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
- restecg : resting electrocardiographic results (values 0,1,2)
- thalach : maximum heart rate achieved
- exang : exercise induced angina (1 = yes, 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slope : the slope of the peak exercise ST segment
- ca : number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
This is a dataset created to predict the presence of a heart disease. It was built with the purpose of helping ML researchers to build a model and find any other trends in heart data to predict certain cardiovascular events or find any clear indications of heart healths. The predicted output gives them a fair idea about whether a heart disease is present or absent in the patient.
We will use Logisitic Regression and KNN models on Heart Disease UCI data from Kaggle. We want to know the relationship among variables, especially between the target with other variables. We also want to predict the chance of someone having any heart disease using historical data. You can download the data here: https://www.kaggle.com/ronitf/heart-disease-uci
library(dplyr)
library(ggplot2)
library(GGally)
library(performance)
library(MLmetrics)
library(rmdformats)
library(class)
library(caret)heart <- read.csv("heart.csv")
rmarkdown::paged_table(heart)heart %>%
is.na() %>%
colSums()/nrow(heart)## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
No missing value, thus the data is well prepared.
Changing them to the right data type will ease the data analytics and machine learning process.
str(heart)## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
Looks like we have to change sex, cp, fbs, exang and target to Factor
heart <- heart %>%
mutate_at(vars(sex, cp, fbs, exang, target), as.factor)
str(heart)## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 2 1 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
ggcorr(heart, label = T, hjust = 0.5)## Warning in ggcorr(heart, label = T, hjust = 0.5): data in column(s) 'sex', 'cp',
## 'fbs', 'exang', 'target' are not numeric and were ignored
This step is necessary to prepare some “unseen” data for the ML model to determine its accuracy and performance We will use 75:25 proportion for this data
prop.table(table(heart$target))##
## 0 1
## 0.4554455 0.5445545
It’s a very balanced dataset.
set.seed(123)
index <- sample(nrow(heart), nrow(heart)*0.75)
data_train <- heart[index,]
data_test <- heart[-index,]prop.table(table(data_train$target))##
## 0 1
## 0.4537445 0.5462555
To make the scale of all the numeric variables more uniform.
# Take all the numeric variables
data_train_numeric <- data_train %>% select_if(is.numeric)
data_test_numeric <- data_test %>% select_if(is.numeric)
# Take all the non numeric variables
data_train_nn <- data_train %>% select(sex, cp, fbs, exang, target)
data_test_nn <- data_test %>% select(sex, cp, fbs, exang, target)
# Scale all the numeric variables
data_train_numeric_scaled <- scale(data_train_numeric)
data_test_numeric_scaled <- scale(data_test_numeric,
center = attr(data_train_numeric_scaled, "scaled:center"),
scale = attr(data_train_numeric_scaled, "scaled:scale"))
# Combining the numeric data with the non numeric data
data_train_new <- cbind(data_train_nn, data_train_numeric_scaled)
data_test_new <- cbind(data_test_nn, data_test_numeric_scaled)data_train_x <- data_train_new %>% select(-target)
data_test_x <- data_test_new %>% select(-target)
data_train_y <- data_train_new[,"target"]
data_test_y <- data_test_new[,"target"]lg_model <- glm(formula = target~.,
family="binomial",
data = data_train_new)unique(data_train_new$target)## [1] 0 1
## Levels: 0 1
Since there are 2 (even number) classes for the target variable, we have to use odd number for the K in the KNN model
sqrt(nrow(data_train_new))## [1] 15.06652
knn_model <- knn(train = data_train_x,
test = data_test_x,
cl = data_train_y,
k = 15)Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative. The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.
In this case, we will be evaluating our model using Recall since we emphasize more on having lesser false negative than false positive.
Making a prediction using logistic regression model
data_test_new$pred_Risk <- predict(object = lg_model,
newdata = data_test_new,
type = "response")
data_test_new$pred_Label <- ifelse(data_test_new$pred_Risk > 0.5 , yes = "1", no = "0")
data_test_new$pred_Label <- as.factor(data_test_new$pred_Label)confusionMatrix(data = as.factor(data_test_new$pred_Label),
reference = as.factor(data_test_y),
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25 2
## 1 10 39
##
## Accuracy : 0.8421
## 95% CI : (0.7404, 0.9157)
## No Information Rate : 0.5395
## P-Value [Acc > NIR] : 2.51e-08
##
## Kappa : 0.6768
##
## Mcnemar's Test P-Value : 0.04331
##
## Sensitivity : 0.9512
## Specificity : 0.7143
## Pos Pred Value : 0.7959
## Neg Pred Value : 0.9259
## Prevalence : 0.5395
## Detection Rate : 0.5132
## Detection Prevalence : 0.6447
## Balanced Accuracy : 0.8328
##
## 'Positive' Class : 1
##
confusionMatrix(data = as.factor(knn_model),
reference = as.factor(data_test_y),
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23 1
## 1 12 40
##
## Accuracy : 0.8289
## 95% CI : (0.7253, 0.9057)
## No Information Rate : 0.5395
## P-Value [Acc > NIR] : 1.083e-07
##
## Kappa : 0.6476
##
## Mcnemar's Test P-Value : 0.005546
##
## Sensitivity : 0.9756
## Specificity : 0.6571
## Pos Pred Value : 0.7692
## Neg Pred Value : 0.9583
## Prevalence : 0.5395
## Detection Rate : 0.5263
## Detection Prevalence : 0.6842
## Balanced Accuracy : 0.8164
##
## 'Positive' Class : 1
##
lg_tuned <- step(lg_model, direction = "backward", trace = F)data_test_new$pred_Risk <- predict(object = lg_tuned,
newdata = data_test_new,
type = "response")
data_test_new$pred_Label <- ifelse(data_test_new$pred_Risk > 0.5 , yes = "1", no = "0")
data_test_new$pred_Label <- as.factor(data_test_new$pred_Label)confusionMatrix(data = as.factor(data_test_new$pred_Label),
reference = as.factor(data_test_y),
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25 3
## 1 10 38
##
## Accuracy : 0.8289
## 95% CI : (0.7253, 0.9057)
## No Information Rate : 0.5395
## P-Value [Acc > NIR] : 1.083e-07
##
## Kappa : 0.6506
##
## Mcnemar's Test P-Value : 0.09609
##
## Sensitivity : 0.9268
## Specificity : 0.7143
## Pos Pred Value : 0.7917
## Neg Pred Value : 0.8929
## Prevalence : 0.5395
## Detection Rate : 0.5000
## Detection Prevalence : 0.6316
## Balanced Accuracy : 0.8206
##
## 'Positive' Class : 1
##
knn_tuned <- knn(train = data_train_x,
test = data_test_x,
cl = data_train_y,
k = 17)confusionMatrix(data = as.factor(knn_tuned),
reference = as.factor(data_test_y),
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23 1
## 1 12 40
##
## Accuracy : 0.8289
## 95% CI : (0.7253, 0.9057)
## No Information Rate : 0.5395
## P-Value [Acc > NIR] : 1.083e-07
##
## Kappa : 0.6476
##
## Mcnemar's Test P-Value : 0.005546
##
## Sensitivity : 0.9756
## Specificity : 0.6571
## Pos Pred Value : 0.7692
## Neg Pred Value : 0.9583
## Prevalence : 0.5395
## Detection Rate : 0.5263
## Detection Prevalence : 0.6842
## Balanced Accuracy : 0.8164
##
## 'Positive' Class : 1
##
recall/sensitivity, KNN model performed better in this dataset compared to the logistic regression model.In the medical world, I think it is really crucial to have a model that reduces the false detection of a healthy person when they are actually not, especially in the cardiovascular world as the heart works as one of our core organs. Thus, giving false treatment to the actually healthy group of people might be less dangerous.
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D. David W. Aha (aha ‘@’ ics.uci.edu) (714) 856-8779