A hospital wants to predict for heart disease patients. Patients can be predicted based on variables that are considered as supporting factors. In this analysis, logistic regression and k-Nearest Neighbor (k-NN) will be used as classification methods. Classifying patients with heart disease or not may help a hospital to decide the treatment required
library(dplyr)
library(tidyverse)
library(readr)
library(class)
library(performance)
library(caret)
library(fastDummies)
library(ggplot2)
heart <- read.csv("heart.csv")
head(heart)
names(heart)
## [1] "ï..age" "sex" "cp" "trestbps" "chol" "fbs"
## [7] "restecg" "thalach" "exang" "oldpeak" "slope" "ca"
## [13] "thal" "target"
ï..age : age in yearssex : 1 = male ; 0 = femalecp : chest pain typetrestbps : resting blood pressure (in mmHg on admission to the hospital)chol : serum cholesterol in mg/dLfbs : fasting blood sugar > 120 mg/dL (1 = true ; 0 = false)restecg : resting electrocardiographic resultsthalach : maximum heart rate achievedexang : exercise induced angina (1 = yes; 0 = no)oldpeak : ST depression induced by exercise relative to restslope : the slope of the peak exercise ST segmentca : number of major vessels (0-3) colored by flourosopythal : 3 = normal; 6 = fixed defect; 7 = reversable defecttarget : 1 = Not Health ; 0 = Healthstr(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ ï..age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
Some of variables used have inappropriate data types. Therefore, we need to change some of the data types in this dataset
heart <- heart %>%
mutate(ï..age = as.numeric(ï..age),
trestbps = as.numeric(trestbps),
chol = as.numeric(chol),
thalach = as.numeric(thalach)) %>%
mutate_if(is.integer , as.factor) %>%
mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
fbs =factor(fbs, levels = c(0,1), labels = c("False", "True")),
exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
target = factor(target, levels = c(0,1), labels = c("Health", "Not Health")))
head(heart)
Then, we need to check the missing values
colSums(is.na(heart))
## ï..age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
Great! this dataset don’t have any missing values
Check the proportion of target variable
prop.table(table(heart$target))
##
## Health Not Health
## 0.4554455 0.5445545
The target variable almost balance, so we don’t need to upsampling or downsampling
Cross Validation
Before created a model, we need to split the data into data train for modelling and test data for testing the ability of the model that has been created. In this analysis, we used 80% of the data for the train data, and 20% for the test data
set.seed(123)
idx <- sample(nrow(heart), nrow(heart)*0.8)
heart_train <- heart[idx,]
heart_test <- heart[-idx,]
Modeling with logistic regression. The first model using all predictor variables
model_all <- glm(formula = target~. , family = "binomial" , data = heart_train)
summary(model_all)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7437 -0.3060 0.1529 0.4913 2.7541
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.062425 2399.547024 -0.006 0.99532
## ï..age 0.014721 0.027289 0.539 0.58957
## sexMale -1.734085 0.617151 -2.810 0.00496 **
## cp1 0.975779 0.681920 1.431 0.15245
## cp2 1.464324 0.590866 2.478 0.01320 *
## cp3 2.154947 0.726005 2.968 0.00300 **
## trestbps -0.023711 0.012476 -1.901 0.05736 .
## chol -0.002700 0.004582 -0.589 0.55579
## fbsTrue 0.595495 0.629487 0.946 0.34415
## restecg1 0.792680 0.446120 1.777 0.07560 .
## restecg2 -13.543663 1639.908265 -0.008 0.99341
## thalach 0.020688 0.013625 1.518 0.12892
## exangYes -0.907873 0.495323 -1.833 0.06682 .
## oldpeak -0.462527 0.264652 -1.748 0.08052 .
## slope1 -1.194357 1.045651 -1.142 0.25337
## slope2 -0.185171 1.145802 -0.162 0.87161
## ca1 -1.453988 0.575932 -2.525 0.01158 *
## ca2 -3.227193 0.796225 -4.053 0.0000505 ***
## ca3 -1.815324 1.017877 -1.783 0.07451 .
## ca4 0.928804 1.716882 0.541 0.58852
## thal1 17.826233 2399.545003 0.007 0.99407
## thal2 16.845401 2399.544873 0.007 0.99440
## thal3 15.513589 2399.544860 0.006 0.99484
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 333.48 on 241 degrees of freedom
## Residual deviance: 150.20 on 219 degrees of freedom
## AIC: 196.2
##
## Number of Fisher Scoring iterations: 15
Interpretation With confidence level of 90%, it can be concluded that the significant variables are sex,cp, trestbps,restecg, exang,oldpeak, and ca. So for the first fitting model we will modeling with significant variables and the second fitting model we will using the stepwise method
Fitting Model
In the first model, there are still many predictor variables that are not significant to target variable. Therefore, we will try to fitting model with significant variables
model_sig <- glm(formula = target~sex + cp + trestbps + restecg + exang + oldpeak + ca,
family = "binomial" , data = heart_train)
summary(model_sig)
##
## Call:
## glm(formula = target ~ sex + cp + trestbps + restecg + exang +
## oldpeak + ca, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5781 -0.3886 0.1710 0.5900 2.1931
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.84985 1.61524 3.003 0.002677 **
## sexMale -1.61782 0.46869 -3.452 0.000557 ***
## cp1 1.50081 0.62669 2.395 0.016628 *
## cp2 1.70387 0.50102 3.401 0.000672 ***
## cp3 2.22019 0.67264 3.301 0.000964 ***
## trestbps -0.02087 0.01107 -1.884 0.059513 .
## restecg1 0.60462 0.39344 1.537 0.124355
## restecg2 -13.27217 1021.02451 -0.013 0.989629
## exangYes -1.21697 0.44092 -2.760 0.005778 **
## oldpeak -0.71828 0.21339 -3.366 0.000762 ***
## ca1 -1.58006 0.48345 -3.268 0.001082 **
## ca2 -2.46323 0.64678 -3.808 0.000140 ***
## ca3 -2.17454 0.91821 -2.368 0.017873 *
## ca4 -0.06637 1.48599 -0.045 0.964377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 333.48 on 241 degrees of freedom
## Residual deviance: 175.76 on 228 degrees of freedom
## AIC: 203.76
##
## Number of Fisher Scoring iterations: 14
For the second fitting model, we will use the stepwise backward method
heart_back <- step(object = model_all, direction = "backward" , trace = F)
heart_back
##
## Call: glm(formula = target ~ sex + cp + trestbps + thalach + exang +
## oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
##
## Coefficients:
## (Intercept) sexMale cp1 cp2 cp3 trestbps
## -11.18813 -1.69715 1.01884 1.73436 2.20559 -0.02346
## thalach exangYes oldpeak slope1 slope2 ca1
## 0.01849 -0.85210 -0.51950 -1.19189 -0.17873 -1.53821
## ca2 ca3 ca4 thal1 thal2 thal3
## -2.92377 -1.59633 1.04799 15.89164 14.73215 13.52745
##
## Degrees of Freedom: 241 Total (i.e. Null); 224 Residual
## Null Deviance: 333.5
## Residual Deviance: 155.5 AIC: 191.5
model_back <- glm(formula = target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
summary(model_back)
##
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang +
## oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7222 -0.3292 0.1518 0.4779 2.8708
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.18813 1455.40016 -0.008 0.99387
## sexMale -1.69715 0.57842 -2.934 0.00335 **
## cp1 1.01884 0.66750 1.526 0.12692
## cp2 1.73436 0.58000 2.990 0.00279 **
## cp3 2.20559 0.71105 3.102 0.00192 **
## trestbps -0.02346 0.01111 -2.111 0.03475 *
## thalach 0.01849 0.01252 1.478 0.13946
## exangYes -0.85210 0.48565 -1.755 0.07933 .
## oldpeak -0.51950 0.25345 -2.050 0.04039 *
## slope1 -1.19189 1.03092 -1.156 0.24762
## slope2 -0.17873 1.12988 -0.158 0.87431
## ca1 -1.53821 0.54601 -2.817 0.00484 **
## ca2 -2.92377 0.72864 -4.013 0.00006 ***
## ca3 -1.59633 0.98979 -1.613 0.10679
## ca4 1.04799 1.60576 0.653 0.51399
## thal1 15.89164 1455.39784 0.011 0.99129
## thal2 14.73215 1455.39766 0.010 0.99192
## thal3 13.52745 1455.39765 0.009 0.99258
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 333.48 on 241 degrees of freedom
## Residual deviance: 155.54 on 224 degrees of freedom
## AIC: 191.54
##
## Number of Fisher Scoring iterations: 14
After we have three models, we need to choose the best model. We were be able to compare each model with the performance package and select the best model with the smallest AIC
compare_performance(model_all, model_sig , model_back)
From the three models, it can be concluded that the smallest AIC value is logistic regression model with the stepwise backward method. So we will prediction with model_back
pred_heart <- predict(model_back, newdata = heart_test, type = "response")
Visualize the distribution of probabilities from our prediction vector
ggplot(heart_test, aes(x = pred_heart)) +
geom_histogram(bins = 10, fill = "lightblue3") +
labs(title = "Distribution of Probability Prediction") +
theme_minimal()
From the histogram we can see the results of prediction are more towards to 1 which means Not Health
Other than that, we can set the threshold
result_pred <- ifelse(pred_heart >= 0.5, "Not Health", "Health")
conf_log <- confusionMatrix(as.factor(result_pred),
reference = heart_test$target,
positive = "Not Health")
conf_log
## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 21 2
## Not Health 7 31
##
## Accuracy : 0.8525
## 95% CI : (0.7383, 0.9302)
## No Information Rate : 0.541
## P-Value [Acc > NIR] : 0.00000026
##
## Kappa : 0.6988
##
## Mcnemar's Test P-Value : 0.1824
##
## Sensitivity : 0.9394
## Specificity : 0.7500
## Pos Pred Value : 0.8158
## Neg Pred Value : 0.9130
## Prevalence : 0.5410
## Detection Rate : 0.5082
## Detection Prevalence : 0.6230
## Balanced Accuracy : 0.8447
##
## 'Positive' Class : Not Health
##
Interpretation
From the confusion matrix that:
We will check the data types again because k-Nearest Neighbour (k-NN) cannot analyze the data type as a factor
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ ï..age : num 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
## $ trestbps: num 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : num 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
## $ thalach : num 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
## $ target : Factor w/ 2 levels "Health","Not Health": 2 2 2 2 2 2 2 2 2 2 ...
Create dummy variables from categorical data that will be used in classification with fastDummies package
heart_knn <- dummy_cols(heart,
select_columns = c("sex","cp","fbs","restecg","exang","slope","ca","thal"),
remove_selected_columns = TRUE)
str(heart_knn)
## 'data.frame': 303 obs. of 31 variables:
## $ ï..age : num 63 37 41 56 57 57 56 44 52 57 ...
## $ trestbps : num 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : num 233 250 204 236 354 192 294 263 199 168 ...
## $ thalach : num 150 187 172 178 163 148 153 173 162 174 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ target : Factor w/ 2 levels "Health","Not Health": 2 2 2 2 2 2 2 2 2 2 ...
## $ sex_Female: int 0 0 1 0 1 0 1 0 0 0 ...
## $ sex_Male : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp_0 : int 0 0 0 0 1 1 0 0 0 0 ...
## $ cp_1 : int 0 0 1 1 0 0 1 1 0 0 ...
## $ cp_2 : int 0 1 0 0 0 0 0 0 1 1 ...
## $ cp_3 : int 1 0 0 0 0 0 0 0 0 0 ...
## $ fbs_False : int 0 1 1 1 1 1 1 1 0 1 ...
## $ fbs_True : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg_0 : int 1 0 1 0 0 0 1 0 0 0 ...
## $ restecg_1 : int 0 1 0 1 1 1 0 1 1 1 ...
## $ restecg_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ exang_No : int 1 1 1 1 0 1 1 1 1 1 ...
## $ exang_Yes : int 0 0 0 0 1 0 0 0 0 0 ...
## $ slope_0 : int 1 1 0 0 0 0 0 0 0 0 ...
## $ slope_1 : int 0 0 0 0 0 1 1 0 0 0 ...
## $ slope_2 : int 0 0 1 1 1 0 0 1 1 1 ...
## $ ca_0 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ ca_1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ca_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal_0 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal_1 : int 1 0 0 0 0 1 0 0 0 0 ...
## $ thal_2 : int 0 1 1 1 1 0 1 0 0 1 ...
## $ thal_3 : int 0 0 0 0 0 0 0 1 1 0 ...
Removing dummy variables with 2 categories and then scaling the data
heart_knn <- heart_knn %>%
select(-c(sex_Female , fbs_False , exang_No)) %>%
mutate_if(is.numeric, scale)
set.seed(321)
train_y <- heart_knn[idx , 6]
test_y <- heart_knn[-idx , 6]
train_x <- heart_knn[idx , -6]
test_x <- heart_knn[-idx , -6]
Choose number of k
sqrt(nrow(train_x))
## [1] 15.55635
pred_knn <- knn(train = train_x,
test = test_x,
cl = train_y,
k = 15)
conf_knn <- confusionMatrix(as.factor(pred_knn),
reference = as.factor(test_y),
positive = "Not Health")
conf_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 21 3
## Not Health 7 30
##
## Accuracy : 0.8361
## 95% CI : (0.7191, 0.9185)
## No Information Rate : 0.541
## P-Value [Acc > NIR] : 0.000001184
##
## Kappa : 0.6663
##
## Mcnemar's Test P-Value : 0.3428
##
## Sensitivity : 0.9091
## Specificity : 0.7500
## Pos Pred Value : 0.8108
## Neg Pred Value : 0.8750
## Prevalence : 0.5410
## Detection Rate : 0.4918
## Detection Prevalence : 0.6066
## Balanced Accuracy : 0.8295
##
## 'Positive' Class : Not Health
##
Interpretation
From the confusion matrix that:
eval_logistic <- data_frame(Accuracy = conf_log$overall[1],
Recall = conf_log$byClass[1],
Specificity = conf_log$byClass[2],
Precision = conf_log$byClass[3])
eval_knn <- data_frame(Accuracy = conf_knn$overall[1],
Recall = conf_knn$byClass[1],
Specificity = conf_knn$byClass[2],
Precision = conf_knn$byClass[3])
Model Evaluation Logistic Regression
eval_logistic
Model Evaluation k-Nearest Neighbors
eval_knn
From the two methods, Logistic Regression and k-Nearest Neighbors, the ability of the model to predict correctly from the actual data of people who have heart disease is better by using the Logistic Regression because it has value of precision = 81.58% greater than using the k-Nearest Neighbors method
From the results of analysis, as a doctor or hospital nurse wants to minimize people who do not have heart disease but are predicted to have heart disease. Because the treatment that will be given to patients with heart disease, if given to patients who are not sick, will endanger the patients. Therefore, I’m going to look at metric precision.