Intro

A hospital wants to predict for heart disease patients. Patients can be predicted based on variables that are considered as supporting factors. In this analysis, logistic regression and k-Nearest Neighbor (k-NN) will be used as classification methods. Classifying patients with heart disease or not may help a hospital to decide the treatment required

Library Preparation

library(dplyr)
library(tidyverse)
library(readr)
library(class)
library(performance)
library(caret)
library(fastDummies)
library(ggplot2)

Logistic Regression

Import Data

heart <- read.csv("heart.csv")

Data Inspection

head(heart)
names(heart)
##  [1] "ï..age"   "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"
  • ï..age : age in years
  • sex : 1 = male ; 0 = female
  • cp : chest pain type
  • trestbps : resting blood pressure (in mmHg on admission to the hospital)
  • chol : serum cholesterol in mg/dL
  • fbs : fasting blood sugar > 120 mg/dL (1 = true ; 0 = false)
  • restecg : resting electrocardiographic results
  • thalach : maximum heart rate achieved
  • exang : exercise induced angina (1 = yes; 0 = no)
  • oldpeak : ST depression induced by exercise relative to rest
  • slope : the slope of the peak exercise ST segment
  • ca : number of major vessels (0-3) colored by flourosopy
  • thal : 3 = normal; 6 = fixed defect; 7 = reversable defect
  • target : 1 = Not Health ; 0 = Health

Data Cleansing

str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ ï..age  : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

Some of variables used have inappropriate data types. Therefore, we need to change some of the data types in this dataset

heart <- heart %>% 
  mutate(ï..age = as.numeric(ï..age),
         trestbps = as.numeric(trestbps),
         chol = as.numeric(chol),
         thalach = as.numeric(thalach)) %>% 
  mutate_if(is.integer , as.factor) %>% 
  mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
         fbs =factor(fbs, levels = c(0,1), labels = c("False", "True")),
         exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
         target = factor(target, levels = c(0,1), labels = c("Health", "Not Health")))

head(heart)

Then, we need to check the missing values

colSums(is.na(heart))
##   ï..age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

Great! this dataset don’t have any missing values

Pre-Processing Data

Check the proportion of target variable

prop.table(table(heart$target))
## 
##     Health Not Health 
##  0.4554455  0.5445545

The target variable almost balance, so we don’t need to upsampling or downsampling

Cross Validation

Before created a model, we need to split the data into data train for modelling and test data for testing the ability of the model that has been created. In this analysis, we used 80% of the data for the train data, and 20% for the test data

set.seed(123)
idx <- sample(nrow(heart), nrow(heart)*0.8)
heart_train <- heart[idx,]
heart_test <- heart[-idx,]

Modeling

Modeling with logistic regression. The first model using all predictor variables

model_all <- glm(formula = target~. , family = "binomial" , data = heart_train)
summary(model_all)
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7437  -0.3060   0.1529   0.4913   2.7541  
## 
## Coefficients:
##                Estimate  Std. Error z value  Pr(>|z|)    
## (Intercept)  -14.062425 2399.547024  -0.006   0.99532    
## ï..age         0.014721    0.027289   0.539   0.58957    
## sexMale       -1.734085    0.617151  -2.810   0.00496 ** 
## cp1            0.975779    0.681920   1.431   0.15245    
## cp2            1.464324    0.590866   2.478   0.01320 *  
## cp3            2.154947    0.726005   2.968   0.00300 ** 
## trestbps      -0.023711    0.012476  -1.901   0.05736 .  
## chol          -0.002700    0.004582  -0.589   0.55579    
## fbsTrue        0.595495    0.629487   0.946   0.34415    
## restecg1       0.792680    0.446120   1.777   0.07560 .  
## restecg2     -13.543663 1639.908265  -0.008   0.99341    
## thalach        0.020688    0.013625   1.518   0.12892    
## exangYes      -0.907873    0.495323  -1.833   0.06682 .  
## oldpeak       -0.462527    0.264652  -1.748   0.08052 .  
## slope1        -1.194357    1.045651  -1.142   0.25337    
## slope2        -0.185171    1.145802  -0.162   0.87161    
## ca1           -1.453988    0.575932  -2.525   0.01158 *  
## ca2           -3.227193    0.796225  -4.053 0.0000505 ***
## ca3           -1.815324    1.017877  -1.783   0.07451 .  
## ca4            0.928804    1.716882   0.541   0.58852    
## thal1         17.826233 2399.545003   0.007   0.99407    
## thal2         16.845401 2399.544873   0.007   0.99440    
## thal3         15.513589 2399.544860   0.006   0.99484    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 333.48  on 241  degrees of freedom
## Residual deviance: 150.20  on 219  degrees of freedom
## AIC: 196.2
## 
## Number of Fisher Scoring iterations: 15

Interpretation With confidence level of 90%, it can be concluded that the significant variables are sex,cp, trestbps,restecg, exang,oldpeak, and ca. So for the first fitting model we will modeling with significant variables and the second fitting model we will using the stepwise method

Fitting Model

In the first model, there are still many predictor variables that are not significant to target variable. Therefore, we will try to fitting model with significant variables

model_sig <- glm(formula = target~sex + cp + trestbps + restecg + exang + oldpeak + ca, 
                 family = "binomial" , data = heart_train)
summary(model_sig)
## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + restecg + exang + 
##     oldpeak + ca, family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5781  -0.3886   0.1710   0.5900   2.1931  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    4.84985    1.61524   3.003 0.002677 ** 
## sexMale       -1.61782    0.46869  -3.452 0.000557 ***
## cp1            1.50081    0.62669   2.395 0.016628 *  
## cp2            1.70387    0.50102   3.401 0.000672 ***
## cp3            2.22019    0.67264   3.301 0.000964 ***
## trestbps      -0.02087    0.01107  -1.884 0.059513 .  
## restecg1       0.60462    0.39344   1.537 0.124355    
## restecg2     -13.27217 1021.02451  -0.013 0.989629    
## exangYes      -1.21697    0.44092  -2.760 0.005778 ** 
## oldpeak       -0.71828    0.21339  -3.366 0.000762 ***
## ca1           -1.58006    0.48345  -3.268 0.001082 ** 
## ca2           -2.46323    0.64678  -3.808 0.000140 ***
## ca3           -2.17454    0.91821  -2.368 0.017873 *  
## ca4           -0.06637    1.48599  -0.045 0.964377    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 333.48  on 241  degrees of freedom
## Residual deviance: 175.76  on 228  degrees of freedom
## AIC: 203.76
## 
## Number of Fisher Scoring iterations: 14

For the second fitting model, we will use the stepwise backward method

heart_back <- step(object = model_all, direction = "backward" , trace = F)
heart_back
## 
## Call:  glm(formula = target ~ sex + cp + trestbps + thalach + exang + 
##     oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
## 
## Coefficients:
## (Intercept)      sexMale          cp1          cp2          cp3     trestbps  
##   -11.18813     -1.69715      1.01884      1.73436      2.20559     -0.02346  
##     thalach     exangYes      oldpeak       slope1       slope2          ca1  
##     0.01849     -0.85210     -0.51950     -1.19189     -0.17873     -1.53821  
##         ca2          ca3          ca4        thal1        thal2        thal3  
##    -2.92377     -1.59633      1.04799     15.89164     14.73215     13.52745  
## 
## Degrees of Freedom: 241 Total (i.e. Null);  224 Residual
## Null Deviance:       333.5 
## Residual Deviance: 155.5     AIC: 191.5
model_back <- glm(formula = target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
summary(model_back)
## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang + 
##     oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7222  -0.3292   0.1518   0.4779   2.8708  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -11.18813 1455.40016  -0.008  0.99387    
## sexMale       -1.69715    0.57842  -2.934  0.00335 ** 
## cp1            1.01884    0.66750   1.526  0.12692    
## cp2            1.73436    0.58000   2.990  0.00279 ** 
## cp3            2.20559    0.71105   3.102  0.00192 ** 
## trestbps      -0.02346    0.01111  -2.111  0.03475 *  
## thalach        0.01849    0.01252   1.478  0.13946    
## exangYes      -0.85210    0.48565  -1.755  0.07933 .  
## oldpeak       -0.51950    0.25345  -2.050  0.04039 *  
## slope1        -1.19189    1.03092  -1.156  0.24762    
## slope2        -0.17873    1.12988  -0.158  0.87431    
## ca1           -1.53821    0.54601  -2.817  0.00484 ** 
## ca2           -2.92377    0.72864  -4.013  0.00006 ***
## ca3           -1.59633    0.98979  -1.613  0.10679    
## ca4            1.04799    1.60576   0.653  0.51399    
## thal1         15.89164 1455.39784   0.011  0.99129    
## thal2         14.73215 1455.39766   0.010  0.99192    
## thal3         13.52745 1455.39765   0.009  0.99258    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 333.48  on 241  degrees of freedom
## Residual deviance: 155.54  on 224  degrees of freedom
## AIC: 191.54
## 
## Number of Fisher Scoring iterations: 14

After we have three models, we need to choose the best model. We were be able to compare each model with the performance package and select the best model with the smallest AIC

compare_performance(model_all, model_sig , model_back)

From the three models, it can be concluded that the smallest AIC value is logistic regression model with the stepwise backward method. So we will prediction with model_back

Prediction

pred_heart <- predict(model_back, newdata = heart_test, type = "response")

Visualize the distribution of probabilities from our prediction vector

ggplot(heart_test, aes(x = pred_heart)) +
  geom_histogram(bins = 10, fill = "lightblue3") +
  labs(title = "Distribution of Probability Prediction") +
  theme_minimal()

From the histogram we can see the results of prediction are more towards to 1 which means Not Health

Other than that, we can set the threshold

result_pred <- ifelse(pred_heart >= 0.5, "Not Health", "Health")

Model Evaluation

conf_log <- confusionMatrix(as.factor(result_pred), 
                            reference = heart_test$target, 
                            positive = "Not Health")
conf_log
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Health Not Health
##   Health         21          2
##   Not Health      7         31
##                                           
##                Accuracy : 0.8525          
##                  95% CI : (0.7383, 0.9302)
##     No Information Rate : 0.541           
##     P-Value [Acc > NIR] : 0.00000026      
##                                           
##                   Kappa : 0.6988          
##                                           
##  Mcnemar's Test P-Value : 0.1824          
##                                           
##             Sensitivity : 0.9394          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.8158          
##          Neg Pred Value : 0.9130          
##              Prevalence : 0.5410          
##          Detection Rate : 0.5082          
##    Detection Prevalence : 0.6230          
##       Balanced Accuracy : 0.8447          
##                                           
##        'Positive' Class : Not Health      
## 
  • Accuracy = Proportion of correctly identified cases from all cases
  • Sensitivity/Recall = Proportion of positives that are correctly identified
  • Specificity = Proportion of negatives that are correctly identified
  • Precision (Pos Pred Value) = Proportion of correctly identified positives from all classified

Interpretation

From the confusion matrix that:

  • Out of the 61 patients in our test set, we classified 52 of them correctly (Accuracy = 85.25%)
  • Out of the 33 actual patients have heart disease, we classified 31 of them correctly (Recall = 93.94%)
  • Out of the 28 actual healthy patients, we classified 21 of them correctly (Specificity = 75%)
  • Out of the 38 identify patients have heart disease, we classified 31 of the correctly (Precision = 81.58%)

k-Nearest Neighbour

Pre-Processing Data

We will check the data types again because k-Nearest Neighbour (k-NN) cannot analyze the data type as a factor

str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ ï..age  : num  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
##  $ cp      : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
##  $ trestbps: num  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : num  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
##  $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
##  $ thalach : num  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
##  $ ca      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ thal    : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
##  $ target  : Factor w/ 2 levels "Health","Not Health": 2 2 2 2 2 2 2 2 2 2 ...

Create dummy variables from categorical data that will be used in classification with fastDummies package

heart_knn <- dummy_cols(heart, 
                        select_columns = c("sex","cp","fbs","restecg","exang","slope","ca","thal"),
                        remove_selected_columns = TRUE)
str(heart_knn)
## 'data.frame':    303 obs. of  31 variables:
##  $ ï..age    : num  63 37 41 56 57 57 56 44 52 57 ...
##  $ trestbps  : num  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol      : num  233 250 204 236 354 192 294 263 199 168 ...
##  $ thalach   : num  150 187 172 178 163 148 153 173 162 174 ...
##  $ oldpeak   : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ target    : Factor w/ 2 levels "Health","Not Health": 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex_Female: int  0 0 1 0 1 0 1 0 0 0 ...
##  $ sex_Male  : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp_0      : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ cp_1      : int  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp_2      : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp_3      : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ fbs_False : int  0 1 1 1 1 1 1 1 0 1 ...
##  $ fbs_True  : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg_0 : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ restecg_1 : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ restecg_2 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ exang_No  : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ exang_Yes : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ slope_0   : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ slope_1   : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope_2   : int  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca_0      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ca_1      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_2      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_3      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ca_4      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal_0    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal_1    : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal_2    : int  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal_3    : int  0 0 0 0 0 0 0 1 1 0 ...

Removing dummy variables with 2 categories and then scaling the data

heart_knn <- heart_knn %>% 
  select(-c(sex_Female , fbs_False , exang_No)) %>% 
  mutate_if(is.numeric, scale)

Pre-processing Data

set.seed(321)
train_y <- heart_knn[idx , 6]
test_y <- heart_knn[-idx , 6]

train_x <- heart_knn[idx , -6]
test_x <- heart_knn[-idx , -6]

Choose number of k

sqrt(nrow(train_x))
## [1] 15.55635

Prediction

pred_knn <- knn(train = train_x, 
                 test = test_x, 
                 cl = train_y, 
                 k = 15)

Model Evaluation

conf_knn <- confusionMatrix(as.factor(pred_knn), 
                            reference = as.factor(test_y), 
                            positive = "Not Health")
conf_knn
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Health Not Health
##   Health         21          3
##   Not Health      7         30
##                                           
##                Accuracy : 0.8361          
##                  95% CI : (0.7191, 0.9185)
##     No Information Rate : 0.541           
##     P-Value [Acc > NIR] : 0.000001184     
##                                           
##                   Kappa : 0.6663          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9091          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.8108          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.5410          
##          Detection Rate : 0.4918          
##    Detection Prevalence : 0.6066          
##       Balanced Accuracy : 0.8295          
##                                           
##        'Positive' Class : Not Health      
## 

Interpretation

From the confusion matrix that:

  • Out of the 61 patients in our test set, we classified 51 of them correctly (Accuracy = 83.61%)
  • Out of the 33 actual patients have heart disease, we classified 30 of them correctly (Recall = 90.91%)
  • Out of the 28 actual healthy patients, we classified 21 of them correctly (Specificity = 75%)
  • Out of the 37 identify patients have heart disease, we classified 30 of the correctly (Precision = 81.08%)

Model Evaluation

eval_logistic <- data_frame(Accuracy = conf_log$overall[1],
                            Recall = conf_log$byClass[1],
                            Specificity = conf_log$byClass[2],
                            Precision = conf_log$byClass[3])

eval_knn <- data_frame(Accuracy = conf_knn$overall[1],
                       Recall = conf_knn$byClass[1],
                       Specificity = conf_knn$byClass[2],
                       Precision = conf_knn$byClass[3])

Model Evaluation Logistic Regression

eval_logistic

Model Evaluation k-Nearest Neighbors

eval_knn

From the two methods, Logistic Regression and k-Nearest Neighbors, the ability of the model to predict correctly from the actual data of people who have heart disease is better by using the Logistic Regression because it has value of precision = 81.58% greater than using the k-Nearest Neighbors method

Conclusion

From the results of analysis, as a doctor or hospital nurse wants to minimize people who do not have heart disease but are predicted to have heart disease. Because the treatment that will be given to patients with heart disease, if given to patients who are not sick, will endanger the patients. Therefore, I’m going to look at metric precision.