1. Case Description

Brief Introduction

Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my Fourth RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.

Case Description

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Attribute Information

  1. Age: age of the patient [years]
  2. Sex: sex of the patient [M: Male, F: Female]
  3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
  4. RestingBP: resting blood pressure [mm Hg]
  5. Cholesterol: serum cholesterol [mm/dl]
  6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
  7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]
  8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
  9. ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
  10. Oldpeak: oldpeak = ST [Numeric value measured in depression]
  11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
  12. HeartDisease: output class [1: heart disease, 0: Normal]

Objectives

  • The choice of target variable depends on the perspective of the case you want to take.
  • Data analysis and process of selecting predictor variables / feature selection.
  • Pre-processing of data from data cleansing to cross validation.
  • An explanation of the evaluation of the model used, what is the best metric to evaluate the model and why.
  • Documenting the analysis on how to improve the performance of the model (eg the process of selecting the optimum k on the knn model). and/or comparison of the logistic model and knn.

2. Setup Environment

Before we start develop the analysis, let’s start with environment setup.

# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

options(scipen=9999)

After we done with environment setup, let’s call several important library that will be used during analysis.

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(GGally)
library(caret)

3. Data Preprocessing

3.1 Data Input & Structure

Before we generate classification model of our dataset, we need to understand what’s inside the dataset. Let’s start with reading heart.csv in our folder using read.csv() function, change chr to Factor, save it into an object named heart, check the data structure and the data types.

heart <- read.csv("heart.csv",stringsAsFactors = TRUE)
str(heart)
#> 'data.frame':    918 obs. of  12 variables:
#>  $ Age           : int  40 49 37 48 54 39 45 54 37 48 ...
#>  $ Sex           : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
#>  $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
#>  $ RestingBP     : int  140 160 130 138 150 120 130 110 140 120 ...
#>  $ Cholesterol   : int  289 180 283 214 195 339 237 208 207 284 ...
#>  $ FastingBS     : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ RestingECG    : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
#>  $ MaxHR         : int  172 156 98 108 122 170 170 142 130 120 ...
#>  $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
#>  $ Oldpeak       : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
#>  $ ST_Slope      : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
#>  $ HeartDisease  : int  0 1 0 1 0 0 0 0 1 0 ...

Based on the output above, we now that our data has dimension of 918 rows and 12 columns. In the other hand, we still need to change int types to Factor.

heart <- heart %>% 
  mutate_if(is.integer, as.factor)
glimpse(heart)
#> Rows: 918
#> Columns: 12
#> $ Age            <fct> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
#> $ Sex            <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F, F, M, F, M…
#> $ ChestPainType  <fct> ATA, NAP, ATA, ASY, NAP, NAP, ATA, ATA, ASY, ATA, NAP, …
#> $ RestingBP      <fct> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
#> $ Cholesterol    <fct> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
#> $ FastingBS      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ RestingECG     <fct> Normal, Normal, ST, Normal, Normal, Normal, Normal, Nor…
#> $ MaxHR          <fct> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
#> $ ExerciseAngina <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N, N, N, N, N…
#> $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
#> $ ST_Slope       <fct> Up, Flat, Up, Flat, Up, Up, Up, Up, Flat, Up, Up, Flat,…
#> $ HeartDisease   <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

Good, now our dataset has the correct data types for each column.

3.2 Missing Data & Practical Statistics

Even though the dataset already has the correct data types, we still need to check for any missing data and conduct necessary data treatment for those missing data.

anyNA(heart)
#> [1] FALSE
colSums(is.na(heart))
#>            Age            Sex  ChestPainType      RestingBP    Cholesterol 
#>              0              0              0              0              0 
#>      FastingBS     RestingECG          MaxHR ExerciseAngina        Oldpeak 
#>              0              0              0              0              0 
#>       ST_Slope   HeartDisease 
#>              0              0

Based on the output above, the dataset doesn’t have any missing values. That means we don’t need to do any missing value data treatment on the dataset.

Next, we would like to check statistic analysis of our dataset.

summary(heart)
#>       Age      Sex     ChestPainType   RestingBP    Cholesterol  FastingBS
#>  54     : 51   F:193   ASY:496       120    :132   0      :172   0:704    
#>  58     : 42   M:725   ATA:173       130    :118   254    : 11   1:214    
#>  55     : 41           NAP:203       140    :107   220    : 10            
#>  56     : 38           TA : 46       110    : 58   223    : 10            
#>  57     : 38                         150    : 55   204    :  9            
#>  52     : 36                         160    : 50   211    :  9            
#>  (Other):672                         (Other):398   (Other):697            
#>   RestingECG      MaxHR     ExerciseAngina    Oldpeak        ST_Slope  
#>  LVH   :188   150    : 43   N:547          Min.   :-2.6000   Down: 63  
#>  Normal:552   140    : 41   Y:371          1st Qu.: 0.0000   Flat:460  
#>  ST    :178   120    : 36                  Median : 0.6000   Up  :395  
#>               130    : 33                  Mean   : 0.8874             
#>               160    : 25                  3rd Qu.: 1.5000             
#>               110    : 23                  Max.   : 6.2000             
#>               (Other):717                                              
#>  HeartDisease
#>  0:410       
#>  1:508       
#>              
#>              
#>              
#>              
#> 

We also need to plot Heatmap Correlation as one of consideration during feature selection in model development later.

heart1 <- read.csv("heart.csv",stringsAsFactors = TRUE)
ggcorr(heart1, label = T, label_size = 2.9, hjust = 1, layout.exp = 3)+
  labs(title = 'Heart Failure Dataset Heatmap Correlation')+
  theme(plot.title = element_text(hjust = 0.5))

3.3 Splitting Data

It is important to balance the class proportion so that model can predict well in both classes. Let’s check current proportion of our class target.

prop.table(table(heart$HeartDisease))
#> 
#>         0         1 
#> 0.4466231 0.5533769
table(heart$HeartDisease)
#> 
#>   0   1 
#> 410 508

Looks like our dataset has balanced proportion, that means we don’t need additional preprocessing step to balance our class proportion and can proceed to splitting dataset.The goal is that we will use the train data for modeling, while the test data will be used as testers for the model we have created when faced with unseen data. In addition, this can be used to see the ability of the model that we create in dealing with unseen data. We use ratio train & test data 80:20 for this exercise.

RNGkind(sample.kind = "Rounding")
set.seed(123)
index <- sample(x = nrow(heart), size = nrow(heart)*0.8)
heart_train <- heart[index , ]
heart_test <-  heart[-index , ]
heart$HeartDisease %>% levels()
#> [1] "0" "1"

4. Classification Model

4.1 Logistic Regression

Our first exercise on Logistic Regression is by build model based on heatmap correlation. According to Heatmap Correlation, we select top 2 parameters with highest correlation value (Oldpeak, FastingBS) and other string parameters(Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope).

# Model Base
model_base <- glm(formula = HeartDisease ~ Oldpeak + FastingBS + Sex + ChestPainType + RestingECG + ExerciseAngina + ST_Slope,
                  data = heart_train,
                  family = "binomial")
summary(model_base)
#> 
#> Call:
#> glm(formula = HeartDisease ~ Oldpeak + FastingBS + Sex + ChestPainType + 
#>     RestingECG + ExerciseAngina + ST_Slope, family = "binomial", 
#>     data = heart_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6898  -0.3836   0.1858   0.4504   2.9212  
#> 
#> Coefficients:
#>                  Estimate Std. Error z value       Pr(>|z|)    
#> (Intercept)      -1.26870    0.61893  -2.050        0.04038 *  
#> Oldpeak           0.37866    0.12562   3.014        0.00257 ** 
#> FastingBS1        1.46473    0.30213   4.848 0.000001247576 ***
#> SexM              1.77747    0.30343   5.858 0.000000004687 ***
#> ChestPainTypeATA -1.96598    0.34364  -5.721 0.000000010590 ***
#> ChestPainTypeNAP -1.81816    0.28856  -6.301 0.000000000296 ***
#> ChestPainTypeTA  -1.45144    0.47286  -3.069        0.00214 ** 
#> RestingECGNormal -0.00692    0.28777  -0.024        0.98081    
#> RestingECGST      0.09025    0.37108   0.243        0.80785    
#> ExerciseAnginaY   0.72350    0.26192   2.762        0.00574 ** 
#> ST_SlopeFlat      1.22205    0.49118   2.488        0.01285 *  
#> ST_SlopeUp       -1.25607    0.50987  -2.464        0.01376 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1010.47  on 733  degrees of freedom
#> Residual deviance:  491.35  on 722  degrees of freedom
#> AIC: 515.35
#> 
#> Number of Fisher Scoring iterations: 5

Our second model is developed using step wise backward method. We need to build model based on all parameters available in the dataset and then we tune the model with step-wise backward. We store the model as model_step.

# model step-wise backward
model_all <- glm(formula = HeartDisease ~ ., 
                   data = heart_train, 
                   family = "binomial")
model_step <- step(object = model_all, 
                    direction = "backward", 
                    trace = F)
summary(model_step)
#> 
#> Call:
#> glm(formula = HeartDisease ~ Sex + ChestPainType + FastingBS + 
#>     ExerciseAngina + Oldpeak + ST_Slope, family = "binomial", 
#>     data = heart_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6947  -0.3871   0.1888   0.4486   2.9505  
#> 
#> Coefficients:
#>                  Estimate Std. Error z value       Pr(>|z|)    
#> (Intercept)       -1.2586     0.5943  -2.118        0.03419 *  
#> SexM               1.7864     0.3015   5.926 0.000000003112 ***
#> ChestPainTypeATA  -1.9686     0.3429  -5.741 0.000000009388 ***
#> ChestPainTypeNAP  -1.8196     0.2878  -6.323 0.000000000256 ***
#> ChestPainTypeTA   -1.4576     0.4699  -3.102        0.00192 ** 
#> FastingBS1         1.4738     0.3005   4.905 0.000000935445 ***
#> ExerciseAnginaY    0.7318     0.2598   2.817        0.00485 ** 
#> Oldpeak            0.3768     0.1248   3.020        0.00253 ** 
#> ST_SlopeFlat       1.2137     0.4902   2.476        0.01330 *  
#> ST_SlopeUp        -1.2616     0.5094  -2.476        0.01327 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1010.47  on 733  degrees of freedom
#> Residual deviance:  491.45  on 724  degrees of freedom
#> AIC: 511.45
#> 
#> Number of Fisher Scoring iterations: 5

When we compare both model, model_step looks more efficient than model_base because it uses less parameter. We will compare the performance based on classification metrics later.

4.2 K-Nearest Neighbour

In order to develop K-Nearest Neighbour (KNN), we requires to build dummy variable based on parameters used in model_step

dmy <- dummyVars(" ~HeartDisease+Sex+ChestPainType+FastingBS+ExerciseAngina+Oldpeak+ST_Slope", data = heart)
dmy <- data.frame(predict(dmy, newdata = heart))
str(dmy)
#> 'data.frame':    918 obs. of  16 variables:
#>  $ HeartDisease.0   : num  1 0 1 0 1 1 1 1 0 1 ...
#>  $ HeartDisease.1   : num  0 1 0 1 0 0 0 0 1 0 ...
#>  $ Sex.F            : num  0 1 0 1 0 0 1 0 0 1 ...
#>  $ Sex.M            : num  1 0 1 0 1 1 0 1 1 0 ...
#>  $ ChestPainType.ASY: num  0 0 0 1 0 0 0 0 1 0 ...
#>  $ ChestPainType.ATA: num  1 0 1 0 0 0 1 1 0 1 ...
#>  $ ChestPainType.NAP: num  0 1 0 0 1 1 0 0 0 0 ...
#>  $ ChestPainType.TA : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ FastingBS.0      : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ FastingBS.1      : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ ExerciseAngina.N : num  1 1 1 0 1 1 1 1 0 1 ...
#>  $ ExerciseAngina.Y : num  0 0 0 1 0 0 0 0 1 0 ...
#>  $ Oldpeak          : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
#>  $ ST_Slope.Down    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ ST_Slope.Flat    : num  0 1 0 1 0 0 0 0 1 0 ...
#>  $ ST_Slope.Up      : num  1 0 1 0 1 1 1 1 0 1 ...

After that, we need to delete dummy variable that has only 2 category.

dmy$HeartDisease.0 <- NULL
dmy$Sex.F <- NULL
dmy$FastingBS.0 <- NULL
dmy$ExerciseAngina.N <- NULL

Let’s check the name of available parameters

names(dmy)
#>  [1] "HeartDisease.1"    "Sex.M"             "ChestPainType.ASY"
#>  [4] "ChestPainType.ATA" "ChestPainType.NAP" "ChestPainType.TA" 
#>  [7] "FastingBS.1"       "ExerciseAngina.Y"  "Oldpeak"          
#> [10] "ST_Slope.Down"     "ST_Slope.Flat"     "ST_Slope.Up"

Based on the output above, we have 11 input parameters and 1 output parameters. We will use this dataset for our KNN model and conduct splitting data.

set.seed(123)
dmy_train <- dmy[index,2:12]
dmy_test <- dmy[-index,2:12]

dmy_train_label <- dmy[index,1]
dmy_test_label <- dmy[-index,1]

In KNN model, we need to find optimum class for our model by running this script.

round(sqrt(nrow(heart)),0)
#> [1] 30

According to the output, the most optimum number of class is 30. Now we can develop the model by using the information above

pred_knn <- class::knn(train = dmy_train,
                       test = dmy_test, 
                       cl = dmy_train_label, 
                       k = 30)
head(pred_knn)
#> [1] 0 1 0 1 1 0
#> Levels: 0 1

5. Model Evaluation

First we will talk about Logistic Regression model evaluation. we compare odds ratio all coefficients for both logistic regression.

# Odds ratio all coefficients for model base
exp(model_base$coefficients) %>% 
  data.frame()
# Odds ratio all coefficients for model step
exp(model_step$coefficients) %>% 
  data.frame()

Based on tables above, it looks like both model have similar odds ration for each parameters used in the model. The first hypothesis is both model potentially have similar metrics performance.

In order to know the real performance, we need to calculate prediction based on model by using heart_test data.

model_base_pred <- predict(model_base, type = "response", newdata = heart_test)
model_step_pred <- predict(model_step, type = "response", newdata = heart_test)

Next, we visualize both model performance in plot.

heart_test$model_base_pred <- model_base_pred
heart_test$model_step_pred <- model_step_pred
ggplot(heart_test, aes(x=model_base_pred)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data Base Model") +
  theme_minimal()

ggplot(heart_test, aes(x=model_step_pred)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data Step Model") +
  theme_minimal()

In the graphs above, it can be interpreted that the prediction results are more inclined towards 1, which means Heart Disease.

Let’s move to check overall metrics performance in all models by calculating the confusion matrix. A confusion matrix is a table that is used to define the performance of a classification algorithm. A confusion matrix visualizes and summarizes the performance of a classification algorithm.

#Confusion Matrix model_base
heart_test$model_base_pred <- factor(ifelse(heart_test$model_base_pred > 0.5, "1","0"))
confusionMatrix(heart_test$model_base_pred, heart_test$HeartDisease, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 64  6
#>          1 15 99
#>                                               
#>                Accuracy : 0.8859              
#>                  95% CI : (0.8308, 0.9279)    
#>     No Information Rate : 0.5707              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.7638              
#>                                               
#>  Mcnemar's Test P-Value : 0.08086             
#>                                               
#>             Sensitivity : 0.9429              
#>             Specificity : 0.8101              
#>          Pos Pred Value : 0.8684              
#>          Neg Pred Value : 0.9143              
#>              Prevalence : 0.5707              
#>          Detection Rate : 0.5380              
#>    Detection Prevalence : 0.6196              
#>       Balanced Accuracy : 0.8765              
#>                                               
#>        'Positive' Class : 1                   
#> 
#Confusion Matrix model_step
heart_test$model_step_pred <- factor(ifelse(heart_test$model_step_pred > 0.5, "1","0"))
confusionMatrix(heart_test$model_step_pred, heart_test$HeartDisease, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 64  6
#>          1 15 99
#>                                               
#>                Accuracy : 0.8859              
#>                  95% CI : (0.8308, 0.9279)    
#>     No Information Rate : 0.5707              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.7638              
#>                                               
#>  Mcnemar's Test P-Value : 0.08086             
#>                                               
#>             Sensitivity : 0.9429              
#>             Specificity : 0.8101              
#>          Pos Pred Value : 0.8684              
#>          Neg Pred Value : 0.9143              
#>              Prevalence : 0.5707              
#>          Detection Rate : 0.5380              
#>    Detection Prevalence : 0.6196              
#>       Balanced Accuracy : 0.8765              
#>                                               
#>        'Positive' Class : 1                   
#> 
#Confusion Matrix model_knn
confusionMatrix(pred_knn, heart_test$HeartDisease, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 66  9
#>          1 13 96
#>                                              
#>                Accuracy : 0.8804             
#>                  95% CI : (0.8246, 0.9235)   
#>     No Information Rate : 0.5707             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.7545             
#>                                              
#>  Mcnemar's Test P-Value : 0.5224             
#>                                              
#>             Sensitivity : 0.9143             
#>             Specificity : 0.8354             
#>          Pos Pred Value : 0.8807             
#>          Neg Pred Value : 0.8800             
#>              Prevalence : 0.5707             
#>          Detection Rate : 0.5217             
#>    Detection Prevalence : 0.5924             
#>       Balanced Accuracy : 0.8749             
#>                                              
#>        'Positive' Class : 1                  
#> 

Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition: \[Accuracy = \frac{TP + TN}{TP + FP + TN + FN}\] Pos Pred Value / Precision is the ability of a classification model to identify only the relevant data points. Formally, precision has the following definition: \[Precision = \frac{TP}{TP + FP}\] Sensitivity / Recall is the ability of a classification model to identify all data points in a relevant class. Formally, recall has the following definition: \[Recall = \frac{TP}{TP + FN}\] TP : TRUE POSITIVE, TN : TRUE NEGATIVE, FP : FALSE POSITIVE, FN : FALSE NEGATIVE.

From the confusion matrix, we already know value of each metrics and we will compile into a dataframe table so we can easily compare each model performances.

Accuracy <- c(0.8859, 0.8859, 0.8804)
Precision <- c(0.8684, 0.8684, 0.8807)
Recall <- c(0.9429, 0.9429, 0.9143)
Name <- c('model_base', 'model_step', 'model_knn')
metric_df <- data.frame(Name,Accuracy,Precision,Recall)
metric_df

Based on the table above, model_base & model_step have exactly same metrics value but in terms of efficiency model_step is more efficient based on number of input parameters. In terms of Accuracy & Precision, model_knn has better performance than other models. While in Recall, model_base & model_step have better performance than model_knn.

6. Conclusion

Looking at the objective of our case, we need to focus on Precision metrics because we want to minimize false prediction for patient with Heart Disease. Let’s imagine what happened when the doctor false diagnose healthy patient. Based on this consideration, we tends to select model_knn to represent our model because it has the best precision metrics. We can also expand our research by search new model by using other classification algorithms.

 

A work by Ade Anggi N S