Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my Fourth RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.
Cardiovascular diseases (CVDs) are the number 1 cause of death
globally, taking an estimated 17.9 million lives each year, which
accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are
due to heart attacks and strokes, and one-third of these deaths occur
prematurely in people under 70 years of age. Heart failure is a common
event caused by CVDs and this dataset contains 11 features that can be
used to predict a possible heart disease.
People with cardiovascular disease or who are at high cardiovascular
risk (due to the presence of one or more risk factors such as
hypertension, diabetes, hyperlipidaemia or already established disease)
need early detection and management wherein a machine learning model can
be of great help.
Before we start develop the analysis, let’s start with environment setup.
# clear-up the environment
rm(list = ls())
# chunk options
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)
options(scipen=9999)After we done with environment setup, let’s call several important library that will be used during analysis.
library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(GGally)
library(caret)Before we generate classification model of our dataset, we need to
understand what’s inside the dataset. Let’s start with reading
heart.csv in our folder using read.csv()
function, change chr to Factor, save it into
an object named heart, check the data structure and the
data types.
heart <- read.csv("heart.csv",stringsAsFactors = TRUE)
str(heart)#> 'data.frame': 918 obs. of 12 variables:
#> $ Age : int 40 49 37 48 54 39 45 54 37 48 ...
#> $ Sex : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
#> $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
#> $ RestingBP : int 140 160 130 138 150 120 130 110 140 120 ...
#> $ Cholesterol : int 289 180 283 214 195 339 237 208 207 284 ...
#> $ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ RestingECG : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
#> $ MaxHR : int 172 156 98 108 122 170 170 142 130 120 ...
#> $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
#> $ Oldpeak : num 0 1 0 1.5 0 0 0 0 1.5 0 ...
#> $ ST_Slope : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
#> $ HeartDisease : int 0 1 0 1 0 0 0 0 1 0 ...
Based on the output above, we now that our data has dimension of
918 rows and 12 columns. In the other hand, we still
need to change int types to Factor.
heart <- heart %>%
mutate_if(is.integer, as.factor)
glimpse(heart)#> Rows: 918
#> Columns: 12
#> $ Age <fct> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
#> $ Sex <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F, F, M, F, M…
#> $ ChestPainType <fct> ATA, NAP, ATA, ASY, NAP, NAP, ATA, ATA, ASY, ATA, NAP, …
#> $ RestingBP <fct> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
#> $ Cholesterol <fct> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
#> $ FastingBS <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ RestingECG <fct> Normal, Normal, ST, Normal, Normal, Normal, Normal, Nor…
#> $ MaxHR <fct> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
#> $ ExerciseAngina <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N, N, N, N, N…
#> $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
#> $ ST_Slope <fct> Up, Flat, Up, Flat, Up, Up, Up, Up, Flat, Up, Up, Flat,…
#> $ HeartDisease <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
Good, now our dataset has the correct data types for each column.
Even though the dataset already has the correct data types, we still need to check for any missing data and conduct necessary data treatment for those missing data.
anyNA(heart)#> [1] FALSE
colSums(is.na(heart))#> Age Sex ChestPainType RestingBP Cholesterol
#> 0 0 0 0 0
#> FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
#> 0 0 0 0 0
#> ST_Slope HeartDisease
#> 0 0
Based on the output above, the dataset doesn’t have any missing
values. That means we don’t need to do any missing value data treatment
on the dataset.
Next, we would like to check statistic analysis of our dataset.
summary(heart)#> Age Sex ChestPainType RestingBP Cholesterol FastingBS
#> 54 : 51 F:193 ASY:496 120 :132 0 :172 0:704
#> 58 : 42 M:725 ATA:173 130 :118 254 : 11 1:214
#> 55 : 41 NAP:203 140 :107 220 : 10
#> 56 : 38 TA : 46 110 : 58 223 : 10
#> 57 : 38 150 : 55 204 : 9
#> 52 : 36 160 : 50 211 : 9
#> (Other):672 (Other):398 (Other):697
#> RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope
#> LVH :188 150 : 43 N:547 Min. :-2.6000 Down: 63
#> Normal:552 140 : 41 Y:371 1st Qu.: 0.0000 Flat:460
#> ST :178 120 : 36 Median : 0.6000 Up :395
#> 130 : 33 Mean : 0.8874
#> 160 : 25 3rd Qu.: 1.5000
#> 110 : 23 Max. : 6.2000
#> (Other):717
#> HeartDisease
#> 0:410
#> 1:508
#>
#>
#>
#>
#>
We also need to plot Heatmap Correlation as one of consideration during feature selection in model development later.
heart1 <- read.csv("heart.csv",stringsAsFactors = TRUE)
ggcorr(heart1, label = T, label_size = 2.9, hjust = 1, layout.exp = 3)+
labs(title = 'Heart Failure Dataset Heatmap Correlation')+
theme(plot.title = element_text(hjust = 0.5))It is important to balance the class proportion so that model can predict well in both classes. Let’s check current proportion of our class target.
prop.table(table(heart$HeartDisease))#>
#> 0 1
#> 0.4466231 0.5533769
table(heart$HeartDisease)#>
#> 0 1
#> 410 508
Looks like our dataset has balanced proportion, that means we don’t
need additional preprocessing step to balance our class proportion and
can proceed to splitting dataset.The goal is that we will use the train
data for modeling, while the test data will be used as testers for the
model we have created when faced with unseen data. In addition, this can
be used to see the ability of the model that we create in dealing with
unseen data. We use ratio train & test data 80:20 for this
exercise.
RNGkind(sample.kind = "Rounding")
set.seed(123)
index <- sample(x = nrow(heart), size = nrow(heart)*0.8)
heart_train <- heart[index , ]
heart_test <- heart[-index , ]
heart$HeartDisease %>% levels()#> [1] "0" "1"
Our first exercise on Logistic Regression is by build model based on heatmap correlation. According to Heatmap Correlation, we select top 2 parameters with highest correlation value (Oldpeak, FastingBS) and other string parameters(Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope).
# Model Base
model_base <- glm(formula = HeartDisease ~ Oldpeak + FastingBS + Sex + ChestPainType + RestingECG + ExerciseAngina + ST_Slope,
data = heart_train,
family = "binomial")
summary(model_base)#>
#> Call:
#> glm(formula = HeartDisease ~ Oldpeak + FastingBS + Sex + ChestPainType +
#> RestingECG + ExerciseAngina + ST_Slope, family = "binomial",
#> data = heart_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6898 -0.3836 0.1858 0.4504 2.9212
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.26870 0.61893 -2.050 0.04038 *
#> Oldpeak 0.37866 0.12562 3.014 0.00257 **
#> FastingBS1 1.46473 0.30213 4.848 0.000001247576 ***
#> SexM 1.77747 0.30343 5.858 0.000000004687 ***
#> ChestPainTypeATA -1.96598 0.34364 -5.721 0.000000010590 ***
#> ChestPainTypeNAP -1.81816 0.28856 -6.301 0.000000000296 ***
#> ChestPainTypeTA -1.45144 0.47286 -3.069 0.00214 **
#> RestingECGNormal -0.00692 0.28777 -0.024 0.98081
#> RestingECGST 0.09025 0.37108 0.243 0.80785
#> ExerciseAnginaY 0.72350 0.26192 2.762 0.00574 **
#> ST_SlopeFlat 1.22205 0.49118 2.488 0.01285 *
#> ST_SlopeUp -1.25607 0.50987 -2.464 0.01376 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1010.47 on 733 degrees of freedom
#> Residual deviance: 491.35 on 722 degrees of freedom
#> AIC: 515.35
#>
#> Number of Fisher Scoring iterations: 5
Our second model is developed using step wise backward method. We
need to build model based on all parameters available in the dataset and
then we tune the model with step-wise backward. We store the model as
model_step.
# model step-wise backward
model_all <- glm(formula = HeartDisease ~ .,
data = heart_train,
family = "binomial")
model_step <- step(object = model_all,
direction = "backward",
trace = F)
summary(model_step)#>
#> Call:
#> glm(formula = HeartDisease ~ Sex + ChestPainType + FastingBS +
#> ExerciseAngina + Oldpeak + ST_Slope, family = "binomial",
#> data = heart_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6947 -0.3871 0.1888 0.4486 2.9505
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.2586 0.5943 -2.118 0.03419 *
#> SexM 1.7864 0.3015 5.926 0.000000003112 ***
#> ChestPainTypeATA -1.9686 0.3429 -5.741 0.000000009388 ***
#> ChestPainTypeNAP -1.8196 0.2878 -6.323 0.000000000256 ***
#> ChestPainTypeTA -1.4576 0.4699 -3.102 0.00192 **
#> FastingBS1 1.4738 0.3005 4.905 0.000000935445 ***
#> ExerciseAnginaY 0.7318 0.2598 2.817 0.00485 **
#> Oldpeak 0.3768 0.1248 3.020 0.00253 **
#> ST_SlopeFlat 1.2137 0.4902 2.476 0.01330 *
#> ST_SlopeUp -1.2616 0.5094 -2.476 0.01327 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1010.47 on 733 degrees of freedom
#> Residual deviance: 491.45 on 724 degrees of freedom
#> AIC: 511.45
#>
#> Number of Fisher Scoring iterations: 5
When we compare both model, model_step looks more
efficient than model_base because it uses less parameter.
We will compare the performance based on classification metrics
later.
In order to develop K-Nearest Neighbour (KNN), we requires to build
dummy variable based on parameters used in model_step
dmy <- dummyVars(" ~HeartDisease+Sex+ChestPainType+FastingBS+ExerciseAngina+Oldpeak+ST_Slope", data = heart)
dmy <- data.frame(predict(dmy, newdata = heart))
str(dmy)#> 'data.frame': 918 obs. of 16 variables:
#> $ HeartDisease.0 : num 1 0 1 0 1 1 1 1 0 1 ...
#> $ HeartDisease.1 : num 0 1 0 1 0 0 0 0 1 0 ...
#> $ Sex.F : num 0 1 0 1 0 0 1 0 0 1 ...
#> $ Sex.M : num 1 0 1 0 1 1 0 1 1 0 ...
#> $ ChestPainType.ASY: num 0 0 0 1 0 0 0 0 1 0 ...
#> $ ChestPainType.ATA: num 1 0 1 0 0 0 1 1 0 1 ...
#> $ ChestPainType.NAP: num 0 1 0 0 1 1 0 0 0 0 ...
#> $ ChestPainType.TA : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ FastingBS.0 : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ FastingBS.1 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ ExerciseAngina.N : num 1 1 1 0 1 1 1 1 0 1 ...
#> $ ExerciseAngina.Y : num 0 0 0 1 0 0 0 0 1 0 ...
#> $ Oldpeak : num 0 1 0 1.5 0 0 0 0 1.5 0 ...
#> $ ST_Slope.Down : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ ST_Slope.Flat : num 0 1 0 1 0 0 0 0 1 0 ...
#> $ ST_Slope.Up : num 1 0 1 0 1 1 1 1 0 1 ...
After that, we need to delete dummy variable that has only 2 category.
dmy$HeartDisease.0 <- NULL
dmy$Sex.F <- NULL
dmy$FastingBS.0 <- NULL
dmy$ExerciseAngina.N <- NULLLet’s check the name of available parameters
names(dmy)#> [1] "HeartDisease.1" "Sex.M" "ChestPainType.ASY"
#> [4] "ChestPainType.ATA" "ChestPainType.NAP" "ChestPainType.TA"
#> [7] "FastingBS.1" "ExerciseAngina.Y" "Oldpeak"
#> [10] "ST_Slope.Down" "ST_Slope.Flat" "ST_Slope.Up"
Based on the output above, we have 11 input parameters and 1 output parameters. We will use this dataset for our KNN model and conduct splitting data.
set.seed(123)
dmy_train <- dmy[index,2:12]
dmy_test <- dmy[-index,2:12]
dmy_train_label <- dmy[index,1]
dmy_test_label <- dmy[-index,1]In KNN model, we need to find optimum class for our model by running this script.
round(sqrt(nrow(heart)),0)#> [1] 30
According to the output, the most optimum number of class is 30. Now we can develop the model by using the information above
pred_knn <- class::knn(train = dmy_train,
test = dmy_test,
cl = dmy_train_label,
k = 30)
head(pred_knn)#> [1] 0 1 0 1 1 0
#> Levels: 0 1
First we will talk about Logistic Regression model evaluation. we compare odds ratio all coefficients for both logistic regression.
# Odds ratio all coefficients for model base
exp(model_base$coefficients) %>%
data.frame()# Odds ratio all coefficients for model step
exp(model_step$coefficients) %>%
data.frame()Based on tables above, it looks like both model have similar odds
ration for each parameters used in the model. The first hypothesis is
both model potentially have similar metrics performance.
In order to know the real performance, we need to calculate prediction
based on model by using heart_test data.
model_base_pred <- predict(model_base, type = "response", newdata = heart_test)
model_step_pred <- predict(model_step, type = "response", newdata = heart_test)Next, we visualize both model performance in plot.
heart_test$model_base_pred <- model_base_pred
heart_test$model_step_pred <- model_step_pred
ggplot(heart_test, aes(x=model_base_pred)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data Base Model") +
theme_minimal()ggplot(heart_test, aes(x=model_step_pred)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data Step Model") +
theme_minimal()
In the graphs above, it can be interpreted that the prediction results
are more inclined towards 1, which means
Heart Disease.
Let’s move to check overall metrics performance in all models by
calculating the confusion matrix. A confusion matrix is a table that is
used to define the performance of a classification algorithm. A
confusion matrix visualizes and summarizes the performance of a
classification algorithm.
#Confusion Matrix model_base
heart_test$model_base_pred <- factor(ifelse(heart_test$model_base_pred > 0.5, "1","0"))
confusionMatrix(heart_test$model_base_pred, heart_test$HeartDisease, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 64 6
#> 1 15 99
#>
#> Accuracy : 0.8859
#> 95% CI : (0.8308, 0.9279)
#> No Information Rate : 0.5707
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.7638
#>
#> Mcnemar's Test P-Value : 0.08086
#>
#> Sensitivity : 0.9429
#> Specificity : 0.8101
#> Pos Pred Value : 0.8684
#> Neg Pred Value : 0.9143
#> Prevalence : 0.5707
#> Detection Rate : 0.5380
#> Detection Prevalence : 0.6196
#> Balanced Accuracy : 0.8765
#>
#> 'Positive' Class : 1
#>
#Confusion Matrix model_step
heart_test$model_step_pred <- factor(ifelse(heart_test$model_step_pred > 0.5, "1","0"))
confusionMatrix(heart_test$model_step_pred, heart_test$HeartDisease, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 64 6
#> 1 15 99
#>
#> Accuracy : 0.8859
#> 95% CI : (0.8308, 0.9279)
#> No Information Rate : 0.5707
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.7638
#>
#> Mcnemar's Test P-Value : 0.08086
#>
#> Sensitivity : 0.9429
#> Specificity : 0.8101
#> Pos Pred Value : 0.8684
#> Neg Pred Value : 0.9143
#> Prevalence : 0.5707
#> Detection Rate : 0.5380
#> Detection Prevalence : 0.6196
#> Balanced Accuracy : 0.8765
#>
#> 'Positive' Class : 1
#>
#Confusion Matrix model_knn
confusionMatrix(pred_knn, heart_test$HeartDisease, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 66 9
#> 1 13 96
#>
#> Accuracy : 0.8804
#> 95% CI : (0.8246, 0.9235)
#> No Information Rate : 0.5707
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.7545
#>
#> Mcnemar's Test P-Value : 0.5224
#>
#> Sensitivity : 0.9143
#> Specificity : 0.8354
#> Pos Pred Value : 0.8807
#> Neg Pred Value : 0.8800
#> Prevalence : 0.5707
#> Detection Rate : 0.5217
#> Detection Prevalence : 0.5924
#> Balanced Accuracy : 0.8749
#>
#> 'Positive' Class : 1
#>
Accuracy is one metric for evaluating classification
models. Informally, accuracy is the fraction of predictions our model
got right. Formally, accuracy has the following definition: \[Accuracy = \frac{TP + TN}{TP + FP + TN +
FN}\] Pos Pred Value / Precision is the ability
of a classification model to identify only the relevant data points.
Formally, precision has the following definition: \[Precision = \frac{TP}{TP + FP}\]
Sensitivity / Recall is the ability of a classification
model to identify all data points in a relevant class. Formally, recall
has the following definition: \[Recall =
\frac{TP}{TP + FN}\] TP : TRUE POSITIVE, TN : TRUE NEGATIVE, FP :
FALSE POSITIVE, FN : FALSE NEGATIVE.
From the confusion matrix, we already know value of each metrics and we
will compile into a dataframe table so we can easily compare each model
performances.
Accuracy <- c(0.8859, 0.8859, 0.8804)
Precision <- c(0.8684, 0.8684, 0.8807)
Recall <- c(0.9429, 0.9429, 0.9143)
Name <- c('model_base', 'model_step', 'model_knn')
metric_df <- data.frame(Name,Accuracy,Precision,Recall)
metric_dfBased on the table above, model_base &
model_step have exactly same metrics value but in terms of
efficiency model_step is more efficient based on number of
input parameters. In terms of Accuracy & Precision,
model_knn has better performance than other models. While
in Recall, model_base & model_step have
better performance than model_knn.
Looking at the objective of our case, we need to focus on Precision
metrics because we want to minimize false prediction for patient with
Heart Disease. Let’s imagine what happened when the doctor false
diagnose healthy patient. Based on this consideration, we tends to
select model_knn to represent our model because it has the
best precision metrics. We can also expand our research by search new
model by using other classification algorithms.
A work by Ade Anggi N S