1 Introduction
For LBB (learning by building) - Classification in Machine Learning 1, i’m using Kaggle - Heart Failure Prediction Dataset. Purpose of this report is to make a model that perform well in predicting Heart Disease, since Heart Disease is considered as lethal disease and often happen suddenly. If we can do early prediction to a patient / person to avoid late diagnose, it can be a life saver.
List of library
library(tidyverse) #For data wrangling
library(GGally) #For visualization
library(MLmetrics) #For evaluation metric
library(ggplot2) #For visualization
library(rsample) #For splitting data
library(MASS) #For stepwise method
library(caret) #Confusion matrix
theme_set(theme_minimal())
2 Data Preparation
2.1 Import and Data Checking
heart <- read.csv("heart.csv") #Import data from local storage
head(heart) #Print the first 6 rows of dataset
Data columns description:
Age
: age of the patient [years]Sex
: sex of the patient [M: Male, F: Female]ChestPainType
: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]RestingBP
: resting blood pressure [mm Hg]Cholesterol
: serum cholesterol [mm/dl]FastingBS
: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]RestingECG
: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]MaxHR
: maximum heart rate achieved [Numeric value between 60 and 202]ExerciseAngina
: exercise-induced angina [Y: Yes, N: No]Oldpeak
: oldpeak = ST [Numeric value measured in depression]ST_Slope
: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]HeartDisease
: output class [1: heart disease, 0: Normal]
colSums(is.na(heart)) #checking missing / "NA" in each column.
#> Age Sex ChestPainType RestingBP Cholesterol
#> 0 0 0 0 0
#> FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
#> 0 0 0 0 0
#> ST_Slope HeartDisease
#> 0 0
glimpse(heart) #glimpse dataset
#> Rows: 918
#> Columns: 12
#> $ Age <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
#> $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", ~
#> $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",~
#> $ RestingBP <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
#> $ Cholesterol <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
#> $ FastingBS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",~
#> $ MaxHR <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
#> $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", ~
#> $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
#> $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl~
#> $ HeartDisease <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~
# To check unique value in each columns.
heart %>% summarise_all(funs(n_distinct(.)))
Based on glimpse()
and unique value checking in each column, we noted Sex
, ChestPainType
, FastingBS
, RestingECG
, ExerciseAngina
, ST_Slope
and HeartDisease
have less unique value, therefore we will convert it into proper data type.
2.2 Clean Data
#To convert several columns data type into factor.
heart_clean <-
heart %>%
mutate(across(c(Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope, HeartDisease), as.factor))
#checking data frame
glimpse(heart_clean)
#> Rows: 918
#> Columns: 12
#> $ Age <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
#> $ Sex <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F, F, M, F, M~
#> $ ChestPainType <fct> ATA, NAP, ATA, ASY, NAP, NAP, ATA, ATA, ASY, ATA, NAP, ~
#> $ RestingBP <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
#> $ Cholesterol <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
#> $ FastingBS <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ RestingECG <fct> Normal, Normal, ST, Normal, Normal, Normal, Normal, Nor~
#> $ MaxHR <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
#> $ ExerciseAngina <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N, N, N, N, N~
#> $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
#> $ ST_Slope <fct> Up, Flat, Up, Flat, Up, Up, Up, Up, Flat, Up, Up, Flat,~
#> $ HeartDisease <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~
Datatype has been converted properly, now we can use this data set for analysis and modelling.
3 Exploratory Data Analysis
#checking stroke proportion in dataset
prop.table(table(heart_clean$HeartDisease))
#>
#> 0 1
#> 0.4466231 0.5533769
Based on prop.table
function, proportion from our target variable HeartDisease
is quite balance.
3.1 Barplot Categorical Predictor Variable
3.1.1 Heart Disease by Age
heart_clean %>%
ggplot(aes(x = Sex, fill = HeartDisease))+
geom_bar(Identity = "stack")
Insight: Male is more likely to suffer from
HeartDisease
compare to Female.
3.1.2 Heart Disease by Chest Pain Type
heart_clean %>%
ggplot(aes(x = ChestPainType, fill = HeartDisease))+
geom_bar(Identity = "stack")
Inisight: Most of the people with
HeartDisease
problem suffering from ASY (Asymptomatic) ChestPainType
. This pain type is also called as Silent Killer, since most of the people are unaware with the symptoms on their body that lead to HeartDisease
.
3.1.3 Heart Disease by Fasting Blood Sugar
heart_clean %>%
ggplot(aes(x = FastingBS, fill = HeartDisease))+
geom_bar(Identity = "stack")
Inisight:
- 0 (blood sugar below 120 mg/dl) normal, 1 (blood sugar above 120 mg/dl) prediabetes. Result taken from a person that has been fasting before the test.
- As we can see, most of the people with
HeartDisease
problem have blood sugar above 120 mg/dl, which also can lead to other problem such as Diabetes. - High blood sugar does not significantly causing
HeartDisease
problem.
3.1.4 Heart Disease by Resting Electrocardiogram
heart_clean %>%
ggplot(aes(x = RestingECG, fill = HeartDisease))+
geom_bar(Identity = "stack")
Inisght: People with
HeartDisease
problem is more likely to have good result in RestingECG
test.
3.1.5 Heart Disease by ST Slope
heart_clean %>%
ggplot(aes(x = ST_Slope, fill = HeartDisease))+
geom_bar(Identity = "stack")
Inisght: People with ST Slope result Flat tend to have higher chance to suffer from
HeartDisease
.
Based on visualization of Categorical Predictor, we will include Sex
, ChestPainType
and ST_Slope
into our model.
3.2 Boxplot Numeric Predictor Variable
3.2.1 Age
heart_clean %>%
ggplot(aes(x = HeartDisease, y = Age, fill = HeartDisease))+
geom_boxplot(alpha = 0.5, show.legend = F)+
geom_jitter(alpha = 0.8, show.legend = F)+
ggtitle("Heart Disease by Age")+
theme(plot.title = element_text(hjust = 0.5))
Insight : From plot above, we noted
HeartDisease
can occur in all range of Age
.
3.2.2 Cholesterol
heart_clean %>%
ggplot(aes(x = HeartDisease, y = Cholesterol, fill = HeartDisease))+
geom_boxplot(alpha = 0.5, show.legend = F)+
geom_jitter(alpha = 0.8, show.legend = F)+
ggtitle("Heart Disease by Age")+
theme(plot.title = element_text(hjust = 0.5))
3.2.3 RestingBP
heart_clean %>%
ggplot(aes(x = HeartDisease, y = RestingBP, fill = HeartDisease))+
geom_boxplot(alpha = 0.5, show.legend = F)+
geom_jitter(alpha = 0.8, show.legend = F)+
ggtitle("Heart Disease by Age")+
theme(plot.title = element_text(hjust = 0.5))
3.2.4 MaxHR
heart_clean %>%
ggplot(aes(x = HeartDisease, y = MaxHR, fill = HeartDisease))+
geom_boxplot(alpha = 0.5, show.legend = F)+
geom_jitter(alpha = 0.8, show.legend = F)+
ggtitle("Heart Disease by Age")+
theme(plot.title = element_text(hjust = 0.5))
Based on numeric predictor boxplot, we noted there are zero value 0 in Cholesterol
and RestingBP
.We will try not to delete zero value in Cholesterol
column by finding its correlation with variable Age
, since according to this article, “As we get older, cholesterol levels rise”. Before make modification in Cholesterol
column, we will delete zero value in RestingBP
since it appear only few row contain zero value.
3.3 Data Modification
#To filter delete zero value in RestingBP using "Filter" function.
heart_clean <-
heart_clean %>%
filter(RestingBP != 0)
cat(
nrow(heart),# full row
nrow(heart_clean),#after filter
sep = "\n")
#> 918
#> 917
Only 1 row deleted from heart_clean
dataset.
heart_clean %>%
filter(Cholesterol > 0) %>%
ggplot(aes(x = Age, y = Cholesterol, color = HeartDisease))+
geom_point()+
ggtitle("Cholesterol correlation with Age")+
theme(plot.title = element_text(hjust = 0.5))
Based on scatter plot above, we can assume that
Cholesterol
do have correlation to Age
, eventhough it was not significant.
In order to fill missing value in Cholesterol
column, we will use Median
value from other non-missing value.
Separate dataset into 2 dataset:
heart_full
dataset with no missing value inCholesterol
columnheart_miss
dataset with missing value inCholesterol
column
#create dataset with no missing value in cholesterol column
heart_full <-
heart_clean %>%
filter(Cholesterol > 0) %>%
mutate(cholesterol_new = Cholesterol)
#create dataset with only missing value in cholesterol column
heart_miss <-
heart_clean %>%
filter(Cholesterol == 0)
Create function to make categorical value based on Age
. Noted in Age
column, range of data spread is from 28 up to 77. We will separate the Age
with interval around 10, so there will be 5 categorical value ingroup_age
column.
#Create function
convert_age <- function(y){
if(y > 27 & y <= 40){ #noted in dataset there are few age below 30, thus we include those age in 28 - 40
y <- "Age 28-40"
}else
if(y > 40 & y <= 50){
y <- "Age 41-50"
}else
if(y > 50 & y <= 60){
y <- "Age 51-60"
}else
if(y > 60 & y <= 70){
y <- "Age 61-70"
}else{
y <- "Above 70"
}
}
Create new column group_age
using function convert_age
created before. Purpose of this column is, to calculate either Mean
and Median
cholesterol value to fill missing value in heart_miss
dataset.
#Apply function
heart_full$group_age <- sapply(heart_full$Age,
FUN = convert_age)
heart_miss$group_age <- sapply (heart_miss$Age,
FUN = convert_age)
Separate heart_full
and heart_miss
dataset by HeartDisease
, as cholesterol level between people with HeartDisease
and healthy
, as people with HeartDisease
tend to have higher Cholesterol
.
heart_full_0 <- heart_full %>%
filter(HeartDisease == 0)
heart_full_1 <- heart_full %>%
filter(HeartDisease == 1)
heart_miss_0 <- heart_miss %>%
filter(HeartDisease == 0)
heart_miss_1 <- heart_miss %>%
filter(HeartDisease == 1)
3.3.1 Group Age with Heart Disesase
heart_full_1 %>%
ggplot(aes(x = cholesterol_new, group = group_age, fill = group_age))+
geom_density()+
facet_wrap(~group_age)
3.3.2 Healthy Group Age
heart_full_0 %>%
ggplot(aes(x = cholesterol_new, group = group_age, fill = group_age))+
geom_density()+
facet_wrap(~group_age)
Based on distribution plot, we will use Median
cholesterol by group age to fill missing value in dataset heart_miss_1
and heart_miss_2
.
# to calculate median cholesterol by group age
heart_full_0 %>%
group_by(group_age) %>%
summarise(median_chol = median(cholesterol_new))
# to calculate median cholesterol by group age
heart_full_1 %>%
group_by(group_age) %>%
summarise(median_chol = median(cholesterol_new))
Fill missing value with Median
cholesterol value by group age in new column called cholesterol_new
.
# to fill missing value
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Above 70"] <- 267
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Age 28-40"] <- 220
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Age 41-50"] <- 234.5
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Age 51-60"] <- 228
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Age 61-70"] <- 244
heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Above 70"] <- 218.5
heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Age 28-40"] <- 246
heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Age 41-50"] <- 243
heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Age 51-60"] <- 246
heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Age 61-70"] <- 254
#to change missing value data type into integer.
heart_miss_0 <-
heart_miss_0 %>%
mutate(cholesterol_new = as.integer(cholesterol_new))
heart_miss_1 <-
heart_miss_1 %>%
mutate(cholesterol_new = as.integer(cholesterol_new))
After fill the missing value with desired median value, we need to combine again all dataset that have been separated before. We will create new dataset and assigned it into heart_clean_new
.
heart_clean_new <- rbind(heart_full, heart_miss_0, heart_miss_1)
heart_clean_new <- heart_clean_new %>%
dplyr::select(-c(Cholesterol, group_age)) ## Conflict with "select" parameter from library "MASS"
nrow(heart_clean_new)
#> [1] 917
New dataset with no missing value has been created, now we can move to the next step.
3.4 Correlation Numeric Variable
heart_clean_new %>%
mutate(HeartDisease = as.integer(HeartDisease)) %>%
ggcorr(label = T, hjust = 0.7)
Oldpeak
have the highest positive correlation to HeartDisease
, following up by Age
, cholesterol_new
and RestingBP
. While MaxHR
have negative correlation.
4 Modelling
4.1 Splitting Dataset
RNGkind(sample.kind = "Rounding")
set.seed(363)
# Sampling Index with proportion for training and test 80 : 20
index <- initial_split(heart_clean_new, prop = 0.8, strata = "HeartDisease")
# Splitting into Train & Test
heart_train <- training(index)
heart_test <- testing(index)
cat(
nrow(heart_train),
nrow(heart_test),
sep = "\n")
#> 733
#> 184
4.2 Create Model
We will create model manually using several predictor that we assume have correlation based on visualization.
#Create model
model_heart_manual <- glm( formula = HeartDisease ~ #Target Variable
Sex + ChestPainType + ST_Slope + #Categorical Predictor
Oldpeak + Age, #Numeric Predictor
data = heart_train,
family = "binomial")
summary(model_heart_manual)
#>
#> Call:
#> glm(formula = HeartDisease ~ Sex + ChestPainType + ST_Slope +
#> Oldpeak + Age, family = "binomial", data = heart_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6634 -0.4295 0.2462 0.5034 2.8046
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.44084 0.87997 -2.774 0.005541 **
#> SexM 1.73081 0.29449 5.877 4.17e-09 ***
#> ChestPainTypeATA -2.02712 0.33268 -6.093 1.11e-09 ***
#> ChestPainTypeNAP -1.64590 0.26929 -6.112 9.84e-10 ***
#> ChestPainTypeTA -2.02320 0.50719 -3.989 6.63e-05 ***
#> ST_SlopeFlat 0.81514 0.48019 1.698 0.089593 .
#> ST_SlopeUp -1.83647 0.49354 -3.721 0.000198 ***
#> Oldpeak 0.40905 0.12504 3.271 0.001070 **
#> Age 0.04123 0.01263 3.263 0.001101 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1008.05 on 732 degrees of freedom
#> Residual deviance: 523.43 on 724 degrees of freedom
#> AIC: 541.43
#>
#> Number of Fisher Scoring iterations: 5
Based on Summary() function, we can see that all predictor variable has correlation to its target variable.
Let’s create another model for comparison using all predictor variable.
model_heart_all <- glm(formula = HeartDisease ~ .,
data = heart_train,
family = "binomial")
summary(model_heart_all)
#>
#> Call:
#> glm(formula = HeartDisease ~ ., family = "binomial", data = heart_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.9708 -0.3970 0.2027 0.4679 2.6618
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.4620489 1.6557037 -1.487 0.137012
#> Age 0.0295181 0.0147217 2.005 0.044955 *
#> SexM 1.5328887 0.3108229 4.932 8.15e-07 ***
#> ChestPainTypeATA -1.6454860 0.3511843 -4.686 2.79e-06 ***
#> ChestPainTypeNAP -1.4360878 0.2899597 -4.953 7.32e-07 ***
#> ChestPainTypeTA -1.6542076 0.5244165 -3.154 0.001608 **
#> RestingBP -0.0001279 0.0066482 -0.019 0.984652
#> FastingBS1 1.1955431 0.2951648 4.050 5.11e-05 ***
#> RestingECGNormal 0.2572881 0.3002505 0.857 0.391494
#> RestingECGST -0.1297717 0.3832890 -0.339 0.734931
#> MaxHR -0.0063036 0.0054246 -1.162 0.245219
#> ExerciseAnginaY 0.9347833 0.2700153 3.462 0.000536 ***
#> Oldpeak 0.3646838 0.1312762 2.778 0.005470 **
#> ST_SlopeFlat 1.0036713 0.5076139 1.977 0.048015 *
#> ST_SlopeUp -1.3669854 0.5281467 -2.588 0.009646 **
#> cholesterol_new 0.0023397 0.0022120 1.058 0.290178
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1008.05 on 732 degrees of freedom
#> Residual deviance: 488.36 on 717 degrees of freedom
#> AIC: 520.36
#>
#> Number of Fisher Scoring iterations: 5
Based on Summary() funciton, there are several predictor that do not have any correlation with our target variable such as RestingBP
, Cholesterol_new
and MaxHR
. But there is 1 significant predictor ExerciseAnginaY
, which we exclude in model_heart_manual
.
#To check AIC from both model
cat(
model_heart_manual$aic,
model_heart_all$aic,
sep = "\n")
#> 541.4283
#> 520.3574
model_heart_all
have lower AIC than model_heart_manual
, therefore we will choose model_heart_all
for prediction. Before we perform prediction, since there are several predictor variable with no correlation to target variable, we will do Model Fitting using stepwise method.
4.3 Model Fitting
model_heart_fit <- stepAIC(model_heart_all,
direction = "backward",
trace = F)
summary(model_heart_fit)
#>
#> Call:
#> glm(formula = HeartDisease ~ Age + Sex + ChestPainType + FastingBS +
#> ExerciseAngina + Oldpeak + ST_Slope, family = "binomial",
#> data = heart_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.8376 -0.3908 0.2048 0.4855 2.5917
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.62277 0.92081 -2.848 0.004395 **
#> Age 0.03037 0.01316 2.308 0.021004 *
#> SexM 1.49168 0.30233 4.934 8.06e-07 ***
#> ChestPainTypeATA -1.67557 0.34885 -4.803 1.56e-06 ***
#> ChestPainTypeNAP -1.48934 0.28345 -5.254 1.49e-07 ***
#> ChestPainTypeTA -1.78487 0.52421 -3.405 0.000662 ***
#> FastingBS1 1.20701 0.29339 4.114 3.89e-05 ***
#> ExerciseAnginaY 0.97933 0.26101 3.752 0.000175 ***
#> Oldpeak 0.33763 0.12895 2.618 0.008836 **
#> ST_SlopeFlat 1.03768 0.49674 2.089 0.036709 *
#> ST_SlopeUp -1.42748 0.51347 -2.780 0.005435 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1008.1 on 732 degrees of freedom
#> Residual deviance: 492.5 on 722 degrees of freedom
#> AIC: 514.5
#>
#> Number of Fisher Scoring iterations: 5
Using stepwise method, we can lower the AIC from 520.35 in model_heart_all
to 514.5 in model_heart_fit
.
4.4 Predicting
Predict model_heart_fit
using predict() function, then include the result in new column called prediction
.
#Predict model
heart_test$prediction <- predict(object = model_heart_fit,
newdata = heart_test,
type = "response")
Create new column to convert prediction
information value into binary value 0 and 1.
heart_test$pred.label <- ifelse(heart_test$prediction > 0.5, 1, 0) %>%
as.factor()
#to Visualize prediction result.
ggplot(heart_test, aes(x = prediction)) +
geom_density(lwd=0.5)
Prediction result mostly leaning to 1.
confusionMatrix(data = heart_test$pred.label,
reference = heart_test$HeartDisease,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 68 16
#> 1 14 86
#>
#> Accuracy : 0.837
#> 95% CI : (0.7755, 0.8872)
#> No Information Rate : 0.5543
#> P-Value [Acc > NIR] : 3.67e-16
#>
#> Kappa : 0.6708
#>
#> Mcnemar's Test P-Value : 0.8551
#>
#> Sensitivity : 0.8431
#> Specificity : 0.8293
#> Pos Pred Value : 0.8600
#> Neg Pred Value : 0.8095
#> Prevalence : 0.5543
#> Detection Rate : 0.4674
#> Detection Prevalence : 0.5435
#> Balanced Accuracy : 0.8362
#>
#> 'Positive' Class : 1
#>
4.5 Conclusion
Based on ConfusionMatrix
function, model_heart_fit
perform well in predicting HeartDisease
with Sensitivity (recall) 84%. Sensitivity is used as performance metric for this model since we want as many people to get an early warning related to HeartDisease
, as in many real case, patient diagnosed with HeartDisease
already in severe condition.