Stroke Prediction

Tera Putera

12/26/2021

1 Introduction

For LBB (learning by building) - Classification in Machine Learning 1, i’m using Kaggle - Heart Failure Prediction Dataset. Purpose of this report is to make a model that perform well in predicting Heart Disease, since Heart Disease is considered as lethal disease and often happen suddenly. If we can do early prediction to a patient / person to avoid late diagnose, it can be a life saver.

List of library

library(tidyverse) #For data wrangling
library(GGally) #For visualization
library(MLmetrics) #For evaluation metric
library(ggplot2) #For visualization
library(rsample) #For splitting data
library(MASS) #For stepwise method
library(caret) #Confusion matrix
theme_set(theme_minimal())

2 Data Preparation

2.1 Import and Data Checking

heart <- read.csv("heart.csv") #Import data from local storage
head(heart) #Print the first 6 rows of dataset

Data columns description:

  • Age: age of the patient [years]
  • Sex: sex of the patient [M: Male, F: Female]
  • ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
  • RestingBP: resting blood pressure [mm Hg]
  • Cholesterol: serum cholesterol [mm/dl]
  • FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
  • RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]
  • MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
  • ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
  • Oldpeak: oldpeak = ST [Numeric value measured in depression]
  • ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
  • HeartDisease: output class [1: heart disease, 0: Normal]
colSums(is.na(heart)) #checking missing / "NA" in each column.
#>            Age            Sex  ChestPainType      RestingBP    Cholesterol 
#>              0              0              0              0              0 
#>      FastingBS     RestingECG          MaxHR ExerciseAngina        Oldpeak 
#>              0              0              0              0              0 
#>       ST_Slope   HeartDisease 
#>              0              0
glimpse(heart) #glimpse dataset
#> Rows: 918
#> Columns: 12
#> $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
#> $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", ~
#> $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",~
#> $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
#> $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
#> $ FastingBS      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",~
#> $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
#> $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", ~
#> $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
#> $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl~
#> $ HeartDisease   <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~
# To check unique value in each columns.
heart %>% summarise_all(funs(n_distinct(.))) 

Based on glimpse() and unique value checking in each column, we noted Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope and HeartDisease have less unique value, therefore we will convert it into proper data type.

2.2 Clean Data

#To convert several columns data type into factor.
heart_clean <-
heart %>% 
  mutate(across(c(Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope, HeartDisease), as.factor))
#checking data frame
glimpse(heart_clean)
#> Rows: 918
#> Columns: 12
#> $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
#> $ Sex            <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F, F, M, F, M~
#> $ ChestPainType  <fct> ATA, NAP, ATA, ASY, NAP, NAP, ATA, ATA, ASY, ATA, NAP, ~
#> $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
#> $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
#> $ FastingBS      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ RestingECG     <fct> Normal, Normal, ST, Normal, Normal, Normal, Normal, Nor~
#> $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
#> $ ExerciseAngina <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N, N, N, N, N~
#> $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
#> $ ST_Slope       <fct> Up, Flat, Up, Flat, Up, Up, Up, Up, Flat, Up, Up, Flat,~
#> $ HeartDisease   <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~

Datatype has been converted properly, now we can use this data set for analysis and modelling.

3 Exploratory Data Analysis

#checking stroke proportion in dataset
prop.table(table(heart_clean$HeartDisease))
#> 
#>         0         1 
#> 0.4466231 0.5533769

Based on prop.table function, proportion from our target variable HeartDisease is quite balance.

3.1 Barplot Categorical Predictor Variable

3.1.1 Heart Disease by Age

heart_clean %>% 
ggplot(aes(x = Sex, fill = HeartDisease))+
geom_bar(Identity = "stack")

Insight: Male is more likely to suffer from HeartDisease compare to Female.

3.1.2 Heart Disease by Chest Pain Type

heart_clean %>% 
ggplot(aes(x = ChestPainType, fill = HeartDisease))+
geom_bar(Identity = "stack")

Inisight: Most of the people with HeartDisease problem suffering from ASY (Asymptomatic) ChestPainType. This pain type is also called as Silent Killer, since most of the people are unaware with the symptoms on their body that lead to HeartDisease.

3.1.3 Heart Disease by Fasting Blood Sugar

heart_clean %>% 
ggplot(aes(x = FastingBS, fill = HeartDisease))+
geom_bar(Identity = "stack")

Inisight:

  • 0 (blood sugar below 120 mg/dl) normal, 1 (blood sugar above 120 mg/dl) prediabetes. Result taken from a person that has been fasting before the test.
  • As we can see, most of the people with HeartDisease problem have blood sugar above 120 mg/dl, which also can lead to other problem such as Diabetes.
  • High blood sugar does not significantly causing HeartDisease problem.

3.1.4 Heart Disease by Resting Electrocardiogram

heart_clean %>% 
ggplot(aes(x = RestingECG, fill = HeartDisease))+
geom_bar(Identity = "stack")

Inisght: People with HeartDisease problem is more likely to have good result in RestingECG test.

3.1.5 Heart Disease by ST Slope

heart_clean %>% 
ggplot(aes(x = ST_Slope, fill = HeartDisease))+
geom_bar(Identity = "stack")

Inisght: People with ST Slope result Flat tend to have higher chance to suffer from HeartDisease.

Based on visualization of Categorical Predictor, we will include Sex, ChestPainType and ST_Slope into our model.

3.2 Boxplot Numeric Predictor Variable

3.2.1 Age

heart_clean %>% 
ggplot(aes(x = HeartDisease, y = Age, fill = HeartDisease))+
  geom_boxplot(alpha = 0.5, show.legend = F)+
  geom_jitter(alpha = 0.8, show.legend = F)+
  ggtitle("Heart Disease by Age")+
  theme(plot.title = element_text(hjust = 0.5))

Insight : From plot above, we noted HeartDisease can occur in all range of Age.

3.2.2 Cholesterol

heart_clean %>% 
ggplot(aes(x = HeartDisease, y = Cholesterol, fill = HeartDisease))+
  geom_boxplot(alpha = 0.5, show.legend = F)+
  geom_jitter(alpha = 0.8, show.legend = F)+
  ggtitle("Heart Disease by Age")+
  theme(plot.title = element_text(hjust = 0.5))

3.2.3 RestingBP

heart_clean %>% 
ggplot(aes(x = HeartDisease, y = RestingBP, fill = HeartDisease))+
  geom_boxplot(alpha = 0.5, show.legend = F)+
  geom_jitter(alpha = 0.8, show.legend = F)+
  ggtitle("Heart Disease by Age")+
  theme(plot.title = element_text(hjust = 0.5))

3.2.4 MaxHR

heart_clean %>% 
ggplot(aes(x = HeartDisease, y = MaxHR, fill = HeartDisease))+
  geom_boxplot(alpha = 0.5, show.legend = F)+
  geom_jitter(alpha = 0.8, show.legend = F)+
  ggtitle("Heart Disease by Age")+
  theme(plot.title = element_text(hjust = 0.5))

Based on numeric predictor boxplot, we noted there are zero value 0 in Cholesterol and RestingBP.We will try not to delete zero value in Cholesterol column by finding its correlation with variable Age, since according to this article, “As we get older, cholesterol levels rise”. Before make modification in Cholesterol column, we will delete zero value in RestingBP since it appear only few row contain zero value.

3.3 Data Modification

#To filter delete zero value in RestingBP using "Filter" function.
heart_clean <- 
  heart_clean %>% 
  filter(RestingBP != 0)

cat(
nrow(heart),# full row
nrow(heart_clean),#after filter
sep = "\n")
#> 918
#> 917

Only 1 row deleted from heart_clean dataset.

heart_clean %>% 
  filter(Cholesterol > 0) %>% 
  ggplot(aes(x = Age, y = Cholesterol, color = HeartDisease))+
  geom_point()+
  ggtitle("Cholesterol correlation with Age")+
  theme(plot.title = element_text(hjust = 0.5))

Based on scatter plot above, we can assume that Cholesterol do have correlation to Age, eventhough it was not significant.

In order to fill missing value in Cholesterol column, we will use Median value from other non-missing value.

Separate dataset into 2 dataset:

  • heart_full dataset with no missing value in Cholesterol column
  • heart_miss dataset with missing value in Cholesterol column
#create dataset with no missing value in cholesterol column
heart_full <-
  heart_clean %>% 
  filter(Cholesterol > 0) %>% 
  mutate(cholesterol_new = Cholesterol)

#create dataset with only missing value in cholesterol column
heart_miss <- 
  heart_clean %>% 
  filter(Cholesterol == 0)

Create function to make categorical value based on Age. Noted in Age column, range of data spread is from 28 up to 77. We will separate the Age with interval around 10, so there will be 5 categorical value ingroup_age column.

#Create function 
convert_age <- function(y){
    if(y > 27 & y <= 40){ #noted in dataset there are few age below 30, thus we include those age in 28 - 40
      y <- "Age 28-40"
    }else
      if(y > 40 & y <= 50){
      y <- "Age 41-50"
    }else
      if(y > 50 & y <= 60){
      y <- "Age 51-60"
    }else
      if(y > 60 & y <= 70){
      y <- "Age 61-70"
    }else{  
      y <- "Above 70"
      }
}

Create new column group_age using function convert_age created before. Purpose of this column is, to calculate either Mean and Median cholesterol value to fill missing value in heart_miss dataset.

#Apply function 
heart_full$group_age <- sapply(heart_full$Age,
                                FUN = convert_age)

heart_miss$group_age <- sapply (heart_miss$Age,
                                FUN = convert_age)

Separate heart_full and heart_miss dataset by HeartDisease, as cholesterol level between people with HeartDisease and healthy, as people with HeartDisease tend to have higher Cholesterol.

heart_full_0 <- heart_full %>% 
                filter(HeartDisease == 0)

heart_full_1 <- heart_full %>% 
                filter(HeartDisease == 1)

heart_miss_0 <- heart_miss %>% 
                filter(HeartDisease == 0)

heart_miss_1 <- heart_miss %>% 
                filter(HeartDisease == 1)

3.3.1 Group Age with Heart Disesase

heart_full_1 %>% 
  ggplot(aes(x = cholesterol_new, group = group_age, fill = group_age))+
  geom_density()+
  facet_wrap(~group_age)

3.3.2 Healthy Group Age

heart_full_0 %>% 
  ggplot(aes(x = cholesterol_new, group = group_age, fill = group_age))+
  geom_density()+
  facet_wrap(~group_age)

Based on distribution plot, we will use Median cholesterol by group age to fill missing value in dataset heart_miss_1 and heart_miss_2.

# to calculate median cholesterol by group age
heart_full_0 %>% 
  group_by(group_age) %>% 
  summarise(median_chol = median(cholesterol_new))
# to calculate median cholesterol by group age
heart_full_1 %>% 
  group_by(group_age) %>% 
  summarise(median_chol = median(cholesterol_new))

Fill missing value with Median cholesterol value by group age in new column called cholesterol_new.

# to fill missing value
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Above 70"] <- 267
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Age 28-40"] <- 220  
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Age 41-50"] <- 234.5    
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Age 51-60"] <- 228  
heart_miss_0$cholesterol_new[heart_miss_0$group_age == "Age 61-70"] <- 244  

heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Above 70"] <- 218.5
heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Age 28-40"] <- 246  
heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Age 41-50"] <- 243  
heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Age 51-60"] <- 246  
heart_miss_1$cholesterol_new[heart_miss_1$group_age == "Age 61-70"] <- 254  

#to change missing value data type into integer.
heart_miss_0 <- 
  heart_miss_0 %>%
  mutate(cholesterol_new = as.integer(cholesterol_new))

heart_miss_1 <- 
  heart_miss_1 %>%
  mutate(cholesterol_new = as.integer(cholesterol_new))

After fill the missing value with desired median value, we need to combine again all dataset that have been separated before. We will create new dataset and assigned it into heart_clean_new.

heart_clean_new <- rbind(heart_full, heart_miss_0, heart_miss_1)

heart_clean_new <- heart_clean_new %>% 
                    dplyr::select(-c(Cholesterol, group_age)) ## Conflict with "select" parameter from library "MASS"

nrow(heart_clean_new)
#> [1] 917

New dataset with no missing value has been created, now we can move to the next step.

3.4 Correlation Numeric Variable

heart_clean_new %>% 
  mutate(HeartDisease = as.integer(HeartDisease)) %>% 
  ggcorr(label = T, hjust = 0.7) 

Oldpeak have the highest positive correlation to HeartDisease, following up by Age, cholesterol_new and RestingBP. While MaxHR have negative correlation.

4 Modelling

4.1 Splitting Dataset

RNGkind(sample.kind = "Rounding")
set.seed(363) 

# Sampling Index with proportion for training and test 80 : 20
index <- initial_split(heart_clean_new, prop = 0.8, strata = "HeartDisease")

# Splitting into Train & Test
heart_train <- training(index)
heart_test <- testing(index)

cat(
nrow(heart_train),
nrow(heart_test),
sep = "\n")
#> 733
#> 184

4.2 Create Model

We will create model manually using several predictor that we assume have correlation based on visualization.

#Create model
model_heart_manual <- glm( formula = HeartDisease ~ #Target Variable
                        Sex + ChestPainType + ST_Slope + #Categorical Predictor
                        Oldpeak + Age, #Numeric Predictor
                      data = heart_train,
                      family = "binomial")

summary(model_heart_manual)
#> 
#> Call:
#> glm(formula = HeartDisease ~ Sex + ChestPainType + ST_Slope + 
#>     Oldpeak + Age, family = "binomial", data = heart_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6634  -0.4295   0.2462   0.5034   2.8046  
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      -2.44084    0.87997  -2.774 0.005541 ** 
#> SexM              1.73081    0.29449   5.877 4.17e-09 ***
#> ChestPainTypeATA -2.02712    0.33268  -6.093 1.11e-09 ***
#> ChestPainTypeNAP -1.64590    0.26929  -6.112 9.84e-10 ***
#> ChestPainTypeTA  -2.02320    0.50719  -3.989 6.63e-05 ***
#> ST_SlopeFlat      0.81514    0.48019   1.698 0.089593 .  
#> ST_SlopeUp       -1.83647    0.49354  -3.721 0.000198 ***
#> Oldpeak           0.40905    0.12504   3.271 0.001070 ** 
#> Age               0.04123    0.01263   3.263 0.001101 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1008.05  on 732  degrees of freedom
#> Residual deviance:  523.43  on 724  degrees of freedom
#> AIC: 541.43
#> 
#> Number of Fisher Scoring iterations: 5

Based on Summary() function, we can see that all predictor variable has correlation to its target variable.

Let’s create another model for comparison using all predictor variable.

model_heart_all <- glm(formula = HeartDisease ~ .,
                      data = heart_train,
                      family = "binomial")

summary(model_heart_all)
#> 
#> Call:
#> glm(formula = HeartDisease ~ ., family = "binomial", data = heart_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.9708  -0.3970   0.2027   0.4679   2.6618  
#> 
#> Coefficients:
#>                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      -2.4620489  1.6557037  -1.487 0.137012    
#> Age               0.0295181  0.0147217   2.005 0.044955 *  
#> SexM              1.5328887  0.3108229   4.932 8.15e-07 ***
#> ChestPainTypeATA -1.6454860  0.3511843  -4.686 2.79e-06 ***
#> ChestPainTypeNAP -1.4360878  0.2899597  -4.953 7.32e-07 ***
#> ChestPainTypeTA  -1.6542076  0.5244165  -3.154 0.001608 ** 
#> RestingBP        -0.0001279  0.0066482  -0.019 0.984652    
#> FastingBS1        1.1955431  0.2951648   4.050 5.11e-05 ***
#> RestingECGNormal  0.2572881  0.3002505   0.857 0.391494    
#> RestingECGST     -0.1297717  0.3832890  -0.339 0.734931    
#> MaxHR            -0.0063036  0.0054246  -1.162 0.245219    
#> ExerciseAnginaY   0.9347833  0.2700153   3.462 0.000536 ***
#> Oldpeak           0.3646838  0.1312762   2.778 0.005470 ** 
#> ST_SlopeFlat      1.0036713  0.5076139   1.977 0.048015 *  
#> ST_SlopeUp       -1.3669854  0.5281467  -2.588 0.009646 ** 
#> cholesterol_new   0.0023397  0.0022120   1.058 0.290178    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1008.05  on 732  degrees of freedom
#> Residual deviance:  488.36  on 717  degrees of freedom
#> AIC: 520.36
#> 
#> Number of Fisher Scoring iterations: 5

Based on Summary() funciton, there are several predictor that do not have any correlation with our target variable such as RestingBP, Cholesterol_new and MaxHR. But there is 1 significant predictor ExerciseAnginaY, which we exclude in model_heart_manual.

#To check AIC from both model
cat(
model_heart_manual$aic,
model_heart_all$aic,
sep = "\n")
#> 541.4283
#> 520.3574

model_heart_all have lower AIC than model_heart_manual, therefore we will choose model_heart_all for prediction. Before we perform prediction, since there are several predictor variable with no correlation to target variable, we will do Model Fitting using stepwise method.

4.3 Model Fitting

model_heart_fit <- stepAIC(model_heart_all, 
                           direction = "backward",
                           trace = F)

summary(model_heart_fit)
#> 
#> Call:
#> glm(formula = HeartDisease ~ Age + Sex + ChestPainType + FastingBS + 
#>     ExerciseAngina + Oldpeak + ST_Slope, family = "binomial", 
#>     data = heart_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8376  -0.3908   0.2048   0.4855   2.5917  
#> 
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)      -2.62277    0.92081  -2.848 0.004395 ** 
#> Age               0.03037    0.01316   2.308 0.021004 *  
#> SexM              1.49168    0.30233   4.934 8.06e-07 ***
#> ChestPainTypeATA -1.67557    0.34885  -4.803 1.56e-06 ***
#> ChestPainTypeNAP -1.48934    0.28345  -5.254 1.49e-07 ***
#> ChestPainTypeTA  -1.78487    0.52421  -3.405 0.000662 ***
#> FastingBS1        1.20701    0.29339   4.114 3.89e-05 ***
#> ExerciseAnginaY   0.97933    0.26101   3.752 0.000175 ***
#> Oldpeak           0.33763    0.12895   2.618 0.008836 ** 
#> ST_SlopeFlat      1.03768    0.49674   2.089 0.036709 *  
#> ST_SlopeUp       -1.42748    0.51347  -2.780 0.005435 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1008.1  on 732  degrees of freedom
#> Residual deviance:  492.5  on 722  degrees of freedom
#> AIC: 514.5
#> 
#> Number of Fisher Scoring iterations: 5

Using stepwise method, we can lower the AIC from 520.35 in model_heart_all to 514.5 in model_heart_fit.

4.4 Predicting

Predict model_heart_fit using predict() function, then include the result in new column called prediction.

#Predict model
heart_test$prediction <- predict(object = model_heart_fit, 
                                 newdata = heart_test,
                                 type = "response")

Create new column to convert prediction information value into binary value 0 and 1.

heart_test$pred.label <- ifelse(heart_test$prediction > 0.5, 1, 0) %>% 
                          as.factor()
#to Visualize prediction result.
ggplot(heart_test, aes(x = prediction)) +
  geom_density(lwd=0.5)

Prediction result mostly leaning to 1.

confusionMatrix(data = heart_test$pred.label,
                reference = heart_test$HeartDisease,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 68 16
#>          1 14 86
#>                                           
#>                Accuracy : 0.837           
#>                  95% CI : (0.7755, 0.8872)
#>     No Information Rate : 0.5543          
#>     P-Value [Acc > NIR] : 3.67e-16        
#>                                           
#>                   Kappa : 0.6708          
#>                                           
#>  Mcnemar's Test P-Value : 0.8551          
#>                                           
#>             Sensitivity : 0.8431          
#>             Specificity : 0.8293          
#>          Pos Pred Value : 0.8600          
#>          Neg Pred Value : 0.8095          
#>              Prevalence : 0.5543          
#>          Detection Rate : 0.4674          
#>    Detection Prevalence : 0.5435          
#>       Balanced Accuracy : 0.8362          
#>                                           
#>        'Positive' Class : 1               
#> 

4.5 Conclusion

Based on ConfusionMatrix function, model_heart_fit perform well in predicting HeartDisease with Sensitivity (recall) 84%. Sensitivity is used as performance metric for this model since we want as many people to get an early warning related to HeartDisease, as in many real case, patient diagnosed with HeartDisease already in severe condition.