1 Introduction

Cardiovascular diseases (CVDs) cause a shocking amount of fatalities worldwide, taking approximately 17.9 million lives each year, making up 31% of all global deaths. In the realm of CVDs, heart attacks and strokes stand out as incredibly destructive, accounting for 80% of CVD-related deaths, with a worrying 33% happening prematurely in individuals under 70. The occurrence of heart failure is frequent in individuals with CVDs, and timely identification is essential for proper treatment. This research explores the capacity of machine learning models in forecasting heart failure.

We will use a dataset from the UCI Machine Learning Repository with 918 data points and 12 features concerning heart health. These characteristics could act as indicators for determining the presence or absence of heart failure, our main focus. Two machine learning algorithms, Decision Tree and Naive Bayes, will be utilized to create two prediction models. Metrics such as accuracy, precision, and recall will be used to evaluate how well these models can predict heart failure. This study seeks to determine the most efficient method for forecasting heart failure within this particular set of data.

By examining the potential of machine learning in this field, we could enhance the identification of issues at an early stage and enable prompt actions. This study may lead to the investigation of other machine learning algorithms or more advanced deep learning models to possibly enhance heart failure prediction accuracy.

2 Data Preparation

Before moving forward, we must first load the necessary library required for conducting this research.

library(dplyr)
library(ggplot2)

# For machine learning
library(partykit)

## Warning: package 'partykit' was built under R version 4.4.1

## Warning: package 'libcoin' was built under R version 4.4.1

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.1

library(caret)

## Warning: package 'caret' was built under R version 4.4.1

library(e1071)
library(ROCR)

## Warning: package 'ROCR' was built under R version 4.4.1

Heart Disease Dataset Attribute Information

Column Name	Description	Data Type
Age	Age of the patient	Years (numeric)
Sex	Sex of the patient	M: Male, F: Female
ChestPainType	Chest pain type	TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic
RestingBP	Resting blood pressure	mm Hg (numeric)
Cholesterol	Serum cholesterol	mm/dl (numeric)
FastingBS	Fasting blood sugar	1: if FastingBS > 120 mg/dl, 0: otherwise
RestingECG	Resting electrocardiogram results	Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria
MaxHR	Maximum heart rate achieved	Numeric value between 60 and 202
ExerciseAngina	Exercise-induced angina	Y: Yes, N: No
Oldpeak	ST depression	Numeric value measured in depression
ST_Slope	Slope of the peak exercise ST segment	Up: upsloping, Flat: flat, Down: downsloping
HeartDisease	Presence of heart disease	1: heart disease, 0: Normal

Source This dataset was formed by merging various datasets that were previously separate and not combined together. This dataset is composed of 5 heart datasets merged based on 11 shared features, making it the most extensive heart disease dataset for research to date. The curation utilized five datasets.

End dataset: 918 data points

All datasets utilized are accessible in the Index of heart disease datasets on UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/

2.1 Import & Read Data

Following data preparation, the initial step is to bring in the dataset by utilizing the read.csv() function.

heart <- read.csv("data_input/heart.csv")
heart

2.2 Inspect Data

The next action is to analyze the dataset that was brought in, in order to examine the beginning and ending data of the startup dataset. We utilize the functions head() and tail() functions.

head(heart)

tail(heart)

2.3 Structure Data

The appropriate data type is determined by initially checking it with the glimpse() function.

heart %>% 
  glimpse()

## Rows: 918
## Columns: 12
## $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease   <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

here we check the value distribution for heart$ChestPainType column

table(heart$ChestPainType)

## 
## ASY ATA NAP  TA 
## 496 173 203  46

where: TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic

2.4 Data Cleansing

Prior to proceeding with the following step, it is necessary to convert the column types of the State to factor type. However, we will exclude this column since regression analysis necessitates solely numeric data.

heart_clean <- 
  heart %>% 
    mutate(Sex = as.factor(Sex),
           ChestPainType = as.factor(ChestPainType),
           RestingECG = as.factor(RestingECG),
           ExerciseAngina = as.factor(ExerciseAngina),
           ST_Slope = as.factor(ST_Slope),
           HeartDisease = as.factor(HeartDisease))
  head(heart_clean)

Next, we reassess the remaining columns to ensure the data type is accurate.

heart_clean %>% 
  glimpse()

## Rows: 918
## Columns: 12
## $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F, F, M, F, M…
## $ ChestPainType  <fct> ATA, NAP, ATA, ASY, NAP, NAP, ATA, ATA, ASY, ATA, NAP, …
## $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <fct> Normal, Normal, ST, Normal, Normal, Normal, Normal, Nor…
## $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N, N, N, N, N…
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <fct> Up, Flat, Up, Flat, Up, Up, Up, Up, Flat, Up, Up, Flat,…
## $ HeartDisease   <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

The correct data type is assigned to each column, now the data needs to be processed further.

2.5 Check Missing Value

Once those steps are completed, it is important to verify any missing values in the dataset as well.

heart_clean %>% 
  is.na() %>% 
  colSums()

##            Age            Sex  ChestPainType      RestingBP    Cholesterol 
##              0              0              0              0              0 
##      FastingBS     RestingECG          MaxHR ExerciseAngina        Oldpeak 
##              0              0              0              0              0 
##       ST_Slope   HeartDisease 
##              0              0

Since there are no missing values in this dataset, it is ready to move on to the next stages.

3 Exploratory Data Analysis (EDA)

Even though the previous section gave a brief summary of the data, we can further examine it by utilizing Exploratory Data Analysis (EDA). EDA enables us to discover the traits of the data, spot possible patterns, and unveil connections among variables. This information is essential for constructing efficient machine learning models.

For example a boxplot in our heart disease dataset can show the difference in FastingBS level distribution between patients with and without heart disease. This is significant because the target variable for predicting heart disease is binary classification (1: presence, 0: absence).

# boxplot of glucose levels in each class

library(ggplot2)

ggplot(data = heart_clean,
       mapping = aes(x = HeartDisease, y = FastingBS,
                     fill = HeartDisease)) +
  geom_boxplot()

Insights:

There might be a trend of higher FastingBS levels being associated with heart disease. This aligns with medical knowledge where high blood sugar is a risk factor for heart disease.
The wider spread in FastingBS distribution for those with heart disease suggests that individuals with heart disease can have a wider range of FastingBS levels compared to those without heart disease.
The boxplot suggests a potential link between higher FastingBS and heart disease, but more analysis is needed to confirm the strength of this association and identify other significant factors that may influence heart disease.

Next, we verify data proportions to identify any imbalances using prop.table()

prop.table(table(heart_clean$HeartDisease))

## 
##         0         1 
## 0.4466231 0.5533769

After analyzing the data distribution, it was noted that the ratio of class 1 (showing heart disease presence) is about 55%, while the ratio of class 0 (indicating no heart disease) is roughly 45%. When dealing with classification tasks, a dataset is deemed balanced when the class ratios are approximately 50:50.

3.1 Feature Selection Approach

Selecting the most appropriate predictors is essential in building machine learning models to guarantee precise predictions. Feature selection is the process that helps pinpoint the most informative variables that play a significant role in the model’s performance.

In this research, we will use a mathematical method for selecting features. This technique uses statistical computations to assess the significance of each factor and identify the most beneficial ones for creating the machine learning model.

We will apply the nearZeroVar() function from the caret package on the heart_clean dataset to remove columns with very little variation in their values. It is improbable that these columns will offer significant data for the model.

n0_var <- nearZeroVar(heart_clean)
n0_var

## integer(0)

Insight: In the context of the nearZeroVar() function, an empty result (integer(0)) implies that the function didn’t flag any variables for removal due to low variance. This suggests that all the predictors in the dataset have sufficient variation and could contribute to the model’s performance.

4 Decision Tree

Having explored the data and dealing with possible problems such as class imbalance, we are ready to move forward with creating a machine learning model to predict heart disease. One of the algorithms we will be examining is the Decision Tree algorithm.

Decision trees are a form of supervised learning model that looks like a diagram showing the flow of decisions. They operate by dividing the data gradually according to certain characteristics (predictors) that most effectively differentiate between various classes (such as the presence or absence of heart disease in this scenario). Every division results in a fresh offshoot in the tree, resulting in more precise forecasts as you move further down the tree.

4.1 Cross Validation

Next, we will divide the dataset into train (heart_train) and test (heart_test) datasets, maintaining an 80%:20% ratio.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

index_heart <- sample(nrow(heart_clean), nrow(heart_clean)*0.80)

heart_train <- heart_clean[index_heart,] # untuk training
heart_test <- heart_clean[-index_heart,] # untuk testing

4.2 Model Fitting

The process of training a machine learning model on a dataset is referred to as model fitting. In the context of building a decision tree model for heart disease prediction, the ctree() function from the partykit library in R can be employed.

🧪 Function: ctree(formula, data)

formula: y ~ x
- y: Dependent variable or target variable.
- x: Independent variables or predictors.
data: A data frame containing both the dependent and independent variables.

heart_tree <- ctree(formula = HeartDisease ~ .,
                   data = heart_train)
  
heart_tree

## 
## Model formula:
## HeartDisease ~ Age + Sex + ChestPainType + RestingBP + Cholesterol + 
##     FastingBS + RestingECG + MaxHR + ExerciseAngina + Oldpeak + 
##     ST_Slope
## 
## Fitted party:
## [1] root
## |   [2] ST_Slope in Down, Flat
## |   |   [3] Sex in F
## |   |   |   [4] FastingBS <= 0: 0 (n = 51, err = 45.1%)
## |   |   |   [5] FastingBS > 0: 1 (n = 8, err = 0.0%)
## |   |   [6] Sex in M
## |   |   |   [7] MaxHR <= 150
## |   |   |   |   [8] ChestPainType in ASY: 1 (n = 235, err = 5.5%)
## |   |   |   |   [9] ChestPainType in ATA, NAP, TA: 1 (n = 75, err = 17.3%)
## |   |   |   [10] MaxHR > 150: 1 (n = 47, err = 40.4%)
## |   [11] ST_Slope in Up
## |   |   [12] ChestPainType in ASY
## |   |   |   [13] FastingBS <= 0
## |   |   |   |   [14] ExerciseAngina in N: 0 (n = 61, err = 26.2%)
## |   |   |   |   [15] ExerciseAngina in Y: 1 (n = 25, err = 40.0%)
## |   |   |   [16] FastingBS > 0
## |   |   |   |   [17] Cholesterol <= 0: 1 (n = 17, err = 0.0%)
## |   |   |   |   [18] Cholesterol > 0: 0 (n = 7, err = 42.9%)
## |   |   [19] ChestPainType in ATA, NAP, TA
## |   |   |   [20] Oldpeak <= 1.9: 0 (n = 200, err = 5.5%)
## |   |   |   [21] Oldpeak > 1.9: 1 (n = 8, err = 37.5%)
## 
## Number of inner nodes:    10
## Number of terminal nodes: 11

Visualization

# visualisasi decision tree
plot(heart_tree, type = "simple")

Insights:

The decision tree highlights the importance of ST_Slope, Sex, FastingBS, MaxHR in predicting heart disease.
The model seems to capture some interactions between features, for example, how FastingBS interacts with ExerciseAngina.
The presence of terminal nodes with varying error rates suggests the model might perform better for certain patient profiles compared to others.

4.3 Model Prediction and Evaluation

Having established the decision tree model, the next step is to utilize it for generating predictions on unseen data. The predict() function in R serves this purpose effectively.

🧪 Function: predict(object, newdata, type)

object: represents the model we want to use for making predictions.
newdata: specifies the new data on which we want to generate predictions.
type: controls the format of the predictions returned by the predict() function.
- type = "prob" > outputs the probability of each class for each row in dataset.
- type = "response" directly assigns a class label to each row in the dataset.

pred_test <- predict(
  object = heart_tree,
  newdata = heart_test,
  type = "response"
)

Once predictions have been generated for the unseen data using the predict() function, the next step is to evaluate the performance of the decision tree model. This crucial step allows us to assess how accurately the model generalizes to new data and identifies its strengths and weaknesses. For this process we evaluate the model using confusionMatrix().

confusionMatrix(
  data = pred_test,
  reference = heart_test$HeartDisease, 
                positive = "1"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 75 11
##          1 11 87
##                                           
##                Accuracy : 0.8804          
##                  95% CI : (0.8246, 0.9235)
##     No Information Rate : 0.5326          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7598          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8878          
##             Specificity : 0.8721          
##          Pos Pred Value : 0.8878          
##          Neg Pred Value : 0.8721          
##              Prevalence : 0.5326          
##          Detection Rate : 0.4728          
##    Detection Prevalence : 0.5326          
##       Balanced Accuracy : 0.8799          
##                                           
##        'Positive' Class : 1               
##

Insights:

The primary objective in this context is to accurately identify patients with the condition, enabling healthcare professionals to implement preventive measures and improve patient outcomes.

Accuracy: 88.04% - indicating a good ability to correctly classify cases in the testing data.
Sensitivity (Recall): 88.78% - reflects the model’s ability to correctly identify individuals with heart disease (true positive rate).
Positive Predictive Value (Precison): 88.78% - indicates that a high proportion of patients predicted to have heart disease by the model actually have the condition.

4.4 Considerations on Decision Tree: Pruning and Tree-size

Decision trees are powerful machine learning algorithms for classification and regression tasks. However, a key challenge associated with them is overfitting. This occurs when the tree becomes overly complex, capturing noise or irrelevant details in the training data, this leads to inflated performance on the training data but poor performance on unseen data, such as the testing set.

To address this issue, we can strategically influence the decision tree construction process, promoting the development of a less complex and more focused tree.

Arguments:

mincriterion = 0.95: this enforces stricter splitting, focusing the tree on the most informative features and preventing noise-based splits.

minsplit = 50: This ensures a minimum number of observations (50 in this case) are present at a node before it can be further split. This prevents the tree from splitting based on small data subsets that might not be representative of the broader population.

minbucket = 50: Setting a minimum of 50 observations per leaf prevents the creation of overly specific branches with limited data. This encourages the model to learn more generalizable patterns, reducing overfitting and improving unseen data predictions.

Consider the following model after we add ctree_control() arguments.

heart_tree_complex <- ctree(formula = HeartDisease ~ ., 
                            data = heart_train,
                            control = ctree_control(mincriterion = 0.95, 
                                                    minsplit = 50,
                                                    minbucket = 50))

4.4.1 Results

Before

# original decision tree
plot(heart_tree, type = "simple")

After

# modified decision tree
plot(heart_tree_complex, type='simple')

4.5 Data Training Model Evaluation

Following the acquisition of the model, we proceed to evaluate its performance using the training data.

# class prediction using training data
pred_heart_train <- predict(heart_tree_complex, 
                           heart_train, 
                           type = "response")

# confusion matrix data train
confusionMatrix(pred_heart_train,
                heart_train$HeartDisease, 
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 235  32
##          1  89 378
##                                           
##                Accuracy : 0.8351          
##                  95% CI : (0.8063, 0.8613)
##     No Information Rate : 0.5586          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6594          
##                                           
##  Mcnemar's Test P-Value : 3.564e-07       
##                                           
##             Sensitivity : 0.9220          
##             Specificity : 0.7253          
##          Pos Pred Value : 0.8094          
##          Neg Pred Value : 0.8801          
##              Prevalence : 0.5586          
##          Detection Rate : 0.5150          
##    Detection Prevalence : 0.6362          
##       Balanced Accuracy : 0.8236          
##                                           
##        'Positive' Class : 1               
##

Insights:

Accuracy: 83.51% - indicating a good ability to correctly classify cases in the testing data.
Sensitivity (Recall): 92.20% - reflects the model’s ability to correctly identify individuals with heart disease (true positive rate).
Positive Predictive Value (Precision): 80.94% - indicates that a high proportion of patients predicted to have heart disease by the model actually have the condition.

4.6 Model Evaluation

Subsequent to evaluating the model’s performance on the training data, we proceeds to evaluate its performance using the testing data.

# class prediction using testing data
pred_heart_test <- predict(heart_tree_complex, 
                     heart_test, 
                     type = "response")

# confusion matrix testing
confusionMatrix(pred_heart_test, 
                heart_test$HeartDisease, 
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 62  5
##          1 24 93
##                                           
##                Accuracy : 0.8424          
##                  95% CI : (0.7816, 0.8918)
##     No Information Rate : 0.5326          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6791          
##                                           
##  Mcnemar's Test P-Value : 0.0008302       
##                                           
##             Sensitivity : 0.9490          
##             Specificity : 0.7209          
##          Pos Pred Value : 0.7949          
##          Neg Pred Value : 0.9254          
##              Prevalence : 0.5326          
##          Detection Rate : 0.5054          
##    Detection Prevalence : 0.6359          
##       Balanced Accuracy : 0.8350          
##                                           
##        'Positive' Class : 1               
##

Insights:

Accuracy: 84.24% - indicating a good ability to correctly classify cases in the testing data.
Sensitivity (Recall): 94.90% - reflects the model’s ability to correctly identify individuals with heart disease (true positive rate).
Positive Predictive Value (Precison): 79.49% - indicates that a high proportion of patients predicted to have heart disease by the model actually have the condition.

Conclusion: Based on the result above, we can conclude that the model has promising performance. Here’s why: * High Sensitivity (Recall): Both training and testing data show high sensitivity (recall), indicating the model is good at identifying true positives. However, we need to consider other factor as well to definitely claim the model’s good performance.

5 Naive Bayes

Naive Bayes is a classifier based on probability that operates under the assumption of feature independence. The probability of a data point being in a specific class is computed by multiplying the probabilities of each feature value happening in that class. The efficiency of handling large datasets with simplicity may not always align with the complexity of certain problems, challenging the independence assumption.

5.1 Cross Validation Naive Bayes

Next, we will divide the dataset into train (heart_nb_train) and test (heart_nb_test) datasets, maintaining an 80%:20% ratio.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

# train-test splitting: 80%:20%
split_heart_nb <- sample(nrow(heart_clean), nrow(heart_clean)*0.80)
heart_nb_train <- heart_clean[split_heart_nb, ]
heart_nb_test <- heart_clean[-split_heart_nb, ]

5.2 Modeling with `naiveBayes()`

After training and testing data are ready, we could proceed to construct model using naiveBayes() function

🧪 Function: naiveBayes(formula, data)

formula = y ~ x
- y: represents the name of the variable that we want to predict.
- x: represent the names of the variables that we use to predict the target variable.
data: specifies the data frame that contains both the target variable and the predictor variables.

# construct a model using all predictor
nb_heart_all <- naiveBayes(
  formula = HeartDisease ~ .,
  data = heart_nb_train,
  laplace = 1)

nb_heart_all

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.4414169 0.5585831 
## 
## Conditional probabilities:
##    Age
## Y       [,1]     [,2]
##   0 50.64198 9.757187
##   1 55.99756 8.659971
## 
##    Sex
## Y            F          M
##   0 0.33435583 0.66564417
##   1 0.09223301 0.90776699
## 
##    ChestPainType
## Y          ASY        ATA        NAP         TA
##   0 0.27134146 0.34451220 0.30792683 0.07621951
##   1 0.76086957 0.04589372 0.14734300 0.04589372
## 
##    RestingBP
## Y       [,1]     [,2]
##   0 130.6914 17.14653
##   1 134.1366 20.24703
## 
##    Cholesterol
## Y       [,1]      [,2]
##   0 225.6389  73.16369
##   1 172.2000 125.73428
## 
##    FastingBS
## Y         [,1]      [,2]
##   0 0.09567901 0.2946055
##   1 0.34634146 0.4763849
## 
##    RestingECG
## Y         LVH    Normal        ST
##   0 0.1987768 0.6483180 0.1529052
##   1 0.2106538 0.5593220 0.2300242
## 
##    MaxHR
## Y       [,1]     [,2]
##   0 147.6358 23.12819
##   1 127.5146 23.55758
## 
##    ExerciseAngina
## Y           N         Y
##   0 0.8558282 0.1441718
##   1 0.3883495 0.6116505
## 
##    Oldpeak
## Y        [,1]      [,2]
##   0 0.4021605 0.7077818
##   1 1.3292683 1.1790643
## 
##    ST_Slope
## Y         Down       Flat         Up
##   0 0.03975535 0.18960245 0.77064220
##   1 0.09927361 0.73607748 0.16464891

Insights:

Naive Bayes predicts that there is a higher likelihood of predicting heart disease (class 1) with a prior probability of 55.8%. Older age, being male, experiencing chest pain, and having high blood pressure all raise the risk of heart disease. Surprisingly, elevated fasting blood sugar levels appear to indicate the absence of heart disease. The prediction model utilizes different factors such as resting ECG, exercise-induced angina, and ST segment slope.

5.3 Prediction

After the model has been train using training data, we proceed to run the model using testing data

# construct prediciont using testing data
heart_nb_pred <- predict(nb_heart_all,
                      heart_nb_test,
                      type = "class")

table(heart_nb_pred)

## heart_nb_pred
##  0  1 
## 90 94

Insights: The resulting table shows the distribution of predicted class labels for the unseen data in heart_nb_test.

90: The model predicted “no heart disease” (class 0) for 90 data points.
94: The model predicted “heart disease” (class 1) for 94 data points.

5.4 Model Evaluation

After making predictions on the test data, the performance of the model can be thoroughly evaluated by creating a confusion matrix. This matrix illustrates the model’s classification strengths and weaknesses by comparing predicted class labels to actual class labels in the test data.

5.4.1 Confusion Matrix

# model evaluation with confusion matrix
confusionMatrix(data = heart_nb_pred,
                reference = heart_nb_test$HeartDisease,
                positive = "1",
                mode = "everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 76 14
##          1 10 84
##                                           
##                Accuracy : 0.8696          
##                  95% CI : (0.8122, 0.9146)
##     No Information Rate : 0.5326          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7388          
##                                           
##  Mcnemar's Test P-Value : 0.5403          
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.8837          
##          Pos Pred Value : 0.8936          
##          Neg Pred Value : 0.8444          
##               Precision : 0.8936          
##                  Recall : 0.8571          
##                      F1 : 0.8750          
##              Prevalence : 0.5326          
##          Detection Rate : 0.4565          
##    Detection Prevalence : 0.5109          
##       Balanced Accuracy : 0.8704          
##                                           
##        'Positive' Class : 1               
##

Insights:

Accuracy: 83.76% - indicating a good ability to correctly classify cases in the testing data.
Sensitivity (Recall): 85.71% - reflects the model’s ability to correctly identify individuals with heart disease (true positive rate).
Positive Predictive Value (Precision): 89.36% - indicates that a high proportion of patients predicted to have heart disease by the model actually have the condition.

5.4.2 ROC-AUC

Evaluating a model’s performance shouldn’t solely rely on accuracy, especially when dealing with unequal class sizes. Choosing a threshold is crucial because it impacts the effectiveness of precision and recall as performance metrics.

ROC curves demonstrate the model’s performance at various thresholds. They compare the rate of accurately identifying true positives (TPR) with the rate of incorrectly identifying negatives (FPR) on a graph.

ROC curves assist in determining the optimal threshold that balances the model’s ability to accurately detect true positives and true negatives. It is a valuable tool for assessing binary classification models, particularly when there are varying class sizes.

ROC Curve Operation:

Vary the classification threshold: The classification threshold is systematically adjusted to different values.
Calculate the True Positive Rate (TPR) and False Positive Rate (FPR): For each threshold, the corresponding TPR and FPR are computed.
Iterate: Steps 1 and 2 are repeated for a range of thresholds.

Ideal ROC Curve:

TPR approaching 1: The model exhibits excellent performance in correctly identifying positive instances.
FPR approaching 0: The model effectively minimizes the number of false positive predictions.

The ROC curve provides a visual representation of a model’s performance across various threshold settings. However, to obtain a single, quantitative measure of overall model performance, the Area Under the Curve (AUC) is calculated.

AUC Criteria:

Range: The AUC value falls between 0 and 1.
- AUC approaching 1: The model demonstrates exceptional ability to discriminate between positive and negative classes.
- AUC approaching 0.5: The model’s predictions are essentially random.

To evaluate the performance of our previously trained Naive Bayes model (heart_nb_test), let us proceed to construct the ROC curve and compute the corresponding AUC value.

pred_test_prob <- predict(nb_heart_all,
                         heart_nb_test,
                          type = "raw")

head(pred_test_prob)

##               0            1
## [1,] 0.05834116 9.416588e-01
## [2,] 0.98088801 1.911199e-02
## [3,] 0.99914034 8.596551e-04
## [4,] 0.99826564 1.734355e-03
## [5,] 0.99993593 6.406891e-05
## [6,] 0.96518364 3.481636e-02

# 
pred_prob <- pred_test_prob[,1]
pred_prob

##   [1] 5.834116e-02 9.808880e-01 9.991403e-01 9.982656e-01 9.999359e-01
##   [6] 9.651836e-01 7.742575e-05 9.800087e-01 3.645031e-01 9.994885e-01
##  [11] 6.285830e-03 9.997827e-01 6.694946e-01 7.524167e-02 9.997510e-01
##  [16] 4.594216e-01 9.990596e-01 2.162319e-01 4.835268e-04 9.435662e-01
##  [21] 7.930881e-03 8.271165e-01 9.995223e-01 8.498010e-01 9.174348e-01
##  [26] 9.615428e-01 5.899562e-04 9.998411e-01 9.895507e-01 1.147551e-01
##  [31] 9.970687e-01 2.718132e-01 6.274243e-02 9.792799e-01 4.086518e-02
##  [36] 5.206547e-04 9.995812e-01 4.393737e-07 9.900046e-01 9.507560e-01
##  [41] 9.945091e-01 1.580460e-02 9.988613e-01 8.237324e-01 8.835710e-01
##  [46] 4.920855e-01 2.336587e-01 9.948962e-01 9.992759e-01 9.123338e-01
##  [51] 8.686088e-04 5.437716e-02 5.445966e-02 9.990059e-01 9.960810e-01
##  [56] 1.189617e-02 9.997825e-01 3.505663e-01 9.973239e-01 9.955656e-01
##  [61] 3.365337e-04 5.879611e-04 4.340943e-05 2.297844e-03 3.919590e-04
##  [66] 1.691800e-04 3.286940e-05 1.882433e-04 1.111365e-02 5.789934e-04
##  [71] 3.780454e-05 3.052390e-04 3.686583e-04 1.523090e-04 3.165478e-02
##  [76] 3.998336e-05 8.689443e-03 1.608124e-03 1.660783e-03 5.091675e-01
##  [81] 5.299630e-02 5.345526e-05 1.513300e-04 2.150153e-03 3.399541e-06
##  [86] 1.882768e-04 1.259815e-04 2.264948e-04 2.399075e-02 4.457754e-03
##  [91] 9.511338e-01 1.314386e-04 2.525499e-01 9.981415e-01 5.276445e-06
##  [96] 2.105061e-03 5.638275e-04 5.566922e-05 3.031100e-05 6.136015e-01
## [101] 1.198238e-01 3.915369e-04 9.860899e-01 4.496259e-02 5.098174e-05
## [106] 2.328762e-05 7.656063e-01 6.229237e-03 8.081638e-02 8.755514e-02
## [111] 4.560817e-03 1.871306e-01 3.881270e-07 1.727914e-04 5.800469e-01
## [116] 7.554467e-02 2.877258e-03 1.741972e-02 9.014790e-01 9.991450e-01
## [121] 8.875839e-01 9.635095e-02 9.978664e-01 9.925036e-01 9.935405e-01
## [126] 3.130949e-01 9.985146e-01 6.131639e-01 1.713541e-01 9.990412e-01
## [131] 9.923529e-01 2.023962e-01 9.998593e-01 9.992199e-01 2.210835e-01
## [136] 8.508582e-01 9.999425e-01 3.460963e-02 9.997223e-01 9.978772e-01
## [141] 9.944604e-01 9.980184e-01 3.733353e-01 9.894948e-01 9.735134e-01
## [146] 9.999857e-01 9.969055e-01 2.302151e-03 9.990163e-01 1.610146e-04
## [151] 9.923150e-01 9.304060e-02 3.729885e-02 9.016681e-01 2.891035e-04
## [156] 9.467848e-01 9.469247e-01 9.244130e-01 7.191219e-01 5.990742e-01
## [161] 9.958924e-01 2.472157e-03 9.999768e-01 9.997772e-01 9.992593e-01
## [166] 3.284626e-05 8.481597e-01 6.442361e-01 9.993699e-01 9.922426e-01
## [171] 3.828279e-01 1.295512e-02 3.701615e-04 8.117484e-01 9.895089e-01
## [176] 9.997763e-01 9.963533e-01 7.353274e-01 9.991346e-01 1.856468e-05
## [181] 9.631131e-01 9.927885e-01 9.993285e-01 2.427410e-01

table(heart_nb_test$HeartDisease)

## 
##  0  1 
## 86 98

levels(heart_nb_test$HeartDisease) <- c("heart disease", "normal")

head(heart_nb_test)

# Next we will make predictions with KNN using the scaled train (`cancer_train_x_sc`) and test (`cancer_test_x_sc`) data.
bayes_roc <- prediction(predictions = pred_prob,
                        labels = heart_nb_test$HeartDisease,
                        label.ordering = c("normal", "heart disease"))

5.4.3 Plot ROC

# Create ROC plot
model_roc_vec <- performance(bayes_roc, 
                             "tpr", 
                             "fpr")
plot(model_roc_vec)
abline(0,1 , lty = 2)

# Calculate AUC
bayes_auc <- performance(bayes_roc, measure = "auc")
auc_value <- as.numeric(bayes_auc@y.values[[1]])
cat("AUC Value:", auc_value, "\n")

## AUC Value: 0.9172995

Insights:

The ROC curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR). An ideal ROC curve would approach the upper left corner of the plot, where the TPR is close to 1 and the FPR is close to 0, forming a near-perfect inverse L shape. This indicates that the model has excellent discrimination between the positive and negative classes.
An AUC value of 0.9172995 suggests that your model exhibits exceptional performance in distinguishing between the two classes. A value closer to 1 signifies a model with a superior ability to differentiate between positive and negative instances.

6 Conclusion

Both the Decision Tree and Naive Bayes models demonstrate strong performance in predicting heart disease, with accuracy scores exceeding 80%. However, a closer examination of their individual strengths reveals key differences.

Decision Tree * Accuracy: 84.24% - indicating a good ability to correctly classify cases in the testing data. * Sensitivity (Recall): 94.90% - reflects the model’s ability to correctly identify individuals with heart disease (true positive rate). * Positive Predictive Value (Precision): 79.49% - indicates that a high proportion of patients predicted to have heart disease by the model actually have the condition.

Naive Bayes * Accuracy: 83.76% - indicating a good ability to correctly classify cases in the testing data. * Sensitivity (Recall): 85.71% - reflects the model’s ability to correctly identify individuals with heart disease (true positive rate). * Positive Predictive Value (Precision): 89.36% - indicates that a high proportion of patients predicted to have heart disease by the model actually have the condition.

Based on the result above, we can conclude that the model using Decision Tree has a better result in identifying all heart disease cases than Naive Bayes. Here’s why:

Accuracy : the accuracy value is better with Decision Tree (84.24%)
High Sensitivity (Recall): The Decision Tree excels at identifying true positive cases (patients with heart disease), minimizing the risk of false negatives. Identifying all patients with heart disease is a top priority to prevent complications. This is crucial in medical applications where early detection is critical.

Heart Failure Prediction with Decision Tree & Naive Bayes

Intan M Sari

2024-07-25

1 Introduction

2 Data Preparation

2.1 Import & Read Data

2.2 Inspect Data

2.3 Structure Data

2.4 Data Cleansing

2.5 Check Missing Value

3 Exploratory Data Analysis (EDA)

3.1 Feature Selection Approach

4 Decision Tree

4.1 Cross Validation

4.2 Model Fitting

4.3 Model Prediction and Evaluation

4.4 Considerations on Decision Tree: Pruning and Tree-size

4.4.1 Results

Before

After

4.5 Data Training Model Evaluation

4.6 Model Evaluation

5 Naive Bayes

5.1 Cross Validation Naive Bayes

5.2 Modeling with `naiveBayes()`

5.3 Prediction

5.4 Model Evaluation

5.4.1 Confusion Matrix

5.4.2 ROC-AUC

5.4.3 Plot ROC

6 Conclusion

Heart Failure Prediction with Decision Tree & Naive Bayes

Intan M Sari

2024-07-25

1 Introduction

2 Data Preparation

2.1 Import & Read Data

2.2 Inspect Data

2.3 Structure Data

2.4 Data Cleansing

2.5 Check Missing Value

3 Exploratory Data Analysis (EDA)

3.1 Feature Selection Approach

4 Decision Tree

4.1 Cross Validation

4.2 Model Fitting

4.3 Model Prediction and Evaluation

4.4 Considerations on Decision Tree: Pruning and Tree-size

4.4.1 Results

Before

After

4.5 Data Training Model Evaluation

4.6 Model Evaluation

5 Naive Bayes

5.1 Cross Validation Naive Bayes

5.2 Modeling with naiveBayes()

5.3 Prediction

5.4 Model Evaluation

5.4.1 Confusion Matrix

5.4.2 ROC-AUC

5.4.3 Plot ROC

6 Conclusion

5.2 Modeling with `naiveBayes()`