Cardiovascular diseases (CVDs) cause a shocking amount of fatalities worldwide, taking approximately 17.9 million lives each year, making up 31% of all global deaths. In the realm of CVDs, heart attacks and strokes stand out as incredibly destructive, accounting for 80% of CVD-related deaths, with a worrying 33% happening prematurely in individuals under 70. The occurrence of heart failure is frequent in individuals with CVDs, and timely identification is essential for proper treatment. This research explores the capacity of machine learning models in forecasting heart failure.
We will use a dataset from the UCI Machine Learning Repository with 918 data points and 12 features concerning heart health. These characteristics could act as indicators for determining the presence or absence of heart failure, our main focus. Two machine learning algorithms, Decision Tree and Naive Bayes, will be utilized to create two prediction models. Metrics such as accuracy, precision, and recall will be used to evaluate how well these models can predict heart failure. This study seeks to determine the most efficient method for forecasting heart failure within this particular set of data.
By examining the potential of machine learning in this field, we could enhance the identification of issues at an early stage and enable prompt actions. This study may lead to the investigation of other machine learning algorithms or more advanced deep learning models to possibly enhance heart failure prediction accuracy.
Before moving forward, we must first load the necessary library required for conducting this research.
library(dplyr)
library(ggplot2)
# For machine learning
library(partykit)
## Warning: package 'partykit' was built under R version 4.4.1
## Warning: package 'libcoin' was built under R version 4.4.1
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.1
library(caret)
## Warning: package 'caret' was built under R version 4.4.1
library(e1071)
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.4.1
Heart Disease Dataset Attribute Information
| Column Name | Description | Data Type |
|---|---|---|
| Age | Age of the patient | Years (numeric) |
| Sex | Sex of the patient | M: Male, F: Female |
| ChestPainType | Chest pain type | TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic |
| RestingBP | Resting blood pressure | mm Hg (numeric) |
| Cholesterol | Serum cholesterol | mm/dl (numeric) |
| FastingBS | Fasting blood sugar | 1: if FastingBS > 120 mg/dl, 0: otherwise |
| RestingECG | Resting electrocardiogram results | Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria |
| MaxHR | Maximum heart rate achieved | Numeric value between 60 and 202 |
| ExerciseAngina | Exercise-induced angina | Y: Yes, N: No |
| Oldpeak | ST depression | Numeric value measured in depression |
| ST_Slope | Slope of the peak exercise ST segment | Up: upsloping, Flat: flat, Down: downsloping |
| HeartDisease | Presence of heart disease | 1: heart disease, 0: Normal |
Source This dataset was formed by merging various datasets that were previously separate and not combined together. This dataset is composed of 5 heart datasets merged based on 11 shared features, making it the most extensive heart disease dataset for research to date. The curation utilized five datasets.
End dataset: 918 data points
All datasets utilized are accessible in the Index of heart disease datasets on UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/
Following data preparation, the initial step is to bring in the
dataset by utilizing the read.csv() function.
heart <- read.csv("data_input/heart.csv")
heart
The next action is to analyze the dataset that was brought in, in
order to examine the beginning and ending data of the
startup dataset. We utilize the functions
head() and tail() functions.
head(heart)
tail(heart)
The appropriate data type is determined by initially checking it with
the glimpse() function.
heart %>%
glimpse()
## Rows: 918
## Columns: 12
## $ Age <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
here we check the value distribution for
heart$ChestPainType column
table(heart$ChestPainType)
##
## ASY ATA NAP TA
## 496 173 203 46
where: TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic
Prior to proceeding with the following step, it is necessary to
convert the column types of the State to factor type.
However, we will exclude this column since regression analysis
necessitates solely numeric data.
heart_clean <-
heart %>%
mutate(Sex = as.factor(Sex),
ChestPainType = as.factor(ChestPainType),
RestingECG = as.factor(RestingECG),
ExerciseAngina = as.factor(ExerciseAngina),
ST_Slope = as.factor(ST_Slope),
HeartDisease = as.factor(HeartDisease))
head(heart_clean)
Next, we reassess the remaining columns to ensure the data type is accurate.
heart_clean %>%
glimpse()
## Rows: 918
## Columns: 12
## $ Age <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex <fct> M, F, M, F, M, M, F, M, M, F, F, M, M, M, F, F, M, F, M…
## $ ChestPainType <fct> ATA, NAP, ATA, ASY, NAP, NAP, ATA, ATA, ASY, ATA, NAP, …
## $ RestingBP <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG <fct> Normal, Normal, ST, Normal, Normal, Normal, Normal, Nor…
## $ MaxHR <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <fct> N, N, N, Y, N, N, N, N, Y, N, N, Y, N, Y, N, N, N, N, N…
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope <fct> Up, Flat, Up, Flat, Up, Up, Up, Up, Flat, Up, Up, Flat,…
## $ HeartDisease <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
The correct data type is assigned to each column, now the data needs to be processed further.
Once those steps are completed, it is important to verify any missing values in the dataset as well.
heart_clean %>%
is.na() %>%
colSums()
## Age Sex ChestPainType RestingBP Cholesterol
## 0 0 0 0 0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## 0 0 0 0 0
## ST_Slope HeartDisease
## 0 0
Since there are no missing values in this dataset, it is ready to move on to the next stages.
Even though the previous section gave a brief summary of the data, we can further examine it by utilizing Exploratory Data Analysis (EDA). EDA enables us to discover the traits of the data, spot possible patterns, and unveil connections among variables. This information is essential for constructing efficient machine learning models.
For example a boxplot in our heart disease dataset can show the difference in FastingBS level distribution between patients with and without heart disease. This is significant because the target variable for predicting heart disease is binary classification (1: presence, 0: absence).
# boxplot of glucose levels in each class
library(ggplot2)
ggplot(data = heart_clean,
mapping = aes(x = HeartDisease, y = FastingBS,
fill = HeartDisease)) +
geom_boxplot()
Insights:
Next, we verify data proportions to identify any imbalances using
prop.table()
prop.table(table(heart_clean$HeartDisease))
##
## 0 1
## 0.4466231 0.5533769
After analyzing the data distribution, it was noted that the ratio of class 1 (showing heart disease presence) is about 55%, while the ratio of class 0 (indicating no heart disease) is roughly 45%. When dealing with classification tasks, a dataset is deemed balanced when the class ratios are approximately 50:50.
Selecting the most appropriate predictors is essential in building machine learning models to guarantee precise predictions. Feature selection is the process that helps pinpoint the most informative variables that play a significant role in the model’s performance.
In this research, we will use a mathematical method for selecting features. This technique uses statistical computations to assess the significance of each factor and identify the most beneficial ones for creating the machine learning model.
We will apply the nearZeroVar() function from the caret
package on the heart_clean dataset to remove columns with
very little variation in their values. It is improbable that these
columns will offer significant data for the model.
n0_var <- nearZeroVar(heart_clean)
n0_var
## integer(0)
Insight: In the context of the
nearZeroVar() function, an empty result
(integer(0)) implies that the function didn’t flag any
variables for removal due to low variance. This suggests that all the
predictors in the dataset have sufficient variation and could contribute
to the model’s performance.
Having explored the data and dealing with possible problems such as class imbalance, we are ready to move forward with creating a machine learning model to predict heart disease. One of the algorithms we will be examining is the Decision Tree algorithm.
Decision trees are a form of supervised learning model that looks like a diagram showing the flow of decisions. They operate by dividing the data gradually according to certain characteristics (predictors) that most effectively differentiate between various classes (such as the presence or absence of heart disease in this scenario). Every division results in a fresh offshoot in the tree, resulting in more precise forecasts as you move further down the tree.
Next, we will divide the dataset into train
(heart_train) and test (heart_test) datasets,
maintaining an 80%:20% ratio.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index_heart <- sample(nrow(heart_clean), nrow(heart_clean)*0.80)
heart_train <- heart_clean[index_heart,] # untuk training
heart_test <- heart_clean[-index_heart,] # untuk testing
The process of training a machine learning model on a dataset is
referred to as model fitting. In the context of building a decision tree
model for heart disease prediction, the ctree() function
from the partykit library in R can be employed.
🧪 Function: ctree(formula, data)
formula: y ~ x
y: Dependent variable or target variable.x: Independent variables or predictors.data: A data frame containing both the dependent and
independent variables.heart_tree <- ctree(formula = HeartDisease ~ .,
data = heart_train)
heart_tree
##
## Model formula:
## HeartDisease ~ Age + Sex + ChestPainType + RestingBP + Cholesterol +
## FastingBS + RestingECG + MaxHR + ExerciseAngina + Oldpeak +
## ST_Slope
##
## Fitted party:
## [1] root
## | [2] ST_Slope in Down, Flat
## | | [3] Sex in F
## | | | [4] FastingBS <= 0: 0 (n = 51, err = 45.1%)
## | | | [5] FastingBS > 0: 1 (n = 8, err = 0.0%)
## | | [6] Sex in M
## | | | [7] MaxHR <= 150
## | | | | [8] ChestPainType in ASY: 1 (n = 235, err = 5.5%)
## | | | | [9] ChestPainType in ATA, NAP, TA: 1 (n = 75, err = 17.3%)
## | | | [10] MaxHR > 150: 1 (n = 47, err = 40.4%)
## | [11] ST_Slope in Up
## | | [12] ChestPainType in ASY
## | | | [13] FastingBS <= 0
## | | | | [14] ExerciseAngina in N: 0 (n = 61, err = 26.2%)
## | | | | [15] ExerciseAngina in Y: 1 (n = 25, err = 40.0%)
## | | | [16] FastingBS > 0
## | | | | [17] Cholesterol <= 0: 1 (n = 17, err = 0.0%)
## | | | | [18] Cholesterol > 0: 0 (n = 7, err = 42.9%)
## | | [19] ChestPainType in ATA, NAP, TA
## | | | [20] Oldpeak <= 1.9: 0 (n = 200, err = 5.5%)
## | | | [21] Oldpeak > 1.9: 1 (n = 8, err = 37.5%)
##
## Number of inner nodes: 10
## Number of terminal nodes: 11
Visualization
# visualisasi decision tree
plot(heart_tree, type = "simple")
Insights:
Having established the decision tree model, the next step is to
utilize it for generating predictions on unseen data. The
predict() function in R serves this purpose
effectively.
🧪 Function:
predict(object, newdata, type)
object: represents the model we want to use for making
predictions.newdata: specifies the new data on which we want to
generate predictions.type: controls the format of the predictions returned
by the predict() function.
type = "prob" > outputs the probability of each
class for each row in dataset.type = "response" directly assigns a class label to
each row in the dataset.pred_test <- predict(
object = heart_tree,
newdata = heart_test,
type = "response"
)
Once predictions have been generated for the unseen data using the
predict() function, the next step is to evaluate the
performance of the decision tree model. This crucial step allows us to
assess how accurately the model generalizes to new data and identifies
its strengths and weaknesses. For this process we evaluate the model
using confusionMatrix().
confusionMatrix(
data = pred_test,
reference = heart_test$HeartDisease,
positive = "1"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 75 11
## 1 11 87
##
## Accuracy : 0.8804
## 95% CI : (0.8246, 0.9235)
## No Information Rate : 0.5326
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7598
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8878
## Specificity : 0.8721
## Pos Pred Value : 0.8878
## Neg Pred Value : 0.8721
## Prevalence : 0.5326
## Detection Rate : 0.4728
## Detection Prevalence : 0.5326
## Balanced Accuracy : 0.8799
##
## 'Positive' Class : 1
##
Insights:
The primary objective in this context is to accurately identify patients with the condition, enabling healthcare professionals to implement preventive measures and improve patient outcomes.
Decision trees are powerful machine learning algorithms for classification and regression tasks. However, a key challenge associated with them is overfitting. This occurs when the tree becomes overly complex, capturing noise or irrelevant details in the training data, this leads to inflated performance on the training data but poor performance on unseen data, such as the testing set.
To address this issue, we can strategically influence the decision tree construction process, promoting the development of a less complex and more focused tree.
Arguments:
mincriterion = 0.95: this enforces stricter splitting,
focusing the tree on the most informative features and preventing
noise-based splits.
minsplit = 50: This ensures a minimum number of
observations (50 in this case) are present at a node before it can be
further split. This prevents the tree from splitting based on small data
subsets that might not be representative of the broader population.
minbucket = 50: Setting a minimum of 50 observations per
leaf prevents the creation of overly specific branches with limited
data. This encourages the model to learn more generalizable patterns,
reducing overfitting and improving unseen data predictions.
Consider the following model after we add
ctree_control() arguments.
heart_tree_complex <- ctree(formula = HeartDisease ~ .,
data = heart_train,
control = ctree_control(mincriterion = 0.95,
minsplit = 50,
minbucket = 50))
# original decision tree
plot(heart_tree, type = "simple")
# modified decision tree
plot(heart_tree_complex, type='simple')
Following the acquisition of the model, we proceed to evaluate its performance using the training data.
# class prediction using training data
pred_heart_train <- predict(heart_tree_complex,
heart_train,
type = "response")
# confusion matrix data train
confusionMatrix(pred_heart_train,
heart_train$HeartDisease,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 235 32
## 1 89 378
##
## Accuracy : 0.8351
## 95% CI : (0.8063, 0.8613)
## No Information Rate : 0.5586
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6594
##
## Mcnemar's Test P-Value : 3.564e-07
##
## Sensitivity : 0.9220
## Specificity : 0.7253
## Pos Pred Value : 0.8094
## Neg Pred Value : 0.8801
## Prevalence : 0.5586
## Detection Rate : 0.5150
## Detection Prevalence : 0.6362
## Balanced Accuracy : 0.8236
##
## 'Positive' Class : 1
##
Insights:
Subsequent to evaluating the model’s performance on the training data, we proceeds to evaluate its performance using the testing data.
# class prediction using testing data
pred_heart_test <- predict(heart_tree_complex,
heart_test,
type = "response")
# confusion matrix testing
confusionMatrix(pred_heart_test,
heart_test$HeartDisease,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 62 5
## 1 24 93
##
## Accuracy : 0.8424
## 95% CI : (0.7816, 0.8918)
## No Information Rate : 0.5326
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6791
##
## Mcnemar's Test P-Value : 0.0008302
##
## Sensitivity : 0.9490
## Specificity : 0.7209
## Pos Pred Value : 0.7949
## Neg Pred Value : 0.9254
## Prevalence : 0.5326
## Detection Rate : 0.5054
## Detection Prevalence : 0.6359
## Balanced Accuracy : 0.8350
##
## 'Positive' Class : 1
##
Insights:
Conclusion: Based on the result above, we can conclude that the model has promising performance. Here’s why: * High Sensitivity (Recall): Both training and testing data show high sensitivity (recall), indicating the model is good at identifying true positives. However, we need to consider other factor as well to definitely claim the model’s good performance.
Naive Bayes is a classifier based on probability that operates under the assumption of feature independence. The probability of a data point being in a specific class is computed by multiplying the probabilities of each feature value happening in that class. The efficiency of handling large datasets with simplicity may not always align with the complexity of certain problems, challenging the independence assumption.
Next, we will divide the dataset into train
(heart_nb_train) and test (heart_nb_test)
datasets, maintaining an 80%:20% ratio.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# train-test splitting: 80%:20%
split_heart_nb <- sample(nrow(heart_clean), nrow(heart_clean)*0.80)
heart_nb_train <- heart_clean[split_heart_nb, ]
heart_nb_test <- heart_clean[-split_heart_nb, ]
naiveBayes()After training and testing data are ready, we could proceed to
construct model using naiveBayes() function
🧪 Function:
naiveBayes(formula, data)
formula = y ~ x
y: represents the name of the variable that we want to
predict.x: represent the names of the variables that we use to
predict the target variable.data: specifies the data frame that contains both the
target variable and the predictor variables.# construct a model using all predictor
nb_heart_all <- naiveBayes(
formula = HeartDisease ~ .,
data = heart_nb_train,
laplace = 1)
nb_heart_all
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.4414169 0.5585831
##
## Conditional probabilities:
## Age
## Y [,1] [,2]
## 0 50.64198 9.757187
## 1 55.99756 8.659971
##
## Sex
## Y F M
## 0 0.33435583 0.66564417
## 1 0.09223301 0.90776699
##
## ChestPainType
## Y ASY ATA NAP TA
## 0 0.27134146 0.34451220 0.30792683 0.07621951
## 1 0.76086957 0.04589372 0.14734300 0.04589372
##
## RestingBP
## Y [,1] [,2]
## 0 130.6914 17.14653
## 1 134.1366 20.24703
##
## Cholesterol
## Y [,1] [,2]
## 0 225.6389 73.16369
## 1 172.2000 125.73428
##
## FastingBS
## Y [,1] [,2]
## 0 0.09567901 0.2946055
## 1 0.34634146 0.4763849
##
## RestingECG
## Y LVH Normal ST
## 0 0.1987768 0.6483180 0.1529052
## 1 0.2106538 0.5593220 0.2300242
##
## MaxHR
## Y [,1] [,2]
## 0 147.6358 23.12819
## 1 127.5146 23.55758
##
## ExerciseAngina
## Y N Y
## 0 0.8558282 0.1441718
## 1 0.3883495 0.6116505
##
## Oldpeak
## Y [,1] [,2]
## 0 0.4021605 0.7077818
## 1 1.3292683 1.1790643
##
## ST_Slope
## Y Down Flat Up
## 0 0.03975535 0.18960245 0.77064220
## 1 0.09927361 0.73607748 0.16464891
Insights:
Naive Bayes predicts that there is a higher likelihood of predicting heart disease (class 1) with a prior probability of 55.8%. Older age, being male, experiencing chest pain, and having high blood pressure all raise the risk of heart disease. Surprisingly, elevated fasting blood sugar levels appear to indicate the absence of heart disease. The prediction model utilizes different factors such as resting ECG, exercise-induced angina, and ST segment slope.
After the model has been train using training data, we proceed to run the model using testing data
# construct prediciont using testing data
heart_nb_pred <- predict(nb_heart_all,
heart_nb_test,
type = "class")
table(heart_nb_pred)
## heart_nb_pred
## 0 1
## 90 94
Insights: The resulting table shows the distribution
of predicted class labels for the unseen data in
heart_nb_test.
After making predictions on the test data, the performance of the model can be thoroughly evaluated by creating a confusion matrix. This matrix illustrates the model’s classification strengths and weaknesses by comparing predicted class labels to actual class labels in the test data.
# model evaluation with confusion matrix
confusionMatrix(data = heart_nb_pred,
reference = heart_nb_test$HeartDisease,
positive = "1",
mode = "everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 76 14
## 1 10 84
##
## Accuracy : 0.8696
## 95% CI : (0.8122, 0.9146)
## No Information Rate : 0.5326
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7388
##
## Mcnemar's Test P-Value : 0.5403
##
## Sensitivity : 0.8571
## Specificity : 0.8837
## Pos Pred Value : 0.8936
## Neg Pred Value : 0.8444
## Precision : 0.8936
## Recall : 0.8571
## F1 : 0.8750
## Prevalence : 0.5326
## Detection Rate : 0.4565
## Detection Prevalence : 0.5109
## Balanced Accuracy : 0.8704
##
## 'Positive' Class : 1
##
Insights:
Evaluating a model’s performance shouldn’t solely rely on accuracy, especially when dealing with unequal class sizes. Choosing a threshold is crucial because it impacts the effectiveness of precision and recall as performance metrics.
ROC curves demonstrate the model’s performance at various thresholds. They compare the rate of accurately identifying true positives (TPR) with the rate of incorrectly identifying negatives (FPR) on a graph.
ROC curves assist in determining the optimal threshold that balances the model’s ability to accurately detect true positives and true negatives. It is a valuable tool for assessing binary classification models, particularly when there are varying class sizes.
ROC Curve Operation:
Ideal ROC Curve:
The ROC curve provides a visual representation of a model’s performance across various threshold settings. However, to obtain a single, quantitative measure of overall model performance, the Area Under the Curve (AUC) is calculated.
AUC Criteria:
To evaluate the performance of our previously trained Naive Bayes
model (heart_nb_test), let us proceed to construct the ROC
curve and compute the corresponding AUC value.
pred_test_prob <- predict(nb_heart_all,
heart_nb_test,
type = "raw")
head(pred_test_prob)
## 0 1
## [1,] 0.05834116 9.416588e-01
## [2,] 0.98088801 1.911199e-02
## [3,] 0.99914034 8.596551e-04
## [4,] 0.99826564 1.734355e-03
## [5,] 0.99993593 6.406891e-05
## [6,] 0.96518364 3.481636e-02
#
pred_prob <- pred_test_prob[,1]
pred_prob
## [1] 5.834116e-02 9.808880e-01 9.991403e-01 9.982656e-01 9.999359e-01
## [6] 9.651836e-01 7.742575e-05 9.800087e-01 3.645031e-01 9.994885e-01
## [11] 6.285830e-03 9.997827e-01 6.694946e-01 7.524167e-02 9.997510e-01
## [16] 4.594216e-01 9.990596e-01 2.162319e-01 4.835268e-04 9.435662e-01
## [21] 7.930881e-03 8.271165e-01 9.995223e-01 8.498010e-01 9.174348e-01
## [26] 9.615428e-01 5.899562e-04 9.998411e-01 9.895507e-01 1.147551e-01
## [31] 9.970687e-01 2.718132e-01 6.274243e-02 9.792799e-01 4.086518e-02
## [36] 5.206547e-04 9.995812e-01 4.393737e-07 9.900046e-01 9.507560e-01
## [41] 9.945091e-01 1.580460e-02 9.988613e-01 8.237324e-01 8.835710e-01
## [46] 4.920855e-01 2.336587e-01 9.948962e-01 9.992759e-01 9.123338e-01
## [51] 8.686088e-04 5.437716e-02 5.445966e-02 9.990059e-01 9.960810e-01
## [56] 1.189617e-02 9.997825e-01 3.505663e-01 9.973239e-01 9.955656e-01
## [61] 3.365337e-04 5.879611e-04 4.340943e-05 2.297844e-03 3.919590e-04
## [66] 1.691800e-04 3.286940e-05 1.882433e-04 1.111365e-02 5.789934e-04
## [71] 3.780454e-05 3.052390e-04 3.686583e-04 1.523090e-04 3.165478e-02
## [76] 3.998336e-05 8.689443e-03 1.608124e-03 1.660783e-03 5.091675e-01
## [81] 5.299630e-02 5.345526e-05 1.513300e-04 2.150153e-03 3.399541e-06
## [86] 1.882768e-04 1.259815e-04 2.264948e-04 2.399075e-02 4.457754e-03
## [91] 9.511338e-01 1.314386e-04 2.525499e-01 9.981415e-01 5.276445e-06
## [96] 2.105061e-03 5.638275e-04 5.566922e-05 3.031100e-05 6.136015e-01
## [101] 1.198238e-01 3.915369e-04 9.860899e-01 4.496259e-02 5.098174e-05
## [106] 2.328762e-05 7.656063e-01 6.229237e-03 8.081638e-02 8.755514e-02
## [111] 4.560817e-03 1.871306e-01 3.881270e-07 1.727914e-04 5.800469e-01
## [116] 7.554467e-02 2.877258e-03 1.741972e-02 9.014790e-01 9.991450e-01
## [121] 8.875839e-01 9.635095e-02 9.978664e-01 9.925036e-01 9.935405e-01
## [126] 3.130949e-01 9.985146e-01 6.131639e-01 1.713541e-01 9.990412e-01
## [131] 9.923529e-01 2.023962e-01 9.998593e-01 9.992199e-01 2.210835e-01
## [136] 8.508582e-01 9.999425e-01 3.460963e-02 9.997223e-01 9.978772e-01
## [141] 9.944604e-01 9.980184e-01 3.733353e-01 9.894948e-01 9.735134e-01
## [146] 9.999857e-01 9.969055e-01 2.302151e-03 9.990163e-01 1.610146e-04
## [151] 9.923150e-01 9.304060e-02 3.729885e-02 9.016681e-01 2.891035e-04
## [156] 9.467848e-01 9.469247e-01 9.244130e-01 7.191219e-01 5.990742e-01
## [161] 9.958924e-01 2.472157e-03 9.999768e-01 9.997772e-01 9.992593e-01
## [166] 3.284626e-05 8.481597e-01 6.442361e-01 9.993699e-01 9.922426e-01
## [171] 3.828279e-01 1.295512e-02 3.701615e-04 8.117484e-01 9.895089e-01
## [176] 9.997763e-01 9.963533e-01 7.353274e-01 9.991346e-01 1.856468e-05
## [181] 9.631131e-01 9.927885e-01 9.993285e-01 2.427410e-01
table(heart_nb_test$HeartDisease)
##
## 0 1
## 86 98
levels(heart_nb_test$HeartDisease) <- c("heart disease", "normal")
head(heart_nb_test)
# Next we will make predictions with KNN using the scaled train (`cancer_train_x_sc`) and test (`cancer_test_x_sc`) data.
bayes_roc <- prediction(predictions = pred_prob,
labels = heart_nb_test$HeartDisease,
label.ordering = c("normal", "heart disease"))
# Create ROC plot
model_roc_vec <- performance(bayes_roc,
"tpr",
"fpr")
plot(model_roc_vec)
abline(0,1 , lty = 2)
# Calculate AUC
bayes_auc <- performance(bayes_roc, measure = "auc")
auc_value <- as.numeric(bayes_auc@y.values[[1]])
cat("AUC Value:", auc_value, "\n")
## AUC Value: 0.9172995
Insights:
The ROC curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR). An ideal ROC curve would approach the upper left corner of the plot, where the TPR is close to 1 and the FPR is close to 0, forming a near-perfect inverse L shape. This indicates that the model has excellent discrimination between the positive and negative classes.
An AUC value of 0.9172995 suggests that your model exhibits exceptional performance in distinguishing between the two classes. A value closer to 1 signifies a model with a superior ability to differentiate between positive and negative instances.
Both the Decision Tree and Naive Bayes models demonstrate strong performance in predicting heart disease, with accuracy scores exceeding 80%. However, a closer examination of their individual strengths reveals key differences.
Decision Tree * Accuracy: 84.24% - indicating a good ability to correctly classify cases in the testing data. * Sensitivity (Recall): 94.90% - reflects the model’s ability to correctly identify individuals with heart disease (true positive rate). * Positive Predictive Value (Precision): 79.49% - indicates that a high proportion of patients predicted to have heart disease by the model actually have the condition.
Naive Bayes * Accuracy: 83.76% - indicating a good ability to correctly classify cases in the testing data. * Sensitivity (Recall): 85.71% - reflects the model’s ability to correctly identify individuals with heart disease (true positive rate). * Positive Predictive Value (Precision): 89.36% - indicates that a high proportion of patients predicted to have heart disease by the model actually have the condition.
Based on the result above, we can conclude that the model using Decision Tree has a better result in identifying all heart disease cases than Naive Bayes. Here’s why:
Accuracy : the accuracy value is better with Decision Tree (84.24%)
High Sensitivity (Recall): The Decision Tree excels at identifying true positive cases (patients with heart disease), minimizing the risk of false negatives. Identifying all patients with heart disease is a top priority to prevent complications. This is crucial in medical applications where early detection is critical.