Introduction

Did you know that in the United States, someone experiences a heart attack every 40 seconds? Additionally, approximately 647,000 people die from heart disease each year in this region. Heart disease is currently the leading cause of death for men, women, and individuals from various racial and ethnic backgrounds.

For our project, we aim to analyze the causes of heart attacks and develop a predictive model that can be valuable in predicting the occurrence of heart disease events. To achieve this, we will utilize the Heart Disease Data Set from the UCI Machine Learning Repository, which consists of patient data collected from Cleveland, Hungary, Switzerland, and Long Beach.

To summarize our Goals for this dataset:

Conduct a comprehensive Exploratory Data Analysis on both our independent and dependent variables.
Build machine learning models.
Utilize our models to predict heart disease events in unseen data.
Evaluate the performance of our models.
Select the best-performing model.

Data Understanding

Before conducting our Exploratory Data Analysis, we will first examine our dataset to identify any anomalies that can be rectified within it. It is crucial to recognize that the quality of insights relies heavily on the integrity of the underlying data. Consequently, ensuring that the data is clean and in a usable format is of utmost importance.

Columnwise Description

heart <- read.csv("heart.csv")
rmarkdown::paged_table(heart)

Each column in the given list corresponds to specific information in the dataset. Here’s a description of each column:

Age: Patient’s Age in years.
Sex: Patient’s Gender. (M = Male, F = Female)
ChestPainType: Chest Pain type. (4 values: ATA, NAP, ASY, TA)
RestingBP: resting Blood Pressure. ( in mm Hg )
Cholesterol: Serum Cholesterol. ( in mg/dl )
FastingBS: Fasting Blood Sugar > 120 mg/dl. (0 = True, 1 = False)
RestingECG: resting Electroencephalographic result. (values: Normal, ST, LVH)
MaxHR: Maximum Heart Rate achieved.
ExerciseAngina: Exercise induced Angina. (N = No, Y = Yes)
Oldpeak: ST Depression induced by Exercise relative to rest.
ST_Slope: Slope of the peak exercise ST segment. (values: Up, Flat, Down)
HeartDisease:: Heart Disease occured. (0 = No, 1 = Yes)

Data Wrangling

Check the Data Types and Data Content:

glimpse(heart)

## Rows: 918
## Columns: 12
## $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease   <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

As pointed out earlier, our variable’s data type still doesn’t match its type. Some changes need to be made, namely:

Change some of the following columns to the factor data type: Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, and ST_Slope.
Change the value of the sex column where “M” stands for Male and “F” stands for Female
Change the value of the FastingBS column where “0” becomes “< 120 mg/dl” and “1” becomes “> 120 mg/dl”
Changed the value of the ExerciseAngina column where “N” becomes “No” and “Y” becomes “Yes”

heart <- 
  heart %>% 
  mutate_at(vars(Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope), as.factor) %>% 
  mutate(Sex = case_when(
    Sex == "M" ~ "Male",
    Sex == "F" ~ "Female"),
    FastingBS = case_when(
    FastingBS == "0" ~ "< 120 mg/dl",
    FastingBS == "1" ~ "> 120 mg/dl"),
    ExerciseAngina = case_when(
    ExerciseAngina == "N" ~ "No",
    ExerciseAngina == "Y" ~ "Yes"))

Check for Missing Values:

anyNA(heart)

## [1] FALSE

Our dataset does not contain any missing values.

Check Multicollinearity Indications

We will assess the Pearson Correlation among our numerical variables to determine whether there exists any initial multicollinearity.

variable_num <- heart %>% 
  select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak, HeartDisease)

format <- round(cor(variable_num),2)

get_lower_tri<-function(format){
  format[upper.tri(format)] <- NA
  return(format)
  }

get_upper_tri <- function(format){
  format[lower.tri(format)]<- NA
  return(format)
}

lower_tri <- get_lower_tri(format)

heart_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
heart_cor_melt <- heart_cor_melt %>% 
  mutate(toltip = glue("{Var1} ~ {Var2}"))

plot_pm <- ggplot(heart_cor_melt,aes(Var1, Var2, text = toltip)) +
  geom_tile(aes(fill = value)) +
  geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3, color = "White") + 
  scale_fill_gradientn(colors = c("#A8D8E0","#428793","#27636D"),
                      values = rescale(c(-1,0,1)),
                      limits = c(-1,1)) +
  labs(x = NULL,
      y = NULL,
      fill = "Pearson Corr:") +
  theme(legend.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        panel.background = element_rect(fill = "#0E264E"),
        panel.grid = element_line(colour = "#0E264E"),
        panel.grid.major.x = element_line(colour = "#0E264E"),
        panel.grid.minor.x = element_line(colour = "#0E264E"),
        legend.title = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
        legend.text = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
        legend.justification = c(1, 0),
        legend.position = c(0.6, 0.7),
        legend.direction = "horizontal",
        axis.text.x = element_text(color = "#1B5249", family = "Times New Roman",
                                angle = 45, vjust = 1, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks = element_blank()) +
        guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                  title.position = "top", title.hjust = 0.5))

ggplotly(plot_pm, tooltip = "text") %>% 
  layout(hoverlabel = list( bgcolor = "rgba(255,255,255,0.75)",
                         font = list(
                           color = "Black",
                           family = "Cardo",
                           size = 12
                         )))

Based on the provided information, there is no evidence of multicollinearity as indicated by the absence of significant correlation coefficients exceeding a range of 0.4/-0.4.

Exploratory Data Analysis

Before doing modeling, we need to look at the proportion of the target variable that we have in the HeartDisease column.

proptarget <- paste(round(prop.table((table(heart$HeartDisease)))*100,2),"%")
proptarget_df <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptarget[1], proptarget[2]))
rmarkdown::paged_table(proptarget_df)

When viewed from the proportions of the two classes, it is quite balanced, so we don’t need additional pre-processing to balance the proportions between the two class target variables.

Build Models

Naive Bayes Classifer

The Naive Bayes algorithm is a type of supervised learning algorithm that utilizes Bayes’ theorem to solve classification problems. It is considered one of the simplest and most efficient classification algorithms, enabling the creation of rapid machine learning models capable of making swift predictions.

To grasp the concept of Naive Bayes, we can refer to the following equation:

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]

Information:

P(A) = Probability of A occurring without any additional information (prior)
P(A∣B) = Probability of A occurring if it is known that B has occurred (posterior)
Event A is often called a hypothesis (target variable)
Event B is often called observation (evidence)

Cross Validation

Prior to delving into the modeling process, our initial step is to conduct Cross Validation on our dataset. Why is this necessary? The Cross Validation technique is employed to assess the effectiveness of machine learning algorithms in making predictions on unfamiliar data. In this particular scenario, we will allocate 80% of our dataset for training our model and reserve 20% for testing our predictions, in order to evaluate the performance of the model.

data_nb <- heart %>% 
  mutate(HeartDisease = as.factor(HeartDisease))

RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(data_nb), size = nrow(data_nb)*0.80)
train_nb <- data_nb[index,] #take 80%
test_nb <- data_nb[-index,] #take 20%

head(data_nb)

##   Age    Sex ChestPainType RestingBP Cholesterol   FastingBS RestingECG MaxHR
## 1  40   Male           ATA       140         289 < 120 mg/dl     Normal   172
## 2  49 Female           NAP       160         180 < 120 mg/dl     Normal   156
## 3  37   Male           ATA       130         283 < 120 mg/dl         ST    98
## 4  48 Female           ASY       138         214 < 120 mg/dl     Normal   108
## 5  54   Male           NAP       150         195 < 120 mg/dl     Normal   122
## 6  39   Male           NAP       120         339 < 120 mg/dl     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1             No     0.0       Up            0
## 2             No     1.0     Flat            1
## 3             No     0.0       Up            0
## 4            Yes     1.5     Flat            1
## 5             No     0.0       Up            0
## 6             No     0.0       Up            0

Re-check Class-Imbalance:

proptrain_nb <- paste(round(prop.table((table(train_nb$HeartDisease)))*100,2),"%")
proptrain_dfnb <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain_nb[1], proptrain_nb[2]))
rmarkdown::paged_table(proptrain_dfnb)

As demonstrated earlier, the class remains evenly distributed.

Model Fitting

Once we have divided our data, our next step is to proceed with training our models using the train_nb dataset. Initially, we will utilize all variables available as independent variables (X).

All Predictor

model_nb <- naiveBayes(formula = HeartDisease~. , data = train_nb, laplace = 1)
model_nb

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.4414169 0.5585831 
## 
## Conditional probabilities:
##    Age
## Y       [,1]     [,2]
##   0 50.64198 9.757187
##   1 55.99756 8.659971
## 
##    Sex
## Y       Female       Male
##   0 0.33641975 0.66975309
##   1 0.09268293 0.91219512
## 
##    ChestPainType
## Y          ASY        ATA        NAP         TA
##   0 0.27134146 0.34451220 0.30792683 0.07621951
##   1 0.76086957 0.04589372 0.14734300 0.04589372
## 
##    RestingBP
## Y       [,1]     [,2]
##   0 130.6914 17.14653
##   1 134.1366 20.24703
## 
##    Cholesterol
## Y       [,1]      [,2]
##   0 225.6389  73.16369
##   1 172.2000 125.73428
## 
##    FastingBS
## Y   < 120 mg/dl > 120 mg/dl
##   0  0.90740741  0.09876543
##   1  0.65609756  0.34878049
## 
##    RestingECG
## Y         LVH    Normal        ST
##   0 0.1987768 0.6483180 0.1529052
##   1 0.2106538 0.5593220 0.2300242
## 
##    MaxHR
## Y       [,1]     [,2]
##   0 147.6358 23.12819
##   1 127.5146 23.55758
## 
##    ExerciseAngina
## Y          No       Yes
##   0 0.8611111 0.1450617
##   1 0.3902439 0.6146341
## 
##    Oldpeak
## Y        [,1]      [,2]
##   0 0.4021605 0.7077818
##   1 1.3292683 1.1790643
## 
##    ST_Slope
## Y         Down       Flat         Up
##   0 0.03975535 0.18960245 0.77064220
##   1 0.09927361 0.73607748 0.16464891

💡 Insight :

The Flat ST Slope has the highest likelihood of experiencing a Heart Disease Condition (77,60%), whereas the UP ST Slope has a greater likelihood (77,06%) of not having a Heart Disease Condition compared to its counterparts.
Patients who experience Exercise Angina have a greater likelihood of developing a Heart Disease Condition (61,46%), whereas patients without Exercise Angina have an 86,11% higher chance of not developing a Heart Disease Condition.
A patient with a normal resting ECG has a higher likelihood of developing a heart disease condition (55,93%) compared to others, while they also have a 65,83%higher chance of not developing a heart disease condition.
Patients with a Fasting Blood Sugar level below 120mg have a greater likelihood of developing a Heart Disease Condition (65,60%), and are also 90,74% more likely to not develop a Heart Disease Condition compared to patients with a Fasting Blood Sugar level above 120mg.
ASY Chest Pain has the highest likelihood of indicating a Heart Disease Condition (76,08%) , whereas ATA Chest Pain has a higher likelihood (34,45%) of not indicating a Heart Disease Condition compared to other types of chest pain.
The likelihood of developing a heart disease condition is higher among males (91,21%), while females have a 66,97% lower probability of not developing a heart disease condition.

Model Prediction

Using our model_nb, we will make predictions on our test_nb dataset.

naive_pred <- predict(model_nb, test_nb, type = "class")
naive_prob <- predict(model_nb, test_nb, type = "raw")
naive_pred_train <- predict(model_nb, train_nb, type = "class")

naive_table <- select(test_nb, HeartDisease) %>% 
  bind_cols(heart_pred = naive_pred) %>% 
  bind_cols(heart_e_prob = round(naive_prob[,1],4)) %>% 
  bind_cols(heart_p_prob = round(naive_prob[,2],4)) 

naive_table <- naive_table %>% 
  mutate(HeartDisease = as.factor(HeartDisease))

rmarkdown::paged_table(head(naive_table, 10))

Decision Tree

Decision Tree Model is a relatively straightforward model based on trees that demonstrates strong and effective predictive capabilities. It generates a decision tree visualization that is easily interpretable. The objective of employing a Decision Tree is to construct a training model that can accurately predict the class or value of the target variable by learning uncomplicated decision rules derived from previous data, also known as training data.

Cross Validation

data_dt <- heart %>% 
  mutate_if(is.character, as.factor) 

RNGkind(sample.kind = "Rounding")
set.seed(123)
index <- sample(nrow(heart), size = nrow(heart)*0.80)
train_dt <- data_dt[index,] #take 80%
test_dt <- data_dt[-index,] #take 20%

Re-check Class-Imbalance:

proptrain_dt <- paste(round(prop.table((table(train_dt$HeartDisease)))*100,2),"%")
proptrain_dfdt <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain_dt[1], proptrain_dt[2]))
rmarkdown::paged_table(proptrain_dfdt)

As demonstrated earlier, the class remains evenly distributed.

Model Fitting

To visualize the appearance of Decision Tree, we will generate a tree plot for our model using the fancyRPartPlot() function from the rpart.plot packages. Initially, we will include all variables as predictor variables.

model_dt <- rpart(formula = HeartDisease~., data = train_dt, method = "class")
plot_dtree <- rpart.plot(model_dt, cex = 0.8)

Our decision tree is composed of various components. The topmost box represents the root node, specifically associated with ST_Slope. The root node divides and generates branches based on specific rules. Each branch concludes with a node. Subsequently, these nodes further split into additional nodes referred to as internal nodes. Nodes that no longer divide are located at the bottom of our visualization, known as terminal nodes or leaf nodes. In conclusion, every node displays the following information: (Per Node)

Predicted class (Yes or No)
Predicted class probabilities
The proportion of observations within the node

Model Prediction

Now that we have our model in place, we can proceed to make predictions using our test_dt.

dt_pred <- predict(model_dt, test_dt, type = "class")
dt_prob <- predict(model_dt, test_dt, type = "prob")
dt_pred_train <- predict(model_dt, train_dt, type = "class")

dt_table <- select(test_dt, HeartDisease) %>% 
  bind_cols(heart_pred = dt_pred) %>% 
  bind_cols(heart_e_prob = round(dt_prob[,1],4)) %>% 
  bind_cols(heart_p_prob = round(dt_prob[,2],4))

dt_table <- dt_table %>% 
  mutate(HeartDisease = as.factor(HeartDisease))

rmarkdown::paged_table(head(dt_table, 10))

Evaluation

To assess the performance, we will utilize confusion matrix and ROC/AUC function which are capable of producing:

Accuracy: how capable my model correctly guesses target Y. \[ \frac {TP + TN} {TP + TN + FP + FN} \] Recall or Sensitivity: From all positive actual data, how well can the proportion of my model guess correctly? \[ \frac {TP} {TP + FN} \] Specificity: of all negative actual data, how well can the proportion of my model guess the correct one? \[ \frac {TN} {TN + FP} \] Precision: From all the prediction results, how well can my model correctly guess the positive class? \[ \frac {TP} {TP + FP} \]

Naive Bayes

Evaluation Summary

con_mat_nb <- confusionMatrix(data = naive_table$heart_pred, 
                              reference = naive_table$HeartDisease, 
                              positive = "1")

con_mat_nb_t <- confusionMatrix(data = naive_pred_train, 
                                reference = as.factor(train_nb$HeartDisease), 
                                positive = "1")

performance_nb <- cbind.data.frame(Accuracy = c(con_mat_nb_t$overall[[1]], con_mat_nb$overall[[1]]),
                                  Recall = c(con_mat_nb_t$byClass[[1]], con_mat_nb$byClass[[1]]),
                                  Specificity = c(con_mat_nb_t$byClass[[2]], con_mat_nb$byClass[[2]]),
                                  Precision = c(con_mat_nb_t$byClass[[3]], con_mat_nb$byClass[[3]]))

rownames(performance_nb) <- c("On Training Data", "On Unseen Data")

rmarkdown::paged_table(performance_nb)

There are minimal disparities in the Accuracy, Sensitivity, Specificity, and Precision scores between our Unseen and Training Data.
This indicates that there are no signs of Overfitting and Underfitting in our Naive Nayes Classifer Model.

roc_nb <- roc(predictor = naive_table$heart_e_prob,
            response = naive_table$HeartDisease,
            percent = TRUE)

df_nb <-  data.frame(Specificity=roc_nb$specificities, Sensitivity=roc_nb$sensitivities)

plot_nb <- ggplot(data = df_nb, aes(x = Specificity, y = Sensitivity))+
  with_outer_glow(geom_path(colour = '#1F6E8C', size = 1), colour = "#1B5249", sigma = 10, expand = 1)+
  scale_x_reverse() +
  geom_abline(intercept = 100, slope = 1, color="#1F6E8C", linetype = "longdash") +
  annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc_nb$auc,1), '%'), size = 4, color = "#1B5249", family = "Times New Roman")+
  ylab('Recall (%)')+
  xlab('Specificity (%)')+
  scale_fill_gradient(low = "#1B5249", high = "#1F6E8C") +
  scale_color_gradient(low = "#1B5249", high = "#1F6E8C") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        panel.background = element_rect(fill = "#0E264E"),
        panel.grid = element_line(colour = "white"),
        panel.grid.major.x = element_line(colour = "white"),
        panel.grid.minor.x = element_line(colour = "white"),
        axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

ggarrange(plot_nb, nrow = 1, ncol = 1)

Our naive bayes model achieves an accuracy rate of accuracy of 88.55% when applied to new and unseen data, indicating that it correctly classifies 88.55% of the data.
The sensitivity value of style=“color:red”>sensitivity is 87.75% demonstrates the accurate classification of the majority of positive outcomes.
The specificity value stands at specificity is 89.53%, signifying accurate classification of the majority of negative outcomes.
The precision value stands at 90.52%, indicating that 90.52% of our positive predictions are accurate.
The ROC Curve we have obtained demonstrates a remarkable level of distinction with an AUC score of 92.6%. This indicates that our model has been verified to possess a high degree of effectiveness in distinguishing between the classes we are targeting.

Decision Tree

Evaluation Summary

con_mat_dt <- confusionMatrix(data = dt_table$heart_pred, 
                              reference = dt_table$HeartDisease, 
                              positive = "1")

con_mat_dt_train <- confusionMatrix(data = dt_pred_train, 
                                reference = as.factor(train_dt$HeartDisease), 
                                positive = "1")

performance_dt <- cbind.data.frame(Accuracy = c(con_mat_dt_train$overall[[1]], con_mat_dt$overall[[1]]),
                                   Recall = c(con_mat_dt_train$byClass[[1]], con_mat_dt$byClass[[1]]),
                                   Specificity = c(con_mat_dt_train$byClass[[2]], con_mat_dt$byClass[[2]]),
                                   Precision = c(con_mat_dt_train$byClass[[3]], con_mat_dt$byClass[[3]]))

rownames(performance_dt) <- c("On Training Data", "On Unseen Data")

rmarkdown::paged_table(performance_dt)

There are minimal disparities in the Accuracy, Sensitivity, Specificity, and Precision scores between our Unseen and Training Data.
This indicates that there are no signs of Overfitting and Underfitting in our Decision Tree Model.

roc_dt <- roc(predictor = dt_table$heart_e_prob,
            response = dt_table$HeartDisease,
            levels = c("0", "1"),
            percent = TRUE)

df_dt <- data.frame(Specificity=roc_dt$specificities, Sensitivity=roc_dt$sensitivities)



plot_dt <- ggplot(data = df_dt, aes(x = Specificity, y = Sensitivity))+
  with_outer_glow(geom_path(colour = '#1F6E8C', size = 1), colour = "#1B5249", sigma = 10, expand = 1)+
  scale_x_reverse() +
  geom_abline(intercept = 100, slope = 1, color="#1F6E8C", linetype = "longdash") +
  annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc_dt$auc,1), '%'), size = 4, color = "#1B5249", family = "Times New Roman")+
  ylab('Recall (%)')+
  xlab('Specificity (%)')+
  scale_fill_gradient(low = "#1B5249", high = "#1F6E8C") +
  scale_color_gradient(low = "#1B5249", high = "#1F6E8C") +
  theme(legend.position = "none",
        plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
        panel.background = element_rect(fill = "#0E264E"),
        panel.grid = element_line(colour = "white"),
        panel.grid.major.x = element_line(colour = "white"),
        panel.grid.minor.x = element_line(colour = "white"),
        axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
        axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
        axis.ticks = element_blank())

ggarrange(plot_dt, nrow = 1, ncol = 1)

Our decision tree model achieves an accuracy rate of accuracy of 82.06% when applied to new and unseen data, indicating that it correctly classifies 82.06% of the data.
The sensitivity value of sensitivity is 87.61% demonstrates the accurate classification of the majority of positive outcomes.
The specificity value stands at specificity is 74.68%, signifying accurate classification of the majority of negative outcomes.
The precision value stands at 90.52%, indicating that 82.14% of our positive predictions are accurate.
The ROC Curve we have obtained demonstrates a remarkable level of distinction with an AUC score of 90.7%. This indicates that our model has been verified to possess a high degree of effectiveness in distinguishing between the classes we are targeting.

Conclusion

Since we are discussing health diagnosis prediction, our goal is to minimize the error in predicting positive outcomes, specifically for patients with heart disease. Upon evaluating the performance of our models, we aim to select a model with a high sensitivity score.

Based on the two models we developed, the Naive Bayes model showed a higher sensitivity score of 88.77% compared to the Decision Tree model. However, if you look at it, it turns out that the difference between the two models is not too significant, the difference is very small and this can be interpreted that the two models have a good sensitivity score for later model use.

This concludes my attempt to classify heart disease using classification models. As time progresses, I will incorporate additional models that have not yet been included in this project. Nonetheless, I sincerely hope that this study will assist individuals dealing with similar cases in the future.

A work by Ahmad Fauzi

wrk.ahmadfauzi@gmail.com

with R Language

Heart Disease Classification

Naive Bayes Classifer and Decision Tree Models

Introduction

Data Understanding

Columnwise Description

Data Wrangling

Exploratory Data Analysis

Build Models

Naive Bayes Classifer

Cross Validation

Model Fitting

Model Prediction

Decision Tree

Cross Validation

Model Fitting

Model Prediction

Evaluation

Naive Bayes

Decision Tree

Conclusion