Did you know that in the United States, someone experiences a heart attack every 40 seconds? Additionally, approximately 647,000 people die from heart disease each year in this region. Heart disease is currently the leading cause of death for men, women, and individuals from various racial and ethnic backgrounds.
For our project, we aim to analyze the causes of heart attacks and develop a predictive model that can be valuable in predicting the occurrence of heart disease events. To achieve this, we will utilize the Heart Disease Data Set from the UCI Machine Learning Repository, which consists of patient data collected from Cleveland, Hungary, Switzerland, and Long Beach.
To summarize our Goals for this dataset:
Before conducting our Exploratory Data Analysis, we will first examine our dataset to identify any anomalies that can be rectified within it. It is crucial to recognize that the quality of insights relies heavily on the integrity of the underlying data. Consequently, ensuring that the data is clean and in a usable format is of utmost importance.
heart <- read.csv("heart.csv")
rmarkdown::paged_table(heart)
Each column in the given list corresponds to specific information in the dataset. Here’s a description of each column:
Age: Patient’s Age in years.Sex: Patient’s Gender. (M = Male, F =
Female)ChestPainType: Chest Pain type. (4 values: ATA,
NAP, ASY, TA)RestingBP: resting Blood Pressure. ( in mm Hg
)Cholesterol: Serum Cholesterol. ( in mg/dl
)FastingBS: Fasting Blood Sugar > 120 mg/dl.
(0 = True, 1 = False)RestingECG: resting Electroencephalographic result.
(values: Normal, ST, LVH)MaxHR: Maximum Heart Rate achieved.ExerciseAngina: Exercise induced Angina. (N =
No, Y = Yes)Oldpeak: ST Depression induced by Exercise relative to
rest.ST_Slope: Slope of the peak exercise ST segment.
(values: Up, Flat, Down)HeartDisease:: Heart Disease occured. (0 =
No, 1 = Yes)glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
As pointed out earlier, our variable’s data type still doesn’t match its type. Some changes need to be made, namely:
Change some of the following columns to the factor
data type: Sex, ChestPainType,
FastingBS, RestingECG,
ExerciseAngina, and ST_Slope.
Change the value of the sex column where “M” stands for Male and “F” stands for Female
Change the value of the FastingBS column where
“0” becomes “< 120 mg/dl” and “1” becomes “> 120
mg/dl”
Changed the value of the ExerciseAngina column where
“N” becomes “No” and “Y” becomes
“Yes”
heart <-
heart %>%
mutate_at(vars(Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope), as.factor) %>%
mutate(Sex = case_when(
Sex == "M" ~ "Male",
Sex == "F" ~ "Female"),
FastingBS = case_when(
FastingBS == "0" ~ "< 120 mg/dl",
FastingBS == "1" ~ "> 120 mg/dl"),
ExerciseAngina = case_when(
ExerciseAngina == "N" ~ "No",
ExerciseAngina == "Y" ~ "Yes"))
anyNA(heart)
## [1] FALSE
Our dataset does not contain any missing values.
We will assess the Pearson Correlation among our numerical variables to determine whether there exists any initial multicollinearity.
variable_num <- heart %>%
select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak, HeartDisease)
format <- round(cor(variable_num),2)
get_lower_tri<-function(format){
format[upper.tri(format)] <- NA
return(format)
}
get_upper_tri <- function(format){
format[lower.tri(format)]<- NA
return(format)
}
lower_tri <- get_lower_tri(format)
heart_cor_melt <- reshape2::melt(lower_tri, na.rm = TRUE)
heart_cor_melt <- heart_cor_melt %>%
mutate(toltip = glue("{Var1} ~ {Var2}"))
plot_pm <- ggplot(heart_cor_melt,aes(Var1, Var2, text = toltip)) +
geom_tile(aes(fill = value)) +
geom_text(aes(label = round(value, 1)), alpha=0.5, size = 3, color = "White") +
scale_fill_gradientn(colors = c("#A8D8E0","#428793","#27636D"),
values = rescale(c(-1,0,1)),
limits = c(-1,1)) +
labs(x = NULL,
y = NULL,
fill = "Pearson Corr:") +
theme(legend.background = element_rect(fill = "#0E264E", color = "#0E264E"),
plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
panel.background = element_rect(fill = "#0E264E"),
panel.grid = element_line(colour = "#0E264E"),
panel.grid.major.x = element_line(colour = "#0E264E"),
panel.grid.minor.x = element_line(colour = "#0E264E"),
legend.title = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
legend.text = element_text(colour = "#1B5249", face ="bold", family = "Times New Roman"),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal",
axis.text.x = element_text(color = "#1B5249", family = "Times New Roman",
angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_blank(),
axis.ticks = element_blank()) +
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
ggplotly(plot_pm, tooltip = "text") %>%
layout(hoverlabel = list( bgcolor = "rgba(255,255,255,0.75)",
font = list(
color = "Black",
family = "Cardo",
size = 12
)))
Based on the provided information, there is no evidence of multicollinearity as indicated by the absence of significant correlation coefficients exceeding a range of 0.4/-0.4.
Before doing modeling, we need to look at the proportion
of the target variable that we have in the
HeartDisease column.
proptarget <- paste(round(prop.table((table(heart$HeartDisease)))*100,2),"%")
proptarget_df <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptarget[1], proptarget[2]))
rmarkdown::paged_table(proptarget_df)
When viewed from the proportions of the two classes, it is quite balanced, so we don’t need additional pre-processing to balance the proportions between the two class target variables.
The Naive Bayes algorithm is a type of supervised learning algorithm that utilizes Bayes’ theorem to solve classification problems. It is considered one of the simplest and most efficient classification algorithms, enabling the creation of rapid machine learning models capable of making swift predictions.
To grasp the concept of Naive Bayes, we can refer to the following equation:
\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
Information:
P(A) = Probability of A occurring without any
additional information (prior)
P(A∣B) = Probability of A occurring if it is known
that B has occurred (posterior)
Event A is often called a hypothesis
(target variable)
Event B is often called observation
(evidence)
Prior to delving into the modeling process, our initial step is to conduct Cross Validation on our dataset. Why is this necessary? The Cross Validation technique is employed to assess the effectiveness of machine learning algorithms in making predictions on unfamiliar data. In this particular scenario, we will allocate 80% of our dataset for training our model and reserve 20% for testing our predictions, in order to evaluate the performance of the model.
data_nb <- heart %>%
mutate(HeartDisease = as.factor(HeartDisease))
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(data_nb), size = nrow(data_nb)*0.80)
train_nb <- data_nb[index,] #take 80%
test_nb <- data_nb[-index,] #take 20%
head(data_nb)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1 40 Male ATA 140 289 < 120 mg/dl Normal 172
## 2 49 Female NAP 160 180 < 120 mg/dl Normal 156
## 3 37 Male ATA 130 283 < 120 mg/dl ST 98
## 4 48 Female ASY 138 214 < 120 mg/dl Normal 108
## 5 54 Male NAP 150 195 < 120 mg/dl Normal 122
## 6 39 Male NAP 120 339 < 120 mg/dl Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 No 0.0 Up 0
## 2 No 1.0 Flat 1
## 3 No 0.0 Up 0
## 4 Yes 1.5 Flat 1
## 5 No 0.0 Up 0
## 6 No 0.0 Up 0
proptrain_nb <- paste(round(prop.table((table(train_nb$HeartDisease)))*100,2),"%")
proptrain_dfnb <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain_nb[1], proptrain_nb[2]))
rmarkdown::paged_table(proptrain_dfnb)
As demonstrated earlier, the class remains evenly distributed.
Once we have divided our data, our next step is to proceed with training our models using the train_nb dataset. Initially, we will utilize all variables available as independent variables (X).
All Predictor
model_nb <- naiveBayes(formula = HeartDisease~. , data = train_nb, laplace = 1)
model_nb
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.4414169 0.5585831
##
## Conditional probabilities:
## Age
## Y [,1] [,2]
## 0 50.64198 9.757187
## 1 55.99756 8.659971
##
## Sex
## Y Female Male
## 0 0.33641975 0.66975309
## 1 0.09268293 0.91219512
##
## ChestPainType
## Y ASY ATA NAP TA
## 0 0.27134146 0.34451220 0.30792683 0.07621951
## 1 0.76086957 0.04589372 0.14734300 0.04589372
##
## RestingBP
## Y [,1] [,2]
## 0 130.6914 17.14653
## 1 134.1366 20.24703
##
## Cholesterol
## Y [,1] [,2]
## 0 225.6389 73.16369
## 1 172.2000 125.73428
##
## FastingBS
## Y < 120 mg/dl > 120 mg/dl
## 0 0.90740741 0.09876543
## 1 0.65609756 0.34878049
##
## RestingECG
## Y LVH Normal ST
## 0 0.1987768 0.6483180 0.1529052
## 1 0.2106538 0.5593220 0.2300242
##
## MaxHR
## Y [,1] [,2]
## 0 147.6358 23.12819
## 1 127.5146 23.55758
##
## ExerciseAngina
## Y No Yes
## 0 0.8611111 0.1450617
## 1 0.3902439 0.6146341
##
## Oldpeak
## Y [,1] [,2]
## 0 0.4021605 0.7077818
## 1 1.3292683 1.1790643
##
## ST_Slope
## Y Down Flat Up
## 0 0.03975535 0.18960245 0.77064220
## 1 0.09927361 0.73607748 0.16464891
💡 Insight :
The Flat ST Slope has the highest likelihood of experiencing a Heart Disease Condition (77,60%), whereas the UP ST Slope has a greater likelihood (77,06%) of not having a Heart Disease Condition compared to its counterparts.
Patients who experience Exercise Angina have a greater likelihood of developing a Heart Disease Condition (61,46%), whereas patients without Exercise Angina have an 86,11% higher chance of not developing a Heart Disease Condition.
A patient with a normal resting ECG has a higher likelihood of developing a heart disease condition (55,93%) compared to others, while they also have a 65,83%higher chance of not developing a heart disease condition.
Patients with a Fasting Blood Sugar level below 120mg have a greater likelihood of developing a Heart Disease Condition (65,60%), and are also 90,74% more likely to not develop a Heart Disease Condition compared to patients with a Fasting Blood Sugar level above 120mg.
ASY Chest Pain has the highest likelihood of indicating a Heart Disease Condition (76,08%) , whereas ATA Chest Pain has a higher likelihood (34,45%) of not indicating a Heart Disease Condition compared to other types of chest pain.
The likelihood of developing a heart disease condition is higher among males (91,21%), while females have a 66,97% lower probability of not developing a heart disease condition.
Using our model_nb, we will make predictions on our test_nb dataset.
naive_pred <- predict(model_nb, test_nb, type = "class")
naive_prob <- predict(model_nb, test_nb, type = "raw")
naive_pred_train <- predict(model_nb, train_nb, type = "class")
naive_table <- select(test_nb, HeartDisease) %>%
bind_cols(heart_pred = naive_pred) %>%
bind_cols(heart_e_prob = round(naive_prob[,1],4)) %>%
bind_cols(heart_p_prob = round(naive_prob[,2],4))
naive_table <- naive_table %>%
mutate(HeartDisease = as.factor(HeartDisease))
rmarkdown::paged_table(head(naive_table, 10))
Decision Tree Model is a relatively straightforward model based on trees that demonstrates strong and effective predictive capabilities. It generates a decision tree visualization that is easily interpretable. The objective of employing a Decision Tree is to construct a training model that can accurately predict the class or value of the target variable by learning uncomplicated decision rules derived from previous data, also known as training data.
data_dt <- heart %>%
mutate_if(is.character, as.factor)
RNGkind(sample.kind = "Rounding")
set.seed(123)
index <- sample(nrow(heart), size = nrow(heart)*0.80)
train_dt <- data_dt[index,] #take 80%
test_dt <- data_dt[-index,] #take 20%
proptrain_dt <- paste(round(prop.table((table(train_dt$HeartDisease)))*100,2),"%")
proptrain_dfdt <- data.frame(HeartDisease = c("No","Yes"), Prop = c(proptrain_dt[1], proptrain_dt[2]))
rmarkdown::paged_table(proptrain_dfdt)
As demonstrated earlier, the class remains evenly distributed.
To visualize the appearance of Decision
Tree, we will generate a tree plot for our model using the
fancyRPartPlot() function from the rpart.plot
packages. Initially, we will include all variables as
predictor variables.
model_dt <- rpart(formula = HeartDisease~., data = train_dt, method = "class")
plot_dtree <- rpart.plot(model_dt, cex = 0.8)
Our decision tree is composed of various
components. The topmost box represents the root node,
specifically associated with ST_Slope. The root node
divides and generates branches based on specific rules. Each branch
concludes with a node. Subsequently, these nodes further split into
additional nodes referred to as internal nodes. Nodes that no longer
divide are located at the bottom of our visualization, known as terminal
nodes or leaf nodes. In conclusion, every node displays the following
information: (Per Node)
Predicted class (Yes or No)
Predicted class probabilities
The proportion of observations within the node
Now that we have our model in place, we can proceed to make predictions using our test_dt.
dt_pred <- predict(model_dt, test_dt, type = "class")
dt_prob <- predict(model_dt, test_dt, type = "prob")
dt_pred_train <- predict(model_dt, train_dt, type = "class")
dt_table <- select(test_dt, HeartDisease) %>%
bind_cols(heart_pred = dt_pred) %>%
bind_cols(heart_e_prob = round(dt_prob[,1],4)) %>%
bind_cols(heart_p_prob = round(dt_prob[,2],4))
dt_table <- dt_table %>%
mutate(HeartDisease = as.factor(HeartDisease))
rmarkdown::paged_table(head(dt_table, 10))
To assess the performance, we will utilize confusion matrix and ROC/AUC function which are capable of producing:
con_mat_nb <- confusionMatrix(data = naive_table$heart_pred,
reference = naive_table$HeartDisease,
positive = "1")
con_mat_nb_t <- confusionMatrix(data = naive_pred_train,
reference = as.factor(train_nb$HeartDisease),
positive = "1")
performance_nb <- cbind.data.frame(Accuracy = c(con_mat_nb_t$overall[[1]], con_mat_nb$overall[[1]]),
Recall = c(con_mat_nb_t$byClass[[1]], con_mat_nb$byClass[[1]]),
Specificity = c(con_mat_nb_t$byClass[[2]], con_mat_nb$byClass[[2]]),
Precision = c(con_mat_nb_t$byClass[[3]], con_mat_nb$byClass[[3]]))
rownames(performance_nb) <- c("On Training Data", "On Unseen Data")
rmarkdown::paged_table(performance_nb)
There are minimal disparities in the Accuracy, Sensitivity, Specificity, and Precision scores between our Unseen and Training Data.
This indicates that there are no signs of Overfitting and Underfitting in our Naive Nayes Classifer Model.
roc_nb <- roc(predictor = naive_table$heart_e_prob,
response = naive_table$HeartDisease,
percent = TRUE)
df_nb <- data.frame(Specificity=roc_nb$specificities, Sensitivity=roc_nb$sensitivities)
plot_nb <- ggplot(data = df_nb, aes(x = Specificity, y = Sensitivity))+
with_outer_glow(geom_path(colour = '#1F6E8C', size = 1), colour = "#1B5249", sigma = 10, expand = 1)+
scale_x_reverse() +
geom_abline(intercept = 100, slope = 1, color="#1F6E8C", linetype = "longdash") +
annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc_nb$auc,1), '%'), size = 4, color = "#1B5249", family = "Times New Roman")+
ylab('Recall (%)')+
xlab('Specificity (%)')+
scale_fill_gradient(low = "#1B5249", high = "#1F6E8C") +
scale_color_gradient(low = "#1B5249", high = "#1F6E8C") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
panel.background = element_rect(fill = "#0E264E"),
panel.grid = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white"),
panel.grid.minor.x = element_line(colour = "white"),
axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
ggarrange(plot_nb, nrow = 1, ncol = 1)
Our naive bayes model achieves an accuracy rate of accuracy of 88.55% when applied to new and unseen data, indicating that it correctly classifies 88.55% of the data.
The sensitivity value of style=“color:red”>sensitivity is 87.75% demonstrates the accurate classification of the majority of positive outcomes.
The specificity value stands at specificity is 89.53%, signifying accurate classification of the majority of negative outcomes.
The precision value stands at 90.52%, indicating that 90.52% of our positive predictions are accurate.
The ROC Curve we have obtained demonstrates a remarkable level of distinction with an AUC score of 92.6%. This indicates that our model has been verified to possess a high degree of effectiveness in distinguishing between the classes we are targeting.
con_mat_dt <- confusionMatrix(data = dt_table$heart_pred,
reference = dt_table$HeartDisease,
positive = "1")
con_mat_dt_train <- confusionMatrix(data = dt_pred_train,
reference = as.factor(train_dt$HeartDisease),
positive = "1")
performance_dt <- cbind.data.frame(Accuracy = c(con_mat_dt_train$overall[[1]], con_mat_dt$overall[[1]]),
Recall = c(con_mat_dt_train$byClass[[1]], con_mat_dt$byClass[[1]]),
Specificity = c(con_mat_dt_train$byClass[[2]], con_mat_dt$byClass[[2]]),
Precision = c(con_mat_dt_train$byClass[[3]], con_mat_dt$byClass[[3]]))
rownames(performance_dt) <- c("On Training Data", "On Unseen Data")
rmarkdown::paged_table(performance_dt)
There are minimal disparities in the Accuracy, Sensitivity, Specificity, and Precision scores between our Unseen and Training Data.
This indicates that there are no signs of Overfitting and Underfitting in our Decision Tree Model.
roc_dt <- roc(predictor = dt_table$heart_e_prob,
response = dt_table$HeartDisease,
levels = c("0", "1"),
percent = TRUE)
df_dt <- data.frame(Specificity=roc_dt$specificities, Sensitivity=roc_dt$sensitivities)
plot_dt <- ggplot(data = df_dt, aes(x = Specificity, y = Sensitivity))+
with_outer_glow(geom_path(colour = '#1F6E8C', size = 1), colour = "#1B5249", sigma = 10, expand = 1)+
scale_x_reverse() +
geom_abline(intercept = 100, slope = 1, color="#1F6E8C", linetype = "longdash") +
annotate("text", x = 40, y = 40, label = paste0('AUC: ', round(roc_dt$auc,1), '%'), size = 4, color = "#1B5249", family = "Times New Roman")+
ylab('Recall (%)')+
xlab('Specificity (%)')+
scale_fill_gradient(low = "#1B5249", high = "#1F6E8C") +
scale_color_gradient(low = "#1B5249", high = "#1F6E8C") +
theme(legend.position = "none",
plot.background = element_rect(fill = "#0E264E", color = "#0E264E"),
panel.background = element_rect(fill = "#0E264E"),
panel.grid = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white"),
panel.grid.minor.x = element_line(colour = "white"),
axis.text.x = element_text(color = "white", family = "Times New Roman", size = 14),
axis.text.y = element_text(color = "white", family = "Times New Roman", size = 14),
axis.title.x = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.title.y = element_text(color = "#1B5249", family = "Times New Roman", size = 14),
axis.ticks = element_blank())
ggarrange(plot_dt, nrow = 1, ncol = 1)
Our decision tree model achieves an accuracy rate of accuracy of 82.06% when applied to new and unseen data, indicating that it correctly classifies 82.06% of the data.
The sensitivity value of sensitivity is 87.61% demonstrates the accurate classification of the majority of positive outcomes.
The specificity value stands at specificity is 74.68%, signifying accurate classification of the majority of negative outcomes.
The precision value stands at 90.52%, indicating that 82.14% of our positive predictions are accurate.
The ROC Curve we have obtained demonstrates a remarkable level of distinction with an AUC score of 90.7%. This indicates that our model has been verified to possess a high degree of effectiveness in distinguishing between the classes we are targeting.
Since we are discussing health diagnosis prediction, our goal is to minimize the error in predicting positive outcomes, specifically for patients with heart disease. Upon evaluating the performance of our models, we aim to select a model with a high sensitivity score.
Based on the two models we developed, the Naive Bayes model showed a higher sensitivity score of 88.77% compared to the Decision Tree model. However, if you look at it, it turns out that the difference between the two models is not too significant, the difference is very small and this can be interpreted that the two models have a good sensitivity score for later model use.
This concludes my attempt to classify heart disease using classification models. As time progresses, I will incorporate additional models that have not yet been included in this project. Nonetheless, I sincerely hope that this study will assist individuals dealing with similar cases in the future.
A work by Ahmad Fauzi
wrk.ahmadfauzi@gmail.com
with R Language