Introduction

For this project I chose to analyze a data set on heart disease posted by user FEDESORIANO to kaggle.com. This is a very personal and important topic to me as I have had family members pass away from heart attacks, and high blood pressure runs in my family. The dataset can be found here.

For this project, I trained several machine learning models that we studied in this course to see which model could best predict heart failure based on the provided data. Like with many serious medical issues, early interventions can significantly improve patient outcomes in this area, so successful use of machine learning to help identify those at risk of heart failure could be literally lifesaving.

With many Americans lacking adequate healthcare coverage, it is common for them to resort to emergency rooms or urgent care clinics for needed medical attention. Doctors at these clinics can become overwhelmed with the logistics of evaluating which patients are in need of further attention. If they are too cautious and give their full attention to any patient with risk factors, they may hit their capacity and not be able to provide needed treatment to those with urgent issues. On the other hand, sending a patient who does need further attention home prematurely can be a fatal error. Successful application of machine learning to predict heart failure can support doctors in this difficult decision process.

Data Management

Data Acquisition

For this project, I used tidyverse to investigate and transform the data and caret package to train the various machine learning models. A copy of the dataset is loaded from my GitHub repository.

# Load required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(naivebayes)
## naivebayes 1.0.0 loaded
## For more information please visit: 
## https://majkamichal.github.io/naivebayes/
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## 
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## 
## Attaching package: 'party'
## 
## The following object is masked from 'package:dplyr':
## 
##     where
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(xgboost)
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(tidytext)

# Load data
heart_failure_url = 'https://raw.githubusercontent.com/Marley-Myrianthopoulos/grad_school_data/refs/heads/main/heart.csv'
heart_failure_data <- read.csv(heart_failure_url)

# Set random seed
set.seed(31415)

Exploratory Data Analysis

# See a summary of the data
glimpse(heart_failure_data)
## Rows: 918
## Columns: 12
## $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease   <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

The data set contains 918 observations of 11 predictor variables and 1 target variable (HeartDisease). There are six numeric predictor variables (Age, RestingBP, Cholesterol, FastingBS, MaxHR, and Oldpeak) and five categorical predictor variables (Sex, ChestPainType, RestingECG, ExerciseAngina, and ST_Slope).

# Verify that there are no missing values
print(sum(is.na(heart_failure_data)))
## [1] 0

There are no missing values in this data set.

# Print the unique values and frequencies for each categorical column
print(table(heart_failure_data$Sex))
## 
##   F   M 
## 193 725
print(table(heart_failure_data$ChestPainType))
## 
## ASY ATA NAP  TA 
## 496 173 203  46
print(table(heart_failure_data$RestingECG))
## 
##    LVH Normal     ST 
##    188    552    178
print(table(heart_failure_data$ExerciseAngina))
## 
##   N   Y 
## 547 371
print(table(heart_failure_data$ST_Slope))
## 
## Down Flat   Up 
##   63  460  395

In examining the categorical columns, the following details emerged:

  • The ChestPainType and RestingECG variables are categorical without clear ordering in the categories.
  • The Sex and ExerciseAngina columns are binary.
  • The ST_Slope column has clear levels (“Down”, “Flat”, and “Up”).

Data Cleaning

The following steps are taken to clean the categorical predictors:

  • The ChestPainType and RestingECG variables are categorical without clear ordering in the categories, so these are converted to factor columns.

  • The Sex column is binary and is therefore converted to 0 (for “M”) or 1 (for “F”).

  • The ExerciseAngina column is binary and is therefore converted to 0 (for “N”) or 1 (for “Y”).

  • The ST_Slope column has clear levels (“Down”, “Flat”, and “Up”) and is therefore converted to an ordinal variable.

# Convert binary columns to values of 0 and 1
heart_failure_cleaned <- heart_failure_data |>
  mutate(Sex = case_when(Sex == 'M' ~ 0, Sex == 'F' ~ 1)) |>
  mutate(ExerciseAngina = case_when(ExerciseAngina == 'N' ~ 0, ExerciseAngina == 'Y' ~ 1))

# Convert categorical columns to factors
heart_failure_cleaned$ChestPainType <- factor(heart_failure_cleaned$ChestPainType)
heart_failure_cleaned$RestingECG <- factor(heart_failure_cleaned$RestingECG)

# Convert ordinal column
heart_failure_cleaned$ST_Slope <- factor(heart_failure_cleaned$ST_Slope, levels = c('Down', 'Flat', 'Up'), ordered = TRUE)

The response variable is converted to a factor binary columns with values “N” and “Y”.

# Convert response variable to "N" for 0 or "Y" for 1
heart_failure_cleaned <- heart_failure_cleaned |>
  mutate(HeartDisease = case_when(HeartDisease == 0 ~ 'N', HeartDisease == 1 ~ 'Y'))

heart_failure_cleaned$HeartDisease <- factor(heart_failure_cleaned$HeartDisease)

Data Pre-Processing

The data is now split into a training and test set, with 20% of the observations in the test set.

# Create train/test split
data_split <- createDataPartition(heart_failure_cleaned$HeartDisease, p = 0.8, list = FALSE)
heart_training <- heart_failure_cleaned[data_split,]
heart_test <- heart_failure_cleaned[-data_split,]

# Verify that the training set has 80% of the original data and the testing set has 20%
print(nrow(heart_training) / nrow(heart_failure_cleaned))
## [1] 0.8006536
print(nrow(heart_test) / nrow(heart_failure_cleaned))
## [1] 0.1993464

The numeric predictor variables are now centered and scaled.

pre_processing <- preProcess(heart_training, method = c("center", "scale"))

heart_training_proc <- predict(pre_processing, heart_training)
heart_test_proc <- predict(pre_processing, heart_test)

Model Training

Each model was trained using 5-fold cross validation. Five folds were used instead of 10 due to the relatively small size of the data set, to ensure that each fold had a reasonable number of observations to assess model performance. To account for this, the cross-validation is repeated 3 times. The metric used to select the optimal model was Recall, since when it comes to diagnosing life-threatening health conditions a false negative is a much more serious error than a false positive. A tuning grid was used to test different parameter values for each of the models.

# Establish training rules for machine learning models
training_rules = trainControl(method = 'repeatedcv',
                              repeats = 3,
                              number = 5,
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

Logistic Regression

The glm method in caret has no tuning parameters.

# Train model
log_reg_model <- train(HeartDisease ~ .,
                       data = heart_training_proc,
                       method = 'glm',
                       trControl = training_rules,
                       metric = 'Sens')

Naive Bayes

For the Naive Bayes model, I tried using alpha values of 0, 0.25, 0.5, 0.75, and 1.

# Define hyper-parameter tuning values
nb_grid <- expand.grid(laplace = c(0, 0.25, 0.5, 0.75, 1),
                       usekernel = FALSE,
                       adjust = 1)

# Train model
nb_model <- train(HeartDisease ~ .,
                  data = heart_training_proc,
                  method = 'naive_bayes',
                  tuneGrid = nb_grid,
                  trControl = training_rules,
                  metric = 'Sens')

Decision Tree

For the Decision Tree model, I tried using maximum tree depths of 5, 10, 15, and 20.

# Define hyper-parameter tuning values
tree_grid <- expand.grid(maxdepth = c(5, 10, 15, 20),
                         mincriterion = 0.05)

# Train model
tree_model <- train(HeartDisease ~ .,
                    data = heart_training_proc,
                    method = 'ctree2',
                    tuneGrid = tree_grid,
                    trControl = training_rules,
                    metric = 'Sens')

Random Forest

For the Random Forest model, I tried using predictor numbers of 2, 5, 8, and 11.

# Define hyper-parameter tuning values
rf_grid <- expand.grid(mtry = c(2, 5, 8, 11))

# Train model
rf_model <- train(HeartDisease ~ .,
                  data = heart_training_proc,
                  method = 'rf',
                  tuneGrid = rf_grid,
                  trControl = training_rules,
                  metric = 'Sens')

XGBoost

For the XGBoost model, I tried the following hyper-parameter combinations:

  • Boosting iterations of 50, 100, 150, and 200.
  • Max tree depths of 2, 4, 6, 8, and 10.
  • Learning rates of 0.01, 0.1, 0.5, and 1.
  • Feature subsample ratios of 0.25, 0.5, 0.75, and 1.
  • Observation subsample ratios of 0.25, 0.5, 0.75, and 1.
# Define hyper-parameter tuning values
xgb_grid <- expand.grid(nrounds = c(50, 100, 150, 200),
                        max_depth = c(2, 4, 6, 8, 10),
                        eta = c(0.01, 0.1, 0.5, 1),
                        gamma = 0,
                        colsample_bytree = c(0.25, 0.5, 0.75, 1),
                        min_child_weight = 1,
                        subsample = c(0.25, 0.5, 0.75, 1))

# Train model
xgb_model <- train(HeartDisease ~ .,
                   data = heart_training_proc,
                   method = 'xgbTree',
                   tuneGrid = xgb_grid,
                   trControl = training_rules,
                   metric = 'Sens',
                   verbosity = 0)

Support Vector Machine

For the SVM model, I tried using Cost function values of 0.01, 0.1, 1, 10, and 100.

# Define hyper-parameter tuning values
svm_grid <- expand.grid(C = c(0.01, 0.1, 1, 10, 100))

# Train model
svm_model <- train(HeartDisease ~ .,
                    data = heart_training_proc,
                    method = 'svmLinear',
                    tuneGrid = svm_grid,
                    trControl = training_rules,
                    metric = 'Sens')

Neural Network

For the neural network model, I tried the following hyper-parameter combinations:

  • Number of hidden units of 1, 2, 3, 5, and 10.
  • Weight decay of 0.001, 0.01, and 0.1
# Define hyper-parameter tuning values
nnet_grid <- expand.grid(size = c(1, 2, 3, 5, 10),
                         decay = c(0.001, 0.01, 0.1))

# Train neural network model
nnet_model <- train(HeartDisease ~ .,
                    data = heart_training_proc,
                    method = 'nnet',
                    tuneGrid = nnet_grid,
                    trControl = training_rules,
                    metric = 'Sens',
                    trace = FALSE)

Results Comparison

Each model is used to predict the response variable for each observation in the test set. Summary metrics for each model’s performance are recorded and compared below.

# Make a list of all of the models
model_list <- list(log_reg_model, nb_model, tree_model, rf_model, xgb_model, svm_model, nnet_model)

# Make empty lists to hold performance metrics
acc_list <- c()
prec_list <- c()
rec_list <- c()
bal_acc <- c()
f1_list <- c()

# Use each model to generate predictions on the test set and report results
for (i in 1:length(model_list)) {
  model <- model_list[[i]]
  predictions <- predict(model, heart_test_proc)
  conf_matrix <- confusionMatrix(predictions, heart_test_proc$HeartDisease, positive = "Y")
  acc_list <- append(acc_list, conf_matrix$overall[['Accuracy']])
  prec_list <- append(prec_list, conf_matrix$byClass[['Precision']])
  rec_list <- append(rec_list, conf_matrix$byClass[['Recall']])
  f1_list <- append(f1_list, conf_matrix$byClass[['F1']])
}

# Report model metrics in a data frame
model_summary_df <- data.frame(Model = c('Logistic Regression',
                                         'Naive Bayes',
                                         'Decision Tree',
                                         'Random Forest',
                                         'XGBoost',
                                         'Support Vector Machine',
                                         'Neural Net'),
                               Accuracy = as.numeric(acc_list),
                               Precision = as.numeric(prec_list),
                               Recall = as.numeric(rec_list),
                               F1Score = as.numeric(f1_list)) |> arrange(desc(Recall))

print(model_summary_df)
##                    Model  Accuracy Precision    Recall   F1Score
## 1 Support Vector Machine 0.9071038 0.8888889 0.9504950 0.9186603
## 2            Naive Bayes 0.9071038 0.8962264 0.9405941 0.9178744
## 3          Random Forest 0.9180328 0.9134615 0.9405941 0.9268293
## 4             Neural Net 0.8852459 0.8636364 0.9405941 0.9004739
## 5    Logistic Regression 0.8852459 0.8703704 0.9306931 0.8995215
## 6          Decision Tree 0.8907104 0.8785047 0.9306931 0.9038462
## 7                XGBoost 0.8743169 0.8900000 0.8811881 0.8855721
# Convert the data frame to long format
long_model_summary <- model_summary_df |>
  pivot_longer(!Model,
               names_to = 'Metric',
               values_to = 'Score')

# Plot model performance by metric
ggplot(long_model_summary, aes(x = reorder_within(Model, Score, Metric), y = Score, fill = Model)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~Metric, scales = 'free_y') +
  theme(axis.text.y = element_blank()) + 
  labs(title = 'Model Performance Metrics',
       x = 'Model',
       y = 'Score')

The Random Forest model had the best overall Accuracy and F1-Score, with almost 92% of test observations classified correctly. However, the Support Vector Machine model was a close second place on both Accuracy and F1-Score. The SVM model had a lower Precision than the Random Forest model (likely due to more false positives), but a higher Recall (likely due to fewer false negatives).

# Plot a comparison of the Random Forest model and SVM model across all metrics
rf_svm_comp <- model_summary_df |>
  filter(Model %in% c('Random Forest', 'Support Vector Machine')) |>
  pivot_longer(!Model,
               names_to = 'Metric',
               values_to = 'Score')

ggplot(rf_svm_comp, aes(x = Metric, y = Score, fill = Model)) + 
  geom_bar(position = 'dodge', stat = 'identity') + 
  scale_x_discrete(limits = rev) + 
  ylim(0,1) +
  labs(title = 'Support Vector Machine vs. Random Forest Model Metrics') +
  geom_text(aes(label = round(Score, 2)), position = position_dodge(width = .9), vjust = .5, hjust = 1.5) + 
  coord_flip()

Investigation of the confusion matrices for these two models is performed below.

# Generate confusion matrix for the random forest model
rf_predictions <- predict(rf_model, heart_test_proc)
print(confusionMatrix(rf_predictions, heart_test_proc$HeartDisease, positive = "Y"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 73  6
##          Y  9 95
##                                           
##                Accuracy : 0.918           
##                  95% CI : (0.8684, 0.9534)
##     No Information Rate : 0.5519          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8337          
##                                           
##  Mcnemar's Test P-Value : 0.6056          
##                                           
##             Sensitivity : 0.9406          
##             Specificity : 0.8902          
##          Pos Pred Value : 0.9135          
##          Neg Pred Value : 0.9241          
##              Prevalence : 0.5519          
##          Detection Rate : 0.5191          
##    Detection Prevalence : 0.5683          
##       Balanced Accuracy : 0.9154          
##                                           
##        'Positive' Class : Y               
## 
# Generate confusion matrix for the random forest model
svm_predictions <- predict(svm_model, heart_test_proc)
print(confusionMatrix(svm_predictions, heart_test_proc$HeartDisease, positive = "Y"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 70  5
##          Y 12 96
##                                           
##                Accuracy : 0.9071          
##                  95% CI : (0.8554, 0.9449)
##     No Information Rate : 0.5519          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8107          
##                                           
##  Mcnemar's Test P-Value : 0.1456          
##                                           
##             Sensitivity : 0.9505          
##             Specificity : 0.8537          
##          Pos Pred Value : 0.8889          
##          Neg Pred Value : 0.9333          
##              Prevalence : 0.5519          
##          Detection Rate : 0.5246          
##    Detection Prevalence : 0.5902          
##       Balanced Accuracy : 0.9021          
##                                           
##        'Positive' Class : Y               
## 

As seen in the confusion matrices, the Random Forest model was slightly more conservative than the Support Vector Machine model, resulting in 3 fewer false positives, but one more false negative. Given the potentially catastrophic consequences of a false negative when predicting Heart Disease, I would be inclined under the circumstances to select the SVM model as the most appropriate. If the SVM model had significantly more false positives than the Random Forest model then that might be worth reconsidering for the sake of practicality, but since it is only slightly more conservative (and had the second-highest Accuracy and F1-Score of all the models, behind only the Random forest model) I believe it is the right choice.

# Report model parameters for the SVM model
print(svm_model$bestTune)
##     C
## 2 0.1

The best tune of the Support Vector Machine model was the one that used \(C=0.1\).

# Report feature importance for the SVM model
plot(varImp(svm_model))

It’s a little bit surprising that Cholesteroal and RestingBP are the weakest predictors for Heart Disease, as these are factors that are very often cited as being major contributors.

Conclusions

Of all the models tested, the Support Vector Machine model would be the most valuable for predicting instances of heart disease in patients, as it almost always correctly identifies patients with heart disease while still doing a very good job of identifying patients who do not require further interventions. Although balancing the need to avoid false positives and false negatives is difficult, in this context it is clear that false negatives are a much more costly error. By avoiding false negatives better than any other model while still limiting false positives (resulting in an excellent F1-Score), the SVM model distinguished itself as the best choice. The Random Forest model would also be an excellent alternative, due to its strong performance across all metrics (including the highest overall Accuracy, F1-Score, and Precision).