Introduction

In this RPubs article, I will be building and evaluating several ML classification models using data from the Heart Failure Prediction Dataset available on Kaggle.

This page will be split into several sections. The first will cover data description and preparation, and the following will be for each model.

I start with a random forest model and will update with more models periodically.

Part One: Data Description, Processing, and Preparation

STEP ONE: LOADING LIBRARIES AND DATA

We will make use of six different R libraries for our initial analysis and data modeling. First, tidymodels is a collection of packages for modeling and machine learning using tidyverse principles. The workflows library lets us use workflow container objects that are useful for modeling and analysis projects. The tune library will let us iterate through and find the best parameters for our models. The ranger library gives us access to a fast and efficient random forest algorithm. Lastly, ggplot2 lets us make convenient and nice-looking plots.

library(tidymodels)
library(tidyverse)
library(workflows)
library(tune)
library(ranger)
library(ggplot2)

orig_data <- RKaggle::get_dataset('fedesoriano/heart-failure-prediction')

STEP TWO: PRELIMINARY EDA, DATA IMPUTATION

Let’s take a glance at our data.

glimpse(orig_data)
## Rows: 918
## Columns: 12
## $ Age            <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP      <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR          <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease   <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

So, our data has 12 columns. Note that our output column (HeartDisease) is in 0s and 1s, representing negative and positive for heart disease, respectively. This is what we want to predict.

We have two different column types, double and character. Let’s take a look at the character variables in the dataset.

orig_data %>% select(where(is.character)) %>% apply(2, table)
## $Sex
## 
##   F   M 
## 193 725 
## 
## $ChestPainType
## 
## ASY ATA NAP  TA 
## 496 173 203  46 
## 
## $RestingECG
## 
##    LVH Normal     ST 
##    188    552    178 
## 
## $ExerciseAngina
## 
##   N   Y 
## 547 371 
## 
## $ST_Slope
## 
## Down Flat   Up 
##   63  460  395

Let’s go through these variables. The labeling of each variable is available on the Kaggle Data Card.

Sex is skewed towards males; this might be something worth considering when inferring from this dataset.

There are four different chest pain types (ChestPainType) - asymptomatic (ASY), atypical angina (ATA), non-anginal pain (NAP), and typical angina (TA).

RestingECG is either LVH (showing probable or definite left ventricular hypertrophy by Estes’ criteria), Normal, or ST (having ST-T wave abnormality - T wave inversions and/or ST elevation or depression of > 0.05 mV).

ExerciseAngina indicates whether or not the patient has exercise-induced angina.

Finally, ST_Slope indicates the slope of the peak exercise ST segment (Up: upsloping, Flat: flat, Down: downsloping).

Let’s make a new, cleaned data frame where we convert all character features and the output feature to factors.

finalData <- orig_data %>% mutate(HeartDisease=as.factor(HeartDisease)) %>% 
    mutate(across(where(is.character), as.factor))

Now, let’s take a look at the non-factor variables.

(Note: figuring out how to loop through and graph numeric variables was a huge pain! Shout-out to ggplot for making visualization easier).

To make these graphs, we first select the relevant variables, then work the data frame into two columns (this will help us separate by each variable). Then, we can make a plot with ggplot to visualize the distribution of our numeric variables.

orig_data %>% select(where(is.double)) %>% pivot_longer(cols=everything()) %>% 
    ggplot(aes(x=value, fill=name)) + 
    facet_wrap(~ name, scales = 'free') + geom_histogram(bins=30)

It seems like Age, maxHR, and RestingBP are approximately normally distributed. RestingBP seemingly has an outlier at 0 (a dead person was counted in the data, probably). Cholesterol has a spike at 0, corresponding to missing data. We can probably leave Oldpeak as it is. Lastly, FastingBS is another binary data type, so we should convert it to factor so it is better represented in our data.

finalData <- finalData %>% mutate(FastingBS=as.factor(FastingBS))

finalData <- finalData %>% mutate(across(Cholesterol, function(i) {
        if_else(i==0, as.numeric(NA), i)
    }))

Lastly, let’s check if any our variables are correlated; if they are, they can increase redundancy and make the model more complex, therefore increasing the risk of overfitting.

finalData %>% select(where(is.numeric)) %>% cor
##                    Age  RestingBP Cholesterol      MaxHR    Oldpeak
## Age          1.0000000  0.2543994          NA -0.3820447  0.2586115
## RestingBP    0.2543994  1.0000000          NA -0.1121350  0.1648030
## Cholesterol         NA         NA           1         NA         NA
## MaxHR       -0.3820447 -0.1121350          NA  1.0000000 -0.1606906
## Oldpeak      0.2586115  0.1648030          NA -0.1606906  1.0000000

Thankfully, none of our features are highly correlated. The highest (absolute) correlation is between Age and Maximum Heart Rate, which makes sense. But we can ignore it for now.

Part Two: Random Forest Model

STEP ONE: CREATING THE MODEL FRAMEWORK

Okay, now we have to split our original dataset into the training and testing subparts. Tidymodels provides a simple framework to do this with initial_split(). We also make a cross validation object which will come in handy later.

set.seed(12345)
dataSplit <- initial_split(finalData, prop = 3/4)
dataTraining <- training(dataSplit)
dataTesting <- testing(dataSplit)
dataCV <- vfold_cv(dataTraining)
dataCV
## #  10-fold cross-validation 
## # A tibble: 10 × 2
##    splits           id    
##    <list>           <chr> 
##  1 <split [619/69]> Fold01
##  2 <split [619/69]> Fold02
##  3 <split [619/69]> Fold03
##  4 <split [619/69]> Fold04
##  5 <split [619/69]> Fold05
##  6 <split [619/69]> Fold06
##  7 <split [619/69]> Fold07
##  8 <split [619/69]> Fold08
##  9 <split [620/68]> Fold09
## 10 <split [620/68]> Fold10

Okay, great. So we have two new datasets selected at random (dataTraining and dataTesting) and a cross-validation object. Now, we have to create the framework of the model. We’ll do this with the recipe() function from tidymodels. Crucially, we specify the model to normalize all numeric columns (subtract the mean of each column and divide by its standard deviation) and impute the missing values in the Cholesterol column using the k-nearest neighbors method.

dataRecipe <- recipe(HeartDisease ~ ., data=finalData) %>% 
    step_normalize(all_numeric()) %>% step_impute_knn(Cholesterol)

So now we have our “recipe” to make the model; it’s time to follow through. The rand_forest() function initializes a random forest model, then we set parameters with set_args(). We set mtry (the number of variables evaluated per tree) and min_n (the minimum number of observations needed in a node to split the node further - the lower the value, the deeper the tree) to tune() - this will allow us to iterate through these values on our cross validation datasets. The importance parameter says that we will define the importance of each variable by permutation (I will explain what this is later). Lastly, we tell the model that we want classification (0 or 1) instead of regression (continuous).

rfModel <- rand_forest() %>% set_args(mtry=tune(), min_n=tune()) %>% 
    set_engine('ranger', importance = 'permutation') %>% 
    set_mode('classification')

We make a workflow object with our recipe and our model. Let’s take a look at it:

rfWorkflow <- workflow() %>% add_recipe(dataRecipe) %>% add_model(rfModel)
rfWorkflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_normalize()
## • step_impute_knn()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = permutation
## 
## Computational engine: ranger

STEP TWO: TESTING DIFFERENT PARAMETERS

The reason I tune()-d the mtry and min_n parameters is because these might significantly influence the accuracy and dependability of our model. Now that we have all the required frameworks, we can test parameter values.

# Good mtry is sqrt(total vars) = sqrt(11) = 3
rfGrid <- expand.grid(mtry=2:6, min_n=c(1, 3, 5, 10, 15, 20, 30))
rfTuneResults <- rfWorkflow %>% tune_grid(resamples=dataCV, grid=rfGrid, 
    metrics=metric_set(accuracy, roc_auc))
rfTuneResults %>% collect_metrics()
## # A tibble: 70 × 8
##     mtry min_n .metric  .estimator  mean     n std_err .config              
##    <int> <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1     2     1 accuracy binary     0.862    10 0.0109  Preprocessor1_Model01
##  2     2     1 roc_auc  binary     0.922    10 0.00802 Preprocessor1_Model01
##  3     3     1 accuracy binary     0.855    10 0.0101  Preprocessor1_Model02
##  4     3     1 roc_auc  binary     0.922    10 0.00774 Preprocessor1_Model02
##  5     4     1 accuracy binary     0.855    10 0.0107  Preprocessor1_Model03
##  6     4     1 roc_auc  binary     0.919    10 0.00756 Preprocessor1_Model03
##  7     5     1 accuracy binary     0.852    10 0.0108  Preprocessor1_Model04
##  8     5     1 roc_auc  binary     0.916    10 0.00760 Preprocessor1_Model04
##  9     6     1 accuracy binary     0.853    10 0.0102  Preprocessor1_Model05
## 10     6     1 roc_auc  binary     0.917    10 0.00719 Preprocessor1_Model05
## # ℹ 60 more rows

I’m interested in the effect of mtry. Let’s take a look across various min_n:

rfTuneResults %>% collect_metrics() %>% filter(min_n %in% c(1,3,30), .metric=='accuracy')
## # A tibble: 15 × 8
##     mtry min_n .metric  .estimator  mean     n std_err .config              
##    <int> <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1     2     1 accuracy binary     0.862    10 0.0109  Preprocessor1_Model01
##  2     3     1 accuracy binary     0.855    10 0.0101  Preprocessor1_Model02
##  3     4     1 accuracy binary     0.855    10 0.0107  Preprocessor1_Model03
##  4     5     1 accuracy binary     0.852    10 0.0108  Preprocessor1_Model04
##  5     6     1 accuracy binary     0.853    10 0.0102  Preprocessor1_Model05
##  6     2     3 accuracy binary     0.861    10 0.00945 Preprocessor1_Model06
##  7     3     3 accuracy binary     0.859    10 0.00886 Preprocessor1_Model07
##  8     4     3 accuracy binary     0.850    10 0.00834 Preprocessor1_Model08
##  9     5     3 accuracy binary     0.855    10 0.0114  Preprocessor1_Model09
## 10     6     3 accuracy binary     0.855    10 0.0112  Preprocessor1_Model10
## 11     2    30 accuracy binary     0.862    10 0.0103  Preprocessor1_Model31
## 12     3    30 accuracy binary     0.859    10 0.0103  Preprocessor1_Model32
## 13     4    30 accuracy binary     0.859    10 0.0106  Preprocessor1_Model33
## 14     5    30 accuracy binary     0.858    10 0.0109  Preprocessor1_Model34
## 15     6    30 accuracy binary     0.856    10 0.00972 Preprocessor1_Model35

Comparing mtry’s for each min_n, it seems like there is a small, weak negative correlation between mtry and model accuracy. However, min_n seems a lot more interesting. Let’s keep experimenting. So, let’s select the smallest mtry (2 and let’s try 1 as well) and try a larger range of min_n.

rfGrid2 <- expand.grid(mtry=1:2, min_n=c(1, 3, 5, seq(5, 50, 5)))
rfTuneResults2 <- rfWorkflow %>% tune_grid(resamples=dataCV, grid=rfGrid2, 
    metrics=metric_set(accuracy, roc_auc))
## Warning: Duplicate rows in grid of tuning combinations found and removed.
rfTuneResults2 %>% collect_metrics() %>% filter(.metric=='accuracy') %>% 
    arrange(desc(mean))
## # A tibble: 24 × 8
##     mtry min_n .metric  .estimator  mean     n std_err .config              
##    <int> <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1     2    20 accuracy binary     0.866    10 0.0111  Preprocessor1_Model12
##  2     2     1 accuracy binary     0.866    10 0.0112  Preprocessor1_Model02
##  3     2     3 accuracy binary     0.866    10 0.00957 Preprocessor1_Model04
##  4     2    10 accuracy binary     0.866    10 0.0118  Preprocessor1_Model08
##  5     1    10 accuracy binary     0.863    10 0.0128  Preprocessor1_Model07
##  6     2    35 accuracy binary     0.863    10 0.00966 Preprocessor1_Model18
##  7     1     3 accuracy binary     0.862    10 0.0114  Preprocessor1_Model03
##  8     2    15 accuracy binary     0.861    10 0.00779 Preprocessor1_Model10
##  9     2    40 accuracy binary     0.861    10 0.0106  Preprocessor1_Model20
## 10     1    15 accuracy binary     0.859    10 0.0129  Preprocessor1_Model09
## # ℹ 14 more rows
rfTuneResults2 %>% collect_metrics() %>% filter(.metric=='roc_auc') %>% 
    arrange(desc(mean))
## # A tibble: 24 × 8
##     mtry min_n .metric .estimator  mean     n std_err .config              
##    <int> <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1     2    50 roc_auc binary     0.927    10 0.00658 Preprocessor1_Model24
##  2     2    20 roc_auc binary     0.927    10 0.00721 Preprocessor1_Model12
##  3     2    25 roc_auc binary     0.926    10 0.00713 Preprocessor1_Model14
##  4     1    20 roc_auc binary     0.926    10 0.00831 Preprocessor1_Model11
##  5     2    35 roc_auc binary     0.926    10 0.00706 Preprocessor1_Model18
##  6     2    45 roc_auc binary     0.926    10 0.00708 Preprocessor1_Model22
##  7     2    10 roc_auc binary     0.926    10 0.00742 Preprocessor1_Model08
##  8     2    30 roc_auc binary     0.926    10 0.00712 Preprocessor1_Model16
##  9     1     5 roc_auc binary     0.926    10 0.00815 Preprocessor1_Model05
## 10     1    25 roc_auc binary     0.926    10 0.00809 Preprocessor1_Model13
## # ℹ 14 more rows

mtry=2, min_n=20 comes in at first place for accuracy, first (tied) for roc_auc, and has low standard errors. It’s safe to say that this model is our winner.

Now, let’s apply it to our workflow.

paramFinal <- rfTuneResults2 %>% select_best(metric='accuracy')
rfWorkflow <- rfWorkflow %>% finalize_workflow(paramFinal)

STEP THREE: FIT THE MODEL

So now we have our workflow:

rfWorkflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_normalize()
## • step_impute_knn()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 2
##   min_n = 20
## 
## Engine-Specific Arguments:
##   importance = permutation
## 
## Computational engine: ranger

Now we can fit this model onto our split data. Tidymodels is nice because it lets you automatically train on the training data and then evaluate on the testing data using the initial_split() object. Without further ado, let’s fit it and see how it performs:

rfFit <- rfWorkflow %>% last_fit(dataSplit)
testPerformance <- rfFit %>% collect_metrics()
testPerformance
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.843 Preprocessor1_Model1
## 2 roc_auc     binary         0.921 Preprocessor1_Model1
## 3 brier_class binary         0.112 Preprocessor1_Model1

Accuracy of 84.3%, not bad!

Let’s visualize our results. We’ll make a confusion matrix:

rfFit %>% collect_predictions() %>% conf_mat(truth=HeartDisease, 
    estimate=.pred_class)
##           Truth
## Prediction   0   1
##          0  81  15
##          1  21 113

Our model on the testing data resulted in 81 true negatives, 113 true positive, 15 false negatives, and 21 false positives.

rfFit %>% collect_predictions() %>% ggplot() + 
    geom_density(aes(x=.pred_1, fill=HeartDisease, alpha=0.5))

We can see here that we have some neat-looking separation. The plot colors positive outcomes in blue and negative in red. We can see that the model is pretty good at separating the positives from the negatives.

STEP FOUR: FIT ON A TARGET DATASET

Now we can fit our final model on the dataset and analyze the importance of each variable.

finalModel <- fit(rfWorkflow, finalData)
extract_fit_parsnip(finalModel)$fit$variable.importance
##            Age            Sex  ChestPainType      RestingBP    Cholesterol 
##   0.0057066074   0.0125310608   0.0302620031   0.0021074592   0.0004058326 
##      FastingBS     RestingECG          MaxHR ExerciseAngina        Oldpeak 
##   0.0089750191   0.0019149438   0.0140150003   0.0212245216   0.0244531546 
##       ST_Slope 
##   0.0814793534

Remember our “permutation” argument from before? To find variable importance, the model takes each variable one by one, shuffles its values (effectively removing its correlation to the outcome), and evaluates the decrease in the model metric (in this case, accuracy).

To conclude, our model performed quite well with 84.3% accuracy, an area of 0.921 under the ROC curve, and a Brier score of 0.112. ST_Slope seems to be the most important variable, accounting for ~9% of the model’s predictive power.

Part Three: Boosted Trees Model

STEP ONE: CREATING THE MODEL FRAMEWORK

Now let’s make a boosted tree model for our data. We will be using the C5.0 algorithm, made convenient through R. Briefly, it sequentially creates trees in a forest, each one learning from the mistakes of the previous one. The first tree builds its nodes, finding the best split at each one using Gini impurity or entropy as in random forest. However, at the end of the tree, the observations are weighted - an increase in weight if misclassified, a decrease if correctly classified. The next tree puts more weight on classifying correctly the observations that were previously misclassified, new weights are calculated at the end of the tree, and the whole process is iterated until the forest finishes.

library(C50)

It is important to keep in mind that we can overfit our data by setting the amount of trees too high - our model will learn the noise, not the pattern. So, we will iterate through that as well as min_n as before.

abModel <- boost_tree() %>% set_args(trees=tune(), min_n=tune()) %>% 
    set_engine('C5.0') %>% set_mode('classification')
abWorkflow <- workflow() %>% add_recipe(dataRecipe) %>% add_model(abModel)

STEP TWO: TESTING DIFFERENT PARAMETERS

We will expand the grid as before.

abGrid <- expand.grid(trees=seq(5,50,5), 
              min_n=c(2,seq(5,25,10)))
abTuneResults <- abWorkflow %>% tune_grid(resamples=dataCV, grid=abGrid, 
                      metrics=metric_set(accuracy, roc_auc))
## → A | warning: 'trials' should be <= 25 for this object. Predictions generated using 25 trials
## There were issues with some computations   A: x1                                                 → B | warning: 'trials' should be <= 31 for this object. Predictions generated using 31 trials
## There were issues with some computations   A: x1There were issues with some computations   A: x1   B: x1                                                         → C | warning: 'trials' should be <= 19 for this object. Predictions generated using 19 trials
## There were issues with some computations   A: x1   B: x1There were issues with some computations   A: x1   B: x1   C: x1                                                                 → D | warning: 'trials' should be <= 23 for this object. Predictions generated using 23 trials
## There were issues with some computations   A: x1   B: x1   C: x1There were issues with some computations   A: x1   B: x1   C: x1   D: x1There were issues with some computations   A: x1   B: x1   C: x1   D: x1
abTuneResults %>% collect_metrics() %>% filter(.metric=='accuracy') %>% 
    arrange(desc(mean)) %>% select(trees, min_n, mean) %>% 
    pivot_longer(cols=c(trees, min_n)) %>% ggplot(aes(x=value,y=mean)) + 
    geom_point() + facet_wrap(~name,scales='free')

It seems like there might be a positive correlation between min_n and accuracy. It’s worth investigating this further.

abGrid2 <- expand.grid(trees=50, min_n=seq(25,100,5))
abTuneResults2 <- abWorkflow %>% tune_grid(resamples=dataCV, grid=abGrid2, 
                      metrics=metric_set(accuracy, roc_auc))
abTuneResults2 %>% collect_metrics() %>% filter(.metric=='accuracy') %>% 
    arrange(desc(mean)) %>% select(trees, min_n, mean) %>% 
    pivot_longer(cols=c(trees, min_n)) %>% ggplot(aes(x=value,y=mean)) + 
    geom_point() + facet_wrap(~name,scales='free')
Graph of min_n
Graph of min_n

Interestingly, it seems like the accuracy falls. Oh well, it didn’t hurt to investigate. Let’s finalize our workflow again:

paramFinal <- abTuneResults %>% select_best(metric='accuracy')
abWorkflow <- abWorkflow %>% finalize_workflow(paramFinal)

STEP THREE: FIT THE MODEL

Let’s fit our model like we did with RF:

abFit <- abWorkflow %>% last_fit(dataSplit)
testPerformance <- abFit %>% collect_metrics()
testPerformance
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.861 Preprocessor1_Model1
## 2 roc_auc     binary         0.920 Preprocessor1_Model1
## 3 brier_class binary         0.107 Preprocessor1_Model1

Our model has 86.1% accuracy, slightly better than RF.

Let’s make a confusion matrix and plot how our model separates values:

abFit %>% collect_predictions() %>% conf_mat(truth=HeartDisease, 
                         estimate=.pred_class)
##           Truth
## Prediction   0   1
##          0  85  15
##          1  17 113
abFit %>% collect_predictions() %>% ggplot() + 
    geom_density(aes(x=.pred_1, fill=HeartDisease, alpha=0.5))

In conclusion, this model performed similarly relative to random forest – it is a bit more accurate and has near-equivalent ROC AUC and Brier score metrics.

New model coming soon!