In this RPubs article, I will be building and evaluating several ML classification models using data from the Heart Failure Prediction Dataset available on Kaggle.
This page will be split into several sections. The first will cover data description and preparation, and the following will be for each model.
I start with a random forest model and will update with more models periodically.
We will make use of six different R libraries for our initial
analysis and data modeling. First, tidymodels is a
collection of packages for modeling and machine learning using
tidyverse principles. The workflows library
lets us use workflow container objects that are useful for modeling and
analysis projects. The tune library will let us iterate
through and find the best parameters for our models. The
ranger library gives us access to a fast and efficient
random forest algorithm. Lastly, ggplot2 lets us make
convenient and nice-looking plots.
library(tidymodels)
library(tidyverse)
library(workflows)
library(tune)
library(ranger)
library(ggplot2)
orig_data <- RKaggle::get_dataset('fedesoriano/heart-failure-prediction')
Let’s take a glance at our data.
glimpse(orig_data)
## Rows: 918
## Columns: 12
## $ Age <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
So, our data has 12 columns. Note that our output column
(HeartDisease) is in 0s and 1s, representing negative and
positive for heart disease, respectively. This is what we want to
predict.
We have two different column types, double and character. Let’s take a look at the character variables in the dataset.
orig_data %>% select(where(is.character)) %>% apply(2, table)
## $Sex
##
## F M
## 193 725
##
## $ChestPainType
##
## ASY ATA NAP TA
## 496 173 203 46
##
## $RestingECG
##
## LVH Normal ST
## 188 552 178
##
## $ExerciseAngina
##
## N Y
## 547 371
##
## $ST_Slope
##
## Down Flat Up
## 63 460 395
Let’s go through these variables. The labeling of each variable is available on the Kaggle Data Card.
Sex is skewed towards males; this might be something
worth considering when inferring from this dataset.
There are four different chest pain types
(ChestPainType) - asymptomatic (ASY), atypical angina
(ATA), non-anginal pain (NAP), and typical angina (TA).
RestingECG is either LVH (showing probable or definite
left ventricular hypertrophy by Estes’ criteria), Normal, or ST (having
ST-T wave abnormality - T wave inversions and/or ST elevation or
depression of > 0.05 mV).
ExerciseAngina indicates whether or not the patient has
exercise-induced angina.
Finally, ST_Slope indicates the slope of the peak
exercise ST segment (Up: upsloping, Flat: flat, Down: downsloping).
Let’s make a new, cleaned data frame where we convert all character features and the output feature to factors.
finalData <- orig_data %>% mutate(HeartDisease=as.factor(HeartDisease)) %>%
mutate(across(where(is.character), as.factor))
Now, let’s take a look at the non-factor variables.
(Note: figuring out how to loop through and graph numeric variables was a huge pain! Shout-out to ggplot for making visualization easier).
To make these graphs, we first select the relevant variables, then work the data frame into two columns (this will help us separate by each variable). Then, we can make a plot with ggplot to visualize the distribution of our numeric variables.
orig_data %>% select(where(is.double)) %>% pivot_longer(cols=everything()) %>%
ggplot(aes(x=value, fill=name)) +
facet_wrap(~ name, scales = 'free') + geom_histogram(bins=30)
It seems like Age, maxHR, and
RestingBP are approximately normally distributed.
RestingBP seemingly has an outlier at 0 (a dead person was
counted in the data, probably). Cholesterol has a spike at
0, corresponding to missing data. We can probably leave
Oldpeak as it is. Lastly, FastingBS is another
binary data type, so we should convert it to factor so it is better
represented in our data.
finalData <- finalData %>% mutate(FastingBS=as.factor(FastingBS))
finalData <- finalData %>% mutate(across(Cholesterol, function(i) {
if_else(i==0, as.numeric(NA), i)
}))
Lastly, let’s check if any our variables are correlated; if they are, they can increase redundancy and make the model more complex, therefore increasing the risk of overfitting.
finalData %>% select(where(is.numeric)) %>% cor
## Age RestingBP Cholesterol MaxHR Oldpeak
## Age 1.0000000 0.2543994 NA -0.3820447 0.2586115
## RestingBP 0.2543994 1.0000000 NA -0.1121350 0.1648030
## Cholesterol NA NA 1 NA NA
## MaxHR -0.3820447 -0.1121350 NA 1.0000000 -0.1606906
## Oldpeak 0.2586115 0.1648030 NA -0.1606906 1.0000000
Thankfully, none of our features are highly correlated. The highest (absolute) correlation is between Age and Maximum Heart Rate, which makes sense. But we can ignore it for now.
Okay, now we have to split our original dataset into the training and
testing subparts. Tidymodels provides a simple framework to do this with
initial_split(). We also make a cross validation object
which will come in handy later.
set.seed(12345)
dataSplit <- initial_split(finalData, prop = 3/4)
dataTraining <- training(dataSplit)
dataTesting <- testing(dataSplit)
dataCV <- vfold_cv(dataTraining)
dataCV
## # 10-fold cross-validation
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [619/69]> Fold01
## 2 <split [619/69]> Fold02
## 3 <split [619/69]> Fold03
## 4 <split [619/69]> Fold04
## 5 <split [619/69]> Fold05
## 6 <split [619/69]> Fold06
## 7 <split [619/69]> Fold07
## 8 <split [619/69]> Fold08
## 9 <split [620/68]> Fold09
## 10 <split [620/68]> Fold10
Okay, great. So we have two new datasets selected at random
(dataTraining and dataTesting) and a
cross-validation object. Now, we have to create the framework of the
model. We’ll do this with the recipe() function from
tidymodels. Crucially, we specify the model to normalize all numeric
columns (subtract the mean of each column and divide by its standard
deviation) and impute the missing values in the Cholesterol
column using the k-nearest neighbors method.
dataRecipe <- recipe(HeartDisease ~ ., data=finalData) %>%
step_normalize(all_numeric()) %>% step_impute_knn(Cholesterol)
So now we have our “recipe” to make the model; it’s time to follow
through. The rand_forest() function initializes a random
forest model, then we set parameters with set_args(). We
set mtry (the number of variables evaluated per tree) and
min_n (the minimum number of observations needed in a node
to split the node further - the lower the value, the deeper the tree) to
tune() - this will allow us to iterate through these values
on our cross validation datasets. The importance parameter
says that we will define the importance of each variable by permutation
(I will explain what this is later). Lastly, we tell the model that we
want classification (0 or 1) instead of regression (continuous).
rfModel <- rand_forest() %>% set_args(mtry=tune(), min_n=tune()) %>%
set_engine('ranger', importance = 'permutation') %>%
set_mode('classification')
We make a workflow object with our recipe and our model.
Let’s take a look at it:
rfWorkflow <- workflow() %>% add_recipe(dataRecipe) %>% add_model(rfModel)
rfWorkflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_normalize()
## • step_impute_knn()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## min_n = tune()
##
## Engine-Specific Arguments:
## importance = permutation
##
## Computational engine: ranger
The reason I tune()-d the mtry and min_n
parameters is because these might significantly influence the accuracy
and dependability of our model. Now that we have all the required
frameworks, we can test parameter values.
# Good mtry is sqrt(total vars) = sqrt(11) = 3
rfGrid <- expand.grid(mtry=2:6, min_n=c(1, 3, 5, 10, 15, 20, 30))
rfTuneResults <- rfWorkflow %>% tune_grid(resamples=dataCV, grid=rfGrid,
metrics=metric_set(accuracy, roc_auc))
rfTuneResults %>% collect_metrics()
## # A tibble: 70 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 1 accuracy binary 0.862 10 0.0109 Preprocessor1_Model01
## 2 2 1 roc_auc binary 0.922 10 0.00802 Preprocessor1_Model01
## 3 3 1 accuracy binary 0.855 10 0.0101 Preprocessor1_Model02
## 4 3 1 roc_auc binary 0.922 10 0.00774 Preprocessor1_Model02
## 5 4 1 accuracy binary 0.855 10 0.0107 Preprocessor1_Model03
## 6 4 1 roc_auc binary 0.919 10 0.00756 Preprocessor1_Model03
## 7 5 1 accuracy binary 0.852 10 0.0108 Preprocessor1_Model04
## 8 5 1 roc_auc binary 0.916 10 0.00760 Preprocessor1_Model04
## 9 6 1 accuracy binary 0.853 10 0.0102 Preprocessor1_Model05
## 10 6 1 roc_auc binary 0.917 10 0.00719 Preprocessor1_Model05
## # ℹ 60 more rows
I’m interested in the effect of mtry. Let’s take a look
across various min_n:
rfTuneResults %>% collect_metrics() %>% filter(min_n %in% c(1,3,30), .metric=='accuracy')
## # A tibble: 15 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 1 accuracy binary 0.862 10 0.0109 Preprocessor1_Model01
## 2 3 1 accuracy binary 0.855 10 0.0101 Preprocessor1_Model02
## 3 4 1 accuracy binary 0.855 10 0.0107 Preprocessor1_Model03
## 4 5 1 accuracy binary 0.852 10 0.0108 Preprocessor1_Model04
## 5 6 1 accuracy binary 0.853 10 0.0102 Preprocessor1_Model05
## 6 2 3 accuracy binary 0.861 10 0.00945 Preprocessor1_Model06
## 7 3 3 accuracy binary 0.859 10 0.00886 Preprocessor1_Model07
## 8 4 3 accuracy binary 0.850 10 0.00834 Preprocessor1_Model08
## 9 5 3 accuracy binary 0.855 10 0.0114 Preprocessor1_Model09
## 10 6 3 accuracy binary 0.855 10 0.0112 Preprocessor1_Model10
## 11 2 30 accuracy binary 0.862 10 0.0103 Preprocessor1_Model31
## 12 3 30 accuracy binary 0.859 10 0.0103 Preprocessor1_Model32
## 13 4 30 accuracy binary 0.859 10 0.0106 Preprocessor1_Model33
## 14 5 30 accuracy binary 0.858 10 0.0109 Preprocessor1_Model34
## 15 6 30 accuracy binary 0.856 10 0.00972 Preprocessor1_Model35
Comparing mtry’s for each min_n, it seems
like there is a small, weak negative correlation between
mtry and model accuracy. However, min_n seems
a lot more interesting. Let’s keep experimenting. So, let’s select the
smallest mtry (2 and let’s try 1 as well) and try a larger
range of min_n.
rfGrid2 <- expand.grid(mtry=1:2, min_n=c(1, 3, 5, seq(5, 50, 5)))
rfTuneResults2 <- rfWorkflow %>% tune_grid(resamples=dataCV, grid=rfGrid2,
metrics=metric_set(accuracy, roc_auc))
## Warning: Duplicate rows in grid of tuning combinations found and removed.
rfTuneResults2 %>% collect_metrics() %>% filter(.metric=='accuracy') %>%
arrange(desc(mean))
## # A tibble: 24 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 20 accuracy binary 0.866 10 0.0111 Preprocessor1_Model12
## 2 2 1 accuracy binary 0.866 10 0.0112 Preprocessor1_Model02
## 3 2 3 accuracy binary 0.866 10 0.00957 Preprocessor1_Model04
## 4 2 10 accuracy binary 0.866 10 0.0118 Preprocessor1_Model08
## 5 1 10 accuracy binary 0.863 10 0.0128 Preprocessor1_Model07
## 6 2 35 accuracy binary 0.863 10 0.00966 Preprocessor1_Model18
## 7 1 3 accuracy binary 0.862 10 0.0114 Preprocessor1_Model03
## 8 2 15 accuracy binary 0.861 10 0.00779 Preprocessor1_Model10
## 9 2 40 accuracy binary 0.861 10 0.0106 Preprocessor1_Model20
## 10 1 15 accuracy binary 0.859 10 0.0129 Preprocessor1_Model09
## # ℹ 14 more rows
rfTuneResults2 %>% collect_metrics() %>% filter(.metric=='roc_auc') %>%
arrange(desc(mean))
## # A tibble: 24 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 50 roc_auc binary 0.927 10 0.00658 Preprocessor1_Model24
## 2 2 20 roc_auc binary 0.927 10 0.00721 Preprocessor1_Model12
## 3 2 25 roc_auc binary 0.926 10 0.00713 Preprocessor1_Model14
## 4 1 20 roc_auc binary 0.926 10 0.00831 Preprocessor1_Model11
## 5 2 35 roc_auc binary 0.926 10 0.00706 Preprocessor1_Model18
## 6 2 45 roc_auc binary 0.926 10 0.00708 Preprocessor1_Model22
## 7 2 10 roc_auc binary 0.926 10 0.00742 Preprocessor1_Model08
## 8 2 30 roc_auc binary 0.926 10 0.00712 Preprocessor1_Model16
## 9 1 5 roc_auc binary 0.926 10 0.00815 Preprocessor1_Model05
## 10 1 25 roc_auc binary 0.926 10 0.00809 Preprocessor1_Model13
## # ℹ 14 more rows
mtry=2, min_n=20 comes in at first place for accuracy, first (tied) for roc_auc, and has low standard errors. It’s safe to say that this model is our winner.
Now, let’s apply it to our workflow.
paramFinal <- rfTuneResults2 %>% select_best(metric='accuracy')
rfWorkflow <- rfWorkflow %>% finalize_workflow(paramFinal)
So now we have our workflow:
rfWorkflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_normalize()
## • step_impute_knn()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 2
## min_n = 20
##
## Engine-Specific Arguments:
## importance = permutation
##
## Computational engine: ranger
Now we can fit this model onto our split data. Tidymodels is nice
because it lets you automatically train on the training data and then
evaluate on the testing data using the initial_split()
object. Without further ado, let’s fit it and see how it performs:
rfFit <- rfWorkflow %>% last_fit(dataSplit)
testPerformance <- rfFit %>% collect_metrics()
testPerformance
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.843 Preprocessor1_Model1
## 2 roc_auc binary 0.921 Preprocessor1_Model1
## 3 brier_class binary 0.112 Preprocessor1_Model1
Accuracy of 84.3%, not bad!
Let’s visualize our results. We’ll make a confusion matrix:
rfFit %>% collect_predictions() %>% conf_mat(truth=HeartDisease,
estimate=.pred_class)
## Truth
## Prediction 0 1
## 0 81 15
## 1 21 113
Our model on the testing data resulted in 81 true negatives, 113 true positive, 15 false negatives, and 21 false positives.
rfFit %>% collect_predictions() %>% ggplot() +
geom_density(aes(x=.pred_1, fill=HeartDisease, alpha=0.5))
We can see here that we have some neat-looking separation. The plot colors positive outcomes in blue and negative in red. We can see that the model is pretty good at separating the positives from the negatives.
Now we can fit our final model on the dataset and analyze the importance of each variable.
finalModel <- fit(rfWorkflow, finalData)
extract_fit_parsnip(finalModel)$fit$variable.importance
## Age Sex ChestPainType RestingBP Cholesterol
## 0.0057066074 0.0125310608 0.0302620031 0.0021074592 0.0004058326
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## 0.0089750191 0.0019149438 0.0140150003 0.0212245216 0.0244531546
## ST_Slope
## 0.0814793534
Remember our “permutation” argument from before? To find variable
importance, the model takes each variable one by one,
shuffles its values (effectively removing its correlation to the
outcome), and evaluates the decrease in the model metric (in this case,
accuracy).
To conclude, our model performed quite well with 84.3% accuracy, an
area of 0.921 under the ROC curve, and a Brier score of 0.112.
ST_Slope seems to be the most important variable,
accounting for ~9% of the model’s predictive power.
Now let’s make a boosted tree model for our data. We will be using
the C5.0 algorithm, made convenient through R. Briefly, it
sequentially creates trees in a forest, each one learning from the
mistakes of the previous one. The first tree builds its nodes, finding
the best split at each one using Gini impurity or entropy as in random
forest. However, at the end of the tree, the observations are weighted -
an increase in weight if misclassified, a decrease if correctly
classified. The next tree puts more weight on classifying correctly the
observations that were previously misclassified, new weights are
calculated at the end of the tree, and the whole process is iterated
until the forest finishes.
library(C50)
It is important to keep in mind that we can overfit our data by
setting the amount of trees too high - our model will learn
the noise, not the pattern. So, we will iterate through that as well as
min_n as before.
abModel <- boost_tree() %>% set_args(trees=tune(), min_n=tune()) %>%
set_engine('C5.0') %>% set_mode('classification')
abWorkflow <- workflow() %>% add_recipe(dataRecipe) %>% add_model(abModel)
We will expand the grid as before.
abGrid <- expand.grid(trees=seq(5,50,5),
min_n=c(2,seq(5,25,10)))
abTuneResults <- abWorkflow %>% tune_grid(resamples=dataCV, grid=abGrid,
metrics=metric_set(accuracy, roc_auc))
## → A | warning: 'trials' should be <= 25 for this object. Predictions generated using 25 trials
## There were issues with some computations A: x1 → B | warning: 'trials' should be <= 31 for this object. Predictions generated using 31 trials
## There were issues with some computations A: x1There were issues with some computations A: x1 B: x1 → C | warning: 'trials' should be <= 19 for this object. Predictions generated using 19 trials
## There were issues with some computations A: x1 B: x1There were issues with some computations A: x1 B: x1 C: x1 → D | warning: 'trials' should be <= 23 for this object. Predictions generated using 23 trials
## There were issues with some computations A: x1 B: x1 C: x1There were issues with some computations A: x1 B: x1 C: x1 D: x1There were issues with some computations A: x1 B: x1 C: x1 D: x1
abTuneResults %>% collect_metrics() %>% filter(.metric=='accuracy') %>%
arrange(desc(mean)) %>% select(trees, min_n, mean) %>%
pivot_longer(cols=c(trees, min_n)) %>% ggplot(aes(x=value,y=mean)) +
geom_point() + facet_wrap(~name,scales='free')
It seems like there might be a positive correlation between min_n and accuracy. It’s worth investigating this further.
abGrid2 <- expand.grid(trees=50, min_n=seq(25,100,5))
abTuneResults2 <- abWorkflow %>% tune_grid(resamples=dataCV, grid=abGrid2,
metrics=metric_set(accuracy, roc_auc))
abTuneResults2 %>% collect_metrics() %>% filter(.metric=='accuracy') %>%
arrange(desc(mean)) %>% select(trees, min_n, mean) %>%
pivot_longer(cols=c(trees, min_n)) %>% ggplot(aes(x=value,y=mean)) +
geom_point() + facet_wrap(~name,scales='free')
Interestingly, it seems like the accuracy falls. Oh well, it didn’t hurt to investigate. Let’s finalize our workflow again:
paramFinal <- abTuneResults %>% select_best(metric='accuracy')
abWorkflow <- abWorkflow %>% finalize_workflow(paramFinal)
Let’s fit our model like we did with RF:
abFit <- abWorkflow %>% last_fit(dataSplit)
testPerformance <- abFit %>% collect_metrics()
testPerformance
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.861 Preprocessor1_Model1
## 2 roc_auc binary 0.920 Preprocessor1_Model1
## 3 brier_class binary 0.107 Preprocessor1_Model1
Our model has 86.1% accuracy, slightly better than RF.
Let’s make a confusion matrix and plot how our model separates values:
abFit %>% collect_predictions() %>% conf_mat(truth=HeartDisease,
estimate=.pred_class)
## Truth
## Prediction 0 1
## 0 85 15
## 1 17 113
abFit %>% collect_predictions() %>% ggplot() +
geom_density(aes(x=.pred_1, fill=HeartDisease, alpha=0.5))
In conclusion, this model performed similarly relative to random forest – it is a bit more accurate and has near-equivalent ROC AUC and Brier score metrics.
New model coming soon!