# Library
library(tidyverse) # for general wrangling and visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart) # for decision trees
## Warning: package 'rpart' was built under R version 4.4.3
library(visNetwork) # for DT plotting
## Warning: package 'visNetwork' was built under R version 4.4.3
library(ipred) # for bagging
## Warning: package 'ipred' was built under R version 4.4.2
library(randomForest) # for random forest
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret) # for RF tuning
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
Bootstrap aggregating, or “bagging”, refers to a model learning
technique often utilized with decision tree models. When creating a
decision tree, algorithms like rpart or C5.0
use predictors in the training data to create classifications or
numerical predictions based on a series of probabilities. Bagging uses
this same approach, but reiterates multiple bootstrap samples from the
training set for each decision tree. The votes from these multiple
models are then combined (aggregated) into the final decision for the
model.
Given that bagging is a relatively simple ensemble of models, it
performs best when used with unstable learners - i.e., models that tend
to change dramatically when the input data change slightly. This is why
bagging works particularly well for decision trees like those produced
by rpart. However, bagging can function quite well with
other model types that vary significantly when the input data are
modified.
A random forest approach is similar to bagging in it’s approach to creating model ensembles, but exclusively uses decision tree models. In fact, only when a decision tree forest is performed in a particular way, defined by Breiman and Cutler in 2001, is it technically called a “random forest”, as it is a trademarked approach.
Random forests pull random, iterative subsets from the training data to create decision trees. Unlike bagging, however, random forests create trees from a random set of features as well as datapoints in the training set. This allows them to handle very large datasets by partitioning the data into more computationally-manageable chunks while retaining a similar error rate to other machine learning approaches. The resulting votes from each decision tree are then combined (categorical) or averaged (numerical) to create predictions across all features. This makes a random forest an excellent approach for many problems: it can handle a great deal of noise, which is often cancelled out by the different submodels, and ultimately will select a final model that only utilizes the most important features. However, the output from a random tree is not nearly as easy to interpret as a single decision tree, which can complicate understanding and communication of results.
The RandomForest package is purportedly a very common
utilization of the random forest approach and has functionality within
the caret package. When running a single
randomForest(), the function defaults to creating 500
different decision trees, each pulling a number of features from the
original dataset equal to the square root of the total number of
features. The relatively small number of features and high number of
iterations is to ensure that each feature is used in multiple trees
within the random forest. The function then returns a model object that
can be used to make predictions on a testing dataset. Interestingly,
randomForest doesn’t just set aside a single training set -
instead, predictions are made for every datapoint not included in each
example model, then predictions are aggregated into the final model fit.
This gives it a lot of predictive power even without further cross
validation.
When including this function in the train() function
from caret, we can add cross validation steps as well as
try multiple values of features to pull. Then, train
outputs the best model based on performance (accuracy or kappa) or
performance and efficiency (i.e., the most parsimonious model within 1
standard error of the highest accuracy). There are other alternatives to
selecting the best model as well from the repeated iterations. We can
then compare different numbers of features tested across the whole
training set to find the best set of parameters for the random forest.
This can take time, as we are repeatedly cross validating and running
the RandomForest multiple times, but will ultimately
produce the fastest and/or most accurate model compared to a single
decision tree or single random forest (at least, in most cases).
I am curious as to how the proper “Random Forest” technique trademarked by Breiman and Cutler differs from other decision tree forest approaches.
I would very much like to play around with tuning and testing a new model with the fish catch data. However, due to time constraints I stopped here for the initial submission. If I am able to make some headway this evening beyond poorly-annotated code slop, I plan to reupload a version with some fun tests.