What is the {forester} AutoML package?

The forester is an AutoML tool in R for tabular data regression and binary classification tasks. It wraps up all machine learning processes into a single train() function, which includes:
i) rendering a brief data check report,
ii) preprocessing the initial dataset enough for models to be trained,
iii) training 5 tree-based models (decision tree, random forest, xgboost, catboost, lightgbm) with default parameters, random search and Bayesian optimization,
iv) evaluating them and providing a ranked list.

Information source https://www.r-bloggers.com/2023/02/forester-an-r-package-for-automated-building-of-tree-based-models/

Before we begin let’s set up our working directory/folder.

setwd("~/Documents/Using the AutoML {forester} package for Tree-based Models")

Classification machine leaning (ML) dataset analysis

Here are the steps on how to conduct a classification machine leaning (ML) dataset analysis:

  1. Install and/or upload packages.
# install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("devtools")
# devtools::install_github("ModelOriented/forester")
library(forester)
## 
## Attaching package: 'forester'
## 
## The following object is masked from 'package:dplyr':
## 
##     explain
# install.packages("DALEX")
library(DALEX)
## Welcome to DALEX (version: 2.4.3).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
## 
## 
## Attaching package: 'DALEX'
## 
## The following object is masked from 'package:forester':
## 
##     explain
## 
## The following object is masked from 'package:dplyr':
## 
##     explain
  1. Upload the Pima Indians diabetes dataset. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
# Using Pima Indians Dataset
# data sets from the UCI repository.
# install.packages("mlbench")
library(mlbench) 
data(PimaIndiansDiabetes)
  1. Use the {skimr} package to quickly display the Pima Indians diabetes dataset summary statistics.
# skimming dataset
# install.packages("skimr")
library(skimr)
skim(PimaIndiansDiabetes)
Data summary
Name PimaIndiansDiabetes
Number of rows 768
Number of columns 9
_______________________
Column type frequency:
factor 1
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
diabetes 0 1 FALSE 2 neg: 500, pos: 268

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
pregnant 0 1 3.85 3.37 0.00 1.00 3.00 6.00 17.00 ▇▃▂▁▁
glucose 0 1 120.89 31.97 0.00 99.00 117.00 140.25 199.00 ▁▁▇▆▂
pressure 0 1 69.11 19.36 0.00 62.00 72.00 80.00 122.00 ▁▁▇▇▁
triceps 0 1 20.54 15.95 0.00 0.00 23.00 32.00 99.00 ▇▇▂▁▁
insulin 0 1 79.80 115.24 0.00 0.00 30.50 127.25 846.00 ▇▁▁▁▁
mass 0 1 31.99 7.88 0.00 27.30 32.00 36.60 67.10 ▁▃▇▂▁
pedigree 0 1 0.47 0.33 0.08 0.24 0.37 0.63 2.42 ▇▃▁▁▁
age 0 1 33.24 11.76 21.00 24.00 29.00 41.00 81.00 ▇▃▁▁▁
  1. Create data report using the data_check() function from the {forester} package.
# Run data check pipeline to seek for potential problems with the data
check <- check_data(PimaIndiansDiabetes, 'diabetes')
##  -------------------- CHECK DATA REPORT -------------------- 
##  
## The dataset has 768 observations and 9 columns, which names are: 
## pregnant; glucose; pressure; triceps; insulin; mass; pedigree; age; diabetes; 
## 
## With the target described by a column diabetes.
## 
## ✔ No static columns. 
## 
## ✔ No duplicate columns.
## 
## ✔ No target values are missing. 
## 
## ✔ No predictor values are missing. 
## 
## ✔  No issues with dimensionality. 
## 
## ✔ No strongly correlated, by Spearman rank, pairs of numerical values. 
## 
## ✖ There are more than 50 possible outliers in the data set, so we are not printing them. They are returned in the output as a vector. 
## 
## ✖ Dataset is unbalanced with: 1.865672 proportion with neg being a dominating class.
## 
## ✔ Columns names suggest that none of them are IDs. 
## 
## ✔ Columns data suggest that none of them are IDs. 
## 
##  -------------------- CHECK DATA REPORT END -------------------- 
## 
  1. Train basic models with {forester}, and view models’ ranking list.
output_1_diabetes <- train(data=PimaIndiansDiabetes,
                      y = 'diabetes',
                      bayes_iter   = 0,
                      random_evals = 0,
                      verbose = FALSE,
                      sort_by = 'auc')

head(output_1_diabetes$score_test)
##   no.                name        engine tuning  accuracy       auc        f1
## 1   1        ranger_model        ranger  basic 0.7727273 0.8238889 0.6391753
## 2   2       xgboost_model       xgboost  basic 0.7597403 0.8087037 0.6476190
## 3   4      lightgbm_model      lightgbm  basic 0.7402597 0.8055556 0.6078431
## 4   3 decision_tree_model decision_tree  basic 0.7012987 0.7705556 0.6101695
  1. Create an explainer and feature importance plot for the basic models that shows us which columns were the most important for the model.
library(DALEX)
ex_1_diabetes <- forester::explain(models = output_1_diabetes$best_models[[1]],
                        test_data = output_1_diabetes$test_data,
                        y = output_1_diabetes$y)

model_1_parts_diabetes <- DALEX::model_parts(ex_1_diabetes$ranger_model)
plot(model_1_parts_diabetes, max_vars = 9)

  1. Train tuned models with {forester}, and view models’ ranking list. Please be patient. The models are being selected.
output_2_diabetes <- train(data=PimaIndiansDiabetes,
                         y = 'diabetes',
                         bayes_iter   = 20,
                         random_evals = 20,
                         verbose = FALSE,
                         sort_by = 'auc')

head(output_2_diabetes$score_test)
##   no.         name engine        tuning  accuracy       auc        f1
## 1   6  ranger_RS_2 ranger random_search 0.7727273 0.8288889 0.6391753
## 2  17 ranger_RS_13 ranger random_search 0.7792208 0.8257407 0.6458333
## 3   1 ranger_model ranger         basic 0.7727273 0.8238889 0.6391753
## 4  10  ranger_RS_6 ranger random_search 0.7792208 0.8238889 0.6458333
## 5   5  ranger_RS_1 ranger random_search 0.7662338 0.8201852 0.6250000
## 6   7  ranger_RS_3 ranger random_search 0.7857143 0.8192593 0.6526316
  1. Create an explainer and feature importance plot for the tuned models that shows us which columns were the most important for the model.
library(DALEX)
ex_2_diabetes <- forester::explain(models = output_2_diabetes$best_models[[1]],
                        test_data = output_2_diabetes$test_data,
                        y = output_2_diabetes$y)

model_2_parts_diabetes <- DALEX::model_parts(ex_2_diabetes$ranger_RS_13)
plot(model_2_parts_diabetes, max_vars = 9)

Regression machine leaning (ML) dataset analysis

Here are the steps on how to conduct a regression machine leaning (ML) dataset analysis:

  1. Upload marketing dataset for regression machine leaning (ML) dataset analysis.
library(datarium)
data(marketing)
  1. Use the {skimr} package to quickly display the marketing dataset summary statistics.
# skimming dataset
# install.packages("skimr")
library(skimr)
skim(marketing)
Data summary
Name marketing
Number of rows 200
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
youtube 0 1 176.45 103.03 0.84 89.25 179.70 262.59 355.68 ▇▆▆▇▆
facebook 0 1 27.92 17.82 0.00 11.97 27.48 43.83 59.52 ▇▆▆▆▆
newspaper 0 1 36.66 26.13 0.36 15.30 30.90 54.12 136.80 ▇▆▃▁▁
sales 0 1 16.83 6.26 1.92 12.45 15.48 20.88 32.40 ▁▇▇▅▂
  1. Train basic models with {forester}, and view models’ ranking list.
output_1_sales <- train(data=marketing,
                      y = 'sales',
                      bayes_iter   = 0,
                      random_evals = 0,
                      verbose = FALSE,
                      sort_by = 'mse')

head(output_1_sales$score_test)
##   no.                name        engine tuning     rmse      mse        r2
## 1   2       xgboost_model       xgboost  basic 1.348093 1.817355 0.9514301
## 2   4      lightgbm_model      lightgbm  basic 1.354851 1.835621 0.9509419
## 3   1        ranger_model        ranger  basic 2.028579 4.115134 0.8900206
## 4   3 decision_tree_model decision_tree  basic 2.145566 4.603452 0.8769700
##        mae
## 1 1.095299
## 2 1.158454
## 3 1.401629
## 4 1.769859
  1. Create an explainer and feature importance plot for the sales basic models that shows us which columns were the most important for the model.
library(DALEX)
ex_1_sales <- forester::explain(models = output_1_sales$best_models[[1]],
                        test_data = output_1_sales$test_data,
                        y = output_1_sales$y)

model_1_parts_sales <- DALEX::model_parts(ex_1_sales$xgboost_model)
plot(model_1_parts_sales, max_vars = 4)

  1. Train tuned models with {forester}, and view models’ ranking list. Please be patient. The models are being selected.
output_2_sales <- train(data=marketing,
                         y = 'sales',
                         bayes_iter   = 20,
                         random_evals = 20,
                         verbose = FALSE,
                         sort_by = 'mse')

head(output_2_sales$score_test)
##   no.           name   engine        tuning     rmse      mse        r2
## 1  29   xgboost_RS_5  xgboost random_search 1.120546 1.255623 0.9664427
## 2  25   xgboost_RS_1  xgboost random_search 1.296193 1.680117 0.9550979
## 3  88 lightgbm_bayes lightgbm     bayes_opt 1.315767 1.731243 0.9537315
## 4  86  xgboost_bayes  xgboost     bayes_opt 1.339707 1.794814 0.9520325
## 5   2  xgboost_model  xgboost         basic 1.348093 1.817355 0.9514301
## 6   4 lightgbm_model lightgbm         basic 1.354851 1.835621 0.9509419
##         mae
## 1 0.8525029
## 2 1.0966873
## 3 1.1216577
## 4 1.0284954
## 5 1.0952987
## 6 1.1584541
  1. Create an explainer and feature importance plot for the tuned models that shows us which columns were the most important for the model.
library(DALEX)
ex_2_sales <- forester::explain(models = output_2_sales$best_models[[1]],
                        test_data = output_2_sales$test_data,
                        y = output_2_sales$y)

model_2_parts_sales <- DALEX::model_parts(ex_2_sales$xgboost_RS_5)
plot(model_2_parts_sales, max_vars = 9)

Finally, spacial thanks to the authors/developers of the {forester} AutoML package.
For more information regarding the package please visit: https://github.com/ModelOriented/forester