Using the AutoML {forester} package for Tree-based Models

What is the {forester} AutoML package?

The forester is an AutoML tool in R for tabular data regression and binary classification tasks. It wraps up all machine learning processes into a single train() function, which includes:
i) rendering a brief data check report,
ii) preprocessing the initial dataset enough for models to be trained,
iii) training 5 tree-based models (decision tree, random forest, xgboost, catboost, lightgbm) with default parameters, random search and Bayesian optimization,
iv) evaluating them and providing a ranked list.

Information source https://www.r-bloggers.com/2023/02/forester-an-r-package-for-automated-building-of-tree-based-models/

Before we begin let’s set up our working directory/folder.

setwd("~/Documents/Using the AutoML {forester} package for Tree-based Models")

Classification machine leaning (ML) dataset analysis

Here are the steps on how to conduct a classification machine leaning (ML) dataset analysis:

Install and/or upload packages.

# install.packages("tidyverse")
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# install.packages("devtools")
# devtools::install_github("ModelOriented/forester")
library(forester)

## 
## Attaching package: 'forester'
## 
## The following object is masked from 'package:dplyr':
## 
##     explain

# install.packages("DALEX")
library(DALEX)

## Welcome to DALEX (version: 2.4.3).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
## 
## 
## Attaching package: 'DALEX'
## 
## The following object is masked from 'package:forester':
## 
##     explain
## 
## The following object is masked from 'package:dplyr':
## 
##     explain

Upload the Pima Indians diabetes dataset. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

# Using Pima Indians Dataset
# data sets from the UCI repository.
# install.packages("mlbench")
library(mlbench) 
data(PimaIndiansDiabetes)

Use the {skimr} package to quickly display the Pima Indians diabetes dataset summary statistics.

# skimming dataset
# install.packages("skimr")
library(skimr)
skim(PimaIndiansDiabetes)

Data summary
Name	PimaIndiansDiabetes
Number of rows	768
Number of columns	9
_______________________
Column type frequency:
factor	1
numeric	8
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
diabetes	0	1	FALSE	2	neg: 500, pos: 268

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
pregnant	1	3.85	3.37	0.00	1.00	3.00	6.00	17.00	▇▃▂▁▁
glucose	1	120.89	31.97	0.00	99.00	117.00	140.25	199.00	▁▁▇▆▂
pressure	1	69.11	19.36	0.00	62.00	72.00	80.00	122.00	▁▁▇▇▁
triceps	1	20.54	15.95	0.00	0.00	23.00	32.00	99.00	▇▇▂▁▁
insulin	1	79.80	115.24	0.00	0.00	30.50	127.25	846.00	▇▁▁▁▁
mass	1	31.99	7.88	0.00	27.30	32.00	36.60	67.10	▁▃▇▂▁
pedigree	1	0.47	0.33	0.08	0.24	0.37	0.63	2.42	▇▃▁▁▁
age	1	33.24	11.76	21.00	24.00	29.00	41.00	81.00	▇▃▁▁▁

Create data report using the data_check() function from the {forester} package.

# Run data check pipeline to seek for potential problems with the data
check <- check_data(PimaIndiansDiabetes, 'diabetes')

##  -------------------- CHECK DATA REPORT -------------------- 
##  
## The dataset has 768 observations and 9 columns, which names are: 
## pregnant; glucose; pressure; triceps; insulin; mass; pedigree; age; diabetes; 
## 
## With the target described by a column diabetes.
## 
## ✔ No static columns. 
## 
## ✔ No duplicate columns.
## 
## ✔ No target values are missing. 
## 
## ✔ No predictor values are missing. 
## 
## ✔  No issues with dimensionality. 
## 
## ✔ No strongly correlated, by Spearman rank, pairs of numerical values. 
## 
## ✖ There are more than 50 possible outliers in the data set, so we are not printing them. They are returned in the output as a vector. 
## 
## ✖ Dataset is unbalanced with: 1.865672 proportion with neg being a dominating class.
## 
## ✔ Columns names suggest that none of them are IDs. 
## 
## ✔ Columns data suggest that none of them are IDs. 
## 
##  -------------------- CHECK DATA REPORT END -------------------- 
##

Train basic models with {forester}, and view models’ ranking list.

output_1_diabetes <- train(data=PimaIndiansDiabetes,
                      y = 'diabetes',
                      bayes_iter   = 0,
                      random_evals = 0,
                      verbose = FALSE,
                      sort_by = 'auc')

head(output_1_diabetes$score_test)

##   no.                name        engine tuning  accuracy       auc        f1
## 1   1        ranger_model        ranger  basic 0.7727273 0.8238889 0.6391753
## 2   2       xgboost_model       xgboost  basic 0.7597403 0.8087037 0.6476190
## 3   4      lightgbm_model      lightgbm  basic 0.7402597 0.8055556 0.6078431
## 4   3 decision_tree_model decision_tree  basic 0.7012987 0.7705556 0.6101695

Create an explainer and feature importance plot for the basic models that shows us which columns were the most important for the model.

library(DALEX)
ex_1_diabetes <- forester::explain(models = output_1_diabetes$best_models[[1]],
                        test_data = output_1_diabetes$test_data,
                        y = output_1_diabetes$y)

model_1_parts_diabetes <- DALEX::model_parts(ex_1_diabetes$ranger_model)
plot(model_1_parts_diabetes, max_vars = 9)

Train tuned models with {forester}, and view models’ ranking list. Please be patient. The models are being selected.

output_2_diabetes <- train(data=PimaIndiansDiabetes,
                         y = 'diabetes',
                         bayes_iter   = 20,
                         random_evals = 20,
                         verbose = FALSE,
                         sort_by = 'auc')

head(output_2_diabetes$score_test)

##   no.         name engine        tuning  accuracy       auc        f1
## 1   6  ranger_RS_2 ranger random_search 0.7727273 0.8288889 0.6391753
## 2  17 ranger_RS_13 ranger random_search 0.7792208 0.8257407 0.6458333
## 3   1 ranger_model ranger         basic 0.7727273 0.8238889 0.6391753
## 4  10  ranger_RS_6 ranger random_search 0.7792208 0.8238889 0.6458333
## 5   5  ranger_RS_1 ranger random_search 0.7662338 0.8201852 0.6250000
## 6   7  ranger_RS_3 ranger random_search 0.7857143 0.8192593 0.6526316

Create an explainer and feature importance plot for the tuned models that shows us which columns were the most important for the model.

library(DALEX)
ex_2_diabetes <- forester::explain(models = output_2_diabetes$best_models[[1]],
                        test_data = output_2_diabetes$test_data,
                        y = output_2_diabetes$y)

model_2_parts_diabetes <- DALEX::model_parts(ex_2_diabetes$ranger_RS_13)
plot(model_2_parts_diabetes, max_vars = 9)

Regression machine leaning (ML) dataset analysis

Here are the steps on how to conduct a regression machine leaning (ML) dataset analysis:

Upload marketing dataset for regression machine leaning (ML) dataset analysis.

library(datarium)
data(marketing)

Use the {skimr} package to quickly display the marketing dataset summary statistics.

# skimming dataset
# install.packages("skimr")
library(skimr)
skim(marketing)

Data summary
Name	marketing
Number of rows	200
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
youtube	1	176.45	103.03	0.84	89.25	179.70	262.59	355.68	▇▆▆▇▆
facebook	1	27.92	17.82	0.00	11.97	27.48	43.83	59.52	▇▆▆▆▆
newspaper	1	36.66	26.13	0.36	15.30	30.90	54.12	136.80	▇▆▃▁▁
sales	1	16.83	6.26	1.92	12.45	15.48	20.88	32.40	▁▇▇▅▂

Train basic models with {forester}, and view models’ ranking list.

output_1_sales <- train(data=marketing,
                      y = 'sales',
                      bayes_iter   = 0,
                      random_evals = 0,
                      verbose = FALSE,
                      sort_by = 'mse')

head(output_1_sales$score_test)

##   no.                name        engine tuning     rmse      mse        r2
## 1   2       xgboost_model       xgboost  basic 1.348093 1.817355 0.9514301
## 2   4      lightgbm_model      lightgbm  basic 1.354851 1.835621 0.9509419
## 3   1        ranger_model        ranger  basic 2.028579 4.115134 0.8900206
## 4   3 decision_tree_model decision_tree  basic 2.145566 4.603452 0.8769700
##        mae
## 1 1.095299
## 2 1.158454
## 3 1.401629
## 4 1.769859

Create an explainer and feature importance plot for the sales basic models that shows us which columns were the most important for the model.

library(DALEX)
ex_1_sales <- forester::explain(models = output_1_sales$best_models[[1]],
                        test_data = output_1_sales$test_data,
                        y = output_1_sales$y)

model_1_parts_sales <- DALEX::model_parts(ex_1_sales$xgboost_model)
plot(model_1_parts_sales, max_vars = 4)

Train tuned models with {forester}, and view models’ ranking list. Please be patient. The models are being selected.

output_2_sales <- train(data=marketing,
                         y = 'sales',
                         bayes_iter   = 20,
                         random_evals = 20,
                         verbose = FALSE,
                         sort_by = 'mse')

head(output_2_sales$score_test)

##   no.           name   engine        tuning     rmse      mse        r2
## 1  29   xgboost_RS_5  xgboost random_search 1.120546 1.255623 0.9664427
## 2  25   xgboost_RS_1  xgboost random_search 1.296193 1.680117 0.9550979
## 3  88 lightgbm_bayes lightgbm     bayes_opt 1.315767 1.731243 0.9537315
## 4  86  xgboost_bayes  xgboost     bayes_opt 1.339707 1.794814 0.9520325
## 5   2  xgboost_model  xgboost         basic 1.348093 1.817355 0.9514301
## 6   4 lightgbm_model lightgbm         basic 1.354851 1.835621 0.9509419
##         mae
## 1 0.8525029
## 2 1.0966873
## 3 1.1216577
## 4 1.0284954
## 5 1.0952987
## 6 1.1584541

Create an explainer and feature importance plot for the tuned models that shows us which columns were the most important for the model.

library(DALEX)
ex_2_sales <- forester::explain(models = output_2_sales$best_models[[1]],
                        test_data = output_2_sales$test_data,
                        y = output_2_sales$y)

model_2_parts_sales <- DALEX::model_parts(ex_2_sales$xgboost_RS_5)
plot(model_2_parts_sales, max_vars = 9)

Finally, spacial thanks to the authors/developers of the {forester} AutoML package.
For more information regarding the package please visit: https://github.com/ModelOriented/forester

Using the AutoML {forester} package for Tree-based Models

Ramon Rodriguez-Santana, MBA, MPH

2024-02-11

What is the {forester} AutoML package?

Classification machine leaning (ML) dataset analysis

Regression machine leaning (ML) dataset analysis