The forester is an AutoML tool in R for tabular data regression and
binary classification tasks. It wraps up all machine learning processes
into a single train() function, which includes:
i) rendering a brief data check report,
ii) preprocessing the initial dataset enough for models to be
trained,
iii) training 5 tree-based models (decision tree, random forest,
xgboost, catboost, lightgbm) with default parameters, random search and
Bayesian optimization,
iv) evaluating them and providing a ranked list.
Information source https://www.r-bloggers.com/2023/02/forester-an-r-package-for-automated-building-of-tree-based-models/
Before we begin let’s set up our working directory/folder.
setwd("~/Documents/Using the AutoML {forester} package for Tree-based Models")
Here are the steps on how to conduct a classification machine leaning (ML) dataset analysis:
# install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("devtools")
# devtools::install_github("ModelOriented/forester")
library(forester)
##
## Attaching package: 'forester'
##
## The following object is masked from 'package:dplyr':
##
## explain
# install.packages("DALEX")
library(DALEX)
## Welcome to DALEX (version: 2.4.3).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
##
##
## Attaching package: 'DALEX'
##
## The following object is masked from 'package:forester':
##
## explain
##
## The following object is masked from 'package:dplyr':
##
## explain
# Using Pima Indians Dataset
# data sets from the UCI repository.
# install.packages("mlbench")
library(mlbench)
data(PimaIndiansDiabetes)
# skimming dataset
# install.packages("skimr")
library(skimr)
skim(PimaIndiansDiabetes)
| Name | PimaIndiansDiabetes |
| Number of rows | 768 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| diabetes | 0 | 1 | FALSE | 2 | neg: 500, pos: 268 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| pregnant | 0 | 1 | 3.85 | 3.37 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▂▁▁ |
| glucose | 0 | 1 | 120.89 | 31.97 | 0.00 | 99.00 | 117.00 | 140.25 | 199.00 | ▁▁▇▆▂ |
| pressure | 0 | 1 | 69.11 | 19.36 | 0.00 | 62.00 | 72.00 | 80.00 | 122.00 | ▁▁▇▇▁ |
| triceps | 0 | 1 | 20.54 | 15.95 | 0.00 | 0.00 | 23.00 | 32.00 | 99.00 | ▇▇▂▁▁ |
| insulin | 0 | 1 | 79.80 | 115.24 | 0.00 | 0.00 | 30.50 | 127.25 | 846.00 | ▇▁▁▁▁ |
| mass | 0 | 1 | 31.99 | 7.88 | 0.00 | 27.30 | 32.00 | 36.60 | 67.10 | ▁▃▇▂▁ |
| pedigree | 0 | 1 | 0.47 | 0.33 | 0.08 | 0.24 | 0.37 | 0.63 | 2.42 | ▇▃▁▁▁ |
| age | 0 | 1 | 33.24 | 11.76 | 21.00 | 24.00 | 29.00 | 41.00 | 81.00 | ▇▃▁▁▁ |
# Run data check pipeline to seek for potential problems with the data
check <- check_data(PimaIndiansDiabetes, 'diabetes')
## -------------------- CHECK DATA REPORT --------------------
##
## The dataset has 768 observations and 9 columns, which names are:
## pregnant; glucose; pressure; triceps; insulin; mass; pedigree; age; diabetes;
##
## With the target described by a column diabetes.
##
## ✔ No static columns.
##
## ✔ No duplicate columns.
##
## ✔ No target values are missing.
##
## ✔ No predictor values are missing.
##
## ✔ No issues with dimensionality.
##
## ✔ No strongly correlated, by Spearman rank, pairs of numerical values.
##
## ✖ There are more than 50 possible outliers in the data set, so we are not printing them. They are returned in the output as a vector.
##
## ✖ Dataset is unbalanced with: 1.865672 proportion with neg being a dominating class.
##
## ✔ Columns names suggest that none of them are IDs.
##
## ✔ Columns data suggest that none of them are IDs.
##
## -------------------- CHECK DATA REPORT END --------------------
##
output_1_diabetes <- train(data=PimaIndiansDiabetes,
y = 'diabetes',
bayes_iter = 0,
random_evals = 0,
verbose = FALSE,
sort_by = 'auc')
head(output_1_diabetes$score_test)
## no. name engine tuning accuracy auc f1
## 1 1 ranger_model ranger basic 0.7727273 0.8238889 0.6391753
## 2 2 xgboost_model xgboost basic 0.7597403 0.8087037 0.6476190
## 3 4 lightgbm_model lightgbm basic 0.7402597 0.8055556 0.6078431
## 4 3 decision_tree_model decision_tree basic 0.7012987 0.7705556 0.6101695
library(DALEX)
ex_1_diabetes <- forester::explain(models = output_1_diabetes$best_models[[1]],
test_data = output_1_diabetes$test_data,
y = output_1_diabetes$y)
model_1_parts_diabetes <- DALEX::model_parts(ex_1_diabetes$ranger_model)
plot(model_1_parts_diabetes, max_vars = 9)
output_2_diabetes <- train(data=PimaIndiansDiabetes,
y = 'diabetes',
bayes_iter = 20,
random_evals = 20,
verbose = FALSE,
sort_by = 'auc')
head(output_2_diabetes$score_test)
## no. name engine tuning accuracy auc f1
## 1 6 ranger_RS_2 ranger random_search 0.7727273 0.8288889 0.6391753
## 2 17 ranger_RS_13 ranger random_search 0.7792208 0.8257407 0.6458333
## 3 1 ranger_model ranger basic 0.7727273 0.8238889 0.6391753
## 4 10 ranger_RS_6 ranger random_search 0.7792208 0.8238889 0.6458333
## 5 5 ranger_RS_1 ranger random_search 0.7662338 0.8201852 0.6250000
## 6 7 ranger_RS_3 ranger random_search 0.7857143 0.8192593 0.6526316
library(DALEX)
ex_2_diabetes <- forester::explain(models = output_2_diabetes$best_models[[1]],
test_data = output_2_diabetes$test_data,
y = output_2_diabetes$y)
model_2_parts_diabetes <- DALEX::model_parts(ex_2_diabetes$ranger_RS_13)
plot(model_2_parts_diabetes, max_vars = 9)
Here are the steps on how to conduct a regression machine leaning (ML) dataset analysis:
library(datarium)
data(marketing)
# skimming dataset
# install.packages("skimr")
library(skimr)
skim(marketing)
| Name | marketing |
| Number of rows | 200 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| youtube | 0 | 1 | 176.45 | 103.03 | 0.84 | 89.25 | 179.70 | 262.59 | 355.68 | ▇▆▆▇▆ |
| 0 | 1 | 27.92 | 17.82 | 0.00 | 11.97 | 27.48 | 43.83 | 59.52 | ▇▆▆▆▆ | |
| newspaper | 0 | 1 | 36.66 | 26.13 | 0.36 | 15.30 | 30.90 | 54.12 | 136.80 | ▇▆▃▁▁ |
| sales | 0 | 1 | 16.83 | 6.26 | 1.92 | 12.45 | 15.48 | 20.88 | 32.40 | ▁▇▇▅▂ |
output_1_sales <- train(data=marketing,
y = 'sales',
bayes_iter = 0,
random_evals = 0,
verbose = FALSE,
sort_by = 'mse')
head(output_1_sales$score_test)
## no. name engine tuning rmse mse r2
## 1 2 xgboost_model xgboost basic 1.348093 1.817355 0.9514301
## 2 4 lightgbm_model lightgbm basic 1.354851 1.835621 0.9509419
## 3 1 ranger_model ranger basic 2.028579 4.115134 0.8900206
## 4 3 decision_tree_model decision_tree basic 2.145566 4.603452 0.8769700
## mae
## 1 1.095299
## 2 1.158454
## 3 1.401629
## 4 1.769859
library(DALEX)
ex_1_sales <- forester::explain(models = output_1_sales$best_models[[1]],
test_data = output_1_sales$test_data,
y = output_1_sales$y)
model_1_parts_sales <- DALEX::model_parts(ex_1_sales$xgboost_model)
plot(model_1_parts_sales, max_vars = 4)
output_2_sales <- train(data=marketing,
y = 'sales',
bayes_iter = 20,
random_evals = 20,
verbose = FALSE,
sort_by = 'mse')
head(output_2_sales$score_test)
## no. name engine tuning rmse mse r2
## 1 29 xgboost_RS_5 xgboost random_search 1.120546 1.255623 0.9664427
## 2 25 xgboost_RS_1 xgboost random_search 1.296193 1.680117 0.9550979
## 3 88 lightgbm_bayes lightgbm bayes_opt 1.315767 1.731243 0.9537315
## 4 86 xgboost_bayes xgboost bayes_opt 1.339707 1.794814 0.9520325
## 5 2 xgboost_model xgboost basic 1.348093 1.817355 0.9514301
## 6 4 lightgbm_model lightgbm basic 1.354851 1.835621 0.9509419
## mae
## 1 0.8525029
## 2 1.0966873
## 3 1.1216577
## 4 1.0284954
## 5 1.0952987
## 6 1.1584541
library(DALEX)
ex_2_sales <- forester::explain(models = output_2_sales$best_models[[1]],
test_data = output_2_sales$test_data,
y = output_2_sales$y)
model_2_parts_sales <- DALEX::model_parts(ex_2_sales$xgboost_RS_5)
plot(model_2_parts_sales, max_vars = 9)
Finally, spacial thanks to the authors/developers of the {forester}
AutoML package.
For more information regarding the package please visit: https://github.com/ModelOriented/forester