knitr::opts_chunk$set(echo = F,
warning = F,
message = F,
eval = T ,
results="asis",
fig.height=6,
fig.width=8)
set.seed(1234)
The Tidymodels framework is a collection of modeling and machine learning packages using Tidyverse principles. In contrast to the flexibility of R, Tidymodels is an opinionated system with an underlying solution philosophy. For this blog post, I wanted to explore the implications of the Tidymodel approach on the development effort. Tidymodels goes beyond exposing modeling capabilities and dictates a solution methodology and workflow for problem-solving.
Opinionated systems such as Tidymodels have several benefits, including:
In the negative column, the drawbacks of opinioned systems include:
The Tidymodels framework is a collection of modeling and machine learning packages. The core packages that make up the Tidymodel universe include rsample, parsnip, recipes, workflows, tune, yardstick, broom, and dials. For this blog post, I will explore how these packages impact data preparation, model definition, and the model execution workflow.
The Tidyverse advocates for Tidy Data, a consistent representation of the model data. The data preparation step transforms data into a consistent model that adheres to the following rules: a) Each variable must have its own column. b) Each observation must have its own row. c) Each value must have its own cell.
Uniform data that conforms to tidy data specifications is more consistent and easier to work with. Furthermore, the rules associated with Tidy Data enable efficient manipulation with tools in the Tidyverse such as dplyr, or ggplot2.
Name | Piped data |
Number of rows | 399 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
character | 5 |
factor | 1 |
numeric | 15 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
lgID | 8 | 0.98 | 2 | 2 | 0 | 6 | 0 |
teamID | 0 | 1.00 | 3 | 3 | 0 | 74 | 0 |
franchID | 0 | 1.00 | 3 | 3 | 0 | 54 | 0 |
name | 0 | 1.00 | 11 | 29 | 0 | 76 | 0 |
divID | 0 | 1.00 | 0 | 1 | 213 | 4 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
era_cat | 0 | 1 | FALSE | 3 | 196: 186, 190: 158, 190: 55 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
yearID | 0 | 1 | 1956.70 | 44.56 | 1872.00 | 1916.50 | 1963.00 | 1997.50 | 2021.00 | ▃▅▃▅▇ |
G | 0 | 1 | 149.14 | 25.39 | 9.00 | 153.00 | 157.00 | 162.00 | 164.00 | ▁▁▁▁▇ |
W | 0 | 1 | 73.01 | 18.00 | 2.00 | 64.00 | 75.00 | 86.00 | 106.00 | ▁▁▃▇▅ |
L | 0 | 1 | 75.26 | 17.45 | 7.00 | 67.00 | 77.00 | 87.00 | 112.00 | ▁▁▃▇▂ |
R | 0 | 1 | 676.53 | 139.04 | 54.00 | 612.50 | 691.00 | 759.00 | 1131.00 | ▁▁▇▇▁ |
H | 0 | 1 | 1332.62 | 239.76 | 91.00 | 1297.50 | 1386.00 | 1462.00 | 1724.00 | ▁▁▁▇▇ |
X2B | 0 | 1 | 227.37 | 61.18 | 10.00 | 193.50 | 234.00 | 272.00 | 376.00 | ▁▂▆▇▂ |
X3B | 0 | 1 | 47.43 | 23.38 | 3.00 | 28.00 | 43.00 | 65.00 | 108.00 | ▃▇▅▃▂ |
HR | 0 | 1 | 100.82 | 64.72 | 0.00 | 38.00 | 100.00 | 154.00 | 279.00 | ▇▆▆▃▁ |
RA | 0 | 1 | 685.12 | 138.85 | 140.00 | 605.50 | 701.00 | 778.00 | 1088.00 | ▁▁▇▇▁ |
ER | 0 | 1 | 572.05 | 150.84 | 56.00 | 496.00 | 597.00 | 676.50 | 884.00 | ▁▂▅▇▂ |
HA | 0 | 1 | 1336.93 | 240.40 | 148.00 | 1283.00 | 1396.00 | 1476.00 | 1689.00 | ▁▁▁▅▇ |
HRA | 0 | 1 | 101.90 | 62.16 | 0.00 | 41.50 | 102.00 | 152.50 | 237.00 | ▇▆▇▆▂ |
wPer | 0 | 1 | 0.49 | 0.09 | 0.09 | 0.43 | 0.49 | 0.55 | 0.81 | ▁▂▇▆▁ |
pythPer | 0 | 1 | 0.49 | 0.10 | 0.10 | 0.44 | 0.49 | 0.56 | 0.83 | ▁▂▇▅▁ |
In non-opinionated systems, there are several approaches to accomplishing a specific task empowering the user to make their own decisions. This flexibility is diminished in opinionated systems. Opinionated systems have a predefined approach or set of approaches for accomplishing tasks.
The Tidymodel model definition and model execution workflow are detailed below.
The Tidymodels recipe is similar to the formula definition in the lm() function; however, it allows for feature engineering, variable role definition, and inheritance. The recipe definition provides a programmatically compact methodology for describing a collection of recipes in a single location.
A unified interface to the available model is provided by the parsnip package. This interface decouples the model definition from the semantic details of the underlying package. Users can rapidly experiment with a range of models without getting bogged down in the semantic details of the underlying packages. The level of abstraction from the underlying model reduces the learning curve required to execute different models.
The recipe, model specification, pre-processing, and post-processing definitions can be bundled together in a workflow. The workflow package offers coordination and synchronization.
wflow_id info option result
1 simple_lm <tibble [1 × 4]> <opts[0]> <list [0]> 2
simple_stan <tibble [1 × 4]> <opts[0]> <list [0]> 3
simple_rf <tibble [1 × 4]> <opts[0]> <list [0]> 4
filter_lm <tibble [1 × 4]> <opts[0]> <list [0]> 5
filter_stan <tibble [1 × 4]> <opts[0]> <list [0]> 6
filter_rf <tibble [1 × 4]> <opts[0]> <list [0]> 7
pca_lm <tibble [1 × 4]> <opts[0]> <list [0]> 8
pca_stan <tibble [1 × 4]> <opts[0]> <list [0]> 9
pca_rf <tibble [1 × 4]> <opts[0]> <list [0]>
splits id
wflow_id info option result
1 simple_lm <tibble [1 × 4]> <opts[3]> <rsmp[+]> 2
simple_stan <tibble [1 × 4]> <opts[3]> <rsmp[+]> 3
simple_rf <tibble [1 × 4]> <opts[3]> <rsmp[+]> 4
filter_lm <tibble [1 × 4]> <opts[3]> <tune[+]> 5
filter_stan <tibble [1 × 4]> <opts[3]> <tune[+]> 6
filter_rf <tibble [1 × 4]> <opts[3]> <tune[+]> 7
pca_lm <tibble [1 × 4]> <opts[3]> <tune[+]> 8 pca_stan
<tibble [1 × 4]> <opts[3]> <tune[+]> 9 pca_rf
<tibble [1 × 4]> <opts[3]> <tune[+]>
rank mean model wflow_id .config
1 1 4.36 rand_forest simple_rf Preprocessor1_Model1 2 2 4.38 rand_forest
filter_rf Preprocessor09_Model1 3 3 4.51 linear_reg simple_stan
Preprocessor1_Model1 4 4 4.51 linear_reg simple_lm Preprocessor1_Model1
5 5 4.51 linear_reg filter_lm Preprocessor09_Model1 6 6 4.51 linear_reg
filter_stan Preprocessor09_Model1 7 7 4.68 rand_forest pca_rf
Preprocessor1_Model1 8 8 5.56 linear_reg pca_stan Preprocessor1_Model1 9
9 5.56 linear_reg pca_lm Preprocessor1_Model1
splits id .metrics .notes
1 <split [358/41]> Fold01 <tibble [1 × 4]> <tibble [0 ×
3]> 2 <split [358/41]> Fold02 <tibble [1 × 4]> <tibble
[0 × 3]> 3 <split [358/41]> Fold03 <tibble [1 × 4]>
<tibble [0 × 3]> 4 <split [358/41]> Fold04 <tibble [1 ×
4]> <tibble [0 × 3]> 5 <split [358/41]> Fold05 <tibble
[1 × 4]> <tibble [0 × 3]> 6 <split [359/40]> Fold06
<tibble [1 × 4]> <tibble [0 × 3]> 7 <split [360/39]>
Fold07 <tibble [1 × 4]> <tibble [0 × 3]> 8 <split
[360/39]> Fold08 <tibble [1 × 4]> <tibble [0 × 3]> 9
<split [361/38]> Fold09 <tibble [1 × 4]> <tibble [0 ×
3]> 10 <split [361/38]> Fold10 <tibble [1 × 4]>
<tibble [0 × 3]>
simple_rf
.config
1 Preprocessor1_Model1 # A tibble: 2 × 4 .metric .estimator .estimate
.config
1 rmse standard 5.58 Preprocessor1_Model1 2 rsq standard 0.905
Preprocessor1_Model1
With opinionated systems, the design decisions are made upfront when you select the framework, and the user has less flexibility on the backend. With non-opinionated systems, the user maintains optionality and flexibility throughout the process.
If flexibility or visibility into the modeling process is important, then using the individual models directly might be more effective. Furthermore, the abstraction in the Tidymodels framework may not be a good option for educational purposes or the execution of a single use case. If you need to use specific parameters for the model execution, then the parsnip abstraction might be an impediment.
However, when we start to view broader problems that require the exploration of different models or model specifications, then the benefits of the Tidymodel framework may make it a good option.