knitr::opts_chunk$set(echo = F, 
                      warning = F, 
                      message = F, 
                      eval = T , 
                      results="asis", 
                      fig.height=6, 
                      fig.width=8)

set.seed(1234)

Introduction

The Tidymodels framework is a collection of modeling and machine learning packages using Tidyverse principles. In contrast to the flexibility of R, Tidymodels is an opinionated system with an underlying solution philosophy. For this blog post, I wanted to explore the implications of the Tidymodel approach on the development effort. Tidymodels goes beyond exposing modeling capabilities and dictates a solution methodology and workflow for problem-solving.

Opinionated systems such as Tidymodels have several benefits, including:

  • Consistency - Opinionated systems promote a consistent mental model, approach, and workflow for solving problems.
  • Encapsulation of Best Practices - Opinionated frameworks provide guardrails that inherently guide developers toward best practices. Furthermore, the execution model enforces a workflow and approach to problem-solving.
  • Faster Development - Reducing the required upfront decisions and providing framework support for everyday modeling tasks accelerates the development process. The individual developers and teams do not need to write the plumbing, connectivity, or boilerplate code for basic functions.

In the negative column, the drawbacks of opinioned systems include:

  • Required Buy-In - To use the Tidymodel framework, you must buy into the authors’ decisions and problem framing. Utilizing individual elements of the framework without adopting the entire solution is a difficult prospect.
  • Hidden Decisions - Standardization of the interfaces to models simplifies the execution but involves some decisions regarding default values.

Tidymodels

The Tidymodels framework is a collection of modeling and machine learning packages. The core packages that make up the Tidymodel universe include rsample, parsnip, recipes, workflows, tune, yardstick, broom, and dials. For this blog post, I will explore how these packages impact data preparation, model definition, and the model execution workflow.

Tidy Data

The Tidyverse advocates for Tidy Data, a consistent representation of the model data. The data preparation step transforms data into a consistent model that adheres to the following rules: a) Each variable must have its own column. b) Each observation must have its own row. c) Each value must have its own cell.

Uniform data that conforms to tidy data specifications is more consistent and easier to work with. Furthermore, the rules associated with Tidy Data enable efficient manipulation with tools in the Tidyverse such as dplyr, or ggplot2.

Data summary
Name Piped data
Number of rows 399
Number of columns 21
_______________________
Column type frequency:
character 5
factor 1
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
lgID 8 0.98 2 2 0 6 0
teamID 0 1.00 3 3 0 74 0
franchID 0 1.00 3 3 0 54 0
name 0 1.00 11 29 0 76 0
divID 0 1.00 0 1 213 4 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
era_cat 0 1 FALSE 3 196: 186, 190: 158, 190: 55

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
yearID 0 1 1956.70 44.56 1872.00 1916.50 1963.00 1997.50 2021.00 ▃▅▃▅▇
G 0 1 149.14 25.39 9.00 153.00 157.00 162.00 164.00 ▁▁▁▁▇
W 0 1 73.01 18.00 2.00 64.00 75.00 86.00 106.00 ▁▁▃▇▅
L 0 1 75.26 17.45 7.00 67.00 77.00 87.00 112.00 ▁▁▃▇▂
R 0 1 676.53 139.04 54.00 612.50 691.00 759.00 1131.00 ▁▁▇▇▁
H 0 1 1332.62 239.76 91.00 1297.50 1386.00 1462.00 1724.00 ▁▁▁▇▇
X2B 0 1 227.37 61.18 10.00 193.50 234.00 272.00 376.00 ▁▂▆▇▂
X3B 0 1 47.43 23.38 3.00 28.00 43.00 65.00 108.00 ▃▇▅▃▂
HR 0 1 100.82 64.72 0.00 38.00 100.00 154.00 279.00 ▇▆▆▃▁
RA 0 1 685.12 138.85 140.00 605.50 701.00 778.00 1088.00 ▁▁▇▇▁
ER 0 1 572.05 150.84 56.00 496.00 597.00 676.50 884.00 ▁▂▅▇▂
HA 0 1 1336.93 240.40 148.00 1283.00 1396.00 1476.00 1689.00 ▁▁▁▅▇
HRA 0 1 101.90 62.16 0.00 41.50 102.00 152.50 237.00 ▇▆▇▆▂
wPer 0 1 0.49 0.09 0.09 0.43 0.49 0.55 0.81 ▁▂▇▆▁
pythPer 0 1 0.49 0.10 0.10 0.44 0.49 0.56 0.83 ▁▂▇▅▁

Approach to Building Models and Execution

In non-opinionated systems, there are several approaches to accomplishing a specific task empowering the user to make their own decisions. This flexibility is diminished in opinionated systems. Opinionated systems have a predefined approach or set of approaches for accomplishing tasks.

The Tidymodel model definition and model execution workflow are detailed below.

Recipe

The Tidymodels recipe is similar to the formula definition in the lm() function; however, it allows for feature engineering, variable role definition, and inheritance. The recipe definition provides a programmatically compact methodology for describing a collection of recipes in a single location.

Model Specification

A unified interface to the available model is provided by the parsnip package. This interface decouples the model definition from the semantic details of the underlying package. Users can rapidly experiment with a range of models without getting bogged down in the semantic details of the underlying packages. The level of abstraction from the underlying model reduces the learning curve required to execute different models.

Workflow

The recipe, model specification, pre-processing, and post-processing definitions can be bundled together in a workflow. The workflow package offers coordination and synchronization.

A workflow set/tibble: 9 × 4

wflow_id info option result

1 simple_lm <tibble [1 × 4]> <opts[0]> <list [0]> 2 simple_stan <tibble [1 × 4]> <opts[0]> <list [0]> 3 simple_rf <tibble [1 × 4]> <opts[0]> <list [0]> 4 filter_lm <tibble [1 × 4]> <opts[0]> <list [0]> 5 filter_stan <tibble [1 × 4]> <opts[0]> <list [0]> 6 filter_rf <tibble [1 × 4]> <opts[0]> <list [0]> 7 pca_lm <tibble [1 × 4]> <opts[0]> <list [0]> 8 pca_stan <tibble [1 × 4]> <opts[0]> <list [0]> 9 pca_rf <tibble [1 × 4]> <opts[0]> <list [0]>

Model Selection

10-fold cross-validation using stratification

A tibble: 10 × 2

splits id
1 <split [358/41]> Fold01 2 <split [358/41]> Fold02 3 <split [358/41]> Fold03 4 <split [358/41]> Fold04 5 <split [358/41]> Fold05 6 <split [359/40]> Fold06 7 <split [360/39]> Fold07 8 <split [360/39]> Fold08 9 <split [361/38]> Fold09 10 <split [361/38]> Fold10

A workflow set/tibble: 9 × 4

wflow_id info option result

1 simple_lm <tibble [1 × 4]> <opts[3]> <rsmp[+]> 2 simple_stan <tibble [1 × 4]> <opts[3]> <rsmp[+]> 3 simple_rf <tibble [1 × 4]> <opts[3]> <rsmp[+]> 4 filter_lm <tibble [1 × 4]> <opts[3]> <tune[+]> 5 filter_stan <tibble [1 × 4]> <opts[3]> <tune[+]> 6 filter_rf <tibble [1 × 4]> <opts[3]> <tune[+]> 7 pca_lm <tibble [1 × 4]> <opts[3]> <tune[+]> 8 pca_stan <tibble [1 × 4]> <opts[3]> <tune[+]> 9 pca_rf <tibble [1 × 4]> <opts[3]> <tune[+]>

A tibble: 9 × 5

rank mean model wflow_id .config

1 1 4.36 rand_forest simple_rf Preprocessor1_Model1 2 2 4.38 rand_forest filter_rf Preprocessor09_Model1 3 3 4.51 linear_reg simple_stan Preprocessor1_Model1 4 4 4.51 linear_reg simple_lm Preprocessor1_Model1 5 5 4.51 linear_reg filter_lm Preprocessor09_Model1 6 6 4.51 linear_reg filter_stan Preprocessor09_Model1 7 7 4.68 rand_forest pca_rf Preprocessor1_Model1 8 8 5.56 linear_reg pca_stan Preprocessor1_Model1 9 9 5.56 linear_reg pca_lm Preprocessor1_Model1

Model Selection

Prediction

Resampling results

10-fold cross-validation using stratification

A tibble: 10 × 4

splits id .metrics .notes

1 <split [358/41]> Fold01 <tibble [1 × 4]> <tibble [0 × 3]> 2 <split [358/41]> Fold02 <tibble [1 × 4]> <tibble [0 × 3]> 3 <split [358/41]> Fold03 <tibble [1 × 4]> <tibble [0 × 3]> 4 <split [358/41]> Fold04 <tibble [1 × 4]> <tibble [0 × 3]> 5 <split [358/41]> Fold05 <tibble [1 × 4]> <tibble [0 × 3]> 6 <split [359/40]> Fold06 <tibble [1 × 4]> <tibble [0 × 3]> 7 <split [360/39]> Fold07 <tibble [1 × 4]> <tibble [0 × 3]> 8 <split [360/39]> Fold08 <tibble [1 × 4]> <tibble [0 × 3]> 9 <split [361/38]> Fold09 <tibble [1 × 4]> <tibble [0 × 3]> 10 <split [361/38]> Fold10 <tibble [1 × 4]> <tibble [0 × 3]>

simple_rf

A tibble: 1 × 1

.config

1 Preprocessor1_Model1 # A tibble: 2 × 4 .metric .estimator .estimate .config

1 rmse standard 5.58 Preprocessor1_Model1 2 rsq standard 0.905 Preprocessor1_Model1

Conclusion

With opinionated systems, the design decisions are made upfront when you select the framework, and the user has less flexibility on the backend. With non-opinionated systems, the user maintains optionality and flexibility throughout the process.

If flexibility or visibility into the modeling process is important, then using the individual models directly might be more effective. Furthermore, the abstraction in the Tidymodels framework may not be a good option for educational purposes or the execution of a single use case. If you need to use specific parameters for the model execution, then the parsnip abstraction might be an impediment.

However, when we start to view broader problems that require the exploration of different models or model specifications, then the benefits of the Tidymodel framework may make it a good option.