tidymodelstidymodelstidymodelsGain an understanding of the fundamental components of the tidymodels ecosystem and appreciate the advantages of a consolidated modeling framework.
Develop the competence to independently undertake machine learning projects using R.
tidymodelstidymodels is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.
Developed by
Max Kuhn
Julia Silge
tidymodelsWhen loading the package, the versions and conflicts are listed:
I ❤️ R but…
My 🤯 with the idiosyncratic syntax developed for different model algorithms.
Dr. Raymond Balise Slides for BST 692-2022
Same model, different packages
The same issue persists if you try to implement same model using alternative packages.
tidymodels Consistency 😎Through the parsnip package we are provided with a time saving framework for exploring multiple models!!
Example:
# Logistic Regression
logistic_reg_glm_spec <-
logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
# Decision Tree
decision_tree_rpart_spec <-
decision_tree(
tree_depth = tune(),
min_n = tune(),
cost_complexity = tune()
) %>%
set_engine("rpart") %>%
set_mode("classification")
# Bagged MARS Model
bag_mars_earth_spec <-
bag_mars() %>%
set_engine('earth') %>%
set_mode('classification')
# Naive Bayes
naive_Bayes_naivebayes_spec <-
naive_Bayes(smoothness = tune(), Laplace = tune()) %>%
set_engine('naivebayes') %>%
set_mode('classification')
# Random Forest
rand_forest_randomForest_spec <-
rand_forest(mtry = tune(), min_n = tune()) %>%
set_engine('randomForest') %>%
set_mode('classification')The real power of tidymodels is baked into the recipes package.
Binds a sequence of preprocessing steps to a training data set.
Defines the roles that the variables are to play in the design matrix.
Specifies what data cleaning needs to take place, and what feature engineering needs to happen.
tidymodels isAre healthy behaviors, such as diet, sleep, physical activity and hours of playing video games associated with concentration in adolescents?
2019 Youth Risk Behavioral Surveillance System
Output in next slide
── Data Summary ────────────────────────
Values
Name healthyBehaviors_df
Number of rows 13677
Number of columns 20
_______________________
Column type frequency:
character 3
factor 1
numeric 16
________________________
Group variables None
── Variable type: character ─────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 Sex 151 0.989 4 6 0 2 0
2 Grade 151 0.989 1 2 0 4 0
3 SexOrientation 702 0.949 8 14 0 4 0
── Variable type: factor ────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique top_counts
1 DifficultyConcentrating 5237 0.617 FALSE 2 0: 5245, 1: 3195
── Variable type: numeric ───────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 DrinkFruitJuice 1085 0.921 2.37 1.53 1 1 2 3 7 ▇▂▁▁▁
2 EatFruit 791 0.942 3.11 1.65 1 2 3 4 7 ▇▃▂▂▂
3 EatSalad 1779 0.870 1.97 1.23 1 1 2 2 7 ▇▁▁▁▁
4 EatPotatoes 1778 0.870 1.94 1.12 1 1 2 2 7 ▇▁▁▁▁
5 EatCarrots 1800 0.868 1.72 1.09 1 1 1 2 7 ▇▁▁▁▁
6 EatOtherVeggies 1830 0.866 2.66 1.44 1 2 2 3 7 ▇▃▂▁▁
7 DrinkSoda 2282 0.833 2.31 1.45 1 1 2 3 7 ▇▂▁▁▁
8 DrinkMilk 4188 0.694 2.64 1.64 1 1 2 4 7 ▇▂▂▁▁
9 EatBreakfast 2084 0.848 4.90 2.67 1 3 5 8 8 ▅▂▃▂▇
10 PhysicalActivity 457 0.967 4.69 2.52 1 2 5 7 8 ▇▃▆▃▇
11 HoursTV 881 0.936 2.96 1.81 1 1 3 4 7 ▇▂▃▂▂
12 HoursVideoGames 500 0.963 4.07 2.13 1 2 4 6 7 ▇▃▅▅▇
13 HoursSleep 572 0.958 3.44 1.38 1 2 4 4 7 ▇▇▇▅▂
14 SportsDrinks 4083 0.701 1.94 1.32 1 1 2 2 7 ▇▁▁▁▁
15 DrinksWater 3517 0.743 5.15 1.92 1 4 6 7 7 ▂▂▁▂▇
16 ConcussionSports 2128 0.844 1.25 0.715 1 1 1 1 5 ▇▁▁▁▁
| Variable | No, N = 5,2451 | Yes, N = 3,1951 | p-value2 |
|---|---|---|---|
| Sex | <0.001 | ||
| Female | 2,328 (55%) | 1,934 (45%) | |
| Male | 2,880 (70%) | 1,229 (30%) | |
| Unknown | 37 | 32 | |
| Grade | 0.059 | ||
| 10 | 1,376 (60%) | 901 (40%) | |
| 11 | 1,232 (61%) | 775 (39%) | |
| 12 | 1,206 (64%) | 672 (36%) | |
| 9 | 1,394 (63%) | 819 (37%) | |
| Unknown | 37 | 28 | |
| SexOrientation | <0.001 | ||
| Bisexual | 236 (33%) | 473 (67%) | |
| Gay or Lesbian | 82 (42%) | 114 (58%) | |
| Heterosexual | 4,483 (67%) | 2,219 (33%) | |
| Not sure | 158 (47%) | 180 (53%) | |
| Unknown | 286 | 209 | |
Data source: MLearnYRBSS::healthyBehaviors |
|||
| 1 n (%) | |||
| 2 Pearson’s Chi-squared test | |||
Let’s explore the relationship between difficulty concentrating, diet, sleep, physical activity and hours of playing video games.
TidymodelstidymodelsHOWEVER
Design matrices do not always come in the required format:
healthy_recipe <-
recipe(formula = DifficultyConcentrating ~ ., data = healthyBehaviors_df) |>
step_zv(all_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_impute_mean(all_numeric_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.7) |>
step_dummy(all_nominal_predictors())
healthy_recipehealthy_recipe── Recipe ───────────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 19
── Operations
• Zero variance filter on: all_predictors()
• Mode imputation for: all_nominal_predictors()
• Mean imputation for: all_numeric_predictors()
• Correlation filter on: all_numeric_predictors()
• Dummy variables from: all_nominal_predictors()
The recipe has only sketched a blueprint of what R is supposed to do with your data. You have NOT performed any actual pre-processing yet.
Imperative Programming
Declarative Programming
This step is crucial!
You have to check your data after the recipe to make sure the transformations look alright.
# A tibble: 13,677 × 24
DrinkFruitJuice EatFruit EatSalad EatPotatoes EatCarrots EatOtherVeggies
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 3 1 1 1 2
2 2 7 6 5 5 6
3 1 4 2 2 2 5
4 2 2 1 2 2 3
5 4 5 1 2 3 4
6 2 1 2 1 1 2
7 1 4 1 2 1 2
8 4 2 2 1 2 3
9 2 3 3 1 1 3
10 2 2 1 1 1 3
# ℹ 13,667 more rows
# ℹ 18 more variables: DrinkSoda <dbl>, DrinkMilk <dbl>, EatBreakfast <dbl>,
# PhysicalActivity <dbl>, HoursTV <dbl>, HoursVideoGames <dbl>,
# HoursSleep <dbl>, SportsDrinks <dbl>, DrinksWater <dbl>,
# ConcussionSports <dbl>, DifficultyConcentrating <fct>, Sex_Male <dbl>,
# Grade_X11 <dbl>, Grade_X12 <dbl>, Grade_X9 <dbl>,
# SexOrientation_Gay.or.Lesbian <dbl>, SexOrientation_Heterosexual <dbl>, …
Recipes in ONE imageparnsip Modeling and Analysis FunctionsParsnip A model specification has three individual components:
R which usually corresponds to a certain modeling function (lm, glm), package (e.g., rpart, glmnet, randomForest) or computing framework (e.g., Stan, sparklyr).Logistic Regression Model Specification (classification)
Computational engine: glm
glmnet_recipe <-
recipe(formula = DifficultyConcentrating ~ ., data = healthyBehaviors_df) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
glmnet_spec <-
logistic_reg(penalty = tune(), mixture = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
glmnet_workflow <-
workflow() %>%
add_recipe(glmnet_recipe) %>%
add_model(glmnet_spec)
glmnet_grid <- tidyr::crossing(penalty = 10^seq(-6, -1, length.out = 20), mixture = c(0.05,
0.2, 0.4, 0.6, 0.8, 1))
glmnet_tune <-
tune_grid(glmnet_workflow, resamples = stop("add your rsample object"), grid = glmnet_grid)
parsnip in ONE image