from the previous section
library (tidymodels)
library (modeldatatoo)
taxi <- data_taxi () |>
drop_na ()
set.seed (123 )
taxi_split <- initial_split (taxi, prop = 0.8 , strata = tip)
taxi_split
<Training/Testing/Total>
<8000/2000/10000>
taxi_train <- training (taxi_split)
taxi_test <- testing (taxi_split)
How do you fit a linear model in R?
lm for linear model
glm for generalized liner model (e.g. logistic regression)
glmnet for regularized regression
keras for regression using TensorFlow
stan for Bayesian regression
spark for large data sets
To specify a model
Choose a model
Specify an engine
Set the mode
Model
Logistic Regression Model Specification (classification)
Computational engine: glm
Engine
logistic_reg () |>
set_engine ("glmnet" )
Logistic Regression Model Specification (classification)
Computational engine: glmnet
logistic_reg () |>
set_engine ("stan" )
Logistic Regression Model Specification (classification)
Computational engine: stan
Mode
decision_tree () |>
set_mode ("classification" )
Decision Tree Model Specification (classification)
Computational engine: rpart
A complete model specification
decision_tree () |> # model
set_engine ("rpart" ) |> # engine
set_mode ("classification" ) # mode
Decision Tree Model Specification (classification)
Computational engine: rpart
All available models are listed at https://www.tidymodels.org/find/parsnip/
A model workflow
Workflows handle new data better than base R tools in terms of new factor levels
You can use other preprocessors besides formulas (more on feature engineering in lesson 2)
They can help organize your work when working with multiple models
Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit
tree_spec <-
decision_tree () |>
set_mode ("classification" )
tree_spec |>
fit (tip ~ ., data = taxi_train)
parsnip model object
n= 8000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8000 616 yes (0.9230000 0.0770000) *
tree_spec <-
decision_tree () |>
set_mode ("classification" )
workflow () |>
add_formula (tip ~ .) |>
add_model (tree_spec) |>
fit (data = taxi_train)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .
── Model ───────────────────────────────────────────────────────────────────────
n= 8000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8000 616 yes (0.9230000 0.0770000) *
tree_spec <-
decision_tree () |>
set_mode ("classification" )
workflow (tip ~ ., tree_spec) |>
fit (data = taxi_train)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .
── Model ───────────────────────────────────────────────────────────────────────
n= 8000
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8000 616 yes (0.9230000 0.0770000) *
Predict with your model
tree_spec <-
decision_tree () |>
set_mode ("classification" )
tree_fit <-
workflow (tip ~ ., tree_spec) |>
fit (data = taxi_train)
predict (tree_fit, new_data = taxi_test)
# A tibble: 2,000 × 1
.pred_class
<fct>
1 yes
2 yes
3 yes
4 yes
5 yes
6 yes
7 yes
8 yes
9 yes
10 yes
# ℹ 1,990 more rows
augment (tree_fit, new_data = taxi_test)
# A tibble: 2,000 × 10
tip distance company local dow month hour .pred_class .pred_yes .pred_no
<fct> <dbl> <fct> <fct> <fct> <fct> <int> <fct> <dbl> <dbl>
1 yes 20.7 Chicag… no Mon Apr 8 yes 0.923 0.077
2 yes 1.47 City S… no Tue Mar 14 yes 0.923 0.077
3 yes 1 Taxi A… no Mon Feb 18 yes 0.923 0.077
4 yes 1.91 Flash … no Wed Apr 15 yes 0.923 0.077
5 yes 17.2 City S… no Mon Apr 9 yes 0.923 0.077
6 yes 17.8 City S… no Mon Mar 9 yes 0.923 0.077
7 yes 0.53 Taxica… yes Wed Apr 8 yes 0.923 0.077
8 yes 1.77 other no Thu Apr 15 yes 0.923 0.077
9 yes 18.6 Flash … no Thu Apr 12 yes 0.923 0.077
10 no 1.13 other no Sat Feb 14 yes 0.923 0.077
# ℹ 1,990 more rows
The tidymodels prediction guarantee!
The predictions will always be inside a tibble
The column name and types are unsurprising and predictable
The number of rows in new_data and the output are the same
Understand your model
library (rpart.plot)
tree_fit |>
extract_fit_engine () |>
rpart.plot (roundint = FALSE )
You can extract_*() several components of your fitter workflow.
Note: Never predict() with any extracted components!
You can use your fitted workflow for model and/or prediction explanations:
overall variable importance, such as with the vip package
flexible model explainers, such as with the DALEXtra package
Learn more at https://www.tmwr.org/explain.html
The whole game - status update