What Makes a Model?

Author

Jamal Rogers

Published

August 17, 2023

from the previous section

library(tidymodels)
library(modeldatatoo)

taxi <- data_taxi() |>
        drop_na()

set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8, strata = tip)
taxi_split

<Training/Testing/Total>
<8000/2000/10000>

taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)

How do you fit a linear model in R?

lm for linear model
glm for generalized liner model (e.g. logistic regression)
glmnet for regularized regression
keras for regression using TensorFlow
stan for Bayesian regression
spark for large data sets

To specify a model

Choose a model
Specify an engine
Set the mode

Model

logistic_reg()

Logistic Regression Model Specification (classification)

Computational engine: glm

Engine

logistic_reg() |>
        set_engine("glmnet")

Logistic Regression Model Specification (classification)

Computational engine: glmnet

logistic_reg() |>
        set_engine("stan")

Logistic Regression Model Specification (classification)

Computational engine: stan

Mode

decision_tree() |>
        set_mode("classification")

Decision Tree Model Specification (classification)

Computational engine: rpart

A complete model specification

decision_tree() |>                  # model
        set_engine("rpart") |>      # engine
        set_mode("classification")  # mode

Decision Tree Model Specification (classification)

Computational engine: rpart

All available models are listed at https://www.tidymodels.org/find/parsnip/

A model workflow

Workflows handle new data better than base R tools in terms of new factor levels
You can use other preprocessors besides formulas (more on feature engineering in lesson 2)
They can help organize your work when working with multiple models
Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

tree_spec |> 
  fit(tip ~ ., data = taxi_train)

parsnip model object

n= 8000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 8000 616 yes (0.9230000 0.0770000) *

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

workflow() |>
  add_formula(tip ~ .) |>
  add_model(tree_spec) |>
  fit(data = taxi_train)

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .

── Model ───────────────────────────────────────────────────────────────────────
n= 8000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 8000 616 yes (0.9230000 0.0770000) *

tree_spec <-
  decision_tree() |>
  set_mode("classification")

workflow(tip ~ ., tree_spec) |>
  fit(data = taxi_train)

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .

── Model ───────────────────────────────────────────────────────────────────────
n= 8000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 8000 616 yes (0.9230000 0.0770000) *

Predict with your model

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

tree_fit <-
  workflow(tip ~ ., tree_spec) |>
  fit(data = taxi_train)

predict(tree_fit, new_data = taxi_test)

# A tibble: 2,000 × 1
   .pred_class
   <fct>      
 1 yes        
 2 yes        
 3 yes        
 4 yes        
 5 yes        
 6 yes        
 7 yes        
 8 yes        
 9 yes        
10 yes        
# ℹ 1,990 more rows

augment(tree_fit, new_data = taxi_test)

# A tibble: 2,000 × 10
   tip   distance company local dow   month  hour .pred_class .pred_yes .pred_no
   <fct>    <dbl> <fct>   <fct> <fct> <fct> <int> <fct>           <dbl>    <dbl>
 1 yes      20.7  Chicag… no    Mon   Apr       8 yes             0.923    0.077
 2 yes       1.47 City S… no    Tue   Mar      14 yes             0.923    0.077
 3 yes       1    Taxi A… no    Mon   Feb      18 yes             0.923    0.077
 4 yes       1.91 Flash … no    Wed   Apr      15 yes             0.923    0.077
 5 yes      17.2  City S… no    Mon   Apr       9 yes             0.923    0.077
 6 yes      17.8  City S… no    Mon   Mar       9 yes             0.923    0.077
 7 yes       0.53 Taxica… yes   Wed   Apr       8 yes             0.923    0.077
 8 yes       1.77 other   no    Thu   Apr      15 yes             0.923    0.077
 9 yes      18.6  Flash … no    Thu   Apr      12 yes             0.923    0.077
10 no        1.13 other   no    Sat   Feb      14 yes             0.923    0.077
# ℹ 1,990 more rows

The tidymodels prediction guarantee!

The predictions will always be inside a tibble
The column name and types are unsurprising and predictable
The number of rows in new_data and the output are the same

Understand your model

library(rpart.plot)

tree_fit |>
  extract_fit_engine() |>
  rpart.plot(roundint = FALSE)

You can extract_*() several components of your fitter workflow.

Note: Never predict() with any extracted components!

You can use your fitted workflow for model and/or prediction explanations:

overall variable importance, such as with the vip package
flexible model explainers, such as with the DALEXtra package

Learn more at https://www.tmwr.org/explain.html

The whole game - status update