What Makes a Model?

Author

Jamal Rogers

Published

August 17, 2023

from the previous section

library(tidymodels)
library(modeldatatoo)

taxi <- data_taxi() |>
        drop_na()

set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8, strata = tip)
taxi_split
<Training/Testing/Total>
<8000/2000/10000>
taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)

How do you fit a linear model in R?

  • lm for linear model

  • glm for generalized liner model (e.g. logistic regression)

  • glmnet for regularized regression

  • keras for regression using TensorFlow

  • stan for Bayesian regression

  • spark for large data sets

To specify a model

  • Choose a model

  • Specify an engine

  • Set the mode

Model

logistic_reg()
Logistic Regression Model Specification (classification)

Computational engine: glm 

Engine

logistic_reg() |>
        set_engine("glmnet")
Logistic Regression Model Specification (classification)

Computational engine: glmnet 
logistic_reg() |>
        set_engine("stan")
Logistic Regression Model Specification (classification)

Computational engine: stan 

Mode

decision_tree() |>
        set_mode("classification")
Decision Tree Model Specification (classification)

Computational engine: rpart 

A complete model specification

decision_tree() |>                  # model
        set_engine("rpart") |>      # engine
        set_mode("classification")  # mode
Decision Tree Model Specification (classification)

Computational engine: rpart 

All available models are listed at https://www.tidymodels.org/find/parsnip/

A model workflow

  • Workflows handle new data better than base R tools in terms of new factor levels

  • You can use other preprocessors besides formulas (more on feature engineering in lesson 2)

  • They can help organize your work when working with multiple models

  • Most importantly, a workflow captures the entire modeling process: fit() and predict() apply to the preprocessing steps in addition to the actual model fit

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

tree_spec |> 
  fit(tip ~ ., data = taxi_train) 
parsnip model object

n= 8000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 8000 616 yes (0.9230000 0.0770000) *

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

workflow() |>
  add_formula(tip ~ .) |>
  add_model(tree_spec) |>
  fit(data = taxi_train) 
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .

── Model ───────────────────────────────────────────────────────────────────────
n= 8000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 8000 616 yes (0.9230000 0.0770000) *

tree_spec <-
  decision_tree() |>
  set_mode("classification")

workflow(tip ~ ., tree_spec) |>
  fit(data = taxi_train) 
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
tip ~ .

── Model ───────────────────────────────────────────────────────────────────────
n= 8000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 8000 616 yes (0.9230000 0.0770000) *

Predict with your model

tree_spec <-
  decision_tree() |> 
  set_mode("classification")

tree_fit <-
  workflow(tip ~ ., tree_spec) |>
  fit(data = taxi_train) 
predict(tree_fit, new_data = taxi_test)
# A tibble: 2,000 × 1
   .pred_class
   <fct>      
 1 yes        
 2 yes        
 3 yes        
 4 yes        
 5 yes        
 6 yes        
 7 yes        
 8 yes        
 9 yes        
10 yes        
# ℹ 1,990 more rows
augment(tree_fit, new_data = taxi_test)
# A tibble: 2,000 × 10
   tip   distance company local dow   month  hour .pred_class .pred_yes .pred_no
   <fct>    <dbl> <fct>   <fct> <fct> <fct> <int> <fct>           <dbl>    <dbl>
 1 yes      20.7  Chicag… no    Mon   Apr       8 yes             0.923    0.077
 2 yes       1.47 City S… no    Tue   Mar      14 yes             0.923    0.077
 3 yes       1    Taxi A… no    Mon   Feb      18 yes             0.923    0.077
 4 yes       1.91 Flash … no    Wed   Apr      15 yes             0.923    0.077
 5 yes      17.2  City S… no    Mon   Apr       9 yes             0.923    0.077
 6 yes      17.8  City S… no    Mon   Mar       9 yes             0.923    0.077
 7 yes       0.53 Taxica… yes   Wed   Apr       8 yes             0.923    0.077
 8 yes       1.77 other   no    Thu   Apr      15 yes             0.923    0.077
 9 yes      18.6  Flash … no    Thu   Apr      12 yes             0.923    0.077
10 no        1.13 other   no    Sat   Feb      14 yes             0.923    0.077
# ℹ 1,990 more rows

The tidymodels prediction guarantee!

  • The predictions will always be inside a tibble

  • The column name and types are unsurprising and predictable

  • The number of rows in new_data and the output are the same

Understand your model

library(rpart.plot)

tree_fit |>
  extract_fit_engine() |>
  rpart.plot(roundint = FALSE)

You can extract_*() several components of your fitter workflow.

Note: Never predict() with any extracted components!

You can use your fitted workflow for model and/or prediction explanations:

  • overall variable importance, such as with the vip package

  • flexible model explainers, such as with the DALEXtra package

Learn more at https://www.tmwr.org/explain.html

The whole game - status update