Preliminaries

Author

Jamal Rogers

Published

August 16, 2023

Load the Packages

The following packages shall be used for the entire Module 3 of this course.

# Install the packages if not available

library(bonsai)
library(doParallel)
library(finetune) 
library(lightgbm)
library(lme4)
library(plumber) 
library(probably)
library(ranger)
library(rpart)
library(rpart.plot)
library(stacks)
library(textrecipes)
library(tidymodels)
library(vetiver) 
library(remotes)
library(modeldatatoo) #install from github using remotes::install_github("tidymodels/modeldatatoo")

Data on Chicago taxi trips

For this lesson, we shall use the data_taxi dataset from the modeldatatoo package from Github.

  • The city of Chicago releases anonymized trip-level data on taxi trips in the city.
  • We pulled a sample of 10,000 rides occuring in early 2022.
  • Type ?modeldatatoo::data_taxi() to learn more about this dataset, including references.

Which of these variable can we use?

library(tidymodels)
library(modeldatatoo)

taxi <- data_taxi()

names(taxi)
[1] "tip"      "distance" "company"  "local"    "dow"      "month"    "hour"    

Checklist for predictors

  • Is it ethical to use this variable? (or even legal?)

  • Will this variable be available at prediction time?

  • Does this variable contribute to explainability?

Data on Chicago taxi trips

  • N = 10,000

  • A nominal outcome, tip, with levels “yes” and “no”

  • company, local, dow, and month are nominal predictors

  • distance and hours are numeric predictors

  • Use the glimpse() command to examine the data types

    glimpse(taxi)
    Rows: 10,000
    Columns: 7
    $ tip      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y…
    $ distance <dbl> 17.19, 0.88, 18.11, 20.70, 12.23, 0.94, 17.47, 17.67, 1.85, 1…
    $ company  <fct> Chicago Independents, City Service, other, Chicago Independen…
    $ local    <fct> no, yes, no, no, no, yes, no, no, no, no, no, no, no, yes, no…
    $ dow      <fct> Thu, Thu, Mon, Mon, Sun, Sat, Fri, Sun, Fri, Tue, Tue, Sun, W…
    $ month    <fct> Feb, Mar, Feb, Apr, Mar, Apr, Mar, Jan, Apr, Mar, Mar, Apr, A…
    $ hour     <int> 16, 8, 18, 8, 21, 23, 12, 6, 12, 14, 18, 11, 12, 19, 17, 13, …