# Install the packages if not available
library(bonsai)
library(doParallel)
library(finetune)
library(lightgbm)
library(lme4)
library(plumber)
library(probably)
library(ranger)
library(rpart)
library(rpart.plot)
library(stacks)
library(textrecipes)
library(tidymodels)
library(vetiver)
library(remotes)
library(modeldatatoo) #install from github using remotes::install_github("tidymodels/modeldatatoo")Preliminaries
Load the Packages
The following packages shall be used for the entire Module 3 of this course.
Data on Chicago taxi trips
For this lesson, we shall use the data_taxi dataset from the modeldatatoo package from Github.
- The city of Chicago releases anonymized trip-level data on taxi trips in the city.
- We pulled a sample of 10,000 rides occuring in early 2022.
- Type ?modeldatatoo::data_taxi() to learn more about this dataset, including references.
Which of these variable can we use?
library(tidymodels)
library(modeldatatoo)
taxi <- data_taxi()
names(taxi)[1] "tip" "distance" "company" "local" "dow" "month" "hour"
Checklist for predictors
Is it ethical to use this variable? (or even legal?)
Will this variable be available at prediction time?
Does this variable contribute to explainability?
Data on Chicago taxi trips
N = 10,000
A nominal outcome, tip, with levels “yes” and “no”
company, local, dow, and month are nominal predictors
distance and hours are numeric predictors
Use the glimpse() command to examine the data types
glimpse(taxi)Rows: 10,000 Columns: 7 $ tip <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y… $ distance <dbl> 17.19, 0.88, 18.11, 20.70, 12.23, 0.94, 17.47, 17.67, 1.85, 1… $ company <fct> Chicago Independents, City Service, other, Chicago Independen… $ local <fct> no, yes, no, no, no, yes, no, no, no, no, no, no, no, yes, no… $ dow <fct> Thu, Thu, Mon, Mon, Sun, Sat, Fri, Sun, Fri, Tue, Tue, Sun, W… $ month <fct> Feb, Mar, Feb, Apr, Mar, Apr, Mar, Jan, Apr, Mar, Mar, Apr, A… $ hour <int> 16, 8, 18, 8, 21, 23, 12, 6, 12, 14, 18, 11, 12, 19, 17, 13, …