library(tidymodels)
library(modeldatatoo)
taxi <- data_taxi() |>
drop_na()
set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8, strata = tip)
taxi_split<Training/Testing/Total>
<8000/2000/10000>
For machine learning, we typically split data into training and test sets:
Note: Do not use the test set during training.
Spending too much data in training prevents us from computing a good assessment of predictive performance.
Spending too much data in testing prevents us from computing a good estimate of model parameters.
The initial_split() function is from the rsample package.
The argument prop splits the data into percentage of training and testing. The value allocates for the training set.
The argument strata applies stratification in spitting relative to a chosen variable. Stratification often helps, with very little downside.
library(tidymodels)
library(modeldatatoo)
taxi <- data_taxi() |>
drop_na()
set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8, strata = tip)
taxi_split<Training/Testing/Total>
<8000/2000/10000>
taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)taxi_train# A tibble: 8,000 × 7
tip distance company local dow month hour
<fct> <dbl> <fct> <fct> <fct> <fct> <int>
1 yes 17.2 Chicago Independents no Thu Feb 16
2 yes 0.88 City Service yes Thu Mar 8
3 yes 18.1 other no Mon Feb 18
4 yes 12.2 Chicago Independents no Sun Mar 21
5 yes 0.94 Sun Taxi yes Sat Apr 23
6 yes 17.5 Flash Cab no Fri Mar 12
7 yes 17.7 other no Sun Jan 6
8 yes 1.85 Taxicab Insurance Agency Llc no Fri Apr 12
9 yes 0.53 Sun Taxi no Tue Mar 18
10 yes 6.65 Taxicab Insurance Agency Llc no Sun Apr 11
# ℹ 7,990 more rows
taxi_test# A tibble: 2,000 × 7
tip distance company local dow month hour
<fct> <dbl> <fct> <fct> <fct> <fct> <int>
1 yes 20.7 Chicago Independents no Mon Apr 8
2 yes 1.47 City Service no Tue Mar 14
3 yes 1 Taxi Affiliation Services no Mon Feb 18
4 yes 1.91 Flash Cab no Wed Apr 15
5 yes 17.2 City Service no Mon Apr 9
6 yes 17.8 City Service no Mon Mar 9
7 yes 0.53 Taxicab Insurance Agency Llc yes Wed Apr 8
8 yes 1.77 other no Thu Apr 15
9 yes 18.6 Flash Cab no Thu Apr 12
10 no 1.13 other no Sat Feb 14
# ℹ 1,990 more rows