Your Data Budget

Author

Jamal Rogers

Published

August 17, 2023

Data Splitting and Spending

For machine learning, we typically split data into training and test sets:

The training set is used to estimate model parameters.
The test set is used to find an independent assessment of model performance.

Note: Do not use the test set during training.

Spending too much data in training prevents us from computing a good assessment of predictive performance.
Spending too much data in testing prevents us from computing a good estimate of model parameters.

The initial split

The initial_split() function is from the rsample package.

The argument prop splits the data into percentage of training and testing. The value allocates for the training set.
The argument strata applies stratification in spitting relative to a chosen variable. Stratification often helps, with very little downside.

library(tidymodels)
library(modeldatatoo)

taxi <- data_taxi() |>
        drop_na()

set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8, strata = tip)
taxi_split

<Training/Testing/Total>
<8000/2000/10000>

Accessing the data

taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)

The training and test sets

taxi_train

# A tibble: 8,000 × 7
   tip   distance company                      local dow   month  hour
   <fct>    <dbl> <fct>                        <fct> <fct> <fct> <int>
 1 yes      17.2  Chicago Independents         no    Thu   Feb      16
 2 yes       0.88 City Service                 yes   Thu   Mar       8
 3 yes      18.1  other                        no    Mon   Feb      18
 4 yes      12.2  Chicago Independents         no    Sun   Mar      21
 5 yes       0.94 Sun Taxi                     yes   Sat   Apr      23
 6 yes      17.5  Flash Cab                    no    Fri   Mar      12
 7 yes      17.7  other                        no    Sun   Jan       6
 8 yes       1.85 Taxicab Insurance Agency Llc no    Fri   Apr      12
 9 yes       0.53 Sun Taxi                     no    Tue   Mar      18
10 yes       6.65 Taxicab Insurance Agency Llc no    Sun   Apr      11
# ℹ 7,990 more rows

taxi_test

# A tibble: 2,000 × 7
   tip   distance company                      local dow   month  hour
   <fct>    <dbl> <fct>                        <fct> <fct> <fct> <int>
 1 yes      20.7  Chicago Independents         no    Mon   Apr       8
 2 yes       1.47 City Service                 no    Tue   Mar      14
 3 yes       1    Taxi Affiliation Services    no    Mon   Feb      18
 4 yes       1.91 Flash Cab                    no    Wed   Apr      15
 5 yes      17.2  City Service                 no    Mon   Apr       9
 6 yes      17.8  City Service                 no    Mon   Mar       9
 7 yes       0.53 Taxicab Insurance Agency Llc yes   Wed   Apr       8
 8 yes       1.77 other                        no    Thu   Apr      15
 9 yes      18.6  Flash Cab                    no    Thu   Apr      12
10 no        1.13 other                        no    Sat   Feb      14
# ℹ 1,990 more rows

The whole game - status update