Your Data Budget

Author

Jamal Rogers

Published

August 17, 2023

Data Splitting and Spending

For machine learning, we typically split data into training and test sets:

  • The training set is used to estimate model parameters.
  • The test set is used to find an independent assessment of model performance.

Note: Do not use the test set during training.

  • Spending too much data in training prevents us from computing a good assessment of predictive performance.

  • Spending too much data in testing prevents us from computing a good estimate of model parameters.

The initial split

The initial_split() function is from the rsample package.

  • The argument prop splits the data into percentage of training and testing. The value allocates for the training set.

  • The argument strata applies stratification in spitting relative to a chosen variable. Stratification often helps, with very little downside.

library(tidymodels)
library(modeldatatoo)

taxi <- data_taxi() |>
        drop_na()

set.seed(123)
taxi_split <- initial_split(taxi, prop = 0.8, strata = tip)
taxi_split
<Training/Testing/Total>
<8000/2000/10000>

Accessing the data

taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)

The training and test sets

taxi_train
# A tibble: 8,000 × 7
   tip   distance company                      local dow   month  hour
   <fct>    <dbl> <fct>                        <fct> <fct> <fct> <int>
 1 yes      17.2  Chicago Independents         no    Thu   Feb      16
 2 yes       0.88 City Service                 yes   Thu   Mar       8
 3 yes      18.1  other                        no    Mon   Feb      18
 4 yes      12.2  Chicago Independents         no    Sun   Mar      21
 5 yes       0.94 Sun Taxi                     yes   Sat   Apr      23
 6 yes      17.5  Flash Cab                    no    Fri   Mar      12
 7 yes      17.7  other                        no    Sun   Jan       6
 8 yes       1.85 Taxicab Insurance Agency Llc no    Fri   Apr      12
 9 yes       0.53 Sun Taxi                     no    Tue   Mar      18
10 yes       6.65 Taxicab Insurance Agency Llc no    Sun   Apr      11
# ℹ 7,990 more rows
taxi_test
# A tibble: 2,000 × 7
   tip   distance company                      local dow   month  hour
   <fct>    <dbl> <fct>                        <fct> <fct> <fct> <int>
 1 yes      20.7  Chicago Independents         no    Mon   Apr       8
 2 yes       1.47 City Service                 no    Tue   Mar      14
 3 yes       1    Taxi Affiliation Services    no    Mon   Feb      18
 4 yes       1.91 Flash Cab                    no    Wed   Apr      15
 5 yes      17.2  City Service                 no    Mon   Apr       9
 6 yes      17.8  City Service                 no    Mon   Mar       9
 7 yes       0.53 Taxicab Insurance Agency Llc yes   Wed   Apr       8
 8 yes       1.77 other                        no    Thu   Apr      15
 9 yes      18.6  Flash Cab                    no    Thu   Apr      12
10 no        1.13 other                        no    Sat   Feb      14
# ℹ 1,990 more rows

The whole game - status update