Get the data from Github

Data summary
Name figs
Number of rows 34918
Number of columns 41
_______________________
Column type frequency:
Date 1
factor 14
numeric 26
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
race_date 0 1 2016-04-13 2021-12-20 2019-10-31 1433

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
target 0 1 FALSE 2 NoT: 27015, TOP: 7903
horse 0 1 FALSE 3388 CHA: 57, EL : 53, STA: 53, STA: 52
surface_cond 0 1 FALSE 13 fst: 16041, frm: 10188, wsm: 2845, vsl: 1734
precip 0 1 FALSE 3 cle: 33590, rai: 1250, sno: 78
wind 0 1 FALSE 3 cal: 30965, hvy: 2537, vhv: 1416
gender 0 1 FALSE 2 Mal: 23692, Fem: 11226
distance 0 1 FALSE 22 8.0: 9161, 6.0: 7606, 8.5: 4518, 7.0: 2920
jky 0 1 FALSE 729 I O: 1009, L S: 910, M F: 886, J O: 853
trk_code 0 1 FALSE 145 GP: 8279, AQU: 4746, BEL: 3269, SA: 2878
surf 0 1 FALSE 5 Dir: 21037, Tur: 11548, Syn: 1245, Inn: 1019
s_cond 0 1 FALSE 8 Fst: 18059, Frm: 10175, Gd: 3216, Sly: 2129
PT3 0 1 FALSE 352 –: 3388, P: 3007, XXX: 2290, TP: 1717
form_cycle 0 1 FALSE 24 AX: 6410, CX: 3647, BX: 3470, A1: 3388
race_type 0 1 FALSE 2 Spr: 18436, Rou: 16482

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
yr 0 1 2019.22 0.99 2016.00 2019.00 2019.00 2020.00 2021.0 ▁▃▇▇▂
age 0 1 3.64 1.30 1.00 3.00 3.00 4.00 11.0 ▇▆▁▁▁
ftl 0 1 0.08 0.27 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
clm_by_hot 0 1 -0.93 0.34 -1.00 -1.00 -1.00 -1.00 1.0 ▇▁▁▁▁
start 0 1 0.03 0.09 0.00 0.00 0.00 0.00 0.9 ▇▁▁▁▁
bled 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
off_turf 0 1 0.05 0.21 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
bfnr 0 1 0.01 0.11 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
lame 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
lsx 0 1 0.00 0.09 -1.00 0.00 0.00 0.00 1.0 ▁▁▇▁▁
blnkrs 0 1 0.04 0.27 -1.00 0.00 0.00 0.00 1.0 ▁▁▇▁▁
age_wks 0 1 204.49 65.51 37.00 154.00 190.00 240.00 559.0 ▂▇▂▁▁
wt 0 1 120.33 2.70 108.00 119.00 120.00 122.00 143.0 ▁▇▂▁▁
off_odds 0 1 12.19 17.14 0.11 2.50 6.00 13.00 99.0 ▇▁▁▁▁
fld 0 1 8.40 2.21 2.00 7.00 8.00 10.00 30.0 ▅▇▁▁▁
L3 0 1 18.03 7.04 1.75 13.33 16.92 21.42 99.0 ▇▃▁▁▁
L5 0 1 18.27 6.78 1.75 13.75 17.25 21.62 99.0 ▇▃▁▁▁
L7 0 1 18.44 6.68 1.75 14.00 17.43 21.75 99.0 ▇▃▁▁▁
rest 0 1 44.73 53.55 2.00 21.00 28.00 44.00 1081.0 ▇▁▁▁▁
avg_rest 0 1 39.90 21.70 9.67 25.68 34.80 46.90 489.0 ▇▁▁▁▁
efforts_last90 0 1 0.90 0.95 0.00 0.00 1.00 1.00 8.0 ▇▂▁▁▁
Lag1 0 1 0.20 0.40 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▂
Lag2 0 1 0.19 0.39 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▂
Lag3 0 1 0.17 0.37 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▂
Lag4 0 1 0.15 0.36 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▂
Lag5 0 1 0.14 0.34 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁

Build Model

We start by spliting the data between training and test sets.

Data summary
Name figs_test
Number of rows 8728
Number of columns 41
_______________________
Column type frequency:
Date 1
factor 14
numeric 26
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
race_date 0 1 2016-06-16 2021-03-25 2019-10-28 1193

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
target 0 1 FALSE 2 NoT: 6753, TOP: 1975
horse 0 1 FALSE 2642 VIC: 21, RAD: 18, HON: 17, ARC: 16
surface_cond 0 1 FALSE 13 fst: 3970, frm: 2517, wsm: 731, vsl: 482
precip 0 1 FALSE 3 cle: 8386, rai: 316, sno: 26
wind 0 1 FALSE 3 cal: 7743, hvy: 636, vhv: 349
gender 0 1 FALSE 2 Mal: 5940, Fem: 2788
distance 0 1 FALSE 22 8.0: 2291, 6.0: 1905, 8.5: 1161, 7.0: 732
jky 0 1 FALSE 527 I O: 268, M F: 235, L S: 233, T G: 201
trk_code 0 1 FALSE 105 GP: 2059, AQU: 1246, BEL: 825, SA: 728
surf 0 1 FALSE 5 Dir: 5287, Tur: 2871, Syn: 301, Inn: 254
s_cond 0 1 FALSE 8 Fst: 4516, Frm: 2528, Gd: 792, Sly: 541
PT3 0 1 FALSE 351 –: 836, P: 734, XXX: 577, TP: 442
form_cycle 0 1 FALSE 24 AX: 1635, BX: 881, CX: 875, A1: 836
race_type 0 1 FALSE 2 Spr: 4577, Rou: 4151

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
yr 0 1 2019.23 0.99 2016.00 2019.00 2019.00 2020.00 2021.0 ▁▃▇▇▂
age 0 1 3.64 1.29 2.00 3.00 3.00 4.00 10.0 ▇▆▁▁▁
ftl 0 1 0.08 0.27 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
clm_by_hot 0 1 -0.93 0.33 -1.00 -1.00 -1.00 -1.00 1.0 ▇▁▁▁▁
start 0 1 0.02 0.09 0.00 0.00 0.00 0.00 0.9 ▇▁▁▁▁
bled 0 1 0.00 0.01 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
off_turf 0 1 0.04 0.21 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
bfnr 0 1 0.01 0.12 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
lame 0 1 0.00 0.02 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁
lsx 0 1 0.00 0.10 -1.00 0.00 0.00 0.00 1.0 ▁▁▇▁▁
blnkrs 0 1 0.04 0.27 -1.00 0.00 0.00 0.00 1.0 ▁▁▇▁▁
age_wks 0 1 204.21 65.26 85.00 153.00 190.00 239.00 547.0 ▇▇▂▁▁
wt 0 1 120.35 2.67 108.00 119.00 120.00 122.00 143.0 ▁▇▂▁▁
off_odds 0 1 11.81 16.57 0.11 2.50 5.00 13.00 99.0 ▇▁▁▁▁
fld 0 1 8.37 2.21 2.00 7.00 8.00 10.00 30.0 ▆▇▁▁▁
L3 0 1 17.97 6.78 1.75 13.38 16.92 21.42 99.0 ▇▃▁▁▁
L5 0 1 18.22 6.56 1.75 13.80 17.25 21.60 99.0 ▇▃▁▁▁
L7 0 1 18.38 6.45 1.75 14.00 17.43 21.75 99.0 ▇▃▁▁▁
rest 0 1 45.14 54.78 3.00 21.00 29.00 45.00 1081.0 ▇▁▁▁▁
avg_rest 0 1 39.89 21.30 10.33 25.87 34.79 46.86 293.0 ▇▁▁▁▁
efforts_last90 0 1 0.91 0.96 0.00 0.00 1.00 1.00 7.0 ▇▂▁▁▁
Lag1 0 1 0.21 0.41 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▂
Lag2 0 1 0.19 0.39 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▂
Lag3 0 1 0.17 0.38 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▂
Lag4 0 1 0.15 0.36 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▂
Lag5 0 1 0.14 0.34 0.00 0.00 0.00 0.00 1.0 ▇▁▁▁▁

Create a Recipe

Next we create a recipe to preprocess the data. We utilize step_other and step_dummy to create dummny variables. The dataset is unbalance so we also downsample the majority target value (NoTOP).

Tuning the Random Forest

We create a tuning specification to facilitate tuning the random forest. Tree is set to 1000, min_n and mtry are parameters that will be tuned. We also set the model engine to ranger and specify a classifcation model.

Setup a Workflow

tune_wf <- workflow() %>% 
  add_recipe(figs_rec) %>% 
  add_model(tune_spec)

Train hyperparameters

The inital training takes place in the following code chunk. We employ parallel processing to epadite the tuning process. After the tuning is complete collect_metrics display the

set.seed(234)

figs_folds <- vfold_cv(figs_train)

doParallel::registerDoParallel(cores=28)
set.seed(345)

tune_res <- tune_grid(
  tune_wf,
  resamples = figs_folds,
  grid = 10
)

tune_res %>% 
  collect_metrics()
## # A tibble: 20 x 8
##     mtry min_n .metric  .estimator  mean     n std_err .config              
##    <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1    94    30 accuracy binary     0.697    10 0.00363 Preprocessor1_Model01
##  2    94    30 roc_auc  binary     0.799    10 0.00237 Preprocessor1_Model01
##  3   374    12 accuracy binary     0.694    10 0.00280 Preprocessor1_Model02
##  4   374    12 roc_auc  binary     0.795    10 0.00266 Preprocessor1_Model02
##  5   217    27 accuracy binary     0.695    10 0.00328 Preprocessor1_Model03
##  6   217    27 roc_auc  binary     0.797    10 0.00257 Preprocessor1_Model03
##  7   286    24 accuracy binary     0.694    10 0.00317 Preprocessor1_Model04
##  8   286    24 roc_auc  binary     0.796    10 0.00259 Preprocessor1_Model04
##  9   247    39 accuracy binary     0.695    10 0.00301 Preprocessor1_Model05
## 10   247    39 roc_auc  binary     0.797    10 0.00265 Preprocessor1_Model05
## 11   165     6 accuracy binary     0.696    10 0.00308 Preprocessor1_Model06
## 12   165     6 roc_auc  binary     0.797    10 0.00247 Preprocessor1_Model06
## 13   342    16 accuracy binary     0.694    10 0.00287 Preprocessor1_Model07
## 14   342    16 roc_auc  binary     0.795    10 0.00258 Preprocessor1_Model07
## 15   443    17 accuracy binary     0.692    10 0.00257 Preprocessor1_Model08
## 16   443    17 roc_auc  binary     0.794    10 0.00261 Preprocessor1_Model08
## 17   140     7 accuracy binary     0.697    10 0.00292 Preprocessor1_Model09
## 18   140     7 roc_auc  binary     0.797    10 0.00241 Preprocessor1_Model09
## 19     6    36 accuracy binary     0.702    10 0.00249 Preprocessor1_Model10
## 20     6    36 roc_auc  binary     0.790    10 0.00207 Preprocessor1_Model10
tune_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>% 
  select(mean, min_n, mtry) %>% 
  pivot_longer(min_n:mtry,
               values_to = "value",
               names_to = "parameter") %>% 
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~ parameter, scales = "free_x")

tune_res
## # Tuning results
## # 10-fold cross-validation 
## # A tibble: 10 x 4
##    splits               id     .metrics          .notes          
##    <list>               <chr>  <list>            <list>          
##  1 <split [23571/2619]> Fold01 <tibble [20 × 6]> <tibble [0 × 1]>
##  2 <split [23571/2619]> Fold02 <tibble [20 × 6]> <tibble [0 × 1]>
##  3 <split [23571/2619]> Fold03 <tibble [20 × 6]> <tibble [0 × 1]>
##  4 <split [23571/2619]> Fold04 <tibble [20 × 6]> <tibble [0 × 1]>
##  5 <split [23571/2619]> Fold05 <tibble [20 × 6]> <tibble [0 × 1]>
##  6 <split [23571/2619]> Fold06 <tibble [20 × 6]> <tibble [0 × 1]>
##  7 <split [23571/2619]> Fold07 <tibble [20 × 6]> <tibble [0 × 1]>
##  8 <split [23571/2619]> Fold08 <tibble [20 × 6]> <tibble [0 × 1]>
##  9 <split [23571/2619]> Fold09 <tibble [20 × 6]> <tibble [0 × 1]>
## 10 <split [23571/2619]> Fold10 <tibble [20 × 6]> <tibble [0 × 1]>

Secondar tuning

We utilize the results of the first tuning to tune a second time with the goal of minding the optimal model parameters.

## # A tibble: 32 x 8
##     mtry min_n .metric  .estimator  mean     n std_err .config              
##    <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1   100    25 accuracy binary     0.697    10 0.00339 Preprocessor1_Model01
##  2   100    25 roc_auc  binary     0.798    10 0.00258 Preprocessor1_Model01
##  3   101    25 accuracy binary     0.698    10 0.00323 Preprocessor1_Model02
##  4   101    25 roc_auc  binary     0.799    10 0.00254 Preprocessor1_Model02
##  5   103    25 accuracy binary     0.697    10 0.00364 Preprocessor1_Model03
##  6   103    25 roc_auc  binary     0.799    10 0.00249 Preprocessor1_Model03
##  7   105    25 accuracy binary     0.697    10 0.00324 Preprocessor1_Model04
##  8   105    25 roc_auc  binary     0.798    10 0.00244 Preprocessor1_Model04
##  9   100    26 accuracy binary     0.698    10 0.00330 Preprocessor1_Model05
## 10   100    26 roc_auc  binary     0.799    10 0.00257 Preprocessor1_Model05
## # … with 22 more rows

Pick the best Model.

We utilize the roc_auc metric as our metric to pick the best model.

Variable Importance

We utilize the VIP package to identify the important variables.

Final Model

The metrics for the final model are calculated and displayed below.

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.682 Preprocessor1_Model1
## 2 roc_auc  binary         0.791 Preprocessor1_Model1

Confusion Matrix

The confusion matrix for the final model is diplayed below:

##           Truth
## Prediction NoTop  TOP
##      NoTop  4437  456
##      TOP    2316 1519

Finally, model predictions are set forth below.

## # A tibble: 8,728 x 8
##    id        .pred_NoTop .pred_TOP  .row .pred_class target .config      correct
##    <chr>           <dbl>     <dbl> <int> <fct>       <fct>  <chr>        <chr>  
##  1 train/te…      0.301      0.699     5 TOP         TOP    Preprocesso… Correct
##  2 train/te…      0.400      0.600     7 TOP         NoTop  Preprocesso… Incorr…
##  3 train/te…      0.470      0.530    10 TOP         NoTop  Preprocesso… Incorr…
##  4 train/te…      0.682      0.318    16 NoTop       NoTop  Preprocesso… Correct
##  5 train/te…      0.654      0.346    28 NoTop       NoTop  Preprocesso… Correct
##  6 train/te…      0.301      0.699    37 TOP         TOP    Preprocesso… Correct
##  7 train/te…      0.649      0.351    38 NoTop       TOP    Preprocesso… Incorr…
##  8 train/te…      0.536      0.464    39 NoTop       TOP    Preprocesso… Incorr…
##  9 train/te…      0.0794     0.921    41 TOP         TOP    Preprocesso… Correct
## 10 train/te…      0.252      0.748    42 TOP         TOP    Preprocesso… Correct
## # … with 8,718 more rows