Get the data from Github
Data summary
| Name |
figs |
| Number of rows |
34918 |
| Number of columns |
41 |
| _______________________ |
|
| Column type frequency: |
|
| Date |
1 |
| factor |
14 |
| numeric |
26 |
| ________________________ |
|
| Group variables |
None |
Variable type: Date
| race_date |
0 |
1 |
2016-04-13 |
2021-12-20 |
2019-10-31 |
1433 |
Variable type: factor
| target |
0 |
1 |
FALSE |
2 |
NoT: 27015, TOP: 7903 |
| horse |
0 |
1 |
FALSE |
3388 |
CHA: 57, EL : 53, STA: 53, STA: 52 |
| surface_cond |
0 |
1 |
FALSE |
13 |
fst: 16041, frm: 10188, wsm: 2845, vsl: 1734 |
| precip |
0 |
1 |
FALSE |
3 |
cle: 33590, rai: 1250, sno: 78 |
| wind |
0 |
1 |
FALSE |
3 |
cal: 30965, hvy: 2537, vhv: 1416 |
| gender |
0 |
1 |
FALSE |
2 |
Mal: 23692, Fem: 11226 |
| distance |
0 |
1 |
FALSE |
22 |
8.0: 9161, 6.0: 7606, 8.5: 4518, 7.0: 2920 |
| jky |
0 |
1 |
FALSE |
729 |
I O: 1009, L S: 910, M F: 886, J O: 853 |
| trk_code |
0 |
1 |
FALSE |
145 |
GP: 8279, AQU: 4746, BEL: 3269, SA: 2878 |
| surf |
0 |
1 |
FALSE |
5 |
Dir: 21037, Tur: 11548, Syn: 1245, Inn: 1019 |
| s_cond |
0 |
1 |
FALSE |
8 |
Fst: 18059, Frm: 10175, Gd: 3216, Sly: 2129 |
| PT3 |
0 |
1 |
FALSE |
352 |
–: 3388, P: 3007, XXX: 2290, TP: 1717 |
| form_cycle |
0 |
1 |
FALSE |
24 |
AX: 6410, CX: 3647, BX: 3470, A1: 3388 |
| race_type |
0 |
1 |
FALSE |
2 |
Spr: 18436, Rou: 16482 |
Variable type: numeric
| yr |
0 |
1 |
2019.22 |
0.99 |
2016.00 |
2019.00 |
2019.00 |
2020.00 |
2021.0 |
▁▃▇▇▂ |
| age |
0 |
1 |
3.64 |
1.30 |
1.00 |
3.00 |
3.00 |
4.00 |
11.0 |
▇▆▁▁▁ |
| ftl |
0 |
1 |
0.08 |
0.27 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| clm_by_hot |
0 |
1 |
-0.93 |
0.34 |
-1.00 |
-1.00 |
-1.00 |
-1.00 |
1.0 |
▇▁▁▁▁ |
| start |
0 |
1 |
0.03 |
0.09 |
0.00 |
0.00 |
0.00 |
0.00 |
0.9 |
▇▁▁▁▁ |
| bled |
0 |
1 |
0.00 |
0.02 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| off_turf |
0 |
1 |
0.05 |
0.21 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| bfnr |
0 |
1 |
0.01 |
0.11 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| lame |
0 |
1 |
0.00 |
0.02 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| lsx |
0 |
1 |
0.00 |
0.09 |
-1.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▁▁▇▁▁ |
| blnkrs |
0 |
1 |
0.04 |
0.27 |
-1.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▁▁▇▁▁ |
| age_wks |
0 |
1 |
204.49 |
65.51 |
37.00 |
154.00 |
190.00 |
240.00 |
559.0 |
▂▇▂▁▁ |
| wt |
0 |
1 |
120.33 |
2.70 |
108.00 |
119.00 |
120.00 |
122.00 |
143.0 |
▁▇▂▁▁ |
| off_odds |
0 |
1 |
12.19 |
17.14 |
0.11 |
2.50 |
6.00 |
13.00 |
99.0 |
▇▁▁▁▁ |
| fld |
0 |
1 |
8.40 |
2.21 |
2.00 |
7.00 |
8.00 |
10.00 |
30.0 |
▅▇▁▁▁ |
| L3 |
0 |
1 |
18.03 |
7.04 |
1.75 |
13.33 |
16.92 |
21.42 |
99.0 |
▇▃▁▁▁ |
| L5 |
0 |
1 |
18.27 |
6.78 |
1.75 |
13.75 |
17.25 |
21.62 |
99.0 |
▇▃▁▁▁ |
| L7 |
0 |
1 |
18.44 |
6.68 |
1.75 |
14.00 |
17.43 |
21.75 |
99.0 |
▇▃▁▁▁ |
| rest |
0 |
1 |
44.73 |
53.55 |
2.00 |
21.00 |
28.00 |
44.00 |
1081.0 |
▇▁▁▁▁ |
| avg_rest |
0 |
1 |
39.90 |
21.70 |
9.67 |
25.68 |
34.80 |
46.90 |
489.0 |
▇▁▁▁▁ |
| efforts_last90 |
0 |
1 |
0.90 |
0.95 |
0.00 |
0.00 |
1.00 |
1.00 |
8.0 |
▇▂▁▁▁ |
| Lag1 |
0 |
1 |
0.20 |
0.40 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▂ |
| Lag2 |
0 |
1 |
0.19 |
0.39 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▂ |
| Lag3 |
0 |
1 |
0.17 |
0.37 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▂ |
| Lag4 |
0 |
1 |
0.15 |
0.36 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▂ |
| Lag5 |
0 |
1 |
0.14 |
0.34 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
Build Model
We start by spliting the data between training and test sets.
Data summary
| Name |
figs_test |
| Number of rows |
8728 |
| Number of columns |
41 |
| _______________________ |
|
| Column type frequency: |
|
| Date |
1 |
| factor |
14 |
| numeric |
26 |
| ________________________ |
|
| Group variables |
None |
Variable type: Date
| race_date |
0 |
1 |
2016-06-16 |
2021-03-25 |
2019-10-28 |
1193 |
Variable type: factor
| target |
0 |
1 |
FALSE |
2 |
NoT: 6753, TOP: 1975 |
| horse |
0 |
1 |
FALSE |
2642 |
VIC: 21, RAD: 18, HON: 17, ARC: 16 |
| surface_cond |
0 |
1 |
FALSE |
13 |
fst: 3970, frm: 2517, wsm: 731, vsl: 482 |
| precip |
0 |
1 |
FALSE |
3 |
cle: 8386, rai: 316, sno: 26 |
| wind |
0 |
1 |
FALSE |
3 |
cal: 7743, hvy: 636, vhv: 349 |
| gender |
0 |
1 |
FALSE |
2 |
Mal: 5940, Fem: 2788 |
| distance |
0 |
1 |
FALSE |
22 |
8.0: 2291, 6.0: 1905, 8.5: 1161, 7.0: 732 |
| jky |
0 |
1 |
FALSE |
527 |
I O: 268, M F: 235, L S: 233, T G: 201 |
| trk_code |
0 |
1 |
FALSE |
105 |
GP: 2059, AQU: 1246, BEL: 825, SA: 728 |
| surf |
0 |
1 |
FALSE |
5 |
Dir: 5287, Tur: 2871, Syn: 301, Inn: 254 |
| s_cond |
0 |
1 |
FALSE |
8 |
Fst: 4516, Frm: 2528, Gd: 792, Sly: 541 |
| PT3 |
0 |
1 |
FALSE |
351 |
–: 836, P: 734, XXX: 577, TP: 442 |
| form_cycle |
0 |
1 |
FALSE |
24 |
AX: 1635, BX: 881, CX: 875, A1: 836 |
| race_type |
0 |
1 |
FALSE |
2 |
Spr: 4577, Rou: 4151 |
Variable type: numeric
| yr |
0 |
1 |
2019.23 |
0.99 |
2016.00 |
2019.00 |
2019.00 |
2020.00 |
2021.0 |
▁▃▇▇▂ |
| age |
0 |
1 |
3.64 |
1.29 |
2.00 |
3.00 |
3.00 |
4.00 |
10.0 |
▇▆▁▁▁ |
| ftl |
0 |
1 |
0.08 |
0.27 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| clm_by_hot |
0 |
1 |
-0.93 |
0.33 |
-1.00 |
-1.00 |
-1.00 |
-1.00 |
1.0 |
▇▁▁▁▁ |
| start |
0 |
1 |
0.02 |
0.09 |
0.00 |
0.00 |
0.00 |
0.00 |
0.9 |
▇▁▁▁▁ |
| bled |
0 |
1 |
0.00 |
0.01 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| off_turf |
0 |
1 |
0.04 |
0.21 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| bfnr |
0 |
1 |
0.01 |
0.12 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| lame |
0 |
1 |
0.00 |
0.02 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
| lsx |
0 |
1 |
0.00 |
0.10 |
-1.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▁▁▇▁▁ |
| blnkrs |
0 |
1 |
0.04 |
0.27 |
-1.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▁▁▇▁▁ |
| age_wks |
0 |
1 |
204.21 |
65.26 |
85.00 |
153.00 |
190.00 |
239.00 |
547.0 |
▇▇▂▁▁ |
| wt |
0 |
1 |
120.35 |
2.67 |
108.00 |
119.00 |
120.00 |
122.00 |
143.0 |
▁▇▂▁▁ |
| off_odds |
0 |
1 |
11.81 |
16.57 |
0.11 |
2.50 |
5.00 |
13.00 |
99.0 |
▇▁▁▁▁ |
| fld |
0 |
1 |
8.37 |
2.21 |
2.00 |
7.00 |
8.00 |
10.00 |
30.0 |
▆▇▁▁▁ |
| L3 |
0 |
1 |
17.97 |
6.78 |
1.75 |
13.38 |
16.92 |
21.42 |
99.0 |
▇▃▁▁▁ |
| L5 |
0 |
1 |
18.22 |
6.56 |
1.75 |
13.80 |
17.25 |
21.60 |
99.0 |
▇▃▁▁▁ |
| L7 |
0 |
1 |
18.38 |
6.45 |
1.75 |
14.00 |
17.43 |
21.75 |
99.0 |
▇▃▁▁▁ |
| rest |
0 |
1 |
45.14 |
54.78 |
3.00 |
21.00 |
29.00 |
45.00 |
1081.0 |
▇▁▁▁▁ |
| avg_rest |
0 |
1 |
39.89 |
21.30 |
10.33 |
25.87 |
34.79 |
46.86 |
293.0 |
▇▁▁▁▁ |
| efforts_last90 |
0 |
1 |
0.91 |
0.96 |
0.00 |
0.00 |
1.00 |
1.00 |
7.0 |
▇▂▁▁▁ |
| Lag1 |
0 |
1 |
0.21 |
0.41 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▂ |
| Lag2 |
0 |
1 |
0.19 |
0.39 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▂ |
| Lag3 |
0 |
1 |
0.17 |
0.38 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▂ |
| Lag4 |
0 |
1 |
0.15 |
0.36 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▂ |
| Lag5 |
0 |
1 |
0.14 |
0.34 |
0.00 |
0.00 |
0.00 |
0.00 |
1.0 |
▇▁▁▁▁ |
Create a Recipe
Next we create a recipe to preprocess the data. We utilize step_other and step_dummy to create dummny variables. The dataset is unbalance so we also downsample the majority target value (NoTOP).
Tuning the Random Forest
We create a tuning specification to facilitate tuning the random forest. Tree is set to 1000, min_n and mtry are parameters that will be tuned. We also set the model engine to ranger and specify a classifcation model.
Setup a Workflow
tune_wf <- workflow() %>%
add_recipe(figs_rec) %>%
add_model(tune_spec)
Train hyperparameters
The inital training takes place in the following code chunk. We employ parallel processing to epadite the tuning process. After the tuning is complete collect_metrics display the
set.seed(234)
figs_folds <- vfold_cv(figs_train)
doParallel::registerDoParallel(cores=28)
set.seed(345)
tune_res <- tune_grid(
tune_wf,
resamples = figs_folds,
grid = 10
)
tune_res %>%
collect_metrics()
## # A tibble: 20 x 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 94 30 accuracy binary 0.697 10 0.00363 Preprocessor1_Model01
## 2 94 30 roc_auc binary 0.799 10 0.00237 Preprocessor1_Model01
## 3 374 12 accuracy binary 0.694 10 0.00280 Preprocessor1_Model02
## 4 374 12 roc_auc binary 0.795 10 0.00266 Preprocessor1_Model02
## 5 217 27 accuracy binary 0.695 10 0.00328 Preprocessor1_Model03
## 6 217 27 roc_auc binary 0.797 10 0.00257 Preprocessor1_Model03
## 7 286 24 accuracy binary 0.694 10 0.00317 Preprocessor1_Model04
## 8 286 24 roc_auc binary 0.796 10 0.00259 Preprocessor1_Model04
## 9 247 39 accuracy binary 0.695 10 0.00301 Preprocessor1_Model05
## 10 247 39 roc_auc binary 0.797 10 0.00265 Preprocessor1_Model05
## 11 165 6 accuracy binary 0.696 10 0.00308 Preprocessor1_Model06
## 12 165 6 roc_auc binary 0.797 10 0.00247 Preprocessor1_Model06
## 13 342 16 accuracy binary 0.694 10 0.00287 Preprocessor1_Model07
## 14 342 16 roc_auc binary 0.795 10 0.00258 Preprocessor1_Model07
## 15 443 17 accuracy binary 0.692 10 0.00257 Preprocessor1_Model08
## 16 443 17 roc_auc binary 0.794 10 0.00261 Preprocessor1_Model08
## 17 140 7 accuracy binary 0.697 10 0.00292 Preprocessor1_Model09
## 18 140 7 roc_auc binary 0.797 10 0.00241 Preprocessor1_Model09
## 19 6 36 accuracy binary 0.702 10 0.00249 Preprocessor1_Model10
## 20 6 36 roc_auc binary 0.790 10 0.00207 Preprocessor1_Model10
tune_res %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, min_n, mtry) %>%
pivot_longer(min_n:mtry,
values_to = "value",
names_to = "parameter") %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(show.legend = FALSE) +
facet_wrap(~ parameter, scales = "free_x")

tune_res
## # Tuning results
## # 10-fold cross-validation
## # A tibble: 10 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [23571/2619]> Fold01 <tibble [20 × 6]> <tibble [0 × 1]>
## 2 <split [23571/2619]> Fold02 <tibble [20 × 6]> <tibble [0 × 1]>
## 3 <split [23571/2619]> Fold03 <tibble [20 × 6]> <tibble [0 × 1]>
## 4 <split [23571/2619]> Fold04 <tibble [20 × 6]> <tibble [0 × 1]>
## 5 <split [23571/2619]> Fold05 <tibble [20 × 6]> <tibble [0 × 1]>
## 6 <split [23571/2619]> Fold06 <tibble [20 × 6]> <tibble [0 × 1]>
## 7 <split [23571/2619]> Fold07 <tibble [20 × 6]> <tibble [0 × 1]>
## 8 <split [23571/2619]> Fold08 <tibble [20 × 6]> <tibble [0 × 1]>
## 9 <split [23571/2619]> Fold09 <tibble [20 × 6]> <tibble [0 × 1]>
## 10 <split [23571/2619]> Fold10 <tibble [20 × 6]> <tibble [0 × 1]>
Secondar tuning
We utilize the results of the first tuning to tune a second time with the goal of minding the optimal model parameters.
## # A tibble: 32 x 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 100 25 accuracy binary 0.697 10 0.00339 Preprocessor1_Model01
## 2 100 25 roc_auc binary 0.798 10 0.00258 Preprocessor1_Model01
## 3 101 25 accuracy binary 0.698 10 0.00323 Preprocessor1_Model02
## 4 101 25 roc_auc binary 0.799 10 0.00254 Preprocessor1_Model02
## 5 103 25 accuracy binary 0.697 10 0.00364 Preprocessor1_Model03
## 6 103 25 roc_auc binary 0.799 10 0.00249 Preprocessor1_Model03
## 7 105 25 accuracy binary 0.697 10 0.00324 Preprocessor1_Model04
## 8 105 25 roc_auc binary 0.798 10 0.00244 Preprocessor1_Model04
## 9 100 26 accuracy binary 0.698 10 0.00330 Preprocessor1_Model05
## 10 100 26 roc_auc binary 0.799 10 0.00257 Preprocessor1_Model05
## # … with 22 more rows

Pick the best Model.
We utilize the roc_auc metric as our metric to pick the best model.
Variable Importance
We utilize the VIP package to identify the important variables.

Final Model
The metrics for the final model are calculated and displayed below.
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.682 Preprocessor1_Model1
## 2 roc_auc binary 0.791 Preprocessor1_Model1
Confusion Matrix
The confusion matrix for the final model is diplayed below:
## Truth
## Prediction NoTop TOP
## NoTop 4437 456
## TOP 2316 1519
Finally, model predictions are set forth below.
## # A tibble: 8,728 x 8
## id .pred_NoTop .pred_TOP .row .pred_class target .config correct
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr> <chr>
## 1 train/te… 0.301 0.699 5 TOP TOP Preprocesso… Correct
## 2 train/te… 0.400 0.600 7 TOP NoTop Preprocesso… Incorr…
## 3 train/te… 0.470 0.530 10 TOP NoTop Preprocesso… Incorr…
## 4 train/te… 0.682 0.318 16 NoTop NoTop Preprocesso… Correct
## 5 train/te… 0.654 0.346 28 NoTop NoTop Preprocesso… Correct
## 6 train/te… 0.301 0.699 37 TOP TOP Preprocesso… Correct
## 7 train/te… 0.649 0.351 38 NoTop TOP Preprocesso… Incorr…
## 8 train/te… 0.536 0.464 39 NoTop TOP Preprocesso… Incorr…
## 9 train/te… 0.0794 0.921 41 TOP TOP Preprocesso… Correct
## 10 train/te… 0.252 0.748 42 TOP TOP Preprocesso… Correct
## # … with 8,718 more rows