TUNE XGBOOST WITH TIDYMODELS AND TIDYTUESDAY BEACH VOLLEYBAL
Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framwork, starting from just getting started. Today’s screencast explores a more advanced topic in how to tune an XGBoost classification model using with this week’s tidytuesday dataset on beach volleyball
suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
suppressMessages(library(plotly))
Our modeling goal is to predict whether a beach volleyball team of 2 won their match based on game play stats like errors, blocks, attacks, etc from this week’s TidyTuesday dataset
This dataset has the match stats like serve errors, kills, and so forth divided out by the two players for each team, but we want those combined together because we are going to make a prediction per team (i.e What makes a team more likely to win). Let include predictors like gender, circuit, and year in out model along with the per-match statistics. Let’s omit matches wit hNA values because we don’t have all kinds of statistics measured of for all matches
vb_matches <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-19/vb_matches.csv', guess_max = 76000)
## Parsed with column specification:
## cols(
## .default = col_double(),
## circuit = col_character(),
## tournament = col_character(),
## country = col_character(),
## date = col_date(format = ""),
## gender = col_character(),
## w_player1 = col_character(),
## w_p1_birthdate = col_date(format = ""),
## w_p1_country = col_character(),
## w_player2 = col_character(),
## w_p2_birthdate = col_date(format = ""),
## w_p2_country = col_character(),
## w_rank = col_character(),
## l_player1 = col_character(),
## l_p1_birthdate = col_date(format = ""),
## l_p1_country = col_character(),
## l_player2 = col_character(),
## l_p2_birthdate = col_date(format = ""),
## l_p2_country = col_character(),
## l_rank = col_character(),
## score = col_character()
## # ... with 3 more columns
## )
## See spec(...) for full column specifications.
vb_parsed <- vb_matches %>%
transmute( circuit, gender, year,
w_attacks = w_p1_tot_attacks + w_p2_tot_attacks,
w_errors = w_p1_tot_errors + w_p2_tot_errors,
w_aces = w_p1_tot_aces + w_p2_tot_aces,
w_serve_errors = w_p1_tot_serve_errors + w_p2_tot_serve_errors,
w_blocks = w_p1_tot_blocks + w_p2_tot_blocks,
w_digs = w_p1_tot_digs + w_p2_tot_digs,
l_attacks = l_p1_tot_attacks + l_p2_tot_attacks, l_errors = l_p1_tot_errors + l_p2_tot_errors,
l_aces = l_p1_tot_aces + l_p2_tot_aces,
l_serve_errors = l_p1_tot_serve_errors + l_p2_tot_serve_errors, l_blocks = l_p1_tot_blocks + l_p2_tot_blocks,
l_digs = l_p1_tot_digs + l_p2_tot_digs
) %>% na.omit()
Still plenty of data!. Next, let’s create separate datafranes for the winners and losers of each match, and then bind them together. I am using function like rename_with() from the upcoming dplyr 1.0 release here
winners <- vb_parsed %>%
select(circuit, gender, year, w_attacks:w_digs) %>%
rename_with(~ str_remove_all(., "w_"), w_attacks:w_digs) %>%
mutate( win = 'win')
losers <- vb_parsed %>%
select(circuit, gender, year, l_attacks:l_digs) %>%
rename_with(~ str_remove_all(., "l_"), l_attacks:l_digs) %>%
mutate(win = 'lose')
vb_df <- bind_rows(winners, losers) %>% mutate_if(is.character, factor)
There is a similar data prep approach to Joshua Cook
Exploratory data analysis is always important before modeling. Let’s make one plot to explore the relationship in this data
vb_df %>% pivot_longer( attacks:digs, names_to = 'stat', values_to = 'value') %>%
ggplot(aes(x = gender, y = value, fill = win, color = win)) + geom_boxplot(alpha = 0.4) + facet_wrap(~stat, scales = 'free_y', nrow = 2) + labs(y = NULL, color = NULL, fill = NULL) + theme_light()
We can start by loading the tidymodels metapackage, and spliting our data into training and testing sets
library(tidymodels)
set.seed(123)
vb_split <- initial_split(vb_df, strata = win)
vb_train <- training(vb_split)
vb_test <- testing(vb_split)
An XGBoost model is based on trees, so we don;t need to do much preprocessing for our data: we don’t need to worry about the factors or centering or scaling out data. Let’s just do straight to setting up our model specification. On the other hand, we are going to tune a lot of model hyperparameters