XGBoost_Volleybal

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framwork, starting from just getting started. Today’s screencast explores a more advanced topic in how to tune an XGBoost classification model using with this week’s tidytuesday dataset on beach volleyball

Explore the data

Our modeling goal is to predict whether a beach volleyball team of 2 won their match based on game play stats like errors, blocks, attacks, etc from this week’s TidyTuesday dataset

This dataset has the match stats like serve errors, kills, and so forth divided out by the two players for each team, but we want those combined together because we are going to make a prediction per team (i.e What makes a team more likely to win). Let include predictors like gender, circuit, and year in out model along with the per-match statistics. Let’s omit matches wit hNA values because we don’t have all kinds of statistics measured of for all matches

vb_matches <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-19/vb_matches.csv', guess_max = 76000)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   circuit = col_character(),
##   tournament = col_character(),
##   country = col_character(),
##   date = col_date(format = ""),
##   gender = col_character(),
##   w_player1 = col_character(),
##   w_p1_birthdate = col_date(format = ""),
##   w_p1_country = col_character(),
##   w_player2 = col_character(),
##   w_p2_birthdate = col_date(format = ""),
##   w_p2_country = col_character(),
##   w_rank = col_character(),
##   l_player1 = col_character(),
##   l_p1_birthdate = col_date(format = ""),
##   l_p1_country = col_character(),
##   l_player2 = col_character(),
##   l_p2_birthdate = col_date(format = ""),
##   l_p2_country = col_character(),
##   l_rank = col_character(),
##   score = col_character()
##   # ... with 3 more columns
## )

## See spec(...) for full column specifications.

vb_parsed <- vb_matches %>% 
               transmute( circuit, gender, year, 
                          w_attacks = w_p1_tot_attacks + w_p2_tot_attacks,
                          w_errors = w_p1_tot_errors + w_p2_tot_errors,
                          w_aces = w_p1_tot_aces + w_p2_tot_aces,
                          w_serve_errors = w_p1_tot_serve_errors + w_p2_tot_serve_errors, 
                          w_blocks = w_p1_tot_blocks + w_p2_tot_blocks,
                          w_digs = w_p1_tot_digs + w_p2_tot_digs,
                          
l_attacks = l_p1_tot_attacks + l_p2_tot_attacks,                           l_errors = l_p1_tot_errors + l_p2_tot_errors,
                          l_aces = l_p1_tot_aces + l_p2_tot_aces,
                          l_serve_errors = l_p1_tot_serve_errors + l_p2_tot_serve_errors, l_blocks = l_p1_tot_blocks + l_p2_tot_blocks,
                          l_digs = l_p1_tot_digs + l_p2_tot_digs
               ) %>% na.omit()

Still plenty of data!. Next, let’s create separate datafranes for the winners and losers of each match, and then bind them together. I am using function like rename_with() from the upcoming dplyr 1.0 release here

winners <- vb_parsed %>%
               select(circuit, gender, year, w_attacks:w_digs) %>% 
               rename_with(~ str_remove_all(., "w_"), w_attacks:w_digs) %>% 
               mutate( win = 'win')

losers <- vb_parsed %>% 
               select(circuit, gender, year, l_attacks:l_digs) %>%
               rename_with(~ str_remove_all(., "l_"), l_attacks:l_digs) %>%
               mutate(win = 'lose')


vb_df <- bind_rows(winners, losers) %>% mutate_if(is.character, factor)

There is a similar data prep approach to Joshua Cook

Exploratory data analysis is always important before modeling. Let’s make one plot to explore the relationship in this data

vb_df %>% pivot_longer( attacks:digs, names_to = 'stat', values_to = 'value') %>%
               ggplot(aes(x = gender, y = value, fill = win, color = win)) + geom_boxplot(alpha = 0.4) + facet_wrap(~stat, scales = 'free_y', nrow = 2) + labs(y = NULL, color = NULL, fill = NULL) + theme_light()

XGBoost_Volleybal

Nguyen_LSCM

7/22/2020

Explore the data

Build a model