Introduction

This tutorial will walk through creating a model using machine learning, specifically using eXtreme Gradient Boosting algorithm (XGBoost). We will use XGBoost to predict the outcome of AFLW matches. In R, XGBoost can be built using the tidymodels package. We will walk through the workflow of building a model using this package.

Model 1

Using the fitzRoy package, we can collate team match stats - these stats will be used as our predictors in our model.

Gather data

library(tidyverse)
library(fitzRoy)

# fetch data from 2020 to 2022
w_data <- rbind(
fetch_player_stats(season = 2022, comp = "AFLW"),
fetch_player_stats(season = 2021, comp = "AFLW"),
fetch_player_stats(season = 2020, comp = "AFLW")
)

# Selecting variables which contain the match stats
w_data2 <- w_data %>% 
  select(teamStatus, providerId, round.roundNumber, team.name, goals, behinds, disposals, marks, tackles, inside50s, rebound50s,
         clearances.totalClearances, contestedPossessions, uncontestedPossessions) %>% 
  group_by(providerId, team.name) %>% 
  summarise(
    status = teamStatus[1],
    across(c(goals:uncontestedPossessions), ~sum(.))
  )

# separating home and away data
w_home <- w_data2 %>% 
  filter(status == 'home') %>% 
  dplyr::rename_with(.cols = team.name:uncontestedPossessions, ~paste('home', ., sep = "_"))

w_away <- w_data2 %>% 
  filter(status == 'away') %>% 
  dplyr::rename_with(.cols = team.name:uncontestedPossessions, ~paste('away', ., sep = "_"))

# joining both team match data together
w_data3 <- left_join(w_home, w_away, by = 'providerId')
w_dataFinal <- w_data3 %>% 
  ungroup() %>% 
  select(-contains('status'), -contains('team')) %>% 
  mutate(
    # finding match results
    home_total = home_goals * 6 + home_behinds,
    away_total = away_goals * 6 + away_behinds,
    result = factor(ifelse(home_total == away_total, 0.5,
                    ifelse(home_total > away_total, 1, 0)))
  ) %>% 
  select(-contains('goals'), -contains('behinds'), -contains('total'), -providerId)

head(w_dataFinal, 5)

## # A tibble: 5 × 15
##   home_disposals home_…¹ home_…² home_…³ home_…⁴ home_…⁵ home_…⁶ away_…⁷ away_…⁸
##            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1            154      31      74      11      37      72      70     228      52
## 2            189      30      55      25       8     110      83     147      15
## 3            185      31      53      24      23     100      92     180      22
## 4            216      46      57      19      27     103      98     249      41
## 5            195      39      53      23      20      98     102     148      24
## # … with 6 more variables: away_tackles <dbl>, away_inside50s <dbl>,
## #   away_rebound50s <dbl>, away_contestedPossessions <dbl>,
## #   away_uncontestedPossessions <dbl>, result <fct>, and abbreviated variable
## #   names ¹home_marks, ²home_tackles, ³home_inside50s, ⁴home_rebound50s,
## #   ⁵home_contestedPossessions, ⁶home_uncontestedPossessions, ⁷away_disposals,
## #   ⁸away_marks

Modelling

With our data, we can now begin modelling. We will load in the tidymodels package and begin creating the model. First we will split the data, then create the model specifications.

library(tidymodels)

# Splitting the data
w_split <- initial_split(w_dataFinal)
w_train <- training(w_split)
w_test <- testing(w_split)

# Creating Model Specification
w_spec <- boost_tree() %>% 
  # Set the model to run the xgboost algorithm
  set_engine("xgboost") %>% 
  # Type of model
  set_mode("classification")

We will now fit the model with a formula to the training data.

w_fit <- w_spec %>% 
  fit(result ~ ., data = w_train)

Our model has now been created, with the algorithm learning from our testing data. We can evaluate the model by comparing the accuracy and confusion matrix between training and testing splits.

Evaluating our model

# Train data
w_fit %>% 
  augment(new_data = w_train) %>%
  conf_mat(truth = result, estimate = .pred_class)

##           Truth
## Prediction   0 0.5   1
##        0    90   0   0
##        0.5   0   3   0
##        1     0   0 104

# Test data
w_fit %>% 
  augment(new_data = w_test) %>%
  conf_mat(truth = result, estimate = .pred_class)

##           Truth
## Prediction  0 0.5  1
##        0   20   0  9
##        0.5  0   0  0
##        1    8   1 28

# Train data
w_fit %>% 
  augment(new_data = w_train) %>%
  accuracy(truth = result, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass         1

# Test data
w_fit %>% 
  augment(new_data = w_test) %>%
  accuracy(truth = result, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.727

Our models appears to be 100% accurate on the training data - our model is expected to be overfit given the match outcome is predicted using result data. The framework of creating the model, and the model evaluation is still valid to be applied to more appropriate data. However, this model was created with default hyperparameters. We will tune the model hyperparamets in order to establish a better model.

Model Tuning

We will create a model specificationm, this time setting hyperparameters which we would like to tune.

w_spec2 <- boost_tree(min_n = tune(),
           learn_rate = tune(),
           tree_depth = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

Next, create a grid for the paramters to be tuned

grid <- grid_regular(tree_depth(),
             min_n(),
             learn_rate(),
             # set how many levels of each parameter to tune - lower is better for weaker computer
             levels = 5)

We will now create a workflow with our model and a formula. The workflow is another way to build a model and is useful if we have many pre-processing steps.

w_wf <- workflow() %>% 
  add_formula(result ~ .) %>% 
  add_model(w_spec2)

Create folds for our cross-validation. Like the levels in the parameter grid, use less folds on weaker computers.

folds <- vfold_cv(w_train, v = 5)

Using tune_grid(), we can now tune the hyperparameters specified in our workflow (w_wf) with the grid created, and the resampled folds.

w_res <- tune_grid(
  w_wf,
  resamples = folds,
  grid = grid
)

To select the best combination of hyperparameters, we will evaluate based on the Area under Curve (‘roc_auc’). Other metrics can potentially be selected.

show_best(w_res, "roc_auc")

## # A tibble: 5 × 9
##   min_n tree_depth learn_rate .metric .estimator  mean     n std_err .config    
##   <int>      <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>      
## 1     2          4        0.1 roc_auc hand_till  0.819     5  0.0275 Preprocess…
## 2    21          4        0.1 roc_auc hand_till  0.797     5  0.0327 Preprocess…
## 3    21          8        0.1 roc_auc hand_till  0.797     5  0.0327 Preprocess…
## 4    21         11        0.1 roc_auc hand_till  0.797     5  0.0327 Preprocess…
## 5    21         15        0.1 roc_auc hand_till  0.797     5  0.0327 Preprocess…

With the best hyperparameters found, we can now finalize our workflow to create our model. We will also the model as we did above, comparing training and testing accuracy and confusion matrices

final_w <- finalize_workflow(
  w_wf,
  select_best(w_res, "roc_auc")
) %>% 
  fit(w_train)

# Train data
final_w %>% 
  augment(new_data = w_train) %>%
  conf_mat(truth = result, estimate = .pred_class)

##           Truth
## Prediction  0 0.5  1
##        0   86   2  8
##        0.5  0   0  0
##        1    4   1 96

# Test data
final_w %>% 
  augment(new_data = w_test) %>%
  conf_mat(truth = result, estimate = .pred_class)

##           Truth
## Prediction  0 0.5  1
##        0   21   1  9
##        0.5  0   0  1
##        1    7   0 27

# Train data
final_w %>% 
  augment(new_data = w_train) %>%
  accuracy(truth = result, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.924

# Test data
final_w %>% 
  augment(new_data = w_test) %>%
  accuracy(truth = result, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.727

Final Words

To build upon our model to be more applicable to predicting AFLW games, we can perform more feature engineering and select other variables, create our own metrics etc. The overall steps of creating the model remain all relatively the same - changing which hyperparameters to tune, or the size of our splits can be adjusted but overall the steps remain the same.

Predictive Modelling with XGBoost

Aaron Lam

2022-10-20