This tutorial will walk through creating a model using machine
learning, specifically using eXtreme Gradient Boosting algorithm
(XGBoost). We will use XGBoost to predict the outcome of AFLW matches.
In R, XGBoost can be built using the tidymodels package. We
will walk through the workflow of building a model using this
package.
Using the fitzRoy package, we can collate team match stats - these stats will be used as our predictors in our model.
library(tidyverse)
library(fitzRoy)
# fetch data from 2020 to 2022
w_data <- rbind(
fetch_player_stats(season = 2022, comp = "AFLW"),
fetch_player_stats(season = 2021, comp = "AFLW"),
fetch_player_stats(season = 2020, comp = "AFLW")
)
# Selecting variables which contain the match stats
w_data2 <- w_data %>%
select(teamStatus, providerId, round.roundNumber, team.name, goals, behinds, disposals, marks, tackles, inside50s, rebound50s,
clearances.totalClearances, contestedPossessions, uncontestedPossessions) %>%
group_by(providerId, team.name) %>%
summarise(
status = teamStatus[1],
across(c(goals:uncontestedPossessions), ~sum(.))
)
# separating home and away data
w_home <- w_data2 %>%
filter(status == 'home') %>%
dplyr::rename_with(.cols = team.name:uncontestedPossessions, ~paste('home', ., sep = "_"))
w_away <- w_data2 %>%
filter(status == 'away') %>%
dplyr::rename_with(.cols = team.name:uncontestedPossessions, ~paste('away', ., sep = "_"))
# joining both team match data together
w_data3 <- left_join(w_home, w_away, by = 'providerId')
w_dataFinal <- w_data3 %>%
ungroup() %>%
select(-contains('status'), -contains('team')) %>%
mutate(
# finding match results
home_total = home_goals * 6 + home_behinds,
away_total = away_goals * 6 + away_behinds,
result = factor(ifelse(home_total == away_total, 0.5,
ifelse(home_total > away_total, 1, 0)))
) %>%
select(-contains('goals'), -contains('behinds'), -contains('total'), -providerId)
head(w_dataFinal, 5)
## # A tibble: 5 × 15
## home_disposals home_…¹ home_…² home_…³ home_…⁴ home_…⁵ home_…⁶ away_…⁷ away_…⁸
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 154 31 74 11 37 72 70 228 52
## 2 189 30 55 25 8 110 83 147 15
## 3 185 31 53 24 23 100 92 180 22
## 4 216 46 57 19 27 103 98 249 41
## 5 195 39 53 23 20 98 102 148 24
## # … with 6 more variables: away_tackles <dbl>, away_inside50s <dbl>,
## # away_rebound50s <dbl>, away_contestedPossessions <dbl>,
## # away_uncontestedPossessions <dbl>, result <fct>, and abbreviated variable
## # names ¹home_marks, ²home_tackles, ³home_inside50s, ⁴home_rebound50s,
## # ⁵home_contestedPossessions, ⁶home_uncontestedPossessions, ⁷away_disposals,
## # ⁸away_marks
With our data, we can now begin modelling. We will load in the
tidymodels package and begin creating the model. First we
will split the data, then create the model specifications.
library(tidymodels)
# Splitting the data
w_split <- initial_split(w_dataFinal)
w_train <- training(w_split)
w_test <- testing(w_split)
# Creating Model Specification
w_spec <- boost_tree() %>%
# Set the model to run the xgboost algorithm
set_engine("xgboost") %>%
# Type of model
set_mode("classification")
We will now fit the model with a formula to the training data.
w_fit <- w_spec %>%
fit(result ~ ., data = w_train)
Our model has now been created, with the algorithm learning from our testing data. We can evaluate the model by comparing the accuracy and confusion matrix between training and testing splits.
# Train data
w_fit %>%
augment(new_data = w_train) %>%
conf_mat(truth = result, estimate = .pred_class)
## Truth
## Prediction 0 0.5 1
## 0 90 0 0
## 0.5 0 3 0
## 1 0 0 104
# Test data
w_fit %>%
augment(new_data = w_test) %>%
conf_mat(truth = result, estimate = .pred_class)
## Truth
## Prediction 0 0.5 1
## 0 20 0 9
## 0.5 0 0 0
## 1 8 1 28
# Train data
w_fit %>%
augment(new_data = w_train) %>%
accuracy(truth = result, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 1
# Test data
w_fit %>%
augment(new_data = w_test) %>%
accuracy(truth = result, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.727
Our models appears to be 100% accurate on the training data - our model is expected to be overfit given the match outcome is predicted using result data. The framework of creating the model, and the model evaluation is still valid to be applied to more appropriate data. However, this model was created with default hyperparameters. We will tune the model hyperparamets in order to establish a better model.
We will create a model specificationm, this time setting hyperparameters which we would like to tune.
w_spec2 <- boost_tree(min_n = tune(),
learn_rate = tune(),
tree_depth = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
Next, create a grid for the paramters to be tuned
grid <- grid_regular(tree_depth(),
min_n(),
learn_rate(),
# set how many levels of each parameter to tune - lower is better for weaker computer
levels = 5)
We will now create a workflow with our model and a formula. The workflow is another way to build a model and is useful if we have many pre-processing steps.
w_wf <- workflow() %>%
add_formula(result ~ .) %>%
add_model(w_spec2)
Create folds for our cross-validation. Like the levels in the parameter grid, use less folds on weaker computers.
folds <- vfold_cv(w_train, v = 5)
Using tune_grid(), we can now tune the hyperparameters
specified in our workflow (w_wf) with the grid created, and the
resampled folds.
w_res <- tune_grid(
w_wf,
resamples = folds,
grid = grid
)
To select the best combination of hyperparameters, we will evaluate based on the Area under Curve (‘roc_auc’). Other metrics can potentially be selected.
show_best(w_res, "roc_auc")
## # A tibble: 5 × 9
## min_n tree_depth learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 4 0.1 roc_auc hand_till 0.819 5 0.0275 Preprocess…
## 2 21 4 0.1 roc_auc hand_till 0.797 5 0.0327 Preprocess…
## 3 21 8 0.1 roc_auc hand_till 0.797 5 0.0327 Preprocess…
## 4 21 11 0.1 roc_auc hand_till 0.797 5 0.0327 Preprocess…
## 5 21 15 0.1 roc_auc hand_till 0.797 5 0.0327 Preprocess…
With the best hyperparameters found, we can now finalize our workflow to create our model. We will also the model as we did above, comparing training and testing accuracy and confusion matrices
final_w <- finalize_workflow(
w_wf,
select_best(w_res, "roc_auc")
) %>%
fit(w_train)
# Train data
final_w %>%
augment(new_data = w_train) %>%
conf_mat(truth = result, estimate = .pred_class)
## Truth
## Prediction 0 0.5 1
## 0 86 2 8
## 0.5 0 0 0
## 1 4 1 96
# Test data
final_w %>%
augment(new_data = w_test) %>%
conf_mat(truth = result, estimate = .pred_class)
## Truth
## Prediction 0 0.5 1
## 0 21 1 9
## 0.5 0 0 1
## 1 7 0 27
# Train data
final_w %>%
augment(new_data = w_train) %>%
accuracy(truth = result, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.924
# Test data
final_w %>%
augment(new_data = w_test) %>%
accuracy(truth = result, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.727
To build upon our model to be more applicable to predicting AFLW games, we can perform more feature engineering and select other variables, create our own metrics etc. The overall steps of creating the model remain all relatively the same - changing which hyperparameters to tune, or the size of our splits can be adjusted but overall the steps remain the same.