Logistic regression models are generally used to predict binary classification outcomes, however, they can also be used for outcomes with multiple classes. In this example, we will be using AFL statistics in the 2022 and 2023 season to predict the outcomes for the 2023 finals series games.
library(fitzRoy)
library(tidyverse)
library(sjPlot)
library(caret)
Scrape player statistics from AFL Tables using the
fetch_player_stats_afltables function from the
fitzRoy package for seasons 2022 and 2023.
In the next part, gather the statistics for each game by team and summarise some chosen statistics. Here Marks Inside 50, Disposals, Clearances and Kicks have been selected, however, you can choose any other statistics available in the data set.
Finally, mutate a new column called Outcome to describe who won each game.
# load data
afl <- fetch_player_stats_afltables(season = 2022:2023)
# clean data & summarise for teams and games
afl2 <- afl %>%
group_by(Season, Round, Playing.for, Home.team, Away.team) %>%
summarise(Home.score = Home.score,
Away.score = Away.score,
Marks.Inside.50 = sum(Marks.Inside.50),
Disposals = sum(Disposals),
Clearances = sum(Clearances),
Kicks = sum(Kicks)) %>%
slice(1) %>%
ungroup()
# calculate outcome
afl2 <- afl2 %>%
mutate(Outcome = ifelse(Home.score > Away.score, 1,
ifelse(Home.score < Away.score, 0, 0.5
)))
Split the data by home and away to get the separate statistics. Rename the statistics with ‘Home’ and ‘Away’ in front of the statistic. In the home statistics, keep Round and Season along with the chosen statistics. For the away statistics, only keep the chosen statistics. When the two are joined, as seen below, the home and away data frames will line up. This puts every game onto one line with the Season, Round, Home.team, Away.team, Outcome and the home and away game statistics.
# get home team stats
home <- afl2 %>%
filter(Playing.for == Home.team) %>%
select(-c(Playing.for, Home.score, Away.score)) %>%
rename_with(~ paste0("Home.", .), Marks.Inside.50:Kicks)
# get away team stats
away <- afl2 %>%
filter(Playing.for == Away.team) %>%
select(Marks.Inside.50:Kicks) %>%
rename_with(~ paste0("Away.", .), Marks.Inside.50:Kicks)
# combine the two data frames
afl_total <- cbind(home, away)
Split the data into a training data set that includes all games apart from the 2023 finals series games, and testing data that includes the 2023 finals series games.
# split training and testing data
train <- afl_total %>%
filter(!(Season == 2023 & Round %in% c("EF", "QF", "SF", "PF", "GF")))
test <- afl_total %>%
filter(Season == 2023 & Round %in% c("EF", "QF", "SF", "PF", "GF"))
Build as many models as you would like in this section. These are some examples.
The tab_model function from the sjPlot package
can be used to investigate the coefficients and the significance of the
variables included.
# build models
model1 <- glm(formula = Outcome ~ Home.Marks.Inside.50 + Away.Marks.Inside.50,
family = "binomial",
data = train)
model2 <- glm(formula = Outcome ~ Home.Marks.Inside.50 + Away.Marks.Inside.50 + Home.Disposals +
Away.Disposals + Home.Kicks + Away.Kicks,
family = "binomial",
data = train)
model3 <- glm(formula = Outcome ~ Home.Marks.Inside.50 + Away.Marks.Inside.50 + Home.Disposals +
Away.Disposals + Home.Kicks + Away.Kicks + Home.Clearances + Away.Clearances,
family = "binomial",
data = train)
# explore coefficients
tab_model(model1, model2, model3)
| Outcome | Outcome | Outcome | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Odds Ratios | CI | p | Odds Ratios | CI | p |
| (Intercept) | 0.06 | 0.02 – 0.16 | <0.001 | 0.00 | 0.00 – 0.00 | <0.001 | 0.00 | 0.00 – 0.00 | <0.001 |
| Home Marks Inside 50 | 1.36 | 1.27 – 1.46 | <0.001 | 1.36 | 1.26 – 1.48 | <0.001 | 1.38 | 1.27 – 1.50 | <0.001 |
| Away Marks Inside 50 | 0.97 | 0.92 – 1.02 | 0.271 | 0.92 | 0.86 – 0.98 | 0.017 | 0.92 | 0.86 – 0.99 | 0.021 |
| Home Disposals | 1.00 | 0.99 – 1.01 | 0.614 | 1.00 | 0.99 – 1.01 | 0.555 | |||
| Away Disposals | 1.00 | 0.99 – 1.01 | 0.794 | 1.00 | 0.99 – 1.01 | 0.845 | |||
| Home Kicks | 1.07 | 1.05 – 1.09 | <0.001 | 1.07 | 1.05 – 1.10 | <0.001 | |||
| Away Kicks | 1.00 | 0.98 – 1.02 | 0.981 | 1.00 | 0.99 – 1.02 | 0.695 | |||
| Home Clearances | 1.08 | 1.04 – 1.12 | <0.001 | ||||||
| Away Clearances | 1.01 | 0.97 – 1.06 | 0.488 | ||||||
| Observations | 414 | 414 | 414 | ||||||
Once the models have been built, use the AIC function to compare the models. The metric assesses models on how well they fit the data. The lower the AIC value the better.
# compare the models using AIC
AIC(model1, model2, model3)
## df AIC
## model1 3 462.5627
## model2 7 399.5699
## model3 9 387.5670
Once the best model has been selected, use it on the test data to predict Outcome.
This first part gives the predicted probabilities of the home team winning. Then classifies these predictions into Outcomes (1, 0.5, 0). Once the predictions have been made, use a confusion matrix to see how many predictions were correct and how many were incorrect.
# use model on testing data to get probabilities
predict_probs <- model3 %>%
predict(test, type = "response")
# predict outcome
predicted_classes <- ifelse(predict_probs > 0.5, 1, 0)
# view confusion matrix
conf_matrix <- confusionMatrix(factor(predicted_classes, levels = c(1, 0.5, 0)),
factor(test$Outcome, levels = c(1, 0.5, 0)))
conf_matrix$table
## Reference
## Prediction 1 0.5 0
## 1 4 0 1
## 0.5 0 0 0
## 0 2 0 2
We can also retrieve the accuracy of the model on the test data using the code below.
conf_matrix$overall["Accuracy"]
## Accuracy
## 0.6666667
This simply puts each final game into a table with the actual outcome and the predicted outcome. From this we can see which games were correctly and incorrectly predicted.
# build table to compare predicted vs actual
probs <- test %>%
select(Season, Round, Home.team, Away.team, Outcome) %>%
mutate(Predicted_Outcome = predicted_classes,
Round = factor(Round, levels = c("EF", "QF", "SF", "PF", "GF"))) %>%
arrange(Round)
probs
## Season Round Home.team Away.team Outcome Predicted_Outcome
## 1 2023 EF Carlton Sydney 1 0
## 2 2023 EF St Kilda Greater Western Sydney 0 0
## 3 2023 QF Brisbane Lions Port Adelaide 1 1
## 4 2023 QF Collingwood Melbourne 1 0
## 5 2023 SF Melbourne Carlton 0 1
## 6 2023 SF Port Adelaide Greater Western Sydney 0 0
## 7 2023 PF Brisbane Lions Carlton 1 1
## 8 2023 PF Collingwood Greater Western Sydney 1 1
## 9 2023 GF Collingwood Brisbane Lions 1 1