Logistic regression models are generally used to predict binary classification outcomes, however, they can also be used for outcomes with multiple classes. In this example, we will be using AFL statistics in the 2022 and 2023 season to predict the outcomes for the 2023 finals series games.



Prepare Data

Step 1: Load Libraries

library(fitzRoy)
library(tidyverse)
library(sjPlot)
library(caret)



Step 2: Get and Clean Data

Scrape player statistics from AFL Tables using the fetch_player_stats_afltables function from the fitzRoy package for seasons 2022 and 2023.

In the next part, gather the statistics for each game by team and summarise some chosen statistics. Here Marks Inside 50, Disposals, Clearances and Kicks have been selected, however, you can choose any other statistics available in the data set.

Finally, mutate a new column called Outcome to describe who won each game.

  • 1 = Home team won
  • 0.5 = Draw
  • 0 = Home team lost (away team won)
# load data
afl <- fetch_player_stats_afltables(season = 2022:2023)


# clean data & summarise for teams and games
afl2 <- afl %>% 
  group_by(Season, Round, Playing.for, Home.team, Away.team) %>% 
  summarise(Home.score = Home.score,
            Away.score = Away.score, 
            Marks.Inside.50 = sum(Marks.Inside.50), 
            Disposals = sum(Disposals), 
            Clearances = sum(Clearances), 
            Kicks = sum(Kicks)) %>% 
  slice(1) %>% 
  ungroup()


# calculate outcome
afl2 <- afl2 %>% 
  mutate(Outcome = ifelse(Home.score > Away.score, 1, 
                          ifelse(Home.score < Away.score, 0, 0.5
                                 )))



Step 3: Split Home and Away Statistics

Split the data by home and away to get the separate statistics. Rename the statistics with ‘Home’ and ‘Away’ in front of the statistic. In the home statistics, keep Round and Season along with the chosen statistics. For the away statistics, only keep the chosen statistics. When the two are joined, as seen below, the home and away data frames will line up. This puts every game onto one line with the Season, Round, Home.team, Away.team, Outcome and the home and away game statistics.

# get home team stats
home <- afl2 %>% 
  filter(Playing.for == Home.team) %>% 
  select(-c(Playing.for, Home.score, Away.score)) %>% 
  rename_with(~ paste0("Home.", .), Marks.Inside.50:Kicks)

# get away team stats 
away <- afl2 %>% 
  filter(Playing.for == Away.team) %>% 
  select(Marks.Inside.50:Kicks) %>% 
  rename_with(~ paste0("Away.", .), Marks.Inside.50:Kicks)


# combine the two data frames
afl_total <- cbind(home, away)



Step 4: Split Data

Split the data into a training data set that includes all games apart from the 2023 finals series games, and testing data that includes the 2023 finals series games.

# split training and testing data
train <- afl_total %>% 
  filter(!(Season == 2023 & Round %in% c("EF", "QF", "SF", "PF", "GF")))

test <- afl_total %>% 
  filter(Season == 2023 & Round %in% c("EF", "QF", "SF", "PF", "GF"))



Models

Step 1: Build Models

Build as many models as you would like in this section. These are some examples.

The tab_model function from the sjPlot package can be used to investigate the coefficients and the significance of the variables included.

# build models
model1 <- glm(formula = Outcome ~ Home.Marks.Inside.50 + Away.Marks.Inside.50, 
              family = "binomial", 
              data = train)

model2 <- glm(formula = Outcome ~ Home.Marks.Inside.50 + Away.Marks.Inside.50 + Home.Disposals +
                Away.Disposals + Home.Kicks + Away.Kicks, 
              family = "binomial", 
              data = train)

model3 <- glm(formula = Outcome ~ Home.Marks.Inside.50 + Away.Marks.Inside.50 + Home.Disposals +
                Away.Disposals + Home.Kicks + Away.Kicks + Home.Clearances + Away.Clearances, 
              family = "binomial", 
              data = train)



# explore coefficients
tab_model(model1, model2, model3)
  Outcome Outcome Outcome
Predictors Odds Ratios CI p Odds Ratios CI p Odds Ratios CI p
(Intercept) 0.06 0.02 – 0.16 <0.001 0.00 0.00 – 0.00 <0.001 0.00 0.00 – 0.00 <0.001
Home Marks Inside 50 1.36 1.27 – 1.46 <0.001 1.36 1.26 – 1.48 <0.001 1.38 1.27 – 1.50 <0.001
Away Marks Inside 50 0.97 0.92 – 1.02 0.271 0.92 0.86 – 0.98 0.017 0.92 0.86 – 0.99 0.021
Home Disposals 1.00 0.99 – 1.01 0.614 1.00 0.99 – 1.01 0.555
Away Disposals 1.00 0.99 – 1.01 0.794 1.00 0.99 – 1.01 0.845
Home Kicks 1.07 1.05 – 1.09 <0.001 1.07 1.05 – 1.10 <0.001
Away Kicks 1.00 0.98 – 1.02 0.981 1.00 0.99 – 1.02 0.695
Home Clearances 1.08 1.04 – 1.12 <0.001
Away Clearances 1.01 0.97 – 1.06 0.488
Observations 414 414 414



Step 2: Compare Models

Once the models have been built, use the AIC function to compare the models. The metric assesses models on how well they fit the data. The lower the AIC value the better.

# compare the models using AIC
AIC(model1, model2, model3)
##        df      AIC
## model1  3 462.5627
## model2  7 399.5699
## model3  9 387.5670



Step 3: Predict

Once the best model has been selected, use it on the test data to predict Outcome.

This first part gives the predicted probabilities of the home team winning. Then classifies these predictions into Outcomes (1, 0.5, 0). Once the predictions have been made, use a confusion matrix to see how many predictions were correct and how many were incorrect.

# use model on testing data to get probabilities
predict_probs <- model3 %>% 
  predict(test, type = "response")

# predict outcome
predicted_classes <- ifelse(predict_probs > 0.5, 1, 0)

# view confusion matrix
conf_matrix <- confusionMatrix(factor(predicted_classes, levels = c(1, 0.5, 0)), 
                factor(test$Outcome, levels = c(1, 0.5, 0)))

conf_matrix$table
##           Reference
## Prediction 1 0.5 0
##        1   4   0 1
##        0.5 0   0 0
##        0   2   0 2

We can also retrieve the accuracy of the model on the test data using the code below.

conf_matrix$overall["Accuracy"]
##  Accuracy 
## 0.6666667



Finals Table

This simply puts each final game into a table with the actual outcome and the predicted outcome. From this we can see which games were correctly and incorrectly predicted.

# build table to compare predicted vs actual
probs <- test %>% 
  select(Season, Round, Home.team, Away.team, Outcome) %>% 
  mutate(Predicted_Outcome = predicted_classes, 
         Round = factor(Round, levels = c("EF", "QF", "SF", "PF", "GF"))) %>% 
  arrange(Round)

probs
##   Season Round      Home.team              Away.team Outcome Predicted_Outcome
## 1   2023    EF        Carlton                 Sydney       1                 0
## 2   2023    EF       St Kilda Greater Western Sydney       0                 0
## 3   2023    QF Brisbane Lions          Port Adelaide       1                 1
## 4   2023    QF    Collingwood              Melbourne       1                 0
## 5   2023    SF      Melbourne                Carlton       0                 1
## 6   2023    SF  Port Adelaide Greater Western Sydney       0                 0
## 7   2023    PF Brisbane Lions                Carlton       1                 1
## 8   2023    PF    Collingwood Greater Western Sydney       1                 1
## 9   2023    GF    Collingwood         Brisbane Lions       1                 1