Using Logistic Regression to Predict if a Player will Receive Brownlow Votes

Logistic Regression

Logistic Regression is used to model the probability of a binary outcome (that being an outcome that has two possibilities, such as yes/no, 0/1) based on one or more predictor variables.

The formula for creating a Logistic Regression model in R is as follows:

glm(y ~ x, data = dataset, family = binomial)

Where y is the binary dependent variable, x is the independent variable(s), dataset is the data frame containing the data to be analysed, and family = binomial indicates that we are performing logistic regression.

Logistic Regression in Action

For this example, we will use AFL data from the fitzRoy package to predict whether a player will receive Brownlow votes based on their performance statistics. If you don’t have the required packages installed, you can do so using the following commands:

install.packages("fitzRoy")
install.packages("tidyverse")
install.packages("sjPlot")
install.packages("performance")
install.packages("caret")
install.packages("patchwork")
install.packages("pROC")

library(fitzRoy)
library(tidyverse)
library(sjPlot)
library(performance)
library(caret)
library(patchwork)
library(pROC)

Load 2025 AFL data

# Fetch player statistics from the 2025 AFL season
afl_data <- fetch_player_stats_afltables(season = 2025)

Creating Binary Variables

In Logistic Regression, we need to create binary variables for our dependent and independent variables. In this case, we will create a binary variable indicating whether a player received Brownlow votes (1) or not (0), and another binary variable indicating whether a player had 30 or more disposals (1) or not (0).

# Create binary variables for Brownlow votes and 30+ disposals
afl_data <- afl_data %>%
  mutate(
    got_brownlow_votes = ifelse(Brownlow.Votes > 0, 1, 0),
    high_disposals = ifelse(Disposals >= 30, 1, 0)
  )

# Check to see how many high disposal outputs were recorded in 2025 games
table(afl_data$high_disposals)

## 
##    0    1 
## 9554  428

Simple Logistic Regression

To begin, create a Simple Logistic Regression Model using only the high disposals variable to predict if this will lead to Brownlow votes.

# Create simple logistic regression model
log_model1 <- glm(got_brownlow_votes ~ high_disposals, 
                  data = afl_data,
                  family = binomial)

# Display model summary
log_model1_summary <- tab_model(log_model1, 
                                dv.labels = "Received Brownlow Votes")

log_model1_summary

	Received Brownlow Votes
Predictors	Odds Ratios	CI	p
(Intercept)	0.05	0.04 – 0.05	<0.001
high disposals	23.68	19.03 – 29.49	<0.001
Observations	9522
R² Tjur	0.155

The Logistic Regression output above shows that having 30 or more disposals is a strong predictor of receiving Brownlow votes (OR = 23.68, 95% CI: 19.03 - 29.49, p < 0.001). As such, players who accumulate 30 or more disposals in a game are nearly 24 times more likely to receive Brownlow votes compared to those with fewer disposals. On it’s own, Disposals explains 15.5% of the variation in Brownlow votes received

Visualising the Model

We can visualise the results of our Logistic Regression model

log_model_plot1 <- plot_model(log_model1, type = "pred", terms = "high_disposals",
           title = "Probability of Receiving Brownlow Votes",
           axis.title = c("30+ Disposals", "Probability of Votes"))

log_model_plot1

Or for a more stark contrast, we can visualise with a bar chart to compare the probability of receiving Brownlow votes

# Calculate probabilities
prob_low <- predict(log_model1, newdata = data.frame(high_disposals = 0), type = "response")
prob_high <- predict(log_model1, newdata = data.frame(high_disposals = 1), type = "response")

# Create data frame
plot_data <- data.frame(
  Category = c("< 30 Disposals", "30+ Disposals"),
  Probability = c(prob_low, prob_high)
)

# Plot
log_model_plot2 <- ggplot(plot_data, aes(x = Category, y = Probability, fill = Category)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(round(Probability * 100, 1), "%")), 
            vjust = -0.5, size = 5) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 0.6)) +
  scale_fill_manual(values = c("lightcoral", "steelblue")) +
  labs(title = "Impact of High Disposals on Brownlow Vote Probability",
       x = "",
       y = "Probability of Receiving Votes") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, size = 13, face = "bold"))

log_model_plot2

We can see here that having less than 30 disposals results in a very low probability (4.5%) of receiving Brownlow votes, whereas having 30 or more disposals significantly increases this probability to approximately 52.6%.

Multiple Logistic Regression

To improve the model, we can add in further predictor variables

# Create multiple logistic regression model
log_model2 <- glm(got_brownlow_votes ~ high_disposals + Goals + Tackles, 
                  data = afl_data,
                  family = binomial)

# Display model summary
log_model2_summary <- tab_model(log_model2, 
                                dv.labels = "Received Brownlow Votes")

log_model2_summary

	Received Brownlow Votes
Predictors	Odds Ratios	CI	p
(Intercept)	0.01	0.01 – 0.01	<0.001
high disposals	29.92	23.50 – 38.18	<0.001
Goals	2.14	2.00 – 2.30	<0.001
Tackles	1.24	1.20 – 1.29	<0.001
Observations	9522
R² Tjur	0.246

We can continue to see that high disposal games are an important variable for receiving Brownlow votes, and that they, along with goals and tackles, are all significant predictors of receiving Brownlow votes (p < 0.001 for all). According to this model, players who have 30 or more Disposals are nearly 30 times more likely to poll a Brownlow vote than those who have less than 30 Disposals, while keeping Goals and Tackles constant (OR = 29.92, 95% CI: 23.50 - 38.18). Each Goal doubles the odds (OR = 2.14, 95% CI: 2.00 - 2.30), while each Tackle is a 24% increase in the odds of receiving a Brownlow vote (OR = 1.24, 95% CI: 1.20 - 1.29). This model explains 24.6% of the variation in Brownlow votes received.

Visualise Multiple Predictors

p1 <- plot_model(log_model2, type = "pred", terms = "high_disposals",
                 axis.title = c("30+ Disposals", "Received Brownlow Votes"))

p2 <- plot_model(log_model2, type = "pred", terms = "Goals",
                 axis.title = c("Effect of Goals", "Received Brownlow Votes"))

p3 <- plot_model(log_model2, type = "pred", terms = "Tackles",
                 axis.title = c("Effect of Tackles", "Received Brownlow Votes"))

# Display plots
multiple_pred <- p1/p2/p3

multiple_pred

From the visualisations above, we can see the individual effects of each predictor variable on the probability of receiving Brownlow votes. High disposal games have a substantial impact. Goals demonstrate exponential growth with a compounding effect, where 0 goals results in around 5% probability of receiving votes, 6 goals around 65% chance, while 10 goals is reaches nearly 100%. Tackles also show a positive relationship, however the effect is more modest when compared to High Disposals and Goals. For games with a significant amount of tackles (i.e. 15+), the probability of recieving votes is around 40%.

Using Continuous Variables

Instead of using a binary threshold (30+ disposals), we can use further explore using Disposals as a continuous variable to see how each additional Disposal affects the probability of receiving votes.

# Create model with continuous disposals
log_model3 <- glm(got_brownlow_votes ~ Disposals + Goals + Tackles,
                  data = afl_data,
                  family = binomial)

# Display model summary
log_model3_summary <- tab_model(log_model3)

log_model3_summary

	got brownlow votes
Predictors	Odds Ratios	CI	p
(Intercept)	0.00	0.00 – 0.00	<0.001
Disposals	1.30	1.27 – 1.32	<0.001
Goals	2.78	2.55 – 3.04	<0.001
Tackles	1.13	1.09 – 1.17	<0.001
Observations	9522
R² Tjur	0.346

From this model, we can see that for each additional Disposal, the odds of receiving a Brownlow vote increases by 30% (OR = 1.30), with each Goal increasing odds by 178% (OR = 2.78) and each Tackle by 13% (OR = 1.13). This model’s R² value of 0.346 sees it achieve the best fit of the 3 models so far, with it’s predictors explaining 34.6% of the variation in Brownlow votes received.

We can also visualise the effect of Disposals as a continuous variable:

# Visualize the effect of disposals
plot_model3 <- plot_model(log_model3, type = "pred", terms = "Disposals [all]",
           axis.title = c("Disposals", "Proability of Receiving Brownlow Votes"))

print(plot_model3)

The Sigmoid shape of the curve demonstrates a typical Logistic Regression shape, highlighting the non-linear relationship between Disposals and the probability of receiving Brownlow votes. We can see from the graph that the inflection point occurs at around 25 Disposals, where the probability of receiving Brownlow votes begins its rise until around 45 Disposals. The steepest rise occurs between 30-35 Disposals. Below 20 Disposals shows minimal probability of receiving votes (<5%), while around 45 Disposals the probability approaches it’s 100% plateau.

Comparison of Models

To assess the best prediction model, we can the compare the 3 of them together

# Compare model performance
log_model_comp <- tab_model(log_model1, log_model2, log_model3,
          dv.labels = c("Model 1: Simple", 
                        "Model 2: Binary + G/T",
                        "Model 3: Continuous"))

log_model_comp

	Model 1: Simple			Model 2: Binary + G/T			Model 3: Continuous
Predictors	Odds Ratios	CI	p	Odds Ratios	CI	p	Odds Ratios	CI	p
(Intercept)	0.05	0.04 – 0.05	<0.001	0.01	0.01 – 0.01	<0.001	0.00	0.00 – 0.00	<0.001
high disposals	23.68	19.03 – 29.49	<0.001	29.92	23.50 – 38.18	<0.001
Goals				2.14	2.00 – 2.30	<0.001	2.78	2.55 – 3.04	<0.001
Tackles				1.24	1.20 – 1.29	<0.001	1.13	1.09 – 1.17	<0.001
Disposals							1.30	1.27 – 1.32	<0.001
Observations	9522			9522			9522
R² Tjur	0.155			0.246			0.346

# Performance comparison
log_model_comparison <- compare_performance(log_model1, log_model2, log_model3)

log_model_comparison

## # Comparison of Model Performance Indices
## 
## Name       | Model |  AIC (weights) | AICc (weights) |  BIC (weights)
## ---------------------------------------------------------------------
## log_model1 |   glm | 3896.8 (<.001) | 3896.8 (<.001) | 3911.2 (<.001)
## log_model2 |   glm | 3363.5 (<.001) | 3363.5 (<.001) | 3392.1 (<.001)
## log_model3 |   glm | 2693.6 (>.999) | 2693.6 (>.999) | 2722.3 (>.999)
## 
## Name       | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP
## ---------------------------------------------------------------------------------------
## log_model1 |     0.155 | 0.227 | 1.000 |    0.204 |   -46.646 |           0.004 | 0.897
## log_model2 |     0.246 | 0.214 | 1.000 |    0.176 |      -Inf |           0.004 | 0.908
## log_model3 |     0.346 | 0.200 | 1.000 |    0.141 |      -Inf |           0.005 | 0.920

Based off its lowest AIC value, in combination with the lowest RMSE and highest R² values, we can conclude that Model 3 (using Disposals as a continuous variable along with Goals and Tackles) is the best performing model for predicting whether a player will receive Brownlow votes based on their performance statistics.

Model Performance Evaluation

Next we’ll evaluate how well the best model (Model 3) is in predicting Brownlow votes.

# Extract data used in model and add predictions
model_results <- log_model3$model
model_results$predicted_prob <- predict(log_model3, type = "response")
model_results$predicted_votes <- ifelse(model_results$predicted_prob > 0.5, 1, 0)

# Create confusion matrix using model data
conf_matrix <- confusionMatrix(factor(model_results$predicted_votes), 
                               factor(model_results$got_brownlow_votes),
                               positive = "1")

print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8808  404
##          1   93  217
##                                           
##                Accuracy : 0.9478          
##                  95% CI : (0.9431, 0.9522)
##     No Information Rate : 0.9348          
##     P-Value [Acc > NIR] : 6.027e-08       
##                                           
##                   Kappa : 0.4419          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.34944         
##             Specificity : 0.98955         
##          Pos Pred Value : 0.70000         
##          Neg Pred Value : 0.95614         
##              Prevalence : 0.06522         
##          Detection Rate : 0.02279         
##    Detection Prevalence : 0.03256         
##       Balanced Accuracy : 0.66949         
##                                           
##        'Positive' Class : 1               
##

Accuracy - Based on the confusion matrix, the model achieves 94.8% accuracy in classifying cases

Sensitivity - The model correctly identifies 34.9% of actual vote getters. While this may seem small, based on only 3 players per game receiving votes, this is greater predictive power than the 6.5% of random chance (~5 times better)

Specificity - The model correctly identifies 99% of players who won’t receive a vote

Positive Prediction Value - When the model predicts a player will receive votes, it is correct 70% of the time

ROC Curve and AUC

# Create ROC curve
roc_obj <- roc(model_results$got_brownlow_votes, model_results$predicted_prob)

# Plot ROC curve
roc <- plot(roc_obj, main = "ROC Curve for Brownlow Vote Prediction",
     col = "blue", lwd = 2)

# Calculate AUC
auc_value <- auc(roc_obj)

# Display AUC
auc_value

## Area under the curve: 0.925

The ROC curve shows excellent differentiation between vote-receivers and non-vote-receivers, with an AUC of [0.925]. This high AUC indicates the model successfully identifies vote-worthy performances, even though predicting the exact 3 players chosen by umpires remains challenging.

Model Diagnostics

Finally, we can check for any potential issues with our model using diagnostic plots.

# Check model diagnostics
check_model(log_model3)

Interpreting Diagnostics

Posterior Predictive Check (Top Left): The model-predicted distribution (blue) closely matches the observed data distribution (green) for both vote categories, indicating the model appropriately captures the structure of the data.

Binned Residuals (Top Right): Most blue points fall within the error bounds, however there is a slight pattern with some red points falling outside the bounds at higher probability levels. This suggests minor model misspecification at the extremes.

Influential Observations (Middle Left): All points fall within the Cook’s distance contour lines (0.8). No individual observations are excessively affecting the model.

Collinearity (Middle Right): All three predictors show VIF values well below 5 (all around 1), confirming no multicollinearity issues. Disposals, Goals, and Tackles contribute independently to the model.

Uniformity of Residuals (Bottom Left): The QQ plot shows points closely following the diagonal line, indicating residuals are approximately uniformly distributed as expected.

Conclusion

This analysis demonstrates the key steps in logistic regression modeling for binary outcomes:

Model Development: Progressed from simple binary predictors to multiple continuous variables
Model Comparison: Selected the best model based off AIC, RMSE and Pseudo R² values
Model Evaluation: Assessed performance through confusion matrices, ROC curves, and diagnostic plots
Practical Application: Disposals, Goals, and Tackles are important variables for predicting if a player will receive Brownlow votes, however the impact of each is of differing influence

The continuous model (Model 3) provided the best balance of fit and interpretability for this analysis. Model diagnostics confirmed that logistic regression was appropriate in this situation