Logistic Regression is used to model the probability of a binary outcome (that being an outcome that has two possibilities, such as yes/no, 0/1) based on one or more predictor variables.
The formula for creating a Logistic Regression model in R is as follows:
glm(y ~ x, data = dataset, family = binomial)
Where y is the binary dependent variable, x
is the independent variable(s), dataset is the data frame
containing the data to be analysed, and family = binomial
indicates that we are performing logistic regression.
For this example, we will use AFL data from the fitzRoy
package to predict whether a player will receive Brownlow votes based on
their performance statistics. If you don’t have the required packages
installed, you can do so using the following commands:
install.packages("fitzRoy")
install.packages("tidyverse")
install.packages("sjPlot")
install.packages("performance")
install.packages("caret")
install.packages("patchwork")
install.packages("pROC")
library(fitzRoy)
library(tidyverse)
library(sjPlot)
library(performance)
library(caret)
library(patchwork)
library(pROC)
Load 2025 AFL data
# Fetch player statistics from the 2025 AFL season
afl_data <- fetch_player_stats_afltables(season = 2025)
In Logistic Regression, we need to create binary variables for our dependent and independent variables. In this case, we will create a binary variable indicating whether a player received Brownlow votes (1) or not (0), and another binary variable indicating whether a player had 30 or more disposals (1) or not (0).
# Create binary variables for Brownlow votes and 30+ disposals
afl_data <- afl_data %>%
mutate(
got_brownlow_votes = ifelse(Brownlow.Votes > 0, 1, 0),
high_disposals = ifelse(Disposals >= 30, 1, 0)
)
# Check to see how many high disposal outputs were recorded in 2025 games
table(afl_data$high_disposals)
##
## 0 1
## 9554 428
To begin, create a Simple Logistic Regression Model using only the high disposals variable to predict if this will lead to Brownlow votes.
# Create simple logistic regression model
log_model1 <- glm(got_brownlow_votes ~ high_disposals,
data = afl_data,
family = binomial)
# Display model summary
log_model1_summary <- tab_model(log_model1,
dv.labels = "Received Brownlow Votes")
log_model1_summary
| Received Brownlow Votes | |||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 0.05 | 0.04 – 0.05 | <0.001 |
| high disposals | 23.68 | 19.03 – 29.49 | <0.001 |
| Observations | 9522 | ||
| R2 Tjur | 0.155 | ||
The Logistic Regression output above shows that having 30 or more disposals is a strong predictor of receiving Brownlow votes (OR = 23.68, 95% CI: 19.03 - 29.49, p < 0.001). As such, players who accumulate 30 or more disposals in a game are nearly 24 times more likely to receive Brownlow votes compared to those with fewer disposals. On it’s own, Disposals explains 15.5% of the variation in Brownlow votes received
We can visualise the results of our Logistic Regression model
log_model_plot1 <- plot_model(log_model1, type = "pred", terms = "high_disposals",
title = "Probability of Receiving Brownlow Votes",
axis.title = c("30+ Disposals", "Probability of Votes"))
log_model_plot1
Or for a more stark contrast, we can visualise with a bar chart to compare the probability of receiving Brownlow votes
# Calculate probabilities
prob_low <- predict(log_model1, newdata = data.frame(high_disposals = 0), type = "response")
prob_high <- predict(log_model1, newdata = data.frame(high_disposals = 1), type = "response")
# Create data frame
plot_data <- data.frame(
Category = c("< 30 Disposals", "30+ Disposals"),
Probability = c(prob_low, prob_high)
)
# Plot
log_model_plot2 <- ggplot(plot_data, aes(x = Category, y = Probability, fill = Category)) +
geom_col(width = 0.6) +
geom_text(aes(label = paste0(round(Probability * 100, 1), "%")),
vjust = -0.5, size = 5) +
scale_y_continuous(labels = scales::percent, limits = c(0, 0.6)) +
scale_fill_manual(values = c("lightcoral", "steelblue")) +
labs(title = "Impact of High Disposals on Brownlow Vote Probability",
x = "",
y = "Probability of Receiving Votes") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, size = 13, face = "bold"))
log_model_plot2
We can see here that having less than 30 disposals results in a very low probability (4.5%) of receiving Brownlow votes, whereas having 30 or more disposals significantly increases this probability to approximately 52.6%.
To improve the model, we can add in further predictor variables
# Create multiple logistic regression model
log_model2 <- glm(got_brownlow_votes ~ high_disposals + Goals + Tackles,
data = afl_data,
family = binomial)
# Display model summary
log_model2_summary <- tab_model(log_model2,
dv.labels = "Received Brownlow Votes")
log_model2_summary
| Received Brownlow Votes | |||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 0.01 | 0.01 – 0.01 | <0.001 |
| high disposals | 29.92 | 23.50 – 38.18 | <0.001 |
| Goals | 2.14 | 2.00 – 2.30 | <0.001 |
| Tackles | 1.24 | 1.20 – 1.29 | <0.001 |
| Observations | 9522 | ||
| R2 Tjur | 0.246 | ||
We can continue to see that high disposal games are an important variable for receiving Brownlow votes, and that they, along with goals and tackles, are all significant predictors of receiving Brownlow votes (p < 0.001 for all). According to this model, players who have 30 or more Disposals are nearly 30 times more likely to poll a Brownlow vote than those who have less than 30 Disposals, while keeping Goals and Tackles constant (OR = 29.92, 95% CI: 23.50 - 38.18). Each Goal doubles the odds (OR = 2.14, 95% CI: 2.00 - 2.30), while each Tackle is a 24% increase in the odds of receiving a Brownlow vote (OR = 1.24, 95% CI: 1.20 - 1.29). This model explains 24.6% of the variation in Brownlow votes received.
p1 <- plot_model(log_model2, type = "pred", terms = "high_disposals",
axis.title = c("30+ Disposals", "Received Brownlow Votes"))
p2 <- plot_model(log_model2, type = "pred", terms = "Goals",
axis.title = c("Effect of Goals", "Received Brownlow Votes"))
p3 <- plot_model(log_model2, type = "pred", terms = "Tackles",
axis.title = c("Effect of Tackles", "Received Brownlow Votes"))
# Display plots
multiple_pred <- p1/p2/p3
multiple_pred
From the visualisations above, we can see the individual effects of each predictor variable on the probability of receiving Brownlow votes. High disposal games have a substantial impact. Goals demonstrate exponential growth with a compounding effect, where 0 goals results in around 5% probability of receiving votes, 6 goals around 65% chance, while 10 goals is reaches nearly 100%. Tackles also show a positive relationship, however the effect is more modest when compared to High Disposals and Goals. For games with a significant amount of tackles (i.e. 15+), the probability of recieving votes is around 40%.
Instead of using a binary threshold (30+ disposals), we can use further explore using Disposals as a continuous variable to see how each additional Disposal affects the probability of receiving votes.
# Create model with continuous disposals
log_model3 <- glm(got_brownlow_votes ~ Disposals + Goals + Tackles,
data = afl_data,
family = binomial)
# Display model summary
log_model3_summary <- tab_model(log_model3)
log_model3_summary
| got brownlow votes | |||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 0.00 | 0.00 – 0.00 | <0.001 |
| Disposals | 1.30 | 1.27 – 1.32 | <0.001 |
| Goals | 2.78 | 2.55 – 3.04 | <0.001 |
| Tackles | 1.13 | 1.09 – 1.17 | <0.001 |
| Observations | 9522 | ||
| R2 Tjur | 0.346 | ||
From this model, we can see that for each additional Disposal, the odds of receiving a Brownlow vote increases by 30% (OR = 1.30), with each Goal increasing odds by 178% (OR = 2.78) and each Tackle by 13% (OR = 1.13). This model’s R² value of 0.346 sees it achieve the best fit of the 3 models so far, with it’s predictors explaining 34.6% of the variation in Brownlow votes received.
We can also visualise the effect of Disposals as a continuous variable:
# Visualize the effect of disposals
plot_model3 <- plot_model(log_model3, type = "pred", terms = "Disposals [all]",
axis.title = c("Disposals", "Proability of Receiving Brownlow Votes"))
print(plot_model3)
The Sigmoid shape of the curve demonstrates a typical Logistic
Regression shape, highlighting the non-linear relationship between
Disposals and the probability of receiving Brownlow votes. We can see
from the graph that the inflection point occurs at around 25 Disposals,
where the probability of receiving Brownlow votes begins its rise until
around 45 Disposals. The steepest rise occurs between 30-35 Disposals.
Below 20 Disposals shows minimal probability of receiving votes
(<5%), while around 45 Disposals the probability approaches it’s 100%
plateau.
To assess the best prediction model, we can the compare the 3 of them together
# Compare model performance
log_model_comp <- tab_model(log_model1, log_model2, log_model3,
dv.labels = c("Model 1: Simple",
"Model 2: Binary + G/T",
"Model 3: Continuous"))
log_model_comp
| Model 1: Simple | Model 2: Binary + G/T | Model 3: Continuous | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Odds Ratios | CI | p | Odds Ratios | CI | p |
| (Intercept) | 0.05 | 0.04 – 0.05 | <0.001 | 0.01 | 0.01 – 0.01 | <0.001 | 0.00 | 0.00 – 0.00 | <0.001 |
| high disposals | 23.68 | 19.03 – 29.49 | <0.001 | 29.92 | 23.50 – 38.18 | <0.001 | |||
| Goals | 2.14 | 2.00 – 2.30 | <0.001 | 2.78 | 2.55 – 3.04 | <0.001 | |||
| Tackles | 1.24 | 1.20 – 1.29 | <0.001 | 1.13 | 1.09 – 1.17 | <0.001 | |||
| Disposals | 1.30 | 1.27 – 1.32 | <0.001 | ||||||
| Observations | 9522 | 9522 | 9522 | ||||||
| R2 Tjur | 0.155 | 0.246 | 0.346 | ||||||
# Performance comparison
log_model_comparison <- compare_performance(log_model1, log_model2, log_model3)
log_model_comparison
## # Comparison of Model Performance Indices
##
## Name | Model | AIC (weights) | AICc (weights) | BIC (weights)
## ---------------------------------------------------------------------
## log_model1 | glm | 3896.8 (<.001) | 3896.8 (<.001) | 3911.2 (<.001)
## log_model2 | glm | 3363.5 (<.001) | 3363.5 (<.001) | 3392.1 (<.001)
## log_model3 | glm | 2693.6 (>.999) | 2693.6 (>.999) | 2722.3 (>.999)
##
## Name | Tjur's R2 | RMSE | Sigma | Log_loss | Score_log | Score_spherical | PCP
## ---------------------------------------------------------------------------------------
## log_model1 | 0.155 | 0.227 | 1.000 | 0.204 | -46.646 | 0.004 | 0.897
## log_model2 | 0.246 | 0.214 | 1.000 | 0.176 | -Inf | 0.004 | 0.908
## log_model3 | 0.346 | 0.200 | 1.000 | 0.141 | -Inf | 0.005 | 0.920
Based off its lowest AIC value, in combination with the lowest RMSE and highest R² values, we can conclude that Model 3 (using Disposals as a continuous variable along with Goals and Tackles) is the best performing model for predicting whether a player will receive Brownlow votes based on their performance statistics.
Next we’ll evaluate how well the best model (Model 3) is in predicting Brownlow votes.
# Extract data used in model and add predictions
model_results <- log_model3$model
model_results$predicted_prob <- predict(log_model3, type = "response")
model_results$predicted_votes <- ifelse(model_results$predicted_prob > 0.5, 1, 0)
# Create confusion matrix using model data
conf_matrix <- confusionMatrix(factor(model_results$predicted_votes),
factor(model_results$got_brownlow_votes),
positive = "1")
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 8808 404
## 1 93 217
##
## Accuracy : 0.9478
## 95% CI : (0.9431, 0.9522)
## No Information Rate : 0.9348
## P-Value [Acc > NIR] : 6.027e-08
##
## Kappa : 0.4419
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.34944
## Specificity : 0.98955
## Pos Pred Value : 0.70000
## Neg Pred Value : 0.95614
## Prevalence : 0.06522
## Detection Rate : 0.02279
## Detection Prevalence : 0.03256
## Balanced Accuracy : 0.66949
##
## 'Positive' Class : 1
##
Accuracy - Based on the confusion matrix, the model achieves 94.8% accuracy in classifying cases
Sensitivity - The model correctly identifies 34.9% of actual vote getters. While this may seem small, based on only 3 players per game receiving votes, this is greater predictive power than the 6.5% of random chance (~5 times better)
Specificity - The model correctly identifies 99% of players who won’t receive a vote
Positive Prediction Value - When the model predicts a player will receive votes, it is correct 70% of the time
# Create ROC curve
roc_obj <- roc(model_results$got_brownlow_votes, model_results$predicted_prob)
# Plot ROC curve
roc <- plot(roc_obj, main = "ROC Curve for Brownlow Vote Prediction",
col = "blue", lwd = 2)
# Calculate AUC
auc_value <- auc(roc_obj)
# Display AUC
auc_value
## Area under the curve: 0.925
The ROC curve shows excellent differentiation between vote-receivers and non-vote-receivers, with an AUC of [0.925]. This high AUC indicates the model successfully identifies vote-worthy performances, even though predicting the exact 3 players chosen by umpires remains challenging.
Finally, we can check for any potential issues with our model using diagnostic plots.
# Check model diagnostics
check_model(log_model3)
Posterior Predictive Check (Top Left): The model-predicted distribution (blue) closely matches the observed data distribution (green) for both vote categories, indicating the model appropriately captures the structure of the data.
Binned Residuals (Top Right): Most blue points fall within the error bounds, however there is a slight pattern with some red points falling outside the bounds at higher probability levels. This suggests minor model misspecification at the extremes.
Influential Observations (Middle Left): All points fall within the Cook’s distance contour lines (0.8). No individual observations are excessively affecting the model.
Collinearity (Middle Right): All three predictors show VIF values well below 5 (all around 1), confirming no multicollinearity issues. Disposals, Goals, and Tackles contribute independently to the model.
Uniformity of Residuals (Bottom Left): The QQ plot shows points closely following the diagonal line, indicating residuals are approximately uniformly distributed as expected.
This analysis demonstrates the key steps in logistic regression modeling for binary outcomes:
Model Development: Progressed from simple binary predictors to multiple continuous variables
Model Comparison: Selected the best model based off AIC, RMSE and Pseudo R² values
Model Evaluation: Assessed performance through confusion matrices, ROC curves, and diagnostic plots
Practical Application: Disposals, Goals, and Tackles are important variables for predicting if a player will receive Brownlow votes, however the impact of each is of differing influence
The continuous model (Model 3) provided the best balance of fit and interpretability for this analysis. Model diagnostics confirmed that logistic regression was appropriate in this situation