To model the relationship between dependent and independent data variables, linear regression modelling can be applied. When doing so, the models can be compared against each other and also against the actual data available to help in future forecats of the dependent variable.

In this tutorial, we will use AFL data to go through how to create simple linear regression model and a multiple regression model to predict the total goals a team scored each match.

Retrieving Data & Necessary Packages

# Install required packages if needed by removing the "#" from the line below
# install.packages(c("fitzRoy", "tidyverse", "GGally", "tidymodels", "performance", "sjPlot"))

library(fitzRoy)
library(tidyverse)
library(GGally)
library(tidymodels)
library(performance)
library(sjPlot)

afldata <- fetch_player_stats_afltables(2025)

# Create Data for the home and away season only

regular_season <- afldata %>%
  mutate(Round = as.numeric(Round)) %>% 
  filter(Round >= 1 & Round <= 25)  

# Aggregate Player Stats to Team Totals per Match

team_totals <- regular_season %>%
  group_by(Date, Round, Playing.for) %>%
  summarise(
    Kicks = sum(Kicks, na.rm = TRUE),
    Handballs = sum(Handballs, na.rm = TRUE),
    Disposals = sum(Disposals, na.rm = TRUE),
    Marks = sum(Marks, na.rm = TRUE),
    Inside50s = sum(Inside.50s, na.rm = TRUE),
    Contested.Marks = sum(Contested.Marks, na.rm = TRUE),
    Clearances = sum(Clearances, na.rm = TRUE),
    Tackles = sum(Tackles, na.rm = TRUE),
    Goals = sum(Goals, na.rm = TRUE),
    Behinds = sum(Behinds, na.rm = TRUE),
    Marks.Inside.50 = sum(Marks.Inside.50, na.rm = TRUE),
    One.Percenters = sum(One.Percenters, na.rm = TRUE),
    Frees.For = sum(Frees.For, na.rm = TRUE),
    Frees.Against = sum(Frees.Against, na.rm = TRUE),
    Clangers = sum(Clangers, na.rm = TRUE),
    .groups = "drop"
  )

Deciding which variables to utilise for modeling

Before creating the models, an example of how to view the between variable relationships

explore_vars <- team_totals %>%
  select(Goals, Kicks, Disposals, Marks, Inside50s, Contested.Marks, Clearances,
         Marks.Inside.50, Clangers)

ggpairs(
  explore_vars,
  title = "Relationships Between Team Stats and Goals (AFL 2025)",
  progress = FALSE
)

From these results, we can see that variables such as disposals, marks inside 50, inside 50s and clearance have some of the higher correlation to goals.

For this example, we will create a simple regression model using only disposals and a multiple regression model using multiple explanatory variables.

Creating Models

# Single regression model
mod_1 <- lm(Goals ~ Disposals, data = team_totals)

# Mulitple regression model
mod_2 <- lm(Goals ~ Disposals + Inside50s + Marks.Inside.50 + Clearances,
            data = team_totals)

Viewing and Comparing both models

After creating two different models, we can use the tab_model function to view and compare how effective each model is using different measurements.

tab_model(mod_1, mod_2,
          show.stat = TRUE,
          show.aic = TRUE,
          show.ci = FALSE,
          title = "Comparison of Simple vs Multiple Regression Models")

Comparison of Simple vs Multiple Regression Models
	Goals			Goals
Predictors	Estimates	Statistic	p	Estimates	Statistic	p
(Intercept)	-7.81	-3.76	<0.001	-9.88	-5.85	<0.001
Disposals	0.06	9.70	<0.001	0.02	3.79	<0.001
Inside50s				0.11	4.79	<0.001
Marks Inside 50				0.45	12.19	<0.001
Clearances				0.12	4.96	<0.001
Observations	414			414
R² / R² adjusted	0.186 / 0.184			0.520 / 0.516
AIC	2285.366			2072.498

compare_performance(mod_1, mod_2, metrics = c( "RMSE"))

## # Comparison of Model Performance Indices
## 
## Name  | Model |  RMSE
## ---------------------
## mod_1 |    lm | 3.796
## mod_2 |    lm | 2.914

Based on these results we can see that the multiple regression model has the better performance. While the variables are statistically significant for both models the measurements of RMSE, AIC and R2 favour the multiple regression model. This is because a lower RMSE of 2.9 indicates predicted values using this model are closer to the actual values than the simple regression model. This is the same concept for the AIC as the AIC measures predictive power in which case a smaller value represents a better model which the second model possesses. The R2 value represents how much variation in total team goals (dependent variable) is explained by the independent variables in which the multiple regression model has a higher R2 of 0.52 making it the stronger model in all three measurements.

To further examine the multiple regression model we can use the check_model function to check how well the model meets the standard assumptions of linear modelling.

check_model(mod_2)

The model fits the data well and meets key linear-regression assumptions. Predicted goal totals align closely with observed outcomes, residuals are approximately normal, and no predictors exhibit problematic multicollinearity or influence. The only mild issue is a small variance change in mid-range goal predictions, but it’s unlikely to affect model validity. Overall, the multiple regression model provides a reliable and well-behaved fit for AFL team goal prediction.

Viewing Predicted vs Actual values

We can now visualise how the multiple regression model predicted values look against actual AFL total team goals. To do so we first split the data for training and testing sets for the model to use.

split_data <- initial_split(team_totals, prop = 0.8)
train_split <- training(split_data)
test_split  <- testing(split_data)

# Fit models on training set
train_mod1 <- lm(Goals ~ Disposals, data = train_split)
train_mod2 <- lm(Goals ~ Disposals + Inside50s + Marks.Inside.50 + Clearances, data = train_split)

# Predict on test data
pred1 <- predict(train_mod1, newdata = test_split)
pred2 <- predict(train_mod2, newdata = test_split)

# Evaluate test performance
perf1 <- model_performance(train_mod1)
perf2 <- model_performance(train_mod2)

compare_performance(train_mod1, train_mod2)

## # Comparison of Model Performance Indices
## 
## Name       | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2
## -----------------------------------------------------------------------------
## train_mod1 |    lm | 1825.7 (<.001) | 1825.8 (<.001) | 1837.1 (<.001) | 0.189
## train_mod2 |    lm | 1638.1 (>.999) | 1638.3 (>.999) | 1660.9 (>.999) | 0.548
## 
## Name       | R2 (adj.) |  RMSE | Sigma
## --------------------------------------
## train_mod1 |     0.186 | 3.781 | 3.792
## train_mod2 |     0.542 | 2.822 | 2.843

We can then visualise how the predictive and actual values together using a plot such as below

predictions <- tibble(
  Actual = test_split$Goals,
  Predicted = pred2
)

ggplot(predictions, aes(x = Actual, y = Predicted)) +
  geom_jitter(width = 0.2, height = 0.2, color = "Black", alpha = 0.6) +
  geom_smooth(method = "lm") + 
  labs(
    title = "Predicted vs. Actual Total Team Goals per Match",
    x = "Actual Total Team Goals per Match",
    y = "Predicted Total Team Goals per Match"
  ) +
  theme_minimal()

This plot shows that the model’s predicted team goal totals align closely with actual results. Teams that scored more goals in real matches are also predicted to score more by the model. The spread of points around the line indicates the typical prediction error and although a better model can improve the accuracy this is a good demonstration.

Conclusion

In conclusion, we have demonstrated how to build both simple linear regression models and multiple regression models and how to compare the performance of both. After assessing which model was better, we learnt how to visualise predicted values against the actual observed values which is an important skill when creating predictive models.

Using Predictive Models

Nathan Dohmen-Jolly

2025-10-30