1. Introduction:

Multiple linear regression (MLR) extends simple regression by modeling a continuous outcome as a linear combination of several predictors: \[ Y=\beta_0+\beta_1X_1+\cdots+\beta_iX_i+\varepsilon \]

  • \(X\) = Independent Variable (Predictor)
  • \(Y\) = Dependent Variable (Outcome)
  • \(\beta_0\) = Intercept (The baseline level of \(Y\) when all predictors equal zero)
  • \(\beta_i\) = Slope coefficients (The expected change in \(Y\) for a one-unit increase in \(X\))
  • \(\varepsilon\) = Random error (Independent with the same variance)

2. Multiple linear regression assumptions:

  • Linearity: There is a linear relationship between \(X\) and \(Y\).
  • Normality: The variance between predicted outcomes and actual outcomes are normally distributed.
  • Homoscedasticity: The variance of residuals should be consistent across all independent variables.
  • Multicollinearity: High correlation between independent variables should be avoided.
  • Error Independence: Errors shouldn’t be correlated with each other.

3. Step-by-Step Guide

3.1 Package Import

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fitzRoy)
library(skimr)
library(ggplot2)
library(sjPlot)
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
library(performance)

3.2 Data Preparation

3.2.1 Data Import

# Download data

afl_raw_2025 <- fetch_player_stats_afltables(season = 2025)
## ℹ Looking for data from 2025-01-01 to 2025-11-03✔ Looking for data from 2025-01-01 to 2025-11-03 [12ms]
## ℹ Fetching cached data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached data from <github.com/jimmyday12/fitzRoy_data> [2.4s]
## ℹ New data found! Fetching new data from 1 matchesℹ Scraping AFL Tables                             ✔ Scraping AFL Tables [123ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Cleaning AFL Tables data                        ✔ Cleaning AFL Tables data [32ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    Rows: 13273 Columns: 9
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    ── Column specification ────────────────────────────────────────────────────────
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    Delimiter: ","
## chr (6): player, dob, round, match, date, url
## dbl (3): rank, suffix, ID
## 
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    
## 
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> [434ms]
## ℹ New data found! Fetching new data from 1 matches✔ New data found! Fetching new data from 1 matches [639ms]
## ℹ Tidying data✔ Tidying data [18ms]
# Sum each match's home-team Kicks, Marks, and Inside.50s and calculate the margin. 
# Only focused on: Margin, Kicks, Marks, Inside.50s

afl_2025 <- afl_raw_2025 %>% 
  group_by(Round, Playing.for) %>% 
  filter(Playing.for == Home.team) %>% 
  mutate(across(c(Kicks, Marks, Inside.50s), sum)) %>% 
  select(Round, Playing.for, Home.team, Home.score, Away.team, Away.score, 
         Home.Kicks = Kicks, Home.Marks = Marks, Home.Inside.50s = Inside.50s) %>% 
  slice(1) %>% 
  mutate(Margin = Home.score - Away.score) %>% 
  ungroup() %>% 
  select(Margin, Home.Kicks, Home.Marks, Home.Inside.50s)
  
# Preview the Data
afl_2025
## # A tibble: 216 × 4
##    Margin Home.Kicks Home.Marks Home.Inside.50s
##     <int>      <int>      <int>           <int>
##  1      9        232         93              67
##  2     95        209         86              67
##  3     52        214         93              54
##  4    -20        203         71              40
##  5      8        234        126              45
##  6    -14        206        102              62
##  7     -4        210         98              61
##  8     10        193         74              62
##  9    -35        218         98              48
## 10      0        195         71              51
## # ℹ 206 more rows

3.2.2 Exploratory data analysis (EDA)

Independent Variables (\(X\)): Home.Kicks, Home.Marks, Home.Inside.50s.
Dependent Variables (\(Y\)): Margin

3.2.2a Statistical Summary

skim(afl_2025)
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
Data summary
Name afl_2025
Number of rows 216
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Margin 0 1 5.63 42.16 -101 -20 7.0 30.25 98 ▁▃▇▃▂
Home.Kicks 0 1 211.59 24.15 163 196 209.0 224.25 398 ▇▇▁▁▁
Home.Marks 0 1 90.99 20.27 48 76 89.5 105.00 165 ▃▇▇▂▁
Home.Inside.50s 0 1 52.49 8.78 34 46 53.0 58.00 102 ▅▇▂▁▁

3.2.2b Variable Distribution

Use ggplot to create histogram for each variable

  • Explore whether the data are symmetric or skewed
  • Explore the unusual data
afl_2025 %>% 
  pivot_longer(cols = c(Margin, Home.Kicks, Home.Marks, Home.Inside.50s), 
               names_to = "Metric", values_to = "Value") %>% 
  ggplot(aes(x = Value)) + 
  geom_histogram(color = 'white') + 
  facet_wrap(~ Metric, scale = 'free') + 
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.2.2c Variable Correlation

Correlation between \(X\) and \(Y\)

afl_2025 %>% 
  pivot_longer(cols = c(Home.Kicks, Home.Marks, Home.Inside.50s), 
               names_to = "Metric", values_to = "Value") %>% 
  ggplot(aes(x = Margin, y = Value)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  facet_wrap(~ Metric, scale = 'free_y') + 
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

All correlation are positive means the variables tend to move in the same direction.
Higher values of Home.Kicks, Home.Marks, Home.Inside.50s are associated with higher Margin.

3.3 Model Building

3.3.1 Single Predictor Model

A simple linear regression with one predictor to explain the outcome

model.1.1 <- lm(Margin ~ Home.Kicks, data = afl_2025)
model.1.2 <- lm(Margin ~ Home.Marks, data = afl_2025)
model.1.3 <- lm(Margin ~ Home.Inside.50s, data = afl_2025)

# Coefficients of Predictors
tab_model(model.1.1, model.1.2, model.1.3, 
          dv.labels = c("model.1.1", "model.1.2", "model.1.3"))
  model.1.1 model.1.2 model.1.3
Predictors Estimates CI p Estimates CI p Estimates CI p
(Intercept) -148.18 -193.72 – -102.64 <0.001 -51.62 -76.51 – -26.72 <0.001 -126.85 -156.16 – -97.54 <0.001
Home Kicks 0.73 0.51 – 0.94 <0.001
Home Marks 0.63 0.36 – 0.90 <0.001
Home Inside 50s 2.52 1.97 – 3.07 <0.001
Observations 216 216 216
R2 / R2 adjusted 0.173 / 0.170 0.092 / 0.087 0.276 / 0.273

3.3.2 Multiple Predictor Model

A multiple linear regression with two or more predictors to explain the outcome

model.2.1 <- lm(Margin ~ Home.Kicks + Home.Marks, data = afl_2025)
# "." indicates to all the predictors
model.2.2 <- lm(Margin ~ ., data = afl_2025)

# Coefficients of Predictors
tab_model(model.2.1, model.2.2, 
          dv.labels = c("model.2.1", "model.2.2"))
  model.2.1 model.2.2
Predictors Estimates CI p Estimates CI p
(Intercept) -146.47 -193.54 – -99.40 <0.001 -157.88 -200.82 – -114.94 <0.001
Home Kicks 0.70 0.40 – 0.99 <0.001 0.01 -0.33 – 0.35 0.961
Home Marks 0.05 -0.30 – 0.41 0.769 0.44 0.09 – 0.78 0.013
Home Inside 50s 2.33 1.65 – 3.00 <0.001
Observations 216 216
R2 / R2 adjusted 0.174 / 0.166 0.320 / 0.310

3.3.3 Interaction Term

Interaction Term:

  • Add the effect of one predictor to vary with the level of another predictor
  • Improve model capability
model.3.1 <- lm(Margin ~ Home.Kicks + Home.Marks + Home.Kicks:Home.Marks, 
                data = afl_2025)

# Coefficients of Predictors
tab_model(model.3.1)
  Margin
Predictors Estimates CI p
(Intercept) -529.99 -694.26 – -365.72 <0.001
Home Kicks 2.54 1.73 – 3.35 <0.001
Home Marks 3.63 2.12 – 5.14 <0.001
Home Kicks × Home Marks -0.02 -0.02 – -0.01 <0.001
Observations 216
R2 / R2 adjusted 0.254 / 0.244

3.4 Model Comparison

Using function compare_performance to compare the model performance

compare_performance(model.1.1, model.1.2, model.1.3, model.2.1, model.2.2, model.3.1)
## # Comparison of Model Performance Indices
## 
## Name      | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2
## ----------------------------------------------------------------------------
## model.1.1 |    lm | 2193.2 (<.001) | 2193.3 (<.001) | 2203.3 (<.001) | 0.173
## model.1.2 |    lm | 2213.6 (<.001) | 2213.7 (<.001) | 2223.7 (<.001) | 0.092
## model.1.3 |    lm | 2164.6 (0.009) | 2164.7 (0.010) | 2174.7 (0.205) | 0.276
## model.2.1 |    lm | 2195.1 (<.001) | 2195.3 (<.001) | 2208.6 (<.001) | 0.174
## model.2.2 |    lm | 2155.1 (0.991) | 2155.4 (0.990) | 2172.0 (0.795) | 0.320
## model.3.1 |    lm | 2174.9 (<.001) | 2175.2 (<.001) | 2191.8 (<.001) | 0.254
## 
## Name      | R2 (adj.) |   RMSE |  Sigma
## ---------------------------------------
## model.1.1 |     0.170 | 38.243 | 38.421
## model.1.2 |     0.087 | 40.092 | 40.279
## model.1.3 |     0.273 | 35.791 | 35.958
## model.2.1 |     0.166 | 38.235 | 38.503
## model.2.2 |     0.310 | 34.694 | 35.019
## model.3.1 |     0.244 | 36.325 | 36.666
  • AIC: An information-theoretic score for model selection that balances fit and complexity. Lower AIC score indicates a better model.
  • R-squared (R2): The proportion of variance in the outcome explained by the model (Range from 0 to 1). Higher R-squared score indicates a better fit.
  • Root Mean Square Error (RMSE): The square-root of the average squared prediction error. Lower RMSE score indicates a lower difference between predicted and actual outcomes.

model.2.2 shows the best overall performance:

  • Lowest AIC
  • Highest R2
  • Lowest RMSE

3.5 Model Interpretation

summary(model.2.2)
## 
## Call:
## lm(formula = Margin ~ ., data = afl_2025)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -190.707  -22.278    3.521   24.353   81.722 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.579e+02  2.178e+01  -7.248 7.78e-12 ***
## Home.Kicks       8.404e-03  1.712e-01   0.049   0.9609    
## Home.Marks       4.363e-01  1.734e-01   2.516   0.0126 *  
## Home.Inside.50s  2.325e+00  3.448e-01   6.744 1.44e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.02 on 212 degrees of freedom
## Multiple R-squared:  0.3197, Adjusted R-squared:  0.3101 
## F-statistic: 33.21 on 3 and 212 DF,  p-value: < 2.2e-16
  • Intercept (-157.88): The predicted margin when all predictors are 0.
  • Home.Kicks (0.01): Each additional home kick is associated with 0.01 points in margin.
  • Home.Marks (0.44): Each additional home mark is associated with 0.44 points in margin.
    • (significant effect, p < 0.05)
  • Home.Inside.50s (2.33): Each extra home inside-50 entry is associated with 2.33 points in margin.
    • (high significant effect, p < 0.001)

3.6 Model Performance

Using function check_model to check the effectiveness of model.

check_model(model.2.2)

  • Posterior Predictive Check:
    • Compare data simulated from the fitted model to the actual data
    • Simulated distributions overlapping the observed (whether fit the data)
  • Linearity:
    • Assesses whether the mean of outcome relates linearly to predictors
    • Residuals vs. fitted without systematic curves
  • Homogeneity of Variance:
    • Errors have constant variance across fitted values
    • No significant fluctuation in residuals vs. fitted
  • Influential Observations:
    • Detects points with large influence
    • Avoid points exceeding reference lines (funnel shape)
  • Collinearity:
    • Check whether the variables appear multicollinearity
    • Avoid high VIF (≥ 10)
  • Normality of Residuals:
    • Check whether residuals are normal distributed
    • The dots should follow the line