1. Recap:

Multiple linear regression (MLR): \[ Y=\beta_0+\beta_1X_1+\cdots+\beta_iX_i+\varepsilon \]

  • \(X\) = Independent Variable (Predictor)
  • \(Y\) = Dependent Variable (Outcome)
  • \(\beta_0\) = Intercept (The baseline level of \(Y\) when all predictors equal zero)
  • \(\beta_i\) = Slope coefficients (The expected change in \(Y\) for a one-unit increase in \(X\))
  • \(\varepsilon\) = Random error (Independent with the same variance)

2. Selecting the Best Model:

  • Primary Goal: Maximize performance while maintaining interpretability.
  • Core Criteria: Greater R2, Lower RMSE, Lower AIC.
  • Selection Rule: Use AIC as the primary metric to compare candidate models fit on the same data; choose the model with the lowest AIC.
  • Overfitting: Cross-validation to prevent overfitting in training dataset.

3. Step-by-Step Guide

3.1 Package Import

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.9     ✔ rsample      1.3.1
## ✔ dials        1.4.2     ✔ tailor       0.1.0
## ✔ infer        1.0.9     ✔ tune         2.0.0
## ✔ modeldata    1.5.1     ✔ workflows    1.3.0
## ✔ parsnip      1.3.3     ✔ workflowsets 1.1.1
## ✔ recipes      1.3.1     ✔ yardstick    1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
library(fitzRoy)
library(skimr)
library(ggplot2)
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(sjPlot)
library(performance)
## 
## Attaching package: 'performance'
## 
## The following objects are masked from 'package:yardstick':
## 
##     mae, rmse

3.2 Data Preparation

3.2.1 Data Import

# Download data

afl_raw <- fetch_player_stats_afltables(season = 2023:2025)
## ℹ Looking for data from 2023-01-01 to 2025-11-03✔ Looking for data from 2023-01-01 to 2025-11-03 [29ms]
## ℹ Fetching cached data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached data from <github.com/jimmyday12/fitzRoy_data> [3.5s]
## ℹ New data found! Fetching new data from 1 matchesℹ Scraping AFL Tables                             ✔ Scraping AFL Tables [127ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Cleaning AFL Tables data                        ✔ Cleaning AFL Tables data [39ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    Rows: 13273 Columns: 9
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    ── Column specification ────────────────────────────────────────────────────────
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    Delimiter: ","
## chr (6): player, dob, round, match, date, url
## dbl (3): rank, suffix, ID
## 
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    
## 
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> [1.2s]
## ℹ New data found! Fetching new data from 1 matches✔ New data found! Fetching new data from 1 matches [1.4s]
## ℹ Tidying data✔ Tidying data [26ms]
# Sum each match's home-team Kicks, Marks, and Inside.50s and calculate the margin. 
# Only focused on: Margin, Kicks, Marks, Inside.50s

afl_data <- afl_raw %>% 
  group_by(Season, Round, Playing.for) %>% 
  filter(Playing.for == Home.team) %>% 
  mutate(across(c(Kicks, Marks, Inside.50s), sum)) %>% 
  dplyr::select(Season, Round, Playing.for, Home.team, Home.score, Away.team, Away.score, 
         Home.Kicks = Kicks, Home.Marks = Marks, Home.Inside.50s = Inside.50s) %>% 
  slice(1) %>% 
  mutate(Margin = Home.score - Away.score) %>% 
  ungroup() %>% 
  dplyr::select(Margin, Home.Kicks, Home.Marks, Home.Inside.50s)
  
# Preview the Data
afl_data
## # A tibble: 648 × 4
##    Margin Home.Kicks Home.Marks Home.Inside.50s
##     <int>      <int>      <int>           <int>
##  1    -22        182         69              46
##  2    -49        225         56              44
##  3     16        208         88              60
##  4    -59        224        124              49
##  5     50        251        117              60
##  6      5        208         77              49
##  7     54        230        111              65
##  8      0        228         95              66
##  9     15        229         87              52
## 10     43        253        127              60
## # ℹ 638 more rows

3.2.2 Exploratory data analysis (EDA)

\(X\): Home.Kicks, Home.Marks, Home.Inside.50s.
\(Y\): Margin

3.2.2a Statistical Summary

skim(afl_data)
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
Data summary
Name afl_data
Number of rows 648
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Margin 0 1 6.74 40.81 -108 -19 5 31.00 171 ▂▇▇▂▁
Home.Kicks 0 1 214.88 21.26 156 200 214 228.00 398 ▃▇▁▁▁
Home.Marks 0 1 92.71 19.82 33 78 92 105.25 165 ▁▆▇▂▁
Home.Inside.50s 0 1 53.19 8.32 31 47 53 59.00 102 ▂▇▃▁▁

3.2.2b Variable Distribution

Use ggplot to create histogram for each variable

afl_data %>% 
  pivot_longer(cols = c(Margin, Home.Kicks, Home.Marks, Home.Inside.50s), 
               names_to = "Metric", values_to = "Value") %>% 
  ggplot(aes(x = Value)) + 
  geom_histogram(color = 'white') + 
  facet_wrap(~ Metric, scale = 'free') + 
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.2.2c Variable Correlation

Correlation between \(X\) and \(Y\)

afl_data %>% 
  pivot_longer(cols = c(Home.Kicks, Home.Marks, Home.Inside.50s), 
               names_to = "Metric", values_to = "Value") %>% 
  ggplot(aes(x = Margin, y = Value)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  facet_wrap(~ Metric, scale = 'free_y') + 
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

3.3 Data Partition

  • 70% for training data
  • 30% for testing data
# set the random seed for reproducible results
set.seed(123)

afl_split <- initial_split(afl_data, prop = 0.7)

afl_train <- training(afl_split)
afl_test <- testing(afl_split)

3.4 Model Building

3.4.1 Build the NULL model

NULL model:

  • An intercept-only model (no predictors)
  • It estimates the outcome’s overall mean
  • Serves as a baseline to judge whether adding predictors actually improves fit
model.0 <- lm(Margin ~ 1, data = afl_train)

3.4.2 Build the FULL model

FULL model:

  • The model that includes all candidate predictors
  • Acts as an upper-bound specification to compare against simpler models
  • Maximizes captured signal but risks overfitting and multicollinearity
model.full.1 <- lm(Margin ~ ., data = afl_train)

# Consider the interaction term
model.full.2 <- lm(Margin ~ .^2, data = afl_train)

3.4.3 Find the best model

Find the best model between NULL model and FULL model by using stepAIC

  • Forward Selection:
    • Begin at the NULL model and add the single predictor that gives the largest AIC drop
    • Keep adding predictors one at a time as long as AIC keeps decreasing
    • Stop when no addition improves AIC (Final Model)
  • Backward Selection:
    • Begin at the FULL model and remove the single predictor that increases AIC the least
    • Keep removing predictors while AIC increases
    • Stop when any further removal would lower AIC (Final Model)
model_best.1 <- stepAIC(model.0, 
                        direction = "forward", 
                        scope = formula(model.full.1), 
                        trace = F)

model_best.2 <- stepAIC(model.0, 
                        direction = "forward", 
                        scope = formula(model.full.2), 
                        trace = F)

# Predictors Selection
tab_model(model_best.1, model_best.2, 
          dv.labels = c("model_best.1", "model_best.2"))
  model_best.1 model_best.2
Predictors Estimates CI p Estimates CI p
(Intercept) -198.30 -228.28 – -168.31 <0.001 -552.07 -638.25 – -465.88 <0.001
Home Inside 50s 2.41 2.01 – 2.81 <0.001 8.14 6.71 – 9.56 <0.001
Home Kicks 0.26 0.04 – 0.48 0.020 2.01 1.61 – 2.42 <0.001
Home Marks 0.21 -0.00 – 0.43 0.053
Home Inside 50s × Home
Kicks
-0.03 -0.03 – -0.02 <0.001
Observations 453 453
R2 / R2 adjusted 0.404 / 0.400 0.479 / 0.475

Predictors Selection:

  • model_best.1: Home.Inside.50s, Home.Kicks, Home.Marks
  • model_best.2: Home.Inside.50s, Home.Kicks, Home.Inside.50s * Home.Kicks (Interaction Term)

3.5 Model Comparison

Using function compare_performance to compare the model performance

compare_performance(model_best.1, model_best.2)
## # Comparison of Model Performance Indices
## 
## Name         | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2
## -------------------------------------------------------------------------------
## model_best.1 |    lm | 4402.3 (<.001) | 4402.4 (<.001) | 4422.9 (<.001) | 0.404
## model_best.2 |    lm | 4341.6 (>.999) | 4341.8 (>.999) | 4362.2 (>.999) | 0.479
## 
## Name         | R2 (adj.) |   RMSE |  Sigma
## ------------------------------------------
## model_best.1 |     0.400 | 30.848 | 30.985
## model_best.2 |     0.475 | 28.850 | 28.978
# Predicted Outcomes in testing set
pred.1 <- predict(model_best.1, newdata = afl_test)
pred.2 <- predict(model_best.2, newdata = afl_test)

# Combine the predicted and actual testing outcomes
res <- bind_rows(
  tibble(truth = afl_test$Margin, estimate = pred.1, model = "model_best.1"),
  tibble(truth = afl_test$Margin, estimate = pred.2, model = "model_best.2")
) 

# Evaluation Metrics for new data
res %>% 
  group_by(model) %>%
  yardstick::rmse(truth = truth, estimate = estimate)
## # A tibble: 2 × 4
##   model        .metric .estimator .estimate
##   <chr>        <chr>   <chr>          <dbl>
## 1 model_best.1 rmse    standard        32.1
## 2 model_best.2 rmse    standard        31.6
res %>% 
  group_by(model) %>%
  yardstick::rsq(truth = truth, estimate = estimate)
## # A tibble: 2 × 4
##   model        .metric .estimator .estimate
##   <chr>        <chr>   <chr>          <dbl>
## 1 model_best.1 rsq     standard       0.455
## 2 model_best.2 rsq     standard       0.454

The performance of both models are similar.

3.6 Model Performance

Using function check_model to check the effectiveness of model.

check_model(model_best.1)

check_model(model_best.2)

Notice:

  • Collinearity:
    • All three predictors in model_best.2 have high VIF (≥ 10).
    • Obvious multicollinearity occurs.

Therefore, model_best.1 is the better model choice.

3.7 Model Interpretation

summary(model_best.1)
## 
## Call:
## lm(formula = Margin ~ Home.Inside.50s + Home.Kicks + Home.Marks, 
##     data = afl_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -227.534  -20.875    2.155   20.334  103.531 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -198.2953    15.2569 -12.997   <2e-16 ***
## Home.Inside.50s    2.4136     0.2037  11.848   <2e-16 ***
## Home.Kicks         0.2582     0.1106   2.335   0.0200 *  
## Home.Marks         0.2133     0.1099   1.941   0.0528 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.99 on 449 degrees of freedom
## Multiple R-squared:  0.4041, Adjusted R-squared:  0.4001 
## F-statistic: 101.5 on 3 and 449 DF,  p-value: < 2.2e-16
  • Intercept (-198.3): The predicted margin when all predictors are 0.
  • Home.Inside.50s (2.41): Each extra home inside-50 entry is associated with 2.41 points in margin.
    • (high significant effect, p < 0.001)
  • Home.Kicks (0.26): Each additional home kick is associated with 0.26 points in margin.
    • (significant effect, p < 0.05)
  • Home.Marks (0.21): Each additional home mark is associated with 0.21 points in margin.