1. Recap:

Multiple linear regression (MLR): \[ Y=\beta_0+\beta_1X_1+\cdots+\beta_iX_i+\varepsilon \]

\(X\) = Independent Variable (Predictor)
\(Y\) = Dependent Variable (Outcome)
\(\beta_0\) = Intercept (The baseline level of \(Y\) when all predictors equal zero)
\(\beta_i\) = Slope coefficients (The expected change in \(Y\) for a one-unit increase in \(X\))
\(\varepsilon\) = Random error (Independent with the same variance)

2. Selecting the Best Model:

Primary Goal: Maximize performance while maintaining interpretability.
Core Criteria: Greater R2, Lower RMSE, Lower AIC.
Selection Rule: Use AIC as the primary metric to compare candidate models fit on the same data; choose the model with the lowest AIC.
Overfitting: Cross-validation to prevent overfitting in training dataset.

3. Step-by-Step Guide

3.1 Package Import

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.9     ✔ rsample      1.3.1
## ✔ dials        1.4.2     ✔ tailor       0.1.0
## ✔ infer        1.0.9     ✔ tune         2.0.0
## ✔ modeldata    1.5.1     ✔ workflows    1.3.0
## ✔ parsnip      1.3.3     ✔ workflowsets 1.1.1
## ✔ recipes      1.3.1     ✔ yardstick    1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

library(fitzRoy)
library(skimr)
library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(sjPlot)
library(performance)

## 
## Attaching package: 'performance'
## 
## The following objects are masked from 'package:yardstick':
## 
##     mae, rmse

3.2 Data Preparation

3.2.1 Data Import

# Download data

afl_raw <- fetch_player_stats_afltables(season = 2023:2025)

## ℹ Looking for data from 2023-01-01 to 2025-11-03✔ Looking for data from 2023-01-01 to 2025-11-03 [29ms]
## ℹ Fetching cached data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached data from <github.com/jimmyday12/fitzRoy_data> [3.5s]
## ℹ New data found! Fetching new data from 1 matchesℹ Scraping AFL Tables                             ✔ Scraping AFL Tables [127ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Cleaning AFL Tables data                        ✔ Cleaning AFL Tables data [39ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    Rows: 13273 Columns: 9
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    ── Column specification ────────────────────────────────────────────────────────
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    Delimiter: ","
## chr (6): player, dob, round, match, date, url
## dbl (3): rank, suffix, ID
## 
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    
## 
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>                                                                    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> [1.2s]
## ℹ New data found! Fetching new data from 1 matches✔ New data found! Fetching new data from 1 matches [1.4s]
## ℹ Tidying data✔ Tidying data [26ms]

# Sum each match's home-team Kicks, Marks, and Inside.50s and calculate the margin. 
# Only focused on: Margin, Kicks, Marks, Inside.50s

afl_data <- afl_raw %>% 
  group_by(Season, Round, Playing.for) %>% 
  filter(Playing.for == Home.team) %>% 
  mutate(across(c(Kicks, Marks, Inside.50s), sum)) %>% 
  dplyr::select(Season, Round, Playing.for, Home.team, Home.score, Away.team, Away.score, 
         Home.Kicks = Kicks, Home.Marks = Marks, Home.Inside.50s = Inside.50s) %>% 
  slice(1) %>% 
  mutate(Margin = Home.score - Away.score) %>% 
  ungroup() %>% 
  dplyr::select(Margin, Home.Kicks, Home.Marks, Home.Inside.50s)
  
# Preview the Data
afl_data

## # A tibble: 648 × 4
##    Margin Home.Kicks Home.Marks Home.Inside.50s
##     <int>      <int>      <int>           <int>
##  1    -22        182         69              46
##  2    -49        225         56              44
##  3     16        208         88              60
##  4    -59        224        124              49
##  5     50        251        117              60
##  6      5        208         77              49
##  7     54        230        111              65
##  8      0        228         95              66
##  9     15        229         87              52
## 10     43        253        127              60
## # ℹ 638 more rows

3.2.2 Exploratory data analysis (EDA)

\(X\): Home.Kicks, Home.Marks, Home.Inside.50s.
\(Y\): Margin

3.2.2a Statistical Summary

skim(afl_data)

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

Data summary
Name	afl_data
Number of rows	648
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Margin	1	6.74	40.81	-108	-19	5	31.00	171	▂▇▇▂▁
Home.Kicks	1	214.88	21.26	156	200	214	228.00	398	▃▇▁▁▁
Home.Marks	1	92.71	19.82	33	78	92	105.25	165	▁▆▇▂▁
Home.Inside.50s	1	53.19	8.32	31	47	53	59.00	102	▂▇▃▁▁

3.2.2b Variable Distribution

Use ggplot to create histogram for each variable

afl_data %>% 
  pivot_longer(cols = c(Margin, Home.Kicks, Home.Marks, Home.Inside.50s), 
               names_to = "Metric", values_to = "Value") %>% 
  ggplot(aes(x = Value)) + 
  geom_histogram(color = 'white') + 
  facet_wrap(~ Metric, scale = 'free') + 
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.2.2c Variable Correlation

Correlation between \(X\) and \(Y\)

afl_data %>% 
  pivot_longer(cols = c(Home.Kicks, Home.Marks, Home.Inside.50s), 
               names_to = "Metric", values_to = "Value") %>% 
  ggplot(aes(x = Margin, y = Value)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  facet_wrap(~ Metric, scale = 'free_y') + 
  theme_bw()

## `geom_smooth()` using formula = 'y ~ x'

3.3 Data Partition

70% for training data
30% for testing data

# set the random seed for reproducible results
set.seed(123)

afl_split <- initial_split(afl_data, prop = 0.7)

afl_train <- training(afl_split)
afl_test <- testing(afl_split)

3.4 Model Building

3.4.1 Build the NULL model

NULL model:

An intercept-only model (no predictors)
It estimates the outcome’s overall mean
Serves as a baseline to judge whether adding predictors actually improves fit

model.0 <- lm(Margin ~ 1, data = afl_train)

3.4.2 Build the FULL model

FULL model:

The model that includes all candidate predictors
Acts as an upper-bound specification to compare against simpler models
Maximizes captured signal but risks overfitting and multicollinearity

model.full.1 <- lm(Margin ~ ., data = afl_train)

# Consider the interaction term
model.full.2 <- lm(Margin ~ .^2, data = afl_train)

3.4.3 Find the best model

Find the best model between NULL model and FULL model by using stepAIC

Forward Selection:
- Begin at the NULL model and add the single predictor that gives the largest AIC drop
- Keep adding predictors one at a time as long as AIC keeps decreasing
- Stop when no addition improves AIC (Final Model)
Backward Selection:
- Begin at the FULL model and remove the single predictor that increases AIC the least
- Keep removing predictors while AIC increases
- Stop when any further removal would lower AIC (Final Model)

model_best.1 <- stepAIC(model.0, 
                        direction = "forward", 
                        scope = formula(model.full.1), 
                        trace = F)

model_best.2 <- stepAIC(model.0, 
                        direction = "forward", 
                        scope = formula(model.full.2), 
                        trace = F)

# Predictors Selection
tab_model(model_best.1, model_best.2, 
          dv.labels = c("model_best.1", "model_best.2"))

	model_best.1			model_best.2
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	-198.30	-228.28 – -168.31	<0.001	-552.07	-638.25 – -465.88	<0.001
Home Inside 50s	2.41	2.01 – 2.81	<0.001	8.14	6.71 – 9.56	<0.001
Home Kicks	0.26	0.04 – 0.48	0.020	2.01	1.61 – 2.42	<0.001
Home Marks	0.21	-0.00 – 0.43	0.053
Home Inside 50s × Home Kicks				-0.03	-0.03 – -0.02	<0.001
Observations	453			453
R² / R² adjusted	0.404 / 0.400			0.479 / 0.475

Predictors Selection:

model_best.1: Home.Inside.50s, Home.Kicks, Home.Marks
model_best.2: Home.Inside.50s, Home.Kicks, Home.Inside.50s * Home.Kicks (Interaction Term)

3.5 Model Comparison

Using function compare_performance to compare the model performance

compare_performance(model_best.1, model_best.2)

## # Comparison of Model Performance Indices
## 
## Name         | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2
## -------------------------------------------------------------------------------
## model_best.1 |    lm | 4402.3 (<.001) | 4402.4 (<.001) | 4422.9 (<.001) | 0.404
## model_best.2 |    lm | 4341.6 (>.999) | 4341.8 (>.999) | 4362.2 (>.999) | 0.479
## 
## Name         | R2 (adj.) |   RMSE |  Sigma
## ------------------------------------------
## model_best.1 |     0.400 | 30.848 | 30.985
## model_best.2 |     0.475 | 28.850 | 28.978

# Predicted Outcomes in testing set
pred.1 <- predict(model_best.1, newdata = afl_test)
pred.2 <- predict(model_best.2, newdata = afl_test)

# Combine the predicted and actual testing outcomes
res <- bind_rows(
  tibble(truth = afl_test$Margin, estimate = pred.1, model = "model_best.1"),
  tibble(truth = afl_test$Margin, estimate = pred.2, model = "model_best.2")
) 

# Evaluation Metrics for new data
res %>% 
  group_by(model) %>%
  yardstick::rmse(truth = truth, estimate = estimate)

## # A tibble: 2 × 4
##   model        .metric .estimator .estimate
##   <chr>        <chr>   <chr>          <dbl>
## 1 model_best.1 rmse    standard        32.1
## 2 model_best.2 rmse    standard        31.6

res %>% 
  group_by(model) %>%
  yardstick::rsq(truth = truth, estimate = estimate)

## # A tibble: 2 × 4
##   model        .metric .estimator .estimate
##   <chr>        <chr>   <chr>          <dbl>
## 1 model_best.1 rsq     standard       0.455
## 2 model_best.2 rsq     standard       0.454

The performance of both models are similar.

3.6 Model Performance

Using function check_model to check the effectiveness of model.

check_model(model_best.1)

check_model(model_best.2)

Notice:

Collinearity:
- All three predictors in model_best.2 have high VIF (≥ 10).
- Obvious multicollinearity occurs.

Therefore, model_best.1 is the better model choice.

3.7 Model Interpretation

summary(model_best.1)

## 
## Call:
## lm(formula = Margin ~ Home.Inside.50s + Home.Kicks + Home.Marks, 
##     data = afl_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -227.534  -20.875    2.155   20.334  103.531 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -198.2953    15.2569 -12.997   <2e-16 ***
## Home.Inside.50s    2.4136     0.2037  11.848   <2e-16 ***
## Home.Kicks         0.2582     0.1106   2.335   0.0200 *  
## Home.Marks         0.2133     0.1099   1.941   0.0528 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.99 on 449 degrees of freedom
## Multiple R-squared:  0.4041, Adjusted R-squared:  0.4001 
## F-statistic: 101.5 on 3 and 449 DF,  p-value: < 2.2e-16

Intercept (-198.3): The predicted margin when all predictors are 0.
Home.Inside.50s (2.41): Each extra home inside-50 entry is associated with 2.41 points in margin.
- (high significant effect, p < 0.001)
Home.Kicks (0.26): Each additional home kick is associated with 0.26 points in margin.
- (significant effect, p < 0.05)
Home.Marks (0.21): Each additional home mark is associated with 0.21 points in margin.

Multiple Linear Regression - Advanced Guide

Toby Huang

2025-10-31