Multiple linear regression (MLR) extends simple regression by modeling a continuous outcome as a linear combination of several predictors: \[ Y=\beta_0+\beta_1X_1+\cdots+\beta_iX_i+\varepsilon \]
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fitzRoy)
library(skimr)
library(ggplot2)
library(sjPlot)
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
library(performance)
# Download data
afl_raw_2025 <- fetch_player_stats_afltables(season = 2025)
## ℹ Looking for data from 2025-01-01 to 2025-11-03✔ Looking for data from 2025-01-01 to 2025-11-03 [12ms]
## ℹ Fetching cached data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached data from <github.com/jimmyday12/fitzRoy_data> [2.4s]
## ℹ New data found! Fetching new data from 1 matchesℹ Scraping AFL Tables ✔ Scraping AFL Tables [123ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Cleaning AFL Tables data ✔ Cleaning AFL Tables data [32ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> Rows: 13273 Columns: 9
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> ── Column specification ────────────────────────────────────────────────────────
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> Delimiter: ","
## chr (6): player, dob, round, match, date, url
## dbl (3): rank, suffix, ID
##
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>
##
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> [434ms]
## ℹ New data found! Fetching new data from 1 matches✔ New data found! Fetching new data from 1 matches [639ms]
## ℹ Tidying data✔ Tidying data [18ms]
# Sum each match's home-team Kicks, Marks, and Inside.50s and calculate the margin.
# Only focused on: Margin, Kicks, Marks, Inside.50s
afl_2025 <- afl_raw_2025 %>%
group_by(Round, Playing.for) %>%
filter(Playing.for == Home.team) %>%
mutate(across(c(Kicks, Marks, Inside.50s), sum)) %>%
select(Round, Playing.for, Home.team, Home.score, Away.team, Away.score,
Home.Kicks = Kicks, Home.Marks = Marks, Home.Inside.50s = Inside.50s) %>%
slice(1) %>%
mutate(Margin = Home.score - Away.score) %>%
ungroup() %>%
select(Margin, Home.Kicks, Home.Marks, Home.Inside.50s)
# Preview the Data
afl_2025
## # A tibble: 216 × 4
## Margin Home.Kicks Home.Marks Home.Inside.50s
## <int> <int> <int> <int>
## 1 9 232 93 67
## 2 95 209 86 67
## 3 52 214 93 54
## 4 -20 203 71 40
## 5 8 234 126 45
## 6 -14 206 102 62
## 7 -4 210 98 61
## 8 10 193 74 62
## 9 -35 218 98 48
## 10 0 195 71 51
## # ℹ 206 more rows
Independent Variables (\(X\)): Home.Kicks, Home.Marks,
Home.Inside.50s.
Dependent Variables (\(Y\)): Margin
skim(afl_2025)
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
| Name | afl_2025 |
| Number of rows | 216 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Margin | 0 | 1 | 5.63 | 42.16 | -101 | -20 | 7.0 | 30.25 | 98 | ▁▃▇▃▂ |
| Home.Kicks | 0 | 1 | 211.59 | 24.15 | 163 | 196 | 209.0 | 224.25 | 398 | ▇▇▁▁▁ |
| Home.Marks | 0 | 1 | 90.99 | 20.27 | 48 | 76 | 89.5 | 105.00 | 165 | ▃▇▇▂▁ |
| Home.Inside.50s | 0 | 1 | 52.49 | 8.78 | 34 | 46 | 53.0 | 58.00 | 102 | ▅▇▂▁▁ |
Use ggplot to create histogram for each variable
afl_2025 %>%
pivot_longer(cols = c(Margin, Home.Kicks, Home.Marks, Home.Inside.50s),
names_to = "Metric", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(color = 'white') +
facet_wrap(~ Metric, scale = 'free') +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Correlation between \(X\) and \(Y\)
afl_2025 %>%
pivot_longer(cols = c(Home.Kicks, Home.Marks, Home.Inside.50s),
names_to = "Metric", values_to = "Value") %>%
ggplot(aes(x = Margin, y = Value)) +
geom_point() +
geom_smooth(method = 'lm') +
facet_wrap(~ Metric, scale = 'free_y') +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
All correlation are positive means the variables tend to move in the
same direction.
Higher values of Home.Kicks, Home.Marks,
Home.Inside.50s are associated with higher
Margin.
A simple linear regression with one predictor to explain the outcome
model.1.1 <- lm(Margin ~ Home.Kicks, data = afl_2025)
model.1.2 <- lm(Margin ~ Home.Marks, data = afl_2025)
model.1.3 <- lm(Margin ~ Home.Inside.50s, data = afl_2025)
# Coefficients of Predictors
tab_model(model.1.1, model.1.2, model.1.3,
dv.labels = c("model.1.1", "model.1.2", "model.1.3"))
| model.1.1 | model.1.2 | model.1.3 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | -148.18 | -193.72 – -102.64 | <0.001 | -51.62 | -76.51 – -26.72 | <0.001 | -126.85 | -156.16 – -97.54 | <0.001 |
| Home Kicks | 0.73 | 0.51 – 0.94 | <0.001 | ||||||
| Home Marks | 0.63 | 0.36 – 0.90 | <0.001 | ||||||
| Home Inside 50s | 2.52 | 1.97 – 3.07 | <0.001 | ||||||
| Observations | 216 | 216 | 216 | ||||||
| R2 / R2 adjusted | 0.173 / 0.170 | 0.092 / 0.087 | 0.276 / 0.273 | ||||||
A multiple linear regression with two or more predictors to explain the outcome
model.2.1 <- lm(Margin ~ Home.Kicks + Home.Marks, data = afl_2025)
# "." indicates to all the predictors
model.2.2 <- lm(Margin ~ ., data = afl_2025)
# Coefficients of Predictors
tab_model(model.2.1, model.2.2,
dv.labels = c("model.2.1", "model.2.2"))
| model.2.1 | model.2.2 | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | -146.47 | -193.54 – -99.40 | <0.001 | -157.88 | -200.82 – -114.94 | <0.001 |
| Home Kicks | 0.70 | 0.40 – 0.99 | <0.001 | 0.01 | -0.33 – 0.35 | 0.961 |
| Home Marks | 0.05 | -0.30 – 0.41 | 0.769 | 0.44 | 0.09 – 0.78 | 0.013 |
| Home Inside 50s | 2.33 | 1.65 – 3.00 | <0.001 | |||
| Observations | 216 | 216 | ||||
| R2 / R2 adjusted | 0.174 / 0.166 | 0.320 / 0.310 | ||||
Interaction Term:
model.3.1 <- lm(Margin ~ Home.Kicks + Home.Marks + Home.Kicks:Home.Marks,
data = afl_2025)
# Coefficients of Predictors
tab_model(model.3.1)
| Margin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -529.99 | -694.26 – -365.72 | <0.001 |
| Home Kicks | 2.54 | 1.73 – 3.35 | <0.001 |
| Home Marks | 3.63 | 2.12 – 5.14 | <0.001 |
| Home Kicks × Home Marks | -0.02 | -0.02 – -0.01 | <0.001 |
| Observations | 216 | ||
| R2 / R2 adjusted | 0.254 / 0.244 | ||
Using function compare_performance to compare the model
performance
compare_performance(model.1.1, model.1.2, model.1.3, model.2.1, model.2.2, model.3.1)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2
## ----------------------------------------------------------------------------
## model.1.1 | lm | 2193.2 (<.001) | 2193.3 (<.001) | 2203.3 (<.001) | 0.173
## model.1.2 | lm | 2213.6 (<.001) | 2213.7 (<.001) | 2223.7 (<.001) | 0.092
## model.1.3 | lm | 2164.6 (0.009) | 2164.7 (0.010) | 2174.7 (0.205) | 0.276
## model.2.1 | lm | 2195.1 (<.001) | 2195.3 (<.001) | 2208.6 (<.001) | 0.174
## model.2.2 | lm | 2155.1 (0.991) | 2155.4 (0.990) | 2172.0 (0.795) | 0.320
## model.3.1 | lm | 2174.9 (<.001) | 2175.2 (<.001) | 2191.8 (<.001) | 0.254
##
## Name | R2 (adj.) | RMSE | Sigma
## ---------------------------------------
## model.1.1 | 0.170 | 38.243 | 38.421
## model.1.2 | 0.087 | 40.092 | 40.279
## model.1.3 | 0.273 | 35.791 | 35.958
## model.2.1 | 0.166 | 38.235 | 38.503
## model.2.2 | 0.310 | 34.694 | 35.019
## model.3.1 | 0.244 | 36.325 | 36.666
model.2.2 shows the best overall performance:
summary(model.2.2)
##
## Call:
## lm(formula = Margin ~ ., data = afl_2025)
##
## Residuals:
## Min 1Q Median 3Q Max
## -190.707 -22.278 3.521 24.353 81.722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.579e+02 2.178e+01 -7.248 7.78e-12 ***
## Home.Kicks 8.404e-03 1.712e-01 0.049 0.9609
## Home.Marks 4.363e-01 1.734e-01 2.516 0.0126 *
## Home.Inside.50s 2.325e+00 3.448e-01 6.744 1.44e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.02 on 212 degrees of freedom
## Multiple R-squared: 0.3197, Adjusted R-squared: 0.3101
## F-statistic: 33.21 on 3 and 212 DF, p-value: < 2.2e-16
Using function check_model to check the effectiveness of
model.
check_model(model.2.2)