Multiple linear regression (MLR): \[ Y=\beta_0+\beta_1X_1+\cdots+\beta_iX_i+\varepsilon \]
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.9 ✔ rsample 1.3.1
## ✔ dials 1.4.2 ✔ tailor 0.1.0
## ✔ infer 1.0.9 ✔ tune 2.0.0
## ✔ modeldata 1.5.1 ✔ workflows 1.3.0
## ✔ parsnip 1.3.3 ✔ workflowsets 1.1.1
## ✔ recipes 1.3.1 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
library(fitzRoy)
library(skimr)
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(sjPlot)
library(performance)
##
## Attaching package: 'performance'
##
## The following objects are masked from 'package:yardstick':
##
## mae, rmse
# Download data
afl_raw <- fetch_player_stats_afltables(season = 2023:2025)
## ℹ Looking for data from 2023-01-01 to 2025-11-03✔ Looking for data from 2023-01-01 to 2025-11-03 [29ms]
## ℹ Fetching cached data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached data from <github.com/jimmyday12/fitzRoy_data> [3.5s]
## ℹ New data found! Fetching new data from 1 matchesℹ Scraping AFL Tables ✔ Scraping AFL Tables [127ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Cleaning AFL Tables data ✔ Cleaning AFL Tables data [39ms]
## ℹ New data found! Fetching new data from 1 matchesℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> Rows: 13273 Columns: 9
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> ── Column specification ────────────────────────────────────────────────────────
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> Delimiter: ","
## chr (6): player, dob, round, match, date, url
## dbl (3): rank, suffix, ID
##
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>
##
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## ℹ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data>✔ Fetching cached ID data from <github.com/jimmyday12/fitzRoy_data> [1.2s]
## ℹ New data found! Fetching new data from 1 matches✔ New data found! Fetching new data from 1 matches [1.4s]
## ℹ Tidying data✔ Tidying data [26ms]
# Sum each match's home-team Kicks, Marks, and Inside.50s and calculate the margin.
# Only focused on: Margin, Kicks, Marks, Inside.50s
afl_data <- afl_raw %>%
group_by(Season, Round, Playing.for) %>%
filter(Playing.for == Home.team) %>%
mutate(across(c(Kicks, Marks, Inside.50s), sum)) %>%
dplyr::select(Season, Round, Playing.for, Home.team, Home.score, Away.team, Away.score,
Home.Kicks = Kicks, Home.Marks = Marks, Home.Inside.50s = Inside.50s) %>%
slice(1) %>%
mutate(Margin = Home.score - Away.score) %>%
ungroup() %>%
dplyr::select(Margin, Home.Kicks, Home.Marks, Home.Inside.50s)
# Preview the Data
afl_data
## # A tibble: 648 × 4
## Margin Home.Kicks Home.Marks Home.Inside.50s
## <int> <int> <int> <int>
## 1 -22 182 69 46
## 2 -49 225 56 44
## 3 16 208 88 60
## 4 -59 224 124 49
## 5 50 251 117 60
## 6 5 208 77 49
## 7 54 230 111 65
## 8 0 228 95 66
## 9 15 229 87 52
## 10 43 253 127 60
## # ℹ 638 more rows
\(X\): Home.Kicks, Home.Marks,
Home.Inside.50s.
\(Y\): Margin
skim(afl_data)
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
| Name | afl_data |
| Number of rows | 648 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Margin | 0 | 1 | 6.74 | 40.81 | -108 | -19 | 5 | 31.00 | 171 | ▂▇▇▂▁ |
| Home.Kicks | 0 | 1 | 214.88 | 21.26 | 156 | 200 | 214 | 228.00 | 398 | ▃▇▁▁▁ |
| Home.Marks | 0 | 1 | 92.71 | 19.82 | 33 | 78 | 92 | 105.25 | 165 | ▁▆▇▂▁ |
| Home.Inside.50s | 0 | 1 | 53.19 | 8.32 | 31 | 47 | 53 | 59.00 | 102 | ▂▇▃▁▁ |
Use ggplot to create histogram for each variable
afl_data %>%
pivot_longer(cols = c(Margin, Home.Kicks, Home.Marks, Home.Inside.50s),
names_to = "Metric", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(color = 'white') +
facet_wrap(~ Metric, scale = 'free') +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Correlation between \(X\) and \(Y\)
afl_data %>%
pivot_longer(cols = c(Home.Kicks, Home.Marks, Home.Inside.50s),
names_to = "Metric", values_to = "Value") %>%
ggplot(aes(x = Margin, y = Value)) +
geom_point() +
geom_smooth(method = 'lm') +
facet_wrap(~ Metric, scale = 'free_y') +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
# set the random seed for reproducible results
set.seed(123)
afl_split <- initial_split(afl_data, prop = 0.7)
afl_train <- training(afl_split)
afl_test <- testing(afl_split)
NULL model:
model.0 <- lm(Margin ~ 1, data = afl_train)
FULL model:
model.full.1 <- lm(Margin ~ ., data = afl_train)
# Consider the interaction term
model.full.2 <- lm(Margin ~ .^2, data = afl_train)
Find the best model between NULL model and FULL model by using
stepAIC
model_best.1 <- stepAIC(model.0,
direction = "forward",
scope = formula(model.full.1),
trace = F)
model_best.2 <- stepAIC(model.0,
direction = "forward",
scope = formula(model.full.2),
trace = F)
# Predictors Selection
tab_model(model_best.1, model_best.2,
dv.labels = c("model_best.1", "model_best.2"))
| model_best.1 | model_best.2 | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | -198.30 | -228.28 – -168.31 | <0.001 | -552.07 | -638.25 – -465.88 | <0.001 |
| Home Inside 50s | 2.41 | 2.01 – 2.81 | <0.001 | 8.14 | 6.71 – 9.56 | <0.001 |
| Home Kicks | 0.26 | 0.04 – 0.48 | 0.020 | 2.01 | 1.61 – 2.42 | <0.001 |
| Home Marks | 0.21 | -0.00 – 0.43 | 0.053 | |||
|
Home Inside 50s × Home Kicks |
-0.03 | -0.03 – -0.02 | <0.001 | |||
| Observations | 453 | 453 | ||||
| R2 / R2 adjusted | 0.404 / 0.400 | 0.479 / 0.475 | ||||
Predictors Selection:
Using function compare_performance to compare the model
performance
compare_performance(model_best.1, model_best.2)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2
## -------------------------------------------------------------------------------
## model_best.1 | lm | 4402.3 (<.001) | 4402.4 (<.001) | 4422.9 (<.001) | 0.404
## model_best.2 | lm | 4341.6 (>.999) | 4341.8 (>.999) | 4362.2 (>.999) | 0.479
##
## Name | R2 (adj.) | RMSE | Sigma
## ------------------------------------------
## model_best.1 | 0.400 | 30.848 | 30.985
## model_best.2 | 0.475 | 28.850 | 28.978
# Predicted Outcomes in testing set
pred.1 <- predict(model_best.1, newdata = afl_test)
pred.2 <- predict(model_best.2, newdata = afl_test)
# Combine the predicted and actual testing outcomes
res <- bind_rows(
tibble(truth = afl_test$Margin, estimate = pred.1, model = "model_best.1"),
tibble(truth = afl_test$Margin, estimate = pred.2, model = "model_best.2")
)
# Evaluation Metrics for new data
res %>%
group_by(model) %>%
yardstick::rmse(truth = truth, estimate = estimate)
## # A tibble: 2 × 4
## model .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 model_best.1 rmse standard 32.1
## 2 model_best.2 rmse standard 31.6
res %>%
group_by(model) %>%
yardstick::rsq(truth = truth, estimate = estimate)
## # A tibble: 2 × 4
## model .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 model_best.1 rsq standard 0.455
## 2 model_best.2 rsq standard 0.454
The performance of both models are similar.
Using function check_model to check the effectiveness of
model.
check_model(model_best.1)
check_model(model_best.2)
Notice:
Therefore, model_best.1 is the better model choice.
summary(model_best.1)
##
## Call:
## lm(formula = Margin ~ Home.Inside.50s + Home.Kicks + Home.Marks,
## data = afl_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.534 -20.875 2.155 20.334 103.531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -198.2953 15.2569 -12.997 <2e-16 ***
## Home.Inside.50s 2.4136 0.2037 11.848 <2e-16 ***
## Home.Kicks 0.2582 0.1106 2.335 0.0200 *
## Home.Marks 0.2133 0.1099 1.941 0.0528 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.99 on 449 degrees of freedom
## Multiple R-squared: 0.4041, Adjusted R-squared: 0.4001
## F-statistic: 101.5 on 3 and 449 DF, p-value: < 2.2e-16