In this tutorial, I will demonstrate how to build an ARIMA model to forecast the total number of goals scored in soccer matches. ARIMA (AutoRegressive Integrated Moving Average) is a popular time series forecasting technique that captures trends, seasonality, and randomness.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(lubridate)
For demonstration, let’s assume we have a dataset containing monthly total goals from January 2018 to December 2023.
We will simulate the dataset for this tutorial.
set.seed(123)
date <- seq(as.Date("2018-01-01"), as.Date("2023-12-01"), by = "month")
goals <- round(rnorm(length(date), mean = 45, sd = 10))
soccer_data <- tibble(date, goals)
# Inspect first 6 rows
head(soccer_data)
## # A tibble: 6 × 2
## date goals
## <date> <dbl>
## 1 2018-01-01 39
## 2 2018-02-01 43
## 3 2018-03-01 61
## 4 2018-04-01 46
## 5 2018-05-01 46
## 6 2018-06-01 62
Convert the data into a time series object suitable for ARIMA modeling and visualize it.
ts_goals <- ts(soccer_data$goals, start = c(2018, 1), frequency = 12)
autoplot(ts_goals) +
ggtitle("Monthly Total Goals (2018–2023)") +
xlab("Year") + ylab("Goals")
Use the auto.arima() function to automatically select the best-fitting ARIMA model.
fit_arima <- auto.arima(ts_goals)
summary(fit_arima)
## Series: ts_goals
## ARIMA(0,0,0)(1,0,0)[12] with non-zero mean
##
## Coefficients:
## sar1 mean
## -0.2055 45.3750
## s.e. 0.1287 0.9253
##
## sigma^2 = 86.91: log likelihood = -262.15
## AIC=530.29 AICc=530.64 BIC=537.12
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0498764 9.192385 7.311318 -4.557838 17.54192 0.6135371
## ACF1
## Training set -0.02551311
We will forecast the next 12 months of soccer goals.
future_forecast <- forecast(fit_arima, h = 12)
autoplot(future_forecast) +
ggtitle("Forecasted Soccer Goals for 2024") +
xlab("Year") + ylab("Predicted Goals")
Split the data into training (2018–2022) and testing (2023) sets to check model accuracy.
train <- window(ts_goals, end = c(2022, 12))
test <- window(ts_goals, start = c(2023, 1))
fit_train <- auto.arima(train)
fc_test <- forecast(fit_train, h = length(test))
accuracy(fc_test, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.04300794 8.743761 7.039286 -3.879383 16.34377 0.5835677
## Test set -1.31075961 11.231442 9.086641 -10.743487 24.78527 0.7532966
## ACF1 Theil's U
## Training set -0.1016650 NA
## Test set 0.1619155 1.084762
This tutorial demonstrated building, forecasting, and evaluating an
ARIMA model using soccer match data.
The same workflow can be applied to any time-dependent sports data for
evidence-based predictions.