Introduction

In this tutorial, I will demonstrate how to build an ARIMA model to forecast the total number of goals scored in soccer matches. ARIMA (AutoRegressive Integrated Moving Average) is a popular time series forecasting technique that captures trends, seasonality, and randomness.


Load Packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(lubridate)

Import and Inspect the Data

For demonstration, let’s assume we have a dataset containing monthly total goals from January 2018 to December 2023.

We will simulate the dataset for this tutorial.

set.seed(123)
date <- seq(as.Date("2018-01-01"), as.Date("2023-12-01"), by = "month")
goals <- round(rnorm(length(date), mean = 45, sd = 10))
soccer_data <- tibble(date, goals)

# Inspect first 6 rows

head(soccer_data)
## # A tibble: 6 × 2
##   date       goals
##   <date>     <dbl>
## 1 2018-01-01    39
## 2 2018-02-01    43
## 3 2018-03-01    61
## 4 2018-04-01    46
## 5 2018-05-01    46
## 6 2018-06-01    62

Create a Time Series Object

Convert the data into a time series object suitable for ARIMA modeling and visualize it.

ts_goals <- ts(soccer_data$goals, start = c(2018, 1), frequency = 12)

autoplot(ts_goals) +
ggtitle("Monthly Total Goals (2018–2023)") +
xlab("Year") + ylab("Goals")

Fit the ARIMA Model

Use the auto.arima() function to automatically select the best-fitting ARIMA model.

fit_arima <- auto.arima(ts_goals)
summary(fit_arima)
## Series: ts_goals 
## ARIMA(0,0,0)(1,0,0)[12] with non-zero mean 
## 
## Coefficients:
##          sar1     mean
##       -0.2055  45.3750
## s.e.   0.1287   0.9253
## 
## sigma^2 = 86.91:  log likelihood = -262.15
## AIC=530.29   AICc=530.64   BIC=537.12
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.0498764 9.192385 7.311318 -4.557838 17.54192 0.6135371
##                     ACF1
## Training set -0.02551311

Forecast Future Goals

We will forecast the next 12 months of soccer goals.

future_forecast <- forecast(fit_arima, h = 12)
autoplot(future_forecast) +
ggtitle("Forecasted Soccer Goals for 2024") +
xlab("Year") + ylab("Predicted Goals")

Evaluate Model Accuracy

Split the data into training (2018–2022) and testing (2023) sets to check model accuracy.

train <- window(ts_goals, end = c(2022, 12))
test <- window(ts_goals, start = c(2023, 1))

fit_train <- auto.arima(train)
fc_test <- forecast(fit_train, h = length(test))
accuracy(fc_test, test)
##                       ME      RMSE      MAE        MPE     MAPE      MASE
## Training set  0.04300794  8.743761 7.039286  -3.879383 16.34377 0.5835677
## Test set     -1.31075961 11.231442 9.086641 -10.743487 24.78527 0.7532966
##                    ACF1 Theil's U
## Training set -0.1016650        NA
## Test set      0.1619155  1.084762

Summary

Conclusion

This tutorial demonstrated building, forecasting, and evaluating an ARIMA model using soccer match data.
The same workflow can be applied to any time-dependent sports data for evidence-based predictions.