Overview

This document runs through the process of Time Series Forecasting in R. The main aim of Time Series Forecasting is the identify any trends that exist within historical data, as well as attempt to forecast trends for a certain period into the future. Sports forecasting results are of great interest to many different people including media, broadcasters, governing bodies, fans and sponsors.

For this example, I have chosen to use data relating to A-League (Australian Soccer) interest / popularity trends throughout Australia.

Loading Required Packages

The following packages are required for Time Series Modelling. If you do not have them installed, please do so, by using the install.packages function.

library(xgboost)
library(tidymodels)
library(modeltime)
library(tidyverse)
library(lubridate)
library(timetk)
library(fpp3)

Importing Data

Data for this example was sourced from Google Trends, and as mentioned earlier, relates to A-League popularity trends over the past 5 years in Australia. This data comes as a .csv file, hence, the read.csv function is used to call up the data.

ALeague_interest <- read.csv("A-League Interest Data.csv")

Changing Week Variable to Date Data Type

Check the variable types of your data using the str command. For time series forecasting it is important to change the time variable into a date data type, as this makes the process more straightforward later on.

ALeague_interest <- ALeague_interest %>%
  mutate(Week = as.Date(Week, format = "%d/%m/%Y"))

str(ALeague_interest)

## 'data.frame':    260 obs. of  2 variables:
##  $ Week            : Date, format: "2017-10-15" "2017-10-22" ...
##  $ ALeague.Interest: int  51 64 59 64 57 40 43 61 52 77 ...

Forecasting Procedure

Create basic visualisation

The first step is to create basic visualisations of the data set using the timetk package. This plot shows the basic shape of the data over the 5 year period using a trend line.

ALeague_interest %>% 
  timetk::plot_time_series(.date_var = Week,
                           .value = ALeague.Interest)

Seasonal Patterns

The following code will plot seasonal patterns, which gives us a better idea of how interest each year (or season) compares with one another.

The first plot displays this by overlaying seasonal trends on top of each other. For this plot to work, we must first make the data set into a tibble, then tsibble format.

The second plot shows diagnostics by week, month, quarter and year using box plots.

For this plot to work, we must first convert the data into a tibble, and then a tsibble object.

# Data as tibble
ALeague_interest <- tibble(ALeague_interest)

# Convert to tsibble
ALeague_interest_ts <- ALeague_interest %>% 
  as_tsibble(index = Week)


# Seasonal Plot
ALeague_interest_ts %>% gg_season(ALeague.Interest)

# Seasonal Diagnostics
ALeague_interest %>% 
  timetk::plot_seasonal_diagnostics(.date_var = Week,
                                    .value = ALeague.Interest)

Identifying Anomalies

The next step is to identify any anomolies that may exist in the data. At first, the alpha value is set to the default of 0.05. When this code is run, no anomalies were identified, however, play around to see whether any anomalies may be identified at other alpha levels e.g. for this data, alpha level at 0.09 detects one anomaly.

ALeague_interest %>% 
  timetk::plot_anomaly_diagnostics(.date_var = Week,
                                   .value = ALeague.Interest)

# We can adjust the threshold for anomalies with .alpha
ALeague_interest %>% 
  timetk::plot_anomaly_diagnostics(.date_var = Week,
                                   .value = ALeague.Interest,
                                   .alpha = 0.09)

Plot the ACF + PACF

The ACF represents the coefficients of correlation between the time series and each of its lagged values, taking into account relationships with other lags. The PACF, although similar, removes the effect of each of the other lags on the relationship, and therefore provides a more direct correlation between the time series and lagged values.

Below, I have provided two different ways of viewing the ACF and PACF plots. The first uses the gridExtra package and produces bar charts, whilst the second method are lag diagnostic charts in the form of scatter plots using timetk.

# Using gridExtra - bar charts
gridExtra::grid.arrange(
  ALeague_interest_ts %>% ACF(ALeague.Interest) %>% autoplot(),
  ALeague_interest_ts %>% PACF(ALeague.Interest) %>% autoplot()
)

# Using timetk - scatter plots
ALeague_interest %>% 
  timetk::plot_acf_diagnostics(.date_var = Week,
                               .value = ALeague.Interest)

Plot the decomposed series

The decomposed series gives us a plot of the original data, overall trend, seasonal trends, as well as the remainder all in one.

ALeague_interest %>% 
  timetk::plot_stl_diagnostics(.date_var = Week,
                               .value = ALeague.Interest,
                               .facet_scales = "free",
                               .feature_set = c("observed","trend","season","remainder"))

Split the data

Next, split the data.

splits <- initial_time_split(ALeague_interest)

Inspect the splits.

Use time_tk_series_plan to inspect the splits. This function will show us a plot of the original data, with both the training and testing sets labelled.

splits %>% 
  tk_time_series_cv_plan() %>% 
  plot_time_series_cv_plan(.date_var = Week,
                           .value = ALeague.Interest)

Create training and testing sets

This step just involves saving the training and testing splits under names, so they can be called up later.

ALeague_train <- training(splits)
ALeague_test <- testing(splits)

Modelling Process

Creating different time series models

Here we create different time series models. The below example shows some types of models that can be created, however, feel free to try other types of models e.g. normal ARIMA with different values, and see which may produce the best results.

### a. Auto ARIMA 
arima_fit <- arima_reg() %>% 
  set_engine("auto_arima") %>% 
  fit(ALeague.Interest ~ Week, data = ALeague_train)

### b. Boosted ARIMA 
arima_boost_fit <- arima_boost() %>% 
  set_engine("auto_arima_xgboost") %>% 
  fit(ALeague.Interest ~ Week, data = ALeague_train)

### c. Exponential Smoothing 
ets_fit <- exp_smoothing() %>% 
  set_engine("ets") %>% 
  fit(ALeague.Interest ~ Week, data = ALeague_train)

### d. Prophet 
prophet_fit <- prophet_reg() %>% 
  set_engine("prophet") %>% 
  fit(ALeague.Interest ~ Week, data = ALeague_train)

### e. Linear Regression
lm_fit <- linear_reg() %>% 
  set_engine("lm") %>% 
  fit(ALeague.Interest ~ Week, data = ALeague_train)

Add all models to a table

This allows us to save all models together so we can view them as one later.

models_tbl <- modeltime_table(
  arima_fit,
  arima_boost_fit,
  ets_fit,
  lm_fit,
  prophet_fit
)

Calibrate models

In this step, we calibrate the models using the testing data.

calibrate_tbl <- models_tbl %>% 
  modeltime_calibrate(new_data = ALeague_test)

Make current forecasts

Forecasting will show the model fits over the past 1.5 years, and how well they are able to fit the data in terms of capturing peaks, troughs, and overall shape of the data.

calibrate_tbl %>% 
  modeltime_forecast(
    actual_data = ALeague_interest,
    new_data = ALeague_test
  ) %>% 
  plot_modeltime_forecast()

Here we can see that the Prophet and Linear models were able to successfully model the data, however the Auto and Boosted ARIMA’s and the Exponential Smoothing model were all unsuccessful at predicting the data, therefore producing flat lines.

In order to improve these results, I will replace the Auto ARIMA with an ARIMA(2,1,2) model, and see whether this models the data any better.

### a. ARIMA(2,1,2)
arima_fit_2 <- arima_reg(seasonal_ar = 2,
                       seasonal_differences = 1,
                       seasonal_ma = 2) %>% 
  set_engine("arima") %>% 
  fit(ALeague.Interest ~ Week, data = ALeague_train)



# Adding new model to table
models_tbl <- modeltime_table(
  arima_fit,
  arima_fit_2,
  arima_boost_fit,
  ets_fit,
  lm_fit,
  prophet_fit
)

# Calibrating models again
calibrate_tbl_2 <- models_tbl %>% 
  modeltime_calibrate(new_data = ALeague_test)

# Forecasting models including new model
calibrate_tbl_2 %>% 
  modeltime_forecast(
    actual_data = ALeague_interest,
    new_data = ALeague_test
  ) %>% 
  plot_modeltime_forecast()

Check results

Here we can check the results of our models, using the parameters of R-squared and error values as accuracy measures.

calibrate_tbl %>% 
  modeltime_accuracy()

## # A tibble: 5 x 9
##   .model_id .model_desc             .type   mae  mape  mase smape  rmse      rsq
##       <int> <chr>                   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1         1 ARIMA(1,0,0) WITH NON-~ Test   22.9 143.   2.28  72.6  25.0  0.0970 
## 2         2 ARIMA(1,0,0) WITH NON-~ Test   22.9 143.   2.28  72.6  25.0  0.0970 
## 3         3 ETS(A,N,N)              Test   22.2  91.5  2.21  72.9  26.8 NA      
## 4         4 LM                      Test   24.1 149.   2.40  75.9  25.9  0.00168
## 5         5 PROPHET                 Test   16.4  95.8  1.63  53.9  20.4  0.383

We can see that none of these models produced very meaningful results. The Prophet model had the highest R-squared (0.38) and lowest error values (20.43 RMSE), however these figures are still not considered great. The closer the R-sqaured value to 1, the better the model.

Refit and forecast forward

Lastly, we will do the same thing, but instead forecast 3 years into the future.

# Re-fitting
refit_tbl <- calibrate_tbl %>% 
  modeltime_refit(data = ALeague_interest)

# Forecasting forward
refit_tbl %>% 
  modeltime_forecast(h = "3 years", actual_data = ALeague_interest) %>% 
  plot_modeltime_forecast()

refit_tbl %>% 
  modeltime_accuracy()

## # A tibble: 5 x 9
##   .model_id .model_desc             .type   mae  mape  mase smape  rmse      rsq
##       <int> <chr>                   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1         1 UPDATE: ARIMA(1,0,1) W~ Test   22.9 143.   2.28  72.6  25.0  0.0970 
## 2         2 UPDATE: ARIMA(1,0,1) W~ Test   22.9 143.   2.28  72.6  25.0  0.0970 
## 3         3 ETS(A,N,N)              Test   22.2  91.5  2.21  72.9  26.8 NA      
## 4         4 LM                      Test   24.1 149.   2.40  75.9  25.9  0.00168
## 5         5 PROPHET                 Test   16.4  95.8  1.63  53.9  20.4  0.383

We again observe the same results. The Prophet and Linear Models were able to somewhat predict the data, and all other models produced straight lines due to not being able to predict the trends of this particular data set.

Conclusions / Summary

Results of this time series forecasting example show us the models such as the Auto and Boosted ARIMAs’ and the Exponential smoothing model have trouble predicting certain data.
This may have been a case of the data being too complex, but could also likely be a limitation of these types of models.
There is plenty of scope to play around with many different parts of this code to see what works best, and I encourage trying out different models to produce the best possible results on your own data.

Time Series Forecasting

Joel Landman

2022-10-19