This document runs through the process of Time Series Forecasting in R. The main aim of Time Series Forecasting is the identify any trends that exist within historical data, as well as attempt to forecast trends for a certain period into the future. Sports forecasting results are of great interest to many different people including media, broadcasters, governing bodies, fans and sponsors.
For this example, I have chosen to use data relating to A-League (Australian Soccer) interest / popularity trends throughout Australia.
The following packages are required for Time Series Modelling. If you do not have them installed, please do so, by using the install.packages function.
library(xgboost)
library(tidymodels)
library(modeltime)
library(tidyverse)
library(lubridate)
library(timetk)
library(fpp3)
Data for this example was sourced from Google Trends, and as
mentioned earlier, relates to A-League popularity trends over the past 5
years in Australia. This data comes as a .csv file, hence, the
read.csv function is used to call up the data.
ALeague_interest <- read.csv("A-League Interest Data.csv")
Check the variable types of your data using the str
command. For time series forecasting it is important to change the time
variable into a date data type, as this makes the process more
straightforward later on.
ALeague_interest <- ALeague_interest %>%
mutate(Week = as.Date(Week, format = "%d/%m/%Y"))
str(ALeague_interest)
## 'data.frame': 260 obs. of 2 variables:
## $ Week : Date, format: "2017-10-15" "2017-10-22" ...
## $ ALeague.Interest: int 51 64 59 64 57 40 43 61 52 77 ...
The first step is to create basic visualisations of the data set
using the timetk package. This plot shows the basic shape
of the data over the 5 year period using a trend line.
ALeague_interest %>%
timetk::plot_time_series(.date_var = Week,
.value = ALeague.Interest)
The following code will plot seasonal patterns, which gives us a better idea of how interest each year (or season) compares with one another.
The first plot displays this by overlaying seasonal trends on top of each other. For this plot to work, we must first make the data set into a tibble, then tsibble format.
The second plot shows diagnostics by week, month, quarter and year using box plots.
For this plot to work, we must first convert the data into a tibble, and then a tsibble object.
# Data as tibble
ALeague_interest <- tibble(ALeague_interest)
# Convert to tsibble
ALeague_interest_ts <- ALeague_interest %>%
as_tsibble(index = Week)
# Seasonal Plot
ALeague_interest_ts %>% gg_season(ALeague.Interest)
# Seasonal Diagnostics
ALeague_interest %>%
timetk::plot_seasonal_diagnostics(.date_var = Week,
.value = ALeague.Interest)
The next step is to identify any anomolies that may exist in the data. At first, the alpha value is set to the default of 0.05. When this code is run, no anomalies were identified, however, play around to see whether any anomalies may be identified at other alpha levels e.g. for this data, alpha level at 0.09 detects one anomaly.
ALeague_interest %>%
timetk::plot_anomaly_diagnostics(.date_var = Week,
.value = ALeague.Interest)
# We can adjust the threshold for anomalies with .alpha
ALeague_interest %>%
timetk::plot_anomaly_diagnostics(.date_var = Week,
.value = ALeague.Interest,
.alpha = 0.09)
The ACF represents the coefficients of correlation between the time series and each of its lagged values, taking into account relationships with other lags. The PACF, although similar, removes the effect of each of the other lags on the relationship, and therefore provides a more direct correlation between the time series and lagged values.
Below, I have provided two different ways of viewing the ACF and PACF
plots. The first uses the gridExtra package and produces
bar charts, whilst the second method are lag diagnostic charts in the
form of scatter plots using timetk.
# Using gridExtra - bar charts
gridExtra::grid.arrange(
ALeague_interest_ts %>% ACF(ALeague.Interest) %>% autoplot(),
ALeague_interest_ts %>% PACF(ALeague.Interest) %>% autoplot()
)
# Using timetk - scatter plots
ALeague_interest %>%
timetk::plot_acf_diagnostics(.date_var = Week,
.value = ALeague.Interest)
The decomposed series gives us a plot of the original data, overall trend, seasonal trends, as well as the remainder all in one.
ALeague_interest %>%
timetk::plot_stl_diagnostics(.date_var = Week,
.value = ALeague.Interest,
.facet_scales = "free",
.feature_set = c("observed","trend","season","remainder"))
Next, split the data.
splits <- initial_time_split(ALeague_interest)
Use time_tk_series_plan to inspect the splits. This
function will show us a plot of the original data, with both the
training and testing sets labelled.
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(.date_var = Week,
.value = ALeague.Interest)
This step just involves saving the training and testing splits under names, so they can be called up later.
ALeague_train <- training(splits)
ALeague_test <- testing(splits)
Here we create different time series models. The below example shows some types of models that can be created, however, feel free to try other types of models e.g. normal ARIMA with different values, and see which may produce the best results.
### a. Auto ARIMA
arima_fit <- arima_reg() %>%
set_engine("auto_arima") %>%
fit(ALeague.Interest ~ Week, data = ALeague_train)
### b. Boosted ARIMA
arima_boost_fit <- arima_boost() %>%
set_engine("auto_arima_xgboost") %>%
fit(ALeague.Interest ~ Week, data = ALeague_train)
### c. Exponential Smoothing
ets_fit <- exp_smoothing() %>%
set_engine("ets") %>%
fit(ALeague.Interest ~ Week, data = ALeague_train)
### d. Prophet
prophet_fit <- prophet_reg() %>%
set_engine("prophet") %>%
fit(ALeague.Interest ~ Week, data = ALeague_train)
### e. Linear Regression
lm_fit <- linear_reg() %>%
set_engine("lm") %>%
fit(ALeague.Interest ~ Week, data = ALeague_train)
This allows us to save all models together so we can view them as one later.
models_tbl <- modeltime_table(
arima_fit,
arima_boost_fit,
ets_fit,
lm_fit,
prophet_fit
)
In this step, we calibrate the models using the testing data.
calibrate_tbl <- models_tbl %>%
modeltime_calibrate(new_data = ALeague_test)
Forecasting will show the model fits over the past 1.5 years, and how well they are able to fit the data in terms of capturing peaks, troughs, and overall shape of the data.
calibrate_tbl %>%
modeltime_forecast(
actual_data = ALeague_interest,
new_data = ALeague_test
) %>%
plot_modeltime_forecast()
Here we can see that the Prophet and Linear models were able to successfully model the data, however the Auto and Boosted ARIMA’s and the Exponential Smoothing model were all unsuccessful at predicting the data, therefore producing flat lines.
In order to improve these results, I will replace the Auto ARIMA with an ARIMA(2,1,2) model, and see whether this models the data any better.
### a. ARIMA(2,1,2)
arima_fit_2 <- arima_reg(seasonal_ar = 2,
seasonal_differences = 1,
seasonal_ma = 2) %>%
set_engine("arima") %>%
fit(ALeague.Interest ~ Week, data = ALeague_train)
# Adding new model to table
models_tbl <- modeltime_table(
arima_fit,
arima_fit_2,
arima_boost_fit,
ets_fit,
lm_fit,
prophet_fit
)
# Calibrating models again
calibrate_tbl_2 <- models_tbl %>%
modeltime_calibrate(new_data = ALeague_test)
# Forecasting models including new model
calibrate_tbl_2 %>%
modeltime_forecast(
actual_data = ALeague_interest,
new_data = ALeague_test
) %>%
plot_modeltime_forecast()
Here we can check the results of our models, using the parameters of R-squared and error values as accuracy measures.
calibrate_tbl %>%
modeltime_accuracy()
## # A tibble: 5 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 ARIMA(1,0,0) WITH NON-~ Test 22.9 143. 2.28 72.6 25.0 0.0970
## 2 2 ARIMA(1,0,0) WITH NON-~ Test 22.9 143. 2.28 72.6 25.0 0.0970
## 3 3 ETS(A,N,N) Test 22.2 91.5 2.21 72.9 26.8 NA
## 4 4 LM Test 24.1 149. 2.40 75.9 25.9 0.00168
## 5 5 PROPHET Test 16.4 95.8 1.63 53.9 20.4 0.383
We can see that none of these models produced very meaningful results. The Prophet model had the highest R-squared (0.38) and lowest error values (20.43 RMSE), however these figures are still not considered great. The closer the R-sqaured value to 1, the better the model.
Lastly, we will do the same thing, but instead forecast 3 years into the future.
# Re-fitting
refit_tbl <- calibrate_tbl %>%
modeltime_refit(data = ALeague_interest)
# Forecasting forward
refit_tbl %>%
modeltime_forecast(h = "3 years", actual_data = ALeague_interest) %>%
plot_modeltime_forecast()
refit_tbl %>%
modeltime_accuracy()
## # A tibble: 5 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 UPDATE: ARIMA(1,0,1) W~ Test 22.9 143. 2.28 72.6 25.0 0.0970
## 2 2 UPDATE: ARIMA(1,0,1) W~ Test 22.9 143. 2.28 72.6 25.0 0.0970
## 3 3 ETS(A,N,N) Test 22.2 91.5 2.21 72.9 26.8 NA
## 4 4 LM Test 24.1 149. 2.40 75.9 25.9 0.00168
## 5 5 PROPHET Test 16.4 95.8 1.63 53.9 20.4 0.383
We again observe the same results. The Prophet and Linear Models were able to somewhat predict the data, and all other models produced straight lines due to not being able to predict the trends of this particular data set.
Results of this time series forecasting example show us the models such as the Auto and Boosted ARIMAs’ and the Exponential smoothing model have trouble predicting certain data.
This may have been a case of the data being too complex, but could also likely be a limitation of these types of models.
There is plenty of scope to play around with many different parts of this code to see what works best, and I encourage trying out different models to produce the best possible results on your own data.