Time Series Forecasting is a prediction technique that analyses past events to find trends that could be repeated in the future. In this analysis, we will be looking into the popularity of the Essendon Football Club over time and forecasting the club’s popularity over the next 3 years.
The data used for this analysis can be found at https://trends.google.com.au/trends/explore?date=all&q=%2Fm%2F02sc5&hl=en-AU and data was taken from 2004 to the present. Save the data as essendon.csv.
Within the CSV file, remove the top two rows of the data. Change the names of the two columns to something more appropriate - in this case I have named column 1 as Year.Month and column 2 as Essendon.Football.Club. Add the CSV to your R project and read in the data.
essendon <- read.csv('essendon.csv')
If libraries are not installed, first install them.
library(tidyverse)
library(xgboost)
library(tidymodels)
library(modeltime)
library(lubridate)
library(timetk)
library(fpp3)
First, we will look at the data on a time series plot. We will need to change the Year.Month variable to an appropriate data type first.
# convert to date
essendon$Year.Month <- ym(essendon$Year.Month)
# time series plot
essendon %>%
timetk::plot_time_series(.date_var = Year.Month,
.value = Essendon.Football.Club)
Now we will investigate this plot further by plotting for anomalies. A quick Google of the Essendon Football Club with each of these time periods attached will help to give context as to why these anomalies occurred. The alpha value is generally set to 0.5.
# We can adjust the threshold for anomalies with .alpha
essendon %>%
timetk::plot_anomaly_diagnostics(.date_var = Year.Month,
.value = Essendon.Football.Club,
.alpha = 0.05)
Next we will split the data into training and testing sets. This will allow us to train our models and see how they perform on existing data. The splits can be inspected below.
## Split the data
splits <- initial_time_split(essendon)
## Create training and testing sets
train <- training(splits)
test <- testing(splits)
# Use tk_time_series_cv_plan to inspect the splits
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(.date_var = Year.Month,
.value = Essendon.Football.Club)
Now we will create our forecast models. For the sake of this paper we will use a linear regression model, an auto-ARIMA model and a prophet model, however there are plenty more that can be used.
# create the models
# Auto ARIMA
arima_fit <- arima_reg() %>%
set_engine("auto_arima") %>%
fit(Essendon.Football.Club ~ Year.Month, data = train)
# Prophet
prophet_fit <- prophet_reg() %>%
set_engine("prophet") %>%
fit(Essendon.Football.Club ~ Year.Month, data = train)
# Linear regression
lm_fit <- linear_reg() %>%
set_engine("lm") %>%
fit(Essendon.Football.Club ~ Year.Month, data = train)
Next we will add our models to a table and calibrate them to the test data set. The model predictions will then be plotted against the actual data.
# put models into table
models_tbl <- modeltime_table(
arima_fit,
prophet_fit,
lm_fit
)
# calibrate models
calibrate_tbl <- models_tbl %>%
modeltime_calibrate(new_data = test)
# plot the forecasts
calibrate_tbl %>%
modeltime_forecast(
actual_data = essendon,
new_data = test
) %>%
plot_modeltime_forecast()
We will now check the accuracy of these models to assess which performed the best.
# Check results
calibrate_tbl %>%
modeltime_accuracy()
## # A tibble: 3 × 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 ARIMA(1,0,1)(2,1,0)[12] Test 3.28 29.1 0.652 21.3 5.01 0.724
## 2 2 PROPHET Test 6.60 48.7 1.31 60.1 7.93 0.566
## 3 3 LM Test 9.13 110. 1.81 55.4 11.0 0.0229
Finally, we will refit the models to the full data set and forecast the interest in the Essendon Football Club over the next 3 years.
## Refit and forecast forward
refit_tbl <- calibrate_tbl %>%
modeltime_refit(data = essendon)
refit_tbl %>%
modeltime_forecast(h = "3 years", actual_data = essendon) %>%
plot_modeltime_forecast()