This paper will involve a comprehensive analysis of retail turnover data with the aim of preparing it for forecasting future trends.The primary objective of this analysis is to gain insights into the underlying patterns and structure of the data, identify any seasonal or trend components, and to apply appropriate pre-processing techniques to improve the accuracy of future forecasts.

All code snippets will be done in R.

The ‘Turnover’ variable in our data set refers to the total sales revenue generated by retail businesses over a one month period. It represents the aggregate value of goods and services sold by retailers to consumers within a given month. The turnover values are in $Million AUD.

Lets first load the required packages:

library(fpp3)
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble      3.2.1          ✔ tsibble     1.1.4     
## ✔ dplyr       1.1.4          ✔ tsibbledata 0.4.1     
## ✔ tidyr       1.3.1          ✔ feasts      0.3.2.9000
## ✔ lubridate   1.9.3          ✔ fable       0.3.3     
## ✔ ggplot2     3.5.0          ✔ fabletools  0.4.0
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

And load/format the required data set:

get_my_data <- function(student_id) {
  set.seed(student_id)
  all_data <- fill_gaps(readr::read_rds("https://bit.ly/monashretaildata"))
  while(TRUE) {
    retail <- filter(all_data, `Series ID` == sample(`Series ID`, 1)) 
    if(!any(is.na(retail$Turnover))) return(retail)
  }
}

retail <- get_my_data(33664765)

First, a plot of the time series data using the autoplot() function. This function will be applied to the ‘Turnover’ variable from the ‘retail’ dataset. The function will automatically chose a suitable plot based on the structure of the data. As we are working with time series data, a plot showing the trend is suitable.

retail |>
  autoplot(Turnover) + 
  labs(title = "Turnover Trend",
       x = "Month")

We can see from this data that there is an increasing trend over time, with a sharp drop in 2020 due to the COVID-19 Pandemic.

Next, we will plot the time series data using the gg_season() function. This function is designed to visualise the seasonal patterns in the time series data to help identify recurring patterns at specific times of the year.

retail |>
  gg_season(Turnover, labels = "both") +
  labs(y = "Turnover",
       title = "Seasonal Plot: Retail turnover in $Million AUD")

The plot reveals a consistent pattern of a decline in retail sales occurring in February, followed by a peak in December, with the years showing a generally upward trend, indicating a gradual increase in retail sales over time. Additionally, the plot exhibits a sharp decrease in retail sales starting from the beginning of 2020, persisting until April, after which there is a rebound in sales. This sharp decline is attributed to the impact of the COVID-19 pandemic on retail activity.

Next, a plot of the time series data using the gg_subseries() function. This function will display a seperate time series plot for each time period, in this case for each month, helpful for identifying changes or irregularities in seasonal patterns through time.

retail |>
  gg_subseries(Turnover) +
  labs(y = "Turnover",
       title = "Subseries Plot: Retail turnover in $Million AUD")

This plot allows us to further see the decrease in retail sales in the month of February, as we can see it has the smallest average turnover of all the months. We can also see that December has the highest average retail sales of any month, as we saw in the previous plots. It also shows how retail turnover increases year on year from the first year in the data set to the final year with an increasing trend line.

The next goal is to find an appropriate Box-Cox transformation to stablise the variance of the data set. The aim of a Box-Cox transformation is to stabilise the variance of the data set through time. We want to do this as statistical methods such as time-series analysis and regression have an important assumption that the variance of the data remains constant through time: \(E(u|x) = 0\). Stable variances tend to have better predictive power, and violating this assumption can lead to unreliable inference.

λ will describe how strong the transformation is:

\[ W_t = \begin{cases} \log(Y_t) & \text{if } \lambda = 0 \\ \frac{{\text{sign}(Y_t) |Y_t|^\lambda - 1}}{{\lambda}} & \text{if } \lambda \neq 0 \end{cases} \]

Looking at the original time series plot, we can observe a small variance in earlier years and as time moves forward the variance gets larger and larger. The box_cox transformation is a function to used to stablise the variance. We will need to find the right value of λ for the transformation to be effective.

retail |>
  autoplot(box_cox(Turnover, 0.3)) + 
  labs(y = "Box-Cox transformed turnover",
       x = "Month")

λ = 0.48 was chosen as it’s the value which achieved a transformation to make the variance more constant through time.

Next is an STL decomposition of the transformed data which will provide a visual representation which decomposes the transformed time series into its 3 main components: trend, seasonal and residual. This will allow us to analyse each component seperately to understand the underlying patterns and structure in the data.

The mathematical expression \(y_t = S_t + T_t + R_t\) represents the decomposition of a time series \(y_t\) into its components: the seasonal component \(S_t\), the trend component \(T_t\), and the residual component \(R_t\).

The trend component represents the long term changes in the data through time, capturing the overall direction of the data and ignoring short term fluctuations.

The seasonality component captures the regular, repeating patterns in the data which occur at constant intervals, monthly in this case. Seasonality reflects systematic variations in the data associated with recurring events.

The remainder component, also known as the residual or error component, represents the unexplained variation in the data after accounting for the trend and seasonality components. It captures random fluctuations, noise, or irregularities that are not captured by the trend or seasonality.

retail |>
  model(STL(Turnover)) |>
  components() |>
  autoplot(Turnover) + labs(title = "STL decomposition")

Observing the trend line we can see an increasing smooth curve, as well as a small grey bar meaning the trend is the largest contributor to the original series.

Looking at the seasonal component we can see a consistent pattern over time which indicates the presence of seasonality in the data. The increasing magnitude of the peaks and troughs suggest that seasonal fluctuations are becoming more pronounced which could be due to various factors such as changing consumer behavior, shifts in market dynamics, or external factors influencing the retail industry and overall economy. The large grey bar suggests that the seasonal component is the smallest contributor to the original series.

The remainder component is well behaved up until the start of 2020, fluctuating around 0, suggesting the model adequately captures the underlying structure of the data. Having a mean close to 0 indicates there may not be any bias or trend remaining in the data after accounting for trend and seasonality.

A moving average is a commonly used method in time series analysis to smooth out fluctuations in data and identify underlying trends or patterns. It involves calculating the average of a subset of data points over a specified window or period, which “moves” along the time series.

The expression \(\hat{T}_t = \frac{1}{m} \sum_{j=-k}^{k} Y_{t+j}\) represents the calculation of the centered moving average of the time series data \(Y_t\), where \(m\) is the size of the moving average window and \(k\) represents the number of periods before and after the current time point \(t\) included in the average.

retail_ma <- retail |>
  mutate(
    `12-MA` = slider::slide_dbl(Turnover, mean,
                .before = 5, .after = 6, .complete = TRUE),
    `2x12-MA` = slider::slide_dbl(`12-MA`, mean,
                .before = 1, .after = 0, .complete = TRUE)
  )
retail_ma <- retail_ma %>%
  na.omit() 

retail_ma |>
  autoplot(Turnover, colour = "gray") +
  geom_line(aes(y = `2x12-MA`), colour = "#D55E00") +
  labs(y = "Retail turnover in $Million AUD", 
       title = "Total Retail Turnover")

By visually comparing the original data with the smoothed line, you can assess the overall trend and identify any patterns or changes over time more easily.

Our data is now ready for forecasting techniques to be introduced. While not the goal of this paper, i’ll give a brief example of how the ARIMA statistical method

ARIMA stands for AutoRegressive Integrated Moving Average and is a popular statistical method used for time series forecasting. It consists of three components: AutoRegressive (AR), Integrated (I), and Moving Average (MA).

The (AR) component models the relationship between an observation and a lagged set of observations: \(y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \ldots + \phi_p y_{t-p} + \epsilon_t\)

The (I) Component represents differencing, which is used to make the time series stationary: \(I(d) : y_t - y_{t-d}\)

The (MA) Component models the relationship between an observation and a residual error from a moving average model applied to lagged observations: \(MA(q) : \epsilon_t = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + ... + \theta_q \epsilon_{t-q}\)

Here’s the mathematical representation of the ARIMA(p, d, q) model:

\(y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \ldots + \phi_p y_{t-p} + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \ldots + \theta_q \epsilon_{t-q} + \epsilon_t\)

Where: \(y_t\) = the observed series at time t, c = the constant term, \(\phi_1\), \(\phi_2\), …, \(\phi_p\) = the autoregressive parameters, \(\theta_1\), \(\theta_2\), …, \(\theta_q\) = the moving average parameters, \(\epsilon_t\) = the error term at time t.

# Fit a model to the data
fit <- retail %>%
  model(
    arima = ARIMA(Turnover)
  )

# Make forecasts
fc <- fit %>%
  forecast(h = "1 year")

# Plot the forecasts
fc %>%
  autoplot(retail) +
  labs(title = "Forecasting Retail Turnover",
       y = "Turnover",
       x = "Month")

autoplot(fc) +
  labs(title = "Retail Turnover Forecasts",
       y = "Turnover",
       x = "Month")

This forecast shows the 80% and 95% prediction intervals. The 80% prediction interval indicates a range within which we expect approximately 80% of the future observations to fall. This is the same as saying there’s 80% probability that the actual future observations will lie within this interval. The 95% prediction interval is wider and provides a higher level of confidence. It indicates a range within which we expect approximately 95% of the future observations to fall. There is a higher degree of certainty associated with this interval compared to the 80% prediction interval.