Time Series Analysis (Forecasting)

Time Series Forecasting

Join me as I perform a Time Series Analysis on the aus_retail dataset of the tsibbledata package in RStudio.

Here is a brief outline of what we’ll do in this session:

Wrangling data
Exploratory data analysis
Building models
Evaluating accuracy
Forecasting

Wrangling Data

I am only interested in the liquor retailing for this session and i want to constrain the data to only the state Victoria. Let’s inspect the data to see what variables we have, their class and if any missing values exist for our dependent variable.

# filter and save as new data frame
vic_liquor <- aus_retail %>%
  filter(Industry == 'Liquor retailing') %>%
  filter(State == 'Victoria')

# check the data
str(vic_liquor)

## tbl_ts [441 × 5] (S3: tbl_ts/tbl_df/tbl/data.frame)
##  $ State    : chr [1:441] "Victoria" "Victoria" "Victoria" "Victoria" ...
##  $ Industry : chr [1:441] "Liquor retailing" "Liquor retailing" "Liquor retailing" "Liquor retailing" ...
##  $ Series ID: chr [1:441] "A3349414R" "A3349414R" "A3349414R" "A3349414R" ...
##  $ Month    : mth [1:441] 1982 Apr, 1982 May, 1982 Jun, 1982 Jul, 1982 Aug, 1982 Sep...
##  $ Turnover : num [1:441] 17.3 18.1 18.1 18.9 19 18.4 20.9 22.4 29.7 22.9 ...
##  - attr(*, "key")= tibble [1 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ State   : chr "Victoria"
##   ..$ Industry: chr "Liquor retailing"
##   ..$ .rows   : list<int> [1:1] 
##   .. ..$ : int [1:441] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
##  - attr(*, "index")= chr "Month"
##   ..- attr(*, "ordered")= logi TRUE
##  - attr(*, "index2")= chr "Month"
##  - attr(*, "interval")= interval [1:1] 1M
##   ..@ .regular: logi TRUE

# any missing values?
summary(vic_liquor$Turnover)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.30   27.10   62.80   84.39  135.00  342.90

Exploratory Data Analysis

Let’s visualise our data in different ways.

## Plot variable not specified, automatically selected `.vars = Turnover`

Here we can see a clear seasonal pattern with an increasing trend.

gg_season(vic_liquor)

## Plot variable not specified, automatically selected `y = Turnover`

Here we are putting each season on the same scale, which makes it easy to see the months where liquor retail experience higher sales, or Turnover, as it’s termed with our dataset.

gg_subseries(vic_liquor)

## Plot variable not specified, automatically selected `y = Turnover`

By looking at the mean (blue horizontal line), we can see the seasonality here, the year starts off strong in January, slows in February, experiences a minor peak in March then steadily decreasing until June, where the months increase steadily until a massive peak in December.

vic_liquor %>%
  ACF(Turnover) %>%
  autoplot()

This Autocorrelation plot tells us that we need to difference our data.

vic_liquor %>%
  mutate(diff_turnover = difference(Turnover)) %>%
  ACF(diff_turnover) %>%
  autoplot()

Now, we can see that in a 12 month period, only lags 4, 5, 7 and 8 are not significant.

dcmp <- vic_liquor %>%
  model(stl = STL(Turnover))

components(dcmp) %>%
  autoplot()

We can now visualise the seasonality in it’s entirety, with the trend clearly increasing.

Building Models

We will build the following models:

ARIMA, ARIMA boost, Exponential Smoothing, Prophet and a linear model.

# for these packages to work, we need a data frame, not a tsibble object
vic_liquor_df <- as.data.frame(vic_liquor)

# we need a date variable not a yearmonth variable
vic_liquor_df$Month <- as.Date(as.yearmon(vic_liquor_df$Month))

# inspect our data again
str(vic_liquor_df)

## 'data.frame':    441 obs. of  5 variables:
##  $ State    : chr  "Victoria" "Victoria" "Victoria" "Victoria" ...
##  $ Industry : chr  "Liquor retailing" "Liquor retailing" "Liquor retailing" "Liquor retailing" ...
##  $ Series ID: chr  "A3349414R" "A3349414R" "A3349414R" "A3349414R" ...
##  $ Month    : Date, format: "0147-01-01" "0148-01-01" ...
##  $ Turnover : num  17.3 18.1 18.1 18.9 19 18.4 20.9 22.4 29.7 22.9 ...

# create splits 
splits <- initial_time_split(vic_liquor_df)

# visualise splits
splits %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(
    .date_var = Month,
    .value = Turnover
  )

# create training and testing sets
train <- training(splits)
test <- testing(splits)

# a. Auto ARIMA
arima_fit <- arima_reg(seasonal_period = 12) %>%
  set_engine("auto_arima") %>%
  fit(Turnover ~ Month, data = train)

# b. Boosted arima
arima_boost_fit <- arima_boost(seasonal_period = 12) %>%
  set_engine("auto_arima_xgboost") %>%
  fit(Turnover ~ Month, data = train)

# c. Exponential smoothing
ets_fit <- exp_smoothing() %>%
  set_engine("ets") %>%
  fit(Turnover ~ Month, data = train)

## frequency = 5 observations per 5 years

# d. Prophet
prophet_fit <- prophet_reg(seasonality_yearly = TRUE) %>%
  set_engine("prophet") %>%
  fit(Turnover ~ Month, data = train)

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

# e. Linear reg
lm_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(Turnover ~ Month, data = train)

# Add models to table
models_tbl <- modeltime_table(
  arima_fit,
  arima_boost_fit,
  ets_fit,
  prophet_fit,
  lm_fit
)

# Calibrate
calibrate_tbl <- models_tbl %>%
  modeltime_calibrate(new_data = test)

# Make forecasts
calibrate_tbl %>%
  modeltime_forecast(
    actual_data = vic_liquor_df,
    new_data = test
  ) %>%
  plot_modeltime_forecast(
    .conf_interval_show = F
  )

# create a table for accuracy
acc_tab <- calibrate_tbl %>%
  modeltime_accuracy()

# refit models
refit_tbl <- calibrate_tbl %>%
  modeltime_refit(data = vic_liquor_df)

## frequency = 5 observations per 5 years

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

# forecast refit models
refit_tbl %>%
  modeltime_forecast(h = 12, actual_data = vic_liquor_df) %>%
  plot_modeltime_forecast(
    .conf_interval_show = F
  )

Evaluating Accuracy

acc_tab

## # A tibble: 5 × 9
##   .model_id .model_desc             .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>                   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         1 ARIMA(0,1,1)(0,1,2)[12] Test   40.8 22.7  1.57  19.7   48.9 0.739
## 2         2 ARIMA(2,1,1)(1,1,0)[12] Test   16.1  8.81 0.618  8.41  18.7 0.866
## 3         3 ETS(M,A,N)              Test   21.0 10.1  0.810 10.9   36.5 0.321
## 4         4 PROPHET                 Test   32.0 15.3  1.23  17.6   48.3 0.325
## 5         5 LM                      Test   61.2 31.6  2.36  38.6   72.3 0.321

From this table, we can see that the boosted ARIMA model (in the second row) outperformed all other models. It produced the lowest error rate on the first 5 columns and the highest R-Squared metric.

Forecasting

Here we will only visualise the superior ARIMA boost model in both it’s ability to capture the testing data and it’s forecast for the next year.

# Plot Boosted ARIMA
arima_boost_fit %>%
  modeltime_calibrate(new_data = test) %>%
  modeltime_forecast(
    actual_data = vic_liquor_df,
    new_data = test
  ) %>%
  plot_modeltime_forecast(
    .conf_interval_show = F
  )

## Converting to Modeltime Table.

# This plot shows how it estimated the testing data set
arima_boost_fit %>%
  modeltime_calibrate(new_data = test) %>%
  modeltime_refit(data = vic_liquor_df) %>%
  modeltime_forecast(h = 12, actual_data = vic_liquor_df) %>%
  plot_modeltime_forecast(
    .conf_interval_show = F
  )

## Converting to Modeltime Table.

It’s difficult to interpret this data. What we have here is the predicted value for the sales in the Victorian Liquor Retail Industry for the next year. This may be interesting to large scale retail companies who will plan in a macro-climate. For store owners, it may be interesting to scale this data to the total sales/profits of your respective store should you not be able to work with internal data.