Textbook: Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia.

# Load packages
library(fpp2)
library(tidyverse)
library(gridExtra)

Exercise 3.1

For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.

  • usnetelec
  • usgdp
  • mcopper
  • enplanements'

Function is as below:

boxcox_transform = function(timeseries, title){
  #Plot before transforming
   x = autoplot(timeseries) + 
     ggtitle(sprintf("Before: %s", title))+xlab('Year')+ylab('Units')
  #Find lambda
   lambda = BoxCox.lambda(timeseries)
  #plot after transformation
   x_transformed = autoplot(BoxCox(timeseries, lambda)) +
                  labs(title = sprintf("Transformed: %s", title), 
                       subtitle = sprintf("lambda = %0.2f", lambda))+
                        xlab('Year')+ylab('Units')
  grid.arrange(x, x_transformed)
}

Using function on usnetelec

boxcox_transform(usnetelec, 'Annual US Net Electricity Generation')

The data is transformed with a lambda of 0.52. There is no significant change in the variance but the transformed chart does look a little straighter.

Using function on usgdp

boxcox_transform(usgdp, 'Quarterly US GDP')

The data is transformed using a lambda of 0.37 and the chart is less curved than before.

Using function on mcopper

boxcox_transform(mcopper, 'Monthly copper price')

The original data exhibits seasonality and cyclicity. Using a lambda of 0.19, the seasonal variations can be more clearly observed and there is a less prominent spike at the end.

Using function on enplanements

boxcox_transform(enplanements, 'Monthly US Domestic Enplanements')

The data exhibits an upward trend, seasonality,and cycles. With a lambda of 0.23, the variation is more stabilized.

Exercise 3.2

Why is a Box-Cox transformation unhelpful for the cangas data?

boxcox_transform(cangas, 'Monthly Canadian Gas Production')

The data is transformed using a lambda of 0.58 and shows a little less variation than the original. The seasonal variance is low through the early part of the data, The variance between 1978 and 1988 are a bit higher and again smaller from 1988 through 2005. The seasonal variation increases and decreases and maybe that is why the Box Cox transformation does not make a significant difference.

Exercise 3.3

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

retaildata = readxl::read_excel("retail.xlsx", skip = 1)
myts = ts(retaildata[, 13], frequency = 12, start = c(1982,4))
boxcox_transform(myts, title = "New South Wales Department Stores")

frequency(myts)
## [1] 12

The data shows an upward trend and a seasonality of frequency 1 year. The seasonal variation increases with time. Using a lambda of 0.12, there is more consistent variation. The consistency can be more clearly seen if we compare the data after 2000 for both the charts.

Exercise 3.8

For your retail time series (from Exercise 3 in Section 2.10):

  1. Split the data into two parts using
#Split into train and test
myts.train = window(myts, end=c(2010,12))
myts.test = window(myts, start=2011)
  1. Check that your data have been split appropriately by producing the following plot.
autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test")

  1. Calculate forecasts using snaive applied to myts.train.
# Seasonal Naive Method
fc = snaive(myts.train)
fc
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2011          491.3 459.6747 522.9253 442.9333 539.6667
## Feb 2011          354.1 322.4747 385.7253 305.7333 402.4667
## Mar 2011          443.7 412.0747 475.3253 395.3333 492.0667
## Apr 2011          447.4 415.7747 479.0253 399.0333 495.7667
## May 2011          471.7 440.0747 503.3253 423.3333 520.0667
## Jun 2011          512.6 480.9747 544.2253 464.2333 560.9667
## Jul 2011          499.4 467.7747 531.0253 451.0333 547.7667
## Aug 2011          419.6 387.9747 451.2253 371.2333 467.9667
## Sep 2011          465.3 433.6747 496.9253 416.9333 513.6667
## Oct 2011          470.4 438.7747 502.0253 422.0333 518.7667
## Nov 2011          542.8 511.1747 574.4253 494.4333 591.1667
## Dec 2011          887.5 855.8747 919.1253 839.1333 935.8667
## Jan 2012          491.3 446.5751 536.0249 422.8991 559.7009
## Feb 2012          354.1 309.3751 398.8249 285.6991 422.5009
## Mar 2012          443.7 398.9751 488.4249 375.2991 512.1009
## Apr 2012          447.4 402.6751 492.1249 378.9991 515.8009
## May 2012          471.7 426.9751 516.4249 403.2991 540.1009
## Jun 2012          512.6 467.8751 557.3249 444.1991 581.0009
## Jul 2012          499.4 454.6751 544.1249 430.9991 567.8009
## Aug 2012          419.6 374.8751 464.3249 351.1991 488.0009
## Sep 2012          465.3 420.5751 510.0249 396.8991 533.7009
## Oct 2012          470.4 425.6751 515.1249 401.9991 538.8009
## Nov 2012          542.8 498.0751 587.5249 474.3991 611.2009
## Dec 2012          887.5 842.7751 932.2249 819.0991 955.9009
  1. Compare the accuracy of your forecasts against the actual values stored in myts.test.
accuracy(fc, myts.test)
##                     ME     RMSE      MAE       MPE     MAPE     MASE
## Training set  11.22432 24.67736 19.07958  3.240636 5.581014 1.000000
## Test set     -17.55417 25.70837 20.65417 -3.862206 4.506843 1.082527
##                     ACF1 Theil's U
## Training set -0.05265994        NA
## Test set     -0.31868147  0.212556

The mean errors for both training and test sets are far away from 0, so the model is not a very good fit.The RMSE and MAE for both sets are close which means the model does not overfit or underfit.

  1. Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?
checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 98.61, df = 24, p-value = 5.18e-11
## 
## Model df: 0.   Total lags used: 24

There are a few very large absolute value of residuals from 2000 to 2006. The histogram is right skewed and the mean is different from 0. 7 of the 36 spikes in the Acf (more than 5%) fall outside the acceptable boundary, so this is not a white noise series. The Ljung-Box test returns a p value less than 0.05, so we reject the null hypothesis that the time series is not autocorrelated. Overall, this does not seem to be a good model to use for forecasting from this dataset.

  1. How sensitive are the accuracy measures to the training/test split?

To check for sensitivity, let us split the data differently. I will choose 2000 as end point for the train set as the residuals seem to show more variations after that year.

#Split into train and test
myts.train1 = window(myts, end=c(2006,12))
myts.test1 = window(myts, start=2007)

fc = snaive(myts.train1)

accuracy(fc, myts.test1)
##                    ME     RMSE      MAE      MPE    MAPE    MASE        ACF1
## Training set 11.67649 24.71612 18.71298 3.496326 5.78378 1.00000 -0.09111525
## Test set     24.53333 32.82566 27.94167 4.839588 5.61242 1.49317 -0.29007084
##              Theil's U
## Training set        NA
## Test set     0.2763369

We see that the RMSE and MAE has increased significantly for the test set, so the accuracy measures seem to be very sensitive to the point of split.