Textbook: Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia.
# Load packages
library(fpp2)
library(tidyverse)
library(gridExtra)
For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.
usnetelecusgdpmcopperenplanements'Function is as below:
boxcox_transform = function(timeseries, title){
#Plot before transforming
x = autoplot(timeseries) +
ggtitle(sprintf("Before: %s", title))+xlab('Year')+ylab('Units')
#Find lambda
lambda = BoxCox.lambda(timeseries)
#plot after transformation
x_transformed = autoplot(BoxCox(timeseries, lambda)) +
labs(title = sprintf("Transformed: %s", title),
subtitle = sprintf("lambda = %0.2f", lambda))+
xlab('Year')+ylab('Units')
grid.arrange(x, x_transformed)
}
Using function on usnetelec
boxcox_transform(usnetelec, 'Annual US Net Electricity Generation')
The data is transformed with a lambda of 0.52. There is no significant change in the variance but the transformed chart does look a little straighter.
Using function on usgdp
boxcox_transform(usgdp, 'Quarterly US GDP')
The data is transformed using a lambda of 0.37 and the chart is less curved than before.
Using function on mcopper
boxcox_transform(mcopper, 'Monthly copper price')
The original data exhibits seasonality and cyclicity. Using a lambda of 0.19, the seasonal variations can be more clearly observed and there is a less prominent spike at the end.
Using function on enplanements
boxcox_transform(enplanements, 'Monthly US Domestic Enplanements')
The data exhibits an upward trend, seasonality,and cycles. With a lambda of 0.23, the variation is more stabilized.
Why is a Box-Cox transformation unhelpful for the cangas data?
boxcox_transform(cangas, 'Monthly Canadian Gas Production')
The data is transformed using a lambda of 0.58 and shows a little less variation than the original. The seasonal variance is low through the early part of the data, The variance between 1978 and 1988 are a bit higher and again smaller from 1988 through 2005. The seasonal variation increases and decreases and maybe that is why the Box Cox transformation does not make a significant difference.
What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?
retaildata = readxl::read_excel("retail.xlsx", skip = 1)
myts = ts(retaildata[, 13], frequency = 12, start = c(1982,4))
boxcox_transform(myts, title = "New South Wales Department Stores")
frequency(myts)
## [1] 12
The data shows an upward trend and a seasonality of frequency 1 year. The seasonal variation increases with time. Using a lambda of 0.12, there is more consistent variation. The consistency can be more clearly seen if we compare the data after 2000 for both the charts.
For your retail time series (from Exercise 3 in Section 2.10):
#Split into train and test
myts.train = window(myts, end=c(2010,12))
myts.test = window(myts, start=2011)
autoplot(myts) +
autolayer(myts.train, series="Training") +
autolayer(myts.test, series="Test")
snaive applied to myts.train.# Seasonal Naive Method
fc = snaive(myts.train)
fc
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2011 491.3 459.6747 522.9253 442.9333 539.6667
## Feb 2011 354.1 322.4747 385.7253 305.7333 402.4667
## Mar 2011 443.7 412.0747 475.3253 395.3333 492.0667
## Apr 2011 447.4 415.7747 479.0253 399.0333 495.7667
## May 2011 471.7 440.0747 503.3253 423.3333 520.0667
## Jun 2011 512.6 480.9747 544.2253 464.2333 560.9667
## Jul 2011 499.4 467.7747 531.0253 451.0333 547.7667
## Aug 2011 419.6 387.9747 451.2253 371.2333 467.9667
## Sep 2011 465.3 433.6747 496.9253 416.9333 513.6667
## Oct 2011 470.4 438.7747 502.0253 422.0333 518.7667
## Nov 2011 542.8 511.1747 574.4253 494.4333 591.1667
## Dec 2011 887.5 855.8747 919.1253 839.1333 935.8667
## Jan 2012 491.3 446.5751 536.0249 422.8991 559.7009
## Feb 2012 354.1 309.3751 398.8249 285.6991 422.5009
## Mar 2012 443.7 398.9751 488.4249 375.2991 512.1009
## Apr 2012 447.4 402.6751 492.1249 378.9991 515.8009
## May 2012 471.7 426.9751 516.4249 403.2991 540.1009
## Jun 2012 512.6 467.8751 557.3249 444.1991 581.0009
## Jul 2012 499.4 454.6751 544.1249 430.9991 567.8009
## Aug 2012 419.6 374.8751 464.3249 351.1991 488.0009
## Sep 2012 465.3 420.5751 510.0249 396.8991 533.7009
## Oct 2012 470.4 425.6751 515.1249 401.9991 538.8009
## Nov 2012 542.8 498.0751 587.5249 474.3991 611.2009
## Dec 2012 887.5 842.7751 932.2249 819.0991 955.9009
myts.test.accuracy(fc, myts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 11.22432 24.67736 19.07958 3.240636 5.581014 1.000000
## Test set -17.55417 25.70837 20.65417 -3.862206 4.506843 1.082527
## ACF1 Theil's U
## Training set -0.05265994 NA
## Test set -0.31868147 0.212556
The mean errors for both training and test sets are far away from 0, so the model is not a very good fit.The RMSE and MAE for both sets are close which means the model does not overfit or underfit.
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 98.61, df = 24, p-value = 5.18e-11
##
## Model df: 0. Total lags used: 24
There are a few very large absolute value of residuals from 2000 to 2006. The histogram is right skewed and the mean is different from 0. 7 of the 36 spikes in the Acf (more than 5%) fall outside the acceptable boundary, so this is not a white noise series. The Ljung-Box test returns a p value less than 0.05, so we reject the null hypothesis that the time series is not autocorrelated. Overall, this does not seem to be a good model to use for forecasting from this dataset.
To check for sensitivity, let us split the data differently. I will choose 2000 as end point for the train set as the residuals seem to show more variations after that year.
#Split into train and test
myts.train1 = window(myts, end=c(2006,12))
myts.test1 = window(myts, start=2007)
fc = snaive(myts.train1)
accuracy(fc, myts.test1)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 11.67649 24.71612 18.71298 3.496326 5.78378 1.00000 -0.09111525
## Test set 24.53333 32.82566 27.94167 4.839588 5.61242 1.49317 -0.29007084
## Theil's U
## Training set NA
## Test set 0.2763369
We see that the RMSE and MAE has increased significantly for the test set, so the accuracy measures seem to be very sensitive to the point of split.