library(fpp2)
library(zoo)
library(plotly)
require(gridExtra)
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. usnetelec usgdp mcopper enplanements
data(usnetelec)
invisible(lambda <- round(BoxCox.lambda(usnetelec),3))
grid.arrange(autoplot(usnetelec) + ggtitle("usnetelec"),autoplot(BoxCox(usnetelec,lambda)) + ggtitle(paste("BoxCox Transform, lambda =", lambda)), ncol = 1)
data(usgdp)
invisible(lambda <- round(BoxCox.lambda(usgdp),3))
grid.arrange(autoplot(usgdp) + ggtitle("usgdp"),autoplot(BoxCox(usgdp,lambda)) + ggtitle(paste("BoxCox Transform, lambda =", lambda)), ncol = 1)
data(mcopper)
invisible(lambda <- round(BoxCox.lambda(mcopper),3))
grid.arrange(autoplot(mcopper) + ggtitle("mcopper"),autoplot(BoxCox(mcopper,lambda)) + ggtitle(paste("BoxCox Transform, lambda =", lambda)), ncol = 1)
data(enplanements)
invisible(lambda <- round(BoxCox.lambda(enplanements),3))
grid.arrange(autoplot(enplanements) + ggtitle("enplanements"),autoplot(BoxCox(enplanements,lambda)) + ggtitle(paste("BoxCox Transform, lambda =", lambda)), ncol = 1)
Why is a Box-Cox transformation unhelpful for the cangas data?
cangas is Monthly Canadian gas production, billions of cubic metres, January 1960 - February 2005
data(cangas)
invisible(lambda <- round(BoxCox.lambda(cangas),3))
grid.arrange(autoplot(cangas) + ggtitle("cangas"),autoplot(BoxCox(cangas,lambda)) + ggtitle(paste("BoxCox Transform, lambda =", lambda)), ncol = 1)
The variability issue in this time series is still the same after the BoxCox transformation (low variability after 1995, high variability between 1980-1990). The other time series did not have evidence of cyclic patterns, but this time series does. The cyclic pattern in this time series causes changes in its variability. In addition, there are situations where variation is not extreme enough for a transformation to be needed.
What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?
I would use a Box-Cox transformation with lambda value of 0.214. This value successfully stabilizes the time series.
temp = tempfile(fileext = ".xlsx")
dataURL <- "https://otexts.com/fpp2/extrafiles/retail.xlsx"
download.file(dataURL, destfile=temp, mode='wb')
retaildata <- readxl::read_excel(temp, skip=1)
myts <- ts(retaildata[,"A3349396W"], frequency=12, start=c(1982,4))
invisible(lambda <- round(BoxCox.lambda(myts),3))
grid.arrange(autoplot(myts) + ggtitle("myts"),autoplot(BoxCox(myts,lambda)) + ggtitle(paste("BoxCox Transform, lambda =", lambda)), ncol = 1)
For your retail time series (from Exercise 3 in Section 2.10):
myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)
autoplot(myts) +
autolayer(myts.train, series="Training") +
autolayer(myts.test, series="Test")
fc <- snaive(myts.train)
All error measures for the test set are higher than error measures for the training set. Since the values in the test dataset are > 20,000, an RMSE of 983 is a decent result.
accuracy(fc,myts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 598.4838 699.2360 601.5060 5.961516 5.997664 1.000000
## Test set 882.2833 982.8526 882.2833 4.172002 4.172002 1.466791
## ACF1 Theil's U
## Training set 0.5972618 NA
## Test set 0.5627543 0.4906103
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 1101, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
Do the residuals appear to be uncorrelated and normally distributed?
The residuals are not normally distributed (there is right skew), with more residual variability in the test set than the training set. The lag shows that residuals are not uncorrelated.
We can test training/test set split sensitivity using cross validation.
modelcv <- CVar(myts, k=5, lambda=0.15)
autoplot(myts, series="Data") +
autolayer(modelcv$testfit, series="Fits") +
autolayer(modelcv$residuals, series="Residuals")
## Warning: Removed 12 rows containing missing values (geom_path).
## Warning: Removed 12 rows containing missing values (geom_path).
ggAcf(modelcv$residuals)
What would happen if we changed the train and test size? Let’s create a new test train split and test the error.
The error did not change much when moving back the split by 6 months, but moving the split forward increased the error for the test set.
myts.train <- window(myts, end=c(2010,6))
myts.test <- window(myts, start=c(2010,7))
fc <- snaive(myts.train)
accuracy(fc,myts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 598.1312 699.8920 601.2089 6.017075 6.053886 1.000000
## Test set 890.4333 984.3877 890.4333 4.278399 4.278399 1.481072
## ACF1 Theil's U
## Training set 0.6020708 NA
## Test set 0.4854022 0.4236731
myts.train <- window(myts, end=c(2011,6))
myts.test <- window(myts, start=c(2011,7))
fc <- snaive(myts.train)
accuracy(fc,myts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 596.4979 696.4529 599.4667 5.900267 5.935776 1.000000
## Test set 962.1125 1042.9384 962.1125 4.487378 4.487378 1.604947
## ACF1 Theil's U
## Training set 0.5919533 NA
## Test set 0.3404947 0.4572131