Problem No. 3.1

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance: usnetelec usgdp mcopper enplanements

For each dataset: Calculate the optimal \(\lambda\) and plot the original and transformed data set side by side. Be sure to compare the range of the y-axes.

`usnetelec`

(lambda <- BoxCox.lambda(usnetelec))

## [1] 0.5167714

grid.arrange(ncol=2,
             autoplot(usnetelec, main='Original'),
             autoplot(BoxCox(usnetelec, lambda), main='Box-Cox transformation'))

`usgdp`

(lambda <- BoxCox.lambda(usgdp))

## [1] 0.366352

grid.arrange(ncol=2,
             autoplot(usgdp, main='Original'),
             autoplot(BoxCox(usgdp, lambda), main='Box-Cox transformation'))

`mcopper`

(lambda <- BoxCox.lambda(mcopper))

## [1] 0.1919047

grid.arrange(ncol=2,
             autoplot(mcopper, main='Original'),
             autoplot(BoxCox(mcopper, lambda), main='Box-Cox transformation'))

`enplanements`

(lambda <- BoxCox.lambda(enplanements))

## [1] -0.2269461

grid.arrange(ncol=2,
             autoplot(enplanements, main='Original'),
             autoplot(BoxCox(enplanements, lambda), main='Box-Cox transformation'))

Problem No. 3.2

Why is a Box-Cox transformation unhelpful for the cangas data?

The problem is in the excess seasonal variation in the 1980s. Box-Cox is unable to ‘smooth’ that out, and so fails in its goal to make subsequent modeling simpler.

(lambda <- BoxCox.lambda(cangas))

## [1] 0.5767759

grid.arrange(ncol=2,
             autoplot(cangas, main='Original'),
             autoplot(BoxCox(cangas, lambda), main='Box-Cox transformation'))

Problem No. 3.3

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

Load the data and select a particular time series:

retaildata <- readxl::read_excel('~/Downloads/retail.xlsx', skip=1)

myts <- ts(retaildata[,55], frequency=12, start=c(1982, 4))

Find the appropriate \(\lambda\) and plot:

(lambda <- BoxCox.lambda(myts))

## [1] 0.1116205

grid.arrange(ncol=2,
             autoplot(myts, main='Original'),
             autoplot(BoxCox(myts, lambda), main='Box-Cox transformation'))

Problem No. 3.8

For your retail time series:

Split the data into two parts.

Since I have 31 years of monthly data, I will set the test data to the last 6 years (about 20 percent): 2008 through 2013.

train <- window(myts, end=c(2007,12))
test <- window(myts, start=c(2008, 1))

Check that your data have been split appropriately by producing the following plot.

Looks good, not sure what that black line is:

autoplot(myts) +
  autolayer(train, series="Training") +
  autolayer(test, series="Test")

Calculate forecasts using snaive applied to train.

fc <- snaive(train)

Compare the accuracy of your forecasts against the actual values stored in test.

The main result of accuracy() is that the forecast performs worse on the test set than the train set. This is expected. Depending on our application, this forecast could still be useful, or it could be too far off the mark for us to use.

accuracy(fc, test)

##                   ME     RMSE      MAE      MPE     MAPE     MASE
## Training set  8.6633 13.57660 10.54680 5.222720 6.585582 1.000000
## Test set     22.7125 28.46872 24.50417 7.361669 8.022287 2.323374
##                    ACF1 Theil's U
## Training set  0.1750858        NA
## Test set     -0.1281384  0.370597

A plot of the forecast shows it is reasonable to the eye:

autoplot(window(myts, end=c(2007,12))) +
  autolayer(fc)

Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 270.61, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

The histogram of the residuals looks approximately normal—right tail is perhaps too long.

The residuals by time plot looks fairly good, except 2003–5, where the residuals are consistently positive. We would probably want to investigate that. Otherwise, it looks like they have a mean around zero.

The ACF plot is problematic. The first twelve months are highly highly (and significantly) correlated. We can also the Ljung-Box test has a large \(Q*\), suggesting the autocorrelation is not due to random noise.

How sensitive are the accuracy measures to the training/test split?

It’s probably not advisable to split data right where the U.S. economy was about to crash, in late 2007, especially for retail data.

We could write a for loop to evaluate predictions across multiple division points and get a good idea of how sensitive the models are to train/test split.

DATA 624—Homework No. 2

Ben Horvath

February 16, 2020