DATA624 HW2

Exercise 3.1: For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

usnetelec
usgdp
mcopper
enplanements

autoplot(usnetelec)

elec_lab <- BoxCox.lambda(usnetelec)
autoplot(BoxCox(usnetelec,elec_lab))

The change here is difficult to perceive and that isn’t very surprising since the original data showed little variance from the get go, meaning that little could be done to stabilize it.

autoplot(usgdp)

gdp_lab <- BoxCox.lambda(usgdp)
autoplot(BoxCox(usgdp,gdp_lab))

The shift here is also slight, but it straightens the plot, accentuating shifts from the trend and highlighting that there doesn’t seem to be any cyclical trends within in.

autoplot(mcopper)

copper_lab <- BoxCox.lambda(mcopper)
autoplot(BoxCox(mcopper,gdp_lab))

With the \(mcopper\) data the transformation stabilizes the variance somewhat, equalizing the magnitude of the peaks and troughs throughout the graph. It helps see that there does seem to be a somewhat cyclical trend in the data.

autoplot(enplanements)

enplmnts_lab <- BoxCox.lambda(enplanements)
autoplot(BoxCox(enplanements,enplmnts_lab))

The transformation here doesn’t change the grap hall that much, since the trend is fairly obvious in the original, it simply somewhat standardizes the magnitudes of the waves we see, highlighting the seasonal nature of the data.

Exercise 3.2: Why is a Box-Cox transformation unhelpful for the cangas data?

autoplot(cangas)

cangas_lab <- BoxCox.lambda(cangas)
autoplot(BoxCox(cangas,cangas_lab))

The seasonal variation in the cangas data seems to increase in the late 70s, 80s and then decrease again in the 90s; this makes it difficult to stabilize the variance. As a result our transformation shows little difference from teh original plot and adds little value.

Exercise 3.3: What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

#Data available at https://github.com/mkollontai/DATA624
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,"A3349874C"],
  frequency=12, start=c(1982,4))
autoplot(myts)

retail_lab <- BoxCox.lambda(myts)
autoplot(BoxCox(myts,retail_lab))

A lambda transformation (\(\lambda = -0.0146938\)) neatly stabilizes the variance in teh data, highlighting the strongly seasonal trend within the retail data as well as the levelling off of the data towards the latter years.

Exercise 3.8: For your retail time series (from Exercise 3 in Section 2.10):

Split the data into two parts using

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test")

Calculate forecasts using snaive applied to myts.train.

fc <- snaive(myts.train)

Compare the accuracy of your forecasts against the actual values stored in myts.test.

accuracy(fc,myts.test)

##                      ME     RMSE       MAE        MPE      MAPE     MASE
## Training set   6.122823 12.45603  8.666967   5.906047  7.848706 1.000000
## Test set     -31.379167 34.36303 31.379167 -18.882634 18.882634 3.620548
##                   ACF1 Theil's U
## Training set 0.5367142        NA
## Test set     0.1146989 0.8221791

The mean error (ME) values are very different, with the training set showing 6 and the test set showing -31. In all honesty all of the accuracies calculated are highly off. This is potentially driven by the fact that there are a few sudden spikes in the data towards the end that are not present in earlier data, skewing any prediction made.

Check the residuals.

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 301.98, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

Do the residuals appear to be uncorrelated and normally distributed?

The residuals appear to be fairly normally distributed around a value slightly higher than 0, though there does seem to be a few outliers on the far right, most likely a result of the same spikes mentioned in part d where we looked at the accuracies.

How sensitive are the accuracy measures to the training/test split?

By splitting the training/test at a different location we can see how sensitive the accuracy measures are:

myts.train <- window(myts, end=c(2005,12))
myts.test <- window(myts, start=2006)
fc <- snaive(myts.train)
accuracy(fc,myts.test)

##                    ME      RMSE      MAE      MPE     MAPE     MASE      ACF1
## Training set 5.308059  9.410004 7.491941 6.143397 7.991817 1.000000 0.5341310
## Test set     4.929167 11.381985 8.620833 2.517747 5.051089 1.150681 0.4141092
##              Theil's U
## Training set        NA
## Test set     0.4153504

Shifting the split from 2011 to 2006 we see that our accuracy measures are now much closer, potentially due to the fact that the test set now encompasses more data to adjust for the effect of the extreme spikes in 2010/2011. We can see here how important it is to have an idea of what your data looks like before determining a split point for test/train (where possible).

DATA624 HW2

M Kollontai

2/1/2021