Data 624 - Assignment 2

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.
- usnetelc
- usgdp
- mcopper
- emplanements

myCaption <- BoxCox.lambda(usnetelec)

usnetelec %>% 
  as_tsibble() %>% 
  mutate(BoxCox = BoxCox.lambda(value)) %>% 
  autoplot() + labs(title = "Annual US Net Electricity Generation", subtitle = str_c('Transformation parameter lambda: ', myCaption), caption= str_c('Data Source: usnetelec')) + 
theme_fivethirtyeight() + theme(plot.subtitle = element_text(color = "#008FD5"))

myCaption <- BoxCox.lambda(usgdp)

usgdp %>% 
  as_tsibble() %>% 
  mutate(BoxCox = BoxCox.lambda(value)) %>% 
  autoplot() + labs(title = "US GDP Timeseries", subtitle = str_c('Transformation parameter lambda: ', myCaption), caption= str_c('Data Source: usgdp')) + 
theme_fivethirtyeight() + theme(plot.subtitle = element_text(color = "#008FD5"))

myCaption <- BoxCox.lambda(mcopper)

mcopper %>% 
  as_tsibble() %>% 
  mutate(BoxCox = BoxCox.lambda(value)) %>% 
  autoplot() + labs(title = "Monthly copper prices", subtitle = str_c('Transformation parameter lambda: ', myCaption), caption= str_c('Data Source: mcopper')) + 
theme_fivethirtyeight() + theme(plot.subtitle = element_text(color = "#008FD5"))

myCaption <- BoxCox.lambda(enplanements)

enplanements %>% 
  as_tsibble() %>% 
  mutate(BoxCox = BoxCox.lambda(value)) %>% 
  autoplot() + labs(title = "Monthly US domestic enplanements", subtitle = str_c('Transformation parameter lambda: ', myCaption), caption= str_c('Data Source: enplanements')) + 
theme_fivethirtyeight() + theme(plot.subtitle = element_text(color = "#008FD5"))

Why is a Box-Cox transformation unhelpful for the cangas data?

myCaption <- BoxCox.lambda(cangas)

cangas %>% 
  as_tsibble() %>% 
  mutate(BoxCox = BoxCox.lambda(value)) %>% 
  autoplot() + labs(title = "Monthly Canadian gas production", subtitle = str_c('Transformation parameter lambda: ', myCaption), caption= str_c('Data Source: cangas')) + 
theme_fivethirtyeight() + theme(plot.subtitle = element_text(color = "#008FD5"))

cangas %>% 
  as_tsibble() %>% 
  autoplot() + labs(title = "Monthly Canadian gas production", subtitle = "No Box-Cox Transformation", caption= str_c('Data Source: cangas')) + 
theme_fivethirtyeight() + theme(plot.subtitle = element_text(color = "#008FD5"))

The Box-Cox transformation is utilized to make the size of the seasonal variation consistent across the whole timeseries. This makes forecasting simpler and more accurate. In this case, the optimal lambda, 0.57677, does not seem to have improved (make consistent) seasonal variation - this is evident by comparing the tranformed plot to the untransformed plot. This may be the results of the error variance changing over time or outliers.

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,"A3349335T"], frequency=12, start=c(1982,4))

myCaption <- BoxCox.lambda(myts)

myts %>% 
  as_tsibble() %>% 
   mutate(BoxCox = BoxCox.lambda(value)) %>% 
  autoplot() + labs(title = "New South Wales Supermarket and grocery store Sales", subtitle = str_c('Transformation parameter lambda: ', myCaption), caption= str_c('Data Source: A3349335T')) + 
theme_fivethirtyeight() + theme(plot.subtitle = element_text(color = "#008FD5"))

My first inclination would be to use a log() transformation because the dataset is sales data. The optimal lambda of 0.19 is very close to zero, which is the log() transform under the box-cox framework.

log() transform

For your retail time series (from Exercise 3 in Section 2.10):

retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,"A3349335T"], frequency=12, start=c(1982,4))

Split the data into two parts using

myts.train <- window(myts, end=c(2010,12)) myts.test <- window(myts, start=2011)

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test") + labs(title = "New South Wales Supermarket and grocery store Sales", subtitle = "Period: 1980 - 2010", caption= str_c('Data Source: A3349335T')) + 
theme_fivethirtyeight() + theme(plot.subtitle = element_text(color = "#008FD5"))

Calculate forecasts using snaive applied to myts.train.

fc <- snaive(myts.train)

fc <- snaive(myts.train)

Compare the accuracy of your forecasts against the actual values stored in myts.test.

accuracy(fc,myts.test)

##                    ME      RMSE       MAE      MPE     MAPE     MASE      ACF1
## Training set 61.56787  72.20702  61.68438 6.388722 6.404105 1.000000 0.6018274
## Test set     97.44583 109.62545 100.02917 4.629852 4.751209 1.621629 0.2686595
##              Theil's U
## Training set        NA
## Test set     0.9036205

Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 812.76, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

The residuals appear to be corralated and have a near-normal distribution that is not centered on zero (greater than zero.) The Ljung-Box test has a p-value below 0.05, thus we can reject the null hypothesis of indepenance.

e <- tsCV(myts.train, snaive)

How sensitive are the accuracy measures to the training/test split?

I have used cross-validation to gauge the sensitivity of the accuracy measures to the training/test split. The cross-validated RMSE was 71.2824667 versus the training RMSE of 72.2070191 versus the test RMSE of 109.6, from above. This would seem to indicate that accuracy is quite sensitive to the training/test split.

Data 624 - Assignment 2

Jim Mundy