HW week 2 - 624 - Spring 2021

Jeff Shamp

2021-02-21

3.1-3.3 and 3.8 # Question HA 3.1 ## Question For the following series, find a appropriate Box-Cox transformation in order to stablize the variance.

  1. usnetelec
  2. usgdp
  3. mcopper
  4. enplanements

Answer

usnetelec

First let’s visualize this data. We see that this is trending data, but there does not appear to be any seasonal or cyclic components. As such, there is little need to transform this data to stabilze variance. Additionally, the scale is such that a even a simple log transform would unnecessary. This said, the auto-choice function for Box-Cox says that a \(\lambda = 0.5167714\) (basically a sqrt transform) is the best option. When both the raw data and the Box-Cox are plot together, we see that the BC transform reduces the scale of the y-axis, and dampens the variations. I guess this is a better choice in light of making forecasting easier, but from my (albeit naive) perspective, this route seems only marginally helpful.

lambda <- BoxCox.lambda(usnetelec)

autoplot(usnetelec) +
  labs(title= "Original Data - US net electric")

autoplot(BoxCox(usnetelec,lambda)) +
  labs(title = "Box Cox Transformed - US net electric")

usgdp

For US gdp, we see the original data in clearly a power function so some kind of transformation will be helpful - This data seems like a reasonable candidate for a drift forecast if it is transformed.

(lambda <- BoxCox.lambda(usgdp))
## [1] 0.366352
autoplot(usgdp) +
  labs(title= "Original Data- US GDP")

autoplot(BoxCox(usgdp, lambda)) +
  labs(title= "Box-Cox Transformed- US GDP")

mcopper

This is a very complex data set. We see that it is trended data, the massive run up in price around 2005 notwithstanding. There may also be some kind of cycling pattern to this data as well. A sub-series plot showed no meaningful seasonality to this data and the ACF plot showed only light evidence of cycling. Given this, it seems reasonable that a power fit on the data will help to minimize the y-axis scale and dampen the run up in price.

autoplot(mcopper) +
  labs(title= "Original Data - Monthly Copper Price")

lambda <- BoxCox.lambda(mcopper)

autoplot(BoxCox(mcopper, lambda)) +
  labs(title= "Box-Cox Transformed- Monthly Copper Price")

enplanements

Finally some data with a seasonal component! It was my understanding that BoxCox was best used for this type of data. The seasonal variation appears to be increasing so a negative value of lambda is helpful.

autoplot(enplanements) +
  labs(title= "Original Data- Monthly Domestic Flights")

lambda <- BoxCox.lambda(enplanements)

autoplot(BoxCox(enplanements, lambda))

Question HA 3.2

Question

Why is a Box-Cox transformation unhelpful for the cangas data?

Answer

The cangas dataset has a seasonal component whose variance is non-constant. The seasonality increases then decrease ans goes from a regular pattern to a very different regular pattern. These types of dynamic changes in the seasonality are hard to tame with a singular power function.

autoplot(cangas) +
  labs(title = "Original - Cangas")

lambda <- BoxCox.lambda(cangas)

autoplot(BoxCox(cangas, lambda)) +
  labs(title = "Box Cox - Cangas")

Question HA 3.3

Question

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

Answer

Re-load the data.

library(httr)
url <- "https://otexts.com/fpp2/extrafiles/retail.xlsx"
GET(url, write_disk("retail.xlsx", overwrite=TRUE))
## Response [https://otexts.com/fpp2/extrafiles/retail.xlsx]
##   Date: 2021-02-22 01:21
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 639 kB
## <ON DISK>  /Users/jeffshamp/Documents/GitHub/cuny_msds/DATA_624/retail.xlsx
retail<- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retail[, "A3349721R"], frequency = 12, start = c(1982, 1))

This data is a really good candidate for Box-Cox transformation as the seaonal component increases over time. As for which value I would choose - I won’t choose one. I don’t like “eye-balling” it, as I have found historically that I tend to convince myself of anything. Thus, if there is some more rigorous way to choose a value, then I perfer to take the calculable solution. Here a small value of lambda does a fine job of regulating the variance in the seasonality.

autoplot(myts) +
  labs(title = "Retail Data - Original")

(lambda<- BoxCox.lambda(myts))
## [1] 0.09105888
autoplot(BoxCox(myts, lambda)) +
  labs(title = paste0("Retail Data - Box Cox Lambda = ", round(lambda, 3)))

Question HA 3.8

Question

Do the following for the retail data loaded up in question 3.3

Part a

Split the data into two parts using

myts.train <- window(myts, end=c(2010,12)) 
myts.test <- window(myts, start=2011)

Part b

Check that your data have been split appropriately by producing the following plot.

autoplot(myts) +
  autolayer(myts.train, series="Training") + 
  autolayer(myts.test, series="Test")

Part c

Calculate forecasts using snaive applied to myts.train .

fc <- snaive(myts.train)

Part d

Compare the accuracy of your forecasts against the actual values stored in myts.test.

accuracy(fc,myts.test)
##                     ME     RMSE       MAE      MPE     MAPE    MASE      ACF1
## Training set  55.94315  71.2746  56.44613 7.856931 7.917269 1.00000 0.8531117
## Test set     133.87500 142.3278 133.87500 7.088744 7.088744 2.37173 0.5693153
##              Theil's U
## Training set        NA
## Test set      1.156762

Part e

Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?

The residuals are not normally distributed, they do not have a constant variance, they are not distributed around zero, and the Ljung-Box test indicates a rejection of the null hypthesis, and the lag plot of shows a clear trend. This is a poor forecast.

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 2664.8, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

Part f

How sensitive are the accuracy measures to the training/test split?

There are changes in the accuarcy scores for different splits of time (train on 2009, test on 2010 i.e.), but the differences are marginal and the results are all about the same (very poor in this case). After trying several years in train, test splits the MASE is generally about 2-3 and the RMSE is ~150.

In general, a single train/test split is probably ill-advised. With time series data the random split could occur at a point in time that is a significant outlier, that would like skew the training or testing results. We could try to split at a more “cherry-picked”, split to avoid spilts at locations likely to adversly affect the training or testing results.

myts.train <- window(myts, end=c(2008,12)) 
myts.test <- window(myts, start=2009)
fc_2 <- snaive(myts.train)
accuracy(fc_2,myts.test)
##                     ME      RMSE       MAE      MPE     MAPE     MASE      ACF1
## Training set  53.01442  67.98473  53.55609 8.033980 8.098961 1.000000 0.8388850
## Test set     153.64583 163.99457 153.64583 9.016907 9.016907 2.868877 0.5554098
##              Theil's U
## Training set        NA
## Test set      1.411193