3.1-3.3 and 3.8 # Question HA 3.1 ## Question For the following series, find a appropriate Box-Cox transformation in order to stablize the variance.
- usnetelec
- usgdp
- mcopper
- enplanements
Answer
usnetelec
First let’s visualize this data. We see that this is trending data, but there does not appear to be any seasonal or cyclic components. As such, there is little need to transform this data to stabilze variance. Additionally, the scale is such that a even a simple log transform would unnecessary. This said, the auto-choice function for Box-Cox says that a \(\lambda = 0.5167714\) (basically a sqrt transform) is the best option. When both the raw data and the Box-Cox are plot together, we see that the BC transform reduces the scale of the y-axis, and dampens the variations. I guess this is a better choice in light of making forecasting easier, but from my (albeit naive) perspective, this route seems only marginally helpful.
lambda <- BoxCox.lambda(usnetelec)
autoplot(usnetelec) +
labs(title= "Original Data - US net electric")
usgdp
For US gdp, we see the original data in clearly a power function so some kind of transformation will be helpful - This data seems like a reasonable candidate for a drift forecast if it is transformed.
## [1] 0.366352
mcopper
This is a very complex data set. We see that it is trended data, the massive run up in price around 2005 notwithstanding. There may also be some kind of cycling pattern to this data as well. A sub-series plot showed no meaningful seasonality to this data and the ACF plot showed only light evidence of cycling. Given this, it seems reasonable that a power fit on the data will help to minimize the y-axis scale and dampen the run up in price.
lambda <- BoxCox.lambda(mcopper)
autoplot(BoxCox(mcopper, lambda)) +
labs(title= "Box-Cox Transformed- Monthly Copper Price")
enplanements
Finally some data with a seasonal component! It was my understanding that BoxCox was best used for this type of data. The seasonal variation appears to be increasing so a negative value of lambda is helpful.
Question HA 3.2
Question
Why is a Box-Cox transformation unhelpful for the cangas data?
Answer
The cangas dataset has a seasonal component whose variance is non-constant. The seasonality increases then decrease ans goes from a regular pattern to a very different regular pattern. These types of dynamic changes in the seasonality are hard to tame with a singular power function.
Question HA 3.3
Question
What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?
Answer
Re-load the data.
library(httr)
url <- "https://otexts.com/fpp2/extrafiles/retail.xlsx"
GET(url, write_disk("retail.xlsx", overwrite=TRUE))
## Response [https://otexts.com/fpp2/extrafiles/retail.xlsx]
## Date: 2021-02-22 01:21
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 639 kB
## <ON DISK> /Users/jeffshamp/Documents/GitHub/cuny_msds/DATA_624/retail.xlsx
retail<- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retail[, "A3349721R"], frequency = 12, start = c(1982, 1))
This data is a really good candidate for Box-Cox transformation as the seaonal component increases over time. As for which value I would choose - I won’t choose one. I don’t like “eye-balling” it, as I have found historically that I tend to convince myself of anything. Thus, if there is some more rigorous way to choose a value, then I perfer to take the calculable solution. Here a small value of lambda does a fine job of regulating the variance in the seasonality.
## [1] 0.09105888
autoplot(BoxCox(myts, lambda)) +
labs(title = paste0("Retail Data - Box Cox Lambda = ", round(lambda, 3)))
Question HA 3.8
Question
Do the following for the retail data loaded up in question 3.3
Part a
Split the data into two parts using
Part b
Check that your data have been split appropriately by producing the following plot.
Part d
Compare the accuracy of your forecasts against the actual values stored in myts.test
.
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 55.94315 71.2746 56.44613 7.856931 7.917269 1.00000 0.8531117
## Test set 133.87500 142.3278 133.87500 7.088744 7.088744 2.37173 0.5693153
## Theil's U
## Training set NA
## Test set 1.156762
Part e
Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?
The residuals are not normally distributed, they do not have a constant variance, they are not distributed around zero, and the Ljung-Box test indicates a rejection of the null hypthesis, and the lag plot of shows a clear trend. This is a poor forecast.
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 2664.8, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
Part f
How sensitive are the accuracy measures to the training/test split?
There are changes in the accuarcy scores for different splits of time (train on 2009, test on 2010 i.e.), but the differences are marginal and the results are all about the same (very poor in this case). After trying several years in train, test splits the MASE is generally about 2-3 and the RMSE is ~150.
In general, a single train/test split is probably ill-advised. With time series data the random split could occur at a point in time that is a significant outlier, that would like skew the training or testing results. We could try to split at a more “cherry-picked”, split to avoid spilts at locations likely to adversly affect the training or testing results.
myts.train <- window(myts, end=c(2008,12))
myts.test <- window(myts, start=2009)
fc_2 <- snaive(myts.train)
accuracy(fc_2,myts.test)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 53.01442 67.98473 53.55609 8.033980 8.098961 1.000000 0.8388850
## Test set 153.64583 163.99457 153.64583 9.016907 9.016907 2.868877 0.5554098
## Theil's U
## Training set NA
## Test set 1.411193