Week 3 - Forecasting - Homework

C. Rosemond 09.13.20


library(fpp2)
library(readxl)


3.1

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

usnetelec

autoplot(usnetelec) # Annual US net electricity generation (billion kwh) (1949-2003)

(lambda <- BoxCox.lambda(usnetelec)) # 0.517
## [1] 0.5167714
autoplot(BoxCox(usnetelec,lambda), main = paste("Lambda =", round(lambda, 3)))

This time series describes annual US net electricity generation from 1949 to 2003. Given the data are annual, there is no seasonality and thus little apparent variation. For the purposes of this exercise, applying the BoxCox.lambda function to the series returns a suggested lambda of approximately 0.52. Using that lambda within a Box-Cox transformation results in a straightening and slight shift upward for the series.


usgdp

autoplot(usgdp) # Quarterly US GDP ($) (1947.1-2006.1)

ggseasonplot(usgdp) # Quick check of seasons

(lambda <- BoxCox.lambda(usgdp)) # 0.366
## [1] 0.366352
autoplot(BoxCox(usgdp, lambda), main = paste("Lambda =", round(lambda, 3)))

This time series describes quarterly US GDP from Q1 1947 to Q1 2006. The data are quarterly, but plotting them does not reveal apparent seasonality; a follow-up seasonal plot serves as confirmation. For the purposes of this exercise, applying the BoxCox.lambda function to the series returns a suggested lambda of approximately 0.37. Using that lambda within a Box-Cox transformation again results in a straightening of the series.


mcopper

autoplot(mcopper) # Monthly copper prices (pounds/ton) (1960 - 2006)

ggseasonplot(mcopper)

(lambda <- BoxCox.lambda(mcopper)) # 0.192
## [1] 0.1919047
autoplot(BoxCox(mcopper, lambda), main = paste("Lambda =", round(lambda, 3)))

autoplot(BoxCox(mcopper, -0.5), main = paste("Lambda =", -0.5))

This time series describes monthly copper prices from 1960 to 2006. There is definite variation in the data, though fixed seasonal patterns are not obvious. There is also an extreme increase in price in around 2005. A seasonal plot shows the increase in price from 2005 to 2006. Applying the BoxCox.lambda function to the series returns a suggested lambda of approximately 0.19. Using that lambda within a Box-Cox transformation results in somewhat more stable variance across the series, though the relatively larger increases are still quite clear. Experimenting with a different lambda of -0.5 results in an increasing logarithmic shape, which does not address variation.


enplanements

autoplot(enplanements) # Monthly US domestic enplanements (1979 - 2002)

ggseasonplot(enplanements)

(lambda <- BoxCox.lambda(enplanements)) # -0.227
## [1] -0.2269461
autoplot(BoxCox(enplanements, lambda), main = paste("Lambda =", round(lambda,3)))

autoplot(BoxCox(enplanements, 0.5), main = paste("Lambda =", 0.5))

This time series describes monthly US domestic enplanements from 1979 to 2002. The data show relatively clear monthly seasonality, which is borne out by a seasonal plot. There is a large drop in 2001, likely in response to 9/11. Applying the BoxCox.lambda function to the series returns a suggested lambda of approximately -0.23. Using that lambda within a Box-Cox transformation results in relatively stable variation across the series but for the 2001 drop.



3.2

Why is a Box-Cox transformation unhelpful for the cangas data?

autoplot(cangas) # Monthly Canadian gas production ()

ggseasonplot(cangas)

ggsubseriesplot(cangas)

(lambda <- BoxCox.lambda(cangas))
## [1] 0.5767759
autoplot(BoxCox(cangas, lambda), main = paste("Lambda =", round(lambda, 3)))

Based on several plots, there appears to be three general levels of variation within the series: roughly 1960 to 1975, 1975 to 1990, and 1990 to 2005. Applying the Box-Cox transformation with a suggested lambda of approximately 0.58 does not address these levels. Given variation is consistent within levels and has a seemingly small magnitude throughout, it may be relatively stable already. A transformation may not be necessary.

*Note: The HA text is pretty sparse on the limitations of Box-Cox re: time series, and I couldn't find resources elsewhere.



3.3

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

retaildata <- read_excel("C:\\Users\\Charlie\\Documents\\CUNY_MSDS\\fall20\\624\\Week2\\retail.xlsx", skip=1) # 381 x 190
myts <- ts(retaildata[,4], frequency=12, start=c(1982,4)) # 4th column ("A3349338X")
(lambda <- BoxCox.lambda(myts))
## [1] -0.2536078
autoplot(myts, main = "Non-Transformed")

autoplot(BoxCox(myts, lambda), main = paste("Lambda = ", round(lambda,3)))

This time series describes monthly Australian retail sales of some good/product from 1982 to 2013. It is trending upward from the start and shows monthly seasonality. Starting in late 1994, there is increasing variation that continues to grow over time. The series also shows a notable dip around the start of the year 2000. Applying the BoxCox.lambda function to the series returns a suggested lambda of approximately -0.25. Using that lambda within a Box-Cox transformation results in a straighter series, with generally more stable variation throughout. Notably, the transformation only tempers the drop in 2000.



3.8

For your retail time series (from Exercise 3 in Section 2.10):

a. Split the data into two parts using

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

The retail time series is split in training (pre-2011) and test (2011 onward) sets.


b. Check that your data have been split appropriately by producing the following plot.

autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test")


c. Calculate forecasts using snaive applied to myts.train.

fc <- snaive(myts.train)


d. Compare the accuracy of your forecasts against the actual values stored in myts.test.

accuracy(fc, myts.test)
##                      ME     RMSE      MAE       MPE      MAPE     MASE
## Training set   5.502402 27.74935 19.33784  3.369450 10.447161 1.000000
## Test set     -10.845833 24.12202 18.52083 -5.910245  9.201624 0.957751
##                   ACF1 Theil's U
## Training set 0.8703252        NA
## Test set     0.3564215 0.9855325

Relative to the test set, the seasonal naive forecast shows smaller (in magnitude) mean error (approximately 5.502 vs -10.846 for the actual values) and mean percentage error (3.369 vs -5.910). The forecast shows slightly larger root mean squared error (27.749 vs. 24.122), mean absolute error (19.338 vs 18.521), mean absolute perecentage error (10.447 vs 9.202), and mean absolute scale error (1 vs 0.958). In addition, autocorrelation at lag 1 for the forecast is 0.870 compared to 0.356 for the test set.


e. Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 1256.8, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

A Ljung-Box test of the residuals returns a test-statistic of 1256.8 with 24 degrees of freedom and a p-value of approximately 0. The statistic indicates that at least some autocorrelations are large and that they do not come from white noise. Given the p-value, there is sufficient evidence to reject the null hypothesis that the residuals are indistinguishable from white noise.

Considering the plots, there is clear variation in residuals across the series. Autocorrelation is generally significantly different from zero, with values positive but decreasing through a lag of 12 months then negative for lags through 24 months. The distribution of the residuals is roughly symmetric, with some left skew, but not approaching normality.


f. How sensitive are the accuracy measures to the training/test split?

myts.train2 <- window(myts, end=c(2005,12))
myts.test2 <- window(myts, start=2006)
autoplot(myts, main = "Test - 2011 onward") +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test")

autoplot(myts, main = "Test - 2006 onward") +
  autolayer(myts.train2, series="Training") +
  autolayer(myts.test2, series="Test")

fc2 <- snaive(myts.train2)
accuracy(fc, myts.test)
##                      ME     RMSE      MAE       MPE      MAPE     MASE
## Training set   5.502402 27.74935 19.33784  3.369450 10.447161 1.000000
## Test set     -10.845833 24.12202 18.52083 -5.910245  9.201624 0.957751
##                   ACF1 Theil's U
## Training set 0.8703252        NA
## Test set     0.3564215 0.9855325
accuracy(fc2, myts.test2)
##                     ME     RMSE      MAE       MPE     MAPE     MASE
## Training set  8.249451 25.23502 16.97033  4.995277 10.07867 1.000000
## Test set     35.525000 40.93159 35.52500 12.144146 12.14415 2.093359
##                   ACF1 Theil's U
## Training set 0.8592305        NA
## Test set     0.7224740  1.752219

Comparing the initial train/test split (at 2011) with a second, more even split (at 2006) suggests that the cut-off does affect forecast accuracy. Here, the initial split, with its smaller test set, results in a somewhat more accurate forecast relative to the actual data.