require(fpp2)
## Loading required package: fpp2
## Loading required package: ggplot2
## Loading required package: forecast
## Loading required package: fma
## Loading required package: expsmooth
Since BoxCox has a built in function to find a good lambda value to stabilize the variance, we will go with that.
usnetelecl1 <- BoxCox.lambda(usnetelec)
autoplot(BoxCox(usnetelec, l1)) +
ggtitle("usnetelec Box-Cox")
autoplot(usnetelec) +
ggtitle("usnetelec raw")
usgdpl2 <- BoxCox.lambda(usgdp)
autoplot(BoxCox(usgdp, l2)) +
ggtitle("usgdp Box-Cox")
autoplot(usgdp) +
ggtitle("usgdp raw")
mcopperl3 <- BoxCox.lambda(mcopper)
autoplot(BoxCox(mcopper, l3)) +
ggtitle("mcopper Box-Cox")
autoplot(mcopper) +
ggtitle("mcopper raw")
enplanementsl4 <- BoxCox.lambda(enplanements)
autoplot(BoxCox(enplanements, l4)) +
ggtitle("enplanement Box-Cox")
autoplot(enplanements) +
ggtitle("enplanements raw")
For usnetelec and usgdp there does not seem to be much seasonal variation in the data to be smoothed over, although the values have been normalized by the Box-Cox transformation so that may help in prediction if other values are as small. The biggest shift seems to be in the mcopper dataset, as the Box-Cox transformation due to the huge spike in the more modern entries.
cangas data?l5 <- BoxCox.lambda(cangas)
autoplot(BoxCox(cangas, l5)) +
ggtitle("cangas Box-Cox")
autoplot(cangas) +
ggtitle("cangas raw")
I suppose that the variation in the cangas data set is not extreme enough for a Box-Cox transformation to have any real effect. We saw this with usnetelec and usgdp in the previous exercise, and even enplanements to a certain extent. Even Hyndman and Athanasopoulos note that a transformation isn’t always needed.
From Exercise 2.3, we know we have to read in the file, but more importantly the file is just a data frame of many time series. We need to pick one in particular to use. For simplicity’s sake, let us choose the first one.
retaildata <- readxl::read_excel("retail.xlsx", skip = 1)
myts <- ts(retaildata[, "A3349335T"],
frequency = 12, start = c(1982, 4))
l6 <- BoxCox.lambda(myts)
autoplot(BoxCox(myts, l6)) +
ggtitle("retail Box-Cox")
autoplot(myts) +
ggtitle("retail raw")
We see a very drastic transformation compared to our other usages of BoxCox. The upward trend is much more linear, rather than being a gradual rise, but also all the seasonal peaks have been smoothed out.
myts.train <- window(myts, end = c(2010, 12))
myts.test <- window(myts, start = 2011)
autoplot(myts) +
autolayer(myts.train, series = "Training") +
autolayer(myts.test, series = "Test")
snaive applied to myts.train.fc <- snaive(myts.train)
myts.test.accuracy(fc, myts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 61.56787 72.20702 61.68438 6.388722 6.404105 1.000000
## Test set 97.44583 109.62545 100.02917 4.629852 4.751209 1.621629
## ACF1 Theil's U
## Training set 0.6018274 NA
## Test set 0.2686595 0.9036205
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 812.76, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
Do the residuals appear to be uncorrelated and normally distributed?
No, the residuals don’t appear to be normally distributed. The data has a rather large right skew. They are correlated though. More than 5% of the spikes are beyond 0.2 in the ACF graph and the p-value is close to 0.
To me, the only way to see the sensitivity of the accuracy measure is to attempt multiple train/test splits. Let’s do one with a larger test split, and one with a smaller test split.
myts.train2 <- window(myts, end = c(2005, 12))
myts.test2 <- window(myts, start = 2006)
myts.train3 <- window(myts, end = c(2012, 12))
myts.test3 <- window(myts, start = 2013)
fc2 <- snaive(myts.train2)
fc3 <- snaive(myts.train3)
accuracy(fc, myts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 61.56787 72.20702 61.68438 6.388722 6.404105 1.000000
## Test set 97.44583 109.62545 100.02917 4.629852 4.751209 1.621629
## ACF1 Theil's U
## Training set 0.6018274 NA
## Test set 0.2686595 0.9036205
accuracy(fc2, myts.test2)
## ME RMSE MAE MPE MAPE MASE
## Training set 53.64432 61.52865 53.70586 6.624671 6.638775 1.000000
## Test set 144.62083 162.88771 144.62083 8.390952 8.390952 2.692831
## ACF1 Theil's U
## Training set 0.4783456 NA
## Test set 0.7699170 1.629926
accuracy(fc3, myts.test3)
## ME RMSE MAE MPE MAPE MASE
## Training set 61.43585 72.16410 61.82577 6.150356 6.177893 1.000000
## Test set 74.47500 86.11787 74.47500 3.250322 3.250322 1.204595
## ACF1 Theil's U
## Training set 0.5721717 NA
## Test set -0.1109222 0.5204259
autoplot(myts) +
autolayer(fc, series = "Default SN", PI = FALSE) +
autolayer(fc2, series = "Large Split SN", PI = FALSE) +
autolayer(fc3, series = "Small Split SN", PI = FALSE) +
ggtitle("Retail Split Difference") +
guides(colour = guide_legend(title = "Forecasts"))
So it looks as though the small split is the best predictor, followed by the original, and then the large test split, which shows that the accuracy measures are incredibly sensitive to the train/test split. This makes some intuitive sense as the more data available over a shorter prediction time, the more accurate the model should be in theory.