Data 624 - Predictive Analytics

Chapter 3

library(fpp2)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.4 ──
## ✓ ggplot2   3.3.3     ✓ fma       2.4  
## ✓ forecast  8.13      ✓ expsmooth 2.3
## 
library(forecast)

3.1 For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

usnetelec

help(usnetelec)
autoplot(usnetelec) + ylab("Annual US Electricity Generation (billion kWh)") +  ggtitle("Annual US Net Electricity Generation")

lambda_usnetelec <- BoxCox.lambda(usnetelec)
lambda_usnetelec
## [1] 0.5167714
autoplot(BoxCox(usnetelec,lambda_usnetelec)) +  ggtitle("Box Cox Transformation of Annual US Net Electricity Generation")

usgdp

autoplot(usgdp) + ylab("Quarterly US GDP") +  ggtitle("Quarterly US GDP")

lambda_usgdp <- BoxCox.lambda(usgdp)
lambda_usgdp
## [1] 0.366352
autoplot(BoxCox(usgdp,lambda_usgdp)) +  ggtitle("Box Cox Transformation of Quarterly US GDP")

mcopper

autoplot(mcopper) + ylab("Monthly Copper Prices") +  ggtitle("Monthly Copper Prices")

lambda_mcopper <- BoxCox.lambda(mcopper)
lambda_mcopper
## [1] 0.1919047
autoplot(BoxCox(mcopper,lambda_mcopper)) +  ggtitle("Box Cox Transformation of Monthly Copper Prices")

enplanements

autoplot(enplanements) + ylab("Domestic Revenue Enplanements (millions)") +  ggtitle("Monthly US Domestic Revenue from People Boarding Airplanes")

lambda_enplanements <- BoxCox.lambda(enplanements)
lambda_enplanements
## [1] -0.2269461
autoplot(BoxCox(enplanements,lambda_enplanements)) +  ggtitle("Box Cox Transformation of Monthly US Domestic Revenue from People Boarding Airplanes")

3.2 Why is a Box-Cox transformation unhelpful for the cangas data?

help(cangas)
autoplot(cangas) + ylab("Monthly Canadian Gas Production (billions of cubic meters)") +  ggtitle("Canadian Gas Production")

lambda_cangas <- BoxCox.lambda(cangas)
lambda_cangas
## [1] 0.5767759
autoplot(BoxCox(cangas,lambda_cangas)) +  ggtitle("Box Cox Transformation of Canadian Gas Production")

* Box Cox Transformation doesn’t seem to have an effect on the cangas data due to the dependency on lambda.

* The plot of monthly Canadian gas production displays a seasonality of 1 year and a seasonal variance that is relatively low from 1960 through 1978, larger from 1978 through 1988 and smaller from 1988 through 2005. Because the seasonal variation increases and then decreases, the Box Cox transformation cannot be used to make the seasonal variation uniform.

3.3 What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,"A3349873A"], frequency=12, start=c(1982,4))
autoplot(myts)

lambda_retail <- BoxCox.lambda(myts)
lambda_retail
## [1] 0.1276369
autoplot(BoxCox(myts,lambda_retail)) + ggtitle("Box Cox Transformation of Australian Retail")

Even though the logarithmic transformation (with λ=0) is an improvement, the transformation with a low-value lambda (λ=0.1276369) is slightly better, since it also better straightens the trend line of the data.

3.8 For your retail time series (from Exercise 3 in Section 2.10):

a. Split the data into two parts using

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

This creates a training set that ends in December of 2010 and a testing set that begins in 2011.

b. Check that your data have been split appropriately by producing the following plot.

autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test")

The training set is shown in blue up until 2011, which is where the testing set, shown in red, is visible.

c. Calculate forecasts using snaive applied to myts.train.

fc <- snaive(myts.train)
fc
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2011          266.2 240.2540 292.1460 226.5190 305.8810
## Feb 2011          240.0 214.0540 265.9460 200.3190 279.6810
## Mar 2011          267.5 241.5540 293.4460 227.8190 307.1810
## Apr 2011          260.7 234.7540 286.6460 221.0190 300.3810
## May 2011          272.8 246.8540 298.7460 233.1190 312.4810
## Jun 2011          260.5 234.5540 286.4460 220.8190 300.1810
## Jul 2011          268.5 242.5540 294.4460 228.8190 308.1810
## Aug 2011          277.0 251.0540 302.9460 237.3190 316.6810
## Sep 2011          278.7 252.7540 304.6460 239.0190 318.3810
## Oct 2011          279.0 253.0540 304.9460 239.3190 318.6810
## Nov 2011          319.3 293.3540 345.2460 279.6190 358.9810
## Dec 2011          400.2 374.2540 426.1460 360.5190 439.8810
## Jan 2012          266.2 229.5068 302.8932 210.0826 322.3174
## Feb 2012          240.0 203.3068 276.6932 183.8826 296.1174
## Mar 2012          267.5 230.8068 304.1932 211.3826 323.6174
## Apr 2012          260.7 224.0068 297.3932 204.5826 316.8174
## May 2012          272.8 236.1068 309.4932 216.6826 328.9174
## Jun 2012          260.5 223.8068 297.1932 204.3826 316.6174
## Jul 2012          268.5 231.8068 305.1932 212.3826 324.6174
## Aug 2012          277.0 240.3068 313.6932 220.8826 333.1174
## Sep 2012          278.7 242.0068 315.3932 222.5826 334.8174
## Oct 2012          279.0 242.3068 315.6932 222.8826 335.1174
## Nov 2012          319.3 282.6068 355.9932 263.1826 375.4174
## Dec 2012          400.2 363.5068 436.8932 344.0826 456.3174

Forecast were made using Seasonal Naive method

d. Compare the accuracy of your forecasts against the actual values stored in myts.test.

accuracy(fc,myts.test)
##                     ME     RMSE      MAE       MPE      MAPE     MASE      ACF1
## Training set  7.772973 20.24576 15.95676  4.702754  8.109777 1.000000 0.7385090
## Test set     55.300000 71.44309 55.78333 14.900996 15.082019 3.495907 0.5315239
##              Theil's U
## Training set        NA
## Test set      1.297866

e. Check the residuals.Do the residuals appear to be uncorrelated and normally distributed?

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 624.45, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

The decrease of LAG suggests coorelation and no, the residuals don’t appear to be normally distributed.

f. How sensitive are the accuracy measures to the training/test split?

In order to check that, performed the forecasts multiple times, each time using different year to split the data, and the accuracies are calculated below.

sen <- function(split_year){
  trainset <- window(myts, end=c(split_year, 12))
  testset <- window(myts, start=split_year+1)
  acc <- accuracy(snaive(trainset), testset)
  return(acc)
}

splits <- c(2000:2011)
accs <- data.frame()
for (year in splits){
  acc <- sen(year)
  temp <- data.frame(t(acc[2,c(1:6)]))
  accs <- rbind(accs, temp)
}
row.names(accs) <- splits
accs
##               ME      RMSE       MAE        MPE      MAPE      MASE
## 2000  46.9958333 51.253085 46.995833 16.0619584 16.061958 3.1627528
## 2001  20.2958333 23.203403 20.454167  6.9090812  6.969629 1.2630187
## 2002   0.9541667 15.196395 12.495833  0.3421824  4.229447 0.7756915
## 2003 -23.9208333 28.733930 25.020833 -8.4660641  8.865573 1.5670274
## 2004  -0.6875000  7.066087  5.820833 -0.2190730  2.103361 0.3571399
## 2005  10.3375000 19.096673 13.870833  3.1513972  4.490635 0.8747372
## 2006  11.5500000 18.573794 13.708333  3.4641276  4.285225 0.8820923
## 2007   4.9208333 24.992541 20.020833  1.9475253  6.575480 1.2948451
## 2008  -0.2375000 25.096256 18.470833  0.1008644  6.074517 1.1985484
## 2009  -5.6958333 34.321622 27.137500 -3.3305299  8.948578 1.7364971
## 2010  55.3000000 71.443089 55.783333 14.9009957 15.082019 3.4959067
## 2011  65.4291667 78.693755 67.345833 15.8536151 16.520064 4.0179004

It is apparent here that the accuracy measures are very sensitive to the split. For example, if we use Dec 2004 to split the data, i.e. using data from Apr 1982 to Dec 2004 to train and make forecast on data from Jan 2005 to Dec 2006, the accuracy will be very good. The MAPE is only 2.1%, which means that on average the forecast is just 2.1% off. But if we pick Dec 2010 to do the split, the accuracy will be very bad, with a MAPE of 15.1%.