DATA 624: Homework 2

1. For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

lambda <- BoxCox.lambda(usnetelec)
dframe <- cbind(RawSeries = usnetelec, TransformedData = BoxCox(usnetelec,lambda))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("billion kwh") +
  ggtitle("Annual US net electricity generation, \n power parameter lambda = ", round(lambda,5))

lambda <- BoxCox.lambda(usgdp)
dframe <- cbind(RawSeries = usgdp, TransformedData = BoxCox(usgdp,lambda))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("US Dollars") +
  ggtitle("Quarterly US GDP, \n power parameter lambda = ", round(lambda,5))

lambda <- BoxCox.lambda(mcopper)
dframe <- cbind(RawSeries = mcopper, TransformedData = BoxCox(mcopper,lambda))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("pounds per ton") +
  ggtitle("Monthly copper prices, \n power parameter lambda = ", round(lambda,5))

lambda <- BoxCox.lambda(enplanements)
dframe <- cbind(RawSeries = enplanements, TransformedData = BoxCox(enplanements,lambda))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("millions") +
  ggtitle("US domestic enplanements, \n power parameter lambda = ", round(lambda,5))

2. Why is a Box-Cox transformation unhelpful for the cangas data?

In the monthly Canadian gas production time series, the variation in data does not increase or decrease according to the level of the series, and as such the Box-Cox transformation is not applicable. The largest variation occurs at the mid-levels.

lambda <- BoxCox.lambda(cangas)
dframe <- cbind(RawSeries = cangas, TransformedData = BoxCox(cangas,lambda))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("billion cubic metres") +
  ggtitle("Monthly Canadian gas production, \n power parameter lambda = ", round(lambda,5))

3. What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

retaildata <- readxl::read_excel('retail.xlsx', skip=1)
myts <- ts(retaildata[,"A3349824F"], frequency=12, start=c(1982,4))
lambda <- BoxCox.lambda(myts)
dframe <- cbind(RawSeries = myts, TransformedData = BoxCox(myts,lambda))
autoplot(dframe, facet=TRUE) +
  xlab("Years") + ylab("Sales") +
  ggtitle("Australian Retail Data (ID: A3349824F), \n power parameter lambda = ", round(lambda,5))

8. For your retail time series (from Exercise 3 in Section 2.10):

  1. Split the data into two parts using
myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)
  1. Check that your data have been split appropriately by producing the following plot.
autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test")

  1. Calculate forecasts using snaive applied to myts.train .
fc <- snaive(myts.train)
  1. Compare the accuracy of your forecasts against the actual values stored in myts.test .
accuracy(fc,myts.test)
##                    ME     RMSE      MAE       MPE      MAPE     MASE
## Training set 12.11051 25.24173 18.71471  6.299482  8.728705 1.000000
## Test set     66.10833 73.22137 66.10833 14.002564 14.002564 3.532425
##                   ACF1 Theil's U
## Training set 0.7639057        NA
## Test set     0.7434153  1.724674
  1. Check the residuals.
checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 977.76, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

Do the residuals appear to be uncorrelated and normally distributed?

No on both counts. There is positive (and decreasing) autocorrelation according to the ACF curve for the first 11 lags, and the residual histogram shows a rightward positive skew. According to the Ljung-Box test, large values of Q* (here, 977.76) suggest that the autocorrelations do not come from a white noise series.

  1. How sensitive are the accuracy measures to the training/test split?

In the first case, we have performed an in-sample forecast with the last three years of monthly sales data in the test set. The RMSE was 73.22137 and MASE 3.532425.

accuracy(fc,myts.test)[,c(2,6)][1,]
##     RMSE     MASE 
## 25.24173  1.00000
head(myts) #Time Series begins (1982,4)
##       Apr  May  Jun  Jul  Aug  Sep
## 1982 50.4 57.9 54.4 56.2 55.0 55.6
tail(myts) #Time Series begins (2013,12)
##        Jul   Aug   Sep   Oct   Nov   Dec
## 2013 486.8 503.6 494.3 541.3 561.2 643.6
(dates <- length(myts))
## [1] 381

Below, we calculate the root squared mean error and mean absolute error for different training-test splits.

years <- 1983:2012
df <- NULL
for (i in years) {
  myts.train <- window(myts, end=c(i,12))
  myts.test <- window(myts, start=i+1)
  fc <- snaive(myts.train)
  acc <- accuracy(fc,myts.test)[,c(1,2)][2,]
  df <- cbind(df, acc)
}
colnames(df) <- years
df2 <- data.frame(t(df))

pct_test <- round((2013-years) * 12 / dates,3)
df2 <- cbind(df2, pct_test)
print(df2)
##               ME      RMSE pct_test
## 1983  14.8166667 16.037092    0.945
## 1984  10.5333333 11.979357    0.913
## 1985  11.7750000 14.136271    0.882
## 1986  11.4333333 12.404166    0.850
## 1987  10.0416667 13.133228    0.819
## 1988  10.1291667 10.752383    0.787
## 1989   1.1125000  7.367638    0.756
## 1990  15.3500000 24.018812    0.724
## 1991  44.3875000 48.218388    0.693
## 1992  32.8166667 35.544819    0.661
## 1993  14.0375000 19.275513    0.630
## 1994  13.8041667 21.997566    0.598
## 1995  24.3125000 30.123572    0.567
## 1996  31.8541667 35.545563    0.535
## 1997  11.4416667 18.728432    0.504
## 1998  21.5875000 28.649920    0.472
## 1999   9.5000000 16.030362    0.441
## 2000  -6.0875000 18.425695    0.409
## 2001  20.2500000 31.688628    0.378
## 2002  59.8458333 71.808545    0.346
## 2003  68.4000000 71.109452    0.315
## 2004  50.3625000 56.312073    0.283
## 2005  52.7333333 55.954528    0.252
## 2006   0.6583333 30.529644    0.220
## 2007 -30.6250000 35.064583    0.189
## 2008   2.5500000 18.527007    0.157
## 2009   8.8500000 32.502820    0.126
## 2010  66.1083333 73.221371    0.094
## 2011  59.4791667 64.331365    0.063
## 2012  28.2583333 33.297710    0.031
attach(df2)
## The following object is masked _by_ .GlobalEnv:
## 
##     pct_test
plot(x=pct_test, y=RMSE)

cor(pct_test, RMSE)
## [1] -0.6200111

We get the opposite result as we would expect, with a moderately strong negative correlation between test/train split and RMSE. As the percentage of test data increases versus trained data, we should see increasing error, but that is not the case.

plot(x=pct_test, y=ME)

cor(pct_test, ME)
## [1] -0.2830965

Here there again is a negative correlation between test/train split and mean absolute error, though not as strong as with RMSE. Again we would expect a positive relationship.

Conclusion:

The author believes this surprising finding is related to the pecularity of this particular time series, in which the early values have low volatility and are close to the overall mean. As the time series progresses, volatility exploded leading to distortions in the forecasts.