1. For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance: usnetelec, usgdp, mcopper, enplanements.

Dataset: usnetelec
autoplot(usnetelec) + labs (title="Annual US Net Electricity Generation", 
                            subtitle = "Non-Transformed", y="KWh", x="Year") +
  scale_y_continuous(labels = comma_format(suffix=" B"))

  • Using the BoxCox.lambda() function, we find the optimal transform is 0.5167714. Basically, this is a square root transform:
autoplot(BoxCox(usnetelec,lambda=0.5)) +
  labs (title="Annual US Net Electricity Generation",
        subtitle = expression("Box-Cox Transform " * (lambda * "=" * 0.5)),
        y="KWh", x="Year") +
  scale_y_continuous(labels = comma_format(suffix=" B"))

Dataset: usgdp
autoplot(usgdp) + labs (title="Quarterly US Gross Domestic Product",
                        subtitle = "Non-Transformed",
                        y="US Dollars", x="Year") +
  scale_y_continuous(labels = dollar_format(prefix="$",suffix=" B", accuracy=1))

  • Using the BoxCox.lambda() function, we find the optimal transform is 0.366352. For simplicity-sake we will use \(\frac{1}{3}\):
autoplot(BoxCox(usgdp, lambda=1/3)) +
  labs (title="Quarterly US Gross Domestic Product",
        subtitle = expression("Box-Cox Transform " * (lambda * " = " * frac(1,3))),
        y="US Dollars", x="Year") +
  scale_y_continuous(labels = dollar_format(prefix="$",suffix="B", accuracy=1))

Dataset: mcopper
autoplot(mcopper) + labs (title="Monthly Copper Prices", 
                            subtitle = "Non-Transformed", y="Price per ton", x="Year") +
  scale_y_continuous(labels = dollar_format(prefix="£"))

  • Using the BoxCox.lambda() function, we find the optimal transform is 0.1919047, which we will call \(\frac{1}{5}\).
autoplot(BoxCox(mcopper,lambda=1/5)) +
  labs (title="Monthly Copper Prices",
        subtitle = expression("Box-Cox Transform " * (lambda * "=" * frac(1,5))),
        y=expression("Price Per Ton"), x="Year") +
  scale_y_continuous(labels = dollar_format(prefix="£"))

Dataset: enplanements
autoplot(enplanements) + labs (title="US Domestic Enplanements", 
                            subtitle = "Non-Transformed", y="Revenue", x="Year") +
  scale_y_continuous(labels = dollar_format(prefix="$", suffix=" M"))

  • Using the BoxCox.lambda() function, we find the optimal transform is -0.2269461. We will use \(-\frac{1}{4}\) to make it simpler.
autoplot(BoxCox(enplanements,lambda=-1/4)) +
           labs (title="US Domestic Enplanements",
                 subtitle = expression("Box-Cox Transform " * (lambda * " = " * frac(-1,4))),
                 y="Revenue", x="Year") +
  scale_y_continuous(labels = dollar_format(prefix="$", suffix=" M"))


2. Why is a Box-Cox transformation unhelpful for the cangas data?

First, we look at the raw data before transforming:

autoplot(cangas) + labs (title="Monthly Canadian Gas Production",
                         subtitle = "Non-Transformed",
                         y=expression("meters"^3), x="Year") +
  scale_y_continuous(labels = unit_format(suffix="B",accuracy=1))

Then after, we see the transformation:

autoplot(BoxCox(cangas,lambda=BoxCox.lambda(cangas))) +
  labs (title="Monthly Canadian Gas Production",
        subtitle = "Non-Transformed",
        y=expression("meters"^3), x="Year") +
  scale_y_continuous(labels = unit_format(suffix="B",accuracy=1))

The transformation doesn’t seem to reduce the variability we see in the center of the data (1978-1988).


3. What Box-Cox transformation would you select for your retaildata (from Exercise 3 in Section 2.10)?

# Load the retail data and select the same column as HW1
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
retail <- ts(retaildata[,"A3349414R"], frequency = 12, start = c(1982,4))
autoplot(retail)+
  labs(title="Liquor Retailing Turnover",subtitle="Victoria") +
  xlab("Date") + ylab("Turnover")

autoplot(log(retail))+
  labs(title="Liquor Retailing Turnover",subtitle="Victoria") +
  xlab("Date") + ylab("ln(Turnover)")


8. For your retail time series (from Exercise 3 in Section 2.10):

a. Split the data into two parts using:
retail.train <- window(retail, end=c(2010,12))
retail.test <- window(retail, start=2011)
b. Check that your data have been split appropriately by producing the following plot:
autoplot(retail) +
  autolayer(retail.train, series="Training") +
  autolayer(retail.test, series="Test") +
  labs(title="Liquor Retailing Turnover",subtitle="Victoria") +
  xlab("Date") + ylab("Turnover")

c. Calculate forecasts using snaive applied to myts.train.
fc <- snaive(retail.train)
d. Compare the accuracy of your forecasts against the actual values stored in retail.test.
accuracy(fc, retail.test)
##                     ME      RMSE       MAE      MPE      MAPE     MASE
## Training set  4.455255  8.699864  5.818619  6.15400  9.948117 1.000000
## Test set     19.170833 22.956217 19.520833 11.59039 11.813322 3.354891
##                   ACF1 Theil's U
## Training set 0.7261600        NA
## Test set     0.5801161 0.7479721
e. Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?
checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 783.91, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24
  • The residuals do have significant correlation within the last 12 months. Their distribution is not very normal (sharper peak, and skewed tothe right).
f. How sensitive are the accuracy measures to the training/test split?
  • We can change the test/train split and see how a new forecast’s accuracy measures compare to the original ones:
# Add an extra year to the test set
retail.train2 <- window(retail, end=c(2009,12))
retail.test2 <- window(retail, start=2010)

autoplot(retail) +
  autolayer(retail.train2, series="New Training") +
  autolayer(retail.test2, series="New Test") +
  labs(title="Liquor Retailing Turnover",subtitle="Victoria") +
  xlab("Date") + ylab("Turnover")

fc2 <- snaive(retail.train2)

accuracy(fc2, retail.test2)
##                     ME     RMSE       MAE       MPE     MAPE     MASE      ACF1
## Training set  4.207165  8.37434  5.584112  6.126617 10.03072 1.000000 0.7364250
## Test set     18.433333 22.89789 18.933333 11.244706 11.67076 3.390572 0.4125194
##              Theil's U
## Training set        NA
## Test set     0.8752422
checkresiduals(fc2)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 768.89, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24
  • Compared to the accuracy measures in the original train/test split, there was little change. The residuals also exhibit much the same pattern as before. I would say that the accuracy measures are not very sensitive to the train/test split for this example.