DATA 624 Homework 2 - Forecaster Toolbox

library(fpp)
library(fpp2)
library(ggplot2)
library(knitr)
library(kableExtra)
library(readxl)

Question 3.1

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance.

usnetelec
usgdp
mcopper
enplanements

funcCmpr <- function(data, ylabtext, title, bcttitle){
  print(head(data))
  print(summary(data))
  print(autoplot(data) + ylab(ylabtext) +  ggtitle(title))
  lambda <- BoxCox.lambda(data)
  print(paste0("Lambda: ", lambda))
  print(autoplot(BoxCox(data,lambda)))
  print(autoplot(BoxCox(data,lambda)) +  ggtitle(bcttitle))
}

usnetelec

funcCmpr(usnetelec, "Annual US Electricity Generation (billion kWh)", "Annual US Net Electricity Generation", "Box Cox Transformation of Annual US Net Electricity Generation")

## Time Series:
## Start = 1949 
## End = 1954 
## Frequency = 1 
## [1] 296.1 334.1 375.3 403.8 447.0 476.3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   296.1   889.0  2040.9  1972.1  3002.7  3858.5

## [1] "Lambda: 0.516771443964645"

The usnetelec plot shows an upward trend and no seasonality. It shows little variance as time progresses.

usgdp

funcCmpr(usgdp, "Quarterly US GDP", "Quarterly US GDP", "Box Cox Transformation of Quarterly US GDP")

##        Qtr1   Qtr2   Qtr3   Qtr4
## 1947 1570.5 1568.7 1568.0 1590.9
## 1948 1616.1 1644.6              
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1568    2632    4552    5168    7130   11404

## [1] "Lambda: 0.366352049520934"

The usgdp plo shows an upward trend and no apparent seasonality. The BoxCox lambda function was used to choose a value for lambda to make the size of the seasonal variation constant. The value of lambda chosen is 0.36. The transformed data is more linear and has less variation than the original data.

mcopper

funcCmpr(mcopper, "Monthly Copper Prices", "Monthly Copper Prices", "Box Cox Transformation of Monthly Copper Prices")

##        Jan   Feb   Mar   Apr   May   Jun
## 1960 255.2 259.7 249.3 258.0 244.3 246.8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   216.6   566.0   949.2   997.8  1262.5  4306.0

## [1] "Lambda: 0.191904709003829"

The mcopper plot shows an upward trend and cyclic behavior. There is less variation and shows a sharp increase in price around 2007.

enplanements

funcCmpr(enplanements, "Domestic Revenue Enplanements (millions)", "Monthly US Domestic Revenue from People Boarding Airplanes", "Box Cox Transformation of Monthly US Domestic Revenue from People Boarding Airplanes")

##        Jan   Feb   Mar   Apr   May   Jun
## 1979 21.12 22.92 25.90 24.38 23.41 26.82
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.14   27.18   34.88   35.67   42.78   56.14

## [1] "Lambda: -0.226946111237065"

The enplanements plot shows an upward trend and a seasonality of 1 year. There is less seasonal variation during som period than there is in the rest of the data set. The BoxCox.lambda function was used to choose a value for lambda to make the size of the seasonal variation constant.

Question 3.2

Why is a Box-Cox transformation unhelpful for the cangas data?

funcCmpr(cangas, "Monthly Canadian Gas Production (billions of cubic meters)", "Canadian Gas Production", "Box Cox Transformation of Canadian Gas Production")

##         Jan    Feb    Mar    Apr    May    Jun
## 1960 1.4306 1.3059 1.4022 1.1699 1.1161 1.0113
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.966   6.453   8.831   9.777  14.429  19.528

## [1] "Lambda: 0.576775938228139"

The Box-Cox transformation cannot be used to make the seasonal variation uniform. This is evident in the cangas data. The middle region has a high variability than the lower and the upper regions. The variance is not stable.

Question 3.3

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

retaildata <- read_excel("data/retail.xlsx", skip = 1)

myts <- ts(retaildata[, "A3349398A"], frequency = 12, start = c(1982, 4))
head(myts)

##        Apr   May   Jun   Jul   Aug   Sep
## 1982 408.7 404.9 401.0 414.4 403.8 411.8

funcCmpr(myts, "Retail Clothing Sales", "New South Wales - Clothing Sales", "Box Cox Transformation of Retail Clothing Sales in New South Wales")

##        Apr   May   Jun   Jul   Aug   Sep
## 1982 408.7 404.9 401.0 414.4 403.8 411.8
##    A3349398A     
##  Min.   : 401.0  
##  1st Qu.: 791.8  
##  Median :1311.6  
##  Mean   :1420.6  
##  3rd Qu.:2025.8  
##  Max.   :3278.2

## [1] "Lambda: 0.123156269082221"

The retail data shows an upward trend. The variance increase with time. This model is great for forecasting. The transformed data has less seasonal variation throughout.

Question 3.8

For your retail time series (from Exercise 3 in Section 2.10):

Split the data into two parts using

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test")

Calculate forecasts using snaive applied to myts.train.

fc <- snaive(myts.train)

Compare the accuracy of your forecasts against the actual values stored in myts.test.

accuracy(fc,myts.test)

##                     ME      RMSE       MAE      MPE     MAPE     MASE      ACF1
## Training set  73.94114  88.31208  75.13514 6.068915 6.134838 1.000000 0.6312891
## Test set     115.00000 127.92727 115.00000 4.459712 4.459712 1.530576 0.2653013
##              Theil's U
## Training set        NA
## Test set     0.7267171

autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test") +
  autolayer(fc, series="prediction")

The mean error for the training set is 74 and the mean error for the test set is about 115. These values are close for the training and test set. The root mean square error (RMSE) is 88 and 128 and is close as well. The mean absolute error (MAE) is very similar for the training and testing set, and is about 75 and 115. The mean percentage error (MPE) is 6% for the training set and 4.5% for the testing set.

Check the residuals.

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 671.41, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

Do the residuals appear to be uncorrelated and normally distributed?

The residuals are not centered around 0 and are not normally distributed. It appears to be correlated to each other.

How sensitive are the accuracy measures to the training/test split?

The errors in both test and train set are fairly similar to each other. The test set has slightly larger errors that the training set for the mean error, root mean square error, mean absolute error, mean absolute scaled error and auto correlation function. The test set has a lower error for the mean percentage error and the mean absolute percentage error.