library(tidyverse)
library(ggplot2)
library(fpp2)
library(rio)
library(gridExtra)

1 HW2: Forecaster’s Toolbox

Please submit exercises 3.1, 3.2, 3.3 and 3.8 from the online Hyndman book. Please include your Rpubs link along with your .rmd file.

1.1 Ex. 3.1

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance: usnetelec, usgdp, mcopper, and enplanements.

1.1.1 usnetelec

Answer:

usnetelec: Annual US net electricity generation (billion kwh) for 1949-2003

The BoxCox plot of usnetelec with \(\lambda = 0.5167714\) made no huge difference here compared with the original plot.

lambda1 <- BoxCox.lambda(usnetelec)
lambda1

## [1] 0.5167714

dframe <- cbind(elec_Original = usnetelec,
                elec_BoxCox = BoxCox(usnetelec, lambda1))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("Electricity (billion kwh)") +
  ggtitle("Annual US net electricity generation (billion kwh) for 1949-2003")

1.1.2 usgdp

Answer:

usgdp: Quarterly US GDP. 1947:1 - 2006.1.

The BoxCox plot of usgdp with \(\lambda = 0.366352\) produced a slightly smoother curve over time compared with the original plot.

lambda2 <- BoxCox.lambda(usgdp)
lambda2

## [1] 0.366352

dframe <- cbind(gdp_Original = usgdp,
                gdp_BoxCox = BoxCox(usgdp, lambda2))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("GDP") +
  ggtitle("Quarterly US GDP. 1947:1 - 2006.1")

1.1.3 mcopper

Answer:

mcopper: Monthly copper prices.

The BoxCox plot of mcopper with \(\lambda = 0.1919047\) is similar to the original plot. The original plot has dramatic ups and downs which the BoxCox plot did not have siginificant change by looking at the shape of the graph.

lambda3 <- BoxCox.lambda(mcopper)
lambda3

## [1] 0.1919047

dframe <- cbind(copper_Original = mcopper,
                copper_BoxCox = BoxCox(mcopper, lambda3))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("Copper Prices") +
  ggtitle("Monthly Copper Prices")

1.1.4 enplanements

Answer:

enplanements: Domestic Revenue Enplanements (millions): 1996-2000.

The BoxCox plot of enplanements with \(\lambda = -0.2269461\) reduced the variability over time and improved the seasonality compared with the original plot.

lambda4 <- BoxCox.lambda(enplanements)
lambda4

## [1] -0.2269461

dframe <- cbind(rev_Original = enplanements,
                rev_BoxCox = BoxCox(enplanements, lambda4))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("Domestic Revenue Enplanements (millions)") +
  ggtitle("Domestic Revenue Enplanements (millions): 1996-2000")

1.2 Ex. 3.2

Why is a Box-Cox transformation unhelpful for the cangas data?

1.2.1 cangas

Answer:

cangas: Monthly Canadian gas production, billions of cubic metres, January 1960 - February 2005.

The BoxCox transformation of cangas with \(\lambda = 0.5767759\) is not helpful for the cangas data. The BoxCox plot still has high variability in the middle part of the graph while the front and end parts varies little. The BoxCox transformation did not perform well in producing uniform seasonality.

lambda5 <- BoxCox.lambda(cangas)
lambda5

## [1] 0.5767759

dframe <- cbind(gas_Original = cangas,
                gas_BoxCox = BoxCox(cangas, lambda5))
autoplot(dframe, facet=TRUE) +
  xlab("Year") + ylab("Gas Production (billions of cubic metres)") +
  ggtitle("Monthly Canadian gas production (billions of cubic metres) Jan 1960 - Feb 2005")

1.3 Ex. 3.3

What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?

1.3.1 retail

Answer:

The BoxCox transformation of my retail data uses \(\lambda = 0.1909638\). It reduces the overall variability of the graph and produces nice seasonality over time. It is more stablized compared to the original time series.

retail <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/HW1/retail.xlsx",
                             skip=1) #this excel sheet has two header rows
#head(retail)
#summary(retail)
myts <- ts(retail[,"A3349746K"], frequency=12, start=c(1982,4))
autoplot(myts) +
  ggtitle("Turnover-Western Australia-Total(Industry) Time Series") +
  xlab("Time") + 
  ylab("Sales")

lambda6 <- BoxCox.lambda(myts)
lambda6

## [1] 0.1909638

dframe <- cbind(retail_Original = myts,
                retail_BoxCox = BoxCox(myts, lambda6))
autoplot(dframe, facet=TRUE) +
  ggtitle("Turnover-Western Australia-Total(Industry) Time Series") +
  xlab("Time") + ylab("Sales")

1.4 Ex. 3.8

For your retail time series (from Exercise 3 in Section 2.10):

(a.) Split the data into two parts using window.

(b.) Check that your data have been split appropriately by producing the following plot.

(c.) Calculate forecasts using snaive applied to myts.train.

(d.) Compare the accuracy of your forecasts against the actual values stored in myts.test.

(e.) Check the residuals using checkresiduals(fc). Do the residuals appear to be uncorrelated and normally distributed?

(f.) How sensitive are the accuracy measures to the training/test split?

1.4.1 Part a

Split the data into two parts using window.

Answer:

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)
myts.train

##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1982                       298.3  318.5  301.5  316.4  300.5  312.3  318.7
## 1983  295.4  292.6  319.6  313.0  336.8  316.9  326.5  340.9  342.3  340.5
## 1984  333.7  341.7  358.4  337.7  390.5  346.6  364.5  377.0  363.2  396.7
## 1985  381.1  358.2  386.7  383.4  438.3  381.6  417.9  429.1  409.3  437.4
## 1986  434.1  405.5  425.8  431.1  473.9  423.0  447.9  453.6  457.5  492.9
## 1987  477.0  448.5  460.2  496.3  526.6  489.5  519.8  497.1  518.6  555.3
## 1988  498.3  479.6  533.1  538.3  548.4  529.5  517.4  522.5  535.0  552.7
## 1989  567.4  530.1  595.2  562.6  595.7  593.2  581.6  598.7  618.2  614.5
## 1990  599.2  550.2  630.2  609.9  644.6  643.8  618.0  642.9  622.3  658.9
## 1991  635.3  587.1  638.0  633.1  672.4  635.3  674.8  693.4  658.9  738.2
## 1992  720.2  679.2  709.1  749.5  763.9  710.3  745.8  723.7  763.9  854.7
## 1993  773.5  718.1  787.8  804.7  829.1  822.0  867.4  817.3  854.6  872.0
## 1994  852.1  817.9  891.6  855.7  887.0  890.0  890.5  876.3  923.6  928.0
## 1995  891.5  820.7  906.0  895.2  949.9  917.9  923.4  956.3  971.6 1005.2
## 1996  981.3  926.6  951.5  955.7 1014.5  950.9  981.3  992.7  921.2 1049.9
## 1997 1049.3  943.1 1001.8 1005.8 1074.6  966.2 1029.4 1016.6 1032.6 1107.1
## 1998 1100.5  964.5 1034.9 1060.9 1081.5 1039.9 1096.6 1057.8 1070.3 1181.4
## 1999 1134.9 1036.9 1146.5 1103.7 1141.1 1088.2 1148.9 1113.5 1137.8 1231.0
## 2000 1162.7 1129.5 1202.7 1171.3 1205.2 1253.4 1138.1 1207.4 1226.2 1220.6
## 2001 1190.1 1098.9 1226.5 1172.1 1204.4 1179.9 1205.1 1247.4 1216.6 1334.5
## 2002 1339.9 1197.9 1324.1 1281.5 1359.7 1291.2 1288.6 1346.3 1265.1 1437.2
## 2003 1403.0 1239.7 1359.9 1361.8 1428.7 1338.6 1433.7 1427.1 1407.8 1525.2
## 2004 1544.1 1379.2 1479.5 1513.3 1511.5 1513.5 1592.8 1525.4 1595.5 1659.4
## 2005 1560.0 1438.3 1590.1 1583.7 1597.4 1626.4 1645.7 1652.0 1657.1 1722.7
## 2006 1662.7 1562.3 1724.0 1711.9 1754.6 1756.5 1784.0 1802.1 1818.4 1914.5
## 2007 1891.0 1746.9 1965.1 1889.5 1971.1 1942.5 1954.4 1993.6 1973.0 2066.5
## 2008 1981.7 1848.0 1973.5 1979.5 2058.9 1962.5 2105.3 2051.2 2021.4 2135.5
## 2009 2110.0 1842.4 2053.4 2028.6 2129.3 2066.7 2134.2 2078.3 2059.2 2228.4
## 2010 2154.7 1926.3 2135.8 2104.1 2133.7 2095.4 2179.7 2132.5 2170.8 2256.8
##         Nov    Dec
## 1982  337.9  457.4
## 1983  365.3  485.3
## 1984  416.8  540.2
## 1985  454.1  585.6
## 1986  487.5  671.3
## 1987  538.5  779.1
## 1988  597.6  810.5
## 1989  674.9  858.1
## 1990  695.3  897.6
## 1991  761.0  958.0
## 1992  825.3 1067.6
## 1993  903.6 1236.9
## 1994  970.2 1258.7
## 1995 1048.1 1349.4
## 1996 1054.3 1303.3
## 1997 1100.4 1407.3
## 1998 1161.2 1503.6
## 1999 1252.4 1590.5
## 2000 1273.7 1589.6
## 2001 1404.4 1710.5
## 2002 1489.1 1826.4
## 2003 1553.8 1953.8
## 2004 1692.3 2155.1
## 2005 1789.2 2286.8
## 2006 2007.5 2496.6
## 2007 2176.9 2637.7
## 2008 2143.1 2749.0
## 2009 2232.7 2811.2
## 2010 2281.2 2849.9

myts.test

##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 2011 2221.6 2040.2 2254.5 2227.1 2269.0 2268.3 2347.1 2342.0 2346.2 2475.6
## 2012 2412.3 2297.7 2517.6 2397.4 2547.8 2510.0 2517.1 2596.0 2577.8 2726.9
## 2013 2601.0 2374.1 2650.5 2490.2 2674.5 2559.3 2590.2 2664.4 2563.7 2758.4
##         Nov    Dec
## 2011 2547.3 3157.0
## 2012 2775.7 3346.2
## 2013 2859.7 3416.1

1.4.2 Part b

Check that your data have been split appropriately by producing the following plot.

Answer:

autoplot(myts) +
  autolayer(myts.train, series="Training") +
  autolayer(myts.test, series="Test") +
  ggtitle("Turnover-Western Australia-Total(Industry) Time Series") +
  xlab("Time") + ylab("Sales")

1.4.3 Part c

Calculate forecasts using snaive applied to myts.train.

Answer:

fc <- snaive(myts.train)

1.4.4 Part d

Compare the accuracy of your forecasts against the actual values stored in myts.test.

Answer:

The accuracy of my forecasts is better against the actual values stored in myts.test.

accuracy(fc, myts.test)

##                     ME      RMSE       MAE       MPE      MAPE     MASE
## Training set  67.72312  83.42504  69.44084  6.621717  6.771199 1.000000
## Test set     286.52500 315.97475 286.52500 11.228334 11.228334 4.126174
##                   ACF1 Theil's U
## Training set 0.5966165        NA
## Test set     0.7975885  1.365472

1.4.5 Part e

Check the residuals using checkresiduals(fc). Do the residuals appear to be uncorrelated and normally distributed?

Answer:

The time plot of the residuals shows the large variation of the residuals across the historical data, especially after 1993, therefore the residual variance is highly possible to be non-constant. This can also be seen on the ACF plot and the histogram of the residuals. The ACF plot shows significant correlations between the lags of residuals. The histogram suggests that the residuals may not be normal as it appears to be right-skewed with mean of the residuals not close to zero.
Therefore, the residuals appear to be correlated and not normally distributed.

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 1098.7, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

1.4.6 Part f

How sensitive are the accuracy measures to the training/test split?

Answer:

To see the senitivity of the accuracy measures to the training/test split, we can try forecasts on different training/test splits sets from the data and calculate the corresponding accuracy measures.
By doing training/test split at different cut-point from 2001 to 2010 and calculaing the accuracy for those 10 training set forecasts, we got the below accuracy metrics.
From the graphs below, we can see that all metrics fluctuate over time with a peak in 2005, low in 2007, and a peak in 2010 again. They are not uniform and show no pattern via the training/test split over the 10 years, which says, the accuracy measures are very sensitive to the training/test split.

df <- data.frame()
for (year in 2001:2010) {
  myts.train <- window(myts, end=c(year, 12))
  myts.test <- window(myts, start=year+1)
  fc <- snaive(myts.train)
  acc <- accuracy(fc, myts.test) %>% data.frame() %>% rownames_to_column()
  df <- df %>% rbind(cbind(year,acc)) %>% filter(rowname=="Test set")
}
df <- df %>% select(-rowname)
df

Look at the accuracy metrics (RMSE, MAE, MAPE, and MASE) by plotting them on the same plane:

metric <-df %>% select(year, RMSE, MAE, MAPE, MASE)
ggplot(metric, aes(x=year)) +
  geom_line(aes(y=RMSE, color="RMSE"), linetype="solid") +
  geom_line(aes(y=MAE, color="MAE"), linetype="dotdash") +
  geom_line(aes(y=MAPE, color="MAPE"), linetype="dotted") +
  geom_line(aes(y=MASE, color="MASE"), linetype="longdash") +
  ggtitle("Accuracy Metrics for Training Set Forecasts of Different Year of Split") +
  labs(x="Year of Split", y="Accuracy", color="Legend")

To focus on MASE and MAPE from the above plot:

metric <-df %>% select(year, RMSE, MAE, MAPE, MASE)
ggplot(metric, aes(x=year)) +
  geom_line(aes(y=MAPE, color="MAPE"), linetype="dotted") +
  geom_line(aes(y=MASE, color="MASE"), linetype="longdash") +
  ggtitle("Accuracy Metrics for Training Set Forecasts of Different Year of Split") +
  labs(x="Year of Split", y="Accuracy", color="Legend")

Data 624 HW2: Forecaster’s Toolbox

Data 624 HW2: Forecaster’s Toolbox

1 HW2: Forecaster’s Toolbox

1.1 Ex. 3.1

1.1.1 usnetelec

1.1.2 usgdp

1.1.3 mcopper

1.1.4 enplanements

1.2 Ex. 3.2

1.2.1 cangas

1.3 Ex. 3.3

1.3.1 retail

1.4 Ex. 3.8

1.4.1 Part a

1.4.2 Part b

1.4.3 Part c

1.4.4 Part d

1.4.5 Part e

1.4.6 Part f