Data 624 HW2: Forecaster’s Toolbox
1 HW2: Forecaster’s Toolbox
Please submit exercises 3.1, 3.2, 3.3 and 3.8 from the online Hyndman book. Please include your Rpubs link along with your .rmd file.
1.1 Ex. 3.1
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance: usnetelec, usgdp, mcopper, and enplanements.
1.1.1 usnetelec
Answer:
usnetelec: Annual US net electricity generation (billion kwh) for 1949-2003
- The BoxCox plot of
usnetelecwith \(\lambda = 0.5167714\) made no huge difference here compared with the original plot.
## [1] 0.5167714
dframe <- cbind(elec_Original = usnetelec,
elec_BoxCox = BoxCox(usnetelec, lambda1))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("Electricity (billion kwh)") +
ggtitle("Annual US net electricity generation (billion kwh) for 1949-2003")1.1.2 usgdp
Answer:
usgdp: Quarterly US GDP. 1947:1 - 2006.1.
- The BoxCox plot of
usgdpwith \(\lambda = 0.366352\) produced a slightly smoother curve over time compared with the original plot.
## [1] 0.366352
dframe <- cbind(gdp_Original = usgdp,
gdp_BoxCox = BoxCox(usgdp, lambda2))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("GDP") +
ggtitle("Quarterly US GDP. 1947:1 - 2006.1")1.1.3 mcopper
Answer:
mcopper: Monthly copper prices.
- The BoxCox plot of
mcopperwith \(\lambda = 0.1919047\) is similar to the original plot. The original plot has dramatic ups and downs which the BoxCox plot did not have siginificant change by looking at the shape of the graph.
## [1] 0.1919047
dframe <- cbind(copper_Original = mcopper,
copper_BoxCox = BoxCox(mcopper, lambda3))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("Copper Prices") +
ggtitle("Monthly Copper Prices")1.1.4 enplanements
Answer:
enplanements: Domestic Revenue Enplanements (millions): 1996-2000.
- The BoxCox plot of
enplanementswith \(\lambda = -0.2269461\) reduced the variability over time and improved the seasonality compared with the original plot.
## [1] -0.2269461
dframe <- cbind(rev_Original = enplanements,
rev_BoxCox = BoxCox(enplanements, lambda4))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("Domestic Revenue Enplanements (millions)") +
ggtitle("Domestic Revenue Enplanements (millions): 1996-2000")1.2 Ex. 3.2
Why is a Box-Cox transformation unhelpful for the cangas data?
1.2.1 cangas
Answer:
cangas: Monthly Canadian gas production, billions of cubic metres, January 1960 - February 2005.
- The BoxCox transformation of
cangaswith \(\lambda = 0.5767759\) is not helpful for thecangasdata. The BoxCox plot still has high variability in the middle part of the graph while the front and end parts varies little. The BoxCox transformation did not perform well in producing uniform seasonality.
## [1] 0.5767759
dframe <- cbind(gas_Original = cangas,
gas_BoxCox = BoxCox(cangas, lambda5))
autoplot(dframe, facet=TRUE) +
xlab("Year") + ylab("Gas Production (billions of cubic metres)") +
ggtitle("Monthly Canadian gas production (billions of cubic metres) Jan 1960 - Feb 2005")1.3 Ex. 3.3
What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)?
1.3.1 retail
Answer:
- The BoxCox transformation of my retail data uses \(\lambda = 0.1909638\). It reduces the overall variability of the graph and produces nice seasonality over time. It is more stablized compared to the original time series.
retail <- import("https://raw.githubusercontent.com/shirley-wong/Data-624/main/HW1/retail.xlsx",
skip=1) #this excel sheet has two header rows
#head(retail)
#summary(retail)
myts <- ts(retail[,"A3349746K"], frequency=12, start=c(1982,4))
autoplot(myts) +
ggtitle("Turnover-Western Australia-Total(Industry) Time Series") +
xlab("Time") +
ylab("Sales")## [1] 0.1909638
dframe <- cbind(retail_Original = myts,
retail_BoxCox = BoxCox(myts, lambda6))
autoplot(dframe, facet=TRUE) +
ggtitle("Turnover-Western Australia-Total(Industry) Time Series") +
xlab("Time") + ylab("Sales")1.4 Ex. 3.8
For your retail time series (from Exercise 3 in Section 2.10):
(a.) Split the data into two parts using window.
(b.) Check that your data have been split appropriately by producing the following plot.
(c.) Calculate forecasts using snaive applied to myts.train.
(d.) Compare the accuracy of your forecasts against the actual values stored in myts.test.
(e.) Check the residuals using checkresiduals(fc). Do the residuals appear to be uncorrelated and normally distributed?
(f.) How sensitive are the accuracy measures to the training/test split?
1.4.1 Part a
Split the data into two parts using window.
Answer:
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1982 298.3 318.5 301.5 316.4 300.5 312.3 318.7
## 1983 295.4 292.6 319.6 313.0 336.8 316.9 326.5 340.9 342.3 340.5
## 1984 333.7 341.7 358.4 337.7 390.5 346.6 364.5 377.0 363.2 396.7
## 1985 381.1 358.2 386.7 383.4 438.3 381.6 417.9 429.1 409.3 437.4
## 1986 434.1 405.5 425.8 431.1 473.9 423.0 447.9 453.6 457.5 492.9
## 1987 477.0 448.5 460.2 496.3 526.6 489.5 519.8 497.1 518.6 555.3
## 1988 498.3 479.6 533.1 538.3 548.4 529.5 517.4 522.5 535.0 552.7
## 1989 567.4 530.1 595.2 562.6 595.7 593.2 581.6 598.7 618.2 614.5
## 1990 599.2 550.2 630.2 609.9 644.6 643.8 618.0 642.9 622.3 658.9
## 1991 635.3 587.1 638.0 633.1 672.4 635.3 674.8 693.4 658.9 738.2
## 1992 720.2 679.2 709.1 749.5 763.9 710.3 745.8 723.7 763.9 854.7
## 1993 773.5 718.1 787.8 804.7 829.1 822.0 867.4 817.3 854.6 872.0
## 1994 852.1 817.9 891.6 855.7 887.0 890.0 890.5 876.3 923.6 928.0
## 1995 891.5 820.7 906.0 895.2 949.9 917.9 923.4 956.3 971.6 1005.2
## 1996 981.3 926.6 951.5 955.7 1014.5 950.9 981.3 992.7 921.2 1049.9
## 1997 1049.3 943.1 1001.8 1005.8 1074.6 966.2 1029.4 1016.6 1032.6 1107.1
## 1998 1100.5 964.5 1034.9 1060.9 1081.5 1039.9 1096.6 1057.8 1070.3 1181.4
## 1999 1134.9 1036.9 1146.5 1103.7 1141.1 1088.2 1148.9 1113.5 1137.8 1231.0
## 2000 1162.7 1129.5 1202.7 1171.3 1205.2 1253.4 1138.1 1207.4 1226.2 1220.6
## 2001 1190.1 1098.9 1226.5 1172.1 1204.4 1179.9 1205.1 1247.4 1216.6 1334.5
## 2002 1339.9 1197.9 1324.1 1281.5 1359.7 1291.2 1288.6 1346.3 1265.1 1437.2
## 2003 1403.0 1239.7 1359.9 1361.8 1428.7 1338.6 1433.7 1427.1 1407.8 1525.2
## 2004 1544.1 1379.2 1479.5 1513.3 1511.5 1513.5 1592.8 1525.4 1595.5 1659.4
## 2005 1560.0 1438.3 1590.1 1583.7 1597.4 1626.4 1645.7 1652.0 1657.1 1722.7
## 2006 1662.7 1562.3 1724.0 1711.9 1754.6 1756.5 1784.0 1802.1 1818.4 1914.5
## 2007 1891.0 1746.9 1965.1 1889.5 1971.1 1942.5 1954.4 1993.6 1973.0 2066.5
## 2008 1981.7 1848.0 1973.5 1979.5 2058.9 1962.5 2105.3 2051.2 2021.4 2135.5
## 2009 2110.0 1842.4 2053.4 2028.6 2129.3 2066.7 2134.2 2078.3 2059.2 2228.4
## 2010 2154.7 1926.3 2135.8 2104.1 2133.7 2095.4 2179.7 2132.5 2170.8 2256.8
## Nov Dec
## 1982 337.9 457.4
## 1983 365.3 485.3
## 1984 416.8 540.2
## 1985 454.1 585.6
## 1986 487.5 671.3
## 1987 538.5 779.1
## 1988 597.6 810.5
## 1989 674.9 858.1
## 1990 695.3 897.6
## 1991 761.0 958.0
## 1992 825.3 1067.6
## 1993 903.6 1236.9
## 1994 970.2 1258.7
## 1995 1048.1 1349.4
## 1996 1054.3 1303.3
## 1997 1100.4 1407.3
## 1998 1161.2 1503.6
## 1999 1252.4 1590.5
## 2000 1273.7 1589.6
## 2001 1404.4 1710.5
## 2002 1489.1 1826.4
## 2003 1553.8 1953.8
## 2004 1692.3 2155.1
## 2005 1789.2 2286.8
## 2006 2007.5 2496.6
## 2007 2176.9 2637.7
## 2008 2143.1 2749.0
## 2009 2232.7 2811.2
## 2010 2281.2 2849.9
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 2011 2221.6 2040.2 2254.5 2227.1 2269.0 2268.3 2347.1 2342.0 2346.2 2475.6
## 2012 2412.3 2297.7 2517.6 2397.4 2547.8 2510.0 2517.1 2596.0 2577.8 2726.9
## 2013 2601.0 2374.1 2650.5 2490.2 2674.5 2559.3 2590.2 2664.4 2563.7 2758.4
## Nov Dec
## 2011 2547.3 3157.0
## 2012 2775.7 3346.2
## 2013 2859.7 3416.1
1.4.2 Part b
Check that your data have been split appropriately by producing the following plot.
Answer:
autoplot(myts) +
autolayer(myts.train, series="Training") +
autolayer(myts.test, series="Test") +
ggtitle("Turnover-Western Australia-Total(Industry) Time Series") +
xlab("Time") + ylab("Sales")1.4.3 Part c
Calculate forecasts using snaive applied to myts.train.
Answer:
1.4.4 Part d
Compare the accuracy of your forecasts against the actual values stored in myts.test.
Answer:
- The accuracy of my forecasts is better against the actual values stored in
myts.test.
## ME RMSE MAE MPE MAPE MASE
## Training set 67.72312 83.42504 69.44084 6.621717 6.771199 1.000000
## Test set 286.52500 315.97475 286.52500 11.228334 11.228334 4.126174
## ACF1 Theil's U
## Training set 0.5966165 NA
## Test set 0.7975885 1.365472
1.4.5 Part e
Check the residuals using checkresiduals(fc). Do the residuals appear to be uncorrelated and normally distributed?
Answer:
The time plot of the residuals shows the large variation of the residuals across the historical data, especially after 1993, therefore the residual variance is highly possible to be non-constant. This can also be seen on the ACF plot and the histogram of the residuals. The ACF plot shows significant correlations between the lags of residuals. The histogram suggests that the residuals may not be normal as it appears to be right-skewed with mean of the residuals not close to zero.
Therefore, the residuals appear to be correlated and not normally distributed.
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 1098.7, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
1.4.6 Part f
How sensitive are the accuracy measures to the training/test split?
Answer:
To see the senitivity of the accuracy measures to the training/test split, we can try forecasts on different training/test splits sets from the data and calculate the corresponding accuracy measures.
By doing training/test split at different cut-point from 2001 to 2010 and calculaing the accuracy for those 10 training set forecasts, we got the below accuracy metrics.
From the graphs below, we can see that all metrics fluctuate over time with a peak in 2005, low in 2007, and a peak in 2010 again. They are not uniform and show no pattern via the training/test split over the 10 years, which says, the accuracy measures are very sensitive to the training/test split.
df <- data.frame()
for (year in 2001:2010) {
myts.train <- window(myts, end=c(year, 12))
myts.test <- window(myts, start=year+1)
fc <- snaive(myts.train)
acc <- accuracy(fc, myts.test) %>% data.frame() %>% rownames_to_column()
df <- df %>% rbind(cbind(year,acc)) %>% filter(rowname=="Test set")
}
df <- df %>% select(-rowname)
df- Look at the accuracy metrics (RMSE, MAE, MAPE, and MASE) by plotting them on the same plane:
metric <-df %>% select(year, RMSE, MAE, MAPE, MASE)
ggplot(metric, aes(x=year)) +
geom_line(aes(y=RMSE, color="RMSE"), linetype="solid") +
geom_line(aes(y=MAE, color="MAE"), linetype="dotdash") +
geom_line(aes(y=MAPE, color="MAPE"), linetype="dotted") +
geom_line(aes(y=MASE, color="MASE"), linetype="longdash") +
ggtitle("Accuracy Metrics for Training Set Forecasts of Different Year of Split") +
labs(x="Year of Split", y="Accuracy", color="Legend")- To focus on MASE and MAPE from the above plot:
metric <-df %>% select(year, RMSE, MAE, MAPE, MASE)
ggplot(metric, aes(x=year)) +
geom_line(aes(y=MAPE, color="MAPE"), linetype="dotted") +
geom_line(aes(y=MASE, color="MASE"), linetype="longdash") +
ggtitle("Accuracy Metrics for Training Set Forecasts of Different Year of Split") +
labs(x="Year of Split", y="Accuracy", color="Legend")