Week 3 Forecasting

By: Joel Park

library(fpp2)
## Loading required package: ggplot2
## Loading required package: forecast
## Loading required package: fma
## Loading required package: expsmooth

HW: 3.1 - 3.3, 3.8

3.1: For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.

A. usnetelec

According to ?usnetelec, this dataset is the annual US net electricity generation (billion kwh) for 1949-2003.

Taking a look at the summary of the data and details of the detail

summary(usnetelec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   296.1   889.0  2040.9  1972.1  3002.7  3858.5
print(usnetelec)
## Time Series:
## Start = 1949 
## End = 2003 
## Frequency = 1 
##  [1]  296.1  334.1  375.3  403.8  447.0  476.3  550.3  603.9  634.6  648.5
## [11]  713.4  759.2  797.1  857.9  920.0  987.2 1058.4 1147.5 1217.8 1332.8
## [21] 1445.5 1535.1 1615.9 1753.0 1864.1 1870.3 1920.8 2040.9 2127.4 2209.4
## [31] 2250.7 2289.6 2298.0 2244.4 2313.4 2419.5 2473.0 2490.5 2575.3 2707.4
## [41] 2967.3 3038.0 3073.8 3083.9 3197.2 3247.5 3353.5 3444.2 3492.2 3620.3
## [51] 3694.8 3802.1 3736.6 3858.5 3848.0

Plotting out the time series. Please note that the frequency is yearly.

autoplot(usnetelec)

lambda <- BoxCox.lambda(usgdp)
print(paste0("Lambda for usgdp: ", lambda))
## [1] "Lambda for usgdp: 0.366352049520934"
autoplot(BoxCox(usgdp,lambda))

Box-Cox Transformation: square-root transformation

From a visual perspective, the transformation does not appear to be dramatic, but if you look at the subtle aspects of the curve, it appears to have become “more linear” and with more even variance. The initial plot appears to have a slight upward curve (somewhat of a quadratic curve appearance), which may explain why a square-root transformation may make sense here.

B. usgdp

According to ?usgdp, this is the quarterly US GDP from 1971 (Quarter 1) to 2006 (Quarter 1).

Taking a look at the summary of the data and details of the detail

summary(usgdp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1568    2632    4552    5168    7130   11404
print(usgdp)
##         Qtr1    Qtr2    Qtr3    Qtr4
## 1947  1570.5  1568.7  1568.0  1590.9
## 1948  1616.1  1644.6  1654.1  1658.0
## 1949  1633.2  1628.4  1646.7  1629.9
## 1950  1696.8  1747.3  1815.8  1848.9
## 1951  1871.3  1903.1  1941.1  1944.4
## 1952  1964.7  1966.0  1978.8  2043.8
## 1953  2082.3  2098.1  2085.4  2052.5
## 1954  2042.4  2044.3  2066.9  2107.8
## 1955  2168.5  2204.0  2233.4  2245.3
## 1956  2234.8  2252.5  2249.8  2286.5
## 1957  2300.3  2294.6  2317.0  2292.5
## 1958  2230.2  2243.4  2295.2  2348.0
## 1959  2392.9  2455.8  2453.9  2462.6
## 1960  2517.4  2504.8  2508.7  2476.2
## 1961  2491.2  2538.0  2579.1  2631.8
## 1962  2679.1  2708.4  2733.3  2740.0
## 1963  2775.9  2810.6  2863.5  2885.8
## 1964  2950.5  2984.8  3025.5  3033.6
## 1965  3108.2  3150.2  3214.1  3291.8
## 1966  3372.3  3384.0  3406.3  3433.7
## 1967  3464.1  3464.3  3491.8  3518.2
## 1968  3590.7  3651.6  3676.5  3692.0
## 1969  3750.2  3760.9  3784.2  3766.3
## 1970  3760.0  3767.1  3800.5  3759.8
## 1971  3864.1  3885.9  3916.7  3927.9
## 1972  3997.7  4092.1  4131.1  4198.7
## 1973  4305.3  4355.1  4331.9  4373.3
## 1974  4335.4  4347.9  4305.8  4288.9
## 1975  4237.6  4268.6  4340.9  4397.8
## 1976  4496.8  4530.3  4552.0  4584.6
## 1977  4640.0  4731.1  4815.8  4815.3
## 1978  4830.8  5021.2  5070.7  5137.4
## 1979  5147.4  5152.3  5189.4  5204.7
## 1980  5221.3  5115.9  5107.4  5202.1
## 1981  5307.5  5266.1  5329.8  5263.4
## 1982  5177.1  5204.9  5185.2  5189.8
## 1983  5253.8  5372.3  5478.4  5590.5
## 1984  5699.8  5797.9  5854.3  5902.4
## 1985  5956.9  6007.8  6101.7  6148.6
## 1986  6207.4  6232.0  6291.7  6323.4
## 1987  6365.0  6435.0  6493.4  6606.8
## 1988  6639.1  6723.5  6759.4  6848.6
## 1989  6918.1  6963.5  7013.1  7030.9
## 1990  7112.1  7130.3  7130.8  7076.9
## 1991  7040.8  7086.5  7120.7  7154.1
## 1992  7228.2  7297.9  7369.5  7450.7
## 1993  7459.7  7497.5  7536.0  7637.4
## 1994  7715.1  7815.7  7859.5  7951.6
## 1995  7973.7  7988.0  8053.1  8112.0
## 1996  8169.2  8303.1  8372.7  8470.6
## 1997  8536.1  8665.8  8773.7  8838.4
## 1998  8936.2  8995.3  9098.9  9237.1
## 1999  9315.5  9392.6  9502.2  9671.1
## 2000  9695.6  9847.9  9836.6  9887.7
## 2001  9875.6  9905.9  9871.1  9910.0
## 2002  9977.3 10031.6 10090.7 10095.8
## 2003 10138.6 10230.4 10410.9 10502.6
## 2004 10612.5 10704.1 10808.9 10897.1
## 2005 10999.3 11089.2 11202.3 11248.3
## 2006 11403.6

Plotting out the time series. Please note that the frequency is quarterly.

autoplot(usgdp)

Box-Cox Transformation: cube-root transformation

C. mcopper

According to ?mcopper, this dataset is the monthly copper prices.

Taking a look at the summary of the data and details of the detail

summary(mcopper)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   216.6   566.0   949.2   997.8  1262.5  4306.0
print(mcopper)
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1960  255.2  259.7  249.3  258.0  244.3  246.8  250.6  241.3  231.0  218.7
## 1961  216.6  220.1  221.7  225.6  238.6  232.8  226.1  227.2  225.8  225.0
## 1962  226.8  231.3  231.1  230.6  230.5  230.4  230.4  230.4  230.4  230.6
## 1963  230.6  230.6  230.6  230.6  230.6  230.6  230.6  230.6  230.6  230.6
## 1964  234.2  247.8  266.1  307.7  295.8  288.8  305.4  356.9  415.0  485.1
## 1965  356.7  419.2  440.6  480.5  490.8  466.1  404.0  431.6  473.5  500.1
## 1966  599.0  668.7  668.7  679.9  592.8  604.8  559.5  426.4  402.4  454.9
## 1967  443.7  435.4  391.8  354.9  369.5  362.3  355.9  372.7  378.3  406.3
## 1968  586.4  715.8  708.0  522.4  456.5  473.0  439.9  439.9  462.3  449.7
## 1969  523.8  535.0  532.4  578.1  579.6  617.0  605.4  669.0  657.6  647.4
## 1970  677.2  690.1  730.4  725.2  665.9  607.0  567.8  527.5  519.3  475.8
## 1971  420.7  424.9  476.5  521.2  464.1  447.2  464.4  451.0  427.5  417.7
## 1972  418.9  426.9  442.2  433.5  423.3  412.4  423.2  427.0  434.2  428.7
## 1973  474.7  511.8  610.2  638.8  613.0  678.4  795.4  843.5  799.6  849.6
## 1974  913.2 1006.5 1172.0 1267.7 1190.6 1020.1  802.8  767.9  630.1  599.2
## 1975  512.4  528.7  554.5  560.5  539.7  522.5  559.3  603.7  580.2  573.1
## 1976  587.8  601.6  683.7  817.9  836.0  878.0  922.1  862.2  844.9  784.8
## 1977  814.8  833.4  881.0  830.4  797.9  763.1  724.4  665.8  685.6  682.9
## 1978  651.2  627.0  657.6  694.4  716.0  725.0  705.4  734.6  735.8  750.1
## 1979  827.1  969.7 1005.5 1012.3  935.3  888.8  802.4  883.2  953.6  967.6
## 1980 1147.9 1220.2 1045.5  937.1  888.9  858.5  916.7  877.9  857.6  846.2
## 1981  777.4  784.9  814.3  837.0  834.0  861.0  897.3  981.4  942.2  904.9
## 1982  853.6  863.8  836.3  858.5  843.6  740.4  829.9  841.0  832.7  861.4
## 1983  997.3 1074.3 1071.9 1090.0 1122.8 1098.6 1115.6 1091.0 1041.0  958.4
## 1984  978.0  991.5 1030.9 1078.1 1022.3  991.2 1007.9 1018.2 1029.3 1043.5
## 1985 1205.5 1269.8 1234.7 1212.7 1225.3 1118.0 1067.5 1025.7 1001.4  973.8
## 1986  995.4  982.8  984.4  957.3  932.3  937.0  891.7  876.5  916.1  922.9
## 1987  893.8  902.7  920.1  909.3  911.9  964.6 1052.8 1097.5 1100.7 1182.7
## 1988 1476.8 1324.3 1286.3 1216.5 1307.1 1428.8 1297.5 1296.2 1445.7 1689.4
## 1989 1912.6 1765.0 1904.1 1832.3 1679.1 1638.8 1539.1 1731.5 1834.9 1801.8
## 1990 1432.7 1390.9 1615.7 1639.6 1633.7 1510.5 1529.6 1554.4 1611.6 1409.6
## 1991 1265.1 1246.5 1322.5 1412.3 1336.8 1344.8 1353.8 1325.6 1346.1 1371.2
## 1992 1182.2 1240.5 1291.7 1260.9 1224.6 1239.1 1313.8 1297.2 1307.0 1360.3
## 1993 1472.3 1536.7 1472.3 1261.9 1159.0 1228.5 1287.5 1305.0 1219.7 1108.7
## 1994 1208.4 1261.0 1284.0 1268.1 1430.5 1549.9 1589.5 1560.0 1601.0 1586.2
## 1995 1910.5 1830.3 1826.8 1805.7 1745.9 1876.8 1927.4 1936.6 1870.3 1782.7
## 1996 1708.7 1650.2 1676.5 1713.5 1757.3 1407.7 1276.6 1294.9 1244.3 1236.2
## 1997 1469.2 1480.3 1506.4 1466.8 1540.0 1588.0 1466.7 1403.6 1315.4 1256.4
## 1998 1032.1 1014.4 1051.4 1078.1 1057.8 1005.6 1004.2  991.8  979.1  935.6
## 1999  866.7  866.5  849.6  910.2  935.6  891.3 1041.2 1025.5 1077.3 1040.1
## 2000 1124.2 1124.7 1100.6 1059.8 1183.3 1161.6 1192.3 1245.1 1365.3 1308.1
## 2001 1209.8 1214.9 1202.8 1159.3 1178.9 1147.5 1078.5 1019.0  974.4  948.5
## 2002 1049.9 1097.4 1127.9 1101.4 1092.9 1110.0 1022.2  962.4  950.0  952.5
## 2003 1018.8 1046.7 1047.4 1007.9 1015.5 1015.4 1054.0 1103.4 1109.1 1143.9
## 2004 1329.0 1477.5 1647.3 1635.6 1529.5 1469.7 1523.4 1563.1 1614.9 1666.7
## 2005 1687.8 1723.8 1773.0 1790.0 1751.7 1938.0 2063.8 2115.8 2133.3 2301.1
## 2006 2677.7 2851.6 2927.3 3610.9 4306.0 3897.7 4170.5 4059.8 4032.9 3998.6
##         Nov    Dec
## 1960  222.8  227.5
## 1961  225.7  226.3
## 1962  230.4  230.5
## 1963  230.6  232.2
## 1964  500.1  453.3
## 1965  523.8  541.4
## 1966  464.4  433.4
## 1967  515.3  551.3
## 1968  458.0  493.4
## 1969  679.4  707.8
## 1970  451.9  435.4
## 1971  406.4  410.7
## 1972  427.8  435.7
## 1973  950.4  959.9
## 1974  608.2  553.2
## 1975  575.1  568.9
## 1976  780.3  767.3
## 1977  650.2  679.1
## 1978  749.3  771.7
## 1979  978.3 1005.6
## 1980  839.5  800.4
## 1981  867.7  869.3
## 1982  884.3  911.4
## 1983  940.4  986.7
## 1984 1085.1 1113.5
## 1985  950.9  962.4
## 1986  915.0  927.6
## 1987 1419.9 1566.6
## 1988 1826.2 1915.3
## 1989 1647.3 1514.1
## 1990 1315.9 1292.7
## 1991 1336.8 1216.4
## 1992 1413.2 1422.5
## 1993 1100.3 1156.4
## 1994 1763.6 1914.4
## 1995 1904.8 1898.7
## 1996 1343.7 1366.9
## 1997 1135.0 1061.5
## 1998  946.9  881.9
## 1999 1065.3 1093.7
## 2000 1258.9 1264.4
## 2001  994.2 1020.8
## 2002 1006.4 1005.7
## 2003 1216.0 1258.5
## 2004 1678.5 1631.1
## 2005 2461.6 2621.7
## 2006 3675.9 3435.8

Plotting out the time series. Please note that the frequency is monthly. As noted in section “Calendar Adjustments” of section 3.2 of the textbook, we will be adjusting for the calendar adjustments first and then perform a mathematical transformation utilizing Box-Cox.

dailyaverage <- mcopper/monthdays(mcopper)
dframe <- cbind(Monthly = mcopper,
                DailyAverage = dailyaverage)
  autoplot(dframe, facet=TRUE) +
    xlab("Years") + ylab("Copper Prices") +
    ggtitle("Monthly copper prices")

As of note, we should mention that the calendar adjustments may have made some a difference for our use in forecasting. However, from a visualization perspective above, there does not seem to be an obvious change. Below is if we used the daily average to find out BoxCox transformation lambda.

lambda <- BoxCox.lambda(dailyaverage)
print(paste0("Lambda for mcopper (using dailyaverage): ", lambda))
## [1] "Lambda for mcopper (using dailyaverage): 0.181704314971557"
autoplot(BoxCox(dailyaverage,lambda))

Box-Cox Transformation: Power to the (1/5) transformation for the daily average.

However, let us look at mcopper dataset without the calendar transformation and focus solely on the BoxCox transformation.

lambda <- BoxCox.lambda(mcopper)
print(paste0("Lambda for mcopper: ", lambda))
## [1] "Lambda for mcopper: 0.191904709003829"
autoplot(BoxCox(mcopper,lambda))

The transformation does not appear to change drastically from the daily average. As a result, the BoxCox transformation lambda here is 0.2.

D. enplanements

According to ?enplanements, this dataset is the Domestic Revenue Enplacements (millions): 1996 - 2000. (Monthly)

Taking a look at the summary of the data and details of the detail

summary(enplanements)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.14   27.18   34.88   35.67   42.78   56.14
print(enplanements)
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
## 1979 21.12 22.92 25.90 24.38 23.41 26.82 26.90 28.72 22.87 23.10 23.11
## 1980 21.35 22.23 24.40 23.41 22.06 24.53 24.00 24.97 20.57 21.55 20.14
## 1981 20.24 20.63 22.03 22.68 22.80 23.78 24.16 22.64 20.50 21.44 20.91
## 1982 20.52 21.66 23.82 24.01 22.52 24.68 24.19 24.73 20.85 21.84 21.70
## 1983 21.83 23.44 26.98 24.61 24.10 26.67 25.63 26.40 23.17 24.24 23.88
## 1984 22.60 23.67 27.00 26.91 26.63 28.93 27.72 29.79 25.61 26.81 26.97
## 1985 25.35 26.91 31.60 31.09 30.70 32.09 32.02 33.16 26.82 28.68 27.80
## 1986 28.07 30.01 34.12 32.90 32.92 34.86 35.67 37.81 30.65 31.96 31.01
## 1987 28.72 33.39 37.76 37.51 35.87 36.86 37.71 38.41 31.65 32.87 32.23
## 1988 29.64 32.87 37.12 35.32 34.82 36.97 36.89 38.76 32.99 34.59 34.28
## 1989 30.52 32.37 35.62 33.81 33.86 37.91 37.03 39.38 32.68 34.72 34.52
## 1990 31.06 33.95 37.11 36.00 34.66 37.88 37.50 39.98 33.08 34.73 34.11
## 1991 30.57 32.06 33.58 34.89 34.21 36.77 37.25 38.73 32.54 34.29 32.28
## 1992 29.81 31.59 34.74 33.74 34.42 40.61 42.74 43.74 36.24 35.02 34.12
## 1993 31.31 33.82 36.59 37.32 36.62 39.34 39.92 40.97 36.62 38.19 37.15
## 1994 32.98 36.23 40.84 39.87 39.97 43.15 44.07 44.43 39.66 41.05 40.60
## 1995 36.43 38.77 42.64 42.19 41.16 44.83 44.28 45.61 39.87 41.86 41.57
## 1996 37.17 41.92 46.55 45.10 44.46 47.47 47.00 48.37 41.76 44.38 41.25
## 1997 39.95 43.00 48.02 45.89 45.47 48.83 49.24 49.53 42.79 44.97 43.60
## 1998 39.21 43.46 47.77 48.41 47.05 50.52 49.88 50.26 43.80 46.72 45.95
## 1999 40.88 44.99 49.85 49.69 47.90 52.24 53.45 52.02 45.89 49.14 49.00
## 2000 41.37 46.47 52.86 51.68 51.98 56.12 55.41 54.38 47.74 50.53 50.93
## 2001 43.83 47.56 52.82 52.10 50.72 54.89 55.50 56.14 31.41 39.82 41.50
## 2002 38.13 42.43 48.05 46.48 46.58 50.29                              
##        Dec
## 1979 22.14
## 1980 21.39
## 1981 21.70
## 1982 22.38
## 1983 23.96
## 1984 26.30
## 1985 30.46
## 1986 32.30
## 1987 31.88
## 1988 32.56
## 1989 32.91
## 1990 32.73
## 1991 34.29
## 1992 33.79
## 1993 36.30
## 1994 39.48
## 1995 40.40
## 1996 43.97
## 1997 44.05
## 1998 45.37
## 1999 45.81
## 2000 46.69
## 2001 40.45
## 2002

Plotting out the time series. Please note that the frequency is monthly. Again, we will apply the calendar adjustments as in the prior example.

dailyaverage <- mcopper/monthdays(enplanements)
dframe <- cbind(Monthly = enplanements,
                DailyAverage = dailyaverage)
  autoplot(dframe, facet=TRUE) +
    xlab("Years") + ylab("Domestic Revenue Enplanements (millions)") +
    ggtitle("Monthly US domestic enplanements")

Looking from this autoplot, there appears to be a general upward trend, but we can confirm this trend and seasonal cycling by using the ACF. At this time, we will disregard the Daily Average and go with the original time series.

ggseasonplot(enplanements)

ggsubseriesplot(enplanements)

gglagplot(enplanements)

ggAcf(enplanements)

There appears to be an upward trend on the ACF plot, however, there appears to be some seasonal component with the “scallop” shape.

lambda <- BoxCox.lambda(enplanements)
print(paste0("Lambda for enplanements: ", lambda))
## [1] "Lambda for enplanements: -0.226946111237065"
autoplot(BoxCox(enplanements,lambda))

Box-Cox Transformation: “-1/4” transformation“.

Just on a side note, it is unclear if this transformation truly made an improvement to the plot, judging from the visualization.

3.2: Why is a Box-Cox transformation unhelpful for the cangas data?

Let’s take a closer look at the cangas time series.

cangas is the monthly Canadian gas production in billions of cubic meters from January 1960 to February 2005.

summary(cangas)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.966   6.453   8.831   9.777  14.429  19.528
head(cangas)
##         Jan    Feb    Mar    Apr    May    Jun
## 1960 1.4306 1.3059 1.4022 1.1699 1.1161 1.0113

This is monthly data. Let’s plot out this time frame.

autoplot(cangas)

lambda <- BoxCox.lambda(cangas)
print(paste0("Lambda for cangas: ", lambda))
## [1] "Lambda for cangas: 0.576775938228139"
autoplot(BoxCox(cangas,lambda))

Though the BoxCox.lambda function suggested a square-root transformation, the visualization shows no obvious improvement to the plot. This suggests that just because a function may recommend a transformation that it does not necessarily means that we should be performing the Box Cox transformation.

3.3: What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)

retaildata <- readxl::read_excel("retail.xlsx", skip=1)
head(retaildata)
## # A tibble: 6 x 190
##   `Series ID`         A3349335T A3349627V A3349338X A3349398A A3349468W
##   <dttm>                  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 1982-04-01 00:00:00      303.      41.7      63.9      409.      65.8
## 2 1982-05-01 00:00:00      298.      43.1      64        405.      65.8
## 3 1982-06-01 00:00:00      298       40.3      62.7      401       62.3
## 4 1982-07-01 00:00:00      308.      40.9      65.6      414.      68.2
## 5 1982-08-01 00:00:00      299.      42.1      62.6      404.      66  
## 6 1982-09-01 00:00:00      305.      42        64.4      412.      62.3
## # … with 184 more variables: A3349336V <dbl>, A3349337W <dbl>,
## #   A3349397X <dbl>, A3349399C <dbl>, A3349874C <dbl>, A3349871W <dbl>,
## #   A3349790V <dbl>, A3349556W <dbl>, A3349791W <dbl>, A3349401C <dbl>,
## #   A3349873A <dbl>, A3349872X <dbl>, A3349709X <dbl>, A3349792X <dbl>,
## #   A3349789K <dbl>, A3349555V <dbl>, A3349565X <dbl>, A3349414R <dbl>,
## #   A3349799R <dbl>, A3349642T <dbl>, A3349413L <dbl>, A3349564W <dbl>,
## #   A3349416V <dbl>, A3349643V <dbl>, A3349483V <dbl>, A3349722T <dbl>,
## #   A3349727C <dbl>, A3349641R <dbl>, A3349639C <dbl>, A3349415T <dbl>,
## #   A3349349F <dbl>, A3349563V <dbl>, A3349350R <dbl>, A3349640L <dbl>,
## #   A3349566A <dbl>, A3349417W <dbl>, A3349352V <dbl>, A3349882C <dbl>,
## #   A3349561R <dbl>, A3349883F <dbl>, A3349721R <dbl>, A3349478A <dbl>,
## #   A3349637X <dbl>, A3349479C <dbl>, A3349797K <dbl>, A3349477X <dbl>,
## #   A3349719C <dbl>, A3349884J <dbl>, A3349562T <dbl>, A3349348C <dbl>,
## #   A3349480L <dbl>, A3349476W <dbl>, A3349881A <dbl>, A3349410F <dbl>,
## #   A3349481R <dbl>, A3349718A <dbl>, A3349411J <dbl>, A3349638A <dbl>,
## #   A3349654A <dbl>, A3349499L <dbl>, A3349902A <dbl>, A3349432V <dbl>,
## #   A3349656F <dbl>, A3349361W <dbl>, A3349501L <dbl>, A3349503T <dbl>,
## #   A3349360V <dbl>, A3349903C <dbl>, A3349905J <dbl>, A3349658K <dbl>,
## #   A3349575C <dbl>, A3349428C <dbl>, A3349500K <dbl>, A3349577J <dbl>,
## #   A3349433W <dbl>, A3349576F <dbl>, A3349574A <dbl>, A3349816F <dbl>,
## #   A3349815C <dbl>, A3349744F <dbl>, A3349823C <dbl>, A3349508C <dbl>,
## #   A3349742A <dbl>, A3349661X <dbl>, A3349660W <dbl>, A3349909T <dbl>,
## #   A3349824F <dbl>, A3349507A <dbl>, A3349580W <dbl>, A3349825J <dbl>,
## #   A3349434X <dbl>, A3349822A <dbl>, A3349821X <dbl>, A3349581X <dbl>,
## #   A3349908R <dbl>, A3349743C <dbl>, A3349910A <dbl>, A3349435A <dbl>,
## #   A3349365F <dbl>, A3349746K <dbl>, …

In my previous assignment, I had used column A3349661X, which was “# Category: Turnover ; Western Australia ; Furniture, floor coverings, houseware and textile goods retailing”.

Let us transform this data into a time series and then plot out this column, A3349661X.

# Recall from above that the data is monthly, starting April 1982
myts <- ts(retaildata$A3349661X, start = c(1982,4),frequency = 12)
head(myts)
##       Apr  May  Jun  Jul  Aug  Sep
## 1982 19.2 21.9 19.9 19.3 19.6 19.9

Let us now plot out the time series.

autoplot(myts)

Judging from the upward curve of the data, I suspect that a Box Cox transformation would be appropriate. Let us take a closer look.

lambda <- BoxCox.lambda(myts)
print(paste0("Lambda for myts: ", lambda))
## [1] "Lambda for myts: -0.0493897710767389"
autoplot(BoxCox(myts,lambda))

It appears that a logarithmic transformation works best for this data. And from the autoplot time series visualization, the plot suggests that the log transformation is appropriate.

3.8: For your retail time series (from Exercise 3 in Section 2.10):

Again, myts is the series that I have used above in my previous exercise.

A. Split the data into two parts using

myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)

The data is now loaded into a training and testing time series data set.

B. Check that your data have been split appropriately by producing the following plot.

autoplot(myts) +
  autolayer(myts.train, series="Training") + 
  autolayer(myts.test, series="Test")

It appears that the split worked appropriately.

C. Calculate forecasts using snaive applied to myts.train.

According to Chapter 3.1:

For naïve forecasts, we simply set all forecasts to be the value of the last observation. That is,

\[\hat{y}_{T+h|T} = y_{T}\]

This method works remarkably well for many economic and financial time series. However, we will be using the seasonal naive method. “In this case, we set each forecast to be equal to the last observed value from the same season of the year (e.g., the same month of the previous year). Formally, the forecast for time \(T + h\) is written as:”

\[\hat{y}_{T+h|T} = y_{T+h-m(k+1)}\]

“where \(m=\) the seasonal period, and \(k\) is the integer part of \((h-1)/m\) (i.e., the number of complete years in the forecast period prior to time \(T+h\)). This looks more complicated than it really is. For example, with monthly data, the forecast for all future February values is equal to the last observed February value. With quarterly data, the forecast of all future Q2 values is equal to the last observed Q2 value (where Q2 means the second quarter). Similar rules apply for other months and quarters, and for other seasonal periods.”

fc <- snaive(myts.train)
print(fc)
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2011          115.4 102.03674 128.7633  94.96265 135.8374
## Feb 2011          104.1  90.73674 117.4633  83.66265 124.5374
## Mar 2011          102.8  89.43674 116.1633  82.36265 123.2374
## Apr 2011           98.7  85.33674 112.0633  78.26265 119.1374
## May 2011          105.5  92.13674 118.8633  85.06265 125.9374
## Jun 2011          119.2 105.83674 132.5633  98.76265 139.6374
## Jul 2011          115.9 102.53674 129.2633  95.46265 136.3374
## Aug 2011          109.6  96.23674 122.9633  89.16265 130.0374
## Sep 2011          112.2  98.83674 125.5633  91.76265 132.6374
## Oct 2011          134.7 121.33674 148.0633 114.26265 155.1374
## Nov 2011          134.9 121.53674 148.2633 114.46265 155.3374
## Dec 2011          149.8 136.43674 163.1633 129.36265 170.2374
## Jan 2012          115.4  96.50149 134.2985  86.49722 144.3028
## Feb 2012          104.1  85.20149 122.9985  75.19722 133.0028
## Mar 2012          102.8  83.90149 121.6985  73.89722 131.7028
## Apr 2012           98.7  79.80149 117.5985  69.79722 127.6028
## May 2012          105.5  86.60149 124.3985  76.59722 134.4028
## Jun 2012          119.2 100.30149 138.0985  90.29722 148.1028
## Jul 2012          115.9  97.00149 134.7985  86.99722 144.8028
## Aug 2012          109.6  90.70149 128.4985  80.69722 138.5028
## Sep 2012          112.2  93.30149 131.0985  83.29722 141.1028
## Oct 2012          134.7 115.80149 153.5985 105.79722 163.6028
## Nov 2012          134.9 116.00149 153.7985 105.99722 163.8028
## Dec 2012          149.8 130.90149 168.6985 120.89722 178.7028
f1 <- snaive(myts.train, h=length(myts.test))
autoplot(f1) + autolayer(myts.test)

Above is the prediction using the seasonal naive forecast. From the visualization, it is easy to see that this model would benefit from an improvement.

D. Compare the accuracy of your forecasts against the actual values stored in myts.test.

According to the textbook, chapter 3.4, “When choosing models, it is common practice to separate the available data into two portions, training and test data, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy. Because the test data is not used in determining the forecasts, it should provide a reliable indication of how well the model is likely to forecast on new data.”

accuracy(fc, myts.test)
##                     ME     RMSE       MAE       MPE     MAPE     MASE
## Training set  3.518619 10.42741  7.244144  5.375707 11.81551 1.000000
## Test set     30.512500 33.10478 30.512500 20.388974 20.38897 4.212023
##                   ACF1 Theil's U
## Training set 0.7935927        NA
## Test set     0.3613754   2.66608

According to the textbook:

Mean absolute error (MAE): mean\((|e_t|)\) Root mean squared error (RMSE): \(\sqrt{\text{mean}(e^2_{t})}\) Mean absolute percentage error (MAPE): mean\((|p_{t}|)\) Mean absolute scaled error (MASE): mean\((|q_{j}|)\) where

\[q_{j} = \frac{e_j}{\frac{1}{T-m}\Sigma^{T}_{t=m+1}|y_t-y_{t-m}}\]

As of note, the above are only a few defined examples of accuracy. Also, we see the results listed above. The numbers themselves do not have significance until it is compared to another forecasting model i.e. naive, drift, etc. Then we can compare which models perform better.

E. Check the residuals.

According to chapter 3.3 “Residual Diagnostics”:

The “residuals” in a time series model are what is left over after fitting a model. For many (but not all) time series models, the residuals are equal to the difference between the observations and the corresponding fitted values:

\[e_t = y_t - \hat{y}_t\]

Residuals are useful in checking whether a model has adequately captured the information in the data. A good forecasting method will yield residuals with the following properties:

  1. The residuals are uncorrelated. If there are correlations between residuals, then there is information left in the residuals which should be used in computing forecasts.

  2. The residuals have zero mean. If the residuals have a mean other than zero, then the forecasts are biased.

Any forecasting method that does not satisfy these properties can be improved. However, that does not mean that forecasting methods that satisfy these properties cannot be improved. It is possible to have several different forecasting methods for the same data set, all of which satisfy these properties. Checking these properties is important in order to see whether a method is using all of the available information, but it is not a good way to select a forecasting method.

If either of these properties is not satisfied, then the forecasting method can be modified to give better forecasts. Adjusting for bias is easy: if the residuals have mean m, then simply add m to all forecasts and the bias problem is solved. Fixing the correlation problem is harder.

In addition to these essential properties, it is useful (but not necessary) for the residuals to also have the following two properties.

  1. The residuals have constant variance.

  2. The residuals are normally distributed.

These two properties make the calculation of prediction intervals easier (see Section 3.5 for an example). However, a forecasting method that does not satisfy these properties cannot necessarily be improved. Sometimes applying a Box-Cox transformation may assist with these properties, but otherwise there is usually little that you can do to ensure that your residuals have constant variance and a normal distribution. Instead, an alternative approach to obtaining prediction intervals is necessary.

checkresiduals(fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 1059, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

Do the residuals appear to be uncorrelated and normally distributed?

The residuals appear to be normally distributed, however, the mean is not centered at zero and the tails are quite long. According to the Ljung-Box test, the p-value is quite significant, which leads us to believe that there may be autocorrelation. (This is also evidenced by the ACF plot where the autocorrelation plots appear to exceed the blue dotted line).

F. How sensitive are the accuracy measures to the training/test split?

The accuracy measure are always sensitive to the training/test split. There are better ways to check the robustness of the methods in terms of accuracy such as using a time series cross-validation tsCV().

According to chapter 3.4:

“A more sophisticated version of training/test sets is time series cross-validation. In this procedure, there are a series of test sets, each consisting of a single observation. The corresponding training set consists only of observations that occurred prior to the observation that forms the test set. Thus, no future observations can be used in constructing the forecast. Since it is not possible to obtain a reliable forecast based on a small training set, the earliest observations are not considered as test sets.

The forecast accuracy is computed by averaging over the test sets. This procedure is sometimes known as “evaluation on a rolling forecasting origin” because the “origin” at which the forecast is based rolls forward in time.

With time series forecasting, one-step forecasts may not be as relevant as multi-step forecasts. In this case, the cross-validation procedure based on a rolling forecasting origin can be modified to allow multi-step errors to be used.

Time series cross-validation is implemented with the tsCV() function."