By: Joel Park
library(fpp2)
## Loading required package: ggplot2
## Loading required package: forecast
## Loading required package: fma
## Loading required package: expsmooth
HW: 3.1 - 3.3, 3.8
3.1: For the following series, find an appropriate Box-Cox transformation in order to stabilize the variance.
A. usnetelec
According to ?usnetelec
, this dataset is the annual US net electricity generation (billion kwh) for 1949-2003.
Taking a look at the summary of the data and details of the detail
summary(usnetelec)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 296.1 889.0 2040.9 1972.1 3002.7 3858.5
print(usnetelec)
## Time Series:
## Start = 1949
## End = 2003
## Frequency = 1
## [1] 296.1 334.1 375.3 403.8 447.0 476.3 550.3 603.9 634.6 648.5
## [11] 713.4 759.2 797.1 857.9 920.0 987.2 1058.4 1147.5 1217.8 1332.8
## [21] 1445.5 1535.1 1615.9 1753.0 1864.1 1870.3 1920.8 2040.9 2127.4 2209.4
## [31] 2250.7 2289.6 2298.0 2244.4 2313.4 2419.5 2473.0 2490.5 2575.3 2707.4
## [41] 2967.3 3038.0 3073.8 3083.9 3197.2 3247.5 3353.5 3444.2 3492.2 3620.3
## [51] 3694.8 3802.1 3736.6 3858.5 3848.0
Plotting out the time series. Please note that the frequency is yearly.
autoplot(usnetelec)
lambda <- BoxCox.lambda(usgdp)
print(paste0("Lambda for usgdp: ", lambda))
## [1] "Lambda for usgdp: 0.366352049520934"
autoplot(BoxCox(usgdp,lambda))
Box-Cox Transformation: square-root transformation
From a visual perspective, the transformation does not appear to be dramatic, but if you look at the subtle aspects of the curve, it appears to have become “more linear” and with more even variance. The initial plot appears to have a slight upward curve (somewhat of a quadratic curve appearance), which may explain why a square-root transformation may make sense here.
B. usgdp
According to ?usgdp
, this is the quarterly US GDP from 1971 (Quarter 1) to 2006 (Quarter 1).
Taking a look at the summary of the data and details of the detail
summary(usgdp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1568 2632 4552 5168 7130 11404
print(usgdp)
## Qtr1 Qtr2 Qtr3 Qtr4
## 1947 1570.5 1568.7 1568.0 1590.9
## 1948 1616.1 1644.6 1654.1 1658.0
## 1949 1633.2 1628.4 1646.7 1629.9
## 1950 1696.8 1747.3 1815.8 1848.9
## 1951 1871.3 1903.1 1941.1 1944.4
## 1952 1964.7 1966.0 1978.8 2043.8
## 1953 2082.3 2098.1 2085.4 2052.5
## 1954 2042.4 2044.3 2066.9 2107.8
## 1955 2168.5 2204.0 2233.4 2245.3
## 1956 2234.8 2252.5 2249.8 2286.5
## 1957 2300.3 2294.6 2317.0 2292.5
## 1958 2230.2 2243.4 2295.2 2348.0
## 1959 2392.9 2455.8 2453.9 2462.6
## 1960 2517.4 2504.8 2508.7 2476.2
## 1961 2491.2 2538.0 2579.1 2631.8
## 1962 2679.1 2708.4 2733.3 2740.0
## 1963 2775.9 2810.6 2863.5 2885.8
## 1964 2950.5 2984.8 3025.5 3033.6
## 1965 3108.2 3150.2 3214.1 3291.8
## 1966 3372.3 3384.0 3406.3 3433.7
## 1967 3464.1 3464.3 3491.8 3518.2
## 1968 3590.7 3651.6 3676.5 3692.0
## 1969 3750.2 3760.9 3784.2 3766.3
## 1970 3760.0 3767.1 3800.5 3759.8
## 1971 3864.1 3885.9 3916.7 3927.9
## 1972 3997.7 4092.1 4131.1 4198.7
## 1973 4305.3 4355.1 4331.9 4373.3
## 1974 4335.4 4347.9 4305.8 4288.9
## 1975 4237.6 4268.6 4340.9 4397.8
## 1976 4496.8 4530.3 4552.0 4584.6
## 1977 4640.0 4731.1 4815.8 4815.3
## 1978 4830.8 5021.2 5070.7 5137.4
## 1979 5147.4 5152.3 5189.4 5204.7
## 1980 5221.3 5115.9 5107.4 5202.1
## 1981 5307.5 5266.1 5329.8 5263.4
## 1982 5177.1 5204.9 5185.2 5189.8
## 1983 5253.8 5372.3 5478.4 5590.5
## 1984 5699.8 5797.9 5854.3 5902.4
## 1985 5956.9 6007.8 6101.7 6148.6
## 1986 6207.4 6232.0 6291.7 6323.4
## 1987 6365.0 6435.0 6493.4 6606.8
## 1988 6639.1 6723.5 6759.4 6848.6
## 1989 6918.1 6963.5 7013.1 7030.9
## 1990 7112.1 7130.3 7130.8 7076.9
## 1991 7040.8 7086.5 7120.7 7154.1
## 1992 7228.2 7297.9 7369.5 7450.7
## 1993 7459.7 7497.5 7536.0 7637.4
## 1994 7715.1 7815.7 7859.5 7951.6
## 1995 7973.7 7988.0 8053.1 8112.0
## 1996 8169.2 8303.1 8372.7 8470.6
## 1997 8536.1 8665.8 8773.7 8838.4
## 1998 8936.2 8995.3 9098.9 9237.1
## 1999 9315.5 9392.6 9502.2 9671.1
## 2000 9695.6 9847.9 9836.6 9887.7
## 2001 9875.6 9905.9 9871.1 9910.0
## 2002 9977.3 10031.6 10090.7 10095.8
## 2003 10138.6 10230.4 10410.9 10502.6
## 2004 10612.5 10704.1 10808.9 10897.1
## 2005 10999.3 11089.2 11202.3 11248.3
## 2006 11403.6
Plotting out the time series. Please note that the frequency is quarterly.
autoplot(usgdp)
Box-Cox Transformation: cube-root transformation
C. mcopper
According to ?mcopper
, this dataset is the monthly copper prices.
Taking a look at the summary of the data and details of the detail
summary(mcopper)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 216.6 566.0 949.2 997.8 1262.5 4306.0
print(mcopper)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1960 255.2 259.7 249.3 258.0 244.3 246.8 250.6 241.3 231.0 218.7
## 1961 216.6 220.1 221.7 225.6 238.6 232.8 226.1 227.2 225.8 225.0
## 1962 226.8 231.3 231.1 230.6 230.5 230.4 230.4 230.4 230.4 230.6
## 1963 230.6 230.6 230.6 230.6 230.6 230.6 230.6 230.6 230.6 230.6
## 1964 234.2 247.8 266.1 307.7 295.8 288.8 305.4 356.9 415.0 485.1
## 1965 356.7 419.2 440.6 480.5 490.8 466.1 404.0 431.6 473.5 500.1
## 1966 599.0 668.7 668.7 679.9 592.8 604.8 559.5 426.4 402.4 454.9
## 1967 443.7 435.4 391.8 354.9 369.5 362.3 355.9 372.7 378.3 406.3
## 1968 586.4 715.8 708.0 522.4 456.5 473.0 439.9 439.9 462.3 449.7
## 1969 523.8 535.0 532.4 578.1 579.6 617.0 605.4 669.0 657.6 647.4
## 1970 677.2 690.1 730.4 725.2 665.9 607.0 567.8 527.5 519.3 475.8
## 1971 420.7 424.9 476.5 521.2 464.1 447.2 464.4 451.0 427.5 417.7
## 1972 418.9 426.9 442.2 433.5 423.3 412.4 423.2 427.0 434.2 428.7
## 1973 474.7 511.8 610.2 638.8 613.0 678.4 795.4 843.5 799.6 849.6
## 1974 913.2 1006.5 1172.0 1267.7 1190.6 1020.1 802.8 767.9 630.1 599.2
## 1975 512.4 528.7 554.5 560.5 539.7 522.5 559.3 603.7 580.2 573.1
## 1976 587.8 601.6 683.7 817.9 836.0 878.0 922.1 862.2 844.9 784.8
## 1977 814.8 833.4 881.0 830.4 797.9 763.1 724.4 665.8 685.6 682.9
## 1978 651.2 627.0 657.6 694.4 716.0 725.0 705.4 734.6 735.8 750.1
## 1979 827.1 969.7 1005.5 1012.3 935.3 888.8 802.4 883.2 953.6 967.6
## 1980 1147.9 1220.2 1045.5 937.1 888.9 858.5 916.7 877.9 857.6 846.2
## 1981 777.4 784.9 814.3 837.0 834.0 861.0 897.3 981.4 942.2 904.9
## 1982 853.6 863.8 836.3 858.5 843.6 740.4 829.9 841.0 832.7 861.4
## 1983 997.3 1074.3 1071.9 1090.0 1122.8 1098.6 1115.6 1091.0 1041.0 958.4
## 1984 978.0 991.5 1030.9 1078.1 1022.3 991.2 1007.9 1018.2 1029.3 1043.5
## 1985 1205.5 1269.8 1234.7 1212.7 1225.3 1118.0 1067.5 1025.7 1001.4 973.8
## 1986 995.4 982.8 984.4 957.3 932.3 937.0 891.7 876.5 916.1 922.9
## 1987 893.8 902.7 920.1 909.3 911.9 964.6 1052.8 1097.5 1100.7 1182.7
## 1988 1476.8 1324.3 1286.3 1216.5 1307.1 1428.8 1297.5 1296.2 1445.7 1689.4
## 1989 1912.6 1765.0 1904.1 1832.3 1679.1 1638.8 1539.1 1731.5 1834.9 1801.8
## 1990 1432.7 1390.9 1615.7 1639.6 1633.7 1510.5 1529.6 1554.4 1611.6 1409.6
## 1991 1265.1 1246.5 1322.5 1412.3 1336.8 1344.8 1353.8 1325.6 1346.1 1371.2
## 1992 1182.2 1240.5 1291.7 1260.9 1224.6 1239.1 1313.8 1297.2 1307.0 1360.3
## 1993 1472.3 1536.7 1472.3 1261.9 1159.0 1228.5 1287.5 1305.0 1219.7 1108.7
## 1994 1208.4 1261.0 1284.0 1268.1 1430.5 1549.9 1589.5 1560.0 1601.0 1586.2
## 1995 1910.5 1830.3 1826.8 1805.7 1745.9 1876.8 1927.4 1936.6 1870.3 1782.7
## 1996 1708.7 1650.2 1676.5 1713.5 1757.3 1407.7 1276.6 1294.9 1244.3 1236.2
## 1997 1469.2 1480.3 1506.4 1466.8 1540.0 1588.0 1466.7 1403.6 1315.4 1256.4
## 1998 1032.1 1014.4 1051.4 1078.1 1057.8 1005.6 1004.2 991.8 979.1 935.6
## 1999 866.7 866.5 849.6 910.2 935.6 891.3 1041.2 1025.5 1077.3 1040.1
## 2000 1124.2 1124.7 1100.6 1059.8 1183.3 1161.6 1192.3 1245.1 1365.3 1308.1
## 2001 1209.8 1214.9 1202.8 1159.3 1178.9 1147.5 1078.5 1019.0 974.4 948.5
## 2002 1049.9 1097.4 1127.9 1101.4 1092.9 1110.0 1022.2 962.4 950.0 952.5
## 2003 1018.8 1046.7 1047.4 1007.9 1015.5 1015.4 1054.0 1103.4 1109.1 1143.9
## 2004 1329.0 1477.5 1647.3 1635.6 1529.5 1469.7 1523.4 1563.1 1614.9 1666.7
## 2005 1687.8 1723.8 1773.0 1790.0 1751.7 1938.0 2063.8 2115.8 2133.3 2301.1
## 2006 2677.7 2851.6 2927.3 3610.9 4306.0 3897.7 4170.5 4059.8 4032.9 3998.6
## Nov Dec
## 1960 222.8 227.5
## 1961 225.7 226.3
## 1962 230.4 230.5
## 1963 230.6 232.2
## 1964 500.1 453.3
## 1965 523.8 541.4
## 1966 464.4 433.4
## 1967 515.3 551.3
## 1968 458.0 493.4
## 1969 679.4 707.8
## 1970 451.9 435.4
## 1971 406.4 410.7
## 1972 427.8 435.7
## 1973 950.4 959.9
## 1974 608.2 553.2
## 1975 575.1 568.9
## 1976 780.3 767.3
## 1977 650.2 679.1
## 1978 749.3 771.7
## 1979 978.3 1005.6
## 1980 839.5 800.4
## 1981 867.7 869.3
## 1982 884.3 911.4
## 1983 940.4 986.7
## 1984 1085.1 1113.5
## 1985 950.9 962.4
## 1986 915.0 927.6
## 1987 1419.9 1566.6
## 1988 1826.2 1915.3
## 1989 1647.3 1514.1
## 1990 1315.9 1292.7
## 1991 1336.8 1216.4
## 1992 1413.2 1422.5
## 1993 1100.3 1156.4
## 1994 1763.6 1914.4
## 1995 1904.8 1898.7
## 1996 1343.7 1366.9
## 1997 1135.0 1061.5
## 1998 946.9 881.9
## 1999 1065.3 1093.7
## 2000 1258.9 1264.4
## 2001 994.2 1020.8
## 2002 1006.4 1005.7
## 2003 1216.0 1258.5
## 2004 1678.5 1631.1
## 2005 2461.6 2621.7
## 2006 3675.9 3435.8
Plotting out the time series. Please note that the frequency is monthly. As noted in section “Calendar Adjustments” of section 3.2 of the textbook, we will be adjusting for the calendar adjustments first and then perform a mathematical transformation utilizing Box-Cox.
dailyaverage <- mcopper/monthdays(mcopper)
dframe <- cbind(Monthly = mcopper,
DailyAverage = dailyaverage)
autoplot(dframe, facet=TRUE) +
xlab("Years") + ylab("Copper Prices") +
ggtitle("Monthly copper prices")
As of note, we should mention that the calendar adjustments may have made some a difference for our use in forecasting. However, from a visualization perspective above, there does not seem to be an obvious change. Below is if we used the daily average to find out BoxCox transformation lambda.
lambda <- BoxCox.lambda(dailyaverage)
print(paste0("Lambda for mcopper (using dailyaverage): ", lambda))
## [1] "Lambda for mcopper (using dailyaverage): 0.181704314971557"
autoplot(BoxCox(dailyaverage,lambda))
Box-Cox Transformation: Power to the (1/5) transformation for the daily average.
However, let us look at mcopper
dataset without the calendar transformation and focus solely on the BoxCox transformation.
lambda <- BoxCox.lambda(mcopper)
print(paste0("Lambda for mcopper: ", lambda))
## [1] "Lambda for mcopper: 0.191904709003829"
autoplot(BoxCox(mcopper,lambda))
The transformation does not appear to change drastically from the daily average. As a result, the BoxCox transformation lambda here is 0.2.
D. enplanements
According to ?enplanements
, this dataset is the Domestic Revenue Enplacements (millions): 1996 - 2000. (Monthly)
Taking a look at the summary of the data and details of the detail
summary(enplanements)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.14 27.18 34.88 35.67 42.78 56.14
print(enplanements)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## 1979 21.12 22.92 25.90 24.38 23.41 26.82 26.90 28.72 22.87 23.10 23.11
## 1980 21.35 22.23 24.40 23.41 22.06 24.53 24.00 24.97 20.57 21.55 20.14
## 1981 20.24 20.63 22.03 22.68 22.80 23.78 24.16 22.64 20.50 21.44 20.91
## 1982 20.52 21.66 23.82 24.01 22.52 24.68 24.19 24.73 20.85 21.84 21.70
## 1983 21.83 23.44 26.98 24.61 24.10 26.67 25.63 26.40 23.17 24.24 23.88
## 1984 22.60 23.67 27.00 26.91 26.63 28.93 27.72 29.79 25.61 26.81 26.97
## 1985 25.35 26.91 31.60 31.09 30.70 32.09 32.02 33.16 26.82 28.68 27.80
## 1986 28.07 30.01 34.12 32.90 32.92 34.86 35.67 37.81 30.65 31.96 31.01
## 1987 28.72 33.39 37.76 37.51 35.87 36.86 37.71 38.41 31.65 32.87 32.23
## 1988 29.64 32.87 37.12 35.32 34.82 36.97 36.89 38.76 32.99 34.59 34.28
## 1989 30.52 32.37 35.62 33.81 33.86 37.91 37.03 39.38 32.68 34.72 34.52
## 1990 31.06 33.95 37.11 36.00 34.66 37.88 37.50 39.98 33.08 34.73 34.11
## 1991 30.57 32.06 33.58 34.89 34.21 36.77 37.25 38.73 32.54 34.29 32.28
## 1992 29.81 31.59 34.74 33.74 34.42 40.61 42.74 43.74 36.24 35.02 34.12
## 1993 31.31 33.82 36.59 37.32 36.62 39.34 39.92 40.97 36.62 38.19 37.15
## 1994 32.98 36.23 40.84 39.87 39.97 43.15 44.07 44.43 39.66 41.05 40.60
## 1995 36.43 38.77 42.64 42.19 41.16 44.83 44.28 45.61 39.87 41.86 41.57
## 1996 37.17 41.92 46.55 45.10 44.46 47.47 47.00 48.37 41.76 44.38 41.25
## 1997 39.95 43.00 48.02 45.89 45.47 48.83 49.24 49.53 42.79 44.97 43.60
## 1998 39.21 43.46 47.77 48.41 47.05 50.52 49.88 50.26 43.80 46.72 45.95
## 1999 40.88 44.99 49.85 49.69 47.90 52.24 53.45 52.02 45.89 49.14 49.00
## 2000 41.37 46.47 52.86 51.68 51.98 56.12 55.41 54.38 47.74 50.53 50.93
## 2001 43.83 47.56 52.82 52.10 50.72 54.89 55.50 56.14 31.41 39.82 41.50
## 2002 38.13 42.43 48.05 46.48 46.58 50.29
## Dec
## 1979 22.14
## 1980 21.39
## 1981 21.70
## 1982 22.38
## 1983 23.96
## 1984 26.30
## 1985 30.46
## 1986 32.30
## 1987 31.88
## 1988 32.56
## 1989 32.91
## 1990 32.73
## 1991 34.29
## 1992 33.79
## 1993 36.30
## 1994 39.48
## 1995 40.40
## 1996 43.97
## 1997 44.05
## 1998 45.37
## 1999 45.81
## 2000 46.69
## 2001 40.45
## 2002
Plotting out the time series. Please note that the frequency is monthly. Again, we will apply the calendar adjustments as in the prior example.
dailyaverage <- mcopper/monthdays(enplanements)
dframe <- cbind(Monthly = enplanements,
DailyAverage = dailyaverage)
autoplot(dframe, facet=TRUE) +
xlab("Years") + ylab("Domestic Revenue Enplanements (millions)") +
ggtitle("Monthly US domestic enplanements")
Looking from this autoplot, there appears to be a general upward trend, but we can confirm this trend and seasonal cycling by using the ACF. At this time, we will disregard the Daily Average and go with the original time series.
ggseasonplot(enplanements)
ggsubseriesplot(enplanements)
gglagplot(enplanements)
ggAcf(enplanements)
There appears to be an upward trend on the ACF plot, however, there appears to be some seasonal component with the “scallop” shape.
lambda <- BoxCox.lambda(enplanements)
print(paste0("Lambda for enplanements: ", lambda))
## [1] "Lambda for enplanements: -0.226946111237065"
autoplot(BoxCox(enplanements,lambda))
Box-Cox Transformation: “-1/4” transformation“.
Just on a side note, it is unclear if this transformation truly made an improvement to the plot, judging from the visualization.
3.2: Why is a Box-Cox transformation unhelpful for the cangas
data?
Let’s take a closer look at the cangas
time series.
cangas
is the monthly Canadian gas production in billions of cubic meters from January 1960 to February 2005.
summary(cangas)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.966 6.453 8.831 9.777 14.429 19.528
head(cangas)
## Jan Feb Mar Apr May Jun
## 1960 1.4306 1.3059 1.4022 1.1699 1.1161 1.0113
This is monthly data. Let’s plot out this time frame.
autoplot(cangas)
lambda <- BoxCox.lambda(cangas)
print(paste0("Lambda for cangas: ", lambda))
## [1] "Lambda for cangas: 0.576775938228139"
autoplot(BoxCox(cangas,lambda))
Though the BoxCox.lambda
function suggested a square-root transformation, the visualization shows no obvious improvement to the plot. This suggests that just because a function may recommend a transformation that it does not necessarily means that we should be performing the Box Cox transformation.
3.3: What Box-Cox transformation would you select for your retail data (from Exercise 3 in Section 2.10)
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
head(retaildata)
## # A tibble: 6 x 190
## `Series ID` A3349335T A3349627V A3349338X A3349398A A3349468W
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1982-04-01 00:00:00 303. 41.7 63.9 409. 65.8
## 2 1982-05-01 00:00:00 298. 43.1 64 405. 65.8
## 3 1982-06-01 00:00:00 298 40.3 62.7 401 62.3
## 4 1982-07-01 00:00:00 308. 40.9 65.6 414. 68.2
## 5 1982-08-01 00:00:00 299. 42.1 62.6 404. 66
## 6 1982-09-01 00:00:00 305. 42 64.4 412. 62.3
## # … with 184 more variables: A3349336V <dbl>, A3349337W <dbl>,
## # A3349397X <dbl>, A3349399C <dbl>, A3349874C <dbl>, A3349871W <dbl>,
## # A3349790V <dbl>, A3349556W <dbl>, A3349791W <dbl>, A3349401C <dbl>,
## # A3349873A <dbl>, A3349872X <dbl>, A3349709X <dbl>, A3349792X <dbl>,
## # A3349789K <dbl>, A3349555V <dbl>, A3349565X <dbl>, A3349414R <dbl>,
## # A3349799R <dbl>, A3349642T <dbl>, A3349413L <dbl>, A3349564W <dbl>,
## # A3349416V <dbl>, A3349643V <dbl>, A3349483V <dbl>, A3349722T <dbl>,
## # A3349727C <dbl>, A3349641R <dbl>, A3349639C <dbl>, A3349415T <dbl>,
## # A3349349F <dbl>, A3349563V <dbl>, A3349350R <dbl>, A3349640L <dbl>,
## # A3349566A <dbl>, A3349417W <dbl>, A3349352V <dbl>, A3349882C <dbl>,
## # A3349561R <dbl>, A3349883F <dbl>, A3349721R <dbl>, A3349478A <dbl>,
## # A3349637X <dbl>, A3349479C <dbl>, A3349797K <dbl>, A3349477X <dbl>,
## # A3349719C <dbl>, A3349884J <dbl>, A3349562T <dbl>, A3349348C <dbl>,
## # A3349480L <dbl>, A3349476W <dbl>, A3349881A <dbl>, A3349410F <dbl>,
## # A3349481R <dbl>, A3349718A <dbl>, A3349411J <dbl>, A3349638A <dbl>,
## # A3349654A <dbl>, A3349499L <dbl>, A3349902A <dbl>, A3349432V <dbl>,
## # A3349656F <dbl>, A3349361W <dbl>, A3349501L <dbl>, A3349503T <dbl>,
## # A3349360V <dbl>, A3349903C <dbl>, A3349905J <dbl>, A3349658K <dbl>,
## # A3349575C <dbl>, A3349428C <dbl>, A3349500K <dbl>, A3349577J <dbl>,
## # A3349433W <dbl>, A3349576F <dbl>, A3349574A <dbl>, A3349816F <dbl>,
## # A3349815C <dbl>, A3349744F <dbl>, A3349823C <dbl>, A3349508C <dbl>,
## # A3349742A <dbl>, A3349661X <dbl>, A3349660W <dbl>, A3349909T <dbl>,
## # A3349824F <dbl>, A3349507A <dbl>, A3349580W <dbl>, A3349825J <dbl>,
## # A3349434X <dbl>, A3349822A <dbl>, A3349821X <dbl>, A3349581X <dbl>,
## # A3349908R <dbl>, A3349743C <dbl>, A3349910A <dbl>, A3349435A <dbl>,
## # A3349365F <dbl>, A3349746K <dbl>, …
In my previous assignment, I had used column A3349661X
, which was “# Category: Turnover ; Western Australia ; Furniture, floor coverings, houseware and textile goods retailing”.
Let us transform this data into a time series and then plot out this column, A3349661X
.
# Recall from above that the data is monthly, starting April 1982
myts <- ts(retaildata$A3349661X, start = c(1982,4),frequency = 12)
head(myts)
## Apr May Jun Jul Aug Sep
## 1982 19.2 21.9 19.9 19.3 19.6 19.9
Let us now plot out the time series.
autoplot(myts)
Judging from the upward curve of the data, I suspect that a Box Cox transformation would be appropriate. Let us take a closer look.
lambda <- BoxCox.lambda(myts)
print(paste0("Lambda for myts: ", lambda))
## [1] "Lambda for myts: -0.0493897710767389"
autoplot(BoxCox(myts,lambda))
It appears that a logarithmic transformation works best for this data. And from the autoplot
time series visualization, the plot suggests that the log transformation is appropriate.
3.8: For your retail time series (from Exercise 3 in Section 2.10):
Again, myts
is the series that I have used above in my previous exercise.
A. Split the data into two parts using
myts.train <- window(myts, end=c(2010,12))
myts.test <- window(myts, start=2011)
The data is now loaded into a training and testing time series data set.
B. Check that your data have been split appropriately by producing the following plot.
autoplot(myts) +
autolayer(myts.train, series="Training") +
autolayer(myts.test, series="Test")
It appears that the split worked appropriately.
C. Calculate forecasts using snaive
applied to myts.train
.
According to Chapter 3.1:
For naïve forecasts, we simply set all forecasts to be the value of the last observation. That is,
\[\hat{y}_{T+h|T} = y_{T}\]
This method works remarkably well for many economic and financial time series. However, we will be using the seasonal naive method. “In this case, we set each forecast to be equal to the last observed value from the same season of the year (e.g., the same month of the previous year). Formally, the forecast for time \(T + h\) is written as:”
\[\hat{y}_{T+h|T} = y_{T+h-m(k+1)}\]
“where \(m=\) the seasonal period, and \(k\) is the integer part of \((h-1)/m\) (i.e., the number of complete years in the forecast period prior to time \(T+h\)). This looks more complicated than it really is. For example, with monthly data, the forecast for all future February values is equal to the last observed February value. With quarterly data, the forecast of all future Q2 values is equal to the last observed Q2 value (where Q2 means the second quarter). Similar rules apply for other months and quarters, and for other seasonal periods.”
fc <- snaive(myts.train)
print(fc)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2011 115.4 102.03674 128.7633 94.96265 135.8374
## Feb 2011 104.1 90.73674 117.4633 83.66265 124.5374
## Mar 2011 102.8 89.43674 116.1633 82.36265 123.2374
## Apr 2011 98.7 85.33674 112.0633 78.26265 119.1374
## May 2011 105.5 92.13674 118.8633 85.06265 125.9374
## Jun 2011 119.2 105.83674 132.5633 98.76265 139.6374
## Jul 2011 115.9 102.53674 129.2633 95.46265 136.3374
## Aug 2011 109.6 96.23674 122.9633 89.16265 130.0374
## Sep 2011 112.2 98.83674 125.5633 91.76265 132.6374
## Oct 2011 134.7 121.33674 148.0633 114.26265 155.1374
## Nov 2011 134.9 121.53674 148.2633 114.46265 155.3374
## Dec 2011 149.8 136.43674 163.1633 129.36265 170.2374
## Jan 2012 115.4 96.50149 134.2985 86.49722 144.3028
## Feb 2012 104.1 85.20149 122.9985 75.19722 133.0028
## Mar 2012 102.8 83.90149 121.6985 73.89722 131.7028
## Apr 2012 98.7 79.80149 117.5985 69.79722 127.6028
## May 2012 105.5 86.60149 124.3985 76.59722 134.4028
## Jun 2012 119.2 100.30149 138.0985 90.29722 148.1028
## Jul 2012 115.9 97.00149 134.7985 86.99722 144.8028
## Aug 2012 109.6 90.70149 128.4985 80.69722 138.5028
## Sep 2012 112.2 93.30149 131.0985 83.29722 141.1028
## Oct 2012 134.7 115.80149 153.5985 105.79722 163.6028
## Nov 2012 134.9 116.00149 153.7985 105.99722 163.8028
## Dec 2012 149.8 130.90149 168.6985 120.89722 178.7028
f1 <- snaive(myts.train, h=length(myts.test))
autoplot(f1) + autolayer(myts.test)
Above is the prediction using the seasonal naive forecast. From the visualization, it is easy to see that this model would benefit from an improvement.
D. Compare the accuracy of your forecasts against the actual values stored in myts.test
.
According to the textbook, chapter 3.4, “When choosing models, it is common practice to separate the available data into two portions, training and test data, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy. Because the test data is not used in determining the forecasts, it should provide a reliable indication of how well the model is likely to forecast on new data.”
accuracy(fc, myts.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 3.518619 10.42741 7.244144 5.375707 11.81551 1.000000
## Test set 30.512500 33.10478 30.512500 20.388974 20.38897 4.212023
## ACF1 Theil's U
## Training set 0.7935927 NA
## Test set 0.3613754 2.66608
According to the textbook:
Mean absolute error (MAE): mean\((|e_t|)\) Root mean squared error (RMSE): \(\sqrt{\text{mean}(e^2_{t})}\) Mean absolute percentage error (MAPE): mean\((|p_{t}|)\) Mean absolute scaled error (MASE): mean\((|q_{j}|)\) where
\[q_{j} = \frac{e_j}{\frac{1}{T-m}\Sigma^{T}_{t=m+1}|y_t-y_{t-m}}\]
As of note, the above are only a few defined examples of accuracy. Also, we see the results listed above. The numbers themselves do not have significance until it is compared to another forecasting model i.e. naive, drift, etc. Then we can compare which models perform better.
E. Check the residuals.
According to chapter 3.3 “Residual Diagnostics”:
The “residuals” in a time series model are what is left over after fitting a model. For many (but not all) time series models, the residuals are equal to the difference between the observations and the corresponding fitted values:
\[e_t = y_t - \hat{y}_t\]
Residuals are useful in checking whether a model has adequately captured the information in the data. A good forecasting method will yield residuals with the following properties:
The residuals are uncorrelated. If there are correlations between residuals, then there is information left in the residuals which should be used in computing forecasts.
The residuals have zero mean. If the residuals have a mean other than zero, then the forecasts are biased.
Any forecasting method that does not satisfy these properties can be improved. However, that does not mean that forecasting methods that satisfy these properties cannot be improved. It is possible to have several different forecasting methods for the same data set, all of which satisfy these properties. Checking these properties is important in order to see whether a method is using all of the available information, but it is not a good way to select a forecasting method.
If either of these properties is not satisfied, then the forecasting method can be modified to give better forecasts. Adjusting for bias is easy: if the residuals have mean m, then simply add m to all forecasts and the bias problem is solved. Fixing the correlation problem is harder.
In addition to these essential properties, it is useful (but not necessary) for the residuals to also have the following two properties.
The residuals have constant variance.
The residuals are normally distributed.
These two properties make the calculation of prediction intervals easier (see Section 3.5 for an example). However, a forecasting method that does not satisfy these properties cannot necessarily be improved. Sometimes applying a Box-Cox transformation may assist with these properties, but otherwise there is usually little that you can do to ensure that your residuals have constant variance and a normal distribution. Instead, an alternative approach to obtaining prediction intervals is necessary.
checkresiduals(fc)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 1059, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
Do the residuals appear to be uncorrelated and normally distributed?
The residuals appear to be normally distributed, however, the mean is not centered at zero and the tails are quite long. According to the Ljung-Box test, the p-value is quite significant, which leads us to believe that there may be autocorrelation. (This is also evidenced by the ACF plot where the autocorrelation plots appear to exceed the blue dotted line).
F. How sensitive are the accuracy measures to the training/test split?
The accuracy measure are always sensitive to the training/test split. There are better ways to check the robustness of the methods in terms of accuracy such as using a time series cross-validation tsCV()
.
According to chapter 3.4:
“A more sophisticated version of training/test sets is time series cross-validation. In this procedure, there are a series of test sets, each consisting of a single observation. The corresponding training set consists only of observations that occurred prior to the observation that forms the test set. Thus, no future observations can be used in constructing the forecast. Since it is not possible to obtain a reliable forecast based on a small training set, the earliest observations are not considered as test sets.
The forecast accuracy is computed by averaging over the test sets. This procedure is sometimes known as “evaluation on a rolling forecasting origin” because the “origin” at which the forecast is based rolls forward in time.
With time series forecasting, one-step forecasts may not be as relevant as multi-step forecasts. In this case, the cross-validation procedure based on a rolling forecasting origin can be modified to allow multi-step errors to be used.
Time series cross-validation is implemented with the tsCV() function."