This project is divided into three parts, each presenting a specific time series in which the best forecasting model is built. The parts include:
The selection of a method depends on many factors - the context of the forecast, the relevance of the historical data given, the degree of accuracy, etc. These factors are cross-validated and examine for the potentially greater accuracy model at minimal cost.
These are the main packages used for data wrangling, visualization, and graphics. Any other minor packages for analysis will be listed when needed.
Part A explores the common forecasting techniques, namely, Seasonal and Trend decomposition using Loess, Exponential Smoothing and ARIMA modeling. Part B goes a step further and introduces Hybrid Forecasting which produces point estimates by combining the functionality of multiple forecasting techniques. Part C transforms data into a time-base sequence and aggregate based on the hour.
Each part more or less follows the work-flow shown below.
The data set contains cash withdrawals from four (4) different ATMs during the period of 1 May 2009 to 30 April 2010. Once that data is separated to allow each column of the data frame to be a specific ATM, the following exploratory analysis is done.
ATM624Data = read_excel("ATM624Data.xlsx", col_types = c("date", "text", "numeric"))
ATM.new = ATM624Data %>% drop_na() %>% spread(ATM, Cash)| DATE | ATM1 | ATM2 | ATM3 | ATM4 |
|---|---|---|---|---|
| 2009-05-01 | 96 | 107 | 0 | 776.993423 |
| 2009-05-02 | 82 | 89 | 0 | 524.417959 |
| 2009-05-03 | 85 | 90 | 0 | 792.811362 |
| 2009-05-04 | 90 | 55 | 0 | 908.238457 |
| 2009-05-05 | 99 | 79 | 0 | 52.832103 |
| 2009-05-06 | 88 | 19 | 0 | 52.208454 |
| 2009-05-07 | 8 | 2 | 0 | 55.473609 |
| 2009-05-08 | 104 | 103 | 0 | 558.503251 |
| 2009-05-09 | 87 | 107 | 0 | 904.341359 |
| 2009-05-10 | 93 | 118 | 0 | 879.493588 |
| 2009-05-11 | 86 | 75 | 0 | 274.022340 |
| 2009-05-12 | 111 | 111 | 0 | 396.108347 |
| 2009-05-13 | 75 | 25 | 0 | 274.547188 |
| 2009-05-14 | 6 | 16 | 0 | 16.321159 |
| 2009-05-15 | 102 | 137 | 0 | 852.307037 |
| 2009-05-16 | 73 | 95 | 0 | 379.561703 |
| 2009-05-17 | 92 | 103 | 0 | 31.284953 |
| 2009-05-18 | 82 | 80 | 0 | 491.850577 |
| 2009-05-19 | 86 | 118 | 0 | 83.705480 |
| 2009-05-20 | 73 | 30 | 0 | 128.653781 |
| 2009-05-21 | 20 | 7 | 0 | 14.357590 |
| 2009-05-22 | 100 | 118 | 0 | 815.358321 |
| 2009-05-23 | 93 | 104 | 0 | 758.218587 |
| 2009-05-24 | 90 | 59 | 0 | 601.421108 |
| 2009-05-25 | 94 | 40 | 0 | 906.796873 |
| 2009-05-26 | 98 | 106 | 0 | 502.907894 |
| 2009-05-27 | 73 | 18 | 0 | 88.273138 |
| 2009-05-28 | 10 | 9 | 0 | 35.438336 |
| 2009-05-29 | 97 | 136 | 0 | 338.459425 |
| 2009-05-30 | 102 | 118 | 0 | 4.547996 |
| 2009-05-31 | 85 | 64 | 0 | 122.667745 |
| 2009-06-01 | 85 | 77 | 0 | 150.234945 |
| 2009-06-02 | 108 | 133 | 0 | 721.155281 |
| 2009-06-03 | 94 | 45 | 0 | 443.012568 |
| 2009-06-04 | 14 | 14 | 0 | 17.151720 |
| 2009-06-05 | 3 | 20 | 0 | 14.887695 |
| 2009-06-06 | 96 | 147 | 0 | 740.543155 |
| 2009-06-07 | 109 | 105 | 0 | 1058.083384 |
| 2009-06-08 | 96 | 132 | 0 | 576.183701 |
| 2009-06-09 | 145 | 93 | 0 | 1484.126887 |
| 2009-06-10 | 81 | 26 | 0 | 193.787327 |
| 2009-06-11 | 16 | 7 | 0 | 27.054779 |
| 2009-06-12 | 142 | 112 | 0 | 1190.898921 |
| 2009-06-13 | NA | 91 | 0 | 746.495617 |
| 2009-06-14 | 120 | 72 | 0 | 1220.767882 |
| 2009-06-15 | 106 | 66 | 0 | 1021.503529 |
| 2009-06-16 | NA | 82 | 0 | 372.623610 |
| 2009-06-17 | 108 | 24 | 0 | 321.272935 |
| 2009-06-18 | 21 | NA | 0 | 92.476655 |
| 2009-06-19 | 140 | 134 | 0 | 116.644133 |
| 2009-06-20 | 110 | 95 | 0 | 202.413503 |
| 2009-06-21 | 115 | 82 | 0 | 524.434514 |
| 2009-06-22 | NA | 90 | 0 | 80.644354 |
| 2009-06-23 | 108 | 99 | 0 | 64.273589 |
| 2009-06-24 | 66 | NA | 0 | 90.636571 |
| 2009-06-25 | 13 | 3 | 0 | 48.046061 |
| 2009-06-26 | 99 | 117 | 0 | 1026.032172 |
| 2009-06-27 | 105 | 53 | 0 | 423.770971 |
| 2009-06-28 | 104 | 44 | 0 | 60.626453 |
| 2009-06-29 | 98 | 56 | 0 | 540.281404 |
| 2009-06-30 | 110 | 110 | 0 | 173.705568 |
| 2009-07-01 | 79 | 36 | 0 | 393.239982 |
| 2009-07-02 | 16 | 12 | 0 | 41.948621 |
| 2009-07-03 | 110 | 128 | 0 | 310.006527 |
| 2009-07-04 | 96 | 72 | 0 | 110.051915 |
| 2009-07-05 | 114 | 122 | 0 | 682.182367 |
| 2009-07-06 | 126 | 100 | 0 | 54.667904 |
| 2009-07-07 | 126 | 108 | 0 | 213.989896 |
| 2009-07-08 | 73 | 25 | 0 | 738.126360 |
| 2009-07-09 | 4 | 6 | 0 | 16.206081 |
| 2009-07-10 | 19 | 22 | 0 | 16.223205 |
| 2009-07-11 | 114 | 135 | 0 | 1050.206528 |
| 2009-07-12 | 98 | 69 | 0 | 438.477361 |
| 2009-07-13 | 97 | 52 | 0 | 546.691592 |
| 2009-07-14 | 114 | 81 | 0 | 858.236420 |
| 2009-07-15 | 78 | 27 | 0 | 446.733514 |
| 2009-07-16 | 19 | 4 | 0 | 94.672035 |
| 2009-07-17 | 102 | 147 | 0 | 644.390623 |
| 2009-07-18 | 94 | 102 | 0 | 568.815899 |
| 2009-07-19 | 108 | 83 | 0 | 704.507042 |
| 2009-07-20 | 91 | 55 | 0 | 571.645285 |
| 2009-07-21 | 86 | 74 | 0 | 479.875501 |
| 2009-07-22 | 78 | 22 | 0 | 418.864062 |
| 2009-07-23 | 16 | 4 | 0 | 27.569130 |
| 2009-07-24 | 114 | 104 | 0 | 834.834363 |
| 2009-07-25 | 115 | 81 | 0 | 910.834075 |
| 2009-07-26 | 108 | 61 | 0 | 468.106927 |
| 2009-07-27 | 102 | 70 | 0 | 768.121660 |
| 2009-07-28 | 129 | 126 | 0 | 1089.167762 |
| 2009-07-29 | 79 | 33 | 0 | 266.898417 |
| 2009-07-30 | 13 | 7 | 0 | 7.110665 |
| 2009-07-31 | 103 | 126 | 0 | 704.192012 |
| 2009-08-01 | 90 | 92 | 0 | 495.350606 |
| 2009-08-02 | 68 | 81 | 0 | 142.638951 |
| 2009-08-03 | 85 | 49 | 0 | 428.560174 |
| 2009-08-04 | 99 | 146 | 0 | 894.969364 |
| 2009-08-05 | 86 | 79 | 0 | 610.290623 |
| 2009-08-06 | 13 | 37 | 0 | 70.613731 |
| 2009-08-07 | 116 | 136 | 0 | 593.570261 |
| 2009-08-08 | 105 | 111 | 0 | 341.601807 |
| 2009-08-09 | 123 | 78 | 0 | 735.035708 |
| 2009-08-10 | 114 | 57 | 0 | 462.773028 |
| 2009-08-11 | 127 | 106 | 0 | 1156.496108 |
| 2009-08-12 | 111 | 38 | 0 | 454.091939 |
| 2009-08-13 | 34 | 15 | 0 | 283.337265 |
| 2009-08-14 | 151 | 119 | 0 | 571.508226 |
| 2009-08-15 | 110 | 110 | 0 | 772.179652 |
| 2009-08-16 | 115 | 68 | 0 | 260.125517 |
| 2009-08-17 | 112 | 60 | 0 | 357.515267 |
| 2009-08-18 | 132 | 92 | 0 | 16.157597 |
| 2009-08-19 | 94 | 22 | 0 | 334.312503 |
| 2009-08-20 | 24 | 11 | 0 | 25.696404 |
| 2009-08-21 | 122 | 121 | 0 | 254.887098 |
| 2009-08-22 | 104 | 89 | 0 | 357.003129 |
| 2009-08-23 | 128 | 62 | 0 | 1245.594280 |
| 2009-08-24 | 120 | 79 | 0 | 917.371157 |
| 2009-08-25 | 174 | 83 | 0 | 592.180082 |
| 2009-08-26 | 96 | 31 | 0 | 412.474975 |
| 2009-08-27 | 13 | 4 | 0 | 82.911171 |
| 2009-08-28 | 121 | 96 | 0 | 996.010226 |
| 2009-08-29 | 133 | 50 | 0 | 103.910375 |
| 2009-08-30 | 118 | 52 | 0 | 1116.915044 |
| 2009-08-31 | 91 | 56 | 0 | 816.847015 |
| 2009-09-01 | 120 | 104 | 0 | 914.493389 |
| 2009-09-02 | 88 | 27 | 0 | 648.209198 |
| 2009-09-03 | 19 | 13 | 0 | 140.515569 |
| 2009-09-04 | 150 | 107 | 0 | 1495.154775 |
| 2009-09-05 | 144 | 125 | 0 | 1301.396344 |
| 2009-09-06 | 121 | 103 | 0 | 779.717193 |
| 2009-09-07 | 105 | 42 | 0 | 744.262278 |
| 2009-09-08 | 133 | 90 | 0 | 200.391803 |
| 2009-09-09 | 109 | 29 | 0 | 854.179084 |
| 2009-09-10 | 18 | 8 | 0 | 50.719369 |
| 2009-09-11 | 1 | 2 | 0 | 7.404799 |
| 2009-09-12 | 105 | 84 | 0 | 1061.192780 |
| 2009-09-13 | 112 | 62 | 0 | 715.021459 |
| 2009-09-14 | 82 | 77 | 0 | 35.444668 |
| 2009-09-15 | 111 | 78 | 0 | 491.884510 |
| 2009-09-16 | 79 | 25 | 0 | 343.496923 |
| 2009-09-17 | 13 | 8 | 0 | 20.714186 |
| 2009-09-18 | 112 | 113 | 0 | 505.717125 |
| 2009-09-19 | 99 | 71 | 0 | 97.325113 |
| 2009-09-20 | 140 | 94 | 0 | 473.570486 |
| 2009-09-21 | 110 | 59 | 0 | 899.773474 |
| 2009-09-22 | 180 | 89 | 0 | 1712.074986 |
| 2009-09-23 | 73 | 18 | 0 | 281.388412 |
| 2009-09-24 | 7 | 6 | 0 | 26.337969 |
| 2009-09-25 | 106 | 115 | 0 | 328.644988 |
| 2009-09-26 | 103 | 81 | 0 | 761.371143 |
| 2009-09-27 | 93 | 61 | 0 | 629.319901 |
| 2009-09-28 | 96 | 71 | 0 | 235.753411 |
| 2009-09-29 | 117 | 69 | 0 | 1195.223181 |
| 2009-09-30 | 80 | 36 | 0 | 782.407474 |
| 2009-10-01 | 14 | 14 | 0 | 108.331743 |
| 2009-10-02 | 120 | 104 | 0 | 846.556566 |
| 2009-10-03 | 91 | 73 | 0 | 576.169344 |
| 2009-10-04 | 96 | 86 | 0 | 441.883927 |
| 2009-10-05 | 74 | 85 | 0 | 319.404846 |
| 2009-10-06 | 108 | 126 | 0 | 154.163484 |
| 2009-10-07 | 73 | 31 | 0 | 543.239375 |
| 2009-10-08 | 13 | 9 | 0 | 124.334415 |
| 2009-10-09 | 93 | 114 | 0 | 449.074910 |
| 2009-10-10 | 94 | 78 | 0 | 614.654064 |
| 2009-10-11 | 76 | 45 | 0 | 219.237855 |
| 2009-10-12 | 111 | 60 | 0 | 945.736417 |
| 2009-10-13 | 88 | 91 | 0 | 9.691243 |
| 2009-10-14 | 76 | 22 | 0 | 696.245755 |
| 2009-10-15 | 9 | 7 | 0 | 8.084283 |
| 2009-10-16 | 87 | 75 | 0 | 845.463546 |
| 2009-10-17 | 105 | 66 | 0 | 212.902683 |
| 2009-10-18 | 78 | 64 | 0 | 9.155135 |
| 2009-10-19 | 67 | 51 | 0 | 46.831421 |
| 2009-10-20 | 90 | 94 | 0 | 30.372376 |
| 2009-10-21 | 68 | 23 | 0 | 400.484974 |
| 2009-10-22 | 9 | 4 | 0 | 61.338241 |
| 2009-10-23 | 78 | 127 | 0 | 428.046952 |
| 2009-10-24 | 74 | 61 | 0 | 1.563260 |
| 2009-10-25 | 74 | 0 | 0 | 9.018826 |
| 2009-10-26 | 60 | 95 | 0 | 50.304215 |
| 2009-10-27 | 75 | 79 | 0 | 313.414359 |
| 2009-10-28 | 61 | 38 | 0 | 626.687136 |
| 2009-10-29 | 9 | 8 | 0 | 5.882076 |
| 2009-10-30 | 90 | 119 | 0 | 77.803024 |
| 2009-10-31 | 86 | 57 | 0 | 337.894871 |
| 2009-11-01 | 86 | 58 | 0 | 211.812232 |
| 2009-11-02 | 79 | 80 | 0 | 690.122884 |
| 2009-11-03 | 90 | 82 | 0 | 596.381877 |
| 2009-11-04 | 80 | 49 | 0 | 65.278483 |
| 2009-11-05 | 21 | 16 | 0 | 77.444454 |
| 2009-11-06 | 93 | 116 | 0 | 43.721410 |
| 2009-11-07 | 104 | 61 | 0 | 964.175255 |
| 2009-11-08 | 109 | 59 | 0 | 834.959599 |
| 2009-11-09 | 88 | 80 | 0 | 636.962626 |
| 2009-11-10 | 96 | 86 | 0 | 927.078614 |
| 2009-11-11 | 70 | 23 | 0 | 75.873055 |
| 2009-11-12 | 15 | 7 | 0 | 43.690550 |
| 2009-11-13 | 73 | 91 | 0 | 621.292501 |
| 2009-11-14 | 94 | 57 | 0 | 312.582041 |
| 2009-11-15 | 108 | 58 | 0 | 825.635396 |
| 2009-11-16 | 73 | 61 | 0 | 413.620440 |
| 2009-11-17 | 87 | 77 | 0 | 194.047355 |
| 2009-11-18 | 75 | 20 | 0 | 345.695900 |
| 2009-11-19 | 10 | 5 | 0 | 32.428111 |
| 2009-11-20 | 92 | 132 | 0 | 655.315200 |
| 2009-11-21 | 87 | 49 | 0 | 638.136824 |
| 2009-11-22 | 74 | 57 | 0 | 15.306757 |
| 2009-11-23 | 73 | 68 | 0 | 299.663010 |
| 2009-11-24 | 93 | 80 | 0 | 626.662127 |
| 2009-11-25 | 66 | 31 | 0 | 601.141158 |
| 2009-11-26 | 18 | 3 | 0 | 64.131249 |
| 2009-11-27 | 99 | 85 | 0 | 562.763323 |
| 2009-11-28 | 94 | 53 | 0 | 317.253589 |
| 2009-11-29 | 136 | 46 | 0 | 1167.264437 |
| 2009-11-30 | 6 | 2 | 0 | 47.365125 |
| 2009-12-01 | 140 | 113 | 0 | 993.777470 |
| 2009-12-02 | 73 | 22 | 0 | 687.196935 |
| 2009-12-03 | 9 | 5 | 0 | 71.147567 |
| 2009-12-04 | 140 | 112 | 0 | 1046.971281 |
| 2009-12-05 | 103 | 59 | 0 | 1009.050230 |
| 2009-12-06 | 110 | 72 | 0 | 288.649516 |
| 2009-12-07 | 90 | 77 | 0 | 591.508331 |
| 2009-12-08 | 135 | 85 | 0 | 230.605915 |
| 2009-12-09 | 67 | 27 | 0 | 578.431798 |
| 2009-12-10 | 12 | 1 | 0 | 70.210639 |
| 2009-12-11 | 109 | 91 | 0 | 580.796406 |
| 2009-12-12 | 84 | 36 | 0 | 149.150820 |
| 2009-12-13 | 92 | 46 | 0 | 403.861263 |
| 2009-12-14 | 84 | 100 | 0 | 124.958721 |
| 2009-12-15 | 118 | 73 | 0 | 230.148897 |
| 2009-12-16 | 68 | 22 | 0 | 19.008453 |
| 2009-12-17 | 14 | 9 | 0 | 128.638640 |
| 2009-12-18 | 90 | 117 | 0 | 327.965321 |
| 2009-12-19 | 92 | 44 | 0 | 532.245673 |
| 2009-12-20 | 93 | 44 | 0 | 877.397771 |
| 2009-12-21 | 85 | 78 | 0 | 662.112869 |
| 2009-12-22 | 93 | 89 | 0 | 300.629002 |
| 2009-12-23 | 70 | 33 | 0 | 667.628261 |
| 2009-12-24 | 13 | 5 | 0 | 14.573322 |
| 2009-12-25 | 90 | 102 | 0 | 660.355296 |
| 2009-12-26 | 91 | 68 | 0 | 510.988823 |
| 2009-12-27 | 102 | 64 | 0 | 164.462091 |
| 2009-12-28 | 97 | 81 | 0 | 748.172762 |
| 2009-12-29 | 42 | 9 | 0 | 174.241405 |
| 2009-12-30 | 2 | 2 | 0 | 20.192725 |
| 2009-12-31 | 14 | 2 | 0 | 100.686543 |
| 2010-01-01 | 132 | 117 | 0 | 985.956884 |
| 2010-01-02 | 108 | 56 | 0 | 597.058060 |
| 2010-01-03 | 120 | 55 | 0 | 468.457250 |
| 2010-01-04 | 120 | 112 | 0 | 856.581158 |
| 2010-01-05 | 90 | 100 | 0 | 684.774261 |
| 2010-01-06 | 48 | 17 | 0 | 381.562058 |
| 2010-01-07 | 15 | 7 | 0 | 152.092362 |
| 2010-01-08 | 86 | 49 | 0 | 271.934402 |
| 2010-01-09 | 109 | 96 | 0 | 135.499645 |
| 2010-01-10 | 115 | 47 | 0 | 1105.018595 |
| 2010-01-11 | 123 | 107 | 0 | 291.765790 |
| 2010-01-12 | 123 | 93 | 0 | 1141.426655 |
| 2010-01-13 | 60 | 27 | 0 | 140.502720 |
| 2010-01-14 | 22 | 9 | 0 | 8.781528 |
| 2010-01-15 | 81 | 78 | 0 | 85.260115 |
| 2010-01-16 | 98 | 47 | 0 | 66.655933 |
| 2010-01-17 | 94 | 31 | 0 | 709.938568 |
| 2010-01-18 | 76 | 91 | 0 | 567.549590 |
| 2010-01-19 | 96 | 77 | 0 | 486.824572 |
| 2010-01-20 | 72 | 22 | 0 | 17.481238 |
| 2010-01-21 | 12 | 1 | 0 | 49.208112 |
| 2010-01-22 | 91 | 125 | 0 | 357.233561 |
| 2010-01-23 | 74 | 53 | 0 | 179.888632 |
| 2010-01-24 | 104 | 37 | 0 | 728.992926 |
| 2010-01-25 | 74 | 69 | 0 | 261.121740 |
| 2010-01-26 | 91 | 69 | 0 | 628.863047 |
| 2010-01-27 | 66 | 16 | 0 | 277.044815 |
| 2010-01-28 | 28 | 2 | 0 | 41.475487 |
| 2010-01-29 | 152 | 95 | 0 | 1574.779330 |
| 2010-01-30 | 97 | 45 | 0 | 669.707653 |
| 2010-01-31 | 100 | 61 | 0 | 979.675944 |
| 2010-02-01 | 85 | 71 | 0 | 426.406008 |
| 2010-02-02 | 123 | 91 | 0 | 153.242692 |
| 2010-02-03 | 46 | 28 | 0 | 274.949921 |
| 2010-02-04 | 38 | 9 | 0 | 136.276358 |
| 2010-02-05 | 138 | 97 | 0 | 454.416950 |
| 2010-02-06 | 74 | 60 | 0 | 458.304270 |
| 2010-02-07 | 85 | 27 | 0 | 112.030628 |
| 2010-02-08 | 78 | 74 | 0 | 417.912276 |
| 2010-02-09 | 123 | 45 | 0 | 10919.761638 |
| 2010-02-10 | 50 | 26 | 0 | 42.438078 |
| 2010-02-11 | 28 | 2 | 0 | 280.043427 |
| 2010-02-12 | 105 | 73 | 0 | 412.318881 |
| 2010-02-13 | 109 | 48 | 0 | 852.837416 |
| 2010-02-14 | 26 | 9 | 0 | 179.702084 |
| 2010-02-15 | 54 | 25 | 0 | 226.047368 |
| 2010-02-16 | 105 | 58 | 0 | 989.195610 |
| 2010-02-17 | 179 | 110 | 0 | 824.916781 |
| 2010-02-18 | 125 | 94 | 0 | 966.610561 |
| 2010-02-19 | 84 | 25 | 0 | 734.220167 |
| 2010-02-20 | 99 | 101 | 0 | 121.332622 |
| 2010-02-21 | 110 | 77 | 0 | 287.931829 |
| 2010-02-22 | 80 | 6 | 0 | 502.748346 |
| 2010-02-23 | 1 | 3 | 0 | 9.752776 |
| 2010-02-24 | 111 | 99 | 0 | 258.138869 |
| 2010-02-25 | 115 | 86 | 0 | 1170.288228 |
| 2010-02-26 | 108 | 20 | 0 | 193.079004 |
| 2010-02-27 | 100 | 72 | 0 | 402.770395 |
| 2010-02-28 | 152 | 81 | 0 | 1275.968167 |
| 2010-03-01 | 90 | 16 | 0 | 819.881512 |
| 2010-03-02 | 4 | 2 | 0 | 26.392615 |
| 2010-03-03 | 128 | 110 | 0 | 893.884507 |
| 2010-03-04 | 87 | 75 | 0 | 360.641557 |
| 2010-03-05 | 84 | 35 | 0 | 859.899389 |
| 2010-03-06 | 69 | 102 | 0 | 381.473089 |
| 2010-03-07 | 112 | 100 | 0 | 9.711205 |
| 2010-03-08 | 62 | 9 | 0 | 601.016953 |
| 2010-03-09 | 4 | 2 | 0 | 32.454958 |
| 2010-03-10 | 94 | 100 | 0 | 553.395793 |
| 2010-03-11 | 102 | 85 | 0 | 572.374702 |
| 2010-03-12 | 98 | 33 | 0 | 218.831558 |
| 2010-03-13 | 88 | 24 | 0 | 828.326828 |
| 2010-03-14 | 123 | 111 | 0 | 630.762176 |
| 2010-03-15 | 55 | 5 | 0 | 339.418304 |
| 2010-03-16 | 4 | 1 | 0 | 31.502210 |
| 2010-03-17 | 92 | 106 | 0 | 486.679944 |
| 2010-03-18 | 88 | 102 | 0 | 335.363138 |
| 2010-03-19 | 92 | 25 | 0 | 340.008595 |
| 2010-03-20 | 92 | 64 | 0 | 291.282267 |
| 2010-03-21 | 117 | 78 | 0 | 46.029410 |
| 2010-03-22 | 73 | 5 | 0 | 201.403616 |
| 2010-03-23 | 3 | 1 | 0 | 9.558300 |
| 2010-03-24 | 85 | 83 | 0 | 877.966103 |
| 2010-03-25 | 88 | 78 | 0 | 778.110791 |
| 2010-03-26 | 76 | 66 | 0 | 707.613494 |
| 2010-03-27 | 93 | 15 | 0 | 351.457835 |
| 2010-03-28 | 116 | 64 | 0 | 711.049107 |
| 2010-03-29 | 68 | 3 | 0 | 502.518192 |
| 2010-03-30 | 19 | 0 | 0 | 23.337569 |
| 2010-03-31 | 127 | 102 | 0 | 492.909936 |
| 2010-04-01 | 93 | 99 | 0 | 405.312584 |
| 2010-04-02 | 97 | 41 | 0 | 818.394331 |
| 2010-04-03 | 102 | 79 | 0 | 152.270800 |
| 2010-04-04 | 109 | 71 | 0 | 281.113528 |
| 2010-04-05 | 68 | 9 | 0 | 470.425772 |
| 2010-04-06 | 4 | 2 | 0 | 2.507317 |
| 2010-04-07 | 105 | 103 | 0 | 414.741223 |
| 2010-04-08 | 79 | 67 | 0 | 719.426783 |
| 2010-04-09 | 90 | 85 | 0 | 811.716825 |
| 2010-04-10 | 87 | 78 | 0 | 889.584209 |
| 2010-04-11 | 92 | 67 | 0 | 616.127547 |
| 2010-04-12 | 63 | 12 | 0 | 61.144354 |
| 2010-04-13 | 3 | 1 | 0 | 27.325671 |
| 2010-04-14 | 103 | 97 | 0 | 767.778153 |
| 2010-04-15 | 80 | 106 | 0 | 326.084093 |
| 2010-04-16 | 80 | 74 | 0 | 825.197816 |
| 2010-04-17 | 84 | 86 | 0 | 383.815025 |
| 2010-04-18 | 81 | 55 | 0 | 195.483454 |
| 2010-04-19 | 88 | 13 | 0 | 711.164653 |
| 2010-04-20 | 3 | 1 | 0 | 29.869011 |
| 2010-04-21 | 100 | 100 | 0 | 556.792330 |
| 2010-04-22 | 69 | 103 | 0 | 386.175335 |
| 2010-04-23 | 85 | 44 | 0 | 165.294181 |
| 2010-04-24 | 85 | 61 | 0 | 5.451815 |
| 2010-04-25 | 109 | 89 | 0 | 542.280602 |
| 2010-04-26 | 74 | 11 | 0 | 403.839336 |
| 2010-04-27 | 4 | 2 | 0 | 13.697331 |
| 2010-04-28 | 96 | 107 | 96 | 348.201061 |
| 2010-04-29 | 82 | 89 | 82 | 44.245345 |
| 2010-04-30 | 85 | 90 | 85 | 482.287107 |
From the few summary statistics below, each ATM appears to have some discrepancies. Firstly, for ATM1, there is a total record of n = 362, suggesting missing data. This is similar to ATM2, where n = 363. Next, ATM3 results in a mean = 0.72 over 365 days with the range of withdrawal being $0 to $9,600. Moreover, with a mean absolute deviation for ATM3 is 0, this suggests that there is no variation of the data. On the other hand, ATM4 appears to be well-used, with a mean daily withdrawal of $47,400.
| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ATM1 | 362 | 83.89 | 36.66 | 91.00 | 86.65 | 25.20 | 1.00 | 180.00 | 179.0 | -0.71 | 0.19 | 1.93 |
| ATM2 | 363 | 62.58 | 38.90 | 67.00 | 62.24 | 48.93 | 0.00 | 147.00 | 147.0 | -0.03 | -1.09 | 2.04 |
| ATM3 | 365 | 0.72 | 7.94 | 0.00 | 0.00 | 0.00 | 0.00 | 96.00 | 96.0 | 10.93 | 118.38 | 0.42 |
| ATM4 | 365 | 474.04 | 650.94 | 403.84 | 416.68 | 425.80 | 1.56 | 10919.76 | 10918.2 | 11.39 | 178.91 | 34.07 |
Lastly, with further examination of the data for ATM3 and ATM4, it shows that the number of days withdrawal was made using ATM #3 is only 3 days, explaining the odd summary statistics and the number of outliers in the boxplot below. As for ATM4, there is one particular outlier that is beyond the typical amount of cash withdraw from that ATM.
## [1] "Number of days withdrawal were made using ATM #3 is 3 days."
## [1] "On 2010-02-09, an unexpectly high amount of $1091.98K was withdraw from ATM #4."
par(mfrow = c(1,4))
for (i in 2:5){
boxplot(
ATM.new[i], main = sprintf("%s", names(ATM.new)[i]), col = "steelblue2",outcex = 1,
xlab = sprintf("# of outliers = %d", length(boxplot(ATM.new[i], plot = FALSE)$out)))
}A view of the time series plot highlights the weekly seasonality, particularly for ATM1 and ATM2. The third time series plot shows the flatline until the even end of April 2010. This flatline may suggest that ATM3 must be newly installed/replaced with data only collected from this machine at the end of April, or it was decommissioned for some time before operating once again. While with ATM4, the outlier is too extreme and uncommon.
ATM624Data %>% drop_na() %>% ggplot(aes(x = DATE, y = Cash, col = ATM)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ ATM, ncol = 1, scales = "free_y") +
labs(title = "ATM Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",
x = "Date") +
scale_y_continuous("Amount of Cash Withdrawal", labels = scales::dollar_format(scale = 0.1, suffix = "K"))Based on this exploratory analysis, each ATMs exhibited different patterns and will require a few tidying and transformation techniques including missing data treatment. Moreover, forecasting should be conducted on each ATM data sets separately. Thus, the next task will be to estimate missing values and outlier replacements for each ATMs, such that for:
ATM1 and ATM2, estimating missing data is done.ATM3 will receive no change due to limited data.ATM4, there will be replacement of outlier.The handling of missing data is done by fitting a seasonal model to the data and then interpolate the seasonally adjusted series, before re-seasonalizing. While identifying outliers and suggest reasonable replacements, residuals are identified by fitting a periodic STL decomposition for seasonal data. Residuals are deemed as outliers if they lie outside the range \(\pm 2(q_{0.9} − q_{0.1})\) where \(q_p\) is the p-quantile of the residuals.
The general function tsclean from the forecast package combines both procedures, so it handles both missing values and outliers. It will return a cleaned version of a time series with outliers and missing values replaced by estimated values.
temp = ts(ATM.new %>% select(-DATE))
ATM.ts = temp
for (i in 1:4){
ATM.ts[,i] = tsclean(ATM.ts[,i])
}As a result, the summary statistic shows that the missing values from ATM1 and ATM2 have been interpolated, and the outlier from ATM4 was replaced.
describe(ATM.ts)[,-1] %>%
kable(digits = 2L, caption = "Descriptive Statistics of ATM dataset") %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE)| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ATM1 | 365 | 84.15 | 36.64 | 91.00 | 86.93 | 25.20 | 1.00 | 180.00 | 179.00 | -0.72 | 0.20 | 1.92 |
| ATM2 | 365 | 62.59 | 38.81 | 67.00 | 62.26 | 48.93 | 0.00 | 147.00 | 147.00 | -0.03 | -1.09 | 2.03 |
| ATM3 | 365 | 0.72 | 7.94 | 0.00 | 0.00 | 0.00 | 0.00 | 96.00 | 96.00 | 10.93 | 118.38 | 0.42 |
| ATM4 | 365 | 444.76 | 351.08 | 402.77 | 414.35 | 424.21 | 1.56 | 1712.07 | 1710.51 | 0.66 | -0.10 | 18.38 |
ATM.df = reshape2::melt(ATM.ts)
ATM.df = cbind(DATE = seq(as.Date("2009-05-1"), as.Date("2010-04-30"), length.out = 365), ATM.df[,-1])
names(ATM.df) = c("DATE", "ATM", "Cash")Thus, below is the time series of cash withdrawal from the 4 different ATMs within the period starting 1 May 2009 to 30 April 2010. While there are not many visual differences in the new time series plot for ATM 1-3, the appearance of ATM4 has greatly improved once the extreme outlier was replaced.
ATM.df %>% ggplot(aes(x = DATE, y = Cash, col = ATM)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ ATM, ncol = 1, scales = "free_y") +
labs(title = "ATM Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",
x = "Date") +
scale_y_continuous("Amount of Cash Withdrawal", labels = scales::dollar_format(scale = 0.1, suffix = "K"))library(gganimate)
ts = ggplot(ATM.df, aes(x = DATE, y = Cash, group = ATM, color = ATM)) +
geom_line() + geom_point() +
labs(title = "ATM Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",
x = "Date") +
scale_y_continuous("Amount of Cash Withdrawal",
labels = scales::dollar_format(scale = 0.1, suffix = "K")) +
theme_minimal() + transition_reveal(DATE)
animate(ts, height = 6, width = 9, units = "in", res = 150)
anim_save("atm.gif")By zooming in and observing the series for a few weeks, there are distinct dips for each of the ATMs. With a month, there are four dips, further emphasizing the weekly seasonality.
df = with(ATM.df, ATM.df[(DATE >= "2009-07-01" & DATE <= "2009-08-31"), ])
ts = ggplot(df, aes(x = DATE, y = Cash, group = ATM, color = ATM)) +
geom_line() + geom_point() +
labs(title = "ATM Cash Withdrawal", subtitle ="1 July, 2009 to 31 August, 2009",
x = "Date") +
scale_y_continuous("Amount of Cash Withdrawal",
labels = scales::dollar_format(scale = 0.1, suffix = "K")) +
theme_minimal() + transition_reveal(DATE)
animate(ts, height = 6, width = 9, units = "in", res = 150)
anim_save("atm_zoom.gif")In the previous sections, the data itself was examined and prep to prevent bias and errors in the forecast. And so, this section will now dive into the time series themselves to examine their characteristics and determine the best modeling technique to forecast future cash withdrawal amount per ATM.
# Transform the dataframe into a time series
temp = ATM.df %>% drop_na() %>% spread(ATM, Cash)
ATM.ts = ts(temp %>% select(-DATE), frequency = 7)From the time series plot, it is evident that there is seasonality in this set, weekly as previously established. While there is no steady trend over the weeks, there are increasing and decreasing activities over periods. For instance, there is a downwards trend from week 22 to 27. Moreover, the subseries plot highlight that Tuesday tends to have the highest mean of cash withdrawal while Saturday is the lowest. The ACF highlights that the slow decrease in every 7th lags due to the trend. Expanding the lags longer, the autocorrelations are in a “scalloped” shape, further confirming that there is seasonality. From the partial autocorrelation (PACF), there a few significant lags in the beginning, while all other PACF are within the critical limit, suggesting all other lags are autocorrelation. Altogether, the time series plots for ATM1 confirm that it is non-stationary, shows seasonality, and slight trend variations, and will require differencing in the lags to be made into a stationary time series.
ggtsdisplay(ATM.ts[,1], main = "ATM #1 Cash Withdrawal", ylab = "Cash Withdrawal (in hundreds)", xlab = "Week")ggsubseriesplot(ATM.ts[,1]) +
labs(title = "ATM #1 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",
x = "Days of the Week") +
scale_y_continuous("Amount of Cash Withdrawal",
labels = scales::dollar_format(scale = 0.1, suffix = "K"))For this time series subset, it was deemed helpful that the stabilization of the seasonal variability throughout the year may be necessary. Using Box-Cox transformation, the optimal \(\lambda=0.26\). The resulting transformation highlights that the smaller variability was stretched, while the larger variability was diminished.
bct = function(ts, title){
a = autoplot(ts) +
labs(title = sprintf("Before: %s", title), x = "Week") +
scale_y_continuous("Amount of Cash Withdrawal",
labels = scales::dollar_format(scale = 0.1, suffix = "K"))
lambda = BoxCox.lambda(ts)
at = autoplot(BoxCox(ts, lambda)) +
labs(title = sprintf("Transformed: %s", title),
subtitle = sprintf("lambda = %0.2f", lambda),
y = "Box-Cox Transformed",
x = "Week")
gridExtra::grid.arrange(a, at)
}
lambda = BoxCox.lambda(ATM.ts[,1])
bct(ATM.ts[,1], title = "ATM #1 Cash Withdrawal")At this moment, applying a decomposition to ATM1 is useful to further study time series data, and exploring historical changes over time. The decomposition of the time series into seasonal, trend and irregular components is done using loess, or the STL decomposition.
ATM.ts[,1] %>% stl(s.window = 7, robust = TRUE) %>%
autoplot() +
labs(title = "Seasonal & Trend Decomposition for ATM #1 Time Series", x = "Week")The decomposition highlights the seasonality component changes drastically from week 36 to week 43. The panel bars indicate that the variation attributed to the trend is much smaller than the seasonal component and consequently only a small part of the variation in the data series.
By considering variations in the combinations of the trend and seasonal components, exponential smoothing methods are not ruled out as a tool for forecasting. This technique will determine a possible model based on the best values in the AICc. It shows that the ETS(A, N, A) model best fits the data for the transformed ATM1, i.e. exponential smoothing with additive error, no trend component, and additive seasonality.
## ETS(A,N,A)
##
## Call:
## ets(y = ., lambda = lambda, biasadj = TRUE)
##
## Box-Cox transformation: lambda= 0.2616
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 0.3513
##
## Initial states:
## l = 7.9717
## s = -4.5094 0.5635 1.0854 0.5711 0.9551 0.5582
## 0.7761
##
## sigma: 1.343
##
## AIC AICc BIC
## 2379.653 2380.275 2418.652
Note that the Box-Cox transformations were kept because it helped to stabilize the variance of a time series. Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and therefore reducing trend and seasonality.
## [1] "And so, the number of differences required for time series to be made stationary is 0."
## [1] "While, the number of differences required for time series to be made seasonally stationary is 1."
ARIMA models are capable of modeling a wide range of seasonal and non-seasonal data. The data is non-stationary, with seasonality, so there will be a seasonal difference, D = 1. These also appear to be non-stationary, but differencing is not need due to the transformed data, d = 0. The unit root test resulted in a smaller than the 1% critical value and fell well within the range expected for stationary data. Thus, the differencing of the data has now made it stationary.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0153
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggtsdisplay(diff(tst, lag = 7),
main = "ATM #1 Cash Withdrawal",
ylab = "Cash Withdrawal",
xlab = "Week")The aim now is to find an appropriate ARIMA model based on the ACF and PACF. Ignoring one significant spike in each plot that is just outside the limits, and not in the first few lags, the ACF plot is sinusoidal decaying. From the non-seasonal component of the ACF, there are small significant spikes at lag 1 and 6, and a larger one at lag 7. While with the seasonal component, there are none beyond lag 7. This suggests q > 1 and Q = 1. The PACF plot is also sinusoidal decaying, but there is no change in the non-seasonal and seasonal components when multiple differencing was performed. This could suggest that p = 0, and P = 0.
An \(ARIMA(p, d, q)(P, D, Q)_m\) model is where p/P = order of the autoregressive part, d/D = degree of first differencing involved, q/Q = order of the moving average part, and m = number of observations. Altogether, possible models should have the sequence similar to ARIMA(0,0,>1)(0,1,1)[7].
To confirm, the auto.arima() function to determine are all determined by minimizing the information criteria. The suggested model is ARIMA(0,0,2)(0,1,1)[7], which will be used since the deduction by observation was very close.
## Series: .
## ARIMA(0,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2615708
##
## Coefficients:
## ma1 ma2 sma1
## 0.1126 -0.1094 -0.6418
## s.e. 0.0524 0.0520 0.0432
##
## sigma^2 estimated as 1.764: log likelihood=-609.99
## AIC=1227.98 AICc=1228.09 BIC=1243.5
Therefore, the proposed models that might be suitable for fitting ATM1 include:
| Technique | Proposed Model |
|---|---|
| Forecasting with Decomposition | STL Decomposition |
| Forecasting with Exponential Smoothing | ETS(A, N, A) |
| Forecasting with ARIMA | ARIMA(0,0,2)(0,1,1)[7] |
From the time series plot, it is evident that there is seasonality in this set, weekly as previously established. However, in this case, there is no steady trend over the year. Moreover, the subseries plot highlight that Sunday tends to have the highest mean of cash withdrawal while Saturday is the lowest. The expansion of ACF lags highlights that the autocorrelations are in a “scalloped” shape, further confirming that there is seasonality. From the partial autocorrelation (PACF), there a few significant lags in the beginning, while all other PACF are within the critical limit. Altogether, the time series plots for ATM2 confirms that it is non-stationary, has seasonality variation, but no trends and will require differencing in the lags to be made into a stationary time series.
ggtsdisplay(ATM.ts[,2], main = "ATM #2 Cash Withdrawal", ylab = "Cash Withdrawal (in hundreds)", xlab = "Week")ggsubseriesplot(ATM.ts[,2]) +
labs(title = "ATM #2 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",
x = "Days of the Week") +
scale_y_continuous("Amount of Cash Withdrawal",
labels = scales::dollar_format(scale = 0.1, suffix = "K"))For this time series subset, it was considered helpful that the stabilization of the seasonal variability throughout the year may be necessary. Using Box-Cox transformation, the optimal \(\lambda=0.72\). The resulting transformation highlights that there are still two peculiar spikes that are large even after transformation.
At this moment, applying a decomposition to ATM2 is useful to further study time series data, and exploring historical changes over time. The decomposition of the transformed time series is done using STL decomposition.
ATM.ts[,2] %>% stl(s.window = 7, robust = TRUE) %>%
autoplot() +
labs(title = "Seasonal & Trend Decomposition for ATM #2 Time Series", x = "Week")The decomposition highlights the seasonality component changes drastically from week 40 to week 43, quite similar to ATM1. The panel bar indicates the trend is much smaller than the seasonal component and consequently only a very small part of the variation in the data series, so it was not considered to be influential.
By considering variations in the seasonal components, exponential smoothing methods are not ruled out as a tool for forecasting. This technique will determine a possible model based on the best values in the information criterion. It shows that the ETS(A, N, A) model best fits the data for the ATM2, i.e. exponential smoothing with additive error, no trend component, and additive seasonality.
## ETS(A,N,A)
##
## Call:
## ets(y = ., lambda = lambda, biasadj = TRUE)
##
## Box-Cox transformation: lambda= 0.7243
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 0.3852
##
## Initial states:
## l = 26.7912
## s = -17.8422 -13.3191 10.8227 1.8426 4.2781 5.7994
## 8.4185
##
## sigma: 8.5054
##
## AIC AICc BIC
## 3727.060 3727.682 3766.059
Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and therefore reducing seasonality.
## [1] "And so, the number of differences required for time series to be made stationary is 1."
## [1] "While, the number of differences required for time series to be made seasonally stationary is 1."
ARIMA models are capable of modeling a wide range of seasonal and non-seasonal data. The data is non-stationary, with seasonality, so there will be a seasonal difference, D = 1. These also appear to be non-stationary, but differencing is not need due to the transformed data, d = 1. The unit root test resulted in a smaller than the 1% critical value and fell well within the range expected for stationary data. Thus, the differencing of the data has now made it stationary. However, manipulating the differencing, either differencing in the seasonal or non-seasonal component still resulted in the stationary data. Therefore, d and D can be either 0 or 1.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0162
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggtsdisplay(diff(tst, lag = 7),
main = "ATM #2 Cash Withdrawal",
ylab = "Cash Withdrawal",
xlab = "Week")The aim now is to find an appropriate ARIMA model based on the ACF and PACF. Ignoring one significant spike in each plot that is just outside the limits, and not in the first few lags, both plots are sinusoidal decaying. From the first seasonal differencing component, significant lags were ending along with lag 3, and no change after another single differencing. From the non-seasonal component of both the ACF and PACF, there are small significant spikes at lag 5, and a larger one at lag 7. This suggests (p,q) = (>2,>2) and (P,Q) = (>1,>1).
An \(ARIMA(p, d, q)(P, D, Q)_m\) model is where p/P = order of the autoregressive part, d/D = degree of first differencing involved, q/Q = order of the moving average part, and m = number of observations. Altogether, possible models should have the sequence similar to ARIMA(>2,[0 or 1],>2)(>1,[0 or 1],>1)[7] with drifting since (d,D) could both be 1.
To confirm, the auto.arima() function to determine are all determined by minimizing the information criteria. The suggested model is ARIMA(3,0,3)(0,1,1)[7] with a drifting constant, which will be used since the deduction by observation was very close.
## Series: .
## ARIMA(3,0,3)(0,1,1)[7] with drift
## Box Cox transformation: lambda= 0.7242585
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 sma1 drift
## 0.4902 -0.4948 0.8326 -0.4823 0.3203 -0.7837 -0.7153 -0.0203
## s.e. 0.0863 0.0743 0.0614 0.1060 0.0941 0.0621 0.0453 0.0072
##
## sigma^2 estimated as 67.52: log likelihood=-1260.59
## AIC=2539.18 AICc=2539.69 BIC=2574.1
Therefore, the proposed models that might be suitable for fitting ATM2 include:
| Technique | Proposed Model |
|---|---|
| Forecasting with Decomposition | STL Decomposition |
| Forecasting with Exponential Smoothing | ETS(A, N, A) |
| Forecasting with ARIMA | ARIMA(3,0,3)(0,1,1)[7] with drift |
Given the very limited data, there is not enough information to make a suitable forecast model. Thus, the mean from ATM 1-3 will be used as the expected amount of cash withdraw from ATM 3. The reason being that ATM 1 & 2 have similar withdrawal activities for the same days, while ATM 4 present higher withdrawal amounts. As a result, with 2/3 of the ATMs with similar cash withdrawals as ATM 3, it is decided that the expected amount of cash withdraw from ATM 3 with being the mean of cash withdrawal from ATM 1-3.
Therefore, the proposed models that might be suitable for fitting ATM3 include:
| Technique | Proposed Model |
|---|---|
| Forecasting with Expected Value | at time t+1, \(E[ATM_3] = \frac {\sum_{i=1}^{n} (ATM_i)}{n}\) where n = 3 |
From the time series plot, it is evident that there are seasonality in this set and no steady trend over the weeks. The subseries plot highlight that Sunday tends to have the highest mean of cash withdrawal while Saturday is the lowest. The ACF highlights that the slow decrease in every 7th lags and represents the autocorrelations in a “scalloped” pattern, further confirming that there is seasonality. From the partial autocorrelation, there a few significant lags in the beginning, while all other PACF are within the critical limit. Altogether, the time series plots for ATM4 confirm that it is non-stationary, has seasonality variation, and may require differencing in the lags to be made into a stationary time series.
ggtsdisplay(ATM.ts[,4], main = "ATM #4 Cash Withdrawal", ylab = "Cash Withdrawal (in hundreds)", xlab = "Week")ggsubseriesplot(ATM.ts[,4]) +
labs(title = "ATM #4 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",
x = "Days of the Week") +
scale_y_continuous("Amount of Cash Withdrawal",
labels = scales::dollar_format(scale = 0.1, suffix = "K"))For this time series subset, it was helpful that the stabilization of the seasonal variability throughout the year may be necessary. Using Box-Cox transformation, the optimal \(\lambda=0.45\). The resulting transformation highlights that the smaller variability was stretched, while the larger variability was diminished.
At this moment, applying a decomposition to ATM4 is useful to further study time series data, and exploring historical changes over time. Decomposition is done using the STL decomposition.
ATM.ts[,4] %>% stl(s.window = 7, robust = TRUE) %>%
autoplot() +
labs(title = "Seasonal & Trend Decomposition for ATM #4 Time Series", x = "Week")Once again, the decomposition highlights the seasonality component changes drastically from week 40 to week 43, in addition to a few earlier periods. The panel bar for trend also suggests that it is much smaller than the seasonal component and consequently only a small part of the variation in the data series.
By considering variations in the combinations of the small trends and seasonal components, exponential smoothing methods are not ruled out as a tool for forecasting. This technique will determine a possible model based on the best values in the information criterion. It shows that the ETS(A, N, A) model best fits the data for the transformed ATM4, i.e. exponential smoothing with additive error, no trend component, and additive seasonality.
## ETS(A,N,A)
##
## Call:
## ets(y = ., lambda = lambda, biasadj = TRUE)
##
## Box-Cox transformation: lambda= 0.4498
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 0.1035
##
## Initial states:
## l = 28.6369
## s = -18.6503 -3.3529 1.6831 4.7437 5.4471 4.9022
## 5.2271
##
## sigma: 12.9202
##
## AIC AICc BIC
## 4032.268 4032.890 4071.267
Note that Box-Cox transformation was kept because it helped to stabilize the variance of a time series. Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and therefore reducing trend and seasonality.
## [1] "And so, the number of differences required for time series to be made stationary is 0."
## [1] "While, the number of differences required for time series to be made seasonally stationary is 0."
ARIMA models are capable of modeling a wide range of seasonal and non-seasonal data. The data is non-stationary, with no seasonal difference needed, D = 0. These also appear to be non-stationary, but differencing is not need due to the transformed data, d = 0. The unit root test resulted in a smaller than the 1% critical value and fell well within the range expected for stationary data. Thus, no differencing of the data is considered to be stationary.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0792
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The aim now is to find an appropriate ARIMA model based on the ACF and PACF. Ignoring one significant spike in each plot that is just outside the limits, and not in the first few lags, once again, both the ACF and PACF plots are sinusoidal decaying. There are the only significant lags at the seasonal lags, and almost significant spikes at lag 4, indicating that some additional non-seasonal terms need to be included in the model. Lastly, there was no change in the non-seasonal and seasonal components when multiple differencing was performed. This could suggest that (p,Q) = 0, and (P,q) > 2.
An \(ARIMA(p, d, q)(P, D, Q)_m\) model is where p/P = order of the autoregressive part, d/D = degree of first differencing involved, q/Q = order of the moving average part, and m = number of observations. Altogether, possible models should have a sequence similar to ARIMA(0,0,>2)(>2,0,0)[7].
To confirm, the auto.arima() function to determine are all determined by minimizing the information criteria. The suggested model is ARIMA(0,0,1)(2,0,0)[7] with a non-zero mean, which will be used since the deduction by observation was very close.
## Series: .
## ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
## Box Cox transformation: lambda= 0.449771
##
## Coefficients:
## ma1 sar1 sar2 mean
## 0.0790 0.2078 0.2023 28.6364
## s.e. 0.0527 0.0516 0.0525 1.2405
##
## sigma^2 estimated as 176.5: log likelihood=-1460.57
## AIC=2931.14 AICc=2931.3 BIC=2950.64
Therefore, the proposed models that might be suitable for fitting ATM4 include:
| Technique | Proposed Model |
|---|---|
| Forecasting with Decomposition | STL Decomposition |
| Forecasting with Exponential Smoothing | ETS(M, N, A) |
| Forecasting with ARIMA | ARIMA(0,0,1)(2,0,0)[7] with non-zero mean |
Because each model was fitted using different transformations, information criteria are not reliable for comparison since the likelihood is computed in different ways, thus cross-validation is performed by splitting the data into a training and testing set, where the root mean squared error is used to identify the best fit model. The following function will split the data set and fit each model as specified above. A performance test will then be conducted to determining the root mean squared error of the train data with the test data. The function is also capable of plotting residuals of each model once specified.
ts_accuracy = function(i, week, rmse, residual){
# Split to train and test
train = window(ATM.ts[,i], end = c(week, 3))
test = window(ATM.ts[,i], start = c(week, 4))
# Finds optimal lambda
lambda = BoxCox.lambda(ATM.ts[,i])
# Forecast Models
stl.model = train %>% stl(s.window = 7, robust = TRUE)
ets.model = train %>% ets(model='ANA', lambda = lambda, biasadj = TRUE)
if (i == 1){
arima.model = train %>% Arima(order = c(0,0,2), seasonal = c(0,1,1),
lambda = lambda, biasadj = TRUE)
}
if (i == 2){
arima.model = train %>% Arima(order = c(3,0,3), seasonal = c(0,1,1),
include.drift = TRUE,
lambda = lambda, biasadj = TRUE)
}
if (i == 4){
arima.model = train %>% Arima(order = c(0,0,1), seasonal = c(2,0,0),
include.mean = TRUE,
lambda = lambda, biasadj = TRUE)
}
# Returns Residuals Plots
if(tolower(residual) == 'stl'){checkresiduals(stl.model)}
if(tolower(residual) == 'ets'){checkresiduals(ets.model)}
if(tolower(residual) == 'arima'){checkresiduals(arima.model)}
# Forecast
stl.fc = forecast(stl.model, h = length(test))$mean
ets.fc = forecast(ets.model, h = length(test))$mean
arima.fc = forecast(arima.model, h = length(test))$mean
# Performance statistics
accuracy = data.frame(RMSE = cbind(accuracy(stl.fc, test)[,2],
accuracy(ets.fc, test)[,2],
accuracy(arima.fc, test)[,2]))
row.names(accuracy) = c(sprintf("ATM #%d", i))
names(accuracy) = c("STL", "ETS", "ARIMA")
# Returns RMSE
if(rmse){accuracy}
}From the table below, it is evident that the ARIMA models produced fewer errors among all the individual ATM models. While not the smallest error, it’s interesting to note that for ATM2, both the STL decomposition and ETS(A, N, A) resulted in nearly similar RMSEs (their actual difference is merely 0.003).
df = data.frame()
for (i in c(1,2,4)){
df = rbind(df,ts_accuracy(i, 44, rmse = TRUE, residual = FALSE))
}| STL | ETS | ARIMA | |
|---|---|---|---|
| ATM #1 | 51.58 | 42.76 | 37.00 |
| ATM #2 | 59.38 | 59.38 | 46.92 |
| ATM #4 | 296.66 | 325.39 | 286.92 |
The residuals of the best model based on the RMSE are shown below. In this collection on plots, there are a time series, ACF, and a histogram of the residuals (with an overlaid normal distribution for comparison).
For ATM1, the ARIMA(0,0,2)(0,1,1)[7] model produced the smallest RMSE. Firstly, the residual time series plot highlights a few large negative residuals. The histogram suggests that the residuals may be left-skewed but the mean of the residuals appears to be near zero. Moreover, there is no significant autocorrelation in the residuals series, suggesting the forecasts are good. This is further confirmed based on the Ljung-Box test since the residuals are not distinguishable from a white noise series, \(Q^∗=7.23, df=3, p−value>0.05\). Overall, it seems that forecasts from this method will be good.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 7.2257, df = 11, p-value = 0.7805
##
## Model df: 3. Total lags used: 14
For ATM2, the ARIMA(3,0,3)(0,1,1)[7] with the drifting model produced the smallest RMSE. Firstly, the residual time series plot highlights a few large negative residuals. The histogram suggests that the mean of the residuals is near zero. Moreover, there is no significant autocorrelation in the residuals series, suggesting the forecasts are good. This is further confirmed based on the Ljung-Box test since the residuals are not distinguishable from a white noise series, \(Q^∗ = 5.53, df = 8, p−value > 0.05\). Overall, it seems that forecasts from this method will be good.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,3)(0,1,1)[7] with drift
## Q* = 5.5332, df = 6, p-value = 0.4775
##
## Model df: 8. Total lags used: 14
For ATM4, the ARIMA(0,0,1)(2,0,0)[7] with non-zero mean model produced the smallest RMSE. Firstly, the residual time series plot highlights the variability in the residuals. The histogram suggests that the residuals near-normal with the mean of the residuals being near zero. Moreover, there is no significant autocorrelation in the residuals series, suggesting the forecasts are good. This is further confirmed based on the Ljung-Box test since the residuals are not distinguishable from a white noise series, \(Q^∗ = 13.22, df = 4, p−value > 0.05\). Overall, it seems that forecasts from this method will be good.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
## Q* = 13.217, df = 10, p-value = 0.2118
##
## Model df: 4. Total lags used: 14
Lastly, the following function will plot all the data in addition to the forecast based on the specified model for a visual comparison. There is also a zoomed-in plot of the forecast for clarity. The function is also capable of returning any of the forecast created by the specified model. This will be used to visually analyze the fit and officially state the final model.
ATM.fc = function(tseries, ATM.no, plot, forecast){
# Finds optimal lambda
lambda = BoxCox.lambda(tseries)
# Forecast Models
stl.model = tseries %>% stl(s.window = 7, robust = TRUE)
ets.model = tseries %>% ets(lambda = lambda, biasadj = TRUE)
arima.model = tseries %>% auto.arima(lambda = lambda, biasadj = TRUE)
# Forecast
ATM.stl.fc = forecast(stl.model, h = 31)$mean %>% ts(start = 366, end = 396)
ATM.ets.fc = forecast(ets.model, h = 31)$mean %>% ts(start = 366, end = 396)
ATM.arima.fc = forecast(arima.model, h = 31)$mean %>% ts(start = 366, end = 396)
# Full forecast plot
p1 = autoplot(ts(tseries, frequency = 1)) +
autolayer(ATM.stl.fc, series = 'STL', PI = FALSE) +
autolayer(ATM.ets.fc, series = 'ETS', PI = FALSE) +
autolayer(ATM.arima.fc, series = 'ARIMA', PI = FALSE) +
labs(title = sprintf("ATM #%0.0f Cash Withdrawal Forecast", ATM.no),
subtitle = "1 May, 2009 to 31 May, 2010",
x = "Days") +
scale_y_continuous("Amount of Cash Withdrawal",
labels = scales::dollar_format(scale = 0.1, suffix = "K")) +
guides(colour = guide_legend(title = "Models")) +
theme(legend.position = "top")
# Zoomed in plot on forecast
p2 = p1 + labs(title = sprintf("Zoom: ATM #%0.0f Cash Withdrawal Forecast", ATM.no),
subtitle = "15 April, 2010 to 31 May, 2010") +
xlim(c(350,396))
# Returns plots
if(plot){gridExtra::grid.arrange(p1, p2, ncol = 1)}
# Return Forecast Values
if(tolower(forecast) == 'stl'){return(ATM.stl.fc)}
if(tolower(forecast) == 'ets'){return(ATM.ets.fc)}
if(tolower(forecast) == 'arima'){return(ATM.arima.fc)}
}Visually, each model does a good job fitting the series. Therefore, based on the performance metrics, the best fit model for ATM #1 is the ARIMA(0,0,2)(0,1,1)[7] with Box-Cox Transformation of 0.26.
STL and ETS fits have opposite shifts in the seasonality than the ARIMA model, which remains closely centered. As a result, based on the fit and performance metrics, the best fit model for ATM #2 is the ARIMA(3,0,3)(0,1,1)[7] with drifting with Box-Cox Transformation of 0.72.
Modeling the decided fit to ATM3, the forecast for May 2010 is shown below.
temp = ATM.df %>% drop_na() %>% spread(ATM, Cash)
ATM3.fc = ts(rep(mean(as.matrix(temp[363:365, 2:4])), 31), start = 366, end = 396)
autoplot(ts(ATM.ts[,3], frequency = 1)) +
autolayer(ATM3.fc, series = '') +
labs(title = "ATM #3 Cash Withdrawal Forecast",
subtitle = "1 May, 2009 to 31 May, 2010",
x = "Days") +
scale_y_continuous("Amount of Cash Withdrawal",
labels = scales::dollar_format(scale = 0.1, suffix = "K")) +
theme(legend.position = "none")Therefore, the best fit model is the expected value of cash withdraw from ATM 1-3, i.e. \(E[ATM_3] = \frac {\sum_{i=1}^{n} (ATM_i)}{n}\) where n = 3.
ATM4 is quite haphazard in its seasonality, the predictive intervals for each model are also large and none of them capture the series as expected. Thus, based on the performance metrics, the best fit model for ATM #4 is the ARIMA(0,0,1)(2,0,0)[7] with a non-zero mean with Box-Cox Transformation of 0.45.
Finally, let’s save the forecast for May 2010. The best fit models used to forecast the amount of cash withdrawal for the month of May for each ATMs are:
DATE = seq(as.Date('2010-05-01'), as.Date('2010-05-31'), 1)
ATM = c(rep('ATM1', 31), rep('ATM2', 31), rep('ATM3', 31), rep('ATM4', 31))
Cash = c(ATM1.fc, ATM2.fc, ATM3.fc, ATM4.fc)
# openxlsx::write.xlsx(data.frame(DATE, ATM, Cash), "ATM_forecasts_Deokinanan.xlsx")Daily Forecast for Each ATM for May 2010
| DATE | ATM | Cash |
|---|---|---|
| 2010-05-01 | ATM1 | 92.128019 |
| 2010-05-02 | ATM1 | 106.518331 |
| 2010-05-03 | ATM1 | 78.907122 |
| 2010-05-04 | ATM1 | 5.556121 |
| 2010-05-05 | ATM1 | 106.163569 |
| 2010-05-06 | ATM1 | 84.720405 |
| 2010-05-07 | ATM1 | 91.309507 |
| 2010-05-08 | ATM1 | 93.491218 |
| 2010-05-09 | ATM1 | 107.162085 |
| 2010-05-10 | ATM1 | 79.572744 |
| 2010-05-11 | ATM1 | 5.726439 |
| 2010-05-12 | ATM1 | 106.933945 |
| 2010-05-13 | ATM1 | 85.409793 |
| 2010-05-14 | ATM1 | 92.024822 |
| 2010-05-15 | ATM1 | 94.212213 |
| 2010-05-16 | ATM1 | 107.933303 |
| 2010-05-17 | ATM1 | 80.238366 |
| 2010-05-18 | ATM1 | 5.896758 |
| 2010-05-19 | ATM1 | 107.704320 |
| 2010-05-20 | ATM1 | 86.099181 |
| 2010-05-21 | ATM1 | 92.740136 |
| 2010-05-22 | ATM1 | 94.933207 |
| 2010-05-23 | ATM1 | 108.704520 |
| 2010-05-24 | ATM1 | 80.903988 |
| 2010-05-25 | ATM1 | 6.067076 |
| 2010-05-26 | ATM1 | 108.474696 |
| 2010-05-27 | ATM1 | 86.788569 |
| 2010-05-28 | ATM1 | 93.455451 |
| 2010-05-29 | ATM1 | 95.654201 |
| 2010-05-30 | ATM1 | 109.475738 |
| 2010-05-31 | ATM1 | 81.569609 |
| 2010-05-01 | ATM2 | 58.520690 |
| 2010-05-02 | ATM2 | 63.739250 |
| 2010-05-03 | ATM2 | 11.157593 |
| 2010-05-04 | ATM2 | 15.797994 |
| 2010-05-05 | ATM2 | 91.122422 |
| 2010-05-06 | ATM2 | 82.918441 |
| 2010-05-07 | ATM2 | 61.628753 |
| 2010-05-08 | ATM2 | 59.214121 |
| 2010-05-09 | ATM2 | 64.504077 |
| 2010-05-10 | ATM2 | 11.955182 |
| 2010-05-11 | ATM2 | 17.065913 |
| 2010-05-12 | ATM2 | 91.629964 |
| 2010-05-13 | ATM2 | 83.433271 |
| 2010-05-14 | ATM2 | 61.996778 |
| 2010-05-15 | ATM2 | 59.538018 |
| 2010-05-16 | ATM2 | 64.874775 |
| 2010-05-17 | ATM2 | 12.592851 |
| 2010-05-18 | ATM2 | 19.567481 |
| 2010-05-19 | ATM2 | 91.810515 |
| 2010-05-20 | ATM2 | 83.629674 |
| 2010-05-21 | ATM2 | 62.124839 |
| 2010-05-22 | ATM2 | 59.647823 |
| 2010-05-23 | ATM2 | 65.012212 |
| 2010-05-24 | ATM2 | 13.146759 |
| 2010-05-25 | ATM2 | 23.217389 |
| 2010-05-26 | ATM2 | 91.799041 |
| 2010-05-27 | ATM2 | 83.637832 |
| 2010-05-28 | ATM2 | 62.114829 |
| 2010-05-29 | ATM2 | 59.634255 |
| 2010-05-30 | ATM2 | 65.011836 |
| 2010-05-31 | ATM2 | 13.662563 |
| 2010-05-01 | ATM3 | 90.222222 |
| 2010-05-02 | ATM3 | 90.222222 |
| 2010-05-03 | ATM3 | 90.222222 |
| 2010-05-04 | ATM3 | 90.222222 |
| 2010-05-05 | ATM3 | 90.222222 |
| 2010-05-06 | ATM3 | 90.222222 |
| 2010-05-07 | ATM3 | 90.222222 |
| 2010-05-08 | ATM3 | 90.222222 |
| 2010-05-09 | ATM3 | 90.222222 |
| 2010-05-10 | ATM3 | 90.222222 |
| 2010-05-11 | ATM3 | 90.222222 |
| 2010-05-12 | ATM3 | 90.222222 |
| 2010-05-13 | ATM3 | 90.222222 |
| 2010-05-14 | ATM3 | 90.222222 |
| 2010-05-15 | ATM3 | 90.222222 |
| 2010-05-16 | ATM3 | 90.222222 |
| 2010-05-17 | ATM3 | 90.222222 |
| 2010-05-18 | ATM3 | 90.222222 |
| 2010-05-19 | ATM3 | 90.222222 |
| 2010-05-20 | ATM3 | 90.222222 |
| 2010-05-21 | ATM3 | 90.222222 |
| 2010-05-22 | ATM3 | 90.222222 |
| 2010-05-23 | ATM3 | 90.222222 |
| 2010-05-24 | ATM3 | 90.222222 |
| 2010-05-25 | ATM3 | 90.222222 |
| 2010-05-26 | ATM3 | 90.222222 |
| 2010-05-27 | ATM3 | 90.222222 |
| 2010-05-28 | ATM3 | 90.222222 |
| 2010-05-29 | ATM3 | 90.222222 |
| 2010-05-30 | ATM3 | 90.222222 |
| 2010-05-31 | ATM3 | 90.222222 |
| 2010-05-01 | ATM4 | 323.999169 |
| 2010-05-02 | ATM4 | 434.873284 |
| 2010-05-03 | ATM4 | 511.231817 |
| 2010-05-04 | ATM4 | 241.030250 |
| 2010-05-05 | ATM4 | 474.002125 |
| 2010-05-06 | ATM4 | 349.505099 |
| 2010-05-07 | ATM4 | 415.843486 |
| 2010-05-08 | ATM4 | 296.604731 |
| 2010-05-09 | ATM4 | 475.086554 |
| 2010-05-10 | ATM4 | 465.407561 |
| 2010-05-11 | ATM4 | 288.570878 |
| 2010-05-12 | ATM4 | 446.728575 |
| 2010-05-13 | ATM4 | 332.521570 |
| 2010-05-14 | ATM4 | 460.415111 |
| 2010-05-15 | ATM4 | 388.228709 |
| 2010-05-16 | ATM4 | 451.226998 |
| 2010-05-17 | ATM4 | 464.251973 |
| 2010-05-18 | ATM4 | 365.911910 |
| 2010-05-19 | ATM4 | 453.254166 |
| 2010-05-20 | ATM4 | 402.458707 |
| 2010-05-21 | ATM4 | 444.343570 |
| 2010-05-22 | ATM4 | 401.349389 |
| 2010-05-23 | ATM4 | 453.312522 |
| 2010-05-24 | ATM4 | 454.100385 |
| 2010-05-25 | ATM4 | 394.551766 |
| 2010-05-26 | ATM4 | 448.115768 |
| 2010-05-27 | ATM4 | 412.935666 |
| 2010-05-28 | ATM4 | 448.996814 |
| 2010-05-29 | ATM4 | 424.218997 |
| 2010-05-30 | ATM4 | 448.226111 |
| 2010-05-31 | ATM4 | 450.990009 |
The data set contains monthly residential power usage for January 1998 until December 2013. The variable KWH is power consumption in Kilowatt-hours.
power.df = read_excel("ResidentialCustomerForecastLoad-624.xlsx")
power.ts = ts(power.df$KWH, start = c(1998,1), frequency = 12)| CaseSequence | YYYY-MMM | KWH |
|---|---|---|
| 733 | 1998-Jan | 6862583 |
| 734 | 1998-Feb | 5838198 |
| 735 | 1998-Mar | 5420658 |
| 736 | 1998-Apr | 5010364 |
| 737 | 1998-May | 4665377 |
| 738 | 1998-Jun | 6467147 |
| 739 | 1998-Jul | 8914755 |
| 740 | 1998-Aug | 8607428 |
| 741 | 1998-Sep | 6989888 |
| 742 | 1998-Oct | 6345620 |
| 743 | 1998-Nov | 4640410 |
| 744 | 1998-Dec | 4693479 |
| 745 | 1999-Jan | 7183759 |
| 746 | 1999-Feb | 5759262 |
| 747 | 1999-Mar | 4847656 |
| 748 | 1999-Apr | 5306592 |
| 749 | 1999-May | 4426794 |
| 750 | 1999-Jun | 5500901 |
| 751 | 1999-Jul | 7444416 |
| 752 | 1999-Aug | 7564391 |
| 753 | 1999-Sep | 7899368 |
| 754 | 1999-Oct | 5358314 |
| 755 | 1999-Nov | 4436269 |
| 756 | 1999-Dec | 4419229 |
| 757 | 2000-Jan | 7068296 |
| 758 | 2000-Feb | 5876083 |
| 759 | 2000-Mar | 4807961 |
| 760 | 2000-Apr | 4873080 |
| 761 | 2000-May | 5050891 |
| 762 | 2000-Jun | 7092865 |
| 763 | 2000-Jul | 6862662 |
| 764 | 2000-Aug | 7517830 |
| 765 | 2000-Sep | 8912169 |
| 766 | 2000-Oct | 5844352 |
| 767 | 2000-Nov | 5041769 |
| 768 | 2000-Dec | 6220334 |
| 769 | 2001-Jan | 7538529 |
| 770 | 2001-Feb | 6602448 |
| 771 | 2001-Mar | 5779180 |
| 772 | 2001-Apr | 4835210 |
| 773 | 2001-May | 4787904 |
| 774 | 2001-Jun | 6283324 |
| 775 | 2001-Jul | 7855129 |
| 776 | 2001-Aug | 8450717 |
| 777 | 2001-Sep | 7112069 |
| 778 | 2001-Oct | 5242535 |
| 779 | 2001-Nov | 4461979 |
| 780 | 2001-Dec | 5240995 |
| 781 | 2002-Jan | 7099063 |
| 782 | 2002-Feb | 6413429 |
| 783 | 2002-Mar | 5839514 |
| 784 | 2002-Apr | 5371604 |
| 785 | 2002-May | 5439166 |
| 786 | 2002-Jun | 5850383 |
| 787 | 2002-Jul | 7039702 |
| 788 | 2002-Aug | 8058748 |
| 789 | 2002-Sep | 8245227 |
| 790 | 2002-Oct | 5865014 |
| 791 | 2002-Nov | 4908979 |
| 792 | 2002-Dec | 5779958 |
| 793 | 2003-Jan | 7256079 |
| 794 | 2003-Feb | 6190517 |
| 795 | 2003-Mar | 6120626 |
| 796 | 2003-Apr | 4885643 |
| 797 | 2003-May | 5296096 |
| 798 | 2003-Jun | 6051571 |
| 799 | 2003-Jul | 6900676 |
| 800 | 2003-Aug | 8476499 |
| 801 | 2003-Sep | 7791791 |
| 802 | 2003-Oct | 5344613 |
| 803 | 2003-Nov | 4913707 |
| 804 | 2003-Dec | 5756193 |
| 805 | 2004-Jan | 7584596 |
| 806 | 2004-Feb | 6560742 |
| 807 | 2004-Mar | 6526586 |
| 808 | 2004-Apr | 4831688 |
| 809 | 2004-May | 4878262 |
| 810 | 2004-Jun | 6421614 |
| 811 | 2004-Jul | 7307931 |
| 812 | 2004-Aug | 7309774 |
| 813 | 2004-Sep | 6690366 |
| 814 | 2004-Oct | 5444948 |
| 815 | 2004-Nov | 4824940 |
| 816 | 2004-Dec | 5791208 |
| 817 | 2005-Jan | 8225477 |
| 818 | 2005-Feb | 6564338 |
| 819 | 2005-Mar | 5581725 |
| 820 | 2005-Apr | 5563071 |
| 821 | 2005-May | 4453983 |
| 822 | 2005-Jun | 5900212 |
| 823 | 2005-Jul | 8337998 |
| 824 | 2005-Aug | 7786659 |
| 825 | 2005-Sep | 7057213 |
| 826 | 2005-Oct | 6694523 |
| 827 | 2005-Nov | 4313019 |
| 828 | 2005-Dec | 6181548 |
| 829 | 2006-Jan | 7793358 |
| 830 | 2006-Feb | 5914945 |
| 831 | 2006-Mar | 5819734 |
| 832 | 2006-Apr | 5255988 |
| 833 | 2006-May | 4740588 |
| 834 | 2006-Jun | 7052275 |
| 835 | 2006-Jul | 7945564 |
| 836 | 2006-Aug | 8241110 |
| 837 | 2006-Sep | 7296355 |
| 838 | 2006-Oct | 5104799 |
| 839 | 2006-Nov | 4458429 |
| 840 | 2006-Dec | 6226214 |
| 841 | 2007-Jan | 8031295 |
| 842 | 2007-Feb | 7928337 |
| 843 | 2007-Mar | 6443170 |
| 844 | 2007-Apr | 4841979 |
| 845 | 2007-May | 4862847 |
| 846 | 2007-Jun | 5022647 |
| 847 | 2007-Jul | 6426220 |
| 848 | 2007-Aug | 7447146 |
| 849 | 2007-Sep | 7666970 |
| 850 | 2007-Oct | 5785964 |
| 851 | 2007-Nov | 4907057 |
| 852 | 2007-Dec | 6047292 |
| 853 | 2008-Jan | 7964293 |
| 854 | 2008-Feb | 7597060 |
| 855 | 2008-Mar | 6085644 |
| 856 | 2008-Apr | 5352359 |
| 857 | 2008-May | 4608528 |
| 858 | 2008-Jun | 6548439 |
| 859 | 2008-Jul | 7643987 |
| 860 | 2008-Aug | 8037137 |
| 861 | 2008-Sep | NA |
| 862 | 2008-Oct | 5101803 |
| 863 | 2008-Nov | 4555602 |
| 864 | 2008-Dec | 6442746 |
| 865 | 2009-Jan | 8072330 |
| 866 | 2009-Feb | 6976800 |
| 867 | 2009-Mar | 5691452 |
| 868 | 2009-Apr | 5531616 |
| 869 | 2009-May | 5264439 |
| 870 | 2009-Jun | 5804433 |
| 871 | 2009-Jul | 7713260 |
| 872 | 2009-Aug | 8350517 |
| 873 | 2009-Sep | 7583146 |
| 874 | 2009-Oct | 5566075 |
| 875 | 2009-Nov | 5339890 |
| 876 | 2009-Dec | 7089880 |
| 877 | 2010-Jan | 9397357 |
| 878 | 2010-Feb | 8390677 |
| 879 | 2010-Mar | 7347915 |
| 880 | 2010-Apr | 5776131 |
| 881 | 2010-May | 4919289 |
| 882 | 2010-Jun | 6696292 |
| 883 | 2010-Jul | 770523 |
| 884 | 2010-Aug | 7922701 |
| 885 | 2010-Sep | 7819472 |
| 886 | 2010-Oct | 5875917 |
| 887 | 2010-Nov | 4800733 |
| 888 | 2010-Dec | 6152583 |
| 889 | 2011-Jan | 8394747 |
| 890 | 2011-Feb | 8898062 |
| 891 | 2011-Mar | 6356903 |
| 892 | 2011-Apr | 5685227 |
| 893 | 2011-May | 5506308 |
| 894 | 2011-Jun | 8037779 |
| 895 | 2011-Jul | 10093343 |
| 896 | 2011-Aug | 10308076 |
| 897 | 2011-Sep | 8943599 |
| 898 | 2011-Oct | 5603920 |
| 899 | 2011-Nov | 6154138 |
| 900 | 2011-Dec | 8273142 |
| 901 | 2012-Jan | 8991267 |
| 902 | 2012-Feb | 7952204 |
| 903 | 2012-Mar | 6356961 |
| 904 | 2012-Apr | 5569828 |
| 905 | 2012-May | 5783598 |
| 906 | 2012-Jun | 7926956 |
| 907 | 2012-Jul | 8886851 |
| 908 | 2012-Aug | 9612423 |
| 909 | 2012-Sep | 7559148 |
| 910 | 2012-Oct | 5576852 |
| 911 | 2012-Nov | 5731899 |
| 912 | 2012-Dec | 6609694 |
| 913 | 2013-Jan | 10655730 |
| 914 | 2013-Feb | 7681798 |
| 915 | 2013-Mar | 6517514 |
| 916 | 2013-Apr | 6105359 |
| 917 | 2013-May | 5940475 |
| 918 | 2013-Jun | 7920627 |
| 919 | 2013-Jul | 8415321 |
| 920 | 2013-Aug | 9080226 |
| 921 | 2013-Sep | 7968220 |
| 922 | 2013-Oct | 5759367 |
| 923 | 2013-Nov | 5769083 |
| 924 | 2013-Dec | 9606304 |
From the few summary statistics below, the mean kiloWatt hour is 6502475 kWh, with a standard deviation of 1447571 kWh. With a near-zero skewness and kurtosis, the data seems to be nearly normal. On the contrary, there are a few discrepancies. Firstly, the data set contains n = 192 cases, however, one case is missing. Lastly, the boxplot reveals that there is a likely outlier in this set.
| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KWH | 191 | 6502475 | 1447571 | 6283324 | 6439475 | 1543074 | 770523 | 10655730 | 9885207 | 0.17 | 0.45 | 104742.6 |
p1 = ggplot(power.df, aes(y = KWH)) +
geom_boxplot(fill = "steelblue2") +
coord_flip() +
theme(aspect.ratio = 3/10,
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(title ="Jan, 1998 to Dec, 2013",
subtitle = "Residential Power Usage Boxplot",
y = sprintf("# of outliers = %d", length(boxplot(power.df[,3], plot = FALSE)$out)))
temp = transform(power.df, date = reshape::colsplit(`YYYY-MMM`, split = "\\-", names = c('year', 'month')))
temp$Date = with(temp, sprintf("%s-%d", date.month, date.year))
p2 = temp %>%
ggplot(aes(x = zoo::as.yearmon(Date,"%b-%Y"), y = KWH)) +
geom_line(color="steelblue2", size = 1) +
labs(subtitle = "Residential Power Usage Time Series",
x = "Month", y = "Power Used, in kWh")
gridExtra::grid.arrange(p1, p2, ncol = 1)## [1] "This missing case is on 2008-Sep."
## [1] "On 2010-Jul, an unexpectly low power usage of 770523 kWh was recorded."
With limited information, it is unknown why this oddly low usage amount was recorded. It can either be a rare occurrence or an error in the data collection process. Therefore, interpolation will be required since this can influence the analysis. As in Part A, tsclean will be used to both interpolate extreme and missing values. That is, missing data is done by fitting a seasonal model to the data, and then interpolate the seasonally adjusted series, before re-seasonalizing. While identifying outliers and suggest reasonable replacements, residuals are identified by fitting a periodic STL decomposition for seasonal data. Residuals are deemed as outliers if they lie outside the range \(\pm2(q_{0.9}−q_{0.1})\) where \(q_p\) is the p-quantile of the residuals.
With not much data tidying and transformation necessary, the data is made into a workable time series.
power.ts = tsclean(power.ts)
describe(power.ts)[,-1] %>% kable(digits = 2L, caption = "Descriptive Statistics of Power Usage") %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE)| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 192 | 6529701 | 1369032 | 6351262 | 6455590 | 1587856 | 4313019 | 10655730 | 6342711 | 0.47 | -0.48 | 98801.41 |
In the previous sections, the data itself was examined and transformed appropriately to prevent bias and errors in the forecast. This section will now dive into the time series itself to examine its characteristics and determine the best modeling technique to forecast the monthly residential power usage in kWh for 2014.
ggtsdisplay(power.ts, main = "Residential Power Usage", ylab = "Power Used, in kWh", xlab = "Month")Immediately, the outlier and missing values were interpolated and the time series plot shows the seasonality of data. It appears to be two peaks every year, suggesting the pattern from peak to trough is from that start of a year to mid-year, then the end. There is no steady trend over the years. The ACF highlights the autocorrelations are in a “scalloped” shape, further confirming that there is seasonality. Because substantially more than 5% of spikes are outside the bounds of the blue dashed line, the series is not white noise. From the partial autocorrelation (PACF), there a few significant lags at the beginning with a sinusoidal decay, while all other PACF are within the critical limit. Altogether, the time series plot confirms that it is non-stationary, as seasonality and will require differencing in the lags to be made into a stationary time series.
p1 = ggseasonplot(power.ts, year.labels = FALSE, continuous = TRUE) +
labs(title = "Residential Power Usage",
subtitle = "Seasonal Plot",
y = "Power Used, in kWh",
x = "Month")
p2 = ggsubseriesplot(power.ts) +
labs(subtitle = "Seasonal Subseries Plot",
y = "Power Used, in kWh",
x = "Month")
gridExtra::grid.arrange(p1, p2, ncol = 1)The seasonal plot shows that there is a steady decline from January to May, which then increases until August before it drops moving to November. The month of August has the most consumption of power, followed by January. This usage pattern is an influence of the seasons. If this is residential power usages from countries located in the Northern Hemisphere, weather seasonality will influence the amount of power consumption. In the summer, June - September, the use of air conditioning systems cause electricity usage to increase. If homes are heated with an electric heating system, they will generally have high electricity usage during winter months. This may explain the two maximum distinct peaks in the seasonal plots.
For this time series, it was helpful that the stabilization of the seasonal variability throughout the years may be necessary. Using Box-Cox transformation, the optimal \(lambda=-0.14\). The resulting transformation highlights that the smaller variability was stretched, while the larger variability was diminished. Overall, the variation appears much more stable now.
bct = function(ts, title){
a = autoplot(ts) +
labs(title = sprintf("Before: %s", title),
x = "Year",
y = "Amount of Cash Withdrawal")
lambda = BoxCox.lambda(ts)
at = autoplot(BoxCox(ts, lambda)) +
labs(title = sprintf("Transformed: %s", title),
subtitle = sprintf("lambda = %0.2f", lambda),
y = "Box-Cox Transformed",
x = "Year")
gridExtra::grid.arrange(a, at)
}
lambda = BoxCox.lambda(power.ts)
bct(power.ts, title = "Residential Power Usage")Now applying a decomposition to time series is useful to further study time series data, and exploring historical changes over time. Decomposition of the time series done using the STL decomposition. From the seasonal plot, there doesn’t appear to be quick changes in the seasonality over the years, i.e. the seasonal pattern seems constant through time, so the loess window for seasonal extraction will be the span.
power.ts %>% stl(s.window = 'periodic', robust = TRUE) %>%
autoplot() +
labs(title = "Seasonal & Trend Decomposition for Power Usage Time Series", x = "Year")The decomposition highlights the seasonality component, with no drastic changes to note. The panel bars indicate that the variation attributed to the trend is much smaller than the seasonal component and consequently only a small part of the variation in the data series.
Exponential smoothing methods were considered to fit the time series based on variations in the components. This technique will determine a possible model based on the best values in the AICc. It shows that the ETS(A, Ad, A) model best fits the data for the transformed power usage data, i.e. exponential smoothing with additive error, additive damped trend component, and additive seasonality.
## ETS(A,Ad,A)
##
## Call:
## ets(y = ., lambda = lambda, biasadj = TRUE)
##
## Box-Cox transformation: lambda= -0.1443
##
## Smoothing parameters:
## alpha = 0.118
## beta = 1e-04
## gamma = 1e-04
## phi = 0.979
##
## Initial states:
## l = 6.1998
## b = 1e-04
## s = -0.006 -0.0285 -0.0132 0.019 0.0263 0.0212
## 0.0014 -0.0255 -0.0192 -0.0077 0.0081 0.024
##
## sigma: 0.0094
##
## AIC AICc BIC
## -765.9795 -762.0258 -707.3446
Since the time series is non-stationary, differencing must be done to help stabilize the mean of a time series by removing changes in the level of a time series, and therefore reducing trend and seasonality. This will in turn make the time series stationary, and appropriate to produce an ARIMA model if it can fit the data better.
## [1] "And so, the number of differences required for time series to be made stationary is 1."
## [1] "While, the number of differences required for time series to be made seasonally stationary is 1."
ARIMA models are capable of modeling a wide range of seasonal and non-seasonal data. The data is non-stationary, with seasonality, so there will be a seasonal difference, D = 1. These also appear to be non-stationary, but since the unit root test still resulted in a smaller than the 1% critical value, it is not necessary. Therefore, d = 0, and the data has now made it stationary.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.1049
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggtsdisplay(diff(tst, lag = 12),
main = "Residential Power Usage",
ylab = "Power Used, in kWh",
xlab = "Month")The aim now is to find an appropriate ARIMA model based on the ACF and PACF. Ignoring one significant spike in each plot that is just outside the limits, and not in the first few lags, both the ACF and PACF plots are sinusoidal decaying and there is a large significant spike at lag 12. There are smaller significant spikes at lag 1 and 4 in the PACF. In the ACF, the first significant lag suggests a non-seasonal moving average component, MA(1), making q = 1 and Q = 0. With no other significant lags in the first few lags of the seasonal component of the PACF (minus one being ignored), p = 0, and P = 2.
An \(ARIMA(p,d,q)(P,D,Q)_m\) model is where p/P = order of the autoregressive part, d/D = degree of first differencing involved, q/Q = order of the moving average part, and m = number of observations. Altogether, a possible model is ARIMA(0,0,1)(>3,1,0).
To confirm, the auto.arima() function to determine are all determined by minimizing the information criteria, i.e. AICc. The suggested model is ARIMA(0,0,1)(2,1,0)[12] with drifting, which will be used since the deduction by observation was very close.
## Series: .
## ARIMA(0,0,1)(2,1,0)[12] with drift
## Box Cox transformation: lambda= -0.1442665
##
## Coefficients:
## ma1 sar1 sar2 drift
## 0.2563 -0.7036 -0.3817 1e-04
## s.e. 0.0809 0.0734 0.0748 1e-04
##
## sigma^2 estimated as 8.869e-05: log likelihood=585.32
## AIC=-1160.65 AICc=-1160.3 BIC=-1144.68
The last model of interest will be a quick, effective hybrid forecast. This method averages single-model forecasts to produce point estimates that are more likely to be better than any of the contributing forecast models (Peter Ellis, 2016). It combines the functionality of forecasting techniques. In class, it was learned that exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting. While ETS models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. So, a fitting is done to find a hybrid forecast model that has more accurate coverage.
## Hybrid forecast model comprised of the following models: auto.arima, ets
## ############
## auto.arima with weight 0.5
## ############
## ets with weight 0.5
Therefore, the proposed models that might be suitable for fitting the time series include:
| Technique | Proposed Model |
|---|---|
| Forecasting with Decomposition | STL Decomposition |
| Forecasting with Exponential Smoothing | ETS(A, Ad, A) |
| Forecasting with ARIMA | ARIMA(0,0,1)(2,1,0)[12] with drifting |
| Forecasting with Hybrid Forecasting | 50% ETS + 50% ARIMA (equivalent to ETS(A, A, A) & ARIMA(0,0,1)(2,1,0)[12] with drifting) |
Because each model was fitted using different transformations, information criteria are not reliable for comparison, thus cross-validation is performed by splitting the data into a training and testing set, where the root mean squared error is used to identify the best fit model. The following function will split the data set and fit each model as specified above. A performance test will then be conducted to determining the root mean squared error of the train data with the test data. The function is also capable of plotting residuals of each model once specified.
ts_accuracy = function(tseries, year, rmse, residual){
# Split to train and test
train = window(tseries, end = c(year, 12))
test = window(tseries, start = year + 1)
# Finds optimal lambda
lambda = BoxCox.lambda(tseries)
# Forecast Models
stl.model = train %>% stl(s.window = 'periodic', robust = TRUE)
ets.model = train %>% ets(model='AAA', damped = TRUE, lambda = lambda, biasadj = TRUE)
arima.model = train %>% Arima(order = c(0,0,1), seasonal = c(2,1,0),
lambda = lambda, biasadj = TRUE)
hybrid.model = train %>% hybridModel(model = 'ae', lambda = lambda)
# Returns Residuals Plots
if(tolower(residual) == 'stl'){checkresiduals(stl.model)}
if(tolower(residual) == 'ets'){checkresiduals(ets.model)}
if(tolower(residual) == 'arima'){checkresiduals(arima.model)}
if(tolower(residual) == 'hybrid'){checkresiduals(hybrid.model)}
# Forecast
stl.fc = forecast(stl.model, h = length(test))$mean
ets.fc = forecast(ets.model, h = length(test))$mean
arima.fc = forecast(arima.model, h = length(test))$mean
hybrid.fc = forecast(hybrid.model, h = length(test))$mean
# Performance statistics
accuracy = data.frame(RMSE = cbind(accuracy(stl.fc, test)[,2],
accuracy(ets.fc, test)[,2],
accuracy(arima.fc, test)[,2],
accuracy(hybrid.fc, test)[,2]))
row.names(accuracy) = c(sprintf("Test Year: %d", year + 1))
names(accuracy) = c("STL", "ETS", "ARIMA","HYBRID")
# Returns RMSE
if(rmse){accuracy}
}From the table below, the overall smallest RMSE is from the hybrid model. But it is noteworthy to see that for each year of the validation test, different models returned the smallest error. For instance, the RMSE for the training and testing split at 2009-2010 and 2012-2013, the best model is the hybrid model, while for the 2010-2011 split, the best model is the ETS fit, and the 2011-2012 split shows the best fit is with ARIMA(0,0,1)(2,1,0)[12]. If the hybrid model is not the best fit, it is the second best-fitting model for forecast based on RMSE. It may be due to the combination of the ETS and ARIMA which increased the errors than the modeling with one of them only.
df = data.frame()
for (year in c(2009:2012)){
df = rbind(df, ts_accuracy(power.ts, year, rmse = TRUE, residual = FALSE))
}
df = rbind(df, Mean = colMeans(df))| STL | ETS | ARIMA | HYBRID | |
|---|---|---|---|---|
| Test Year: 2010 | 1161717.5 | 1151275.4 | 1068433.9 | 1063564.8 |
| Test Year: 2011 | 1064756.6 | 969066.1 | 1023285.1 | 970870.2 |
| Test Year: 2012 | 763539.8 | 1107406.8 | 582788.1 | 618909.4 |
| Test Year: 2013 | 666957.8 | 642963.9 | 673960.2 | 620798.5 |
| Mean | 914242.9 | 967678.0 | 837116.8 | 818535.7 |
Because the best model changed based on the year, but the 50% ETS + 50% ARIMA produced the smallest overall RMSE, the residuals of this model are shown below. In this collection on plots, there are a time series, ACF, and a histogram of the residuals (with an overlaid normal distribution for comparison). Firstly, the residual time series plot highlights some large negative residuals. The histogram suggests that the residuals may be right-skewed but the mean of the residuals appears to be near zero. Moreover, there is no significant autocorrelation in the residuals series, suggesting the forecasts are good.
Lastly, the following function will plot all the data in addition to the forecast based on the specified model for a visual comparison. There is also a zoomed-in plot of the forecast for clarity. The function is also capable of returning any of the forecast created by the specified model. This will be used to visually analyze the fit and officially state the final model.
power.fc = function(tseries, plot, forecast){
# Finds optimal lambda
lambda = BoxCox.lambda(tseries)
# Forecast Models
stl.model = tseries %>% stl(s.window = 'periodic', robust = TRUE)
ets.model = tseries %>% ets(lambda = lambda, biasadj = TRUE)
arima.model = tseries %>% auto.arima(lambda = lambda, biasadj = TRUE)
hybrid.model = tseries %>% hybridModel(model = 'ae', lambda = lambda)
# Forecast
power.stl.fc = forecast(stl.model, h = 12)
power.ets.fc = forecast(ets.model, h = 12)
power.arima.fc = forecast(arima.model, h = 12)
power.hybrid.fc = forecast(hybrid.model, h = 12)
# Full forecast plot
p1 = autoplot(tseries) +
autolayer(power.stl.fc, series = 'STL', PI = FALSE) +
autolayer(power.ets.fc, series = 'ETS', PI = FALSE) +
autolayer(power.arima.fc, series = 'ARIMA', PI = FALSE) +
autolayer(power.hybrid.fc, series = 'HYBRID', PI = FALSE) +
labs(title = "Residential Power Usage Forecast",
subtitle = "Jan, 1998 to Dec, 2014",
x = "Year",
y = "Power Used, in kWh") +
guides(colour = guide_legend(title = "Models")) +
theme(legend.position = "top")
# Zoomed in plot on forecast
p2 = p1 + labs(title = "Zoom: Residential Power Usage Forecast",
subtitle = "Jan, 2012 to Dec, 2014") +
xlim(c(2012,2015))
# Returns plots
if(plot){gridExtra::grid.arrange(p1, p2, ncol = 1)}
# Return Forecast Values
if(tolower(forecast) == 'stl'){return(power.stl.fc)}
if(tolower(forecast) == 'ets'){return(power.ets.fc)}
if(tolower(forecast) == 'arima'){return(power.arima.fc)}
if(tolower(forecast) == 'hybrid'){return(power.hybrid.fc)}
}Visually, each model does a good job fitting the series. Therefore, based on the performance metrics, the best fit model for forecasting the residential power usage for 2014 is the 50% ETS + 50% ARIMA model hybrid method. That is, an ETS(A, A, A) model with an ARIMA(0,0,1)(2,1,0)[12] with drifting and a Box-Cox Transformation of -0.144.
Finally, let’s save the forecast for 2014.
date = seq(as.Date('2014-01-01'), as.Date('2014-12-01'), by = 'month') %>% format('%Y-%b')
# openxlsx::write.xlsx(data.frame(YYYY.MM = date, KWH = forecast$mean), "power_forecasts_Deokinanan.xlsx")Monthly Forecast of Residential Power Usage in kWh for 2014
| YYYY.MM | KWH |
|---|---|
| 2014-Jan | 9279925.6990787 |
| 2014-Feb | 8122460.02531558 |
| 2014-Mar | 6630075.73074448 |
| 2014-Apr | 5955509.17094461 |
| 2014-May | 5759130.9806573 |
| 2014-Jun | 7747404.93538889 |
| 2014-Jul | 9165853.78334161 |
| 2014-Aug | 9671337.9104755 |
| 2014-Sep | 8556162.2636904 |
| 2014-Oct | 6065559.45917231 |
| 2014-Nov | 5774262.02125659 |
| 2014-Dec | 7166133.60022812 |
Two separate data sets are provided containing n = 1000 recordings of water flows from different water pipelines. Each pipeline is recorded at a different time. There is no missing data and no outliers.
pipe.1 = read_excel("Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
pipe.2 = read_excel("Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
|
|
pipe.1 is recorded at uneven time intervals from 23 October 2015 to 1 November 2015, while pipe.2 is recorded at even hourly intervals from 23 October 2015 to 3 December 2015. Before more exploration can be performed, the data needs to be transformed into a time-base sequence and aggregate based on the hour. Moreover, there are multiple recordings within an hour for pipe1 since the count within a day is more than 24, therefore the hourly mean will be used. pipe.2 doesn’t require data transformation at this moment since the count within a day is no more than 24.
| Var1 | Freq |
|---|---|
| 2015-10-23 | 96 |
| 2015-10-24 | 109 |
| 2015-10-25 | 95 |
| 2015-10-26 | 118 |
| 2015-10-27 | 99 |
| 2015-10-28 | 104 |
| 2015-10-29 | 96 |
| 2015-10-30 | 111 |
| 2015-10-31 | 91 |
| 2015-11-01 | 81 |
| Var1 | Freq |
|---|---|
| 2015-10-23 | 23 |
| 2015-10-24 | 24 |
| 2015-10-25 | 24 |
| 2015-10-26 | 24 |
| 2015-10-27 | 24 |
| 2015-10-28 | 24 |
| 2015-10-29 | 24 |
| 2015-10-30 | 24 |
| 2015-10-31 | 24 |
| 2015-11-01 | 24 |
| 2015-11-02 | 24 |
| 2015-11-03 | 24 |
| 2015-11-04 | 24 |
| 2015-11-05 | 24 |
| 2015-11-06 | 24 |
| 2015-11-07 | 24 |
| 2015-11-08 | 24 |
| 2015-11-09 | 24 |
| 2015-11-10 | 24 |
| 2015-11-11 | 24 |
| 2015-11-12 | 24 |
| 2015-11-13 | 24 |
| 2015-11-14 | 24 |
| 2015-11-15 | 24 |
| 2015-11-16 | 24 |
| 2015-11-17 | 24 |
| 2015-11-18 | 24 |
| 2015-11-19 | 24 |
| 2015-11-20 | 24 |
| 2015-11-21 | 24 |
| 2015-11-22 | 24 |
| 2015-11-23 | 24 |
| 2015-11-24 | 24 |
| 2015-11-25 | 24 |
| 2015-11-26 | 24 |
| 2015-11-27 | 24 |
| 2015-11-28 | 24 |
| 2015-11-29 | 24 |
| 2015-11-30 | 24 |
| 2015-12-01 | 24 |
| 2015-12-02 | 24 |
| 2015-12-03 | 17 |
With not much data tidying and transformation necessary, the data is made into a workable time series. The transformed data set now highlights there are at most 24 recordings per day, suggesting one recording every hour, whether the average of all recordings within another or not.
pipe.1 = pipe.1 %>%
mutate(Date = date(`Date Time`),
Hour = hour(`Date Time`)) %>%
group_by(Date, Hour) %>%
summarize(WaterFlow = mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime = ymd_h(paste(Date, Hour))) %>%
select(DateTime, WaterFlow)| Var1 | Freq |
|---|---|
| 2015-10-23 | 24 |
| 2015-10-24 | 24 |
| 2015-10-25 | 24 |
| 2015-10-26 | 24 |
| 2015-10-27 | 23 |
| 2015-10-28 | 23 |
| 2015-10-29 | 24 |
| 2015-10-30 | 24 |
| 2015-10-31 | 24 |
| 2015-11-01 | 22 |
| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pipe.1.ts | 236 | 19.89 | 4.24 | 19.78 | 19.84 | 4.32 | 8.92 | 31.73 | 22.81 | 0.13 | -0.17 | 0.28 |
| pipe.2.ts | 1000 | 39.56 | 16.05 | 39.68 | 39.46 | 16.73 | 1.88 | 78.30 | 76.42 | 0.03 | -0.55 | 0.51 |
p1 = ggplot(pipe.1, aes(y = WaterFlow)) +
geom_boxplot(fill = "steelblue2", outlier.shape = NA) +
coord_flip() +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(title ="Boxplots of Waterflow Pipelines",
subtitle = "Pipeline #1")
p2 = ggplot(pipe.2, aes(y = WaterFlow)) +
geom_boxplot(fill = "steelblue2") +
coord_flip() +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(subtitle = "Pipeline #2")
gridExtra::grid.arrange(p1, p2, ncol=1)From the few summary statistics below, the mean water flow from pipeline #1 is 19.89 (\(\sigma = 4.24\)), whereas for pipeline #2, it is 39.56 (\(\sigma = 16.05\)). Pipeline #2 has more water flowing through it. With a near-zero skewness and kurtosis, the data seems to follow a normal distribution.
By examining the time series’ characteristics, the best modeling technique to forecast a week’s worth of water-flow from each pipeline can be determined
The time plot depicts the unsteadiness of the series. There is no apparent seasonality and no distinct trend. The ACF and PACF highlight that the autocorrelations are all white noise. Altogether, the time series plot confirms that it is already stationary, STL decomposition cannot be used for forecasting and it will not require differencing.
ggtsdisplay(pipe.1.ts, main = "Waterflow from Pipeline #1", ylab = "Amount of Water", xlab = "Time")For this time series, it seems helpful to stabilize the variability, therefore Box-Cox transformation was calculated. The optimal \(lambda=0.27\). While the transformation made little difference to the forecast, it will have a large effect on prediction intervals.
bct = function(ts, title){
a = autoplot(ts) +
labs(title = sprintf("Before: %s", title),
x = "Time",
y = "Amount of Water")
lambda = BoxCox.lambda(ts)
at = autoplot(BoxCox(ts, lambda)) +
labs(title = sprintf("Transformed: %s", title),
subtitle = sprintf("lambda = %0.2f", lambda),
y = "Box-Cox Transformed",
x = "Time")
gridExtra::grid.arrange(a, at)
}
lambda = BoxCox.lambda(pipe.1.ts)
bct(pipe.1.ts, title = "Waterflow from Pipeline #1")Exponential smoothing methods were considered to fit the time series based on the variability. This technique will determine a possible model based on the best values in the AICc. It is of no surprise that the ETS(A, N, N) model best fits the data, i.e. exponential smoothing with additive error, no trend component, and no seasonality.
## ETS(A,N,N)
##
## Call:
## ets(y = ., lambda = lambda, biasadj = TRUE)
##
## Box-Cox transformation: lambda= 0.272
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 4.5771
##
## sigma: 0.4936
##
## AIC AICc BIC
## 960.1697 960.2731 970.5612
Moreover, it was already determined that the number of differences required for time series to be made stationary is 0. And so, an ARIMA model is also considered. The unit root test resulted in a smaller than the 1% critical value, therefore, d = 0. Based on the ACF and PACF, there is no significant spike, making (p,q) = 0.
An \(ARIMA(p,d,q)\) model is where p = order of the autoregressive part, d = degree of first differencing involved, and q = order of the moving average part. Altogether, a possible model is ARIMA(0,0,0) with a non-zero mean. This was further confirmed by the auto.arima().
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.2346
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
## Series: .
## ARIMA(0,0,0) with non-zero mean
## Box Cox transformation: lambda= 0.271971
##
## Coefficients:
## mean
## 4.5771
## s.e. 0.0320
##
## sigma^2 estimated as 0.2425: log likelihood=-167.21
## AIC=338.42 AICc=338.47 BIC=345.35
Therefore, the proposed models that might be suitable for fitting the time series include:
| Technique | Proposed Model |
|---|---|
| Forecasting with Exponential Smoothing | ETS(A, N, N) |
| Forecasting with ARIMA | ARIMA(0,0,0) with non-zero mean |
Similar to the time plot of pipeline #1, pipeline #2 depicts an unsteadiness in its series. There is no apparent seasonality and no distinct trend. The ACF and PACF highlight a few significant autocorrelations with no pattern. Altogether, the time series will undergo unit testing to determine if these autocorrelations can be distinguished from white noise. If no, then the data will be treated as stationary, and it will not require differencing.
ggtsdisplay(pipe.2.ts, main = "Waterflow from Pipeline #2", ylab = "Amount of Water", xlab = "Time")Box-Cox transformation was calculated to help stabilize the variability. The optimal \(lambda=0.85\). Once again, the transformation made little difference to the forecast, it will have a large effect on prediction intervals.
As with pipeline #1, it is of no surprise that the ETS(A, N, N) model best fits the data, i.e. exponential smoothing with additive error, no trend component, and no seasonality.
## ETS(A,N,N)
##
## Call:
## ets(y = ., lambda = lambda, biasadj = TRUE)
##
## Box-Cox transformation: lambda= 0.8493
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 25.2727
##
## sigma: 9.3622
##
## AIC AICc BIC
## 11385.12 11385.14 11399.84
Moreover, the unit root test resulted in a smaller than all the critical ranges. This suggests that the time series is already stationary. And so, an ARIMA model is also considered. With, d = 0, no significant peak are at the first lag, (p,q) = 0. Altogether, a possible model is ARIMA(0,0,0) with a non-zero mean, which was further confirmed by the auto.arima().
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.1067
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
## Series: .
## ARIMA(0,0,0) with non-zero mean
## Box Cox transformation: lambda= 0.8493285
##
## Coefficients:
## mean
## 25.2700
## s.e. 0.2957
##
## sigma^2 estimated as 87.55: log likelihood=-3654.57
## AIC=7313.14 AICc=7313.15 BIC=7322.96
Therefore, the proposed models that might be suitable for fitting the time series include:
| Technique | Proposed Model |
|---|---|
| Forecasting with Exponential Smoothing | ETS(A, N, N) |
| Forecasting with ARIMA | ARIMA(0,0,0) with non-zero mean |
Simple cross-validation with these models is conducted to measure the root mean squared error. Because it is expected that they both will forecast close to the mean of the data, \(\Delta RMSE\) will be very small.
ts_accuracy = function(tseries, time, rmse){
# Split to train and test
train = window(tseries, end = time)
test = window(tseries, start = time + 1)
# Finds optimal lambda
lambda = BoxCox.lambda(tseries)
# Forecast Models
ets.model = train %>% ets(model = 'ANN', lambda = lambda, biasadj = TRUE)
arima.model = train %>% Arima(order = c(0,0,0), lambda = lambda, biasadj = TRUE)
# Forecast
ets.fc = forecast(ets.model, h = length(test))$mean
arima.fc = forecast(arima.model, h = length(test))$mean
# Performance statistics
accuracy = data.frame(RMSE = cbind(accuracy(ets.fc, test)[,2],
accuracy(arima.fc, test)[,2],
accuracy(ets.fc, test)[,2]-accuracy(arima.fc, test)[,2]))
row.names(accuracy) = c(sprintf("Test Starting: %d", time + 1))
names(accuracy) = c("ETS", "ARIMA", "Delta RMSE")
# Returns RMSE
if(rmse){accuracy}
}From the tables below, the overall smallest RMSE is the ETS model for Pipeline #1, and ARIMA(0,0,0) for Pipeline #2. The difference between the fit is quite small. While one model may fit the training data slightly better than the other model, but the one with the more accurate forecasts on the test set is usually the best choice.
df.p1 = data.frame()
for (time in seq(150,222,24)){
df.p1 = rbind(df.p1, ts_accuracy(pipe.1.ts, time, rmse = TRUE))
}
df.p1 = rbind(df.p1, Mean = colMeans(df.p1))| ETS | ARIMA | Delta RMSE | |
|---|---|---|---|
| Test Starting: 151 | 4.46197 | 4.46250 | -0.00053 |
| Test Starting: 175 | 4.44947 | 4.44998 | -0.00051 |
| Test Starting: 199 | 3.46861 | 3.46857 | 0.00004 |
| Test Starting: 223 | 3.21729 | 3.21755 | -0.00026 |
| Mean | 3.89933 | 3.89965 | -0.00031 |
df.p2 = data.frame()
for (time in seq(850,976,24)){
df.p2 = rbind(df.p2, ts_accuracy(pipe.2.ts, time, rmse = TRUE))
}
df.p2 = rbind(df.p2, Mean = colMeans(df.p2))| ETS | ARIMA | Delta RMSE | |
|---|---|---|---|
| Test Starting: 851 | 17.00739 | 17.00745 | -0.00006 |
| Test Starting: 875 | 16.54871 | 16.54855 | 0.00016 |
| Test Starting: 899 | 16.88023 | 16.88028 | -0.00006 |
| Test Starting: 923 | 16.79999 | 16.79957 | 0.00042 |
| Test Starting: 947 | 16.83745 | 16.83094 | 0.00651 |
| Test Starting: 971 | 17.61152 | 17.60991 | 0.00161 |
| Mean | 16.94755 | 16.94612 | 0.00143 |
For pipeline #1 and #2, there is nothing odd among the residuals for either model, and the Ljung Box Test Statistic suggests that the models do not show a lack of fit.
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 5.4338, df = 8, p-value = 0.7104
##
## Model df: 2. Total lags used: 10
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 5.4337, df = 9, p-value = 0.795
##
## Model df: 1. Total lags used: 10
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 6.5634, df = 8, p-value = 0.5844
##
## Model df: 2. Total lags used: 10
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 6.564, df = 9, p-value = 0.6824
##
## Model df: 1. Total lags used: 10
And so, the forecast is determined for every hour within a day for a week. The following plots show that the forecast for both the ETS and ARIMA model fits on each other, i.e. both forecasted to be the mean. For pipeline #1, the ETS and ARIMA model forecast that the water-flow is expected to be 19.892, with ETS 95% PI = (11.778, 28.406) while ARIMA 95% PI = (11.795, 28.386). Whereas, for pipeline #1, the ETS and ARIMA model forecast that the water-flow is expected to be 39.52, with ETS 95% PI = (9.686, 72.560) while ARIMA 95% PI = (9.697, 72.537).
# Forecast
ets.fc.p1 = forecast(ets.model.p1, h = 7*24)
arima.fc.p1 = forecast(arima.model.p1, h = 7*24)
ets.fc.p2 = forecast(ets.model.p2, h = 7*24)
arima.fc.p2 = forecast(arima.model.p2, h = 7*24)Analytically, each model does a good job fitting both pipeline time series. Therefore, based on the performance statistics, the best model for each pipeline is:
Finally, let’s save the forecast.
date.pipe1 = seq(ymd_hms('2015-11-01 24:00:00'), ymd_hms('2015-11-08 23:00:00'), by = as.difftime(hours(1)))
date.pipe2 = seq(ymd_hms('2015-12-03 17:00:00'), ymd_hms('2015-12-10 16:00:00'), by = as.difftime(hours(1)))
# openxlsx::write.xlsx(data.frame('Date.Time' = date.pipe1,
# 'WaterFlow' = ets.fc.p1$mean), "pipe1_forecasts_Deokinanan.xlsx")
# openxlsx::write.xlsx(data.frame('Date.Time' = date.pipe2,
# 'WaterFlow' = arima.fc.p2$mean), "pipe2_forecasts_Deokinanan.xlsx")A Week of Hourly Forecast for Pipelines 1 & 2
|
|