This project consists of 3 parts - two required and one bonus and is worth 15% of your grade.
Part A – ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
library(tidyverse)
library(ggplot2)
library(kableExtra)
library(gridExtra)
library(rio)
library(fpp2)
library(urca)
library(rio)
library(forecast)
library(lubridate)
library(dplyr)The goal of the project is to explore time series, decomposition, forecasting, data preprocessing, exponential smoothing, and ARIMA.
First, load the excel data, clean it by dropping the NA values, and rearrange it to a better format:
atm <- import("https://raw.githubusercontent.com/GabrielSantos33/DATA624_Project1/main/ATM624Data.xlsx", col_types = c("date", "text", "numeric"))
atm_daily <- atm %>% drop_na() %>% spread(ATM, Cash)
atm_daily ## DATE ATM1 ATM2 ATM3 ATM4
## 1 2009-05-01 96 107 0 776.993423
## 2 2009-05-02 82 89 0 524.417959
## 3 2009-05-03 85 90 0 792.811362
## 4 2009-05-04 90 55 0 908.238457
## 5 2009-05-05 99 79 0 52.832103
## 6 2009-05-06 88 19 0 52.208454
## 7 2009-05-07 8 2 0 55.473609
## 8 2009-05-08 104 103 0 558.503251
## 9 2009-05-09 87 107 0 904.341359
## 10 2009-05-10 93 118 0 879.493588
## 11 2009-05-11 86 75 0 274.022340
## 12 2009-05-12 111 111 0 396.108347
## 13 2009-05-13 75 25 0 274.547188
## 14 2009-05-14 6 16 0 16.321159
## 15 2009-05-15 102 137 0 852.307037
## 16 2009-05-16 73 95 0 379.561703
## 17 2009-05-17 92 103 0 31.284953
## 18 2009-05-18 82 80 0 491.850577
## 19 2009-05-19 86 118 0 83.705480
## 20 2009-05-20 73 30 0 128.653781
## 21 2009-05-21 20 7 0 14.357590
## 22 2009-05-22 100 118 0 815.358321
## 23 2009-05-23 93 104 0 758.218587
## 24 2009-05-24 90 59 0 601.421108
## 25 2009-05-25 94 40 0 906.796873
## 26 2009-05-26 98 106 0 502.907894
## 27 2009-05-27 73 18 0 88.273138
## 28 2009-05-28 10 9 0 35.438335
## 29 2009-05-29 97 136 0 338.459425
## 30 2009-05-30 102 118 0 4.547996
## 31 2009-05-31 85 64 0 122.667745
## 32 2009-06-01 85 77 0 150.234945
## 33 2009-06-02 108 133 0 721.155281
## 34 2009-06-03 94 45 0 443.012568
## 35 2009-06-04 14 14 0 17.151720
## 36 2009-06-05 3 20 0 14.887695
## 37 2009-06-06 96 147 0 740.543155
## 38 2009-06-07 109 105 0 1058.083384
## 39 2009-06-08 96 132 0 576.183701
## 40 2009-06-09 145 93 0 1484.126887
## 41 2009-06-10 81 26 0 193.787327
## 42 2009-06-11 16 7 0 27.054779
## 43 2009-06-12 142 112 0 1190.898921
## 44 2009-06-13 NA 91 0 746.495617
## 45 2009-06-14 120 72 0 1220.767882
## 46 2009-06-15 106 66 0 1021.503529
## 47 2009-06-16 NA 82 0 372.623610
## 48 2009-06-17 108 24 0 321.272935
## 49 2009-06-18 21 NA 0 92.476655
## 50 2009-06-19 140 134 0 116.644133
## 51 2009-06-20 110 95 0 202.413503
## 52 2009-06-21 115 82 0 524.434514
## 53 2009-06-22 NA 90 0 80.644354
## 54 2009-06-23 108 99 0 64.273589
## 55 2009-06-24 66 NA 0 90.636571
## 56 2009-06-25 13 3 0 48.046061
## 57 2009-06-26 99 117 0 1026.032172
## 58 2009-06-27 105 53 0 423.770971
## 59 2009-06-28 104 44 0 60.626453
## 60 2009-06-29 98 56 0 540.281404
## 61 2009-06-30 110 110 0 173.705568
## 62 2009-07-01 79 36 0 393.239982
## 63 2009-07-02 16 12 0 41.948621
## 64 2009-07-03 110 128 0 310.006527
## 65 2009-07-04 96 72 0 110.051915
## 66 2009-07-05 114 122 0 682.182367
## 67 2009-07-06 126 100 0 54.667904
## 68 2009-07-07 126 108 0 213.989896
## 69 2009-07-08 73 25 0 738.126360
## 70 2009-07-09 4 6 0 16.206081
## 71 2009-07-10 19 22 0 16.223205
## 72 2009-07-11 114 135 0 1050.206528
## 73 2009-07-12 98 69 0 438.477361
## 74 2009-07-13 97 52 0 546.691592
## 75 2009-07-14 114 81 0 858.236420
## 76 2009-07-15 78 27 0 446.733514
## 77 2009-07-16 19 4 0 94.672035
## 78 2009-07-17 102 147 0 644.390623
## 79 2009-07-18 94 102 0 568.815899
## 80 2009-07-19 108 83 0 704.507042
## 81 2009-07-20 91 55 0 571.645285
## 82 2009-07-21 86 74 0 479.875501
## 83 2009-07-22 78 22 0 418.864062
## 84 2009-07-23 16 4 0 27.569130
## 85 2009-07-24 114 104 0 834.834364
## 86 2009-07-25 115 81 0 910.834075
## 87 2009-07-26 108 61 0 468.106927
## 88 2009-07-27 102 70 0 768.121660
## 89 2009-07-28 129 126 0 1089.167762
## 90 2009-07-29 79 33 0 266.898417
## 91 2009-07-30 13 7 0 7.110664
## 92 2009-07-31 103 126 0 704.192012
## 93 2009-08-01 90 92 0 495.350606
## 94 2009-08-02 68 81 0 142.638951
## 95 2009-08-03 85 49 0 428.560174
## 96 2009-08-04 99 146 0 894.969364
## 97 2009-08-05 86 79 0 610.290623
## 98 2009-08-06 13 37 0 70.613731
## 99 2009-08-07 116 136 0 593.570261
## 100 2009-08-08 105 111 0 341.601807
## 101 2009-08-09 123 78 0 735.035708
## 102 2009-08-10 114 57 0 462.773028
## 103 2009-08-11 127 106 0 1156.496108
## 104 2009-08-12 111 38 0 454.091938
## 105 2009-08-13 34 15 0 283.337265
## 106 2009-08-14 151 119 0 571.508226
## 107 2009-08-15 110 110 0 772.179652
## 108 2009-08-16 115 68 0 260.125517
## 109 2009-08-17 112 60 0 357.515267
## 110 2009-08-18 132 92 0 16.157597
## 111 2009-08-19 94 22 0 334.312503
## 112 2009-08-20 24 11 0 25.696404
## 113 2009-08-21 122 121 0 254.887098
## 114 2009-08-22 104 89 0 357.003129
## 115 2009-08-23 128 62 0 1245.594280
## 116 2009-08-24 120 79 0 917.371157
## 117 2009-08-25 174 83 0 592.180082
## 118 2009-08-26 96 31 0 412.474975
## 119 2009-08-27 13 4 0 82.911171
## 120 2009-08-28 121 96 0 996.010226
## 121 2009-08-29 133 50 0 103.910375
## 122 2009-08-30 118 52 0 1116.915044
## 123 2009-08-31 91 56 0 816.847015
## 124 2009-09-01 120 104 0 914.493389
## 125 2009-09-02 88 27 0 648.209198
## 126 2009-09-03 19 13 0 140.515569
## 127 2009-09-04 150 107 0 1495.154775
## 128 2009-09-05 144 125 0 1301.396344
## 129 2009-09-06 121 103 0 779.717193
## 130 2009-09-07 105 42 0 744.262278
## 131 2009-09-08 133 90 0 200.391803
## 132 2009-09-09 109 29 0 854.179084
## 133 2009-09-10 18 8 0 50.719369
## 134 2009-09-11 1 2 0 7.404799
## 135 2009-09-12 105 84 0 1061.192780
## 136 2009-09-13 112 62 0 715.021460
## 137 2009-09-14 82 77 0 35.444668
## 138 2009-09-15 111 78 0 491.884510
## 139 2009-09-16 79 25 0 343.496923
## 140 2009-09-17 13 8 0 20.714186
## 141 2009-09-18 112 113 0 505.717125
## 142 2009-09-19 99 71 0 97.325113
## 143 2009-09-20 140 94 0 473.570486
## 144 2009-09-21 110 59 0 899.773474
## 145 2009-09-22 180 89 0 1712.074986
## 146 2009-09-23 73 18 0 281.388412
## 147 2009-09-24 7 6 0 26.337969
## 148 2009-09-25 106 115 0 328.644988
## 149 2009-09-26 103 81 0 761.371143
## 150 2009-09-27 93 61 0 629.319901
## 151 2009-09-28 96 71 0 235.753411
## 152 2009-09-29 117 69 0 1195.223181
## 153 2009-09-30 80 36 0 782.407474
## 154 2009-10-01 14 14 0 108.331743
## 155 2009-10-02 120 104 0 846.556566
## 156 2009-10-03 91 73 0 576.169344
## 157 2009-10-04 96 86 0 441.883927
## 158 2009-10-05 74 85 0 319.404846
## 159 2009-10-06 108 126 0 154.163484
## 160 2009-10-07 73 31 0 543.239375
## 161 2009-10-08 13 9 0 124.334415
## 162 2009-10-09 93 114 0 449.074910
## 163 2009-10-10 94 78 0 614.654064
## 164 2009-10-11 76 45 0 219.237855
## 165 2009-10-12 111 60 0 945.736417
## 166 2009-10-13 88 91 0 9.691243
## 167 2009-10-14 76 22 0 696.245756
## 168 2009-10-15 9 7 0 8.084284
## 169 2009-10-16 87 75 0 845.463546
## 170 2009-10-17 105 66 0 212.902683
## 171 2009-10-18 78 64 0 9.155135
## 172 2009-10-19 67 51 0 46.831421
## 173 2009-10-20 90 94 0 30.372376
## 174 2009-10-21 68 23 0 400.484974
## 175 2009-10-22 9 4 0 61.338241
## 176 2009-10-23 78 127 0 428.046952
## 177 2009-10-24 74 61 0 1.563260
## 178 2009-10-25 74 0 0 9.018826
## 179 2009-10-26 60 95 0 50.304215
## 180 2009-10-27 75 79 0 313.414359
## 181 2009-10-28 61 38 0 626.687136
## 182 2009-10-29 9 8 0 5.882076
## 183 2009-10-30 90 119 0 77.803024
## 184 2009-10-31 86 57 0 337.894871
## 185 2009-11-01 86 58 0 211.812232
## 186 2009-11-02 79 80 0 690.122884
## 187 2009-11-03 90 82 0 596.381877
## 188 2009-11-04 80 49 0 65.278483
## 189 2009-11-05 21 16 0 77.444454
## 190 2009-11-06 93 116 0 43.721409
## 191 2009-11-07 104 61 0 964.175255
## 192 2009-11-08 109 59 0 834.959599
## 193 2009-11-09 88 80 0 636.962626
## 194 2009-11-10 96 86 0 927.078614
## 195 2009-11-11 70 23 0 75.873055
## 196 2009-11-12 15 7 0 43.690550
## 197 2009-11-13 73 91 0 621.292501
## 198 2009-11-14 94 57 0 312.582041
## 199 2009-11-15 108 58 0 825.635396
## 200 2009-11-16 73 61 0 413.620440
## 201 2009-11-17 87 77 0 194.047355
## 202 2009-11-18 75 20 0 345.695900
## 203 2009-11-19 10 5 0 32.428111
## 204 2009-11-20 92 132 0 655.315200
## 205 2009-11-21 87 49 0 638.136824
## 206 2009-11-22 74 57 0 15.306757
## 207 2009-11-23 73 68 0 299.663010
## 208 2009-11-24 93 80 0 626.662127
## 209 2009-11-25 66 31 0 601.141158
## 210 2009-11-26 18 3 0 64.131250
## 211 2009-11-27 99 85 0 562.763323
## 212 2009-11-28 94 53 0 317.253589
## 213 2009-11-29 136 46 0 1167.264437
## 214 2009-11-30 6 2 0 47.365125
## 215 2009-12-01 140 113 0 993.777470
## 216 2009-12-02 73 22 0 687.196935
## 217 2009-12-03 9 5 0 71.147567
## 218 2009-12-04 140 112 0 1046.971281
## 219 2009-12-05 103 59 0 1009.050231
## 220 2009-12-06 110 72 0 288.649516
## 221 2009-12-07 90 77 0 591.508331
## 222 2009-12-08 135 85 0 230.605915
## 223 2009-12-09 67 27 0 578.431798
## 224 2009-12-10 12 1 0 70.210639
## 225 2009-12-11 109 91 0 580.796406
## 226 2009-12-12 84 36 0 149.150820
## 227 2009-12-13 92 46 0 403.861263
## 228 2009-12-14 84 100 0 124.958721
## 229 2009-12-15 118 73 0 230.148897
## 230 2009-12-16 68 22 0 19.008453
## 231 2009-12-17 14 9 0 128.638640
## 232 2009-12-18 90 117 0 327.965321
## 233 2009-12-19 92 44 0 532.245673
## 234 2009-12-20 93 44 0 877.397772
## 235 2009-12-21 85 78 0 662.112869
## 236 2009-12-22 93 89 0 300.629002
## 237 2009-12-23 70 33 0 667.628261
## 238 2009-12-24 13 5 0 14.573322
## 239 2009-12-25 90 102 0 660.355296
## 240 2009-12-26 91 68 0 510.988823
## 241 2009-12-27 102 64 0 164.462091
## 242 2009-12-28 97 81 0 748.172762
## 243 2009-12-29 42 9 0 174.241405
## 244 2009-12-30 2 2 0 20.192725
## 245 2009-12-31 14 2 0 100.686543
## 246 2010-01-01 132 117 0 985.956884
## 247 2010-01-02 108 56 0 597.058060
## 248 2010-01-03 120 55 0 468.457250
## 249 2010-01-04 120 112 0 856.581158
## 250 2010-01-05 90 100 0 684.774261
## 251 2010-01-06 48 17 0 381.562058
## 252 2010-01-07 15 7 0 152.092362
## 253 2010-01-08 86 49 0 271.934402
## 254 2010-01-09 109 96 0 135.499645
## 255 2010-01-10 115 47 0 1105.018595
## 256 2010-01-11 123 107 0 291.765790
## 257 2010-01-12 123 93 0 1141.426655
## 258 2010-01-13 60 27 0 140.502720
## 259 2010-01-14 22 9 0 8.781528
## 260 2010-01-15 81 78 0 85.260115
## 261 2010-01-16 98 47 0 66.655933
## 262 2010-01-17 94 31 0 709.938568
## 263 2010-01-18 76 91 0 567.549590
## 264 2010-01-19 96 77 0 486.824572
## 265 2010-01-20 72 22 0 17.481238
## 266 2010-01-21 12 1 0 49.208112
## 267 2010-01-22 91 125 0 357.233561
## 268 2010-01-23 74 53 0 179.888632
## 269 2010-01-24 104 37 0 728.992926
## 270 2010-01-25 74 69 0 261.121740
## 271 2010-01-26 91 69 0 628.863047
## 272 2010-01-27 66 16 0 277.044815
## 273 2010-01-28 28 2 0 41.475487
## 274 2010-01-29 152 95 0 1574.779330
## 275 2010-01-30 97 45 0 669.707653
## 276 2010-01-31 100 61 0 979.675944
## 277 2010-02-01 85 71 0 426.406008
## 278 2010-02-02 123 91 0 153.242692
## 279 2010-02-03 46 28 0 274.949921
## 280 2010-02-04 38 9 0 136.276358
## 281 2010-02-05 138 97 0 454.416950
## 282 2010-02-06 74 60 0 458.304270
## 283 2010-02-07 85 27 0 112.030628
## 284 2010-02-08 78 74 0 417.912276
## 285 2010-02-09 123 45 0 10919.761638
## 286 2010-02-10 50 26 0 42.438078
## 287 2010-02-11 28 2 0 280.043427
## 288 2010-02-12 105 73 0 412.318881
## 289 2010-02-13 109 48 0 852.837416
## 290 2010-02-14 26 9 0 179.702084
## 291 2010-02-15 54 25 0 226.047368
## 292 2010-02-16 105 58 0 989.195610
## 293 2010-02-17 179 110 0 824.916781
## 294 2010-02-18 125 94 0 966.610561
## 295 2010-02-19 84 25 0 734.220167
## 296 2010-02-20 99 101 0 121.332622
## 297 2010-02-21 110 77 0 287.931829
## 298 2010-02-22 80 6 0 502.748346
## 299 2010-02-23 1 3 0 9.752776
## 300 2010-02-24 111 99 0 258.138869
## 301 2010-02-25 115 86 0 1170.288228
## 302 2010-02-26 108 20 0 193.079004
## 303 2010-02-27 100 72 0 402.770395
## 304 2010-02-28 152 81 0 1275.968167
## 305 2010-03-01 90 16 0 819.881512
## 306 2010-03-02 4 2 0 26.392615
## 307 2010-03-03 128 110 0 893.884507
## 308 2010-03-04 87 75 0 360.641557
## 309 2010-03-05 84 35 0 859.899389
## 310 2010-03-06 69 102 0 381.473089
## 311 2010-03-07 112 100 0 9.711205
## 312 2010-03-08 62 9 0 601.016953
## 313 2010-03-09 4 2 0 32.454958
## 314 2010-03-10 94 100 0 553.395793
## 315 2010-03-11 102 85 0 572.374702
## 316 2010-03-12 98 33 0 218.831558
## 317 2010-03-13 88 24 0 828.326827
## 318 2010-03-14 123 111 0 630.762176
## 319 2010-03-15 55 5 0 339.418304
## 320 2010-03-16 4 1 0 31.502210
## 321 2010-03-17 92 106 0 486.679944
## 322 2010-03-18 88 102 0 335.363138
## 323 2010-03-19 92 25 0 340.008595
## 324 2010-03-20 92 64 0 291.282267
## 325 2010-03-21 117 78 0 46.029411
## 326 2010-03-22 73 5 0 201.403616
## 327 2010-03-23 3 1 0 9.558301
## 328 2010-03-24 85 83 0 877.966103
## 329 2010-03-25 88 78 0 778.110791
## 330 2010-03-26 76 66 0 707.613494
## 331 2010-03-27 93 15 0 351.457835
## 332 2010-03-28 116 64 0 711.049107
## 333 2010-03-29 68 3 0 502.518192
## 334 2010-03-30 19 0 0 23.337569
## 335 2010-03-31 127 102 0 492.909936
## 336 2010-04-01 93 99 0 405.312584
## 337 2010-04-02 97 41 0 818.394331
## 338 2010-04-03 102 79 0 152.270800
## 339 2010-04-04 109 71 0 281.113528
## 340 2010-04-05 68 9 0 470.425772
## 341 2010-04-06 4 2 0 2.507317
## 342 2010-04-07 105 103 0 414.741223
## 343 2010-04-08 79 67 0 719.426783
## 344 2010-04-09 90 85 0 811.716825
## 345 2010-04-10 87 78 0 889.584209
## 346 2010-04-11 92 67 0 616.127547
## 347 2010-04-12 63 12 0 61.144354
## 348 2010-04-13 3 1 0 27.325671
## 349 2010-04-14 103 97 0 767.778153
## 350 2010-04-15 80 106 0 326.084093
## 351 2010-04-16 80 74 0 825.197816
## 352 2010-04-17 84 86 0 383.815025
## 353 2010-04-18 81 55 0 195.483454
## 354 2010-04-19 88 13 0 711.164653
## 355 2010-04-20 3 1 0 29.869011
## 356 2010-04-21 100 100 0 556.792330
## 357 2010-04-22 69 103 0 386.175335
## 358 2010-04-23 85 44 0 165.294181
## 359 2010-04-24 85 61 0 5.451815
## 360 2010-04-25 109 89 0 542.280602
## 361 2010-04-26 74 11 0 403.839336
## 362 2010-04-27 4 2 0 13.697331
## 363 2010-04-28 96 107 96 348.201061
## 364 2010-04-29 82 89 82 44.245345
## 365 2010-04-30 85 90 85 482.287107
According to the data in the table, we can see that the withdrawals from ‘ATM4’ are greater than the withdrawals from the other ATMs. In addition, the ‘ATM3’, most of the withdrawals are zero.
Generating the summary statistics and the boxplot:
par(mfrow=c(4,1))
for (i in 2:5) {
print(summary(atm_daily[i]))
}## ATM1
## Min. : 1.00
## 1st Qu.: 73.00
## Median : 91.00
## Mean : 83.89
## 3rd Qu.:108.00
## Max. :180.00
## NA's :3
## ATM2
## Min. : 0.00
## 1st Qu.: 25.50
## Median : 67.00
## Mean : 62.58
## 3rd Qu.: 93.00
## Max. :147.00
## NA's :2
## ATM3
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.7206
## 3rd Qu.: 0.0000
## Max. :96.0000
## ATM4
## Min. : 1.563
## 1st Qu.: 124.334
## Median : 403.839
## Mean : 474.043
## 3rd Qu.: 704.507
## Max. :10919.762
par(mfrow=c(1,4))
for (i in 2:5){
boxplot(atm_daily[i],
main = sprintf("%s", names(atm_daily)[i]),
col="lightblue", width = 2)
}From the above boxplots we see that ‘ATM1’ has numerous outliers, ‘ATM2’ has none, ‘ATM3’ has 3 (for every non-zero observation), and ‘ATM4’ has 3 outliers, one of which is quite extreme. . The transformation of outliers will be important for forecasting.
Now, let’s convert the data into a time series.
atm_ts <- ts(atm_daily %>% select(-DATE))
autoplot(atm_ts) +
ggtitle("Daily Cash Withdrawal", subtitle = "4 ATM machines") +
xlab("Day") +
ylab("Hundreds of Dollars ($100)")As the spike from ATM4 in the first plot makes it hard to see the details in ATM1, ATM2, and ATM3, individual plots will be shown below.
To handle the data better, the extremely high outlier in ATM4 will be replaced by median for better forecasting.
atm[1394,3] <- 403.839
atm %>% drop_na() %>% ggplot(aes(x=DATE, y=Cash, col=ATM)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ATM, ncol=1, scales="free_y") +
labs(title="Daily Cash Withdrawal from 4 ATM Machines", subtitle = "May 2019 to April 2020") +
xlab("Date") + ylab("Hundreds of Dollars ($100)")From the graphs we can see that ‘ATM1’ and ‘ATM2’ show seasonality. We will apply the Box-Cox transformation. In the case of ‘ATM3’ we do not have enough information for the prediction. We can assume that the ‘ATM3’ possibly began its operation at the end of April 2020. For ‘ATM4’ we will replace the outlier with the median so it is better for forecasting.
atm1 <- atm_daily[2]
atm1 <- ts(atm1, frequency=7)
atm2 <- atm_daily[3]
atm2 <- ts(atm2, frequency=7)
atm3 <- atm_daily[4]
atm3 <- ts(atm3, start=363)
atm3[which(atm3==0)] <- NA
atm4 <- atm_daily[5]
atm4[285,] <- 403.839
atm4 <- ts(atm4, frequency=7)ggtsdisplay(atm1, main="Daily Cash Withdrawal in ATM1 (May 2019 - April 2020)")ggseasonplot(atm1)ggsubseriesplot(atm1)atm1_bc <- BoxCox(atm1, lambda = BoxCox.lambda(atm1))
ggtsdisplay(atm1_bc, main="ATM1 with BoxCox Transformation")ATM1 is clearly with seasonality, which is weekly seasonality. The ACF and PACF plots have significant lag7, lag14, and lag21. This is a non-stationary timeseries.
After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.
ggtsdisplay(diff(atm1_bc, 7), points=FALSE)atm1_bc %>% diff(.,7) %>% ur.kpss() %>% summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.016
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary. ARIMA(p,d,q)(P,D,Q)[m] model has p/P as order of the autoregressive part, d/D as degree of first differencing involved, q/Q as order of the moving average part, and m as number of observations. The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.
The seasonal spike at lag7 suggest seasonal AR(1) and/or MA(1) components. As ACF decays gradually, this suggests seasonal AR(0) and MA(1), P=0, Q=1. I will try ARIMA(1,0,1)(0,1,1).
atm1_arima <- Arima(atm1, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm1))
summary(atm1_arima)## Series: atm1
## ARIMA(1,0,1)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2081476
##
## Coefficients:
## ar1 ma1 sma1
## -0.5040 0.6201 -0.6438
## s.e. 0.2417 0.2183 0.0427
##
## sigma^2 = 1.252: log likelihood = -544.29
## AIC=1096.58 AICc=1096.7 BIC=1112.11
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.514774 25.09103 15.83582 -89.65246 108.6821 0.8958869
## ACF1
## Training set -0.007347018
checkresiduals(atm1_arima)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 9.4438, df = 11, p-value = 0.581
##
## Model df: 3. Total lags used: 14
Now, let’s use auto.arima to find out which is the best Arima model:
atm1_auto <- auto.arima(atm1, approximation = FALSE, lambda=BoxCox.lambda(atm1))
summary(atm1_auto)## Series: atm1
## ARIMA(0,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2081476
##
## Coefficients:
## ma1 ma2 sma1
## 0.0989 -0.1107 -0.6482
## s.e. 0.0532 0.0527 0.0425
##
## sigma^2 = 1.247: log likelihood = -543.68
## AIC=1095.37 AICc=1095.48 BIC=1110.89
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 2.493867 25.19797 16.12913 -88.12706 107.3975 0.9124807 0.01895515
checkresiduals(atm1_auto)##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 8.5811, df = 11, p-value = 0.6605
##
## Model df: 3. Total lags used: 14
The ARIMA model found by ‘auto.arima’ is ARIMA(0,0,2)(0,1,1)[7]. The ARIMA model I suggested is ARIMA(1,0,1)(0,1,1)[7].
According to the error measures and the residual plots, both models represents the data well with similar AIC values, similar error measures, and similar p-values. Both models can be applied, but the best Arima model is ARIMA(0,0,2)(0,1,1)[7], because it has a smaller AICc.
atm1_model <- Arima(atm1, order=c(0,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm1))ggtsdisplay(atm2, main="Daily Cash Withdrawal in ATM2 (May 2019 - April 2020)")ggseasonplot(atm2)ggsubseriesplot(atm2)atm2_bc <- BoxCox(atm2, lambda = BoxCox.lambda(atm2))
ggtsdisplay(atm2_bc, main="ATM2 with BoxCox Transformation")ATM2 is clearly with seasonality, which is weekly seasonality. The ACF and PACF plots have significant lag7, lag14, and lag21.Thus, this is a non-stationary timeseries.
After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.
ggtsdisplay(diff(atm2_bc, 7), points=FALSE)atm2_bc %>% diff(.,7) %>% ur.kpss() %>% summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0161
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary. One seasonal differencing was applied so D=1, while the non-seasonal part suggests d=0. The seasonal lags at ACF sudden drops while the ones in PACF gradually decrease, suggest AR(0) and MA(1), so P=0, Q=1. The non-differenced ACF and PACF plots have spikes at lag2 and lag5, suggest p=2 and q=2. I will try ARIMA(2,0,2)(0,1,1)[7].
atm2_arima <- Arima(atm2, order=c(2,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm2))
summary(atm2_arima)## Series: atm2
## ARIMA(2,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.7164431
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4188 -0.9019 0.4587 0.7886 -0.7187
## s.e. 0.0587 0.0471 0.0884 0.0619 0.0414
##
## sigma^2 = 62.19: log likelihood = -1240.08
## AIC=2492.16 AICc=2492.4 BIC=2515.45
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2631418 24.22403 16.90092 -Inf Inf 0.8250038 -0.01464052
checkresiduals(atm2_arima)##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.8051, df = 9, p-value = 0.3665
##
## Model df: 5. Total lags used: 14
Now, let’s use auto.arima to find out which is the best Arima model:
atm2_auto <- auto.arima(atm2, approximation = FALSE, lambda=BoxCox.lambda(atm2))
summary(atm2_auto)## Series: atm2
## ARIMA(2,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.7164431
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4188 -0.9019 0.4587 0.7886 -0.7187
## s.e. 0.0587 0.0471 0.0884 0.0619 0.0414
##
## sigma^2 = 62.19: log likelihood = -1240.08
## AIC=2492.16 AICc=2492.4 BIC=2515.45
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2631418 24.22403 16.90092 -Inf Inf 0.8250038 -0.01464052
checkresiduals(atm2_auto)##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.8051, df = 9, p-value = 0.3665
##
## Model df: 5. Total lags used: 14
The ARIMA model found by ‘auto.arima’ is ARIMA(2,0,2)(0,1,1)[7].The ARIMA model I suggested is ARIMA(2,0,2)(0,1,1)[7].Both models are the same.According to the error measures and the residual plots, the model represents the data well with small p-value.
atm2_model <- Arima(atm2, order=c(2,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm2))ggtsdisplay(atm3, main="Daily Cash Withdrawal in ATM3 (May 2019 - April 2020)")There is not enough information to forecast in the time series. Because there are only 3 data points. We can assume that it is an ATM that began to work at the end of April. Therefore, I will use a simple mean forecast.
ggtsdisplay(atm4, main="Daily Cash Withdrawal in ATM4 (May 2019 - April 2020)")ggseasonplot(atm4)ggsubseriesplot(atm4)atm4_bc <- BoxCox(atm4, lambda = BoxCox.lambda(atm4))
ggtsdisplay(atm4_bc, main="ATM4 with BoxCox Transformation")ATM4 is clearly with seasonality, which is weekly seasonality. The ACF and PACF plots have significant lag7, lag14, and lag21. Thus, this is a non-stationary timeseries.
After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.
ggtsdisplay(diff(atm4_bc, 7), points=FALSE)atm4_bc %>% diff() %>% ur.kpss() %>% summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0124
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary. One seasonal differencing was applied so D=1, while the non-seasonal part suggests d=0. The seasonal lags at ACF sudden drops while the ones in PACF gradually decrease, suggest AR(0) and MA(1), so P=0, Q=1.There one non-seasonal spike at lag3 in ACF and PACF plots suggest p=q=1.I will try ARIMA(1,0,1)(0,1,1)[7].
atm4_arima <- Arima(atm4, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm4))
summary(atm4_arima)## Series: atm4
## ARIMA(1,0,1)(0,1,1)[7]
## Box Cox transformation: lambda= 0.4525697
##
## Coefficients:
## ar1 ma1 sma1
## 0.6851 -0.6059 -0.8677
## s.e. 0.3040 0.3315 0.0315
##
## sigma^2 = 171.4: log likelihood = -1432.08
## AIC=2872.16 AICc=2872.27 BIC=2887.68
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 61.63608 337.5467 254.8181 -360.8151 404.2179 0.7355835 0.0185852
checkresiduals(atm4_arima)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 16.519, df = 11, p-value = 0.1229
##
## Model df: 3. Total lags used: 14
Now, let’s use auto.arima to find out which is the best Arima model:
atm4_auto <- auto.arima(atm4, approximation = FALSE, lambda=BoxCox.lambda(atm4))
summary(atm4_auto)## Series: atm4
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Box Cox transformation: lambda= 0.4525697
##
## Coefficients:
## ar1 sar1 sar2 mean
## 0.0771 0.2090 0.1996 29.0082
## s.e. 0.0526 0.0516 0.0525 1.2627
##
## sigma^2 = 182.2: log likelihood = -1466.34
## AIC=2942.69 AICc=2942.85 BIC=2962.19
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 84.38613 351.7695 274.2406 -370.6751 415.1343 0.7916504 0.01960666
checkresiduals(atm4_auto)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.92, df = 11, p-value = 0.1441
##
## Model df: 3. Total lags used: 14
The ARIMA model found by ‘auto.arima’ is ARIMA(1,0,0)(2,0,0)[7]. The ARIMA model I suggested is ARIMA(1,0,1)(0,1,1)[7].The model I suggested has smaller AICc and RMSE.I think that the best model is the model that I suggested.
atm4_model <- Arima(atm4, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm4))atm1_f <- forecast(atm1_model, 31, level=95)
atm2_f <- forecast(atm2_model, 31, level=95)
atm3_f <- meanf(atm3, 31, level=95)
atm4_f <- forecast(atm4_model, 31, level=95)
gridExtra::grid.arrange(
autoplot(atm1_f) +
labs(title="ATM1: ARIMA(0,0,2)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
autoplot(atm2_f) +
labs(title="ATM2: ARIMA(2,0,2)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
autoplot(atm3_f) +
labs(title="ATM3: meanf", x="Day", y="Hundreds of Dollars($100)"),
autoplot(atm4_f) +
labs(title="ATM4: ARIMA(1,0,1)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
top = grid::textGrob("Forecast on Cash Withdrawal for May 2020")
)export <- rbind(atm1_f$mean, atm2_f$mean, atm3_f$mean, atm4_f$mean)
write.csv(export, "ATM_Forecast.csv")
data.frame(export) %>% cbind(ATM = c('ATM1', 'ATM2', 'ATM3', 'ATM4')) %>%
select(ATM, everything())## ATM X1 X2 X3 X4 X5 X6 X7
## 1 ATM1 86.74294 100.55515 73.62242 4.211032 100.14543 79.32380 85.78103
## 2 ATM2 65.64590 71.57733 11.87248 2.741321 97.99480 88.96331 65.68584
## 3 ATM3 87.66667 87.66667 87.66667 87.666667 87.66667 87.66667 87.66667
## 4 ATM4 280.99138 363.04960 380.34153 64.998646 449.24378 303.25757 507.37695
## X8 X9 X10 X11 X12 X13 X14
## 1 87.22992 100.36819 73.62242 4.211032 100.14543 79.32380 85.78103
## 2 65.66859 71.59474 11.85550 2.739787 98.02444 88.95494 65.66519
## 3 87.66667 87.66667 87.66667 87.666667 87.66667 87.66667 87.66667
## 4 306.98773 383.34690 394.53535 68.714726 456.49955 307.26292 511.00791
## X15 X16 X17 X18 X19 X20 X21
## 1 87.22992 100.36819 73.62242 4.211032 100.14543 79.32380 85.78103
## 2 65.68416 71.60715 11.84374 2.738603 98.04512 88.94938 65.65068
## 3 87.66667 87.66667 87.66667 87.666667 87.66667 87.66667 87.66667
## 4 308.87651 384.80722 395.55131 68.982178 457.01579 307.54766 511.26558
## X22 X23 X24 X25 X26 X27 X28
## 1 87.22992 100.36819 73.62242 4.211032 100.14543 79.32380 85.78103
## 2 65.69484 71.61598 11.83559 2.737698 98.05954 88.94570 65.64048
## 3 87.66667 87.66667 87.66667 87.666667 87.66667 87.66667 87.66667
## 4 309.01050 384.91075 395.62331 69.001140 457.05236 307.56783 511.28383
## X29 X30 X31
## 1 87.22992 100.36819 73.62242
## 2 65.70216 71.62226 11.82995
## 3 87.66667 87.66667 87.66667
## 4 309.01999 384.91809 395.62841
Export the results Forecast of ATM1, ATM2, ATM3 and ATM4 en el file: ‘ATM_Forecast.csv’
I think that the ATM1 and ATM2 forecasts seem correct, but in the case of ATM4, the generated forecast does not seem to follow the pattern of the data, I do not think the forecast is reliable, suddenly it is necessary to apply another type of model.
The goal of the project is to explore time series, decomposition, forecasting, data preprocessing, exponential smoothing, and ARIMA.
First, load the excel data into R and clean it by using function ‘tsclean()’ which can handle outliers and NA value.
kwh <- import("https://raw.githubusercontent.com/GabrielSantos33/DATA624_Project1/main/ResidentialCustomerForecastLoad-624.xlsx")
kwh_ts <- ts(kwh[,"KWH"], start=c(1998,1), frequency=12) %>%
tsclean()
kwh_ts## Jan Feb Mar Apr May Jun Jul Aug
## 1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428
## 1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416 7564391
## 2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662 7517830
## 2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129 8450717
## 2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702 8058748
## 2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676 8476499
## 2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931 7309774
## 2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998 7786659
## 2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564 8241110
## 2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220 7447146
## 2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987 8037137
## 2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260 8350517
## 2010 9397357 8390677 7347915 5776131 4919289 6696292 7686742 7922701
## 2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343 10308076
## 2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851 9612423
## 2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321 9080226
## Sep Oct Nov Dec
## 1998 6989888 6345620 4640410 4693479
## 1999 7899368 5358314 4436269 4419229
## 2000 8912169 5844352 5041769 6220334
## 2001 7112069 5242535 4461979 5240995
## 2002 8245227 5865014 4908979 5779958
## 2003 7791791 5344613 4913707 5756193
## 2004 6690366 5444948 4824940 5791208
## 2005 7057213 6694523 4313019 6181548
## 2006 7296355 5104799 4458429 6226214
## 2007 7666970 5785964 4907057 6047292
## 2008 7436335 5101803 4555602 6442746
## 2009 7583146 5566075 5339890 7089880
## 2010 7819472 5875917 4800733 6152583
## 2011 8943599 5603920 6154138 8273142
## 2012 7559148 5576852 5731899 6609694
## 2013 7968220 5759367 5769083 6987874
summary(kwh[,"KWH"]) #before## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
summary(kwh_ts)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4313019 5443502 6351262 6529723 7608792 10655730
Now, let’s convert the data into a time series.
autoplot(kwh_ts) +
ggtitle("Monthly Residential Power Usage", subtitle = "Jan 1998 to Dec 2013 (KWH)") +
xlab("Month") +
ylab("Kilowatt hours (KWH)")ggseasonplot(kwh_ts)ggsubseriesplot(kwh_ts)
Seasonality is found in this timeseries and appears to have a peak every
6 months. The seasonality may be annual due to the high power usage
during winter and summer.
We see annual seasonality.
ggtsdisplay(kwh_ts, main="Monthly Residential Power Usage (Jan 1998 to Dec 2013) - (KWH)")We apply the BoxCox Transformation:
kwh_bc <- BoxCox(kwh_ts, lambda = BoxCox.lambda(kwh_ts))
ggtsdisplay(kwh_bc, main="kwh_ts with BoxCox Transformation")
Tried Box-Cox transformation on the timeseries but no huge differences.
Will work on differencing instead.
ggtsdisplay(diff(kwh_ts,12), points=FALSE)kwh_ts %>% diff(.,12) %>% ur.kpss() %>% summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 4 lags.
##
## Value of test-statistic is: 0.1159
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
After differencing once, the timeseries now appears to be stationary. ARIMA(p,d,q)(P,D,Q)[m] model has p/P as order of the autoregressive part, d/D as degree of first differencing involved, q/Q as order of the moving average part, and m as number of observations. The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.
The seasonal spike at lag7 in ACF and lag7 & lag14 in PACF suggest seasonal AR(1) and MA(2) components. As ACF decays gradually, this suggests seasonal AR(1) and MA(2), P=0, Q=2. I will try ARIMA(1,0,1)(0,1,2).
kwh_arima <- Arima(kwh_ts, order=c(1,0,1), seasonal=c(0,1,2), lambda = BoxCox.lambda(kwh_ts))
summary(kwh_arima)## Series: kwh_ts
## ARIMA(1,0,1)(0,1,2)[12]
## Box Cox transformation: lambda= -0.1454175
##
## Coefficients:
## ar1 ma1 sma1 sma2
## 0.9475 -0.8014 -0.8016 0.1738
## s.e. 0.0838 0.1801 0.0933 0.0912
##
## sigma^2 = 8.786e-05: log likelihood = 585.93
## AIC=-1161.86 AICc=-1161.51 BIC=-1145.89
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 86012.57 597771.6 463632.2 0.8088744 7.00649 0.7783202 0.1518373
checkresiduals(kwh_arima)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,2)[12]
## Q* = 24.322, df = 20, p-value = 0.2286
##
## Model df: 4. Total lags used: 24
Now, let’s use auto.arima to find out which is the best Arima model:
kwh_auto <- auto.arima(kwh_ts, approximation = FALSE, lambda=BoxCox.lambda(kwh_ts))
checkresiduals(kwh_auto)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(0,1,1)[12] with drift
## Q* = 25.402, df = 22, p-value = 0.2782
##
## Model df: 2. Total lags used: 24
summary(kwh_auto)## Series: kwh_ts
## ARIMA(1,0,0)(0,1,1)[12] with drift
## Box Cox transformation: lambda= -0.1454175
##
## Coefficients:
## ar1 sma1 drift
## 0.2895 -0.7353 1e-04
## s.e. 0.0724 0.0700 1e-04
##
## sigma^2 = 8.429e-05: log likelihood = 588.5
## AIC=-1169 AICc=-1168.77 BIC=-1156.23
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 20961.31 580774.8 445952.2 -0.2249688 6.812064 0.7486401
## ACF1
## Training set 0.05258332
The ARIMA model found by ‘auto.arima’ is ARIMA(1,0,0)(0,1,1)[12]. The ARIMA model I suggested is ARIMA(1,0,1)(0,1,2)[12]. According to the error measures and the residual plots, both models represents the data well with similar AIC values, similar error measures, and similar p-values. Both models can be applied, but the best Arima model is ARIMA(1,0,0)(0,1,1)[12], because it has a smaller AICc.
kwh_model <- Arima(kwh_ts, order=c(1,0,0), seasonal=c(0,1,1), lambda = BoxCox.lambda(kwh_ts))
kwh_f <- forecast(kwh_model, h=12, level=95)
kwh_f## Point Forecast Lo 95 Hi 95
## Jan 2014 9396883 7762540 11437614
## Feb 2014 7887475 6475039 9664189
## Mar 2014 6435175 5306104 7848082
## Apr 2014 5756155 4760036 6998419
## May 2014 5570691 4610683 6766688
## Jun 2014 7479517 6140217 9164205
## Jul 2014 8521791 6969852 10482565
## Aug 2014 9080655 7413299 11191835
## Sep 2014 7911892 6484811 9710387
## Oct 2014 5661786 4684004 6880566
## Nov 2014 5568723 4609087 6764249
## Dec 2014 6904666 5681062 8439739
autoplot(kwh_f)export <- kwh_f$mean
write.csv(export, "kwh_Forecast.csv")We write the forecast data in the file: ‘kwh_Forecast.csv’
I think the forecast generated is accurate. Our forecast captures the seasonality of what appears to be increased demand in the summer and winter while falling between the peaks and troughs of more recent years..
The goal of the project is to explore time series, decomposition, forecasting, data preprocessing, exponential smoothing, and ARIMA.
Load the excel data:
wfp1 <- import("https://raw.githubusercontent.com/GabrielSantos33/DATA624_Project1/main/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
wfp2 <- import("https://raw.githubusercontent.com/GabrielSantos33/DATA624_Project1/main/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
colnames(wfp1) <- c("DateTime", "WaterFlow")
colnames(wfp2) <- c("DateTime", "WaterFlow") Match the hour with wfp2:
wfp1 <- wfp1 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)+1) %>%
group_by(Date, Time) %>%
summarise(Water=mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime=ymd_h(paste(Date,Time))) %>%
select(DateTime,Water)
wfp1## # A tibble: 236 × 2
## DateTime Water
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 26.1
## 2 2015-10-23 02:00:00 18.9
## 3 2015-10-23 03:00:00 15.2
## 4 2015-10-23 04:00:00 23.1
## 5 2015-10-23 05:00:00 15.5
## 6 2015-10-23 06:00:00 22.7
## 7 2015-10-23 07:00:00 20.6
## 8 2015-10-23 08:00:00 18.4
## 9 2015-10-23 09:00:00 20.8
## 10 2015-10-23 10:00:00 21.2
## # … with 226 more rows
wfp2 <- wfp2 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)) %>%
group_by(Date, Time) %>%
summarise(Water=mean(WaterFlow)) %>%
ungroup() %>%
mutate(DateTime=ymd_h(paste(Date,Time))) %>%
select(DateTime, Water)
wfp2## # A tibble: 1,000 × 2
## DateTime Water
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 18.8
## 2 2015-10-23 02:00:00 43.1
## 3 2015-10-23 03:00:00 38.0
## 4 2015-10-23 04:00:00 36.1
## 5 2015-10-23 05:00:00 31.9
## 6 2015-10-23 06:00:00 28.2
## 7 2015-10-23 07:00:00 9.86
## 8 2015-10-23 08:00:00 26.7
## 9 2015-10-23 09:00:00 55.8
## 10 2015-10-23 10:00:00 54.2
## # … with 990 more rows
Combining the two waterflows into one:
water <- full_join(wfp1, wfp2, by="DateTime", suffix=c("_1", "_2")) %>%
mutate(Water_1=ifelse(is.na(Water_1), 0, Water_1)) %>%
mutate(Water_2=ifelse(is.na(Water_2), 0, Water_2)) %>%
mutate(Water = Water_1 + Water_2) %>%
select(DateTime, Water)
water## # A tibble: 1,000 × 2
## DateTime Water
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 44.9
## 2 2015-10-23 02:00:00 61.9
## 3 2015-10-23 03:00:00 53.1
## 4 2015-10-23 04:00:00 59.2
## 5 2015-10-23 05:00:00 47.3
## 6 2015-10-23 06:00:00 51.0
## 7 2015-10-23 07:00:00 30.5
## 8 2015-10-23 08:00:00 45.0
## 9 2015-10-23 09:00:00 76.6
## 10 2015-10-23 10:00:00 75.4
## # … with 990 more rows
water_ts <- ts(water$Water, frequency=24)
ggseasonplot(water_ts) + theme(legend.title = element_blank())ggsubseriesplot(water_ts)We cannot see significant seasonality involved in ‘water_ts’ however there is a slightly decreasing trend. It is a non-stationary timeseries.
water_ts <- ts(water$Water, frequency=24)
ggtsdisplay(water_ts, main="Daily Waterflow")We apply BocCox transformation:
water_bc <- BoxCox(water_ts, lambda = BoxCox.lambda(water_ts))
ggtsdisplay(water_bc, main="Water_ts with BoxCox Transformation")Trend differencing is needed.
ndiffs(water_ts)## [1] 1
nsdiffs(water_ts)## [1] 0
ggtsdisplay(diff(water_bc), points=FALSE, main="Differenced water_ts with BoxCox Transformation")water_bc %>% diff() %>% ur.kpss() %>% summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0091
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The timeseries now appears to be stationary. One seasonal differencing was applied so D=0, while the non-seasonal part suggests d=1. There is one seasonal lags in ACF, suggest Q=1. There one non-seasonal spike at lag1 in ACF suggest q=1.I will try ARIMA(0,1,1)(0,0,1)[24].
water_arima <- Arima(water_ts, order=c(0,1,1), seasonal=c(0,0,1), lambda = BoxCox.lambda(water_ts))
summary(water_arima)## Series: water_ts
## ARIMA(0,1,1)(0,0,1)[24]
## Box Cox transformation: lambda= 0.8713037
##
## Coefficients:
## ma1 sma1
## -0.9578 0.0712
## s.e. 0.0101 0.0319
##
## sigma^2 = 103.3: log likelihood = -3734.4
## AIC=7474.8 AICc=7474.83 BIC=7489.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1591549 16.33716 13.34589 -28.20783 50.26558 0.7507364
## ACF1
## Training set -0.04444759
checkresiduals(water_arima)##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(0,0,1)[24]
## Q* = 57.881, df = 46, p-value = 0.1124
##
## Model df: 2. Total lags used: 48
Now, let’s use auto.arima to find out which is the best Arima model:
water_auto <- auto.arima(water_ts, approximation = FALSE, lambda=BoxCox.lambda(water_ts))
summary(water_auto)## Series: water_ts
## ARIMA(0,1,1)(1,0,0)[24]
## Box Cox transformation: lambda= 0.8713037
##
## Coefficients:
## ma1 sar1
## -0.9578 0.0714
## s.e. 0.0101 0.0322
##
## sigma^2 = 103.3: log likelihood = -3734.4
## AIC=7474.8 AICc=7474.82 BIC=7489.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1613506 16.33729 13.34564 -28.19083 50.25213 0.750722
## ACF1
## Training set -0.04431537
checkresiduals(water_auto)##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(1,0,0)[24]
## Q* = 57.858, df = 46, p-value = 0.1128
##
## Model df: 2. Total lags used: 48
The ARIMA model found by ‘auto.arima’ is ARIMA(0,1,1)(1,0,0)[24].The ARIMA model I suggested is ARIMA(0,1,1)(0,0,1)[24].Both models are similars to each other, on the statistics with only 0.01 difference on AICc. I will use the model generated by ‘auto.arima’, ARIMA(0,1,1)(1,0,0)[24].
water_model <- Arima(water_ts, order=c(0,1,1), seasonal=c(1,0,0), lambda = BoxCox.lambda(water_ts))I performed the forecast on a week on water use, 7 days, for 24 hours.
water_f <- forecast(water_model, h=7*24, level=95)
autoplot(water_f) +
labs(title="Water Usage Forecast", subtitle = "ARIMA(0,1,1)(1,0,0)[24]", x="Day")export <- water_f$mean
write.csv(export, "Waterflow_Forecast.csv")
df <- data.frame(water_f) %>% select(Point.Forecast)
rownames(df) <- NULL
df## Point.Forecast
## 1 44.13862
## 2 45.16903
## 3 46.07068
## 4 44.27144
## 5 45.44000
## 6 44.89509
## 7 42.95309
## 8 44.67901
## 9 46.80433
## 10 44.92962
## 11 46.40268
## 12 45.53528
## 13 45.54120
## 14 43.38716
## 15 44.16368
## 16 42.46275
## 17 43.26947
## 18 44.53708
## 19 46.52640
## 20 43.60654
## 21 46.10964
## 22 44.44325
## 23 43.74894
## 24 46.10043
## 25 44.52937
## 26 44.60297
## 27 44.66721
## 28 44.53887
## 29 44.62229
## 30 44.58342
## 31 44.44443
## 32 44.56799
## 33 44.71937
## 34 44.58589
## 35 44.69083
## 36 44.62908
## 37 44.62950
## 38 44.47556
## 39 44.53116
## 40 44.40922
## 41 44.46713
## 42 44.55785
## 43 44.69962
## 44 44.49128
## 45 44.66998
## 46 44.55115
## 47 44.50148
## 48 44.66933
## 49 44.55730
## 50 44.56256
## 51 44.56715
## 52 44.55798
## 53 44.56394
## 54 44.56117
## 55 44.55123
## 56 44.56006
## 57 44.57088
## 58 44.56134
## 59 44.56884
## 60 44.56443
## 61 44.56446
## 62 44.55346
## 63 44.55743
## 64 44.54872
## 65 44.55286
## 66 44.55934
## 67 44.56947
## 68 44.55458
## 69 44.56735
## 70 44.55886
## 71 44.55531
## 72 44.56730
## 73 44.55930
## 74 44.55967
## 75 44.56000
## 76 44.55935
## 77 44.55977
## 78 44.55958
## 79 44.55887
## 80 44.55950
## 81 44.56027
## 82 44.55959
## 83 44.56012
## 84 44.55981
## 85 44.55981
## 86 44.55902
## 87 44.55931
## 88 44.55869
## 89 44.55898
## 90 44.55944
## 91 44.56017
## 92 44.55910
## 93 44.56002
## 94 44.55941
## 95 44.55916
## 96 44.56001
## 97 44.55944
## 98 44.55947
## 99 44.55949
## 100 44.55945
## 101 44.55948
## 102 44.55946
## 103 44.55941
## 104 44.55946
## 105 44.55951
## 106 44.55946
## 107 44.55950
## 108 44.55948
## 109 44.55948
## 110 44.55942
## 111 44.55944
## 112 44.55940
## 113 44.55942
## 114 44.55945
## 115 44.55950
## 116 44.55943
## 117 44.55949
## 118 44.55945
## 119 44.55943
## 120 44.55949
## 121 44.55945
## 122 44.55945
## 123 44.55946
## 124 44.55945
## 125 44.55945
## 126 44.55945
## 127 44.55945
## 128 44.55945
## 129 44.55946
## 130 44.55945
## 131 44.55946
## 132 44.55945
## 133 44.55945
## 134 44.55945
## 135 44.55945
## 136 44.55945
## 137 44.55945
## 138 44.55945
## 139 44.55946
## 140 44.55945
## 141 44.55946
## 142 44.55945
## 143 44.55945
## 144 44.55946
## 145 44.55945
## 146 44.55945
## 147 44.55945
## 148 44.55945
## 149 44.55945
## 150 44.55945
## 151 44.55945
## 152 44.55945
## 153 44.55945
## 154 44.55945
## 155 44.55945
## 156 44.55945
## 157 44.55945
## 158 44.55945
## 159 44.55945
## 160 44.55945
## 161 44.55945
## 162 44.55945
## 163 44.55945
## 164 44.55945
## 165 44.55945
## 166 44.55945
## 167 44.55945
## 168 44.55945
Let’s generate the forecast data in the file ‘Waterflow_Forecast.csv’.
I think that the forecast generated, allowing us to see what the water flow will be is not reliable, because we can see that it does not follow the pattern of the data. I think it’s because its large, the forecasts are closer to the mean.