Project - 1

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade.

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Libraries

library(tidyverse)
library(ggplot2)
library(kableExtra)
library(gridExtra)
library(rio)
library(fpp2)
library(urca)
library(rio)
library(forecast)
library(lubridate)
library(dplyr)

Part A – ATM Forecast

Objetive

The goal of the project is to explore time series, decomposition, forecasting, data preprocessing, exponential smoothing, and ARIMA.

Load Data

First, load the excel data, clean it by dropping the NA values, and rearrange it to a better format:

atm <- import("https://raw.githubusercontent.com/GabrielSantos33/DATA624_Project1/main/ATM624Data.xlsx", col_types = c("date", "text", "numeric"))
atm_daily <- atm %>% drop_na() %>% spread(ATM, Cash) 
atm_daily 
##           DATE ATM1 ATM2 ATM3         ATM4
## 1   2009-05-01   96  107    0   776.993423
## 2   2009-05-02   82   89    0   524.417959
## 3   2009-05-03   85   90    0   792.811362
## 4   2009-05-04   90   55    0   908.238457
## 5   2009-05-05   99   79    0    52.832103
## 6   2009-05-06   88   19    0    52.208454
## 7   2009-05-07    8    2    0    55.473609
## 8   2009-05-08  104  103    0   558.503251
## 9   2009-05-09   87  107    0   904.341359
## 10  2009-05-10   93  118    0   879.493588
## 11  2009-05-11   86   75    0   274.022340
## 12  2009-05-12  111  111    0   396.108347
## 13  2009-05-13   75   25    0   274.547188
## 14  2009-05-14    6   16    0    16.321159
## 15  2009-05-15  102  137    0   852.307037
## 16  2009-05-16   73   95    0   379.561703
## 17  2009-05-17   92  103    0    31.284953
## 18  2009-05-18   82   80    0   491.850577
## 19  2009-05-19   86  118    0    83.705480
## 20  2009-05-20   73   30    0   128.653781
## 21  2009-05-21   20    7    0    14.357590
## 22  2009-05-22  100  118    0   815.358321
## 23  2009-05-23   93  104    0   758.218587
## 24  2009-05-24   90   59    0   601.421108
## 25  2009-05-25   94   40    0   906.796873
## 26  2009-05-26   98  106    0   502.907894
## 27  2009-05-27   73   18    0    88.273138
## 28  2009-05-28   10    9    0    35.438335
## 29  2009-05-29   97  136    0   338.459425
## 30  2009-05-30  102  118    0     4.547996
## 31  2009-05-31   85   64    0   122.667745
## 32  2009-06-01   85   77    0   150.234945
## 33  2009-06-02  108  133    0   721.155281
## 34  2009-06-03   94   45    0   443.012568
## 35  2009-06-04   14   14    0    17.151720
## 36  2009-06-05    3   20    0    14.887695
## 37  2009-06-06   96  147    0   740.543155
## 38  2009-06-07  109  105    0  1058.083384
## 39  2009-06-08   96  132    0   576.183701
## 40  2009-06-09  145   93    0  1484.126887
## 41  2009-06-10   81   26    0   193.787327
## 42  2009-06-11   16    7    0    27.054779
## 43  2009-06-12  142  112    0  1190.898921
## 44  2009-06-13   NA   91    0   746.495617
## 45  2009-06-14  120   72    0  1220.767882
## 46  2009-06-15  106   66    0  1021.503529
## 47  2009-06-16   NA   82    0   372.623610
## 48  2009-06-17  108   24    0   321.272935
## 49  2009-06-18   21   NA    0    92.476655
## 50  2009-06-19  140  134    0   116.644133
## 51  2009-06-20  110   95    0   202.413503
## 52  2009-06-21  115   82    0   524.434514
## 53  2009-06-22   NA   90    0    80.644354
## 54  2009-06-23  108   99    0    64.273589
## 55  2009-06-24   66   NA    0    90.636571
## 56  2009-06-25   13    3    0    48.046061
## 57  2009-06-26   99  117    0  1026.032172
## 58  2009-06-27  105   53    0   423.770971
## 59  2009-06-28  104   44    0    60.626453
## 60  2009-06-29   98   56    0   540.281404
## 61  2009-06-30  110  110    0   173.705568
## 62  2009-07-01   79   36    0   393.239982
## 63  2009-07-02   16   12    0    41.948621
## 64  2009-07-03  110  128    0   310.006527
## 65  2009-07-04   96   72    0   110.051915
## 66  2009-07-05  114  122    0   682.182367
## 67  2009-07-06  126  100    0    54.667904
## 68  2009-07-07  126  108    0   213.989896
## 69  2009-07-08   73   25    0   738.126360
## 70  2009-07-09    4    6    0    16.206081
## 71  2009-07-10   19   22    0    16.223205
## 72  2009-07-11  114  135    0  1050.206528
## 73  2009-07-12   98   69    0   438.477361
## 74  2009-07-13   97   52    0   546.691592
## 75  2009-07-14  114   81    0   858.236420
## 76  2009-07-15   78   27    0   446.733514
## 77  2009-07-16   19    4    0    94.672035
## 78  2009-07-17  102  147    0   644.390623
## 79  2009-07-18   94  102    0   568.815899
## 80  2009-07-19  108   83    0   704.507042
## 81  2009-07-20   91   55    0   571.645285
## 82  2009-07-21   86   74    0   479.875501
## 83  2009-07-22   78   22    0   418.864062
## 84  2009-07-23   16    4    0    27.569130
## 85  2009-07-24  114  104    0   834.834364
## 86  2009-07-25  115   81    0   910.834075
## 87  2009-07-26  108   61    0   468.106927
## 88  2009-07-27  102   70    0   768.121660
## 89  2009-07-28  129  126    0  1089.167762
## 90  2009-07-29   79   33    0   266.898417
## 91  2009-07-30   13    7    0     7.110664
## 92  2009-07-31  103  126    0   704.192012
## 93  2009-08-01   90   92    0   495.350606
## 94  2009-08-02   68   81    0   142.638951
## 95  2009-08-03   85   49    0   428.560174
## 96  2009-08-04   99  146    0   894.969364
## 97  2009-08-05   86   79    0   610.290623
## 98  2009-08-06   13   37    0    70.613731
## 99  2009-08-07  116  136    0   593.570261
## 100 2009-08-08  105  111    0   341.601807
## 101 2009-08-09  123   78    0   735.035708
## 102 2009-08-10  114   57    0   462.773028
## 103 2009-08-11  127  106    0  1156.496108
## 104 2009-08-12  111   38    0   454.091938
## 105 2009-08-13   34   15    0   283.337265
## 106 2009-08-14  151  119    0   571.508226
## 107 2009-08-15  110  110    0   772.179652
## 108 2009-08-16  115   68    0   260.125517
## 109 2009-08-17  112   60    0   357.515267
## 110 2009-08-18  132   92    0    16.157597
## 111 2009-08-19   94   22    0   334.312503
## 112 2009-08-20   24   11    0    25.696404
## 113 2009-08-21  122  121    0   254.887098
## 114 2009-08-22  104   89    0   357.003129
## 115 2009-08-23  128   62    0  1245.594280
## 116 2009-08-24  120   79    0   917.371157
## 117 2009-08-25  174   83    0   592.180082
## 118 2009-08-26   96   31    0   412.474975
## 119 2009-08-27   13    4    0    82.911171
## 120 2009-08-28  121   96    0   996.010226
## 121 2009-08-29  133   50    0   103.910375
## 122 2009-08-30  118   52    0  1116.915044
## 123 2009-08-31   91   56    0   816.847015
## 124 2009-09-01  120  104    0   914.493389
## 125 2009-09-02   88   27    0   648.209198
## 126 2009-09-03   19   13    0   140.515569
## 127 2009-09-04  150  107    0  1495.154775
## 128 2009-09-05  144  125    0  1301.396344
## 129 2009-09-06  121  103    0   779.717193
## 130 2009-09-07  105   42    0   744.262278
## 131 2009-09-08  133   90    0   200.391803
## 132 2009-09-09  109   29    0   854.179084
## 133 2009-09-10   18    8    0    50.719369
## 134 2009-09-11    1    2    0     7.404799
## 135 2009-09-12  105   84    0  1061.192780
## 136 2009-09-13  112   62    0   715.021460
## 137 2009-09-14   82   77    0    35.444668
## 138 2009-09-15  111   78    0   491.884510
## 139 2009-09-16   79   25    0   343.496923
## 140 2009-09-17   13    8    0    20.714186
## 141 2009-09-18  112  113    0   505.717125
## 142 2009-09-19   99   71    0    97.325113
## 143 2009-09-20  140   94    0   473.570486
## 144 2009-09-21  110   59    0   899.773474
## 145 2009-09-22  180   89    0  1712.074986
## 146 2009-09-23   73   18    0   281.388412
## 147 2009-09-24    7    6    0    26.337969
## 148 2009-09-25  106  115    0   328.644988
## 149 2009-09-26  103   81    0   761.371143
## 150 2009-09-27   93   61    0   629.319901
## 151 2009-09-28   96   71    0   235.753411
## 152 2009-09-29  117   69    0  1195.223181
## 153 2009-09-30   80   36    0   782.407474
## 154 2009-10-01   14   14    0   108.331743
## 155 2009-10-02  120  104    0   846.556566
## 156 2009-10-03   91   73    0   576.169344
## 157 2009-10-04   96   86    0   441.883927
## 158 2009-10-05   74   85    0   319.404846
## 159 2009-10-06  108  126    0   154.163484
## 160 2009-10-07   73   31    0   543.239375
## 161 2009-10-08   13    9    0   124.334415
## 162 2009-10-09   93  114    0   449.074910
## 163 2009-10-10   94   78    0   614.654064
## 164 2009-10-11   76   45    0   219.237855
## 165 2009-10-12  111   60    0   945.736417
## 166 2009-10-13   88   91    0     9.691243
## 167 2009-10-14   76   22    0   696.245756
## 168 2009-10-15    9    7    0     8.084284
## 169 2009-10-16   87   75    0   845.463546
## 170 2009-10-17  105   66    0   212.902683
## 171 2009-10-18   78   64    0     9.155135
## 172 2009-10-19   67   51    0    46.831421
## 173 2009-10-20   90   94    0    30.372376
## 174 2009-10-21   68   23    0   400.484974
## 175 2009-10-22    9    4    0    61.338241
## 176 2009-10-23   78  127    0   428.046952
## 177 2009-10-24   74   61    0     1.563260
## 178 2009-10-25   74    0    0     9.018826
## 179 2009-10-26   60   95    0    50.304215
## 180 2009-10-27   75   79    0   313.414359
## 181 2009-10-28   61   38    0   626.687136
## 182 2009-10-29    9    8    0     5.882076
## 183 2009-10-30   90  119    0    77.803024
## 184 2009-10-31   86   57    0   337.894871
## 185 2009-11-01   86   58    0   211.812232
## 186 2009-11-02   79   80    0   690.122884
## 187 2009-11-03   90   82    0   596.381877
## 188 2009-11-04   80   49    0    65.278483
## 189 2009-11-05   21   16    0    77.444454
## 190 2009-11-06   93  116    0    43.721409
## 191 2009-11-07  104   61    0   964.175255
## 192 2009-11-08  109   59    0   834.959599
## 193 2009-11-09   88   80    0   636.962626
## 194 2009-11-10   96   86    0   927.078614
## 195 2009-11-11   70   23    0    75.873055
## 196 2009-11-12   15    7    0    43.690550
## 197 2009-11-13   73   91    0   621.292501
## 198 2009-11-14   94   57    0   312.582041
## 199 2009-11-15  108   58    0   825.635396
## 200 2009-11-16   73   61    0   413.620440
## 201 2009-11-17   87   77    0   194.047355
## 202 2009-11-18   75   20    0   345.695900
## 203 2009-11-19   10    5    0    32.428111
## 204 2009-11-20   92  132    0   655.315200
## 205 2009-11-21   87   49    0   638.136824
## 206 2009-11-22   74   57    0    15.306757
## 207 2009-11-23   73   68    0   299.663010
## 208 2009-11-24   93   80    0   626.662127
## 209 2009-11-25   66   31    0   601.141158
## 210 2009-11-26   18    3    0    64.131250
## 211 2009-11-27   99   85    0   562.763323
## 212 2009-11-28   94   53    0   317.253589
## 213 2009-11-29  136   46    0  1167.264437
## 214 2009-11-30    6    2    0    47.365125
## 215 2009-12-01  140  113    0   993.777470
## 216 2009-12-02   73   22    0   687.196935
## 217 2009-12-03    9    5    0    71.147567
## 218 2009-12-04  140  112    0  1046.971281
## 219 2009-12-05  103   59    0  1009.050231
## 220 2009-12-06  110   72    0   288.649516
## 221 2009-12-07   90   77    0   591.508331
## 222 2009-12-08  135   85    0   230.605915
## 223 2009-12-09   67   27    0   578.431798
## 224 2009-12-10   12    1    0    70.210639
## 225 2009-12-11  109   91    0   580.796406
## 226 2009-12-12   84   36    0   149.150820
## 227 2009-12-13   92   46    0   403.861263
## 228 2009-12-14   84  100    0   124.958721
## 229 2009-12-15  118   73    0   230.148897
## 230 2009-12-16   68   22    0    19.008453
## 231 2009-12-17   14    9    0   128.638640
## 232 2009-12-18   90  117    0   327.965321
## 233 2009-12-19   92   44    0   532.245673
## 234 2009-12-20   93   44    0   877.397772
## 235 2009-12-21   85   78    0   662.112869
## 236 2009-12-22   93   89    0   300.629002
## 237 2009-12-23   70   33    0   667.628261
## 238 2009-12-24   13    5    0    14.573322
## 239 2009-12-25   90  102    0   660.355296
## 240 2009-12-26   91   68    0   510.988823
## 241 2009-12-27  102   64    0   164.462091
## 242 2009-12-28   97   81    0   748.172762
## 243 2009-12-29   42    9    0   174.241405
## 244 2009-12-30    2    2    0    20.192725
## 245 2009-12-31   14    2    0   100.686543
## 246 2010-01-01  132  117    0   985.956884
## 247 2010-01-02  108   56    0   597.058060
## 248 2010-01-03  120   55    0   468.457250
## 249 2010-01-04  120  112    0   856.581158
## 250 2010-01-05   90  100    0   684.774261
## 251 2010-01-06   48   17    0   381.562058
## 252 2010-01-07   15    7    0   152.092362
## 253 2010-01-08   86   49    0   271.934402
## 254 2010-01-09  109   96    0   135.499645
## 255 2010-01-10  115   47    0  1105.018595
## 256 2010-01-11  123  107    0   291.765790
## 257 2010-01-12  123   93    0  1141.426655
## 258 2010-01-13   60   27    0   140.502720
## 259 2010-01-14   22    9    0     8.781528
## 260 2010-01-15   81   78    0    85.260115
## 261 2010-01-16   98   47    0    66.655933
## 262 2010-01-17   94   31    0   709.938568
## 263 2010-01-18   76   91    0   567.549590
## 264 2010-01-19   96   77    0   486.824572
## 265 2010-01-20   72   22    0    17.481238
## 266 2010-01-21   12    1    0    49.208112
## 267 2010-01-22   91  125    0   357.233561
## 268 2010-01-23   74   53    0   179.888632
## 269 2010-01-24  104   37    0   728.992926
## 270 2010-01-25   74   69    0   261.121740
## 271 2010-01-26   91   69    0   628.863047
## 272 2010-01-27   66   16    0   277.044815
## 273 2010-01-28   28    2    0    41.475487
## 274 2010-01-29  152   95    0  1574.779330
## 275 2010-01-30   97   45    0   669.707653
## 276 2010-01-31  100   61    0   979.675944
## 277 2010-02-01   85   71    0   426.406008
## 278 2010-02-02  123   91    0   153.242692
## 279 2010-02-03   46   28    0   274.949921
## 280 2010-02-04   38    9    0   136.276358
## 281 2010-02-05  138   97    0   454.416950
## 282 2010-02-06   74   60    0   458.304270
## 283 2010-02-07   85   27    0   112.030628
## 284 2010-02-08   78   74    0   417.912276
## 285 2010-02-09  123   45    0 10919.761638
## 286 2010-02-10   50   26    0    42.438078
## 287 2010-02-11   28    2    0   280.043427
## 288 2010-02-12  105   73    0   412.318881
## 289 2010-02-13  109   48    0   852.837416
## 290 2010-02-14   26    9    0   179.702084
## 291 2010-02-15   54   25    0   226.047368
## 292 2010-02-16  105   58    0   989.195610
## 293 2010-02-17  179  110    0   824.916781
## 294 2010-02-18  125   94    0   966.610561
## 295 2010-02-19   84   25    0   734.220167
## 296 2010-02-20   99  101    0   121.332622
## 297 2010-02-21  110   77    0   287.931829
## 298 2010-02-22   80    6    0   502.748346
## 299 2010-02-23    1    3    0     9.752776
## 300 2010-02-24  111   99    0   258.138869
## 301 2010-02-25  115   86    0  1170.288228
## 302 2010-02-26  108   20    0   193.079004
## 303 2010-02-27  100   72    0   402.770395
## 304 2010-02-28  152   81    0  1275.968167
## 305 2010-03-01   90   16    0   819.881512
## 306 2010-03-02    4    2    0    26.392615
## 307 2010-03-03  128  110    0   893.884507
## 308 2010-03-04   87   75    0   360.641557
## 309 2010-03-05   84   35    0   859.899389
## 310 2010-03-06   69  102    0   381.473089
## 311 2010-03-07  112  100    0     9.711205
## 312 2010-03-08   62    9    0   601.016953
## 313 2010-03-09    4    2    0    32.454958
## 314 2010-03-10   94  100    0   553.395793
## 315 2010-03-11  102   85    0   572.374702
## 316 2010-03-12   98   33    0   218.831558
## 317 2010-03-13   88   24    0   828.326827
## 318 2010-03-14  123  111    0   630.762176
## 319 2010-03-15   55    5    0   339.418304
## 320 2010-03-16    4    1    0    31.502210
## 321 2010-03-17   92  106    0   486.679944
## 322 2010-03-18   88  102    0   335.363138
## 323 2010-03-19   92   25    0   340.008595
## 324 2010-03-20   92   64    0   291.282267
## 325 2010-03-21  117   78    0    46.029411
## 326 2010-03-22   73    5    0   201.403616
## 327 2010-03-23    3    1    0     9.558301
## 328 2010-03-24   85   83    0   877.966103
## 329 2010-03-25   88   78    0   778.110791
## 330 2010-03-26   76   66    0   707.613494
## 331 2010-03-27   93   15    0   351.457835
## 332 2010-03-28  116   64    0   711.049107
## 333 2010-03-29   68    3    0   502.518192
## 334 2010-03-30   19    0    0    23.337569
## 335 2010-03-31  127  102    0   492.909936
## 336 2010-04-01   93   99    0   405.312584
## 337 2010-04-02   97   41    0   818.394331
## 338 2010-04-03  102   79    0   152.270800
## 339 2010-04-04  109   71    0   281.113528
## 340 2010-04-05   68    9    0   470.425772
## 341 2010-04-06    4    2    0     2.507317
## 342 2010-04-07  105  103    0   414.741223
## 343 2010-04-08   79   67    0   719.426783
## 344 2010-04-09   90   85    0   811.716825
## 345 2010-04-10   87   78    0   889.584209
## 346 2010-04-11   92   67    0   616.127547
## 347 2010-04-12   63   12    0    61.144354
## 348 2010-04-13    3    1    0    27.325671
## 349 2010-04-14  103   97    0   767.778153
## 350 2010-04-15   80  106    0   326.084093
## 351 2010-04-16   80   74    0   825.197816
## 352 2010-04-17   84   86    0   383.815025
## 353 2010-04-18   81   55    0   195.483454
## 354 2010-04-19   88   13    0   711.164653
## 355 2010-04-20    3    1    0    29.869011
## 356 2010-04-21  100  100    0   556.792330
## 357 2010-04-22   69  103    0   386.175335
## 358 2010-04-23   85   44    0   165.294181
## 359 2010-04-24   85   61    0     5.451815
## 360 2010-04-25  109   89    0   542.280602
## 361 2010-04-26   74   11    0   403.839336
## 362 2010-04-27    4    2    0    13.697331
## 363 2010-04-28   96  107   96   348.201061
## 364 2010-04-29   82   89   82    44.245345
## 365 2010-04-30   85   90   85   482.287107

According to the data in the table, we can see that the withdrawals from ‘ATM4’ are greater than the withdrawals from the other ATMs. In addition, the ‘ATM3’, most of the withdrawals are zero.

Generating the summary statistics and the boxplot:

par(mfrow=c(4,1))
for (i in 2:5) {
  print(summary(atm_daily[i]))
}
##       ATM1       
##  Min.   :  1.00  
##  1st Qu.: 73.00  
##  Median : 91.00  
##  Mean   : 83.89  
##  3rd Qu.:108.00  
##  Max.   :180.00  
##  NA's   :3       
##       ATM2       
##  Min.   :  0.00  
##  1st Qu.: 25.50  
##  Median : 67.00  
##  Mean   : 62.58  
##  3rd Qu.: 93.00  
##  Max.   :147.00  
##  NA's   :2       
##       ATM3        
##  Min.   : 0.0000  
##  1st Qu.: 0.0000  
##  Median : 0.0000  
##  Mean   : 0.7206  
##  3rd Qu.: 0.0000  
##  Max.   :96.0000  
##       ATM4          
##  Min.   :    1.563  
##  1st Qu.:  124.334  
##  Median :  403.839  
##  Mean   :  474.043  
##  3rd Qu.:  704.507  
##  Max.   :10919.762
par(mfrow=c(1,4))
for (i in 2:5){
    boxplot(atm_daily[i], 
            main = sprintf("%s", names(atm_daily)[i]),
                  col="lightblue", width = 2)
}

From the above boxplots we see that ‘ATM1’ has numerous outliers, ‘ATM2’ has none, ‘ATM3’ has 3 (for every non-zero observation), and ‘ATM4’ has 3 outliers, one of which is quite extreme. . The transformation of outliers will be important for forecasting.

Timeseries

Now, let’s convert the data into a time series.

atm_ts <- ts(atm_daily %>% select(-DATE))
autoplot(atm_ts) +
  ggtitle("Daily Cash Withdrawal", subtitle = "4 ATM machines") +
  xlab("Day") +
  ylab("Hundreds of Dollars ($100)")

As the spike from ATM4 in the first plot makes it hard to see the details in ATM1, ATM2, and ATM3, individual plots will be shown below.

To handle the data better, the extremely high outlier in ATM4 will be replaced by median for better forecasting.

atm[1394,3] <- 403.839
atm %>% drop_na() %>% ggplot(aes(x=DATE, y=Cash, col=ATM)) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~ATM, ncol=1, scales="free_y") +
  labs(title="Daily Cash Withdrawal from 4 ATM Machines", subtitle = "May 2019 to April 2020") +
  xlab("Date") + ylab("Hundreds of Dollars ($100)")

From the graphs we can see that ‘ATM1’ and ‘ATM2’ show seasonality. We will apply the Box-Cox transformation. In the case of ‘ATM3’ we do not have enough information for the prediction. We can assume that the ‘ATM3’ possibly began its operation at the end of April 2020. For ‘ATM4’ we will replace the outlier with the median so it is better for forecasting.

ARIMA Models

ATM1

atm1 <- atm_daily[2]
atm1 <- ts(atm1, frequency=7)
atm2 <- atm_daily[3]
atm2 <- ts(atm2, frequency=7)
atm3 <- atm_daily[4]
atm3 <- ts(atm3, start=363)
atm3[which(atm3==0)] <- NA
atm4 <- atm_daily[5]
atm4[285,] <- 403.839
atm4 <- ts(atm4, frequency=7)
ggtsdisplay(atm1, main="Daily Cash Withdrawal in ATM1 (May 2019 - April 2020)")

ggseasonplot(atm1)

ggsubseriesplot(atm1)

atm1_bc <- BoxCox(atm1, lambda = BoxCox.lambda(atm1))
ggtsdisplay(atm1_bc, main="ATM1 with BoxCox Transformation")

ATM1 is clearly with seasonality, which is weekly seasonality. The ACF and PACF plots have significant lag7, lag14, and lag21. This is a non-stationary timeseries.

After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.

ggtsdisplay(diff(atm1_bc, 7), points=FALSE)

atm1_bc %>% diff(.,7) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.016 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary. ARIMA(p,d,q)(P,D,Q)[m] model has p/P as order of the autoregressive part, d/D as degree of first differencing involved, q/Q as order of the moving average part, and m as number of observations. The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.

The seasonal spike at lag7 suggest seasonal AR(1) and/or MA(1) components. As ACF decays gradually, this suggests seasonal AR(0) and MA(1), P=0, Q=1. I will try ARIMA(1,0,1)(0,1,1).

atm1_arima <- Arima(atm1, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm1))
summary(atm1_arima)
## Series: atm1 
## ARIMA(1,0,1)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2081476 
## 
## Coefficients:
##           ar1     ma1     sma1
##       -0.5040  0.6201  -0.6438
## s.e.   0.2417  0.2183   0.0427
## 
## sigma^2 = 1.252:  log likelihood = -544.29
## AIC=1096.58   AICc=1096.7   BIC=1112.11
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 2.514774 25.09103 15.83582 -89.65246 108.6821 0.8958869
##                      ACF1
## Training set -0.007347018
checkresiduals(atm1_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 9.4438, df = 11, p-value = 0.581
## 
## Model df: 3.   Total lags used: 14

Now, let’s use auto.arima to find out which is the best Arima model:

atm1_auto <- auto.arima(atm1, approximation = FALSE, lambda=BoxCox.lambda(atm1))
summary(atm1_auto)
## Series: atm1 
## ARIMA(0,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2081476 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.0989  -0.1107  -0.6482
## s.e.  0.0532   0.0527   0.0425
## 
## sigma^2 = 1.247:  log likelihood = -543.68
## AIC=1095.37   AICc=1095.48   BIC=1110.89
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 2.493867 25.19797 16.12913 -88.12706 107.3975 0.9124807 0.01895515
checkresiduals(atm1_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 8.5811, df = 11, p-value = 0.6605
## 
## Model df: 3.   Total lags used: 14

The ARIMA model found by ‘auto.arima’ is ARIMA(0,0,2)(0,1,1)[7]. The ARIMA model I suggested is ARIMA(1,0,1)(0,1,1)[7].

According to the error measures and the residual plots, both models represents the data well with similar AIC values, similar error measures, and similar p-values. Both models can be applied, but the best Arima model is ARIMA(0,0,2)(0,1,1)[7], because it has a smaller AICc.

atm1_model <- Arima(atm1, order=c(0,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm1))

ATM2

ggtsdisplay(atm2, main="Daily Cash Withdrawal in ATM2 (May 2019 - April 2020)")

ggseasonplot(atm2)

ggsubseriesplot(atm2)

atm2_bc <- BoxCox(atm2, lambda = BoxCox.lambda(atm2))
ggtsdisplay(atm2_bc, main="ATM2 with BoxCox Transformation")

ATM2 is clearly with seasonality, which is weekly seasonality. The ACF and PACF plots have significant lag7, lag14, and lag21.Thus, this is a non-stationary timeseries.

After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.

ggtsdisplay(diff(atm2_bc, 7), points=FALSE)

atm2_bc %>% diff(.,7) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0161 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary. One seasonal differencing was applied so D=1, while the non-seasonal part suggests d=0. The seasonal lags at ACF sudden drops while the ones in PACF gradually decrease, suggest AR(0) and MA(1), so P=0, Q=1. The non-differenced ACF and PACF plots have spikes at lag2 and lag5, suggest p=2 and q=2. I will try ARIMA(2,0,2)(0,1,1)[7].

atm2_arima <- Arima(atm2, order=c(2,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm2))
summary(atm2_arima)
## Series: atm2 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.7164431 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4188  -0.9019  0.4587  0.7886  -0.7187
## s.e.   0.0587   0.0471  0.0884  0.0619   0.0414
## 
## sigma^2 = 62.19:  log likelihood = -1240.08
## AIC=2492.16   AICc=2492.4   BIC=2515.45
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.2631418 24.22403 16.90092 -Inf  Inf 0.8250038 -0.01464052
checkresiduals(atm2_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.8051, df = 9, p-value = 0.3665
## 
## Model df: 5.   Total lags used: 14

Now, let’s use auto.arima to find out which is the best Arima model:

atm2_auto <- auto.arima(atm2, approximation = FALSE, lambda=BoxCox.lambda(atm2))
summary(atm2_auto)
## Series: atm2 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.7164431 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4188  -0.9019  0.4587  0.7886  -0.7187
## s.e.   0.0587   0.0471  0.0884  0.0619   0.0414
## 
## sigma^2 = 62.19:  log likelihood = -1240.08
## AIC=2492.16   AICc=2492.4   BIC=2515.45
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.2631418 24.22403 16.90092 -Inf  Inf 0.8250038 -0.01464052
checkresiduals(atm2_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 9.8051, df = 9, p-value = 0.3665
## 
## Model df: 5.   Total lags used: 14

The ARIMA model found by ‘auto.arima’ is ARIMA(2,0,2)(0,1,1)[7].The ARIMA model I suggested is ARIMA(2,0,2)(0,1,1)[7].Both models are the same.According to the error measures and the residual plots, the model represents the data well with small p-value.

atm2_model <- Arima(atm2, order=c(2,0,2), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm2))

ATM3

ggtsdisplay(atm3, main="Daily Cash Withdrawal in ATM3 (May 2019 - April 2020)")

There is not enough information to forecast in the time series. Because there are only 3 data points. We can assume that it is an ATM that began to work at the end of April. Therefore, I will use a simple mean forecast.

ATM4

ggtsdisplay(atm4, main="Daily Cash Withdrawal in ATM4 (May 2019 - April 2020)")

ggseasonplot(atm4)

ggsubseriesplot(atm4)

atm4_bc <- BoxCox(atm4, lambda = BoxCox.lambda(atm4))
ggtsdisplay(atm4_bc, main="ATM4 with BoxCox Transformation")

ATM4 is clearly with seasonality, which is weekly seasonality. The ACF and PACF plots have significant lag7, lag14, and lag21. Thus, this is a non-stationary timeseries.

After applying Box-Cox transformation, we still see the weekly seasonality in the timeseries. Differencing is needed.

ggtsdisplay(diff(atm4_bc, 7), points=FALSE)

atm4_bc %>% diff() %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0124 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary. One seasonal differencing was applied so D=1, while the non-seasonal part suggests d=0. The seasonal lags at ACF sudden drops while the ones in PACF gradually decrease, suggest AR(0) and MA(1), so P=0, Q=1.There one non-seasonal spike at lag3 in ACF and PACF plots suggest p=q=1.I will try ARIMA(1,0,1)(0,1,1)[7].

atm4_arima <- Arima(atm4, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm4))
summary(atm4_arima)
## Series: atm4 
## ARIMA(1,0,1)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.4525697 
## 
## Coefficients:
##          ar1      ma1     sma1
##       0.6851  -0.6059  -0.8677
## s.e.  0.3040   0.3315   0.0315
## 
## sigma^2 = 171.4:  log likelihood = -1432.08
## AIC=2872.16   AICc=2872.27   BIC=2887.68
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 61.63608 337.5467 254.8181 -360.8151 404.2179 0.7355835 0.0185852
checkresiduals(atm4_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 16.519, df = 11, p-value = 0.1229
## 
## Model df: 3.   Total lags used: 14

Now, let’s use auto.arima to find out which is the best Arima model:

atm4_auto <- auto.arima(atm4, approximation = FALSE, lambda=BoxCox.lambda(atm4))
summary(atm4_auto)
## Series: atm4 
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4525697 
## 
## Coefficients:
##          ar1    sar1    sar2     mean
##       0.0771  0.2090  0.1996  29.0082
## s.e.  0.0526  0.0516  0.0525   1.2627
## 
## sigma^2 = 182.2:  log likelihood = -1466.34
## AIC=2942.69   AICc=2942.85   BIC=2962.19
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 84.38613 351.7695 274.2406 -370.6751 415.1343 0.7916504 0.01960666
checkresiduals(atm4_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.92, df = 11, p-value = 0.1441
## 
## Model df: 3.   Total lags used: 14

The ARIMA model found by ‘auto.arima’ is ARIMA(1,0,0)(2,0,0)[7]. The ARIMA model I suggested is ARIMA(1,0,1)(0,1,1)[7].The model I suggested has smaller AICc and RMSE.I think that the best model is the model that I suggested.

atm4_model <- Arima(atm4, order=c(1,0,1), seasonal=c(0,1,1), lambda = BoxCox.lambda(atm4))

Forecast

atm1_f <- forecast(atm1_model, 31, level=95)
atm2_f <- forecast(atm2_model, 31, level=95)
atm3_f <- meanf(atm3, 31, level=95)
atm4_f <- forecast(atm4_model, 31, level=95)

gridExtra::grid.arrange(
  autoplot(atm1_f) +
    labs(title="ATM1: ARIMA(0,0,2)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
  autoplot(atm2_f) +
    labs(title="ATM2: ARIMA(2,0,2)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
  autoplot(atm3_f) +
    labs(title="ATM3: meanf", x="Day", y="Hundreds of Dollars($100)"),
  autoplot(atm4_f) +
    labs(title="ATM4: ARIMA(1,0,1)(0,1,1)[7]", x="Day", y="Hundreds of Dollars($100)"),
  top = grid::textGrob("Forecast on Cash Withdrawal for May 2020")
)

export <- rbind(atm1_f$mean, atm2_f$mean, atm3_f$mean, atm4_f$mean)
write.csv(export, "ATM_Forecast.csv")

data.frame(export) %>% cbind(ATM = c('ATM1', 'ATM2', 'ATM3', 'ATM4')) %>% 
                    select(ATM, everything())
##    ATM        X1        X2        X3        X4        X5        X6        X7
## 1 ATM1  86.74294 100.55515  73.62242  4.211032 100.14543  79.32380  85.78103
## 2 ATM2  65.64590  71.57733  11.87248  2.741321  97.99480  88.96331  65.68584
## 3 ATM3  87.66667  87.66667  87.66667 87.666667  87.66667  87.66667  87.66667
## 4 ATM4 280.99138 363.04960 380.34153 64.998646 449.24378 303.25757 507.37695
##          X8        X9       X10       X11       X12       X13       X14
## 1  87.22992 100.36819  73.62242  4.211032 100.14543  79.32380  85.78103
## 2  65.66859  71.59474  11.85550  2.739787  98.02444  88.95494  65.66519
## 3  87.66667  87.66667  87.66667 87.666667  87.66667  87.66667  87.66667
## 4 306.98773 383.34690 394.53535 68.714726 456.49955 307.26292 511.00791
##         X15       X16       X17       X18       X19       X20       X21
## 1  87.22992 100.36819  73.62242  4.211032 100.14543  79.32380  85.78103
## 2  65.68416  71.60715  11.84374  2.738603  98.04512  88.94938  65.65068
## 3  87.66667  87.66667  87.66667 87.666667  87.66667  87.66667  87.66667
## 4 308.87651 384.80722 395.55131 68.982178 457.01579 307.54766 511.26558
##         X22       X23       X24       X25       X26       X27       X28
## 1  87.22992 100.36819  73.62242  4.211032 100.14543  79.32380  85.78103
## 2  65.69484  71.61598  11.83559  2.737698  98.05954  88.94570  65.64048
## 3  87.66667  87.66667  87.66667 87.666667  87.66667  87.66667  87.66667
## 4 309.01050 384.91075 395.62331 69.001140 457.05236 307.56783 511.28383
##         X29       X30       X31
## 1  87.22992 100.36819  73.62242
## 2  65.70216  71.62226  11.82995
## 3  87.66667  87.66667  87.66667
## 4 309.01999 384.91809 395.62841

Export the results Forecast of ATM1, ATM2, ATM3 and ATM4 en el file: ‘ATM_Forecast.csv’

I think that the ATM1 and ATM2 forecasts seem correct, but in the case of ATM4, the generated forecast does not seem to follow the pattern of the data, I do not think the forecast is reliable, suddenly it is necessary to apply another type of model.

Part B – Forecasting Power

Objetive

The goal of the project is to explore time series, decomposition, forecasting, data preprocessing, exponential smoothing, and ARIMA.

Load Data

First, load the excel data into R and clean it by using function ‘tsclean()’ which can handle outliers and NA value.

kwh <- import("https://raw.githubusercontent.com/GabrielSantos33/DATA624_Project1/main/ResidentialCustomerForecastLoad-624.xlsx")
kwh_ts <- ts(kwh[,"KWH"], start=c(1998,1), frequency=12) %>% 
  tsclean()
kwh_ts
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755  8607428
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416  7564391
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662  7517830
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129  8450717
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702  8058748
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676  8476499
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931  7309774
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998  7786659
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564  8241110
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220  7447146
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987  8037137
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260  8350517
## 2010  9397357  8390677  7347915  5776131  4919289  6696292  7686742  7922701
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343 10308076
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851  9612423
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321  9080226
##           Sep      Oct      Nov      Dec
## 1998  6989888  6345620  4640410  4693479
## 1999  7899368  5358314  4436269  4419229
## 2000  8912169  5844352  5041769  6220334
## 2001  7112069  5242535  4461979  5240995
## 2002  8245227  5865014  4908979  5779958
## 2003  7791791  5344613  4913707  5756193
## 2004  6690366  5444948  4824940  5791208
## 2005  7057213  6694523  4313019  6181548
## 2006  7296355  5104799  4458429  6226214
## 2007  7666970  5785964  4907057  6047292
## 2008  7436335  5101803  4555602  6442746
## 2009  7583146  5566075  5339890  7089880
## 2010  7819472  5875917  4800733  6152583
## 2011  8943599  5603920  6154138  8273142
## 2012  7559148  5576852  5731899  6609694
## 2013  7968220  5759367  5769083  6987874
summary(kwh[,"KWH"]) #before
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1
summary(kwh_ts)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  4313019  5443502  6351262  6529723  7608792 10655730

Timeseries

Now, let’s convert the data into a time series.

autoplot(kwh_ts) +
  ggtitle("Monthly Residential Power Usage", subtitle = "Jan 1998 to Dec 2013 (KWH)") +
  xlab("Month") +
  ylab("Kilowatt hours (KWH)")

ggseasonplot(kwh_ts)

ggsubseriesplot(kwh_ts)

Seasonality is found in this timeseries and appears to have a peak every 6 months. The seasonality may be annual due to the high power usage during winter and summer.

ARIMA Model

We see annual seasonality.

ggtsdisplay(kwh_ts, main="Monthly Residential Power Usage (Jan 1998 to Dec 2013) - (KWH)")

We apply the BoxCox Transformation:

kwh_bc <- BoxCox(kwh_ts, lambda = BoxCox.lambda(kwh_ts))
ggtsdisplay(kwh_bc, main="kwh_ts with BoxCox Transformation")

Tried Box-Cox transformation on the timeseries but no huge differences. Will work on differencing instead.

ggtsdisplay(diff(kwh_ts,12), points=FALSE)

kwh_ts %>% diff(.,12) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.1159 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

After differencing once, the timeseries now appears to be stationary. ARIMA(p,d,q)(P,D,Q)[m] model has p/P as order of the autoregressive part, d/D as degree of first differencing involved, q/Q as order of the moving average part, and m as number of observations. The non-seasonal significant lag1 in ACF and PACF suggest non-seasonal p=q=1.

The seasonal spike at lag7 in ACF and lag7 & lag14 in PACF suggest seasonal AR(1) and MA(2) components. As ACF decays gradually, this suggests seasonal AR(1) and MA(2), P=0, Q=2. I will try ARIMA(1,0,1)(0,1,2).

kwh_arima <- Arima(kwh_ts, order=c(1,0,1), seasonal=c(0,1,2), lambda = BoxCox.lambda(kwh_ts))
summary(kwh_arima)
## Series: kwh_ts 
## ARIMA(1,0,1)(0,1,2)[12] 
## Box Cox transformation: lambda= -0.1454175 
## 
## Coefficients:
##          ar1      ma1     sma1    sma2
##       0.9475  -0.8014  -0.8016  0.1738
## s.e.  0.0838   0.1801   0.0933  0.0912
## 
## sigma^2 = 8.786e-05:  log likelihood = 585.93
## AIC=-1161.86   AICc=-1161.51   BIC=-1145.89
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE      ACF1
## Training set 86012.57 597771.6 463632.2 0.8088744 7.00649 0.7783202 0.1518373
checkresiduals(kwh_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1)(0,1,2)[12]
## Q* = 24.322, df = 20, p-value = 0.2286
## 
## Model df: 4.   Total lags used: 24

Now, let’s use auto.arima to find out which is the best Arima model:

kwh_auto <- auto.arima(kwh_ts, approximation = FALSE, lambda=BoxCox.lambda(kwh_ts))
checkresiduals(kwh_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,1)[12] with drift
## Q* = 25.402, df = 22, p-value = 0.2782
## 
## Model df: 2.   Total lags used: 24
summary(kwh_auto)
## Series: kwh_ts 
## ARIMA(1,0,0)(0,1,1)[12] with drift 
## Box Cox transformation: lambda= -0.1454175 
## 
## Coefficients:
##          ar1     sma1  drift
##       0.2895  -0.7353  1e-04
## s.e.  0.0724   0.0700  1e-04
## 
## sigma^2 = 8.429e-05:  log likelihood = 588.5
## AIC=-1169   AICc=-1168.77   BIC=-1156.23
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 20961.31 580774.8 445952.2 -0.2249688 6.812064 0.7486401
##                    ACF1
## Training set 0.05258332

The ARIMA model found by ‘auto.arima’ is ARIMA(1,0,0)(0,1,1)[12]. The ARIMA model I suggested is ARIMA(1,0,1)(0,1,2)[12]. According to the error measures and the residual plots, both models represents the data well with similar AIC values, similar error measures, and similar p-values. Both models can be applied, but the best Arima model is ARIMA(1,0,0)(0,1,1)[12], because it has a smaller AICc.

Forecast

kwh_model <- Arima(kwh_ts, order=c(1,0,0), seasonal=c(0,1,1), lambda = BoxCox.lambda(kwh_ts))
kwh_f <- forecast(kwh_model, h=12, level=95)
kwh_f
##          Point Forecast   Lo 95    Hi 95
## Jan 2014        9396883 7762540 11437614
## Feb 2014        7887475 6475039  9664189
## Mar 2014        6435175 5306104  7848082
## Apr 2014        5756155 4760036  6998419
## May 2014        5570691 4610683  6766688
## Jun 2014        7479517 6140217  9164205
## Jul 2014        8521791 6969852 10482565
## Aug 2014        9080655 7413299 11191835
## Sep 2014        7911892 6484811  9710387
## Oct 2014        5661786 4684004  6880566
## Nov 2014        5568723 4609087  6764249
## Dec 2014        6904666 5681062  8439739
autoplot(kwh_f)

export <- kwh_f$mean
write.csv(export, "kwh_Forecast.csv")

We write the forecast data in the file: ‘kwh_Forecast.csv’

I think the forecast generated is accurate. Our forecast captures the seasonality of what appears to be increased demand in the summer and winter while falling between the peaks and troughs of more recent years..

Part C– BONUS, optional (part or all)

Objetive

The goal of the project is to explore time series, decomposition, forecasting, data preprocessing, exponential smoothing, and ARIMA.

Load Data

Load the excel data:

wfp1 <- import("https://raw.githubusercontent.com/GabrielSantos33/DATA624_Project1/main/Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
wfp2 <- import("https://raw.githubusercontent.com/GabrielSantos33/DATA624_Project1/main/Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
colnames(wfp1) <- c("DateTime", "WaterFlow") 
colnames(wfp2) <- c("DateTime", "WaterFlow") 

Match the hour with wfp2:

wfp1 <- wfp1 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)+1) %>% 
                  group_by(Date, Time) %>%
                  summarise(Water=mean(WaterFlow)) %>%
                  ungroup() %>%
                  mutate(DateTime=ymd_h(paste(Date,Time))) %>%
                  select(DateTime,Water)
wfp1
## # A tibble: 236 × 2
##    DateTime            Water
##    <dttm>              <dbl>
##  1 2015-10-23 01:00:00  26.1
##  2 2015-10-23 02:00:00  18.9
##  3 2015-10-23 03:00:00  15.2
##  4 2015-10-23 04:00:00  23.1
##  5 2015-10-23 05:00:00  15.5
##  6 2015-10-23 06:00:00  22.7
##  7 2015-10-23 07:00:00  20.6
##  8 2015-10-23 08:00:00  18.4
##  9 2015-10-23 09:00:00  20.8
## 10 2015-10-23 10:00:00  21.2
## # … with 226 more rows
wfp2 <- wfp2 %>% mutate(Date = as.Date(DateTime), Time = hour(DateTime)) %>%
                  group_by(Date, Time) %>%
                  summarise(Water=mean(WaterFlow)) %>%
                  ungroup() %>%
                  mutate(DateTime=ymd_h(paste(Date,Time))) %>%
                  select(DateTime, Water)
wfp2
## # A tibble: 1,000 × 2
##    DateTime            Water
##    <dttm>              <dbl>
##  1 2015-10-23 01:00:00 18.8 
##  2 2015-10-23 02:00:00 43.1 
##  3 2015-10-23 03:00:00 38.0 
##  4 2015-10-23 04:00:00 36.1 
##  5 2015-10-23 05:00:00 31.9 
##  6 2015-10-23 06:00:00 28.2 
##  7 2015-10-23 07:00:00  9.86
##  8 2015-10-23 08:00:00 26.7 
##  9 2015-10-23 09:00:00 55.8 
## 10 2015-10-23 10:00:00 54.2 
## # … with 990 more rows

Timeseries

Combining the two waterflows into one:

water <- full_join(wfp1, wfp2, by="DateTime", suffix=c("_1", "_2")) %>%
  mutate(Water_1=ifelse(is.na(Water_1), 0, Water_1)) %>%
  mutate(Water_2=ifelse(is.na(Water_2), 0, Water_2)) %>%
  mutate(Water = Water_1 + Water_2) %>%
  select(DateTime, Water)
water
## # A tibble: 1,000 × 2
##    DateTime            Water
##    <dttm>              <dbl>
##  1 2015-10-23 01:00:00  44.9
##  2 2015-10-23 02:00:00  61.9
##  3 2015-10-23 03:00:00  53.1
##  4 2015-10-23 04:00:00  59.2
##  5 2015-10-23 05:00:00  47.3
##  6 2015-10-23 06:00:00  51.0
##  7 2015-10-23 07:00:00  30.5
##  8 2015-10-23 08:00:00  45.0
##  9 2015-10-23 09:00:00  76.6
## 10 2015-10-23 10:00:00  75.4
## # … with 990 more rows
water_ts <- ts(water$Water, frequency=24)
ggseasonplot(water_ts) + theme(legend.title = element_blank())

ggsubseriesplot(water_ts)

We cannot see significant seasonality involved in ‘water_ts’ however there is a slightly decreasing trend. It is a non-stationary timeseries.

water_ts <- ts(water$Water, frequency=24)
ggtsdisplay(water_ts, main="Daily Waterflow")

ARIMA Model

We apply BocCox transformation:

water_bc <- BoxCox(water_ts, lambda = BoxCox.lambda(water_ts))
ggtsdisplay(water_bc, main="Water_ts with BoxCox Transformation")

Trend differencing is needed.

ndiffs(water_ts)
## [1] 1
nsdiffs(water_ts)
## [1] 0
ggtsdisplay(diff(water_bc), points=FALSE, main="Differenced water_ts with BoxCox Transformation")

water_bc %>% diff() %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0091 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The timeseries now appears to be stationary. One seasonal differencing was applied so D=0, while the non-seasonal part suggests d=1. There is one seasonal lags in ACF, suggest Q=1. There one non-seasonal spike at lag1 in ACF suggest q=1.I will try ARIMA(0,1,1)(0,0,1)[24].

water_arima <- Arima(water_ts, order=c(0,1,1), seasonal=c(0,0,1), lambda = BoxCox.lambda(water_ts))
summary(water_arima)
## Series: water_ts 
## ARIMA(0,1,1)(0,0,1)[24] 
## Box Cox transformation: lambda= 0.8713037 
## 
## Coefficients:
##           ma1    sma1
##       -0.9578  0.0712
## s.e.   0.0101  0.0319
## 
## sigma^2 = 103.3:  log likelihood = -3734.4
## AIC=7474.8   AICc=7474.83   BIC=7489.52
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.1591549 16.33716 13.34589 -28.20783 50.26558 0.7507364
##                     ACF1
## Training set -0.04444759
checkresiduals(water_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,0,1)[24]
## Q* = 57.881, df = 46, p-value = 0.1124
## 
## Model df: 2.   Total lags used: 48

Now, let’s use auto.arima to find out which is the best Arima model:

water_auto <- auto.arima(water_ts, approximation = FALSE, lambda=BoxCox.lambda(water_ts))
summary(water_auto)
## Series: water_ts 
## ARIMA(0,1,1)(1,0,0)[24] 
## Box Cox transformation: lambda= 0.8713037 
## 
## Coefficients:
##           ma1    sar1
##       -0.9578  0.0714
## s.e.   0.0101  0.0322
## 
## sigma^2 = 103.3:  log likelihood = -3734.4
## AIC=7474.8   AICc=7474.82   BIC=7489.52
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 0.1613506 16.33729 13.34564 -28.19083 50.25213 0.750722
##                     ACF1
## Training set -0.04431537
checkresiduals(water_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(1,0,0)[24]
## Q* = 57.858, df = 46, p-value = 0.1128
## 
## Model df: 2.   Total lags used: 48

The ARIMA model found by ‘auto.arima’ is ARIMA(0,1,1)(1,0,0)[24].The ARIMA model I suggested is ARIMA(0,1,1)(0,0,1)[24].Both models are similars to each other, on the statistics with only 0.01 difference on AICc. I will use the model generated by ‘auto.arima’, ARIMA(0,1,1)(1,0,0)[24].

water_model <- Arima(water_ts, order=c(0,1,1), seasonal=c(1,0,0), lambda = BoxCox.lambda(water_ts))

Forecast

I performed the forecast on a week on water use, 7 days, for 24 hours.

water_f <- forecast(water_model, h=7*24, level=95)
autoplot(water_f) +
  labs(title="Water Usage Forecast", subtitle = "ARIMA(0,1,1)(1,0,0)[24]", x="Day")

export <- water_f$mean
write.csv(export, "Waterflow_Forecast.csv")
df <- data.frame(water_f) %>% select(Point.Forecast)
rownames(df) <- NULL
df
##     Point.Forecast
## 1         44.13862
## 2         45.16903
## 3         46.07068
## 4         44.27144
## 5         45.44000
## 6         44.89509
## 7         42.95309
## 8         44.67901
## 9         46.80433
## 10        44.92962
## 11        46.40268
## 12        45.53528
## 13        45.54120
## 14        43.38716
## 15        44.16368
## 16        42.46275
## 17        43.26947
## 18        44.53708
## 19        46.52640
## 20        43.60654
## 21        46.10964
## 22        44.44325
## 23        43.74894
## 24        46.10043
## 25        44.52937
## 26        44.60297
## 27        44.66721
## 28        44.53887
## 29        44.62229
## 30        44.58342
## 31        44.44443
## 32        44.56799
## 33        44.71937
## 34        44.58589
## 35        44.69083
## 36        44.62908
## 37        44.62950
## 38        44.47556
## 39        44.53116
## 40        44.40922
## 41        44.46713
## 42        44.55785
## 43        44.69962
## 44        44.49128
## 45        44.66998
## 46        44.55115
## 47        44.50148
## 48        44.66933
## 49        44.55730
## 50        44.56256
## 51        44.56715
## 52        44.55798
## 53        44.56394
## 54        44.56117
## 55        44.55123
## 56        44.56006
## 57        44.57088
## 58        44.56134
## 59        44.56884
## 60        44.56443
## 61        44.56446
## 62        44.55346
## 63        44.55743
## 64        44.54872
## 65        44.55286
## 66        44.55934
## 67        44.56947
## 68        44.55458
## 69        44.56735
## 70        44.55886
## 71        44.55531
## 72        44.56730
## 73        44.55930
## 74        44.55967
## 75        44.56000
## 76        44.55935
## 77        44.55977
## 78        44.55958
## 79        44.55887
## 80        44.55950
## 81        44.56027
## 82        44.55959
## 83        44.56012
## 84        44.55981
## 85        44.55981
## 86        44.55902
## 87        44.55931
## 88        44.55869
## 89        44.55898
## 90        44.55944
## 91        44.56017
## 92        44.55910
## 93        44.56002
## 94        44.55941
## 95        44.55916
## 96        44.56001
## 97        44.55944
## 98        44.55947
## 99        44.55949
## 100       44.55945
## 101       44.55948
## 102       44.55946
## 103       44.55941
## 104       44.55946
## 105       44.55951
## 106       44.55946
## 107       44.55950
## 108       44.55948
## 109       44.55948
## 110       44.55942
## 111       44.55944
## 112       44.55940
## 113       44.55942
## 114       44.55945
## 115       44.55950
## 116       44.55943
## 117       44.55949
## 118       44.55945
## 119       44.55943
## 120       44.55949
## 121       44.55945
## 122       44.55945
## 123       44.55946
## 124       44.55945
## 125       44.55945
## 126       44.55945
## 127       44.55945
## 128       44.55945
## 129       44.55946
## 130       44.55945
## 131       44.55946
## 132       44.55945
## 133       44.55945
## 134       44.55945
## 135       44.55945
## 136       44.55945
## 137       44.55945
## 138       44.55945
## 139       44.55946
## 140       44.55945
## 141       44.55946
## 142       44.55945
## 143       44.55945
## 144       44.55946
## 145       44.55945
## 146       44.55945
## 147       44.55945
## 148       44.55945
## 149       44.55945
## 150       44.55945
## 151       44.55945
## 152       44.55945
## 153       44.55945
## 154       44.55945
## 155       44.55945
## 156       44.55945
## 157       44.55945
## 158       44.55945
## 159       44.55945
## 160       44.55945
## 161       44.55945
## 162       44.55945
## 163       44.55945
## 164       44.55945
## 165       44.55945
## 166       44.55945
## 167       44.55945
## 168       44.55945

Let’s generate the forecast data in the file ‘Waterflow_Forecast.csv’.

I think that the forecast generated, allowing us to see what the water flow will be is not reliable, because we can see that it does not follow the pattern of the data. I think it’s because its large, the forecasts are closer to the mean.