ATM DATA
First we read in the data and look for missing values.
There are 19 records with missing values in the dataset, including 14 with a missing ATM (they all represent days from the month of May 2010, which is what we are predicting.) This leaves 5 missing cash values - we use ARIMA to impute them.
Then we separate the original tsibble into 4 tsibbles, each representing one ATM.
## DATE ATM Cash
## Min. :2009-05-01 00:00:00.00 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 00:00:00.00 Class :character 1st Qu.: 0.5
## Median :2009-11-01 00:00:00.00 Mode :character Median : 73.0
## Mean :2009-10-31 19:11:48.27 Mean : 155.6
## 3rd Qu.:2010-02-01 00:00:00.00 3rd Qu.: 114.0
## Max. :2010-05-14 00:00:00.00 Max. :10920.0
## NA's :19
## 'data.frame': 1474 obs. of 3 variables:
## $ DATE: POSIXct, format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num 96 107 82 89 85 90 90 55 99 79 ...
## [1] "Missing ATM: "
## [1] 731 732 733 734 735 736 737 738 739 740 741 742 743 744
## [1] "Missing Cash: "
## [1] 87 93 98 105 110 731 732 733 734 735 736 737 738 739 740 741 742 743 744
1. Examine the data.
First we look at a boxplot, histogram and decomposition of the data. It shows a number of strange patterns, not the least of which is that the money removed from the ATMs is not in multiples of twenty. Also, the distribution is bimodal. There are possible outliers but none of them are too out of range. Since we have no contextual information, we simply move forward.
2. We will train various models on 75% of the data.
## # A tibble: 5 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Test -0.0353 49.1 36.0 -365. 401. NaN NaN 0.0398
## 2 ETS_AAA Test -3.66 50.1 36.2 -397. 430. NaN NaN 0.0626
## 3 ETS_AAdA Test -2.29 50.3 36.1 -390. 425. NaN NaN 0.0671
## 4 ETS_AAN Test 2.70 36.2 27.2 -278. 304. NaN NaN -0.0237
## 5 SNAIVE Test -4.98 49.1 38.8 -327. 360. NaN NaN -0.172
3. Forecast using the best model.
Interestingly, the RMSEs are comparable for all models except ETS_AAN (even SNAIVE). However, the residuals for ETS_AAN indicate a strong issue (because it is not accounting for seasonality) so we don’t trust standard errors and other statistical inference calculations.
In this case we decide to compare the results of several modes before deciding what to do. First we look at total May cash using ARIMA(0-0-3,0-1-1)(7)). The sum is $2,402.
## Series: Cash
## Model: ARIMA(0,0,3)(0,1,1)[7]
##
## Coefficients:
## ma1 ma2 ma3 sma1
## 0.1675 -0.0315 0.0535 -0.6454
## s.e. 0.0539 0.0502 0.0626 0.0490
##
## sigma^2 estimated as 560.5: log likelihood=-1640.74
## AIC=3291.47 AICc=3291.64 BIC=3310.87
## [1] 2401.925
Now we look at total May cash using ETS(AAN). The sum is $2,465. There is only a $63 difference despite the lack of a seasonal component in the ATM. We can see from the Monthly cash plot that May’s cash haul last year was comparable, so we split the difference - $2,435.
## Series: Cash
## Model: ETS(A,A,N)
## Smoothing parameters:
## alpha = 0.002163354
## beta = 0.000100001
##
## Initial states:
## l[0] b[0]
## 87.14808 -0.02005013
##
## sigma^2: 1359.729
##
## AIC AICc BIC
## 4792.930 4793.097 4812.430
## [1] 2465.182
1. Examine the data.
First we look at a boxplot, histogram and decomposition of the data. It shows similar irregularities as ATM1, including the bimodal distribution and cash not in multiples of twenty. There are no apparent outliers. Variance may be falling over time so a boxcox is employed here.
ATM2 with BoxCox
2. We will train various models on 75% of the data.
## # A tibble: 5 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Test 1.38 17.9 15.3 -Inf Inf NaN NaN 0.0359
## 2 ETS_AAA Test 1.35 17.8 14.9 -Inf Inf NaN NaN 0.0383
## 3 ETS_AAdA Test -0.0639 17.5 14.5 -Inf Inf NaN NaN 0.0534
## 4 ETS_AAN Test 2.56 11.9 11.0 -Inf Inf NaN NaN 0.0665
## 5 SNAIVE Test 0.874 18.0 15.3 -Inf Inf NaN NaN 0.0413
3. Forecast using the best model.
The RMSEs are again comparable for all models - ETS_AAN again performs best but the residuals suggest the standard error may not be accurate.
We will estimate the cash using ETS_AADA as it had the second lowest RMSE and the residuals indicate white noise. We will need to reverse the boxcox.
Total cash for May is $1,825. This falls short of last May but probably reflects a significant downward trend after last May.
## Series: Cash
## Model: ETS(A,Ad,A)
## Smoothing parameters:
## alpha = 0.0001000573
## beta = 0.0001000475
## gamma = 0.4030829
## phi = 0.85
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5]
## 23.25623 0.2202457 -12.06986 -4.874162 3.902075 -0.4611876 2.060142 4.699912
## s[-6]
## 6.743085
##
## sigma^2: 48.5349
##
## AIC AICc BIC
## 3582.294 3583.181 3629.093
## [1] 1825.08
1. Examine the data.
ATM3 has three values for April and nothing else. We could:
But the fruits of this effort are likely to be very small. There is just not enough data. So for this exercise we will simply apply the daily average of April over the 31 days of May. This yields $2,717.67.
## [1] 2717.667
1. Examine the data.
ATM4 has an extreme outlier which must be rectified. We can use Arima interpolation.
After the interpolation there are still some high values for cash, but we keep them since they are merely the tail end of a range of values. The variance looks smaller toward the end of the series so we will employ a boxcox transformation.
Outlier corrected
ATM4 with BoxCox
a. ATM4 Residuals
b. Forecasts and test RMSE
## # A tibble: 5 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Test -0.727 13.0 10.5 -83.3 107. NaN NaN -0.0699
## 2 ETS_AAA Test -1.51 16.0 12.7 -98.7 128. NaN NaN 0.0510
## 3 ETS_AAdA Test -1.15 16.0 12.7 -96.1 126. NaN NaN 0.0520
## 4 ETS_AAN Test -0.193 12.9 10.5 -80.0 105. NaN NaN -0.0532
## 5 SNAIVE Test -6.16 21.4 18.2 -129. 162. NaN NaN -0.231
3. Forecast using the best model.
ARIMA performs the “best” - comparable to the lowest RMSE (ETS_AAN) with normal looking residuals centered on zero and looking like white noise.
The model is an (0-0-0), (2-0-0)(7) model, showing how important seasonality is in the estimate. The model estimates $9,931 for May. This is lower than last May and April and probably reflects the downward trend in the data since January.
## Series: Cash
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean
##
## Coefficients:
## sar1 sar2 constant
## 0.2113 0.1919 17.0635
## s.e. 0.0517 0.0521 0.6797
##
## sigma^2 estimated as 176.4: log likelihood=-1460.99
## AIC=2929.97 AICc=2930.09 BIC=2945.57
## [1] 9931.181
1. Examine the data
The data has both a missing value and an outlier. We use Arima to interpolate both. We might imagine that the outlier is a typo (and then we might just add a 0 to the end) but we have no other evidence to suggest this.
The variance appears to increase over time so we will apply a boxcox transformation.
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
## tbl_ts [192 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
## $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
## $ Month : mth [1:192] 1998 Jan, 1998 Feb, 1998 Mar, 1998 Apr, 1998 May, 1998 Jun...
## $ KWH : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
## - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
## ..$ .rows: list<int> [1:1]
## .. ..$ : int [1:192] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..@ ptype: int(0)
## - attr(*, "index")= chr "Month"
## ..- attr(*, "ordered")= logi TRUE
## - attr(*, "index2")= chr "Month"
## - attr(*, "interval")= interval [1:1] 1M
## ..@ .regular: logi TRUE
2. We will train using various methods on 75% of the data.
a. Residual plots for each of the models
b. RMSE and forecasts
## # A tibble: 5 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Test 0.00397 0.00548 0.00430 0.0849 0.0920 NaN NaN 0.285
## 2 ETS_MMA Test 0.00207 0.00476 0.00362 0.0442 0.0773 NaN NaN 0.327
## 3 ETS_MMdA Test 0.00503 0.00660 0.00529 0.108 0.113 NaN NaN 0.317
## 4 ETS_MMN Test 0.00364 0.00894 0.00774 0.0775 0.166 NaN NaN 0.399
## 5 SNAIVE Test 0.00399 0.00584 0.00473 0.0853 0.101 NaN NaN 0.247
3. Forecast using the best model.
ETS_MMA has the lowest RMSE and the residuals, with a few small exceptions in the acf plot, look like white noise. However, the ARIMA model also has a low RMSE and more normally distributed residuals, so we’ll look at both.
The ETS model estimates 92,882,533 KW for 2014, while the ARIMA model estimates 91,392,876. The difference between them is about 1.5% of the total. We will split the difference and estimate 92,137,704.5 KWH.
## Series: KWH
## Model: ETS(M,M,A)
## Smoothing parameters:
## alpha = 0.07679262
## beta = 0.00132528
## gamma = 0.0001006672
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3]
## 4.663678 1.000005 -0.002095888 -0.01078722 -0.005287552 0.006963209
## s[-4] s[-5] s[-6] s[-7] s[-8] s[-9]
## 0.009873606 0.007849855 0.0008124751 -0.009611188 -0.007265691 -0.00301425
## s[-10] s[-11]
## 0.003171756 0.009390889
##
## sigma^2: 0
##
## AIC AICc BIC
## -1125.043 -1121.526 -1069.666
## [1] 92904482
## Series: KWH
## Model: ARIMA(1,0,1)(0,1,1)[12] w/ drift
##
## Coefficients:
## ar1 ma1 sma1 constant
## 0.5318 -0.2706 -0.7165 3e-04
## s.e. 0.2434 0.2813 0.0646 1e-04
##
## sigma^2 estimated as 1.459e-05: log likelihood=754.03
## AIC=-1498.06 AICc=-1497.72 BIC=-1482.1
## [1] 91392876