ATM DATA

First we read in the data and look for missing values.

There are 19 records with missing values in the dataset, including 14 with a missing ATM (they all represent days from the month of May 2010, which is what we are predicting.) This leaves 5 missing cash values - we use ARIMA to impute them.

Then we separate the original tsibble into 4 tsibbles, each representing one ATM.

##       DATE                            ATM                 Cash        
##  Min.   :2009-05-01 00:00:00.00   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01 00:00:00.00   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01 00:00:00.00   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31 19:11:48.27                      Mean   :  155.6  
##  3rd Qu.:2010-02-01 00:00:00.00                      3rd Qu.:  114.0  
##  Max.   :2010-05-14 00:00:00.00                      Max.   :10920.0  
##                                                      NA's   :19
## 'data.frame':    1474 obs. of  3 variables:
##  $ DATE: POSIXct, format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr  "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: num  96 107 82 89 85 90 90 55 99 79 ...
## [1] "Missing ATM: "
##  [1] 731 732 733 734 735 736 737 738 739 740 741 742 743 744
## [1] "Missing Cash: "
##  [1]  87  93  98 105 110 731 732 733 734 735 736 737 738 739 740 741 742 743 744

Forecasts

ATM1

1. Examine the data.

First we look at a boxplot, histogram and decomposition of the data. It shows a number of strange patterns, not the least of which is that the money removed from the ATMs is not in multiples of twenty. Also, the distribution is bimodal. There are possible outliers but none of them are too out of range. Since we have no contextual information, we simply move forward.

2. We will train various models on 75% of the data.

  1. ATM1 Residuals

  1. Forecasts and test RMSE

## # A tibble: 5 × 10
##   .model   .type      ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA    Test  -0.0353  49.1  36.0 -365.  401.   NaN   NaN  0.0398
## 2 ETS_AAA  Test  -3.66    50.1  36.2 -397.  430.   NaN   NaN  0.0626
## 3 ETS_AAdA Test  -2.29    50.3  36.1 -390.  425.   NaN   NaN  0.0671
## 4 ETS_AAN  Test   2.70    36.2  27.2 -278.  304.   NaN   NaN -0.0237
## 5 SNAIVE   Test  -4.98    49.1  38.8 -327.  360.   NaN   NaN -0.172

3. Forecast using the best model.

Interestingly, the RMSEs are comparable for all models except ETS_AAN (even SNAIVE). However, the residuals for ETS_AAN indicate a strong issue (because it is not accounting for seasonality) so we don’t trust standard errors and other statistical inference calculations.

In this case we decide to compare the results of several modes before deciding what to do. First we look at total May cash using ARIMA(0-0-3,0-1-1)(7)). The sum is $2,402.

## Series: Cash 
## Model: ARIMA(0,0,3)(0,1,1)[7] 
## 
## Coefficients:
##          ma1      ma2     ma3     sma1
##       0.1675  -0.0315  0.0535  -0.6454
## s.e.  0.0539   0.0502  0.0626   0.0490
## 
## sigma^2 estimated as 560.5:  log likelihood=-1640.74
## AIC=3291.47   AICc=3291.64   BIC=3310.87
## [1] 2401.925

Now we look at total May cash using ETS(AAN). The sum is $2,465. There is only a $63 difference despite the lack of a seasonal component in the ATM. We can see from the Monthly cash plot that May’s cash haul last year was comparable, so we split the difference - $2,435.

## Series: Cash 
## Model: ETS(A,A,N) 
##   Smoothing parameters:
##     alpha = 0.002163354 
##     beta  = 0.000100001 
## 
##   Initial states:
##      l[0]        b[0]
##  87.14808 -0.02005013
## 
##   sigma^2:  1359.729
## 
##      AIC     AICc      BIC 
## 4792.930 4793.097 4812.430
## [1] 2465.182

ATM2

1. Examine the data.

First we look at a boxplot, histogram and decomposition of the data. It shows similar irregularities as ATM1, including the bimodal distribution and cash not in multiples of twenty. There are no apparent outliers. Variance may be falling over time so a boxcox is employed here.

ATM2 with BoxCox

2. We will train various models on 75% of the data.

  1. ATM2 Residuals

  1. Forecasts and test RMSE

## # A tibble: 5 × 10
##   .model   .type      ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ARIMA    Test   1.38    17.9  15.3  -Inf   Inf   NaN   NaN 0.0359
## 2 ETS_AAA  Test   1.35    17.8  14.9  -Inf   Inf   NaN   NaN 0.0383
## 3 ETS_AAdA Test  -0.0639  17.5  14.5  -Inf   Inf   NaN   NaN 0.0534
## 4 ETS_AAN  Test   2.56    11.9  11.0  -Inf   Inf   NaN   NaN 0.0665
## 5 SNAIVE   Test   0.874   18.0  15.3  -Inf   Inf   NaN   NaN 0.0413

3. Forecast using the best model.

The RMSEs are again comparable for all models - ETS_AAN again performs best but the residuals suggest the standard error may not be accurate.

We will estimate the cash using ETS_AADA as it had the second lowest RMSE and the residuals indicate white noise. We will need to reverse the boxcox.

Total cash for May is $1,825. This falls short of last May but probably reflects a significant downward trend after last May.

## Series: Cash 
## Model: ETS(A,Ad,A) 
##   Smoothing parameters:
##     alpha = 0.0001000573 
##     beta  = 0.0001000475 
##     gamma = 0.4030829 
##     phi   = 0.85 
## 
##   Initial states:
##      l[0]      b[0]      s[0]     s[-1]    s[-2]      s[-3]    s[-4]    s[-5]
##  23.25623 0.2202457 -12.06986 -4.874162 3.902075 -0.4611876 2.060142 4.699912
##     s[-6]
##  6.743085
## 
##   sigma^2:  48.5349
## 
##      AIC     AICc      BIC 
## 3582.294 3583.181 3629.093
## [1] 1825.08

ATM3

1. Examine the data.

ATM3 has three values for April and nothing else. We could:

  1. Start with an average from April and apply it to the 31 days of May.
  2. Test statistically whether all four ATMs are connected and use this to modify our forecast (e.g. increase the forecast because May of last year is always larger than April of this year.)
  3. Check to see whether the April dates are pulled from the weekend or weekdays and apply some kind of seasonal adjustment.

But the fruits of this effort are likely to be very small. There is just not enough data. So for this exercise we will simply apply the daily average of April over the 31 days of May. This yields $2,717.67.

## [1] 2717.667

ATM 4

1. Examine the data.

ATM4 has an extreme outlier which must be rectified. We can use Arima interpolation.

After the interpolation there are still some high values for cash, but we keep them since they are merely the tail end of a range of values. The variance looks smaller toward the end of the series so we will employ a boxcox transformation.

Outlier corrected

ATM4 with BoxCox

a. ATM4 Residuals

b. Forecasts and test RMSE

## # A tibble: 5 × 10
##   .model   .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr>    <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA    Test  -0.727  13.0  10.5  -83.3  107.   NaN   NaN -0.0699
## 2 ETS_AAA  Test  -1.51   16.0  12.7  -98.7  128.   NaN   NaN  0.0510
## 3 ETS_AAdA Test  -1.15   16.0  12.7  -96.1  126.   NaN   NaN  0.0520
## 4 ETS_AAN  Test  -0.193  12.9  10.5  -80.0  105.   NaN   NaN -0.0532
## 5 SNAIVE   Test  -6.16   21.4  18.2 -129.   162.   NaN   NaN -0.231

3. Forecast using the best model.

ARIMA performs the “best” - comparable to the lowest RMSE (ETS_AAN) with normal looking residuals centered on zero and looking like white noise.

The model is an (0-0-0), (2-0-0)(7) model, showing how important seasonality is in the estimate. The model estimates $9,931 for May. This is lower than last May and April and probably reflects the downward trend in the data since January.

## Series: Cash 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.2113  0.1919   17.0635
## s.e.  0.0517  0.0521    0.6797
## 
## sigma^2 estimated as 176.4:  log likelihood=-1460.99
## AIC=2929.97   AICc=2930.09   BIC=2945.57
## [1] 9931.181

KWH

1. Examine the data

The data has both a missing value and an outlier. We use Arima to interpolate both. We might imagine that the outlier is a typo (and then we might just add a 0 to the end) but we have no other evidence to suggest this.

The variance appears to increase over time so we will apply a boxcox transformation.

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
## tbl_ts [192 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
##  $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
##  $ Month       : mth [1:192] 1998 Jan, 1998 Feb, 1998 Mar, 1998 Apr, 1998 May, 1998 Jun...
##  $ KWH         : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
##  - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
##   ..$ .rows: list<int> [1:1] 
##   .. ..$ : int [1:192] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..@ ptype: int(0) 
##  - attr(*, "index")= chr "Month"
##   ..- attr(*, "ordered")= logi TRUE
##  - attr(*, "index2")= chr "Month"
##  - attr(*, "interval")= interval [1:1] 1M
##   ..@ .regular: logi TRUE

2. We will train using various methods on 75% of the data.

a. Residual plots for each of the models

b. RMSE and forecasts

## # A tibble: 5 × 10
##   .model   .type      ME    RMSE     MAE    MPE   MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA    Test  0.00397 0.00548 0.00430 0.0849 0.0920   NaN   NaN 0.285
## 2 ETS_MMA  Test  0.00207 0.00476 0.00362 0.0442 0.0773   NaN   NaN 0.327
## 3 ETS_MMdA Test  0.00503 0.00660 0.00529 0.108  0.113    NaN   NaN 0.317
## 4 ETS_MMN  Test  0.00364 0.00894 0.00774 0.0775 0.166    NaN   NaN 0.399
## 5 SNAIVE   Test  0.00399 0.00584 0.00473 0.0853 0.101    NaN   NaN 0.247

3. Forecast using the best model.

ETS_MMA has the lowest RMSE and the residuals, with a few small exceptions in the acf plot, look like white noise. However, the ARIMA model also has a low RMSE and more normally distributed residuals, so we’ll look at both.

The ETS model estimates 92,882,533 KW for 2014, while the ARIMA model estimates 91,392,876. The difference between them is about 1.5% of the total. We will split the difference and estimate 92,137,704.5 KWH.

## Series: KWH 
## Model: ETS(M,M,A) 
##   Smoothing parameters:
##     alpha = 0.07679262 
##     beta  = 0.00132528 
##     gamma = 0.0001006672 
## 
##   Initial states:
##      l[0]     b[0]         s[0]       s[-1]        s[-2]       s[-3]
##  4.663678 1.000005 -0.002095888 -0.01078722 -0.005287552 0.006963209
##        s[-4]       s[-5]        s[-6]        s[-7]        s[-8]       s[-9]
##  0.009873606 0.007849855 0.0008124751 -0.009611188 -0.007265691 -0.00301425
##       s[-10]      s[-11]
##  0.003171756 0.009390889
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -1125.043 -1121.526 -1069.666
## [1] 92904482

## Series: KWH 
## Model: ARIMA(1,0,1)(0,1,1)[12] w/ drift 
## 
## Coefficients:
##          ar1      ma1     sma1  constant
##       0.5318  -0.2706  -0.7165     3e-04
## s.e.  0.2434   0.2813   0.0646     1e-04
## 
## sigma^2 estimated as 1.459e-05:  log likelihood=754.03
## AIC=-1498.06   AICc=-1497.72   BIC=-1482.1
## [1] 91392876