Data

DATA & Data Review


We load the data and then utilize the skim and vis_dat function to assess the quality of the data. Here are some observations:

  • There is no ATM5
  • ATMs 1 & 2 have missing numeric (CASH) data
  • ATMs 1 & 4 appear to have outliers

Next we will perform some exploratory data analysis to confirm our initial findings and identify other potential issues.

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19
Data summary
Name atm_skim
Number of rows 1474
Number of columns 3
_______________________
Column type frequency:
Date 1
numeric 1
________________________
Group variables ATM

Variable type: Date

skim_variable ATM n_missing complete_rate min max median n_unique
DATE ATM1 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM2 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM3 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM4 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE NA 0 1 2010-05-01 2010-05-14 2010-05-07 14

Variable type: numeric

skim_variable ATM n_missing complete_rate mean sd p0 p25 p50 p75 p100
Cash ATM1 3 0.99 83.89 36.66 1.00 73.00 91.00 108.00 180.00
Cash ATM2 2 0.99 62.58 38.90 0.00 25.50 67.00 93.00 147.00
Cash ATM3 0 1.00 0.72 7.94 0.00 0.00 0.00 0.00 96.00
Cash ATM4 0 1.00 474.04 650.94 1.56 124.33 403.84 704.51 10919.76
Cash NA 14 0.00 NaN NA NA NA NA NA NA

EDA

Exploratory Data Analysis


We utilized a combination of box plots, line charts and the ggtsdisplay function to perform our EDA. Here are some key findings.

  • The outlier in ATM4 may in fact be bad data and will have to be replaced/imputed.
  • Almost all the values for ATM 3 are zero. This ATM may have been deployed very recently.
  • The ACF and PACF chart reveal some weekly patterns for ATMs 1 and 2
  • ATM 2 appears to have the largest variance based up the box plot

We will address the missing data and outliers in our pre-processing section.

Preprocessing

Exploratory Data Analysis


In this section we will address outliers and missing data. Our strategy is to utilize average day of week values by ATM for the missing data.

The dplyr script belows calculate weekday averages by ATM and the replaces NA values with the appropriate weekday average. The Skim function is utilized a second time to ensure the preprocessing has been successful. The skim function confirms there is no longer missing values and the outlier for ATM 4 has been imputed. We are now ready to move to the Modeling section.

A plot of the pre-processed times series is set forth below.

Data summary
Name atm_skim2
Number of rows 1474
Number of columns 3
_______________________
Column type frequency:
Date 1
numeric 1
________________________
Group variables ATM

Variable type: Date

skim_variable ATM n_missing complete_rate min max median n_unique
DATE ATM1 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM2 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM3 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM4 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE NA 0 1 2010-05-01 2010-05-14 2010-05-07 14

Variable type: numeric

skim_variable ATM n_missing complete_rate mean sd p0 p25 p50 p75 p100
Cash ATM1 0 1 83.94 36.51 1.00 73.00 91.00 108.00 180.00
Cash ATM2 0 1 62.43 38.85 0.00 25.53 66.00 93.00 147.00
Cash ATM3 0 1 0.72 7.94 0.00 0.00 0.00 0.00 96.00
Cash ATM4 0 1 445.35 350.90 1.56 124.33 403.84 704.19 1712.07
Cash NA 14 0 NaN NA NA NA NA NA NA

Modeling

Exploratory Data Analysis


In the modeling section we will utilized STL Decomposition, Exponential Smoothing and Arima to identify the best possible model fit. Based upon my personal work experience, I have determine not to model ATM 3. The reason for this is that both the location and age (time in service) of an ATM are so important to its activity/usage that modeling an ATM by taking an average of other ATMs would be a unproductive.

After completing our preliminary modeling analysis we will utilize the tsCV function to help assess our modeling results. We will leverage the RMSE to pick our preferred model(s). We will model each ATM individually and review modeling residuals. A modeling summary follows:

STL Decomposition

We utilized STL (stl() function) Decomposition for model ATMs 1, 2 and 4. An analysis of the ACF and PACF charts indicated that 7 was the appropriate period (weekly) for seasonality. Plots for each of the modeled ATMs are set forth below.

Exponential Smoothing

We used the ets() function in our exponential smoothing models. Again, we modeled ATMs 1, 2, and 4. Key parameter setting follow:

  • model - Default (ZZZ). This setting enables the function to determine the best model
  • lambda - On and Off.Box Cox is utilized when on to determine the optimal lambda. biasadj is utilized when Box Cox is on.
  • damped - Null (this enables the model to determine the best model)

The exponential smoothing models utilized in this analysis follows:

  • ATM 1: ETS(A, N, A) with Box-Cox transformation: lambda= 0.2454, Stationary Diff=0, Seasonal Diff=1
  • ATM 2: ETS(A,N,A) with Box-Cox transformation: lambda= 0.6936, Stationary Diff=1, Seasonal Diff=1
  • ATM 4: ETS(A,N,A) with Box-Cox transformation: lambda= 0.4533, Stationary Diff=0, Seasonal Diff=0

ARIMA

The auto.arima function was utilized to produce ARIMA forecasts. The models utilized for each ATM machine follow:

  • ATM 1: ARIMA(0,0,2)(0,1,1)[7] with Box Cox transformation: lambda= 0.2454
  • ATM 2: ARIMA(2,0,2)(0,1,1)[7] with Box Cox transformation: lambda= 0.6936
  • ATM 4: ARIMA(0,0,1)(2,0,0)[7] with non-zero mean with Box Cox transformation: lambda= 0.4533

Model Accuracy / Selection

An analysis of model accuracy is performed for each ATM machine. We have utilized RMSE as our accuracy metric. Model accuracy results are summarized below:

  • ATM 1: Best RMSE: Arima - 37.1; STL RMSE:51.44 ; ETS RMSE: 43.61
  • ATM 2: Best RMSE: Arima - 46.1; STL RMSE:58.94 ; ETS RMSE: 59.40
  • ATM 4: Best RMSE: Arima - 287.2; STL RMSE:350.75 ; ETS RMSE: 329.09

Detailed forecasting information by ATM is set forth below:

Box-Cox Transformation

The Box-Cox lambda value for the series is: 0.2453825.

ETS

## ETS(A,N,A) 
## 
## Call:
##  ets(y = ., lambda = lambda_1, biasadj = TRUE) 
## 
##   Box-Cox transformation: lambda= 0.2454 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.353 
## 
##   Initial states:
##     l = 7.5346 
##     s = -4.5312 0.8387 0.2145 0.8725 1.1761 0.925
##            0.5045
## 
##   sigma:  1.2737
## 
##      AIC     AICc      BIC 
## 2340.975 2341.596 2379.973

ARIMA

## Series: . 
## ARIMA(0,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2453825 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1001  -0.1135  -0.6455
## s.e.  0.0524   0.0527   0.0426
## 
## sigma^2 estimated as 1.584:  log likelihood=-590.73
## AIC=1189.46   AICc=1189.58   BIC=1204.98

Model Accuracy

##            STL      ETS    ARIMA
## RMSE: 51.44506 43.61599 37.14315

Residuals

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 7.9435, df = 11, p-value = 0.7184
## 
## Model df: 3.   Total lags used: 14

ETS

## ETS(A,N,A) 
## 
## Call:
##  ets(y = ., lambda = lambda_2, biasadj = TRUE) 
## 
##   Box-Cox transformation: lambda= 0.6936 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.3967 
## 
##   Initial states:
##     l = 25.4405 
##     s = -13.7016 -10.3688 6.5125 -1.8726 0.3309 10.0456
##            9.054
## 
##   sigma:  7.4538
## 
##      AIC     AICc      BIC 
## 3630.720 3631.342 3669.719

ARIMA

## Series: . 
## ARIMA(2,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.6935866 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4209  -0.9036  0.4633  0.7935  -0.7169
## s.e.   0.0589   0.0458  0.0888  0.0601   0.0416
## 
## sigma^2 estimated as 52.2:  log likelihood=-1215.47
## AIC=2442.94   AICc=2443.18   BIC=2466.23

Model Accuracy

##            STL      ETS    ARIMA
## RMSE: 58.94468 59.40853 46.07943

Residuals

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,3)(0,1,1)[7] with drift
## Q* = 5.2417, df = 6, p-value = 0.5132
## 
## Model df: 8.   Total lags used: 14

Box Cox Transform

The Box-Cox lambda value for the series is: 0.4533182.

ETS

## ETS(A,N,A) 
## 
## Call:
##  ets(y = ., lambda = lambda_4, biasadj = TRUE) 
## 
##   Box-Cox transformation: lambda= 0.4533 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.1008 
## 
##   Initial states:
##     l = 29.2323 
##     s = -17.9705 -2.5336 0.6625 4.6494 6.2229 4.8631
##            4.1063
## 
##   sigma:  13.1943
## 
##      AIC     AICc      BIC 
## 4047.594 4048.215 4086.593

ARIMA

## Series: . 
## ARIMA(0,0,1)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4533182 
## 
## Coefficients:
##          ma1    sar1    sar2     mean
##       0.0765  0.2090  0.1990  29.1078
## s.e.  0.0529  0.0517  0.0525   1.2587
## 
## sigma^2 estimated as 183.8:  log likelihood=-1467.94
## AIC=2945.87   AICc=2946.04   BIC=2965.37

Model Accuracy

##            STL      ETS    ARIMA
## RMSE: 350.7567 329.0905 287.1895

Residuals

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
## Q* = 12.392, df = 10, p-value = 0.2596
## 
## Model df: 4.   Total lags used: 14

Output Forecast Data

ATM forecat output is set forth below. The ATM forecasts have also been written to a csv file.

##          DATE      ATM1     ATM2     ATM4
## 1  2010-05-01  88.03248 43.73438 513.6667
## 2  2010-05-02  70.27660 29.52705 386.9937
## 3  2010-05-03  76.01683 53.06357 481.4655
## 4  2010-05-04  37.14118 24.18667 661.3958
## 5  2010-05-05 109.88323 42.56347 453.9731
## 6  2010-05-06 101.69941 60.56368 412.7631
## 7  2010-05-07 101.80257 64.93641 578.4927
## 8  2010-05-08  84.34108 38.51018 486.2995
## 9  2010-05-09  74.09145 41.57788 351.5421
## 10 2010-05-10  76.12534 53.16685 442.8794
## 11 2010-05-11  37.21530 17.44156 587.5181
## 12 2010-05-12 110.01498 49.47473 425.1071
## 13 2010-05-13 101.82591 67.45401 451.5255
## 14 2010-05-14 101.92913 56.10449 580.6077
## 15 2010-05-15  84.45562 38.62484 472.1165
## 16 2010-05-16  74.19842 49.50073 424.2479
## 17 2010-05-17  76.23384 49.41827 458.5358
## 18 2010-05-18  37.28942 14.80928 511.2671
## 19 2010-05-19 110.14673 55.76500 450.7936
## 20 2010-05-20 101.95240 68.48927 450.1528
## 21 2010-05-21 102.05569 50.24438 499.2524
## 22 2010-05-22  84.57017 40.80389 464.5069
## 23 2010-05-23  74.30538 53.25603 433.7026
## 24 2010-05-24  76.34235 45.28566 455.3553
## 25 2010-05-25  37.36355 14.22772 486.2518
## 26 2010-05-26 110.27847 60.00312 451.0631
## 27 2010-05-27 102.07890 66.78616 454.9031
## 28 2010-05-28 102.18226 47.09690 482.9237
## 29 2010-05-29  84.68471 43.17946 460.4860
## 30 2010-05-30  74.41235 54.09822 446.9201
## 31 2010-05-31  76.45086 42.13277 456.6020