DATA 624 Project 1 - ATM Forecast

library(fpp2)
library(ggplot2)
library(dplyr)

Objective

The objective is to forecast how much cash is taken out of 4 different ATM machines for May 2010.

Data Preparation

First load the data and check summary and loaded values

atmdata <- readxl::read_excel("ATM624Data.xlsx")
summary(atmdata)

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

The summary shows that there are some blank data in the Cash column to deal with and that the DATE colum is in numerical Excel date format.

Let’s first convert the dates from Numerical Format [in Windows Excel]:

atmdata$DATE <- as.Date(atmdata$DATE, origin = "1899-12-30")
head(atmdata)

## # A tibble: 6 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-01 ATM2    107
## 3 2009-05-02 ATM1     82
## 4 2009-05-02 ATM2     89
## 5 2009-05-03 ATM1     85
## 6 2009-05-03 ATM2     90

Next, separate the data for each of the 4 ATMs [all observations up to May 2010]

atm1 <- atmdata[atmdata$ATM == "ATM1" & atmdata$DATE < "2010-05-01",]
atm2 <- atmdata[atmdata$ATM == "ATM2" & atmdata$DATE < "2010-05-01",]
atm3 <- atmdata[atmdata$ATM == "ATM3" & atmdata$DATE < "2010-05-01",]
atm4 <- atmdata[atmdata$ATM == "ATM4" & atmdata$DATE < "2010-05-01",]
summary(atm1)

##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.89  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00  
##                                          NA's   :3

summary(atm2)

##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  0.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 25.50  
##  Median :2009-10-30   Mode  :character   Median : 67.00  
##  Mean   :2009-10-30                      Mean   : 62.58  
##  3rd Qu.:2010-01-29                      3rd Qu.: 93.00  
##  Max.   :2010-04-30                      Max.   :147.00  
##                                          NA's   :2

summary(atm3)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:365         Min.   : 0.0000  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 0.0000  
##  Median :2009-10-30   Mode  :character   Median : 0.0000  
##  Mean   :2009-10-30                      Mean   : 0.7206  
##  3rd Qu.:2010-01-29                      3rd Qu.: 0.0000  
##  Max.   :2010-04-30                      Max.   :96.0000

summary(atm4)

##       DATE                ATM                 Cash          
##  Min.   :2009-05-01   Length:365         Min.   :    1.563  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:  124.334  
##  Median :2009-10-30   Mode  :character   Median :  403.839  
##  Mean   :2009-10-30                      Mean   :  474.043  
##  3rd Qu.:2010-01-29                      3rd Qu.:  704.507  
##  Max.   :2010-04-30                      Max.   :10919.762

From the summaries we see that each ATM data now has 365 readings since May 2009. However, ATM1 and ATM2 data have some missing Cash values. Let’s deal with that next.

atm1[is.na(atm1$Cash),]

## # A tibble: 3 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-22 ATM1     NA

atm2[is.na(atm2$Cash),]

## # A tibble: 2 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-18 ATM2     NA
## 2 2009-06-24 ATM2     NA

There are a few missing values in the month of June 2009. It should be an acceptable strategy to impute by using the mean value for the month.

mu1 <- round(mean(unlist(atm1[atm1$DATE >= "2009-06-01"
                              & atm1$DATE < "2009-07-01", "Cash"]), na.rm = TRUE))
mu2 <- round(mean(unlist(atm2[atm2$DATE >= "2009-06-01"
                              & atm2$DATE < "2009-07-01", "Cash"]), na.rm = TRUE))

atm1[is.na(atm1$Cash),"Cash"] <- mu1
atm2[is.na(atm2$Cash),"Cash"] <- mu2

Now it should be safe to create the time series objects which can then be used for modelling.
NOTE: Frequency = 1 was used for the daily data (similar to the ibmclose dataset)

ts1 <- ts(atm1$Cash, frequency = 1, start = c(2009,5))
ts2 <- ts(atm2$Cash, frequency = 1, start = c(2009,5))
ts3 <- ts(atm3$Cash, frequency = 1, start = c(2009,5))
ts4 <- ts(atm4$Cash, frequency = 1, start = c(2009,5))

Modelling

Model for ATM1

ggtsdisplay(ts1)

ndiffs(ts1)

## [1] 0

There appears to be no need for transformation as there is no evidence in the variance increase or descrease. Also, the unit-root test does not suggest any differencing to be done. The ACF and PACF are suggesting lag=7 and so the best model turned out to be ARIMA(7,0,7)

(fit1 <- Arima(ts1, order = c(7,0,7)))

## Series: ts1 
## ARIMA(7,0,7) with non-zero mean 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5      ar6     ar7     ma1
##       -0.0457  -0.0316  -0.0457  -0.0509  -0.0084  -0.0280  0.9277  0.1858
## s.e.   0.0315   0.0322   0.0316   0.0318   0.0320   0.0315  0.0303  0.0587
##           ma2     ma3     ma4      ma5      ma6      ma7     mean
##       -0.0043  0.0993  0.1666  -0.0766  -0.0478  -0.6147  83.3332
## s.e.   0.0621  0.0585  0.0629   0.0606   0.0597   0.0552   2.8944
## 
## sigma^2 estimated as 538.9:  log likelihood=-1662.78
## AIC=3357.56   AICc=3359.13   BIC=3419.96

checkresiduals(fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(7,0,7) with non-zero mean
## Q* = 4.6861, df = 3, p-value = 0.1963
## 
## Model df: 15.   Total lags used: 18

The graphics for residuals and the Ljung-Box test suggest that the residuals are not distinguishable from a white noise series. Therefore this is a good model to represent the series.

Model for ATM2

ggtsdisplay(ts2)

ndiffs(ts2)

## [1] 1

ATM2 data is similar looking to ATM1. However, the unit-root test suggest 1 level of differencing. Also, considering the lag information in ACF and PACF graphs, a manual search turned up ARIMA(7,1,7) to be a better model.

(fit2 <- Arima(ts2, order = c(7,1,7)))

## Series: ts2 
## ARIMA(7,1,7) 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5      ar6      ar7
##       -1.0996  -1.0927  -1.0882  -1.1030  -1.0798  -1.0723  -0.0954
## s.e.   0.0823   0.0816   0.0812   0.0822   0.0832   0.0768   0.0770
##          ma1      ma2      ma3     ma4      ma5      ma6      ma7
##       0.1599  -0.1332  -0.0125  0.1084  -0.1307  -0.0837  -0.6920
## s.e.  0.0638   0.0446   0.0513  0.0428   0.0446   0.0567   0.0407
## 
## sigma^2 estimated as 591.5:  log likelihood=-1677.3
## AIC=3384.61   AICc=3385.99   BIC=3443.07

checkresiduals(fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(7,1,7)
## Q* = 13.971, df = 3, p-value = 0.002945
## 
## Model df: 14.   Total lags used: 17

Model for ATM3

ggtsdisplay(ts3)

ATM3 data appears to be odd, because most of the data is zero and only a few actual observations in the end of the time series. For this kind of data a more naive forcasting modeling would work well. Average model would not work, however, because predicted values would be zero or very close to zero. Therefore a simple “Random Walk” forcasting model would probably be best in this case, as can be seen in the graph below.

fit3 <- rwf(ts3, h=31)
autoplot(ts3) + 
  autolayer(fit3)

Model for ATM4

ggtsdisplay(ts4)

There are two features that become evident from the graph of ATM4 data. 1. there is an outlier that is way outside the range of the rest of the data. However not having a domain expert to consult with, it may be unwise to deal with this outlier in some way. 2. The data appears to be a white noise series. Therefore the best way to model this data is by ARIMA(0,0,0) (i.e. white noise).

(fit4 <- auto.arima(ts4))

## Series: ts4 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##           mean
##       474.0433
## s.e.   34.0248
## 
## sigma^2 estimated as 423718:  log likelihood=-2882.03
## AIC=5768.06   AICc=5768.1   BIC=5775.86

checkresiduals(fit4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 3.4236, df = 9, p-value = 0.9451
## 
## Model df: 1.   Total lags used: 10

Forecasting

sd <- as.Date("2010-05-01")
fd <- sd + 0:30

f1 <- forecast(fit1, h = 31)
df1 <- data.frame(DATE=fd, ATM=rep("ATM1", 31), Cash=f1$mean)

f2 <- forecast(fit2, h = 31)
df2 <- data.frame(DATE=fd, ATM=rep("ATM2", 31), Cash=f2$mean)

f3 <- forecast(fit3, h = 31)
df3 <- data.frame(DATE=fd, ATM=rep("ATM3", 31), Cash=f3$mean)

f4 <- forecast(fit4, h = 31)
df4 <- data.frame(DATE=fd, ATM=rep("ATM4", 31), Cash=f4$mean)

df <- bind_rows(list(df1, df2, df3, df4))
write.csv(df, file = "ATM624Predicted.csv", row.names = FALSE)