ATM

I want to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

atm <- readxl::read_excel("ATM624Data.xlsx") %>% mutate(DATE = as.Date(DATE, origin='1899-12-30'))
head(atm)
## # A tibble: 6 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-01 ATM2    107
## 3 2009-05-02 ATM1     82
## 4 2009-05-02 ATM2     89
## 5 2009-05-03 ATM1     85
## 6 2009-05-03 ATM2     90

Understanding the data

summary(atm)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19
str(atm)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1474 obs. of  3 variables:
##  $ DATE: Date, format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr  "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: num  96 107 82 89 85 90 90 55 99 79 ...

There looks like 19 observations that do not have a value for cash. We can remove those values as it does not help our study

atm <- atm[complete.cases(atm),]
ggplot(atm, aes(ATM)) +
geom_bar()

ggplot(atm, aes(x=ATM, y=Cash)) +
geom_boxplot()

ATM4 has the largest values out of the 4 ATM’s but there is 1 observation that looks like it is an outlier.

head(atm[order(-atm$Cash),],5)
## # A tibble: 5 x 3
##   DATE       ATM     Cash
##   <date>     <chr>  <dbl>
## 1 2010-02-09 ATM4  10920.
## 2 2009-09-22 ATM4   1712.
## 3 2010-01-29 ATM4   1575.
## 4 2009-09-04 ATM4   1495.
## 5 2009-06-09 ATM4   1484.

When we look at the highest transactions, the transaction is over 6 times greater than the next. ATM4 may be located in a place where there are lot cash transactions such as a casino however, this seems like a one time transaction so we’ll remove from out data set.

atm %>% group_by(ATM) %>% summarise(avg = mean(Cash,na.rm = TRUE))
## # A tibble: 4 x 2
##   ATM       avg
##   <chr>   <dbl>
## 1 ATM1   83.9  
## 2 ATM2   62.6  
## 3 ATM3    0.721
## 4 ATM4  474.

When we look at the average cash value of transactions, ATM3 has significantly less amount than the other 3 locations. Either, the ATM is located in an unpopular location or there is an error in the data.

atm3 <- atm %>% group_by(ATM) %>% filter(ATM == 'ATM3')
head(atm3[order(-atm3$Cash),],10)
## # A tibble: 10 x 3
## # Groups:   ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2010-04-28 ATM3     96
##  2 2010-04-30 ATM3     85
##  3 2010-04-29 ATM3     82
##  4 2009-05-01 ATM3      0
##  5 2009-05-02 ATM3      0
##  6 2009-05-03 ATM3      0
##  7 2009-05-04 ATM3      0
##  8 2009-05-05 ATM3      0
##  9 2009-05-06 ATM3      0
## 10 2009-05-07 ATM3      0

It looks like ATM3 registered a total of 3 transactions. Since there isn’t a large enough sample size for ATM3, we can’t make a forecast model for that ATM.

Cleaning the Data

To clean the data, we’ll remove all observations greater than 3000

atm <- atm[atm$Cash<3000,]
atm %>%
  ggplot(aes(DATE, Cash, color=ATM)) +
  geom_point() + geom_smooth() +
  facet_wrap(~ATM, scales='free') +
  theme_bw() + theme(legend.position = 'none') + labs(x="", y="")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

##seperate ATM data
atm1 <- atm %>% group_by(ATM) %>% filter(ATM == 'ATM1')
atm2 <- atm %>% group_by(ATM) %>% filter(ATM == 'ATM2')
atm4 <- atm %>% group_by(ATM) %>% filter(ATM == 'ATM4') 
#convert dataset to time series
atm1<- ts(atm1[,"Cash"],frequency = 7)
atm2<- ts(atm2[,"Cash"],frequency = 7)
atm3<- ts(atm3[,"Cash"],frequency = 7)
atm4<- ts(atm4[,"Cash"],frequency = 7)

Fitting a Forecast Model

ATM1

autoplot(atm1,main="ATM1",xlab="Weeks",ylab="Cash in Hundreds")

ndiffs(atm1)
## [1] 0
nsdiffs(atm1)
## [1] 1

It appears that there is seasonality trend and require a seasonal difference in order to be stationary.

atm12<- diff(atm1,7)
ggtsdisplay(atm12,main="ATM1",xlab="Weeks",ylab="Cash in Hundreds")

After taking a seasonal difference, the data appears stationary. ACF and PACF both show a spike at lag 1. In the PACF chart, We also see spikes at 7, 14, and 21 that decrease, indicating that that the PACF is geometric. Based on the analysis, MA(1) model may be appropriate.

Box.test(atm12, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  atm12
## X-squared = 5.4172, df = 1, p-value = 0.01994
kpss.test(atm12)
## Warning in kpss.test(atm12): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  atm12
## KPSS Level = 0.021606, Truncation lag parameter = 4, p-value = 0.1

The Box-Ljung test indicates that we can reject the null hypothesis that the data is white noise and we can assume that the values are showing dependence of each other. The KPSS test indicates that the we can not reject the null hypothesis of stationarity around a trend and we can assume that the values are stationary.

We will move forward with the MA(1) model and check the residuals.

#fit the 
atm1_fit <- Arima(atm12, order = c(0,0,1))
checkresiduals(atm1_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 71.986, df = 12, p-value = 1.359e-10
## 
## Model df: 2.   Total lags used: 14
summary(atm1_fit)
## Series: atm12 
## ARIMA(0,0,1) with non-zero mean 
## 
## Coefficients:
##          ma1     mean
##       0.1380  -0.0351
## s.e.  0.0557   1.7837
## 
## sigma^2 estimated as 877.7:  log likelihood=-1705.7
## AIC=3417.4   AICc=3417.47   BIC=3429.02
## 
## Training set error measures:
##                        ME     RMSE      MAE MPE MAPE      MASE
## Training set -0.002151755 29.54249 18.57711 NaN  Inf 0.5538281
##                      ACF1
## Training set -0.006524952

The residuals appear to be normally distributed with a mean around zero. The ACF does not show an auto-correlation but there is a spike at 7. The model appears acceptable for forecasting.

Below we see a 5 week forecast based on the ARIMA(0,0,1)

autoplot(forecast(atm1_fit, h=5))

ATM2

For ATM2, we see a similar seasonal pattern we saw in ATM1. The ndiffs function suggests that we need a difference of 1 and a seasonal difference. The data also shows a strong volatile variance that requires a transformation.

autoplot(atm2,main="ATM2",xlab="Weeks",ylab="Cash in Hundreds")

ndiffs(atm2)
## [1] 1
nsdiffs(atm2)
## [1] 1
atm2.lambda <- BoxCox.lambda(atm2)
atm22<- diff(diff(BoxCox(atm2,atm2.lambda),7))
ggtsdisplay(atm22,main="ATM2",xlab="Weeks",ylab="Cash in Hundreds")

After making a box-cox transformation and taking 1 difference and a seasonal difference, the data appears stationary. The ACF chart shows significant spikes early in the critical values. PACF chart shows a geometric decrease

Box.test(atm22, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  atm22
## X-squared = 93.322, df = 1, p-value < 2.2e-16
kpss.test(atm22)
## Warning in kpss.test(atm22): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  atm22
## KPSS Level = 0.0094355, Truncation lag parameter = 4, p-value =
## 0.1

The Box-Ljung test indicates that we can reject the null hypothesis that the data is white noise and we can assume that the values are showing dependence of each other. The KPSS test indicates that the we can not reject the null hypothesis of stationarity around a trend and we can assume that the values are stationary.

Since it looks like the data requires a seasonal and a non-seasonal fitting, I decided to use auto arima to find the best fit.

atm2_fit <- auto.arima(atm22, stepwise =FALSE)
checkresiduals(atm2_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,3)(1,0,0)[7] with zero mean
## Q* = 29.375, df = 9, p-value = 0.0005601
## 
## Model df: 5.   Total lags used: 14

Auto arima function suggests a non-seasonal AR(1) & MA(3) and a seasonal AR(1) model. The residuals appear to be normally distributed with a mean around zero. The ACF does not show an auto-correlation but there are a number of spikes. The model appears acceptable for forecasting.

Below we see a 5 week forecast based on the ARIMA(1,0,3)(1,0,0)

autoplot(forecast(atm2_fit, h=5))

ATM4

autoplot(atm4,main="ATM4",xlab="Weeks",ylab="Cash in Hundreds")

ATM4 has much larger sums of money being transacted but we still see a seasonality in the plot.

ndiffs(atm4)
## [1] 0
nsdiffs(atm4)
## [1] 0

The ndiffs function suggests that the data is stationary and does not need any differencing.

ggtsdisplay(atm4,main="ATM4",xlab="Weeks",ylab="Cash in Hundreds")

Both the ACF and PACF plots show values that spike greater than the critical values and requires a seasonal & non-seasonal fitting.

Box.test(atm2, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  atm2
## X-squared = 0.056235, df = 1, p-value = 0.8125
kpss.test(atm2)
## Warning in kpss.test(atm2): p-value smaller than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  atm2
## KPSS Level = 1.8143, Truncation lag parameter = 4, p-value = 0.01

The Box-Ljung test indicates that we can reject the null hypothesis that the data is white noise and we can assume that the values are showing dependence of each other. The KPSS test indicates that the we can not reject the null hypothesis of stationarity around a trend and we can assume that the values are stationary.

atm4_fit <- auto.arima(atm4, stepwise =FALSE)
checkresiduals(atm4_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.693, df = 10, p-value = 0.1088
## 
## Model df: 4.   Total lags used: 14

Auto arima function suggests a non-seasonal AR(1) and a seasonal AR(2) model. The residuals appear to be slight skew to the right but The ACF does not show an auto-correlation and there are no significant spikes greater than the critical values. The model appears acceptable for forecasting.

Below we see a 5 week forecast based on the ARIMA(1,0,0)(2,0,0)

autoplot(forecast(atm4_fit, h=5))

Conclustion

gridExtra::grid.arrange(
  autoplot(forecast(atm1_fit, h=5)) + 
    labs(title = "ATM1: ARIMA(0,0,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(forecast(atm2_fit, h=5)) + 
    labs(title = "ATM2: ARIMA(1,0,3)(1,0,0)", x = "Week") +
    theme(legend.position = "none"),
  autoplot(forecast(atm4_fit, h=5)) + 
    labs(title = "ATM4: ARIMA(1,0,0)(2,0,0)", x = "Week") +
    theme(legend.position = "none"),
  top = grid::textGrob("ATM withdrawals (in hundreds of dollars)")
)

Using ARIMA Models, we were able to forecast the ATM withdraws for 3 ATM’s in the data set. We weren’t able to create a forecast for ATM3 as it did not have a large enough sample size to make a forecast. ATM2 & ATM4 required ARMA model as both had significant values in the ACF/PACF charts. ATM1, a moving average ARIMA Model was sufficient for forecasting.