I want to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
atm <- readxl::read_excel("ATM624Data.xlsx") %>% mutate(DATE = as.Date(DATE, origin='1899-12-30'))
head(atm)
## # A tibble: 6 x 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
summary(atm)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
str(atm)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1474 obs. of 3 variables:
## $ DATE: Date, format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num 96 107 82 89 85 90 90 55 99 79 ...
There looks like 19 observations that do not have a value for cash. We can remove those values as it does not help our study
atm <- atm[complete.cases(atm),]
ggplot(atm, aes(ATM)) +
geom_bar()
ggplot(atm, aes(x=ATM, y=Cash)) +
geom_boxplot()
ATM4 has the largest values out of the 4 ATM’s but there is 1 observation that looks like it is an outlier.
head(atm[order(-atm$Cash),],5)
## # A tibble: 5 x 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 10920.
## 2 2009-09-22 ATM4 1712.
## 3 2010-01-29 ATM4 1575.
## 4 2009-09-04 ATM4 1495.
## 5 2009-06-09 ATM4 1484.
When we look at the highest transactions, the transaction is over 6 times greater than the next. ATM4 may be located in a place where there are lot cash transactions such as a casino however, this seems like a one time transaction so we’ll remove from out data set.
atm %>% group_by(ATM) %>% summarise(avg = mean(Cash,na.rm = TRUE))
## # A tibble: 4 x 2
## ATM avg
## <chr> <dbl>
## 1 ATM1 83.9
## 2 ATM2 62.6
## 3 ATM3 0.721
## 4 ATM4 474.
When we look at the average cash value of transactions, ATM3 has significantly less amount than the other 3 locations. Either, the ATM is located in an unpopular location or there is an error in the data.
atm3 <- atm %>% group_by(ATM) %>% filter(ATM == 'ATM3')
head(atm3[order(-atm3$Cash),],10)
## # A tibble: 10 x 3
## # Groups: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-28 ATM3 96
## 2 2010-04-30 ATM3 85
## 3 2010-04-29 ATM3 82
## 4 2009-05-01 ATM3 0
## 5 2009-05-02 ATM3 0
## 6 2009-05-03 ATM3 0
## 7 2009-05-04 ATM3 0
## 8 2009-05-05 ATM3 0
## 9 2009-05-06 ATM3 0
## 10 2009-05-07 ATM3 0
It looks like ATM3 registered a total of 3 transactions. Since there isn’t a large enough sample size for ATM3, we can’t make a forecast model for that ATM.
To clean the data, we’ll remove all observations greater than 3000
atm <- atm[atm$Cash<3000,]
atm %>%
ggplot(aes(DATE, Cash, color=ATM)) +
geom_point() + geom_smooth() +
facet_wrap(~ATM, scales='free') +
theme_bw() + theme(legend.position = 'none') + labs(x="", y="")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
##seperate ATM data
atm1 <- atm %>% group_by(ATM) %>% filter(ATM == 'ATM1')
atm2 <- atm %>% group_by(ATM) %>% filter(ATM == 'ATM2')
atm4 <- atm %>% group_by(ATM) %>% filter(ATM == 'ATM4')
#convert dataset to time series
atm1<- ts(atm1[,"Cash"],frequency = 7)
atm2<- ts(atm2[,"Cash"],frequency = 7)
atm3<- ts(atm3[,"Cash"],frequency = 7)
atm4<- ts(atm4[,"Cash"],frequency = 7)
autoplot(atm1,main="ATM1",xlab="Weeks",ylab="Cash in Hundreds")
ndiffs(atm1)
## [1] 0
nsdiffs(atm1)
## [1] 1
It appears that there is seasonality trend and require a seasonal difference in order to be stationary.
atm12<- diff(atm1,7)
ggtsdisplay(atm12,main="ATM1",xlab="Weeks",ylab="Cash in Hundreds")
After taking a seasonal difference, the data appears stationary. ACF and PACF both show a spike at lag 1. In the PACF chart, We also see spikes at 7, 14, and 21 that decrease, indicating that that the PACF is geometric. Based on the analysis, MA(1) model may be appropriate.
Box.test(atm12, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: atm12
## X-squared = 5.4172, df = 1, p-value = 0.01994
kpss.test(atm12)
## Warning in kpss.test(atm12): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: atm12
## KPSS Level = 0.021606, Truncation lag parameter = 4, p-value = 0.1
The Box-Ljung test indicates that we can reject the null hypothesis that the data is white noise and we can assume that the values are showing dependence of each other. The KPSS test indicates that the we can not reject the null hypothesis of stationarity around a trend and we can assume that the values are stationary.
We will move forward with the MA(1) model and check the residuals.
#fit the
atm1_fit <- Arima(atm12, order = c(0,0,1))
checkresiduals(atm1_fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 71.986, df = 12, p-value = 1.359e-10
##
## Model df: 2. Total lags used: 14
summary(atm1_fit)
## Series: atm12
## ARIMA(0,0,1) with non-zero mean
##
## Coefficients:
## ma1 mean
## 0.1380 -0.0351
## s.e. 0.0557 1.7837
##
## sigma^2 estimated as 877.7: log likelihood=-1705.7
## AIC=3417.4 AICc=3417.47 BIC=3429.02
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.002151755 29.54249 18.57711 NaN Inf 0.5538281
## ACF1
## Training set -0.006524952
The residuals appear to be normally distributed with a mean around zero. The ACF does not show an auto-correlation but there is a spike at 7. The model appears acceptable for forecasting.
Below we see a 5 week forecast based on the ARIMA(0,0,1)
autoplot(forecast(atm1_fit, h=5))
For ATM2, we see a similar seasonal pattern we saw in ATM1. The ndiffs function suggests that we need a difference of 1 and a seasonal difference. The data also shows a strong volatile variance that requires a transformation.
autoplot(atm2,main="ATM2",xlab="Weeks",ylab="Cash in Hundreds")
ndiffs(atm2)
## [1] 1
nsdiffs(atm2)
## [1] 1
atm2.lambda <- BoxCox.lambda(atm2)
atm22<- diff(diff(BoxCox(atm2,atm2.lambda),7))
ggtsdisplay(atm22,main="ATM2",xlab="Weeks",ylab="Cash in Hundreds")
After making a box-cox transformation and taking 1 difference and a seasonal difference, the data appears stationary. The ACF chart shows significant spikes early in the critical values. PACF chart shows a geometric decrease
Box.test(atm22, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: atm22
## X-squared = 93.322, df = 1, p-value < 2.2e-16
kpss.test(atm22)
## Warning in kpss.test(atm22): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: atm22
## KPSS Level = 0.0094355, Truncation lag parameter = 4, p-value =
## 0.1
The Box-Ljung test indicates that we can reject the null hypothesis that the data is white noise and we can assume that the values are showing dependence of each other. The KPSS test indicates that the we can not reject the null hypothesis of stationarity around a trend and we can assume that the values are stationary.
Since it looks like the data requires a seasonal and a non-seasonal fitting, I decided to use auto arima to find the best fit.
atm2_fit <- auto.arima(atm22, stepwise =FALSE)
checkresiduals(atm2_fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,3)(1,0,0)[7] with zero mean
## Q* = 29.375, df = 9, p-value = 0.0005601
##
## Model df: 5. Total lags used: 14
Auto arima function suggests a non-seasonal AR(1) & MA(3) and a seasonal AR(1) model. The residuals appear to be normally distributed with a mean around zero. The ACF does not show an auto-correlation but there are a number of spikes. The model appears acceptable for forecasting.
Below we see a 5 week forecast based on the ARIMA(1,0,3)(1,0,0)
autoplot(forecast(atm2_fit, h=5))
autoplot(atm4,main="ATM4",xlab="Weeks",ylab="Cash in Hundreds")
ATM4 has much larger sums of money being transacted but we still see a seasonality in the plot.
ndiffs(atm4)
## [1] 0
nsdiffs(atm4)
## [1] 0
The ndiffs function suggests that the data is stationary and does not need any differencing.
ggtsdisplay(atm4,main="ATM4",xlab="Weeks",ylab="Cash in Hundreds")
Both the ACF and PACF plots show values that spike greater than the critical values and requires a seasonal & non-seasonal fitting.
Box.test(atm2, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: atm2
## X-squared = 0.056235, df = 1, p-value = 0.8125
kpss.test(atm2)
## Warning in kpss.test(atm2): p-value smaller than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: atm2
## KPSS Level = 1.8143, Truncation lag parameter = 4, p-value = 0.01
The Box-Ljung test indicates that we can reject the null hypothesis that the data is white noise and we can assume that the values are showing dependence of each other. The KPSS test indicates that the we can not reject the null hypothesis of stationarity around a trend and we can assume that the values are stationary.
atm4_fit <- auto.arima(atm4, stepwise =FALSE)
checkresiduals(atm4_fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 15.693, df = 10, p-value = 0.1088
##
## Model df: 4. Total lags used: 14
Auto arima function suggests a non-seasonal AR(1) and a seasonal AR(2) model. The residuals appear to be slight skew to the right but The ACF does not show an auto-correlation and there are no significant spikes greater than the critical values. The model appears acceptable for forecasting.
Below we see a 5 week forecast based on the ARIMA(1,0,0)(2,0,0)
autoplot(forecast(atm4_fit, h=5))
gridExtra::grid.arrange(
autoplot(forecast(atm1_fit, h=5)) +
labs(title = "ATM1: ARIMA(0,0,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
autoplot(forecast(atm2_fit, h=5)) +
labs(title = "ATM2: ARIMA(1,0,3)(1,0,0)", x = "Week") +
theme(legend.position = "none"),
autoplot(forecast(atm4_fit, h=5)) +
labs(title = "ATM4: ARIMA(1,0,0)(2,0,0)", x = "Week") +
theme(legend.position = "none"),
top = grid::textGrob("ATM withdrawals (in hundreds of dollars)")
)
Using ARIMA Models, we were able to forecast the ATM withdraws for 3 ATM’s in the data set. We weren’t able to create a forecast for ATM3 as it did not have a large enough sample size to make a forecast. ATM2 & ATM4 required ARMA model as both had significant values in the ACF/PACF charts. ATM1, a moving average ARIMA Model was sufficient for forecasting.