library(fpp2)
library(ggplot2)
library(dplyr)
The objective is to forecast how much cash is taken out of 4 different ATM machines for May 2010.
First load the data and check summary and loaded values
atmdata <- readxl::read_excel("ATM624Data.xlsx")
summary(atmdata)
## DATE ATM Cash
## Min. :39934 Length:1474 Min. : 0.0
## 1st Qu.:40026 Class :character 1st Qu.: 0.5
## Median :40118 Mode :character Median : 73.0
## Mean :40118 Mean : 155.6
## 3rd Qu.:40210 3rd Qu.: 114.0
## Max. :40312 Max. :10919.8
## NA's :19
The summary shows that there are some blank data in the Cash column to deal with and that the DATE colum is in numerical Excel date format.
Let’s first convert the dates from Numerical Format [in Windows Excel]:
atmdata$DATE <- as.Date(atmdata$DATE, origin = "1899-12-30")
head(atmdata)
## # A tibble: 6 x 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
Next, separate the data for each of the 4 ATMs [all observations up to May 2010]
atm1 <- atmdata[atmdata$ATM == "ATM1" & atmdata$DATE < "2010-05-01",]
atm2 <- atmdata[atmdata$ATM == "ATM2" & atmdata$DATE < "2010-05-01",]
atm3 <- atmdata[atmdata$ATM == "ATM3" & atmdata$DATE < "2010-05-01",]
atm4 <- atmdata[atmdata$ATM == "ATM4" & atmdata$DATE < "2010-05-01",]
summary(atm1)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.00
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 73.00
## Median :2009-10-30 Mode :character Median : 91.00
## Mean :2009-10-30 Mean : 83.89
## 3rd Qu.:2010-01-29 3rd Qu.:108.00
## Max. :2010-04-30 Max. :180.00
## NA's :3
summary(atm2)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 0.00
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 25.50
## Median :2009-10-30 Mode :character Median : 67.00
## Mean :2009-10-30 Mean : 62.58
## 3rd Qu.:2010-01-29 3rd Qu.: 93.00
## Max. :2010-04-30 Max. :147.00
## NA's :2
summary(atm3)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 0.0000
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.0000
## Median :2009-10-30 Mode :character Median : 0.0000
## Mean :2009-10-30 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :96.0000
summary(atm4)
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.563
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 124.334
## Median :2009-10-30 Mode :character Median : 403.839
## Mean :2009-10-30 Mean : 474.043
## 3rd Qu.:2010-01-29 3rd Qu.: 704.507
## Max. :2010-04-30 Max. :10919.762
From the summaries we see that each ATM data now has 365 readings since May 2009. However, ATM1 and ATM2 data have some missing Cash values. Let’s deal with that next.
atm1[is.na(atm1$Cash),]
## # A tibble: 3 x 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
atm2[is.na(atm2$Cash),]
## # A tibble: 2 x 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-18 ATM2 NA
## 2 2009-06-24 ATM2 NA
There are a few missing values in the month of June 2009. It should be an acceptable strategy to impute by using the mean value for the month.
mu1 <- round(mean(unlist(atm1[atm1$DATE >= "2009-06-01"
& atm1$DATE < "2009-07-01", "Cash"]), na.rm = TRUE))
mu2 <- round(mean(unlist(atm2[atm2$DATE >= "2009-06-01"
& atm2$DATE < "2009-07-01", "Cash"]), na.rm = TRUE))
atm1[is.na(atm1$Cash),"Cash"] <- mu1
atm2[is.na(atm2$Cash),"Cash"] <- mu2
Now it should be safe to create the time series objects which can then be used for modelling.
NOTE: Frequency = 1 was used for the daily data (similar to the ibmclose dataset)
ts1 <- ts(atm1$Cash, frequency = 1, start = c(2009,5))
ts2 <- ts(atm2$Cash, frequency = 1, start = c(2009,5))
ts3 <- ts(atm3$Cash, frequency = 1, start = c(2009,5))
ts4 <- ts(atm4$Cash, frequency = 1, start = c(2009,5))
ggtsdisplay(ts1)
ndiffs(ts1)
## [1] 0
There appears to be no need for transformation as there is no evidence in the variance increase or descrease. Also, the unit-root test does not suggest any differencing to be done. The ACF and PACF are suggesting lag=7 and so the best model turned out to be ARIMA(7,0,7)
(fit1 <- Arima(ts1, order = c(7,0,7)))
## Series: ts1
## ARIMA(7,0,7) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ma1
## -0.0457 -0.0316 -0.0457 -0.0509 -0.0084 -0.0280 0.9277 0.1858
## s.e. 0.0315 0.0322 0.0316 0.0318 0.0320 0.0315 0.0303 0.0587
## ma2 ma3 ma4 ma5 ma6 ma7 mean
## -0.0043 0.0993 0.1666 -0.0766 -0.0478 -0.6147 83.3332
## s.e. 0.0621 0.0585 0.0629 0.0606 0.0597 0.0552 2.8944
##
## sigma^2 estimated as 538.9: log likelihood=-1662.78
## AIC=3357.56 AICc=3359.13 BIC=3419.96
checkresiduals(fit1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(7,0,7) with non-zero mean
## Q* = 4.6861, df = 3, p-value = 0.1963
##
## Model df: 15. Total lags used: 18
The graphics for residuals and the Ljung-Box test suggest that the residuals are not distinguishable from a white noise series. Therefore this is a good model to represent the series.
ggtsdisplay(ts2)
ndiffs(ts2)
## [1] 1
ATM2 data is similar looking to ATM1. However, the unit-root test suggest 1 level of differencing. Also, considering the lag information in ACF and PACF graphs, a manual search turned up ARIMA(7,1,7) to be a better model.
(fit2 <- Arima(ts2, order = c(7,1,7)))
## Series: ts2
## ARIMA(7,1,7)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7
## -1.0996 -1.0927 -1.0882 -1.1030 -1.0798 -1.0723 -0.0954
## s.e. 0.0823 0.0816 0.0812 0.0822 0.0832 0.0768 0.0770
## ma1 ma2 ma3 ma4 ma5 ma6 ma7
## 0.1599 -0.1332 -0.0125 0.1084 -0.1307 -0.0837 -0.6920
## s.e. 0.0638 0.0446 0.0513 0.0428 0.0446 0.0567 0.0407
##
## sigma^2 estimated as 591.5: log likelihood=-1677.3
## AIC=3384.61 AICc=3385.99 BIC=3443.07
checkresiduals(fit2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(7,1,7)
## Q* = 13.971, df = 3, p-value = 0.002945
##
## Model df: 14. Total lags used: 17
ggtsdisplay(ts3)
ATM3 data appears to be odd, because most of the data is zero and only a few actual observations in the end of the time series. For this kind of data a more naive forcasting modeling would work well. Average model would not work, however, because predicted values would be zero or very close to zero. Therefore a simple “Random Walk” forcasting model would probably be best in this case, as can be seen in the graph below.
fit3 <- rwf(ts3, h=31)
autoplot(ts3) +
autolayer(fit3)
ggtsdisplay(ts4)
There are two features that become evident from the graph of ATM4 data. 1. there is an outlier that is way outside the range of the rest of the data. However not having a domain expert to consult with, it may be unwise to deal with this outlier in some way. 2. The data appears to be a white noise series. Therefore the best way to model this data is by ARIMA(0,0,0) (i.e. white noise).
(fit4 <- auto.arima(ts4))
## Series: ts4
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 474.0433
## s.e. 34.0248
##
## sigma^2 estimated as 423718: log likelihood=-2882.03
## AIC=5768.06 AICc=5768.1 BIC=5775.86
checkresiduals(fit4)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 3.4236, df = 9, p-value = 0.9451
##
## Model df: 1. Total lags used: 10
sd <- as.Date("2010-05-01")
fd <- sd + 0:30
f1 <- forecast(fit1, h = 31)
df1 <- data.frame(DATE=fd, ATM=rep("ATM1", 31), Cash=f1$mean)
f2 <- forecast(fit2, h = 31)
df2 <- data.frame(DATE=fd, ATM=rep("ATM2", 31), Cash=f2$mean)
f3 <- forecast(fit3, h = 31)
df3 <- data.frame(DATE=fd, ATM=rep("ATM3", 31), Cash=f3$mean)
f4 <- forecast(fit4, h = 31)
df4 <- data.frame(DATE=fd, ATM=rep("ATM4", 31), Cash=f4$mean)
df <- bind_rows(list(df1, df2, df3, df4))
write.csv(df, file = "ATM624Predicted.csv", row.names = FALSE)