We load the data and then utilize the skim and vis_dat function to assess the quality of the data. Here are some observations:
Next we will perform some exploratory data analysis to confirm our initial findings and identify other potential issues.
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
| Name | atm_skim |
| Number of rows | 1474 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | ATM |
Variable type: Date
| skim_variable | ATM | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|---|
| DATE | ATM1 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM2 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM3 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM4 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | NA | 0 | 1 | 2010-05-01 | 2010-05-14 | 2010-05-07 | 14 |
Variable type: numeric
| skim_variable | ATM | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|---|
| Cash | ATM1 | 3 | 0.99 | 83.89 | 36.66 | 1.00 | 73.00 | 91.00 | 108.00 | 180.00 |
| Cash | ATM2 | 2 | 0.99 | 62.58 | 38.90 | 0.00 | 25.50 | 67.00 | 93.00 | 147.00 |
| Cash | ATM3 | 0 | 1.00 | 0.72 | 7.94 | 0.00 | 0.00 | 0.00 | 0.00 | 96.00 |
| Cash | ATM4 | 0 | 1.00 | 474.04 | 650.94 | 1.56 | 124.33 | 403.84 | 704.51 | 10919.76 |
| Cash | NA | 14 | 0.00 | NaN | NA | NA | NA | NA | NA | NA |
We utilized a combination of box plots, line charts and the ggtsdisplay function to perform our EDA. Here are some key findings.
We will address the missing data and outliers in our pre-processing section.
In this section we will address outliers and missing data. Our strategy is to utilize average day of week values by ATM for the missing data.
The dplyr script belows calculate weekday averages by ATM and the replaces NA values with the appropriate weekday average. The Skim function is utilized a second time to ensure the preprocessing has been successful. The skim function confirms there is no longer missing values and the outlier for ATM 4 has been imputed. We are now ready to move to the Modeling section.
A plot of the pre-processed times series is set forth below.
atm <- atm %>%
mutate(wday = wday(DATE, label=TRUE, abbr=TRUE)) %>%
mutate(Cash = if_else(Cash>9000,NA_real_,Cash)) %>%
group_by(ATM, wday) %>%
mutate(wday_avg = mean(Cash, na.rm = TRUE)) %>%
arrange(ATM, DATE) %>%
mutate(Cash = if_else(is.na(Cash), wday_avg, Cash)) %>%
ungroup() %>%
select(DATE, ATM, Cash)| Name | atm_skim2 |
| Number of rows | 1474 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | ATM |
Variable type: Date
| skim_variable | ATM | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|---|
| DATE | ATM1 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM2 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM3 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM4 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | NA | 0 | 1 | 2010-05-01 | 2010-05-14 | 2010-05-07 | 14 |
Variable type: numeric
| skim_variable | ATM | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|---|
| Cash | ATM1 | 0 | 1 | 83.94 | 36.51 | 1.00 | 73.00 | 91.00 | 108.00 | 180.00 |
| Cash | ATM2 | 0 | 1 | 62.43 | 38.85 | 0.00 | 25.53 | 66.00 | 93.00 | 147.00 |
| Cash | ATM3 | 0 | 1 | 0.72 | 7.94 | 0.00 | 0.00 | 0.00 | 0.00 | 96.00 |
| Cash | ATM4 | 0 | 1 | 445.35 | 350.90 | 1.56 | 124.33 | 403.84 | 704.19 | 1712.07 |
| Cash | NA | 14 | 0 | NaN | NA | NA | NA | NA | NA | NA |
In the modeling section we will utilized STL Decomposition, Exponential Smoothing and Arima to identify the best possible model fit. Based upon my personal work experience, I have determine not to model ATM 3. The reason for this is that both the location and age (time in service) of an ATM are so important to its activity/usage that modeling an ATM by taking an average of other ATMs would be a unproductive.
After completing our preliminary modeling analysis we will utilize the tsCV function to help assess our modeling results. We will leverage the RMSE to pick our preferred model(s). We will model each ATM individually and review modeling residuals. A modeling summary follows:
We utilized STL (stl() function) Decomposition for model ATMs 1, 2 and 4. An analysis of the ACF and PACF charts indicated that 7 was the appropriate period (weekly) for seasonality. Plots for each of the modeled ATMs are set forth below.
We used the ets() function in our exponential smoothing models. Again, we modeled ATMs 1, 2, and 4. Key parameter setting follow:
The exponential smoothing models utilized in this analysis follows:
The auto.arima function was utilized to produce ARIMA forecasts. The models utilized for each ATM machine follow:
An analysis of model accuracy is performed for each ATM machine. We have utilized RMSE as our accuracy metric. Model accuracy results are summarized below:
Detailed forecasting information by ATM is set forth below:
temp = atm %>%
drop_na() %>%
spread(ATM, Cash)
atm = ts(temp %>% select(-DATE), frequency = 7)
ggtsdisplay(atm[,1], main = "Withdrawals", ylab = "Withdrawals (in hundreds)", xlab = "Week")ggsubseriesplot(atm[,1]) +
labs(title = "Withdrawals", subtitle ="May 2009 - April 2010",
x = "Days of the Week") +
scale_y_continuous("Amount of Cash Withdrawal",
labels = scales::dollar_format(scale = 0.1, suffix = "K"))atm1_stl_fit <- atm[,1] %>% stl(s.window = 7, robust = TRUE) %>%
autoplot() +
labs(title = "ATM 1 - STL Decomposition", x = "Week")
atm1_stl_fit## ETS(A,N,A)
##
## Call:
## ets(y = ., lambda = lambda_1, biasadj = TRUE)
##
## Box-Cox transformation: lambda= 0.2454
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 0.353
##
## Initial states:
## l = 7.5346
## s = -4.5312 0.8387 0.2145 0.8725 1.1761 0.925
## 0.5045
##
## sigma: 1.2737
##
## AIC AICc BIC
## 2340.975 2341.596 2379.973
## Series: .
## ARIMA(0,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.2453825
##
## Coefficients:
## ma1 ma2 sma1
## 0.1001 -0.1135 -0.6455
## s.e. 0.0524 0.0527 0.0426
##
## sigma^2 estimated as 1.584: log likelihood=-590.73
## AIC=1189.46 AICc=1189.58 BIC=1204.98
## STL ETS ARIMA
## RMSE: 51.44506 43.61599 37.14315
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 7.9435, df = 11, p-value = 0.7184
##
## Model df: 3. Total lags used: 14
ggsubseriesplot(atm[,2]) +
labs(title = "Withdrawals", subtitle ="May 2009 - April 2010",
x = "Days of the Week") +
scale_y_continuous("Cash Withdrawals",
labels = scales::dollar_format(scale = 0.1, suffix = "K"))atm2_stl_fit <- atm[,2] %>% stl(s.window = 7, robust = TRUE) %>%
autoplot() +
labs(title = "ATM 2 - STL Decomposition", x = "Week")
atm2_stl_fit## ETS(A,N,A)
##
## Call:
## ets(y = ., lambda = lambda_2, biasadj = TRUE)
##
## Box-Cox transformation: lambda= 0.6936
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 0.3967
##
## Initial states:
## l = 25.4405
## s = -13.7016 -10.3688 6.5125 -1.8726 0.3309 10.0456
## 9.054
##
## sigma: 7.4538
##
## AIC AICc BIC
## 3630.720 3631.342 3669.719
## Series: .
## ARIMA(2,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.6935866
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4209 -0.9036 0.4633 0.7935 -0.7169
## s.e. 0.0589 0.0458 0.0888 0.0601 0.0416
##
## sigma^2 estimated as 52.2: log likelihood=-1215.47
## AIC=2442.94 AICc=2443.18 BIC=2466.23
## STL ETS ARIMA
## RMSE: 58.94468 59.40853 46.07943
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,3)(0,1,1)[7] with drift
## Q* = 5.2417, df = 6, p-value = 0.5132
##
## Model df: 8. Total lags used: 14
ggsubseriesplot(atm[,4]) +
labs(title = "ATM 4 Withdrawals", subtitle ="May 2009 - April 2010",
x = "Days of the Week") +
scale_y_continuous("Cash Withdrawals",
labels = scales::dollar_format(scale = 0.1, suffix = "K"))atm4_stl_fit <- atm[,4] %>% stl(s.window = 7, robust = TRUE) %>%
autoplot() +
labs(title = "ATM 4 - Decomposition", x = "Week")
atm4_stl_fit## ETS(A,N,A)
##
## Call:
## ets(y = ., lambda = lambda_4, biasadj = TRUE)
##
## Box-Cox transformation: lambda= 0.4533
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 0.1008
##
## Initial states:
## l = 29.2323
## s = -17.9705 -2.5336 0.6625 4.6494 6.2229 4.8631
## 4.1063
##
## sigma: 13.1943
##
## AIC AICc BIC
## 4047.594 4048.215 4086.593
## Series: .
## ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
## Box Cox transformation: lambda= 0.4533182
##
## Coefficients:
## ma1 sar1 sar2 mean
## 0.0765 0.2090 0.1990 29.1078
## s.e. 0.0529 0.0517 0.0525 1.2587
##
## sigma^2 estimated as 183.8: log likelihood=-1467.94
## AIC=2945.87 AICc=2946.04 BIC=2965.37
## STL ETS ARIMA
## RMSE: 350.7567 329.0905 287.1895
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(2,0,0)[7] with non-zero mean
## Q* = 12.392, df = 10, p-value = 0.2596
##
## Model df: 4. Total lags used: 14
atm_output = function(week){
# Split to train and test
train1 = window(atm[,1], end = c(week, 3))
train2 = window(atm[,2], end = c(week, 3))
train4 = window(atm[,4], end = c(week, 3))
test1 = window(atm[,1], start = c(week, 4))
test2 = window(atm[,2], start = c(week, 4))
test4 = window(atm[,4], start = c(week, 4))
arima.model1 = train1 %>% Arima(order = c(0,0,2), seasonal = c(0,1,1),lambda = lambda_1, biasadj = TRUE)
arima.model2 = train2 %>% Arima(order = c(3,0,3), seasonal = c(0,1,1),include.drift = TRUE, lambda = lambda_2, biasadj = TRUE)
arima.model4 = train4 %>% Arima(order = c(0,0,1), seasonal = c(2,0,0),include.mean = TRUE, lambda = lambda_4, biasadj = TRUE)
# Create the forecasts
ATM_1_Fcst = forecast(arima.model1, h = 31)$mean %>% ts(start = 366, end = 396)
ATM_2_Fcst = forecast(arima.model2, h = 31)$mean %>% ts(start = 366, end = 396)
ATM_4_Fcst = forecast(arima.model4, h = 31)$mean %>% ts(start = 366, end = 396)
DATE <- seq(as.Date('2010-05-01'), as.Date('2010-05-31'), 1)
ATM1 <- c(ATM_1_Fcst)
ATM2 <- c(ATM_2_Fcst)
ATM4 <- c(ATM_4_Fcst)
df <- data.frame(DATE, ATM1, ATM2, ATM4)
}
myATMFcst <- atm_output(44)
myATMFcst## DATE ATM1 ATM2 ATM4
## 1 2010-05-01 88.03248 43.73438 513.6667
## 2 2010-05-02 70.27660 29.52705 386.9937
## 3 2010-05-03 76.01683 53.06357 481.4655
## 4 2010-05-04 37.14118 24.18667 661.3958
## 5 2010-05-05 109.88323 42.56347 453.9731
## 6 2010-05-06 101.69941 60.56368 412.7631
## 7 2010-05-07 101.80257 64.93641 578.4927
## 8 2010-05-08 84.34108 38.51018 486.2995
## 9 2010-05-09 74.09145 41.57788 351.5421
## 10 2010-05-10 76.12534 53.16685 442.8794
## 11 2010-05-11 37.21530 17.44156 587.5181
## 12 2010-05-12 110.01498 49.47473 425.1071
## 13 2010-05-13 101.82591 67.45401 451.5255
## 14 2010-05-14 101.92913 56.10449 580.6077
## 15 2010-05-15 84.45562 38.62484 472.1165
## 16 2010-05-16 74.19842 49.50073 424.2479
## 17 2010-05-17 76.23384 49.41827 458.5358
## 18 2010-05-18 37.28942 14.80928 511.2671
## 19 2010-05-19 110.14673 55.76500 450.7936
## 20 2010-05-20 101.95240 68.48927 450.1528
## 21 2010-05-21 102.05569 50.24438 499.2524
## 22 2010-05-22 84.57017 40.80389 464.5069
## 23 2010-05-23 74.30538 53.25603 433.7026
## 24 2010-05-24 76.34235 45.28566 455.3553
## 25 2010-05-25 37.36355 14.22772 486.2518
## 26 2010-05-26 110.27847 60.00312 451.0631
## 27 2010-05-27 102.07890 66.78616 454.9031
## 28 2010-05-28 102.18226 47.09690 482.9237
## 29 2010-05-29 84.68471 43.17946 460.4860
## 30 2010-05-30 74.41235 54.09822 446.9201
## 31 2010-05-31 76.45086 42.13277 456.6020