DATA 624 Project 1 Part A

Introduction

I am to forecast how much cash will be taken out of 4 different ATM machines for May 2010. Data was provided for this project. The cash is in hundreds of dollars.

Data Exploration

I will begin by plotting the cash withdrawn by each ATM.

if (!file.exists("ATM624Data.xlsx")) {
  download.file("https://github.com/mikeasilva/CUNY-SPS/raw/master/DATA624/ATM624Data.xlsx", "ATM624Data.xlsx")
}

atm <- read_excel("ATM624Data.xlsx") %>%
  mutate(DATE = as.Date(DATE, origin = "1899-12-30"),
         ATM = as.factor(ATM)) %>%
  data.frame()

atm %>%
  ggplot(aes(DATE, Cash, color = ATM)) +
  geom_line() +
  ggtitle("Cash Withdrawal by Date and ATM") +
  scale_color_brewer(palette = "Set1") +
  scale_y_continuous(labels = comma) +
  facet_wrap(.~ATM, ncol = 2) +
  theme(legend.title = element_blank(),
        legend.position = "bottom",
        axis.title = element_blank())

ATM #4 is utilized at a higher rate than every other ATM in the data series. There is an unusual spike in ATM #4. According to a 2010 article in Time magazine an average sized ATM can hold $200,000. So the outlier is an implausible figure. There are some missing ATM labels in the data. Let’s remove ATM #4 and the missing label from the dataset and replot the visualization:

atm %>%
  filter(ATM != "ATM4") %>%
  filter(!is.na(ATM)) %>%
  ggplot(aes(DATE, Cash, color = ATM)) +
  geom_line() +
  facet_wrap(.~ATM, ncol = 2) +
  ggtitle("Cash Withdrawal by Date and ATM") +
  scale_color_brewer(palette = "Set1") +
  theme(legend.title = element_blank(),
        legend.position = "bottom",
        axis.title = element_blank())

It looks like ATM #3 was not utilized until recently. ATM #1 and #2 seem very similar to each other. It doesn’t seem like there is much of a trend in the data. Let’s explore the distributions further with this visualization:

atm %>%
  na.omit() %>%
  ggplot(aes(ATM, Cash, color = ATM)) +
  geom_boxplot() +
  ggtitle("Cash Withdrawal by ATM") +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(.~ATM, nrow = 2, scales = "free") +
  theme(legend.position = "none",
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

atm %>%
  filter(!is.na(ATM)) %>%
  group_by(ATM) %>%
  summarise(Minimum = min(Cash, na.rm = T),
            `1st Qu.` = quantile(Cash, .25, na.rm = T),
            Mean = mean(Cash, na.rm = T),
            Median = median(Cash, na.rm = T),
            `3rd Qu.` = quantile(Cash, .75, na.rm = T),
            Maximum = max(Cash, na.rm = T),
            `NA's` = sum(is.na(Cash))) %>%
  kable() %>%
  kable_styling()

ATM	Minimum	1st Qu.	Mean	Median	3rd Qu.	Maximum	NA’s
ATM1	1.00000	73.0000	83.8867403	91.0000	108.000	180.00	3
ATM2	0.00000	25.5000	62.5785124	67.0000	93.000	147.00	2
ATM3	0.00000	0.0000	0.7205479	0.0000	0.000	96.00	0
ATM4	1.56326	124.3344	474.0433449	403.8393	704.507	10919.76	0

Is there any Daily Effects?

There is a pattern in the time series but fluctuations are very tight. I wonder if it is explained by the day of the week. For example, Friday and Saturday might have heavier usage and a Tuesday night might be calm. Let’s examine this hypothesis:

atm %>%
  na.omit() %>%
  mutate(`Day of Week` = recode(as.factor(wday(DATE)), "1" = "Sunday", "2" = "Monday", "3" = "Tuesday", "4" = "Wednesday", "5" = "Thursday", "6" = "Friday", "7" = "Saturday"))  %>%
  ggplot(aes(`Day of Week`, Cash, color = `Day of Week`)) +
  geom_boxplot() +
  facet_grid(ATM ~ `Day of Week`, scales = "free") +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "none", axis.title = element_blank(), axis.text.x = element_blank(), axis.ticks = element_blank())

It looks like there is some varriation based on the day of the week. Thursday looks to be less busy than the other days of the week. There is so much variation between the 4 ATMs that each will be modeled seperately.

A Quick Note on Process

I will clean up the data to produce the forecasts. I will then explore some candidate modeling techniques. Once I have a handful of candidates, I will prefrom a cross-validation on the models and get the RMSE. The model that minimizes the RMSE during cross-validation will be selected as the model of choice.

ATM #1

Data Cleanup

I will begin ATM #1. There are 3 missing observations that will need to be cleaned up.

atm1 <- atm %>%
  filter(ATM == "ATM1")

atm1$clean_cash <- c(tsclean(ts(atm1$Cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 365)))

Now that we have a complete data set I will create time series objects. Since there is a weekly effect I will be using a frequency of 7. This does mess up the dates in the plot however so please pay no reguard to them.

ggtsplot <- function(ts, title) {
  # A ggplot2 version of tsdisplay(df)
  # Args:
  #    ts (Time-Series): The time series we want to plot
  #    title (str): The title of the graph
  grid.arrange(
    autoplot(ts) +
      ggtitle(title) +
      scale_y_continuous(labels = comma) +
      theme(axis.title = element_blank()),
    grid.arrange(
      ggAcf(ts) + ggtitle(element_blank()),
      ggPacf(ts) + ggtitle(element_blank()), ncol = 2)
    , nrow = 2)
}

atm1_ts <- ts(atm1$clean_cash, start = decimal_date(as.Date(min(atm1$DATE))), frequency = 7)
ggtsplot(atm1_ts, "ATM #1")

One readily observes the repeating peaks on every 7th lag. There also seems to be an interesting negative correlation between the 1st and both the 3rd and the 5th lag.

Model Creation

STL Decomposition Models

I will try a couple of seasonal decomposition models. I will set the seasonal window to 7 so it picks up the day of the week variation.

atm1_ts %>%
  stl(s.window = 7, robust = TRUE) %>%
  autoplot()

This seems promising to me. I will do STL decomposition forecasts using both the ETS and ARIMA models and check their residual plots

STL + ETS

h <- 31
atm1_stl_ets_fit <- atm1_ts %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm1_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(A,N,N)
Q* = 13.759, df = 12, p-value = 0.3164

Model df: 2.   Total lags used: 14

This model seems to hold merit and should be taken under consideration.

STL + ARIMA

atm1_stl_arima_fit <- atm1_ts %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm1_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(0,1,2)
Q* = 8.4479, df = 12, p-value = 0.7492

Model df: 2.   Total lags used: 14

This model also preformed well. It is definately a candidate for the cross-validation stage.

Holt-Winters

atm1_hw_fit <- hw(atm1_ts, h = h)
checkresiduals(atm1_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 28.877, df = 3, p-value = 0.000002376

Model df: 11.   Total lags used: 14

This method did a fairly good job. I will move it on to the next phase.

Holt-Winters with Box Cox Adjustment

atm1_lambda <- BoxCox.lambda(atm1_ts)
atm1_adj_hw_fit <- hw(atm1_ts, h = h, lambda = atm1_lambda)
checkresiduals(atm1_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 18.824, df = 3, p-value = 0.0002973

Model df: 11.   Total lags used: 14

This seems to be a strong candidate. We will see how if fares in the cross-validation.

ARIMA

atm1_arima_fit <- auto.arima(atm1_ts)
checkresiduals(atm1_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,1)(0,1,2)[7]
Q* = 15.247, df = 11, p-value = 0.1715

Model df: 3.   Total lags used: 14

This model looks like it is preforming well. Let’s see how all of them stack up.

Cross Validation

In order to understand how well a model is likely to preform at predicting out of sample data, I will use the tsCV function and evaluate the models. As prevously noted my goal is to minimize the RMSE. First I will get the errors from the cross validation process, then I will compute the RMSE.

get_rmse <- function(e) {
  sqrt(mean(e^2, na.rm = TRUE))
}

atm1_arima_forecast <- function(x, h) {
  forecast(Arima(x, order = c(0, 0, 1), seasonal = c(0, 1, 2)), h = h)
}

stl_ets_errors <- tsCV(atm1_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm1_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm1_ts, hw, h = h)
adj_hw_errors <- tsCV(atm1_ts, hw, h = h, lambda = atm1_lambda)
arima_errors <- tsCV(atm1_ts, atm1_arima_forecast, h = h)

data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"),
           RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()

Model	RMSE
ARIMA	29.67002
STL ARIMA	31.80636
STL ETS	31.81959
Adjusted Holt-Winters	34.30067
Holt-Winters	39.49958

It looks like the ARIMA model did the best job predicting on the out of sample data set.

Model Selection

I will select the ARIMA model to produce the forecast for ATM #1. I have decided on this because it is the model that preformed the best in the cross validation.

ATM #2

Now I will repeat the above process for ATM #2. As this is a repeat I will not include as much explanatory text.

Data Cleanup

There are 2 missing observations. This will be cleaned up using the tsclean function again.

atm2 <- atm %>%
  filter(ATM == "ATM2")

atm2$clean_cash <- c(tsclean(ts(atm2$Cash, start = decimal_date(as.Date(min(atm2$DATE))), frequency = 365)))

atm2_ts <- ts(atm2$clean_cash, start = decimal_date(as.Date(min(atm2$DATE))), frequency = 7)
ggtsplot(atm2_ts, "ATM #2")

Once again the ACF plot has the regular spikes on evry multiple of 7.

Model Creation

STL Decomposition Models

Again I will try a couple of seasonal decomposition models. I will set the seasonal window to 7 so it picks up the day of the week variation.

atm2_ts %>%
  stl(s.window = 7, robust = TRUE) %>%
  autoplot()

This seems very similar to ATM #1.

STL + ETS

atm2_stl_ets_fit <- atm2_ts %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm2_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(A,N,N)
Q* = 9.2372, df = 12, p-value = 0.6825

Model df: 2.   Total lags used: 14

This model preformed much better than the it did for the ATM #1 time series. It will be interesting to see how it does in cross-validation

STL + ARIMA

atm2_stl_arima_fit <- atm2_ts %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm2_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(2,1,2)
Q* = 7.4932, df = 10, p-value = 0.6782

Model df: 4.   Total lags used: 14

Interesting. The residuals have a bit more spread but the left tail is shorter than the STL+ETS model.

Holt-Winters

atm2_hw_fit <- hw(atm2_ts, h = h)
checkresiduals(atm2_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 37.435, df = 3, p-value = 0.00000003722

Model df: 11.   Total lags used: 14

This method did not preform as well as the others. I will, however keep it in so I can compare it to ATM #1’s statistics.

Holt-Winters with Box Cox Adjustment

atm2_lambda <- BoxCox.lambda(atm2_ts)
atm2_adj_hw_fit <- hw(atm2_ts, h = h, lambda = atm2_lambda)
checkresiduals(atm2_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 34.776, df = 3, p-value = 0.0000001358

Model df: 11.   Total lags used: 14

ARIMA

atm2_arima_fit <- auto.arima(atm2_ts)
checkresiduals(atm2_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
Q* = 10.231, df = 9, p-value = 0.3321

Model df: 5.   Total lags used: 14

This model looks like it is preforming well. Let’s see how all of them stack up.

Cross Validation

atm2_arima_forecast <- function(x, h) {
  forecast(Arima(x, order = c(2, 0, 2), seasonal = c(0, 1, 1)), h = h)
}

stl_ets_errors <- tsCV(atm2_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm2_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm2_ts, hw, h = h)
adj_hw_errors <- tsCV(atm2_ts, hw, h = h, lambda = atm2_lambda)
arima_errors <- tsCV(atm2_ts, atm2_arima_forecast, h = h)

data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"),
           RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()

Model	RMSE
ARIMA	33.56421
STL ARIMA	34.17962
STL ETS	34.96679
Adjusted Holt-Winters	41.53970
Holt-Winters	55.33134

Interesting. It looks like the ARIMA model was the top preformer again. I find it interesting that the RMSE is higher and more spread out for ATM #2. ATM #1’s RMSEs ranged from roughyl 30 to 39. The RMSEs for ATM #2 ranged from 34 to 55. There is more variability in this data that isn’t captured by the model.

Model Selection

As stated in the previous section I will use the ARIMA model to produce the forecast for ATM #2.

ATM #3

This model is quite different from the two proceeding cases. You can see it in the plot:

atm %>%
  filter(ATM == "ATM3") %>%
  mutate(nonzero = if_else(Cash == 0, "No", "Yes")) %>%
  ggplot(aes(DATE, Cash, color = nonzero)) +
  geom_point() +
  ggtitle("ATM #3") +
  scale_color_brewer(palette = "Set1") +
  theme(axis.title = element_blank(), legend.position = "none")

Most of the values are zeros (points in red above), except for 3 points (shown in blue). The three non-zero points are the most current. This presents a serious challenge.

Model Creation/Selection

There is one fundamental question with this dataset. Are the three points an outlier, or is it the begining of the new normal? If they are outliers one would expect the cash value to return to zero. If the three points are an indication of change, then the historical data have little relevance.

I will be assuming the new data are the begining of the new normal. The challenge is we only have three data points. In the absense of more data I will calculate the average of these three points and use it for the forecast with the recommendation to revise it frequently. As the average is only based off of three data points it should not be considered stable.

atm3 <- atm %>%
  filter(ATM == "ATM3", Cash > 0)

atm3_mean <- mean(atm3$Cash)

ATM #4

This ATM is different too, but not as radically different as ATM #3. I will be using the same approach used for ATM #1 and #2 with this ATM.

Data Cleanup

There is one major outlier in the data set. We will clean it up by simply by using the tsclean function once again.

atm4 <- atm %>%
  filter(ATM == "ATM4")

atm4$clean_cash <- c(tsclean(ts(atm4$Cash, start = decimal_date(as.Date(min(atm4$DATE))), frequency = 365)))

atm4_ts <- ts(atm4$clean_cash, start = decimal_date(as.Date(min(atm4$DATE))), frequency = 7)
ggtsplot(atm4_ts, "ATM #4")

Once again there is the familiar patter of peaks on the multiples of seven.

Model Creation

STL Decomposition Models

Again I will try a couple of seasonal decomposition models. I will set the seasonal window to 7 so it picks up the day of the week variation.

atm2_ts %>%
  stl(s.window = 7, robust = TRUE) %>%
  autoplot()

This seems very similar to ATM #1.

STL + ETS

atm4_stl_ets_fit <- atm4_ts %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "ets")
checkresiduals(atm4_stl_ets_fit)


    Ljung-Box test

data:  Residuals from STL +  ETS(A,N,N)
Q* = 30.61, df = 12, p-value = 0.002258

Model df: 2.   Total lags used: 14

This model did a fair job. There are a couple of ACF spikes that are outside the bands. Let’s see if the ARIMA model does better:

STL + ARIMA

atm4_stl_arima_fit <- atm4_ts %>%
  stlf(h = h, s.window = 7, robust = TRUE, method = "arima")
checkresiduals(atm4_stl_arima_fit)


    Ljung-Box test

data:  Residuals from STL +  ARIMA(1,0,1) with non-zero mean
Q* = 21.78, df = 11, p-value = 0.02614

Model df: 3.   Total lags used: 14

This is an improvement over the STL+ETS model. There are still a couple of spikes on the ACF that falls outside the threshold.

Holt-Winters

atm4_hw_fit <- hw(atm4_ts, h = h)
checkresiduals(atm4_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 18.279, df = 3, p-value = 0.0003852

Model df: 11.   Total lags used: 14

This method seems to have preformed better than any of the STL models.

Holt-Winters with Box Cox Adjustment

atm4_lambda <- BoxCox.lambda(atm4_ts)
atm4_adj_hw_fit <- hw(atm4_ts, h = h, lambda = atm4_lambda)
checkresiduals(atm4_adj_hw_fit)


    Ljung-Box test

data:  Residuals from Holt-Winters' additive method
Q* = 18.934, df = 3, p-value = 0.0002822

Model df: 11.   Total lags used: 14

This model preformed very well. It doesn’t look like any of the ACF spikes are outside the bands. This is a contender for sure.

ARIMA

atm4_arima_fit <- auto.arima(atm4_ts)
checkresiduals(atm4_arima_fit)


    Ljung-Box test

data:  Residuals from ARIMA(0,0,3)(1,0,0)[7] with non-zero mean
Q* = 16.96, df = 9, p-value = 0.04934

Model df: 5.   Total lags used: 14

This model did fairly well, but it seems like the Holt-Winters with Box-Cox adjustment did better. It’s time to cross-validate and see how all of them preform.

Cross Validation

Once again, I will use the tsCV function and evaluate the models. My goal to minimize the RMSE remains for this dataset.

atm4_arima_forecast <- function(x, h) {
  forecast(Arima(x, order = c(0, 0, 3), seasonal = c(1, 0, 0)), h = h)
}

stl_ets_errors <- tsCV(atm4_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "ets")
stl_arima_errors <- tsCV(atm4_ts, stlf, h = h, s.window = 7, robust = TRUE, method = "arima")
hw_errors <- tsCV(atm4_ts, hw, h = h)
adj_hw_errors <- tsCV(atm4_ts, hw, h = h, lambda = atm4_lambda)
arima_errors <- tsCV(atm4_ts, atm4_arima_forecast, h = h)

data.frame(Model = c("STL ETS", "STL ARIMA", "ARIMA", "Holt-Winters", "Adjusted Holt-Winters"),
           RMSE = c(get_rmse(stl_ets_errors[, h]), get_rmse(stl_arima_errors[, h]), get_rmse(arima_errors[, h]), get_rmse(hw_errors[, h]), get_rmse(adj_hw_errors[, h]))) %>%
  arrange(RMSE) %>%
  kable() %>%
  kable_styling()

Model	RMSE
ARIMA	346.7434
STL ARIMA	353.7169
STL ETS	366.0626
Adjusted Holt-Winters	388.4804
Holt-Winters	504.2440

That’s an unexpected result. The Adjusted Holt-Winters (the one I thought would preform the best), did not preform as well. The ARIMA model, once again, rose to the top.

Model Selection

Once again I will use the forecasts produced by the ARIMA model as the projections for ATM #4.

Summary

I set out to create predictions for 4 different ATMs. After testing multiple approaches using cross-validation, I selected ARIMA models for ATM #1, #2 and #4, as it was the modeling technique with the lowest RMSE. For ATM #3 we used the mean of all non-zero data (3 observations). This model needs to be updated once more data becomes available. I will finish this project by exporting my forcasts in the same file format as the original data.

DATA 624 Project 1 Part A - ATM