Import Libraries

library(fpp3)
## Warning: package 'fpp3' was built under R version 4.1.2
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✓ tibble      3.1.6     ✓ tsibble     1.1.3
## ✓ dplyr       1.0.8     ✓ tsibbledata 0.4.1
## ✓ tidyr       1.2.0     ✓ feasts      0.3.0
## ✓ lubridate   1.8.0     ✓ fable       0.3.2
## ✓ ggplot2     3.3.5     ✓ fabletools  0.3.2
## Warning: package 'dplyr' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'tsibble' was built under R version 4.1.2
## Warning: package 'tsibbledata' was built under R version 4.1.2
## Warning: package 'feasts' was built under R version 4.1.2
## Warning: package 'fable' was built under R version 4.1.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()
library(dplyr)
library(tidyr)
library(ggplot2)
library(readxl)

Part 1 - ATM Data

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Load and Inspect Data

atm <- read_excel("/Users/alecmccabe/Downloads/ATM624DATA.xlsx", col_types = c('date', 'text', 'numeric')) %>%
  # converting into date format
  mutate(DATE = as_date(DATE)) %>%
  # convert to tsibble with DATE as index and ATM as key
  as_tsibble(index = DATE, key = ATM)


print(dim(atm))
## [1] 1474    3
summary(atm)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

The ATM dataset contains 1474 entries. Each entry corresponds to a cash withdrawel from a specific ATM at a specific time. The dates range from May-2009 to May-2010.

colSums(is.na(atm))
## DATE  ATM Cash 
##    0   14   19
atm[is.na(atm$ATM),]

There are 14 missing values for the ATM column. But upon inspection they are all in May-2010, which we are attempting to model. We can remove these without effect.

atm <- atm %>%
  drop_na(ATM)
atm[is.na(atm$Cash),]

After removing the entries with missing ATM values, there are 5 entries left with missing Cash values. All of the missing values are in June of 2009. In later steps we will think about different ways we can overcome this.

We can overcome the missing values in a few ways: - First, we can replace the missing values with the mean or median values for the ATM and Month. - Second, we can aggregate all the data onto a weekly or monthly level. This however may not be the best approach since all of the missing Cash values are from the same 2-week period in June. Potentially a monthly aggregation would be appropriate. - Lastly, we can ignore the imputation of missing values and instead rely purely on an ARIMA model, which can inherently handle missing values.

For this project, I will explore options 1 and 2. I will create a new dataframe which imputes the missing Cash values from June. I will retain the existing dataset and test ARIMA later without imputation. I will then compare results between the two approaches.

# calculate median values
atm1_median <- median(atm[(atm$ATM == 'ATM1') & (month(atm$DATE)==6) & (!is.na(atm$Cash)),]$Cash)
atm2_median <- median(atm[(atm$ATM == 'ATM1') & (month(atm$DATE)==6) & (!is.na(atm$Cash)),]$Cash)


# replace missing values with median values
atm[(atm$ATM == 'ATM1') & (is.na(atm$Cash)),]$Cash = atm1_median
atm[(atm$ATM == 'ATM2') & (is.na(atm$Cash)),]$Cash = atm2_median

Exploring the Data

atm %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Time Series by ATM")

From the above, we can see that ATMs 1 and 2 have similar series values, with continued activity (withdrawls) across the period. It is difficult to ascertain the behavior of ATM4 because of an extreme outlier which we will need to address.

ATM 4 differs, mainly in that it exhibit significantly smaller levels of cash withdrawls. In fact, ATM3 has 0 withdraws until late April of 2010, after which there is a large spike. Is this spike an outlier, or was this third ATM newly introduced to a busy corner? Regardless, we don’t have enough data for the ATM to include in our analysis so we will remove in further steps.

atm <- atm %>%
  filter(ATM != 'ATM3')

For ATM 4, we will need to address the singular outlier occuring around Marth of 2010. We can do this by replacing it with the average of the series.

max <- max(atm[atm$ATM=='ATM4',]$Cash)
median <- median(atm[atm$ATM=='ATM4',]$Cash)

atm[(atm$ATM=='ATM4') & (atm$Cash == max),]$Cash = median
atm %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Time Series by ATM")

atm %>%
  ACF(Cash) %>%
  autoplot() +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Time Series by ATM")

From the ACT plots above, we can see that ATMs 1 & 2 both exhibit weekly seasonality. We cannot see the same type of seasonality from ATM 4.

Exploring ATM 1

atm1 <- atm %>%
  filter(ATM == 'ATM1')

# ATM1 STL Decomposition
atm1 %>%
  model(STL(Cash ~ season(window="periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "ATM1 with Seasonal Decomposition")

Based on the seasonal decomposition above, we can see that there is no significant trend in the series. The data is follows a weekly seasonality. We also see that the remainder is mostly random, but exhibits seemingly non-random behavior around March of 2010.

atm1 |>
  gg_lag(Cash, geom = "point") +
  labs(x = "lag(Beer, k)")

Now looking at the lag plot for ATM 1, we can once again see that this ATM’s time series follows a weekly (7-day) seasonality cycle.

# create training set and compare against last months

atm1_train <- atm1 %>%
  filter(DATE < '2010-04-01')
# lambda for box cox transformation
atm1_lambda <- atm1_train %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm1_fit <- atm1_train %>%
  model(
    # additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # transformed additive ETS model
    additive_ts = ETS(box_cox(Cash,atm1_lambda) ~ error("A") + trend("N") + season("A")),
    # transformed multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,atm1_lambda) ~ error("M") + trend("N") + season("M")),
    # transformed SNAIVE model
    snaive_ts = SNAIVE(box_cox(Cash,atm1_lambda)),
    # arima model
    ARIMA = ARIMA(box_cox(Cash,atm1_lambda))
  )

accuracy(atm1_fit)

For ATM 1, the multiplicative-transform model produced the lowest RSME

atm1_forecast <- atm1_fit %>%
  forecast(h = 31) %>%
  filter(.model=='multiplicative_ts')


atm1_forecast %>%
  autoplot(atm1, level=NULL) +
  labs(title="Transformed Multiplicative ETS Forecast")

# report of the arima model
atm1_fit %>% select(.model = "multiplicative_ts") %>% report()
## Series: Cash 
## Model: ETS(M,N,M) 
## Transformation: box_cox(Cash, atm1_lambda) 
##   Smoothing parameters:
##     alpha = 0.03183972 
##     gamma = 0.2749021 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  11.94849 0.4187292 1.012255 1.149805 1.054564 1.076637 1.048191 1.239819
## 
##   sigma^2:  0.0533
## 
##      AIC     AICc      BIC 
## 2657.902 2658.581 2696.043
# residuals
atm1_fit %>%
  select("multiplicative_ts") %>%
  gg_tsresiduals()

#Box-Pierce test, â„“=2m for seasonal data, m=7
atm1_fit %>% 
  select(.model = "multiplicative_ts") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)

Exploring ATM 2

atm2 <- atm %>%
  filter(ATM == 'ATM2')

# ATM1 STL Decomposition
atm2 %>%
  model(STL(Cash ~ season(window="periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "ATM2 with Seasonal Decomposition")

Based on the seasonal decomposition above, we can see that there is no significant trend in the series. The data is follows a weekly seasonality. We also see that the remainder is mostly random, but exhibits seemingly non-random behavior around March of 2010.

atm2 |>
  gg_lag(Cash, geom = "point") +
  labs(x = "lag(Beer, k)")

Now looking at the lag plot for ATM 1, we can once again see that this ATM’s time series follows a weekly (7-day) seasonality cycle.

# create training set and compare against last months

atm2_train <- atm2 %>%
  filter(DATE < '2010-04-01')
# lambda for box cox transformation
atm2_lambda <- atm2_train %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm2_fit <- atm2_train %>%
  model(
    # additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # transformed additive ETS model
    additive_ts = ETS(box_cox(Cash,atm2_lambda) ~ error("A") + trend("N") + season("A")),
    # transformed multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,atm2_lambda) ~ error("M") + trend("N") + season("M")),
    # transformed SNAIVE model
    snaive_ts = SNAIVE(box_cox(Cash,atm2_lambda)),
    # arima model
    ARIMA = ARIMA(box_cox(Cash,atm2_lambda))
  )

accuracy(atm2_fit)

For ATM 2, ARIMA and Transformed Additive ETS produced the lowest RSME values. We can compare their AIC and BIC to determine which model we should move forward with.

atm2_fit %>%
  forecast(h = 31) %>%
  filter(.model %in% c('additive_ts', 'ARIMA')) %>%
  autoplot(atm2, level=NULL) +
  labs(title="Transformed Multiplicative ETS Forecast")

# report of the multiplicative model
atm2_fit %>% select(.model = "additive_ts") %>% report()
## Series: Cash 
## Model: ETS(A,N,A) 
## Transformation: box_cox(Cash, atm2_lambda) 
##   Smoothing parameters:
##     alpha = 0.0001000105 
##     gamma = 0.3788876 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  28.54438 -19.49563 -14.30877 4.824496 1.906367 7.832009 9.026157 10.21537
## 
##   sigma^2:  83.0954
## 
##      AIC     AICc      BIC 
## 3439.307 3439.986 3477.448
# report of the multiplicative model
atm2_fit %>% select(.model = "ARIMA") %>% report()
## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, atm2_lambda) 
## 
## Coefficients:
##           ar1      ar2    ma1     ma2     sma1
##       -0.4242  -0.9271  0.462  0.8169  -0.7544
## s.e.   0.0503   0.0425  0.081  0.0574   0.0460
## 
## sigma^2 estimated as 78.22:  log likelihood=-1180.26
## AIC=2372.52   AICc=2372.78   BIC=2395.27

The ARIMA model has significantly lower AIC and BIC values. Considering it also has the lowest RSME, it is an obvious choice to move forward with ARIMA for ATM2.

atm2_forecast <- atm2_fit %>%
  forecast(h = 31) %>%
  filter(.model %in% c('ARIMA'))
# residuals
atm2_fit %>%
  select("ARIMA") %>%
  gg_tsresiduals()

#Box-Pierce test, â„“=2m for seasonal data, m=7
atm2_fit %>% 
  select(.model = "ARIMA") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)

Exploring ATM 4

atm4 <- atm %>%
  filter(ATM == 'ATM4')

# ATM1 STL Decomposition
atm4 %>%
  model(STL(Cash ~ season(window="periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "ATM1 with Seasonal Decomposition")

Based on the seasonal decomposition above, we can see that there is no significant trend in the series. The data is follows a weekly seasonality. We also see that the remainder is mostly random, but exhibits seemingly non-random behavior around March of 2010.

atm4 |>
  gg_lag(Cash, geom = "point") +
  labs(x = "lag(Beer, k)")

Now looking at the lag plot for ATM 1, we can once again see that this ATM’s time series follows a weekly (7-day) seasonality cycle.

# create training set and compare against last months

atm4_train <- atm4 %>%
  filter(DATE < '2010-04-01')
# lambda for box cox transformation
atm4_lambda <- atm4_train %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm4_fit <- atm4_train %>%
  model(
    # additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # transformed additive ETS model
    additive_ts = ETS(box_cox(Cash,atm4_lambda) ~ error("A") + trend("N") + season("A")),
    # transformed multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,atm4_lambda) ~ error("M") + trend("N") + season("M")),
    # transformed SNAIVE model
    snaive_ts = SNAIVE(box_cox(Cash,atm4_lambda)),
    # arima model
    ARIMA = ARIMA(box_cox(Cash,atm4_lambda))
  )

accuracy(atm4_fit)

For ATM 4, the Transformed Additive TS model produced the lowest RSME

atm4_forecast <- atm4_fit %>%
  forecast(h = 40) %>%
  filter(.model=='additive_ts')


atm4_forecast %>%
  autoplot(atm4, level=NULL) +
  labs(title="Transformed Additive ETS Forecast")

While the mean of the forecasted values match the mean of the last months in the test set, we can see that the chosen model does not do a gret job at capturing the true variance of cash withdrawls from ATM 4.

# report of the arima model
atm4_fit %>% select(.model = "additive_ts") %>% report()
## Series: Cash 
## Model: ETS(A,N,A) 
## Transformation: box_cox(Cash, atm4_lambda) 
##   Smoothing parameters:
##     alpha = 0.000100027 
##     gamma = 0.0001000061 
## 
##   Initial states:
##      l[0]      s[0]        s[-1]     s[-2]    s[-3]    s[-4]    s[-5]   s[-6]
##  23.12951 -10.53245 -0.008230648 0.3562908 2.212736 2.473327 2.258718 3.23961
## 
##   sigma^2:  92.8782
## 
##      AIC     AICc      BIC 
## 3476.592 3477.271 3514.734
# residuals
atm4_fit %>%
  select("additive_ts") %>%
  gg_tsresiduals()

#Box-Pierce test, â„“=2m for seasonal data, m=7
atm4_fit %>% 
  select(.model = "additive_ts") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)

Producing Forecasts for May

forecasts <- bind_rows(atm1_forecast, atm2_forecast, atm4_forecast) %>%
  as.data.frame() %>%
  select(DATE, ATM, .mean) %>%
  rename(Cash = .mean)
forecasts
# altogether plot
forecasts %>%
  as_tsibble(index = DATE, key = ATM) %>%
  autoplot(Cash) +
  autolayer(atm, Cash, colour = "black") +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawls")