library(fpp3)
library(tidyverse)
library(ggrepel)
library(seasonal)
library(fabletools)
library(corrplot)
library(GGally)
library(patchwork)
library(caret)
library(readxl)
library(imputeTS)
theme_set(theme_bw())

Problem

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Download Data and Handling Missing Values

The first step in the analysis is to download the data and explore any missing values within the dataset. The below code loads the data from the provide Excel file and checks for any missing values within the ATM column first.

# Load in Data from Excel File
atm_data = read_excel("ATM624Data.xlsx", col_types = c("date", "text", "numeric"))
# Look for Any Missing Values in ATM Column
atm_data |> filter(is.na(ATM))

For all missing values in ATM column above, the corresponding Cash value is also missing. Thus, this data can be removed. The below code removes the rows of data where the ATM value is missing, and also sorts the data and converts the DATE column into the same data type.

After transforming the dataset, I converted the dataset into a tsibble and looked at whether there were any missing values in the Cash column. For the missing values in the Cash column, linear interpolation is used to estimate the missing values. This was chosen instead of removing these observations because this would affect the analysis when looking at potential seasonality.

# Filter, sort, and mutate data
atm_data = atm_data |> arrange(DATE, ATM) |> filter(!is.na(ATM)) |> mutate(DATE = as_date(DATE))
# Convert Data into a Tsibble
atm_data = as_tsibble(atm_data, index = "DATE", key = "ATM")
# Look for Any Missing Values in Cash Column
atm_data |> filter(is.na(Cash))
# Use Linear Interpolation to Fill in Missing Values
atm_data_inter = atm_data |> select(Cash) |> na_interpolation(option = "linear")

Handling Outliers

After managing all of the missing values in the dataset, the dataset is further analyzed for any potential outliers.

# Plot the Data
atm_data_inter |> autoplot() +
  labs(title = "Time Plot of Cash Withdrawn for 4 ATMs", y = "Cash Withdrawn (in hundreds)")

# Sort Data to Find Max Value of Cash to Look at Outlier
atm_data_inter |> arrange(desc(Cash))

Looking at the plot above, there appears to be a clear outlier from ATM4 that occurs on 2010-02-09 and has a value of 10,919.76, wheras next highest value is 1712.07. This appears to be a typo since all other ATMs don’t have a major outlier on this day, and this value is almost 10x more than the next closest maximum. As a result, I wanted to look at modifying this observation using imputation by first looking at whether there is strong seasonality that can be used to modify this value.

atm_data_inter |> filter(ATM == "ATM4") |> gg_season(Cash, period = "week") +
  labs(title = "Seasonal Plot of Cash Withdrawn for ATM4", y = "Cash Withdrawn (in hundreds)")

atm_data_inter |> filter(ATM == "ATM4") |> ACF(Cash, lag_max = 28) |> autoplot() +
  labs(title = "ACF Plot of Cash Withdrawn for ATM4")

atm_data_inter |> filter(ATM == "ATM4") |> gg_lag(Cash, lags = 1:7, geom = "point") +
  labs(title = "Lag Plot of Cash Withdrawn for ATM4")

Upon initial investigation, there does not appear to be strong seasonality within the data that can be used for imputation. As a result, linear interpolation was used to adjust impute the value of the outlier.

atm_outlier_before = atm_data_inter |>
  filter(ATM == "ATM4", DATE == "2010-02-08") |>
  select(Cash)
atm_outlier_after = atm_data_inter |>
  filter(ATM == "ATM4", DATE == "2010-02-10") |>
  select(Cash)
atm_data_inter = atm_data_inter |>
  mutate(Cash = case_when(
    DATE == "2010-02-09" ~ mean(c(atm_outlier_before$Cash, atm_outlier_after$Cash)),
    .default = Cash))

After using imputation to modify the outlier, the data was replotted below to verify the outlier was removed, and also verify there are no other major outliers. Additionally, the cash withdrawn from each ATM was summed to get one dataset to analyze for the remainder of the report.

# Replot Data
atm_data_inter |> autoplot() +
  labs(title = "Time Plot of Cash Withdrawn for 4 ATMs after\nHandling Outliers and Missing Data",
       y = "Cash Withdrawn (in hundreds)")

# Get Total Cash Across all ATMs for Every Day in Data
atm_data_total = atm_data_inter |> summarise(total_cash = sum(Cash))

Data Exploration and Plots

Now that all missing values and outliers have been handled, I wanted to further investigate the data for potential trends and seasonality. Since the data is only given for 1 year, there is no cyclicity.

# Plot Total Cash Data
atm_data_total |> autoplot() +
  labs(title = "Time Plot of Cash Withdrawn for All ATMs", y = "Cash Withdrawn (in hundreds)")

# Check for Seasonality
atm_data_total |> gg_season(total_cash, period = "week") +
  labs(title = "Seasonal Plot of Cash Withdrawn for All ATMs", y = "Cash Withdrawn (in hundreds)")

atm_data_total |> ACF(total_cash, lag_max = 35) |> autoplot() +
  labs(title = "ACF Plot of Cash Withdrawn for All ATMs")

atm_data_total |> gg_lag(total_cash, lags = 1:14, geom = "point") +
  labs(title = "Lag Plot of Cash Withdrawn for All ATMs")

Based on the time, seasonal, and lag plots above, there does not appear to be any obvious trend in the data, and it is unclear whether there is seasonality within the data. However, the ACF plot shows peaks at lag multiples of 7, all of which lie above the line showing significance when compared to zero. This suggests that there is likely weekly seasonality.

Check for Possible Transformations

After checking the data for possible trends and seasonality, the data was analyzed to see if any data transformations would improve the analysis.

BoxCoxTrans(atm_data_total$total_cash)
## Box-Cox Transformation
## 
## 365 data points used to estimate Lambda
## 
## Input data summary:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    8.507  271.668  558.175  593.656  865.427 1981.075 
## 
## Largest/Smallest: 233 
## Sample Skewness: 0.502 
## 
## Estimated Lambda: 0.6

Although the above function suggests that taking the square root of the data would result in a better model, after looking at the time plot above, the magnitude of the peaks appears to increase at first, followed by a decrease in magnitude. Therefore, a Box Cox transformation can not be used.

Checking Different Models

Training and Test Sets

In order to test different models and their accuracy, a training and test set will be generated from the dataset. The training data will then be used to generate a forecast, which will then be compared to the test set. For the split, 80% of the data will be used for the training set, and 20% will be used for the test set.

# Training Set
atm_train = atm_data_total |> head(nrow(atm_data_total)*0.8)

# Test Set
atm_test = atm_data_total |> tail(nrow(atm_data_total)*0.2)

Decomposing Data

Before generating models, I wanted to further explore the data by decomposing the dataset into their seasonal, trend, and remainder components to get a better idea of how the data is behaving. The below code uses various decomposition methods to break these components out.

# STL Decomposition
stl_dcmp = atm_train |> model(stl = STL(total_cash))
stl_dcmp |> components() |> autoplot() 

# Classical Decomposition (Additive)
class_dcmp_add = atm_train |> model(classical_decomposition(total_cash, type = "additive"))
class_dcmp_add |> components() |> autoplot()

# Classical Decomposition (Multiplicative)
class_dcmp_mul = atm_train |> model(classical_decomposition(total_cash, type = "multiplicative"))
class_dcmp_mul |> components() |> autoplot()

One thing to note is that the X11 and SEATS methods for decomposition cannot be used since the data is daily, and the X11 and SEATS methods have not been developed for daily data.

All decomposition methods above show that there is some level of seasonality within the data. Additionally, the trend component does not appear to clearly positive or negative trend, but instead appears to oscillate around a value of about 600.

Simple Models

The following few subsections create models using a variety of methods. The first section below creates models using simple methods.

# Mean Method
mean_model = atm_train |> model(Mean = MEAN(total_cash))

# Naive Method
naive_model = atm_train |> model(Naive = NAIVE(total_cash))

# Seasonal Naive Method
snaive_model = atm_train |> model(SNaive = SNAIVE(total_cash ~ lag("week")))

# Drift Method
drift_model = atm_train |> model(Drift = RW(total_cash ~ drift()))

Models with Decomposition

The following models use simple models on seasonally adjusted data. The data has been decomposed using the STL method. Only the STL Model is be used because the classic decomposition methods utilize moving averages that doesn’t allow for the trends to be calculated for the first and last few observations.

# MEAN Method on Seasonally Adjusted Data
stl_mean_model = atm_train |> model(STL_MEAN = decomposition_model(STL(total_cash),
                                                                     MEAN(season_adjust)))
# Naive Method on Seasonally Adjusted Data
stl_naive_model = atm_train |> model(STL_NAIVE = decomposition_model(STL(total_cash),
                                                                     NAIVE(season_adjust)))

Exponential Smoothing

The following code generates a model using exponential smoothing. The parameters used for the model are automatically selected from the ETS function.

ets_model = atm_train |> model(ETS = ETS(total_cash))
report(ets_model)
## Series: total_cash 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.02424798 
##     gamma = 0.0001004256 
## 
##   Initial states:
##    l[0]      s[0]     s[-1]   s[-2]    s[-3]   s[-4]    s[-5]    s[-6]
##  567.65 0.1814013 0.7760135 1.31874 1.044221 1.21863 1.143943 1.317051
## 
##   sigma^2:  0.317
## 
##      AIC     AICc      BIC 
## 4970.215 4970.998 5006.982
components(ets_model) |> autoplot()

Based on the output from the model above, the best exponential model is an ETS(M,N,M) model, which has multiplicative errors, no trend component, and multiplicative seasonality.

ARIMA Modeling

The following code looks into ARIMA modeling. Before generating the model, the dataset needs to be stationary.

atm_train |> gg_tsdisplay(total_cash, plot_type = "partial", lag = 28) +
  labs(title = "Plots of Cash Withdrawn for All ATMs", y = "Cash Withdrawn (in hundreds)")

Looking at the plot above, the data does not appear to be stationary due to potential seasonality within the dataset. As a result, differencing needs to be looked at.

# Check for Number of Differences Based on Unit Root Tests
atm_train |> features(total_cash, unitroot_kpss) 
atm_train |> features(total_cash, unitroot_ndiffs)
# Check for Number of Seasonal Differences Based on Unit Root Tests
atm_train |> features(total_cash, feat_stl) 
atm_train |> features(total_cash, unitroot_nsdiffs)

The KPSS root test shows a p-value greater than 0.05, which suggests that no differencing is required. Additionally, the seasonal strength is less than 0.64, which suggests that no seasonal differencing is required. Based on these unit root tests, the data appears to be stationary without any differencing. However, since based on previous discussions the data appears to have weekly seasonality, a seasonal difference of 7 is used in the following analysis.

atm_train |> gg_tsdisplay(difference(total_cash, 7), plot_type = "partial", lag = 28) +
  labs(title = "Plots of Cash Withdrawn for All ATMs after Seasonal Differencing",
       y = "Cash Withdrawn (in hundreds)")

# Check for Number of Differences Based on Unit Root Tests
atm_train |> features(difference(total_cash, 7), unitroot_kpss)
atm_train |> features(difference(total_cash, 7), unitroot_ndiffs)

The data appears to be stationary after a seasonal differencing of 7. The significant spike at lag = 7 on the ACF plot above suggests a seasonal MA(1) component. Additionally, the spike at lag = 3 on the ACF plot suggests a non-seasonal MA(3) component. Overall, this would suggest an \(ARIMA(0,0,3)(0,1,1)_7\) model. However, the spike at lag = 3 on the PACF plot suggests a non-seasonal AR(3) component. This would suggest an \(ARIMA(3,0,0)(0,1,1)_7\) model. Both of these models, along with a model that is automatically selected by the ARIMA function, will be compared.

arima_models = atm_train |>
  model(
    arima003011 = ARIMA(total_cash ~ pdq(0,0,3) + PDQ(0,1,1)),
    arima300011 = ARIMA(total_cash ~ pdq(3,0,0) + PDQ(0,1,1)),
    arima_auto = ARIMA(total_cash, stepwise = FALSE, approx = FALSE)
  )

arima_models |> pivot_longer(everything(), names_to = "Model name",
                     values_to = "Orders")

Looking at the list of models above, the model that was automatically generated from the ARIMA function was an \(ARIMA(0,0,0)(2,0,0)_7\ with\ mean\).

Generating Forecasts and Comparing Results

Now that all of the models have been generated using the training dataset, all models will generate forecasts and compared to the test dataset. The below code generates these forecasts against the test set.

# Mean Method
mean_forecast = mean_model |> forecast(new_data = atm_test)

# Naive Method
naive_forecast = naive_model |> forecast(new_data = atm_test)

# Seasonal Naive Method
snaive_forecast = snaive_model |> forecast(new_data = atm_test)

# Drift Method
drift_forecast = drift_model |> forecast(new_data = atm_test)

# Naive Method Using Seasonally Adjusted Data from STL Method
stl_naive_forecast = stl_naive_model |> forecast(new_data = atm_test)
stl_mean_forecast = stl_mean_model |> forecast(new_data = atm_test)

# ETS Method
ets_forecast = ets_model |> forecast(new_data = atm_test)

# ARIMA Methods
arima_forecasts = arima_models |> forecast(new_data = atm_test)

The code below generates 3 different tables using the forecasts above: performance comparisons based on residuals, performance comparisons based on point forecasts, and performance comparisons based on distributional forecasts.

# Compare Performance of Models Based on Residuals
bind_rows(
  mean_model |> accuracy(),
  naive_model |> accuracy(),
  snaive_model |> accuracy(),
  drift_model |> accuracy(),
  stl_mean_model |> accuracy(),
  stl_naive_model |> accuracy(),
  ets_model |> accuracy(),
  arima_models |> accuracy()
)  |>
  select(-ME, -MPE, -ACF1) |>
  arrange(RMSE)
# Compare Performance of Models Based on Point Forecasts
bind_rows(
  mean_forecast |> accuracy(atm_data_total),
  naive_forecast |> accuracy(atm_data_total),
  snaive_forecast |> accuracy(atm_data_total),
  drift_forecast |> accuracy(atm_data_total),
  stl_mean_forecast |> accuracy(atm_data_total),
  stl_naive_forecast |> accuracy(atm_data_total),
  ets_forecast |> accuracy(atm_data_total),
  arima_forecasts |> accuracy(atm_data_total)
)  |>
  select(-ME, -MPE, -ACF1) |>
  arrange(RMSE)
# Compare Performance of Models Based on Distributional Forecasts
bind_rows(
  mean_forecast |> accuracy(atm_data_total, list(crps = CRPS)),
  naive_forecast |> accuracy(atm_data_total, list(crps = CRPS)),
  snaive_forecast |> accuracy(atm_data_total, list(crps = CRPS)),
  drift_forecast |> accuracy(atm_data_total, list(crps = CRPS)),
  stl_mean_forecast |> accuracy(atm_data_total, list(crps = CRPS)),
  stl_naive_forecast |> accuracy(atm_data_total, list(crps = CRPS)),
  ets_forecast |> accuracy(atm_data_total, list(crps = CRPS)),
  arima_forecasts |> accuracy(atm_data_total, list(crps = CRPS))
) |> arrange(crps)

Based on the tables above, the method that performed the best when modelling over the training set was the MEAN method using the seasonally adjusted data from the STL decomposition, followed by the expontential smoothing method. However, the method that generated the best forecast on the test dataset when compared to the actual dataset is the MEAN method. This is based on having the lowest Root Mean Squared Error (RMSE) value for the point forecasts, as well as the lowest Continuous Ranked Probability Score (CRPS) value for the distributional forecasts. As a result, the MEAN method should be used for forecasting future data. However, before using the MEAN method for forecasting, the residuals are evaluated to ensure that everything has been captured in the model.

mean_model |> gg_tsresiduals(lag_max = 28) +
  labs(title = "Residual Plots for MEAN Model of Cash Withdrawn for All ATMs",
       y = "Cash Withdrawn (in hundreds)")

mean_model |> augment() |> features(.innov, ljung_box, lag = 14)

Based on the plot below, it does not appear that the residuals behave like white noise. This is confirmed with the Ljung-Box test, which has a p-value of less than 0.05. Additionally, the distribution of residuals is not normal, but instead right skewed. These bits of information suggest that the MEAN method does not capture all necessary information from the data. As a result, the next best model when looking at the performance of the forecasts on the test dataset is the ARIMA model that was auto-selected, which was \(ARIMA(0,0,0)(2,0,0)_7\ with\ mean\).

arima_models |> select(arima_auto) |> gg_tsresiduals(lag_max = 28) +
  labs(title = "Residual Plots for ARIMA Model of Cash Withdrawn for All ATMs",
       y = "Cash Withdrawn (in hundreds)")

arima_models |> select(arima_auto) |> augment() |> features(.innov, ljung_box, lag = 14)
arima_forecasts |> filter(.model == "arima_auto") |> autoplot(atm_data_total)

Looking at the residual plots above, the residuals appear to behave like white noise and normally distributed. They also appear to have a mean about zero, suggesting this is a good model. This is confirmed with the Ljung-Box test, which has a p-value greater than 0.05. As a result, the \(ARIMA(0,0,0)(2,0,0)_7\ with\ mean\) model can be used for forecasting.

Generating Future Forecasts

The following section generates a 14 day forecast for the total cash pulled out of all ATMs using the \(ARIMA(0,0,0)(2,0,0)_7\ with\ mean\) model. The model utilizes all data from the dataset to ensure all information is taken into consideration.

atm_arima = atm_data_total |> model(ARIMA(total_cash ~ 1 + pdq(0,0,0) + PDQ(2,0,0)))
report(atm_arima)
## Series: total_cash 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.2346  0.1659  354.5421
## s.e.  0.0516  0.0519   19.1513
## 
## sigma^2 estimated as 139824:  log likelihood=-2679.18
## AIC=5366.35   AICc=5366.46   BIC=5381.95
atm_arima_forecast = atm_arima |> forecast(h = 14)
atm_arima_forecast |> autoplot(atm_data_total) +
  labs(title = paste("Time Plot of Total Cash Pulled out of 4 ATMs with",
       "14 Day Forecast Using ARIMA(0,0,0)(2,0,0)7 with Mean Model", sep = "\n"))

# Convert Forecast Distributions to Intervals
atm_arima_intervals = atm_arima_forecast |> hilo()

# Total Cash Pulled out Over 14 Day Forecast Based on Mean
sum(atm_arima_intervals$.mean)
## [1] 7636.013
# Total Cash Pulled out Over 14 Day Forecast Based on 80% Confidence Interval
sum(atm_arima_intervals$`80%`$lower)
## [1] 835.9873
sum(atm_arima_intervals$`80%`$upper)
## [1] 14436.04
# Total Cash Pulled out Over 14 Day Forecast Based on 95% Confidence Interval
sum(atm_arima_intervals$`95%`$lower)
## [1] -2763.729
sum(atm_arima_intervals$`95%`$upper)
## [1] 18035.75

The plot above shows a 14 Day Forecast for the total cash to be pulled out from 4 ATMs. Based on the forecast intervals, the total cash pulled out over the next 14 days has an 80% chance of being between $83,598.73 and $144,360.40, and a 95% chance of being between $0.00 and $180,357.50. (Note that the lower interval for the 95% interval was raised to $0.00 since the data represents the total amount of cash pulled out of ATMs, which cannot be negative.) The average total cash pulled out over the next 14 days is $76,360.13.

The final bit of code below exports the relevant outputs from the forecast to a CSV file.

atm_arima_intervals_full = atm_arima_intervals |> mutate(
  lower_interval_95 = atm_arima_intervals$`95%`$lower,
  upper_interval_95 = atm_arima_intervals$`95%`$upper,
  lower_interval_80 = atm_arima_intervals$`80%`$lower,
  upper_interval_80 = atm_arima_intervals$`80%`$upper
)

atm_arima_intervals_full = atm_arima_intervals_full |> 
  select(-.model, -total_cash, -`80%`, -`95%`) |>
  rename("total_cash_mean" = .mean)

write_excel_csv(atm_arima_intervals_full, "data624_partA_forecast.csv")