Project Instructions

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Oct 25. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Setup

atm <- read_excel(file.path(projPath, "ATM624Data.xlsx"))
head(atm)

## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

unique(atm$ATM)

## [1] "ATM1" "ATM2" NA     "ATM3" "ATM4"

# Fixing the date formatting
atm$DATE <- as.Date(atm$DATE, origin = "1900-01-01")
head(atm)

## # A tibble: 6 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-03 ATM1     96
## 2 2009-05-03 ATM2    107
## 3 2009-05-04 ATM1     82
## 4 2009-05-04 ATM2     89
## 5 2009-05-05 ATM1     85
## 6 2009-05-05 ATM2     90

# Tsibble
atm <- as_tsibble(atm, index = DATE, key = ATM)

Basic Plots

# Basic Plot
atm %>% 
  autoplot(Cash)

## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).

  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATMS over time") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## NULL

# Basic Plot ATM 1
atm %>% 
  filter(ATM == "ATM1") %>%
  autoplot(Cash) + 
  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM1") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) + 
  geom_line(color = "#F8766D") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Basic Plot ATM 2
atm %>% 
  filter(ATM == "ATM2") %>%
  autoplot(Cash) + 
  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM2") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  geom_line(color = "#7CAE00") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Basic Plot ATM 3
atm %>% 
  filter(ATM == "ATM3") %>%
  autoplot(Cash) + 
  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM3") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  geom_line(color = "#00BFC4") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Basic Plot ATM 4
atm %>% 
  filter(ATM == "ATM4") %>%
  autoplot(Cash) + 
  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM4") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  geom_line(color = "#C77CFF") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Data Exploration

Weird Specific ATM Behaviors

Looking at when ATM3 started getting withdrawals, and looking at the ATM4 Feb. spike.

# Days money moved for ATM3
atm %>% 
  filter((ATM == "ATM3") & (Cash != 0)) %>%
  kable(caption = "Days with non-zero cash movement for ATM3")

Days with non-zero cash movement for ATM3
DATE	ATM	Cash
2010-04-30	ATM3	96
2010-05-01	ATM3	82
2010-05-02	ATM3	85

# Days money spiked for ATM 4
atm %>% 
  filter((ATM == "ATM4") & (DATE >= '2010-02-01') & (DATE <= '2010-02-15')) %>%
  kable(caption = "ATM4 Cash Spike ($10,919*100 value on Feb 11)")

ATM4 Cash Spike ($10,919*100 value on Feb 11)
DATE	ATM	Cash
2010-02-01	ATM4	669.70765
2010-02-02	ATM4	979.67594
2010-02-03	ATM4	426.40601
2010-02-04	ATM4	153.24269
2010-02-05	ATM4	274.94992
2010-02-06	ATM4	136.27636
2010-02-07	ATM4	454.41695
2010-02-08	ATM4	458.30427
2010-02-09	ATM4	112.03063
2010-02-10	ATM4	417.91228
2010-02-11	ATM4	10919.76164
2010-02-12	ATM4	42.43808
2010-02-13	ATM4	280.04343
2010-02-14	ATM4	412.31888
2010-02-15	ATM4	852.83742

Summary Table

Get some high level counts and sums of dollars withdrawn, including counts for NA ATMs.

# Overall statistical summary
summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-03   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-03   Class :character   1st Qu.:    0.5  
##  Median :2009-11-03   Mode  :character   Median :   73.0  
##  Mean   :2009-11-02                      Mean   :  155.6  
##  3rd Qu.:2010-02-03                      3rd Qu.:  114.0  
##  Max.   :2010-05-16                      Max.   :10919.8  
##                                          NA's   :19

# Aggregate counts and missing values by ATM
atm %>%
  as_tibble() %>%
  group_by(ATM) %>%
  summarise(
    count = n(), 
    total_cash = sum(Cash, na.rm = TRUE),
    zero_days = sum(Cash == 0, na.rm = TRUE), 
    missing_days = sum(is.na(Cash)) 
  ) %>%
  kable(caption = "Total records, cash, and missing days per ATM")

Total records, cash, and missing days per ATM
ATM	count	total_cash	zero_days	missing_days
ATM1	365	30367.0	0	3
ATM2	365	22716.0	2	2
ATM3	365	263.0	362	0
ATM4	365	173025.8	0	0
NA	14	0.0	0	14

NA Records

Further investigate when the ATM values are NA.

# Extract dates with NA ATM value
na_atm_dates <- atm %>%
  filter(is.na(ATM)) %>%
  pull(DATE)

# Check all ATM activity on those specific dates
atm %>%
  filter(DATE %in% na_atm_dates) %>%
  arrange(DATE, ATM) %>%
  kable(caption = "Dates with a missing 'NA' ATM record")

Dates with a missing ‘NA’ ATM record
DATE	ATM	Cash
2010-05-03	NA	NA
2010-05-04	NA	NA
2010-05-05	NA	NA
2010-05-06	NA	NA
2010-05-07	NA	NA
2010-05-08	NA	NA
2010-05-09	NA	NA
2010-05-10	NA	NA
2010-05-11	NA	NA
2010-05-12	NA	NA
2010-05-13	NA	NA
2010-05-14	NA	NA
2010-05-15	NA	NA
2010-05-16	NA	NA

Also investigate when the Cash values are NA, but it’s not that both ATM and Cash are NA.

Missing Cash Records

Next, let’s identify the specific dates where an ATM is known to exist, but its Cash withdrawal value is missing. This deliberately excludes completely blank phantom rows where both the ATM name and Cash value are NA.

# Filter for missing Cash, but ensure ATM is NOT missing
atm %>%
  filter(is.na(Cash) & !is.na(ATM)) %>%
  arrange(ATM, DATE) %>%
  kable(caption = "Table 5: Records with missing Cash values for known ATMs")

Table 5: Records with missing Cash values for known ATMs
DATE	ATM	Cash
2009-06-15	ATM1	NA
2009-06-18	ATM1	NA
2009-06-24	ATM1	NA
2009-06-20	ATM2	NA
2009-06-26	ATM2	NA

See if spikes in withdrawals are mondays

Being in fintech right now, I suspected that the oscillations on the graphs are Monday highs and weekend lows. I just wanted to dummy check that.

# Average Cash by Day of the Week and ATM
atm %>%
  as_tibble() %>% 
  mutate(Day_of_Week = wday(DATE, label = TRUE, abbr = FALSE)) %>% # get day of week
  group_by(ATM, Day_of_Week) %>%
  summarise(
    Avg_Cash = mean(Cash, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  pivot_wider(names_from = Day_of_Week, values_from = Avg_Cash) %>% # day of week to column
  kable(caption = "Average Daily Cash Withdrawals by ATM and Day of the Week", digits = 2)

Average Daily Cash Withdrawals by ATM and Day of the Week
ATM	Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
ATM1	98.64	96.61	102.65	86.00	89.57	82.15	31.69
ATM2	92.02	75.98	67.15	58.73	73.25	43.75	25.53
ATM3	1.60	0.00	0.00	0.00	0.00	1.85	1.58
ATM4	573.53	491.54	539.07	481.29	647.27	413.72	169.96
NA	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Initial Notes

ATM4 is seeing the most money movement by far.
- In a business context, I’d confirm with someone that this is expected.
ATM4 has a crazy spike in withdrawal volume on Feb 11th.
- In a business context, I’d investigate if there was a known reason.
- In machine learning context, this is an outlier that may skew projections
ATM3 only sees money moving starting from 2010-04-30. Maybe ATM3 was only set up that time.
Withdrawal volume is highest at the start of the work week (Mon.) and lowest on the weekends.
There are some NA values for ATM in the data set. These seem to all have NA for Cash, and be from May (i.e. the bottom of the excel spreadsheet with the source data.) Since the ask here is to predict May data, I think this is safe to cut from the data set.
There are some NA values for Cash in the data set. The ATMs and dates these happened on don’t have a pattern or calendar significance (though they all take place in June). These probably need to be imputed.

Next steps

Remove the May values with NA for both ATM and Cash.
Impute the missing values.
Replace the outlier with imputed value

Cleaning NA and Outliers

Cut the May missing values

There is May data for the 1st and 2nd for all ATMs, so I’m leaving that as is. I’m just cutting the NAs there on.

# CUT NA records (which is May 3rd onward) 
atm_clean <- atm %>%
  drop_na(ATM)

Imputation for Cash NAs

For imputation, I’m opting to use ARIMA interpolation, one of the recommended methods in the Hyndman textbook.

# Check for time gaps (before ARIMA)
atm_clean %>% 
  has_gaps()

## # A tibble: 5 × 2
##   ATM   .gaps
##   <chr> <lgl>
## 1 ATM1  FALSE
## 2 ATM2  FALSE
## 3 ATM3  FALSE
## 4 ATM4  FALSE
## 5 <NA>  FALSE

# Where are the time gaps if there are some
atm_clean %>% 
  count_gaps() %>%
  knitr::kable(caption = "Time Gaps by ATM")

Time Gaps by ATM
ATM	.from	.to	.n

# No Time Gaps

# Remove "NA" as a key from when I originally made the data tsibble 
atm_clean <- atm_clean %>%
  as_tibble() %>%                   
  as_tsibble(index = DATE, key = ATM) # rebuild

key_data(atm_clean) %>% 
  select(-.rows) %>% 
  kable(caption = "Ensuring that NA is not a key in the Tsibble anymore")

Ensuring that NA is not a key in the Tsibble anymore
ATM
ATM1
ATM2
ATM3
ATM4

# Impute NAs with ARIMA 
atm_clean <- atm_clean %>%
  model(arima = ARIMA(Cash)) %>%
  interpolate(atm_clean)

Outlier Analysis and Imputation

I couldn’t use STL() until the NA values were handled. Now that that is done, I can use it for outlier analysis in the traditional way: using remainders > 3*IQR from the median remainder

# Check for outliers in addition to the one I found earlier
# Using remainders > 3x IQR from the median remainder
outlier_analysis <- atm_clean %>%
  model(stlo = STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  group_by(ATM) %>%
  mutate(
    rem_iqr = IQR(remainder, na.rm = TRUE),
    rem_mid = median(remainder, na.rm = TRUE),
    is_outlier = abs(remainder - rem_mid) > (3 * rem_iqr)
  ) %>%
  ungroup()

# Look at Outliers
outlier_analysis %>%
  filter(is_outlier == TRUE)

## # A tsibble: 87 x 11 [1D]
## # Key:       ATM, .model [4]
##    ATM   .model DATE        Cash trend season_week remainder season_adjust
##    <chr> <chr>  <date>     <dbl> <dbl>       <dbl>     <dbl>         <dbl>
##  1 ATM1  stlo   2009-06-07     3  88.1       16.8     -102.         -13.8 
##  2 ATM1  stlo   2009-07-12    19  86.4       16.8      -84.2          2.20
##  3 ATM1  stlo   2009-08-27   174 103.        23.0       48.0        151.  
##  4 ATM1  stlo   2009-09-13     1  95.6       16.8     -111.         -15.8 
##  5 ATM1  stlo   2009-09-24   180  90.1       23.0       67.0        157.  
##  6 ATM1  stlo   2009-12-02     6  82.0        3.80     -79.8          2.20
##  7 ATM1  stlo   2009-12-31    42  89.8       23.0      -70.8         19.0 
##  8 ATM1  stlo   2010-01-01     2  92.6       -4.32     -86.2          6.32
##  9 ATM1  stlo   2010-01-31   152  81.4       16.8       53.8        135.  
## 10 ATM1  stlo   2010-02-07   138  79.1       16.8       42.1        121.  
## # ℹ 77 more rows
## # ℹ 3 more variables: rem_iqr <dbl>, rem_mid <dbl>, is_outlier <lgl>

# Visually validate the outliers
atm_clean %>%
  left_join(select(outlier_analysis, ATM, DATE, is_outlier), by = c("ATM", "DATE")) %>%
  ggplot(aes(x = DATE, y = Cash)) +
  geom_line(color = "grey60", alpha = 0.7) +
  geom_point(data = . %>% filter(is_outlier == TRUE), color = "red", size = 2) +
  facet_wrap(~ ATM, scales = "free_y") +
  labs(
    title = "Visual Inspection of Methodically Identified Outliers",
    subtitle = "Red dots indicate flagged outliers (3x IQR of STL Remainder)",
    x = "Date", 
    y = "Cash Withdrawn (Hundreds)"
  ) +
  theme_minimal()

This method classified a lot of data points as outliers that frankly, are not. It remains that the only “outlier” I am concerned with is the one from ATM4 on Feb. 11th. So I will impute that one only.

# Replace my One Feb 11th Outlier with NA
atm_clean <- atm_clean %>%
  mutate(
    Cash = if_else(ATM == "ATM4" & DATE == as.Date("2010-02-11"), NA_real_, Cash)
  ) # NA_real_ is explicitly for missing decimal number

# Impute
atm_clean <- atm_clean %>%
  model(arima = ARIMA(Cash)) %>%
  interpolate(atm_clean)

# Peaking at the imputed value
atm_clean %>%
  filter(ATM == "ATM4", DATE >= as.Date("2010-02-09") & DATE <= as.Date("2010-02-13")) %>%
  knitr::kable(caption = "ATM4: Feb 11th Outlier Successfully Imputed")

ATM4: Feb 11th Outlier Successfully Imputed
ATM	DATE	Cash
ATM4	2010-02-09	112.03063
ATM4	2010-02-10	417.91228
ATM4	2010-02-11	214.38754
ATM4	2010-02-12	42.43808
ATM4	2010-02-13	280.04343

Training and Test Sets

# Training Set (Everything up to March 31, 2010)
atm_train <- atm_clean %>% 
  filter(DATE <= as.Date("2010-03-31"))

# Test Set (April 1st through May 2nd)
atm_test <- atm_clean %>% 
  filter(DATE > as.Date("2010-03-31"))

# Verify the split
cat("Training Set ends on:", as.character(max(atm_train$DATE)), "and has", nrow(atm_train), "days.\n")

## Training Set ends on: 2010-03-31 and has 1332 days.

cat("Test Set starts on:", as.character(min(atm_test$DATE)), "and has", nrow(atm_test), "days.\n")

## Test Set starts on: 2010-04-01 and has 128 days.

Modeling ATM1

STL Decompositon for ATM1

# STL decomposition
atm_train %>%
  filter(ATM == "ATM1") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM1")

I already confirmed earlier that withdrawal volume is highest Mondays and lowest on the weekends. I can see weekly seasonality confirmed on the STL Decomp. view.

I also see that the trend wanders without much direction, but interestingly, the variation in the remainder becomes much higher from roughly March 2010 onwards.

Model Prep. ATM1

# Isolate ATM1 
atm1 <- atm_clean %>% filter(ATM == "ATM1")
atm1_train <- atm_train %>% filter(ATM == "ATM1")
atm1_test <- atm_test %>% filter(ATM == "ATM1")

#Optimal Box-Cox Lambda using Guerrero's method
lambda_atm1 <- atm1_train %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

cat("Optimal Box-Cox Lambda for ATM1:", round(lambda_atm1, 4), "\n")

## Optimal Box-Cox Lambda for ATM1: 0.6236

# Check for required differencing on the transformed data 
# seasonal
nsdiffs_atm1 <- atm1_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm1)) %>%
  features(box_cash, unitroot_nsdiffs) %>%
  pull(nsdiffs)

cat("Seasonal differences required (nsdiffs):", nsdiffs_atm1, "\n")

## Seasonal differences required (nsdiffs): 1

# ordinary
ndiffs_atm1 <- atm1_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm1)) %>%
  features(box_cash, unitroot_ndiffs) %>%
  pull(ndiffs)

cat("Ordinary differences required (ndiffs):", ndiffs_atm1, "\n")

## Ordinary differences required (ndiffs): 0

I’m going to move forward using the optimal Box-Cox lambda because I am more focused on effective modeling than on explanatory power at this time.. The value of 0.6236 is a bit between 0.5 and 1. Where 0.5 is taking the square root transformation to stabilize the variance, and 1 is doing nothing (there is a stable enough variance).

(I’ll use the optimal lambda values as they are found for the rest of this project as well, rather than rounding to more “explainable” lambdas.)

Model Comparisons ATM1

# Fit models on training data
atm1_train_fit <- atm1_train %>%
  model(
    snaive = SNAIVE(box_cox(Cash, lambda_atm1)),
    ets = ETS(box_cox(Cash, lambda_atm1)),
    arima_auto = ARIMA(box_cox(Cash, lambda_atm1)),
    arima_manual = ARIMA(box_cox(Cash, lambda_atm1) ~ pdq(d=0) + PDQ(D=1))
  )

# Forecast the length of the TEST data
atm1_test_forecast <- atm1_train_fit %>%
  forecast(new_data = atm1_test)

# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm1_test_forecast, atm1) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM1 (Test Set)")

Out-of-Sample Accuracy Metrics for ATM1 (Test Set)
.model	RMSE	MAE	MAPE	MASE
arima_auto	12.65584	10.031486	57.80838	0.5491826
arima_manual	12.65584	10.031486	57.80838	0.5491826
ets	12.73463	9.927388	44.49709	0.5434837
snaive	17.05096	13.744518	80.29855	0.7524559

Comparing ATM1 models by accuracy performance

The MASE of all of the models is under 1, which is good.
SNAIVE is the worst of these, with MASE of 0.75 (expected)
ARIMA performed really well, with MASE of 0.55 RMSE of 12.66.
ETS has the lowest MASE (0.54) and a slightly higher RMSE (12.73).
ARIMA and ETS are so close that I’ll go to the residuals to better understand the best model.

# ETS Residuals
atm1_train_fit %>%
  select(ets) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM1 (ETS Model)")

## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# ARIMA Residuals
atm1_train_fit %>%
  select(arima_auto) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM1 (ARIMA Model)")

# Portmanteau Tests
# Ljung-Box test  
# lag = 14 for daily data to look at 2 weeks of data
atm1_train_fit %>%
  select(ets, arima_auto) %>%
  augment() %>%
  features(.innov, ljung_box, lag = 14) %>%
  knitr::kable(caption = "Ljung-Box Test for ETS and ARIMA Residuals")

Ljung-Box Test for ETS and ARIMA Residuals
.model	lb_stat	lb_pvalue
arima_auto	10.87316	0.6959701
ets	23.63239	0.0507266

Comparing the ARIMA and ETS model residual plots and portmanteau:

The residual plots for ARIMA and ETS look pretty similar.
The residual distribution also look similar.
The ACF plots, however, show that for the ETS model, some early autocorrelation goes past the boundary lines.
The Portmanteau tests for both have p-values over 0.05 (which is good, because if the p value were less than 0.05, I’d conclude that the residuals do not look like white noise.) However, the p-value for the ETS model is 0.0507, which is just **barely* clearing that 0.05 bar. ARIMA looks much better at 0.70.

The ARIMA model is the best choice for ATM1.

# Model from training:
atm1_train_fit$arima_auto

## <lst_mdl[1]>
## [1] <ARIMA(0,0,2)(0,1,2)[7]>

# Refit the ARIMA model to full dataset
atm1_final_fit <- atm1 %>%
  model(
    arima_final = ARIMA(box_cox(Cash, lambda_atm1) ~ pdq(0,0,2) + PDQ(0,1,2, period = 7))
  )

# Final Model
report(atm1_final_fit)

## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,2)[7] 
## Transformation: box_cox(Cash, lambda_atm1) 
## 
## Coefficients:
##          ma1      ma2     sma1     sma2
##       0.1590  -0.0823  -0.5835  -0.0782
## s.e.  0.0523   0.0503   0.0515   0.0500
## 
## sigma^2 estimated as 24.99:  log likelihood=-1083.97
## AIC=2177.93   AICc=2178.1   BIC=2197.34

The final model for ATM1 is ready to use for forecasting

Forecasting ATM1

# Forecast
# h = 30 for 30 days
atm1_fc <- atm1_final_fit %>%
  forecast(h = 30)

# Plot forecast
atm1 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm1_fc, color = "#F8766D", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM1",
    subtitle = "ARIMA(0,0,2)(0,1,2)[7] Structure",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Modeling ATM2

STL Decomposition for ATM2

# STL decomposition
atm_train %>%
  filter(ATM == "ATM2") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM2")

Much like ATM1, weekly seasonality remains, and the variation in the remainder becomes much higher from roughly March 2010 onwards. But what’s different here is the trend, which was relatively steady with a slight decline, also up until March 2010.

Model Prep ATM2

# Isolate atm2 
atm2 <- atm_clean %>% filter(ATM == "ATM2")
atm2_train <- atm_train %>% filter(ATM == "ATM2")
atm2_test <- atm_test %>% filter(ATM == "ATM2")

#Optimal Box-Cox Lambda using Guerrero's method
lambda_atm2 <- atm2_train %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

cat("Optimal Box-Cox Lambda for atm2:", round(lambda_atm2, 4), "\n")

## Optimal Box-Cox Lambda for atm2: 0.5398

# Check for required differencing on the transformed data 
# seasonal
nsdiffs_atm2 <- atm2_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm2)) %>%
  features(box_cash, unitroot_nsdiffs) %>%
  pull(nsdiffs)

cat("Seasonal differences required (nsdiffs):", nsdiffs_atm2, "\n")

## Seasonal differences required (nsdiffs): 1

# ordinary
ndiffs_atm2 <- atm2_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm2)) %>%
  features(box_cash, unitroot_ndiffs) %>%
  pull(ndiffs)

cat("Ordinary differences required (ndiffs):", ndiffs_atm2, "\n")

## Ordinary differences required (ndiffs): 1

Different from ATM1, when checking for differencing, ATM2 calls for 1 seasonal and 1 ordinary difference.

Model Comparisons ATM2

# Fit models on training data
atm2_train_fit <- atm2_train %>%
  model(
    snaive = SNAIVE(box_cox(Cash, lambda_atm2)),
    ets = ETS(box_cox(Cash, lambda_atm2)),
    arima_auto = ARIMA(box_cox(Cash, lambda_atm2)),
    arima_manual = ARIMA(box_cox(Cash, lambda_atm2) ~ pdq(d=1) + PDQ(D=1))
  )

# Forecast the length of the TEST data
atm2_test_forecast <- atm2_train_fit %>%
  forecast(new_data = atm2_test)

# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm2_test_forecast, atm2) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM2 (Test Set)")

Out-of-Sample Accuracy Metrics for ATM2 (Test Set)
.model	RMSE	MAE	MAPE	MASE
ets	19.09029	14.17161	Inf	0.6794205
arima_auto	19.83110	15.44513	Inf	0.7404763
snaive	22.95675	16.60056	Inf	0.7958699
arima_manual	47.44609	35.18190	Inf	1.6867035

Comparing ATM2 models by accuracy performance

The MASE of the manual ARIMA model is 1.68, which means it performed significantly worse than a baseline naive forecast. I’ll discard this model choice.
- Side note: Even though the unit root tests suggested both an ordinary and seasonal difference (d=1, D=1), explicitly forcing these constraints in the manual ARIMA model clearly resulted in the worst performance. Adhering to the KPSS tests probably led to overdifferencing.
The SNAIVE model serves as a reasonable baseline with a MASE of 0.79.
The Auto ARIMA model performed decently, beating the SNAIVE baseline with a MASE of 0.74 and an RMSE of 19.83.
ETS is the winner from the accuracy perspective, with the lowest MASE (0.68) and the lowest RMSE (19.09).

However, especially based on ATM1, I will look at both ETS and ARIMA (the auto-selected one, not my manual) from a residual perspective

# ETS Residuals
atm2_train_fit %>%
  select(ets) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM2 (ETS Model)")

# ARIMA Residuals
atm2_train_fit %>%
  select(arima_auto) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM2 (ARIMA Model)")

# Portmanteau Tests
# Ljung-Box test  
# lag = 14 for daily data to look at 2 weeks of data
atm2_train_fit %>%
  select(ets, arima_auto) %>%
  augment() %>%
  features(.innov, ljung_box, lag = 14) %>%
  knitr::kable(caption = "Ljung-Box Test for ETS and ARIMA Residuals")

Ljung-Box Test for ETS and ARIMA Residuals
.model	lb_stat	lb_pvalue
arima_auto	10.49173	0.7254213
ets	31.07524	0.0054106

Comparing the ARIMA and ETS model residual plots and portmanteau:

The residual plots for ARIMA and ETS look pretty similar.
The residual distribution also look similar, but the one for ARIMA does look a bit more centered and symmetric.
The ACF plots have differences again. ETS is spiking beyond the boundary lines in a couple places and at higher amplitudes in general compared to ARIMA, which appears cleaner.
The Portmanteau tests for show that ETS has a p-value under 0.05, which means I reject the null hypothesis that the residuals are randon noise for the ETS model.

The ARIMA model is the best choice for ATM2.

# Model from training:
atm2_train_fit$arima_auto

## <lst_mdl[1]>
## [1] <ARIMA(2,0,2)(0,1,1)[7]>

# Refit the ARIMA model to full dataset
atm2_final_fit <- atm2 %>%
  model(
    arima_final = ARIMA(box_cox(Cash, lambda_atm2) ~ pdq(2,0,2) + PDQ(0,1,1, period = 7))
  )

# Final Model
report(atm2_final_fit)

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda_atm2) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4158  -0.8961  0.4560  0.7860  -0.6971
## s.e.   0.0616   0.0513  0.0909  0.0652   0.0431
## 
## sigma^2 estimated as 16.57:  log likelihood=-1009.89
## AIC=2031.78   AICc=2032.02   BIC=2055.07

The final model for ATM2 is ready to use for forecasting

Forecasting ATM2

# Forecast
# h = 30 for 30 days
atm2_fc <- atm2_final_fit %>%
  forecast(h = 30)

# Plot forecast
atm2 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm2_fc, color = "#7CAE00", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM2",
    subtitle = "ARIMA(2,0,2)(0,1,1)[7] Structure",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Modeling ATM3

I already know that ATM3 has only three days’ worth of data. So this is going to look pretty different from ATM1 and ATM2.

STL Decomposition for ATM3

For the sake of thoroughness, I’ll still look at STL decomp for ATM3.

# STL decomposition
atm_train %>%
  filter(ATM == "ATM3") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM3")

As expected, STL Decomp. yielded nothing here.

Model Prep ATM3

The model preparation is different here too. For ATM1 and ATM2 I looked at the Box-Cox Lambda using Guerrero’s method and at seasonal and ordinary differencing. With so little data, I’m skipping that here.

Model Comparisons ATM3

Again, with three days’ data, I’m not going to entertain ETS and ARIMA. Nor am I going to train and test models. I’ll use the mean to forecast the data because I have already identified that there is seasonal (weekly) fluctuation. And I’d rather use the mean of 3 days as a predictor than the most recent value of one day.

atm3 <- atm_clean %>%
  filter(ATM == "ATM3")

# When ATM3 has data
atm3_active <- atm3 %>% 
  filter(Cash > 0)
# Fit a simple Mean model
atm3_fit <- atm3_active %>%
  model(mean = MEAN(Cash))

# Model report
report(atm3_fit)

## Series: Cash 
## Model: MEAN 
## 
## Mean: 87.6667 
## sigma^2: 54.3333

Forecasting ATM3

# Forecast
# h = 30 for 30 days
atm3_fc <- atm3_fit %>%
  forecast(h = 30)

# Plot forecast
atm3 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm3_fc, color = "#00BFC4", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM3",
    subtitle = "Simple Mean Model",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Modeling ATM4

STL Decomposition for ATM4

# STL decomposition
atm_train %>%
  filter(ATM == "ATM4") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM4")

I can still see weekly seasonality for ATM4. And the trend wanders a lot like ATM1. Interestingly, the variation in the remainder does not become much higher from March 2010 onwards as it did with ATM1 and ATM2.

As I confirmed before, the magnitude of withdrawals is higher in ATM4 than the others, but also, I see that the outlier I removed is indeed gone, as I’d hoped when I imputed it.

Model Prep ATM4

# Isolate ATM4 
atm4 <- atm_clean %>% filter(ATM == "ATM4")
atm4_train <- atm_train %>% filter(ATM == "ATM4")
atm4_test <- atm_test %>% filter(ATM == "ATM4")

#Optimal Box-Cox Lambda using Guerrero's method
lambda_atm4 <- atm4_train %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

cat("Optimal Box-Cox Lambda for ATM4:", round(lambda_atm4, 4), "\n")

## Optimal Box-Cox Lambda for ATM4: 0.4028

# Check for required differencing on the transformed data 
# seasonal
nsdiffs_atm4 <- atm4_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm4)) %>%
  features(box_cash, unitroot_nsdiffs) %>%
  pull(nsdiffs)

cat("Seasonal differences required (nsdiffs):", nsdiffs_atm4, "\n")

## Seasonal differences required (nsdiffs): 0

# ordinary
ndiffs_atm4 <- atm4_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm4)) %>%
  features(box_cash, unitroot_ndiffs) %>%
  pull(ndiffs)

cat("Ordinary differences required (ndiffs):", ndiffs_atm4, "\n")

## Ordinary differences required (ndiffs): 0

Model Comparisons ATM4

# Fit models on training data
atm4_train_fit <- atm4_train %>%
  model(
    snaive = SNAIVE(box_cox(Cash, lambda_atm4)),
    ets = ETS(box_cox(Cash, lambda_atm4)),
    arima_auto = ARIMA(box_cox(Cash, lambda_atm4)),
    arima_manual_from_SPSS = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(d=0) + PDQ(D=0)),
    arima_manual_from_intuition = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(d=0) + PDQ(D=1))
  )

## Warning in sqrt(diag(best$var.coef)): NaNs produced

# Forecast the length of the TEST data
atm4_test_forecast <- atm4_train_fit %>%
  forecast(new_data = atm4_test)

# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm4_test_forecast, atm4) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM4 (Test Set)")

Out-of-Sample Accuracy Metrics for ATM4 (Test Set)
.model	RMSE	MAE	MAPE	MASE
arima_manual_from_intuition	248.8757	210.6874	861.8666	0.5953832
arima_auto	260.9621	224.0422	914.2848	0.6331225
arima_manual_from_SPSS	260.9621	224.0422	914.2848	0.6331225
ets	303.1305	258.4645	1181.5676	0.7303968
snaive	789.8017	666.7780	1252.9923	1.8842529

Comparing ATM4 models by accuracy performance

SNAIVE was quite bad, with MASE of 1.88
ETS was alright with a MASE of 0.73, but ARIMA performed better in accuracy
Here’s where I was experimenting with ARIMA. The manual SPSS tests for ARIMA called for no differencing. So I decided to test with that manually, with 1 seasonal difference based on intuition, and then with the auto-ARIMA.
- ARIMA where I manually adhered to the results of the SPSS test, and the auto-ARIMA performed identically
- My ARIMA where I manually went off of my intuition for seasonal differencing had the best accuracy metrics accross the board.
- I suspect that my manual intuition ARIMA model might be overly complex or unstable, so I’ll check for this

# complexity (AICc)
atm4_train_fit %>%
  select(arima_auto, arima_manual_from_intuition) %>%
  glance() %>%
  arrange(AICc) %>%
  select(.model, AICc, BIC) %>%
  knitr::kable(caption = "Information Criteria Comparison: Auto vs Intuition")

Information Criteria Comparison: Auto vs Intuition
.model	AICc	BIC
arima_manual_from_intuition	2449.251	2475.407
arima_auto	2508.037	2526.894

#  (p,d,q)(P,D,Q) parameter compare
atm4_train_fit %>%
  select(arima_auto, arima_manual_from_intuition)

## # A mable: 1 x 2
##                         arima_auto arima_manual_from_intuition
##                            <model>                     <model>
## 1 <ARIMA(0,0,1)(2,0,0)[7] w/ mean>    <ARIMA(2,0,1)(1,1,2)[7]>

# AR and MA Roots for Stability
# Auto ARIMA Roots
atm4_train_fit %>%
  select(arima_auto) %>%
  gg_arma() +
  labs(
    title = "Inverse AR/MA Roots: Auto ARIMA",
    subtitle = "Points should lie entirely inside the unit circle"
  )

## Warning: `gg_arma()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_arma()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Manual Intuition ARIMA Roots
atm4_train_fit %>%
  select(arima_manual_from_intuition) %>%
  gg_arma() +
  labs(
    title = "Inverse AR/MA Roots: Manual Intuition ARIMA",
    subtitle = "Checking for borderline instability (roots on or near the circle)"
  )

# Check for stability more definitively than with the circle
atm4_train_fit %>%
  select(arima_auto, arima_manual_from_intuition) %>%
  glance() %>%
  select(.model, ar_roots, ma_roots) %>%
  pivot_longer(ends_with("roots"), names_to = "Root_Type", values_to = "Root") %>%
  unnest(Root) %>%
  mutate(
    Inverse_Distance = Mod(1 / Root),
    Is_Stable = Inverse_Distance < 1
  ) %>%
  select(.model, Root_Type, Inverse_Distance, Is_Stable) %>%
  knitr::kable(digits = 4, caption = "Definitive Root Check: Auto vs Intuition")

Definitive Root Check: Auto vs Intuition
.model	Root_Type	Inverse_Distance	Is_Stable
arima_auto	ar_roots	0.9166	TRUE
arima_auto	ar_roots	0.9166	TRUE
arima_auto	ar_roots	0.8586	TRUE
arima_auto	ar_roots	0.8586	TRUE
arima_auto	ar_roots	0.8586	TRUE
arima_auto	ar_roots	0.8586	TRUE
arima_auto	ar_roots	0.9166	TRUE
arima_auto	ar_roots	0.9166	TRUE
arima_auto	ar_roots	0.9166	TRUE
arima_auto	ar_roots	0.9166	TRUE
arima_auto	ar_roots	0.8586	TRUE
arima_auto	ar_roots	0.8586	TRUE
arima_auto	ar_roots	0.8586	TRUE
arima_auto	ar_roots	0.9166	TRUE
arima_auto	ma_roots	0.0816	TRUE
arima_manual_from_intuition	ar_roots	0.8715	TRUE
arima_manual_from_intuition	ar_roots	0.8715	TRUE
arima_manual_from_intuition	ar_roots	0.8715	TRUE
arima_manual_from_intuition	ar_roots	0.8715	TRUE
arima_manual_from_intuition	ar_roots	0.8715	TRUE
arima_manual_from_intuition	ar_roots	0.8715	TRUE
arima_manual_from_intuition	ar_roots	0.8715	TRUE
arima_manual_from_intuition	ar_roots	0.7533	TRUE
arima_manual_from_intuition	ar_roots	0.0363	TRUE
arima_manual_from_intuition	ma_roots	0.9851	TRUE
arima_manual_from_intuition	ma_roots	0.9851	TRUE
arima_manual_from_intuition	ma_roots	0.9851	TRUE
arima_manual_from_intuition	ma_roots	0.9851	TRUE
arima_manual_from_intuition	ma_roots	0.8732	TRUE
arima_manual_from_intuition	ma_roots	0.8732	TRUE
arima_manual_from_intuition	ma_roots	0.8732	TRUE
arima_manual_from_intuition	ma_roots	0.9851	TRUE
arima_manual_from_intuition	ma_roots	0.9851	TRUE
arima_manual_from_intuition	ma_roots	0.9851	TRUE
arima_manual_from_intuition	ma_roots	0.8732	TRUE
arima_manual_from_intuition	ma_roots	0.8732	TRUE
arima_manual_from_intuition	ma_roots	0.8732	TRUE
arima_manual_from_intuition	ma_roots	0.8732	TRUE
arima_manual_from_intuition	ma_roots	0.6950	TRUE

Even though my manually built model scored better on the accuracy tests, it’s not better than the auto-ARIMA.

The manual-intuition ARIAM is more complex, which AIC penalizes. But it does have a better/lower AIC.
However, the manual-intuition ARIMA has roots nearly on the unit circle. When it has 6 parameters (higher complexity), having the roots be this close means that the model in potentially unstable and should probably be rejected. Otherwise, it could have long-term predictions that may go a little crazy.

I’m going to proceed like the auto-ARIMA model is the best choice, and I’ll still check its residual plot below. I’ll also look at ETS here.

# ETS Residuals
atm4_train_fit %>%
  select(ets) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM4 (ETS Model)")

# ARIMA Residuals
atm4_train_fit %>%
  select(arima_auto) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM4 (ARIMA Model)")

# Portmanteau Tests
# Ljung-Box test  
# lag = 14 for daily data to look at 2 weeks of data
atm4_train_fit %>%
  select(ets, arima_auto) %>%
  augment() %>%
  features(.innov, ljung_box, lag = 14) %>%
  knitr::kable(caption = "Ljung-Box Test for ARIMA Residuals")

Ljung-Box Test for ARIMA Residuals
.model	lb_stat	lb_pvalue
arima_auto	13.90588	0.4567468
ets	18.62921	0.1796068

Looking at the ARIMA model (the auto one) residual plots and portmanteau:

The residuals look like white noise for both ARIMA and ETS, and the p-values well over 0.05 from the Portmanteau test confirms this.
The ACF plots for both ARIMA and ETS show a spike outside of the bounds at some point, though it doesn’t look like a systemic pattern.
The residual histograms looks slightly left skewed for both, though ETS looks a little less well centered at 0 than ARIMA. Pretty okay overall.

I’ll move forward with the auto selected ARIMA model based on it scoring slightly better in accuracy and residual analysis than the others.

# Model from training:
atm4_train_fit$arima_auto

## <lst_mdl[1]>
## [1] <ARIMA(0,0,1)(2,0,0)[7] w/ mean>

# Refit the ARIMA model to full dataset
atm4_final_fit <- atm4 %>%
  model(
    arima_final = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,1) + PDQ(2,0,0, period = 7))
  )
# Final Model
report(atm4_final_fit)

## Series: Cash 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda_atm4) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0785  0.2128  0.2082   13.5498
## s.e.  0.0528  0.0516  0.0525    0.5624
## 
## sigma^2 estimated as 104.5:  log likelihood=-1364.98
## AIC=2739.96   AICc=2740.13   BIC=2759.46

The final model for ATM4 is ready to use for forecasting

Forecasting ATM4

# Forecast
# h = 30 for 30 days
atm4_fc <- atm4_final_fit %>%
  forecast(h = 30)

# Plot forecast
atm4 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm4_fc, color = "#C77CFF", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM4",
    subtitle = "ARIMA(0,0,1)(2,0,0)[7] Structure",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This prediction clearly does not appear to be right. Because this ARIMA model had no differencing, it’s assuming that the data is already stationary, so the prediction is progressing towards the mean.

I think I need the seasonal differencing component, but I also need to ensure that it’s used to create a stable model. I’ll try that below.

Model Iteration 2 for ATM4

# Try a seasonally difference ARIMA that is forced to be simpler
# Also try ETS again
atm4_train2_fit <- atm4_train %>%
  model(
    arima_seasonal_and_stable = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,1) + PDQ(0,1,1, period = 7))
  )

# Forecast the length of the TEST data
atm4_test2_forecast <- atm4_train2_fit %>%
  forecast(new_data = atm4_test)

# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm4_test2_forecast, atm4) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM4 (Test Set)")

Out-of-Sample Accuracy Metrics for ATM4 (Test Set)
.model	RMSE	MAE	MAPE	MASE
arima_seasonal_and_stable	248.1606	210.4713	853.3203	0.5947723

This model’s RMSE, MAE, MAPE and MASE are all performing as well or slightly better than my arima_manual_from_intuition that I modeled earlier. That means the real test is in the stability and complexity.

# complexity (AICc)
atm4_train2_fit %>%
  select(arima_seasonal_and_stable) %>%
  glance() %>%
  arrange(AICc) %>%
  select(.model, AICc, BIC) %>%
  knitr::kable(caption = "Information Criteria Comparison: Auto vs Intuition")

Information Criteria Comparison: Auto vs Intuition
.model	AICc	BIC
arima_seasonal_and_stable	2442.302	2453.589

#  (p,d,q)(P,D,Q) parameter compare
atm4_train2_fit %>%
  select(arima_seasonal_and_stable)

## # A mable: 1 x 1
##   arima_seasonal_and_stable
##                     <model>
## 1  <ARIMA(0,0,1)(0,1,1)[7]>

# AR and MA Roots for Stability 
atm4_train2_fit %>%
  select(arima_seasonal_and_stable) %>%
  gg_arma() +
  labs(
    title = "Inverse AR/MA Roots: Auto ARIMA",
    subtitle = "Points should lie entirely inside the unit circle"
  )

# Check for stability more definitively than with the circle
atm4_train2_fit %>%
  select(arima_seasonal_and_stable) %>%
  glance() %>%
  select(.model, ar_roots, ma_roots) %>%
  pivot_longer(ends_with("roots"), names_to = "Root_Type", values_to = "Root") %>%
  unnest(Root) %>% # Flatten the lists of roots into individual rows
  mutate(
    Inverse_Distance = Mod(1 / Root),
    Is_Stable = Inverse_Distance < 1
  ) %>%
  select(.model, Root_Type, Inverse_Distance, Is_Stable) %>%
  knitr::kable(digits = 4, caption = "Definitive Root Check: Iteration 2")

Definitive Root Check: Iteration 2
.model	Root_Type	Inverse_Distance	Is_Stable
arima_seasonal_and_stable	ma_roots	0.9853	TRUE
arima_seasonal_and_stable	ma_roots	0.9853	TRUE
arima_seasonal_and_stable	ma_roots	0.9853	TRUE
arima_seasonal_and_stable	ma_roots	0.9853	TRUE
arima_seasonal_and_stable	ma_roots	0.9853	TRUE
arima_seasonal_and_stable	ma_roots	0.9853	TRUE
arima_seasonal_and_stable	ma_roots	0.9853	TRUE
arima_seasonal_and_stable	ma_roots	0.0972	TRUE

This model has fewer parameters (only 2). It does have many roots nearly on the unit circle, but I feel much more confident in this. The lower complexity means I’m not so worried about instability. I’ll try this model now.

# Model from training:
atm4_train2_fit$arima_seasonal_and_stable

## <lst_mdl[1]>
## [1] <ARIMA(0,0,1)(0,1,1)[7]>

# Refit the ARIMA model to full dataset
atm4_final2_fit <- atm4 %>%
  model(
    arima_final2 = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,1) + PDQ(0,1,1, period = 7))
  )
# Final Model
report(atm4_final2_fit)

## Series: Cash 
## Model: ARIMA(0,0,1)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda_atm4) 
## 
## Coefficients:
##          ma1     sma1
##       0.0976  -0.8664
## s.e.  0.0524   0.0321
## 
## sigma^2 estimated as 98.1:  log likelihood=-1332.73
## AIC=2671.46   AICc=2671.53   BIC=2683.1

Model Iteration 2 - Forecasting for ATM4

# Forecast
# h = 30 for 30 days
atm4_fc2 <- atm4_final2_fit %>%
  forecast(h = 30)

# Plot forecast
atm4 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm4_fc2, color = "#C77CFF", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM4",
    subtitle = "ARIMA(0,0,1)(0,1,1)[7] Structure",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Exporting Final Forecasts

# Get the forecasts (.mean) from all final models 
all_forecasts <- bind_rows(
  atm1_fc %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean),
  atm2_fc %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean),
  atm3_fc %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean),
  atm4_fc2 %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean) # Using the iter2 forecast
)

# Join forecasts to the OG data
# NOT using my outlier-cleaned data or imputed data 
final_export <- atm %>%
  as_tibble() %>%
  rename(Original_Cash = Cash) %>%
  full_join(all_forecasts, by = c("ATM", "DATE")) %>%
  arrange(ATM, DATE)

# Export to CSV
export_path <- file.path(projPath, "ATM_Forecast_Results.csv")
write_csv(final_export, export_path)

cat("Exported original data and forecasts to:", export_path, "\n")

## Exported original data and forecasts to: C:/Users/cdube/Grad School/DATA 624/ATM_Forecast_Results.csv

The final forecasts have been exported to a csv.

Important note: I know the ask was to forecast the month of May. Because there were 2 days of may that had data (and those days were important for ATM3 since it had next to no data), I started the forecast from May 3rd onward. I also included June 1st in the forecast. Essentially, I opted to forecast one month forward.

Conclusions

# final plot
final_export %>%
  filter(!(is.na(ATM))) %>%
  ggplot(aes(x = DATE)) +
  # OG Data  
  geom_line(aes(y = Original_Cash, color = ATM), alpha = 0.5) +
  # Forecasted
  geom_line(aes(y = Forecasted_Cash, color = ATM), linewidth = 1) +
  facet_wrap(~ ATM, scales = "free_y", ncol = 2) +
  scale_color_manual(values = c(
    "ATM1" = "#F8766D", 
    "ATM2" = "#7CAE00", 
    "ATM3" = "#00BFC4", 
    "ATM4" = "#C77CFF"
  )) +
  labs(
    title = "Final 30-Day Forecasts vs. Original Raw Data by ATM",
    subtitle = "Bold lines in the May 2010 onward are forecasts",
    x = "Date",
    y = "Cash Withdrawn (Hundreds)",
    color = "ATM"
  ) +
  scale_x_date(
    date_breaks = "2 months", 
    date_labels = "%b %Y"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none", 
    strip.text = element_text(face = "bold", size = 12)
  )

## Warning: Removed 120 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1460 rows containing missing values or values outside the scale range
## (`geom_line()`).

Findings and Forecast Recap

Forecasting the cash withdrawals for these four atm locations shows distinct behavior for each.

All of the ATMs that are active have a weekly seasonality: high volume at the start of the work week, low volume on weekends.
ATM3 just barely became an active machine. Since it’s only been up for 3 days, we’re forecasting for it very simplistically, and this could be revisited in the future.
ATM4 sees significantly more volume in withdrawals than the others.
ATM1 and ATM2 both saw increased variability starting around March 2010. We might want to investigate events that may be contributing to this.
There was a big outlier on Feb. 11th for ATM4 with $1M+ withdrawn. This has been handled in terms of forecasting, but it should be called out to business if it needs to be investigated as expected behavior, a fraud event, etc.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Setup

power_usage <- read_excel(file.path(projPath, "ResidentialCustomerForecastLoad-624.xlsx"))
head(power_usage)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

# relying on lubridate to create a date column here since there is no day
power_usage <- power_usage %>% 
  mutate(ReportDate = yearmonth(`YYYY-MMM`)) %>%
  select(CaseSequence, ReportDate, KWH) %>%
  as_tsibble(, key=NULL, index=ReportDate)

Exploration

# Quick Peek
power_usage %>%
  autoplot(KWH) + 
  labs(title = "Residential Power Usage (KWH) over Time", 
       x = "Report Date") +
  scale_x_yearmonth(
    date_breaks = "4 month",   # I want a label per quarter
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=6.5)) # tiny text to fit on x axis

# Dip in power usage 
min_index <- which.min(power_usage$KWH)
power_usage %>%
  slice((min_index - 2):(min_index + 2)) %>%
  kable(caption = "5-Point Window: The Lowest KWH Reading and its Neighbors")

5-Point Window: The Lowest KWH Reading and its Neighbors
CaseSequence	ReportDate	KWH
881	2010 May	4919289
882	2010 Jun	6696292
883	2010 Jul	770523
884	2010 Aug	7922701
885	2010 Sep	7819472

# Check for any missing case sequence
# sequence from min to max and see what is missing 
full_seq <- min(power_usage$CaseSequence, na.rm = TRUE):max(power_usage$CaseSequence, na.rm = TRUE)
missing_vals <- setdiff(full_seq, power_usage$CaseSequence)

data.frame(Missing_CaseSequence = missing_vals) %>%
  kable(caption =  "Gaps in Case Sequence")

Gaps in Case Sequence
Missing_CaseSequence

# Check for repeating case sequence
power_usage %>%
  as_tibble() %>%
  group_by(CaseSequence) %>%
  summarise(count = n()) %>%
  filter(count > 1) %>%
  kable(caption = "Repating Case Sequences")

Repating Case Sequences
CaseSequence	count

# check for missing dates 
gap_check <- power_usage %>% has_gaps() 
power_usage %>% 
  scan_gaps() %>% 
  kable(caption = "Time Gaps (Missing Rows) Found in Data")

Time Gaps (Missing Rows) Found in Data
ReportDate

# Check for any NAs in any columns
power_usage %>%
  as_tibble() %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column_Name", values_to = "NA_Count") %>%
  kable(caption = "Count of Missing Values (NAs) by Column")

Count of Missing Values (NAs) by Column
Column_Name	NA_Count
CaseSequence	0
ReportDate	0
KWH	1

power_usage %>%
  filter(is.na(KWH)) %>%
  kable(caption = "NA value for KWH")

NA value for KWH
CaseSequence	ReportDate	KWH
861	2008 Sep	NA

# Check for any repeating dates
power_usage %>%
  count(ReportDate) %>%
  filter(n > 1) %>%
  rename(Occurrences = n) %>%
  kable(caption = "Duplicate Report Dates")

Duplicate Report Dates
ReportDate	Occurrences

# Summary
power_usage %>%
  as_tibble() %>%
  # Convert yearmonth to Date so summary() can read it
  mutate(ReportDate = as.Date(ReportDate)) %>% 
  summary() %>%
  kable(caption = "Statistical Summary of Power Usage Data")

Statistical Summary of Power Usage Data
CaseSequence	ReportDate	KWH
Min. :733.0	Min. :1998-01-01	Min. : 770523
1st Qu.:780.8	1st Qu.:2001-12-24	1st Qu.: 5429912
Median :828.5	Median :2005-12-16	Median : 6283324
Mean :828.5	Mean :2005-12-15	Mean : 6502475
3rd Qu.:876.2	3rd Qu.:2009-12-08	3rd Qu.: 7620524
Max. :924.0	Max. :2013-12-01	Max. :10655730
NA	NA	NA’s :1

To start with, I don’t see much that is “messy” about the data (like duplicate dates or missing dates). There is just one missing KWH value from Sept. 2008.

From visual inspection, I also see that there is clear seasonality. Power usage spikes in the winter and summer.

It also looks like power usage started trending upwards in 2010, following a dip in power usage in July 2010. This dip in power usage for July 2010 appears to be an outlier, and it might even be a user-data-entry error. Since adding a 0 to the end of the number would make it quite plausible. However, I’m not certain that is the case, so I’m going to impute it regardless.

Handle NAs (Imputation)

For imputation of the NA value spotted in exploration, much like in Part A I’m opting to use ARIMA interpolation, one of the recommended methods in the Hyndman textbook.

# NA Value
power_usage %>%
  filter(is.na(KWH))

## # A tsibble: 1 x 3 [1M]
##   CaseSequence ReportDate   KWH
##          <dbl>      <mth> <dbl>
## 1          861   2008 Sep    NA

power_usage_impute_me <- power_usage %>%
  filter(is.na(KWH)) %>%
  pull(CaseSequence)

# Impute NAs with ARIMA 
power_usage_clean <- power_usage %>%
  model(arima = ARIMA(KWH)) %>%
  interpolate(power_usage) %>%
  left_join(as_tibble(power_usage) %>% select(ReportDate, CaseSequence), by = "ReportDate")


# Check where NA used to be 
power_usage_clean %>%
  filter(CaseSequence == power_usage_impute_me)

## # A tsibble: 1 x 3 [1M]
##   ReportDate      KWH CaseSequence
##        <mth>    <dbl>        <dbl>
## 1   2008 Sep 7597704.          861

Handle Outliers

I feel like I’ve learned my lesson from Part A of this project. I could systematically identify all the outliers by some rule. But visually, I see that the only outlier I am concerned with is the one from July 2010. So that’s the one I will handle.

# Dip in power usage 
# I'm rerunning this on power_usage_clean just to be safe, since this is now my working df
min_index <- which.min(power_usage_clean$KWH)

# I prefer anchoring to the CaseSequnce anyway though. So i'm doing both,
min_case_sequence <- power_usage_clean %>%
  filter(KWH == min(power_usage_clean$KWH)) %>%
  pull(CaseSequence)

power_usage_clean %>%
  filter(KWH == min(power_usage_clean$KWH)) %>%
  kable(caption = "Minimum KWH value is an outlier. To be Imputed.")

Minimum KWH value is an outlier. To be Imputed.
ReportDate	KWH	CaseSequence
2010 Jul	770523	883

# Replace outlier with NA
power_usage_clean <- power_usage_clean %>%
  mutate(
    KWH = if_else(CaseSequence == min_case_sequence, NA_real_, KWH)
  )

# Impute
power_usage_clean <- power_usage_clean %>%
  model(arima = ARIMA(KWH)) %>%
  interpolate(power_usage_clean) %>%
  left_join(as_tibble(power_usage_clean) %>% select(ReportDate, CaseSequence), by = "ReportDate")

# Double Check new Value
power_usage_clean %>%
  filter(CaseSequence == min_case_sequence) %>%
  kable(caption = "New Imputed Value to Handle Outlier")

New Imputed Value to Handle Outlier
ReportDate	KWH	CaseSequence
2010 Jul	7728530	883

Training and Test Set

My goal is to forecast for 2014, I’ll hold out one year of data (all of 2013) from my test set.

# Training Set (Everything up to December 2012)
power_train <- power_usage_clean %>% 
  filter(ReportDate <= yearmonth("2012-12"))

# Test Set (January 2013 through December 2013)
power_test <- power_usage_clean %>% 
  filter(ReportDate > yearmonth("2012-12"))

# Split Check
cat("Training Set ends on:", as.character(max(power_train$ReportDate)), "and has", nrow(power_train), "months.\n")

## Training Set ends on: 2012 Dec and has 180 months.

cat("Test Set starts on:", as.character(min(power_test$ReportDate)), "and ends on:", as.character(max(power_test$ReportDate)), "with", nrow(power_test), "months.\n")

## Test Set starts on: 2013 Jan and ends on: 2013 Dec with 12 months.

Modeling

STL Decomposition

I want to look at the training data through the lense of STL decomposition.

# STL decomposition on the training set
power_train %>%
  model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(
    title = "STL Decomposition for Residential Power Usage",
    subtitle = "Training Set (1998 - 2012)",
    x = "Report Date"
  )

Trend:I see a shift upward starting around 2010 that I hesitantly spotted earlier in EDA.
Seasonality: As expected, I see seasonality relative to the summer-winter annual cycle. It’s also accounting for a lot of the behavior if I just look at the magnitude of the y axis on it.
Remainder: This generally looks like noise, though I do see the variability is a bit tighter in the 2000-2007 time range. It gets a bit larger after that. Box-cox might be called for to stabilize the variance (which I’ll check)

Model Prep.

# Optimal Box-Cox Lambda using Guerrero's method
lambda_power <- power_train %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

cat("Optimal Box-Cox Lambda for Power Usage:", round(lambda_power, 4), "\n")

## Optimal Box-Cox Lambda for Power Usage: -0.1696

# Check for required differencing on the transformed data 
# Seasonal differencing
nsdiffs_power <- power_train %>%
  mutate(box_kwh = box_cox(KWH, lambda_power)) %>%
  features(box_kwh, unitroot_nsdiffs) %>%
  pull(nsdiffs)

cat("Seasonal differences required (nsdiffs):", nsdiffs_power, "\n")

## Seasonal differences required (nsdiffs): 1

# Ordinary differencing
ndiffs_power <- power_train %>%
  mutate(box_kwh = box_cox(KWH, lambda_power)) %>%
  features(box_kwh, unitroot_ndiffs) %>%
  pull(ndiffs)

cat("Ordinary differences required (ndiffs):", ndiffs_power, "\n")

## Ordinary differences required (ndiffs): 1

The optimal Box-Cox lambda is -0.1696, which is pretty close to 0 (natural log transofrmation). The SPSS test also calls for both one seasonal difference (which I expected) and one ordinary difference (maybe to handle the trend seen earlier).

Model Comparisons

I’m going to be just a little bit more thorough with the “theory” or “explanatory power” I’m looking for while fitting these models than I did with Part A. I honestly think it’s a bit overkill, but it’s nice to reinforce the learning. (Also Part A had so many components with many similar trends in each, and that could be used to inform on expectations of each ATM, whereas Part B has fewer components, so I’d rather be a little more thorough up front.)

# Fit models
power_fit <- power_train %>%
  model(
    # Snaive (the baseline basically)
    snaive = SNAIVE(KWH),
    
    # Auto models without transformation
    ets_auto = ETS(KWH),
    arima_auto = ARIMA(KWH),
    
    # Auto models with Box-Cox transformation
    ets_bc = ETS(box_cox(KWH, lambda_power)),
    arima_bc = ARIMA(box_cox(KWH, lambda_power)),
    
    # Manual ARIMA based on SPSS test (d=1, D=1)
    arima_manual = ARIMA(box_cox(KWH, lambda_power) ~ pdq(d=1) + PDQ(D=1))
  )

# Forecast against the 2013 test set
power_fc <- power_fit %>%
  forecast(new_data = power_test)

# Grade the forecast against the true historical test data
accuracy(power_fc, power_usage_clean) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for Power Usage (Test Set)")

Out-of-Sample Accuracy Metrics for Power Usage (Test Set)
.model	RMSE	MAE	MAPE	MASE
arima_bc	963502.9	683535.0	8.050892	1.125034
arima_auto	969439.6	655830.8	7.694001	1.079436
arima_manual	970940.7	742764.1	8.925284	1.222520
snaive	1035537.6	618605.6	7.058492	1.018166
ets_bc	1054366.9	697314.2	8.339581	1.147713
ets_auto	1100750.2	690764.1	8.126241	1.136932

Comparing the models by accuracy performance

SNAIVE performed really well here for MAE, MAPE and MASE. It didn’t perform well in RMSE though, implying that when it had some large forecasting errors.
The models are in a somewhat tight race, but the automatically determined ARIMA is the winner with the the lowest or close to lowest of each kind of error.
Because Box-cox transformed ARIMA had the lowest RMSE, I want give it added attention as well.

## Residual Plots

# SNAIVE Residuals
power_fit %>%
  select(snaive) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics: SNAIVE Baseline")

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_rug()`).

# Auto ARIMA Residuals
power_fit %>%
  select(arima_auto) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics: Auto ARIMA (No Transformation)")

# Box-Cox ARIMA Residuals
power_fit %>%
  select(arima_bc) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics: Box-Cox ARIMA")

## Portmanteau Tests 

# Ljung-Box test for white noise
# lag = 24 for two years of monthly seasonal data
power_fit %>%
  select(snaive, arima_auto, arima_bc) %>%
  augment() %>%
  features(.innov, ljung_box, lag = 24) %>%
  arrange(lb_pvalue) %>%
  knitr::kable(caption = "Ljung-Box Test Results (lag = 24)")

Ljung-Box Test Results (lag = 24)
.model	lb_stat	lb_pvalue
snaive	107.64492	0.0000000
arima_bc	31.42627	0.1418684
arima_auto	27.01241	0.3038750

## Stability Checks (ARIMA only)

# Visual check of AR/MA roots
power_fit %>%
  select(arima_auto, arima_bc) %>%
  gg_arma() +
  labs(
    title = "Inverse AR/MA Roots: ARIMA Models",
    subtitle = "Points should lie entirely inside the unit circle"
  )

# Definitive root check table 
power_fit %>%
  select(arima_auto, arima_bc) %>%
  glance() %>%
  select(.model, ar_roots, ma_roots) %>%
  pivot_longer(ends_with("roots"), names_to = "Root_Type", values_to = "Root") %>%
  unnest(Root) %>%
  mutate(
    Inverse_Distance = Mod(1 / Root),
    Is_Stable = Inverse_Distance < 1
  ) %>%
  select(.model, Root_Type, Inverse_Distance, Is_Stable) %>%
  knitr::kable(digits = 4, caption = "Definitive Root Stability Check")

Definitive Root Stability Check
.model	Root_Type	Inverse_Distance	Is_Stable
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ar_roots	0.9633	TRUE
arima_auto	ma_roots	0.4122	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ar_roots	0.9594	TRUE
arima_bc	ma_roots	0.2843	TRUE

Comparing the model residual plots and portmanteau:

The SNAIVE residual analysis show a huge spike in autocorrelation at 12 months. And the Portmanteau test has a p value under 0.05. So the SNAIVE is clearly not capturing some of the annual behavior.
The two ARIMA models perform similarly, but the auto-generated ARIMA model appears to have more normally distributed residuals, and a higher p-value, which is preferred.
Both ARIMA models also show an early autocorrelation spike beyond the boundary lines, which is suboptimal, but again the auto ARIMA shows a smaller amplitude in this than the box-cox one.
However, these things are pretty equal in terms of performance. The thing I come back to here is the, where Box-Cox Transformed ARIMA was superior. If that means it is handling extreme values better, that seems better suited when forecasting something as important as power consumption, since it might play into available power supply.

The Box-Cox Transformed ARIMA model is the best choice.

power_fit$arima_bc

## <lst_mdl[1]>
## [1] <ARIMA(0,0,1)(2,1,0)[12] w/ drift>

# Refit the Box-Cox ARIMA model to the FULL dataset
power_final_fit <- power_usage_clean %>%
  model(
    arima_final = ARIMA(box_cox(KWH, lambda_power) ~ 1 + pdq(0,0,1) + PDQ(2,1,0, period = 12))
  )

# Review the final model
report(power_final_fit)

## Series: KWH 
## Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift 
## Transformation: box_cox(KWH, lambda_power) 
## 
## Coefficients:
##          ma1     sar1     sar2  constant
##       0.2486  -0.7202  -0.3560    0.0023
## s.e.  0.0834   0.0752   0.0769    0.0006
## 
## sigma^2 estimated as 4.313e-05:  log likelihood=651.95
## AIC=-1293.9   AICc=-1293.55   BIC=-1277.93

The final model is ready to use for forecasting

Forecasting

# Forecast the next 12 months (January - December 2014)
power_fc <- power_final_fit %>%
  forecast(h = 12)

# Plot the historical data with the forecast
power_usage_clean %>%
  autoplot(KWH) +
  # Add the forecast layer
  autolayer(power_fc, color = "#4682B4", level = c(80, 95)) + 
  labs(
    title = "12-Month Residential Power Usage Forecast (2014)",
    subtitle = "ARIMA(0,0,1)(2,1,0)[12] with drift & Box-Cox transformation",
    y = "Power Usage (KWH)",
    x = "Report Date"
  ) +
  scale_x_yearmonth(
    date_breaks = "1 year", 
    date_labels = "%Y"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Exporting Final Forecasts

# Extract the point forecasts  
power_export <- power_fc %>%
  as_tibble() %>%
  select(ReportDate, Forecasted_KWH = .mean) %>%
  mutate(ReportDate = as.character(ReportDate)) 

# Define export path and write to CSV
export_path_b <- file.path(projPath, "Power_Forecast_2014.csv")
write_csv(power_export, export_path_b)

cat("Exported 2014 Power Usage forecast to:", export_path_b, "\n")

## Exported 2014 Power Usage forecast to: C:/Users/cdube/Grad School/DATA 624/Power_Forecast_2014.csv

The final forecast has been exported to a csv.

Conclusion

Residential power consumption has shown a steady upward trajectory since 2010. This indicates a growing customer baseline or increased household energy reliance.
Summer and winter drive peak usage, which is quite predictable behavior. However, the severity or intensity of these piked is increasing as time goes on. This seems like something that would require some thoughtful planning for future adequate power procurement. *The drop in power usage July 2010 was dealt with for the purposes of forecasting, but that “blip” could be an error in data entry, or a localized power outage, or some other failure that may warrant investigation.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Setup

Because the timestamps provided in the raw data do not fall perfectly on the hour, I’m rounding the DateTime down to the nearest hour. Then I’m getting the average waterflow to handle the case where there are multiple recordings within the same hour.

# read data
pipe1_raw <- read_excel(
  file.path(projPath, "Waterflow_Pipe1.xlsx"), 
  col_types = c("date", "numeric") # Forces Excel to parse as POSIXct datetime
)
 
pipe2_raw <- read_excel(
  file.path(projPath, "Waterflow_Pipe2.xlsx"), 
  col_types = c("date", "numeric")
)

# checking the time period
any(minute(pipe1_raw$`Date Time`) != 0 | second(pipe1_raw$`Date Time`) != 0)

## [1] TRUE

any(duplicated(floor_date(pipe1_raw$`Date Time`, "hour")))

## [1] TRUE

# cleaning data
pipe1_agg <- pipe1_raw %>%
  # Rename columns to remove spaces and standardize
  rename(DateTime = `Date Time`, WaterFlow = WaterFlow) %>%
  # Floor the datetime to the nearest hour
  mutate(DateTimeHour = floor_date(DateTime, "hour")) %>%
  # Aggregate multiple recordings within the same hour
  group_by(DateTimeHour) %>%
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE), .groups = 'drop') %>%
  # Convert to tsibble
  as_tsibble(index = DateTimeHour)

pipe2_agg <- pipe2_raw %>%
  rename(DateTime = `Date Time`, WaterFlow = WaterFlow) %>%
  mutate(DateTimeHour = floor_date(DateTime, "hour")) %>%
  group_by(DateTimeHour) %>%
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE), .groups = 'drop') %>%
  as_tsibble(index = DateTimeHour)

Data Cleaning Gaps

When working with sensor data, there periods where the sensor has gaps in the time series (the hour simply doesn’t exist in the data frame). So I’ll impute these using ARIMA.

#  
pipe1_agg %>% has_gaps() %>% kable(caption = "Does Pipe 1 have gaps?")

Does Pipe 1 have gaps?
.gaps
TRUE

pipe2_agg %>% has_gaps() %>% kable(caption = "Does Pipe 2 have gaps?")

Does Pipe 2 have gaps?
.gaps
FALSE

# summarized count of the gaps
pipe1_agg %>% 
  count_gaps() %>% 
  kable(caption = "Summary of Gap Sizes in Pipe 1")

Summary of Gap Sizes in Pipe 1
.from	.to	.n
2015-10-27 17:00:00	2015-10-27 17:00:00	1
2015-10-28 00:00:00	2015-10-28 00:00:00	1
2015-11-01 05:00:00	2015-11-01 05:00:00	1
2015-11-01 09:00:00	2015-11-01 09:00:00	1

# fill gaps
pipe1_clean <- pipe1_agg %>% fill_gaps()
pipe2_clean <- pipe2_agg %>% fill_gaps()

cat("Missing hours found in Pipe 1:", sum(is.na(pipe1_clean$WaterFlow)), "\n")

## Missing hours found in Pipe 1: 4

cat("Missing hours found in Pipe 2:", sum(is.na(pipe2_clean$WaterFlow)), "\n")

## Missing hours found in Pipe 2: 0

# Impute NAs using ARIMA interpolation (if any exist)
if(sum(is.na(pipe1_clean$WaterFlow)) > 0) {
  pipe1_clean <- pipe1_clean %>%
    model(arima = ARIMA(WaterFlow)) %>%
    interpolate(pipe1_clean)
}

if(sum(is.na(pipe2_clean$WaterFlow)) > 0) {
  pipe2_clean <- pipe2_clean %>%
    model(arima = ARIMA(WaterFlow)) %>%
    interpolate(pipe2_clean)
}

Cleaned Data Inspection

pipe1_clean %>%
  autoplot(WaterFlow) +
  labs(title = "Hourly Waterflow: Pipe 1", x = "Date/Time", y = "Waterflow") +
  theme_minimal()

pipe2_clean %>%
  autoplot(WaterFlow) +
  labs(title = "Hourly Waterflow: Pipe 2", x = "Date/Time", y = "Waterflow") +
  theme_minimal()

KPSS

Visually, both datasets appear to have a stable mean and variance over time without any long-term wandering trends. I’ll run a KPSS unit root test to confirm that.

# KPSS Test
bind_rows(
  pipe1_clean %>% features(WaterFlow, unitroot_kpss) %>% mutate(Pipe = "Pipe 1"),
  pipe2_clean %>% features(WaterFlow, unitroot_kpss) %>% mutate(Pipe = "Pipe 2")
) %>%
  select(Pipe, kpss_stat, kpss_pvalue) %>%
  kable(caption = "KPSS Stationarity Test Results") # p value > 0.05 so fail to reject null

KPSS Stationarity Test Results
Pipe	kpss_stat	kpss_pvalue
Pipe 1	0.0793697	0.1
Pipe 2	0.1053662	0.1

Both Pipe 1 and Pipe 2 return a p-value greater than 0.05. Therefore, I fail to reject the null hypothesis. The data is stationary and requires no ordinary differencing ($d=0$) for modeling. Because the variance looks highly stable in the EDA plots, I am also opting to skip Box-Cox transformations for this section.

Modeling

I will fit an automated ARIMA model to both pipes, letting the algorithm identify any seasonal or autoregressive patterns in the stationary hourly data.

# Fit ARIMA models
pipe1_fit <- pipe1_clean %>% model(arima_auto = ARIMA(WaterFlow))
pipe2_fit <- pipe2_clean %>% model(arima_auto = ARIMA(WaterFlow))

# Review the selected models
report(pipe1_fit)

## Series: WaterFlow 
## Model: ARIMA(0,0,0) w/ mean 
## 
## Coefficients:
##       constant
##        19.4683
## s.e.    0.3430
## 
## sigma^2 estimated as 28.35:  log likelihood=-741.4
## AIC=1486.8   AICc=1486.85   BIC=1493.76

report(pipe2_fit)

## Series: WaterFlow 
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean 
## 
## Coefficients:
##         sma1  constant
##       0.0846   39.5732
## s.e.  0.0314    0.5472
## 
## sigma^2 estimated as 256:  log likelihood=-4190.54
## AIC=8387.08   AICc=8387.1   BIC=8401.8

# Residual diagnostics for Pipe 1
pipe1_fit %>% gg_tsresiduals() + labs(title = "Residual Diagnostics: Pipe 1")

# Residual diagnostics for Pipe 2
pipe2_fit %>% gg_tsresiduals() + labs(title = "Residual Diagnostics: Pipe 2")

The auto-ARIMA models are pretty simple, which I expected base on the KPSS tests and visually inspecting the data and seeing how stationary it was.

*Pipe 1 (ARIMA(0,0,0) w/ mean):** For the first pipe, the algorithm essentially fitted a “mean model.” Because it found no significant autoregressive or moving average terms to improve the forecast, it will simply predict the historical average water flow (a constant of roughly 19.47). Looking at the residual diagnostics, the ACF plot shows the residuals mostly behave like white noise. The histogram is fairly normal, but there are a few negative outliers skewing it (sudden drops in water flow around Oct 28 and Nov 1).

*Pipe 2 (ARIMA(0,0,0)(0,0,1)[24] w/ mean):** For the second pipe, the algorithm also fitted a mean model (with a constant around 39.57), but it successfully detected a daily seasonal pattern. The (0,0,1)[24] component indicates a seasonal moving average term adjusting for a 24-hour cycle. The variance is consistent over time, the histogram is beautifully centered and normal, and while a few lags briefly poke past the significance boundaries on the ACF plot, this is entirely expected by chance with high-frequency hourly data.

Forecasting and Export

I’ll forecast one week forward. Because the data is hourly, this translates to $h = 168$ periods (24 hours * 7 days).

# Fit ARIMA models to both pipes 
pipe1_fit <- pipe1_clean %>% model(arima_auto = ARIMA(WaterFlow))
pipe2_fit <- pipe2_clean %>% model(arima_auto = ARIMA(WaterFlow))

# Forecast a week forward (24 hours * 7 days = 168 hours)
pipe1_fc <- pipe1_fit %>% forecast(h = 168)
pipe2_fc <- pipe2_fit %>% forecast(h = 168)

# Plot the forecasts
# Pipe 1 Plot
pipe1_clean %>%
  autoplot(WaterFlow) +
  autolayer(pipe1_fc, color = "#F8766D", level = NULL) +
  labs(
    title = "1-Week Waterflow Forecast: Pipe 1", 
    x = "Date/Time (Hourly)", 
    y = "Waterflow"
  ) +
  theme_minimal()

# Pipe 2 Plot
pipe2_clean %>%
  autoplot(WaterFlow) +
  autolayer(pipe2_fc, color = "#00BFC4", level = NULL) +
  labs(
    title = "1-Week Waterflow Forecast: Pipe 2", 
    x = "Date/Time (Hourly)", 
    y = "Waterflow"
  ) +
  theme_minimal()

# Prepare data for export
# Extract point forecasts (.mean) and label the pipes
pipe1_export <- pipe1_fc %>%
  as_tibble() %>%
  mutate(Pipe = "Pipe 1") %>%
  select(Pipe, DateTimeHour, Forecasted_WaterFlow = .mean)

pipe2_export <- pipe2_fc %>%
  as_tibble() %>%
  mutate(Pipe = "Pipe 2") %>%
  select(Pipe, DateTimeHour, Forecasted_WaterFlow = .mean)

# Combine both into a single data frame
combined_pipe_forecasts <- bind_rows(pipe1_export, pipe2_export)

# Export to CSV (Ensuring standard file naming)
export_path_c <- file.path(projPath, "Waterflow_Forecast_Results.csv")
write_csv(combined_pipe_forecasts, export_path_c)

cat("Exported Pipe 1 and Pipe 2 forecasts to:", export_path_c, "\n")

## Exported Pipe 1 and Pipe 2 forecasts to: C:/Users/cdube/Grad School/DATA 624/Waterflow_Forecast_Results.csv

DATA 624 Project 1

Catherine Dube

2026-03-29

Project Instructions

Part A – ATM Forecast, ATM624Data.xlsx

Setup

Basic Plots

Data Exploration

Weird Specific ATM Behaviors

Summary Table

NA Records

Missing Cash Records

See if spikes in withdrawals are mondays

Initial Notes

Next steps

Cleaning NA and Outliers

Cut the May missing values

Imputation for Cash NAs

Outlier Analysis and Imputation

Training and Test Sets

Modeling ATM1

STL Decompositon for ATM1

Model Prep. ATM1

Model Comparisons ATM1

Forecasting ATM1

Modeling ATM2

STL Decomposition for ATM2

Model Prep ATM2

Model Comparisons ATM2

Forecasting ATM2

Modeling ATM3

STL Decomposition for ATM3

Model Prep ATM3

Model Comparisons ATM3

Forecasting ATM3

Modeling ATM4

STL Decomposition for ATM4

Model Prep ATM4

Model Comparisons ATM4

Forecasting ATM4

Model Iteration 2 for ATM4

Model Iteration 2 - Forecasting for ATM4

Exporting Final Forecasts

Conclusions

Findings and Forecast Recap

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Setup

Exploration

Handle NAs (Imputation)

Handle Outliers

Training and Test Set

Modeling

STL Decomposition

Model Prep.

Model Comparisons

Forecasting

Exporting Final Forecasts

Conclusion

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Setup

Data Cleaning Gaps

Cleaned Data Inspection

KPSS

Modeling

Forecasting and Export