Project Instructions

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Oct 25. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Setup

atm <- read_excel(file.path(projPath, "ATM624Data.xlsx"))
head(atm)
## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90
unique(atm$ATM)
## [1] "ATM1" "ATM2" NA     "ATM3" "ATM4"
# Fixing the date formatting
atm$DATE <- as.Date(atm$DATE, origin = "1900-01-01")
head(atm) 
## # A tibble: 6 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-03 ATM1     96
## 2 2009-05-03 ATM2    107
## 3 2009-05-04 ATM1     82
## 4 2009-05-04 ATM2     89
## 5 2009-05-05 ATM1     85
## 6 2009-05-05 ATM2     90
# Tsibble
atm <- as_tsibble(atm, index = DATE, key = ATM)

Basic Plots

# Basic Plot
atm %>% 
  autoplot(Cash) 
## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).

  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATMS over time") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
## NULL
# Basic Plot ATM 1
atm %>% 
  filter(ATM == "ATM1") %>%
  autoplot(Cash) + 
  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM1") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) + 
  geom_line(color = "#F8766D") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

# Basic Plot ATM 2
atm %>% 
  filter(ATM == "ATM2") %>%
  autoplot(Cash) + 
  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM2") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  geom_line(color = "#7CAE00") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

# Basic Plot ATM 3
atm %>% 
  filter(ATM == "ATM3") %>%
  autoplot(Cash) + 
  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM3") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  geom_line(color = "#00BFC4") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

# Basic Plot ATM 4
atm %>% 
  filter(ATM == "ATM4") %>%
  autoplot(Cash) + 
  labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM4") +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  geom_line(color = "#C77CFF") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Data Exploration

Weird Specific ATM Behaviors

Looking at when ATM3 started getting withdrawals, and looking at the ATM4 Feb. spike.

# Days money moved for ATM3
atm %>% 
  filter((ATM == "ATM3") & (Cash != 0)) %>%
  kable(caption = "Days with non-zero cash movement for ATM3")
Days with non-zero cash movement for ATM3
DATE ATM Cash
2010-04-30 ATM3 96
2010-05-01 ATM3 82
2010-05-02 ATM3 85
# Days money spiked for ATM 4
atm %>% 
  filter((ATM == "ATM4") & (DATE >= '2010-02-01') & (DATE <= '2010-02-15')) %>%
  kable(caption = "ATM4 Cash Spike ($10,919*100 value on Feb 11)")
ATM4 Cash Spike ($10,919*100 value on Feb 11)
DATE ATM Cash
2010-02-01 ATM4 669.70765
2010-02-02 ATM4 979.67594
2010-02-03 ATM4 426.40601
2010-02-04 ATM4 153.24269
2010-02-05 ATM4 274.94992
2010-02-06 ATM4 136.27636
2010-02-07 ATM4 454.41695
2010-02-08 ATM4 458.30427
2010-02-09 ATM4 112.03063
2010-02-10 ATM4 417.91228
2010-02-11 ATM4 10919.76164
2010-02-12 ATM4 42.43808
2010-02-13 ATM4 280.04343
2010-02-14 ATM4 412.31888
2010-02-15 ATM4 852.83742

Summary Table

Get some high level counts and sums of dollars withdrawn, including counts for NA ATMs.

# Overall statistical summary
summary(atm)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-03   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-03   Class :character   1st Qu.:    0.5  
##  Median :2009-11-03   Mode  :character   Median :   73.0  
##  Mean   :2009-11-02                      Mean   :  155.6  
##  3rd Qu.:2010-02-03                      3rd Qu.:  114.0  
##  Max.   :2010-05-16                      Max.   :10919.8  
##                                          NA's   :19
# Aggregate counts and missing values by ATM
atm %>%
  as_tibble() %>%
  group_by(ATM) %>%
  summarise(
    count = n(), 
    total_cash = sum(Cash, na.rm = TRUE),
    zero_days = sum(Cash == 0, na.rm = TRUE), 
    missing_days = sum(is.na(Cash)) 
  ) %>%
  kable(caption = "Total records, cash, and missing days per ATM")
Total records, cash, and missing days per ATM
ATM count total_cash zero_days missing_days
ATM1 365 30367.0 0 3
ATM2 365 22716.0 2 2
ATM3 365 263.0 362 0
ATM4 365 173025.8 0 0
NA 14 0.0 0 14

NA Records

Further investigate when the ATM values are NA.

# Extract dates with NA ATM value
na_atm_dates <- atm %>%
  filter(is.na(ATM)) %>%
  pull(DATE)

# Check all ATM activity on those specific dates
atm %>%
  filter(DATE %in% na_atm_dates) %>%
  arrange(DATE, ATM) %>%
  kable(caption = "Dates with a missing 'NA' ATM record")
Dates with a missing ‘NA’ ATM record
DATE ATM Cash
2010-05-03 NA NA
2010-05-04 NA NA
2010-05-05 NA NA
2010-05-06 NA NA
2010-05-07 NA NA
2010-05-08 NA NA
2010-05-09 NA NA
2010-05-10 NA NA
2010-05-11 NA NA
2010-05-12 NA NA
2010-05-13 NA NA
2010-05-14 NA NA
2010-05-15 NA NA
2010-05-16 NA NA

Also investigate when the Cash values are NA, but it’s not that both ATM and Cash are NA.

Missing Cash Records

Next, let’s identify the specific dates where an ATM is known to exist, but its Cash withdrawal value is missing. This deliberately excludes completely blank phantom rows where both the ATM name and Cash value are NA.

# Filter for missing Cash, but ensure ATM is NOT missing
atm %>%
  filter(is.na(Cash) & !is.na(ATM)) %>%
  arrange(ATM, DATE) %>%
  kable(caption = "Table 5: Records with missing Cash values for known ATMs")
Table 5: Records with missing Cash values for known ATMs
DATE ATM Cash
2009-06-15 ATM1 NA
2009-06-18 ATM1 NA
2009-06-24 ATM1 NA
2009-06-20 ATM2 NA
2009-06-26 ATM2 NA

See if spikes in withdrawals are mondays

Being in fintech right now, I suspected that the oscillations on the graphs are Monday highs and weekend lows. I just wanted to dummy check that.

# Average Cash by Day of the Week and ATM
atm %>%
  as_tibble() %>% 
  mutate(Day_of_Week = wday(DATE, label = TRUE, abbr = FALSE)) %>% # get day of week
  group_by(ATM, Day_of_Week) %>%
  summarise(
    Avg_Cash = mean(Cash, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  pivot_wider(names_from = Day_of_Week, values_from = Avg_Cash) %>% # day of week to column
  kable(caption = "Average Daily Cash Withdrawals by ATM and Day of the Week", digits = 2)
Average Daily Cash Withdrawals by ATM and Day of the Week
ATM Sunday Monday Tuesday Wednesday Thursday Friday Saturday
ATM1 98.64 96.61 102.65 86.00 89.57 82.15 31.69
ATM2 92.02 75.98 67.15 58.73 73.25 43.75 25.53
ATM3 1.60 0.00 0.00 0.00 0.00 1.85 1.58
ATM4 573.53 491.54 539.07 481.29 647.27 413.72 169.96
NA NaN NaN NaN NaN NaN NaN NaN

Initial Notes

  • ATM4 is seeing the most money movement by far.
    • In a business context, I’d confirm with someone that this is expected.
  • ATM4 has a crazy spike in withdrawal volume on Feb 11th.
    • In a business context, I’d investigate if there was a known reason.
    • In machine learning context, this is an outlier that may skew projections
  • ATM3 only sees money moving starting from 2010-04-30. Maybe ATM3 was only set up that time.
  • Withdrawal volume is highest at the start of the work week (Mon.) and lowest on the weekends.
  • There are some NA values for ATM in the data set. These seem to all have NA for Cash, and be from May (i.e. the bottom of the excel spreadsheet with the source data.) Since the ask here is to predict May data, I think this is safe to cut from the data set.
  • There are some NA values for Cash in the data set. The ATMs and dates these happened on don’t have a pattern or calendar significance (though they all take place in June). These probably need to be imputed.

Next steps

  • Remove the May values with NA for both ATM and Cash.
  • Impute the missing values.
  • Replace the outlier with imputed value

Cleaning NA and Outliers

Cut the May missing values

There is May data for the 1st and 2nd for all ATMs, so I’m leaving that as is. I’m just cutting the NAs there on.

# CUT NA records (which is May 3rd onward) 
atm_clean <- atm %>%
  drop_na(ATM) 

Imputation for Cash NAs

For imputation, I’m opting to use ARIMA interpolation, one of the recommended methods in the Hyndman textbook.

# Check for time gaps (before ARIMA)
atm_clean %>% 
  has_gaps()
## # A tibble: 5 × 2
##   ATM   .gaps
##   <chr> <lgl>
## 1 ATM1  FALSE
## 2 ATM2  FALSE
## 3 ATM3  FALSE
## 4 ATM4  FALSE
## 5 <NA>  FALSE
# Where are the time gaps if there are some
atm_clean %>% 
  count_gaps() %>%
  knitr::kable(caption = "Time Gaps by ATM")
Time Gaps by ATM
ATM .from .to .n
# No Time Gaps
# Remove "NA" as a key from when I originally made the data tsibble 
atm_clean <- atm_clean %>%
  as_tibble() %>%                   
  as_tsibble(index = DATE, key = ATM) # rebuild

key_data(atm_clean) %>% 
  select(-.rows) %>% 
  kable(caption = "Ensuring that NA is not a key in the Tsibble anymore")
Ensuring that NA is not a key in the Tsibble anymore
ATM
ATM1
ATM2
ATM3
ATM4
# Impute NAs with ARIMA 
atm_clean <- atm_clean %>%
  model(arima = ARIMA(Cash)) %>%
  interpolate(atm_clean)

Outlier Analysis and Imputation

I couldn’t use STL() until the NA values were handled. Now that that is done, I can use it for outlier analysis in the traditional way: using remainders > 3*IQR from the median remainder

# Check for outliers in addition to the one I found earlier
# Using remainders > 3x IQR from the median remainder
outlier_analysis <- atm_clean %>%
  model(stlo = STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  group_by(ATM) %>%
  mutate(
    rem_iqr = IQR(remainder, na.rm = TRUE),
    rem_mid = median(remainder, na.rm = TRUE),
    is_outlier = abs(remainder - rem_mid) > (3 * rem_iqr)
  ) %>%
  ungroup()

# Look at Outliers
outlier_analysis %>%
  filter(is_outlier == TRUE)
## # A tsibble: 87 x 11 [1D]
## # Key:       ATM, .model [4]
##    ATM   .model DATE        Cash trend season_week remainder season_adjust
##    <chr> <chr>  <date>     <dbl> <dbl>       <dbl>     <dbl>         <dbl>
##  1 ATM1  stlo   2009-06-07     3  88.1       16.8     -102.         -13.8 
##  2 ATM1  stlo   2009-07-12    19  86.4       16.8      -84.2          2.20
##  3 ATM1  stlo   2009-08-27   174 103.        23.0       48.0        151.  
##  4 ATM1  stlo   2009-09-13     1  95.6       16.8     -111.         -15.8 
##  5 ATM1  stlo   2009-09-24   180  90.1       23.0       67.0        157.  
##  6 ATM1  stlo   2009-12-02     6  82.0        3.80     -79.8          2.20
##  7 ATM1  stlo   2009-12-31    42  89.8       23.0      -70.8         19.0 
##  8 ATM1  stlo   2010-01-01     2  92.6       -4.32     -86.2          6.32
##  9 ATM1  stlo   2010-01-31   152  81.4       16.8       53.8        135.  
## 10 ATM1  stlo   2010-02-07   138  79.1       16.8       42.1        121.  
## # ℹ 77 more rows
## # ℹ 3 more variables: rem_iqr <dbl>, rem_mid <dbl>, is_outlier <lgl>
# Visually validate the outliers
atm_clean %>%
  left_join(select(outlier_analysis, ATM, DATE, is_outlier), by = c("ATM", "DATE")) %>%
  ggplot(aes(x = DATE, y = Cash)) +
  geom_line(color = "grey60", alpha = 0.7) +
  geom_point(data = . %>% filter(is_outlier == TRUE), color = "red", size = 2) +
  facet_wrap(~ ATM, scales = "free_y") +
  labs(
    title = "Visual Inspection of Methodically Identified Outliers",
    subtitle = "Red dots indicate flagged outliers (3x IQR of STL Remainder)",
    x = "Date", 
    y = "Cash Withdrawn (Hundreds)"
  ) +
  theme_minimal()

This method classified a lot of data points as outliers that frankly, are not. It remains that the only “outlier” I am concerned with is the one from ATM4 on Feb. 11th. So I will impute that one only.

# Replace my One Feb 11th Outlier with NA
atm_clean <- atm_clean %>%
  mutate(
    Cash = if_else(ATM == "ATM4" & DATE == as.Date("2010-02-11"), NA_real_, Cash)
  ) # NA_real_ is explicitly for missing decimal number

# Impute
atm_clean <- atm_clean %>%
  model(arima = ARIMA(Cash)) %>%
  interpolate(atm_clean)

# Peaking at the imputed value
atm_clean %>%
  filter(ATM == "ATM4", DATE >= as.Date("2010-02-09") & DATE <= as.Date("2010-02-13")) %>%
  knitr::kable(caption = "ATM4: Feb 11th Outlier Successfully Imputed")
ATM4: Feb 11th Outlier Successfully Imputed
ATM DATE Cash
ATM4 2010-02-09 112.03063
ATM4 2010-02-10 417.91228
ATM4 2010-02-11 214.38754
ATM4 2010-02-12 42.43808
ATM4 2010-02-13 280.04343

Training and Test Sets

# Training Set (Everything up to March 31, 2010)
atm_train <- atm_clean %>% 
  filter(DATE <= as.Date("2010-03-31"))

# Test Set (April 1st through May 2nd)
atm_test <- atm_clean %>% 
  filter(DATE > as.Date("2010-03-31"))

# Verify the split
cat("Training Set ends on:", as.character(max(atm_train$DATE)), "and has", nrow(atm_train), "days.\n")
## Training Set ends on: 2010-03-31 and has 1332 days.
cat("Test Set starts on:", as.character(min(atm_test$DATE)), "and has", nrow(atm_test), "days.\n")
## Test Set starts on: 2010-04-01 and has 128 days.

Modeling ATM1

STL Decompositon for ATM1

# STL decomposition
atm_train %>%
  filter(ATM == "ATM1") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM1")

I already confirmed earlier that withdrawal volume is highest Mondays and lowest on the weekends. I can see weekly seasonality confirmed on the STL Decomp. view.

I also see that the trend wanders without much direction, but interestingly, the variation in the remainder becomes much higher from roughly March 2010 onwards.

Model Prep. ATM1

# Isolate ATM1 
atm1 <- atm_clean %>% filter(ATM == "ATM1")
atm1_train <- atm_train %>% filter(ATM == "ATM1")
atm1_test <- atm_test %>% filter(ATM == "ATM1")

#Optimal Box-Cox Lambda using Guerrero's method
lambda_atm1 <- atm1_train %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

cat("Optimal Box-Cox Lambda for ATM1:", round(lambda_atm1, 4), "\n")
## Optimal Box-Cox Lambda for ATM1: 0.6236
# Check for required differencing on the transformed data 
# seasonal
nsdiffs_atm1 <- atm1_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm1)) %>%
  features(box_cash, unitroot_nsdiffs) %>%
  pull(nsdiffs)

cat("Seasonal differences required (nsdiffs):", nsdiffs_atm1, "\n")
## Seasonal differences required (nsdiffs): 1
# ordinary
ndiffs_atm1 <- atm1_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm1)) %>%
  features(box_cash, unitroot_ndiffs) %>%
  pull(ndiffs)

cat("Ordinary differences required (ndiffs):", ndiffs_atm1, "\n")
## Ordinary differences required (ndiffs): 0

I’m going to move forward using the optimal Box-Cox lambda because I am more focused on effective modeling than on explanatory power at this time.. The value of 0.6236 is a bit between 0.5 and 1. Where 0.5 is taking the square root transformation to stabilize the variance, and 1 is doing nothing (there is a stable enough variance).

(I’ll use the optimal lambda values as they are found for the rest of this project as well, rather than rounding to more “explainable” lambdas.)

Model Comparisons ATM1

# Fit models on training data
atm1_train_fit <- atm1_train %>%
  model(
    snaive = SNAIVE(box_cox(Cash, lambda_atm1)),
    ets = ETS(box_cox(Cash, lambda_atm1)),
    arima_auto = ARIMA(box_cox(Cash, lambda_atm1)),
    arima_manual = ARIMA(box_cox(Cash, lambda_atm1) ~ pdq(d=0) + PDQ(D=1))
  )

# Forecast the length of the TEST data
atm1_test_forecast <- atm1_train_fit %>%
  forecast(new_data = atm1_test)

# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm1_test_forecast, atm1) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM1 (Test Set)")
Out-of-Sample Accuracy Metrics for ATM1 (Test Set)
.model RMSE MAE MAPE MASE
arima_auto 12.65584 10.031486 57.80838 0.5491826
arima_manual 12.65584 10.031486 57.80838 0.5491826
ets 12.73463 9.927388 44.49709 0.5434837
snaive 17.05096 13.744518 80.29855 0.7524559

Comparing ATM1 models by accuracy performance

  • The MASE of all of the models is under 1, which is good.
  • SNAIVE is the worst of these, with MASE of 0.75 (expected)
  • ARIMA performed really well, with MASE of 0.55 RMSE of 12.66.
  • ETS has the lowest MASE (0.54) and a slightly higher RMSE (12.73).
  • ARIMA and ETS are so close that I’ll go to the residuals to better understand the best model.
# ETS Residuals
atm1_train_fit %>%
  select(ets) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM1 (ETS Model)")
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# ARIMA Residuals
atm1_train_fit %>%
  select(arima_auto) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM1 (ARIMA Model)")

# Portmanteau Tests
# Ljung-Box test  
# lag = 14 for daily data to look at 2 weeks of data
atm1_train_fit %>%
  select(ets, arima_auto) %>%
  augment() %>%
  features(.innov, ljung_box, lag = 14) %>%
  knitr::kable(caption = "Ljung-Box Test for ETS and ARIMA Residuals")
Ljung-Box Test for ETS and ARIMA Residuals
.model lb_stat lb_pvalue
arima_auto 10.87316 0.6959701
ets 23.63239 0.0507266

Comparing the ARIMA and ETS model residual plots and portmanteau:

  • The residual plots for ARIMA and ETS look pretty similar.
  • The residual distribution also look similar.
  • The ACF plots, however, show that for the ETS model, some early autocorrelation goes past the boundary lines.
  • The Portmanteau tests for both have p-values over 0.05 (which is good, because if the p value were less than 0.05, I’d conclude that the residuals do not look like white noise.) However, the p-value for the ETS model is 0.0507, which is just **barely* clearing that 0.05 bar. ARIMA looks much better at 0.70.

The ARIMA model is the best choice for ATM1.

# Model from training:
atm1_train_fit$arima_auto
## <lst_mdl[1]>
## [1] <ARIMA(0,0,2)(0,1,2)[7]>
# Refit the ARIMA model to full dataset
atm1_final_fit <- atm1 %>%
  model(
    arima_final = ARIMA(box_cox(Cash, lambda_atm1) ~ pdq(0,0,2) + PDQ(0,1,2, period = 7))
  )

# Final Model
report(atm1_final_fit)
## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,2)[7] 
## Transformation: box_cox(Cash, lambda_atm1) 
## 
## Coefficients:
##          ma1      ma2     sma1     sma2
##       0.1590  -0.0823  -0.5835  -0.0782
## s.e.  0.0523   0.0503   0.0515   0.0500
## 
## sigma^2 estimated as 24.99:  log likelihood=-1083.97
## AIC=2177.93   AICc=2178.1   BIC=2197.34

The final model for ATM1 is ready to use for forecasting

Forecasting ATM1

# Forecast
# h = 30 for 30 days
atm1_fc <- atm1_final_fit %>%
  forecast(h = 30)

# Plot forecast
atm1 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm1_fc, color = "#F8766D", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM1",
    subtitle = "ARIMA(0,0,2)(0,1,2)[7] Structure",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Modeling ATM2

STL Decomposition for ATM2

# STL decomposition
atm_train %>%
  filter(ATM == "ATM2") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM2")

Much like ATM1, weekly seasonality remains, and the variation in the remainder becomes much higher from roughly March 2010 onwards. But what’s different here is the trend, which was relatively steady with a slight decline, also up until March 2010.

Model Prep ATM2

# Isolate atm2 
atm2 <- atm_clean %>% filter(ATM == "ATM2")
atm2_train <- atm_train %>% filter(ATM == "ATM2")
atm2_test <- atm_test %>% filter(ATM == "ATM2")

#Optimal Box-Cox Lambda using Guerrero's method
lambda_atm2 <- atm2_train %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

cat("Optimal Box-Cox Lambda for atm2:", round(lambda_atm2, 4), "\n")
## Optimal Box-Cox Lambda for atm2: 0.5398
# Check for required differencing on the transformed data 
# seasonal
nsdiffs_atm2 <- atm2_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm2)) %>%
  features(box_cash, unitroot_nsdiffs) %>%
  pull(nsdiffs)

cat("Seasonal differences required (nsdiffs):", nsdiffs_atm2, "\n")
## Seasonal differences required (nsdiffs): 1
# ordinary
ndiffs_atm2 <- atm2_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm2)) %>%
  features(box_cash, unitroot_ndiffs) %>%
  pull(ndiffs)

cat("Ordinary differences required (ndiffs):", ndiffs_atm2, "\n")
## Ordinary differences required (ndiffs): 1

Different from ATM1, when checking for differencing, ATM2 calls for 1 seasonal and 1 ordinary difference.

Model Comparisons ATM2

# Fit models on training data
atm2_train_fit <- atm2_train %>%
  model(
    snaive = SNAIVE(box_cox(Cash, lambda_atm2)),
    ets = ETS(box_cox(Cash, lambda_atm2)),
    arima_auto = ARIMA(box_cox(Cash, lambda_atm2)),
    arima_manual = ARIMA(box_cox(Cash, lambda_atm2) ~ pdq(d=1) + PDQ(D=1))
  )

# Forecast the length of the TEST data
atm2_test_forecast <- atm2_train_fit %>%
  forecast(new_data = atm2_test)

# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm2_test_forecast, atm2) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM2 (Test Set)")
Out-of-Sample Accuracy Metrics for ATM2 (Test Set)
.model RMSE MAE MAPE MASE
ets 19.09029 14.17161 Inf 0.6794205
arima_auto 19.83110 15.44513 Inf 0.7404763
snaive 22.95675 16.60056 Inf 0.7958699
arima_manual 47.44609 35.18190 Inf 1.6867035

Comparing ATM2 models by accuracy performance

  • The MASE of the manual ARIMA model is 1.68, which means it performed significantly worse than a baseline naive forecast. I’ll discard this model choice.
    • Side note: Even though the unit root tests suggested both an ordinary and seasonal difference (d=1, D=1), explicitly forcing these constraints in the manual ARIMA model clearly resulted in the worst performance. Adhering to the KPSS tests probably led to overdifferencing.
  • The SNAIVE model serves as a reasonable baseline with a MASE of 0.79.
  • The Auto ARIMA model performed decently, beating the SNAIVE baseline with a MASE of 0.74 and an RMSE of 19.83.
  • ETS is the winner from the accuracy perspective, with the lowest MASE (0.68) and the lowest RMSE (19.09).

However, especially based on ATM1, I will look at both ETS and ARIMA (the auto-selected one, not my manual) from a residual perspective

# ETS Residuals
atm2_train_fit %>%
  select(ets) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM2 (ETS Model)")

# ARIMA Residuals
atm2_train_fit %>%
  select(arima_auto) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM2 (ARIMA Model)")

# Portmanteau Tests
# Ljung-Box test  
# lag = 14 for daily data to look at 2 weeks of data
atm2_train_fit %>%
  select(ets, arima_auto) %>%
  augment() %>%
  features(.innov, ljung_box, lag = 14) %>%
  knitr::kable(caption = "Ljung-Box Test for ETS and ARIMA Residuals")
Ljung-Box Test for ETS and ARIMA Residuals
.model lb_stat lb_pvalue
arima_auto 10.49173 0.7254213
ets 31.07524 0.0054106

Comparing the ARIMA and ETS model residual plots and portmanteau:

  • The residual plots for ARIMA and ETS look pretty similar.
  • The residual distribution also look similar, but the one for ARIMA does look a bit more centered and symmetric.
  • The ACF plots have differences again. ETS is spiking beyond the boundary lines in a couple places and at higher amplitudes in general compared to ARIMA, which appears cleaner.
  • The Portmanteau tests for show that ETS has a p-value under 0.05, which means I reject the null hypothesis that the residuals are randon noise for the ETS model.

The ARIMA model is the best choice for ATM2.

# Model from training:
atm2_train_fit$arima_auto
## <lst_mdl[1]>
## [1] <ARIMA(2,0,2)(0,1,1)[7]>
# Refit the ARIMA model to full dataset
atm2_final_fit <- atm2 %>%
  model(
    arima_final = ARIMA(box_cox(Cash, lambda_atm2) ~ pdq(2,0,2) + PDQ(0,1,1, period = 7))
  )

# Final Model
report(atm2_final_fit)
## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda_atm2) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4158  -0.8961  0.4560  0.7860  -0.6971
## s.e.   0.0616   0.0513  0.0909  0.0652   0.0431
## 
## sigma^2 estimated as 16.57:  log likelihood=-1009.89
## AIC=2031.78   AICc=2032.02   BIC=2055.07

The final model for ATM2 is ready to use for forecasting

Forecasting ATM2

# Forecast
# h = 30 for 30 days
atm2_fc <- atm2_final_fit %>%
  forecast(h = 30)

# Plot forecast
atm2 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm2_fc, color = "#7CAE00", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM2",
    subtitle = "ARIMA(2,0,2)(0,1,1)[7] Structure",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Modeling ATM3

I already know that ATM3 has only three days’ worth of data. So this is going to look pretty different from ATM1 and ATM2.

STL Decomposition for ATM3

For the sake of thoroughness, I’ll still look at STL decomp for ATM3.

# STL decomposition
atm_train %>%
  filter(ATM == "ATM3") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM3")

As expected, STL Decomp. yielded nothing here.

Model Prep ATM3

The model preparation is different here too. For ATM1 and ATM2 I looked at the Box-Cox Lambda using Guerrero’s method and at seasonal and ordinary differencing. With so little data, I’m skipping that here.

Model Comparisons ATM3

Again, with three days’ data, I’m not going to entertain ETS and ARIMA. Nor am I going to train and test models. I’ll use the mean to forecast the data because I have already identified that there is seasonal (weekly) fluctuation. And I’d rather use the mean of 3 days as a predictor than the most recent value of one day.

atm3 <- atm_clean %>%
  filter(ATM == "ATM3")

# When ATM3 has data
atm3_active <- atm3 %>% 
  filter(Cash > 0)
# Fit a simple Mean model
atm3_fit <- atm3_active %>%
  model(mean = MEAN(Cash))

# Model report
report(atm3_fit)
## Series: Cash 
## Model: MEAN 
## 
## Mean: 87.6667 
## sigma^2: 54.3333

Forecasting ATM3

# Forecast
# h = 30 for 30 days
atm3_fc <- atm3_fit %>%
  forecast(h = 30)

# Plot forecast
atm3 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm3_fc, color = "#00BFC4", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM3",
    subtitle = "Simple Mean Model",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Modeling ATM4

STL Decomposition for ATM4

# STL decomposition
atm_train %>%
  filter(ATM == "ATM4") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM4")

I can still see weekly seasonality for ATM4. And the trend wanders a lot like ATM1. Interestingly, the variation in the remainder does not become much higher from March 2010 onwards as it did with ATM1 and ATM2.

As I confirmed before, the magnitude of withdrawals is higher in ATM4 than the others, but also, I see that the outlier I removed is indeed gone, as I’d hoped when I imputed it.

Model Prep ATM4

# Isolate ATM4 
atm4 <- atm_clean %>% filter(ATM == "ATM4")
atm4_train <- atm_train %>% filter(ATM == "ATM4")
atm4_test <- atm_test %>% filter(ATM == "ATM4")

#Optimal Box-Cox Lambda using Guerrero's method
lambda_atm4 <- atm4_train %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

cat("Optimal Box-Cox Lambda for ATM4:", round(lambda_atm4, 4), "\n")
## Optimal Box-Cox Lambda for ATM4: 0.4028
# Check for required differencing on the transformed data 
# seasonal
nsdiffs_atm4 <- atm4_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm4)) %>%
  features(box_cash, unitroot_nsdiffs) %>%
  pull(nsdiffs)

cat("Seasonal differences required (nsdiffs):", nsdiffs_atm4, "\n")
## Seasonal differences required (nsdiffs): 0
# ordinary
ndiffs_atm4 <- atm4_train %>%
  mutate(box_cash = box_cox(Cash, lambda_atm4)) %>%
  features(box_cash, unitroot_ndiffs) %>%
  pull(ndiffs)

cat("Ordinary differences required (ndiffs):", ndiffs_atm4, "\n")
## Ordinary differences required (ndiffs): 0

Model Comparisons ATM4

# Fit models on training data
atm4_train_fit <- atm4_train %>%
  model(
    snaive = SNAIVE(box_cox(Cash, lambda_atm4)),
    ets = ETS(box_cox(Cash, lambda_atm4)),
    arima_auto = ARIMA(box_cox(Cash, lambda_atm4)),
    arima_manual_from_SPSS = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(d=0) + PDQ(D=0)),
    arima_manual_from_intuition = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(d=0) + PDQ(D=1))
  )
## Warning in sqrt(diag(best$var.coef)): NaNs produced
# Forecast the length of the TEST data
atm4_test_forecast <- atm4_train_fit %>%
  forecast(new_data = atm4_test)

# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm4_test_forecast, atm4) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM4 (Test Set)")
Out-of-Sample Accuracy Metrics for ATM4 (Test Set)
.model RMSE MAE MAPE MASE
arima_manual_from_intuition 248.8757 210.6874 861.8666 0.5953832
arima_auto 260.9621 224.0422 914.2848 0.6331225
arima_manual_from_SPSS 260.9621 224.0422 914.2848 0.6331225
ets 303.1305 258.4645 1181.5676 0.7303968
snaive 789.8017 666.7780 1252.9923 1.8842529

Comparing ATM4 models by accuracy performance

  • SNAIVE was quite bad, with MASE of 1.88
  • ETS was alright with a MASE of 0.73, but ARIMA performed better in accuracy
  • Here’s where I was experimenting with ARIMA. The manual SPSS tests for ARIMA called for no differencing. So I decided to test with that manually, with 1 seasonal difference based on intuition, and then with the auto-ARIMA.
    • ARIMA where I manually adhered to the results of the SPSS test, and the auto-ARIMA performed identically
    • My ARIMA where I manually went off of my intuition for seasonal differencing had the best accuracy metrics accross the board.
    • I suspect that my manual intuition ARIMA model might be overly complex or unstable, so I’ll check for this
# complexity (AICc)
atm4_train_fit %>%
  select(arima_auto, arima_manual_from_intuition) %>%
  glance() %>%
  arrange(AICc) %>%
  select(.model, AICc, BIC) %>%
  knitr::kable(caption = "Information Criteria Comparison: Auto vs Intuition")
Information Criteria Comparison: Auto vs Intuition
.model AICc BIC
arima_manual_from_intuition 2449.251 2475.407
arima_auto 2508.037 2526.894
#  (p,d,q)(P,D,Q) parameter compare
atm4_train_fit %>%
  select(arima_auto, arima_manual_from_intuition) 
## # A mable: 1 x 2
##                         arima_auto arima_manual_from_intuition
##                            <model>                     <model>
## 1 <ARIMA(0,0,1)(2,0,0)[7] w/ mean>    <ARIMA(2,0,1)(1,1,2)[7]>
# AR and MA Roots for Stability
# Auto ARIMA Roots
atm4_train_fit %>%
  select(arima_auto) %>%
  gg_arma() +
  labs(
    title = "Inverse AR/MA Roots: Auto ARIMA",
    subtitle = "Points should lie entirely inside the unit circle"
  )
## Warning: `gg_arma()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_arma()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Manual Intuition ARIMA Roots
atm4_train_fit %>%
  select(arima_manual_from_intuition) %>%
  gg_arma() +
  labs(
    title = "Inverse AR/MA Roots: Manual Intuition ARIMA",
    subtitle = "Checking for borderline instability (roots on or near the circle)"
  )

# Check for stability more definitively than with the circle
atm4_train_fit %>%
  select(arima_auto, arima_manual_from_intuition) %>%
  glance() %>%
  select(.model, ar_roots, ma_roots) %>%
  pivot_longer(ends_with("roots"), names_to = "Root_Type", values_to = "Root") %>%
  unnest(Root) %>%
  mutate(
    Inverse_Distance = Mod(1 / Root),
    Is_Stable = Inverse_Distance < 1
  ) %>%
  select(.model, Root_Type, Inverse_Distance, Is_Stable) %>%
  knitr::kable(digits = 4, caption = "Definitive Root Check: Auto vs Intuition")
Definitive Root Check: Auto vs Intuition
.model Root_Type Inverse_Distance Is_Stable
arima_auto ar_roots 0.9166 TRUE
arima_auto ar_roots 0.9166 TRUE
arima_auto ar_roots 0.8586 TRUE
arima_auto ar_roots 0.8586 TRUE
arima_auto ar_roots 0.8586 TRUE
arima_auto ar_roots 0.8586 TRUE
arima_auto ar_roots 0.9166 TRUE
arima_auto ar_roots 0.9166 TRUE
arima_auto ar_roots 0.9166 TRUE
arima_auto ar_roots 0.9166 TRUE
arima_auto ar_roots 0.8586 TRUE
arima_auto ar_roots 0.8586 TRUE
arima_auto ar_roots 0.8586 TRUE
arima_auto ar_roots 0.9166 TRUE
arima_auto ma_roots 0.0816 TRUE
arima_manual_from_intuition ar_roots 0.8715 TRUE
arima_manual_from_intuition ar_roots 0.8715 TRUE
arima_manual_from_intuition ar_roots 0.8715 TRUE
arima_manual_from_intuition ar_roots 0.8715 TRUE
arima_manual_from_intuition ar_roots 0.8715 TRUE
arima_manual_from_intuition ar_roots 0.8715 TRUE
arima_manual_from_intuition ar_roots 0.8715 TRUE
arima_manual_from_intuition ar_roots 0.7533 TRUE
arima_manual_from_intuition ar_roots 0.0363 TRUE
arima_manual_from_intuition ma_roots 0.9851 TRUE
arima_manual_from_intuition ma_roots 0.9851 TRUE
arima_manual_from_intuition ma_roots 0.9851 TRUE
arima_manual_from_intuition ma_roots 0.9851 TRUE
arima_manual_from_intuition ma_roots 0.8732 TRUE
arima_manual_from_intuition ma_roots 0.8732 TRUE
arima_manual_from_intuition ma_roots 0.8732 TRUE
arima_manual_from_intuition ma_roots 0.9851 TRUE
arima_manual_from_intuition ma_roots 0.9851 TRUE
arima_manual_from_intuition ma_roots 0.9851 TRUE
arima_manual_from_intuition ma_roots 0.8732 TRUE
arima_manual_from_intuition ma_roots 0.8732 TRUE
arima_manual_from_intuition ma_roots 0.8732 TRUE
arima_manual_from_intuition ma_roots 0.8732 TRUE
arima_manual_from_intuition ma_roots 0.6950 TRUE

Even though my manually built model scored better on the accuracy tests, it’s not better than the auto-ARIMA.

  • The manual-intuition ARIAM is more complex, which AIC penalizes. But it does have a better/lower AIC.
  • However, the manual-intuition ARIMA has roots nearly on the unit circle. When it has 6 parameters (higher complexity), having the roots be this close means that the model in potentially unstable and should probably be rejected. Otherwise, it could have long-term predictions that may go a little crazy.

I’m going to proceed like the auto-ARIMA model is the best choice, and I’ll still check its residual plot below. I’ll also look at ETS here.

# ETS Residuals
atm4_train_fit %>%
  select(ets) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM4 (ETS Model)")

# ARIMA Residuals
atm4_train_fit %>%
  select(arima_auto) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics for ATM4 (ARIMA Model)")

# Portmanteau Tests
# Ljung-Box test  
# lag = 14 for daily data to look at 2 weeks of data
atm4_train_fit %>%
  select(ets, arima_auto) %>%
  augment() %>%
  features(.innov, ljung_box, lag = 14) %>%
  knitr::kable(caption = "Ljung-Box Test for ARIMA Residuals")
Ljung-Box Test for ARIMA Residuals
.model lb_stat lb_pvalue
arima_auto 13.90588 0.4567468
ets 18.62921 0.1796068

Looking at the ARIMA model (the auto one) residual plots and portmanteau:

  • The residuals look like white noise for both ARIMA and ETS, and the p-values well over 0.05 from the Portmanteau test confirms this.
  • The ACF plots for both ARIMA and ETS show a spike outside of the bounds at some point, though it doesn’t look like a systemic pattern.
  • The residual histograms looks slightly left skewed for both, though ETS looks a little less well centered at 0 than ARIMA. Pretty okay overall.

I’ll move forward with the auto selected ARIMA model based on it scoring slightly better in accuracy and residual analysis than the others.

# Model from training:
atm4_train_fit$arima_auto
## <lst_mdl[1]>
## [1] <ARIMA(0,0,1)(2,0,0)[7] w/ mean>
# Refit the ARIMA model to full dataset
atm4_final_fit <- atm4 %>%
  model(
    arima_final = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,1) + PDQ(2,0,0, period = 7))
  )
# Final Model
report(atm4_final_fit)
## Series: Cash 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda_atm4) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0785  0.2128  0.2082   13.5498
## s.e.  0.0528  0.0516  0.0525    0.5624
## 
## sigma^2 estimated as 104.5:  log likelihood=-1364.98
## AIC=2739.96   AICc=2740.13   BIC=2759.46

The final model for ATM4 is ready to use for forecasting

Forecasting ATM4

# Forecast
# h = 30 for 30 days
atm4_fc <- atm4_final_fit %>%
  forecast(h = 30)

# Plot forecast
atm4 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm4_fc, color = "#C77CFF", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM4",
    subtitle = "ARIMA(0,0,1)(2,0,0)[7] Structure",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

This prediction clearly does not appear to be right. Because this ARIMA model had no differencing, it’s assuming that the data is already stationary, so the prediction is progressing towards the mean.

I think I need the seasonal differencing component, but I also need to ensure that it’s used to create a stable model. I’ll try that below.

Model Iteration 2 for ATM4

# Try a seasonally difference ARIMA that is forced to be simpler
# Also try ETS again
atm4_train2_fit <- atm4_train %>%
  model(
    arima_seasonal_and_stable = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,1) + PDQ(0,1,1, period = 7))
  )

# Forecast the length of the TEST data
atm4_test2_forecast <- atm4_train2_fit %>%
  forecast(new_data = atm4_test)

# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm4_test2_forecast, atm4) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM4 (Test Set)")
Out-of-Sample Accuracy Metrics for ATM4 (Test Set)
.model RMSE MAE MAPE MASE
arima_seasonal_and_stable 248.1606 210.4713 853.3203 0.5947723

This model’s RMSE, MAE, MAPE and MASE are all performing as well or slightly better than my arima_manual_from_intuition that I modeled earlier. That means the real test is in the stability and complexity.

# complexity (AICc)
atm4_train2_fit %>%
  select(arima_seasonal_and_stable) %>%
  glance() %>%
  arrange(AICc) %>%
  select(.model, AICc, BIC) %>%
  knitr::kable(caption = "Information Criteria Comparison: Auto vs Intuition")
Information Criteria Comparison: Auto vs Intuition
.model AICc BIC
arima_seasonal_and_stable 2442.302 2453.589
#  (p,d,q)(P,D,Q) parameter compare
atm4_train2_fit %>%
  select(arima_seasonal_and_stable) 
## # A mable: 1 x 1
##   arima_seasonal_and_stable
##                     <model>
## 1  <ARIMA(0,0,1)(0,1,1)[7]>
# AR and MA Roots for Stability 
atm4_train2_fit %>%
  select(arima_seasonal_and_stable) %>%
  gg_arma() +
  labs(
    title = "Inverse AR/MA Roots: Auto ARIMA",
    subtitle = "Points should lie entirely inside the unit circle"
  )

# Check for stability more definitively than with the circle
atm4_train2_fit %>%
  select(arima_seasonal_and_stable) %>%
  glance() %>%
  select(.model, ar_roots, ma_roots) %>%
  pivot_longer(ends_with("roots"), names_to = "Root_Type", values_to = "Root") %>%
  unnest(Root) %>% # Flatten the lists of roots into individual rows
  mutate(
    Inverse_Distance = Mod(1 / Root),
    Is_Stable = Inverse_Distance < 1
  ) %>%
  select(.model, Root_Type, Inverse_Distance, Is_Stable) %>%
  knitr::kable(digits = 4, caption = "Definitive Root Check: Iteration 2")
Definitive Root Check: Iteration 2
.model Root_Type Inverse_Distance Is_Stable
arima_seasonal_and_stable ma_roots 0.9853 TRUE
arima_seasonal_and_stable ma_roots 0.9853 TRUE
arima_seasonal_and_stable ma_roots 0.9853 TRUE
arima_seasonal_and_stable ma_roots 0.9853 TRUE
arima_seasonal_and_stable ma_roots 0.9853 TRUE
arima_seasonal_and_stable ma_roots 0.9853 TRUE
arima_seasonal_and_stable ma_roots 0.9853 TRUE
arima_seasonal_and_stable ma_roots 0.0972 TRUE

This model has fewer parameters (only 2). It does have many roots nearly on the unit circle, but I feel much more confident in this. The lower complexity means I’m not so worried about instability. I’ll try this model now.

# Model from training:
atm4_train2_fit$arima_seasonal_and_stable
## <lst_mdl[1]>
## [1] <ARIMA(0,0,1)(0,1,1)[7]>
# Refit the ARIMA model to full dataset
atm4_final2_fit <- atm4 %>%
  model(
    arima_final2 = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,1) + PDQ(0,1,1, period = 7))
  )
# Final Model
report(atm4_final2_fit)
## Series: Cash 
## Model: ARIMA(0,0,1)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda_atm4) 
## 
## Coefficients:
##          ma1     sma1
##       0.0976  -0.8664
## s.e.  0.0524   0.0321
## 
## sigma^2 estimated as 98.1:  log likelihood=-1332.73
## AIC=2671.46   AICc=2671.53   BIC=2683.1

Model Iteration 2 - Forecasting for ATM4

# Forecast
# h = 30 for 30 days
atm4_fc2 <- atm4_final2_fit %>%
  forecast(h = 30)

# Plot forecast
atm4 %>%
  autoplot(Cash) +
  # Add the forecast layer specifically
  autolayer(atm4_fc2, color = "#C77CFF", level = NULL) + 
  labs(
    title = "30-Day Cash Forecast: ATM4",
    subtitle = "ARIMA(0,0,1)(0,1,1)[7] Structure",
    y = "Cash",
    x = "Date"
  ) +
  scale_x_date(
    date_breaks = "1 month",   # I want a label per month
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Exporting Final Forecasts

# Get the forecasts (.mean) from all final models 
all_forecasts <- bind_rows(
  atm1_fc %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean),
  atm2_fc %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean),
  atm3_fc %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean),
  atm4_fc2 %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean) # Using the iter2 forecast
)

# Join forecasts to the OG data
# NOT using my outlier-cleaned data or imputed data 
final_export <- atm %>%
  as_tibble() %>%
  rename(Original_Cash = Cash) %>%
  full_join(all_forecasts, by = c("ATM", "DATE")) %>%
  arrange(ATM, DATE)

# Export to CSV
export_path <- file.path(projPath, "ATM_Forecast_Results.csv")
write_csv(final_export, export_path)

cat("Exported original data and forecasts to:", export_path, "\n")
## Exported original data and forecasts to: C:/Users/cdube/Grad School/DATA 624/ATM_Forecast_Results.csv

The final forecasts have been exported to a csv.

  • Important note: I know the ask was to forecast the month of May. Because there were 2 days of may that had data (and those days were important for ATM3 since it had next to no data), I started the forecast from May 3rd onward. I also included June 1st in the forecast. Essentially, I opted to forecast one month forward.

Conclusions

# final plot
final_export %>%
  filter(!(is.na(ATM))) %>%
  ggplot(aes(x = DATE)) +
  # OG Data  
  geom_line(aes(y = Original_Cash, color = ATM), alpha = 0.5) +
  # Forecasted
  geom_line(aes(y = Forecasted_Cash, color = ATM), linewidth = 1) +
  facet_wrap(~ ATM, scales = "free_y", ncol = 2) +
  scale_color_manual(values = c(
    "ATM1" = "#F8766D", 
    "ATM2" = "#7CAE00", 
    "ATM3" = "#00BFC4", 
    "ATM4" = "#C77CFF"
  )) +
  labs(
    title = "Final 30-Day Forecasts vs. Original Raw Data by ATM",
    subtitle = "Bold lines in the May 2010 onward are forecasts",
    x = "Date",
    y = "Cash Withdrawn (Hundreds)",
    color = "ATM"
  ) +
  scale_x_date(
    date_breaks = "2 months", 
    date_labels = "%b %Y"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none", 
    strip.text = element_text(face = "bold", size = 12)
  )
## Warning: Removed 120 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1460 rows containing missing values or values outside the scale range
## (`geom_line()`).

Findings and Forecast Recap

Forecasting the cash withdrawals for these four atm locations shows distinct behavior for each.

  • All of the ATMs that are active have a weekly seasonality: high volume at the start of the work week, low volume on weekends.
  • ATM3 just barely became an active machine. Since it’s only been up for 3 days, we’re forecasting for it very simplistically, and this could be revisited in the future.
  • ATM4 sees significantly more volume in withdrawals than the others.
  • ATM1 and ATM2 both saw increased variability starting around March 2010. We might want to investigate events that may be contributing to this.
  • There was a big outlier on Feb. 11th for ATM4 with $1M+ withdrawn. This has been handled in terms of forecasting, but it should be called out to business if it needs to be investigated as expected behavior, a fraud event, etc.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Setup

power_usage <- read_excel(file.path(projPath, "ResidentialCustomerForecastLoad-624.xlsx"))
head(power_usage)
## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147
# relying on lubridate to create a date column here since there is no day
power_usage <- power_usage %>% 
  mutate(ReportDate = yearmonth(`YYYY-MMM`)) %>%
  select(CaseSequence, ReportDate, KWH) %>%
  as_tsibble(, key=NULL, index=ReportDate)

Exploration

# Quick Peek
power_usage %>%
  autoplot(KWH) + 
  labs(title = "Residential Power Usage (KWH) over Time", 
       x = "Report Date") +
  scale_x_yearmonth(
    date_breaks = "4 month",   # I want a label per quarter
    date_labels = "%b %Y"      # Mon. Year format
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=6.5)) # tiny text to fit on x axis

# Dip in power usage 
min_index <- which.min(power_usage$KWH)
power_usage %>%
  slice((min_index - 2):(min_index + 2)) %>%
  kable(caption = "5-Point Window: The Lowest KWH Reading and its Neighbors")
5-Point Window: The Lowest KWH Reading and its Neighbors
CaseSequence ReportDate KWH
881 2010 May 4919289
882 2010 Jun 6696292
883 2010 Jul 770523
884 2010 Aug 7922701
885 2010 Sep 7819472
# Check for any missing case sequence
# sequence from min to max and see what is missing 
full_seq <- min(power_usage$CaseSequence, na.rm = TRUE):max(power_usage$CaseSequence, na.rm = TRUE)
missing_vals <- setdiff(full_seq, power_usage$CaseSequence)

data.frame(Missing_CaseSequence = missing_vals) %>%
  kable(caption =  "Gaps in Case Sequence")
Gaps in Case Sequence
Missing_CaseSequence
# Check for repeating case sequence
power_usage %>%
  as_tibble() %>%
  group_by(CaseSequence) %>%
  summarise(count = n()) %>%
  filter(count > 1) %>%
  kable(caption = "Repating Case Sequences")
Repating Case Sequences
CaseSequence count
# check for missing dates 
gap_check <- power_usage %>% has_gaps() 
power_usage %>% 
  scan_gaps() %>% 
  kable(caption = "Time Gaps (Missing Rows) Found in Data")
Time Gaps (Missing Rows) Found in Data
ReportDate
# Check for any NAs in any columns
power_usage %>%
  as_tibble() %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Column_Name", values_to = "NA_Count") %>%
  kable(caption = "Count of Missing Values (NAs) by Column")
Count of Missing Values (NAs) by Column
Column_Name NA_Count
CaseSequence 0
ReportDate 0
KWH 1
power_usage %>%
  filter(is.na(KWH)) %>%
  kable(caption = "NA value for KWH")
NA value for KWH
CaseSequence ReportDate KWH
861 2008 Sep NA
# Check for any repeating dates
power_usage %>%
  count(ReportDate) %>%
  filter(n > 1) %>%
  rename(Occurrences = n) %>%
  kable(caption = "Duplicate Report Dates")
Duplicate Report Dates
ReportDate Occurrences
# Summary
power_usage %>%
  as_tibble() %>%
  # Convert yearmonth to Date so summary() can read it
  mutate(ReportDate = as.Date(ReportDate)) %>% 
  summary() %>%
  kable(caption = "Statistical Summary of Power Usage Data")
Statistical Summary of Power Usage Data
CaseSequence ReportDate KWH
Min. :733.0 Min. :1998-01-01 Min. : 770523
1st Qu.:780.8 1st Qu.:2001-12-24 1st Qu.: 5429912
Median :828.5 Median :2005-12-16 Median : 6283324
Mean :828.5 Mean :2005-12-15 Mean : 6502475
3rd Qu.:876.2 3rd Qu.:2009-12-08 3rd Qu.: 7620524
Max. :924.0 Max. :2013-12-01 Max. :10655730
NA NA NA’s :1

To start with, I don’t see much that is “messy” about the data (like duplicate dates or missing dates). There is just one missing KWH value from Sept. 2008.

From visual inspection, I also see that there is clear seasonality. Power usage spikes in the winter and summer.

It also looks like power usage started trending upwards in 2010, following a dip in power usage in July 2010. This dip in power usage for July 2010 appears to be an outlier, and it might even be a user-data-entry error. Since adding a 0 to the end of the number would make it quite plausible. However, I’m not certain that is the case, so I’m going to impute it regardless.

Handle NAs (Imputation)

For imputation of the NA value spotted in exploration, much like in Part A I’m opting to use ARIMA interpolation, one of the recommended methods in the Hyndman textbook.

# NA Value
power_usage %>%
  filter(is.na(KWH))
## # A tsibble: 1 x 3 [1M]
##   CaseSequence ReportDate   KWH
##          <dbl>      <mth> <dbl>
## 1          861   2008 Sep    NA
power_usage_impute_me <- power_usage %>%
  filter(is.na(KWH)) %>%
  pull(CaseSequence)

# Impute NAs with ARIMA 
power_usage_clean <- power_usage %>%
  model(arima = ARIMA(KWH)) %>%
  interpolate(power_usage) %>%
  left_join(as_tibble(power_usage) %>% select(ReportDate, CaseSequence), by = "ReportDate")


# Check where NA used to be 
power_usage_clean %>%
  filter(CaseSequence == power_usage_impute_me)
## # A tsibble: 1 x 3 [1M]
##   ReportDate      KWH CaseSequence
##        <mth>    <dbl>        <dbl>
## 1   2008 Sep 7597704.          861

Handle Outliers

I feel like I’ve learned my lesson from Part A of this project. I could systematically identify all the outliers by some rule. But visually, I see that the only outlier I am concerned with is the one from July 2010. So that’s the one I will handle.

# Dip in power usage 
# I'm rerunning this on power_usage_clean just to be safe, since this is now my working df
min_index <- which.min(power_usage_clean$KWH)

# I prefer anchoring to the CaseSequnce anyway though. So i'm doing both,
min_case_sequence <- power_usage_clean %>%
  filter(KWH == min(power_usage_clean$KWH)) %>%
  pull(CaseSequence)

power_usage_clean %>%
  filter(KWH == min(power_usage_clean$KWH)) %>%
  kable(caption = "Minimum KWH value is an outlier. To be Imputed.")
Minimum KWH value is an outlier. To be Imputed.
ReportDate KWH CaseSequence
2010 Jul 770523 883
# Replace outlier with NA
power_usage_clean <- power_usage_clean %>%
  mutate(
    KWH = if_else(CaseSequence == min_case_sequence, NA_real_, KWH)
  )

# Impute
power_usage_clean <- power_usage_clean %>%
  model(arima = ARIMA(KWH)) %>%
  interpolate(power_usage_clean) %>%
  left_join(as_tibble(power_usage_clean) %>% select(ReportDate, CaseSequence), by = "ReportDate")

# Double Check new Value
power_usage_clean %>%
  filter(CaseSequence == min_case_sequence) %>%
  kable(caption = "New Imputed Value to Handle Outlier")
New Imputed Value to Handle Outlier
ReportDate KWH CaseSequence
2010 Jul 7728530 883

Training and Test Set

My goal is to forecast for 2014, I’ll hold out one year of data (all of 2013) from my test set.

# Training Set (Everything up to December 2012)
power_train <- power_usage_clean %>% 
  filter(ReportDate <= yearmonth("2012-12"))

# Test Set (January 2013 through December 2013)
power_test <- power_usage_clean %>% 
  filter(ReportDate > yearmonth("2012-12"))

# Split Check
cat("Training Set ends on:", as.character(max(power_train$ReportDate)), "and has", nrow(power_train), "months.\n")
## Training Set ends on: 2012 Dec and has 180 months.
cat("Test Set starts on:", as.character(min(power_test$ReportDate)), "and ends on:", as.character(max(power_test$ReportDate)), "with", nrow(power_test), "months.\n")
## Test Set starts on: 2013 Jan and ends on: 2013 Dec with 12 months.

Modeling

STL Decomposition

I want to look at the training data through the lense of STL decomposition.

# STL decomposition on the training set
power_train %>%
  model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(
    title = "STL Decomposition for Residential Power Usage",
    subtitle = "Training Set (1998 - 2012)",
    x = "Report Date"
  )

  • Trend:I see a shift upward starting around 2010 that I hesitantly spotted earlier in EDA.
  • Seasonality: As expected, I see seasonality relative to the summer-winter annual cycle. It’s also accounting for a lot of the behavior if I just look at the magnitude of the y axis on it.
  • Remainder: This generally looks like noise, though I do see the variability is a bit tighter in the 2000-2007 time range. It gets a bit larger after that. Box-cox might be called for to stabilize the variance (which I’ll check)

Model Prep.

# Optimal Box-Cox Lambda using Guerrero's method
lambda_power <- power_train %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

cat("Optimal Box-Cox Lambda for Power Usage:", round(lambda_power, 4), "\n")
## Optimal Box-Cox Lambda for Power Usage: -0.1696
# Check for required differencing on the transformed data 
# Seasonal differencing
nsdiffs_power <- power_train %>%
  mutate(box_kwh = box_cox(KWH, lambda_power)) %>%
  features(box_kwh, unitroot_nsdiffs) %>%
  pull(nsdiffs)

cat("Seasonal differences required (nsdiffs):", nsdiffs_power, "\n")
## Seasonal differences required (nsdiffs): 1
# Ordinary differencing
ndiffs_power <- power_train %>%
  mutate(box_kwh = box_cox(KWH, lambda_power)) %>%
  features(box_kwh, unitroot_ndiffs) %>%
  pull(ndiffs)

cat("Ordinary differences required (ndiffs):", ndiffs_power, "\n")
## Ordinary differences required (ndiffs): 1

The optimal Box-Cox lambda is -0.1696, which is pretty close to 0 (natural log transofrmation). The SPSS test also calls for both one seasonal difference (which I expected) and one ordinary difference (maybe to handle the trend seen earlier).

Model Comparisons

I’m going to be just a little bit more thorough with the “theory” or “explanatory power” I’m looking for while fitting these models than I did with Part A. I honestly think it’s a bit overkill, but it’s nice to reinforce the learning. (Also Part A had so many components with many similar trends in each, and that could be used to inform on expectations of each ATM, whereas Part B has fewer components, so I’d rather be a little more thorough up front.)

# Fit models
power_fit <- power_train %>%
  model(
    # Snaive (the baseline basically)
    snaive = SNAIVE(KWH),
    
    # Auto models without transformation
    ets_auto = ETS(KWH),
    arima_auto = ARIMA(KWH),
    
    # Auto models with Box-Cox transformation
    ets_bc = ETS(box_cox(KWH, lambda_power)),
    arima_bc = ARIMA(box_cox(KWH, lambda_power)),
    
    # Manual ARIMA based on SPSS test (d=1, D=1)
    arima_manual = ARIMA(box_cox(KWH, lambda_power) ~ pdq(d=1) + PDQ(D=1))
  )

# Forecast against the 2013 test set
power_fc <- power_fit %>%
  forecast(new_data = power_test)

# Grade the forecast against the true historical test data
accuracy(power_fc, power_usage_clean) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE, MASE) %>% 
  knitr::kable(caption = "Out-of-Sample Accuracy Metrics for Power Usage (Test Set)")
Out-of-Sample Accuracy Metrics for Power Usage (Test Set)
.model RMSE MAE MAPE MASE
arima_bc 963502.9 683535.0 8.050892 1.125034
arima_auto 969439.6 655830.8 7.694001 1.079436
arima_manual 970940.7 742764.1 8.925284 1.222520
snaive 1035537.6 618605.6 7.058492 1.018166
ets_bc 1054366.9 697314.2 8.339581 1.147713
ets_auto 1100750.2 690764.1 8.126241 1.136932

Comparing the models by accuracy performance

  • SNAIVE performed really well here for MAE, MAPE and MASE. It didn’t perform well in RMSE though, implying that when it had some large forecasting errors.
  • The models are in a somewhat tight race, but the automatically determined ARIMA is the winner with the the lowest or close to lowest of each kind of error.
  • Because Box-cox transformed ARIMA had the lowest RMSE, I want give it added attention as well.
## Residual Plots

# SNAIVE Residuals
power_fit %>%
  select(snaive) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics: SNAIVE Baseline")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_rug()`).

# Auto ARIMA Residuals
power_fit %>%
  select(arima_auto) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics: Auto ARIMA (No Transformation)")

# Box-Cox ARIMA Residuals
power_fit %>%
  select(arima_bc) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostics: Box-Cox ARIMA")

## Portmanteau Tests 

# Ljung-Box test for white noise
# lag = 24 for two years of monthly seasonal data
power_fit %>%
  select(snaive, arima_auto, arima_bc) %>%
  augment() %>%
  features(.innov, ljung_box, lag = 24) %>%
  arrange(lb_pvalue) %>%
  knitr::kable(caption = "Ljung-Box Test Results (lag = 24)")
Ljung-Box Test Results (lag = 24)
.model lb_stat lb_pvalue
snaive 107.64492 0.0000000
arima_bc 31.42627 0.1418684
arima_auto 27.01241 0.3038750
## Stability Checks (ARIMA only)

# Visual check of AR/MA roots
power_fit %>%
  select(arima_auto, arima_bc) %>%
  gg_arma() +
  labs(
    title = "Inverse AR/MA Roots: ARIMA Models",
    subtitle = "Points should lie entirely inside the unit circle"
  )

# Definitive root check table 
power_fit %>%
  select(arima_auto, arima_bc) %>%
  glance() %>%
  select(.model, ar_roots, ma_roots) %>%
  pivot_longer(ends_with("roots"), names_to = "Root_Type", values_to = "Root") %>%
  unnest(Root) %>%
  mutate(
    Inverse_Distance = Mod(1 / Root),
    Is_Stable = Inverse_Distance < 1
  ) %>%
  select(.model, Root_Type, Inverse_Distance, Is_Stable) %>%
  knitr::kable(digits = 4, caption = "Definitive Root Stability Check")
Definitive Root Stability Check
.model Root_Type Inverse_Distance Is_Stable
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ar_roots 0.9633 TRUE
arima_auto ma_roots 0.4122 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ar_roots 0.9594 TRUE
arima_bc ma_roots 0.2843 TRUE

Comparing the model residual plots and portmanteau:

  • The SNAIVE residual analysis show a huge spike in autocorrelation at 12 months. And the Portmanteau test has a p value under 0.05. So the SNAIVE is clearly not capturing some of the annual behavior.
  • The two ARIMA models perform similarly, but the auto-generated ARIMA model appears to have more normally distributed residuals, and a higher p-value, which is preferred.
  • Both ARIMA models also show an early autocorrelation spike beyond the boundary lines, which is suboptimal, but again the auto ARIMA shows a smaller amplitude in this than the box-cox one.
  • However, these things are pretty equal in terms of performance. The thing I come back to here is the, where Box-Cox Transformed ARIMA was superior. If that means it is handling extreme values better, that seems better suited when forecasting something as important as power consumption, since it might play into available power supply.

The Box-Cox Transformed ARIMA model is the best choice.

power_fit$arima_bc
## <lst_mdl[1]>
## [1] <ARIMA(0,0,1)(2,1,0)[12] w/ drift>
# Refit the Box-Cox ARIMA model to the FULL dataset
power_final_fit <- power_usage_clean %>%
  model(
    arima_final = ARIMA(box_cox(KWH, lambda_power) ~ 1 + pdq(0,0,1) + PDQ(2,1,0, period = 12))
  )

# Review the final model
report(power_final_fit)
## Series: KWH 
## Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift 
## Transformation: box_cox(KWH, lambda_power) 
## 
## Coefficients:
##          ma1     sar1     sar2  constant
##       0.2486  -0.7202  -0.3560    0.0023
## s.e.  0.0834   0.0752   0.0769    0.0006
## 
## sigma^2 estimated as 4.313e-05:  log likelihood=651.95
## AIC=-1293.9   AICc=-1293.55   BIC=-1277.93

The final model is ready to use for forecasting

Forecasting

# Forecast the next 12 months (January - December 2014)
power_fc <- power_final_fit %>%
  forecast(h = 12)

# Plot the historical data with the forecast
power_usage_clean %>%
  autoplot(KWH) +
  # Add the forecast layer
  autolayer(power_fc, color = "#4682B4", level = c(80, 95)) + 
  labs(
    title = "12-Month Residential Power Usage Forecast (2014)",
    subtitle = "ARIMA(0,0,1)(2,1,0)[12] with drift & Box-Cox transformation",
    y = "Power Usage (KWH)",
    x = "Report Date"
  ) +
  scale_x_yearmonth(
    date_breaks = "1 year", 
    date_labels = "%Y"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Exporting Final Forecasts

# Extract the point forecasts  
power_export <- power_fc %>%
  as_tibble() %>%
  select(ReportDate, Forecasted_KWH = .mean) %>%
  mutate(ReportDate = as.character(ReportDate)) 

# Define export path and write to CSV
export_path_b <- file.path(projPath, "Power_Forecast_2014.csv")
write_csv(power_export, export_path_b)

cat("Exported 2014 Power Usage forecast to:", export_path_b, "\n")
## Exported 2014 Power Usage forecast to: C:/Users/cdube/Grad School/DATA 624/Power_Forecast_2014.csv

The final forecast has been exported to a csv.

Conclusion

  • Residential power consumption has shown a steady upward trajectory since 2010. This indicates a growing customer baseline or increased household energy reliance.
  • Summer and winter drive peak usage, which is quite predictable behavior. However, the severity or intensity of these piked is increasing as time goes on. This seems like something that would require some thoughtful planning for future adequate power procurement. *The drop in power usage July 2010 was dealt with for the purposes of forecasting, but that “blip” could be an error in data entry, or a localized power outage, or some other failure that may warrant investigation.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Setup

Because the timestamps provided in the raw data do not fall perfectly on the hour, I’m rounding the DateTime down to the nearest hour. Then I’m getting the average waterflow to handle the case where there are multiple recordings within the same hour.

# read data
pipe1_raw <- read_excel(
  file.path(projPath, "Waterflow_Pipe1.xlsx"), 
  col_types = c("date", "numeric") # Forces Excel to parse as POSIXct datetime
)
 
pipe2_raw <- read_excel(
  file.path(projPath, "Waterflow_Pipe2.xlsx"), 
  col_types = c("date", "numeric")
)

# checking the time period
any(minute(pipe1_raw$`Date Time`) != 0 | second(pipe1_raw$`Date Time`) != 0)
## [1] TRUE
any(duplicated(floor_date(pipe1_raw$`Date Time`, "hour")))
## [1] TRUE
# cleaning data
pipe1_agg <- pipe1_raw %>%
  # Rename columns to remove spaces and standardize
  rename(DateTime = `Date Time`, WaterFlow = WaterFlow) %>%
  # Floor the datetime to the nearest hour
  mutate(DateTimeHour = floor_date(DateTime, "hour")) %>%
  # Aggregate multiple recordings within the same hour
  group_by(DateTimeHour) %>%
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE), .groups = 'drop') %>%
  # Convert to tsibble
  as_tsibble(index = DateTimeHour)

pipe2_agg <- pipe2_raw %>%
  rename(DateTime = `Date Time`, WaterFlow = WaterFlow) %>%
  mutate(DateTimeHour = floor_date(DateTime, "hour")) %>%
  group_by(DateTimeHour) %>%
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE), .groups = 'drop') %>%
  as_tsibble(index = DateTimeHour)

Data Cleaning Gaps

When working with sensor data, there periods where the sensor has gaps in the time series (the hour simply doesn’t exist in the data frame). So I’ll impute these using ARIMA.

#  
pipe1_agg %>% has_gaps() %>% kable(caption = "Does Pipe 1 have gaps?")
Does Pipe 1 have gaps?
.gaps
TRUE
pipe2_agg %>% has_gaps() %>% kable(caption = "Does Pipe 2 have gaps?")
Does Pipe 2 have gaps?
.gaps
FALSE
# summarized count of the gaps
pipe1_agg %>% 
  count_gaps() %>% 
  kable(caption = "Summary of Gap Sizes in Pipe 1")
Summary of Gap Sizes in Pipe 1
.from .to .n
2015-10-27 17:00:00 2015-10-27 17:00:00 1
2015-10-28 00:00:00 2015-10-28 00:00:00 1
2015-11-01 05:00:00 2015-11-01 05:00:00 1
2015-11-01 09:00:00 2015-11-01 09:00:00 1
# fill gaps
pipe1_clean <- pipe1_agg %>% fill_gaps()
pipe2_clean <- pipe2_agg %>% fill_gaps()

cat("Missing hours found in Pipe 1:", sum(is.na(pipe1_clean$WaterFlow)), "\n")
## Missing hours found in Pipe 1: 4
cat("Missing hours found in Pipe 2:", sum(is.na(pipe2_clean$WaterFlow)), "\n")
## Missing hours found in Pipe 2: 0
# Impute NAs using ARIMA interpolation (if any exist)
if(sum(is.na(pipe1_clean$WaterFlow)) > 0) {
  pipe1_clean <- pipe1_clean %>%
    model(arima = ARIMA(WaterFlow)) %>%
    interpolate(pipe1_clean)
}

if(sum(is.na(pipe2_clean$WaterFlow)) > 0) {
  pipe2_clean <- pipe2_clean %>%
    model(arima = ARIMA(WaterFlow)) %>%
    interpolate(pipe2_clean)
}

Cleaned Data Inspection

pipe1_clean %>%
  autoplot(WaterFlow) +
  labs(title = "Hourly Waterflow: Pipe 1", x = "Date/Time", y = "Waterflow") +
  theme_minimal()

pipe2_clean %>%
  autoplot(WaterFlow) +
  labs(title = "Hourly Waterflow: Pipe 2", x = "Date/Time", y = "Waterflow") +
  theme_minimal()

KPSS

Visually, both datasets appear to have a stable mean and variance over time without any long-term wandering trends. I’ll run a KPSS unit root test to confirm that.

# KPSS Test
bind_rows(
  pipe1_clean %>% features(WaterFlow, unitroot_kpss) %>% mutate(Pipe = "Pipe 1"),
  pipe2_clean %>% features(WaterFlow, unitroot_kpss) %>% mutate(Pipe = "Pipe 2")
) %>%
  select(Pipe, kpss_stat, kpss_pvalue) %>%
  kable(caption = "KPSS Stationarity Test Results") # p value > 0.05 so fail to reject null
KPSS Stationarity Test Results
Pipe kpss_stat kpss_pvalue
Pipe 1 0.0793697 0.1
Pipe 2 0.1053662 0.1

Both Pipe 1 and Pipe 2 return a p-value greater than 0.05. Therefore, I fail to reject the null hypothesis. The data is stationary and requires no ordinary differencing (\(d=0\)) for modeling. Because the variance looks highly stable in the EDA plots, I am also opting to skip Box-Cox transformations for this section.

Modeling

I will fit an automated ARIMA model to both pipes, letting the algorithm identify any seasonal or autoregressive patterns in the stationary hourly data.

# Fit ARIMA models
pipe1_fit <- pipe1_clean %>% model(arima_auto = ARIMA(WaterFlow))
pipe2_fit <- pipe2_clean %>% model(arima_auto = ARIMA(WaterFlow))

# Review the selected models
report(pipe1_fit)
## Series: WaterFlow 
## Model: ARIMA(0,0,0) w/ mean 
## 
## Coefficients:
##       constant
##        19.4683
## s.e.    0.3430
## 
## sigma^2 estimated as 28.35:  log likelihood=-741.4
## AIC=1486.8   AICc=1486.85   BIC=1493.76
report(pipe2_fit)
## Series: WaterFlow 
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean 
## 
## Coefficients:
##         sma1  constant
##       0.0846   39.5732
## s.e.  0.0314    0.5472
## 
## sigma^2 estimated as 256:  log likelihood=-4190.54
## AIC=8387.08   AICc=8387.1   BIC=8401.8
# Residual diagnostics for Pipe 1
pipe1_fit %>% gg_tsresiduals() + labs(title = "Residual Diagnostics: Pipe 1")

# Residual diagnostics for Pipe 2
pipe2_fit %>% gg_tsresiduals() + labs(title = "Residual Diagnostics: Pipe 2")

The auto-ARIMA models are pretty simple, which I expected base on the KPSS tests and visually inspecting the data and seeing how stationary it was.

*Pipe 1 (ARIMA(0,0,0) w/ mean):** For the first pipe, the algorithm essentially fitted a “mean model.” Because it found no significant autoregressive or moving average terms to improve the forecast, it will simply predict the historical average water flow (a constant of roughly 19.47). Looking at the residual diagnostics, the ACF plot shows the residuals mostly behave like white noise. The histogram is fairly normal, but there are a few negative outliers skewing it (sudden drops in water flow around Oct 28 and Nov 1).

*Pipe 2 (ARIMA(0,0,0)(0,0,1)[24] w/ mean):** For the second pipe, the algorithm also fitted a mean model (with a constant around 39.57), but it successfully detected a daily seasonal pattern. The (0,0,1)[24] component indicates a seasonal moving average term adjusting for a 24-hour cycle. The variance is consistent over time, the histogram is beautifully centered and normal, and while a few lags briefly poke past the significance boundaries on the ACF plot, this is entirely expected by chance with high-frequency hourly data.

Forecasting and Export

I’ll forecast one week forward. Because the data is hourly, this translates to \(h = 168\) periods (24 hours * 7 days).

# Fit ARIMA models to both pipes 
pipe1_fit <- pipe1_clean %>% model(arima_auto = ARIMA(WaterFlow))
pipe2_fit <- pipe2_clean %>% model(arima_auto = ARIMA(WaterFlow))

# Forecast a week forward (24 hours * 7 days = 168 hours)
pipe1_fc <- pipe1_fit %>% forecast(h = 168)
pipe2_fc <- pipe2_fit %>% forecast(h = 168)

# Plot the forecasts
# Pipe 1 Plot
pipe1_clean %>%
  autoplot(WaterFlow) +
  autolayer(pipe1_fc, color = "#F8766D", level = NULL) +
  labs(
    title = "1-Week Waterflow Forecast: Pipe 1", 
    x = "Date/Time (Hourly)", 
    y = "Waterflow"
  ) +
  theme_minimal()

# Pipe 2 Plot
pipe2_clean %>%
  autoplot(WaterFlow) +
  autolayer(pipe2_fc, color = "#00BFC4", level = NULL) +
  labs(
    title = "1-Week Waterflow Forecast: Pipe 2", 
    x = "Date/Time (Hourly)", 
    y = "Waterflow"
  ) +
  theme_minimal()

# Prepare data for export
# Extract point forecasts (.mean) and label the pipes
pipe1_export <- pipe1_fc %>%
  as_tibble() %>%
  mutate(Pipe = "Pipe 1") %>%
  select(Pipe, DateTimeHour, Forecasted_WaterFlow = .mean)

pipe2_export <- pipe2_fc %>%
  as_tibble() %>%
  mutate(Pipe = "Pipe 2") %>%
  select(Pipe, DateTimeHour, Forecasted_WaterFlow = .mean)

# Combine both into a single data frame
combined_pipe_forecasts <- bind_rows(pipe1_export, pipe2_export)

# Export to CSV (Ensuring standard file naming)
export_path_c <- file.path(projPath, "Waterflow_Forecast_Results.csv")
write_csv(combined_pipe_forecasts, export_path_c)

cat("Exported Pipe 1 and Pipe 2 forecasts to:", export_path_c, "\n")
## Exported Pipe 1 and Pipe 2 forecasts to: C:/Users/cdube/Grad School/DATA 624/Waterflow_Forecast_Results.csv