This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Oct 25. I will accept late submissions with a penalty until the meetup after that when we review some projects.
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
atm <- read_excel(file.path(projPath, "ATM624Data.xlsx"))
head(atm)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
unique(atm$ATM)
## [1] "ATM1" "ATM2" NA "ATM3" "ATM4"
# Fixing the date formatting
atm$DATE <- as.Date(atm$DATE, origin = "1900-01-01")
head(atm)
## # A tibble: 6 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-03 ATM1 96
## 2 2009-05-03 ATM2 107
## 3 2009-05-04 ATM1 82
## 4 2009-05-04 ATM2 89
## 5 2009-05-05 ATM1 85
## 6 2009-05-05 ATM2 90
# Tsibble
atm <- as_tsibble(atm, index = DATE, key = ATM)
# Basic Plot
atm %>%
autoplot(Cash)
## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).
labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATMS over time") +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## NULL
# Basic Plot ATM 1
atm %>%
filter(ATM == "ATM1") %>%
autoplot(Cash) +
labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM1") +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
geom_line(color = "#F8766D") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Basic Plot ATM 2
atm %>%
filter(ATM == "ATM2") %>%
autoplot(Cash) +
labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM2") +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
geom_line(color = "#7CAE00") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Basic Plot ATM 3
atm %>%
filter(ATM == "ATM3") %>%
autoplot(Cash) +
labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM3") +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
geom_line(color = "#00BFC4") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Basic Plot ATM 4
atm %>%
filter(ATM == "ATM4") %>%
autoplot(Cash) +
labs(title = "Basic Plot of Cash (in hundreds of dollars) withdrawn from ATM4") +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
geom_line(color = "#C77CFF") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Looking at when ATM3 started getting withdrawals, and looking at the ATM4 Feb. spike.
# Days money moved for ATM3
atm %>%
filter((ATM == "ATM3") & (Cash != 0)) %>%
kable(caption = "Days with non-zero cash movement for ATM3")
| DATE | ATM | Cash |
|---|---|---|
| 2010-04-30 | ATM3 | 96 |
| 2010-05-01 | ATM3 | 82 |
| 2010-05-02 | ATM3 | 85 |
# Days money spiked for ATM 4
atm %>%
filter((ATM == "ATM4") & (DATE >= '2010-02-01') & (DATE <= '2010-02-15')) %>%
kable(caption = "ATM4 Cash Spike ($10,919*100 value on Feb 11)")
| DATE | ATM | Cash |
|---|---|---|
| 2010-02-01 | ATM4 | 669.70765 |
| 2010-02-02 | ATM4 | 979.67594 |
| 2010-02-03 | ATM4 | 426.40601 |
| 2010-02-04 | ATM4 | 153.24269 |
| 2010-02-05 | ATM4 | 274.94992 |
| 2010-02-06 | ATM4 | 136.27636 |
| 2010-02-07 | ATM4 | 454.41695 |
| 2010-02-08 | ATM4 | 458.30427 |
| 2010-02-09 | ATM4 | 112.03063 |
| 2010-02-10 | ATM4 | 417.91228 |
| 2010-02-11 | ATM4 | 10919.76164 |
| 2010-02-12 | ATM4 | 42.43808 |
| 2010-02-13 | ATM4 | 280.04343 |
| 2010-02-14 | ATM4 | 412.31888 |
| 2010-02-15 | ATM4 | 852.83742 |
Get some high level counts and sums of dollars withdrawn, including counts for NA ATMs.
# Overall statistical summary
summary(atm)
## DATE ATM Cash
## Min. :2009-05-03 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-03 Class :character 1st Qu.: 0.5
## Median :2009-11-03 Mode :character Median : 73.0
## Mean :2009-11-02 Mean : 155.6
## 3rd Qu.:2010-02-03 3rd Qu.: 114.0
## Max. :2010-05-16 Max. :10919.8
## NA's :19
# Aggregate counts and missing values by ATM
atm %>%
as_tibble() %>%
group_by(ATM) %>%
summarise(
count = n(),
total_cash = sum(Cash, na.rm = TRUE),
zero_days = sum(Cash == 0, na.rm = TRUE),
missing_days = sum(is.na(Cash))
) %>%
kable(caption = "Total records, cash, and missing days per ATM")
| ATM | count | total_cash | zero_days | missing_days |
|---|---|---|---|---|
| ATM1 | 365 | 30367.0 | 0 | 3 |
| ATM2 | 365 | 22716.0 | 2 | 2 |
| ATM3 | 365 | 263.0 | 362 | 0 |
| ATM4 | 365 | 173025.8 | 0 | 0 |
| NA | 14 | 0.0 | 0 | 14 |
Further investigate when the ATM values are NA.
# Extract dates with NA ATM value
na_atm_dates <- atm %>%
filter(is.na(ATM)) %>%
pull(DATE)
# Check all ATM activity on those specific dates
atm %>%
filter(DATE %in% na_atm_dates) %>%
arrange(DATE, ATM) %>%
kable(caption = "Dates with a missing 'NA' ATM record")
| DATE | ATM | Cash |
|---|---|---|
| 2010-05-03 | NA | NA |
| 2010-05-04 | NA | NA |
| 2010-05-05 | NA | NA |
| 2010-05-06 | NA | NA |
| 2010-05-07 | NA | NA |
| 2010-05-08 | NA | NA |
| 2010-05-09 | NA | NA |
| 2010-05-10 | NA | NA |
| 2010-05-11 | NA | NA |
| 2010-05-12 | NA | NA |
| 2010-05-13 | NA | NA |
| 2010-05-14 | NA | NA |
| 2010-05-15 | NA | NA |
| 2010-05-16 | NA | NA |
Also investigate when the Cash values are NA, but it’s
not that both ATM and Cash are NA.
Next, let’s identify the specific dates where an ATM is known to
exist, but its Cash withdrawal value is missing. This
deliberately excludes completely blank phantom rows where both the ATM
name and Cash value are NA.
# Filter for missing Cash, but ensure ATM is NOT missing
atm %>%
filter(is.na(Cash) & !is.na(ATM)) %>%
arrange(ATM, DATE) %>%
kable(caption = "Table 5: Records with missing Cash values for known ATMs")
| DATE | ATM | Cash |
|---|---|---|
| 2009-06-15 | ATM1 | NA |
| 2009-06-18 | ATM1 | NA |
| 2009-06-24 | ATM1 | NA |
| 2009-06-20 | ATM2 | NA |
| 2009-06-26 | ATM2 | NA |
Being in fintech right now, I suspected that the oscillations on the graphs are Monday highs and weekend lows. I just wanted to dummy check that.
# Average Cash by Day of the Week and ATM
atm %>%
as_tibble() %>%
mutate(Day_of_Week = wday(DATE, label = TRUE, abbr = FALSE)) %>% # get day of week
group_by(ATM, Day_of_Week) %>%
summarise(
Avg_Cash = mean(Cash, na.rm = TRUE),
.groups = 'drop'
) %>%
pivot_wider(names_from = Day_of_Week, values_from = Avg_Cash) %>% # day of week to column
kable(caption = "Average Daily Cash Withdrawals by ATM and Day of the Week", digits = 2)
| ATM | Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday |
|---|---|---|---|---|---|---|---|
| ATM1 | 98.64 | 96.61 | 102.65 | 86.00 | 89.57 | 82.15 | 31.69 |
| ATM2 | 92.02 | 75.98 | 67.15 | 58.73 | 73.25 | 43.75 | 25.53 |
| ATM3 | 1.60 | 0.00 | 0.00 | 0.00 | 0.00 | 1.85 | 1.58 |
| ATM4 | 573.53 | 491.54 | 539.07 | 481.29 | 647.27 | 413.72 | 169.96 |
| NA | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ATM4 is seeing the most money movement by far.
ATM4 has a crazy spike in withdrawal volume on Feb
11th.
ATM3 only sees money moving starting from 2010-04-30.
Maybe ATM3 was only set up that time.ATM in the data set. These
seem to all have NA for Cash, and be from May (i.e. the
bottom of the excel spreadsheet with the source data.) Since the ask
here is to predict May data, I think this is safe to cut from the data
set.Cash in the data set. The
ATMs and dates these happened on don’t have a pattern or calendar
significance (though they all take place in June). These probably need
to be imputed.ATM and
Cash.There is May data for the 1st and 2nd for all
ATMs, so I’m leaving that as is. I’m just cutting the NAs
there on.
# CUT NA records (which is May 3rd onward)
atm_clean <- atm %>%
drop_na(ATM)
For imputation, I’m opting to use ARIMA interpolation, one of the recommended methods in the Hyndman textbook.
# Check for time gaps (before ARIMA)
atm_clean %>%
has_gaps()
## # A tibble: 5 × 2
## ATM .gaps
## <chr> <lgl>
## 1 ATM1 FALSE
## 2 ATM2 FALSE
## 3 ATM3 FALSE
## 4 ATM4 FALSE
## 5 <NA> FALSE
# Where are the time gaps if there are some
atm_clean %>%
count_gaps() %>%
knitr::kable(caption = "Time Gaps by ATM")
| ATM | .from | .to | .n |
|---|
# No Time Gaps
# Remove "NA" as a key from when I originally made the data tsibble
atm_clean <- atm_clean %>%
as_tibble() %>%
as_tsibble(index = DATE, key = ATM) # rebuild
key_data(atm_clean) %>%
select(-.rows) %>%
kable(caption = "Ensuring that NA is not a key in the Tsibble anymore")
| ATM |
|---|
| ATM1 |
| ATM2 |
| ATM3 |
| ATM4 |
# Impute NAs with ARIMA
atm_clean <- atm_clean %>%
model(arima = ARIMA(Cash)) %>%
interpolate(atm_clean)
I couldn’t use STL() until the NA values were handled. Now that that is done, I can use it for outlier analysis in the traditional way: using remainders > 3*IQR from the median remainder
# Check for outliers in addition to the one I found earlier
# Using remainders > 3x IQR from the median remainder
outlier_analysis <- atm_clean %>%
model(stlo = STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
group_by(ATM) %>%
mutate(
rem_iqr = IQR(remainder, na.rm = TRUE),
rem_mid = median(remainder, na.rm = TRUE),
is_outlier = abs(remainder - rem_mid) > (3 * rem_iqr)
) %>%
ungroup()
# Look at Outliers
outlier_analysis %>%
filter(is_outlier == TRUE)
## # A tsibble: 87 x 11 [1D]
## # Key: ATM, .model [4]
## ATM .model DATE Cash trend season_week remainder season_adjust
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 stlo 2009-06-07 3 88.1 16.8 -102. -13.8
## 2 ATM1 stlo 2009-07-12 19 86.4 16.8 -84.2 2.20
## 3 ATM1 stlo 2009-08-27 174 103. 23.0 48.0 151.
## 4 ATM1 stlo 2009-09-13 1 95.6 16.8 -111. -15.8
## 5 ATM1 stlo 2009-09-24 180 90.1 23.0 67.0 157.
## 6 ATM1 stlo 2009-12-02 6 82.0 3.80 -79.8 2.20
## 7 ATM1 stlo 2009-12-31 42 89.8 23.0 -70.8 19.0
## 8 ATM1 stlo 2010-01-01 2 92.6 -4.32 -86.2 6.32
## 9 ATM1 stlo 2010-01-31 152 81.4 16.8 53.8 135.
## 10 ATM1 stlo 2010-02-07 138 79.1 16.8 42.1 121.
## # ℹ 77 more rows
## # ℹ 3 more variables: rem_iqr <dbl>, rem_mid <dbl>, is_outlier <lgl>
# Visually validate the outliers
atm_clean %>%
left_join(select(outlier_analysis, ATM, DATE, is_outlier), by = c("ATM", "DATE")) %>%
ggplot(aes(x = DATE, y = Cash)) +
geom_line(color = "grey60", alpha = 0.7) +
geom_point(data = . %>% filter(is_outlier == TRUE), color = "red", size = 2) +
facet_wrap(~ ATM, scales = "free_y") +
labs(
title = "Visual Inspection of Methodically Identified Outliers",
subtitle = "Red dots indicate flagged outliers (3x IQR of STL Remainder)",
x = "Date",
y = "Cash Withdrawn (Hundreds)"
) +
theme_minimal()
This method classified a lot of data points as outliers that frankly, are not. It remains that the only “outlier” I am concerned with is the one from ATM4 on Feb. 11th. So I will impute that one only.
# Replace my One Feb 11th Outlier with NA
atm_clean <- atm_clean %>%
mutate(
Cash = if_else(ATM == "ATM4" & DATE == as.Date("2010-02-11"), NA_real_, Cash)
) # NA_real_ is explicitly for missing decimal number
# Impute
atm_clean <- atm_clean %>%
model(arima = ARIMA(Cash)) %>%
interpolate(atm_clean)
# Peaking at the imputed value
atm_clean %>%
filter(ATM == "ATM4", DATE >= as.Date("2010-02-09") & DATE <= as.Date("2010-02-13")) %>%
knitr::kable(caption = "ATM4: Feb 11th Outlier Successfully Imputed")
| ATM | DATE | Cash |
|---|---|---|
| ATM4 | 2010-02-09 | 112.03063 |
| ATM4 | 2010-02-10 | 417.91228 |
| ATM4 | 2010-02-11 | 214.38754 |
| ATM4 | 2010-02-12 | 42.43808 |
| ATM4 | 2010-02-13 | 280.04343 |
# Training Set (Everything up to March 31, 2010)
atm_train <- atm_clean %>%
filter(DATE <= as.Date("2010-03-31"))
# Test Set (April 1st through May 2nd)
atm_test <- atm_clean %>%
filter(DATE > as.Date("2010-03-31"))
# Verify the split
cat("Training Set ends on:", as.character(max(atm_train$DATE)), "and has", nrow(atm_train), "days.\n")
## Training Set ends on: 2010-03-31 and has 1332 days.
cat("Test Set starts on:", as.character(min(atm_test$DATE)), "and has", nrow(atm_test), "days.\n")
## Test Set starts on: 2010-04-01 and has 128 days.
# STL decomposition
atm_train %>%
filter(ATM == "ATM1") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM1")
I already confirmed earlier that withdrawal volume is highest Mondays and lowest on the weekends. I can see weekly seasonality confirmed on the STL Decomp. view.
I also see that the trend wanders without much direction, but interestingly, the variation in the remainder becomes much higher from roughly March 2010 onwards.
# Isolate ATM1
atm1 <- atm_clean %>% filter(ATM == "ATM1")
atm1_train <- atm_train %>% filter(ATM == "ATM1")
atm1_test <- atm_test %>% filter(ATM == "ATM1")
#Optimal Box-Cox Lambda using Guerrero's method
lambda_atm1 <- atm1_train %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
cat("Optimal Box-Cox Lambda for ATM1:", round(lambda_atm1, 4), "\n")
## Optimal Box-Cox Lambda for ATM1: 0.6236
# Check for required differencing on the transformed data
# seasonal
nsdiffs_atm1 <- atm1_train %>%
mutate(box_cash = box_cox(Cash, lambda_atm1)) %>%
features(box_cash, unitroot_nsdiffs) %>%
pull(nsdiffs)
cat("Seasonal differences required (nsdiffs):", nsdiffs_atm1, "\n")
## Seasonal differences required (nsdiffs): 1
# ordinary
ndiffs_atm1 <- atm1_train %>%
mutate(box_cash = box_cox(Cash, lambda_atm1)) %>%
features(box_cash, unitroot_ndiffs) %>%
pull(ndiffs)
cat("Ordinary differences required (ndiffs):", ndiffs_atm1, "\n")
## Ordinary differences required (ndiffs): 0
I’m going to move forward using the optimal Box-Cox lambda because I am more focused on effective modeling than on explanatory power at this time.. The value of 0.6236 is a bit between 0.5 and 1. Where 0.5 is taking the square root transformation to stabilize the variance, and 1 is doing nothing (there is a stable enough variance).
(I’ll use the optimal lambda values as they are found for the rest of this project as well, rather than rounding to more “explainable” lambdas.)
# Fit models on training data
atm1_train_fit <- atm1_train %>%
model(
snaive = SNAIVE(box_cox(Cash, lambda_atm1)),
ets = ETS(box_cox(Cash, lambda_atm1)),
arima_auto = ARIMA(box_cox(Cash, lambda_atm1)),
arima_manual = ARIMA(box_cox(Cash, lambda_atm1) ~ pdq(d=0) + PDQ(D=1))
)
# Forecast the length of the TEST data
atm1_test_forecast <- atm1_train_fit %>%
forecast(new_data = atm1_test)
# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm1_test_forecast, atm1) %>%
arrange(RMSE) %>%
select(.model, RMSE, MAE, MAPE, MASE) %>%
knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM1 (Test Set)")
| .model | RMSE | MAE | MAPE | MASE |
|---|---|---|---|---|
| arima_auto | 12.65584 | 10.031486 | 57.80838 | 0.5491826 |
| arima_manual | 12.65584 | 10.031486 | 57.80838 | 0.5491826 |
| ets | 12.73463 | 9.927388 | 44.49709 | 0.5434837 |
| snaive | 17.05096 | 13.744518 | 80.29855 | 0.7524559 |
Comparing ATM1 models by accuracy performance
# ETS Residuals
atm1_train_fit %>%
select(ets) %>%
gg_tsresiduals() +
labs(title = "Residual Diagnostics for ATM1 (ETS Model)")
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# ARIMA Residuals
atm1_train_fit %>%
select(arima_auto) %>%
gg_tsresiduals() +
labs(title = "Residual Diagnostics for ATM1 (ARIMA Model)")
# Portmanteau Tests
# Ljung-Box test
# lag = 14 for daily data to look at 2 weeks of data
atm1_train_fit %>%
select(ets, arima_auto) %>%
augment() %>%
features(.innov, ljung_box, lag = 14) %>%
knitr::kable(caption = "Ljung-Box Test for ETS and ARIMA Residuals")
| .model | lb_stat | lb_pvalue |
|---|---|---|
| arima_auto | 10.87316 | 0.6959701 |
| ets | 23.63239 | 0.0507266 |
Comparing the ARIMA and ETS model residual plots and portmanteau:
The ARIMA model is the best choice for ATM1.
# Model from training:
atm1_train_fit$arima_auto
## <lst_mdl[1]>
## [1] <ARIMA(0,0,2)(0,1,2)[7]>
# Refit the ARIMA model to full dataset
atm1_final_fit <- atm1 %>%
model(
arima_final = ARIMA(box_cox(Cash, lambda_atm1) ~ pdq(0,0,2) + PDQ(0,1,2, period = 7))
)
# Final Model
report(atm1_final_fit)
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,2)[7]
## Transformation: box_cox(Cash, lambda_atm1)
##
## Coefficients:
## ma1 ma2 sma1 sma2
## 0.1590 -0.0823 -0.5835 -0.0782
## s.e. 0.0523 0.0503 0.0515 0.0500
##
## sigma^2 estimated as 24.99: log likelihood=-1083.97
## AIC=2177.93 AICc=2178.1 BIC=2197.34
The final model for ATM1 is ready to use for forecasting
# Forecast
# h = 30 for 30 days
atm1_fc <- atm1_final_fit %>%
forecast(h = 30)
# Plot forecast
atm1 %>%
autoplot(Cash) +
# Add the forecast layer specifically
autolayer(atm1_fc, color = "#F8766D", level = NULL) +
labs(
title = "30-Day Cash Forecast: ATM1",
subtitle = "ARIMA(0,0,2)(0,1,2)[7] Structure",
y = "Cash",
x = "Date"
) +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# STL decomposition
atm_train %>%
filter(ATM == "ATM2") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM2")
Much like ATM1, weekly seasonality remains, and the variation in the remainder becomes much higher from roughly March 2010 onwards. But what’s different here is the trend, which was relatively steady with a slight decline, also up until March 2010.
# Isolate atm2
atm2 <- atm_clean %>% filter(ATM == "ATM2")
atm2_train <- atm_train %>% filter(ATM == "ATM2")
atm2_test <- atm_test %>% filter(ATM == "ATM2")
#Optimal Box-Cox Lambda using Guerrero's method
lambda_atm2 <- atm2_train %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
cat("Optimal Box-Cox Lambda for atm2:", round(lambda_atm2, 4), "\n")
## Optimal Box-Cox Lambda for atm2: 0.5398
# Check for required differencing on the transformed data
# seasonal
nsdiffs_atm2 <- atm2_train %>%
mutate(box_cash = box_cox(Cash, lambda_atm2)) %>%
features(box_cash, unitroot_nsdiffs) %>%
pull(nsdiffs)
cat("Seasonal differences required (nsdiffs):", nsdiffs_atm2, "\n")
## Seasonal differences required (nsdiffs): 1
# ordinary
ndiffs_atm2 <- atm2_train %>%
mutate(box_cash = box_cox(Cash, lambda_atm2)) %>%
features(box_cash, unitroot_ndiffs) %>%
pull(ndiffs)
cat("Ordinary differences required (ndiffs):", ndiffs_atm2, "\n")
## Ordinary differences required (ndiffs): 1
Different from ATM1, when checking for differencing, ATM2 calls for 1 seasonal and 1 ordinary difference.
# Fit models on training data
atm2_train_fit <- atm2_train %>%
model(
snaive = SNAIVE(box_cox(Cash, lambda_atm2)),
ets = ETS(box_cox(Cash, lambda_atm2)),
arima_auto = ARIMA(box_cox(Cash, lambda_atm2)),
arima_manual = ARIMA(box_cox(Cash, lambda_atm2) ~ pdq(d=1) + PDQ(D=1))
)
# Forecast the length of the TEST data
atm2_test_forecast <- atm2_train_fit %>%
forecast(new_data = atm2_test)
# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm2_test_forecast, atm2) %>%
arrange(RMSE) %>%
select(.model, RMSE, MAE, MAPE, MASE) %>%
knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM2 (Test Set)")
| .model | RMSE | MAE | MAPE | MASE |
|---|---|---|---|---|
| ets | 19.09029 | 14.17161 | Inf | 0.6794205 |
| arima_auto | 19.83110 | 15.44513 | Inf | 0.7404763 |
| snaive | 22.95675 | 16.60056 | Inf | 0.7958699 |
| arima_manual | 47.44609 | 35.18190 | Inf | 1.6867035 |
Comparing ATM2 models by accuracy performance
However, especially based on ATM1, I will look at both ETS and ARIMA (the auto-selected one, not my manual) from a residual perspective
# ETS Residuals
atm2_train_fit %>%
select(ets) %>%
gg_tsresiduals() +
labs(title = "Residual Diagnostics for ATM2 (ETS Model)")
# ARIMA Residuals
atm2_train_fit %>%
select(arima_auto) %>%
gg_tsresiduals() +
labs(title = "Residual Diagnostics for ATM2 (ARIMA Model)")
# Portmanteau Tests
# Ljung-Box test
# lag = 14 for daily data to look at 2 weeks of data
atm2_train_fit %>%
select(ets, arima_auto) %>%
augment() %>%
features(.innov, ljung_box, lag = 14) %>%
knitr::kable(caption = "Ljung-Box Test for ETS and ARIMA Residuals")
| .model | lb_stat | lb_pvalue |
|---|---|---|
| arima_auto | 10.49173 | 0.7254213 |
| ets | 31.07524 | 0.0054106 |
Comparing the ARIMA and ETS model residual plots and portmanteau:
The ARIMA model is the best choice for ATM2.
# Model from training:
atm2_train_fit$arima_auto
## <lst_mdl[1]>
## [1] <ARIMA(2,0,2)(0,1,1)[7]>
# Refit the ARIMA model to full dataset
atm2_final_fit <- atm2 %>%
model(
arima_final = ARIMA(box_cox(Cash, lambda_atm2) ~ pdq(2,0,2) + PDQ(0,1,1, period = 7))
)
# Final Model
report(atm2_final_fit)
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda_atm2)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4158 -0.8961 0.4560 0.7860 -0.6971
## s.e. 0.0616 0.0513 0.0909 0.0652 0.0431
##
## sigma^2 estimated as 16.57: log likelihood=-1009.89
## AIC=2031.78 AICc=2032.02 BIC=2055.07
The final model for ATM2 is ready to use for forecasting
# Forecast
# h = 30 for 30 days
atm2_fc <- atm2_final_fit %>%
forecast(h = 30)
# Plot forecast
atm2 %>%
autoplot(Cash) +
# Add the forecast layer specifically
autolayer(atm2_fc, color = "#7CAE00", level = NULL) +
labs(
title = "30-Day Cash Forecast: ATM2",
subtitle = "ARIMA(2,0,2)(0,1,1)[7] Structure",
y = "Cash",
x = "Date"
) +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
I already know that ATM3 has only three days’ worth of data. So this is going to look pretty different from ATM1 and ATM2.
For the sake of thoroughness, I’ll still look at STL decomp for ATM3.
# STL decomposition
atm_train %>%
filter(ATM == "ATM3") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM3")
As expected, STL Decomp. yielded nothing here.
The model preparation is different here too. For ATM1 and ATM2 I looked at the Box-Cox Lambda using Guerrero’s method and at seasonal and ordinary differencing. With so little data, I’m skipping that here.
Again, with three days’ data, I’m not going to entertain ETS and ARIMA. Nor am I going to train and test models. I’ll use the mean to forecast the data because I have already identified that there is seasonal (weekly) fluctuation. And I’d rather use the mean of 3 days as a predictor than the most recent value of one day.
atm3 <- atm_clean %>%
filter(ATM == "ATM3")
# When ATM3 has data
atm3_active <- atm3 %>%
filter(Cash > 0)
# Fit a simple Mean model
atm3_fit <- atm3_active %>%
model(mean = MEAN(Cash))
# Model report
report(atm3_fit)
## Series: Cash
## Model: MEAN
##
## Mean: 87.6667
## sigma^2: 54.3333
# Forecast
# h = 30 for 30 days
atm3_fc <- atm3_fit %>%
forecast(h = 30)
# Plot forecast
atm3 %>%
autoplot(Cash) +
# Add the forecast layer specifically
autolayer(atm3_fc, color = "#00BFC4", level = NULL) +
labs(
title = "30-Day Cash Forecast: ATM3",
subtitle = "Simple Mean Model",
y = "Cash",
x = "Date"
) +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# STL decomposition
atm_train %>%
filter(ATM == "ATM4") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM4")
I can still see weekly seasonality for ATM4. And the trend wanders a lot like ATM1. Interestingly, the variation in the remainder does not become much higher from March 2010 onwards as it did with ATM1 and ATM2.
As I confirmed before, the magnitude of withdrawals is higher in ATM4 than the others, but also, I see that the outlier I removed is indeed gone, as I’d hoped when I imputed it.
# Isolate ATM4
atm4 <- atm_clean %>% filter(ATM == "ATM4")
atm4_train <- atm_train %>% filter(ATM == "ATM4")
atm4_test <- atm_test %>% filter(ATM == "ATM4")
#Optimal Box-Cox Lambda using Guerrero's method
lambda_atm4 <- atm4_train %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
cat("Optimal Box-Cox Lambda for ATM4:", round(lambda_atm4, 4), "\n")
## Optimal Box-Cox Lambda for ATM4: 0.4028
# Check for required differencing on the transformed data
# seasonal
nsdiffs_atm4 <- atm4_train %>%
mutate(box_cash = box_cox(Cash, lambda_atm4)) %>%
features(box_cash, unitroot_nsdiffs) %>%
pull(nsdiffs)
cat("Seasonal differences required (nsdiffs):", nsdiffs_atm4, "\n")
## Seasonal differences required (nsdiffs): 0
# ordinary
ndiffs_atm4 <- atm4_train %>%
mutate(box_cash = box_cox(Cash, lambda_atm4)) %>%
features(box_cash, unitroot_ndiffs) %>%
pull(ndiffs)
cat("Ordinary differences required (ndiffs):", ndiffs_atm4, "\n")
## Ordinary differences required (ndiffs): 0
# Fit models on training data
atm4_train_fit <- atm4_train %>%
model(
snaive = SNAIVE(box_cox(Cash, lambda_atm4)),
ets = ETS(box_cox(Cash, lambda_atm4)),
arima_auto = ARIMA(box_cox(Cash, lambda_atm4)),
arima_manual_from_SPSS = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(d=0) + PDQ(D=0)),
arima_manual_from_intuition = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(d=0) + PDQ(D=1))
)
## Warning in sqrt(diag(best$var.coef)): NaNs produced
# Forecast the length of the TEST data
atm4_test_forecast <- atm4_train_fit %>%
forecast(new_data = atm4_test)
# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm4_test_forecast, atm4) %>%
arrange(RMSE) %>%
select(.model, RMSE, MAE, MAPE, MASE) %>%
knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM4 (Test Set)")
| .model | RMSE | MAE | MAPE | MASE |
|---|---|---|---|---|
| arima_manual_from_intuition | 248.8757 | 210.6874 | 861.8666 | 0.5953832 |
| arima_auto | 260.9621 | 224.0422 | 914.2848 | 0.6331225 |
| arima_manual_from_SPSS | 260.9621 | 224.0422 | 914.2848 | 0.6331225 |
| ets | 303.1305 | 258.4645 | 1181.5676 | 0.7303968 |
| snaive | 789.8017 | 666.7780 | 1252.9923 | 1.8842529 |
Comparing ATM4 models by accuracy performance
# complexity (AICc)
atm4_train_fit %>%
select(arima_auto, arima_manual_from_intuition) %>%
glance() %>%
arrange(AICc) %>%
select(.model, AICc, BIC) %>%
knitr::kable(caption = "Information Criteria Comparison: Auto vs Intuition")
| .model | AICc | BIC |
|---|---|---|
| arima_manual_from_intuition | 2449.251 | 2475.407 |
| arima_auto | 2508.037 | 2526.894 |
# (p,d,q)(P,D,Q) parameter compare
atm4_train_fit %>%
select(arima_auto, arima_manual_from_intuition)
## # A mable: 1 x 2
## arima_auto arima_manual_from_intuition
## <model> <model>
## 1 <ARIMA(0,0,1)(2,0,0)[7] w/ mean> <ARIMA(2,0,1)(1,1,2)[7]>
# AR and MA Roots for Stability
# Auto ARIMA Roots
atm4_train_fit %>%
select(arima_auto) %>%
gg_arma() +
labs(
title = "Inverse AR/MA Roots: Auto ARIMA",
subtitle = "Points should lie entirely inside the unit circle"
)
## Warning: `gg_arma()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_arma()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Manual Intuition ARIMA Roots
atm4_train_fit %>%
select(arima_manual_from_intuition) %>%
gg_arma() +
labs(
title = "Inverse AR/MA Roots: Manual Intuition ARIMA",
subtitle = "Checking for borderline instability (roots on or near the circle)"
)
# Check for stability more definitively than with the circle
atm4_train_fit %>%
select(arima_auto, arima_manual_from_intuition) %>%
glance() %>%
select(.model, ar_roots, ma_roots) %>%
pivot_longer(ends_with("roots"), names_to = "Root_Type", values_to = "Root") %>%
unnest(Root) %>%
mutate(
Inverse_Distance = Mod(1 / Root),
Is_Stable = Inverse_Distance < 1
) %>%
select(.model, Root_Type, Inverse_Distance, Is_Stable) %>%
knitr::kable(digits = 4, caption = "Definitive Root Check: Auto vs Intuition")
| .model | Root_Type | Inverse_Distance | Is_Stable |
|---|---|---|---|
| arima_auto | ar_roots | 0.9166 | TRUE |
| arima_auto | ar_roots | 0.9166 | TRUE |
| arima_auto | ar_roots | 0.8586 | TRUE |
| arima_auto | ar_roots | 0.8586 | TRUE |
| arima_auto | ar_roots | 0.8586 | TRUE |
| arima_auto | ar_roots | 0.8586 | TRUE |
| arima_auto | ar_roots | 0.9166 | TRUE |
| arima_auto | ar_roots | 0.9166 | TRUE |
| arima_auto | ar_roots | 0.9166 | TRUE |
| arima_auto | ar_roots | 0.9166 | TRUE |
| arima_auto | ar_roots | 0.8586 | TRUE |
| arima_auto | ar_roots | 0.8586 | TRUE |
| arima_auto | ar_roots | 0.8586 | TRUE |
| arima_auto | ar_roots | 0.9166 | TRUE |
| arima_auto | ma_roots | 0.0816 | TRUE |
| arima_manual_from_intuition | ar_roots | 0.8715 | TRUE |
| arima_manual_from_intuition | ar_roots | 0.8715 | TRUE |
| arima_manual_from_intuition | ar_roots | 0.8715 | TRUE |
| arima_manual_from_intuition | ar_roots | 0.8715 | TRUE |
| arima_manual_from_intuition | ar_roots | 0.8715 | TRUE |
| arima_manual_from_intuition | ar_roots | 0.8715 | TRUE |
| arima_manual_from_intuition | ar_roots | 0.8715 | TRUE |
| arima_manual_from_intuition | ar_roots | 0.7533 | TRUE |
| arima_manual_from_intuition | ar_roots | 0.0363 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.9851 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.9851 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.9851 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.9851 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.8732 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.8732 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.8732 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.9851 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.9851 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.9851 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.8732 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.8732 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.8732 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.8732 | TRUE |
| arima_manual_from_intuition | ma_roots | 0.6950 | TRUE |
Even though my manually built model scored better on the accuracy tests, it’s not better than the auto-ARIMA.
I’m going to proceed like the auto-ARIMA model is the best choice, and I’ll still check its residual plot below. I’ll also look at ETS here.
# ETS Residuals
atm4_train_fit %>%
select(ets) %>%
gg_tsresiduals() +
labs(title = "Residual Diagnostics for ATM4 (ETS Model)")
# ARIMA Residuals
atm4_train_fit %>%
select(arima_auto) %>%
gg_tsresiduals() +
labs(title = "Residual Diagnostics for ATM4 (ARIMA Model)")
# Portmanteau Tests
# Ljung-Box test
# lag = 14 for daily data to look at 2 weeks of data
atm4_train_fit %>%
select(ets, arima_auto) %>%
augment() %>%
features(.innov, ljung_box, lag = 14) %>%
knitr::kable(caption = "Ljung-Box Test for ARIMA Residuals")
| .model | lb_stat | lb_pvalue |
|---|---|---|
| arima_auto | 13.90588 | 0.4567468 |
| ets | 18.62921 | 0.1796068 |
Looking at the ARIMA model (the auto one) residual plots and portmanteau:
I’ll move forward with the auto selected ARIMA model based on it scoring slightly better in accuracy and residual analysis than the others.
# Model from training:
atm4_train_fit$arima_auto
## <lst_mdl[1]>
## [1] <ARIMA(0,0,1)(2,0,0)[7] w/ mean>
# Refit the ARIMA model to full dataset
atm4_final_fit <- atm4 %>%
model(
arima_final = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,1) + PDQ(2,0,0, period = 7))
)
# Final Model
report(atm4_final_fit)
## Series: Cash
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, lambda_atm4)
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.0785 0.2128 0.2082 13.5498
## s.e. 0.0528 0.0516 0.0525 0.5624
##
## sigma^2 estimated as 104.5: log likelihood=-1364.98
## AIC=2739.96 AICc=2740.13 BIC=2759.46
The final model for ATM4 is ready to use for forecasting
# Forecast
# h = 30 for 30 days
atm4_fc <- atm4_final_fit %>%
forecast(h = 30)
# Plot forecast
atm4 %>%
autoplot(Cash) +
# Add the forecast layer specifically
autolayer(atm4_fc, color = "#C77CFF", level = NULL) +
labs(
title = "30-Day Cash Forecast: ATM4",
subtitle = "ARIMA(0,0,1)(2,0,0)[7] Structure",
y = "Cash",
x = "Date"
) +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This prediction clearly does not appear to be right. Because this ARIMA model had no differencing, it’s assuming that the data is already stationary, so the prediction is progressing towards the mean.
I think I need the seasonal differencing component, but I also need to ensure that it’s used to create a stable model. I’ll try that below.
# Try a seasonally difference ARIMA that is forced to be simpler
# Also try ETS again
atm4_train2_fit <- atm4_train %>%
model(
arima_seasonal_and_stable = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,1) + PDQ(0,1,1, period = 7))
)
# Forecast the length of the TEST data
atm4_test2_forecast <- atm4_train2_fit %>%
forecast(new_data = atm4_test)
# Grade the forecast against the true historical data to find out-of-sample accuracy
accuracy(atm4_test2_forecast, atm4) %>%
arrange(RMSE) %>%
select(.model, RMSE, MAE, MAPE, MASE) %>%
knitr::kable(caption = "Out-of-Sample Accuracy Metrics for ATM4 (Test Set)")
| .model | RMSE | MAE | MAPE | MASE |
|---|---|---|---|---|
| arima_seasonal_and_stable | 248.1606 | 210.4713 | 853.3203 | 0.5947723 |
This model’s RMSE, MAE, MAPE and MASE are all performing as well or slightly better than my arima_manual_from_intuition that I modeled earlier. That means the real test is in the stability and complexity.
# complexity (AICc)
atm4_train2_fit %>%
select(arima_seasonal_and_stable) %>%
glance() %>%
arrange(AICc) %>%
select(.model, AICc, BIC) %>%
knitr::kable(caption = "Information Criteria Comparison: Auto vs Intuition")
| .model | AICc | BIC |
|---|---|---|
| arima_seasonal_and_stable | 2442.302 | 2453.589 |
# (p,d,q)(P,D,Q) parameter compare
atm4_train2_fit %>%
select(arima_seasonal_and_stable)
## # A mable: 1 x 1
## arima_seasonal_and_stable
## <model>
## 1 <ARIMA(0,0,1)(0,1,1)[7]>
# AR and MA Roots for Stability
atm4_train2_fit %>%
select(arima_seasonal_and_stable) %>%
gg_arma() +
labs(
title = "Inverse AR/MA Roots: Auto ARIMA",
subtitle = "Points should lie entirely inside the unit circle"
)
# Check for stability more definitively than with the circle
atm4_train2_fit %>%
select(arima_seasonal_and_stable) %>%
glance() %>%
select(.model, ar_roots, ma_roots) %>%
pivot_longer(ends_with("roots"), names_to = "Root_Type", values_to = "Root") %>%
unnest(Root) %>% # Flatten the lists of roots into individual rows
mutate(
Inverse_Distance = Mod(1 / Root),
Is_Stable = Inverse_Distance < 1
) %>%
select(.model, Root_Type, Inverse_Distance, Is_Stable) %>%
knitr::kable(digits = 4, caption = "Definitive Root Check: Iteration 2")
| .model | Root_Type | Inverse_Distance | Is_Stable |
|---|---|---|---|
| arima_seasonal_and_stable | ma_roots | 0.9853 | TRUE |
| arima_seasonal_and_stable | ma_roots | 0.9853 | TRUE |
| arima_seasonal_and_stable | ma_roots | 0.9853 | TRUE |
| arima_seasonal_and_stable | ma_roots | 0.9853 | TRUE |
| arima_seasonal_and_stable | ma_roots | 0.9853 | TRUE |
| arima_seasonal_and_stable | ma_roots | 0.9853 | TRUE |
| arima_seasonal_and_stable | ma_roots | 0.9853 | TRUE |
| arima_seasonal_and_stable | ma_roots | 0.0972 | TRUE |
This model has fewer parameters (only 2). It does have many roots nearly on the unit circle, but I feel much more confident in this. The lower complexity means I’m not so worried about instability. I’ll try this model now.
# Model from training:
atm4_train2_fit$arima_seasonal_and_stable
## <lst_mdl[1]>
## [1] <ARIMA(0,0,1)(0,1,1)[7]>
# Refit the ARIMA model to full dataset
atm4_final2_fit <- atm4 %>%
model(
arima_final2 = ARIMA(box_cox(Cash, lambda_atm4) ~ pdq(0,0,1) + PDQ(0,1,1, period = 7))
)
# Final Model
report(atm4_final2_fit)
## Series: Cash
## Model: ARIMA(0,0,1)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda_atm4)
##
## Coefficients:
## ma1 sma1
## 0.0976 -0.8664
## s.e. 0.0524 0.0321
##
## sigma^2 estimated as 98.1: log likelihood=-1332.73
## AIC=2671.46 AICc=2671.53 BIC=2683.1
# Forecast
# h = 30 for 30 days
atm4_fc2 <- atm4_final2_fit %>%
forecast(h = 30)
# Plot forecast
atm4 %>%
autoplot(Cash) +
# Add the forecast layer specifically
autolayer(atm4_fc2, color = "#C77CFF", level = NULL) +
labs(
title = "30-Day Cash Forecast: ATM4",
subtitle = "ARIMA(0,0,1)(0,1,1)[7] Structure",
y = "Cash",
x = "Date"
) +
scale_x_date(
date_breaks = "1 month", # I want a label per month
date_labels = "%b %Y" # Mon. Year format
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Get the forecasts (.mean) from all final models
all_forecasts <- bind_rows(
atm1_fc %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean),
atm2_fc %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean),
atm3_fc %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean),
atm4_fc2 %>% as_tibble() %>% select(ATM, DATE, Forecasted_Cash = .mean) # Using the iter2 forecast
)
# Join forecasts to the OG data
# NOT using my outlier-cleaned data or imputed data
final_export <- atm %>%
as_tibble() %>%
rename(Original_Cash = Cash) %>%
full_join(all_forecasts, by = c("ATM", "DATE")) %>%
arrange(ATM, DATE)
# Export to CSV
export_path <- file.path(projPath, "ATM_Forecast_Results.csv")
write_csv(final_export, export_path)
cat("Exported original data and forecasts to:", export_path, "\n")
## Exported original data and forecasts to: C:/Users/cdube/Grad School/DATA 624/ATM_Forecast_Results.csv
The final forecasts have been exported to a csv.
# final plot
final_export %>%
filter(!(is.na(ATM))) %>%
ggplot(aes(x = DATE)) +
# OG Data
geom_line(aes(y = Original_Cash, color = ATM), alpha = 0.5) +
# Forecasted
geom_line(aes(y = Forecasted_Cash, color = ATM), linewidth = 1) +
facet_wrap(~ ATM, scales = "free_y", ncol = 2) +
scale_color_manual(values = c(
"ATM1" = "#F8766D",
"ATM2" = "#7CAE00",
"ATM3" = "#00BFC4",
"ATM4" = "#C77CFF"
)) +
labs(
title = "Final 30-Day Forecasts vs. Original Raw Data by ATM",
subtitle = "Bold lines in the May 2010 onward are forecasts",
x = "Date",
y = "Cash Withdrawn (Hundreds)",
color = "ATM"
) +
scale_x_date(
date_breaks = "2 months",
date_labels = "%b %Y"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
strip.text = element_text(face = "bold", size = 12)
)
## Warning: Removed 120 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1460 rows containing missing values or values outside the scale range
## (`geom_line()`).
Forecasting the cash withdrawals for these four atm locations shows distinct behavior for each.
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
power_usage <- read_excel(file.path(projPath, "ResidentialCustomerForecastLoad-624.xlsx"))
head(power_usage)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
# relying on lubridate to create a date column here since there is no day
power_usage <- power_usage %>%
mutate(ReportDate = yearmonth(`YYYY-MMM`)) %>%
select(CaseSequence, ReportDate, KWH) %>%
as_tsibble(, key=NULL, index=ReportDate)
# Quick Peek
power_usage %>%
autoplot(KWH) +
labs(title = "Residential Power Usage (KWH) over Time",
x = "Report Date") +
scale_x_yearmonth(
date_breaks = "4 month", # I want a label per quarter
date_labels = "%b %Y" # Mon. Year format
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=6.5)) # tiny text to fit on x axis
# Dip in power usage
min_index <- which.min(power_usage$KWH)
power_usage %>%
slice((min_index - 2):(min_index + 2)) %>%
kable(caption = "5-Point Window: The Lowest KWH Reading and its Neighbors")
| CaseSequence | ReportDate | KWH |
|---|---|---|
| 881 | 2010 May | 4919289 |
| 882 | 2010 Jun | 6696292 |
| 883 | 2010 Jul | 770523 |
| 884 | 2010 Aug | 7922701 |
| 885 | 2010 Sep | 7819472 |
# Check for any missing case sequence
# sequence from min to max and see what is missing
full_seq <- min(power_usage$CaseSequence, na.rm = TRUE):max(power_usage$CaseSequence, na.rm = TRUE)
missing_vals <- setdiff(full_seq, power_usage$CaseSequence)
data.frame(Missing_CaseSequence = missing_vals) %>%
kable(caption = "Gaps in Case Sequence")
| Missing_CaseSequence |
|---|
# Check for repeating case sequence
power_usage %>%
as_tibble() %>%
group_by(CaseSequence) %>%
summarise(count = n()) %>%
filter(count > 1) %>%
kable(caption = "Repating Case Sequences")
| CaseSequence | count |
|---|
# check for missing dates
gap_check <- power_usage %>% has_gaps()
power_usage %>%
scan_gaps() %>%
kable(caption = "Time Gaps (Missing Rows) Found in Data")
| ReportDate |
|---|
# Check for any NAs in any columns
power_usage %>%
as_tibble() %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Column_Name", values_to = "NA_Count") %>%
kable(caption = "Count of Missing Values (NAs) by Column")
| Column_Name | NA_Count |
|---|---|
| CaseSequence | 0 |
| ReportDate | 0 |
| KWH | 1 |
power_usage %>%
filter(is.na(KWH)) %>%
kable(caption = "NA value for KWH")
| CaseSequence | ReportDate | KWH |
|---|---|---|
| 861 | 2008 Sep | NA |
# Check for any repeating dates
power_usage %>%
count(ReportDate) %>%
filter(n > 1) %>%
rename(Occurrences = n) %>%
kable(caption = "Duplicate Report Dates")
| ReportDate | Occurrences |
|---|
# Summary
power_usage %>%
as_tibble() %>%
# Convert yearmonth to Date so summary() can read it
mutate(ReportDate = as.Date(ReportDate)) %>%
summary() %>%
kable(caption = "Statistical Summary of Power Usage Data")
| CaseSequence | ReportDate | KWH | |
|---|---|---|---|
| Min. :733.0 | Min. :1998-01-01 | Min. : 770523 | |
| 1st Qu.:780.8 | 1st Qu.:2001-12-24 | 1st Qu.: 5429912 | |
| Median :828.5 | Median :2005-12-16 | Median : 6283324 | |
| Mean :828.5 | Mean :2005-12-15 | Mean : 6502475 | |
| 3rd Qu.:876.2 | 3rd Qu.:2009-12-08 | 3rd Qu.: 7620524 | |
| Max. :924.0 | Max. :2013-12-01 | Max. :10655730 | |
| NA | NA | NA’s :1 |
To start with, I don’t see much that is “messy” about the data (like
duplicate dates or missing dates). There is just one missing
KWH value from Sept. 2008.
From visual inspection, I also see that there is clear seasonality. Power usage spikes in the winter and summer.
It also looks like power usage started trending upwards in 2010, following a dip in power usage in July 2010. This dip in power usage for July 2010 appears to be an outlier, and it might even be a user-data-entry error. Since adding a 0 to the end of the number would make it quite plausible. However, I’m not certain that is the case, so I’m going to impute it regardless.
For imputation of the NA value spotted in exploration, much like in Part A I’m opting to use ARIMA interpolation, one of the recommended methods in the Hyndman textbook.
# NA Value
power_usage %>%
filter(is.na(KWH))
## # A tsibble: 1 x 3 [1M]
## CaseSequence ReportDate KWH
## <dbl> <mth> <dbl>
## 1 861 2008 Sep NA
power_usage_impute_me <- power_usage %>%
filter(is.na(KWH)) %>%
pull(CaseSequence)
# Impute NAs with ARIMA
power_usage_clean <- power_usage %>%
model(arima = ARIMA(KWH)) %>%
interpolate(power_usage) %>%
left_join(as_tibble(power_usage) %>% select(ReportDate, CaseSequence), by = "ReportDate")
# Check where NA used to be
power_usage_clean %>%
filter(CaseSequence == power_usage_impute_me)
## # A tsibble: 1 x 3 [1M]
## ReportDate KWH CaseSequence
## <mth> <dbl> <dbl>
## 1 2008 Sep 7597704. 861
I feel like I’ve learned my lesson from Part A of this project. I could systematically identify all the outliers by some rule. But visually, I see that the only outlier I am concerned with is the one from July 2010. So that’s the one I will handle.
# Dip in power usage
# I'm rerunning this on power_usage_clean just to be safe, since this is now my working df
min_index <- which.min(power_usage_clean$KWH)
# I prefer anchoring to the CaseSequnce anyway though. So i'm doing both,
min_case_sequence <- power_usage_clean %>%
filter(KWH == min(power_usage_clean$KWH)) %>%
pull(CaseSequence)
power_usage_clean %>%
filter(KWH == min(power_usage_clean$KWH)) %>%
kable(caption = "Minimum KWH value is an outlier. To be Imputed.")
| ReportDate | KWH | CaseSequence |
|---|---|---|
| 2010 Jul | 770523 | 883 |
# Replace outlier with NA
power_usage_clean <- power_usage_clean %>%
mutate(
KWH = if_else(CaseSequence == min_case_sequence, NA_real_, KWH)
)
# Impute
power_usage_clean <- power_usage_clean %>%
model(arima = ARIMA(KWH)) %>%
interpolate(power_usage_clean) %>%
left_join(as_tibble(power_usage_clean) %>% select(ReportDate, CaseSequence), by = "ReportDate")
# Double Check new Value
power_usage_clean %>%
filter(CaseSequence == min_case_sequence) %>%
kable(caption = "New Imputed Value to Handle Outlier")
| ReportDate | KWH | CaseSequence |
|---|---|---|
| 2010 Jul | 7728530 | 883 |
My goal is to forecast for 2014, I’ll hold out one year of data (all of 2013) from my test set.
# Training Set (Everything up to December 2012)
power_train <- power_usage_clean %>%
filter(ReportDate <= yearmonth("2012-12"))
# Test Set (January 2013 through December 2013)
power_test <- power_usage_clean %>%
filter(ReportDate > yearmonth("2012-12"))
# Split Check
cat("Training Set ends on:", as.character(max(power_train$ReportDate)), "and has", nrow(power_train), "months.\n")
## Training Set ends on: 2012 Dec and has 180 months.
cat("Test Set starts on:", as.character(min(power_test$ReportDate)), "and ends on:", as.character(max(power_test$ReportDate)), "with", nrow(power_test), "months.\n")
## Test Set starts on: 2013 Jan and ends on: 2013 Dec with 12 months.
I want to look at the training data through the lense of STL decomposition.
# STL decomposition on the training set
power_train %>%
model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(
title = "STL Decomposition for Residential Power Usage",
subtitle = "Training Set (1998 - 2012)",
x = "Report Date"
)
# Optimal Box-Cox Lambda using Guerrero's method
lambda_power <- power_train %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
cat("Optimal Box-Cox Lambda for Power Usage:", round(lambda_power, 4), "\n")
## Optimal Box-Cox Lambda for Power Usage: -0.1696
# Check for required differencing on the transformed data
# Seasonal differencing
nsdiffs_power <- power_train %>%
mutate(box_kwh = box_cox(KWH, lambda_power)) %>%
features(box_kwh, unitroot_nsdiffs) %>%
pull(nsdiffs)
cat("Seasonal differences required (nsdiffs):", nsdiffs_power, "\n")
## Seasonal differences required (nsdiffs): 1
# Ordinary differencing
ndiffs_power <- power_train %>%
mutate(box_kwh = box_cox(KWH, lambda_power)) %>%
features(box_kwh, unitroot_ndiffs) %>%
pull(ndiffs)
cat("Ordinary differences required (ndiffs):", ndiffs_power, "\n")
## Ordinary differences required (ndiffs): 1
The optimal Box-Cox lambda is -0.1696, which is pretty close to 0 (natural log transofrmation). The SPSS test also calls for both one seasonal difference (which I expected) and one ordinary difference (maybe to handle the trend seen earlier).
I’m going to be just a little bit more thorough with the “theory” or “explanatory power” I’m looking for while fitting these models than I did with Part A. I honestly think it’s a bit overkill, but it’s nice to reinforce the learning. (Also Part A had so many components with many similar trends in each, and that could be used to inform on expectations of each ATM, whereas Part B has fewer components, so I’d rather be a little more thorough up front.)
# Fit models
power_fit <- power_train %>%
model(
# Snaive (the baseline basically)
snaive = SNAIVE(KWH),
# Auto models without transformation
ets_auto = ETS(KWH),
arima_auto = ARIMA(KWH),
# Auto models with Box-Cox transformation
ets_bc = ETS(box_cox(KWH, lambda_power)),
arima_bc = ARIMA(box_cox(KWH, lambda_power)),
# Manual ARIMA based on SPSS test (d=1, D=1)
arima_manual = ARIMA(box_cox(KWH, lambda_power) ~ pdq(d=1) + PDQ(D=1))
)
# Forecast against the 2013 test set
power_fc <- power_fit %>%
forecast(new_data = power_test)
# Grade the forecast against the true historical test data
accuracy(power_fc, power_usage_clean) %>%
arrange(RMSE) %>%
select(.model, RMSE, MAE, MAPE, MASE) %>%
knitr::kable(caption = "Out-of-Sample Accuracy Metrics for Power Usage (Test Set)")
| .model | RMSE | MAE | MAPE | MASE |
|---|---|---|---|---|
| arima_bc | 963502.9 | 683535.0 | 8.050892 | 1.125034 |
| arima_auto | 969439.6 | 655830.8 | 7.694001 | 1.079436 |
| arima_manual | 970940.7 | 742764.1 | 8.925284 | 1.222520 |
| snaive | 1035537.6 | 618605.6 | 7.058492 | 1.018166 |
| ets_bc | 1054366.9 | 697314.2 | 8.339581 | 1.147713 |
| ets_auto | 1100750.2 | 690764.1 | 8.126241 | 1.136932 |
Comparing the models by accuracy performance
## Residual Plots
# SNAIVE Residuals
power_fit %>%
select(snaive) %>%
gg_tsresiduals() +
labs(title = "Residual Diagnostics: SNAIVE Baseline")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_rug()`).
# Auto ARIMA Residuals
power_fit %>%
select(arima_auto) %>%
gg_tsresiduals() +
labs(title = "Residual Diagnostics: Auto ARIMA (No Transformation)")
# Box-Cox ARIMA Residuals
power_fit %>%
select(arima_bc) %>%
gg_tsresiduals() +
labs(title = "Residual Diagnostics: Box-Cox ARIMA")
## Portmanteau Tests
# Ljung-Box test for white noise
# lag = 24 for two years of monthly seasonal data
power_fit %>%
select(snaive, arima_auto, arima_bc) %>%
augment() %>%
features(.innov, ljung_box, lag = 24) %>%
arrange(lb_pvalue) %>%
knitr::kable(caption = "Ljung-Box Test Results (lag = 24)")
| .model | lb_stat | lb_pvalue |
|---|---|---|
| snaive | 107.64492 | 0.0000000 |
| arima_bc | 31.42627 | 0.1418684 |
| arima_auto | 27.01241 | 0.3038750 |
## Stability Checks (ARIMA only)
# Visual check of AR/MA roots
power_fit %>%
select(arima_auto, arima_bc) %>%
gg_arma() +
labs(
title = "Inverse AR/MA Roots: ARIMA Models",
subtitle = "Points should lie entirely inside the unit circle"
)
# Definitive root check table
power_fit %>%
select(arima_auto, arima_bc) %>%
glance() %>%
select(.model, ar_roots, ma_roots) %>%
pivot_longer(ends_with("roots"), names_to = "Root_Type", values_to = "Root") %>%
unnest(Root) %>%
mutate(
Inverse_Distance = Mod(1 / Root),
Is_Stable = Inverse_Distance < 1
) %>%
select(.model, Root_Type, Inverse_Distance, Is_Stable) %>%
knitr::kable(digits = 4, caption = "Definitive Root Stability Check")
| .model | Root_Type | Inverse_Distance | Is_Stable |
|---|---|---|---|
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ar_roots | 0.9633 | TRUE |
| arima_auto | ma_roots | 0.4122 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ar_roots | 0.9594 | TRUE |
| arima_bc | ma_roots | 0.2843 | TRUE |
Comparing the model residual plots and portmanteau:
The Box-Cox Transformed ARIMA model is the best choice.
power_fit$arima_bc
## <lst_mdl[1]>
## [1] <ARIMA(0,0,1)(2,1,0)[12] w/ drift>
# Refit the Box-Cox ARIMA model to the FULL dataset
power_final_fit <- power_usage_clean %>%
model(
arima_final = ARIMA(box_cox(KWH, lambda_power) ~ 1 + pdq(0,0,1) + PDQ(2,1,0, period = 12))
)
# Review the final model
report(power_final_fit)
## Series: KWH
## Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift
## Transformation: box_cox(KWH, lambda_power)
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.2486 -0.7202 -0.3560 0.0023
## s.e. 0.0834 0.0752 0.0769 0.0006
##
## sigma^2 estimated as 4.313e-05: log likelihood=651.95
## AIC=-1293.9 AICc=-1293.55 BIC=-1277.93
The final model is ready to use for forecasting
# Forecast the next 12 months (January - December 2014)
power_fc <- power_final_fit %>%
forecast(h = 12)
# Plot the historical data with the forecast
power_usage_clean %>%
autoplot(KWH) +
# Add the forecast layer
autolayer(power_fc, color = "#4682B4", level = c(80, 95)) +
labs(
title = "12-Month Residential Power Usage Forecast (2014)",
subtitle = "ARIMA(0,0,1)(2,1,0)[12] with drift & Box-Cox transformation",
y = "Power Usage (KWH)",
x = "Report Date"
) +
scale_x_yearmonth(
date_breaks = "1 year",
date_labels = "%Y"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Extract the point forecasts
power_export <- power_fc %>%
as_tibble() %>%
select(ReportDate, Forecasted_KWH = .mean) %>%
mutate(ReportDate = as.character(ReportDate))
# Define export path and write to CSV
export_path_b <- file.path(projPath, "Power_Forecast_2014.csv")
write_csv(power_export, export_path_b)
cat("Exported 2014 Power Usage forecast to:", export_path_b, "\n")
## Exported 2014 Power Usage forecast to: C:/Users/cdube/Grad School/DATA 624/Power_Forecast_2014.csv
The final forecast has been exported to a csv.
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Because the timestamps provided in the raw data do not fall perfectly on the hour, I’m rounding the DateTime down to the nearest hour. Then I’m getting the average waterflow to handle the case where there are multiple recordings within the same hour.
# read data
pipe1_raw <- read_excel(
file.path(projPath, "Waterflow_Pipe1.xlsx"),
col_types = c("date", "numeric") # Forces Excel to parse as POSIXct datetime
)
pipe2_raw <- read_excel(
file.path(projPath, "Waterflow_Pipe2.xlsx"),
col_types = c("date", "numeric")
)
# checking the time period
any(minute(pipe1_raw$`Date Time`) != 0 | second(pipe1_raw$`Date Time`) != 0)
## [1] TRUE
any(duplicated(floor_date(pipe1_raw$`Date Time`, "hour")))
## [1] TRUE
# cleaning data
pipe1_agg <- pipe1_raw %>%
# Rename columns to remove spaces and standardize
rename(DateTime = `Date Time`, WaterFlow = WaterFlow) %>%
# Floor the datetime to the nearest hour
mutate(DateTimeHour = floor_date(DateTime, "hour")) %>%
# Aggregate multiple recordings within the same hour
group_by(DateTimeHour) %>%
summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE), .groups = 'drop') %>%
# Convert to tsibble
as_tsibble(index = DateTimeHour)
pipe2_agg <- pipe2_raw %>%
rename(DateTime = `Date Time`, WaterFlow = WaterFlow) %>%
mutate(DateTimeHour = floor_date(DateTime, "hour")) %>%
group_by(DateTimeHour) %>%
summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE), .groups = 'drop') %>%
as_tsibble(index = DateTimeHour)
When working with sensor data, there periods where the sensor has gaps in the time series (the hour simply doesn’t exist in the data frame). So I’ll impute these using ARIMA.
#
pipe1_agg %>% has_gaps() %>% kable(caption = "Does Pipe 1 have gaps?")
| .gaps |
|---|
| TRUE |
pipe2_agg %>% has_gaps() %>% kable(caption = "Does Pipe 2 have gaps?")
| .gaps |
|---|
| FALSE |
# summarized count of the gaps
pipe1_agg %>%
count_gaps() %>%
kable(caption = "Summary of Gap Sizes in Pipe 1")
| .from | .to | .n |
|---|---|---|
| 2015-10-27 17:00:00 | 2015-10-27 17:00:00 | 1 |
| 2015-10-28 00:00:00 | 2015-10-28 00:00:00 | 1 |
| 2015-11-01 05:00:00 | 2015-11-01 05:00:00 | 1 |
| 2015-11-01 09:00:00 | 2015-11-01 09:00:00 | 1 |
# fill gaps
pipe1_clean <- pipe1_agg %>% fill_gaps()
pipe2_clean <- pipe2_agg %>% fill_gaps()
cat("Missing hours found in Pipe 1:", sum(is.na(pipe1_clean$WaterFlow)), "\n")
## Missing hours found in Pipe 1: 4
cat("Missing hours found in Pipe 2:", sum(is.na(pipe2_clean$WaterFlow)), "\n")
## Missing hours found in Pipe 2: 0
# Impute NAs using ARIMA interpolation (if any exist)
if(sum(is.na(pipe1_clean$WaterFlow)) > 0) {
pipe1_clean <- pipe1_clean %>%
model(arima = ARIMA(WaterFlow)) %>%
interpolate(pipe1_clean)
}
if(sum(is.na(pipe2_clean$WaterFlow)) > 0) {
pipe2_clean <- pipe2_clean %>%
model(arima = ARIMA(WaterFlow)) %>%
interpolate(pipe2_clean)
}
pipe1_clean %>%
autoplot(WaterFlow) +
labs(title = "Hourly Waterflow: Pipe 1", x = "Date/Time", y = "Waterflow") +
theme_minimal()
pipe2_clean %>%
autoplot(WaterFlow) +
labs(title = "Hourly Waterflow: Pipe 2", x = "Date/Time", y = "Waterflow") +
theme_minimal()
Visually, both datasets appear to have a stable mean and variance over time without any long-term wandering trends. I’ll run a KPSS unit root test to confirm that.
# KPSS Test
bind_rows(
pipe1_clean %>% features(WaterFlow, unitroot_kpss) %>% mutate(Pipe = "Pipe 1"),
pipe2_clean %>% features(WaterFlow, unitroot_kpss) %>% mutate(Pipe = "Pipe 2")
) %>%
select(Pipe, kpss_stat, kpss_pvalue) %>%
kable(caption = "KPSS Stationarity Test Results") # p value > 0.05 so fail to reject null
| Pipe | kpss_stat | kpss_pvalue |
|---|---|---|
| Pipe 1 | 0.0793697 | 0.1 |
| Pipe 2 | 0.1053662 | 0.1 |
Both Pipe 1 and Pipe 2 return a p-value greater than 0.05. Therefore, I fail to reject the null hypothesis. The data is stationary and requires no ordinary differencing (\(d=0\)) for modeling. Because the variance looks highly stable in the EDA plots, I am also opting to skip Box-Cox transformations for this section.
I will fit an automated ARIMA model to both pipes, letting the algorithm identify any seasonal or autoregressive patterns in the stationary hourly data.
# Fit ARIMA models
pipe1_fit <- pipe1_clean %>% model(arima_auto = ARIMA(WaterFlow))
pipe2_fit <- pipe2_clean %>% model(arima_auto = ARIMA(WaterFlow))
# Review the selected models
report(pipe1_fit)
## Series: WaterFlow
## Model: ARIMA(0,0,0) w/ mean
##
## Coefficients:
## constant
## 19.4683
## s.e. 0.3430
##
## sigma^2 estimated as 28.35: log likelihood=-741.4
## AIC=1486.8 AICc=1486.85 BIC=1493.76
report(pipe2_fit)
## Series: WaterFlow
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean
##
## Coefficients:
## sma1 constant
## 0.0846 39.5732
## s.e. 0.0314 0.5472
##
## sigma^2 estimated as 256: log likelihood=-4190.54
## AIC=8387.08 AICc=8387.1 BIC=8401.8
# Residual diagnostics for Pipe 1
pipe1_fit %>% gg_tsresiduals() + labs(title = "Residual Diagnostics: Pipe 1")
# Residual diagnostics for Pipe 2
pipe2_fit %>% gg_tsresiduals() + labs(title = "Residual Diagnostics: Pipe 2")
The auto-ARIMA models are pretty simple, which I expected base on the KPSS tests and visually inspecting the data and seeing how stationary it was.
*Pipe 1 (ARIMA(0,0,0) w/ mean):** For the first pipe, the algorithm essentially fitted a “mean model.” Because it found no significant autoregressive or moving average terms to improve the forecast, it will simply predict the historical average water flow (a constant of roughly 19.47). Looking at the residual diagnostics, the ACF plot shows the residuals mostly behave like white noise. The histogram is fairly normal, but there are a few negative outliers skewing it (sudden drops in water flow around Oct 28 and Nov 1).
*Pipe 2 (ARIMA(0,0,0)(0,0,1)[24] w/ mean):** For the second pipe, the algorithm also fitted a mean model (with a constant around 39.57), but it successfully detected a daily seasonal pattern. The (0,0,1)[24] component indicates a seasonal moving average term adjusting for a 24-hour cycle. The variance is consistent over time, the histogram is beautifully centered and normal, and while a few lags briefly poke past the significance boundaries on the ACF plot, this is entirely expected by chance with high-frequency hourly data.
I’ll forecast one week forward. Because the data is hourly, this translates to \(h = 168\) periods (24 hours * 7 days).
# Fit ARIMA models to both pipes
pipe1_fit <- pipe1_clean %>% model(arima_auto = ARIMA(WaterFlow))
pipe2_fit <- pipe2_clean %>% model(arima_auto = ARIMA(WaterFlow))
# Forecast a week forward (24 hours * 7 days = 168 hours)
pipe1_fc <- pipe1_fit %>% forecast(h = 168)
pipe2_fc <- pipe2_fit %>% forecast(h = 168)
# Plot the forecasts
# Pipe 1 Plot
pipe1_clean %>%
autoplot(WaterFlow) +
autolayer(pipe1_fc, color = "#F8766D", level = NULL) +
labs(
title = "1-Week Waterflow Forecast: Pipe 1",
x = "Date/Time (Hourly)",
y = "Waterflow"
) +
theme_minimal()
# Pipe 2 Plot
pipe2_clean %>%
autoplot(WaterFlow) +
autolayer(pipe2_fc, color = "#00BFC4", level = NULL) +
labs(
title = "1-Week Waterflow Forecast: Pipe 2",
x = "Date/Time (Hourly)",
y = "Waterflow"
) +
theme_minimal()
# Prepare data for export
# Extract point forecasts (.mean) and label the pipes
pipe1_export <- pipe1_fc %>%
as_tibble() %>%
mutate(Pipe = "Pipe 1") %>%
select(Pipe, DateTimeHour, Forecasted_WaterFlow = .mean)
pipe2_export <- pipe2_fc %>%
as_tibble() %>%
mutate(Pipe = "Pipe 2") %>%
select(Pipe, DateTimeHour, Forecasted_WaterFlow = .mean)
# Combine both into a single data frame
combined_pipe_forecasts <- bind_rows(pipe1_export, pipe2_export)
# Export to CSV (Ensuring standard file naming)
export_path_c <- file.path(projPath, "Waterflow_Forecast_Results.csv")
write_csv(combined_pipe_forecasts, export_path_c)
cat("Exported Pipe 1 and Pipe 2 forecasts to:", export_path_c, "\n")
## Exported Pipe 1 and Pipe 2 forecasts to: C:/Users/cdube/Grad School/DATA 624/Waterflow_Forecast_Results.csv