Part A

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Introduction

Based on the guidelines of Part A, my strategy here is to build a fpp3 workflow that ingest the single Excel file from my GitHub repo for easy reproduciblity, keep the data in long form as a keyed tsibble (which would be ATM by date), and convert the Cash variable from hundreds of dollars to USD for better interpretability as that was intentinally left to be ambiguous. Afterwards, aim to explore structure with STL decomposition and “day of week effects” to motivate model choice, reserve the last 28 pre-May days for validation to control overfitting, and compare ETS and ARIMA against a NAIVE baseline, selecting the best model per ATM by RMSE. After residual checks (such as the Ljung–Box and residual ACF), I refit through April 30, 2010 and generate daily forecasts for May 1–31, 2010 with 80%/95% prediction intervals. (Following a concluding visualization for both parts A and B)

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 4.5)
set.seed(123)
library(tidyverse)

## Warning: package 'forcats' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(janitor)

## Warning: package 'janitor' was built under R version 4.4.3

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(readxl)
library(writexl)

## Warning: package 'writexl' was built under R version 4.4.3

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.4.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(feasts)

## Warning: package 'feasts' was built under R version 4.4.3

## Loading required package: fabletools

## Warning: package 'fabletools' was built under R version 4.4.3

library(fable)

## Warning: package 'fable' was built under R version 4.4.3

library(fabletools)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(forecast)

## Warning: package 'forecast' was built under R version 4.4.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Data Import

After downloading the dataset directly from the project folder, I uploaded the file into my Github Repository so this file can easily be replicated via this data import. Aiming here to standardize column names with janitor::clean_names(), yielding the lower-case fields date, atm, and cash. Lastly previewing the results to confirm the expected long layout (one row per ATM per day) and shows the first six records, with cash expressed in hundreds of dollars (to be converted to USD in the next section).

get_atm_data <- function() {
  url <- "https://raw.githubusercontent.com/GullitNa/DATA624-Project1/main/ATM624Data.xlsx"
  tf  <- tempfile(fileext = ".xlsx")
  utils::download.file(url, tf, mode = "wb", quiet = TRUE)
  readxl::read_xlsx(tf) %>% janitor::clean_names()
}
raw <- get_atm_data()
raw %>% head()

Data Transformation

The following code chunk loads up the file, keeps the dataset in long form, and standardizes types (atm as text, date as Date, cash numeric). It converts cash (in hundreds) to dollars via cash_usd = cash * 100, removes the early-May placeholder rows with missing values so only observed history remains, and constructs a tsibble keyed by atm and indexed by date—the structure required for the fpp3 decomposition and forecasting workflow. (The Apr-30 training cutoff is applied in the next section.)

raw <- get_atm_data()

coerce_excel_date <- function(x) {
  if (inherits(x, "Date"))  return(lubridate::as_date(x))
  if (is.numeric(x))        return(lubridate::as_date(x, origin = "1899-12-30"))
  if (is.character(x))      return(lubridate::as_date(lubridate::parse_date_time(x, orders = c("ymd","mdy","dmy"))))
  stop("Unsupported date type for conversion.")
}

atm_long <- raw %>%
  transmute(
    atm  = as.character(atm),
    date = coerce_excel_date(date),
    cash_hundreds = as.numeric(cash),
    cash_usd = cash_hundreds * 100
  ) %>%
  filter(!is.na(atm), !is.na(cash_hundreds)) %>%
  arrange(atm, date)

atm_ts <- atm_long %>% as_tsibble(index = date, key = atm)

expected_days <- seq(as.Date("2009-05-01"), as.Date("2010-04-30"), by = "day")

missing_tbl <- atm_long %>%
  group_by(atm) %>%
  tidyr::complete(date = expected_days) %>%   # create the full daily panel
  filter(is.na(cash_usd)) %>%
  arrange(atm, date)

missing_tbl

# With the season-aware path, fill in  each missing cash_usd by imputing using the local weekly pattern inside its ATM series.
atm_ts <- atm_ts %>%
  tsibble::group_by_key() %>%
  dplyr::mutate(
    cash_usd = dplyr::if_else(
      date <= as.Date("2010-04-30"),
      as.numeric(forecast::na.interp(ts(cash_usd, frequency = 7))),
      cash_usd
    )
  ) %>%
  dplyr::ungroup()

Data Visualizations: Time-Series EDA and Decomposition

Overview Time Series

In order to maintain uniformity across all figures, I define a reusable ggplot theme (minimal base, no minor gridlines, bold titles and facet labels) and set the May-2010 forecast horizon (may_start, may_end) for consistent shading across plots.

viz_theme <- theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        plot.title = element_text(face = "bold"),
        strip.text  = element_text(face = "bold"))

may_start <- as.Date("2010-05-01")
may_end <- as.Date("2010-05-31")

Now I can plot daily withdrawals for each ATM with a consistent theme, overlay a 7-day moving average to reveal structure, and shade May 2010 to cue the forecast horizon.

# 7-day moving average per ATM
overview_df <- atm_ts %>%
  tsibble::group_by_key() %>%
  dplyr::mutate(ma7 = as.numeric(stats::filter(cash_usd, rep(1/7, 7), sides = 2))) %>%
  dplyr::ungroup()

band <- overview_df %>%
  dplyr::distinct(atm) %>%
  dplyr::mutate(xmin = may_start, xmax = may_end, ymin = -Inf, ymax = Inf)

ggplot(overview_df, aes(date, cash_usd)) +
  geom_rect(data = band,
            aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
            inherit.aes = FALSE, fill = "grey85", alpha = 0.4) +
  geom_line(alpha = 0.6) +
  geom_line(aes(y = ma7), linewidth = 0.7, na.rm = TRUE) +  # <- suppress boundary NA warning
  facet_wrap(~ atm, scales = "free_y") +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(
    title = "ATM Withdrawals Over Time (USD)",
    subtitle = "Grey band: May 2010 forecast horizon, Thin: daily, Thick: 7-day average",
    x = NULL, y = "USD per day"
  ) +
  viz_theme

ATM1 and ATM2 have clear weekly fluctuation with a fairly stable level through April 2010, which suggests models should capture a strong weekly seasonal pattern. ATM3 and ATM4 contain extreme spikes that compress the scale, so their behavior is a bit harder to read here. Which is where I use a zoomed in graph to detail the results more clearly.

# Per-ATM 99th percentile cap for visibility
caps <- overview_df %>%
  dplyr::group_by(atm) %>%
  dplyr::summarise(cap99 = quantile(cash_usd, 0.99, na.rm = TRUE),
                   .groups = "drop")

band <- overview_df %>%
  dplyr::distinct(atm) %>%
  dplyr::mutate(xmin = as.Date("2010-05-01"),
                xmax = as.Date("2010-05-31"),
                ymin = -Inf, ymax = Inf)

# ---- Zoomed/capped version WITHOUT joins ----
overview_capped <- overview_df %>%
  tibble::as_tibble() %>%          # drop tsibble metadata to avoid reconstruction errors
  dplyr::group_by(atm) %>%
  dplyr::mutate(
    cap99      = stats::quantile(cash_usd, 0.99, na.rm = TRUE),
    y_plot     = pmin(cash_usd, cap99),
    is_clipped = cash_usd > cap99
  ) %>%
  dplyr::ungroup()

cap_lines <- overview_capped %>% dplyr::distinct(atm, cap99)

# Same as before just zoomed in
ggplot(overview_capped, aes(date, y_plot)) +
  geom_rect(data = band,
            aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
            inherit.aes = FALSE, fill = "grey90", alpha = 0.5) +
  geom_hline(data = cap_lines, aes(yintercept = cap99),
             linetype = "dashed", alpha = 0.6) +
  geom_line(alpha = 0.6) +
  geom_line(aes(y = ma7), linewidth = 0.7, na.rm = TRUE) +  # <- here too
  geom_point(data = subset(overview_capped, is_clipped),
             aes(y = cap99), size = 1.2, alpha = 0.7) +
  facet_wrap(~ atm, scales = "free_y") +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(
    title = "ATM Withdrawals Over Time (USD) (Zoomed for Visibility)",
    subtitle = "Values above the dashed line (per-ATM 99th percentile)\nare clipped for readability where the points at the line indicate clipped days",
    x = NULL, y = "USD per day"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.subtitle = element_text(margin = margin(b = 6))
  ) +
  viz_theme

ATM3 stays near zero for most of the year and then spikes sharply in late April, which looks like a one-off event rather than a persistent pattern. ATM4 has a higher, more variable baseline with several large spikes, so the typical level is moderate but occasional extremes occur. This information remains similar to what you’ll see in following visualizations.

Day-of-Week Seasonality

For the day-of-week Seasonality, I create a weekday factor, compute per-ATM medians, highlight the weekday with the highest median, and then finish off with the plots capped (99th-percentile) boxplots with median markers so ATM3 and ATM4 remain readable.

pre_may <- atm_ts %>% dplyr::filter(date <= as_date("2010-04-30"))
# Add weekday
dow_df <- pre_may %>%
  dplyr::mutate(dow = lubridate::wday(date, label = TRUE))

# Peak weekday (highest median) per ATM
dow_peak <- dow_df %>%
  dplyr::group_by(atm, dow) %>%
  dplyr::summarise(med = median(cash_usd, na.rm = TRUE), .groups = "drop") %>%
  dplyr::group_by(atm) %>%
  dplyr::mutate(is_peak = med == max(med)) %>%
  dplyr::ungroup()

# Zooming in for ATM3 and ATM4
dow_capped <- dow_df %>%
  tibble::as_tibble() %>%
  dplyr::group_by(atm) %>%
  dplyr::mutate(
    cap99      = stats::quantile(cash_usd, 0.99, na.rm = TRUE),
    y_plot     = pmin(cash_usd, cap99),
    is_clipped = cash_usd > cap99
  ) %>%
  dplyr::ungroup()

cap_lines <- dow_capped %>% dplyr::distinct(atm, cap99)

dow_peak_capped <- dow_peak %>%
  tibble::as_tibble() %>%
  dplyr::left_join(cap_lines, by = "atm") %>%
  dplyr::mutate(med_capped = pmin(med, cap99))

ggplot(dow_capped, aes(dow, y_plot)) +
  geom_hline(data = cap_lines, aes(yintercept = cap99),
             linetype = "dashed", alpha = 0.6) +
  geom_boxplot(outlier.alpha = 0.25) +
  stat_summary(fun = median, geom = "point", size = 2) +
  geom_point(data = dplyr::filter(dow_peak_capped, is_peak),
             aes(dow, med_capped), size = 3) +
  facet_wrap(~ atm, scales = "free_y") +
  scale_y_continuous(labels = scales::label_dollar()) +
  labs(
    title = "Day-of-Week Effects by ATM",
    subtitle = "Boxes show distribution; black dots are medians; larger dots mark the weekday with the highest median\nDashed line shows the per-ATM 99th percentile to improve visibility for high-outlier machines",
    x = "Day of week", y = "USD per day"
  ) +
  viz_theme +
  theme(plot.subtitle = element_text(margin = margin(b = 6)))

The result shows clear weekly seasonality for ATM1 and ATM2 with consistent central tendencies across weekdays. ATM3 stays near zero most days with rare surges, while ATM4 has a higher baseline and wider spread with several very large outliers, and the larger dot in each facet marks the peak-median weekday.

Autocorrelation (ACF) Diagnostics

For ACF diagnostics I’m computing and plotting the autocorrelation up to 28 days 4 weeks to diagnose weekly dependence.

acf_ready <- atm_ts %>%
  dplyr::filter(date <= as.Date("2010-04-30")) %>%
  tsibble::group_by_key() %>%
  tsibble::fill_gaps() %>%
  dplyr::mutate(
    cash_usd = as.numeric(forecast::na.interp(ts(cash_usd, frequency = 7)))
  ) %>%
  dplyr::ungroup()

acf_df <- acf_ready %>%
  feasts::ACF(cash_usd, lag_max = 28)

guides <- tibble::tibble(lag = seq(7, 28, by = 7))

feasts::autoplot(acf_df) +
  ggplot2::geom_vline(data = guides, ggplot2::aes(xintercept = lag),
                      linetype = "dashed", alpha = 0.5) +
  ggplot2::labs(
    title = "Autocorrelation of Daily Withdrawals",
    subtitle = "Dashed lines at 7, 14, 21, 28 days highlight weekly seasonality",
    x = "Lag (days)", y = "ACF"
  ) +
  viz_theme

ATM1, ATM2, and ATM4 show clear spikes at lags 7, 14, and 21, confirming a strong weekly seasonal cycle that the models must capture. ATM3 shows weak autocorrelation, consistent with its near-zero baseline and occasional surges.

STL Decomposition (Structure vs Noise)

Here the final visualization regularizes the daily series, applies STL with a weekly period, plots trend/seasonal/remainder components, and computes seasonality and trend strength per ATM.

# Regularize pre-May to daily
pre_may_stl <- atm_ts %>%
  dplyr::filter(date <= as.Date("2010-04-30")) %>%
  tsibble::group_by_key() %>%
  tsibble::fill_gaps() %>%
  dplyr::mutate(cash_usd = as.numeric(forecast::na.interp(ts(cash_usd, frequency = 7)))) %>%
  dplyr::ungroup()

# STL with explicit weekly seasonality
stl_comp <- pre_may_stl %>%
  model(STL(cash_usd ~ season(period = 7))) %>%
  components()

autoplot(stl_comp) +
  labs(
    title = "STL Decomposition of ATM Withdrawals (pre-May)",
    subtitle = "Trend, weekly seasonality (period = 7), and remainder by ATM",
    x = NULL, y = NULL
  ) +
  theme(legend.position = "none") + 
  viz_theme

sc <- dplyr::as_tibble(stl_comp)
season_col <- grep("^season_", names(sc), value = TRUE)[1]
stl_strength <- sc %>%
  dplyr::group_by(atm) %>%
  dplyr::summarise(
    # Hyndman formula
    seasonal_strength = pmax(0, 1 - stats::var(remainder, na.rm = TRUE) /
                                 stats::var(.data[[season_col]] + remainder, na.rm = TRUE)),
    # Trend strength
    trend_strength = pmax(0, 1 - stats::var(remainder, na.rm = TRUE) /
                                 stats::var(trend + remainder, na.rm = TRUE)),
    .groups = "drop"
  )
stl_strength

ATM1 and ATM2 exhibit strong weekly seasonality (with 0.79) with modest trend, consistent with stable weekly cycles. ATM3 has weak seasonality (estimated 0.23) but a strong trend (0.82) driven by its late April surge, while ATM4 shows low seasonality (0.25) and weak trend (0.20), indicating a noisier series with occasional large spikes.

Conclusion and Forecasting/Excel Export

According to my findings, from May 2009 to April 2010 the four ATMs act differently. ATM1 and ATM2 have clear weekly patterns and steady levels, ATM3 is near zero most of the year then jumps in late April and ATM4 runs higher and is more variable with a few very large days. The ACF confirms weekly cycles for ATM1, ATM2, and ATM4. STL shows seasonality is strong for ATM1 and ATM2 (about 0.79) and weak for ATM3 and ATM4 (estimated 0.23 to 0.25). ATM3 also shows a strong trend from the late spike. To limit overfitting, I hold out the last 28 days, pick ETS or ARIMA per ATM against a NAIVE baseline, refit through April 30, and forecast May 2010 with 80% and 95% intervals.

Forecasting/Excel Export

atm_ts_reg <- atm_ts %>%
  tsibble::group_by_key() %>%
  tsibble::fill_gaps() %>%
  dplyr::ungroup()

# --- Validation split (28 days pre-May) ---
cutoff    <- as.Date("2010-04-30")
val_len   <- 28
val_start <- cutoff - lubridate::days(val_len - 1)

train <- atm_ts_reg %>%
  dplyr::filter(date <= val_start) %>%
  tsibble::group_by_key() %>%
  dplyr::mutate(                                  # fill any training NAs, weekly-aware
    cash_usd = as.numeric(forecast::na.interp(ts(cash_usd, frequency = 7)))
  ) %>%
  dplyr::ungroup()

valid <- atm_ts_reg %>%
  dplyr::filter(date > val_start, date <= cutoff)

# new_data must be a tsibble with same key/index
valid_new <- valid %>%
  dplyr::select(atm, date) %>%
  tsibble::as_tsibble(index = date, key = atm)

# --- Fit candidates & validate ---
fits <- train %>%
  fabletools::model(
    ETS   = fable::ETS(cash_usd),
    ARIMA = fable::ARIMA(cash_usd),
    NAIVE = fable::NAIVE(cash_usd)
  )

fc_valid  <- fabletools::forecast(fits, new_data = valid_new)
acc_valid <- fabletools::accuracy(fc_valid, valid) %>% dplyr::arrange(atm, RMSE)

best_map <- acc_valid %>%
  dplyr::group_by(atm) %>%
  dplyr::slice_min(RMSE, n = 1, with_ties = FALSE) %>%
  dplyr::ungroup() %>%
  dplyr::select(atm, .model)
best_map

pre_may_reg <- atm_ts_reg %>%
  dplyr::filter(date <= cutoff) %>%
  tsibble::group_by_key() %>%
  dplyr::mutate(
    cash_usd = as.numeric(forecast::na.interp(ts(cash_usd, frequency = 7)))
  ) %>%
  dplyr::ungroup()

# 2) Future dates for May 2010 as a tsibble
may_new <- pre_may_reg %>%
  dplyr::distinct(atm) %>%
  tidyr::crossing(
    tibble::tibble(date = seq.Date(as.Date("2010-05-01"),
                                   as.Date("2010-05-31"), by = "day"))
  ) %>%
  tsibble::as_tsibble(index = date, key = atm)

# 3) Final fits and forecasts
fits_final <- pre_may_reg %>%
  fabletools::model(
    ETS   = fable::ETS(cash_usd),
    ARIMA = fable::ARIMA(cash_usd),
    NAIVE = fable::NAIVE(cash_usd)
  )

fc_may_all  <- fabletools::forecast(fits_final, new_data = may_new)
fc_may_best <- fc_may_all %>%
  dplyr::semi_join(best_map, by = c("atm", ".model"))  # <-- preserves fbl_ts

# Add intervals robustly (no `.distribution` references)
fc80 <- fabletools::hilo(fc_may_best, 80)  # adds `80%`
fc95 <- fabletools::hilo(fc_may_best, 95)  # adds `95%`

# Bring the two interval columns together, then unpack to numeric bounds
fc_join <- fc80 %>%
  dplyr::select(atm, date, .model, .mean, `80%`) %>%
  dplyr::left_join(
    fc95 %>% dplyr::select(atm, date, .model, `95%`),
    by = c("atm", "date", ".model")
  )

fc_may_tbl <- fc_join %>%
  fabletools::unpack_hilo(`80%`) %>%   # -> `80%_lower`, `80%_upper`
  dplyr::rename(lo80 = `80%_lower`, hi80 = `80%_upper`) %>%
  fabletools::unpack_hilo(`95%`) %>%   # -> `95%_lower`, `95%_upper`
  dplyr::rename(lo95 = `95%_lower`, hi95 = `95%_upper`) %>%
  dplyr::transmute(
    atm, date,
    point = as.numeric(.mean),
    lo80, hi80, lo95, hi95,
    model = .model
  ) %>%
  dplyr::arrange(atm, date)

fc_may_hundreds <- fc_may_tbl %>%
  dplyr::mutate(
    point_hundreds = point / 100,
    lo80_hundreds  = lo80  / 100, hi80_hundreds = hi80 / 100,
    lo95_hundreds  = lo95  / 100, hi95_hundreds = hi95 / 100
  ) %>%
  dplyr::select(atm, date, point_hundreds, lo80_hundreds, hi80_hundreds,
                lo95_hundreds, hi95_hundreds, model)

# Optional: residual check for chosen models
ljung_tbl <- fits_final %>%
  fabletools::augment() %>%                              # adds .model and .innov
  dplyr::semi_join(best_map, by = c("atm", ".model")) %>%
  fabletools::features(.innov, feasts::ljung_box, lag = 14, dof = 2) %>%
  dplyr::arrange(atm)

# 5) Write the Excel deliverable
writexl::write_xlsx(
  list(
    May2010_Forecast_USD      = fc_may_tbl,
    May2010_Forecast_Hundreds = fc_may_hundreds,
    Validation_Accuracy       = acc_valid,
    Selected_Model            = best_map,
    Residual_LjungBox         = ljung_tbl
  ),
  path = "ATM624_May2010_Forecasts.xlsx"
)

"ATM624_May2010_Forecasts.xlsx"

## [1] "ATM624_May2010_Forecasts.xlsx"

Part B

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Introduction

For Part B, I intend to model as was specified such as the monthly residential electricity use from Jan-1998 to Dec-2013 and produces a 2014 monthly forecast using a similar fpp3 workflow similar to Part A.The overall plan is to import and tidy what’s necessary to a monthly tsibble, verify regularity and types, and prepare the series for seasonal ETS/ARIMA modeling with a simple hold-out for 2013 before refitting through 2013 and forecasting 2014.

Data Import

After importing, the preview shows the first six records with three columns: case_sequence (row id), yyyy_mmm (month label like 1998-Jan), and kwh (which is the monthly consumption totals).

get_power_data <- function(){
  url <- "https://raw.githubusercontent.com/GullitNa/DATA624-Project1/main/ResidentialCustomerForecastLoad-624.xlsx"
  tf <- tempfile(fileext = ".xlsx")
  utils::download.file(url, tf, mode = "wb", quiet = TRUE)
  readxl::read_xlsx(tf) %>% clean_names()
}

power_raw <- get_power_data()
power_raw %>% head()

Data Transformation

In terms of “transforming” the data, I convert the “yyyy_mmm” labeled into a proper yearmonth index, change kwh to numeric, removes NAs, and sorts the data, then builds a monthly tsibble to be used later. Finally, it inserts the one missing month and fills its value using interpolation with a 12 month frequency so the series is complete for modeling.

power_long <- power_raw %>%
  transmute(
    month = tsibble::yearmonth(str_replace(yyyy_mmm, "-", " ")),  # "1998 Jan"
    kwh   = as.numeric(kwh)
  ) %>%
  filter(!is.na(month), !is.na(kwh)) %>%
  arrange(month)
power_ts <- power_long %>% as_tsibble(index = month)

# Add the missing month and fill it
power_ts <- power_ts %>%
  tsibble::fill_gaps() %>%
  dplyr::mutate(kwh = as.numeric(forecast::na.interp(ts(kwh, frequency = 12))))

Data Visualizations: Time Series in terms of Power

2013 Validation Band and 12-month MA Monthly Usage

It plots monthly kwh from 1998–2013, highlights 2013 in grey for validation, and overlays a 12 month moving average to smooth short-term noise. This is important because it shows trend and seasonality in one view, which guides model choice (seasonal ETS or ARIMA) and validates the hold-out period.

val_start <- tsibble::yearmonth("2013 Jan")
val_end   <- tsibble::yearmonth("2013 Dec")

overview_df <- power_ts %>%
  dplyr::mutate(
    month_date = as.Date(month),
    ma12 = as.numeric(stats::filter(kwh, rep(1/12, 12), sides = 2))
  )

band <- tibble::tibble(
  xmin = as.Date(val_start),
  xmax = as.Date(val_end + 1),
  ymin = -Inf, ymax = Inf
)

ggplot2::ggplot(overview_df, ggplot2::aes(month_date, kwh)) +
  ggplot2::geom_rect(data = band,
    ggplot2::aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
    inherit.aes = FALSE, fill = "grey85", alpha = 0.4) +
  ggplot2::geom_line(alpha = 0.7) +
  ggplot2::geom_line(ggplot2::aes(y = ma12), linewidth = 0.8) +
  ggplot2::labs(
    title = "Monthly Residential kWh (1998–2013)",
    subtitle = "Grey band = 2013 (validation) • Thin = monthly values • Thick = 12-month moving average",
    x = NULL, y = "kWh"
  ) +
  viz_theme

In terms of the plot, strong yearly peaks and troughs with an overall upward trend, a sharp one-time dip around 2011, and demand rising into 2013.

Seasonal pattern by month-of-year

Here after using feasts::gg_subseries(kwh), I can draw one panel per month, plotting all years for that month with a horizontal line for the month’s average. Seasonal pattern by month-of-year is important for this context because it shows the year seasonal shape and how stable it is, which guides seasonal model choice.

power_ts %>%
  feasts::gg_subseries(kwh) +
  ggplot2::labs(
    title = "Seasonal Subseries: kWh by Month-of-Year",
    subtitle = "Each panel shows all years for that month; horizontal line is the month’s mean",
    x = "Year", y = "kWh"
  ) +
  viz_theme

Visibly, demand is highest in July-Augg and Dec-Jan, lowest around Apr–May and Oct–Nov, with a clear seasonal cycle and a visible outlier around 2011.

Autocorrelation (ACF) Diagnostics

Similarly to Part A’s autocorrelation, I compute the autocorrelation of monthly kwh out to 48 lags and plots it, with dashed guides at lags 12, 24, 36. ACF here is important because it reveals persistence and the strength of annual seasonality, which informs whether to use seasonal ETS or SARIMA with period 12.

acf_df <- power_ts %>% feasts::ACF(kwh, lag_max = 48)

guides <- tibble::tibble(lag = c(12, 24, 36))

autoplot(acf_df) +
  ggplot2::geom_vline(data = guides, ggplot2::aes(xintercept = lag),
                      linetype = "dashed", alpha = 0.5) +
  ggplot2::labs(
    title = "Autocorrelation of Monthly kwh",
    subtitle = "Dashed lines at lags 12, 24, 36 are annual seasonality",
    x = "Lag (months)", y = "ACF"
  ) +
  viz_theme

Large positive spikes at 12, 24, 36 confirm a strong yearly cycle, and the mild negative values near 6, 18, 30 reflect the mid-year opposite season.

STL decomposition

STL decomposition splits monthly kwh into trend, seasonal pattern, and remainder so we can see structure versus noise and report how strong the seasonality and trend are.

stl_comp <- power_ts %>%
  fabletools::model(STL(kwh ~ season(window = "periodic"))) %>%
  fabletools::components()

autoplot(stl_comp) +
  ggplot2::labs(
    title = "STL Decomposition of Monthly khh (1998–2013)",
    subtitle = "Trend, seasonal pattern, and remainder"
  ) +
  viz_theme

# Quantify strengths directly
stl_strength <- stl_comp %>%
  tibble::as_tibble() %>%
  dplyr::summarise(
    seasonal_strength = pmax(0, 1 - stats::var(remainder, na.rm = TRUE) /
                                stats::var(remainder + season_year, na.rm = TRUE)),
    trend_strength = pmax(0, 1 - stats::var(remainder, na.rm = TRUE) /
                                stats::var(remainder + trend,        na.rm = TRUE))
  )
stl_strength

According to the results, a strong regular yearly pattern with an upward trend that rises after 2010, and mostly small residuals except for one large negative dip around 2011.

Conclusion

According to my findings, the monthly series of 1998-2013 has one missing month that we filled. The plots show a clear yearly pattern with highs in summer and winter and lows in spring and fall. There is a smaller upward trend that steps up after 2010, plus one big dip around early 2011. STL backs this up: seasonality is strong (0.70) and trend is moderate (0.36). Based on this, we will use a seasonal model with period 12 (ETS or SARIMA), pick the best model using 2013 as validation, then refit on all data through 2013 and forecast each month of 2014.

DATA624 Project 1 Parts A and B

Gullit Navarrete

2025-10-26

Part A

Introduction

Data Import

Data Transformation

Data Visualizations: Time-Series EDA and Decomposition

Overview Time Series

Day-of-Week Seasonality

Autocorrelation (ACF) Diagnostics

STL Decomposition (Structure vs Noise)

Conclusion and Forecasting/Excel Export

Part B

Introduction

Data Import

Data Transformation

Data Visualizations: Time Series in terms of Power

2013 Validation Band and 12-month MA Monthly Usage

Seasonal pattern by month-of-year

Autocorrelation (ACF) Diagnostics

STL decomposition

Conclusion