Load Required Packages

Load the required libraries

# Load necessary libraries
# Function to check and install packages if not already installed
install_if_missing <- function(packages) {
  missing_packages <- packages[!(packages %in% installed.packages()[, "Package"])]
  if (length(missing_packages)) {
    install.packages(missing_packages)
  }
}

# List of required packages, including fpp3
packages <- c("readxl", "tsibble", "dplyr", "fable", "lubridate", "tseries", "forecast", "zoo", "tidyr", "fpp3", "ggplot2", "gridExtra", "kableExtra", "openxlsx", "imputeTS", "tseries", "purrr")

# Install missing packages
install_if_missing(packages)

# Load all the required libraries
library(fpp3)       # Includes tsibble, fable, and others
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tibble      3.2.1     ✔ tsibble     1.1.5
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.3.2
## ✔ lubridate   1.9.3     ✔ fable       0.3.4
## ✔ ggplot2     3.5.1     ✔ fabletools  0.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
library(tidyr)
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(forecast)
library(zoo)
## 
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
## 
##     index
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(openxlsx)
library(imputeTS)
## 
## Attaching package: 'imputeTS'
## The following object is masked from 'package:zoo':
## 
##     na.locf
## The following object is masked from 'package:tseries':
## 
##     na.remove
library(readxl)      # Correctly load the readxl package (read_excel is a function inside it)
library(tseries)
library(purrr)

Part A: ATM Projects

Introduction

In Part A of the project, we analyze ATM cash withdrawal data to develop a model for forecasting future cash demands. The dataset includes records from various ATMs, documenting daily withdrawals over a given period.

A key challenge is managing missing values in fields like ATM identifiers and withdrawal amounts. Accurate data imputation is critical to preserve time series patterns, including seasonality and trends. By employing suitable imputation techniques, we aim to transform the raw data into a well-structured time series for forecasting.

This analysis aims to provide insights into withdrawal behaviors, ensuring ATMs are adequately stocked, enhancing operational efficiency, and optimizing cash replenishment schedules.

Load and Review the Data

A review of the data below shows that the dataset consists of 1,474 rows and 3 columns. Here is a brief description of the columns:

  • DATE: This column contains the dates (in datetime format) for the ATM transactions. There are no missing values in this column.
  • ATM: This column specifies the ATM identifier (e.g., “ATM1”, “ATM2”). There are 14 missing values in this column, as it has 1,460 non-null entries.
  • Cash: This column records the cash withdrawn from the ATMs (in float64 format). There are 19 missing values in this column, as it has 1,455 non-null entries.

The dataset spans daily transactions, and missing data may need to be addressed before analysis, particularly for forecasting purposes.

# File path
file_path <- "ATM624Data.xlsx"

# Load the file using read_excel
atm_data <- read_excel(file_path, col_types = c("numeric", "text", "numeric"))


atm_data <- atm_data %>%
  mutate(DATE = as_date(DATE, origin = "1899-12-30")) |>  # Convert numeric date
  rename(atm = ATM, date = DATE, cash = Cash) |>  # Rename the 'ATM' column to 'name'
  as_tsibble(index = date, key = atm) |>  # Convert to tsibble format
  arrange(date)

summary(atm_data) |>
  kable(caption = "ATM Data Summary") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ATM Data Summary
date atm cash
Min. :2009-05-01 Length:1474 Min. : 0.0
1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
Median :2009-11-01 Mode :character Median : 73.0
Mean :2009-10-31 NA Mean : 155.6
3rd Qu.:2010-02-01 NA 3rd Qu.: 114.0
Max. :2010-05-14 NA Max. :10919.8
NA NA NA’s :19

Identify any missing values

Identify and quantify the presence of missing values within the dataset, focusing specifically on the cash column and restricting the analysis to data recorded on or before April 30, 2010 which is the end of observation. This helps in understanding the scope and distribution of missing data within the given timeframe.

The output below shows a summary of missing values in the dataset, excluding dates after April 30, 2010. Specifically: - There are no missing values in the date and atm columns. - There are 5 missing values in the cash column, indicating a small portion of the withdrawal amounts are missing within the filtered date range.

library(dplyr)
library(knitr)
library(kableExtra)

# Filter data for entries before May 2010
atm_data <- atm_data |>
  filter(date <= as.Date("2010-04-30"))  # Exclude dates after April 30, 2010

# Check for missing values in all columns of the filtered data
missing_values <- sapply(atm_data, function(x) sum(is.na(x)))

# Convert missing values summary to a data frame
missing_values_df <- data.frame(Column = names(missing_values), Missing_Values = missing_values)

# Display the missing values using kable
missing_values_df |>
  kable(caption = "Summary of Missing Values (Excluding Dates After April 30, 2010)") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary of Missing Values (Excluding Dates After April 30, 2010)
Column Missing_Values
date date 0
atm atm 0
cash cash 5

Check for patterns in missing values to inform imputation method

The output below provides a summary of missing cash values by specific dates, day of the week, and ATM location, focusing on data prior to May 2010. Each row represents a missing data point for a specific ATM on a particular date. This detailed breakdown helps in identifying potential patterns in the missing data, such as whether certain ATMs or days of the week are more prone to missing values. Understanding these patterns is crucial for deciding the most appropriate method for imputing the missing data, such as time-based or location-based imputation strategies.

From the output, we can observe the following potential patterns in the missing data:

  • ATM-Specific Patterns: The missing data is distributed across two ATMs, ATM1 and ATM2. The occurrences of missing values alternate between these two ATMs, with both having multiple instances of missing cash values.

  • Temporal Patterns: The missing values are spread across different days of the week, with no immediate concentration on weekends or weekdays. The dates with missing values range from June 13, 2009 (Saturday) to June 24, 2009 (Wednesday), indicating that missing data is not confined to a particular type of day, such as weekends or weekdays.

These observations suggest that the missing data is likely distributed across both ATMs and dates without a strong temporal or ATM-specific bias.

# Filter rows where there are missing values in the 'cash' column and dates are before May 2010
missing_data_summary <- atm_data |>
  as_tibble() |>
  filter(is.na(cash) & date <= as.Date("2010-04-30")) |>  # Focus on rows with missing cash values and date <= April 30, 2010
  mutate(day_of_week = wday(date, label = TRUE)) |>       # Add a day of the week column
  group_by(date, day_of_week, atm) |>                     # Group by date, day of week, and ATM
  summarise(missing_count = sum(is.na(cash)), .groups = "drop") |>  # Count missing values and drop grouping after summarizing
  arrange(date)                                            # Arrange by date

# Display the results
missing_data_summary |>
  kable(caption = "Summary of Missing Values by Date and ATM (Excluding May 2010)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary of Missing Values by Date and ATM (Excluding May 2010)
date day_of_week atm missing_count
2009-06-13 Sat ATM1 1
2009-06-16 Tue ATM1 1
2009-06-18 Thu ATM2 1
2009-06-22 Mon ATM1 1
2009-06-24 Wed ATM2 1

Impute missing values for each ATM based on the mean for the same day of the same month

The missing values may be due to system downtimes or maintenance, and no transactions were recorded because the ATMs were genuinely not operational. In this scenario, filling missing values with 0 could be a valid approach if the missing data is genuinely the result of downtime or periods where no activity occurred. However, it’s important to note that this can potentially distort the time series data and lead to inaccurate forecasting. An approach which would better preserve the patterns and integrity of the data for forecasting purposes will be used for missing data imputation.

The custom imputation method below fills missing values in the ATM withdrawal data based on the average withdrawals for the same weekday within the same month. By grouping the data by ATM, year-month, and weekday (e.g., all Tuesdays in May), missing values are replaced with the average cash withdrawals for that specific day across the same month.

Why We Use This Method:

  • Preserves Temporal Patterns: This approach captures both weekly and monthly trends, ensuring that the imputation respects seasonal withdrawal behavior.
  • Localized Imputation: The method focuses on each ATM, month, and weekday, providing more realistic imputations by reflecting actual patterns during that specific time.
  • Supports Accurate Forecasting: By maintaining the structure of the time series, this method enhances the accuracy of any future forecasting, ensuring that the missing data is imputed in context.

This method is ideal for datasets with distinct weekly or monthly patterns, ensuring that imputed values align with expected trends.

library(lubridate)
library(dplyr)

# Group by ATM, year-month, and weekday, then impute missing values for the same weekday in the same month
atm_data <- atm_data %>%
  group_by_key() %>%
  mutate(
    # Extract year-month and weekday information
    year_month = floor_date(date, "month"),
    weekday = wday(date, label = TRUE),
    
    # Impute missing cash values based on the same weekday in the same month
    cash = ifelse(
      is.na(cash),
      ave(cash, year_month, weekday, FUN = function(x) mean(x, na.rm = TRUE)),
      cash
    )
  ) %>%
  ungroup()  # Ungroup the data

# Check for missing values after imputation and get a basic summary of the data
missing_imputed_values <- colSums(is.na(atm_data))

# Display the results
missing_imputed_values %>%
  kable(caption = "Summary of Missing Values by Date and ATM (Excluding May 2010)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary of Missing Values by Date and ATM (Excluding May 2010)
x
date 0
atm 0
cash 0
year_month 0
weekday 0
atm1_data <- atm_data |>
  filter(atm == "ATM1")

atm2_data <- atm_data |>
  filter(atm == "ATM2")

atm3_data <- atm_data |>
  filter(atm == "ATM3")

atm4_data <- atm_data |>
  filter(atm == "ATM4")

Summarize the data and plot the distribution to understand its structure, contents and identify anomalies

The summary statistics and box plots for each ATM reveals the following:

  • ATM1:
    • The cash withdrawals at ATM1 show a balanced distribution, with a median withdrawal of 91.00. The majority of withdrawals fall between 73.00 and 108.00, and the largest withdrawal recorded is 180.00. There are a few outliers above the interquartile range, suggesting some larger-than-usual transactions.
  • ATM2:
    • Withdrawals from ATM2 have a slightly lower median of 66.00, with most withdrawals occurring between 25.00 and 93.00. The maximum recorded withdrawal is 147.00. The distribution is somewhat similar to ATM1, but with generally smaller withdrawal amounts. There are also a few outliers indicating higher than typical withdrawals.
  • ATM3:
    • ATM3 appears to be underutilized, with a median of 0.00 and a significant number of transactions showing zero withdrawals. The few recorded withdrawals are very small, with the largest being only 96.00. The distribution indicates that ATM3 sees minimal activity, with outliers reflecting withdrawals only at the end of the observation period. ATM3 was either out of service for a prolonged time period or is newly installed.
  • ATM4:
    • ATM4 shows the most significant cash withdrawal activity, with a median of 403.84 and a much wider range of withdrawals. The bulk of withdrawals fall between 124.33 and 704.51, and the maximum recorded withdrawal is an exceptionally high 10,919.76. The unually extreme value of 10,919.76 is not typical of ATM withdrawai. It is either a fraudulent withdrawal or a malfunction. This values will be excluded from ATM4 for forecasting purposes. Without the extremely high outlier, the distribution for ATM4 appears more balanced. This suggests that ATM4 likely serves a high-demand location, potentially with frequent and sizable withdrawals.
# Step 1: Convert the tsibble to a tibble
atm_summary_data <- as_tibble(atm_data)  # Convert tsibble to tibble

# Step 2: Exclude the DATE column and group by ATM
atm_summary <- atm_summary_data |>
  select(-date) |>  # Remove the DATE column
  group_by(atm) |>  # Group by ATM column
  summarise(
    min = min(cash),
    `25%` = quantile(cash, 0.25),
    median = median(cash),
    mean = mean(cash),
    `75%` = quantile(cash, 0.75),
    max = max(cash)
  )

# Step 3: Display the atm_summary data frame using kable for better presentation
atm_summary |>
  kable(caption = "Summary of Imputed ATM Data") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary of Imputed ATM Data
atm min 25% median mean 75% max
ATM1 1.00000 73.0000 91.0000 84.0675799 108.000 180.00
ATM2 0.00000 25.0000 66.0000 62.3442922 93.000 147.00
ATM3 0.00000 0.0000 0.0000 0.7205479 0.000 96.00
ATM4 1.56326 124.3344 403.8393 474.0433449 704.507 10919.76
# Load required library
library(grid)
library(gridExtra)

# Create separate box plots for each ATM
p1 <- atm1_data |>
  ggplot(aes(x = atm, y = cash)) +
  geom_boxplot() +
  labs(y = "Cash Withdrawals") 


p2 <- atm2_data |>
  ggplot(aes(x = atm, y = cash)) +
  geom_boxplot() 

p3 <- atm3_data |>
  ggplot(aes(x = atm, y = cash)) +
  geom_boxplot() 

# Plot for ATM4 with outlier labeled
p4 <- atm4_data |>
  ggplot(aes(x = atm, y = cash)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3) +
  
  # Label the specific outlier (10,919.76) - filter the atm_data explicitly
  geom_text(
    data = atm_data |>
      filter(atm == "ATM4", cash > 10919),  # Explicitly filter atm_data
    aes(label = cash),
    vjust = -0.5, color = "blue", size = 3
  ) 

# Plot ATM4 again without the max value
p4_filtered <- atm4_data |>
  filter(cash != max(cash, na.rm = TRUE)) |>
  ggplot(aes(x = atm, y = cash)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3)

# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
  p1, p2, p3, p4, p4_filtered, 
  ncol = 3, 
  top = textGrob("Distribution of ATM Withdrawals", gp = gpar(fontsize = 12), just = "left", x = 0)
)

# Replace the maximum Cash value for ATM4 with the average for the same day of the month
atm4_data <- atm4_data %>%
  group_by(atm) %>%
  mutate(cash = ifelse(cash == max(cash[atm == "ATM4"], na.rm = TRUE),
    ave(cash, day(date), FUN = function(x) mean(x, na.rm = TRUE)),
    cash
  )) %>%
  ungroup()

ATM1

Plot the Data

The plot shows the cash withdrawals over time for ATM1. There is a noticeable fluctuation in the cash withdrawals, with periodic peaks and troughs indicating regular variations in activity. The fluctuations suggest possible seasonality or cyclical patterns in the cash withdrawal behavior over the observed period from mid-2009 to early 2010.

atm1_data |>
  ggplot(aes(x = date, y = cash)) +
  geom_line() +
  labs(title = "Cash Withdrawals Over Time - ATM1", x = "Date", y = "Cash Withdrawn") 

Transform the data using Box-Cox transformation to stabilize the variance

Box-Cox Transformed ATM1: The Box-Cox transformed plot for ATM1 shows stabilized variance across time. However, clear weekly seasonal patterns and minor trends still persist, indicating that further differencing (e.g., seasonal differencing) will be necessary for effective time series modeling. SARIMA models with appropriate seasonal components would be suitable for capturing these patterns.

# Step 1: Shift cash values by adding a small constant to handle zero values
atm1_data <- atm1_data |> 
  mutate(cash_adjusted = cash + 1)  # Adding 1 to avoid zero or negative values

# Step 2: Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- atm1_data |> 
  BoxCox.lambda(cash_adjusted)  # Calculate lambda for Box-Cox

# Apply transformations and differences
atm1_data <- atm1_data |> 
  mutate(
    box_cox_cash = BoxCox(cash_adjusted, lambda = box_cox_lambda),  # Apply Box-Cox transformation
  )

# Step 3: Time Series Plot for Box-Cox Transformed Data
atm1_data |>
  ggplot(aes(x = date, y = box_cox_cash)) +
  geom_line() +
  labs(title = "Box-Cox Transformed ATM1",
       x = "Time", y = "Box-Cox Transformed ATM1")

Decompose the Time Series to reveal any seasonality and trend

The STL decomposition of ATM1 (Box-Cox transformed withdrawals) reveals the following components:

  1. Trend: The trend shows moderate fluctuations over time, with an increase towards the middle of the period and a slight decline near the end. This indicates some variability in the overall withdrawal levels but no strong, sustained trend.
  2. Seasonality (Weekly): The seasonality component exhibits a clear weekly pattern, suggesting consistent withdrawal behavior that aligns with a weekly cycle.
  3. Remainder: The remainder captures irregularities and noise in the data. There is increased variability in the remainder component towards early 2010, indicating greater unpredictability in withdrawals during this period.

Overall, this decomposition confirms the presence of a stable weekly pattern and some trend fluctuations, while the remainder highlights increased randomness towards the end of the observed period.

# STL Decomposition for ATM1
atm1_data |>
  model(
    stl = STL(box_cox_cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)
  ) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM1 (Box-Cox Transformed Withdrawals)")

ACF Plot to Identify Autocorrelation

The ACF plot for Box-Cox transformed ATM1 withdrawals shows strong spikes at lags 7, 14, and 21, confirming a weekly seasonality as observed in the seasonal component of the STL decomposition. This ACF pattern reinforces the presence of a clear weekly cycle in the data. These observations indicate that any forecasting model should account for weekly seasonality to effectively capture the periodic withdrawal behavior.

# Generate ACF Plot for ATM1
atm1_data |>
  ACF(box_cox_cash) |>
  autoplot() +
  labs(title = "ACF Plot for Box-Cox Transformed ATM1 Cash Withdrawals ")

ATM1 Difference the series to make the series stationary

Perform differencing to identify the differncing parameters d and D for manually specifying ARIMA and SARIMA models.

Seasonal Differencing to Determine Seasonal Parameter, D

The Lag 7 differencing for ATM1 data removes significant seasonality, but the ACF plot still shows notable autocorrelation at lag 7. This suggests that while weekly seasonality is reduced, there may still be some recurring weekly patterns.

# Apply differencing
atm1_data <- atm1_data |> 
  mutate(
    diff_lag7 = difference(box_cox_cash, lag = 7)        # Lag 7 Seasonal differencing 
  )

# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm1_data |>
  drop_na(diff_lag7) |>
  autoplot(diff_lag7) + 
  labs(title = "Lag 7 Seasonally Differenced ATM1",
       x = "Time", y = "Lag 7 Differenced ATM1")

# Generate ACF Plot for ATM1
plot_acf <- atm1_data |>
  ACF(diff_lag7) |>
  autoplot() +
  labs(title = "ACF Plot for Lag 7 Differenced ATM1 Cash Withdrawals ")

# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)

Non-seasonal Differencing to Determine Non-seasonal Parameter, d

With Lag 1 on Lag 7 differencing, the data is further stabilized, as seen in the smoother time series and reduced autocorrelation in the ACF plot. However, the Ljung-Box test results in a high \(lb\_stat\) and p-value of 0, indicating that residual autocorrelation remains. This supports the need for a SARIMA model that can accommodate both non-seasonal and seasonal differencing components to effectively model the data’s underlying structure.

# Apply differencing
atm1_data <- atm1_data |> 
  mutate(
    diff_lag7_lag1 = difference(diff_lag7, lag = 1)        # First-order differencing after seasonal differencing
  )

# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm1_data |>
  drop_na(diff_lag7_lag1) |>
  autoplot(diff_lag7_lag1) + 
  labs(title = "Lag 1 on Lag 7 Differenced ATM1",
       x = "Time", y = "Lag 1 on Lag 7 Differenced ATM1")

# Generate ACF Plot for ATM1
plot_acf <- atm1_data |>
  ACF(diff_lag7_lag1) |>
  autoplot() +
  labs(title = "ACF Plot for Lag 1 on Lag 7 Differenced ATM1 Cash Withdrawals ")

# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)

#  Apply Ljung-Box test to the diff_lag1 series for ATM1
ljung_box_results <- atm1_data |>
  drop_na(diff_lag7_lag1) |>     # Drop NA values before performing Ljung-Box test
  features(diff_lag7_lag1, ljung_box, lag = 10)  # Apply Ljung-Box test with lag = 10
  
# Display the Ljung-Box test results
ljung_box_results |>
  kable(caption = "ATM1 Ljung Test Result for Lag 1 on Lag 7 Differencing") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ATM1 Ljung Test Result for Lag 1 on Lag 7 Differencing
atm lb_stat lb_pvalue
ATM1 143.6979 0

Define Models

Examine the ACF/PACF plots

We will examine the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots for selecting appropriate model parameters and understanding the structure of the time series.

Based on the ACF and PACF plots for the Box-Cox transformed cash withdrawals (without differencing), the following models are recommended, including differencing to achieve stationarity:

  1. ARIMA Models:
    • ARIMA(1,1,0): Significant PACF spike at lag 1 suggests AR(1) component.
    • ARIMA(0,1,1): Significant ACF spike at lag 1 indicates MA(1) component.
    • Both use first differencing (\(d=1\)) for stationarity.
  2. SARIMA Models:
    • Weekly seasonality in ACF spikes (lag 7) justifies seasonal differencing (\(D=1\)) and seasonal MA(1) components.
    • SARIMA(1,1,0)(0,1,1)[7]: AR(1) for short-term dependency.
    • SARIMA(0,1,1)(0,1,1)[7]: MA(1) for short-term dependency.
    • SARIMA(1,1,1)(0,1,1)[7]: Combines AR(1) and MA(1) for flexibility.
  3. ETS Models:
    • Weekly seasonality requires Multiplicative seasonality (M).
    • ETS(M, A, M): Handles variability in error and trend.
    • ETS(A, A, M): Suitable if linear trend fits well.
    • ETS(M, Ad, M): Damped trend option for leveling effects over time.

These choices capture trend and weekly seasonality effectively.

# Plot ACF/PACF
atm1_data |>
  gg_tsdisplay(box_cox_cash, plot_type = 'partial', lag_max = 21)

Train the Models and Select the Best Model

Split data into train and test sets

Split the data and leave out the last 14 days (up to April 30, 2010) for ATM1 to create training and test sets.

# Define the end date for the training set (14 days before the last date, April 16, 2010)
train_end_date <- as.Date("2010-04-16")

# Split ATM1 data into training and test sets
atm1_train <- atm1_data |> filter(date <= train_end_date)
atm1_test <- atm1_data |> filter(date > train_end_date)

Train the Models

The models are trained on the Box-Cox transformed ATM cash withdrawals data to capture trend, seasonality, and short-term dependencies. Multiple ARIMA, SARIMA, and ETS models are fitted, incorporating non-seasonal differencing (\(d=1\)) and seasonal differencing (\(D=1\)) to address stationarity and weekly seasonality.

Based on the AICc and BIC values in the table below, the best model is the stepwise ARIMA model with parameters ARIMA(0,0,1)(0,1,2)[7], as it has the lowest AICc (3169.255) and BIC (3184.499) values among all models listed.

This suggests that the stepwise-selected ARIMA model performs better in terms of information criteria and is the most suitable choice for this data.

# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(ggplot2)
library(tidyr)
library(kableExtra)

# Fit ARIMA, SARIMA, and ETS models to the Box-Cox transformed ATM data
atm1_fit <- atm1_train |>
  model(
    arima110 = ARIMA(box_cox_cash ~ pdq(1,1,0)),                 # ARIMA(1,1,0) - AR(1) for short-term dependency
    arima011 = ARIMA(box_cox_cash ~ pdq(0,1,1)),                 # ARIMA(0,1,1) - MA(1) for short-term dependency
    sarima110011 = ARIMA(box_cox_cash ~ pdq(1,1,0) + PDQ(0,1,1, period = 7)),  # SARIMA(1,1,0)(0,1,1)[7] - AR(1) with seasonal MA(1)
    sarima011011 = ARIMA(box_cox_cash ~ pdq(0,1,1) + PDQ(0,1,1, period = 7)),  # SARIMA(0,1,1)(0,1,1)[7] - MA(1) with seasonal MA(1)
    sarima111011 = ARIMA(box_cox_cash ~ pdq(1,1,1) + PDQ(0,1,1, period = 7)),  # SARIMA(1,1,1)(0,1,1)[7] - AR(1) and MA(1) with seasonal MA(1)
    ets_mam = ETS(box_cox_cash ~ error("M") + trend("A") + season("M")),       # ETS(M, A, M) - Multiplicative seasonality with additive trend
    ets_aam = ETS(box_cox_cash ~ error("A") + trend("A") + season("M")),       # ETS(A, A, M) - Additive trend and error, multiplicative seasonality
    ets_madm = ETS(box_cox_cash ~ error("M") + trend("Ad") + season("M")),     # ETS(M, Ad, M) - Damped trend with multiplicative seasonality
    stepwise = ARIMA(box_cox_cash),                           # Automatic ARIMA using stepwise search
    search = ARIMA(box_cox_cash, stepwise = FALSE)            # Automatic ARIMA using exhaustive search
  )

# Extract the model summary with AICc and BIC values
model_summary <- glance(atm1_fit) |>
  arrange(AICc) |>
  select(.model, AIC, AICc, BIC, MSE, AMSE)

# Extract the ARIMA orders (p,d,q) for each model
model_orders <- atm1_fit |>
  pivot_longer(cols = -atm, names_to = "Model name", values_to = "Orders") |>
  select(`Model name`, Orders)

# Combine the AICc, BIC and ARIMA orders into a single data frame
combined_results <- model_summary |> 
  left_join(model_orders, by = c(".model" = "Model name"))

# Display the combined data frame using kable
combined_results |>
  kable(caption = "Ranked ATM1 Models based on AICc") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Ranked ATM1 Models based on AICc
.model AIC AICc BIC MSE AMSE Orders
stepwise 3169.137 3169.255 3184.499 NA NA <ARIMA(0,0,1)(0,1,2)[7]>
search 3169.137 3169.255 3184.499 NA NA <ARIMA(0,0,1)(0,1,2)[7]>
sarima111011 3173.805 3173.923 3189.156 NA NA <ARIMA(1,1,1)(0,1,1)[7]>
arima011 3179.820 3179.938 3195.171 NA NA <ARIMA(0,1,1)(0,1,2)[7]>
sarima011011 3181.156 3181.226 3192.669 NA NA <ARIMA(0,1,1)(0,1,1)[7]>
arima110 3293.020 3293.138 3308.371 NA NA <ARIMA(1,1,0)(0,1,2)[7]>
sarima110011 3295.772 3295.843 3307.285 NA NA <ARIMA(1,1,0)(0,1,1)[7]>
ets_aam 4316.458 4317.381 4362.787 583.1468 586.3747 <ETS(A,A,M)>
ets_mam 4396.864 4397.787 4443.194 702.3972 713.0573 <ETS(M,A,M)>
ets_madm 4397.976 4399.056 4448.166 710.7313 724.7224 <ETS(M,Ad,M)>

Evaluate model performance

The CRPS scores for ATM1’s 14-day forecasts indicate that ARIMA(0,1,1) performs best with the lowest CRPS value (7.53), closely followed by the Search (7.61) and Stepwise (7.61) models. The SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7] models also show competitive accuracy with CRPS values around 7.64 and 7.69, respectively. These models provide better short-term forecast accuracy compared to others, with higher CRPS scores for ETS models and additional SARIMA configurations indicating less accurate predictions. This CRPS analysis supports prioritizing ARIMA(0,1,1), Search, and Stepwise models for short-term forecasts.

# Load necessary libraries for forecasting accuracy
library(fabletools)
library(dplyr)
library(kableExtra)

# Generate 14-day forecasts for all models
atm1_forecasts <- atm1_fit |>
  forecast(h = "14 days")

# Calculate accuracy metrics, including CRPS, for the 14-day forecast and sort by CRPS
atm1_forecast_accuracy <- atm1_forecasts |>
  accuracy(atm1_test, list(crps = CRPS)) |>
  arrange(crps)  # Sort from most to least accurate (lowest to highest CRPS)

# Display sorted accuracy results for each model
atm1_forecast_accuracy |>
  kable(caption = "CRPS Scores for 14-Day ATM Withdrawal Forecasts (Sorted by Accuracy)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
CRPS Scores for 14-Day ATM Withdrawal Forecasts (Sorted by Accuracy)
.model atm .type crps
arima011 ATM1 Test 7.530348
search ATM1 Test 7.612832
stepwise ATM1 Test 7.612832
sarima111011 ATM1 Test 7.644105
sarima011011 ATM1 Test 7.686854
ets_aam ATM1 Test 7.695302
ets_mam ATM1 Test 9.055706
ets_madm ATM1 Test 9.273915
sarima110011 ATM1 Test 16.265857
arima110 ATM1 Test 16.319732

Residual Analysis of ATM1 Forecast

Based on both the CRPS scores and AICc values shown in the tables, the following four models are selected for residual analysis:

  1. ARIMA(0,1,1): This model has the lowest CRPS score, making it highly accurate for short-term forecasts.
  2. Stepwise ARIMA (ARIMA(0,0,1)(0,1,2)[7]): It shares the lowest AICc with the search model and performs well in terms of CRPS, ensuring both model fit and forecast accuracy.
  3. Search ARIMA (ARIMA(0,0,1)(0,1,2)[7]): With low AICc and competitive CRPS, this model is robust in fit and accuracy.
  4. SARIMA(1,1,1)(0,1,1)[7]: This model balances AICc and CRPS well, offering additional seasonality handling, which may improve longer-term accuracy.

Based on the residual analysis, the Stepwise ARIMA and Search ARIMA models show the most promising results, with minimal residual bias and low autocorrelation, indicating a good fit to the underlying data patterns. ARIMA(0,1,1) also performs adequately but exhibits slightly higher autocorrelation in the residuals, suggesting it may not capture short-term dependencies as effectively. SARIMA(1,1,1)(0,1,1)[7] manages seasonality well but shows a bit more spread in residuals, indicating slightly higher variance. Overall, Stepwise ARIMA and Search ARIMA models demonstrate the most stable and well-behaved residuals, making them preferable for forecasting.

Analyzing these models’ residuals helps determine the best model for generating reliable future forecasts by assessing residual patterns, autocorrelation, and normality.

atm1_fit |>
  select(stepwise) |>
  gg_tsresiduals() +
  labs(title = "Stepwise Residuals")

atm1_fit |>
  select(search) |>
  gg_tsresiduals() +
  labs(title = "Search Model Residuals")

atm1_fit |>
  select(arima011) |>
  gg_tsresiduals() +
  labs(title = "arima011 Model Residuals")

atm1_fit |>
  select(sarima111011) |>
  gg_tsresiduals() +
  labs(title = "sarima111011 Model Residuals")

Model Selection

Based on the evaluation of models using CRPS scores, AICc/BIC information criteria and Residual Analysis:

  1. Model Comparison by Information Criteria:
    • The stepwise ARIMA model has the lowest AICc (3169.255) and BIC (3184.499) among all models, making it the best model based on information criteria.
    • The search model is close in AICc, with slightly higher values, indicating competitive performance in terms of model complexity.
  2. Model Comparison by Forecast Accuracy (CRPS):
    • The ARIMA(0,1,1) model (arima011) performs best on CRPS (7.53), indicating superior forecast accuracy for the test set. The search model is the next best with a CRPS of 7.61, demonstrating consistent accuracy across time.
  3. Residual Analysis:
    • The residual plots for all models show a reasonably consistent spread around zero, with no major patterns in the ACF plot, indicating that all models handle the autocorrelations adequately.
    • Stepwise ARIMA has a clean residual distribution but shows a few large residuals, similar to the search model and ARIMA(0,1,1).

Best Model Recommendation

Considering both information criteria (AICc and BIC) and forecast accuracy (CRPS), the stepwise ARIMA model is recommended due to its balance between model complexity and accuracy. However, the ARIMA(0,1,1) model (arima011) could also be preferred if slightly better short-term forecast accuracy is a priority.

Generate Forecast for May 2010

The stepwise ARIMA model is used for a month-long forecast to maintain reliability and stability. If short-term accuracy is a priority (e.g., for daily or weekly forecasts), the ARIMA(0,1,1) model could also be considered as a secondary choice.

# Load necessary libraries
library(tsibble)
library(fable)
library(dplyr)
library(tidyr)
library(kableExtra)

# Fit stepwise model to the Box-Cox transformed ATM1 data to generate forecast for May 2010
atm1_forecast_fit <- atm1_data |>
  model(
    stepwise = ARIMA(box_cox_cash)  # Automatic ARIMA using stepwise search
  )

# Generate 31-day forecast for the stepwise model
atm1_may_forecasts <- atm1_forecast_fit |>
  forecast(h = "31 days")

# Convert atm1_may_forecasts to a regular data frame to avoid tsibble constraints
forecast_means <- as_tibble(atm1_may_forecasts) |>
  mutate(
    forecast = round(.mean, 1)  # Format as rounded mean values
  ) |>
  select(.model, date, forecast) |>
  pivot_wider(names_from = .model, values_from = forecast)

# Display the forecast table with formatted means only
forecast_means |>
  head() |>
  kable(caption = "31-Day Forecasts for ATM1 (Stepwise Model)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
31-Day Forecasts for ATM1 (Stepwise Model)
date stepwise
2010-05-01 87.2
2010-05-02 104.1
2010-05-03 73.1
2010-05-04 8.1
2010-05-05 100.0
2010-05-06 80.9
# Export the data to CSV
write.csv(forecast_means, "atm1_31day_Forecast.csv", row.names = FALSE)

Visual Forecast Comparison for ATM1 Withdrawals

The 31-day forecast for ATM1 cash withdrawals, generated by the stepwise ARIMA model, demonstrates a clear seasonal pattern with daily fluctuations. The zoomed-in forecast view highlights the 80% and 95% confidence intervals, which encapsulate potential variability in the cash withdrawals, reflecting the model’s anticipated uncertainty. The forecast suggests a consistent cyclical trend, with withdrawals expected to vary within a relatively stable range. The intervals indicate that the model is confident in predicting the general range of withdrawals while allowing for typical day-to-day variations.

# Load necessary libraries
library(ggplot2)
library(fable)
library(tsibble)
library(gridExtra)

# Fit the best model (stepwise) to the Box-Cox transformed data and generate a 31-day forecast
atm1_forecast_fit <- atm1_data |>
  model(stepwise = ARIMA(box_cox_cash))

# Generate the forecast for 31 days
atm1_may_forecast <- atm1_forecast_fit |>
  forecast(h = "31 days")

# Extract the forecast period date range
forecast_start <- min(atm1_may_forecast$date)
forecast_end <- max(atm1_may_forecast$date)

# Full view plot including the training and test data
plot_full <- atm1_may_forecast |>
  autoplot(atm1_data, level = c(80, 95)) +  # Include confidence intervals
  xlab("Date") + ylab("ATM Withdrawals") +
  ggtitle("31-Day Forecast for ATM1 - Stepwise Model (Full View)") +
  theme_minimal()

# Zoomed-in view on the forecast period only
plot_zoom <- atm1_may_forecast |>
  autoplot(atm1_data, level = c(80, 95)) +
  xlab("Date") + ylab("ATM Withdrawals") +
  ggtitle("31-Day Forecast for ATM1 - Stepwise Model (Forecast Period Only)") +
  theme_minimal() +
  coord_cartesian(xlim = c(forecast_start, forecast_end))  # Set x-axis to forecast range only

# Arrange the plots vertically
grid.arrange(plot_full, plot_zoom, ncol = 1)

ATM2

Plot the Data

The plot shows cash withdrawals over time for ATM1, with significant fluctuations throughout the observed period from mid-2009 to early 2010. There is a noticeable high-frequency pattern, suggesting regular peaks and troughs, which may indicate daily or weekly withdrawal patterns. However, there is no clear upward or downward trend, suggesting that cash withdrawals remained relatively stable on average, with consistent variability around a central level over time.

atm2_data |>
  ggplot(aes(x = date, y = cash)) +
  geom_line() +
  labs(title = "Cash Withdrawals Over Time - ATM2", x = "Date", y = "Cash Withdrawn") 

Transform the data using Box-Cox transformation to stabilize the variance

Box-Cox Transformed ATM2: TThe Box-Cox transformation has been applied to stabilize variance without altering the underlying pattern significantly, suggesting that the transformed data may be more suitable for time series modeling by reducing potential heteroscedasticity. The overall trend remains stable, with no clear long-term increase or decrease in withdrawal levels.

# Step 1: Shift cash values by adding a small constant to handle zero values
atm2_data <- atm2_data |> 
  mutate(cash_adjusted = cash + 1)  # Adding 1 to avoid zero or negative values

# Step 2: Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- atm2_data |> 
  BoxCox.lambda(cash_adjusted)  # Calculate lambda for Box-Cox

# Apply transformations and differences
atm2_data <- atm2_data |> 
  mutate(
    box_cox_cash = BoxCox(cash_adjusted, lambda = box_cox_lambda),  # Apply Box-Cox transformation
  )

# Step 3: Time Series Plot for Box-Cox Transformed Data
atm2_data |>
  ggplot(aes(x = date, y = box_cox_cash)) +
  geom_line() +
  labs(title = "Box-Cox Transformed ATM2",
       x = "Time", y = "Box-Cox Transformed ATM2")

Decompose the Time Series to reveal any seasonality and trend

The STL decomposition of ATM2 (Box-Cox transformed withdrawals) breaks down the data into trend, weekly seasonality, and remainder (noise) components:

  1. Trend: The trend component shows a mostly stable pattern with a slight decline towards the end, indicating no significant long-term increase or decrease in withdrawals.
  2. Seasonality (Weekly): The strong, regular pattern in the seasonality component suggests a clear weekly cycle in ATM withdrawals, highlighting consistent patterns likely tied to weekly user behavior.
  3. Remainder: The remainder component captures irregular fluctuations that are not explained by trend or seasonality, with higher variability towards the end of the period, potentially indicating more unpredictable withdrawal behavior.

This decomposition confirms a stable trend with strong weekly seasonality, while the remainder shows increased noise, especially in early 2010.

# STL Decomposition for ATM1
atm2_data |>
  model(
    stl = STL(box_cox_cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)
  ) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM2 (Box-Cox Transformed Withdrawals)")

ACF Plot to Identify Autocorrelation

Lag 7 Differencing for ATM2 removed a significant portion of the seasonality, as shown in the first differenced plot. However, some autocorrelation remains at lag 7 in the ACF, suggesting that additional differencing or seasonal components in the model may be required.

# Generate ACF Plot for ATM1
atm2_data |>
  ACF(box_cox_cash) |>
  autoplot() +
  labs(title = "ACF Plot for Box-Cox Transformed ATM2 Cash Withdrawals ")

ATM2 Differencing to make the series stationary

Perform differencing to identify the differncing parameters d and D for manually specifying ARIMA and SARIMA models.

Seasonal Differencing to Determine Seasonal Parameter, D

Lag 7 Differencing for ATM2 removed a significant portion of the seasonality, as shown in the first differenced plot. However, some autocorrelation remains at lag 7 in the ACF, suggesting that additional differencing or seasonal components in the model may be required.

# Apply differencing
atm2_data <- atm2_data |> 
  mutate(
    diff_lag7 = difference(box_cox_cash, lag = 7)        # Lag 7 Seasonal differencing 
  )

# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm2_data |>
  drop_na(diff_lag7) |>
  autoplot(diff_lag7) + 
  labs(title = "Lag 7 Seasonally Differenced ATM2",
       x = "Time", y = "Lag 7 Differenced ATM2")

# Generate ACF Plot for ATM2
plot_acf <- atm2_data |>
  ACF(diff_lag7) |>
  autoplot() +
  labs(title = "ACF Plot for Lag 7 Differenced ATM2 Cash Withdrawals")

# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)

### Non-seasonal Differencing to Determine Non-seasonal Parameter, d

The plots and Ljung-Box test results indicate that:

Lag 1 on Lag 7 Differencing further stabilizes the series for ATM2, as reflected in the second plot and ACF, showing reduced autocorrelation at smaller lags. The high Ljung-Box statistic with a p-value of 0 suggests residual autocorrelation, emphasizing the need for more complex seasonal modeling, possibly a SARIMA model, to account for the remaining dependencies.

# Apply differencing
atm2_data <- atm2_data |> 
  mutate(
    diff_lag7_lag1 = difference(diff_lag7, lag = 1)        # First-order differencing after seasonal differencing
  )

# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm2_data |>
  drop_na(diff_lag7_lag1) |>
  autoplot(diff_lag7_lag1) + 
  labs(title = "Lag 1 on Lag 7 Differenced ATM1",
       x = "Time", y = "Lag 1 on Lag 7 Differenced ATM2")

# Generate ACF Plot for ATM1
plot_acf <- atm2_data |>
  ACF(diff_lag7_lag1) |>
  autoplot() +
  labs(title = "ACF Plot for Lag 1 on Lag 7 Differenced ATM2 Cash Withdrawals ")

# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)

# Display the Ljung-Box test results
ljung_box_results |>
  kable(caption = "ATM2 Ljung Test Result for Lag 1 on Lag 7 Differencing") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ATM2 Ljung Test Result for Lag 1 on Lag 7 Differencing
atm lb_stat lb_pvalue
ATM1 143.6979 0

Define Models

Examine the ACF/PACF plots

We will examine the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots for selecting appropriate model parameters and understanding the structure of the time series.

Based on the ACF and PACF plots for the Box-Cox transformed ATM2 withdrawals (without differencing), the following models are recommended, including differencing for stationarity:

  1. ARIMA Models:
    • ARIMA(2,1,0): PACF spike at lag 2 suggests AR(2) component.
    • ARIMA(0,1,2): ACF spike at lag 2 indicates MA(2) component.
    • Both use first differencing (\(d=1\)) for stationarity.
  2. SARIMA Models:
    • Weekly seasonality in ACF spikes (lag 7) justifies seasonal differencing (\(D=1\)) and seasonal MA(1).
    • SARIMA(2,1,0)(0,1,1)[7]: AR(2) for short-term dependency.
    • SARIMA(0,1,2)(0,1,1)[7]: MA(2) for short-term dependency.
    • SARIMA(2,1,2)(0,1,1)[7]: Combines AR(2) and MA(2) for flexibility.
  3. ETS Models:
    • Weekly seasonality requires Multiplicative seasonality (M).
    • ETS(M, A, M): Handles variability in error and trend.
    • ETS(A, A, M): Suitable if linear trend fits well.
    • ETS(M, Ad, M): Damped trend option for leveling over time.

These models capture ATM2’s trend and weekly seasonality effectively.

# Plot ACF/PACF
atm2_data |>
  gg_tsdisplay(box_cox_cash, plot_type = 'partial', lag_max = 21)

Train the Models and Select the Best Model

Split data into train and test sets

Split the data and leave out the last 14 days (up to April 30, 2010) for ATM2 to create training and test sets.

# Define the end date for the training set (14 days before the last date, April 16, 2010)
train_end_date <- as.Date("2010-04-16")

# Split ATM1 data into training and test sets
atm2_train <- atm2_data |> filter(date <= train_end_date)
atm2_test <- atm2_data |> filter(date > train_end_date)

Train the Models

The models are trained on the Box-Cox transformed ATM2 cash withdrawals data to capture trend, seasonality, and short-term dependencies. Various ARIMA, SARIMA, and ETS models were fitted, with non-seasonal differencing (\(d=2\)) and seasonal differencing (\(D=1\)) to handle stationarity and weekly seasonality.

Based on the AICc and BIC values shown in the table, the best model is the stepwise ARIMA model with parameters ARIMA(2,0,2)(0,1,1)[7], having the lowest AICc (3188.332) and BIC (3211.126) values among the models.

This indicates that the stepwise-selected ARIMA model is the most suitable choice for forecasting ATM2 cash withdrawals, offering the best fit based on information criteria.

# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(ggplot2)
library(tidyr)
library(kableExtra)

# Fit ARIMA, SARIMA, and ETS models to the Box-Cox transformed ATM2 data
atm2_fit <- atm2_train |>
  model(
    arima210 = ARIMA(box_cox_cash ~ pdq(2,1,0)),                 # ARIMA(2,1,0) - AR(2) for short-term dependency
    arima012 = ARIMA(box_cox_cash ~ pdq(0,1,2)),                 # ARIMA(0,1,2) - MA(2) for short-term dependency
    sarima210011 = ARIMA(box_cox_cash ~ pdq(2,1,0) + PDQ(0,1,1, period = 7)),  # SARIMA(2,1,0)(0,1,1)[7] - AR(2) with seasonal MA(1)
    sarima012011 = ARIMA(box_cox_cash ~ pdq(0,1,2) + PDQ(0,1,1, period = 7)),  # SARIMA(0,1,2)(0,1,1)[7] - MA(2) with seasonal MA(1)
    sarima212011 = ARIMA(box_cox_cash ~ pdq(2,1,2) + PDQ(0,1,1, period = 7)),  # SARIMA(2,1,2)(0,1,1)[7] - AR(2) and MA(2) with seasonal MA(1)
    ets_mam = ETS(box_cox_cash ~ error("M") + trend("A") + season("M")),       # ETS(M, A, M) - Multiplicative seasonality with additive trend
    ets_aam = ETS(box_cox_cash ~ error("A") + trend("A") + season("M")),       # ETS(A, A, M) - Additive trend and error, multiplicative seasonality
    ets_madm = ETS(box_cox_cash ~ error("M") + trend("Ad") + season("M")),     # ETS(M, Ad, M) - Damped trend with multiplicative seasonality
    stepwise = ARIMA(box_cox_cash),                           # Automatic ARIMA using stepwise search
    search = ARIMA(box_cox_cash, stepwise = FALSE)            # Automatic ARIMA using exhaustive search
  )
## Warning: 1 error encountered for arima012
## [1] Could not find an appropriate ARIMA model.
## This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
## For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
# Extract the model summary with AICc and BIC values
model_summary <- glance(atm2_fit) |>
  arrange(AICc) |>
  select(.model, AIC, AICc, BIC, MSE, AMSE)

# Extract the ARIMA orders (p,d,q) for each model
model_orders <- atm2_fit |>
  pivot_longer(cols = -atm, names_to = "Model name", values_to = "Orders") |>
  select(`Model name`, Orders)

# Combine the AICc, BIC, and ARIMA orders into a single data frame
combined_results <- model_summary |> 
  left_join(model_orders, by = c(".model" = "Model name"))

# Display the combined data frame using kable
combined_results |>
  kable(caption = "Ranked ATM2 Models based on AICc") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Ranked ATM2 Models based on AICc
.model AIC AICc BIC MSE AMSE Orders
stepwise 3188.082 3188.332 3211.126 NA NA <ARIMA(2,0,2)(0,1,1)[7]>
search 3191.233 3191.566 3218.117 NA NA <ARIMA(5,0,0)(0,1,1)[7]>
sarima012011 3205.108 3205.226 3220.459 NA NA <ARIMA(0,1,2)(0,1,1)[7]>
sarima212011 3205.538 3205.788 3228.564 NA NA <ARIMA(2,1,2)(0,1,1)[7]>
arima210 3301.614 3301.732 3316.965 NA NA <ARIMA(2,1,0)(0,1,1)[7]>
sarima210011 3301.614 3301.732 3316.965 NA NA <ARIMA(2,1,0)(0,1,1)[7]>
ets_aam 4363.130 4364.053 4409.459 666.0784 665.0423 <ETS(A,A,M)>
ets_mam 4576.257 4577.180 4622.587 1153.6274 1154.7978 <ETS(M,A,M)>
ets_madm 4632.066 4633.146 4682.256 1362.9499 1362.7706 <ETS(M,Ad,M)>

Evaluate model performance

The CRPS values indicate that the ETS(A,A,M) model performs the best for short-term forecasts, achieving the lowest CRPS (8.66) among all models tested. The SARIMA(0,1,2)(0,1,1)[7] and SARIMA(2,1,2)(0,1,1)[7] models follow closely with CRPS values of 8.70 and 8.88, respectively. The Search and Stepwise ARIMA models also perform well with CRPS scores below 10, making them viable options. Higher CRPS scores for other models suggest less accuracy in short-term forecasts. These results help prioritize model selection based on forecast accuracy.

# Load necessary libraries for forecasting accuracy
library(fabletools)
library(dplyr)
library(kableExtra)

# Generate 14-day forecasts for all models
atm2_forecasts <- atm2_fit |>
  forecast(h = "14 days")

# Calculate accuracy metrics, including CRPS, for the 14-day forecast and sort by CRPS
atm2_forecast_accuracy <- atm2_forecasts |>
  accuracy(atm2_test, list(crps = CRPS)) |>
  arrange(crps)  # Sort from most to least accurate (lowest to highest CRPS)

# Display sorted accuracy results for each model
atm2_forecast_accuracy |>
  kable(caption = "CRPS Scores for 14-Day ATM2 Withdrawal Forecasts (Sorted by Accuracy)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
CRPS Scores for 14-Day ATM2 Withdrawal Forecasts (Sorted by Accuracy)
.model atm .type crps
ets_aam ATM2 Test 8.621016
sarima012011 ATM2 Test 8.697526
sarima212011 ATM2 Test 8.875338
search ATM2 Test 8.985455
stepwise ATM2 Test 9.225573
arima210 ATM2 Test 14.862258
sarima210011 ATM2 Test 14.862258
ets_madm ATM2 Test 27.398077
ets_mam ATM2 Test 32.840325
arima012 ATM2 Test NaN

Residual Analysis of ATM2 Forecast

For residual analysis, I will focus on the following models:

  1. Stepwise ARIMA (2,0,2)(0,1,1)[7]: Selected due to its lowest AICc, indicating it’s the best fit based on information criteria.
  2. SARIMA(0,1,2)(0,1,1)[7]: Among the top models in both AICc and CRPS, suggesting a strong balance between fit and forecast accuracy.
  3. ETS(A,A,M): Has the lowest CRPS score, indicating good forecast accuracy, even though it ranks lower in AICc.
  4. Search ARIMA: Although it ranks closely to the stepwise model in terms of AICc, it has a relatively strong CRPS performance, warranting a comparative analysis.

The Stepwise ARIMA and Search ARIMA models show well-centered residuals with low variance and minimal autocorrelation, indicating effective capture of data patterns. SARIMA(0,1,2)(0,1,1)[7] also performs well but has slightly higher autocorrelation. ETS(A,A,M) captures seasonality but shows higher residual variance. Overall, Stepwise and Search ARIMA models are preferred for their balanced residual behavior, making them strong choices for forecasting.

atm2_fit |>
  select(stepwise) |>
  gg_tsresiduals() +
  labs(title = "Stepwise Residuals")

atm2_fit |>
  select(sarima012011) |>
  gg_tsresiduals() +
  labs(title = "sarima012011 Model Residuals")

atm2_fit |>
  select(ets_aam) |>
  gg_tsresiduals() +
  labs(title = "ets_aam Model Residuals")

atm2_fit |>
  select(search) |>
  gg_tsresiduals() +
  labs(title = "Search Model Residuals")

Model Selection

Based on the residual analysis, AICc, BIC, and CRPS values, the following models are recommended:

  1. Short-Term Forecasting: The ETS(A,A,M) model is ideal for short-term forecasts due to its low CRPS (8.66), suggesting good accuracy over a short forecasting horizon.

  2. Long-Term Forecasting: The Stepwise ARIMA (2,0,2)(0,1,1)[7] model performs well in terms of AICc and BIC, making it a suitable choice for longer-term forecasts, where stability and consistency are crucial.

These selections balance forecast accuracy with model complexity and are based on the observed performance in residual behavior, information criteria, and continuous ranked probability scores.

Generate Forecast for May 2010

The stepwise ARIMA model is used for a month-long forecast to maintain reliability and stability. If short-term accuracy is a priority (e.g., for daily or weekly forecasts), the ETS(A,A,M) model could also be considered as a secondary choice.

# Load necessary libraries
library(tsibble)
library(fable)
library(dplyr)
library(tidyr)
library(kableExtra)

# Fit stepwise model to the Box-Cox transformed ATM2 data to generate forecast for May 2010
atm2_forecast_fit <- atm2_data |>
  model(
    stepwise = ARIMA(box_cox_cash)  # Automatic ARIMA using stepwise search
  )

# Generate 31-day forecast for the stepwise model
atm2_may_forecasts <- atm2_forecast_fit |>
  forecast(h = "31 days")

# Convert atm1_may_forecasts to a regular data frame to avoid tsibble constraints
forecast_means <- as_tibble(atm2_may_forecasts) |>
  mutate(
    forecast = round(.mean, 1)  # Format as rounded mean values
  ) |>
  select(.model, date, forecast) |>
  pivot_wider(names_from = .model, values_from = forecast)

# Display the forecast table with formatted means only
forecast_means |>
  head() |>
  kable(caption = "31-Day Forecasts for ATM2 (Stepwise Model)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
31-Day Forecasts for ATM2 (Stepwise Model)
date stepwise
2010-05-01 66.4
2010-05-02 72.7
2010-05-03 13.6
2010-05-04 5.0
2010-05-05 97.8
2010-05-06 88.7
# Export the data to CSV
write.csv(forecast_means, "atm2_31day_Forecast.csv", row.names = FALSE)

Visual Forecast Comparison for ATM2 Withdrawals

The 31-day forecast for ATM2 withdrawals using the Stepwise ARIMA model shows consistent weekly patterns with credible intervals that capture the expected fluctuations. In the zoomed-in view for May, the forecast maintains steady seasonality with tighter prediction intervals, indicating reliable short-term accuracy. The model effectively predicts variations within the 80% and 95% confidence levels, making it a robust choice for capturing periodic withdrawal behavior.

# Load necessary libraries
library(ggplot2)
library(fable)
library(tsibble)
library(gridExtra)

# Fit the best model (stepwise) to the Box-Cox transformed data and generate a 31-day forecast
atm2_forecast_fit <- atm2_data |>
  model(stepwise = ARIMA(box_cox_cash))

# Generate the forecast for 31 days
atm2_may_forecast <- atm2_forecast_fit |>
  forecast(h = "31 days")

# Extract the forecast period date range
forecast_start <- min(atm2_may_forecast$date)
forecast_end <- max(atm2_may_forecast$date)

# Full view plot including the training and test data
plot_full <- atm2_may_forecast |>
  autoplot(atm2_data, level = c(80, 95)) +  # Include confidence intervals
  xlab("Date") + ylab("ATM Withdrawals") +
  ggtitle("31-Day Forecast for ATM2 - Stepwise Model (Full View)") +
  theme_minimal()

# Zoomed-in view on the forecast period only
plot_zoom <- atm2_may_forecast |>
  autoplot(atm2_data, level = c(80, 95)) +
  xlab("Date") + ylab("ATM Withdrawals") +
  ggtitle("31-Day Forecast for ATM2 - Stepwise Model (Forecast Period Only)") +
  theme_minimal() +
  coord_cartesian(xlim = c(forecast_start, forecast_end))  # Set x-axis to forecast range only

# Arrange the plots vertically
grid.arrange(plot_full, plot_zoom, ncol = 1)

ATM4

Plot the Data

This plot displays the cash withdrawals over time for ATM4, showing distinct periodic peaks and troughs, suggesting strong seasonality in the data. Withdrawals tend to spike periodically, with notable peaks around mid-2009 and early 2010, potentially corresponding to recurring events or demand patterns. The variability in withdrawal amounts also appears high, indicating fluctuations in cash usage at this ATM over the analyzed period. This seasonality and variability would likely need to be considered in any forecasting model applied to this data.

atm4_data |>
  ggplot(aes(x = date, y = cash)) +
  geom_line() +
  labs(title = "Cash Withdrawals Over Time - ATM4", x = "Date", y = "Cash Withdrawn") 

Transform the data using Box-Cox transformation to stabilize the variance

Box-Cox Transformed ATM4: The Box-Cox transformation helps in making the series more stationary, preparing it for modeling. Peaks and trends are preserved, but the distribution across values appears more balanced, indicating that a Box-Cox transformation is effective in handling non-constant variance often observed in raw time series data.

# Step 1: Shift cash values by adding a small constant to handle zero values
atm4_data <- atm4_data |> 
  mutate(cash_adjusted = cash + 1)  # Adding 1 to avoid zero or negative values

# Step 2: Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- atm4_data |> 
  BoxCox.lambda(cash_adjusted)  # Calculate lambda for Box-Cox

# Apply transformations and differences
atm4_data <- atm4_data |> 
  mutate(
    box_cox_cash = BoxCox(cash_adjusted, lambda = box_cox_lambda),  # Apply Box-Cox transformation
  )

# Step 3: Time Series Plot for Box-Cox Transformed Data
atm4_data |>
  ggplot(aes(x = date, y = box_cox_cash)) +
  geom_line() +
  labs(title = "Box-Cox Transformed ATM4",
       x = "Time", y = "Box-Cox Transformed ATM2")

Decompose the Time Series to reveal any seasonality and trend

The STL decomposition of ATM4 (Box-Cox transformed withdrawals) breaks down the data into trend, weekly seasonality, and remainder (noise) components:

  1. Trend: The trend component shows moderate fluctuations, with periods of both increase and decrease, indicating some variability in the overall level of withdrawals over time.
  2. Seasonality (Weekly): The seasonality component displays a strong and consistent weekly cycle, highlighting regular withdrawal patterns likely driven by user behavior each week.
  3. Remainder: The remainder component captures irregular, random fluctuations not explained by trend or seasonality, remaining relatively small but showing occasional spikes, especially in the latter part of the period.

This decomposition highlights the strong weekly cycle in withdrawals, with a fluctuating trend and limited unexplained variability.

# STL Decomposition for ATM1
atm4_data |>
  model(
    stl = STL(box_cox_cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)
  ) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM4 (Box-Cox Transformed Withdrawals)")

ACF Plot to Identify Autocorrelation

The ACF plot for Box-Cox transformed ATM4 cash withdrawals confirms the findings from the STL decomposition. The strong spikes at lags 7, 14, and 21 reinforce the weekly seasonality observed in the STL’s seasonality component, indicating a clear 7-day cycle in withdrawal patterns. This alignment with the STL decomposition underscores the stability and regularity of weekly patterns in the data.

# Generate ACF Plot for ATM1
atm4_data |>
  ACF(box_cox_cash) |>
  autoplot() +
  labs(title = "ACF Plot for Box-Cox Transformed ATM4 Cash Withdrawals ")

ATM4 Differencing to make the series stationary

Perform differencing to identify the differncing parameters d and D for manually specifying ARIMA and SARIMA models.

Seasonal Differencing to Determine Seasonal Parameter, D

The Lag 7 seasonally differenced plot for ATM4 cash withdrawals reduces the original series’ strong weekly pattern, as indicated by a much lower amplitude compared to the undifferenced series. The corresponding ACF plot shows minimal autocorrelation after the first lag, suggesting that the differencing has effectively mitigated seasonality, aligning the data closer to stationarity. This transformation supports further modeling by stabilizing the variance and removing repetitive weekly cycles, consistent with the weekly seasonality identified in previous analyses.

# Apply differencing
atm4_data <- atm4_data |> 
  mutate(
    diff_lag7 = difference(box_cox_cash, lag = 7)        # Lag 7 Seasonal differencing 
  )

# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm4_data |>
  drop_na(diff_lag7) |>
  autoplot(diff_lag7) + 
  labs(title = "Lag 7 Seasonally Differenced ATM4",
       x = "Time", y = "Lag 7 Differenced ATM2")

# Generate ACF Plot for ATM2
plot_acf <- atm4_data |>
  ACF(diff_lag7) |>
  autoplot() +
  labs(title = "ACF Plot for Lag 7 Differenced ATM4 Cash Withdrawals")

# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)

### Non-seasonal Differencing to Determine Non-seasonal Parameter, d

The analysis of ATM4 cash withdrawals using lag 7 seasonal differencing, followed by lag 1 differencing, shows stable variations around zero, indicating seasonality and stationarity. The ACF plot shows minimal autocorrelation beyond lag 7, suggesting that differencing has effectively reduced correlations. However, the Ljung-Box test results indicate significant residual autocorrelation (\(p\)-value = 0), confirming that additional model refinements, such as an ARIMA or SARIMA model, may be needed to further capture the remaining dependencies.

# Apply differencing
atm4_data <- atm4_data |> 
  mutate(
    diff_lag7_lag1 = difference(diff_lag7, lag = 1)        # First-order differencing after seasonal differencing
  )

# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm4_data |>
  drop_na(diff_lag7_lag1) |>
  autoplot(diff_lag7_lag1) + 
  labs(title = "Lag 1 on Lag 7 Differenced ATM1",
       x = "Time", y = "Lag 1 on Lag 7 Differenced ATM2")

# Generate ACF Plot for ATM1
plot_acf <- atm4_data |>
  ACF(diff_lag7_lag1) |>
  autoplot() +
  labs(title = "ACF Plot for Lag 1 on Lag 7 Differenced ATM2 Cash Withdrawals ")

# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)

# Display the Ljung-Box test results
ljung_box_results |>
  kable(caption = "ATM4 Ljung Test Result for Lag 1 on Lag 7 Differencing") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ATM4 Ljung Test Result for Lag 1 on Lag 7 Differencing
atm lb_stat lb_pvalue
ATM1 143.6979 0

Define Models

Examine the ACF/PACF plots

We will examine the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots for selecting appropriate model parameters and understanding the structure of the time series.

Based on the ACF and PACF plots for the Box-Cox transformed ATM4 cash withdrawals (without differencing), the following models are recommended, incorporating differencing to achieve stationarity:

  1. ARIMA Models:
    • ARIMA(1,1,0): The significant PACF spike at lag 1 suggests an AR(1) component.
    • ARIMA(0,1,1): The significant ACF spike at lag 1 indicates an MA(1) component.
    • Both models utilize first differencing (\(d=1\)) to address non-stationarity.
  2. SARIMA Models:
    • The noticeable weekly seasonality in the ACF plot at lag 7 justifies seasonal differencing (\(D=1\)) and a seasonal MA(1) component.
    • SARIMA(1,1,0)(0,1,1)[7]: Combines AR(1) with seasonal MA(1) for weekly seasonality.
    • SARIMA(0,1,1)(0,1,1)[7]: Combines MA(1) with seasonal MA(1) for flexibility.
    • SARIMA(1,1,1)(0,1,1)[7]: Adds both AR(1) and MA(1) components for additional model adaptability.
  3. ETS Models:
    • Weekly seasonality observed in the ACF plot suggests Multiplicative seasonality (M).
    • ETS(M, A, M): Allows for multiplicative seasonality with an additive trend.
    • ETS(A, A, M): Suitable if linear trends are more consistent with the data.
    • ETS(M, Ad, M): Uses a damped trend to manage potential leveling effects over time.

These models capture the observed weekly seasonality, trend, and short-term dependencies effectively.

# Plot ACF/PACF
atm4_data |>
  gg_tsdisplay(box_cox_cash, plot_type = 'partial', lag_max = 21)

Train the Models and Select the Best Model

Split data into train and test sets

Split the data and leave out the last 14 days (up to April 30, 2010) for ATM4 to create training and test sets.

# Define the end date for the training set (14 days before the last date, April 16, 2010)
train_end_date <- as.Date("2010-04-16")

# Split ATM1 data into training and test sets
atm4_train <- atm4_data |> filter(date <= train_end_date)
atm4_test <- atm4_data |> filter(date > train_end_date)

Train the Models

The models are trained on the Box-Cox transformed ATM2 cash withdrawals data to capture trend, seasonality, and short-term dependencies. Various ARIMA, SARIMA, and ETS models were fitted, with non-seasonal differencing (\(d=2\)) and seasonal differencing (\(D=1\)) to handle stationarity and weekly seasonality.

The model selection process for ATM4 revealed the following insights:

  • SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7] are top-performing models based on AICc, with values significantly lower than other models, indicating they capture the data’s structure effectively.
  • Stepwise and Search ARIMA models (both using automatic selection) ranked closely behind the top SARIMA models, providing reasonable fits while leveraging automated selection methods.
  • ETS models displayed higher AICc values, suggesting that exponential smoothing may not capture the underlying patterns as effectively for ATM4.

The warning for ARIMA(0,1,1) indicates potential instability due to non-characteristic roots, making it unsuitable for this dataset.

# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(ggplot2)
library(tidyr)
library(kableExtra)

# Fit ARIMA, SARIMA, and ETS models to the Box-Cox transformed ATM4 data
atm4_fit <- atm4_train |>
  model(
    arima110 = ARIMA(box_cox_cash ~ pdq(1,1,0)),                     # ARIMA(1,1,0) - AR(1) for short-term dependency
    arima011 = ARIMA(box_cox_cash ~ pdq(0,1,1)),                     # ARIMA(0,1,1) - MA(1) for short-term dependency
    sarima110011 = ARIMA(box_cox_cash ~ pdq(1,1,0) + PDQ(0,1,1, period = 7)),  # SARIMA(1,1,0)(0,1,1)[7] - AR(1) with seasonal MA(1)
    sarima011011 = ARIMA(box_cox_cash ~ pdq(0,1,1) + PDQ(0,1,1, period = 7)),  # SARIMA(0,1,1)(0,1,1)[7] - MA(1) with seasonal MA(1)
    sarima111011 = ARIMA(box_cox_cash ~ pdq(1,1,1) + PDQ(0,1,1, period = 7)),  # SARIMA(1,1,1)(0,1,1)[7] - AR(1) and MA(1) with seasonal MA(1)
    ets_mam = ETS(box_cox_cash ~ error("M") + trend("A") + season("M")),       # ETS(M, A, M) - Multiplicative seasonality with additive trend
    ets_aam = ETS(box_cox_cash ~ error("A") + trend("A") + season("M")),       # ETS(A, A, M) - Additive trend and error, multiplicative seasonality
    ets_madm = ETS(box_cox_cash ~ error("M") + trend("Ad") + season("M")),     # ETS(M, Ad, M) - Damped trend with multiplicative seasonality
    stepwise = ARIMA(box_cox_cash),                                            # Automatic ARIMA using stepwise search
    search = ARIMA(box_cox_cash, stepwise = FALSE)                             # Automatic ARIMA using exhaustive search
  )
## Warning: 1 error encountered for arima011
## [1] Could not find an appropriate ARIMA model.
## This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
## For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
# Extract the model summary with AICc and BIC values
model_summary <- glance(atm4_fit) |>
  arrange(AICc) |>
  select(.model, AIC, AICc, BIC, MSE, AMSE)

# Extract the ARIMA orders (p,d,q) for each model
model_orders <- atm4_fit |>
  pivot_longer(cols = -atm, names_to = "Model name", values_to = "Orders") |>
  select(`Model name`, Orders)

# Combine the AICc, BIC, and ARIMA orders into a single data frame
combined_results <- model_summary |> 
  left_join(model_orders, by = c(".model" = "Model name"))

# Display the combined data frame using kable
combined_results |>
  kable(caption = "Ranked ATM4 Models based on AICc") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Ranked ATM4 Models based on AICc
.model AIC AICc BIC MSE AMSE Orders
sarima111011 5007.190 5007.308 5022.541 NA NA <ARIMA(1,1,1)(0,1,1)[7]>
sarima011011 5007.626 5007.697 5019.139 NA NA <ARIMA(0,1,1)(0,1,1)[7]>
stepwise 5118.958 5119.027 5130.540 NA NA <ARIMA(0,0,0)(1,0,0)[7] w/ mean>
search 5118.958 5119.027 5130.540 NA NA <ARIMA(0,0,0)(1,0,0)[7] w/ mean>
sarima110011 5120.322 5120.392 5131.835 NA NA <ARIMA(1,1,0)(0,1,1)[7]>
arima110 5236.010 5236.126 5251.442 NA NA <ARIMA(1,1,0)(2,0,0)[7]>
ets_mam 6171.051 6171.974 6217.380 142759.6 143845.0 <ETS(M,A,M)>
ets_madm 6179.144 6180.224 6229.334 136950.6 137808.1 <ETS(M,Ad,M)>
ets_aam 6206.119 6207.042 6252.448 127018.4 127679.9 <ETS(A,A,M)>

Evaluate model performance

Based on the AICc and CRPS results for ATM4:

  1. Top Models by AICc: The SARIMA models, specifically SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7], show the best performance in terms of AICc, indicating these models effectively capture underlying patterns with minimal complexity.

  2. Top Models by CRPS: The same SARIMA models also have the lowest CRPS scores, demonstrating their accuracy in forecast performance for ATM4 over a 14-day period.

  3. Alternative Models: While the ETS models rank lower in terms of AICc and CRPS, the ETS(M,Ad,M) and ETS(M,A,M) offer competitive forecast accuracy, though with slightly higher CRPS scores.

Overall, SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7] are the best choices for reliable ATM4 forecasting based on both AICc and CRPS.

# Load necessary libraries for forecasting accuracy
library(fabletools)
library(dplyr)
library(kableExtra)

# Generate 14-day forecasts for all models
atm4_forecasts <- atm4_fit |>
  forecast(h = "14 days")

# Calculate accuracy metrics, including CRPS, for the 14-day forecast and sort by CRPS
atm4_forecast_accuracy <- atm4_forecasts |>
  accuracy(atm4_test, list(crps = CRPS)) |>
  arrange(crps)  # Sort from most to least accurate (lowest to highest CRPS)

# Display sorted accuracy results for each model
atm4_forecast_accuracy |>
  kable(caption = "CRPS Scores for 14-Day ATM4 Withdrawal Forecasts (Sorted by Accuracy)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
CRPS Scores for 14-Day ATM4 Withdrawal Forecasts (Sorted by Accuracy)
.model atm .type crps
sarima011011 ATM4 Test 151.0016
sarima111011 ATM4 Test 151.6449
search ATM4 Test 159.4341
stepwise ATM4 Test 159.4341
ets_madm ATM4 Test 163.5230
ets_mam ATM4 Test 171.1137
ets_aam ATM4 Test 179.1936
sarima110011 ATM4 Test 254.1342
arima110 ATM4 Test 262.8403
arima011 ATM4 Test NaN

Residual Analysis of ATM2 Forecast

For residual analysis, I will focus on the following models:

  1. SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7]: Top AICc and CRPS scores indicate strong performance. Residuals will confirm effective seasonal/trend capture.

  2. Stepwise ARIMA: Good CRPS, simpler structure. Residuals will show if it fully captures underlying patterns.

  3. ETS(M, Ad, M): Models multiplicative seasonality with damped trend. Residuals will assess adequacy of seasonal/trend handling.

The SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7] models show well-centered residuals with minimal autocorrelation, effectively capturing the weekly seasonality in ATM4 withdrawals. The Stepwise ARIMA model also performs adequately, but it exhibits slightly higher residual variance, indicating some limitations in capturing all patterns. ETS(M, Ad, M) manages to capture multiplicative seasonality but displays the highest residual variance, making it less effective in this context. Overall, SARIMA models are preferred for their balanced residual behavior, making them robust options for forecasting.

# Residual analysis for selected models on ATM4 data

atm4_fit |>
  select(sarima111011) |>
  gg_tsresiduals() +
  labs(title = "SARIMA(1,1,1)(0,1,1)[7] Model Residuals - ATM4")

atm4_fit |>
  select(sarima011011) |>
  gg_tsresiduals() +
  labs(title = "SARIMA(0,1,1)(0,1,1)[7] Model Residuals - ATM4")

atm4_fit |>
  select(stepwise) |>
  gg_tsresiduals() +
  labs(title = "Stepwise Residuals - ATM4")

atm4_fit |>
  select(ets_madm) |>
  gg_tsresiduals() +
  labs(title = "ETS(M, Ad, M) Model Residuals - ATM4")

Model Selection

Based on the residual analysis, AICc, BIC, and CRPS values:

  • Short-term forecasting: The SARIMA(0,1,1)(0,1,1)[7] model is recommended as it shows the lowest CRPS score among models, indicating high accuracy for shorter forecasting horizons. Its residuals are well-centered with minimal autocorrelation, which aligns well with the observed patterns.

  • Long-term forecasting: SARIMA(1,1,1)(0,1,1)[7] emerges as the best model due to its low AICc and BIC values, indicating better model fit for extended forecasts. Its residuals also display stability with minimal bias, suggesting it captures the overall data structure effectively.

Generate Forecast for May 2010

The sarima111011 model is used for a month-long forecast to maintain reliability and stability. If short-term accuracy is a priority (e.g., for daily or weekly forecasts), the sarima011011 model could also be considered as a secondary choice.

# Load necessary libraries
library(tsibble)
library(fable)
library(dplyr)
library(tidyr)
library(kableExtra)

# Fit sarima111011 model to the Box-Cox transformed ATM2 data to generate forecast for May 2010
atm4_forecast_fit <- atm4_data |>
  model(
    sarima111011 = ARIMA(box_cox_cash ~ pdq(1,1,1) + PDQ(0,1,1, period = 7)),  # SARIMA(1,1,1)(0,1,1)[7] - AR(1) and MA(1) with seasonal MA(1)
  )

# Generate 31-day forecast for the stepwise model
atm4_may_forecasts <- atm4_forecast_fit |>
  forecast(h = "31 days")

# Convert atm1_may_forecasts to a regular data frame to avoid tsibble constraints
forecast_means <- as_tibble(atm4_may_forecasts) |>
  mutate(
    forecast = round(.mean, 1)  # Format as rounded mean values
  ) |>
  select(.model, date, forecast) |>
  pivot_wider(names_from = .model, values_from = forecast)

# Display the forecast table with formatted means only
forecast_means |>
  head() |>
  kable(caption = "31-Day Forecasts for ATM4 (sarima111011 Model)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
31-Day Forecasts for ATM4 (sarima111011 Model)
date sarima111011
2010-05-01 406.2
2010-05-02 478.9
2010-05-03 438.7
2010-05-04 264.4
2010-05-05 467.5
2010-05-06 332.4
# Export the data to CSV
write.csv(forecast_means, "atm4_31day_Forecast.csv", row.names = FALSE)

Visual Forecast Comparison for ATM2 Withdrawals

The 31-day forecast for ATM4 using the sarima111011 forecast shows a stable range with confidence intervals, indicating consistent ATM withdrawal levels expected in the near term. The 80% and 95% prediction intervals suggest moderate variability in forecasts, capturing most potential outcomes with limited extremes. This model projects reliable, centered predictions for upcoming ATM activity.

# Load necessary libraries
library(ggplot2)
library(fable)
library(tsibble)
library(gridExtra)

# Fit the best model (sarima111011) to the Box-Cox transformed data and generate a 31-day forecast
atm4_forecast_fit <- atm4_data |>
  model(sarima111011 = ARIMA(box_cox_cash ~ pdq(1,1,1) + PDQ(0,1,1, period = 7)))

# Generate the forecast for 31 days
atm4_may_forecast <- atm4_forecast_fit |>
  forecast(h = "31 days")

# Extract the forecast period date range
forecast_start <- min(atm4_may_forecast$date)
forecast_end <- max(atm4_may_forecast$date)

# Full view plot including the training and test data
plot_full <- atm4_may_forecast |>
  autoplot(atm4_data, level = c(80, 95)) +  # Include confidence intervals
  xlab("Date") + ylab("ATM4 Withdrawals") +
  ggtitle("31-Day Forecast for ATM4 - sarima111011 Model (Full View)") +
  theme_minimal()

# Zoomed-in view on the forecast period only
plot_zoom <- atm4_may_forecast |>
  autoplot(atm2_data, level = c(80, 95)) +
  xlab("Date") + ylab("ATM4 Withdrawals") +
  ggtitle("31-Day Forecast for ATM4 - sarima111011 Model (Forecast Period Only)") +
  theme_minimal() +
  coord_cartesian(xlim = c(forecast_start, forecast_end))  # Set x-axis to forecast range only

# Arrange the plots vertically
grid.arrange(plot_full, plot_zoom, ncol = 1)

ATM3

Plot the data and identify any unusual observations

The visualizations for ATM3 show a prolonged period of inactivity with zero cash withdrawals across the full data range, with withdrawals only starting in late April 2010. The focused view captures the initial 14 zero values followed by a rapid increase in transactions, reflecting a sharp shift from inactivity to regular withdrawal patterns.

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(gridExtra)

# Find the first date with non-zero cash withdrawals
first_nonzero_date <- atm3_data %>%
  filter(cash > 0) %>%
  summarize(first_date = min(date)) %>%
  pull(first_date)

# Calculate the date range to include the first 14 zeros
include_zero_start <- atm3_data %>%
  filter(cash == 0 & date <= first_nonzero_date) %>%
  slice_tail(n = 14) %>%
  summarize(start_date = min(date)) %>%
  pull(start_date)
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `cash == 0 & date <= first_nonzero_date`.
## Caused by warning in `<=.default`:
## ! longer object length is not a multiple of shorter object length
# Plot the full range of ATM3 data
full_range_plot <- atm3_data %>%
  ggplot(aes(x = date, y = cash)) +
  geom_line() +
  labs(title = "Cash Withdrawals Over Time - ATM3 (Full Range)",
       x = "Date", y = "Cash Withdrawn")

# Plot the non-zero data range including the first 14 zeros
nonzero_data_plot <- atm3_data %>%
  filter(date >= include_zero_start) %>%
  ggplot(aes(x = date, y = cash)) +
  geom_line() +
  labs(title = "Cash Withdrawals Over Time - ATM3 (Non-Zero Data with First 14 Zeros)",
       x = "Date", y = "Cash Withdrawn") 
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `date >= include_zero_start`.
## Caused by warning in `>=.default`:
## ! longer object length is not a multiple of shorter object length
# Combine the two plots with gridExtra
grid.arrange(full_range_plot, nonzero_data_plot, ncol = 1)

tail(atm3_data) %>%
  kable(caption = "Non-zero data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Non-zero data
date atm cash year_month weekday
2010-04-25 ATM3 0 2010-04-01 Sun
2010-04-26 ATM3 0 2010-04-01 Mon
2010-04-27 ATM3 0 2010-04-01 Tue
2010-04-28 ATM3 96 2010-04-01 Wed
2010-04-29 ATM3 82 2010-04-01 Thu
2010-04-30 ATM3 85 2010-04-01 Fri

Modeling and Forecasting

Given the data pattern for ATM3—a long period of zeros followed by only a few non-zero withdrawals—there is not enough data to build a reliable forecasting model. Effective time series forecasting usually requires consistent patterns or enough variation to capture seasonality, trend, or other dependencies. In this case:

  1. Lack of Consistent Withdrawals: The sudden shift from zero to non-zero values without a clear trend or seasonality makes it challenging to apply standard forecasting models, which assume continuity or periodicity.

  2. Limited Observations for Non-Zero Values: Forecasting models need a reasonable number of non-zero observations to identify patterns. With only three active data points at the end, any trend inferred from them would likely be an overfit rather than a generalizable pattern.

  3. No Clear Seasonality: With such sparse data, it’s impossible to detect or leverage any weekly or monthly seasonality, which is often essential for ATM cash withdrawal modeling.

This dataset, in its current form, lacks sufficient variability and consistency to support reliable forecasting. In summary, it’s best to avoid trying to forecast this series until there’s more data reflecting regular ATM usage patterns.

Selection of models for the limited data

If we must proceed with modeling given the limited data, the following models can technically be used but with important caveats:

  1. ARIMA(0,1,0): This minimal ARIMA model will capture a linear trend but may produce unrealistic forecasts due to insufficient data variability.
  2. Mean Model: Useful if you assume future withdrawals will average out around the mean, though this is simplistic.
  3. Naive Model: Suitable if you expect the last value to represent future values, but it’s unreliable with such limited data.
  4. Random Walk and Random Walk with Drift: These assume changes are random or follow a simple trend. Drift can add an upward or downward trend, but without enough historical data, the forecast could diverge unrealistically.

Seasonal models are unsuitable for ATM3 because the data lacks sufficient length, consistent repeating patterns, and enough non-zero values to establish a reliable seasonal trend.

Train the Models

The model is trained on the original data without transformation.

Based on the calculated MAE and RMSE values for each model on ATM3’s non-zero data, the ARIMA(0,1,0) model has the lowest error on 2010-04-28, suggesting it captures the minimal variations well in this sparse dataset. The Naive and Random Walk with Drift models show higher errors, especially on certain days, indicating they may be less suited for reliable forecasting in this context. Overall, ARIMA(0,1,0) is preferable for short-term accuracy, though all models face limitations due to the limited and irregular nature of the data.

# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(yardstick)
## 
## Attaching package: 'yardstick'
## The following object is masked from 'package:forecast':
## 
##     accuracy
## The following object is masked from 'package:fabletools':
## 
##     accuracy
# Filter for non-zero data only
nonzero_data <- atm3_data %>%
  filter(cash > 0)

# Fit models
atm3_fit <- nonzero_data |>
  model(
    arima010 = ARIMA(cash ~ pdq(0,1,0)),        # Non-seasonal ARIMA(0,1,0)
    mean_model = MEAN(cash),                    # Mean model
    naive_model = NAIVE(cash),                  # Naive model
    rw_model = RW(cash),                        # Random Walk model
    rw_drift = RW(cash + 1)               # Random Walk with Drift
  )

# Forecast for the next few periods (example: 5)
atm3_forecast <- atm3_fit |>
  forecast(h = 5)

# Get residuals for each model and calculate accuracy measures
accuracy_measures <- atm3_fit %>%
  augment() %>%  # Get fitted values and residuals
  group_by(.model) %>%
  summarise(
    mae = mean(abs(.resid), na.rm = TRUE),
    rmse = sqrt(mean(.resid^2, na.rm = TRUE))
  )

# View the accuracy measures for model comparison
print(accuracy_measures)
## # A tsibble: 15 x 4 [1D]
## # Key:       .model [5]
##    .model      date           mae    rmse
##    <chr>       <date>       <dbl>   <dbl>
##  1 arima010    2010-04-28   0.101   0.101
##  2 arima010    2010-04-29   8.50    8.50 
##  3 arima010    2010-04-30   8.50    8.50 
##  4 mean_model  2010-04-28   8.33    8.33 
##  5 mean_model  2010-04-29   5.67    5.67 
##  6 mean_model  2010-04-30   2.67    2.67 
##  7 naive_model 2010-04-28 NaN     NaN    
##  8 naive_model 2010-04-29  14      14    
##  9 naive_model 2010-04-30   3       3    
## 10 rw_drift    2010-04-28 NaN     NaN    
## 11 rw_drift    2010-04-29  14      14    
## 12 rw_drift    2010-04-30   3       3    
## 13 rw_model    2010-04-28 NaN     NaN    
## 14 rw_model    2010-04-29  14      14    
## 15 rw_model    2010-04-30   3       3

Residual Analysis of ATM2 Forecast

The residual plots for the ARIMA(0,1,0), Random Walk with Drift, and Naive models for ATM3 indicate limited data points, which makes assessing model adequacy challenging. Key observations include:

  1. ARIMA(0,1,0): Residuals exhibit minimal autocorrelation, but with only three data points, this is insufficient for robust diagnostics.
  2. Random Walk with Drift: The residuals lack significant structure, but the limited observations constrain reliability in inference.
  3. Naive Model: Similar to the other models, there’s minimal discernible autocorrelation, but too few data points to confirm model suitability.

Given the very small sample size, it’s difficult to conduct a meaningful residual analysis, and none of these models can be fully validated. The lack of data points calls for cautious interpretation of any forecasting model for ATM3.

# Plot residuals for each model
atm3_fit |>
  select(arima010) |>
  gg_tsresiduals() +
  labs(title = "Residuals Analysis for arima010 Model - ATM3")

# Plot residuals for each model
atm3_fit |>
  select(naive_model) |>
  gg_tsresiduals() +
  labs(title = "Residuals Analysis for naive_model Model - ATM3")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

# Plot residuals for each model
atm3_fit |>
  select(rw_drift) |>
  gg_tsresiduals() +
  labs(title = "Residuals Analysis for rw_drift Model - ATM3")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

Visualize the forecast

The 31-day forecast plot for ATM3, based on the sarima111011 model and other models (ARIMA(0,1,0), Mean, Naive, Random Walk, Random Walk with Drift), reveals the following:

  1. Forecast Instability: The forecast shows significant uncertainty, particularly with a wide range of confidence intervals (80% and 95%) that even dip into negative values, which is unrealistic for withdrawal amounts.
  2. Lack of Historical Data Impact: The forecast is heavily influenced by limited historical non-zero data points, leading to unreliable and volatile predictions.
  3. Comparison Across Models: Different models offer conflicting forecasts due to data sparsity, reflected by overlapping and divergent confidence bands.

Overall, this plot highlights the difficulty in making reliable forecasts with limited non-zero data, and none of the models provide a stable or realistic forecast. The confidence intervals indicate substantial forecast variability, which should prompt consideration of more data collection before any model-based decision-making.

Given the limited non-zero data available, Random Walk with Drift (rw_drift) may be slightly better suited than the other models for short-term forecasting due to its ability to capture a linear trend based on recent values. However, all models show considerable instability, and none stand out as ideal.

# Load necessary libraries
library(ggplot2)
library(fable)
library(tsibble)
library(gridExtra)

# Generate the forecast for 31 days
atm3_may_forecast <- atm3_fit |>
  forecast(h = "31 days")

# Full view plot including the training and test data
plot_full <- atm3_may_forecast |>
  autoplot(atm3_data, level = c(80, 95)) +  # Include confidence intervals
  xlab("Date") + ylab("ATM3 Withdrawals") +
  ggtitle("31-Day Forecast for ATM3 - sarima111011 Model (Full View)") +
  theme_minimal()

# Arrange the plots vertically
plot_full

Calculate the forecast

# Load necessary libraries
library(tsibble)
library(fable)
library(dplyr)
library(tidyr)
library(kableExtra)

# Fit sarima111011 model to the Box-Cox transformed ATM2 data to generate forecast for May 2010
atm3_forecast_fit <- atm3_data |>
  model(
    rw_drift = RW(cash + 1)               # Random Walk with Drift
  )

# Generate 31-day forecast for the stepwise model
atm3_may_forecasts <- atm3_forecast_fit |>
  forecast(h = "31 days")

# Convert atm1_may_forecasts to a regular data frame to avoid tsibble constraints
forecast_means <- as_tibble(atm3_may_forecasts) |>
  mutate(
    forecast = round(.mean, 1)  # Format as rounded mean values
  ) |>
  select(.model, date, forecast) |>
  pivot_wider(names_from = .model, values_from = forecast)

# Display the forecast table with formatted means only
forecast_means |>
  head() |>
  kable(caption = "31-Day Forecasts for ATM3 (rw_drift Model)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
31-Day Forecasts for ATM3 (rw_drift Model)
date rw_drift
2010-05-01 85
2010-05-02 85
2010-05-03 85
2010-05-04 85
2010-05-05 85
2010-05-06 85
# Export the data to CSV
write.csv(forecast_means, "atm3_31day_Forecast.csv", row.names = FALSE)

Part B: Overview

Introduction

In Part B of this project, we analyze residential power usage load data to explore patterns and trends over time, focusing on handling missing data and preparing the dataset for time series forecasting. The data presents challenges, such as missing values. Our goal is to address these gaps using appropriate imputation techniques while preserving the time series characteristics, including potential seasonality and trends. By transforming and visualizing the data, we aim to develop a robust forecasting model that can predict future energy demand.

Data Preparation

Load the data

A review of the data shows that the dataset consists of multiple rows and columns. Here is a brief description of the columns:

  • YYYY-MMM: Contains the dates for the energy records.
  • CaseSequence: Specifies the Load identifier.
  • KWH: Records the energy load in kilowatt-hours.
# Load necessary libraries
library(tsibble)
library(lubridate)
library(readxl)
library(kableExtra)

# Load the file 
file_path <- "ResidentialCustomerForecastLoad-624.xlsx"
load_data <- read_excel(file_path)

# Convert the date column to yearmonth format and make it a tsibble
load_data <- load_data %>%
  mutate(date = yearmonth(`YYYY-MMM`)) %>%  # Convert to yearmonth format
  rename(load_id = CaseSequence, load = KWH) %>%  # Rename columns for clarity
  select(-`YYYY-MMM`) %>%  # Drop original column
  as_tsibble(index = date)  # Convert to tsibble with date as the index

# Display the first few rows of the updated data
load_data %>%
  head() %>%
  kable(caption = "Residential Customer Load Data (Yearmonth Format)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Residential Customer Load Data (Yearmonth Format)
load_id load date
733 6862583 1998 Jan
734 5838198 1998 Feb
735 5420658 1998 Mar
736 5010364 1998 Apr
737 4665377 1998 May
738 6467147 1998 Jun

Plot and Visualize the Time Series

The time series plot shows a significant anomaly in July 2010, where the energy load drops sharply. This anomaly, along with a noticeable break in the line plot, indicates both an unusual dip and missing data. To handle these issues, the missing values will be filled using seasonal imputation, where the missing values are imputed based on the median energy load for the same month across different years. The sharp drop in July 2010 will also be replaced by the median value for the same month to ensure consistency and avoid distortion in the time series analysis.

library(ggplot2)
library(dplyr)

# Detect the dip programmatically 
dip_threshold <- 0.5 * mean(load_data$load, na.rm = TRUE)

# Identify rows where the load is significantly below the average 
dip_data <- load_data |>
  filter(load < dip_threshold, format(date, "%Y-%m") == "2010-07")

# Identify missing values in the dataset
missing_data <- load_data |>
  filter(is.na(load))

# Combine dip_data and missing_data into one data frame
combined_data <- bind_rows(dip_data, missing_data)

# Display the combined data frame
combined_data |>
  kable(caption = "Missing value and anomaly") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Missing value and anomaly
load_id load date
861 NA 2008 Sep
883 770523 2010 Jul
# Plot with both missing values and the dip highlighted
ggplot(load_data, aes(x = date, y = load)) +
  geom_line() +
  geom_point(data = dip_data, aes(x = date, y = load), color = "red", size = 3) +  # Highlight the dip
  geom_text(data = dip_data, aes(x = date, y = load, label = paste0(date, "\n", round(load, 0))),
            vjust = 1.0, color = "red", size = 4) +  # Label the dip
  labs(title = "Energy Load Over Time with July 2010 Dip and Missing Values Highlighted",
       y = "Load (kWh)",
       x = "Date")

Impute for Missing Value and Unusual Value

The plot of the imputed time series exhibits a clear seasonal pattern with recurring peaks and troughs, reflecting periodic fluctuations in the energy load.

Key observations:

  • Trend: The plot exhibit a generally increasing trend over time, which needs to be considered for long-term forecasting.
  • Seasonality: There is a clear annual seasonality in both plots, suggesting that any model should account for periodic fluctuations, likely on a yearly basis.
  • Variance: The variance appears to be changing over time (heteroscedasticity), which require Box-Cox transformation to help stabilize.
library(dplyr)
library(lubridate)

# Identify the median load for each month
median_monthly_load <- load_data %>%
  group_by(month = month(date, label = TRUE), year = year(date)) %>%
  summarize(median_load = median(load, na.rm = TRUE))

# Impute missing values using the median of the same month across years
load_data <- load_data %>%
  mutate(
    month = month(date, label = TRUE),
    # Replace missing values with median for the same month
    load = ifelse(is.na(load), 
                  ave(load, month, FUN = function(x) median(x, na.rm = TRUE)), 
                  load),
    # Replace July 2010 dip with the median value for July
    load = ifelse(
      format(date, "%Y-%m") == "2010-07" & load < (0.5 * mean(load, na.rm = TRUE)), 
      median(load[month == "Jul"], na.rm = TRUE), 
      load
    )
  )

# Drop the extra 'month' column added for processing
load_data <- load_data %>% select(-month)

# Plot to verify the correction
ggplot(load_data, aes(x = date, y = load)) +
  geom_line() +
  labs(title = "Energy Load Over Time with Imputed Missing Values and Corrected July 2010 Dip",
       y = "Load (kWh)",
       x = "Date")

Box-Cox Transformation and Seasonal Differencing (Lag 12)

The Box-Cox transformation has been applied to stabilize the variance across the time series, reducing fluctuations in the load over time. After that, the seasonal differencing (Lag 12) has been applied to remove the annual seasonality, effectively making the series more stationary. This process prepares the data for ARIMA modeling by addressing both non-constant variance and seasonal patterns.

# Load necessary libraries
library(fable)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(kableExtra)

# Apply Box-Cox transformation (assuming load_data already exists)
load_data <- load_data %>%
  mutate(
    box_cox_load = box_cox(load, lambda = 0.5),  # Example lambda value
    diff_lag12 = difference(box_cox_load, lag = 12),   # Seasonal differencing (lag = 12)
  )

# Step 1: Time Series Plot for Lag 12 Seasonally Differenced Data
plot1_ts <- load_data %>%
  select(date, diff_lag12) %>%
  drop_na(diff_lag12) %>%
  ggplot(aes(x = date, y = diff_lag12)) +
  geom_line() +
  labs(title = "Lag 12 Seasonally Differenced Load",
       x = "Date", y = "Lag 12 Differenced Load")

# Step 2: Plot the Box-Cox transformed data
plot2_ts <- load_data %>%
  ggplot(aes(x = date, y = box_cox_load)) +
  geom_line() +
  labs(title = "Box-Cox Transformed Load",
       x = "Date", y = "Box-Cox Transformed Load")

# Step 4: Combine all time series plots into a grid for comparison
grid.arrange(plot2_ts, plot1_ts, ncol = 1)

# Step 5: Apply Ljung-Box test to the diff_lag12_lag1 series
ljung_box_results <- load_data %>%
  drop_na(diff_lag12) %>%
  features(diff_lag12, ljung_box, lag = 10)  # Apply Ljung-Box test with lag = 10

Decompose the Time Series

The STL decomposition plot shows the following:

  1. Original Series (Box-Cox Transformed Load): The overall energy load pattern with seasonal peaks and troughs, showing a clear repeating seasonal pattern each year.

  2. Trend: The long-term upward trend is generally stable with slight fluctuations, particularly around 2005, where the trend increases before stabilizing.

  3. Seasonal Component: The seasonal pattern is quite regular, with a consistent annual cycle. This indicates that the energy load exhibits strong seasonal effects across years.

  4. Remainder: The remainder component captures the residual variations that aren’t explained by the trend or seasonal components. Noticeable spikes in the remainder, especially in the later periods, indicate unexplained irregularities in the data that may require further investigation.

This decomposition is essential in understanding the distinct components of the data, which will help in selecting and refining models for accurate forecasting. By isolating the trend and seasonality, we can build models that predict the future behavior of the energy load more effectively.

# Load necessary libraries for plotting and STL decomposition
library(fable)
library(fabletools)
library(ggplot2)

# Ensure the data has the required column for STL decomposition
# Assuming the Box-Cox transformed data is already in 'box_cox_load' column
# Perform STL decomposition
decomp_box_cox <- load_data |>
  model(
    stl = STL(box_cox_load ~ trend(window = 13) + season(window = "periodic"), robust = TRUE)
  ) |>
  components()  # Extract the decomposition components

# Plot the decomposition results
autoplot(decomp_box_cox) +
  labs(title = "STL Decomposition of Box-Cox Transformed Load",
       x = "Date", y = "Value")

## Define the Models

Select Models to Fit by Examining the ACF/PACF plots

Based on the ACF and PACF plots for the Box-Cox transformed load (without differencing), the following models are recommended, including differencing to achieve stationarity:

  1. ARIMA(1,1,0): The PACF plot shows a significant spike at lag 1, indicating a strong AR(1) component. First differencing (d=1) is necessary to remove the non-stationarity in the series and make it stationary.

  2. ARIMA(0,1,1): The ACF plot shows a clear spike at lag 1, suggesting an MA(1) component. First differencing (d=1) will be applied to address the non-stationary nature of the data.

  3. SARIMA(1,1,0)(0,1,1)[12]: Seasonal spikes at lag 12 in the ACF plot indicate yearly seasonality. This model will include an AR(1) component for short-term dependencies and a seasonal MA(1) component to account for the yearly seasonal pattern, with both seasonal and non-seasonal differencing.

  4. SARIMA(0,1,1)(0,1,1)[12]: This model, incorporating an MA(1) and seasonal MA(1) component, is suggested based on the ACF spikes at lag 1 and lag 12. Both seasonal and non-seasonal differencing will be included to account for annual seasonality.

  5. SARIMA(1,1,1)(0,1,1)[12]: A mixed model with both AR(1) and MA(1) components, along with seasonal MA(1), is suitable to capture short-term dependencies and yearly seasonality. This model includes both non-seasonal and seasonal differencing.

  6. ETS (Exponential Smoothing State Space Model): ETS models automatically handle trends and seasonality without requiring explicit differencing. It provides an alternative to ARIMA models, which need differencing to stabilize the data, making ETS a flexible choice to account for trend and seasonal components.

These models incorporate necessary non-seasonal and seasonal differencing to address stationarity and capture the autocorrelation patterns identified in the ACF/PACF plots, ensuring a better fit for future forecasting.

# Load necessary library
library(feasts)
library(tsibble)
library(ggplot2)

# Assuming `load_data` contains the time series data and we focus on `box_cox_load`
# Visualize the time series, ACF, and PACF using gg_tsdisplay
# Plot ACF/PACF
load_data |>
  gg_tsdisplay(box_cox_load, plot_type = 'partial', lag_max = 24) +
  labs(
    title = "Time Series, ACF, and PACF of Box-Cox Transformed Load",
    y = "Box-Cox Transformed Load",
    x = "Date"
  )

Train the Models and Select the Best Model

Fit and Select the Best Model

Based on the information criteria (AIC, AICc, and BIC), the sarima111011 model has the lowest values across AIC (2483.402), AICc (2483.632), and BIC (2496.151), indicating it is the best-performing model in terms of model fit. The arima011 and stepwise models follow closely, with slightly higher AIC and BIC values, suggesting they also provide a good fit but are marginally less optimal than sarima111011. The ets model has the highest AIC, AICc, and BIC values, indicating it is the least favorable among the models tested in terms of fit. Therefore, sarima111011 is the most favorable model based on information criteria.

# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(kableExtra)

# Fit the recommended ARIMA and SARIMA models to the Box-Cox transformed load data
load_fit <- load_data |> 
  model(
    arima110 = ARIMA(box_cox_load ~ pdq(1,1,0)),                   # ARIMA(1,1,0)
    arima011 = ARIMA(box_cox_load ~ pdq(0,1,1)),                   # ARIMA(0,1,1)
    sarima110011 = ARIMA(box_cox_load ~ pdq(1,1,0) + PDQ(0,1,1,12)), # SARIMA(1,1,0)(0,1,1)[12]
    sarima011011 = ARIMA(box_cox_load ~ pdq(0,1,1) + PDQ(0,1,1,12)), # SARIMA(0,1,1)(0,1,1)[12]
    sarima111011 = ARIMA(box_cox_load ~ pdq(1,1,1) + PDQ(0,1,1,12)), # SARIMA(1,1,1)(0,1,1)[12]
    ets = ETS(box_cox_load),                                        # ETS model
    stepwise = ARIMA(box_cox_load),                           # Automatic ARIMA using stepwise search
    search = ARIMA(box_cox_load, stepwise = FALSE)            # Automatic ARIMA using exhaustive search

  )

# Extract the model summary with AICc and BIC values
model_summary <- glance(load_fit) |>
  arrange(AICc) |>
  select(.model, AIC, AICc, BIC)

# Display the model summary using kable
model_summary %>%
  kable(caption = "Model Comparison Summary: AICc, BIC") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Model Comparison Summary: AICc, BIC
.model AIC AICc BIC
sarima111011 2483.402 2483.632 2496.151
arima011 2488.015 2488.245 2500.765
stepwise 2488.314 2488.965 2510.664
search 2488.314 2488.965 2510.664
sarima011011 2492.345 2492.482 2501.907
sarima110011 2526.340 2526.477 2535.902
arima110 2526.787 2527.134 2542.724
ets 3127.800 3130.527 3176.662

Perform Residual Analysis

In conducting the residual analysis for the top 3 models and the ets model, we observe the following:

  1. SARIMA(1,1,1)(0,1,1)[12]: The residual plot shows consistent spread around zero, and the ACF plot indicates low autocorrelation across lags, suggesting that this model captures most of the dependencies. The residuals are fairly normally distributed, showing a strong fit.

  2. Stepwise Model: The residuals exhibit some spread and a few spikes, indicating potential unexplained variation. The ACF plot reveals some mild autocorrelation at a few lags, which may suggest a slightly less accurate model compared to SARIMA.

  3. ARIMA(0,1,1): This model has relatively well-distributed residuals around zero, but the ACF plot shows slightly more autocorrelation at early lags than the SARIMA model, suggesting that it may be less effective in capturing seasonal dependencies.

  4. ETS Model: This model shows very little autocorrelation in the ACF plot, with residuals tightly clustered around zero. However, the histogram of residuals suggests a non-normal distribution, which may indicate that the model is less suited for datasets with a strong seasonal component.

In summary, SARIMA(1,1,1)(0,1,1)[12] shows the most promising fit for capturing both seasonal and non-seasonal patterns. The ETS model also performs well for non-seasonal data but may not fully capture seasonality, while the ARIMA(0,1,1) and Stepwise models show some residual autocorrelation, indicating potentially lower accuracy in seasonal forecasting.

load_fit |>
  select(sarima111011) |>
  gg_tsresiduals() +
  labs(title = "sarima111011 Residuals")

load_fit |>
  select(stepwise) |>
  gg_tsresiduals() +
  labs(title = "stepwise Residuals")

load_fit |>
  select(arima011) |>
  gg_tsresiduals() +
  labs(title = "arima011 Residuals")

load_fit |>
  select(ets) |>
  gg_tsresiduals() +
  labs(title = "ets Residuals")

Ljung Test Result Analysis

In conducting the residual analysis for the top 3 models and the ets model, we observe the following:

The Ljung-Box test results indicate that the “stepwise” model has the highest p-value (0.9333), suggesting that it has the least autocorrelation in its residuals and is the best fit among the models tested, as it passes the test for independence of residuals. The “sarima111011” model also shows a relatively high p-value (0.1302), indicating moderate autocorrelation in its residuals but still within acceptable limits. Conversely, the “arima011” and “ets” models exhibit lower p-values (0.0803 and 0.0010, respectively), suggesting that these models have more significant autocorrelation in their residuals, with the “ets” model showing the highest level of autocorrelation among the selected models. Therefore, the “stepwise” model is the most favorable for capturing the data patterns with minimal residual autocorrelation.

# Load necessary libraries
library(dplyr)
library(fabletools)
library(tsibble)
library(kableExtra)

# Define the models to include in the Ljung-Box test
models_to_include <- c("sarima111011", "arima011", "stepwise", "ets")

# Perform Ljung-Box test for each model's residuals and collect results in a single data frame
ljung_results <- augment(load_fit) |>
  filter(.model %in% models_to_include) |>  # Filter to only the specified models
  group_by(.model) |>
  features(.innov, ljung_box, lag = 12, dof = 3) |>
  ungroup() |>
  select(.model, lb_statistic = lb_stat, lb_p_value = lb_pvalue) |>
  arrange(desc(lb_p_value))  # Sort by p-value in descending order

# Display the Ljung-Box test results in a formatted table
ljung_results %>%
  kable(caption = "Ljung-Box Test Results for Selected Model Residuals (Sorted by Descending p-value)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Ljung-Box Test Results for Selected Model Residuals (Sorted by Descending p-value)
.model lb_statistic lb_p_value
stepwise 3.645989 0.9331318
sarima111011 13.785551 0.1301607
arima011 15.408077 0.0803196
ets 27.822019 0.0010214

Model Selection Based on Information Criteria, Residual Analysis, and Ljung-Box Test

Based on the combined analysis:

  • Information Criteria: The SARIMA(1,1,1)(0,1,1)[12] (sarima111011) model has the lowest AIC, AICc, and BIC values, making it the most favorable model in terms of fit.

  • Residual Analysis: The SARIMA(1,1,1)(0,1,1)[12] model also shows a good distribution of residuals around zero, with low autocorrelation in the ACF plot, indicating it captures dependencies well.

  • Ljung-Box Test: The Stepwise model has the highest p-value (0.9333), indicating the least residual autocorrelation and passing the test for independence of residuals. However, sarima111011 also has an acceptable p-value (0.1302), suggesting moderate but acceptable residual independence.

Recommended Models

For long-term forecasting, SARIMA(1,1,1)(0,1,1)[12] is the best choice due to its superior fit indicated by information criteria and effective residual behavior. For short-term forecasting or scenarios where minimal autocorrelation is critical, the Stepwise model is slightly better due to its higher Ljung-Box p-value. Thus:

  • Best Overall Model (Long-term): SARIMA(1,1,1)(0,1,1)[12]
  • Best Model for Residual Independence (Short-term): Stepwise

Calculate and save Forecast

# Generate the forecast for the best model for long-term forecasting ('sarima111011')
forecast_load <- load_fit |>
  forecast(h = "12 months") |>
  filter(.model == "sarima111011")

# Extract just the model, date, and the forecast mean without any distribution info
forecast_clean <- forecast_load |>
  as_tibble() |> 
  select(.model, date, .mean)


# Display the forecast data in a clean table format
forecast_clean %>%
  kable(caption = "12-Month Forecasts for sarima111011 Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
12-Month Forecasts for sarima111011 Models
.model date .mean
sarima111011 2014 Jan 6459.343
sarima111011 2014 Feb 5866.020
sarima111011 2014 Mar 5304.488
sarima111011 2014 Apr 5002.353
sarima111011 2014 May 4904.648
sarima111011 2014 Jun 5626.898
sarima111011 2014 Jul 6034.145
sarima111011 2014 Aug 6213.165
sarima111011 2014 Sep 5855.638
sarima111011 2014 Oct 4995.141
sarima111011 2014 Nov 4904.060
sarima111011 2014 Dec 5704.108
# Export the cleaned data to CSV
write.csv(forecast_clean, "ForecastLoad.csv", row.names = FALSE)

Plot the Forecast

The 12-month forecast for the SARIMA(1,1,1)(0,1,1)[12] model captures the seasonal patterns observed in the historical data, with projected values maintaining a similar range. The confidence intervals (80% and 95%) suggest a moderate level of uncertainty, widening slightly toward the end of the forecast period. This model effectively accounts for both the trend and seasonal components in the data, providing a reliable prediction with seasonal variations.

 # Plot the forecast against the original data
forecast_load |>
  autoplot(load_data) +
  ggtitle("12-Month Forecast for SARIMA(1,1,1)(0,1,1)[12] Model") +
  ylab("Load") +
  xlab("Date")

Part C: Overview

Introduction to Part C

Load Necessary Libraries

Load the required libraries for data manipulation, time series analysis, and exporting to Excel.

# Load necessary libraries
library(readxl)
library(dplyr)
library(lubridate)
library(tsibble)
library(fable)
library(forecast)
library(openxlsx)

Load the Data

Reads in the Excel files for each pipe. These files contain water flow measurements over time. Ensure that the Date Time column is in date-time format for both datasets. Here is a brief description of the columns:

  • Pipe1 Data:
    • Rows: 1,000
    • Columns: 2
    • Date Time: Timestamps from 2015-10-23 to 2015-11-01
    • WaterFlow: Range from 1.07 to 38.91

-Pipe2 Data - Rows: 1,000 - Columns: 2 - Date Time: Timestamps from 2015-10-23 to 2015-12-03 - WaterFlow: Range from 1.88 to 78.30

Note: Different date spans and flow ranges indicate the need for timestamp alignment before combined analysis.

# Load necessary libraries
library(readxl)
library(dplyr)
library(lubridate)
library(knitr)
library(kableExtra)

# Load and arrange data
pipe1_data <- read_excel("Waterflow_Pipe1.xlsx", col_types = c("numeric", "numeric")) %>%
  arrange(`Date Time`) 

pipe2_data <- read_excel("Waterflow_Pipe2.xlsx", col_types = c("numeric", "numeric")) %>%
  arrange(`Date Time`)

# Define a custom function to convert Excel date serial numbers to POSIXct
convert_excel_date <- function(serial_date) {
  as_datetime(serial_date * 86400, origin = "1899-12-30", tz = "UTC")
}

# Apply the custom function to convert the Date Time column
pipe1_data <- pipe1_data %>%
  mutate(`Date Time` = convert_excel_date(`Date Time`))

pipe2_data <- pipe2_data %>%
  mutate(`Date Time` = convert_excel_date(`Date Time`))

pipe1_data
## # A tibble: 1,000 × 2
##    `Date Time`         WaterFlow
##    <dttm>                  <dbl>
##  1 2015-10-23 00:24:06     23.4 
##  2 2015-10-23 00:40:02     28.0 
##  3 2015-10-23 00:53:51     23.1 
##  4 2015-10-23 00:55:40     30.0 
##  5 2015-10-23 01:19:17      6.00
##  6 2015-10-23 01:23:58     15.9 
##  7 2015-10-23 01:50:05     26.6 
##  8 2015-10-23 01:55:33     33.3 
##  9 2015-10-23 01:59:15     12.4 
## 10 2015-10-23 02:51:51     21.8 
## # ℹ 990 more rows
# Calculate number of rows
num_rows_pipe1 <- nrow(pipe1_data)
num_rows_pipe2 <- nrow(pipe2_data)

# Display summaries with formatted table, including row counts
pipe1_summary <- summary(pipe1_data)
pipe2_summary <- summary(pipe2_data)

# Convert to kable format with row count
kable(pipe1_summary, caption = paste("Pipe1 Data Summary - Rows:", num_rows_pipe1)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Pipe1 Data Summary - Rows: 1000
Date Time WaterFlow
Min. :2015-10-23 00:24:06.49 Min. : 1.067
1st Qu.:2015-10-25 11:21:35.49 1st Qu.:13.683
Median :2015-10-27 20:07:30.89 Median :19.880
Mean :2015-10-27 20:49:15.93 Mean :19.897
3rd Qu.:2015-10-30 08:24:51.07 3rd Qu.:26.159
Max. :2015-11-01 23:35:43.10 Max. :38.913
kable(pipe2_summary, caption = paste("Pipe2 Data Summary - Rows:", num_rows_pipe2)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Pipe2 Data Summary - Rows: 1000
Date Time WaterFlow
Min. :2015-10-23 01:00:00 Min. : 1.885
1st Qu.:2015-11-02 10:45:00 1st Qu.:28.140
Median :2015-11-12 20:30:00 Median :39.682
Mean :2015-11-12 20:30:00 Mean :39.556
3rd Qu.:2015-11-23 06:15:00 3rd Qu.:50.782
Max. :2015-12-03 16:00:00 Max. :78.303

Visualize the Time Series for Each Pipe

The time series plots for Water Flow - Pipe 1 and Water Flow - Pipe 2 reveal distinct patterns and variations.

  • Pipe 1 shows consistent, moderate fluctuations in water flow over a shorter time span, with values primarily ranging between 0 and 40 units. This indicates a stable flow rate with periodic peaks and troughs, suggesting regular usage or inflow patterns.

  • Pipe 2 spans a longer period and displays greater variability in flow, with values reaching up to 80 units. The flow appears more dynamic with noticeable peaks, likely indicating higher water demand or variability in this system over time.

The contrast between these two pipes—Pipe 1’s steady pattern versus Pipe 2’s high variability—may suggest differences in operational contexts or flow requirements. This analysis provides a foundation for further exploration of flow behaviors and forecasting needs for each pipe.

# Plot for Pipe 1
pipe1_plot <- ggplot(pipe1_data, aes(x = `Date Time`, y = WaterFlow)) +
  geom_line(color = "blue") +
  labs(title = "Water Flow - Pipe 1",
       x = "Time",
       y = "Water Flow") 

# Plot for Pipe 2
pipe2_plot <- ggplot(pipe2_data, aes(x = `Date Time`, y = WaterFlow)) +
  geom_line(color = "red") +
  labs(title = "Water Flow - Pipe 2",
       x = "Time",
       y = "Water Flow") 

# Arrange the plots vertically
grid.arrange(pipe1_plot, pipe2_plot , ncol = 1)

Aggregate Data by Hour

Convert timestamps to hourly data by rounding down to the nearest hour and then calculating the mean for each hour.

The table and plots provide insights into the data collection frequency and transformations applied to the water flow measurements for each pipe.

For Pipe 1: - Initial data points: 1000 - Hourly mean data points after aggregation: 236 - Average interval between measurements: approximately 14.37 minutes - Median interval: 10.02 minutes

For Pipe 2: - Initial data points: 1000 - Hourly mean data points after aggregation: 667 - Average interval between measurements: 60 minutes - Median interval: 60 minutes

This indicates that Pipe 1 measurements were collected at a higher frequency than Pipe 2, resulting in fewer aggregated hourly data points when averaged per hour. The higher frequency in Pipe 1’s measurements contributed to more granular data, whereas Pipe 2’s hourly interval aligns with its lower frequency of data collection.

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(knitr)
library(kableExtra)

# Calculate number of rows before mean flow calculation
num_points_before_pipe1 <- nrow(pipe1_data)
num_points_before_pipe2 <- nrow(pipe2_data)

# Convert 'Date Time' to hourly timestamps and calculate the hourly mean of 'WaterFlow'
pipe1_hourly  <- pipe1_data %>%
  mutate(Timestamp = floor_date(`Date Time`, "hour")) %>%  # Floor to the nearest hour
  group_by(Timestamp) %>%
  summarize(Mean_WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%  # Calculate the mean of 'WaterFlow'
  ungroup()  # Remove grouping to avoid issues in further steps

pipe2_hourly  <- pipe2_data %>%
  mutate(Timestamp = floor_date(`Date Time`, "hour")) %>%
  group_by(Timestamp) %>%
  summarize(Mean_WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%
  ungroup()

# Calculate number of rows after mean flow calculation
num_points_after_pipe1 <- nrow(pipe1_hourly)
num_points_after_pipe2 <- nrow(pipe2_hourly)

# Calculate time intervals for Pipe 1
pipe1_intervals <- pipe1_data %>%
  arrange(`Date Time`) %>%
  mutate(Interval = difftime(`Date Time`, lag(`Date Time`), units = "mins")) %>%
  filter(!is.na(Interval))  # Remove the first NA caused by lag

# Calculate time intervals for Pipe 2
pipe2_intervals <- pipe2_data %>%
  arrange(`Date Time`) %>%
  mutate(Interval = difftime(`Date Time`, lag(`Date Time`), units = "mins")) %>%
  filter(!is.na(Interval))  # Remove the first NA caused by lag

# Create a data frame to display the counts and intervals for each pipe
data_points_df <- data.frame(
  Pipe = c("Pipe 1", "Pipe 2"),
  Data_Points_Before = c(num_points_before_pipe1, num_points_before_pipe2),
  Data_Points_After = c(num_points_after_pipe1, num_points_after_pipe2),
  Avg_Interval = c(mean(pipe1_intervals$Interval), mean(pipe2_intervals$Interval)),
  Median_Interval = c(median(pipe1_intervals$Interval), median(pipe2_intervals$Interval))
)

# Display the data frame with formatted table
data_points_df |>
  kable(caption = "Summary of Data Points and Time Intervals Between Measurements for Each Pipe") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary of Data Points and Time Intervals Between Measurements for Each Pipe
Pipe Data_Points_Before Data_Points_After Avg_Interval Median_Interval
Pipe 1 1000 236 14.36598 mins 10.01989 mins
Pipe 2 1000 667 60.00000 mins 60.00000 mins
# Plot for Pipe 1
pipe1_hourly_plot <- ggplot(pipe1_hourly, aes(x = Timestamp, y = Mean_WaterFlow)) +
  geom_line(color = "blue") +
  labs(title = "Hourly Mean Water Flow - Pipe 1",
       x = "Timestamp",
       y = "Mean Water Flow") 

# Plot for Pipe 2
pipe2_hourly_plot <- ggplot(pipe2_hourly, aes(x = Timestamp, y = Mean_WaterFlow)) +
  geom_line(color = "red") +
  labs(title = "Hourly Mean Water Flow - Pipe 2",
       x = "Timestamp",
       y = "Mean Water Flow") 

# Arrange the plots vertically
grid.arrange(pipe1_hourly_plot, pipe2_hourly_plot , ncol = 1)

Check Stationarity with the Augmented Dickey-Fuller Test

Convert the merged data to a tsibble format and check each series for stationarity using the ADF test.

Based on the Augmented Dickey-Fuller (ADF) test results for each pipe, here is the interpretation:

  1. Pipe 1:
    • Test Statistic: The ADF test statistic is significantly negative, indicating a strong likelihood of stationarity.
    • Critical Values: Since the test statistic is lower than the critical values at the 1%, 5%, and 10% levels, we reject the null hypothesis of a unit root.
    • Conclusion: Pipe 1’s series appears stationary based on the ADF test, making it suitable for ARIMA modeling.
  2. Pipe 2:
    • Test Statistic: The ADF test statistic for Pipe 2 is also significantly negative.
    • Critical Values: Like Pipe 1, the test statistic for Pipe 2 is lower than all critical values, allowing us to reject the null hypothesis.
    • Conclusion: Pipe 2’s series is also stationary, suitable for ARIMA modeling.

The stationarity confirmed by these results indicates that both series can proceed to forecasting without further differencing adjustments.

# Load the urca package if not already loaded
library(urca)
library(dplyr)
# Step 3: Convert to tsibble format for time series analysis
pipe1_tsibble <- as_tsibble(pipe1_hourly, index = Timestamp) %>%
  fill_gaps()

pipe2_tsibble <- as_tsibble(pipe2_hourly, index = Timestamp) %>%
  fill_gaps()

# Step 4: Test for stationarity using ADF test
adf_test_pipe1 <- ur.df(pipe1_hourly$Mean_WaterFlow, type = "drift")
adf_test_pipe1 <- ur.df(pipe2_hourly$Mean_WaterFlow, type = "drift")


summary(adf_test_pipe1)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression drift 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.732  -9.229   0.298   9.370  37.233 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.55674    2.29084  16.831   <2e-16 ***
## z.lag.1     -0.96884    0.05603 -17.292   <2e-16 ***
## z.diff.lag  -0.06832    0.03888  -1.757   0.0793 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 662 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.5193 
## F-statistic: 359.6 on 2 and 662 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -17.2918 149.5061 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1  6.43  4.59  3.78
summary(adf_test_pipe1)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression drift 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.732  -9.229   0.298   9.370  37.233 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.55674    2.29084  16.831   <2e-16 ***
## z.lag.1     -0.96884    0.05603 -17.292   <2e-16 ***
## z.diff.lag  -0.06832    0.03888  -1.757   0.0793 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 662 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.5193 
## F-statistic: 359.6 on 2 and 662 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -17.2918 149.5061 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1  6.43  4.59  3.78

Pipe 1 Modeling and Forecastng

Apply Box-Cox transformation

The top plot shows the Box-Cox transformed Pipe 2 Flow, which stabilizes variance, while the bottom plot shows the filled series, ensuring continuity for accurate forecasting. These steps prepare the data for reliable model fitting.

pipe1_tsibble <- pipe1_tsibble |> 
  mutate(Mean_WaterFlow)  # Adding 1 to avoid zero or negative values

# Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- pipe1_tsibble |> 
  BoxCox.lambda(Mean_WaterFlow)  # Calculate lambda for Box-Cox

# Apply transformations and differences
pipe1_tsibble <- pipe1_tsibble |> 
  mutate(
    box_cox_flow = BoxCox(Mean_WaterFlow, lambda = box_cox_lambda),  # Apply Box-Cox transformation
  )

# Fill missing values in the Mean_WaterFlow column using linear interpolation
pipe1_tsibble_filled <- pipe1_tsibble %>%
  mutate(Mean_WaterFlow = na_interpolation(Mean_WaterFlow))

# Step 3: Time Series Plot for Box-Cox Transformed Data
p1 <- pipe1_tsibble |>
  ggplot(aes(x = Timestamp, y = box_cox_flow)) +
  geom_line() +
  labs(title = "Box-Cox Transformed Pipe 1 Flow",
       x = "Time", y = "Pipe 1 Flow")

# Step : Time Series Plot for Box-Cox Transformed Data
p2 <- pipe1_tsibble_filled |>
  ggplot(aes(x = Timestamp, y = Mean_WaterFlow)) +
  geom_line() +
  labs(title = "Gap Filled Pipe 1 Flow",
       x = "Time", y = "Pipe 1 Flow")

# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
  p1, p2, 
  ncol = 1
)

Decompose the Time Series to reveal any seasonality and trend

The STL decomposition of Pipe 1 water flow data shows:

  1. Trend: A stable pattern with minor fluctuations, indicating no strong long-term trend.
  2. Seasonal (Daily): Clear daily cycles, capturing regular high and low water flow periods each day.
  3. Remainder: Small, random fluctuations around zero, representing noise and minor anomalies.

Overall, the decomposition effectively captures the main structure, suggesting that a model with these components could forecast this data accurately.

# Step 1: Decompose the series using STL
pipe1_stl <- pipe1_tsibble_filled %>%
  model(STL(Mean_WaterFlow ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
  components()

# Step 2: Plot the STL components
autoplot(pipe1_stl) +
  labs(title = "STL Decomposition of Pipe 1 Water Flow")

Fit the Models for Pipe 1

Here’s twe fit ARIMA and different configurations of ETS models on the time series data, including ETS(M, A, M), ETS(A, A, M), and ETS(M, Ad, M). This approach allows you to compare the models and select the best one based on criteria such as AIC or forecast performance.

Fit models to each series and forecast

To determine the best model for forecasting Pipe 1 Water Flow, we need to consider the model performance metrics, especially the AIC, AICc, BIC, MSE, and MAE.

Even though the stepwise and search ARIMA models have a lower AICc, they exhibit a much narrower confidence interval, suggesting that they fail to capture the variability seen in the actual water flow data. This lack of variability indicates that these models may underfit the data, leading to less reliable predictions under changing conditions. In contrast, ETS models (such as ets_mam) show wider and more realistic confidence intervals, better reflecting the fluctuations in the observed series, making them more suitable for accurate forecasting in this context.

  1. ARIMA:
    • Lowest AIC and AICc, indicating a good fit.
    • Missing MSE and MAE metrics, limiting full accuracy comparison.
  2. ETS (AAM):
    • Moderate AIC and MSE.
    • Performs well with seasonal patterns but not the lowest errors.
  3. ETS (MADM):
    • Highest AIC, MSE, and MAE among models, indicating weaker performance.
  4. ETS (MAM):
    • Lowest MSE, MAE, and competitive AIC among ETS models.
    • Best at capturing trend and seasonality with minimal error.

Best Model: ETS (MAM) due to balanced low error metrics and solid fit.

# Load necessary libraries
library(fable)
library(feasts)
library(tsibble)
library(dplyr)

# Fit multiple ETS models
pipe1_fit <- pipe1_tsibble_filled %>%
  model(
    ets_mam = ETS(Mean_WaterFlow ~ error("M") + trend("A") + season("M")),  # ETS(M, A, M)
    ets_aam = ETS(Mean_WaterFlow ~ error("A") + trend("A") + season("M")),  # ETS(A, A, M)
    ets_madm = ETS(Mean_WaterFlow ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M)
    stepwise = ARIMA(Mean_WaterFlow),                           # Automatic ARIMA using stepwise search
    search = ARIMA(Mean_WaterFlow, stepwise = FALSE)            # Automatic ARIMA using exhaustive search
  )

# Extract model summary and select the best model based on AIC or AICc
model_summary <- glance(pipe1_fit)

# Display model details to compare AIC, AICc, and BIC
model_summary %>%
  select(-ar_roots, -ma_roots) %>%
  arrange(AICc) %>%
  kable(caption = "ETS Model Comparison Based on AICc") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ETS Model Comparison Based on AICc
.model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
stepwise 17.8346691 -685.7815 1375.563 1375.614 1382.524 NA NA NA
search 17.9552502 -685.6035 1379.207 1379.377 1393.129 NA NA NA
ets_mam 0.0475971 -995.0379 2048.076 2056.361 2149.014 16.82212 16.86331 0.1618203
ets_aam 18.9938721 -996.0843 2050.169 2058.454 2151.107 16.77792 16.81792 3.2571771
ets_madm 0.0503575 -999.7983 2059.597 2068.496 2164.016 17.24243 17.24774 0.1666085
# Forecast with each ETS model
pipe1_forecasts <- pipe1_fit %>%
  forecast(h = "1 week")

# Plot the forecasts with different models shown in separate rows
autoplot(pipe1_forecasts, pipe1_tsibble_filled) +
  labs(title = "1-Week Forecast for Pipe 1 Water Flow Using ETS Models",
       y = "Mean Water Flow", x = "Timestamp") +
  facet_wrap(~ .model, scales = "free_y", ncol = 1) +  # Arrange in a single column with each model in a row
  theme_minimal()

Plot the forecast for the best model

The forecast for Pipe 1 Water Flow using the best ETS model, ETS(M, A, M), provides a projection with clear confidence intervals. This model, which captures multiplicative error, additive trend, and multiplicative seasonality, shows a reasonable fit, effectively capturing the seasonal variations in the data. The increasing width of the confidence intervals over time reflects the model’s uncertainty in predicting future values. Given the strong seasonal pattern and consistent trend, this model is well-suited to represent the data’s structure, allowing for meaningful future predictions within expected variability.

# Filter the forecasts to include only the best model (ets_mam)
best_forecast <- pipe1_forecasts %>% 
  filter(.model == "ets_mam")

# Plot the forecast for the best model (ets_mam)
autoplot(best_forecast, pipe1_tsibble_filled) +
  labs(
    title = "1-Week Forecast for Pipe 1 Water Flow Using Best Model (ETS(M, A, M))",
    y = "Mean Water Flow", 
    x = "Timestamp"
  )

Pipe 2 Modeling and Forecastng

Apply Box-Cox transformation

The top plot shows the Box-Cox transformed Pipe 2 Flow, which stabilizes variance, while the bottom plot shows the filled series, ensuring continuity for accurate forecasting. These steps prepare the data for reliable model fitting.

pipe2_tsibble <- pipe2_tsibble |> 
  mutate(Mean_WaterFlow)  # Adding 1 to avoid zero or negative values

# Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- pipe2_tsibble |> 
  BoxCox.lambda(Mean_WaterFlow)  # Calculate lambda for Box-Cox

# Apply transformations and differences
pipe2_tsibble <- pipe2_tsibble |> 
  mutate(
    box_cox_flow = BoxCox(Mean_WaterFlow, lambda = box_cox_lambda),  # Apply Box-Cox transformation
  )

# Fill missing values in the Mean_WaterFlow column using linear interpolation
pipe2_tsibble_filled <- pipe2_tsibble %>%
  mutate(Mean_WaterFlow = na_interpolation(Mean_WaterFlow))

# Step 3: Time Series Plot for Box-Cox Transformed Data
p1 <- pipe2_tsibble |>
  ggplot(aes(x = Timestamp, y = box_cox_flow)) +
  geom_line() +
  labs(title = "Box-Cox Transformed Pipe 2 Flow",
       x = "Time", y = "Pipe 2 Flow")

# Step : Time Series Plot for Box-Cox Transformed Data
p2 <- pipe2_tsibble_filled |>
  ggplot(aes(x = Timestamp, y = Mean_WaterFlow)) +
  geom_line() +
  labs(title = "Gap Filled Pipe 2 Flow",
       x = "Time", y = "Pipe 2 Flow")

# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
  p1, p2, 
  ncol = 1
)

Decompose the Time Series to reveal any seasonality and trend

The STL decomposition of Pipe 2 Water Flow reveals the following:

  • Trend: The trend component shows gradual fluctuations over time, indicating underlying long-term changes in water flow.
  • Seasonal (Daily): The season_day component captures daily variations, showing consistent cyclical patterns that repeat each day, characteristic of diurnal changes in water flow.
  • Remainder: The remainder component contains high-frequency noise, capturing short-term fluctuations not explained by the trend or daily seasonality.

Overall, this decomposition highlights a steady trend with strong daily seasonal effects, alongside short-term noise. This structure can inform model selection, suggesting that any forecast model should account for both trend and daily seasonal components.

# Step 1: Decompose the series using STL
pipe2_stl <- pipe1_tsibble_filled %>%
  model(STL(Mean_WaterFlow ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
  components()

# Step 2: Plot the STL components
autoplot(pipe2_stl) +
  labs(title = "STL Decomposition of Pipe 2 Water Flow")

Fit the Models for Pipe 2

Here’s we fit ARIMA and different configurations of ETS models on the time series data, including ETS(M, A, M), ETS(A, A, M), and ETS(M, Ad, M). This approach allows you to compare the models and select the best one based on criteria such as AIC or forecast performance.

Fit models to each series and forecast

To determine the best model for forecasting Pipe 1 Water Flow, we need to consider the model performance metrics, especially the AIC, AICc, BIC, MSE, and MAE.

The best model for forecasting Pipe 2 Water Flow based on the provided metrics is ets_madm. This model has a low AICc, BIC, MSE, and MAE compared to the other models. Although the search and stepwise models show lower AICc values, their much larger sigma2 and log-likelihood values suggest poor fit and stability issues. Therefore, ets_madm provides a more reliable forecast with appropriate error and stability characteristics.

# Load necessary libraries
library(fable)
library(feasts)
library(tsibble)
library(dplyr)

# Fit multiple models
pipe2_fit <- pipe2_tsibble_filled %>%
  model(
    ets_mam = ETS(Mean_WaterFlow ~ error("M") + trend("A") + season("M")),  # ETS(M, A, M)
    ets_aam = ETS(Mean_WaterFlow ~ error("A") + trend("A") + season("M")),  # ETS(A, A, M)
    ets_madm = ETS(Mean_WaterFlow ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M)
    stepwise = ARIMA(Mean_WaterFlow),                           # Automatic ARIMA using stepwise search
    search = ARIMA(Mean_WaterFlow, stepwise = FALSE)            # Automatic ARIMA using exhaustive search
  )

# Extract model summary and select the best model based on AIC or AICc
model_summary <- glance(pipe2_fit)

# Display model details to compare AIC, AICc, and BIC
model_summary %>%
  select(-ar_roots, -ma_roots) %>%
  arrange(AICc) %>%
  kable(caption = "ETS Model Comparison Based on AICc") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ETS Model Comparison Based on AICc
.model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
search 145.5595822 -3902.861 7819.723 7819.836 7854.070 NA NA NA
stepwise 156.7908794 -3942.633 7891.266 7891.290 7905.986 NA NA NA
ets_madm 0.1017133 -5980.418 12020.837 12022.756 12168.070 155.3642 158.8355 0.2533285
ets_aam 163.1633975 -5987.054 12032.108 12033.902 12174.433 158.5948 176.0376 10.0992134
ets_mam 0.1019389 -5987.697 12033.394 12035.188 12175.719 158.5184 162.9869 0.2536301
# Forecast with each ETS model
pipe2_forecasts <- pipe2_fit %>%
  forecast(h = "1 week")

# Plot the forecasts with different models shown in separate rows
autoplot(pipe2_forecasts, pipe2_tsibble_filled) +
  labs(title = "1-Week Forecast for Pipe 2 Water Flow",
       y = "Mean Water Flow", x = "Timestamp") +
  facet_wrap(~ .model, scales = "free_y", ncol = 1) +  # Arrange in a single column with each model in a row
  theme_minimal()

Plot the forecast for the best model

The ETS(M, Ad, M) model provides a one-week forecast for Pipe 2 Water Flow, capturing both multiplicative error and seasonality, with an additive dampened trend. The forecast maintains the high variability observed in the historical data and shows increasing uncertainty over the forecast period, as represented by the widening confidence intervals. This model is suitable for capturing the complexity and fluctuating pattern of water flow in Pipe 2, reflecting both the daily and seasonal variations effectively.

# Filter the forecasts to include only the best model (ets_mam)
best_forecast <- pipe2_forecasts %>% 
  filter(.model == "ets_madm")

# Plot the forecast for the best model (ets_mam)
autoplot(best_forecast, pipe2_tsibble_filled) +
  labs(
    title = "1-Week Forecast for Pipe 2 Water Flow (ETS(M, Ad, M))",
    y = "Mean Water Flow", 
    x = "Timestamp"
  ) +
  theme_minimal()