Load the required libraries
# Load necessary libraries
# Function to check and install packages if not already installed
install_if_missing <- function(packages) {
missing_packages <- packages[!(packages %in% installed.packages()[, "Package"])]
if (length(missing_packages)) {
install.packages(missing_packages)
}
}
# List of required packages, including fpp3
packages <- c("readxl", "tsibble", "dplyr", "fable", "lubridate", "tseries", "forecast", "zoo", "tidyr", "fpp3", "ggplot2", "gridExtra", "kableExtra", "openxlsx", "imputeTS", "tseries", "purrr")
# Install missing packages
install_if_missing(packages)
# Load all the required libraries
library(fpp3) # Includes tsibble, fable, and others
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.5
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.3.2
## ✔ lubridate 1.9.3 ✔ fable 0.3.4
## ✔ ggplot2 3.5.1 ✔ fabletools 0.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(tidyr)
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(forecast)
library(zoo)
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(openxlsx)
library(imputeTS)
##
## Attaching package: 'imputeTS'
## The following object is masked from 'package:zoo':
##
## na.locf
## The following object is masked from 'package:tseries':
##
## na.remove
library(readxl) # Correctly load the readxl package (read_excel is a function inside it)
library(tseries)
library(purrr)
In Part A of the project, we analyze ATM cash withdrawal data to develop a model for forecasting future cash demands. The dataset includes records from various ATMs, documenting daily withdrawals over a given period.
A key challenge is managing missing values in fields like ATM identifiers and withdrawal amounts. Accurate data imputation is critical to preserve time series patterns, including seasonality and trends. By employing suitable imputation techniques, we aim to transform the raw data into a well-structured time series for forecasting.
This analysis aims to provide insights into withdrawal behaviors, ensuring ATMs are adequately stocked, enhancing operational efficiency, and optimizing cash replenishment schedules.
A review of the data below shows that the dataset consists of 1,474 rows and 3 columns. Here is a brief description of the columns:
datetime
format) for the ATM transactions. There are no
missing values in this column.float64
format). There are 19 missing values
in this column, as it has 1,455 non-null entries.The dataset spans daily transactions, and missing data may need to be addressed before analysis, particularly for forecasting purposes.
# File path
file_path <- "ATM624Data.xlsx"
# Load the file using read_excel
atm_data <- read_excel(file_path, col_types = c("numeric", "text", "numeric"))
atm_data <- atm_data %>%
mutate(DATE = as_date(DATE, origin = "1899-12-30")) |> # Convert numeric date
rename(atm = ATM, date = DATE, cash = Cash) |> # Rename the 'ATM' column to 'name'
as_tsibble(index = date, key = atm) |> # Convert to tsibble format
arrange(date)
summary(atm_data) |>
kable(caption = "ATM Data Summary") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
date | atm | cash | |
---|---|---|---|
Min. :2009-05-01 | Length:1474 | Min. : 0.0 | |
1st Qu.:2009-08-01 | Class :character | 1st Qu.: 0.5 | |
Median :2009-11-01 | Mode :character | Median : 73.0 | |
Mean :2009-10-31 | NA | Mean : 155.6 | |
3rd Qu.:2010-02-01 | NA | 3rd Qu.: 114.0 | |
Max. :2010-05-14 | NA | Max. :10919.8 | |
NA | NA | NA’s :19 |
Identify and quantify the presence of missing values within the dataset, focusing specifically on the cash column and restricting the analysis to data recorded on or before April 30, 2010 which is the end of observation. This helps in understanding the scope and distribution of missing data within the given timeframe.
The output below shows a summary of missing values in the dataset,
excluding dates after April 30, 2010. Specifically: - There are
no missing values in the date
and
atm
columns. - There are 5 missing values
in the cash
column, indicating a small portion of the
withdrawal amounts are missing within the filtered date range.
library(dplyr)
library(knitr)
library(kableExtra)
# Filter data for entries before May 2010
atm_data <- atm_data |>
filter(date <= as.Date("2010-04-30")) # Exclude dates after April 30, 2010
# Check for missing values in all columns of the filtered data
missing_values <- sapply(atm_data, function(x) sum(is.na(x)))
# Convert missing values summary to a data frame
missing_values_df <- data.frame(Column = names(missing_values), Missing_Values = missing_values)
# Display the missing values using kable
missing_values_df |>
kable(caption = "Summary of Missing Values (Excluding Dates After April 30, 2010)") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Column | Missing_Values | |
---|---|---|
date | date | 0 |
atm | atm | 0 |
cash | cash | 5 |
The output below provides a summary of missing cash
values by specific dates, day of the week, and ATM location, focusing on
data prior to May 2010. Each row represents a missing data point for a
specific ATM on a particular date. This detailed breakdown helps in
identifying potential patterns in the missing data,
such as whether certain ATMs or days of the week are more prone to
missing values. Understanding these patterns is crucial for deciding the
most appropriate method for imputing the missing data,
such as time-based or location-based imputation strategies.
From the output, we can observe the following potential patterns in the missing data:
ATM-Specific Patterns: The missing data is
distributed across two ATMs, ATM1
and ATM2
.
The occurrences of missing values alternate between these two ATMs, with
both having multiple instances of missing cash
values.
Temporal Patterns: The missing values are spread across different days of the week, with no immediate concentration on weekends or weekdays. The dates with missing values range from June 13, 2009 (Saturday) to June 24, 2009 (Wednesday), indicating that missing data is not confined to a particular type of day, such as weekends or weekdays.
These observations suggest that the missing data is likely distributed across both ATMs and dates without a strong temporal or ATM-specific bias.
# Filter rows where there are missing values in the 'cash' column and dates are before May 2010
missing_data_summary <- atm_data |>
as_tibble() |>
filter(is.na(cash) & date <= as.Date("2010-04-30")) |> # Focus on rows with missing cash values and date <= April 30, 2010
mutate(day_of_week = wday(date, label = TRUE)) |> # Add a day of the week column
group_by(date, day_of_week, atm) |> # Group by date, day of week, and ATM
summarise(missing_count = sum(is.na(cash)), .groups = "drop") |> # Count missing values and drop grouping after summarizing
arrange(date) # Arrange by date
# Display the results
missing_data_summary |>
kable(caption = "Summary of Missing Values by Date and ATM (Excluding May 2010)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
date | day_of_week | atm | missing_count |
---|---|---|---|
2009-06-13 | Sat | ATM1 | 1 |
2009-06-16 | Tue | ATM1 | 1 |
2009-06-18 | Thu | ATM2 | 1 |
2009-06-22 | Mon | ATM1 | 1 |
2009-06-24 | Wed | ATM2 | 1 |
The missing values may be due to system downtimes or maintenance, and no transactions were recorded because the ATMs were genuinely not operational. In this scenario, filling missing values with 0 could be a valid approach if the missing data is genuinely the result of downtime or periods where no activity occurred. However, it’s important to note that this can potentially distort the time series data and lead to inaccurate forecasting. An approach which would better preserve the patterns and integrity of the data for forecasting purposes will be used for missing data imputation.
The custom imputation method below fills missing values in the ATM withdrawal data based on the average withdrawals for the same weekday within the same month. By grouping the data by ATM, year-month, and weekday (e.g., all Tuesdays in May), missing values are replaced with the average cash withdrawals for that specific day across the same month.
Why We Use This Method:
This method is ideal for datasets with distinct weekly or monthly patterns, ensuring that imputed values align with expected trends.
library(lubridate)
library(dplyr)
# Group by ATM, year-month, and weekday, then impute missing values for the same weekday in the same month
atm_data <- atm_data %>%
group_by_key() %>%
mutate(
# Extract year-month and weekday information
year_month = floor_date(date, "month"),
weekday = wday(date, label = TRUE),
# Impute missing cash values based on the same weekday in the same month
cash = ifelse(
is.na(cash),
ave(cash, year_month, weekday, FUN = function(x) mean(x, na.rm = TRUE)),
cash
)
) %>%
ungroup() # Ungroup the data
# Check for missing values after imputation and get a basic summary of the data
missing_imputed_values <- colSums(is.na(atm_data))
# Display the results
missing_imputed_values %>%
kable(caption = "Summary of Missing Values by Date and ATM (Excluding May 2010)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
x | |
---|---|
date | 0 |
atm | 0 |
cash | 0 |
year_month | 0 |
weekday | 0 |
atm1_data <- atm_data |>
filter(atm == "ATM1")
atm2_data <- atm_data |>
filter(atm == "ATM2")
atm3_data <- atm_data |>
filter(atm == "ATM3")
atm4_data <- atm_data |>
filter(atm == "ATM4")
The summary statistics and box plots for each ATM reveals the following:
# Step 1: Convert the tsibble to a tibble
atm_summary_data <- as_tibble(atm_data) # Convert tsibble to tibble
# Step 2: Exclude the DATE column and group by ATM
atm_summary <- atm_summary_data |>
select(-date) |> # Remove the DATE column
group_by(atm) |> # Group by ATM column
summarise(
min = min(cash),
`25%` = quantile(cash, 0.25),
median = median(cash),
mean = mean(cash),
`75%` = quantile(cash, 0.75),
max = max(cash)
)
# Step 3: Display the atm_summary data frame using kable for better presentation
atm_summary |>
kable(caption = "Summary of Imputed ATM Data") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
atm | min | 25% | median | mean | 75% | max |
---|---|---|---|---|---|---|
ATM1 | 1.00000 | 73.0000 | 91.0000 | 84.0675799 | 108.000 | 180.00 |
ATM2 | 0.00000 | 25.0000 | 66.0000 | 62.3442922 | 93.000 | 147.00 |
ATM3 | 0.00000 | 0.0000 | 0.0000 | 0.7205479 | 0.000 | 96.00 |
ATM4 | 1.56326 | 124.3344 | 403.8393 | 474.0433449 | 704.507 | 10919.76 |
# Load required library
library(grid)
library(gridExtra)
# Create separate box plots for each ATM
p1 <- atm1_data |>
ggplot(aes(x = atm, y = cash)) +
geom_boxplot() +
labs(y = "Cash Withdrawals")
p2 <- atm2_data |>
ggplot(aes(x = atm, y = cash)) +
geom_boxplot()
p3 <- atm3_data |>
ggplot(aes(x = atm, y = cash)) +
geom_boxplot()
# Plot for ATM4 with outlier labeled
p4 <- atm4_data |>
ggplot(aes(x = atm, y = cash)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3) +
# Label the specific outlier (10,919.76) - filter the atm_data explicitly
geom_text(
data = atm_data |>
filter(atm == "ATM4", cash > 10919), # Explicitly filter atm_data
aes(label = cash),
vjust = -0.5, color = "blue", size = 3
)
# Plot ATM4 again without the max value
p4_filtered <- atm4_data |>
filter(cash != max(cash, na.rm = TRUE)) |>
ggplot(aes(x = atm, y = cash)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3)
# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
p1, p2, p3, p4, p4_filtered,
ncol = 3,
top = textGrob("Distribution of ATM Withdrawals", gp = gpar(fontsize = 12), just = "left", x = 0)
)
# Replace the maximum Cash value for ATM4 with the average for the same day of the month
atm4_data <- atm4_data %>%
group_by(atm) %>%
mutate(cash = ifelse(cash == max(cash[atm == "ATM4"], na.rm = TRUE),
ave(cash, day(date), FUN = function(x) mean(x, na.rm = TRUE)),
cash
)) %>%
ungroup()
The plot shows the cash withdrawals over time for ATM1. There is a noticeable fluctuation in the cash withdrawals, with periodic peaks and troughs indicating regular variations in activity. The fluctuations suggest possible seasonality or cyclical patterns in the cash withdrawal behavior over the observed period from mid-2009 to early 2010.
atm1_data |>
ggplot(aes(x = date, y = cash)) +
geom_line() +
labs(title = "Cash Withdrawals Over Time - ATM1", x = "Date", y = "Cash Withdrawn")
Box-Cox Transformed ATM1: The Box-Cox transformed plot for ATM1 shows stabilized variance across time. However, clear weekly seasonal patterns and minor trends still persist, indicating that further differencing (e.g., seasonal differencing) will be necessary for effective time series modeling. SARIMA models with appropriate seasonal components would be suitable for capturing these patterns.
# Step 1: Shift cash values by adding a small constant to handle zero values
atm1_data <- atm1_data |>
mutate(cash_adjusted = cash + 1) # Adding 1 to avoid zero or negative values
# Step 2: Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- atm1_data |>
BoxCox.lambda(cash_adjusted) # Calculate lambda for Box-Cox
# Apply transformations and differences
atm1_data <- atm1_data |>
mutate(
box_cox_cash = BoxCox(cash_adjusted, lambda = box_cox_lambda), # Apply Box-Cox transformation
)
# Step 3: Time Series Plot for Box-Cox Transformed Data
atm1_data |>
ggplot(aes(x = date, y = box_cox_cash)) +
geom_line() +
labs(title = "Box-Cox Transformed ATM1",
x = "Time", y = "Box-Cox Transformed ATM1")
The STL decomposition of ATM1 (Box-Cox transformed withdrawals) reveals the following components:
Overall, this decomposition confirms the presence of a stable weekly pattern and some trend fluctuations, while the remainder highlights increased randomness towards the end of the observed period.
# STL Decomposition for ATM1
atm1_data |>
model(
stl = STL(box_cox_cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)
) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - ATM1 (Box-Cox Transformed Withdrawals)")
The ACF plot for Box-Cox transformed ATM1 withdrawals shows strong spikes at lags 7, 14, and 21, confirming a weekly seasonality as observed in the seasonal component of the STL decomposition. This ACF pattern reinforces the presence of a clear weekly cycle in the data. These observations indicate that any forecasting model should account for weekly seasonality to effectively capture the periodic withdrawal behavior.
# Generate ACF Plot for ATM1
atm1_data |>
ACF(box_cox_cash) |>
autoplot() +
labs(title = "ACF Plot for Box-Cox Transformed ATM1 Cash Withdrawals ")
Perform differencing to identify the differncing parameters d and D for manually specifying ARIMA and SARIMA models.
The Lag 7 differencing for ATM1 data removes significant seasonality, but the ACF plot still shows notable autocorrelation at lag 7. This suggests that while weekly seasonality is reduced, there may still be some recurring weekly patterns.
# Apply differencing
atm1_data <- atm1_data |>
mutate(
diff_lag7 = difference(box_cox_cash, lag = 7) # Lag 7 Seasonal differencing
)
# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm1_data |>
drop_na(diff_lag7) |>
autoplot(diff_lag7) +
labs(title = "Lag 7 Seasonally Differenced ATM1",
x = "Time", y = "Lag 7 Differenced ATM1")
# Generate ACF Plot for ATM1
plot_acf <- atm1_data |>
ACF(diff_lag7) |>
autoplot() +
labs(title = "ACF Plot for Lag 7 Differenced ATM1 Cash Withdrawals ")
# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)
With Lag 1 on Lag 7 differencing, the data is further stabilized, as seen in the smoother time series and reduced autocorrelation in the ACF plot. However, the Ljung-Box test results in a high \(lb\_stat\) and p-value of 0, indicating that residual autocorrelation remains. This supports the need for a SARIMA model that can accommodate both non-seasonal and seasonal differencing components to effectively model the data’s underlying structure.
# Apply differencing
atm1_data <- atm1_data |>
mutate(
diff_lag7_lag1 = difference(diff_lag7, lag = 1) # First-order differencing after seasonal differencing
)
# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm1_data |>
drop_na(diff_lag7_lag1) |>
autoplot(diff_lag7_lag1) +
labs(title = "Lag 1 on Lag 7 Differenced ATM1",
x = "Time", y = "Lag 1 on Lag 7 Differenced ATM1")
# Generate ACF Plot for ATM1
plot_acf <- atm1_data |>
ACF(diff_lag7_lag1) |>
autoplot() +
labs(title = "ACF Plot for Lag 1 on Lag 7 Differenced ATM1 Cash Withdrawals ")
# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)
# Apply Ljung-Box test to the diff_lag1 series for ATM1
ljung_box_results <- atm1_data |>
drop_na(diff_lag7_lag1) |> # Drop NA values before performing Ljung-Box test
features(diff_lag7_lag1, ljung_box, lag = 10) # Apply Ljung-Box test with lag = 10
# Display the Ljung-Box test results
ljung_box_results |>
kable(caption = "ATM1 Ljung Test Result for Lag 1 on Lag 7 Differencing") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
atm | lb_stat | lb_pvalue |
---|---|---|
ATM1 | 143.6979 | 0 |
We will examine the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots for selecting appropriate model parameters and understanding the structure of the time series.
Based on the ACF and PACF plots for the Box-Cox transformed cash withdrawals (without differencing), the following models are recommended, including differencing to achieve stationarity:
These choices capture trend and weekly seasonality effectively.
# Plot ACF/PACF
atm1_data |>
gg_tsdisplay(box_cox_cash, plot_type = 'partial', lag_max = 21)
Split data into train and test sets
Split the data and leave out the last 14 days (up to April 30, 2010) for ATM1 to create training and test sets.
# Define the end date for the training set (14 days before the last date, April 16, 2010)
train_end_date <- as.Date("2010-04-16")
# Split ATM1 data into training and test sets
atm1_train <- atm1_data |> filter(date <= train_end_date)
atm1_test <- atm1_data |> filter(date > train_end_date)
The models are trained on the Box-Cox transformed ATM cash withdrawals data to capture trend, seasonality, and short-term dependencies. Multiple ARIMA, SARIMA, and ETS models are fitted, incorporating non-seasonal differencing (\(d=1\)) and seasonal differencing (\(D=1\)) to address stationarity and weekly seasonality.
Based on the AICc and BIC values in the table below, the best model is the stepwise ARIMA model with parameters ARIMA(0,0,1)(0,1,2)[7], as it has the lowest AICc (3169.255) and BIC (3184.499) values among all models listed.
This suggests that the stepwise-selected ARIMA model performs better in terms of information criteria and is the most suitable choice for this data.
# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(ggplot2)
library(tidyr)
library(kableExtra)
# Fit ARIMA, SARIMA, and ETS models to the Box-Cox transformed ATM data
atm1_fit <- atm1_train |>
model(
arima110 = ARIMA(box_cox_cash ~ pdq(1,1,0)), # ARIMA(1,1,0) - AR(1) for short-term dependency
arima011 = ARIMA(box_cox_cash ~ pdq(0,1,1)), # ARIMA(0,1,1) - MA(1) for short-term dependency
sarima110011 = ARIMA(box_cox_cash ~ pdq(1,1,0) + PDQ(0,1,1, period = 7)), # SARIMA(1,1,0)(0,1,1)[7] - AR(1) with seasonal MA(1)
sarima011011 = ARIMA(box_cox_cash ~ pdq(0,1,1) + PDQ(0,1,1, period = 7)), # SARIMA(0,1,1)(0,1,1)[7] - MA(1) with seasonal MA(1)
sarima111011 = ARIMA(box_cox_cash ~ pdq(1,1,1) + PDQ(0,1,1, period = 7)), # SARIMA(1,1,1)(0,1,1)[7] - AR(1) and MA(1) with seasonal MA(1)
ets_mam = ETS(box_cox_cash ~ error("M") + trend("A") + season("M")), # ETS(M, A, M) - Multiplicative seasonality with additive trend
ets_aam = ETS(box_cox_cash ~ error("A") + trend("A") + season("M")), # ETS(A, A, M) - Additive trend and error, multiplicative seasonality
ets_madm = ETS(box_cox_cash ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M) - Damped trend with multiplicative seasonality
stepwise = ARIMA(box_cox_cash), # Automatic ARIMA using stepwise search
search = ARIMA(box_cox_cash, stepwise = FALSE) # Automatic ARIMA using exhaustive search
)
# Extract the model summary with AICc and BIC values
model_summary <- glance(atm1_fit) |>
arrange(AICc) |>
select(.model, AIC, AICc, BIC, MSE, AMSE)
# Extract the ARIMA orders (p,d,q) for each model
model_orders <- atm1_fit |>
pivot_longer(cols = -atm, names_to = "Model name", values_to = "Orders") |>
select(`Model name`, Orders)
# Combine the AICc, BIC and ARIMA orders into a single data frame
combined_results <- model_summary |>
left_join(model_orders, by = c(".model" = "Model name"))
# Display the combined data frame using kable
combined_results |>
kable(caption = "Ranked ATM1 Models based on AICc") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | AIC | AICc | BIC | MSE | AMSE | Orders |
---|---|---|---|---|---|---|
stepwise | 3169.137 | 3169.255 | 3184.499 | NA | NA | <ARIMA(0,0,1)(0,1,2)[7]> |
search | 3169.137 | 3169.255 | 3184.499 | NA | NA | <ARIMA(0,0,1)(0,1,2)[7]> |
sarima111011 | 3173.805 | 3173.923 | 3189.156 | NA | NA | <ARIMA(1,1,1)(0,1,1)[7]> |
arima011 | 3179.820 | 3179.938 | 3195.171 | NA | NA | <ARIMA(0,1,1)(0,1,2)[7]> |
sarima011011 | 3181.156 | 3181.226 | 3192.669 | NA | NA | <ARIMA(0,1,1)(0,1,1)[7]> |
arima110 | 3293.020 | 3293.138 | 3308.371 | NA | NA | <ARIMA(1,1,0)(0,1,2)[7]> |
sarima110011 | 3295.772 | 3295.843 | 3307.285 | NA | NA | <ARIMA(1,1,0)(0,1,1)[7]> |
ets_aam | 4316.458 | 4317.381 | 4362.787 | 583.1468 | 586.3747 | <ETS(A,A,M)> |
ets_mam | 4396.864 | 4397.787 | 4443.194 | 702.3972 | 713.0573 | <ETS(M,A,M)> |
ets_madm | 4397.976 | 4399.056 | 4448.166 | 710.7313 | 724.7224 | <ETS(M,Ad,M)> |
The CRPS scores for ATM1’s 14-day forecasts indicate that ARIMA(0,1,1) performs best with the lowest CRPS value (7.53), closely followed by the Search (7.61) and Stepwise (7.61) models. The SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7] models also show competitive accuracy with CRPS values around 7.64 and 7.69, respectively. These models provide better short-term forecast accuracy compared to others, with higher CRPS scores for ETS models and additional SARIMA configurations indicating less accurate predictions. This CRPS analysis supports prioritizing ARIMA(0,1,1), Search, and Stepwise models for short-term forecasts.
# Load necessary libraries for forecasting accuracy
library(fabletools)
library(dplyr)
library(kableExtra)
# Generate 14-day forecasts for all models
atm1_forecasts <- atm1_fit |>
forecast(h = "14 days")
# Calculate accuracy metrics, including CRPS, for the 14-day forecast and sort by CRPS
atm1_forecast_accuracy <- atm1_forecasts |>
accuracy(atm1_test, list(crps = CRPS)) |>
arrange(crps) # Sort from most to least accurate (lowest to highest CRPS)
# Display sorted accuracy results for each model
atm1_forecast_accuracy |>
kable(caption = "CRPS Scores for 14-Day ATM Withdrawal Forecasts (Sorted by Accuracy)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | atm | .type | crps |
---|---|---|---|
arima011 | ATM1 | Test | 7.530348 |
search | ATM1 | Test | 7.612832 |
stepwise | ATM1 | Test | 7.612832 |
sarima111011 | ATM1 | Test | 7.644105 |
sarima011011 | ATM1 | Test | 7.686854 |
ets_aam | ATM1 | Test | 7.695302 |
ets_mam | ATM1 | Test | 9.055706 |
ets_madm | ATM1 | Test | 9.273915 |
sarima110011 | ATM1 | Test | 16.265857 |
arima110 | ATM1 | Test | 16.319732 |
Based on both the CRPS scores and AICc values shown in the tables, the following four models are selected for residual analysis:
Based on the residual analysis, the Stepwise ARIMA and Search ARIMA models show the most promising results, with minimal residual bias and low autocorrelation, indicating a good fit to the underlying data patterns. ARIMA(0,1,1) also performs adequately but exhibits slightly higher autocorrelation in the residuals, suggesting it may not capture short-term dependencies as effectively. SARIMA(1,1,1)(0,1,1)[7] manages seasonality well but shows a bit more spread in residuals, indicating slightly higher variance. Overall, Stepwise ARIMA and Search ARIMA models demonstrate the most stable and well-behaved residuals, making them preferable for forecasting.
Analyzing these models’ residuals helps determine the best model for generating reliable future forecasts by assessing residual patterns, autocorrelation, and normality.
atm1_fit |>
select(stepwise) |>
gg_tsresiduals() +
labs(title = "Stepwise Residuals")
atm1_fit |>
select(search) |>
gg_tsresiduals() +
labs(title = "Search Model Residuals")
atm1_fit |>
select(arima011) |>
gg_tsresiduals() +
labs(title = "arima011 Model Residuals")
atm1_fit |>
select(sarima111011) |>
gg_tsresiduals() +
labs(title = "sarima111011 Model Residuals")
Based on the evaluation of models using CRPS scores, AICc/BIC information criteria and Residual Analysis:
Best Model Recommendation
Considering both information criteria (AICc and BIC) and forecast accuracy (CRPS), the stepwise ARIMA model is recommended due to its balance between model complexity and accuracy. However, the ARIMA(0,1,1) model (arima011) could also be preferred if slightly better short-term forecast accuracy is a priority.
The stepwise ARIMA model is used for a month-long forecast to maintain reliability and stability. If short-term accuracy is a priority (e.g., for daily or weekly forecasts), the ARIMA(0,1,1) model could also be considered as a secondary choice.
# Load necessary libraries
library(tsibble)
library(fable)
library(dplyr)
library(tidyr)
library(kableExtra)
# Fit stepwise model to the Box-Cox transformed ATM1 data to generate forecast for May 2010
atm1_forecast_fit <- atm1_data |>
model(
stepwise = ARIMA(box_cox_cash) # Automatic ARIMA using stepwise search
)
# Generate 31-day forecast for the stepwise model
atm1_may_forecasts <- atm1_forecast_fit |>
forecast(h = "31 days")
# Convert atm1_may_forecasts to a regular data frame to avoid tsibble constraints
forecast_means <- as_tibble(atm1_may_forecasts) |>
mutate(
forecast = round(.mean, 1) # Format as rounded mean values
) |>
select(.model, date, forecast) |>
pivot_wider(names_from = .model, values_from = forecast)
# Display the forecast table with formatted means only
forecast_means |>
head() |>
kable(caption = "31-Day Forecasts for ATM1 (Stepwise Model)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
date | stepwise |
---|---|
2010-05-01 | 87.2 |
2010-05-02 | 104.1 |
2010-05-03 | 73.1 |
2010-05-04 | 8.1 |
2010-05-05 | 100.0 |
2010-05-06 | 80.9 |
# Export the data to CSV
write.csv(forecast_means, "atm1_31day_Forecast.csv", row.names = FALSE)
The 31-day forecast for ATM1 cash withdrawals, generated by the stepwise ARIMA model, demonstrates a clear seasonal pattern with daily fluctuations. The zoomed-in forecast view highlights the 80% and 95% confidence intervals, which encapsulate potential variability in the cash withdrawals, reflecting the model’s anticipated uncertainty. The forecast suggests a consistent cyclical trend, with withdrawals expected to vary within a relatively stable range. The intervals indicate that the model is confident in predicting the general range of withdrawals while allowing for typical day-to-day variations.
# Load necessary libraries
library(ggplot2)
library(fable)
library(tsibble)
library(gridExtra)
# Fit the best model (stepwise) to the Box-Cox transformed data and generate a 31-day forecast
atm1_forecast_fit <- atm1_data |>
model(stepwise = ARIMA(box_cox_cash))
# Generate the forecast for 31 days
atm1_may_forecast <- atm1_forecast_fit |>
forecast(h = "31 days")
# Extract the forecast period date range
forecast_start <- min(atm1_may_forecast$date)
forecast_end <- max(atm1_may_forecast$date)
# Full view plot including the training and test data
plot_full <- atm1_may_forecast |>
autoplot(atm1_data, level = c(80, 95)) + # Include confidence intervals
xlab("Date") + ylab("ATM Withdrawals") +
ggtitle("31-Day Forecast for ATM1 - Stepwise Model (Full View)") +
theme_minimal()
# Zoomed-in view on the forecast period only
plot_zoom <- atm1_may_forecast |>
autoplot(atm1_data, level = c(80, 95)) +
xlab("Date") + ylab("ATM Withdrawals") +
ggtitle("31-Day Forecast for ATM1 - Stepwise Model (Forecast Period Only)") +
theme_minimal() +
coord_cartesian(xlim = c(forecast_start, forecast_end)) # Set x-axis to forecast range only
# Arrange the plots vertically
grid.arrange(plot_full, plot_zoom, ncol = 1)
The plot shows cash withdrawals over time for ATM1, with significant fluctuations throughout the observed period from mid-2009 to early 2010. There is a noticeable high-frequency pattern, suggesting regular peaks and troughs, which may indicate daily or weekly withdrawal patterns. However, there is no clear upward or downward trend, suggesting that cash withdrawals remained relatively stable on average, with consistent variability around a central level over time.
atm2_data |>
ggplot(aes(x = date, y = cash)) +
geom_line() +
labs(title = "Cash Withdrawals Over Time - ATM2", x = "Date", y = "Cash Withdrawn")
Box-Cox Transformed ATM2: TThe Box-Cox transformation has been applied to stabilize variance without altering the underlying pattern significantly, suggesting that the transformed data may be more suitable for time series modeling by reducing potential heteroscedasticity. The overall trend remains stable, with no clear long-term increase or decrease in withdrawal levels.
# Step 1: Shift cash values by adding a small constant to handle zero values
atm2_data <- atm2_data |>
mutate(cash_adjusted = cash + 1) # Adding 1 to avoid zero or negative values
# Step 2: Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- atm2_data |>
BoxCox.lambda(cash_adjusted) # Calculate lambda for Box-Cox
# Apply transformations and differences
atm2_data <- atm2_data |>
mutate(
box_cox_cash = BoxCox(cash_adjusted, lambda = box_cox_lambda), # Apply Box-Cox transformation
)
# Step 3: Time Series Plot for Box-Cox Transformed Data
atm2_data |>
ggplot(aes(x = date, y = box_cox_cash)) +
geom_line() +
labs(title = "Box-Cox Transformed ATM2",
x = "Time", y = "Box-Cox Transformed ATM2")
The STL decomposition of ATM2 (Box-Cox transformed withdrawals) breaks down the data into trend, weekly seasonality, and remainder (noise) components:
This decomposition confirms a stable trend with strong weekly seasonality, while the remainder shows increased noise, especially in early 2010.
# STL Decomposition for ATM1
atm2_data |>
model(
stl = STL(box_cox_cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)
) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - ATM2 (Box-Cox Transformed Withdrawals)")
Lag 7 Differencing for ATM2 removed a significant portion of the seasonality, as shown in the first differenced plot. However, some autocorrelation remains at lag 7 in the ACF, suggesting that additional differencing or seasonal components in the model may be required.
# Generate ACF Plot for ATM1
atm2_data |>
ACF(box_cox_cash) |>
autoplot() +
labs(title = "ACF Plot for Box-Cox Transformed ATM2 Cash Withdrawals ")
Perform differencing to identify the differncing parameters d and D for manually specifying ARIMA and SARIMA models.
Lag 7 Differencing for ATM2 removed a significant portion of the seasonality, as shown in the first differenced plot. However, some autocorrelation remains at lag 7 in the ACF, suggesting that additional differencing or seasonal components in the model may be required.
# Apply differencing
atm2_data <- atm2_data |>
mutate(
diff_lag7 = difference(box_cox_cash, lag = 7) # Lag 7 Seasonal differencing
)
# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm2_data |>
drop_na(diff_lag7) |>
autoplot(diff_lag7) +
labs(title = "Lag 7 Seasonally Differenced ATM2",
x = "Time", y = "Lag 7 Differenced ATM2")
# Generate ACF Plot for ATM2
plot_acf <- atm2_data |>
ACF(diff_lag7) |>
autoplot() +
labs(title = "ACF Plot for Lag 7 Differenced ATM2 Cash Withdrawals")
# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)
### Non-seasonal Differencing to Determine Non-seasonal Parameter, d
The plots and Ljung-Box test results indicate that:
Lag 1 on Lag 7 Differencing further stabilizes the series for ATM2, as reflected in the second plot and ACF, showing reduced autocorrelation at smaller lags. The high Ljung-Box statistic with a p-value of 0 suggests residual autocorrelation, emphasizing the need for more complex seasonal modeling, possibly a SARIMA model, to account for the remaining dependencies.
# Apply differencing
atm2_data <- atm2_data |>
mutate(
diff_lag7_lag1 = difference(diff_lag7, lag = 1) # First-order differencing after seasonal differencing
)
# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm2_data |>
drop_na(diff_lag7_lag1) |>
autoplot(diff_lag7_lag1) +
labs(title = "Lag 1 on Lag 7 Differenced ATM1",
x = "Time", y = "Lag 1 on Lag 7 Differenced ATM2")
# Generate ACF Plot for ATM1
plot_acf <- atm2_data |>
ACF(diff_lag7_lag1) |>
autoplot() +
labs(title = "ACF Plot for Lag 1 on Lag 7 Differenced ATM2 Cash Withdrawals ")
# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)
# Display the Ljung-Box test results
ljung_box_results |>
kable(caption = "ATM2 Ljung Test Result for Lag 1 on Lag 7 Differencing") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
atm | lb_stat | lb_pvalue |
---|---|---|
ATM1 | 143.6979 | 0 |
We will examine the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots for selecting appropriate model parameters and understanding the structure of the time series.
Based on the ACF and PACF plots for the Box-Cox transformed ATM2 withdrawals (without differencing), the following models are recommended, including differencing for stationarity:
These models capture ATM2’s trend and weekly seasonality effectively.
# Plot ACF/PACF
atm2_data |>
gg_tsdisplay(box_cox_cash, plot_type = 'partial', lag_max = 21)
Split data into train and test sets
Split the data and leave out the last 14 days (up to April 30, 2010) for ATM2 to create training and test sets.
# Define the end date for the training set (14 days before the last date, April 16, 2010)
train_end_date <- as.Date("2010-04-16")
# Split ATM1 data into training and test sets
atm2_train <- atm2_data |> filter(date <= train_end_date)
atm2_test <- atm2_data |> filter(date > train_end_date)
The models are trained on the Box-Cox transformed ATM2 cash withdrawals data to capture trend, seasonality, and short-term dependencies. Various ARIMA, SARIMA, and ETS models were fitted, with non-seasonal differencing (\(d=2\)) and seasonal differencing (\(D=1\)) to handle stationarity and weekly seasonality.
Based on the AICc and BIC values shown in the table, the best model is the stepwise ARIMA model with parameters ARIMA(2,0,2)(0,1,1)[7], having the lowest AICc (3188.332) and BIC (3211.126) values among the models.
This indicates that the stepwise-selected ARIMA model is the most suitable choice for forecasting ATM2 cash withdrawals, offering the best fit based on information criteria.
# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(ggplot2)
library(tidyr)
library(kableExtra)
# Fit ARIMA, SARIMA, and ETS models to the Box-Cox transformed ATM2 data
atm2_fit <- atm2_train |>
model(
arima210 = ARIMA(box_cox_cash ~ pdq(2,1,0)), # ARIMA(2,1,0) - AR(2) for short-term dependency
arima012 = ARIMA(box_cox_cash ~ pdq(0,1,2)), # ARIMA(0,1,2) - MA(2) for short-term dependency
sarima210011 = ARIMA(box_cox_cash ~ pdq(2,1,0) + PDQ(0,1,1, period = 7)), # SARIMA(2,1,0)(0,1,1)[7] - AR(2) with seasonal MA(1)
sarima012011 = ARIMA(box_cox_cash ~ pdq(0,1,2) + PDQ(0,1,1, period = 7)), # SARIMA(0,1,2)(0,1,1)[7] - MA(2) with seasonal MA(1)
sarima212011 = ARIMA(box_cox_cash ~ pdq(2,1,2) + PDQ(0,1,1, period = 7)), # SARIMA(2,1,2)(0,1,1)[7] - AR(2) and MA(2) with seasonal MA(1)
ets_mam = ETS(box_cox_cash ~ error("M") + trend("A") + season("M")), # ETS(M, A, M) - Multiplicative seasonality with additive trend
ets_aam = ETS(box_cox_cash ~ error("A") + trend("A") + season("M")), # ETS(A, A, M) - Additive trend and error, multiplicative seasonality
ets_madm = ETS(box_cox_cash ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M) - Damped trend with multiplicative seasonality
stepwise = ARIMA(box_cox_cash), # Automatic ARIMA using stepwise search
search = ARIMA(box_cox_cash, stepwise = FALSE) # Automatic ARIMA using exhaustive search
)
## Warning: 1 error encountered for arima012
## [1] Could not find an appropriate ARIMA model.
## This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
## For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
# Extract the model summary with AICc and BIC values
model_summary <- glance(atm2_fit) |>
arrange(AICc) |>
select(.model, AIC, AICc, BIC, MSE, AMSE)
# Extract the ARIMA orders (p,d,q) for each model
model_orders <- atm2_fit |>
pivot_longer(cols = -atm, names_to = "Model name", values_to = "Orders") |>
select(`Model name`, Orders)
# Combine the AICc, BIC, and ARIMA orders into a single data frame
combined_results <- model_summary |>
left_join(model_orders, by = c(".model" = "Model name"))
# Display the combined data frame using kable
combined_results |>
kable(caption = "Ranked ATM2 Models based on AICc") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | AIC | AICc | BIC | MSE | AMSE | Orders |
---|---|---|---|---|---|---|
stepwise | 3188.082 | 3188.332 | 3211.126 | NA | NA | <ARIMA(2,0,2)(0,1,1)[7]> |
search | 3191.233 | 3191.566 | 3218.117 | NA | NA | <ARIMA(5,0,0)(0,1,1)[7]> |
sarima012011 | 3205.108 | 3205.226 | 3220.459 | NA | NA | <ARIMA(0,1,2)(0,1,1)[7]> |
sarima212011 | 3205.538 | 3205.788 | 3228.564 | NA | NA | <ARIMA(2,1,2)(0,1,1)[7]> |
arima210 | 3301.614 | 3301.732 | 3316.965 | NA | NA | <ARIMA(2,1,0)(0,1,1)[7]> |
sarima210011 | 3301.614 | 3301.732 | 3316.965 | NA | NA | <ARIMA(2,1,0)(0,1,1)[7]> |
ets_aam | 4363.130 | 4364.053 | 4409.459 | 666.0784 | 665.0423 | <ETS(A,A,M)> |
ets_mam | 4576.257 | 4577.180 | 4622.587 | 1153.6274 | 1154.7978 | <ETS(M,A,M)> |
ets_madm | 4632.066 | 4633.146 | 4682.256 | 1362.9499 | 1362.7706 | <ETS(M,Ad,M)> |
The CRPS values indicate that the ETS(A,A,M) model performs the best for short-term forecasts, achieving the lowest CRPS (8.66) among all models tested. The SARIMA(0,1,2)(0,1,1)[7] and SARIMA(2,1,2)(0,1,1)[7] models follow closely with CRPS values of 8.70 and 8.88, respectively. The Search and Stepwise ARIMA models also perform well with CRPS scores below 10, making them viable options. Higher CRPS scores for other models suggest less accuracy in short-term forecasts. These results help prioritize model selection based on forecast accuracy.
# Load necessary libraries for forecasting accuracy
library(fabletools)
library(dplyr)
library(kableExtra)
# Generate 14-day forecasts for all models
atm2_forecasts <- atm2_fit |>
forecast(h = "14 days")
# Calculate accuracy metrics, including CRPS, for the 14-day forecast and sort by CRPS
atm2_forecast_accuracy <- atm2_forecasts |>
accuracy(atm2_test, list(crps = CRPS)) |>
arrange(crps) # Sort from most to least accurate (lowest to highest CRPS)
# Display sorted accuracy results for each model
atm2_forecast_accuracy |>
kable(caption = "CRPS Scores for 14-Day ATM2 Withdrawal Forecasts (Sorted by Accuracy)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | atm | .type | crps |
---|---|---|---|
ets_aam | ATM2 | Test | 8.621016 |
sarima012011 | ATM2 | Test | 8.697526 |
sarima212011 | ATM2 | Test | 8.875338 |
search | ATM2 | Test | 8.985455 |
stepwise | ATM2 | Test | 9.225573 |
arima210 | ATM2 | Test | 14.862258 |
sarima210011 | ATM2 | Test | 14.862258 |
ets_madm | ATM2 | Test | 27.398077 |
ets_mam | ATM2 | Test | 32.840325 |
arima012 | ATM2 | Test | NaN |
For residual analysis, I will focus on the following models:
The Stepwise ARIMA and Search ARIMA models show well-centered residuals with low variance and minimal autocorrelation, indicating effective capture of data patterns. SARIMA(0,1,2)(0,1,1)[7] also performs well but has slightly higher autocorrelation. ETS(A,A,M) captures seasonality but shows higher residual variance. Overall, Stepwise and Search ARIMA models are preferred for their balanced residual behavior, making them strong choices for forecasting.
atm2_fit |>
select(stepwise) |>
gg_tsresiduals() +
labs(title = "Stepwise Residuals")
atm2_fit |>
select(sarima012011) |>
gg_tsresiduals() +
labs(title = "sarima012011 Model Residuals")
atm2_fit |>
select(ets_aam) |>
gg_tsresiduals() +
labs(title = "ets_aam Model Residuals")
atm2_fit |>
select(search) |>
gg_tsresiduals() +
labs(title = "Search Model Residuals")
Based on the residual analysis, AICc, BIC, and CRPS values, the following models are recommended:
Short-Term Forecasting: The ETS(A,A,M) model is ideal for short-term forecasts due to its low CRPS (8.66), suggesting good accuracy over a short forecasting horizon.
Long-Term Forecasting: The Stepwise ARIMA (2,0,2)(0,1,1)[7] model performs well in terms of AICc and BIC, making it a suitable choice for longer-term forecasts, where stability and consistency are crucial.
These selections balance forecast accuracy with model complexity and are based on the observed performance in residual behavior, information criteria, and continuous ranked probability scores.
The stepwise ARIMA model is used for a month-long forecast to maintain reliability and stability. If short-term accuracy is a priority (e.g., for daily or weekly forecasts), the ETS(A,A,M) model could also be considered as a secondary choice.
# Load necessary libraries
library(tsibble)
library(fable)
library(dplyr)
library(tidyr)
library(kableExtra)
# Fit stepwise model to the Box-Cox transformed ATM2 data to generate forecast for May 2010
atm2_forecast_fit <- atm2_data |>
model(
stepwise = ARIMA(box_cox_cash) # Automatic ARIMA using stepwise search
)
# Generate 31-day forecast for the stepwise model
atm2_may_forecasts <- atm2_forecast_fit |>
forecast(h = "31 days")
# Convert atm1_may_forecasts to a regular data frame to avoid tsibble constraints
forecast_means <- as_tibble(atm2_may_forecasts) |>
mutate(
forecast = round(.mean, 1) # Format as rounded mean values
) |>
select(.model, date, forecast) |>
pivot_wider(names_from = .model, values_from = forecast)
# Display the forecast table with formatted means only
forecast_means |>
head() |>
kable(caption = "31-Day Forecasts for ATM2 (Stepwise Model)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
date | stepwise |
---|---|
2010-05-01 | 66.4 |
2010-05-02 | 72.7 |
2010-05-03 | 13.6 |
2010-05-04 | 5.0 |
2010-05-05 | 97.8 |
2010-05-06 | 88.7 |
# Export the data to CSV
write.csv(forecast_means, "atm2_31day_Forecast.csv", row.names = FALSE)
The 31-day forecast for ATM2 withdrawals using the Stepwise ARIMA model shows consistent weekly patterns with credible intervals that capture the expected fluctuations. In the zoomed-in view for May, the forecast maintains steady seasonality with tighter prediction intervals, indicating reliable short-term accuracy. The model effectively predicts variations within the 80% and 95% confidence levels, making it a robust choice for capturing periodic withdrawal behavior.
# Load necessary libraries
library(ggplot2)
library(fable)
library(tsibble)
library(gridExtra)
# Fit the best model (stepwise) to the Box-Cox transformed data and generate a 31-day forecast
atm2_forecast_fit <- atm2_data |>
model(stepwise = ARIMA(box_cox_cash))
# Generate the forecast for 31 days
atm2_may_forecast <- atm2_forecast_fit |>
forecast(h = "31 days")
# Extract the forecast period date range
forecast_start <- min(atm2_may_forecast$date)
forecast_end <- max(atm2_may_forecast$date)
# Full view plot including the training and test data
plot_full <- atm2_may_forecast |>
autoplot(atm2_data, level = c(80, 95)) + # Include confidence intervals
xlab("Date") + ylab("ATM Withdrawals") +
ggtitle("31-Day Forecast for ATM2 - Stepwise Model (Full View)") +
theme_minimal()
# Zoomed-in view on the forecast period only
plot_zoom <- atm2_may_forecast |>
autoplot(atm2_data, level = c(80, 95)) +
xlab("Date") + ylab("ATM Withdrawals") +
ggtitle("31-Day Forecast for ATM2 - Stepwise Model (Forecast Period Only)") +
theme_minimal() +
coord_cartesian(xlim = c(forecast_start, forecast_end)) # Set x-axis to forecast range only
# Arrange the plots vertically
grid.arrange(plot_full, plot_zoom, ncol = 1)
This plot displays the cash withdrawals over time for ATM4, showing distinct periodic peaks and troughs, suggesting strong seasonality in the data. Withdrawals tend to spike periodically, with notable peaks around mid-2009 and early 2010, potentially corresponding to recurring events or demand patterns. The variability in withdrawal amounts also appears high, indicating fluctuations in cash usage at this ATM over the analyzed period. This seasonality and variability would likely need to be considered in any forecasting model applied to this data.
atm4_data |>
ggplot(aes(x = date, y = cash)) +
geom_line() +
labs(title = "Cash Withdrawals Over Time - ATM4", x = "Date", y = "Cash Withdrawn")
Box-Cox Transformed ATM4: The Box-Cox transformation helps in making the series more stationary, preparing it for modeling. Peaks and trends are preserved, but the distribution across values appears more balanced, indicating that a Box-Cox transformation is effective in handling non-constant variance often observed in raw time series data.
# Step 1: Shift cash values by adding a small constant to handle zero values
atm4_data <- atm4_data |>
mutate(cash_adjusted = cash + 1) # Adding 1 to avoid zero or negative values
# Step 2: Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- atm4_data |>
BoxCox.lambda(cash_adjusted) # Calculate lambda for Box-Cox
# Apply transformations and differences
atm4_data <- atm4_data |>
mutate(
box_cox_cash = BoxCox(cash_adjusted, lambda = box_cox_lambda), # Apply Box-Cox transformation
)
# Step 3: Time Series Plot for Box-Cox Transformed Data
atm4_data |>
ggplot(aes(x = date, y = box_cox_cash)) +
geom_line() +
labs(title = "Box-Cox Transformed ATM4",
x = "Time", y = "Box-Cox Transformed ATM2")
The STL decomposition of ATM4 (Box-Cox transformed withdrawals) breaks down the data into trend, weekly seasonality, and remainder (noise) components:
This decomposition highlights the strong weekly cycle in withdrawals, with a fluctuating trend and limited unexplained variability.
# STL Decomposition for ATM1
atm4_data |>
model(
stl = STL(box_cox_cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)
) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - ATM4 (Box-Cox Transformed Withdrawals)")
The ACF plot for Box-Cox transformed ATM4 cash withdrawals confirms the findings from the STL decomposition. The strong spikes at lags 7, 14, and 21 reinforce the weekly seasonality observed in the STL’s seasonality component, indicating a clear 7-day cycle in withdrawal patterns. This alignment with the STL decomposition underscores the stability and regularity of weekly patterns in the data.
# Generate ACF Plot for ATM1
atm4_data |>
ACF(box_cox_cash) |>
autoplot() +
labs(title = "ACF Plot for Box-Cox Transformed ATM4 Cash Withdrawals ")
Perform differencing to identify the differncing parameters d and D for manually specifying ARIMA and SARIMA models.
The Lag 7 seasonally differenced plot for ATM4 cash withdrawals reduces the original series’ strong weekly pattern, as indicated by a much lower amplitude compared to the undifferenced series. The corresponding ACF plot shows minimal autocorrelation after the first lag, suggesting that the differencing has effectively mitigated seasonality, aligning the data closer to stationarity. This transformation supports further modeling by stabilizing the variance and removing repetitive weekly cycles, consistent with the weekly seasonality identified in previous analyses.
# Apply differencing
atm4_data <- atm4_data |>
mutate(
diff_lag7 = difference(box_cox_cash, lag = 7) # Lag 7 Seasonal differencing
)
# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm4_data |>
drop_na(diff_lag7) |>
autoplot(diff_lag7) +
labs(title = "Lag 7 Seasonally Differenced ATM4",
x = "Time", y = "Lag 7 Differenced ATM2")
# Generate ACF Plot for ATM2
plot_acf <- atm4_data |>
ACF(diff_lag7) |>
autoplot() +
labs(title = "ACF Plot for Lag 7 Differenced ATM4 Cash Withdrawals")
# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)
### Non-seasonal Differencing to Determine Non-seasonal Parameter, d
The analysis of ATM4 cash withdrawals using lag 7 seasonal differencing, followed by lag 1 differencing, shows stable variations around zero, indicating seasonality and stationarity. The ACF plot shows minimal autocorrelation beyond lag 7, suggesting that differencing has effectively reduced correlations. However, the Ljung-Box test results indicate significant residual autocorrelation (\(p\)-value = 0), confirming that additional model refinements, such as an ARIMA or SARIMA model, may be needed to further capture the remaining dependencies.
# Apply differencing
atm4_data <- atm4_data |>
mutate(
diff_lag7_lag1 = difference(diff_lag7, lag = 1) # First-order differencing after seasonal differencing
)
# Time Series Plot for Lag 1 non-seasonally Differenced Data
plot_ts <- atm4_data |>
drop_na(diff_lag7_lag1) |>
autoplot(diff_lag7_lag1) +
labs(title = "Lag 1 on Lag 7 Differenced ATM1",
x = "Time", y = "Lag 1 on Lag 7 Differenced ATM2")
# Generate ACF Plot for ATM1
plot_acf <- atm4_data |>
ACF(diff_lag7_lag1) |>
autoplot() +
labs(title = "ACF Plot for Lag 1 on Lag 7 Differenced ATM2 Cash Withdrawals ")
# Combine the plots into a grid
grid.arrange(plot_ts, plot_acf, ncol = 1)
# Display the Ljung-Box test results
ljung_box_results |>
kable(caption = "ATM4 Ljung Test Result for Lag 1 on Lag 7 Differencing") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
atm | lb_stat | lb_pvalue |
---|---|---|
ATM1 | 143.6979 | 0 |
We will examine the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots for selecting appropriate model parameters and understanding the structure of the time series.
Based on the ACF and PACF plots for the Box-Cox transformed ATM4 cash withdrawals (without differencing), the following models are recommended, incorporating differencing to achieve stationarity:
These models capture the observed weekly seasonality, trend, and short-term dependencies effectively.
# Plot ACF/PACF
atm4_data |>
gg_tsdisplay(box_cox_cash, plot_type = 'partial', lag_max = 21)
Split data into train and test sets
Split the data and leave out the last 14 days (up to April 30, 2010) for ATM4 to create training and test sets.
# Define the end date for the training set (14 days before the last date, April 16, 2010)
train_end_date <- as.Date("2010-04-16")
# Split ATM1 data into training and test sets
atm4_train <- atm4_data |> filter(date <= train_end_date)
atm4_test <- atm4_data |> filter(date > train_end_date)
The models are trained on the Box-Cox transformed ATM2 cash withdrawals data to capture trend, seasonality, and short-term dependencies. Various ARIMA, SARIMA, and ETS models were fitted, with non-seasonal differencing (\(d=2\)) and seasonal differencing (\(D=1\)) to handle stationarity and weekly seasonality.
The model selection process for ATM4 revealed the following insights:
The warning for ARIMA(0,1,1) indicates potential instability due to non-characteristic roots, making it unsuitable for this dataset.
# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(ggplot2)
library(tidyr)
library(kableExtra)
# Fit ARIMA, SARIMA, and ETS models to the Box-Cox transformed ATM4 data
atm4_fit <- atm4_train |>
model(
arima110 = ARIMA(box_cox_cash ~ pdq(1,1,0)), # ARIMA(1,1,0) - AR(1) for short-term dependency
arima011 = ARIMA(box_cox_cash ~ pdq(0,1,1)), # ARIMA(0,1,1) - MA(1) for short-term dependency
sarima110011 = ARIMA(box_cox_cash ~ pdq(1,1,0) + PDQ(0,1,1, period = 7)), # SARIMA(1,1,0)(0,1,1)[7] - AR(1) with seasonal MA(1)
sarima011011 = ARIMA(box_cox_cash ~ pdq(0,1,1) + PDQ(0,1,1, period = 7)), # SARIMA(0,1,1)(0,1,1)[7] - MA(1) with seasonal MA(1)
sarima111011 = ARIMA(box_cox_cash ~ pdq(1,1,1) + PDQ(0,1,1, period = 7)), # SARIMA(1,1,1)(0,1,1)[7] - AR(1) and MA(1) with seasonal MA(1)
ets_mam = ETS(box_cox_cash ~ error("M") + trend("A") + season("M")), # ETS(M, A, M) - Multiplicative seasonality with additive trend
ets_aam = ETS(box_cox_cash ~ error("A") + trend("A") + season("M")), # ETS(A, A, M) - Additive trend and error, multiplicative seasonality
ets_madm = ETS(box_cox_cash ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M) - Damped trend with multiplicative seasonality
stepwise = ARIMA(box_cox_cash), # Automatic ARIMA using stepwise search
search = ARIMA(box_cox_cash, stepwise = FALSE) # Automatic ARIMA using exhaustive search
)
## Warning: 1 error encountered for arima011
## [1] Could not find an appropriate ARIMA model.
## This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
## For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots
# Extract the model summary with AICc and BIC values
model_summary <- glance(atm4_fit) |>
arrange(AICc) |>
select(.model, AIC, AICc, BIC, MSE, AMSE)
# Extract the ARIMA orders (p,d,q) for each model
model_orders <- atm4_fit |>
pivot_longer(cols = -atm, names_to = "Model name", values_to = "Orders") |>
select(`Model name`, Orders)
# Combine the AICc, BIC, and ARIMA orders into a single data frame
combined_results <- model_summary |>
left_join(model_orders, by = c(".model" = "Model name"))
# Display the combined data frame using kable
combined_results |>
kable(caption = "Ranked ATM4 Models based on AICc") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | AIC | AICc | BIC | MSE | AMSE | Orders |
---|---|---|---|---|---|---|
sarima111011 | 5007.190 | 5007.308 | 5022.541 | NA | NA | <ARIMA(1,1,1)(0,1,1)[7]> |
sarima011011 | 5007.626 | 5007.697 | 5019.139 | NA | NA | <ARIMA(0,1,1)(0,1,1)[7]> |
stepwise | 5118.958 | 5119.027 | 5130.540 | NA | NA | <ARIMA(0,0,0)(1,0,0)[7] w/ mean> |
search | 5118.958 | 5119.027 | 5130.540 | NA | NA | <ARIMA(0,0,0)(1,0,0)[7] w/ mean> |
sarima110011 | 5120.322 | 5120.392 | 5131.835 | NA | NA | <ARIMA(1,1,0)(0,1,1)[7]> |
arima110 | 5236.010 | 5236.126 | 5251.442 | NA | NA | <ARIMA(1,1,0)(2,0,0)[7]> |
ets_mam | 6171.051 | 6171.974 | 6217.380 | 142759.6 | 143845.0 | <ETS(M,A,M)> |
ets_madm | 6179.144 | 6180.224 | 6229.334 | 136950.6 | 137808.1 | <ETS(M,Ad,M)> |
ets_aam | 6206.119 | 6207.042 | 6252.448 | 127018.4 | 127679.9 | <ETS(A,A,M)> |
Based on the AICc and CRPS results for ATM4:
Top Models by AICc: The SARIMA models, specifically SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7], show the best performance in terms of AICc, indicating these models effectively capture underlying patterns with minimal complexity.
Top Models by CRPS: The same SARIMA models also have the lowest CRPS scores, demonstrating their accuracy in forecast performance for ATM4 over a 14-day period.
Alternative Models: While the ETS models rank lower in terms of AICc and CRPS, the ETS(M,Ad,M) and ETS(M,A,M) offer competitive forecast accuracy, though with slightly higher CRPS scores.
Overall, SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7] are the best choices for reliable ATM4 forecasting based on both AICc and CRPS.
# Load necessary libraries for forecasting accuracy
library(fabletools)
library(dplyr)
library(kableExtra)
# Generate 14-day forecasts for all models
atm4_forecasts <- atm4_fit |>
forecast(h = "14 days")
# Calculate accuracy metrics, including CRPS, for the 14-day forecast and sort by CRPS
atm4_forecast_accuracy <- atm4_forecasts |>
accuracy(atm4_test, list(crps = CRPS)) |>
arrange(crps) # Sort from most to least accurate (lowest to highest CRPS)
# Display sorted accuracy results for each model
atm4_forecast_accuracy |>
kable(caption = "CRPS Scores for 14-Day ATM4 Withdrawal Forecasts (Sorted by Accuracy)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | atm | .type | crps |
---|---|---|---|
sarima011011 | ATM4 | Test | 151.0016 |
sarima111011 | ATM4 | Test | 151.6449 |
search | ATM4 | Test | 159.4341 |
stepwise | ATM4 | Test | 159.4341 |
ets_madm | ATM4 | Test | 163.5230 |
ets_mam | ATM4 | Test | 171.1137 |
ets_aam | ATM4 | Test | 179.1936 |
sarima110011 | ATM4 | Test | 254.1342 |
arima110 | ATM4 | Test | 262.8403 |
arima011 | ATM4 | Test | NaN |
For residual analysis, I will focus on the following models:
SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7]: Top AICc and CRPS scores indicate strong performance. Residuals will confirm effective seasonal/trend capture.
Stepwise ARIMA: Good CRPS, simpler structure. Residuals will show if it fully captures underlying patterns.
ETS(M, Ad, M): Models multiplicative seasonality with damped trend. Residuals will assess adequacy of seasonal/trend handling.
The SARIMA(1,1,1)(0,1,1)[7] and SARIMA(0,1,1)(0,1,1)[7] models show well-centered residuals with minimal autocorrelation, effectively capturing the weekly seasonality in ATM4 withdrawals. The Stepwise ARIMA model also performs adequately, but it exhibits slightly higher residual variance, indicating some limitations in capturing all patterns. ETS(M, Ad, M) manages to capture multiplicative seasonality but displays the highest residual variance, making it less effective in this context. Overall, SARIMA models are preferred for their balanced residual behavior, making them robust options for forecasting.
# Residual analysis for selected models on ATM4 data
atm4_fit |>
select(sarima111011) |>
gg_tsresiduals() +
labs(title = "SARIMA(1,1,1)(0,1,1)[7] Model Residuals - ATM4")
atm4_fit |>
select(sarima011011) |>
gg_tsresiduals() +
labs(title = "SARIMA(0,1,1)(0,1,1)[7] Model Residuals - ATM4")
atm4_fit |>
select(stepwise) |>
gg_tsresiduals() +
labs(title = "Stepwise Residuals - ATM4")
atm4_fit |>
select(ets_madm) |>
gg_tsresiduals() +
labs(title = "ETS(M, Ad, M) Model Residuals - ATM4")
Based on the residual analysis, AICc, BIC, and CRPS values:
Short-term forecasting: The SARIMA(0,1,1)(0,1,1)[7] model is recommended as it shows the lowest CRPS score among models, indicating high accuracy for shorter forecasting horizons. Its residuals are well-centered with minimal autocorrelation, which aligns well with the observed patterns.
Long-term forecasting: SARIMA(1,1,1)(0,1,1)[7] emerges as the best model due to its low AICc and BIC values, indicating better model fit for extended forecasts. Its residuals also display stability with minimal bias, suggesting it captures the overall data structure effectively.
The sarima111011 model is used for a month-long forecast to maintain reliability and stability. If short-term accuracy is a priority (e.g., for daily or weekly forecasts), the sarima011011 model could also be considered as a secondary choice.
# Load necessary libraries
library(tsibble)
library(fable)
library(dplyr)
library(tidyr)
library(kableExtra)
# Fit sarima111011 model to the Box-Cox transformed ATM2 data to generate forecast for May 2010
atm4_forecast_fit <- atm4_data |>
model(
sarima111011 = ARIMA(box_cox_cash ~ pdq(1,1,1) + PDQ(0,1,1, period = 7)), # SARIMA(1,1,1)(0,1,1)[7] - AR(1) and MA(1) with seasonal MA(1)
)
# Generate 31-day forecast for the stepwise model
atm4_may_forecasts <- atm4_forecast_fit |>
forecast(h = "31 days")
# Convert atm1_may_forecasts to a regular data frame to avoid tsibble constraints
forecast_means <- as_tibble(atm4_may_forecasts) |>
mutate(
forecast = round(.mean, 1) # Format as rounded mean values
) |>
select(.model, date, forecast) |>
pivot_wider(names_from = .model, values_from = forecast)
# Display the forecast table with formatted means only
forecast_means |>
head() |>
kable(caption = "31-Day Forecasts for ATM4 (sarima111011 Model)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
date | sarima111011 |
---|---|
2010-05-01 | 406.2 |
2010-05-02 | 478.9 |
2010-05-03 | 438.7 |
2010-05-04 | 264.4 |
2010-05-05 | 467.5 |
2010-05-06 | 332.4 |
# Export the data to CSV
write.csv(forecast_means, "atm4_31day_Forecast.csv", row.names = FALSE)
The 31-day forecast for ATM4 using the sarima111011 forecast shows a stable range with confidence intervals, indicating consistent ATM withdrawal levels expected in the near term. The 80% and 95% prediction intervals suggest moderate variability in forecasts, capturing most potential outcomes with limited extremes. This model projects reliable, centered predictions for upcoming ATM activity.
# Load necessary libraries
library(ggplot2)
library(fable)
library(tsibble)
library(gridExtra)
# Fit the best model (sarima111011) to the Box-Cox transformed data and generate a 31-day forecast
atm4_forecast_fit <- atm4_data |>
model(sarima111011 = ARIMA(box_cox_cash ~ pdq(1,1,1) + PDQ(0,1,1, period = 7)))
# Generate the forecast for 31 days
atm4_may_forecast <- atm4_forecast_fit |>
forecast(h = "31 days")
# Extract the forecast period date range
forecast_start <- min(atm4_may_forecast$date)
forecast_end <- max(atm4_may_forecast$date)
# Full view plot including the training and test data
plot_full <- atm4_may_forecast |>
autoplot(atm4_data, level = c(80, 95)) + # Include confidence intervals
xlab("Date") + ylab("ATM4 Withdrawals") +
ggtitle("31-Day Forecast for ATM4 - sarima111011 Model (Full View)") +
theme_minimal()
# Zoomed-in view on the forecast period only
plot_zoom <- atm4_may_forecast |>
autoplot(atm2_data, level = c(80, 95)) +
xlab("Date") + ylab("ATM4 Withdrawals") +
ggtitle("31-Day Forecast for ATM4 - sarima111011 Model (Forecast Period Only)") +
theme_minimal() +
coord_cartesian(xlim = c(forecast_start, forecast_end)) # Set x-axis to forecast range only
# Arrange the plots vertically
grid.arrange(plot_full, plot_zoom, ncol = 1)
The visualizations for ATM3 show a prolonged period of inactivity with zero cash withdrawals across the full data range, with withdrawals only starting in late April 2010. The focused view captures the initial 14 zero values followed by a rapid increase in transactions, reflecting a sharp shift from inactivity to regular withdrawal patterns.
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(gridExtra)
# Find the first date with non-zero cash withdrawals
first_nonzero_date <- atm3_data %>%
filter(cash > 0) %>%
summarize(first_date = min(date)) %>%
pull(first_date)
# Calculate the date range to include the first 14 zeros
include_zero_start <- atm3_data %>%
filter(cash == 0 & date <= first_nonzero_date) %>%
slice_tail(n = 14) %>%
summarize(start_date = min(date)) %>%
pull(start_date)
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `cash == 0 & date <= first_nonzero_date`.
## Caused by warning in `<=.default`:
## ! longer object length is not a multiple of shorter object length
# Plot the full range of ATM3 data
full_range_plot <- atm3_data %>%
ggplot(aes(x = date, y = cash)) +
geom_line() +
labs(title = "Cash Withdrawals Over Time - ATM3 (Full Range)",
x = "Date", y = "Cash Withdrawn")
# Plot the non-zero data range including the first 14 zeros
nonzero_data_plot <- atm3_data %>%
filter(date >= include_zero_start) %>%
ggplot(aes(x = date, y = cash)) +
geom_line() +
labs(title = "Cash Withdrawals Over Time - ATM3 (Non-Zero Data with First 14 Zeros)",
x = "Date", y = "Cash Withdrawn")
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `date >= include_zero_start`.
## Caused by warning in `>=.default`:
## ! longer object length is not a multiple of shorter object length
# Combine the two plots with gridExtra
grid.arrange(full_range_plot, nonzero_data_plot, ncol = 1)
tail(atm3_data) %>%
kable(caption = "Non-zero data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
date | atm | cash | year_month | weekday |
---|---|---|---|---|
2010-04-25 | ATM3 | 0 | 2010-04-01 | Sun |
2010-04-26 | ATM3 | 0 | 2010-04-01 | Mon |
2010-04-27 | ATM3 | 0 | 2010-04-01 | Tue |
2010-04-28 | ATM3 | 96 | 2010-04-01 | Wed |
2010-04-29 | ATM3 | 82 | 2010-04-01 | Thu |
2010-04-30 | ATM3 | 85 | 2010-04-01 | Fri |
Given the data pattern for ATM3—a long period of zeros followed by only a few non-zero withdrawals—there is not enough data to build a reliable forecasting model. Effective time series forecasting usually requires consistent patterns or enough variation to capture seasonality, trend, or other dependencies. In this case:
Lack of Consistent Withdrawals: The sudden shift from zero to non-zero values without a clear trend or seasonality makes it challenging to apply standard forecasting models, which assume continuity or periodicity.
Limited Observations for Non-Zero Values: Forecasting models need a reasonable number of non-zero observations to identify patterns. With only three active data points at the end, any trend inferred from them would likely be an overfit rather than a generalizable pattern.
No Clear Seasonality: With such sparse data, it’s impossible to detect or leverage any weekly or monthly seasonality, which is often essential for ATM cash withdrawal modeling.
This dataset, in its current form, lacks sufficient variability and consistency to support reliable forecasting. In summary, it’s best to avoid trying to forecast this series until there’s more data reflecting regular ATM usage patterns.
If we must proceed with modeling given the limited data, the following models can technically be used but with important caveats:
Seasonal models are unsuitable for ATM3 because the data lacks sufficient length, consistent repeating patterns, and enough non-zero values to establish a reliable seasonal trend.
The model is trained on the original data without transformation.
Based on the calculated MAE and RMSE values for each model on ATM3’s non-zero data, the ARIMA(0,1,0) model has the lowest error on 2010-04-28, suggesting it captures the minimal variations well in this sparse dataset. The Naive and Random Walk with Drift models show higher errors, especially on certain days, indicating they may be less suited for reliable forecasting in this context. Overall, ARIMA(0,1,0) is preferable for short-term accuracy, though all models face limitations due to the limited and irregular nature of the data.
# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(yardstick)
##
## Attaching package: 'yardstick'
## The following object is masked from 'package:forecast':
##
## accuracy
## The following object is masked from 'package:fabletools':
##
## accuracy
# Filter for non-zero data only
nonzero_data <- atm3_data %>%
filter(cash > 0)
# Fit models
atm3_fit <- nonzero_data |>
model(
arima010 = ARIMA(cash ~ pdq(0,1,0)), # Non-seasonal ARIMA(0,1,0)
mean_model = MEAN(cash), # Mean model
naive_model = NAIVE(cash), # Naive model
rw_model = RW(cash), # Random Walk model
rw_drift = RW(cash + 1) # Random Walk with Drift
)
# Forecast for the next few periods (example: 5)
atm3_forecast <- atm3_fit |>
forecast(h = 5)
# Get residuals for each model and calculate accuracy measures
accuracy_measures <- atm3_fit %>%
augment() %>% # Get fitted values and residuals
group_by(.model) %>%
summarise(
mae = mean(abs(.resid), na.rm = TRUE),
rmse = sqrt(mean(.resid^2, na.rm = TRUE))
)
# View the accuracy measures for model comparison
print(accuracy_measures)
## # A tsibble: 15 x 4 [1D]
## # Key: .model [5]
## .model date mae rmse
## <chr> <date> <dbl> <dbl>
## 1 arima010 2010-04-28 0.101 0.101
## 2 arima010 2010-04-29 8.50 8.50
## 3 arima010 2010-04-30 8.50 8.50
## 4 mean_model 2010-04-28 8.33 8.33
## 5 mean_model 2010-04-29 5.67 5.67
## 6 mean_model 2010-04-30 2.67 2.67
## 7 naive_model 2010-04-28 NaN NaN
## 8 naive_model 2010-04-29 14 14
## 9 naive_model 2010-04-30 3 3
## 10 rw_drift 2010-04-28 NaN NaN
## 11 rw_drift 2010-04-29 14 14
## 12 rw_drift 2010-04-30 3 3
## 13 rw_model 2010-04-28 NaN NaN
## 14 rw_model 2010-04-29 14 14
## 15 rw_model 2010-04-30 3 3
The residual plots for the ARIMA(0,1,0), Random Walk with Drift, and Naive models for ATM3 indicate limited data points, which makes assessing model adequacy challenging. Key observations include:
Given the very small sample size, it’s difficult to conduct a meaningful residual analysis, and none of these models can be fully validated. The lack of data points calls for cautious interpretation of any forecasting model for ATM3.
# Plot residuals for each model
atm3_fit |>
select(arima010) |>
gg_tsresiduals() +
labs(title = "Residuals Analysis for arima010 Model - ATM3")
# Plot residuals for each model
atm3_fit |>
select(naive_model) |>
gg_tsresiduals() +
labs(title = "Residuals Analysis for naive_model Model - ATM3")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
# Plot residuals for each model
atm3_fit |>
select(rw_drift) |>
gg_tsresiduals() +
labs(title = "Residuals Analysis for rw_drift Model - ATM3")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
The 31-day forecast plot for ATM3, based on the sarima111011 model and other models (ARIMA(0,1,0), Mean, Naive, Random Walk, Random Walk with Drift), reveals the following:
Overall, this plot highlights the difficulty in making reliable forecasts with limited non-zero data, and none of the models provide a stable or realistic forecast. The confidence intervals indicate substantial forecast variability, which should prompt consideration of more data collection before any model-based decision-making.
Given the limited non-zero data available, Random Walk with Drift (rw_drift) may be slightly better suited than the other models for short-term forecasting due to its ability to capture a linear trend based on recent values. However, all models show considerable instability, and none stand out as ideal.
# Load necessary libraries
library(ggplot2)
library(fable)
library(tsibble)
library(gridExtra)
# Generate the forecast for 31 days
atm3_may_forecast <- atm3_fit |>
forecast(h = "31 days")
# Full view plot including the training and test data
plot_full <- atm3_may_forecast |>
autoplot(atm3_data, level = c(80, 95)) + # Include confidence intervals
xlab("Date") + ylab("ATM3 Withdrawals") +
ggtitle("31-Day Forecast for ATM3 - sarima111011 Model (Full View)") +
theme_minimal()
# Arrange the plots vertically
plot_full
# Load necessary libraries
library(tsibble)
library(fable)
library(dplyr)
library(tidyr)
library(kableExtra)
# Fit sarima111011 model to the Box-Cox transformed ATM2 data to generate forecast for May 2010
atm3_forecast_fit <- atm3_data |>
model(
rw_drift = RW(cash + 1) # Random Walk with Drift
)
# Generate 31-day forecast for the stepwise model
atm3_may_forecasts <- atm3_forecast_fit |>
forecast(h = "31 days")
# Convert atm1_may_forecasts to a regular data frame to avoid tsibble constraints
forecast_means <- as_tibble(atm3_may_forecasts) |>
mutate(
forecast = round(.mean, 1) # Format as rounded mean values
) |>
select(.model, date, forecast) |>
pivot_wider(names_from = .model, values_from = forecast)
# Display the forecast table with formatted means only
forecast_means |>
head() |>
kable(caption = "31-Day Forecasts for ATM3 (rw_drift Model)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
date | rw_drift |
---|---|
2010-05-01 | 85 |
2010-05-02 | 85 |
2010-05-03 | 85 |
2010-05-04 | 85 |
2010-05-05 | 85 |
2010-05-06 | 85 |
# Export the data to CSV
write.csv(forecast_means, "atm3_31day_Forecast.csv", row.names = FALSE)
In Part B of this project, we analyze residential power usage load data to explore patterns and trends over time, focusing on handling missing data and preparing the dataset for time series forecasting. The data presents challenges, such as missing values. Our goal is to address these gaps using appropriate imputation techniques while preserving the time series characteristics, including potential seasonality and trends. By transforming and visualizing the data, we aim to develop a robust forecasting model that can predict future energy demand.
Data Preparation
Load the data
A review of the data shows that the dataset consists of multiple rows and columns. Here is a brief description of the columns:
# Load necessary libraries
library(tsibble)
library(lubridate)
library(readxl)
library(kableExtra)
# Load the file
file_path <- "ResidentialCustomerForecastLoad-624.xlsx"
load_data <- read_excel(file_path)
# Convert the date column to yearmonth format and make it a tsibble
load_data <- load_data %>%
mutate(date = yearmonth(`YYYY-MMM`)) %>% # Convert to yearmonth format
rename(load_id = CaseSequence, load = KWH) %>% # Rename columns for clarity
select(-`YYYY-MMM`) %>% # Drop original column
as_tsibble(index = date) # Convert to tsibble with date as the index
# Display the first few rows of the updated data
load_data %>%
head() %>%
kable(caption = "Residential Customer Load Data (Yearmonth Format)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
load_id | load | date |
---|---|---|
733 | 6862583 | 1998 Jan |
734 | 5838198 | 1998 Feb |
735 | 5420658 | 1998 Mar |
736 | 5010364 | 1998 Apr |
737 | 4665377 | 1998 May |
738 | 6467147 | 1998 Jun |
The time series plot shows a significant anomaly in July 2010, where the energy load drops sharply. This anomaly, along with a noticeable break in the line plot, indicates both an unusual dip and missing data. To handle these issues, the missing values will be filled using seasonal imputation, where the missing values are imputed based on the median energy load for the same month across different years. The sharp drop in July 2010 will also be replaced by the median value for the same month to ensure consistency and avoid distortion in the time series analysis.
library(ggplot2)
library(dplyr)
# Detect the dip programmatically
dip_threshold <- 0.5 * mean(load_data$load, na.rm = TRUE)
# Identify rows where the load is significantly below the average
dip_data <- load_data |>
filter(load < dip_threshold, format(date, "%Y-%m") == "2010-07")
# Identify missing values in the dataset
missing_data <- load_data |>
filter(is.na(load))
# Combine dip_data and missing_data into one data frame
combined_data <- bind_rows(dip_data, missing_data)
# Display the combined data frame
combined_data |>
kable(caption = "Missing value and anomaly") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
load_id | load | date |
---|---|---|
861 | NA | 2008 Sep |
883 | 770523 | 2010 Jul |
# Plot with both missing values and the dip highlighted
ggplot(load_data, aes(x = date, y = load)) +
geom_line() +
geom_point(data = dip_data, aes(x = date, y = load), color = "red", size = 3) + # Highlight the dip
geom_text(data = dip_data, aes(x = date, y = load, label = paste0(date, "\n", round(load, 0))),
vjust = 1.0, color = "red", size = 4) + # Label the dip
labs(title = "Energy Load Over Time with July 2010 Dip and Missing Values Highlighted",
y = "Load (kWh)",
x = "Date")
The plot of the imputed time series exhibits a clear seasonal pattern with recurring peaks and troughs, reflecting periodic fluctuations in the energy load.
Key observations:
library(dplyr)
library(lubridate)
# Identify the median load for each month
median_monthly_load <- load_data %>%
group_by(month = month(date, label = TRUE), year = year(date)) %>%
summarize(median_load = median(load, na.rm = TRUE))
# Impute missing values using the median of the same month across years
load_data <- load_data %>%
mutate(
month = month(date, label = TRUE),
# Replace missing values with median for the same month
load = ifelse(is.na(load),
ave(load, month, FUN = function(x) median(x, na.rm = TRUE)),
load),
# Replace July 2010 dip with the median value for July
load = ifelse(
format(date, "%Y-%m") == "2010-07" & load < (0.5 * mean(load, na.rm = TRUE)),
median(load[month == "Jul"], na.rm = TRUE),
load
)
)
# Drop the extra 'month' column added for processing
load_data <- load_data %>% select(-month)
# Plot to verify the correction
ggplot(load_data, aes(x = date, y = load)) +
geom_line() +
labs(title = "Energy Load Over Time with Imputed Missing Values and Corrected July 2010 Dip",
y = "Load (kWh)",
x = "Date")
The Box-Cox transformation has been applied to stabilize the variance across the time series, reducing fluctuations in the load over time. After that, the seasonal differencing (Lag 12) has been applied to remove the annual seasonality, effectively making the series more stationary. This process prepares the data for ARIMA modeling by addressing both non-constant variance and seasonal patterns.
# Load necessary libraries
library(fable)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(kableExtra)
# Apply Box-Cox transformation (assuming load_data already exists)
load_data <- load_data %>%
mutate(
box_cox_load = box_cox(load, lambda = 0.5), # Example lambda value
diff_lag12 = difference(box_cox_load, lag = 12), # Seasonal differencing (lag = 12)
)
# Step 1: Time Series Plot for Lag 12 Seasonally Differenced Data
plot1_ts <- load_data %>%
select(date, diff_lag12) %>%
drop_na(diff_lag12) %>%
ggplot(aes(x = date, y = diff_lag12)) +
geom_line() +
labs(title = "Lag 12 Seasonally Differenced Load",
x = "Date", y = "Lag 12 Differenced Load")
# Step 2: Plot the Box-Cox transformed data
plot2_ts <- load_data %>%
ggplot(aes(x = date, y = box_cox_load)) +
geom_line() +
labs(title = "Box-Cox Transformed Load",
x = "Date", y = "Box-Cox Transformed Load")
# Step 4: Combine all time series plots into a grid for comparison
grid.arrange(plot2_ts, plot1_ts, ncol = 1)
# Step 5: Apply Ljung-Box test to the diff_lag12_lag1 series
ljung_box_results <- load_data %>%
drop_na(diff_lag12) %>%
features(diff_lag12, ljung_box, lag = 10) # Apply Ljung-Box test with lag = 10
The STL decomposition plot shows the following:
Original Series (Box-Cox Transformed Load): The overall energy load pattern with seasonal peaks and troughs, showing a clear repeating seasonal pattern each year.
Trend: The long-term upward trend is generally stable with slight fluctuations, particularly around 2005, where the trend increases before stabilizing.
Seasonal Component: The seasonal pattern is quite regular, with a consistent annual cycle. This indicates that the energy load exhibits strong seasonal effects across years.
Remainder: The remainder component captures the residual variations that aren’t explained by the trend or seasonal components. Noticeable spikes in the remainder, especially in the later periods, indicate unexplained irregularities in the data that may require further investigation.
This decomposition is essential in understanding the distinct components of the data, which will help in selecting and refining models for accurate forecasting. By isolating the trend and seasonality, we can build models that predict the future behavior of the energy load more effectively.
# Load necessary libraries for plotting and STL decomposition
library(fable)
library(fabletools)
library(ggplot2)
# Ensure the data has the required column for STL decomposition
# Assuming the Box-Cox transformed data is already in 'box_cox_load' column
# Perform STL decomposition
decomp_box_cox <- load_data |>
model(
stl = STL(box_cox_load ~ trend(window = 13) + season(window = "periodic"), robust = TRUE)
) |>
components() # Extract the decomposition components
# Plot the decomposition results
autoplot(decomp_box_cox) +
labs(title = "STL Decomposition of Box-Cox Transformed Load",
x = "Date", y = "Value")
## Define the Models
Select Models to Fit by Examining the ACF/PACF plots
Based on the ACF and PACF plots for the Box-Cox transformed load (without differencing), the following models are recommended, including differencing to achieve stationarity:
ARIMA(1,1,0): The PACF plot shows a significant spike at lag 1, indicating a strong AR(1) component. First differencing (d=1) is necessary to remove the non-stationarity in the series and make it stationary.
ARIMA(0,1,1): The ACF plot shows a clear spike at lag 1, suggesting an MA(1) component. First differencing (d=1) will be applied to address the non-stationary nature of the data.
SARIMA(1,1,0)(0,1,1)[12]: Seasonal spikes at lag 12 in the ACF plot indicate yearly seasonality. This model will include an AR(1) component for short-term dependencies and a seasonal MA(1) component to account for the yearly seasonal pattern, with both seasonal and non-seasonal differencing.
SARIMA(0,1,1)(0,1,1)[12]: This model, incorporating an MA(1) and seasonal MA(1) component, is suggested based on the ACF spikes at lag 1 and lag 12. Both seasonal and non-seasonal differencing will be included to account for annual seasonality.
SARIMA(1,1,1)(0,1,1)[12]: A mixed model with both AR(1) and MA(1) components, along with seasonal MA(1), is suitable to capture short-term dependencies and yearly seasonality. This model includes both non-seasonal and seasonal differencing.
ETS (Exponential Smoothing State Space Model): ETS models automatically handle trends and seasonality without requiring explicit differencing. It provides an alternative to ARIMA models, which need differencing to stabilize the data, making ETS a flexible choice to account for trend and seasonal components.
These models incorporate necessary non-seasonal and seasonal differencing to address stationarity and capture the autocorrelation patterns identified in the ACF/PACF plots, ensuring a better fit for future forecasting.
# Load necessary library
library(feasts)
library(tsibble)
library(ggplot2)
# Assuming `load_data` contains the time series data and we focus on `box_cox_load`
# Visualize the time series, ACF, and PACF using gg_tsdisplay
# Plot ACF/PACF
load_data |>
gg_tsdisplay(box_cox_load, plot_type = 'partial', lag_max = 24) +
labs(
title = "Time Series, ACF, and PACF of Box-Cox Transformed Load",
y = "Box-Cox Transformed Load",
x = "Date"
)
Based on the information criteria (AIC, AICc, and BIC), the sarima111011 model has the lowest values across AIC (2483.402), AICc (2483.632), and BIC (2496.151), indicating it is the best-performing model in terms of model fit. The arima011 and stepwise models follow closely, with slightly higher AIC and BIC values, suggesting they also provide a good fit but are marginally less optimal than sarima111011. The ets model has the highest AIC, AICc, and BIC values, indicating it is the least favorable among the models tested in terms of fit. Therefore, sarima111011 is the most favorable model based on information criteria.
# Load necessary libraries
library(dplyr)
library(fable)
library(tsibble)
library(kableExtra)
# Fit the recommended ARIMA and SARIMA models to the Box-Cox transformed load data
load_fit <- load_data |>
model(
arima110 = ARIMA(box_cox_load ~ pdq(1,1,0)), # ARIMA(1,1,0)
arima011 = ARIMA(box_cox_load ~ pdq(0,1,1)), # ARIMA(0,1,1)
sarima110011 = ARIMA(box_cox_load ~ pdq(1,1,0) + PDQ(0,1,1,12)), # SARIMA(1,1,0)(0,1,1)[12]
sarima011011 = ARIMA(box_cox_load ~ pdq(0,1,1) + PDQ(0,1,1,12)), # SARIMA(0,1,1)(0,1,1)[12]
sarima111011 = ARIMA(box_cox_load ~ pdq(1,1,1) + PDQ(0,1,1,12)), # SARIMA(1,1,1)(0,1,1)[12]
ets = ETS(box_cox_load), # ETS model
stepwise = ARIMA(box_cox_load), # Automatic ARIMA using stepwise search
search = ARIMA(box_cox_load, stepwise = FALSE) # Automatic ARIMA using exhaustive search
)
# Extract the model summary with AICc and BIC values
model_summary <- glance(load_fit) |>
arrange(AICc) |>
select(.model, AIC, AICc, BIC)
# Display the model summary using kable
model_summary %>%
kable(caption = "Model Comparison Summary: AICc, BIC") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | AIC | AICc | BIC |
---|---|---|---|
sarima111011 | 2483.402 | 2483.632 | 2496.151 |
arima011 | 2488.015 | 2488.245 | 2500.765 |
stepwise | 2488.314 | 2488.965 | 2510.664 |
search | 2488.314 | 2488.965 | 2510.664 |
sarima011011 | 2492.345 | 2492.482 | 2501.907 |
sarima110011 | 2526.340 | 2526.477 | 2535.902 |
arima110 | 2526.787 | 2527.134 | 2542.724 |
ets | 3127.800 | 3130.527 | 3176.662 |
In conducting the residual analysis for the top 3 models and the ets model, we observe the following:
SARIMA(1,1,1)(0,1,1)[12]: The residual plot shows consistent spread around zero, and the ACF plot indicates low autocorrelation across lags, suggesting that this model captures most of the dependencies. The residuals are fairly normally distributed, showing a strong fit.
Stepwise Model: The residuals exhibit some spread and a few spikes, indicating potential unexplained variation. The ACF plot reveals some mild autocorrelation at a few lags, which may suggest a slightly less accurate model compared to SARIMA.
ARIMA(0,1,1): This model has relatively well-distributed residuals around zero, but the ACF plot shows slightly more autocorrelation at early lags than the SARIMA model, suggesting that it may be less effective in capturing seasonal dependencies.
ETS Model: This model shows very little autocorrelation in the ACF plot, with residuals tightly clustered around zero. However, the histogram of residuals suggests a non-normal distribution, which may indicate that the model is less suited for datasets with a strong seasonal component.
In summary, SARIMA(1,1,1)(0,1,1)[12] shows the most promising fit for capturing both seasonal and non-seasonal patterns. The ETS model also performs well for non-seasonal data but may not fully capture seasonality, while the ARIMA(0,1,1) and Stepwise models show some residual autocorrelation, indicating potentially lower accuracy in seasonal forecasting.
load_fit |>
select(sarima111011) |>
gg_tsresiduals() +
labs(title = "sarima111011 Residuals")
load_fit |>
select(stepwise) |>
gg_tsresiduals() +
labs(title = "stepwise Residuals")
load_fit |>
select(arima011) |>
gg_tsresiduals() +
labs(title = "arima011 Residuals")
load_fit |>
select(ets) |>
gg_tsresiduals() +
labs(title = "ets Residuals")
In conducting the residual analysis for the top 3 models and the ets model, we observe the following:
The Ljung-Box test results indicate that the “stepwise” model has the highest p-value (0.9333), suggesting that it has the least autocorrelation in its residuals and is the best fit among the models tested, as it passes the test for independence of residuals. The “sarima111011” model also shows a relatively high p-value (0.1302), indicating moderate autocorrelation in its residuals but still within acceptable limits. Conversely, the “arima011” and “ets” models exhibit lower p-values (0.0803 and 0.0010, respectively), suggesting that these models have more significant autocorrelation in their residuals, with the “ets” model showing the highest level of autocorrelation among the selected models. Therefore, the “stepwise” model is the most favorable for capturing the data patterns with minimal residual autocorrelation.
# Load necessary libraries
library(dplyr)
library(fabletools)
library(tsibble)
library(kableExtra)
# Define the models to include in the Ljung-Box test
models_to_include <- c("sarima111011", "arima011", "stepwise", "ets")
# Perform Ljung-Box test for each model's residuals and collect results in a single data frame
ljung_results <- augment(load_fit) |>
filter(.model %in% models_to_include) |> # Filter to only the specified models
group_by(.model) |>
features(.innov, ljung_box, lag = 12, dof = 3) |>
ungroup() |>
select(.model, lb_statistic = lb_stat, lb_p_value = lb_pvalue) |>
arrange(desc(lb_p_value)) # Sort by p-value in descending order
# Display the Ljung-Box test results in a formatted table
ljung_results %>%
kable(caption = "Ljung-Box Test Results for Selected Model Residuals (Sorted by Descending p-value)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | lb_statistic | lb_p_value |
---|---|---|
stepwise | 3.645989 | 0.9331318 |
sarima111011 | 13.785551 | 0.1301607 |
arima011 | 15.408077 | 0.0803196 |
ets | 27.822019 | 0.0010214 |
Based on the combined analysis:
Information Criteria: The SARIMA(1,1,1)(0,1,1)[12] (sarima111011) model has the lowest AIC, AICc, and BIC values, making it the most favorable model in terms of fit.
Residual Analysis: The SARIMA(1,1,1)(0,1,1)[12] model also shows a good distribution of residuals around zero, with low autocorrelation in the ACF plot, indicating it captures dependencies well.
Ljung-Box Test: The Stepwise model has the highest p-value (0.9333), indicating the least residual autocorrelation and passing the test for independence of residuals. However, sarima111011 also has an acceptable p-value (0.1302), suggesting moderate but acceptable residual independence.
Recommended Models
For long-term forecasting, SARIMA(1,1,1)(0,1,1)[12] is the best choice due to its superior fit indicated by information criteria and effective residual behavior. For short-term forecasting or scenarios where minimal autocorrelation is critical, the Stepwise model is slightly better due to its higher Ljung-Box p-value. Thus:
# Generate the forecast for the best model for long-term forecasting ('sarima111011')
forecast_load <- load_fit |>
forecast(h = "12 months") |>
filter(.model == "sarima111011")
# Extract just the model, date, and the forecast mean without any distribution info
forecast_clean <- forecast_load |>
as_tibble() |>
select(.model, date, .mean)
# Display the forecast data in a clean table format
forecast_clean %>%
kable(caption = "12-Month Forecasts for sarima111011 Models") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | date | .mean |
---|---|---|
sarima111011 | 2014 Jan | 6459.343 |
sarima111011 | 2014 Feb | 5866.020 |
sarima111011 | 2014 Mar | 5304.488 |
sarima111011 | 2014 Apr | 5002.353 |
sarima111011 | 2014 May | 4904.648 |
sarima111011 | 2014 Jun | 5626.898 |
sarima111011 | 2014 Jul | 6034.145 |
sarima111011 | 2014 Aug | 6213.165 |
sarima111011 | 2014 Sep | 5855.638 |
sarima111011 | 2014 Oct | 4995.141 |
sarima111011 | 2014 Nov | 4904.060 |
sarima111011 | 2014 Dec | 5704.108 |
# Export the cleaned data to CSV
write.csv(forecast_clean, "ForecastLoad.csv", row.names = FALSE)
The 12-month forecast for the SARIMA(1,1,1)(0,1,1)[12] model captures the seasonal patterns observed in the historical data, with projected values maintaining a similar range. The confidence intervals (80% and 95%) suggest a moderate level of uncertainty, widening slightly toward the end of the forecast period. This model effectively accounts for both the trend and seasonal components in the data, providing a reliable prediction with seasonal variations.
# Plot the forecast against the original data
forecast_load |>
autoplot(load_data) +
ggtitle("12-Month Forecast for SARIMA(1,1,1)(0,1,1)[12] Model") +
ylab("Load") +
xlab("Date")
Load the required libraries for data manipulation, time series analysis, and exporting to Excel.
# Load necessary libraries
library(readxl)
library(dplyr)
library(lubridate)
library(tsibble)
library(fable)
library(forecast)
library(openxlsx)
Reads in the Excel files for each pipe. These files contain water
flow measurements over time. Ensure that the Date Time
column is in date-time format for both datasets. Here is a brief
description of the columns:
2015-10-23
to 2015-11-01
-Pipe2 Data - Rows: 1,000 -
Columns: 2 - Date Time: Timestamps
from 2015-10-23
to 2015-12-03
-
WaterFlow: Range from 1.88 to 78.30
Note: Different date spans and flow ranges indicate the need for timestamp alignment before combined analysis.
# Load necessary libraries
library(readxl)
library(dplyr)
library(lubridate)
library(knitr)
library(kableExtra)
# Load and arrange data
pipe1_data <- read_excel("Waterflow_Pipe1.xlsx", col_types = c("numeric", "numeric")) %>%
arrange(`Date Time`)
pipe2_data <- read_excel("Waterflow_Pipe2.xlsx", col_types = c("numeric", "numeric")) %>%
arrange(`Date Time`)
# Define a custom function to convert Excel date serial numbers to POSIXct
convert_excel_date <- function(serial_date) {
as_datetime(serial_date * 86400, origin = "1899-12-30", tz = "UTC")
}
# Apply the custom function to convert the Date Time column
pipe1_data <- pipe1_data %>%
mutate(`Date Time` = convert_excel_date(`Date Time`))
pipe2_data <- pipe2_data %>%
mutate(`Date Time` = convert_excel_date(`Date Time`))
pipe1_data
## # A tibble: 1,000 × 2
## `Date Time` WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 00:24:06 23.4
## 2 2015-10-23 00:40:02 28.0
## 3 2015-10-23 00:53:51 23.1
## 4 2015-10-23 00:55:40 30.0
## 5 2015-10-23 01:19:17 6.00
## 6 2015-10-23 01:23:58 15.9
## 7 2015-10-23 01:50:05 26.6
## 8 2015-10-23 01:55:33 33.3
## 9 2015-10-23 01:59:15 12.4
## 10 2015-10-23 02:51:51 21.8
## # ℹ 990 more rows
# Calculate number of rows
num_rows_pipe1 <- nrow(pipe1_data)
num_rows_pipe2 <- nrow(pipe2_data)
# Display summaries with formatted table, including row counts
pipe1_summary <- summary(pipe1_data)
pipe2_summary <- summary(pipe2_data)
# Convert to kable format with row count
kable(pipe1_summary, caption = paste("Pipe1 Data Summary - Rows:", num_rows_pipe1)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Date Time | WaterFlow | |
---|---|---|
Min. :2015-10-23 00:24:06.49 | Min. : 1.067 | |
1st Qu.:2015-10-25 11:21:35.49 | 1st Qu.:13.683 | |
Median :2015-10-27 20:07:30.89 | Median :19.880 | |
Mean :2015-10-27 20:49:15.93 | Mean :19.897 | |
3rd Qu.:2015-10-30 08:24:51.07 | 3rd Qu.:26.159 | |
Max. :2015-11-01 23:35:43.10 | Max. :38.913 |
kable(pipe2_summary, caption = paste("Pipe2 Data Summary - Rows:", num_rows_pipe2)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Date Time | WaterFlow | |
---|---|---|
Min. :2015-10-23 01:00:00 | Min. : 1.885 | |
1st Qu.:2015-11-02 10:45:00 | 1st Qu.:28.140 | |
Median :2015-11-12 20:30:00 | Median :39.682 | |
Mean :2015-11-12 20:30:00 | Mean :39.556 | |
3rd Qu.:2015-11-23 06:15:00 | 3rd Qu.:50.782 | |
Max. :2015-12-03 16:00:00 | Max. :78.303 |
The time series plots for Water Flow - Pipe 1 and Water Flow - Pipe 2 reveal distinct patterns and variations.
Pipe 1 shows consistent, moderate fluctuations in water flow over a shorter time span, with values primarily ranging between 0 and 40 units. This indicates a stable flow rate with periodic peaks and troughs, suggesting regular usage or inflow patterns.
Pipe 2 spans a longer period and displays greater variability in flow, with values reaching up to 80 units. The flow appears more dynamic with noticeable peaks, likely indicating higher water demand or variability in this system over time.
The contrast between these two pipes—Pipe 1’s steady pattern versus Pipe 2’s high variability—may suggest differences in operational contexts or flow requirements. This analysis provides a foundation for further exploration of flow behaviors and forecasting needs for each pipe.
# Plot for Pipe 1
pipe1_plot <- ggplot(pipe1_data, aes(x = `Date Time`, y = WaterFlow)) +
geom_line(color = "blue") +
labs(title = "Water Flow - Pipe 1",
x = "Time",
y = "Water Flow")
# Plot for Pipe 2
pipe2_plot <- ggplot(pipe2_data, aes(x = `Date Time`, y = WaterFlow)) +
geom_line(color = "red") +
labs(title = "Water Flow - Pipe 2",
x = "Time",
y = "Water Flow")
# Arrange the plots vertically
grid.arrange(pipe1_plot, pipe2_plot , ncol = 1)
Convert timestamps to hourly data by rounding down to the nearest hour and then calculating the mean for each hour.
The table and plots provide insights into the data collection frequency and transformations applied to the water flow measurements for each pipe.
For Pipe 1: - Initial data points: 1000 - Hourly mean data points after aggregation: 236 - Average interval between measurements: approximately 14.37 minutes - Median interval: 10.02 minutes
For Pipe 2: - Initial data points: 1000 - Hourly mean data points after aggregation: 667 - Average interval between measurements: 60 minutes - Median interval: 60 minutes
This indicates that Pipe 1 measurements were collected at a higher frequency than Pipe 2, resulting in fewer aggregated hourly data points when averaged per hour. The higher frequency in Pipe 1’s measurements contributed to more granular data, whereas Pipe 2’s hourly interval aligns with its lower frequency of data collection.
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(knitr)
library(kableExtra)
# Calculate number of rows before mean flow calculation
num_points_before_pipe1 <- nrow(pipe1_data)
num_points_before_pipe2 <- nrow(pipe2_data)
# Convert 'Date Time' to hourly timestamps and calculate the hourly mean of 'WaterFlow'
pipe1_hourly <- pipe1_data %>%
mutate(Timestamp = floor_date(`Date Time`, "hour")) %>% # Floor to the nearest hour
group_by(Timestamp) %>%
summarize(Mean_WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>% # Calculate the mean of 'WaterFlow'
ungroup() # Remove grouping to avoid issues in further steps
pipe2_hourly <- pipe2_data %>%
mutate(Timestamp = floor_date(`Date Time`, "hour")) %>%
group_by(Timestamp) %>%
summarize(Mean_WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%
ungroup()
# Calculate number of rows after mean flow calculation
num_points_after_pipe1 <- nrow(pipe1_hourly)
num_points_after_pipe2 <- nrow(pipe2_hourly)
# Calculate time intervals for Pipe 1
pipe1_intervals <- pipe1_data %>%
arrange(`Date Time`) %>%
mutate(Interval = difftime(`Date Time`, lag(`Date Time`), units = "mins")) %>%
filter(!is.na(Interval)) # Remove the first NA caused by lag
# Calculate time intervals for Pipe 2
pipe2_intervals <- pipe2_data %>%
arrange(`Date Time`) %>%
mutate(Interval = difftime(`Date Time`, lag(`Date Time`), units = "mins")) %>%
filter(!is.na(Interval)) # Remove the first NA caused by lag
# Create a data frame to display the counts and intervals for each pipe
data_points_df <- data.frame(
Pipe = c("Pipe 1", "Pipe 2"),
Data_Points_Before = c(num_points_before_pipe1, num_points_before_pipe2),
Data_Points_After = c(num_points_after_pipe1, num_points_after_pipe2),
Avg_Interval = c(mean(pipe1_intervals$Interval), mean(pipe2_intervals$Interval)),
Median_Interval = c(median(pipe1_intervals$Interval), median(pipe2_intervals$Interval))
)
# Display the data frame with formatted table
data_points_df |>
kable(caption = "Summary of Data Points and Time Intervals Between Measurements for Each Pipe") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Pipe | Data_Points_Before | Data_Points_After | Avg_Interval | Median_Interval |
---|---|---|---|---|
Pipe 1 | 1000 | 236 | 14.36598 mins | 10.01989 mins |
Pipe 2 | 1000 | 667 | 60.00000 mins | 60.00000 mins |
# Plot for Pipe 1
pipe1_hourly_plot <- ggplot(pipe1_hourly, aes(x = Timestamp, y = Mean_WaterFlow)) +
geom_line(color = "blue") +
labs(title = "Hourly Mean Water Flow - Pipe 1",
x = "Timestamp",
y = "Mean Water Flow")
# Plot for Pipe 2
pipe2_hourly_plot <- ggplot(pipe2_hourly, aes(x = Timestamp, y = Mean_WaterFlow)) +
geom_line(color = "red") +
labs(title = "Hourly Mean Water Flow - Pipe 2",
x = "Timestamp",
y = "Mean Water Flow")
# Arrange the plots vertically
grid.arrange(pipe1_hourly_plot, pipe2_hourly_plot , ncol = 1)
Convert the merged data to a tsibble
format and check
each series for stationarity using the ADF test.
Based on the Augmented Dickey-Fuller (ADF) test results for each pipe, here is the interpretation:
The stationarity confirmed by these results indicates that both series can proceed to forecasting without further differencing adjustments.
# Load the urca package if not already loaded
library(urca)
library(dplyr)
# Step 3: Convert to tsibble format for time series analysis
pipe1_tsibble <- as_tsibble(pipe1_hourly, index = Timestamp) %>%
fill_gaps()
pipe2_tsibble <- as_tsibble(pipe2_hourly, index = Timestamp) %>%
fill_gaps()
# Step 4: Test for stationarity using ADF test
adf_test_pipe1 <- ur.df(pipe1_hourly$Mean_WaterFlow, type = "drift")
adf_test_pipe1 <- ur.df(pipe2_hourly$Mean_WaterFlow, type = "drift")
summary(adf_test_pipe1)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression drift
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.732 -9.229 0.298 9.370 37.233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.55674 2.29084 16.831 <2e-16 ***
## z.lag.1 -0.96884 0.05603 -17.292 <2e-16 ***
## z.diff.lag -0.06832 0.03888 -1.757 0.0793 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.83 on 662 degrees of freedom
## Multiple R-squared: 0.5207, Adjusted R-squared: 0.5193
## F-statistic: 359.6 on 2 and 662 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -17.2918 149.5061
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1 6.43 4.59 3.78
summary(adf_test_pipe1)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression drift
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.732 -9.229 0.298 9.370 37.233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.55674 2.29084 16.831 <2e-16 ***
## z.lag.1 -0.96884 0.05603 -17.292 <2e-16 ***
## z.diff.lag -0.06832 0.03888 -1.757 0.0793 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.83 on 662 degrees of freedom
## Multiple R-squared: 0.5207, Adjusted R-squared: 0.5193
## F-statistic: 359.6 on 2 and 662 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -17.2918 149.5061
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1 6.43 4.59 3.78
The top plot shows the Box-Cox transformed Pipe 2 Flow, which stabilizes variance, while the bottom plot shows the filled series, ensuring continuity for accurate forecasting. These steps prepare the data for reliable model fitting.
pipe1_tsibble <- pipe1_tsibble |>
mutate(Mean_WaterFlow) # Adding 1 to avoid zero or negative values
# Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- pipe1_tsibble |>
BoxCox.lambda(Mean_WaterFlow) # Calculate lambda for Box-Cox
# Apply transformations and differences
pipe1_tsibble <- pipe1_tsibble |>
mutate(
box_cox_flow = BoxCox(Mean_WaterFlow, lambda = box_cox_lambda), # Apply Box-Cox transformation
)
# Fill missing values in the Mean_WaterFlow column using linear interpolation
pipe1_tsibble_filled <- pipe1_tsibble %>%
mutate(Mean_WaterFlow = na_interpolation(Mean_WaterFlow))
# Step 3: Time Series Plot for Box-Cox Transformed Data
p1 <- pipe1_tsibble |>
ggplot(aes(x = Timestamp, y = box_cox_flow)) +
geom_line() +
labs(title = "Box-Cox Transformed Pipe 1 Flow",
x = "Time", y = "Pipe 1 Flow")
# Step : Time Series Plot for Box-Cox Transformed Data
p2 <- pipe1_tsibble_filled |>
ggplot(aes(x = Timestamp, y = Mean_WaterFlow)) +
geom_line() +
labs(title = "Gap Filled Pipe 1 Flow",
x = "Time", y = "Pipe 1 Flow")
# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
p1, p2,
ncol = 1
)
The STL decomposition of Pipe 1 water flow data shows:
Overall, the decomposition effectively captures the main structure, suggesting that a model with these components could forecast this data accurately.
# Step 1: Decompose the series using STL
pipe1_stl <- pipe1_tsibble_filled %>%
model(STL(Mean_WaterFlow ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
components()
# Step 2: Plot the STL components
autoplot(pipe1_stl) +
labs(title = "STL Decomposition of Pipe 1 Water Flow")
Here’s twe fit ARIMA and different configurations of ETS models on the time series data, including ETS(M, A, M), ETS(A, A, M), and ETS(M, Ad, M). This approach allows you to compare the models and select the best one based on criteria such as AIC or forecast performance.
To determine the best model for forecasting Pipe 1 Water Flow, we need to consider the model performance metrics, especially the AIC, AICc, BIC, MSE, and MAE.
Even though the stepwise
and search
ARIMA
models have a lower AICc, they exhibit a much narrower confidence
interval, suggesting that they fail to capture the variability seen in
the actual water flow data. This lack of variability indicates that
these models may underfit the data, leading to less reliable predictions
under changing conditions. In contrast, ETS models (such as
ets_mam
) show wider and more realistic confidence
intervals, better reflecting the fluctuations in the observed series,
making them more suitable for accurate forecasting in this context.
Best Model: ETS (MAM) due to balanced low error metrics and solid fit.
# Load necessary libraries
library(fable)
library(feasts)
library(tsibble)
library(dplyr)
# Fit multiple ETS models
pipe1_fit <- pipe1_tsibble_filled %>%
model(
ets_mam = ETS(Mean_WaterFlow ~ error("M") + trend("A") + season("M")), # ETS(M, A, M)
ets_aam = ETS(Mean_WaterFlow ~ error("A") + trend("A") + season("M")), # ETS(A, A, M)
ets_madm = ETS(Mean_WaterFlow ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M)
stepwise = ARIMA(Mean_WaterFlow), # Automatic ARIMA using stepwise search
search = ARIMA(Mean_WaterFlow, stepwise = FALSE) # Automatic ARIMA using exhaustive search
)
# Extract model summary and select the best model based on AIC or AICc
model_summary <- glance(pipe1_fit)
# Display model details to compare AIC, AICc, and BIC
model_summary %>%
select(-ar_roots, -ma_roots) %>%
arrange(AICc) %>%
kable(caption = "ETS Model Comparison Based on AICc") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | sigma2 | log_lik | AIC | AICc | BIC | MSE | AMSE | MAE |
---|---|---|---|---|---|---|---|---|
stepwise | 17.8346691 | -685.7815 | 1375.563 | 1375.614 | 1382.524 | NA | NA | NA |
search | 17.9552502 | -685.6035 | 1379.207 | 1379.377 | 1393.129 | NA | NA | NA |
ets_mam | 0.0475971 | -995.0379 | 2048.076 | 2056.361 | 2149.014 | 16.82212 | 16.86331 | 0.1618203 |
ets_aam | 18.9938721 | -996.0843 | 2050.169 | 2058.454 | 2151.107 | 16.77792 | 16.81792 | 3.2571771 |
ets_madm | 0.0503575 | -999.7983 | 2059.597 | 2068.496 | 2164.016 | 17.24243 | 17.24774 | 0.1666085 |
# Forecast with each ETS model
pipe1_forecasts <- pipe1_fit %>%
forecast(h = "1 week")
# Plot the forecasts with different models shown in separate rows
autoplot(pipe1_forecasts, pipe1_tsibble_filled) +
labs(title = "1-Week Forecast for Pipe 1 Water Flow Using ETS Models",
y = "Mean Water Flow", x = "Timestamp") +
facet_wrap(~ .model, scales = "free_y", ncol = 1) + # Arrange in a single column with each model in a row
theme_minimal()
The forecast for Pipe 1 Water Flow using the best ETS model, ETS(M, A, M), provides a projection with clear confidence intervals. This model, which captures multiplicative error, additive trend, and multiplicative seasonality, shows a reasonable fit, effectively capturing the seasonal variations in the data. The increasing width of the confidence intervals over time reflects the model’s uncertainty in predicting future values. Given the strong seasonal pattern and consistent trend, this model is well-suited to represent the data’s structure, allowing for meaningful future predictions within expected variability.
# Filter the forecasts to include only the best model (ets_mam)
best_forecast <- pipe1_forecasts %>%
filter(.model == "ets_mam")
# Plot the forecast for the best model (ets_mam)
autoplot(best_forecast, pipe1_tsibble_filled) +
labs(
title = "1-Week Forecast for Pipe 1 Water Flow Using Best Model (ETS(M, A, M))",
y = "Mean Water Flow",
x = "Timestamp"
)
head(best_forecast) %>%
kable(caption = "1-Week Forecasts for Pipe 1") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | Timestamp | Mean_WaterFlow | .mean |
---|---|---|---|
ets_mam | 2015-11-02 00:00:00 | N(21, 21) | 20.89573 |
ets_mam | 2015-11-02 01:00:00 | N(22, 22) | 21.58641 |
ets_mam | 2015-11-02 02:00:00 | N(21, 21) | 20.83362 |
ets_mam | 2015-11-02 03:00:00 | N(19, 17) | 18.66852 |
ets_mam | 2015-11-02 04:00:00 | N(22, 23) | 22.00720 |
ets_mam | 2015-11-02 05:00:00 | N(19, 17) | 18.94728 |
# Save the forecast for the best model to CSV if needed
best_forecast <- pipe1_forecasts %>%
filter(.model == "ets_mam")
write.csv(best_forecast, "pipe1_best_ets_forecast.csv", row.names = FALSE)
The top plot shows the Box-Cox transformed Pipe 2 Flow, which stabilizes variance, while the bottom plot shows the filled series, ensuring continuity for accurate forecasting. These steps prepare the data for reliable model fitting.
pipe2_tsibble <- pipe2_tsibble |>
mutate(Mean_WaterFlow) # Adding 1 to avoid zero or negative values
# Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- pipe2_tsibble |>
BoxCox.lambda(Mean_WaterFlow) # Calculate lambda for Box-Cox
# Apply transformations and differences
pipe2_tsibble <- pipe2_tsibble |>
mutate(
box_cox_flow = BoxCox(Mean_WaterFlow, lambda = box_cox_lambda), # Apply Box-Cox transformation
)
# Fill missing values in the Mean_WaterFlow column using linear interpolation
pipe2_tsibble_filled <- pipe2_tsibble %>%
mutate(Mean_WaterFlow = na_interpolation(Mean_WaterFlow))
# Step 3: Time Series Plot for Box-Cox Transformed Data
p1 <- pipe2_tsibble |>
ggplot(aes(x = Timestamp, y = box_cox_flow)) +
geom_line() +
labs(title = "Box-Cox Transformed Pipe 2 Flow",
x = "Time", y = "Pipe 2 Flow")
# Step : Time Series Plot for Box-Cox Transformed Data
p2 <- pipe2_tsibble_filled |>
ggplot(aes(x = Timestamp, y = Mean_WaterFlow)) +
geom_line() +
labs(title = "Gap Filled Pipe 2 Flow",
x = "Time", y = "Pipe 2 Flow")
# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
p1, p2,
ncol = 1
)
The STL decomposition of Pipe 2 Water Flow reveals the following:
Overall, this decomposition highlights a steady trend with strong daily seasonal effects, alongside short-term noise. This structure can inform model selection, suggesting that any forecast model should account for both trend and daily seasonal components.
# Step 1: Decompose the series using STL
pipe2_stl <- pipe1_tsibble_filled %>%
model(STL(Mean_WaterFlow ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
components()
# Step 2: Plot the STL components
autoplot(pipe2_stl) +
labs(title = "STL Decomposition of Pipe 2 Water Flow")
Here’s we fit ARIMA and different configurations of ETS models on the time series data, including ETS(M, A, M), ETS(A, A, M), and ETS(M, Ad, M). This approach allows you to compare the models and select the best one based on criteria such as AIC or forecast performance.
To determine the best model for forecasting Pipe 1 Water Flow, we need to consider the model performance metrics, especially the AIC, AICc, BIC, MSE, and MAE.
The best model for forecasting Pipe 2 Water Flow based on the
provided metrics is ets_madm
. This model
has a low AICc, BIC, MSE, and MAE compared to the other models. Although
the search
and stepwise
models show lower AICc
values, their much larger sigma2 and log-likelihood values suggest poor
fit and stability issues. Therefore, ets_madm
provides a
more reliable forecast with appropriate error and stability
characteristics.
# Load necessary libraries
library(fable)
library(feasts)
library(tsibble)
library(dplyr)
# Fit multiple models
pipe2_fit <- pipe2_tsibble_filled %>%
model(
ets_mam = ETS(Mean_WaterFlow ~ error("M") + trend("A") + season("M")), # ETS(M, A, M)
ets_aam = ETS(Mean_WaterFlow ~ error("A") + trend("A") + season("M")), # ETS(A, A, M)
ets_madm = ETS(Mean_WaterFlow ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M)
stepwise = ARIMA(Mean_WaterFlow), # Automatic ARIMA using stepwise search
search = ARIMA(Mean_WaterFlow, stepwise = FALSE) # Automatic ARIMA using exhaustive search
)
# Extract model summary and select the best model based on AIC or AICc
model_summary <- glance(pipe2_fit)
# Display model details to compare AIC, AICc, and BIC
model_summary %>%
select(-ar_roots, -ma_roots) %>%
arrange(AICc) %>%
kable(caption = "ETS Model Comparison Based on AICc") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | sigma2 | log_lik | AIC | AICc | BIC | MSE | AMSE | MAE |
---|---|---|---|---|---|---|---|---|
search | 145.5595822 | -3902.861 | 7819.723 | 7819.836 | 7854.070 | NA | NA | NA |
stepwise | 156.7908794 | -3942.633 | 7891.266 | 7891.290 | 7905.986 | NA | NA | NA |
ets_madm | 0.1017133 | -5980.418 | 12020.837 | 12022.756 | 12168.070 | 155.3642 | 158.8355 | 0.2533285 |
ets_aam | 163.1633975 | -5987.054 | 12032.108 | 12033.902 | 12174.433 | 158.5948 | 176.0376 | 10.0992134 |
ets_mam | 0.1019389 | -5987.697 | 12033.394 | 12035.188 | 12175.719 | 158.5184 | 162.9869 | 0.2536301 |
# Forecast with each ETS model
pipe2_forecasts <- pipe2_fit %>%
forecast(h = "1 week")
# Plot the forecasts with different models shown in separate rows
autoplot(pipe2_forecasts, pipe2_tsibble_filled) +
labs(title = "1-Week Forecast for Pipe 2 Water Flow",
y = "Mean Water Flow", x = "Timestamp") +
facet_wrap(~ .model, scales = "free_y", ncol = 1) + # Arrange in a single column with each model in a row
theme_minimal()
The ETS(M, Ad, M) model provides a one-week forecast for Pipe 2 Water Flow, capturing both multiplicative error and seasonality, with an additive dampened trend. The forecast maintains the high variability observed in the historical data and shows increasing uncertainty over the forecast period, as represented by the widening confidence intervals. This model is suitable for capturing the complexity and fluctuating pattern of water flow in Pipe 2, reflecting both the daily and seasonal variations effectively.
# Filter the forecasts to include only the best model (ets_mam)
best_forecast <- pipe2_forecasts %>%
filter(.model == "ets_madm")
# Plot the forecast for the best model (ets_mam)
autoplot(best_forecast, pipe2_tsibble_filled) +
labs(
title = "1-Week Forecast for Pipe 2 Water Flow (ETS(M, Ad, M))",
y = "Mean Water Flow",
x = "Timestamp"
) +
theme_minimal()
# Save the forecast for the best model to CSV if needed
best_forecast <- pipe2_forecasts %>%
filter(.model == "ets_madm")
write.csv(best_forecast, "pipe2_best_ets_forecast.csv", row.names = FALSE)