Part A Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate your process, techniques used and not used, and your actual forecast.
The objective of this analysis was to forecast the total amount of cash expected to be withdrawn from four ATM machines during May 2010, using historical daily withdrawal data contained in ATM624Data.xlsx. The data represent daily withdrawals in hundreds of dollars, covering the period May 2009-May 2010.
The dataset was imported from Excel and cleaned by:
atm <- read_excel("/Users/aaliyahmjh/Documents/FA25/DATA624/ATM624Data.xlsx") |>
mutate(
DATE = if (is.numeric(DATE)) as.Date(DATE, origin = "1899-12-30") else as.Date(DATE),
Cash = Cash * 100 # convert from hundreds of dollars to dollars
)
atm <- atm %>%
rename_with(~tolower(.x)) %>%
rename(date = any_of(c("date", "DATE")),
cash = any_of(c("cash", "Cash")),
atm = any_of(c("atm", "ATM"))) %>%
filter(!is.na(date), !is.na(cash), !is.na(atm)) %>%
arrange(atm, date)
summary(atm$date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2009-05-01" "2009-08-01" "2009-10-31" "2009-10-30" "2010-01-29" "2010-04-30"
range(atm$date)
## [1] "2009-05-01" "2010-04-30"
# Plot
ggplot(atm, aes(date, cash, color = atm)) +
geom_line(linewidth = 0.7) +
facet_wrap(~atm, scales = "free_y") +
labs(
title = "ATM Daily Withdrawals (Cleaned Data)",
x = "Date",
y = "Cash ($)"
) +
theme_minimal() +
theme(legend.position = "right")
Because each ATM operates independently, time-series models were fitted separately.
The ATM withdrawal data showed strong daily seasonality, especially for ATM 1 and 2, where withdrawals followed a weekly rhythm. I considered simpler methods like the mean, naïve, or seasonal naïve forecasts, but those wouldn’t capture the repeating weekly pattern or short-term changes in withdrawal volume.
I also thought about using exponential smoothing (ETS) or decomposition-based methods, but these work best for longer, smoother time series with clear long-term trends. The ATM data were short and noisy, so those methods wouldn’t adapt well.
Technique Used
From my previous visual inspection, I found that:
These patterns suggested that ATMs 1 & 2 were ideal for model-based forecasting, while 3 & 4 required fallback logic.
atm <- read_excel("ATM624Data.xlsx") %>%
mutate(
DATE = if (is.numeric(DATE)) as.Date(DATE, origin = "1899-12-30") else as.Date(DATE),
Cash = Cash * 100 # convert from hundreds of dollars to dollars
) %>%
arrange(ATM, DATE)
range(atm$DATE)
## [1] "2009-05-01" "2010-05-14"
# Split into list by ATM
atm_list <- split(atm, atm$ATM)
forecast_atm <- function(df) {
df <- df %>% arrange(DATE)
if (nrow(df) < 10) {
avg <- mean(df$Cash, na.rm = TRUE)
message(unique(df$ATM), ": too few points (", nrow(df), "). Using mean forecast.")
return(tibble(
Date = seq(max(df$DATE) + 1, by = "day", length.out = 31),
Forecast = rep(avg, 31),
Lower80 = rep(avg * 0.9, 31),
Upper80 = rep(avg * 1.1, 31),
Lower95 = rep(avg * 0.8, 31),
Upper95 = rep(avg * 1.2, 31),
ATM = unique(df$ATM)
))
}
ts_train <- ts(df$Cash, frequency = 7)
fit <- auto.arima(ts_train)
fc <- forecast(fit, h = 31)
tibble(
Date = seq(max(df$DATE) + 1, by = "day", length.out = 31),
Forecast = as.numeric(fc$mean),
Lower80 = as.numeric(fc$lower[, 1]),
Upper80 = as.numeric(fc$upper[, 1]),
Lower95 = as.numeric(fc$lower[, 2]),
Upper95 = as.numeric(fc$upper[, 2]),
ATM = unique(df$ATM)
)
}
atm_forecasts <- lapply(atm_list, forecast_atm)
final_fc <- bind_rows(atm_forecasts)
# Save to Excel
write_xlsx(final_fc, "/Users/aaliyahmjh/Documents/FA25/DATA624/ATM_May2010_Forecasts.xlsx")
# Summarize monthly totals
monthly_totals <- final_fc %>%
group_by(ATM) %>%
summarise(May2010_TotalCash = sum(Forecast, na.rm = TRUE))
# Save monthly totals to a separate file
write_xlsx(monthly_totals, "/Users/aaliyahmjh/Documents/FA25/DATA624/ATM_May2010_TotalForecasts.xlsx")
head(final_fc,15)
## # A tibble: 15 × 7
## Date Forecast Lower80 Upper80 Lower95 Upper95 ATM
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2010-05-01 8519. 5362. 11676. 3691. 13347. ATM1
## 2 2010-05-02 9810. 6618. 13003. 4928. 14693. ATM1
## 3 2010-05-03 7546. 4353. 10739. 2662. 12429. ATM1
## 4 2010-05-04 353. -2840. 3546. -4531. 5237. ATM1
## 5 2010-05-05 9849. 6656. 13043. 4966. 14733. ATM1
## 6 2010-05-06 7814. 4621. 11008. 2931. 12698. ATM1
## 7 2010-05-07 8398. 5204. 11591. 3514. 13281. ATM1
## 8 2010-05-08 8490. 4884. 12095. 2976. 14003. ATM1
## 9 2010-05-09 9749. 6135. 13363. 4222. 15276. ATM1
## 10 2010-05-10 7764. 4150. 11378. 2237. 13292. ATM1
## 11 2010-05-11 355. -3260. 3969. -5173. 5882. ATM1
## 12 2010-05-12 9814. 6200. 13428. 4287. 15342. ATM1
## 13 2010-05-13 7729. 4115. 11344. 2202. 13257. ATM1
## 14 2010-05-14 8446. 4831. 12060. 2918. 13973. ATM1
## 15 2010-05-15 8500. 4495. 12504. 2375. 14624. ATM1
monthly_totals
## # A tibble: 4 × 2
## ATM May2010_TotalCash
## <chr> <dbl>
## 1 ATM1 235716.
## 2 ATM2 188268.
## 3 ATM3 402.
## 4 ATM4 1469534.
Each ATM’s forecast plot shows:
Blue line: observed data through May 14 2010.
Red line: projected withdrawals for May 15-31 2010.
Shaded band: 95 % confidence interval.
ATMs 1 & 2 demonstrate clear continuation of the weekly pattern, where confidence intervals widen slightly but remain centered around similar magnitude to prior months.
ATMs 3 & 4 show that limited data or data with anomalies lead to flat forecasts, which is expected behavior.
for (atm_name in names(atm_list)) {
obs <- atm %>% filter(ATM == atm_name)
fc <- forecast_atm(atm_list[[atm_name]])
p <- ggplot() +
geom_line(data = obs, aes(x = DATE, y = Cash), color = "steelblue", linewidth = 1) +
# 95% Prediction Interval (lighter)
geom_ribbon(
data = fc,
aes(x = Date, ymin = Lower95, ymax = Upper95),
fill = "firebrick", alpha = 0.15
) +
# 80% Prediction Interval (darker)
geom_ribbon(
data = fc,
aes(x = Date, ymin = Lower80, ymax = Upper80),
fill = "firebrick", alpha = 0.3
) +
# Forecast line
geom_line(data = fc, aes(x = Date, y = Forecast), color = "firebrick", linewidth = 1) +
labs(
title = paste0(atm_name, " Forecast - May 2010"),
x = "Date",
y = "Cash ($)",
caption = "Blue = Historical | Red = Forecast | Shaded = 80% (dark) and 95% (light) Prediction Intervals"
) +
theme_minimal()
print(p)
}
The goal of this analysis was to forecast daily cash withdrawals for
four ATMs for May 2010.
Each ATM’s historical series was modeled individually using an automated
ARIMA approach when sufficient data were available, and a simple
mean-based projection when the historical window was too short to
support reliable parameter estimation.
The forecasts indicate stable withdrawal behavior consistent with
previous months’ weekly cycles.
For ATMs 1 and 2, daily withdrawals averaged approximately
$9,000-$10,000, showing regular fluctuations without a clear upward or
downward trend heading into May 2010.
ATMs 3 and 4 displayed more irregular patterns, with ATM 3 showing
near-zero activity and ATM 4 producing one unusually high spike, likely
due to an outlier transaction or exceptional event.
The resulting Excel file,
ATM_May2010_Forecasts.xlsx, contains
day-by-day forecasts and corresponding 80% and 95% prediction intervals
for each ATM.
Overall, ATM 4 is projected to account for roughly 75% of
total cash volume, followed by ATMs 1 and 2, while ATM 3
remains minimally active—possibly due to location, accessibility, or
lower customer traffic.
These results suggest that replenishment schedules for May can largely
mirror those from April, with additional liquidity allocated to ATM 4 to
accommodate its substantially higher demand.
This analysis successfully produced short-term forecasts of daily cash withdrawals for four ATMs for May 2010 using historical transaction data. Automated ARIMA models provided robust and interpretable forecasts for ATMs with sufficient history, while mean-based projections were applied to sparse or inconsistent series to maintain reliability.
Overall, the forecasts suggest stable withdrawal demand across the network, with ATM 4 accounting for the majority of expected volume. No major trend shifts were detected, implying that existing replenishment schedules remain appropriate, provided ATM 4 continues to receive proportionally higher allocations.
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
The goal of this analysis was to forecast monthly residential power consumption for the year 2014 based on historical usage data spanning January 1998-December 2013. The dataset includes monthly energy consumption (in kilowatt-hours, KWH) and exhibits a clear yearly seasonal cycle, likely corresponding to variations in heating and cooling demand.
The data was imported from Excel, cleaned, and formatted into a date-based structure suitable for time series analysis. Each observation represents monthly residential power consumption.
A time-series line plot of Monthly Residential Power Consumption (1998-2013) reveals:
The data show strong seasonality, power use rises every summer and dips in cooler months, along with a gradual upward trend over time. Simple models like the mean or naïve forecast wouldn’t capture those repeating seasonal patterns. Decomposition could separate trend and seasonality, but it assumes they stay constant, which doesn’t fit the steady growth here.
Because of that, I focused on seasonal ARIMA and ETS models. Both handle seasonality and long-term trends well. SARIMA is good for capturing autocorrelation, while ETS adapts smoothly to changing seasonal patterns. These methods make the most sense for this kind of monthly energy consumption data.
# Import and clean data
power <- read_excel("/Users/aaliyahmjh/Documents/FA25/DATA624/ResidentialCustomerForecastLoad-624.xlsx") %>%
rename(YearMonth = `YYYY-MMM`) %>% # rename safely
mutate(
Date = as.Date(paste0("01-", YearMonth), format = "%d-%Y-%b"),
KWH = as.numeric(KWH)
) %>%
arrange(Date)
glimpse(power)
## Rows: 192
## Columns: 4
## $ CaseSequence <dbl> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
## $ YearMonth <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
## $ KWH <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…
## $ Date <date> 1998-01-01, 1998-02-01, 1998-03-01, 1998-04-01, 1998-05-…
ggplot(power, aes(x = Date, y = KWH)) +
geom_line(color = "steelblue", linewidth = 1) +
labs(
title = "Monthly Residential Power Consumption (1998-2013)",
x = "Date",
y = "KWH (Kilowatt hours)"
) +
theme_minimal()
Two forecasting models were compared: * Exponential Smoothing (ETS) * Auto ARIMA .
Both models were trained on the full series. Their RMSE (Root Mean Squared Error) values on the training data were compared to determine the better-performing model.
The ETS model achieved the lower RMSE, so it was selected for forecasting. ETS automatically chose an additive error, multiplicative trend, and additive seasonal (A, M, A) structure, which captures the steady year-to-year increase in energy consumption while adapting smoothly to recurring seasonal peaks. The model produced a 12-month forecast for 2014, continuing the established seasonal trend. The plot displays both 80% (darker band) and 95% (lighter band) prediction intervals. The narrower inner band captures the more likely outcomes, while the wider band reflects greater uncertainty at higher confidence. As expected, the forecast retains the established seasonal pattern, with uncertainty gradually increasing further into 2014.
# Convert to monthly time series
power_ts <- ts(power$KWH, start = c(1998, 1), frequency = 12)
# Train models
fit_ets <- ets(power_ts)
## Warning in ets(power_ts): Missing values encountered. Using longest contiguous
## portion of time series
fit_arima <- auto.arima(power_ts)
# Compare accuracy
acc_ets <- accuracy(fit_ets)["Training set", "RMSE"]
acc_arima <- accuracy(fit_arima)["Training set", "RMSE"]
cat("\n ETS RMSE:", acc_ets, "\n")
##
## ETS RMSE: 516453.6
cat("\n ARIMA RMSE:", acc_arima, "\n")
##
## ARIMA RMSE: 845160.6
# Select better model
if (acc_ets < acc_arima) {
best_fit <- fit_ets
model_used <- "ETS"
} else {
best_fit <- fit_arima
model_used <- "ARIMA"
}
cat("\n Selected model:", model_used, "\n")
##
## Selected model: ETS
# Forecast for 12 months
power_fc <- forecast(best_fit, h = 12)
future_dates <- seq(from = max(power$Date) + 30, by = "month", length.out = 12)
forecast_df <- data.frame(
Date = future_dates,
Forecast = as.numeric(power_fc$mean),
Lower80 = as.numeric(power_fc$lower[,1]),
Upper80 = as.numeric(power_fc$upper[,1]),
Lower95 = as.numeric(power_fc$lower[,2]),
Upper95 = as.numeric(power_fc$upper[,2])
)
ggplot() +
# Historical data
geom_line(data = power, aes(x = Date, y = KWH), color = "steelblue") +
# 95% CI (wider, lighter)
geom_ribbon(data = forecast_df,
aes(x = Date, ymin = Lower95, ymax = Upper95),
fill = "blue", alpha = 0.15) +
# 80% CI (narrower, slightly darker)
geom_ribbon(data = forecast_df,
aes(x = Date, ymin = Lower80, ymax = Upper80),
fill = "blue", alpha = 0.3) +
# Point forecast
geom_line(data = forecast_df,
aes(x = Date, y = Forecast),
color = "darkblue", linewidth = 1) +
labs(
title = paste("Monthly Residential Power Consumption Forecast (", model_used, ")", sep = ""),
x = "Year",
y = "KWH (Kilowatt hours)"
) +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
theme_minimal()
A 12-month forecast for January–December 2014 was generated using the selected model. The output includes point forecasts along with both 80 % and 95 % prediction intervals, which quantify the expected range of future values.
The forecast continues the yearly seasonal cycle, higher energy consumption in summer and winter, with dips in spring and fall. The prediction intervals widen modestly over time, reflecting increasing uncertainty for longer-term forecasts.
The file Residential_Power_2014_Forecast.xlsx contains the monthly forecast results and both interval estimates for reference.
# Create dataframe for Excel
power_forecast_df <- data.frame(
Month = seq(as.Date("2014-01-01"), by = "month", length.out = 12),
Forecast = round(power_fc$mean, 0),
Lower80 = round(power_fc$lower[,1], 0),
Upper80 = round(power_fc$upper[,1], 0),
Lower95 = round(power_fc$lower[,2], 0),
Upper95 = round(power_fc$upper[,2], 0)
)
# Save results
write_xlsx(power_forecast_df, "/Users/aaliyahmjh/Documents/FA25/DATA624/Residential_Power_2014_Forecast.xlsx")
head(power_forecast_df)
## Month Forecast Lower80 Upper80 Lower95 Upper95
## 1 2014-01-01 7665313 6814500 8516126 6364107 8966519
## 2 2014-02-01 5793862 5150621 6437104 4810109 6777615
## 3 2014-03-01 4681272 4161430 5201114 3886242 5476302
## 4 2014-04-01 5689482 5057533 6321430 4723000 6655964
## 5 2014-05-01 7643044 6793908 8492179 6344403 8941684
## 6 2014-06-01 6592915 5860277 7325553 5472441 7713389
Monthly residential power consumption data from January 1998 to December 2013 exhibited strong seasonality, with pronounced peaks during the summer months and lower consumption in cooler periods. The gradual upward trend across years indicates steady growth in residential energy demand, likely driven by population increases and greater household electricity use.
Both ETS and ARIMA models were evaluated for forecasting. The ETS model achieved a slightly lower RMSE ( ETS RMSE: 516453.6 vs ARIMA RMSE: 845160.6 ) and provided a smoother representation of the clear seasonal cycle, making it the better fit for this dataset.
The 12-month forecast for 2014 continues this pattern of seasonal variation, with projected peaks around July-August and dips in winter months. Predicted monthly consumption ranges between approximately 4.5 and 9.0 million kWh, with uncertainty gradually increasing toward the end of 2014, as reflected by the widening confidence intervals.
Overall, the model effectively captures historical consumption behavior and offers realistic projections for 2014. The seasonal regularity makes this data ideal for ETS or seasonal ARIMA approaches, both of which can accommodate repeating patterns and long-term growth.
While the ETS model provided accurate forecasts and effectively captured the seasonal nature of residential electricity consumption, several limitations should be noted:
Historical Dependence: The model relies entirely on past consumption patterns and does not incorporate explanatory factors such as weather conditions, economic activity, or population growth, which can significantly influence power demand.
Data Gaps and Anomalies: Minor missing values and irregularities, such as the dip observed around 2010, may reduce model precision, especially if they stem from reporting or measurement inconsistencies rather than true behavioral changes.
This analysis demonstrates that exponential smoothing (ETS) is well-suited for forecasting residential power consumption that exhibits consistent, repeating seasonal patterns. The model projects continued cyclical variation in 2014, with higher demand during summer months and moderate consumption during cooler seasons.
The forecasts can aid utility companies and energy planners in aligning production, grid management, and resource allocation with predictable seasonal peaks.
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Two datasets, Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx, were imported and cleaned to ensure consistent timestamp formatting. Each contained irregular time intervals, occasional duplicate timestamps, and multiple readings within the same hour. To create a uniform time base, all measurements were aggregated to hourly averages, when more than one reading occurred within an hour, their mean value was computed.
Timestamps were standardized by converting from Excel serial or text formats into POSIX datetime objects, and each record was labeled by pipe (“Pipe 1” or “Pipe 2”). The final hourly dataset included three variables: hour, Value, and Pipe, allowing both data streams to be analyzed together at a consistent resolution suitable for time-series modeling.
path_pipe1 <- "/Users/aaliyahmjh/Documents/FA25/DATA624/Waterflow_Pipe1.xlsx"
path_pipe2 <- "/Users/aaliyahmjh/Documents/FA25/DATA624/Waterflow_Pipe2.xlsx"
hourly_aggregate <- function(path, id_label) {
df <- read_excel(path)
# Handle possible column name variations
if (all(c("Date", "Time") %in% names(df))) {
df <- df %>% mutate(DateTime = as_datetime(paste(Date, Time)))
} else if ("Date Time" %in% names(df)) {
df <- df %>% rename(DateTime = `Date Time`)
} else if ("DateTime" %in% names(df)) {
df <- df
} else {
stop("No recognizable datetime column found in file.")
}
# Convert Excel serials if numeric
if (is.numeric(df$DateTime)) {
df <- df %>% mutate(DateTime = as_datetime(DateTime * 86400, origin = "1899-12-30"))
}
df %>%
rename(Value = WaterFlow) %>%
mutate(hour = floor_date(DateTime, "hour")) %>%
group_by(hour) %>%
summarise(Value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
mutate(Pipe = id_label)
}
pipe1_hr <- hourly_aggregate(path_pipe1, "Pipe1")
pipe2_hr <- hourly_aggregate(path_pipe2, "Pipe2")
# Quick look
head(pipe1_hr, 10)
## # A tibble: 10 × 3
## hour Value Pipe
## <dttm> <dbl> <chr>
## 1 2015-10-23 00:00:00 26.1 Pipe1
## 2 2015-10-23 01:00:00 18.9 Pipe1
## 3 2015-10-23 02:00:00 15.2 Pipe1
## 4 2015-10-23 03:00:00 23.1 Pipe1
## 5 2015-10-23 04:00:00 15.5 Pipe1
## 6 2015-10-23 05:00:00 22.7 Pipe1
## 7 2015-10-23 06:00:00 20.6 Pipe1
## 8 2015-10-23 07:00:00 18.4 Pipe1
## 9 2015-10-23 08:00:00 20.8 Pipe1
## 10 2015-10-23 09:00:00 21.2 Pipe1
head(pipe2_hr, 10)
## # A tibble: 10 × 3
## hour Value Pipe
## <dttm> <dbl> <chr>
## 1 2015-10-23 01:00:00 30.9 Pipe2
## 2 2015-10-23 03:00:00 38.0 Pipe2
## 3 2015-10-23 04:00:00 34.0 Pipe2
## 4 2015-10-23 06:00:00 28.2 Pipe2
## 5 2015-10-23 07:00:00 18.3 Pipe2
## 6 2015-10-23 09:00:00 55.8 Pipe2
## 7 2015-10-23 10:00:00 61.3 Pipe2
## 8 2015-10-23 12:00:00 55.7 Pipe2
## 9 2015-10-23 13:00:00 37.1 Pipe2
## 10 2015-10-23 15:00:00 35.1 Pipe2
To confirm the data were appropriate for ARIMA modeling:
ADF test: p < 0.05 - reject the null of a unit root -> stationary.
KPSS test: p > 0.05 - cannot reject stationarity.
Both tests agree that each series is stationary, meaning mean and variance remain roughly constant over time.
pipe1_ts <- ts(pipe1_hr$Value, frequency = 24) # daily cycle
pipe2_ts <- ts(pipe2_hr$Value, frequency = 24)
# ADF (tests for unit root -> nonstationarity)
adf.test(pipe1_ts)
## Warning in adf.test(pipe1_ts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: pipe1_ts
## Dickey-Fuller = -6.2649, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
adf.test(pipe2_ts)
## Warning in adf.test(pipe2_ts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: pipe2_ts
## Dickey-Fuller = -8.6584, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
# KPSS (tests for trend stationarity)
kpss.test(pipe1_ts)
## Warning in kpss.test(pipe1_ts): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: pipe1_ts
## KPSS Level = 0.24656, Truncation lag parameter = 4, p-value = 0.1
kpss.test(pipe2_ts)
##
## KPSS Test for Level Stationarity
##
## data: pipe2_ts
## KPSS Level = 0.42049, Truncation lag parameter = 6, p-value = 0.06832
Because both series were stationary and lacked strong trend components, I used ARIMA models, selected automatically with auto.arima(). This function chooses the best parameters and seasonal components based on AICc, making it reliable for short-term forecasting without manual tuning.
I considered methods like ETS or exponential smoothing, but those work best when clear trends or evolving seasonal patterns are present. Decomposition-based approaches were also unnecessary here because the data were already stationary. ARIMA was the best match since it handles short-term autocorrelation and produces stable forecasts for steady, mean-reverting data like these water flow series.
Each model produced a one-week-ahead forecast (168 hours).
The resulting forecasts for both pipes are centered around their respective long-run means:
The shaded regions on the forecast plots represent 95% prediction intervals, showing the expected range of variation over the week. The flat appearance doesn’t indicate model failure; it reflects that the data have no clear upward or downward pattern, so the ARIMA model predicts stable mean-reverting behavior.
fit1 <- auto.arima(pipe1_ts)
fit2 <- auto.arima(pipe2_ts)
# Forecast 1 week ahead = 168 hours
fc1 <- forecast(fit1, h = 168)
fc2 <- forecast(fit2, h = 168)
autoplot(fc1) +
labs(title = "Pipe 1: One-Week Forecast (ARIMA)",
y = "Flow (units)", x = "Time") +
theme_minimal()
autoplot(fc2) +
labs(title = "Pipe 2: One-Week Forecast (ARIMA)",
y = "Flow (units)", x = "Time") +
theme_minimal()
# Combined forecast visualization
weekly_fc <- bind_rows(
data.frame(
Time = seq(max(pipe1_hr$hour) + hours(1), by = "hour", length.out = 168),
Forecast = as.numeric(fc1$mean),
Lower95 = as.numeric(fc1$lower[,2]),
Upper95 = as.numeric(fc1$upper[,2]),
Pipe = "Pipe1"
),
data.frame(
Time = seq(max(pipe2_hr$hour) + hours(1), by = "hour", length.out = 168),
Forecast = as.numeric(fc2$mean),
Lower95 = as.numeric(fc2$lower[,2]),
Upper95 = as.numeric(fc2$upper[,2]),
Pipe = "Pipe2"
)
)
ggplot(weekly_fc, aes(Time, Forecast, color = Pipe, fill = Pipe)) +
geom_line(linewidth = 0.8) +
geom_ribbon(aes(ymin = Lower95, ymax = Upper95), alpha = 0.15, color = NA) +
labs(
title = "One-Week Ahead Water Flow Forecasts (ARIMA)",
subtitle = "Forecasted hourly flow by pipe with 95% confidence intervals",
x = "Time (hourly)", y = "Water Flow (forecasted units)"
) +
theme_minimal() +
theme(legend.position = "bottom")
The results were exported to an Excel-readable file: Waterflow_Weekly_Forecasts.xlsx.
pipe1_fc_df <- data.frame(
Hour = seq(max(pipe1_hr$hour) + 3600, by = 3600, length.out = 168),
Forecast = as.numeric(fc1$mean),
Lower95 = as.numeric(fc1$lower[,2]),
Upper95 = as.numeric(fc1$upper[,2]),
Pipe = "Pipe1"
)
pipe2_fc_df <- data.frame(
Hour = seq(max(pipe2_hr$hour) + 3600, by = 3600, length.out = 168),
Forecast = as.numeric(fc2$mean),
Lower95 = as.numeric(fc2$lower[,2]),
Upper95 = as.numeric(fc2$upper[,2]),
Pipe = "Pipe2"
)
write_xlsx(bind_rows(pipe1_fc_df, pipe2_fc_df),
"/Users/aaliyahmjh/Documents/FA25/DATA624/Waterflow_Weekly_Forecasts.xlsx")
Hourly water-flow data from Pipe 1 and Pipe 2 were aggregated from raw timestamped measurements into hourly means to ensure consistent time intervals for analysis. After cleaning and aggregation, each series displayed distinct flow behavior:
Stationarity tests confirmed the suitability of both series for time-series forecasting. The ADF test rejected the null hypothesis of a unit root (p < 0.05), and the KPSS test failed to reject level stationarity (p > 0.05), indicating both series were stationary.
Each pipe’s flow series was modeled using an ARIMA model selected automatically via auto.arima(), which optimized parameters based on AICc. One-week-ahead forecasts (168 hours) were produced and visualized with 95 % prediction intervals.
These prediction intervals represent the expected operational ranges for each pipe and can inform proactive decisions, such as pump scheduling, flow balancing, or maintenance planning.
Overall, this analysis demonstrates that short-term ARIMA forecasting can effectively model high-frequency operational data when appropriately aggregated and validated for stationarity.
While the ARIMA models performed reasonably well for short-term forecasting, several limitations should be noted:
After timestamp normalization and hourly aggregation, both Pipe 1 and Pipe 2 were successfully modeled and forecasted using ARIMA. The models indicate that Pipe 1 maintains steady flow near 20 units, while Pipe 2 averages about 40 units with greater variability. These results provide a practical short-term outlook to support water-flow management, pump scheduling, and system-monitoring decisions.