Automated Teller Machines (ATMs) play a critical role in modern banking by providing customers with continuous access to cash. Efficient management of ATM cash levels is essential to ensure reliable service while minimizing operational costs. The dataset used in this analysis contains daily cash withdrawal amounts from four ATM machines over a specified time period, with withdrawals measured in hundreds of dollars.
Understanding the patterns in cash demand—such as trends, variability, and potential seasonality—is important for developing accurate forecasts. Time series methods, including transformation, decomposition, and ARIMA modeling, are used to capture these patterns and generate reliable predictions for future cash requirements.
The primary objective of this analysis is to forecast daily cash withdrawals for each ATM for the month of May 2010. These forecasts will support better planning and decision-making for ATM cash management.
Financial institutions must ensure that ATMs are adequately stocked with cash at all times. Underestimating demand can lead to empty machines, resulting in poor customer experience and potential revenue loss. Overestimating demand, on the other hand, leads to excess cash being held in machines, increasing security risks and operational costs.
The key business problem is to determine how much cash should be allocated to each ATM on a daily basis. This requires accurate forecasting of withdrawal demand based on historical data.
By developing robust time series models, this analysis aims to:
Predict future cash withdrawal patterns Optimize cash replenishment schedules Reduce operational and transportation costs Minimize the risk associated with holding excess cash Improve overall service reliability
Ultimately, the goal is to provide a data-driven approach to ATM cash management, enabling more efficient and cost-effective operations.
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(readr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.4.3
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.3
## Registered S3 methods overwritten by 'ggtime':
## method from
## +.gg_tsensemble feasts
## autolayer.fbl_ts fabletools
## autolayer.tbl_ts fabletools
## autoplot.dcmp_ts fabletools
## autoplot.fbl_ts fabletools
## autoplot.tbl_cf feasts
## autoplot.tbl_ts fabletools
## chooseOpsMethod.gg_tsensemble feasts
## fortify.fbl_ts fabletools
## grid.draw.gg_tsensemble feasts
## print.gg_tsensemble feasts
## scale_type.cf_lag feasts
## Warning: replacing previous import 'feasts::scale_x_cf_lag' by
## 'ggtime::scale_x_cf_lag' when loading 'fpp3'
## Warning: replacing previous import 'feasts::gg_season' by 'ggtime::gg_season'
## when loading 'fpp3'
## Warning: replacing previous import 'feasts::gg_tsresiduals' by
## 'ggtime::gg_tsresiduals' when loading 'fpp3'
## Warning: replacing previous import 'feasts::gg_irf' by 'ggtime::gg_irf' when
## loading 'fpp3'
## Warning: replacing previous import 'feasts::gg_arma' by 'ggtime::gg_arma' when
## loading 'fpp3'
## Warning: replacing previous import 'feasts::gg_tsdisplay' by
## 'ggtime::gg_tsdisplay' when loading 'fpp3'
## Warning: replacing previous import 'feasts::gg_subseries' by
## 'ggtime::gg_subseries' when loading 'fpp3'
## Warning: replacing previous import 'feasts::gg_lag' by 'ggtime::gg_lag' when
## loading 'fpp3'
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.3 ──
## ✔ tibble 3.2.1 ✔ ggtime 0.2.0
## ✔ tidyr 1.3.1 ✔ feasts 0.4.2
## ✔ ggplot2 3.5.2 ✔ fable 0.5.0
## ✔ tsibbledata 0.4.1
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tsibbledata' was built under R version 4.4.3
## Warning: package 'ggtime' was built under R version 4.4.3
## Warning: package 'feasts' was built under R version 4.4.3
## Warning: package 'fabletools' was built under R version 4.4.3
## Warning: package 'fable' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ feasts::gg_arma() masks ggtime::gg_arma()
## ✖ feasts::gg_irf() masks ggtime::gg_irf()
## ✖ feasts::gg_lag() masks ggtime::gg_lag()
## ✖ feasts::gg_season() masks ggtime::gg_season()
## ✖ feasts::gg_subseries() masks ggtime::gg_subseries()
## ✖ feasts::gg_tsdisplay() masks ggtime::gg_tsdisplay()
## ✖ feasts::gg_tsresiduals() masks ggtime::gg_tsresiduals()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ feasts::scale_x_cf_lag() masks ggtime::scale_x_cf_lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(tidyr)
library(ggplot2)
library(zoo)
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.3
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
ATMDATA <- read_csv("https://raw.githubusercontent.com/tanzil64/DATA624/refs/heads/main/ATM624Data.csv")
ATMDATA <- ATMDATA %>%
mutate(DATE = as.Date(DATE, format = "%m/%d/%Y")) %>%
filter(!is.na(DATE))
head(ATMDATA)
summary(ATMDATA)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10920.0
## NA's :19
str(ATMDATA)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
head(ATMDATA)
summary(ATMDATA)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10920.0
## NA's :19
library(naniar)
gg_miss_var(ATMDATA)
atm <- ATMDATA %>%
mutate(Cash = as.integer(Cash))
str(atm)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: int [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
Because the four ATM machines may behave differently, each series will be analyzed separately.
atm1 <- atm %>%
filter(ATM == "ATM1") %>%
as_tsibble(index = DATE)
atm2 <- atm %>%
filter(ATM == "ATM2") %>%
as_tsibble(index = DATE)
atm3 <- atm %>%
filter(ATM == "ATM3") %>%
as_tsibble(index = DATE)
atm4 <- atm %>%
filter(ATM == "ATM4") %>%
as_tsibble(index = DATE)
autoplot(atm1) +
labs(title = "ATM1", subtitle = "Cash withdrawals per day", y = "Hundreds USD")
autoplot(atm2) +
labs(title = "ATM2", subtitle = "Cash withdrawals per day", y = "Hundreds USD")
autoplot(atm3) +
labs(title = "ATM3", subtitle = "Cash withdrawals per day", y = "Hundreds USD")
autoplot(atm4) +
labs(title = "ATM4", subtitle = "Cash withdrawals per day", y = "Hundreds USD")
From the plots, ATM1 and ATM2 appear to have relatively stable variation
and possible seasonality. ATM3 contains very few observations. ATM4
contains one clear outlier. # Outlier Treatment
which(atm4$Cash == 10919)
## integer(0)
The value 10919 is an obvious outlier, so it is replaced with the mean of the ATM4 series.
atm4 <- atm4 %>%
mutate(Cash = ifelse(Cash == 10919,
round(mean(Cash, na.rm = TRUE), 0),
Cash))
autoplot(atm4) +
labs(title = "ATM4 After Outlier Correction",
subtitle = "Cash withdrawals per day",
y = "Hundreds USD")
sum(is.na(atm1$Cash))
## [1] 3
sum(is.na(atm2$Cash))
## [1] 2
sum(is.na(atm3$Cash))
## [1] 0
sum(is.na(atm4$Cash))
## [1] 0
which(is.na(atm1$Cash))
## [1] 44 47 53
which(is.na(atm2$Cash))
## [1] 49 55
We inspect the distributions of ATM1 and ATM2 before imputing the missing values.
hist(atm1$Cash, main = "ATM1 Cash Distribution", xlab = "Cash")
hist(atm2$Cash, main = "ATM2 Cash Distribution", xlab = "Cash")
Since the distributions do not appear heavily skewed, the missing values
are imputed using the median.
atm1$Cash[44] <- round(median(atm1$Cash, na.rm = TRUE), 0)
atm1$Cash[47] <- round(median(atm1$Cash, na.rm = TRUE), 0)
atm1$Cash[53] <- round(median(atm1$Cash, na.rm = TRUE), 0)
atm2$Cash[49] <- round(median(atm2$Cash, na.rm = TRUE), 0)
atm2$Cash[55] <- round(median(atm2$Cash, na.rm = TRUE), 0)
lambda1 <- atm1 %>%
filter(!is.na(Cash)) %>%
features(Cash + 1, guerrero) %>%
pull(lambda_guerrero)
lambda2 <- atm2 %>%
filter(!is.na(Cash)) %>%
features(Cash + 1, guerrero) %>%
pull(lambda_guerrero)
lambda3 <- atm3 %>%
filter(!is.na(Cash)) %>%
features(Cash + 1, guerrero) %>%
pull(lambda_guerrero)
lambda4 <- atm4 %>%
filter(!is.na(Cash)) %>%
features(Cash + 1, guerrero) %>%
pull(lambda_guerrero)
plot_trans1 <- atm1 %>%
autoplot(box_cox(Cash + 1, lambda1)) +
labs(title = "ATM1 Transformed", subtitle = "Cash withdrawals per day", y = "USD")
plot_trans2 <- atm2 %>%
autoplot(box_cox(Cash + 1, lambda2)) +
labs(title = "ATM2 Transformed", subtitle = "Cash withdrawals per day", y = "USD")
plot_trans3 <- atm3 %>%
autoplot(box_cox(Cash + 1, lambda3)) +
labs(title = "ATM3 Transformed", subtitle = "Cash withdrawals per day", y = "USD")
plot_trans4 <- atm4 %>%
autoplot(box_cox(Cash + 1, lambda4)) +
labs(title = "ATM4 Transformed", subtitle = "Cash withdrawals per day", y = "USD")
atm1 %>%
model(classical_decomposition(box_cox(Cash, lambda1), type = "additive")) %>%
components() %>%
autoplot() +
labs(title = "Classical Additive Decomposition of ATM1")
atm2 %>%
model(classical_decomposition(box_cox(Cash, lambda2), type = "additive")) %>%
components() %>%
autoplot() +
labs(title = "Classical Additive Decomposition of ATM2")
atm3 %>%
model(classical_decomposition(box_cox(Cash, lambda3), type = "additive")) %>%
components() %>%
autoplot() +
labs(title = "Classical Additive Decomposition of ATM3")
atm4 %>%
model(classical_decomposition(box_cox(Cash, lambda4), type = "additive")) %>%
components() %>%
autoplot() +
labs(title = "Classical Additive Decomposition of ATM4")
plot_acf1 <- atm1 %>%
ACF(box_cox(Cash, lambda1)) %>%
autoplot() +
labs(title = "Autocorrelation of Cash - ATM1")
plot_acf2 <- atm2 %>%
ACF(box_cox(Cash, lambda2)) %>%
autoplot() +
labs(title = "Autocorrelation of Cash - ATM2")
plot_acf3 <- atm3 %>%
ACF(box_cox(Cash, lambda3)) %>%
autoplot() +
labs(title = "Autocorrelation of Cash - ATM3")
plot_acf4 <- atm4 %>%
ACF(box_cox(Cash, lambda4)) %>%
autoplot() +
labs(title = "Autocorrelation of Cash - ATM4")
library(gridExtra)
grid.arrange(plot_acf1, plot_acf2, plot_acf3, plot_acf4, nrow = 2)
atm1 %>% features(box_cox(Cash, lambda1), unitroot_nsdiffs)
atm2 %>% features(box_cox(Cash, lambda2), unitroot_nsdiffs)
atm3 %>% features(box_cox(Cash, lambda3), unitroot_nsdiffs)
atm4 %>% features(box_cox(Cash, lambda4), unitroot_nsdiffs)
The results indicate that ATM1 and ATM2 require one seasonal difference, while ATM3 and ATM4 do not.
We next check whether any additional regular differencing is required.
atm1 %>% features(difference(box_cox(Cash, lambda1), 7), unitroot_ndiffs)
atm2 %>% features(difference(box_cox(Cash, lambda2), 7), unitroot_ndiffs)
atm3 %>% features(box_cox(Cash, lambda3), unitroot_ndiffs)
atm4 %>% features(box_cox(Cash, lambda4), unitroot_ndiffs)
No additional non-seasonal differencing appears necessary.
atm1 %>%
ACF(difference(box_cox(Cash, lambda1), 7)) %>%
autoplot() +
labs(title = "ACF of Differenced ATM1")
atm2 %>%
ACF(difference(box_cox(Cash, lambda2), 7)) %>%
autoplot() +
labs(title = "ACF of Differenced ATM2")
After differencing, the series look much closer to white noise. # Model Building
Because ATM1 and ATM2 need seasonal differencing, ARIMA models are appropriate. ATM3 has too little data, so a naive model is more reasonable. # ATM1 Model
atm1_fit <- atm1 %>%
model(ARIMA(box_cox(Cash, lambda1)))
report(atm1_fit)
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda1)
##
## Coefficients:
## ma1 ma2 sma1
## 0.0980 -0.1143 -0.6463
## s.e. 0.0524 0.0527 0.0425
##
## sigma^2 estimated as 1.483: log likelihood=-578.9
## AIC=1165.81 AICc=1165.92 BIC=1181.33
atm1_fit %>% gg_tsresiduals()
atm2_fit <- atm2 %>%
model(ARIMA(box_cox(Cash, lambda2)))
report(atm2_fit)
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda2)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4246 -0.9072 0.4679 0.8003 -0.7257
## s.e. 0.0587 0.0439 0.0890 0.0584 0.0409
##
## sigma^2 estimated as 67.61: log likelihood=-1261.88
## AIC=2535.75 AICc=2535.99 BIC=2559.03
atm2_fit %>% gg_tsresiduals()
atm3_fit <- atm3 %>%
model(NAIVE(box_cox(Cash, lambda3)))
report(atm3_fit)
## Series: Cash
## Model: NAIVE
## Transformation: box_cox(Cash, lambda3)
##
## sigma^2: 5.0877
atm3_fit %>% gg_tsresiduals()
atm4_fit <- atm4 %>%
model(ARIMA(box_cox(Cash, lambda4)))
report(atm4_fit)
## Series: Cash
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, lambda4)
##
## Coefficients:
## sar1 sar2 constant
## 0.2501 0.1951 2.4713
## s.e. 0.0521 0.0525 0.0460
##
## sigma^2 estimated as 0.8133: log likelihood=-479.32
## AIC=966.65 AICc=966.76 BIC=982.25
atm4_fit %>% gg_tsresiduals()
We now forecast 31 days ahead for May 2010.
forecast_atm1 <- atm1_fit %>% forecast(h = 31)
forecast_atm2 <- atm2_fit %>% forecast(h = 31)
forecast_atm3 <- atm3_fit %>% forecast(h = 31)
forecast_atm4 <- atm4_fit %>% forecast(h = 31)
autoplot(forecast_atm1, atm1) + labs(title = "Forecast for ATM1")
autoplot(forecast_atm2, atm2) + labs(title = "Forecast for ATM2")
autoplot(forecast_atm3, atm3) + labs(title = "Forecast for ATM3")
autoplot(forecast_atm4, atm4) + labs(title = "Forecast for ATM4")
write.csv(as.data.frame(forecast_atm1), "forecast_atm1.csv", row.names = FALSE)
write.csv(as.data.frame(forecast_atm2), "forecast_atm2.csv", row.names = FALSE)
write.csv(as.data.frame(forecast_atm3), "forecast_atm3.csv", row.names = FALSE)
write.csv(as.data.frame(forecast_atm4), "forecast_atm4.csv", row.names = FALSE)
The business problem is to determine how much cash each ATM should hold in order to avoid shortages while minimizing excess cash. The time series models developed in this analysis directly address this problem by converting historical withdrawal patterns into quantitative forecasts of future demand.
First, the models capture key characteristics of the data, such as trend, variability, and seasonality. For example, ATM1 and ATM2 showed clear weekly patterns, which were incorporated through seasonal differencing in the ARIMA models. By accounting for these patterns, the models are able to produce more accurate and reliable forecasts.
Second, the forecasting step provides daily predictions of cash withdrawals for each ATM. These forecasts allow decision-makers to estimate how much cash will be needed on each day of May 2010. Instead of relying on intuition or simple averages, the bank can use these statistically grounded predictions to plan cash allocation.
As a result, the models help solve the business problem in the following ways:
Prevent stockouts: Forecasts ensure that enough cash is loaded into each ATM to meet expected demand, reducing the risk of machines running empty. Reduce excess cash: By avoiding overestimation, the bank minimizes unnecessary cash holding, lowering security risks and opportunity costs. Optimize replenishment schedules: Knowing expected demand allows for efficient planning of cash delivery routes and frequencies. Improve operational efficiency: Resources such as staff and transportation can be allocated more effectively based on predicted needs.
In summary, the models transform raw historical data into actionable insights, enabling the bank to make informed, data-driven decisions about ATM cash management. This directly improves service reliability while reducing operational costs.
Accurate forecasting of residential electricity demand is critical for efficient energy planning and resource allocation. The dataset used in this analysis contains monthly residential power consumption from January 1998 through December 2013. The goal is to build a time series model and generate forecasts for 2014.
Energy providers must anticipate future electricity demand to ensure reliable service while minimizing operational costs. Underestimating demand can lead to shortages, while overestimating demand results in inefficient resource allocation. This analysis aims to develop a data-driven forecasting model to support better planning and decision-making.
power <- read_csv("https://raw.githubusercontent.com/tanzil64/DATA624/refs/heads/main/ResidentialCustomerForecastLoad-624.csv") %>%
rename(Date = `YYYY-MMM`) %>%
mutate(Date = yearmonth(Date)) %>%
as_tsibble(index = Date)
power
power %>% filter(is.na(KWH))
There is one missing value (September 2008). This value is replaced using the median of the series.
power <- power %>%
mutate(KWH = ifelse(is.na(KWH),
median(KWH, na.rm = TRUE),
KWH))
power %>%
autoplot(KWH) +
labs(title = "Monthly Residential Power Consumption",
y = "KWH")
The data shows strong seasonal patterns and variability over time.
power %>%
gg_season(KWH) +
labs(title = "Seasonal Plot")
power %>%
gg_subseries(KWH) +
labs(title = "Subseries Plot")
These plots confirm strong yearly seasonality.
lambda <- power %>%
features(KWH, guerrero) %>%
pull(lambda_guerrero)
lambda
## [1] 0.1041666
power %>%
autoplot(box_cox(KWH, lambda)) +
labs(title = "Box-Cox Transformed Series")
The transformation stabilizes variance and improves model suitability.
power %>%
ACF(box_cox(KWH, lambda)) %>%
autoplot() +
labs(title = "ACF Plot")
The slow decay of the ACF suggests non-stationarity.
power %>%
features(KWH, unitroot_nsdiffs)
power %>%
features(KWH, unitroot_ndiffs)
Seasonal differencing is required, which will be handled internally by the ARIMA model.
fit_arima <- power %>%
model(ARIMA(box_cox(KWH, lambda)))
report(fit_arima)
## Series: KWH
## Model: ARIMA(0,1,2)(0,0,2)[12]
## Transformation: box_cox(KWH, lambda)
##
## Coefficients:
## ma1 ma2 sma1 sma2
## -0.7266 -0.2492 0.2269 0.2104
## s.e. 0.0717 0.0706 0.0787 0.0656
##
## sigma^2 estimated as 1.294: log likelihood=-295.53
## AIC=601.05 AICc=601.38 BIC=617.31
fit_arima %>%
gg_tsresiduals()
The residual diagnostics indicate that the residuals behave like white noise, suggesting a good model fit.
fit_ets <- power %>%
model(ETS(KWH))
report(fit_ets)
## Series: KWH
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.1167131
## gamma = 0.0001038822
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 6133758 0.9567738 0.7524052 0.8702234 1.184894 1.261653 1.185654 1.000083
## s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.7742886 0.8153702 0.9116166 1.065073 1.221965
##
## sigma^2: 0.0144
##
## AIC AICc BIC
## 6224.787 6227.514 6273.649
fit_ets %>%
gg_tsresiduals()
accuracy(fit_arima)
accuracy(fit_ets)
The ARIMA model shows lower RMSE and better residual behavior, making it the preferred model.
fc <- fit_arima %>%
forecast(h = "12 months")
fc %>%
autoplot(power) +
labs(title = "Forecast for 2014")
fc_final <- fc %>%
as_tibble() %>%
mutate(KWH = inv_box_cox(.mean, lambda))
fc_final
write.csv(fc_final, "results_rcfl.csv", row.names = FALSE)
The ARIMA model captures both trend and seasonal patterns in electricity consumption and generates reliable monthly forecasts. These forecasts allow energy providers to anticipate demand, optimize production, and improve operational efficiency.
This analysis developed a forecasting model for residential electricity demand using historical monthly data. After handling missing values and applying a Box-Cox transformation, the ARIMA model was selected based on its superior performance and residual diagnostics. The forecasts for 2014 provide a reliable basis for planning and decision-making, helping energy providers manage resources efficiently and reduce operational risk.
The objective of this analysis is to model and forecast total water flow using time series techniques. Accurate forecasting supports operational planning, system efficiency, and resource allocation for water infrastructure management.
Two datasets representing water flow from Pipe 1 and Pipe 2 were combined to create a unified time series. Missing values were handled by replacing them with zero, ensuring continuity in the dataset.
The data was aggregated to hourly intervals to align both datasets and create a consistent time index. The resulting time series reflects total water flow across both pipelines.
library(readr)
library(dplyr)
library(tsibble)
library(lubridate)
library(forecast)
library(ggplot2)
# Load Pipe 1
wfp1 <- read_csv(
"https://raw.githubusercontent.com/tanzil64/DATA624/refs/heads/main/Waterflow_Pipe1.csv",
show_col_types = FALSE
) %>%
rename(WaterFlow = WaterFlow) %>%
mutate(DateTime = mdy_hm(DateTime))
# Load Pipe 2
wfp2 <- read_csv(
"https://raw.githubusercontent.com/tanzil64/DATA624/refs/heads/main/Waterflow_Pipe2.csv",
show_col_types = FALSE
) %>%
rename(WaterFlow = WaterFlow) %>%
mutate(DateTime = mdy_hm(DateTime))
# Match Pipe 1 hour format to project example
wfp1 <- wfp1 %>%
mutate(Date = as.Date(DateTime),
Time = hour(DateTime) + 1) %>%
group_by(Date, Time) %>%
summarise(Water = mean(WaterFlow, na.rm = TRUE), .groups = "drop") %>%
mutate(DateTime = ymd_h(paste(Date, Time))) %>%
select(DateTime, Water)
# Match Pipe 2 hour format to project example
wfp2 <- wfp2 %>%
mutate(Date = as.Date(DateTime),
Time = hour(DateTime)) %>%
group_by(Date, Time) %>%
summarise(Water = mean(WaterFlow, na.rm = TRUE), .groups = "drop") %>%
mutate(DateTime = ymd_h(paste(Date, Time))) %>%
select(DateTime, Water)
# Combine the two pipes
water <- full_join(wfp1, wfp2, by = "DateTime", suffix = c("_1", "_2")) %>%
mutate(
Water_1 = ifelse(is.na(Water_1), 0, Water_1),
Water_2 = ifelse(is.na(Water_2), 0, Water_2),
Water = Water_1 + Water_2
) %>%
select(DateTime, Water)
A Box-Cox transformation was applied to stabilize variance. Differencing analysis showed:
Non-seasonal differencing required (d = 1) No seasonal differencing required (D = 0)
After transformation and differencing, the series exhibited stationarity, making it suitable for ARIMA modeling.
# Create time series
water_ts <- ts(water$Water, frequency = 24)
# Plot series
autoplot(water_ts) +
labs(title = "Combined Water Flow Time Series",
x = "Time",
y = "Water Flow")
# Seasonal plots
ggseasonplot(water_ts) +
theme(legend.title = element_blank()) +
labs(title = "Seasonal Plot of Water Flow")
ggsubseriesplot(water_ts) +
labs(title = "Subseries Plot of Water Flow")
# Box-Cox transformation
lambda <- BoxCox.lambda(water_ts)
water_bc <- BoxCox(water_ts, lambda)
# Display transformed series
ggtsdisplay(water_bc, main = "Box-Cox Transformed Water Flow")
# Differencing checks
ndiffs(water_ts)
## [1] 1
nsdiffs(water_ts)
## [1] 0
# Differenced transformed series
ggtsdisplay(diff(water_bc), points = FALSE,
main = "Differenced Box-Cox Transformed Water Flow")
# Manual ARIMA model
water_arima <- Arima(
water_ts,
order = c(0, 1, 1),
seasonal = c(0, 0, 1),
lambda = lambda
)
summary(water_arima)
## Series: water_ts
## ARIMA(0,1,1)(0,0,1)[24]
## Box Cox transformation: lambda= 0.8713037
##
## Coefficients:
## ma1 sma1
## -0.9578 0.0712
## s.e. 0.0101 0.0319
##
## sigma^2 = 103.3: log likelihood = -3734.4
## AIC=7474.8 AICc=7474.83 BIC=7489.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1591549 16.33716 13.34589 -28.20783 50.26558 0.7507364
## ACF1
## Training set -0.04444759
checkresiduals(water_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(0,0,1)[24]
## Q* = 57.881, df = 46, p-value = 0.1124
##
## Model df: 2. Total lags used: 48
# Auto ARIMA model
water_auto <- auto.arima(
water_ts,
approximation = FALSE,
lambda = lambda
)
summary(water_auto)
## Series: water_ts
## ARIMA(0,1,1)(1,0,0)[24]
## Box Cox transformation: lambda= 0.8713037
##
## Coefficients:
## ma1 sar1
## -0.9578 0.0714
## s.e. 0.0101 0.0322
##
## sigma^2 = 103.3: log likelihood = -3734.4
## AIC=7474.8 AICc=7474.82 BIC=7489.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1613506 16.33729 13.34564 -28.19083 50.25213 0.750722
## ACF1
## Training set -0.04431537
checkresiduals(water_auto)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(1,0,0)[24]
## Q* = 57.858, df = 46, p-value = 0.1128
##
## Model df: 2. Total lags used: 48
# Final model
water_model <- water_auto
# Forecast 7 days = 168 hours
water_f <- forecast(water_model, h = 7 * 24, level = 95)
# Plot forecast
autoplot(water_f) +
labs(title = "Water Usage Forecast",
subtitle = "ARIMA Forecast for Combined Pipe Flow",
x = "Time",
y = "Water Flow")
# Export forecast
write.csv(as.data.frame(water_f), "Waterflow_Forecast.csv", row.names = FALSE)
# Show point forecasts
forecast_df <- as.data.frame(water_f) %>%
rename(Point_Forecast = `Point Forecast`) %>%
select(Point_Forecast)
forecast_df
Forecast values gradually stabilize around the mean Limited variability in predictions Confidence intervals reflect uncertainty in long-term forecasting
While the ARIMA model provides statistically valid forecasts, it shows limited responsiveness to short-term fluctuations. The forecasts converge toward the mean, suggesting:
The system has low predictable structure Water flow may be influenced by external factors not captured in the model Short-term variability is difficult to model with ARIMA alone
The ARIMA model provides a statistically sound approach for modeling water flow data. However, its forecasting capability is limited in capturing variability, indicating that additional modeling approaches may be required for enhanced predictive performance.
The ARIMA model delivers stable baseline forecasts for water flow but lacks the ability to capture short-term variability, limiting its effectiveness for operational decision-making.