624 Project 1

Question 1: ATM Forecasting

Introduction

The goal of this analysis is to forecast daily ATM withdrawals for four machines through the end of May 2010. These forecasts support better cash replenishment decisions and reduce the risk of shortages.

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forecast)

## Warning: package 'forecast' was built under R version 4.5.3

library(writexl)

## Warning: package 'writexl' was built under R version 4.5.3

data <- read_excel("C:/Users/rbron/Downloads/ATM624Data (1).xlsx")
data$DATE <- as.Date(data$DATE)

This dataset contains daily cash withdrawals for four ATMs, measured in hundreds of dollars.

atm_list <- split(data, data$ATM)

Each ATM is modeled separately to capture differences in withdrawal behavior.

ts_list <- list()

for(name in names(atm_list)) {
  atm_data <- atm_list[[name]] %>% arrange(DATE)
  ts_list[[name]] <- ts(atm_data$Cash, frequency = 7)
}

Each dataset is converted into a time series object with weekly frequency.

models <- list()

models[["ATM1"]] <- Arima(ts_list[["ATM1"]], order = c(1,1,1))
models[["ATM2"]] <- Arima(ts_list[["ATM2"]], order = c(1,1,1))
models[["ATM3"]] <- Arima(ts_list[["ATM3"]], order = c(0,0,2))
models[["ATM4"]] <- Arima(ts_list[["ATM4"]], order = c(0,0,0), include.mean = TRUE)

print(models)

## $ATM1
## Series: ts_list[["ATM1"]] 
## ARIMA(1,1,1) 
## 
## Coefficients:
##         ar1      ma1
##       0.055  -0.9899
## s.e.  0.054   0.0135
## 
## sigma^2 = 1353:  log likelihood = -1814.5
## AIC=3635.01   AICc=3635.07   BIC=3646.7
## 
## $ATM2
## Series: ts_list[["ATM2"]] 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1      ma1
##       -0.007  -0.9823
## s.e.   0.053   0.0077
## 
## sigma^2 = 1500:  log likelihood = -1838.08
## AIC=3682.16   AICc=3682.23   BIC=3693.86
## 
## $ATM3
## Series: ts_list[["ATM3"]] 
## ARIMA(0,0,2) with non-zero mean 
## 
## Coefficients:
##          ma1     ma2    mean
##       0.8369  0.8516  0.7294
## s.e.  0.0497  0.0613  0.7049
## 
## sigma^2 = 25.4:  log likelihood = -1108.16
## AIC=2224.32   AICc=2224.44   BIC=2239.92
## 
## $ATM4
## Series: ts_list[["ATM4"]] 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##           mean
##       474.0433
## s.e.   34.0248
## 
## sigma^2 = 423718:  log likelihood = -2882.03
## AIC=5768.06   AICc=5768.1   BIC=5775.86

ARIMA was chosen because it is better with short term forecasting and can be a more flexible model.

ATM1 and ATM2 are modeled using ARIMA(1,1,1), which captures both trend and short-term dependence in the data. ATM3 uses an ARIMA(0,0,2) model, indicating no underlying trend but the presence of short-term shock effects. ATM4 is modeled using a mean model, suggesting there is no strong time dependence in its withdrawal pattern.

forecasts <- list()

for(name in names(models)) {
  forecasts[[name]] <- forecast(models[[name]], h = 17)
}

for(name in names(forecasts)) {
  plot(forecasts[[name]], main = paste("Forecast for", name))
}

Forecasting the remaining 17 days of May.

ATM1

ATM1 shows relatively stable behavior with moderate fluctuations. The forecast remains centered around its historical average, with widening prediction intervals indicating increasing uncertainty over time.

ATM2

ATM2 displays similar variability to ATM1 but with slightly more dispersion. The forecast suggests consistent withdrawals with no strong upward or downward trend.

ATM3

ATM3 behaves very differently. The historical data is mostly near zero with a sudden spike. As a result, the model predicts values close to zero going forward. This suggests that the spike was likely an anomaly rather than a recurring pattern.

ATM4

ATM4 shows large variability and extreme spikes in the historical data. Because of this, a simple mean model was used. The forecast centers around the average withdrawal level but with very wide prediction intervals, reflecting high uncertainty.

forecast_df <- data.frame()

for(name in names(forecasts)) {
  
  fc_vals <- as.numeric(forecasts[[name]]$mean)
  
  temp <- data.frame(
    ATM = name,
    Day = 1:17,
    Forecast_Cash = fc_vals
  )
  
  forecast_df <- rbind(forecast_df, temp)
  
}

#r export forecast to excel

# Create proper future dates (remaining days of May 2010)

last_date <- max(data$DATE)
future_dates <- seq(from = last_date + 1, by = "day", length.out = 17)

# Build clean forecast dataframe with actual dates
forecast_df <- data.frame()

for(name in names(forecasts)) {
  
  fc_vals <- as.numeric(forecasts[[name]]$mean)
  
  temp <- data.frame(
    ATM = name,
    Date = future_dates,
    Forecast_Cash = round(fc_vals, 2)   # cleaner numbers
  )
  
  forecast_df <- rbind(forecast_df, temp)
}

# Optional: reshape for a more readable Excel format (wide format)
forecast_wide <- forecast_df %>%
  tidyr::pivot_wider(names_from = ATM, values_from = Forecast_Cash)

# Write BOTH formats to Excel (two tabs)
write_xlsx(
  list(
    "Long_Format" = forecast_df,
    "Readable_Table" = forecast_wide
  ),
  "ATM_Forecasts_May2010.xlsx"
)

# Preview
head(forecast_wide)

## # A tibble: 6 × 5
##   Date        ATM1  ATM2  ATM3  ATM4
##   <date>     <dbl> <dbl> <dbl> <dbl>
## 1 2080-05-16  81.6  57.0  3.24  474.
## 2 2080-05-17  81.4  57.3  2.08  474.
## 3 2080-05-18  81.4  57.3  0.73  474.
## 4 2080-05-19  81.4  57.3  0.73  474.
## 5 2080-05-20  81.4  57.3  0.73  474.
## 6 2080-05-21  81.4  57.3  0.73  474.

ATM1 and ATM2 show stable and predictable demand, making them suitable for consistent replenishment schedules. In contrast, ATM3 appears inactive or irregular, with near-zero expected withdrawals, suggesting it should be monitored for potential operational issues or low usage. ATM4 is highly volatile, indicating a need for a larger safety buffer to manage uncertainty in withdrawals. Based on these insights, it is recommended to maintain regular cash replenishment for ATM1 and ATM2, closely monitor ATM3, and allocate additional buffer cash to ATM4. Overall, the models provide reasonable short-term forecasts, highlighting that while some ATMs follow stable patterns, others require more adaptive cash management strategies.

Q2: Forecasting Residential Power Useage

This analysis models monthly residential electricity consumption from January 1998 through December 2013 and produces a forecast for 2014. Since energy usage typically exhibits both trend and seasonality, time series models such as ARIMA and ETS are used to capture these patterns and generate accurate forecasts.

library(readxl)
library(forecast)
library(tseries)

## Warning: package 'tseries' was built under R version 4.5.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(ggplot2)
library(writexl)

# Load data
power_data <- read_excel("C:/Users/rbron/Downloads/ResidentialCustomerForecastLoad-624.xlsx")

# Convert to time series
power_ts <- ts(power_data$KWH, start = c(1998, 1), frequency = 12)

# Plot the series
autoplot(power_ts) +
  ggtitle("Monthly Residential Power Usage (KWH)") +
  xlab("Year") +
  ylab("KWH")

The time series plot shows a clear and consistent seasonal pattern, with regular peaks and troughs each year. There is also a slight upward trend over time and a noticeable drop around 2010, suggesting a possible outlier or unusual event.

# Handle missing values
power_ts_clean <- na.interp(power_ts)

# Decompose the series
decomp <- decompose(power_ts_clean)
autoplot(decomp)

The decomposition confirms strong seasonality with a stable repeating yearly pattern. The trend component shows gradual growth with a temporary dip around 2010, while the remainder is mostly random aside from a few extreme observations.

# Stationarity test
adf.test(power_ts_clean)

## Warning in adf.test(power_ts_clean): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  power_ts_clean
## Dickey-Fuller = -4.8594, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

# Differencing
diff_ts <- diff(power_ts_clean, differences = 1)
seasonal_diff_ts <- diff(diff_ts, lag = 12)

# Plot differenced series
autoplot(seasonal_diff_ts) +
  ggtitle("Differenced Series")

# ACF and PACF plots
Acf(seasonal_diff_ts)

Pacf(seasonal_diff_ts)

The Augmented Dickey-Fuller test indicates that the differenced series is stationary. Applying both first and seasonal differencing removes trend and seasonal effects, resulting in a stable series suitable for modeling. The ACF and PACF plots suggest the presence of both short-term and seasonal dependencies.

# Fit ARIMA model
model_arima <- auto.arima(power_ts, seasonal = TRUE)
summary(model_arima)

## Series: power_ts 
## ARIMA(0,0,2)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ma1     ma2     sar1     sar2     drift
##       0.1739  0.0505  -0.7591  -0.4124  8750.907
## s.e.  0.0766  0.0844   0.0697   0.0682  3214.839
## 
## sigma^2 = 7.841e+11:  log likelihood = -2707.12
## AIC=5426.25   AICc=5426.73   BIC=5445.4
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE     MASE
## Training set -9730.731 845160.6 506279.9 -5.055329 11.57781 0.730904
##                     ACF1
## Training set 0.008533477

# Residual diagnostics
checkresiduals(model_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(2,1,0)[12] with drift
## Q* = 14.247, df = 20, p-value = 0.8178
## 
## Model df: 4.   Total lags used: 24

The selected ARIMA model captures both short-term and seasonal patterns in the data. Residual diagnostics show no significant autocorrelation, and the Ljung-Box test indicates that the residuals behave like white noise, confirming a good model fit.

# Forecast 2014
forecast_2014 <- forecast(model_arima, h = 12)

autoplot(forecast_2014) +
  ggtitle("Forecast of Monthly Power Usage for 2014") +
  xlab("Year") +
  ylab("KWH")

forecast_2014

##          Point Forecast   Lo 80    Hi 80   Lo 95    Hi 95
## Jan 2014        9790458 8655663 10925252 8054939 11525977
## Feb 2014        8624753 7472931  9776575 6863193 10386313
## Mar 2014        6623643 5470398  7776888 4859907  8387379
## Apr 2014        5974459 4821213  7127704 4210722  7738195
## May 2014        5935072 4781827  7088317 4171336  7698808
## Jun 2014        8199158 7045913  9352403 6435422  9962894
## Jul 2014        9498803 8345558 10652048 7735067 11262539
## Aug 2014        9999104 8845859 11152349 8235368 11762840
## Sep 2014        8456634 7303389  9609879 6692898 10220370
## Oct 2014        5860011 4706766  7013256 4096275  7623747
## Nov 2014        6143003 4989758  7296248 4379267  7906739
## Dec 2014        8245603 7092357  9398848 6481866 10009339

# Convert ARIMA forecast to a data frame
forecast_df <- data.frame(
  Month = seq(as.Date("2014-01-01"), by = "month", length.out = 12),
  Forecast_KWH = as.numeric(forecast_2014$mean),
  Lower_80 = as.numeric(forecast_2014$lower[,1]),
  Upper_80 = as.numeric(forecast_2014$upper[,1]),
  Lower_95 = as.numeric(forecast_2014$lower[,2]),
  Upper_95 = as.numeric(forecast_2014$upper[,2])
)

# ETS model
model_ets <- ets(power_ts)
forecast_ets <- forecast(model_ets, h = 12)

autoplot(forecast_ets) +
  ggtitle("ETS Forecast for 2014")

# Accuracy comparison
accuracy(model_arima)

##                     ME     RMSE      MAE       MPE     MAPE     MASE
## Training set -9730.731 845160.6 506279.9 -5.055329 11.57781 0.730904
##                     ACF1
## Training set 0.008533477

accuracy(model_ets)

##                    ME   RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 56391.73 830674 502742.5 -4.469644 12.06537 0.7257972 0.1665399

#ETS forecast
forecast_ets_df <- data.frame(
  Month = seq(as.Date("2014-01-01"), by = "month", length.out = 12),
  ETS_Forecast_KWH = as.numeric(forecast_ets$mean)
)

# Combine both forecasts
final_forecast <- dplyr::left_join(forecast_df, forecast_ets_df, by = "Month")

# Export to Excel
write_xlsx(final_forecast, "Power_Forecast_2014.xlsx")

The forecast for 2014 maintains the strong seasonal pattern observed in the historical data. Prediction intervals widen over time, reflecting increasing uncertainty, but overall trends remain stable and consistent.

Conclusion: Residential electricity usage exhibits strong seasonal behavior with predictable fluctuations throughout the year. The ARIMA model effectively captures these dynamics and provides reliable forecasts for 2014. The ETS model produces a similar seasonal forecast pattern. Both models perform comparably, though the ARIMA model shows slightly better accuracy based on error metrics.

The results suggest stable cyclical demand with moderate uncertainty. These forecasts can support planning decisions such as resource allocation and preparation for periods of peak electricity usage.

Q3: Bonus-Waterflow Pipe

This analysis examines two water flow datasets collected at irregular time intervals. To enable meaningful comparison and forecasting, both datasets are first aligned to a common hourly time scale by aggregating observations within each hour using the mean.

After standardizing the data, time series methods are applied to evaluate stationarity and determine whether the series can be reliably forecast. Appropriate ARIMA models are then fitted, and one-week (168-hour) forecasts are generated to provide insight into short-term water flow behavior for each pipe.

The required libraries are loaded and both datasets are imported. Column names are standardized for consistency.

library(readxl)
library(dplyr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(forecast)
library(tseries)
library(ggplot2)
library(writexl)

pipe1 <- read_excel("C:/Users/rbron/Downloads/Waterflow_Pipe1.xlsx",
                    col_types = c("numeric", "numeric"))

pipe2 <- read_excel("C:/Users/rbron/Downloads/Waterflow_Pipe2.xlsx",
                    col_types = c("numeric", "numeric"))

colnames(pipe1) <- c("DateTimeSerial", "Flow")
colnames(pipe2) <- c("DateTimeSerial", "Flow")

Timestamps are converted into proper datetime format to allow time-based operations.

pipe1$Timestamp <- as.POSIXct(pipe1$DateTimeSerial * 86400,
                              origin = "1899-12-30", tz = "UTC")

pipe2$Timestamp <- as.POSIXct(pipe2$DateTimeSerial * 86400,
                              origin = "1899-12-30", tz = "UTC")

Data is grouped by hour. When multiple observations occur within the same hour, the mean value is used.

pipe1_hourly <- pipe1 %>%
  mutate(Hour = floor_date(Timestamp, "hour")) %>%
  group_by(Hour) %>%
  summarise(Flow = mean(Flow, na.rm = TRUE)) %>%
  ungroup()

pipe2_hourly <- pipe2 %>%
  mutate(Hour = floor_date(Timestamp, "hour")) %>%
  group_by(Hour) %>%
  summarise(Flow = mean(Flow, na.rm = TRUE)) %>%
  ungroup()

The hourly aggregated data is converted into time series format using a frequency of 24 to represent daily cycles.

pipe1_ts <- ts(pipe1_hourly$Flow, frequency = 24)
pipe2_ts <- ts(pipe2_hourly$Flow, frequency = 24)

r ts} pipe1_ts <- ts(pipe1_hourly$Flow, frequency = 24) pipe2_ts <- ts(pipe2_hourly$Flow, frequency = 24)

The Augmented Dickey-Fuller test is used to evaluate whether each series is stationary.

adf.test(pipe1_ts)

## Warning in adf.test(pipe1_ts): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  pipe1_ts
## Dickey-Fuller = -6.2649, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

adf.test(pipe2_ts)

## Warning in adf.test(pipe2_ts): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  pipe2_ts
## Dickey-Fuller = -8.6584, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

The Augmented Dickey-Fuller test was applied to both time series. For Pipe 1, the reported p-value is 0.01 with a warning indicating the true p-value is even smaller. For Pipe 2, the same result is observed. Since both p-values are well below the 0.05 threshold, the null hypothesis of a unit root is rejected. Both series are therefore strongly stationary in levels, indicating stable mean and variance over time. Differencing is applied if the series is non-stationary to stabilize the mean.

ARIMA models are fitted automatically to each time series.

model_pipe1 <- auto.arima(pipe1_ts)
model_pipe2 <- auto.arima(pipe2_ts)

summary(model_pipe1)

## Series: pipe1_ts 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##          mean
##       19.8927
## s.e.   0.2754
## 
## sigma^2 = 17.98:  log likelihood = -675.29
## AIC=1354.59   AICc=1354.64   BIC=1361.51
## 
## Training set error measures:
##                        ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 3.043898e-13 4.231136 3.332455 -5.064092 18.29517 0.6846484
##                    ACF1
## Training set 0.02101267

summary(model_pipe2)

## Series: pipe2_ts 
## ARIMA(1,0,1) with non-zero mean 
## 
## Coefficients:
##           ar1     ma1     mean
##       -0.9861  0.9650  39.7820
## s.e.   0.0162  0.0264   0.5251
## 
## sigma^2 = 188.8:  log likelihood = -2692.75
## AIC=5393.5   AICc=5393.56   BIC=5411.51
## 
## Training set error measures:
##                        ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.004127904 13.70841 11.10012 -21.76767 41.73933 0.7235743
##                      ACF1
## Training set -0.002008699

The auto.arima() procedure selected the following models:

Pipe 1: ARIMA(0,0,0) with mean Pipe 2: ARIMA(1,0,1) with mean

The Pipe 1 model suggests that the series behaves as a stationary process fluctuating randomly around a constant mean, with no significant autocorrelation structure.

The Pipe 2 model incorporates both autoregressive and moving average components, indicating the presence of short-term temporal dependencies in the data.

For Pipe 1:

RMSE ≈ 4.23 MAE ≈ 3.33 MAPE ≈ 18.30%

For Pipe 2:

RMSE ≈ 13.71 MAE ≈ 11.10 MAPE ≈ 41.74%

Pipe 1 demonstrates substantially lower forecast error, indicating a more stable and predictable process. Pipe 2 exhibits higher variability, which reduces forecast accuracy.

Forecasts are generated for 168 hours, representing one week.

forecast_pipe1 <- forecast(model_pipe1, h = 168)
forecast_pipe2 <- forecast(model_pipe2, h = 168)

autoplot(forecast_pipe1) + ggtitle("Pipe 1 Weekly Forecast")

autoplot(forecast_pipe2) + ggtitle("Pipe 2 Weekly Forecast")

Forecast Interpretation

The Pipe 1 forecast remains nearly constant around its mean (~20), with relatively narrow confidence intervals. This behavior is consistent with a white noise process and indicates no underlying trend or seasonal pattern.

The Pipe 2 forecast converges toward its long-run mean (~40), but with noticeably wider confidence intervals. This reflects greater uncertainty and volatility in the underlying process.

Important note about the warning: The warning indicating that the p-value is smaller than the printed value is expected behavior in the ADF test. This implies that the true p-value is less than 0.01, providing strong statistical evidence of stationarity.

Forecast results are saved into an Excel file with separate sheets for each pipe.

forecast_df1 <- data.frame(
  Time = 1:168,
  Forecast = as.numeric(forecast_pipe1$mean)
)

forecast_df2 <- data.frame(
  Time = 1:168,
  Forecast = as.numeric(forecast_pipe2$mean)
)

write_xlsx(
  list(
    Pipe1_Forecast = forecast_df1,
    Pipe2_Forecast = forecast_df2
  ),
  "C:/Users/rbron/Downloads/Waterflow_Forecasts.xlsx"
)

Both datasets were successfully aligned to a common hourly time base and aggregated using mean values within each hour. Statistical testing confirmed that both series are stationary without the need for differencing, allowing direct application of ARIMA models. The resulting forecasts indicate that Pipe 1 follows a stable, mean-reverting process with relatively high predictability, while Pipe 2 displays greater variability and reduced forecast precision. Overall, the modeling approach is appropriate, and the forecasts provide a reasonable short-term outlook for both systems.

624 Project 1

Rebecca Bronstein

2026-03-29