PART A

Part A Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. Explain and demonstrate your process, techniques used and not used, and your actual forecast.

Overview

The objective of this analysis was to forecast the total amount of cash expected to be withdrawn from four ATM machines during May 2010, using historical daily withdrawal data contained in ATM624Data.xlsx. The data represent daily withdrawals in hundreds of dollars, covering the period May 2009-May 2010.

Data Preparation & EDA

Data Preparation

The dataset was imported from Excel and cleaned by:

Converting cash values from hundreds to actual dollars (Cash × 100).
Fixing Excel’s Mac 1904 date system issue by re-specifying the correct origin (1899-12-30), ensuring all dates aligned with 2009-2010 instead of 2079-2080.
Standardizing variable names to lower-case and removed missing values.
- Missing rows were removed to ensure each observation had a valid date, ATM identifier, and cash value.
- This prevents gaps in the time series and avoids errors in later visualization and modeling steps.
Sorting data chronologically within each ATM.

Exploratory Data Analysis

Key Observations

ATM 1 and ATM 2 show consistent daily withdrawal cycles with clear weekly seasonality, likely reflecting weekday vs. weekend customer behavior.
ATM 3 displays an unusual pattern: almost no activity for most of the timeline followed by a sudden surge near the end, which could indicate data reporting issues or a late activation of that ATM.
ATM 4 shows sparse activity overall with one extreme outlier, possibly due to a bulk cash refill or service event.

Visualization- Daily Withdrawals

Each ATM’s withdrawals were plotted over time using ggplot2 with facet_wrap() to allow comparison on different y-scales.
ATM 1 and ATM 2 dominate activity with clear repeating patterns, whereas ATM 3 and ATM 4 are nearly flat with a few spikes.

atm <- read_excel("/Users/aaliyahmjh/Documents/FA25/DATA624/ATM624Data.xlsx") |> 
  mutate(
    DATE = if (is.numeric(DATE)) as.Date(DATE, origin = "1899-12-30") else as.Date(DATE),
    Cash = Cash * 100   # convert from hundreds of dollars to dollars
  )

atm <- atm %>%
  rename_with(~tolower(.x)) %>%
  rename(date = any_of(c("date", "DATE")),
         cash = any_of(c("cash", "Cash")),
         atm = any_of(c("atm", "ATM"))) %>%
  filter(!is.na(date), !is.na(cash), !is.na(atm)) %>%
  arrange(atm, date)

summary(atm$date)

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2009-05-01" "2009-08-01" "2009-10-31" "2009-10-30" "2010-01-29" "2010-04-30"

range(atm$date)

## [1] "2009-05-01" "2010-04-30"

# Plot
ggplot(atm, aes(date, cash, color = atm)) +
  geom_line(linewidth = 0.7) +
  facet_wrap(~atm, scales = "free_y") +
  labs(
    title = "ATM Daily Withdrawals (Cleaned Data)",
    x = "Date",
    y = "Cash ($)"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

Model Fitting & Forecasting

Modeling Approach

Because each ATM operates independently, time-series models were fitted separately.

The ATM withdrawal data showed strong daily seasonality, especially for ATM 1 and 2, where withdrawals followed a weekly rhythm. I considered simpler methods like the mean, naïve, or seasonal naïve forecasts, but those wouldn’t capture the repeating weekly pattern or short-term changes in withdrawal volume.

I also thought about using exponential smoothing (ETS) or decomposition-based methods, but these work best for longer, smoother time series with clear long-term trends. The ATM data were short and noisy, so those methods wouldn’t adapt well.

Technique Used

Because of the clear seasonality, with strong weekly cycles for the more active machines (ATM 1 and 2). To capture these repeating patterns, I used an ARIMA model selected automatically with auto.arima() from the forecast package.
- This approach chooses the best combination of parameters and seasonal terms without needing manual tuning, making it reliable for short, daily time series.
I chose this approach because of its ability to model trend and seasonality without requiring manual differencing or transformation.

Forecasting Process

Each ATM’s series was transformed into a weekly seasonal time-series (frequency = 7).
A 31-day horizon (May 1-31 2010) was forecasted with 80 % and 95 % prediction intervals.
For ATMs with insufficient history, a simple mean forecast was used as a fallback.

From my previous visual inspection, I found that:

ATM 1 & 2: regular weekly cycles and stable volumes.
ATM 3: almost inactive except one large spike.
ATM 4: mostly small values but one extreme outlier.

These patterns suggested that ATMs 1 & 2 were ideal for model-based forecasting, while 3 & 4 required fallback logic.

atm <- read_excel("ATM624Data.xlsx") %>%
  mutate(
    DATE = if (is.numeric(DATE)) as.Date(DATE, origin = "1899-12-30") else as.Date(DATE),
    Cash = Cash * 100   # convert from hundreds of dollars to dollars
  ) %>%
  arrange(ATM, DATE)

range(atm$DATE)

## [1] "2009-05-01" "2010-05-14"

# Split into list by ATM
atm_list <- split(atm, atm$ATM)

forecast_atm <- function(df) {
  df <- df %>% arrange(DATE)
  
  if (nrow(df) < 10) {
    avg <- mean(df$Cash, na.rm = TRUE)
    message(unique(df$ATM), ": too few points (", nrow(df), "). Using mean forecast.")
    return(tibble(
      Date = seq(max(df$DATE) + 1, by = "day", length.out = 31),
      Forecast = rep(avg, 31),
      Lower80 = rep(avg * 0.9, 31),
      Upper80 = rep(avg * 1.1, 31),
      Lower95 = rep(avg * 0.8, 31),
      Upper95 = rep(avg * 1.2, 31),
      ATM = unique(df$ATM)
    ))
  }
  
  ts_train <- ts(df$Cash, frequency = 7)
  fit <- auto.arima(ts_train)
  fc <- forecast(fit, h = 31)
  
  tibble(
    Date = seq(max(df$DATE) + 1, by = "day", length.out = 31),
    Forecast = as.numeric(fc$mean),
    Lower80 = as.numeric(fc$lower[, 1]),
    Upper80 = as.numeric(fc$upper[, 1]),
    Lower95 = as.numeric(fc$lower[, 2]),
    Upper95 = as.numeric(fc$upper[, 2]),
    ATM = unique(df$ATM)
  )
}

atm_forecasts <- lapply(atm_list, forecast_atm)
final_fc <- bind_rows(atm_forecasts)

# Save to Excel
write_xlsx(final_fc, "/Users/aaliyahmjh/Documents/FA25/DATA624/ATM_May2010_Forecasts.xlsx")

# Summarize monthly totals
monthly_totals <- final_fc %>%
  group_by(ATM) %>%
  summarise(May2010_TotalCash = sum(Forecast, na.rm = TRUE))

#  Save monthly totals to a separate file
write_xlsx(monthly_totals, "/Users/aaliyahmjh/Documents/FA25/DATA624/ATM_May2010_TotalForecasts.xlsx")

head(final_fc,15)

## # A tibble: 15 × 7
##    Date       Forecast Lower80 Upper80 Lower95 Upper95 ATM  
##    <date>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>
##  1 2010-05-01    8519.   5362.  11676.   3691.  13347. ATM1 
##  2 2010-05-02    9810.   6618.  13003.   4928.  14693. ATM1 
##  3 2010-05-03    7546.   4353.  10739.   2662.  12429. ATM1 
##  4 2010-05-04     353.  -2840.   3546.  -4531.   5237. ATM1 
##  5 2010-05-05    9849.   6656.  13043.   4966.  14733. ATM1 
##  6 2010-05-06    7814.   4621.  11008.   2931.  12698. ATM1 
##  7 2010-05-07    8398.   5204.  11591.   3514.  13281. ATM1 
##  8 2010-05-08    8490.   4884.  12095.   2976.  14003. ATM1 
##  9 2010-05-09    9749.   6135.  13363.   4222.  15276. ATM1 
## 10 2010-05-10    7764.   4150.  11378.   2237.  13292. ATM1 
## 11 2010-05-11     355.  -3260.   3969.  -5173.   5882. ATM1 
## 12 2010-05-12    9814.   6200.  13428.   4287.  15342. ATM1 
## 13 2010-05-13    7729.   4115.  11344.   2202.  13257. ATM1 
## 14 2010-05-14    8446.   4831.  12060.   2918.  13973. ATM1 
## 15 2010-05-15    8500.   4495.  12504.   2375.  14624. ATM1

monthly_totals

## # A tibble: 4 × 2
##   ATM   May2010_TotalCash
##   <chr>             <dbl>
## 1 ATM1            235716.
## 2 ATM2            188268.
## 3 ATM3               402.
## 4 ATM4           1469534.

Visualize Forecasting

Each ATM’s forecast plot shows:

Blue line: observed data through May 14 2010.
Red line: projected withdrawals for May 15-31 2010.
Shaded band: 95 % confidence interval.
ATMs 1 & 2 demonstrate clear continuation of the weekly pattern, where confidence intervals widen slightly but remain centered around similar magnitude to prior months.
ATMs 3 & 4 show that limited data or data with anomalies lead to flat forecasts, which is expected behavior.

for (atm_name in names(atm_list)) {
  obs <- atm %>% filter(ATM == atm_name)
  fc  <- forecast_atm(atm_list[[atm_name]])
  
  p <- ggplot() +
    geom_line(data = obs, aes(x = DATE, y = Cash), color = "steelblue", linewidth = 1) +

    # 95% Prediction Interval (lighter)
    geom_ribbon(
      data = fc,
      aes(x = Date, ymin = Lower95, ymax = Upper95),
      fill = "firebrick", alpha = 0.15
    ) +

    # 80% Prediction Interval (darker)
    geom_ribbon(
      data = fc,
      aes(x = Date, ymin = Lower80, ymax = Upper80),
      fill = "firebrick", alpha = 0.3
    ) +

    # Forecast line
    geom_line(data = fc, aes(x = Date, y = Forecast), color = "firebrick", linewidth = 1) +

    labs(
      title = paste0(atm_name, " Forecast - May 2010"),
      x = "Date",
      y = "Cash ($)",
      caption = "Blue = Historical | Red = Forecast | Shaded = 80% (dark) and 95% (light) Prediction Intervals"
    ) +
    theme_minimal()
  
  print(p)
}

Discussion

The goal of this analysis was to forecast daily cash withdrawals for four ATMs for May 2010.
Each ATM’s historical series was modeled individually using an automated ARIMA approach when sufficient data were available, and a simple mean-based projection when the historical window was too short to support reliable parameter estimation.

The forecasts indicate stable withdrawal behavior consistent with previous months’ weekly cycles.
For ATMs 1 and 2, daily withdrawals averaged approximately $9,000-$10,000, showing regular fluctuations without a clear upward or downward trend heading into May 2010.
ATMs 3 and 4 displayed more irregular patterns, with ATM 3 showing near-zero activity and ATM 4 producing one unusually high spike, likely due to an outlier transaction or exceptional event.

The resulting Excel file, ATM_May2010_Forecasts.xlsx, contains day-by-day forecasts and corresponding 80% and 95% prediction intervals for each ATM.

Overall, ATM 4 is projected to account for roughly 75% of total cash volume, followed by ATMs 1 and 2, while ATM 3 remains minimally active—possibly due to location, accessibility, or lower customer traffic.
These results suggest that replenishment schedules for May can largely mirror those from April, with additional liquidity allocated to ATM 4 to accommodate its substantially higher demand.

Model Limitations

Sparse or Irregular Data:
- ATMs 3 and 4 contained very few valid daily observations, leading to unreliable parameter estimation. For these cases, simple mean-based forecasts were used to avoid overfitting. These flat forecasts reflect limited transaction history rather than true customer inactivity.
External Factors Not Considered:
- The forecasts rely purely on historical withdrawal data. Factors like local events, holidays, or economic changes, which can strongly influence ATM usage—were not included due to data availability.

Conclusion

This analysis successfully produced short-term forecasts of daily cash withdrawals for four ATMs for May 2010 using historical transaction data. Automated ARIMA models provided robust and interpretable forecasts for ATMs with sufficient history, while mean-based projections were applied to sparse or inconsistent series to maintain reliability.

Overall, the forecasts suggest stable withdrawal demand across the network, with ATM 4 accounting for the majority of expected volume. No major trend shifts were detected, implying that existing replenishment schedules remain appropriate, provided ATM 4 continues to receive proportionally higher allocations.

Part B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Objective

The goal of this analysis was to forecast monthly residential power consumption for the year 2014 based on historical usage data spanning January 1998-December 2013. The dataset includes monthly energy consumption (in kilowatt-hours, KWH) and exhibits a clear yearly seasonal cycle, likely corresponding to variations in heating and cooling demand.

Data Preparation & Exploration

The data was imported from Excel, cleaned, and formatted into a date-based structure suitable for time series analysis. Each observation represents monthly residential power consumption.

A time-series line plot of Monthly Residential Power Consumption (1998-2013) reveals:

Pronounced annual seasonality, with recurring peaks and troughs each year.
A slight upward trend over the 15-year period, indicating gradual growth in residential energy demand.
A temporary dip around 2010, possibly caused by missing data or an external anomaly such as economic slowdown or measurement gaps.

The data show strong seasonality, power use rises every summer and dips in cooler months, along with a gradual upward trend over time. Simple models like the mean or naïve forecast wouldn’t capture those repeating seasonal patterns. Decomposition could separate trend and seasonality, but it assumes they stay constant, which doesn’t fit the steady growth here.

Because of that, I focused on seasonal ARIMA and ETS models. Both handle seasonality and long-term trends well. SARIMA is good for capturing autocorrelation, while ETS adapts smoothly to changing seasonal patterns. These methods make the most sense for this kind of monthly energy consumption data.

# Import and clean data
power <- read_excel("/Users/aaliyahmjh/Documents/FA25/DATA624/ResidentialCustomerForecastLoad-624.xlsx") %>%
  rename(YearMonth = `YYYY-MMM`) %>%    # rename safely
  mutate(
    Date = as.Date(paste0("01-", YearMonth), format = "%d-%Y-%b"),
    KWH = as.numeric(KWH)
  ) %>%
  arrange(Date)

glimpse(power)

## Rows: 192
## Columns: 4
## $ CaseSequence <dbl> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
## $ YearMonth    <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
## $ KWH          <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…
## $ Date         <date> 1998-01-01, 1998-02-01, 1998-03-01, 1998-04-01, 1998-05-…

ggplot(power, aes(x = Date, y = KWH)) +
  geom_line(color = "steelblue", linewidth = 1) +
  labs(
    title = "Monthly Residential Power Consumption (1998-2013)",
    x = "Date",
    y = "KWH (Kilowatt hours)"
  ) +
  theme_minimal()

Modeling & Forecasting

Two forecasting models were compared: * Exponential Smoothing (ETS) * Auto ARIMA .

Both models were trained on the full series. Their RMSE (Root Mean Squared Error) values on the training data were compared to determine the better-performing model.

The ETS model achieved the lower RMSE, so it was selected for forecasting. ETS automatically chose an additive error, multiplicative trend, and additive seasonal (A, M, A) structure, which captures the steady year-to-year increase in energy consumption while adapting smoothly to recurring seasonal peaks. The model produced a 12-month forecast for 2014, continuing the established seasonal trend. The plot displays both 80% (darker band) and 95% (lighter band) prediction intervals. The narrower inner band captures the more likely outcomes, while the wider band reflects greater uncertainty at higher confidence. As expected, the forecast retains the established seasonal pattern, with uncertainty gradually increasing further into 2014.

# Convert to monthly time series
power_ts <- ts(power$KWH, start = c(1998, 1), frequency = 12)

# Train models
fit_ets   <- ets(power_ts)

## Warning in ets(power_ts): Missing values encountered. Using longest contiguous
## portion of time series

fit_arima <- auto.arima(power_ts)

# Compare accuracy
acc_ets   <- accuracy(fit_ets)["Training set", "RMSE"]
acc_arima <- accuracy(fit_arima)["Training set", "RMSE"]

cat("\n ETS RMSE:", acc_ets, "\n")

## 
##  ETS RMSE: 516453.6

cat("\n ARIMA RMSE:", acc_arima, "\n")

## 
##  ARIMA RMSE: 845160.6

# Select better model
if (acc_ets < acc_arima) {
  best_fit <- fit_ets
  model_used <- "ETS"
} else {
  best_fit <- fit_arima
  model_used <- "ARIMA"
}

cat("\n Selected model:", model_used, "\n")

## 
##  Selected model: ETS

# Forecast for 12 months
power_fc <- forecast(best_fit, h = 12)

future_dates <- seq(from = max(power$Date) + 30, by = "month", length.out = 12)

forecast_df <- data.frame(
  Date = future_dates,
  Forecast = as.numeric(power_fc$mean),
  Lower80 = as.numeric(power_fc$lower[,1]),
  Upper80 = as.numeric(power_fc$upper[,1]),
  Lower95 = as.numeric(power_fc$lower[,2]),
  Upper95 = as.numeric(power_fc$upper[,2])
)

ggplot() +
  # Historical data
  geom_line(data = power, aes(x = Date, y = KWH), color = "steelblue") +

  # 95% CI (wider, lighter)
  geom_ribbon(data = forecast_df,
              aes(x = Date, ymin = Lower95, ymax = Upper95),
              fill = "blue", alpha = 0.15) +

  # 80% CI (narrower, slightly darker)
  geom_ribbon(data = forecast_df,
              aes(x = Date, ymin = Lower80, ymax = Upper80),
              fill = "blue", alpha = 0.3) +

  # Point forecast
  geom_line(data = forecast_df,
            aes(x = Date, y = Forecast),
            color = "darkblue", linewidth = 1) +

  labs(
    title = paste("Monthly Residential Power Consumption Forecast (", model_used, ")", sep = ""),
    x = "Year",
    y = "KWH (Kilowatt hours)"
  ) +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
  theme_minimal()

Export Forecasts

A 12-month forecast for January–December 2014 was generated using the selected model. The output includes point forecasts along with both 80 % and 95 % prediction intervals, which quantify the expected range of future values.

The forecast continues the yearly seasonal cycle, higher energy consumption in summer and winter, with dips in spring and fall. The prediction intervals widen modestly over time, reflecting increasing uncertainty for longer-term forecasts.

The file Residential_Power_2014_Forecast.xlsx contains the monthly forecast results and both interval estimates for reference.

# Create dataframe for Excel
power_forecast_df <- data.frame(
  Month = seq(as.Date("2014-01-01"), by = "month", length.out = 12),
  Forecast = round(power_fc$mean, 0),
  Lower80 = round(power_fc$lower[,1], 0),
  Upper80 = round(power_fc$upper[,1], 0),
  Lower95 = round(power_fc$lower[,2], 0),
  Upper95 = round(power_fc$upper[,2], 0)
)

# Save results
write_xlsx(power_forecast_df, "/Users/aaliyahmjh/Documents/FA25/DATA624/Residential_Power_2014_Forecast.xlsx")

head(power_forecast_df)

##        Month Forecast Lower80 Upper80 Lower95 Upper95
## 1 2014-01-01  7665313 6814500 8516126 6364107 8966519
## 2 2014-02-01  5793862 5150621 6437104 4810109 6777615
## 3 2014-03-01  4681272 4161430 5201114 3886242 5476302
## 4 2014-04-01  5689482 5057533 6321430 4723000 6655964
## 5 2014-05-01  7643044 6793908 8492179 6344403 8941684
## 6 2014-06-01  6592915 5860277 7325553 5472441 7713389

Discussion

Monthly residential power consumption data from January 1998 to December 2013 exhibited strong seasonality, with pronounced peaks during the summer months and lower consumption in cooler periods. The gradual upward trend across years indicates steady growth in residential energy demand, likely driven by population increases and greater household electricity use.

Both ETS and ARIMA models were evaluated for forecasting. The ETS model achieved a slightly lower RMSE ( ETS RMSE: 516453.6 vs ARIMA RMSE: 845160.6 ) and provided a smoother representation of the clear seasonal cycle, making it the better fit for this dataset.

The 12-month forecast for 2014 continues this pattern of seasonal variation, with projected peaks around July-August and dips in winter months. Predicted monthly consumption ranges between approximately 4.5 and 9.0 million kWh, with uncertainty gradually increasing toward the end of 2014, as reflected by the widening confidence intervals.

Overall, the model effectively captures historical consumption behavior and offers realistic projections for 2014. The seasonal regularity makes this data ideal for ETS or seasonal ARIMA approaches, both of which can accommodate repeating patterns and long-term growth.

Model Limitations

While the ETS model provided accurate forecasts and effectively captured the seasonal nature of residential electricity consumption, several limitations should be noted:

Historical Dependence: The model relies entirely on past consumption patterns and does not incorporate explanatory factors such as weather conditions, economic activity, or population growth, which can significantly influence power demand.
Data Gaps and Anomalies: Minor missing values and irregularities, such as the dip observed around 2010, may reduce model precision, especially if they stem from reporting or measurement inconsistencies rather than true behavioral changes.

Conclusion

This analysis demonstrates that exponential smoothing (ETS) is well-suited for forecasting residential power consumption that exhibits consistent, repeating seasonal patterns. The model projects continued cyclical variation in 2014, with higher demand during summer months and moderate consumption during cooler seasons.

The forecasts can aid utility companies and energy planners in aligning production, grid management, and resource allocation with predictable seasonal peaks.

Part C

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Data Preparation

Two datasets, Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx, were imported and cleaned to ensure consistent timestamp formatting. Each contained irregular time intervals, occasional duplicate timestamps, and multiple readings within the same hour. To create a uniform time base, all measurements were aggregated to hourly averages, when more than one reading occurred within an hour, their mean value was computed.

Timestamps were standardized by converting from Excel serial or text formats into POSIX datetime objects, and each record was labeled by pipe (“Pipe 1” or “Pipe 2”). The final hourly dataset included three variables: hour, Value, and Pipe, allowing both data streams to be analyzed together at a consistent resolution suitable for time-series modeling.

path_pipe1 <- "/Users/aaliyahmjh/Documents/FA25/DATA624/Waterflow_Pipe1.xlsx"
path_pipe2 <- "/Users/aaliyahmjh/Documents/FA25/DATA624/Waterflow_Pipe2.xlsx"

hourly_aggregate <- function(path, id_label) {
  df <- read_excel(path)

  # Handle possible column name variations
  if (all(c("Date", "Time") %in% names(df))) {
    df <- df %>% mutate(DateTime = as_datetime(paste(Date, Time)))
  } else if ("Date Time" %in% names(df)) {
    df <- df %>% rename(DateTime = `Date Time`)
  } else if ("DateTime" %in% names(df)) {
    df <- df
  } else {
    stop("No recognizable datetime column found in file.")
  }

  # Convert Excel serials if numeric
  if (is.numeric(df$DateTime)) {
    df <- df %>% mutate(DateTime = as_datetime(DateTime * 86400, origin = "1899-12-30"))
  }

  df %>%
    rename(Value = WaterFlow) %>%
    mutate(hour = floor_date(DateTime, "hour")) %>%
    group_by(hour) %>%
    summarise(Value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
    mutate(Pipe = id_label)
}

pipe1_hr <- hourly_aggregate(path_pipe1, "Pipe1")
pipe2_hr <- hourly_aggregate(path_pipe2, "Pipe2")

# Quick look
head(pipe1_hr, 10)

## # A tibble: 10 × 3
##    hour                Value Pipe 
##    <dttm>              <dbl> <chr>
##  1 2015-10-23 00:00:00  26.1 Pipe1
##  2 2015-10-23 01:00:00  18.9 Pipe1
##  3 2015-10-23 02:00:00  15.2 Pipe1
##  4 2015-10-23 03:00:00  23.1 Pipe1
##  5 2015-10-23 04:00:00  15.5 Pipe1
##  6 2015-10-23 05:00:00  22.7 Pipe1
##  7 2015-10-23 06:00:00  20.6 Pipe1
##  8 2015-10-23 07:00:00  18.4 Pipe1
##  9 2015-10-23 08:00:00  20.8 Pipe1
## 10 2015-10-23 09:00:00  21.2 Pipe1

head(pipe2_hr, 10)

## # A tibble: 10 × 3
##    hour                Value Pipe 
##    <dttm>              <dbl> <chr>
##  1 2015-10-23 01:00:00  30.9 Pipe2
##  2 2015-10-23 03:00:00  38.0 Pipe2
##  3 2015-10-23 04:00:00  34.0 Pipe2
##  4 2015-10-23 06:00:00  28.2 Pipe2
##  5 2015-10-23 07:00:00  18.3 Pipe2
##  6 2015-10-23 09:00:00  55.8 Pipe2
##  7 2015-10-23 10:00:00  61.3 Pipe2
##  8 2015-10-23 12:00:00  55.7 Pipe2
##  9 2015-10-23 13:00:00  37.1 Pipe2
## 10 2015-10-23 15:00:00  35.1 Pipe2

Stationarity Testing

To confirm the data were appropriate for ARIMA modeling:

ADF test: p < 0.05 - reject the null of a unit root -> stationary.

KPSS test: p > 0.05 - cannot reject stationarity.

Both tests agree that each series is stationary, meaning mean and variance remain roughly constant over time.

pipe1_ts <- ts(pipe1_hr$Value, frequency = 24)  # daily cycle
pipe2_ts <- ts(pipe2_hr$Value, frequency = 24)

# ADF (tests for unit root -> nonstationarity)
adf.test(pipe1_ts)

## Warning in adf.test(pipe1_ts): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  pipe1_ts
## Dickey-Fuller = -6.2649, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

adf.test(pipe2_ts)

## Warning in adf.test(pipe2_ts): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  pipe2_ts
## Dickey-Fuller = -8.6584, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

# KPSS (tests for trend stationarity)
kpss.test(pipe1_ts)

## Warning in kpss.test(pipe1_ts): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  pipe1_ts
## KPSS Level = 0.24656, Truncation lag parameter = 4, p-value = 0.1

kpss.test(pipe2_ts)

## 
##  KPSS Test for Level Stationarity
## 
## data:  pipe2_ts
## KPSS Level = 0.42049, Truncation lag parameter = 6, p-value = 0.06832

Model Forecasting

Because both series were stationary and lacked strong trend components, I used ARIMA models, selected automatically with auto.arima(). This function chooses the best parameters and seasonal components based on AICc, making it reliable for short-term forecasting without manual tuning.

I considered methods like ETS or exponential smoothing, but those work best when clear trends or evolving seasonal patterns are present. Decomposition-based approaches were also unnecessary here because the data were already stationary. ARIMA was the best match since it handles short-term autocorrelation and produces stable forecasts for steady, mean-reverting data like these water flow series.

Each model produced a one-week-ahead forecast (168 hours).

The resulting forecasts for both pipes are centered around their respective long-run means:

Pipe 1 = Forecast centered near 20 units with narrow 95 % prediction intervals.
Pipe 2 = Forecast centered near 40 units with narrow 95 % prediction intervals.

The shaded regions on the forecast plots represent 95% prediction intervals, showing the expected range of variation over the week. The flat appearance doesn’t indicate model failure; it reflects that the data have no clear upward or downward pattern, so the ARIMA model predicts stable mean-reverting behavior.

fit1 <- auto.arima(pipe1_ts)
fit2 <- auto.arima(pipe2_ts)

# Forecast 1 week ahead = 168 hours
fc1 <- forecast(fit1, h = 168)
fc2 <- forecast(fit2, h = 168)

autoplot(fc1) +
  labs(title = "Pipe 1: One-Week Forecast (ARIMA)",
       y = "Flow (units)", x = "Time") +
  theme_minimal()

autoplot(fc2) +
  labs(title = "Pipe 2: One-Week Forecast (ARIMA)",
       y = "Flow (units)", x = "Time") +
  theme_minimal()

# Combined forecast visualization
weekly_fc <- bind_rows(
  data.frame(
    Time = seq(max(pipe1_hr$hour) + hours(1), by = "hour", length.out = 168),
    Forecast = as.numeric(fc1$mean),
    Lower95 = as.numeric(fc1$lower[,2]),
    Upper95 = as.numeric(fc1$upper[,2]),
    Pipe = "Pipe1"
  ),
  data.frame(
    Time = seq(max(pipe2_hr$hour) + hours(1), by = "hour", length.out = 168),
    Forecast = as.numeric(fc2$mean),
    Lower95 = as.numeric(fc2$lower[,2]),
    Upper95 = as.numeric(fc2$upper[,2]),
    Pipe = "Pipe2"
  )
)

ggplot(weekly_fc, aes(Time, Forecast, color = Pipe, fill = Pipe)) +
  geom_line(linewidth = 0.8) +
  geom_ribbon(aes(ymin = Lower95, ymax = Upper95), alpha = 0.15, color = NA) +
  labs(
    title = "One-Week Ahead Water Flow Forecasts (ARIMA)",
    subtitle = "Forecasted hourly flow by pipe with 95% confidence intervals",
    x = "Time (hourly)", y = "Water Flow (forecasted units)"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

The results were exported to an Excel-readable file: Waterflow_Weekly_Forecasts.xlsx.

pipe1_fc_df <- data.frame(
  Hour = seq(max(pipe1_hr$hour) + 3600, by = 3600, length.out = 168),
  Forecast = as.numeric(fc1$mean),
  Lower95 = as.numeric(fc1$lower[,2]),
  Upper95 = as.numeric(fc1$upper[,2]),
  Pipe = "Pipe1"
)

pipe2_fc_df <- data.frame(
  Hour = seq(max(pipe2_hr$hour) + 3600, by = 3600, length.out = 168),
  Forecast = as.numeric(fc2$mean),
  Lower95 = as.numeric(fc2$lower[,2]),
  Upper95 = as.numeric(fc2$upper[,2]),
  Pipe = "Pipe2"
)

write_xlsx(bind_rows(pipe1_fc_df, pipe2_fc_df),
           "/Users/aaliyahmjh/Documents/FA25/DATA624/Waterflow_Weekly_Forecasts.xlsx")

Discussion

Hourly water-flow data from Pipe 1 and Pipe 2 were aggregated from raw timestamped measurements into hourly means to ensure consistent time intervals for analysis. After cleaning and aggregation, each series displayed distinct flow behavior:

Pipe 1 covered a shorter operating period (late October-early November) with stable flows between 15-25 units, suggesting consistent delivery and minimal fluctuation.
Pipe 2, which extended through November into December, showed greater variability ranging 30-70 units, with short-term spikes likely driven by demand surges or pressure fluctuations.

Stationarity tests confirmed the suitability of both series for time-series forecasting. The ADF test rejected the null hypothesis of a unit root (p < 0.05), and the KPSS test failed to reject level stationarity (p > 0.05), indicating both series were stationary.

Each pipe’s flow series was modeled using an ARIMA model selected automatically via auto.arima(), which optimized parameters based on AICc. One-week-ahead forecasts (168 hours) were produced and visualized with 95 % prediction intervals.

Pipe 1 (red) showed stable forecasts centered around recent averages with narrow confidence bands, implying predictable near-term behavior.
Pipe 2 (blue) exhibited wider intervals and stronger volatility, reflecting greater uncertainty due to higher hourly fluctuations.

These prediction intervals represent the expected operational ranges for each pipe and can inform proactive decisions, such as pump scheduling, flow balancing, or maintenance planning.

Overall, this analysis demonstrates that short-term ARIMA forecasting can effectively model high-frequency operational data when appropriately aggregated and validated for stationarity.

Model Limitations

While the ARIMA models performed reasonably well for short-term forecasting, several limitations should be noted:

Short data window (Pipe 1): The limited historical coverage restricts model reliability beyond a few days.
High variability (Pipe 2): Frequent short-term spikes increase forecast uncertainty and widen prediction intervals.
No external regressors: Factors such as weather, system pressure, or scheduled maintenance were not included.

Conclusion

After timestamp normalization and hourly aggregation, both Pipe 1 and Pipe 2 were successfully modeled and forecasted using ARIMA. The models indicate that Pipe 1 maintains steady flow near 20 units, while Pipe 2 averages about 40 units with greater variability. These results provide a practical short-term outlook to support water-flow management, pump scheduling, and system-monitoring decisions.

Project 1

Aaliyah John-Harry

2025-10-15

PART A

Overview

Data Preparation & EDA

Data Preparation

Exploratory Data Analysis

Key Observations

Visualization- Daily Withdrawals

Model Fitting & Forecasting

Modeling Approach

Forecasting Process

Visualize Forecasting

Discussion

Model Limitations

Conclusion

Part B

Objective

Data Preparation & Exploration

Modeling & Forecasting

Export Forecasts

Discussion

Model Limitations

Conclusion

Part C

Data Preparation

Stationarity Testing

Model Forecasting

Discussion

Model Limitations

Conclusion