Paroject1XiaoFei

Project 1 consists of 3 parts, partA, B and optional C.

Part A – In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part A Overview: goal is to forecast cash taken out based on given records of ATMs. First I’ll do some data exploration to view statistics summary, check if there are any missing data or if any data has formatting issue require conversion. Compare different forcasting methods and pick the best fit.

At a glance, it show the file from part A contains 1474 rows of record. Quick summary shows there are 19 NA values. Worth noting I initially encounter warnings about format of the date in the first column “warning: Coercing numeric to date in A2…”. therefore I fixed the date format in excel.

atm <- read_excel("C:/Users/xmei/Downloads/ATM624Data.xlsx") |>
  mutate(
    DATE = as.Date(DATE, origin = "1899-12-30"), ## Fix excel date format
    Cash = Cash * 100   #Given info said number are provided in hundres of dollars.
  )

summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :      0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:     50  
##  Median :2009-11-01   Mode  :character   Median :   7300  
##  Mean   :2009-10-31                      Mean   :  15558  
##  3rd Qu.:2010-02-01                      3rd Qu.:  11400  
##  Max.   :2010-05-14                      Max.   :1091976  
##                                          NA's   :19

# take a closer look at na values
atm[is.na(atm$Cash), ]

## # A tibble: 19 × 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-18 ATM2     NA
##  4 2009-06-22 ATM1     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

Below is another summary view with plots,without na values. From the plots, ATM 1 and 2 shows a pattern with seasonality.ATM 3 doesn’t seem to have many activities except toward the end around Aprl 2010. ATM 4 appears to show a pattern with exception of a very large withdraw.

atm <- atm |>
  rename_with(~tolower(.x)) |>
  rename(date = any_of(c("date", "DATE")),
         cash = any_of(c("cash", "Cash")),
         atm = any_of(c("atm", "ATM"))) |>
  filter(!is.na(date), !is.na(cash), !is.na(atm)) |>
  arrange(atm, date)

ggplot(atm, aes(date, cash, color = atm)) +
  geom_line(linewidth = 0.3) +
  facet_wrap(~atm, scales = "free_y") +
  labs(
    title = "ATM Daily Withdrawals",
    x = "Date",
    y = "Cash ($)"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

More observation and forcasting plan: from the observation above, forcast on ATM1 and ATM2 should be easier since we have relative robust data, we can 7 days rolling average before and after dates with na values to fill in na. However, ATM3 will not be a straight forward case due to its inactivity, one possible explanation could be the ATM3 was located in a good location but inside a store thats been vacate, when ATM3 is active the cash withdraw seems to be within a normal range that’s comparable to other ATM’s record, therefore we can look for any ATM that has cash withdraw closely match ATM3 and consider to provide forecast based on that ATM’s record. ATM3 has cash withdraw around April 28 to April 30, let’s find out which ATM cash withdraw closely match with ATM3. As for ATM4, I’ll smooth out the spike in ATM4 by replacing the outlier with a rolling average.

From the plots below we can see ATM1 has a pattern that most closely match with ATM3, therefore we can provide reference in our conclusion section that if ATM3 begins to operate daily, given it’s location and records, it would be close to the forecast of ATM1. That’s a reasonable starting point given available records we have now, and we should revisit when more records of ATM3 comes in in the future.

# closer look at days when ATM3 has activities
atm_3withcash <- atm |>
  filter(date >= as.Date("2010-04-28") & date <= as.Date("2010-04-30"))

ggplot(atm_3withcash, aes(date, cash, color = atm)) +  
  geom_line(linewidth = 0.7) +
  geom_point(size = 1.5) +  
  facet_wrap(~atm, scales = "free_y") + 
  labs(
    title = "ATM Withdrawals from April 28-30, 2010",
    x = "Date",
    y = "Cash ($)"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

Now with above plan, here is the forecasting part. As mentioned, I’ll smooth out spikes in ATM4 using a 7-day rolling average to reduce the impact of outliers on the forecast model. I’ll do forcast with given data on ATM3 but will mention alternative suggestion in written conclusion about referencing ATM1 numbers. For all record, mostly ATM3, if there are fewer than 10 observations, we use a simple mean forecast. The forecast for the next 31 days is the historical average. The confidence intervals are set arbitrarily (80%: ±10%, 95%: ±20%). If there are 10 or more observations,ATM1,2 and 4, ARIMA model will be used with weekly seasonality (frequency=7). We use auto ARIMA to select the best ARIMA model and finally generate a 31-day forecast with the model and extract the mean and confidence intervals.

library(forecast)
library(dplyr)
library(zoo)

## 
## Attaching package: 'zoo'

## The following object is masked from 'package:tsibble':
## 
##     index

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

# Smooth out spikes in ATM4 
atm_clean <- atm |>
  group_by(atm) |>  
  arrange(date) |>  
  mutate(
    cash_smoothed = ifelse(
      atm == "ATM4",
      rollapply(cash, width = 7, FUN = median, na.rm = TRUE, fill = NA, align = "center"),
      cash  
    )
  ) |>
  mutate(
    cash_smoothed = ifelse(
      is.na(cash_smoothed) & atm == "ATM4",
      cash,  
      cash_smoothed
    )
  ) |>
  ungroup()

#forcast
atm_list <- split(atm_clean, atm_clean$atm)  

forecast_atm <- function(df) {
  df <- df |> arrange(date)  
  
  if (nrow(df) < 10) {
    avg <- mean(df$cash_smoothed, na.rm = TRUE)
    return(tibble(
      date = seq(max(df$date) + 1, by = "day", length.out = 31), 
      forecast = rep(avg, 31),  
      lower80 = rep(avg * 0.9, 31),  
      upper80 = rep(avg * 1.1, 31), 
      lower95 = rep(avg * 0.8, 31),  
      upper95 = rep(avg * 1.2, 31), 
      atm = unique(df$atm)  
    ))
  }
  
  ts_train <- ts(df$cash_smoothed, frequency = 7)  
  fit <- auto.arima(ts_train)
  fc <- forecast(fit, h = 31)
  
  tibble(
    date = seq(max(df$date) + 1, by = "day", length.out = 31),  
    forecast = as.numeric(fc$mean),  
    lower80 = as.numeric(fc$lower[, 1]),  
    upper80 = as.numeric(fc$upper[, 1]), 
    lower95 = as.numeric(fc$lower[, 2]),  
    upper95 = as.numeric(fc$upper[, 2]),  
    atm = unique(df$atm)  
  )
}

atm_forecasts <- lapply(atm_list, forecast_atm)
final_fc <- bind_rows(atm_forecasts)


write_xlsx(final_fc, "C:/Users/xmei/Downloads/ATM_May2010_Forecasts_Smoothed.xlsx")


monthly_totals <- final_fc |>
  group_by(atm) |>
  summarise(may2010_totalcash = sum(forecast, na.rm = TRUE))  

write_xlsx(monthly_totals, "C:/Users/xmei/Downloads/ATM_May2010_TotalForecasts_Smoothed.xlsx")

head(final_fc, 15)

## # A tibble: 15 × 7
##    date       forecast lower80 upper80 lower95 upper95 atm  
##    <date>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>
##  1 2010-05-01    8681.   5372.  11989.   3620.  13741. ATM1 
##  2 2010-05-02   10064.   6709.  13419.   4933.  15195. ATM1 
##  3 2010-05-03    7471.   4117.  10826.   2341.  12602. ATM1 
##  4 2010-05-04     476.  -2879.   3831.  -4654.   5607. ATM1 
##  5 2010-05-05   10006.   6652.  13361.   4876.  15137. ATM1 
##  6 2010-05-06    7936.   4581.  11290.   2805.  13066. ATM1 
##  7 2010-05-07    8541.   5186.  11896.   3410.  13672. ATM1 
##  8 2010-05-08    8697.   5067.  12326.   3146.  14248. ATM1 
##  9 2010-05-09   10064.   6427.  13701.   4502.  15626. ATM1 
## 10 2010-05-10    7471.   3835.  11108.   1909.  13034. ATM1 
## 11 2010-05-11     476.  -3161.   4113.  -5086.   6039. ATM1 
## 12 2010-05-12   10006.   6369.  13643.   4444.  15569. ATM1 
## 13 2010-05-13    7936.   4299.  11573.   2373.  13498. ATM1 
## 14 2010-05-14    8541.   4904.  12178.   2979.  14103. ATM1 
## 15 2010-05-15    8697.   4805.  12589.   2745.  14649. ATM1

monthly_totals <- final_fc |>
  group_by(atm) |>
  summarise(may2010_totalcash = sum(forecast, na.rm = TRUE))  

# MONTHLY TOTALS 
cat("\n")

cat("MONTHLY FORECAST TOTALS FOR MAY 2010:")

## MONTHLY FORECAST TOTALS FOR MAY 2010:

print(monthly_totals)

## # A tibble: 4 × 2
##   atm   may2010_totalcash
##   <chr>             <dbl>
## 1 ATM1            238982.
## 2 ATM2            182596.
## 3 ATM3               402.
## 4 ATM4           1367631.

cat("\n")

Here is the visual presentation of forecasting. I use a line to separate historical patterns and predictions also added in multiple confidence intervals for uncertainty assessment.

for (atm_name in names(atm_list)) {
  obs <- atm_clean %>% filter(atm == atm_name)
  fc <- atm_forecasts[[atm_name]]
  
  last_historical_date <- max(obs$date)
  
  p <- ggplot() +
    geom_line(data = obs, aes(x = date, y = cash_smoothed), color = "steelblue", linewidth = 0.8) +
    geom_line(data = fc, aes(x = date, y = forecast), color = "firebrick", linewidth = 0.8) +
    # 95% prediction interval
    geom_ribbon(data = fc, aes(x = date, ymin = lower95, ymax = upper95), 
                fill = "firebrick", alpha = 0.2) +
    # 80% prediction interval
    geom_ribbon(data = fc, aes(x = date, ymin = lower80, ymax = upper80), 
                fill = "firebrick", alpha = 0.3) +
    # Vertical line separating historical and forecast
    geom_vline(xintercept = last_historical_date, linetype = "dashed", color = "darkgray") +
    # Point at the transition
    geom_point(data = obs %>% filter(date == last_historical_date), 
               aes(x = date, y = cash_smoothed), color = "steelblue", size = 2) +
    geom_point(data = fc %>% filter(date == min(date)), 
               aes(x = date, y = forecast), color = "firebrick", size = 2) +
    labs(
      title = paste0(atm_name, " Cash Withdrawal Forecast"),
      subtitle = "May 2010 Forecast with Prediction Intervals",
      x = "Date",
      y = "Cash Amount ($)",
      caption = paste0("Blue: Historical (smoothed) | Red: Forecast | ",
                      "Dark shade: 80% PI | Light shade: 95% PI")
    ) +
    theme_minimal() +
    theme(plot.caption = element_text(size = 9, color = "gray50"))
  
  print(p)
}

Additional comment for PartA, as I suggested before about ATM3. It’s prediction has limitation given limited records. If ATM3 start to active at October as the same level as the end of April, an alternative would be use the prediction of ATM1 since I demonstrated that the two have a similar pattern, and then we should revisit as more records are available and adjust prediction.

Part B – Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

After download the data, I mutate date formatting and show data,transformed “YYYY-MMM” format (e.g., “2023-Jan”) to proper Date objects. I then use the data for time series line plot: Shows electricity consumption patterns over time to see if there is seasonal trends, peak usage periods and if there is any unusual consumption time period. Looks like the usage is seasonal. Unusual points worth noting is the brief missing record in 2008 August and a sharp drop of usage in 2010.

power <- read_excel("C:/Users/xmei/Downloads/ResidentialCustomerForecastLoad-624.xlsx") |>
  mutate(Date = as.Date(paste0(`YYYY-MMM`, "-01"), format = "%Y-%b-%d")) |>
  select(-`YYYY-MMM`, -CaseSequence)

power |> 
  ggplot(aes(x = Date, y = KWH)) +
  geom_line(colour = "Black") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray")

With the observation above, I need to think of a way taking care of missing values, two ways I could think of: 1)use the record from the same date bu previous years. 2)we could use average from surrounding dates of the missing value this year to fill in the blank. I’ll use the first option.

#  rolling average imputation for all missing values
power <- power |>
  arrange(Date) |>
  mutate(
    # 7-day centered moving average
    KWH_imputed = ifelse(
      is.na(KWH),
      rollapply(KWH, width = 7, FUN = mean, na.rm = TRUE, fill = NA, align = "center"),
      KWH
    ),
    KWH = ifelse(is.na(KWH_imputed), na.approx(KWH_imputed, na.rm = FALSE), KWH_imputed)
  ) |>
  select(-KWH_imputed)

To further study the data, I performs seasonal-trend decomposition using STL. Create time series object ts_data with start date January 1998 and monthly frequency (12). Then applies the STL decomposition with a periodic window.

ts_data <- ts(power$KWH, start = c(1998, 1), frequency = 12)
stl_decomp <- stl(ts_data, s.window = "periodic")
plot(stl_decomp)

For best electricity consumption forecast, I’ll first do log transformation to stabilizes variance and handles multiplicative effects. After transformation, apply different time series forecasting models: various ETS and ARIMA, and get list RMSE result to pick a best one to use. Looks like model ETSMAM with RMSE value of 0.191668 is the best fit model.

library(stats)
library(forecast)

get_rmse <- function(model, model_name) {
  accuracy_metrics <- accuracy(model)
  model_accuracy <- data.frame(
    Model = model_name,
    RMSE = round(accuracy_metrics["Training set", "RMSE"], 6)
  )
  return(model_accuracy)
}

power_ts <- ts(log(power$KWH), start = c(1998, 1), frequency = 12)

# Fit models and get RMSE 
etszzz_rmse <- get_rmse(ets(power_ts, model = "ZZZ"), "ETSZZZ")
etsmnn_rmse <- get_rmse(ets(power_ts, model = "MNM"), "ETSMNN")
etsana_rmse <- get_rmse(ets(power_ts, model = "ANA"), "ETSANA")
etsmam_rmse <- get_rmse(ets(power_ts, model = "MAM"), "ETSMAM")
auto_arima_rmse <- get_rmse(auto.arima(power_ts), "A-AUTO")
arima010_rmse <- get_rmse(arima(power_ts, order = c(0, 1, 0)), "A-010") 
arimas111_rmse <- get_rmse(arima(power_ts, order = c(0, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12)), "A-011-111")
arimas211_rmse <- get_rmse(arima(power_ts, order = c(0, 1, 1), seasonal = list(order = c(2, 1, 1), period = 12)), "A-011-211")
arimas010_rmse <- get_rmse(arima(power_ts, order = c(0, 0, 0), seasonal = list(order = c(0, 1, 0), period = 12)), "A-000-010")

# show all
rmse_table <- rbind(etszzz_rmse, etsmnn_rmse, etsana_rmse, etsmam_rmse, 
                    auto_arima_rmse, arimas111_rmse, arimas211_rmse, 
                    arimas010_rmse, arima010_rmse)

rmse_table <- rmse_table[order(rmse_table$RMSE), ]

print(rmse_table)

##       Model     RMSE
## 1    ETSZZZ 0.192600
## 4    ETSMAM 0.192635
## 3    ETSANA 0.192737
## 2    ETSMNN 0.194576
## 7 A-011-211 0.198978
## 6 A-011-111 0.199224
## 5    A-AUTO 0.231882
## 8 A-000-010 0.274346
## 9     A-010 0.311059

Now it’s time to do time series forecasting using an ETSMAM model and creates visualization to show the results. Forecast is for 12-month forecast. After getting result, I converted log-scale predictions back to KWH units. For the visualization, confidence interval were presented: 80% interval (darker gray, alpha=0.8) and 95% interval (lighter gray, alpha=0.3).

power_ts <- ts(log(power$KWH), start = c(1998, 1), frequency = 12)
best_ets_model <- ets(power_ts, model = "MAM")
fckw_values <- forecast(best_ets_model, h = 12)
fckw_values$mean <- exp(fckw_values$mean)
fckw_values$lower <- exp(fckw_values$lower)
fckw_values$upper <- exp(fckw_values$upper)

historical_data_df <- as.data.frame(exp(power_ts))
historical_data_df$Date <- seq(as.Date("1998-01-01"), by = "month", 
  length.out = nrow(historical_data_df))

historical_data_df <- historical_data_df |>
  rename(kwh = x)
# head(historical_data_df)

forecast_data_df <- as.data.frame(fckw_values)
forecast_months <- strptime(rownames(forecast_data_df), format = "%b %Y")
forecast_data_df$Date <- myd(paste0(rownames(forecast_data_df), "-01"))
forecast_data_df <- forecast_data_df |> rename(kwh = `Point Forecast`)
# head(forecast_data_df)

ggplot() +
  geom_line(data = filter(historical_data_df, Date > as.Date("2010-12-31")), aes(x = Date, y = kwh), colour = "Black") +
  geom_line(data = forecast_data_df, aes(x = Date, y = kwh), colour = "Black") +
  geom_ribbon(data = forecast_data_df, aes(x = Date, ymin = `Lo 95`, ymax = `Hi 95`), fill = "gray", alpha = 0.3) +
  geom_ribbon(data = forecast_data_df, aes(x = Date, ymin = `Lo 80`, ymax = `Hi 80`), fill = "gray", alpha = 0.8) +
  labs(title = "Power Consumption Forecast (Last Two Years of Historical Data)", x = "Date", y = "KWH") +
  theme_minimal()

In the end, present forecast data in charts:

print(forecast_data_df)

##              kwh   Lo 80    Hi 80   Lo 95    Hi 95       Date
## Jan 2014 8870710 6826522 11527025 5942637 13241511 2014-01-01
## Feb 2014 7748851 5975949 10047726 5208085 11529131 2014-02-01
## Mar 2014 6820044 5270229  8825612 4597933 10116066 2014-03-01
## Apr 2014 6014421 4656855  7767744 4067048  8894229 2014-04-01
## May 2014 5726501 4437076  7390637 3876563  8459250 2014-05-01
## Jun 2014 7440588 5739925  9645135 5003177 11065439 2014-06-01
## Jul 2014 7967699 6138989 10341152 5347531 11871688 2014-07-01
## Aug 2014 9503710 7300452 12371906 6349129 14225654 2014-08-01
## Sep 2014 8759739 6737126 11389580 5862974 13087733 2014-09-01
## Oct 2014 6485212 5011819  8391758 4372642  9618436 2014-10-01
## Nov 2014 5704227 4416967  7366639 3857678  8434661 2014-11-01
## Dec 2014 7126953 5497656  9239112 4791856 10599953 2014-12-01

Part C – BONUS, Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

path_pipe1 <- "C:/Users/xmei/Downloads/Waterflow_Pipe1.xlsx"
path_pipe2 <- "C:/Users/xmei/Downloads/Waterflow_Pipe2.xlsx"

hourly_aggregate <- function(path, id_label) {
  df <- read_excel(path)

  # consist in column name
  if (all(c("Date", "Time") %in% names(df))) {
    df <- df |> mutate(DateTime = as_datetime(paste(Date, Time)))
  } else if ("Date Time" %in% names(df)) {
    df <- df |> rename(DateTime = `Date Time`)
  } else if ("DateTime" %in% names(df)) {
    df <- df
  } 

  #conversion date
  if (is.numeric(df$DateTime)) {
    df <- df |> mutate(DateTime = as_datetime(DateTime * 86400, origin = "1899-12-30"))
  }

  df |>
    rename(Value = WaterFlow) |>
    mutate(hour = floor_date(DateTime, "hour")) |>
    group_by(hour) |>
    summarise(Value = mean(Value, na.rm = TRUE), .groups = "drop") |>
    mutate(Pipe = id_label)
}

pipe1_hr <- hourly_aggregate(path_pipe1, "Pipe1")
pipe2_hr <- hourly_aggregate(path_pipe2, "Pipe2")


head(pipe1_hr, 10)

## # A tibble: 10 × 3
##    hour                Value Pipe 
##    <dttm>              <dbl> <chr>
##  1 2015-10-23 00:00:00  26.1 Pipe1
##  2 2015-10-23 01:00:00  18.9 Pipe1
##  3 2015-10-23 02:00:00  15.2 Pipe1
##  4 2015-10-23 03:00:00  23.1 Pipe1
##  5 2015-10-23 04:00:00  15.5 Pipe1
##  6 2015-10-23 05:00:00  22.7 Pipe1
##  7 2015-10-23 06:00:00  20.6 Pipe1
##  8 2015-10-23 07:00:00  18.4 Pipe1
##  9 2015-10-23 08:00:00  20.8 Pipe1
## 10 2015-10-23 09:00:00  21.2 Pipe1

head(pipe2_hr, 10)

## # A tibble: 10 × 3
##    hour                Value Pipe 
##    <dttm>              <dbl> <chr>
##  1 2015-10-23 01:00:00  30.9 Pipe2
##  2 2015-10-23 03:00:00  38.0 Pipe2
##  3 2015-10-23 04:00:00  34.0 Pipe2
##  4 2015-10-23 06:00:00  28.2 Pipe2
##  5 2015-10-23 07:00:00  18.3 Pipe2
##  6 2015-10-23 09:00:00  55.8 Pipe2
##  7 2015-10-23 10:00:00  61.3 Pipe2
##  8 2015-10-23 12:00:00  55.7 Pipe2
##  9 2015-10-23 13:00:00  37.1 Pipe2
## 10 2015-10-23 15:00:00  35.1 Pipe2

For pipe1 and pipe2 data, I will check to see if I should or how many order of seasonal differncing should be applied for each pipe data, then using two ADF and KPSS tests to determine if the time series are stationary for forecasting models. As shown below, no difference is recommended and data appear to be stationary, so it’s ready for forecast.

pipe1_ts <- ts(pipe1_hr$Value, frequency = 24)
pipe2_ts <- ts(pipe2_hr$Value, frequency = 24)

#  seasonal unit root analysis
seasonal_analysis <- function(ts_data, name) {
  cat("-- ", name, " --\n")
  
  # Seasonal differencing recommend
  seas_diff <- nsdiffs(ts_data)
  cat("Seasonal differences needed:", seas_diff, "\n")
  
  # Auto.arima 
  model <- auto.arima(ts_data, seasonal = TRUE, stepwise = TRUE)
  cat("ARIMA model:", arimaorder(model), "\n")
  
}

seasonal_analysis(pipe1_ts, "Pipe1")

## --  Pipe1  --
## Seasonal differences needed: 0 
## ARIMA model: 0 0 0

seasonal_analysis(pipe2_ts, "Pipe2")

## --  Pipe2  --
## Seasonal differences needed: 0 
## ARIMA model: 1 0 1

adf.test(pipe1_ts)

## Warning in adf.test(pipe1_ts): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  pipe1_ts
## Dickey-Fuller = -6.2649, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

adf.test(pipe2_ts)

## Warning in adf.test(pipe2_ts): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  pipe2_ts
## Dickey-Fuller = -8.6584, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

kpss.test(pipe1_ts)

## Warning in kpss.test(pipe1_ts): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  pipe1_ts
## KPSS Level = 0.24656, Truncation lag parameter = 4, p-value = 0.1

kpss.test(pipe2_ts)

## 
##  KPSS Test for Level Stationarity
## 
## data:  pipe2_ts
## KPSS Level = 0.42049, Truncation lag parameter = 6, p-value = 0.06832

Here for the forecast, I used the auto.arima function to automatically select the best ARIMA model for each pipe’s time series. Then, we generate a forecast for 168 hours = 7 days × 24 hours/day, one week ahead and plot the results and create excel file.

library(forecast)
library(ggplot2)
library(writexl)
library(dplyr)
library(lubridate)

fit1 <- auto.arima(pipe1_ts)
fit2 <- auto.arima(pipe2_ts)

# Forecast 1 week ahead 
fc1 <- forecast(fit1, h = 168)
fc2 <- forecast(fit2, h = 168)

autoplot(fc1) +
  labs(title = "Pipe 1: One-Week Forecast",
       y = "Flow (units)", x = "Time") +
  theme_minimal()

autoplot(fc2) +
  labs(title = "Pipe 2: One-Week Forecast (ARIMA)",
       y = "Flow (units)", x = "Time") +
  theme_minimal()

# Create Excel
pipe1_forecast_df <- data.frame(
  DateTime = seq(max(pipe1_hr$hour) + hours(1), by = "hour", length.out = 168),
  Point_Forecast = as.numeric(fc1$mean),
  Lower_80 = as.numeric(fc1$lower[, 1]),
  Upper_80 = as.numeric(fc1$upper[, 1]),
  Lower_95 = as.numeric(fc1$lower[, 2]),
  Upper_95 = as.numeric(fc1$upper[, 2]),
  Pipe = "Pipe1"
)

pipe2_forecast_df <- data.frame(
  DateTime = seq(max(pipe2_hr$hour) + hours(1), by = "hour", length.out = 168),
  Point_Forecast = as.numeric(fc2$mean),
  Lower_80 = as.numeric(fc2$lower[, 1]),
  Upper_80 = as.numeric(fc2$upper[, 1]),
  Lower_95 = as.numeric(fc2$lower[, 2]),
  Upper_95 = as.numeric(fc2$upper[, 2]),
  Pipe = "Pipe2"
)

# Combine 
combined_forecasts <- bind_rows(pipe1_forecast_df, pipe2_forecast_df)

# Create summary statistics by pipe
forecast_summary <- combined_forecasts |>
  group_by(Pipe) |>
  summarise(
    Forecast_Start = min(DateTime),
    Forecast_End = max(DateTime),
    Average_Forecast = round(mean(Point_Forecast, na.rm = TRUE), 2),
    Minimum_Forecast = round(min(Point_Forecast, na.rm = TRUE), 2),
    Maximum_Forecast = round(max(Point_Forecast, na.rm = TRUE), 2),
    Total_Weekly_Volume = round(sum(Point_Forecast, na.rm = TRUE), 2),
    Avg_Uncertainty_95 = round(mean(Upper_95 - Lower_95, na.rm = TRUE), 2)
  )


excel_file_path <- "C:/Users/xmei/Downloads/Waterflow_Pipe_Forecasts.xlsx"

#  dataframes for different sheets
excel_data <- list(
  All_Forecasts = combined_forecasts,
  Forecast_Summary = forecast_summary,
  Pipe1_Detailed = pipe1_forecast_df,
  Pipe2_Detailed = pipe2_forecast_df
)

write_xlsx(excel_data, excel_file_path)

print(forecast_summary)

## # A tibble: 2 × 8
##   Pipe  Forecast_Start      Forecast_End        Average_Forecast
##   <chr> <dttm>              <dttm>                         <dbl>
## 1 Pipe1 2015-11-02 00:00:00 2015-11-08 23:00:00             19.9
## 2 Pipe2 2015-12-03 17:00:00 2015-12-10 16:00:00             39.8
## # ℹ 4 more variables: Minimum_Forecast <dbl>, Maximum_Forecast <dbl>,
## #   Total_Weekly_Volume <dbl>, Avg_Uncertainty_95 <dbl>

Paroject1XiaoFei

XiaoFei Mei

2025-10-26