624 Project 1

Author

By Tony Fraser

Published

March 24, 2024

library(fpp3)
library(fable)
library(patchwork)
library(scales)
library(stringr)
library(forecast)
library(ggfortify)
library(readxl)
library(zoo)
options(scipen=999)

git_p1_data <- "https://github.com/tonythor/cuny-datascience/raw/develop/data624-predictive/project1/data/"

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Explore

At first glance

This is an excel file. There will be conversion, type and/or formatting issues.
There are some malformed records in the spreadsheet. Each record should have an ATM and a date. Some are missing ATM and CASH.
It’s difficult to tell if we have every record for every day for every atm. We should look at that.
After loading the data into R, the date column is in epoch time. That needs to be converted.

Let’s start with a chart.

file_url <- paste0(git_p1_data, "atm_624.xlsx")
local_file_path <- "atm_624.xlsx"
download.file(file_url, local_file_path, mode = "wb")
atm_data <- read_excel(local_file_path)

unlink(local_file_path)

# immediately fix the date out of excel epoch time. 
atm_data$DATE <- as.Date(atm_data$DATE, origin = "1899-12-30") 

atm_tsibble <- as_tsibble(atm_data, index = DATE, key = ATM)

ggplot(atm_tsibble, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "ATM Cash Withdraws", x = "Date", y = "Cash") +
  theme_minimal()

Quick thoughts

For a quick sanity check, let’s consider this through the lens of transactions per minute. There are only 1440 minutes in the 24 hour period, maybe with a maximum throughput of 1 transaction per minute. Given the maximum most banks will let you withdraw at any one time (2,000$), and over a 24 hour period, a hasty theoretical maximum withdraw amount might be around $2,880,000. That spike ATM is: 11,000 hundred dollars, or $1,100,000, so perhaps theoretically that might be possible.

Still though, 1.1M is:

55,000 twenty dollar bills
2200 wrapped stacks of 20’s
A 19 feet tall stack of 20’s, or a 4 foot tall stack of 100’s

If that is a valid data point, then there were people unwrapping cash and feeding that ATM all day.

One ATM is mich higher volume compared to the others, that’s odd.
After another quick look at this graph, it seems again like we should look for missing data.

Research Spikes

Spike 1: What about the outlier? Let’s look at that spike and a couple of days before and and after, just to see.

atm_tsibble |>
    filter(DATE >= as.Date("2010-02-07") & DATE <= as.Date("2010-02-14")) |>
    ggplot(aes(x = DATE, y = Cash, color = ATM)) +
        geom_line() +
        labs(title = "ATM Withdrawals", x = "Date", y = "Withdraw Count") +
        scale_x_date(date_breaks = "1 day", date_labels = "%a %b %d")

That spike is Tuesday the 9th of February. A quick Google search shows the Northeast United States experienced a massive blizzard on the 9th and 10th. An event like that could cause that spike, and even explain how the next two days were so low traffic at ATM4.

Obviously, if this was a real world project, we’d need to verify that ATM’s location was in the Northeast near the blizzard. Given this assignment, I suspect it was, and I suspect that’s why that assignment contains this data set.

Decision 1: Smooth over the outlier We’ll smooth that over. We’ll replace those storm records with an average of a few days before and after. And, well politely suggest to the bank if there is another storm, that they have two people on standby with a 15 foot stack of 20 dollar bills on hand.

Spike 2: How much data are we missing? What do we do? For missing time series data, let’s construct a dataframe with all the dates, and the left join everything together to make sure to account for each day atm combination. That’s the one of the best ways I know to check every date in a time series.

all_dates <- seq(min(atm_tsibble$DATE), max(atm_tsibble$DATE), by = "day")
atm_tsibble <- atm_tsibble  |> filter(!is.na(ATM)) # this is for the date records with no atm or cash. 
complete_atm_tibble <- expand.grid(DATE = all_dates, ATM = unique(atm_tsibble$ATM)) 
complete_atm_tibble <- left_join(complete_atm_tibble, atm_tsibble, by = c("DATE", "ATM"))

missing_data <- complete_atm_tibble |>
  filter(is.na(Cash) | Cash == 0)

ggplot(missing_data, aes(x = DATE, y = ATM)) +
  geom_point(aes(color = ATM), shape = 4) +  # Use shape 4 for X's
  labs(title = "Missing ATM Records by Date",
       x = "Date", y = "ATM") +
  theme_minimal()

# Show all the missing records
complete_atm_tibble |>
  filter(ATM != "ATM3" & (is.na(Cash) | Cash == 0) & DATE < as.Date("2010-05-01"))

        DATE  ATM Cash
1 2009-06-13 ATM1   NA
2 2009-06-16 ATM1   NA
3 2009-06-22 ATM1   NA
4 2009-06-18 ATM2   NA
5 2009-06-24 ATM2   NA
6 2009-10-25 ATM2    0
7 2010-03-30 ATM2    0

Decision 2: Impute NA’s in ATM 1 and 2 Above, it looks like we’re missing a couple of days worth of data (denoted as NA in the list of records, and x’s on the chart.) As well we have a couple of days where cash delivered was zero in the record list above. Zero is to be expected, the ATM didn’t give out any money that day. The NA’s are missing records that were caught with the left join.

We’ll impute those missing records as part of the plan.

Decision 3: Drop ATM3, suggest ATM3 is like ATM1 ATM3 only dispensed cash for 2010-04-28, 2010-04-29, 2010-04-30. It isn’t missing data, it just seems like it hasn’t been giving cash. Perhaps the ATM was stationed in a building under renovation, we don’t know. Regardless, let’s quickly look at those dates, and whichever other ATM seems the closest can be our recommendation for ATM3.

complete_atm_tibble |>
  filter( DATE >= as.Date("2010-04-28") & DATE <= as.Date("2010-04-30") )

         DATE  ATM      Cash
1  2010-04-28 ATM1  96.00000
2  2010-04-29 ATM1  82.00000
3  2010-04-30 ATM1  85.00000
4  2010-04-28 ATM2 107.00000
5  2010-04-29 ATM2  89.00000
6  2010-04-30 ATM2  90.00000
7  2010-04-28 ATM3  96.00000
8  2010-04-29 ATM3  82.00000
9  2010-04-30 ATM3  85.00000
10 2010-04-28 ATM4 348.20106
11 2010-04-29 ATM4  44.24535
12 2010-04-30 ATM4 482.28711

Perfect, ATM3 appears to be identical to ATM1 with respect to cash withdraw. We’ll use ATM1 as a proxy for ATM3 in the final conclusion, though we have to remember to suggest actively monitoring ATM3’s behavior until more data is collected.

Plan

Let’s organize the plan start to finish before we act.

Wrangle: Remove all improperly formatted 2010-05 records from complete_atm_tibble.
Wrangle: Remove ATM3 in complete_atm_tibble
Wrangle: Smooth that outlier on ATM4
Wrangle: Impute using mean average in complete_atm_tibble where Cash is NA.
Forecast: Forecast with STS and ARIMA
Forecast: Sanity check
Conclude: Suggest how much cash to have on hand per ATM

Wrangle

## remove improperly formatted records and ATM3
complete_atm_tibble <- complete_atm_tibble |>
    filter (ATM != "ATM3" & DATE < as.Date("2010-05-01"))

## replace the outlier     
date_to_replace <- complete_atm_tibble |>  filter(Cash >= 3000) |> pull(DATE) 
s_date = date_to_replace - 9
e_date = date_to_replace - 2
average_cash <- complete_atm_tibble |>
  filter(ATM == "ATM4", DATE >= s_date, DATE <= e_date) |>
  summarize(Average_Cash = mean(Cash, na.rm = TRUE)) |>
  pull(Average_Cash)
complete_atm_tibble <- complete_atm_tibble |>
 mutate(Cash = ifelse(DATE == date_to_replace & ATM == "ATM4", average_cash, Cash))
  
# double check that one date, Should be Feb 09 375$. 
print(complete_atm_tibble |> filter(DATE == as.Date(date_to_replace) & ATM == "ATM4"))

        DATE  ATM     Cash
1 2010-02-09 ATM4 374.4128

## Impute the three missing records for ATM1 and the two for ATM2 
average_cash_per_atm <- complete_atm_tibble |>
  group_by(ATM) |>
  summarize(Average_Cash = mean(Cash, na.rm = TRUE))
complete_atm_tibble <- complete_atm_tibble|>
  left_join(average_cash_per_atm, by = "ATM") |>
  mutate(Cash = ifelse(is.na(Cash), Average_Cash, Cash)) |>
  select(-Average_Cash)

## double check that imputation worked, the NA records should be gone
## but we should still see the zero records. Zero is a valid number. 
print(complete_atm_tibble |>
  filter((is.na(Cash) | Cash == 0) & DATE < as.Date("2010-05-01")))

        DATE  ATM Cash
1 2009-10-25 ATM2    0
2 2010-03-30 ATM2    0

We’ll plot those two charts again to be sure. The outlier is gone and there is no missing data.

## now let's plot those charts again! 

p1 <- ggplot(complete_atm_tibble, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "ATM Cash Trend", x = "Date", y = "Cash") +
  theme_minimal()

p2 <- ggplot(complete_atm_tibble |> filter(is.na(Cash)), aes(x = DATE, y = ATM)) +
  geom_point(aes(color = ATM), shape = 4) +  # Use shape 4 for X's
  labs(title = "Missing ATM Records by Date",
       x = "Date", y = "ATM")

(p1 + p2)

That’s perfect.

Select Model

To help us select a model, we’ll decompose this to help us review trend and seasonality.

atm1_tsibble <- as_tsibble(complete_atm_tibble |> filter(ATM == "ATM1") , index = DATE)
atm1_stl_fit <- atm1_tsibble |>  model(STL(Cash ~ season(window = "periodic")))
atm1_components <- atm1_stl_fit |>  components() 

atm2_tsibble <- as_tsibble(complete_atm_tibble |> filter(ATM == "ATM2") , index = DATE)
atm2_stl_fit <- atm2_tsibble |>  model(STL(Cash ~ season(window = "periodic")))
atm2_components <- atm2_stl_fit |>  components() 

atm4_tsibble <- as_tsibble(complete_atm_tibble |> filter(ATM == "ATM4") , index = DATE)
atm4_stl_fit <- atm4_tsibble |>  model(STL(Cash ~ season(window = "periodic")))
atm4_components <- atm4_stl_fit |>  components() 

pa1 <- autoplot(atm1_components)
pa2 <- autoplot(atm2_components)
pa4 <- autoplot(atm4_components)
(pa1 + pa2 + pa4)

Above we can see considerable weekly seasonality. ETS and SARIMA are both likely choices for good forecasting models.

If we compare ETS and SARIMA across all three ATM machines, like below, we see that either ETS or SARMIA would work fine. We’ll pick SARIMA as our model though, because it fits slightly better to ATM4 which has the highest volume.

fit_compare_ets_sarima <- function(atm_tsibble) {
  ets_model <- atm_tsibble |> model(ETS(Cash ~ error("A") + trend("N") + season("A")))
  sarima_model <- atm_tsibble |> model(ARIMA(Cash ~ PDQ(0,0,0) + season("W")))
  models <- list(ets = ets_model, sarima = sarima_model)
  accuracy_table <- purrr::map_df(models, accuracy, .id = "model")
  return(accuracy_table)
}
# print(fit_compare_ets_sarima((atm1_tsibble)))
# print(fit_compare_ets_sarima((atm2_tsibble)))
# print(fit_compare_ets_sarima((atm4_tsibble)))

combined_results <- rbind(
  mutate(fit_compare_ets_sarima(atm1_tsibble), ATM = "ATM1"),
  mutate(fit_compare_ets_sarima(atm2_tsibble), ATM = "ATM2"),
  mutate(fit_compare_ets_sarima(atm4_tsibble), ATM = "ATM4") # Add additional data frames as needed
)
print(combined_results |> select(-.type, -.model, -ME))

# A tibble: 6 × 9
  model   RMSE   MAE   MPE  MAPE  MASE RMSSE      ACF1 ATM  
  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl> <chr>
1 ets     23.8  15.1 -106.  121. 0.846 0.854  0.125    ATM1 
2 sarima  25.0  17.9 -113.  129. 1.00  0.897 -0.00694  ATM1 
3 ets     25.0  17.8 -Inf   Inf  0.856 0.832  0.0177   ATM2 
4 sarima  26.6  20.5 -Inf   Inf  0.987 0.883  0.0395   ATM2 
5 ets    329.  265.  -520.  550. 0.764 0.726  0.101    ATM4 
6 sarima 327.  260.  -492.  522. 0.752 0.722 -0.000300 ATM4

Forecast

Let’s forecast and plot to see if we’re on the right track. It looks like we are.

forecast_atm <- function(atm_tsibble) {
  sarima_fit <- atm_tsibble |> model(ARIMA(Cash ~ PDQ(0,0,0) + season("W")))
  forecasted_values <- sarima_fit |> forecast(h = 30)
  return(forecasted_values)
}

atm1_fc <- forecast_atm(atm1_tsibble)
atm2_fc <- forecast_atm(atm2_tsibble)
atm4_fc <- forecast_atm(atm4_tsibble)

at1 <- autoplot(atm1_fc) +
  autolayer(atm1_tsibble |>filter(DATE >= as.Date("2010-04-01"))) +
  labs(title = "ATM1 FC Cash Withdrawals", y = "Cash") 
at2 <- autoplot(atm2_fc) +
  autolayer(atm2_tsibble |>filter(DATE >= as.Date("2010-04-01"))) +
  labs(title = "ATM2 FC Cash Withdrawals", y = "Cash")

at4 <- autoplot(atm4_fc) +
  autolayer(atm4_tsibble |>filter(DATE >= as.Date("2010-04-01"))) +
  labs(title = "ATM4 FC Cash Withdrawals", y = "Cash")

atm1_fc_ti <- atm1_fc |> mutate(ATM = "ATM1") |> as_tibble()
atm2_fc_ti <- atm2_fc |> mutate(ATM = "ATM2") |> as_tibble()
atm4_fc_ti <- atm4_fc |> mutate(ATM = "ATM4") |> as_tibble() 
all_fc <- bind_rows(atm1_fc_ti, atm2_fc_ti, atm4_fc_ti) 
# all_fc <- as_tibble(all_fc)

# Plot using ggplot
all_three <- ggplot(all_fc, aes(x = DATE, y = .mean, color = ATM)) +
  geom_line() +
  labs(title = "ATM Cash Withdrawal Forecasts", x = "Date", y = "Forecasted Cash") +
  theme_minimal()

combined_plot <- (at1 | at2) /
                 (at4 | all_three) 
print(combined_plot)

Conclude

For the last step, we must forecast how much cash each ATM gives out. Remembering two things, that units of cash are in 100’s, and that we are using ATM1 as a proxy for ATM3, we should only have to sum the means of the forecast data sets to get a reasonable forecast for the month per atm.

For clarity in our final numbers, we will round to the nearest 10$ bill, add commas, and leave out confidence band information.

That said, assuming no major blizzards or other planet events that will cause bank runs, below is a reasonable prediction of how much money well need for each ATM for the month of May, 2010. Also, please remember to watch ATM3 until we have more historical data.

summarize_forecast <- function(fc) {
  total_mean <- fc |>
    summarise(total_mean = sum(.mean, na.rm = TRUE))|>
    pull(total_mean) * 100
  total_mean_rounded <- round(total_mean / 10) * 10
  total_mean_dollars <- dollar(total_mean_rounded)
  return(total_mean_dollars)
}
atm1_cash_fc <- summarize_forecast(atm1_fc_ti)
atm2_cash_fc <- summarize_forecast(atm2_fc_ti)
atm4_cash_fc <- summarize_forecast(atm4_fc_ti)

final_atm <- sprintf("Required cash for:<br>ATM1 %s<BR>ATM2 %s<BR>ATM3 %s<BR>ATM4 %s",
        atm1_cash_fc, atm2_cash_fc, atm2_cash_fc, atm4_cash_fc)

cat(final_atm)

Required cash for:
ATM1 $256,060
ATM2 $179,140
ATM3 $179,140
ATM4 $1,347,760

Note: the above section took about 13 hours to complete

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Explore

Initial Thoughts

This is an excel file, there might be some formatting issues.
There are no missing dates between the first and last record, which is great.
There is one missing record, 2008-Aug. It has NULL KWH.
Ugh, it’s a yearmonth datatype, which doesn’t play well with the tsibble/fable/mable ecosystem. And I’m tired of wasting time on that. I am making the conscious decision to skip all that, and do this section using the r stats package instead. It might take longer, but it’s probably a more fully baked package anyway.

Let’s start with a chart, but we need to wrangle that date first.

file_url <- paste0(git_p1_data, "residential_customer_forecast.xlsx")
local_file_path <- "residential_customer_forecast.xlsx"
download.file(file_url, local_file_path, mode = "wb")

power <- read_excel(local_file_path) |>
  mutate(Date = as.Date(paste0(`YYYY-MMM`, "-01"), format = "%Y-%b-%d")) |>
  select(-`YYYY-MMM`, -CaseSequence)

unlink(local_file_path)

power |> 
  ggplot(aes(x = Date, y = KWH)) +
  geom_line(colour = "blue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray")

The plot might show a light positive trend from about 2006 on. It is definitely seasonal with respect to month. There’s a pretty big outlier, but it seems like a valid record. Let’s leave it alone for now.

Plan

Wrangle: Get dates loaded. (Done above)
Impute that one null KWH record.
Plot a decomposition
Spend some time researching the stats package
Forecast: Select a model
Forecast: Wrangle and forecast
Conclude: Predict

Wrangle

Impute First We’ll replace that missing record with the average KWH of the two records from August 2007 and August 2009. This is highly seasonal data, and there might be a slight trend. Taking the average of those two specific records assures it follows the seasonality trends.

Also, it’s as good a guess as any.

# Let's impute that missing date, maybe we just grab the record from last year. It might not contain the trend, but it'll contain the seasonal. 
average_kwh <- power |>
  filter(Date %in% as.Date(c("2007-09-01", "2009-09-01"))) |>
  summarize(average_kwh = mean(KWH)) |>
  pull(average_kwh) 

power <- power |>
  mutate(KWH = ifelse(Date == as.Date("2008-09-01"), average_kwh, KWH))

Select Model

Let’s decompose first. We see quickly that this has a slight trend and is extremely seasonal over the months. let’s compare some models and see which one we like more.

ts_data <- ts(power$KWH, start = c(1998, 1), frequency = 12)
stl_decomp <- stl(ts_data, s.window = "periodic")
plot(stl_decomp)

Awesome. Definitely seems like we have some seasonality. Let’s test Arima and ETS again, but this time using the Stats package instead. We’re definitely going to apply log to that data.

Looks like ETS(MAM) performs the best, and we needed to log the data to get the best results. We’ll use ETS(MAM) for the next section.

library(stats)
library(forecast)

get_accuracy <- function(model, model_name) {
  accuracy_metrics <- accuracy(model)
  model_accuracy <- data.frame(
    Model = model_name,
    ME = accuracy_metrics["Training set", "ME"],
    RMSE = accuracy_metrics["Training set", "RMSE"],
    MAE = accuracy_metrics["Training set", "MAE"],
    MPE = accuracy_metrics["Training set", "MPE"],
    MAPE = accuracy_metrics["Training set", "MAPE"],
    MASE = accuracy_metrics["Training set", "MASE"],
    ACF1 = accuracy_metrics["Training set", "ACF1"]
  )
  model_accuracy[, -1] <- round(model_accuracy[, -1], 6)
  return(model_accuracy)
}

power_ts <- ts(log(power$KWH), start = c(1998, 1), frequency = 12)

etszzz_ac <- get_accuracy(ets(power_ts, model = "ZZZ") ,"ETSZZZ")
etsmnn_ac <- get_accuracy(ets(power_ts, model = "MNM"), "ETSMNN")
etsana_ac <- get_accuracy(ets(power_ts, model = "ANA"), "ETSANA")
etsmam_ac <- get_accuracy(ets(power_ts, model = "MAM"), "ETSMAM")
auto_arima <- get_accuracy(auto.arima(power_ts), "A-AUTO")
arima010 <- get_accuracy(arima(power_ts, order = c(0, 1, 0)), "A-010") 
arimas111 <- get_accuracy(arima(power_ts, order = c(0, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12)), "A-011-111")
arimas211 <- get_accuracy(arima(power_ts, order = c(0, 1, 1), seasonal = list(order = c(2, 1, 1), period = 12)), "A-011-211")
arimas010 <- get_accuracy(arima(power_ts, order = c(0, 0, 0), seasonal = list(order = c(0, 1, 0), period = 12)), "A-000-010")
accuracy_table <- rbind(etszzz_ac, etsmnn_ac, etsana_ac,  etsmam_ac, auto_arima, arimas111, arimas211, arimas010,arima010 )
print(accuracy_table)

      Model       ME     RMSE      MAE       MPE     MAPE     MASE      ACF1
1    ETSZZZ 0.019334 0.192173 0.090667  0.107158 0.588246 0.770975  0.090482
2    ETSMNN 0.022302 0.195954 0.090882  0.126263 0.589943 0.772807  0.096247
3    ETSANA 0.018992 0.192264 0.091528  0.104975 0.593769 0.778298  0.087817
4    ETSMAM 0.019547 0.191668 0.089658  0.108533 0.581894 0.762391  0.115953
5    A-AUTO 0.017476 0.231948 0.140563  0.087735 0.905738 1.195260 -0.008806
6 A-011-111 0.010949 0.198691 0.089306  0.054186 0.580158 0.452785  0.083968
7 A-011-211 0.011218 0.198475 0.089979  0.055934 0.584470 0.456199  0.077472
8 A-000-010 0.014021 0.273751 0.111227  0.072525 0.718556 0.563927  0.104852
9     A-010 0.001834 0.311569 0.196291 -0.009728 1.262295 0.995207 -0.226022

Forecast

Using ETS(MAM), now we forecast and plot for a year, and we’re going to use the stats library for this also!

power_ts <- ts(log(power$KWH), start = c(1998, 1), frequency = 12)
best_ets_model <- ets(power_ts, model = "MAM")
fckw_values <- forecast(best_ets_model, h = 12)
fckw_values$mean <- exp(fckw_values$mean)
fckw_values$lower <- exp(fckw_values$lower)
fckw_values$upper <- exp(fckw_values$upper)

historical_data_df <- as.data.frame(exp(power_ts))
historical_data_df$Date <- seq(as.Date("1998-01-01"), by = "month", 
  length.out = nrow(historical_data_df))

historical_data_df <- historical_data_df |>
  rename(kwh = x)
# head(historical_data_df)

forecast_data_df <- as.data.frame(fckw_values)
forecast_months <- strptime(rownames(forecast_data_df), format = "%b %Y")
forecast_data_df$Date <- myd(paste0(rownames(forecast_data_df), "-01"))
forecast_data_df <- forecast_data_df |> rename(kwh = `Point Forecast`)
# head(forecast_data_df)

ggplot() +
  geom_line(data = filter(historical_data_df, Date > as.Date("2010-12-31")), aes(x = Date, y = kwh), colour = "blue") +
  geom_line(data = forecast_data_df, aes(x = Date, y = kwh), colour = "red") +
  geom_ribbon(data = forecast_data_df, aes(x = Date, ymin = `Lo 95`, ymax = `Hi 95`), fill = "orange", alpha = 0.2) +
  geom_ribbon(data = forecast_data_df, aes(x = Date, ymin = `Lo 80`, ymax = `Hi 80`), fill = "orange", alpha = 0.5) +
  labs(title = "Power Consumption Forecast (Last Two Years of Historical Data)", x = "Date", y = "KWH") +
  theme_minimal()

Conclude

Here is our forecast for power consumption for 2014. We’ll leave confidence intervals in as this time as this is power, and that might be really important to somebody.

print(forecast_data_df)

             kwh   Lo 80    Hi 80   Lo 95    Hi 95       Date
Jan 2014 8932180 6881668 11593678 5994259 13310042 2014-01-01
Feb 2014 7682089 5932900  9946990 5174464 11404948 2014-02-01
Mar 2014 6526263 5053467  8428294 4413568  9650266 2014-03-01
Apr 2014 5903916 4578847  7612444 4002417  8708791 2014-04-01
May 2014 5659175 4391866  7292178 3840285  8339555 2014-05-01
Jun 2014 7344873 5675603  9505097 4951508 10895096 2014-06-01
Jul 2014 7801821 6022440 10106935 5251206 11591319 2014-07-01
Aug 2014 9348917 7194993 12147650 6263612 13953970 2014-08-01
Sep 2014 8661938 6674169 11241725 5813843 12905263 2014-09-01
Oct 2014 6365869 4929394  8220948 4305263  9412733 2014-10-01
Nov 2014 5561192 4315497  7166464 3773356  8196114 2014-11-01
Dec 2014 7026197 5431147  9089689 4739069 10417118 2014-12-01

Personal note: There isn’t as much ‘built in stuff’ or shortcuts in stats as there are in tsibble’s, mables, fables, etc, but overall this library feels a lot more stable and predictable. This was a good exercise considering I’ll most likely never use R for forecasting again, I’m going to have to learn a different library in scala and python anyway.

Part C: Water Flow Pipes

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour. Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Explore

Initial thoughts

The problem asks if this can be forecast. Maybe it can’t, watch out.
There are two excel files. Expect an import problem.
One file appears to be hourly scheduled records from 10/23/15 to 12/3/15. The other, a much more granular than hourly record set, contains data between 10/23 and 11/1. We’ll refer to these as the “scheduled” one and the “random” (sampled) one.
These data sets are not from the same water pipe. 10/23/2015 9:00AM one sheet shows 6.2, and on the other shows 55.7. They are also on very differnet scales.
There is some record overlap between the two, date is not a unique field. The source must be taken into account to identify uniqueness.
These dates are in serial datetime, excel’s propritary date storage format.
For the scheduled dataset, we might have to check that later to see if the series is complete.

After further exploration

Well into this exercise, I found that when trying to convert a serial datetime to a real date, the scheduled records were slightly off,causing records like 4:59:59 AM instead fo 5AM. Later we group_by().sum() which means data got lost. I have retrofitted this upstream loading from excel code to levergage lubridate’s round to minute function to account for this.
The point above about the two data feeds above not overlapping with dates is what broke this problem.

Plan

Wrangle: Rename Waterflow columns to WaterFlow1 and WaterFlow2, union, sort by time, and create final waterflow of values, or means.
Wrangle: group by hour sum Waterflow for a final dataset
Forecast: Decompose and chart
Forecast: check for stationary data, acf/pacf
Forecast
Conclude

Wrangle

First we’ll load these into dataframes, fix the dates, union and sort.

# w -> water s -> scheduled. r -> random 
water_r <- "waterflow_pipe1.xlsx"
water_s <- "waterflow_pipe2.xlsx"
ws_p <- paste0(git_p1_data, water_s)
wr_p <- paste0(git_p1_data, water_r)

download.file(ws_p, water_s, mode = "wb")
download.file(wr_p , water_r, mode = "wb")


# we could probaly do some of this mutation after the union, but I prefer clean data to start.
ws <- read_excel(water_s) |>
  mutate(datetime = as.POSIXct((`Date Time` - 25569) * 86400, origin = "1970-01-01", tz = "UTC")) |>
  mutate(datetime = round_date(datetime, unit = "hour")) |>
  select(-`Date Time`) |>
  mutate(dataset = "ws")

wr <- read_excel(water_r)  |> 
  mutate(datetime = as.POSIXct((`Date Time` - 25569) * 86400, origin = "1970-01-01", tz = "UTC")) |>
  select(-`Date Time`) |> 
  mutate(dataset = "wr")

water_df<- union(wr, ws)  |> 
  arrange(datetime)  |>
  mutate(WaterFlow = round(WaterFlow, 3)) # just simplify display

unlink(water_s)
unlink(water_r)

print(n=8, water_df)

# A tibble: 2,000 × 3
  WaterFlow datetime            dataset
      <dbl> <dttm>              <chr>  
1     23.4  2015-10-23 00:24:06 wr     
2     28.0  2015-10-23 00:40:02 wr     
3     23.1  2015-10-23 00:53:51 wr     
4     30.0  2015-10-23 00:55:40 wr     
5     18.8  2015-10-23 01:00:00 ws     
6      6.00 2015-10-23 01:19:17 wr     
7     15.9  2015-10-23 01:23:58 wr     
8     26.6  2015-10-23 01:50:05 wr     
# ℹ 1,992 more rows

Perfect, now let’s aggregate date/time to the hour. Best way to do that cleanly would be to create a column for date and hour.

Also, after being grouped, the last day in this combined data set is incomplete. There are only 17 records, when there should be 24. That makes sense, the data cuts off in the afternoon. That we can chop off that partial day to create an hourly dataset to forecast from.

Then, we can plot.

water_df <- water_df |>
  mutate(
    date = as.Date(datetime),    # Extracting just the date part
    hour = hour(datetime)        # Extracting the hour part
  )# we don't need this, it's all for 2015.

# looks good.
# print(n=8, water_df)

water_df <- water_df |> 
 group_by(date, hour) |>
  summarise(avg_water_flow = mean(WaterFlow, na.rm = TRUE)) |>
  filter(date < "2015-12-03") |>
  ungroup()
# great, data by day.
print(n=4, water_df)

# A tibble: 984 × 3
  date        hour avg_water_flow
  <date>     <int>          <dbl>
1 2015-10-23     0           26.1
2 2015-10-23     1           18.8
3 2015-10-23     2           24.5
4 2015-10-23     3           25.6
# ℹ 980 more rows

# prove that there are 24 records per day. if this shows up empty, it's perfect.
water_df %>%
  group_by(date) %>%
  summarise(recordsPerDay = n()) |>
  filter (recordsPerDay != 24)

# A tibble: 0 × 2
# ℹ 2 variables: date <date>, recordsPerDay <int>

# perfect, now let's try to plot. 
# print(n=8, water_df)

water_df %>%
  mutate(datetime = as.POSIXct(paste(date, hour, sep=" "), format="%Y-%m-%d %H")) |>
  ggplot(aes(x = datetime, y = avg_water_flow)) +
  geom_line() +
  labs(title = "Average Water Flow Over Time", x = "Date Time", y = "Average Water Flow") +
  theme_minimal()

Decompose

Perfect. Now let’s decompose. STL requires a time series object, so we’ll create one to plot.

start_date <- as.numeric(format(min(water_df$date), "%Y%m%d"))
ts_data <- ts(water_df$avg_water_flow, frequency = 24, start = c(substr(start_date, 1, 4), substr(start_date, 5, 6)))
stl_decomp <- stl(ts_data, s.window = "periodic")

acf_plot <- ggAcf(ts_data)
pacf_plot <- ggPacf(ts_data)
combined_plot <- acf_plot + pacf_plot

plot(stl_decomp)

plot(combined_plot)

Ok, this definitely isn’t stationary. Let’s show the furthers equalizer we’ve ever done in our class, which is diff(diff(log())). As one can see, this still isn’t stationary, and in fact the log(), diff(log), and diff(diff(log)) all look about the same.

diff_ts_data <- diff(diff(log(ts_data)))

stl_decomp_diff <- stl(diff_ts_data, s.window = "periodic")
acf_plot_diff <- ggAcf(diff_ts_data)
pacf_plot_diff <- ggPacf(diff_ts_data)
combined_plot_diff <- acf_plot_diff + pacf_plot_diff

plot(stl_decomp_diff)

plot(combined_plot_diff)

Because we’re in this non-stationary pattern, the immediate assumption is these two datasets shouldn’t have been combined in the first place.

Reasarch Spike 1: Scale of these datasets We already know know this more randomly sampled dataset is only a few days worth of data, while the scheduled one longer. Let’s go back and roll up by day and look at them together.

We knew this, but records before November 1’st are going to be a lot larger than records for the middle of November, because we are rolling up with means by hour and day.

water_df<- union(wr, ws)  |> 
  arrange(datetime)  |>
  mutate(WaterFlow = round(WaterFlow, 3)) |>
  mutate(
    date = as.Date(datetime),    # Extracting just the date part
    hour = hour(datetime)        # Extracting the hour part
  )|>
  select(-datetime) |>
  group_by(date, dataset) %>%
  summarise(flow_per_day = sum(WaterFlow))

`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.

ggplot(water_df, aes(x = date, y = flow_per_day, color = dataset)) +
  geom_line() +
  labs(title = "Daily Water Flow", x = "Date", y = "Flow per Day", color = "Dataset") +
  theme_minimal()

At this point, my gut is telling me that we either need to go get more data from the more randomly sampled dataset, so it extends out to the end of the scheduled data set, or we need to cut off scheduled dataset so the timelines overlap.

You can forecast the “sum of all waterflow in the pipes,” or the “sum of all car sales in Montana for July” but only if you have the overlapping time series.

Conclude

If I would have started this problem with more common sense maybe I would have saved myself hours worth of work, but in the end merging these two datasets won’t work. Complex seasonality or trend is one thing, there could be a second seasonal pattern, seasonal at both months and weeks, but the shock of those big week numbers for the first week of this time series won’t work.