library(fpp3)
library(fable)
library(patchwork)
library(scales)
library(stringr)
library(forecast)
library(ggfortify)
library(readxl)
library(zoo)
options(scipen=999)
<- "https://github.com/tonythor/cuny-datascience/raw/develop/data624-predictive/project1/data/" git_p1_data
624 Project 1
Part A – ATM Forecast
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Explore
At first glance
- This is an excel file. There will be conversion, type and/or formatting issues.
- There are some malformed records in the spreadsheet. Each record should have an ATM and a date. Some are missing ATM and CASH.
- It’s difficult to tell if we have every record for every day for every atm. We should look at that.
- After loading the data into R, the date column is in epoch time. That needs to be converted.
Let’s start with a chart.
<- paste0(git_p1_data, "atm_624.xlsx")
file_url <- "atm_624.xlsx"
local_file_path download.file(file_url, local_file_path, mode = "wb")
<- read_excel(local_file_path)
atm_data
unlink(local_file_path)
# immediately fix the date out of excel epoch time.
$DATE <- as.Date(atm_data$DATE, origin = "1899-12-30")
atm_data
<- as_tsibble(atm_data, index = DATE, key = ATM)
atm_tsibble
ggplot(atm_tsibble, aes(x = DATE, y = Cash, color = ATM)) +
geom_line() +
labs(title = "ATM Cash Withdraws", x = "Date", y = "Cash") +
theme_minimal()
Quick thoughts
For a quick sanity check, let’s consider this through the lens of transactions per minute. There are only 1440 minutes in the 24 hour period, maybe with a maximum throughput of 1 transaction per minute. Given the maximum most banks will let you withdraw at any one time (2,000$), and over a 24 hour period, a hasty theoretical maximum withdraw amount might be around $2,880,000. That spike ATM is: 11,000 hundred dollars, or $1,100,000, so perhaps theoretically that might be possible.
Still though, 1.1M is:
55,000 twenty dollar bills
2200 wrapped stacks of 20’s
A 19 feet tall stack of 20’s, or a 4 foot tall stack of 100’s
If that is a valid data point, then there were people unwrapping cash and feeding that ATM all day.
One ATM is mich higher volume compared to the others, that’s odd.
After another quick look at this graph, it seems again like we should look for missing data.
Research Spikes
Spike 1: What about the outlier? Let’s look at that spike and a couple of days before and and after, just to see.
|>
atm_tsibble filter(DATE >= as.Date("2010-02-07") & DATE <= as.Date("2010-02-14")) |>
ggplot(aes(x = DATE, y = Cash, color = ATM)) +
geom_line() +
labs(title = "ATM Withdrawals", x = "Date", y = "Withdraw Count") +
scale_x_date(date_breaks = "1 day", date_labels = "%a %b %d")
That spike is Tuesday the 9th of February. A quick Google search shows the Northeast United States experienced a massive blizzard on the 9th and 10th. An event like that could cause that spike, and even explain how the next two days were so low traffic at ATM4.
Obviously, if this was a real world project, we’d need to verify that ATM’s location was in the Northeast near the blizzard. Given this assignment, I suspect it was, and I suspect that’s why that assignment contains this data set.
Decision 1: Smooth over the outlier We’ll smooth that over. We’ll replace those storm records with an average of a few days before and after. And, well politely suggest to the bank if there is another storm, that they have two people on standby with a 15 foot stack of 20 dollar bills on hand.
Spike 2: How much data are we missing? What do we do? For missing time series data, let’s construct a dataframe with all the dates, and the left join everything together to make sure to account for each day atm combination. That’s the one of the best ways I know to check every date in a time series.
<- seq(min(atm_tsibble$DATE), max(atm_tsibble$DATE), by = "day")
all_dates <- atm_tsibble |> filter(!is.na(ATM)) # this is for the date records with no atm or cash.
atm_tsibble <- expand.grid(DATE = all_dates, ATM = unique(atm_tsibble$ATM))
complete_atm_tibble <- left_join(complete_atm_tibble, atm_tsibble, by = c("DATE", "ATM"))
complete_atm_tibble
<- complete_atm_tibble |>
missing_data filter(is.na(Cash) | Cash == 0)
ggplot(missing_data, aes(x = DATE, y = ATM)) +
geom_point(aes(color = ATM), shape = 4) + # Use shape 4 for X's
labs(title = "Missing ATM Records by Date",
x = "Date", y = "ATM") +
theme_minimal()
# Show all the missing records
|>
complete_atm_tibble filter(ATM != "ATM3" & (is.na(Cash) | Cash == 0) & DATE < as.Date("2010-05-01"))
DATE ATM Cash
1 2009-06-13 ATM1 NA
2 2009-06-16 ATM1 NA
3 2009-06-22 ATM1 NA
4 2009-06-18 ATM2 NA
5 2009-06-24 ATM2 NA
6 2009-10-25 ATM2 0
7 2010-03-30 ATM2 0
Decision 2: Impute NA’s in ATM 1 and 2 Above, it looks like we’re missing a couple of days worth of data (denoted as NA in the list of records, and x’s on the chart.) As well we have a couple of days where cash delivered was zero in the record list above. Zero is to be expected, the ATM didn’t give out any money that day. The NA’s are missing records that were caught with the left join.
We’ll impute those missing records as part of the plan.
Decision 3: Drop ATM3, suggest ATM3 is like ATM1 ATM3 only dispensed cash for 2010-04-28, 2010-04-29, 2010-04-30. It isn’t missing data, it just seems like it hasn’t been giving cash. Perhaps the ATM was stationed in a building under renovation, we don’t know. Regardless, let’s quickly look at those dates, and whichever other ATM seems the closest can be our recommendation for ATM3.
|>
complete_atm_tibble filter( DATE >= as.Date("2010-04-28") & DATE <= as.Date("2010-04-30") )
DATE ATM Cash
1 2010-04-28 ATM1 96.00000
2 2010-04-29 ATM1 82.00000
3 2010-04-30 ATM1 85.00000
4 2010-04-28 ATM2 107.00000
5 2010-04-29 ATM2 89.00000
6 2010-04-30 ATM2 90.00000
7 2010-04-28 ATM3 96.00000
8 2010-04-29 ATM3 82.00000
9 2010-04-30 ATM3 85.00000
10 2010-04-28 ATM4 348.20106
11 2010-04-29 ATM4 44.24535
12 2010-04-30 ATM4 482.28711
Perfect, ATM3 appears to be identical to ATM1 with respect to cash withdraw. We’ll use ATM1 as a proxy for ATM3 in the final conclusion, though we have to remember to suggest actively monitoring ATM3’s behavior until more data is collected.
Plan
Let’s organize the plan start to finish before we act.
- Wrangle: Remove all improperly formatted 2010-05 records from complete_atm_tibble.
- Wrangle: Remove ATM3 in
complete_atm_tibble
- Wrangle: Smooth that outlier on ATM4
- Wrangle: Impute using mean average in
complete_atm_tibble
where Cash is NA. - Forecast: Forecast with STS and ARIMA
- Forecast: Sanity check
- Conclude: Suggest how much cash to have on hand per ATM
Wrangle
## remove improperly formatted records and ATM3
<- complete_atm_tibble |>
complete_atm_tibble filter (ATM != "ATM3" & DATE < as.Date("2010-05-01"))
## replace the outlier
<- complete_atm_tibble |> filter(Cash >= 3000) |> pull(DATE)
date_to_replace = date_to_replace - 9
s_date = date_to_replace - 2
e_date <- complete_atm_tibble |>
average_cash filter(ATM == "ATM4", DATE >= s_date, DATE <= e_date) |>
summarize(Average_Cash = mean(Cash, na.rm = TRUE)) |>
pull(Average_Cash)
<- complete_atm_tibble |>
complete_atm_tibble mutate(Cash = ifelse(DATE == date_to_replace & ATM == "ATM4", average_cash, Cash))
# double check that one date, Should be Feb 09 375$.
print(complete_atm_tibble |> filter(DATE == as.Date(date_to_replace) & ATM == "ATM4"))
DATE ATM Cash
1 2010-02-09 ATM4 374.4128
## Impute the three missing records for ATM1 and the two for ATM2
<- complete_atm_tibble |>
average_cash_per_atm group_by(ATM) |>
summarize(Average_Cash = mean(Cash, na.rm = TRUE))
<- complete_atm_tibble|>
complete_atm_tibble left_join(average_cash_per_atm, by = "ATM") |>
mutate(Cash = ifelse(is.na(Cash), Average_Cash, Cash)) |>
select(-Average_Cash)
## double check that imputation worked, the NA records should be gone
## but we should still see the zero records. Zero is a valid number.
print(complete_atm_tibble |>
filter((is.na(Cash) | Cash == 0) & DATE < as.Date("2010-05-01")))
DATE ATM Cash
1 2009-10-25 ATM2 0
2 2010-03-30 ATM2 0
We’ll plot those two charts again to be sure. The outlier is gone and there is no missing data.
## now let's plot those charts again!
<- ggplot(complete_atm_tibble, aes(x = DATE, y = Cash, color = ATM)) +
p1 geom_line() +
labs(title = "ATM Cash Trend", x = "Date", y = "Cash") +
theme_minimal()
<- ggplot(complete_atm_tibble |> filter(is.na(Cash)), aes(x = DATE, y = ATM)) +
p2 geom_point(aes(color = ATM), shape = 4) + # Use shape 4 for X's
labs(title = "Missing ATM Records by Date",
x = "Date", y = "ATM")
+ p2) (p1
That’s perfect.
Select Model
To help us select a model, we’ll decompose this to help us review trend and seasonality.
<- as_tsibble(complete_atm_tibble |> filter(ATM == "ATM1") , index = DATE)
atm1_tsibble <- atm1_tsibble |> model(STL(Cash ~ season(window = "periodic")))
atm1_stl_fit <- atm1_stl_fit |> components()
atm1_components
<- as_tsibble(complete_atm_tibble |> filter(ATM == "ATM2") , index = DATE)
atm2_tsibble <- atm2_tsibble |> model(STL(Cash ~ season(window = "periodic")))
atm2_stl_fit <- atm2_stl_fit |> components()
atm2_components
<- as_tsibble(complete_atm_tibble |> filter(ATM == "ATM4") , index = DATE)
atm4_tsibble <- atm4_tsibble |> model(STL(Cash ~ season(window = "periodic")))
atm4_stl_fit <- atm4_stl_fit |> components()
atm4_components
<- autoplot(atm1_components)
pa1 <- autoplot(atm2_components)
pa2 <- autoplot(atm4_components)
pa4 + pa2 + pa4) (pa1
Above we can see considerable weekly seasonality. ETS and SARIMA are both likely choices for good forecasting models.
If we compare ETS and SARIMA across all three ATM machines, like below, we see that either ETS or SARMIA would work fine. We’ll pick SARIMA as our model though, because it fits slightly better to ATM4 which has the highest volume.
<- function(atm_tsibble) {
fit_compare_ets_sarima <- atm_tsibble |> model(ETS(Cash ~ error("A") + trend("N") + season("A")))
ets_model <- atm_tsibble |> model(ARIMA(Cash ~ PDQ(0,0,0) + season("W")))
sarima_model <- list(ets = ets_model, sarima = sarima_model)
models <- purrr::map_df(models, accuracy, .id = "model")
accuracy_table return(accuracy_table)
}# print(fit_compare_ets_sarima((atm1_tsibble)))
# print(fit_compare_ets_sarima((atm2_tsibble)))
# print(fit_compare_ets_sarima((atm4_tsibble)))
<- rbind(
combined_results mutate(fit_compare_ets_sarima(atm1_tsibble), ATM = "ATM1"),
mutate(fit_compare_ets_sarima(atm2_tsibble), ATM = "ATM2"),
mutate(fit_compare_ets_sarima(atm4_tsibble), ATM = "ATM4") # Add additional data frames as needed
)print(combined_results |> select(-.type, -.model, -ME))
# A tibble: 6 × 9
model RMSE MAE MPE MAPE MASE RMSSE ACF1 ATM
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 ets 23.8 15.1 -106. 121. 0.846 0.854 0.125 ATM1
2 sarima 25.0 17.9 -113. 129. 1.00 0.897 -0.00694 ATM1
3 ets 25.0 17.8 -Inf Inf 0.856 0.832 0.0177 ATM2
4 sarima 26.6 20.5 -Inf Inf 0.987 0.883 0.0395 ATM2
5 ets 329. 265. -520. 550. 0.764 0.726 0.101 ATM4
6 sarima 327. 260. -492. 522. 0.752 0.722 -0.000300 ATM4
Forecast
Let’s forecast and plot to see if we’re on the right track. It looks like we are.
<- function(atm_tsibble) {
forecast_atm <- atm_tsibble |> model(ARIMA(Cash ~ PDQ(0,0,0) + season("W")))
sarima_fit <- sarima_fit |> forecast(h = 30)
forecasted_values return(forecasted_values)
}
<- forecast_atm(atm1_tsibble)
atm1_fc <- forecast_atm(atm2_tsibble)
atm2_fc <- forecast_atm(atm4_tsibble)
atm4_fc
<- autoplot(atm1_fc) +
at1 autolayer(atm1_tsibble |>filter(DATE >= as.Date("2010-04-01"))) +
labs(title = "ATM1 FC Cash Withdrawals", y = "Cash")
<- autoplot(atm2_fc) +
at2 autolayer(atm2_tsibble |>filter(DATE >= as.Date("2010-04-01"))) +
labs(title = "ATM2 FC Cash Withdrawals", y = "Cash")
<- autoplot(atm4_fc) +
at4 autolayer(atm4_tsibble |>filter(DATE >= as.Date("2010-04-01"))) +
labs(title = "ATM4 FC Cash Withdrawals", y = "Cash")
<- atm1_fc |> mutate(ATM = "ATM1") |> as_tibble()
atm1_fc_ti <- atm2_fc |> mutate(ATM = "ATM2") |> as_tibble()
atm2_fc_ti <- atm4_fc |> mutate(ATM = "ATM4") |> as_tibble()
atm4_fc_ti <- bind_rows(atm1_fc_ti, atm2_fc_ti, atm4_fc_ti)
all_fc # all_fc <- as_tibble(all_fc)
# Plot using ggplot
<- ggplot(all_fc, aes(x = DATE, y = .mean, color = ATM)) +
all_three geom_line() +
labs(title = "ATM Cash Withdrawal Forecasts", x = "Date", y = "Forecasted Cash") +
theme_minimal()
<- (at1 | at2) /
combined_plot | all_three)
(at4 print(combined_plot)
Conclude
For the last step, we must forecast how much cash each ATM gives out. Remembering two things, that units of cash are in 100’s, and that we are using ATM1 as a proxy for ATM3, we should only have to sum the means of the forecast data sets to get a reasonable forecast for the month per atm.
For clarity in our final numbers, we will round to the nearest 10$ bill, add commas, and leave out confidence band information.
That said, assuming no major blizzards or other planet events that will cause bank runs, below is a reasonable prediction of how much money well need for each ATM for the month of May, 2010. Also, please remember to watch ATM3 until we have more historical data.
<- function(fc) {
summarize_forecast <- fc |>
total_mean summarise(total_mean = sum(.mean, na.rm = TRUE))|>
pull(total_mean) * 100
<- round(total_mean / 10) * 10
total_mean_rounded <- dollar(total_mean_rounded)
total_mean_dollars return(total_mean_dollars)
}<- summarize_forecast(atm1_fc_ti)
atm1_cash_fc <- summarize_forecast(atm2_fc_ti)
atm2_cash_fc <- summarize_forecast(atm4_fc_ti)
atm4_cash_fc
<- sprintf("Required cash for:<br>ATM1 %s<BR>ATM2 %s<BR>ATM3 %s<BR>ATM4 %s",
final_atm
atm1_cash_fc, atm2_cash_fc, atm2_cash_fc, atm4_cash_fc)
cat(final_atm)
Required cash for:
ATM1 $256,060
ATM2 $179,140
ATM3 $179,140
ATM4 $1,347,760
Note: the above section took about 13 hours to complete
Part B – Forecasting Power
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Explore
Initial Thoughts
- This is an excel file, there might be some formatting issues.
- There are no missing dates between the first and last record, which is great.
- There is one missing record, 2008-Aug. It has NULL KWH.
- Ugh, it’s a yearmonth datatype, which doesn’t play well with the tsibble/fable/mable ecosystem. And I’m tired of wasting time on that. I am making the conscious decision to skip all that, and do this section using the r stats package instead. It might take longer, but it’s probably a more fully baked package anyway.
Let’s start with a chart, but we need to wrangle that date first.
<- paste0(git_p1_data, "residential_customer_forecast.xlsx")
file_url <- "residential_customer_forecast.xlsx"
local_file_path download.file(file_url, local_file_path, mode = "wb")
<- read_excel(local_file_path) |>
power mutate(Date = as.Date(paste0(`YYYY-MMM`, "-01"), format = "%Y-%b-%d")) |>
select(-`YYYY-MMM`, -CaseSequence)
unlink(local_file_path)
|>
power ggplot(aes(x = Date, y = KWH)) +
geom_line(colour = "blue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray")
The plot might show a light positive trend from about 2006 on. It is definitely seasonal with respect to month. There’s a pretty big outlier, but it seems like a valid record. Let’s leave it alone for now.
Plan
- Wrangle: Get dates loaded. (Done above)
- Impute that one null KWH record.
- Plot a decomposition
- Spend some time researching the stats package
- Forecast: Select a model
- Forecast: Wrangle and forecast
- Conclude: Predict
Wrangle
Impute First We’ll replace that missing record with the average KWH of the two records from August 2007 and August 2009. This is highly seasonal data, and there might be a slight trend. Taking the average of those two specific records assures it follows the seasonality trends.
Also, it’s as good a guess as any.
# Let's impute that missing date, maybe we just grab the record from last year. It might not contain the trend, but it'll contain the seasonal.
<- power |>
average_kwh filter(Date %in% as.Date(c("2007-09-01", "2009-09-01"))) |>
summarize(average_kwh = mean(KWH)) |>
pull(average_kwh)
<- power |>
power mutate(KWH = ifelse(Date == as.Date("2008-09-01"), average_kwh, KWH))
Select Model
Let’s decompose first. We see quickly that this has a slight trend and is extremely seasonal over the months. let’s compare some models and see which one we like more.
<- ts(power$KWH, start = c(1998, 1), frequency = 12)
ts_data <- stl(ts_data, s.window = "periodic")
stl_decomp plot(stl_decomp)
Awesome. Definitely seems like we have some seasonality. Let’s test Arima and ETS again, but this time using the Stats package instead. We’re definitely going to apply log to that data.
Looks like ETS(MAM) performs the best, and we needed to log the data to get the best results. We’ll use ETS(MAM) for the next section.
library(stats)
library(forecast)
<- function(model, model_name) {
get_accuracy <- accuracy(model)
accuracy_metrics <- data.frame(
model_accuracy Model = model_name,
ME = accuracy_metrics["Training set", "ME"],
RMSE = accuracy_metrics["Training set", "RMSE"],
MAE = accuracy_metrics["Training set", "MAE"],
MPE = accuracy_metrics["Training set", "MPE"],
MAPE = accuracy_metrics["Training set", "MAPE"],
MASE = accuracy_metrics["Training set", "MASE"],
ACF1 = accuracy_metrics["Training set", "ACF1"]
)-1] <- round(model_accuracy[, -1], 6)
model_accuracy[, return(model_accuracy)
}
<- ts(log(power$KWH), start = c(1998, 1), frequency = 12)
power_ts
<- get_accuracy(ets(power_ts, model = "ZZZ") ,"ETSZZZ")
etszzz_ac <- get_accuracy(ets(power_ts, model = "MNM"), "ETSMNN")
etsmnn_ac <- get_accuracy(ets(power_ts, model = "ANA"), "ETSANA")
etsana_ac <- get_accuracy(ets(power_ts, model = "MAM"), "ETSMAM")
etsmam_ac <- get_accuracy(auto.arima(power_ts), "A-AUTO")
auto_arima <- get_accuracy(arima(power_ts, order = c(0, 1, 0)), "A-010")
arima010 <- get_accuracy(arima(power_ts, order = c(0, 1, 1), seasonal = list(order = c(1, 1, 1), period = 12)), "A-011-111")
arimas111 <- get_accuracy(arima(power_ts, order = c(0, 1, 1), seasonal = list(order = c(2, 1, 1), period = 12)), "A-011-211")
arimas211 <- get_accuracy(arima(power_ts, order = c(0, 0, 0), seasonal = list(order = c(0, 1, 0), period = 12)), "A-000-010")
arimas010 <- rbind(etszzz_ac, etsmnn_ac, etsana_ac, etsmam_ac, auto_arima, arimas111, arimas211, arimas010,arima010 )
accuracy_table print(accuracy_table)
Model ME RMSE MAE MPE MAPE MASE ACF1
1 ETSZZZ 0.019334 0.192173 0.090667 0.107158 0.588246 0.770975 0.090482
2 ETSMNN 0.022302 0.195954 0.090882 0.126263 0.589943 0.772807 0.096247
3 ETSANA 0.018992 0.192264 0.091528 0.104975 0.593769 0.778298 0.087817
4 ETSMAM 0.019547 0.191668 0.089658 0.108533 0.581894 0.762391 0.115953
5 A-AUTO 0.017476 0.231948 0.140563 0.087735 0.905738 1.195260 -0.008806
6 A-011-111 0.010949 0.198691 0.089306 0.054186 0.580158 0.452785 0.083968
7 A-011-211 0.011218 0.198475 0.089979 0.055934 0.584470 0.456199 0.077472
8 A-000-010 0.014021 0.273751 0.111227 0.072525 0.718556 0.563927 0.104852
9 A-010 0.001834 0.311569 0.196291 -0.009728 1.262295 0.995207 -0.226022
Forecast
Using ETS(MAM), now we forecast and plot for a year, and we’re going to use the stats library for this also!
<- ts(log(power$KWH), start = c(1998, 1), frequency = 12)
power_ts <- ets(power_ts, model = "MAM")
best_ets_model <- forecast(best_ets_model, h = 12)
fckw_values $mean <- exp(fckw_values$mean)
fckw_values$lower <- exp(fckw_values$lower)
fckw_values$upper <- exp(fckw_values$upper)
fckw_values
<- as.data.frame(exp(power_ts))
historical_data_df $Date <- seq(as.Date("1998-01-01"), by = "month",
historical_data_dflength.out = nrow(historical_data_df))
<- historical_data_df |>
historical_data_df rename(kwh = x)
# head(historical_data_df)
<- as.data.frame(fckw_values)
forecast_data_df <- strptime(rownames(forecast_data_df), format = "%b %Y")
forecast_months $Date <- myd(paste0(rownames(forecast_data_df), "-01"))
forecast_data_df<- forecast_data_df |> rename(kwh = `Point Forecast`)
forecast_data_df # head(forecast_data_df)
ggplot() +
geom_line(data = filter(historical_data_df, Date > as.Date("2010-12-31")), aes(x = Date, y = kwh), colour = "blue") +
geom_line(data = forecast_data_df, aes(x = Date, y = kwh), colour = "red") +
geom_ribbon(data = forecast_data_df, aes(x = Date, ymin = `Lo 95`, ymax = `Hi 95`), fill = "orange", alpha = 0.2) +
geom_ribbon(data = forecast_data_df, aes(x = Date, ymin = `Lo 80`, ymax = `Hi 80`), fill = "orange", alpha = 0.5) +
labs(title = "Power Consumption Forecast (Last Two Years of Historical Data)", x = "Date", y = "KWH") +
theme_minimal()
Conclude
Here is our forecast for power consumption for 2014. We’ll leave confidence intervals in as this time as this is power, and that might be really important to somebody.
print(forecast_data_df)
kwh Lo 80 Hi 80 Lo 95 Hi 95 Date
Jan 2014 8932180 6881668 11593678 5994259 13310042 2014-01-01
Feb 2014 7682089 5932900 9946990 5174464 11404948 2014-02-01
Mar 2014 6526263 5053467 8428294 4413568 9650266 2014-03-01
Apr 2014 5903916 4578847 7612444 4002417 8708791 2014-04-01
May 2014 5659175 4391866 7292178 3840285 8339555 2014-05-01
Jun 2014 7344873 5675603 9505097 4951508 10895096 2014-06-01
Jul 2014 7801821 6022440 10106935 5251206 11591319 2014-07-01
Aug 2014 9348917 7194993 12147650 6263612 13953970 2014-08-01
Sep 2014 8661938 6674169 11241725 5813843 12905263 2014-09-01
Oct 2014 6365869 4929394 8220948 4305263 9412733 2014-10-01
Nov 2014 5561192 4315497 7166464 3773356 8196114 2014-11-01
Dec 2014 7026197 5431147 9089689 4739069 10417118 2014-12-01
Personal note: There isn’t as much ‘built in stuff’ or shortcuts in stats as there are in tsibble’s, mables, fables, etc, but overall this library feels a lot more stable and predictable. This was a good exercise considering I’ll most likely never use R for forecasting again, I’m going to have to learn a different library in scala and python anyway.
Part C: Water Flow Pipes
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour. Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Explore
Initial thoughts
- The problem asks if this can be forecast. Maybe it can’t, watch out.
- There are two excel files. Expect an import problem.
- One file appears to be hourly scheduled records from 10/23/15 to 12/3/15. The other, a much more granular than hourly record set, contains data between 10/23 and 11/1. We’ll refer to these as the “scheduled” one and the “random” (sampled) one.
- These data sets are not from the same water pipe. 10/23/2015 9:00AM one sheet shows 6.2, and on the other shows 55.7. They are also on very differnet scales.
- There is some record overlap between the two, date is not a unique field. The source must be taken into account to identify uniqueness.
- These dates are in serial datetime, excel’s propritary date storage format.
- For the scheduled dataset, we might have to check that later to see if the series is complete.
After further exploration
- Well into this exercise, I found that when trying to convert a serial datetime to a real date, the scheduled records were slightly off,causing records like 4:59:59 AM instead fo 5AM. Later we group_by().sum() which means data got lost. I have retrofitted this upstream loading from excel code to levergage lubridate’s round to minute function to account for this.
- The point above about the two data feeds above not overlapping with dates is what broke this problem.
Plan
- Wrangle: Rename Waterflow columns to WaterFlow1 and WaterFlow2, union, sort by time, and create final waterflow of values, or means.
- Wrangle: group by hour sum Waterflow for a final dataset
- Forecast: Decompose and chart
- Forecast: check for stationary data, acf/pacf
- Forecast
- Conclude
Wrangle
First we’ll load these into dataframes, fix the dates, union and sort.
# w -> water s -> scheduled. r -> random
<- "waterflow_pipe1.xlsx"
water_r <- "waterflow_pipe2.xlsx"
water_s <- paste0(git_p1_data, water_s)
ws_p <- paste0(git_p1_data, water_r)
wr_p
download.file(ws_p, water_s, mode = "wb")
download.file(wr_p , water_r, mode = "wb")
# we could probaly do some of this mutation after the union, but I prefer clean data to start.
<- read_excel(water_s) |>
ws mutate(datetime = as.POSIXct((`Date Time` - 25569) * 86400, origin = "1970-01-01", tz = "UTC")) |>
mutate(datetime = round_date(datetime, unit = "hour")) |>
select(-`Date Time`) |>
mutate(dataset = "ws")
<- read_excel(water_r) |>
wr mutate(datetime = as.POSIXct((`Date Time` - 25569) * 86400, origin = "1970-01-01", tz = "UTC")) |>
select(-`Date Time`) |>
mutate(dataset = "wr")
<- union(wr, ws) |>
water_dfarrange(datetime) |>
mutate(WaterFlow = round(WaterFlow, 3)) # just simplify display
unlink(water_s)
unlink(water_r)
print(n=8, water_df)
# A tibble: 2,000 × 3
WaterFlow datetime dataset
<dbl> <dttm> <chr>
1 23.4 2015-10-23 00:24:06 wr
2 28.0 2015-10-23 00:40:02 wr
3 23.1 2015-10-23 00:53:51 wr
4 30.0 2015-10-23 00:55:40 wr
5 18.8 2015-10-23 01:00:00 ws
6 6.00 2015-10-23 01:19:17 wr
7 15.9 2015-10-23 01:23:58 wr
8 26.6 2015-10-23 01:50:05 wr
# ℹ 1,992 more rows
Perfect, now let’s aggregate date/time to the hour. Best way to do that cleanly would be to create a column for date and hour.
Also, after being grouped, the last day in this combined data set is incomplete. There are only 17 records, when there should be 24. That makes sense, the data cuts off in the afternoon. That we can chop off that partial day to create an hourly dataset to forecast from.
Then, we can plot.
<- water_df |>
water_df mutate(
date = as.Date(datetime), # Extracting just the date part
hour = hour(datetime) # Extracting the hour part
# we don't need this, it's all for 2015.
)
# looks good.
# print(n=8, water_df)
<- water_df |>
water_df group_by(date, hour) |>
summarise(avg_water_flow = mean(WaterFlow, na.rm = TRUE)) |>
filter(date < "2015-12-03") |>
ungroup()
# great, data by day.
print(n=4, water_df)
# A tibble: 984 × 3
date hour avg_water_flow
<date> <int> <dbl>
1 2015-10-23 0 26.1
2 2015-10-23 1 18.8
3 2015-10-23 2 24.5
4 2015-10-23 3 25.6
# ℹ 980 more rows
# prove that there are 24 records per day. if this shows up empty, it's perfect.
%>%
water_df group_by(date) %>%
summarise(recordsPerDay = n()) |>
filter (recordsPerDay != 24)
# A tibble: 0 × 2
# ℹ 2 variables: date <date>, recordsPerDay <int>
# perfect, now let's try to plot.
# print(n=8, water_df)
%>%
water_df mutate(datetime = as.POSIXct(paste(date, hour, sep=" "), format="%Y-%m-%d %H")) |>
ggplot(aes(x = datetime, y = avg_water_flow)) +
geom_line() +
labs(title = "Average Water Flow Over Time", x = "Date Time", y = "Average Water Flow") +
theme_minimal()
Decompose
Perfect. Now let’s decompose. STL requires a time series object, so we’ll create one to plot.
<- as.numeric(format(min(water_df$date), "%Y%m%d"))
start_date <- ts(water_df$avg_water_flow, frequency = 24, start = c(substr(start_date, 1, 4), substr(start_date, 5, 6)))
ts_data <- stl(ts_data, s.window = "periodic")
stl_decomp
<- ggAcf(ts_data)
acf_plot <- ggPacf(ts_data)
pacf_plot <- acf_plot + pacf_plot
combined_plot
plot(stl_decomp)
plot(combined_plot)
Ok, this definitely isn’t stationary. Let’s show the furthers equalizer we’ve ever done in our class, which is diff(diff(log())). As one can see, this still isn’t stationary, and in fact the log(), diff(log), and diff(diff(log)) all look about the same.
<- diff(diff(log(ts_data)))
diff_ts_data
<- stl(diff_ts_data, s.window = "periodic")
stl_decomp_diff <- ggAcf(diff_ts_data)
acf_plot_diff <- ggPacf(diff_ts_data)
pacf_plot_diff <- acf_plot_diff + pacf_plot_diff
combined_plot_diff
plot(stl_decomp_diff)
plot(combined_plot_diff)
Because we’re in this non-stationary pattern, the immediate assumption is these two datasets shouldn’t have been combined in the first place.
Reasarch Spike 1: Scale of these datasets We already know know this more randomly sampled dataset is only a few days worth of data, while the scheduled one longer. Let’s go back and roll up by day and look at them together.
We knew this, but records before November 1’st are going to be a lot larger than records for the middle of November, because we are rolling up with means by hour and day.
<- union(wr, ws) |>
water_dfarrange(datetime) |>
mutate(WaterFlow = round(WaterFlow, 3)) |>
mutate(
date = as.Date(datetime), # Extracting just the date part
hour = hour(datetime) # Extracting the hour part
|>
)select(-datetime) |>
group_by(date, dataset) %>%
summarise(flow_per_day = sum(WaterFlow))
`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.
ggplot(water_df, aes(x = date, y = flow_per_day, color = dataset)) +
geom_line() +
labs(title = "Daily Water Flow", x = "Date", y = "Flow per Day", color = "Dataset") +
theme_minimal()
At this point, my gut is telling me that we either need to go get more data from the more randomly sampled dataset, so it extends out to the end of the scheduled data set, or we need to cut off scheduled dataset so the timelines overlap.
You can forecast the “sum of all waterflow in the pipes,” or the “sum of all car sales in Montana for July” but only if you have the overlapping time series.
Conclude
If I would have started this problem with more common sense maybe I would have saved myself hours worth of work, but in the end merging these two datasets won’t work. Complex seasonality or trend is one thing, there could be a second seasonal pattern, seasonal at both months and weeks, but the shock of those big week numbers for the first week of this time series won’t work.