Part A - ATM Forecast

Forecast how much cash is taken out of 4 different ATM machines for May 2010.


Data

Here we process our data, develop an intuition for how to approach the forecast, and handling missing or anomalous data.

Section Summary
* Import
* Visualize the data
* Patterns
* Problematic ATM3
* Remaining missing data
* Replacing missing values
* Convert to time series


Import

Here we import the data with columns for DATE, ATM and Cash where ATM says which of the four ATMs the money was withdrawn and Cash is in hundreds of dollars.

We remove the empty rows from 2010-05-01 to 2010-05-14, dates which we will later forecast.

Interestingly, ATM4 has cash withdrawn to the tenth of a penny so we will assume it was a Mexican Peso ATM with withdrawals converted into USD before being recorded.

# Check out packages
library(tidyverse)
library(fpp3)
library(readxl)
library(knitr)
library(gridExtra)
library(lubridate)

# Load data
atm <- read_excel("~/Documents/D624/Proj1/ATM624Data.xlsx", 
    col_types = c("date", "text", "numeric"))

# Remove empty rows starting 2010-05-01
atm <- atm |>
  filter(DATE < ymd('2010-05-01'))

# Display subset of ATM4
table01 <- atm |>
  filter(DATE > ymd('2009-07-17') & DATE < ymd('2009-07-25') & ATM == 'ATM4')
kable(table01, align="c")
DATE ATM Cash
2009-07-18 ATM4 568.81590
2009-07-19 ATM4 704.50704
2009-07-20 ATM4 571.64528
2009-07-21 ATM4 479.87550
2009-07-22 ATM4 418.86406
2009-07-23 ATM4 27.56913
2009-07-24 ATM4 834.83436


Visualize the data

The data for ATMs 1 and 2 look reasonable.

ATM3 looks like it only came online three days before our measurement period ended.

ATM4 looks like it has one anomalous data point. It looks like 10,000 was added to the original value. We will delete it and treat it like a missing data point.

# Subset data by ATM
atm1_data <- atm |>
  filter(ATM == 'ATM1')
atm2_data <- atm |>
  filter(ATM == 'ATM2')
atm3_data <- atm |>
  filter(ATM == 'ATM3')
atm4_data <- atm |>
  filter(ATM == 'ATM4')

# Graph the four ATMs
plot1a <- ggplot(atm1_data, aes(x = DATE, y = Cash)) +
  geom_point() +                
  labs(title = "ATM1: Cash Withdrawals ($100s)", 
       x = "", 
       y = "") +
  theme_minimal()     
plot1b <- ggplot(atm2_data, aes(x = DATE, y = Cash)) +
  geom_point() +                
  labs(title = "ATM2: Cash Withdrawals ($100s)", 
       x = "", 
       y = "") +
  theme_minimal()  
plot1c <- ggplot(atm3_data, aes(x = DATE, y = Cash)) +
  geom_point() +                
  labs(title = "ATM3: Cash Withdrawals ($100s)", 
       x = "", 
       y = "") +
  theme_minimal()  
plot1d <- ggplot(atm4_data, aes(x = DATE, y = Cash)) +
  geom_point() +                
  labs(title = "ATM4: Cash Withdrawals ($100s)", 
       x = "", 
       y = "") +
  theme_minimal()  
grid.arrange(plot1a, plot1b, plot1c, plot1d, ncol=2)

# Remove the anomalous ATM4 point
atm[which(atm$DATE == ymd('2010-02-09') & atm$ATM == 'ATM4'),"Cash"] <- 0


Patterns

Here we’ve plot the average cash withdrawn by day of the week excluding ATM3 since it’s almost all zeros. Note, withdrawals are significantly lower on Thursdays. Since Fridays are paydays this makes sense, people are waiting a day to get paid.

This informs how we will replace missing data. We’ll replace the missing data with the average withdrawn for that ATM for that same day of the week.

Note, this also means our data will have seasonality on a weekly basis at a minimum.

# Add a column for weekday
atm$WDay <- wday(atm$DATE,week_start = getOption("lubridate.week.start", 1), label=TRUE)

# Create average cash withdrawn by day of the week
table02 <- atm |>
  group_by(WDay) |>
  filter(!is.na(Cash) & ATM != 'ATM3') |>
  summarize(mean(Cash))
kable(table02, align="c")
WDay mean(Cash)
Mon 209.46382
Tue 200.74528
Wed 180.75221
Thu 76.05229
Fri 254.73080
Sat 222.18087
Sun 236.29308


Problematic ATM3

It looks like ATM3 went online starting 2010-04-28, with only three days of withdrawals in our data.

We can’t combine all four of our ATMs’ cash flows and forecast because the three days of withdrawals for ATM3 will operate in the model like noise and our forecast will under represent the future withdrawals.

We shouldn’t backfill the days for ATM3 with the mean withdrawn for those three days and combine all four of our ATMs’ cash flows because it might affect the calculation of trend or seasonality.

Instead we’ll separate ATM3 from the others for purposes of forecasting. Even though we only have three data points we can do a simple mean forecast of 87.67 per day. We could do a naïve forecast but the last day of withdrawals is lower than the mean and I’d expect the ATM to get more use as people become familiar with the new ATM.

# Show the three data points for ATM3
table03 <- atm |>
  filter(ATM == 'ATM3' & Cash != 0)
kable(table03, align="c")
DATE ATM Cash WDay
2010-04-28 ATM3 96 Wed
2010-04-29 ATM3 82 Thu
2010-04-30 ATM3 85 Fri


Remaining missing data

In addition to the one zero value we created for ATM4, there are five missing values and two zero values remaining. None of them look to be related to holidays.

Looking at the dates of the missing values, I assume the missing values are missing at random and so we can provide an estimated value for them without error. Possibly the ATMs were unplugged or a professor deleted them to make the data more interesting.

For the two values of zero maybe they aren’t missing at random, say the ATMs ran out of cash the day before, and in that case we would want to also estimate what the true value of the withdrawals would have been the day before and possibly the day after. However for the purposes of this analysis we will assume the two days with zero values for ATM2 were missing at random and we can estimate them without adjusting the cash values from the day before.

We’ll also delete the zeros from ATM3 so they don’t factor into the mean forecast.

# Check for NAs and zeros not in ATM3
table04 <- atm |> 
  filter(is.na(Cash) | Cash == 0 & ATM != 'ATM3')
kable(table04, align="c")
DATE ATM Cash WDay
2009-06-13 ATM1 NA Sat
2009-06-16 ATM1 NA Tue
2009-06-18 ATM2 NA Thu
2009-06-22 ATM1 NA Mon
2009-06-24 ATM2 NA Wed
2009-10-25 ATM2 0 Sun
2010-03-30 ATM2 0 Tue
2010-02-09 ATM4 0 Tue


Replacing missing values

Here we are replacing the missing data with the average withdrawn for that ATM for that same day of the week.

If later we discover our data has trend (and we’re not populating missing values in the middle of our time series), or if our data is seasonal on a quarterly basis (Summer withdrawals are different than winter withdrawals - we already know the data is seasonal on a weekly basis), then we may revise this section by taking the average of the two same weekdays before and after the missing value.

Here are the values we’re populating based on the table of mean weekday withdrawals for ATMs 1, 2 and 4 below:

# Show the values we're using to replace missing data
table05 <- data.frame(
  ATM = c("ATM1", "ATM1", "ATM1", "ATM2", "ATM2", "ATM2", "ATM2", "ATM4"),
  Date = as.Date(c("2009-06-13", "2009-06-16", "2009-06-22", "2009-06-16", 
                   "2009-06-24", "2009-10-25", "2010-03-30", "2010-02-09")),
  Day = c("Sat", "Tues", "Mon", "Thur", "Wed", "Sun", "Tues", "Tues"),
  Amount = c(96.61, 89.57, 86.00, 25.53, 43.75, 68.47, 74.69, 445.85)
)
kable(table05, align="c")
ATM Date Day Amount
ATM1 2009-06-13 Sat 96.61
ATM1 2009-06-16 Tues 89.57
ATM1 2009-06-22 Mon 86.00
ATM2 2009-06-16 Thur 25.53
ATM2 2009-06-24 Wed 43.75
ATM2 2009-10-25 Sun 68.47
ATM2 2010-03-30 Tues 74.69
ATM4 2010-02-09 Tues 445.85
# Calculate the mean by weekday for ATM 1 2 and 4
atm1 <- atm |>
  filter(ATM == 'ATM1' & !is.na(Cash))
atm_avg <- atm1 |>
  group_by(WDay) |>
  summarize(mean(Cash))
atm2 <- atm |>
  filter(ATM == 'ATM2' & !is.na(Cash) & Cash != 0)
atm2_avg <- atm2 |>
  group_by(WDay) |>
  summarize(mean(Cash))
atm4 <- atm |>
  filter(ATM == 'ATM4' & Cash != 0)
atm4_avg <- atm4 |>
  group_by(WDay) |>
  summarize(mean(Cash))

# Combine them for display
atm_avg$ATM2 <- atm2_avg$'mean(Cash)'
atm_avg$ATM4 <- atm4_avg$'mean(Cash)'
colnames(atm_avg)[colnames(atm_avg) == "mean(Cash)"] <- "ATM1"
kable(atm_avg, align="c")
WDay ATM1 ATM2 ATM4
Mon 86.00000 58.73077 481.2864
Tue 89.56863 74.68627 445.8533
Wed 82.15385 43.74510 413.7229
Thu 31.69231 25.52941 169.9635
Fri 98.64151 92.01887 573.5320
Sat 96.60784 75.98077 491.5391
Sun 102.65385 68.47059 539.0716
# Populate the missing values
atm[which(atm$DATE == ymd('2009-06-13') & atm$ATM == 'ATM1'),"Cash"] <- 96.61
atm[which(atm$DATE == ymd('2009-06-16') & atm$ATM == 'ATM1'),"Cash"] <- 89.57
atm[which(atm$DATE == ymd('2009-06-22') & atm$ATM == 'ATM1'),"Cash"] <- 86.00
atm[which(atm$DATE == ymd('2009-06-18') & atm$ATM == 'ATM2'),"Cash"] <- 25.53
atm[which(atm$DATE == ymd('2009-06-24') & atm$ATM == 'ATM2'),"Cash"] <- 43.75
atm[which(atm$DATE == ymd('2009-10-25') & atm$ATM == 'ATM2'),"Cash"] <- 68.47
atm[which(atm$DATE == ymd('2010-03-30') & atm$ATM == 'ATM2'),"Cash"] <- 74.69
atm[which(atm$DATE == ymd('2010-02-09') & atm$ATM == 'ATM4'),"Cash"] <- 445.85

# Remove zero values from ATM3
atm3_data <- tail(atm3_data,3)


Convert to time series

We’ve converted our data into time series both in aggregate (excluding ATM3), and individually

# ATMs 1 2 and 4 in aggregate
atmts <- atm |>
  filter(ATM != 'ATM3') |>
  group_by(DATE) |>
  summarize(sum(Cash))
colnames(atmts)[colnames(atmts) == "sum(Cash)"] <- "Cash"
atmts <- atmts |>
  mutate(DATE = ymd(DATE)) |>
  as_tsibble(index = DATE)
plot2a <- autoplot(atmts,Cash) +
  labs(x="", y="$100s", title="ATM1, ATM2 & ATM4")

# ATMs 1 2 4 separately
atm1_data <- atm |>
  filter(ATM == 'ATM1') |>
  mutate(DATE = ymd(DATE)) |>
  as_tsibble(index = DATE)
plot2b <- autoplot(atm1_data,Cash) +
  labs(x="", y="$100s", title="ATM1")  
atm2_data <- atm |>
  filter(ATM == 'ATM2') |>
  mutate(DATE = ymd(DATE)) |>
  as_tsibble(index = DATE)
plot2c <- autoplot(atm2_data,Cash) +
  labs(x="", y="$100s", title="ATM2")
atm3_data <- atm |>
  filter(ATM == 'ATM3' & Cash != 0) |>
  mutate(DATE = ymd(DATE)) |>
  as_tsibble(index = DATE)
atm4_data <- atm |>
  filter(ATM == 'ATM4') |>
  mutate(DATE = ymd(DATE)) |>
  as_tsibble(index = DATE)
plot2d <- autoplot(atm4_data,Cash) +
  labs(x="", y="$100s", title="ATM4")

# Display all
grid.arrange(plot2a, plot2b, plot2c, plot2d, ncol=2)




Process

Explain and demonstrate your process, techniques used and not used

Initially we aggregated ATMs 1, 2 and 4 and ran a variety of ETS models using both box-cox transformation with guerrero-optimized lambda and log transformation. Ultimately we settled on putting log transformed data into an ETS model with multiplicative error and seasonality and no trend to optimize AICc. Unusually, the trend in the decomposition looked like a random walk but never drifting away from the mean. If we were to have looked at the trend over longer and longer averaging periods it would have resolved into a flat line and that encouraged us to use ‘No’ trend in the ETS model, which actually resulted in a lower AICc.

However there was a major concern. The error portion of the data decomposition indicated there was a delineated and persistent change in the error starting in February of 2010 and ETS models aren’t suited to data with abruptly changing seasonality. One option we had was to rerun ETS starting from February of 2010 to model the new behavior going forward; However that would ignore the larger portion of our historical data.

We opted to try an ARIMA model, since the Moving Average (MA) component of ARIMA can capture the new patterns in the error. Similarly we saw that a log transformation outperformed the box-cox transformation with guerrero-optimized lambda, and our new ARIMA approach resulted in a lower AICc than we had with ETS.

Then we questioned if combining ATM 1, 2 and 4 data was muddying the patterns resulting in a worse forecast. We conjectured that separately forecasting them might result in a better mean prediction but then we wouldn’t have an aggregate prediction interval. To determine if forecasting them separately was worth it, we fit each of ATMs 1, 2 and 4’s data to an ARIMA model and found that all three were best fit by different ARIMA models. Thus we concluded the three sets of data were not similar enough to aggregate and model together.

When adding the forecasts together we also include the simple mean forecast for ATM 3.

Below we memorialize enough of our process that someone could follow the same path.

Section Summary
* Lag correlation plots
* Data decomposition
* ETS model
* ARIMA model in aggregate
* ARIMA model separately
* ATM3 mean forecast


Lag correlation plots

Here we show the lag correlation plots of the aggregate data with box-cox transformation.

We observe week-based seasonality in the spikes every seven days in the Autocorrelation Function (ACF) plot.

When we optimize lambda for the aggregate data we also optimize lambda for ATMs 1, 2 and 4.

# Optimize lambda for all
lambda <- atmts |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)
lambda1 <- atm1_data |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)
lambda2 <- atm2_data |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)
lambda4 <- atm4_data |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

# Display lag correlation plots
atmts |>
  gg_tsdisplay(box_cox(Cash, lambda), plot_type='partial', lag_max = 50)


Data decomposition

Here we’re showing the classical decomposition of our aggregate data. While ETS can model seasonality that changes gradually over time it’s not going to be able to adapt to abrupt and persistent change in seasonality starting in February.

Also note that the trend looks like a random walk that never drifts too far away from the mean and so it looks as if there is no discernible trend, supporting a mean forecast if there were no week-to-week seasonality.

#classical decomposition of aggregate data
atmts |>
  model(
    classical_decomposition(Cash, type = "multiplicative")
  ) |>
  components() |>
  autoplot()


Below you can see the decomposition of just ATM1 and observe how starkly the new error appears to be a new and persistent seasonality.

# classical decomposition of ATM1
atm1_data |>
  model(
    classical_decomposition(Cash, type = "multiplicative")
  ) |>
  components() |>
  autoplot()


ETS model

We noticed decreasing AICc with the following changes:
* Error from Additive to Multiplicative
* Seasonality from Additive to Multiplicative
* Trend from Additive to Dampening to None
* Transformation from box-cox to log

We ultimately arrive at an AICc of 2024.83. Note our forecast prediction error looks inconsistent with the historical data, it doesn’t seem bound the way our trend is to the mean.

Lastly we look at our model residuals and note that they are a little peaked but otherwise roughly normal with a slight left skew. The lag correlation values seem reasonably within the 95% threshold and so the errors would be considered white noise; However we know this is deceiving because the errors in the decomposition show a marked and persistent new week-to-week seasonality starting in February.

Just for demonstration we also commented out code below for a Ljung-box test. Even the Ljung-box test is deceived, since over the whole time period, the p-value is greater than 0.05 meaning we can’t distinguish the residuals from white noise. I’ll leave it as a future exercise to run a Ljung-box test on just the errors after the state change in February. That test should indicate the residuals from just that period are not white noise and there is more new pattern to capture in our model.

# Fit ETS model
fit <- atmts |>
  model(ETS(log(Cash) ~ error("M") + trend("N") + season("M")))

# Truncate data for display only
atmts_trunc <- atmts |>
  filter_index("2010-01-01" ~ .)

# Plot ETS forecast
ets_plot <- fit |>
  forecast(h = 31) |>
  autoplot(atmts_trunc) +
  labs(x="", y="$100s", title="ETS forecast: ATM1 ATM2 ATM4")
ets_plot

# To obtain AICc
#report(fit)

# Show ETS model residuals
fit |> gg_tsresiduals()

# Ljung-box test
#augment(fit) |>
#  features(.innov, ljung_box, lag = 10)


ARIMA model in aggregate

Here we fit an ARIMA model on our log transformed data and arrive at an AICc of 868.55 which is a significant improvement over the ETS model’s 2024.83.

The model is ARIMA(0,0,0)(0,1,1)[7], meaning there is differencing and a moving average component on the seven-day seasonality both of one.

Similarly to the ETS model, the ARIMA model performed better on log transformed data then on box-cox transformed data.

Looking at the forecast we notice that the ARIMA prediction interval is smaller than the ETS prediction interval however this is expected and not in itself a confirmation of the improvement in modeling. ARIMA models typically under represent the prediction interval because while they capture the uncertainty in the parameters fit, they don’t capture the uncertainty around picking one of many different combinations of differencing or orders of AR and MA.

Note, while the lag correlations are all within the 95% threshold indicating no more pattern to extract for our model, the errors look skewed left. However I’m now assuming this is because our data has a hard boundary of zero on one side and can’t fluctuate normally around it’s mean and that’s ok.

# Fit ARIMA model
atmts_fit <- atmts |>
  model(ARIMA(log(Cash)))
        
# To obtain AICc and model type
#report(atmts_fit_c)

# Plot ARIMA forecast
arima_plot <- atmts_fit |>
  forecast(h=31) |>
  autoplot(atmts_trunc) +
  labs(x="", y="$100s", title="ARIMA forecast: ATM1 ATM2 ATM4")
arima_plot

# Show ARIMA model residuals
atmts_fit |> gg_tsresiduals()


ARIMA model separately

To check whether it’s appropriate to combine ATM 1, 2 and 4 for modeling we checked if ARIMA would select a different model for each individually and it does.

This is enough to convince us to model the three ATMs separately and add the final forecasts together along with ATM3’s mean forecast.

Optimal ARIMA models for:

ATM1: ARIMA(0,0,0)(0,1,1)[7]
ATM2: ARIMA(2,0,2)(1,1,1)[7]
ATM4: ARIMA(0,0,1)(2,0,0)[7] w/ mean

Note that the w/ mean for ATM4 indicates the model uses a constant term, so that the forecast will fluctuate around some mean instead of zero. I’ll leave it as a future exercise to check whether the residuals for ATM4’s model have less skew than the other since the values fluctuate further away from the hard boundary of zero compared to ATM1 and ATM2.

# Fit ARIMA models for each ATM separately
atmts_fit_1 <- atm1_data |> 
  model(ARIMA(log(Cash)))
atmts_fit_2 <- atm2_data |>
  model(ARIMA(log(Cash)))
atmts_fit_4 <- atm4_data |>
  model(ARIMA(log(Cash)))

# Obtain model type
#atmts_fit_1c
#atmts_fit_2c
#atmts_fit_4c


ATM3 mean forecast

Here we fit ATM3 to a mean forecast model.

# Fit ATM3 to a mean model
atmts_fit_3 <- atm3_data |>
  model(MEAN(Cash))

# Review model
#report(atmts_fit_3)

# Plot ARIMA forecast
mean_plot <- atmts_fit_3 |>
  forecast(h=31) |>
  autoplot(atm3_data) +
  labs(x="", y="$100s", title="Mean forecast: ATM3")
mean_plot




Forecast

Explain and demonstrate your actual forecast

Here we show our final four forecasts side-by-side and add them together for extraction.

Section Summary
* Display forecasts
* Extract combined forecast


Display forecasts

Here we display our forecasts before combining them for extraction. Notice how the mean term in the ARIMA model for ATM4 shows the forecast settling towards a mean value.

Recall we are forecasting each ATM separately because they have different characteristics and so modeling the combined cash flows wouldn’t capture their unique characteristics. ATM3 is a mean forecast since there is not enough data to model but we want to capture the new cash flows coming from ATM3 in our forecast. The other three ATMs have unique ARIMA modeling with log transformation.

# Produce truncated data for display purposes
atm1_trunc <- atm1_data |>
  filter_index("2010-01-01" ~ .)
atm2_trunc <- atm2_data |>
  filter_index("2010-01-01" ~ .)
atm4_trunc <- atm4_data |>
  filter_index("2010-01-01" ~ .)

# Display forecasts
plot3a <- atmts_fit_1 |>
  forecast(h=31) |>
  autoplot(atm1_trunc) +
  labs(x="", y="$100s", title="ARIMA Forecast: ATM1")
plot3b <- atmts_fit_2 |>
  forecast(h=31) |>
  autoplot(atm2_trunc) +
  labs(x="", y="$100s", title="ARIMA Forecast: ATM2")
plot3d <- atmts_fit_4 |>
  forecast(h=31) |>
  autoplot(atm4_trunc) +
  labs(x="", y="$100s", title="ARIMA Forecast: ATM4")
grid.arrange(plot3a, plot3b, mean_plot, plot3d, ncol=2)


Extract combined forecast

Here we collate and sum the May 2010 forecasts for all four ATMs. Lastly, we save the file to “forecast_output_pkoflaherty.csv”.

Here are the first 10 rows of our combined forecast. The results look typical of the respective ATMs and follow seasonality with the dips on Thursday, 2010-05-04.

# Produce the all forecasts
fc_atm1 <- atmts_fit_1 |>
  forecast(h=31)
fc_atm2 <- atmts_fit_2 |>
  forecast(h=31)
fc_atm3 <- atmts_fit_3 |>
  forecast(h=31)
fc_atm4 <- atmts_fit_4 |>
  forecast(h=31)

# Capture mean forecast values
fc_atm1 <- data.frame(Time = fc_atm1$DATE, Forecast = fc_atm1$.mean)
fc_atm2 <- data.frame(Time = fc_atm2$DATE, Forecast = fc_atm2$.mean)
fc_atm3 <- data.frame(Time = fc_atm3$DATE, Forecast = fc_atm3$.mean)
fc_atm4 <- data.frame(Time = fc_atm4$DATE, Forecast = fc_atm4$.mean)

# Combine forecasts
fc_atm1$ATM2FC <- fc_atm2$Forecast
fc_atm1$ATM3FC <- fc_atm3$Forecast
fc_atm1$ATM4FC <- fc_atm4$Forecast
colnames(fc_atm1)[colnames(fc_atm1) == "Forecast"] <- "ATM1FC"
fc_atm1$Combined_Forecast <- fc_atm1$ATM1FC + fc_atm1$ATM2FC + fc_atm1$ATM3FC + fc_atm1$ATM4FC
fc_full <- fc_atm1

# Extract forecast
write.csv(fc_full, file = "forecast_output_pkoflaherty_parta.csv", row.names = FALSE)

# Display first ten rows of combined forecast
kable(head(fc_full,10), digits=2, align="c")
Time ATM1FC ATM2FC ATM3FC ATM4FC Combined_Forecast
2010-05-01 102.90 77.52 87.67 203.50 471.58
2010-05-02 117.59 94.53 87.67 516.64 816.42
2010-05-03 86.21 12.57 87.67 640.91 827.35
2010-05-04 4.81 2.38 87.67 139.15 234.01
2010-05-05 117.52 129.67 87.67 585.49 920.34
2010-05-06 93.00 114.70 87.67 326.92 622.28
2010-05-07 100.91 76.85 87.67 483.74 749.16
2010-05-08 104.65 81.64 87.67 163.12 437.08
2010-05-09 119.60 97.76 87.67 568.78 873.81
2010-05-10 87.68 12.90 87.67 561.34 749.59





Part B - Forecasting Power

Model residential power usage for January 1998 to December 2013 and then produce a monthly forecast for 2014.

We’re going to inspect and clean the data, try a range of models based on intuition developed looking at diagnostics, and then demonstrate and share our final forecast.




Data

Ultimately the data was fairly clean. We only had to replace one anomalous and one missing value and we had to manipulate the data a little after importing it to transform it into a time series.


Import

Here we load the data from excel. The variable KWH is power consumption in Kilowatt hours (kWh).

# Load data
rcfl <- read_excel("~/Documents/D624/Proj1/ResidentialCustomerForecastLoad-624.xlsx", 
    col_types = c("skip", "text", "numeric"))

# Convert date column into a useable date from characters
rcfl$DATE <- yearmonth(rcfl$`YYYY-MMM`)

# Select and reorder our future date index and target variable
rcfl <- rcfl |>
  select(DATE, KWH)


View data

Due to a warning we detect one missing value and by inspection we see one anomalous value. Otherwise the data looks good with a bend upwards in the trend starting in 2010.

# Display data
ggplot(rcfl, aes(x = DATE, y = KWH)) +
  geom_point() +                
  labs(title = "Residential power usage (kWh)", 
       x = "", 
       y = "") +
  theme_minimal()   


Anomalous value

It looks like there was a decimal place dropped and so we should either interpolate or adjust the number however we don’t know which digit was dropped. If it were the first it would be significant, if it were the last it would be insignificant however we’re not going to make that determination. We’re just going to delete it and handle it with the other missing value.

# View anomalous data
ex6 <- rcfl |>
  filter(DATE > yearmonth('2010 Jan') & DATE < yearmonth('2010 Dec'))

kable(ex6, align="c")
DATE KWH
2010 Feb 8390677
2010 Mar 7347915
2010 Apr 5776131
2010 May 4919289
2010 Jun 6696292
2010 Jul 770523
2010 Aug 7922701
2010 Sep 7819472
2010 Oct 5875917
2010 Nov 4800733
# Replace data with zero for now
rcfl[which(rcfl$DATE == yearmonth('2010 Jul')),"KWH"] <- 0


Missing values

There is one missing value in addition to the anomalous data point. We’ll update them by interpolating between the immediately adjacent values.

Interpolating between the two data points before and after yields 6,569,470 for 2008 Sep and 7,309,496 for 2010 Jul

# Find location of NAs and zeros
#rcfl |> 
#  filter(is.na(KWH) | KWH == 0)

# Display NAs and zeros and their adjacent values
ex1 <- rcfl |> 
  filter(DATE >= yearmonth('2008 Aug') & DATE <= yearmonth('2008 Oct') |
         DATE >= yearmonth('2010 Jun') & DATE <= yearmonth('2010 Aug'))
kable(ex1, align="c")
DATE KWH
2008 Aug 8037137
2008 Sep NA
2008 Oct 5101803
2010 Jun 6696292
2010 Jul 0
2010 Aug 7922701
# Replace missing
rcfl[which(rcfl$DATE == yearmonth('2008 Sep')),"KWH"] <- (8037137+5101803)/2
rcfl[which(rcfl$DATE == yearmonth('2010 Jul')),"KWH"] <- (6696292+7922701)/2


Convert to timeseries

We’ve converted and displayed our data as a timeseries.

# Convert to timeseries
rcfl <- rcfl |>
  as_tsibble(index = DATE)

# Display
autoplot(rcfl,KWH) +                
  labs(title = "Residential power usage (kWh)", 
       x = "", 
       y = "") +
  theme_minimal()   




Modeling

We noticed annual and quarterly seasonality in the lag correlation plots and from the decomposition upwards trend starting in 2009.

Experimentation yielded a box-cox transformation significantly outperforms the log transformation.

When we arrived at our best and final model we noticed two large spikes out of 24 outside of the 95% threshold and tried an additional ARIMA search without the step-wise and found a better model, with only slightly improved AICc, but significantly improved diagnostics.


Lag correlation plots

Paying attention to the ACF plot we see quarterly trends every three months. Presumably Winter and Summer have high energy consumption due to heating and cooling and Fall and Spring have low energy consumption due to milder temperatures.

# Optimize lambda
lambda <- rcfl |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)

# Display lag correlation plots
rcfl |>
  gg_tsdisplay(box_cox(KWH, lambda), plot_type='partial')


Data decomposition

Here we have interesting error. As we start to build models we can assess the errors for additional patterns to draw out into the model. Notably we have confirmation of the rising trend starting in 2009 which suggests wider spread adoption and use of air conditioning around that time.

#classical decomposition
rcfl |>
  model(
    classical_decomposition(KWH, type = "multiplicative")
  ) |>
  components() |>
  autoplot()


ETS model

Finding the optimal ETS

We went through a series of ETS options to find the best one.

Interestingly, the box-cox significantly outperforms the log transformation. The guerrero-optimized lambda is -0.2066 which is between a log transformation (lambda = 0) and an inverse transformation (lambda = -1).

Surprisingly, the best AICc vaues are negative. This isn’t alarming, it’s a sign of good fit. One of the AICc terms can be highly negative if the log-likelihood is high. We still need to compare against other models to understand its relative fit and to select the best model.

AICc of 126.8757 with log and ETS(A,A,A) AICc of -1122.740 with box-cox and ETS(A,A,A) AICc of -1121.368 with box-cox and ETS(M,A,A) AICc of -1119.897 with box-cox and ETS(M,A,M) AICc of -1119.770 with box-cox and ETS(M,Ad,M) AICc of 126.9328 with log and ETS(M,Ad,M)

The best ETS model has multiplicative error and season and dampening trend with an AICc of -1,119.770.

Note, the plot has truncated data for display only to enlarge the forecast detail.

# Fit ETS model
fit <- rcfl |>
  model(ETS(box_cox(KWH, lambda) ~ error("M") + trend("Ad") + season("M")))

# To obtain AICc
#report(fit)

# Truncate data for display only
rcfl_trunc <- rcfl |>
  filter_index("2006 Jan" ~ .)

# Plot ETS forecast
ets_plot <- fit |>
  forecast(h = 12) |>
  autoplot(rcfl_trunc) +
  labs(x="", y="kWh", title="ETS forecast: Residential power usage")
ets_plot


ETS residuals

Note, the ACF plot shows two possibly 3 spikes outside of the 95% threshold which suggests there is still pattern in the errors not yet captured by our model.

# Show ETS model residuals
fit |> gg_tsresiduals()


Ljung-box test

To confirm our suspicions we ran a Ljung-box test on the model’s residuals. The p-value is an extraordinary 0.0015 arguing clearly that the model’s residuals do not resemble white noise.

# Ljung-box test
ex5 <- augment(fit) |>
  features(.innov, ljung_box, lag = 10)

kable(ex5, align="c")
.model lb_stat lb_pvalue
ETS(box_cox(KWH, lambda) ~ error(“M”) + trend(“Ad”) + season(“M”)) 28.43258 0.0015385


ARIMA model

Initial model

Our initial ARIMA model is ARIMA(1,0,0)(2,1,0)[12] w/ drift with an AICc of -1,500.26, which significantly outperforms the ETS model.

Same-wise the model performed better with the box-cox instead of log transformation.

Looking at the innovation residuals you’ll note the first 12 on the left are meaningless because the first 12 on the left are dropped due to a seasonal lag of 12.

I’m still concerned however because the innovation residuals seem to have multiple points in a row above mean of zero indicating trend left out of the model. Additionally two strong spikes out of 24 are outside of the 95% threshhold and it appears to be quarterly trend this time. Also the residuals do not look gaussian.

# Fit ARIMA model
rcfl_fit <- rcfl |>
  model(ARIMA(box_cox(KWH,lambda)))
        
# To obtain AICc and model type
#report(rcfl_fit)

# Show ARIMA model residuals
rcfl_fit |> gg_tsresiduals()


Trying additional models

Our first thought was ARIMA’s step-wise algorithm had found a local minimum and there was a better model to try so we did a search with stepwise=FALSE.

Additionally from the first ARIMA model’s ACF plot, it looked like there was quarterly seasonality in addition to the annual seasonality and so I tried some additional models with step-wise increments of differencing however all of them come with warnings for having quadratic trend. I believe it is out of the scope of the textbook to forecast with multiple seasonalities.

Our expanded ARIMA search found a better performing model with a change in p and q.

Original:
ARIMA(1,0,0)(2,1,0)[12] w/ drift

Final:
ARIMA(0,0,3)(2,1,0)[12] w/ drift

# Try several models
rcfl_fit_basket <- rcfl |>
  model(arima1003210 = ARIMA(box_cox(KWH,lambda) ~ 1 + pdq(0,0,3) + PDQ(2,1,0)),
        arima1013210 = ARIMA(box_cox(KWH,lambda) ~ 1 + pdq(0,1,3) + PDQ(2,1,0)),
        arima1003220 = ARIMA(box_cox(KWH,lambda) ~ 1 + pdq(0,0,3) + PDQ(2,2,0)),
        arima1013220 = ARIMA(box_cox(KWH,lambda) ~ 1 + pdq(0,1,3) + PDQ(2,2,0)),
        stepwise = ARIMA(box_cox(KWH,lambda)),
        search = ARIMA(box_cox(KWH,lambda), stepwise=FALSE))

# Display models ranked by AICc
ex7 <- glance(rcfl_fit_basket) |> arrange(AICc) |> select(.model:BIC)

kable(ex7, align="c")
.model sigma2 log_lik AIC AICc BIC
arima1003210 1.40e-05 759.5857 -1505.171 -1504.520 -1482.821
search 1.40e-05 759.5857 -1505.171 -1504.520 -1482.821
stepwise 1.45e-05 755.3032 -1500.606 -1500.262 -1484.642
arima1013210 1.47e-05 750.0060 -1486.012 -1485.357 -1463.700
arima1003220 2.44e-05 659.0007 -1304.001 -1303.301 -1282.134
arima1013220 2.58e-05 635.4842 -1256.968 -1256.264 -1235.143

While the AIC only decreased slightly from -1,500.606 to -1,505.171. The diagnostics of the residuals are significantly better with ony one spike out of 24 outside of the 95% threshold and we’ll take it.

# Fit ARIMA model
rcfl_fit <- rcfl |>
  model(ARIMA(box_cox(KWH,lambda), stepwise=FALSE))

# Show ARIMA model residuals
rcfl_fit |> gg_tsresiduals()




Forecast

Forecast graph

Here is our final model forecast for 2014.

Recall we’ve gone with an ARIMA model with seasonality (SARIMA) and an extended search of models for the best fit.

# Plot ARIMA forecast
arima_plot <- rcfl_fit |>
  forecast(h=12) |>
  autoplot(rcfl_trunc) +
  labs(x="", y="kWh", title="ARIMA forecast: Residential power usage")
arima_plot

Forecast numbers

Here is our forecast in numbers:

# Produce forecast
rcfl_fc <- rcfl_fit |>
  forecast(h=12)

# Capture mean forecast values
fc_rcfl <- data.frame(DATE = rcfl_fc$DATE, Forecast_kWh = rcfl_fc$.mean)

# Display the 12 rows of forecast
kable(head(fc_rcfl,12), align="c")
DATE Forecast_kWh
2014 Jan 10393275
2014 Feb 8860839
2014 Mar 7064897
2014 Apr 5978216
2014 May 5938635
2014 Jun 8314016
2014 Jul 9623971
2014 Aug 10178785
2014 Sep 8541825
2014 Oct 5857955
2014 Nov 6142157
2014 Dec 8277646
# Extract forecast
write.csv(fc_rcfl, file = "forecast_output_pkoflaherty_partb.csv", row.names = FALSE)





Part C - Bonus

If the waterflow data is stationary, provide a week forward forecast


Data

We import both datasets, convert the one with more frequent measurements into the same hourly time stamps, and check the data.

The water flow measurement could be in gallons, cubic feet or liters per minute or second depending on the context.


Import and View

Here we import and view data for both water pipes and visually determine there are no obviously anomalous values.

Additionally we checked for missing values and there were none.

# Read data
WFP1 <- read_excel("~/Documents/D624/Proj1/Waterflow_Pipe1.xlsx", 
    col_types = c("date", "numeric"))
WFP2 <- read_excel("~/Documents/D624/Proj1/Waterflow_Pipe2.xlsx", 
    col_types = c("date", "numeric"))

# Improve column names
WFP1 <- WFP1 |> rename(DATE = 'Date Time')
WFP2 <- WFP2 |> rename(DATE = 'Date Time')

# View Data
view1 <- ggplot(WFP1, aes(x = DATE, y = WaterFlow)) +
  geom_point() +
  labs(x="", title = "Pipe 1: Waterflow", x = "Date", y = "Water Flow") +
  theme_minimal()
view2 <- ggplot(WFP2, aes(x = DATE, y = WaterFlow)) +
  geom_point() +
  labs(x="", title = "Pipe 2: Waterflow", x = "Date", y = "Water Flow") +
  theme_minimal()
grid.arrange(view1, view2, ncol=1)

# Checked for missing values and there were none
#sum(is.na(WFP1$DATE))
#sum(is.na(WFP2$DATE))
#sum(is.na(WFP1$WaterFlow))
#sum(is.na(WFP2$WaterFlow))


Aggregate data

Here we convert the first pipe’s measurements into hourly data matching the second pipe’s time stamps. We do this by rounding the time stamps to the nearest hour, and then taking the mean of any at a given hour. Then we need to trim the data so they have matching time stamps before we sum them together.

While I rounded to nearest hour, it’s possible the problem was designed so that I’d ceiling to the next hour and so these results might be slightly different than expected.

# Round Pipe 1's time stamps to the nearest hour
WFP1$DATE <- round_date(WFP1$DATE, unit = "hours")

# Take the mean of any at a given hour
WFP1 <- WFP1 |>
  group_by(DATE) |>
  summarize(WaterFlow = mean(WaterFlow))

# Trim the data to fit each other
WFP1 <- WFP1 |> 
  slice(-1)
WFP2 <- WFP2 |>
  slice_head(n=240)

# Aggregate the two water flows
WFP <- WFP1
WFP$WaterFlow <- WFP1$WaterFlow + WFP2$WaterFlow


Convert to time series

Here we convert to time series. Interestingly, using ymd_hms() resulted in NA values for times at 00:00:00 but we found as_datetime() worked as expected.

# Convert to time series
WFP <- WFP |>
  mutate(DATE = as_datetime(DATE)) |> #ymd_hms
  as_tsibble(index = DATE)

# Display as time series
autoplot(WFP, WaterFlow) +
  labs(x="", y="", title="Aggregated water flow ")




Model

We ran typical diagnostics such as lag correlation plots, data decomposition and KPSS to test for stationarity.

Then we found the most optimal ETS and ARIMA models. The ETS forecast looked more compelling because it carried forward the seasonality while ARIMA regressed to a mean.

Ultimately we tried the same ARIMA model but without mean and a step-wise search only looking for models without mean and found a final ARIMA model that preserved the seaonality observed in our data (by not using a mean) with a significantly lower AICc than the best ETS model and only a slight performance reduction compared to the ARIMA model with mean that didn’t preserve seasonality in the forecast.


Lag correlation plots

We can see lag correlation in the plots however it’s not at the 24 hour mark. It could be that there are both 24 hour correlations and 7 day correlations but we don’t have enough data to capture the 7 day correlations and we’re not able to parse out the two seasonalities.

We ran lag correlation plots with lag_max = 100 so that we could see patterns across a higher range of lags.

# Optimize lambda
lambda <- WFP |>
  features(WaterFlow, features = guerrero) |>
  pull(lambda_guerrero)

# Display lag correlation plots without differencing
WFP |>
  gg_tsdisplay(WaterFlow, plot_type='partial', lag_max = 100)

Here we show the lag plots with differencing and the cluster of spikes in the PACF show that there is a pattern but describing the pattern from the lag plots is difficult. Since we’re looking at the differencing we’re seeing patterns in the rate of change and the PACF is the impact of earlier lags with the intervening lags impact taken out and they’re all negative, so the best I can describe is the rate of change of the waterflow is increasing then decreasing almost every hour!

# Display lag correlation plots with differencing
WFP |>
  gg_tsdisplay(difference(WaterFlow), plot_type='partial', lag_max = 100)

Data decomposition

Here the seasonality is a complex daily pattern but what we’re seeing in the trend line is likely a partial weekly seasonality.

#classical decomposition
WFP |>
  model(
    classical_decomposition(WaterFlow, type = "multiplicative")
  ) |>
  components() |>
  autoplot()

KPSS test

We’re using the Kwiatkowski-Phillips-Schmidt-Shin test to determine if the data is stationary. The high p value (it’s capped between 0.01 and 0.1) indicates our time series is stationary.

# Confirm if stationary
kpss <- WFP |>
  features(WaterFlow, unitroot_kpss)
kable(kpss, align="c")
kpss_stat kpss_pvalue
0.1888434 0.1

ETS model

Here we tried many variations to find the ETS model with the lowest AICc of 2,712.987 using no transformation on the data and fitting it to an ETS model with additive error, trend and seasonality.

The forecast preserves the complex daily seasonality we observed in the data decomposition.

box-cox
3806.511 M Ad M
3811.008 M A M
3805.857 A A M
3801.886 A Ad M
3799.757 A Ad A

no transformation
2712.987 A A A
2713.710 A Ad A
2720.969 M Ad A
2720.902 M A A
2721.128 M A M

# Fit ETS model
fit <- WFP |>
  model(ETS(WaterFlow ~ error("A") + trend("A") + season("A")))

# To obtain AICc
#report(fit)

# Plot ETS forecast
ets_plot <- fit |>
  forecast(h = 168) |>
  autoplot(WFP) +
  labs(x="", y="", title="ETS forecast: Water flow")
ets_plot

And in the residuals we see that there are roughly four spikes past the 95% threshold for 100 lags which indicates white noise.

# Show ETS model residuals
fit |> gg_tsresiduals(lag_max = 100)

And when we apply a Ljung-box test we have an incredibly high p-value indicating our residuals are white noise.

# Ljung-box test
ex5 <- augment(fit) |>
  features(.innov, ljung_box, lag = 10)

kable(ex5, align="c")
.model lb_stat lb_pvalue
ETS(WaterFlow ~ error(“A”) + trend(“A”) + season(“A”)) 4.776997 0.9055683

ARIMA Model

We tried ARIMA models both step-wise and extended search on both the box-cox and untransformed data and concluded that the step-wise was sufficient or better and we should model the untransformed data.

We arrive at a lower AICc than we did with the ETS model, however the forecast converges on a mean which makes me wonder if we should have forecasted the two pipes separately, or if we should try manually searching ARIMA models without a mean.

Model: ARIMA(1,0,0)(0,0,1)[24] w/ mean
AICc=3123.03

Model: ARIMA(0,0,1)(1,0,1)[24] w/ mean
AICc=2038.81

# Fit ARIMA model
WFP_fit <- WFP |>
  model(ARIMA(WaterFlow))
        
# To obtain AICc and model type
#report(WFP_fit)

# Plot ARIMA forecast
arima_plot <- WFP_fit |>
  forecast(h=168) |>
  autoplot(WFP) +
  labs(x="", y="", title="ARIMA forecast: Water flow")
arima_plot

Here are the residual diagnostics which show similar results as the ETS model.

# Show ARIMA model residuals
WFP_fit |> gg_tsresiduals(lag_max = 100)

ARIMA model without mean

We tried a version of the ARIMA model without mean and it performed worse with an AICc of 2,153 compared to 2,038. Additionally we tried a step-wise search only looking at models without mean and it selected an even worse model with an AICc of 2,237.

So while the ARIMA model with mean is the technically better forecast, we’re ultimately selecting the ARIMA model without mean to preserve the daily seasonality in our forecast and at a lower AICc than using the ETS model.

# Try several models
wfp_fit_basket <- WFP |>
  model(arima1001101 = ARIMA(WaterFlow ~ 1 + pdq(0,0,1) + PDQ(1,0,1)),
        arima0001101 = ARIMA(WaterFlow ~ 0 + pdq(0,0,1) + PDQ(1,0,1)),
        stepwise = ARIMA(WaterFlow ~ 0))

# Display models ranked by AICc
ex12 <- glance(wfp_fit_basket) |> arrange(AICc) |> select(.model:BIC)

kable(ex12, align="c")
.model sigma2 log_lik AIC AICc BIC
arima1001101 278.5293 -1014.279 2038.558 2038.814 2055.961
arima0001101 323.6212 -1072.638 2153.276 2153.446 2167.198
stepwise 602.4915 -1111.509 2237.017 2237.500 2261.382



Forecast

Here we display our final forecast as a graph, table of partial results, and write to a csv file.

Additionally, for an interested reader, we provide the code to fit our data to a neural network model.


Forecast graph

Here is the most accurate forecast we’ve found so far that preserves the daily seasonality observed.

Model: ARIMA(0,0,1)(1,0,1)[24]
AICc = 2,153.45

wfp_fit_woMean <- WFP |>
  model(arima0001101 = ARIMA(WaterFlow ~ 0 + pdq(0,0,1) + PDQ(1,0,1)))

# Plot ARIMA forecast
arima_plot <- wfp_fit_woMean |>
  forecast(h=168) |>
  autoplot(WFP) +
  labs(x="", y="", title="ARIMA (without mean) forecast: Water flow")
arima_plot

# To obtain AICc and model type
#report(wfp_fit_woMean)


Forecast numbers

Here is our forecast in numbers for the first 24 hours. We save the full 168 hour forecast to csv.

# Produce forecast
wfp_fc <- wfp_fit_woMean |>
  forecast(h=168)

# Capture mean forecast values
fc_wfp <- data.frame(DATE = wfp_fc$DATE, Forecast_kWh = wfp_fc$.mean)

# Display the first 24 rows of forecast
kable(head(fc_wfp,24), align="c")
DATE Forecast_kWh
2015-11-02 01:00:00 63.59479
2015-11-02 02:00:00 63.48805
2015-11-02 03:00:00 68.07687
2015-11-02 04:00:00 51.62703
2015-11-02 05:00:00 62.38238
2015-11-02 06:00:00 65.03996
2015-11-02 07:00:00 50.15033
2015-11-02 08:00:00 56.54127
2015-11-02 09:00:00 57.07658
2015-11-02 10:00:00 47.74977
2015-11-02 11:00:00 56.77604
2015-11-02 12:00:00 61.80018
2015-11-02 13:00:00 53.23077
2015-11-02 14:00:00 57.82800
2015-11-02 15:00:00 66.74459
2015-11-02 16:00:00 63.84864
2015-11-02 17:00:00 50.15427
2015-11-02 18:00:00 67.68386
2015-11-02 19:00:00 53.11179
2015-11-02 20:00:00 56.25635
2015-11-02 21:00:00 56.29655
2015-11-02 22:00:00 53.79341
2015-11-02 23:00:00 66.63699
2015-11-03 00:00:00 58.66060
# Extract forecast
write.csv(fc_wfp, file = "forecast_output_pkoflaherty_partc.csv", row.names = FALSE)


Neural network model

We don’t have the hardware to fit our data to a neural network model to the same degree. 7 hours of forecast took two minutes so 168 hours of forecast is outside of what we can demonstrate here.

Here is the code for a potential future exercise:

#fit <- WFP |>
# model(NNETAR(box_cox(WaterFlow,lambda)))
#fit |>
# forecast(h = 168) |>
# autoplot(WFP) +
# labs(x = "Year", y = "Counts", title = "Yearly sunspots")






References

This work was informed by ‘Forecasting: Principles and Practice’ by Rob J Hyndman and George Athanasopoulos, 3rd Ed
https://otexts.com/fpp3/

This work was produced 100% independently by the author, PK O’Flaherty.