This project consists of analysis and forecasting of two provided datasets, one dealing with ATM withdrawals and the other power consumption. We will begin with the ATM dataset.
The ATM dataset consists of daily cash withdrawal totals from 4 different ATM machines. We have been tasked to “forecast how much cash is taken out of the 4 ATM machines for May 2010”. This task can be interpreted multiple ways, so we will begin by creating a more concrete definition of the task. Our goal will be to forecast the daily combined total amount of cash withdrawn from the 4 ATM machines in May 2010.
We begin by looking at a sample of the ATM data which provides us with several useful observations. The format of the DATE column appears unusual and will need to be converted for us to better understand the range of data we have. Additionally the first observation for ATM 3 is 0, indicating we may have to address 0 value days as well as missing values. Lastly, looking at the tail it seems likely there are a certain amount of placeholder observations included in the data.
raw_atm <- readxl::read_xlsx("ATM624Data.xlsx") |>
as_tibble() |>
arrange(DATE)
head(raw_atm)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39934 ATM3 0
## 4 39934 ATM4 777.
## 5 39935 ATM1 82
## 6 39935 ATM2 89
tail(raw_atm)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 40307 <NA> NA
## 2 40308 <NA> NA
## 3 40309 <NA> NA
## 4 40310 <NA> NA
## 5 40311 <NA> NA
## 6 40312 <NA> NA
We will begin by addressing the dates. When loaded into a dataframe we can see the dates are clearly a numeric value; however, when viewed in Excel they are nicely formatted as dates. It turns out Excel has an usual method of storing dates, tracking the number of days since January 1, 1900 according to Microsoft documentation. As an interesting anecdote this is actually incorrect, as Excel wrongly assumes 1900 was a leap year, thus the value is actually the number of days since December 30, 1899. Thankfully the janitor package takes care of this lapse for us. Upon converting the dates we can see that our data ranges from May 1st, 2009 to May 14th, 2010; however, the rows in May 2010 appear to be empty fillers which will need to be addressed. Using the complete function we can populate any dates that may be missing.
corrected_date <- raw_atm |>
mutate(date = excel_numeric_to_date(DATE)) |>
select(!DATE) |>
complete(date = seq.Date(min(date), max(date), by="day")) |>
arrange(date) |>
rename_with(~ tolower(.x))
head(corrected_date)
## # A tibble: 6 × 3
## date atm cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-01 ATM3 0
## 4 2009-05-01 ATM4 777.
## 5 2009-05-02 ATM1 82
## 6 2009-05-02 ATM2 89
tail(corrected_date)
## # A tibble: 6 × 3
## date atm cash
## <date> <chr> <dbl>
## 1 2010-05-09 <NA> NA
## 2 2010-05-10 <NA> NA
## 3 2010-05-11 <NA> NA
## 4 2010-05-12 <NA> NA
## 5 2010-05-13 <NA> NA
## 6 2010-05-14 <NA> NA
Checking our assumption that the May 2010 dates are placeholders, we can check for missing values on dates at the end of the dataset. If there are more missing values at the end of our data we will need to reevaluate how we will approach the problem. This turns out not to be necessary, as the only days missing data at the end of our sample are in May and these rows can be safely dropped.
corrected_date |>
filter(date >= "2010-04-01" & (is.na(atm) | is.na(cash)))
## # A tibble: 14 × 3
## date atm cash
## <date> <chr> <dbl>
## 1 2010-05-01 <NA> NA
## 2 2010-05-02 <NA> NA
## 3 2010-05-03 <NA> NA
## 4 2010-05-04 <NA> NA
## 5 2010-05-05 <NA> NA
## 6 2010-05-06 <NA> NA
## 7 2010-05-07 <NA> NA
## 8 2010-05-08 <NA> NA
## 9 2010-05-09 <NA> NA
## 10 2010-05-10 <NA> NA
## 11 2010-05-11 <NA> NA
## 12 2010-05-12 <NA> NA
## 13 2010-05-13 <NA> NA
## 14 2010-05-14 <NA> NA
remove_empties <- corrected_date |>
filter(date < "2010-05-01")
Checking our data for empty values we can see that they only occur in the cash column, which will be relatively easy to address.
colSums(is.na(remove_empties))
## date atm cash
## 0 0 5
As we noted earlier; however, there are some entries with 0 values in the cash column. Counting the rows with 0s, it is clear there is a significant number of observations with 0 cash withdrawn which merits further exploration.
colSums(is.na(remove_empties) | remove_empties == 0)
## date atm cash
## 0 0 369
We will begin by briefly looking at the missing observations. We can see that only two ATMs have observations with NA in the cash column and all of those observations are in June 2009. There is nothing particularly notable about these absences, so we will impute these values later.
completed_df <- remove_empties |>
complete(date, atm)
completed_df |>
filter(is.na(cash))
## # A tibble: 5 × 3
## date atm cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-18 ATM2 NA
## 4 2009-06-22 ATM1 NA
## 5 2009-06-24 ATM2 NA
Moving on to the observations with 0 cash, we can clearly see ATM 3 has no cash withdrawn for a huge proportion of the observed period. The two days of 0 withdrawals for ATM 2 may or may not be reasonable and will be checked later, but we need to take a closer look at why ATM 3 is missing so much data.
completed_df |>
filter(cash == 0) |>
group_by(atm) |>
count()
## # A tibble: 2 × 2
## # Groups: atm [2]
## atm n
## <chr> <int>
## 1 ATM2 2
## 2 ATM3 362
completed_df |>
filter(atm == "ATM3", cash == 0) |>
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 362
completed_df |>
filter(atm == "ATM3", cash != 0) |>
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 3
Looking at the days ATM 3 has non-zero withdrawals, we can see they are the last 3 days of our sample period. From this we will conclude that ATM 3 is likely new, which will need to be dealt with when begin making our forecasts. For now we will continue our exploration of the data.
completed_df |>
filter(atm == "ATM3", cash != 0)
## # A tibble: 3 × 3
## date atm cash
## <date> <chr> <dbl>
## 1 2010-04-28 ATM3 96
## 2 2010-04-29 ATM3 82
## 3 2010-04-30 ATM3 85
We will now explore the distribution of the data as we check for outliers, as well as evaluate any differences between the ATMs we have not already uncovered. Plotting the full dataset on a box and whisker plot we can immediately see a massive outlier for ATM 4 which heavily distorts our plots. Noting this, we will remove the outlier and impute a more suitable value for it later.
completed_df |>
ggplot(aes(x = atm, y = cash)) +
geom_boxplot() +
labs(title = "All ATMs")
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Removing the outlier as well as ATM3 as we know it has very few actual observations, we can see that ATM 4 has approximately 4 times more withdrawals than ATMs 1 and 2. Most of these withdrawals are less than 500, although there is a long tail. It is clear there is something about ATM 4 which results in a higher withdrawal rate, but we cannot conclude what that is from the data provided. If we knew more about the ATMs we might be able to use that information to better estimate the withdrawals for ATM 3, but for now we will move on.
completed_df |>
filter(cash < 3000, atm != "ATM3") |>
ggplot(aes(x = atm, y = cash)) +
geom_boxplot() +
labs(title = "ATMs 1, 2, and 4 With Outliers Removed")
Focusing in on ATMs 1 and 2 we can see they have largely similar withdrawal patterns. ATM 1 has a much more consistent withdrawal rate, but the ranges for both machines are virtually identical. While ATM 1 technically has some outliers, they are within a reasonable range for the other ATMs and therefore do not need to be addressed.
completed_df |>
filter(cash < 3000, atm %in% c("ATM1", "ATM2")) |>
ggplot(aes(x = atm, y = cash)) +
geom_boxplot() +
labs(title = "ATM 1 & 2")
Our last step in our data exploration is to plot the data. We will still exclude the large outlier from ATM 4 to allow us to better understand the data. None of the ATMs exhibit a strong trend, and all have strong seasonality. Looking at the ranges of values it does not appear that the days with 0 withdrawals for ATMs 1 and 2 are outside the realm of possibility, so we will not touch these values. Our last observation of note is that ATM 3 has the exact same withdrawal numbers as ATM 1 for the days we have data on.
completed_df |>
filter(atm == "ATM1") |>
ggplot(aes(x = date, y = cash)) +
geom_line() +
labs(title = "ATM 1 Withdrawals")
completed_df |>
filter(atm == "ATM2") |>
ggplot(aes(x = date, y = cash)) +
geom_line() +
labs(title = "ATM 2 Withdrawals")
completed_df |>
filter(cash < 3000, atm == "ATM4") |>
ggplot(aes(x = date, y = cash)) +
geom_line() +
labs(title = "ATM 4 Withdrawals")
completed_df |>
filter(cash < 3000, atm %in% c("ATM1", "ATM2", "ATM3"), date > "2010-04-01") |>
ggplot(aes(x = date, y = cash, color = atm)) +
geom_line() +
labs(title = "ATM 1, 2, and 3 Withdrawals")
Taking a look at the ACF and PACF plots for each ATM we can confirm our earlier assumption of seasonality. The seasonal period is 7 days and is very strong for every machine.
# This operation is actually needed for the imputing below but is done here to save us the trouble of continuing to manually exclude the outlier
prep_impute <- completed_df |>
mutate(cash = if_else(cash > 3000, NA, cash))
prep_impute |>
filter(atm == "ATM1") |>
as_tsibble() |>
gg_tsdisplay(cash, plot_type = "partial") +
labs(title = "ATM 1")
## Using `date` as index variable.
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
prep_impute |>
filter(atm == "ATM2") |>
as_tsibble() |>
gg_tsdisplay(cash, plot_type = "partial") +
labs(title = "ATM 2")
## Using `date` as index variable.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
prep_impute |>
filter(atm == "ATM4") |>
as_tsibble() |>
gg_tsdisplay(cash, plot_type = "partial") +
labs(title = "ATM 4")
## Using `date` as index variable.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
Reviewing the findings of our data exploration, we come to the conclusion that we will need to impute some missing observations as well as a replacement value for an outlier from ATM 4. Additionally, and more significantly, we need to come up with a method of addressing the lack of data for ATM 3. We will begin by imputing the few values needed, then address ATM 3.
Earlier we concluded that days with 0 withdrawals could reasonably happen and, aside from ATM 3, happened sparingly enough they should ultimately remain untouched. There are still a few rows which have no cash observations at all, and we will therefore impute values for these days as well as the extreme outlier from ATM 4. Our method of imputation is to take the average of the preceding 2 seasonal lags and the following 2 seasonal leads for the given ATM. The lack of overall trend and strong seasonality lead us to conclude this is the best method of imputing the required values. Carrying forward the last observation on the seasonal lag would also be a reasonable choice; however, our moving window average reduces the potential risk of including outliers if our missing observations happen to occur at a peak or trough in the seasonal pattern.
imputed_vals <- prep_impute |>
filter(atm != "ATM3") |>
group_by(atm) |>
mutate(cash = if_else(is.na(cash), (lag(cash, n = 14, order_by = atm) + lag(cash, n = 7, order_by = atm) + lead(cash, n = 7, order_by = atm) + lead(cash, n = 14, order_by = atm)) / 2, cash))
Earlier we concluded ATM 3 was likely new and as a result only had withdrawals for a few days. We can assume that ATM 3 will continue to receive traffic for the forecast period, and need to come up with a way to include this anticipated volume in our forecasts. Looking at our data for the other ATMs, we know ATM 4 has a much larger withdrawal rate than ATMs 1 and 2. Based on the days for which we have data on ATM 3 it does not seem likely that it will experience similarly large withdrawal rates and we will therefore exclude ATM 4 withdrawals from any imputation for ATM 3. We also know that ATMs 1 and 2 have very similar average withdrawal rates as well as ranges of withdrawals. Lastly, we observed that ATM 3 had the exact same withdrawal rates as ATM 1 on the days it received traffic. While we could attempt to construct a model for ATM 3 withdrawals and impute values for the observed period based upon ATMs 1 and 2 this would be a rather complicated solution that is unlikely to provide a substantial improvement on the naive. We will therefore use the Seasonal Naive method to populate data for ATM 3 over the sample period, which is to say we will copy the values for ATM 1 to ATM 3 for the sample period.
fully_populated_data <- imputed_vals |>
filter(atm == "ATM1") |>
mutate(atm = "ATM3") |>
bind_rows(imputed_vals) |>
arrange(date)
Having ensured our data is fully populated and any outliers have been suitably addressed we can now safely aggregate the totals from the ATMs over our sample period and perform some final analysis before we begin the forecasting process. Plotting the withdrawal totals we can see there is a rather large degree of variance as well as a relatively flat overall trend.
aggregated_df <- fully_populated_data |>
ungroup() |>
group_by(date) |>
summarise(cash = sum(cash)) |>
as_tsibble(index = date)
aggregated_df |>
autoplot() +
labs(title = "Total ATM Withdrawals")
## Plot variable not specified, automatically selected `.vars = cash`
Looking at some summary statistics we can see some indicators that our data may be difficult to produce an accurate model for. The large standard deviation relative to the mean indicates that there is a high degree of variance within our data, and if that variance exists outside the seasonal pattern we may see some large confidence intervals in our models. We can validate this initial observation by looking at a STL decomposition and can confirm there appears to be a high degree of random variation, which will make generating accurate models difficult.
aggregated_df |>
as_tibble() |>
summarise(mean = mean(cash), median = median(cash), sd = sd(cash), min = min(cash), max = max(cash))
## # A tibble: 1 × 5
## mean median sd min max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 679. 648. 422. 11.4 2161.
aggregated_df |>
model(STL(cash ~ trend(window = 31) + season(window = "periodic"), robust = TRUE)) |>
components() |>
autoplot()
We are not quite done manipulating the data, but the rest of the transformations we will apply are all within the context of generating a model. We will begin by evaluating if and what kind of transformations could improve our model performance then train and evaluate several potential models.
To determine whether a transformation could improve model accuracy we will be using the Guerrero method of generating an optimal lambda for a Box-Cox transformation. Doing so we get a \(\lambda\) value of 0.17 which is relatively close to the logarithmic transformation of \(\lambda = 0\) which could be easier to explain and may be preferable if it does not significantly impact the data distribution. Plotting the transformations side by side, we can see they are fairly similar although the Guerrero lambda has a slightly more even distribution around the mean. Given how much data has been imputed already due to ATM 3, we are going to err on the side of caution and select the Guerrero lambda to generate the best forecasts possible.
lambda_atm <- aggregated_df |>
features(cash, features = guerrero) |>
pull(lambda_guerrero)
aggregated_df |>
gg_tsdisplay(box_cox(cash, lambda_atm), plot_type = "partial") +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed ATM cash total with $\\lambda$ = ",
round(lambda_atm,2))))
aggregated_df |>
gg_tsdisplay(box_cox(cash, 0), plot_type = "partial") +
labs(y = "",
title = latex2exp::TeX(paste0("Transformed ATM cash total $\\lambda$ = 0 (log transform)")))
Given the relative stability of the aggregated time series over the sample period and the amount of data available we can be confident splitting our data into test and evaluation sets will give us a good model. We will be evaluating three types of models: ARIMA, ETS, and Seasonal Naive. Both the ARIMA and ETS models will be autoselected, and we will not be evaluating additional manually selected models due to time constraints combined with the limited available processing power on the laptop being used for this analysis. Attempts were made to include more options, but the R session crashed repeatedly. To help compensate for this, we will not be using the stepwise feature of the ARIMA model to ensure we are getting the best performing ARIMA option. Both the ARIMA and ETS models will be compared to the Seasonal Naive in order to determine if they perform any better.
training_atm <- aggregated_df |>
filter_index(. ~ "2010-01-01")
fit_atm_training <- training_atm |>
model(arima = ARIMA(box_cox(cash, lambda_atm), stepwise = FALSE),
ets = ETS(box_cox(cash, lambda_atm)),
naive = SNAIVE(cash ~ lag(7)))
To get a better understanding of the models selected we can look at the reports generated. R has selected an ARIMA(2,0,1)(1,1,2) model as well as an ETS(M,N,A) model. The ARIMA model is rather complicated with 6 degrees of freedom which will need to be considered when we are selecting our final forecasting model.
fit_atm_training |>
select(arima) |>
report()
## Series: cash
## Model: ARIMA(2,0,1)(1,1,2)[7]
## Transformation: box_cox(cash, lambda_atm)
##
## Coefficients:
## ar1 ar2 ma1 sar1 sma1 sma2
## 1.0070 -0.0489 -0.8945 0.6036 -1.6924 0.7161
## s.e. 0.1108 0.0789 0.0814 0.1762 0.1580 0.1638
##
## sigma^2 estimated as 3.628: log likelihood=-498.7
## AIC=1011.4 AICc=1011.89 BIC=1035.74
fit_atm_training |>
select(ets) |>
report()
## Series: cash
## Model: ETS(M,N,A)
## Transformation: box_cox(cash, lambda_atm)
## Smoothing parameters:
## alpha = 0.0410589
## gamma = 0.0001000129
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 11.11792 -4.767462 -0.1863009 0.9601199 0.6894683 0.9569522 1.095417 1.251806
##
## sigma^2: 0.0265
##
## AIC AICc BIC
## 1665.875 1666.811 1700.928
As we cannot use AICc to compare performance across model classes we will rely on more traditional error measurements with a particular emphasis on RMSE due to the ease of interpretation. Doing so we can see both the ARIMA and ETS models outperform the Seasonal Naive method by a large enough margin to justify their selection over the Seasonal Naive. While the ARIMA model performed slightly better on the training set, the ETS model outperformed it by a similar margin on the test set. The ETS model outperforms the ARIMA model in every relevant measure of error on the test set indicating the ARIMA model may be slightly overfit to the training data. Since the ETS model performs best on the test set and is a much simpler and more explainable model than ARIMA we will move forward with the ETS model. We should note; however, that there are indicators none of these models perform particularly well when looking at the large MAPE values and considering the size of the RMSE with regards to the average withdrawal amount.
bind_rows(
fit_atm_training |> accuracy(),
fit_atm_training |> forecast(h = 30) |> accuracy(aggregated_df)
) |>
group_by(.type) |>
select(!ME) |>
arrange(RMSE, .by_group = TRUE)
## # A tibble: 6 × 9
## # Groups: .type [2]
## .model .type RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets Test 366. 262. -19.2 46.8 0.691 0.737 -0.00110
## 2 arima Test 371. 275. -28.3 53.7 0.725 0.747 -0.0181
## 3 naive Test 547. 453. -9.69 79.4 1.19 1.10 -0.0296
## 4 arima Training 352. 262. -62.4 95.1 0.692 0.710 0.0288
## 5 ets Training 360. 273. -68.7 103. 0.720 0.724 0.109
## 6 naive Training 497. 379. -120. 161. 1 1 0.0564
Before fully committing to the ETS model we will consider the residuals to ensure there is no undue bias. Looking at the ACF of the residuals there is only one lag which possibly reaches the level of significance; however, this is not unexpected when looking at 30 lags and is consistent with white noise. The residuals do appear to have a left skew however, and there are enough outliers to warrant further evaluation. Upon performing a Ljung-Box test we can see the large p value indicates the residuals are consistent with white noise and we can be confident in the accuracy of the ETS model and will select it for forecasting.
fit_atm_training |>
select(ets) |>
gg_tsresiduals(type = "partial") +
labs(title = "ETS Residuals")
augment(fit_atm_training) |>
filter(.model == "ets") |>
features(.innov, ljung_box, lag = 30)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ets 32.4 0.349
Looking at the generated forecast we can make a few interesting observations. There appears to be a rather high degree of uncertainty in the predictions for most days, especially with regards to when and how large the potential peaks will be. Conversely the model has a high degree of confidence in the troughs, both in terms of when they will happen and what the value will be. We mentioned earlier in the Closing Data Observations section that there was the potential for large confidence intervals due to the random variance in the data and this has indeed come to pass. While our model is better than the Naive, any forecasts on this data should be used with a high degree of caution.
forecast_atm <- aggregated_df |>
model(ets = ETS(box_cox(cash, lambda_atm))) |>
forecast(h = 31)
small_window <- aggregated_df |>
filter(date > "2010-1-1")
forecast_atm |>
autoplot(small_window) +
labs(title = "ETS Forecast for ATM Withdrawals in May 2010")
The power dataset tracks residential power usage from January 1998 through December 2013. We are tasked with modeling the data and producing a forecast for the entirety of 2014.
Taking a peak at the data we can see it is fairly straightforward. The dates are in an easy to process format, and while there is an unknown column called CaseSequence it is evident it is simply an index as \(924 - 733 = 191\) which is the number of months between our start and end dates.
raw_power <- readxl::read_xlsx("ResidentialCustomerForecastLoad-624.xlsx") |>
as_tibble()
head(raw_power)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
tail(raw_power)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 919 2013-Jul 8415321
## 2 920 2013-Aug 9080226
## 3 921 2013-Sep 7968220
## 4 922 2013-Oct 5759367
## 5 923 2013-Nov 5769083
## 6 924 2013-Dec 9606304
To confirm there are no missing entries we can use the fill_gaps function to populate any missing months. After doing so, we can see there is one month with a missing observation which will need to be addressed later.
completed_power <-raw_power |>
mutate(date = yearmonth(parse_date_time(`YYYY-MMM`, "Yb"))) |>
select(date, KWH) |>
as_tsibble(index = date) |>
fill_gaps() |>
rename_with(~ tolower(.x))
completed_power |>
filter(is.na(kwh))
## # A tsibble: 1 x 2 [1M]
## date kwh
## <mth> <dbl>
## 1 2008 Sep NA
Replacing the missing observation with a 0 for now, we can plot the power consumption to get a better understanding of the data. Immediately we should note there is an outlier other than our introduced 0 value which we will likely want to impute a replacement for. The data has strong autocorrelations at lags 1 and 6 with inversions at the 3 lag interval which is to be expected given heating and cooling needs. Taking a yearly view of the data we see a consistent annual pattern with usage peaking during the Summer and Winter months and falling off in the transitional seasons.
completed_power |>
mutate(kwh = if_else(is.na(kwh), 0, kwh)) |>
gg_tsdisplay(y = kwh) +
labs(title="Monthly Residential Power Consumption Jan 1998 - December 2013")
Given the overall consistency of the annual pattern and strong autocorrelations we could simply carry forward the previous value at the seasonal lag to impute our missing values without much risk of negatively impacting our models. Just to err on the side of caution we will take a windowed average of the previous two annual lags and following two leading annual values for the two observations we are imputing to avoid the chance of including an anomalous value. Looking at the plots of our imputed data we can now see that there is a slight upward trend in power consumption, and the variance appears to be increasing slightly with time. We should also note the increase in the values of the ACF, although the patterns remain the same.
imputed_power <- completed_power |>
mutate(imputed_kwh = if_else(is.na(kwh) | kwh < 3000000, (lag(kwh, 24) + lag(kwh, 12) + lead(kwh, 12) + lead(kwh, 24)) / 4, kwh)) |>
select(!kwh) |>
rename(kwh = imputed_kwh)
imputed_power |>
gg_tsdisplay() +
labs(title="Imputed Replacement of Power Consumption")
## Plot variable not specified, automatically selected `y = kwh`
Beginning the modeling process, we will first check to see if the data would benefit from any transformations. The Guerrero method indicates a \(\lambda\) of 0.17 would perform best, however when comparing the Guerrero Box-Cox transformation to a simple log transformation we can see the difference is negligible. We will therefore stick with the simpler and more explainable log transformation.
lambda_power <- imputed_power |>
features(kwh, features = guerrero) |>
pull(lambda_guerrero)
imputed_power |>
gg_tsdisplay(box_cox(kwh, lambda_power), plot_type = "partial") +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed Power Consumption with $\\lambda$ = ",
round(lambda_power,2))))
imputed_power |>
gg_tsdisplay(box_cox(kwh, 0), plot_type = "partial") +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed Power Consumption with $\\lambda$ = 0")))
We will once again be testing models by splitting our data into test and training sets. The models we will explore are ARIMA, ETS, and the Seasonal Naive to determine which will produce the best forecast. We will be considering two ARIMA models, one selected with the stepwise feature and one without to see if there is a substantial difference in performance.
training_power <- imputed_power |>
filter_index(. ~ "2008 Jun")
fit_power_models <- training_power |>
model(arima_no_step = ARIMA(log(kwh), stepwise = FALSE),
arima = ARIMA(log(kwh)),
ets = ETS(log(kwh)),
naive = SNAIVE(kwh ~ lag("year")))
To better understand the models we are comparing we can look at the reports for each model. The ARIMA model selected without the stepwise feature is ARIMA(4,0,0)(0,1,1) with drift. This is a relatively complicated model with a large AR non-seasonal component and 5 degrees of freedom. The model selected with the stepwise feature is ARIMA(1,0,1)(0,1,1) with drift which is an overall simpler model. The AICc of both models is relatively close, so we will continue to consider both for now. The ETS model selected is once again ETS(M,N,A) although with an exceptionally small gamma coefficient which is anecdotally interesting.
fit_power_models |>
select(arima_no_step) |>
report()
## Series: kwh
## Model: ARIMA(4,0,0)(0,1,1)[12] w/ drift
## Transformation: log(kwh)
##
## Coefficients:
## ar1 ar2 ar3 ar4 sma1 constant
## 0.2333 -0.0492 0.2148 -0.3066 -0.7837 0.0053
## s.e. 0.0902 0.0908 0.0899 0.0885 0.1041 0.0028
##
## sigma^2 estimated as 0.007118: log likelihood=117.5
## AIC=-221.01 AICc=-219.95 BIC=-201.86
fit_power_models |>
select(arima) |>
report()
## Series: kwh
## Model: ARIMA(1,0,1)(0,1,1)[12] w/ drift
## Transformation: log(kwh)
##
## Coefficients:
## ar1 ma1 sma1 constant
## -0.5034 0.7142 -0.7380 0.0088
## s.e. 0.1904 0.1491 0.0914 0.0054
##
## sigma^2 estimated as 0.007854: log likelihood=111.95
## AIC=-213.9 AICc=-213.35 BIC=-200.22
fit_power_models |>
select(ets) |>
report()
## Series: kwh
## Model: ETS(M,N,A)
## Transformation: log(kwh)
## Smoothing parameters:
## alpha = 0.04271041
## gamma = 0.0001000095
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5]
## 15.60086 -0.08425774 -0.2653176 -0.07629524 0.2073936 0.2606581 0.2010284
## s[-6] s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.00460603 -0.2411172 -0.1879366 -0.06790398 0.04876882 0.2003735
##
## sigma^2: 0
##
## AIC AICc BIC
## 8.316489 12.680125 50.860718
Checking the accuracy of our models on the test set, we can see all models outperform the naive by a large enough margin to be considered. The non-stepwise selected ARIMA model again shows indications of being overfit on the training data, performing worse on the test data than the stepwise ARIMA. Given this performance we will consider the stepwise ARIMA model over the non-stepwise model. Comparing our selected ARIMA model to the ETS model we can see a significant difference in favor of the ARIMA model by most metrics. Of particular note is the difference in RMSE and MAPE, the first indicating errors that are not large on the scale of the data and the second confirming this observation. Together they indicate the models do a good job of modeling the data.
bind_rows(
fit_power_models |> accuracy(),
fit_power_models |> forecast(h = 12) |> accuracy(imputed_power)
) |>
group_by(.type) |>
select(!ME) |>
arrange(RMSE, .by_group = TRUE)
## # A tibble: 8 × 9
## # Groups: .type [2]
## .model .type RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test 340697. 284838. -1.51 4.99 0.509 0.473 -0.0373
## 2 arima_no_step Test 351101. 298345. -1.46 5.11 0.533 0.487 0.0308
## 3 ets Test 395541. 334759. 0.0360 5.64 0.598 0.549 0.190
## 4 naive Test 588246. 501331. -0.359 8.10 0.895 0.816 0.147
## 5 arima_no_step Training 505063. 383538. -0.251 6.12 0.685 0.701 0.0104
## 6 arima Training 521701. 398737. -0.303 6.43 0.712 0.724 0.0414
## 7 ets Training 523978. 409071. 0.523 6.50 0.730 0.727 0.229
## 8 naive Training 720781. 560002. -0.272 8.97 1 1 0.195
Looking at the residuals of our chosen ARIMA model we can see they do resemble white noise which is encouraging. The residuals are normally distributed and while there is one observation in the ACF plot which rises to the level of significance this is still in line with the expectations for white noise.
fit_power_models |>
select(arima) |>
gg_tsresiduals(lag = 24) +
labs(title = "ARIMA(1,0,1)(0,1,1) w/ Drift Residuals")
To verify our observation the residuals of the ARIMA model are consistent with white noise we can perform a Ljung-Box test. The large p value indicates the residuals are consistent with white noise and we can be confident in the forecasts produced.
augment(fit_power_models) |>
filter(.model == "arima") |>
features(.innov, ljung_box, lag = 24, dof = 3) |>
mutate_if(is.numeric, round, 5)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima 27.3 0.160
Looking at the forecast for 2015 it seems reasonable. The highest degrees of uncertainty are at the peaks and troughs, with the transitionary periods having very tight confidence intervals. We can see the model correctly anticipates consumption reaching a peak at the start of the forecast period which is in line with past behavior. Based on the metrics above we can have a high degree of confidence in this model and should feel comfortable using the produced forecast.
small_window <- imputed_power |>
filter_index("2010 Jan" ~ .)
forecast_power <- imputed_power |>
model(ARIMA(log(kwh))) |>
forecast(h = "1 year")
forecast_power |>
autoplot(small_window) +
labs(title = "Predicted Residential Power Consumption for 2015")