library(fpp3)
library(tidyverse)
library(tsibble)
library(readxl)
library(writexl)

Part A - ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

A quick look at the excel sheet shows three columns: DATE, ATM, Cash. Currently, the DATE and Cash column are character types so let’s import them as a datetime and numeric variable. Afterwards, we can convert the data into a tsibble.

raw <- read_excel("ATM624Data.xlsx", col_types = c("date", "text", "numeric")) %>% 
  mutate(DATE = as_date(DATE)) %>% #drops the time
  as_tsibble(index = DATE, key = ATM)
head(raw)
## # A tsibble: 6 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-02 ATM1     82
## 3 2009-05-03 ATM1     85
## 4 2009-05-04 ATM1     90
## 5 2009-05-05 ATM1     99
## 6 2009-05-06 ATM1     88
summary(raw)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

The dataset includes 1,474 observations with 19 missing Cash values, about 1.2% of the total data. Given that the missing values were only a small proportion of the total data, the missing entries were left alone for now.

Let’s plot the data to get a sense of trend and seasonality.

atm <- raw 

atm %>% 
  autoplot(Cash)
## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).

This plot of the ATMs and Cash is hard to read since they are all on different scales and there seems to be an outlier distorting the plot. Let’s plot each ATM separately.

atm %>% 
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4)
## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).

From this plot, we can see that ATM1 and ATM2 display clear seasonality. ATM3 shows nearly zero withdrawals most of the year, then a sudden surge at the end of April 2010. ATM4 shows an extreme spike between Jan 2010 and Apr 2010. The spikes in ATM3 and ATM4 should be examined since they skew the scale of the data.

# Histogram of ATM Withrdawals by ATM
atm %>%
  ggplot(aes(x = Cash, fill = ATM)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ATM, scales = "free")
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_bin()`).

The distribution of ATM1 is somewhat normal with a center around 100. ATM2 does not show any clear pattern but most of the ATM withdrawals are less than 100. Almost all of ATM3’s withdrawals are at or near zero with a few large withdrawals. The distribution of ATM4 is extremely right skewed with outliers above 9000, which was distorting the scale of the time plot.

Based on the time series plot and the histogram, ATM3 was inactive and had zero withdrawals until April 2010 when the first cash withdrawals were made. It would be difficult to model this since nearly all observations are zero and the most recent data has nonzero values.

ATM4 contained a single outlier, exceeding 9000. Because this value likely represents an anomaly rather than normal behavior, it was replaced with a NA to prevent it from skewing the scale.

# Remove only that one extreme ATM4 observation
atm_clean <- atm %>%
  mutate(Cash = if_else(ATM == "ATM4" & Cash > 9000, NA_real_, Cash))

atm_clean %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawals")
## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).

After cleaning the data and removing one extreme outlier from ATM4, ATM4 now displays a more stable withdrawal pattern.

ATM1 and ATM2 exhibit consistent withdrawal behavior with weekly seasonality and stable variance, which makes them good candidates for ARIMA modeling. Neither ATMs showed any upward or downward trend so exponential smoothing was not considered. ATM3, on the other hand, shows almost no activity until late April 2010, followed by a sudden increase. In this case, a simple mean or naive model would be most appropriate due to the limited history of the data. ATM4 has higher variance than ATM1 and ATM2 so a Box-Cox transformation should be applied to stabilize variance prior to ARIMA modeling. Exponential smoothing was not appropriate for forecasting ATM4’s withdrawals since there was no trend and the data showed a repetitive weekly cycle, which ARIMA models handle better.

ATM1

atm1 <- atm_clean %>%
  filter(ATM == "ATM1")

gg_tsdisplay(atm1, Cash, plot_type = "partial") +
  labs(title = "ATM1 Daily Withdrawals")

The ACF plot shows large positive spikes at lags 7, 14, and 21, confirming the presence of weekly seasonality. The slow decay between these seasonal lags suggests that seasonal differencing may be required to achieve stationarity. The PACF plot displays a single significant spike around lag 7, indicating that a seasonal autoregressive term of order 1 may be appropriate.These patterns suggest that an ARIMA model with a seasonal component of period 7 would be best for modeling ATM1’s withdrawals.

fit_arima1 <- atm1 %>%
  model(ARIMA(Cash))

report(fit_arima1)
## Series: Cash 
## Model: ARIMA(1,0,0)(2,1,0)[7] 
## 
## Coefficients:
##          ar1     sar1     sar2
##       0.1511  -0.4698  -0.2048
## s.e.  0.0535   0.0517   0.0514
## 
## sigma^2 estimated as 601.5:  log likelihood=-1641.05
## AIC=3290.11   AICc=3290.22   BIC=3305.63

The automatic ARIMA function selected an ARIMA(1,0,0)(2,1,0)[7] model for ATM1. This model incorporates one non-seasonal autoregressive term (AR(1)) and two seasonal autoregressive terms (SAR(2)), along with one seasonal difference (D=1) for a weekly period of 7 days. The presence of a single seasonal difference (D=1) confirms the presence of weekly seasonality observed in the data. The negative SAR coefficients suggesting mean reverting behavior across weeks. A residual diagnostic should be done to check if autocorrelation remains. There was some variance in the time plot so a Box-Cox transformation will be done to improve the

gg_tsresiduals(fit_arima1)
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_rug()`).

augment(fit_arima1) %>%
  features(.innov, ljung_box, lag = 14, dof = 4)
## # A tibble: 1 × 4
##   ATM   .model      lb_stat lb_pvalue
##   <chr> <chr>         <dbl>     <dbl>
## 1 ATM1  ARIMA(Cash)    20.3    0.0267

The residual diagnostics indicate that the ARIMA(1,0,0)(2,1,0)[7] model provides a reasonable fit. The residuals fluctuate around zero with no visible pattern. The ACF of residuals show most autocorrelations fall within the confidence bounds, suggesting the model has captured the main structure of the data. The histogram of the residual is normally distributed. However, the Ljung-Box test yields a p-value of 0.0267, suggesting that residuals are not entirely independent and that some autocorrelation remains. This indicates the model may not fully capture all patterns in the data. There was some variance in the time plot so a Box-Cox transformation will be done to stabalize the variance and improve the model.

lambda1 <- atm1 %>%
  features(Cash, guerrero) %>%
  pull(lambda_guerrero)

atm1_bc <- atm1 %>%
  mutate(Cash_bc = box_cox(Cash, lambda = lambda1))

fit_arima1_bc <- atm1_bc %>%
  model(ARIMA(Cash_bc))

report(fit_arima1_bc)
## Series: Cash_bc 
## Model: ARIMA(0,0,0)(2,1,0)[7] 
## 
## Coefficients:
##          sar1     sar2
##       -0.5351  -0.2433
## s.e.   0.0508   0.0504
## 
## sigma^2 estimated as 1.333:  log likelihood=-556.79
## AIC=1119.58   AICc=1119.65   BIC=1131.23
gg_tsresiduals(fit_arima1_bc)
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_rug()`).

augment(fit_arima1_bc) %>%
  features(.innov, ljung_box, lag = 14, dof = 4)
## # A tibble: 1 × 4
##   ATM   .model         lb_stat lb_pvalue
##   <chr> <chr>            <dbl>     <dbl>
## 1 ATM1  ARIMA(Cash_bc)    19.5    0.0341

The Box-Cox transformation improved the model fit and stabilized variance. After transformation, the ARIMA function selected an ARIMA(0,0,0)(2,1,0)[7] model, which includes two seasonal autoregressive terms and one seasonal difference, but no non-seasonal AR or MA terms. The seasonal AR coefficients remain negative, indicating mean-reverting weekly behavior. The AIC dropped from 3290 to 1119 and the BIC dropped from 3305 to 1131, confirming that the transformed model is a better fit.

Residual diagnostics show residuals fluctuating randomly around zero with no visible trend, and most autocorrelations fall within confidence bounds. The histogram of residuals is approximately normal. The Ljung-Box test yields a p-value of 0.0341, below 0.05 but still better than the intial model (p=0.0267). This suggests that some mild autocorrelation remains, but the transformed model performs better overall. Therefore, this model will be used to forecast ATM withdrawals for May 2010.

lambda1 <- atm1 %>% features(Cash, guerrero) %>% pull(lambda_guerrero)

fit_atm <- atm1 %>%
  model(arima = ARIMA(box_cox(Cash, lambda1)))

fc_atm <- forecast(fit_atm, h = 31)

autoplot(fc_atm, atm1) +
  labs(title = "ATM1 Forecast", x = "Date", y = "Cash")

The ARIMA model forecast captures the repeating weekly seasonality and overall fluctuation pattern in ATM1’s cash withdrawals. Although it does not fully reproduce the occasional extreme spikes observed in the historical data, the model successfully reflects the consistent 7-day cycle and provides a reasonable short-term projection of expected withdrawal levels for May 2010.

# Exporting to excel
fc_mean <- fc_atm %>%
  as_tibble() %>%
  select(DATE,ATM, .mean) %>% 
  rename(Cash = .mean)

# Export options
#write_xlsx(fc_mean, "ATM1_Forecast_May2010.xlsx")

ATM2

atm2 <- atm_clean %>% filter(ATM == "ATM2") %>% as_tsibble(index = DATE)

gg_tsdisplay(atm2, Cash, plot_type = "partial")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

ATM2’s daily withdrawals fluctuate between 0 and 150 with a clearly repeating pattern. The ACF plot has strong positive spikes at lag 7, 14, and 21, indicating weekly seasonality. A Box-Cox transform is unnecessary since the variance appears constant. However, seasonal differencing may be required to address the weekly cycle. Based on these patterns, an ARIMA model with a seasonal term of period 7 would likely capture this structure well.

fit2 <- atm2 %>% model(ARIMA(Cash))
report(fit2)
## Series: Cash 
## Model: ARIMA(0,0,0)(2,1,0)[7] 
## 
## Coefficients:
##          sar1     sar2
##       -0.5747  -0.1766
## s.e.   0.0521   0.0518
## 
## sigma^2 estimated as 649.2:  log likelihood=-1659.31
## AIC=3324.61   AICc=3324.68   BIC=3336.26
gg_tsresiduals(fit2)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_rug()`).

augment(fit2) %>% features(.innov, ljung_box, lag = 14, dof = 4)
## # A tibble: 1 × 4
##   ATM   .model      lb_stat lb_pvalue
##   <chr> <chr>         <dbl>     <dbl>
## 1 ATM2  ARIMA(Cash)    27.4   0.00226

ATM2’s daily withdrawals were modeled using an ARIMA(0,0,0)(2,1,0)[7] model, which includes two seasonal AR terms and one seasonal difference over a 7 day period to capture the weekly withdrawal pattern. Both seasonal parameters were significant, confirming the presence of strong weekly dependence in cash withdrawal behavior. The residual diagnostics showed residuals fluctuating randomly around zero and and appearing normally distributed. However, small peaks remained in the residual ACF, and the Ljung-Box test (p = 0.0023) indicated some remaining autocorrelation. Overall, the model fits the data reasonably well and effectively explains the weekly seasonality so we can move on to forecasting.

fc2 <- fit2 %>% forecast(h = 31)

autoplot(fc2, atm2) +
  labs(title = "ATM2 ARIMA Forecast for May 2010")

The fitted ARIMA(0,0,0)(2,1,0)[7] model used to forecast daily withdrawals for May 2010 successfully maintains the weekly seasonal pattern observed in the historical data. The forecast captures the consistent 7-day cycle and produces prediction intervals that align with the historical variability, making it suitable for short-term planning. While the model does not fully reproduce occasional extreme spikes, it provides a reasonable projection of expected withdrawal levels. Given the Ljung-Box test indicated residual autocorrelation, adding a seasonal MA term could have been considered to improve the fit. However, since the forecast aligned with historical data and seemed reasonable, additional adjustments were not needed.

# Export to Excel
fc2_export <- fc2 %>%
  as_tibble() %>%
  transmute(Date = DATE, ATM, Cash = .mean)

#write_xlsx(fc2_export, "ATM2_Forecast_May2010.xlsx")

ATM3

atm3 <- atm_clean %>% filter(ATM == "ATM3") %>% as_tsibble(index = DATE)

autoplot(atm3, Cash) +
  labs(title = "ATM3 Daily Withdrawals")

As mentioned earlier, ATM3 shows almost no withdrawal activity until late April 2010, followed by a sudden increase. Because the historical data lacks sufficient variation and continuity, ARIMA or exponential smoothing models are not suitable. Instead, a Naive is most appropriate since there is limited historical data and it simply uses the most recent observations as the forecast. A Mean model could also be considered, but with only three nonzero values, the average would quickly flatten and fail to reflect the recent increase. Therefore, the Naive mode approach the best short-term projection given the limited data.

fit3 <- atm3 %>% model(NAIVE(Cash))
report(fit3)
## Series: Cash 
## Model: NAIVE 
## 
## sigma^2: 25.8985
fc3 <- fit3 %>% forecast(h = 31)

autoplot(fc3, atm3) +
  labs(title = "ATM3 Naive Forecast for May 2010")

The Naive model used the last observed value to forecast the Cash withdrawals in May 2010. In this case, the final observed withdrawal amount was 85, which the model projects for each day in May 2010. Because nearly all prior observations were zero, residual and autocorrelation diagnostics are not necessary in this case.

# Export to Excel
fc3_export <- fc3 %>%
  as_tibble() %>%
  transmute(Date = DATE, ATM, Cash = .mean)

#write_xlsx(fc3_export, "ATM3_Forecast_May2010.xlsx")

ATM4

# ATM4 without the outlier
atm4 <- atm_clean %>% 
  filter(ATM == "ATM4") %>% 
  as_tsibble(index = DATE)

gg_tsdisplay(atm4, Cash, plot_type = "partial") +
  labs(title = "ATM4")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

ATM4 shows clear weekly fluctuations and high variance. There are large spikes over 1500 on some days and near zero on others, suggesting heteroscedasticity where variance changes over time. The ACF plot shows strong autocorrelations at lag 7, 14, and 21, confirming weekly seasonality. To stabilize the variance, we can perform a Box-Cox transformation before fitting an ARIMA model.

lambda4 <- atm4 %>%
  features(Cash, guerrero) %>%
  pull(lambda_guerrero)

lambda4
## [1] 0.4512229
atm4_bc <- atm4 %>%
  mutate(Cash_bc = box_cox(Cash, lambda = lambda4))

fit_arima4_bc <- atm4_bc %>%
  model(ARIMA(Cash_bc))

report(fit_arima4_bc)
## Series: Cash_bc 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.2116  0.1901   17.2491
## s.e.  0.0518  0.0522    0.6879
## 
## sigma^2 estimated as 180:  log likelihood=-1461.14
## AIC=2930.29   AICc=2930.4   BIC=2945.89
gg_tsresiduals(fit_arima4_bc)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_rug()`).

augment(fit_arima4_bc) %>%
  features(.innov, ljung_box, lag = 14, dof = 4)
## # A tibble: 1 × 4
##   ATM   .model         lb_stat lb_pvalue
##   <chr> <chr>            <dbl>     <dbl>
## 1 ATM4  ARIMA(Cash_bc)    18.9    0.0411

The ARIMA function selected an ARIMA(0,0,0)(2,0,0)[7] with mean for ATM4 after applying a Box-Cox transformation. This specification includes two seasonal AR terms and no differencing. Both seasonal AR coefficients are positive and statistically significant, indicating persistent weekly dependence.

The residuals fluctuate randomly around zero with and almost normal shape. The ACF of residuals shows several autocorrelations outside of the significance bound and the Ljung-box test suggests that some autocorrelation remain. To improve the fit, adding seasonal differencing (D=1) and incorporating a seasonal MA term could help address remaining autocorrelation and stabilize the series further.

# Adding one seasonal differencing and one MA term
fit_arima4_refined <- atm4_bc %>%
  model(ARIMA(Cash_bc ~ pdq(0,0,0) + PDQ(2,1,1,7)))
report(fit_arima4_refined)
## Series: Cash_bc 
## Model: ARIMA(0,0,0)(2,1,1)[7] 
## 
## Coefficients:
##         sar1    sar2     sma1
##       0.0215  0.0154  -0.8805
## s.e.  0.0663  0.0628   0.0416
## 
## sigma^2 estimated as 170.3:  log likelihood=-1427.7
## AIC=2863.41   AICc=2863.52   BIC=2878.93
gg_tsresiduals(fit_arima4_refined)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_rug()`).

augment(fit_arima4_refined) %>%
  features(.innov, ljung_box, lag = 14, dof = 2)
## # A tibble: 1 × 4
##   ATM   .model                                          lb_stat lb_pvalue
##   <chr> <chr>                                             <dbl>     <dbl>
## 1 ATM4  ARIMA(Cash_bc ~ pdq(0, 0, 0) + PDQ(2, 1, 1, 7))    19.8    0.0719

After adding one seasonal differencing and a seasonal MA term, the ARIMA(0,0,0)(2,1,1)[7] model displayed a better fit. The model’s performance improved substantially, with AIC decreasing from 2930 to 2863 and BIC from 2930 to 2878. Residual diagnostics show residuals fluctuating randomly around zero with an approximately normal distribution, and the Ljung-Box test (p = 0.0719) suggests that residual autocorrelation has been largely resolved. This refined model provides the best overall fit and will be used to forecast ATM4 withdrawals for May 2010.

# Forecast May 2010
fit4 <- atm4 %>%
  model(arima = ARIMA(box_cox(Cash, lambda4) ~ pdq(0,0,0) + PDQ(2,1,0,7)))

# Forecast
fc4 <- forecast(fit4, h = 31)

# Plot on original scale (fable back-transforms)
autoplot(fc4, atm4) +
  labs(title = "ATM4 ARIMA Forecast for May 2010", x = "Date", y = "Cash")

#writexl::#write_xlsx(fc4_bt, "ATM4_Forecast_May2010.xlsx")

The ARIMA(0,0,0)(2,1,1)[7] model with a Box-Cox transformation forecasts relatively stable ATM4 withdrawals for May 2010. The predicted values fluctuate around 450-500, reflecting the weekly seasonality seen in the historical data. Compared to the large historical volatility, the forecast is smoother and has narrower variations, indicating the model is predicting consistent withdrawal activity.

# Exporting to excel
fc4_out <- fc4 %>%
  as_tibble() %>%
  rename(Date = DATE) %>%       
  select(Date, ATM, Cash = .mean)

#write_xlsx(fc4_out, "ATM4_Forecast_May2010.xlsx")

Part B - Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

A glance at the dataset shows 192 observations spanning from Jan 1998 to Dec 2013.

raw <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(raw)
## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147
summary(raw$KWH)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1
# Converting YYYY-MM column to datetime
data <- raw %>%
  mutate(Date = yearmonth(`YYYY-MMM`)) %>% 
  select(Date, KWH) %>%
  arrange(Date)

# Converting to tsibble
ts <- data %>%
  as_tsibble(index = Date)

head(ts)
## # A tsibble: 6 x 2 [1M]
##       Date     KWH
##      <mth>   <dbl>
## 1 1998 Jan 6862583
## 2 1998 Feb 5838198
## 3 1998 Mar 5420658
## 4 1998 Apr 5010364
## 5 1998 May 4665377
## 6 1998 Jun 6467147
ts %>% filter(is.na(KWH))
## # A tsibble: 1 x 2 [1M]
##       Date   KWH
##      <mth> <dbl>
## 1 2008 Sep    NA

There is one NA value but to avoid any issues later on, let’s impute this missing value with the average KWH of the month before and after.

ts <-ts %>%
  mutate(KWH = if_else(Date == yearmonth("2008-09"),
                       (lag(KWH) + lead(KWH)) / 2,
                       KWH))
gg_tsdisplay(ts, KWH, plot_type = "partial") +
  labs(title = "Monthly KWH usage")

gg_season(ts) +
  labs(title = "Seasonal Plot of Power Usage")
## Plot variable not specified, automatically selected `y = KWH`

The time series and seasonal plots show that residential power usage fluctuates cyclically each year. Power consumption is higher during the winter and summer months and lower in the spring and fall. Aside from the sharp drop in the July 2010, the series exhibits a slight upward trend in beginning in 2010. The significant drop in July 2010 is an outlier and should be handled before ARIMA and ETS modeling. The outlier will be replaced by the average value of the month before (Jun 2010) and the month after (Aug 2010).

The ACF plot shows significant spikes at lags 6, 12, and 18, confirming strong seasonality with a 12-month period.

# Handling the significant drop in July 2010
ts <-ts %>%
  mutate(KWH = if_else(Date == yearmonth("2010-07"),
                       (lag(KWH) + lead(KWH)) / 2,
                       KWH))

autoplot(ts, KWH)

Now that the significant drop in July 2010 has been adjusted, we can estimate the lambda to see if a Box-Cox transformation is needed.

lambda <- ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

lambda
## [1] -0.2066353

Lambda is close to zero so a log Box-Cox transformation is needed.

power_bc <- ts %>%
  mutate(KWH_bc = box_cox(KWH, lambda = lambda))

autoplot(power_bc)
## Plot variable not specified, automatically selected `.vars = KWH`

Now that the variance is stabilized, we can proceed with modeling. Both ARIMA and ETS will be used to forecast the power usage for 2014.

fit_models <- power_bc %>%
  model(
    arima = ARIMA(KWH_bc),
    ets = ETS(KWH_bc)
  )

# ARIMA summary
report(fit_models %>% select(arima))
## Series: KWH_bc 
## Model: ARIMA(1,0,0)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ar1     sar1     sar2  constant
##       0.2545  -0.7271  -0.3680     1e-03
## s.e.  0.0754   0.0743   0.0765     3e-04
## 
## sigma^2 estimated as 1.449e-05:  log likelihood=755.3
## AIC=-1500.61   AICc=-1500.26   BIC=-1484.64
# ETS summary
report(fit_models %>% select(ets))
## Series: KWH_bc 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1114426 
##     gamma = 0.2041424 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]     s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  4.646672 0.9990819 0.997859 0.9995432 1.001871 1.002372 1.001925 1.000118
##      s[-7]     s[-8]     s[-9]   s[-10]   s[-11]
##  0.9978945 0.9984037 0.9991044 1.000203 1.001623
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -1125.552 -1122.825 -1076.690

Based on the AIC, the ARIMA(1,0,0)(2,1,0)[12] w/ drift model performed better than the ETS model as it had a lower AIC. The selected ARIMA model incorporated one non-seasonal autoregressive (AR) term and two seasonal AR terms with differencing. This captured both short term dependencies as well as the repeating yearly pattern in power usage. In contrast, the ETS model might not have captured the strong autocorrelation and moderate trend changes. Since ARIMA fit the data better, it will be used to forecast the power usage in May 2014.

# Extract only the ARIMA model
fit_arima <- fit_models %>% select(arima)

# Plot residual diagnostics
gg_tsresiduals(fit_arima)

# Ljung-Box test for autocorrelation
augment(fit_arima) %>%
  features(.innov, ljung_box, lag = 24, dof = 4)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 arima     32.2    0.0414

The residual plot shows that the residuals are centered around zero with no obvious trend or seasonal pattern. The histogram has a somewhat bell shape, suggesting the residuals are normally distributed. Most autocorrelations fall within the significance bounds. However, there are small spikes at lags 3, 6, and 9, indicating minor remaining autocorrelation that the model did not fully capture. Overall, the residuals suggest the model provides a good fit to the data, so it can be used to generate forecasts for 2014.

fit_tf <- ts %>%
  model(arima = ARIMA(box_cox(KWH, lambda)))

fc <- fit_tf %>% forecast(h = 12)

autoplot(fc, ts) +
  labs(title = "Residential Power Forecast for 2014",
       x = "Year", y = "KWH")

The ARIMA(1,0,0)(2,1,0)[12] with drift model was used to forecast residential power usage for 2014. The forecast maintains the strong seasonal pattern observed in previous years where there is higher consumption in the winter and summer months and lower consumption the rest of the year. The forecasted model also retains the slight upward trend from the data as well. Overall, the ARIMA(1,0,0)(2,1,0)[12] with drift model provides a projection that aligns with the historical seasonal behvaior of residential power usage.

# Extracting forecasted values
fc_tbl <- fc %>%
  as_tibble() %>%
  transmute(
    Date,
    Forecast_KWH = .mean
  )

# Export to Excel
#write_xlsx(fc_tbl, "PowerUsage_Forecast_2014.xlsx")