library(tidyverse)
library(fpp3)
library(tseries)
library(forecast)
library(kableExtra)
library(reactable)
library(seasonal)
library(tsibble)
library(openxlsx)
library(readxl)
library(mice)
library(caret)
library(zoo)
library(lubridate)
library(imputeTS)
library(naniar)
library(timeplyr)
library(rstatix)
library(timetk)
set.seed(1234)

Project 1

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

atm_df <- read_excel("/Users/mohamedhassan/Downloads/ATM624Data (1).xlsx", col_types = c('date', 'text', 'numeric'))

After uploading the data, I mutated the DATE column into a date format, since the values were not formatted correctly. I rounded the Cash column to the nearest dollar, since ATM machines are known to dispense cash, not change, and transformed the values to hundreds of dollars, as stated in the question. Then I assigned the date as the index and ATM as the key to match the values. Finally, I filtered to not include data values beyond May 2010, since we are attempting to predict cash withdrawals for May 2010:

atm_df2 <- atm_df |>
  mutate(Date = as_date(DATE)) |>
  #mutate(Month = floor_date(Date, "month")) |>
  mutate(Cash = round(Cash)) |>
  mutate_if(is.numeric, ~ . * 100) |>
  as_tsibble(index = Date, key = ATM) |>
  filter(Date < "2010-05-01") |>
  select(-DATE)
reactable(atm_df2)
atm_df2 |>
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawals: May 2009 - April 2010") +
  theme(plot.title = element_text(hjust = 0.5))

We can see in the initial autoplot that ATM1 and ATM2 has no trend, seasonality or cyclic behavior. There are random fluctuations with no distinct patterns. ATM3 did not have any cash withdrawals until the last 3 days in April 2010, which may indicate that ATM3 is a new machine being used. ATM4 also has no trend, seasonality, or cyclic behavior, but has an outlier of more than $90,000 around February 2010, which will be addressed later. Let’s look at the distribution of each ATM’s cash withdrawal using a histogram:

atm_df2 |> 
  ggplot(aes(x = Cash)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ATM, ncol = 2, scales = "free") +
  labs(title = "ATM Withdrawals:  May 2009 - April 2010") +
  scale_y_continuous("Withdrawals (hundreds)") +
  theme(plot.title = element_text(hjust = 0.5))

ATM1 and ATM2 does not show a normal distribution of data. However, they show more of a wider spread than ATM3 and ATM4. ATM1 appears to have an outlier, with a cash value greater than $15,000. ATM3 and ATM4 show a skewness to the right, with outliers in each. I’ll explore any missing values and outliers in more detail next.

Missing Values

reactable(round(miss_case_table(atm_df2), 2))
atm_df2[!complete.cases(atm_df2),]
## # A tsibble: 5 x 3 [1D]
## # Key:       ATM [2]
##   ATM    Cash Date      
##   <chr> <dbl> <date>    
## 1 ATM1     NA 2009-06-13
## 2 ATM1     NA 2009-06-16
## 3 ATM1     NA 2009-06-22
## 4 ATM2     NA 2009-06-18
## 5 ATM2     NA 2009-06-24

There are only 5 NA values in the whole dataset, constituting 0.34% of the data. While this is a small amount, all 5 of the NA values occur in the same month and year, June 2009, with 3 values missing from ATM 1 and 2 values missing from ATM 2. Because each of the missing values does not look to be at random, I will impute the missing values. Initially I thought to either ignore the missing values or use the 7-day rolling average of the week in which the days of the missing values were in. Instead, I used the imputeTS package to fill in the missing values.

Here are some sources that discuss how to handle Missing Not at Random (MNAR) values: How to Deal with Missing DataHandling Missing Data

colSums(is.na(atm_df2))
##  ATM Cash Date 
##    0    5    0
atm_df2 <- imputeTS::na_interpolation(atm_df2, option = "linear")
colSums(is.na(atm_df2))
##  ATM Cash Date 
##    0    0    0

Outliers

atm_outliers <- atm_df2 |>
  identify_outliers(Cash)
reactable(atm_outliers)

Using the identify_outliers() function from the rstatix package, there were a lot of outliers identified, with ATM4 being the only ATM with outliers. To get a better idea of the outliers in ATM4, I used Hyndman’s method for finding outliers outlined in section 13.9 of his textbook:

outliers <- atm_df2 |>
  filter(ATM == 'ATM4') |>
  model(
    stl = STL(Cash ~ season(window = "periodic"), robust = TRUE)
  ) |>
  components()
outliers <- outliers |>
  filter(
    remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
    remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
  )

outliers
## # A dable: 2 x 8 [1D]
## # Key:     ATM, .model [1]
## # :        Cash = trend + season_week + remainder
##   ATM   .model Date          Cash  trend season_week remainder season_adjust
##   <chr> <chr>  <date>       <dbl>  <dbl>       <dbl>     <dbl>         <dbl>
## 1 ATM4  stl    2009-09-22  171200 37267.      -7162.   141095.       178362.
## 2 ATM4  stl    2010-02-09 1092000 27902.      -7162.  1071260.      1099162.

I replaced the 2 outliers values with NA values, then used the ARIMA and interpolate functions to model and insert the replacement values for each:

atm_df2 <- atm_df2 |>
  mutate(Cash = replace(Cash, ATM == 'ATM4' & Date == '2009-09-22', NA),
         Cash = replace(Cash, ATM == 'ATM4' & Date == '2010-02-09', NA))

atm_df2 <- atm_df2 |>
  model(ARIMA(Cash)) |>
  interpolate(atm_df2)
atm_df2 |> 
  filter(ATM == "ATM4") |>
  filter(Date == '2009-09-22' | Date == '2010-02-09')
## # A tsibble: 2 x 3 [1D]
## # Key:       ATM [1]
##   ATM   Date         Cash
##   <chr> <date>      <dbl>
## 1 ATM4  2009-09-22 23175.
## 2 ATM4  2010-02-09 20764.

I rounded the new values to whole numbers and re-created the Month column:

atm_df2 <- atm_df2 |>
  mutate(Cash = round(atm_df2$Cash)) 
  #mutate(Month = floor_date(Date, "month"))
atm_df2 |> 
  filter(ATM == "ATM4") |>
  filter(Date == '2009-09-22' | Date == '2010-02-09')
## # A tsibble: 2 x 3 [1D]
## # Key:       ATM [1]
##   ATM   Date        Cash
##   <chr> <date>     <dbl>
## 1 ATM4  2009-09-22 23175
## 2 ATM4  2010-02-09 20764

The autoplot below shows that the outlier in ATM4 is no longer distinct:

atm_df2 |>
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawals: May 2009 - April 2010 (with Replacement Outliers)") +
  theme(plot.title = element_text(hjust = 0.5))

I will begin the process of separating each of the 4 ATM machines. Since we’re attempting to predict the May 2010 totals for each ATM Machine, I wanted to get the Monthly totals for each ATM machine. This changes the composition of the data from seasonal to non-seasonal:

# atm_df3 <- atm_df2 |>
#   group_by_key(ATM) |> 
#   index_by(Date) |>
#   summarize(Cash_Total = sum(Cash)) |>
#   ungroup() |>
#   mutate(Month = yearmonth(as.character(Month))) |>
#   as_tsibble(index = Month)
atm1 <- atm_df2 |> filter(ATM == "ATM1")
atm2 <- atm_df2 |> filter(ATM == "ATM2") 
atm3 <- atm_df2 |> filter(ATM == "ATM3") 
atm4 <- atm_df2 |> filter(ATM == "ATM4") 

When looking at the monthly cash withdrawals, ATM 1 and 4 show a precipitous drop in the amount of cash withdrawals in October 2009. With the exception of ATM3, which doesn’t have any data, the other 3 ATM machines show a decline in withdrawals sometime during the summer months of 2009. ATM1 and ATM2 show declines almost around the same time, while the decline in ATM4 begins about a month or two later.

ATM

ATM1

atm1 |>
  autoplot(Cash)

atm1 |>
  gg_tsdisplay(Cash, plot_type = 'partial') + 
  labs(title = 'ATM1 Withdrawals')

There are distinct points beyond the confidence intervals in the ACF and PACF plots

lambda <- atm1 |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

atm1_fit <- atm1 |>
  model(
    "Additive ETS" = ETS(Cash ~ error("A") + trend("N") + season("A")),
    "Multiplicative ETS" = ETS(Cash ~ error("M") + trend("N") + season("M")),
    "SNAIVE" = SNAIVE(Cash),
    "Transformed Additive" = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
    "Transformed Multiplicative" = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
    "Transformed SNAIVE" = SNAIVE(box_cox(Cash,lambda)),
    "ARIMA" = ARIMA(box_cox(Cash,lambda))
  )


left_join(glance(atm1_fit) |>
            select(.model:BIC), 
          accuracy(atm1_fit) |> 
            select(.model, RMSE, MAE)) |>
  arrange(RMSE)
## # A tibble: 7 × 8
##   .model                           sigma2 log_lik   AIC  AICc   BIC  RMSE   MAE
##   <chr>                             <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Additive ETS               5812790.      -3915. 7849. 7850. 7888. 2381. 1512.
## 2 Transformed Multiplicative       0.0182  -1636. 3291. 3292. 3330. 2484. 1573.
## 3 ARIMA                           19.6     -1041. 2090. 2091. 2106. 2493. 1596.
## 4 Transformed Additive            20.2     -1621. 3261. 3262. 3300. 2508. 1604.
## 5 Multiplicative ETS               0.134   -3954. 7928. 7929. 7967. 2625. 1616.
## 6 SNAIVE                     7719917.         NA    NA    NA    NA  2775. 1770.
## 7 Transformed SNAIVE              27.6        NA    NA    NA    NA  2775. 1770.

The RMSE of each of the models were not widely different. The Additive ETS model had the lowest RMSE with 2381.063, while the Seasonal Naive and Transformed Seasonal Naive had the highest. However, when comparing the AIC , BIC, and AICc, the ARIMA model performed the best, having the lowest values among the tested models. Therefore, I decided to use the ARIMA model as the best fit:

atm1_fit |>
  select(.model = "ARIMA") |>
  report()
## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1126  -0.1094  -0.6418
## s.e.  0.0524   0.0520   0.0432
## 
## sigma^2 estimated as 19.63:  log likelihood=-1041.23
## AIC=2090.46   AICc=2090.57   BIC=2105.98
fc_atm1 <- atm1_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')
atm1_fit |>
  select(ARIMA) |>
  gg_tsresiduals()

The autoplot of the residuals are mostly stationary around the zero threshold, with a few outliers. The ACF plot is within the confidence intervals, and appear show white noise with no distinct pattern. The center of the histogram appears to be at zero and not spread widely, which is a good indication that the ARIMA model is a good fit.

fc_atm1 |>
  autoplot(atm1) 

atm1_fit |> 
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 10, dof = 0)
## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 .model    6.37     0.784
atm1_fit |> 
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 .model    6.50     0.772

The large p-values confirm that the residuals are not distinguishable from a white noise series.

ATM2

atm2 |>
  autoplot(Cash)

atm2 |>
  gg_tsdisplay(Cash, plot_type = 'partial') + 
  labs(title = 'ATM2 Withdrawals')

lambda2 <- atm2 |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

atm2_fit <- atm2 |>
  model(
    "Additive ETS" = ETS(Cash ~ error("A") + trend("N") + season("A")),
    "Multiplicative ETS" = ETS(Cash ~ error("M") + trend("N") + season("M")),
    "SNAIVE" = SNAIVE(Cash),
    "Transformed Additive" = ETS(box_cox(Cash,lambda2) ~ error("A") + trend("N") + season("A")),
    "Transformed Multiplicative" = ETS(box_cox(Cash,lambda2) ~ error("M") + trend("N") + season("M")),
    "Transformed SNAIVE" = SNAIVE(box_cox(Cash,lambda2)),
    "ARIMA" = ARIMA(box_cox(Cash,lambda2))
  )


left_join(glance(atm2_fit) |>
            select(.model:BIC), 
          accuracy(atm2_fit) |> 
            select(.model, RMSE, MAE)) |>
  arrange(RMSE)
## # A tibble: 7 × 8
##   .model                          sigma2 log_lik   AIC  AICc   BIC  RMSE   MAE
##   <chr>                            <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA                        53782.     -2457. 4927. 4927. 4950. 2446. 1713.
## 2 Additive ETS               6489600.     -3935. 7890. 7890. 7929. 2516. 1790.
## 3 Transformed Additive         57689.     -3073. 6166. 6166. 6205. 2545. 1790.
## 4 SNAIVE                     9144253.        NA    NA    NA    NA  3020. 2079.
## 5 Transformed SNAIVE           79405.        NA    NA    NA    NA  3020. 2079.
## 6 Multiplicative ETS               0.332  -4065. 8151. 8151. 8190. 3385. 2804.
## 7 Transformed Multiplicative       0.338  -3237. 6494. 6494. 6533. 5487. 2777.

ARIMA performs the best across the board, having the lowest values in all categories. For this reason, I chose this model as the best fit:

atm2_fit |>
  select(.model = "ARIMA") |> 
  report()
## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda2) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4256  -0.9051  0.4719  0.7967  -0.7271
## s.e.   0.0588   0.0444  0.0888  0.0590   0.0406
## 
## sigma^2 estimated as 53782:  log likelihood=-2457.41
## AIC=4926.82   AICc=4927.06   BIC=4950.1
fc_atm2 <- atm2_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')
atm2_fit |>
  select(ARIMA) |>
  gg_tsresiduals()

The residuals shows stationarity, with the ACF plot appearing to be randomized with no distinct patterns and histogram showing a normal distribution with its center at zero.

fc_atm2 |>
  autoplot(atm2) 

atm2_fit |> 
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 10, dof = 0)
## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 .model    8.88     0.544
atm2_fit |> 
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 .model    9.06     0.527

The large p-values confirm that the residuals are not distinguishable from a white noise series.

ATM3

atm3 |>
  autoplot(Cash)

atm3 |>
  gg_tsdisplay(Cash, plot_type = 'partial') + 
  labs(title = 'ATM3 Withdrawals')

Because we only have 3 days of data to work with, I thought the best way to forecast the upcoming month was to take the average of the 3 days of data and use a MEAN model:

atm3_fit <- atm3 |>
  filter(Cash != 0) |>
  model(MEAN(Cash))
fc_atm3 <- atm3_fit |>
  forecast(h = 31) 
atm3_fit |>
  select(`MEAN(Cash)`) |>
  gg_tsresiduals()

I wanted to see what the residuals of the model looked like. Predictably, each plot does not look normalized due to the lack of data available.

fc_atm3 |>
  autoplot(atm3) +
  labs(title = "ATM3 Forecasted with the MEAN() Model") +
  theme(plot.title = element_text(hjust = 0.5))

Based on the plot above, the forecast for May 2010 appears to show cash withdrawals from around $7,400 to around $11,000, which is not a bad start for the new ATM machine.

ATM4

atm4 |>
  autoplot(Cash)

atm4 |>
  gg_tsdisplay(Cash, plot_type = 'partial') + 
  labs(title = 'ATM4 Withdrawals')

lambda4 <- atm4 |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

atm4_fit <- atm4 |>
  model(
    "Additive ETS" = ETS(Cash ~ error("A") + trend("N") + season("A")),
    "Multiplicative ETS" = ETS(Cash ~ error("M") + trend("N") + season("M")),
    "SNAIVE" = SNAIVE(Cash),
    "Additive_Box_Cox" = ETS(box_cox(Cash,lambda4) ~ error("A") + trend("N") + season("A")),
    "Multiplicative_Box_Cox" = ETS(box_cox(Cash,lambda4) ~ error("M") + trend("N") + season("M")),
    "SNAIVE_Box_Cox" = SNAIVE(box_cox(Cash,lambda4)),
    "ARIMA" = ARIMA(box_cox(Cash,lambda4)),
    "Additive_BC_Non_Seasonal" = ETS(box_cox(Cash,lambda4) ~ error("A") + trend("N") + season("N")),
    "Multiplicative_BC_Non_Seasonal" = ETS(box_cox(Cash,lambda4) ~ error("M") + trend("N") + season("N"))
  )

# stats for the models
left_join(glance(atm4_fit) |> 
            select(.model:BIC), 
          accuracy(atm4_fit) |> 
            select(.model, RMSE, MAE)) |>
  arrange(RMSE)
## # A tibble: 9 × 8
##   .model                          sigma2 log_lik   AIC  AICc   BIC   RMSE    MAE
##   <chr>                            <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1 Additive ETS                   1.07e+9  -4866. 9752. 9752. 9791. 32249. 26320.
## 2 Additive_Box_Cox               1.85e+4  -2865. 5750. 5751. 5789. 33236. 25773.
## 3 Multiplicative_Box_Cox         2.04e-1  -2859. 5738. 5738. 5777. 33642. 25860.
## 4 ARIMA                          1.95e+4  -2319. 4649. 4649. 4668. 34464. 27004.
## 5 Multiplicative ETS             7.24e-1  -4869. 9759. 9759. 9798. 34898. 27443.
## 6 Multiplicative_BC_Non_Seasonal 2.26e-1  -2898. 5802. 5802. 5813. 35562. 28867.
## 7 Additive_BC_Non_Seasonal       2.17e+4  -2898. 5802. 5802. 5813. 35567. 28869.
## 8 SNAIVE                         2.04e+9     NA    NA    NA    NA  45078. 34495.
## 9 SNAIVE_Box_Cox                 3.27e+4     NA    NA    NA    NA  45078. 34495.

The Additive ETS had the lowest RMSE, but had the second-highest AIC value. Depsite having the 4th-lowest RMSE and MAE, I ultimately decided to choose the ARIMA model, since it had the lowest AIC, AICc, and BIC values.

atm4_fit |>
  select(.model = "ARIMA") |> 
  report()
## Series: Cash 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda4) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0773  0.1894  0.2197  182.4995
## s.e.  0.0526  0.0514  0.0524    7.6781
## 
## sigma^2 estimated as 19510:  log likelihood=-2319.32
## AIC=4648.64   AICc=4648.81   BIC=4668.14
fc_atm4 <- atm4_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')
atm4_fit |>
  select(ARIMA) |>
  gg_tsresiduals()

The autoplot of the model shows stationarity around the zero threshold with no distinct patterns with a few prominent outlier points. The lines in the ACF plot are mostly within the confidence intervals, with no distinct patterns. The histogram shows close to a normal distribution, despite the apex of the residuals not appearing to be at zero.

fc_atm4 |>
  autoplot(atm4) 

atm4_fit |> 
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 10, dof = 0)
## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 .model    18.5    0.0469
atm4_fit |> 
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 .model    19.0    0.0401

The p-values for each test is less than 0.05, which means that the residuals are distinguishable from a white noise series.

Putting it Altogether

atm_may_fc <- bind_rows(fc_atm1, fc_atm2, fc_atm3, fc_atm4)
write.xlsx(atm_may_fc, file = "/Users/mohamedhassan/Downloads/ATM624Data.xlsx",
      sheetName = "ATM May Forecast", append = FALSE)

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

forecast_powerdf <- readxl::read_excel("/Users/mohamedhassan/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
reactable(forecast_powerdf)
forecast_powerdf <- forecast_powerdf |>
  rename(Month = `YYYY-MMM`) |>
  mutate(Month = yearmonth(Month)) |>
  as_tsibble(index = Month) 
reactable(forecast_powerdf)
forecast_powerdf |>
  autoplot(KWH) 

forecast_powerdf |> 
  ggplot(aes(x = KWH)) +
  geom_histogram(bins = 30) +
  labs(title = "Residential Power Usage: January 1998 - December 2013") +
  theme(plot.title = element_text(hjust = 0.5))

The amount of KWH usage does not follow a normal distribution. There appears to be outliers, with a couple of months have over one million KWH, as well as one month with very low KWH usage. We can see that there is one missing value in the KWH column, as well as a distinct outlier after January 2010.

Missing Values

forecast_powerdf[!complete.cases(forecast_powerdf),]
## # A tsibble: 1 x 3 [1M]
##   CaseSequence    Month   KWH
##          <dbl>    <mth> <dbl>
## 1          861 2008 Sep    NA
colSums(is.na(forecast_powerdf))
## CaseSequence        Month          KWH 
##            0            0            1
forecast_powerdf <- forecast_powerdf %>%
  model(ARIMA(KWH)) %>%
  interpolate(forecast_powerdf)

Outliers

forecast_powerdf |>
  identify_outliers(KWH)
## # A tsibble: 1 x 4 [1M]
##      Month    KWH is.outlier is.extreme
##      <mth>  <dbl> <lgl>      <lgl>     
## 1 2010 Jul 770523 TRUE       FALSE
outliers <- forecast_powerdf %>%
  model(
    stl = STL(KWH ~ season(window = "periodic"), robust = TRUE)
  ) |>
  components()
outliers <- outliers |>
  filter(
    remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
    remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
  )

outliers
## # A dable: 2 x 7 [1M]
## # Key:     .model [1]
## # :        KWH = trend + season_year + remainder
##   .model    Month     KWH    trend season_year remainder season_adjust
##   <chr>     <mth>   <dbl>    <dbl>       <dbl>     <dbl>         <dbl>
## 1 stl    2010 Jul  770523 6779462.    1266280. -7275219.      -495757.
## 2 stl    2013 Dec 9606304 7123483.    -402370.  2885192.     10008674.
forecast_powerdf$KWH[tsoutliers(ts(forecast_powerdf$KWH, frequency=12))$index] = tsoutliers(ts(forecast_powerdf$KWH, frequency=12))$replacements
forecast_powerdf |>
  autoplot(KWH) 

The outlier is no longer distinguishable, and the time series plot shows a regular up-and-down pattern.

forecast_powerdf |>
  gg_season(KWH, labels = "both") +
  labs(title = "Monthtly Residential Power Usage: January 1998 - December 2013") +
  theme(plot.title = element_text(hjust = 0.5))

forecast_powerdf |>
  gg_subseries(KWH) +
  labs(title = "Monthtly Residential Power Usage: January 1998 - December 2013") +
  theme(plot.title = element_text(hjust = 0.5))

Not surprisingly, residential power usage increases during the summer months due to the warmer, hotter weather, as well as during the colder, winter weather months. In recent years, however, the amount of residential power usage has increased.

Modeling

lambda_kwh <- forecast_powerdf |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)

kwh_fit <- forecast_powerdf |>
  model(
    "Additive_ETS" = ETS(KWH ~ error("A") + trend("A") + season("A")),
    "Multiplicative_ETS" = ETS(KWH ~ error("M") + trend("A") + season("M")),
    "Additive_Damped" = ETS(KWH ~ error("A") + trend("Ad") + season("A")),
    "SNAIVE" = SNAIVE(KWH),
    "Additive_BC_ETS" = ETS(box_cox(KWH,lambda_kwh) ~ error("A") + trend("A") + season("A")),
    "Multiplicative_BC_ETS" = ETS(box_cox(KWH,lambda_kwh) ~ error("M") + trend("A") + season("M")),
    "BC Transformed Additive Damped" = ETS(box_cox(KWH,lambda_kwh) ~ error("A") + trend("Ad") + season("A")),
    "BC_SNAIVE" = SNAIVE(box_cox(KWH,lambda_kwh)),
    "ARIMA" = ARIMA(KWH),
    "ARIMA_BC" = ARIMA(box_cox(KWH,lambda_kwh))
  )


left_join(glance(kwh_fit) |>
            select(.model:BIC), 
          accuracy(kwh_fit) |> 
            select(.model, RMSE, MAE)) |>
  arrange(AICc)
## # A tibble: 10 × 8
##    .model                      sigma2 log_lik    AIC   AICc    BIC   RMSE    MAE
##    <chr>                        <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 ARIMA_BC                  8.33e- 5    592. -1170. -1169. -1147. 5.67e5 4.32e5
##  2 BC Transformed Additive … 8.64e- 5    402.  -769.  -765.  -710. 5.83e5 4.38e5
##  3 Multiplicative_BC_ETS     2.27e- 6    401.  -768.  -765.  -713. 5.78e5 4.45e5
##  4 Additive_BC_ETS           8.78e- 5    400.  -767.  -763.  -711. 5.81e5 4.53e5
##  5 ARIMA                     3.41e+11  -2646.  5309.  5310.  5334. 5.54e5 4.07e5
##  6 Multiplicative_ETS        8.17e- 3  -3044.  6123.  6126.  6178. 5.84e5 4.42e5
##  7 Additive_ETS              3.79e+11  -3056.  6146.  6149.  6201. 5.90e5 4.45e5
##  8 Additive_Damped           3.84e+11  -3056.  6149.  6153.  6207. 5.91e5 4.48e5
##  9 SNAIVE                    6.05e+11     NA     NA     NA     NA  7.80e5 5.94e5
## 10 BC_SNAIVE                 1.42e- 4     NA     NA     NA     NA  7.80e5 5.94e5

When comparing the results of each model, ARIMA with the Box-Cox Transformation performed the best, having the lowest values in AIC, AICc, BIC, RMSE, and MAE measurements. For this reason, I decided to choose the ARIMA_BC model as the best fit model.

Best-Fit Model Exploration

kwh_fit %>% 
  select(.model = "ARIMA_BC") %>% 
  report()
## Series: KWH 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## Transformation: box_cox(KWH, lambda_kwh) 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  constant
##       0.2675  0.0586  0.2237  -0.7101  -0.4124    0.0032
## s.e.  0.0736  0.0807  0.0669   0.0725   0.0743    0.0011
## 
## sigma^2 estimated as 8.327e-05:  log likelihood=591.85
## AIC=-1169.69   AICc=-1169.04   BIC=-1147.34
kwh_fc <- kwh_fit |>
  forecast(h = 12) |>
  filter(.model=='ARIMA_BC')
kwh_fit |>
  select(ARIMA_BC) |>
  gg_tsresiduals(lag=24)

The autoplot of the residuals appear stationary, with no distinguishable patterns. The ACF plot appears to show white noise, with no distinct patterns of oscillation and all values within the confidence intervals. The histogram shows a normal distribution of the residuals, with its center around zero a spread that’s not very wide.

kwh_fc |>
  autoplot(forecast_powerdf) 

kwh_fit |> 
  select(.model = "ARIMA_BC") |>
  augment() |>
  features(.innov, box_pierce, lag = 24, dof = 0)
## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 .model    10.8     0.991
kwh_fit |> 
  select(.model = "ARIMA_BC") |>
  augment() |>
  features(.innov, ljung_box, lag = 24, dof = 0)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 .model    11.7     0.983

The large p-values confirms that the residuals are not distinguishable from white noise.

kbl(kwh_fc) |>
kable_styling(latex_options="scale_down", c("striped", "hover", "condensed", full_width=F))
.model Month KWH .mean
ARIMA_BC 2014 Jan t(N(6.2, 8.3e-05)) 9516620
ARIMA_BC 2014 Feb t(N(6.2, 8.9e-05)) 8623095
ARIMA_BC 2014 Mar t(N(6.2, 9e-05)) 6612499
ARIMA_BC 2014 Apr t(N(6.2, 9.4e-05)) 5974879
ARIMA_BC 2014 May t(N(6.2, 9.4e-05)) 5915752
ARIMA_BC 2014 Jun t(N(6.2, 9.4e-05)) 8272424
ARIMA_BC 2014 Jul t(N(6.2, 9.4e-05)) 9572529
ARIMA_BC 2014 Aug t(N(6.2, 9.4e-05)) 10110546
ARIMA_BC 2014 Sep t(N(6.2, 9.4e-05)) 8536696
ARIMA_BC 2014 Oct t(N(6.2, 9.4e-05)) 5841829
ARIMA_BC 2014 Nov t(N(6.2, 9.4e-05)) 6125923
ARIMA_BC 2014 Dec t(N(6.2, 9.4e-05)) 7638758

Adding to File

write.xlsx(kwh_fc, file = "/Users/mohamedhassan/Downloads/ResidentialCustomerForecastLoad-624 (1).xlsx",
      sheetName = "2014 Power Usage Forecast", append = FALSE)