DATA 624 Project 1

Author

Henock Montcho

Published

May 20, 2025

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

Loading required packages:

library(tsibble)
library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forecast)
library(lubridate)
library(openxlsx)

Step 1: Loading and converting the data into time series format

setwd("C:/Users/month/Downloads")
atm <- read_excel("ATM624Data.xlsx", 
                  col_names = TRUE,
                  col_types = c('date', 'text', 'numeric'))  |>
  mutate(DATE = as_date(DATE))  |>
  as_tsibble(index = DATE, key = ATM)  |>
  filter(DATE < "2010-05-01")
str(atm)

tbl_ts [1,460 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
 $ DATE: Date[1:1460], format: "2009-05-01" "2009-05-02" ...
 $ ATM : chr [1:1460] "ATM1" "ATM1" "ATM1" "ATM1" ...
 $ Cash: num [1:1460] 96 82 85 90 99 88 8 104 87 93 ...
 - attr(*, "key")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ ATM  : chr [1:4] "ATM1" "ATM2" "ATM3" "ATM4"
  ..$ .rows: list<int> [1:4] 
  .. ..$ : int [1:365] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ : int [1:365] 366 367 368 369 370 371 372 373 374 375 ...
  .. ..$ : int [1:365] 731 732 733 734 735 736 737 738 739 740 ...
  .. ..$ : int [1:365] 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 ...
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE
 - attr(*, "index")= chr "DATE"
  ..- attr(*, "ordered")= logi TRUE
 - attr(*, "index2")= chr "DATE"
 - attr(*, "interval")= interval [1:1] 1D
  ..@ .regular: logi TRUE

summary(atm)

      DATE                ATM                 Cash        
 Min.   :2009-05-01   Length:1460        Min.   :    0.0  
 1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
 Median :2009-10-30   Mode  :character   Median :   73.0  
 Mean   :2009-10-30                      Mean   :  155.6  
 3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
 Max.   :2010-04-30                      Max.   :10919.8  
                                         NA's   :5

Comment: The ATM624Data.xlsx data was loaded, DATE was formatted into [yyyy-mm-dd]. The dates were filtered to only select dates from May 2009 to April 2010 because we will be forecasting May 2010’s cash withdrawals.

Step 2: Treating missing Cash observations

atm  |> 
  as.data.frame()  |>
  group_by(ATM)  |>
  summarise(`Missing Values` = sum(is.na(Cash)))

# A tibble: 4 × 2
  ATM   `Missing Values`
  <chr>            <int>
1 ATM1                 3
2 ATM2                 2
3 ATM3                 0
4 ATM4                 0

library(fable)
library(imputeTS)

# Estimate Cash for the missing values.
  atm <- na_interpolation(atm, option = "linear")

#double-checking for any missing values remaining
atm  |> 
  as.data.frame()  |>
  group_by(ATM)  |>
  summarise(`Missing Values` = sum(is.na(Cash)))

# A tibble: 4 × 2
  ATM   `Missing Values`
  <chr>            <int>
1 ATM1                 0
2 ATM2                 0
3 ATM3                 0
4 ATM4                 0

Comment: Three missing values were detected in ATM1 and two in ATM2. These five missing entries were resolved using interpolation.

Step 3: Exploratory Data Analysis, modeling and forecasting.

atm  |>
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Before Removing Outliers")

Comment: Initially, it appears that ATM1 and ATM2 exhibit signs of seasonality, whereas ATM3 seems to have only become active toward the end of April 2010. Additionally, a significant outlier is noticeable in ATM4. Removing this outlier would be advisable to allow for more accurate data exploration.

ATM1 Exploration

library(fpp3)
#plot
atm  |>
  filter(ATM == "ATM1")  |>
  autoplot(Cash) +
  ggtitle("Non-tranformed ATM1")

# STL decomposition
atm  |> 
  filter(ATM == "ATM1")  |>
  model(
    STL(Cash ~ season(window = "periodic"), robust = TRUE))  |>
  components()  |>
  autoplot() +
  labs(title = "STL Decomposition for ATM1")

Comment: ATM1 does not exhibit a clear trend, but displays a weekly seasonal pattern characterized by a noticeable drop in cash withdrawals once per week. The remainder component appears random until March 2010, after which a pattern of increased variability becomes evident.

#lambda for box cox transformation
lambda <- atm  |>
  filter(ATM == "ATM1")  |>
  features(Cash, features = guerrero)  |>
  pull(lambda_guerrero)

atm1_fit <- atm  |>
  filter(ATM == "ATM1")  |>
  model(
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    snaive = SNAIVE(Cash),
    additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
    multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
    snaive_ts = SNAIVE(box_cox(Cash,lambda)),
    ARIMA = ARIMA(box_cox(Cash,lambda))
  )

#stats for the models
left_join(glance(atm1_fit)  |>
            select(.model:BIC), 
          accuracy(atm1_fit)  |>
            select(.model, RMSE))  |>
  arrange(RMSE)

# A tibble: 7 × 7
  .model              sigma2 log_lik   AIC  AICc   BIC  RMSE
  <chr>                <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 multiplicative_ts   0.0383  -1219. 2458. 2458. 2497.  23.3
2 additive          580.      -2234. 4487. 4488. 4526.  23.8
3 additive_ts         1.80    -1180. 2380. 2380. 2419.  24.9
4 ARIMA               1.76     -610. 1228. 1228. 1244.  24.9
5 multiplicative      0.134   -2273. 4566. 4567. 4605.  26.3
6 snaive_ts           2.48       NA    NA    NA    NA   27.7
7 snaive            772.         NA    NA    NA    NA   27.7

Comment: To determine the most suitable model, metrics such as RMSE, AIC, AICc, and BIC were considered. While the transformed multiplicative ETS model achieved the lowest RMSE, the ARIMA model outperformed others in terms of AIC, AICc, and BIC. As a result, the ARIMA model was selected for forecasting May 2010.

# report of the arima model
atm1_fit  |>
  select(.model = "ARIMA")  |>
  report()

Series: Cash 
Model: ARIMA(0,0,2)(0,1,1)[7] 
Transformation: box_cox(Cash, lambda) 

Coefficients:
         ma1      ma2     sma1
      0.1126  -0.1094  -0.6418
s.e.  0.0524   0.0520   0.0432

sigma^2 estimated as 1.764:  log likelihood=-609.99
AIC=1227.98   AICc=1228.09   BIC=1243.5

ARIMA(0,0,2)(0,1,1)[7] was modeled on ATM1.

library(latex2exp)
# forecasting
fc_atm1 <- atm1_fit  |>
  forecast(h = 31)  |>
  filter(.model=='ARIMA')

# forcasted plot
fc_atm1  |>
  autoplot(atm) +
  ggtitle(TeX(paste0("ATM 1 Forcasted with $ARIMA(0,0,2)(0,1,1)_7$ and $\\lambda$ = ",
         round(lambda,2))))

# residuals
atm1_fit  |>
  select(ARIMA)  |>
  gg_tsresiduals() +
  ggtitle(TeX(paste0("Residuals for $ARIMA(0,0,2)(0,1,1)_7$ with $\\lambda$ = ",
         round(lambda,2))))

#Box-Pierce test, ℓ=2m for seasonal data, m=7
atm1_fit  |>
  select(.model = "ARIMA")  |>
  augment()  |>
  features(.innov, box_pierce, lag = 14, dof = 0)

# A tibble: 1 × 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 .model    9.60     0.791

Comment: The residuals resemble white noise, though a few data points appear to cluster in the left tail of the distribution.

ATM2 Exploration

#plot
atm  |>
  filter(ATM == "ATM2")  |>
  autoplot(Cash) +
  ggtitle("Non-tranformed ATM2")

# STL decomposition
atm  |>
  filter(ATM == "ATM2")  |>
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE))  |>
  components()  |>
  autoplot() +
  labs(title = "STL Decomposition for ATM2")

Comment: ATM2 exhibits a clear weekly seasonal pattern, marked by a sharp decline followed by a rapid increase in cash withdrawals once per week. The remainder component appears largely random until around February or March 2010, when variability becomes more pronounced. Additionally, the overall trend is slightly downward, with a noticeable spike and subsequent drop occurring during that same February/March period.

# lambda for box cox transformation
lambda <- atm  |>
  filter(ATM == "ATM2")  |> 
  features(Cash, features = guerrero)  |>
  pull(lambda_guerrero)

atm2_fit <- atm  |>
  filter(ATM == "ATM2")  |>
  model(
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    snaive = SNAIVE(Cash),
    additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
    multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
    snaive_ts = SNAIVE(box_cox(Cash,lambda)),
    ARIMA = ARIMA(box_cox(Cash,lambda))
  )

# stats for the models
left_join(glance(atm2_fit)  |>
          select(.model:BIC), 
          accuracy(atm2_fit)  |>
            select(.model, RMSE))  |>
  arrange(RMSE)

# A tibble: 7 × 7
  .model             sigma2 log_lik   AIC  AICc   BIC  RMSE
  <chr>               <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA              68.2    -1263. 2539. 2539. 2562.  24.5
2 additive          648.     -2254. 4527. 4528. 4566.  25.1
3 additive_ts        72.4    -1854. 3728. 3728. 3767.  25.4
4 snaive_ts         101.        NA    NA    NA    NA   30.2
5 snaive            914.        NA    NA    NA    NA   30.2
6 multiplicative_ts   0.308  -2011. 4043. 4043. 4082.  34.1
7 multiplicative      0.424  -2399. 4818. 4819. 4857.  34.4

Comment: The ARIMA model has the lowest RMSE, AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.

# report of the arima model
atm2_fit  |>
  select(.model = "ARIMA")  |>
  report()

Series: Cash 
Model: ARIMA(2,0,2)(0,1,1)[7] 
Transformation: box_cox(Cash, lambda) 

Coefficients:
          ar1      ar2     ma1     ma2     sma1
      -0.4256  -0.9051  0.4719  0.7967  -0.7271
s.e.   0.0588   0.0444  0.0888  0.0590   0.0406

sigma^2 estimated as 68.16:  log likelihood=-1263.33
AIC=2538.66   AICc=2538.9   BIC=2561.94

ARIMA(2,0,2)(0,1,1)[7] was modeled on ATM2.

# forecasting
fc_atm2 <- atm2_fit  |>
  forecast(h = 31)  |>
  filter(.model=='ARIMA')

# forcasted plot
fc_atm2  |>
  autoplot(atm) +
  ggtitle(TeX(paste0("ATM 2 Forcasted with $ARIMA(2,0,2)(0,1,1)_7$ and $\\lambda$ = ",
         round(lambda,2))))

# residuals
atm2_fit  |>
  select(ARIMA)  |>
  gg_tsresiduals() +
  ggtitle(TeX(paste0("Residuals for $ARIMA(2,0,2)(0,1,1)_7$ with $\\lambda$ = ",
         round(lambda,2))))

#Box-Pierce test, ℓ=2m for seasonal data, m=7
atm2_fit  |>
  select(.model = "ARIMA")  |>
  augment()  |>
  features(.innov, box_pierce, lag = 14, dof = 0)

# A tibble: 1 × 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 .model    10.4     0.732

Comment: The residuals appear to resemble white noise, with a few values falling into both the lower and upper tails of the distribution.

ATM3 Exploration

#plot
atm  |>
  filter(ATM == "ATM3")  |>
  autoplot(Cash) +
  ggtitle("Non-tranformed ATM3")

# mean model
atm3_fit <- atm  |>
  filter(ATM == "ATM3",
         Cash != 0)  |>
  model(MEAN(Cash))

# forecasting
fc_atm3 <- atm3_fit  |>
  forecast(h = 31) 

# forcasted plot
fc_atm3  |>
  autoplot(atm) +
  ggtitle("ATM 3 Forecasted with the MEAN() Model")

Comment: ATM3 differs from the other machines in that it only contains three recorded data points, which is insufficient for identifying trends, seasonality, or generating a reliable forecast for an entire month. As a practical alternative, using the average of these three values to estimate withdrawals for May would be the most reasonable approach.

ATM4 Exploration

atm  |>
  filter(ATM == "ATM4")  |>
  autoplot(Cash) +
  ggtitle("Non-tranformed ATM4")

# STL decomposition
dcmp_4 <- atm  |>
  filter(ATM == "ATM4")  |>
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE))  |>
  components()

dcmp_4  |>
  autoplot() +
  labs(title = "STL Decomposition for ATM4")

# identifying the outliers
outliers <- dcmp_4  |>
  filter(remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
           remainder > quantile(remainder, 0.75) + 3*IQR(remainder))
outliers

# A dable: 2 x 8 [1D]
# Key:     ATM, .model [1]
# :        Cash = trend + season_week + remainder
  ATM   .model       DATE         Cash trend season_week remainder season_adjust
  <chr> <chr>        <date>      <dbl> <dbl>       <dbl>     <dbl>         <dbl>
1 ATM4  "STL(Cash ~… 2009-09-22  1712.  373.       -71.5     1411.         1784.
2 ATM4  "STL(Cash ~… 2010-02-09 10920.  279.       -71.5    10712.        10991.

Comment: ATM4 exhibits an outlier that will need to be removed.

library(dplyr)
atm_miss <- atm  |>
  #replace outliers
  mutate(Cash = replace(Cash, ATM == "ATM4" & DATE == "2010-02-09", NA)) 
  
atm <- atm_miss  |>
  # Fit ARIMA model to the data with missing values
  model(ARIMA(Cash))  |>
  # Estimate Cash for the missing values / outliers
  interpolate(atm_miss)

#plot
atm  |>
  filter(ATM == "ATM4")  |>
  autoplot(Cash) +
  ggtitle("ATM4 with No Outlier")

# STL decomposition
atm  |>
  filter(ATM == "ATM4")  |>
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE))  |>
  components()  |>
  autoplot() +
  labs(title = "STL Decomposition for ATM4 with No Outlier")

Comment: There is a clear weekly pattern (seasonality) in the data, characterized by a sharp drop followed by a steep rise in the amount of cash withdrawn once each week. However, the gray bar in the seasonal component is relatively wide, indicating that this pattern has the least variation compared to the overall variability in the data.

# lambda for box cox transformation
lambda <- atm  |>
  filter(ATM == "ATM4")  |>
  features(Cash, features = guerrero)  |>
  pull(lambda_guerrero)

atm4_fit <- atm  |>
  filter(ATM == "ATM4")  |>
  model(
    # additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # transformed additive ETS model
    additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
    # transformed multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
    # transformed SNAIVE model
    snaive_ts = SNAIVE(box_cox(Cash,lambda)),
    # arima model
  ARIMA = ARIMA(box_cox(Cash,lambda)),
     # transformed additive ETS model, no seasonality
    additive_ts_no_s = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("N")),
    # transformed multiplicative ETS model, no seasonality
    multiplicative_ts_no_s = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("N"))
  )

# stats for the models
left_join(glance(atm4_fit)  |>
          select(.model:BIC), 
          accuracy(atm4_fit)  |>
          select(.model, RMSE))  |>
  arrange(RMSE)

# A tibble: 9 × 7
  .model                     sigma2 log_lik   AIC  AICc   BIC  RMSE
  <chr>                       <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 additive               110883.     -3192. 6404. 6405. 6443.  329.
2 additive_ts               167.     -2006. 4032. 4032. 4071.  338.
3 multiplicative_ts           0.220  -2003. 4025. 4026. 4064.  345.
4 ARIMA                     176.     -1460. 2930. 2930. 2950.  352.
5 multiplicative              0.728  -3192. 6404. 6405. 6443.  354.
6 multiplicative_ts_no_s      0.238  -2039. 4084. 4084. 4096.  363.
7 additive_ts_no_s          196.     -2039. 4084. 4084. 4096.  363.
8 snaive                 205867.        NA    NA    NA    NA   453.
9 snaive_ts                 289.        NA    NA    NA    NA   453.

Comment: The ARIMA model has the lowest AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.

# report of the arima model
atm4_fit  |>
  select(.model = "ARIMA")  |>
  report()

Series: Cash 
Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
Transformation: box_cox(Cash, lambda) 

Coefficients:
         ma1    sar1    sar2  constant
      0.0792  0.2076  0.2026   16.8725
s.e.  0.0527  0.0516  0.0524    0.7311

sigma^2 estimated as 176.1:  log likelihood=-1460.16
AIC=2930.31   AICc=2930.48   BIC=2949.81

Comment: ARIMA(0,0,1)(2,0,0)[7] will be used to model ATM4.

# forecasting
fc_atm4 <- atm4_fit  |>
  forecast(h = 31)  |>
  filter(.model=='ARIMA')

# forcasted plot
fc_atm4  |>
  autoplot(atm) +
  ggtitle(TeX(paste0("ATM 4 Forcasted with $ARIMA(0,0,1)(2,0,0)_7$ and $\\lambda$ = ",
         round(lambda,2))))

# residuals
atm4_fit  |>
  select(ARIMA)  |>
  gg_tsresiduals() +
  ggtitle(TeX(paste0("Residuals for $ARIMA(0,0,1)(2,0,0)_7$ with $\\lambda$ = ",
         round(lambda,2))))

#Box-Pierce test, ℓ=2m for seasonal data, m=7
atm2_fit  |>
  select(.model = "ARIMA")  |>
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)

# A tibble: 1 × 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 .model    10.4     0.732

Comment: The residuals appear to resemble white noise, with a few values falling in the extreme ends of the distribution. While the forecasts may not seem to capture the underlying pattern effectively—evident in their tendency to revert toward the mean—the prediction intervals might still encompass the actual values for May 2010.

Step 4: Final Forecast

library(openxlsx)
# save as data frame
fc <- bind_rows(fc_atm1, fc_atm2, fc_atm3, fc_atm4)  |>
  as.data.frame()  |>
  select(DATE, ATM, .mean)  |>
  rename(Cash = .mean)

# export file
fc  |>
  write.xlsx("ATM_forecasts_Data_624.xlsx")

#plot of may 2010
fc  |>
  as_tsibble(index = DATE, key = ATM)  |>
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "Forecasted ATM Withdrawls in May 2010")

#Summary plot
fc  |>
  as_tsibble(index = DATE, key = ATM)  |>
  autoplot(Cash) +
  autolayer(atm, Cash, colour = "black") +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawls")

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Loading the data

setwd("C:/Users/month/Downloads")
resident_load <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")  |>
  rename(Month = 'YYYY-MMM')  |>
  mutate(Month = yearmonth(Month))  |>
  as_tsibble(index = Month) 

head(resident_load)

# A tsibble: 6 x 3 [1M]
  CaseSequence      Month     KWH
         <dbl>      <mth>   <dbl>
1          733 1998 janv. 6862583
2          734 1998 févr. 5838198
3          735  1998 mars 5420658
4          736  1998 avr. 5010364
5          737   1998 mai 4665377
6          738  1998 juin 6467147

summary(resident_load$KWH)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
  770523  5429912  6283324  6502475  7620524 10655730        1

Comment: There is a missing observation for September 2008. An ARIMA model will be fitted on the data in order to interpolate the missing observation.

resident_load <- resident_load  |>
# Fit ARIMA model to the data containing missing values
  model(ARIMA(KWH))  |>
# Estimate Cash for the missing values.
  interpolate(resident_load)

# check to see if any missing values again
resident_load  |>
  as.data.frame()  |>
  summarise(`Missing Values` = sum(is.na(KWH)))

  Missing Values
1              0

Exploratory Data Analysis

resident_load  |>
  autoplot(KWH) +
  labs(title = "Residential Power Usage")

Comment: The data appears to exhibit seasonality along with a slight upward trend. An anomaly is noticeable in July 2010, where the recorded value is 770,523 (observed when excel file is sorted from smallest to largest)—likely a data entry error, possibly missing an extra digit. To address this outlier, the best approach is to remove it and apply interpolation to estimate a more accurate value.

residential_miss <- resident_load  |>
  #replace outliers with NA
  mutate(KWH = replace(KWH, KWH == 770523, NA)) 
  
resident_load <- residential_miss  |>
  # Fit ARIMA model to the data with missing values
  model(ARIMA(KWH))  |>
  # Estimate Cash for the missing values / outliers
  interpolate(residential_miss)

#plot
resident_load  |>
  autoplot(KWH) +
  ggtitle("Residential Power Usage with no Outlier")

# STL decomposition
resident_load  |>
  model(STL(KWH ~ season(window = "periodic"), robust = TRUE))  |>
  components()  |>
  autoplot() +
  labs(title = "STL Decomposition with no Outlier")

Comment: After removing the outlier, the graph becomes clearer and more interpretable. A gradual upward trend is evident, contributing the least to the overall variation in the data. The pattern also reveals annual seasonality, with noticeable peaks during the summer and winter months. The residuals appear random, though there is a noticeable spike in December 2013, which marks the end of the observation period.

resident_load  |>
  gg_season(KWH, labels = "both") +
  labs(title = "Seasonal plot: Residential Power Usage")

resident_load  |>
  gg_subseries(KWH) +
  labs(title = "Residential Power Usage")

Modeling and Forecasting

Comment: A Box-Cox transformation will be applied to certain models and their performance will be compared to models that were not transformed.

#lambda for box cox transformation
lambda <- resident_load  |>
  features(KWH, features = guerrero)  |>
  pull(lambda_guerrero)

res_fit <- resident_load  |>
  model(
    additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
    multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M")),
    damped = ETS(KWH ~ error("A") + trend("Ad") + season("A")),
    snaive = SNAIVE(KWH),
    additive_bc = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") + season("A")),
    multiplicative_bc = ETS(box_cox(KWH,lambda) ~ error("M") + trend("A") + season("M")),
    damped_bc = ETS(box_cox(KWH,lambda) ~ error("A") + trend("Ad") + season("A")),
    snaive_bc = SNAIVE(box_cox(KWH,lambda)),
    ARIMA = ARIMA(box_cox(KWH,lambda))
  )

# stats for the models
left_join(glance(res_fit)  |>
            select(.model:BIC), 
          accuracy(res_fit)  |>
          select(.model, RMSE))  |>
  arrange(AICc)

# A tibble: 9 × 7
  .model              sigma2 log_lik    AIC   AICc    BIC    RMSE
  <chr>                <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1 ARIMA             1.51e- 5    752. -1493. -1493. -1478. 627673.
2 additive_bc       1.40e- 5    577. -1120. -1116. -1064. 610956.
3 damped_bc         1.41e- 5    576. -1116. -1112. -1058. 618831.
4 multiplicative_bc 6.51e- 7    575. -1115. -1112. -1060. 625349.
5 multiplicative    9.00e- 3  -3053.  6140.  6144.  6196. 613581.
6 additive          4.19e+11  -3065.  6165.  6168.  6220. 619703.
7 damped            4.25e+11  -3066.  6169.  6173.  6227. 622538.
8 snaive            6.51e+11     NA     NA     NA     NA  809926.
9 snaive_bc         2.25e- 5     NA     NA     NA     NA  809926.

Comment: Unlike the other datasets, this one yielded negative values for AIC, AICc, and BIC. Although the additive exponential smoothing model with a Box-Cox transformation achieved the lowest RMSE, the ARIMA model had the most favorable AIC, AICc, and BIC scores. Therefore, the ARIMA model was selected as the most suitable choice.

# report of the arima model
res_fit  |>
  select(.model = "ARIMA")  |>
  report()

Series: KWH 
Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift 
Transformation: box_cox(KWH, lambda) 

Coefficients:
         ma1     sar1     sar2  constant
      0.2462  -0.7202  -0.3546    0.0013
s.e.  0.0833   0.0751   0.0769    0.0004

sigma^2 estimated as 1.505e-05:  log likelihood=751.74
AIC=-1493.47   AICc=-1493.13   BIC=-1477.51

Comment: ARIMA(0,0,1)(2,1,0)[12] w/ drift

# forecasting
fc_res <- res_fit  |>
  forecast(h = 12)  |>
  filter(.model=='ARIMA')

# forcasted plot
fc_res  |>
  autoplot(resident_load) +
  ggtitle(TeX(paste0("ATM 1 Forcasted with $(0,0,1)(2,1,0)_{12}$ and $\\lambda$ = ",
         round(lambda,2))))

# residuals
res_fit  |>
  select(ARIMA)  |>
  gg_tsresiduals(lag = 24) +
  ggtitle(TeX(paste0("Residuals for $(0,0,1)(2,1,0)_{12}$ with $\\lambda$ = ",
         round(lambda,2))))

#Box-Pierce test, ℓ=2m for seasonal data, m=12
res_fit  |>
  select(.model = "ARIMA")  |>
  augment()  |>
  features(.innov, box_pierce, lag = 24, dof = 0)

# A tibble: 1 × 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 .model    29.9     0.187

Comment: The residuals appear to resemble white noise, as indicated by both the diagnostic plots and the results of the Box-Pierce test. Interestingly, the initial residuals seem to remain constant for a short period.

Final Forecasting using the ARIMA model:

fc_res <- fc_res  |>
  as.data.frame()  |>
  select(Month, .mean)  |>
  rename(KWH = .mean)  |>
  # adding back the case sequence
  mutate(CaseSequence = 925:936)  |>
  # rearranging order
  relocate(CaseSequence)
fc_res

   CaseSequence      Month      KWH
1           925 2014 janv. 10297282
2           926 2014 févr.  8531331
3           927  2014 mars  6650696
4           928  2014 avr.  5974089
5           929   2014 mai  5942940
6           930  2014 juin  8291370
7           931 2014 juil.  9539978
8           932  2014 août 10113059
9           933 2014 sept.  8474487
10          934  2014 oct.  5849784
11          935  2014 nov.  6112937
12          936  2014 déc.  8253088

library(openxlsx)
# export file
fc_res  |>
  write.xlsx("Residential_KWH_forecasts.xlsx")