Required Libraries

library(fpp3)
library(tidyverse)
library(kableExtra)
library(janitor)
library(readxl)
library(psych)
library(cowplot)
library(DescTools)
library(latex2exp)

Part A – ATM Forecast

The goal is to forecast how much cash is taken out of four (4) different ATM machines for May 2010. We are given an Excel formatted spreadsheet containing three columns:

Import Excel Data

Initial Import

When the file was imported initially, the date was converted into a five digit code representing the original date. We also have inconsistencies with our column naming style as DATE and ATM are completely capitalized while Cash was not.

atm <- 
  read_excel("atm.xlsx") 

kbl(head(atm), 
    caption = "ATM Withdrawals") |>
  kable_classic(full_width = F, html_font = "Cambria")
ATM Withdrawals
DATE ATM Cash
39934 ATM1 96
39934 ATM2 107
39935 ATM1 82
39935 ATM2 89
39936 ATM1 85
39936 ATM2 90

When we look at the summary counts of the data, we see that there are some discrepancies.

describe_import <-
  read_excel("atm.xlsx", col_types = c('date', 'text', 'numeric')) |>
  clean_names() |>
  rename(name = atm) |>
  describe()

kbl(describe_import[2], 
    caption = "Variables") |>
  kable_classic(full_width = F, html_font = "Cambria")
Variables
n
date 1474
name* 1460
cash 1455

Grouping the atm variable, we can see that there are missing values for ATM1 and ATM2. We also see fourteen (14) missing values labeled as <NA>, where we have 14 days that do not have an atm label and no cash withdrawn. As these <NA>values provide us zero information they will be dropped. As for the 5 missing ATM withdrawal amounts, it is possible that these ATMs may have been offline and not giving cash out. If that is the case, we can insert zero for those days.

missing_values <-
  read_excel("atm.xlsx", col_types = c('date', 'text', 'numeric')) |>
  clean_names() |>
  rename(name = atm) |>
  group_by(name) |>
  summarise(total = sum(is.na(cash))) |>
  spread(name, total)

kbl(missing_values, 
    caption = "Missing Values") |>
  kable_classic(full_width = F, html_font = "Cambria")
Missing Values
ATM1 ATM2 ATM3 ATM4 <NA>
3 2 0 0 14

Standardizing the Import

Updating the read_excel function we set our columns to be read as specific data types. The column names were also formatted to be a similar format using the clean_names() function. The table was converted into a tsibble format to allow us to model our functions later on with the index = date and our key = ATM. We have decimal points in some of our values that indicate coins could have been given or that there is some reliability on the data. We will assume the data is valid. However, it does not help us to have such granularity, as we are focused on forecasting the dollar amount, so these values will be rounded. To avoid any confusion, the cash column was changed from representing amounts in hundreds to the whole number.

atm_ts <- 
  read_excel("atm.xlsx", col_types = c('date', 'text', 'numeric')) |>
  clean_names() |>
  mutate(date = as_date(date)) |>
  rename(name = atm) |>
  as_tsibble(index = date, key = name) |>
  arrange(date)

kbl(head(atm_ts), 
    caption = "ATM Withdrawals") |>
  kable_classic(full_width = F, html_font = "Cambria")
ATM Withdrawals
date name cash
2009-05-01 ATM1 96.0000
2009-05-01 ATM2 107.0000
2009-05-01 ATM3 0.0000
2009-05-01 ATM4 776.9934
2009-05-02 ATM1 82.0000
2009-05-02 ATM2 89.0000
atm_ts <-
  atm_ts |>
  mutate(cash = round(cash * 100, 0)) |>
  filter(!is.na(name))

kbl(head(atm_ts), 
    caption = "ATM Withdrawals") |>
  kable_classic(full_width = F, html_font = "Cambria")
ATM Withdrawals
date name cash
2009-05-01 ATM1 9600
2009-05-01 ATM2 10700
2009-05-01 ATM3 0
2009-05-01 ATM4 77699
2009-05-02 ATM1 8200
2009-05-02 ATM2 8900

Data Exploration

Time Series

ATM3 immediately shows us that there was no cash withdrawals until the most recent dates. This likely means that it was not in service yet and is not a major concern. ATM4 shows a large spike on February 9, 2010 with an amount of $1,091,976. It is extremely higher than any other value in all of our ATMs and highly unlikely this is correct. Most ATMs “typically hold between $20,000 to $50,000 in cash. Imposing withdrawal limits helps ensure ATMs have sufficient funds on hand to cover customers’ needs in between the times that they’re restocked.” (USA Today).

atm_colors <- c('#1170aa', '#fc7d0b', '#a3acb9', '#57606c')

names(atm_colors) <- c('ATM1', 'ATM2', 'ATM3', 'ATM4')

atm_ts |>
  ggplot(aes(date, cash, colour = name)) + 
  geom_line() +
  facet_wrap(~name, scales = "free", nrow = 4) +
  labs(title = "", x='') + 
  theme_bw() +
  scale_color_manual(values = atm_colors)

Boxplot

The boxplot further shows the extreme outlier within ATM4, and a several others on ATM1

atm_ts |>
  filter(name != "ATM3") |>
  ggplot(aes(x = cash, fill = name)) +
  geom_boxplot() + 
  facet_wrap(~ name, scales = "free_x", ncol=1) +
  scale_fill_manual(values = atm_colors) + 
  labs(title = "", x = "Cash", y = "Count") +
  theme_bw() +
  theme(legend.position = "none", legend.title=element_blank())

Data Manipulation

Missing Values

As mentioned before, we will assume these ATMs on these dates were offline and no cash was withdrawn. Given that, these N/A values will be replaced with zero.

  • Note: If these dates were indeed not offline, we could have used ARIMA’s interpolate() function to help fill in these missing values. This provides an estimated value approach based on an ARIMA best fit model, whereas back-filling or forward-filling these values would be less robust.
kbl(atm_ts |>
  filter(is.na(cash)), 
    caption = "Missing Withdrawals") |>
  kable_classic(full_width = F, html_font = "Cambria")
Missing Withdrawals
date name cash
2009-06-13 ATM1 NA
2009-06-16 ATM1 NA
2009-06-18 ATM2 NA
2009-06-22 ATM1 NA
2009-06-24 ATM2 NA
missing_values <-
  atm_ts |>
  filter(is.na(cash)) |>
  mutate(cash = 0)

atm_ts <-
  atm_ts |>
  filter(!is.na(cash)) |>
  bind_rows(missing_values)

Winsorizing

To handle outliers on all the ATMs, any value that is outside the 99th percentile will be replaced with its corresponding 1% or 99% value. This method of handling outliers is called winsorizing and allows us to not eliminate these values altogether. We can see how this treatment has changed our time series and boxplot graphs.

filter_atm3 <- 
  atm_ts |>
  as_tibble() |>
  filter(name != "ATM3")

winsorize_data <-
  filter_atm3 |>
  group_by(name) |>
  mutate(cash_winsor = Winsorize(cash, probs = c(0.01, 0.99)))

atm_3 <-
  atm_ts |>
  filter(name == "ATM3")

atm_ts <-
  winsorize_data |>
  as_tsibble(index = date, key = name) |>
  arrange(date) |>
  bind_rows(atm_3)
missing_values <-
  atm_ts |>
  filter(is.na(cash_winsor)) |>
  mutate(cash_winsor = if_else(is.na(cash), 0, cash))

atm_ts <-
  atm_ts |>
  filter(!is.na(cash_winsor)) |>
  bind_rows(missing_values) |>
  mutate(cash_winsor = round(cash_winsor, 0)) |>
  select(date, name, cash_winsor)
atm_ts |>
  ggplot(aes(date, cash_winsor, colour = name)) + 
  geom_line() +
  facet_wrap(~name, scales = "free", nrow = 4) +
  labs(title = "", x='') + 
  theme_bw() +
  scale_color_manual(values = atm_colors)

atm_ts |>
  filter(name != "ATM3") |>
  ggplot(aes(x = cash_winsor, fill = name)) +
  geom_boxplot() + 
  facet_wrap(~ name, scales = "free_x", ncol=1) +
  scale_fill_manual(values = atm_colors) + 
  labs(title = "", x = "Cash", y = "Count") +
  theme_bw() +
  theme(legend.position = "none", legend.title=element_blank())

Trend and Seasonality

ATM1

There is a seasonality (weekly) with no apparent trend. We can see the remainder has some large spikes but we can verify if it is white noise with an ACF plot.

atm_ts |>
  filter(name == 'ATM1') |>
  model(STL(cash_winsor)) |>
  components() |>
  autoplot()

The ACF plot indeed shows there is some severe autocorrelation every lag 7 that needs to be dealt with.

atm_ts |>
  filter(name == 'ATM1') |>
  ACF(cash_winsor) |>
  autoplot()

ATM2

There is a seasonality (weekly) with a marginal trend that has decreased until recent dates. We can see the remainder has some large spikes but we can verify if it is white noise with an ACF plot.

atm_ts |>
  filter(name == 'ATM2') |>
  model(STL(cash_winsor)) |>
  components() |>
  autoplot()

The ACF plot indeed shows there is some severe autocorrelation every lag 2, 5 and 7 that needs to be dealt with.

atm_ts |>
  filter(name == 'ATM2') |>
  ACF(cash_winsor) |>
  autoplot()

ATM3

We can weakly say there is a positive trend and no seasonality, however, given that there are only three (3) days worth of values, forecasting for ATM3 will use a NAIVE() method given the limited information we have for it.

atm_ts |>
  filter(name == 'ATM3') |>
  model(STL(cash_winsor)) |>
  components() |>
  autoplot()

ATM4

There is a seasonality (weekly) with no apparent trend. We can see the remainder has some large spikes but we can verify if it is white noise with an ACF plot.

atm_ts |>
  filter(name == 'ATM4') |>
  model(STL(cash_winsor)) |>
  components() |>
  autoplot()

The ACF plot indeed shows there is some autocorrelation every lag 7 that slowly decays and needs to be dealt with.

atm_ts |>
  filter(name == 'ATM4') |>
  ACF(cash_winsor) |>
  autoplot()

Modeling & Forecasting

For our seasonal data with ATM1, ATM2, and ATM4, we can look to minimize our autocorrelation and stationary issues with a seasonal, first and/or second order differencing and build ARIMA models. We can then compare how these ARIMA models stack against additive ETS models. Again, ATM3 will be using a simple NAIVE() method for forecasting as we only have three data points.

Looking at the unit root tests, ATM2 appears to only need a seasonal difference, while the others have kpss p-values over 0.05.

kpss <-
  atm_ts |>
  features(cash_winsor, unitroot_kpss)

n_diffs <-
  atm_ts |>
  features(cash_winsor, unitroot_ndiffs)

unit_root <-
  merge(kpss, n_diffs, by="name")

kbl(unit_root, 
    caption = "Unit Root Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")
Unit Root Tests
name kpss_stat kpss_pvalue ndiffs
ATM1 0.3095347 0.1000000 0
ATM2 1.9434796 0.0100000 1
ATM3 0.3891606 0.0818273 0
ATM4 0.0965402 0.1000000 0

ATM1

We can see that our ARIMA model outperforms the ETS model, most importantly looking where the AICc is significantly lower. The ARIMA model chosen is AR(0, 0, 0)(0,1,1)[7]. We can also see that our model’s innovation residuals are white noise based on the Box-Pierce test p-value and our residual plots. The Box-Pierce test lag was set to 14 where \(l = 2m\) and \(m = 7\), given we have weekly seasonality.

atm1 <-
  atm_ts |>
  filter(name == 'ATM1')

lambda <- 
  atm1 |>
  features(cash_winsor, features = guerrero) |>
  pull(lambda_guerrero)

atm1_fit <-
  atm1 |>
  model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("A")),
        ARIMA = ARIMA(box_cox(cash_winsor,lambda)))

benchmarks <-
  atm1_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")
Benchmark Tests
.model sigma2 log_lik AIC AICc BIC
ETS 28752.76 -2945.809 5911.618 5912.240 5950.617
ARIMA 28526.70 -2345.966 4695.931 4695.965 4703.692
atm1_fit |>
  select(.model = "ARIMA") |>
  report()
## Series: cash_winsor 
## Model: ARIMA(0,0,0)(0,1,1)[7] 
## Transformation: box_cox(cash_winsor, lambda) 
## 
## Coefficients:
##          sma1
##       -0.6829
## s.e.   0.0411
## 
## sigma^2 estimated as 28527:  log likelihood=-2345.97
## AIC=4695.93   AICc=4695.96   BIC=4703.69
atm1_fit |>
  select(ARIMA) |>
  gg_tsresiduals() 

bp_test <-
  atm1_fit |>
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")
Box-Pierce Test
.model bp_stat bp_pvalue
.model 27.27467 0.0177297

As we forecast ATM1, we see how the seasonality affects the future values, however we do have somewhat of a large prediction interval.

atm1_fc <- 
  atm1_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')

atm1_fc |>
  autoplot(atm1) +
  labs(title = TeX(paste0("ATM1: $AR(0,0,0)(0,1,1)[7]$ Forecast")),
       caption =  TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))

ATM2

ATM2 requires a seasonal differencing before we can model and forecast any values. We can see that lag 7 still has a high ACF and PACF where a moving average of MA(1) could be of use.

atm2 <-
  atm_ts |>
  filter(name == 'ATM2')

lambda <- 
  atm2 |>
  features(cash_winsor, features = guerrero) |>
  pull(lambda_guerrero)

atm2 |>
  autoplot(box_cox(cash_winsor, lambda)) +
  labs(title = latex2exp::TeX(paste0(
         "Transformed ATM2 with $\\lambda$ = ",
         round(lambda,3))), 
       y='')

atm2 |>
  gg_tsdisplay(difference(cash_winsor, lag=7), 
               plot_type="partial") +
  labs(title = "ATM2: Seasonal Differencing")

We will now use an ARIMA vs ETS model. We see that our AICc for our ARIMA model is the best choice. The ARIMA model ARIMA(0,0,0)(0,1,1)[7].

atm2_fit <-
  atm2 |>
  model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("A")),
        ARIMA = ARIMA(box_cox(cash_winsor,lambda)))

benchmarks <-
  atm2_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")
Benchmark Tests
.model sigma2 log_lik AIC AICc BIC
ETS 28715.35 -2945.572 5911.143 5911.765 5950.142
ARIMA 26762.42 -2332.363 4676.726 4676.965 4700.009
atm2_fit |>
  select(.model = "ARIMA") |>
  report()
## Series: cash_winsor 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(cash_winsor, lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4161  -0.8921  0.4555  0.7717  -0.7139
## s.e.   0.0599   0.0524  0.0885  0.0673   0.0416
## 
## sigma^2 estimated as 26762:  log likelihood=-2332.36
## AIC=4676.73   AICc=4676.96   BIC=4700.01
atm2_fit |>
  select(ARIMA) |>
  gg_tsresiduals() 

bp_test <-
  atm2_fit |>
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")
Box-Pierce Test
.model bp_stat bp_pvalue
.model 9.549906 0.7942951

We see that our forecast as the same seasonal aspects of the past values. The prediction intervals are somewhat wide, but nothing drastic.

atm2_fc <- 
  atm2_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')

atm2_fc |>
  autoplot(atm2) +
  labs(title = TeX(paste0("ATM2: $AR(0,0,0)(0,1,1)[7]$ Forecast")),
       caption =  TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))

ATM3

ATM3 will be using a simple NAIVE() method for forecasting as we only have three data points. However, just to add some comparison we will use an ETS(A,N,N) model.

atm3 <-
  atm_ts |>
  filter(name == 'ATM3')

atm3_fit <-
  atm3 |>
  model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("N")),
        NAIVE = NAIVE(cash_winsor))

benchmarks <-
  atm3_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")
Benchmark Tests
.model sigma2 log_lik AIC AICc BIC
ETS 1680.576 -2431.136 4868.272 4868.339 4879.972
NAIVE 258984.879 NA NA NA NA
bp_test <-
  atm3_fit |>
  select(.model = "ETS") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")
Box-Pierce Test
.model bp_stat bp_pvalue
.model 0.1465766 1

We can see from our NAIVE() and ETS() models, the latter model has a much larger prediction interval, and a increasing slope. Given that we do not have much information of ATM withdrawals, it would be best to use a simple approach of NAIVE()

atm3_fc <- 
  atm3_fit |>
  forecast(h = 31)

atm3_fc |>
  autoplot(atm3) +
  labs(title = TeX(paste0("ATM3: ETS(A,N,N) & NAIVE Forecast")))

ATM4

We again see that our ARIMA model outperforms the ETS model where AICc is lower. The ARIMA model chosen is ARIMA(0,0,0)(2,0,0)[7] with mean. We can also see that our model’s innovation residuals are white noise based on the Box-Pierce test p-value and our residual plots.

atm4 <-
  atm_ts |>
  filter(name == 'ATM4')

lambda <- 
  atm4 |>
  features(cash_winsor, features = guerrero) |>
  pull(lambda_guerrero)

atm4_fit <-
  atm4 |>
  model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("A")),
        ARIMA = ARIMA(box_cox(cash_winsor,lambda)))

benchmarks <-
  atm4_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")
Benchmark Tests
.model sigma2 log_lik AIC AICc BIC
ETS 18267.53 -2863.026 5746.051 5746.673 5785.05
ARIMA 19163.81 -2316.470 4640.941 4641.052 4656.54
atm4_fit |>
  select(.model = "ARIMA") |>
  report()
## Series: cash_winsor 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## Transformation: box_cox(cash_winsor, lambda) 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.2111  0.1804  185.6600
## s.e.  0.0518  0.0522    7.0914
## 
## sigma^2 estimated as 19164:  log likelihood=-2316.47
## AIC=4640.94   AICc=4641.05   BIC=4656.54
atm4_fit |>
  select(ARIMA) |>
  gg_tsresiduals()

bp_test <-
  atm4_fit |>
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")
Box-Pierce Test
.model bp_stat bp_pvalue
.model 14.51123 0.4123591

As we forecast ATM4, we see how our future values variance become narrower into the future. This is concerning as a point forecast because we do not see this happen previously. However we do have a substantially large prediction interval indicating that this model is having trouble covering the large possible future values and not a great model to use as is.

atm4_fc <- 
  atm4_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')

atm4_fc |>
  autoplot(atm4) +
  labs(title = TeX(paste0("ATM4: $AR(0,0,0)(2,0,0)[7]$ with Mean Forecast")),
       caption =  TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))

Export ATM Forecasts

export_atm1 <-
  atm1_fc %>%
  as_tibble |>
  select(date, name, .mean) |>
  rename(cash = .mean)

export_atm2 <-
  atm2_fc %>%
  as_tibble |>
  select(date, name, .mean) |>
  rename(cash = .mean)

export_atm3 <-
  atm3_fc %>%
  as_tibble |>
  select(date, name, .mean) |>
  rename(cash = .mean)

export_atm4 <-
  atm4_fc %>%
  as_tibble |>
  select(date, name, .mean) |>
  rename(cash = .mean)

export_atm <- 
  bind_rows(export_atm1, export_atm2, export_atm3, export_atm4)

write.csv(export_atm, 'atm_forecasts.csv', row.names = FALSE)

Part B – Power Consumption Forecast

The goal is to forecast how much power consumption will be used on a month basis for 2014. We are given an Excel formatted spreadsheet containing three columns:

Import Excel Data

Initial Import

In our initial import, we have CaseSequence which is not needed for this analysis and can be dropped. We also see our date format in Year-Month format and will change this to an appropriate datetime format for our time series.

power <- 
  read_excel("power.xlsx") 

kbl(head(power), 
    caption = "Power Consumption") |>
  kable_classic(full_width = F, html_font = "Cambria")
Power Consumption
CaseSequence YYYY-MMM KWH
733 1998-Jan 6862583
734 1998-Feb 5838198
735 1998-Mar 5420658
736 1998-Apr 5010364
737 1998-May 4665377
738 1998-Jun 6467147

Standardizing the Import

power <-
  power |>
  clean_names() |>
  rename(date = yyyy_mmm) |>
  mutate(date = yearmonth(date)) |>
  select(date, kwh) |>
  as_tsibble(index=date)

kbl(head(power), 
    caption = "Power Consumption",
    format.args = list(big.mark = ",")) |>
  kable_classic(full_width = F, html_font = "Cambria")
Power Consumption
date kwh
1998 Jan 6,862,583
1998 Feb 5,838,198
1998 Mar 5,420,658
1998 Apr 5,010,364
1998 May 4,665,377
1998 Jun 6,467,147

Data Exploration

Time Series

We viewing the time series we see a break around 2008-2009.

power |>
  autoplot(kwh)

Boxplot

When viewing our data on a boxplot, we see we have one value that is missing, as we saw on the time series. We also see an extreme outlier that is quite low compared to the other values we have.

power |>
  ggplot(aes(x = kwh)) +
  geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).

Data Manipulation

Missing Values

We have one value missing on September 2008. We will use ARIMA’s interpolate() function to help with replacing the NA to an estimated value.

kbl(power |>
  filter(is.na(kwh)), 
    caption = "Missing Power (1.1)") |>
  kable_classic(full_width = F, html_font = "Cambria")
Missing Power (1.1)
date kwh
2008 Sep NA
power <-
  power |>
  model(ARIMA(kwh)) |>
  interpolate(power)

kbl(power |>
  filter(date == yearmonth('2008 Sep')), 
    caption = "Missing Power (1.2)",
  format.args = list(big.mark = ",")) |>
  kable_classic(full_width = F, html_font = "Cambria")
Missing Power (1.2)
date kwh
2008 Sep 7,597,704

Winsorizing

We can see our outlier is quite low! It is less than 1 million when the mean is 6.5 million. This can either be attributed to a major power outage due to natural disasters, or an error on the collection of the data for the month. We will treat this outlier as abnormal and replace it by winsorizing it.

sum_stats <-
  describe(as_tibble(power)) |>
  select(mean, min, max) |>
  tail(1)

kbl(sum_stats, 
    caption = "Summary Statistics",
    format.args = list(big.mark = ",")) |>
  kable_classic(full_width = F, html_font = "Cambria")
Summary Statistics
mean min max
kwh 6,508,179 770,523 10,655,730

To handle the one outlier it will be replaced with its corresponding 1st percentile value.

power <-
  power |>
  mutate(kwh = Winsorize(kwh, probs = c(0.01, 0.99)))

We can confirm that our time series now has no breaks and that the outlier was handled with.

power |>
  autoplot(kwh)

power |>
  ggplot(aes(x = kwh)) +
  geom_boxplot()

Trend and Seasonality

There is a a trend upwards in power consumption after January 2010. As for seasonality we see a yearly seasonality pattern. We also see a deep trough in the remainder post January 2010.

power |>
  model(STL(kwh)) |>
  components() |>
  autoplot()

We can confirm seasonality where December-January and July-August has the highest consumption due to weather changes of winter and summer.

power |>
  gg_season(kwh) +
  scale_y_continuous(labels=scales::comma) +
  labs(title = "Seasonal Decomposition")

Modeling and Forecasting

We need to fix our seasonal stationary issue if we want to use an ARIMA model. When we run our unit root tests, we see we only need 1, which a seasonal differencing should be enough. We see we have lag 1 and 12 are high, but that coincides with out annual seasonality we were looking at before.

kpss <-
  power |>
  features(kwh, unitroot_kpss)

n_diffs <-
  power |>
  features(kwh, unitroot_ndiffs)

unit_root <-
  bind_cols(kpss, n_diffs)

kbl(unit_root, 
    caption = "Unit Root Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")
Unit Root Tests
kpss_stat kpss_pvalue ndiffs
1.512599 0.01 1
power |>
  gg_tsdisplay(difference(kwh, lag=12), 
               plot_type="partial") +
  labs(title = "Power Consumption: Seasonal Differecing")

We can see we can transform the power seasonality with a \(\lambda = -0.137\)

lambda <- 
  power |>
  features(kwh, features = guerrero) |>
  pull(lambda_guerrero)

power |>
  autoplot(box_cox(kwh, lambda)) +
  labs(title = latex2exp::TeX(paste0(
         "Transformed Power with $\\lambda$ = ",
         round(lambda,3))), 
       y='')

We can see our ARIMA model once again has the best AICc. This is the model we will forecast with using ARIMA(2,0,2)(0,1,1)[7]

power_fit <-
  power |>
  model(ETS = ETS(box_cox(kwh,lambda) ~ error("A") + trend("N") + season("A")),
        ARIMA = ARIMA(box_cox(kwh,lambda)))

benchmarks <-
  power_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")
Benchmark Tests
.model sigma2 log_lik AIC AICc BIC
ETS 0.0001470 349.7756 -669.5512 -666.8239 -620.6888
ARIMA 0.0001474 538.1746 -1066.3493 -1066.0044 -1050.3845
atm2_fit |>
  select(.model = "ARIMA") |>
  report()
## Series: cash_winsor 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(cash_winsor, lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4161  -0.8921  0.4555  0.7717  -0.7139
## s.e.   0.0599   0.0524  0.0885  0.0673   0.0416
## 
## sigma^2 estimated as 26762:  log likelihood=-2332.36
## AIC=4676.73   AICc=4676.96   BIC=4700.01

Our innovation residuals look fine, besides one outlier that makes it left-skewed

power_fit |>
  select(ARIMA) |>
  gg_tsresiduals() 

Our Box-Pierce test shows that the remainder is white noise

bp_test <-
  power_fit |>
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")
Box-Pierce Test
.model bp_stat bp_pvalue
.model 11.77447 0.6244091

Our forecasted values looks like they provide decent point forcasts, with small ranges for the prediction intervals.

power_fc <- 
  power_fit |>
  forecast(h = 12) |>
  filter(.model=='ARIMA')

power_fc |>
  autoplot(power) +
  labs(title = TeX(paste0("Power Consumption: $AR(2,0,2)(0,1,1)[7]$ Forecast")),
       caption =  TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))

## Export Power Forecasts

export_power <-
  power_fc %>%
  as_tibble |>
  select(date, .mean) |>
  rename(kwh = .mean)

write.csv(export_power, 'power_forecasts.csv', row.names = FALSE)