ATM, Power & Water Forecasting

Required Libraries

library(fpp3)
library(tidyverse)
library(kableExtra)
library(janitor)
library(readxl)
library(psych)
library(cowplot)
library(DescTools)
library(latex2exp)

Part A – ATM Forecast

The goal is to forecast how much cash is taken out of four (4) different ATM machines for May 2010. We are given an Excel formatted spreadsheet containing three columns:

date: the date the money was withdrawn
atm: which ATM the money was withdrawn from
cash: the amount of money withdrawn from a particular ATM on a specific date (in hundreds).

Import Excel Data

Initial Import

When the file was imported initially, the date was converted into a five digit code representing the original date. We also have inconsistencies with our column naming style as DATE and ATM are completely capitalized while Cash was not.

atm <- 
  read_excel("atm.xlsx") 

kbl(head(atm), 
    caption = "ATM Withdrawals") |>
  kable_classic(full_width = F, html_font = "Cambria")

ATM Withdrawals
DATE	ATM	Cash
39934	ATM1	96
39934	ATM2	107
39935	ATM1	82
39935	ATM2	89
39936	ATM1	85
39936	ATM2	90

When we look at the summary counts of the data, we see that there are some discrepancies.

describe_import <-
  read_excel("atm.xlsx", col_types = c('date', 'text', 'numeric')) |>
  clean_names() |>
  rename(name = atm) |>
  describe()

kbl(describe_import[2], 
    caption = "Variables") |>
  kable_classic(full_width = F, html_font = "Cambria")

Variables
	n
date	1474
name*	1460
cash	1455

Grouping the atm variable, we can see that there are missing values for ATM1 and ATM2. We also see fourteen (14) missing values labeled as <NA>, where we have 14 days that do not have an atm label and no cash withdrawn. As these <NA>values provide us zero information they will be dropped. As for the 5 missing ATM withdrawal amounts, it is possible that these ATMs may have been offline and not giving cash out. If that is the case, we can insert zero for those days.

missing_values <-
  read_excel("atm.xlsx", col_types = c('date', 'text', 'numeric')) |>
  clean_names() |>
  rename(name = atm) |>
  group_by(name) |>
  summarise(total = sum(is.na(cash))) |>
  spread(name, total)

kbl(missing_values, 
    caption = "Missing Values") |>
  kable_classic(full_width = F, html_font = "Cambria")

Missing Values
ATM1	ATM2	ATM3	ATM4	<NA>
3	2	0	0	14

Standardizing the Import

Updating the read_excel function we set our columns to be read as specific data types. The column names were also formatted to be a similar format using the clean_names() function. The table was converted into a tsibble format to allow us to model our functions later on with the index = date and our key = ATM. We have decimal points in some of our values that indicate coins could have been given or that there is some reliability on the data. We will assume the data is valid. However, it does not help us to have such granularity, as we are focused on forecasting the dollar amount, so these values will be rounded. To avoid any confusion, the cash column was changed from representing amounts in hundreds to the whole number.

atm_ts <- 
  read_excel("atm.xlsx", col_types = c('date', 'text', 'numeric')) |>
  clean_names() |>
  mutate(date = as_date(date)) |>
  rename(name = atm) |>
  as_tsibble(index = date, key = name) |>
  arrange(date)

kbl(head(atm_ts), 
    caption = "ATM Withdrawals") |>
  kable_classic(full_width = F, html_font = "Cambria")

ATM Withdrawals
date	name	cash
2009-05-01	ATM1	96.0000
2009-05-01	ATM2	107.0000
2009-05-01	ATM3	0.0000
2009-05-01	ATM4	776.9934
2009-05-02	ATM1	82.0000
2009-05-02	ATM2	89.0000

atm_ts <-
  atm_ts |>
  mutate(cash = round(cash * 100, 0)) |>
  filter(!is.na(name))

kbl(head(atm_ts), 
    caption = "ATM Withdrawals") |>
  kable_classic(full_width = F, html_font = "Cambria")

ATM Withdrawals
date	name	cash
2009-05-01	ATM1	9600
2009-05-01	ATM2	10700
2009-05-01	ATM3	0
2009-05-01	ATM4	77699
2009-05-02	ATM1	8200
2009-05-02	ATM2	8900

Data Exploration

Time Series

ATM3 immediately shows us that there was no cash withdrawals until the most recent dates. This likely means that it was not in service yet and is not a major concern. ATM4 shows a large spike on February 9, 2010 with an amount of $1,091,976. It is extremely higher than any other value in all of our ATMs and highly unlikely this is correct. Most ATMs “typically hold between $20,000 to $50,000 in cash. Imposing withdrawal limits helps ensure ATMs have sufficient funds on hand to cover customers’ needs in between the times that they’re restocked.” (USA Today).

atm_colors <- c('#1170aa', '#fc7d0b', '#a3acb9', '#57606c')

names(atm_colors) <- c('ATM1', 'ATM2', 'ATM3', 'ATM4')

atm_ts |>
  ggplot(aes(date, cash, colour = name)) + 
  geom_line() +
  facet_wrap(~name, scales = "free", nrow = 4) +
  labs(title = "", x='') + 
  theme_bw() +
  scale_color_manual(values = atm_colors)

Boxplot

The boxplot further shows the extreme outlier within ATM4, and a several others on ATM1

atm_ts |>
  filter(name != "ATM3") |>
  ggplot(aes(x = cash, fill = name)) +
  geom_boxplot() + 
  facet_wrap(~ name, scales = "free_x", ncol=1) +
  scale_fill_manual(values = atm_colors) + 
  labs(title = "", x = "Cash", y = "Count") +
  theme_bw() +
  theme(legend.position = "none", legend.title=element_blank())

Data Manipulation

Missing Values

As mentioned before, we will assume these ATMs on these dates were offline and no cash was withdrawn. Given that, these N/A values will be replaced with zero.

Note: If these dates were indeed not offline, we could have used ARIMA’s interpolate() function to help fill in these missing values. This provides an estimated value approach based on an ARIMA best fit model, whereas back-filling or forward-filling these values would be less robust.

kbl(atm_ts |>
  filter(is.na(cash)), 
    caption = "Missing Withdrawals") |>
  kable_classic(full_width = F, html_font = "Cambria")

Missing Withdrawals
date	name	cash
2009-06-13	ATM1	NA
2009-06-16	ATM1	NA
2009-06-18	ATM2	NA
2009-06-22	ATM1	NA
2009-06-24	ATM2	NA

missing_values <-
  atm_ts |>
  filter(is.na(cash)) |>
  mutate(cash = 0)

atm_ts <-
  atm_ts |>
  filter(!is.na(cash)) |>
  bind_rows(missing_values)

Winsorizing

To handle outliers on all the ATMs, any value that is outside the 99th percentile will be replaced with its corresponding 1% or 99% value. This method of handling outliers is called winsorizing and allows us to not eliminate these values altogether. We can see how this treatment has changed our time series and boxplot graphs.

filter_atm3 <- 
  atm_ts |>
  as_tibble() |>
  filter(name != "ATM3")

winsorize_data <-
  filter_atm3 |>
  group_by(name) |>
  mutate(cash_winsor = Winsorize(cash, probs = c(0.01, 0.99)))

atm_3 <-
  atm_ts |>
  filter(name == "ATM3")

atm_ts <-
  winsorize_data |>
  as_tsibble(index = date, key = name) |>
  arrange(date) |>
  bind_rows(atm_3)

missing_values <-
  atm_ts |>
  filter(is.na(cash_winsor)) |>
  mutate(cash_winsor = if_else(is.na(cash), 0, cash))

atm_ts <-
  atm_ts |>
  filter(!is.na(cash_winsor)) |>
  bind_rows(missing_values) |>
  mutate(cash_winsor = round(cash_winsor, 0)) |>
  select(date, name, cash_winsor)

atm_ts |>
  ggplot(aes(date, cash_winsor, colour = name)) + 
  geom_line() +
  facet_wrap(~name, scales = "free", nrow = 4) +
  labs(title = "", x='') + 
  theme_bw() +
  scale_color_manual(values = atm_colors)

atm_ts |>
  filter(name != "ATM3") |>
  ggplot(aes(x = cash_winsor, fill = name)) +
  geom_boxplot() + 
  facet_wrap(~ name, scales = "free_x", ncol=1) +
  scale_fill_manual(values = atm_colors) + 
  labs(title = "", x = "Cash", y = "Count") +
  theme_bw() +
  theme(legend.position = "none", legend.title=element_blank())

Trend and Seasonality

ATM1

There is a seasonality (weekly) with no apparent trend. We can see the remainder has some large spikes but we can verify if it is white noise with an ACF plot.

atm_ts |>
  filter(name == 'ATM1') |>
  model(STL(cash_winsor)) |>
  components() |>
  autoplot()

The ACF plot indeed shows there is some severe autocorrelation every lag 7 that needs to be dealt with.

atm_ts |>
  filter(name == 'ATM1') |>
  ACF(cash_winsor) |>
  autoplot()

ATM2

There is a seasonality (weekly) with a marginal trend that has decreased until recent dates. We can see the remainder has some large spikes but we can verify if it is white noise with an ACF plot.

atm_ts |>
  filter(name == 'ATM2') |>
  model(STL(cash_winsor)) |>
  components() |>
  autoplot()

The ACF plot indeed shows there is some severe autocorrelation every lag 2, 5 and 7 that needs to be dealt with.

atm_ts |>
  filter(name == 'ATM2') |>
  ACF(cash_winsor) |>
  autoplot()

ATM3

We can weakly say there is a positive trend and no seasonality, however, given that there are only three (3) days worth of values, forecasting for ATM3 will use a NAIVE() method given the limited information we have for it.

atm_ts |>
  filter(name == 'ATM3') |>
  model(STL(cash_winsor)) |>
  components() |>
  autoplot()

ATM4

There is a seasonality (weekly) with no apparent trend. We can see the remainder has some large spikes but we can verify if it is white noise with an ACF plot.

atm_ts |>
  filter(name == 'ATM4') |>
  model(STL(cash_winsor)) |>
  components() |>
  autoplot()

The ACF plot indeed shows there is some autocorrelation every lag 7 that slowly decays and needs to be dealt with.

atm_ts |>
  filter(name == 'ATM4') |>
  ACF(cash_winsor) |>
  autoplot()

Modeling & Forecasting

For our seasonal data with ATM1, ATM2, and ATM4, we can look to minimize our autocorrelation and stationary issues with a seasonal, first and/or second order differencing and build ARIMA models. We can then compare how these ARIMA models stack against additive ETS models. Again, ATM3 will be using a simple NAIVE() method for forecasting as we only have three data points.

Looking at the unit root tests, ATM2 appears to only need a seasonal difference, while the others have kpss p-values over 0.05.

kpss <-
  atm_ts |>
  features(cash_winsor, unitroot_kpss)

n_diffs <-
  atm_ts |>
  features(cash_winsor, unitroot_ndiffs)

unit_root <-
  merge(kpss, n_diffs, by="name")

kbl(unit_root, 
    caption = "Unit Root Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")

Unit Root Tests
name	kpss_stat	kpss_pvalue	ndiffs
ATM1	0.3095347	0.1000000	0
ATM2	1.9434796	0.0100000	1
ATM3	0.3891606	0.0818273	0
ATM4	0.0965402	0.1000000	0

ATM1

We can see that our ARIMA model outperforms the ETS model, most importantly looking where the AICc is significantly lower. The ARIMA model chosen is AR(0, 0, 0)(0,1,1)[7]. We can also see that our model’s innovation residuals are white noise based on the Box-Pierce test p-value and our residual plots. The Box-Pierce test lag was set to 14 where $l = 2m$ and $m = 7$, given we have weekly seasonality.

atm1 <-
  atm_ts |>
  filter(name == 'ATM1')

lambda <- 
  atm1 |>
  features(cash_winsor, features = guerrero) |>
  pull(lambda_guerrero)

atm1_fit <-
  atm1 |>
  model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("A")),
        ARIMA = ARIMA(box_cox(cash_winsor,lambda)))

benchmarks <-
  atm1_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")

Benchmark Tests
.model	sigma2	log_lik	AIC	AICc	BIC
ETS	28752.76	-2945.809	5911.618	5912.240	5950.617
ARIMA	28526.70	-2345.966	4695.931	4695.965	4703.692

atm1_fit |>
  select(.model = "ARIMA") |>
  report()

## Series: cash_winsor 
## Model: ARIMA(0,0,0)(0,1,1)[7] 
## Transformation: box_cox(cash_winsor, lambda) 
## 
## Coefficients:
##          sma1
##       -0.6829
## s.e.   0.0411
## 
## sigma^2 estimated as 28527:  log likelihood=-2345.97
## AIC=4695.93   AICc=4695.96   BIC=4703.69

atm1_fit |>
  select(ARIMA) |>
  gg_tsresiduals()

bp_test <-
  atm1_fit |>
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")

Box-Pierce Test
.model	bp_stat	bp_pvalue
.model	27.27467	0.0177297

As we forecast ATM1, we see how the seasonality affects the future values, however we do have somewhat of a large prediction interval.

atm1_fc <- 
  atm1_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')

atm1_fc |>
  autoplot(atm1) +
  labs(title = TeX(paste0("ATM1: $AR(0,0,0)(0,1,1)[7]$ Forecast")),
       caption =  TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))

ATM2

ATM2 requires a seasonal differencing before we can model and forecast any values. We can see that lag 7 still has a high ACF and PACF where a moving average of MA(1) could be of use.

atm2 <-
  atm_ts |>
  filter(name == 'ATM2')

lambda <- 
  atm2 |>
  features(cash_winsor, features = guerrero) |>
  pull(lambda_guerrero)

atm2 |>
  autoplot(box_cox(cash_winsor, lambda)) +
  labs(title = latex2exp::TeX(paste0(
         "Transformed ATM2 with $\\lambda$ = ",
         round(lambda,3))), 
       y='')

atm2 |>
  gg_tsdisplay(difference(cash_winsor, lag=7), 
               plot_type="partial") +
  labs(title = "ATM2: Seasonal Differencing")

We will now use an ARIMA vs ETS model. We see that our AICc for our ARIMA model is the best choice. The ARIMA model ARIMA(0,0,0)(0,1,1)[7].

atm2_fit <-
  atm2 |>
  model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("A")),
        ARIMA = ARIMA(box_cox(cash_winsor,lambda)))

benchmarks <-
  atm2_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")

Benchmark Tests
.model	sigma2	log_lik	AIC	AICc	BIC
ETS	28715.35	-2945.572	5911.143	5911.765	5950.142
ARIMA	26762.42	-2332.363	4676.726	4676.965	4700.009

atm2_fit |>
  select(.model = "ARIMA") |>
  report()

## Series: cash_winsor 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(cash_winsor, lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4161  -0.8921  0.4555  0.7717  -0.7139
## s.e.   0.0599   0.0524  0.0885  0.0673   0.0416
## 
## sigma^2 estimated as 26762:  log likelihood=-2332.36
## AIC=4676.73   AICc=4676.96   BIC=4700.01

atm2_fit |>
  select(ARIMA) |>
  gg_tsresiduals()

bp_test <-
  atm2_fit |>
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")

Box-Pierce Test
.model	bp_stat	bp_pvalue
.model	9.549906	0.7942951

We see that our forecast as the same seasonal aspects of the past values. The prediction intervals are somewhat wide, but nothing drastic.

atm2_fc <- 
  atm2_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')

atm2_fc |>
  autoplot(atm2) +
  labs(title = TeX(paste0("ATM2: $AR(0,0,0)(0,1,1)[7]$ Forecast")),
       caption =  TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))

ATM3

ATM3 will be using a simple NAIVE() method for forecasting as we only have three data points. However, just to add some comparison we will use an ETS(A,N,N) model.

atm3 <-
  atm_ts |>
  filter(name == 'ATM3')

atm3_fit <-
  atm3 |>
  model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("N")),
        NAIVE = NAIVE(cash_winsor))

benchmarks <-
  atm3_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")

Benchmark Tests
.model	sigma2	log_lik	AIC	AICc	BIC
ETS	1680.576	-2431.136	4868.272	4868.339	4879.972
NAIVE	258984.879	NA	NA	NA	NA

bp_test <-
  atm3_fit |>
  select(.model = "ETS") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")

Box-Pierce Test
.model	bp_stat	bp_pvalue
.model	0.1465766	1

We can see from our NAIVE() and ETS() models, the latter model has a much larger prediction interval, and a increasing slope. Given that we do not have much information of ATM withdrawals, it would be best to use a simple approach of NAIVE()

atm3_fc <- 
  atm3_fit |>
  forecast(h = 31)

atm3_fc |>
  autoplot(atm3) +
  labs(title = TeX(paste0("ATM3: ETS(A,N,N) & NAIVE Forecast")))

ATM4

We again see that our ARIMA model outperforms the ETS model where AICc is lower. The ARIMA model chosen is ARIMA(0,0,0)(2,0,0)[7] with mean. We can also see that our model’s innovation residuals are white noise based on the Box-Pierce test p-value and our residual plots.

atm4 <-
  atm_ts |>
  filter(name == 'ATM4')

lambda <- 
  atm4 |>
  features(cash_winsor, features = guerrero) |>
  pull(lambda_guerrero)

atm4_fit <-
  atm4 |>
  model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("A")),
        ARIMA = ARIMA(box_cox(cash_winsor,lambda)))

benchmarks <-
  atm4_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")

Benchmark Tests
.model	sigma2	log_lik	AIC	AICc	BIC
ETS	18267.53	-2863.026	5746.051	5746.673	5785.05
ARIMA	19163.81	-2316.470	4640.941	4641.052	4656.54

atm4_fit |>
  select(.model = "ARIMA") |>
  report()

## Series: cash_winsor 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## Transformation: box_cox(cash_winsor, lambda) 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.2111  0.1804  185.6600
## s.e.  0.0518  0.0522    7.0914
## 
## sigma^2 estimated as 19164:  log likelihood=-2316.47
## AIC=4640.94   AICc=4641.05   BIC=4656.54

atm4_fit |>
  select(ARIMA) |>
  gg_tsresiduals()

bp_test <-
  atm4_fit |>
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")

Box-Pierce Test
.model	bp_stat	bp_pvalue
.model	14.51123	0.4123591

As we forecast ATM4, we see how our future values variance become narrower into the future. This is concerning as a point forecast because we do not see this happen previously. However we do have a substantially large prediction interval indicating that this model is having trouble covering the large possible future values and not a great model to use as is.

atm4_fc <- 
  atm4_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')

atm4_fc |>
  autoplot(atm4) +
  labs(title = TeX(paste0("ATM4: $AR(0,0,0)(2,0,0)[7]$ with Mean Forecast")),
       caption =  TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))

Export ATM Forecasts

export_atm1 <-
  atm1_fc %>%
  as_tibble |>
  select(date, name, .mean) |>
  rename(cash = .mean)

export_atm2 <-
  atm2_fc %>%
  as_tibble |>
  select(date, name, .mean) |>
  rename(cash = .mean)

export_atm3 <-
  atm3_fc %>%
  as_tibble |>
  select(date, name, .mean) |>
  rename(cash = .mean)

export_atm4 <-
  atm4_fc %>%
  as_tibble |>
  select(date, name, .mean) |>
  rename(cash = .mean)

export_atm <- 
  bind_rows(export_atm1, export_atm2, export_atm3, export_atm4)

write.csv(export_atm, 'atm_forecasts.csv', row.names = FALSE)

Part B – Power Consumption Forecast

The goal is to forecast how much power consumption will be used on a month basis for 2014. We are given an Excel formatted spreadsheet containing three columns:

case sequence: id of power consumption recorded
yyyy-mmm: the year-month of when the power was consumed
KWH: power consumption in Kilowatt hours

Import Excel Data

Initial Import

In our initial import, we have CaseSequence which is not needed for this analysis and can be dropped. We also see our date format in Year-Month format and will change this to an appropriate datetime format for our time series.

power <- 
  read_excel("power.xlsx") 

kbl(head(power), 
    caption = "Power Consumption") |>
  kable_classic(full_width = F, html_font = "Cambria")

Power Consumption
CaseSequence	YYYY-MMM	KWH
733	1998-Jan	6862583
734	1998-Feb	5838198
735	1998-Mar	5420658
736	1998-Apr	5010364
737	1998-May	4665377
738	1998-Jun	6467147

Standardizing the Import

power <-
  power |>
  clean_names() |>
  rename(date = yyyy_mmm) |>
  mutate(date = yearmonth(date)) |>
  select(date, kwh) |>
  as_tsibble(index=date)

kbl(head(power), 
    caption = "Power Consumption",
    format.args = list(big.mark = ",")) |>
  kable_classic(full_width = F, html_font = "Cambria")

Power Consumption
date	kwh
1998 Jan	6,862,583
1998 Feb	5,838,198
1998 Mar	5,420,658
1998 Apr	5,010,364
1998 May	4,665,377
1998 Jun	6,467,147

Data Exploration

Time Series

We viewing the time series we see a break around 2008-2009.

power |>
  autoplot(kwh)

Boxplot

When viewing our data on a boxplot, we see we have one value that is missing, as we saw on the time series. We also see an extreme outlier that is quite low compared to the other values we have.

power |>
  ggplot(aes(x = kwh)) +
  geom_boxplot()

## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).

Data Manipulation

Missing Values

We have one value missing on September 2008. We will use ARIMA’s interpolate() function to help with replacing the NA to an estimated value.

kbl(power |>
  filter(is.na(kwh)), 
    caption = "Missing Power (1.1)") |>
  kable_classic(full_width = F, html_font = "Cambria")

Missing Power (1.1)
date	kwh
2008 Sep	NA

power <-
  power |>
  model(ARIMA(kwh)) |>
  interpolate(power)

kbl(power |>
  filter(date == yearmonth('2008 Sep')), 
    caption = "Missing Power (1.2)",
  format.args = list(big.mark = ",")) |>
  kable_classic(full_width = F, html_font = "Cambria")

Missing Power (1.2)
date	kwh
2008 Sep	7,597,704

Winsorizing

We can see our outlier is quite low! It is less than 1 million when the mean is 6.5 million. This can either be attributed to a major power outage due to natural disasters, or an error on the collection of the data for the month. We will treat this outlier as abnormal and replace it by winsorizing it.

sum_stats <-
  describe(as_tibble(power)) |>
  select(mean, min, max) |>
  tail(1)

kbl(sum_stats, 
    caption = "Summary Statistics",
    format.args = list(big.mark = ",")) |>
  kable_classic(full_width = F, html_font = "Cambria")

Summary Statistics
	mean	min	max
kwh	6,508,179	770,523	10,655,730

To handle the one outlier it will be replaced with its corresponding 1st percentile value.

power <-
  power |>
  mutate(kwh = Winsorize(kwh, probs = c(0.01, 0.99)))

We can confirm that our time series now has no breaks and that the outlier was handled with.

power |>
  autoplot(kwh)

power |>
  ggplot(aes(x = kwh)) +
  geom_boxplot()

Trend and Seasonality

There is a a trend upwards in power consumption after January 2010. As for seasonality we see a yearly seasonality pattern. We also see a deep trough in the remainder post January 2010.

power |>
  model(STL(kwh)) |>
  components() |>
  autoplot()

We can confirm seasonality where December-January and July-August has the highest consumption due to weather changes of winter and summer.

power |>
  gg_season(kwh) +
  scale_y_continuous(labels=scales::comma) +
  labs(title = "Seasonal Decomposition")

Modeling and Forecasting

We need to fix our seasonal stationary issue if we want to use an ARIMA model. When we run our unit root tests, we see we only need 1, which a seasonal differencing should be enough. We see we have lag 1 and 12 are high, but that coincides with out annual seasonality we were looking at before.

kpss <-
  power |>
  features(kwh, unitroot_kpss)

n_diffs <-
  power |>
  features(kwh, unitroot_ndiffs)

unit_root <-
  bind_cols(kpss, n_diffs)

kbl(unit_root, 
    caption = "Unit Root Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")

Unit Root Tests
kpss_stat	kpss_pvalue	ndiffs
1.512599	0.01	1

power |>
  gg_tsdisplay(difference(kwh, lag=12), 
               plot_type="partial") +
  labs(title = "Power Consumption: Seasonal Differecing")

We can see we can transform the power seasonality with a $\lambda = -0.137$

lambda <- 
  power |>
  features(kwh, features = guerrero) |>
  pull(lambda_guerrero)

power |>
  autoplot(box_cox(kwh, lambda)) +
  labs(title = latex2exp::TeX(paste0(
         "Transformed Power with $\\lambda$ = ",
         round(lambda,3))), 
       y='')

We can see our ARIMA model once again has the best AICc. This is the model we will forecast with using ARIMA(2,0,2)(0,1,1)[7]

power_fit <-
  power |>
  model(ETS = ETS(box_cox(kwh,lambda) ~ error("A") + trend("N") + season("A")),
        ARIMA = ARIMA(box_cox(kwh,lambda)))

benchmarks <-
  power_fit |>
  glance() |>
  select(.model:BIC)

kbl(benchmarks, 
    caption = "Benchmark Tests") |>
  kable_classic(full_width = F, html_font = "Cambria")

Benchmark Tests
.model	sigma2	log_lik	AIC	AICc	BIC
ETS	0.0001470	349.7756	-669.5512	-666.8239	-620.6888
ARIMA	0.0001474	538.1746	-1066.3493	-1066.0044	-1050.3845

atm2_fit |>
  select(.model = "ARIMA") |>
  report()

## Series: cash_winsor 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(cash_winsor, lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4161  -0.8921  0.4555  0.7717  -0.7139
## s.e.   0.0599   0.0524  0.0885  0.0673   0.0416
## 
## sigma^2 estimated as 26762:  log likelihood=-2332.36
## AIC=4676.73   AICc=4676.96   BIC=4700.01

Our innovation residuals look fine, besides one outlier that makes it left-skewed

power_fit |>
  select(ARIMA) |>
  gg_tsresiduals()

Our Box-Pierce test shows that the remainder is white noise

bp_test <-
  power_fit |>
  select(.model = "ARIMA") |>
  augment() |>
  features(.innov, box_pierce, lag = 14)

kbl(bp_test, 
    caption = "Box-Pierce Test") |>
  kable_classic(full_width = F, html_font = "Cambria")

Box-Pierce Test
.model	bp_stat	bp_pvalue
.model	11.77447	0.6244091

Our forecasted values looks like they provide decent point forcasts, with small ranges for the prediction intervals.

power_fc <- 
  power_fit |>
  forecast(h = 12) |>
  filter(.model=='ARIMA')

power_fc |>
  autoplot(power) +
  labs(title = TeX(paste0("Power Consumption: $AR(2,0,2)(0,1,1)[7]$ Forecast")),
       caption =  TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))

## Export Power Forecasts

export_power <-
  power_fc %>%
  as_tibble |>
  select(date, .mean) |>
  rename(kwh = .mean)

write.csv(export_power, 'power_forecasts.csv', row.names = FALSE)

ATM, Power & Water Forecasting

John Cruz

2024-03-18

Required Libraries

Part A – ATM Forecast

Import Excel Data

Initial Import

Standardizing the Import

Data Exploration

Time Series

Boxplot

Data Manipulation

Missing Values

Winsorizing

Trend and Seasonality

ATM1

ATM2

ATM3

ATM4

Modeling & Forecasting

ATM1

ATM2

ATM3

ATM4

Export ATM Forecasts

Part B – Power Consumption Forecast

Import Excel Data

Initial Import

Standardizing the Import

Data Exploration

Time Series

Boxplot

Data Manipulation

Missing Values

Winsorizing

Trend and Seasonality

Modeling and Forecasting