library(fpp3)
library(tidyverse)
library(kableExtra)
library(janitor)
library(readxl)
library(psych)
library(cowplot)
library(DescTools)
library(latex2exp)
The goal is to forecast how much cash is taken out of four (4) different ATM machines for May 2010. We are given an Excel formatted spreadsheet containing three columns:
When the file was imported initially, the date was converted into a
five digit code representing the original date. We also have
inconsistencies with our column naming style as DATE
and
ATM
are completely capitalized while Cash
was
not.
atm <-
read_excel("atm.xlsx")
kbl(head(atm),
caption = "ATM Withdrawals") |>
kable_classic(full_width = F, html_font = "Cambria")
DATE | ATM | Cash |
---|---|---|
39934 | ATM1 | 96 |
39934 | ATM2 | 107 |
39935 | ATM1 | 82 |
39935 | ATM2 | 89 |
39936 | ATM1 | 85 |
39936 | ATM2 | 90 |
When we look at the summary counts of the data, we see that there are some discrepancies.
describe_import <-
read_excel("atm.xlsx", col_types = c('date', 'text', 'numeric')) |>
clean_names() |>
rename(name = atm) |>
describe()
kbl(describe_import[2],
caption = "Variables") |>
kable_classic(full_width = F, html_font = "Cambria")
n | |
---|---|
date | 1474 |
name* | 1460 |
cash | 1455 |
Grouping the atm
variable, we can see that there are
missing values for ATM1
and ATM2
. We also see
fourteen (14) missing values labeled as <NA>
, where
we have 14 days that do not have an atm
label and no
cash
withdrawn. As these <NA>
values
provide us zero information they will be dropped. As for the 5 missing
ATM withdrawal amounts, it is possible that these ATMs may have been
offline and not giving cash out. If that is the case, we can insert zero
for those days.
missing_values <-
read_excel("atm.xlsx", col_types = c('date', 'text', 'numeric')) |>
clean_names() |>
rename(name = atm) |>
group_by(name) |>
summarise(total = sum(is.na(cash))) |>
spread(name, total)
kbl(missing_values,
caption = "Missing Values") |>
kable_classic(full_width = F, html_font = "Cambria")
ATM1 | ATM2 | ATM3 | ATM4 | <NA> |
---|---|---|---|---|
3 | 2 | 0 | 0 | 14 |
Updating the read_excel
function we set our columns to
be read as specific data types. The column names were also formatted to
be a similar format using the clean_names()
function. The
table was converted into a tsibble format to allow us to model our
functions later on with the index = date
and our
key = ATM
. We have decimal points in some of our values
that indicate coins could have been given or that there is some
reliability on the data. We will assume the data is valid. However, it
does not help us to have such granularity, as we are focused on
forecasting the dollar amount, so these values will be rounded. To avoid
any confusion, the cash
column was changed from
representing amounts in hundreds to the whole
number.
atm_ts <-
read_excel("atm.xlsx", col_types = c('date', 'text', 'numeric')) |>
clean_names() |>
mutate(date = as_date(date)) |>
rename(name = atm) |>
as_tsibble(index = date, key = name) |>
arrange(date)
kbl(head(atm_ts),
caption = "ATM Withdrawals") |>
kable_classic(full_width = F, html_font = "Cambria")
date | name | cash |
---|---|---|
2009-05-01 | ATM1 | 96.0000 |
2009-05-01 | ATM2 | 107.0000 |
2009-05-01 | ATM3 | 0.0000 |
2009-05-01 | ATM4 | 776.9934 |
2009-05-02 | ATM1 | 82.0000 |
2009-05-02 | ATM2 | 89.0000 |
atm_ts <-
atm_ts |>
mutate(cash = round(cash * 100, 0)) |>
filter(!is.na(name))
kbl(head(atm_ts),
caption = "ATM Withdrawals") |>
kable_classic(full_width = F, html_font = "Cambria")
date | name | cash |
---|---|---|
2009-05-01 | ATM1 | 9600 |
2009-05-01 | ATM2 | 10700 |
2009-05-01 | ATM3 | 0 |
2009-05-01 | ATM4 | 77699 |
2009-05-02 | ATM1 | 8200 |
2009-05-02 | ATM2 | 8900 |
ATM3
immediately shows us that there was no cash
withdrawals until the most recent dates. This likely means that it was
not in service yet and is not a major concern. ATM4
shows a
large spike on February 9, 2010 with an amount of $1,091,976. It is
extremely higher than any other value in all of our ATMs and highly
unlikely this is correct. Most ATMs “typically hold between $20,000
to $50,000 in cash. Imposing withdrawal limits helps ensure ATMs have
sufficient funds on hand to cover customers’ needs in between the times
that they’re restocked.” (USA
Today).
atm_colors <- c('#1170aa', '#fc7d0b', '#a3acb9', '#57606c')
names(atm_colors) <- c('ATM1', 'ATM2', 'ATM3', 'ATM4')
atm_ts |>
ggplot(aes(date, cash, colour = name)) +
geom_line() +
facet_wrap(~name, scales = "free", nrow = 4) +
labs(title = "", x='') +
theme_bw() +
scale_color_manual(values = atm_colors)
The boxplot further shows the extreme outlier within
ATM4
, and a several others on ATM1
atm_ts |>
filter(name != "ATM3") |>
ggplot(aes(x = cash, fill = name)) +
geom_boxplot() +
facet_wrap(~ name, scales = "free_x", ncol=1) +
scale_fill_manual(values = atm_colors) +
labs(title = "", x = "Cash", y = "Count") +
theme_bw() +
theme(legend.position = "none", legend.title=element_blank())
As mentioned before, we will assume these ATMs on these dates were
offline and no cash was withdrawn. Given that, these N/A
values will be replaced with zero.
interpolate()
function to help fill
in these missing values. This provides an estimated value approach based
on an ARIMA best fit model, whereas back-filling or forward-filling
these values would be less robust.kbl(atm_ts |>
filter(is.na(cash)),
caption = "Missing Withdrawals") |>
kable_classic(full_width = F, html_font = "Cambria")
date | name | cash |
---|---|---|
2009-06-13 | ATM1 | NA |
2009-06-16 | ATM1 | NA |
2009-06-18 | ATM2 | NA |
2009-06-22 | ATM1 | NA |
2009-06-24 | ATM2 | NA |
missing_values <-
atm_ts |>
filter(is.na(cash)) |>
mutate(cash = 0)
atm_ts <-
atm_ts |>
filter(!is.na(cash)) |>
bind_rows(missing_values)
To handle outliers on all the ATMs, any value that is outside the 99th percentile will be replaced with its corresponding 1% or 99% value. This method of handling outliers is called winsorizing and allows us to not eliminate these values altogether. We can see how this treatment has changed our time series and boxplot graphs.
filter_atm3 <-
atm_ts |>
as_tibble() |>
filter(name != "ATM3")
winsorize_data <-
filter_atm3 |>
group_by(name) |>
mutate(cash_winsor = Winsorize(cash, probs = c(0.01, 0.99)))
atm_3 <-
atm_ts |>
filter(name == "ATM3")
atm_ts <-
winsorize_data |>
as_tsibble(index = date, key = name) |>
arrange(date) |>
bind_rows(atm_3)
missing_values <-
atm_ts |>
filter(is.na(cash_winsor)) |>
mutate(cash_winsor = if_else(is.na(cash), 0, cash))
atm_ts <-
atm_ts |>
filter(!is.na(cash_winsor)) |>
bind_rows(missing_values) |>
mutate(cash_winsor = round(cash_winsor, 0)) |>
select(date, name, cash_winsor)
atm_ts |>
ggplot(aes(date, cash_winsor, colour = name)) +
geom_line() +
facet_wrap(~name, scales = "free", nrow = 4) +
labs(title = "", x='') +
theme_bw() +
scale_color_manual(values = atm_colors)
atm_ts |>
filter(name != "ATM3") |>
ggplot(aes(x = cash_winsor, fill = name)) +
geom_boxplot() +
facet_wrap(~ name, scales = "free_x", ncol=1) +
scale_fill_manual(values = atm_colors) +
labs(title = "", x = "Cash", y = "Count") +
theme_bw() +
theme(legend.position = "none", legend.title=element_blank())
There is a seasonality (weekly) with no apparent trend. We can see the remainder has some large spikes but we can verify if it is white noise with an ACF plot.
atm_ts |>
filter(name == 'ATM1') |>
model(STL(cash_winsor)) |>
components() |>
autoplot()
The ACF plot indeed shows there is some severe autocorrelation every lag 7 that needs to be dealt with.
atm_ts |>
filter(name == 'ATM1') |>
ACF(cash_winsor) |>
autoplot()
There is a seasonality (weekly) with a marginal trend that has decreased until recent dates. We can see the remainder has some large spikes but we can verify if it is white noise with an ACF plot.
atm_ts |>
filter(name == 'ATM2') |>
model(STL(cash_winsor)) |>
components() |>
autoplot()
The ACF plot indeed shows there is some severe autocorrelation every lag 2, 5 and 7 that needs to be dealt with.
atm_ts |>
filter(name == 'ATM2') |>
ACF(cash_winsor) |>
autoplot()
We can weakly say there is a positive trend and no seasonality,
however, given that there are only three (3) days worth of values,
forecasting for ATM3
will use a NAIVE()
method
given the limited information we have for it.
atm_ts |>
filter(name == 'ATM3') |>
model(STL(cash_winsor)) |>
components() |>
autoplot()
There is a seasonality (weekly) with no apparent trend. We can see the remainder has some large spikes but we can verify if it is white noise with an ACF plot.
atm_ts |>
filter(name == 'ATM4') |>
model(STL(cash_winsor)) |>
components() |>
autoplot()
The ACF plot indeed shows there is some autocorrelation every lag 7 that slowly decays and needs to be dealt with.
atm_ts |>
filter(name == 'ATM4') |>
ACF(cash_winsor) |>
autoplot()
For our seasonal data with ATM1
, ATM2
, and
ATM4
, we can look to minimize our autocorrelation and
stationary issues with a seasonal, first and/or second order
differencing and build ARIMA models. We can then compare how these ARIMA
models stack against additive ETS models. Again, ATM3
will
be using a simple NAIVE()
method for forecasting as we only
have three data points.
Looking at the unit root tests, ATM2
appears to only
need a seasonal difference, while the others have kpss p-values over
0.05.
kpss <-
atm_ts |>
features(cash_winsor, unitroot_kpss)
n_diffs <-
atm_ts |>
features(cash_winsor, unitroot_ndiffs)
unit_root <-
merge(kpss, n_diffs, by="name")
kbl(unit_root,
caption = "Unit Root Tests") |>
kable_classic(full_width = F, html_font = "Cambria")
name | kpss_stat | kpss_pvalue | ndiffs |
---|---|---|---|
ATM1 | 0.3095347 | 0.1000000 | 0 |
ATM2 | 1.9434796 | 0.0100000 | 1 |
ATM3 | 0.3891606 | 0.0818273 | 0 |
ATM4 | 0.0965402 | 0.1000000 | 0 |
We can see that our ARIMA model outperforms the ETS model, most importantly looking where the AICc is significantly lower. The ARIMA model chosen is AR(0, 0, 0)(0,1,1)[7]. We can also see that our model’s innovation residuals are white noise based on the Box-Pierce test p-value and our residual plots. The Box-Pierce test lag was set to 14 where \(l = 2m\) and \(m = 7\), given we have weekly seasonality.
atm1 <-
atm_ts |>
filter(name == 'ATM1')
lambda <-
atm1 |>
features(cash_winsor, features = guerrero) |>
pull(lambda_guerrero)
atm1_fit <-
atm1 |>
model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("A")),
ARIMA = ARIMA(box_cox(cash_winsor,lambda)))
benchmarks <-
atm1_fit |>
glance() |>
select(.model:BIC)
kbl(benchmarks,
caption = "Benchmark Tests") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | sigma2 | log_lik | AIC | AICc | BIC |
---|---|---|---|---|---|
ETS | 28752.76 | -2945.809 | 5911.618 | 5912.240 | 5950.617 |
ARIMA | 28526.70 | -2345.966 | 4695.931 | 4695.965 | 4703.692 |
atm1_fit |>
select(.model = "ARIMA") |>
report()
## Series: cash_winsor
## Model: ARIMA(0,0,0)(0,1,1)[7]
## Transformation: box_cox(cash_winsor, lambda)
##
## Coefficients:
## sma1
## -0.6829
## s.e. 0.0411
##
## sigma^2 estimated as 28527: log likelihood=-2345.97
## AIC=4695.93 AICc=4695.96 BIC=4703.69
atm1_fit |>
select(ARIMA) |>
gg_tsresiduals()
bp_test <-
atm1_fit |>
select(.model = "ARIMA") |>
augment() |>
features(.innov, box_pierce, lag = 14)
kbl(bp_test,
caption = "Box-Pierce Test") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | bp_stat | bp_pvalue |
---|---|---|
.model | 27.27467 | 0.0177297 |
As we forecast ATM1
, we see how the seasonality affects
the future values, however we do have somewhat of a large prediction
interval.
atm1_fc <-
atm1_fit |>
forecast(h = 31) |>
filter(.model=='ARIMA')
atm1_fc |>
autoplot(atm1) +
labs(title = TeX(paste0("ATM1: $AR(0,0,0)(0,1,1)[7]$ Forecast")),
caption = TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))
ATM2
requires a seasonal differencing before we can
model and forecast any values. We can see that lag 7 still has a high
ACF and PACF where a moving average of MA(1) could be of use.
atm2 <-
atm_ts |>
filter(name == 'ATM2')
lambda <-
atm2 |>
features(cash_winsor, features = guerrero) |>
pull(lambda_guerrero)
atm2 |>
autoplot(box_cox(cash_winsor, lambda)) +
labs(title = latex2exp::TeX(paste0(
"Transformed ATM2 with $\\lambda$ = ",
round(lambda,3))),
y='')
atm2 |>
gg_tsdisplay(difference(cash_winsor, lag=7),
plot_type="partial") +
labs(title = "ATM2: Seasonal Differencing")
We will now use an ARIMA vs ETS model. We see that our AICc for our ARIMA model is the best choice. The ARIMA model ARIMA(0,0,0)(0,1,1)[7].
atm2_fit <-
atm2 |>
model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("A")),
ARIMA = ARIMA(box_cox(cash_winsor,lambda)))
benchmarks <-
atm2_fit |>
glance() |>
select(.model:BIC)
kbl(benchmarks,
caption = "Benchmark Tests") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | sigma2 | log_lik | AIC | AICc | BIC |
---|---|---|---|---|---|
ETS | 28715.35 | -2945.572 | 5911.143 | 5911.765 | 5950.142 |
ARIMA | 26762.42 | -2332.363 | 4676.726 | 4676.965 | 4700.009 |
atm2_fit |>
select(.model = "ARIMA") |>
report()
## Series: cash_winsor
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(cash_winsor, lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4161 -0.8921 0.4555 0.7717 -0.7139
## s.e. 0.0599 0.0524 0.0885 0.0673 0.0416
##
## sigma^2 estimated as 26762: log likelihood=-2332.36
## AIC=4676.73 AICc=4676.96 BIC=4700.01
atm2_fit |>
select(ARIMA) |>
gg_tsresiduals()
bp_test <-
atm2_fit |>
select(.model = "ARIMA") |>
augment() |>
features(.innov, box_pierce, lag = 14)
kbl(bp_test,
caption = "Box-Pierce Test") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | bp_stat | bp_pvalue |
---|---|---|
.model | 9.549906 | 0.7942951 |
We see that our forecast as the same seasonal aspects of the past values. The prediction intervals are somewhat wide, but nothing drastic.
atm2_fc <-
atm2_fit |>
forecast(h = 31) |>
filter(.model=='ARIMA')
atm2_fc |>
autoplot(atm2) +
labs(title = TeX(paste0("ATM2: $AR(0,0,0)(0,1,1)[7]$ Forecast")),
caption = TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))
ATM3
will be using a simple NAIVE()
method
for forecasting as we only have three data points. However, just to add
some comparison we will use an ETS(A,N,N)
model.
atm3 <-
atm_ts |>
filter(name == 'ATM3')
atm3_fit <-
atm3 |>
model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("N")),
NAIVE = NAIVE(cash_winsor))
benchmarks <-
atm3_fit |>
glance() |>
select(.model:BIC)
kbl(benchmarks,
caption = "Benchmark Tests") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | sigma2 | log_lik | AIC | AICc | BIC |
---|---|---|---|---|---|
ETS | 1680.576 | -2431.136 | 4868.272 | 4868.339 | 4879.972 |
NAIVE | 258984.879 | NA | NA | NA | NA |
bp_test <-
atm3_fit |>
select(.model = "ETS") |>
augment() |>
features(.innov, box_pierce, lag = 14)
kbl(bp_test,
caption = "Box-Pierce Test") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | bp_stat | bp_pvalue |
---|---|---|
.model | 0.1465766 | 1 |
We can see from our NAIVE()
and ETS()
models, the latter model has a much larger prediction interval, and a
increasing slope. Given that we do not have much information of ATM
withdrawals, it would be best to use a simple approach of
NAIVE()
atm3_fc <-
atm3_fit |>
forecast(h = 31)
atm3_fc |>
autoplot(atm3) +
labs(title = TeX(paste0("ATM3: ETS(A,N,N) & NAIVE Forecast")))
We again see that our ARIMA model outperforms the ETS model where AICc is lower. The ARIMA model chosen is ARIMA(0,0,0)(2,0,0)[7] with mean. We can also see that our model’s innovation residuals are white noise based on the Box-Pierce test p-value and our residual plots.
atm4 <-
atm_ts |>
filter(name == 'ATM4')
lambda <-
atm4 |>
features(cash_winsor, features = guerrero) |>
pull(lambda_guerrero)
atm4_fit <-
atm4 |>
model(ETS = ETS(box_cox(cash_winsor,lambda) ~ error("A") + trend("N") + season("A")),
ARIMA = ARIMA(box_cox(cash_winsor,lambda)))
benchmarks <-
atm4_fit |>
glance() |>
select(.model:BIC)
kbl(benchmarks,
caption = "Benchmark Tests") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | sigma2 | log_lik | AIC | AICc | BIC |
---|---|---|---|---|---|
ETS | 18267.53 | -2863.026 | 5746.051 | 5746.673 | 5785.05 |
ARIMA | 19163.81 | -2316.470 | 4640.941 | 4641.052 | 4656.54 |
atm4_fit |>
select(.model = "ARIMA") |>
report()
## Series: cash_winsor
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean
## Transformation: box_cox(cash_winsor, lambda)
##
## Coefficients:
## sar1 sar2 constant
## 0.2111 0.1804 185.6600
## s.e. 0.0518 0.0522 7.0914
##
## sigma^2 estimated as 19164: log likelihood=-2316.47
## AIC=4640.94 AICc=4641.05 BIC=4656.54
atm4_fit |>
select(ARIMA) |>
gg_tsresiduals()
bp_test <-
atm4_fit |>
select(.model = "ARIMA") |>
augment() |>
features(.innov, box_pierce, lag = 14)
kbl(bp_test,
caption = "Box-Pierce Test") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | bp_stat | bp_pvalue |
---|---|---|
.model | 14.51123 | 0.4123591 |
As we forecast ATM4
, we see how our future values
variance become narrower into the future. This is concerning as a point
forecast because we do not see this happen previously. However we do
have a substantially large prediction interval indicating that this
model is having trouble covering the large possible future values and
not a great model to use as is.
atm4_fc <-
atm4_fit |>
forecast(h = 31) |>
filter(.model=='ARIMA')
atm4_fc |>
autoplot(atm4) +
labs(title = TeX(paste0("ATM4: $AR(0,0,0)(2,0,0)[7]$ with Mean Forecast")),
caption = TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))
export_atm1 <-
atm1_fc %>%
as_tibble |>
select(date, name, .mean) |>
rename(cash = .mean)
export_atm2 <-
atm2_fc %>%
as_tibble |>
select(date, name, .mean) |>
rename(cash = .mean)
export_atm3 <-
atm3_fc %>%
as_tibble |>
select(date, name, .mean) |>
rename(cash = .mean)
export_atm4 <-
atm4_fc %>%
as_tibble |>
select(date, name, .mean) |>
rename(cash = .mean)
export_atm <-
bind_rows(export_atm1, export_atm2, export_atm3, export_atm4)
write.csv(export_atm, 'atm_forecasts.csv', row.names = FALSE)
The goal is to forecast how much power consumption will be used on a month basis for 2014. We are given an Excel formatted spreadsheet containing three columns:
In our initial import, we have CaseSequence
which is not
needed for this analysis and can be dropped. We also see our date format
in Year-Month format and will change this to an appropriate datetime
format for our time series.
power <-
read_excel("power.xlsx")
kbl(head(power),
caption = "Power Consumption") |>
kable_classic(full_width = F, html_font = "Cambria")
CaseSequence | YYYY-MMM | KWH |
---|---|---|
733 | 1998-Jan | 6862583 |
734 | 1998-Feb | 5838198 |
735 | 1998-Mar | 5420658 |
736 | 1998-Apr | 5010364 |
737 | 1998-May | 4665377 |
738 | 1998-Jun | 6467147 |
power <-
power |>
clean_names() |>
rename(date = yyyy_mmm) |>
mutate(date = yearmonth(date)) |>
select(date, kwh) |>
as_tsibble(index=date)
kbl(head(power),
caption = "Power Consumption",
format.args = list(big.mark = ",")) |>
kable_classic(full_width = F, html_font = "Cambria")
date | kwh |
---|---|
1998 Jan | 6,862,583 |
1998 Feb | 5,838,198 |
1998 Mar | 5,420,658 |
1998 Apr | 5,010,364 |
1998 May | 4,665,377 |
1998 Jun | 6,467,147 |
We viewing the time series we see a break around 2008-2009.
power |>
autoplot(kwh)
When viewing our data on a boxplot, we see we have one value that is missing, as we saw on the time series. We also see an extreme outlier that is quite low compared to the other values we have.
power |>
ggplot(aes(x = kwh)) +
geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
We have one value missing on September 2008
. We will use
ARIMA’s interpolate()
function to help with replacing the
NA
to an estimated value.
kbl(power |>
filter(is.na(kwh)),
caption = "Missing Power (1.1)") |>
kable_classic(full_width = F, html_font = "Cambria")
date | kwh |
---|---|
2008 Sep | NA |
power <-
power |>
model(ARIMA(kwh)) |>
interpolate(power)
kbl(power |>
filter(date == yearmonth('2008 Sep')),
caption = "Missing Power (1.2)",
format.args = list(big.mark = ",")) |>
kable_classic(full_width = F, html_font = "Cambria")
date | kwh |
---|---|
2008 Sep | 7,597,704 |
We can see our outlier is quite low! It is less than 1 million when the mean is 6.5 million. This can either be attributed to a major power outage due to natural disasters, or an error on the collection of the data for the month. We will treat this outlier as abnormal and replace it by winsorizing it.
sum_stats <-
describe(as_tibble(power)) |>
select(mean, min, max) |>
tail(1)
kbl(sum_stats,
caption = "Summary Statistics",
format.args = list(big.mark = ",")) |>
kable_classic(full_width = F, html_font = "Cambria")
mean | min | max | |
---|---|---|---|
kwh | 6,508,179 | 770,523 | 10,655,730 |
To handle the one outlier it will be replaced with its corresponding 1st percentile value.
power <-
power |>
mutate(kwh = Winsorize(kwh, probs = c(0.01, 0.99)))
We can confirm that our time series now has no breaks and that the outlier was handled with.
power |>
autoplot(kwh)
power |>
ggplot(aes(x = kwh)) +
geom_boxplot()
There is a a trend upwards in power consumption after January 2010. As for seasonality we see a yearly seasonality pattern. We also see a deep trough in the remainder post January 2010.
power |>
model(STL(kwh)) |>
components() |>
autoplot()
We can confirm seasonality where December-January and July-August has the highest consumption due to weather changes of winter and summer.
power |>
gg_season(kwh) +
scale_y_continuous(labels=scales::comma) +
labs(title = "Seasonal Decomposition")
We need to fix our seasonal stationary issue if we want to use an ARIMA model. When we run our unit root tests, we see we only need 1, which a seasonal differencing should be enough. We see we have lag 1 and 12 are high, but that coincides with out annual seasonality we were looking at before.
kpss <-
power |>
features(kwh, unitroot_kpss)
n_diffs <-
power |>
features(kwh, unitroot_ndiffs)
unit_root <-
bind_cols(kpss, n_diffs)
kbl(unit_root,
caption = "Unit Root Tests") |>
kable_classic(full_width = F, html_font = "Cambria")
kpss_stat | kpss_pvalue | ndiffs |
---|---|---|
1.512599 | 0.01 | 1 |
power |>
gg_tsdisplay(difference(kwh, lag=12),
plot_type="partial") +
labs(title = "Power Consumption: Seasonal Differecing")
We can see we can transform the power seasonality with a \(\lambda = -0.137\)
lambda <-
power |>
features(kwh, features = guerrero) |>
pull(lambda_guerrero)
power |>
autoplot(box_cox(kwh, lambda)) +
labs(title = latex2exp::TeX(paste0(
"Transformed Power with $\\lambda$ = ",
round(lambda,3))),
y='')
We can see our ARIMA model once again has the best AICc. This is the model we will forecast with using ARIMA(2,0,2)(0,1,1)[7]
power_fit <-
power |>
model(ETS = ETS(box_cox(kwh,lambda) ~ error("A") + trend("N") + season("A")),
ARIMA = ARIMA(box_cox(kwh,lambda)))
benchmarks <-
power_fit |>
glance() |>
select(.model:BIC)
kbl(benchmarks,
caption = "Benchmark Tests") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | sigma2 | log_lik | AIC | AICc | BIC |
---|---|---|---|---|---|
ETS | 0.0001470 | 349.7756 | -669.5512 | -666.8239 | -620.6888 |
ARIMA | 0.0001474 | 538.1746 | -1066.3493 | -1066.0044 | -1050.3845 |
atm2_fit |>
select(.model = "ARIMA") |>
report()
## Series: cash_winsor
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(cash_winsor, lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4161 -0.8921 0.4555 0.7717 -0.7139
## s.e. 0.0599 0.0524 0.0885 0.0673 0.0416
##
## sigma^2 estimated as 26762: log likelihood=-2332.36
## AIC=4676.73 AICc=4676.96 BIC=4700.01
Our innovation residuals look fine, besides one outlier that makes it left-skewed
power_fit |>
select(ARIMA) |>
gg_tsresiduals()
Our Box-Pierce test shows that the remainder is white noise
bp_test <-
power_fit |>
select(.model = "ARIMA") |>
augment() |>
features(.innov, box_pierce, lag = 14)
kbl(bp_test,
caption = "Box-Pierce Test") |>
kable_classic(full_width = F, html_font = "Cambria")
.model | bp_stat | bp_pvalue |
---|---|---|
.model | 11.77447 | 0.6244091 |
Our forecasted values looks like they provide decent point forcasts, with small ranges for the prediction intervals.
power_fc <-
power_fit |>
forecast(h = 12) |>
filter(.model=='ARIMA')
power_fc |>
autoplot(power) +
labs(title = TeX(paste0("Power Consumption: $AR(2,0,2)(0,1,1)[7]$ Forecast")),
caption = TeX(paste0("$\\lambda$ Trans = ", round(lambda,2))))
## Export Power Forecasts
export_power <-
power_fc %>%
as_tibble |>
select(date, .mean) |>
rename(kwh = .mean)
write.csv(export_power, 'power_forecasts.csv', row.names = FALSE)