library(tsibble)
library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forecast)
library(lubridate)
library(openxlsx)DATA 624 Project 1
Part A – ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.
- Loading required packages:
Step 1: Loading and converting the data into time series format
setwd("C:/Users/month/Downloads")
atm <- read_excel("ATM624Data.xlsx",
col_names = TRUE,
col_types = c('date', 'text', 'numeric')) |>
mutate(DATE = as_date(DATE)) |>
as_tsibble(index = DATE, key = ATM) |>
filter(DATE < "2010-05-01")
str(atm)tbl_ts [1,460 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
$ DATE: Date[1:1460], format: "2009-05-01" "2009-05-02" ...
$ ATM : chr [1:1460] "ATM1" "ATM1" "ATM1" "ATM1" ...
$ Cash: num [1:1460] 96 82 85 90 99 88 8 104 87 93 ...
- attr(*, "key")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
..$ ATM : chr [1:4] "ATM1" "ATM2" "ATM3" "ATM4"
..$ .rows: list<int> [1:4]
.. ..$ : int [1:365] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int [1:365] 366 367 368 369 370 371 372 373 374 375 ...
.. ..$ : int [1:365] 731 732 733 734 735 736 737 738 739 740 ...
.. ..$ : int [1:365] 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 ...
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
- attr(*, "index")= chr "DATE"
..- attr(*, "ordered")= logi TRUE
- attr(*, "index2")= chr "DATE"
- attr(*, "interval")= interval [1:1] 1D
..@ .regular: logi TRUE
summary(atm) DATE ATM Cash
Min. :2009-05-01 Length:1460 Min. : 0.0
1st Qu.:2009-07-31 Class :character 1st Qu.: 0.5
Median :2009-10-30 Mode :character Median : 73.0
Mean :2009-10-30 Mean : 155.6
3rd Qu.:2010-01-29 3rd Qu.: 114.0
Max. :2010-04-30 Max. :10919.8
NA's :5
Comment: The ATM624Data.xlsx data was loaded, DATE was formatted into [yyyy-mm-dd]. The dates were filtered to only select dates from May 2009 to April 2010 because we will be forecasting May 2010’s cash withdrawals.
Step 2: Treating missing Cash observations
atm |>
as.data.frame() |>
group_by(ATM) |>
summarise(`Missing Values` = sum(is.na(Cash)))# A tibble: 4 × 2
ATM `Missing Values`
<chr> <int>
1 ATM1 3
2 ATM2 2
3 ATM3 0
4 ATM4 0
library(fable)
library(imputeTS)
# Estimate Cash for the missing values.
atm <- na_interpolation(atm, option = "linear")
#double-checking for any missing values remaining
atm |>
as.data.frame() |>
group_by(ATM) |>
summarise(`Missing Values` = sum(is.na(Cash)))# A tibble: 4 × 2
ATM `Missing Values`
<chr> <int>
1 ATM1 0
2 ATM2 0
3 ATM3 0
4 ATM4 0
Comment: Three missing values were detected in ATM1 and two in ATM2. These five missing entries were resolved using interpolation.
Step 3: Exploratory Data Analysis, modeling and forecasting.
atm |>
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Before Removing Outliers")Comment: Initially, it appears that ATM1 and ATM2 exhibit signs of seasonality, whereas ATM3 seems to have only become active toward the end of April 2010. Additionally, a significant outlier is noticeable in ATM4. Removing this outlier would be advisable to allow for more accurate data exploration.
- ATM1 Exploration
library(fpp3)
#plot
atm |>
filter(ATM == "ATM1") |>
autoplot(Cash) +
ggtitle("Non-tranformed ATM1")# STL decomposition
atm |>
filter(ATM == "ATM1") |>
model(
STL(Cash ~ season(window = "periodic"), robust = TRUE)) |>
components() |>
autoplot() +
labs(title = "STL Decomposition for ATM1")Comment: ATM1 does not exhibit a clear trend, but displays a weekly seasonal pattern characterized by a noticeable drop in cash withdrawals once per week. The remainder component appears random until March 2010, after which a pattern of increased variability becomes evident.
#lambda for box cox transformation
lambda <- atm |>
filter(ATM == "ATM1") |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
atm1_fit <- atm |>
filter(ATM == "ATM1") |>
model(
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
snaive = SNAIVE(Cash),
additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
snaive_ts = SNAIVE(box_cox(Cash,lambda)),
ARIMA = ARIMA(box_cox(Cash,lambda))
)
#stats for the models
left_join(glance(atm1_fit) |>
select(.model:BIC),
accuracy(atm1_fit) |>
select(.model, RMSE)) |>
arrange(RMSE)# A tibble: 7 × 7
.model sigma2 log_lik AIC AICc BIC RMSE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 multiplicative_ts 0.0383 -1219. 2458. 2458. 2497. 23.3
2 additive 580. -2234. 4487. 4488. 4526. 23.8
3 additive_ts 1.80 -1180. 2380. 2380. 2419. 24.9
4 ARIMA 1.76 -610. 1228. 1228. 1244. 24.9
5 multiplicative 0.134 -2273. 4566. 4567. 4605. 26.3
6 snaive_ts 2.48 NA NA NA NA 27.7
7 snaive 772. NA NA NA NA 27.7
Comment: To determine the most suitable model, metrics such as RMSE, AIC, AICc, and BIC were considered. While the transformed multiplicative ETS model achieved the lowest RMSE, the ARIMA model outperformed others in terms of AIC, AICc, and BIC. As a result, the ARIMA model was selected for forecasting May 2010.
# report of the arima model
atm1_fit |>
select(.model = "ARIMA") |>
report()Series: Cash
Model: ARIMA(0,0,2)(0,1,1)[7]
Transformation: box_cox(Cash, lambda)
Coefficients:
ma1 ma2 sma1
0.1126 -0.1094 -0.6418
s.e. 0.0524 0.0520 0.0432
sigma^2 estimated as 1.764: log likelihood=-609.99
AIC=1227.98 AICc=1228.09 BIC=1243.5
- ARIMA(0,0,2)(0,1,1)[7] was modeled on
ATM1.
library(latex2exp)
# forecasting
fc_atm1 <- atm1_fit |>
forecast(h = 31) |>
filter(.model=='ARIMA')
# forcasted plot
fc_atm1 |>
autoplot(atm) +
ggtitle(TeX(paste0("ATM 1 Forcasted with $ARIMA(0,0,2)(0,1,1)_7$ and $\\lambda$ = ",
round(lambda,2))))# residuals
atm1_fit |>
select(ARIMA) |>
gg_tsresiduals() +
ggtitle(TeX(paste0("Residuals for $ARIMA(0,0,2)(0,1,1)_7$ with $\\lambda$ = ",
round(lambda,2))))#Box-Pierce test, ℓ=2m for seasonal data, m=7
atm1_fit |>
select(.model = "ARIMA") |>
augment() |>
features(.innov, box_pierce, lag = 14, dof = 0)# A tibble: 1 × 3
.model bp_stat bp_pvalue
<chr> <dbl> <dbl>
1 .model 9.60 0.791
Comment: The residuals resemble white noise, though a few data points appear to cluster in the left tail of the distribution.
- ATM2 Exploration
#plot
atm |>
filter(ATM == "ATM2") |>
autoplot(Cash) +
ggtitle("Non-tranformed ATM2")# STL decomposition
atm |>
filter(ATM == "ATM2") |>
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) |>
components() |>
autoplot() +
labs(title = "STL Decomposition for ATM2")Comment: ATM2 exhibits a clear weekly seasonal pattern, marked by a sharp decline followed by a rapid increase in cash withdrawals once per week. The remainder component appears largely random until around February or March 2010, when variability becomes more pronounced. Additionally, the overall trend is slightly downward, with a noticeable spike and subsequent drop occurring during that same February/March period.
# lambda for box cox transformation
lambda <- atm |>
filter(ATM == "ATM2") |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
atm2_fit <- atm |>
filter(ATM == "ATM2") |>
model(
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
snaive = SNAIVE(Cash),
additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
snaive_ts = SNAIVE(box_cox(Cash,lambda)),
ARIMA = ARIMA(box_cox(Cash,lambda))
)
# stats for the models
left_join(glance(atm2_fit) |>
select(.model:BIC),
accuracy(atm2_fit) |>
select(.model, RMSE)) |>
arrange(RMSE)# A tibble: 7 × 7
.model sigma2 log_lik AIC AICc BIC RMSE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA 68.2 -1263. 2539. 2539. 2562. 24.5
2 additive 648. -2254. 4527. 4528. 4566. 25.1
3 additive_ts 72.4 -1854. 3728. 3728. 3767. 25.4
4 snaive_ts 101. NA NA NA NA 30.2
5 snaive 914. NA NA NA NA 30.2
6 multiplicative_ts 0.308 -2011. 4043. 4043. 4082. 34.1
7 multiplicative 0.424 -2399. 4818. 4819. 4857. 34.4
Comment: The ARIMA model has the lowest RMSE, AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.
# report of the arima model
atm2_fit |>
select(.model = "ARIMA") |>
report()Series: Cash
Model: ARIMA(2,0,2)(0,1,1)[7]
Transformation: box_cox(Cash, lambda)
Coefficients:
ar1 ar2 ma1 ma2 sma1
-0.4256 -0.9051 0.4719 0.7967 -0.7271
s.e. 0.0588 0.0444 0.0888 0.0590 0.0406
sigma^2 estimated as 68.16: log likelihood=-1263.33
AIC=2538.66 AICc=2538.9 BIC=2561.94
- ARIMA(2,0,2)(0,1,1)[7] was modeled on
ATM2.
# forecasting
fc_atm2 <- atm2_fit |>
forecast(h = 31) |>
filter(.model=='ARIMA')
# forcasted plot
fc_atm2 |>
autoplot(atm) +
ggtitle(TeX(paste0("ATM 2 Forcasted with $ARIMA(2,0,2)(0,1,1)_7$ and $\\lambda$ = ",
round(lambda,2))))# residuals
atm2_fit |>
select(ARIMA) |>
gg_tsresiduals() +
ggtitle(TeX(paste0("Residuals for $ARIMA(2,0,2)(0,1,1)_7$ with $\\lambda$ = ",
round(lambda,2))))#Box-Pierce test, ℓ=2m for seasonal data, m=7
atm2_fit |>
select(.model = "ARIMA") |>
augment() |>
features(.innov, box_pierce, lag = 14, dof = 0)# A tibble: 1 × 3
.model bp_stat bp_pvalue
<chr> <dbl> <dbl>
1 .model 10.4 0.732
Comment: The residuals appear to resemble white noise, with a few values falling into both the lower and upper tails of the distribution.
- ATM3 Exploration
#plot
atm |>
filter(ATM == "ATM3") |>
autoplot(Cash) +
ggtitle("Non-tranformed ATM3")# mean model
atm3_fit <- atm |>
filter(ATM == "ATM3",
Cash != 0) |>
model(MEAN(Cash))
# forecasting
fc_atm3 <- atm3_fit |>
forecast(h = 31)
# forcasted plot
fc_atm3 |>
autoplot(atm) +
ggtitle("ATM 3 Forecasted with the MEAN() Model")Comment: ATM3 differs from the other machines in that it only contains three recorded data points, which is insufficient for identifying trends, seasonality, or generating a reliable forecast for an entire month. As a practical alternative, using the average of these three values to estimate withdrawals for May would be the most reasonable approach.
- ATM4 Exploration
atm |>
filter(ATM == "ATM4") |>
autoplot(Cash) +
ggtitle("Non-tranformed ATM4")# STL decomposition
dcmp_4 <- atm |>
filter(ATM == "ATM4") |>
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) |>
components()
dcmp_4 |>
autoplot() +
labs(title = "STL Decomposition for ATM4")# identifying the outliers
outliers <- dcmp_4 |>
filter(remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
remainder > quantile(remainder, 0.75) + 3*IQR(remainder))
outliers# A dable: 2 x 8 [1D]
# Key: ATM, .model [1]
# : Cash = trend + season_week + remainder
ATM .model DATE Cash trend season_week remainder season_adjust
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ATM4 "STL(Cash ~… 2009-09-22 1712. 373. -71.5 1411. 1784.
2 ATM4 "STL(Cash ~… 2010-02-09 10920. 279. -71.5 10712. 10991.
Comment: ATM4 exhibits an outlier that will need to be removed.
library(dplyr)
atm_miss <- atm |>
#replace outliers
mutate(Cash = replace(Cash, ATM == "ATM4" & DATE == "2010-02-09", NA))
atm <- atm_miss |>
# Fit ARIMA model to the data with missing values
model(ARIMA(Cash)) |>
# Estimate Cash for the missing values / outliers
interpolate(atm_miss)#plot
atm |>
filter(ATM == "ATM4") |>
autoplot(Cash) +
ggtitle("ATM4 with No Outlier")# STL decomposition
atm |>
filter(ATM == "ATM4") |>
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) |>
components() |>
autoplot() +
labs(title = "STL Decomposition for ATM4 with No Outlier")Comment: There is a clear weekly pattern (seasonality) in the data, characterized by a sharp drop followed by a steep rise in the amount of cash withdrawn once each week. However, the gray bar in the seasonal component is relatively wide, indicating that this pattern has the least variation compared to the overall variability in the data.
# lambda for box cox transformation
lambda <- atm |>
filter(ATM == "ATM4") |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
atm4_fit <- atm |>
filter(ATM == "ATM4") |>
model(
# additive ETS model
additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
additive_ts = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_ts = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_ts = SNAIVE(box_cox(Cash,lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,lambda)),
# transformed additive ETS model, no seasonality
additive_ts_no_s = ETS(box_cox(Cash,lambda) ~ error("A") + trend("N") + season("N")),
# transformed multiplicative ETS model, no seasonality
multiplicative_ts_no_s = ETS(box_cox(Cash,lambda) ~ error("M") + trend("N") + season("N"))
)
# stats for the models
left_join(glance(atm4_fit) |>
select(.model:BIC),
accuracy(atm4_fit) |>
select(.model, RMSE)) |>
arrange(RMSE)# A tibble: 9 × 7
.model sigma2 log_lik AIC AICc BIC RMSE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 additive 110883. -3192. 6404. 6405. 6443. 329.
2 additive_ts 167. -2006. 4032. 4032. 4071. 338.
3 multiplicative_ts 0.220 -2003. 4025. 4026. 4064. 345.
4 ARIMA 176. -1460. 2930. 2930. 2950. 352.
5 multiplicative 0.728 -3192. 6404. 6405. 6443. 354.
6 multiplicative_ts_no_s 0.238 -2039. 4084. 4084. 4096. 363.
7 additive_ts_no_s 196. -2039. 4084. 4084. 4096. 363.
8 snaive 205867. NA NA NA NA 453.
9 snaive_ts 289. NA NA NA NA 453.
Comment: The ARIMA model has the lowest AIC, AICc, and BIC. Therefore, the ARIMA model was chosen to forecast May 2010.
# report of the arima model
atm4_fit |>
select(.model = "ARIMA") |>
report()Series: Cash
Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean
Transformation: box_cox(Cash, lambda)
Coefficients:
ma1 sar1 sar2 constant
0.0792 0.2076 0.2026 16.8725
s.e. 0.0527 0.0516 0.0524 0.7311
sigma^2 estimated as 176.1: log likelihood=-1460.16
AIC=2930.31 AICc=2930.48 BIC=2949.81
Comment: ARIMA(0,0,1)(2,0,0)[7] will be used to model ATM4.
# forecasting
fc_atm4 <- atm4_fit |>
forecast(h = 31) |>
filter(.model=='ARIMA')
# forcasted plot
fc_atm4 |>
autoplot(atm) +
ggtitle(TeX(paste0("ATM 4 Forcasted with $ARIMA(0,0,1)(2,0,0)_7$ and $\\lambda$ = ",
round(lambda,2))))# residuals
atm4_fit |>
select(ARIMA) |>
gg_tsresiduals() +
ggtitle(TeX(paste0("Residuals for $ARIMA(0,0,1)(2,0,0)_7$ with $\\lambda$ = ",
round(lambda,2))))#Box-Pierce test, ℓ=2m for seasonal data, m=7
atm2_fit |>
select(.model = "ARIMA") |>
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)# A tibble: 1 × 3
.model bp_stat bp_pvalue
<chr> <dbl> <dbl>
1 .model 10.4 0.732
Comment: The residuals appear to resemble white noise, with a few values falling in the extreme ends of the distribution. While the forecasts may not seem to capture the underlying pattern effectively—evident in their tendency to revert toward the mean—the prediction intervals might still encompass the actual values for May 2010.
Step 4: Final Forecast
library(openxlsx)
# save as data frame
fc <- bind_rows(fc_atm1, fc_atm2, fc_atm3, fc_atm4) |>
as.data.frame() |>
select(DATE, ATM, .mean) |>
rename(Cash = .mean)
# export file
fc |>
write.xlsx("ATM_forecasts_Data_624.xlsx")
#plot of may 2010
fc |>
as_tsibble(index = DATE, key = ATM) |>
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "Forecasted ATM Withdrawls in May 2010")#Summary plot
fc |>
as_tsibble(index = DATE, key = ATM) |>
autoplot(Cash) +
autolayer(atm, Cash, colour = "black") +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Withdrawls")Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Loading the data
setwd("C:/Users/month/Downloads")
resident_load <- read_excel("ResidentialCustomerForecastLoad-624.xlsx") |>
rename(Month = 'YYYY-MMM') |>
mutate(Month = yearmonth(Month)) |>
as_tsibble(index = Month)
head(resident_load)# A tsibble: 6 x 3 [1M]
CaseSequence Month KWH
<dbl> <mth> <dbl>
1 733 1998 janv. 6862583
2 734 1998 févr. 5838198
3 735 1998 mars 5420658
4 736 1998 avr. 5010364
5 737 1998 mai 4665377
6 738 1998 juin 6467147
summary(resident_load$KWH) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
770523 5429912 6283324 6502475 7620524 10655730 1
Comment: There is a missing observation for September 2008. An ARIMA model will be fitted on the data in order to interpolate the missing observation.
resident_load <- resident_load |>
# Fit ARIMA model to the data containing missing values
model(ARIMA(KWH)) |>
# Estimate Cash for the missing values.
interpolate(resident_load)
# check to see if any missing values again
resident_load |>
as.data.frame() |>
summarise(`Missing Values` = sum(is.na(KWH))) Missing Values
1 0
Exploratory Data Analysis
resident_load |>
autoplot(KWH) +
labs(title = "Residential Power Usage")Comment: The data appears to exhibit seasonality along with a slight upward trend. An anomaly is noticeable in July 2010, where the recorded value is 770,523 (observed when excel file is sorted from smallest to largest)—likely a data entry error, possibly missing an extra digit. To address this outlier, the best approach is to remove it and apply interpolation to estimate a more accurate value.
residential_miss <- resident_load |>
#replace outliers with NA
mutate(KWH = replace(KWH, KWH == 770523, NA))
resident_load <- residential_miss |>
# Fit ARIMA model to the data with missing values
model(ARIMA(KWH)) |>
# Estimate Cash for the missing values / outliers
interpolate(residential_miss)#plot
resident_load |>
autoplot(KWH) +
ggtitle("Residential Power Usage with no Outlier")# STL decomposition
resident_load |>
model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) |>
components() |>
autoplot() +
labs(title = "STL Decomposition with no Outlier")Comment: After removing the outlier, the graph becomes clearer and more interpretable. A gradual upward trend is evident, contributing the least to the overall variation in the data. The pattern also reveals annual seasonality, with noticeable peaks during the summer and winter months. The residuals appear random, though there is a noticeable spike in December 2013, which marks the end of the observation period.
resident_load |>
gg_season(KWH, labels = "both") +
labs(title = "Seasonal plot: Residential Power Usage")resident_load |>
gg_subseries(KWH) +
labs(title = "Residential Power Usage")Modeling and Forecasting
Comment: A Box-Cox transformation will be applied to certain models and their performance will be compared to models that were not transformed.
#lambda for box cox transformation
lambda <- resident_load |>
features(KWH, features = guerrero) |>
pull(lambda_guerrero)
res_fit <- resident_load |>
model(
additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M")),
damped = ETS(KWH ~ error("A") + trend("Ad") + season("A")),
snaive = SNAIVE(KWH),
additive_bc = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") + season("A")),
multiplicative_bc = ETS(box_cox(KWH,lambda) ~ error("M") + trend("A") + season("M")),
damped_bc = ETS(box_cox(KWH,lambda) ~ error("A") + trend("Ad") + season("A")),
snaive_bc = SNAIVE(box_cox(KWH,lambda)),
ARIMA = ARIMA(box_cox(KWH,lambda))
)
# stats for the models
left_join(glance(res_fit) |>
select(.model:BIC),
accuracy(res_fit) |>
select(.model, RMSE)) |>
arrange(AICc)# A tibble: 9 × 7
.model sigma2 log_lik AIC AICc BIC RMSE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA 1.51e- 5 752. -1493. -1493. -1478. 627673.
2 additive_bc 1.40e- 5 577. -1120. -1116. -1064. 610956.
3 damped_bc 1.41e- 5 576. -1116. -1112. -1058. 618831.
4 multiplicative_bc 6.51e- 7 575. -1115. -1112. -1060. 625349.
5 multiplicative 9.00e- 3 -3053. 6140. 6144. 6196. 613581.
6 additive 4.19e+11 -3065. 6165. 6168. 6220. 619703.
7 damped 4.25e+11 -3066. 6169. 6173. 6227. 622538.
8 snaive 6.51e+11 NA NA NA NA 809926.
9 snaive_bc 2.25e- 5 NA NA NA NA 809926.
Comment: Unlike the other datasets, this one yielded negative values for AIC, AICc, and BIC. Although the additive exponential smoothing model with a Box-Cox transformation achieved the lowest RMSE, the ARIMA model had the most favorable AIC, AICc, and BIC scores. Therefore, the ARIMA model was selected as the most suitable choice.
# report of the arima model
res_fit |>
select(.model = "ARIMA") |>
report()Series: KWH
Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift
Transformation: box_cox(KWH, lambda)
Coefficients:
ma1 sar1 sar2 constant
0.2462 -0.7202 -0.3546 0.0013
s.e. 0.0833 0.0751 0.0769 0.0004
sigma^2 estimated as 1.505e-05: log likelihood=751.74
AIC=-1493.47 AICc=-1493.13 BIC=-1477.51
Comment: ARIMA(0,0,1)(2,1,0)[12] w/ drift
# forecasting
fc_res <- res_fit |>
forecast(h = 12) |>
filter(.model=='ARIMA')
# forcasted plot
fc_res |>
autoplot(resident_load) +
ggtitle(TeX(paste0("ATM 1 Forcasted with $(0,0,1)(2,1,0)_{12}$ and $\\lambda$ = ",
round(lambda,2))))# residuals
res_fit |>
select(ARIMA) |>
gg_tsresiduals(lag = 24) +
ggtitle(TeX(paste0("Residuals for $(0,0,1)(2,1,0)_{12}$ with $\\lambda$ = ",
round(lambda,2))))#Box-Pierce test, ℓ=2m for seasonal data, m=12
res_fit |>
select(.model = "ARIMA") |>
augment() |>
features(.innov, box_pierce, lag = 24, dof = 0)# A tibble: 1 × 3
.model bp_stat bp_pvalue
<chr> <dbl> <dbl>
1 .model 29.9 0.187
Comment: The residuals appear to resemble white noise, as indicated by both the diagnostic plots and the results of the Box-Pierce test. Interestingly, the initial residuals seem to remain constant for a short period.
Final Forecasting using the ARIMA model:
fc_res <- fc_res |>
as.data.frame() |>
select(Month, .mean) |>
rename(KWH = .mean) |>
# adding back the case sequence
mutate(CaseSequence = 925:936) |>
# rearranging order
relocate(CaseSequence)
fc_res CaseSequence Month KWH
1 925 2014 janv. 10297282
2 926 2014 févr. 8531331
3 927 2014 mars 6650696
4 928 2014 avr. 5974089
5 929 2014 mai 5942940
6 930 2014 juin 8291370
7 931 2014 juil. 9539978
8 932 2014 août 10113059
9 933 2014 sept. 8474487
10 934 2014 oct. 5849784
11 935 2014 nov. 6112937
12 936 2014 déc. 8253088
library(openxlsx)
# export file
fc_res |>
write.xlsx("Residential_KWH_forecasts.xlsx")