library(fpp3)
library(dplyr)
library(ggplot2)
library(readxl)
library(tsibble)
library(psych)
library(tidyr)
library(forecast)
This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Apr 11. I will accept late submissions with a penalty until the meetup after that when we review some projects.
ATM Forecast ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Forecasting Power ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
https://github.com/GitableGabe/Data624_Data/raw/main/ATM624Data.xlsx
atm_coltype<-c("date","text","numeric")
atm_import<-read_xlsx('ATM624Data.xlsx', col_types = atm_coltype)
# Ommitting Extra Credit as I won't be working on it
# WP1_df<-read_xlsx('Waterflow_Pipe1.xlsx')
# WP2_df<-read_xlsx('Waterflow_Pipe2.xlsx')
power_raw<-read_xlsx('ResidentialCustomerForecastLoad-624.xlsx')
head(atm_import%>%
filter(ATM=="ATM4"))
atm_range<-range(atm_import$DATE)
atm_range[1]
[1] "2009-05-01 UTC"
atm_range[2]
[1] "2010-05-14 UTC"
sapply(atm_import, function(x) sum(is.na(x)))
DATE ATM Cash
0 14 19
data.frame(atm_import$DATE[atm_import$Cash %in% NA])
atm_import %>%
filter(DATE < "2010-05-01", !is.na(ATM)) %>%
ggplot(aes(x = Cash)) +
geom_histogram(bins = 30, color= "blue") +
facet_wrap(~ ATM, ncol = 2, scales = "free")
(atm_df <- atm_import %>%
mutate(DATE = as.Date(DATE)) %>%
filter(DATE<"2010-05-01")%>%
pivot_wider(names_from=ATM, values_from = Cash))
atm_df<-atm_df%>%
as_tsibble(index=DATE)
head(atm_df)
summary(atm_df)
DATE ATM1 ATM2 ATM3 ATM4
Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 0.0000 Min. : 1.563
1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000 1st Qu.: 124.334
Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 0.0000 Median : 403.839
Mean :2009-10-30 Mean : 83.89 Mean : 62.58 Mean : 0.7206 Mean : 474.043
3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000 3rd Qu.: 704.507
Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :96.0000 Max. :10919.762
NA's :3 NA's :2
atm_df[!complete.cases(atm_df), ]
atm_df%>%
select(DATE,ATM3)%>%
filter(ATM3>0)
DATE into a date value made senses type
POSIXct may cause future issues.
# seasonality
atm_import %>%
filter(DATE < "2010-05-01", !is.na(ATM)) %>%
ggplot(aes(x = DATE, y = Cash, col = ATM)) +
geom_line(color="blue") +
facet_wrap(~ ATM, ncol = 2, scales = "free_y")+
labs(title = "Seasonality Plot", x = "Date", y = "Cash") +
theme_minimal()
NA
median_value <- median(atm_df[["ATM1"]], na.rm = TRUE)
atm_df[["ATM1"]][is.na(atm_df[["ATM1"]])] <- median_value
median_value <- median(atm_df[["ATM2"]], na.rm = TRUE)
atm_df[["ATM2"]][is.na(atm_df[["ATM2"]])] <- median_value
atm_df[!complete.cases(atm_df), ]
The seasonality plot did not show a trend in the long term but a
better assessment in weekly interval is likely needed, using resources
from Rob J Hyndman and George
Athanasopoulos, Forecasting: Principles and Practice (3rd ed) section
3.6 STL decomposition I will perform a STL “Seasonal and Trend
decomposition using Loess” decomposition of the series. To make it
weekly I’ll set the parameter trend(window = 7) and the
season(window='periodic') to impose seasonality element
across days of the week.
My reference come directly from the chapter.
us_retail_employment |>
model(
STL(Employed ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) |>
components() |>
autoplot()
atm1_df <- atm_df %>%
dplyr::select(DATE, ATM1)
atm1_df %>%
model(
STL(ATM1 ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
ndiffs(atm1_df$ATM1)
[1] 0
atm1_df %>%
ACF(ATM1, lag_max = 30) %>%
autoplot()
The STL decomposition wasn’t as telling as I would have liked,
however the ACF plot presents lags at 2, 5, and 7. I believe, given the
week starts on Sunday, that this represents Monday, Thursday and
Saturday as the days with the most lag. 7 has shown the value with the
most significant lag. There is a decreasing trend with the ACF plot, and
supports that the data is non-stationary would require differencing
however \(r_ 1's\) small value and
the results of the ndiff() function, showing the first
number of differences as 0, negates that suspicion.
Seasonal naive method was my preferred choice considering the
seasonality, and so we can use the prior time period’s withdrawals to
conduct our forecast, but I also like to default to
Auto ARIMA for the optimized selection. I assume ETS and
ARIMA wont perform as well but will await for the comparisons. Below we
filter out the data residing in May, the month we are forecasting.
# train
atm1_train <- atm1_df %>%
filter(DATE <= "2010-04-01")
atm1_fit <- atm1_train %>%
model(
SNAIVE = SNAIVE(ATM1),
ETS = ETS(ATM1),
ARIMA = ARIMA(ATM1),
`Auto ARIMA` = ARIMA(ATM1, stepwise = FALSE, approx = FALSE)
)
# forecast April
atm1_forecast <- atm1_fit %>%
forecast(h = 30)
#plot
atm1_forecast %>%
autoplot(atm1_df, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "ATM1 Forecasts | April") +
xlab("Date") +
ylab("$$$ (In Hundreds)")
# RMSE
accuracy(atm1_forecast, atm1_df) %>%
select(.model, RMSE:MAPE)
When interpreting the results, the model with the lowest RMSE and MAE value and the MPE and MAPE values closes to zero the best performing. This is true in all cases for ETS indicating it is the best performing.
** Reference**
aus_economy |>
model(ETS(Population)) |>
forecast(h = "5 years") |>
autoplot(aus_economy |> filter(Year >= 2000)) +
labs(title = "Australian population",
y = "People (millions)")
# remade the model from source
atm1_fit_ets <- atm1_df %>%
model(ETS = ETS(ATM1))
atm1_forecast_ets <- atm1_fit_ets %>%
forecast(h=30)
atm1_forecast_ets %>%
autoplot(atm1_df) +
labs(title = "ATM1 Forecast (ETS) | May",
y = "$$$ (in Hundreds)")
(atm1_forecast_results <-
as.data.frame(atm1_forecast_ets) %>%
select(DATE, .mean) %>%
rename(Date = DATE, Cash = .mean)%>%
mutate(Cash=round(Cash,2)))
atm2_df <- atm_df %>%
dplyr::select(DATE, ATM2)
atm2_df %>%
model(
STL(ATM2 ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
ndiffs(atm2_df$ATM2)
[1] 1
unitroot_ndiffs(atm2_df$ATM2)
ndiffs
1
atm2_df %>%
ACF(ATM2, lag_max = 30) %>%
autoplot()
The approach with ATM2 is a rinse and repeat but in this case differencing is needed and achieved with the below code
atm2_df <- atm2_df %>%
mutate(diff_ATM2= difference(ATM2))
Below we again filter out data and identify our best model but include both differenced and non-differenced data.
atm2_train <- atm2_df %>%
filter(DATE <= "2010-04-01")
#run seasonal related models without the differenced data
atm2_fit_nondiff <- atm2_train %>%
model(
SNAIVE = SNAIVE(ATM2),
ETS = ETS(ATM2),
)
#run models with differenced data
atm2_fit_diff <- atm2_train %>%
slice(2:336) %>%
model(
ETS_diff = ETS(diff_ATM2),
ARIMA = ARIMA(diff_ATM2),
`Auto ARIMA` = ARIMA(diff_ATM2, stepwise = FALSE, approx = FALSE)
)
#forecast_ATM2 April
atm2_forecast_nondiff <- atm2_fit_nondiff %>%
forecast(h = 30)
#forecast_ATM2 April
atm2__forecast_diff <- atm2_fit_diff %>%
forecast(h = 30)
#plot
atm2_forecast_nondiff %>%
autoplot(atm2_df, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "ATM2 Forecasts | April") +
xlab("Date") +
ylab("$$$ (In Hundreds)")
#plot 2
atm2__forecast_diff %>%
autoplot(atm2_df, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "ATM2 Forecasts | April") +
xlab("Date") +
ylab("Cash")
accuracy(atm2_forecast_nondiff, atm2_df) %>%
select(.model, RMSE:MAPE)
accuracy(atm2__forecast_diff, atm2_df) %>%
select(.model, RMSE:MAPE)
NA
Among the reuslts, the non-difference ETS model had the lowest RMSE & MAE, and MPE & MAPE closest to zero, making it the optimal choice.
atm2_fit_ets <- atm2_df %>%
model(
ETS = ETS(ATM2))
#generate the values
atm2_forecast_ets <- atm2_fit_ets %>%
forecast(h=30)
#plot
atm2_forecast_ets %>%
autoplot(atm2_df) +
labs(title = "ATM2 - ETS Forecast | May 2010",
y = "$$$ (In Hundreds)")
(atm2_forecast_results <-
as.data.frame(atm2_forecast_ets) %>%
select(DATE, .mean) %>%
rename(Date = DATE, Cash = .mean)%>%
mutate(Cash=round(Cash,2)))
ATM3 was ultimately omitted, considering the limited date range and skewed distributions. It can be considered when more data is provided.
atm4_df <- atm_df %>%
select(DATE, ATM4)
atm4_df %>%
model(
STL(ATM4 ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
Considering the variance from the time series, I decided to tranform the data before forecasting using box-cox transformation
Reference
Forecasting Principles and Practice
lambda <- aus_production |>
features(Gas, features = guerrero) |>
pull(lambda_guerrero)
aus_production |>
autoplot(box_cox(Gas, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed gas production with $\\lambda$ = ",
round(lambda,2))))
(atm4_lambda <- atm4_df %>%
features(ATM4, features = guerrero) %>%
pull(lambda_guerrero))
[1] -0.0737252
atm4_transformed <- BoxCox(atm4_df$ATM4, lambda = atm4_lambda)
# Extract the transformed data
atm4_df$ATM4_T<-atm4_transformed
#plot
atm4_df%>%
autoplot(ATM4_T)
ndiffs(atm4_df$ATM4)
[1] 0
ndiffs(atm4_df$ATM4_T)
[1] 0
atm4_df %>%
ACF(ATM4_T, lag_max = 28) %>%
autoplot()
Using ndiff() we identify that theres no need for differencing, and the ACF shows
The ACF plot below suggest lags 7 consistently and on 2 other occasions in different periods. Despite the ndiff() function resulting in 0, if believe this does require differencing using the transformed data.
atm4_df <- atm4_df %>%
mutate(diff_ATM4= difference(ATM4_T))
atm4_train <- atm4_df %>%
filter(DATE <= "2010-04-01")
#run seasonal related models without the differenced data
atm4_fit_nondiff <- atm4_train %>%
model(
SNAIVE = SNAIVE(ATM4_T),
ETS = ETS(ATM4_T),
)
#run models with differenced data
atm4_fit_diff <- atm4_train %>%
slice(2:336) %>%
model(
ETS_diff = ETS(diff_ATM4),
ARIMA = ARIMA(diff_ATM4),
`Auto ARIMA` = ARIMA(diff_ATM4, stepwise = FALSE, approx = FALSE)
)
#forecast_ATM2 April
atm4_forecast_nondiff <- atm4_fit_nondiff %>%
forecast(h = 30)
#forecast_ATM2 April
atm4__forecast_diff <- atm4_fit_diff %>%
forecast(h = 30)
#plot
atm4_forecast_nondiff %>%
autoplot(atm4_df, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "ATM4 Forecasts | April") +
xlab("Date") +
ylab("$$$ (In Hundreds)")
#plot 2
atm4__forecast_diff %>%
autoplot(atm4_df, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "ATM4 Forecasts | April") +
xlab("Date") +
ylab("Cash")
accuracy(atm4_forecast_nondiff, atm4_df) %>%
select(.model, RMSE:MAPE)
accuracy(atm4__forecast_diff, atm4_df) %>%
select(.model, RMSE:MAPE)
NA
Of the models, SNAIVE for non differenced data was the most accurate so I will proceed with this.
atm4_fit_snaive <- atm4_df %>%
model(
SNAIVE = SNAIVE(ATM4_T))
#generate the values
atm4_forecast_snaive <- atm4_fit_snaive %>%
forecast(h=30)
#plot
atm4_forecast_snaive %>%
autoplot(atm4_df) +
labs(title = "ATM2 - SNAIVE Forecast | May 2010",
y = "$$$ (In Hundreds)")
(atm4_forecast_results <-
as.data.frame(atm4_forecast_snaive) %>%
select(DATE, .mean) %>%
rename(Date = DATE, Cash = .mean)%>%
mutate(Cash=round(Cash,2)))
str(power_raw)
tibble [192 × 3] (S3: tbl_df/tbl/data.frame)
$ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
$ YYYY-MMM : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
$ KWH : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
describe(power_raw)
data.frame(power_raw$`YYYY-MMM`[power_raw$KWH %in% NA])
I renamed the YYYY-MMM for preference to
DATE. The cleanup is a change of type for
DATE, removal of CaseSequence as it does not
help our model, and reducing our model to values in the thousands for
ease of analysis. Like before we’ll also be indexing by DATE
#change variable type
power_df <- power_raw %>%
mutate(DATE = yearmonth(`YYYY-MMM`), KWH = KWH/1000) %>%
select(-CaseSequence, -'YYYY-MMM') %>%
tsibble(index= DATE)
head(power_df)
ggplot(power_df, aes(x=KWH))+
geom_histogram(bins=40)+
labs(title = "Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")
power_df %>%
autoplot(KWH) +
labs(title = "Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")
The data has an apparent outlier and is right skewed that appears in both plots, and resides sometime after January of 2010.
# made a copy of the data first
power_df2<-power_df
power_df2$KWH <- na.interp(power_df2$KWH)
power_df2$KWH <- replace(power_df2$KWH, power_df2$KWH == min(power_df2$KWH),
median(power_df2$KWH))
Considering the distribution, I again thought it best to replace the
missing value with the median, but considering I will be using that
method to address the outlier, I decided to use na.interp
since its a tool used by the author of our textbook Rob J Hydman’s github
repo. Regardless, the transformation below shows its still right
skewed but shows seasonality with an upward trend.
ggplot(power_df2, aes(x=KWH))+
geom_histogram()+
labs(title = "Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")
#summary
summary(power_df2$KWH)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4313 5444 6330 6532 7609 10656
#ts plot
power_df2 %>%
autoplot(KWH) +
labs(title = "Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")+
ylab(label= "KWH (Thousands)")
Before forecasting I will transform the data using a Box-Cox transformation.
#get lambda
(power_lambda <- power_df2 %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero))
[1] -0.2130548
power_df2 <- power_df2 %>%
mutate(KWH_bc = box_cox(KWH, power_lambda))
summary(power_df2$KWH_bc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.905 3.943 3.967 3.967 3.994 4.043
ggplot(power_df2, aes(x=KWH_bc))+
geom_histogram()+
labs(title = "Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")
power_df2 %>%
autoplot(KWH_bc) +
labs(y = "KWH in Thousands",
title = "Transformed KWH with Lambda = -0.2130548")
power_df2 %>%
model(
STL(KWH_bc ~ trend(window = 13) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
ndiffs(power_df2$KWH_bc)
[1] 1
power_df2 %>%
ACF(KWH_bc, lag_max = 36) %>%
autoplot()
Differencing is needed.
diff_power <- power_df2 %>%
mutate(diff_KWH= difference(KWH), diff_KWH_bc = difference(KWH_bc))
diff_power<-diff_power%>%
select(-KWH, -KWH_bc)%>%
slice(-1)
ndiffs(diff_power$diff_KWH_bc)
[1] 0
#Differenced data for arima
#split
power_train_diff <- diff_power %>%
filter(year(DATE) < 2013)
#models
power_fit_diff <- power_train_diff %>%
model(
ARIMA = ARIMA(diff_KWH),
`Auto ARIMA` = ARIMA(diff_KWH, stepwise = FALSE, approx = FALSE)
)
#forecast of 2013
power_forecast_diff <- power_fit_diff %>%
forecast(h = "1 year")
#plot
power_forecast_diff %>%
autoplot(diff_power, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "KWH Forecasts | Jan '13 - Dec '13")+
xlab("Month") +
ylab("KWH in Thousands")
#split
power_train <- power_df2 %>%
filter(year(DATE) < 2013)
#models
power_fit <- power_train %>%
model(
ETS = ETS(KWH),
`Additive ETS` = ETS(KWH ~ error("A") + trend("A") + season("A")),
SNAIVE = SNAIVE(KWH)
)
#forecast of 2013
power_forecast <- power_fit %>%
forecast(h = "1 year")
#plot
power_forecast %>%
autoplot(power_df2, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "KWH Forecasts | Jan '13 - Dec '13")+
xlab("Month") +
ylab("KWH in Thousands")
#find ARIMA RMSE, MAE
accuracy(power_forecast_diff, diff_power) %>%
select(.model, RMSE:MAE)
#find other RMSE, MAE
accuracy(power_forecast, power_df2) %>%
select(.model, RMSE:MAE)
Additive ETS is the best model based on RMSE and MAE
#reproduce the mode using the original dataset
power_ETS_fit <- power_df2 %>%
model(`Additive ETS` = ETS(KWH ~ error("A") + trend("A") + season("A")))
#generate the values
power_ETS_forecast <- power_ETS_fit %>%
forecast(h=12)
#plot
power_ETS_forecast %>%
autoplot(power_df2) +
labs(title = "Monthly Residential Power Usage (Additive ETS |2024)",
y = "KWH in Thousands")
NA
(power_forecast_results <-
as.data.frame(power_ETS_forecast) %>%
select(DATE, .mean) %>%
rename('KWH Forecast' = .mean))