In part A, I want you to forecast how much cash is taken out of 4 different atm machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
To begin, we can start by importing the excel file. Afterwards, we can clean the file contents
atm_read <- read_excel("C:\\Users\\yu_we\\Downloads\\ATM624Data.xlsx")
In RStudio, the dates are read as plain integers, so instead of 5/21/2009, it is read as a number like 39934. I experienced this problem when using lubridate package on its own as it will convert the dates to 2004-2005 instead of the intended 2009-2010 years. To resolve this issue, I aligned RStudio dates.
atm_read$DATE <- as.Date(atm_read$DATE, origin = "1899-12-30")
atm_read$DATE <- lubridate::ymd(atm_read$DATE)
atm_read <- atm_read[complete.cases(atm_read), ]
#atm3_data <- subset(atm_read, ATM == "ATM3")
#max(atm3_data$Cash, na.rm = TRUE)
#min(atm3_data$Cash, na.rm = TRUE)
From this graph, we can see that ATM3 is made up of mostly 0s and contains NAs. Many of the other ATMs also have NAs that will need to be removed as well. In addition, there are outliers in ATM4 that caused a spike in the data which will need to be investigated. It can be noted that there is some seasonality in ATM1 and ATM2.
ggplot(atm_read, aes(x = DATE, y = Cash)) +
geom_line(color = "purple") +
facet_wrap(vars(ATM), ncol = 1, scales = "free_y") +
labs(
title = "Withdraws from ATMs",
x = NULL,
y = ""
) +
theme_minimal()
I separated the ATMs into wide format, using pivot_wider. This is to make forecasting the ATM data easier
atm_read <- atm_read %>%
pivot_wider(
names_from = ATM,
values_from = Cash
)
From this summary of the ATM data, we can see that ATM3 has only recently opened. As the quarters show that it mostly have values of 0.
atm_read %>%
select(-DATE) %>%
summary()
## ATM1 ATM2 ATM3 ATM4
## Min. : 1.00 Min. : 0.00 Min. : 0.0000 Min. : 1.563
## 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000 1st Qu.: 124.334
## Median : 91.00 Median : 67.00 Median : 0.0000 Median : 403.839
## Mean : 83.89 Mean : 62.58 Mean : 0.7205 Mean : 474.043
## 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000 3rd Qu.: 704.507
## Max. :180.00 Max. :147.00 Max. :96.0000 Max. :10919.762
## NA's :3 NA's :2
I decided to replace the NAs in the ATM cols with the median to get a better sense of the data when forecasting.
#replace NAs in all ATM cols with their col medians
atm_read <- atm_read %>%
mutate(across(starts_with("ATM"),
~ replace(., is.na(.), median(., na.rm = TRUE))))
#replace the maximum val in ATM4 with its median
atm_read <- atm_read %>%
mutate(ATM4 = replace(ATM4,
ATM4 == max(ATM4, na.rm = TRUE),
median(ATM4, na.rm = TRUE)))
After importing and cleaning the data, replacing NAs with their medians, we can now convert the data into timeseries.
atm_ts <- atm_read %>%
as_tsibble(index = DATE)
Tsibble check is true.
is_tsibble(atm_ts)
## [1] TRUE
Before fitting the baseline models, I want to know the STL decomposition of ATM1. From the STL decomposition, we can see that there is a weekly seasonal trend, while the trend and remainder seem to be partially random.
atm_ts %>%
select(DATE, ATM1) %>%
model(
STL(ATM1 ~ season(window = "periodic"), robust = TRUE)
) %>%
components() %>%
autoplot() +
labs(title = "ATM1 STL Decomposition")
Selecting the date from ATM1 into a separate dataframe, I also split the date before/during Feb 28 and after Feb 28 into a train and test dataset, to be able to compare predictions and existing values that will be forecasted.
atm1 <- atm_ts %>%
select(DATE, ATM1)
train <- atm1 %>%
filter(DATE <= as.Date("2010-02-28"))
test <- atm1 %>%
filter(DATE > as.Date("2010-02-28"))
We can now begin by fitting all the models to the train dataset. Using naive and snaive to establish baseline forecast, while fitting a holts winters additive damped model with ETS and an automatic ETS model itself because the seasonal variation is approximately the same through the series. I also wanted to fit an auto ARIMA model for the purposes of comparing all models’ forecast on accuracy and well-behaved residuals.
fit_all <- train %>%
model(
naive = NAIVE(ATM1),
snaive = SNAIVE(ATM1 ~ lag("week")),
hw_add_damped = ETS(ATM1 ~ error("A") + trend("Ad") + season("A")),
ets_auto = ETS(ATM1),
arima_auto = ARIMA(ATM1)
)
We forecast all the models and use the accuracy function to see which model is most appropriate. The seasonal naïve model achieved the lowest RMSE (24.46), significantly outperforming ARIMA (43.64) and the ETS models (~51.5).
fc_all <- fit_all %>%
forecast(h = nrow(test))
accuracy(fc_all, test)
Although my SNAIVE model show some autocorrelation and I reject the null hypothesis since the residuals are not white noise, the RSME shows significant accuracy over the other models.
fit_best <- fit_all %>%select(snaive)
gg_tsresiduals(fit_best)
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_rug()`).
augment(fit_best) %>%
features(.innov, ljung_box, lag = 24, dof = 5)
I considered testing the models on other metrics such as AIC, AICc, and BIC. However, because most of my models aren’t within the same class, I figured that it wouldn’t be appropriate comparing an ARIMA model to ETS or SNAIVE. Because of so, I ultimately decided on SNAIVE model since it had the lowest RSME.
#extract ATM1 series
atm1 <- atm_ts |>
select(DATE, ATM1)
#fit final seasonal naive model with full data
fit_final <- atm1 |>
model(
snaive = SNAIVE(ATM1 ~ lag("week"))
)
#forecast next 30 days
fc_final <- fit_final |>
forecast(h = 30)
autoplot(fc_final, atm1) +
ggtitle("ATM1 – Seasonal Naive Model") +
labs(y = "Withdrawals", x = "Date")
We now do the same structure for ATM2 and ATM4 as done with ATM1.
ATM2 shows a steady weekly cycle, with withdrawals dropping sharply and then rebounding each week. Overall activity trends slightly downward, with a brief spike and quick decline during that same period.
atm_ts |>
select(DATE, ATM2) |>
model(
STL(ATM2 ~ season(window = "periodic"), robust = TRUE)
) |>
components() |>
autoplot() +
labs(title = "ATM2 STL Decomposition")
atm2 <- atm_ts |>
select(DATE, ATM2)
train2 <- atm2 |>
filter(DATE <= as.Date("2010-02-28"))
test2 <- atm2 |>
filter(DATE > as.Date("2010-02-28"))
fit_all2 <- train2 |>
model(
naive = NAIVE(ATM2),
snaive = SNAIVE(ATM2 ~ lag("week")),
hw_add_damped = ETS(ATM2 ~ error("A") + trend("Ad") + season("A")),
ets_auto = ETS(ATM2),
arima_auto = ARIMA(ATM2)
)
fc_all2 <- fit_all2 |>
forecast(h = nrow(test2))
Interestingly, ATM2 behaves similarly to ATM1 perhaps due to the steady weekly seasonality cycle. The SNAIVE model (~21.3) outperforms the other models when it comes to RSME, ARIMA (~43.9) and ETS (~48.4).
accuracy(fc_all2, test2)
Similarly to ATM1, ATM2’s SNAIVE model as fails to reject the null hypothesis as the residuals do show autocorrelation so the SNAIVE model’s residuals are not white noise. However, like ATM1, I will be choosing the SNAIVE model over the other models as the RSME is significantly lower, meaning more accuracy. The Ljung–Box test for the SNAVIE model showed significant autocorrelation in the residuals (p < 0.001), which is common for simple seasonal benchmarks, but the model still outperforms the other models in out‑of‑sample accuracy. So like ATM1, I will be selecting the SNAIVE model as the forecasting method for ATM2.
fit_best2 <- fit_all2 |> select(snaive)
gg_tsresiduals(fit_best2)
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_rug()`).
augment(fit_best2) |>
features(.innov, ljung_box, lag = 24, dof = 5)
atm2 <- atm_ts |>
select(DATE, ATM2)
fit_final2 <- atm2 |>
model(
snaive = SNAIVE(ATM2 ~ lag("week"))
)
fc_final2 <- fit_final2 |>
forecast(h = 30)
autoplot(fc_final2, atm2) +
ggtitle("ATM2 – Seasonal Naive Model") +
labs(y = "Withdrawals", x = "Date")
Because ATM3 only has 3 recorded data points, we are unable to accurately forecast the next month. Instead, we can try the mean to create a simplified forecast of the data until more is collected.
atm3 <- atm_ts |>
select(DATE, ATM3)
fit3 <- atm3 |>
model(
mean = MEAN(ATM3)
)
fc3 <- fit3 |>
forecast(h = 30)
autoplot(fc3, atm3) +
ggtitle("ATM3 – Mean Forecast Model") +
labs(y = "Withdrawals", x = "Date")
Because we replaced the outlier when cleaning the dataset,the STL decomposition for ATM4 shows a weekly seasonality pattern. Large amounts of cash is also withdrawn at once, before steadily dropping.
atm_ts |>
select(DATE, ATM4) |>
model(
STL(ATM4 ~ season(window = "periodic"), robust = TRUE)
) |>
components() |>
autoplot() +
labs(title = "ATM4 STL Decomposition")
atm4 <- atm_ts |>
select(DATE, ATM4)
train4 <- atm4 |>
filter(DATE <= as.Date("2010-02-28"))
test4 <- atm4 |>
filter(DATE > as.Date("2010-02-28"))
fit_all4 <- train4 |>
model(
naive = NAIVE(ATM4),
snaive = SNAIVE(ATM4 ~ lag("week")),
hw_add_damped = ETS(ATM4 ~ error("A") + trend("Ad") + season("A")),
ets_auto = ETS(ATM4),
arima_auto = ARIMA(ATM4)
)
fc_all4 <- fit_all4 |>
forecast(h = nrow(test4))
Interestingly, the ARIMA model performs best (293.5) compared to other modes such as ETS (344.2) and Holts-Winters (341.9). Unlike ATM1 and ATM2, ATM3 significantly outperforms the SNAIVE and NAIVE models, which makes sense as ATM4 had a huge outlier that breaks the weekly pattern, in addition to large amounts of cash withdrawn before spiking downward.
accuracy(fc_all4, test4)
From performing the ljung box test, we see that the p-value is 0.3078 (> 0.05). So we fail to reject the null hypothesis, meaning that the residuals do not show significant autocorrelation and that the ARIMA model’s residuals behave like white noise. This means that our model is statistically adequate for ATM4. So we decided on selecting this model to forecast ATM4.
fit_best4 <- fit_all4 |> select(arima_auto)
gg_tsresiduals(fit_best4)
augment(fit_best4) |>
features(.innov, ljung_box, lag = 24, dof = 5)
atm4 <- atm_ts |>
select(DATE, ATM4)
fit_final4 <- atm4 |>
model(
arima = ARIMA(ATM4)
)
fc_final4 <- fit_final4 |>
forecast(h = 30)
autoplot(fc_final4, atm4) +
ggtitle("ATM4 – ARIMA Model") +
labs(y = "Withdrawals", x = "Date")
We can now export the forecasts of each ATM using our selected
model.
# ATM1 sNAIVE
atm1 <- atm_ts |> select(DATE, ATM1)
fit1 <- atm1 |> model(snaive = SNAIVE(ATM1 ~ lag("week")))
fc1 <- fit1 |> forecast(h = 30) |> mutate(ATM = "ATM1")
#ATM2 SNAIVE
atm2 <- atm_ts |> select(DATE, ATM2)
fit2 <- atm2 |> model(snaive = SNAIVE(ATM2 ~ lag("week")))
fc2 <- fit2 |> forecast(h = 30) |> mutate(ATM = "ATM2")
#ATM3 MEAN
atm3 <- atm_ts |> select(DATE, ATM3)
fit3 <- atm3 |> model(mean = MEAN(ATM3))
fc3 <- fit3 |> forecast(h = 30) |> mutate(ATM = "ATM3")
#ATM4 ARIMA
atm4 <- atm_ts |> select(DATE, ATM4)
fit4 <- atm4 |> model(arima = ARIMA(ATM4))
fc4 <- fit4 |> forecast(h = 30) |> mutate(ATM = "ATM4")
clean_fc <- function(fc) {
if (".distribution" %in% names(fc)) {
fc |> select(-.distribution)
} else {
fc
}
}
standardize_fc <- function(fc) {
fc |>
hilo() |>
as_tibble() |>
select(DATE, .mean, ATM)
}
fc1_std <- standardize_fc(fc1)
fc2_std <- standardize_fc(fc2)
fc3_std <- standardize_fc(fc3)
fc4_std <- standardize_fc(fc4)
all_fc <- bind_rows(fc1_std, fc2_std, fc3_std, fc4_std)
#write_xlsx(all_fc , "ATM_JY_all_forecasts.xlsx")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
We first load the data.
res_pow_read <- read_excel("C:\\Users\\yu_we\\Downloads\\ResidentialCustomerForecastLoad-624.xlsx")
Because the data is only the month and year, we can convert it into R format using yearmonth function.
res_pow_read <- res_pow_read |>
mutate(
date = yearmonth(`YYYY-MMM`) |> as.Date()
)
res_pow_ts <- res_pow_read |>
as_tsibble(index = date)
From the timeseries plot, we can see that there is an outlier around 2010 mark.
res_pow_ts |>
autoplot(KWH) +
labs(title = "January 1998 to December 2013 - Residential Power Usage",
x = "",
y = "KWH") +
theme_minimal()
In addition, we see that in 2008-09-01, there is an NA in the KWH col.
res_pow_read |>
filter(is.na(KWH))
The outlier is from the date: 2010-07-01.
res_pow_read |>
filter(KWH == min(KWH, na.rm = TRUE))
I changed the date col back to year month and remove the YYYY-MMM col as they served the same purpose. Additionally, changing it back to yearmonth allows us to view a periodic trend of the data when we do a STL decomposition. We can replace the outlier with a mean to regard for seasonality. We will also interpolate missing values with the mean. I wanted to interpolate and impute the outlier and NA with ARIMA, however I wasn’t able to get that to work. But if it did, it would’ve interpolated the NA and outlier to account for trend and seasonality, not smoothing or averaging like with the mean.
res_pow_ts <- res_pow_read |>
mutate(date = yearmonth(date)) |>
as_tsibble(index = date)
res_pow_filled <- res_pow_ts |>
fill_gaps()
res_pow_simple <- res_pow_filled |>
mutate(
KWH = case_when(
date == as.Date("2008-09-01") ~ mean(c(
KWH[date == as.Date("2008-08-01")],
KWH[date == as.Date("2008-10-01")]
), na.rm = TRUE),
TRUE ~ KWH
)
)
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `KWH = case_when(...)`.
## Caused by warning:
## ! Incompatible methods ("==.vctrs_vctr", "==.Date") for "=="
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
res_pow_simple <- res_pow_simple |>
mutate(
KWH = case_when(
date == as.Date("2010-07-01") ~ mean(c(
KWH[date == as.Date("2010-06-01")],
KWH[date == as.Date("2010-08-01")]
), na.rm = TRUE),
TRUE ~ KWH
)
)
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `KWH = case_when(...)`.
## Caused by warning:
## ! Incompatible methods ("==.vctrs_vctr", "==.Date") for "=="
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
res_pow_simple <- res_pow_simple |>
select(-`YYYY-MMM`) #remove col
# res_pow_filled <- res_pow_ts |>
# fill_gaps()
#
#
# res_pow_miss <- res_pow_filled |>
# mutate(
# KWH = case_when(
# date == as.Date("2008-09-01") ~ NA_real_,
# date == as.Date("2010-07-01") ~ NA_real_,
# TRUE ~ KWH
# )
# )
Now that the outlier and NA is replaced with a mean, we can explore the STL decomposition of the resident power usage data. The residuals/remainder appears to be random, but spikes around the end of December 2013. We can see that there is an increase over time of the trend, before it began to decrease slightly. In addition, we can also observe that there seems to be seasonality yearly, with peaks around the summer and winter time, possibly because of higher heating or air conditioning usage during those times.
res_pow_simple |>
model(
stl = STL(KWH ~ season(window = "periodic"), robust = TRUE)
) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - Residential Power Usage",
x = "",
y = "")
For the resident power usage dataset, I will be using the same models and approach from the first part and using a test and train dataset, where the test data will be the remaining year, while the train data will be all months besides the last year. Using naive and snaive to establish baseline forecast, and an automatic ETS model itself because the seasonal variation is approximately the same through the series. I also wanted to fit an auto ARIMA model for the purposes of comparing all models’ forecast on accuracy and well-behaved residuals.
train_pow <- res_pow_simple |>
filter(date <= yearmonth("2012 Dec"))
test_pow <- res_pow_simple |>
filter(date > yearmonth("2012 Dec"))
fit_all_pow <- train_pow |>
model(
naive = NAIVE(KWH),
snaive = SNAIVE(KWH ~ lag("year")), #yearly seasonality
ets_auto = ETS(KWH),
arima_auto = ARIMA(KWH)
)
fc_all_pow <- fit_all_pow |>
forecast(h = nrow(test_pow))
From the table, we can see that ARIMA has the most accuracy out of the models. (RSME 959735.3) while other models (RSME > 1000000).
accuracy(fc_all_pow, test_pow)
fit_best_pow <- fit_all_pow |> select(arima_auto)
The residual plot of the ARIMA model for the resident power usage data is normally distributed around zero. The LJung Box test reveals that the p-value is 0.066, meaning that we fail to reject the null hypothesis. So the residuals do not show significant autocorrelation, meaning that the model captured all the predictable structure in the residential power data and that the remaining residuals are random noise. So the ARIMA model is an appropriate choice to forecasting residential power consumption, which is why it will be selected.
gg_tsresiduals(fit_best_pow)
augment(fit_best_pow) |>
features(.innov, ljung_box, lag = 24, dof = 5)
Conducting a final fit of the ARIMA model on the full dataset.
fit_final_pow <- res_pow_simple |>
model(
final = ARIMA(KWH)
)
Forecasting each month of next year (2014).
fc_final_pow <- fit_final_pow |>
forecast(h = 12)
Final Forecast of Residential Power Usage
autoplot(fc_final_pow, res_pow_simple) +
ggtitle("Residential Power Usage –Forecast") +
labs(y = "KWH", x = "Date")
Exporting final forecast to excel
fc_excel <- fc_final_pow |>
as.data.frame()
fc_excel <- tibble(
CaseSequence = 925:936,
Month = fc_final_pow$date,
KWH = fc_final_pow$.mean
)
#write_xlsx(fc_excel , "jy_residential_power_forecast.xlsx")