library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.3
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.3.3
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tsibble 1.1.5 ✔ fable 0.3.4
## ✔ tsibbledata 0.4.1 ✔ fabletools 0.4.2
## ✔ feasts 0.3.2
## Warning: package 'tsibble' was built under R version 4.3.3
## Warning: package 'tsibbledata' was built under R version 4.3.3
## Warning: package 'feasts' was built under R version 4.3.3
## Warning: package 'fabletools' was built under R version 4.3.3
## Warning: package 'fable' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(fable)
library(purrr)
library(zoo)
##
## Attaching package: 'zoo'
##
## The following object is masked from 'package:tsibble':
##
## index
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.3.3
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
ATM_df <- read_excel("C:/Users/Owner/Downloads/ATM624Data.xlsx")
# Converting Date from Numeric to YYYY-MM-DD Format
ATM_df$DATE <- as.Date(ATM_df$DATE, origin = "1899-12-30")
# Information
summary(ATM_df)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
As shown below, the ATM data set contains 19 missing entries. Of these, 14 lack both ATM and Cash values, so I will be removing them entirely as there isn’t sufficient information to support meaningful data filling. The remaining 5 missing entries are associated with an ATM category. Since we’re working on time series forecasting, we’ll use an ARIMA model to impute these values. This approach is preferred because ARIMA accounts for various dependencies and patterns in the data, helping to estimate missing values based on trends in surrounding data points.
# Searching for rows that consist NA's
ATM_df[which(is.na(ATM_df$Cash)), ]
# Deleting any NA's that don't have an ATM
ATM_df <- ATM_df[!is.na(ATM_df$ATM), ]
ATM_df <- ATM_df %>%
arrange(ATM)
ATM_df[which(is.na(ATM_df$Cash)), ]
missing_rows <- which(is.na(ATM_df$Cash))
missing_rows
## [1] 44 47 53 414 420
# Converting ATM Dataframe to Tsibble
atm_ts <- as_tsibble(ATM_df, index = DATE, key = ATM)
atm_ts <- atm_ts %>%
# Fit ARIMA model to data consisting missing values
model(ARIMA(Cash)) %>%
# Estimate Cash for the missing values
interpolate(atm_ts)
atm_ts[missing_rows, ]
For the ATMs listed below, we are splitting them into groups and applying several forecasting models, primarily including Random Walk, Naive, Naive-Drift, ETS, and ARIMA. For ATMs 1, 2, and 4, we’re using the ARIMA method, as there is ample historical data available, and it provides the best performance for these cases. For ATM 3, however, ETS has proven to be the most effective model, as this ATM had minimal withdrawal activity for an extended period. Using ARIMA in this case would result in a skewed forecast, underestimating the cash required.
# Removing Scientific Notation
options(scipen = 999)
aggregate(Cash ~ ATM, data = atm_ts, summary)
# Time Series Graph of ATM
ggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +
geom_line() +
labs(title = "ATM Cash Transactions", x = "Date", y = "Cash") +
theme_minimal()
atm_ts %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Cash Transactions")
# ATM 1
atm_ts_1 <- atm_ts %>%
filter(ATM == 'ATM1')
atm_ts_1 %>%
model(STL(Cash ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
atm_ts_1 %>% model(
RW = RW(Cash),
ETS = ETS(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()),
ARIMA = ARIMA(Cash)) %>%
forecast(h = 30) %>%
autoplot(atm_ts_1) +
labs(title = "Forecast of Cash Transactions for ATM 1 Using Various Predictive Models")
arima_model <- atm_ts_1 %>% model(ARIMA(Cash))
ets_model <- atm_ts_1 %>% model(ETS(Cash))
naive_model <- atm_ts_1 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)
# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))
## # A tibble: 3 × 9
## model ATM RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima ATM1 23.2 14.4 -102. 117. 0.822 0.840 -0.00838
## 2 ets ATM1 23.7 15.0 -107. 122. 0.853 0.859 0.142
## 3 naive ATM1 50.0 37.4 -132. 167. 2.13 1.81 -0.358
# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)
a1 = c("Atm1", sum(forecast_results$.mean) * 100)
# ATM 2
atm_ts_2 <- atm_ts %>%
filter(ATM == 'ATM2')
atm_ts_2 %>%
model(STL(Cash ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
atm_ts_2 %>% model(
RW = RW(Cash),
ETS = ETS(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()),
ARIMA = ARIMA(Cash)) %>%
forecast(h = 30) %>%
autoplot(atm_ts_2) +
labs(title = "Forecast of Cash Transactions for ATM 2 Using Various Predictive Models")
arima_model <- atm_ts_2 %>% model(ARIMA(Cash))
ets_model <- atm_ts_2 %>% model(ETS(Cash))
naive_model <- atm_ts_2 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)
# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))
## # A tibble: 3 × 9
## model ATM RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima ATM2 23.8 16.7 -Inf Inf 0.823 0.803 -0.00268
## 2 ets ATM2 24.6 17.2 -Inf Inf 0.847 0.831 0.0119
## 3 naive ATM2 54.4 42.6 -Inf Inf 2.10 1.84 -0.306
# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)
a2 = c("Atm2", sum(forecast_results$.mean) * 100)
# ATM 3
atm_ts_3 <- atm_ts %>%
filter(ATM == 'ATM3')
atm_ts_3 %>%
model(STL(Cash ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
`
atm_ts_3 %>% model(
RW = RW(Cash),
ETS = ETS(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()),
ARIMA = ARIMA(Cash)) %>%
forecast(h = 30) %>%
autoplot(atm_ts_3) +
labs(title = "Forecast of Cash Transactions for ATM 3 Using Various Predictive Models")
arima_model <- atm_ts_3 %>% model(ARIMA(Cash))
ets_model <- atm_ts_3 %>% model(ETS(Cash))
naive_model <- atm_ts_3 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)
# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))
## # A tibble: 3 × 9
## model ATM RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima ATM3 5.03 0.271 34.6 34.6 0.370 0.625 0.0124
## 2 ets ATM3 5.03 0.273 Inf Inf 0.371 0.625 -0.00736
## 3 naive ATM3 5.09 0.310 28.8 40.2 0.423 0.632 -0.149
# Forecast for 30 days
forecast_results <- ets_model %>% forecast(h = 30)
a3 = c("Atm3", sum(forecast_results$.mean) * 100)
An ATM can hold a maximum of $200,000, or 2,000 $100 bills, as per sources from Google Search. Therefore, any daily ATM withdrawals exceeding this amount should be treated as anomalies or reporting errors. Since each data point represents the number of $100 bills withdrawn each day, we can reasonably assume that any value over 2,000 bills in a single day is an error. Realistically, ATMs are rarely restocked within a single day, let alone multiple times. For forecasting ATM4’s cash withdrawals, I opted to filter out these outliers before applying the different predictive models.
# ATM 4
atm_ts %>%
filter(ATM == 'ATM4') %>%
model(STL(Cash ~ trend(window = 7) +
season(window = "periodic"),
robust = TRUE)) %>%
components() %>%
autoplot()
atm_ts_4 <- atm_ts %>%
filter(ATM == 'ATM4') %>%
filter(Cash <= 2000 )
atm_ts_4 <- atm_ts_4 %>% fill_gaps()
atm_ts_4 %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM 4 Cash Transactions(Outlier Removed)")
atm_ts_4 %>% model(
RW = RW(Cash),
ETS = ETS(Cash),
`Naïve` = NAIVE(Cash),
Drift = NAIVE(Cash ~ drift()),
ARIMA = ARIMA(Cash)) %>%
forecast(h = 30) %>%
autoplot(atm_ts_4) +
labs(title = "Forecast of Cash Transactions for ATM 4 Using Various Predictive Models")
## Warning: 1 error encountered for ETS
## [1] ETS does not support missing values.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning: Removed 30 rows containing missing values or values outside the scale range
## (`geom_line()`).
arima_model <- atm_ts_4 %>% model(ARIMA(Cash))
ets_model <- atm_ts_4 %>% model(ETS(Cash))
## Warning: 1 error encountered for ETS(Cash)
## [1] ETS does not support missing values.
naive_model <- atm_ts_4 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)
# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))
## # A tibble: 3 × 9
## model ATM RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima ATM4 344. 284. -535. 567. 0.822 0.760 0.0818
## 2 ets ATM4 NaN NaN NaN NaN NaN NaN NA
## 3 naive ATM4 478. 387. -489. 549. 1.12 1.06 -0.465
# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)
a4 = c("Atm4", sum(forecast_results$.mean) * 100)
df1 <- data.frame(matrix(c(a1, a2, a3, a4), nrow = 4, byrow = TRUE))
colnames(df1) <- c("ATM", "Total Dollars")
print(df1)
## ATM Total Dollars
## 1 Atm1 235330.63538969
## 2 Atm2 177724.633390371
## 3 Atm3 253752.392779698
## 4 Atm4 1280071.35746589
write.xlsx(df1, file = "Project1_Data_624_PartA.xlsx", sheetName = "Part A")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
From the Residential Power Usage Excel file, I made two key adjustments to prepare the data. First, I converted the date column, initially formatted as a text string, into a month/year index. This was done by reformatting the dates to a yearmonth() structure. Then, I transformed the data frame into a tsibble, setting it up for time series analysis, as shown in the following sections.
# Importing Excel to df
ResidentialCustomerForecastLoad_624 <- read_excel("C:/Users/Owner/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
# Formatting Time to Month/Year Index
ResidentialCustomerForecastLoad_624 <- rename(ResidentialCustomerForecastLoad_624, date = `YYYY-MMM`)
power_ts <- ResidentialCustomerForecastLoad_624 %>%
mutate(Month = yearmonth(date))%>%
select(-CaseSequence, -date) %>%
tsibble(index= Month)
The second step involved cleaning the dataset by addressing the missing data. There was only one missing entry, specifically in the KWH column. I chose mean imputation for this value, as it best maintains the dataset’s seasonality and consistency with minimal alteration.
power_ts[which(is.na(power_ts$KWH)), ]
# Calculating Mean
mean_KWH <- mean(power_ts$KWH, na.rm = TRUE)
# Impute missing values with Mean
power_ts$KWH[which(is.na(power_ts$KWH))] <- mean_KWH
summary(power_ts$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 770523 5434539 6314472 6502475 7608792 10655730
power_ts %>%
autoplot(KWH) +
labs(title = "Residential Electricity Consumption in Kilowatt Hours")
power_ts %>%
gg_tsdisplay(KWH, plot_type='partial') +
labs(title = "Residential Power Consumption Prior to Transformation")
# Calculating Lambda
lambda <- power_ts %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
# Unit Root Test KPSS
power_ts %>%
features(box_cox(KWH,lambda), unitroot_ndiffs)
# Box-Cox Transformation
power_ts %>%
gg_tsdisplay(difference(box_cox(KWH, lambda)), plot_type = 'partial') +
labs(title = paste("Residential Power Usage with lambda = ", round(lambda, 2)))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
In comparing various ARIMA models, the ARIMA(1,1,1) model consistently demonstrates the best performance across several evaluation criteria. With lower values in the Akaike Information Criterion (AIC), corrected AIC (AICc), and Bayesian Information Criterion (BIC), we can conclude that ARIMA(1,1,1) is the most suitable model. This outcome is expected, as simpler models with fewer parameters are generally less prone to overfitting and exhibit lower noise, making them preferable in this case.
# Comparison of different ARIMA models
power_fit <- power_ts %>%
model(arima110 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,0)),
arima120 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,2,0)),
arima210 = ARIMA(box_cox(KWH,lambda) ~ pdq(2,1,0)),
arima212 = ARIMA(box_cox(KWH,lambda) ~ pdq(2,1,2)),
arima111 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,1)))
# Comparison Results
glance(power_fit) %>% arrange(AICc) %>% select(.model:BIC)
In our forecasting analysis, we observe several residual outliers, and the ACF graph shows peaks that extend beyond the defined boundaries, indicating significant thresholds. Additionally, the prominent outlier highlighted on the graph corresponds to the value past -10 on the residual plot. Overall, the distribution appears centered around zero, suggesting that the data exhibits normality.
# Using Best Fit ARIMA Model
power_fit %>%
select(arima111) %>%
gg_tsresiduals() +
ggtitle("Residuals for ARIMA(1,1,0) Model")
# Forecast of 2014 Monthly Utilizing ARIMA(1,1,1)
df_three <- power_ts %>%
model(arima111 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,1))) %>%
forecast(h=12)
# Time Series of Forecast
df_three %>%
autoplot() +
labs(title="Monthly Forecast of Residential Power Consumption for 2014")
write.xlsx(df_three, file = "Project1_Data_624_PartB.xlsx", sheetName = "Part B", append = TRUE)
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing