DATA624Project1

Import Libraries

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)

## Warning: package 'readxl' was built under R version 4.3.3

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.3.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
## ✔ tsibble     1.1.5     ✔ fable       0.3.4
## ✔ tsibbledata 0.4.1     ✔ fabletools  0.4.2
## ✔ feasts      0.3.2

## Warning: package 'tsibble' was built under R version 4.3.3

## Warning: package 'tsibbledata' was built under R version 4.3.3

## Warning: package 'feasts' was built under R version 4.3.3

## Warning: package 'fabletools' was built under R version 4.3.3

## Warning: package 'fable' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(fable)
library(purrr)
library(zoo)

## 
## Attaching package: 'zoo'
## 
## The following object is masked from 'package:tsibble':
## 
##     index
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(openxlsx)

## Warning: package 'openxlsx' was built under R version 4.3.3

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

ATM_df <- read_excel("C:/Users/Owner/Downloads/ATM624Data.xlsx")

# Converting Date from Numeric to YYYY-MM-DD Format
ATM_df$DATE <- as.Date(ATM_df$DATE, origin = "1899-12-30") 

# Information
summary(ATM_df)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

Tackling Missing Data

As shown below, the ATM data set contains 19 missing entries. Of these, 14 lack both ATM and Cash values, so I will be removing them entirely as there isn’t sufficient information to support meaningful data filling. The remaining 5 missing entries are associated with an ATM category. Since we’re working on time series forecasting, we’ll use an ARIMA model to impute these values. This approach is preferred because ARIMA accounts for various dependencies and patterns in the data, helping to estimate missing values based on trends in surrounding data points.

# Searching for rows that consist NA's
ATM_df[which(is.na(ATM_df$Cash)), ]

# Deleting any NA's that don't have an ATM
ATM_df <- ATM_df[!is.na(ATM_df$ATM), ]

ATM_df <- ATM_df %>%
  arrange(ATM)
ATM_df[which(is.na(ATM_df$Cash)), ]

missing_rows <- which(is.na(ATM_df$Cash))
missing_rows

## [1]  44  47  53 414 420

# Converting ATM Dataframe to Tsibble 
atm_ts <- as_tsibble(ATM_df, index = DATE, key = ATM)

atm_ts <- atm_ts %>%
  # Fit ARIMA model to data consisting missing values
  model(ARIMA(Cash)) %>%
  # Estimate Cash for the missing values
  interpolate(atm_ts)



atm_ts[missing_rows, ]

Time Series Graph

For the ATMs listed below, we are splitting them into groups and applying several forecasting models, primarily including Random Walk, Naive, Naive-Drift, ETS, and ARIMA. For ATMs 1, 2, and 4, we’re using the ARIMA method, as there is ample historical data available, and it provides the best performance for these cases. For ATM 3, however, ETS has proven to be the most effective model, as this ATM had minimal withdrawal activity for an extended period. Using ARIMA in this case would result in a skewed forecast, underestimating the cash required.

# Removing Scientific Notation
options(scipen = 999)
aggregate(Cash ~ ATM, data = atm_ts, summary)

# Time Series Graph of ATM
ggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "ATM Cash Transactions", x = "Date", y = "Cash") +
  theme_minimal()

atm_ts %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Cash Transactions")

ATM 1

# ATM 1
atm_ts_1 <- atm_ts %>%
  filter(ATM == 'ATM1') 
atm_ts_1 %>%
  model(STL(Cash ~ trend(window = 7) +
        season(window = "periodic"),
        robust = TRUE))  %>%
  components()  %>%
  autoplot()

atm_ts_1 %>% model(
  RW = RW(Cash),
  ETS = ETS(Cash),
  `Naïve` = NAIVE(Cash),
  Drift = NAIVE(Cash ~ drift()),
  ARIMA = ARIMA(Cash)) %>%
  forecast(h = 30) %>%
  autoplot(atm_ts_1) +
  labs(title = "Forecast of Cash Transactions for ATM 1 Using Various Predictive Models")

arima_model <- atm_ts_1 %>% model(ARIMA(Cash))
ets_model <- atm_ts_1 %>% model(ETS(Cash))
naive_model <- atm_ts_1 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)

# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))

## # A tibble: 3 × 9
##   model ATM    RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima ATM1   23.2  14.4 -102.  117. 0.822 0.840 -0.00838
## 2 ets   ATM1   23.7  15.0 -107.  122. 0.853 0.859  0.142  
## 3 naive ATM1   50.0  37.4 -132.  167. 2.13  1.81  -0.358

# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)

a1 = c("Atm1", sum(forecast_results$.mean) * 100)

ATM 2

# ATM 2
atm_ts_2 <- atm_ts %>%
  filter(ATM == 'ATM2') 
atm_ts_2 %>%
  model(STL(Cash ~ trend(window = 7) +
        season(window = "periodic"),
        robust = TRUE))  %>%
  components()  %>%
  autoplot()

atm_ts_2 %>% model(
  RW = RW(Cash),
  ETS = ETS(Cash),
  `Naïve` = NAIVE(Cash),
  Drift = NAIVE(Cash ~ drift()),
  ARIMA = ARIMA(Cash)) %>%
  forecast(h = 30) %>%
  autoplot(atm_ts_2) +
  labs(title = "Forecast of Cash Transactions for ATM 2 Using Various Predictive Models")

arima_model <- atm_ts_2 %>% model(ARIMA(Cash))
ets_model <- atm_ts_2 %>% model(ETS(Cash))
naive_model <- atm_ts_2 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)

# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))

## # A tibble: 3 × 9
##   model ATM    RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima ATM2   23.8  16.7  -Inf   Inf 0.823 0.803 -0.00268
## 2 ets   ATM2   24.6  17.2  -Inf   Inf 0.847 0.831  0.0119 
## 3 naive ATM2   54.4  42.6  -Inf   Inf 2.10  1.84  -0.306

# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)

a2 = c("Atm2", sum(forecast_results$.mean) * 100)

ATM 3

# ATM 3
atm_ts_3 <- atm_ts %>%
  filter(ATM == 'ATM3') 
atm_ts_3 %>%
  model(STL(Cash ~ trend(window = 7) +
        season(window = "periodic"),
        robust = TRUE))  %>%
  components()  %>%
  autoplot()

atm_ts_3 %>% model(
  RW = RW(Cash),
  ETS = ETS(Cash),
  `Naïve` = NAIVE(Cash),
  Drift = NAIVE(Cash ~ drift()),
  ARIMA = ARIMA(Cash)) %>%
  forecast(h = 30) %>%
  autoplot(atm_ts_3) +
  labs(title = "Forecast of Cash Transactions for ATM 3 Using Various Predictive Models")

arima_model <- atm_ts_3 %>% model(ARIMA(Cash))
ets_model <- atm_ts_3 %>% model(ETS(Cash))
naive_model <- atm_ts_3 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)

# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))

## # A tibble: 3 × 9
##   model ATM    RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima ATM3   5.03 0.271  34.6  34.6 0.370 0.625  0.0124 
## 2 ets   ATM3   5.03 0.273 Inf   Inf   0.371 0.625 -0.00736
## 3 naive ATM3   5.09 0.310  28.8  40.2 0.423 0.632 -0.149

# Forecast for 30 days
forecast_results <- ets_model %>% forecast(h = 30)

a3 = c("Atm3", sum(forecast_results$.mean) * 100)

ATM 4

An ATM can hold a maximum of $200,000, or 2,000 $100 bills, as per sources from Google Search. Therefore, any daily ATM withdrawals exceeding this amount should be treated as anomalies or reporting errors. Since each data point represents the number of $100 bills withdrawn each day, we can reasonably assume that any value over 2,000 bills in a single day is an error. Realistically, ATMs are rarely restocked within a single day, let alone multiple times. For forecasting ATM4’s cash withdrawals, I opted to filter out these outliers before applying the different predictive models.

# ATM 4
atm_ts %>%
  filter(ATM == 'ATM4') %>%
  model(STL(Cash ~ trend(window = 7) +
        season(window = "periodic"),
        robust = TRUE))  %>%
  components()  %>%
  autoplot()

atm_ts_4 <- atm_ts %>%
  filter(ATM == 'ATM4') %>%
  filter(Cash <= 2000 ) 


atm_ts_4 <- atm_ts_4 %>% fill_gaps()

atm_ts_4 %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM 4 Cash Transactions(Outlier Removed)")

atm_ts_4 %>% model(
  RW = RW(Cash),
  ETS = ETS(Cash),
  `Naïve` = NAIVE(Cash),
  Drift = NAIVE(Cash ~ drift()),
  ARIMA = ARIMA(Cash)) %>%
  forecast(h = 30) %>%
  autoplot(atm_ts_4) +
  labs(title = "Forecast of Cash Transactions for ATM 4 Using Various Predictive Models")

## Warning: 1 error encountered for ETS
## [1] ETS does not support missing values.

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Removed 30 rows containing missing values or values outside the scale range
## (`geom_line()`).

arima_model <- atm_ts_4 %>% model(ARIMA(Cash))
ets_model <- atm_ts_4 %>% model(ETS(Cash))

## Warning: 1 error encountered for ETS(Cash)
## [1] ETS does not support missing values.

naive_model <- atm_ts_4 %>% model(NAIVE(Cash))
models <- list(arima = arima_model, ets = ets_model, naive = naive_model)

# Accuracy Metrics
accuracy_table <- map_df(models, accuracy, .id = "model")
print(accuracy_table %>% select(-.type, -.model, -ME))

## # A tibble: 3 × 9
##   model ATM    RMSE   MAE   MPE  MAPE    MASE   RMSSE    ACF1
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1 arima ATM4   344.  284. -535.  567.   0.822   0.760  0.0818
## 2 ets   ATM4   NaN   NaN   NaN   NaN  NaN     NaN     NA     
## 3 naive ATM4   478.  387. -489.  549.   1.12    1.06  -0.465

# Forecast for 30 days
forecast_results <- arima_model %>% forecast(h = 30)

a4 = c("Atm4", sum(forecast_results$.mean) * 100)

Excel File

df1 <- data.frame(matrix(c(a1, a2, a3, a4), nrow = 4, byrow = TRUE))
colnames(df1) <- c("ATM", "Total Dollars")
print(df1)

##    ATM    Total Dollars
## 1 Atm1  235330.63538969
## 2 Atm2 177724.633390371
## 3 Atm3 253752.392779698
## 4 Atm4 1280071.35746589

write.xlsx(df1, file = "Project1_Data_624_PartA.xlsx", sheetName = "Part A")

Part B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Manipulating Data

From the Residential Power Usage Excel file, I made two key adjustments to prepare the data. First, I converted the date column, initially formatted as a text string, into a month/year index. This was done by reformatting the dates to a yearmonth() structure. Then, I transformed the data frame into a tsibble, setting it up for time series analysis, as shown in the following sections.

# Importing Excel to df
ResidentialCustomerForecastLoad_624 <- read_excel("C:/Users/Owner/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
# Formatting Time to Month/Year Index
ResidentialCustomerForecastLoad_624 <- rename(ResidentialCustomerForecastLoad_624, date = `YYYY-MMM`)
power_ts <- ResidentialCustomerForecastLoad_624 %>% 
  mutate(Month = yearmonth(date))%>%
  select(-CaseSequence, -date) %>% 
  tsibble(index= Month)

Working with Missing Data

The second step involved cleaning the dataset by addressing the missing data. There was only one missing entry, specifically in the KWH column. I chose mean imputation for this value, as it best maintains the dataset’s seasonality and consistency with minimal alteration.

power_ts[which(is.na(power_ts$KWH)), ]

# Calculating Mean
mean_KWH <- mean(power_ts$KWH, na.rm = TRUE)

# Impute missing values with Mean
power_ts$KWH[which(is.na(power_ts$KWH))] <- mean_KWH

summary(power_ts$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   770523  5434539  6314472  6502475  7608792 10655730

Time Series Visualization

power_ts %>%
  autoplot(KWH) +
  labs(title = "Residential Electricity Consumption in Kilowatt Hours")

power_ts %>%
  gg_tsdisplay(KWH, plot_type='partial') +
  labs(title = "Residential Power Consumption Prior to Transformation")

# Calculating Lambda
lambda <- power_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)


# Unit Root Test KPSS
power_ts %>%
  features(box_cox(KWH,lambda), unitroot_ndiffs)

# Box-Cox Transformation
power_ts %>%
  gg_tsdisplay(difference(box_cox(KWH, lambda)), plot_type = 'partial') +
  labs(title = paste("Residential Power Usage with lambda = ", round(lambda, 2)))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Comparison of Models

In comparing various ARIMA models, the ARIMA(1,1,1) model consistently demonstrates the best performance across several evaluation criteria. With lower values in the Akaike Information Criterion (AIC), corrected AIC (AICc), and Bayesian Information Criterion (BIC), we can conclude that ARIMA(1,1,1) is the most suitable model. This outcome is expected, as simpler models with fewer parameters are generally less prone to overfitting and exhibit lower noise, making them preferable in this case.

# Comparison of different ARIMA models
power_fit <- power_ts %>%
  model(arima110 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,0)),
        arima120 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,2,0)),
        arima210 = ARIMA(box_cox(KWH,lambda) ~ pdq(2,1,0)),
        arima212 = ARIMA(box_cox(KWH,lambda) ~ pdq(2,1,2)),
        arima111 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,1)))

# Comparison Results
glance(power_fit) %>% arrange(AICc) %>% select(.model:BIC)

Forecast of 2014

In our forecasting analysis, we observe several residual outliers, and the ACF graph shows peaks that extend beyond the defined boundaries, indicating significant thresholds. Additionally, the prominent outlier highlighted on the graph corresponds to the value past -10 on the residual plot. Overall, the distribution appears centered around zero, suggesting that the data exhibits normality.

# Using Best Fit ARIMA Model
power_fit %>%
  select(arima111) %>%
  gg_tsresiduals() +
  ggtitle("Residuals for ARIMA(1,1,0) Model")

# Forecast of 2014 Monthly Utilizing ARIMA(1,1,1)
df_three <- power_ts %>%
  model(arima111 = ARIMA(box_cox(KWH,lambda) ~ pdq(1,1,1))) %>%
  forecast(h=12)

# Time Series of Forecast
df_three %>%
  autoplot() +
  labs(title="Monthly Forecast of Residential Power Consumption for 2014")

Excel

write.xlsx(df_three, file = "Project1_Data_624_PartB.xlsx", sheetName = "Part B", append = TRUE)

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing