Project 1 Part A Overview:

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part A: Deliverables

  • Explaination and breakdown of processes (Written report)
  • Visuals
  • Discussion
  • R-Code in RPubs and physical rmd copy
  • Copy of Forecast data csv

Part A: Data Pull

I made these csvs and uploaded them to github for reproducibility purposes

partA_data <- read.csv("https://raw.githubusercontent.com/jonburns2454/DATA-624/refs/heads/main/Project%201%20Data/ATM624Data.csv")

EDA - ATM

Necessary libraries

library(tidyverse)
library(naniar)
library(readxl)
library(fpp3)
library(forecast)
library(ggfortify)
library(gridExtra)

Preliminary EDA

# Convert DATE to Date type
partA_data$DATE <- mdy_hms(partA_data$DATE)

# Summarize the data
summary(partA_data)
##       DATE                            ATM                 Cash        
##  Min.   :2009-05-01 00:00:00.00   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01 00:00:00.00   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01 00:00:00.00   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31 19:11:48.27                      Mean   :  155.6  
##  3rd Qu.:2010-02-01 00:00:00.00                      3rd Qu.:  114.0  
##  Max.   :2010-05-14 00:00:00.00                      Max.   :10920.0  
##                                                      NA's   :19
# Plot the data
# 1. Cash distribution by ATM
ggplot(partA_data, aes(x = ATM, y = Cash)) +
  geom_boxplot() +
  labs(title = "Cash Distribution by ATM", x = "ATM", y = "Cash")

# 2. Cash over time
ggplot(partA_data, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "Cash Over Time", x = "Date", y = "Cash")

# 3. Total cash per ATM
total_cash <- partA_data %>%
  group_by(ATM) %>%
  summarize(Total_Cash = sum(Cash))

ggplot(total_cash, aes(x = ATM, y = Total_Cash, fill = ATM)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Cash per ATM", x = "ATM", y = "Total Cash")

Missingness

The amount of missingness across actual categorized atms (ATM1-ATM4) is minimal as you can see from the table and visualization, a majority of the missingness comes from the unlabeled atm variables which make up 74% of the missingness. The cash column has the only instance of missingness.

gg_miss_var(partA_data) +
  labs(title = "Missing Data ATM dataset")

gg_miss_var(partA_data, facet = ATM) +
  labs(title = "Missing Data by ATM Category")

missing_by_atm <- partA_data %>%
  group_by(ATM) %>%
  summarise(across(everything(), ~ sum(is.na(.)), .names = "missing_{col}"))

total_missing <- colSums(is.na(partA_data))
missing_by_atm <- missing_by_atm %>%
  mutate(across(starts_with("missing_"), ~ . / total_missing[match(sub("missing_", "", cur_column()), names(total_missing))] * 100, .names = "percent_{col}"))
missing_by_atm
## # A tibble: 5 × 5
##   ATM    missing_DATE missing_Cash percent_missing_DATE percent_missing_Cash
##   <chr>         <int>        <int>                <dbl>                <dbl>
## 1 ""                0           14                  NaN                 73.7
## 2 "ATM1"            0            3                  NaN                 15.8
## 3 "ATM2"            0            2                  NaN                 10.5
## 4 "ATM3"            0            0                  NaN                  0  
## 5 "ATM4"            0            0                  NaN                  0

Data Preparation:

Ensuring DATE and Cash are correctly formatted so all of the autoplot functions can run smoothly

atm_data <- partA_data %>% 
    mutate(DATE = as_date(DATE), Cash = as.integer(Cash))

Dealing with missingness

To combat the missingness I will filter the dataset for any date prior to 2010-05-01 because this data is completely empty after that data. Next I will use mean imputation (grouped by atm) to fill in the remaining 5 missing cash data. Lastly I will convert this data.frame object to a tsibble and create a key and index.

atm_data <- atm_data %>% 
    filter(DATE < "2010-05-01") %>% 
    group_by(ATM) %>%
    mutate(Cash = ifelse(is.na(Cash), mean(Cash, na.rm = TRUE), Cash)) %>% 
    ungroup()
atm_data <- as_tsibble(atm_data, index = DATE, key = ATM)

Evaluate autoplots at each ATM

ATM1:

The trend plot from the STL decomposition reflects that there really is no trend in the data, while showing a strong weekly season, with consistent dips in the cash reserve. The remainder graph shows that there is seemingly random variation though most of the two year period, until around March 2010 when there is consistent increased variation.

atm1_auto <-atm_data %>% 
    filter(ATM == "ATM1") %>% 
    autoplot(Cash) +
    ggtitle("Pre-Transformation Autoplot: ATM1")

atm1_stl <- atm_data %>% 
    filter(ATM == "ATM1") %>%
    model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
    components() %>%
    autoplot() +
    labs(title = "STL Decomposition for ATM1")
grid.arrange(atm1_auto, atm1_stl, ncol = 2)

ATM2:

In a similar fashion to ATM1, this ATM also isnt showing a clear trend, however if you look closely cash does seems to be falling ever so slightly through the period until an eventual spike in March 2010, followed by a massive dip and recovery. This is also reflected in the remainder, which shows a sustained increase then dip and lastly a recovery.

atm2_auto <- atm_data %>% 
    filter(ATM == "ATM2") %>% 
    autoplot(Cash) +
    ggtitle("Pre-Transformation Autoplot: ATM2")

atm2_stl <- atm_data %>% 
    filter(ATM == "ATM2") %>%
    model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
    components() %>%
    autoplot() +
    labs(title = "STL Decomposition for ATM2")
grid.arrange(atm2_auto, atm2_stl, ncol = 2)

ATM3:

ATM3 seems to be an outlier ATM, there are only three instances of this ATM having cash and they are all at the end of the period.

atm3_auto <- atm_data %>% 
    filter(ATM == "ATM3") %>% 
    autoplot(Cash) +
    ggtitle("Pre-Transformation Autoplot: ATM3")

atm3_stl <- atm_data %>% 
    filter(ATM == "ATM3") %>%
    model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
    components() %>%
    autoplot() +
    labs(title = "STL Decomposition for ATM3")
grid.arrange(atm3_auto, atm3_stl, ncol = 2)

ATM4:

ATM4 also presents an issue, there is a massive outlier on Febuary 09,2010 which will entirely throw off a forecast. Since there isnt much more context to this dataset (ie a reason why it would be this high) I think it best to remove this point by imputing using a group mean instead.

atm4_auto <- atm_data %>% 
    filter(ATM == "ATM4") %>% 
    autoplot(Cash) +
    ggtitle("Pre-Transformation Autoplot: ATM4")

atm4_stl <- atm_data %>% 
    filter(ATM == "ATM4") %>%
    model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
    components() %>%
    autoplot() +
    labs(title = "STL Decomposition for ATM4")
grid.arrange(atm4_auto, atm4_stl, ncol = 2)

Removed outliers : Autoplots

outlier_date <- as.Date('2010-02-09')
outlier_value <- atm_data %>% filter(DATE == outlier_date & ATM == "ATM4") %>% pull(Cash)

# Calculate the mean of the grouped category excluding the outlier
grouped_mean <- atm_data %>%
  filter(ATM == 'ATM4' & DATE != outlier_date) %>%
  summarise(mean_value = mean(Cash)) %>%
  pull(mean_value)

# Impute the outlier value with the new mean of ATM4 on 02-09-2010
atm_data_fixed_1 <- atm_data %>%
  mutate(Cash = ifelse(DATE == outlier_date & ATM == "ATM4", grouped_mean, Cash))
atm1_auto_fx <-atm_data_fixed_1 %>% 
    filter(ATM == "ATM1") %>% 
    autoplot(Cash) +
    ggtitle("Pre-Transformation Autoplot: ATM1")

atm2_auto_fx <- atm_data_fixed_1 %>% 
    filter(ATM == "ATM2") %>% 
    autoplot(Cash) +
    ggtitle("Pre-Transformation Autoplot: ATM2")

atm3_auto_fx <- atm_data_fixed_1 %>% 
    filter(ATM == "ATM3") %>% 
    autoplot(Cash) +
    ggtitle("Pre-Transformation Autoplot: ATM3")

atm4_auto_fx <- atm_data_fixed_1 %>% 
    filter(ATM == "ATM4") %>% 
    autoplot(Cash) +
    ggtitle("Pre-Transformation Autoplot: ATM4")
grid.arrange(atm1_auto_fx, atm2_auto_fx,atm3_auto_fx, atm4_auto_fx,ncol = 2)

Data Transformation (Pre-Forecasting) and choosing the proper model

Due to the seasonality observed across all of the ATMs I believe that ETS and ARIMA would be the best options. But they need to be compared before chosing one. I will use ATM2 to test these different models out.

lambda <- atm_data_fixed_1 %>%
  filter(ATM == "ATM2") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)
accuracy_test <- atm_data_fixed_1 %>% 
  filter(ATM == "ATM2") %>% 
  model(
    mult_monthly = ETS(box_cox(Cash, lambda) ~ error ("A") + trend("N") + season("M")),
    ARIMA = ARIMA(box_cox(Cash, lambda))
  )

#Pulling in .model(BIC and RMSE)
left_join(glance(accuracy_test) %>% select(.model:BIC), 
          accuracy(accuracy_test) %>% select(.model, RMSE)) %>%
  arrange(RMSE)
## Joining with `by = join_by(.model)`
## # A tibble: 2 × 7
##   .model       sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>         <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA          67.6  -1262. 2536. 2536. 2559.  24.4
## 2 mult_monthly   71.3  -1851. 3722. 3722. 3761.  25.2

Looking across AICc, AIC and BIC, the ARIMA model had the lowest values. Thus, ARIMA will be the model used to forecast this data.

A short function might make this part easier:

Function Setup + forecasted df

atm_arima_fc <- function(data) {
  arima_fit <- data %>% model(ARIMA(box_cox(Cash, lambda)))
  fc_cash <- arima_fit %>% forecast(h = 31)
  return(fc_cash)
}

ATM1_fc <- atm_data_fixed_1 %>% 
  filter(ATM == "ATM1") %>% 
  atm_arima_fc()

ATM2_fc <- atm_data_fixed_1 %>% 
  filter(ATM == "ATM2") %>% 
  atm_arima_fc()

ATM3_fc <- atm_data_fixed_1 %>% 
  filter(ATM == "ATM3") %>% 
  atm_arima_fc()

ATM4_fc <- atm_data_fixed_1 %>% 
  filter(ATM == "ATM4") %>% 
  atm_arima_fc()

Write out csv

full_forecast_df <- bind_rows(ATM1_fc, ATM2_fc, ATM3_fc, ATM4_fc) %>% 
  as_data_frame() %>% 
  select(DATE, ATM, .mean)
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
##   tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#write out the forecasted data

#full_forecast_df %>% 
  #write.csv("Part_A_Forecasts_Combined_File.csv")

Autoplot Forecast Visualizations:

This final graphic highlights each forecasted ATM Cash balance through the month of May 2010. The seasonality keeps pace with the raw data in ATM1 and ATM2 as well as the scale of its variability. At the same time ATM3 forecasts are reacting to the previous years of no balance or withdrawals and sees through the recent increase. ATM4 highlights a tighter spread through the month of May.

full_forecast_df %>% 
  rename(Cash = .mean) %>% 
  as_tsibble(index = DATE, key = ATM) %>% 
  autoplot(Cash)  +
  autolayer(atm_data_fixed_1, Cash, color = "grey")+ 
  facet_wrap(~ATM, scales = "free", nrow = 4)+
  labs(title = "Forecasted ATM Cash Balance Through May 2010")

Part B Overview:

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

KWH_import <- read.csv("https://raw.githubusercontent.com/jonburns2454/DATA-624/refs/heads/main/Project%201%20Data/ResidentialCustomerForecastLoad-624.csv")

KWH_ts <- KWH_import %>% 
  mutate(YYYY.MMM = yearmonth(YYYY.MMM), KWH = as.numeric(KWH)) %>% 
  as_tsibble(index = YYYY.MMM)

EDA and Formatting

There is currently only one missing KWH and its for September 2008. Before any further EDA this will need to be addressed.

KWH_import[!complete.cases(KWH_import), ]
##     CaseSequence YYYY.MMM KWH
## 129          861 2008-Sep  NA

Using ARIMA to address the missing value

As opposed to part A where I could group and address this using the mean.

KWH_interpolate <- KWH_ts %>%
  model(ARIMA(KWH)) %>% 
  interpolate(KWH_ts)

Auto and STL Decomp

The seasonality and upward trend are the most prevalent withing the stl and autoplots. However it is not clear what model is the best for this data. Thus some more investigation is needed.

KWH_interpolate %>%
  autoplot(KWH) + 
  ggtitle("Residential Power Usage (Monthly)")

KWH_interpolate %>%
  model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomp")

KWH_interpolate %>% 
  gg_subseries(KWH)+
  ggtitle('Subseries Monthly Trends')

Testing differing models:

Similar to Part A I will test both EST with box-cox transformation as well as ARIMA: Note this is very similar to the code from Part A, just adjusted for the problem.

lambda <- KWH_interpolate %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)
accuracy_test <- KWH_interpolate %>% 
  model(
    mult_monthly = ETS(box_cox(KWH, lambda) ~ error ("M") + trend("A") + season("M")),
    ARIMA = ARIMA(box_cox(KWH, lambda))
  )

#Pulling in .model(BIC and RMSE)
left_join(glance(accuracy_test) %>% select(.model:BIC), 
          accuracy(accuracy_test) %>% select(.model, RMSE)) %>%
  arrange(RMSE)
## Joining with `by = join_by(.model)`
## # A tibble: 2 × 7
##   .model         sigma2 log_lik   AIC  AICc   BIC     RMSE
##   <chr>           <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 mult_monthly 0.000676   -534. 1102. 1105. 1157.  852417.
## 2 ARIMA        2.19       -346.  701.  702.  717. 1194982.

While boasting a lower RMSE number, AIC, AICc and BIC were all much lower than the ETS boxcox transformation. Thus I will be sticking with the ARIMA again for this forecast.

Forecasting with ARIMA

Applying the ARIMA forecast from the last section

Now that the ARIMA forecast has been applied it appears that the residuals are mostly white noise and are mostly centered around zero.

KWH_forecast <- accuracy_test %>% 
  forecast(h = 12) %>% 
  filter(.model =="ARIMA")

KWH_forecast %>% 
  autoplot(KWH_interpolate)+
  ggtitle('Residential Power Usage (Monthly): Forecasting 2014')

accuracy_test %>%
  select(ARIMA) %>%
  gg_tsresiduals(lag = 24) +
  ggtitle("Redisuals for Forecast through 2014")

Pull out forecasted data

KWH_forecast_data <- KWH_forecast %>% 
  as.data.frame() %>% 
  select(YYYY.MMM, .mean) %>% 
  rename(Month_Year = YYYY.MMM, KWH = .mean)

#KWH_forecast_data %>% 
  #write.csv("Part_B_Forecasted_Data.csv")

KWH_forecast_data
##    Month_Year     KWH
## 1    2014 Jan 8629876
## 2    2014 Feb 7143118
## 3    2014 Mar 6824015
## 4    2014 Apr 6674699
## 5    2014 May 6696593
## 6    2014 Jun 7266722
## 7    2014 Jul 7579630
## 8    2014 Aug 7602538
## 9    2014 Sep 7167714
## 10   2014 Oct 6590928
## 11   2014 Nov 6583600
## 12   2014 Dec 7410337