PART A

Introduction

This request comes from DATA 624 Predictive Analytics at CUNY SPS.

The goal of part A is to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data was provided to the class in the project section of our Brightspace. I took the data and uploaded it to my personal GitHub so this RMD can be run on any device https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/ATM624DataDates.csv.

Note about this data

ATM 4 contains an outlier value that is more than 6 times the second highest value.

The case for omitting the outlier:

  1. Based on what I can see from this data, it looks like this is an error in the data, rather than a day in which the ATM was used to an extreme. That ATM has variation in its usage and is overall more highly used than the others; however, this usage just does not line up with the variation ranges.
  2. Alternatively, this could be a true value in the ATM. I would argue that even if it is, it is not a regular occurrence and should not skew the prediction higher.

Based on the above logic, this forecast will omit the outlier. More on specifics in the EDA section.

Results Summary

ATM 1 Predictions:

  • This ATM is forecasted (average of forecast) to have daily withdrawals between $540 and $11,030 in the month of May 2010.

  • The lowest withdrawals are expected on Tuesdays.

  • The highest days for withdrawals are expected to be on Sundays and Wednesdays.

  • The total withdrawal amount expected from this ATM is around $259,263.

ATM 2 Predictions:

  • This ATM is forecasted (average of forecast) to have daily withdrawals between $879 and $9,978 in the month of May 2010.
  • The lowest withdrawals are expected on Tuesdays and Mondays.

  • The highest days for withdrawals are expected to be on Wednesdays and Thursdays.

  • The total withdrawal amount expected for ATM 2 is around $185,441.

ATM 3 Predictions:

  • This ATM is forecasted to have daily withdrawals around $8,458. The data from this ATM is limited to only a few days towards the end of the available data. Because of this, the range of possible values is high.

  • Expectations within the week are even, as no weekly seasonality can be found.

  • With low confidence due to lack of data, the total withdrawal amount expected for ATM 3 is around $262,211.

ATM 4 Predictions:

  • This ATM is forecasted (average of forecast) to have daily withdrawals between $24,278 and $51,027 in the month of May 2010.

  • This forecast does not show strong weekly seasonality and has high variation.

  • The total withdrawal amount expected for ATM 4 is around $1,283,766.

Load Libraries

library(tidyverse)
library(tsibble)
library(feasts)
library(ggplot2)
library(forecast)
library(fable)
library(paletteer)
library(dplyr)
library(tidyr)
library(scales)
library(glue)
library(writexl)

EDA

First, load the data. I have the data saved as a .csv file in my GitHub for reproducibility of this code outside of my personal machine.

Then take a look at the data.

# Load the csv from GitHub
atm_df <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/ATM624DataDates.csv")

head(atm_df)
##         DATE  ATM Cash
## 1 2009-05-01 ATM1   96
## 2 2009-05-01 ATM2  107
## 3 2009-05-02 ATM1   82
## 4 2009-05-02 ATM2   89
## 5 2009-05-03 ATM1   85
## 6 2009-05-03 ATM2   90

The date field should be converted to a date data type rather than the character data type. Also, the data will need to be in a tsibble format for the visualizations.

# Converting the date column to a date datatype
atm_df <- atm_df |>
  mutate(
    DATE = as.Date(DATE, origin = "1899-12-30")
  )

# Convert to a tsibble
atm_tb <- atm_df |>
  as_tsibble(index = DATE, key = ATM)

# Take another look
summary(atm_tb)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10920.0  
##                                          NA's   :19

There are several NA’s. Drop those NA rows.

# Drop the null ATM values and null Cash values
atm_tb <- atm_tb |>
  filter(ATM != "") |>
  drop_na(Cash)

# Take another look
summary(atm_tb)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1455        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-10-31   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10920.0

Now to visualize the data.

# Autoplot the data
atm_tb |> 
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawals: May 2009 - April 2010",
       x = "Date",
       y = "Cash Withdrawals (hundreds of dollars)") +
  theme_minimal() + 
  guides(col="none")

The visualizations above show the data distributions for the different ATMs.

  • ATM1: Strong variation patterns. Some changes in variance; variance decreases around October 2009 and then increases again before 2010.

  • ATM2: Strong variation patterns, though less consistent as ATM1. Relatively constant variance

  • ATM3: Quite low usage until the beginning of May 2010, at the end of the data. Even with the increase at the end of the data, the scales are relatively similar to the ATM1 and ATM2, so this data will not be ommitted.

  • ATM4: Extreme high usage in February 2010 for a single day. This could be an input error, as that point (10920 on Feb 9 2010) is about 6 times the second highest value (1712 on Sept 22 2009).

What does the data look like with this potential error removed?

# Take out the outlier in ATM4
atm_tb_no_outl <- atm_tb |>
  filter(Cash < 2500)

# Autoplot the updated data
atm_tb_no_outl |> 
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawals: May 2009 - April 2010",
       x = "Date",
       y = "Cash Withdrawals (hundreds of dollars)") +
  theme_minimal() + 
  guides(col="none")

The visualizations above show the data distributions for the different ATMs.

  • ATM4: The scales for ATM4 are still much higher than any of the other ATMs. Without the outlier, the data shows variation changes across time.

Time Series Decomposition

To further understand the breakdown of the time series, a time series decomposition will be helpful. The decomposition used here is additive because the magnitude of the time-series does not change with the level.

# Function to decompose and plot using ggplot2
decompose_and_plot_gg <- function(atm_name) {
  # Filter for specific ATM
  atm_subset <- atm_tb_no_outl |> 
    filter(ATM == atm_name) |>  
    arrange(DATE) |>  
    mutate(Cash = as.numeric(Cash))  

  # Convert to ts object
  atm_ts <- ts(atm_subset$Cash, 
               start = c(2009, as.numeric(format(min(atm_subset$DATE), "%j"))), 
               frequency = 7)  

  # Classical decomposition
  decomp <- decompose(atm_ts, type = "additive")  

  # Convert to dataframe for ggplot
  decomp_df <- data.frame(
    DATE = atm_subset$DATE,
    Observed = decomp$x,
    Trend = decomp$trend,
    Seasonal = decomp$seasonal,
    Remainder = decomp$random
  ) |> 
    pivot_longer(-DATE, names_to = "Component", values_to = "Value")

  # Plot using ggplot2
  ggplot(decomp_df, aes(x = DATE, y = Value, color = Component)) +
    geom_line() +
    facet_wrap(~Component, scales = "free_y", ncol = 1) +
    labs(title = paste("Classical Time Series Decomposition -", atm_name),
         x = "Time", y = "Value") +
    theme_minimal()
}

# Get unique ATM names
atm_list <- unique(atm_tb_no_outl$ATM)

# Apply the function to each ATM
plots <- lapply(atm_list, decompose_and_plot_gg)

# Visualize the decompositions
plots
## [[1]]
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## 
## [[2]]
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## 
## [[3]]
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## 
## [[4]]
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

The visualizations above show the time series decompositions for the different ATMs. Each breakdown will inform what kind of algorithm will be most effective for that ATM forecast.

Data Preprocessing

This section will address any additional preprocessing the data needs before forecasting is possible.

What features can be added to make predictions accurate? What is the cadence of the seasonality in the ATMs?

# Widen data set with contextual features
atm_explore_oct09 <- atm_tb_no_outl |>
  filter(DATE >= "2009-10-01") |>
  filter(DATE <= "2009-10-31")

# Plot the updated data
atm_explore_oct09 |> 
  ggplot(aes(x = DATE, y = Cash)) +
  geom_line() +  
  geom_point(color = "salmon", size = 1.5) + 
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawals: October 2009",
       x = "Date",
       y = "Cash Withdrawals (hundreds of dollars)") +
  theme_minimal() + 
  guides(col="none")

The first 2 ATMs show a seasonal cycle to be 7 days for the filtered month of October 2009.

There seems to be lots of variation within a month that looks somewhat regular for different ATMs. Adding additional context columns may help with the predictions.

# Widen data set with contextual features
atm_ext <- atm_tb_no_outl |>
  mutate(
    week_day = wday(DATE, label = TRUE),
    month = month(DATE, label = TRUE)
    )

# View the wider dataset with more features
head(atm_ext)
## # A tsibble: 6 x 5 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash week_day month
##   <date>     <chr> <int> <ord>    <ord>
## 1 2009-05-01 ATM1     96 Fri      May  
## 2 2009-05-02 ATM1     82 Sat      May  
## 3 2009-05-03 ATM1     85 Sun      May  
## 4 2009-05-04 ATM1     90 Mon      May  
## 5 2009-05-05 ATM1     99 Tue      May  
## 6 2009-05-06 ATM1     88 Wed      May

Visualizing Different Distributions

Visualize the distribution within each week.

# Visualize week day distributions
atm_ext |>
  ggplot(aes(x = week_day, y = Cash, fill = week_day)) +
  geom_boxplot() +
  scale_fill_paletteer_d("colorBlindness::Blue2Orange8Steps") +
  facet_wrap(~ATM, scales = "free_y") +
  labs(title = "Weekly Distribution of ATM Cash Withdrawals",
       x = "Day of Week", 
       y = "Cash Withdrawals (hundreds of dollars)") +
  theme_minimal() + 
  theme(legend.position = "none")

Daily distributions within a week do seem to have a pattern. It also seems that there is some weekly seasonality within the ATM4, with Thursdays looking much lower overall.

Visualize the distribution across each month.

# Visualize month distributions
atm_ext |>
  ggplot(aes(x = month, y = Cash, fill = month)) +
  geom_boxplot() +
  scale_fill_paletteer_d("colorBlindness::Blue2DarkRed12Steps") +
  facet_wrap(~ATM, scales = "free_y") +
  labs(title = "Monthly Distribution of ATM Cash Withdrawals",
       x = "Month", 
       y = "Cash Withdrawals (hundreds of dollars)") +
  theme_minimal() + 
  theme(legend.position = "none")

Month-over-month distributions also seem to have some seasonality to them.

Four different ATMs should be subset into distinct data frames.

# Create new data frames from subset data
atm1_df <- atm_ext |>
  filter(ATM == "ATM1")

atm2_df <- atm_ext |>
  filter(ATM == "ATM2")

atm3_df <- atm_ext |>
  filter(ATM == "ATM3")

atm4_df <- atm_ext |>
  filter(ATM == "ATM4")

Now the data is ready to be forecasted.

Forecasting the Data

Each ATM will be forecasted.

Find the Right Model

ATM1

# Handle missing values and fill gaps
atm1_df <- atm1_df |>
  fill_gaps() |>
  mutate(Cash = replace_na(Cash, round(mean(Cash, na.rm = TRUE))))  # Impute missing values with mean

# Calculate lambda using Guerrero's method
atm1_lambda <- atm1_df |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

# Fit the models
atm1_fit <- atm1_df |>
  model(
    "ARIMA" = ARIMA(box_cox(Cash,atm1_lambda)),
    "Additive ETS" = ETS(Cash ~ error("A") + trend("N") + season("A")),
    "Multiplicative ETS" = ETS(Cash ~ error("M") + trend("N") + season("M")),
    "SNAIVE" = SNAIVE(Cash),
    "Transformed Additive" = ETS(box_cox(Cash,atm1_lambda) ~ error("A") + trend("N") + season("A")),
    "Transformed Multiplicative" = ETS(box_cox(Cash,atm1_lambda) ~ error("M") + trend("N") + season("M")),
    "Transformed SNAIVE" = SNAIVE(box_cox(Cash,atm1_lambda))
  )

# Combine model evaluation metrics (RMSE, MAE, etc.)
atm1_model_results <- left_join(
  glance(atm1_fit) |> select(.model:BIC),
  accuracy(atm1_fit) |> select(.model, RMSE, MAE), 
  by = ".model"
)

# Sort by RMSE for easy model comparison
atm1_sorted_model_results <- atm1_model_results |>
  arrange(RMSE)  

# Print the results to see which model has the lowest RMSE
print(atm1_sorted_model_results)
## # A tibble: 7 × 8
##   .model                       sigma2 log_lik   AIC  AICc   BIC  RMSE   MAE
##   <chr>                         <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Transformed Multiplicative   0.0367  -1188. 2396. 2396. 2435.  23.5  15.4
## 2 Additive ETS               581.      -2234. 4487. 4488. 4526.  23.8  15.1
## 3 ARIMA                        1.50     -581. 1169. 1169. 1185.  25.1  16.2
## 4 Transformed Additive         1.53    -1150. 2320. 2320. 2359.  25.1  16.2
## 5 Multiplicative ETS           0.134   -2273. 4567. 4567. 4606.  26.3  16.3
## 6 SNAIVE                     779.         NA    NA    NA    NA   27.9  17.8
## 7 Transformed SNAIVE           2.12       NA    NA    NA    NA   27.9  17.8

The most ideal model based on these results is the ARIMA model. It is lowest in AIC, AICc, and BIC. Also, it is about average for RMSE and MAE.

ATM2

# Handle missing values and fill gaps
atm2_df <- atm2_df |>
  fill_gaps() |>
  mutate(Cash = replace_na(Cash, round(mean(Cash, na.rm = TRUE))))  # Impute missing values with mean

# Calculate lambda using Guerrero's method
atm2_lambda <- atm2_df |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

# Fit the models
atm2_fit <- atm2_df |>
  model(
    "ARIMA" = ARIMA(box_cox(Cash,atm2_lambda)),
    "Additive ETS" = ETS(Cash ~ error("A") + trend("N") + season("A")),
    "Multiplicative ETS" = ETS(Cash ~ error("M") + trend("N") + season("M")),
    "SNAIVE" = SNAIVE(Cash),
    "Transformed Additive" = ETS(box_cox(Cash,atm2_lambda) ~ error("A") + trend("N") + season("A")),
    "Transformed Multiplicative" = ETS(box_cox(Cash,atm2_lambda) ~ error("M") + trend("N") + season("M")),
    "Transformed SNAIVE" = SNAIVE(box_cox(Cash,atm2_lambda))
  )

# Combine model evaluation metrics (RMSE, MAE, etc.)
atm2_model_results <- left_join(
  glance(atm2_fit) |> select(.model:BIC),
  accuracy(atm2_fit) |> select(.model, RMSE, MAE), 
  by = ".model"
)

# Sort by RMSE for easy model comparison
atm2_sorted_model_results <- atm2_model_results |>
  arrange(RMSE)  

# Print the results to see which model has the lowest RMSE
print(atm2_sorted_model_results)
## # A tibble: 7 × 8
##   .model                      sigma2 log_lik   AIC  AICc   BIC  RMSE   MAE
##   <chr>                        <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA                       67.8    -1262. 2537. 2537. 2560.  24.4  17.1
## 2 Additive ETS               638.     -2251. 4522. 4523. 4561.  25.0  17.7
## 3 Transformed Additive        71.9    -1852. 3725. 3726. 3764.  25.3  17.7
## 4 Transformed SNAIVE          99.7       NA    NA    NA    NA   30.1  20.8
## 5 SNAIVE                     906.        NA    NA    NA    NA   30.1  20.8
## 6 Multiplicative ETS           0.390  -2390. 4801. 4801. 4840.  33.8  28.2
## 7 Transformed Multiplicative   0.308  -2012. 4044. 4044. 4083.  34.0  27.0

The most ideal model based on these results is the ARIMA model. It is lowest in AIC, AICc, BIC, RMSE and MAE.

ATM3

The data for ATM 3 doesn’t actually have any non-zero values until the last 3 points in the series. Because of this, I don’t think the advanced methods are going to be useful.

The most ideal model based on these results is the ARIMA model. It is lowest in AIC, AICc, BIC and low for RMSE and MAE.

ATM4

# Handle missing values and fill gaps
atm4_df <- atm4_df |>
  fill_gaps() |>
  mutate(Cash = replace_na(Cash, round(mean(Cash, na.rm = TRUE))))  # Impute missing values with mean

# Calculate lambda using Guerrero's method
atm4_lambda <- atm4_df |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

# Fit the models
atm4_fit <- atm4_df |>
  model(
    "ARIMA" = ARIMA(box_cox(Cash,atm4_lambda)),
    "Additive ETS" = ETS(Cash ~ error("A") + trend("N") + season("A")),
    "Multiplicative ETS" = ETS(Cash ~ error("M") + trend("N") + season("M")),
    "SNAIVE" = SNAIVE(Cash),
    "Transformed Additive" = ETS(box_cox(Cash,atm4_lambda) ~ error("A") + trend("N") + season("A")),
    "Transformed Multiplicative" = ETS(box_cox(Cash,atm4_lambda) ~ error("M") + trend("N") + season("M")),
    "Transformed SNAIVE" = SNAIVE(box_cox(Cash,atm4_lambda))
  )

# Combine model evaluation metrics (RMSE, MAE, etc.)
atm4_model_results <- left_join(
  glance(atm4_fit) |> select(.model:BIC),
  accuracy(atm4_fit) |> select(.model, RMSE, MAE), 
  by = ".model"
)

# Sort by RMSE for easy model comparison
atm4_sorted_model_results <- atm4_model_results |>
  arrange(RMSE)  

# Print the results to see which model has the lowest RMSE
print(atm4_sorted_model_results)
## # A tibble: 7 × 8
##   .model                         sigma2 log_lik   AIC  AICc   BIC  RMSE   MAE
##   <chr>                           <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Additive ETS               110734.     -3192. 6404. 6404. 6443.  329.  264.
## 2 Transformed Additive          174.     -2013. 4047. 4048. 4086.  338.  260.
## 3 Transformed Multiplicative      0.222  -2009. 4039. 4039. 4078.  345.  264.
## 4 ARIMA                         183.     -1468. 2945. 2945. 2965.  352.  274.
## 5 Multiplicative ETS              0.727  -3192. 6405. 6405. 6444.  354.  276.
## 6 SNAIVE                     205239.        NA    NA    NA    NA   452.  346.
## 7 Transformed SNAIVE            300.        NA    NA    NA    NA   452.  346.

The most ideal model based on these results is the ARIMA model. It is lowest in AIC, AICc, BIC and low for RMSE and MAE.

ATM 1 ARIMA Modelling

# Forecast model
forecast_atm1 <- atm1_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')

# Visualize the ARIMA predictions
autoplot(atm1_df, Cash) + 
  autolayer(forecast_atm1) + 
  ggtitle("ATM1 Cash Withdrawal Forecast") + 
  ylab("Cash Withdrawals (hundreds of dollars)") + 
  xlab("Date")

This looks pretty good! Zooming in on 2010 may help see more detail.

atm1_2010 <- atm1_df |>
  filter(DATE >= "2010-01-01")

# Visualize the ARIMA predictions
autoplot(atm1_2010, Cash) + 
  autolayer(forecast_atm1) + 
  ggtitle("ATM1 Cash Withdrawal Forecast 2010") + 
  ylab("Cash Withdrawals (hundreds of dollars)") + 
  xlab("Date")

atm1_total_pred_value <- forecast_atm1 |> 
  pull(.mean) |> 
  sum(na.rm = TRUE)

print(paste("The total withdrawals expected for ATM 1: around", dollar(atm1_total_pred_value * 100)))
## [1] "The total withdrawals expected for ATM 1: around $259,263"

ATM 1 Predictions:

  • This ATM is forecasted (average of forecast) to have withdrawals between $540 and $11,030 in the month of May 2010.

  • The lowest withdrawals are expected on Tuesdays.

  • The highest days for withdrawals are expected to be on Sundays and Wednesdays.

  • The total withdrawals expected for ATM 1: around $259,263.

ATM 2 ARIMA Modelling

# Forecast model
forecast_atm2 <- atm2_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')

# Visualize the ARIMA predictions
autoplot(atm2_df, Cash) + 
  autolayer(forecast_atm2) + 
  ggtitle("ATM2 Cash Withdrawal Forecast") + 
  ylab("Cash Withdrawals (hundreds of dollars)") + 
  xlab("Date")

This looks pretty good! Zooming in on 2010 may help see more detail.

atm2_2010 <- atm2_df |>
  filter(DATE >= "2010-01-01")

# Visualize the ARIMA predictions
autoplot(atm2_2010, Cash) + 
  autolayer(forecast_atm2) + 
  ggtitle("ATM2 Cash Withdrawal Forecast 2010") + 
  ylab("Cash Withdrawals (hundreds of dollars)") + 
  xlab("Date")

atm2_total_pred_value <- forecast_atm2 |> 
  pull(.mean) |> 
  sum(na.rm = TRUE)

print(paste("The total withdrawals expected for ATM 2: around", dollar(atm2_total_pred_value * 100)))
## [1] "The total withdrawals expected for ATM 2: around $185,441"

ATM 2 Predictions:

  • This ATM is forecasted (average of forecast) to have withdrawals between $879 and $9,978 in the month of May 2010.
  • The lowest withdrawals are expected on Tuesdays and Mondays.

  • The highest days for withdrawals are expected to be on Wednesdays and Thursdays.

  • The total withdrawals expected for ATM 2: around $185,441.

ATM 3 Additive ETS Modelling

# Function for predictions and visualizing them
atm3_df$Cash <- as.numeric(atm3_df$Cash)
atm3_df <- atm3_df |> arrange(DATE)

# Convert to tsibble and fill gaps
atm3_tsibble <- atm3_df |> 
  as_tsibble(index = DATE) |> 
  fill_gaps() |>  # Add missing dates
  mutate(Cash = replace_na(Cash, mean(Cash, na.rm = TRUE)))  # Fill missing values

# Fit ETS model
atm3_ets_model <- atm3_tsibble |> 
  model(ETS(Cash))

# Forecast for 31 days
forecast_atm3 <- atm3_ets_model |> 
  forecast(h = "31 days")

# Plot the forecast
forecast_atm3 |>
  autoplot(atm3_tsibble) +
  labs(title = paste("ATM2 Cash Withdrawal Forecast 2010"), 
       x = "Date", 
       y = "Cash Withdrawals (hundreds of dollars)") +
  theme_minimal()

This model is an additive ETS model based off of the recently populating ATM data. It is such a wide range with little detail because there is not enough data to confidently predict any trends.

atm3_total_pred_value <- forecast_atm3 |> 
  pull(.mean) |> 
  sum(na.rm = TRUE)

print(paste("The total withdrawals expected for ATM 3: around", dollar(atm3_total_pred_value * 100)))
## [1] "The total withdrawals expected for ATM 3: around $262,211"

ATM 3 Predictions:

  • This ATM is forecasted to have withdrawals around $8,458. The data from this ATM is limited to only a few days towards the end of the available data. Because of this, the range of possible values is high.

  • Expectations within the week are even, as no weekly seasonality can be found.

  • With low confidence, the total withdrawals expected for ATM 3: around $262,211.

ATM 4 ARIMA Modelling

# Forecast model
forecast_atm4 <- atm4_fit |>
  forecast(h = 31) |>
  filter(.model=='ARIMA')

# Visualize the ARIMA predictions
autoplot(atm4_df, Cash) + 
  autolayer(forecast_atm4) + 
  ggtitle("ATM4 Cash Withdrawal Forecast") + 
  ylab("Cash Withdrawals (hundreds of dollars)") + 
  xlab("Date")

This looks pretty loose for a prediction. Zooming in on 2010 may help see more detail.

atm4_2010 <- atm4_df |>
  filter(DATE >= "2010-01-01")

# Visualize the ARIMA predictions
autoplot(atm4_2010, Cash) + 
  autolayer(forecast_atm4) + 
  ggtitle("ATM4 Cash Withdrawal Forecast 2010") + 
  ylab("Cash Withdrawals (hundreds of dollars)") + 
  xlab("Date")

atm4_total_pred_value <- forecast_atm4 |> 
  pull(.mean) |> 
  sum(na.rm = TRUE)

print(paste("The total withdrawals expected for ATM 4: around", dollar(atm4_total_pred_value * 100)))
## [1] "The total withdrawals expected for ATM 4: around $1,283,766"

ATM 4 Predictions:

  • This ATM is forecasted (average of forecast) to have withdrawals between $24,278 and $51,027 in the month of May 2010.

  • This forecast does not show strong weekly seasonality and has high variation.

  • The total withdrawals expected for ATM 4: around $1,283,766.

Export the Data Forecast

# Combine all forecasts into one data frame
combined_forecast <- bind_rows(forecast_atm1, forecast_atm2, forecast_atm3, forecast_atm4)

# Write to excel
write_xlsx(combined_forecast, "ATMPredictions.xlsx")

PART B

Introduction

This request comes from DATA 624 Predictive Analytics at CUNY SPS.

The goal of part B is to model the provided data and a monthly forecast for 2014. The data was provided to the class in the project section of our Brightspace. It contains data of residential power usage for January 1998 until December 2013. I took the data and uploaded it to my personal GitHub so this RMD can be run on any device https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/ResidentialCustomerForecastLoad-624.csv.

Results Summary

Residential Power Usage Predictions:

  • The data follows strong seasonality, and the prediction follows this trend.

  • The highest consumption is expected in January (around 10.2 million KWH) and August (around 10.0 million KWH). The lowest consumption is expected in October, April, and May (around 57.6, 59.1, and 59.4 million KWH, respectively).

  • See the section titled Power Forecast for visualization

Load Libraries

library(zoo)

EDA

First, load the data. I have the data saved as a .csv file in my GitHub for reproducibility of this code outside of my personal machine.

Then take a look at the data.

# Load the csv from GitHub
pwr_df <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/ResidentialCustomerForecastLoad-624.csv")

head(pwr_df)
##   CaseSequence YYYY.MMM     KWH
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147

Reformat the features data types when needed.

# Convert date format
pwr_ts <- ts(pwr_df$KWH, start = c(1998, 1), frequency = 12)

# Preview the data
head(pwr_ts)
##          Jan     Feb     Mar     Apr     May     Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147
# Another format
summary(pwr_ts)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

Visualize the data in a time series graph.

# Plot the data
autoplot(pwr_ts) +
  labs(title = paste("Residential Power Usage"), 
       x = "Date", 
       y = "Kilowatt hours") +
  theme_minimal()

Replace the NA value by interpolation.

# Fill null value
pwr_ts <- na.approx(pwr_ts)

# Plot the filled data
autoplot(pwr_ts) +
  labs(title = "Residential Power Usage", 
       x = "Date", 
       y = "Kilowatt hours") +
  theme_minimal()

Raw Time Series Decomposition

Try out the time series decomposition to get a better feel for the data.

# Decompose the time series
pwr_decomp <- decompose(pwr_ts)

# Plot the decomposed time series
autoplot(pwr_decomp) +
  labs(title = "Decomposition of Residential Power Usage",
       x = "Date") +
  theme_minimal()

What does the seasonality look like? Looking at a single year may help specify the cadence.

# Get only 2005 data
pwr_0506 <- window(pwr_ts, start = c(2005, 1), end = c(2007, 1))

# Plot the filled data
autoplot(pwr_0506) +
  labs(title = "Residential Power Usage in 2005 and 2006", 
       x = "Date", 
       y = "Kilowatt hours") +
  theme_minimal()

Looks like the seasonality is 12 months in a year. More yearly analysis below.

Data Preprocessing

The outlier for 2010-Jul skews the data. I would assume this outlier comes from some sort of black out. It is clear in the remainder graph of the Time Series Decomposition that this outlier stands out from the rest of the data and does not follow the same trends.

Remove July 2010 Outlier

Remove the outlier. Visualize the data without it.

# Convert ts to a tibble
pwr_trimmed <- as_tibble(as.data.frame(pwr_ts), rownames = "Date")

# Remove the outlier for July 2010
pwr_trimmed <- pwr_trimmed |> 
  mutate(KWH = ifelse(Date == "151", NA, x))

# Convert back to a time series
pwr_trimmed <- ts(pwr_trimmed$KWH, start = start(pwr_ts), frequency = frequency(pwr_ts))

# Fill null value
pwr_trimmed <- na.approx(pwr_trimmed)

# Plot the filled data
autoplot(pwr_trimmed) +
  labs(title = "Residential Power Usage", 
       x = "Date", 
       y = "Kilowatt hours") +
  theme_minimal()

Forecasting the Data

Time Series Decomposition Without Outlier

Try out the time series decomposition without that outlier.

# Decompose the time series
pwr_tr_decomp <- decompose(pwr_trimmed)

# Plot the decomposed time series
autoplot(pwr_tr_decomp) +
  labs(title = "Decomposition of Trimmed Residential Power Usage",
       x = "Date") +
  theme_minimal()

Alright, this data shows high seasonality.

Monthly Visualization

# Plot the consumption per month, year over year
ggseasonplot(pwr_trimmed,
              year.labels=TRUE) +
  ylab("Kilowatt hours") +
  ggtitle("Residential Power Usage by Month") +
  theme_minimal()

The data looks to have some strong monthly trends. Months correlate with each other year over year. The model will need to have seasonality accounted for.

Power Forecast

Build the model with an ARIMA forecast with seasonality. Visualize the result.

# Fit the trimmed power data to a seasonal ARIMA model
fit_pwr <- auto.arima(pwr_trimmed, seasonal = TRUE)

# Forecast on that model for 1 year
forecast_pwr <- forecast(fit_pwr, h = 12)

# Plot the forecast
autoplot(forecast_pwr) +
  labs(title = "Residential Power Usage Forecast", 
       x = "Date", 
       y = "Kilowatt hours") +
  theme_minimal()

This looks pretty good! Zooming in on 2010-2014 may help see more detail.

# Get historical data from 2011 onward
pwr_hist_10_13 <- window(pwr_trimmed, start = c(2010, 1))

# Plot historical data + forecast
autoplot(pwr_hist_10_13, series = "Historical Data") + 
  autolayer(forecast_pwr, series = "Forecast", PI = TRUE) + 
  labs(title = "Residential Power Usage (2010+) and Forecast (2014)", 
       x = "Date", 
       y = "Kilowatt hours") +
  theme_minimal() +
  scale_color_manual(values = c("Historical Data" = "black", "Forecast" = "blue"))

Overall, it looks like the model shows significant seasonality. This prediction visually matches the previous months. I would be confident in the ranges provided here to contain the data at their respective levels.

Export the Data Forecast

# Save the forecast as a data frame for excel
pwr_fcst_export <- data.frame(
  Date = seq(as.Date("2014-01-01"), by = "month", length.out = 12),
  Power_Forecast_KWH = as.numeric(forecast_pwr$mean)
)

# Write to excel
write_xlsx(pwr_fcst_export, "PowerPredictions.xlsx")

PART C

Introduction

The assignment for part C is to time-base sequence the data and aggregate based on hour. Then it should be determines if the data is stationary and if it can be forecast.

For this submission, I will only be doing part of this for partial extra credit. I fix the time stamp inconsistencies, aggregate the data based on hour, and determine it is stationary.

The data was provided to the class in the project section of our Brightspace. I took the data and uploaded it to my personal GitHub so this RMD can be run on any device https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/Waterflow_Pipe1.csv and https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/Waterflow_Pipe2.csv.

Load Libraries

library(lubridate)
library(aTSA)

EDA

Load Data

First, load the data. I have the data saved as a .csv file in my GitHub for reproducibility of this code outside of my personal machine.

Then take a look at the data.

# Load the csv from GitHub
pipe1_raw <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/Waterflow_Pipe1.csv")

head(pipe1_raw)
##           Date.Time WaterFlow
## 1 10/23/15 12:24 AM 23.369599
## 2 10/23/15 12:40 AM 28.002881
## 3 10/23/15 12:53 AM 23.065895
## 4 10/23/15 12:55 AM 29.972809
## 5  10/23/15 1:19 AM  5.997953
## 6  10/23/15 1:23 AM 15.935223
# Load the csv from GitHub
pipe2_raw <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/Waterflow_Pipe2.csv")

head(pipe2_raw)
##          Date.Time WaterFlow
## 1 10/23/15 1:00 AM  18.81079
## 2 10/23/15 2:00 AM  43.08703
## 3 10/23/15 3:00 AM  37.98770
## 4 10/23/15 4:00 AM  36.12038
## 5 10/23/15 5:00 AM  31.85126
## 6 10/23/15 6:00 AM  28.23809

Aggregate Times

The data needs to be aggregated. To start, the formatting of the Date.Time feature has to be recognized as a date time data type. Then the hour will be rounded.

# Create new data frames to adjust
pipe1_df <- pipe1_raw
pipe2_df <- pipe2_raw

# Convert Date.Time to datetime (handling AM/PM format correctly)
pipe1_df$Date.Time <- parse_date_time(pipe1_df$Date.Time, orders = "mdy IMp")
pipe2_df$Date.Time <- parse_date_time(pipe2_df$Date.Time, orders = "mdy IMp")

# Round to the nearest hour in date.time
pipe1_df$Date.Time <- floor_date(pipe1_df$Date.Time, "hour")
pipe2_df$Date.Time <- floor_date(pipe2_df$Date.Time, "hour")

# Plot for Pipe 1 
plot(pipe1_df$Date.Time, pipe1_df$WaterFlow, 
     type = "p", 
     col = "salmon", 
     main = "WaterFlow in Pipe 1", 
     xlab = "Date Time", 
     ylab = "Water Flow")

# Plot for Pipe 2 
plot(pipe2_df$Date.Time, pipe2_df$WaterFlow, 
     type = "p", 
     col = "lightblue", 
     main = "WaterFlow in Pipe 2", 
     xlab = "Date Time", 
     ylab = "Water Flow")

This is where the times are actually aggregated. Take the average water flow amount within each hour. Then merge the data sets on hour.

# Average rows within each hour for pipe 1
pipe1_agg <- pipe1_df |>
  group_by(Date.Time) |>
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE))

# Average rows within each hour for pipe 2
pipe2_agg <- pipe2_df |>
  group_by(Date.Time) |>
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE))

# Merge datasets based on Hour
merged_pipes_df <- full_join(pipe1_agg, pipe2_agg, by = "Date.Time", suffix = c("_pipe1", "_pipe2"))

# Reshape data into long format for ggplot
merged_long <- merged_pipes_df |>
  pivot_longer(cols = starts_with("WaterFlow"),
               names_to = "Pipe",
               values_to = "WaterFlow") |>
  mutate(Pipe = ifelse(Pipe == "WaterFlow_pipe1", "Pipe 1", "Pipe 2"))

# Plot using ggplot2 with different colors for each pipe
ggplot(merged_long, aes(x = Date.Time, y = WaterFlow, color = Pipe)) +
  geom_point() +
  geom_point(alpha = 0.3) +
  labs(title = "WaterFlow in Pipes", x = "Date Time", y = "Water Flow") +
  scale_color_manual(values = c("Pipe 1" = "salmon", "Pipe 2" = "lightblue")) +
  theme_minimal()
## Warning: Removed 766 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 766 rows containing missing values or values outside the scale range
## (`geom_point()`).

Zoom in on a specific day to make sure the data is actually hourly.

# Filter to keep only data from November 1st
merged_df_nov1 <- merged_pipes_df |>
  filter(month(Date.Time) == 11 & day(Date.Time) == 1)

# Reshape data into long format for ggplot
merged_long_nov1 <- merged_df_nov1 |>
  pivot_longer(cols = starts_with("WaterFlow"),
               names_to = "Pipe",
               values_to = "WaterFlow") |>
  mutate(Pipe = ifelse(Pipe == "WaterFlow_pipe1", "Pipe 1", "Pipe 2"))

# Plot using ggplot2 with different colors for each pipe
ggplot(merged_long_nov1, aes(x = Date.Time, y = WaterFlow, color = Pipe)) +
  geom_line() +  
  labs(title = "WaterFlow in Pipes (November 1st)", x = "Date Time", y = "Water Flow") +
  scale_color_manual(values = c("Pipe 1" = "salmon", "Pipe 2" = "lightblue")) +
  theme_minimal()

There are some gaps in the data. Otherwise, the data now shows one value per hour per pipe.

Check for Stationarity

Check for stationarity with the Augmented Dickey-Fuller Test.

# ADF test for pipe 1
print("ADF Test for Pipe 1:")
## [1] "ADF Test for Pipe 1:"
pipe1_adf <- adf.test(na.omit(pipe1_agg$WaterFlow))
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag    ADF p.value
## [1,]   0 -2.328  0.0212
## [2,]   1 -1.365  0.1908
## [3,]   2 -0.856  0.3729
## [4,]   3 -0.745  0.4124
## [5,]   4 -0.559  0.4791
## Type 2: with drift no trend 
##      lag    ADF p.value
## [1,]   0 -15.02    0.01
## [2,]   1 -11.01    0.01
## [3,]   2  -8.43    0.01
## [4,]   3  -7.65    0.01
## [5,]   4  -6.83    0.01
## Type 3: with drift and trend 
##      lag    ADF p.value
## [1,]   0 -15.12    0.01
## [2,]   1 -11.14    0.01
## [3,]   2  -8.55    0.01
## [4,]   3  -7.80    0.01
## [5,]   4  -6.97    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

Since the p-values are very small (below 0.05), the null hypothesis that the series is non-stationary can be rejected, which indicates that Pipe 1 is stationary.

# ADF test for pipe 2
print("ADF Test for Pipe 2:")
## [1] "ADF Test for Pipe 2:"
pipe2_adf <- adf.test(na.omit(pipe2_agg$WaterFlow))
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##      lag   ADF p.value
## [1,]   0 -8.75  0.0100
## [2,]   1 -4.92  0.0100
## [3,]   2 -3.39  0.0100
## [4,]   3 -2.55  0.0112
## [5,]   4 -2.13  0.0342
## [6,]   5 -1.76  0.0784
## [7,]   6 -1.46  0.1583
## Type 2: with drift no trend 
##      lag   ADF p.value
## [1,]   0 -32.4    0.01
## [2,]   1 -22.0    0.01
## [3,]   2 -17.2    0.01
## [4,]   3 -14.4    0.01
## [5,]   4 -13.1    0.01
## [6,]   5 -11.9    0.01
## [7,]   6 -11.0    0.01
## Type 3: with drift and trend 
##      lag   ADF p.value
## [1,]   0 -32.4    0.01
## [2,]   1 -22.0    0.01
## [3,]   2 -17.3    0.01
## [4,]   3 -14.4    0.01
## [5,]   4 -13.2    0.01
## [6,]   5 -12.0    0.01
## [7,]   6 -11.0    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

Since the p-values are very small (below 0.05), the null hypothesis that the series is non-stationary can be rejected, which indicates that Pipe 2 is stationary.

Forecast [No Forecast]

This is where I would do an ARIMA forecast on the data.