In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
I made these csvs and uploaded them to github for reproducibility purposes
partA_data <- read.csv("https://raw.githubusercontent.com/jonburns2454/DATA-624/refs/heads/main/Project%201%20Data/ATM624Data.csv")
library(tidyverse)
library(naniar)
library(readxl)
library(fpp3)
library(forecast)
library(ggfortify)
library(gridExtra)
# Convert DATE to Date type
partA_data$DATE <- mdy_hms(partA_data$DATE)
# Summarize the data
summary(partA_data)
## DATE ATM Cash
## Min. :2009-05-01 00:00:00.00 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 00:00:00.00 Class :character 1st Qu.: 0.5
## Median :2009-11-01 00:00:00.00 Mode :character Median : 73.0
## Mean :2009-10-31 19:11:48.27 Mean : 155.6
## 3rd Qu.:2010-02-01 00:00:00.00 3rd Qu.: 114.0
## Max. :2010-05-14 00:00:00.00 Max. :10920.0
## NA's :19
# Plot the data
# 1. Cash distribution by ATM
ggplot(partA_data, aes(x = ATM, y = Cash)) +
geom_boxplot() +
labs(title = "Cash Distribution by ATM", x = "ATM", y = "Cash")
# 2. Cash over time
ggplot(partA_data, aes(x = DATE, y = Cash, color = ATM)) +
geom_line() +
labs(title = "Cash Over Time", x = "Date", y = "Cash")
# 3. Total cash per ATM
total_cash <- partA_data %>%
group_by(ATM) %>%
summarize(Total_Cash = sum(Cash))
ggplot(total_cash, aes(x = ATM, y = Total_Cash, fill = ATM)) +
geom_bar(stat = "identity") +
labs(title = "Total Cash per ATM", x = "ATM", y = "Total Cash")
The amount of missingness across actual categorized atms (ATM1-ATM4) is minimal as you can see from the table and visualization, a majority of the missingness comes from the unlabeled atm variables which make up 74% of the missingness. The cash column has the only instance of missingness.
gg_miss_var(partA_data) +
labs(title = "Missing Data ATM dataset")
gg_miss_var(partA_data, facet = ATM) +
labs(title = "Missing Data by ATM Category")
missing_by_atm <- partA_data %>%
group_by(ATM) %>%
summarise(across(everything(), ~ sum(is.na(.)), .names = "missing_{col}"))
total_missing <- colSums(is.na(partA_data))
missing_by_atm <- missing_by_atm %>%
mutate(across(starts_with("missing_"), ~ . / total_missing[match(sub("missing_", "", cur_column()), names(total_missing))] * 100, .names = "percent_{col}"))
missing_by_atm
## # A tibble: 5 × 5
## ATM missing_DATE missing_Cash percent_missing_DATE percent_missing_Cash
## <chr> <int> <int> <dbl> <dbl>
## 1 "" 0 14 NaN 73.7
## 2 "ATM1" 0 3 NaN 15.8
## 3 "ATM2" 0 2 NaN 10.5
## 4 "ATM3" 0 0 NaN 0
## 5 "ATM4" 0 0 NaN 0
Ensuring DATE and Cash are correctly formatted so all of the autoplot functions can run smoothly
atm_data <- partA_data %>%
mutate(DATE = as_date(DATE), Cash = as.integer(Cash))
To combat the missingness I will filter the dataset for any date prior to 2010-05-01 because this data is completely empty after that data. Next I will use mean imputation (grouped by atm) to fill in the remaining 5 missing cash data. Lastly I will convert this data.frame object to a tsibble and create a key and index.
atm_data <- atm_data %>%
filter(DATE < "2010-05-01") %>%
group_by(ATM) %>%
mutate(Cash = ifelse(is.na(Cash), mean(Cash, na.rm = TRUE), Cash)) %>%
ungroup()
atm_data <- as_tsibble(atm_data, index = DATE, key = ATM)
The trend plot from the STL decomposition reflects that there really is no trend in the data, while showing a strong weekly season, with consistent dips in the cash reserve. The remainder graph shows that there is seemingly random variation though most of the two year period, until around March 2010 when there is consistent increased variation.
atm1_auto <-atm_data %>%
filter(ATM == "ATM1") %>%
autoplot(Cash) +
ggtitle("Pre-Transformation Autoplot: ATM1")
atm1_stl <- atm_data %>%
filter(ATM == "ATM1") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM1")
grid.arrange(atm1_auto, atm1_stl, ncol = 2)
In a similar fashion to ATM1, this ATM also isnt showing a clear trend, however if you look closely cash does seems to be falling ever so slightly through the period until an eventual spike in March 2010, followed by a massive dip and recovery. This is also reflected in the remainder, which shows a sustained increase then dip and lastly a recovery.
atm2_auto <- atm_data %>%
filter(ATM == "ATM2") %>%
autoplot(Cash) +
ggtitle("Pre-Transformation Autoplot: ATM2")
atm2_stl <- atm_data %>%
filter(ATM == "ATM2") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM2")
grid.arrange(atm2_auto, atm2_stl, ncol = 2)
ATM3 seems to be an outlier ATM, there are only three instances of this ATM having cash and they are all at the end of the period.
atm3_auto <- atm_data %>%
filter(ATM == "ATM3") %>%
autoplot(Cash) +
ggtitle("Pre-Transformation Autoplot: ATM3")
atm3_stl <- atm_data %>%
filter(ATM == "ATM3") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM3")
grid.arrange(atm3_auto, atm3_stl, ncol = 2)
ATM4 also presents an issue, there is a massive outlier on Febuary 09,2010 which will entirely throw off a forecast. Since there isnt much more context to this dataset (ie a reason why it would be this high) I think it best to remove this point by imputing using a group mean instead.
atm4_auto <- atm_data %>%
filter(ATM == "ATM4") %>%
autoplot(Cash) +
ggtitle("Pre-Transformation Autoplot: ATM4")
atm4_stl <- atm_data %>%
filter(ATM == "ATM4") %>%
model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomposition for ATM4")
grid.arrange(atm4_auto, atm4_stl, ncol = 2)
outlier_date <- as.Date('2010-02-09')
outlier_value <- atm_data %>% filter(DATE == outlier_date & ATM == "ATM4") %>% pull(Cash)
# Calculate the mean of the grouped category excluding the outlier
grouped_mean <- atm_data %>%
filter(ATM == 'ATM4' & DATE != outlier_date) %>%
summarise(mean_value = mean(Cash)) %>%
pull(mean_value)
# Impute the outlier value with the new mean of ATM4 on 02-09-2010
atm_data_fixed_1 <- atm_data %>%
mutate(Cash = ifelse(DATE == outlier_date & ATM == "ATM4", grouped_mean, Cash))
atm1_auto_fx <-atm_data_fixed_1 %>%
filter(ATM == "ATM1") %>%
autoplot(Cash) +
ggtitle("Pre-Transformation Autoplot: ATM1")
atm2_auto_fx <- atm_data_fixed_1 %>%
filter(ATM == "ATM2") %>%
autoplot(Cash) +
ggtitle("Pre-Transformation Autoplot: ATM2")
atm3_auto_fx <- atm_data_fixed_1 %>%
filter(ATM == "ATM3") %>%
autoplot(Cash) +
ggtitle("Pre-Transformation Autoplot: ATM3")
atm4_auto_fx <- atm_data_fixed_1 %>%
filter(ATM == "ATM4") %>%
autoplot(Cash) +
ggtitle("Pre-Transformation Autoplot: ATM4")
grid.arrange(atm1_auto_fx, atm2_auto_fx,atm3_auto_fx, atm4_auto_fx,ncol = 2)
Due to the seasonality observed across all of the ATMs I believe that ETS and ARIMA would be the best options. But they need to be compared before chosing one. I will use ATM2 to test these different models out.
lambda <- atm_data_fixed_1 %>%
filter(ATM == "ATM2") %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
accuracy_test <- atm_data_fixed_1 %>%
filter(ATM == "ATM2") %>%
model(
mult_monthly = ETS(box_cox(Cash, lambda) ~ error ("A") + trend("N") + season("M")),
ARIMA = ARIMA(box_cox(Cash, lambda))
)
#Pulling in .model(BIC and RMSE)
left_join(glance(accuracy_test) %>% select(.model:BIC),
accuracy(accuracy_test) %>% select(.model, RMSE)) %>%
arrange(RMSE)
## Joining with `by = join_by(.model)`
## # A tibble: 2 × 7
## .model sigma2 log_lik AIC AICc BIC RMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 67.6 -1262. 2536. 2536. 2559. 24.4
## 2 mult_monthly 71.3 -1851. 3722. 3722. 3761. 25.2
Looking across AICc, AIC and BIC, the ARIMA model had the lowest values. Thus, ARIMA will be the model used to forecast this data.
A short function might make this part easier:
atm_arima_fc <- function(data) {
arima_fit <- data %>% model(ARIMA(box_cox(Cash, lambda)))
fc_cash <- arima_fit %>% forecast(h = 31)
return(fc_cash)
}
ATM1_fc <- atm_data_fixed_1 %>%
filter(ATM == "ATM1") %>%
atm_arima_fc()
ATM2_fc <- atm_data_fixed_1 %>%
filter(ATM == "ATM2") %>%
atm_arima_fc()
ATM3_fc <- atm_data_fixed_1 %>%
filter(ATM == "ATM3") %>%
atm_arima_fc()
ATM4_fc <- atm_data_fixed_1 %>%
filter(ATM == "ATM4") %>%
atm_arima_fc()
full_forecast_df <- bind_rows(ATM1_fc, ATM2_fc, ATM3_fc, ATM4_fc) %>%
as_data_frame() %>%
select(DATE, ATM, .mean)
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
## tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#write out the forecasted data
#full_forecast_df %>%
#write.csv("Part_A_Forecasts_Combined_File.csv")
This final graphic highlights each forecasted ATM Cash balance through the month of May 2010. The seasonality keeps pace with the raw data in ATM1 and ATM2 as well as the scale of its variability. At the same time ATM3 forecasts are reacting to the previous years of no balance or withdrawals and sees through the recent increase. ATM4 highlights a tighter spread through the month of May.
full_forecast_df %>%
rename(Cash = .mean) %>%
as_tsibble(index = DATE, key = ATM) %>%
autoplot(Cash) +
autolayer(atm_data_fixed_1, Cash, color = "grey")+
facet_wrap(~ATM, scales = "free", nrow = 4)+
labs(title = "Forecasted ATM Cash Balance Through May 2010")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
KWH_import <- read.csv("https://raw.githubusercontent.com/jonburns2454/DATA-624/refs/heads/main/Project%201%20Data/ResidentialCustomerForecastLoad-624.csv")
KWH_ts <- KWH_import %>%
mutate(YYYY.MMM = yearmonth(YYYY.MMM), KWH = as.numeric(KWH)) %>%
as_tsibble(index = YYYY.MMM)
There is currently only one missing KWH and its for September 2008. Before any further EDA this will need to be addressed.
KWH_import[!complete.cases(KWH_import), ]
## CaseSequence YYYY.MMM KWH
## 129 861 2008-Sep NA
As opposed to part A where I could group and address this using the mean.
KWH_interpolate <- KWH_ts %>%
model(ARIMA(KWH)) %>%
interpolate(KWH_ts)
The seasonality and upward trend are the most prevalent withing the stl and autoplots. However it is not clear what model is the best for this data. Thus some more investigation is needed.
KWH_interpolate %>%
autoplot(KWH) +
ggtitle("Residential Power Usage (Monthly)")
KWH_interpolate %>%
model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomp")
KWH_interpolate %>%
gg_subseries(KWH)+
ggtitle('Subseries Monthly Trends')
Similar to Part A I will test both EST with box-cox transformation as well as ARIMA: Note this is very similar to the code from Part A, just adjusted for the problem.
lambda <- KWH_interpolate %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
accuracy_test <- KWH_interpolate %>%
model(
mult_monthly = ETS(box_cox(KWH, lambda) ~ error ("M") + trend("A") + season("M")),
ARIMA = ARIMA(box_cox(KWH, lambda))
)
#Pulling in .model(BIC and RMSE)
left_join(glance(accuracy_test) %>% select(.model:BIC),
accuracy(accuracy_test) %>% select(.model, RMSE)) %>%
arrange(RMSE)
## Joining with `by = join_by(.model)`
## # A tibble: 2 × 7
## .model sigma2 log_lik AIC AICc BIC RMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mult_monthly 0.000676 -534. 1102. 1105. 1157. 852417.
## 2 ARIMA 2.19 -346. 701. 702. 717. 1194982.
While boasting a lower RMSE number, AIC, AICc and BIC were all much lower than the ETS boxcox transformation. Thus I will be sticking with the ARIMA again for this forecast.
Now that the ARIMA forecast has been applied it appears that the residuals are mostly white noise and are mostly centered around zero.
KWH_forecast <- accuracy_test %>%
forecast(h = 12) %>%
filter(.model =="ARIMA")
KWH_forecast %>%
autoplot(KWH_interpolate)+
ggtitle('Residential Power Usage (Monthly): Forecasting 2014')
accuracy_test %>%
select(ARIMA) %>%
gg_tsresiduals(lag = 24) +
ggtitle("Redisuals for Forecast through 2014")
KWH_forecast_data <- KWH_forecast %>%
as.data.frame() %>%
select(YYYY.MMM, .mean) %>%
rename(Month_Year = YYYY.MMM, KWH = .mean)
#KWH_forecast_data %>%
#write.csv("Part_B_Forecasted_Data.csv")
KWH_forecast_data
## Month_Year KWH
## 1 2014 Jan 8629876
## 2 2014 Feb 7143118
## 3 2014 Mar 6824015
## 4 2014 Apr 6674699
## 5 2014 May 6696593
## 6 2014 Jun 7266722
## 7 2014 Jul 7579630
## 8 2014 Aug 7602538
## 9 2014 Sep 7167714
## 10 2014 Oct 6590928
## 11 2014 Nov 6583600
## 12 2014 Dec 7410337