This request comes from DATA 624 Predictive Analytics at CUNY SPS.
The goal of part A is to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data was provided to the class in the project section of our Brightspace. I took the data and uploaded it to my personal GitHub so this RMD can be run on any device https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/ATM624DataDates.csv.
ATM 4 contains an outlier value that is more than 6 times the second highest value.
The case for omitting the outlier:
Based on the above logic, this forecast will omit the outlier. More on specifics in the EDA section.
ATM 1 Predictions:
This ATM is forecasted (average of forecast) to have daily withdrawals between $540 and $11,030 in the month of May 2010.
The lowest withdrawals are expected on Tuesdays.
The highest days for withdrawals are expected to be on Sundays and Wednesdays.
The total withdrawal amount expected from this ATM is around $259,263.
ATM 2 Predictions:
The lowest withdrawals are expected on Tuesdays and Mondays.
The highest days for withdrawals are expected to be on Wednesdays and Thursdays.
The total withdrawal amount expected for ATM 2 is around $185,441.
ATM 3 Predictions:
This ATM is forecasted to have daily withdrawals around $8,458. The data from this ATM is limited to only a few days towards the end of the available data. Because of this, the range of possible values is high.
Expectations within the week are even, as no weekly seasonality can be found.
With low confidence due to lack of data, the total withdrawal amount expected for ATM 3 is around $262,211.
ATM 4 Predictions:
This ATM is forecasted (average of forecast) to have daily withdrawals between $24,278 and $51,027 in the month of May 2010.
This forecast does not show strong weekly seasonality and has high variation.
The total withdrawal amount expected for ATM 4 is around $1,283,766.
library(tidyverse)
library(tsibble)
library(feasts)
library(ggplot2)
library(forecast)
library(fable)
library(paletteer)
library(dplyr)
library(tidyr)
library(scales)
library(glue)
library(writexl)
First, load the data. I have the data saved as a .csv file in my GitHub for reproducibility of this code outside of my personal machine.
Then take a look at the data.
# Load the csv from GitHub
atm_df <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/ATM624DataDates.csv")
head(atm_df)
## DATE ATM Cash
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
The date field should be converted to a date data type rather than the character data type. Also, the data will need to be in a tsibble format for the visualizations.
# Converting the date column to a date datatype
atm_df <- atm_df |>
mutate(
DATE = as.Date(DATE, origin = "1899-12-30")
)
# Convert to a tsibble
atm_tb <- atm_df |>
as_tsibble(index = DATE, key = ATM)
# Take another look
summary(atm_tb)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10920.0
## NA's :19
There are several NA’s. Drop those NA rows.
# Drop the null ATM values and null Cash values
atm_tb <- atm_tb |>
filter(ATM != "") |>
drop_na(Cash)
# Take another look
summary(atm_tb)
## DATE ATM Cash
## Min. :2009-05-01 Length:1455 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-10-31 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10920.0
Now to visualize the data.
# Autoplot the data
atm_tb |>
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Withdrawals: May 2009 - April 2010",
x = "Date",
y = "Cash Withdrawals (hundreds of dollars)") +
theme_minimal() +
guides(col="none")
The visualizations above show the data distributions for the different ATMs.
ATM1: Strong variation patterns. Some changes in variance; variance decreases around October 2009 and then increases again before 2010.
ATM2: Strong variation patterns, though less consistent as ATM1. Relatively constant variance
ATM3: Quite low usage until the beginning of May 2010, at the end of the data. Even with the increase at the end of the data, the scales are relatively similar to the ATM1 and ATM2, so this data will not be ommitted.
ATM4: Extreme high usage in February 2010 for a single day. This could be an input error, as that point (10920 on Feb 9 2010) is about 6 times the second highest value (1712 on Sept 22 2009).
What does the data look like with this potential error removed?
# Take out the outlier in ATM4
atm_tb_no_outl <- atm_tb |>
filter(Cash < 2500)
# Autoplot the updated data
atm_tb_no_outl |>
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Withdrawals: May 2009 - April 2010",
x = "Date",
y = "Cash Withdrawals (hundreds of dollars)") +
theme_minimal() +
guides(col="none")
The visualizations above show the data distributions for the different ATMs.
To further understand the breakdown of the time series, a time series decomposition will be helpful. The decomposition used here is additive because the magnitude of the time-series does not change with the level.
# Function to decompose and plot using ggplot2
decompose_and_plot_gg <- function(atm_name) {
# Filter for specific ATM
atm_subset <- atm_tb_no_outl |>
filter(ATM == atm_name) |>
arrange(DATE) |>
mutate(Cash = as.numeric(Cash))
# Convert to ts object
atm_ts <- ts(atm_subset$Cash,
start = c(2009, as.numeric(format(min(atm_subset$DATE), "%j"))),
frequency = 7)
# Classical decomposition
decomp <- decompose(atm_ts, type = "additive")
# Convert to dataframe for ggplot
decomp_df <- data.frame(
DATE = atm_subset$DATE,
Observed = decomp$x,
Trend = decomp$trend,
Seasonal = decomp$seasonal,
Remainder = decomp$random
) |>
pivot_longer(-DATE, names_to = "Component", values_to = "Value")
# Plot using ggplot2
ggplot(decomp_df, aes(x = DATE, y = Value, color = Component)) +
geom_line() +
facet_wrap(~Component, scales = "free_y", ncol = 1) +
labs(title = paste("Classical Time Series Decomposition -", atm_name),
x = "Time", y = "Value") +
theme_minimal()
}
# Get unique ATM names
atm_list <- unique(atm_tb_no_outl$ATM)
# Apply the function to each ATM
plots <- lapply(atm_list, decompose_and_plot_gg)
# Visualize the decompositions
plots
## [[1]]
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
##
## [[2]]
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
##
## [[3]]
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
##
## [[4]]
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
The visualizations above show the time series decompositions for the different ATMs. Each breakdown will inform what kind of algorithm will be most effective for that ATM forecast.
This section will address any additional preprocessing the data needs before forecasting is possible.
What features can be added to make predictions accurate? What is the cadence of the seasonality in the ATMs?
# Widen data set with contextual features
atm_explore_oct09 <- atm_tb_no_outl |>
filter(DATE >= "2009-10-01") |>
filter(DATE <= "2009-10-31")
# Plot the updated data
atm_explore_oct09 |>
ggplot(aes(x = DATE, y = Cash)) +
geom_line() +
geom_point(color = "salmon", size = 1.5) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Withdrawals: October 2009",
x = "Date",
y = "Cash Withdrawals (hundreds of dollars)") +
theme_minimal() +
guides(col="none")
The first 2 ATMs show a seasonal cycle to be 7 days for the filtered month of October 2009.
There seems to be lots of variation within a month that looks somewhat regular for different ATMs. Adding additional context columns may help with the predictions.
# Widen data set with contextual features
atm_ext <- atm_tb_no_outl |>
mutate(
week_day = wday(DATE, label = TRUE),
month = month(DATE, label = TRUE)
)
# View the wider dataset with more features
head(atm_ext)
## # A tsibble: 6 x 5 [1D]
## # Key: ATM [1]
## DATE ATM Cash week_day month
## <date> <chr> <int> <ord> <ord>
## 1 2009-05-01 ATM1 96 Fri May
## 2 2009-05-02 ATM1 82 Sat May
## 3 2009-05-03 ATM1 85 Sun May
## 4 2009-05-04 ATM1 90 Mon May
## 5 2009-05-05 ATM1 99 Tue May
## 6 2009-05-06 ATM1 88 Wed May
Visualize the distribution within each week.
# Visualize week day distributions
atm_ext |>
ggplot(aes(x = week_day, y = Cash, fill = week_day)) +
geom_boxplot() +
scale_fill_paletteer_d("colorBlindness::Blue2Orange8Steps") +
facet_wrap(~ATM, scales = "free_y") +
labs(title = "Weekly Distribution of ATM Cash Withdrawals",
x = "Day of Week",
y = "Cash Withdrawals (hundreds of dollars)") +
theme_minimal() +
theme(legend.position = "none")
Daily distributions within a week do seem to have a pattern. It also seems that there is some weekly seasonality within the ATM4, with Thursdays looking much lower overall.
Visualize the distribution across each month.
# Visualize month distributions
atm_ext |>
ggplot(aes(x = month, y = Cash, fill = month)) +
geom_boxplot() +
scale_fill_paletteer_d("colorBlindness::Blue2DarkRed12Steps") +
facet_wrap(~ATM, scales = "free_y") +
labs(title = "Monthly Distribution of ATM Cash Withdrawals",
x = "Month",
y = "Cash Withdrawals (hundreds of dollars)") +
theme_minimal() +
theme(legend.position = "none")
Month-over-month distributions also seem to have some seasonality to them.
Four different ATMs should be subset into distinct data frames.
# Create new data frames from subset data
atm1_df <- atm_ext |>
filter(ATM == "ATM1")
atm2_df <- atm_ext |>
filter(ATM == "ATM2")
atm3_df <- atm_ext |>
filter(ATM == "ATM3")
atm4_df <- atm_ext |>
filter(ATM == "ATM4")
Now the data is ready to be forecasted.
Each ATM will be forecasted.
# Handle missing values and fill gaps
atm1_df <- atm1_df |>
fill_gaps() |>
mutate(Cash = replace_na(Cash, round(mean(Cash, na.rm = TRUE)))) # Impute missing values with mean
# Calculate lambda using Guerrero's method
atm1_lambda <- atm1_df |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
# Fit the models
atm1_fit <- atm1_df |>
model(
"ARIMA" = ARIMA(box_cox(Cash,atm1_lambda)),
"Additive ETS" = ETS(Cash ~ error("A") + trend("N") + season("A")),
"Multiplicative ETS" = ETS(Cash ~ error("M") + trend("N") + season("M")),
"SNAIVE" = SNAIVE(Cash),
"Transformed Additive" = ETS(box_cox(Cash,atm1_lambda) ~ error("A") + trend("N") + season("A")),
"Transformed Multiplicative" = ETS(box_cox(Cash,atm1_lambda) ~ error("M") + trend("N") + season("M")),
"Transformed SNAIVE" = SNAIVE(box_cox(Cash,atm1_lambda))
)
# Combine model evaluation metrics (RMSE, MAE, etc.)
atm1_model_results <- left_join(
glance(atm1_fit) |> select(.model:BIC),
accuracy(atm1_fit) |> select(.model, RMSE, MAE),
by = ".model"
)
# Sort by RMSE for easy model comparison
atm1_sorted_model_results <- atm1_model_results |>
arrange(RMSE)
# Print the results to see which model has the lowest RMSE
print(atm1_sorted_model_results)
## # A tibble: 7 × 8
## .model sigma2 log_lik AIC AICc BIC RMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Transformed Multiplicative 0.0367 -1188. 2396. 2396. 2435. 23.5 15.4
## 2 Additive ETS 581. -2234. 4487. 4488. 4526. 23.8 15.1
## 3 ARIMA 1.50 -581. 1169. 1169. 1185. 25.1 16.2
## 4 Transformed Additive 1.53 -1150. 2320. 2320. 2359. 25.1 16.2
## 5 Multiplicative ETS 0.134 -2273. 4567. 4567. 4606. 26.3 16.3
## 6 SNAIVE 779. NA NA NA NA 27.9 17.8
## 7 Transformed SNAIVE 2.12 NA NA NA NA 27.9 17.8
The most ideal model based on these results is the ARIMA model. It is lowest in AIC, AICc, and BIC. Also, it is about average for RMSE and MAE.
# Handle missing values and fill gaps
atm2_df <- atm2_df |>
fill_gaps() |>
mutate(Cash = replace_na(Cash, round(mean(Cash, na.rm = TRUE)))) # Impute missing values with mean
# Calculate lambda using Guerrero's method
atm2_lambda <- atm2_df |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
# Fit the models
atm2_fit <- atm2_df |>
model(
"ARIMA" = ARIMA(box_cox(Cash,atm2_lambda)),
"Additive ETS" = ETS(Cash ~ error("A") + trend("N") + season("A")),
"Multiplicative ETS" = ETS(Cash ~ error("M") + trend("N") + season("M")),
"SNAIVE" = SNAIVE(Cash),
"Transformed Additive" = ETS(box_cox(Cash,atm2_lambda) ~ error("A") + trend("N") + season("A")),
"Transformed Multiplicative" = ETS(box_cox(Cash,atm2_lambda) ~ error("M") + trend("N") + season("M")),
"Transformed SNAIVE" = SNAIVE(box_cox(Cash,atm2_lambda))
)
# Combine model evaluation metrics (RMSE, MAE, etc.)
atm2_model_results <- left_join(
glance(atm2_fit) |> select(.model:BIC),
accuracy(atm2_fit) |> select(.model, RMSE, MAE),
by = ".model"
)
# Sort by RMSE for easy model comparison
atm2_sorted_model_results <- atm2_model_results |>
arrange(RMSE)
# Print the results to see which model has the lowest RMSE
print(atm2_sorted_model_results)
## # A tibble: 7 × 8
## .model sigma2 log_lik AIC AICc BIC RMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 67.8 -1262. 2537. 2537. 2560. 24.4 17.1
## 2 Additive ETS 638. -2251. 4522. 4523. 4561. 25.0 17.7
## 3 Transformed Additive 71.9 -1852. 3725. 3726. 3764. 25.3 17.7
## 4 Transformed SNAIVE 99.7 NA NA NA NA 30.1 20.8
## 5 SNAIVE 906. NA NA NA NA 30.1 20.8
## 6 Multiplicative ETS 0.390 -2390. 4801. 4801. 4840. 33.8 28.2
## 7 Transformed Multiplicative 0.308 -2012. 4044. 4044. 4083. 34.0 27.0
The most ideal model based on these results is the ARIMA model. It is lowest in AIC, AICc, BIC, RMSE and MAE.
The data for ATM 3 doesn’t actually have any non-zero values until the last 3 points in the series. Because of this, I don’t think the advanced methods are going to be useful.
The most ideal model based on these results is the ARIMA model. It is lowest in AIC, AICc, BIC and low for RMSE and MAE.
# Handle missing values and fill gaps
atm4_df <- atm4_df |>
fill_gaps() |>
mutate(Cash = replace_na(Cash, round(mean(Cash, na.rm = TRUE)))) # Impute missing values with mean
# Calculate lambda using Guerrero's method
atm4_lambda <- atm4_df |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero)
# Fit the models
atm4_fit <- atm4_df |>
model(
"ARIMA" = ARIMA(box_cox(Cash,atm4_lambda)),
"Additive ETS" = ETS(Cash ~ error("A") + trend("N") + season("A")),
"Multiplicative ETS" = ETS(Cash ~ error("M") + trend("N") + season("M")),
"SNAIVE" = SNAIVE(Cash),
"Transformed Additive" = ETS(box_cox(Cash,atm4_lambda) ~ error("A") + trend("N") + season("A")),
"Transformed Multiplicative" = ETS(box_cox(Cash,atm4_lambda) ~ error("M") + trend("N") + season("M")),
"Transformed SNAIVE" = SNAIVE(box_cox(Cash,atm4_lambda))
)
# Combine model evaluation metrics (RMSE, MAE, etc.)
atm4_model_results <- left_join(
glance(atm4_fit) |> select(.model:BIC),
accuracy(atm4_fit) |> select(.model, RMSE, MAE),
by = ".model"
)
# Sort by RMSE for easy model comparison
atm4_sorted_model_results <- atm4_model_results |>
arrange(RMSE)
# Print the results to see which model has the lowest RMSE
print(atm4_sorted_model_results)
## # A tibble: 7 × 8
## .model sigma2 log_lik AIC AICc BIC RMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Additive ETS 110734. -3192. 6404. 6404. 6443. 329. 264.
## 2 Transformed Additive 174. -2013. 4047. 4048. 4086. 338. 260.
## 3 Transformed Multiplicative 0.222 -2009. 4039. 4039. 4078. 345. 264.
## 4 ARIMA 183. -1468. 2945. 2945. 2965. 352. 274.
## 5 Multiplicative ETS 0.727 -3192. 6405. 6405. 6444. 354. 276.
## 6 SNAIVE 205239. NA NA NA NA 452. 346.
## 7 Transformed SNAIVE 300. NA NA NA NA 452. 346.
The most ideal model based on these results is the ARIMA model. It is lowest in AIC, AICc, BIC and low for RMSE and MAE.
# Forecast model
forecast_atm1 <- atm1_fit |>
forecast(h = 31) |>
filter(.model=='ARIMA')
# Visualize the ARIMA predictions
autoplot(atm1_df, Cash) +
autolayer(forecast_atm1) +
ggtitle("ATM1 Cash Withdrawal Forecast") +
ylab("Cash Withdrawals (hundreds of dollars)") +
xlab("Date")
This looks pretty good! Zooming in on 2010 may help see more detail.
atm1_2010 <- atm1_df |>
filter(DATE >= "2010-01-01")
# Visualize the ARIMA predictions
autoplot(atm1_2010, Cash) +
autolayer(forecast_atm1) +
ggtitle("ATM1 Cash Withdrawal Forecast 2010") +
ylab("Cash Withdrawals (hundreds of dollars)") +
xlab("Date")
atm1_total_pred_value <- forecast_atm1 |>
pull(.mean) |>
sum(na.rm = TRUE)
print(paste("The total withdrawals expected for ATM 1: around", dollar(atm1_total_pred_value * 100)))
## [1] "The total withdrawals expected for ATM 1: around $259,263"
ATM 1 Predictions:
This ATM is forecasted (average of forecast) to have withdrawals between $540 and $11,030 in the month of May 2010.
The lowest withdrawals are expected on Tuesdays.
The highest days for withdrawals are expected to be on Sundays and Wednesdays.
The total withdrawals expected for ATM 1: around $259,263.
# Forecast model
forecast_atm2 <- atm2_fit |>
forecast(h = 31) |>
filter(.model=='ARIMA')
# Visualize the ARIMA predictions
autoplot(atm2_df, Cash) +
autolayer(forecast_atm2) +
ggtitle("ATM2 Cash Withdrawal Forecast") +
ylab("Cash Withdrawals (hundreds of dollars)") +
xlab("Date")
This looks pretty good! Zooming in on 2010 may help see more detail.
atm2_2010 <- atm2_df |>
filter(DATE >= "2010-01-01")
# Visualize the ARIMA predictions
autoplot(atm2_2010, Cash) +
autolayer(forecast_atm2) +
ggtitle("ATM2 Cash Withdrawal Forecast 2010") +
ylab("Cash Withdrawals (hundreds of dollars)") +
xlab("Date")
atm2_total_pred_value <- forecast_atm2 |>
pull(.mean) |>
sum(na.rm = TRUE)
print(paste("The total withdrawals expected for ATM 2: around", dollar(atm2_total_pred_value * 100)))
## [1] "The total withdrawals expected for ATM 2: around $185,441"
ATM 2 Predictions:
The lowest withdrawals are expected on Tuesdays and Mondays.
The highest days for withdrawals are expected to be on Wednesdays and Thursdays.
The total withdrawals expected for ATM 2: around $185,441.
# Function for predictions and visualizing them
atm3_df$Cash <- as.numeric(atm3_df$Cash)
atm3_df <- atm3_df |> arrange(DATE)
# Convert to tsibble and fill gaps
atm3_tsibble <- atm3_df |>
as_tsibble(index = DATE) |>
fill_gaps() |> # Add missing dates
mutate(Cash = replace_na(Cash, mean(Cash, na.rm = TRUE))) # Fill missing values
# Fit ETS model
atm3_ets_model <- atm3_tsibble |>
model(ETS(Cash))
# Forecast for 31 days
forecast_atm3 <- atm3_ets_model |>
forecast(h = "31 days")
# Plot the forecast
forecast_atm3 |>
autoplot(atm3_tsibble) +
labs(title = paste("ATM2 Cash Withdrawal Forecast 2010"),
x = "Date",
y = "Cash Withdrawals (hundreds of dollars)") +
theme_minimal()
This model is an additive ETS model based off of the recently populating ATM data. It is such a wide range with little detail because there is not enough data to confidently predict any trends.
atm3_total_pred_value <- forecast_atm3 |>
pull(.mean) |>
sum(na.rm = TRUE)
print(paste("The total withdrawals expected for ATM 3: around", dollar(atm3_total_pred_value * 100)))
## [1] "The total withdrawals expected for ATM 3: around $262,211"
ATM 3 Predictions:
This ATM is forecasted to have withdrawals around $8,458. The data from this ATM is limited to only a few days towards the end of the available data. Because of this, the range of possible values is high.
Expectations within the week are even, as no weekly seasonality can be found.
With low confidence, the total withdrawals expected for ATM 3: around $262,211.
# Forecast model
forecast_atm4 <- atm4_fit |>
forecast(h = 31) |>
filter(.model=='ARIMA')
# Visualize the ARIMA predictions
autoplot(atm4_df, Cash) +
autolayer(forecast_atm4) +
ggtitle("ATM4 Cash Withdrawal Forecast") +
ylab("Cash Withdrawals (hundreds of dollars)") +
xlab("Date")
This looks pretty loose for a prediction. Zooming in on 2010 may help see more detail.
atm4_2010 <- atm4_df |>
filter(DATE >= "2010-01-01")
# Visualize the ARIMA predictions
autoplot(atm4_2010, Cash) +
autolayer(forecast_atm4) +
ggtitle("ATM4 Cash Withdrawal Forecast 2010") +
ylab("Cash Withdrawals (hundreds of dollars)") +
xlab("Date")
atm4_total_pred_value <- forecast_atm4 |>
pull(.mean) |>
sum(na.rm = TRUE)
print(paste("The total withdrawals expected for ATM 4: around", dollar(atm4_total_pred_value * 100)))
## [1] "The total withdrawals expected for ATM 4: around $1,283,766"
ATM 4 Predictions:
This ATM is forecasted (average of forecast) to have withdrawals between $24,278 and $51,027 in the month of May 2010.
This forecast does not show strong weekly seasonality and has high variation.
The total withdrawals expected for ATM 4: around $1,283,766.
# Combine all forecasts into one data frame
combined_forecast <- bind_rows(forecast_atm1, forecast_atm2, forecast_atm3, forecast_atm4)
# Write to excel
write_xlsx(combined_forecast, "ATMPredictions.xlsx")
This request comes from DATA 624 Predictive Analytics at CUNY SPS.
The goal of part B is to model the provided data and a monthly forecast for 2014. The data was provided to the class in the project section of our Brightspace. It contains data of residential power usage for January 1998 until December 2013. I took the data and uploaded it to my personal GitHub so this RMD can be run on any device https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/ResidentialCustomerForecastLoad-624.csv.
Residential Power Usage Predictions:
The data follows strong seasonality, and the prediction follows this trend.
The highest consumption is expected in January (around 10.2 million KWH) and August (around 10.0 million KWH). The lowest consumption is expected in October, April, and May (around 57.6, 59.1, and 59.4 million KWH, respectively).
See the section titled Power Forecast for visualization
library(zoo)
First, load the data. I have the data saved as a .csv file in my GitHub for reproducibility of this code outside of my personal machine.
Then take a look at the data.
# Load the csv from GitHub
pwr_df <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/ResidentialCustomerForecastLoad-624.csv")
head(pwr_df)
## CaseSequence YYYY.MMM KWH
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
Reformat the features data types when needed.
# Convert date format
pwr_ts <- ts(pwr_df$KWH, start = c(1998, 1), frequency = 12)
# Preview the data
head(pwr_ts)
## Jan Feb Mar Apr May Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147
# Another format
summary(pwr_ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
Visualize the data in a time series graph.
# Plot the data
autoplot(pwr_ts) +
labs(title = paste("Residential Power Usage"),
x = "Date",
y = "Kilowatt hours") +
theme_minimal()
Replace the NA value by interpolation.
# Fill null value
pwr_ts <- na.approx(pwr_ts)
# Plot the filled data
autoplot(pwr_ts) +
labs(title = "Residential Power Usage",
x = "Date",
y = "Kilowatt hours") +
theme_minimal()
Try out the time series decomposition to get a better feel for the data.
# Decompose the time series
pwr_decomp <- decompose(pwr_ts)
# Plot the decomposed time series
autoplot(pwr_decomp) +
labs(title = "Decomposition of Residential Power Usage",
x = "Date") +
theme_minimal()
What does the seasonality look like? Looking at a single year may help specify the cadence.
# Get only 2005 data
pwr_0506 <- window(pwr_ts, start = c(2005, 1), end = c(2007, 1))
# Plot the filled data
autoplot(pwr_0506) +
labs(title = "Residential Power Usage in 2005 and 2006",
x = "Date",
y = "Kilowatt hours") +
theme_minimal()
Looks like the seasonality is 12 months in a year. More yearly analysis below.
The outlier for 2010-Jul skews the data. I would assume this outlier comes from some sort of black out. It is clear in the remainder graph of the Time Series Decomposition that this outlier stands out from the rest of the data and does not follow the same trends.
Remove the outlier. Visualize the data without it.
# Convert ts to a tibble
pwr_trimmed <- as_tibble(as.data.frame(pwr_ts), rownames = "Date")
# Remove the outlier for July 2010
pwr_trimmed <- pwr_trimmed |>
mutate(KWH = ifelse(Date == "151", NA, x))
# Convert back to a time series
pwr_trimmed <- ts(pwr_trimmed$KWH, start = start(pwr_ts), frequency = frequency(pwr_ts))
# Fill null value
pwr_trimmed <- na.approx(pwr_trimmed)
# Plot the filled data
autoplot(pwr_trimmed) +
labs(title = "Residential Power Usage",
x = "Date",
y = "Kilowatt hours") +
theme_minimal()
Try out the time series decomposition without that outlier.
# Decompose the time series
pwr_tr_decomp <- decompose(pwr_trimmed)
# Plot the decomposed time series
autoplot(pwr_tr_decomp) +
labs(title = "Decomposition of Trimmed Residential Power Usage",
x = "Date") +
theme_minimal()
Alright, this data shows high seasonality.
# Plot the consumption per month, year over year
ggseasonplot(pwr_trimmed,
year.labels=TRUE) +
ylab("Kilowatt hours") +
ggtitle("Residential Power Usage by Month") +
theme_minimal()
The data looks to have some strong monthly trends. Months correlate with each other year over year. The model will need to have seasonality accounted for.
Build the model with an ARIMA forecast with seasonality. Visualize the result.
# Fit the trimmed power data to a seasonal ARIMA model
fit_pwr <- auto.arima(pwr_trimmed, seasonal = TRUE)
# Forecast on that model for 1 year
forecast_pwr <- forecast(fit_pwr, h = 12)
# Plot the forecast
autoplot(forecast_pwr) +
labs(title = "Residential Power Usage Forecast",
x = "Date",
y = "Kilowatt hours") +
theme_minimal()
This looks pretty good! Zooming in on 2010-2014 may help see more detail.
# Get historical data from 2011 onward
pwr_hist_10_13 <- window(pwr_trimmed, start = c(2010, 1))
# Plot historical data + forecast
autoplot(pwr_hist_10_13, series = "Historical Data") +
autolayer(forecast_pwr, series = "Forecast", PI = TRUE) +
labs(title = "Residential Power Usage (2010+) and Forecast (2014)",
x = "Date",
y = "Kilowatt hours") +
theme_minimal() +
scale_color_manual(values = c("Historical Data" = "black", "Forecast" = "blue"))
Overall, it looks like the model shows significant seasonality. This prediction visually matches the previous months. I would be confident in the ranges provided here to contain the data at their respective levels.
# Save the forecast as a data frame for excel
pwr_fcst_export <- data.frame(
Date = seq(as.Date("2014-01-01"), by = "month", length.out = 12),
Power_Forecast_KWH = as.numeric(forecast_pwr$mean)
)
# Write to excel
write_xlsx(pwr_fcst_export, "PowerPredictions.xlsx")
The assignment for part C is to time-base sequence the data and aggregate based on hour. Then it should be determines if the data is stationary and if it can be forecast.
For this submission, I will only be doing part of this for partial extra credit. I fix the time stamp inconsistencies, aggregate the data based on hour, and determine it is stationary.
The data was provided to the class in the project section of our Brightspace. I took the data and uploaded it to my personal GitHub so this RMD can be run on any device https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/Waterflow_Pipe1.csv and https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/Waterflow_Pipe2.csv.
library(lubridate)
library(aTSA)
First, load the data. I have the data saved as a .csv file in my GitHub for reproducibility of this code outside of my personal machine.
Then take a look at the data.
# Load the csv from GitHub
pipe1_raw <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/Waterflow_Pipe1.csv")
head(pipe1_raw)
## Date.Time WaterFlow
## 1 10/23/15 12:24 AM 23.369599
## 2 10/23/15 12:40 AM 28.002881
## 3 10/23/15 12:53 AM 23.065895
## 4 10/23/15 12:55 AM 29.972809
## 5 10/23/15 1:19 AM 5.997953
## 6 10/23/15 1:23 AM 15.935223
# Load the csv from GitHub
pipe2_raw <- read.csv("https://raw.githubusercontent.com/evanskaylie/DATA624/refs/heads/main/Waterflow_Pipe2.csv")
head(pipe2_raw)
## Date.Time WaterFlow
## 1 10/23/15 1:00 AM 18.81079
## 2 10/23/15 2:00 AM 43.08703
## 3 10/23/15 3:00 AM 37.98770
## 4 10/23/15 4:00 AM 36.12038
## 5 10/23/15 5:00 AM 31.85126
## 6 10/23/15 6:00 AM 28.23809
The data needs to be aggregated. To start, the formatting of the Date.Time feature has to be recognized as a date time data type. Then the hour will be rounded.
# Create new data frames to adjust
pipe1_df <- pipe1_raw
pipe2_df <- pipe2_raw
# Convert Date.Time to datetime (handling AM/PM format correctly)
pipe1_df$Date.Time <- parse_date_time(pipe1_df$Date.Time, orders = "mdy IMp")
pipe2_df$Date.Time <- parse_date_time(pipe2_df$Date.Time, orders = "mdy IMp")
# Round to the nearest hour in date.time
pipe1_df$Date.Time <- floor_date(pipe1_df$Date.Time, "hour")
pipe2_df$Date.Time <- floor_date(pipe2_df$Date.Time, "hour")
# Plot for Pipe 1
plot(pipe1_df$Date.Time, pipe1_df$WaterFlow,
type = "p",
col = "salmon",
main = "WaterFlow in Pipe 1",
xlab = "Date Time",
ylab = "Water Flow")
# Plot for Pipe 2
plot(pipe2_df$Date.Time, pipe2_df$WaterFlow,
type = "p",
col = "lightblue",
main = "WaterFlow in Pipe 2",
xlab = "Date Time",
ylab = "Water Flow")
This is where the times are actually aggregated. Take the average water flow amount within each hour. Then merge the data sets on hour.
# Average rows within each hour for pipe 1
pipe1_agg <- pipe1_df |>
group_by(Date.Time) |>
summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE))
# Average rows within each hour for pipe 2
pipe2_agg <- pipe2_df |>
group_by(Date.Time) |>
summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE))
# Merge datasets based on Hour
merged_pipes_df <- full_join(pipe1_agg, pipe2_agg, by = "Date.Time", suffix = c("_pipe1", "_pipe2"))
# Reshape data into long format for ggplot
merged_long <- merged_pipes_df |>
pivot_longer(cols = starts_with("WaterFlow"),
names_to = "Pipe",
values_to = "WaterFlow") |>
mutate(Pipe = ifelse(Pipe == "WaterFlow_pipe1", "Pipe 1", "Pipe 2"))
# Plot using ggplot2 with different colors for each pipe
ggplot(merged_long, aes(x = Date.Time, y = WaterFlow, color = Pipe)) +
geom_point() +
geom_point(alpha = 0.3) +
labs(title = "WaterFlow in Pipes", x = "Date Time", y = "Water Flow") +
scale_color_manual(values = c("Pipe 1" = "salmon", "Pipe 2" = "lightblue")) +
theme_minimal()
## Warning: Removed 766 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 766 rows containing missing values or values outside the scale range
## (`geom_point()`).
Zoom in on a specific day to make sure the data is actually hourly.
# Filter to keep only data from November 1st
merged_df_nov1 <- merged_pipes_df |>
filter(month(Date.Time) == 11 & day(Date.Time) == 1)
# Reshape data into long format for ggplot
merged_long_nov1 <- merged_df_nov1 |>
pivot_longer(cols = starts_with("WaterFlow"),
names_to = "Pipe",
values_to = "WaterFlow") |>
mutate(Pipe = ifelse(Pipe == "WaterFlow_pipe1", "Pipe 1", "Pipe 2"))
# Plot using ggplot2 with different colors for each pipe
ggplot(merged_long_nov1, aes(x = Date.Time, y = WaterFlow, color = Pipe)) +
geom_line() +
labs(title = "WaterFlow in Pipes (November 1st)", x = "Date Time", y = "Water Flow") +
scale_color_manual(values = c("Pipe 1" = "salmon", "Pipe 2" = "lightblue")) +
theme_minimal()
There are some gaps in the data. Otherwise, the data now shows one value
per hour per pipe.
Check for stationarity with the Augmented Dickey-Fuller Test.
# ADF test for pipe 1
print("ADF Test for Pipe 1:")
## [1] "ADF Test for Pipe 1:"
pipe1_adf <- adf.test(na.omit(pipe1_agg$WaterFlow))
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -2.328 0.0212
## [2,] 1 -1.365 0.1908
## [3,] 2 -0.856 0.3729
## [4,] 3 -0.745 0.4124
## [5,] 4 -0.559 0.4791
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -15.02 0.01
## [2,] 1 -11.01 0.01
## [3,] 2 -8.43 0.01
## [4,] 3 -7.65 0.01
## [5,] 4 -6.83 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -15.12 0.01
## [2,] 1 -11.14 0.01
## [3,] 2 -8.55 0.01
## [4,] 3 -7.80 0.01
## [5,] 4 -6.97 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
Since the p-values are very small (below 0.05), the null hypothesis that the series is non-stationary can be rejected, which indicates that Pipe 1 is stationary.
# ADF test for pipe 2
print("ADF Test for Pipe 2:")
## [1] "ADF Test for Pipe 2:"
pipe2_adf <- adf.test(na.omit(pipe2_agg$WaterFlow))
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -8.75 0.0100
## [2,] 1 -4.92 0.0100
## [3,] 2 -3.39 0.0100
## [4,] 3 -2.55 0.0112
## [5,] 4 -2.13 0.0342
## [6,] 5 -1.76 0.0784
## [7,] 6 -1.46 0.1583
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -32.4 0.01
## [2,] 1 -22.0 0.01
## [3,] 2 -17.2 0.01
## [4,] 3 -14.4 0.01
## [5,] 4 -13.1 0.01
## [6,] 5 -11.9 0.01
## [7,] 6 -11.0 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -32.4 0.01
## [2,] 1 -22.0 0.01
## [3,] 2 -17.3 0.01
## [4,] 3 -14.4 0.01
## [5,] 4 -13.2 0.01
## [6,] 5 -12.0 0.01
## [7,] 6 -11.0 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
Since the p-values are very small (below 0.05), the null hypothesis that the series is non-stationary can be rejected, which indicates that Pipe 2 is stationary.
This is where I would do an ARIMA forecast on the data.