This project analyzes and forecasts time series data across two distinct domains: cash withdrawals at ATMs and residential power usage. The analysis is divided into two parts:
Each part introduces specific objectives, applies various time series forecasting methods, and discusses the outcomes based on model performance and forecast results.
In Part A, we focus on forecasting the amount of cash withdrawn from
four different ATMs for the month of May 2010. The provided dataset
(ATM624Data.xlsx
) includes a
Cash
variable, recorded in hundreds of
dollars, which we will use to model cash withdrawal trends.
Objective: To develop and present a forecast for ATM withdrawals by applying suitable forecasting methods and justifying each step of the process.
Approach: I will begin by exploring the dataset and identifying relevant forecasting techniques. Visualizations and discussions will accompany the analysis to convey insights.
The first step is to explore the dataset and summarize key features,
such as the structure, variable types, and initial data distributions.
This preliminary analysis provides an overview and helps identify any
initial anomalies or missing data patterns. An example of this is that
we previously found the DATE
column wasn’t formatted as a
date when we initially loaded the data. We converted this column to date
format by using as.Date()
in our cleaning process.
# Load the data
atm_data <- read_excel("ATM624Data.xlsx")
atm_data$DATE <- as.Date(atm_data$DATE)
# Preview the data
#head(atm_data)
# Checking the structure of the dataset
str(atm_data)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
# Summary statistics for each ATM
summary(atm_data)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
We further investigated the missing data to understand its pattern.
We started by examining missing data in the atm_data
dataset to understand the extent and pattern of missing values. This
following analysis will help determine if we need to impute missing
values before proceeding with forecasting. Also, to determine if any
specific ATM has a higher amount of missing data, we summarized missing
values by the ATM variable.
# Summary of missing data by column
missing_data_summary <- colSums(is.na(atm_data))
missing_data_summary
## DATE ATM Cash
## 0 14 19
# Checking missing data by ATM
missing_by_atm <- atm_data %>%
group_by(ATM) %>%
summarise_all(~sum(is.na(.)))
missing_by_atm
## # A tibble: 5 × 3
## ATM DATE Cash
## <chr> <int> <int>
## 1 ATM1 0 3
## 2 ATM2 0 2
## 3 ATM3 0 0
## 4 ATM4 0 0
## 5 <NA> 0 14
The results indicate that missing data is limited to the Cash variable, with 5 missing entries distributed across ATMs. This distribution suggests that no single ATM has a significantly higher proportion of missing values, so a general imputation approach can be applied.
We also noted that there are 14 instances of missing ATM categories with no associated cash withdrawals. These rows lack an identifier, making them difficult to use for analysis. Therefore, we will remove these rows to maintain the focus on the four defined ATMs.
ATM
ValuesSince rows with missing ATM
values cannot be associated
with any of the defined ATMs, we will remove these rows from the
dataset.
# Remove rows where ATM is NA
atm_data <- atm_data %>%
filter(!is.na(ATM))
# Verify removal
summary(atm_data)
## DATE ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0.0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.5
## Median :2009-10-30 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
## NA's :5
Cash
VariableTo handle missing data in the Cash
variable, we applied
median imputation to replace missing values. Median imputation was
selected because it is less sensitive to outliers compared to mean
imputation, thus preserving the central tendency of the data without
being skewed by extreme values.
# Applying median imputation to fill missing values in the Cash variable
atm_data <- atm_data %>%
group_by(ATM) %>%
mutate(Cash = ifelse(is.na(Cash), median(Cash, na.rm = TRUE), Cash))
# Verify missing values are handled
sum(is.na(atm_data))
## [1] 0
To better understand the withdrawal patterns for each ATM without one dominating the scale, we will use separate plots for each ATM.
## Visualizing Cash Withdrawals for Each ATM
# Faceted plot by ATM
ggplot(atm_data, aes(x = DATE, y = Cash)) +
geom_line() +
facet_wrap(~ ATM, scales = "free_y") +
labs(title = "Cash Withdrawals by Each ATM Over Time",
x = "Date",
y = "Cash (in hundreds of dollars)") +
theme_minimal() +
theme(axis.title.x = element_text(size = 14, face = 'bold'),
axis.title.y = element_text(size = 14, face = 'bold'),
plot.title = element_text(hjust = 0.5, size = 14, face = 'bold'),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8)
)
ATM3: out of 366 observations, cash was withdraw at ATM3 only three times, making almost its entire withdrawal to be 0 - or no transaction.
ATM4: ATM4 has an extreme cash outlier greater than $1000,000 (1 millions of dollars)
Given that ATM3 has almost no transactions, it will be excluded from this visualization to focus on the active ATMs (ATM1, ATM2, and ATM4).
# Filter out ATM3 for the visualization
atm_data_filtered <- atm_data %>%
filter(ATM != "ATM3")
# Faceted plot by ATM
ggplot(atm_data_filtered, aes(x = DATE, y = Cash)) +
geom_line() +
facet_wrap(~ ATM, scales = "free_y") +
labs(title = "Cash Withdrawals by Each ATM Over Time",
x = "Date",
y = "Cash (in hundreds of dollars)") +
theme_minimal() +
theme(axis.title.x = element_text(size = 14, face = 'bold'),
axis.title.y = element_text(size = 14, face = 'bold'),
plot.title = element_text(hjust = 0.5, size = 14, face = 'bold'),
axis.text.x = element_text(size = 6.3),
axis.text.y = element_text(size = 10)
)
To identify the presence of outliers in the cash withdrawals for ATM1, ATM2, and ATM4, we use box plots for each ATM. Box plots help in visualizing the distribution and highlight any values that fall outside the typical range (1.5 times the IQR), which are considered outliers. This step allows us to assess whether any extreme values may influence the forecast model.
library(ggplot2)
library(gridExtra)
# Filter data for each ATM
atm1_data <- subset(atm_data_filtered, ATM == "ATM1")
atm2_data <- subset(atm_data_filtered, ATM == "ATM2")
atm4_data <- subset(atm_data_filtered, ATM == "ATM4")
# Create box plots for each ATM with adjusted x-axis labels
plot_atm1 <- ggplot(atm1_data, aes(x = ATM, y = Cash)) +
geom_boxplot() +
labs(title = "Cash Withdrawals for ATM1", x = "", y = "Cash (in hundreds of dollars)") +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.text.x = element_blank(), # Remove x-axis text
axis.title.y = element_text(size = 10, face = 'bold'),
plot.title = element_text(hjust = 0.5, size = 10, face = 'bold'),
axis.text.y = element_text(size = 9)
)
plot_atm2 <- ggplot(atm2_data, aes(x = ATM, y = Cash)) +
geom_boxplot() +
labs(title = "Cash Withdrawals for ATM2", x = "", y = "") +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.text.x = element_blank(), # Remove x-axis text
plot.title = element_text(hjust = 0.5, size = 10, face = 'bold'),
axis.text.y = element_text(size = 9)
)
plot_atm4 <- ggplot(atm4_data, aes(x = ATM, y = Cash)) +
geom_boxplot() +
labs(title = "Cash Withdrawals for ATM4", x = "", y = "") +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.text.x = element_blank(), # Remove x-axis text
plot.title = element_text(hjust = 0.5, size = 10, face = 'bold'),
axis.text.y = element_text(size = 9)
)
# Arrange the plots using gridExtra
grid.arrange(plot_atm1, plot_atm2, plot_atm4, nrow = 1)
Observations
ATM1 and ATM2: Both ATMs display relatively consistent withdrawal patterns, with some mild outliers but nothing extreme. The values in these ATMs fall within a narrow range, indicating steady usage.
ATM4: This ATM shows a large outlier far above the typical range, as seen in the box plot. Such an extreme value could skew the forecast model, so we’ll handle it by capping the outlier in the following step. This will allow us to preserve data continuity without letting one value disproportionately impact the forecast.
The histogram plot for each ATM’s cash withdrawals helps us understand the distribution of cash amounts withdrawn over time. By examining the shape and spread of these distributions, we can identify any patterns, central tendencies, or irregularities that might affect the forecasting model.
# Univariate plots for cash withdrawal distribution by each ATM
ggplot(atm_data_filtered, aes(x = Cash)) +
geom_histogram(bins = 50, fill = 'gray', color = "black") +
facet_wrap(~ATM, scales = "free") +
theme_bw() +
labs(title = "Distribution of Cash Withdrawals at Each ATM",
x = "Cash (in hundreds of dollars)",
y = "Frequency") +
theme(
legend.position = "none",
axis.title.x = element_text(size = 12, face = 'bold'),
axis.title.y = element_text(size = 12, face = 'bold'),
plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
axis.text.y = element_text(size = 9)
)
ATM1 and ATM2: Both ATMs exhibit a relatively compact distribution, with most cash withdrawals clustered around lower values, peaking near the $100 mark. This suggests that these ATMs experience more consistent and modest transaction amounts, which could make their time series easier to forecast.
ATM4: ATM4 shows a broader distribution, with cash withdrawals ranging significantly higher, reaching up to approximately $1200. This wider spread indicates more variability in transaction amounts, which might present challenges for forecasting. The higher values could reflect occasional high-volume transactions, though these are not frequent.
While the histograms reveal some low-value outliers, particularly for ATM4, we have opted to retain these values in the dataset. These low values, while less common, could represent genuine transaction behavior at these ATMs. Removing or adjusting them may obscure real patterns in the data. Instead, we’ll proceed with modeling the data as is, allowing the forecasting methods to account for the full range of transaction behaviors.
ATM4 displays an extreme spike in cash withdrawals, which could distort the forecast model if left unaddressed. Since removing this outlier would break the continuity of the time series, we will cap the outlier at a defined threshold based on a reasonable percentile of the data. This will preserve the continuity of the data and reduce the outlier’s influence.
# Identify the extreme value and cap it to a specific threshold
extreme_value <- max(atm_data_filtered$Cash)
cap_threshold <- quantile(atm_data_filtered$Cash[atm_data_filtered$ATM == "ATM4" & atm_data_filtered$Cash != extreme_value], 0.95, na.rm = TRUE)
# Cap only the extreme outlier for ATM4
atm_data_filtered <- atm_data_filtered %>%
mutate(Cash = ifelse(ATM == "ATM4" & Cash == extreme_value, cap_threshold, Cash))
# Generate plots without y-axis labels
plot_atm1 <- ggplot(subset(atm_data_filtered, ATM == "ATM1"), aes(x = DATE, y = Cash)) +
geom_line() +
labs(title = "Cash Withdrawals for ATM1", x = "Date") +
theme_minimal() +
theme(axis.title.y = element_blank())
plot_atm2 <- ggplot(subset(atm_data_filtered, ATM == "ATM2"), aes(x = DATE, y = Cash)) +
geom_line() +
labs(title = "Cash Withdrawals for ATM2", x = "Date") +
theme_minimal() +
theme(axis.title.y = element_blank())
plot_atm4 <- ggplot(subset(atm_data_filtered, ATM == "ATM4"), aes(x = DATE, y = Cash)) +
geom_line() +
labs(title = "Cash Withdrawals for ATM4 (Capped Outlier)", x = "Date") +
theme_minimal() +
theme(axis.title.y = element_blank())
# Arrange plots with a single y-axis label
grid.arrange(plot_atm1, plot_atm2, plot_atm4, nrow = 3,
left = textGrob("Cash (in hundreds of dollars)", rot = 90, vjust = 1, gp = gpar(fontsize = 12)))
ATM1 and ATM2: Both ATMs display cash withdrawal patterns with relatively low volatility and a consistent range, suggesting stable demand over time. The time series for these ATMs exhibit moderate short-term fluctuations but lack significant seasonality or trends. The data oscillates within a defined range, with minor peaks and troughs, indicating potential stationarity in these series. The absence of extreme spikes implies that these ATMs maintain steady usage patterns, which could make forecasting simpler and more reliable.
ATM4: Unlike ATM1 and ATM2, ATM4 shows a considerably broader range of cash withdrawals, reaching up to around $1,000 after capping an extreme outlier. This variability indicates higher volatility and possibly irregular demand, with less frequent but larger transactions. The time series for ATM4 may exhibit non-stationary behavior due to its pronounced fluctuations, suggesting the presence of irregular cycles or outliers rather than clear seasonality. The large swings in cash withdrawals may indicate unique usage patterns, possibly driven by factors like higher demand or occasional bulk withdrawals.
Based on the observations from the time series plots, we will use three forecasting techniques to predict future cash withdrawals for each ATM. These methods will allow us to capture different data patterns and provide a robust comparison for model selection:
# Convert the dataset to a tsibble for time series analysis
atm_data_tsibble <- atm_data_filtered %>%
as_tsibble(index = DATE, key = ATM)
# Step 1: Backtesting (training on data up to March 2010, testing on April 2010)
# Train models for ATM1 up to March 2010
atm1_model_march <- atm_data_tsibble %>%
filter(ATM == "ATM1", DATE <= as.Date("2010-03-31")) %>%
model(
ARIMA = ARIMA(Cash),
ETS = ETS(Cash),
Naive = NAIVE(Cash)
)
# Train models for ATM2 up to March 2010
atm2_model_march <- atm_data_tsibble %>%
filter(ATM == "ATM2", DATE <= as.Date("2010-03-31")) %>%
model(
ARIMA = ARIMA(Cash),
ETS = ETS(Cash),
Naive = NAIVE(Cash)
)
# Train models for ATM4 up to March 2010
atm4_model_march <- atm_data_tsibble %>%
filter(ATM == "ATM4", DATE <= as.Date("2010-03-31")) %>%
model(
ARIMA = ARIMA(Cash),
ETS = ETS(Cash),
Drift = NAIVE(Cash ~ drift())
)
# Generate forecasts for April 2010 to evaluate model performance
atm1_backtest <- forecast(atm1_model_march, h = "1 month")
atm2_backtest <- forecast(atm2_model_march, h = "1 month")
atm4_backtest <- forecast(atm4_model_march, h = "1 month")
# Calculate accuracy by comparing to actual April 2010 data
atm1_backtest_accuracy <- accuracy(atm1_backtest, atm_data_tsibble %>% filter(ATM == "ATM1", DATE >= as.Date("2010-04-01")))
atm2_backtest_accuracy <- accuracy(atm2_backtest, atm_data_tsibble %>% filter(ATM == "ATM2", DATE >= as.Date("2010-04-01")))
atm4_backtest_accuracy <- accuracy(atm4_backtest, atm_data_tsibble %>% filter(ATM == "ATM4", DATE >= as.Date("2010-04-01")))
# Display backtest accuracy
list(
`ATM1 Backtest Accuracy` = atm1_backtest_accuracy,
`ATM2 Backtest Accuracy` = atm2_backtest_accuracy,
`ATM4 Backtest Accuracy` = atm4_backtest_accuracy
)
## $`ATM1 Backtest Accuracy`
## # A tibble: 3 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA ATM1 Test -6.81 12.3 9.68 -82.9 86.0 NaN NaN -0.282
## 2 ETS ATM1 Test -5.80 11.6 9.13 -61.5 65.1 NaN NaN -0.255
## 3 Naive ATM1 Test -50.3 59.0 50.3 -522. 522. NaN NaN -0.147
##
## $`ATM2 Backtest Accuracy`
## # A tibble: 3 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA ATM2 Test 8.63 21.1 16.3 -45.0 74.0 NaN NaN 0.0940
## 2 ETS ATM2 Test 9.51 19.4 14.3 -26.3 58.2 NaN NaN -0.0932
## 3 Naive ATM2 Test -40.6 55.0 41.3 -1142. 1143. NaN NaN 0.136
##
## $`ATM4 Backtest Accuracy`
## # A tibble: 3 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA ATM4 Test -59.1 281. 240. -1045. 1069. NaN NaN -0.00233
## 2 Drift ATM4 Test -84.5 291. 248. -1218. 1240. NaN NaN -0.0168
## 3 ETS ATM4 Test -56.1 312. 264. -1257. 1290. NaN NaN 0.119
# Step 2: Final Forecasting for May 2010 using full data up to April 2010
# Train models for ATM1 on data up to April 2010
atm1_model_full <- atm_data_tsibble %>%
filter(ATM == "ATM1", DATE <= as.Date("2010-04-30")) %>%
model(
ARIMA = ARIMA(Cash),
ETS = ETS(Cash),
Naive = NAIVE(Cash)
)
# Train models for ATM2 on data up to April 2010
atm2_model_full <- atm_data_tsibble %>%
filter(ATM == "ATM2", DATE <= as.Date("2010-04-30")) %>%
model(
ARIMA = ARIMA(Cash),
ETS = ETS(Cash),
Naive = NAIVE(Cash)
)
# Train models for ATM4 on data up to April 2010
atm4_model_full <- atm_data_tsibble %>%
filter(ATM == "ATM4", DATE <= as.Date("2010-04-30")) %>%
model(
ARIMA = ARIMA(Cash),
ETS = ETS(Cash),
Drift = NAIVE(Cash ~ drift())
)
# Generate final forecasts for May 2010
atm1_forecast_may <- forecast(atm1_model_full, h = "1 month")
atm2_forecast_may <- forecast(atm2_model_full, h = "1 month")
atm4_forecast_may <- forecast(atm4_model_full, h = "1 month")
# Plot the forecasts with original data
plot_atm1 <- autoplot(atm1_forecast_may, atm_data_tsibble %>% filter(ATM == "ATM1")) +
labs(title = "Forecast for ATM1", y = "Cash (in hundreds of dollars)")
## `mutate_if()` ignored the following grouping variables:
## • Column `ATM`
plot_atm2 <- autoplot(atm2_forecast_may, atm_data_tsibble %>% filter(ATM == "ATM2")) +
labs(title = "Forecast for ATM2", y = "Cash (in hundreds of dollars)")
## `mutate_if()` ignored the following grouping variables:
## • Column `ATM`
plot_atm4 <- autoplot(atm4_forecast_may, atm_data_tsibble %>% filter(ATM == "ATM4")) +
labs(title = "Forecast for ATM4", y = "Cash (in hundreds of dollars)")
## `mutate_if()` ignored the following grouping variables:
## • Column `ATM`
# Arrange the plots in a grid
grid.arrange(plot_atm1, plot_atm2, plot_atm4, nrow = 3)
Observations from Forecasts
Given these initial observations, we can proceed to evaluate each model’s accuracy quantitatively. This will involve comparing forecast accuracy metrics such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) to determine the best model for each ATM.
Based on the Backtesting results, we can analyze each ATM’s forecast accuracy across different models using the metrics provided (ME, RMSE, MAE, MAPE, etc.). Below is a summary of observations for each ATM.
The forecasts for ATM1 demonstrate the following:
Conclusion: For ATM1, both ARIMA and ETS provide accurate forecasts, with ETS slightly outperforming ARIMA in terms of RMSE and MAPE.
The forecast accuracy for ATM2 displays similar trends:
Conclusion: ETS is the preferred model for ATM2, marginally outperforming ARIMA in accuracy metrics.
For ATM4, which shows more variability, we observe the following:
Conclusion: While none of the models perfectly fit ATM4 due to its high variability, ARIMA and ETS are relatively comparable, with ARIMA showing slightly better performance in terms of RMSE and MAE.
Given these results, ETS is recommended for forecasting ATM1 and ATM2, while ARIMA might be a preferable choice for ATM4, though additional analysis or different modeling approaches (e.g., machine learning models) might be warranted for highly variable series like ATM4.
With these insights, we can proceed with these selected models for each ATM to forcast May 2010 withdrawal.
# Load required libraries
library(fpp3)
library(openxlsx)
# Convert the dataset to a tsibble for time series analysis
atm_data_tsibble <- atm_data_filtered %>%
as_tsibble(index = DATE, key = ATM)
# Select the best model for each ATM based on previous accuracy analysis
# ATM1 and ATM2: ETS model, ATM4: ARIMA model
# Train final models on the full dataset (up to April 2010)
atm1_final_model <- atm_data_tsibble %>%
filter(ATM == "ATM1") %>%
model(ETS = ETS(Cash))
atm2_final_model <- atm_data_tsibble %>%
filter(ATM == "ATM2") %>%
model(ETS = ETS(Cash))
atm4_final_model <- atm_data_tsibble %>%
filter(ATM == "ATM4") %>%
model(ARIMA = ARIMA(Cash))
# Generate forecasts for May 2010 (next month)
atm1_forecast <- forecast(atm1_final_model, h = "1 month") %>% as_tibble()
atm2_forecast <- forecast(atm2_final_model, h = "1 month") %>% as_tibble()
atm4_forecast <- forecast(atm4_final_model, h = "1 month") %>% as_tibble()
# Combine the forecasts for exporting
combined_forecasts <- bind_rows(
atm1_forecast %>% mutate(ATM = "ATM1"),
atm2_forecast %>% mutate(ATM = "ATM2"),
atm4_forecast %>% mutate(ATM = "ATM4")
)
# Select relevant columns for export (Date, ATM, model, and point forecast)
forecast_export <- combined_forecasts %>%
select(DATE, ATM, .model, .mean) %>%
rename(Forecast = .mean)
# Check the forecast data
#print(forecast_export)
# Export to Excel
write.xlsx(forecast_export, file = "/Users/souleymanedoumbia/Library/Mobile Documents/com~apple~CloudDocs/CUNY SPS CLASSES/MSDS CLASSES/Data 624 Fall 2024/Week9/Project1/Project1_work/ATM_May_2010_Forecast.xlsx")
In Part B, the task is to model residential power usage using the
ResidentialCustomerForecastLoad-624.xlsx
dataset, spanning
from January 1998 to December 2013. This dataset
contains a KWH
variable, representing
power consumption in kilowatt-hours. The objective is to create a
monthly forecast for the year 2014 based on historical
patterns.
Objective: To build a predictive model for residential power usage, focusing on seasonal and trend components, and validate its effectiveness.
Approach: Using various time series techniques, I will analyze historical trends in power consumption, produce forecasts, and compare methods to ensure accurate results. Findings will be documented, accompanied by visualizations and code.
# Import dataset
power_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
summary(power_data)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
#power_data<-power_data%>% filter(KWH > 770523 & KWH <9000000 )
# Univariate plots for cash withdrawal distribution by each ATM
ggplot(power_data, aes(x = KWH)) +
geom_histogram(bins = 50, fill = 'gray', color = "black") +
#facet_wrap(~ATM, scales = "free") +
theme_bw() +
labs(title = "Distribution of Cash Withdrawals at Each ATM",
x = "Cash (in hundreds of dollars)",
y = "Frequency") +
theme(
legend.position = "none",
axis.title.x = element_text(size = 12, face = 'bold'),
axis.title.y = element_text(size = 12, face = 'bold'),
plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
axis.text.y = element_text(size = 9)
)
# Create a single box plot for the distribution of KWH values
ggplot(power_data, aes(y = KWH)) +
geom_boxplot() +
labs(
title = "Box Plot of KWH Values",
x = "",
y = "KWH"
) +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_text(size = 12, face = 'bold'),
axis.title.y = element_text(size = 12, face = 'bold'),
plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
axis.text.y = element_text(size = 9)
)
# Extract the year from 'YYYY-MMM' and add it as a new column
power_data <- power_data %>%
mutate(Year = substr(`YYYY-MMM`, 1, 4))
# Create a box plot for KWH by year
ggplot(power_data, aes(x = Year, y = KWH)) +
geom_boxplot() +
labs(
title = "Box Plot of KWH by Year",
x = "Year",
y = "KWH"
) +
theme_minimal() +
theme(legend.position = "none",
axis.title.x = element_text(size = 12, face = 'bold'),
axis.title.y = element_text(size = 12, face = 'bold'),
plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
axis.text.y = element_text(size = 8)
)
# Extract the month from 'YYYY-MMM' and add it as a new column
power_data <- power_data %>%
mutate(Month = substr(`YYYY-MMM`, 6, 8))
# Create a box plot for KWH by month
ggplot(power_data, aes(x = Month, y = KWH)) +
geom_boxplot() +
labs(
title = "Box Plot of KWH by Month",
x = "Month",
y = "KWH"
) +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_text(size = 12, face = 'bold'),
axis.title.y = element_text(size = 12, face = 'bold'),
plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
axis.text.y = element_text(size = 9)
)
KWH
values highlights the presence of
an extreme low-value outlier in the dataset.KWH
variable, which isolates this value as the only notable
outlier.KWH
variable, which will need to be addressed to ensure
a complete and accurate analysis.YYYY-MMM
column, which represents the date, is
currently stored as character
type (Class: character, Mode:
character).tsibble
framework, this column needs to be converted to a
proper Date
format to allow accurate indexing and
time-based operations.KWH
variable
indicate a skewed distribution, with most values concentrated on the
left side. Although the mean (6,502,475
) and median
(6,283,324
) are relatively close, the slight skewness
suggests that using the median is a robust approach for
imputing the missing value. The median is less affected by potential
extreme values and better represents the central tendency in such
distributions.YYYY-MMM
column, currently stored as
character
, needs to be converted into a proper
Date
format for time series analysis.YYYY-MMM
string.Date
object using the
lubridate::ymd()
function or similar parsing functions in
R.tsibble
framework for time
series modeling.This section focuses on preparing the power dataset for time series
analysis by handling missing values, treating outliers, and transforming
the date column. These steps are crucial to ensure the data’s integrity
and compatibility with time series modeling frameworks such as
tsibble
.
The dataset contains a missing value in the KWH
column.
Given the slight skewness observed in the distribution, the
median is chosen as the most robust imputation method.
The median better represents the central tendency of the data and is
less sensitive to extreme values.
# Impute missing values in the KWH column using the median
median_kwh <- median(power_data$KWH, na.rm = TRUE)
power_data <- power_data %>%
mutate(KWH = ifelse(is.na(KWH), median_kwh, KWH))
# Verify the missing value is handled
sum(is.na(power_data$KWH)) # Should be 0
## [1] 0
KWH
column was imputed using
the median of the data. This choice ensures robustness
to extreme values and preserves the integrity of the dataset.The outlier in July 2010 accounts for only
1% of the total power usage for all July observations,
significantly deviating from the typical 7% average
contribution across other July values in the KWH
dataset.
To address this discrepancy, the outlier will be replaced with the
mean power usage of all other July observations,
ensuring the adjusted value aligns with the overall trend and maintains
consistency within the dataset.
# Calculate the mean July power excluding the outlier
mean_july_power <- power_data %>%
filter(KWH != min(KWH) & Month == 'Jul') %>%
summarize(mean_july = mean(KWH, na.rm = TRUE)) %>%
pull(mean_july) # Extract the numeric value
# Replace the outlier value for July 2010
power_data <- power_data %>%
mutate(KWH = ifelse(`YYYY-MMM` == "2010-Jul" & KWH == min(KWH),
mean_july_power, KWH))
# Verify the adjustment
#power_data %>%
# filter(Month == "Jul")
The YYYY-MMM column, stored as character, needs to be converted to a Date format for time series analysis. This step ensures compatibility with the tsibble framework and enables efficient time-based operations.
# Convert the YYYY-MMM column to a proper Date format
power_data <- power_data %>%
mutate(MonthYear = yearmonth(`YYYY-MMM`)) # Use yearmonth for year and month granularity
# Verify the transformation
str(power_data$MonthYear)
## mth [1:192] 1998 Jan, 1998 Feb, 1998 Mar, 1998 Apr, 1998 May, 1998 Jun, 19...
#summary(power_data)
Here, the transformed data is converted to a tsibble
with MonthYear
as the time index. This step ensures that
the time series structure is consistent, complete, and ordered. The
objective is to perform an exploratory analysis of the dataset to
identify trends, seasonality, and potential patterns that could inform
the selection of appropriate forecasting models. To do so multiple
visualizations and decomposition techniques is applied below:
Time Plot: Displays the overall trend and variability in power usage over time.
Seasonal Plot: Highlights recurring patterns across months within each year.
STL Decomposition: Separates the data into trend, seasonal, and remainder components.
# Convert power_data to a tsibble and fill gaps
power_tsibble <- power_data %>%
select(MonthYear, KWH) %>%
arrange(MonthYear) %>%
as_tsibble(index = MonthYear)
# Time plot to observe overall trends
autoplot(power_tsibble, KWH) +
labs(
title = "Time Plot of Power Usage (KWH)",
x = "Date",
y = "KWH"
) +
theme_minimal()
# Seasonal plot to uncover recurring patterns
gg_season(power_tsibble, KWH, period = "year") +
labs(
title = "Seasonal Plot of Power Usage (KWH)",
x = "Month",
y = "KWH"
)
# Decomposition to separate components
decomp <- power_tsibble %>%
model(STL(KWH ~ season(window = "periodic"))) %>%
components()
autoplot(decomp) +
labs(title = "STL Decomposition of Power Usage")
The time plot illustrates the overall behavior of power usage over the years:
The seasonal plot provides a closer look at how power usage varies across months in different years:
The STL decomposition breaks the time series into its components, allowing a detailed examination of the underlying trends and patterns:
Based on the insights obtained from the exploratory analysis, including the time plot, seasonal plot, and STL decomposition, the following forecasting models will be selected:
# Automatically fit an ARIMA model with seasonal adjustments
arima_model <- power_tsibble %>%
model(ARIMA = ARIMA(KWH ~ season()))
# Fit the ETS model
ets_model <- power_tsibble %>%
model(ETS = ETS(KWH))
# Apply Box-Cox Transformation
lambda <- BoxCox.lambda(power_tsibble$KWH) # Find the optimal lambda
transformed_data <- power_tsibble %>%
mutate(KWH_transformed = BoxCox(KWH, lambda)) # Apply the transformation
# To Ensure transformed_data is in a tsibble format
transformed_data <- transformed_data %>%
as_tsibble(index = MonthYear) # Ensure `MonthYear` is the index column
# Time series plot of the transformed data
autoplot(transformed_data, KWH_transformed) +
labs(title = "Box-Cox Transformed Power Usage", y = "Transformed KWH")
# Check for missing values
if (any(is.na(transformed_data$KWH_transformed))) {
stop("Missing values detected in the transformed data. Please address them before proceeding.")
}
# Fit ARIMA Model on Transformed Data
arima_model_tr <- transformed_data %>%
model(ARIMA = ARIMA(KWH_transformed)) # Auto ARIMA, no hardcoding
# Fit ETS Model on Transformed Data
ets_model_tr <- transformed_data %>%
model(ETS = ETS(KWH_transformed)) # Auto ETS, no hardcoding
Time series cross-validation splits the data into multiple training and testing sets, evaluating model performance at each split. For this dataset, a rolling-origin strategy will be used:
We will:
Use data up to December 2012 for initial training.
Incrementally expand the training set, leaving one-step or multi-step forecasts for validation.
suppressWarnings({
# Create rolling time series cross-validation dataset
cv <- power_tsibble %>%
stretch_tsibble(.init = 168, .step = 12) # Initial 168 months (Jan 1998 - Dec 2011)
# Evaluate ARIMA model with cross-validation
cv_arima <- cv %>%
model(ARIMA = ARIMA(KWH ~ season())) %>%
forecast(h = 12) %>%
accuracy(data = power_tsibble) # Compare against actual data
# Evaluate ETS model with cross-validation
cv_ets <- cv %>%
model(ETS = ETS(KWH)) %>%
forecast(h = 12) %>%
accuracy(data = power_tsibble) # Compare against actual data
# Evaluate Box-Cox ARIMA model with cross-validation
cv_arima_tr <- cv %>%
mutate(KWH_transformed = BoxCox(KWH, lambda)) %>% # Include transformed data in cross-validation
model(ARIMA = ARIMA(KWH_transformed)) %>%
forecast(h = 12) %>%
accuracy(data = power_tsibble %>%
mutate(KWH_transformed = BoxCox(KWH, lambda))) # Compare transformed forecasts to transformed actuals
# Evaluate Box-Cox ETS model with cross-validation
cv_ets_tr <- cv %>%
mutate(KWH_transformed = BoxCox(KWH, lambda)) %>% # Include transformed data in cross-validation
model(ETS = ETS(KWH_transformed)) %>%
forecast(h = 12) %>%
accuracy(data = power_tsibble %>%
mutate(KWH_transformed = BoxCox(KWH, lambda))) # Compare transformed forecasts to transformed actuals
# Combine results for comparison
cv_results <- bind_rows(
cv_arima %>% mutate(Model = "ARIMA"),
cv_ets %>% mutate(Model = "ETS"),
cv_arima_tr %>% mutate(Model = "Box-Cox ARIMA"),
cv_ets_tr %>% mutate(Model = "Box-Cox ETS")
)
# Display cross-validation results
cv_results %>%
select(Model, RMSE, MAE, MAPE) %>%
arrange(RMSE)
})
## # A tibble: 4 × 4
## Model RMSE MAE MAPE
## <chr> <dbl> <dbl> <dbl>
## 1 Box-Cox ARIMA 0.00269 0.00214 0.0498
## 2 Box-Cox ETS 0.00340 0.00251 0.0584
## 3 ARIMA 851068. 551037. 7.23
## 4 ETS 882272. 579381. 7.67
Discussion on Cross-Validation Results
The results from the cross-validation process provide valuable insights into the performance of the four models: Box-Cox ARIMA, Box-Cox ETS, ARIMA, and ETS. The metrics considered are RMSE (Root Mean Square Error), MAE (Mean Absolute Error), and MAPE (Mean Absolute Percentage Error), which assess the accuracy of the models during the validation process.
Using the Box-Cox ARIMA model, we will proceed to forecast the power usage for the year 2014.
# Fit the Box-Cox ARIMA model
lambda <- BoxCox.lambda(power_tsibble$KWH)
power_transformed <- power_tsibble %>%
mutate(KWH_transformed = BoxCox(KWH, lambda))
box_cox_arima <- power_transformed %>%
model(BoxCox_ARIMA = ARIMA(KWH_transformed))
# Forecast for 2014
forecast_2014 <- box_cox_arima %>%
forecast(h = "12 months") %>%
mutate(Forecast = InvBoxCox(.mean, lambda)) # Back-transform forecasts
# Export forecast results to Excel
forecast_2014_export <- forecast_2014 %>%
as_tibble() %>%
select(MonthYear, Forecast)
# Write to Excel file
write.xlsx(forecast_2014_export, file = "ResidentialPowerForecast_2014.xlsx", overwrite = TRUE)
# Display a message confirming the file export
cat("The forecast for 2014 has been successfully saved as 'ResidentialPowerForecast_2014.xlsx'.")
## The forecast for 2014 has been successfully saved as 'ResidentialPowerForecast_2014.xlsx'.
autoplot(forecast_2014, power_transformed) +
labs(title = "Forecast for Residential Power Usage (2014)",
y = "Power Usage (KWH)",
x = "Month") +
theme_minimal() +
theme(
axis.title.x = element_text(size = 12, face = 'bold'),
axis.title.y = element_text(size = 12, face = 'bold'),
plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10)
)