Introduction

This project analyzes and forecasts time series data across two distinct domains: cash withdrawals at ATMs and residential power usage. The analysis is divided into two parts:

  • Part A focuses on ATM cash withdrawals, exploring patterns and trends, applying multiple forecasting models, selecting the best-performing model, and exporting the forecast data (May 2010 for each ATM) into an Excel file.
  • Part B centers on residential power usage, following a similar process of analyzing historical trends, applying and validating forecasting models, and exporting the final forecast for 2014 into an Excel file.

Each part introduces specific objectives, applies various time series forecasting methods, and discusses the outcomes based on model performance and forecast results.


Part A: Forecasting ATM Cash Withdrawals

In Part A, we focus on forecasting the amount of cash withdrawn from four different ATMs for the month of May 2010. The provided dataset (ATM624Data.xlsx) includes a Cash variable, recorded in hundreds of dollars, which we will use to model cash withdrawal trends.

Objective: To develop and present a forecast for ATM withdrawals by applying suitable forecasting methods and justifying each step of the process.

Approach: I will begin by exploring the dataset and identifying relevant forecasting techniques. Visualizations and discussions will accompany the analysis to convey insights.

1. Data Eploration

The first step is to explore the dataset and summarize key features, such as the structure, variable types, and initial data distributions. This preliminary analysis provides an overview and helps identify any initial anomalies or missing data patterns. An example of this is that we previously found the DATE column wasn’t formatted as a date when we initially loaded the data. We converted this column to date format by using as.Date() in our cleaning process.

# Load the data
atm_data <- read_excel("ATM624Data.xlsx")
atm_data$DATE <- as.Date(atm_data$DATE)

# Preview the data
#head(atm_data)

# Checking the structure of the dataset
str(atm_data)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
# Summary statistics for each ATM
summary(atm_data)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19
  • The summary statistics revealed a total of 19 missing values in the Cash variable. These missing values will be addressed in the data cleaning step to ensure continuity in our time series.

2. Investigating Missing Data

We further investigated the missing data to understand its pattern. We started by examining missing data in the atm_data dataset to understand the extent and pattern of missing values. This following analysis will help determine if we need to impute missing values before proceeding with forecasting. Also, to determine if any specific ATM has a higher amount of missing data, we summarized missing values by the ATM variable.

# Summary of missing data by column
missing_data_summary <- colSums(is.na(atm_data))
missing_data_summary
## DATE  ATM Cash 
##    0   14   19
# Checking missing data by ATM
missing_by_atm <- atm_data %>%
  group_by(ATM) %>%
  summarise_all(~sum(is.na(.))) 
missing_by_atm
## # A tibble: 5 × 3
##   ATM    DATE  Cash
##   <chr> <int> <int>
## 1 ATM1      0     3
## 2 ATM2      0     2
## 3 ATM3      0     0
## 4 ATM4      0     0
## 5 <NA>      0    14
  • The results indicate that missing data is limited to the Cash variable, with 5 missing entries distributed across ATMs. This distribution suggests that no single ATM has a significantly higher proportion of missing values, so a general imputation approach can be applied.

  • We also noted that there are 14 instances of missing ATM categories with no associated cash withdrawals. These rows lack an identifier, making them difficult to use for analysis. Therefore, we will remove these rows to maintain the focus on the four defined ATMs.

3. Data Cleaning and Handling Missing Values

3.a. Removing Rows with Missing ATM Values

Since rows with missing ATM values cannot be associated with any of the defined ATMs, we will remove these rows from the dataset.

# Remove rows where ATM is NA
atm_data <- atm_data %>%
  filter(!is.na(ATM))

# Verify removal
summary(atm_data)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5
  • After removing these rows, we are left with observations corresponding to the four ATMs.

3.b. Handling Missing Values in Cash Variable

To handle missing data in the Cash variable, we applied median imputation to replace missing values. Median imputation was selected because it is less sensitive to outliers compared to mean imputation, thus preserving the central tendency of the data without being skewed by extreme values.

# Applying median imputation to fill missing values in the Cash variable
atm_data <- atm_data %>%
  group_by(ATM) %>%
  mutate(Cash = ifelse(is.na(Cash), median(Cash, na.rm = TRUE), Cash))

# Verify missing values are handled
sum(is.na(atm_data))
## [1] 0
  • After imputation, there are no remaining missing values in the dataset. This ensures that our time series is complete, allowing us to proceed with forecasting without interruptions.

4. Faceted Plot for Each ATM

To better understand the withdrawal patterns for each ATM without one dominating the scale, we will use separate plots for each ATM.

## Visualizing Cash Withdrawals for Each ATM

# Faceted plot by ATM
ggplot(atm_data, aes(x = DATE, y = Cash)) +
  geom_line() +
  facet_wrap(~ ATM, scales = "free_y") +
  labs(title = "Cash Withdrawals by Each ATM Over Time",
       x = "Date",
       y = "Cash (in hundreds of dollars)") +
  theme_minimal() +
  theme(axis.title.x = element_text(size = 14, face = 'bold'),
        axis.title.y = element_text(size = 14, face = 'bold'),
        plot.title = element_text(hjust = 0.5, size = 14, face = 'bold'),
        axis.text.x = element_text(size = 8),
        axis.text.y = element_text(size = 8)
        )

  • ATM3: out of 366 observations, cash was withdraw at ATM3 only three times, making almost its entire withdrawal to be 0 - or no transaction.

  • ATM4: ATM4 has an extreme cash outlier greater than $1000,000 (1 millions of dollars)

5. Visualizing Cash Withdrawals for ATMs (Excluding ATM3)

Given that ATM3 has almost no transactions, it will be excluded from this visualization to focus on the active ATMs (ATM1, ATM2, and ATM4).

# Filter out ATM3 for the visualization
atm_data_filtered <- atm_data %>%
  filter(ATM != "ATM3")

# Faceted plot by ATM
ggplot(atm_data_filtered, aes(x = DATE, y = Cash)) +
  geom_line() +
  facet_wrap(~ ATM, scales = "free_y") +
  labs(title = "Cash Withdrawals by Each ATM Over Time",
       x = "Date",
       y = "Cash (in hundreds of dollars)") +
  theme_minimal() +
  theme(axis.title.x = element_text(size = 14, face = 'bold'),
        axis.title.y = element_text(size = 14, face = 'bold'),
        plot.title = element_text(hjust = 0.5, size = 14, face = 'bold'),
        axis.text.x = element_text(size = 6.3),
        axis.text.y = element_text(size = 10)
        )

  • Justification for Excluding ATM3: Since ATM3 has only three non-zero values among 366 observations, it does not provide sufficient data for reliable forecasting and could skew the analysis if included. Excluding ATM3 allows us to focus on ATMs with consistent activity, leading to a more meaningful analysis of cash withdrawal trends.

6. Examining Outliers in Cash Withdrawals for Each ATM

To identify the presence of outliers in the cash withdrawals for ATM1, ATM2, and ATM4, we use box plots for each ATM. Box plots help in visualizing the distribution and highlight any values that fall outside the typical range (1.5 times the IQR), which are considered outliers. This step allows us to assess whether any extreme values may influence the forecast model.

library(ggplot2)
library(gridExtra)

# Filter data for each ATM
atm1_data <- subset(atm_data_filtered, ATM == "ATM1")
atm2_data <- subset(atm_data_filtered, ATM == "ATM2")
atm4_data <- subset(atm_data_filtered, ATM == "ATM4")

# Create box plots for each ATM with adjusted x-axis labels
plot_atm1 <- ggplot(atm1_data, aes(x = ATM, y = Cash)) +
  geom_boxplot() +
  labs(title = "Cash Withdrawals for ATM1", x = "", y = "Cash (in hundreds of dollars)") +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.text.x = element_blank(), # Remove x-axis text
    axis.title.y = element_text(size = 10, face = 'bold'),
    plot.title = element_text(hjust = 0.5, size = 10, face = 'bold'),
    axis.text.y = element_text(size = 9)
  )

plot_atm2 <- ggplot(atm2_data, aes(x = ATM, y = Cash)) +
  geom_boxplot() +
  labs(title = "Cash Withdrawals for ATM2", x = "", y = "") +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.text.x = element_blank(), # Remove x-axis text
    plot.title = element_text(hjust = 0.5, size = 10, face = 'bold'),
    axis.text.y = element_text(size = 9)
  )

plot_atm4 <- ggplot(atm4_data, aes(x = ATM, y = Cash)) +
  geom_boxplot() +
  labs(title = "Cash Withdrawals for ATM4", x = "", y = "") +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.text.x = element_blank(), # Remove x-axis text
    plot.title = element_text(hjust = 0.5, size = 10, face = 'bold'),
    axis.text.y = element_text(size = 9)
  )

# Arrange the plots using gridExtra
grid.arrange(plot_atm1, plot_atm2, plot_atm4, nrow = 1)

Observations

  • ATM1 and ATM2: Both ATMs display relatively consistent withdrawal patterns, with some mild outliers but nothing extreme. The values in these ATMs fall within a narrow range, indicating steady usage.

  • ATM4: This ATM shows a large outlier far above the typical range, as seen in the box plot. Such an extreme value could skew the forecast model, so we’ll handle it by capping the outlier in the following step. This will allow us to preserve data continuity without letting one value disproportionately impact the forecast.

7. Distribution Analysis of Cash Withdrawals

The histogram plot for each ATM’s cash withdrawals helps us understand the distribution of cash amounts withdrawn over time. By examining the shape and spread of these distributions, we can identify any patterns, central tendencies, or irregularities that might affect the forecasting model.

# Univariate plots for cash withdrawal distribution by each ATM
ggplot(atm_data_filtered, aes(x = Cash)) + 
  geom_histogram(bins = 50, fill = 'gray', color = "black") + 
  facet_wrap(~ATM, scales = "free") + 
  theme_bw() + 
  labs(title = "Distribution of Cash Withdrawals at Each ATM",
       x = "Cash (in hundreds of dollars)",
       y = "Frequency") +
  theme(
    legend.position = "none",
    axis.title.x = element_text(size = 12, face = 'bold'),
    axis.title.y = element_text(size = 12, face = 'bold'),
    plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
    axis.text.y = element_text(size = 9)
  )

7.a. Understanding the Cash Withdrawal Distribution

  • ATM1 and ATM2: Both ATMs exhibit a relatively compact distribution, with most cash withdrawals clustered around lower values, peaking near the $100 mark. This suggests that these ATMs experience more consistent and modest transaction amounts, which could make their time series easier to forecast.

  • ATM4: ATM4 shows a broader distribution, with cash withdrawals ranging significantly higher, reaching up to approximately $1200. This wider spread indicates more variability in transaction amounts, which might present challenges for forecasting. The higher values could reflect occasional high-volume transactions, though these are not frequent.

7.b. Consideration of Low-Value Outliers

While the histograms reveal some low-value outliers, particularly for ATM4, we have opted to retain these values in the dataset. These low values, while less common, could represent genuine transaction behavior at these ATMs. Removing or adjusting them may obscure real patterns in the data. Instead, we’ll proceed with modeling the data as is, allowing the forecasting methods to account for the full range of transaction behaviors.

8. Dealing with Outliers (extreme cash value) for ATM4

ATM4 displays an extreme spike in cash withdrawals, which could distort the forecast model if left unaddressed. Since removing this outlier would break the continuity of the time series, we will cap the outlier at a defined threshold based on a reasonable percentile of the data. This will preserve the continuity of the data and reduce the outlier’s influence.

# Identify the extreme value and cap it to a specific threshold
extreme_value <- max(atm_data_filtered$Cash)
cap_threshold <- quantile(atm_data_filtered$Cash[atm_data_filtered$ATM == "ATM4" & atm_data_filtered$Cash != extreme_value], 0.95, na.rm = TRUE)

# Cap only the extreme outlier for ATM4
atm_data_filtered <- atm_data_filtered %>%
  mutate(Cash = ifelse(ATM == "ATM4" & Cash == extreme_value, cap_threshold, Cash))


# Generate plots without y-axis labels
plot_atm1 <- ggplot(subset(atm_data_filtered, ATM == "ATM1"), aes(x = DATE, y = Cash)) +
  geom_line() +
  labs(title = "Cash Withdrawals for ATM1", x = "Date") +
  theme_minimal() +
  theme(axis.title.y = element_blank())

plot_atm2 <- ggplot(subset(atm_data_filtered, ATM == "ATM2"), aes(x = DATE, y = Cash)) +
  geom_line() +
  labs(title = "Cash Withdrawals for ATM2", x = "Date") +
  theme_minimal() +
  theme(axis.title.y = element_blank())

plot_atm4 <- ggplot(subset(atm_data_filtered, ATM == "ATM4"), aes(x = DATE, y = Cash)) +
  geom_line() +
  labs(title = "Cash Withdrawals for ATM4 (Capped Outlier)", x = "Date") +
  theme_minimal() +
  theme(axis.title.y = element_blank())

# Arrange plots with a single y-axis label
grid.arrange(plot_atm1, plot_atm2, plot_atm4, nrow = 3,
             left = textGrob("Cash (in hundreds of dollars)", rot = 90, vjust = 1, gp = gpar(fontsize = 12)))

8.a. Outlier Capping Approach

  • Capping the outlier values for ATM4 prevents one extreme value from disproportionately impacting the forecast model. By setting a threshold at the 95th percentile, we retain the time series continuity while limiting the outlier’s influence. This approach is particularly useful in time series forecasting where continuity is essential, as it avoids the disruptions that can arise from removing data points altogether.

8.b. Observations from the Time Series Plots

  • ATM1 and ATM2: Both ATMs display cash withdrawal patterns with relatively low volatility and a consistent range, suggesting stable demand over time. The time series for these ATMs exhibit moderate short-term fluctuations but lack significant seasonality or trends. The data oscillates within a defined range, with minor peaks and troughs, indicating potential stationarity in these series. The absence of extreme spikes implies that these ATMs maintain steady usage patterns, which could make forecasting simpler and more reliable.

  • ATM4: Unlike ATM1 and ATM2, ATM4 shows a considerably broader range of cash withdrawals, reaching up to around $1,000 after capping an extreme outlier. This variability indicates higher volatility and possibly irregular demand, with less frequent but larger transactions. The time series for ATM4 may exhibit non-stationary behavior due to its pronounced fluctuations, suggesting the presence of irregular cycles or outliers rather than clear seasonality. The large swings in cash withdrawals may indicate unique usage patterns, possibly driven by factors like higher demand or occasional bulk withdrawals.

8.c. Introduction to Modeling Techniques

Based on the observations from the time series plots, we will use three forecasting techniques to predict future cash withdrawals for each ATM. These methods will allow us to capture different data patterns and provide a robust comparison for model selection:

  1. ARIMA (Auto-Regressive Integrated Moving Average):
    • Suitable for data with a trend or seasonal component, especially when the series can be made stationary through differencing.
    • ARIMA is commonly used for its flexibility in modeling complex time series data with both autoregressive and moving average terms.
  2. Exponential Smoothing (ETS):
    • Ideal for data with trend or seasonality. This model can capture levels, trends, and seasonal patterns based on exponential weights.
    • It’s straightforward to interpret and effective for data with relatively stable patterns, making it potentially suitable for ATM1 and ATM2.
  3. Naïve and Drift Methods:
    • For ATM1 and ATM2, we will use the Naïve Method as a baseline. This method assumes that future values will equal the last observed value, making it effective for stable series without strong trends or seasonal fluctuations.
    • For ATM4, given its variability, we will employ the Drift Method, which allows the forecast to increase or decrease based on the average change over time. This method will capture gradual shifts, providing a suitable baseline for ATM4’s broader range of withdrawals.

9. Modeling Cash Withdrawals with Backtesting for ATM1, ATM2, and ATM4

# Convert the dataset to a tsibble for time series analysis
atm_data_tsibble <- atm_data_filtered %>%
  as_tsibble(index = DATE, key = ATM)

# Step 1: Backtesting (training on data up to March 2010, testing on April 2010)

# Train models for ATM1 up to March 2010
atm1_model_march <- atm_data_tsibble %>%
  filter(ATM == "ATM1", DATE <= as.Date("2010-03-31")) %>%
  model(
    ARIMA = ARIMA(Cash),
    ETS = ETS(Cash),
    Naive = NAIVE(Cash)
  )

# Train models for ATM2 up to March 2010
atm2_model_march <- atm_data_tsibble %>%
  filter(ATM == "ATM2", DATE <= as.Date("2010-03-31")) %>%
  model(
    ARIMA = ARIMA(Cash),
    ETS = ETS(Cash),
    Naive = NAIVE(Cash)
  )

# Train models for ATM4 up to March 2010
atm4_model_march <- atm_data_tsibble %>%
  filter(ATM == "ATM4", DATE <= as.Date("2010-03-31")) %>%
  model(
    ARIMA = ARIMA(Cash),
    ETS = ETS(Cash),
    Drift = NAIVE(Cash ~ drift())
  )

# Generate forecasts for April 2010 to evaluate model performance
atm1_backtest <- forecast(atm1_model_march, h = "1 month")
atm2_backtest <- forecast(atm2_model_march, h = "1 month")
atm4_backtest <- forecast(atm4_model_march, h = "1 month")

# Calculate accuracy by comparing to actual April 2010 data
atm1_backtest_accuracy <- accuracy(atm1_backtest, atm_data_tsibble %>% filter(ATM == "ATM1", DATE >= as.Date("2010-04-01")))
atm2_backtest_accuracy <- accuracy(atm2_backtest, atm_data_tsibble %>% filter(ATM == "ATM2", DATE >= as.Date("2010-04-01")))
atm4_backtest_accuracy <- accuracy(atm4_backtest, atm_data_tsibble %>% filter(ATM == "ATM4", DATE >= as.Date("2010-04-01")))

# Display backtest accuracy
list(
  `ATM1 Backtest Accuracy` = atm1_backtest_accuracy,
  `ATM2 Backtest Accuracy` = atm2_backtest_accuracy,
  `ATM4 Backtest Accuracy` = atm4_backtest_accuracy
)
## $`ATM1 Backtest Accuracy`
## # A tibble: 3 × 11
##   .model ATM   .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ARIMA  ATM1  Test   -6.81  12.3  9.68  -82.9  86.0   NaN   NaN -0.282
## 2 ETS    ATM1  Test   -5.80  11.6  9.13  -61.5  65.1   NaN   NaN -0.255
## 3 Naive  ATM1  Test  -50.3   59.0 50.3  -522.  522.    NaN   NaN -0.147
## 
## $`ATM2 Backtest Accuracy`
## # A tibble: 3 × 11
##   .model ATM   .type     ME  RMSE   MAE     MPE   MAPE  MASE RMSSE    ACF1
##   <chr>  <chr> <chr>  <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA  ATM2  Test    8.63  21.1  16.3   -45.0   74.0   NaN   NaN  0.0940
## 2 ETS    ATM2  Test    9.51  19.4  14.3   -26.3   58.2   NaN   NaN -0.0932
## 3 Naive  ATM2  Test  -40.6   55.0  41.3 -1142.  1143.    NaN   NaN  0.136 
## 
## $`ATM4 Backtest Accuracy`
## # A tibble: 3 × 11
##   .model ATM   .type    ME  RMSE   MAE    MPE  MAPE  MASE RMSSE     ACF1
##   <chr>  <chr> <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ARIMA  ATM4  Test  -59.1  281.  240. -1045. 1069.   NaN   NaN -0.00233
## 2 Drift  ATM4  Test  -84.5  291.  248. -1218. 1240.   NaN   NaN -0.0168 
## 3 ETS    ATM4  Test  -56.1  312.  264. -1257. 1290.   NaN   NaN  0.119
# Step 2: Final Forecasting for May 2010 using full data up to April 2010

# Train models for ATM1 on data up to April 2010
atm1_model_full <- atm_data_tsibble %>%
  filter(ATM == "ATM1", DATE <= as.Date("2010-04-30")) %>%
  model(
    ARIMA = ARIMA(Cash),
    ETS = ETS(Cash),
    Naive = NAIVE(Cash)
  )

# Train models for ATM2 on data up to April 2010
atm2_model_full <- atm_data_tsibble %>%
  filter(ATM == "ATM2", DATE <= as.Date("2010-04-30")) %>%
  model(
    ARIMA = ARIMA(Cash),
    ETS = ETS(Cash),
    Naive = NAIVE(Cash)
  )

# Train models for ATM4 on data up to April 2010
atm4_model_full <- atm_data_tsibble %>%
  filter(ATM == "ATM4", DATE <= as.Date("2010-04-30")) %>%
  model(
    ARIMA = ARIMA(Cash),
    ETS = ETS(Cash),
    Drift = NAIVE(Cash ~ drift())
  )

# Generate final forecasts for May 2010
atm1_forecast_may <- forecast(atm1_model_full, h = "1 month")
atm2_forecast_may <- forecast(atm2_model_full, h = "1 month")
atm4_forecast_may <- forecast(atm4_model_full, h = "1 month")

# Plot the forecasts with original data
plot_atm1 <- autoplot(atm1_forecast_may, atm_data_tsibble %>% filter(ATM == "ATM1")) +
  labs(title = "Forecast for ATM1", y = "Cash (in hundreds of dollars)")
## `mutate_if()` ignored the following grouping variables:
## • Column `ATM`
plot_atm2 <- autoplot(atm2_forecast_may, atm_data_tsibble %>% filter(ATM == "ATM2")) +
  labs(title = "Forecast for ATM2", y = "Cash (in hundreds of dollars)")
## `mutate_if()` ignored the following grouping variables:
## • Column `ATM`
plot_atm4 <- autoplot(atm4_forecast_may, atm_data_tsibble %>% filter(ATM == "ATM4")) +
  labs(title = "Forecast for ATM4", y = "Cash (in hundreds of dollars)")
## `mutate_if()` ignored the following grouping variables:
## • Column `ATM`
# Arrange the plots in a grid
grid.arrange(plot_atm1, plot_atm2, plot_atm4, nrow = 3)

Observations from Forecasts

  • ATM1
    • The forecasts appear stable with narrow confidence intervals, reflecting the consistent and relatively low variability observed in the cash withdrawals for this ATM.
    • Both ARIMA and ETS provide forecasts within a close range, and the Naive model aligns well, showing that the series is stable with minimal trend or seasonality.
  • ATM2
    • Similar to ATM1, ATM2’s forecasts show narrow prediction intervals, which indicates low volatility and a stable withdrawal pattern.
    • ETS and ARIMA align closely, while the Naive model is nearly identical, reinforcing the lack of a strong trend or seasonality.
  • ATM4
    • The forecast for ATM4 has wider confidence intervals, particularly in the Drift model, which shows an increasing trend. This likely stems from the larger observed variability and the presence of outliers in its cash withdrawals.
    • The ARIMA and ETS models both predict stable values but capture some level of variation, making them likely candidates for forecasting.

Given these initial observations, we can proceed to evaluate each model’s accuracy quantitatively. This will involve comparing forecast accuracy metrics such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) to determine the best model for each ATM.

10. Model Evaluation and Summary of Results

Based on the Backtesting results, we can analyze each ATM’s forecast accuracy across different models using the metrics provided (ME, RMSE, MAE, MAPE, etc.). Below is a summary of observations for each ATM.

10.a. ATM1

The forecasts for ATM1 demonstrate the following:

  • ARIMA Model:
    • Mean Error (ME): -6.81, Root Mean Square Error (RMSE): 12.35
    • The ARIMA model exhibits relatively low error metrics, indicating a strong fit to the data.
    • Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are also low, which suggests that the model performs well for the stable cash withdrawal series of ATM1.
  • ETS Model:
    • ME: -5.80, RMSE: 11.63
    • The ETS model provides a similar accuracy level to ARIMA, with slightly lower RMSE and MAE values.
    • The low MAPE (65.12%) confirms that ETS effectively captures the pattern in ATM1’s cash withdrawals, with minimal deviation from the actual values.
  • Naive Model:
    • ME: -50.33, RMSE: 59.04
    • The Naive model performs notably worse than ARIMA and ETS, with higher error values across all metrics.
    • This result indicates that a simple forecast using the last observed value is insufficient for even this stable series.

Conclusion: For ATM1, both ARIMA and ETS provide accurate forecasts, with ETS slightly outperforming ARIMA in terms of RMSE and MAPE.

10.b. ATM2

The forecast accuracy for ATM2 displays similar trends:

  • ARIMA Model:
    • ME: 8.63, RMSE: 21.05
    • The ARIMA model demonstrates good accuracy, with low RMSE and MAE values.
    • MAPE is 74.03%, indicating a reasonably close fit to actual values, though slightly higher than ATM1 due to the nature of ATM2’s data.
  • ETS Model:
    • ME: 9.51, RMSE: 19.37
    • ETS model outperforms ARIMA slightly, with a lower RMSE and MAE.
    • MAPE is 58.19%, suggesting that ETS provides a closer fit to the observed cash withdrawals, making it a reliable model for ATM2.
  • Naive Model:
    • ME: -40.60, RMSE: 54.99
    • The Naive model again has the highest error metrics, confirming that it is less suitable for accurate forecasts in this stable dataset.

Conclusion: ETS is the preferred model for ATM2, marginally outperforming ARIMA in accuracy metrics.

10.c. ATM4

For ATM4, which shows more variability, we observe the following:

  • ARIMA Model:
    • ME: -59.09, RMSE: 280.85
    • Despite the model capturing some variability, the higher error metrics reflect the increased difficulty in forecasting due to the volatility in ATM4’s cash withdrawals.
    • MAPE of 1068.70% suggests significant deviation from actual values, highlighting the challenging nature of this series.
  • Drift Model:
    • ME: -84.47, RMSE: 291.44
    • The Drift model shows slightly higher RMSE and MAE than ARIMA, indicating that this model’s tendency to forecast an increasing trend may not align well with the actual data.
    • A high MAPE (1239.86%) further supports the conclusion that Drift is not the best fit for this volatile series.
  • ETS Model:
    • ME: -56.10, RMSE: 312.07
    • ETS performs comparably to ARIMA, with slightly lower ME but higher RMSE and MAPE.
    • The error metrics are still high, reflecting the complexity of capturing variability in ATM4, though ETS does capture some of the series’ structure.

Conclusion: While none of the models perfectly fit ATM4 due to its high variability, ARIMA and ETS are relatively comparable, with ARIMA showing slightly better performance in terms of RMSE and MAE.

10.d. Overall Summary

  • For ATM1 and ATM2, the ETS model generally provides the most accurate forecasts, closely followed by ARIMA. These models capture the stable cash withdrawal patterns well.
  • For ATM4, where higher variability is present, ARIMA shows a slight edge in capturing the variability in the data, though error metrics are high across all models.

Given these results, ETS is recommended for forecasting ATM1 and ATM2, while ARIMA might be a preferable choice for ATM4, though additional analysis or different modeling approaches (e.g., machine learning models) might be warranted for highly variable series like ATM4.

With these insights, we can proceed with these selected models for each ATM to forcast May 2010 withdrawal.

11. Generate May 2010 Forecasts Based on Best Models

# Load required libraries
library(fpp3)
library(openxlsx)

# Convert the dataset to a tsibble for time series analysis
atm_data_tsibble <- atm_data_filtered %>%
  as_tsibble(index = DATE, key = ATM)

# Select the best model for each ATM based on previous accuracy analysis
# ATM1 and ATM2: ETS model, ATM4: ARIMA model

# Train final models on the full dataset (up to April 2010)
atm1_final_model <- atm_data_tsibble %>%
  filter(ATM == "ATM1") %>%
  model(ETS = ETS(Cash))

atm2_final_model <- atm_data_tsibble %>%
  filter(ATM == "ATM2") %>%
  model(ETS = ETS(Cash))

atm4_final_model <- atm_data_tsibble %>%
  filter(ATM == "ATM4") %>%
  model(ARIMA = ARIMA(Cash))

# Generate forecasts for May 2010 (next month)
atm1_forecast <- forecast(atm1_final_model, h = "1 month") %>% as_tibble()
atm2_forecast <- forecast(atm2_final_model, h = "1 month") %>% as_tibble()
atm4_forecast <- forecast(atm4_final_model, h = "1 month") %>% as_tibble()

# Combine the forecasts for exporting
combined_forecasts <- bind_rows(
  atm1_forecast %>% mutate(ATM = "ATM1"),
  atm2_forecast %>% mutate(ATM = "ATM2"),
  atm4_forecast %>% mutate(ATM = "ATM4")
)

# Select relevant columns for export (Date, ATM, model, and point forecast)
forecast_export <- combined_forecasts %>%
  select(DATE, ATM, .model, .mean) %>%
  rename(Forecast = .mean)

# Check the forecast data
#print(forecast_export)


# Export to Excel
write.xlsx(forecast_export, file = "/Users/souleymanedoumbia/Library/Mobile Documents/com~apple~CloudDocs/CUNY SPS CLASSES/MSDS CLASSES/Data 624 Fall 2024/Week9/Project1/Project1_work/ATM_May_2010_Forecast.xlsx")

Part B: Forecasting Residential Power Usage

In Part B, the task is to model residential power usage using the ResidentialCustomerForecastLoad-624.xlsx dataset, spanning from January 1998 to December 2013. This dataset contains a KWH variable, representing power consumption in kilowatt-hours. The objective is to create a monthly forecast for the year 2014 based on historical patterns.

  • Objective: To build a predictive model for residential power usage, focusing on seasonal and trend components, and validate its effectiveness.

  • Approach: Using various time series techniques, I will analyze historical trends in power consumption, produce forecasts, and compare methods to ensure accurate results. Findings will be documented, accompanied by visualizations and code.

1. Data Import and Initial Exploration

# Import dataset
power_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
summary(power_data)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
#power_data<-power_data%>% filter(KWH > 770523 & KWH <9000000 )

# Univariate plots for cash withdrawal distribution by each ATM
ggplot(power_data, aes(x = KWH)) + 
  geom_histogram(bins = 50, fill = 'gray', color = "black") + 
  #facet_wrap(~ATM, scales = "free") + 
  theme_bw() + 
  labs(title = "Distribution of Cash Withdrawals at Each ATM",
       x = "Cash (in hundreds of dollars)",
       y = "Frequency")  +
  theme(
    legend.position = "none",
    axis.title.x = element_text(size = 12, face = 'bold'),
    axis.title.y = element_text(size = 12, face = 'bold'),
    plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
    axis.text.y = element_text(size = 9)
  )

# Create a single box plot for the distribution of KWH values
ggplot(power_data, aes(y = KWH)) +
  geom_boxplot() +
  labs(
    title = "Box Plot of KWH Values",
    x = "",
    y = "KWH"
  ) +
  theme_minimal()  +
  theme(
    legend.position = "none",
    axis.title.x = element_text(size = 12, face = 'bold'),
    axis.title.y = element_text(size = 12, face = 'bold'),
    plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
    axis.text.y = element_text(size = 9)
  )

# Extract the year from 'YYYY-MMM' and add it as a new column
power_data <- power_data %>%
  mutate(Year = substr(`YYYY-MMM`, 1, 4))

# Create a box plot for KWH by year
ggplot(power_data, aes(x = Year, y = KWH)) +
  geom_boxplot() +
  labs(
    title = "Box Plot of KWH by Year",
    x = "Year",
    y = "KWH"
  ) +
  theme_minimal() +
  theme(legend.position = "none",
    axis.title.x = element_text(size = 12, face = 'bold'),
    axis.title.y = element_text(size = 12, face = 'bold'),
    plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
    axis.text.y = element_text(size = 8)
  )

# Extract the month from 'YYYY-MMM' and add it as a new column
power_data <- power_data %>%
  mutate(Month = substr(`YYYY-MMM`, 6, 8))

# Create a box plot for KWH by month
ggplot(power_data, aes(x = Month, y = KWH)) +
  geom_boxplot() +
  labs(
    title = "Box Plot of KWH by Month",
    x = "Month",
    y = "KWH"
  ) +
  theme_minimal()  +
  theme(
    legend.position = "none",
    axis.title.x = element_text(size = 12, face = 'bold'),
    axis.title.y = element_text(size = 12, face = 'bold'),
    plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
    axis.text.y = element_text(size = 9)
  )

1.a. Initial Insights

  1. Presence of an Extreme Outlier:
    • The histogram of KWH values highlights the presence of an extreme low-value outlier in the dataset.
    • This observation is reinforced by the overall box plot of the KWH variable, which isolates this value as the only notable outlier.
  2. Identification of the Outlier:
    • The annual box plot shows that this outlier was recorded in 2010.
    • The monthly box plot further pinpoints this event to July 2010, providing a clear temporal reference for the anomaly.
  3. Missing Value:
    • The summary statistics for the dataset reveal one missing value in the KWH variable, which will need to be addressed to ensure a complete and accurate analysis.
  4. Date Column Format:
    • The YYYY-MMM column, which represents the date, is currently stored as character type (Class: character, Mode: character).
    • For time series purposes, particularly when using the tsibble framework, this column needs to be converted to a proper Date format to allow accurate indexing and time-based operations.

1.b. Implications

  • The extreme low-value outlier could skew any time series analysis or forecasting efforts. It will need to be addressed appropriately to ensure robust modeling.
  • The missing value could disrupt the continuity of the time series, emphasizing the need for careful imputation or handling.
  • The improper format of the date column may lead to errors or inefficiencies in time series processing and needs to be resolved in the cleaning phase.

1.c. Data Cleaning Steps

  1. Handling Missing Values:
    • The overall box plot and histogram of the KWH variable indicate a skewed distribution, with most values concentrated on the left side. Although the mean (6,502,475) and median (6,283,324) are relatively close, the slight skewness suggests that using the median is a robust approach for imputing the missing value. The median is less affected by potential extreme values and better represents the central tendency in such distributions.
  2. Treating the Outlier:
    • The extreme low-value outlier in July 2010 accounts for only 1% of the total power for all July observations, whereas the average contribution of other July values is approximately 7% of the total sum of power for July.
    • Given the overall distribution and context, it is reasonable to assume this value is an error. We will adjust this value by replacing it with the mean percentage (7%) of the total July power. This approach ensures the corrected value aligns with the typical pattern for July, maintaining consistency in the dataset.
  3. Date Column Transformation:
    • The YYYY-MMM column, currently stored as character, needs to be converted into a proper Date format for time series analysis.
    • To achieve this:
      • We will parse the year and month from the YYYY-MMM string.
      • Create a valid Date object using the lubridate::ymd() function or similar parsing functions in R.
      • This transformation will enable accurate indexing and facilitate seamless integration with the tsibble framework for time series modeling.

2. Data Cleaning

This section focuses on preparing the power dataset for time series analysis by handling missing values, treating outliers, and transforming the date column. These steps are crucial to ensure the data’s integrity and compatibility with time series modeling frameworks such as tsibble.

2.a. Handling Missing Values

The dataset contains a missing value in the KWH column. Given the slight skewness observed in the distribution, the median is chosen as the most robust imputation method. The median better represents the central tendency of the data and is less sensitive to extreme values.

# Impute missing values in the KWH column using the median
median_kwh <- median(power_data$KWH, na.rm = TRUE)
power_data <- power_data %>%
  mutate(KWH = ifelse(is.na(KWH), median_kwh, KWH))

# Verify the missing value is handled
sum(is.na(power_data$KWH))  # Should be 0
## [1] 0
  • The missing value in the KWH column was imputed using the median of the data. This choice ensures robustness to extreme values and preserves the integrity of the dataset.

2.b. Treating the Outlier

The outlier in July 2010 accounts for only 1% of the total power usage for all July observations, significantly deviating from the typical 7% average contribution across other July values in the KWH dataset. To address this discrepancy, the outlier will be replaced with the mean power usage of all other July observations, ensuring the adjusted value aligns with the overall trend and maintains consistency within the dataset.

# Calculate the mean July power excluding the outlier
mean_july_power <- power_data %>%
  filter(KWH != min(KWH) & Month == 'Jul') %>%
  summarize(mean_july = mean(KWH, na.rm = TRUE)) %>%
  pull(mean_july)  # Extract the numeric value

# Replace the outlier value for July 2010
power_data <- power_data %>%
  mutate(KWH = ifelse(`YYYY-MMM` == "2010-Jul" & KWH == min(KWH),
                      mean_july_power, KWH))

# Verify the adjustment
#power_data %>%
 # filter(Month == "Jul")
  • The extreme low outlier in July 2010 was adjusted to align with the expected contribution of 7% to the total July power. This correction ensures consistency with historical trends and mitigates the risk of skewing the analysis.

2.c. Date Column Transformation

The YYYY-MMM column, stored as character, needs to be converted to a Date format for time series analysis. This step ensures compatibility with the tsibble framework and enables efficient time-based operations.

# Convert the YYYY-MMM column to a proper Date format
power_data <- power_data %>%
  mutate(MonthYear = yearmonth(`YYYY-MMM`))  # Use yearmonth for year and month granularity

# Verify the transformation
str(power_data$MonthYear)
##  mth [1:192] 1998 Jan, 1998 Feb, 1998 Mar, 1998 Apr, 1998 May, 1998 Jun, 19...
#summary(power_data)
  • The YYYY-MMM column was successfully transformed into a proper Date object, enabling time-based indexing and compatibility with the tsibble framework.

3. Exploratory Data Analysis (EDA)

Here, the transformed data is converted to a tsibble with MonthYear as the time index. This step ensures that the time series structure is consistent, complete, and ordered. The objective is to perform an exploratory analysis of the dataset to identify trends, seasonality, and potential patterns that could inform the selection of appropriate forecasting models. To do so multiple visualizations and decomposition techniques is applied below:

  • Time Plot: Displays the overall trend and variability in power usage over time.

  • Seasonal Plot: Highlights recurring patterns across months within each year.

  • STL Decomposition: Separates the data into trend, seasonal, and remainder components.

# Convert power_data to a tsibble and fill gaps
power_tsibble <- power_data %>%
  select(MonthYear, KWH) %>%
  arrange(MonthYear) %>%
  as_tsibble(index = MonthYear)

# Time plot to observe overall trends
autoplot(power_tsibble, KWH) +
  labs(
    title = "Time Plot of Power Usage (KWH)",
    x = "Date",
    y = "KWH"
  ) +
  theme_minimal()

# Seasonal plot to uncover recurring patterns
gg_season(power_tsibble, KWH, period = "year") +
  labs(
    title = "Seasonal Plot of Power Usage (KWH)",
    x = "Month",
    y = "KWH"
  )

# Decomposition to separate components
decomp <- power_tsibble %>%
  model(STL(KWH ~ season(window = "periodic"))) %>%
  components()

autoplot(decomp) +
  labs(title = "STL Decomposition of Power Usage")

3.a. Time Plot

The time plot illustrates the overall behavior of power usage over the years:

  • Seasonal Patterns: Power usage peaks in August and December, likely corresponding to increased cooling demand in the summer (August) and heating or holiday-related consumption in the winter (December). There are noticeable troughs in May and November, reflecting lower usage during these transitional months.
  • Year-to-Year Variation: While the seasonal pattern remains consistent, some years (e.g., 2013) exhibit higher overall power usage compared to earlier years, suggesting growth in demand over time.

3.b. Seasonal Plot

The seasonal plot provides a closer look at how power usage varies across months in different years:

  • Seasonal Patterns: There are consistent peaks in power usage during the colder months (likely due to heating demand) and noticeable troughs during the warmer months, particularly in the middle of the year.
  • Year-to-Year Variation: While the seasonal pattern is consistent, some years exhibit higher overall power usage compared to others.

3.c. STL Decomposition

The STL decomposition breaks the time series into its components, allowing a detailed examination of the underlying trends and patterns:

  • Trend Component: Shows a steady increase in power usage over the years, reflecting long-term growth in demand.
  • Seasonal Component: Demonstrates regular, repeating seasonal variations that align with the patterns observed in the seasonal plot.
  • Remainder Component: Represents the irregular, unexplained variations in power usage, which may result from external factors or anomalies.

4. Model Selection for Forecasting

Based on the insights obtained from the exploratory analysis, including the time plot, seasonal plot, and STL decomposition, the following forecasting models will be selected:

4.a. ARIMA (AutoRegressive Integrated Moving Average)

  • The upward trend in the time series and the presence of seasonal patterns (peaks in August and December, troughs in May and November) make the ARIMA model a suitable choice.
  • ARIMA can incorporate trend and seasonality through differencing and seasonal terms, which align well with the observed data characteristics.
  • The STL decomposition results show periodic, predictable seasonality, and ARIMA’s seasonal extension effectively handles such patterns.
# Automatically fit an ARIMA model with seasonal adjustments
arima_model <- power_tsibble %>%
  model(ARIMA = ARIMA(KWH ~ season()))

4.b. Exponential Smoothing State Space Model (ETS)

  • The ETS model (Error, Trend, Seasonal) is effective for capturing time series with a combination of trend and seasonality.
  • It can handle the smooth growth trend observed in the decomposition’s trend component, as well as the consistent seasonal peaks and troughs.
  • ETS works well for time series where patterns are expected to persist in the future without abrupt structural changes.
# Fit the ETS model
ets_model <- power_tsibble %>%
  model(ETS = ETS(KWH))

4.c. Box-Cox Transformation

  • A Box-Cox Transformation is well-suited for stabilizing variance and addressing potential non-linear patterns in the data.
  • Given the presence of seasonality and an upward trend, the transformation can be applied as a preprocessing step to improve the performance of other forecasting models (e.g., ARIMA or ETS).
  • The Box-Cox Transformation ensures that the series conforms to the assumptions of linearity and stationarity, enhancing the accuracy of the forecasts.
# Apply Box-Cox Transformation
lambda <- BoxCox.lambda(power_tsibble$KWH) # Find the optimal lambda
transformed_data <- power_tsibble %>%
  mutate(KWH_transformed = BoxCox(KWH, lambda)) # Apply the transformation

# To Ensure transformed_data is in a tsibble format
transformed_data <- transformed_data %>% 
  as_tsibble(index = MonthYear) # Ensure `MonthYear` is the index column

# Time series plot of the transformed data
autoplot(transformed_data, KWH_transformed) +
  labs(title = "Box-Cox Transformed Power Usage", y = "Transformed KWH")

# Check for missing values
if (any(is.na(transformed_data$KWH_transformed))) {
  stop("Missing values detected in the transformed data. Please address them before proceeding.")
}

# Fit ARIMA Model on Transformed Data
arima_model_tr <- transformed_data %>%
  model(ARIMA = ARIMA(KWH_transformed)) # Auto ARIMA, no hardcoding

# Fit ETS Model on Transformed Data
ets_model_tr <- transformed_data %>%
  model(ETS = ETS(KWH_transformed)) # Auto ETS, no hardcoding

5. Model Evaluation using Time Series Cross-Validation

Time series cross-validation splits the data into multiple training and testing sets, evaluating model performance at each split. For this dataset, a rolling-origin strategy will be used:

We will:

  1. Split the data:
  • Use data up to December 2012 for initial training.

  • Incrementally expand the training set, leaving one-step or multi-step forecasts for validation.

  1. Measure forecast accuracy using metrics such as RMSE, MAE, and MAPE.
suppressWarnings({
# Create rolling time series cross-validation dataset
cv <- power_tsibble %>%
  stretch_tsibble(.init = 168, .step = 12) # Initial 168 months (Jan 1998 - Dec 2011)

# Evaluate ARIMA model with cross-validation
cv_arima <- cv %>%
  model(ARIMA = ARIMA(KWH ~ season())) %>%
  forecast(h = 12) %>%
  accuracy(data = power_tsibble) # Compare against actual data

# Evaluate ETS model with cross-validation
cv_ets <- cv %>%
  model(ETS = ETS(KWH)) %>%
  forecast(h = 12) %>%
  accuracy(data = power_tsibble) # Compare against actual data

# Evaluate Box-Cox ARIMA model with cross-validation
cv_arima_tr <- cv %>%
  mutate(KWH_transformed = BoxCox(KWH, lambda)) %>% # Include transformed data in cross-validation
  model(ARIMA = ARIMA(KWH_transformed)) %>%
  forecast(h = 12) %>%
  accuracy(data = power_tsibble %>%
             mutate(KWH_transformed = BoxCox(KWH, lambda))) # Compare transformed forecasts to transformed actuals

# Evaluate Box-Cox ETS model with cross-validation
cv_ets_tr <- cv %>%
  mutate(KWH_transformed = BoxCox(KWH, lambda)) %>% # Include transformed data in cross-validation
  model(ETS = ETS(KWH_transformed)) %>%
  forecast(h = 12) %>%
  accuracy(data = power_tsibble %>%
             mutate(KWH_transformed = BoxCox(KWH, lambda))) # Compare transformed forecasts to transformed actuals

# Combine results for comparison
cv_results <- bind_rows(
  cv_arima %>% mutate(Model = "ARIMA"),
  cv_ets %>% mutate(Model = "ETS"),
  cv_arima_tr %>% mutate(Model = "Box-Cox ARIMA"),
  cv_ets_tr %>% mutate(Model = "Box-Cox ETS")
)

# Display cross-validation results
cv_results %>%
  select(Model, RMSE, MAE, MAPE) %>%
  arrange(RMSE)
})
## # A tibble: 4 × 4
##   Model                 RMSE          MAE   MAPE
##   <chr>                <dbl>        <dbl>  <dbl>
## 1 Box-Cox ARIMA      0.00269      0.00214 0.0498
## 2 Box-Cox ETS        0.00340      0.00251 0.0584
## 3 ARIMA         851068.      551037.      7.23  
## 4 ETS           882272.      579381.      7.67

Discussion on Cross-Validation Results

The results from the cross-validation process provide valuable insights into the performance of the four models: Box-Cox ARIMA, Box-Cox ETS, ARIMA, and ETS. The metrics considered are RMSE (Root Mean Square Error), MAE (Mean Absolute Error), and MAPE (Mean Absolute Percentage Error), which assess the accuracy of the models during the validation process.

5.a. Box-Cox ARIMA

  • RMSE: 0.0027 (lowest among all models)
  • MAE: 0.0021 (lowest among all models)
  • MAPE: 0.0498 (lowest among all models)
  • Analysis:
    • The Box-Cox ARIMA model performs exceptionally well, achieving the lowest error metrics. This indicates that the combination of the Box-Cox transformation with ARIMA effectively stabilizes the variance and captures both trend and seasonality in the data.
    • The low MAPE shows the model’s robustness in providing accurate percentage forecasts relative to the true values.

5.b. Box-Cox ETS

  • RMSE: 0.0034
  • MAE: 0.0025
  • MAPE: 0.0584
  • Analysis:
    • The Box-Cox ETS model also performs well, though slightly worse than the Box-Cox ARIMA model.
    • The higher RMSE and MAE compared to Box-Cox ARIMA indicate that the ETS model struggles slightly with capturing the data’s dynamics despite the variance stabilization provided by the Box-Cox transformation.

5.c. ARIMA (without Box-Cox Transformation)

  • RMSE: 851067.57
  • MAE: 551037.25
  • MAPE: 7.23
  • Analysis:
    • The ARIMA model without the Box-Cox transformation shows significantly higher error metrics compared to the transformed models. This suggests that the original dataset’s variance and non-linearity negatively impact ARIMA’s forecasting performance.

5.d. ETS (without Box-Cox Transformation)

  • RMSE: 882272.00
  • MAE: 579381.30
  • MAPE: 7.67
  • Analysis:
    • Similar to the untransformed ARIMA model, the ETS model without the Box-Cox transformation performs poorly, with the highest RMSE and MAE values.
    • This model is less effective in capturing the variance and seasonality, highlighting the importance of preprocessing steps like variance stabilization.

5.e.Conclusion from Cross-Validation

  1. Box-Cox ARIMA is the best-performing model, with the lowest RMSE, MAE, and MAPE. It is the most suitable choice for forecasting residential power usage in 2014.
  2. Box-Cox ETS is the second-best model, offering decent accuracy but falling short compared to Box-Cox ARIMA.
  3. The untransformed models (ARIMA and ETS) perform significantly worse and are not recommended for forecasting.

Using the Box-Cox ARIMA model, we will proceed to forecast the power usage for the year 2014.

6. Forecasting Power Usage for 2014 with Box-Cox ARIMA Model

# Fit the Box-Cox ARIMA model
lambda <- BoxCox.lambda(power_tsibble$KWH)
power_transformed <- power_tsibble %>%
  mutate(KWH_transformed = BoxCox(KWH, lambda))

box_cox_arima <- power_transformed %>%
  model(BoxCox_ARIMA = ARIMA(KWH_transformed))

# Forecast for 2014
forecast_2014 <- box_cox_arima %>%
  forecast(h = "12 months") %>%
  mutate(Forecast = InvBoxCox(.mean, lambda)) # Back-transform forecasts

# Export forecast results to Excel
forecast_2014_export <- forecast_2014 %>%
  as_tibble() %>%
  select(MonthYear, Forecast)

# Write to Excel file
write.xlsx(forecast_2014_export, file = "ResidentialPowerForecast_2014.xlsx", overwrite = TRUE)

# Display a message confirming the file export
cat("The forecast for 2014 has been successfully saved as 'ResidentialPowerForecast_2014.xlsx'.")
## The forecast for 2014 has been successfully saved as 'ResidentialPowerForecast_2014.xlsx'.

6.a. Plotting the forecast results (2014)

autoplot(forecast_2014, power_transformed) +
  labs(title = "Forecast for Residential Power Usage (2014)",
       y = "Power Usage (KWH)",
       x = "Month") +
  theme_minimal()  +
  theme(
    axis.title.x = element_text(size = 12, face = 'bold'),
    axis.title.y = element_text(size = 12, face = 'bold'),
    plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10)
  )

6.b. Discussion of 2014 Forecast for Residential Power Usage

  1. Seasonal Patterns
  • The forecast for 2014 captures the consistent seasonal fluctuations present in the historical data.
  • These fluctuations are evident in the peaks during summer months, which can be associated with increased energy demand for cooling, and troughs during spring and fall months, reflecting lower overall power usage.
  1. Trend
  • The forecast exhibits a continuation of the upward long-term trend observed in the historical data.
  • This suggests the model successfully captured the underlying trend in residential power usage over the years.
  1. Uncertainty
  • The widening of the prediction intervals as we move further into the forecast period highlights the uncertainty in predictions.
  • The 80% and 95% confidence intervals demonstrate the range within which the actual power usage values are expected to fall, with higher certainty closer to the start of 2014.
  1. Model Behavior
  • The Box-Cox ARIMA model performed well in stabilizing the variance through the Box-Cox transformation and capturing both the trend and seasonality in the data.
  • The model outputs align closely with historical patterns, indicating that it effectively learned from the data and incorporated the key components into the forecast.
  • However, as with any forecasting model, its ability to predict depends heavily on the assumption that future patterns remain consistent with historical ones.