Introduction

This project applies time series forecasting techniques to two datasets:

  • Part A: Forecasting cash withdrawals for 4 ATM machines for May 2010
  • Part B: Forecasting monthly residential power consumption for 2014

The methods used include time series decomposition, exponential smoothing, and ARIMA models.

# Load required libraries
library(tidyverse)
library(readxl)
library(writexl)
library(fpp3)

Part A: ATM Cash Withdrawal Forecasting

1. Data Loading and Exploration

# Load the ATM data
atm_data <- read_excel("ATM624Data.xlsx")

# Convert DATE to proper Date format based on Excel file
atm_data <- atm_data %>%
  mutate(DATE = as.Date(DATE, origin = "1899-12-30"))

# Display structure
str(atm_data)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
# Preview data
head(atm_data, 10)
## # A tibble: 10 × 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-05-01 ATM1     96
##  2 2009-05-01 ATM2    107
##  3 2009-05-02 ATM1     82
##  4 2009-05-02 ATM2     89
##  5 2009-05-03 ATM1     85
##  6 2009-05-03 ATM2     90
##  7 2009-05-04 ATM1     90
##  8 2009-05-04 ATM2     55
##  9 2009-05-05 ATM1     99
## 10 2009-05-05 ATM2     79
tail(atm_data, 10)
## # A tibble: 10 × 3
##    DATE       ATM     Cash
##    <date>     <chr>  <dbl>
##  1 2010-04-21 ATM4  557.  
##  2 2010-04-22 ATM4  386.  
##  3 2010-04-23 ATM4  165.  
##  4 2010-04-24 ATM4    5.45
##  5 2010-04-25 ATM4  542.  
##  6 2010-04-26 ATM4  404.  
##  7 2010-04-27 ATM4   13.7 
##  8 2010-04-28 ATM4  348.  
##  9 2010-04-29 ATM4   44.2 
## 10 2010-04-30 ATM4  482.
# Summary
summary(atm_data)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

Missing Values Analysis

# Check for missing values
cat("Missing values by column:\n")
## Missing values by column:
colSums(is.na(atm_data))
## DATE  ATM Cash 
##    0   14   19
# Missing values by ATM
atm_data %>%
  group_by(ATM) %>%
  summarise(
    Total_Records = n(),
    Missing_Cash = sum(is.na(Cash)),
    Pct_Missing = round(100 * Missing_Cash / Total_Records, 2)
  ) %>%
  knitr::kable(caption = "Missing Data Summary by ATM")
Missing Data Summary by ATM
ATM Total_Records Missing_Cash Pct_Missing
ATM1 365 3 0.82
ATM2 365 2 0.55
ATM3 365 0 0.00
ATM4 365 0 0.00
NA 14 14 100.00

2. Data Cleaning

I’ll handle missing values using linear interpolation. For ATM data, this is appropriate because withdrawal patterns are relatively stable over short periods. With only 1% of values missing and gaps appearing to be isolated individual days, interpolating between adjacent observations provides reasonable estimates without introducing significant bias into the analysis.

# Remove rows with missing ATM identifiers
atm_clean <- atm_data %>%
  filter(!is.na(ATM)) %>%
  mutate(DATE = as.Date(DATE))

# For each ATM, fill in missing cash values using linear interpolation
# Using tidyr::fill and manual interpolation
atm_clean <- atm_clean %>%
  group_by(ATM) %>%
  arrange(DATE) %>%
  mutate(
    # Mark positions of non-NA values
    Cash = approx(x = seq_along(Cash), y = Cash, xout = seq_along(Cash), 
                  method = "linear", rule = 2)$y
  ) %>%
  ungroup()

# Check date coverage by ATM
atm_clean %>%
  group_by(ATM) %>%
  summarise(
    Start_Date = format(as.Date(min(DATE)), "%Y-%m-%d"),
    End_Date = format(as.Date(max(DATE)), "%Y-%m-%d"),
    N_Days = n(),
    Mean_Cash = round(mean(Cash, na.rm = TRUE), 2),
    SD_Cash = round(sd(Cash, na.rm = TRUE), 2)
  ) %>%
  knitr::kable(caption = "Date Coverage and Summary Statistics by ATM")
Date Coverage and Summary Statistics by ATM
ATM Start_Date End_Date N_Days Mean_Cash SD_Cash
ATM1 2009-05-01 2010-04-30 365 84.15 36.64
ATM2 2009-05-01 2010-04-30 365 62.59 38.81
ATM3 2009-05-01 2010-04-30 365 0.72 7.94
ATM4 2009-05-01 2010-04-30 365 474.04 650.94

3. Time Series Visualization

# Time series plot for all ATMs
ggplot(atm_clean, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line(alpha = 0.7) +
  facet_wrap(~ATM, ncol = 2, scales = "free_y") +
  labs(
    title = "ATM Cash Withdrawals Over Time",
    subtitle = "May 2009 - May 2010",
    x = "Date",
    y = "Cash (hundreds of dollars)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

# Box plots by day of week
atm_clean %>%
  mutate(DayOfWeek = wday(DATE, label = TRUE)) %>%
  ggplot(aes(x = DayOfWeek, y = Cash, fill = ATM)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ATM, scales = "free_y") +
  labs(
    title = "Cash Withdrawal Patterns by Day of Week",
    x = "Day of Week",
    y = "Cash (hundreds of dollars)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5)
  )

Looking at the time series plots, ATM1 and ATM2 behave like one might expect from a typical consumer ATM. Withdrawals range from $0-$180 (hundreds of dollars) with predictable weekly patterns throughout the year. Interestingly, both show a consistent dip on Thursdays. Maybe more people get paid Monday or Friday, withdrawing shortly after that. The box plots confirm the weekly cycles are real, with activity picking up on weekends when people are out spending. This predictability makes them perfect candidates for ARIMA forecasting.

ATM3 and ATM4 are a different story altogether. ATM3 is basically flat at zero for nearly the entire year until one random spike around April/May 2010. Something seems off here—the machine might be broken, or it’s in an inaccessible location. ATM4, on the other hand, is operating in a whole different zone We’re talking withdrawals up to $10,000+ (hundreds of dollars)—about 100 times what the consumer ATMs see. There’s no weekly pattern here; instead, we see irregular spikes that suggest “business transactions” rather than everyday banking. This machine is likely serving commercial clients rather than everyday consumers.

4. Time Series Decomposition

I’ll decompose the time series for each ATM to understand the trend, seasonal, and irregular components.

# Convert to tsibble format 
atm_tsibble <- atm_clean %>%
  as_tsibble(key = ATM, index = DATE)

# STL decomposition for each ATM
atm_decomp <- atm_tsibble %>%
  model(
    stl = STL(Cash ~ season(window = 7))
  ) %>%
  components()

# Plot decomposition for all ATMs
atm_decomp %>%
  autoplot() +
  labs(title = "STL Decomposition by ATM") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Individual decomposition plots
for (atm_name in unique(atm_clean$ATM)) {
  p <- atm_decomp %>%
    filter(ATM == atm_name) %>%
    autoplot() +
    labs(title = paste("STL Decomposition -", atm_name)) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5))
  print(p)
}

The STL decomposition breaks down each ATM’s cash withdrawal pattern into three components: trend, weekly seasonality, and remainder (irregular fluctuations). This analysis helps us understand what’s really driving the patterns we see in the raw data.

For ATM1 and ATM2, the decomposition reveals some interesting dynamics. Both machines show a gradually declining trend over the year, with ATM1 dropping from around 100 to about 75 (hundreds of dollars) and ATM2 following a similar pattern from roughly 80 down to 60. This suggests these locations might be experiencing reduced foot traffic or changing customer behavior over time. The seasonal component clearly shows the weekly cycle we noticed earlier—those regular peaks and valleys that repeat every seven days. What’s particularly useful here is seeing how consistent the weekly pattern is throughout the entire year. The remainder component (the random noise) is relatively small for both machines, which is exactly what we want to see. It means most of the variation in the data can be explained by the trend and weekly patterns, making these specific ATMs highly predictable for forecasting purposes.

ATM3’s decomposition tells the story of a machine that’s been sitting idle, like we saw in the earlier visualization. The trend line is flat at near-zero for the entire period until that sudden jump at the end. There’s no meaningful seasonal pattern because there’s no activity to create one. The remainder component only shows action at the very end when the machine finally comes to life. This decomposition confirms what we suspected—this machine wasn’t operational for most of the observation period.

ATM4’s trend component sits around 500-700 (hundreds of dollars) for most of the year, then shows a huge spike in late March 2010 where it jumps to over 1,500. The seasonal component is relatively small compared to the overall scale of transactions. This makes sense, given that this machine’s activity seems to be driven by business needs rather than weekly consumer cycles. The remainder component stands out for ATM4—it’s highly variable, showing irregular large withdrawals. The fact that the irregular component is so much larger than the seasonal component tells us that trying to forecast this ATM based on patterns alone will be challenging. We’d probably get better results if we had additional information about the type of business clients using this machine. If we knew data such as scheduled payroll dates and known large transaction requests, it might inform better forecasting.

5. Model Building and Forecasting

I’ll apply both Exponential Smoothing (ETS) and ARIMA models to each ATM using the fable framework.

5.1 ATM1 Forecasting

# Prepare ATM1 data
atm1_tsibble <- atm_tsibble %>%
  filter(ATM == "ATM1")

# Fit multiple models
atm1_fit <- atm1_tsibble %>%
  model(
    ets = ETS(Cash),
    arima = ARIMA(Cash)
  )

# Model summaries
cat("ATM1 - Model Details Below:\n\n")
## ATM1 - Model Details Below:
report(atm1_fit %>% select(ets))
## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.02126635 
##     gamma = 0.3298864 
## 
##   Initial states:
##      l[0]      s[0]    s[-1]    s[-2]    s[-3]    s[-4]    s[-5]   s[-6]
##  77.02956 -55.27235 2.531658 16.15413 4.529656 11.40245 12.92571 7.72875
## 
##   sigma^2:  580.2586
## 
##      AIC     AICc      BIC 
## 4487.018 4487.639 4526.017
report(atm1_fit %>% select(arima))
## Series: Cash 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1950  -0.5806  -0.1037
## s.e.  0.0546   0.0506   0.0494
## 
## sigma^2 estimated as 556.4:  log likelihood=-1640.01
## AIC=3288.03   AICc=3288.14   BIC=3303.55
# Generate forecasts (31 days for May 2010)
atm1_fc <- atm1_fit %>%
  forecast(h = 31)

# Check accuracy
cat("\nModel Accuracy Comparison:\n")
## 
## Model Accuracy Comparison:
accuracy(atm1_fit) %>%
  arrange(RMSE) %>%
  knitr::kable(digits = 3)
ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
ATM1 arima Training -0.074 23.262 14.54 -102.259 117.148 0.822 0.838 -0.009
ATM1 ets Training -0.048 23.790 15.08 -106.835 121.865 0.852 0.857 0.144
# Plot forecasts
atm1_fc %>%
  autoplot(atm1_tsibble, level = NULL) +
  labs(
    title = "ATM1: 31-Day Forecast Comparison",
    x = "Date",
    y = "Cash (hundreds of dollars)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Plot ARIMA forecast with intervals
atm1_fc %>%
  filter(.model == "arima") %>%
  autoplot(atm1_tsibble) +
  labs(
    title = "ATM1: May 2010 Forecast (ARIMA)",
    x = "Date",
    y = "Cash (hundreds of dollars)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Residual diagnostics
atm1_fit %>%
  select(arima) %>%
  gg_tsresiduals()

ATM1 Model Interpretation

The ARIMA model slightly outperforms ETS for ATM1, with an RMSE of 23.262 compared to 23.790. The selected ARIMA(0,0,1)(0,1,2)[7] specification handles the weekly seasonality through seasonal moving average terms at a 7-day lag, while the seasonal differencing (the middle “1” in the seasonal part) accounts for the gradual declining trend we saw in the decomposition.

Looking at the forecast comparison plot, both models produce very similar predictions for May 2010, which makes sense given their comparable accuracy metrics. The ARIMA forecast maintains the weekly cyclical pattern we’ve observed throughout the historical data, with the confidence intervals widening as we move further into the forecast period—this is expected and reflects increasing uncertainty about future values. The forecast appears reasonable, continuing the patterns we’ve seen historically, with the typical weekly oscillations between low and high usage days.

The residual diagnostics look solid. The innovation residuals scatter randomly around zero without any obvious patterns, which is what we want to see. The ACF plot confirms this—nearly all the lags stay within those blue dashed lines, which means there’s no leftover correlation in the residuals. The histogram of residuals is roughly bell-shaped and centered at zero, suggesting the residuals are approximately normally distributed. A few larger outliers appear in the residual plot, but these are relatively rare and don’t seem to follow any systematic pattern. OOverall, the diagnostics tell us that we can be fairly confident in these May 2010 forecasts.

5.2 ATM2 Forecasting

# Prepare ATM2 data
atm2_tsibble <- atm_tsibble %>%
  filter(ATM == "ATM2")

# Fit models
atm2_fit <- atm2_tsibble %>%
  model(
    ets = ETS(Cash),
    arima = ARIMA(Cash)
  )

cat("ATM2 - Selected ARIMA Model Below:\n")
## ATM2 - Selected ARIMA Model Below:
report(atm2_fit %>% select(arima))
## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4320  -0.9130  0.4773  0.8048  -0.7547
## s.e.   0.0553   0.0407  0.0861  0.0556   0.0381
## 
## sigma^2 estimated as 602.5:  log likelihood=-1653.67
## AIC=3319.33   AICc=3319.57   BIC=3342.61
# Generate forecast
atm2_fc <- atm2_fit %>%
  forecast(h = 31)

# Plot ARIMA forecast
atm2_fc %>%
  filter(.model == "arima") %>%
  autoplot(atm2_tsibble) +
  labs(
    title = "ATM2: May 2010 Forecast (ARIMA)",
    x = "Date",
    y = "Cash (hundreds of dollars)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Residual diagnostics
atm2_fit %>%
  select(arima) %>%
  gg_tsresiduals()

ATM2 Model Interpretation

The ARIMA model for ATM2 handles the weekly patterns and declining trend we saw in the decomposition, similar to what we did with ATM1. The automatic selection process picked a model that best fits ATM2’s particular usage patterns.

The May 2010 forecast continues the weekly up-and-down pattern we’ve seen throughout the year. The forecast looks reasonable and maintains the typical range of variation we’ve observed in the historical data.

The residuals look good—they scatter around zero without any obvious patterns, which tells us the model captured what it needed to from the data. The ACF plot backs this up, with nearly all the lags staying within those bounds. The histogram shows the residuals are roughly normal and centered at zero. There are a few bigger outliers scattered throughout, but they don’t seem to follow any particular pattern or cluster in one time period. Overall, the diagnostics give us good reason to trust these May forecasts.

5.3 ATM3 Forecasting

atm3_tsibble <- atm_tsibble %>%
  filter(ATM == "ATM3")

atm3_fit <- atm3_tsibble %>%
  model(
    arima = ARIMA(Cash)
  )

cat("ATM3 - Selected ARIMA Model Below:\n")
## ATM3 - Selected ARIMA Model Below:
report(atm3_fit)
## Series: Cash 
## Model: ARIMA(0,0,2) 
## 
## Coefficients:
##          ma1     ma2
##       0.8392  0.8557
## s.e.  0.0496  0.0611
## 
## sigma^2 estimated as 25.4:  log likelihood=-1108.69
## AIC=2223.39   AICc=2223.46   BIC=2235.09
atm3_fc <- atm3_fit %>%
  forecast(h = 31)

atm3_fc %>%
  autoplot(atm3_tsibble) +
  labs(
    title = "ATM3: May 2010 Forecast (ARIMA)",
    x = "Date",
    y = "Cash (hundreds of dollars)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

atm3_fit %>%
  gg_tsresiduals()

ATM3 presents a difficult forecasting situation. With nearly 11 months of zero activity followed by a brief spike in late April, there’s not enough meaningful pattern for the model to learn from. The data is essentially a flat line at zero with one outlier, which makes it challenging to identify any reliable trends or seasonal patterns.

The May 2010 forecast predicts near-zero values, continuing the baseline that dominated the historical period. While this makes sense given what the model saw, it raises real questions about usefulness. If the machine was recently repaired and brought online (as that late April spike suggests), the historical data won’t reflect its future behavior at all. We’d be predicting based on a broken machine’s performance. On the other hand, if it’s still non-functional, we’re forecasting something that won’t generate any cash flow anyway. Before putting any trust into these predictions, ATM3 needs operational investigation to determine if it will actually be functioning in May 2010.

5.4 ATM4 Forecasting

atm4_tsibble <- atm_tsibble %>%
  filter(ATM == "ATM4")

atm4_fit <- atm4_tsibble %>%
  model(
    arima = ARIMA(Cash)
  )

cat("ATM4 - Selected ARIMA Model:\n")
## ATM4 - Selected ARIMA Model:
report(atm4_fit)
## Series: Cash 
## Model: ARIMA(0,0,0) w/ mean 
## 
## Coefficients:
##       constant
##       474.0433
## s.e.   34.0248
## 
## sigma^2 estimated as 423718:  log likelihood=-2882.03
## AIC=5768.06   AICc=5768.1   BIC=5775.86
atm4_fc <- atm4_fit %>%
  forecast(h = 31)

atm4_fc %>%
  autoplot(atm4_tsibble) +
  labs(
    title = "ATM4: May 2010 Forecast (ARIMA)",
    x = "Date",
    y = "Cash (hundreds of dollars)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

atm4_fit %>%
  gg_tsresiduals()

ATM4 is completely different from the other ATMs, as we discussed before, and the model selection reflects that. The automatic ARIMA process selected ARIMA(0,0,0) with mean, essentially just forecasting the average value of $474 (hundreds of dollars). This might seem overly simplistic, but it makes sense given ATM4’s behavior. With huge irregular spikes reaching $10,000+ and no consistent pattern, there’s no predictable structure for a more complex model to capture.

The May 2010 forecast is a flat line at the historical mean of around $474. Given the extreme variability in the transaction history (re: those huge spikes we saw earlier) this simple average-based forecast isn’t particularly useful for cash management planning. The residual diagnostics show that one spike dominates everything, with the histogram heavily concentrated near zero but with extreme outliers. For ATM4, traditional time series forecasting has limited value—better forecasts would require business intelligence about planned large withdrawals from the commercial clients using this machine.

6. Export Part A Forecasts

# Combine all forecasts using ARIMA model
atm_all_forecasts <- bind_rows(
  atm1_fc %>% filter(.model == "arima") %>% as_tibble() %>% mutate(ATM = "ATM1"),
  atm2_fc %>% filter(.model == "arima") %>% as_tibble() %>% mutate(ATM = "ATM2"),
  atm3_fc %>% as_tibble() %>% mutate(ATM = "ATM3"),
  atm4_fc %>% as_tibble() %>% mutate(ATM = "ATM4")
)

# Filter for May 2010 only
atm_forecasts <- atm_all_forecasts %>%
  filter(month(DATE) == 5, year(DATE) == 2010) %>%
  select(DATE, ATM, Cash = .mean) %>%
  mutate(Cash = round(Cash, 2)) %>%
  arrange(ATM, DATE)

# Display forecast summary
atm_forecasts %>%
  group_by(ATM) %>%
  summarise(
    N_Days = n(),
    Mean_Forecast = round(mean(Cash), 2),
    Min_Forecast = round(min(Cash), 2),
    Max_Forecast = round(max(Cash), 2)
  ) %>%
  knitr::kable(caption = "May 2010 Forecast Summary by ATM")
May 2010 Forecast Summary by ATM
ATM N_Days Mean_Forecast Min_Forecast Max_Forecast
ATM1 31 78.27 8.03 104.06
ATM2 31 57.71 5.29 97.66
ATM3 31 0.13 0.00 2.61
ATM4 31 474.04 474.04 474.04
# Export to Excel
write_xlsx(atm_forecasts, "ATM_Forecast_May2010.xlsx")
cat("\n✓ ATM forecasts exported to: ATM_Forecast_May2010.xlsx\n")
## 
## ✓ ATM forecasts exported to: ATM_Forecast_May2010.xlsx
# Display first few rows
head(atm_forecasts, 20) %>%
  knitr::kable(caption = "Sample of ATM Forecasts")
Sample of ATM Forecasts
DATE ATM Cash
2010-05-01 ATM1 87.17
2010-05-02 ATM1 104.06
2010-05-03 ATM1 73.15
2010-05-04 ATM1 8.03
2010-05-05 ATM1 100.01
2010-05-06 ATM1 80.86
2010-05-07 ATM1 86.67
2010-05-08 ATM1 87.96
2010-05-09 ATM1 102.75
2010-05-10 ATM1 73.39
2010-05-11 ATM1 8.59
2010-05-12 ATM1 100.65
2010-05-13 ATM1 80.39
2010-05-14 ATM1 86.93
2010-05-15 ATM1 88.03
2010-05-16 ATM1 102.75
2010-05-17 ATM1 73.39
2010-05-18 ATM1 8.59
2010-05-19 ATM1 100.65
2010-05-20 ATM1 80.39

7. Part A Discussion

Model Selection

I applied both Exponential Smoothing (ETS) and ARIMA models to each ATM, allowing automatic selection of optimal specifications. For ATM1 and ATM2, ARIMA models were selected as they provided better accuracy metrics (lower RMSE), captured weekly seasonality effectively, and passed residual diagnostic tests. These consumer-oriented ATMs showed predictable patterns that the models handled well.

ATM3 and ATM4 presented us with different challenges. ATM3 seemed unused for most of the observation period, with about 11 months of zero activity followed by a brief spike. This left limited meaningful data for forecasting. ATM4’s extreme variability and lack of consistent patterns led to selection of a simple ARIMA(0,0,0) with mean specification, basically forecasting the historical average.

Key Findings

ATM1 and ATM2 handle typical consumer banking with withdrawals in the $0-180 range (hundreds of dollars) and clear weekly patterns. Both show a consistent Thursday dip that probably reflects factors like reduced foot traffic or business closures. These predictable cycles make them well-suited for standard time series forecasting.

ATM4 operates on a completely different scale, with transactions reaching $10,000+ (hundreds of dollars)—roughly 100 times larger than the consumer ATMs. The irregular spikes suggest business-driven activity from commercial clients rather than everyday banking patterns. Traditional forecasting based on historical patterns has limited value for this ATM’s data; better predictions would need more data about commercial clients/business uses.

ATM3’s near-zero activity throughout the year suggests operational issues. This requires investigation before any forecasting can be meaningful.


Part B: Residential Power Consumption Forecasting

1. Data Loading and Exploration

# Load power data
power_data <- read_excel("ResidentialCustomerForecastLoad624.xlsx")

str(power_data)
## tibble [192 × 3] (S3: tbl_df/tbl/data.frame)
##  $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY-MMM    : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
head(power_data, 12)
## # A tibble: 12 × 3
##    CaseSequence `YYYY-MMM`     KWH
##           <dbl> <chr>        <dbl>
##  1          733 1998-Jan   6862583
##  2          734 1998-Feb   5838198
##  3          735 1998-Mar   5420658
##  4          736 1998-Apr   5010364
##  5          737 1998-May   4665377
##  6          738 1998-Jun   6467147
##  7          739 1998-Jul   8914755
##  8          740 1998-Aug   8607428
##  9          741 1998-Sep   6989888
## 10          742 1998-Oct   6345620
## 11          743 1998-Nov   4640410
## 12          744 1998-Dec   4693479
summary(power_data)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
# Check for missing values
cat("\nMissing values:\n")
## 
## Missing values:
colSums(is.na(power_data))
## CaseSequence     YYYY-MMM          KWH 
##            0            0            1

2. Data Preparation

There’s only one missing value in this dataset. Linear interpolation works well here too - monthly power consumption follows seasonal patterns and usually doesn’t have sudden spikes, so estimating the missing month based on the values before and after it makes sense.

# Parse date and handle missing values
power_clean <- power_data %>%
  mutate(Date = yearmonth(as.Date(paste0(`YYYY-MMM`, "-01"), format = "%Y-%b-%d"))) %>%
  arrange(Date)

# Impute single missing value using linear interpolation
power_clean <- power_clean %>%
  mutate(
    KWH = approx(x = seq_along(KWH), y = KWH, xout = seq_along(KWH), 
                 method = "linear", rule = 2)$y
  )

# Verify no missing values remain
cat("Missing values after imputation:", sum(is.na(power_clean$KWH)), "\n\n")
## Missing values after imputation: 0
# Convert to tsibble
power_tsibble <- power_clean %>%
  select(Date, KWH) %>%
  as_tsibble(index = Date)

# Summary statistics
power_tsibble %>%
  summarise(
    Start_Date = format(min(Date), "%Y %b"),
    End_Date = format(max(Date), "%Y %b"),
    N_Months = n(),
    Mean_KWH = round(mean(KWH), 0),
    SD_KWH = round(sd(KWH), 0),
    Min_KWH = round(min(KWH), 0),
    Max_KWH = round(max(KWH), 0)
  ) %>%
  knitr::kable(caption = "Power Consumption Summary", format.args = list(big.mark = ","))
Power Consumption Summary
Date Start_Date End_Date N_Months Mean_KWH SD_KWH Min_KWH Max_KWH
1998 Jan 1998 Jan 1998 Jan 1 6,862,583 NA 6,862,583 6,862,583
1998 Feb 1998 Feb 1998 Feb 1 5,838,198 NA 5,838,198 5,838,198
1998 Mar 1998 Mar 1998 Mar 1 5,420,658 NA 5,420,658 5,420,658
1998 Apr 1998 Apr 1998 Apr 1 5,010,364 NA 5,010,364 5,010,364
1998 May 1998 May 1998 May 1 4,665,377 NA 4,665,377 4,665,377
1998 Jun 1998 Jun 1998 Jun 1 6,467,147 NA 6,467,147 6,467,147
1998 Jul 1998 Jul 1998 Jul 1 8,914,755 NA 8,914,755 8,914,755
1998 Aug 1998 Aug 1998 Aug 1 8,607,428 NA 8,607,428 8,607,428
1998 Sep 1998 Sep 1998 Sep 1 6,989,888 NA 6,989,888 6,989,888
1998 Oct 1998 Oct 1998 Oct 1 6,345,620 NA 6,345,620 6,345,620
1998 Nov 1998 Nov 1998 Nov 1 4,640,410 NA 4,640,410 4,640,410
1998 Dec 1998 Dec 1998 Dec 1 4,693,479 NA 4,693,479 4,693,479
1999 Jan 1999 Jan 1999 Jan 1 7,183,759 NA 7,183,759 7,183,759
1999 Feb 1999 Feb 1999 Feb 1 5,759,262 NA 5,759,262 5,759,262
1999 Mar 1999 Mar 1999 Mar 1 4,847,656 NA 4,847,656 4,847,656
1999 Apr 1999 Apr 1999 Apr 1 5,306,592 NA 5,306,592 5,306,592
1999 May 1999 May 1999 May 1 4,426,794 NA 4,426,794 4,426,794
1999 Jun 1999 Jun 1999 Jun 1 5,500,901 NA 5,500,901 5,500,901
1999 Jul 1999 Jul 1999 Jul 1 7,444,416 NA 7,444,416 7,444,416
1999 Aug 1999 Aug 1999 Aug 1 7,564,391 NA 7,564,391 7,564,391
1999 Sep 1999 Sep 1999 Sep 1 7,899,368 NA 7,899,368 7,899,368
1999 Oct 1999 Oct 1999 Oct 1 5,358,314 NA 5,358,314 5,358,314
1999 Nov 1999 Nov 1999 Nov 1 4,436,269 NA 4,436,269 4,436,269
1999 Dec 1999 Dec 1999 Dec 1 4,419,229 NA 4,419,229 4,419,229
2000 Jan 2000 Jan 2000 Jan 1 7,068,296 NA 7,068,296 7,068,296
2000 Feb 2000 Feb 2000 Feb 1 5,876,083 NA 5,876,083 5,876,083
2000 Mar 2000 Mar 2000 Mar 1 4,807,961 NA 4,807,961 4,807,961
2000 Apr 2000 Apr 2000 Apr 1 4,873,080 NA 4,873,080 4,873,080
2000 May 2000 May 2000 May 1 5,050,891 NA 5,050,891 5,050,891
2000 Jun 2000 Jun 2000 Jun 1 7,092,865 NA 7,092,865 7,092,865
2000 Jul 2000 Jul 2000 Jul 1 6,862,662 NA 6,862,662 6,862,662
2000 Aug 2000 Aug 2000 Aug 1 7,517,830 NA 7,517,830 7,517,830
2000 Sep 2000 Sep 2000 Sep 1 8,912,169 NA 8,912,169 8,912,169
2000 Oct 2000 Oct 2000 Oct 1 5,844,352 NA 5,844,352 5,844,352
2000 Nov 2000 Nov 2000 Nov 1 5,041,769 NA 5,041,769 5,041,769
2000 Dec 2000 Dec 2000 Dec 1 6,220,334 NA 6,220,334 6,220,334
2001 Jan 2001 Jan 2001 Jan 1 7,538,529 NA 7,538,529 7,538,529
2001 Feb 2001 Feb 2001 Feb 1 6,602,448 NA 6,602,448 6,602,448
2001 Mar 2001 Mar 2001 Mar 1 5,779,180 NA 5,779,180 5,779,180
2001 Apr 2001 Apr 2001 Apr 1 4,835,210 NA 4,835,210 4,835,210
2001 May 2001 May 2001 May 1 4,787,904 NA 4,787,904 4,787,904
2001 Jun 2001 Jun 2001 Jun 1 6,283,324 NA 6,283,324 6,283,324
2001 Jul 2001 Jul 2001 Jul 1 7,855,129 NA 7,855,129 7,855,129
2001 Aug 2001 Aug 2001 Aug 1 8,450,717 NA 8,450,717 8,450,717
2001 Sep 2001 Sep 2001 Sep 1 7,112,069 NA 7,112,069 7,112,069
2001 Oct 2001 Oct 2001 Oct 1 5,242,535 NA 5,242,535 5,242,535
2001 Nov 2001 Nov 2001 Nov 1 4,461,979 NA 4,461,979 4,461,979
2001 Dec 2001 Dec 2001 Dec 1 5,240,995 NA 5,240,995 5,240,995
2002 Jan 2002 Jan 2002 Jan 1 7,099,063 NA 7,099,063 7,099,063
2002 Feb 2002 Feb 2002 Feb 1 6,413,429 NA 6,413,429 6,413,429
2002 Mar 2002 Mar 2002 Mar 1 5,839,514 NA 5,839,514 5,839,514
2002 Apr 2002 Apr 2002 Apr 1 5,371,604 NA 5,371,604 5,371,604
2002 May 2002 May 2002 May 1 5,439,166 NA 5,439,166 5,439,166
2002 Jun 2002 Jun 2002 Jun 1 5,850,383 NA 5,850,383 5,850,383
2002 Jul 2002 Jul 2002 Jul 1 7,039,702 NA 7,039,702 7,039,702
2002 Aug 2002 Aug 2002 Aug 1 8,058,748 NA 8,058,748 8,058,748
2002 Sep 2002 Sep 2002 Sep 1 8,245,227 NA 8,245,227 8,245,227
2002 Oct 2002 Oct 2002 Oct 1 5,865,014 NA 5,865,014 5,865,014
2002 Nov 2002 Nov 2002 Nov 1 4,908,979 NA 4,908,979 4,908,979
2002 Dec 2002 Dec 2002 Dec 1 5,779,958 NA 5,779,958 5,779,958
2003 Jan 2003 Jan 2003 Jan 1 7,256,079 NA 7,256,079 7,256,079
2003 Feb 2003 Feb 2003 Feb 1 6,190,517 NA 6,190,517 6,190,517
2003 Mar 2003 Mar 2003 Mar 1 6,120,626 NA 6,120,626 6,120,626
2003 Apr 2003 Apr 2003 Apr 1 4,885,643 NA 4,885,643 4,885,643
2003 May 2003 May 2003 May 1 5,296,096 NA 5,296,096 5,296,096
2003 Jun 2003 Jun 2003 Jun 1 6,051,571 NA 6,051,571 6,051,571
2003 Jul 2003 Jul 2003 Jul 1 6,900,676 NA 6,900,676 6,900,676
2003 Aug 2003 Aug 2003 Aug 1 8,476,499 NA 8,476,499 8,476,499
2003 Sep 2003 Sep 2003 Sep 1 7,791,791 NA 7,791,791 7,791,791
2003 Oct 2003 Oct 2003 Oct 1 5,344,613 NA 5,344,613 5,344,613
2003 Nov 2003 Nov 2003 Nov 1 4,913,707 NA 4,913,707 4,913,707
2003 Dec 2003 Dec 2003 Dec 1 5,756,193 NA 5,756,193 5,756,193
2004 Jan 2004 Jan 2004 Jan 1 7,584,596 NA 7,584,596 7,584,596
2004 Feb 2004 Feb 2004 Feb 1 6,560,742 NA 6,560,742 6,560,742
2004 Mar 2004 Mar 2004 Mar 1 6,526,586 NA 6,526,586 6,526,586
2004 Apr 2004 Apr 2004 Apr 1 4,831,688 NA 4,831,688 4,831,688
2004 May 2004 May 2004 May 1 4,878,262 NA 4,878,262 4,878,262
2004 Jun 2004 Jun 2004 Jun 1 6,421,614 NA 6,421,614 6,421,614
2004 Jul 2004 Jul 2004 Jul 1 7,307,931 NA 7,307,931 7,307,931
2004 Aug 2004 Aug 2004 Aug 1 7,309,774 NA 7,309,774 7,309,774
2004 Sep 2004 Sep 2004 Sep 1 6,690,366 NA 6,690,366 6,690,366
2004 Oct 2004 Oct 2004 Oct 1 5,444,948 NA 5,444,948 5,444,948
2004 Nov 2004 Nov 2004 Nov 1 4,824,940 NA 4,824,940 4,824,940
2004 Dec 2004 Dec 2004 Dec 1 5,791,208 NA 5,791,208 5,791,208
2005 Jan 2005 Jan 2005 Jan 1 8,225,477 NA 8,225,477 8,225,477
2005 Feb 2005 Feb 2005 Feb 1 6,564,338 NA 6,564,338 6,564,338
2005 Mar 2005 Mar 2005 Mar 1 5,581,725 NA 5,581,725 5,581,725
2005 Apr 2005 Apr 2005 Apr 1 5,563,071 NA 5,563,071 5,563,071
2005 May 2005 May 2005 May 1 4,453,983 NA 4,453,983 4,453,983
2005 Jun 2005 Jun 2005 Jun 1 5,900,212 NA 5,900,212 5,900,212
2005 Jul 2005 Jul 2005 Jul 1 8,337,998 NA 8,337,998 8,337,998
2005 Aug 2005 Aug 2005 Aug 1 7,786,659 NA 7,786,659 7,786,659
2005 Sep 2005 Sep 2005 Sep 1 7,057,213 NA 7,057,213 7,057,213
2005 Oct 2005 Oct 2005 Oct 1 6,694,523 NA 6,694,523 6,694,523
2005 Nov 2005 Nov 2005 Nov 1 4,313,019 NA 4,313,019 4,313,019
2005 Dec 2005 Dec 2005 Dec 1 6,181,548 NA 6,181,548 6,181,548
2006 Jan 2006 Jan 2006 Jan 1 7,793,358 NA 7,793,358 7,793,358
2006 Feb 2006 Feb 2006 Feb 1 5,914,945 NA 5,914,945 5,914,945
2006 Mar 2006 Mar 2006 Mar 1 5,819,734 NA 5,819,734 5,819,734
2006 Apr 2006 Apr 2006 Apr 1 5,255,988 NA 5,255,988 5,255,988
2006 May 2006 May 2006 May 1 4,740,588 NA 4,740,588 4,740,588
2006 Jun 2006 Jun 2006 Jun 1 7,052,275 NA 7,052,275 7,052,275
2006 Jul 2006 Jul 2006 Jul 1 7,945,564 NA 7,945,564 7,945,564
2006 Aug 2006 Aug 2006 Aug 1 8,241,110 NA 8,241,110 8,241,110
2006 Sep 2006 Sep 2006 Sep 1 7,296,355 NA 7,296,355 7,296,355
2006 Oct 2006 Oct 2006 Oct 1 5,104,799 NA 5,104,799 5,104,799
2006 Nov 2006 Nov 2006 Nov 1 4,458,429 NA 4,458,429 4,458,429
2006 Dec 2006 Dec 2006 Dec 1 6,226,214 NA 6,226,214 6,226,214
2007 Jan 2007 Jan 2007 Jan 1 8,031,295 NA 8,031,295 8,031,295
2007 Feb 2007 Feb 2007 Feb 1 7,928,337 NA 7,928,337 7,928,337
2007 Mar 2007 Mar 2007 Mar 1 6,443,170 NA 6,443,170 6,443,170
2007 Apr 2007 Apr 2007 Apr 1 4,841,979 NA 4,841,979 4,841,979
2007 May 2007 May 2007 May 1 4,862,847 NA 4,862,847 4,862,847
2007 Jun 2007 Jun 2007 Jun 1 5,022,647 NA 5,022,647 5,022,647
2007 Jul 2007 Jul 2007 Jul 1 6,426,220 NA 6,426,220 6,426,220
2007 Aug 2007 Aug 2007 Aug 1 7,447,146 NA 7,447,146 7,447,146
2007 Sep 2007 Sep 2007 Sep 1 7,666,970 NA 7,666,970 7,666,970
2007 Oct 2007 Oct 2007 Oct 1 5,785,964 NA 5,785,964 5,785,964
2007 Nov 2007 Nov 2007 Nov 1 4,907,057 NA 4,907,057 4,907,057
2007 Dec 2007 Dec 2007 Dec 1 6,047,292 NA 6,047,292 6,047,292
2008 Jan 2008 Jan 2008 Jan 1 7,964,293 NA 7,964,293 7,964,293
2008 Feb 2008 Feb 2008 Feb 1 7,597,060 NA 7,597,060 7,597,060
2008 Mar 2008 Mar 2008 Mar 1 6,085,644 NA 6,085,644 6,085,644
2008 Apr 2008 Apr 2008 Apr 1 5,352,359 NA 5,352,359 5,352,359
2008 May 2008 May 2008 May 1 4,608,528 NA 4,608,528 4,608,528
2008 Jun 2008 Jun 2008 Jun 1 6,548,439 NA 6,548,439 6,548,439
2008 Jul 2008 Jul 2008 Jul 1 7,643,987 NA 7,643,987 7,643,987
2008 Aug 2008 Aug 2008 Aug 1 8,037,137 NA 8,037,137 8,037,137
2008 Sep 2008 Sep 2008 Sep 1 6,569,470 NA 6,569,470 6,569,470
2008 Oct 2008 Oct 2008 Oct 1 5,101,803 NA 5,101,803 5,101,803
2008 Nov 2008 Nov 2008 Nov 1 4,555,602 NA 4,555,602 4,555,602
2008 Dec 2008 Dec 2008 Dec 1 6,442,746 NA 6,442,746 6,442,746
2009 Jan 2009 Jan 2009 Jan 1 8,072,330 NA 8,072,330 8,072,330
2009 Feb 2009 Feb 2009 Feb 1 6,976,800 NA 6,976,800 6,976,800
2009 Mar 2009 Mar 2009 Mar 1 5,691,452 NA 5,691,452 5,691,452
2009 Apr 2009 Apr 2009 Apr 1 5,531,616 NA 5,531,616 5,531,616
2009 May 2009 May 2009 May 1 5,264,439 NA 5,264,439 5,264,439
2009 Jun 2009 Jun 2009 Jun 1 5,804,433 NA 5,804,433 5,804,433
2009 Jul 2009 Jul 2009 Jul 1 7,713,260 NA 7,713,260 7,713,260
2009 Aug 2009 Aug 2009 Aug 1 8,350,517 NA 8,350,517 8,350,517
2009 Sep 2009 Sep 2009 Sep 1 7,583,146 NA 7,583,146 7,583,146
2009 Oct 2009 Oct 2009 Oct 1 5,566,075 NA 5,566,075 5,566,075
2009 Nov 2009 Nov 2009 Nov 1 5,339,890 NA 5,339,890 5,339,890
2009 Dec 2009 Dec 2009 Dec 1 7,089,880 NA 7,089,880 7,089,880
2010 Jan 2010 Jan 2010 Jan 1 9,397,357 NA 9,397,357 9,397,357
2010 Feb 2010 Feb 2010 Feb 1 8,390,677 NA 8,390,677 8,390,677
2010 Mar 2010 Mar 2010 Mar 1 7,347,915 NA 7,347,915 7,347,915
2010 Apr 2010 Apr 2010 Apr 1 5,776,131 NA 5,776,131 5,776,131
2010 May 2010 May 2010 May 1 4,919,289 NA 4,919,289 4,919,289
2010 Jun 2010 Jun 2010 Jun 1 6,696,292 NA 6,696,292 6,696,292
2010 Jul 2010 Jul 2010 Jul 1 770,523 NA 770,523 770,523
2010 Aug 2010 Aug 2010 Aug 1 7,922,701 NA 7,922,701 7,922,701
2010 Sep 2010 Sep 2010 Sep 1 7,819,472 NA 7,819,472 7,819,472
2010 Oct 2010 Oct 2010 Oct 1 5,875,917 NA 5,875,917 5,875,917
2010 Nov 2010 Nov 2010 Nov 1 4,800,733 NA 4,800,733 4,800,733
2010 Dec 2010 Dec 2010 Dec 1 6,152,583 NA 6,152,583 6,152,583
2011 Jan 2011 Jan 2011 Jan 1 8,394,747 NA 8,394,747 8,394,747
2011 Feb 2011 Feb 2011 Feb 1 8,898,062 NA 8,898,062 8,898,062
2011 Mar 2011 Mar 2011 Mar 1 6,356,903 NA 6,356,903 6,356,903
2011 Apr 2011 Apr 2011 Apr 1 5,685,227 NA 5,685,227 5,685,227
2011 May 2011 May 2011 May 1 5,506,308 NA 5,506,308 5,506,308
2011 Jun 2011 Jun 2011 Jun 1 8,037,779 NA 8,037,779 8,037,779
2011 Jul 2011 Jul 2011 Jul 1 10,093,343 NA 10,093,343 10,093,343
2011 Aug 2011 Aug 2011 Aug 1 10,308,076 NA 10,308,076 10,308,076
2011 Sep 2011 Sep 2011 Sep 1 8,943,599 NA 8,943,599 8,943,599
2011 Oct 2011 Oct 2011 Oct 1 5,603,920 NA 5,603,920 5,603,920
2011 Nov 2011 Nov 2011 Nov 1 6,154,138 NA 6,154,138 6,154,138
2011 Dec 2011 Dec 2011 Dec 1 8,273,142 NA 8,273,142 8,273,142
2012 Jan 2012 Jan 2012 Jan 1 8,991,267 NA 8,991,267 8,991,267
2012 Feb 2012 Feb 2012 Feb 1 7,952,204 NA 7,952,204 7,952,204
2012 Mar 2012 Mar 2012 Mar 1 6,356,961 NA 6,356,961 6,356,961
2012 Apr 2012 Apr 2012 Apr 1 5,569,828 NA 5,569,828 5,569,828
2012 May 2012 May 2012 May 1 5,783,598 NA 5,783,598 5,783,598
2012 Jun 2012 Jun 2012 Jun 1 7,926,956 NA 7,926,956 7,926,956
2012 Jul 2012 Jul 2012 Jul 1 8,886,851 NA 8,886,851 8,886,851
2012 Aug 2012 Aug 2012 Aug 1 9,612,423 NA 9,612,423 9,612,423
2012 Sep 2012 Sep 2012 Sep 1 7,559,148 NA 7,559,148 7,559,148
2012 Oct 2012 Oct 2012 Oct 1 5,576,852 NA 5,576,852 5,576,852
2012 Nov 2012 Nov 2012 Nov 1 5,731,899 NA 5,731,899 5,731,899
2012 Dec 2012 Dec 2012 Dec 1 6,609,694 NA 6,609,694 6,609,694
2013 Jan 2013 Jan 2013 Jan 1 10,655,730 NA 10,655,730 10,655,730
2013 Feb 2013 Feb 2013 Feb 1 7,681,798 NA 7,681,798 7,681,798
2013 Mar 2013 Mar 2013 Mar 1 6,517,514 NA 6,517,514 6,517,514
2013 Apr 2013 Apr 2013 Apr 1 6,105,359 NA 6,105,359 6,105,359
2013 May 2013 May 2013 May 1 5,940,475 NA 5,940,475 5,940,475
2013 Jun 2013 Jun 2013 Jun 1 7,920,627 NA 7,920,627 7,920,627
2013 Jul 2013 Jul 2013 Jul 1 8,415,321 NA 8,415,321 8,415,321
2013 Aug 2013 Aug 2013 Aug 1 9,080,226 NA 9,080,226 9,080,226
2013 Sep 2013 Sep 2013 Sep 1 7,968,220 NA 7,968,220 7,968,220
2013 Oct 2013 Oct 2013 Oct 1 5,759,367 NA 5,759,367 5,759,367
2013 Nov 2013 Nov 2013 Nov 1 5,769,083 NA 5,769,083 5,769,083
2013 Dec 2013 Dec 2013 Dec 1 9,606,304 NA 9,606,304 9,606,304

3. Time Series Visualization

# Time series plot
power_tsibble %>%
  autoplot(KWH) +
  labs(
    title = "Residential Power Consumption (1998-2013)",
    x = "Date",
    y = "Power Consumption (KWH)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = scales::comma)

# Seasonal plot
power_tsibble %>%
  gg_season(KWH, labels = "both") +
  labs(
    title = "Seasonal Plot - Power Consumption",
    x = "Month",
    y = "KWH"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = scales::comma)

# Subseries plot
power_tsibble %>%
  gg_subseries(KWH) +
  labs(
    title = "Seasonal Subseries Plot",
    subtitle = "Blue line shows the mean for each month",
    x = "Year",
    y = "KWH"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8)  # Angle the labels
  ) +
  scale_y_continuous(labels = scales::comma)

The residential power consumption data from 1998-2013 shows a clear and consistent seasonal pattern. The time series plot reveals regular oscillations throughout the year, with consumption ranging from about 4-5 million KWH in the spring/fall months up to 8-10 million KWH during peak summer and winter periods. There’s a notable anomaly in July 2010 where consumption dropped dramatically to around 770,000 KWH. This appears to be a data recording error or system outage rather than actual consumption, since it’s completely out of line with every other July in the dataset.

The seasonal plot shows that every year follows a similar pattern. Summer months (June-August) consistently show the highest consumption, peaking in July and August when air conditioning demand drives usage up. There’s a secondary peak in winter (December-January) from heating needs, though it’s typically lower than the summer peak. Spring and Fall show the lowest consumption, with April and May being particularly low, as temperatures are mild and central air/heat is not as necessary. The more recent years (2010-2013) in the dataset show slightly higher peaks than previous years, suggesting population growth, increased housing built, and/or changing consumption patterns for other factors. Sometimes power companies raise their prices as well, that could play a role.

The subseries plot breaks down each month across all years, making it easy to spot trends within specific months. August shows the highest average consumption, with January close behind—both averaging around 8.5 million KWH. Both months also display upward trends over time, with August showing considerable year-to-year variability, while January shows a steady increase in recent winters. July also shows high consumption with substantial variability. The spring months (March-May) remain relatively stable across the entire period. The July 2010 outlier is particularly noticeable: it drops dramatically to around 2 million KWH.

4. Time Series Decomposition

# STL decomposition
power_decomp <- power_tsibble %>%
  model(
    stl = STL(KWH ~ season(window = "periodic"))
  ) %>%
  components()

# Plot decomposition
power_decomp %>%
  autoplot() +
  labs(title = "STL Decomposition of Power Consumption") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Calculate seasonal strength
power_features <- power_tsibble %>%
  features(KWH, feat_stl)

cat("Seasonal Strength:", round(power_features$seasonal_strength_year, 3), "\n")
## Seasonal Strength: 0.757
cat("Trend Strength:", round(power_features$trend_strength, 3), "\n")
## Trend Strength: 0.394
cat("Interpretation: Strong seasonal pattern present\n")
## Interpretation: Strong seasonal pattern present

The STL decomposition breaks the power consumption into three components: trend, seasonal pattern, and remainder. The trend component shows a gradual increase from around 6 million KWH in the late 1990s to about 7.5 million KWH by 2013. The increase isn’t linear though. There’s a relatively flat period through most of the 2000s, then a noticeable uptick starting around 2010. This could potentially reflect economic recovery after the 2008 recession.

The seasonal component reveals a consistent pattern that repeats year after year. With a seasonal strength of 0.757, seasonality accounts for approximately 76% of the variation in the data, quite a strong influence. The trend strength of 0.394 is more moderate, indicating that while a long-term trend exists, it’s not as dominant as the seasonal pattern.

The remainder component shows the irregular fluctuations that aren’t captured by trend or seasonality. For most of the series, the remainders are relatively small and randomly scattered around zero, which is exactly what we want to see. This tells us the trend and seasonal components are doing a good job explaining the data. The massive negative spike in July 2010 dominates the remainder plot. This suggests it’s an outlier that doesn’t fit the normal patterns. Besides that one observation, the remainders look okay, with no obvious patterns or clustering.

5. Model Building and Forecasting

5.1 Benchmark Methods

# Fit benchmark models
power_bench <- power_tsibble %>%
  model(
    naive = NAIVE(KWH),
    snaive = SNAIVE(KWH),
    drift = RW(KWH ~ drift())
  )

# Generate forecasts
power_bench_fc <- power_bench %>%
  forecast(h = 12)

# Plot benchmark methods
power_bench_fc %>%
  autoplot(power_tsibble, level = NULL) +
  labs(
    title = "Benchmark Forecasting Methods",
    x = "Year",
    y = "KWH"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = scales::comma)

5.2 Exponential Smoothing

# Fit ETS model
power_ets <- power_tsibble %>%
  model(
    ets = ETS(KWH)
  )

cat("Selected ETS Model:\n")
## Selected ETS Model:
report(power_ets)
## Series: KWH 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1137195 
##     gamma = 0.0001001086 
## 
##   Initial states:
##     l[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]     s[-6]
##  6134773 0.9547463 0.7533319 0.8602685 1.191176 1.255577 1.199206 0.9982427
##      s[-7]     s[-8]     s[-9]   s[-10]   s[-11]
##  0.7683657 0.8160779 0.9174797 1.061846 1.223683
## 
##   sigma^2:  0.0144
## 
##      AIC     AICc      BIC 
## 6224.057 6226.784 6272.919
# Generate forecast
power_ets_fc <- power_ets %>%
  forecast(h = 12)

# Plot
power_ets_fc %>%
  autoplot(power_tsibble) +
  labs(
    title = "2014 Power Forecast - ETS Model",
    x = "Year",
    y = "KWH"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = scales::comma)

# Check residuals
power_ets %>%
  gg_tsresiduals()

The ETS model automatically selected ETS(M,N,M), which stands for multiplicative errors, no trend, and multiplicative seasonality. This makes sense given the data shows strong seasonal swings but a relatively weak long-term trend. The model uses a low alpha (0.114) for the level, meaning it changes slowly and smooths out short-term fluctuations. The gamma parameter for seasonality is extremely small (0.0001), indicating that the seasonal pattern is very stable and doesn’t need much updating from year to year.

The residuals show that the model accurately captures most of the systematic patterns in the data. The innovation residuals are mostly randomly scattered around zero. The ACF plot shows good behavior at most lags, with nearly all autocorrelations falling within the confidence bounds. The histogram of residuals shows approximate normality with some deviation in the tails, which means the prediction intervals might be slightly optimistic but the model is generally solid for this dataset.

5.3 ARIMA Model

# Fit ARIMA model
power_arima <- power_tsibble %>%
  model(
    arima = ARIMA(KWH)
  )

cat("Selected ARIMA Model Below:\n")
## Selected ARIMA Model Below:
report(power_arima)
## Series: KWH 
## Model: ARIMA(0,0,1)(1,1,1)[12] w/ drift 
## 
## Coefficients:
##          ma1     sar1     sma1   constant
##       0.2439  -0.1518  -0.7311  105259.65
## s.e.  0.0723   0.0963   0.0821   27000.49
## 
## sigma^2 estimated as 7.466e+11:  log likelihood=-2719.89
## AIC=5449.79   AICc=5450.13   BIC=5465.75
# Generate forecast
power_arima_fc <- power_arima %>%
  forecast(h = 12)

# Plot
power_arima_fc %>%
  autoplot(power_tsibble) +
  labs(
    title = "2014 Power Forecast - ARIMA Model",
    x = "Year",
    y = "KWH"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = scales::comma)

# Check residuals
power_arima %>%
  gg_tsresiduals()

The ARIMA model specification chosen by automatic selection is ARIMA(0,0,1)(1,1,2)[12] with drift. The seasonal differencing (the middle “1” in the seasonal part) handles the growing trend by looking at year-over-year changes rather than raw levels. The seasonal MA terms capture the repeating annual pattern. The drift term adds that gradual upward trend we see in the decomposition. This combination of features allows the model to adapt to both the long-term growth and the strong seasonal cycles.

The residuals look solid overall. They scatter randomly around zero without any obvious patterns, which suggests the model has extracted the systematic information from the data. The ACF plot is particularly reassuring. Nearly all the autocorrelation lags fall within the confidence bounds, with only a couple of minor exceptions. This indicates the residuals are essentially white noise with no leftover structure that the model missed.

The histogram of residuals is roughly bell-shaped and centered near zero, though it shows some slight departures from perfect normality. There’s a bit of a tail on the positive side, and the distribution looks slightly wider than a normal curve would be. This isn’t a major concern for forecasting purposes, but it does mean the prediction intervals might be slightly optimistic. The extreme outlier from July 2010 appears in the residual plot as a massive negative spike, but it’s an isolated incident that doesn’t affect the overall model quality.

5.4 Model Comparison

# Combine all models
power_all_models <- power_tsibble %>%
  model(
    snaive = SNAIVE(KWH),
    ets = ETS(KWH),
    arima = ARIMA(KWH)
  )

# Compare accuracy metrics
cat("Model Accuracy Comparison:\n\n")
## Model Accuracy Comparison:
accuracy(power_all_models) %>%
  arrange(RMSE) %>%
  select(.model, RMSE, MAE, MAPE) %>%
  knitr::kable(digits = 2, caption = "Model Performance Comparison")
Model Performance Comparison
.model RMSE MAE MAPE
arima 827254.2 493305.9 11.68
ets 835107.0 503972.8 12.04
snaive 1178104.4 696708.7 14.62
# Generate forecasts for all models
power_all_fc <- power_all_models %>%
  forecast(h = 12)

# Plot comparison
power_all_fc %>%
  autoplot(power_tsibble, level = NULL) +
  labs(
    title = "2014 Forecast Comparison",
    x = "Year",
    y = "KWH"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = scales::comma)

The model comparison reveals that ARIMA significantly outperforms the benchmark methods. The ARIMA model achieves an RMSE of 827,254 KWH and a MAPE of 11.68%, making it the most accurate of the three approaches tested. The ETS model comes in a close second with an RMSE of 835,107 KWH and MAPE of 12.04%. These two sophisticated models perform similarly because they’re both capturing the seasonal dynamics effectively, just through slightly different methods.

The seasonal naive model, which simply uses last year’s value for each month as the forecast, performs noticeably worse with an RMSE of 1,178,104 KWH and MAPE of 14.62%. This makes sense given that there’s an upward trend in the data. Just repeating last year’s values will systematically under-forecast as consumption grows. The fact that seasonal naive still performs somewhat well shows how strong and consistent the seasonal pattern is. Even this simple approach captures much of the variation.

The forecast comparison plot shows all three models projected into 2014. The ARIMA (red) and ETS (green) forecasts are nearly identical and closely overlap, both continuing the seasonal pattern while trending slightly upward. The seasonal naive forecast (blue) also captures the seasonal pattern but sits noticeably higher than the other two models, particularly in the later months of 2014. This occurs because the seasonal naive method simply repeats the values from 2013, which included some exceptionally high consumption months, without adjusting for trend. Unlike ARIMA and ETS models which can incorporate damping to gradually flatten trends over longer forecast horizons (making predictions more conservative as uncertainty increases), the seasonal naive approach mechanically repeats past patterns without any such adjustment, which can lead to overly optimistic or pessimistic forecasts depending on the recent data.

6. Export Part B Forecasts

# Extract ARIMA forecasts
power_forecast <- power_arima_fc %>%
  hilo(level = c(80, 95)) %>%
  unpack_hilo(c("80%", "95%")) %>%
  as_tibble() %>%
  mutate(
    Date = as.Date(Date),
    `YYYY-MMM` = format(Date, "%Y-%b"),
    KWH_Forecast = round(.mean, 0),
    Lower_80 = round(`80%_lower`, 0),
    Upper_80 = round(`80%_upper`, 0),
    Lower_95 = round(`95%_lower`, 0),
    Upper_95 = round(`95%_upper`, 0)
  ) %>%
  select(Date, `YYYY-MMM`, KWH_Forecast, Lower_80, Upper_80, Lower_95, Upper_95)

# Display forecast
power_forecast %>%
  knitr::kable(
    caption = "2014 Monthly Power Consumption Forecast (ARIMA)",
    format.args = list(big.mark = ",")
  )
2014 Monthly Power Consumption Forecast (ARIMA)
Date YYYY-MMM KWH_Forecast Lower_80 Upper_80 Lower_95 Upper_95
2014-01-01 2014-Jan 9,714,361 8,607,037 10,821,686 8,020,854 11,407,868
2014-02-01 2014-Feb 8,181,495 7,041,708 9,321,281 6,438,342 9,924,647
2014-03-01 2014-Mar 6,737,578 5,597,792 7,877,365 4,994,426 8,480,731
2014-04-01 2014-Apr 5,957,336 4,817,549 7,097,122 4,214,183 7,700,488
2014-05-01 2014-May 5,727,983 4,588,197 6,867,770 3,984,831 7,471,136
2014-06-01 2014-Jun 7,534,009 6,394,222 8,673,795 5,790,856 9,277,162
2014-07-01 2014-Jul 7,921,051 6,781,265 9,060,837 6,177,898 9,664,204
2014-08-01 2014-Aug 9,292,410 8,152,624 10,432,196 7,549,257 11,035,563
2014-09-01 2014-Sep 8,182,948 7,043,162 9,322,734 6,439,795 9,926,101
2014-10-01 2014-Oct 6,015,721 4,875,934 7,155,507 4,272,568 7,758,873
2014-11-01 2014-Nov 5,769,225 4,629,439 6,909,011 4,026,072 7,512,378
2014-12-01 2014-Dec 7,484,127 6,344,341 8,623,913 5,740,975 9,227,279
# Export to Excel
write_xlsx(power_forecast, "Power_Forecast_2014.xlsx")
cat("\n✓ Power forecasts exported to: Power_Forecast_2014.xlsx\n")
## 
## ✓ Power forecasts exported to: Power_Forecast_2014.xlsx

The 2014 ARIMA forecast continues the established seasonal pattern while incorporating the gradual upward trend. The model predicts January 2014 around 9.7 million KWH, which is lower than January 2013’s exceptionally high 10.7 million KWH but still elevated compared to earlier years, suggesting the model is applying some dampening to extreme recent values. Summer peaks in August reach approximately 9.3 million KWH, consistent with recent summer consumption patterns.

The spring months show the expected seasonal dip, with April and May forecasted around 5.7-5.9 million KWH. Fall and winter months follow the typical seasonal pattern, with December forecasted around 7.5 million KWH. The forecasts generally trend upward from historical averages while avoiding simply extrapolating the most extreme recent values.

The confidence intervals provide important context for these point forecasts. The 80% intervals range from about ±1 million KWH around the forecasts in early months, widening slightly as we move further into 2014. The 95% intervals span roughly ±1.5 million KWH. These intervals reflect the model’s uncertainty and remind us that while the seasonal pattern is reliable, weather extremes, economic changes, or other factors could easily push actual consumption outside the predicted range for any given month.

7. Part B Discussion

Model Selection

I compared three forecasting approaches: benchmark methods (seasonal naive), ETS (automatically selected model with seasonal components), and ARIMA (selected model to handle both trend and seasonality). The ARIMA model was chosen for the final forecast because it achieved the lowest RMSE, passed residual diagnostic tests with residuals approximating white noise, and successfully captured both the seasonal pattern and underlying trend.

Key Findings

The data exhibits strong seasonality with clear peaks in summer (July-August) for cooling and winter (December-January) for heating. The overall consumption trend is gradually increasing across the 16-year period, though it remained relatively stable through most of the 2000s before picking up around 2010. The ARIMA model successfully captured these seasonal dynamics while accounting for the upward trend. The confidence intervals provide realistic uncertainty bounds that reflect inherent variability in power consumption.

Forecast Interpretation

The 2014 forecasts show expected summer peaks around 9.3 million KWH, winter peaks in January near 9.7 million KWH, and lower consumption in spring and fall around 5.7-6.0 million KWH. These patterns are consistent with historical data but adjusted upward based on recent trends. Time series forecasting works particularly well for residential power consumption because the seasonal patterns are strong and driven by predictable factors like weather and human behavior. However, these forecasts assume conditions similar to recent history. Unusual weather patterns, economic shifts, or changes in energy efficiency could cause actual consumption to diverge from our predictions.