This project applies time series forecasting techniques to two datasets:
The methods used include time series decomposition, exponential smoothing, and ARIMA models.
# Load the ATM data
atm_data <- read_excel("ATM624Data.xlsx")
# Convert DATE to proper Date format based on Excel file
atm_data <- atm_data %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30"))
# Display structure
str(atm_data)## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
## # A tibble: 10 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
## 7 2009-05-04 ATM1 90
## 8 2009-05-04 ATM2 55
## 9 2009-05-05 ATM1 99
## 10 2009-05-05 ATM2 79
## # A tibble: 10 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-21 ATM4 557.
## 2 2010-04-22 ATM4 386.
## 3 2010-04-23 ATM4 165.
## 4 2010-04-24 ATM4 5.45
## 5 2010-04-25 ATM4 542.
## 6 2010-04-26 ATM4 404.
## 7 2010-04-27 ATM4 13.7
## 8 2010-04-28 ATM4 348.
## 9 2010-04-29 ATM4 44.2
## 10 2010-04-30 ATM4 482.
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
## Missing values by column:
## DATE ATM Cash
## 0 14 19
# Missing values by ATM
atm_data %>%
group_by(ATM) %>%
summarise(
Total_Records = n(),
Missing_Cash = sum(is.na(Cash)),
Pct_Missing = round(100 * Missing_Cash / Total_Records, 2)
) %>%
knitr::kable(caption = "Missing Data Summary by ATM")| ATM | Total_Records | Missing_Cash | Pct_Missing |
|---|---|---|---|
| ATM1 | 365 | 3 | 0.82 |
| ATM2 | 365 | 2 | 0.55 |
| ATM3 | 365 | 0 | 0.00 |
| ATM4 | 365 | 0 | 0.00 |
| NA | 14 | 14 | 100.00 |
I’ll handle missing values using linear interpolation. For ATM data, this is appropriate because withdrawal patterns are relatively stable over short periods. With only 1% of values missing and gaps appearing to be isolated individual days, interpolating between adjacent observations provides reasonable estimates without introducing significant bias into the analysis.
# Remove rows with missing ATM identifiers
atm_clean <- atm_data %>%
filter(!is.na(ATM)) %>%
mutate(DATE = as.Date(DATE))
# For each ATM, fill in missing cash values using linear interpolation
# Using tidyr::fill and manual interpolation
atm_clean <- atm_clean %>%
group_by(ATM) %>%
arrange(DATE) %>%
mutate(
# Mark positions of non-NA values
Cash = approx(x = seq_along(Cash), y = Cash, xout = seq_along(Cash),
method = "linear", rule = 2)$y
) %>%
ungroup()
# Check date coverage by ATM
atm_clean %>%
group_by(ATM) %>%
summarise(
Start_Date = format(as.Date(min(DATE)), "%Y-%m-%d"),
End_Date = format(as.Date(max(DATE)), "%Y-%m-%d"),
N_Days = n(),
Mean_Cash = round(mean(Cash, na.rm = TRUE), 2),
SD_Cash = round(sd(Cash, na.rm = TRUE), 2)
) %>%
knitr::kable(caption = "Date Coverage and Summary Statistics by ATM")| ATM | Start_Date | End_Date | N_Days | Mean_Cash | SD_Cash |
|---|---|---|---|---|---|
| ATM1 | 2009-05-01 | 2010-04-30 | 365 | 84.15 | 36.64 |
| ATM2 | 2009-05-01 | 2010-04-30 | 365 | 62.59 | 38.81 |
| ATM3 | 2009-05-01 | 2010-04-30 | 365 | 0.72 | 7.94 |
| ATM4 | 2009-05-01 | 2010-04-30 | 365 | 474.04 | 650.94 |
# Time series plot for all ATMs
ggplot(atm_clean, aes(x = DATE, y = Cash, color = ATM)) +
geom_line(alpha = 0.7) +
facet_wrap(~ATM, ncol = 2, scales = "free_y") +
labs(
title = "ATM Cash Withdrawals Over Time",
subtitle = "May 2009 - May 2010",
x = "Date",
y = "Cash (hundreds of dollars)"
) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)# Box plots by day of week
atm_clean %>%
mutate(DayOfWeek = wday(DATE, label = TRUE)) %>%
ggplot(aes(x = DayOfWeek, y = Cash, fill = ATM)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ATM, scales = "free_y") +
labs(
title = "Cash Withdrawal Patterns by Day of Week",
x = "Day of Week",
y = "Cash (hundreds of dollars)"
) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5)
)Looking at the time series plots, ATM1 and ATM2 behave like one might expect from a typical consumer ATM. Withdrawals range from $0-$180 (hundreds of dollars) with predictable weekly patterns throughout the year. Interestingly, both show a consistent dip on Thursdays. Maybe more people get paid Monday or Friday, withdrawing shortly after that. The box plots confirm the weekly cycles are real, with activity picking up on weekends when people are out spending. This predictability makes them perfect candidates for ARIMA forecasting.
ATM3 and ATM4 are a different story altogether. ATM3 is basically flat at zero for nearly the entire year until one random spike around April/May 2010. Something seems off here—the machine might be broken, or it’s in an inaccessible location. ATM4, on the other hand, is operating in a whole different zone We’re talking withdrawals up to $10,000+ (hundreds of dollars)—about 100 times what the consumer ATMs see. There’s no weekly pattern here; instead, we see irregular spikes that suggest “business transactions” rather than everyday banking. This machine is likely serving commercial clients rather than everyday consumers.
I’ll decompose the time series for each ATM to understand the trend, seasonal, and irregular components.
# Convert to tsibble format
atm_tsibble <- atm_clean %>%
as_tsibble(key = ATM, index = DATE)
# STL decomposition for each ATM
atm_decomp <- atm_tsibble %>%
model(
stl = STL(Cash ~ season(window = 7))
) %>%
components()
# Plot decomposition for all ATMs
atm_decomp %>%
autoplot() +
labs(title = "STL Decomposition by ATM") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))# Individual decomposition plots
for (atm_name in unique(atm_clean$ATM)) {
p <- atm_decomp %>%
filter(ATM == atm_name) %>%
autoplot() +
labs(title = paste("STL Decomposition -", atm_name)) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
print(p)
}The STL decomposition breaks down each ATM’s cash withdrawal pattern into three components: trend, weekly seasonality, and remainder (irregular fluctuations). This analysis helps us understand what’s really driving the patterns we see in the raw data.
For ATM1 and ATM2, the decomposition reveals some interesting dynamics. Both machines show a gradually declining trend over the year, with ATM1 dropping from around 100 to about 75 (hundreds of dollars) and ATM2 following a similar pattern from roughly 80 down to 60. This suggests these locations might be experiencing reduced foot traffic or changing customer behavior over time. The seasonal component clearly shows the weekly cycle we noticed earlier—those regular peaks and valleys that repeat every seven days. What’s particularly useful here is seeing how consistent the weekly pattern is throughout the entire year. The remainder component (the random noise) is relatively small for both machines, which is exactly what we want to see. It means most of the variation in the data can be explained by the trend and weekly patterns, making these specific ATMs highly predictable for forecasting purposes.
ATM3’s decomposition tells the story of a machine that’s been sitting idle, like we saw in the earlier visualization. The trend line is flat at near-zero for the entire period until that sudden jump at the end. There’s no meaningful seasonal pattern because there’s no activity to create one. The remainder component only shows action at the very end when the machine finally comes to life. This decomposition confirms what we suspected—this machine wasn’t operational for most of the observation period.
ATM4’s trend component sits around 500-700 (hundreds of dollars) for most of the year, then shows a huge spike in late March 2010 where it jumps to over 1,500. The seasonal component is relatively small compared to the overall scale of transactions. This makes sense, given that this machine’s activity seems to be driven by business needs rather than weekly consumer cycles. The remainder component stands out for ATM4—it’s highly variable, showing irregular large withdrawals. The fact that the irregular component is so much larger than the seasonal component tells us that trying to forecast this ATM based on patterns alone will be challenging. We’d probably get better results if we had additional information about the type of business clients using this machine. If we knew data such as scheduled payroll dates and known large transaction requests, it might inform better forecasting.
I’ll apply both Exponential Smoothing (ETS) and ARIMA models to each ATM using the fable framework.
# Prepare ATM1 data
atm1_tsibble <- atm_tsibble %>%
filter(ATM == "ATM1")
# Fit multiple models
atm1_fit <- atm1_tsibble %>%
model(
ets = ETS(Cash),
arima = ARIMA(Cash)
)
# Model summaries
cat("ATM1 - Model Details Below:\n\n")## ATM1 - Model Details Below:
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.02126635
## gamma = 0.3298864
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 77.02956 -55.27235 2.531658 16.15413 4.529656 11.40245 12.92571 7.72875
##
## sigma^2: 580.2586
##
## AIC AICc BIC
## 4487.018 4487.639 4526.017
## Series: Cash
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1950 -0.5806 -0.1037
## s.e. 0.0546 0.0506 0.0494
##
## sigma^2 estimated as 556.4: log likelihood=-1640.01
## AIC=3288.03 AICc=3288.14 BIC=3303.55
# Generate forecasts (31 days for May 2010)
atm1_fc <- atm1_fit %>%
forecast(h = 31)
# Check accuracy
cat("\nModel Accuracy Comparison:\n")##
## Model Accuracy Comparison:
| ATM | .model | .type | ME | RMSE | MAE | MPE | MAPE | MASE | RMSSE | ACF1 |
|---|---|---|---|---|---|---|---|---|---|---|
| ATM1 | arima | Training | -0.074 | 23.262 | 14.54 | -102.259 | 117.148 | 0.822 | 0.838 | -0.009 |
| ATM1 | ets | Training | -0.048 | 23.790 | 15.08 | -106.835 | 121.865 | 0.852 | 0.857 | 0.144 |
# Plot forecasts
atm1_fc %>%
autoplot(atm1_tsibble, level = NULL) +
labs(
title = "ATM1: 31-Day Forecast Comparison",
x = "Date",
y = "Cash (hundreds of dollars)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))# Plot ARIMA forecast with intervals
atm1_fc %>%
filter(.model == "arima") %>%
autoplot(atm1_tsibble) +
labs(
title = "ATM1: May 2010 Forecast (ARIMA)",
x = "Date",
y = "Cash (hundreds of dollars)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))ATM1 Model Interpretation
The ARIMA model slightly outperforms ETS for ATM1, with an RMSE of 23.262 compared to 23.790. The selected ARIMA(0,0,1)(0,1,2)[7] specification handles the weekly seasonality through seasonal moving average terms at a 7-day lag, while the seasonal differencing (the middle “1” in the seasonal part) accounts for the gradual declining trend we saw in the decomposition.
Looking at the forecast comparison plot, both models produce very similar predictions for May 2010, which makes sense given their comparable accuracy metrics. The ARIMA forecast maintains the weekly cyclical pattern we’ve observed throughout the historical data, with the confidence intervals widening as we move further into the forecast period—this is expected and reflects increasing uncertainty about future values. The forecast appears reasonable, continuing the patterns we’ve seen historically, with the typical weekly oscillations between low and high usage days.
The residual diagnostics look solid. The innovation residuals scatter randomly around zero without any obvious patterns, which is what we want to see. The ACF plot confirms this—nearly all the lags stay within those blue dashed lines, which means there’s no leftover correlation in the residuals. The histogram of residuals is roughly bell-shaped and centered at zero, suggesting the residuals are approximately normally distributed. A few larger outliers appear in the residual plot, but these are relatively rare and don’t seem to follow any systematic pattern. OOverall, the diagnostics tell us that we can be fairly confident in these May 2010 forecasts.
# Prepare ATM2 data
atm2_tsibble <- atm_tsibble %>%
filter(ATM == "ATM2")
# Fit models
atm2_fit <- atm2_tsibble %>%
model(
ets = ETS(Cash),
arima = ARIMA(Cash)
)
cat("ATM2 - Selected ARIMA Model Below:\n")## ATM2 - Selected ARIMA Model Below:
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4320 -0.9130 0.4773 0.8048 -0.7547
## s.e. 0.0553 0.0407 0.0861 0.0556 0.0381
##
## sigma^2 estimated as 602.5: log likelihood=-1653.67
## AIC=3319.33 AICc=3319.57 BIC=3342.61
# Generate forecast
atm2_fc <- atm2_fit %>%
forecast(h = 31)
# Plot ARIMA forecast
atm2_fc %>%
filter(.model == "arima") %>%
autoplot(atm2_tsibble) +
labs(
title = "ATM2: May 2010 Forecast (ARIMA)",
x = "Date",
y = "Cash (hundreds of dollars)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))ATM2 Model Interpretation
The ARIMA model for ATM2 handles the weekly patterns and declining trend we saw in the decomposition, similar to what we did with ATM1. The automatic selection process picked a model that best fits ATM2’s particular usage patterns.
The May 2010 forecast continues the weekly up-and-down pattern we’ve seen throughout the year. The forecast looks reasonable and maintains the typical range of variation we’ve observed in the historical data.
The residuals look good—they scatter around zero without any obvious patterns, which tells us the model captured what it needed to from the data. The ACF plot backs this up, with nearly all the lags staying within those bounds. The histogram shows the residuals are roughly normal and centered at zero. There are a few bigger outliers scattered throughout, but they don’t seem to follow any particular pattern or cluster in one time period. Overall, the diagnostics give us good reason to trust these May forecasts.
atm3_tsibble <- atm_tsibble %>%
filter(ATM == "ATM3")
atm3_fit <- atm3_tsibble %>%
model(
arima = ARIMA(Cash)
)
cat("ATM3 - Selected ARIMA Model Below:\n")## ATM3 - Selected ARIMA Model Below:
## Series: Cash
## Model: ARIMA(0,0,2)
##
## Coefficients:
## ma1 ma2
## 0.8392 0.8557
## s.e. 0.0496 0.0611
##
## sigma^2 estimated as 25.4: log likelihood=-1108.69
## AIC=2223.39 AICc=2223.46 BIC=2235.09
atm3_fc <- atm3_fit %>%
forecast(h = 31)
atm3_fc %>%
autoplot(atm3_tsibble) +
labs(
title = "ATM3: May 2010 Forecast (ARIMA)",
x = "Date",
y = "Cash (hundreds of dollars)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))ATM3 presents a difficult forecasting situation. With nearly 11 months of zero activity followed by a brief spike in late April, there’s not enough meaningful pattern for the model to learn from. The data is essentially a flat line at zero with one outlier, which makes it challenging to identify any reliable trends or seasonal patterns.
The May 2010 forecast predicts near-zero values, continuing the baseline that dominated the historical period. While this makes sense given what the model saw, it raises real questions about usefulness. If the machine was recently repaired and brought online (as that late April spike suggests), the historical data won’t reflect its future behavior at all. We’d be predicting based on a broken machine’s performance. On the other hand, if it’s still non-functional, we’re forecasting something that won’t generate any cash flow anyway. Before putting any trust into these predictions, ATM3 needs operational investigation to determine if it will actually be functioning in May 2010.
atm4_tsibble <- atm_tsibble %>%
filter(ATM == "ATM4")
atm4_fit <- atm4_tsibble %>%
model(
arima = ARIMA(Cash)
)
cat("ATM4 - Selected ARIMA Model:\n")## ATM4 - Selected ARIMA Model:
## Series: Cash
## Model: ARIMA(0,0,0) w/ mean
##
## Coefficients:
## constant
## 474.0433
## s.e. 34.0248
##
## sigma^2 estimated as 423718: log likelihood=-2882.03
## AIC=5768.06 AICc=5768.1 BIC=5775.86
atm4_fc <- atm4_fit %>%
forecast(h = 31)
atm4_fc %>%
autoplot(atm4_tsibble) +
labs(
title = "ATM4: May 2010 Forecast (ARIMA)",
x = "Date",
y = "Cash (hundreds of dollars)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))ATM4 is completely different from the other ATMs, as we discussed before, and the model selection reflects that. The automatic ARIMA process selected ARIMA(0,0,0) with mean, essentially just forecasting the average value of $474 (hundreds of dollars). This might seem overly simplistic, but it makes sense given ATM4’s behavior. With huge irregular spikes reaching $10,000+ and no consistent pattern, there’s no predictable structure for a more complex model to capture.
The May 2010 forecast is a flat line at the historical mean of around $474. Given the extreme variability in the transaction history (re: those huge spikes we saw earlier) this simple average-based forecast isn’t particularly useful for cash management planning. The residual diagnostics show that one spike dominates everything, with the histogram heavily concentrated near zero but with extreme outliers. For ATM4, traditional time series forecasting has limited value—better forecasts would require business intelligence about planned large withdrawals from the commercial clients using this machine.
# Combine all forecasts using ARIMA model
atm_all_forecasts <- bind_rows(
atm1_fc %>% filter(.model == "arima") %>% as_tibble() %>% mutate(ATM = "ATM1"),
atm2_fc %>% filter(.model == "arima") %>% as_tibble() %>% mutate(ATM = "ATM2"),
atm3_fc %>% as_tibble() %>% mutate(ATM = "ATM3"),
atm4_fc %>% as_tibble() %>% mutate(ATM = "ATM4")
)
# Filter for May 2010 only
atm_forecasts <- atm_all_forecasts %>%
filter(month(DATE) == 5, year(DATE) == 2010) %>%
select(DATE, ATM, Cash = .mean) %>%
mutate(Cash = round(Cash, 2)) %>%
arrange(ATM, DATE)
# Display forecast summary
atm_forecasts %>%
group_by(ATM) %>%
summarise(
N_Days = n(),
Mean_Forecast = round(mean(Cash), 2),
Min_Forecast = round(min(Cash), 2),
Max_Forecast = round(max(Cash), 2)
) %>%
knitr::kable(caption = "May 2010 Forecast Summary by ATM")| ATM | N_Days | Mean_Forecast | Min_Forecast | Max_Forecast |
|---|---|---|---|---|
| ATM1 | 31 | 78.27 | 8.03 | 104.06 |
| ATM2 | 31 | 57.71 | 5.29 | 97.66 |
| ATM3 | 31 | 0.13 | 0.00 | 2.61 |
| ATM4 | 31 | 474.04 | 474.04 | 474.04 |
# Export to Excel
write_xlsx(atm_forecasts, "ATM_Forecast_May2010.xlsx")
cat("\n✓ ATM forecasts exported to: ATM_Forecast_May2010.xlsx\n")##
## ✓ ATM forecasts exported to: ATM_Forecast_May2010.xlsx
# Display first few rows
head(atm_forecasts, 20) %>%
knitr::kable(caption = "Sample of ATM Forecasts")| DATE | ATM | Cash |
|---|---|---|
| 2010-05-01 | ATM1 | 87.17 |
| 2010-05-02 | ATM1 | 104.06 |
| 2010-05-03 | ATM1 | 73.15 |
| 2010-05-04 | ATM1 | 8.03 |
| 2010-05-05 | ATM1 | 100.01 |
| 2010-05-06 | ATM1 | 80.86 |
| 2010-05-07 | ATM1 | 86.67 |
| 2010-05-08 | ATM1 | 87.96 |
| 2010-05-09 | ATM1 | 102.75 |
| 2010-05-10 | ATM1 | 73.39 |
| 2010-05-11 | ATM1 | 8.59 |
| 2010-05-12 | ATM1 | 100.65 |
| 2010-05-13 | ATM1 | 80.39 |
| 2010-05-14 | ATM1 | 86.93 |
| 2010-05-15 | ATM1 | 88.03 |
| 2010-05-16 | ATM1 | 102.75 |
| 2010-05-17 | ATM1 | 73.39 |
| 2010-05-18 | ATM1 | 8.59 |
| 2010-05-19 | ATM1 | 100.65 |
| 2010-05-20 | ATM1 | 80.39 |
I applied both Exponential Smoothing (ETS) and ARIMA models to each ATM, allowing automatic selection of optimal specifications. For ATM1 and ATM2, ARIMA models were selected as they provided better accuracy metrics (lower RMSE), captured weekly seasonality effectively, and passed residual diagnostic tests. These consumer-oriented ATMs showed predictable patterns that the models handled well.
ATM3 and ATM4 presented us with different challenges. ATM3 seemed unused for most of the observation period, with about 11 months of zero activity followed by a brief spike. This left limited meaningful data for forecasting. ATM4’s extreme variability and lack of consistent patterns led to selection of a simple ARIMA(0,0,0) with mean specification, basically forecasting the historical average.
ATM1 and ATM2 handle typical consumer banking with withdrawals in the $0-180 range (hundreds of dollars) and clear weekly patterns. Both show a consistent Thursday dip that probably reflects factors like reduced foot traffic or business closures. These predictable cycles make them well-suited for standard time series forecasting.
ATM4 operates on a completely different scale, with transactions reaching $10,000+ (hundreds of dollars)—roughly 100 times larger than the consumer ATMs. The irregular spikes suggest business-driven activity from commercial clients rather than everyday banking patterns. Traditional forecasting based on historical patterns has limited value for this ATM’s data; better predictions would need more data about commercial clients/business uses.
ATM3’s near-zero activity throughout the year suggests operational issues. This requires investigation before any forecasting can be meaningful.
# Load power data
power_data <- read_excel("ResidentialCustomerForecastLoad624.xlsx")
str(power_data)## tibble [192 × 3] (S3: tbl_df/tbl/data.frame)
## $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY-MMM : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
## # A tibble: 12 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
## 7 739 1998-Jul 8914755
## 8 740 1998-Aug 8607428
## 9 741 1998-Sep 6989888
## 10 742 1998-Oct 6345620
## 11 743 1998-Nov 4640410
## 12 744 1998-Dec 4693479
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
##
## Missing values:
## CaseSequence YYYY-MMM KWH
## 0 0 1
There’s only one missing value in this dataset. Linear interpolation works well here too - monthly power consumption follows seasonal patterns and usually doesn’t have sudden spikes, so estimating the missing month based on the values before and after it makes sense.
# Parse date and handle missing values
power_clean <- power_data %>%
mutate(Date = yearmonth(as.Date(paste0(`YYYY-MMM`, "-01"), format = "%Y-%b-%d"))) %>%
arrange(Date)
# Impute single missing value using linear interpolation
power_clean <- power_clean %>%
mutate(
KWH = approx(x = seq_along(KWH), y = KWH, xout = seq_along(KWH),
method = "linear", rule = 2)$y
)
# Verify no missing values remain
cat("Missing values after imputation:", sum(is.na(power_clean$KWH)), "\n\n")## Missing values after imputation: 0
# Convert to tsibble
power_tsibble <- power_clean %>%
select(Date, KWH) %>%
as_tsibble(index = Date)
# Summary statistics
power_tsibble %>%
summarise(
Start_Date = format(min(Date), "%Y %b"),
End_Date = format(max(Date), "%Y %b"),
N_Months = n(),
Mean_KWH = round(mean(KWH), 0),
SD_KWH = round(sd(KWH), 0),
Min_KWH = round(min(KWH), 0),
Max_KWH = round(max(KWH), 0)
) %>%
knitr::kable(caption = "Power Consumption Summary", format.args = list(big.mark = ","))| Date | Start_Date | End_Date | N_Months | Mean_KWH | SD_KWH | Min_KWH | Max_KWH |
|---|---|---|---|---|---|---|---|
| 1998 Jan | 1998 Jan | 1998 Jan | 1 | 6,862,583 | NA | 6,862,583 | 6,862,583 |
| 1998 Feb | 1998 Feb | 1998 Feb | 1 | 5,838,198 | NA | 5,838,198 | 5,838,198 |
| 1998 Mar | 1998 Mar | 1998 Mar | 1 | 5,420,658 | NA | 5,420,658 | 5,420,658 |
| 1998 Apr | 1998 Apr | 1998 Apr | 1 | 5,010,364 | NA | 5,010,364 | 5,010,364 |
| 1998 May | 1998 May | 1998 May | 1 | 4,665,377 | NA | 4,665,377 | 4,665,377 |
| 1998 Jun | 1998 Jun | 1998 Jun | 1 | 6,467,147 | NA | 6,467,147 | 6,467,147 |
| 1998 Jul | 1998 Jul | 1998 Jul | 1 | 8,914,755 | NA | 8,914,755 | 8,914,755 |
| 1998 Aug | 1998 Aug | 1998 Aug | 1 | 8,607,428 | NA | 8,607,428 | 8,607,428 |
| 1998 Sep | 1998 Sep | 1998 Sep | 1 | 6,989,888 | NA | 6,989,888 | 6,989,888 |
| 1998 Oct | 1998 Oct | 1998 Oct | 1 | 6,345,620 | NA | 6,345,620 | 6,345,620 |
| 1998 Nov | 1998 Nov | 1998 Nov | 1 | 4,640,410 | NA | 4,640,410 | 4,640,410 |
| 1998 Dec | 1998 Dec | 1998 Dec | 1 | 4,693,479 | NA | 4,693,479 | 4,693,479 |
| 1999 Jan | 1999 Jan | 1999 Jan | 1 | 7,183,759 | NA | 7,183,759 | 7,183,759 |
| 1999 Feb | 1999 Feb | 1999 Feb | 1 | 5,759,262 | NA | 5,759,262 | 5,759,262 |
| 1999 Mar | 1999 Mar | 1999 Mar | 1 | 4,847,656 | NA | 4,847,656 | 4,847,656 |
| 1999 Apr | 1999 Apr | 1999 Apr | 1 | 5,306,592 | NA | 5,306,592 | 5,306,592 |
| 1999 May | 1999 May | 1999 May | 1 | 4,426,794 | NA | 4,426,794 | 4,426,794 |
| 1999 Jun | 1999 Jun | 1999 Jun | 1 | 5,500,901 | NA | 5,500,901 | 5,500,901 |
| 1999 Jul | 1999 Jul | 1999 Jul | 1 | 7,444,416 | NA | 7,444,416 | 7,444,416 |
| 1999 Aug | 1999 Aug | 1999 Aug | 1 | 7,564,391 | NA | 7,564,391 | 7,564,391 |
| 1999 Sep | 1999 Sep | 1999 Sep | 1 | 7,899,368 | NA | 7,899,368 | 7,899,368 |
| 1999 Oct | 1999 Oct | 1999 Oct | 1 | 5,358,314 | NA | 5,358,314 | 5,358,314 |
| 1999 Nov | 1999 Nov | 1999 Nov | 1 | 4,436,269 | NA | 4,436,269 | 4,436,269 |
| 1999 Dec | 1999 Dec | 1999 Dec | 1 | 4,419,229 | NA | 4,419,229 | 4,419,229 |
| 2000 Jan | 2000 Jan | 2000 Jan | 1 | 7,068,296 | NA | 7,068,296 | 7,068,296 |
| 2000 Feb | 2000 Feb | 2000 Feb | 1 | 5,876,083 | NA | 5,876,083 | 5,876,083 |
| 2000 Mar | 2000 Mar | 2000 Mar | 1 | 4,807,961 | NA | 4,807,961 | 4,807,961 |
| 2000 Apr | 2000 Apr | 2000 Apr | 1 | 4,873,080 | NA | 4,873,080 | 4,873,080 |
| 2000 May | 2000 May | 2000 May | 1 | 5,050,891 | NA | 5,050,891 | 5,050,891 |
| 2000 Jun | 2000 Jun | 2000 Jun | 1 | 7,092,865 | NA | 7,092,865 | 7,092,865 |
| 2000 Jul | 2000 Jul | 2000 Jul | 1 | 6,862,662 | NA | 6,862,662 | 6,862,662 |
| 2000 Aug | 2000 Aug | 2000 Aug | 1 | 7,517,830 | NA | 7,517,830 | 7,517,830 |
| 2000 Sep | 2000 Sep | 2000 Sep | 1 | 8,912,169 | NA | 8,912,169 | 8,912,169 |
| 2000 Oct | 2000 Oct | 2000 Oct | 1 | 5,844,352 | NA | 5,844,352 | 5,844,352 |
| 2000 Nov | 2000 Nov | 2000 Nov | 1 | 5,041,769 | NA | 5,041,769 | 5,041,769 |
| 2000 Dec | 2000 Dec | 2000 Dec | 1 | 6,220,334 | NA | 6,220,334 | 6,220,334 |
| 2001 Jan | 2001 Jan | 2001 Jan | 1 | 7,538,529 | NA | 7,538,529 | 7,538,529 |
| 2001 Feb | 2001 Feb | 2001 Feb | 1 | 6,602,448 | NA | 6,602,448 | 6,602,448 |
| 2001 Mar | 2001 Mar | 2001 Mar | 1 | 5,779,180 | NA | 5,779,180 | 5,779,180 |
| 2001 Apr | 2001 Apr | 2001 Apr | 1 | 4,835,210 | NA | 4,835,210 | 4,835,210 |
| 2001 May | 2001 May | 2001 May | 1 | 4,787,904 | NA | 4,787,904 | 4,787,904 |
| 2001 Jun | 2001 Jun | 2001 Jun | 1 | 6,283,324 | NA | 6,283,324 | 6,283,324 |
| 2001 Jul | 2001 Jul | 2001 Jul | 1 | 7,855,129 | NA | 7,855,129 | 7,855,129 |
| 2001 Aug | 2001 Aug | 2001 Aug | 1 | 8,450,717 | NA | 8,450,717 | 8,450,717 |
| 2001 Sep | 2001 Sep | 2001 Sep | 1 | 7,112,069 | NA | 7,112,069 | 7,112,069 |
| 2001 Oct | 2001 Oct | 2001 Oct | 1 | 5,242,535 | NA | 5,242,535 | 5,242,535 |
| 2001 Nov | 2001 Nov | 2001 Nov | 1 | 4,461,979 | NA | 4,461,979 | 4,461,979 |
| 2001 Dec | 2001 Dec | 2001 Dec | 1 | 5,240,995 | NA | 5,240,995 | 5,240,995 |
| 2002 Jan | 2002 Jan | 2002 Jan | 1 | 7,099,063 | NA | 7,099,063 | 7,099,063 |
| 2002 Feb | 2002 Feb | 2002 Feb | 1 | 6,413,429 | NA | 6,413,429 | 6,413,429 |
| 2002 Mar | 2002 Mar | 2002 Mar | 1 | 5,839,514 | NA | 5,839,514 | 5,839,514 |
| 2002 Apr | 2002 Apr | 2002 Apr | 1 | 5,371,604 | NA | 5,371,604 | 5,371,604 |
| 2002 May | 2002 May | 2002 May | 1 | 5,439,166 | NA | 5,439,166 | 5,439,166 |
| 2002 Jun | 2002 Jun | 2002 Jun | 1 | 5,850,383 | NA | 5,850,383 | 5,850,383 |
| 2002 Jul | 2002 Jul | 2002 Jul | 1 | 7,039,702 | NA | 7,039,702 | 7,039,702 |
| 2002 Aug | 2002 Aug | 2002 Aug | 1 | 8,058,748 | NA | 8,058,748 | 8,058,748 |
| 2002 Sep | 2002 Sep | 2002 Sep | 1 | 8,245,227 | NA | 8,245,227 | 8,245,227 |
| 2002 Oct | 2002 Oct | 2002 Oct | 1 | 5,865,014 | NA | 5,865,014 | 5,865,014 |
| 2002 Nov | 2002 Nov | 2002 Nov | 1 | 4,908,979 | NA | 4,908,979 | 4,908,979 |
| 2002 Dec | 2002 Dec | 2002 Dec | 1 | 5,779,958 | NA | 5,779,958 | 5,779,958 |
| 2003 Jan | 2003 Jan | 2003 Jan | 1 | 7,256,079 | NA | 7,256,079 | 7,256,079 |
| 2003 Feb | 2003 Feb | 2003 Feb | 1 | 6,190,517 | NA | 6,190,517 | 6,190,517 |
| 2003 Mar | 2003 Mar | 2003 Mar | 1 | 6,120,626 | NA | 6,120,626 | 6,120,626 |
| 2003 Apr | 2003 Apr | 2003 Apr | 1 | 4,885,643 | NA | 4,885,643 | 4,885,643 |
| 2003 May | 2003 May | 2003 May | 1 | 5,296,096 | NA | 5,296,096 | 5,296,096 |
| 2003 Jun | 2003 Jun | 2003 Jun | 1 | 6,051,571 | NA | 6,051,571 | 6,051,571 |
| 2003 Jul | 2003 Jul | 2003 Jul | 1 | 6,900,676 | NA | 6,900,676 | 6,900,676 |
| 2003 Aug | 2003 Aug | 2003 Aug | 1 | 8,476,499 | NA | 8,476,499 | 8,476,499 |
| 2003 Sep | 2003 Sep | 2003 Sep | 1 | 7,791,791 | NA | 7,791,791 | 7,791,791 |
| 2003 Oct | 2003 Oct | 2003 Oct | 1 | 5,344,613 | NA | 5,344,613 | 5,344,613 |
| 2003 Nov | 2003 Nov | 2003 Nov | 1 | 4,913,707 | NA | 4,913,707 | 4,913,707 |
| 2003 Dec | 2003 Dec | 2003 Dec | 1 | 5,756,193 | NA | 5,756,193 | 5,756,193 |
| 2004 Jan | 2004 Jan | 2004 Jan | 1 | 7,584,596 | NA | 7,584,596 | 7,584,596 |
| 2004 Feb | 2004 Feb | 2004 Feb | 1 | 6,560,742 | NA | 6,560,742 | 6,560,742 |
| 2004 Mar | 2004 Mar | 2004 Mar | 1 | 6,526,586 | NA | 6,526,586 | 6,526,586 |
| 2004 Apr | 2004 Apr | 2004 Apr | 1 | 4,831,688 | NA | 4,831,688 | 4,831,688 |
| 2004 May | 2004 May | 2004 May | 1 | 4,878,262 | NA | 4,878,262 | 4,878,262 |
| 2004 Jun | 2004 Jun | 2004 Jun | 1 | 6,421,614 | NA | 6,421,614 | 6,421,614 |
| 2004 Jul | 2004 Jul | 2004 Jul | 1 | 7,307,931 | NA | 7,307,931 | 7,307,931 |
| 2004 Aug | 2004 Aug | 2004 Aug | 1 | 7,309,774 | NA | 7,309,774 | 7,309,774 |
| 2004 Sep | 2004 Sep | 2004 Sep | 1 | 6,690,366 | NA | 6,690,366 | 6,690,366 |
| 2004 Oct | 2004 Oct | 2004 Oct | 1 | 5,444,948 | NA | 5,444,948 | 5,444,948 |
| 2004 Nov | 2004 Nov | 2004 Nov | 1 | 4,824,940 | NA | 4,824,940 | 4,824,940 |
| 2004 Dec | 2004 Dec | 2004 Dec | 1 | 5,791,208 | NA | 5,791,208 | 5,791,208 |
| 2005 Jan | 2005 Jan | 2005 Jan | 1 | 8,225,477 | NA | 8,225,477 | 8,225,477 |
| 2005 Feb | 2005 Feb | 2005 Feb | 1 | 6,564,338 | NA | 6,564,338 | 6,564,338 |
| 2005 Mar | 2005 Mar | 2005 Mar | 1 | 5,581,725 | NA | 5,581,725 | 5,581,725 |
| 2005 Apr | 2005 Apr | 2005 Apr | 1 | 5,563,071 | NA | 5,563,071 | 5,563,071 |
| 2005 May | 2005 May | 2005 May | 1 | 4,453,983 | NA | 4,453,983 | 4,453,983 |
| 2005 Jun | 2005 Jun | 2005 Jun | 1 | 5,900,212 | NA | 5,900,212 | 5,900,212 |
| 2005 Jul | 2005 Jul | 2005 Jul | 1 | 8,337,998 | NA | 8,337,998 | 8,337,998 |
| 2005 Aug | 2005 Aug | 2005 Aug | 1 | 7,786,659 | NA | 7,786,659 | 7,786,659 |
| 2005 Sep | 2005 Sep | 2005 Sep | 1 | 7,057,213 | NA | 7,057,213 | 7,057,213 |
| 2005 Oct | 2005 Oct | 2005 Oct | 1 | 6,694,523 | NA | 6,694,523 | 6,694,523 |
| 2005 Nov | 2005 Nov | 2005 Nov | 1 | 4,313,019 | NA | 4,313,019 | 4,313,019 |
| 2005 Dec | 2005 Dec | 2005 Dec | 1 | 6,181,548 | NA | 6,181,548 | 6,181,548 |
| 2006 Jan | 2006 Jan | 2006 Jan | 1 | 7,793,358 | NA | 7,793,358 | 7,793,358 |
| 2006 Feb | 2006 Feb | 2006 Feb | 1 | 5,914,945 | NA | 5,914,945 | 5,914,945 |
| 2006 Mar | 2006 Mar | 2006 Mar | 1 | 5,819,734 | NA | 5,819,734 | 5,819,734 |
| 2006 Apr | 2006 Apr | 2006 Apr | 1 | 5,255,988 | NA | 5,255,988 | 5,255,988 |
| 2006 May | 2006 May | 2006 May | 1 | 4,740,588 | NA | 4,740,588 | 4,740,588 |
| 2006 Jun | 2006 Jun | 2006 Jun | 1 | 7,052,275 | NA | 7,052,275 | 7,052,275 |
| 2006 Jul | 2006 Jul | 2006 Jul | 1 | 7,945,564 | NA | 7,945,564 | 7,945,564 |
| 2006 Aug | 2006 Aug | 2006 Aug | 1 | 8,241,110 | NA | 8,241,110 | 8,241,110 |
| 2006 Sep | 2006 Sep | 2006 Sep | 1 | 7,296,355 | NA | 7,296,355 | 7,296,355 |
| 2006 Oct | 2006 Oct | 2006 Oct | 1 | 5,104,799 | NA | 5,104,799 | 5,104,799 |
| 2006 Nov | 2006 Nov | 2006 Nov | 1 | 4,458,429 | NA | 4,458,429 | 4,458,429 |
| 2006 Dec | 2006 Dec | 2006 Dec | 1 | 6,226,214 | NA | 6,226,214 | 6,226,214 |
| 2007 Jan | 2007 Jan | 2007 Jan | 1 | 8,031,295 | NA | 8,031,295 | 8,031,295 |
| 2007 Feb | 2007 Feb | 2007 Feb | 1 | 7,928,337 | NA | 7,928,337 | 7,928,337 |
| 2007 Mar | 2007 Mar | 2007 Mar | 1 | 6,443,170 | NA | 6,443,170 | 6,443,170 |
| 2007 Apr | 2007 Apr | 2007 Apr | 1 | 4,841,979 | NA | 4,841,979 | 4,841,979 |
| 2007 May | 2007 May | 2007 May | 1 | 4,862,847 | NA | 4,862,847 | 4,862,847 |
| 2007 Jun | 2007 Jun | 2007 Jun | 1 | 5,022,647 | NA | 5,022,647 | 5,022,647 |
| 2007 Jul | 2007 Jul | 2007 Jul | 1 | 6,426,220 | NA | 6,426,220 | 6,426,220 |
| 2007 Aug | 2007 Aug | 2007 Aug | 1 | 7,447,146 | NA | 7,447,146 | 7,447,146 |
| 2007 Sep | 2007 Sep | 2007 Sep | 1 | 7,666,970 | NA | 7,666,970 | 7,666,970 |
| 2007 Oct | 2007 Oct | 2007 Oct | 1 | 5,785,964 | NA | 5,785,964 | 5,785,964 |
| 2007 Nov | 2007 Nov | 2007 Nov | 1 | 4,907,057 | NA | 4,907,057 | 4,907,057 |
| 2007 Dec | 2007 Dec | 2007 Dec | 1 | 6,047,292 | NA | 6,047,292 | 6,047,292 |
| 2008 Jan | 2008 Jan | 2008 Jan | 1 | 7,964,293 | NA | 7,964,293 | 7,964,293 |
| 2008 Feb | 2008 Feb | 2008 Feb | 1 | 7,597,060 | NA | 7,597,060 | 7,597,060 |
| 2008 Mar | 2008 Mar | 2008 Mar | 1 | 6,085,644 | NA | 6,085,644 | 6,085,644 |
| 2008 Apr | 2008 Apr | 2008 Apr | 1 | 5,352,359 | NA | 5,352,359 | 5,352,359 |
| 2008 May | 2008 May | 2008 May | 1 | 4,608,528 | NA | 4,608,528 | 4,608,528 |
| 2008 Jun | 2008 Jun | 2008 Jun | 1 | 6,548,439 | NA | 6,548,439 | 6,548,439 |
| 2008 Jul | 2008 Jul | 2008 Jul | 1 | 7,643,987 | NA | 7,643,987 | 7,643,987 |
| 2008 Aug | 2008 Aug | 2008 Aug | 1 | 8,037,137 | NA | 8,037,137 | 8,037,137 |
| 2008 Sep | 2008 Sep | 2008 Sep | 1 | 6,569,470 | NA | 6,569,470 | 6,569,470 |
| 2008 Oct | 2008 Oct | 2008 Oct | 1 | 5,101,803 | NA | 5,101,803 | 5,101,803 |
| 2008 Nov | 2008 Nov | 2008 Nov | 1 | 4,555,602 | NA | 4,555,602 | 4,555,602 |
| 2008 Dec | 2008 Dec | 2008 Dec | 1 | 6,442,746 | NA | 6,442,746 | 6,442,746 |
| 2009 Jan | 2009 Jan | 2009 Jan | 1 | 8,072,330 | NA | 8,072,330 | 8,072,330 |
| 2009 Feb | 2009 Feb | 2009 Feb | 1 | 6,976,800 | NA | 6,976,800 | 6,976,800 |
| 2009 Mar | 2009 Mar | 2009 Mar | 1 | 5,691,452 | NA | 5,691,452 | 5,691,452 |
| 2009 Apr | 2009 Apr | 2009 Apr | 1 | 5,531,616 | NA | 5,531,616 | 5,531,616 |
| 2009 May | 2009 May | 2009 May | 1 | 5,264,439 | NA | 5,264,439 | 5,264,439 |
| 2009 Jun | 2009 Jun | 2009 Jun | 1 | 5,804,433 | NA | 5,804,433 | 5,804,433 |
| 2009 Jul | 2009 Jul | 2009 Jul | 1 | 7,713,260 | NA | 7,713,260 | 7,713,260 |
| 2009 Aug | 2009 Aug | 2009 Aug | 1 | 8,350,517 | NA | 8,350,517 | 8,350,517 |
| 2009 Sep | 2009 Sep | 2009 Sep | 1 | 7,583,146 | NA | 7,583,146 | 7,583,146 |
| 2009 Oct | 2009 Oct | 2009 Oct | 1 | 5,566,075 | NA | 5,566,075 | 5,566,075 |
| 2009 Nov | 2009 Nov | 2009 Nov | 1 | 5,339,890 | NA | 5,339,890 | 5,339,890 |
| 2009 Dec | 2009 Dec | 2009 Dec | 1 | 7,089,880 | NA | 7,089,880 | 7,089,880 |
| 2010 Jan | 2010 Jan | 2010 Jan | 1 | 9,397,357 | NA | 9,397,357 | 9,397,357 |
| 2010 Feb | 2010 Feb | 2010 Feb | 1 | 8,390,677 | NA | 8,390,677 | 8,390,677 |
| 2010 Mar | 2010 Mar | 2010 Mar | 1 | 7,347,915 | NA | 7,347,915 | 7,347,915 |
| 2010 Apr | 2010 Apr | 2010 Apr | 1 | 5,776,131 | NA | 5,776,131 | 5,776,131 |
| 2010 May | 2010 May | 2010 May | 1 | 4,919,289 | NA | 4,919,289 | 4,919,289 |
| 2010 Jun | 2010 Jun | 2010 Jun | 1 | 6,696,292 | NA | 6,696,292 | 6,696,292 |
| 2010 Jul | 2010 Jul | 2010 Jul | 1 | 770,523 | NA | 770,523 | 770,523 |
| 2010 Aug | 2010 Aug | 2010 Aug | 1 | 7,922,701 | NA | 7,922,701 | 7,922,701 |
| 2010 Sep | 2010 Sep | 2010 Sep | 1 | 7,819,472 | NA | 7,819,472 | 7,819,472 |
| 2010 Oct | 2010 Oct | 2010 Oct | 1 | 5,875,917 | NA | 5,875,917 | 5,875,917 |
| 2010 Nov | 2010 Nov | 2010 Nov | 1 | 4,800,733 | NA | 4,800,733 | 4,800,733 |
| 2010 Dec | 2010 Dec | 2010 Dec | 1 | 6,152,583 | NA | 6,152,583 | 6,152,583 |
| 2011 Jan | 2011 Jan | 2011 Jan | 1 | 8,394,747 | NA | 8,394,747 | 8,394,747 |
| 2011 Feb | 2011 Feb | 2011 Feb | 1 | 8,898,062 | NA | 8,898,062 | 8,898,062 |
| 2011 Mar | 2011 Mar | 2011 Mar | 1 | 6,356,903 | NA | 6,356,903 | 6,356,903 |
| 2011 Apr | 2011 Apr | 2011 Apr | 1 | 5,685,227 | NA | 5,685,227 | 5,685,227 |
| 2011 May | 2011 May | 2011 May | 1 | 5,506,308 | NA | 5,506,308 | 5,506,308 |
| 2011 Jun | 2011 Jun | 2011 Jun | 1 | 8,037,779 | NA | 8,037,779 | 8,037,779 |
| 2011 Jul | 2011 Jul | 2011 Jul | 1 | 10,093,343 | NA | 10,093,343 | 10,093,343 |
| 2011 Aug | 2011 Aug | 2011 Aug | 1 | 10,308,076 | NA | 10,308,076 | 10,308,076 |
| 2011 Sep | 2011 Sep | 2011 Sep | 1 | 8,943,599 | NA | 8,943,599 | 8,943,599 |
| 2011 Oct | 2011 Oct | 2011 Oct | 1 | 5,603,920 | NA | 5,603,920 | 5,603,920 |
| 2011 Nov | 2011 Nov | 2011 Nov | 1 | 6,154,138 | NA | 6,154,138 | 6,154,138 |
| 2011 Dec | 2011 Dec | 2011 Dec | 1 | 8,273,142 | NA | 8,273,142 | 8,273,142 |
| 2012 Jan | 2012 Jan | 2012 Jan | 1 | 8,991,267 | NA | 8,991,267 | 8,991,267 |
| 2012 Feb | 2012 Feb | 2012 Feb | 1 | 7,952,204 | NA | 7,952,204 | 7,952,204 |
| 2012 Mar | 2012 Mar | 2012 Mar | 1 | 6,356,961 | NA | 6,356,961 | 6,356,961 |
| 2012 Apr | 2012 Apr | 2012 Apr | 1 | 5,569,828 | NA | 5,569,828 | 5,569,828 |
| 2012 May | 2012 May | 2012 May | 1 | 5,783,598 | NA | 5,783,598 | 5,783,598 |
| 2012 Jun | 2012 Jun | 2012 Jun | 1 | 7,926,956 | NA | 7,926,956 | 7,926,956 |
| 2012 Jul | 2012 Jul | 2012 Jul | 1 | 8,886,851 | NA | 8,886,851 | 8,886,851 |
| 2012 Aug | 2012 Aug | 2012 Aug | 1 | 9,612,423 | NA | 9,612,423 | 9,612,423 |
| 2012 Sep | 2012 Sep | 2012 Sep | 1 | 7,559,148 | NA | 7,559,148 | 7,559,148 |
| 2012 Oct | 2012 Oct | 2012 Oct | 1 | 5,576,852 | NA | 5,576,852 | 5,576,852 |
| 2012 Nov | 2012 Nov | 2012 Nov | 1 | 5,731,899 | NA | 5,731,899 | 5,731,899 |
| 2012 Dec | 2012 Dec | 2012 Dec | 1 | 6,609,694 | NA | 6,609,694 | 6,609,694 |
| 2013 Jan | 2013 Jan | 2013 Jan | 1 | 10,655,730 | NA | 10,655,730 | 10,655,730 |
| 2013 Feb | 2013 Feb | 2013 Feb | 1 | 7,681,798 | NA | 7,681,798 | 7,681,798 |
| 2013 Mar | 2013 Mar | 2013 Mar | 1 | 6,517,514 | NA | 6,517,514 | 6,517,514 |
| 2013 Apr | 2013 Apr | 2013 Apr | 1 | 6,105,359 | NA | 6,105,359 | 6,105,359 |
| 2013 May | 2013 May | 2013 May | 1 | 5,940,475 | NA | 5,940,475 | 5,940,475 |
| 2013 Jun | 2013 Jun | 2013 Jun | 1 | 7,920,627 | NA | 7,920,627 | 7,920,627 |
| 2013 Jul | 2013 Jul | 2013 Jul | 1 | 8,415,321 | NA | 8,415,321 | 8,415,321 |
| 2013 Aug | 2013 Aug | 2013 Aug | 1 | 9,080,226 | NA | 9,080,226 | 9,080,226 |
| 2013 Sep | 2013 Sep | 2013 Sep | 1 | 7,968,220 | NA | 7,968,220 | 7,968,220 |
| 2013 Oct | 2013 Oct | 2013 Oct | 1 | 5,759,367 | NA | 5,759,367 | 5,759,367 |
| 2013 Nov | 2013 Nov | 2013 Nov | 1 | 5,769,083 | NA | 5,769,083 | 5,769,083 |
| 2013 Dec | 2013 Dec | 2013 Dec | 1 | 9,606,304 | NA | 9,606,304 | 9,606,304 |
# Time series plot
power_tsibble %>%
autoplot(KWH) +
labs(
title = "Residential Power Consumption (1998-2013)",
x = "Date",
y = "Power Consumption (KWH)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = scales::comma)# Seasonal plot
power_tsibble %>%
gg_season(KWH, labels = "both") +
labs(
title = "Seasonal Plot - Power Consumption",
x = "Month",
y = "KWH"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = scales::comma)# Subseries plot
power_tsibble %>%
gg_subseries(KWH) +
labs(
title = "Seasonal Subseries Plot",
subtitle = "Blue line shows the mean for each month",
x = "Year",
y = "KWH"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8) # Angle the labels
) +
scale_y_continuous(labels = scales::comma)The residential power consumption data from 1998-2013 shows a clear and consistent seasonal pattern. The time series plot reveals regular oscillations throughout the year, with consumption ranging from about 4-5 million KWH in the spring/fall months up to 8-10 million KWH during peak summer and winter periods. There’s a notable anomaly in July 2010 where consumption dropped dramatically to around 770,000 KWH. This appears to be a data recording error or system outage rather than actual consumption, since it’s completely out of line with every other July in the dataset.
The seasonal plot shows that every year follows a similar pattern. Summer months (June-August) consistently show the highest consumption, peaking in July and August when air conditioning demand drives usage up. There’s a secondary peak in winter (December-January) from heating needs, though it’s typically lower than the summer peak. Spring and Fall show the lowest consumption, with April and May being particularly low, as temperatures are mild and central air/heat is not as necessary. The more recent years (2010-2013) in the dataset show slightly higher peaks than previous years, suggesting population growth, increased housing built, and/or changing consumption patterns for other factors. Sometimes power companies raise their prices as well, that could play a role.
The subseries plot breaks down each month across all years, making it easy to spot trends within specific months. August shows the highest average consumption, with January close behind—both averaging around 8.5 million KWH. Both months also display upward trends over time, with August showing considerable year-to-year variability, while January shows a steady increase in recent winters. July also shows high consumption with substantial variability. The spring months (March-May) remain relatively stable across the entire period. The July 2010 outlier is particularly noticeable: it drops dramatically to around 2 million KWH.
# STL decomposition
power_decomp <- power_tsibble %>%
model(
stl = STL(KWH ~ season(window = "periodic"))
) %>%
components()
# Plot decomposition
power_decomp %>%
autoplot() +
labs(title = "STL Decomposition of Power Consumption") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))# Calculate seasonal strength
power_features <- power_tsibble %>%
features(KWH, feat_stl)
cat("Seasonal Strength:", round(power_features$seasonal_strength_year, 3), "\n")## Seasonal Strength: 0.757
## Trend Strength: 0.394
## Interpretation: Strong seasonal pattern present
The STL decomposition breaks the power consumption into three components: trend, seasonal pattern, and remainder. The trend component shows a gradual increase from around 6 million KWH in the late 1990s to about 7.5 million KWH by 2013. The increase isn’t linear though. There’s a relatively flat period through most of the 2000s, then a noticeable uptick starting around 2010. This could potentially reflect economic recovery after the 2008 recession.
The seasonal component reveals a consistent pattern that repeats year after year. With a seasonal strength of 0.757, seasonality accounts for approximately 76% of the variation in the data, quite a strong influence. The trend strength of 0.394 is more moderate, indicating that while a long-term trend exists, it’s not as dominant as the seasonal pattern.
The remainder component shows the irregular fluctuations that aren’t captured by trend or seasonality. For most of the series, the remainders are relatively small and randomly scattered around zero, which is exactly what we want to see. This tells us the trend and seasonal components are doing a good job explaining the data. The massive negative spike in July 2010 dominates the remainder plot. This suggests it’s an outlier that doesn’t fit the normal patterns. Besides that one observation, the remainders look okay, with no obvious patterns or clustering.
# Fit benchmark models
power_bench <- power_tsibble %>%
model(
naive = NAIVE(KWH),
snaive = SNAIVE(KWH),
drift = RW(KWH ~ drift())
)
# Generate forecasts
power_bench_fc <- power_bench %>%
forecast(h = 12)
# Plot benchmark methods
power_bench_fc %>%
autoplot(power_tsibble, level = NULL) +
labs(
title = "Benchmark Forecasting Methods",
x = "Year",
y = "KWH"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = scales::comma)## Selected ETS Model:
## Series: KWH
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.1137195
## gamma = 0.0001001086
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 6134773 0.9547463 0.7533319 0.8602685 1.191176 1.255577 1.199206 0.9982427
## s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.7683657 0.8160779 0.9174797 1.061846 1.223683
##
## sigma^2: 0.0144
##
## AIC AICc BIC
## 6224.057 6226.784 6272.919
# Generate forecast
power_ets_fc <- power_ets %>%
forecast(h = 12)
# Plot
power_ets_fc %>%
autoplot(power_tsibble) +
labs(
title = "2014 Power Forecast - ETS Model",
x = "Year",
y = "KWH"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = scales::comma)The ETS model automatically selected ETS(M,N,M), which stands for multiplicative errors, no trend, and multiplicative seasonality. This makes sense given the data shows strong seasonal swings but a relatively weak long-term trend. The model uses a low alpha (0.114) for the level, meaning it changes slowly and smooths out short-term fluctuations. The gamma parameter for seasonality is extremely small (0.0001), indicating that the seasonal pattern is very stable and doesn’t need much updating from year to year.
The residuals show that the model accurately captures most of the systematic patterns in the data. The innovation residuals are mostly randomly scattered around zero. The ACF plot shows good behavior at most lags, with nearly all autocorrelations falling within the confidence bounds. The histogram of residuals shows approximate normality with some deviation in the tails, which means the prediction intervals might be slightly optimistic but the model is generally solid for this dataset.
# Fit ARIMA model
power_arima <- power_tsibble %>%
model(
arima = ARIMA(KWH)
)
cat("Selected ARIMA Model Below:\n")## Selected ARIMA Model Below:
## Series: KWH
## Model: ARIMA(0,0,1)(1,1,1)[12] w/ drift
##
## Coefficients:
## ma1 sar1 sma1 constant
## 0.2439 -0.1518 -0.7311 105259.65
## s.e. 0.0723 0.0963 0.0821 27000.49
##
## sigma^2 estimated as 7.466e+11: log likelihood=-2719.89
## AIC=5449.79 AICc=5450.13 BIC=5465.75
# Generate forecast
power_arima_fc <- power_arima %>%
forecast(h = 12)
# Plot
power_arima_fc %>%
autoplot(power_tsibble) +
labs(
title = "2014 Power Forecast - ARIMA Model",
x = "Year",
y = "KWH"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = scales::comma)The ARIMA model specification chosen by automatic selection is ARIMA(0,0,1)(1,1,2)[12] with drift. The seasonal differencing (the middle “1” in the seasonal part) handles the growing trend by looking at year-over-year changes rather than raw levels. The seasonal MA terms capture the repeating annual pattern. The drift term adds that gradual upward trend we see in the decomposition. This combination of features allows the model to adapt to both the long-term growth and the strong seasonal cycles.
The residuals look solid overall. They scatter randomly around zero without any obvious patterns, which suggests the model has extracted the systematic information from the data. The ACF plot is particularly reassuring. Nearly all the autocorrelation lags fall within the confidence bounds, with only a couple of minor exceptions. This indicates the residuals are essentially white noise with no leftover structure that the model missed.
The histogram of residuals is roughly bell-shaped and centered near zero, though it shows some slight departures from perfect normality. There’s a bit of a tail on the positive side, and the distribution looks slightly wider than a normal curve would be. This isn’t a major concern for forecasting purposes, but it does mean the prediction intervals might be slightly optimistic. The extreme outlier from July 2010 appears in the residual plot as a massive negative spike, but it’s an isolated incident that doesn’t affect the overall model quality.
# Combine all models
power_all_models <- power_tsibble %>%
model(
snaive = SNAIVE(KWH),
ets = ETS(KWH),
arima = ARIMA(KWH)
)
# Compare accuracy metrics
cat("Model Accuracy Comparison:\n\n")## Model Accuracy Comparison:
accuracy(power_all_models) %>%
arrange(RMSE) %>%
select(.model, RMSE, MAE, MAPE) %>%
knitr::kable(digits = 2, caption = "Model Performance Comparison")| .model | RMSE | MAE | MAPE |
|---|---|---|---|
| arima | 827254.2 | 493305.9 | 11.68 |
| ets | 835107.0 | 503972.8 | 12.04 |
| snaive | 1178104.4 | 696708.7 | 14.62 |
# Generate forecasts for all models
power_all_fc <- power_all_models %>%
forecast(h = 12)
# Plot comparison
power_all_fc %>%
autoplot(power_tsibble, level = NULL) +
labs(
title = "2014 Forecast Comparison",
x = "Year",
y = "KWH"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = scales::comma)The model comparison reveals that ARIMA significantly outperforms the benchmark methods. The ARIMA model achieves an RMSE of 827,254 KWH and a MAPE of 11.68%, making it the most accurate of the three approaches tested. The ETS model comes in a close second with an RMSE of 835,107 KWH and MAPE of 12.04%. These two sophisticated models perform similarly because they’re both capturing the seasonal dynamics effectively, just through slightly different methods.
The seasonal naive model, which simply uses last year’s value for each month as the forecast, performs noticeably worse with an RMSE of 1,178,104 KWH and MAPE of 14.62%. This makes sense given that there’s an upward trend in the data. Just repeating last year’s values will systematically under-forecast as consumption grows. The fact that seasonal naive still performs somewhat well shows how strong and consistent the seasonal pattern is. Even this simple approach captures much of the variation.
The forecast comparison plot shows all three models projected into 2014. The ARIMA (red) and ETS (green) forecasts are nearly identical and closely overlap, both continuing the seasonal pattern while trending slightly upward. The seasonal naive forecast (blue) also captures the seasonal pattern but sits noticeably higher than the other two models, particularly in the later months of 2014. This occurs because the seasonal naive method simply repeats the values from 2013, which included some exceptionally high consumption months, without adjusting for trend. Unlike ARIMA and ETS models which can incorporate damping to gradually flatten trends over longer forecast horizons (making predictions more conservative as uncertainty increases), the seasonal naive approach mechanically repeats past patterns without any such adjustment, which can lead to overly optimistic or pessimistic forecasts depending on the recent data.
# Extract ARIMA forecasts
power_forecast <- power_arima_fc %>%
hilo(level = c(80, 95)) %>%
unpack_hilo(c("80%", "95%")) %>%
as_tibble() %>%
mutate(
Date = as.Date(Date),
`YYYY-MMM` = format(Date, "%Y-%b"),
KWH_Forecast = round(.mean, 0),
Lower_80 = round(`80%_lower`, 0),
Upper_80 = round(`80%_upper`, 0),
Lower_95 = round(`95%_lower`, 0),
Upper_95 = round(`95%_upper`, 0)
) %>%
select(Date, `YYYY-MMM`, KWH_Forecast, Lower_80, Upper_80, Lower_95, Upper_95)
# Display forecast
power_forecast %>%
knitr::kable(
caption = "2014 Monthly Power Consumption Forecast (ARIMA)",
format.args = list(big.mark = ",")
)| Date | YYYY-MMM | KWH_Forecast | Lower_80 | Upper_80 | Lower_95 | Upper_95 |
|---|---|---|---|---|---|---|
| 2014-01-01 | 2014-Jan | 9,714,361 | 8,607,037 | 10,821,686 | 8,020,854 | 11,407,868 |
| 2014-02-01 | 2014-Feb | 8,181,495 | 7,041,708 | 9,321,281 | 6,438,342 | 9,924,647 |
| 2014-03-01 | 2014-Mar | 6,737,578 | 5,597,792 | 7,877,365 | 4,994,426 | 8,480,731 |
| 2014-04-01 | 2014-Apr | 5,957,336 | 4,817,549 | 7,097,122 | 4,214,183 | 7,700,488 |
| 2014-05-01 | 2014-May | 5,727,983 | 4,588,197 | 6,867,770 | 3,984,831 | 7,471,136 |
| 2014-06-01 | 2014-Jun | 7,534,009 | 6,394,222 | 8,673,795 | 5,790,856 | 9,277,162 |
| 2014-07-01 | 2014-Jul | 7,921,051 | 6,781,265 | 9,060,837 | 6,177,898 | 9,664,204 |
| 2014-08-01 | 2014-Aug | 9,292,410 | 8,152,624 | 10,432,196 | 7,549,257 | 11,035,563 |
| 2014-09-01 | 2014-Sep | 8,182,948 | 7,043,162 | 9,322,734 | 6,439,795 | 9,926,101 |
| 2014-10-01 | 2014-Oct | 6,015,721 | 4,875,934 | 7,155,507 | 4,272,568 | 7,758,873 |
| 2014-11-01 | 2014-Nov | 5,769,225 | 4,629,439 | 6,909,011 | 4,026,072 | 7,512,378 |
| 2014-12-01 | 2014-Dec | 7,484,127 | 6,344,341 | 8,623,913 | 5,740,975 | 9,227,279 |
# Export to Excel
write_xlsx(power_forecast, "Power_Forecast_2014.xlsx")
cat("\n✓ Power forecasts exported to: Power_Forecast_2014.xlsx\n")##
## ✓ Power forecasts exported to: Power_Forecast_2014.xlsx
The 2014 ARIMA forecast continues the established seasonal pattern while incorporating the gradual upward trend. The model predicts January 2014 around 9.7 million KWH, which is lower than January 2013’s exceptionally high 10.7 million KWH but still elevated compared to earlier years, suggesting the model is applying some dampening to extreme recent values. Summer peaks in August reach approximately 9.3 million KWH, consistent with recent summer consumption patterns.
The spring months show the expected seasonal dip, with April and May forecasted around 5.7-5.9 million KWH. Fall and winter months follow the typical seasonal pattern, with December forecasted around 7.5 million KWH. The forecasts generally trend upward from historical averages while avoiding simply extrapolating the most extreme recent values.
The confidence intervals provide important context for these point forecasts. The 80% intervals range from about ±1 million KWH around the forecasts in early months, widening slightly as we move further into 2014. The 95% intervals span roughly ±1.5 million KWH. These intervals reflect the model’s uncertainty and remind us that while the seasonal pattern is reliable, weather extremes, economic changes, or other factors could easily push actual consumption outside the predicted range for any given month.
I compared three forecasting approaches: benchmark methods (seasonal naive), ETS (automatically selected model with seasonal components), and ARIMA (selected model to handle both trend and seasonality). The ARIMA model was chosen for the final forecast because it achieved the lowest RMSE, passed residual diagnostic tests with residuals approximating white noise, and successfully captured both the seasonal pattern and underlying trend.
The data exhibits strong seasonality with clear peaks in summer (July-August) for cooling and winter (December-January) for heating. The overall consumption trend is gradually increasing across the 16-year period, though it remained relatively stable through most of the 2000s before picking up around 2010. The ARIMA model successfully captured these seasonal dynamics while accounting for the upward trend. The confidence intervals provide realistic uncertainty bounds that reflect inherent variability in power consumption.
The 2014 forecasts show expected summer peaks around 9.3 million KWH, winter peaks in January near 9.7 million KWH, and lower consumption in spring and fall around 5.7-6.0 million KWH. These patterns are consistent with historical data but adjusted upward based on recent trends. Time series forecasting works particularly well for residential power consumption because the seasonal patterns are strong and driven by predictable factors like weather and human behavior. However, these forecasts assume conditions similar to recent history. Unusual weather patterns, economic shifts, or changes in energy efficiency could cause actual consumption to diverge from our predictions.