The data has been loaded. I need to convert the date column to a date-type variable. Next, I will transform the atm_data data frame into a tsibble. I will also rename some of the variables to make them more descriptive.
atm_data <- read_excel("ATM624Data.xlsx", col_types = c("date", "text", "numeric"))|>
mutate(DATE = as_date(DATE)) |>
rename(Date = DATE) |>
# converting to tsibble
as_tsibble(index = Date, key = ATM) |>
# selecting from May 2009 to April 2010
filter(Date < "2010-05-01")
head(atm_data)
## # A tsibble: 6 x 3 [1D]
## # Key: ATM [1]
## Date ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
summary(atm_data)
## Date ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0.0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.5
## Median :2009-10-30 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
## NA's :5
atm_data |>
as.data.frame() |>
group_by(ATM) |>
summarise(`Missing Values` = sum(is.na(Cash)))
## # A tibble: 4 × 2
## ATM `Missing Values`
## <chr> <int>
## 1 ATM1 3
## 2 ATM2 2
## 3 ATM3 0
## 4 ATM4 0
ATMs 1 and 2 have 3 and 2 missing cash values, respectively. Additionally, for ATM 3, there are 362 rows with a cash value of 0, which may indicate either no transactions occurred on those days or that these entries are placeholders for missing data.
atm_data |>
autoplot(Cash) +
geom_line() +
facet_wrap(~ATM , scales = "free_y",ncol = 1 ) + # Facet by ATM with each in a separate row
labs(
title = "ATM Cash Dispensed Over Time",
y = "Cash Dispensed in Hundreds",
x = "Time in Days"
) +
theme_minimal()
The plot for ATM4 shows a significant spike in February, where the cash value exceeds 900,000, which is drastically higher than the next recorded high. This suggests a potential error in the data, as it is unlikely the ATM would have been loaded with such a large amount of cash, especially when previous days show much lower values.
The first step will be to handle the missing data. Since seasonality is important, I’ll use a method that aligns well with seasonal data. I plan to apply neighbor interpolation, where each missing value is replaced by the average of the last observed and next observed values.
impute_neighbors <- function(x) {
for (i in which(is.na(x))) {
# Check if there are preceding and following non-NA values
if (i > 1 && i < length(x)) {
# Take the average of the previous and next non-NA values
x[i] <- mean(c(x[i - 1], x[i + 1]), na.rm = TRUE)
}
}
return(x)
}
atm_data|>
group_by(ATM)|>
filter( is.na(Cash))|>
print()
## # A tsibble: 5 x 3 [1D]
## # Key: ATM [2]
## # Groups: ATM [2]
## Date ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
atm_data|>
filter(ATM == "ATM1") |>
summary()
## Date ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.00
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 73.00
## Median :2009-10-30 Mode :character Median : 91.00
## Mean :2009-10-30 Mean : 83.89
## 3rd Qu.:2010-01-29 3rd Qu.:108.00
## Max. :2010-04-30 Max. :180.00
## NA's :3
atm_1<- atm_data|>
filter(ATM == "ATM1" )
atm_1_no_na<- atm_1|>
mutate(Cash = impute_neighbors(Cash),Cash)
atm_1_no_na[43:53, ] |> print()
## # A tsibble: 11 x 3 [1D]
## # Key: ATM [1]
## Date ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-12 ATM1 142
## 2 2009-06-13 ATM1 131
## 3 2009-06-14 ATM1 120
## 4 2009-06-15 ATM1 106
## 5 2009-06-16 ATM1 107
## 6 2009-06-17 ATM1 108
## 7 2009-06-18 ATM1 21
## 8 2009-06-19 ATM1 140
## 9 2009-06-20 ATM1 110
## 10 2009-06-21 ATM1 115
## 11 2009-06-22 ATM1 112.
The function worked as intended, and we now have calculated values for the previously missing cash entries.
stl_model_1 <- atm_1_no_na |>
model(STL(Cash ~ season(window = "periodic")))
# Plot the components of the STL decomposition
stl_model_1 |>
components() |>
autoplot() +
labs(title = "STL Decomposition of Cash for ATM1",
y = "Cash Dispensed",
x = "Date")
atm_1_no_na |>
gg_tsdisplay(Cash, plot_type = "partial") +
labs(title = "ACF and PACF of Cash for ATM1",
y = "Cash Dispensed",
x = "Lag")
atm_1_no_na|>
features(Cash, c(unitroot_ndiffs,unitroot_nsdiffs, unitroot_kpss))
## # A tibble: 1 × 5
## ATM ndiffs nsdiffs kpss_stat kpss_pvalue
## <chr> <int> <int> <dbl> <dbl>
## 1 ATM1 1 1 0.509 0.0395
The trend component does not show a clear, sustained increase or decrease; instead, it fluctuates, suggesting a lack of a strong long-term trend in cash withdrawals for ATM1. The seasonal component, however, is more prominent, displaying a consistent weekly pattern that highlights predictable cycles in ATM usage. The residuals capture the remaining variability, with occasional spikes likely due to random events or anomalies outside the usual weekly patterns.
Given that the ACF and PACF plots indicate strong weekly seasonality (with significant lags at 7, 14, and 21 days), applying seasonal differencing with a 7-day lag can help remove this repetitive pattern.
Seasonal differencing with a 7-day lag will transform the series by subtracting each value from its counterpart 7 days prior, which helps reduce the seasonality and makes the series more stationary.
atm1_lambda<- atm_1_no_na|>
features(Cash, features = guerrero)|>
pull(lambda_guerrero)|>
round(4)
print(atm1_lambda)
## [1] 0.2616
atm1_fit <- atm_1_no_na|>
model(
ARIMA(Cash)
)
atm1_fitbc <- atm_1_no_na|>
model(
auto = ARIMA(box_cox(Cash, atm1_lambda),stepwise = FALSE, approx = FALSE))
report(atm1_fit)
## Series: Cash
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1950 -0.5806 -0.1037
## s.e. 0.0546 0.0506 0.0494
##
## sigma^2 estimated as 556.4: log likelihood=-1640.01
## AIC=3288.03 AICc=3288.14 BIC=3303.55
report(atm1_fitbc)
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, atm1_lambda)
##
## Coefficients:
## ma1 ma2 sma1
## 0.1126 -0.1094 -0.6418
## s.e. 0.0524 0.0520 0.0432
##
## sigma^2 estimated as 1.765: log likelihood=-610.02
## AIC=1228.05 AICc=1228.16 BIC=1243.57
Model 2 (ARIMA(0,0,2)(0,1,1)[7]
with Box-Cox
transformation) appears to be the better model due to its improved fit,
lower residual variance, and stability from the Box-Cox transformation.
This model would likely yield more reliable and accurate forecasts for
cash withdrawals at ATM1.
# Step 1: Generate Forecast for May (31 days)
may_forecast_1<- atm1_fitbc |>
forecast(h = "31 days")
may_forecast_1
## # A fable: 31 x 5 [1D]
## # Key: ATM, .model [1]
## ATM .model Date Cash .mean
## <chr> <chr> <date> <dist> <dbl>
## 1 ATM1 auto 2010-05-01 t(N(8.5, 1.8)) 92.1
## 2 ATM1 auto 2010-05-02 t(N(8.9, 1.8)) 107.
## 3 ATM1 auto 2010-05-03 t(N(8, 1.8)) 78.9
## 4 ATM1 auto 2010-05-04 t(N(1.8, 1.8)) 5.56
## 5 ATM1 auto 2010-05-05 t(N(8.9, 1.8)) 106.
## 6 ATM1 auto 2010-05-06 t(N(8.2, 1.8)) 84.7
## 7 ATM1 auto 2010-05-07 t(N(8.4, 1.8)) 91.3
## 8 ATM1 auto 2010-05-08 t(N(8.5, 2)) 93.5
## 9 ATM1 auto 2010-05-09 t(N(8.9, 2)) 107.
## 10 ATM1 auto 2010-05-10 t(N(8, 2)) 79.6
## # ℹ 21 more rows
may_forecast_1|>
autoplot(atm_1_no_na, level = 95)+
labs(
title = "May 2010 forecast using lambda 0.2616 and H= 31",
y="Cash in hundreads"
)
atm1_fitbc|>
gg_tsresiduals()
The chosen model, ARIMA(0,0,1)(0,1,2)[7], effectively reduces the lag spikes observed in earlier analyses. The residuals’ distribution suggests that they are mostly white noise, indicating a well-fitted model with minimal autocorrelation in the errors.
We have two missing values in ATM 2. To handle these, we will apply the same neighbor interpolation approach used for ATM 1, where each missing value is replaced by the average of the last observed and next observed values.
2009-06-18 | ATM2 | NA | ||
2009-06-24 | ATM2 | NA |
atm_2<- atm_data|>
filter(ATM =="ATM2")
atm_2_no_na<- atm_2|>
mutate(Cash = impute_neighbors(Cash),Cash)
atm_2_no_na[48:56, ] |> print()
## # A tsibble: 9 x 3 [1D]
## # Key: ATM [1]
## Date ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-17 ATM2 24
## 2 2009-06-18 ATM2 79
## 3 2009-06-19 ATM2 134
## 4 2009-06-20 ATM2 95
## 5 2009-06-21 ATM2 82
## 6 2009-06-22 ATM2 90
## 7 2009-06-23 ATM2 99
## 8 2009-06-24 ATM2 51
## 9 2009-06-25 ATM2 3
autoplot(atm_2_no_na)
## Plot variable not specified, automatically selected `.vars = Cash`
The function worked as intended, and we now have calculated values for the previously missing cash entries.
stl_model_2 <- atm_2_no_na |>
model(STL(Cash ~ season(window = "periodic")))
# Plot the components of ATM2 STL decomposition
stl_model_2 |>
components() |>
autoplot() +
labs(title = "STL Decomposition of Cash for ATM2",
y = "Cash Dispensed",
x = "Date")
atm_2_no_na |>
gg_tsdisplay(Cash, plot_type = "partial") +
labs(title = "ACF and PACF of Cash for ATM2",
y = "Cash Dispensed",
x = "Lag")
atm_2_no_na|>
features(Cash, c(unitroot_ndiffs,unitroot_nsdiffs, unitroot_kpss))
## # A tibble: 1 × 5
## ATM ndiffs nsdiffs kpss_stat kpss_pvalue
## <chr> <int> <int> <dbl> <dbl>
## 1 ATM2 1 1 2.15 0.01
Overall, the STL decomposition provides valuable insights:
Weekly seasonality is strong and consistent, which should be considered when forecasting.
The trend component shows moderate fluctuations without a strong long-term direction.
The remainder includes some irregular spikes, indicating occasional deviations from the seasonal pattern.
atm2_lambda<- atm_2_no_na|>
features(Cash, features = guerrero)|>
pull(lambda_guerrero)|>
round(4)
print(atm2_lambda)
## [1] 0.7243
atm2_fit <- atm_2_no_na|>
model(
ARIMA(Cash)
)
atm2_fitbc <- atm_1_no_na|>
model(
auto = ARIMA(box_cox(Cash, atm2_lambda),stepwise = FALSE, approx = FALSE))
report(atm2_fit)
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4320 -0.9130 0.4773 0.8048 -0.7547
## s.e. 0.0553 0.0407 0.0861 0.0556 0.0381
##
## sigma^2 estimated as 602.5: log likelihood=-1653.67
## AIC=3319.33 AICc=3319.57 BIC=3342.61
report(atm2_fitbc)
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, atm2_lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.5997 -0.8396 0.7154 0.8221 -0.6941
## s.e. 0.0879 0.0795 0.0862 0.0918 0.0447
##
## sigma^2 estimated as 55.58: log likelihood=-1226.66
## AIC=2465.32 AICc=2465.56 BIC=2488.6
Model Fit: The Box-Cox transformed model (Model 2) provides a significantly better fit with lower AIC, AICc, and BIC values, along with a substantial reduction in residual variance (σ²).
Variance Stability: The Box-Cox transformation stabilizes the variance, reducing noise in the residuals and enhancing model accuracy.
Prediction Reliability: The transformed model is more likely to yield reliable forecasts due to its improved fit and lower residual variance.
atm2_fitbc|>
gg_tsresiduals()+
labs(title = "Residuals ofATM 2\nARIMA(2,0,2)(0,1,1)[7] model \nwith Box-Cox lambda = .7243",
y = "Cash Dispensed",
x = "Lag")
#
may_forecast_2 <- atm2_fitbc |>
forecast(h = "31 days")
# Plot the forecast along with the original data for ATM2
autoplot(atm_2_no_na, Cash,) +
autolayer(may_forecast_2, level = 95) +
labs(
title = "May 2010 forecast for ATM2 using lambda = 0.7243 and H= 31",
y = "Cash in Hundreds",
x = "Date"
)
atm2_fitbc|>
augment() |>
features(.innov, box_pierce, lag = 7,dof = 2)
## # A tibble: 1 × 4
## ATM .model bp_stat bp_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 ATM1 auto 5.56 0.351
The Box-Pierce test result (p-value = 0.35099) is comfortably above 0.05, indicating no significant autocorrelation in the residuals. This suggests that the model parameters, as well as the test parameters, are well-suited for assessing model adequacy, and the model is capturing the data patterns effectively.
For ATM 3, with only the last 3 days of significant data available and significant missing data overall, building a statistical predictive model isn’t feasible. Instead, a practical approach is to take the average of the 3 observations as a simple estimate of future values.
atm_3<- atm_data|>
filter(ATM=="ATM3")
summary(atm_3)
## Date ATM Cash
## Min. :2009-05-01 Length:365 Min. : 0.0000
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.0000
## Median :2009-10-30 Mode :character Median : 0.0000
## Mean :2009-10-30 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :96.0000
atm_3_mfit <- atm_3 |>
filter(ATM == "ATM3" & Cash > 0) |>
model(MEAN(Cash))
may_forecast_3 <- atm_3_mfit |>
forecast(h = 31)
autoplot(may_forecast_3, level =95) +
autolayer(atm_3, series = "Forecast") +
labs(title = "Forecast for ATM3 Cash", y = "Cash", x = "Date")
## Plot variable not specified, automatically selected `.vars = Cash`
## Warning in geom_line(eval_tidy(expr(aes(!!!aes_spec))), data = object, ..., :
## Ignoring unknown parameters: `series`
From our earlier plots of each ATM’s original Time Series we remember that there was at least one observable outlier in ATM 4 so we would try to determine all the outliers here.
atm_4<- atm_data|>
filter(ATM == "ATM4")
summary(atm_4)
## Date ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.563
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 124.334
## Median :2009-10-30 Mode :character Median : 403.839
## Mean :2009-10-30 Mean : 474.043
## 3rd Qu.:2010-01-29 3rd Qu.: 704.507
## Max. :2010-04-30 Max. :10919.762
autoplot(atm_4)
## Plot variable not specified, automatically selected `.vars = Cash`
# Calculate IQR bounds for outliers
Q1 <- quantile(atm_4$Cash, 0.25)
Q3 <- quantile(atm_4$Cash, 0.75)
IQR <- Q3 - Q1
# Define bounds
lower_bound <- Q1 - 2.5 * IQR
upper_bound <- Q3 + 2.5 * IQR
# Identify outliers
outliers <- atm_4 |>
filter(Cash < lower_bound | Cash > upper_bound)
outliers
## # A tsibble: 1 x 3 [1D]
## # Key: ATM [1]
## Date ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 10920.
Winsorize (Cap): We will Replace the outlier value with the upper limit defined by the IQR-based threshold (e.g., Q3+2.5×IQRQ3 + 2.5 Q3+2.5×IQR). This retains the point as a high value but reduces its impact on model fitting.
upper_bound <- quantile(atm_4$Cash, 0.75) + 2.5 * IQR(atm_4$Cash)
atm_4 <- atm_4 |>
mutate(Cash = if_else(Cash > upper_bound, upper_bound, Cash))
atm_4|>
filter(Date == "2010-02-09")
## # A tsibble: 1 x 3 [1D]
## # Key: ATM [1]
## Date ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 2155.
# STL decomposition for ATM 4
stl_model_4 <- atm_4|>
model(STL(Cash ~ season(window = "periodic")))
# Plot the components of ATM 4 STL decomposition
stl_model_4 |>
components() |>
autoplot() +
labs(title = "STL Decomposition of Cash for ATM4",
y = "Cash Dispensed",
x = "Date")
# ACF and PACF of Cash for ATM 4
atm_4 |>
gg_tsdisplay(Cash, plot_type = "partial") +
labs(title = "ACF and PACF of Cash for ATM4",
y = "Cash Dispensed",
x = "Lag")
# Test for differencing needs
atm_4 |>
features(Cash, c(unitroot_ndiffs, unitroot_nsdiffs, unitroot_kpss))
## # A tibble: 1 × 5
## ATM ndiffs nsdiffs kpss_stat kpss_pvalue
## <chr> <int> <int> <dbl> <dbl>
## 1 ATM4 0 0 0.0863 0.1
The STL ACF show stationary data as the spikes are only slightly above the threshold in lag 7 and its very small.
atm4_lambda <- atm_4 |>
features(Cash, features = guerrero) |>
pull(lambda_guerrero) |>
round(4)
print(atm4_lambda)
## [1] 0.409
atm4_fit <- atm_4 |>
model(
ARIMA(Cash)
)
atm4_fitbc <- atm_4|>
model(
auto = ARIMA(box_cox(Cash, atm4_lambda), stepwise = FALSE, approx = FALSE)
)
report(atm4_fit)
## Series: Cash
## Model: ARIMA(0,0,0)(1,0,0)[7] w/ mean
##
## Coefficients:
## sar1 constant
## 0.1667 374.3727
## s.e. 0.0519 18.5923
##
## sigma^2 estimated as 127821: log likelihood=-2662.91
## AIC=5331.83 AICc=5331.9 BIC=5343.53
report(atm4_fitbc)
## Series: Cash
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, atm4_lambda)
##
## Coefficients:
## sar1 sar2 constant
## 0.2200 0.1805 14.4595
## s.e. 0.0518 0.0522 0.5495
##
## sigma^2 estimated as 115.2: log likelihood=-1383.18
## AIC=2774.36 AICc=2774.48 BIC=2789.96
The Box-Cox transformed ARIMA(0,0,0)(2,0,0)[7] model provides the best fit, with a notably lower variance estimate of 115.2 compared to 127,821 in the un-transformed model. This model also has substantially lower AIC (2774.36 vs. 5331.83) and BIC (2789.96 vs. 5343.53) scores, indicating a better balance between model complexity and data fit. Overall, these metrics suggest the transformed model more accurately captures the underlying patterns in the data.
# Forecasting 31 days ahead for ATM 4
may_forecast_4 <- atm4_fitbc |>
forecast(h = 31)
# View the forecast results
print(may_forecast_4)
## # A fable: 31 x 5 [1D]
## # Key: ATM, .model [1]
## ATM .model Date Cash .mean
## <chr> <chr> <date> <dist> <dbl>
## 1 ATM4 auto 2010-05-01 t(N(20, 115)) 307.
## 2 ATM4 auto 2010-05-02 t(N(24, 115)) 448.
## 3 ATM4 auto 2010-05-03 t(N(26, 115)) 513.
## 4 ATM4 auto 2010-05-04 t(N(17, 115)) 241.
## 5 ATM4 auto 2010-05-05 t(N(25, 115)) 477.
## 6 ATM4 auto 2010-05-06 t(N(21, 115)) 346.
## 7 ATM4 auto 2010-05-07 t(N(24, 115)) 428.
## 8 ATM4 auto 2010-05-08 t(N(19, 121)) 301.
## 9 ATM4 auto 2010-05-09 t(N(25, 121)) 480.
## 10 ATM4 auto 2010-05-10 t(N(25, 121)) 472.
## # ℹ 21 more rows
autoplot(atm_4) +
autolayer(may_forecast_4, series = "Forecast", ,level = 95) +
labs(title = "31-Day Forecast for ATM4 Cash with imputed outlier Data",
y = "Cash", x = "Date") +
theme_minimal()
## Plot variable not specified, automatically selected `.vars = Cash`
## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`
## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`
The 31-day forecast for ATM 4 reveals a decreasing variance in forecasted cash values as the forecast horizon extends. Forecast values stabilize and show reduced fluctuation over time, indicating increasing confidence in predicted values further into the period
# Convert each forecast to a tibble to remove the tsibble structure
forecasts <- bind_rows(
as_tibble(may_forecast_1),
as_tibble(may_forecast_2),
as_tibble(may_forecast_3),
as_tibble(may_forecast_4)
) |>
select(Date, ATM, .mean) |>
rename(Cash = .mean)
# Round the Cash column to zero decimal places
forecasts$Cash <- round(forecasts$Cash, 0)
head(forecasts)
## # A tibble: 6 × 3
## Date ATM Cash
## <date> <chr> <dbl>
## 1 2010-05-01 ATM1 92
## 2 2010-05-02 ATM1 107
## 3 2010-05-03 ATM1 79
## 4 2010-05-04 ATM1 6
## 5 2010-05-05 ATM1 106
## 6 2010-05-06 ATM1 85
write_xlsx(forecasts, path = "project1atmsfc.xlsx")
power_data <- read_xlsx(
"ResidentialCustomerForecastLoad-624.xlsx",
col_types = c("text", "text", "numeric")
) |>
# Convert `YYYY-MMM` to Date format if it's in a recognizable format like "2024-Jan"
mutate(DATE = yearmonth(`YYYY-MMM`),
)|>
select(-`YYYY-MMM`)
# Display the first few rows to verify
head(power_data)
## # A tibble: 6 × 3
## CaseSequence KWH DATE
## <chr> <dbl> <mth>
## 1 733 6862583 1998 Jan
## 2 734 5838198 1998 Feb
## 3 735 5420658 1998 Mar
## 4 736 5010364 1998 Apr
## 5 737 4665377 1998 May
## 6 738 6467147 1998 Jun
power_data<- power_data|> as_tsibble(index = DATE)
power_data
## # A tsibble: 192 x 3 [1M]
## CaseSequence KWH DATE
## <chr> <dbl> <mth>
## 1 733 6862583 1998 Jan
## 2 734 5838198 1998 Feb
## 3 735 5420658 1998 Mar
## 4 736 5010364 1998 Apr
## 5 737 4665377 1998 May
## 6 738 6467147 1998 Jun
## 7 739 8914755 1998 Jul
## 8 740 8607428 1998 Aug
## 9 741 6989888 1998 Sep
## 10 742 6345620 1998 Oct
## # ℹ 182 more rows
summary(power_data$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
autoplot(power_data)
## Plot variable not specified, automatically selected `.vars = KWH`
After exploring and plotting the data, we observed a missing data point in September 2008. The graph suggests an oscillating pattern approximately every three points, making the median value of 6,283,324 KWH a reasonable choice to fill this gap. Imputing this value helps maintain consistency within the observed pattern.
# Check for missing values
power_data |>
filter(is.na(KWH))
## # A tsibble: 1 x 3 [1M]
## CaseSequence KWH DATE
## <chr> <dbl> <mth>
## 1 861 NA 2008 Sep
# Fill NA in KWH with the median
power_data$KWH[is.na(power_data$KWH)] <- median(power_data$KWH, na.rm = TRUE)
summary(power_data$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 770523 5434539 6283324 6501333 7608792 10655730
power_data |>
model(STL(KWH ~ season(window = "periodic"))) |>
components() |>
autoplot()
The STL decomposition reveals a clear upward trend in
KWH
usage, with a quarterly seasonal pattern. Most of the
variability is well-explained by the model, except for an unusual spike
in the residuals in July 2010, indicating a significant drop in
KWH
usage. This anomaly warrants further investigation, as
it could be due to an external factor affecting energy consumption
during that period.
We will explore various models to determine the best fit. Given the observed trend and seasonality, an ARIMA model appears to be a promising choice; however, we will also fit alternative models to ensure we select the most accurate option. I will use a 80/20 split to train the Data this should work well since we have over 10 years of monthly data.
split_point <- floor(0.8 * 192)
train_data <- power_data |> slice(1:split_point) # First 154 rows
test_data <- power_data |> slice((split_point + 1):n()) # Remaining
power_models <- train_data |>
model(
ARIMA = ARIMA(KWH),
ETS = ETS(KWH),
STL_Naive = decomposition_model(
STL(KWH ~ season(window = "periodic")),
NAIVE(season_adjust)
)
)
print(power_models)
## # A mable: 1 x 3
## ARIMA ETS STL_Naive
## <model> <model> <model>
## 1 <ARIMA(1,0,1)(0,1,2)[12]> <ETS(M,N,M)> <STL decomposition model>
model_accuracy <- power_models |>
accuracy()|>
arrange(RMSE)
model_accuracy
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Training 32723. 760559. 450434. -4.77 11.9 0.720 0.810 -0.0607
## 2 ETS Training 20126. 803310. 461807. -5.78 12.7 0.738 0.856 0.163
## 3 STL_Naive Training 9850. 991277. 581872. -5.84 14.7 0.930 1.06 -0.444
forecasts <- power_models |>
forecast(new_data = test_data)
forecasts|>
autoplot(power_data)
forecasts|>
autoplot(test_data, level=NULL)+
labs(title = "Model Forecasts vs Actual KWH (Test Period)",
x = "Date",
y = "KWH",
color = "Model")
power_models |>
select(ARIMA) |>
gg_tsresiduals()
The ARIMA(1,0,1)(0,1,2)[12] model was chosen because it effectively captures both short-term patterns and monthly seasonality, as indicated by its strong performance across key accuracy metrics. This model’s configuration, with one autoregressive term, no differencing, one moving average term, and seasonal adjustments, makes it well-suited for handling the cyclical and seasonal nature of the data.
Despite the outlier in July 2010, the ARIMA model still performs well overall. The residuals are generally centered around zero with minimal autocorrelation, indicating that the model has captured the underlying patterns effectively. While the anomaly introduces a slight deviation, it doesn’t significantly affect the model’s stability or its ability to produce accurate forecasts for the majority of the data.
forecasts_2014 <- power_models |>
select(ARIMA)|>
forecast(h=51)
forecasts_2014<-tail(forecasts_2014,12)
forecasts_2014|>
autoplot(power_data, level= 95)+
labs(
title ="2014 KWH Forecast using ARIMA(1,0,1)(0,1,2)[12] with test model "
)
final_arima_model <- power_data |>
model(
ARIMA = ARIMA(KWH ~ pdq(1,0,1) + PDQ(0,1,2))
)
report(final_arima_model)
## Series: KWH
## Model: ARIMA(1,0,1)(0,1,2)[12] w/ drift
##
## Coefficients:
## ar1 ma1 sma1 sma2 constant
## 0.4044 -0.1540 -0.8972 0.1339 54993.10
## s.e. 0.2902 0.3156 0.0823 0.0821 15811.29
##
## sigma^2 estimated as 7.477e+11: log likelihood=-2719.65
## AIC=5451.3 AICc=5451.79 BIC=5470.46
accuracy(final_arima_model)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Training -21280. 825523. 499648. -5.43 11.7 0.714 0.699 0.00302
future_final_forecast<-final_arima_model|>
forecast(h = 12)
autoplot(test_data, KWH) +
autolayer(future_final_forecast, level = 95, color = "blue") +
autolayer(forecasts_2014, level = NULL, color = "red") +
labs(title = "Forecast Comparison: Actual KWH, 2014 Forecast, and 2014 Forecast",
x = "Date", y = "KWH",
color = "Legend") +
theme_minimal()
The two lines above represent the same model: the red line shows the model trained with 80% of available data points, while the blue line shows the exact model trained on the entire dataset. As we can see, the fully trained model aligns well in both seasonality and trend, likely due to the greater amount of training data, which helps mitigate the impact of the significant outlier in July 2010.
print(future_final_forecast)
## # A fable: 12 x 4 [1M]
## # Key: .model [1]
## .model DATE KWH .mean
## <chr> <mth> <dist> <dbl>
## 1 ARIMA 2014 Jan N(9730543, 7.5e+11) 9730543.
## 2 ARIMA 2014 Feb N(8468331, 7.9e+11) 8468331.
## 3 ARIMA 2014 Mar N(6851880, 8e+11) 6851880.
## 4 ARIMA 2014 Apr N(6e+06, 8e+11) 5999883.
## 5 ARIMA 2014 May N(5731409, 8e+11) 5731409.
## 6 ARIMA 2014 Jun N(7528743, 8e+11) 7528743.
## 7 ARIMA 2014 Jul N(7852173, 8e+11) 7852173.
## 8 ARIMA 2014 Aug N(9297497, 8e+11) 9297497.
## 9 ARIMA 2014 Sep N(8179597, 8e+11) 8179597.
## 10 ARIMA 2014 Oct N(6e+06, 8e+11) 6014848.
## 11 ARIMA 2014 Nov N(5770802, 8e+11) 5770802.
## 12 ARIMA 2014 Dec N(7477616, 8e+11) 7477616.
future_final_forecastl <- future_final_forecast |>
as_tibble() |>
select(.model, DATE, .mean) |>
rename(Model = .model, KWH = .mean) |>
mutate(DATE = format(DATE, "%Y-%m-%d")) # Convert DATE to ISO 8601 format
head(future_final_forecastl)
## # A tibble: 6 × 3
## Model DATE KWH
## <chr> <chr> <dbl>
## 1 ARIMA 2014-01-01 9730543.
## 2 ARIMA 2014-02-01 8468331.
## 3 ARIMA 2014-03-01 6851880.
## 4 ARIMA 2014-04-01 5999883.
## 5 ARIMA 2014-05-01 5731409.
## 6 ARIMA 2014-06-01 7528743.
write_xlsx(future_final_forecastl, path = "project1powerfc.xlsx")