Taking a look at the data:
atm_data <- read_excel("/Users/aelsaeyed/Downloads/ATM624Data.xlsx")
head(atm_data)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
Modifying the date data to be human readable and removing NAs. Also arranging the data to be grouped by ATMs. Im using fill_gaps here because I found out later on after debugging some errors that the data contained implicit gaps that weren’t caught by simply removing the NAs. Filling the gaps will add in those dates and populate the rows with NA.
atm_ts <- atm_data %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30")) %>%
as_tsibble(index = DATE, key = ATM) %>%
fill_gaps()
atm_ts
## # A tsibble: 1,474 x 3 [1D]
## # Key: ATM [5]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
## 7 2009-05-07 ATM1 8
## 8 2009-05-08 ATM1 104
## 9 2009-05-09 ATM1 87
## 10 2009-05-10 ATM1 93
## # ℹ 1,464 more rows
summary(atm_ts)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
Check for NA:
rows_with_na <- atm_ts[!complete.cases(atm_ts), ]
print(rows_with_na)
## # A tsibble: 19 x 3 [1D]
## # Key: ATM [3]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
## 6 2010-05-01 <NA> NA
## 7 2010-05-02 <NA> NA
## 8 2010-05-03 <NA> NA
## 9 2010-05-04 <NA> NA
## 10 2010-05-05 <NA> NA
## 11 2010-05-06 <NA> NA
## 12 2010-05-07 <NA> NA
## 13 2010-05-08 <NA> NA
## 14 2010-05-09 <NA> NA
## 15 2010-05-10 <NA> NA
## 16 2010-05-11 <NA> NA
## 17 2010-05-12 <NA> NA
## 18 2010-05-13 <NA> NA
## 19 2010-05-14 <NA> NA
It looks like there are some rows that are missing both ATM and Cash data. There isn’t much we can do about those missing both so I will remove them. There are also some rows which are missing just the Cash value. For those we can interpolate using ARIMA.
Remove the rows where both cash and atm are NA:
atm_ts <- atm_ts %>% filter(!(is.na(ATM) & is.na(Cash)))
rows_with_na <- atm_ts[!complete.cases(atm_ts), ]
print(rows_with_na)
## # A tsibble: 5 x 3 [1D]
## # Key: ATM [2]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
Use ARIMA to interpolate the remaining missing cash data
atm_ts <- atm_ts %>%
model(ARIMA(Cash)) %>%
interpolate(atm_ts)
rows_with_na <- atm_ts[!complete.cases(atm_ts), ]
print(rows_with_na)
## # A tsibble: 0 x 3 [?]
## # Key: ATM [0]
## # ℹ 3 variables: ATM <chr>, DATE <date>, Cash <dbl>
Now the data should be clean as well as fully available. Taking a look at the time series data for each atm:
ggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +
geom_line() +
labs(title = "Cash Trends Over Time by ATM", x = "Date", y = "Cash Balance") +
theme_minimal() +
facet_wrap(~ ATM, scales = "free_y")
ATM3 looks like it was unused for most of the year.
There is a pretty large outlier in ATM4. At first I tried to run the forecasting without removing it and the numbers looked very wrong. Removing it got me results that made more sense.
I will be replacing it with mean of surrounding values:
atm_ts <- atm_ts %>%
group_by(ATM) %>%
mutate(
Cash = if_else(
Cash > 3000,
(lag(Cash, default = first(Cash)) + lead(Cash, default = last(Cash))) / 2,
Cash
)
) %>%
ungroup()
The graphs look much better now:
ggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +
geom_line() +
labs(title = "Cash Trends Over Time by ATM", x = "Date", y = "Cash Balance") +
theme_minimal() +
facet_wrap(~ ATM, scales = "free_y")
Decomposition:
We apply decomposition to expose patterns in the time series data.
for (atm in unique(atm_ts$ATM)) {
atm_ts_filtered <- atm_ts %>% filter(ATM == atm)
decomposition <- atm_ts_filtered %>%
model(STL(Cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = paste(atm, "STL Decomposition 7 Day"))
print(decomposition)
}
ATMs 1,2, and 4 show strong seasonality. Since we know ATM3 is mostly zeroes we can ignore its seasonality. There is no obvious trend.
I will first use ARIMA and see what models it chooses for these ATMs:
forecasts <- atm_ts %>%
model(ARIMA(Cash)) %>%
forecast(h = "1 month")
for (atm in unique(forecasts$ATM)) {
atm_forecast <- forecasts %>% filter(ATM == atm)
forecast_plot <- atm_forecast %>%
autoplot(atm_ts) +
labs(title = paste("ARIMA Forecast for", atm, "Cash Withdrawals (May 2010)"),
y = "Cash Withdrawals (in hundreds of dollars)", x = "Date")
print(forecast_plot)
write_xlsx(forecasts %>% as_tibble(), "/Users/aelsaeyed/Downloads/ATM_Cash_Forecast_May2010.xlsx")
}
We can print out the components of the models and take a look at what ARIMA thought was appropriate for each.
fitted_models <- atm_ts %>% model(fit = ARIMA(Cash))
fitted_models %>% tidy()
## # A tibble: 17 × 7
## ATM .model term estimate std.error statistic p.value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 fit ma1 0.190 0.0547 3.46 5.95e- 4
## 2 ATM1 fit sma1 -0.578 0.0506 -11.4 6.28e-26
## 3 ATM1 fit sma2 -0.107 0.0495 -2.16 3.17e- 2
## 4 ATM2 fit ar1 -0.424 0.0547 -7.75 9.35e-14
## 5 ATM2 fit ar2 -0.913 0.0420 -21.8 1.85e-67
## 6 ATM2 fit ma1 0.460 0.0849 5.42 1.07e- 7
## 7 ATM2 fit ma2 0.801 0.0572 14.0 8.42e-36
## 8 ATM2 fit sma1 -0.749 0.0388 -19.3 2.24e-57
## 9 ATM3 fit ma1 0.839 0.0496 16.9 8.38e-48
## 10 ATM3 fit ma2 0.856 0.0611 14.0 5.20e-36
## 11 ATM4 fit ar1 -0.878 0.100 -8.75 8.16e-17
## 12 ATM4 fit ar2 -0.677 0.119 -5.71 2.38e- 8
## 13 ATM4 fit ar3 0.170 0.0564 3.01 2.77e- 3
## 14 ATM4 fit ma1 0.963 0.0906 10.6 3.38e-23
## 15 ATM4 fit ma2 0.807 0.108 7.47 5.88e-13
## 16 ATM4 fit sar1 0.248 0.0556 4.47 1.06e- 5
## 17 ATM4 fit constant 795. 48.9 16.3 4.19e-45
ATM1: sma1 and sma2 indicate a seasonal component in the moving average ATM2: sma1, same as above ATM4: sar1 shows a seasonal autoregressive component
ATM3 is the outlier since a lot of its cash withdrawals are 0. ARIMA did not detect any seasonal structure and so only a standard ARIMA model was used- no consideration for seasonality.
The forecast for ATM3 didnt look accurate, so I decided to see if I would get better results with another model given the lack of data.
I wanted to use ETS to forecast as well, and knowing the above I switched things up so that the ATMs with obvious seasonality were treated as such and not to bother with the one that didn’t have seasonality.
ets_forecasts <- list()
for (atm in unique(atm_data$ATM)) {
atm_ts_filtered <- atm_ts %>%
filter(ATM == atm)
ets_model <- tryCatch({
if (atm == "ATM3") {
atm_ts_filtered %>%
model(ETS(Cash ~ error("A") + trend("A") + season("N")))
} else {
atm_ts_filtered %>%
model(ETS(Cash ~ error("A") + trend("A") + season("A")))
}
}, error = function(e) {
message(paste("Error in ETS model for", atm, ":", e$message))
NULL
})
if (!is.null(ets_model)) {
ets_forecast <- ets_model %>%
forecast(h = 31)
ets_forecasts[[atm]] <- ets_forecast
ets_plot <- ets_forecast %>%
autoplot(atm_ts_filtered) +
labs(title = paste("ETS Forecast for", atm), y = "Cash Withdrawals") +
theme_minimal()
print(ets_plot)
}
}
The forecasts for ATM3 and 4 look a little better using ETS. The ETS forecast for 4 captures the seasonality while the ARIMA one does not.
The ETS forecast for 3- while working with very little data- does capture the fact that the average withdrawal will be higher unlike the ARIMA forecast.
Overall I think the ETS forecasts are a little better than ARIMA so I will submit those.
ets_forecast_df <- bind_rows(
lapply(names(ets_forecasts), function(atm) {
ets_forecasts[[atm]] %>%
as_tibble() %>%
select(ATM = .model, DATE, Forecast = .mean) %>%
mutate(ATM = atm)
})
)
write_xlsx(ets_forecast_df, "/Users/aelsaeyed/Downloads/ATM_Cash_Forecast_May2010_ETS_UPDATED.xlsx")
print("Forecasts have been successfully exported to ATM_Cash_Forecast_May2010_ETS_UPDATED.xlsx")
## [1] "Forecasts have been successfully exported to ATM_Cash_Forecast_May2010_ETS_UPDATED.xlsx"
Taking a look at the data:
power_data <- read_excel("/Users/aelsaeyed/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
head(power_data)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
The power usage data is collected once a month. I will proceed to clean up the date column and check for implicit and explicitly missing data off the bat.
# Convert YYYY-MMM to Date format and keep only monthly entries
power_data <- power_data %>%
mutate(DATE = yearmonth(paste0(`YYYY-MMM`, "-01"))) %>%
select(-`YYYY-MMM`) %>%
rename(KWH = KWH, CaseSequence = CaseSequence)
power_ts <- power_data %>%
as_tsibble(index = DATE) %>%
select(DATE, KWH) %>%
fill_gaps()
missing_data <- power_ts[!complete.cases(power_ts), ]
missing_data
## # A tsibble: 1 x 2 [1M]
## DATE KWH
## <mth> <dbl>
## 1 2008 Sep NA
It looks like there is a row with missing KWH data. I will interpolate using ARIMA:
# Interpolate missing KWH values using ARIMA
power_ts <- power_ts %>%
model(ARIMA(KWH)) %>%
interpolate(power_ts)
missing_data_after_interpolation <- power_ts[!complete.cases(power_ts), ]
missing_data_after_interpolation
## # A tsibble: 0 x 2 [?]
## # ℹ 2 variables: DATE <mth>, KWH <dbl>
Now we can take a look at the time series data:
# Plot the time series data
ggplot(power_ts, aes(x = DATE, y = KWH)) +
geom_line(color = "blue") +
labs(title = "Monthly Residential Power Usage (KWH)",
x = "Date", y = "Power Consumption (KWH)") +
theme_minimal()
At first glance there seems to be an upward trend as well as some seasonality. We will make sure using decomposition:
# Decompose the time series using STL
decomposition <- power_ts %>%
model(STL(KWH ~ trend(window = 13) + season(window = "periodic"), robust = TRUE)) %>%
components()
# Plot the decomposition
autoplot(decomposition) +
labs(title = "STL Decomposition of Monthly Residential Power Usage",
y = "Power Consumption (KWH)") +
theme_minimal()
There is a clear upward trend as well as yearly seasonality. There is also relatively low randomness except for a downward spike near 2010 (blackout?) Ill address this before moving forward. Since it only seems to be one month, Ill identify the minimum value and replace it with the mean of the surrounding values.
power_ts <- power_ts %>%
mutate(KWH = if_else(KWH < 3e+06,
(lag(KWH, default = NA) + lead(KWH, default = NA)) / 2,
KWH))
# Display any remaining values below 3e+06 to verify
power_ts %>% filter(KWH < 3e+06)
## # A tsibble: 0 x 2 [?]
## # ℹ 2 variables: DATE <mth>, KWH <dbl>
Plotting the data again:
ggplot(power_ts, aes(x = DATE, y = KWH)) +
geom_line(color = "blue") +
labs(title = "Monthly Residential Power Usage (KWH)",
x = "Date", y = "Power Consumption (KWH)") +
theme_minimal()
Now the decomposition should be cleaner:
# Decompose the time series using STL
decomposition <- power_ts %>%
model(STL(KWH ~ trend(window = 13) + season(window = "periodic"), robust = TRUE)) %>%
components()
# Plot the decomposition
autoplot(decomposition) +
labs(title = "STL Decomposition of Monthly Residential Power Usage",
y = "Power Consumption (KWH)") +
theme_minimal()
Now we can move on to forecasting. First, ARIMA:
# ARIMA Forecast
arima_model <- power_ts %>%
model(ARIMA(KWH ~ season()))
# Generate ARIMA forecast for 2014
arima_forecast <- arima_model %>% forecast(h = "1 year")
# Plot ARIMA forecast
autoplot(arima_forecast) +
labs(title = "ARIMA Forecast for 2014", y = "Power Consumption (KWH)", x = "Date") +
theme_minimal()
# Plot ARIMA forecast with original data
autoplot(power_ts, KWH) +
autolayer(arima_forecast, series = "ARIMA Forecast", PI = TRUE) +
labs(title = "ARIMA Forecast for 2014 with Original Data",
y = "Power Consumption (KWH)", x = "Date") +
theme_minimal()
Next: ETS
# ETS Forecast
ets_model <- power_ts %>%
model(ETS(KWH))
# Generate ETS forecast for 2014
ets_forecast <- ets_model %>% forecast(h = "1 year")
# Plot ETS forecast
autoplot(ets_forecast) +
labs(title = "ETS Forecast for 2014", y = "Power Consumption (KWH)", x = "Date") +
theme_minimal()
# Plot ETS forecast with original data
autoplot(power_ts, KWH) +
autolayer(ets_forecast, series = "ETS Forecast", PI = TRUE) +
labs(title = "ETS Forecast for 2014 with Original Data",
y = "Power Consumption (KWH)", x = "Date") +
theme_minimal()
Both forecasts seem to overlap.
# Plot original data with both ARIMA and ETS forecasts
autoplot(power_ts, KWH) +
autolayer(arima_forecast, series = "ARIMA Forecast", PI = FALSE, color = "blue") +
autolayer(ets_forecast, series = "ETS Forecast", PI = FALSE, color = "red") +
labs(title = "Comparison of ARIMA and ETS Forecasts for 2014",
y = "Power Consumption (KWH)", x = "Date") +
scale_color_manual(values = c("ARIMA Forecast" = "blue", "ETS Forecast" = "red")) +
theme_minimal() +
theme(legend.position = "bottom")
We can now interrogate the ARIMA model to see what we can glean:
# Display the ARIMA model coefficients
arima_coefs <- arima_model %>% tidy()
arima_coefs
## # A tibble: 16 × 6
## .model term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(KWH ~ season()) ma1 -5.68e-1 8.46e-2 -6.71 2.18e-10
## 2 ARIMA(KWH ~ season()) ma2 -3.34e-1 8.09e-2 -4.13 5.48e- 5
## 3 ARIMA(KWH ~ season()) sar1 4.95e-1 3.15e-1 1.57 1.18e- 1
## 4 ARIMA(KWH ~ season()) sar2 8.01e-2 9.32e-2 0.860 3.91e- 1
## 5 ARIMA(KWH ~ season()) sma1 -4.55e-1 3.07e-1 -1.48 1.40e- 1
## 6 ARIMA(KWH ~ season()) season()year2 -1.15e+6 2.20e+5 -5.25 3.98e- 7
## 7 ARIMA(KWH ~ season()) season()year3 -2.15e+6 2.66e+5 -8.06 8.29e-14
## 8 ARIMA(KWH ~ season()) season()year4 -2.80e+6 2.65e+5 -10.6 6.73e-21
## 9 ARIMA(KWH ~ season()) season()year5 -3.07e+6 2.66e+5 -11.6 8.53e-24
## 10 ARIMA(KWH ~ season()) season()year6 -1.56e+6 2.66e+5 -5.87 1.91e- 8
## 11 ARIMA(KWH ~ season()) season()year7 -2.79e+5 2.66e+5 -1.05 2.95e- 1
## 12 ARIMA(KWH ~ season()) season()year8 1.83e+5 2.66e+5 0.688 4.92e- 1
## 13 ARIMA(KWH ~ season()) season()year9 -4.51e+5 2.66e+5 -1.69 9.19e- 2
## 14 ARIMA(KWH ~ season()) season()year10 -2.50e+6 2.66e+5 -9.41 1.66e-17
## 15 ARIMA(KWH ~ season()) season()year11 -3.20e+6 2.65e+5 -12.1 2.37e-25
## 16 ARIMA(KWH ~ season()) season()year12 -1.89e+6 2.21e+5 -8.57 3.49e-15
We learn from this that the model chosen by arima has strong moving average components: ma1 and ma2 with coefficients -0.568 and -0.334. This indicates that despite removing the large outlier around 2010, there may be some other shocks to the trend.
The SAR terms sar1 and sar2 have coefficients of 0.495 and 0.0801, and fairly low p-values. I interpret this as meaning the seasonal autoregressivity is low- previous season values don’t affect the upcoming seasons very much.
I will be submitting the ARIMA forecast:
# Select only the date and mean forecast columns for export
arima_forecast_clean <- arima_forecast %>%
as_tibble() %>%
select(DATE, Forecasted_KWH = .mean) # Rename .mean to Forecasted_KWH for clarity
# Export the cleaned ARIMA forecast to an Excel file
write_xlsx(arima_forecast_clean, "/Users/aelsaeyed/Downloads/ARIMA_Forecast_Cleaned_2014.xlsx")
arima_forecast_clean
## # A tibble: 12 × 2
## DATE Forecasted_KWH
## <mth> <dbl>
## 1 2014 Jan 10049051.
## 2 2014 Feb 8103359.
## 3 2014 Mar 6968739.
## 4 2014 Apr 6292544.
## 5 2014 May 6066301.
## 6 2014 Jun 7701120.
## 7 2014 Jul 8944551.
## 8 2014 Aug 9434329.
## 9 2014 Sep 8620525.
## 10 2014 Oct 6517466.
## 11 2014 Nov 5981076.
## 12 2014 Dec 7391869.