Part 1: ATM Data and Forecasting

Taking a look at the data:

atm_data <- read_excel("/Users/aelsaeyed/Downloads/ATM624Data.xlsx")

head(atm_data)
## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

Modifying the date data to be human readable and removing NAs. Also arranging the data to be grouped by ATMs. Im using fill_gaps here because I found out later on after debugging some errors that the data contained implicit gaps that weren’t caught by simply removing the NAs. Filling the gaps will add in those dates and populate the rows with NA.

atm_ts <- atm_data %>% 
  mutate(DATE = as.Date(DATE, origin = "1899-12-30")) %>%
  as_tsibble(index = DATE, key = ATM) %>%
  fill_gaps()

atm_ts
## # A tsibble: 1,474 x 3 [1D]
## # Key:       ATM [5]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-05-01 ATM1     96
##  2 2009-05-02 ATM1     82
##  3 2009-05-03 ATM1     85
##  4 2009-05-04 ATM1     90
##  5 2009-05-05 ATM1     99
##  6 2009-05-06 ATM1     88
##  7 2009-05-07 ATM1      8
##  8 2009-05-08 ATM1    104
##  9 2009-05-09 ATM1     87
## 10 2009-05-10 ATM1     93
## # ℹ 1,464 more rows
summary(atm_ts)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

Check for NA:

rows_with_na <- atm_ts[!complete.cases(atm_ts), ]
print(rows_with_na)
## # A tsibble: 19 x 3 [1D]
## # Key:       ATM [3]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-22 ATM1     NA
##  4 2009-06-18 ATM2     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

It looks like there are some rows that are missing both ATM and Cash data. There isn’t much we can do about those missing both so I will remove them. There are also some rows which are missing just the Cash value. For those we can interpolate using ARIMA.

Remove the rows where both cash and atm are NA:

atm_ts <- atm_ts %>% filter(!(is.na(ATM) & is.na(Cash)))
rows_with_na <- atm_ts[!complete.cases(atm_ts), ]
print(rows_with_na)
## # A tsibble: 5 x 3 [1D]
## # Key:       ATM [2]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-22 ATM1     NA
## 4 2009-06-18 ATM2     NA
## 5 2009-06-24 ATM2     NA

Use ARIMA to interpolate the remaining missing cash data

atm_ts <- atm_ts %>%
  model(ARIMA(Cash)) %>%
  interpolate(atm_ts)

rows_with_na <- atm_ts[!complete.cases(atm_ts), ]
print(rows_with_na)
## # A tsibble: 0 x 3 [?]
## # Key:       ATM [0]
## # ℹ 3 variables: ATM <chr>, DATE <date>, Cash <dbl>

Now the data should be clean as well as fully available. Taking a look at the time series data for each atm:

ggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "Cash Trends Over Time by ATM", x = "Date", y = "Cash Balance") +
  theme_minimal() +
  facet_wrap(~ ATM, scales = "free_y")

ATM3 looks like it was unused for most of the year.

There is a pretty large outlier in ATM4. At first I tried to run the forecasting without removing it and the numbers looked very wrong. Removing it got me results that made more sense.

I will be replacing it with mean of surrounding values:

atm_ts <- atm_ts %>%
  group_by(ATM) %>%
  mutate(
    Cash = if_else(
      Cash > 3000,  
      (lag(Cash, default = first(Cash)) + lead(Cash, default = last(Cash))) / 2, 
      Cash  
    )
  ) %>%
  ungroup()

The graphs look much better now:

ggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "Cash Trends Over Time by ATM", x = "Date", y = "Cash Balance") +
  theme_minimal() +
  facet_wrap(~ ATM, scales = "free_y")

Decomposition:

We apply decomposition to expose patterns in the time series data.

for (atm in unique(atm_ts$ATM)) {
  atm_ts_filtered <- atm_ts %>% filter(ATM == atm)
  
  decomposition <- atm_ts_filtered %>%
    model(STL(Cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
    components() %>%
    autoplot() +
    labs(title = paste(atm, "STL Decomposition 7 Day"))
  
  print(decomposition)
}

ATMs 1,2, and 4 show strong seasonality. Since we know ATM3 is mostly zeroes we can ignore its seasonality. There is no obvious trend.

I will first use ARIMA and see what models it chooses for these ATMs:

forecasts <- atm_ts %>%
  model(ARIMA(Cash)) %>%
  forecast(h = "1 month")


for (atm in unique(forecasts$ATM)) {
  atm_forecast <- forecasts %>% filter(ATM == atm)
  forecast_plot <- atm_forecast %>%
    autoplot(atm_ts) +
    labs(title = paste("ARIMA Forecast for", atm, "Cash Withdrawals (May 2010)"),
         y = "Cash Withdrawals (in hundreds of dollars)", x = "Date")
  
  print(forecast_plot)


write_xlsx(forecasts %>% as_tibble(), "/Users/aelsaeyed/Downloads/ATM_Cash_Forecast_May2010.xlsx")
}

We can print out the components of the models and take a look at what ARIMA thought was appropriate for each.

fitted_models <- atm_ts %>% model(fit = ARIMA(Cash))

fitted_models %>% tidy()
## # A tibble: 17 × 7
##    ATM   .model term     estimate std.error statistic  p.value
##    <chr> <chr>  <chr>       <dbl>     <dbl>     <dbl>    <dbl>
##  1 ATM1  fit    ma1         0.190    0.0547      3.46 5.95e- 4
##  2 ATM1  fit    sma1       -0.578    0.0506    -11.4  6.28e-26
##  3 ATM1  fit    sma2       -0.107    0.0495     -2.16 3.17e- 2
##  4 ATM2  fit    ar1        -0.424    0.0547     -7.75 9.35e-14
##  5 ATM2  fit    ar2        -0.913    0.0420    -21.8  1.85e-67
##  6 ATM2  fit    ma1         0.460    0.0849      5.42 1.07e- 7
##  7 ATM2  fit    ma2         0.801    0.0572     14.0  8.42e-36
##  8 ATM2  fit    sma1       -0.749    0.0388    -19.3  2.24e-57
##  9 ATM3  fit    ma1         0.839    0.0496     16.9  8.38e-48
## 10 ATM3  fit    ma2         0.856    0.0611     14.0  5.20e-36
## 11 ATM4  fit    ar1        -0.878    0.100      -8.75 8.16e-17
## 12 ATM4  fit    ar2        -0.677    0.119      -5.71 2.38e- 8
## 13 ATM4  fit    ar3         0.170    0.0564      3.01 2.77e- 3
## 14 ATM4  fit    ma1         0.963    0.0906     10.6  3.38e-23
## 15 ATM4  fit    ma2         0.807    0.108       7.47 5.88e-13
## 16 ATM4  fit    sar1        0.248    0.0556      4.47 1.06e- 5
## 17 ATM4  fit    constant  795.      48.9        16.3  4.19e-45

ATM1: sma1 and sma2 indicate a seasonal component in the moving average ATM2: sma1, same as above ATM4: sar1 shows a seasonal autoregressive component

ATM3 is the outlier since a lot of its cash withdrawals are 0. ARIMA did not detect any seasonal structure and so only a standard ARIMA model was used- no consideration for seasonality.

The forecast for ATM3 didnt look accurate, so I decided to see if I would get better results with another model given the lack of data.

I wanted to use ETS to forecast as well, and knowing the above I switched things up so that the ATMs with obvious seasonality were treated as such and not to bother with the one that didn’t have seasonality.

ets_forecasts <- list()

for (atm in unique(atm_data$ATM)) {
  atm_ts_filtered <- atm_ts %>%
    filter(ATM == atm)
  
  ets_model <- tryCatch({
    if (atm == "ATM3") {
      atm_ts_filtered %>%
        model(ETS(Cash ~ error("A") + trend("A") + season("N")))  
    } else {
      atm_ts_filtered %>%
        model(ETS(Cash ~ error("A") + trend("A") + season("A"))) 
    }
  }, error = function(e) {
    message(paste("Error in ETS model for", atm, ":", e$message))
    NULL  
  })
  
  
  if (!is.null(ets_model)) {
    ets_forecast <- ets_model %>%
      forecast(h = 31)  
    ets_forecasts[[atm]] <- ets_forecast  
    
    
    ets_plot <- ets_forecast %>%
      autoplot(atm_ts_filtered) +  
      labs(title = paste("ETS Forecast for", atm), y = "Cash Withdrawals") +
      theme_minimal()
    print(ets_plot)
  }
}

The forecasts for ATM3 and 4 look a little better using ETS. The ETS forecast for 4 captures the seasonality while the ARIMA one does not.

The ETS forecast for 3- while working with very little data- does capture the fact that the average withdrawal will be higher unlike the ARIMA forecast.

Overall I think the ETS forecasts are a little better than ARIMA so I will submit those.

ets_forecast_df <- bind_rows(
  lapply(names(ets_forecasts), function(atm) {
    ets_forecasts[[atm]] %>%
      as_tibble() %>%
      select(ATM = .model, DATE, Forecast = .mean) %>%  
      mutate(ATM = atm)  
  })
)

write_xlsx(ets_forecast_df, "/Users/aelsaeyed/Downloads/ATM_Cash_Forecast_May2010_ETS_UPDATED.xlsx")

print("Forecasts have been successfully exported to ATM_Cash_Forecast_May2010_ETS_UPDATED.xlsx")
## [1] "Forecasts have been successfully exported to ATM_Cash_Forecast_May2010_ETS_UPDATED.xlsx"

Part 2: Power Consumption

Taking a look at the data:

power_data <- read_excel("/Users/aelsaeyed/Downloads/ResidentialCustomerForecastLoad-624.xlsx")

head(power_data)
## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

The power usage data is collected once a month. I will proceed to clean up the date column and check for implicit and explicitly missing data off the bat.

# Convert YYYY-MMM to Date format and keep only monthly entries
power_data <- power_data %>%
  mutate(DATE = yearmonth(paste0(`YYYY-MMM`, "-01"))) %>%
  select(-`YYYY-MMM`) %>%
  rename(KWH = KWH, CaseSequence = CaseSequence)

power_ts <- power_data %>%
  as_tsibble(index = DATE) %>%
  select(DATE, KWH) %>%
  fill_gaps()

missing_data <- power_ts[!complete.cases(power_ts), ]
missing_data
## # A tsibble: 1 x 2 [1M]
##       DATE   KWH
##      <mth> <dbl>
## 1 2008 Sep    NA

It looks like there is a row with missing KWH data. I will interpolate using ARIMA:

# Interpolate missing KWH values using ARIMA
power_ts <- power_ts %>%
  model(ARIMA(KWH)) %>%
  interpolate(power_ts)

missing_data_after_interpolation <- power_ts[!complete.cases(power_ts), ]
missing_data_after_interpolation
## # A tsibble: 0 x 2 [?]
## # ℹ 2 variables: DATE <mth>, KWH <dbl>

Now we can take a look at the time series data:

# Plot the time series data
ggplot(power_ts, aes(x = DATE, y = KWH)) +
  geom_line(color = "blue") +
  labs(title = "Monthly Residential Power Usage (KWH)", 
       x = "Date", y = "Power Consumption (KWH)") +
  theme_minimal()

At first glance there seems to be an upward trend as well as some seasonality. We will make sure using decomposition:

# Decompose the time series using STL
decomposition <- power_ts %>%
  model(STL(KWH ~ trend(window = 13) + season(window = "periodic"), robust = TRUE)) %>%
  components()

# Plot the decomposition
autoplot(decomposition) +
  labs(title = "STL Decomposition of Monthly Residential Power Usage",
       y = "Power Consumption (KWH)") +
  theme_minimal()

There is a clear upward trend as well as yearly seasonality. There is also relatively low randomness except for a downward spike near 2010 (blackout?) Ill address this before moving forward. Since it only seems to be one month, Ill identify the minimum value and replace it with the mean of the surrounding values.

power_ts <- power_ts %>%
  mutate(KWH = if_else(KWH < 3e+06, 
                       (lag(KWH, default = NA) + lead(KWH, default = NA)) / 2, 
                       KWH))

# Display any remaining values below 3e+06 to verify
power_ts %>% filter(KWH < 3e+06)
## # A tsibble: 0 x 2 [?]
## # ℹ 2 variables: DATE <mth>, KWH <dbl>

Plotting the data again:

ggplot(power_ts, aes(x = DATE, y = KWH)) +
  geom_line(color = "blue") +
  labs(title = "Monthly Residential Power Usage (KWH)", 
       x = "Date", y = "Power Consumption (KWH)") +
  theme_minimal()

Now the decomposition should be cleaner:

# Decompose the time series using STL
decomposition <- power_ts %>%
  model(STL(KWH ~ trend(window = 13) + season(window = "periodic"), robust = TRUE)) %>%
  components()

# Plot the decomposition
autoplot(decomposition) +
  labs(title = "STL Decomposition of Monthly Residential Power Usage",
       y = "Power Consumption (KWH)") +
  theme_minimal()

Now we can move on to forecasting. First, ARIMA:

# ARIMA Forecast
arima_model <- power_ts %>%
  model(ARIMA(KWH ~ season()))

# Generate ARIMA forecast for 2014
arima_forecast <- arima_model %>% forecast(h = "1 year")

# Plot ARIMA forecast
autoplot(arima_forecast) +
  labs(title = "ARIMA Forecast for 2014", y = "Power Consumption (KWH)", x = "Date") +
  theme_minimal()

# Plot ARIMA forecast with original data
autoplot(power_ts, KWH) +
  autolayer(arima_forecast, series = "ARIMA Forecast", PI = TRUE) +
  labs(title = "ARIMA Forecast for 2014 with Original Data", 
       y = "Power Consumption (KWH)", x = "Date") +
  theme_minimal()

Next: ETS

# ETS Forecast
ets_model <- power_ts %>%
  model(ETS(KWH))

# Generate ETS forecast for 2014
ets_forecast <- ets_model %>% forecast(h = "1 year")

# Plot ETS forecast
autoplot(ets_forecast) +
  labs(title = "ETS Forecast for 2014", y = "Power Consumption (KWH)", x = "Date") +
  theme_minimal()

# Plot ETS forecast with original data
autoplot(power_ts, KWH) +
  autolayer(ets_forecast, series = "ETS Forecast", PI = TRUE) +
  labs(title = "ETS Forecast for 2014 with Original Data", 
       y = "Power Consumption (KWH)", x = "Date") +
  theme_minimal()

Both forecasts seem to overlap.

# Plot original data with both ARIMA and ETS forecasts
autoplot(power_ts, KWH) +
  autolayer(arima_forecast, series = "ARIMA Forecast", PI = FALSE, color = "blue") +
  autolayer(ets_forecast, series = "ETS Forecast", PI = FALSE, color = "red") +
  labs(title = "Comparison of ARIMA and ETS Forecasts for 2014",
       y = "Power Consumption (KWH)", x = "Date") +
  scale_color_manual(values = c("ARIMA Forecast" = "blue", "ETS Forecast" = "red")) +
  theme_minimal() +
  theme(legend.position = "bottom")

We can now interrogate the ARIMA model to see what we can glean:

# Display the ARIMA model coefficients
arima_coefs <- arima_model %>% tidy()  
arima_coefs
## # A tibble: 16 × 6
##    .model                term              estimate std.error statistic  p.value
##    <chr>                 <chr>                <dbl>     <dbl>     <dbl>    <dbl>
##  1 ARIMA(KWH ~ season()) ma1               -5.68e-1   8.46e-2    -6.71  2.18e-10
##  2 ARIMA(KWH ~ season()) ma2               -3.34e-1   8.09e-2    -4.13  5.48e- 5
##  3 ARIMA(KWH ~ season()) sar1               4.95e-1   3.15e-1     1.57  1.18e- 1
##  4 ARIMA(KWH ~ season()) sar2               8.01e-2   9.32e-2     0.860 3.91e- 1
##  5 ARIMA(KWH ~ season()) sma1              -4.55e-1   3.07e-1    -1.48  1.40e- 1
##  6 ARIMA(KWH ~ season()) season()year2     -1.15e+6   2.20e+5    -5.25  3.98e- 7
##  7 ARIMA(KWH ~ season()) season()year3     -2.15e+6   2.66e+5    -8.06  8.29e-14
##  8 ARIMA(KWH ~ season()) season()year4     -2.80e+6   2.65e+5   -10.6   6.73e-21
##  9 ARIMA(KWH ~ season()) season()year5     -3.07e+6   2.66e+5   -11.6   8.53e-24
## 10 ARIMA(KWH ~ season()) season()year6     -1.56e+6   2.66e+5    -5.87  1.91e- 8
## 11 ARIMA(KWH ~ season()) season()year7     -2.79e+5   2.66e+5    -1.05  2.95e- 1
## 12 ARIMA(KWH ~ season()) season()year8      1.83e+5   2.66e+5     0.688 4.92e- 1
## 13 ARIMA(KWH ~ season()) season()year9     -4.51e+5   2.66e+5    -1.69  9.19e- 2
## 14 ARIMA(KWH ~ season()) season()year10    -2.50e+6   2.66e+5    -9.41  1.66e-17
## 15 ARIMA(KWH ~ season()) season()year11    -3.20e+6   2.65e+5   -12.1   2.37e-25
## 16 ARIMA(KWH ~ season()) season()year12    -1.89e+6   2.21e+5    -8.57  3.49e-15

We learn from this that the model chosen by arima has strong moving average components: ma1 and ma2 with coefficients -0.568 and -0.334. This indicates that despite removing the large outlier around 2010, there may be some other shocks to the trend.

The SAR terms sar1 and sar2 have coefficients of 0.495 and 0.0801, and fairly low p-values. I interpret this as meaning the seasonal autoregressivity is low- previous season values don’t affect the upcoming seasons very much.

I will be submitting the ARIMA forecast:

# Select only the date and mean forecast columns for export
arima_forecast_clean <- arima_forecast %>%
  as_tibble() %>%
  select(DATE, Forecasted_KWH = .mean)  # Rename .mean to Forecasted_KWH for clarity

# Export the cleaned ARIMA forecast to an Excel file
write_xlsx(arima_forecast_clean, "/Users/aelsaeyed/Downloads/ARIMA_Forecast_Cleaned_2014.xlsx")
arima_forecast_clean
## # A tibble: 12 × 2
##        DATE Forecasted_KWH
##       <mth>          <dbl>
##  1 2014 Jan      10049051.
##  2 2014 Feb       8103359.
##  3 2014 Mar       6968739.
##  4 2014 Apr       6292544.
##  5 2014 May       6066301.
##  6 2014 Jun       7701120.
##  7 2014 Jul       8944551.
##  8 2014 Aug       9434329.
##  9 2014 Sep       8620525.
## 10 2014 Oct       6517466.
## 11 2014 Nov       5981076.
## 12 2014 Dec       7391869.