Part A

Data Acquisition & Data Processing

The data has been loaded. I need to convert the date column to a date-type variable. Next, I will transform the atm_data data frame into a tsibble. I will also rename some of the variables to make them more descriptive.

atm_data <- read_excel("ATM624Data.xlsx", col_types = c("date", "text", "numeric"))|>

 
  mutate(DATE = as_date(DATE)) |>

  rename(Date = DATE) |>
  # converting to tsibble
  as_tsibble(index = Date, key = ATM) |>
  # selecting from May 2009 to April 2010
  filter(Date < "2010-05-01")
head(atm_data)

## # A tsibble: 6 x 3 [1D]
## # Key:       ATM [1]
##   Date       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-02 ATM1     82
## 3 2009-05-03 ATM1     85
## 4 2009-05-04 ATM1     90
## 5 2009-05-05 ATM1     99
## 6 2009-05-06 ATM1     88

EDA

Missing Data

summary(atm_data)

##       Date                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5

atm_data |> 
  as.data.frame() |>
  group_by(ATM) |>
  summarise(`Missing Values` = sum(is.na(Cash)))

## # A tibble: 4 × 2
##   ATM   `Missing Values`
##   <chr>            <int>
## 1 ATM1                 3
## 2 ATM2                 2
## 3 ATM3                 0
## 4 ATM4                 0

ATMs 1 and 2 have 3 and 2 missing cash values, respectively. Additionally, for ATM 3, there are 362 rows with a cash value of 0, which may indicate either no transactions occurred on those days or that these entries are placeholders for missing data.

atm_data |>
  autoplot(Cash) +
  geom_line() +
  facet_wrap(~ATM , scales = "free_y",ncol = 1 ) +  # Facet by ATM with each in a separate row
  labs(
    title = "ATM Cash Dispensed Over Time",
    y = "Cash Dispensed in Hundreds",
    x = "Time in Days"
  ) +
  theme_minimal()

The plot for ATM4 shows a significant spike in February, where the cash value exceeds 900,000, which is drastically higher than the next recorded high. This suggests a potential error in the data, as it is unlikely the ATM would have been loaded with such a large amount of cash, especially when previous days show much lower values.

ATM 1

The first step will be to handle the missing data. Since seasonality is important, I’ll use a method that aligns well with seasonal data. I plan to apply neighbor interpolation, where each missing value is replaced by the average of the last observed and next observed values.

impute_neighbors <- function(x) {
  for (i in which(is.na(x))) {
    # Check if there are preceding and following non-NA values
    if (i > 1 && i < length(x)) {
      # Take the average of the previous and next non-NA values
      x[i] <- mean(c(x[i - 1], x[i + 1]), na.rm = TRUE)
    }
  }
  return(x)
}

 atm_data|>
   group_by(ATM)|>
  filter( is.na(Cash))|>
  print()

## # A tsibble: 5 x 3 [1D]
## # Key:       ATM [2]
## # Groups:    ATM [2]
##   Date       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-22 ATM1     NA
## 4 2009-06-18 ATM2     NA
## 5 2009-06-24 ATM2     NA

atm_data|>
  filter(ATM == "ATM1") |>
  summary()

##       Date                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.89  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00  
##                                          NA's   :3

atm_1<- atm_data|>
  filter(ATM == "ATM1" )
  
atm_1_no_na<- atm_1|>
  mutate(Cash = impute_neighbors(Cash),Cash)

atm_1_no_na[43:53, ] |> print()

## # A tsibble: 11 x 3 [1D]
## # Key:       ATM [1]
##    Date       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-12 ATM1   142 
##  2 2009-06-13 ATM1   131 
##  3 2009-06-14 ATM1   120 
##  4 2009-06-15 ATM1   106 
##  5 2009-06-16 ATM1   107 
##  6 2009-06-17 ATM1   108 
##  7 2009-06-18 ATM1    21 
##  8 2009-06-19 ATM1   140 
##  9 2009-06-20 ATM1   110 
## 10 2009-06-21 ATM1   115 
## 11 2009-06-22 ATM1   112.

The function worked as intended, and we now have calculated values for the previously missing cash entries.

stl_model_1 <- atm_1_no_na |>
  model(STL(Cash ~ season(window = "periodic")))

# Plot the components of the STL decomposition
stl_model_1 |>
  components() |>
 autoplot() +
  labs(title = "STL Decomposition of Cash for ATM1",
       y = "Cash Dispensed",
       x = "Date")

atm_1_no_na |>
  gg_tsdisplay(Cash, plot_type = "partial") +
  labs(title = "ACF and PACF of Cash for ATM1",
       y = "Cash Dispensed",
       x = "Lag")

atm_1_no_na|>
  features(Cash, c(unitroot_ndiffs,unitroot_nsdiffs, unitroot_kpss))

## # A tibble: 1 × 5
##   ATM   ndiffs nsdiffs kpss_stat kpss_pvalue
##   <chr>  <int>   <int>     <dbl>       <dbl>
## 1 ATM1       1       1     0.509      0.0395

The trend component does not show a clear, sustained increase or decrease; instead, it fluctuates, suggesting a lack of a strong long-term trend in cash withdrawals for ATM1. The seasonal component, however, is more prominent, displaying a consistent weekly pattern that highlights predictable cycles in ATM usage. The residuals capture the remaining variability, with occasional spikes likely due to random events or anomalies outside the usual weekly patterns.

Given that the ACF and PACF plots indicate strong weekly seasonality (with significant lags at 7, 14, and 21 days), applying seasonal differencing with a 7-day lag can help remove this repetitive pattern.

Seasonal differencing with a 7-day lag will transform the series by subtracting each value from its counterpart 7 days prior, which helps reduce the seasonality and makes the series more stationary.

Model

 atm1_lambda<- atm_1_no_na|>
  features(Cash, features = guerrero)|>
  pull(lambda_guerrero)|>
    round(4)
 print(atm1_lambda)

## [1] 0.2616

 atm1_fit <- atm_1_no_na|>
   model(
    ARIMA(Cash)
    
   )
  atm1_fitbc <- atm_1_no_na|>
   model(
  
   auto = ARIMA(box_cox(Cash, atm1_lambda),stepwise = FALSE, approx = FALSE))
   
 
report(atm1_fit)

## Series: Cash 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1950  -0.5806  -0.1037
## s.e.  0.0546   0.0506   0.0494
## 
## sigma^2 estimated as 556.4:  log likelihood=-1640.01
## AIC=3288.03   AICc=3288.14   BIC=3303.55

report(atm1_fitbc)

## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, atm1_lambda) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1126  -0.1094  -0.6418
## s.e.  0.0524   0.0520   0.0432
## 
## sigma^2 estimated as 1.765:  log likelihood=-610.02
## AIC=1228.05   AICc=1228.16   BIC=1243.57

Model 2 (ARIMA(0,0,2)(0,1,1)[7] with Box-Cox transformation) appears to be the better model due to its improved fit, lower residual variance, and stability from the Box-Cox transformation. This model would likely yield more reliable and accurate forecasts for cash withdrawals at ATM1.

Forecast

# Step 1: Generate Forecast for May (31 days)
may_forecast_1<- atm1_fitbc |>
  forecast(h = "31 days")

may_forecast_1

## # A fable: 31 x 5 [1D]
## # Key:     ATM, .model [1]
##    ATM   .model Date                 Cash  .mean
##    <chr> <chr>  <date>             <dist>  <dbl>
##  1 ATM1  auto   2010-05-01 t(N(8.5, 1.8))  92.1 
##  2 ATM1  auto   2010-05-02 t(N(8.9, 1.8)) 107.  
##  3 ATM1  auto   2010-05-03   t(N(8, 1.8))  78.9 
##  4 ATM1  auto   2010-05-04 t(N(1.8, 1.8))   5.56
##  5 ATM1  auto   2010-05-05 t(N(8.9, 1.8)) 106.  
##  6 ATM1  auto   2010-05-06 t(N(8.2, 1.8))  84.7 
##  7 ATM1  auto   2010-05-07 t(N(8.4, 1.8))  91.3 
##  8 ATM1  auto   2010-05-08   t(N(8.5, 2))  93.5 
##  9 ATM1  auto   2010-05-09   t(N(8.9, 2)) 107.  
## 10 ATM1  auto   2010-05-10     t(N(8, 2))  79.6 
## # ℹ 21 more rows

may_forecast_1|>
  
  autoplot(atm_1_no_na, level = 95)+
  labs(
    title = "May 2010 forecast using lambda 0.2616 and H= 31",
    y="Cash in hundreads"
  )

atm1_fitbc|>
  gg_tsresiduals()

The chosen model, ARIMA(0,0,1)(0,1,2)[7], effectively reduces the lag spikes observed in earlier analyses. The residuals’ distribution suggests that they are mostly white noise, indicating a well-fitted model with minimal autocorrelation in the errors.

ATM 2

We have two missing values in ATM 2. To handle these, we will apply the same neighbor interpolation approach used for ATM 1, where each missing value is replaced by the average of the last observed and next observed values.


2009-06-18	ATM2	NA
2009-06-24	ATM2	NA

atm_2<- atm_data|>
  filter(ATM =="ATM2")
  
atm_2_no_na<- atm_2|>
  mutate(Cash = impute_neighbors(Cash),Cash)
atm_2_no_na[48:56, ] |> print()

## # A tsibble: 9 x 3 [1D]
## # Key:       ATM [1]
##   Date       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-17 ATM2     24
## 2 2009-06-18 ATM2     79
## 3 2009-06-19 ATM2    134
## 4 2009-06-20 ATM2     95
## 5 2009-06-21 ATM2     82
## 6 2009-06-22 ATM2     90
## 7 2009-06-23 ATM2     99
## 8 2009-06-24 ATM2     51
## 9 2009-06-25 ATM2      3

autoplot(atm_2_no_na)

## Plot variable not specified, automatically selected `.vars = Cash`

The function worked as intended, and we now have calculated values for the previously missing cash entries.

stl_model_2 <- atm_2_no_na |>
  model(STL(Cash ~ season(window = "periodic")))

# Plot the components of  ATM2 STL decomposition
stl_model_2 |>
  components() |>
 autoplot() +
  labs(title = "STL Decomposition of Cash for ATM2",
       y = "Cash Dispensed",
       x = "Date")

atm_2_no_na |>
  gg_tsdisplay(Cash, plot_type = "partial") +
  labs(title = "ACF and PACF of Cash for ATM2",
       y = "Cash Dispensed",
       x = "Lag")

atm_2_no_na|>
  features(Cash, c(unitroot_ndiffs,unitroot_nsdiffs, unitroot_kpss))

## # A tibble: 1 × 5
##   ATM   ndiffs nsdiffs kpss_stat kpss_pvalue
##   <chr>  <int>   <int>     <dbl>       <dbl>
## 1 ATM2       1       1      2.15        0.01

Overall, the STL decomposition provides valuable insights:

Weekly seasonality is strong and consistent, which should be considered when forecasting.
The trend component shows moderate fluctuations without a strong long-term direction.
The remainder includes some irregular spikes, indicating occasional deviations from the seasonal pattern.

Model

 atm2_lambda<- atm_2_no_na|>
  features(Cash, features = guerrero)|>
  pull(lambda_guerrero)|>
    round(4)
 print(atm2_lambda)

## [1] 0.7243

 atm2_fit <- atm_2_no_na|>
   model(
    ARIMA(Cash)
    
   )
  atm2_fitbc <- atm_1_no_na|>
   model(
  
   auto = ARIMA(box_cox(Cash, atm2_lambda),stepwise = FALSE, approx = FALSE))
   
 
report(atm2_fit)

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4320  -0.9130  0.4773  0.8048  -0.7547
## s.e.   0.0553   0.0407  0.0861  0.0556   0.0381
## 
## sigma^2 estimated as 602.5:  log likelihood=-1653.67
## AIC=3319.33   AICc=3319.57   BIC=3342.61

report(atm2_fitbc)

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, atm2_lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.5997  -0.8396  0.7154  0.8221  -0.6941
## s.e.   0.0879   0.0795  0.0862  0.0918   0.0447
## 
## sigma^2 estimated as 55.58:  log likelihood=-1226.66
## AIC=2465.32   AICc=2465.56   BIC=2488.6

Model Fit: The Box-Cox transformed model (Model 2) provides a significantly better fit with lower AIC, AICc, and BIC values, along with a substantial reduction in residual variance (σ²).

Variance Stability: The Box-Cox transformation stabilizes the variance, reducing noise in the residuals and enhancing model accuracy.

Prediction Reliability: The transformed model is more likely to yield reliable forecasts due to its improved fit and lower residual variance.

atm2_fitbc|>
  gg_tsresiduals()+
    labs(title = "Residuals ofATM 2\nARIMA(2,0,2)(0,1,1)[7] model \nwith Box-Cox lambda = .7243",
       y = "Cash Dispensed",
       x = "Lag")

Forecast

#
may_forecast_2 <- atm2_fitbc |>
  forecast(h = "31 days")

# Plot the forecast along with the original data for ATM2
autoplot(atm_2_no_na, Cash,) +
  autolayer(may_forecast_2, level = 95) + 
  labs(
    title = "May 2010 forecast for ATM2 using lambda = 0.7243 and H= 31",
    y = "Cash in Hundreds",
    x = "Date"
  )

Model Accuracy

atm2_fitbc|>
  augment() |> 
  features(.innov, box_pierce, lag = 7,dof = 2)

## # A tibble: 1 × 4
##   ATM   .model bp_stat bp_pvalue
##   <chr> <chr>    <dbl>     <dbl>
## 1 ATM1  auto      5.56     0.351

The Box-Pierce test result (p-value = 0.35099) is comfortably above 0.05, indicating no significant autocorrelation in the residuals. This suggests that the model parameters, as well as the test parameters, are well-suited for assessing model adequacy, and the model is capturing the data patterns effectively.

ATM 3

For ATM 3, with only the last 3 days of significant data available and significant missing data overall, building a statistical predictive model isn’t feasible. Instead, a practical approach is to take the average of the 3 observations as a simple estimate of future values.

atm_3<- atm_data|>
  filter(ATM=="ATM3")
  
summary(atm_3)

##       Date                ATM                 Cash        
##  Min.   :2009-05-01   Length:365         Min.   : 0.0000  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 0.0000  
##  Median :2009-10-30   Mode  :character   Median : 0.0000  
##  Mean   :2009-10-30                      Mean   : 0.7206  
##  3rd Qu.:2010-01-29                      3rd Qu.: 0.0000  
##  Max.   :2010-04-30                      Max.   :96.0000

atm_3_mfit <- atm_3 |>
  filter(ATM == "ATM3" & Cash > 0) |>
  model(MEAN(Cash))

may_forecast_3 <- atm_3_mfit |>
  forecast(h = 31)


autoplot(may_forecast_3, level =95) + 
  autolayer(atm_3, series = "Forecast") + 
  labs(title = "Forecast for ATM3 Cash", y = "Cash", x = "Date")

## Plot variable not specified, automatically selected `.vars = Cash`

## Warning in geom_line(eval_tidy(expr(aes(!!!aes_spec))), data = object, ..., :
## Ignoring unknown parameters: `series`

ATM 4

From our earlier plots of each ATM’s original Time Series we remember that there was at least one observable outlier in ATM 4 so we would try to determine all the outliers here.

atm_4<- atm_data|>
  filter(ATM == "ATM4")
summary(atm_4)

##       Date                ATM                 Cash          
##  Min.   :2009-05-01   Length:365         Min.   :    1.563  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:  124.334  
##  Median :2009-10-30   Mode  :character   Median :  403.839  
##  Mean   :2009-10-30                      Mean   :  474.043  
##  3rd Qu.:2010-01-29                      3rd Qu.:  704.507  
##  Max.   :2010-04-30                      Max.   :10919.762

autoplot(atm_4)

## Plot variable not specified, automatically selected `.vars = Cash`

# Calculate IQR bounds for outliers
Q1 <- quantile(atm_4$Cash, 0.25)
Q3 <- quantile(atm_4$Cash, 0.75)
IQR <- Q3 - Q1

# Define bounds
lower_bound <- Q1 - 2.5 * IQR
upper_bound <- Q3 + 2.5 * IQR

# Identify outliers
outliers <- atm_4 |>
  filter(Cash < lower_bound | Cash > upper_bound)

outliers

## # A tsibble: 1 x 3 [1D]
## # Key:       ATM [1]
##   Date       ATM     Cash
##   <date>     <chr>  <dbl>
## 1 2010-02-09 ATM4  10920.

Winsorize (Cap): We will Replace the outlier value with the upper limit defined by the IQR-based threshold (e.g., Q3+2.5×IQRQ3 + 2.5 Q3+2.5×IQR). This retains the point as a high value but reduces its impact on model fitting.

upper_bound <- quantile(atm_4$Cash, 0.75) + 2.5 * IQR(atm_4$Cash)
atm_4 <- atm_4 |>
  mutate(Cash = if_else(Cash > upper_bound, upper_bound, Cash))
atm_4|>
  filter(Date == "2010-02-09")

## # A tsibble: 1 x 3 [1D]
## # Key:       ATM [1]
##   Date       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-02-09 ATM4  2155.

Model

# STL decomposition for ATM 4
stl_model_4 <- atm_4|>
  model(STL(Cash ~ season(window = "periodic")))

# Plot the components of ATM 4 STL decomposition
stl_model_4 |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition of Cash for ATM4",
       y = "Cash Dispensed",
       x = "Date")

# ACF and PACF of Cash for ATM 4
atm_4 |>
  gg_tsdisplay(Cash, plot_type = "partial") +
  labs(title = "ACF and PACF of Cash for ATM4",
       y = "Cash Dispensed",
       x = "Lag")

# Test for differencing needs
atm_4 |>
  features(Cash, c(unitroot_ndiffs, unitroot_nsdiffs, unitroot_kpss))

## # A tibble: 1 × 5
##   ATM   ndiffs nsdiffs kpss_stat kpss_pvalue
##   <chr>  <int>   <int>     <dbl>       <dbl>
## 1 ATM4       0       0    0.0863         0.1

The STL ACF show stationary data as the spikes are only slightly above the threshold in lag 7 and its very small.

atm4_lambda <- atm_4 |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero) |>
  round(4)
print(atm4_lambda)

## [1] 0.409

atm4_fit <- atm_4 |>
  model(
    ARIMA(Cash)
  )

atm4_fitbc <- atm_4|>
  model(
    auto = ARIMA(box_cox(Cash, atm4_lambda), stepwise = FALSE, approx = FALSE)
  )

report(atm4_fit)

## Series: Cash 
## Model: ARIMA(0,0,0)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##         sar1  constant
##       0.1667  374.3727
## s.e.  0.0519   18.5923
## 
## sigma^2 estimated as 127821:  log likelihood=-2662.91
## AIC=5331.83   AICc=5331.9   BIC=5343.53

report(atm4_fitbc)

## Series: Cash 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, atm4_lambda) 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.2200  0.1805   14.4595
## s.e.  0.0518  0.0522    0.5495
## 
## sigma^2 estimated as 115.2:  log likelihood=-1383.18
## AIC=2774.36   AICc=2774.48   BIC=2789.96

The Box-Cox transformed ARIMA(0,0,0)(2,0,0)[7] model provides the best fit, with a notably lower variance estimate of 115.2 compared to 127,821 in the un-transformed model. This model also has substantially lower AIC (2774.36 vs. 5331.83) and BIC (2789.96 vs. 5343.53) scores, indicating a better balance between model complexity and data fit. Overall, these metrics suggest the transformed model more accurately captures the underlying patterns in the data.

Forecast

# Forecasting 31 days ahead for ATM 4
may_forecast_4 <- atm4_fitbc |>
  forecast(h = 31)

# View the forecast results
print(may_forecast_4)

## # A fable: 31 x 5 [1D]
## # Key:     ATM, .model [1]
##    ATM   .model Date                Cash .mean
##    <chr> <chr>  <date>            <dist> <dbl>
##  1 ATM4  auto   2010-05-01 t(N(20, 115))  307.
##  2 ATM4  auto   2010-05-02 t(N(24, 115))  448.
##  3 ATM4  auto   2010-05-03 t(N(26, 115))  513.
##  4 ATM4  auto   2010-05-04 t(N(17, 115))  241.
##  5 ATM4  auto   2010-05-05 t(N(25, 115))  477.
##  6 ATM4  auto   2010-05-06 t(N(21, 115))  346.
##  7 ATM4  auto   2010-05-07 t(N(24, 115))  428.
##  8 ATM4  auto   2010-05-08 t(N(19, 121))  301.
##  9 ATM4  auto   2010-05-09 t(N(25, 121))  480.
## 10 ATM4  auto   2010-05-10 t(N(25, 121))  472.
## # ℹ 21 more rows

autoplot(atm_4) +
  autolayer(may_forecast_4, series = "Forecast", ,level = 95) +
  labs(title = "31-Day Forecast for ATM4 Cash with imputed outlier Data",
       y = "Cash", x = "Date") +
  theme_minimal()

## Plot variable not specified, automatically selected `.vars = Cash`

## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`

## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`

The 31-day forecast for ATM 4 reveals a decreasing variance in forecasted cash values as the forecast horizon extends. Forecast values stabilize and show reduced fluctuation over time, indicating increasing confidence in predicted values further into the period

Forecast output to XLS

# Convert each forecast to a tibble to remove the tsibble structure
forecasts <- bind_rows(
  as_tibble(may_forecast_1),
  as_tibble(may_forecast_2),
  as_tibble(may_forecast_3),
  as_tibble(may_forecast_4)
) |>
  select(Date, ATM, .mean) |>
  rename(Cash = .mean)

# Round the Cash column to zero decimal places
forecasts$Cash <- round(forecasts$Cash, 0)
head(forecasts)

## # A tibble: 6 × 3
##   Date       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-05-01 ATM1     92
## 2 2010-05-02 ATM1    107
## 3 2010-05-03 ATM1     79
## 4 2010-05-04 ATM1      6
## 5 2010-05-05 ATM1    106
## 6 2010-05-06 ATM1     85

write_xlsx(forecasts, path = "project1atmsfc.xlsx")

Part B Monthly Power Usage

Data Acquisition & Processing

power_data <- read_xlsx(
  "ResidentialCustomerForecastLoad-624.xlsx",
  col_types = c("text", "text", "numeric")
) |>
  # Convert `YYYY-MMM` to Date format if it's in a recognizable format like "2024-Jan"
  mutate(DATE = yearmonth(`YYYY-MMM`),
         )|>
  select(-`YYYY-MMM`)
  

# Display the first few rows to verify
head(power_data)

## # A tibble: 6 × 3
##   CaseSequence     KWH     DATE
##   <chr>          <dbl>    <mth>
## 1 733          6862583 1998 Jan
## 2 734          5838198 1998 Feb
## 3 735          5420658 1998 Mar
## 4 736          5010364 1998 Apr
## 5 737          4665377 1998 May
## 6 738          6467147 1998 Jun

power_data<- power_data|> as_tsibble(index = DATE)
power_data

## # A tsibble: 192 x 3 [1M]
##    CaseSequence     KWH     DATE
##    <chr>          <dbl>    <mth>
##  1 733          6862583 1998 Jan
##  2 734          5838198 1998 Feb
##  3 735          5420658 1998 Mar
##  4 736          5010364 1998 Apr
##  5 737          4665377 1998 May
##  6 738          6467147 1998 Jun
##  7 739          8914755 1998 Jul
##  8 740          8607428 1998 Aug
##  9 741          6989888 1998 Sep
## 10 742          6345620 1998 Oct
## # ℹ 182 more rows

EDA

summary(power_data$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

autoplot(power_data)

## Plot variable not specified, automatically selected `.vars = KWH`

After exploring and plotting the data, we observed a missing data point in September 2008. The graph suggests an oscillating pattern approximately every three points, making the median value of 6,283,324 KWH a reasonable choice to fill this gap. Imputing this value helps maintain consistency within the observed pattern.

Missing Data

# Check for missing values 
power_data |>
  filter(is.na(KWH))

## # A tsibble: 1 x 3 [1M]
##   CaseSequence   KWH     DATE
##   <chr>        <dbl>    <mth>
## 1 861             NA 2008 Sep

# Fill NA in KWH with the median
power_data$KWH[is.na(power_data$KWH)] <- median(power_data$KWH, na.rm = TRUE)

summary(power_data$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   770523  5434539  6283324  6501333  7608792 10655730

power_data |>
  model(STL(KWH ~ season(window = "periodic"))) |>
  components() |>
  autoplot()

The STL decomposition reveals a clear upward trend in KWH usage, with a quarterly seasonal pattern. Most of the variability is well-explained by the model, except for an unusual spike in the residuals in July 2010, indicating a significant drop in KWH usage. This anomaly warrants further investigation, as it could be due to an external factor affecting energy consumption during that period.

Models

We will explore various models to determine the best fit. Given the observed trend and seasonality, an ARIMA model appears to be a promising choice; however, we will also fit alternative models to ensure we select the most accurate option. I will use a 80/20 split to train the Data this should work well since we have over 10 years of monthly data.

split_point <- floor(0.8 * 192)
train_data <- power_data |> slice(1:split_point)     # First 154 rows
test_data <- power_data |> slice((split_point + 1):n())  # Remaining 
power_models <- train_data |>
model(
ARIMA = ARIMA(KWH),
ETS = ETS(KWH),
STL_Naive = decomposition_model(
STL(KWH ~ season(window = "periodic")),
NAIVE(season_adjust)
)
)

print(power_models)

## # A mable: 1 x 3
##                       ARIMA          ETS                 STL_Naive
##                     <model>      <model>                   <model>
## 1 <ARIMA(1,0,1)(0,1,2)[12]> <ETS(M,N,M)> <STL decomposition model>

model_accuracy <- power_models |>
accuracy()|>
  arrange(RMSE)
model_accuracy

## # A tibble: 3 × 10
##   .model    .type        ME    RMSE     MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>     <chr>     <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA     Training 32723. 760559. 450434. -4.77  11.9 0.720 0.810 -0.0607
## 2 ETS       Training 20126. 803310. 461807. -5.78  12.7 0.738 0.856  0.163 
## 3 STL_Naive Training  9850. 991277. 581872. -5.84  14.7 0.930 1.06  -0.444

Tests

forecasts <- power_models |>
  forecast(new_data = test_data)
forecasts|>
  autoplot(power_data)

forecasts|>
  autoplot(test_data, level=NULL)+
  labs(title = "Model Forecasts vs Actual KWH (Test Period)",
       x = "Date",
       y = "KWH",
       color = "Model")

power_models |>
  select(ARIMA) |>
  gg_tsresiduals()

The ARIMA(1,0,1)(0,1,2)[12] model was chosen because it effectively captures both short-term patterns and monthly seasonality, as indicated by its strong performance across key accuracy metrics. This model’s configuration, with one autoregressive term, no differencing, one moving average term, and seasonal adjustments, makes it well-suited for handling the cyclical and seasonal nature of the data.

Despite the outlier in July 2010, the ARIMA model still performs well overall. The residuals are generally centered around zero with minimal autocorrelation, indicating that the model has captured the underlying patterns effectively. While the anomaly introduces a slight deviation, it doesn’t significantly affect the model’s stability or its ability to produce accurate forecasts for the majority of the data.

Forecasts

forecasts_2014 <- power_models |>
select(ARIMA)|>
forecast(h=51)
forecasts_2014<-tail(forecasts_2014,12)
forecasts_2014|>
autoplot(power_data, level= 95)+

  labs(
    title ="2014 KWH Forecast using ARIMA(1,0,1)(0,1,2)[12] with test model  "
  )

Final model trained on all existing data

final_arima_model <- power_data |>
  model(
    ARIMA = ARIMA(KWH ~ pdq(1,0,1) + PDQ(0,1,2))
  )
report(final_arima_model)

## Series: KWH 
## Model: ARIMA(1,0,1)(0,1,2)[12] w/ drift 
## 
## Coefficients:
##          ar1      ma1     sma1    sma2  constant
##       0.4044  -0.1540  -0.8972  0.1339  54993.10
## s.e.  0.2902   0.3156   0.0823  0.0821  15811.29
## 
## sigma^2 estimated as 7.477e+11:  log likelihood=-2719.65
## AIC=5451.3   AICc=5451.79   BIC=5470.46

accuracy(final_arima_model)

## # A tibble: 1 × 10
##   .model .type         ME    RMSE     MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr>      <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA  Training -21280. 825523. 499648. -5.43  11.7 0.714 0.699 0.00302

future_final_forecast<-final_arima_model|>
  forecast(h = 12)
autoplot(test_data, KWH) +
  autolayer(future_final_forecast, level = 95, color = "blue") + 
  autolayer(forecasts_2014, level = NULL, color = "red") +       
  labs(title = "Forecast Comparison: Actual KWH, 2014 Forecast, and 2014 Forecast",
       x = "Date", y = "KWH",
       color = "Legend") +
  theme_minimal()

The two lines above represent the same model: the red line shows the model trained with 80% of available data points, while the blue line shows the exact model trained on the entire dataset. As we can see, the fully trained model aligns well in both seasonality and trend, likely due to the greater amount of training data, which helps mitigate the impact of the significant outlier in July 2010.

Export forecast

print(future_final_forecast)

## # A fable: 12 x 4 [1M]
## # Key:     .model [1]
##    .model     DATE                 KWH    .mean
##    <chr>     <mth>              <dist>    <dbl>
##  1 ARIMA  2014 Jan N(9730543, 7.5e+11) 9730543.
##  2 ARIMA  2014 Feb N(8468331, 7.9e+11) 8468331.
##  3 ARIMA  2014 Mar   N(6851880, 8e+11) 6851880.
##  4 ARIMA  2014 Apr     N(6e+06, 8e+11) 5999883.
##  5 ARIMA  2014 May   N(5731409, 8e+11) 5731409.
##  6 ARIMA  2014 Jun   N(7528743, 8e+11) 7528743.
##  7 ARIMA  2014 Jul   N(7852173, 8e+11) 7852173.
##  8 ARIMA  2014 Aug   N(9297497, 8e+11) 9297497.
##  9 ARIMA  2014 Sep   N(8179597, 8e+11) 8179597.
## 10 ARIMA  2014 Oct     N(6e+06, 8e+11) 6014848.
## 11 ARIMA  2014 Nov   N(5770802, 8e+11) 5770802.
## 12 ARIMA  2014 Dec   N(7477616, 8e+11) 7477616.

future_final_forecastl <- future_final_forecast |>
  as_tibble() |>
  select(.model, DATE, .mean) |>
  rename(Model = .model, KWH = .mean) |>
  mutate(DATE = format(DATE, "%Y-%m-%d")) # Convert DATE to ISO 8601 format


head(future_final_forecastl)

## # A tibble: 6 × 3
##   Model DATE            KWH
##   <chr> <chr>         <dbl>
## 1 ARIMA 2014-01-01 9730543.
## 2 ARIMA 2014-02-01 8468331.
## 3 ARIMA 2014-03-01 6851880.
## 4 ARIMA 2014-04-01 5999883.
## 5 ARIMA 2014-05-01 5731409.
## 6 ARIMA 2014-06-01 7528743.

write_xlsx(future_final_forecastl, path = "project1powerfc.xlsx")

Data624 Project 1

Darwhin Gomez

2024-10-27

Part A

Data Acquisition & Data Processing

EDA

Missing Data

ATM 1

Model

Forecast

ATM 2

Model

Forecast

Model Accuracy

ATM 3

ATM 4

Model

Forecast

Forecast output to XLS

Part B Monthly Power Usage

Data Acquisition & Processing

EDA

Missing Data

Models

Tests

Forecasts

Final model trained on all existing data

Export forecast