DATA624 - Project 1

library(readxl)
library(tidyverse)
library(lubridate)
library(fpp3)
library(forecast)
library(ggplot2)
library(imputeTS)
library(openxlsx2)

PART A

Part A – ATM Forecast, ATM624Data.xlsx

Objective

This analysis forecasts the cash withdrawals from four different ATMs for May 2010 using historical data. The process involves cleaning, exploring, and modeling the data to generate accurate predictions, followed by presenting the results with a detailed explanation and visualizations.

Data Exploration (EDA)

First, we will load the data, which is saved in a local folder, and explore it.

data <- read_excel("ATM624Data.xlsx", col_names = T, col_types = c('date', 'guess', 'guess'))

let’s take a glimpse of the data:

glimpse(data)

## Rows: 1,474
## Columns: 3
## $ DATE <dttm> 2009-05-01, 2009-05-01, 2009-05-02, 2009-05-02, 2009-05-03, 2009…
## $ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …

head(data)

## # A tibble: 6 × 3
##   DATE                ATM    Cash
##   <dttm>              <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1     96
## 2 2009-05-01 00:00:00 ATM2    107
## 3 2009-05-02 00:00:00 ATM1     82
## 4 2009-05-02 00:00:00 ATM2     89
## 5 2009-05-03 00:00:00 ATM1     85
## 6 2009-05-03 00:00:00 ATM2     90

From the output we can see that:

1,474 rows and 3 columns: DATE, ATM, and Cash.
DATE is a POSIXct date-time.
daily cash withdrawal data for 4 different ATMs over time.

we will create a seperate dataset and Convert to Date format. Also, To ensure accurate forecasting for May 2010, I selected data from May 2009 to April 2010. This 12-month period captures recent trends and seasonality, providing a reliable basis for the forecast.

data_ts <- data |>
  mutate(DATE = as.Date(DATE)) |>  
  rename(Date = DATE) |>  
  filter(Date >= "2009-05-01" & Date < "2010-05-01")
head(data_ts)

## # A tibble: 6 × 3
##   Date       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-01 ATM2    107
## 3 2009-05-02 ATM1     82
## 4 2009-05-02 ATM2     89
## 5 2009-05-03 ATM1     85
## 6 2009-05-03 ATM2     90

As we can see that our time series which is still a data frame object has a proper date column now courtesy of as.Date() function. Now take a look at the summary of our time series

summary(data_ts)

##       Date                ATM                 Cash        
##  Min.   :2009-05-01   Length:1460        Min.   :    0.0  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:    0.5  
##  Median :2009-10-30   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8  
##                                          NA's   :5

The max value under the “Cash” variable is 10919.8. Since the number is in hundreds of dollars, when multiplying this by 100 we get 1,091,980, and it is quite unlikely that any ATM machine would be able to dispense this kind of cash. This is most likely an error and we will address it accordingly. we can also see that is we have 5 NA’s in Cash column.

Let’s check for duplicates

# Check for duplicate rows
sum(duplicated(data_ts))

## [1] 0

Now let’s check if we have any missing values in our time series.

colSums(is.na(data_ts))

## Date  ATM Cash 
##    0    0    5

We can see that our time series has 5 missing values in Cash column.

Data processing:

** Handeling Missing Values:*

Since our ATM time series has four different ATMs information, let’s check which ATM is missing these values.

data_ts|>
  filter(is.na(Cash))|>
  count(ATM)

## # A tibble: 2 × 2
##   ATM       n
##   <chr> <int>
## 1 ATM1      3
## 2 ATM2      2

We can see that there are there are three missing values for ATM1 and two for ATM2. To address these five missing values, we applied an ARIMA model to the dataset with the missing values and used it to interpolate the absent observations. Before interpolating for missing values using ARIMA we have to convert our series to a tsibble object.

Converting to tsibble and Handling Missing Values Using ARIMA:

# Convert to tsibble
data_ts <- data_ts |>
  as_tsibble(index = Date, key = ATM)

We will interpolate missing values.

# Fit ARIMA model to the data containing missing values
fit_model <- data_ts |>
  model(ARIMA(Cash))

# Apply interpolation using the fitted model
data_ts <- fit_model |>
  interpolate(data_ts)

# Check if any missing values remain
sum(is.na(data_ts$Cash))

## [1] 0

Now that your data is ready, let’s visualize it:

# Plot time series using autoplot after interpolation
data_ts |>
  ggplot2::autoplot(Cash) +
  labs(title = "ATM Cash Withdrawals Over Time",
       y = "Cash Withdrawals (in dollars)",
       x = "Date") +
  facet_wrap(~ ATM, scales = "free_y",  nrow = 4)+
  theme_minimal(base_size = 14) +
  theme(panel.grid.major = element_line(color = "grey90"),
        panel.grid.minor = element_blank())

Predictive Model and Forecast for Each ATM:

Building and Forecasting Models for Each ATM:

Since the ATM dataset contains four different time series, each representing cash withdrawals from a unique ATM, we will build and evaluate models for each ATM separately. This approach ensures that we capture the distinct trends, seasonality, and patterns associated with each ATM. To achieve this, we will fit multiple models, including ARIMA and ETS, and compare their performance using appropriate accuracy metrics. The best-performing model will be used to generate forecasts for May 2010, ensuring that each ATM’s forecast is tailored to its individual withdrawal patterns.

ATM1:

ATM1 Time Series Analysis & Data Preparation: We’ll start by isolating the data for ATM1 to build the models.

# Filter data for ATM1
atm1_data <- data_ts |> 
  filter(ATM == "ATM1")

We visualize ATM1 Data to verify and explore the ATM1 data before modeling:

# Plot ATM1 data
atm1_data |>
  autoplot(Cash) +
  labs(title = "ATM1 Cash Withdrawals Over Time",
       y = "Cash Withdrawals (in dollars)",
       x = "Date") +
  theme_minimal(base_size = 12)

Analysis of ATM1 Plot: The plot of ATM1 cash withdrawals over time shows noticeable fluctuations, with some variability in the withdrawal amounts. While there is no clear upward or downward trend, the data exhibits some seasonal patterns and occasional spikes. The variance appears relatively stable over time, suggesting that a transformation may not be necessary, but further analysis will confirm whether a Box-Cox transformation is required.

# Check if a transformation is needed for ATM1
lambda1 <- atm1_data |>
  features(Cash, features = guerrero) |>  # Run guerrero feature
  pull(lambda_guerrero)  # Pull the lambda value

Lambda is approximately 0.26. This suggests that the variance is not stable, and a Box-Cox transformation is recommended.

Box-Cox Transformation to Stabilize Variance:

# Apply Box-Cox 
atm1_data <- atm1_data |>
  mutate(Cash_trans = box_cox(Cash, lambda = 0.2622969))

# Plot transformed data for ATM1
atm1_data |>
  autoplot(Cash_trans) +
  labs(title = "Transformed ATM1 Cash Withdrawals Over Time",
       y = "Transformed Cash Withdrawals",
       x = "Date") +
  theme_minimal(base_size = 12)

*Observations:** The variance now looks much more stable after applying the Box-Cox transformation and the fluctuations appear to be more consistent across time, indicating that the transformation successfully reduced variability. Although, there are still noticeable spikes, they are less extreme compared to the original plot. Now we will decompose the time series.

STL Decomposition:

# STL decomposition of transformed data for ATM1
atm1_data |>
  model(STL(Cash_trans ~ season(window = "periodic"))) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition of ATM1 Cash Withdrawals",
       y = "Values",
       x = "Date")

Observation from STL Decomposition: The STL decomposition of ATM1 cash withdrawals reveals strong weekly seasonality and a relatively stable trend, confirming that seasonal differencing and seasonal components are necessary for accurate modeling.

# Check unit root test 
atm1_data |>
  mutate(Cash_diff = difference(Cash_trans, 7)) |>  # Box-Cox & Seasonal differencing
  features(Cash_diff, unitroot_ndiffs)

## # A tibble: 1 × 2
##   ATM   ndiffs
##   <chr>  <int>
## 1 ATM1       0

ACF/PACF and Manual ARIMA Refinement: Now that the series for ATM1 is stationary after the Box-Cox transformation and seasonal differencing, we will analyze the ACF and PACF plots to identify possible values for the AR (Auto-Regressive) and MA (Moving Average) terms.

Generate ACF and PACF Plots:

# Check ACF and PACF after Box-Cox and seasonal differencing
atm1_data |>
  mutate(Cash_diff = difference(Cash_trans, 7)) |>  # Seasonal differencing with lag 7
  gg_tsdisplay(Cash_diff, plot_type = "partial", lag = 30) +
  labs(title = "Seasonally Differenced ACF/PACF for ATM1",
       y = "")

The ACF plot shows a significant spike at lag 7, indicating strong weekly seasonality, while the PACF plot shows minor spikes.

Model Fitting and Selection:

Fit Models for ATM1:

# Fit multiple models for ATM1
atm1_fit <- atm1_data |>
  model(
    # Additive ETS model
    ets_ana = ETS(Cash_trans ~ error("A") + trend("N") + season("A")),
    
    # Multiplicative ETS model
    ets_mnm = ETS(Cash_trans ~ error("M") + trend("N") + season("M")),
    
    # SNaive model
    snaive = SNAIVE(Cash_trans),
    
    # Auto ARIMA with Box-Cox transformation
    arima_auto = ARIMA(Cash_trans),
    
    # Manually tuned ARIMA model (optional for refinement later)
    arima_manual = ARIMA(Cash_trans ~ pdq(0,0,2) + PDQ(0,1,1))
  )

Compare Model Performance: we compare model performance using AICc, RMSE, and MAE

atm1_results <- atm1_fit |>
  glance() |>
  select(.model, AICc, BIC) |>
  left_join(accuracy(atm1_fit), by = ".model") |>
  select(.model, AICc, RMSE, MAE, MAPE) |>
  arrange(AICc)
atm1_results

## # A tibble: 5 × 5
##   .model        AICc  RMSE   MAE  MAPE
##   <chr>        <dbl> <dbl> <dbl> <dbl>
## 1 arima_auto   1229.  1.31 0.720 Inf  
## 2 arima_manual 1229.  1.31 0.720 Inf  
## 3 ets_ana      2381.  1.33 0.721 Inf  
## 4 ets_mnm      2460.  1.15 0.683  16.3
## 5 snaive         NA   1.58 0.822 Inf

Based on the results, it looks like arima_auto and arima_manual have the lowest AICc and similar RMSE and MAE values.

Since arima_auto and arima_manual had the lowest AICc and similar RMSE/MAE values, we’ll proceed by using one of them (let’s go with arima_auto for now). Before that let’s check the report on our model.

atm1_fit |>
  select(arima_auto) |>
  report()

## Series: Cash_trans 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1105  -0.1088  -0.6419
## s.e.  0.0524   0.0521   0.0431
## 
## sigma^2 estimated as 1.771:  log likelihood=-610.69
## AIC=1229.38   AICc=1229.49   BIC=1244.9

Multiple ARIMA models to refine results: Now that we’ve compared initial results, we refine by fitting additional ARIMA models with different parameters.

# Fit multiple ARIMA models for ATM1 using Box-Cox transformed data
fit <- atm1_data |>
  model(
    # ARIMA(0,0,6)(0,1,1)[7] - Based on refined ACF/PACF analysis
    arima006011 = ARIMA(Cash_trans ~ 0 + pdq(0,0,6) + PDQ(0,1,1)),
    
    # ARIMA(6,0,0)(4,1,0)[7] - Exploring another seasonal/non-seasonal config
    arima600410 = ARIMA(Cash_trans ~ 0 + pdq(6,0,0) + PDQ(4,1,0)),
    
    # ARIMA(1,0,2)(1,1,2)[7] - Testing different AR and MA lags
    arima102112 = ARIMA(Cash_trans ~ 0 + pdq(1,0,2) + PDQ(1,1,2)),
    
    # ARIMA(0,0,3)(1,1,1)[7] - A model with more seasonal terms
    arima003111 = ARIMA(Cash_trans ~ 0 + pdq(0,0,3) + PDQ(1,1,1)),
    
    # Auto-selected ARIMA with Box-Cox applied, without stepwise search
    auto = ARIMA(Cash_trans, stepwise = FALSE)
  )

# Compare model performance using AICc and RMSE for ATM1
model_results <- left_join(
  glance(fit) |> select(.model, AICc, BIC),
  accuracy(fit) |> select(.model, RMSE),
  by = ".model"
) |>
  arrange(AICc)
model_results

## # A tibble: 5 × 4
##   .model       AICc   BIC  RMSE
##   <chr>       <dbl> <dbl> <dbl>
## 1 auto        1229. 1245.  1.31
## 2 arima006011 1231. 1262.  1.30
## 3 arima003111 1233. 1256.  1.31
## 4 arima102112 1235. 1262.  1.31
## 5 arima600410 1238. 1280.  1.30

The model comparison results show that arima_auto has the lowest AICc and similar RMSE and MAE values compared to other models. Therefore, we will proceed with arima_auto to generate the forecast for May 2010.

Forecast for May 2010 Using the Best Model: Since our goal is to forecast ATM withdrawals for May 2010, and May has 31 days, we will forecast for the next 31 values after the end of April 2010.

# Generate 31-day forecast for ATM1
atm1_forecast <- atm1_fit |>
  select(arima_auto) |>  # Use the best model (arima_auto)
  forecast(h = 31)  # Forecast 31 days for May 2010

# Plot forecast with historical data after transformation
atm1_data_trans <- atm1_data |>
  mutate(Cash_trans = box_cox(Cash, lambda = lambda1)) |>  # Apply Box-Cox
  select(Date, Cash_trans)  

atm1_forecast |>
  autoplot(atm1_data_trans) +
  ggtitle(latex2exp::TeX(paste0("ATM 1 Forecasted with $ARIMA(0,0,2)(0,1,1)_7$ and $\\lambda$ = ",
         round(lambda1, 2)))) +
  labs(y = "Transformed Cash Withdrawals", x = "Date") +
  theme_minimal(base_size = 12)

After generating the forecast for ATM1 using the ARIMA(0,0,2)(0,1,1)[7] model with a Box-Cox transformation (λ = 0.262), it is essential to analyze the residuals to ensure that the model has captured all the available information and that the remaining noise behaves like white noise.

# Check residuals of the ARIMA model for ATM1
atm1_fit |>
  select(arima_auto) |>
  gg_tsresiduals() +
  labs(title = expression("Residuals Analysis for ATM1 Forecast Using ARIMA(0,0,2)(0,1,1)[7] and " ~ lambda ~ "=" ~ 0.262))

Residual Analysis of ATM1 Forecast: The residual analysis indicates that the residuals are randomly scattered around zero with no discernible pattern, suggesting that the model captures most of the variability in the data. The ACF plot of the residuals shows no significant autocorrelation, confirming that the residuals resemble white noise. The histogram of the residuals also indicates a roughly normal distribution, further supporting the adequacy of the selected ARIMA model.

# Perform Ljung-Box test on residuals
atm1_fit |>
  select(arima_auto) |>
  augment() |>
  features(.innov, ljung_box, lag = 14, dof = 4)

## # A tibble: 1 × 3
##   .model     lb_stat lb_pvalue
##   <chr>        <dbl>     <dbl>
## 1 arima_auto    9.93     0.447

The lb_stat = 9.93 with p-value = 0.45, Since the p-value > 0.05, we fail to reject the null hypothesis, meaning that there is no significant autocorrelation left in the residuals.

ATM2:

In this section we will look at the ATM 2 from our ATM time series. We’ll begin by filtering data for ATM2 and plotting the raw data to identify patterns, trends, and seasonality.

# Filter data for ATM2
atm2_data <- data_ts |> 
  filter(ATM == "ATM2")
atm2_data |>
  autoplot(Cash) +
  labs(title = "ATM2 Cash Withdrawals Over Time",
       y = "Cash Withdrawals (in dollars)",
       x = "Date") +
  theme_minimal(base_size = 12)

The plot shows fluctuations in cash withdrawals over time. Similar to ATM1, there are spikes and dips, but the variance appears less stable. There are no clear upward or downward trends, but seasonal patterns are visible, suggesting weekly or periodic withdrawal behavior.

Check if a Transformation is Needed: We’ll now check if a Box-Cox transformation is necessary to stabilize the variance.

# Check if a transformation is needed for ATM2
lambda2 <- atm2_data |>
  features(Cash, features = guerrero) |>  # Generates a tibble
  pull(lambda_guerrero)  # Extracts the value
lambda2

## [1] 0.6746523

The Lambda is approximately ≈ 0.67 suggests that a Box-Cox transformation is likely needed to stabilize the variance before proceeding with modeling. Since the value is closer to 1, the transformation may not be as dramatic but will still help in reducing variability.

Box-Cox Transformation to Stabilize Variance:

# Apply Box-Cox transformation to stabilize variance for ATM2
atm2_data <- atm2_data |>
  mutate(Cash_trans = box_cox(Cash, lambda = lambda2))

# Plot transformed data for ATM2
atm2_data |>
  autoplot(Cash_trans) +
  labs(title = "Transformed ATM2 Cash Withdrawals Over Time",
       y = "Transformed Cash Withdrawals",
       x = "Date") +
  theme_minimal(base_size = 12)

*Observations:** The Box-Cox transformation has reduced the fluctuations in ATM2 cash withdrawals, making the data more consistent. Although some variations remain, the transformation has improved the stability of the data.

STL Decomposition:

# STL decomposition of transformed data for ATM2
atm2_data |>
  model(STL(Cash_trans ~ season(window = "periodic"))) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition of ATM2 Cash Withdrawals",
       y = "Values",
       x = "Date")

Observation from STL Decomposition: The STL decomposition of ATM2 shows a strong weekly seasonality, a relatively stable trend, and a slightly fluctuating remainder, indicating consistent seasonal patterns over time.

Next we will check the unit root test to determine if any further differencing is required.

# Check unit root test after Box-Cox and seasonal differencing
atm2_data |>
  mutate(Cash_diff = difference(Cash_trans, 7)) |>  # Box-Cox & Seasonal differencing
  features(Cash_diff, unitroot_ndiffs)

## # A tibble: 1 × 2
##   ATM   ndiffs
##   <chr>  <int>
## 1 ATM2       0

Since the unit root test result for ATM2 shows ndiffs = 0, this means no further differencing is required after applying the Box-Cox transformation and seasonal differencing.

We can now move on to generating the ACF and PACF plots to analyze the correlation patterns and identify potential AR and MA terms for model refinement.

ACF/PACF and Manual ARIMA Refinement: Now that the series for ATM2 is stationary after the Box-Cox transformation and seasonal differencing, we will analyze the ACF and PACF plots to identify possible values for the AR (Auto-Regressive) and MA (Moving Average) terms.

Generate ACF and PACF Plots:

# Generate ACF and PACF plots for ATM2
atm2_data |>
  mutate(Cash_diff = difference(Cash_trans, 7)) |>  # Seasonal differencing with lag 7
  gg_tsdisplay(Cash_diff, plot_type = "partial", lag = 30) +
  labs(title = "Seasonally Differenced ACF/PACF for ATM2",
       y = "")

The ACF plot shows a significant spike at lag 7, indicating strong weekly seasonality. The PACF plot suggests a possible MA(2) process due to the initial two spikes and an additional seasonal MA(1) component. Based on these observations, a reasonable initial ARIMA model to try would be ARIMA(0,0,2)(0,1,1)[7].

Model Fitting and Selection:

Fit Models for ATM2:

# Fit multiple models for ATM2
atm2_fit <- atm2_data |>
  model(
    # Additive ETS model
    ets_ana = ETS(Cash_trans ~ error("A") + trend("N") + season("A")),
    
    # Multiplicative ETS model
    ets_mnm = ETS(Cash_trans ~ error("M") + trend("N") + season("M")),
    
    # SNaive model
    snaive = SNAIVE(Cash_trans),
    
    # Auto ARIMA with Box-Cox transformation
    arima_auto2 = ARIMA(Cash_trans),
    
    # Manually tuned ARIMA model (optional for refinement later)
    arima_manual2 = ARIMA(Cash_trans ~ pdq(0,0,2) + PDQ(0,1,1))
  )

Now, we will compare the model performance using AICc, RMSE, and MAE to identify the best model.

# Compare model performance for ATM2 using AICc, RMSE, and MAE
atm2_results <- atm2_fit |>
  glance() |>
  select(.model, AICc, BIC) |>
  left_join(accuracy(atm2_fit), by = ".model") |>
  select(.model, AICc, RMSE, MAE, MAPE) |>
  arrange(AICc)
atm2_results

## # A tibble: 5 × 5
##   .model         AICc  RMSE   MAE  MAPE
##   <chr>         <dbl> <dbl> <dbl> <dbl>
## 1 arima_auto2   2390.  6.59  4.61  Inf 
## 2 arima_manual2 2403.  6.76  4.72  Inf 
## 3 ets_ana       3574.  6.80  4.70  Inf 
## 4 ets_mnm       3886.  9.62  7.91  151.
## 5 snaive          NA   8.07  5.50  Inf

Based on the results for ATM2, we can clearly see that arima_auto2 has the lowest AICc (2375.91) and the best RMSE and MAE compared to other models.

atm2_fit |>
  select(arima_auto2) |>
  report()

## Series: Cash_trans 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4180  -0.9012  0.4579  0.7889  -0.7129
## s.e.   0.0592   0.0475  0.0888  0.0619   0.0419
## 
## sigma^2 estimated as 44.97:  log likelihood=-1188.76
## AIC=2389.51   AICc=2389.75   BIC=2412.8

Multiple ARIMA Models to Refine Results: Now, as we did for ATM1, let’s fit multiple ARIMA models to refine the results.

# Fit multiple ARIMA models for ATM2 using Box-Cox transformed data
fit2 <- atm2_data |>
  model(
    # ARIMA(0,0,6)(0,1,1)[7] - Based on refined ACF/PACF analysis
    arima006011 = ARIMA(Cash_trans ~ 0 + pdq(0,0,6) + PDQ(0,1,1)),
    
    # ARIMA(6,0,0)(4,1,0)[7] - Exploring another seasonal/non-seasonal config
    arima600410 = ARIMA(Cash_trans ~ 0 + pdq(6,0,0) + PDQ(4,1,0)),
    
    # ARIMA(1,0,2)(1,1,2)[7] - Testing different AR and MA lags
    arima102112 = ARIMA(Cash_trans ~ 0 + pdq(1,0,2) + PDQ(1,1,2)),
    
    # ARIMA(0,0,3)(1,1,1)[7] - A model with more seasonal terms
    arima003111 = ARIMA(Cash_trans ~ 0 + pdq(0,0,3) + PDQ(1,1,1)),
    
    # Auto-selected ARIMA with Box-Cox applied, without stepwise search
    auto = ARIMA(Cash_trans, stepwise = FALSE)
  )

# Compare model performance using AICc and RMSE for ATM2
model_results2 <- left_join(
  glance(fit2) |> select(.model, AICc, BIC),
  accuracy(fit2) |> select(.model, RMSE),
  by = ".model"
) |>
  arrange(AICc)

model_results2

## # A tibble: 5 × 4
##   .model       AICc   BIC  RMSE
##   <chr>       <dbl> <dbl> <dbl>
## 1 auto        2391. 2418.  6.58
## 2 arima006011 2394. 2425.  6.60
## 3 arima600410 2395. 2437.  6.54
## 4 arima102112 2403. 2430.  6.71
## 5 arima003111 2406. 2429.  6.75

The auto ARIMA (arima_auto2) model has the lowest AICc and a relatively low RMSE, indicating that it is the best-performing model.Although arima600410 has a slightly lower RMSE, the AICc is slightly higher. So we will pick making arima_auto2 the preferred model based on AICc. Here is our model report.

# Generate model report for ARIMA auto model for ATM2
atm2_fit |>
  select(arima_auto2) |>
  report()

## Series: Cash_trans 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4180  -0.9012  0.4579  0.7889  -0.7129
## s.e.   0.0592   0.0475  0.0888  0.0619   0.0419
## 
## sigma^2 estimated as 44.97:  log likelihood=-1188.76
## AIC=2389.51   AICc=2389.75   BIC=2412.8

Now that our model has been selected and validated, we can proceed to generate the forecast for ATM2 cash withdrawals for the month of May 2010.

Forecast for May 2010 Using the Best Model:

# Generate 31-day forecast for ATM2
atm2_forecast <- atm2_fit |>
  select(arima_auto2) |>  # Use the best model (arima_auto2)
  forecast(h = 31)  # Forecast 31 days for May 2010

# Plot forecast with historical data after transformation
atm2_data_trans <- atm2_data |>
  mutate(Cash_trans = box_cox(Cash, lambda = lambda2)) |>  # Apply Box-Cox for ATM2
  select(Date, Cash_trans)  # Keep only transformed column and Date

atm2_forecast |>
  autoplot(atm2_data_trans) +
  labs(
    title = latex2exp::TeX(paste0("ATM 2 Forecasted with $ARIMA(2,0,2)(0,1,1)_7$ and $\\lambda$ = ",
           round(lambda2, 2))),  # Use lambda2 dynamically
    y = "Transformed Cash Withdrawals",
    x = "Date"
  ) +
  theme_minimal(base_size = 12)

The forecast for ATM2 cash withdrawals using ARIMA(2,0,2)(0,1,1)[7] with lambda = 0.67 shows a reasonable extension of the historical trend, capturing the weekly seasonality with consistent variability.

# Check residuals of the ARIMA model for ATM2
atm2_fit |>
  select(arima_auto2) |>  # Use the best model (arima_auto2)
  gg_tsresiduals() +
  labs(
    title = expression("Residuals Analysis for ATM2 Forecast Using ARIMA(2,0,2)(0,1,1)"[7] ~ "and" ~ lambda ~ "=" ~ 0.67)
  )

The residuals from the ARIMA(2,0,2)(0,1,1)[7] model for ATM2 appear to be uncorrelated (white noise), indicating that the model has captured most of the underlying structure in the data.

# Ljung-Box test with lag 14
atm2_fit |>
  select(arima_auto2) |>
  augment() |>
  features(.innov, ljung_box, lag = 14, dof = 5)

## # A tibble: 1 × 3
##   .model      lb_stat lb_pvalue
##   <chr>         <dbl>     <dbl>
## 1 arima_auto2    9.98     0.352

The Ljung-Box test result shows that the lb_stat = 10.00766 with p-value = 0.35, Since the p-value > 0.05, we fail to reject the null hypothesis, meaning that there is no significant autocorrelation left in the residuals.

ATM3:

For ATM 3, the situation is unique because all the values are zero except for the last three. Trying to fit an ARIMA model to these three points is possible but highly unreliable, as ARIMA models require sufficient data to estimate parameters accurately and produce meaningful forecasts. The same applies to ETS models, which also need enough data for reliable predictions.

When considering benchmark techniques such as Mean, Naive, Seasonal Naive, and Drift, using the Mean method on the last three values seems to be the best option. Let’s quickly review the plot.

# Filter data for ATM3
atm3_data <- data_ts |> 
  filter(ATM == "ATM3")

# Plot ATM3 data
atm3_data |>
  autoplot(Cash) +
  labs(title = "ATM3 Cash Withdrawals Over Time",
       y = "Cash Withdrawals (in dollars)",
       x = "Date") +
  theme_minimal(base_size = 12)

The plot shows that ATM 3 had zero withdrawals for most of the period, with only the last three data points showing non-zero values.

Model Fitting:

# Fit a MEAN model for ATM3 by filtering non-zero values
atm3_fit <- data_ts |>
  filter(ATM == "ATM3",
         Cash != 0) |>
  model(MEAN(Cash))

Now let’s forecast for the month of May 2010.

Forecast:

# Forecast for 31 days in May 2010
atm3_forecast <- atm3_fit |>
  forecast(h = 31)

atm3_forecast |>
  autoplot(data_ts) +
  ggtitle("ATM 3 Forecasted with the MEAN() Model")

The forecast for ATM3 using the MEAN() model shows consistent cash withdrawal predictions based on the last three non-zero values, with a narrow confidence interval indicating minimal variability. Also, since ATM3 was modeled using the MEAN() model, there’s no need for residual analysis or Ljung-Box tests here because we didn’t apply ARIMA or ETS.

ATM4:

In this section we will look at the ATM 4 from our ATM time series. We’ll begin by filtering data for ATM4 and plotting the raw data to identify patterns, trends, and seasonality.

# Filter data for ATM4
atm4_data <- data_ts |>
  filter(ATM == "ATM4")

# Plot ATM4 data
atm4_data |>
  autoplot(Cash) +
  labs(title = "ATM4 Cash Withdrawals Over Time",
       y = "Cash Withdrawals (in dollars)",
       x = "Date") +
  theme_minimal(base_size = 12)

The plot for ATM4 cash withdrawals shows frequent and sharp fluctuations, suggesting high variability in the data. We will now check whether a Box-Cox transformation is required to stabilize the variance.

atm4_data  <- data_ts |>
  filter(ATM == "ATM4") |>
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) |>
  components()

In ATM 4, there is a noticeable outlier that should be removed before proceeding with the forecast. After removing it, interpolation can be used to fill the resulting gap. Using a criterion where values exceeding 3 interquartile ranges (IQR) are considered outliers, we identified two such outliers in the dataset. The outliers are shown below.

outliers <- atm4_data |>
  filter(remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
           remainder > quantile(remainder, 0.75) + 3*IQR(remainder))
outliers

## # A dable: 2 x 8 [1D]
## # Key:     ATM, .model [1]
## # :        Cash = trend + season_week + remainder
##   ATM   .model       Date         Cash trend season_week remainder season_adjust
##   <chr> <chr>        <date>      <dbl> <dbl>       <dbl>     <dbl>         <dbl>
## 1 ATM4  "STL(Cash ~… 2009-09-22  1712.  373.       -71.5     1411.         1784.
## 2 ATM4  "STL(Cash ~… 2010-02-09 10920.  279.       -71.5    10712.        10991.

With the outlier identified, we can replace it with NA and interpolate using ARIMA. Since the outlier on February 9 is significantly larger compared to the one on September 22, we will only address the February 9 outlier.

# Replace the identified outlier for ATM4 with NA
atm4_missing <- data_ts |>
  mutate(Cash = replace(Cash, ATM == "ATM4" & Date == "2010-02-09", NA)) 

# Fit ARIMA model to handle missing values and interpolate  
data_ts <- atm4_missing |>
  model(ARIMA(Cash)) |>
  interpolate(atm4_missing)

Our outliers are fixed and now we can check the plot of time series alongside the components of our time series.

data_ts |>
  filter(ATM == "ATM4") |>
  autoplot(Cash) +
  labs(
    title = "ATM4 Cash Withdrawals After Interpolation",
    y = "Cash Withdrawals (in dollars)",
    x = "Date"
  ) +
  theme_minimal(base_size = 12)

STL Decomposition:

data_ts |>
  filter(ATM == "ATM4") |>
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition for ATM4 without Outlier")

Observation from STL Decomposition: The STL decomposition of ATM4 cash withdrawals (after outlier removal) shows a stable trend, strong weekly seasonality, and reduced noise in the remainder.

lambda4 <- data_ts |>
  filter(ATM == "ATM4") |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)
lambda4

## [1] 0.4495623

Model Fitting and Selection:

Fit Models for ATM4:

# Apply Box-Cox transformation and fit models for ATM4
atm4_fit <- data_ts |>
  filter(ATM == "ATM4") |>
  model(
    # Additive ETS with Box-Cox transformation
    additive_ts = ETS(box_cox(Cash, lambda4) ~ error("A") + trend("N") + season("A")),
    
    # Multiplicative ETS with Box-Cox transformation
    multiplicative_ts = ETS(box_cox(Cash, lambda4) ~ error("M") + trend("N") + season("M")),
    
    # ARIMA with Box-Cox transformation
    arima_ts = ARIMA(box_cox(Cash, lambda4)),
    
    # SNaive with Box-Cox transformation
    snaive_ts = SNAIVE(box_cox(Cash, lambda4))
  )

# Compare model performance for ATM4
atm4_results <- left_join(
  glance(atm4_fit) |> select(.model, AICc, BIC),
  accuracy(atm4_fit) |> select(.model, RMSE),
  by = ".model"
) |>
  arrange(AICc)
atm4_results

## # A tibble: 4 × 4
##   .model             AICc   BIC  RMSE
##   <chr>             <dbl> <dbl> <dbl>
## 1 arima_ts          2930. 2950.  352.
## 2 multiplicative_ts 4026. 4064.  345.
## 3 additive_ts       4032. 4071.  338.
## 4 snaive_ts           NA    NA   453.

The ARIMA model (arima_ts) was selected for forecasting ATM4 as it has the lowest AICc and BIC, indicating the best overall model fit despite additive_ts having a slightly lower RMSE.

# Report for Auto ARIMA model for ATM4
atm4_fit |>
  select(arima_ts) |>
  report()

## Series: Cash 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda4) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0792  0.2076  0.2026   16.8725
## s.e.  0.0527  0.0516  0.0524    0.7311
## 
## sigma^2 estimated as 176.1:  log likelihood=-1460.16
## AIC=2930.31   AICc=2930.48   BIC=2949.81

Forecast for May 2010 Using the Best Model:

# Correct Forecast Plotting
atm4_forecast <- atm4_fit |>
  forecast(h = 31) |>
  filter(.model == "arima_ts")

# Plot with Correct Historical Data
atm4_forecast |>
  autoplot(data_ts |> filter(ATM == "ATM4")) +
  ggtitle(latex2exp::TeX(paste0("ATM 4 Forecasted with $ARIMA(0,0,1)(2,0,0)_7$ and $\\lambda$ = ",
         round(lambda4, 2)))) +
  labs(y = "Transformed Cash Withdrawals", x = "Date") +
  theme_minimal(base_size = 12)

# Residual Analysis
atm4_fit |>
  select(arima_ts) |>
  gg_tsresiduals() +
  ggtitle(latex2exp::TeX(paste0("Residuals for $ARIMA(0,0,1)(2,0,0)_7$ with $\\lambda$ = ",
         round(lambda4, 2))))

# Ljung-Box Test for Residuals
atm4_fit |>
  select(arima_ts) |>
  augment() |>
  features(.innov, ljung_box, lag = 14, dof = 2)

## # A tibble: 1 × 3
##   .model   lb_stat lb_pvalue
##   <chr>      <dbl>     <dbl>
## 1 arima_ts    16.7     0.160

Since the p-value (0.16) is greater than 0.05, we fail to reject the null hypothesis.

Forecasted ATM Data:

Here are the final forecasts for all four ATMs. ATM1 and ATM2 successfully capture the seasonal patterns and fluctuations in the data. ATM3 maintains a steady forecast, reflecting the average of its non-zero values. In contrast, ATM4 shows a tendency to revert toward the mean as the forecast progresses through May

# Convert all forecasts to data frames and select consistent columns
fc_combined <- bind_rows(
  atm1_forecast |> as.data.frame() |> select(Date, .mean) |> mutate(ATM = "ATM1"),
  atm2_forecast |> as.data.frame() |> select(Date, .mean) |> mutate(ATM = "ATM2"),
  atm3_forecast |> as.data.frame() |> select(Date, .mean) |> mutate(ATM = "ATM3"),
  atm4_forecast |> as.data.frame() |> select(Date, .mean) |> mutate(ATM = "ATM4")
) |>
  rename(Cash = .mean)  # Rename .mean to Cash for consistency

# Export the forecasted data to CSV
write.csv(fc_combined, "ATM_forecasts.csv", row.names = FALSE)

Let’s bring all the forecast plots together for a comprehensive comparison.

# Plot forecasted data for all ATMs with facet wrap
fc_combined |>
  as_tsibble(index = Date, key = ATM) |>
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(
    title = "Forecasted ATM Withdrawals in May 2010",
    y = "Cash",
    x = "Date [1D]"
  ) +
  theme_minimal(base_size = 12)

Historical + Forecasted Plot:

# Plot historical data and forecasts for all ATMs
fc_combined|>
  as_tsibble(index = Date, key = ATM)|>
  autoplot(Cash) +
  autolayer(data_ts, Cash, colour = "black") +  # Add historical data in black
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(
    title = "ATM Withdrawals: Historical and Forecasted",
    y = "Cash",
    x = "Date [1D]"
  ) +
  theme_minimal(base_size = 12)

PART B

Residential Power Forecast:

In this section, we will analyze residential power consumption data. The dataset contains monthly power usage (measured in kilowatt-hours or KWH) from January 1998 to December 2013. Our objective is to build a suitable forecasting model and generate monthly predictions for power consumption in the year 2014. The insights gained from this forecast will help assess future energy usage trends.

Data Exploration

We will start by loading the dataset, which is stored in a local folder, and then explore its structure.

data2 <- read_excel("ResidentialCustomerForecastLoad-624.xlsx", col_names = T)
head(data2)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

The dataset contains three variables:

CaseSequence – Index of the records
YYYY-MMM – Date in year-month format (character)
KWH – Power consumption in kilowatt-hours

Since the “YYYY-MMM” column is stored as a character variable, we need to convert it to a date format before proceeding. After that, we will transform the dataset into a tsibble object to facilitate time series analysis.

colnames(data2)

## [1] "CaseSequence" "YYYY-MMM"     "KWH"

# Convert 'YYYY-MMM' to date format and transform data into tsibble
load <- data2 |>
  rename(Month = 'YYYY-MMM') |>      # Rename the date column for consistency
  mutate(Month = yearmonth(Month)) |> # Convert to date format
  as_tsibble(index = Month)           # Convert to tsibble format

summary(data2)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

We can also see that the max value under the “KWH” variable is 10655730 and min value is 770523. We’ll have to explore the data some more to determine weather either of these values are outliers. NA’s shows 1. Let’s confirm the missing value.

colSums(is.na(load))

## CaseSequence        Month          KWH 
##            0            0            1

# Convert KWH to a time series object with monthly frequency
ts_data <- ts(load$KWH, frequency = 12)  # Monthly frequency since it's monthly data

# Fit ARIMA model to the data
arima_model <- auto.arima(ts_data)

# Forecast missing values using the ARIMA model
interpolated_data <- forecast(arima_model, h = length(ts_data))

# Extract the interpolated values
interpolated_values <- interpolated_data$mean

# Replace missing values in the original data with the interpolated values
load$KWH[is.na(load$KWH)] <- interpolated_values[is.na(load$KWH)]

To address the missing value in the dataset, we will use an ARIMA model to interpolate and estimate the missing observation in the KWH variable.

Exploring the Dataset::

# Plot the interpolated KWH data
load |>
  autoplot(KWH) +
  labs(
    title = "Residential Power Consumption",
    y = "Power Consumption (KWH)",
    x = "Month"
  ) +
  theme_minimal(base_size = 12)

We can there is an obvious outlier. Let’s identify and display the outlier in KWH.

load |> 
  filter(KWH == min(KWH, na.rm = TRUE))  # Filter the row with minimum KWH value

## # A tsibble: 1 x 3 [1M]
##   CaseSequence    Month    KWH
##          <dbl>    <mth>  <dbl>
## 1          883 2010 Jul 770523

# Plot
load |>
  ggplot(aes(x = Month, y = KWH)) +
  geom_line(color = "black") +
  geom_point(data = load |> filter(KWH == min(KWH, na.rm = TRUE)), 
             aes(x = Month, y = KWH), color = "red", size = 3) +
  labs(title = "Residential Power Consumption with Outlier Highlighted",
       y = "Power Consumption (KWH)",
       x = "Month") +
  theme_minimal(base_size = 12)

The interpolated plot reveals a consistent seasonal pattern in residential power consumption. However, there is a noticeable drop in July 2010 (CaseSequence 883) where the power consumption recorded was 770,523 KWH, indicating a potential anomaly or data error during that period. The most appropriate course of action is to address this outlier by removing it and applying interpolation to fill the gap effectively

# Replace outlier (770523) with NA in KWH
load_miss <- load|>
  mutate(KWH = replace(KWH, KWH == 770523, NA)) 

# Fit ARIMA model to handle missing values and interpolate
load <- load_miss|>
  model(ARIMA(KWH))|>
  interpolate(load_miss)

# Plot the data after interpolation
load|>
  autoplot(KWH) +
  ggtitle("Residential Power Consumption without Outlier")

The updated plot shows a smooth and consistent seasonal pattern after successfully removing the outlier (770523) and interpolating the missing value.

STL Decomposition:

load|>
  model(STL(KWH ~ season(window = "periodic"), robust = TRUE))|>
  components()|>
  autoplot() +
  labs(title = "STL Decomposition of Residential Power Consumption without Outlier",
       y = "KWH",
       x = "Month")

After removing the outlier, the plot becomes more refined and interpretable. A clear upward trend emerges, showing minimal variation compared to other components. Additionally, there is evident annual seasonality, with noticeable peaks during the summer and winter months. The remainder appears to be random, although a significant spike is observed in December 2013..

Here is a seasonal plot showing the pattern for Residential Power Usage.

# Plot seasonal
load |>
  gg_season(KWH, labels = "both") +
  labs(
    title = "Seasonal Plot for Residential Power Consumption ",
    y = "KWH",
    x = "Month"
  )

To gain deeper insights into the seasonal patterns of residential power usage, we will now explore the subseries plot, which highlights variations across different months.

# Plot subseries pattern for Residential Power Usage
load |>
  gg_subseries(KWH) +
  labs(
    title = "Residential Power Consumption ",
    y = "KWH",
    x = "Month"
  )

The subseries plot reveals a clear seasonal pattern, with consistent peaks in power consumption during the summer and winter months. This pattern suggests a higher demand for energy during extreme temperatures. To further refine our analysis, we will now determine the optimal lambda for a Box-Cox transformation and proceed to develop models for both transformed and non-transformed time series.

Predictive Modeling: Residential Power Consumption

# Calculate optimal lambda for Box-Cox transformation
lambdaR <- load|>
  features(KWH, features = guerrero)|>
  pull(lambda_guerrero)
lambdaR

## [1] -0.2189058

# Fit models for power consumption data (consistent with ATM models)
load_fit <- load|>
  model(
    # Additive ETS with Box-Cox transformation
    additive_bc = ETS(box_cox(KWH, lambdaR) ~ error("A") + trend("N") + season("A")),
    
    # Multiplicative ETS with Box-Cox transformation
    multiplicative_bc = ETS(box_cox(KWH, lambdaR) ~ error("M") + trend("N") + season("M")),
    
    # ARIMA with Box-Cox transformation
    arima_bc = ARIMA(box_cox(KWH, lambdaR)),
    
    # SNAIVE with Box-Cox transformation
    snaive_bc = SNAIVE(box_cox(KWH, lambdaR))
  )

# Model statistics for comparison
left_join(
  glance(load_fit)|> select(.model:BIC),
  accuracy(load_fit)|> select(.model, RMSE)
)|>
  arrange(AICc)

## # A tibble: 4 × 7
##   .model                 sigma2 log_lik    AIC   AICc    BIC    RMSE
##   <chr>                   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1 arima_bc          0.0000104      788. -1566. -1565. -1550. 652115.
## 2 additive_bc       0.00000939     614. -1198. -1195. -1149. 639391.
## 3 multiplicative_bc 0.000000482    614. -1197. -1195. -1148. 639491.
## 4 snaive_bc         0.0000148       NA     NA     NA     NA  825253.

the ARIMA model with Box-Cox transformation generated the best results, as it had the lowest AIC, AICc, and BIC values, suggesting the most appropriate fit for the data. Although the additive and multiplicative ETS models also performed reasonably well, ARIMA remains the optimal choice for forecasting. We will now proceed to generate forecasts using the selected ARIMA model.

load_fit |>
  select(arima_bc) |>
  report()

## Series: KWH 
## Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift 
## Transformation: box_cox(KWH, lambdaR) 
## 
## Coefficients:
##          ma1     sar1     sar2  constant
##       0.2097  -0.7155  -0.3486     1e-03
## s.e.  0.0838   0.0750   0.0763     3e-04
## 
## sigma^2 estimated as 1.041e-05:  log likelihood=787.87
## AIC=-1565.75   AICc=-1565.4   BIC=-1549.78

Forecast: Residential Power Consumption

# Forecast Residential Power Consumption for 2014
fc_load <- load_fit|>
  forecast(h = 12)|>
  filter(.model == "arima_bc")

# Plot forecast with historical data
fc_load|>
  autoplot(load) +
  ggtitle(latex2exp::TeX(paste0("Forecast of Power Usage: $ARIMA(0,0,1)(2,1,0)_{12}$, $\\lambda$ = ",
         format(round(lambdaR, 2), nsmall = 2)))) +
  labs(y = "KWH", x = "Month") +
  theme_minimal(base_size = 12)

Residual Analysis: Residential Power Consumption

# Plot residuals 
load_fit |>
  select(arima_bc) |>  # Select the correct model
  gg_tsresiduals(lag = 24) +
  ggtitle(latex2exp::TeX(paste0("Residuals for ARIMA(0,0,1)(2,1,0)_{12} with $\\lambda$ = ",
         round(lambdaR, 2))))

load_fit|> 
  select(.model = "arima_bc")|>  
  augment()|> 
  features(.innov, box_pierce, lag = 24, dof = 0)

## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 .model    30.0     0.186

Observation on Residual Analysis and Box-Pierce Test:

The residuals from the ARIMA model (ARIMA(0,0,1)(2,1,0)[12] with Box-Cox transformation) exhibit characteristics of white noise, as they are randomly distributed around zero without any discernible pattern. The ACF plot of the residuals shows no significant spikes, indicating that the model has effectively captured the underlying structure of the data.

Additionally, the Box-Pierce test returned a p-value of 0.19, which is greater than 0.05. This suggests that we fail to reject the null hypothesis, implying that the residuals are independently distributed and behave like white noise.

# Final forecasted data using the ARIMA model
fc_load <- fc_load|>
  as.data.frame()|>
  select(Month, .mean)|>         
  rename(KWH = .mean)|>         
  mutate(CaseSequence = 925:936)|> 
  relocate(CaseSequence)          
fc_load

##    CaseSequence    Month      KWH
## 1           925 2014 Jan 10207318
## 2           926 2014 Feb  8526971
## 3           927 2014 Mar  6652600
## 4           928 2014 Apr  5976444
## 5           929 2014 May  5945957
## 6           930 2014 Jun  8293464
## 7           931 2014 Jul  9534133
## 8           932 2014 Aug 10111017
## 9           933 2014 Sep  8470724
## 10          934 2014 Oct  5850947
## 11          935 2014 Nov  6111091
## 12          936 2014 Dec  8257887

Merging to existing ATM file:

# Preview the data
glimpse(fc_combined)

## Rows: 124
## Columns: 3
## $ Date <date> 2010-05-01, 2010-05-02, 2010-05-03, 2010-05-04, 2010-05-05, 2010…
## $ Cash <dbl> 8.476213, 8.965110, 7.965366, 1.752859, 8.951395, 8.194925, 8.441…
## $ ATM  <chr> "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "ATM1", "…

glimpse(fc_load)

## Rows: 12
## Columns: 3
## $ CaseSequence <int> 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936
## $ Month        <mth> 2014 Jan, 2014 Feb, 2014 Mar, 2014 Apr, 2014 May, 2014 Ju…
## $ KWH          <dbl> 10207318, 8526971, 6652600, 5976444, 5945957, 8293464, 95…

# Check column names to verify consistency
colnames(fc_combined)

## [1] "Date" "Cash" "ATM"

colnames(fc_load)

## [1] "CaseSequence" "Month"        "KWH"

# Rename 'Month' to 'Date' in fc_load to match with fc_combined
fc_load <- fc_load|>
  rename(Date = Month)

# Convert 'Date' to character in both datasets to avoid conflicts
fc_combined <- fc_combined |>
  mutate(Date = as.character(Date))

fc_load <- fc_load |>
  mutate(Date = as.character(Date))

# Combine ATM (fc_combined) and Residential Power forecasts (fc_load)
combined_fc <- bind_rows(fc_combined, fc_load)

# Save the combined forecast as a CSV
write.csv(combined_fc, "Final_Forecast.csv", row.names = FALSE)

# Read the combined forecast file
final_data <- read.csv("Final_Forecast.csv")

# View the first few rows to confirm the structure
head(final_data)

##         Date     Cash  ATM CaseSequence KWH
## 1 2010-05-01 8.476213 ATM1           NA  NA
## 2 2010-05-02 8.965110 ATM1           NA  NA
## 3 2010-05-03 7.965366 ATM1           NA  NA
## 4 2010-05-04 1.752859 ATM1           NA  NA
## 5 2010-05-05 8.951395 ATM1           NA  NA
## 6 2010-05-06 8.194925 ATM1           NA  NA

# Check the structure of the combined data
str(final_data)

## 'data.frame':    136 obs. of  5 variables:
##  $ Date        : chr  "2010-05-01" "2010-05-02" "2010-05-03" "2010-05-04" ...
##  $ Cash        : num  8.48 8.97 7.97 1.75 8.95 ...
##  $ ATM         : chr  "ATM1" "ATM1" "ATM1" "ATM1" ...
##  $ CaseSequence: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ KWH         : num  NA NA NA NA NA NA NA NA NA NA ...

tail(final_data, 15)  # Check the last 15 rows

##           Date     Cash  ATM CaseSequence      KWH
## 122 2010-05-29 424.1583 ATM4           NA       NA
## 123 2010-05-30 448.1962 ATM4           NA       NA
## 124 2010-05-31 450.9739 ATM4           NA       NA
## 125   2014 Jan       NA <NA>          925 10207318
## 126   2014 Feb       NA <NA>          926  8526971
## 127   2014 Mar       NA <NA>          927  6652600
## 128   2014 Apr       NA <NA>          928  5976444
## 129   2014 May       NA <NA>          929  5945957
## 130   2014 Jun       NA <NA>          930  8293464
## 131   2014 Jul       NA <NA>          931  9534133
## 132   2014 Aug       NA <NA>          932 10111017
## 133   2014 Sep       NA <NA>          933  8470724
## 134   2014 Oct       NA <NA>          934  5850947
## 135   2014 Nov       NA <NA>          935  6111091
## 136   2014 Dec       NA <NA>          936  8257887

PART C

Water Flow Forecast (Optional):

Introduction:

Water Flow Forecasting involves analyzing two datasets, Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx, each containing two columns with different time stamps. The task requires aligning these datasets by aggregating the data to hourly intervals and calculating the mean for multiple recordings within each hour. After time-base sequencing, we will determine whether the aggregated data is stationary and assess its suitability for forecasting. If the data is deemed fit for forecasting, we will develop a model to produce a one-week forecast. The final results will be presented via RPubs, an .Rmd file, and an Excel-readable forecast file.

Waterflow Pipe1 1:

In this section we will look at the pipe 1 flow.

in this Data set, Date Time column is not properly parsed “Date Time: POSIXct[1:1000]”, which causing coercion or parsing warning. We will ensure correct date-time format.

pipe1_data <- read_excel("Waterflow_Pipe1.xlsx", col_types = c('date', 'numeric')) %>%
  mutate(`Date Time` = as_datetime(`Date Time`)) %>%
  rename(DateTime = `Date Time`) %>%    
  mutate(date = as.Date(DateTime),    
         hour = paste(format(DateTime, format = "%H"), ":00:00"))

# Create Pipe1 by grouping and aggregating by hourly intervals
Pipe1 <- pipe1_data %>%
  mutate(DateTime = ymd(date) + hms(hour)) %>%   
  group_by(DateTime) %>%
  mutate(WaterFlow = mean(WaterFlow)) %>%        
  distinct(DateTime, WaterFlow) %>%            
  as_tsibble(index = DateTime)                   
head(Pipe1)

## # A tsibble: 6 x 2 [1h] <UTC>
## # Groups:    @ DateTime [6]
##   DateTime            WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 00:00:00      26.1
## 2 2015-10-23 01:00:00      18.9
## 3 2015-10-23 02:00:00      15.2
## 4 2015-10-23 03:00:00      23.1
## 5 2015-10-23 04:00:00      15.5
## 6 2015-10-23 05:00:00      22.7

# Fill gaps to ensure no missing time intervals
Pipe1 <- fill_gaps(Pipe1)

# Check for any newly filled gaps
miss <- Pipe1 %>%
  filter(is.na(WaterFlow))
miss

## # A tsibble: 4 x 2 [1h] <UTC>
## # Groups:    @ DateTime [4]
##   DateTime            WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-27 17:00:00        NA
## 2 2015-10-28 00:00:00        NA
## 3 2015-11-01 05:00:00        NA
## 4 2015-11-01 09:00:00        NA

pipe1_data <- pipe1_data %>%
  select(DateTime, WaterFlow) %>%
  rbind(.,miss) %>%
  as_tsibble(index = DateTime) %>%
  na_interpolation(.)


Pipe1 <-left_join(Pipe1, pipe1_data, by = "DateTime") %>%
  mutate(WaterFlow  = coalesce(WaterFlow.x, WaterFlow.y)) %>%
  select(DateTime, WaterFlow)

Pipe1 %>%
  gg_tsdisplay(WaterFlow, plot_type='partial') +
  labs(title = "Water Flow of Pipe 1")

Pipe1 <- Pipe1 %>% ungroup()

groups(Pipe1)

## list()

Pipe1 %>%
  features(WaterFlow, unitroot_kpss)

## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.269         0.1

Pipe1 %>%
  model(STL(WaterFlow ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition")

Shows a clear trend and seasonality in the data.

Model: Water Flow Pipe1:

# Estimate Box-Cox transformation parameter using Guerrero's method
lambda <- Pipe1 %>%
  features(WaterFlow, features = guerrero) %>%
  pull(lambda_guerrero)


# Fit ARIMA models with and without Box-Cox transformation
Pipe1_fit <- Pipe1 %>%
  model(
    ARIMA_bc = ARIMA(box_cox(WaterFlow, lambda)),  # Box-Cox transformed model
    ARIMA = ARIMA(WaterFlow)                       # Standard ARIMA model
  )

# Compare models using AICc and BIC to select the best
glance(Pipe1_fit) %>%
  arrange(AICc) %>%
  select(.model:BIC)

## # A tibble: 2 × 6
##   .model   sigma2 log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA      18.2   -688. 1380. 1380. 1387.
## 2 ARIMA_bc   57.1   -825. 1655. 1655. 1662.

Pipe1_fit %>% select(.model = "ARIMA_bc") %>% report()

## Series: WaterFlow 
## Model: ARIMA(0,0,0) w/ mean 
## Transformation: box_cox(WaterFlow, lambda) 
## 
## Coefficients:
##       constant
##        29.0075
## s.e.    0.4866
## 
## sigma^2 estimated as 57.07:  log likelihood=-825.36
## AIC=1654.71   AICc=1654.76   BIC=1661.67

Since we’ve now selected and reviewed the ARIMA_bc model, we’ll move on to forecasting

Forcast: Water Flow Pipe1

# Generate a 7-day (168 hours) forecast using ARIMA_bc
Pipe1_fc <- Pipe1_fit %>%
  forecast(h = 168) %>%                    # Forecast for 168 hours (7 days)
  filter(.model == "ARIMA_bc")             # Filter ARIMA_bc model for forecast

# Plot the forecast with appropriate title using LaTeX formatting
Pipe1_fc %>%
  autoplot(Pipe1) +
  ggtitle(latex2exp::TeX(paste0("Pipe 1 Forecasted with ARIMA $(0,0,0)$ with mean and $\\lambda$ = ",
                                 round(lambda, 2))))

Residuals for Pipe 1:

Pipe1_fit %>%
  select(ARIMA) %>%          
  gg_tsresiduals() +                    
  ggtitle("Residuals for Pipe 1 | ARIMA(0,0,0) with mean")

The residuals from the ARIMA(0,0,0) model with mean show that the model captures the overall trend but leaves some autocorrelation at lag 12, suggesting possible seasonal effects. While the residuals appear approximately normally distributed, a more complex model might better account for these patterns

Waterflow Pipe1 2:

# reading in excel file
pipe2_data <- read_excel("Waterflow_Pipe2.xlsx", col_types = c('date', 'numeric')) %>%
  # converting into date format
  mutate(`Date Time` = as_datetime(`Date Time`)) %>%
  # renaming column name
  rename(DateTime = `Date Time`) %>%
  # converting to tsibble
  as_tsibble(index = DateTime) 

head(pipe2_data)

## # A tsibble: 6 x 2 [1h] <UTC>
##   DateTime            WaterFlow
##   <dttm>                  <dbl>
## 1 2015-10-23 01:00:00      18.8
## 2 2015-10-23 02:00:00      43.1
## 3 2015-10-23 03:00:00      38.0
## 4 2015-10-23 04:00:00      36.1
## 5 2015-10-23 05:00:00      31.9
## 6 2015-10-23 06:00:00      28.2

The dates were successfully parsed for Waterflow Pipe2, and the data was converted into a tsibble format. There are no missing values or missing hours, which we will confirm using the summary.

summary(pipe2_data)

##     DateTime                     WaterFlow     
##  Min.   :2015-10-23 01:00:00   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
##  Median :2015-11-12 20:30:00   Median :39.682  
##  Mean   :2015-11-12 20:30:00   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :78.303

Explore Dataset:

pipe2_data %>%
  gg_tsdisplay(WaterFlow, plot_type = "partial") +
  labs(title = "Water Flow Analysis for Pipe 2")

The time series plot for Pipe 2 shows fluctuating water flow with minimal autocorrelation, but there are small spikes in the ACF and PACF plots, suggesting possible weak seasonal or lagged effects.

STL Decomposition of Pipe 2 Water Flow:

pipe2_data %>%
  model(STL(WaterFlow ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition of Pipe 2 Water Flow")

The STL decomposition of Pipe 2 shows a stable trend with notable weekly and daily seasonal patterns, while the remainder exhibits random fluctuations.

Model: Water Flow Pipe2

# Estimate Box-Cox transformation parameter using Guerrero's method for Pipe 2
lambda <- pipe2_data %>%
  features(WaterFlow, features = guerrero) %>%
  pull(lambda_guerrero)

# Fit ARIMA models
Pipe2_fit <- pipe2_data %>%
  model(
    ARIMA_bc = ARIMA(box_cox(WaterFlow, lambda)),  # Box-Cox transformed model
    ARIMA = ARIMA(WaterFlow)                       # Standard ARIMA model
  )

# Compare models using AICc and BIC to select the best
glance(Pipe2_fit) %>%
  arrange(AICc) %>%
  select(.model:BIC)

## # A tibble: 2 × 6
##   .model   sigma2 log_lik   AIC  AICc   BIC
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_bc   172.  -3992. 7990. 7990. 8005.
## 2 ARIMA      256.  -4191. 8387. 8387. 8402.

The model with the lowest AICc and BIC is ARIMA_bc, indicating that the Box-Cox transformation improved the model fit.

Pipe2_fit %>%
  select(.model = "ARIMA_bc") %>% 
  report()

## Series: WaterFlow 
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean 
## Transformation: box_cox(WaterFlow, lambda) 
## 
## Coefficients:
##         sma1  constant
##       0.0844   32.9694
## s.e.  0.0314    0.4486
## 
## sigma^2 estimated as 172.1:  log likelihood=-3992.1
## AIC=7990.2   AICc=7990.23   BIC=8004.93

Forcast: Water Flow Pipe2

# Forecasting the data using the ARIMA model
Pipe2_fc <- Pipe2_fit %>%
  forecast(h = 168) %>%                   # Forecast for 168 hours (7 days)
  filter(.model == "ARIMA")               # Filter ARIMA model for forecast

# Plotting the forecast with LaTeX-styled title
Pipe2_fc %>%
  autoplot(pipe2_data) +                   # Plot forecast against original data
  ggtitle(latex2exp::TeX(paste0("Pipe 2 Forecasted with ARIMA $(0,0,0)(0,0,1)[24]$ with mean and $\\lambda$ = ",
                                 round(lambda, 2))))

# Extract and plot residuals for ARIMA model from Pipe2_fit
Pipe2_fit %>%
  select(ARIMA) %>%               
  gg_tsresiduals() +              
  ggtitle("Residuals for Pipe 2 | ARIMA(0,0,0)(0,0,1)[24] with mean")

The ARIMA(0,0,0)(0,0,1)[24] model with mean performs reasonably well for Pipe 2, although slight autocorrelation and heavier tails suggest that a more refined model might further improve the fit.

Final Results- Water Flow:

Pipe1_fc <- Pipe1_fc %>%
  as.data.frame() %>%                   # Convert to data frame
  select(DateTime, .mean) %>%           # Select DateTime and forecasted mean
  rename(WaterFlow = .mean)              # Rename .mean to WaterFlow

Pipe2_fc <- Pipe2_fc %>%
  as.data.frame() %>%                   # Convert to data frame
  select(DateTime, .mean) %>%           # Select DateTime and forecasted mean
  rename(WaterFlow = .mean)              # Rename .mean to WaterFlow

wb <- wb_workbook()

# Add Pipe1 data to a worksheet
wb <- wb_add_worksheet(wb, "Pipe1")
wb <- wb_add_data(wb, "Pipe1", Pipe1_fc)

# Add Pipe2 data to a worksheet
wb <- wb_add_worksheet(wb, "Pipe2")
wb <- wb_add_data(wb, "Pipe2", Pipe2_fc)

# Save the workbook as 'pipes.xlsx'
wb_save(wb, file = "pipes.xlsx")

DATA624 - Project 1

Farhana Akther

2025-03-30

PART A

Objective

Data Exploration (EDA)

Data processing:

Predictive Model and Forecast for Each ATM:

Building and Forecasting Models for Each ATM:

ATM1:

ATM2:

ATM3:

ATM4:

PART B

Residential Power Forecast:

Data Exploration

PART C

Water Flow Forecast (Optional):