624 - Project 1

Author

Luis Munoz Grass

ATM Forecast

The goal of this analysis is to: Forecast how much cash is taken out of 4 different ATM machines for May 2010.  The data is given in a single file.  The variable ‘Cash’ is provided in hundreds of dollars.

Prepping the Data

#loading data and libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(tsibble)
Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr

Attaching package: 'tsibble'

The following object is masked from 'package:lubridate':

    interval

The following objects are masked from 'package:base':

    intersect, setdiff, union
library(fable)
Loading required package: fabletools
library(fabletools)
library(tidyr)
library(feasts)
library(zoo)

Attaching package: 'zoo'

The following object is masked from 'package:tsibble':

    index

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
# I uploaded the provided file into GitHub for reproducibility
atm_data <- read_csv("https://raw.githubusercontent.com/Lfirenzeg/msds624labs/refs/heads/main/Project%201/ATM624Data.csv")
Rows: 1474 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): DATE, ATM
dbl (1): Cash

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now explore data and look for missing values

# check structure and first few rows
glimpse(atm_data)
Rows: 1,474
Columns: 3
$ DATE <chr> "5/1/2009 12:00:00 AM", "5/1/2009 12:00:00 AM", "5/2/2009 12:00:0…
$ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
$ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …
head(atm_data)
# A tibble: 6 × 3
  DATE                 ATM    Cash
  <chr>                <chr> <dbl>
1 5/1/2009 12:00:00 AM ATM1     96
2 5/1/2009 12:00:00 AM ATM2    107
3 5/2/2009 12:00:00 AM ATM1     82
4 5/2/2009 12:00:00 AM ATM2     89
5 5/3/2009 12:00:00 AM ATM1     85
6 5/3/2009 12:00:00 AM ATM2     90
# Check for missing values in each column
colSums(is.na(atm_data))
DATE  ATM Cash 
   0   14   19 
# Count the number of unique ATM identifiers
unique(atm_data$ATM)
[1] "ATM1" "ATM2" NA     "ATM3" "ATM4"

The ATM and Cash columns have missing values. Each ATM is expected to have consistent daily records, so we can drop the rows with missing ATM or Cash, since the proportion is small (19 total NA for almost 1500 entries).

# dropping rows with missing ATM or Cash values
atm_clean <- atm_data %>%
  filter(!is.na(ATM), !is.na(Cash))

Next we want to convert the Date column to a proper date-time format

# DATE column using lubridate
atm_clean <- atm_clean %>%
  mutate(DATE = mdy_hms(DATE))
summary(atm_clean$DATE)
                      Min.                    1st Qu. 
"2009-05-01 00:00:00.0000" "2009-08-01 00:00:00.0000" 
                    Median                       Mean 
"2009-10-31 00:00:00.0000" "2009-10-30 11:00:07.4226" 
                   3rd Qu.                       Max. 
"2010-01-29 12:00:00.0000" "2010-04-30 00:00:00.0000" 
colSums(is.na(atm_clean))
DATE  ATM Cash 
   0    0    0 

At this point there are no missing values.

First Data Exploration

At this point it would be beneficial to visually explore the data

ggplot(atm_clean, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
    facet_wrap(~ATM, scales = "free", nrow = 4)+
  labs(title = "Cash Withdrawals per ATM (Raw data)", x = "Date", y = "Cash (hundreds of $)") +
  theme_minimal()

We can see that ATMs 1, 2 have consistent daily activity, and the withdrawals are all within a small predictable range. However, ATM4 has quite an erratic behavior and a major outlier around January/February of 2010, which could be a data error or some sort of unusual event. ATM3 could be considered either new or out of service for a while, since it has 0 withdrawals for most of the entries.

In any case, the first 3 ATMs seem good candidates for either ARIMA or ETS forecasting, while ATM4 might need to remove the outlier or perhaps apply some STL decomposition that detects anomalies. At minimum the forecasting intervals will reflect the higher uncertainty.

Reshaping the Data

We will group the data so that each date-ATM pair is a group (in case there are multiple rows per ATM per day), also add up the cash withdrawals so that we have only one cash value per ATM per day. Again, we are checking for NAs

# Aggregate and reshape
atm_timeseries <- atm_clean %>%
  group_by(DATE, ATM) %>%
  summarise(Cash = sum(Cash), .groups = "drop") %>%
  pivot_wider(names_from = ATM, values_from = Cash)
colSums(is.na(atm_timeseries))
DATE ATM1 ATM2 ATM3 ATM4 
   0    3    2    0    0 
summary(atm_timeseries)
      DATE                 ATM1             ATM2             ATM3        
 Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: 0.0000  
 Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 0.0000  
 Mean   :2009-10-30   Mean   : 83.89   Mean   : 62.58   Mean   : 0.7206  
 3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000  
 Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :96.0000  
                      NA's   :3        NA's   :2                         
      ATM4      
 Min.   :    2  
 1st Qu.:  124  
 Median :  404  
 Mean   :  474  
 3rd Qu.:  705  
 Max.   :10920  
                

We see that we now have 3 NAs for ATM1 and 2 for ATM2, because some ATMs 1 and 2 were missing data for specific dates, so it didn’t appear as rows in the long format, and pivot_wider() inserted NA for the ATMs on those dates.

However, even though we already have full Date coverage we can fill missing values using a rolling median (safer than just a regular median or forward fill)

atm_timeseries <- atm_timeseries %>%
  mutate(across(
    starts_with("ATM"),
    ~ {
      filled <- ifelse(is.na(.), rollmedian(., 7, fill = NA, align = "center"), .)
      filled <- zoo::na.locf(filled, na.rm = FALSE)  # forward fill
      filled <- zoo::na.locf(filled, fromLast = TRUE, na.rm = FALSE)  # backward fill
      return(filled)
    }
  ))
colSums(is.na(atm_timeseries))
DATE ATM1 ATM2 ATM3 ATM4 
   0    0    0    0    0 
summary(atm_timeseries)
      DATE                 ATM1             ATM2             ATM3        
 Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 25.00   1st Qu.: 0.0000  
 Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 0.0000  
 Mean   :2009-10-30   Mean   : 84.19   Mean   : 62.57   Mean   : 0.7206  
 3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000  
 Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :96.0000  
      ATM4      
 Min.   :    2  
 1st Qu.:  124  
 Median :  404  
 Mean   :  474  
 3rd Qu.:  705  
 Max.   :10920  

We have succesfully eliminated NAs, and the summary statistics before and after imputation are almost identical.

# Convert to long format for plotting
atm_long <- atm_timeseries %>%
  pivot_longer(cols = -DATE, names_to = "ATM", values_to = "Cash")

# Plot cleaned data
ggplot(atm_long, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  facet_wrap(~ATM, scales = "free", nrow = 4)+
  labs(title = "Cleaned Daily Withdrawals per ATM", x = "Date", y = "Cash (in dollars)") +
  theme_minimal()

Decomposing ATM1 Data

We can now use a tidy model approach and decompose the data for each ATM.

atm_long <- atm_long %>%
  mutate(DATE = as.Date(DATE)) %>%
  as_tsibble(index = DATE, key = ATM)

ATM_tsibble <- atm_long %>%
  filter(DATE <= as.Date("2010-04-30")) %>%
  as_tsibble(index = DATE, key = ATM)

And now decompose:

ATM_tsibble %>%
  filter(ATM == "ATM1") %>%
  autoplot(Cash) +
  ggtitle("Non-tranformed ATM1")

# STL decomposition
ATM_tsibble %>%
  filter(ATM == "ATM1") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM1")

We can see strong seasonality and some trend variation for ATM1. There might not be a need for transformation here, but it can be confirmed with some Box-Cox to follow.

Comparing Models ATM1

lambda1 <- ATM_tsibble %>%
  filter(ATM == "ATM1") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm1_data <- ATM_tsibble %>%
  filter(ATM == "ATM1")

atm1_fit <- atm1_data %>%
  model(
    # Additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # Transformed additive ETS model
    additive_ts = ETS(box_cox(Cash,lambda1) ~ error("A") + trend("N") + season("A")),
    # Multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # Transformed multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,lambda1) ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # ARIMA model
    ARIMA = ARIMA(box_cox(Cash,lambda1))
  )

left_join(
  glance(atm1_fit) %>% select(.model:BIC), 
  accuracy(atm1_fit) %>% select(.model, RMSE)
) %>%
  arrange(RMSE)
Joining with `by = join_by(.model)`
# A tibble: 6 × 7
  .model              sigma2 log_lik   AIC  AICc   BIC  RMSE
  <chr>                <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 multiplicative_ts   0.0385  -1217. 2454. 2454. 2493.  23.7
2 additive          583.      -2234. 4489. 4489. 4528.  23.8
3 ARIMA               1.73     -606. 1221. 1221. 1236.  25.0
4 additive_ts         1.77    -1176. 2372. 2373. 2411.  25.0
5 multiplicative      0.134   -2273. 4567. 4567. 4606.  26.3
6 snaive            777.         NA    NA    NA    NA   27.8

Although the multiplicative_ts model got the lowest RMSE, indicating higher accuracy, the ARIMA model got by far best AIC, and relatively good RMSE, so that would be the simplest model with high accuracy.

But how to know the best ARIMA?

# Finding the best ARIMA model with report()
atm1_fit %>% select(.model = "ARIMA") %>% report()
Series: Cash 
Model: ARIMA(0,0,2)(0,1,1)[7] 
Transformation: box_cox(Cash, lambda1) 

Coefficients:
         ma1     ma2     sma1
      0.1138  -0.109  -0.6419
s.e.  0.0524   0.052   0.0432

sigma^2 estimated as 1.728:  log likelihood=-606.28
AIC=1220.55   AICc=1220.67   BIC=1236.07

So we find the ARIMA model to be ARIMA(0,0,2)(0,1,1)[7]

Forecasting ATM1

We will fit the proposed ARIMA model, get residuals and run a Box-Pierce test, using a lag of 14 (2x7) since its appropriate for weekly seasonal data, and a dof of 0 since we are already modeling with seasonal ARIMA

#Forecasting the model
forecast_atm1 <- atm1_fit %>%
  forecast(h = 31) %>%
  filter(.model=='ARIMA')

# Plotting the forecast for ATM1
forecast_atm1 %>%
  autoplot(atm1_data) +
  ggtitle(paste0("ATM 1 Forecasted with ARIMA(0,0,2)(0,1,1)_7 and lambda= ",
         round(lambda1,2)))

# Plotting residuals
atm1_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle(paste0("Residuals for ARIMA(0,0,2)(0,1,1)7 with lambda = ",
         round(lambda1,2)))

#Box-Pierce test
atm1_fit %>% 
  select(.model = "ARIMA") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)
# A tibble: 1 × 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 .model    9.47     0.800

The model captures seasonality well, and the forecaste range expands over time, maintaining the previously seen weekly cycles.

The residuals tell us that there seems to be no autocorrelation, and no major skewness. This is also reinforced by the Box-Pierce test, with a p-value of 0.8, we fail to reject the null and it validates the model fit.

Decomposing ATM2

ATM_tsibble %>%
  filter(ATM == "ATM2") %>%
  autoplot(Cash) +
  ggtitle("Non-transformed ATM2")

ATM_tsibble %>%
  filter(ATM == "ATM2") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM2")

Similar to ATM1, we see a clear weekly seasonality, with maybe some larger peaks at the beginning of each year. We can anticipate that ARIMA or ETS would be good models as well.

Comparing Models ATM2

lambda2 <- ATM_tsibble %>%
  filter(ATM == "ATM2") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm2_data <- ATM_tsibble %>%
  filter(ATM == "ATM2")

atm2_fit <- atm2_data %>%
  model(
    # Additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # Transformed Additive ETS model
    additive_ts = ETS(box_cox(Cash,lambda2) ~ error("A") + trend("N") + season("A")),
    # Multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # Transformed Multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,lambda2) ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # Transformed SNAIVE model
    snaive_ts = SNAIVE(box_cox(Cash,lambda2)),
    # ARIMA model
    ARIMA = ARIMA(box_cox(Cash,lambda2))
  )

left_join(
  glance(atm2_fit) %>% select(.model:BIC), 
  accuracy(atm2_fit) %>% select(.model, RMSE)
) %>%
  arrange(RMSE)
Joining with `by = join_by(.model)`
# A tibble: 7 × 7
  .model             sigma2 log_lik   AIC  AICc   BIC  RMSE
  <chr>               <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA              62.0    -1246. 2505. 2505. 2528.  24.5
2 additive          643.     -2252. 4525. 4525. 4564.  25.0
3 additive_ts        65.7    -1836. 3692. 3693. 3731.  25.4
4 snaive            910.        NA    NA    NA    NA   30.1
5 snaive_ts          90.3       NA    NA    NA    NA   30.1
6 multiplicative_ts   0.313  -2001. 4022. 4022. 4061.  33.9
7 multiplicative      0.764  -2451. 4921. 4922. 4960.  74.1

This time ARIMA gets both the lowest AICc and RMSE, so its the clear model to choose. Which means we can run a report again to find the best ARIMA.

# Finding the best ARIMA model with report()
atm2_fit %>% select(.model = "ARIMA") %>% report()
Series: Cash 
Model: ARIMA(2,0,2)(0,1,1)[7] 
Transformation: box_cox(Cash, lambda2) 

Coefficients:
          ar1      ar2     ma1     ma2     sma1
      -0.4227  -0.9111  0.4611  0.8081  -0.7212
s.e.   0.0589   0.0431  0.0897  0.0575   0.0418

sigma^2 estimated as 61.97:  log likelihood=-1246.25
AIC=2504.5   AICc=2504.74   BIC=2527.78

With a lambda of 0.7 we find AR terms p=2, MA terms q=2, and Seasonal MA Q=1 with seasonal differencing D=1

Forecasting ATM2

#Forecasting the model for ATM2
forecast_atm2 <- atm2_fit %>%
  forecast(h = 31) %>%
  filter(.model=='ARIMA')

# Plotting the forecast for ATM2
forecast_atm2 %>%
  autoplot(atm2_data) +
  ggtitle(paste0("ATM 2 Forecasted with ARIMA(2,0,2)(0,1,1)_7 and lambda= ",
         round(lambda2,2)))

# Plotting residuals
atm2_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle(paste0("Residuals for ARIMA(2,0,2)(0,1,1)7 with lambda = ",
         round(lambda2,2)))

#Box-Pierce test
atm2_fit %>% 
  select(.model = "ARIMA") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)
# A tibble: 1 × 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 .model    9.79     0.777

At a first glance we see the forecast line continue the weekly pattern, and the confidence interval behaving as expected (uncertainty growing over time).
The residual diagnostics are centered around 0, and most ACF falling inside the bands, suggesting no significant autocorrelation. The histogram has a slight skew on the right but does not seem significant, and shows a fairly normal distribution. Again, p-value bigger than 0.5 (0.77) so we failt to reject the null and pass the independence test.

Decomposing ATM3

ATM_tsibble %>%
  filter(ATM == "ATM3") %>%
  autoplot(Cash) +
  ggtitle("Non-transformed ATM3")

ATM_tsibble %>%
  filter(ATM == "ATM3") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM3")

As mentioned previously, ATM3 shows 0 withdrawals for most of the reporting period, with just the final weeks of April 2010 showing any activity. This could be due to the ATM not being installed or active or somehow recorded as operational but not used. In any case there is not enough historical data to detect seasonality, trends or cyclical patterns.

Comparing Models ATM3

ATM3 is interesting in that it offers very little data, so comparing models would be done merely as an academic exercise.

lambda3 <- ATM_tsibble %>%
  filter(ATM == "ATM3") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)
Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
max(.period, : NA/Inf replaced by maximum positive value
Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
max(.period, : NA/Inf replaced by maximum positive value
atm3_data <- ATM_tsibble %>%
  filter(ATM == "ATM3")

atm3_fit <- atm3_data %>%
  model(
    # Additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # Multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # Transformed Multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,lambda3) ~ error("M") + trend("N") + season("M")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # Mean model
    MEAN(Cash),
    # ARIMA model
    ARIMA = ARIMA(box_cox(Cash,lambda3))
  )
Warning: 1 error encountered for multiplicative
[1] missing value where TRUE/FALSE needed
left_join(
  glance(atm3_fit) %>% select(.model:BIC), 
  accuracy(atm3_fit) %>% select(.model, RMSE)
) %>%
  arrange(RMSE)
Joining with `by = join_by(.model)`
# A tibble: 5 × 7
  .model            sigma2 log_lik   AIC  AICc   BIC  RMSE
  <chr>              <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 additive           25.6   -1664. 3348. 3348. 3387.  4.99
2 ARIMA             382.    -1603. 3213. 3213. 3224.  5.02
3 MEAN(Cash)         63.1      NA    NA    NA    NA   7.93
4 snaive             64.3      NA    NA    NA    NA   8.04
5 multiplicative_ts   8.49  -1618. 3256. 3257. 3295.  8.71

Even with limited data when we attempt a model comparison, we see that ARIMA shows the lowest AICc. However, due to the limited period of ATM3 activity, forecasts are inherently uncertain. While ARIMA(0,0,2) provided the best fit among tested models, a simple mean forecast may offer comparable practical value for short-term planning

Forecasting ATM3

#Fitting the MEAN model
atm3_fit <- ATM_tsibble %>%
  filter(ATM == "ATM3",
         Cash != 0) %>%
  model(MEAN(Cash))

#Forecasting the model for ATM3
forecast_atm3 <- atm3_fit %>%
  forecast(h = 31) 

# Plotting the forecast for ATM3
forecast_atm3 %>%
  autoplot(atm3_data) +
  ggtitle("ATM 3 Forecasted with MEAN() model")

ATM3 was inactive for most of the year and only started showing consistent withdrawals in April 2010. Given the limited historical data, a MEAN() model was applied. This forecast assumes the daily average seen in late April will continue through May, with moderate prediction intervals reflecting the short active window.

Decomposing ATM4

atm4_tsibble<- ATM_tsibble %>%
  filter(ATM == "ATM4") 

atm4_tsibble %>% autoplot(Cash) +
  ggtitle("Non-transformed ATM4")

ATM_tsibble %>%
  filter(ATM == "ATM4") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM4")

ATM4 shows stable daily activity with strong weekly seasonality, but includes at least one extreme outlier in early 2010. Given the impact this spike has on modeling, we applied STL decomposition with robust fitting to isolate the anomaly and forecast based on the smoothed components. This allowed us to preserve the underlying structure without distortion from the spike.

Detect and Remove Outliers

We will first identify outliers

# Use Interquartile Range (IQR) to flag outliers
atm4_outliers <- atm4_tsibble %>%
  mutate(
    is_outlier = Cash > quantile(Cash, 0.75) + 1.5 * IQR(Cash) |
                 Cash < quantile(Cash, 0.25) - 1.5 * IQR(Cash)
  )

# View how many outliers we have
atm4_outliers%>%filter(is_outlier)
# A tsibble: 2 x 4 [1D]
# Key:       ATM [1]
  DATE       ATM    Cash is_outlier
  <date>     <chr> <dbl> <lgl>     
1 2009-09-22 ATM4   1712 TRUE      
2 2010-02-09 ATM4  10920 TRUE      

So if we define outliers as those above or below 1.5 IQ ranges we can identify September 22nd, 2009, and February 9th, 2010 as the only two outliers. Now we can proceed to replace them

atm4_clean <- atm4_outliers %>%
  mutate(
    Cash = if_else(is_outlier, NA_real_, Cash)
  )

atm4_interpolated <- atm4_clean %>%
  model(ARIMA(Cash)) %>%
  interpolate(atm4_clean)

And now we can explore data again to see what we have

atm4_interpolated %>% autoplot(Cash) +
  ggtitle("ATM4 without Outliers")

atm4_interpolated %>%
  filter(ATM == "ATM4") %>%
  model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for ATM4 with no outliers")

Comparing Models for ATM4

ATM4 exhibited significant volatility in cash withdrawals, including major outliers—particularly on September 22, 2009, and February 9, 2010. These extreme values heavily skewed the data and could bias forecasting models. To address this, outliers were systematically identified using the Interquartile Range (IQR) method and excluded from the dataset before fitting any model.

After removing outliers, we performed STL decomposition to understand the underlying components of the time series. Clear weekly seasonality and a slightly declining trend were observed, with reduced residual noise compared to the original data.

# Estimate lambda for Box-Cox transformation
lambda4 <- atm4_interpolated %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

# Fit models including ARIMA
atm4_fit <- atm4_interpolated %>%
  model(
    # Additive ETS model
    additive = ETS(Cash ~ error("A") + trend("N") + season("A")),
    # Transformed Additive ETS model
    additive_ts = ETS(box_cox(Cash,lambda4) ~ error("A") + trend("N") + season("A")),
    # Transformed Additive ETS model, no seasonality
    additive_ts_no_s = ETS(box_cox(Cash,lambda4) ~ error("A") + trend("N") + season("N")),
    # Multiplicative ETS model
    multiplicative = ETS(Cash ~ error("M") + trend("N") + season("M")),
    # Transformed Multiplicative ETS model
    multiplicative_ts = ETS(box_cox(Cash,lambda4) ~ error("M") + trend("N") + season("M")),
    # Transformed Multiplicative ETS model, no seasonality
    multiplicative_ts_no_s = ETS(box_cox(Cash,lambda4) ~ error("M") + trend("N") + season("N")),
    # SNAIVE model
    snaive = SNAIVE(Cash),
    # ARIMA model
    ARIMA = ARIMA(box_cox(Cash,lambda4))
  )

left_join(glance(atm4_fit) %>% select(.model:BIC), 
          accuracy(atm4_fit) %>% select(.model, RMSE)) %>%
  arrange(RMSE)
Joining with `by = join_by(.model)`
# A tibble: 8 × 7
  .model                     sigma2 log_lik   AIC  AICc   BIC  RMSE
  <chr>                       <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 additive               106305.     -3184. 6389. 6390. 6428.  322.
2 additive_ts               226.     -2062. 4143. 4144. 4182.  332.
3 multiplicative_ts           0.236  -2059. 4137. 4138. 4176.  337.
4 ARIMA                     239.     -1516. 3041. 3042. 3061.  345.
5 multiplicative              0.724  -3189. 6397. 6398. 6436.  349.
6 multiplicative_ts_no_s      0.252  -2094. 4195. 4195. 4206.  356.
7 additive_ts_no_s          265.     -2094. 4195. 4195. 4206.  356.
8 snaive                 203761.        NA    NA    NA    NA   451.

A Box-Cox transformation was applied to stabilize variance, and multiple time series models were tested. The ARIMA(0,0,1)(2,0,0)[7] model with a Box-Cox lambda of 0.48 emerged as one of the most appropriate based on AIC, BIC, and residual analysis.

# report of the arima model
atm4_fit %>% select(.model = "ARIMA") %>% report()
Series: Cash 
Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
Transformation: box_cox(Cash, lambda4) 

Coefficients:
         ma1    sar1    sar2  constant
      0.0773  0.1894  0.2197   19.0903
s.e.  0.0526  0.0514  0.0524    0.8494

sigma^2 estimated as 238.8:  log likelihood=-1515.74
AIC=3041.48   AICc=3041.65   BIC=3060.98

So we can see a MA(1) component in the non seasonal part, and a seasonal AR component of order 2 with a weekly period (seasonality is 7 days), with the model assuming a non zero mean.

Forecasting ATM4

#Forecasting the model for ATM4
forecast_atm4 <- atm4_fit %>%
  forecast(h = 31) %>%
  filter(.model=='ARIMA')

# Plotting the forecast for ATM4
forecast_atm4 %>%
  autoplot(atm4_interpolated) +
  ggtitle(paste0("ATM 4 Forecasted with ARIMA(2,0,2)(0,1,1)_7 and lambda= ",
         round(lambda4,2)))

# Plotting residuals
atm4_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals() +
  ggtitle(paste0("Residuals for ARIMA(0,0,1)(2,0,0)7 with lambda = ",
         round(lambda4,2)))

#Box-Pierce test
atm4_fit %>% 
  select(.model = "ARIMA") %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 14, dof = 0)
# A tibble: 1 × 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 .model    21.0     0.102

The results indicate a continuation of the weekly seasonal pattern observed historically. While volatility remains, the forecast no longer includes the erratic spikes seen in the raw data.

After cleaning the data, the ARIMA(0,0,1)(2,0,0)[7] model with a Box-Cox transformation (λ = 0.48) provided a robust forecast of daily withdrawals. While ATM4 still exhibits more variability than other machines, the outlier-adjusted model is statistically sound and should provide reliable estimates for cash flow planning during May 2010.

Exporting the data

# We can now combine all 4 forecasts to export
atms_forecast <- bind_rows(forecast_atm1, forecast_atm2, forecast_atm3, forecast_atm4) %>%
  as.data.frame() %>%
  select(DATE, ATM, .mean) %>%
  rename(Cash = .mean)

# export file
atms_forecast %>% write.csv("ATM_Forecasts.csv")

Conclusions

atms_forecast  %>%
  as_tsibble(index = DATE, key = ATM) %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "Forecast for ATM Withdrawls in May 2010")

# Combined forecast plot, with historical data as black line
atms_forecast %>%
  as_tsibble(index = DATE, key = ATM) %>%
  autoplot(Cash) +
  autolayer(ATM_tsibble, Cash, colour = "black") +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Withdrawals and Forecasts",
       y = "Cash (in dollars)", x = "Date") +
  theme_minimal()

This analysis presents a comprehensive time series forecast of daily cash withdrawals for four ATMs (ATM1–ATM4) for the month of May 2010. After cleaning and preparing the dataset, we applied tailored modeling techniques to each ATM based on its individual usage pattern, seasonality, and data quality.

  • ATM1 and ATM2 demonstrated highly consistent, seasonally-driven withdrawal patterns. Both were successfully modeled using seasonal ARIMA approaches that captured weekly trends with high accuracy. Model diagnostics confirmed well-behaved residuals and strong statistical fit.

    • Business implication: Cash needs for these ATMs can be reliably anticipated on a weekly cycle, enabling efficient scheduling of replenishments and minimizing idle cash.
  • ATM3 had very limited historical data, as it was inactive or recorded zero withdrawals for most of the observed period. Given this, we opted for a MEAN model, which assumes the recent average withdrawals will continue into May.

    • Business implication: Forecasts for ATM3 should be treated with caution due to data limitations. Continued monitoring is recommended to reassess the model as more usage data becomes available.
  • ATM4 presented extreme outliers likely due to unusual events or recording errors. These anomalies were identified and removed using IQR-based filtering, followed by interpolation. After cleaning, we applied a Box-Cox transformed ARIMA model that accounted for volatility while preserving the core weekly patterns.

    • Business implication: With outliers removed, ATM4 displays more stable behavior. Still, its variability remains higher than the others, suggesting a more flexible cash reserve strategy may be prudent.

Power Forecast

The goal of this analysis is to model the data of residential power usage for January 1998 until December 2013, and forecast 2014. The data is given in a single file.  The variable ‘KWH’ is power consumption in Kilowatt hours.

Prepping the Data

library(tidyverse)
library(lubridate)
library(ggplot2)
library(forecast)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(imputeTS)

Attaching package: 'imputeTS'
The following object is masked from 'package:zoo':

    na.locf
library(tseries)

Attaching package: 'tseries'
The following object is masked from 'package:imputeTS':

    na.remove
library(ggplot2)
library(dplyr)

# Loading the data
power_data <- read_csv("https://raw.githubusercontent.com/Lfirenzeg/msds624labs/refs/heads/main/Project%201/ResidentialCustomerForecastLoad-624.csv")
Rows: 192 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): YYYY-MMM
dbl (2): CaseSequence, KWH

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#rename columns for easier handling
power_data <- power_data %>%
  rename(Month = `YYYY-MMM`)

First Data Exploration

# Convert the 'Month' column to Date format
power_data <- power_data %>%
  mutate(Month = ym(Month))  

# Check structure again
summary(power_data)
  CaseSequence       Month                 KWH          
 Min.   :733.0   Min.   :1998-01-01   Min.   :  770523  
 1st Qu.:780.8   1st Qu.:2001-12-24   1st Qu.: 5429912  
 Median :828.5   Median :2005-12-16   Median : 6283324  
 Mean   :828.5   Mean   :2005-12-15   Mean   : 6502475  
 3rd Qu.:876.2   3rd Qu.:2009-12-08   3rd Qu.: 7620524  
 Max.   :924.0   Max.   :2013-12-01   Max.   :10655730  
                                      NA's   :1         
# Plot a time series for early exploration
ggplot(power_data, aes(x = Month, y = KWH)) +
  geom_line(color = "blue") +
  labs(title = "Residential Power Usage (KWH) Over Time", x = "Date", y = "KWH")

There is a clear pattern repeating every year, indicating a strong seasonal behavior, expected from energy usage in relation to the weather.
There also seems to be a gradual upward trend, move evident after 2005, perhaps related to an increase in population.

However, it’s worth noting that there is a significant dip around 2010. Also there is an NA in the KWH column that needs to be addressed.

Outliers and Missing Data

We can start with the outliers. It corresponds to July 7th, 2010, and looks like it might be missing a digit when compared to any other entry in the data (770523). So we will remove it and interpolate it.

# First we can calculate IQR bounds, and then replace any outliers exceeding those bounds with NA
Q1 <- quantile(power_data$KWH, 0.25, na.rm = TRUE)
Q3 <- quantile(power_data$KWH, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

power_data <- power_data %>%
  mutate(KWH = ifelse(KWH < lower_bound | KWH > upper_bound, NA, KWH))

At this point there should be no remaining outliers, so we can proceed to create a time series and fit an ARIMA model with the current data to later interpolate missing values.

# We need to create a monthly time series starting from Jan 1998 in order to 
# address the missing values.
kwh_ts <- ts(power_data$KWH, start = c(1998, 1), frequency = 12)

# Now fit ARIMA model (which handles missing values automatically)
arima_model <- auto.arima(kwh_ts, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)

# Summary of the model
summary(arima_model)
Series: kwh_ts 
ARIMA(0,0,3)(2,1,0)[12] with drift 

Coefficients:
         ma1     ma2     ma3     sar1     sar2     drift
      0.3411  0.0308  0.2307  -0.6985  -0.4150  8998.942
s.e.  0.0792  0.0897  0.0749   0.0805   0.0795  3006.595

sigma^2 = 3.832e+11:  log likelihood = -2627.93
AIC=5269.87   AICc=5270.52   BIC=5292.22

Training set error measures:
                    ME     RMSE      MAE        MPE     MAPE      MASE
Training set -8119.868 589004.5 428520.6 -0.7969903 6.535785 0.7046532
                    ACF1
Training set -0.01306416
# Fill in missing values using Kalman smoothing
kwh_filled <- na_kalman(kwh_ts, model = "auto.arima")

# Compare original vs filled (optional)
cbind(original = kwh_ts, filled = kwh_filled)[which(is.na(kwh_ts)), ]
     original  filled
[1,]       NA 7549357
[2,]       NA 7684639

We can see that 2 NAs were replaced with values expected at that point of the time series.

# Plot a time series for new exploration
autoplot(kwh_filled) +
  labs(title = "Power Usage with Outliers and Missing Values Filled via ARIMA Kalman", 
       x = "Date", y = "KWH") +
  theme_minimal()

We see that the extreme dip in 2010 was replaced with a more plausible value, aligning with the seasonal patterns. Additionally we see a continuity in the data, meaning no NAs at this point.

STL Decomposition

We can now move to STL decomposition, to visualize more clearly each component.

power_ts <- tibble(
  Month = yearmonth(seq(as.Date("1998-01-01"), by = "month", length.out = length(kwh_filled))),
  KWH = as.numeric(kwh_filled)
) %>%
  as_tsibble(index = Month)

# Perform STL decomposition
power_ts %>%
  model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition of Residential Power Usage")

The plot makes a steady increase more evident as a trend, with a slight decrease after 2012. We also confirm a very strong and regular seasonal pattern, with peaks corresponding to either winter or summer months. Additionally seasonality is stable over the years.

power_ts %>% 
  gg_subseries(KWH) +
  labs(title = "Residential Power Usage by Month")

The Residential Power Usage by Month plot confirms that the months that see more power usage are August, January, July and September.

At this point we can move ahead and compare different models for the series.

Comparing Models for Power Usage Data

lambda_p <- power_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

power_fit <- power_ts %>% 
  model(
    # Additive ETS model 
    additive = ETS(KWH ~ error("A") + trend("A") + season("A")),
    # Transformed Additive ETS model
    additive_bc = ETS(box_cox(KWH,lambda_p) ~ error("A") + trend("A") + season("A")),
    # Multiplicative ETS model
    multiplicative = ETS(KWH ~ error("M") + trend("A") + season("M")),
    # Transformed Multiplicative ETS model
    multiplicative_bc = ETS(box_cox(KWH,lambda_p) ~ error("M") + trend("A") + season("M")),
    # Additive damped model
    damped = ETS(KWH ~ error("A") + trend("Ad") + season("A")),
    # Transformed additive damped model
    damped_bc = ETS(box_cox(KWH,lambda_p) ~ error("A") + trend("Ad") + season("A")),
    # SNAIVE model
    snaive = SNAIVE(KWH),
    # ARIMA without transformation
    ARIMA_raw = ARIMA(KWH),
    # ARIMA with transformation 
    ARIMA = ARIMA(box_cox(KWH,lambda_p))
  )

# stats for the models
left_join(glance(power_fit) %>% select(.model:BIC), 
          accuracy(power_fit) %>% select(.model, RMSE)) %>%
  arrange(AICc)
Joining with `by = join_by(.model)`
# A tibble: 9 × 7
  .model              sigma2 log_lik    AIC   AICc    BIC    RMSE
  <chr>                <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
1 ARIMA             1.52e- 5    751. -1492. -1492. -1476. 627467.
2 additive_bc       1.41e- 5    576. -1117. -1114. -1062. 612060.
3 damped_bc         1.42e- 5    576. -1116. -1112. -1057. 616439.
4 multiplicative_bc 6.53e- 7    574. -1114. -1111. -1059. 624595.
5 ARIMA_raw         3.79e+11  -2656.  5327.  5328.  5353. 584377.
6 multiplicative    9.02e- 3  -3053.  6141.  6144.  6196. 614008.
7 additive          4.19e+11  -3065.  6165.  6169.  6220. 619946.
8 damped            4.26e+11  -3066.  6169.  6173.  6227. 622806.
9 snaive            6.52e+11     NA     NA     NA     NA  810680.

We see that some models obtained negative AICc, indicating a high positive log-likelihood that outweighs penalty terms.

Overall, ARIMA comes out as the model with the best fit, evidenced by lowest AICc. Transformed Additive model also had a low RMSE, but we will move forward to explore the ARIMA models.

power_fit %>% select(.model = "ARIMA") %>% report()
Series: KWH 
Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift 
Transformation: box_cox(KWH, lambda_p) 

Coefficients:
         ma1     sar1     sar2  constant
      0.2472  -0.7211  -0.3554    0.0013
s.e.  0.0832   0.0751   0.0769    0.0004

sigma^2 estimated as 1.517e-05:  log likelihood=750.96
AIC=-1491.91   AICc=-1491.57   BIC=-1475.95

We find a log likelihood of 750, which indicates a very good fit. So we can continue with the forecast.

Forecasting for Power Usage Data

# Forecast for all 12 months of 2014
power_forecast <- power_fit %>%
  select(ARIMA) %>%
  forecast(h = "12 months")

# Plotting the forecast
autoplot(power_forecast, power_ts) +
  labs(
    title = "Forecast of Residential Power Usage (2014)",
    y = "KWH",
    x = "Month"
  ) +
  theme_minimal()

# Plotting residual diagnostics
power_fit %>%
  select(ARIMA) %>%
  gg_tsresiduals()

The residuals hover around zero, with no visible patterns, and the ACF shows most values falling within confidence bonds. The histogram shows a close normal distribution.

The forecast follows a clear seasonal pattern, consistent with previous years, indicating that power usage is driven by recurring seasonal behaviors. The relatively narrow width of the confidence intervals suggests high confidence in the model’s predictions. Notably, no unusual spikes or drops are predicted, supporting operational stability for electricity demand in 2014.

Forecasted demand aligns well with seasonal trends from previous years, which would enable energy providers to optimize generation schedules, maintenance windows, and staffing. Additionally these projections can be used to estimate power procurement costs and revenue expectations.

Exporting the data

# Creating a data frame with the desired columns and adding the Case Sequence numbers
power_forecast_exp <- power_forecast %>%
  as.data.frame() %>%
  select(Month, .mean) %>%
  rename(KHW = .mean) %>%  
  mutate(CaseSequence = 925:936) %>%
  relocate(CaseSequence)

# Export file to CSV
power_forecast_exp %>% write.csv("Power_forecast.csv")

Water Pipes

The final dataset actually consists of 2 files, showing a Date Time column and a Waterflow column, each file is for one pipe. The goal of the final analysis is to time-base sequence the data and aggregate based on hour. For multiple recordings within an hour, we will take the mean. Then determine if the data is stationary and if it can be forecast. If so, provide a week-forward forecast.

library(urca)
library(tseries)

pipe1 <- read_csv("https://raw.githubusercontent.com/Lfirenzeg/msds624labs/refs/heads/main/Project%201/Waterflow_Pipe1.csv")
Rows: 1000 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Date Time
dbl (1): WaterFlow

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pipe2 <- read_csv("https://raw.githubusercontent.com/Lfirenzeg/msds624labs/refs/heads/main/Project%201/Waterflow_Pipe2.csv")
Rows: 1000 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Date Time
dbl (1): WaterFlow

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(pipe1)
# A tibble: 6 × 2
  `Date Time`       WaterFlow
  <chr>                 <dbl>
1 10/23/15 12:24 AM     23.4 
2 10/23/15 12:40 AM     28.0 
3 10/23/15 12:53 AM     23.1 
4 10/23/15 12:55 AM     30.0 
5 10/23/15 1:19 AM       6.00
6 10/23/15 1:23 AM      15.9 
head(pipe2)
# A tibble: 6 × 2
  `Date Time`      WaterFlow
  <chr>                <dbl>
1 10/23/15 1:00 AM      18.8
2 10/23/15 2:00 AM      43.1
3 10/23/15 3:00 AM      38.0
4 10/23/15 4:00 AM      36.1
5 10/23/15 5:00 AM      31.9
6 10/23/15 6:00 AM      28.2

Prepping the Data

When we loaded the data we saw that Pipe 1 had several entries per hour, and the instructions tell us to consolidate them into one entry per hour using the mean. For Pipe 2 we see that is close to the correct hourly format but it already has unique hourly readings (I verified this with code, just not shown here to avoid more clutter).

# First we will parse date and aggregate using mean per hour
pipe1_clean <- pipe1 %>%
  #Replacing the DateTime with round hour units
  mutate(DateTime = mdy_hm(`Date Time`)) %>%
  mutate(DateTime = floor_date(DateTime, unit = "hour")) %>%
  group_by(DateTime) %>%
  # Aggregrate by mean per hour
  summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%
  ungroup()

Changing the format of pipe 2 so both are consistent.

pipe2_clean <- pipe2 %>%
  mutate(DateTime = mdy_hm(`Date Time`)) %>%
  select(DateTime, WaterFlow)

Before combining the dataset we need to ensure correct labeling for each entry.

pipe1_labeled <- pipe1_clean %>%
  mutate(Pipe = "Pipe1")

pipe2_labeled <- pipe2_clean %>%
  mutate(Pipe = "Pipe2")

Now we can merge the data for both pipes into one tidy structure.

# Combining the two datasets into one
combined_pipes <- bind_rows(pipe1_labeled, pipe2_labeled)

# Arranging by time 
combined_pipes <- combined_pipes %>%
  arrange(DateTime)

head(combined_pipes)
# A tibble: 6 × 3
  DateTime            WaterFlow Pipe 
  <dttm>                  <dbl> <chr>
1 2015-10-23 00:00:00      26.1 Pipe1
2 2015-10-23 01:00:00      18.9 Pipe1
3 2015-10-23 01:00:00      18.8 Pipe2
4 2015-10-23 02:00:00      15.2 Pipe1
5 2015-10-23 02:00:00      43.1 Pipe2
6 2015-10-23 03:00:00      23.1 Pipe1

First Data Exploration

We can more easily visualize this if we plot.

ggplot(combined_pipes, aes(x = DateTime, y = WaterFlow)) +
  geom_line(color = "steelblue") +
  facet_wrap(~ Pipe, scales = "free_y", ncol = 1) +
  labs(
    title = "Hourly Water Flow by Pipe",
    x = "Date and Time",
    y = "Water Flow (units)"
  ) +
  theme_minimal()

summary(combined_pipes)
    DateTime                       WaterFlow          Pipe          
 Min.   :2015-10-23 00:00:00.0   Min.   : 1.885   Length:1236       
 1st Qu.:2015-10-29 11:45:00.0   1st Qu.:21.577   Class :character  
 Median :2015-11-07 22:30:00.0   Median :34.353   Mode  :character  
 Mean   :2015-11-09 19:35:23.2   Mean   :35.801                     
 3rd Qu.:2015-11-20 19:15:00.0   3rd Qu.:47.806                     
 Max.   :2015-12-03 16:00:00.0   Max.   :78.303                     

No NAs are evident at this point, but that can be quickly confirmed:

# Checking for NAs in the combined dataset
combined_pipes %>%
  summarise(
    NA_DateTime = sum(is.na(DateTime)),
    NA_WaterFlow = sum(is.na(WaterFlow)),
    NA_Pipe = sum(is.na(Pipe))
  )
# A tibble: 1 × 3
  NA_DateTime NA_WaterFlow NA_Pipe
        <int>        <int>   <int>
1           0            0       0

Now for outliers:

# Identify potential outliers per pipe using IQR method
combined_pipes %>%
  group_by(Pipe) %>%
  mutate(
    Q1 = quantile(WaterFlow, 0.25),
    Q3 = quantile(WaterFlow, 0.75),
    IQR = Q3 - Q1,
    is_outlier = WaterFlow < (Q1 - 1.5 * IQR) | WaterFlow > (Q3 + 1.5 * IQR)
  ) %>%
  summarise(
    Outlier_Count = sum(is_outlier),
    Total = n(),
    Percent_Outliers = round(100 * sum(is_outlier) / n(), 2)
  )
# A tibble: 2 × 4
  Pipe  Outlier_Count Total Percent_Outliers
  <chr>         <int> <int>            <dbl>
1 Pipe1             1   236             0.42
2 Pipe2             0  1000             0   

We find only 1 outlier in Pipe 1, while Pipe 2 has 0 outliers.

ggplot(combined_pipes, aes(x = Pipe, y = WaterFlow)) +
  geom_boxplot(fill = "skyblue") +
  labs(title = "Boxplot of Water Flow per Pipe", y = "Water Flow", x = "Pipe") +
  theme_minimal()

The boxplot illustrates that Pipe1 experiences stable, narrow-range flow, whereas Pipe2 has a much broader and more variable distribution. A single statistical outlier was detected in Pipe1, but Pipe2, despite its wide range, shows no extreme outliers by the IQR rule. These patterns suggest different operational behaviors or environmental conditions for each pipe.

# Filter and display any outlier rows from Pipe1
pipe1_outliers <- combined_pipes %>%
  filter(Pipe == "Pipe1") %>%
  mutate(
    Q1 = quantile(WaterFlow, 0.25),
    Q3 = quantile(WaterFlow, 0.75),
    IQR = Q3 - Q1,
    is_outlier = WaterFlow < (Q1 - 1.5 * IQR) | WaterFlow > (Q3 + 1.5 * IQR)
  ) %>%
  filter(is_outlier) %>%
  select(DateTime, WaterFlow, Pipe)

pipe1_outliers
# A tibble: 1 × 3
  DateTime            WaterFlow Pipe 
  <dttm>                  <dbl> <chr>
1 2015-10-31 07:00:00      31.7 Pipe1
# Split Pipe1 and complete hourly sequence
pipe1_ts <- combined_pipes %>%
  filter(Pipe == "Pipe1") %>%
  select(DateTime, WaterFlow) %>%
  as_tsibble(index = DateTime) %>%
  fill_gaps(.full = TRUE)

# Split Pipe2 and complete hourly sequence
pipe2_ts <- combined_pipes %>%
  filter(Pipe == "Pipe2") %>%
  select(DateTime, WaterFlow) %>%
  as_tsibble(index = DateTime) %>%
  fill_gaps(.full = TRUE)

We see that Pipe1 now has 4 NAs. So this is a good chance to handle those and the outlier at the same time. We will start replacing the outlier with NA.

pipe1_ts_cleaned <- pipe1_ts %>%
  mutate(
    Q1 = quantile(WaterFlow, 0.25, na.rm = TRUE),
    Q3 = quantile(WaterFlow, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    is_outlier = WaterFlow < (Q1 - 1.5 * IQR) | WaterFlow > (Q3 + 1.5 * IQR),
    WaterFlow = if_else(is_outlier, NA_real_, WaterFlow)
  ) %>%
  select(-Q1, -Q3, -IQR, -is_outlier)

And now we can impute all missing values using Kalman smoothing.

# Impute values
pipe1_ts_cleaned <- pipe1_ts_cleaned %>%
  mutate(WaterFlow = na_kalman(WaterFlow, model = "StructTS"))

sum(is.na(pipe1_ts_cleaned$WaterFlow))  #If no NAs will show 0
[1] 0

Stationary Testing

# ADF test for Pipe1
adf_pipe1 <- ur.df(pipe1_ts_cleaned$WaterFlow, type = "drift", lags = 24)
summary(adf_pipe1)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.4630 -2.6921 -0.0609  2.1951 12.0169 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)  21.1029944  7.8337763   2.694  0.00770 **
z.lag.1      -1.0635269  0.3951935  -2.691  0.00776 **
z.diff.lag1   0.1462703  0.3867905   0.378  0.70573   
z.diff.lag2   0.1568702  0.3777901   0.415  0.67844   
z.diff.lag3   0.1132589  0.3683528   0.307  0.75882   
z.diff.lag4   0.0609648  0.3596542   0.170  0.86558   
z.diff.lag5   0.0942252  0.3496387   0.269  0.78784   
z.diff.lag6  -0.0006116  0.3388741  -0.002  0.99856   
z.diff.lag7  -0.0150923  0.3288952  -0.046  0.96345   
z.diff.lag8  -0.0338833  0.3189246  -0.106  0.91550   
z.diff.lag9  -0.0959608  0.3083753  -0.311  0.75601   
z.diff.lag10 -0.0022002  0.2981411  -0.007  0.99412   
z.diff.lag11 -0.1314844  0.2874100  -0.457  0.64785   
z.diff.lag12 -0.2171717  0.2749954  -0.790  0.43068   
z.diff.lag13 -0.0827250  0.2593617  -0.319  0.75011   
z.diff.lag14 -0.0376040  0.2451315  -0.153  0.87824   
z.diff.lag15 -0.0260120  0.2321601  -0.112  0.91091   
z.diff.lag16 -0.0565532  0.2167959  -0.261  0.79448   
z.diff.lag17 -0.0154743  0.2018249  -0.077  0.93897   
z.diff.lag18 -0.0500280  0.1861915  -0.269  0.78846   
z.diff.lag19 -0.1132906  0.1692406  -0.669  0.50405   
z.diff.lag20 -0.1038979  0.1543836  -0.673  0.50178   
z.diff.lag21 -0.0160015  0.1362635  -0.117  0.90664   
z.diff.lag22 -0.1027555  0.1179364  -0.871  0.38471   
z.diff.lag23 -0.0948589  0.0975556  -0.972  0.33212   
z.diff.lag24 -0.0418539  0.0710840  -0.589  0.55670   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.224 on 189 degrees of freedom
Multiple R-squared:  0.5228,    Adjusted R-squared:  0.4597 
F-statistic: 8.284 on 25 and 189 DF,  p-value: < 2.2e-16


Value of test-statistic is: -2.6912 3.6287 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1  6.52  4.63  3.81
# ADF test for Pipe2
adf_pipe2 <- ur.df(pipe2_ts$WaterFlow, type = "drift", lags = 24)
summary(adf_pipe2)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.974 -11.481   0.341  10.220  40.803 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  31.24192    5.90638   5.290 1.52e-07 ***
z.lag.1      -0.78982    0.14908  -5.298 1.46e-07 ***
z.diff.lag1  -0.24404    0.14681  -1.662   0.0968 .  
z.diff.lag2  -0.21713    0.14396  -1.508   0.1318    
z.diff.lag3  -0.15879    0.14096  -1.126   0.2603    
z.diff.lag4  -0.11155    0.13762  -0.811   0.4178    
z.diff.lag5  -0.11520    0.13423  -0.858   0.3910    
z.diff.lag6  -0.09255    0.13059  -0.709   0.4787    
z.diff.lag7  -0.07503    0.12684  -0.592   0.5543    
z.diff.lag8  -0.07603    0.12364  -0.615   0.5388    
z.diff.lag9  -0.07313    0.12032  -0.608   0.5434    
z.diff.lag10 -0.08178    0.11711  -0.698   0.4851    
z.diff.lag11 -0.05614    0.11310  -0.496   0.6198    
z.diff.lag12 -0.06196    0.10921  -0.567   0.5706    
z.diff.lag13 -0.11599    0.10551  -1.099   0.2719    
z.diff.lag14 -0.16128    0.10155  -1.588   0.1126    
z.diff.lag15 -0.06711    0.09713  -0.691   0.4898    
z.diff.lag16 -0.07688    0.09249  -0.831   0.4060    
z.diff.lag17 -0.05374    0.08763  -0.613   0.5398    
z.diff.lag18 -0.12621    0.08269  -1.526   0.1273    
z.diff.lag19 -0.12517    0.07742  -1.617   0.1062    
z.diff.lag20 -0.11829    0.07150  -1.654   0.0984 .  
z.diff.lag21 -0.12190    0.06506  -1.874   0.0613 .  
z.diff.lag22 -0.10989    0.05705  -1.926   0.0544 .  
z.diff.lag23 -0.10537    0.04668  -2.257   0.0242 *  
z.diff.lag24 -0.01513    0.03258  -0.465   0.6424    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.98 on 949 degrees of freedom
Multiple R-squared:  0.5339,    Adjusted R-squared:  0.5216 
F-statistic: 43.48 on 25 and 949 DF,  p-value: < 2.2e-16


Value of test-statistic is: -5.2979 14.0426 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.43 -2.86 -2.57
phi1  6.43  4.59  3.78

For Pipe 1 the ADF test shows that the t statistic (tau) -2.6912 is greater than the 5% critical value -2.88. This means that we fail to reject the null hypothesis. So Pipe 1 is not stationary, and differencing is required before modeling.

For Pipe 2 the ADF test shows that the t statistic (tau) -5.2979 is much lower than the 1% critical value -3.43. This means that we can reject the null hypothesis. So Pipe 2 is stationary, and can be modeled as is.

# First-order differencing
pipe1_diff <- pipe1_ts_cleaned %>%
  mutate(WaterFlow = difference(WaterFlow, lag = 1)) %>%
  filter(!is.na(WaterFlow))
# ADF test after differencing
adf_pipe1_diff <- ur.df(pipe1_diff$WaterFlow, type = "drift", lags = 24)
summary(adf_pipe1_diff)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.1535  -2.3899  -0.1534   2.3326  12.2765 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.03707    0.29462   0.126 0.899994    
z.lag.1      -15.29365    2.84876  -5.369 2.32e-07 ***
z.diff.lag1   13.40957    2.82266   4.751 4.01e-06 ***
z.diff.lag2   12.57048    2.77721   4.526 1.06e-05 ***
z.diff.lag3   11.72339    2.71613   4.316 2.56e-05 ***
z.diff.lag4   10.86209    2.64128   4.112 5.85e-05 ***
z.diff.lag5   10.07008    2.55158   3.947 0.000112 ***
z.diff.lag6    9.22067    2.45134   3.761 0.000226 ***
z.diff.lag7    8.40061    2.34084   3.589 0.000424 ***
z.diff.lag8    7.60185    2.21996   3.424 0.000757 ***
z.diff.lag9    6.78242    2.08909   3.247 0.001383 ** 
z.diff.lag10   6.10127    1.94906   3.130 0.002025 ** 
z.diff.lag11   5.33149    1.80182   2.959 0.003484 ** 
z.diff.lag12   4.51774    1.64722   2.743 0.006684 ** 
z.diff.lag13   3.88516    1.48691   2.613 0.009704 ** 
z.diff.lag14   3.35111    1.33051   2.519 0.012614 *  
z.diff.lag15   2.87328    1.17666   2.442 0.015536 *  
z.diff.lag16   2.40458    1.02418   2.348 0.019924 *  
z.diff.lag17   2.02145    0.87656   2.306 0.022195 *  
z.diff.lag18   1.64439    0.73489   2.238 0.026420 *  
z.diff.lag19   1.24458    0.59992   2.075 0.039387 *  
z.diff.lag20   0.90027    0.47437   1.898 0.059252 .  
z.diff.lag21   0.68055    0.35516   1.916 0.056864 .  
z.diff.lag22   0.41342    0.24838   1.665 0.097678 .  
z.diff.lag23   0.19083    0.15302   1.247 0.213931    
z.diff.lag24   0.05682    0.07144   0.795 0.427405    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.308 on 188 degrees of freedom
Multiple R-squared:  0.8309,    Adjusted R-squared:  0.8084 
F-statistic: 36.94 on 25 and 188 DF,  p-value: < 2.2e-16


Value of test-statistic is: -5.3685 14.4298 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1  6.52  4.63  3.81

So 1st order differencing was enough to make Pipe 1 stationary, since the test statistic -5.3685 is well below 1% critical value.

Forecasting Pipe 1

We’ll use automatic selection with AICc minimization

# Fit ARIMA model to Pipe 1
pipe1_model <- pipe1_ts_cleaned %>%
  model(ARIMA = ARIMA(WaterFlow))

report(pipe1_model)
Series: WaterFlow 
Model: ARIMA(0,0,0) w/ mean 

Coefficients:
      constant
       19.8413
s.e.    0.2663

sigma^2 estimated as 17.09:  log likelihood=-680.68
AIC=1365.35   AICc=1365.4   BIC=1372.31
pipe1_forecast <- pipe1_model %>%
  forecast(h = "168 hours")  # 7 days * 24 hours
pipe1_forecast %>%
  autoplot(pipe1_ts_cleaned) +
  labs(
    title = "Forecast of Water Flow (Pipe 1)",
    y = "Hourly Water Flow",
    x = "Time"
  ) +
  theme_minimal()

pipe1_model %>%
  gg_tsresiduals()

The ARIMA model projects a stable average water flow for Pipe 1 over the next week, with moderate uncertainty reflected in the forecast intervals.

The model suggests that no strong upward or downward trends are expected. Expected flow will likely remain within the 15 to 30 units/hour range, with a central forecast around 20 units/hour

Forecasting Pipe 2

pipe2_model <- pipe2_ts %>%
  model(ARIMA = ARIMA(WaterFlow))

report(pipe2_model)
Series: WaterFlow 
Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean 

Coefficients:
        sma1  constant
      0.0846   39.5732
s.e.  0.0314    0.5472

sigma^2 estimated as 256:  log likelihood=-4190.54
AIC=8387.08   AICc=8387.1   BIC=8401.8
pipe2_forecast <- pipe2_model %>%
  forecast(h = "168 hours")  # 7 days * 24 hours
pipe2_forecast %>%
  autoplot(pipe2_ts) +
  labs(
    title = "Forecast of Water Flow (Pipe 2)",
    y = "Hourly Water Flow",
    x = "Time"
  ) +
  theme_minimal()

pipe2_model %>%
  gg_tsresiduals()

The ARIMA model projects that hourly water flow in Pipe 2 will continue to fluctuate around its long-term average of approximately 40 units/hour.

Forecast intervals are relatively tight and stable, reflecting the model’s confidence in its predictions despite the system’s natural variability.

Exporting the Data

# Add label to each forecast
pipe1_forecast <- pipe1_forecast %>%
  mutate(Pipe = "Pipe1")

pipe2_forecast <- pipe2_forecast %>%
  mutate(Pipe = "Pipe2")
# We can now combine both forecasts to export
pipes_forecast <- bind_rows(pipe1_forecast, pipe2_forecast) %>%
  as.data.frame() %>%
  select(DateTime, Pipe, .mean) %>%
  rename(Waterflow = .mean)

# export file
pipes_forecast %>% write.csv("Pipes_Forecasts.csv", row.names = FALSE)