PART A – ATM Forecast

Prompt:

  • Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file.
  • The variable Cash is provided in hundreds of dollars
  • Explain and demonstrate your process, techniques used and not used, and your actual forecast.
  • Provide a written report on your findings, visuals, discussion

Dataset: ATM624Data.xlsx


Import and Visualize Data

Import the data from Github and view the raw data

# import data from github
github_url = "https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/ATM624Data.xlsx"
data = import(github_url)

# view data head
knit_table(head(data), "View Raw Data")
View Raw Data
DATE ATM Cash
2009-05-01 ATM1 96
2009-05-01 ATM2 107
2009-05-02 ATM1 82
2009-05-02 ATM2 89
2009-05-03 ATM1 85
2009-05-03 ATM2 90

The data consists of three columns:

  • DATE: date of the ATM cash withdraws between May 2009 and May 2010
  • ATM: ATM number (ATM1, ATM2, ATM3, ATM4)
  • CASH: total cash withdrew in hundreds of dollars

Visualize the time series overtime by ATM

group_colors <- c("ATM1" = "#C77CFF", "ATM2" = "darkolivegreen3", "ATM3" = "coral1", "ATM4" = "#00BFC4")

all_cash_plot = data |>
  ggplot(aes(x=DATE, y=Cash, color=ATM)) +
  geom_line() +
  scale_color_manual(values = group_colors) +
  scale_x_date(
    date_breaks = "2 month", # Set breaks at 2-month intervals
    date_labels = "%b %y"
  ) +
  labs(title="Withdrawn Cash by ATM Overtime", x='Date', y='Cash (hundreds)') +
  theme_minimal()
all_cash_plot

all_cash_plot +
  facet_grid(ATM~., scales = "free_y") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

View dataset summary metrics

# transform date column to date type
df = data |>
  mutate(DATE = as.Date(DATE)) |>
  arrange(DATE, ATM)
# rename columns to all naming follows same casing
colnames(df) = c("DATE", "ATM", "CASH")

skim_df = skim(df)
skim_df |>
  summary()
Data summary
Name df
Number of rows 1474
Number of columns 3
_______________________
Column type frequency:
character 1
Date 1
numeric 1
________________________
Group variables None
knit_table(yank(skim_df, "Date"), 'Column: DATE')
Column: DATE
skim_variable n_missing complete_rate min max median n_unique
DATE 0 1 2009-05-01 2010-05-14 2009-11-01 379
knit_table(yank(skim_df, "character"), 'Column: ATM')
Column: ATM
skim_variable n_missing complete_rate min max empty n_unique whitespace
ATM 14 0.990502 4 4 0 4 0
knit_table(yank(skim_df, "numeric"), 'Column: CASH')
Column: CASH
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CASH 19 0.9871099 155.582 376.4568 0 0.5 73 114 10919.76 ▇▁▁▁▁

Data Handling

As shown in the data summary metrics within the tables above, there are 1,474 total rows in the dataset; however, 14 of these rows (5/1/10 - 5/14/10) have null ATM and CASH values. These dates represent the dates being forecasted and will be removed for modeling.

# create subset df for null dates that will be forecasted
df_na_dates = df |> filter(is.na(ATM))

# filter out null dates from data
df = df |> filter(!is.na(ATM))

knit_table(df_na_dates, 'Dates to be Forecasted: Rows with no ATM or CASH values')
Dates to be Forecasted: Rows with no ATM or CASH values
DATE ATM CASH
2010-05-01 NA NA
2010-05-02 NA NA
2010-05-03 NA NA
2010-05-04 NA NA
2010-05-05 NA NA
2010-05-06 NA NA
2010-05-07 NA NA
2010-05-08 NA NA
2010-05-09 NA NA
2010-05-10 NA NA
2010-05-11 NA NA
2010-05-12 NA NA
2010-05-13 NA NA
2010-05-14 NA NA

Missing Values

As seen in the CASH and DATE column summary metrics below grouped for by ATM, after removing those 14 rows, five null CASH values still remain. Three of these values are associated with ATM1 and 2 of them are associated with ATM2. All five null values occur in June 2009.

skim_atm = df |>
  group_by(ATM) |>
  skim(DATE, CASH)

knit_table(yank(skim_atm, "Date"), 'Column: DATE - Grouped by ATM')
Column: DATE - Grouped by ATM
skim_variable ATM n_missing complete_rate min max median n_unique
DATE ATM1 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM2 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM3 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM4 0 1 2009-05-01 2010-04-30 2009-10-30 365
knit_table(yank(skim_atm, "numeric"), 'Column: CASH - Grouped by ATM')
Column: CASH - Grouped by ATM
skim_variable ATM n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CASH ATM1 3 0.9917808 83.8867403 36.656539 1.00000 73.0000 91.0000 108.000 180.00 ▂▁▇▃▁
CASH ATM2 2 0.9945205 62.5785124 38.898455 0.00000 25.5000 67.0000 93.000 147.00 ▇▅▇▇▂
CASH ATM3 0 1.0000000 0.7205479 7.944778 0.00000 0.0000 0.0000 0.000 96.00 ▇▁▁▁▁
CASH ATM4 0 1.0000000 474.0433449 650.935992 1.56326 124.3344 403.8393 704.507 10919.76 ▇▁▁▁▁

The missing June 2009 cash values are highlighted in the the time series data visualization below for both ATM1 and ATM2. The missing values for both ATMs occur in short, localized intervals.

# view data with cash null values
null_df = df |> filter(is.na(CASH))

# create df filtered to missing value time range
na_filt_df = df |>
  filter(
    ATM %in% c('ATM1', 'ATM2'),
    DATE >= as.Date('2009-05-01'),
    DATE < as.Date('2009-08-01'),
    )
# create df to be used to create the timerange missing value highlights 
rects = null_df |>
  mutate(
    END_DATE = DATE + 1,
    START_DATE = DATE - 1,)

# plot and highlight missing values in time series
na_filt_df |>
  ggplot(aes(x=DATE, y=CASH)) +
  geom_line() +
  facet_wrap(~ATM, scales = "free_y") +
  geom_rect(data=rects,
    aes(xmin=START_DATE, xmax=END_DATE, ymin=0, ymax=max(na_filt_df$CASH, na.rm = TRUE),
    group=ATM, fill="Null Date Range"), alpha=0.4) +
  scale_fill_manual(values="orange", labels=c("Null Date Range")) +
  labs(title="ATM1 & ATM2 Null Cash Value Date Ranges (May-August 2009)", x='Date', y='Cash (hundreds)', fill=NULL) +
  theme_minimal() +
  theme(legend.position="bottom") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

To identify the best type of method to handle the missing CASH values, it is beneficial to first identify what type of missing data is present.

Little’s MCAR Test

Little’s MCAR test was performed and resulted in a p-value less than 0.05; thus, the data is not Missing Completely at Random (MCAR). This suggests the missing values could be Missing at Random (MAR).

knit_table(as.data.frame(mcar_test(df)), 'Little’s MCAR Test Results')
Little’s MCAR Test Results
statistic df p.value missing.patterns
12.89875 2 0.0015815 2

Imputation

Given that the missing values are potentially consistent with MAR qualities in and occur within a continuous time series, linear interpolation provides an unbiased and structurally consistent method for imputation by assuming that the missing value lies between neighboring observations in a smooth way. This approach preserves local trends, maintains temporal continuity, and avoids introducing artificial distortions, making it the most appropriate approach for this dataset.

From the business perspective, ATM withdrawals typically change incrementally, not abruptly. Linear interpolation reflects realistic customer behavior between days.

Lastly, linear interpolation was applied consistently across ATM1 and ATM2. While ATM1’s stable behavior makes interpolation highly accurate, ATM2’s slightly higher variability introduces minor smoothing effects. However, given that missingness is consistent with MAR and gaps are short, linear interpolation remains an appropriate and unbiased method for both series.

The plot below displays the time series plot with the imputed values using linear interpolation.

df = df |>
  group_by(ATM) |>
  mutate(CASH = na.approx(CASH, maxgap = 4, rule = 2)) |>
  ungroup()

df |> filter(
    ATM %in% c('ATM1', 'ATM2'),
    DATE >= as.Date('2009-05-01'),
    DATE < as.Date('2009-08-01'),
    ) |>
  ggplot(aes(x=DATE, y=CASH)) +
  geom_line() +
  facet_wrap(~ATM, scales = "free_y") +
  geom_rect(data=rects,
    aes(xmin=START_DATE, xmax=END_DATE, ymin=0, ymax=max(na_filt_df$CASH, na.rm = TRUE),
    group=ATM, fill="Imputed Date Range"), alpha=0.4) +
  scale_fill_manual(values="orange", labels=c("Imputated Date Range")) +
  labs(title="ATM1 & ATM2 Imputed Cash Value Date Ranges (May-August 2009)", x='Date', y='Cash (hundreds)', fill=NULL) +
  theme_minimal() +
  theme(legend.position="bottom") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


Outliers

As seen in the time series plot, there is an extreme outlier of $1,091,976.16 associated with ATM4 on 2/9/2010 that is skewing the time series. This is apparent in the ATM4 time series plot and is also supported by the outlier remaining prominent in the remainder of the STL Decomposition. Due to this, this outlier will be replaced with NA and treated as missing value as it does not reflect the underlying demand pattern and would otherwise bias forecast estimates. Consistent with the principles discussed in the Forecasting: Principles and Practice textbook, the goal is to model typical behavior rather than one-time anomalies, ensuring that forecasts remain accurate and operationally meaningful.

# subset data to only atm4
atm4 = df |>
  filter(ATM == "ATM4") |>
  as_tsibble(index = DATE)

# plot time series
atm4 |>
  autoplot(CASH) +
  labs(title = "ATM4 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm4 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM4", x="Date", y='Cash (hundreds)') +
  theme_minimal()

After being replaced with NA. values, linear interpolation was applied to impute a value consistent with neighboring observations. This reconstructs a value consistent with the local level of the series and preserves both continuity and statistical integrity.

# replace outlier with NA and impute
df = df |>
  mutate(CASH = if_else(CASH < 10919, CASH, NA_real_)) |>
  mutate(CASH = na.approx(CASH, na.rm = FALSE))

The plot below displays the updated time series with the outlier removed and imputed with linear interpolation.

df |>
  ggplot(aes(x=DATE, y=CASH, color=ATM)) +
  geom_line() +
  scale_color_manual(values = group_colors) +
  scale_x_date(
    date_breaks = "2 month", # Set breaks at 2-month intervals
    date_labels = "%b %y"
  ) +
  labs(title="Withdrawn Cash by ATM Overtime", x='Date', y='Cash ($USD, Hundreds)') +
  theme_minimal()


Modeling

To determine the best modeling approach for forecasting ATM1 values, we first need to review the trends and features of the time series.

ATM1

# subset data to only atm1
atm1 = df |>
  filter(ATM == "ATM1") |>
  as_tsibble(index = DATE)

# plot time series
atm1 |>
  autoplot(CASH) +
  labs(title = "ATM1 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm1 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM1", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm1 |>
  ACF(CASH) |>
  autoplot() +
  labs(title = "ACF Plot - ATM1") +
  theme_minimal()

There does not appear to be any overall trend in the time series for ATM1. There does appear to be a weekly seasonal pattern as supported by the seasonal plot of the STL decomposition and the ACF plot which has high values at lags 7, 14, and 21. ATM1 withdraws follow a weekly seasonal pattern with high withdraws at the beginning of the week followed by a sharp decline.

To perform the model selection process, the data was split into a training and test sets. The training set includes all data before April 2010 (about 92% of the data).

Based on the features of the ATM1 time series, the following models were considered:

  • Exponential Smoothing (Holt-Winters’ additive method): due to there being a clear seasonal pattern and the seasonal variations being roughly constant through the series
  • Seasonal Native: due to the data being highly seasonal and not having any clear trends
  • ARIMA: due to having automated differencing determination via repeated KPSS tests, this model can be applied and included in the assessment
# split the data into test and train sets
atm1_train <- atm1 |>
  filter(DATE < as.Date("2010-04-01"))
atm1_test <- atm1 |>
  filter(DATE >= as.Date("2010-04-01"))

# define the box cox transformation lambda
lambda <- atm1 |>
  features(CASH, features = guerrero) |>
  pull(lambda_guerrero)

# fit the models on the training data
atm1_fit = atm1_train |>
  model(
    "Additive ETS" = ETS(CASH ~ error("A") + trend("N") + season("A")),
    "Auto ETS" = ETS(CASH),
    "SNAIVE" = SNAIVE(CASH),
    "Transformed Additive ETS" = ETS(box_cox(CASH,lambda) ~ error("A") + trend("N") + season("A")),
    "ARIMA" = ARIMA(box_cox(CASH,lambda))
  )

# view the test set forecasts 
atm1_fc_final = atm1_fit |>
  forecast(h = nrow(atm1_test))

atm1_fc_final |>
  autoplot(atm1_test, level=NULL) +
  labs(title = "ATM1 Test Forecasets", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# evaluate model performance on test set
knit_table(
left_join(
  glance(atm1_fit) |> select(.model, AIC, AICc, BIC), 
  atm1_fc_final |> accuracy(atm1_test) |> select(.model, RMSE, MAPE),
  by = join_by(.model)
  ) |>
  arrange(RMSE, AICc)
)
.model AIC AICc BIC RMSE MAPE
Additive ETS 4114.365 4115.044 4152.506 11.61487 64.73164
Auto ETS 4114.365 4115.044 4152.506 11.61487 64.73164
ARIMA 1150.052 1150.176 1165.224 14.21138 45.64292
Transformed Additive ETS 2181.670 2182.349 2219.811 14.41856 46.02107
SNAIVE NA NA NA 16.03746 73.39914

The testing results show that the Additive ETS model and Auto ETS model have the same metrics which suggests the additive ETS model was chosen as the best performing ETS model. In addition, the additive ETS model has the lowest RMSE, followed by the the ARIMA model and transformed additive ETS. Overall, the difference in RMSE value between these model is minimal. The ARIMA model has a significantly lower AICc value and a lower MAPE than the other models. Due to this, the ARIMA model is the best if the best fit for the ATM1 time series.

# view the selected model report
atm1_fit |>
  select(.model = "ARIMA") |>
  report()
## Series: CASH 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(CASH, lambda) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1139  -0.1197  -0.6512
## s.e.  0.0547   0.0549   0.0473
## 
## sigma^2 estimated as 1.899:  log likelihood=-571.03
## AIC=1150.05   AICc=1150.18   BIC=1165.22
# fit the model on all the data
atm1_fit_final <- atm1 |>
  model("ARIMA" = ARIMA(box_cox(CASH,lambda) ~ pdq(0, 0, 2) + PDQ(0, 1, 1, period = 7)))
# forecast values for May 2010
atm1_fc_final <- atm1_fit_final |>
  forecast(h = 31)

The selected ARIMA model has the following components:

  • Non-seasonal part (0,0,2): No Autoregressive (p=0) or Integration (d=0) terms. It includes a Moving Average of order 2 (q=2).
  • Seasonal part (0,1,1): No seasonal AR (P=0). It uses seasonal differencing with a period of 7, combined with a seasonal Moving Average of order 1

This model was fit to the entire ATM1 dataset.

atm1_fit_final |>
  select(.model = "ARIMA") |>
  gg_tsresiduals() +
  labs(
    title = "ARIMA(0,0,2)(0,1,1)[7] Model of ATM1 Transformed CASH residuals", x="Date", y='Cash (hundreds)') +
  theme_minimal()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Applying the Ljung-Box test
augment(atm1_fit_final) |> features(.innov, ljung_box, lag=10)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     6.50     0.772
# Applying the Box-Pierce test
augment(atm1_fit_final) |> features(.innov, box_pierce, lag = 10)
## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     6.37     0.784

The residuals of the fit model appear be white noise and normally distributed with no autocorrelations exceed the confidence interval limit. This is supported by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.

atm1_fc_final |>
  autoplot(atm1) +
  labs(title="ATM1 May 2010 Forecasts", subtitle = "ARIMA(0,0,2)(0,1,1)[7] Model") +
  theme_minimal()


ATM2

# subset data to only atm2
atm2 = df |>
  filter(ATM == "ATM2") |>
  as_tsibble(index = DATE)

# plot time series
atm2 |>
  autoplot(CASH) +
  labs(title = "ATM2 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm2 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM2", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm2 |>
  ACF(CASH) |>
  autoplot() +
  labs(title = "ACF Plot - ATM2")

The time series for ATM2 appears to be very similar to ATM1 in that there does not appear to be a strong overall trend and there appears to be a weekly seasonal pattern. This is supported by the decomposition and the ACF plot which has high values at lags 7, 14, and 21. ATM2 withdraws also follow a weekly seasonal pattern have high withdraws at the beginning of the week followed by a sharp decline.

To perform the model selection process, the data was split into a training and test set. The training set includes all data before April 2010 (about 92% of the data).

Based on the features of the ATM2 time series, the following models were considered:

  • Exponential Smoothing (Holt-Winters’ additive method): due to there being a clear seasonal pattern and the seasonal variations being roughly constant through the series
  • Seasonal Native: due to the data being highly seasonal and not having any clear trends
  • ARIMA: due to having automated differencing determination via repeated KPSS tests, this model can be applied and included in the assessment
# split the data into test and train sets
atm2_train <- atm2 |>
  filter(DATE < as.Date("2010-04-01"))
atm2_test <- atm2 |>
  filter(DATE >= as.Date("2010-04-01"))

# define the box cox transformation lambda
lambda <- atm2 |>
  features(CASH, features = guerrero) |>
  pull(lambda_guerrero)

# fit the models on the training data
atm2_fit = atm2_train |>
  model(
    "Additive ETS" = ETS(CASH ~ error("A") + trend("N") + season("A")),
    "SNAIVE" = SNAIVE(CASH),
    "Transformed Additive ETS" = ETS(box_cox(CASH,lambda) ~ error("A") + trend("N") + season("A")),
    "ARIMA" = ARIMA(box_cox(CASH,lambda))
  )

# view the test set forecasts 
atm2_fc_final = atm2_fit |>
  forecast(h = nrow(atm2_test))

atm2_fc_final |>
  autoplot(atm2_test, level=NULL)+
  labs(title = "ATM2 Test Forecasets", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# evaluate model performance on test set
knit_table(
left_join(
  glance(atm2_fit) |> select(.model, AIC, AICc, BIC), 
  atm2_fc_final |> accuracy(atm2_test) |> select(.model, RMSE, MAPE, MASE),
  by = join_by(.model)
  ) |>
  arrange(RMSE, AICc)
)
.model AIC AICc BIC RMSE MAPE MASE
Additive ETS 4141.324 4142.003 4179.465 19.37845 57.41915 NaN
Transformed Additive ETS 3410.339 3411.018 3448.481 19.66788 118.82328 NaN
ARIMA 2341.451 2341.713 2364.210 21.16529 119.83911 NaN
SNAIVE NA NA NA 26.16231 45.13794 NaN

The testing results show that the ARIMA model has the lowest AIC, AICc, and BIC, indicating superior model fit. While ETS provides a slightly lower RMSE and SNAIVE has a lower MAPE, ARIMA offers the best balance between fit and forecasting performance. Thus, the ARIMA model will be used to forecast May 2010 values.

# view the selected model report
atm2_fit |>
  select(.model = "ARIMA") |>
  report()
## Series: CASH 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(CASH, lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4251  -0.9187  0.4732  0.8028  -0.7474
## s.e.   0.0511   0.0441  0.0819  0.0578   0.0460
## 
## sigma^2 estimated as 71.19:  log likelihood=-1164.73
## AIC=2341.45   AICc=2341.71   BIC=2364.21
# fit the model on all the data
atm2_fit_final <- atm2 |>
  model("ARIMA" = ARIMA(box_cox(CASH,lambda) ~ pdq(2, 0, 2) + PDQ(0, 1, 1, period = 7)))
# forecast values for May 2010
atm2_fc_final <- atm2_fit_final |>
  forecast(h = 31)

The selected ARIMA model has the following components:

  • Non-seasonal part (2,0,2): The model uses the last two values to forecast values by incorporating two autoregressive parameters. There are no Integration (d=0) terms.It includes a Moving Average of order 2 (q=2).
  • Seasonal part (0,1,1): No seasonal AR (P=0). It uses seasonal differencing with a period of 7, combined with a seasonal Moving Average of order 1

This model was fit to the entire ATM2 dataset.

atm2_fit_final |>
  select(.model = "ARIMA") |>
  gg_tsresiduals() +
  labs(
    title = "ARIMA(2,0,2)(0,1,1)[7] Model of ATM2 Transformed CASH residuals"
  )

# Applying the Ljung-Box test
augment(atm2_fit_final) |> features(.innov, ljung_box, lag=10)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     9.06     0.527
# Applying the Box-Pierce test
augment(atm2_fit_final) |> features(.innov, box_pierce, lag = 10)
## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     8.88     0.544

While there are a few autocorrelation lag values (5 and 25) that exceed the acf confidence intervals, the residuals of the fit model appear be mostly white noise and normally distributed. This is supported by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.

atm2_fc_final |>
  autoplot(atm2) +
  labs(title="ATM2 May 2010 Forecasts", subtitle = "ARIMA(2,0,2)(0,1,1)[7] Model", x="Date", y='Cash (hundreds)') +
  theme_minimal()


ATM3

# subset data to only atm3
atm3 = df |>
  filter(ATM == "ATM3") |>
  as_tsibble(index = DATE)

# plot time series
atm3 |>
  autoplot(CASH) +
  labs(title = "ATM3 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm3_data = atm3 |>
  filter(CASH > 0)

atm3_data |>
  autoplot(CASH) +
  labs(title = "ATM3 Time Series - Greater than $0", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm3 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM3", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm3 |>
  ACF(CASH) |>
  autoplot() +
  labs(title = "ACF Plot - ATM3")

The time series for ATM3 only has withdraw cash values greater than $0 for the most recent three days. Due to this, a simpler model approach will be needed due to not having adequate historical data to develop a more sophisticated model.

Based on the features of the ATM3 time series, the following models were considered: - Mean - Native

atm3_fit = atm3_data |>
  model(
    "Mean" = MEAN(CASH),
    "Naïve" = NAIVE(CASH),
  )

atm3_fcs = atm3_fit |>
  forecast(h = 31)

atm3_fcs |>
  autoplot(atm3_data, level=NULL) +
  labs(title='ATM3 Cash Forecasts', x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm3_fcs |>
  autoplot(atm3, level=NULL) +
  labs(title='ATM3 Cash Forecasts', x="Date", y='Cash (hundreds)') +
  theme_minimal()

accuracy(atm3_fit)
## # A tibble: 2 × 10
##   .model .type           ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr>        <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Mean   Training -4.74e-15  6.02  5.56 -0.456  6.24   NaN   NaN -0.296
## 2 Naïve  Training -5.5 e+ 0 10.1   8.5  -6.77  10.3    NaN   NaN -0.5
atm3_fit |>
  forecast(h = 1) |>
  hilo(level = 95)
## # A tsibble: 2 x 5 [1D]
## # Key:       .model [2]
##   .model DATE      
##   <chr>  <date>    
## 1 Mean   2010-05-01
## 2 Naïve  2010-05-01
## # ℹ 3 more variables: CASH <dist>, .mean <dbl>, `95%` <hilo>

95% Confidence Interval: - Mean: [70.98457, 104.3488], Range = 33.36 - Native: [65.15688, 104.8431], Range = 39.69

Due to the MEAN() model method having a smaller 95% confidence interval range than the NATIVE() model for the three data points available, the MEAN() forecasting approach will be used to forecast values for May 2010.

# view the selected model report
atm3_fit_final = atm3_fit |>
  select(.model = "Mean")

# forecast values for May 2010
atm3_fc_final <- atm3_fit_final |>
  forecast(h = 31)
atm3_fit_final |>
  gg_tsresiduals() +
  labs(
    title = "Mean Model of ATM3 Residuals"
  )

No conclusions can be made on the model fit based upon the residuals due to the lack of available data.

atm3_fc_final |>
  autoplot(atm3) +
  labs(title="ATM3 May 2010 Forecasts", subtitle = "Mean Model", x="Date", y='Cash (hundreds)') +
  theme_minimal()


ATM4

# subset data to only atm4
atm4 = df |>
  filter(ATM == "ATM4") |>
  as_tsibble(index = DATE)

# plot time series
atm4 |>
  autoplot(CASH) +
  labs(title = "ATM4 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm4 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM4", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm4 |>
  ACF(CASH) |>
  autoplot() +
  labs(title = "ACF Plot - ATM4")

Here we can see that this data doesn’t really seem to have much trend and is highly seasonal with a seasonal window of one week, just as we saw with ATM1. With that, we can follow a similar process as we did with ATM1 here:

The time series for ATM4 similar to ATM1 in that there does not appear to be a strong overall trend and there appears to be a weekly seasonal pattern. This is supported by the STL decomposition and the ACF plot which has high values at lags 7, 14, and 21.

To perform the model selection process, the data was split into a training and test set. The training set includes all data before April 2010 (about 92% of the data).

Based on the features of the ATM4 time series, the following models were considered:

  • Exponential Smoothing (Holt-Winters’ additive method): due to there being a clear seasonal pattern and the seasonal variations being roughly constant through the series
  • Seasonal Native: due to the data being highly seasonal and not having any clear trends
  • ARIMA: due to having automated differencing determination via repeated KPSS tests, this model can be applied and included in the assessment
# split the data into test and train sets
atm4_train <- atm4 |>
  filter(DATE < as.Date("2010-04-01"))
atm4_test <- atm4 |>
  filter(DATE >= as.Date("2010-04-01"))

# define the box cox transformation lambda
lambda <- atm4 |>
  features(CASH, features = guerrero) |>
  pull(lambda_guerrero)

# fit the models on the training data
atm4_fit = atm4_train |>
  model(
    "Additive ETS" = ETS(CASH ~ error("A") + trend("N") + season("A")),
    "SNAIVE" = SNAIVE(CASH),
    "Transformed Additive ETS" = ETS(box_cox(CASH,lambda) ~ error("A") + trend("N") + season("A")),
    "ARIMA" = ARIMA(box_cox(CASH,lambda))
  )

# view the test set forecasts 
atm4_fc_final = atm4_fit |>
  forecast(h = nrow(atm4_test))

atm4_fc_final |>
  autoplot(atm4_test, level=NULL) +
  labs(title = "ATM4 Test Forecasets", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# evaluate model performance on test set
knit_table(
left_join(
  glance(atm4_fit) |> select(.model, AIC, AICc, BIC), 
  atm4_fc_final |> accuracy(atm4_test) |> select(.model, RMSE, MAPE),
  by = join_by(.model)
  ) |>
  arrange(RMSE, AICc)
)
.model AIC AICc BIC RMSE MAPE
ARIMA 2698.459 2698.641 2717.529 258.9685 857.8227
Transformed Additive ETS 3673.410 3674.089 3711.551 300.2535 1185.6613
SNAIVE NA NA NA 300.8483 375.0538
Additive ETS 5856.301 5856.980 5894.443 306.8014 1242.6338

The testing results show that the ARIMA model is the best overall choice as it achieves the lowest AIC, AICc, and BIC, indicating superior model fit, and also provides the lowest RMSE. While the seasonal naïve model has a lower MAPE, this metric can be unstable in the presence of small values, making ARIMA the more robust and reliable forecasting model.

# view the selected model report
atm4_fit |>
  select(.model = "ARIMA") |>
  report()
## Series: CASH 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(CASH, lambda) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0844  0.1965  0.1890   17.6950
## s.e.  0.0542  0.0537  0.0546    0.7764
## 
## sigma^2 estimated as 180.7:  log likelihood=-1344.23
## AIC=2698.46   AICc=2698.64   BIC=2717.53
# fit the model on all the data
atm4_fit_final <- atm4 |>
  model("ARIMA" = ARIMA(box_cox(CASH,lambda) ~ pdq(0, 0, 1) + PDQ(2, 0, 0, period = 7)))
# forecast values for May 2010
atm4_fc_final <- atm4_fit_final |>
  forecast(h = 31)

The selected ARIMA model has the following components:

  • Non-seasonal part (0,0,1): Includes one moving average term, meaning the model uses the previous forecast error (shock) to improve predictions
  • Seasonal part (2,0,0): Includes two seasonal autoregressive terms with a seasonal period of 7, and a constant mean is included, indicating the series reverts to a non-zero

This model was fit to the entire ATM4 dataset.

atm4_fit_final |>
  select(.model = "ARIMA") |>
  gg_tsresiduals() +
  labs(
    title = "ARIMA(0,0,1)(2,0,0)[7] Model of ATM4 Transformed Residuals"
  )

# Applying the Ljung-Box test
augment(atm4_fit_final) |> features(.innov, ljung_box, lag=10)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     16.0    0.0986
# Applying the Box-Pierce test
augment(atm4_fit_final) |> features(.innov, box_pierce, lag = 10)
## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     15.6     0.111

While there is one autocorrelation lag value (10) slightly over the ACF confidence interval threshold, the residuals of the fit model appear be mostly white noise and normally distributed. This is supported. by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.

atm4_fc_final |>
  autoplot(atm4) +
  labs(title="ATM4 May 2010 Forecasts", subtitle = "ARIMA(0,0,1)(2,0,0)[7] Model", x="Date", y='Cash (hundreds)') +
  theme_minimal()


PART B – Forecasting Power

Prompt: Model the residential power usage data and a monthly forecast for 2014

Dataset: ResidentialCustomerForecastLoad-624.xlsx


Import and Visualize Data

Import the data from Github and view the raw data

# import data from github
github_url_b = 'https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/ResidentialCustomerForecastLoad-624.xlsx'
data_b = import(github_url_b)

# view data head
knit_table(head(data_b), "View Raw Data")
View Raw Data
CaseSequence YYYY-MMM KWH
733 1998-Jan 6862583
734 1998-Feb 5838198
735 1998-Mar 5420658
736 1998-Apr 5010364
737 1998-May 4665377
738 1998-Jun 6467147

This dataset includes residential power usage from January 1998 until December 2013. There are 192 rows with power usage tracked at the monthly level.

The data consists of three columns:

  • CaseSequence: unique identifier for monthly residential power usage
  • YYYY-MM: Year and month of power consumption
  • KWH: power consumption in Kilowatt hours

Data Handling

power = data_b
# rename columns to all naming follows same casing
colnames(power) = c("CaseSequence", "YR_MONTH", "KWH")


skim_power = skim(power)
# |> select(YR_MONTH, KWH))
# yank(skim_power, "character")

skim_power |> summary()
Data summary
Name power
Number of rows 192
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None
# knit_table(yank(skim_df, "Date"), 'Column: YR_MONTH')
knit_table(yank(skim_power, "numeric"), 'Column: KWH')
Column: KWH
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CaseSequence 0 1.0000000 828.5 5.556978e+01 733 780.75 828.5 876.25 924 ▇▇▇▇▇
KWH 1 0.9947917 6502474.6 1.447571e+06 770523 5429912.00 6283324.0 7620523.50 10655730 ▁▁▇▅▁
knit_table(yank(skim_power, "character"), 'Column: YR_MONTH')
Column: YR_MONTH
skim_variable n_missing complete_rate min max empty n_unique whitespace
YR_MONTH 0 1 8 8 0 192 0
# transform into tsibble
power = power |>
  mutate(YR_MONTH = yearmonth(YR_MONTH)) |>
  as_tsibble(index = YR_MONTH)

Visualize the time series overtime

power |>
  autoplot(KWH) +
  labs(title="Residential Power Usage Time Series", x="Year Month", y="KWH (Kilowatt hours)") +
  theme_minimal()

Data Handling

Outliers

Identify outliers with the IQR method.

outlier_flag <- function(x) {
  # Calculate first and third quartiles
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  # Calculate the Interquartile Range
  IQR_val <- IQR(x, na.rm = TRUE)

  # Define lower and upper bounds (fences)
  lower_bound <- Q1 - (IQR_val * 1.5)
  upper_bound <- Q3 + (IQR_val * 1.5)

  # Flag values outside the bounds as TRUE
  is_outlier <- (x < lower_bound) | (x > upper_bound)
  return(is_outlier)
}

power = power |>
  mutate(outlier_flag = outlier_flag(KWH))

knit_table(
  power |>
    filter(outlier_flag==TRUE),
  "Identified Outliers"
)
Identified Outliers
CaseSequence YR_MONTH KWH outlier_flag
883 2010 Jul 770523 TRUE
# replace outlier with NA and impute
power = power |>
  mutate(KWH = if_else(outlier_flag == TRUE, NA_real_, KWH)) |>
  mutate(
    KWH = na.approx(KWH, na.rm = FALSE)
  )
power |>
  autoplot(KWH) +
  labs(title="Residential Power Usage Time Series", x="Year Month", y="KWH (Kilowatt hours)") +
  theme_minimal()


Modeling

power = power |>
  select(YR_MONTH, KWH)

# STL decomposition
power |>
  model(stl = STL(KWH ~ trend(window=Inf) + season(period=12, window="periodic"))) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - KWH")

power |>
  gg_subseries() +
  labs(
    y = "KWH",
    title = "Seasonal Subseries - KWH"
  )
## Warning: `gg_subseries()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_subseries()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Plot variable not specified, automatically selected `y = KWH`

power |>
  ACF(KWH) |>
  autoplot() +
  labs(title = "ACF Plot - KWH")

The time series for residential power usage appears to have an overall increasing trend with a more rapid increase starting around 2005. There also appears to be a strong yearly seasonal trend with as seen in the STL decomposition and the Seasonal subseries plot. At the begining of the year, power usage declines from January till May and then starts to steadily increase through the summer and then decrease through November. At the end of the year (December into January) power usage starts to increase again into the new year. Overall the trend in power usage seems to shift about every three months with the seasons. This is supported by the ACF plot which has high values every third lag (3, 6, 9, 12, 15, 18, 21).

To perform the model selection process, the data was split into a training and test set. The training set includes all data before January 2012 (about 88% of the data).

# split the data into test and train sets
power_train <- power |>
  filter(year(YR_MONTH) < 2012)
power_test <- power |>
  filter(year(YR_MONTH) >= 2012)

# define the box cox transformation lambda
lambda <- power |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)

# fit the models on the training data
power_fit = power_train |>
  model(
    "Additive ETS" = ETS(KWH ~ error("A") + trend("A") + season("A")),
    "Transformed Additive ETS" = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") + season("A")),
    "Transformed Multiplicative ETS" = ETS(box_cox(KWH,lambda) ~ error("M") + trend("A") + season("M")),
    "Multiplicative ETS" = ETS(KWH ~ error("M") + trend("A") + season("M")),
    "T Auto ETS" = ETS(box_cox(KWH,lambda)),
    "Transformed HW" = ETS(box_cox(KWH,lambda) ~ error("M") + trend("Ad") + season("M")),
    "Transformed Additive ETS" = ETS(box_cox(KWH,lambda) ~ error("A") + trend("N") + season("A")),
    "Transformed SNAIVE" = SNAIVE(box_cox(KWH,lambda)),
    "ARIMA" = ARIMA(KWH),
    "Transformed ARIMA" = ARIMA(box_cox(KWH,lambda))
  )

# view the test set forecasts 
power_fc_final = power_fit |>
  forecast(h = nrow(power_test))
power_fc_final |>
  autoplot(power_test, level=NULL)

# evaluate model performance on test set
knit_table(
left_join(
  glance(power_fit) |> select(.model, AIC, AICc, BIC), 
  power_fc_final |> accuracy(power_test) |> select(.model, RMSE, MAPE),
  by = join_by(.model)) |>
  arrange(AICc, RMSE)
)
.model AIC AICc BIC RMSE MAPE
Transformed ARIMA -1304.721 -1303.964 -1283.3716 772888.0 7.913079
Transformed Multiplicative ETS -1009.378 -1005.298 -956.2709 1119967.0 13.075044
T Auto ETS -1009.378 -1005.298 -956.2709 1119967.0 13.075044
Transformed HW -1007.029 -1002.438 -950.7977 1064521.2 11.800401
Transformed Additive ETS -1004.387 -1001.229 -957.5273 824731.4 9.000613
ARIMA 4612.814 4613.214 4628.0629 732388.6 7.019935
Multiplicative ETS 5343.893 5347.973 5397.0003 1043069.9 12.096480
Additive ETS 5361.507 5365.587 5414.6143 952109.5 11.238595
Transformed SNAIVE NA NA NA 1012071.6 9.817695

The testing results show that the Transformed ARIMA model is the best overall model, as it achieves substantially lower AIC, AICc, and BIC values, indicating a superior fit to the data. While the untransformed ARIMA shows slightly better RMSE and MAPE, these metrics are not directly comparable due to differences in scale. The transformation improves model stability and better captures the underlying data structure.

# view the selected model report
power_fit |>
  select(.model = "Transformed ARIMA") |>
  report()
## Series: KWH 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## Transformation: box_cox(KWH, lambda) 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  constant
##       0.2561  0.0869  0.2588  -0.7917  -0.4339    0.0013
## s.e.  0.0802  0.0809  0.0725   0.0820   0.0833    0.0005
## 
## sigma^2 estimated as 1.393e-05:  log likelihood=659.36
## AIC=-1304.72   AICc=-1303.96   BIC=-1283.37
# fit the model on all the data
power_fit_final <- power |>
  model("ARIMA" = ARIMA(box_cox(KWH,lambda) ~ pdq(0, 0,3) + PDQ(2, 1, 0, period = 12)))
# forecast values for May 2010
power_fc_final <- power_fit_final |>
  forecast(h = 12)

The selected ARIMA model has the following components:

  • Non-seasonal part (0,0,3): The model includes three moving average terms, using past forecast errors to predict current values.
  • Seasonal part (2,1,0): Two seasonal autoregressive terms are used, meaning the current value depends on the same month from the previous two years. One seasonal difference is applied with a period of 12.
  • The inclusion of drift means the model includes a constant term

This model was fit to the entire power dataset.

power_fit_final |>
  select(.model = "ARIMA") |>
  gg_tsresiduals() +
  labs(
    title = "ARIMA(0,0,3)(2,1,0)[12] Model of KWH Transformed residuals"
  )

# Applying the Ljung-Box test
augment(power_fit_final) |> features(.innov, ljung_box, lag=10)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     12.4     0.258
# Applying the Box-Pierce test
augment(power_fit_final) |> features(.innov, box_pierce, lag = 10)
## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     12.0     0.287

While there is one autocorrelation lag value (6) slightly over the ACF confidence interval threshold, the residuals of the fit model appear be mostly white noise and normally distributed. This is supported. by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.

power_fc_final |>
  autoplot(power) +
  labs(title="ATM2 May 2010 Forecasts", subtitle = "ARIMA(0,0,3)(2,1,0)[12] Model")


Export Results (Part A & B)

Export forecasts into an Excel readable file

# gather part a forecasts to output to excel
atm1_fc_xslx = atm1_fc_final |>
  as.data.frame() |>
  mutate(ATM1_Forecasts = .mean) |>
  select(DATE, ATM1_Forecasts)

atm2_fc_xslx = atm2_fc_final |>
  as.data.frame() |>
  mutate(ATM2_Forecasts = .mean) |>
  select(DATE, ATM2_Forecasts)

atm3_fc_xslx = atm3_fc_final |>
  as.data.frame() |>
  mutate(ATM3_Forecasts = .mean) |>
  select(DATE, ATM3_Forecasts)

atm4_fc_xslx = atm4_fc_final |>
  as.data.frame() |>
  mutate(ATM4_Forecasts = .mean) |>
  select(DATE, ATM4_Forecasts)

part_a_fc = left_join(
  left_join(
    left_join(
      atm1_fc_xslx, atm2_fc_xslx, by = join_by(DATE)),
    atm3_fc_xslx, by = join_by(DATE)),
  atm4_fc_xslx, by = join_by(DATE))

# gather part b forecasts to output to excel
part_b_fc = power_fc_final |>
  as.data.frame() |>
  mutate(
    Power_Forecasts = .mean,
    YR_MONTH = as.character(YR_MONTH)
    ) |>
  select(YR_MONTH, Power_Forecasts)

# view forecasts
knit_table(part_a_fc, 'Part A Forecasts')
Part A Forecasts
DATE ATM1_Forecasts ATM2_Forecasts ATM3_Forecasts ATM4_Forecasts
2010-05-01 92.128015 67.057749 87.66667 326.5391
2010-05-02 106.518327 73.066960 87.66667 433.4118
2010-05-03 78.907120 15.091609 87.66667 512.2953
2010-05-04 5.556122 8.836430 87.66667 241.1186
2010-05-05 106.163566 99.090250 87.66667 474.7541
2010-05-06 84.720401 89.863263 87.66667 351.1941
2010-05-07 91.309504 66.721573 87.66667 414.4358
2010-05-08 93.491214 67.301326 87.66667 295.0792
2010-05-09 107.162081 73.321607 87.66667 475.2268
2010-05-10 79.572741 15.602155 87.66667 465.2695
2010-05-11 5.726440 9.814201 87.66667 287.1311
2010-05-12 106.933940 99.309032 87.66667 446.4479
2010-05-13 85.409789 90.080118 87.66667 331.2565
2010-05-14 92.024818 66.943764 87.66667 460.4210
2010-05-15 94.212208 67.540347 87.66667 388.3167
2010-05-16 107.933298 73.565606 87.66667 450.5642
2010-05-17 80.238362 16.106499 87.66667 464.3841
2010-05-18 5.896759 10.778081 87.66667 365.0836
2010-05-19 107.704315 99.519491 87.66667 453.2276
2010-05-20 86.099177 90.291614 87.66667 402.3680
2010-05-21 92.740132 67.167822 87.66667 443.6369
2010-05-22 94.933202 67.776145 87.66667 400.3265
2010-05-23 108.704515 73.802141 87.66667 453.0699
2010-05-24 80.903984 16.606507 87.66667 453.8763
2010-05-25 6.067077 11.732015 87.66667 393.4039
2010-05-26 108.474690 99.724074 87.66667 447.7969
2010-05-27 86.788565 90.499379 87.66667 412.0704
2010-05-28 93.455446 67.393207 87.66667 448.6861
2010-05-29 95.654196 68.009653 87.66667 423.6912
2010-05-30 109.475732 74.033433 87.66667 447.7701
2010-05-31 81.569605 17.103467 87.66667 450.7500
knit_table(part_b_fc, 'Part B Forecasts')
Part B Forecasts
YR_MONTH Power_Forecasts
2014 Jan 10393275
2014 Feb 8860839
2014 Mar 7064897
2014 Apr 5978216
2014 May 5938635
2014 Jun 8314016
2014 Jul 9623971
2014 Aug 10178785
2014 Sep 8541825
2014 Oct 5857955
2014 Nov 6142157
2014 Dec 8277646
# export
list_of_fc_dfs <- list("Part A Forecasts" = part_a_fc, "Part B Forecasts" = part_b_fc)
write_xlsx(list_of_fc_dfs, path = "DATA624_Project1_Forecasts_Azevedo_20260329.xlsx")

PART C – Bonus

Prompt:

  • Your optional assignment is to time-base sequence the data and aggregate based on hour
    • Note for multiple recordings within an hour, take the mean
  • Determine if the data is stationary and can it be forecast
    • If so, provide a week forward forecast and present results

Datasets:


Import Data

Import the data from Github and view the raw data

# import data from github
github_url_ec1 = 'https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/Waterflow_Pipe1.xlsx'
github_url_ec2 = 'https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/Waterflow_Pipe2.xlsx'
data_ec1 = import(github_url_ec1)
data_ec2 = import(github_url_ec2)

Data Handling

Time-base sequence the data and aggregate based on date and hour.

# aggregate data by date and hour
data_ec1 = data_ec1 |>
  mutate(
    Hour = hour(`Date Time`),
    Date = date(`Date Time`),
    ) |>
  mutate(DateTime = ymd(Date) + hours(Hour))
data_ec1_agg = data_ec1 |>
  group_by(DateTime) |>
  summarise(WaterFlow = mean(WaterFlow)) |>
  as_tsibble(index=DateTime)

# aggregate data by date and hour
data_ec2 = data_ec2 |>
  mutate(
    Hour = hour(`Date Time`),
    Date = date(`Date Time`),
    ) |>
  mutate(DateTime = ymd(Date) + hours(Hour))
data_ec2_agg = data_ec2 |>
  group_by(DateTime) |>
  summarise(WaterFlow = mean(WaterFlow)) |>
  as_tsibble(index=DateTime)

Visualize Data

data_ec1_agg |>
  autoplot(WaterFlow) +
  labs(title='Pipe 1 Time Series by Hour', x='Date Time', y='Water Flow')

data_ec2_agg |>
  autoplot(WaterFlow) +
  labs(title='Pipe 2 Time Series by Hour', x='Date Time', y='Water Flow')


Stationary

knit_table(
  data_ec1_agg |>features(WaterFlow, unitroot_kpss),
  "Pipe 1 KPSS Test"
)
Pipe 1 KPSS Test
kpss_stat kpss_pvalue
0.2465587 0.1
knit_table(
  data_ec2_agg |>features(WaterFlow, unitroot_kpss),
  "Pipe 2 KPSS Test"
)
Pipe 2 KPSS Test
kpss_stat kpss_pvalue
0.1053662 0.1

To test if the time series are stationary, a the KPSS unit root test was performed on each series. Both the Pipe 1 and Pipe 2 time series had p-values of 0.1. Since this is greater than 0.05 we fail to reject the null hypothesis; thus, both of the series are stationary.

Due to time constraints, this was the extent of what I was able to get done for the bonus section.