PART A – ATM Forecast

Prompt:

Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file.
The variable Cash is provided in hundreds of dollars
Explain and demonstrate your process, techniques used and not used, and your actual forecast.
Provide a written report on your findings, visuals, discussion

Import and Visualize Data

Import the data from Github and view the raw data

# import data from github
github_url = "https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/ATM624Data.xlsx"
data = import(github_url)

# view data head
knit_table(head(data), "View Raw Data")

View Raw Data
DATE	ATM	Cash
2009-05-01	ATM1	96
2009-05-01	ATM2	107
2009-05-02	ATM1	82
2009-05-02	ATM2	89
2009-05-03	ATM1	85
2009-05-03	ATM2	90

The data consists of three columns:

DATE: date of the ATM cash withdraws between May 2009 and May 2010
ATM: ATM number (ATM1, ATM2, ATM3, ATM4)
CASH: total cash withdrew in hundreds of dollars

Visualize the time series overtime by ATM

group_colors <- c("ATM1" = "#C77CFF", "ATM2" = "darkolivegreen3", "ATM3" = "coral1", "ATM4" = "#00BFC4")

all_cash_plot = data |>
  ggplot(aes(x=DATE, y=Cash, color=ATM)) +
  geom_line() +
  scale_color_manual(values = group_colors) +
  scale_x_date(
    date_breaks = "2 month", # Set breaks at 2-month intervals
    date_labels = "%b %y"
  ) +
  labs(title="Withdrawn Cash by ATM Overtime", x='Date', y='Cash (hundreds)') +
  theme_minimal()
all_cash_plot

all_cash_plot +
  facet_grid(ATM~., scales = "free_y") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

View dataset summary metrics

# transform date column to date type
df = data |>
  mutate(DATE = as.Date(DATE)) |>
  arrange(DATE, ATM)
# rename columns to all naming follows same casing
colnames(df) = c("DATE", "ATM", "CASH")

skim_df = skim(df)
skim_df |>
  summary()

Data summary
Name	df
Number of rows	1474
Number of columns	3
_______________________
Column type frequency:
character	1
Date	1
numeric	1
________________________
Group variables	None

knit_table(yank(skim_df, "Date"), 'Column: DATE')

Column: DATE
skim_variable	n_missing	complete_rate	min	max	median	n_unique
DATE	0	1	2009-05-01	2010-05-14	2009-11-01	379

knit_table(yank(skim_df, "character"), 'Column: ATM')

Column: ATM
skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ATM	14	0.990502	4	4	0	4	0

knit_table(yank(skim_df, "numeric"), 'Column: CASH')

Column: CASH
skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
CASH	19	0.9871099	155.582	376.4568	0	0.5	73	114	10919.76	▇▁▁▁▁

Data Handling

As shown in the data summary metrics within the tables above, there are 1,474 total rows in the dataset; however, 14 of these rows (5/1/10 - 5/14/10) have null ATM and CASH values. These dates represent the dates being forecasted and will be removed for modeling.

# create subset df for null dates that will be forecasted
df_na_dates = df |> filter(is.na(ATM))

# filter out null dates from data
df = df |> filter(!is.na(ATM))

knit_table(df_na_dates, 'Dates to be Forecasted: Rows with no ATM or CASH values')

Dates to be Forecasted: Rows with no ATM or CASH values
DATE	ATM	CASH
2010-05-01	NA	NA
2010-05-02	NA	NA
2010-05-03	NA	NA
2010-05-04	NA	NA
2010-05-05	NA	NA
2010-05-06	NA	NA
2010-05-07	NA	NA
2010-05-08	NA	NA
2010-05-09	NA	NA
2010-05-10	NA	NA
2010-05-11	NA	NA
2010-05-12	NA	NA
2010-05-13	NA	NA
2010-05-14	NA	NA

Missing Values

As seen in the CASH and DATE column summary metrics below grouped for by ATM, after removing those 14 rows, five null CASH values still remain. Three of these values are associated with ATM1 and 2 of them are associated with ATM2. All five null values occur in June 2009.

skim_atm = df |>
  group_by(ATM) |>
  skim(DATE, CASH)

knit_table(yank(skim_atm, "Date"), 'Column: DATE - Grouped by ATM')

Column: DATE - Grouped by ATM
skim_variable	ATM	complete_rate	min	max	median	n_unique
DATE	ATM1	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	ATM2	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	ATM3	1	2009-05-01	2010-04-30	2009-10-30	365
DATE	ATM4	1	2009-05-01	2010-04-30	2009-10-30	365

knit_table(yank(skim_atm, "numeric"), 'Column: CASH - Grouped by ATM')

Column: CASH - Grouped by ATM
skim_variable	ATM	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
CASH	ATM1	3	0.9917808	83.8867403	36.656539	1.00000	73.0000	91.0000	108.000	180.00	▂▁▇▃▁
CASH	ATM2	2	0.9945205	62.5785124	38.898455	0.00000	25.5000	67.0000	93.000	147.00	▇▅▇▇▂
CASH	ATM3	0	1.0000000	0.7205479	7.944778	0.00000	0.0000	0.0000	0.000	96.00	▇▁▁▁▁
CASH	ATM4	0	1.0000000	474.0433449	650.935992	1.56326	124.3344	403.8393	704.507	10919.76	▇▁▁▁▁

The missing June 2009 cash values are highlighted in the the time series data visualization below for both ATM1 and ATM2. The missing values for both ATMs occur in short, localized intervals.

# view data with cash null values
null_df = df |> filter(is.na(CASH))

# create df filtered to missing value time range
na_filt_df = df |>
  filter(
    ATM %in% c('ATM1', 'ATM2'),
    DATE >= as.Date('2009-05-01'),
    DATE < as.Date('2009-08-01'),
    )
# create df to be used to create the timerange missing value highlights 
rects = null_df |>
  mutate(
    END_DATE = DATE + 1,
    START_DATE = DATE - 1,)

# plot and highlight missing values in time series
na_filt_df |>
  ggplot(aes(x=DATE, y=CASH)) +
  geom_line() +
  facet_wrap(~ATM, scales = "free_y") +
  geom_rect(data=rects,
    aes(xmin=START_DATE, xmax=END_DATE, ymin=0, ymax=max(na_filt_df$CASH, na.rm = TRUE),
    group=ATM, fill="Null Date Range"), alpha=0.4) +
  scale_fill_manual(values="orange", labels=c("Null Date Range")) +
  labs(title="ATM1 & ATM2 Null Cash Value Date Ranges (May-August 2009)", x='Date', y='Cash (hundreds)', fill=NULL) +
  theme_minimal() +
  theme(legend.position="bottom") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

To identify the best type of method to handle the missing CASH values, it is beneficial to first identify what type of missing data is present.

Little’s MCAR Test

Little’s MCAR test was performed and resulted in a p-value less than 0.05; thus, the data is not Missing Completely at Random (MCAR). This suggests the missing values could be Missing at Random (MAR).

knit_table(as.data.frame(mcar_test(df)), 'Little’s MCAR Test Results')

Little’s MCAR Test Results
statistic	df	p.value	missing.patterns
12.89875	2	0.0015815	2

Imputation

Given that the missing values are potentially consistent with MAR qualities in and occur within a continuous time series, linear interpolation provides an unbiased and structurally consistent method for imputation by assuming that the missing value lies between neighboring observations in a smooth way. This approach preserves local trends, maintains temporal continuity, and avoids introducing artificial distortions, making it the most appropriate approach for this dataset.

From the business perspective, ATM withdrawals typically change incrementally, not abruptly. Linear interpolation reflects realistic customer behavior between days.

Lastly, linear interpolation was applied consistently across ATM1 and ATM2. While ATM1’s stable behavior makes interpolation highly accurate, ATM2’s slightly higher variability introduces minor smoothing effects. However, given that missingness is consistent with MAR and gaps are short, linear interpolation remains an appropriate and unbiased method for both series.

The plot below displays the time series plot with the imputed values using linear interpolation.

df = df |>
  group_by(ATM) |>
  mutate(CASH = na.approx(CASH, maxgap = 4, rule = 2)) |>
  ungroup()

df |> filter(
    ATM %in% c('ATM1', 'ATM2'),
    DATE >= as.Date('2009-05-01'),
    DATE < as.Date('2009-08-01'),
    ) |>
  ggplot(aes(x=DATE, y=CASH)) +
  geom_line() +
  facet_wrap(~ATM, scales = "free_y") +
  geom_rect(data=rects,
    aes(xmin=START_DATE, xmax=END_DATE, ymin=0, ymax=max(na_filt_df$CASH, na.rm = TRUE),
    group=ATM, fill="Imputed Date Range"), alpha=0.4) +
  scale_fill_manual(values="orange", labels=c("Imputated Date Range")) +
  labs(title="ATM1 & ATM2 Imputed Cash Value Date Ranges (May-August 2009)", x='Date', y='Cash (hundreds)', fill=NULL) +
  theme_minimal() +
  theme(legend.position="bottom") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Outliers

As seen in the time series plot, there is an extreme outlier of $1,091,976.16 associated with ATM4 on 2/9/2010 that is skewing the time series. This is apparent in the ATM4 time series plot and is also supported by the outlier remaining prominent in the remainder of the STL Decomposition. Due to this, this outlier will be replaced with NA and treated as missing value as it does not reflect the underlying demand pattern and would otherwise bias forecast estimates. Consistent with the principles discussed in the Forecasting: Principles and Practice textbook, the goal is to model typical behavior rather than one-time anomalies, ensuring that forecasts remain accurate and operationally meaningful.

# subset data to only atm4
atm4 = df |>
  filter(ATM == "ATM4") |>
  as_tsibble(index = DATE)

# plot time series
atm4 |>
  autoplot(CASH) +
  labs(title = "ATM4 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm4 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM4", x="Date", y='Cash (hundreds)') +
  theme_minimal()

After being replaced with NA. values, linear interpolation was applied to impute a value consistent with neighboring observations. This reconstructs a value consistent with the local level of the series and preserves both continuity and statistical integrity.

# replace outlier with NA and impute
df = df |>
  mutate(CASH = if_else(CASH < 10919, CASH, NA_real_)) |>
  mutate(CASH = na.approx(CASH, na.rm = FALSE))

The plot below displays the updated time series with the outlier removed and imputed with linear interpolation.

df |>
  ggplot(aes(x=DATE, y=CASH, color=ATM)) +
  geom_line() +
  scale_color_manual(values = group_colors) +
  scale_x_date(
    date_breaks = "2 month", # Set breaks at 2-month intervals
    date_labels = "%b %y"
  ) +
  labs(title="Withdrawn Cash by ATM Overtime", x='Date', y='Cash ($USD, Hundreds)') +
  theme_minimal()

Modeling

To determine the best modeling approach for forecasting ATM1 values, we first need to review the trends and features of the time series.

ATM1

# subset data to only atm1
atm1 = df |>
  filter(ATM == "ATM1") |>
  as_tsibble(index = DATE)

# plot time series
atm1 |>
  autoplot(CASH) +
  labs(title = "ATM1 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm1 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM1", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm1 |>
  ACF(CASH) |>
  autoplot() +
  labs(title = "ACF Plot - ATM1") +
  theme_minimal()

There does not appear to be any overall trend in the time series for ATM1. There does appear to be a weekly seasonal pattern as supported by the seasonal plot of the STL decomposition and the ACF plot which has high values at lags 7, 14, and 21. ATM1 withdraws follow a weekly seasonal pattern with high withdraws at the beginning of the week followed by a sharp decline.

To perform the model selection process, the data was split into a training and test sets. The training set includes all data before April 2010 (about 92% of the data).

Based on the features of the ATM1 time series, the following models were considered:

Exponential Smoothing (Holt-Winters’ additive method): due to there being a clear seasonal pattern and the seasonal variations being roughly constant through the series
Seasonal Native: due to the data being highly seasonal and not having any clear trends
ARIMA: due to having automated differencing determination via repeated KPSS tests, this model can be applied and included in the assessment

# split the data into test and train sets
atm1_train <- atm1 |>
  filter(DATE < as.Date("2010-04-01"))
atm1_test <- atm1 |>
  filter(DATE >= as.Date("2010-04-01"))

# define the box cox transformation lambda
lambda <- atm1 |>
  features(CASH, features = guerrero) |>
  pull(lambda_guerrero)

# fit the models on the training data
atm1_fit = atm1_train |>
  model(
    "Additive ETS" = ETS(CASH ~ error("A") + trend("N") + season("A")),
    "Auto ETS" = ETS(CASH),
    "SNAIVE" = SNAIVE(CASH),
    "Transformed Additive ETS" = ETS(box_cox(CASH,lambda) ~ error("A") + trend("N") + season("A")),
    "ARIMA" = ARIMA(box_cox(CASH,lambda))
  )

# view the test set forecasts 
atm1_fc_final = atm1_fit |>
  forecast(h = nrow(atm1_test))

atm1_fc_final |>
  autoplot(atm1_test, level=NULL) +
  labs(title = "ATM1 Test Forecasets", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# evaluate model performance on test set
knit_table(
left_join(
  glance(atm1_fit) |> select(.model, AIC, AICc, BIC), 
  atm1_fc_final |> accuracy(atm1_test) |> select(.model, RMSE, MAPE),
  by = join_by(.model)
  ) |>
  arrange(RMSE, AICc)
)


.model	AIC	AICc	BIC	RMSE	MAPE
Additive ETS	4114.365	4115.044	4152.506	11.61487	64.73164
Auto ETS	4114.365	4115.044	4152.506	11.61487	64.73164
ARIMA	1150.052	1150.176	1165.224	14.21138	45.64292
Transformed Additive ETS	2181.670	2182.349	2219.811	14.41856	46.02107
SNAIVE	NA	NA	NA	16.03746	73.39914

The testing results show that the Additive ETS model and Auto ETS model have the same metrics which suggests the additive ETS model was chosen as the best performing ETS model. In addition, the additive ETS model has the lowest RMSE, followed by the the ARIMA model and transformed additive ETS. Overall, the difference in RMSE value between these model is minimal. The ARIMA model has a significantly lower AICc value and a lower MAPE than the other models. Due to this, the ARIMA model is the best if the best fit for the ATM1 time series.

# view the selected model report
atm1_fit |>
  select(.model = "ARIMA") |>
  report()

## Series: CASH 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(CASH, lambda) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1139  -0.1197  -0.6512
## s.e.  0.0547   0.0549   0.0473
## 
## sigma^2 estimated as 1.899:  log likelihood=-571.03
## AIC=1150.05   AICc=1150.18   BIC=1165.22

# fit the model on all the data
atm1_fit_final <- atm1 |>
  model("ARIMA" = ARIMA(box_cox(CASH,lambda) ~ pdq(0, 0, 2) + PDQ(0, 1, 1, period = 7)))
# forecast values for May 2010
atm1_fc_final <- atm1_fit_final |>
  forecast(h = 31)

The selected ARIMA model has the following components:

Non-seasonal part (0,0,2): No Autoregressive (p=0) or Integration (d=0) terms. It includes a Moving Average of order 2 (q=2).
Seasonal part (0,1,1): No seasonal AR (P=0). It uses seasonal differencing with a period of 7, combined with a seasonal Moving Average of order 1

This model was fit to the entire ATM1 dataset.

atm1_fit_final |>
  select(.model = "ARIMA") |>
  gg_tsresiduals() +
  labs(
    title = "ARIMA(0,0,2)(0,1,1)[7] Model of ATM1 Transformed CASH residuals", x="Date", y='Cash (hundreds)') +
  theme_minimal()

## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Applying the Ljung-Box test
augment(atm1_fit_final) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     6.50     0.772

# Applying the Box-Pierce test
augment(atm1_fit_final) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     6.37     0.784

The residuals of the fit model appear be white noise and normally distributed with no autocorrelations exceed the confidence interval limit. This is supported by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.

atm1_fc_final |>
  autoplot(atm1) +
  labs(title="ATM1 May 2010 Forecasts", subtitle = "ARIMA(0,0,2)(0,1,1)[7] Model") +
  theme_minimal()

ATM2

# subset data to only atm2
atm2 = df |>
  filter(ATM == "ATM2") |>
  as_tsibble(index = DATE)

# plot time series
atm2 |>
  autoplot(CASH) +
  labs(title = "ATM2 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm2 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM2", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm2 |>
  ACF(CASH) |>
  autoplot() +
  labs(title = "ACF Plot - ATM2")

The time series for ATM2 appears to be very similar to ATM1 in that there does not appear to be a strong overall trend and there appears to be a weekly seasonal pattern. This is supported by the decomposition and the ACF plot which has high values at lags 7, 14, and 21. ATM2 withdraws also follow a weekly seasonal pattern have high withdraws at the beginning of the week followed by a sharp decline.

To perform the model selection process, the data was split into a training and test set. The training set includes all data before April 2010 (about 92% of the data).

Based on the features of the ATM2 time series, the following models were considered:

Exponential Smoothing (Holt-Winters’ additive method): due to there being a clear seasonal pattern and the seasonal variations being roughly constant through the series
Seasonal Native: due to the data being highly seasonal and not having any clear trends
ARIMA: due to having automated differencing determination via repeated KPSS tests, this model can be applied and included in the assessment

# split the data into test and train sets
atm2_train <- atm2 |>
  filter(DATE < as.Date("2010-04-01"))
atm2_test <- atm2 |>
  filter(DATE >= as.Date("2010-04-01"))

# define the box cox transformation lambda
lambda <- atm2 |>
  features(CASH, features = guerrero) |>
  pull(lambda_guerrero)

# fit the models on the training data
atm2_fit = atm2_train |>
  model(
    "Additive ETS" = ETS(CASH ~ error("A") + trend("N") + season("A")),
    "SNAIVE" = SNAIVE(CASH),
    "Transformed Additive ETS" = ETS(box_cox(CASH,lambda) ~ error("A") + trend("N") + season("A")),
    "ARIMA" = ARIMA(box_cox(CASH,lambda))
  )

# view the test set forecasts 
atm2_fc_final = atm2_fit |>
  forecast(h = nrow(atm2_test))

atm2_fc_final |>
  autoplot(atm2_test, level=NULL)+
  labs(title = "ATM2 Test Forecasets", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# evaluate model performance on test set
knit_table(
left_join(
  glance(atm2_fit) |> select(.model, AIC, AICc, BIC), 
  atm2_fc_final |> accuracy(atm2_test) |> select(.model, RMSE, MAPE, MASE),
  by = join_by(.model)
  ) |>
  arrange(RMSE, AICc)
)


.model	AIC	AICc	BIC	RMSE	MAPE	MASE
Additive ETS	4141.324	4142.003	4179.465	19.37845	57.41915	NaN
Transformed Additive ETS	3410.339	3411.018	3448.481	19.66788	118.82328	NaN
ARIMA	2341.451	2341.713	2364.210	21.16529	119.83911	NaN
SNAIVE	NA	NA	NA	26.16231	45.13794	NaN

The testing results show that the ARIMA model has the lowest AIC, AICc, and BIC, indicating superior model fit. While ETS provides a slightly lower RMSE and SNAIVE has a lower MAPE, ARIMA offers the best balance between fit and forecasting performance. Thus, the ARIMA model will be used to forecast May 2010 values.

# view the selected model report
atm2_fit |>
  select(.model = "ARIMA") |>
  report()

## Series: CASH 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(CASH, lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4251  -0.9187  0.4732  0.8028  -0.7474
## s.e.   0.0511   0.0441  0.0819  0.0578   0.0460
## 
## sigma^2 estimated as 71.19:  log likelihood=-1164.73
## AIC=2341.45   AICc=2341.71   BIC=2364.21

# fit the model on all the data
atm2_fit_final <- atm2 |>
  model("ARIMA" = ARIMA(box_cox(CASH,lambda) ~ pdq(2, 0, 2) + PDQ(0, 1, 1, period = 7)))
# forecast values for May 2010
atm2_fc_final <- atm2_fit_final |>
  forecast(h = 31)

The selected ARIMA model has the following components:

Non-seasonal part (2,0,2): The model uses the last two values to forecast values by incorporating two autoregressive parameters. There are no Integration (d=0) terms.It includes a Moving Average of order 2 (q=2).
Seasonal part (0,1,1): No seasonal AR (P=0). It uses seasonal differencing with a period of 7, combined with a seasonal Moving Average of order 1

This model was fit to the entire ATM2 dataset.

atm2_fit_final |>
  select(.model = "ARIMA") |>
  gg_tsresiduals() +
  labs(
    title = "ARIMA(2,0,2)(0,1,1)[7] Model of ATM2 Transformed CASH residuals"
  )

# Applying the Ljung-Box test
augment(atm2_fit_final) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     9.06     0.527

# Applying the Box-Pierce test
augment(atm2_fit_final) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     8.88     0.544

While there are a few autocorrelation lag values (5 and 25) that exceed the acf confidence intervals, the residuals of the fit model appear be mostly white noise and normally distributed. This is supported by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.

atm2_fc_final |>
  autoplot(atm2) +
  labs(title="ATM2 May 2010 Forecasts", subtitle = "ARIMA(2,0,2)(0,1,1)[7] Model", x="Date", y='Cash (hundreds)') +
  theme_minimal()

ATM3

# subset data to only atm3
atm3 = df |>
  filter(ATM == "ATM3") |>
  as_tsibble(index = DATE)

# plot time series
atm3 |>
  autoplot(CASH) +
  labs(title = "ATM3 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm3_data = atm3 |>
  filter(CASH > 0)

atm3_data |>
  autoplot(CASH) +
  labs(title = "ATM3 Time Series - Greater than $0", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm3 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM3", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm3 |>
  ACF(CASH) |>
  autoplot() +
  labs(title = "ACF Plot - ATM3")

The time series for ATM3 only has withdraw cash values greater than $0 for the most recent three days. Due to this, a simpler model approach will be needed due to not having adequate historical data to develop a more sophisticated model.

Based on the features of the ATM3 time series, the following models were considered: - Mean - Native

atm3_fit = atm3_data |>
  model(
    "Mean" = MEAN(CASH),
    "Naïve" = NAIVE(CASH),
  )

atm3_fcs = atm3_fit |>
  forecast(h = 31)

atm3_fcs |>
  autoplot(atm3_data, level=NULL) +
  labs(title='ATM3 Cash Forecasts', x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm3_fcs |>
  autoplot(atm3, level=NULL) +
  labs(title='ATM3 Cash Forecasts', x="Date", y='Cash (hundreds)') +
  theme_minimal()

accuracy(atm3_fit)

## # A tibble: 2 × 10
##   .model .type           ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr>        <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Mean   Training -4.74e-15  6.02  5.56 -0.456  6.24   NaN   NaN -0.296
## 2 Naïve  Training -5.5 e+ 0 10.1   8.5  -6.77  10.3    NaN   NaN -0.5

atm3_fit |>
  forecast(h = 1) |>
  hilo(level = 95)

## # A tsibble: 2 x 5 [1D]
## # Key:       .model [2]
##   .model DATE      
##   <chr>  <date>    
## 1 Mean   2010-05-01
## 2 Naïve  2010-05-01
## # ℹ 3 more variables: CASH <dist>, .mean <dbl>, `95%` <hilo>

95% Confidence Interval: - Mean: [70.98457, 104.3488], Range = 33.36 - Native: [65.15688, 104.8431], Range = 39.69

Due to the MEAN() model method having a smaller 95% confidence interval range than the NATIVE() model for the three data points available, the MEAN() forecasting approach will be used to forecast values for May 2010.

# view the selected model report
atm3_fit_final = atm3_fit |>
  select(.model = "Mean")

# forecast values for May 2010
atm3_fc_final <- atm3_fit_final |>
  forecast(h = 31)

atm3_fit_final |>
  gg_tsresiduals() +
  labs(
    title = "Mean Model of ATM3 Residuals"
  )

No conclusions can be made on the model fit based upon the residuals due to the lack of available data.

atm3_fc_final |>
  autoplot(atm3) +
  labs(title="ATM3 May 2010 Forecasts", subtitle = "Mean Model", x="Date", y='Cash (hundreds)') +
  theme_minimal()

ATM4

# subset data to only atm4
atm4 = df |>
  filter(ATM == "ATM4") |>
  as_tsibble(index = DATE)

# plot time series
atm4 |>
  autoplot(CASH) +
  labs(title = "ATM4 Time Series", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# STL decomposition
atm4 |>
  model(stl = STL(CASH)) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - ATM4", x="Date", y='Cash (hundreds)') +
  theme_minimal()

atm4 |>
  ACF(CASH) |>
  autoplot() +
  labs(title = "ACF Plot - ATM4")

Here we can see that this data doesn’t really seem to have much trend and is highly seasonal with a seasonal window of one week, just as we saw with ATM1. With that, we can follow a similar process as we did with ATM1 here:

The time series for ATM4 similar to ATM1 in that there does not appear to be a strong overall trend and there appears to be a weekly seasonal pattern. This is supported by the STL decomposition and the ACF plot which has high values at lags 7, 14, and 21.

To perform the model selection process, the data was split into a training and test set. The training set includes all data before April 2010 (about 92% of the data).

Based on the features of the ATM4 time series, the following models were considered:

Exponential Smoothing (Holt-Winters’ additive method): due to there being a clear seasonal pattern and the seasonal variations being roughly constant through the series
Seasonal Native: due to the data being highly seasonal and not having any clear trends
ARIMA: due to having automated differencing determination via repeated KPSS tests, this model can be applied and included in the assessment

# split the data into test and train sets
atm4_train <- atm4 |>
  filter(DATE < as.Date("2010-04-01"))
atm4_test <- atm4 |>
  filter(DATE >= as.Date("2010-04-01"))

# define the box cox transformation lambda
lambda <- atm4 |>
  features(CASH, features = guerrero) |>
  pull(lambda_guerrero)

# fit the models on the training data
atm4_fit = atm4_train |>
  model(
    "Additive ETS" = ETS(CASH ~ error("A") + trend("N") + season("A")),
    "SNAIVE" = SNAIVE(CASH),
    "Transformed Additive ETS" = ETS(box_cox(CASH,lambda) ~ error("A") + trend("N") + season("A")),
    "ARIMA" = ARIMA(box_cox(CASH,lambda))
  )

# view the test set forecasts 
atm4_fc_final = atm4_fit |>
  forecast(h = nrow(atm4_test))

atm4_fc_final |>
  autoplot(atm4_test, level=NULL) +
  labs(title = "ATM4 Test Forecasets", x="Date", y='Cash (hundreds)') +
  theme_minimal()

# evaluate model performance on test set
knit_table(
left_join(
  glance(atm4_fit) |> select(.model, AIC, AICc, BIC), 
  atm4_fc_final |> accuracy(atm4_test) |> select(.model, RMSE, MAPE),
  by = join_by(.model)
  ) |>
  arrange(RMSE, AICc)
)


.model	AIC	AICc	BIC	RMSE	MAPE
ARIMA	2698.459	2698.641	2717.529	258.9685	857.8227
Transformed Additive ETS	3673.410	3674.089	3711.551	300.2535	1185.6613
SNAIVE	NA	NA	NA	300.8483	375.0538
Additive ETS	5856.301	5856.980	5894.443	306.8014	1242.6338

The testing results show that the ARIMA model is the best overall choice as it achieves the lowest AIC, AICc, and BIC, indicating superior model fit, and also provides the lowest RMSE. While the seasonal naïve model has a lower MAPE, this metric can be unstable in the presence of small values, making ARIMA the more robust and reliable forecasting model.

# view the selected model report
atm4_fit |>
  select(.model = "ARIMA") |>
  report()

## Series: CASH 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## Transformation: box_cox(CASH, lambda) 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0844  0.1965  0.1890   17.6950
## s.e.  0.0542  0.0537  0.0546    0.7764
## 
## sigma^2 estimated as 180.7:  log likelihood=-1344.23
## AIC=2698.46   AICc=2698.64   BIC=2717.53

# fit the model on all the data
atm4_fit_final <- atm4 |>
  model("ARIMA" = ARIMA(box_cox(CASH,lambda) ~ pdq(0, 0, 1) + PDQ(2, 0, 0, period = 7)))
# forecast values for May 2010
atm4_fc_final <- atm4_fit_final |>
  forecast(h = 31)

The selected ARIMA model has the following components:

Non-seasonal part (0,0,1): Includes one moving average term, meaning the model uses the previous forecast error (shock) to improve predictions
Seasonal part (2,0,0): Includes two seasonal autoregressive terms with a seasonal period of 7, and a constant mean is included, indicating the series reverts to a non-zero

This model was fit to the entire ATM4 dataset.

atm4_fit_final |>
  select(.model = "ARIMA") |>
  gg_tsresiduals() +
  labs(
    title = "ARIMA(0,0,1)(2,0,0)[7] Model of ATM4 Transformed Residuals"
  )

# Applying the Ljung-Box test
augment(atm4_fit_final) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     16.0    0.0986

# Applying the Box-Pierce test
augment(atm4_fit_final) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     15.6     0.111

While there is one autocorrelation lag value (10) slightly over the ACF confidence interval threshold, the residuals of the fit model appear be mostly white noise and normally distributed. This is supported. by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.

atm4_fc_final |>
  autoplot(atm4) +
  labs(title="ATM4 May 2010 Forecasts", subtitle = "ARIMA(0,0,1)(2,0,0)[7] Model", x="Date", y='Cash (hundreds)') +
  theme_minimal()

PART B – Forecasting Power

Prompt: Model the residential power usage data and a monthly forecast for 2014

Dataset: ResidentialCustomerForecastLoad-624.xlsx

Import and Visualize Data

Import the data from Github and view the raw data

# import data from github
github_url_b = 'https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/ResidentialCustomerForecastLoad-624.xlsx'
data_b = import(github_url_b)

# view data head
knit_table(head(data_b), "View Raw Data")

View Raw Data
CaseSequence	YYYY-MMM	KWH
733	1998-Jan	6862583
734	1998-Feb	5838198
735	1998-Mar	5420658
736	1998-Apr	5010364
737	1998-May	4665377
738	1998-Jun	6467147

This dataset includes residential power usage from January 1998 until December 2013. There are 192 rows with power usage tracked at the monthly level.

The data consists of three columns:

CaseSequence: unique identifier for monthly residential power usage
YYYY-MM: Year and month of power consumption
KWH: power consumption in Kilowatt hours

Data Handling

power = data_b
# rename columns to all naming follows same casing
colnames(power) = c("CaseSequence", "YR_MONTH", "KWH")


skim_power = skim(power)
# |> select(YR_MONTH, KWH))
# yank(skim_power, "character")

skim_power |> summary()

Data summary
Name	power
Number of rows	192
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

# knit_table(yank(skim_df, "Date"), 'Column: YR_MONTH')
knit_table(yank(skim_power, "numeric"), 'Column: KWH')

Column: KWH
skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
CaseSequence	0	1.0000000	828.5	5.556978e+01	733	780.75	828.5	876.25	924	▇▇▇▇▇
KWH	1	0.9947917	6502474.6	1.447571e+06	770523	5429912.00	6283324.0	7620523.50	10655730	▁▁▇▅▁

knit_table(yank(skim_power, "character"), 'Column: YR_MONTH')

Column: YR_MONTH
skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
YR_MONTH	0	1	8	8	0	192	0

# transform into tsibble
power = power |>
  mutate(YR_MONTH = yearmonth(YR_MONTH)) |>
  as_tsibble(index = YR_MONTH)

Visualize the time series overtime

power |>
  autoplot(KWH) +
  labs(title="Residential Power Usage Time Series", x="Year Month", y="KWH (Kilowatt hours)") +
  theme_minimal()

Data Handling

Outliers

Identify outliers with the IQR method.

outlier_flag <- function(x) {
  # Calculate first and third quartiles
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  # Calculate the Interquartile Range
  IQR_val <- IQR(x, na.rm = TRUE)

  # Define lower and upper bounds (fences)
  lower_bound <- Q1 - (IQR_val * 1.5)
  upper_bound <- Q3 + (IQR_val * 1.5)

  # Flag values outside the bounds as TRUE
  is_outlier <- (x < lower_bound) | (x > upper_bound)
  return(is_outlier)
}

power = power |>
  mutate(outlier_flag = outlier_flag(KWH))

knit_table(
  power |>
    filter(outlier_flag==TRUE),
  "Identified Outliers"
)

Identified Outliers
CaseSequence	YR_MONTH	KWH	outlier_flag
883	2010 Jul	770523	TRUE

# replace outlier with NA and impute
power = power |>
  mutate(KWH = if_else(outlier_flag == TRUE, NA_real_, KWH)) |>
  mutate(
    KWH = na.approx(KWH, na.rm = FALSE)
  )

power |>
  autoplot(KWH) +
  labs(title="Residential Power Usage Time Series", x="Year Month", y="KWH (Kilowatt hours)") +
  theme_minimal()

Modeling

power = power |>
  select(YR_MONTH, KWH)

# STL decomposition
power |>
  model(stl = STL(KWH ~ trend(window=Inf) + season(period=12, window="periodic"))) |>
  components() |>
  autoplot() +
  labs(title = "STL Decomposition - KWH")

power |>
  gg_subseries() +
  labs(
    y = "KWH",
    title = "Seasonal Subseries - KWH"
  )

## Warning: `gg_subseries()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_subseries()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Plot variable not specified, automatically selected `y = KWH`

power |>
  ACF(KWH) |>
  autoplot() +
  labs(title = "ACF Plot - KWH")

The time series for residential power usage appears to have an overall increasing trend with a more rapid increase starting around 2005. There also appears to be a strong yearly seasonal trend with as seen in the STL decomposition and the Seasonal subseries plot. At the begining of the year, power usage declines from January till May and then starts to steadily increase through the summer and then decrease through November. At the end of the year (December into January) power usage starts to increase again into the new year. Overall the trend in power usage seems to shift about every three months with the seasons. This is supported by the ACF plot which has high values every third lag (3, 6, 9, 12, 15, 18, 21).

To perform the model selection process, the data was split into a training and test set. The training set includes all data before January 2012 (about 88% of the data).

# split the data into test and train sets
power_train <- power |>
  filter(year(YR_MONTH) < 2012)
power_test <- power |>
  filter(year(YR_MONTH) >= 2012)

# define the box cox transformation lambda
lambda <- power |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)

# fit the models on the training data
power_fit = power_train |>
  model(
    "Additive ETS" = ETS(KWH ~ error("A") + trend("A") + season("A")),
    "Transformed Additive ETS" = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") + season("A")),
    "Transformed Multiplicative ETS" = ETS(box_cox(KWH,lambda) ~ error("M") + trend("A") + season("M")),
    "Multiplicative ETS" = ETS(KWH ~ error("M") + trend("A") + season("M")),
    "T Auto ETS" = ETS(box_cox(KWH,lambda)),
    "Transformed HW" = ETS(box_cox(KWH,lambda) ~ error("M") + trend("Ad") + season("M")),
    "Transformed Additive ETS" = ETS(box_cox(KWH,lambda) ~ error("A") + trend("N") + season("A")),
    "Transformed SNAIVE" = SNAIVE(box_cox(KWH,lambda)),
    "ARIMA" = ARIMA(KWH),
    "Transformed ARIMA" = ARIMA(box_cox(KWH,lambda))
  )

# view the test set forecasts 
power_fc_final = power_fit |>
  forecast(h = nrow(power_test))
power_fc_final |>
  autoplot(power_test, level=NULL)

# evaluate model performance on test set
knit_table(
left_join(
  glance(power_fit) |> select(.model, AIC, AICc, BIC), 
  power_fc_final |> accuracy(power_test) |> select(.model, RMSE, MAPE),
  by = join_by(.model)) |>
  arrange(AICc, RMSE)
)


.model	AIC	AICc	BIC	RMSE	MAPE
Transformed ARIMA	-1304.721	-1303.964	-1283.3716	772888.0	7.913079
Transformed Multiplicative ETS	-1009.378	-1005.298	-956.2709	1119967.0	13.075044
T Auto ETS	-1009.378	-1005.298	-956.2709	1119967.0	13.075044
Transformed HW	-1007.029	-1002.438	-950.7977	1064521.2	11.800401
Transformed Additive ETS	-1004.387	-1001.229	-957.5273	824731.4	9.000613
ARIMA	4612.814	4613.214	4628.0629	732388.6	7.019935
Multiplicative ETS	5343.893	5347.973	5397.0003	1043069.9	12.096480
Additive ETS	5361.507	5365.587	5414.6143	952109.5	11.238595
Transformed SNAIVE	NA	NA	NA	1012071.6	9.817695

The testing results show that the Transformed ARIMA model is the best overall model, as it achieves substantially lower AIC, AICc, and BIC values, indicating a superior fit to the data. While the untransformed ARIMA shows slightly better RMSE and MAPE, these metrics are not directly comparable due to differences in scale. The transformation improves model stability and better captures the underlying data structure.

# view the selected model report
power_fit |>
  select(.model = "Transformed ARIMA") |>
  report()

## Series: KWH 
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift 
## Transformation: box_cox(KWH, lambda) 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  constant
##       0.2561  0.0869  0.2588  -0.7917  -0.4339    0.0013
## s.e.  0.0802  0.0809  0.0725   0.0820   0.0833    0.0005
## 
## sigma^2 estimated as 1.393e-05:  log likelihood=659.36
## AIC=-1304.72   AICc=-1303.96   BIC=-1283.37

# fit the model on all the data
power_fit_final <- power |>
  model("ARIMA" = ARIMA(box_cox(KWH,lambda) ~ pdq(0, 0,3) + PDQ(2, 1, 0, period = 12)))
# forecast values for May 2010
power_fc_final <- power_fit_final |>
  forecast(h = 12)

The selected ARIMA model has the following components:

Non-seasonal part (0,0,3): The model includes three moving average terms, using past forecast errors to predict current values.
Seasonal part (2,1,0): Two seasonal autoregressive terms are used, meaning the current value depends on the same month from the previous two years. One seasonal difference is applied with a period of 12.
The inclusion of drift means the model includes a constant term

This model was fit to the entire power dataset.

power_fit_final |>
  select(.model = "ARIMA") |>
  gg_tsresiduals() +
  labs(
    title = "ARIMA(0,0,3)(2,1,0)[12] Model of KWH Transformed residuals"
  )

# Applying the Ljung-Box test
augment(power_fit_final) |> features(.innov, ljung_box, lag=10)

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     12.4     0.258

# Applying the Box-Pierce test
augment(power_fit_final) |> features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     12.0     0.287

While there is one autocorrelation lag value (6) slightly over the ACF confidence interval threshold, the residuals of the fit model appear be mostly white noise and normally distributed. This is supported. by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.

power_fc_final |>
  autoplot(power) +
  labs(title="ATM2 May 2010 Forecasts", subtitle = "ARIMA(0,0,3)(2,1,0)[12] Model")

Export Results (Part A & B)

Export forecasts into an Excel readable file

# gather part a forecasts to output to excel
atm1_fc_xslx = atm1_fc_final |>
  as.data.frame() |>
  mutate(ATM1_Forecasts = .mean) |>
  select(DATE, ATM1_Forecasts)

atm2_fc_xslx = atm2_fc_final |>
  as.data.frame() |>
  mutate(ATM2_Forecasts = .mean) |>
  select(DATE, ATM2_Forecasts)

atm3_fc_xslx = atm3_fc_final |>
  as.data.frame() |>
  mutate(ATM3_Forecasts = .mean) |>
  select(DATE, ATM3_Forecasts)

atm4_fc_xslx = atm4_fc_final |>
  as.data.frame() |>
  mutate(ATM4_Forecasts = .mean) |>
  select(DATE, ATM4_Forecasts)

part_a_fc = left_join(
  left_join(
    left_join(
      atm1_fc_xslx, atm2_fc_xslx, by = join_by(DATE)),
    atm3_fc_xslx, by = join_by(DATE)),
  atm4_fc_xslx, by = join_by(DATE))

# gather part b forecasts to output to excel
part_b_fc = power_fc_final |>
  as.data.frame() |>
  mutate(
    Power_Forecasts = .mean,
    YR_MONTH = as.character(YR_MONTH)
    ) |>
  select(YR_MONTH, Power_Forecasts)

# view forecasts
knit_table(part_a_fc, 'Part A Forecasts')

Part A Forecasts
DATE	ATM1_Forecasts	ATM2_Forecasts	ATM3_Forecasts	ATM4_Forecasts
2010-05-01	92.128015	67.057749	87.66667	326.5391
2010-05-02	106.518327	73.066960	87.66667	433.4118
2010-05-03	78.907120	15.091609	87.66667	512.2953
2010-05-04	5.556122	8.836430	87.66667	241.1186
2010-05-05	106.163566	99.090250	87.66667	474.7541
2010-05-06	84.720401	89.863263	87.66667	351.1941
2010-05-07	91.309504	66.721573	87.66667	414.4358
2010-05-08	93.491214	67.301326	87.66667	295.0792
2010-05-09	107.162081	73.321607	87.66667	475.2268
2010-05-10	79.572741	15.602155	87.66667	465.2695
2010-05-11	5.726440	9.814201	87.66667	287.1311
2010-05-12	106.933940	99.309032	87.66667	446.4479
2010-05-13	85.409789	90.080118	87.66667	331.2565
2010-05-14	92.024818	66.943764	87.66667	460.4210
2010-05-15	94.212208	67.540347	87.66667	388.3167
2010-05-16	107.933298	73.565606	87.66667	450.5642
2010-05-17	80.238362	16.106499	87.66667	464.3841
2010-05-18	5.896759	10.778081	87.66667	365.0836
2010-05-19	107.704315	99.519491	87.66667	453.2276
2010-05-20	86.099177	90.291614	87.66667	402.3680
2010-05-21	92.740132	67.167822	87.66667	443.6369
2010-05-22	94.933202	67.776145	87.66667	400.3265
2010-05-23	108.704515	73.802141	87.66667	453.0699
2010-05-24	80.903984	16.606507	87.66667	453.8763
2010-05-25	6.067077	11.732015	87.66667	393.4039
2010-05-26	108.474690	99.724074	87.66667	447.7969
2010-05-27	86.788565	90.499379	87.66667	412.0704
2010-05-28	93.455446	67.393207	87.66667	448.6861
2010-05-29	95.654196	68.009653	87.66667	423.6912
2010-05-30	109.475732	74.033433	87.66667	447.7701
2010-05-31	81.569605	17.103467	87.66667	450.7500

knit_table(part_b_fc, 'Part B Forecasts')

Part B Forecasts
YR_MONTH	Power_Forecasts
2014 Jan	10393275
2014 Feb	8860839
2014 Mar	7064897
2014 Apr	5978216
2014 May	5938635
2014 Jun	8314016
2014 Jul	9623971
2014 Aug	10178785
2014 Sep	8541825
2014 Oct	5857955
2014 Nov	6142157
2014 Dec	8277646

# export
list_of_fc_dfs <- list("Part A Forecasts" = part_a_fc, "Part B Forecasts" = part_b_fc)
write_xlsx(list_of_fc_dfs, path = "DATA624_Project1_Forecasts_Azevedo_20260329.xlsx")

PART C – Bonus

Prompt:

Your optional assignment is to time-base sequence the data and aggregate based on hour
- Note for multiple recordings within an hour, take the mean
Determine if the data is stationary and can it be forecast
- If so, provide a week forward forecast and present results

Datasets:

Import Data

Import the data from Github and view the raw data

# import data from github
github_url_ec1 = 'https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/Waterflow_Pipe1.xlsx'
github_url_ec2 = 'https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/Waterflow_Pipe2.xlsx'
data_ec1 = import(github_url_ec1)
data_ec2 = import(github_url_ec2)

Data Handling

Time-base sequence the data and aggregate based on date and hour.

# aggregate data by date and hour
data_ec1 = data_ec1 |>
  mutate(
    Hour = hour(`Date Time`),
    Date = date(`Date Time`),
    ) |>
  mutate(DateTime = ymd(Date) + hours(Hour))
data_ec1_agg = data_ec1 |>
  group_by(DateTime) |>
  summarise(WaterFlow = mean(WaterFlow)) |>
  as_tsibble(index=DateTime)

# aggregate data by date and hour
data_ec2 = data_ec2 |>
  mutate(
    Hour = hour(`Date Time`),
    Date = date(`Date Time`),
    ) |>
  mutate(DateTime = ymd(Date) + hours(Hour))
data_ec2_agg = data_ec2 |>
  group_by(DateTime) |>
  summarise(WaterFlow = mean(WaterFlow)) |>
  as_tsibble(index=DateTime)

Visualize Data

data_ec1_agg |>
  autoplot(WaterFlow) +
  labs(title='Pipe 1 Time Series by Hour', x='Date Time', y='Water Flow')

data_ec2_agg |>
  autoplot(WaterFlow) +
  labs(title='Pipe 2 Time Series by Hour', x='Date Time', y='Water Flow')

Stationary

knit_table(
  data_ec1_agg |>features(WaterFlow, unitroot_kpss),
  "Pipe 1 KPSS Test"
)

Pipe 1 KPSS Test
kpss_stat	kpss_pvalue
0.2465587	0.1

knit_table(
  data_ec2_agg |>features(WaterFlow, unitroot_kpss),
  "Pipe 2 KPSS Test"
)

Pipe 2 KPSS Test
kpss_stat	kpss_pvalue
0.1053662	0.1

To test if the time series are stationary, a the KPSS unit root test was performed on each series. Both the Pipe 1 and Pipe 2 time series had p-values of 0.1. Since this is greater than 0.05 we fail to reject the null hypothesis; thus, both of the series are stationary.

Due to time constraints, this was the extent of what I was able to get done for the bonus section.

DATA624 Project 1

Olivia Azevedo

2026-03-29

PART A – ATM Forecast

Import and Visualize Data

Data Handling

Missing Values

Little’s MCAR Test

Imputation

Outliers

Modeling

ATM1

ATM2

ATM3

ATM4

PART B – Forecasting Power

Import and Visualize Data

Data Handling

Data Handling

Outliers

Modeling

Export Results (Part A & B)

PART C – Bonus

Import Data

Data Handling

Visualize Data

Stationary