Prompt:
Cash is provided in hundreds of
dollarsDataset: ATM624Data.xlsx
Import the data from Github and view the raw data
# import data from github
github_url = "https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/ATM624Data.xlsx"
data = import(github_url)
# view data head
knit_table(head(data), "View Raw Data")
| DATE | ATM | Cash |
|---|---|---|
| 2009-05-01 | ATM1 | 96 |
| 2009-05-01 | ATM2 | 107 |
| 2009-05-02 | ATM1 | 82 |
| 2009-05-02 | ATM2 | 89 |
| 2009-05-03 | ATM1 | 85 |
| 2009-05-03 | ATM2 | 90 |
The data consists of three columns:
Visualize the time series overtime by ATM
group_colors <- c("ATM1" = "#C77CFF", "ATM2" = "darkolivegreen3", "ATM3" = "coral1", "ATM4" = "#00BFC4")
all_cash_plot = data |>
ggplot(aes(x=DATE, y=Cash, color=ATM)) +
geom_line() +
scale_color_manual(values = group_colors) +
scale_x_date(
date_breaks = "2 month", # Set breaks at 2-month intervals
date_labels = "%b %y"
) +
labs(title="Withdrawn Cash by ATM Overtime", x='Date', y='Cash (hundreds)') +
theme_minimal()
all_cash_plot
all_cash_plot +
facet_grid(ATM~., scales = "free_y") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
View dataset summary metrics
# transform date column to date type
df = data |>
mutate(DATE = as.Date(DATE)) |>
arrange(DATE, ATM)
# rename columns to all naming follows same casing
colnames(df) = c("DATE", "ATM", "CASH")
skim_df = skim(df)
skim_df |>
summary()
| Name | df |
| Number of rows | 1474 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
knit_table(yank(skim_df, "Date"), 'Column: DATE')
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| DATE | 0 | 1 | 2009-05-01 | 2010-05-14 | 2009-11-01 | 379 |
knit_table(yank(skim_df, "character"), 'Column: ATM')
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ATM | 14 | 0.990502 | 4 | 4 | 0 | 4 | 0 |
knit_table(yank(skim_df, "numeric"), 'Column: CASH')
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CASH | 19 | 0.9871099 | 155.582 | 376.4568 | 0 | 0.5 | 73 | 114 | 10919.76 | ▇▁▁▁▁ |
As shown in the data summary metrics within the tables above, there
are 1,474 total rows in the dataset; however, 14 of these rows (5/1/10 -
5/14/10) have null ATM and CASH values. These
dates represent the dates being forecasted and will be removed for
modeling.
# create subset df for null dates that will be forecasted
df_na_dates = df |> filter(is.na(ATM))
# filter out null dates from data
df = df |> filter(!is.na(ATM))
knit_table(df_na_dates, 'Dates to be Forecasted: Rows with no ATM or CASH values')
| DATE | ATM | CASH |
|---|---|---|
| 2010-05-01 | NA | NA |
| 2010-05-02 | NA | NA |
| 2010-05-03 | NA | NA |
| 2010-05-04 | NA | NA |
| 2010-05-05 | NA | NA |
| 2010-05-06 | NA | NA |
| 2010-05-07 | NA | NA |
| 2010-05-08 | NA | NA |
| 2010-05-09 | NA | NA |
| 2010-05-10 | NA | NA |
| 2010-05-11 | NA | NA |
| 2010-05-12 | NA | NA |
| 2010-05-13 | NA | NA |
| 2010-05-14 | NA | NA |
As seen in the CASH and DATE column summary
metrics below grouped for by ATM, after removing those 14
rows, five null CASH values still remain. Three of these
values are associated with ATM1 and 2 of them are associated with ATM2.
All five null values occur in June 2009.
skim_atm = df |>
group_by(ATM) |>
skim(DATE, CASH)
knit_table(yank(skim_atm, "Date"), 'Column: DATE - Grouped by ATM')
| skim_variable | ATM | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|---|
| DATE | ATM1 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM2 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM3 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
| DATE | ATM4 | 0 | 1 | 2009-05-01 | 2010-04-30 | 2009-10-30 | 365 |
knit_table(yank(skim_atm, "numeric"), 'Column: CASH - Grouped by ATM')
| skim_variable | ATM | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CASH | ATM1 | 3 | 0.9917808 | 83.8867403 | 36.656539 | 1.00000 | 73.0000 | 91.0000 | 108.000 | 180.00 | ▂▁▇▃▁ |
| CASH | ATM2 | 2 | 0.9945205 | 62.5785124 | 38.898455 | 0.00000 | 25.5000 | 67.0000 | 93.000 | 147.00 | ▇▅▇▇▂ |
| CASH | ATM3 | 0 | 1.0000000 | 0.7205479 | 7.944778 | 0.00000 | 0.0000 | 0.0000 | 0.000 | 96.00 | ▇▁▁▁▁ |
| CASH | ATM4 | 0 | 1.0000000 | 474.0433449 | 650.935992 | 1.56326 | 124.3344 | 403.8393 | 704.507 | 10919.76 | ▇▁▁▁▁ |
The missing June 2009 cash values are highlighted in the the time series data visualization below for both ATM1 and ATM2. The missing values for both ATMs occur in short, localized intervals.
# view data with cash null values
null_df = df |> filter(is.na(CASH))
# create df filtered to missing value time range
na_filt_df = df |>
filter(
ATM %in% c('ATM1', 'ATM2'),
DATE >= as.Date('2009-05-01'),
DATE < as.Date('2009-08-01'),
)
# create df to be used to create the timerange missing value highlights
rects = null_df |>
mutate(
END_DATE = DATE + 1,
START_DATE = DATE - 1,)
# plot and highlight missing values in time series
na_filt_df |>
ggplot(aes(x=DATE, y=CASH)) +
geom_line() +
facet_wrap(~ATM, scales = "free_y") +
geom_rect(data=rects,
aes(xmin=START_DATE, xmax=END_DATE, ymin=0, ymax=max(na_filt_df$CASH, na.rm = TRUE),
group=ATM, fill="Null Date Range"), alpha=0.4) +
scale_fill_manual(values="orange", labels=c("Null Date Range")) +
labs(title="ATM1 & ATM2 Null Cash Value Date Ranges (May-August 2009)", x='Date', y='Cash (hundreds)', fill=NULL) +
theme_minimal() +
theme(legend.position="bottom") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To identify the best type of method to handle the missing
CASH values, it is beneficial to first identify what type
of missing data is present.
Little’s MCAR test was performed and resulted in a p-value less than 0.05; thus, the data is not Missing Completely at Random (MCAR). This suggests the missing values could be Missing at Random (MAR).
knit_table(as.data.frame(mcar_test(df)), 'Little’s MCAR Test Results')
| statistic | df | p.value | missing.patterns |
|---|---|---|---|
| 12.89875 | 2 | 0.0015815 | 2 |
Given that the missing values are potentially consistent with MAR qualities in and occur within a continuous time series, linear interpolation provides an unbiased and structurally consistent method for imputation by assuming that the missing value lies between neighboring observations in a smooth way. This approach preserves local trends, maintains temporal continuity, and avoids introducing artificial distortions, making it the most appropriate approach for this dataset.
From the business perspective, ATM withdrawals typically change incrementally, not abruptly. Linear interpolation reflects realistic customer behavior between days.
Lastly, linear interpolation was applied consistently across ATM1 and ATM2. While ATM1’s stable behavior makes interpolation highly accurate, ATM2’s slightly higher variability introduces minor smoothing effects. However, given that missingness is consistent with MAR and gaps are short, linear interpolation remains an appropriate and unbiased method for both series.
The plot below displays the time series plot with the imputed values using linear interpolation.
df = df |>
group_by(ATM) |>
mutate(CASH = na.approx(CASH, maxgap = 4, rule = 2)) |>
ungroup()
df |> filter(
ATM %in% c('ATM1', 'ATM2'),
DATE >= as.Date('2009-05-01'),
DATE < as.Date('2009-08-01'),
) |>
ggplot(aes(x=DATE, y=CASH)) +
geom_line() +
facet_wrap(~ATM, scales = "free_y") +
geom_rect(data=rects,
aes(xmin=START_DATE, xmax=END_DATE, ymin=0, ymax=max(na_filt_df$CASH, na.rm = TRUE),
group=ATM, fill="Imputed Date Range"), alpha=0.4) +
scale_fill_manual(values="orange", labels=c("Imputated Date Range")) +
labs(title="ATM1 & ATM2 Imputed Cash Value Date Ranges (May-August 2009)", x='Date', y='Cash (hundreds)', fill=NULL) +
theme_minimal() +
theme(legend.position="bottom") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As seen in the time series plot, there is an extreme outlier of $1,091,976.16 associated with ATM4 on 2/9/2010 that is skewing the time series. This is apparent in the ATM4 time series plot and is also supported by the outlier remaining prominent in the remainder of the STL Decomposition. Due to this, this outlier will be replaced with NA and treated as missing value as it does not reflect the underlying demand pattern and would otherwise bias forecast estimates. Consistent with the principles discussed in the Forecasting: Principles and Practice textbook, the goal is to model typical behavior rather than one-time anomalies, ensuring that forecasts remain accurate and operationally meaningful.
# subset data to only atm4
atm4 = df |>
filter(ATM == "ATM4") |>
as_tsibble(index = DATE)
# plot time series
atm4 |>
autoplot(CASH) +
labs(title = "ATM4 Time Series", x="Date", y='Cash (hundreds)') +
theme_minimal()
# STL decomposition
atm4 |>
model(stl = STL(CASH)) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - ATM4", x="Date", y='Cash (hundreds)') +
theme_minimal()
After being replaced with NA. values, linear interpolation was applied to impute a value consistent with neighboring observations. This reconstructs a value consistent with the local level of the series and preserves both continuity and statistical integrity.
# replace outlier with NA and impute
df = df |>
mutate(CASH = if_else(CASH < 10919, CASH, NA_real_)) |>
mutate(CASH = na.approx(CASH, na.rm = FALSE))
The plot below displays the updated time series with the outlier removed and imputed with linear interpolation.
df |>
ggplot(aes(x=DATE, y=CASH, color=ATM)) +
geom_line() +
scale_color_manual(values = group_colors) +
scale_x_date(
date_breaks = "2 month", # Set breaks at 2-month intervals
date_labels = "%b %y"
) +
labs(title="Withdrawn Cash by ATM Overtime", x='Date', y='Cash ($USD, Hundreds)') +
theme_minimal()
To determine the best modeling approach for forecasting ATM1 values, we first need to review the trends and features of the time series.
# subset data to only atm1
atm1 = df |>
filter(ATM == "ATM1") |>
as_tsibble(index = DATE)
# plot time series
atm1 |>
autoplot(CASH) +
labs(title = "ATM1 Time Series", x="Date", y='Cash (hundreds)') +
theme_minimal()
# STL decomposition
atm1 |>
model(stl = STL(CASH)) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - ATM1", x="Date", y='Cash (hundreds)') +
theme_minimal()
atm1 |>
ACF(CASH) |>
autoplot() +
labs(title = "ACF Plot - ATM1") +
theme_minimal()
There does not appear to be any overall trend in the time series for ATM1. There does appear to be a weekly seasonal pattern as supported by the seasonal plot of the STL decomposition and the ACF plot which has high values at lags 7, 14, and 21. ATM1 withdraws follow a weekly seasonal pattern with high withdraws at the beginning of the week followed by a sharp decline.
To perform the model selection process, the data was split into a training and test sets. The training set includes all data before April 2010 (about 92% of the data).
Based on the features of the ATM1 time series, the following models were considered:
# split the data into test and train sets
atm1_train <- atm1 |>
filter(DATE < as.Date("2010-04-01"))
atm1_test <- atm1 |>
filter(DATE >= as.Date("2010-04-01"))
# define the box cox transformation lambda
lambda <- atm1 |>
features(CASH, features = guerrero) |>
pull(lambda_guerrero)
# fit the models on the training data
atm1_fit = atm1_train |>
model(
"Additive ETS" = ETS(CASH ~ error("A") + trend("N") + season("A")),
"Auto ETS" = ETS(CASH),
"SNAIVE" = SNAIVE(CASH),
"Transformed Additive ETS" = ETS(box_cox(CASH,lambda) ~ error("A") + trend("N") + season("A")),
"ARIMA" = ARIMA(box_cox(CASH,lambda))
)
# view the test set forecasts
atm1_fc_final = atm1_fit |>
forecast(h = nrow(atm1_test))
atm1_fc_final |>
autoplot(atm1_test, level=NULL) +
labs(title = "ATM1 Test Forecasets", x="Date", y='Cash (hundreds)') +
theme_minimal()
# evaluate model performance on test set
knit_table(
left_join(
glance(atm1_fit) |> select(.model, AIC, AICc, BIC),
atm1_fc_final |> accuracy(atm1_test) |> select(.model, RMSE, MAPE),
by = join_by(.model)
) |>
arrange(RMSE, AICc)
)
| .model | AIC | AICc | BIC | RMSE | MAPE |
|---|---|---|---|---|---|
| Additive ETS | 4114.365 | 4115.044 | 4152.506 | 11.61487 | 64.73164 |
| Auto ETS | 4114.365 | 4115.044 | 4152.506 | 11.61487 | 64.73164 |
| ARIMA | 1150.052 | 1150.176 | 1165.224 | 14.21138 | 45.64292 |
| Transformed Additive ETS | 2181.670 | 2182.349 | 2219.811 | 14.41856 | 46.02107 |
| SNAIVE | NA | NA | NA | 16.03746 | 73.39914 |
The testing results show that the Additive ETS model and Auto ETS model have the same metrics which suggests the additive ETS model was chosen as the best performing ETS model. In addition, the additive ETS model has the lowest RMSE, followed by the the ARIMA model and transformed additive ETS. Overall, the difference in RMSE value between these model is minimal. The ARIMA model has a significantly lower AICc value and a lower MAPE than the other models. Due to this, the ARIMA model is the best if the best fit for the ATM1 time series.
# view the selected model report
atm1_fit |>
select(.model = "ARIMA") |>
report()
## Series: CASH
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(CASH, lambda)
##
## Coefficients:
## ma1 ma2 sma1
## 0.1139 -0.1197 -0.6512
## s.e. 0.0547 0.0549 0.0473
##
## sigma^2 estimated as 1.899: log likelihood=-571.03
## AIC=1150.05 AICc=1150.18 BIC=1165.22
# fit the model on all the data
atm1_fit_final <- atm1 |>
model("ARIMA" = ARIMA(box_cox(CASH,lambda) ~ pdq(0, 0, 2) + PDQ(0, 1, 1, period = 7)))
# forecast values for May 2010
atm1_fc_final <- atm1_fit_final |>
forecast(h = 31)
The selected ARIMA model has the following components:
This model was fit to the entire ATM1 dataset.
atm1_fit_final |>
select(.model = "ARIMA") |>
gg_tsresiduals() +
labs(
title = "ARIMA(0,0,2)(0,1,1)[7] Model of ATM1 Transformed CASH residuals", x="Date", y='Cash (hundreds)') +
theme_minimal()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Applying the Ljung-Box test
augment(atm1_fit_final) |> features(.innov, ljung_box, lag=10)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 6.50 0.772
# Applying the Box-Pierce test
augment(atm1_fit_final) |> features(.innov, box_pierce, lag = 10)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 6.37 0.784
The residuals of the fit model appear be white noise and normally distributed with no autocorrelations exceed the confidence interval limit. This is supported by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.
atm1_fc_final |>
autoplot(atm1) +
labs(title="ATM1 May 2010 Forecasts", subtitle = "ARIMA(0,0,2)(0,1,1)[7] Model") +
theme_minimal()
# subset data to only atm2
atm2 = df |>
filter(ATM == "ATM2") |>
as_tsibble(index = DATE)
# plot time series
atm2 |>
autoplot(CASH) +
labs(title = "ATM2 Time Series", x="Date", y='Cash (hundreds)') +
theme_minimal()
# STL decomposition
atm2 |>
model(stl = STL(CASH)) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - ATM2", x="Date", y='Cash (hundreds)') +
theme_minimal()
atm2 |>
ACF(CASH) |>
autoplot() +
labs(title = "ACF Plot - ATM2")
The time series for ATM2 appears to be very similar to ATM1 in that there does not appear to be a strong overall trend and there appears to be a weekly seasonal pattern. This is supported by the decomposition and the ACF plot which has high values at lags 7, 14, and 21. ATM2 withdraws also follow a weekly seasonal pattern have high withdraws at the beginning of the week followed by a sharp decline.
To perform the model selection process, the data was split into a training and test set. The training set includes all data before April 2010 (about 92% of the data).
Based on the features of the ATM2 time series, the following models were considered:
# split the data into test and train sets
atm2_train <- atm2 |>
filter(DATE < as.Date("2010-04-01"))
atm2_test <- atm2 |>
filter(DATE >= as.Date("2010-04-01"))
# define the box cox transformation lambda
lambda <- atm2 |>
features(CASH, features = guerrero) |>
pull(lambda_guerrero)
# fit the models on the training data
atm2_fit = atm2_train |>
model(
"Additive ETS" = ETS(CASH ~ error("A") + trend("N") + season("A")),
"SNAIVE" = SNAIVE(CASH),
"Transformed Additive ETS" = ETS(box_cox(CASH,lambda) ~ error("A") + trend("N") + season("A")),
"ARIMA" = ARIMA(box_cox(CASH,lambda))
)
# view the test set forecasts
atm2_fc_final = atm2_fit |>
forecast(h = nrow(atm2_test))
atm2_fc_final |>
autoplot(atm2_test, level=NULL)+
labs(title = "ATM2 Test Forecasets", x="Date", y='Cash (hundreds)') +
theme_minimal()
# evaluate model performance on test set
knit_table(
left_join(
glance(atm2_fit) |> select(.model, AIC, AICc, BIC),
atm2_fc_final |> accuracy(atm2_test) |> select(.model, RMSE, MAPE, MASE),
by = join_by(.model)
) |>
arrange(RMSE, AICc)
)
| .model | AIC | AICc | BIC | RMSE | MAPE | MASE |
|---|---|---|---|---|---|---|
| Additive ETS | 4141.324 | 4142.003 | 4179.465 | 19.37845 | 57.41915 | NaN |
| Transformed Additive ETS | 3410.339 | 3411.018 | 3448.481 | 19.66788 | 118.82328 | NaN |
| ARIMA | 2341.451 | 2341.713 | 2364.210 | 21.16529 | 119.83911 | NaN |
| SNAIVE | NA | NA | NA | 26.16231 | 45.13794 | NaN |
The testing results show that the ARIMA model has the lowest AIC, AICc, and BIC, indicating superior model fit. While ETS provides a slightly lower RMSE and SNAIVE has a lower MAPE, ARIMA offers the best balance between fit and forecasting performance. Thus, the ARIMA model will be used to forecast May 2010 values.
# view the selected model report
atm2_fit |>
select(.model = "ARIMA") |>
report()
## Series: CASH
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(CASH, lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4251 -0.9187 0.4732 0.8028 -0.7474
## s.e. 0.0511 0.0441 0.0819 0.0578 0.0460
##
## sigma^2 estimated as 71.19: log likelihood=-1164.73
## AIC=2341.45 AICc=2341.71 BIC=2364.21
# fit the model on all the data
atm2_fit_final <- atm2 |>
model("ARIMA" = ARIMA(box_cox(CASH,lambda) ~ pdq(2, 0, 2) + PDQ(0, 1, 1, period = 7)))
# forecast values for May 2010
atm2_fc_final <- atm2_fit_final |>
forecast(h = 31)
The selected ARIMA model has the following components:
This model was fit to the entire ATM2 dataset.
atm2_fit_final |>
select(.model = "ARIMA") |>
gg_tsresiduals() +
labs(
title = "ARIMA(2,0,2)(0,1,1)[7] Model of ATM2 Transformed CASH residuals"
)
# Applying the Ljung-Box test
augment(atm2_fit_final) |> features(.innov, ljung_box, lag=10)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 9.06 0.527
# Applying the Box-Pierce test
augment(atm2_fit_final) |> features(.innov, box_pierce, lag = 10)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 8.88 0.544
While there are a few autocorrelation lag values (5 and 25) that exceed the acf confidence intervals, the residuals of the fit model appear be mostly white noise and normally distributed. This is supported by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.
atm2_fc_final |>
autoplot(atm2) +
labs(title="ATM2 May 2010 Forecasts", subtitle = "ARIMA(2,0,2)(0,1,1)[7] Model", x="Date", y='Cash (hundreds)') +
theme_minimal()
# subset data to only atm3
atm3 = df |>
filter(ATM == "ATM3") |>
as_tsibble(index = DATE)
# plot time series
atm3 |>
autoplot(CASH) +
labs(title = "ATM3 Time Series", x="Date", y='Cash (hundreds)') +
theme_minimal()
atm3_data = atm3 |>
filter(CASH > 0)
atm3_data |>
autoplot(CASH) +
labs(title = "ATM3 Time Series - Greater than $0", x="Date", y='Cash (hundreds)') +
theme_minimal()
# STL decomposition
atm3 |>
model(stl = STL(CASH)) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - ATM3", x="Date", y='Cash (hundreds)') +
theme_minimal()
atm3 |>
ACF(CASH) |>
autoplot() +
labs(title = "ACF Plot - ATM3")
The time series for ATM3 only has withdraw cash values greater than $0 for the most recent three days. Due to this, a simpler model approach will be needed due to not having adequate historical data to develop a more sophisticated model.
Based on the features of the ATM3 time series, the following models were considered: - Mean - Native
atm3_fit = atm3_data |>
model(
"Mean" = MEAN(CASH),
"Naïve" = NAIVE(CASH),
)
atm3_fcs = atm3_fit |>
forecast(h = 31)
atm3_fcs |>
autoplot(atm3_data, level=NULL) +
labs(title='ATM3 Cash Forecasts', x="Date", y='Cash (hundreds)') +
theme_minimal()
atm3_fcs |>
autoplot(atm3, level=NULL) +
labs(title='ATM3 Cash Forecasts', x="Date", y='Cash (hundreds)') +
theme_minimal()
accuracy(atm3_fit)
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mean Training -4.74e-15 6.02 5.56 -0.456 6.24 NaN NaN -0.296
## 2 Naïve Training -5.5 e+ 0 10.1 8.5 -6.77 10.3 NaN NaN -0.5
atm3_fit |>
forecast(h = 1) |>
hilo(level = 95)
## # A tsibble: 2 x 5 [1D]
## # Key: .model [2]
## .model DATE
## <chr> <date>
## 1 Mean 2010-05-01
## 2 Naïve 2010-05-01
## # ℹ 3 more variables: CASH <dist>, .mean <dbl>, `95%` <hilo>
95% Confidence Interval: - Mean: [70.98457, 104.3488], Range = 33.36 - Native: [65.15688, 104.8431], Range = 39.69
Due to the MEAN() model method having a smaller 95% confidence interval range than the NATIVE() model for the three data points available, the MEAN() forecasting approach will be used to forecast values for May 2010.
# view the selected model report
atm3_fit_final = atm3_fit |>
select(.model = "Mean")
# forecast values for May 2010
atm3_fc_final <- atm3_fit_final |>
forecast(h = 31)
atm3_fit_final |>
gg_tsresiduals() +
labs(
title = "Mean Model of ATM3 Residuals"
)
No conclusions can be made on the model fit based upon the residuals due to the lack of available data.
atm3_fc_final |>
autoplot(atm3) +
labs(title="ATM3 May 2010 Forecasts", subtitle = "Mean Model", x="Date", y='Cash (hundreds)') +
theme_minimal()
# subset data to only atm4
atm4 = df |>
filter(ATM == "ATM4") |>
as_tsibble(index = DATE)
# plot time series
atm4 |>
autoplot(CASH) +
labs(title = "ATM4 Time Series", x="Date", y='Cash (hundreds)') +
theme_minimal()
# STL decomposition
atm4 |>
model(stl = STL(CASH)) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - ATM4", x="Date", y='Cash (hundreds)') +
theme_minimal()
atm4 |>
ACF(CASH) |>
autoplot() +
labs(title = "ACF Plot - ATM4")
Here we can see that this data doesn’t really seem to have much trend and is highly seasonal with a seasonal window of one week, just as we saw with ATM1. With that, we can follow a similar process as we did with ATM1 here:
The time series for ATM4 similar to ATM1 in that there does not appear to be a strong overall trend and there appears to be a weekly seasonal pattern. This is supported by the STL decomposition and the ACF plot which has high values at lags 7, 14, and 21.
To perform the model selection process, the data was split into a training and test set. The training set includes all data before April 2010 (about 92% of the data).
Based on the features of the ATM4 time series, the following models were considered:
# split the data into test and train sets
atm4_train <- atm4 |>
filter(DATE < as.Date("2010-04-01"))
atm4_test <- atm4 |>
filter(DATE >= as.Date("2010-04-01"))
# define the box cox transformation lambda
lambda <- atm4 |>
features(CASH, features = guerrero) |>
pull(lambda_guerrero)
# fit the models on the training data
atm4_fit = atm4_train |>
model(
"Additive ETS" = ETS(CASH ~ error("A") + trend("N") + season("A")),
"SNAIVE" = SNAIVE(CASH),
"Transformed Additive ETS" = ETS(box_cox(CASH,lambda) ~ error("A") + trend("N") + season("A")),
"ARIMA" = ARIMA(box_cox(CASH,lambda))
)
# view the test set forecasts
atm4_fc_final = atm4_fit |>
forecast(h = nrow(atm4_test))
atm4_fc_final |>
autoplot(atm4_test, level=NULL) +
labs(title = "ATM4 Test Forecasets", x="Date", y='Cash (hundreds)') +
theme_minimal()
# evaluate model performance on test set
knit_table(
left_join(
glance(atm4_fit) |> select(.model, AIC, AICc, BIC),
atm4_fc_final |> accuracy(atm4_test) |> select(.model, RMSE, MAPE),
by = join_by(.model)
) |>
arrange(RMSE, AICc)
)
| .model | AIC | AICc | BIC | RMSE | MAPE |
|---|---|---|---|---|---|
| ARIMA | 2698.459 | 2698.641 | 2717.529 | 258.9685 | 857.8227 |
| Transformed Additive ETS | 3673.410 | 3674.089 | 3711.551 | 300.2535 | 1185.6613 |
| SNAIVE | NA | NA | NA | 300.8483 | 375.0538 |
| Additive ETS | 5856.301 | 5856.980 | 5894.443 | 306.8014 | 1242.6338 |
The testing results show that the ARIMA model is the best overall choice as it achieves the lowest AIC, AICc, and BIC, indicating superior model fit, and also provides the lowest RMSE. While the seasonal naïve model has a lower MAPE, this metric can be unstable in the presence of small values, making ARIMA the more robust and reliable forecasting model.
# view the selected model report
atm4_fit |>
select(.model = "ARIMA") |>
report()
## Series: CASH
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean
## Transformation: box_cox(CASH, lambda)
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.0844 0.1965 0.1890 17.6950
## s.e. 0.0542 0.0537 0.0546 0.7764
##
## sigma^2 estimated as 180.7: log likelihood=-1344.23
## AIC=2698.46 AICc=2698.64 BIC=2717.53
# fit the model on all the data
atm4_fit_final <- atm4 |>
model("ARIMA" = ARIMA(box_cox(CASH,lambda) ~ pdq(0, 0, 1) + PDQ(2, 0, 0, period = 7)))
# forecast values for May 2010
atm4_fc_final <- atm4_fit_final |>
forecast(h = 31)
The selected ARIMA model has the following components:
This model was fit to the entire ATM4 dataset.
atm4_fit_final |>
select(.model = "ARIMA") |>
gg_tsresiduals() +
labs(
title = "ARIMA(0,0,1)(2,0,0)[7] Model of ATM4 Transformed Residuals"
)
# Applying the Ljung-Box test
augment(atm4_fit_final) |> features(.innov, ljung_box, lag=10)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 16.0 0.0986
# Applying the Box-Pierce test
augment(atm4_fit_final) |> features(.innov, box_pierce, lag = 10)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 15.6 0.111
While there is one autocorrelation lag value (10) slightly over the ACF confidence interval threshold, the residuals of the fit model appear be mostly white noise and normally distributed. This is supported. by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.
atm4_fc_final |>
autoplot(atm4) +
labs(title="ATM4 May 2010 Forecasts", subtitle = "ARIMA(0,0,1)(2,0,0)[7] Model", x="Date", y='Cash (hundreds)') +
theme_minimal()
Prompt: Model the residential power usage data and a monthly forecast for 2014
Dataset: ResidentialCustomerForecastLoad-624.xlsx
Import the data from Github and view the raw data
# import data from github
github_url_b = 'https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/ResidentialCustomerForecastLoad-624.xlsx'
data_b = import(github_url_b)
# view data head
knit_table(head(data_b), "View Raw Data")
| CaseSequence | YYYY-MMM | KWH |
|---|---|---|
| 733 | 1998-Jan | 6862583 |
| 734 | 1998-Feb | 5838198 |
| 735 | 1998-Mar | 5420658 |
| 736 | 1998-Apr | 5010364 |
| 737 | 1998-May | 4665377 |
| 738 | 1998-Jun | 6467147 |
This dataset includes residential power usage from January 1998 until December 2013. There are 192 rows with power usage tracked at the monthly level.
The data consists of three columns:
power = data_b
# rename columns to all naming follows same casing
colnames(power) = c("CaseSequence", "YR_MONTH", "KWH")
skim_power = skim(power)
# |> select(YR_MONTH, KWH))
# yank(skim_power, "character")
skim_power |> summary()
| Name | power |
| Number of rows | 192 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
# knit_table(yank(skim_df, "Date"), 'Column: YR_MONTH')
knit_table(yank(skim_power, "numeric"), 'Column: KWH')
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CaseSequence | 0 | 1.0000000 | 828.5 | 5.556978e+01 | 733 | 780.75 | 828.5 | 876.25 | 924 | ▇▇▇▇▇ |
| KWH | 1 | 0.9947917 | 6502474.6 | 1.447571e+06 | 770523 | 5429912.00 | 6283324.0 | 7620523.50 | 10655730 | ▁▁▇▅▁ |
knit_table(yank(skim_power, "character"), 'Column: YR_MONTH')
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| YR_MONTH | 0 | 1 | 8 | 8 | 0 | 192 | 0 |
# transform into tsibble
power = power |>
mutate(YR_MONTH = yearmonth(YR_MONTH)) |>
as_tsibble(index = YR_MONTH)
Visualize the time series overtime
power |>
autoplot(KWH) +
labs(title="Residential Power Usage Time Series", x="Year Month", y="KWH (Kilowatt hours)") +
theme_minimal()
Identify outliers with the IQR method.
outlier_flag <- function(x) {
# Calculate first and third quartiles
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
# Calculate the Interquartile Range
IQR_val <- IQR(x, na.rm = TRUE)
# Define lower and upper bounds (fences)
lower_bound <- Q1 - (IQR_val * 1.5)
upper_bound <- Q3 + (IQR_val * 1.5)
# Flag values outside the bounds as TRUE
is_outlier <- (x < lower_bound) | (x > upper_bound)
return(is_outlier)
}
power = power |>
mutate(outlier_flag = outlier_flag(KWH))
knit_table(
power |>
filter(outlier_flag==TRUE),
"Identified Outliers"
)
| CaseSequence | YR_MONTH | KWH | outlier_flag |
|---|---|---|---|
| 883 | 2010 Jul | 770523 | TRUE |
# replace outlier with NA and impute
power = power |>
mutate(KWH = if_else(outlier_flag == TRUE, NA_real_, KWH)) |>
mutate(
KWH = na.approx(KWH, na.rm = FALSE)
)
power |>
autoplot(KWH) +
labs(title="Residential Power Usage Time Series", x="Year Month", y="KWH (Kilowatt hours)") +
theme_minimal()
power = power |>
select(YR_MONTH, KWH)
# STL decomposition
power |>
model(stl = STL(KWH ~ trend(window=Inf) + season(period=12, window="periodic"))) |>
components() |>
autoplot() +
labs(title = "STL Decomposition - KWH")
power |>
gg_subseries() +
labs(
y = "KWH",
title = "Seasonal Subseries - KWH"
)
## Warning: `gg_subseries()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_subseries()` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Plot variable not specified, automatically selected `y = KWH`
power |>
ACF(KWH) |>
autoplot() +
labs(title = "ACF Plot - KWH")
The time series for residential power usage appears to have an overall increasing trend with a more rapid increase starting around 2005. There also appears to be a strong yearly seasonal trend with as seen in the STL decomposition and the Seasonal subseries plot. At the begining of the year, power usage declines from January till May and then starts to steadily increase through the summer and then decrease through November. At the end of the year (December into January) power usage starts to increase again into the new year. Overall the trend in power usage seems to shift about every three months with the seasons. This is supported by the ACF plot which has high values every third lag (3, 6, 9, 12, 15, 18, 21).
To perform the model selection process, the data was split into a training and test set. The training set includes all data before January 2012 (about 88% of the data).
# split the data into test and train sets
power_train <- power |>
filter(year(YR_MONTH) < 2012)
power_test <- power |>
filter(year(YR_MONTH) >= 2012)
# define the box cox transformation lambda
lambda <- power |>
features(KWH, features = guerrero) |>
pull(lambda_guerrero)
# fit the models on the training data
power_fit = power_train |>
model(
"Additive ETS" = ETS(KWH ~ error("A") + trend("A") + season("A")),
"Transformed Additive ETS" = ETS(box_cox(KWH,lambda) ~ error("A") + trend("A") + season("A")),
"Transformed Multiplicative ETS" = ETS(box_cox(KWH,lambda) ~ error("M") + trend("A") + season("M")),
"Multiplicative ETS" = ETS(KWH ~ error("M") + trend("A") + season("M")),
"T Auto ETS" = ETS(box_cox(KWH,lambda)),
"Transformed HW" = ETS(box_cox(KWH,lambda) ~ error("M") + trend("Ad") + season("M")),
"Transformed Additive ETS" = ETS(box_cox(KWH,lambda) ~ error("A") + trend("N") + season("A")),
"Transformed SNAIVE" = SNAIVE(box_cox(KWH,lambda)),
"ARIMA" = ARIMA(KWH),
"Transformed ARIMA" = ARIMA(box_cox(KWH,lambda))
)
# view the test set forecasts
power_fc_final = power_fit |>
forecast(h = nrow(power_test))
power_fc_final |>
autoplot(power_test, level=NULL)
# evaluate model performance on test set
knit_table(
left_join(
glance(power_fit) |> select(.model, AIC, AICc, BIC),
power_fc_final |> accuracy(power_test) |> select(.model, RMSE, MAPE),
by = join_by(.model)) |>
arrange(AICc, RMSE)
)
| .model | AIC | AICc | BIC | RMSE | MAPE |
|---|---|---|---|---|---|
| Transformed ARIMA | -1304.721 | -1303.964 | -1283.3716 | 772888.0 | 7.913079 |
| Transformed Multiplicative ETS | -1009.378 | -1005.298 | -956.2709 | 1119967.0 | 13.075044 |
| T Auto ETS | -1009.378 | -1005.298 | -956.2709 | 1119967.0 | 13.075044 |
| Transformed HW | -1007.029 | -1002.438 | -950.7977 | 1064521.2 | 11.800401 |
| Transformed Additive ETS | -1004.387 | -1001.229 | -957.5273 | 824731.4 | 9.000613 |
| ARIMA | 4612.814 | 4613.214 | 4628.0629 | 732388.6 | 7.019935 |
| Multiplicative ETS | 5343.893 | 5347.973 | 5397.0003 | 1043069.9 | 12.096480 |
| Additive ETS | 5361.507 | 5365.587 | 5414.6143 | 952109.5 | 11.238595 |
| Transformed SNAIVE | NA | NA | NA | 1012071.6 | 9.817695 |
The testing results show that the Transformed ARIMA model is the best overall model, as it achieves substantially lower AIC, AICc, and BIC values, indicating a superior fit to the data. While the untransformed ARIMA shows slightly better RMSE and MAPE, these metrics are not directly comparable due to differences in scale. The transformation improves model stability and better captures the underlying data structure.
# view the selected model report
power_fit |>
select(.model = "Transformed ARIMA") |>
report()
## Series: KWH
## Model: ARIMA(0,0,3)(2,1,0)[12] w/ drift
## Transformation: box_cox(KWH, lambda)
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 constant
## 0.2561 0.0869 0.2588 -0.7917 -0.4339 0.0013
## s.e. 0.0802 0.0809 0.0725 0.0820 0.0833 0.0005
##
## sigma^2 estimated as 1.393e-05: log likelihood=659.36
## AIC=-1304.72 AICc=-1303.96 BIC=-1283.37
# fit the model on all the data
power_fit_final <- power |>
model("ARIMA" = ARIMA(box_cox(KWH,lambda) ~ pdq(0, 0,3) + PDQ(2, 1, 0, period = 12)))
# forecast values for May 2010
power_fc_final <- power_fit_final |>
forecast(h = 12)
The selected ARIMA model has the following components:
This model was fit to the entire power dataset.
power_fit_final |>
select(.model = "ARIMA") |>
gg_tsresiduals() +
labs(
title = "ARIMA(0,0,3)(2,1,0)[12] Model of KWH Transformed residuals"
)
# Applying the Ljung-Box test
augment(power_fit_final) |> features(.innov, ljung_box, lag=10)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 12.4 0.258
# Applying the Box-Pierce test
augment(power_fit_final) |> features(.innov, box_pierce, lag = 10)
## # A tibble: 1 × 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 12.0 0.287
While there is one autocorrelation lag value (6) slightly over the ACF confidence interval threshold, the residuals of the fit model appear be mostly white noise and normally distributed. This is supported. by the Ljung-Box and Box Pierce tests having p-values greater than 0.05, meaning the residuals are not distinguishable from white noise.
power_fc_final |>
autoplot(power) +
labs(title="ATM2 May 2010 Forecasts", subtitle = "ARIMA(0,0,3)(2,1,0)[12] Model")
Export forecasts into an Excel readable file
# gather part a forecasts to output to excel
atm1_fc_xslx = atm1_fc_final |>
as.data.frame() |>
mutate(ATM1_Forecasts = .mean) |>
select(DATE, ATM1_Forecasts)
atm2_fc_xslx = atm2_fc_final |>
as.data.frame() |>
mutate(ATM2_Forecasts = .mean) |>
select(DATE, ATM2_Forecasts)
atm3_fc_xslx = atm3_fc_final |>
as.data.frame() |>
mutate(ATM3_Forecasts = .mean) |>
select(DATE, ATM3_Forecasts)
atm4_fc_xslx = atm4_fc_final |>
as.data.frame() |>
mutate(ATM4_Forecasts = .mean) |>
select(DATE, ATM4_Forecasts)
part_a_fc = left_join(
left_join(
left_join(
atm1_fc_xslx, atm2_fc_xslx, by = join_by(DATE)),
atm3_fc_xslx, by = join_by(DATE)),
atm4_fc_xslx, by = join_by(DATE))
# gather part b forecasts to output to excel
part_b_fc = power_fc_final |>
as.data.frame() |>
mutate(
Power_Forecasts = .mean,
YR_MONTH = as.character(YR_MONTH)
) |>
select(YR_MONTH, Power_Forecasts)
# view forecasts
knit_table(part_a_fc, 'Part A Forecasts')
| DATE | ATM1_Forecasts | ATM2_Forecasts | ATM3_Forecasts | ATM4_Forecasts |
|---|---|---|---|---|
| 2010-05-01 | 92.128015 | 67.057749 | 87.66667 | 326.5391 |
| 2010-05-02 | 106.518327 | 73.066960 | 87.66667 | 433.4118 |
| 2010-05-03 | 78.907120 | 15.091609 | 87.66667 | 512.2953 |
| 2010-05-04 | 5.556122 | 8.836430 | 87.66667 | 241.1186 |
| 2010-05-05 | 106.163566 | 99.090250 | 87.66667 | 474.7541 |
| 2010-05-06 | 84.720401 | 89.863263 | 87.66667 | 351.1941 |
| 2010-05-07 | 91.309504 | 66.721573 | 87.66667 | 414.4358 |
| 2010-05-08 | 93.491214 | 67.301326 | 87.66667 | 295.0792 |
| 2010-05-09 | 107.162081 | 73.321607 | 87.66667 | 475.2268 |
| 2010-05-10 | 79.572741 | 15.602155 | 87.66667 | 465.2695 |
| 2010-05-11 | 5.726440 | 9.814201 | 87.66667 | 287.1311 |
| 2010-05-12 | 106.933940 | 99.309032 | 87.66667 | 446.4479 |
| 2010-05-13 | 85.409789 | 90.080118 | 87.66667 | 331.2565 |
| 2010-05-14 | 92.024818 | 66.943764 | 87.66667 | 460.4210 |
| 2010-05-15 | 94.212208 | 67.540347 | 87.66667 | 388.3167 |
| 2010-05-16 | 107.933298 | 73.565606 | 87.66667 | 450.5642 |
| 2010-05-17 | 80.238362 | 16.106499 | 87.66667 | 464.3841 |
| 2010-05-18 | 5.896759 | 10.778081 | 87.66667 | 365.0836 |
| 2010-05-19 | 107.704315 | 99.519491 | 87.66667 | 453.2276 |
| 2010-05-20 | 86.099177 | 90.291614 | 87.66667 | 402.3680 |
| 2010-05-21 | 92.740132 | 67.167822 | 87.66667 | 443.6369 |
| 2010-05-22 | 94.933202 | 67.776145 | 87.66667 | 400.3265 |
| 2010-05-23 | 108.704515 | 73.802141 | 87.66667 | 453.0699 |
| 2010-05-24 | 80.903984 | 16.606507 | 87.66667 | 453.8763 |
| 2010-05-25 | 6.067077 | 11.732015 | 87.66667 | 393.4039 |
| 2010-05-26 | 108.474690 | 99.724074 | 87.66667 | 447.7969 |
| 2010-05-27 | 86.788565 | 90.499379 | 87.66667 | 412.0704 |
| 2010-05-28 | 93.455446 | 67.393207 | 87.66667 | 448.6861 |
| 2010-05-29 | 95.654196 | 68.009653 | 87.66667 | 423.6912 |
| 2010-05-30 | 109.475732 | 74.033433 | 87.66667 | 447.7701 |
| 2010-05-31 | 81.569605 | 17.103467 | 87.66667 | 450.7500 |
knit_table(part_b_fc, 'Part B Forecasts')
| YR_MONTH | Power_Forecasts |
|---|---|
| 2014 Jan | 10393275 |
| 2014 Feb | 8860839 |
| 2014 Mar | 7064897 |
| 2014 Apr | 5978216 |
| 2014 May | 5938635 |
| 2014 Jun | 8314016 |
| 2014 Jul | 9623971 |
| 2014 Aug | 10178785 |
| 2014 Sep | 8541825 |
| 2014 Oct | 5857955 |
| 2014 Nov | 6142157 |
| 2014 Dec | 8277646 |
# export
list_of_fc_dfs <- list("Part A Forecasts" = part_a_fc, "Part B Forecasts" = part_b_fc)
write_xlsx(list_of_fc_dfs, path = "DATA624_Project1_Forecasts_Azevedo_20260329.xlsx")
Prompt:
Datasets:
Import the data from Github and view the raw data
# import data from github
github_url_ec1 = 'https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/Waterflow_Pipe1.xlsx'
github_url_ec2 = 'https://github.com/okazevedo90/CUNY_GITHUB_submissions/raw/refs/heads/main/DATA624/project_1/Waterflow_Pipe2.xlsx'
data_ec1 = import(github_url_ec1)
data_ec2 = import(github_url_ec2)
Time-base sequence the data and aggregate based on date and hour.
# aggregate data by date and hour
data_ec1 = data_ec1 |>
mutate(
Hour = hour(`Date Time`),
Date = date(`Date Time`),
) |>
mutate(DateTime = ymd(Date) + hours(Hour))
data_ec1_agg = data_ec1 |>
group_by(DateTime) |>
summarise(WaterFlow = mean(WaterFlow)) |>
as_tsibble(index=DateTime)
# aggregate data by date and hour
data_ec2 = data_ec2 |>
mutate(
Hour = hour(`Date Time`),
Date = date(`Date Time`),
) |>
mutate(DateTime = ymd(Date) + hours(Hour))
data_ec2_agg = data_ec2 |>
group_by(DateTime) |>
summarise(WaterFlow = mean(WaterFlow)) |>
as_tsibble(index=DateTime)
data_ec1_agg |>
autoplot(WaterFlow) +
labs(title='Pipe 1 Time Series by Hour', x='Date Time', y='Water Flow')
data_ec2_agg |>
autoplot(WaterFlow) +
labs(title='Pipe 2 Time Series by Hour', x='Date Time', y='Water Flow')
knit_table(
data_ec1_agg |>features(WaterFlow, unitroot_kpss),
"Pipe 1 KPSS Test"
)
| kpss_stat | kpss_pvalue |
|---|---|
| 0.2465587 | 0.1 |
knit_table(
data_ec2_agg |>features(WaterFlow, unitroot_kpss),
"Pipe 2 KPSS Test"
)
| kpss_stat | kpss_pvalue |
|---|---|
| 0.1053662 | 0.1 |
To test if the time series are stationary, a the KPSS unit root test was performed on each series. Both the Pipe 1 and Pipe 2 time series had p-values of 0.1. Since this is greater than 0.05 we fail to reject the null hypothesis; thus, both of the series are stationary.
Due to time constraints, this was the extent of what I was able to get done for the bonus section.