In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Data ends on 4/30/2010, so forecast horizon should be = 30
Because it asks to predict how much is taken out of 4 different ATMs,I plan to make a model for each. As I continue coding, we will likely see different patterns with each ATM,
# load libraries
library(readr)
library(readxl)
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.5
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.3 ✔ fable 0.4.0
## ✔ ggplot2 3.5.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(ggplot2)
library(zoo)
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tsoutliers)
##
## Attaching package: 'tsoutliers'
## The following object is masked from 'package:fabletools':
##
## outliers
library(patchwork)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(urca)
library(rsample)
First, I need to verify the structure is appropriately saved before forecasting work. Next I will check for missing data and address them, if needed. Next I will look ay the distibution of the data.
atm <- read_xlsx("ATM624Data.xlsx")
atm$DATE <- as.Date(atm$DATE, origin = "1899-12-30")
str(atm)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
summary(atm)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
str(atm)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
Spread function is used to reshape the data so each ATM can have its own column. I used box plots to confirm the atm’s have different patterns and should ve modeled separately. This is also confirmed by the different scales, and we can see ATM4 has a large outlier, probably the max value from above 10919.8. The box plots also show that ATM3 appears to have very little cash taken out, so I will look deeper at this. Note: atm_wide is not used again.
atm_wide <- atm %>% spread(ATM, Cash)
#Linear interpolation
par(mfrow=c(1,4))
for (i in 2:5){
boxplot(atm_wide[i],
main = sprintf("%s", names(atm_wide)[i]))
}
With 19 values missing, I want to check which variables that falls
under. Some of these NA are for ATM values, which are for May 2010; it
makes sense to remove the rows where NA is present in the ATM column.
Although the proportion of NA remaining, now in cash values, is small in
a dataset of 1474, these data points might be important for continuity
in time series. Interpolation is used ordered data so I will use this
technique to fill the missing cash values for ATM1 (3), and ATM2
(2).
missing_rows <- atm[rowSums(is.na(atm)) > 0, ]
print(missing_rows)
## # A tibble: 19 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-18 ATM2 NA
## 4 2009-06-22 ATM1 NA
## 5 2009-06-24 ATM2 NA
## 6 2010-05-01 <NA> NA
## 7 2010-05-02 <NA> NA
## 8 2010-05-03 <NA> NA
## 9 2010-05-04 <NA> NA
## 10 2010-05-05 <NA> NA
## 11 2010-05-06 <NA> NA
## 12 2010-05-07 <NA> NA
## 13 2010-05-08 <NA> NA
## 14 2010-05-09 <NA> NA
## 15 2010-05-10 <NA> NA
## 16 2010-05-11 <NA> NA
## 17 2010-05-12 <NA> NA
## 18 2010-05-13 <NA> NA
## 19 2010-05-14 <NA> NA
atm <- atm %>% filter(!is.na(ATM)) #remove NA in ATM
# Check again only NA remaining are in Cash column
missing_rows_after_filter <- atm[rowSums(is.na(atm)) > 0, ]
print(missing_rows_after_filter)
## # A tibble: 5 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-18 ATM2 NA
## 4 2009-06-22 ATM1 NA
## 5 2009-06-24 ATM2 NA
#Interpolation
atm1_data <- atm %>% filter(ATM == "ATM1")
atm1_data$Cash <- zoo::na.approx(atm1_data$Cash, na.rm = FALSE) #na is retained
atm2_data <- atm %>% filter(ATM == "ATM2")
atm2_data$Cash <- zoo::na.approx(atm2_data$Cash, na.rm = FALSE)
imputed_data <- bind_rows(atm1_data, atm2_data)
other_atm_data <- atm %>% filter(!ATM %in% c("ATM1", "ATM2"))
atm <- bind_rows(other_atm_data, imputed_data) %>%
arrange(DATE, ATM)
atm_tsibble <- atm %>%
as_tsibble(index = DATE, key = ATM)
# Check
sum(is.na(atm_tsibble$Cash))
## [1] 0
I will the convert the data to tsibble then to time series to visualize them over time. The autoplot confirms that AMT4’s outlier looks like a spike. The autoplots confirm that ATM3 only has data for a few recent time points.
atm_tsibble %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Before Outlier Removal", x = "Date", y = "Cash (in hundreds)")
I have to consider if the outlier in ATM4 is an real world event or an
error. It is possible some event could have occurred during this time.
However, the value of 10919.8 is actually in hundreds, meaning was it
possible for 10,919,800 to be withdrawn from one ATM on one day? It
would be best to address this outlier.
atm4_data <- atm %>% filter(ATM == "ATM4")
atm4_tsib <- ts(atm4_data$Cash, frequency = 365)
# Identify outliers using tsoutliers
outlier_results <- tso(atm4_tsib )
## Warning in locate.outliers.iloop(resid = resid, pars = pars, cval = cval, :
## stopped when 'maxit.iloop' was reached
## Warning in locate.outliers.oloop(y = y, fit = fit, types = types, cval = cval,
## : stopped when 'maxit.oloop = 4' was reached
print(outlier_results)
## Series: atm4_tsib
## Regression with ARIMA(0,0,0) errors
##
## Coefficients:
## intercept AO285
## 445.3463 10474.4153
## s.e. 18.3671 350.9031
##
## sigma^2 = 123472: log likelihood = -2656.5
## AIC=5319 AICc=5319.06 BIC=5330.7
##
## Outliers:
## type ind time coefhat tstat
## 1 AO 285 1:285 10474 29.85
outlier_index <- 285
original_index <- which(atm_tsibble$ATM == "ATM4")[outlier_index]
replacement_value <- median(atm4_data$Cash[-outlier_index], na.rm = TRUE)
atm4_data$Cash[outlier_index] <- replacement_value
# Replace the value in the original dataset
atm_tsibble$Cash[original_index] <- replacement_value
print(atm_tsibble[original_index, ])
## # A tsibble: 1 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-02-09 ATM4 403.
Next, I want to look more ATM3 in depth. It seems like there were only three days where the withdrawal was no zero. For those same dates, ATM1 has the same withdrawal amount: 2010-04-28 96.00000, 2010-04-29, 82.00000 2010-04-30 85.00000. Since ATM3 has very limited data, I will remove it and use AMT1 to guide the forecast amount needed for ATM3 later on.
atm3_dates <- atm_tsibble %>%
filter(ATM == "ATM3" & Cash > 0) %>% # where cash values greater than 0 to filter out other rows
select(DATE) %>%
distinct()
# Get all ATM values for the same dates
all_atm_values <- atm_tsibble %>%
filter(DATE %in% atm3_dates$DATE)
print(all_atm_values)
## # A tsibble: 12 x 3 [1D]
## # Key: ATM [4]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-28 ATM1 96
## 2 2010-04-29 ATM1 82
## 3 2010-04-30 ATM1 85
## 4 2010-04-28 ATM2 107
## 5 2010-04-29 ATM2 89
## 6 2010-04-30 ATM2 90
## 7 2010-04-28 ATM3 96
## 8 2010-04-29 ATM3 82
## 9 2010-04-30 ATM3 85
## 10 2010-04-28 ATM4 348.
## 11 2010-04-29 ATM4 44.2
## 12 2010-04-30 ATM4 482.
# Remove ATM3
atm_tsibble <- atm_tsibble %>%
filter(ATM != "ATM3")
Finally before I can move into modelling, I want to review if there were other dates that were potentially not present on the data. Great, this all set.
full_dates <- seq(min(atm_tsibble$DATE), max(atm_tsibble$DATE), by = "day")
missing_dates <- setdiff(full_dates, atm_tsibble$DATE)
if (length(missing_dates) == 0) {
print("All days are present in the data.")
} else {
print("Missing dates:")
print(missing_dates)
}
## [1] "All days are present in the data."
Autoplots look a lot better than initially, now the outlier is removed, and missing variables are accounted for. Next I will decompose each ATMs to better understand if further transformation is needed and what type of modelling might be best.
atm_tsibble %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Before Outlier Removal", x = "Date", y = "Cash (in hundreds)")
The components of all three show seasonal pattern based on the week.
They all show a changing trend, indicating the series are not
stationary. The remainder for ATM1 and ATM2 seems seems random up to a
point, then theres more variation; ATM4 remainer has a lot of
variation.
sum(is.na(atm_tsibble$Cash))
## [1] 0
atm1_ts <- as_tsibble(atm_tsibble %>% filter(ATM == "ATM1") , index = DATE)
atm1_stl <- atm1_ts %>% model(STL(Cash ~ season(window = "periodic")))
atm1_components <- atm1_stl %>% components()
autoplot(atm1_components)
atm2_ts <- as_tsibble(atm_tsibble %>% filter(ATM == "ATM2") , index = DATE)
atm2_stl <- atm2_ts |> model(STL(Cash ~ season(window = "periodic")))
atm2_components <- atm2_stl %>% components()
autoplot(atm2_components)
atm4_ts <- as_tsibble(atm_tsibble %>% filter(ATM == "ATM4") , index = DATE)
atm4_stl <- atm4_ts |> model(STL(Cash ~ season(window = "periodic")))
atm4_components <- atm4_stl %>% components()
autoplot(atm4_components)
I transform the series to make them stationary and normalize the
variance. This is not done automatically when modelling. Next, I use
gg_display function to view the new time series autoplots and ACF and
PACF graphs. The variance for ATM1, ATM2 and ATM4 looks like the
fluctuate around a constant mean and have no trend or volatlity (there
is a bit more fluctuations for ATM2 and 4). The ACF for ATM1, ATM2 and
ATM4 shows similar results: autocorrelations drop off after a few lags,
with some significant spikes around lags 7 and 21. The PACF shows a
significant spike around lag 7 for ATM1, ATM2 and ATM4, confirming they
are stationary.
lambda1 <- atm_tsibble %>%
filter(ATM == "ATM1") %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm1_tsibble <- atm_tsibble %>%
filter(ATM == "ATM1") %>%
mutate(Cash = box_cox(Cash + 2, lambda1))
gg_display <- atm1_tsibble %>%
gg_tsdisplay(Cash, plot_type = "partial") +
labs(title = "ATM1: Transformed Cash ", y = "Transformed Cash")
print(gg_display)
lambda2 <- atm_tsibble %>%
filter(ATM == "ATM2") %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm2_tsibble <- atm_tsibble %>%
filter(ATM == "ATM2") %>%
mutate(Cash = box_cox(Cash + 2, lambda2))
gg_display_atm2 <- atm2_tsibble %>%
gg_tsdisplay(Cash, plot_type = "partial") +
labs(title = "ATM2: Transformed Cash", y = "Transformed Cash")
print(gg_display_atm2)
lambda4 <- atm_tsibble %>%
filter(ATM == "ATM4") %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm4_tsibble <- atm_tsibble %>%
filter(ATM == "ATM4") %>%
mutate(Cash = box_cox(Cash + 2, lambda4))
gg_display_atm4 <- atm4_tsibble %>%
gg_tsdisplay(Cash, plot_type = "partial") +
labs(title = "ATM4: Transformed Cash", y = "Transformed Cash")
print(gg_display_atm4)
## Modelling
Now, I want to finally model the data and I selected, ETS, which captures error, trend, and seasonality, ARIMA (with seasonal), since its verstile and handles seasonality and SNaive which also is a good model for seasonality. Remember the lag on the 7 indicates a weekly seasonality so I wanted to make sure I account for it.
fit_compare <- function(atm_tsibble) {
ets_model <- atm_tsibble %>% model(ETS(Cash))
arima_model <- atm_tsibble %>% model(ARIMA(Cash ~ season(), stepwise = FALSE)) #accounts for seaonsal
snaive_model <- atm_tsibble %>% model(SNAIVE(Cash))
models <- list(ets = ets_model, arima = arima_model, snaive = snaive_model)
accuracy_table <- purrr::map_df(models, accuracy, .id = "model")
return(accuracy_table)
}
combined_results <- rbind(
mutate(fit_compare(atm1_tsibble), ATM = "ATM1"),
mutate(fit_compare(atm2_tsibble), ATM = "ATM2"),
mutate(fit_compare(atm4_tsibble), ATM = "ATM4")
)
print(combined_results |> select(-.type, -.model, -ME))
## # A tibble: 9 × 9
## model ATM RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets ATM1 1.22 0.689 -7.04 15.1 0.884 0.849 0.102
## 2 arima ATM1 1.22 0.717 -7.50 15.7 0.919 0.848 -0.000428
## 3 snaive ATM1 1.44 0.779 -6.03 15.7 1 1 0.129
## 4 ets ATM2 8.27 5.79 -38.2 55.6 0.858 0.839 0.0296
## 5 arima ATM2 8.56 6.46 -51.1 68.7 0.958 0.869 -0.0265
## 6 snaive ATM2 9.85 6.74 -29.6 53.9 1 1 0.0553
## 7 ets ATM4 13.0 10.3 -47.9 73.6 0.761 0.758 0.0824
## 8 arima ATM4 12.7 10.3 -52.0 76.0 0.762 0.743 -0.00227
## 9 snaive ATM4 17.1 13.5 -45.3 82.5 1 1 0.0439
The RMSE values are larger for SNAIVE, so I will not select that one. Now the ARIMA and ETS models have comparable RMSE values. ARIMA’s RMSE is slightly higher for ATM 1 and ATM2. Considering that ETS is more simple and an easier model to explain than ARIMA plus the fact it has lower RMSE for two out of the three ATMs I will more forward with forecasting using ETS. Note: The MPE and MAPE had ‘inf’ values for ATM2 due to having near zero values, I initially tried to avoid this by adding the +1, which I then changed to +2 to the lambda during transformation, and it corrected the values.
Next I applied ETS model manually to each atm tsibble to better understand what I will be using to forecast. ATM1 and ATM2 best selected model is ETS (A,N,A), additive error, no trend, additive seasonality; ATM 4 best selected model is ETS(M,N,A), multiplicative error, no trend, additive seasonality. I reviewed the residuals which all appear to be white noise and histograms have normal distributions.
ets1_model <- atm1_tsibble %>%
model(ETS(Cash))
ets2_model <- atm2_tsibble %>%
model(ETS(Cash))
ets4_model <- atm4_tsibble %>%
model(ETS(Cash))
create_combined_ets_diagnostics <- function(atm_data_list) {
par(mfrow = c(3, 3))
for (atm_info in atm_data_list) {
atm_tsibble <- atm_info$tsibble
model_name <- atm_info$name
ets_model <- atm_tsibble %>% model(ETS(Cash))
print(report(ets_model))
residuals <- ets_model %>% augment() %>% pull(.resid)
plot(residuals, main = paste("Residuals of", model_name), ylab = "Residuals", xlab = "Index", type = "l")
abline(h = 0, col = "red")
hist(residuals, main = paste("Histogram of Residuals for", model_name), xlab = "Residuals", breaks = 20)
Acf(residuals, main = paste("ACF of Residuals for", model_name))
Pacf(residuals, main = paste("PACF of Residuals for", model_name))
}
par(mfrow = c(1, 1))
}
atm_data_list <- list(
list(tsibble = atm1_tsibble, name = "ATM1"),
list(tsibble = atm2_tsibble, name = "ATM2"),
list(tsibble = atm4_tsibble, name = "ATM4")
)
create_combined_ets_diagnostics(atm_data_list)
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000065
## gamma = 0.3662404
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 7.90113 -4.179114 0.5827181 0.4285257 0.5334693 0.9818783 1.062594 0.5899285
##
## sigma^2: 1.535
##
## AIC AICc BIC
## 2320.775 2321.396 2359.774
## # A mable: 1 x 2
## # Key: ATM [1]
## ATM `ETS(Cash)`
## <chr> <model>
## 1 ATM1 <ETS(A,N,A)>
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000123
## gamma = 0.3821918
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 30.06249 -20.56225 -9.114558 4.425845 1.801073 6.680938 10.08389 6.685066
##
## sigma^2: 70.0432
##
## AIC AICc BIC
## 3715.276 3715.897 3754.275
## # A mable: 1 x 2
## # Key: ATM [1]
## ATM `ETS(Cash)`
## <chr> <model>
## 1 ATM2 <ETS(A,N,A)>
## Series: Cash
## Model: ETS(M,N,A)
## Smoothing parameters:
## alpha = 0.01543598
## gamma = 0.1731552
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 26.87038 -19.01183 -6.330076 4.945182 3.246947 7.593668 4.364441 5.191662
##
## sigma^2: 0.2148
##
## AIC AICc BIC
## 4024.047 4024.669 4063.046
## # A mable: 1 x 2
## # Key: ATM [1]
## ATM `ETS(Cash)`
## <chr> <model>
## 1 ATM4 <ETS(M,N,A)>
#when I tried to do it the simple way ets4_model <- atm4_tsibble %>% model(ETS(Cash))%>% gg_tsresiduals() I could get blank space
Now, I use the models to make forecast(and title has a reminder that these are not on the right scale essentially). May has 31 days so I ensure that my h value is correctly assigned. The forecasted plots start in March so I can see 2 months of historical data next to the forecasted values.
# forecasts for May 2010 (31 days)
forecast1 <- ets1_model %>% forecast(h = 31)
forecast2 <- ets2_model %>% forecast(h = 31)
forecast4 <- ets4_model %>% forecast(h = 31)
p1 <- autoplot(atm1_tsibble, Cash) +
autolayer(forecast1, series = "Forecast ATM1", PI = FALSE) +
labs(title = "Cash Withdrawals for ATM1 (transformed))",
x = "Date",
y = "Cash Withdrawals") +
xlim(as.Date("2010-03-01"), as.Date("2010-05-31")) +
theme_minimal()
## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series` and `PI`
## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series` and `PI`
p2 <- autoplot(atm2_tsibble, Cash) +
autolayer(forecast2, series = "Forecast ATM2", PI = FALSE) +
labs(title = "Cash Withdrawals for ATM2 (transformed))",
x = "Date",
y = "Cash Withdrawals") +
xlim(as.Date("2010-03-01"), as.Date("2010-05-31")) +
theme_minimal()
## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), : Ignoring unknown parameters: `series` and `PI`
## Ignoring unknown parameters: `series` and `PI`
p3 <- autoplot(atm4_tsibble, Cash) +
autolayer(forecast4, series = "Forecast ATM4", PI = FALSE) +
labs(title = "Cash Withdrawals for ATM4 (transformed))",
x = "Date",
y = "Cash Withdrawals") +
xlim(as.Date("2010-03-01"), as.Date("2010-05-31")) +
theme_minimal()
## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), : Ignoring unknown parameters: `series` and `PI`
## Ignoring unknown parameters: `series` and `PI`
print(p1)
## Warning: Removed 304 rows containing missing values or values outside the scale range
## (`geom_line()`).
print(p2)
## Warning: Removed 304 rows containing missing values or values outside the scale range
## (`geom_line()`).
print(p3)
## Warning: Removed 304 rows containing missing values or values outside the scale range
## (`geom_line()`).
Next, I want to make the forecasts to be directly interpretable in the
same units as the original data, which means I need to inverse the
transformation which I do using the forecasted mean values.
inverse_box_cox <- function(y_prime, lambda) {
if (lambda == 0) {
return(exp(y_prime))
} else {
return((lambda * y_prime + 1)^(1/lambda))
}
}
forecast1$.mean <- inv_box_cox(forecast1$.mean, lambda = lambda1)
forecast2$.mean <- inv_box_cox(forecast2$.mean, lambda = lambda2)
forecast4$.mean <- inv_box_cox(forecast4$.mean, lambda = lambda4)
In order to find the forecasted ATM values for May I need to sum the forecasted values in the original scale for each ATM for 31 days.
sum_forecast1 <- sum(forecast1$.mean, na.rm = TRUE)
sum_forecast2 <- sum(forecast2$.mean, na.rm = TRUE)
sum_forecast4 <- sum(forecast4$.mean, na.rm = TRUE)
print(paste("Total forecasted cash withdrawals for ATM1 in May 2010:", sum_forecast1))
## [1] "Total forecasted cash withdrawals for ATM1 in May 2010: 2446.96678918692"
print(paste("Total forecasted cash withdrawals for ATM2 in May 2010:", sum_forecast2))
## [1] "Total forecasted cash withdrawals for ATM2 in May 2010: 1882.04906194075"
print(paste("Total forecasted cash withdrawals for ATM4 in May 2010:", sum_forecast4))
## [1] "Total forecasted cash withdrawals for ATM4 in May 2010: 10226.8861647894"
#Sanity check to see if values are similar to previous month
april_sums <- atm %>%
filter(DATE >= as.Date("2010-04-01") & DATE <= as.Date("2010-04-30")) %>%
group_by(ATM) %>%
summarise(Total_Cash = sum(Cash, na.rm = TRUE))
print(april_sums)
## # A tibble: 4 × 2
## ATM Total_Cash
## <chr> <dbl>
## 1 ATM1 2300
## 2 ATM2 1842
## 3 ATM3 263
## 4 ATM4 11858.
Finally, we have the forecasted values for each ATM for May 2010, these sum values times ** hundreds of dollars **. For the Month of May 2010 the amount needed in ATM1 is $2,446,967, ATM 2 is $1,882,050 (rounding up to make sure to have enough in the atm), ATM 3, might follow the amount of AMT1, based on the data we did have available where amounts withdrawn were the same as ATM1, and the amount needed for ATM 4 is $10,226,867 (rounding up to make sure to have enough in the atm). While these values seem high they are comparable to April 2010 withdrawals at each ATM.
library(writexl)
# Convert forecasts to data frames
forecast1_df <- as.data.frame(forecast1)
forecast2_df <- as.data.frame(forecast2)
forecast4_df <- as.data.frame(forecast4)
write_xlsx(
list(
"Forecast_ATM1" = forecast1_df,
"Forecast_ATM2" = forecast2_df,
"Forecast_ATM4" = forecast4_df
),
path = "ATM_Forecasts_May2010.xlsx"
)
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
# load libraries
library(readr)
library(readxl)
library(fpp3)
library(ggplot2)
library(zoo)
library(forecast)
library(tsoutliers)
library(patchwork)
library(gridExtra)
After reading in the data, I replaced the Na with the median value and removed CaseSequence as it is not needed.
rcf <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx")
summary(rcf)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
rcf$KWH[is.na(rcf$KWH)] = median(rcf$KWH, na.rm = TRUE)
any(is.na(rcf))
## [1] FALSE
rcf <- rcf %>%
select(-CaseSequence)
tail(rcf)
## # A tibble: 6 × 2
## `YYYY-MMM` KWH
## <chr> <dbl>
## 1 2013-Jul 8415321
## 2 2013-Aug 9080226
## 3 2013-Sep 7968220
## 4 2013-Oct 5759367
## 5 2013-Nov 5769083
## 6 2013-Dec 9606304
I create rcf tssible and made sure to convert directly to yearmonth format so the tsibble can reflect monthly data.
rcf_tsibble <- rcf %>%
rename(Date = `YYYY-MMM`) %>%
mutate(Date = yearmonth(Date)) %>%
as_tsibble(index = Date)
head(rcf_tsibble)
## # A tsibble: 6 x 2 [1M]
## Date KWH
## <mth> <dbl>
## 1 1998 Jan 6862583
## 2 1998 Feb 5838198
## 3 1998 Mar 5420658
## 4 1998 Apr 5010364
## 5 1998 May 4665377
## 6 1998 Jun 6467147
I decompose the model and extract the components to better understand them. The autplot shows some fluctuations and one substantial dip (outlier), there seems to be a upward trend, and there is a clear seasonal pattern with peaks on several months, one of which seems to be January.
rcf_stl <- rcf_tsibble %>%
model(STL(KWH ~ season(window = "periodic")))
rcf_components <- rcf_stl %>% components()
autoplot(rcf_components)
I using rolling mean to replace the outlier and visualize the autoplot
again.
rcf_tsibble$KWH <- rollmean(rcf_tsibble$KWH, k = 5, fill = NA, align = "center")
autoplot(rcf_tsibble, KWH) +
labs(title = "After Addressing Outliers", x = "Time", y = "KWH") +
theme_minimal()
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
na_rows <- rcf_tsibble[is.na(rcf_tsibble$KWH), ]
print(na_rows)
## # A tsibble: 4 x 2 [1M]
## Date KWH
## <mth> <dbl>
## 1 1998 Jan NA
## 2 1998 Feb NA
## 3 2013 Nov NA
## 4 2013 Dec NA
rcf_tsibble$KWH[is.na(rcf_tsibble$KWH)] <- median(rcf_tsibble$KWH, na.rm = TRUE)
Seasonal pattern is clear, so I use gg-season to look more in depth at it as it may be helpful later for modeling. The plot shows a peak during the summer and trought during the spring and this pattern is seen acorss most the years present below.
gg_season(rcf_tsibble, KWH) +
labs(title = "Seaonsal Plot over 12 Months", x = "Time", y = "KWH") +
theme_minimal()
I find the optimal lambda value for the Box-Cox using the Guerrero. I apply the Box-Cox transformation to the KWH variable in the rcf_tsibblee and rename the variable trcf_tsibble which is now the transformed tsibble. gg_tsdisplay function is applied to produce time series plot, ACF plot, and PACF plot to confirm the variance is stabilized.
lambda5 <- rcf_tsibble %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
trcf_tsibble <- rcf_tsibble %>%
mutate(KWH = box_cox(KWH, lambda5))
gg_display_tt <- trcf_tsibble %>%
gg_tsdisplay(KWH, plot_type = "partial")
print(gg_display_tt)
Now I will start looking for best fitting model, trcf_tsibble is the transformed tsibble.
ets_model <- trcf_tsibble %>%
model(
ets = ETS(KWH)
)
future_forecast <- ets_model %>%
forecast(h = "12 months")
accuracy(ets_model)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets Training 972. 147136. 94839. -0.0465 2.45 0.372 0.380 0.169
autoplot(future_forecast, trcf_tsibble)
trcf_tsibble %>%
features(KWH, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
trcf_tsibble %>%
features(KWH, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
After applying the seasonal and regular order of differencing, the plot shows that the data is oscillating around the mean of zero. It looks like it could be stationary but the ACF and PACF dont drop off or cut off as quickly as they normally do when stationary. So I will do another test just to confirm.
seasonal_diff_tsibble <- trcf_tsibble %>%
mutate(KWH_seasonal_diff = difference(KWH, lag = 12)) #seasonal
combined_diff_tsibble <- seasonal_diff_tsibble %>%
mutate(KWH_combined_diff = difference(KWH_seasonal_diff)) #additional order of differencing
combined_ts <- ts(combined_diff_tsibble$KWH_combined_diff, frequency = 12, start = c(1998, 1))
ggtsdisplay(combined_ts)
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
The ADF test confirms that it is stationary with a p value of less than
0.01 which is less than 0.05 and KPSS is more than .05 - This indicates
that it’s stationary
library(tseries)
trc_clean <- na.omit(combined_ts)
check_stationarity <- function(ts) {
adf_result <- adf.test(ts)
kpss_result <- kpss.test(ts)
list(
ADF = list(statistic = adf_result$statistic, p.value = adf_result$p.value),
KPSS = list(statistic = kpss_result$statistic, p.value = kpss_result$p.value)
)
}
pipe_2_stationarity <- check_stationarity(trc_clean)
## Warning in adf.test(ts): p-value smaller than printed p-value
## Warning in kpss.test(ts): p-value greater than printed p-value
print(pipe_2_stationarity)
## $ADF
## $ADF$statistic
## Dickey-Fuller
## -6.245781
##
## $ADF$p.value
## [1] 0.01
##
##
## $KPSS
## $KPSS$statistic
## KPSS Level
## 0.04745011
##
## $KPSS$p.value
## [1] 0.1
So I now know the ideal model will 1 seasonal differencing and one regular order of differencing. I will go back to using the transformed tsibble: trcf_tsibble (rather than differenced value) and check the auto ARIMA model. I created the auto arima model first and notice the
model_auto <- auto.arima(trcf_tsibble$KWH, seasonal = TRUE, stepwise = FALSE)
summary(model_auto)
## Series: trcf_tsibble$KWH
## ARIMA(0,1,5)
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5
## 0.3480 -0.0396 -0.2668 -0.1393 -0.7040
## s.e. 0.1026 0.1161 0.1755 0.0500 0.1917
##
## sigma^2 = 2.633e+10: log likelihood = -2563.1
## AIC=5138.2 AICc=5138.66 BIC=5157.71
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 13144.51 159703.8 118652.1 0.2180583 3.01978 0.7291728 0.0397053
model_b <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(1, 1, 1), period = 12))
model_c <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(1, 1, 2), period = 12))
model_d <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(0, 1, 1), period = 12))
model_e <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(0, 1, 2), period = 12))
models2v <- list(model_auto = model_auto, model_b = model_b, model_c = model_c, model_d = model_d, model_e = model_e)
aic_values <- sapply(models2v, AIC)
print(aic_values)
## model_auto model_b model_c model_d model_e
## 5138.201 4739.690 4741.362 4741.621 4739.408
summary(model_e)
## Series: trcf_tsibble$KWH
## ARIMA(0,1,5)(0,1,2)[12]
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 sma1 sma2
## 0.0887 0.0102 0.1268 -0.0197 -0.9114 -0.9973 0.2060
## s.e. 0.0569 0.0675 0.0598 0.0552 0.0633 0.0929 0.0934
##
## sigma^2 = 1.512e+10: log likelihood = -2361.7
## AIC=4739.41 AICc=4740.26 BIC=4764.91
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 13347.66 116379.2 71519.77 0.2930489 1.825928 0.4395227 0.04993144
I attempted to do Cross- validation of the models, specifically model_auto (ARIMA’s auto selection); model_e (the one that is best fit based on having the lowest AIC metric), and the ets_model (R’s auto ETS selection). However, there several errors and model e was not able to be validated, possibly because the test of 20% was too small.
splits <- initial_time_split(trcf_tsibble, prop = 0.8) # 80% train, 20% test
evaluate_models <- function(split) {
train_data <- training(split)
test_data <- testing(split)
ets_model <- train_data %>%
model(ETS = ETS(KWH))
auto_model <- train_data %>%
model(Auto_ARIMA = ARIMA(KWH))
# model_e <- train_data %>%
#model(Model_E = ARIMA(KWH, order = c(0, 1, 5), seasonal = list(order = c(0, 1, 2), period = 12)))
ets_forecast <- ets_model %>%
forecast(h = nrow(test_data))
auto_forecast <- auto_model %>%
forecast(h = nrow(test_data))
ets_accuracy <- accuracy(ets_forecast, test_data)
auto_accuracy <- accuracy(auto_forecast, test_data)
return(list(ets_accuracy = ets_accuracy, auto_accuracy = auto_accuracy))
}
results <- evaluate_models(splits)
print(results$ets_accuracy)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS Test 1335746. 1353881. 1335746. 29.7 29.7 NaN NaN 0.634
print(results$auto_accuracy)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Auto_ARIMA Test 638614. 727569. 641524. 14.0 14.0 NaN NaN 0.713
The best model is model e. ARIMA (0, 1, 5) (0, 1, 2)[12] with a RMSE of 4739.408, which is lower than the autoarima model and lower than the auto ets model. I will continue with this model, confirm it is stationary and forecast with it.
The Ljung-Box test shows a p-value of 0.19, p-value indicates that there is no significant autocorrelation in the residuals, and it is stationary. There is an error in gg_tsresiduals of this output as well, but based on the visual of residuals and the historgram, stationary is confirmed.
model_e <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(0, 1, 2), period = 12))
checkresiduals(model_e)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,5)(0,1,2)[12]
## Q* = 4.6908, df = 3, p-value = 0.1959
##
## Model df: 7. Total lags used: 10
#gg_tsresiduals(model_e) makes an errors so I'll remove it but it looks like white noise
The blue forecast is coming up blank.
train_size <- floor(0.7 * nrow(trcf_tsibble))
train_data <- trcf_tsibble %>% slice(1:train_size)
test_data <- trcf_tsibble %>% slice((train_size + 1):nrow(trcf_tsibble))
model_e <- train_data %>%
model(
Model_E = ARIMA(KWH ~ pdq(0, 1, 5) + PDQ(0, 1, 2)[12])
)
## Warning: 1 error encountered for Model_E
## [1] invalid type (list) for variable 'PDQ(0, 1, 2)[12]'
model_e_forecast <- model_e %>%
forecast(h = nrow(test_data))
model_e_accuracy <- accuracy(model_e_forecast, test_data)
print(model_e_accuracy)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Model_E Test NaN NaN NaN NaN NaN NaN NaN NA
autoplot(model_e_forecast, color = "blue") +
autolayer(test_data, KWH, color = "red", series = "Actual") +
labs(title = "Model E Forecast vs Actual", y = "KWH") +
theme_minimal()
## Warning in geom_line(eval_tidy(expr(aes(!!!aes_spec))), data = object, ..., :
## Ignoring unknown parameters: `series`
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning: Removed 58 rows containing missing values or values outside the scale range
## (`geom_line()`).
** This code chunk causes an error and the model e becomes null -
unsure how to correct, so I will move on** model_e <- train_data
%>% model(model_e = ARIMA(KWH, order = c(0, 1, 5), seasonal =
list(order = c(0, 1, 2), period = 12)))
model_e_forecast <- model_e %>% forecast(h = “12 months”)
model_e_accuracy <- accuracy(model_e_forecast, test_data)
library(ggplot2) autoplot(model_e_forecast) + autolayer(trcf_tsibble,
aes(x = index(trcf_tsibble), y = KWH), series = “Actual”) + labs(title =
“Model E Forecast”, y = “KWH”) + theme_minimal()
Something seems wrong - with the forecasts…
evaluate_models <- function(split) {
train_data <- training(split)
auto_model <- train_data %>%
model(Auto_ARIMA = ARIMA(KWH))
ets_model <- train_data %>%
model(ETS = ETS(KWH))
auto_forecast <- auto_model %>%
forecast(h = 12) %>%
mutate(model = "Auto ARIMA")
ets_forecast <- ets_model %>%
forecast(h = 12) %>%
mutate(model = "ETS")
combined_forecast <- bind_rows(auto_forecast, ets_forecast)
return(list(auto_forecast = auto_forecast, ets_forecast = ets_forecast, combined_forecast = combined_forecast))
}
results <- evaluate_models(splits)
autoplot(rcf_tsibble, KWH) +
autolayer(results$auto_forecast, KWH, color = "blue", series = "Auto ARIMA") +
autolayer(results$ets_forecast, KWH, color = "purple", series = "ETS") +
labs(title = "Forecasts from Auto ARIMA and ETS Models",
y = "KWH",
x = "Date") +
theme_minimal() +
facet_wrap(~model, scales = "free_y")
## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`
## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`
## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`
## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`
## Scale for fill_ramp is already present.
## Adding another scale for fill_ramp, which will replace the existing scale.
evaluate_models <- function(data, new_data) {
auto_model <- data %>%
model(Auto_ARIMA = ARIMA(KWH))
ets_model <- data %>%
model(ETS = ETS(KWH))
auto_forecast <- auto_model %>%
forecast(new_data = new_data) %>%
mutate(model = "Auto ARIMA")
ets_forecast <- ets_model %>%
forecast(new_data = new_data) %>%
mutate(model = "ETS")
combined_forecast <- bind_rows(auto_forecast, ets_forecast)
return(combined_forecast)
}
forecast_period <- new_data(rcf_tsibble, n = 24) %>%
mutate(index = yearmonth("2014 Jan") + 0:23)
results <- evaluate_models(rcf_tsibble, forecast_period)
results %>%
autoplot(rcf_tsibble, level = NULL) +
labs(title = "Forecasts from Auto ARIMA and ETS Models",
y = "KWH",
x = "Date") +
theme_minimal() +
facet_wrap(~ .model, scales = "free_y")
The auto ARIMA model had the lower RMSE value (I do admit neither of
these models are the ‘best’) but in comparison the visual also suggests
to pick ARIMA. I try again ans create a forecast for 2014, 12 months,
then I fit arima model and forecast. I try to apply inverse Box-Cox
transformation of forecasted values based on lambda5, but then receive
an error because the output becomes null. However, the environment shows
there are values for this forecast which I imported a photo of below.
The value in the original dataset are in the millions so maybe this isnt
completely off (although I suspect it is not correct either).
forecast_period <- new_data(trcf_tsibble, n = 12) %>%
mutate(index = yearmonth("2014 Jan") + 0:11)
evaluate_arima <- function(data, new_data) {
auto_model <- data %>%
model(Auto_ARIMA = ARIMA(KWH))
auto_forecast <- auto_model %>%
forecast(new_data = new_data) %>%
mutate(model = "Auto ARIMA")
return(auto_forecast)
}
results <- evaluate_arima(trcf_tsibble, forecast_period)
results <- results %>%
mutate(predicted_KWH = if (lambda5 == 0) {
exp(.mean)
} else {
inv_box_cox(.mean, lambda5)
})
#----error # when trying to display forecasted KWH values for each month in 2014
knitr::include_graphics("arima results.png")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Load the data and format the date time format
pipe_1 <- read_xlsx("Waterflow_Pipe1.xlsx") %>% mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))
pipe_2 <- read_xlsx("Waterflow_Pipe2.xlsx") %>%
mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))
Visualize pipe 1
ggplot(pipe_1, aes(`Date Time`, WaterFlow)) +
geom_line(color = "blue") +
ggtitle("Water Flow for Pipe 1") +
xlab("Date Time") +
ylab("Water Flow")
Data cleaning on pipe 1, and ensuring that multiple entries within the
same hour are aggregated.
pipe_1_clean <- pipe_1 %>%
mutate(Date = as.Date(`Date Time`),
Time = format(`Date Time`, "%H:00:00")) %>%
group_by(Date, Time) %>%
summarise(WaterFlow = mean(WaterFlow), .groups = "drop") %>%
ungroup() %>%
mutate(`Date Time` = as.POSIXct(paste(Date, Time))) %>%
select(`Date Time`, WaterFlow)
Create tsibble and view autoplot
pipe_1_ts <- ts(pipe_1_clean$WaterFlow, start = c(2015, 10, 23), frequency = 24)
pipe_1_decomp <- stl(pipe_1_ts, s.window = "periodic")
plot(pipe_1_decomp)
It looks stationary as it but I double check using KPSS test and ADF. ADF p value is less than .05 so it means it is stationary, kpss is more than .05 so it also means it is stationary.
check_stationarity <- function(ts) {
adf_result <- adf.test(ts)
kpss_result <- kpss.test(ts)
list(
ADF = list(statistic = adf_result$statistic, p.value = adf_result$p.value),
KPSS = list(statistic = kpss_result$statistic, p.value = kpss_result$p.value)
)
}
pipe_1_stationarity <- check_stationarity(pipe_1_ts)
## Warning in adf.test(ts): p-value smaller than printed p-value
## Warning in kpss.test(ts): p-value greater than printed p-value
print(pipe_1_stationarity)
## $ADF
## $ADF$statistic
## Dickey-Fuller
## -6.264898
##
## $ADF$p.value
## [1] 0.01
##
##
## $KPSS
## $KPSS$statistic
## KPSS Level
## 0.2465587
##
## $KPSS$p.value
## [1] 0.1
acf_plot <- ggAcf(pipe_1_ts) + ggtitle("ACF") + theme_minimal()
pacf_plot <- ggPacf(pipe_1_ts) + ggtitle("PACF") + theme_minimal()
combined_plot <- acf_plot / pacf_plot
print(combined_plot)
Fit ETS model and check residuals, which confirm white noise and normal
distribution
autoplot(pipe_1_ts)
Visualize pipe 2
ggplot(pipe_2, aes(`Date Time`, WaterFlow)) +
geom_line(color = "blue") +
ggtitle("Water Flow for Pipe 2") +
xlab("Date Time") +
ylab("Water Flow")
Data cleaning on pipe 2
pipe_2_clean <- pipe_2 %>%
mutate(Date = as.Date(`Date Time`),
Time = format(`Date Time`, "%H:00:00")) %>%
group_by(Date, Time) %>%
summarise(WaterFlow = mean(WaterFlow), .groups = "drop") %>%
ungroup() %>%
mutate(`Date Time` = as.POSIXct(paste(Date, Time))) %>%
select(`Date Time`, WaterFlow)
Create tsibble and view autoplot
pipe_2_ts <- ts(pipe_2_clean$WaterFlow, start = c(2015, 10, 23), frequency = 24)
pipe_2_decomp <- stl(pipe_1_ts, s.window = "periodic")
# Plot the decomposition
plot(pipe_2_decomp)
It looks stationary as it but I double check using KPSS test and ADF.
ADF p value is less than .05 so it means it is stationary, kpss is more
than .05 so it also means it is stationary.
check_stationarity <- function(ts) {
adf_result <- adf.test(ts)
kpss_result <- kpss.test(ts)
list(
ADF = list(statistic = adf_result$statistic, p.value = adf_result$p.value),
KPSS = list(statistic = kpss_result$statistic, p.value = kpss_result$p.value)
)
}
pipe_2_stationarity <- check_stationarity(pipe_2_ts)
## Warning in adf.test(ts): p-value smaller than printed p-value
print(pipe_2_stationarity)
## $ADF
## $ADF$statistic
## Dickey-Fuller
## -8.658365
##
## $ADF$p.value
## [1] 0.01
##
##
## $KPSS
## $KPSS$statistic
## KPSS Level
## 0.4204869
##
## $KPSS$p.value
## [1] 0.06832461
acf_plot <- ggAcf(pipe_2_ts) + ggtitle("ACF") + theme_minimal()
pacf_plot <- ggPacf(pipe_2_ts) + ggtitle("PACF") + theme_minimal()
combined_plot <- acf_plot / pacf_plot
print(combined_plot)
Fit ETS model and check residuals, which confirm white noise and normal
distribution
autoplot(pipe_2_ts)
## Conclusion Both pipe 1 and pipe 2 are stationary. Of course when
considering the task to forecast the next 7 days or 168 hours, I was
unable to complete the forecasting as the plots did not show the correct
data.
code below produce this error: Warning: 1 error encountered for
ARIMA(WaterFlow) [1] .data contains implicit gaps in time. You should
check your data and convert implicit gaps into explicit missing values
using tsibble::fill_gaps()
if required. Warning: 1 error
encountered for ETS(WaterFlow) [1] .data contains implicit gaps in time.
You should check your data and convert implicit gaps into explicit
missing values using tsibble::fill_gaps()
if required.
Warning: Ignoring unknown parameters: series
and
PI
Warning: Ignoring unknown parameters: series
and PI
Warning: Ignoring unknown parameters:
series
and PI
Warning: Ignoring unknown
parameters: series
and PI
I dont have time to attempt to correct it.
arima_model <- pipe_1_tsibble %>% model(ARIMA(WaterFlow))
ets_model <- pipe_1_tsibble %>% model(ETS(WaterFlow))
arima_forecast <- arima_model %>% forecast(h = “168 hours”) #7days
ets_forecast <- ets_model %>% forecast(h = “168 hours”)
autoplot(pipe_1_tsibble, WaterFlow) + autolayer(arima_forecast, series = “ARIMA Forecast”, PI = TRUE, color = “blue”) + ggtitle(“Auto ARIMA Forecast for Water Flow”) + xlab(“Date Time”) + ylab(“Water Flow”)
autoplot(pipe_1_tsibble, WaterFlow) + autolayer(ets_forecast, series = “ETS Forecast”, PI = TRUE, color = “red”) + ggtitle(“Auto ETS Forecast for Water Flow”) + xlab(“Date Time”) + ylab(“Water Flow”)