In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Based on the request the main goal is to forecast the cash withdrawals from 4 different ATM machines for May 2010 using the provided dataset.
Lets start by loading the necessary library and reading the data from the Excel file.
library(fpp3)
library(readxl)
library(tidyr)
library(dplyr)
library(readr)
atm_raw <- read_excel("data/ATM624Data.xlsx")
str(atm_raw)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: num [1:1474] 39934 39934 39935 39935 39936 ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
From the result we can learn that the dataset contains the following columns:
DATE: The date of the cash withdrawal.
ATM: The identifier for each ATM machine.
Cash: The amount of cash withdrawn in hundreds of dollars.
Next, we will convert the ‘DATE’ column to a Date object and create a time series object for each ATM machine. Currently, the ‘DATE’ column is in Excel date format, so we need to convert it properly. ATM with empty values will be removed.
atm_data <- atm_raw |>
mutate(DATE = as.Date(DATE, origin = "1899-12-30")) |>
as_tsibble(index = DATE, key = ATM) |>
filter(!is.na(Cash))
str(atm_data)
## tbl_ts [1,455 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
## $ DATE: Date[1:1455], format: "2009-05-01" "2009-05-02" ...
## $ ATM : chr [1:1455] "ATM1" "ATM1" "ATM1" "ATM1" ...
## $ Cash: num [1:1455] 96 82 85 90 99 88 8 104 87 93 ...
## - attr(*, "key")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ ATM : chr [1:4] "ATM1" "ATM2" "ATM3" "ATM4"
## ..$ .rows: list<int> [1:4]
## .. ..$ : int [1:362] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ : int [1:363] 363 364 365 366 367 368 369 370 371 372 ...
## .. ..$ : int [1:365] 726 727 728 729 730 731 732 733 734 735 ...
## .. ..$ : int [1:365] 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 ...
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
## - attr(*, "index")= chr "DATE"
## ..- attr(*, "ordered")= logi TRUE
## - attr(*, "index2")= chr "DATE"
## - attr(*, "interval")= interval [1:1] 1D
## ..@ .regular: logi TRUE
Now that we have the data in a tsibble format, let’s take a look at the summary statistics to understand the data better.
summary(atm_data)
## DATE ATM Cash
## Min. :2009-05-01 Length:1455 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-10-31 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
From the summary, we can see the range of dates, the number of observations for ATM, and the distribution of cash withdrawals.
In this part I will create a plot to visualize first the daily withdrawals.
atm_data |>
autoplot(Cash, alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE, color = "darkorange", linetype = "dashed") +
facet_wrap(~ATM, scales = "free_y") +
labs(title = "Daily ATM Cash Withdrawals", y = "Hundreds of dollars")+
theme_minimal(base_size = 10)
The plot above shows the daily cash withdrawals for each ATM machine over time, along with a smoothed trend line to help visualize the overall trend. The x-axis represents the date, while the y-axis represents the cash withdrawn in hundreds of dollars. For each ATM, we can say:
ATM1 looks fairly stable overall, it doesn’t change much over time, so it’s close to stationary. There’s a clear bump in withdrawals around July and August, which might be due to seasonal demand.
ATM2 seems to drop a bit after October 2009, then slightly picks up again around April 2010. It’s got a gentle downward trend.
ATM3 hardly shows any activity through most of the year, then suddenly spikes at the end of April 2010. That could mean the ATM was new, reactivated, or just had one big withdrawal event.
ATM4 is more unpredictable, it’s mostly low and steady, but there’s a huge spike in February 2010, which could’ve been an unusual one time event.
In conclusion, this plot ATM1 and ATM2 show repeating weekly patterns, while ATM3 and ATM4 exhibit more irregular behavior.
Next, we will analyze the autocorrelation of the daily cash withdrawals for each ATM machine.
atm_filled <- atm_data |> group_by_key() |> fill_gaps(.full = TRUE)
atm_filled |>
ACF(Cash) |>
autoplot() +
facet_wrap(~ATM)+
theme_minimal(base_size = 15)
The ACF plots above show the autocorrelation of daily cash withdrawals for each ATM machine. We can observe the following:
ATM1 & ATM2: Both show strong, clear weekly seasonality, evidenced by significant spikes at lags 7, 14, and 21. This implies a highly predictable weekly pattern for cash withdrawals.
ATM3: Shows a significant correlation only at lag 1. This indicates a relationship with the previous day’s withdrawals but a complete lack of any weekly pattern.
ATM4: The series appears to be white noise (random). None of the autocorrelations are statistically significant, suggesting there is no predictable pattern or seasonality in the data.
Based on the analysis above, we can choose appropriate forecasting models for each ATM machine by comparing different models including ARIMA with and without seasonality, and ETS models.
atm_models <- atm_filled |>
group_by_key() |>
model(
SNAIVE_week = SNAIVE(Cash ~ lag("week")),
ARIMA_auto = ARIMA(Cash),
ARIMA_seasonal = ARIMA(Cash ~ pdq(0,1,1) + PDQ(0,1,1,7)),
ETS_auto = ETS(Cash)
)
model_compare <- atm_models |> glance() |> arrange(ATM, AICc) |> filter(!is.na(AICc))
model_compare
## # A tibble: 10 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl>
## 1 ATM1 ARIMA_seaso… 5.78e2 -1634. 3275. 3275. 3286. <cpl> <cpl> NA
## 2 ATM1 ARIMA_auto 6.02e2 -1641. 3290. 3290. 3306. <cpl> <cpl> NA
## 3 ATM2 ARIMA_seaso… 6.21e2 -1651. 3308. 3308. 3320. <cpl> <cpl> NA
## 4 ATM2 ARIMA_auto 6.49e2 -1659. 3325. 3325. 3336. <cpl> <cpl> NA
## 5 ATM3 ARIMA_seaso… 2.60e1 -1087. 2180. 2180. 2192. <cpl> <cpl> NA
## 6 ATM3 ARIMA_auto 2.54e1 -1109. 2223. 2223. 2235. <cpl> <cpl> NA
## 7 ATM3 ETS_auto 2.54e1 -1666. 3338. 3338. 3350. <NULL> <NULL> 2.53e1
## 8 ATM4 ARIMA_seaso… 4.14e5 -2834. 5673. 5673. 5685. <cpl> <cpl> NA
## 9 ATM4 ARIMA_auto 4.24e5 -2882. 5768. 5768. 5776. <cpl> <cpl> NA
## 10 ATM4 ETS_auto 1.66e0 -3335. 6691. 6691. 6730. <NULL> <NULL> 4.16e5
## # ℹ 2 more variables: AMSE <dbl>, MAE <dbl>
Lets select the best model for each ATM based on the lowest AICc values.
best_models <- model_compare |>
group_by(ATM) |>
slice_min(AICc, n = 1)
best_models
## # A tibble: 4 × 12
## # Groups: ATM [4]
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ATM1 ARIMA_se… 5.78e2 -1634. 3275. 3275. 3286. <cpl> <cpl> NA NA
## 2 ATM2 ARIMA_se… 6.21e2 -1651. 3308. 3308. 3320. <cpl> <cpl> NA NA
## 3 ATM3 ARIMA_se… 2.60e1 -1087. 2180. 2180. 2192. <cpl> <cpl> NA NA
## 4 ATM4 ARIMA_se… 4.14e5 -2834. 5673. 5673. 5685. <cpl> <cpl> NA NA
## # ℹ 1 more variable: MAE <dbl>
The best models for each ATM based on AICc values are as follows:
ATM1: ARIMA_seasonal
ATM2: ARIMA_seasonal
ATM3: ARIMA_seasonal
ATM4: ARIMA_seasonal
The residual diagnostics for each selected model will be checked to ensure the models are appropriate.
atm_models |>
filter(ATM == "ATM1") |>
select(ARIMA_seasonal) |>
gg_tsresiduals()
atm_models |>
filter(ATM == "ATM2") |>
select(ARIMA_seasonal) |>
gg_tsresiduals()
atm_models |>
filter(ATM == "ATM3") |>
select(ARIMA_seasonal) |>
gg_tsresiduals()
atm_models |>
filter(ATM == "ATM4") |>
select(ARIMA_seasonal) |>
gg_tsresiduals()
ATM1: There is no evidence of remaining autocorrelation, so the selected Seasonal ARIMA model provides an adequate representation of the daily cash withdrawals for ATM1.It can therefore be used confidently for forecasting May 2010.
ATM2: There is no evidence of remaining autocorrelation or non-constant variance, and the distribution is approximately normal. Therefore, the selected Seasonal ARIMA model provides a good fit for the ATM2 cash withdrawal data and is appropriate for forecasting
ATM3: The selected model for ATM3 was a Seasonal ARIMA, chosen automatically based on the lowest AICc value. However, the data for ATM3 show very limited activity throughout most of the observed period, with nearly all withdrawal values close to zero and a single extreme spike near the end of April 2010.
As a result, while the model formally includes seasonal components, the residual diagnostics reveal that there is no genuine seasonal or trend pattern in this series. The residuals are mostly zero, uncorrelated, and dominated by that one large positive outlier.
This behavior indicates that the ATM was inactive for much of the time and only experienced a rare, isolated withdrawal event, rather than regular usage. Consequently, the ARIMA model provides a statistically valid fit but has limited forecasting power, as there is little structure for the model to capture.
Lets generate forecasts for May 2010 using the selected models.
atm_daily <- atm_data |>
group_by_key() |>
fill_gaps(.full = TRUE) |>
arrange(ATM, DATE)
best_models <- atm_models |>
select(ARIMA_seasonal)
atm_forecasts <- best_models |>
forecast(h = "31 days")
atm_forecasts |> filter(ATM == "ATM1") |> autoplot(atm_daily) + labs(title = "ATM1 Forecast",
subtitle = "Using ARIMA seasonal model for each ATM",
y = "Cash (hundreds of $)",
x = "Date")+
theme_minimal(base_size = 10)
atm_forecasts |> filter(ATM == "ATM2") |> autoplot(atm_daily) + labs(title = "ATM2 Forecast",
subtitle = "Using ARIMA seasonal model for each ATM",
y = "Cash (hundreds of $)",
x = "Date")+
theme_minimal(base_size = 10)
atm_forecasts |> filter(ATM == "ATM3") |> autoplot(atm_daily) + labs(title = "ATM3 Forecast",
subtitle = "Using ARIMA seasonal model for each ATM",
y = "Cash (hundreds of $)",
x = "Date")+
theme_minimal(base_size = 10)
atm_forecasts |> filter(ATM == "ATM4") |> autoplot(atm_daily) + labs(title = "ATM4 Forecast",
subtitle = "Using ARIMA seasonal model for each ATM",
y = "Cash (hundreds of $)",
x = "Date")+
theme_minimal(base_size = 10)
The ARIMA seasonal models were used to forecast cash withdrawals for May 2010 across the four ATMs. The forecasts and their 80% and 95% confidence intervals are shown in the plots above.
ATM1: Withdrawals are relatively consistent, showing a stable weekly seasonal pattern with moderate fluctuations. The forecast for May 2010 suggests continued steady demand, with no major changes expected.
ATM2: Displays a similar seasonal pattern but with a slightly decreasing trend through early 2010. The forecast maintains this weekly rhythm, indicating regular customer behavior with minor variability.
ATM3: This ATM was mostly inactive for much of the year, followed by a sharp rise in withdrawals starting around April 2010. The model projects this increase continuing into May, although the wide prediction intervals reflect higher uncertainty due to limited past activity.
ATM4: Withdrawals remain low and stable overall, except for one large spike earlier in the series, likely an outlier. The model forecasts steady but minimal activity for May, with narrow intervals suggesting predictable low usage.
Overall, the forecasts reflect distinct usage patterns at each ATM. Machines 1 and 2 show strong, regular seasonal behavior, while ATMs 3 and 4 exhibit more irregular or exceptional patterns. The ARIMA seasonal models successfully capture these differences, providing reasonable forecasts for short term cash demand planning.
Finally, we will extract the forecasted values for May 2010 and save them to an Excel file.
library(openxlsx)
atm_forecasts |>
as_tibble() |>
write.xlsx("Forecasts_ATM.xlsx")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
I will load the data and examine its structure.
library(readxl)
fp_data <- read_excel("data/ResidentialCustomerForecastLoad-624.xlsx")
str(fp_data)
## tibble [192 × 3] (S3: tbl_df/tbl/data.frame)
## $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY-MMM : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
The dataset contains the following columns:
CaseSequence: An identifier for each case sequence.
YYYY-MMM: The year and month of the power consumption record.
KWH: The power consumption in Kilowatt hours.
Next, we will convert the ‘YYYY-MMM’ column to a Date object and create a time series object for each CaseSequence.
library(fpp3)
fp_data <- fp_data |>
mutate(
Date = yearmonth(`YYYY-MMM`)
) |>
select(Date, KWH) |>
as_tsibble(index = Date)
str(fp_data)
## tbl_ts [192 × 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
## $ Date: mth [1:192] 1998 Jan, 1998 Feb, 1998 Mar, 1998 Apr, 1998 May, 1998 Jun...
## $ KWH : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
## - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
## ..$ .rows: list<int> [1:1]
## .. ..$ : int [1:192] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..@ ptype: int(0)
## - attr(*, "index")= chr "Date"
## ..- attr(*, "ordered")= logi TRUE
## - attr(*, "index2")= chr "Date"
## - attr(*, "interval")= interval [1:1] 1M
## ..@ .regular: logi TRUE
Now lets visualize the time series data to understand its patterns.
fp_data |>
autoplot(KWH) +
labs(title = "Residential Power Usage (Jan 1998 – Dec 2013)",
y = "Kilowatt-hours (KWH)", x = "Year") +
theme_minimal(base_size = 14)
The series shows a clear upward trend over time with recurring seasonal patterns, higher usage in certain months probably winter or summer peaks. This suggests the data are non stationary and contain strong annual seasonality.
gg_season(fp_data, KWH) +
labs(title = "Seasonal Plot of Residential Power Usage",
y = "Kilowatt-hours (KWH)", x = "Month") +
theme_minimal(base_size = 14)
The seasonal plot confirms the presence of strong annual seasonality in power usage, with peaks typically occurring in the winter months (December to February) and troughs in the summer months (June to August).
ACF(fp_data, KWH) |> autoplot()
ACF plot shows significant autocorrelations at multiple lags, particularly at lag 12, confirming the strong annual seasonality in the data.
fp_models <- fp_data |>
model(
ETS_auto = ETS(KWH),
ARIMA_auto = ARIMA(KWH),
ARIMA_seasonal = ARIMA(KWH ~ pdq(0,1,1) + PDQ(0,1,1,7)),
)
glance(fp_models)
## # A tibble: 2 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA_auto 7.80e11 -2707. 5426. 5427. 5445. <cpl [24]> <cpl [2]>
## 2 ARIMA_seasonal 2.45e12 -2881. 5768. 5769. 5778. <cpl [0]> <cpl [8]>
Based on the AICc values from the model comparison, the ETS_auto model has the lowest AICc value, indicating it is the best fit for the data. I will proceed to evaluate the residuals.
best_name <- fp_models |> glance() |> slice_min(AICc) |> pull(.model)
best_fit <- fp_models |> select(!!best_name)
gg_tsresiduals(best_fit)
augment(fp_models) |>
group_by(.model) |>
features(.innov, ljung_box, lag = 24, dof = 3)
## # A tibble: 3 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA_auto 14.0 0.868
## 2 ARIMA_seasonal 369. 0
## 3 ETS_auto NA NA
The residual diagnostics indicate that the ARIMA_auto model provides an excellent fit for the residential power consumption data. The residuals fluctuate randomly around zero with no visible trend or autocorrelation, and their distribution appears approximately normal. The Ljung–Box test (p = 0.868) confirms that the residuals behave like white noise, meaning the model has successfully captured the underlying trend and seasonality in the data. In contrast, the seasonal ARIMA model showed significant autocorrelation (p < 0.001). Overall, the ARIMA_auto model is statistically sound and well-suited for forecasting monthly electricity usage for 2014.
Finally, we will generate a 12-month forecast for 2014 using the selected ARIMA_auto model.
fp_forecast <- fp_models |>
select(ARIMA_auto) |>
forecast(h = "12 months")
autoplot(fp_forecast, fp_data) +
labs(title = "Forecast of Residential Power Usage for 2014 (ARIMA model)",
y = "KWH", x = "Year") +
theme_minimal(base_size = 14)
The ARIMA model forecast for 2014 shows that residential power usage keeps its familiar seasonal rhythm, demand peaks during the colder months and again around June, likely due to increased air conditioning use in the summer.
Overall consumption continues to trend slightly upward, suggesting steady growth in household electricity use over time. This could be driven by more homes being added to the grid or higher energy needs per household.
The shaded areas represent the 80% and 95% confidence intervals, showing that while the model’s predictions remain reliable, uncertainty naturally grows toward the end of the forecast period.
Overall, the ARIMA model captures both the winter and summer peaks quite well and provides a realistic outlook for residential electricity demand in 2014.
Finally, we will extract the forecasted values for 2014 and save them to an Excel file.
fp_forecast |>
as_tibble() |>
write.xlsx("Forecast_ResidentialPower.xlsx")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Excel files will be loaded sameas the libraries used before.
library(readxl)
wp_data1 <- read_excel("data/Waterflow_Pipe1.xlsx")
str(wp_data1)
## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
## $ Date Time: num [1:1000] 42300 42300 42300 42300 42300 ...
## $ WaterFlow: num [1:1000] 23.4 28 23.1 30 6 ...
wp_data2 <- read_excel("data/Waterflow_Pipe2.xlsx")
str(wp_data2)
## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
## $ Date Time: num [1:1000] 42300 42300 42300 42300 42300 ...
## $ WaterFlow: num [1:1000] 18.8 43.1 38 36.1 31.9 ...
Both Excel files use Excel’s date time numeric format
(days since 1899-12-30), so we’ll convert and rename columns to make
them easier to handle.
wp_data1 <- wp_data1|>
rename(DateTime = `Date Time`) |>
mutate(DateTime = as.POSIXct((DateTime - 25569) * 86400, origin = "1970-01-01"))
wp_data2 <- wp_data2|>
rename(DateTime = `Date Time`) |>
mutate(DateTime = as.POSIXct((DateTime - 25569) * 86400, origin = "1970-01-01"))
str(wp_data1)
## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
## $ DateTime : POSIXct[1:1000], format: "2015-10-22 20:24:06" "2015-10-22 20:40:02" ...
## $ WaterFlow: num [1:1000] 23.4 28 23.1 30 6 ...
str(wp_data2)
## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
## $ DateTime : POSIXct[1:1000], format: "2015-10-22 21:00:00" "2015-10-22 21:59:59" ...
## $ WaterFlow: num [1:1000] 18.8 43.1 38 36.1 31.9 ...
Next, we will aggregate the data to hourly frequency by taking the mean of multiple recordings within each hour. This step ensures that we have a consistent time base for analysis.
wp_hourly1 <- wp_data1 |>
mutate(Hour = lubridate::floor_date(DateTime, "hour")) |>
group_by(Hour) |>
summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE), .groups="drop") |>
as_tsibble(index = Hour)
wp_hourly2 <- wp_data2 |>
mutate(Hour = lubridate::floor_date(DateTime, "hour")) |>
group_by(Hour) |>
summarise(WaterFlow = mean(WaterFlow, na.rm = TRUE), .groups="drop") |>
as_tsibble(index = Hour)
str(wp_hourly1)
## tbl_ts [236 × 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
## $ Hour : POSIXct[1:236], format: "2015-10-22 20:00:00" "2015-10-22 21:00:00" ...
## $ WaterFlow: num [1:236] 26.1 18.9 15.2 23.1 15.5 ...
## - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
## ..$ .rows: list<int> [1:1]
## .. ..$ : int [1:236] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..@ ptype: int(0)
## - attr(*, "index")= chr "Hour"
## ..- attr(*, "ordered")= logi TRUE
## - attr(*, "index2")= chr "Hour"
## - attr(*, "interval")= interval [1:1] 1h
## ..@ .regular: logi TRUE
str(wp_hourly2)
## tbl_ts [667 × 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
## $ Hour : POSIXct[1:667], format: "2015-10-22 21:00:00" "2015-10-22 23:00:00" ...
## $ WaterFlow: num [1:667] 30.9 38 34 28.2 18.3 ...
## - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
## ..$ .rows: list<int> [1:1]
## .. ..$ : int [1:667] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..@ ptype: int(0)
## - attr(*, "index")= chr "Hour"
## ..- attr(*, "ordered")= logi TRUE
## - attr(*, "index2")= chr "Hour"
## - attr(*, "interval")= interval [1:1] 1h
## ..@ .regular: logi TRUE
A validation of the resulting hourly time series objects will be performed.
wp_hourly1 |> features(WaterFlow, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.247 0.1
wp_hourly2 |> features(WaterFlow, unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.420 0.0683
Before building the forecasting models, I tested both water flow series for stationarity using the KPSS test. The goal was to check whether the data fluctuates around a constant mean without showing strong trends or changing variance over time.
The results show that Pipe 1 has a KPSS statistic of 0.25 (p = 0.10) and Pipe 2 has 0.42 (p = 0.068). Since both p-values are above 0.05, we fail to reject the null hypothesis of stationarity.
Now I will create plots to visualize the hourly water flow data for both pipes.
wp_hourly1 |>
autoplot(WaterFlow) + labs(title="Pipe 1 – Hourly flow", y="Flow", x="Time")
wp_hourly2 |>
autoplot(WaterFlow) + labs(title="Pipe 2 – Hourly flow", y="Flow", x="Time")
Pipe 1:
The flow stays fairly consistent, usually between 10 and 30 units. There’s a clear daily rhythm ,flow tends to rise and fall around the same times each day, likely reflecting normal usage patterns.
Pipe 2: This pipe shows higher and more variable flow, generally between 20 and 60 units. It also follows a daily pattern but with more ups and downs compared to Pipe 1.
Next, I will build forecasting models for both water flow series using ARIMA, ETS, and seasonal naïve methods. Before modeling, any missing hourly values are filled using a combination of Last Observation Carried Forward (LOCF) and Next Observation Carried Backward (NOCB) techniques to maintain a continuous time series and prevent gaps in the data.
wp1_reg <- wp_hourly1 |>
fill_gaps(.full = TRUE) |>
arrange(Hour) |>
fill(WaterFlow, .direction = "down") |>
fill(WaterFlow, .direction = "up")
wp2_reg <- wp_hourly2 |>
fill_gaps(.full = TRUE) |>
arrange(Hour) |>
fill(WaterFlow, .direction = "down") |>
fill(WaterFlow, .direction = "up")
wp1_models <- wp1_reg |>
model(
ARIMA_auto = ARIMA(WaterFlow, stepwise = FALSE, approximation = FALSE),
ETS_auto = ETS(WaterFlow),
SNAIVE24 = SNAIVE(WaterFlow ~ lag("day"))
)
wp2_models <- wp2_reg |>
model(
ARIMA_auto = ARIMA(WaterFlow, stepwise = FALSE, approximation = FALSE),
ETS_auto = ETS(WaterFlow),
SNAIVE24 = SNAIVE(WaterFlow ~ lag("day"))
)
From the model comparison based on AICc values, the best models for each pipe are as follows:
glance(wp1_models) |> arrange(AICc)
## # A tibble: 3 × 11
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl> <dbl>
## 1 ARIMA_… 18.1 -688. 1380. 1380. 1387. <cpl> <cpl> NA NA NA
## 2 ETS_au… 0.0462 -1005. 2016. 2016. 2026. <NULL> <NULL> 18.1 18.0 0.169
## 3 SNAIVE… 36.2 NA NA NA NA <NULL> <NULL> NA NA NA
glance(wp2_models) |> arrange(AICc)
## # A tibble: 3 × 11
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ARIMA_auto 161. -3957. 7925. 7926. 7955. <cpl> <cpl [0]> NA NA
## 2 ETS_auto 0.111 -6034. 12073. 12073. 12088. <NULL> <NULL> 174. 174.
## 3 SNAIVE24 307. NA NA NA NA <NULL> <NULL> NA NA
## # ℹ 1 more variable: MAE <dbl>