This project consists of three forecasting tasks. Part A involves predicting the cash withdrawals from four ATMs for May 2010. Part B focuses on forecasting residential power consumption for the year 2014 using monthly data from 1998–2013. Part C is an optional bonus involving the hourly aggregation and forecasting of water flow data from two different pipes.
In this project, I’ll analyze historical ATM withdrawal data to forecast how much cash will be taken out of four different ATM machines for May 2010. I’ll start by exploring and preprocessing the data, then apply time series analysis techniques to build and evaluate forecasting models. Along the way, I’ll decide which methods work best for the data and explain why.
atm_data <- read_excel("C:/Users/Admin/Downloads/ATM624Data.xlsx")
head(atm_data)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
atm_data[!complete.cases(atm_data),]
## # A tibble: 19 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39977 ATM1 NA
## 2 39980 ATM1 NA
## 3 39982 ATM2 NA
## 4 39986 ATM1 NA
## 5 39988 ATM2 NA
## 6 40299 <NA> NA
## 7 40300 <NA> NA
## 8 40301 <NA> NA
## 9 40302 <NA> NA
## 10 40303 <NA> NA
## 11 40304 <NA> NA
## 12 40305 <NA> NA
## 13 40306 <NA> NA
## 14 40307 <NA> NA
## 15 40308 <NA> NA
## 16 40309 <NA> NA
## 17 40310 <NA> NA
## 18 40311 <NA> NA
## 19 40312 <NA> NA
atm_data$DATE<-as.Date(atm_data$DATE, origin = "1899-12-30")
atm <- atm_data %>%
mutate(DATE = as_date(DATE), Cash = as.integer(Cash))
str(atm)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: int [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
I’ll create a separate dataset to preserve the original data and perform all necessary transformations on the copy. First, I’ll convert the “DATE” variable to the appropriate date format. Since ATMs don’t dispense coins, I’ll also convert the “Cash” variable to an integer.
atm <- atm_data %>%
mutate(DATE = as_date(DATE), Cash = as.integer(Cash))
str(atm)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: int [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
Since we are looking at 4 separate ATMs and the data does not provide their location, they may have varying amounts of cash withdrawn at varying different days. We will work with each ATM separately and each will have its own forecast for the month of May in 2010.
# Prepare the data
atm1 <- atm %>%
filter(ATM == "ATM1") %>%
as_tsibble(index = DATE)
ggplot(atm1, aes(x = DATE, y = Cash)) +
geom_line(color = "hotpink2") +
labs(
title = "ATM1",
subtitle = "Cash withdrawals per day",
x = "DATE [1D]",
y = "Hundreds USD"
)
# Prepare the data
atm2 <- atm %>%
filter(ATM == "ATM2") %>%
as_tsibble(index = DATE)
# Generate the plot using ggplot2
ggplot(atm1, aes(x = DATE, y = Cash)) +
geom_line(color = "hotpink2") +
labs(
title = "ATM2",
subtitle = "Cash withdrawals per day",
x = "DATE [1D]",
y = "Hundreds USD"
)
# Prepare the data
atm3 <- atm %>%
filter(ATM == "ATM1") %>%
as_tsibble(index = DATE)
# Generate the plot using ggplot2
ggplot(atm3, aes(x = DATE, y = Cash)) +
geom_line(color = "hotpink2") +
labs(
title = "ATM3",
subtitle = "Cash withdrawals per day",
x = "DATE [1D]",
y = "Hundreds USD"
)
# Prepare the data
atm4 <- atm %>%
filter(ATM == "ATM4") %>%
as_tsibble(index = DATE)
ggplot(atm4, aes(x = DATE, y = Cash)) +
geom_line(color = "hotpink2") +
labs(
title = "ATM4",
subtitle = "Cash withdrawals per day",
x = "DATE [1D]",
y = "Hundreds USD"
)
Based on the time series plots, ATM1 and ATM2 show consistent variability with no clear trend and regular daily fluctuations, suggesting potential seasonality to explore through decomposition. Their patterns indicate steady usage over time. ATM3 also shows consistent activity, similar to ATM1, and does not appear limited to just the final days of data as initially suggested. Its pattern suggests a comparable usage profile. ATM4 stands out due to a major outlier, a sharp spike in withdrawals, while the rest of the data remains flat and low. This spike may be a data error or unusual event and should be investigated before modeling.
which(atm4$Cash %in% 10919)
## [1] 285
The outlier appears in row 285 of the ATM4 time series. Since it’s clearly an error, we’ll replace it using the average value for imputation.
After this adjustment, the time series now closely resembles the patterns seen in ATM1 and ATM2.
atm4_clean <- atm4 %>%
mutate(Cash = if_else(row_number() == 285, round(mean(Cash, na.rm = TRUE), 0), Cash))
# Convert to tsibble if needed
atm4_clean <- as_tsibble(atm4_clean, index = DATE)
# Plot the updated series
ggplot(atm4_clean, aes(x = DATE, y = Cash)) +
geom_line(color = "hotpink") +
labs(
title = "ATM4",
subtitle = "Cash withdrawals per day",
y = "Hundreds USD"
)
There also appear to be some missing values in the “Cash” column for both “ATM1” and “ATM2”.
sum(is.na(atm1$Cash))
## [1] 3
sum(is.na(atm2$Cash))
## [1] 2
sum(is.na(atm3$Cash))
## [1] 3
sum(is.na(atm4$Cash))
## [1] 0
The missing values are located in the following rows for each corresponding ATM.
which(is.na(atm1$Cash))
## [1] 44 47 53
which(is.na(atm2$Cash))
## [1] 49 55
hist(atm1$Cash,
col = "pink",
border = "black",
main = "Histogram of atm1$Cash",
xlab = "atm1$Cash",
ylab = "Frequency")
hist(atm2$Cash,
col = "pink",
border = "black",
main = "Histogram of atm2$Cash",
xlab = "atm2$Cash",
ylab = "Frequency")
Since the distribution of “Cash” for both ATMs appears fairly symmetrical with no strong skewness, the median is a suitable choice for imputation. It’s a more robust measure than the mean in the presence of outliers, helping maintain the overall integrity of the data. For this reason, I will use the median to fill in the missing values.
# Impute missing values with the median
atm1[44, 3] <- round(median(atm1$Cash, na.rm = TRUE), 0)
atm1[47, 3] <- round(median(atm1$Cash, na.rm = TRUE), 0)
atm1[53, 3] <- round(median(atm1$Cash, na.rm = TRUE), 0)
atm2[49, 3] <- round(median(atm2$Cash, na.rm = TRUE), 0)
atm2[55, 3] <- round(median(atm2$Cash, na.rm = TRUE), 0)
# Print updated rows for verification
atm1[c(44, 47, 53), ]
## # A tsibble: 3 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-13 ATM1 91
## 2 2009-06-16 ATM1 91
## 3 2009-06-22 ATM1 91
atm2[c(49, 55), ]
## # A tsibble: 2 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-18 ATM2 67
## 2 2009-06-24 ATM2 67
Additionally, we can apply a Box-Cox transformation to each series to help stabilize variance and simplify the structure of the data.
# ATM1
lambda1 <- atm1 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
plot_trans1 <- atm1 %>%
mutate(TransCash = box_cox(Cash, lambda1)) %>%
ggplot(aes(x = DATE, y = TransCash)) +
geom_line(color = "hotpink") +
labs(title = "ATM1 TRANSFORMED", subtitle = "Cash withdrawals per day", y = "USD")
# ATM2
lambda2 <- atm2 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
plot_trans2 <- atm2 %>%
mutate(TransCash = box_cox(Cash, lambda2)) %>%
ggplot(aes(x = DATE, y = TransCash)) +
geom_line(color = "hotpink") +
labs(title = "ATM2 TRANSFORMED", subtitle = "Cash withdrawals per day", y = "USD")
# ATM3
lambda3 <- atm3 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
plot_trans3 <- atm3 %>%
mutate(TransCash = box_cox(Cash, lambda3)) %>%
ggplot(aes(x = DATE, y = TransCash)) +
geom_line(color = "hotpink") +
labs(title = "ATM3 TRANSFORMED", subtitle = "Cash withdrawals per day", y = "USD")
# ATM4
lambda4 <- atm4 %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
plot_trans4 <- atm4 %>%
mutate(TransCash = box_cox(Cash, lambda4)) %>%
ggplot(aes(x = DATE, y = TransCash)) +
geom_line(color = "hotpink") +
labs(title = "ATM4 TRANSFORMED", subtitle = "Cash withdrawals per day", y = "USD")
# Arrange plots
grid.arrange(plot_trans1, plot_trans2, plot_trans3, plot_trans4, nrow = 2)
The Box-Cox transformed time series plots for each ATM reveal several key insights. ATM1 and ATM3 show consistent patterns with stabilized variance after transformation. Both series exhibit clear weekly seasonality, where regular dips, likely corresponding to weekends, suggest predictable fluctuations in cash withdrawals. The transformation effectively reduced variability while preserving seasonal trends. ATM2 also reflects strong weekly seasonality but with higher volatility compared to ATM1 and ATM3. The Box-Cox method has smoothed this volatility to some extent, making the data more suitable for modeling. In contrast, ATM4 has lower overall withdrawal values and a more stable pattern. While the seasonality is less pronounced, it still exists, suggesting a steadier usage pattern for this machine. Overall, the transformations have successfully prepared each series for decomposition and forecasting by reducing skewness and stabilizing variance.
Next, we’ll examine the decomposition of each series to identify any strong seasonal patterns and assess whether differencing might be necessary for modeling. Since the magnitude of the seasonal components remains relatively constant over time, we can conclude that the series exhibit additive behavior.
atm1%>%
model(classical_decomposition(box_cox(Cash, lambda1), type="additive")) %>%
components () %>%
autoplot() +
labs(title="Classical additive decomposition of ATM1")
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).
The decomposition of ATM1 shows a clear and consistent weekly seasonal pattern, along with a mildly fluctuating trend that rises mid-series and dips slightly toward the end. The residuals are stable, indicating that most of the variation is explained by the trend and seasonality. This supports the use of an additive model.
atm2%>%
model(classical_decomposition(box_cox(Cash, lambda2), type="additive")) %>%
components () %>%
autoplot() +
labs(title="Classical additive decomposition of ATM2")
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).
ATM2’s decomposition shows a clear weekly seasonal pattern and a moderately stable trend with some fluctuations. The residuals are a bit more volatile than ATM1 but still support an additive model.
# Convert to time series object (assuming daily data)
cash_ts <- ts(na_interpolation(atm3$Cash), frequency = 7) # weekly seasonality
# Apply classical decomposition
cash_decomp <- decompose(cash_ts)
# Plot
plot(cash_decomp)
This decomposition plot shows ATM3’s time series has a strong seasonal
structure and a noticeable trend component. The residuals are relatively
stable, supporting the use of an additive model for forecasting.
atm4%>%
model(classical_decomposition(box_cox(Cash, lambda4), type="additive")) %>%
components () %>%
autoplot() +
labs(title="Classical additive decomposition of ATM4")
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).
The decomposition for ATM4 indicates a stable seasonal component and a fluctuating but generally declining trend toward the end. The residuals show some variability, but the pattern is largely consistent, supporting the validity of the additive decomposition.
plot_acf1 <- atm1 %>%
ACF(box_cox(Cash, lambda1)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM1")
plot_acf2 <- atm2 %>%
ACF(box_cox(Cash, lambda2)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM2")
plot_acf3 <- atm3 %>%
ACF(box_cox(Cash, lambda3)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM3")
plot_acf4 <- atm4 %>%
ACF(box_cox(Cash, lambda4)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM4")
grid.arrange(plot_acf1, plot_acf2, plot_acf3, plot_acf4, nrow = 2)
The autocorrelation plots reveal clear weekly seasonality patterns for ATM1, ATM2, and ATM3, with noticeable spikes at lags 7, 14, and 21. This supports the need for seasonal differencing in these series. ATM4, on the other hand, shows weaker autocorrelation, suggesting either lower seasonality or more randomness in its daily withdrawal pattern.
atm1 %>%
features(box_cox(Cash, lambda1), unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
atm2 %>%
features(box_cox(Cash, lambda1), unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
atm3 %>%
features(box_cox(Cash, lambda1), unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
atm4 %>%
features(box_cox(Cash, lambda1), unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
Based on the output of the function, seasonal differencing is required for ATM1 and ATM2. Next, we’ll examine whether any further differencing is necessary.
atm1 %>%
features(difference(box_cox(Cash, lambda1), 7), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
atm2 %>%
features(difference(box_cox(Cash, lambda2), 7), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
atm3 %>%
features(difference(box_cox(Cash, lambda3), 7), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
atm4 %>%
features(difference(box_cox(Cash, lambda4), 7), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
No further differencing appears to be necessary. Let’s now review the ACF plots after applying seasonal differencing to ATM1 and ATM2.
atm1 %>%
ACF(difference(box_cox(Cash, lambda1), 7)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM1")
The ACF plot for ATM1 after differencing shows that most
autocorrelations fall within the confidence bounds, indicating that
seasonality has been adequately removed. The sharp negative spike at lag
7 suggests that weekly seasonality was successfully addressed through
differencing, and the residuals now resemble white noise, which is ideal
for ARIMA modeling.
atm2 %>%
ACF(difference(box_cox(Cash, lambda2), 7)) %>%
autoplot() + labs(title="Autocorrelation of Cash ATM2")
The ACF plot for ATM2 after differencing shows a significant drop at lag 7, indicating that weekly seasonality has been successfully accounted for. Most of the autocorrelations now fall within the confidence bands, suggesting that the series behaves more like white noise. This supports the idea that the data is now stationary and ready for ARIMA modeling.
atm1_fit <- atm1 %>%
model(ARIMA(box_cox(Cash, lambda1)))
report(atm1_fit)
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda1)
##
## Coefficients:
## ma1 ma2 sma1
## 0.0994 -0.1141 -0.6456
## s.e. 0.0524 0.0527 0.0426
##
## sigma^2 estimated as 1.576: log likelihood=-589.77
## AIC=1187.54 AICc=1187.66 BIC=1203.07
atm1_fit %>%
gg_tsresiduals()
The residual diagnostics for ATM1 suggest a well-performing ARIMA model.
The residuals appear randomly scattered around zero with no visible
patterns, and the ACF plot shows no significant autocorrelation
remaining—indicating white noise. Additionally, the histogram of
residuals shows a distribution that is approximately normal, which
supports the model’s validity. Overall, this is a strong sign that the
ARIMA model is a good fit for ATM1’s cash withdrawal series.
atm2_fit <- atm2 %>%
model(ARIMA(box_cox(Cash, lambda2)))
report(atm2_fit)
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda2)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4247 -0.9074 0.468 0.8004 -0.7262
## s.e. 0.0587 0.0438 0.089 0.0584 0.0409
##
## sigma^2 estimated as 69.93: log likelihood=-1267.92
## AIC=2547.85 AICc=2548.09 BIC=2571.13
atm2_fit %>%
gg_tsresiduals()
The residual diagnostics for ATM2 indicate that the ARIMA model is
performing reasonably well. The residuals are centered around zero, and
the ACF plot shows no significant spikes beyond the confidence bounds,
implying that most autocorrelation has been removed. The histogram also
resembles a normal distribution, supporting the model’s assumption of
normally distributed errors. While there are some large residuals,
overall the model seems appropriate for forecasting ATM2.
atm3_fit <- atm3 %>%
model(NAIVE(box_cox(Cash, lambda3)))
report(atm3_fit)
## Series: Cash
## Model: NAIVE
## Transformation: box_cox(Cash, lambda3)
##
## sigma^2: 6.4317
atm3_fit %>%
gg_tsresiduals()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
The residual diagnostics for ATM3 show clear issues with the model fit. The top panel reveals a strong pattern in the residuals rather than random noise, indicating underfitting. The ACF plot also shows high autocorrelation at regular lags, suggesting that key seasonality or trend components remain. Additionally, the histogram of residuals shows a noticeable bimodal distribution, further pointing to model misspecification. This confirms that the naïve model used for ATM3 may not be suitable despite the limited data.
atm4_fit <- atm4 %>%
model(ARIMA(box_cox(Cash, lambda4)))
report(atm4_fit)
## Series: Cash
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, lambda4)
##
## Coefficients:
## sar1 sar2 constant
## 0.2459 0.1925 2.5248
## s.e. 0.0521 0.0525 0.0480
##
## sigma^2 estimated as 0.8828: log likelihood=-494.26
## AIC=996.52 AICc=996.63 BIC=1012.12
atm4_fit %>%
gg_tsresiduals()
The residual diagnostics for ATM4 suggest a strong model fit. The residuals over time appear randomly scattered around zero, showing no obvious trends or patterns. The ACF plot displays minimal autocorrelation, with most values well within the confidence bounds, indicating the residuals resemble white noise. The histogram shows an approximately normal distribution centered around zero. Overall, these diagnostics suggest the model for ATM4 is well specified and captures the underlying structure of the data effectively.
Part B involves modeling a dataset of residential power usage from January 1998 to December 2013. My task is to analyze the data and generate monthly forecasts for the year 2014. The data is provided in a single file, with the variable ‘KWH’ representing power consumption in kilowatt-hours.
RCFL <- read_excel("C:/Users/Admin/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
RCFL$Date <- as.yearmon(RCFL$`YYYY-MMM`, format = "%Y-%b")
ts_power <- ts(RCFL$KWH, start = c(1998, 1), frequency = 12)
autoplot(ts_power) + ggtitle("Residential Power Usage") + ylab("KWH")
This time series plot of residential power usage from January 1998 to
December 2013 reveals a strong seasonal pattern with yearly peaks and
troughs, typical of electricity consumption. There’s an evident upward
trend beginning in the late 2000s, suggesting increasing energy demand
over time. One notable anomaly occurs around 2010, where there’s a sharp
drop in usage—this could indicate missing data, a recording error, or a
real event worth further investigation. The series appears
non-stationary, making it a candidate for differencing or transformation
before forecasting.
head(RCFL)
## # A tibble: 6 × 4
## CaseSequence `YYYY-MMM` KWH Date
## <dbl> <chr> <dbl> <yearmon>
## 1 733 1998-Jan 6862583 Jan 1998
## 2 734 1998-Feb 5838198 Feb 1998
## 3 735 1998-Mar 5420658 Mar 1998
## 4 736 1998-Apr 5010364 Apr 1998
## 5 737 1998-May 4665377 May 1998
## 6 738 1998-Jun 6467147 Jun 1998
summary(RCFL)
## CaseSequence YYYY-MMM KWH Date
## Min. :733.0 Length:192 Min. : 770523 Min. :1998
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912 1st Qu.:2002
## Median :828.5 Mode :character Median : 6283324 Median :2006
## Mean :828.5 Mean : 6502475 Mean :2006
## 3rd Qu.:876.2 3rd Qu.: 7620524 3rd Qu.:2010
## Max. :924.0 Max. :10655730 Max. :2014
## NA's :1
which(is.na(RCFL), arr.ind=TRUE)
## row col
## [1,] 129 3
slice(RCFL,c(127:132))
## # A tibble: 6 × 4
## CaseSequence `YYYY-MMM` KWH Date
## <dbl> <chr> <dbl> <yearmon>
## 1 859 2008-Jul 7643987 Jul 2008
## 2 860 2008-Aug 8037137 Aug 2008
## 3 861 2008-Sep NA Sep 2008
## 4 862 2008-Oct 5101803 Oct 2008
## 5 863 2008-Nov 4555602 Nov 2008
## 6 864 2008-Dec 6442746 Dec 2008
There appears to be a gap in the dataset around September 2008, indicating missing data for that period. Additionally, there are a few data points present near the year 2010, which may require further investigation to understand their context within the overall time series.
RCFL_data <- RCFL %>%
rename(Formatted_Date = 'YYYY-MMM') %>%
mutate(Formatted_Date = as.Date(paste0('01-', Formatted_Date), '%d-%Y-%b'))
min(RCFL_data$KWH, na.rm = TRUE)
## [1] 770523
RCFL_2 <-ts(RCFL[, "KWH"], start = c(1998, 1), frequency = 12)
ggseasonplot(RCFL_2)+ggtitle('USAGE BY YEAR FOR RESIDENTIAL POWER')
This monthly breakdown of residential power usage by year highlights a consistent seasonal trend: usage dips in the spring (April–May) and peaks in the summer (July–August) and winter (December–January). The visual also confirms the 2010 anomaly, with an unusually sharp drop in July, likely due to missing or erroneous data. Overall, there’s a clear upward shift in power usage in more recent years, particularly 2012 and 2013, supporting the notion of growing residential demand over time.
RCFL_data<- RCFL_data[-c(129,151),]
#Get average by month
RCFL_data$Month <- months(RCFL_data$Date)
aggregate(KWH ~ Month, RCFL_data, mean)
## Month KWH
## 1 April 5299734
## 2 August 8298211
## 3 December 6283175
## 4 February 6946556
## 5 January 8007422
## 6 July 7852521
## 7 June 6536092
## 8 March 5971450
## 9 May 5039034
## 10 November 4953619
## 11 October 5657164
## 12 September 7702333
RCFL$KWH[is.na(RCFL$KWH)] = median(RCFL$KWH, na.rm=TRUE)
summary(RCFL)
## CaseSequence YYYY-MMM KWH Date
## Min. :733.0 Length:192 Min. : 770523 Min. :1998
## 1st Qu.:780.8 Class :character 1st Qu.: 5434539 1st Qu.:2002
## Median :828.5 Mode :character Median : 6283324 Median :2006
## Mean :828.5 Mean : 6501333 Mean :2006
## 3rd Qu.:876.2 3rd Qu.: 7608792 3rd Qu.:2010
## Max. :924.0 Max. :10655730 Max. :2014
RCFL_ts <- ts(RCFL$KWH, start=c(1998,1), frequency = 12)
RCFL_ts
## Jan Feb Mar Apr May Jun Jul Aug
## 1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428
## 1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416 7564391
## 2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662 7517830
## 2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129 8450717
## 2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702 8058748
## 2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676 8476499
## 2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931 7309774
## 2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998 7786659
## 2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564 8241110
## 2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220 7447146
## 2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987 8037137
## 2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260 8350517
## 2010 9397357 8390677 7347915 5776131 4919289 6696292 770523 7922701
## 2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343 10308076
## 2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851 9612423
## 2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321 9080226
## Sep Oct Nov Dec
## 1998 6989888 6345620 4640410 4693479
## 1999 7899368 5358314 4436269 4419229
## 2000 8912169 5844352 5041769 6220334
## 2001 7112069 5242535 4461979 5240995
## 2002 8245227 5865014 4908979 5779958
## 2003 7791791 5344613 4913707 5756193
## 2004 6690366 5444948 4824940 5791208
## 2005 7057213 6694523 4313019 6181548
## 2006 7296355 5104799 4458429 6226214
## 2007 7666970 5785964 4907057 6047292
## 2008 6283324 5101803 4555602 6442746
## 2009 7583146 5566075 5339890 7089880
## 2010 7819472 5875917 4800733 6152583
## 2011 8943599 5603920 6154138 8273142
## 2012 7559148 5576852 5731899 6609694
## 2013 7968220 5759367 5769083 9606304
ggtsdisplay(RCFL_ts, main="Monthly Power Consumption before transform")
The time series plot clearly shows seasonal peaks and troughs in monthly
residential power consumption, with a noticeable spike in the summer and
winter months each year. There’s also a visible dip in 2010, likely due
to data issues or an outlier. The ACF plot shows strong autocorrelation
at lag 12 and its multiples, confirming a strong yearly seasonal
pattern. The PACF plot highlights significant lags in the first few
months, suggesting both autoregressive and seasonal components. These
patterns indicate the need for seasonal differencing and support the use
of a seasonal ARIMA model to forecast future consumption.
RCFLS_BXCX <- RCFL_ts %>% BoxCox(lambda= 'auto')
ggtsdisplay(RCFLS_BXCX, main='MONTHLY POWER CONSUMER BXCX')
After applying the Box-Cox transformation, the variance in the power
consumption time series appears more stable across time. The overall
seasonal pattern remains visible, with peaks and troughs repeating
consistently each year.
In the ACF plot, we still see significant autocorrelation at seasonal lags (multiples of 12), suggesting persistent seasonality. The PACF plot shows a few significant spikes at lower lags, indicating short-term dependencies that could be captured by an AR or MA component. This transformation sets a strong foundation for fitting a seasonal ARIMA model with better residual diagnostics.
ggseasonplot(RCFLS_BXCX)
This seasonal plot highlights the strong seasonal fluctuations in
residential power usage, with notable peaks around the middle of the
year (July) and lower consumption in winter months. The pattern appears
consistent across different years, with a clear seasonal cycle
influencing power consumption trends.
summary(RCFLS_BXCX)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.80 38.69 39.42 39.47 40.41 42.20
ggsubseriesplot(RCFLS_BXCX)
This plot depicts the seasonality in the transformed residential power
consumption data. The horizontal blue lines represent the seasonal
components, indicating a repeating pattern of high and low usage over
the months. It is evident that there is a significant peak around July,
followed by a sharp decline, and then a rise again towards the end of
the year, confirming the strong seasonal nature of the power
consumption. The overall trend seems to exhibit an upward movement over
time as well.
ggAcf(RCFLS_BXCX)
To take a closer look at the residuals and assess whether they exhibit autocorrelation, we’ll apply the Box test. This will help us determine if the residuals behave like white noise, which is a key assumption for validating our time series models.
Box.test(RCFLS_BXCX, type = c("Ljung-Box"))
##
## Box-Ljung test
##
## data: RCFLS_BXCX
## X-squared = 16.556, df = 1, p-value = 4.723e-05
summary(RCFLS_BXCX)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.80 38.69 39.42 39.47 40.41 42.20
boxplot(RCFLS_BXCX~cycle(RCFLS_BXCX))
The box plot above displays the distribution of residential power usage
data (RCFLS_BXCX) across the months. The boxes represent the
interquartile range (IQR), where the middle 50% of the data lies, while
the whiskers show the range of the data. The plot reveals significant
seasonal variation in power usage, with some months exhibiting higher
consumption than others. There are notable outliers in certain months,
such as June and December, which may indicate exceptional consumption
patterns or anomalies. The general trend shows that power consumption
tends to increase during the summer months (June to August) and decrease
towards the winter months (December).
print(paste0("Suggested # of diff: ", ndiffs(RCFLS_BXCX)))
## [1] "Suggested # of diff: 1"
print(paste0("DIFF REQUIRED (SEASIONAL): ", ndiffs(diff(RCFLS_BXCX, lag=12))))
## [1] "DIFF REQUIRED (SEASIONAL): 0"
RCFL_PWR_DIFF <- RCFLS_BXCX %>% diff(lag = 12)
ggtsdisplay(RCFL_PWR_DIFF, main= "Monthly power consumption BXCX AND DIFF")
The plot above shows the monthly power consumption after differencing. This transformation seems to remove some of the long-term trend and cyclical patterns, making the data appear more stationary. The ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots show that the data now exhibit significant spikes at specific lags, especially at lag 12, indicating a yearly seasonality. This suggests that seasonal differencing might be required to better model the data. The transformation and the resulting ACF/PACF plots indicate that further steps like seasonal differencing could improve the model.
ggseasonplot(RCFL_PWR_DIFF,polar = TRUE)+ggtitle('Residential Power Usage by Year')
The radial plot above shows the monthly power usage by year,
highlighting the seasonal trend in power consumption. Each line
represents a different year, and the pattern clearly illustrates the
cyclical nature of power usage, with peaks typically occurring in the
summer months (around July). The seasonal fluctuations in usage are
evident, with a dip in consumption during the winter months and a rise
during the warmer months. This visualization helps in identifying the
seasonal behavior and trends across different years.
plot(RCFL_PWR_DIFF)
The plot you have shared represents the differenced residential power usage data, which shows how the data behaves after seasonal adjustments. The sharp spike around 2010 indicates an anomaly in the dataset, which may be related to a significant event or change during that period. After the spike, the data seems to return to a more stable pattern, suggesting that the differencing effectively handled the seasonal trend, making it easier to detect unusual events.
autoplot(RCFL_PWR_DIFF, series="Data")+
autolayer(ma(RCFL_PWR_DIFF, 12), series = "12 MTH Moving Avg")+ ggtitle("2014 MVING AVG")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
The plot displays the 12-month moving average of the differenced residential power consumption data for 2014. The moving average (shown in red) smooths out fluctuations in the original data (shown in blue), making the underlying trend more visible. The sharp spike around 2010 in the original data is still present but less pronounced in the moving average, highlighting how the longer-term trend stabilizes despite short-term anomalies.
#stlf - etsmodel
RCFLS_STL <- stlf(RCFL_PWR_DIFF, damped=FALSE, s.window = "periodic", robust=TRUE, h = 12)
# forecast plot
autoplot(RCFLS_STL) + autolayer(fitted(RCFLS_STL))
The plot shows the forecast for residential power consumption after
applying the STL (Seasonal and Trend decomposition using Loess) method
combined with an ETS (Exponential Smoothing State Space Model) model.
The black line represents the original differenced data, while the red
line shows the fitted values from the combined STL + ETS model. The
shaded area highlights the forecasted values, showing the expected range
for future values in 2014. The model successfully captures the trend
while smoothing out fluctuations. However, the sharp anomaly in 2010 is
still visible in the forecast, suggesting it may need additional
adjustment or consideration.
#stlf - etsmodel estimation --- M, Ad, N is chosen.
RCFL_STL_DP <- stlf(RCFL_PWR_DIFF, damped=TRUE, s.window = "periodic", robust=TRUE, h = 12)
# forecast plot
autoplot(RCFL_STL_DP) + autolayer(fitted(RCFL_STL_DP))
The updated forecast plot shows the results from the STL (Seasonal and
Trend decomposition using Loess) method combined with an ETS
(Exponential Smoothing State Space Model) with additive errors, an
additive trend, and no seasonal dampening (ETS(A,Ad,N)). The black line
represents the original differenced data, and the red line shows the
fitted values from the combined model. The forecasted values for 2014
are represented by the shaded blue area.
This model seems to better capture the long-term trend while smoothing out fluctuations compared to the previous model. However, the sharp anomaly around 2010 remains prominent, indicating that the model may still need adjustment for extreme outliers or structural changes.
# auto.arima
arima_model <- auto.arima(RCFL_PWR_DIFF)
# forecast values
arima_model <- forecast(arima_model, h=20)
# forecast plot
autoplot(arima_model) + autolayer(fitted(arima_model))
The forecast plot above displays the results of applying an
ARIMA(0,0,1)(0,0,2)[12] model with a non-zero mean to the data. In this
case, the black line represents the original differenced data, and the
red line shows the fitted values from the ARIMA model. The shaded blue
area indicates the forecasted values for 2014.
This ARIMA model captures both the seasonal and non-seasonal components of the data well, particularly in the 2014 forecast range. While the sharp anomaly around 2010 remains visible, the forecasted values appear smoother and less volatile compared to earlier models.
RCFL_ETS<- ets(RCFL_PWR_DIFF)
# forecast plot
autoplot(forecast(RCFL_ETS
, h=12)) + autolayer(fitted(RCFL_ETS
))
It seems you have uploaded another image. If you’d like me to interpret
or help with any specific analysis, feel free to let me know what you’re
looking for.
RCFL_FCST_PWR_S <- ses(RCFL_PWR_DIFF, h=12)
autoplot(RCFL_FCST_PWR_S)+
autolayer(fitted(RCFL_FCST_PWR_S), series="Fitted")
accuracy(RCFLS_STL)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0005455413 1.348221 0.5931651 -1360.432 1605.179 0.5404898
## ACF1
## Training set 0.1194029
checkresiduals(RCFLS_STL)
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,N,N)
## Q* = 65.794, df = 24, p-value = 9.303e-06
##
## Model df: 0. Total lags used: 24
The residuals from the STL + ETS(A,N,N) model are centered around zero, with no significant autocorrelation or patterns, indicating a good model fit. The residuals appear to be nearly normally distributed, confirming that the model effectively captures the trend and seasonality in the data.
summary(RCFLS_STL)
##
## Forecast method: STL + ETS(A,N,N)
##
## Model Information:
## ETS(A,N,N)
##
## Call:
## ets(y = na.interp(x), model = etsmodel, damped = FALSE, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 0.0713
##
## sigma: 1.3558
##
## AIC AICc BIC
## 1048.295 1048.432 1057.874
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0005455413 1.348221 0.5931651 -1360.432 1605.179 0.5404898
## ACF1
## Training set 0.1194029
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 0.17305758 -1.564437 1.910552 -2.484211 2.830326
## Feb 2014 0.06130592 -1.676189 1.798801 -2.595963 2.718575
## Mar 2014 0.12236178 -1.615133 1.859857 -2.534907 2.779631
## Apr 2014 0.08210098 -1.655394 1.819596 -2.575168 2.739370
## May 2014 0.12746350 -1.610031 1.864958 -2.529805 2.784732
## Jun 2014 0.14679543 -1.590699 1.884290 -2.510474 2.804064
## Jul 2014 -0.19208644 -1.929581 1.545408 -2.849355 2.465183
## Aug 2014 -0.02245609 -1.759951 1.715039 -2.679725 2.634813
## Sep 2014 0.16034813 -1.577147 1.897843 -2.496921 2.817617
## Oct 2014 0.05745299 -1.680042 1.794948 -2.599816 2.714722
## Nov 2014 0.08070004 -1.656795 1.818195 -2.576569 2.737969
## Dec 2014 0.05898262 -1.678512 1.796477 -2.598286 2.716252
accuracy(RCFL_STL_DP)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.007629573 1.349456 0.5947813 -1328.026 1577.668 0.5419625
## ACF1
## Training set 0.120852
checkresiduals(RCFL_STL_DP)
##
## Ljung-Box test
##
## data: Residuals from STL + ETS(A,Ad,N)
## Q* = 65.735, df = 24, p-value = 9.492e-06
##
## Model df: 0. Total lags used: 24
The residuals from the STL + ETS(A,Ad,N) model show that the data is centered around zero, indicating a good model fit. However, there is a slight non-normality in the residuals’ distribution with some skewness. The autocorrelation plot suggests some remaining seasonality in the residuals, especially with significant spikes at specific lags. This suggests that the model could potentially be improved further by addressing the seasonal components more explicitly.
summary(RCFL_STL_DP)
##
## Forecast method: STL + ETS(A,Ad,N)
##
## Model Information:
## ETS(A,Ad,N)
##
## Call:
## ets(y = na.interp(x), model = etsmodel, damped = TRUE, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 1e-04
## beta = 1e-04
## phi = 0.9622
##
## Initial states:
## l = 0.1056
## b = -6e-04
##
## sigma: 1.3686
##
## AIC AICc BIC
## 1054.625 1055.110 1073.783
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.007629573 1.349456 0.5947813 -1328.026 1577.668 0.5419625
## ACF1
## Training set 0.120852
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 0.17900301 -1.574926 1.932932 -2.503400 2.861406
## Feb 2014 0.06761419 -1.686315 1.821544 -2.614789 2.750018
## Mar 2014 0.12901915 -1.624910 1.882949 -2.553385 2.811423
## Apr 2014 0.08909425 -1.664835 1.843024 -2.593310 2.771498
## May 2014 0.13477996 -1.619150 1.888710 -2.547624 2.817184
## Jun 2014 0.15442285 -1.599507 1.908353 -2.527982 2.836828
## Jul 2014 -0.18415983 -1.938091 1.569771 -2.866565 2.498246
## Aug 2014 -0.01424159 -1.768173 1.739690 -2.696648 2.668165
## Sep 2014 0.16883961 -1.585092 1.922772 -2.513568 2.851247
## Oct 2014 0.06621097 -1.687722 1.820144 -2.616198 2.748620
## Nov 2014 0.08971444 -1.664219 1.843648 -2.592696 2.772125
## Dec 2014 0.06824374 -1.685691 1.822179 -2.614168 2.750656
accuracy(arima_model)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02554754 0.9666583 0.4541852 -1920.696 2160.894 0.4138519
## ACF1
## Training set 0.006469743
checkresiduals(arima_model)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
## Q* = 6.8185, df = 21, p-value = 0.9985
##
## Model df: 3. Total lags used: 24
The residuals from the ARIMA(0,0,1)(0,0,2)[12] model with non-zero mean show that the model is not perfect. The residual plot reveals some outliers and a potential structural break around 2010. The autocorrelation function (ACF) suggests that there is still some autocorrelation at different lags, particularly at the seasonal lag, indicating that the model may not fully capture the seasonality. Additionally, the histogram of residuals shows some deviation from normality, especially with a long tail. These observations suggest that the model might need further adjustments to improve its fit.
summary(arima_model)
##
## Forecast method: ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
##
## Model Information:
## Series: RCFL_PWR_DIFF
## ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
##
## Coefficients:
## ma1 sma1 sma2 mean
## 0.1195 -0.9264 0.0890 0.0616
## s.e. 0.0712 0.0810 0.0811 0.0198
##
## sigma^2 = 0.9557: log likelihood = -257.11
## AIC=524.23 AICc=524.57 BIC=540.19
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02554754 0.9666583 0.4541852 -1920.696 2160.894 0.4138519
## ACF1
## Training set 0.006469743
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 -0.71170867 -1.964897 0.5414796 -2.628295 1.2048774
## Feb 2014 0.20176196 -1.060338 1.4638615 -1.728453 2.1319766
## Mar 2014 0.10559912 -1.156500 1.3676986 -1.824616 2.0358137
## Apr 2014 -0.21986269 -1.481962 1.0422368 -2.150077 1.7103519
## May 2014 -0.33880539 -1.600905 0.9232941 -2.269020 1.5914092
## Jun 2014 -0.43704168 -1.699141 0.8250578 -2.367256 1.4931729
## Jul 2014 -1.07257955 -2.334679 0.1895200 -3.002794 0.8576351
## Aug 2014 0.06430436 -1.197795 1.3264039 -1.865910 1.9945190
## Sep 2014 0.14938779 -1.112712 1.4114873 -1.780827 2.0796024
## Oct 2014 0.22911025 -1.032989 1.4912097 -1.701104 2.1593249
## Nov 2014 -0.19797910 -1.460079 1.0641204 -2.128194 1.7322355
## Dec 2014 -1.54532103 -2.807419 -0.2832232 -3.475533 0.3848910
## Jan 2015 -0.02932622 -1.743669 1.6850167 -2.651187 2.5925349
## Feb 2015 0.05006537 -1.669873 1.7700036 -2.580353 2.6804839
## Mar 2015 0.05625769 -1.663681 1.7761959 -2.574161 2.6866762
## Apr 2015 0.08773329 -1.632205 1.8076715 -2.542685 2.7181518
## May 2015 0.10404245 -1.615896 1.8239807 -2.526376 2.7344610
## Jun 2015 0.11673378 -1.603204 1.8366720 -2.513685 2.7471523
## Jul 2015 0.18863975 -1.531298 1.9085780 -2.441779 2.8190583
## Aug 2015 0.06605819 -1.653880 1.7859964 -2.564360 2.6964767
accuracy(RCFL_ETS)
## ME RMSE MAE MPE MAPE MASE
## Training set -1.229798e-05 1.346981 0.6002461 -768.1263 994.3076 0.546942
## ACF1
## Training set 0.1202904
checkresiduals(RCFL_ETS)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 66.674, df = 24, p-value = 6.893e-06
##
## Model df: 0. Total lags used: 24
The residuals from the ETS(A,N,N) model also show some patterns that suggest room for improvement. The residuals plot shows that the series has some sharp fluctuations, especially around 2010, indicating that the model doesn’t fully capture the underlying dynamics of the data. The autocorrelation function (ACF) of the residuals suggests the presence of seasonality, as there are clear correlations at seasonal lags, which implies that the model could not account for these seasonal patterns effectively. Additionally, the histogram of the residuals suggests a non-normal distribution, with a long tail, indicating that the model may not be fully capturing all aspects of the data’s structure. Thus, further model refinement may be necessary.
summary(RCFL_ETS)
## ETS(A,N,N)
##
## Call:
## ets(y = RCFL_PWR_DIFF)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 0.0719
##
## sigma: 1.3545
##
## AIC AICc BIC
## 1047.964 1048.100 1057.543
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.229798e-05 1.346981 0.6002461 -768.1263 994.3076 0.546942
## ACF1
## Training set 0.1202904
accuracy(RCFL_FCST_PWR_S)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0002402688 1.346981 0.6002376 -765.1024 991.1566 0.5469343
## ACF1
## Training set 0.1202903
checkresiduals(RCFL_FCST_PWR_S)
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 66.674, df = 24, p-value = 6.893e-06
##
## Model df: 0. Total lags used: 24
The residuals from the Simple Exponential Smoothing (SES) model show some fluctuations around zero, with noticeable peaks around 2010. The ACF plot reveals autocorrelation at lag 12, indicating residual seasonality. The histogram is right-skewed, and the Q-Q plot shows deviations from normality, suggesting that the SES model doesn’t fully capture all patterns in the data, particularly seasonal or irregular variations. This implies that while the SES model is useful, it may not be sufficient for accurate long-term forecasting.
summary(RCFL_FCST_PWR_S)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = RCFL_PWR_DIFF, h = 12)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 0.0716
##
## sigma: 1.3545
##
## AIC AICc BIC
## 1047.964 1048.100 1057.543
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0002402688 1.346981 0.6002376 -765.1024 991.1566 0.5469343
## ACF1
## Training set 0.1202903
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Feb 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Mar 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Apr 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726454
## May 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Jun 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Jul 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Aug 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Sep 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Oct 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726455
## Nov 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726455
## Dec 2014 0.07162989 -1.664267 1.807526 -2.583195 2.726455
Based on the evaluation metrics, the ARIMA model appears to deliver the best results. The AIC dropped to 524, and the BIC decreased to 540. Additionally, the RMSE improved significantly, falling from 1.347 to 0.966. Given these improvements, I’ve decided to proceed with the ARIMA model. I’ll now generate the forecast and export the predicted values to a CSV file, as I’m confident in ARIMA’s performance.
Part C includes two simple two-column datasets with different time stamps. My optional task is to align the data based on time by creating a time-based sequence and aggregating the values by hour—taking the mean when there are multiple recordings within the same hour. I’ll then assess whether the data is stationary and determine if it can be reliably forecasted. If so, I’ll produce a one-week-ahead forecast and present the results through an RPubs report, the accompanying .Rmd file, and an Excel-readable file containing the forecast.
# Define a custom function to convert Excel date serial numbers to POSIXct
convert_excel_date <- function(serial_date) {
as_datetime(serial_date * 86400, origin = "1899-12-30", tz = "UTC")
}
# Read data
pipe1_data <- read_excel("C:/Users/Admin/Downloads/Waterflow_Pipe1.xlsx")
pipe2_data <- read_excel("C:/Users/Admin/Downloads/Waterflow_Pipe2.xlsx")
# Convert date columns using the custom function
pipe1_data$`Date Time` <- sapply(pipe1_data$`Date Time`, convert_excel_date)
pipe2_data$`Date Time` <- sapply(pipe2_data$`Date Time`, convert_excel_date)
# Calculate number of rows
num_rows_pipe1 <- nrow(pipe1_data)
num_rows_pipe2 <- nrow(pipe2_data)
# Generate summaries
pipe1_summary <- summary(pipe1_data)
pipe2_summary <- summary(pipe2_data)
# Output summaries in kable format
kable(pipe1_summary, caption = paste("Pipe1 Data Summary - Rows:", num_rows_pipe1)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Date Time | WaterFlow | |
---|---|---|
Min. :1.446e+09 | Min. : 1.067 | |
1st Qu.:1.446e+09 | 1st Qu.:13.683 | |
Median :1.446e+09 | Median :19.880 | |
Mean :1.446e+09 | Mean :19.897 | |
3rd Qu.:1.446e+09 | 3rd Qu.:26.159 | |
Max. :1.446e+09 | Max. :38.913 |
kable(pipe2_summary, caption = paste("Pipe2 Data Summary - Rows:", num_rows_pipe2)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Date Time | WaterFlow | |
---|---|---|
Min. :1.446e+09 | Min. : 1.885 | |
1st Qu.:1.446e+09 | 1st Qu.:28.140 | |
Median :1.447e+09 | Median :39.682 | |
Mean :1.447e+09 | Mean :39.556 | |
3rd Qu.:1.448e+09 | 3rd Qu.:50.782 | |
Max. :1.449e+09 | Max. :78.303 |
pipe1_data <- pipe1_data %>%
drop_na(`Date Time`)
pipe2_data <- pipe2_data %>%
drop_na(`Date Time`)
# Custom function to convert Excel serial dates to POSIXct
convert_excel_date <- function(serial_date) {
as_datetime(serial_date * 86400, origin = "1899-12-30", tz = "UTC")
}
# Apply the conversion function to the "Date Time" column
pipe1_data$`Date Time` <- sapply(pipe1_data$`Date Time`, convert_excel_date)
pipe2_data$`Date Time` <- sapply(pipe2_data$`Date Time`, convert_excel_date)
kable(pipe2_summary, caption = paste("Pipe2 Data Summary - Rows:", num_rows_pipe2)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Date Time | WaterFlow | |
---|---|---|
Min. :1.446e+09 | Min. : 1.885 | |
1st Qu.:1.446e+09 | 1st Qu.:28.140 | |
Median :1.447e+09 | Median :39.682 | |
Mean :1.447e+09 | Mean :39.556 | |
3rd Qu.:1.448e+09 | 3rd Qu.:50.782 | |
Max. :1.449e+09 | Max. :78.303 |
The time series plots for Water Flow - Pipe 1 and Pipe 2 highlight distinct flow behaviors and variability.
The clear difference between the two—Pipe 1’s stability versus Pipe 2’s variability—hints at differing usage patterns or system demands. This initial analysis lays the groundwork for deeper investigation and targeted forecasting for each pipe.
# Plot for Pipe 1
pipe1_plot <- ggplot(pipe1_data, aes(x = `Date Time`, y = WaterFlow)) +
geom_line(color = "yellow") +
labs(title = "Water Flow - Pipe 1",
x = "Time",
y = "Water Flow")
# Plot for Pipe 2
pipe2_plot <- ggplot(pipe2_data, aes(x = `Date Time`, y = WaterFlow)) +
geom_line(color = "skyblue") +
labs(title = "Water Flow - Pipe 2",
x = "Time",
y = "Water Flow")
# Arrange the plots vertically
grid.arrange(pipe1_plot, pipe2_plot , ncol = 1)
Convert timestamps to hourly data by rounding down to the nearest hour and calculating the mean for each hour.
The table and plots below offer insight into the data collection frequency and the transformations applied to the water flow measurements for each pipe:
Pipe 1:
Initial data points: 1,000
Hourly mean data points after aggregation: 236
Average interval between measurements: approximately 14.37 minutes
Median interval: 10.02 minutes
Pipe 2:
Initial data points: 1,000
Hourly mean data points after aggregation: 667
Average interval between measurements: 60 minutes
Median interval: 60 minutes
These results show that Pipe 1 was measured more frequently than Pipe 2, resulting in fewer aggregated hourly data points. The higher measurement frequency in Pipe 1 offers more detailed, granular data, while Pipe 2’s consistent hourly intervals reflect a lower-frequency collection pattern.
# Load necessary libraries
library(dplyr)
library(lubridate)
library(knitr)
library(kableExtra)
# Function to convert Excel serial date numbers to POSIXct
convert_excel_date <- function(serial_date) {
as_datetime(serial_date * 86400, origin = "1899-12-30", tz = "UTC")
}
# Read data
pipe1_data <- read_excel("C:/Users/Admin/Downloads/Waterflow_Pipe1.xlsx")
pipe2_data <- read_excel("C:/Users/Admin/Downloads/Waterflow_Pipe2.xlsx")
# Convert 'Date Time' column from Excel serial date to POSIXct
pipe1_data$`Date Time` <- convert_excel_date(pipe1_data$`Date Time`)
pipe2_data$`Date Time` <- convert_excel_date(pipe2_data$`Date Time`)
# Convert 'WaterFlow' to numeric if necessary (it should already be numeric, but just in case)
pipe1_data$WaterFlow <- as.numeric(pipe1_data$WaterFlow)
pipe2_data$WaterFlow <- as.numeric(pipe2_data$WaterFlow)
# Convert 'Date Time' to hourly timestamps and calculate the hourly mean of 'WaterFlow'
pipe1_hourly <- pipe1_data %>%
mutate(Timestamp = floor_date(`Date Time`, "hour")) %>% # Floor to the nearest hour
group_by(Timestamp) %>%
summarize(Mean_WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>% # Calculate the mean of 'WaterFlow'
ungroup() # Remove grouping
# Check if pipe1_hourly was created correctly
head(pipe1_hourly)
## # A tibble: 6 × 2
## Timestamp Mean_WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 00:00:00 26.1
## 2 2015-10-23 01:00:00 18.9
## 3 2015-10-23 02:00:00 15.2
## 4 2015-10-23 03:00:00 23.1
## 5 2015-10-23 04:00:00 15.5
## 6 2015-10-23 05:00:00 22.7
pipe2_hourly <- pipe2_data %>%
mutate(Timestamp = floor_date(`Date Time`, "hour")) %>%
group_by(Timestamp) %>%
summarize(Mean_WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%
ungroup()
# Check if pipe2_hourly was created correctly
head(pipe2_hourly)
## # A tibble: 6 × 2
## Timestamp Mean_WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 30.9
## 2 2015-10-23 03:00:00 38.0
## 3 2015-10-23 04:00:00 34.0
## 4 2015-10-23 06:00:00 28.2
## 5 2015-10-23 07:00:00 18.3
## 6 2015-10-23 09:00:00 55.8
# Calculate the number of rows before and after the mean calculation
num_points_before_pipe1 <- nrow(pipe1_data)
num_points_before_pipe2 <- nrow(pipe2_data)
num_points_after_pipe1 <- nrow(pipe1_hourly)
num_points_after_pipe2 <- nrow(pipe2_hourly)
# Calculate time intervals for Pipe 1
pipe1_intervals <- pipe1_data %>%
arrange(`Date Time`) %>%
mutate(Interval = difftime(`Date Time`, lag(`Date Time`), units = "mins")) %>%
filter(!is.na(Interval)) # Remove the first NA caused by lag
# Ensure the 'Interval' column is numeric
pipe1_intervals$Interval <- as.numeric(pipe1_intervals$Interval)
# Calculate time intervals for Pipe 2
pipe2_intervals <- pipe2_data %>%
arrange(`Date Time`) %>%
mutate(Interval = difftime(`Date Time`, lag(`Date Time`), units = "mins")) %>%
filter(!is.na(Interval)) # Remove the first NA caused by lag
# Ensure the 'Interval' column is numeric
pipe2_intervals$Interval <- as.numeric(pipe2_intervals$Interval)
# Create a data frame to display the counts and intervals for each pipe
data_points_df <- data.frame(
Pipe = c("Pipe 1", "Pipe 2"),
Data_Points_Before = c(num_points_before_pipe1, num_points_before_pipe2),
Data_Points_After = c(num_points_after_pipe1, num_points_after_pipe2),
Avg_Interval = c(mean(pipe1_intervals$Interval), mean(pipe2_intervals$Interval)),
Median_Interval = c(median(pipe1_intervals$Interval), median(pipe2_intervals$Interval))
)
# Display the data frame with formatted table
data_points_df %>%
kable(caption = "Summary of Data Points and Time Intervals Between Measurements for Each Pipe") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Pipe | Data_Points_Before | Data_Points_After | Avg_Interval | Median_Interval |
---|---|---|---|---|
Pipe 1 | 1000 | 236 | 14.36598 | 10.01989 |
Pipe 2 | 1000 | 667 | 60.00000 | 60.00000 |
# Plot for Pipe 1
pipe1_hourly_plot <- ggplot(pipe1_hourly, aes(x = Timestamp, y = Mean_WaterFlow)) +
geom_line(color = "yellow") +
labs(title = "Hourly Mean Water Flow - Pipe 1",
x = "Timestamp",
y = "Mean Water Flow")
# Plot for Pipe 2
pipe2_hourly_plot <- ggplot(pipe2_hourly, aes(x = Timestamp, y = Mean_WaterFlow)) +
geom_line(color = "skyblue") +
labs(title = "Hourly Mean Water Flow - Pipe 2",
x = "Timestamp",
y = "Mean Water Flow")
# Arrange the plots vertically
grid.arrange(pipe1_hourly_plot, pipe2_hourly_plot , ncol = 1)
After converting the merged dataset to a tsibble
format,
I checked each time series for stationarity using the Augmented
Dickey-Fuller (ADF) test. Below is the interpretation of the results for
each pipe:
Pipe 1
- Test Statistic: The ADF statistic is strongly
negative, suggesting a high likelihood that the series is
stationary.
- Critical Values: The test statistic falls below the
critical values at the 1%, 5%, and 10% significance levels, leading to
rejection of the null hypothesis of a unit root.
- Conclusion: The time series for Pipe 1 is
stationary, making it appropriate for ARIMA
modeling.
Pipe 2
- Test Statistic: Similar to Pipe 1, Pipe 2’s ADF test
statistic is significantly negative.
- Critical Values: The result is again lower than all
critical thresholds, which supports rejecting the null hypothesis.
- Conclusion: Pipe 2’s series is also
stationary, and suitable for ARIMA modeling.
These findings confirm that both time series are stationary and can be used for forecasting without the need for additional differencing.
# Step 3: Convert to tsibble format for time series analysis
pipe1_tsibble <- as_tsibble(pipe1_hourly, index = Timestamp) %>%
fill_gaps()
pipe2_tsibble <- as_tsibble(pipe2_hourly, index = Timestamp) %>%
fill_gaps()
# Step 4: Test for stationarity using ADF test
adf_test_pipe1 <- ur.df(pipe1_hourly$Mean_WaterFlow, type = "drift")
adf_test_pipe1 <- ur.df(pipe2_hourly$Mean_WaterFlow, type = "drift")
summary(adf_test_pipe1)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression drift
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.732 -9.229 0.298 9.370 37.233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.55674 2.29084 16.831 <2e-16 ***
## z.lag.1 -0.96884 0.05603 -17.292 <2e-16 ***
## z.diff.lag -0.06832 0.03888 -1.757 0.0793 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.83 on 662 degrees of freedom
## Multiple R-squared: 0.5207, Adjusted R-squared: 0.5193
## F-statistic: 359.6 on 2 and 662 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -17.2918 149.5061
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1 6.43 4.59 3.78
summary(adf_test_pipe1)
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression drift
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.732 -9.229 0.298 9.370 37.233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.55674 2.29084 16.831 <2e-16 ***
## z.lag.1 -0.96884 0.05603 -17.292 <2e-16 ***
## z.diff.lag -0.06832 0.03888 -1.757 0.0793 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.83 on 662 degrees of freedom
## Multiple R-squared: 0.5207, Adjusted R-squared: 0.5193
## F-statistic: 359.6 on 2 and 662 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -17.2918 149.5061
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1 6.43 4.59 3.78
The top plot displays the Box-Cox transformed flow data for Pipe 2, which helps stabilize variance. The bottom plot shows the completed series after filling in any missing values, ensuring consistency for accurate forecasting. These preprocessing steps are essential for preparing the data for reliable model fitting.
pipe1_tsibble <- pipe1_tsibble |>
mutate(Mean_WaterFlow) # Adding 1 to avoid zero or negative values
# Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- pipe1_tsibble |>
BoxCox.lambda(Mean_WaterFlow) # Calculate lambda for Box-Cox
# Apply transformations and differences
pipe1_tsibble <- pipe1_tsibble |>
mutate(
box_cox_flow = BoxCox(Mean_WaterFlow, lambda = box_cox_lambda), # Apply Box-Cox transformation
)
# Fill missing values in the Mean_WaterFlow column using linear interpolation
pipe1_tsibble_filled <- pipe1_tsibble %>%
mutate(Mean_WaterFlow = na_interpolation(Mean_WaterFlow))
# Step 3: Time Series Plot for Box-Cox Transformed Data
p1 <- pipe1_tsibble |>
ggplot(aes(x = Timestamp, y = box_cox_flow)) +
geom_line() +
labs(title = "Box-Cox Transformed Pipe 1 Flow",
x = "Time", y = "Pipe 1 Flow")
# Step : Time Series Plot for Box-Cox Transformed Data
p2 <- pipe1_tsibble_filled |>
ggplot(aes(x = Timestamp, y = Mean_WaterFlow)) +
geom_line() +
labs(title = "Gap Filled Pipe 1 Flow",
x = "Time", y = "Pipe 1 Flow")
# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
p1, p2,
ncol = 1
)
### Decomposing the Time Series to Identify Seasonality and Trend
The STL decomposition of Pipe 1’s water flow data reveals the following components:
Overall, this decomposition captures the core structure of the data well, indicating that a forecasting model based on these components could accurately predict future values.
# Step 1: Decompose the series using STL
pipe1_stl <- pipe1_tsibble_filled %>%
model(STL(Mean_WaterFlow ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
components()
# Step 2: Plot the STL components
autoplot(pipe1_stl) +
labs(title = "STL Decomposition of Pipe 1 Water Flow")
### Model Fitting for Pipe 1
I applied both ARIMA and various ETS models—including ETS(M, A, M), ETS(A, A, M), and ETS(M, Ad, M)—to the Pipe 1 time series data. This allowed for model comparison using performance criteria such as AIC, AICc, BIC, MSE, and MAE to identify the most suitable forecasting approach.
To find the best model for forecasting Pipe 1 water flow, I evaluated
each model based on error metrics and information criteria. Although the
ARIMA models selected through stepwise
and
search
methods showed lower AICc values, their narrow
confidence intervals suggest they may underrepresent the true
variability in the data—potentially leading to underfitting and
unreliable forecasts in dynamic conditions.
On the other hand, ETS models—particularly
ets_mam
—produced wider, more realistic confidence
intervals, capturing the series’ fluctuations more effectively and
offering more reliable forecasting.
Model Comparison Summary:
Best Model: ETS(MAM)
Chosen for its strong balance of low error metrics and reliable
forecasting performance.
# Fit multiple ETS models
pipe1_fit <- pipe1_tsibble_filled %>%
model(
ets_mam = ETS(Mean_WaterFlow ~ error("M") + trend("A") + season("M")), # ETS(M, A, M)
ets_aam = ETS(Mean_WaterFlow ~ error("A") + trend("A") + season("M")), # ETS(A, A, M)
ets_madm = ETS(Mean_WaterFlow ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M)
stepwise = ARIMA(Mean_WaterFlow), # Automatic ARIMA using stepwise search
search = ARIMA(Mean_WaterFlow, stepwise = FALSE) # Automatic ARIMA using exhaustive search
)
# Extract model summary and select the best model based on AIC or AICc
model_summary <- glance(pipe1_fit)
# Display model details to compare AIC, AICc, and BIC
model_summary %>%
select(-ar_roots, -ma_roots) %>%
arrange(AICc) %>%
kable(caption = "ETS Model Comparison Based on AICc") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | sigma2 | log_lik | AIC | AICc | BIC | MSE | AMSE | MAE |
---|---|---|---|---|---|---|---|---|
stepwise | 17.8346691 | -685.7815 | 1375.563 | 1375.614 | 1382.524 | NA | NA | NA |
search | 17.9552502 | -685.6035 | 1379.207 | 1379.377 | 1393.129 | NA | NA | NA |
ets_mam | 0.0475971 | -995.0379 | 2048.076 | 2056.361 | 2149.014 | 16.82212 | 16.86331 | 0.1618203 |
ets_aam | 18.9938721 | -996.0843 | 2050.169 | 2058.454 | 2151.107 | 16.77792 | 16.81792 | 3.2571771 |
ets_madm | 0.0503575 | -999.7983 | 2059.597 | 2068.496 | 2164.016 | 17.24243 | 17.24774 | 0.1666085 |
# Forecast with each ETS model
pipe1_forecasts <- pipe1_fit %>%
forecast(h = "1 week")
# Plot the forecasts with different models shown in separate rows
autoplot(pipe1_forecasts, pipe1_tsibble_filled) +
labs(title = "1-Week Forecast for Pipe 1 Water Flow Using ETS Models",
y = "Mean Water Flow", x = "Timestamp") +
facet_wrap(~ .model, scales = "free_y", ncol = 1) + # Arrange in a single column with each model in a row
theme_minimal()
### Forecasting with the Best Model
The forecast for Pipe 1 Water Flow, generated using the best-performing ETS model—ETS(M, A, M)—offers projections with well-defined confidence intervals. This model, which incorporates multiplicative error, an additive trend, and multiplicative seasonality, fits the data well and effectively captures its seasonal patterns.
The widening of the confidence intervals over time highlights the natural uncertainty involved in long-term forecasting. Due to the strong seasonal behavior and consistent trend in the data, this model is highly appropriate for representing its structure and producing reliable future forecasts within expected variability.
# Filter the forecasts to include only the best model (ets_mam)
best_forecast <- pipe1_forecasts %>%
filter(.model == "ets_mam")
# Plot the forecast for the best model (ets_mam)
autoplot(best_forecast, pipe1_tsibble_filled) +
labs(
title = "1-Week Forecast for Pipe 1 Water Flow Using Best Model (ETS(M, A, M))",
y = "Mean Water Flow",
x = "Timestamp"
)
head(best_forecast) %>%
kable(caption = "1-Week Forecasts for Pipe 1") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | Timestamp | Mean_WaterFlow | .mean |
---|---|---|---|
ets_mam | 2015-11-02 00:00:00 | N(21, 21) | 20.89573 |
ets_mam | 2015-11-02 01:00:00 | N(22, 22) | 21.58641 |
ets_mam | 2015-11-02 02:00:00 | N(21, 21) | 20.83362 |
ets_mam | 2015-11-02 03:00:00 | N(19, 17) | 18.66852 |
ets_mam | 2015-11-02 04:00:00 | N(22, 23) | 22.00720 |
ets_mam | 2015-11-02 05:00:00 | N(19, 17) | 18.94728 |
The top plot displays the Box-Cox transformed Pipe 2 water flow data, which helps stabilize variance. The bottom plot shows the completed time series after filling in missing values, ensuring data continuity for accurate forecasting. These preprocessing steps are essential for preparing the data for effective model fitting.
pipe2_tsibble <- pipe2_tsibble |>
mutate(Mean_WaterFlow) # Adding 1 to avoid zero or negative values
# Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- pipe2_tsibble |>
BoxCox.lambda(Mean_WaterFlow) # Calculate lambda for Box-Cox
# Apply transformations and differences
pipe2_tsibble <- pipe2_tsibble |>
mutate(
box_cox_flow = BoxCox(Mean_WaterFlow, lambda = box_cox_lambda), # Apply Box-Cox transformation
)
# Fill missing values in the Mean_WaterFlow column using linear interpolation
pipe2_tsibble_filled <- pipe2_tsibble %>%
mutate(Mean_WaterFlow = na_interpolation(Mean_WaterFlow))
# Step 3: Time Series Plot for Box-Cox Transformed Data
p1 <- pipe2_tsibble |>
ggplot(aes(x = Timestamp, y = box_cox_flow)) +
geom_line() +
labs(title = "Box-Cox Transformed Pipe 2 Flow",
x = "Time", y = "Pipe 2 Flow")
# Step : Time Series Plot for Box-Cox Transformed Data
p2 <- pipe2_tsibble_filled |>
ggplot(aes(x = Timestamp, y = Mean_WaterFlow)) +
geom_line() +
labs(title = "Gap Filled Pipe 2 Flow",
x = "Time", y = "Pipe 2 Flow")
# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
p1, p2,
ncol = 1
)
### Decomposing the Time Series to Identify Trend and Seasonality
The STL decomposition of Pipe 2 Water Flow reveals the following components:
Overall, the decomposition reveals a stable trend combined with strong daily seasonality and some short-term irregularities. This structure can guide model selection, suggesting that any effective forecasting model should incorporate both trend and daily seasonal components.
# Step 1: Decompose the series using STL
pipe2_stl <- pipe1_tsibble_filled %>%
model(STL(Mean_WaterFlow ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
components()
# Step 2: Plot the STL components
autoplot(pipe2_stl) +
labs(title = "STL Decomposition of Pipe 2 Water Flow")
### Model Fitting for Pipe 2
I applied ARIMA and multiple ETS configurations—including ETS(M, A, M), ETS(A, A, M), and ETS(M, Ad, M)—to the Pipe 2 time series data. This approach enables a comparison of models based on criteria such as AIC, BIC, MSE, and MAE to identify the most effective forecasting method.
To determine the best model for forecasting Pipe 2 Water Flow, I evaluated key performance metrics including AICc, BIC, MSE, and MAE.
The ETS(M, Ad, M) model (ets_madm
) outperformed the
others, showing the lowest AICc, BIC, MSE, and MAE values. While
search
and stepwise
ARIMA models yielded
slightly lower AICc scores, they also produced much higher sigma² and
log-likelihood values—indicating a poor model fit and potential
stability issues.
As a result, ets_madm
was selected as the most reliable
model, offering a good balance of forecast accuracy and stability.
# Fit multiple models
pipe2_fit <- pipe2_tsibble_filled %>%
model(
ets_mam = ETS(Mean_WaterFlow ~ error("M") + trend("A") + season("M")), # ETS(M, A, M)
ets_aam = ETS(Mean_WaterFlow ~ error("A") + trend("A") + season("M")), # ETS(A, A, M)
ets_madm = ETS(Mean_WaterFlow ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M)
stepwise = ARIMA(Mean_WaterFlow), # Automatic ARIMA using stepwise search
search = ARIMA(Mean_WaterFlow, stepwise = FALSE) # Automatic ARIMA using exhaustive search
)
# Extract model summary and select the best model based on AIC or AICc
model_summary <- glance(pipe2_fit)
# Display model details to compare AIC, AICc, and BIC
model_summary %>%
select(-ar_roots, -ma_roots) %>%
arrange(AICc) %>%
kable(caption = "ETS Model Comparison Based on AICc") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
.model | sigma2 | log_lik | AIC | AICc | BIC | MSE | AMSE | MAE |
---|---|---|---|---|---|---|---|---|
search | 145.5595822 | -3902.861 | 7819.723 | 7819.836 | 7854.070 | NA | NA | NA |
stepwise | 156.7908794 | -3942.633 | 7891.266 | 7891.290 | 7905.986 | NA | NA | NA |
ets_madm | 0.1017133 | -5980.418 | 12020.837 | 12022.756 | 12168.070 | 155.3642 | 158.8355 | 0.2533285 |
ets_aam | 163.1633975 | -5987.054 | 12032.108 | 12033.902 | 12174.433 | 158.5948 | 176.0376 | 10.0992134 |
ets_mam | 0.1019389 | -5987.697 | 12033.394 | 12035.188 | 12175.719 | 158.5184 | 162.9869 | 0.2536301 |
# Forecast with each ETS model
pipe2_forecasts <- pipe2_fit %>%
forecast(h = "1 week")
# Plot the forecasts with different models shown in separate rows
autoplot(pipe2_forecasts, pipe2_tsibble_filled) +
labs(title = "1-Week Forecast for Pipe 2 Water Flow",
y = "Mean Water Flow", x = "Timestamp") +
facet_wrap(~ .model, scales = "free_y", ncol = 1) + # Arrange in a single column with each model in a row
theme_minimal()
The ETS(M, Ad, M) model generates a one-week forecast for Pipe 2 Water Flow, effectively capturing multiplicative error and seasonality along with an additive damped trend. The forecast reflects the high variability seen in the historical data and shows increasing uncertainty over time, illustrated by the widening confidence intervals. This model is well-suited for representing the complex and fluctuating water flow patterns in Pipe 2, accurately accounting for both daily and seasonal variations.
# Filter the forecasts to include only the best model (ets_mam)
best_forecast <- pipe2_forecasts %>%
filter(.model == "ets_madm")
# Plot the forecast for the best model (ets_mam)
autoplot(best_forecast, pipe2_tsibble_filled) +
labs(
title = "1-Week Forecast for Pipe 2 Water Flow (ETS(M, Ad, M))",
y = "Mean Water Flow",
x = "Timestamp"
) +
theme_minimal()