1 DATA 624 Forecasting Project – ATM, Power Usage, and Water Flow Analysis

1.1 Introduction

This project consists of three forecasting tasks. Part A involves predicting the cash withdrawals from four ATMs for May 2010. Part B focuses on forecasting residential power consumption for the year 2014 using monthly data from 1998–2013. Part C is an optional bonus involving the hourly aggregation and forecasting of water flow data from two different pipes.


1.2 Part A – ATM Forecast

1.2.1 Overview

In this project, I’ll analyze historical ATM withdrawal data to forecast how much cash will be taken out of four different ATM machines for May 2010. I’ll start by exploring and preprocessing the data, then apply time series analysis techniques to build and evaluate forecasting models. Along the way, I’ll decide which methods work best for the data and explain why.

1.2.2 Data Summary and Cleaning

atm_data <- read_excel("C:/Users/Admin/Downloads/ATM624Data.xlsx")
head(atm_data)
## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90
atm_data[!complete.cases(atm_data),]
## # A tibble: 19 × 3
##     DATE ATM    Cash
##    <dbl> <chr> <dbl>
##  1 39977 ATM1     NA
##  2 39980 ATM1     NA
##  3 39982 ATM2     NA
##  4 39986 ATM1     NA
##  5 39988 ATM2     NA
##  6 40299 <NA>     NA
##  7 40300 <NA>     NA
##  8 40301 <NA>     NA
##  9 40302 <NA>     NA
## 10 40303 <NA>     NA
## 11 40304 <NA>     NA
## 12 40305 <NA>     NA
## 13 40306 <NA>     NA
## 14 40307 <NA>     NA
## 15 40308 <NA>     NA
## 16 40309 <NA>     NA
## 17 40310 <NA>     NA
## 18 40311 <NA>     NA
## 19 40312 <NA>     NA
atm_data$DATE<-as.Date(atm_data$DATE, origin = "1899-12-30")
atm <- atm_data %>%
  mutate(DATE = as_date(DATE), Cash = as.integer(Cash))
str(atm)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: int [1:1474] 96 107 82 89 85 90 90 55 99 79 ...

1.2.3 Forecasting by ATM

I’ll create a separate dataset to preserve the original data and perform all necessary transformations on the copy. First, I’ll convert the “DATE” variable to the appropriate date format. Since ATMs don’t dispense coins, I’ll also convert the “Cash” variable to an integer.

atm <- atm_data %>%
  mutate(DATE = as_date(DATE), Cash = as.integer(Cash))
str(atm)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: int [1:1474] 96 107 82 89 85 90 90 55 99 79 ...

Since we are looking at 4 separate ATMs and the data does not provide their location, they may have varying amounts of cash withdrawn at varying different days. We will work with each ATM separately and each will have its own forecast for the month of May in 2010.

# Prepare the data
atm1 <- atm %>%
  filter(ATM == "ATM1") %>%
  as_tsibble(index = DATE)

ggplot(atm1, aes(x = DATE, y = Cash)) +
  geom_line(color = "hotpink2") +
  labs(
    title = "ATM1",
    subtitle = "Cash withdrawals per day",
    x = "DATE [1D]",
    y = "Hundreds USD"
  )

# Prepare the data
atm2 <- atm %>%
  filter(ATM == "ATM2") %>%
  as_tsibble(index = DATE)

# Generate the plot using ggplot2
ggplot(atm1, aes(x = DATE, y = Cash)) +
  geom_line(color = "hotpink2") +
  labs(
    title = "ATM2",
    subtitle = "Cash withdrawals per day",
    x = "DATE [1D]",
    y = "Hundreds USD"
  )

# Prepare the data
atm3 <- atm %>%
  filter(ATM == "ATM1") %>%
  as_tsibble(index = DATE)

# Generate the plot using ggplot2
ggplot(atm3, aes(x = DATE, y = Cash)) +
  geom_line(color = "hotpink2") +
  labs(
    title = "ATM3",
    subtitle = "Cash withdrawals per day",
    x = "DATE [1D]",
    y = "Hundreds USD"
  )

# Prepare the data
atm4 <- atm %>%
  filter(ATM == "ATM4") %>%
  as_tsibble(index = DATE)

ggplot(atm4, aes(x = DATE, y = Cash)) +
  geom_line(color = "hotpink2") +
  labs(
    title = "ATM4",
    subtitle = "Cash withdrawals per day",
    x = "DATE [1D]",
    y = "Hundreds USD"
  )

Based on the time series plots, ATM1 and ATM2 show consistent variability with no clear trend and regular daily fluctuations, suggesting potential seasonality to explore through decomposition. Their patterns indicate steady usage over time. ATM3 also shows consistent activity, similar to ATM1, and does not appear limited to just the final days of data as initially suggested. Its pattern suggests a comparable usage profile. ATM4 stands out due to a major outlier, a sharp spike in withdrawals, while the rest of the data remains flat and low. This spike may be a data error or unusual event and should be investigated before modeling.

which(atm4$Cash %in% 10919)
## [1] 285

The outlier appears in row 285 of the ATM4 time series. Since it’s clearly an error, we’ll replace it using the average value for imputation.

After this adjustment, the time series now closely resembles the patterns seen in ATM1 and ATM2.

atm4_clean <- atm4 %>%
  mutate(Cash = if_else(row_number() == 285, round(mean(Cash, na.rm = TRUE), 0), Cash))

# Convert to tsibble if needed
atm4_clean <- as_tsibble(atm4_clean, index = DATE)

# Plot the updated series
ggplot(atm4_clean, aes(x = DATE, y = Cash)) +
  geom_line(color = "hotpink") +
  labs(
    title = "ATM4",
    subtitle = "Cash withdrawals per day",
    y = "Hundreds USD"
  )

There also appear to be some missing values in the “Cash” column for both “ATM1” and “ATM2”.

sum(is.na(atm1$Cash))
## [1] 3
sum(is.na(atm2$Cash))
## [1] 2
sum(is.na(atm3$Cash))
## [1] 3
sum(is.na(atm4$Cash))
## [1] 0

The missing values are located in the following rows for each corresponding ATM.

which(is.na(atm1$Cash))
## [1] 44 47 53
which(is.na(atm2$Cash))
## [1] 49 55
hist(atm1$Cash, 
     col = "pink", 
     border = "black", 
     main = "Histogram of atm1$Cash", 
     xlab = "atm1$Cash", 
     ylab = "Frequency")

hist(atm2$Cash, 
     col = "pink", 
     border = "black", 
     main = "Histogram of atm2$Cash", 
     xlab = "atm2$Cash", 
     ylab = "Frequency")

Since the distribution of “Cash” for both ATMs appears fairly symmetrical with no strong skewness, the median is a suitable choice for imputation. It’s a more robust measure than the mean in the presence of outliers, helping maintain the overall integrity of the data. For this reason, I will use the median to fill in the missing values.

# Impute missing values with the median
atm1[44, 3] <- round(median(atm1$Cash, na.rm = TRUE), 0)
atm1[47, 3] <- round(median(atm1$Cash, na.rm = TRUE), 0)
atm1[53, 3] <- round(median(atm1$Cash, na.rm = TRUE), 0)
atm2[49, 3] <- round(median(atm2$Cash, na.rm = TRUE), 0)
atm2[55, 3] <- round(median(atm2$Cash, na.rm = TRUE), 0)

# Print updated rows for verification
atm1[c(44, 47, 53), ]
## # A tsibble: 3 x 3 [1D]
##   DATE       ATM    Cash
##   <date>     <chr> <int>
## 1 2009-06-13 ATM1     91
## 2 2009-06-16 ATM1     91
## 3 2009-06-22 ATM1     91
atm2[c(49, 55), ]
## # A tsibble: 2 x 3 [1D]
##   DATE       ATM    Cash
##   <date>     <chr> <int>
## 1 2009-06-18 ATM2     67
## 2 2009-06-24 ATM2     67

Additionally, we can apply a Box-Cox transformation to each series to help stabilize variance and simplify the structure of the data.

# ATM1
lambda1 <- atm1 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

plot_trans1 <- atm1 %>%
  mutate(TransCash = box_cox(Cash, lambda1)) %>%
  ggplot(aes(x = DATE, y = TransCash)) +
  geom_line(color = "hotpink") +
  labs(title = "ATM1 TRANSFORMED", subtitle = "Cash withdrawals per day", y = "USD")

# ATM2
lambda2 <- atm2 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

plot_trans2 <- atm2 %>%
  mutate(TransCash = box_cox(Cash, lambda2)) %>%
  ggplot(aes(x = DATE, y = TransCash)) +
  geom_line(color = "hotpink") +
  labs(title = "ATM2 TRANSFORMED", subtitle = "Cash withdrawals per day", y = "USD")

# ATM3
lambda3 <- atm3 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

plot_trans3 <- atm3 %>%
  mutate(TransCash = box_cox(Cash, lambda3)) %>%
  ggplot(aes(x = DATE, y = TransCash)) +
  geom_line(color = "hotpink") +
  labs(title = "ATM3 TRANSFORMED", subtitle = "Cash withdrawals per day", y = "USD")

# ATM4
lambda4 <- atm4 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

plot_trans4 <- atm4 %>%
  mutate(TransCash = box_cox(Cash, lambda4)) %>%
  ggplot(aes(x = DATE, y = TransCash)) +
  geom_line(color = "hotpink") +
  labs(title = "ATM4 TRANSFORMED", subtitle = "Cash withdrawals per day", y = "USD")

# Arrange plots
grid.arrange(plot_trans1, plot_trans2, plot_trans3, plot_trans4, nrow = 2)

The Box-Cox transformed time series plots for each ATM reveal several key insights. ATM1 and ATM3 show consistent patterns with stabilized variance after transformation. Both series exhibit clear weekly seasonality, where regular dips, likely corresponding to weekends, suggest predictable fluctuations in cash withdrawals. The transformation effectively reduced variability while preserving seasonal trends. ATM2 also reflects strong weekly seasonality but with higher volatility compared to ATM1 and ATM3. The Box-Cox method has smoothed this volatility to some extent, making the data more suitable for modeling. In contrast, ATM4 has lower overall withdrawal values and a more stable pattern. While the seasonality is less pronounced, it still exists, suggesting a steadier usage pattern for this machine. Overall, the transformations have successfully prepared each series for decomposition and forecasting by reducing skewness and stabilizing variance.

1.3 Build Models

Next, we’ll examine the decomposition of each series to identify any strong seasonal patterns and assess whether differencing might be necessary for modeling. Since the magnitude of the seasonal components remains relatively constant over time, we can conclude that the series exhibit additive behavior.

atm1%>%
  model(classical_decomposition(box_cox(Cash, lambda1), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM1")
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

The decomposition of ATM1 shows a clear and consistent weekly seasonal pattern, along with a mildly fluctuating trend that rises mid-series and dips slightly toward the end. The residuals are stable, indicating that most of the variation is explained by the trend and seasonality. This supports the use of an additive model.

atm2%>%
  model(classical_decomposition(box_cox(Cash, lambda2), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM2")
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

ATM2’s decomposition shows a clear weekly seasonal pattern and a moderately stable trend with some fluctuations. The residuals are a bit more volatile than ATM1 but still support an additive model.

# Convert to time series object (assuming daily data)
cash_ts <- ts(na_interpolation(atm3$Cash), frequency = 7)  # weekly seasonality

# Apply classical decomposition
cash_decomp <- decompose(cash_ts)

# Plot
plot(cash_decomp)

This decomposition plot shows ATM3’s time series has a strong seasonal structure and a noticeable trend component. The residuals are relatively stable, supporting the use of an additive model for forecasting.

atm4%>%
  model(classical_decomposition(box_cox(Cash, lambda4), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Classical additive decomposition of ATM4")
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

The decomposition for ATM4 indicates a stable seasonal component and a fluctuating but generally declining trend toward the end. The residuals show some variability, but the pattern is largely consistent, supporting the validity of the additive decomposition.

plot_acf1 <- atm1 %>%
  ACF(box_cox(Cash, lambda1)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM1")

plot_acf2 <- atm2 %>%
  ACF(box_cox(Cash, lambda2)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM2")

plot_acf3 <- atm3 %>%
  ACF(box_cox(Cash, lambda3)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM3")

plot_acf4 <- atm4 %>%
  ACF(box_cox(Cash, lambda4)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM4")

grid.arrange(plot_acf1, plot_acf2, plot_acf3, plot_acf4, nrow = 2)

The autocorrelation plots reveal clear weekly seasonality patterns for ATM1, ATM2, and ATM3, with noticeable spikes at lags 7, 14, and 21. This supports the need for seasonal differencing in these series. ATM4, on the other hand, shows weaker autocorrelation, suggesting either lower seasonality or more randomness in its daily withdrawal pattern.

atm1 %>%
  features(box_cox(Cash, lambda1), unitroot_nsdiffs)
## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1
atm2 %>%
  features(box_cox(Cash, lambda1), unitroot_nsdiffs)
## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1
atm3 %>%
  features(box_cox(Cash, lambda1), unitroot_nsdiffs)
## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1
atm4 %>%
  features(box_cox(Cash, lambda1), unitroot_nsdiffs)
## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

Based on the output of the function, seasonal differencing is required for ATM1 and ATM2. Next, we’ll examine whether any further differencing is necessary.

atm1 %>%
  features(difference(box_cox(Cash, lambda1), 7), unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0
atm2 %>%
  features(difference(box_cox(Cash, lambda2), 7), unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0
atm3 %>%
  features(difference(box_cox(Cash, lambda3), 7), unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0
atm4 %>%
  features(difference(box_cox(Cash, lambda4), 7), unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

No further differencing appears to be necessary. Let’s now review the ACF plots after applying seasonal differencing to ATM1 and ATM2.

atm1 %>%
  ACF(difference(box_cox(Cash, lambda1), 7)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM1")

The ACF plot for ATM1 after differencing shows that most autocorrelations fall within the confidence bounds, indicating that seasonality has been adequately removed. The sharp negative spike at lag 7 suggests that weekly seasonality was successfully addressed through differencing, and the residuals now resemble white noise, which is ideal for ARIMA modeling.

atm2 %>%
  ACF(difference(box_cox(Cash, lambda2), 7)) %>%
  autoplot() + labs(title="Autocorrelation of Cash ATM2")

The ACF plot for ATM2 after differencing shows a significant drop at lag 7, indicating that weekly seasonality has been successfully accounted for. Most of the autocorrelations now fall within the confidence bands, suggesting that the series behaves more like white noise. This supports the idea that the data is now stationary and ready for ARIMA modeling.

atm1_fit <- atm1 %>%
  model(ARIMA(box_cox(Cash, lambda1)))

report(atm1_fit)
## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda1) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.0994  -0.1141  -0.6456
## s.e.  0.0524   0.0527   0.0426
## 
## sigma^2 estimated as 1.576:  log likelihood=-589.77
## AIC=1187.54   AICc=1187.66   BIC=1203.07
atm1_fit %>%
  gg_tsresiduals()

The residual diagnostics for ATM1 suggest a well-performing ARIMA model. The residuals appear randomly scattered around zero with no visible patterns, and the ACF plot shows no significant autocorrelation remaining—indicating white noise. Additionally, the histogram of residuals shows a distribution that is approximately normal, which supports the model’s validity. Overall, this is a strong sign that the ARIMA model is a good fit for ATM1’s cash withdrawal series.

atm2_fit <- atm2 %>%
  model(ARIMA(box_cox(Cash, lambda2)))

report(atm2_fit)
## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda2) 
## 
## Coefficients:
##           ar1      ar2    ma1     ma2     sma1
##       -0.4247  -0.9074  0.468  0.8004  -0.7262
## s.e.   0.0587   0.0438  0.089  0.0584   0.0409
## 
## sigma^2 estimated as 69.93:  log likelihood=-1267.92
## AIC=2547.85   AICc=2548.09   BIC=2571.13
atm2_fit %>%
  gg_tsresiduals()

The residual diagnostics for ATM2 indicate that the ARIMA model is performing reasonably well. The residuals are centered around zero, and the ACF plot shows no significant spikes beyond the confidence bounds, implying that most autocorrelation has been removed. The histogram also resembles a normal distribution, supporting the model’s assumption of normally distributed errors. While there are some large residuals, overall the model seems appropriate for forecasting ATM2.

atm3_fit <- atm3 %>%
  model(NAIVE(box_cox(Cash, lambda3)))

report(atm3_fit)
## Series: Cash 
## Model: NAIVE 
## Transformation: box_cox(Cash, lambda3) 
## 
## sigma^2: 6.4317
atm3_fit %>%
  gg_tsresiduals()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

The residual diagnostics for ATM3 show clear issues with the model fit. The top panel reveals a strong pattern in the residuals rather than random noise, indicating underfitting. The ACF plot also shows high autocorrelation at regular lags, suggesting that key seasonality or trend components remain. Additionally, the histogram of residuals shows a noticeable bimodal distribution, further pointing to model misspecification. This confirms that the naïve model used for ATM3 may not be suitable despite the limited data.

atm4_fit <- atm4 %>%
  model(ARIMA(box_cox(Cash, lambda4)))

report(atm4_fit)
## Series: Cash 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda4) 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.2459  0.1925    2.5248
## s.e.  0.0521  0.0525    0.0480
## 
## sigma^2 estimated as 0.8828:  log likelihood=-494.26
## AIC=996.52   AICc=996.63   BIC=1012.12
atm4_fit %>%
  gg_tsresiduals()

The residual diagnostics for ATM4 suggest a strong model fit. The residuals over time appear randomly scattered around zero, showing no obvious trends or patterns. The ACF plot displays minimal autocorrelation, with most values well within the confidence bounds, indicating the residuals resemble white noise. The histogram shows an approximately normal distribution centered around zero. Overall, these diagnostics suggest the model for ATM4 is well specified and captures the underlying structure of the data effectively.

2 Part B

2.1 Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B involves modeling a dataset of residential power usage from January 1998 to December 2013. My task is to analyze the data and generate monthly forecasts for the year 2014. The data is provided in a single file, with the variable ‘KWH’ representing power consumption in kilowatt-hours.

2.1.1 Data Preparation

RCFL <- read_excel("C:/Users/Admin/Downloads/ResidentialCustomerForecastLoad-624.xlsx")
RCFL$Date <- as.yearmon(RCFL$`YYYY-MMM`, format = "%Y-%b")
ts_power <- ts(RCFL$KWH, start = c(1998, 1), frequency = 12)
autoplot(ts_power) + ggtitle("Residential Power Usage") + ylab("KWH")

This time series plot of residential power usage from January 1998 to December 2013 reveals a strong seasonal pattern with yearly peaks and troughs, typical of electricity consumption. There’s an evident upward trend beginning in the late 2000s, suggesting increasing energy demand over time. One notable anomaly occurs around 2010, where there’s a sharp drop in usage—this could indicate missing data, a recording error, or a real event worth further investigation. The series appears non-stationary, making it a candidate for differencing or transformation before forecasting.

head(RCFL)
## # A tibble: 6 × 4
##   CaseSequence `YYYY-MMM`     KWH Date     
##          <dbl> <chr>        <dbl> <yearmon>
## 1          733 1998-Jan   6862583 Jan 1998 
## 2          734 1998-Feb   5838198 Feb 1998 
## 3          735 1998-Mar   5420658 Mar 1998 
## 4          736 1998-Apr   5010364 Apr 1998 
## 5          737 1998-May   4665377 May 1998 
## 6          738 1998-Jun   6467147 Jun 1998
summary(RCFL)
##   CaseSequence     YYYY-MMM              KWH                Date     
##  Min.   :733.0   Length:192         Min.   :  770523   Min.   :1998  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912   1st Qu.:2002  
##  Median :828.5   Mode  :character   Median : 6283324   Median :2006  
##  Mean   :828.5                      Mean   : 6502475   Mean   :2006  
##  3rd Qu.:876.2                      3rd Qu.: 7620524   3rd Qu.:2010  
##  Max.   :924.0                      Max.   :10655730   Max.   :2014  
##                                     NA's   :1
which(is.na(RCFL), arr.ind=TRUE)
##      row col
## [1,] 129   3
slice(RCFL,c(127:132))
## # A tibble: 6 × 4
##   CaseSequence `YYYY-MMM`     KWH Date     
##          <dbl> <chr>        <dbl> <yearmon>
## 1          859 2008-Jul   7643987 Jul 2008 
## 2          860 2008-Aug   8037137 Aug 2008 
## 3          861 2008-Sep        NA Sep 2008 
## 4          862 2008-Oct   5101803 Oct 2008 
## 5          863 2008-Nov   4555602 Nov 2008 
## 6          864 2008-Dec   6442746 Dec 2008

There appears to be a gap in the dataset around September 2008, indicating missing data for that period. Additionally, there are a few data points present near the year 2010, which may require further investigation to understand their context within the overall time series.

RCFL_data <- RCFL %>%
  rename(Formatted_Date = 'YYYY-MMM') %>%
  mutate(Formatted_Date = as.Date(paste0('01-', Formatted_Date), '%d-%Y-%b'))

min(RCFL_data$KWH, na.rm = TRUE)
## [1] 770523
RCFL_2 <-ts(RCFL[, "KWH"], start = c(1998, 1), frequency = 12)
ggseasonplot(RCFL_2)+ggtitle('USAGE BY YEAR FOR RESIDENTIAL POWER')

This monthly breakdown of residential power usage by year highlights a consistent seasonal trend: usage dips in the spring (April–May) and peaks in the summer (July–August) and winter (December–January). The visual also confirms the 2010 anomaly, with an unusually sharp drop in July, likely due to missing or erroneous data. Overall, there’s a clear upward shift in power usage in more recent years, particularly 2012 and 2013, supporting the notion of growing residential demand over time.

RCFL_data<- RCFL_data[-c(129,151),]

#Get average by month
RCFL_data$Month <- months(RCFL_data$Date)
aggregate(KWH ~ Month, RCFL_data, mean)
##        Month     KWH
## 1      April 5299734
## 2     August 8298211
## 3   December 6283175
## 4   February 6946556
## 5    January 8007422
## 6       July 7852521
## 7       June 6536092
## 8      March 5971450
## 9        May 5039034
## 10  November 4953619
## 11   October 5657164
## 12 September 7702333
RCFL$KWH[is.na(RCFL$KWH)] = median(RCFL$KWH, na.rm=TRUE)

summary(RCFL)
##   CaseSequence     YYYY-MMM              KWH                Date     
##  Min.   :733.0   Length:192         Min.   :  770523   Min.   :1998  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539   1st Qu.:2002  
##  Median :828.5   Mode  :character   Median : 6283324   Median :2006  
##  Mean   :828.5                      Mean   : 6501333   Mean   :2006  
##  3rd Qu.:876.2                      3rd Qu.: 7608792   3rd Qu.:2010  
##  Max.   :924.0                      Max.   :10655730   Max.   :2014
RCFL_ts <- ts(RCFL$KWH, start=c(1998,1), frequency = 12)
RCFL_ts
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755  8607428
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416  7564391
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662  7517830
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129  8450717
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702  8058748
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676  8476499
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931  7309774
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998  7786659
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564  8241110
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220  7447146
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987  8037137
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260  8350517
## 2010  9397357  8390677  7347915  5776131  4919289  6696292   770523  7922701
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343 10308076
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851  9612423
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321  9080226
##           Sep      Oct      Nov      Dec
## 1998  6989888  6345620  4640410  4693479
## 1999  7899368  5358314  4436269  4419229
## 2000  8912169  5844352  5041769  6220334
## 2001  7112069  5242535  4461979  5240995
## 2002  8245227  5865014  4908979  5779958
## 2003  7791791  5344613  4913707  5756193
## 2004  6690366  5444948  4824940  5791208
## 2005  7057213  6694523  4313019  6181548
## 2006  7296355  5104799  4458429  6226214
## 2007  7666970  5785964  4907057  6047292
## 2008  6283324  5101803  4555602  6442746
## 2009  7583146  5566075  5339890  7089880
## 2010  7819472  5875917  4800733  6152583
## 2011  8943599  5603920  6154138  8273142
## 2012  7559148  5576852  5731899  6609694
## 2013  7968220  5759367  5769083  9606304
ggtsdisplay(RCFL_ts, main="Monthly Power Consumption before transform")

The time series plot clearly shows seasonal peaks and troughs in monthly residential power consumption, with a noticeable spike in the summer and winter months each year. There’s also a visible dip in 2010, likely due to data issues or an outlier. The ACF plot shows strong autocorrelation at lag 12 and its multiples, confirming a strong yearly seasonal pattern. The PACF plot highlights significant lags in the first few months, suggesting both autoregressive and seasonal components. These patterns indicate the need for seasonal differencing and support the use of a seasonal ARIMA model to forecast future consumption.

RCFLS_BXCX <- RCFL_ts %>% BoxCox(lambda= 'auto')
ggtsdisplay(RCFLS_BXCX, main='MONTHLY POWER CONSUMER BXCX')

After applying the Box-Cox transformation, the variance in the power consumption time series appears more stable across time. The overall seasonal pattern remains visible, with peaks and troughs repeating consistently each year.

In the ACF plot, we still see significant autocorrelation at seasonal lags (multiples of 12), suggesting persistent seasonality. The PACF plot shows a few significant spikes at lower lags, indicating short-term dependencies that could be captured by an AR or MA component. This transformation sets a strong foundation for fitting a seasonal ARIMA model with better residual diagnostics.

2.2 Exploratory data analysis

ggseasonplot(RCFLS_BXCX)

This seasonal plot highlights the strong seasonal fluctuations in residential power usage, with notable peaks around the middle of the year (July) and lower consumption in winter months. The pattern appears consistent across different years, with a clear seasonal cycle influencing power consumption trends.

summary(RCFLS_BXCX)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.80   38.69   39.42   39.47   40.41   42.20
ggsubseriesplot(RCFLS_BXCX) 

This plot depicts the seasonality in the transformed residential power consumption data. The horizontal blue lines represent the seasonal components, indicating a repeating pattern of high and low usage over the months. It is evident that there is a significant peak around July, followed by a sharp decline, and then a rise again towards the end of the year, confirming the strong seasonal nature of the power consumption. The overall trend seems to exhibit an upward movement over time as well.

ggAcf(RCFLS_BXCX)

To take a closer look at the residuals and assess whether they exhibit autocorrelation, we’ll apply the Box test. This will help us determine if the residuals behave like white noise, which is a key assumption for validating our time series models.

Box.test(RCFLS_BXCX, type = c("Ljung-Box"))
## 
##  Box-Ljung test
## 
## data:  RCFLS_BXCX
## X-squared = 16.556, df = 1, p-value = 4.723e-05
summary(RCFLS_BXCX)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.80   38.69   39.42   39.47   40.41   42.20
boxplot(RCFLS_BXCX~cycle(RCFLS_BXCX))

The box plot above displays the distribution of residential power usage data (RCFLS_BXCX) across the months. The boxes represent the interquartile range (IQR), where the middle 50% of the data lies, while the whiskers show the range of the data. The plot reveals significant seasonal variation in power usage, with some months exhibiting higher consumption than others. There are notable outliers in certain months, such as June and December, which may indicate exceptional consumption patterns or anomalies. The general trend shows that power consumption tends to increase during the summer months (June to August) and decrease towards the winter months (December).

print(paste0("Suggested # of diff: ", ndiffs(RCFLS_BXCX)))
## [1] "Suggested # of diff: 1"
print(paste0("DIFF REQUIRED (SEASIONAL): ", ndiffs(diff(RCFLS_BXCX, lag=12))))
## [1] "DIFF REQUIRED (SEASIONAL): 0"
RCFL_PWR_DIFF <- RCFLS_BXCX %>% diff(lag = 12)
ggtsdisplay(RCFL_PWR_DIFF, main= "Monthly power consumption BXCX AND DIFF")

The plot above shows the monthly power consumption after differencing. This transformation seems to remove some of the long-term trend and cyclical patterns, making the data appear more stationary. The ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots show that the data now exhibit significant spikes at specific lags, especially at lag 12, indicating a yearly seasonality. This suggests that seasonal differencing might be required to better model the data. The transformation and the resulting ACF/PACF plots indicate that further steps like seasonal differencing could improve the model.

ggseasonplot(RCFL_PWR_DIFF,polar = TRUE)+ggtitle('Residential Power Usage by Year')

The radial plot above shows the monthly power usage by year, highlighting the seasonal trend in power consumption. Each line represents a different year, and the pattern clearly illustrates the cyclical nature of power usage, with peaks typically occurring in the summer months (around July). The seasonal fluctuations in usage are evident, with a dip in consumption during the winter months and a rise during the warmer months. This visualization helps in identifying the seasonal behavior and trends across different years.

plot(RCFL_PWR_DIFF)

The plot you have shared represents the differenced residential power usage data, which shows how the data behaves after seasonal adjustments. The sharp spike around 2010 indicates an anomaly in the dataset, which may be related to a significant event or change during that period. After the spike, the data seems to return to a more stable pattern, suggesting that the differencing effectively handled the seasonal trend, making it easier to detect unusual events.

autoplot(RCFL_PWR_DIFF, series="Data")+
  autolayer(ma(RCFL_PWR_DIFF, 12), series = "12 MTH Moving Avg")+ ggtitle("2014 MVING AVG")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

The plot displays the 12-month moving average of the differenced residential power consumption data for 2014. The moving average (shown in red) smooths out fluctuations in the original data (shown in blue), making the underlying trend more visible. The sharp spike around 2010 in the original data is still present but less pronounced in the moving average, highlighting how the longer-term trend stabilizes despite short-term anomalies.

2.3 Forecast Models

#stlf - etsmodel
RCFLS_STL <- stlf(RCFL_PWR_DIFF, damped=FALSE, s.window = "periodic", robust=TRUE, h = 12)

# forecast plot
autoplot(RCFLS_STL) + autolayer(fitted(RCFLS_STL))

The plot shows the forecast for residential power consumption after applying the STL (Seasonal and Trend decomposition using Loess) method combined with an ETS (Exponential Smoothing State Space Model) model. The black line represents the original differenced data, while the red line shows the fitted values from the combined STL + ETS model. The shaded area highlights the forecasted values, showing the expected range for future values in 2014. The model successfully captures the trend while smoothing out fluctuations. However, the sharp anomaly in 2010 is still visible in the forecast, suggesting it may need additional adjustment or consideration.

#stlf - etsmodel estimation --- M, Ad, N is chosen.
RCFL_STL_DP <- stlf(RCFL_PWR_DIFF, damped=TRUE, s.window = "periodic", robust=TRUE, h = 12)

# forecast plot
autoplot(RCFL_STL_DP) + autolayer(fitted(RCFL_STL_DP))

The updated forecast plot shows the results from the STL (Seasonal and Trend decomposition using Loess) method combined with an ETS (Exponential Smoothing State Space Model) with additive errors, an additive trend, and no seasonal dampening (ETS(A,Ad,N)). The black line represents the original differenced data, and the red line shows the fitted values from the combined model. The forecasted values for 2014 are represented by the shaded blue area.

This model seems to better capture the long-term trend while smoothing out fluctuations compared to the previous model. However, the sharp anomaly around 2010 remains prominent, indicating that the model may still need adjustment for extreme outliers or structural changes.

# auto.arima
arima_model <- auto.arima(RCFL_PWR_DIFF)

# forecast values
arima_model <- forecast(arima_model, h=20)

# forecast plot
autoplot(arima_model) + autolayer(fitted(arima_model))

The forecast plot above displays the results of applying an ARIMA(0,0,1)(0,0,2)[12] model with a non-zero mean to the data. In this case, the black line represents the original differenced data, and the red line shows the fitted values from the ARIMA model. The shaded blue area indicates the forecasted values for 2014.

This ARIMA model captures both the seasonal and non-seasonal components of the data well, particularly in the 2014 forecast range. While the sharp anomaly around 2010 remains visible, the forecasted values appear smoother and less volatile compared to earlier models.

RCFL_ETS<- ets(RCFL_PWR_DIFF)

# forecast plot
autoplot(forecast(RCFL_ETS
                , h=12)) + autolayer(fitted(RCFL_ETS
                                                    ))

It seems you have uploaded another image. If you’d like me to interpret or help with any specific analysis, feel free to let me know what you’re looking for.

RCFL_FCST_PWR_S <- ses(RCFL_PWR_DIFF, h=12)
autoplot(RCFL_FCST_PWR_S)+
  autolayer(fitted(RCFL_FCST_PWR_S), series="Fitted")

2.4 Comparing the Performance of the Models

accuracy(RCFLS_STL)
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0005455413 1.348221 0.5931651 -1360.432 1605.179 0.5404898
##                   ACF1
## Training set 0.1194029
checkresiduals(RCFLS_STL)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 65.794, df = 24, p-value = 9.303e-06
## 
## Model df: 0.   Total lags used: 24

The residuals from the STL + ETS(A,N,N) model are centered around zero, with no significant autocorrelation or patterns, indicating a good model fit. The residuals appear to be nearly normally distributed, confirming that the model effectively captures the trend and seasonality in the data.

summary(RCFLS_STL)
## 
## Forecast method: STL +  ETS(A,N,N)
## 
## Model Information:
## ETS(A,N,N) 
## 
## Call:
## ets(y = na.interp(x), model = etsmodel, damped = FALSE, allow.multiplicative.trend = allow.multiplicative.trend)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0713 
## 
##   sigma:  1.3558
## 
##      AIC     AICc      BIC 
## 1048.295 1048.432 1057.874 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0005455413 1.348221 0.5931651 -1360.432 1605.179 0.5404898
##                   ACF1
## Training set 0.1194029
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.17305758 -1.564437 1.910552 -2.484211 2.830326
## Feb 2014     0.06130592 -1.676189 1.798801 -2.595963 2.718575
## Mar 2014     0.12236178 -1.615133 1.859857 -2.534907 2.779631
## Apr 2014     0.08210098 -1.655394 1.819596 -2.575168 2.739370
## May 2014     0.12746350 -1.610031 1.864958 -2.529805 2.784732
## Jun 2014     0.14679543 -1.590699 1.884290 -2.510474 2.804064
## Jul 2014    -0.19208644 -1.929581 1.545408 -2.849355 2.465183
## Aug 2014    -0.02245609 -1.759951 1.715039 -2.679725 2.634813
## Sep 2014     0.16034813 -1.577147 1.897843 -2.496921 2.817617
## Oct 2014     0.05745299 -1.680042 1.794948 -2.599816 2.714722
## Nov 2014     0.08070004 -1.656795 1.818195 -2.576569 2.737969
## Dec 2014     0.05898262 -1.678512 1.796477 -2.598286 2.716252
accuracy(RCFL_STL_DP)
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set -0.007629573 1.349456 0.5947813 -1328.026 1577.668 0.5419625
##                  ACF1
## Training set 0.120852
checkresiduals(RCFL_STL_DP)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,Ad,N)
## Q* = 65.735, df = 24, p-value = 9.492e-06
## 
## Model df: 0.   Total lags used: 24

The residuals from the STL + ETS(A,Ad,N) model show that the data is centered around zero, indicating a good model fit. However, there is a slight non-normality in the residuals’ distribution with some skewness. The autocorrelation plot suggests some remaining seasonality in the residuals, especially with significant spikes at specific lags. This suggests that the model could potentially be improved further by addressing the seasonal components more explicitly.

summary(RCFL_STL_DP)
## 
## Forecast method: STL +  ETS(A,Ad,N)
## 
## Model Information:
## ETS(A,Ad,N) 
## 
## Call:
## ets(y = na.interp(x), model = etsmodel, damped = TRUE, allow.multiplicative.trend = allow.multiplicative.trend)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     beta  = 1e-04 
##     phi   = 0.9622 
## 
##   Initial states:
##     l = 0.1056 
##     b = -6e-04 
## 
##   sigma:  1.3686
## 
##      AIC     AICc      BIC 
## 1054.625 1055.110 1073.783 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set -0.007629573 1.349456 0.5947813 -1328.026 1577.668 0.5419625
##                  ACF1
## Training set 0.120852
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.17900301 -1.574926 1.932932 -2.503400 2.861406
## Feb 2014     0.06761419 -1.686315 1.821544 -2.614789 2.750018
## Mar 2014     0.12901915 -1.624910 1.882949 -2.553385 2.811423
## Apr 2014     0.08909425 -1.664835 1.843024 -2.593310 2.771498
## May 2014     0.13477996 -1.619150 1.888710 -2.547624 2.817184
## Jun 2014     0.15442285 -1.599507 1.908353 -2.527982 2.836828
## Jul 2014    -0.18415983 -1.938091 1.569771 -2.866565 2.498246
## Aug 2014    -0.01424159 -1.768173 1.739690 -2.696648 2.668165
## Sep 2014     0.16883961 -1.585092 1.922772 -2.513568 2.851247
## Oct 2014     0.06621097 -1.687722 1.820144 -2.616198 2.748620
## Nov 2014     0.08971444 -1.664219 1.843648 -2.592696 2.772125
## Dec 2014     0.06824374 -1.685691 1.822179 -2.614168 2.750656
accuracy(arima_model)
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02554754 0.9666583 0.4541852 -1920.696 2160.894 0.4138519
##                     ACF1
## Training set 0.006469743
checkresiduals(arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
## Q* = 6.8185, df = 21, p-value = 0.9985
## 
## Model df: 3.   Total lags used: 24

The residuals from the ARIMA(0,0,1)(0,0,2)[12] model with non-zero mean show that the model is not perfect. The residual plot reveals some outliers and a potential structural break around 2010. The autocorrelation function (ACF) suggests that there is still some autocorrelation at different lags, particularly at the seasonal lag, indicating that the model may not fully capture the seasonality. Additionally, the histogram of residuals shows some deviation from normality, especially with a long tail. These observations suggest that the model might need further adjustments to improve its fit.

summary(arima_model)
## 
## Forecast method: ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
## 
## Model Information:
## Series: RCFL_PWR_DIFF 
## ARIMA(0,0,1)(0,0,2)[12] with non-zero mean 
## 
## Coefficients:
##          ma1     sma1    sma2    mean
##       0.1195  -0.9264  0.0890  0.0616
## s.e.  0.0712   0.0810  0.0811  0.0198
## 
## sigma^2 = 0.9557:  log likelihood = -257.11
## AIC=524.23   AICc=524.57   BIC=540.19
## 
## Error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02554754 0.9666583 0.4541852 -1920.696 2160.894 0.4138519
##                     ACF1
## Training set 0.006469743
## 
## Forecasts:
##          Point Forecast     Lo 80      Hi 80     Lo 95     Hi 95
## Jan 2014    -0.71170867 -1.964897  0.5414796 -2.628295 1.2048774
## Feb 2014     0.20176196 -1.060338  1.4638615 -1.728453 2.1319766
## Mar 2014     0.10559912 -1.156500  1.3676986 -1.824616 2.0358137
## Apr 2014    -0.21986269 -1.481962  1.0422368 -2.150077 1.7103519
## May 2014    -0.33880539 -1.600905  0.9232941 -2.269020 1.5914092
## Jun 2014    -0.43704168 -1.699141  0.8250578 -2.367256 1.4931729
## Jul 2014    -1.07257955 -2.334679  0.1895200 -3.002794 0.8576351
## Aug 2014     0.06430436 -1.197795  1.3264039 -1.865910 1.9945190
## Sep 2014     0.14938779 -1.112712  1.4114873 -1.780827 2.0796024
## Oct 2014     0.22911025 -1.032989  1.4912097 -1.701104 2.1593249
## Nov 2014    -0.19797910 -1.460079  1.0641204 -2.128194 1.7322355
## Dec 2014    -1.54532103 -2.807419 -0.2832232 -3.475533 0.3848910
## Jan 2015    -0.02932622 -1.743669  1.6850167 -2.651187 2.5925349
## Feb 2015     0.05006537 -1.669873  1.7700036 -2.580353 2.6804839
## Mar 2015     0.05625769 -1.663681  1.7761959 -2.574161 2.6866762
## Apr 2015     0.08773329 -1.632205  1.8076715 -2.542685 2.7181518
## May 2015     0.10404245 -1.615896  1.8239807 -2.526376 2.7344610
## Jun 2015     0.11673378 -1.603204  1.8366720 -2.513685 2.7471523
## Jul 2015     0.18863975 -1.531298  1.9085780 -2.441779 2.8190583
## Aug 2015     0.06605819 -1.653880  1.7859964 -2.564360 2.6964767
accuracy(RCFL_ETS)
##                         ME     RMSE       MAE       MPE     MAPE     MASE
## Training set -1.229798e-05 1.346981 0.6002461 -768.1263 994.3076 0.546942
##                   ACF1
## Training set 0.1202904
checkresiduals(RCFL_ETS)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 66.674, df = 24, p-value = 6.893e-06
## 
## Model df: 0.   Total lags used: 24

The residuals from the ETS(A,N,N) model also show some patterns that suggest room for improvement. The residuals plot shows that the series has some sharp fluctuations, especially around 2010, indicating that the model doesn’t fully capture the underlying dynamics of the data. The autocorrelation function (ACF) of the residuals suggests the presence of seasonality, as there are clear correlations at seasonal lags, which implies that the model could not account for these seasonal patterns effectively. Additionally, the histogram of the residuals suggests a non-normal distribution, with a long tail, indicating that the model may not be fully capturing all aspects of the data’s structure. Thus, further model refinement may be necessary.

summary(RCFL_ETS)
## ETS(A,N,N) 
## 
## Call:
## ets(y = RCFL_PWR_DIFF)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0719 
## 
##   sigma:  1.3545
## 
##      AIC     AICc      BIC 
## 1047.964 1048.100 1057.543 
## 
## Training set error measures:
##                         ME     RMSE       MAE       MPE     MAPE     MASE
## Training set -1.229798e-05 1.346981 0.6002461 -768.1263 994.3076 0.546942
##                   ACF1
## Training set 0.1202904
accuracy(RCFL_FCST_PWR_S)
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0002402688 1.346981 0.6002376 -765.1024 991.1566 0.5469343
##                   ACF1
## Training set 0.1202903
checkresiduals(RCFL_FCST_PWR_S)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 66.674, df = 24, p-value = 6.893e-06
## 
## Model df: 0.   Total lags used: 24

The residuals from the Simple Exponential Smoothing (SES) model show some fluctuations around zero, with noticeable peaks around 2010. The ACF plot reveals autocorrelation at lag 12, indicating residual seasonality. The histogram is right-skewed, and the Q-Q plot shows deviations from normality, suggesting that the SES model doesn’t fully capture all patterns in the data, particularly seasonal or irregular variations. This implies that while the SES model is useful, it may not be sufficient for accurate long-term forecasting.

summary(RCFL_FCST_PWR_S)
## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
## ses(y = RCFL_PWR_DIFF, h = 12)
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0716 
## 
##   sigma:  1.3545
## 
##      AIC     AICc      BIC 
## 1047.964 1048.100 1057.543 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0002402688 1.346981 0.6002376 -765.1024 991.1566 0.5469343
##                   ACF1
## Training set 0.1202903
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Feb 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Mar 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Apr 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## May 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Jun 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Jul 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Aug 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Sep 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726454
## Oct 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726455
## Nov 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726455
## Dec 2014     0.07162989 -1.664267 1.807526 -2.583195 2.726455

Based on the evaluation metrics, the ARIMA model appears to deliver the best results. The AIC dropped to 524, and the BIC decreased to 540. Additionally, the RMSE improved significantly, falling from 1.347 to 0.966. Given these improvements, I’ve decided to proceed with the ARIMA model. I’ll now generate the forecast and export the predicted values to a CSV file, as I’m confident in ARIMA’s performance.

3 Part C

3.1 Part C: FBONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

3.2 Overview

Part C includes two simple two-column datasets with different time stamps. My optional task is to align the data based on time by creating a time-based sequence and aggregating the values by hour—taking the mean when there are multiple recordings within the same hour. I’ll then assess whether the data is stationary and determine if it can be reliably forecasted. If so, I’ll produce a one-week-ahead forecast and present the results through an RPubs report, the accompanying .Rmd file, and an Excel-readable file containing the forecast.

3.3 Load Data

# Define a custom function to convert Excel date serial numbers to POSIXct
convert_excel_date <- function(serial_date) {
  as_datetime(serial_date * 86400, origin = "1899-12-30", tz = "UTC")
}

# Read data
pipe1_data <- read_excel("C:/Users/Admin/Downloads/Waterflow_Pipe1.xlsx")
pipe2_data <- read_excel("C:/Users/Admin/Downloads/Waterflow_Pipe2.xlsx")

# Convert date columns using the custom function
pipe1_data$`Date Time` <- sapply(pipe1_data$`Date Time`, convert_excel_date)
pipe2_data$`Date Time` <- sapply(pipe2_data$`Date Time`, convert_excel_date)

# Calculate number of rows
num_rows_pipe1 <- nrow(pipe1_data)
num_rows_pipe2 <- nrow(pipe2_data)

# Generate summaries
pipe1_summary <- summary(pipe1_data)
pipe2_summary <- summary(pipe2_data)

# Output summaries in kable format
kable(pipe1_summary, caption = paste("Pipe1 Data Summary - Rows:", num_rows_pipe1)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Pipe1 Data Summary - Rows: 1000
Date Time WaterFlow
Min. :1.446e+09 Min. : 1.067
1st Qu.:1.446e+09 1st Qu.:13.683
Median :1.446e+09 Median :19.880
Mean :1.446e+09 Mean :19.897
3rd Qu.:1.446e+09 3rd Qu.:26.159
Max. :1.446e+09 Max. :38.913
kable(pipe2_summary, caption = paste("Pipe2 Data Summary - Rows:", num_rows_pipe2)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Pipe2 Data Summary - Rows: 1000
Date Time WaterFlow
Min. :1.446e+09 Min. : 1.885
1st Qu.:1.446e+09 1st Qu.:28.140
Median :1.447e+09 Median :39.682
Mean :1.447e+09 Mean :39.556
3rd Qu.:1.448e+09 3rd Qu.:50.782
Max. :1.449e+09 Max. :78.303
pipe1_data <- pipe1_data %>%
  drop_na(`Date Time`)

pipe2_data <- pipe2_data %>%
  drop_na(`Date Time`)
# Custom function to convert Excel serial dates to POSIXct
convert_excel_date <- function(serial_date) {
  as_datetime(serial_date * 86400, origin = "1899-12-30", tz = "UTC")
}

# Apply the conversion function to the "Date Time" column
pipe1_data$`Date Time` <- sapply(pipe1_data$`Date Time`, convert_excel_date)
pipe2_data$`Date Time` <- sapply(pipe2_data$`Date Time`, convert_excel_date)
kable(pipe2_summary, caption = paste("Pipe2 Data Summary - Rows:", num_rows_pipe2)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Pipe2 Data Summary - Rows: 1000
Date Time WaterFlow
Min. :1.446e+09 Min. : 1.885
1st Qu.:1.446e+09 1st Qu.:28.140
Median :1.447e+09 Median :39.682
Mean :1.447e+09 Mean :39.556
3rd Qu.:1.448e+09 3rd Qu.:50.782
Max. :1.449e+09 Max. :78.303

3.4 Visualize the Time Series for Each Pipe

The time series plots for Water Flow - Pipe 1 and Pipe 2 highlight distinct flow behaviors and variability.

  • Pipe 1: Shows steady, moderate fluctuations over a shorter time frame, with flow values mostly between 0 and 40 units. This suggests a stable system with regular peaks and dips, likely reflecting consistent usage or inflow patterns.
  • Pipe 2: Covers a longer period and exhibits greater variability, with flow values reaching up to 80 units. Its more dynamic pattern and pronounced peaks may indicate fluctuating demand or operational changes over time.

The clear difference between the two—Pipe 1’s stability versus Pipe 2’s variability—hints at differing usage patterns or system demands. This initial analysis lays the groundwork for deeper investigation and targeted forecasting for each pipe.

# Plot for Pipe 1
pipe1_plot <- ggplot(pipe1_data, aes(x = `Date Time`, y = WaterFlow)) +
  geom_line(color = "yellow") +
  labs(title = "Water Flow - Pipe 1",
       x = "Time",
       y = "Water Flow") 

# Plot for Pipe 2
pipe2_plot <- ggplot(pipe2_data, aes(x = `Date Time`, y = WaterFlow)) +
  geom_line(color = "skyblue") +
  labs(title = "Water Flow - Pipe 2",
       x = "Time",
       y = "Water Flow") 

# Arrange the plots vertically
grid.arrange(pipe1_plot, pipe2_plot , ncol = 1)

3.5 Summarize Data on an Hourly Basis

Convert timestamps to hourly data by rounding down to the nearest hour and calculating the mean for each hour.

The table and plots below offer insight into the data collection frequency and the transformations applied to the water flow measurements for each pipe:

Pipe 1:
Initial data points: 1,000
Hourly mean data points after aggregation: 236
Average interval between measurements: approximately 14.37 minutes
Median interval: 10.02 minutes

Pipe 2:
Initial data points: 1,000
Hourly mean data points after aggregation: 667
Average interval between measurements: 60 minutes
Median interval: 60 minutes

These results show that Pipe 1 was measured more frequently than Pipe 2, resulting in fewer aggregated hourly data points. The higher measurement frequency in Pipe 1 offers more detailed, granular data, while Pipe 2’s consistent hourly intervals reflect a lower-frequency collection pattern.

# Load necessary libraries
library(dplyr)
library(lubridate)
library(knitr)
library(kableExtra)

# Function to convert Excel serial date numbers to POSIXct
convert_excel_date <- function(serial_date) {
  as_datetime(serial_date * 86400, origin = "1899-12-30", tz = "UTC")
}

# Read data
pipe1_data <- read_excel("C:/Users/Admin/Downloads/Waterflow_Pipe1.xlsx")
pipe2_data <- read_excel("C:/Users/Admin/Downloads/Waterflow_Pipe2.xlsx")

# Convert 'Date Time' column from Excel serial date to POSIXct
pipe1_data$`Date Time` <- convert_excel_date(pipe1_data$`Date Time`)
pipe2_data$`Date Time` <- convert_excel_date(pipe2_data$`Date Time`)

# Convert 'WaterFlow' to numeric if necessary (it should already be numeric, but just in case)
pipe1_data$WaterFlow <- as.numeric(pipe1_data$WaterFlow)
pipe2_data$WaterFlow <- as.numeric(pipe2_data$WaterFlow)

# Convert 'Date Time' to hourly timestamps and calculate the hourly mean of 'WaterFlow'
pipe1_hourly  <- pipe1_data %>%
  mutate(Timestamp = floor_date(`Date Time`, "hour")) %>%  # Floor to the nearest hour
  group_by(Timestamp) %>%
  summarize(Mean_WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%  # Calculate the mean of 'WaterFlow'
  ungroup()  # Remove grouping

# Check if pipe1_hourly was created correctly
head(pipe1_hourly)
## # A tibble: 6 × 2
##   Timestamp           Mean_WaterFlow
##   <dttm>                       <dbl>
## 1 2015-10-23 00:00:00           26.1
## 2 2015-10-23 01:00:00           18.9
## 3 2015-10-23 02:00:00           15.2
## 4 2015-10-23 03:00:00           23.1
## 5 2015-10-23 04:00:00           15.5
## 6 2015-10-23 05:00:00           22.7
pipe2_hourly  <- pipe2_data %>%
  mutate(Timestamp = floor_date(`Date Time`, "hour")) %>%
  group_by(Timestamp) %>%
  summarize(Mean_WaterFlow = mean(WaterFlow, na.rm = TRUE)) %>%
  ungroup()

# Check if pipe2_hourly was created correctly
head(pipe2_hourly)
## # A tibble: 6 × 2
##   Timestamp           Mean_WaterFlow
##   <dttm>                       <dbl>
## 1 2015-10-23 01:00:00           30.9
## 2 2015-10-23 03:00:00           38.0
## 3 2015-10-23 04:00:00           34.0
## 4 2015-10-23 06:00:00           28.2
## 5 2015-10-23 07:00:00           18.3
## 6 2015-10-23 09:00:00           55.8
# Calculate the number of rows before and after the mean calculation
num_points_before_pipe1 <- nrow(pipe1_data)
num_points_before_pipe2 <- nrow(pipe2_data)

num_points_after_pipe1 <- nrow(pipe1_hourly)
num_points_after_pipe2 <- nrow(pipe2_hourly)

# Calculate time intervals for Pipe 1
pipe1_intervals <- pipe1_data %>%
  arrange(`Date Time`) %>%
  mutate(Interval = difftime(`Date Time`, lag(`Date Time`), units = "mins")) %>%
  filter(!is.na(Interval))  # Remove the first NA caused by lag

# Ensure the 'Interval' column is numeric
pipe1_intervals$Interval <- as.numeric(pipe1_intervals$Interval)

# Calculate time intervals for Pipe 2
pipe2_intervals <- pipe2_data %>%
  arrange(`Date Time`) %>%
  mutate(Interval = difftime(`Date Time`, lag(`Date Time`), units = "mins")) %>%
  filter(!is.na(Interval))  # Remove the first NA caused by lag

# Ensure the 'Interval' column is numeric
pipe2_intervals$Interval <- as.numeric(pipe2_intervals$Interval)

# Create a data frame to display the counts and intervals for each pipe
data_points_df <- data.frame(
  Pipe = c("Pipe 1", "Pipe 2"),
  Data_Points_Before = c(num_points_before_pipe1, num_points_before_pipe2),
  Data_Points_After = c(num_points_after_pipe1, num_points_after_pipe2),
  Avg_Interval = c(mean(pipe1_intervals$Interval), mean(pipe2_intervals$Interval)),
  Median_Interval = c(median(pipe1_intervals$Interval), median(pipe2_intervals$Interval))
)

# Display the data frame with formatted table
data_points_df %>%
  kable(caption = "Summary of Data Points and Time Intervals Between Measurements for Each Pipe") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary of Data Points and Time Intervals Between Measurements for Each Pipe
Pipe Data_Points_Before Data_Points_After Avg_Interval Median_Interval
Pipe 1 1000 236 14.36598 10.01989
Pipe 2 1000 667 60.00000 60.00000
# Plot for Pipe 1
pipe1_hourly_plot <- ggplot(pipe1_hourly, aes(x = Timestamp, y = Mean_WaterFlow)) +
  geom_line(color = "yellow") +
  labs(title = "Hourly Mean Water Flow - Pipe 1",
       x = "Timestamp",
       y = "Mean Water Flow") 

# Plot for Pipe 2
pipe2_hourly_plot <- ggplot(pipe2_hourly, aes(x = Timestamp, y = Mean_WaterFlow)) +
  geom_line(color = "skyblue") +
  labs(title = "Hourly Mean Water Flow - Pipe 2",
       x = "Timestamp",
       y = "Mean Water Flow") 

# Arrange the plots vertically
grid.arrange(pipe1_hourly_plot, pipe2_hourly_plot , ncol = 1)

3.6 Assess Stationarity Using the Augmented Dickey-Fuller (ADF) Test

After converting the merged dataset to a tsibble format, I checked each time series for stationarity using the Augmented Dickey-Fuller (ADF) test. Below is the interpretation of the results for each pipe:

Pipe 1
- Test Statistic: The ADF statistic is strongly negative, suggesting a high likelihood that the series is stationary.
- Critical Values: The test statistic falls below the critical values at the 1%, 5%, and 10% significance levels, leading to rejection of the null hypothesis of a unit root.
- Conclusion: The time series for Pipe 1 is stationary, making it appropriate for ARIMA modeling.

Pipe 2
- Test Statistic: Similar to Pipe 1, Pipe 2’s ADF test statistic is significantly negative.
- Critical Values: The result is again lower than all critical thresholds, which supports rejecting the null hypothesis.
- Conclusion: Pipe 2’s series is also stationary, and suitable for ARIMA modeling.

These findings confirm that both time series are stationary and can be used for forecasting without the need for additional differencing.

# Step 3: Convert to tsibble format for time series analysis
pipe1_tsibble <- as_tsibble(pipe1_hourly, index = Timestamp) %>%
  fill_gaps()

pipe2_tsibble <- as_tsibble(pipe2_hourly, index = Timestamp) %>%
  fill_gaps()

# Step 4: Test for stationarity using ADF test
adf_test_pipe1 <- ur.df(pipe1_hourly$Mean_WaterFlow, type = "drift")
adf_test_pipe1 <- ur.df(pipe2_hourly$Mean_WaterFlow, type = "drift")


summary(adf_test_pipe1)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression drift 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.732  -9.229   0.298   9.370  37.233 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.55674    2.29084  16.831   <2e-16 ***
## z.lag.1     -0.96884    0.05603 -17.292   <2e-16 ***
## z.diff.lag  -0.06832    0.03888  -1.757   0.0793 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 662 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.5193 
## F-statistic: 359.6 on 2 and 662 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -17.2918 149.5061 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1  6.43  4.59  3.78
summary(adf_test_pipe1)
## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression drift 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.732  -9.229   0.298   9.370  37.233 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.55674    2.29084  16.831   <2e-16 ***
## z.lag.1     -0.96884    0.05603 -17.292   <2e-16 ***
## z.diff.lag  -0.06832    0.03888  -1.757   0.0793 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 662 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.5193 
## F-statistic: 359.6 on 2 and 662 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -17.2918 149.5061 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau2 -3.43 -2.86 -2.57
## phi1  6.43  4.59  3.78

3.6.1 Pipe 1 Modeling and Forecasting

3.6.1.1 Applying the Box-Cox Transformation

The top plot displays the Box-Cox transformed flow data for Pipe 2, which helps stabilize variance. The bottom plot shows the completed series after filling in any missing values, ensuring consistency for accurate forecasting. These preprocessing steps are essential for preparing the data for reliable model fitting.

pipe1_tsibble <- pipe1_tsibble |> 
  mutate(Mean_WaterFlow)  # Adding 1 to avoid zero or negative values

# Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- pipe1_tsibble |> 
  BoxCox.lambda(Mean_WaterFlow)  # Calculate lambda for Box-Cox

# Apply transformations and differences
pipe1_tsibble <- pipe1_tsibble |> 
  mutate(
    box_cox_flow = BoxCox(Mean_WaterFlow, lambda = box_cox_lambda),  # Apply Box-Cox transformation
  )

# Fill missing values in the Mean_WaterFlow column using linear interpolation
pipe1_tsibble_filled <- pipe1_tsibble %>%
  mutate(Mean_WaterFlow = na_interpolation(Mean_WaterFlow))

# Step 3: Time Series Plot for Box-Cox Transformed Data
p1 <- pipe1_tsibble |>
  ggplot(aes(x = Timestamp, y = box_cox_flow)) +
  geom_line() +
  labs(title = "Box-Cox Transformed Pipe 1 Flow",
       x = "Time", y = "Pipe 1 Flow")

# Step : Time Series Plot for Box-Cox Transformed Data
p2 <- pipe1_tsibble_filled |>
  ggplot(aes(x = Timestamp, y = Mean_WaterFlow)) +
  geom_line() +
  labs(title = "Gap Filled Pipe 1 Flow",
       x = "Time", y = "Pipe 1 Flow")

# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
  p1, p2, 
  ncol = 1
)

### Decomposing the Time Series to Identify Seasonality and Trend

The STL decomposition of Pipe 1’s water flow data reveals the following components:

  1. Trend: A relatively stable pattern with slight fluctuations, suggesting the absence of a strong long-term trend.
  2. Seasonal (Daily): Distinct daily cycles that reflect consistent periods of high and low water flow.
  3. Remainder: Random, minor variations centered around zero, representing noise or small irregularities.

Overall, this decomposition captures the core structure of the data well, indicating that a forecasting model based on these components could accurately predict future values.

# Step 1: Decompose the series using STL
pipe1_stl <- pipe1_tsibble_filled %>%
  model(STL(Mean_WaterFlow ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
  components()

# Step 2: Plot the STL components
autoplot(pipe1_stl) +
  labs(title = "STL Decomposition of Pipe 1 Water Flow")

### Model Fitting for Pipe 1

I applied both ARIMA and various ETS models—including ETS(M, A, M), ETS(A, A, M), and ETS(M, Ad, M)—to the Pipe 1 time series data. This allowed for model comparison using performance criteria such as AIC, AICc, BIC, MSE, and MAE to identify the most suitable forecasting approach.

3.6.2 Forecasting and Model Evaluation

To find the best model for forecasting Pipe 1 water flow, I evaluated each model based on error metrics and information criteria. Although the ARIMA models selected through stepwise and search methods showed lower AICc values, their narrow confidence intervals suggest they may underrepresent the true variability in the data—potentially leading to underfitting and unreliable forecasts in dynamic conditions.

On the other hand, ETS models—particularly ets_mam—produced wider, more realistic confidence intervals, capturing the series’ fluctuations more effectively and offering more reliable forecasting.

Model Comparison Summary:

  1. ARIMA
    • Lowest AIC and AICc, suggesting a good fit
    • However, lacks MSE and MAE, which limits comprehensive accuracy comparison
  2. ETS(AAM)
    • Moderate AIC and MSE values
    • Handles seasonality well, but not the most accurate
  3. ETS(MADM)
    • Highest AIC, MSE, and MAE, indicating weakest performance
  4. ETS(MAM)
    • Lowest MSE and MAE, and competitive AIC
    • Most effective at capturing both trend and seasonality with minimal error

Best Model: ETS(MAM)
Chosen for its strong balance of low error metrics and reliable forecasting performance.

# Fit multiple ETS models
pipe1_fit <- pipe1_tsibble_filled %>%
  model(
    ets_mam = ETS(Mean_WaterFlow ~ error("M") + trend("A") + season("M")),  # ETS(M, A, M)
    ets_aam = ETS(Mean_WaterFlow ~ error("A") + trend("A") + season("M")),  # ETS(A, A, M)
    ets_madm = ETS(Mean_WaterFlow ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M)
    stepwise = ARIMA(Mean_WaterFlow),                           # Automatic ARIMA using stepwise search
    search = ARIMA(Mean_WaterFlow, stepwise = FALSE)            # Automatic ARIMA using exhaustive search
  )

# Extract model summary and select the best model based on AIC or AICc
model_summary <- glance(pipe1_fit)

# Display model details to compare AIC, AICc, and BIC
model_summary %>%
  select(-ar_roots, -ma_roots) %>%
  arrange(AICc) %>%
  kable(caption = "ETS Model Comparison Based on AICc") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ETS Model Comparison Based on AICc
.model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
stepwise 17.8346691 -685.7815 1375.563 1375.614 1382.524 NA NA NA
search 17.9552502 -685.6035 1379.207 1379.377 1393.129 NA NA NA
ets_mam 0.0475971 -995.0379 2048.076 2056.361 2149.014 16.82212 16.86331 0.1618203
ets_aam 18.9938721 -996.0843 2050.169 2058.454 2151.107 16.77792 16.81792 3.2571771
ets_madm 0.0503575 -999.7983 2059.597 2068.496 2164.016 17.24243 17.24774 0.1666085
# Forecast with each ETS model
pipe1_forecasts <- pipe1_fit %>%
  forecast(h = "1 week")

# Plot the forecasts with different models shown in separate rows
autoplot(pipe1_forecasts, pipe1_tsibble_filled) +
  labs(title = "1-Week Forecast for Pipe 1 Water Flow Using ETS Models",
       y = "Mean Water Flow", x = "Timestamp") +
  facet_wrap(~ .model, scales = "free_y", ncol = 1) +  # Arrange in a single column with each model in a row
  theme_minimal()

### Forecasting with the Best Model

The forecast for Pipe 1 Water Flow, generated using the best-performing ETS model—ETS(M, A, M)—offers projections with well-defined confidence intervals. This model, which incorporates multiplicative error, an additive trend, and multiplicative seasonality, fits the data well and effectively captures its seasonal patterns.

The widening of the confidence intervals over time highlights the natural uncertainty involved in long-term forecasting. Due to the strong seasonal behavior and consistent trend in the data, this model is highly appropriate for representing its structure and producing reliable future forecasts within expected variability.

# Filter the forecasts to include only the best model (ets_mam)
best_forecast <- pipe1_forecasts %>% 
  filter(.model == "ets_mam")

# Plot the forecast for the best model (ets_mam)
autoplot(best_forecast, pipe1_tsibble_filled) +
  labs(
    title = "1-Week Forecast for Pipe 1 Water Flow Using Best Model (ETS(M, A, M))",
    y = "Mean Water Flow", 
    x = "Timestamp"
  )

head(best_forecast) %>%
  kable(caption = "1-Week Forecasts for Pipe 1") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
1-Week Forecasts for Pipe 1
.model Timestamp Mean_WaterFlow .mean
ets_mam 2015-11-02 00:00:00 N(21, 21) 20.89573
ets_mam 2015-11-02 01:00:00 N(22, 22) 21.58641
ets_mam 2015-11-02 02:00:00 N(21, 21) 20.83362
ets_mam 2015-11-02 03:00:00 N(19, 17) 18.66852
ets_mam 2015-11-02 04:00:00 N(22, 23) 22.00720
ets_mam 2015-11-02 05:00:00 N(19, 17) 18.94728

3.6.3 Pipe 2 Modeling and Forecasting

3.6.3.1 Applying the Box-Cox Transformation

The top plot displays the Box-Cox transformed Pipe 2 water flow data, which helps stabilize variance. The bottom plot shows the completed time series after filling in missing values, ensuring data continuity for accurate forecasting. These preprocessing steps are essential for preparing the data for effective model fitting.

pipe2_tsibble <- pipe2_tsibble |> 
  mutate(Mean_WaterFlow)  # Adding 1 to avoid zero or negative values

# Calculate the Box-Cox transformation using the adjusted cash values
box_cox_lambda <- pipe2_tsibble |> 
  BoxCox.lambda(Mean_WaterFlow)  # Calculate lambda for Box-Cox

# Apply transformations and differences
pipe2_tsibble <- pipe2_tsibble |> 
  mutate(
    box_cox_flow = BoxCox(Mean_WaterFlow, lambda = box_cox_lambda),  # Apply Box-Cox transformation
  )

# Fill missing values in the Mean_WaterFlow column using linear interpolation
pipe2_tsibble_filled <- pipe2_tsibble %>%
  mutate(Mean_WaterFlow = na_interpolation(Mean_WaterFlow))

# Step 3: Time Series Plot for Box-Cox Transformed Data
p1 <- pipe2_tsibble |>
  ggplot(aes(x = Timestamp, y = box_cox_flow)) +
  geom_line() +
  labs(title = "Box-Cox Transformed Pipe 2 Flow",
       x = "Time", y = "Pipe 2 Flow")

# Step : Time Series Plot for Box-Cox Transformed Data
p2 <- pipe2_tsibble_filled |>
  ggplot(aes(x = Timestamp, y = Mean_WaterFlow)) +
  geom_line() +
  labs(title = "Gap Filled Pipe 2 Flow",
       x = "Time", y = "Pipe 2 Flow")

# Combine the plots using gridExtra with 3 columns and add a title
grid.arrange(
  p1, p2, 
  ncol = 1
)

### Decomposing the Time Series to Identify Trend and Seasonality

The STL decomposition of Pipe 2 Water Flow reveals the following components:

  • Trend: Shows gradual changes over time, indicating long-term shifts in water flow behavior.
  • Seasonal (Daily): Captures regular daily cycles, reflecting consistent patterns typical of diurnal water usage.
  • Remainder: Represents short-term fluctuations and high-frequency noise not explained by trend or seasonality.

Overall, the decomposition reveals a stable trend combined with strong daily seasonality and some short-term irregularities. This structure can guide model selection, suggesting that any effective forecasting model should incorporate both trend and daily seasonal components.

# Step 1: Decompose the series using STL
pipe2_stl <- pipe1_tsibble_filled %>%
  model(STL(Mean_WaterFlow ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)) %>%
  components()

# Step 2: Plot the STL components
autoplot(pipe2_stl) +
  labs(title = "STL Decomposition of Pipe 2 Water Flow")

### Model Fitting for Pipe 2

I applied ARIMA and multiple ETS configurations—including ETS(M, A, M), ETS(A, A, M), and ETS(M, Ad, M)—to the Pipe 2 time series data. This approach enables a comparison of models based on criteria such as AIC, BIC, MSE, and MAE to identify the most effective forecasting method.

3.6.4 Evaluating Models and Selecting the Best Fit

To determine the best model for forecasting Pipe 2 Water Flow, I evaluated key performance metrics including AICc, BIC, MSE, and MAE.

The ETS(M, Ad, M) model (ets_madm) outperformed the others, showing the lowest AICc, BIC, MSE, and MAE values. While search and stepwise ARIMA models yielded slightly lower AICc scores, they also produced much higher sigma² and log-likelihood values—indicating a poor model fit and potential stability issues.

As a result, ets_madm was selected as the most reliable model, offering a good balance of forecast accuracy and stability.

# Fit multiple models
pipe2_fit <- pipe2_tsibble_filled %>%
  model(
    ets_mam = ETS(Mean_WaterFlow ~ error("M") + trend("A") + season("M")),  # ETS(M, A, M)
    ets_aam = ETS(Mean_WaterFlow ~ error("A") + trend("A") + season("M")),  # ETS(A, A, M)
    ets_madm = ETS(Mean_WaterFlow ~ error("M") + trend("Ad") + season("M")), # ETS(M, Ad, M)
    stepwise = ARIMA(Mean_WaterFlow),                           # Automatic ARIMA using stepwise search
    search = ARIMA(Mean_WaterFlow, stepwise = FALSE)            # Automatic ARIMA using exhaustive search
  )

# Extract model summary and select the best model based on AIC or AICc
model_summary <- glance(pipe2_fit)

# Display model details to compare AIC, AICc, and BIC
model_summary %>%
  select(-ar_roots, -ma_roots) %>%
  arrange(AICc) %>%
  kable(caption = "ETS Model Comparison Based on AICc") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ETS Model Comparison Based on AICc
.model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
search 145.5595822 -3902.861 7819.723 7819.836 7854.070 NA NA NA
stepwise 156.7908794 -3942.633 7891.266 7891.290 7905.986 NA NA NA
ets_madm 0.1017133 -5980.418 12020.837 12022.756 12168.070 155.3642 158.8355 0.2533285
ets_aam 163.1633975 -5987.054 12032.108 12033.902 12174.433 158.5948 176.0376 10.0992134
ets_mam 0.1019389 -5987.697 12033.394 12035.188 12175.719 158.5184 162.9869 0.2536301
# Forecast with each ETS model
pipe2_forecasts <- pipe2_fit %>%
  forecast(h = "1 week")

# Plot the forecasts with different models shown in separate rows
autoplot(pipe2_forecasts, pipe2_tsibble_filled) +
  labs(title = "1-Week Forecast for Pipe 2 Water Flow",
       y = "Mean Water Flow", x = "Timestamp") +
  facet_wrap(~ .model, scales = "free_y", ncol = 1) +  # Arrange in a single column with each model in a row
  theme_minimal()

3.6.5 Forecasting with the Best Model

The ETS(M, Ad, M) model generates a one-week forecast for Pipe 2 Water Flow, effectively capturing multiplicative error and seasonality along with an additive damped trend. The forecast reflects the high variability seen in the historical data and shows increasing uncertainty over time, illustrated by the widening confidence intervals. This model is well-suited for representing the complex and fluctuating water flow patterns in Pipe 2, accurately accounting for both daily and seasonal variations.

# Filter the forecasts to include only the best model (ets_mam)
best_forecast <- pipe2_forecasts %>% 
  filter(.model == "ets_madm")

# Plot the forecast for the best model (ets_mam)
autoplot(best_forecast, pipe2_tsibble_filled) +
  labs(
    title = "1-Week Forecast for Pipe 2 Water Flow (ETS(M, Ad, M))",
    y = "Mean Water Flow", 
    x = "Timestamp"
  ) +
  theme_minimal()