Load Packages

library(tidyverse)
library(lubridate)
library(ggplot2)
library(forecast)
library(fpp3)

Load Dataset

For re-productivity of the RMD file, a csv file of all data to my Github.

atm_raw_data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-624/refs/heads/main/ATM624Data.csv")

residential_raw_data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-624/refs/heads/main/ResidentialCustomerForecastLoad-624.csv")

pipe1_raw_data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-624/refs/heads/main/Waterflow_Pipe1.csv")

pipe2_raw_data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-624/refs/heads/main/Waterflow_Pipe2.csv")

ATM DATA

Goal: Forecast ATM cash withdrawal from each ATM for May 2010.

Data Exploration & Wrangling

This time series contain 1474 observations of ATM cash withdrawal from May 2009 to May 2010. Observing the raw data, we need to convert DATE from character to datetime and factor ATM.

There are 19 observations with missing data in the amount of cash that was withdrawn. The maximum cash withdrawal of 10920 which is equivalent to $1,092,000 seem to be an outlier compared to the mean ($15,560) and median ($7,300) cash withdrawal.

summary(atm_raw_data)
##      DATE               ATM                 Cash        
##  Length:1474        Length:1474        Min.   :    0.0  
##  Class :character   Class :character   1st Qu.:    0.5  
##  Mode  :character   Mode  :character   Median :   73.0  
##                                        Mean   :  155.6  
##                                        3rd Qu.:  114.0  
##                                        Max.   :10920.0  
##                                        NA's   :19

There are 14 missing values in the ATM column and 19 missing values in the cash column.

atm_data <- atm_raw_data %>%
  mutate(DATE = mdy_hms(DATE)
  )
##
##  mutate(
##    Year = year(date),
##    Month = month(date, label = TRUE),
##    Day = day(date),
##    Day_of_the_week = wday(date, label = TRUE),
##    Hour = format(date, "%H")

atm_data <- atm_data %>%
  mutate(DATE = as_date(DATE),
    ATM = as.factor(ATM),
         Cash = as.numeric(Cash)) %>%
  as_tsibble(index = DATE, key = ATM)

#summary(atm_data)

Missing Data

There are 19 observations with missing data in the amount of cash that was withdrawn.

14 of these missing values occur in May 2010, the month to forecast. Filter out or drop those observation from the training data.

The remaining 5 observations are missing cash values in June 2009, with three from ATM1 and two from ATM 2.

atm_data %>% 
  filter(is.na(Cash))
## # A tsibble: 19 x 3 [1D]
## # Key:       ATM [3]
##    DATE       ATM     Cash
##    <date>     <fct>  <dbl>
##  1 2010-05-01 ""        NA
##  2 2010-05-02 ""        NA
##  3 2010-05-03 ""        NA
##  4 2010-05-04 ""        NA
##  5 2010-05-05 ""        NA
##  6 2010-05-06 ""        NA
##  7 2010-05-07 ""        NA
##  8 2010-05-08 ""        NA
##  9 2010-05-09 ""        NA
## 10 2010-05-10 ""        NA
## 11 2010-05-11 ""        NA
## 12 2010-05-12 ""        NA
## 13 2010-05-13 ""        NA
## 14 2010-05-14 ""        NA
## 15 2009-06-13 "ATM1"    NA
## 16 2009-06-16 "ATM1"    NA
## 17 2009-06-22 "ATM1"    NA
## 18 2009-06-18 "ATM2"    NA
## 19 2009-06-24 "ATM2"    NA

Filter out the 14 observations in May 2010.

atm_filter <- atm_data %>%
  filter(!(is.na(Cash) & month(DATE) == 5 & year(DATE) ==2010))

atm_filter  %>% 
  filter(is.na(Cash))
## # A tsibble: 5 x 3 [1D]
## # Key:       ATM [2]
##   DATE       ATM    Cash
##   <date>     <fct> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-22 ATM1     NA
## 4 2009-06-18 ATM2     NA
## 5 2009-06-24 ATM2     NA

Since ATM cash withdrawals are likely seasonal by the day of the week, we can use na.interp() to fill the missing cash values as it preserves the trend & seasonality.

Initially, I thought of replacing the missing value with the mean cash of the relevant ATM. However, doing so will ignore the trends and seasonality. Filling the missing values with a simple “0” may misrepresent the them as 0 withdrawal.

# Source: https://otexts.com/fpp2/missing-outliers.html
atm_final <- atm_filter %>%
  group_by_key() %>%
  mutate(Cash = na.interp(Cash)) %>%
  ungroup()

Outliers

Plotting all ATMs in one plot, we can see ATM 4 generally contains higher cash withdrawals than the other three ATMs. Yet, it is difficult to see the trends of the other ATMs. So we should plot them separately.

atm_final %>%
  autoplot(Cash) +
  labs(title = "ATM Cash Withdrawals from May 2009 to April 2010", 
       x = "Date",
       y = "Cash (hundreds in dollars)")

With the ATMs plotted separately, we can see ATM3’s values are zeros for almost the entire time until late April 2010. Only three days contain nonzero cash values. There could be various explanation for this: - This is a new ATM. So there are no prior cash withdrawals. - The data contains daily integer cash withdrawal in hundreds. Perhaps the amount were rounded. The cash withdrawals were all less than $50 so they round to 0 when rounded to the nearest hundreds.

Regardless, forecasting this ATM will not be meaningful as there are no clear pattern from the 3 days that contain nonzero cash values. We can either exclude this ATM or forecast a simple naive model.

ATM 4 has a large outlier on February 9th, 2010 with an unusual cash withdrawal of $1,092,000 dollar. I searched online to see if there were any events occuring in early February that may cause such a large withdrawal. I found:

  • February 5th-6th: A major snowstorm in eastern United States caused flights cancellation and power outages. Since people may be snowed in, the snowstorm could lead to unusual cash withdrawal days before and after the snowstorm.

  • February 12th: XXI Olympic Winter Games in Vancouver, Canada. Big events like this attract large crowds and higher spending near the area.

Even so, there is no available information from the dataset that connects the February 9th spike to the events above.

atm_final %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free_y") +
  labs(title = "ATM Cash Withdrawals from May 2009 to April 2010", 
       x = "Date",
       y = "Cash (hundreds in dollars)")

There are outliers present in ATM1, and ATM4. ATM3 seem to have 3 outlier point. Recall, ATM3 contains 9 cash value prior late April 2010. Due to the extreme value on the right, ATM4 is right skewed.

atm_final %>%
  ggplot(aes(x = Cash, fill = ATM)) + 
  geom_boxplot() + 
  facet_wrap(~ATM, scales = "free_x") +
  labs(title = "Boxplot of ATM Cash Withdrawals from May 2009 to April 2010", 
       x = "Cash (hundreds in dollars)")

# ah_decomp <- atm_final |>
#   filter(ATM == "ATM4") |>
#   # Fit a non-seasonal STL decomposition
#   model(
#     stl = STL(Cash ~ season(period = 7), robust = TRUE)
#   ) |>
#   components()
# ah_decomp |> autoplot()
# 
# outliers <- ah_decomp |>
#   filter(
#     remainder < quantile(remainder, 0.25) - 3*IQR(remainder) |
#     remainder > quantile(remainder, 0.75) + 3*IQR(remainder)
#   )
# outliers

There are a variety of ways to deal with outliers. We can use tsclean() or tsoutliers() to impute the outliers. Both functions can help maintain the trend and seasonality while imputing the outliers.

atm_final <- atm_final %>%
  group_by_key()%>%
  mutate(
    Cash = {
      series <- ts(Cash)
      out <- tsoutliers(series)
      if (length(out$index) >0) {
        series[out$index] <- out$replacements
      }
      as.numeric(series)
    }
  ) %>%
  ungroup()

Plotting the cash withdrawals, we see the extreme outliers is no longer there.

atm_final %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free_y") +
  labs(title = "ATM Cash Withdrawals from May 2009 to April 2010", 
       x = "Date",
       y = "Cash (hundreds in dollars)")

Distribution

The histogram of cash withdrawal can help determine if transformation of the data will be needed.

ATM1 and ATM2 appears to be bimodal and close to normal. Transformation may not be needed.

ATM3 is heavily skewed right as the majority of the dataset contains zero cash withdrawal except for the 3 recent dates. Even if I did transformation, there’s not enough information to produce any meaningful data.

ATM4 appears to be right skewed. We will need to use transformation to reduce skewness and stabilize variance.

atm_raw_data %>%
  ggplot(aes(x = Cash, fill = ATM)) + 
  geom_histogram(bins=40) + 
  facet_wrap(~ATM, scales = "free") +
  labs(title = "Histogram of ATM Cash Withdrawals from Orginal Dataset", 
       x = "Cash (hundreds in dollars)")
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_bin()`).

Imputing the missing value and outlier did help with the visualization of the data. This histogram is more smoothed out, especially for ATM4.

atm_final %>%
  ggplot(aes(x = Cash, fill = ATM)) + 
  geom_histogram(bins=40) + 
  facet_wrap(~ATM, scales = "free") +
  labs(title = "Histogram of ATM Cash Withdrawals (after dealing with missing values and outliers)", 
       x = "Cash (hundreds in dollars)")

lambda_df <- atm_final %>% 
  group_by(ATM) %>% 
  features(Cash, features = guerrero) %>% 
  select(ATM, lambda_guerrero)
## Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
## max(.period, : NA/NaN replaced by maximum positive value
## Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
## max(.period, : NA/NaN replaced by maximum positive value
atm_transformed <- atm_final %>%
  left_join(lambda_df, by= "ATM") %>%
  mutate(transformed_cash = box_cox(Cash, lambda_guerrero))

Modeling

Stationary or Non-stationary

Below is a test to see if a time series is stationary or non-stationary. Based on the test results, ATM1 and ATM2 are non-stationary and ATM3 and ATM4 are stationary.

atm_final %>%
  features(Cash, unitroot_kpss)
## # A tibble: 4 × 3
##   ATM   kpss_stat kpss_pvalue
##   <fct>     <dbl>       <dbl>
## 1 ATM1      0.509      0.0395
## 2 ATM2      2.15       0.01  
## 3 ATM3      0.389      0.0818
## 4 ATM4      0.122      0.1

One seasonal difference is required to make ATM1 and ATM2 stationary. No further regular difference is needed.

atm_final %>%
  features(Cash, unitroot_nsdiffs)
## # A tibble: 4 × 2
##   ATM   nsdiffs
##   <fct>   <int>
## 1 ATM1        1
## 2 ATM2        1
## 3 ATM3        0
## 4 ATM4        0
atm_final %>%
  features(difference(Cash, 7), unitroot_ndiffs)
## # A tibble: 4 × 2
##   ATM   ndiffs
##   <fct>  <int>
## 1 ATM1       0
## 2 ATM2       0
## 3 ATM3       0
## 4 ATM4       0

ATM 1

The chosen model for ATM1 is ARIMA(0,0,1)(0,1,2)[7] which consists of one non-seasonal moving average terms, one seasonal difference and two seasonal moving terms.

atm1_fit <- atm_final %>%
  filter(ATM == "ATM1")%>%
  model(ARIMA(Cash))


report(atm1_fit)
## Series: Cash 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1950  -0.5806  -0.1037
## s.e.  0.0546   0.0506   0.0494
## 
## sigma^2 estimated as 556.4:  log likelihood=-1640.01
## AIC=3288.03   AICc=3288.14   BIC=3303.55

Below is the residual diagnostic of the our best model. The top panel shows the innovation residuals over time. They are all center around zero with no specific trend or pattern. ACF plot shows all spikes within the confidence bands, indicating a good fit as there is no significant autocorection. The residuals appear almost normal and behave like white noise. All indication of a good fit.

atm1_fit %>%
  filter(ATM == "ATM1")%>%
  gg_tsresiduals()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ATM 2

The chosen model for ATM2 is ARIMA(2,0,2)(0,1,1)[7] which consists of two non-seasonal autoregressive term, two non-seasonal moving average terms, one seasonal difference and one seasonal moving terms.

atm2_fit <- atm_final %>%
  filter(ATM == "ATM2")%>%
  model(ARIMA(Cash))


report(atm2_fit)
## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4320  -0.9130  0.4773  0.8048  -0.7547
## s.e.   0.0553   0.0407  0.0861  0.0556   0.0381
## 
## sigma^2 estimated as 602.5:  log likelihood=-1653.67
## AIC=3319.33   AICc=3319.57   BIC=3342.61

Below is the residual diagnostic of the our best model. The top panel shows the innovation residuals over time. They are all center around zero with no specific trend or pattern. ACF plot shows all spikes within the confidence bands, indicating a good fit as there is no significant autocorection. The residuals appear almost normal and behave like white noise. All indication of a good fit.

atm2_fit %>%
  filter(ATM == "ATM2")%>%
  gg_tsresiduals()

ATM 3

atm3_fit <- atm_final %>%
  filter(ATM == "ATM3") %>%
  model(NAIVE(Cash))


atm3_fit
## # A mable: 1 x 2
## # Key:     ATM [1]
##   ATM   `NAIVE(Cash)`
##   <fct>       <model>
## 1 ATM3        <NAIVE>
report(atm3_fit)
## Series: Cash 
## Model: NAIVE 
## 
## sigma^2: 25.8985
atm3_fit %>%
  filter(ATM == "ATM3")%>%
  gg_tsresiduals()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_rug()`).

ATM 4

The chosen model for ATM4 is ARIMA(0,0,1)(2,0,0)[7] which consists of one non-seasonal moving average terms, two seasonal moving autoregressive terms.

atm4_fit <- atm_transformed %>%
  filter(ATM == "ATM4")%>%
  model(ARIMA(transformed_cash))

atm4_fit
## # A mable: 1 x 2
## # Key:     ATM [1]
##   ATM          `ARIMA(transformed_cash)`
##   <fct>                          <model>
## 1 ATM4  <ARIMA(0,0,1)(2,0,0)[7] w/ mean>
report(atm4_fit)
## Series: transformed_cash 
## Model: ARIMA(0,0,1)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##          ma1    sar1    sar2  constant
##       0.0789  0.2078  0.2022   16.8843
## s.e.  0.0527  0.0516  0.0525    0.7312
## 
## sigma^2 estimated as 176.2:  log likelihood=-1460.27
## AIC=2930.53   AICc=2930.7   BIC=2950.03

Below is the residual diagnostic of the our best model. The top panel shows the innovation residuals over time. They are all center around zero with no specific trend or pattern. ACF plot shows all spikes within the confidence bands, indicating a good fit as there is no significant autocorection. The residuals appear almost normal and behave like white noise. All indication of a good fit.

atm4_fit %>%
  filter(ATM == "ATM4")%>%
  gg_tsresiduals()

Forecasting

All fits are reasonable except fot ATM3. There are not enough observations to forecast and trends or seasonality.

atm1_fit %>%
  forecast(h=31)%>%
  autoplot(atm_final)

atm1_fc <- atm1_fit %>%
  forecast(h=31)
atm1_fc  %>%
  write_csv("~/Downloads/atm1_fc_data.csv")
atm2_fit %>%
  forecast(h=31)%>%
  autoplot(atm_final)

atm2_fc <- atm3_fit %>%
  forecast(h=31)
atm2_fc  %>%
  write_csv("~/Downloads/atm2_fc_data.csv")
atm3_fit %>%
  forecast(h=31)%>%
  autoplot(atm_final)

atm3_fc <- atm3_fit %>%
  forecast(h=31)
atm3_fc  %>%
  write_csv("~/Downloads/atm3_fc_data.csv")
atm4_fit %>%
  forecast(h=31)%>%
  autoplot(atm_transformed)

atm4_fc <- atm4_fit %>%
  forecast(h=31)
atm4_fc  %>%
  write_csv("~/Downloads/atm4_fc_data.csv")

Residential Power Consumption

Any utility data is likely to be seasonal. People tend to use more electricity on weekend as they are at home and not at work or even during warmer months (AC usage) or colder months (heating usage).

Data Exploration & Wrangling

This time series contain 192 monthly observations of power consumption from January 1998 to December 2013. Exploratory analysis reveals seasonal patterns, missing data, and outliers.

summary(residential_raw_data)
##   CaseSequence     YYYY.MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
residential_data <- residential_raw_data %>%
  rename(date = "YYYY.MMM" ) %>%
  mutate(date = yearmonth(date)) %>%
  select(-CaseSequence) %>%
  as_tsibble(index = date)

Missing Data

There is one missing value for KWH on September 2008.

residential_data %>% 
  filter(is.na(KWH))
## # A tsibble: 1 x 2 [1M]
##       date   KWH
##      <mth> <int>
## 1 2008 Sep    NA

Since residential power consumption is likely seasonal by the day of the week or month, we can use na.interp() to fill the missing values as it preserves the trend & seasonality. Simply replacing the missing value with the mean or median will ignore the trends and seasonality. Filling the missing values with a simple “0” may misrepresent the them as 0 power usage.

clean_data <- residential_data %>%
  mutate(KWH=na.interp(ts(KWH, frequency = 12)))

clean_data %>% 
  filter(is.na(KWH))
## # A tsibble: 0 x 2 [?]
## # ℹ 2 variables: date <mth>, KWH <dbl>

Outlier

There appears to be an extreme low outlier in July 2010 of 770523 KWH. I searched online to see if there were any events occurring in July 2010 that may be related to the low consumption I found:

Even so, there is no available information from the dataset that connects the July 2010 dip to the events above.

clean_data %>%
  autoplot(KWH) +
  labs(title = "Residential Power Consumption", 
       x = "Date",
       y = "KWH") + 
  geom_smooth(method = "loess")
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## `geom_smooth()` using formula = 'y ~ x'

clean_data %>%
  ggplot(aes(x = KWH)) + 
  geom_boxplot(fill='lightblue') + 
  labs(title = "Boxplot of Residential Power Consumption", 
       x = "KWH")
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

There are a variety of ways to deal with outliers. We can use tsclean() or tsoutliers() to impute the outliers. Both functions can help maintain the trend and seasonality while imputing the outliers.

clean_data <- clean_data %>%
  mutate(
    KWH = {
      series <- ts(KWH, frequency = 12)
      out <- tsoutliers(series)
      if (length(out$index) >0) {
        series[out$index] <- out$replacements
      }
      as.numeric(series)
    }
  ) %>%
  ungroup()

Plotting the power consumption, we see the extreme outliers is no longer there. We see an increasing overall trend of power consumption. Since trend is present, this time series is non-stationary.

clean_data %>%
  autoplot(KWH) +
  labs(title = "Residential Power Consumption", 
       x = "Date",
       y = "KWH") + 
  geom_smooth(method = "loess")
## `geom_smooth()` using formula = 'y ~ x'

Distribtion

It appears right skewed. We should use box cox transformation to address the skewness and stable variance.

clean_data %>%
  ggplot(aes(x = KWH)) + 
  geom_histogram( fill = 'lightblue') +
  labs(title = "Histogram of Residential Power Consumption", 
       subtitle = "Missing Values and Outlier Imputed",
       x = "KWH")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The lambda value was -0.1455, indicating a slight inverse transformation. After applying the transformation, the data is more symmetric.

lambda <- clean_data %>% 
  features(KWH, features = guerrero) %>% 
  pull(lambda_guerrero)

clean_data <- clean_data %>%
  mutate(transformed_KWH = box_cox(KWH, lambda))
clean_data %>%
  ggplot(aes(x = transformed_KWH)) + 
  geom_histogram( fill = 'lightblue') +
  labs(title = "Histogram of Residential Power Consumption", 
       subtitle = "Missing Values and Outlier Imputed",
       x = "KWH")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Modeling

Using KPSS unit root test, we test whether the time series is stationary. The KPSS statistics is 1.55691 and its p-value is 0.01. Since the statistics is relatively high and the p-value <0.05, we reject the null-hypothesis of the time series being stationary. We confirm this power consumption time series is non-stationary.

clean_data %>%
  features(transformed_KWH, unitroot_kpss)
## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      1.56        0.01

One seasonal difference is required to make this time series stationary. No further regular difference is needed.

clean_data %>%
  features(transformed_KWH, unitroot_nsdiffs)
## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1
clean_data %>%
  features(difference(transformed_KWH, 12), unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

We can try different arima model variation to find the best model. All models must have D=1 as the time series is non-stationary and requires a seasonal difference and period = 12. d=0 since we do not expect another first order difference. This model ARIMA(1,0,0)(0,1,1)[12] w/ drift> has the lowest AIC, AICC, and BIC.

fit <- clean_data %>%
  model(ets = ETS(transformed_KWH ~error("A")+trend("A")+season ("A")),
    model = ARIMA(transformed_KWH),
    arima100110 = ARIMA(transformed_KWH ~ pdq(1,0,0)+PDQ (1,1,0)),
    arima001011 = ARIMA(transformed_KWH ~ pdq(0,0,1)+PDQ (0,1,1)),
    arima100011 = ARIMA(transformed_KWH ~ pdq(1,0,0)+PDQ (0,1,1)),
    arima001110 = ARIMA(transformed_KWH ~ pdq(0,0,1)+PDQ (1,1,0)),
    arima003110 = ARIMA(transformed_KWH ~ pdq(0,0,3)+PDQ (1,1,0)),

    )

fit
## # A mable: 1 x 7
##            ets                              model
##        <model>                            <model>
## 1 <ETS(A,A,A)> <ARIMA(0,0,3)(2,1,0)[12] w/ drift>
## # ℹ 5 more variables: arima100110 <model>, arima001011 <model>,
## #   arima100011 <model>, arima001110 <model>, arima003110 <model>
report(fit)
## Warning in report.mdl_df(fit): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 7 × 11
##   .model         sigma2 log_lik    AIC   AICc    BIC       MSE     AMSE      MAE
##   <chr>           <dbl>   <dbl>  <dbl>  <dbl>  <dbl>     <dbl>    <dbl>    <dbl>
## 1 ets         0.0000855    403.  -772.  -768.  -716.   7.84e-5  8.80e-5  0.00708
## 2 model       0.0000810    594. -1175. -1174. -1152.  NA       NA       NA      
## 3 arima100110 0.0000985    577. -1146. -1146. -1133.  NA       NA       NA      
## 4 arima001011 0.0000844    589. -1169. -1169. -1156.  NA       NA       NA      
## 5 arima100011 0.0000840    589. -1170. -1169. -1157.  NA       NA       NA      
## 6 arima001110 0.0000982    577. -1147. -1146. -1134.  NA       NA       NA      
## 7 arima003110 0.0000956    581. -1150. -1149. -1131.  NA       NA       NA      
## # ℹ 2 more variables: ar_roots <list>, ma_roots <list>
fit%>%
  select(arima100011) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostic for Model: ARIMA(1,0,0)(0,1,1)[12] w/ drift" )

fit%>%
  select(model) %>%
  augment() %>%
  features(.innov, ljung_box, lag=12)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 model     5.51     0.939
fit%>%
  select(model) %>%
  augment() %>%
  features(.innov, box_pierce, lag=12)
## # A tibble: 1 × 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 model     5.27     0.948
fit%>%
  select(ets) %>%
  gg_tsresiduals()

fit %>%
  select(model)%>%
  forecast(h="1 years")%>%
  autoplot(clean_data, levels= NULL) + 
  labs(title = "Forecasting 1 Year on Power Consumption Time Series", 
       y = "KWH")
## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `levels`
## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `levels`

fc <- fit %>%
  forecast(h="5 years")
fc %>%
  write_csv("~/Downloads/residential_fc_data.csv")

Extra Credit

As we do not have information on the pipe location, type of pipe, and more, forecasting the water pipe separately can preserve their individual trends. Each dataset initially contains 1000 observation of water flow.

Below is the summary water pipe 1. We need to convert Date.Time to the correct format for forecasting as it in character now.

summary(pipe1_raw_data)
##   Date.Time           WaterFlow     
##  Length:1000        Min.   : 1.067  
##  Class :character   1st Qu.:13.683  
##  Mode  :character   Median :19.880  
##                     Mean   :19.897  
##                     3rd Qu.:26.159  
##                     Max.   :38.913

Below is the summary water pipe 2. We need to convert Date.Time to the correct format for forecasting as it in character now.

Comparing the statistics of the waterflow of each pipe, we see they have different mean water flow. This can indicate they may be different type of pipes. It is best not to combine both dateset into one.

summary(pipe2_raw_data)
##   Date.Time           WaterFlow     
##  Length:1000        Min.   : 1.885  
##  Class :character   1st Qu.:28.140  
##  Mode  :character   Median :39.682  
##                     Mean   :39.556  
##                     3rd Qu.:50.782  
##                     Max.   :78.303

Pipe 1 contains water flow values from Oct 23, 2015 to Nov 1, 2015. Pipe 2 has one month more data.

pipe1_data <- pipe1_raw_data %>%
  rename(datetime = "Date.Time" ) %>%
  mutate(datetime = mdy_hm(datetime),
         date = floor_date(datetime, "hour")
  )

pipe1_data <- pipe1_data %>%
  group_by(date) %>%
  summarise(avg_water = mean(WaterFlow)) %>%
  ungroup() %>%
  select(date, avg_water) %>%
  as_tsibble(index = date) %>%
  fill_gaps()
  
##
##  mutate(
##    Year = year(date),
##    Month = month(date, label = TRUE),
##    Day = day(date),
##    Day_of_the_week = wday(date, label = TRUE),
##    Hour = format(date, "%H")
summary(pipe1_data)
##       date                       avg_water     
##  Min.   :2015-10-23 00:00:00   Min.   : 8.923  
##  1st Qu.:2015-10-25 11:45:00   1st Qu.:17.033  
##  Median :2015-10-27 23:30:00   Median :19.784  
##  Mean   :2015-10-27 23:30:00   Mean   :19.893  
##  3rd Qu.:2015-10-30 11:15:00   3rd Qu.:22.789  
##  Max.   :2015-11-01 23:00:00   Max.   :31.730  
##                                NA's   :4

Pipe 2 contains waterflow values from Oct 23, 2015 to Dec 3, 2015. Pipe 1 has one month less of data.

pipe2_data <- pipe2_raw_data %>%
  rename(datetime = "Date.Time" ) %>%
  mutate(datetime = mdy_hm(datetime),
         date = floor_date(datetime, "hour")
  )

pipe2_data <- pipe2_data %>%
  group_by(date) %>%
  summarise(avg_water = mean(WaterFlow)) %>%
  ungroup() %>%
  select(date, avg_water) %>%
  as_tsibble(index = date) %>%
  fill_gaps() ## have to fill gaps

summary(pipe2_data)
##       date                       avg_water     
##  Min.   :2015-10-23 01:00:00   Min.   : 1.885  
##  1st Qu.:2015-11-02 10:45:00   1st Qu.:28.140  
##  Median :2015-11-12 20:30:00   Median :39.682  
##  Mean   :2015-11-12 20:30:00   Mean   :39.556  
##  3rd Qu.:2015-11-23 06:15:00   3rd Qu.:50.782  
##  Max.   :2015-12-03 16:00:00   Max.   :78.303

Missing Values

There are 4 observations with missing values.

pipe1_data %>%
  filter(is.na(avg_water))
## # A tsibble: 4 x 2 [1h] <UTC>
##   date                avg_water
##   <dttm>                  <dbl>
## 1 2015-10-27 17:00:00        NA
## 2 2015-10-28 00:00:00        NA
## 3 2015-11-01 05:00:00        NA
## 4 2015-11-01 09:00:00        NA

Since water flow is likely seasonal by the hour, we can use na.interp() to fill the missing values as it preserves the trend & seasonality.

Simply replacing the missing value with the mean or median will ignore the trends and seasonality.

Filling the missing values with a simple “0” may misrepresent the them as 0 water flow.

clean_pipe1 <- pipe1_data %>%
  mutate(avg_water =na.interp(ts(avg_water, frequency = 24)))

clean_pipe1 %>% 
  filter(is.na(avg_water))
## # A tsibble: 0 x 2 [?] <UTC>
## # ℹ 2 variables: date <dttm>, avg_water <dbl>

Pipe 2 has no missing data

pipe2_data %>%
  filter(is.na(avg_water))
## # A tsibble: 0 x 2 [?] <UTC>
## # ℹ 2 variables: date <dttm>, avg_water <dbl>

Autoplot

There is no obvious uppward or downward trend in pipe1 data.

clean_pipe1 %>%
  autoplot()+
  labs(title = "Average Monthly Pipe 1 Water Flow", 
       x = "Date") + 
  geom_smooth(method = "loess")
## Plot variable not specified, automatically selected `.vars = avg_water`
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## `geom_smooth()` using formula = 'y ~ x'

There is no obvious uppward or downward trend in pipe2 data.

pipe2_data %>%
  autoplot()+
  labs(title = "Average Monthly Pipe 1 Water Flow", 
       x = "Date") + 
  geom_smooth(method = "loess", color = "green")
## Plot variable not specified, automatically selected `.vars = avg_water`
## `geom_smooth()` using formula = 'y ~ x'

Outlier

There is one outlier in pipe1 dataset. We can impute that later using tsoutlier().

clean_pipe1 %>%
  ggplot(aes(x = avg_water)) + 
  geom_boxplot(fill='lightblue') + 
  labs(title = "Boxplot of Pipe 1 Water Flow")
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

No outlier is present in pipe 2.

pipe2_data %>%
  ggplot(aes(x = avg_water)) + 
  geom_boxplot(fill='lightgreen') + 
  labs(title = "Boxplot of Pipe 2 Water Flow")

clean_pipe1 <- pipe1_data %>%
  group_by_key()%>%
  mutate(
    avg_water = {
      series <- ts(avg_water)
      out <- tsoutliers(series)
      if (length(out$index) >0) {
        series[out$index] <- out$replacements
      }
      as.numeric(series)
    }
  ) %>%
  ungroup()

clean_pipe1 <- clean_pipe1 %>%
  mutate(avg_water =na.interp(ts(avg_water, frequency = 24)))

clean_pipe1 %>% 
  filter(is.na(avg_water))
## # A tsibble: 0 x 2 [?] <UTC>
## # ℹ 2 variables: date <dttm>, avg_water <dbl>

Distribution

Both data appears approximately normally distributed. We usually recommend box-cox transformation when there is skewness.

clean_pipe1 %>%
  ggplot(aes(x = avg_water)) + 
  geom_histogram( fill = 'lightblue') +
  labs(title = "Histogram of Pipe 1 Water Flow")
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

pipe2_data %>%
  ggplot(aes(x = avg_water)) + 
  geom_histogram( fill = 'lightgreen') +
  labs(title = "Histogram of Pipe 2 Water Flow")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Modeling

Stationary or Non-stationary

Below is a test to see if a time series is stationary or non-stationary. Based on the test results, both time series are stationary.

clean_pipe1 %>%
  features(avg_water, unitroot_kpss)
## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.191         0.1
pipe2_data %>%
  features(avg_water, unitroot_kpss)
## # A tibble: 1 × 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.105         0.1

As both time series exhibit stationary, no seasonal difference or first order difference is not.

clean_pipe1 %>%
  features(avg_water, unitroot_nsdiffs)
## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0
clean_pipe1 %>%
  features(avg_water, unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0
pipe2_data %>%
  features(avg_water, unitroot_nsdiffs)
## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0
pipe2_data %>%
  features(avg_water, unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

Water Pipe 1

ARIMA model generally fit better than ETS model. It is also in this case. The ETS(M.N,N) model has the lowest sigma2, MSE, and MAE, which could mean it fit that time series well. ARIMA (0,0,0) (0,1,1) models the dialy cycle has the lowest AIC.

fit <- clean_pipe1 %>%
  model(
    ets_model = ETS(avg_water),
    ets = ETS(avg_water ~error("A")+trend("A")+season ("A")),
        etsANA=ETS(avg_water ~error("A")+trend("N")+season ("A")),
        etsAAdA = ETS(avg_water ~error("A")+trend("Ad")+season ("A")),
    model = ARIMA(avg_water),
    arima000011 = ARIMA(avg_water ~ 0+pdq(0,0,0) + PDQ(0,1,1))
    )

fit
## # A mable: 1 x 6
##      ets_model          ets       etsANA       etsAAdA                  model
##        <model>      <model>      <model>       <model>                <model>
## 1 <ETS(M,N,N)> <ETS(A,A,A)> <ETS(A,N,A)> <ETS(A,Ad,A)> <ARIMA(0,0,0) w/ mean>
## # ℹ 1 more variable: arima000011 <model>
report(fit)
## Warning in report.mdl_df(fit): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 6 × 11
##   .model   sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE    MAE ar_roots ma_roots
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <list>   <list>  
## 1 ets_mo…  0.0454  -1003. 2012. 2012. 2023.  17.8  17.8  0.167 <NULL>   <NULL>  
## 2 ets     18.8      -995. 2047. 2056. 2148.  16.6  16.6  3.22  <NULL>   <NULL>  
## 3 etsANA  18.0      -991. 2036. 2043. 2130.  16.1  16.1  3.18  <NULL>   <NULL>  
## 4 etsAAdA 18.5      -992. 2044. 2053. 2149.  16.2  16.2  3.18  <NULL>   <NULL>  
## 5 model   17.9      -686. 1376. 1376. 1383.  NA    NA   NA     <cpl>    <cpl>   
## 6 arima0… 20.1      -645. 1293. 1294. 1300.  NA    NA   NA     <cpl>    <cpl>
fit%>%
  select(ets_model) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostic for Model: ETS(M.N,N) model" )

fit%>%
  select(arima000011) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostic for Model: ARIMA (0,0,0) (0,1,1) model" )

Best model : ARIMA (0,0,0) (0,1,1) model

fit %>%
  select(ets_model) %>%
  forecast(h="1 week")%>%
  autoplot(clean_pipe1) 

  labs(title = "Forecasting 1 Week on Pipe 1")
## <ggplot2::labels> List of 1
##  $ title: chr "Forecasting 1 Week on Pipe 1"
fit %>%
  select(arima000011) %>%
  forecast(h="1 week")%>%
  autoplot(clean_pipe1) 

  labs(title = "Forecasting 1 Week on Pipe 1")
## <ggplot2::labels> List of 1
##  $ title: chr "Forecasting 1 Week on Pipe 1"

Water Pipe 2

ARIMA model generally fit better than ETS model. It is also in this case. The ETS(M.N,N) model has the lowest sigma2, MSE, and MAE, which could mean it fit that time series well. ARIMA (0,0,0) (0,1,1) models the daily cycle has the lowest AIC.

fit <- pipe2_data %>%
  model(
    ets_model = ETS(avg_water),
    ets = ETS(avg_water ~error("M")+trend("A")+season ("M")),
        etsANA=ETS(avg_water ~error("A")+trend("N")+season ("A")),
        etsAAdA = ETS(avg_water ~error("A")+trend("Ad")+season ("A")),
    model = ARIMA(avg_water),
    arima000011 = ARIMA(avg_water ~ 0+pdq(0,0,0) + PDQ(0,1,1))
    )

fit
## # A mable: 1 x 6
##      ets_model          ets       etsANA       etsAAdA
##        <model>      <model>      <model>       <model>
## 1 <ETS(M,N,N)> <ETS(M,A,M)> <ETS(A,N,A)> <ETS(A,Ad,A)>
## # ℹ 2 more variables: model <model>, arima000011 <model>
report(fit)
## Warning in report.mdl_df(fit): Model reporting is only supported for individual
## models, so a glance will be shown. To see the report for a specific model, use
## `select()` and `filter()` to identify a single model.
## # A tibble: 6 × 11
##   .model       sigma2 log_lik    AIC   AICc    BIC   MSE  AMSE    MAE ar_roots 
##   <chr>         <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl> <list>   
## 1 ets_model     0.165  -6229. 12464. 12464. 12479.  257.  257.  0.331 <NULL>   
## 2 ets           0.171  -6228. 12514. 12516. 12656.  255.  255.  0.333 <NULL>   
## 3 etsANA      259.     -6219. 12493. 12494. 12625.  252.  252. 13.0   <NULL>   
## 4 etsAAdA     260.     -6219. 12497. 12499. 12645.  252.  252. 13.0   <NULL>   
## 5 model       256.     -4191.  8387.  8387.  8402.   NA    NA  NA     <cpl [0]>
## 6 arima000011 265.     -4137.  8277.  8277.  8287.   NA    NA  NA     <cpl [0]>
## # ℹ 1 more variable: ma_roots <list>

ACF plot shows not all spikes within the confidence bands, indicating there is still some significant autocorection. The residuals appear almost normal

fit%>%
  select(ets) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostic for Model: ETS(M.N,N) model" )

ACF plot shows not all spikes within the confidence bands, indicating there is still some significant autocorection. The residuals appear almost normal

fit%>%
  select(arima000011) %>%
  gg_tsresiduals() +
  labs(title = "Residual Diagnostic for Model: ARIMA (0,0,0) (0,1,1) model" )

Best model : ARIMA (0,0,0) (0,1,1) model

fit %>%
  select(ets) %>%
  forecast(h="1 week")%>%
  autoplot(pipe2_data) 

  labs(title = "Forecasting 1 Week on Pipe 2")
## <ggplot2::labels> List of 1
##  $ title: chr "Forecasting 1 Week on Pipe 2"
fit %>%
  select(arima000011) %>%
  forecast(h="1 week")%>%
  autoplot(pipe2_data) +
  labs(title = "Forecasting 1 Week on Pipe 2")

Pipe 1 datset is relatively smaller than the pipe 2 dataset. Having smaller sample can affect the fit as well. The ETS does follow the upward trend close to Dec that we saw in the STL Decompose. However, I feel like there’s more room for improvement on the fitted model based on the fit of the forecast.

Source:

https://robjhyndman.com/hyndsight/tsoutliers/