Overview

In Part B of this project, we have a dataset of residential power usage for January 1998 until December 2013. The goal is to develop a model these data and provide a monthly forecast for 2014.

The general process, will be the same as outlined in Part A

Import data and convert to tsibble

We begin by importing the data. Once we have the data imported, we will convert the data field to field the desired YYYY-MMM format and then create the necessary tsibble.

raw_data <- read_xlsx('../data/ResidentialCustomerForecastLoad-624.xlsx')


raw_data <- raw_data %>% mutate(date_field = as_date(ym(`YYYY-MMM`)))

raw_data
## # A tibble: 192 × 4
##    CaseSequence `YYYY-MMM`     KWH date_field
##           <dbl> <chr>        <dbl> <date>    
##  1          733 1998-Jan   6862583 1998-01-01
##  2          734 1998-Feb   5838198 1998-02-01
##  3          735 1998-Mar   5420658 1998-03-01
##  4          736 1998-Apr   5010364 1998-04-01
##  5          737 1998-May   4665377 1998-05-01
##  6          738 1998-Jun   6467147 1998-06-01
##  7          739 1998-Jul   8914755 1998-07-01
##  8          740 1998-Aug   8607428 1998-08-01
##  9          741 1998-Sep   6989888 1998-09-01
## 10          742 1998-Oct   6345620 1998-10-01
## # ℹ 182 more rows
data_ts <- raw_data %>% 
  select(-`YYYY-MMM`) %>% 
  mutate(Month = yearmonth(date_field)) %>%
  tsibble(index=Month)

data_ts %>% autoplot(KWH)

Evaluate the data

There is a value missing for CaseSequence 861. As we did in Part A, we will forecast the value by building a timeseries using the earlier data

data_ts %>% filter(is.na(KWH))
## # A tsibble: 1 x 4 [1M]
##   CaseSequence   KWH date_field    Month
##          <dbl> <dbl> <date>        <mth>
## 1          861    NA 2008-09-01 2008 Sep
early_data <- data_ts %>% 
  filter(CaseSequence < 861)


early_data %>%
  autoplot(KWH)

early_data %>% 
  model(stl = STL(KWH)) %>%
  components() %>%
  autoplot()

From the plot of the data, we see that there appears to somewhat be a positive trend and we see evidence of seasonality in the data. So we will explore exponential smoothing model, and try different combinations.

fit_early <- early_data %>%
  model(m1 = ETS(KWH ~ error("A") + trend("A") + season("A")),
        m2 = ETS(KWH ~ error("A") + trend("A") + season("M")),
        m3 = ETS(KWH ~ error("A") + trend("M") + season("M")),
        m4 = ETS(KWH))

glance(fit_early) %>% arrange(AICc)
## # A tibble: 4 × 9
##   .model   sigma2 log_lik   AIC  AICc   BIC           MSE          AMSE      MAE
##   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>         <dbl>         <dbl>    <dbl>
## 1 m4     7.50e- 3  -1990. 4010. 4014. 4053. 266724313447. 268119581517.  6.53e-2
## 2 m1     3.06e+11  -1994. 4023. 4029. 4071. 267385294351. 268018613863.  4.10e+5
## 3 m2     3.07e+11  -1995. 4023. 4029. 4072. 268249229600. 269690812211.  4.08e+5
## 4 m3     3.15e+11  -1996. 4027. 4032. 4075. 275309122826. 284114223083.  4.10e+5
fit_early
## # A mable: 1 x 4
##             m1           m2           m3           m4
##        <model>      <model>      <model>      <model>
## 1 <ETS(A,A,A)> <ETS(A,A,M)> <ETS(A,M,M)> <ETS(M,N,M)>

The AICc for the system generated model ETA(A,N,A) was the lowest, so that’s the model that we will use.

fit_early <- early_data %>% model(ETS(KWH ~ error("M") + trend("N") + season("M")))

aug <- fit_early %>% augment() 

aug %>% 
  ggplot(aes(x=.innov)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

aug %>%
  ACF(.innov, lags=36) %>%
  autoplot()
## Warning: The `...` argument of `PACF()` is deprecated as of feasts 0.2.2.
## ℹ ACF variables should be passed to the `y` argument. If multiple variables are
##   to be used, specify them using `vars(...)`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: ACF currently only supports one column, `.innov` will be used.

Looking at this data, we see that there is significant autocorrelation within the data. We are going to try some differencing and focus on using an ARIMA model.

early_data %>%
  gg_tsdisplay(KWH, plot_type = 'partial')

We see strong evidence of seasonality in the data in the lags, so we will both transform the reponse variable and difference the data to remove the seasonality

bc <- BoxCoxTrans(early_data$KWH)
lambda <- bc$lambda


early_data %>%
  autoplot(KWH^lambda-1/lambda)

early_data %>%
  gg_tsdisplay(difference(KWH^lambda-1/lambda,12), plot_type = 'partial')
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).

Based on this data, we will try the following ARIMA models with seasonality based on the ACF and the PCF charts:

  1. pdq(4,0,0) + PDQ(1,1,0)12
  2. pdq(0,0,4) + PDQ(0,1,1)12
fit_early <- early_data %>%
  model(m1 = ARIMA((KWH^lambda-1/lambda)~pdq(4,0,0) + PDQ(1,1,0)),
        m2 = ARIMA((KWH^lambda-1/lambda)~pdq(0,0,4) + PDQ(,1,1)))


glance(fit_early) %>% arrange(AICc)
## # A tibble: 2 × 8
##   .model    sigma2 log_lik    AIC   AICc    BIC ar_roots   ma_roots  
##   <chr>      <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <list>     <list>    
## 1 m2     0.0000146    566. -1118. -1117. -1099. <cpl [0]>  <cpl [16]>
## 2 m1     0.0000151    559. -1107. -1106. -1090. <cpl [16]> <cpl [0]>

Based on the AICc, we will go with the second model to general our 1-point ahead forecast and then we will use that imputed value to plug into our original timeseries to replace the missing value.

fit_early <- early_data %>%
  model(ARIMA((KWH^lambda-1/lambda)~pdq(0,0,4) + PDQ(,1,1)))

fc_early <- fit_early %>% forecast(h=1)



data_ts[data_ts$CaseSequence == 861,]$KWH <- fc_early$.mean

Now that we’ve updated the timeseries, we will now use this to build a model on the entire set of data. Now we will build a model for the entire complete data and then forecast based on this

data_ts %>% model(STL(KWH)) %>%
  components() %>%
  autoplot()

fit <- data_ts %>%
  model(m1 = ETS(KWH ~ error("A") + trend("N") + season("A")),
        m2 = ETS(KWH ~ error("A") + trend("N") + season("M")),
        m3 = ARIMA((KWH^lambda-1/lambda)~1+pdq(4,0,0) + PDQ(1,1,0)),
        m4 = ARIMA((KWH^lambda-1/lambda)~1+pdq(0,0,4) + PDQ(,1,1)),
        m5 = ARIMA((KWH^lambda-1/lambda)))


glance(fit) %>% arrange(AICc)
## # A tibble: 5 × 11
##   .model   sigma2 log_lik    AIC   AICc    BIC           MSE        AMSE     MAE
##   <chr>     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>         <dbl>       <dbl>   <dbl>
## 1 m5     2.78e- 5    733. -1456. -1456. -1440.           NA     NA           NA 
## 2 m4     2.86e- 5    707. -1399. -1398. -1373.           NA     NA           NA 
## 3 m3     3.72e- 5    683. -1352. -1351. -1330.           NA     NA           NA 
## 4 m2     7.30e+11  -3120.  6270.  6272.  6319. 676896866919.     7.06e11 516061.
## 5 m1     7.36e+11  -3121.  6271.  6274.  6320. 682203380453.     7.05e11 514104.
## # ℹ 2 more variables: ar_roots <list>, ma_roots <list>
fit
## # A mable: 1 x 5
##             m1           m2                                 m3
##        <model>      <model>                            <model>
## 1 <ETS(A,N,A)> <ETS(A,N,M)> <ARIMA(4,0,0)(1,1,0)[12] w/ drift>
## # ℹ 2 more variables: m4 <model>, m5 <model>

Based on the AICc, we see that the automated ARIMA model - MA(1) with MA(2) is the best model. So we will use that to build our final model and forecast the remaining data.

fit_final <- data_ts %>%
  model(ARIMA(KWH^lambda-1/lambda~pdq(0,0,1) + PDQ(0,1,2)))


partb_fcst <- fit_final %>% forecast(h=12)



partb_fcst
## # A fable: 12 x 4 [1M]
## # Key:     .model [1]
##    .model                                         Month               KWH  .mean
##    <chr>                                          <mth>            <dist>  <dbl>
##  1 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Jan t(N(10, 2.8e-05)) 9.55e6
##  2 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Feb t(N(10, 2.8e-05)) 8.19e6
##  3 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Mar t(N(10, 2.8e-05)) 6.85e6
##  4 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Apr t(N(10, 2.8e-05)) 6.02e6
##  5 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 May t(N(10, 2.8e-05)) 5.70e6
##  6 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Jun t(N(10, 2.8e-05)) 7.48e6
##  7 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Jul t(N(10, 2.8e-05)) 6.80e6
##  8 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Aug t(N(10, 2.8e-05)) 9.52e6
##  9 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Sep t(N(10, 2.8e-05)) 8.69e6
## 10 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Oct t(N(10, 2.8e-05)) 6.25e6
## 11 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Nov t(N(10, 2.8e-05)) 5.67e6
## 12 ARIMA(KWH^lambda - 1/lambda ~ pdq(0, 0, 1)… 2014 Dec t(N(10, 2.8e-05)) 7.31e6
write.xlsx(partb_fcst, '../data/project1_forecast.xlsx', sheetName='partb_fcst',append = TRUE)
## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing

## Warning in FUN(X[[i]], ...): argument is not an atomic vector; coercing