library(fpp2)
library(ggplot2)
library(dplyr)

Objective

The analyzed data represents residential power usage for January 1998 until December 2013. The objective is to model these data and produce a monthly forecast for 2014.

Data Preparation

First load the data and check summary and loaded values

pwrdata <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx")

summary(pwrdata)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
head(pwrdata)
## # A tibble: 6 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

The dataset contains one missing value for 2008-Sep. A good strategy to fill the missing value would be to take an average value of the year before [2007-Sep] and of the year after [2009-Sep].

pwrdata[is.na(pwrdata$KWH),]
## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`   KWH
##          <dbl> <chr>      <dbl>
## 1          861 2008-Sep      NA
pwrdata[pwrdata$`YYYY-MMM` == "2008-Sep", "KWH"] <- round(
  (pwrdata[pwrdata$`YYYY-MMM` == "2007-Sep","KWH"] +
     pwrdata[pwrdata$`YYYY-MMM` == "2009-Sep","KWH"]) / 2)
pwrdata[pwrdata$`YYYY-MMM` == "2008-Sep",]
## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          861 2008-Sep   7625058

Let’s check for any outliers and then deal with them.

boxplot(pwrdata$KWH)

The boxplot reveals that a minimum KWH value is way outside of the rest of the values. The values from the year before and after are much higher and therefore such a small value is likely to be invalid for the month of July in any year.

pwrdata[which.min(pwrdata$KWH),]
## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`    KWH
##          <dbl> <chr>       <dbl>
## 1          883 2010-Jul   770523
pwrdata[pwrdata$`YYYY-MMM` == "2009-Jul",]
## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          871 2009-Jul   7713260
pwrdata[pwrdata$`YYYY-MMM` == "2011-Jul",]
## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`      KWH
##          <dbl> <chr>         <dbl>
## 1          895 2011-Jul   10093343

It is therefore a good idea to replace the outlier [invalid] value with an average value of the year before [2009-Jul] and of the year after [2011-Jul].

pwrdata[pwrdata$`YYYY-MMM` == "2010-Jul", "KWH"] <- round(
  (pwrdata[pwrdata$`YYYY-MMM` == "2009-Jul","KWH"] +
     pwrdata[pwrdata$`YYYY-MMM` == "2011-Jul","KWH"]) / 2)
pwrdata[pwrdata$`YYYY-MMM` == "2010-Jul",]
## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          883 2010-Jul   8903302

It should now be safe to create the time series object which can then be used for modelling.

tspwr <- ts(pwrdata$KWH, frequency = 12, start = c(1998,1))
ggtsdisplay(tspwr)

Modelling

As an initial step in modelling, let’s check if Box-Cox transformation is needed to stabilize the variance. Based on the graph, it’s not clear if variance stays somewhat constant or not as the time series progresses.

(lambda <- tspwr %>% BoxCox.lambda())
## [1] -0.2399937
tspwr %>% BoxCox(lambda = lambda) %>% ggtsdisplay()

Based on the graph of the transformed data, there isn’t much improvement and therefore I would recommend not to do any transformation.

Now, let’s see if unit-root tests would suggest to difference the data in order to make it look more stationary.

tspwr %>% ndiffs()
## [1] 1
tspwr %>% diff(lag = 12) %>% Box.test()
## 
##  Box-Pierce test
## 
## data:  .
## X-squared = 12.518, df = 1, p-value = 0.0004031
tspwr %>% diff(lag = 12) %>% ggtsdisplay()

Unfortunately, there is no clear-cut conclusion regarding whether or not to perform differencing. On the one hand the unit-root test ndiffs suggests that 1 level of seasonal differencing would be better. However, the Box.test did not confirm it, by reporting a very small p-value.

Also, by looking at ACF and PACF graph patterns, it is also not clear-cut conclusive as what parameters should be in selecting the best, seasonal ARIMA model manually. Therefore, it would be prudent to start off with auto.arima() and see if makes a good determination by examining the residuals of the selected model.

(fit <- auto.arima(tspwr))
## Series: tspwr 
## ARIMA(0,0,3)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2     drift
##       0.3230  0.0173  0.2446  -0.6550  -0.3837  8974.271
## s.e.  0.0785  0.0875  0.0762   0.0786   0.0795  3111.023
## 
## sigma^2 estimated as 3.918e+11:  log likelihood=-2658.43
## AIC=5330.87   AICc=5331.52   BIC=5353.22
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,3)(2,1,0)[12] with drift
## Q* = 15.875, df = 18, p-value = 0.6013
## 
## Model df: 6.   Total lags used: 24

Indeed, the automatically selected model, ARIMA(0,0,3)(2,1,0)[12] with drift, has near normally distributed and “white-noise” looking residuals. These are good indicators of a good model, which can be used for forecasting.

Forecasting

f <- forecast(fit, h = 12)
autoplot(tspwr) + 
  autolayer(f)

predict(f)
##          Point Forecast   Lo 80    Hi 80   Lo 95    Hi 95
## Jan 2014       10410760 9608603 11212918 9183966 11637554
## Feb 2014        8498785 7655832  9341737 7209599  9787970
## Mar 2014        7250962 6407895  8094030 5961602  8540323
## Apr 2014        6018419 5152823  6884016 4694604  7342235
## May 2014        5950870 5085273  6816466 4627054  7274685
## Jun 2014        8186852 7321255  9052448 6863036  9510667
## Jul 2014        9406691 8541095 10272288 8082876 10730507
## Aug 2014        9915311 9049714 10780908 8591495 11239127
## Sep 2014        8451080 7585484  9316677 7127265  9774896
## Oct 2014        5869759 5004163  6735356 4545943  7193575
## Nov 2014        6126305 5260709  6991902 4802489  7450121
## Dec 2014        8501376 7635779  9366972 7177560  9825191
fd <- seq(as.Date("2014-01-01"), as.Date("2014-12-01"), by="months")
df <- data.frame("YYYY-MMM" = format(fd, "%Y-%b"), KWH = round(f$mean))
colnames(df) <- c("YYYY-MMM", "KWH")
head(df, 12)
##    YYYY-MMM      KWH
## 1  2014-Jan 10410760
## 2  2014-Feb  8498785
## 3  2014-Mar  7250962
## 4  2014-Apr  6018419
## 5  2014-May  5950870
## 6  2014-Jun  8186852
## 7  2014-Jul  9406691
## 8  2014-Aug  9915311
## 9  2014-Sep  8451080
## 10 2014-Oct  5869759
## 11 2014-Nov  6126305
## 12 2014-Dec  8501376
write.csv(df, file = "PowerForecast.csv", row.names = FALSE)