library(fpp2)
library(ggplot2)
library(dplyr)
The analyzed data represents residential power usage for January 1998 until December 2013. The objective is to model these data and produce a monthly forecast for 2014.
First load the data and check summary and loaded values
pwrdata <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx")
summary(pwrdata)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
head(pwrdata)
## # A tibble: 6 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
The dataset contains one missing value for 2008-Sep. A good strategy to fill the missing value would be to take an average value of the year before [2007-Sep] and of the year after [2009-Sep].
pwrdata[is.na(pwrdata$KWH),]
## # A tibble: 1 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 861 2008-Sep NA
pwrdata[pwrdata$`YYYY-MMM` == "2008-Sep", "KWH"] <- round(
(pwrdata[pwrdata$`YYYY-MMM` == "2007-Sep","KWH"] +
pwrdata[pwrdata$`YYYY-MMM` == "2009-Sep","KWH"]) / 2)
pwrdata[pwrdata$`YYYY-MMM` == "2008-Sep",]
## # A tibble: 1 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 861 2008-Sep 7625058
Let’s check for any outliers and then deal with them.
boxplot(pwrdata$KWH)
The boxplot reveals that a minimum KWH value is way outside of the rest of the values. The values from the year before and after are much higher and therefore such a small value is likely to be invalid for the month of July in any year.
pwrdata[which.min(pwrdata$KWH),]
## # A tibble: 1 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 883 2010-Jul 770523
pwrdata[pwrdata$`YYYY-MMM` == "2009-Jul",]
## # A tibble: 1 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 871 2009-Jul 7713260
pwrdata[pwrdata$`YYYY-MMM` == "2011-Jul",]
## # A tibble: 1 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 895 2011-Jul 10093343
It is therefore a good idea to replace the outlier [invalid] value with an average value of the year before [2009-Jul] and of the year after [2011-Jul].
pwrdata[pwrdata$`YYYY-MMM` == "2010-Jul", "KWH"] <- round(
(pwrdata[pwrdata$`YYYY-MMM` == "2009-Jul","KWH"] +
pwrdata[pwrdata$`YYYY-MMM` == "2011-Jul","KWH"]) / 2)
pwrdata[pwrdata$`YYYY-MMM` == "2010-Jul",]
## # A tibble: 1 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 883 2010-Jul 8903302
It should now be safe to create the time series object which can then be used for modelling.
tspwr <- ts(pwrdata$KWH, frequency = 12, start = c(1998,1))
ggtsdisplay(tspwr)
As an initial step in modelling, let’s check if Box-Cox transformation is needed to stabilize the variance. Based on the graph, it’s not clear if variance stays somewhat constant or not as the time series progresses.
(lambda <- tspwr %>% BoxCox.lambda())
## [1] -0.2399937
tspwr %>% BoxCox(lambda = lambda) %>% ggtsdisplay()
Based on the graph of the transformed data, there isn’t much improvement and therefore I would recommend not to do any transformation.
Now, let’s see if unit-root tests would suggest to difference the data in order to make it look more stationary.
tspwr %>% ndiffs()
## [1] 1
tspwr %>% diff(lag = 12) %>% Box.test()
##
## Box-Pierce test
##
## data: .
## X-squared = 12.518, df = 1, p-value = 0.0004031
tspwr %>% diff(lag = 12) %>% ggtsdisplay()
Unfortunately, there is no clear-cut conclusion regarding whether or not to perform differencing. On the one hand the unit-root test ndiffs suggests that 1 level of seasonal differencing would be better. However, the Box.test did not confirm it, by reporting a very small p-value.
Also, by looking at ACF and PACF graph patterns, it is also not clear-cut conclusive as what parameters should be in selecting the best, seasonal ARIMA model manually. Therefore, it would be prudent to start off with auto.arima() and see if makes a good determination by examining the residuals of the selected model.
(fit <- auto.arima(tspwr))
## Series: tspwr
## ARIMA(0,0,3)(2,1,0)[12] with drift
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 drift
## 0.3230 0.0173 0.2446 -0.6550 -0.3837 8974.271
## s.e. 0.0785 0.0875 0.0762 0.0786 0.0795 3111.023
##
## sigma^2 estimated as 3.918e+11: log likelihood=-2658.43
## AIC=5330.87 AICc=5331.52 BIC=5353.22
checkresiduals(fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,3)(2,1,0)[12] with drift
## Q* = 15.875, df = 18, p-value = 0.6013
##
## Model df: 6. Total lags used: 24
Indeed, the automatically selected model, ARIMA(0,0,3)(2,1,0)[12] with drift, has near normally distributed and “white-noise” looking residuals. These are good indicators of a good model, which can be used for forecasting.
f <- forecast(fit, h = 12)
autoplot(tspwr) +
autolayer(f)
predict(f)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 10410760 9608603 11212918 9183966 11637554
## Feb 2014 8498785 7655832 9341737 7209599 9787970
## Mar 2014 7250962 6407895 8094030 5961602 8540323
## Apr 2014 6018419 5152823 6884016 4694604 7342235
## May 2014 5950870 5085273 6816466 4627054 7274685
## Jun 2014 8186852 7321255 9052448 6863036 9510667
## Jul 2014 9406691 8541095 10272288 8082876 10730507
## Aug 2014 9915311 9049714 10780908 8591495 11239127
## Sep 2014 8451080 7585484 9316677 7127265 9774896
## Oct 2014 5869759 5004163 6735356 4545943 7193575
## Nov 2014 6126305 5260709 6991902 4802489 7450121
## Dec 2014 8501376 7635779 9366972 7177560 9825191
fd <- seq(as.Date("2014-01-01"), as.Date("2014-12-01"), by="months")
df <- data.frame("YYYY-MMM" = format(fd, "%Y-%b"), KWH = round(f$mean))
colnames(df) <- c("YYYY-MMM", "KWH")
head(df, 12)
## YYYY-MMM KWH
## 1 2014-Jan 10410760
## 2 2014-Feb 8498785
## 3 2014-Mar 7250962
## 4 2014-Apr 6018419
## 5 2014-May 5950870
## 6 2014-Jun 8186852
## 7 2014-Jul 9406691
## 8 2014-Aug 9915311
## 9 2014-Sep 8451080
## 10 2014-Oct 5869759
## 11 2014-Nov 6126305
## 12 2014-Dec 8501376
write.csv(df, file = "PowerForecast.csv", row.names = FALSE)