US Electricity Production: Consider the total net generation of electricity (in billion kilowatt hours) by the U.S. electric industry (monthly for the period January 1973 – June 2013). In general there are two peaks per year: in mid-summer and mid-winter. This analysis uses several methods to evaluate and model this electricity data.

Based on Forecasting: Principles and Practice (Rob J Hyndman and George Athanasopoulos).

load needed packages

library(fpp2)
library(urca)
library(forecast)
library(fBasics)

plot time-series

autoplot(usmelec, ylab = "net generation of electricity (in billion kilowatt hours)", 
    xlab = "monthly for the period January 1973 – June 2013", 
    main = "total net generation of electricity (in billion kilowatt hours) by the U.S.")

Exploratory Analysis

12-month moving average

Examine the 12-month moving average of this series to see what kind of trend is involved.

plot(usmelec, col = "grey")
lines(ma(usmelec, order = 12), col = "red")

Taking the 12-month moving average makes it easier to see a clear trend. Aside from a noticable dip in the early-mid 1980s there is a consistent increasing linear trend until the late 2000s where it dips and flattens to the end of the data in 2013.

Transform Data

Check if any transformation is necessary.

un-transformed

evaluate un-transformed data

normalTest(usmelec, method = c("jb"))
## 
## Title:
##  Jarque - Bera Normalality Test
## 
## Test Results:
##   STATISTIC:
##     X-squared: 22.0343
##   P VALUE:
##     Asymptotic p Value: 1.642e-05 
## 
## Description:
##  Mon May 11 20:27:08 2020 by user:
## 4 moments
qqnorm(usmelec)
qqline(usmelec, col = 2)

skewness(usmelec)
## [1] 0.1444914
## attr(,"method")
## [1] "moment"
kurtosis(usmelec)
## [1] -1.010254
## attr(,"method")
## [1] "excess"

box-cox transformed

evaluate box-cox transformed data

## apply Box-Cox transform with - lambda ='auto'
usmelec_box <- BoxCox(usmelec, lambda = "auto")
normalTest(usmelec_box, method = c("jb"))
## 
## Title:
##  Jarque - Bera Normalality Test
## 
## Test Results:
##   STATISTIC:
##     X-squared: 33.152
##   P VALUE:
##     Asymptotic p Value: 6.326e-08 
## 
## Description:
##  Mon May 11 20:27:08 2020 by user:
qqnorm(usmelec_box)
qqline(usmelec_box, col = 2)

skewness(usmelec_box)
## [1] -0.4223738
## attr(,"method")
## [1] "moment"
kurtosis(usmelec_box)
## [1] -0.9670972
## attr(,"method")
## [1] "excess"

transformation conclusion

These pass the Jarque-Bera (JB) test for a normal distribution, with a p-value of 0.16e-05. Looking at the Q-Q plot appears to be tightly fit to the line. Skewness is also reduced significantly, to 0.14 - a slight right skew (mean slightly more than the median). The kurtosis of the transformed data is to -1.01 which normally indicates less “peakedness” than a normal distribution or could also mean with negative excess kurtosis the outlier character (as measured by large |Z|-values) of the distribution is less extreme than that of a normal distribution. Looking at the Box-cox and log transformation do not result in better kurtosis or skewness. I will move forward with the non-transformed data.

stationary

Check to see if data are stationary. If needed, find an appropriate differencing which yields stationary data.

acf(usmelec)

pacf(usmelec)

## check number of differences
ndiffs(usmelec, alpha = 0.05)
## [1] 1

stationarity conclusion

Looking at the ACF plot for 1st diff lagged correlation. The PACF also appears to have significant residual variation. Checking the ndiffs() tells us 1st differencing should be enough to remove stationarity.

ARIMA Models

An autoregressive integrated moving average model is a form of regression analysis that gauges the strength of one dependent variable relative to other changing variables. The model’s goal is to predict future securities or financial market moves by examining the differences between values in the series instead of through actual values.

3 components of ARIMA:

Autoregression (AR) - refers to a model that shows a changing variable that regresses on its own lagged, or prior, values.

Integrated (I) represents the differencing of raw observations to allow for the time series to become stationary, i.e., data values are replaced by the difference between the data values and the previous values.

Moving average (MA) incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.

ARIMA #1

plot(decompose(usmelec))

fit.manual2 <- Arima(usmelec, order = c(12, 1, 0), seasonal = c(1, 
    0, 0), lambda = 0)
fit.manual2
## Series: usmelec 
## ARIMA(12,1,0)(1,0,0)[12] 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5      ar6      ar7
##       -0.4196  -0.3817  -0.2097  -0.2566  -0.2020  -0.0813  -0.0862
## s.e.   0.0436   0.0474   0.0506   0.0517   0.0523   0.0529   0.0526
##           ar8     ar9    ar10    ar11     ar12    sar1
##       -0.0655  0.0091  0.1031  0.1144  -0.2935  0.9610
## s.e.   0.0523  0.0518  0.0508  0.0477   0.0440  0.0107
## 
## sigma^2 estimated as 0.0009182:  log likelihood=1000.95
## AIC=-1973.91   AICc=-1973.02   BIC=-1915.33

This method (ARIMA(0,1,0)(1,0,0)[12]) uses 12 lagged observations ,with single differencing and a single lag on the residual variation. The AICc is -1973.9, RMSE of 8.24.

ARIMA #2

fit.manual3 <- Arima(usmelec, order = c(6, 1, 0), seasonal = c(3, 
    1, 0), lambda = 0)
fit.manual3
## Series: usmelec 
## ARIMA(6,1,0)(3,1,0)[12] 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5      ar6     sar1
##       -0.4448  -0.4031  -0.2257  -0.2341  -0.1611  -0.0692  -0.6594
## s.e.   0.0460   0.0500   0.0522   0.0521   0.0500   0.0461   0.0453
##          sar2     sar3
##       -0.4746  -0.2839
## s.e.   0.0506   0.0457
## 
## sigma^2 estimated as 0.0007708:  log likelihood=1024.58
## AIC=-2029.16   AICc=-2028.68   BIC=-1987.57

This method (ARIMA(6,1,6)(1,0,0)[12]) uses 6 lagged observations ,with single differencing and a 6 lagged forecast errors. The AICc is -2028.7, RMSE of 7.53.

Auto ARIMA

usmelec_auto_arima <- auto.arima(usmelec, seasonal = TRUE, stepwise = FALSE, 
    approximation = FALSE, lambda = "auto")
usmelec_auto_arima
## Series: usmelec 
## ARIMA(1,1,1)(2,1,1)[12] 
## Box Cox transformation: lambda= -0.5738331 
## 
## Coefficients:
##          ar1      ma1    sar1     sar2     sma1
##       0.3888  -0.8265  0.0403  -0.0958  -0.8471
## s.e.  0.0630   0.0375  0.0555   0.0531   0.0341
## 
## sigma^2 estimated as 1.274e-06:  log likelihood=2547.36
## AIC=-5082.72   AICc=-5082.54   BIC=-5057.76

Auto.arima method (ARIMA(1,1,1)(2,1,1)[12]) uses 1 lag ,with single differencing and single moving average. On the residual component, it uses 2 lags, with single differencing and 1 period moving average. The AICc is -5802.5, and RMSE of 7.23.

ARIMA results

The best model, according to AIC is the model (AICc of -5802.5) that uses auto.arima() - (ARIMA(1,1,1)(2,1,1)[12])

ARIMA model residuals

Estimate the parameters of your best model and do diagnostic testing on the residuals. Check to make sure the residuals resemble white noise.

usmelec_auto_arima
## Series: usmelec 
## ARIMA(1,1,1)(2,1,1)[12] 
## Box Cox transformation: lambda= -0.5738331 
## 
## Coefficients:
##          ar1      ma1    sar1     sar2     sma1
##       0.3888  -0.8265  0.0403  -0.0958  -0.8471
## s.e.  0.0630   0.0375  0.0555   0.0531   0.0341
## 
## sigma^2 estimated as 1.274e-06:  log likelihood=2547.36
## AIC=-5082.72   AICc=-5082.54   BIC=-5057.76
checkresiduals(usmelec_auto_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(2,1,1)[12]
## Q* = 28.012, df = 19, p-value = 0.08319
## 
## Model df: 5.   Total lags used: 24
summary(usmelec_auto_arima)
## Series: usmelec 
## ARIMA(1,1,1)(2,1,1)[12] 
## Box Cox transformation: lambda= -0.5738331 
## 
## Coefficients:
##          ar1      ma1    sar1     sar2     sma1
##       0.3888  -0.8265  0.0403  -0.0958  -0.8471
## s.e.  0.0630   0.0375  0.0555   0.0531   0.0341
## 
## sigma^2 estimated as 1.274e-06:  log likelihood=2547.36
## AIC=-5082.72   AICc=-5082.54   BIC=-5057.76
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE     MASE
## Training set -0.5001989 7.235656 5.302245 -0.1959073 2.002005 0.588828
##                     ACF1
## Training set -0.02860273

conlusion on residuals

Auto.arima method (ARIMA(1,1,1)(2,1,1)[12]) uses 1 lag ,with single differencing and single moving average. On the residual component, it uses 2 lags, with single differencing and 1 period moving average. The AICc is -5802.5, and RMSE of 7.23. The residual ACF plot and distrivution of residuals look pass the eye test and is slightly above the p>0.5 theshold (at 0.83). Based on the residual ACF plot - the oscillating nature of these lags, only a few falling outside of the confidence interval for significance, and normal distribution of the residuals - I believe this to be white noise and acceptable.

Forecast (vs validation data)

Forecast the next 15 years of electricity generation by the U.S. electric industry. Get the latest figures from the EIA to check the accuracy of your forecasts. Output results.

plot(forecast(usmelec_auto_arima, h = 180), ylab = "total net generation of electricity (in billion kilowatt hours) by the U.S.")
lines(ma(usmelec, 12), col = "red")

## export predictions to csv for comparison to actuals (from
## EIA) library(data.table) forecast_usmelec <-
## as.data.table(forecast(usmelec_auto_arima, h=180))
## forecast_usmelec write.csv(forecast_usmelec, file
## ='forecast_usmelec.csv')

Forecast vs Actuals

Conclusion

Based on my comparison, these values are still useful 6 years (July 2019) after the dataset we trained from ended (June 2013) - as you can see from the chart the average absolute percentage error was only 2.17 percent and an RMSE of 9.71, which is only slightly below what we observed in the best training model (RMSE = 7.23). Some decay in the model should be expected, but these predictions are clearly still useful and with continued retraining could be even better than what we see 6 years out. This was attained by comparing actual values from EIA to the predicted values between January 2017 and July 2019.