US Electricity Production: Consider the total net generation of electricity (in billion kilowatt hours) by the U.S. electric industry (monthly for the period January 1973 – June 2013). In general there are two peaks per year: in mid-summer and mid-winter. This analysis uses several methods to evaluate and model this electricity data.
Based on Forecasting: Principles and Practice (Rob J Hyndman and George Athanasopoulos).
library(fpp2)
library(urca)
library(forecast)
library(fBasics)autoplot(usmelec, ylab = "net generation of electricity (in billion kilowatt hours)",
xlab = "monthly for the period January 1973 – June 2013",
main = "total net generation of electricity (in billion kilowatt hours) by the U.S.")Examine the 12-month moving average of this series to see what kind of trend is involved.
plot(usmelec, col = "grey")
lines(ma(usmelec, order = 12), col = "red")Taking the 12-month moving average makes it easier to see a clear trend. Aside from a noticable dip in the early-mid 1980s there is a consistent increasing linear trend until the late 2000s where it dips and flattens to the end of the data in 2013.
Check if any transformation is necessary.
evaluate un-transformed data
normalTest(usmelec, method = c("jb"))##
## Title:
## Jarque - Bera Normalality Test
##
## Test Results:
## STATISTIC:
## X-squared: 22.0343
## P VALUE:
## Asymptotic p Value: 1.642e-05
##
## Description:
## Mon May 11 20:27:08 2020 by user:
## 4 moments
qqnorm(usmelec)
qqline(usmelec, col = 2)skewness(usmelec)## [1] 0.1444914
## attr(,"method")
## [1] "moment"
kurtosis(usmelec)## [1] -1.010254
## attr(,"method")
## [1] "excess"
evaluate box-cox transformed data
## apply Box-Cox transform with - lambda ='auto'
usmelec_box <- BoxCox(usmelec, lambda = "auto")
normalTest(usmelec_box, method = c("jb"))##
## Title:
## Jarque - Bera Normalality Test
##
## Test Results:
## STATISTIC:
## X-squared: 33.152
## P VALUE:
## Asymptotic p Value: 6.326e-08
##
## Description:
## Mon May 11 20:27:08 2020 by user:
qqnorm(usmelec_box)
qqline(usmelec_box, col = 2)skewness(usmelec_box)## [1] -0.4223738
## attr(,"method")
## [1] "moment"
kurtosis(usmelec_box)## [1] -0.9670972
## attr(,"method")
## [1] "excess"
These pass the Jarque-Bera (JB) test for a normal distribution, with a p-value of 0.16e-05. Looking at the Q-Q plot appears to be tightly fit to the line. Skewness is also reduced significantly, to 0.14 - a slight right skew (mean slightly more than the median). The kurtosis of the transformed data is to -1.01 which normally indicates less “peakedness” than a normal distribution or could also mean with negative excess kurtosis the outlier character (as measured by large |Z|-values) of the distribution is less extreme than that of a normal distribution. Looking at the Box-cox and log transformation do not result in better kurtosis or skewness. I will move forward with the non-transformed data.
Check to see if data are stationary. If needed, find an appropriate differencing which yields stationary data.
acf(usmelec)pacf(usmelec)## check number of differences
ndiffs(usmelec, alpha = 0.05)## [1] 1
Looking at the ACF plot for 1st diff lagged correlation. The PACF also appears to have significant residual variation. Checking the ndiffs() tells us 1st differencing should be enough to remove stationarity.
An autoregressive integrated moving average model is a form of regression analysis that gauges the strength of one dependent variable relative to other changing variables. The model’s goal is to predict future securities or financial market moves by examining the differences between values in the series instead of through actual values.
Autoregression (AR) - refers to a model that shows a changing variable that regresses on its own lagged, or prior, values.
Integrated (I) represents the differencing of raw observations to allow for the time series to become stationary, i.e., data values are replaced by the difference between the data values and the previous values.
Moving average (MA) incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.
plot(decompose(usmelec))fit.manual2 <- Arima(usmelec, order = c(12, 1, 0), seasonal = c(1,
0, 0), lambda = 0)
fit.manual2## Series: usmelec
## ARIMA(12,1,0)(1,0,0)[12]
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7
## -0.4196 -0.3817 -0.2097 -0.2566 -0.2020 -0.0813 -0.0862
## s.e. 0.0436 0.0474 0.0506 0.0517 0.0523 0.0529 0.0526
## ar8 ar9 ar10 ar11 ar12 sar1
## -0.0655 0.0091 0.1031 0.1144 -0.2935 0.9610
## s.e. 0.0523 0.0518 0.0508 0.0477 0.0440 0.0107
##
## sigma^2 estimated as 0.0009182: log likelihood=1000.95
## AIC=-1973.91 AICc=-1973.02 BIC=-1915.33
This method (ARIMA(0,1,0)(1,0,0)[12]) uses 12 lagged observations ,with single differencing and a single lag on the residual variation. The AICc is -1973.9, RMSE of 8.24.
fit.manual3 <- Arima(usmelec, order = c(6, 1, 0), seasonal = c(3,
1, 0), lambda = 0)
fit.manual3## Series: usmelec
## ARIMA(6,1,0)(3,1,0)[12]
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 sar1
## -0.4448 -0.4031 -0.2257 -0.2341 -0.1611 -0.0692 -0.6594
## s.e. 0.0460 0.0500 0.0522 0.0521 0.0500 0.0461 0.0453
## sar2 sar3
## -0.4746 -0.2839
## s.e. 0.0506 0.0457
##
## sigma^2 estimated as 0.0007708: log likelihood=1024.58
## AIC=-2029.16 AICc=-2028.68 BIC=-1987.57
This method (ARIMA(6,1,6)(1,0,0)[12]) uses 6 lagged observations ,with single differencing and a 6 lagged forecast errors. The AICc is -2028.7, RMSE of 7.53.
usmelec_auto_arima <- auto.arima(usmelec, seasonal = TRUE, stepwise = FALSE,
approximation = FALSE, lambda = "auto")
usmelec_auto_arima## Series: usmelec
## ARIMA(1,1,1)(2,1,1)[12]
## Box Cox transformation: lambda= -0.5738331
##
## Coefficients:
## ar1 ma1 sar1 sar2 sma1
## 0.3888 -0.8265 0.0403 -0.0958 -0.8471
## s.e. 0.0630 0.0375 0.0555 0.0531 0.0341
##
## sigma^2 estimated as 1.274e-06: log likelihood=2547.36
## AIC=-5082.72 AICc=-5082.54 BIC=-5057.76
Auto.arima method (ARIMA(1,1,1)(2,1,1)[12]) uses 1 lag ,with single differencing and single moving average. On the residual component, it uses 2 lags, with single differencing and 1 period moving average. The AICc is -5802.5, and RMSE of 7.23.
The best model, according to AIC is the model (AICc of -5802.5) that uses auto.arima() - (ARIMA(1,1,1)(2,1,1)[12])
Estimate the parameters of your best model and do diagnostic testing on the residuals. Check to make sure the residuals resemble white noise.
usmelec_auto_arima## Series: usmelec
## ARIMA(1,1,1)(2,1,1)[12]
## Box Cox transformation: lambda= -0.5738331
##
## Coefficients:
## ar1 ma1 sar1 sar2 sma1
## 0.3888 -0.8265 0.0403 -0.0958 -0.8471
## s.e. 0.0630 0.0375 0.0555 0.0531 0.0341
##
## sigma^2 estimated as 1.274e-06: log likelihood=2547.36
## AIC=-5082.72 AICc=-5082.54 BIC=-5057.76
checkresiduals(usmelec_auto_arima)##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)(2,1,1)[12]
## Q* = 28.012, df = 19, p-value = 0.08319
##
## Model df: 5. Total lags used: 24
summary(usmelec_auto_arima)## Series: usmelec
## ARIMA(1,1,1)(2,1,1)[12]
## Box Cox transformation: lambda= -0.5738331
##
## Coefficients:
## ar1 ma1 sar1 sar2 sma1
## 0.3888 -0.8265 0.0403 -0.0958 -0.8471
## s.e. 0.0630 0.0375 0.0555 0.0531 0.0341
##
## sigma^2 estimated as 1.274e-06: log likelihood=2547.36
## AIC=-5082.72 AICc=-5082.54 BIC=-5057.76
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.5001989 7.235656 5.302245 -0.1959073 2.002005 0.588828
## ACF1
## Training set -0.02860273
Auto.arima method (ARIMA(1,1,1)(2,1,1)[12]) uses 1 lag ,with single differencing and single moving average. On the residual component, it uses 2 lags, with single differencing and 1 period moving average. The AICc is -5802.5, and RMSE of 7.23. The residual ACF plot and distrivution of residuals look pass the eye test and is slightly above the p>0.5 theshold (at 0.83). Based on the residual ACF plot - the oscillating nature of these lags, only a few falling outside of the confidence interval for significance, and normal distribution of the residuals - I believe this to be white noise and acceptable.
Forecast the next 15 years of electricity generation by the U.S. electric industry. Get the latest figures from the EIA to check the accuracy of your forecasts. Output results.
plot(forecast(usmelec_auto_arima, h = 180), ylab = "total net generation of electricity (in billion kilowatt hours) by the U.S.")
lines(ma(usmelec, 12), col = "red")## export predictions to csv for comparison to actuals (from
## EIA) library(data.table) forecast_usmelec <-
## as.data.table(forecast(usmelec_auto_arima, h=180))
## forecast_usmelec write.csv(forecast_usmelec, file
## ='forecast_usmelec.csv')Based on my comparison, these values are still useful 6 years (July 2019) after the dataset we trained from ended (June 2013) - as you can see from the chart the average absolute percentage error was only 2.17 percent and an RMSE of 9.71, which is only slightly below what we observed in the best training model (RMSE = 7.23). Some decay in the model should be expected, but these predictions are clearly still useful and with continued retraining could be even better than what we see 6 years out. This was attained by comparing actual values from EIA to the predicted values between January 2017 and July 2019.