MONTHLY INDUSTRIAL ELECTRICITY PRODUCTION FROM 1985 TO 2018

Introduction

In the United States, power and gas are the two most significant energies which are conveyed and provided to public usage. The creation of electric and gas utilities can mirror the vitality use by inhabitants. In this analysis, we will analyse the US electric and gas utilities production through time series examination.

This project we are using dataset of Industrial production of electric and gas utilities[1] in the United States, from the years January 1985-January 2018, with the frequency being Monthly production output.

Forecast for next 10 months for Industrial production of electric and gas utilities will be made using suitable model which is identified as the best model from the analysis.

Time Series Plot

Below are the charasteristics of the Time series graph of our dataset

Trend : It is obvious from the above time series graph that there is upwords trend which suggests that graph is non-stationary.
Changing variation : Even though it seems like there is a slight changing variation among the seasonal pattern, it is very hard to decide just by looking at the above time series graph.
Seasonality : Seasonality is obvious from the repeating patterns from the graph since this is also a monthly observation.
Autocorrelation structure : We don’t see any observations hanging around.Hence we can say there is no autocorrelation.
Intervention point : No intervention point since there is no sudden change in the patterns of the graph.

Investigate Correlation

## [1] 0.8684431

Strong correlation can be seen from scatter plot between our data and its first lag (figure 1).
This is proved by covariance test with the result of 0.87 which suggest positive autocorrelation between the same.

ACF and PACF plots

Observations made from Autocorrelation structure of our time series data,

Effect of seasonal trend is seen from ACF plot. The shape in ACF plot indicates seasonality.
This indicates we need to remove the seasonality before building the model. We will apply residual approach for the same.

Specification of seasonal part

From time series graph it is clear that this is a stocastic trend with seasonality.
Since there is obvious trend, we start with seasonal differencing. Seasonal differencing with D = 1 and white noise models are made.

Seasonal differencing with D = 1 and fit a white noise model

Data is fitted to obtain residuals.
From Time series plot of residuals, we can see that seasonality is pretty much filtered out.
P and Q values for seasonal part are determined from ACF and PACF plots. We can see there are 3 lags at PACF and 2 lags at ACF plots which are significant at 12,24 and 36 months. Therefore P = 3 and Q = 2 are taken for seasonal part.

Adding SARMA(3,2) to the model and checking TS-ACF-PACF plots

From the above time series plot of residuals, we can see that there is no seasonality in the time series graph.
From ACF and PACF plots we can see that residuals are now white noise which is centered around zero, uncorrelated with almost constant variance. We can conclude that model is now Adequate.
Our model with P=3, D=1,, Q=2 is aqequate to model the original part
For seasonal part, our model will be SARMA(3,2).

Specification of Ordinary Part

In the original time series data we observed there is slight changing variance. We apply log transformation on our data to take care of this.

Tranforming the Data

* In the time series graph after transformation we can see that the variance is stabilised.

ACF plot doesn’t show any seasonal pattern but has exponentially decaying trend among lags which explains non-stationarity.
This can be corrected by differencing the original part of the model.

Differencing the Original part of the model

##      
## 0.01

From ADF test of the residuals, we can see that series is stationary after differencing it with d = 1. This is also confirmed by ADF test with p value < 0.01 which rejects null hypothesis and proves series is stationary.
Same can be seen from ACF plot as well. There are no exponentially decaying lags.
ACF is related with MA component. There are 3 significant lags going beyond blue dotted lines.
PACF is related with AR component. There are 4 significant lags going beyond blue dotted lines.
SARIMA(4,1,3)X(3,1,2)_12 is the model obtained from ACF and PACF plots.

EACF

## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 x x o o o o o o o x o  o  o  o 
## 1 x x o o o o o o o o o  o  o  o 
## 2 x x o o o o o o o o o  o  o  o 
## 3 x o x o o o o o o o o  o  o  o 
## 4 x x x x o o o o o o o  o  o  o 
## 5 x x x o x o o o o o o  o  o  o 
## 6 x o o x x x o o o o o  o  o  o 
## 7 x x o x x o o o o o o  o  o  o

Top left vertex is starting from (0,2) in EACF plot. Candidate model for ARMA from EACF would be MA(0,2), ARMA(1,2), MA(0,3).

BIC Graph

## Reordering variables and trying again:

MA(4) and MA(5) components can be seen significant from BIC graph. Candidate models from BIC for ARMA is MA(0, 4) and MA(0,5).
We are not considering AR(10) due to principle of parcimony.

Candidate Model

Below are our candidate models from ACF, PACF, EACF and BIC plots

SARIMA(4,1,3)X(3,1,2)_12
SARIMA(0,1,2)X(3,1,2)_12
SARIMA(1,1,2)X(3,1,2)_12
SARIMA(0,1,3)X(3,1,2)_12
SARIMA(0,1,4)X(3,1,2)_12
SARIMA(0,1,5)X(3,1,2)_12

Parameter Estimation

All the parameters must be statistically significant at 5% of significance.
If they are insignificant, they will be dropped from the model.

SARIMA(0,1,2)X(3,1,2)_12

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.430246   0.050921 -8.4492 < 2.2e-16 ***
## ma2  -0.332135   0.057116 -5.8151 6.059e-09 ***
## sar1  0.341335   0.636699  0.5361 0.5918888    
## sar2 -0.256797   0.069326 -3.7042 0.0002121 ***
## sar3  0.023448   0.165915  0.1413 0.8876155    
## sma1 -1.107127   0.634977 -1.7436 0.0812341 .  
## sma2  0.310893   0.477178  0.6515 0.5147087    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both MA(1) and MA(2) co-efficients are significant with p-valve much less than 0.05.

SARIMA(1,1,2)X(3,1,2)_12

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ar1   0.492009   0.122405  4.0195 5.832e-05 ***
## ma1  -0.908725   0.140302 -6.4769 9.362e-11 ***
## ma2  -0.016074   0.118620 -0.1355 0.8922099    
## sar1  0.487073   0.353599  1.3775 0.1683658    
## sar2 -0.261586   0.068549 -3.8160 0.0001356 ***
## sar3  0.047681   0.114671  0.4158 0.6775548    
## sma1 -1.259078   0.348365 -3.6142 0.0003012 ***
## sma2  0.429343   0.263607  1.6287 0.1033724    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MA(2) co-efficient is not significant with p-value of the test more than 0.05.

SARIMA(0,1,3)X(3,1,2)_12

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.416220   0.052475 -7.9317 2.161e-15 ***
## ma2  -0.289413   0.059922 -4.8298 1.366e-06 ***
## ma3  -0.113121   0.048058 -2.3539 0.0185799 *  
## sar1  0.445792   0.419500  1.0627 0.2879300    
## sar2 -0.258217   0.068797 -3.7533 0.0001745 ***
## sar3  0.054092   0.128605  0.4206 0.6740412    
## sma1 -1.228258   0.416713 -2.9475 0.0032036 ** 
## sma2  0.398270   0.314299  1.2672 0.2050943    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All MA co-efficients of model SARIMA(0,1,3)X(3,1,2)_12 are significant with p-valve less than 0.05.

SARIMA(0,1,4)X(3,1,2)_12

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.417338   0.051639 -8.0818 6.383e-16 ***
## ma2  -0.254462   0.057037 -4.4613 8.146e-06 ***
## ma3  -0.058738   0.053604 -1.0958  0.273178    
## ma4  -0.108068   0.052755 -2.0485  0.040510 *  
## sar1 -0.512741        NaN     NaN       NaN    
## sar2 -0.248224   0.085551 -2.9015  0.003714 ** 
## sar3 -0.145795        NaN     NaN       NaN    
## sma1 -0.251091        NaN     NaN       NaN    
## sma2 -0.345411        NaN     NaN       NaN    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MA(3) co-efficients of model SARIMA(0,1,4)X(3,1,2)_12 is not significant with p-value of the test more than 0.05.

SARIMA(0,1,5)X(3,1,2)_12

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.402210   0.051860 -7.7557 8.788e-15 ***
## ma2  -0.258219   0.056701 -4.5541 5.262e-06 ***
## ma3  -0.034358   0.056770 -0.6052  0.545037    
## ma4  -0.068483   0.055958 -1.2238  0.221016    
## ma5  -0.088020   0.048894 -1.8002  0.071824 .  
## sar1 -0.521673        NaN     NaN       NaN    
## sar2 -0.248848   0.085293 -2.9176  0.003528 ** 
## sar3 -0.151252        NaN     NaN       NaN    
## sma1 -0.230361        NaN     NaN       NaN    
## sma2 -0.352123        NaN     NaN       NaN    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MA(3), MA(4) and MA(5) co-efficients of model SARIMA(0,1,5)X(3,1,2)_12 is not significant with p-value of the test more than 0.05.

Residual Analysis

Below are the candidiate models which passed parameter test.

SARIMA(0,1,2)X(3,1,2)_12
SARIMA(0,1,3)X(3,1,2)_12

we will do residual analysis for above models to find if the residuals are white noise with normal distribution.

Residual analysis of SARIMA(0,1,2)X(3,1,2)_12

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.99419, p-value = 0.1358
## 
## 
##  Box-Ljung test
## 
## data:  res.model
## X-squared = 17.842, df = 10, p-value = 0.05769

From ACF plot we can see that 2 lags are crossing blue dotted line which suggest that the residuals are not white noise.
But the lags in ACF plot are not very far from the blue dotted line. Instead of rejecting this model, we will compare this model with the residul analysis of all candidate models.
Shapiro test indicates that the residuals are normally distributed with the test result of 0.1795 which is > 0.05. This is also proved by QQ-plot with all data points are being alligned to the red line.

Residual analysis of SARIMA(0,1,3)X(3,1,2)_12

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.99407, p-value = 0.1252
## 
## 
##  Box-Ljung test
## 
## data:  res.model
## X-squared = 12.023, df = 10, p-value = 0.2835

From ACF plot we can see that 1 lag crossing blue dotted line which suggest that the residuals are not white noise.
But the lag in ACF plot is not very far from the blue dotted line. Instead of rejecting this model, we will compare this model with the residul analysis of all candidate models.
Shapiro test indicates that the residuals are normally distributed with the test result of 0.1857 which is > 0.05. This is also proved by QQ-plot with all data points are being alligned to the red line.
ACF plot of residuals of model SARIMA(0,1,3)X(3,1,2)_12 with 1 lag crossing blue dotted line is seen to be better than residuals of model SARIMA(0,1,2)X(3,1,2)_12.

Over fitting

SARIMA(0,1,3)X(3,1,2)_12 being the most eligible model, SARIMA(0,1,4)X(3,1,2)_12 and SARIMA(1,1,3)X(3,1,2)_12 are the over fitted model.
Analysis for model SARIMA(0,1,4)X(3,1,2)_12 is already done. Analysis for model SARIMA(1,1,3)X(3,1,2)_12 is done below.

SARIMA(1,1,3)X(3,1,2)_12

## 
## z test of coefficients:
## 
##        Estimate Std. Error z value  Pr(>|z|)    
## ar1   0.6153409  0.1602113  3.8408 0.0001226 ***
## ma1  -1.0274595  0.1682366 -6.1072 1.014e-09 ***
## ma2   0.0078804  0.1026841  0.0767 0.9388270    
## ma3   0.0746371  0.0868249  0.8596 0.3899943    
## sar1  0.4724080  0.3801575  1.2427 0.2139917    
## sar2 -0.2641270  0.0690086 -3.8275 0.0001295 ***
## sar3  0.0425761  0.1180051  0.3608 0.7182500    
## sma1 -1.2328369  0.3759513 -3.2792 0.0010408 ** 
## sma2  0.4122796  0.2831803  1.4559 0.1454228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.99435, p-value = 0.1499
## 
## 
##  Box-Ljung test
## 
## data:  res.model
## X-squared = 8.6409, df = 10, p-value = 0.5665

From ACF plot we can see 2 lags crossing blue dotted line which suggest that the residuals are not white noise.
few co-efficients of model SARIMA(1,1,3)X(3,1,2)_12 model are not significant with test value p > 0.05.

Model Comparision

##      df       AIC
## m013  9 -1729.342
## m012  8 -1726.043

##      df       BIC
## m012  8 -1694.438
## m013  9 -1693.787

We got two different results for AIC and BIC test for model comparision.
AIC is taken into account for goodness of the fit and model complexity.
According to AIC, SARIMA(0,1,3)X(3,1,2)_12 is good model on our data for forecasting.

Forecasting

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Feb 2018       4.710677 4.679119 4.742234 4.662414 4.758940
## Mar 2018       4.629216 4.592674 4.665757 4.573330 4.685101
## Apr 2018       4.489901 4.452197 4.527605 4.432238 4.547564
## May 2018       4.514993 4.476858 4.553128 4.456670 4.573315
## Jun 2018       4.624210 4.585649 4.662772 4.565235 4.683185
## Jul 2018       4.713796 4.674813 4.752780 4.654176 4.773416
## Aug 2018       4.698027 4.658626 4.737428 4.637769 4.758286
## Sep 2018       4.599235 4.559421 4.639049 4.538344 4.660125
## Oct 2018       4.517877 4.477655 4.558100 4.456362 4.579393
## Nov 2018       4.563416 4.522789 4.604043 4.501282 4.625550

Conclusion

There were several models that we got from the analysis. Residual analysis and parameter estimations showed that the residuals of model SARIMA(0,1,3)X(3,1,2)_12 approximately follows the white noise with normal distribution property. This is an indicator of a good fit.
From the forecast we can see that production of electricity and gas utility for next 10 months won’t differ much when compared to previous years.

Future improvements

Analysing more advance time series modelling techniques has to be done since the residuals of the best model identified for the given data from our analysis is not perfect white noise.

References

[1] Data Source https://fred.stlouisfed.org/series/IPG2211A2N

[2] Class notes https://rmit.instructure.com/courses/67182/files/12231943?module_item_id=2391148

Appendix

#Load the requied libraries
library(TSA)
library(readr)
library(tseries)
library(fUnitRoots)
library(lmtest)
library(FSAdata)
library(forecast)

#Read the dataset and convert the dataframe to time series data
utilities <-  read.csv('C:/Users/DELL/Downloads/Electric_Production.csv', header = TRUE)
names(utilities) <- c("date", "production")
rownames(utilities) <- utilities[,1]
utilities <- utilities[,2]
utilities.ts <- ts(as.vector(utilities), start=c(1985,1), end=c(2018,1), frequency=12)

# plotting time series
plot(utilities.ts, type='l',ylab='Electricity Production' )
points(y=utilities.ts,x=time(utilities.ts))


## scattter plot
par(mfrow=c(1,1) , cex.lab = 0.7, cex.main=0.7, cex.axis = 0.7)
plot(y=utilities.ts,x=zlag(utilities.ts), ylab='Electricity Production', xlab = 'Year', main = 'Figure-1')

# Autocovariance test
y = utilities.ts              
x = zlag(utilities.ts)        # Generate first lag
index = 2:length(x)          # Create an index to get rid of the first NA value and the last
cor(y[index],x[index]) 

#ACF and PACF plots of original time series data
par(mfrow=c(2,1))
acf(utilities.ts,  lag.max = 36,main="The sample ACF of production of electric and gas utilities series")
pacf(utilities.ts,  lag.max = 36,main="The sample PACF of production of electric and gas utilities series")
#Seasonal differencing with D = 1 and fit a white noise model
m1 = arima(utilities.ts,order=c(0,0,0),seasonal=list(order=c(0,1,0), period=12))
res_m1 = residuals(m1)

#Plot Time series graph and ACF and PACF plot
par(mfrow=c(3,1))
plot(res_m1, type='l', ylab='Electric and gas utilities production', main = 'Time series graph of residuals' )
acf(res_m1,lag.max=36) 
pacf(res_m1, lag.max=36)
par(mfrow=c(1,1))

#Adding SARMA(3,2) to the model
m3 = arima(utilities.ts,order=c(0,0,0),seasonal=list(order=c(3,1,2), period=12))
res_m3 = residuals(m3)

# plotting time series and ACF and PACF plot
par(mfrow=c(3,1))
plot(res_m3,type='l',ylab='Electric and gas utilities production',main = 'Time series graph of residuals' )
acf(res_m3,lag.max=36) 
pacf(res_m3, lag.max=36)
par(mfrow=c(1,1))
#Log Transformation
log.utilities.ts = log(utilities.ts)

#Adding  SARMA(3,2) and differencing the data to the model
m4 = arima(log.utilities.ts,order=c(0,0,0),seasonal=list(order=c(3,1,2), period=12))
res_m4 = residuals(m4)

#Plotting ACF and PACF
par(mfrow=c(3,1))
plot(res_m4,type='l',ylab='Electric and gas utilities production',main = 'Time series graph of residuals' )
acf(res_m4,lag.max=36) 
pacf(res_m4, lag.max=36)
par(mfrow=c(1,1))


#Differencing the original part of the model
m5 = arima(log.utilities.ts,order=c(0,1,0),seasonal=list(order=c(3,1,2), period=12))
res_m5 = residuals(m5)

#ADF test for non-stationarity
order = ar(diff(res_m5))$order
adfTest(res_m5, lags = order, title = NULL,
        description = NULL)@test$p.value

# ACF and PACF plots
par(mfrow=c(2,1))
acf(res_m5,lag.max=36) 
pacf(res_m5, lag.max=36)
par(mfrow=c(1,1))



#Plot EACF graph
eacf(res_m5)


#Plotting BIC Graph
par(mfrow=c(1,1) , cex.lab = 0.7, cex.main=0.7, cex.axis = 0.7)
res4 = armasubsets(y=res_m5,nar=12,nma=12,y.name='test',ar.method='ols')
plot(res4)
mtext("BIC Table", outer=TRUE,  cex=1, line=-1.25)


#SARIMA(0,1,2)X(3,1,2)_12
m012 = arima(log.utilities.ts,order=c(0,1,2),seasonal=list(order=c(3,1,2), period=12))
coeftest(m012)



#SARIMA(1,1,2)X(3,1,2)_12
m112 = arima(log.utilities.ts,order=c(1,1,2),seasonal=list(order=c(3,1,2), period=12))
coeftest(m112)


#SARIMA(0,1,3)X(3,1,2)_12
m013 = arima(log.utilities.ts,order=c(0,1,3),seasonal=list(order=c(3,1,2), period=12))
coeftest(m013)


#SARIMA(0,1,4)X(3,1,2)_12
m014 = arima(log.utilities.ts,order=c(0,1,4),seasonal=list(order=c(3,1,2), period=12))
coeftest(m014)


#SARIMA(0,1,5)X(3,1,2)_12
m015 = arima(log.utilities.ts,order=c(0,1,5),seasonal=list(order=c(3,1,2), period=12))
coeftest(m015)


#Function to test residual analysis of best model
residual.analysis <- function(model, std = TRUE,start = 2, class = c("ARIMA","GARCH","ARMA-GARCH")[1]){
  library(TSA)
  library(FitAR)
  if (class == "ARIMA"){
    if (std == TRUE){
      res.model = residuals(model)
    }else{
      res.model = residuals(model)
    }
  }else if (class == "GARCH"){
    res.model = model$residuals[start:model$n.used]
  }else if (class == "ARMA-GARCH"){
    res.model = model@fit$residuals
  }else {
    stop("The argument 'class' must be either 'ARIMA' or 'GARCH' ")
  }
  par(mfrow=c(3,2))
  plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised residuals")
  abline(h=0)
  hist(res.model,main="Histogram of standardised residuals")
  acf(res.model,main="ACF of standardised residuals")
  pacf(res.model,main="PACF of standardised residuals")
  qqnorm(res.model,main="QQ plot of standardised residuals")
  qqline(res.model, col = 2)
  print(shapiro.test(res.model))
  print(Box.test(res.model, lag=10, type="Ljung-Box"))
  k=0
  LBQPlot(res.model, lag.max = length(model$residuals)-1, StartLag = k + 1, k = 0, SquaredQ = FALSE)
}

#residual analysis of SARIMA(0,1,2)X(3,1,2)_12  
residual.analysis(m012)


#residual analysis of SARIMA(0,1,3)X(3,1,2)_12  
residual.analysis(m013)

#Parameter estimate and residual analysis of SARIMA(1,1,3)X(3,1,2)_12
m113 = arima(log.utilities.ts,order=c(1,1,3),seasonal=list(order=c(3,1,2), period=12))
coeftest(m113)
residual.analysis(m113)


#Function for sorting AIC and BIC scores
sort.score <- function(x, score = c("bic", "aic")){
  if (score == "aic"){
    x[with(x, order(AIC)),]
  } else if (score == "bic") {
    x[with(x, order(BIC)),]
  } else {
    warning('score = "x" only accepts valid arguments ("aic","bic")')
  }
}

# Calculate AIC and BIC scores of the model
sc.AIC=AIC(m012, m013)
sc.BIC=BIC(m012, m013)

#sort AIC and BIC scores of the candidate models
sort.score(sc.AIC, score = "aic")
sort.score(sc.BIC, score = "bic")

#Plotting Time series graph with forecasting for next 10 months
finalModel = Arima(log.utilities.ts,order=c(0,1,3),seasonal=list(order=c(3,1,2), period=12))
pred = forecast(finalModel, h=10)
plot(pred)

#Forecasted values for next 10 months
pred