1. Introduction

Electricity and gas are the basic necessity that is required for day-to-day basis the consumption of these utilities is something that will keep on increasing in the future. The data shows the amount of electricity and gas that was consumed in United states from 1985 to 2018. We will observe the data present and using data fitting and diagnostic techniques we will be predict how there will be changes in the electricity production in consumption of electricity for next 10 units.

Dataset URL : https://www.kaggle.com/shenba/electricity-production

2. Required Libaries

library(TSA)
library(tseries)
library(lmtest)
library(forecast)
library(FitAR)

3. Model Specification

3.1 Reading data

# loading the dataset.
data = read.csv("Electric_Production.csv")
head(data)
##       DATE IPG2211A2N
## 1 1/1/1985    72.5052
## 2 2/1/1985    70.6720
## 3 3/1/1985    62.4502
## 4 4/1/1985    57.4714
## 5 5/1/1985    55.3151
## 6 6/1/1985    58.0904

The dataset contains two columns for the date and electricity production. We will change the column name for better understanding and convert the data into a time series.

# changing the column names.
names(data) <- c("date", "electricity_produced")
# changing the data into time series.
data.ts <- matrix(data$electricity_produced, nrow= 397, ncol= 1)
data.ts <- as.vector(t(data.ts))

3.2 Transforming to Time Series

# monthly time series from January 1985 to January 2018.
data.ts <- ts(data.ts, start= c(1985,1), end= c(2018,1), frequency= 12)

3.3 Analysing the data

# creating a plot to analyze the time series for the 5 important characteristics.
plot(data.ts, type= 'l', ylab= 'Electricity Production', main= "Monthly Electricity Production (1985-2018)") 
points(y= data.ts, x= time(data.ts), pch= as.vector(season(data.ts)))

The month names are present to check the seasonality. From the above time series plot, we can analyze that

  1. Trend - we can observe an upward trend.

  2. Seasonality - repeating patterns observed, therefore seasonality is present.

  3. Changing Variance – a slight change in variance is present.

  4. Behaviour - Auto-Regressive behaviour observed.

  5. Change Point – no changing point observed in the series.

3.4 Correlation

# Investigation of Correlation in the time series with one lag.
lag0 = data.ts
lag1 = zlag(data.ts)
index = 2: length(lag1)
correlation <- cor(lag0[index], lag1[index])
print(correlation)
## [1] 0.8717309

We observed a positive correlation of 0.87 in the time series with one z-lag.

3.5 Analysing scatter plot

# scatter plot for correlation of time series with 1 Z-Lag.
plot(y = data.ts, x = zlag(data.ts), ylab = 'Electricity Production', 
     xlab = 'Previous Year Electricity Production', main= "Fig-1 : Plot to Investigate Correlation with 1 Z-lag")

We can observe an upward moving trend, and the correlation is clearly visible between the Electricity produced with the previous year electricity production.

We will now create the ACF and PACF plots for the time series.

3.6 Creating ACF and PACF plots.

par(mfrow=c(2,1), mar = c(5, 4, 4, 2) + 0.1)
acf(data.ts, lag.max= 24, main= "ACF Plot") 
pacf(data.ts, lag.max= 24, main= "PACF Plot")

We can see the repeating pattern present in ACF plot that indicates seasonality. In PACF, we can see that there is high correlation present in the first lag that indicates the trend present in the data.

4. Model Selection

To remove the seasonal trend, we will be performing seasonal differencing so that the seasonality component can be reduced.

# for the first model, we will use D=1 in the order list for seasonality.
model_1 = arima(data.ts, order= c(0,0,0), seasonal= list(order= c(0,1,0), period= 12))
res.model_1 = residuals(model_1)
par(mfrow=c(3,1), mar = c(5, 4, 4, 2) + 0.1)
plot(res.model_1, xlab= 'Time', ylab= 'Residuals',main= "Model 1 : Time Series Plot of Residuals")
acf(res.model_1, lag.max= 24, main= "Model 1 : ACF of Residuals")
pacf(res.model_1, lag.max= 24, main= "Model 1 : PACF of Residuals")

from the observations of the above plots, we can determine two different values of P.

P=0, since we considered it as a decaying pattern, or

P=3 as observed from the PACF plot.

From the ACF plot, we can determine the value of Q=2 Further we will be using the values and create our model accordingly.

# creating a new model with P=0, Q=2 and D=1. 
model_2 = arima(data.ts, order= c(0,0,0), seasonal= list(order= c(0,1,2), period= 12))
res.model_2 = residuals(model_2)
par(mfrow=c(3,1), mar = c(5, 4, 4, 2) + 0.1)
plot(res.model_2, xlab= 'Time', ylab= 'Residuals', main= "Model 2 : Time Series Plot of Residuals")
acf(res.model_2, lag.max= 24, main= "Model 2 : ACF of residuals")
pacf(res.model_2, lag.max= 24, main= "Model 2 : PACF of Residuals")

By putting the value of d=1, p=0 and q=2 we plot the model. From the model, we observed that the created model fails to capture the significant lags therefore, we will reject this model.

# now creating a new model with P=3, Q=2 and D=1.
model_3 = arima(data.ts, order= c(0,0,0), seasonal= list(order= c(3,1,2), period= 12))
res.model_3 = residuals(model_3)
par(mfrow=c(3,1), mar = c(5, 4, 4, 2) + 0.1)
plot(res.model_3, xlab= 'Time', ylab= 'Residuals', main= "Model 3 : Time Series Plot of Residuals")
acf(res.model_3, lag.max= 24, main= "Model 3 : ACF of residuals")
pacf(res.model_3, lag.max= 24, main= "Model 3 : PACF of Residuals")

In the above model, we were able to successfully capture the significant lags. However, we observed presence of some white noise.

5. Model Transformation

Here we perform the log-transformation on the original set and checked for the characteristics and stabilize the variance of the series.

model_4 = arima(log(data.ts), order= c(0,1,0), seasonal= list(order= c(3,1,2), period= 12))
res.model_4 = residuals(model_4)
par(mfrow=c(3,1), mar = c(5, 4, 4, 2) + 0.1)
plot(res.model_4, xlab= 'Time', ylab= 'Residuals', main= "Model 4 : Time Series Plot of Residuals")
acf(res.model_4, lag.max= 24, main= "Model 4 : ACF of residuals")
pacf(res.model_4, lag.max= 24, main = "Model 4 : PACF of Residuals")

We observed a decaying pattern in the ACF plot and in the PACF plot, there is a high correlation present in the first lag that implies changing variance.

Therefore, we will create a new model with log-transformation on the model using the value of d=1, p=4 and q=3.

model_5 = arima(log(data.ts), order= c(4,1,3), seasonal= list(order= c(3,1,2), period= 12))
par(mfrow=c(3,1), mar = c(5, 4, 4, 2) + 0.1)
res.model_5 = residuals(model_5)
plot(res.model_5, xlab= 'Time', ylab= 'Residuals', main= "Model 5 : Time Series Plot of Residuals")
acf(res.model_5, lag.max= 24, main= "Model 5 : ACF of residuals")
pacf(res.model_5, lag.max= 24, main= "Model 5 : PACF of Residuals")

Using the value of d=1, p=4 and q=3 we observed the result of ACF and PACF plots that the model was able to capture all the significant lags and no trend was present in the series.

From all the above plots we observed that model_5 was able to capture all the information and was best model among all the 5 models.

Now will perform a Augmented Dickey-Fuller (ADF) test for model_5 to check if the series is stationary or not.

options( warn = -1 )

adf.test(res.model_5)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  res.model_5
## Dickey-Fuller = -7.8177, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

From the results of the ADF test, we can determine that our series is stationary.

5.1 EACF

Now, we will perform EACF on model_5 to determine the candidate models.

eacf(res.model_5)
## AR/MA
##   0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0 o o o o o o o o o o o  o  o  o 
## 1 x o o o o o o o o o o  o  o  o 
## 2 o x o o o o o o o o o  o  o  o 
## 3 o x x o o o o o o o o  o  o  o 
## 4 x x x x o o o o o o o  o  o  o 
## 5 x x x x o o o o o o o  o  o  o 
## 6 o o x o o x o o o o o  o  o  o 
## 7 o x x o x x x o o o o  o  o  o

Based on the EACF, we can determine our candidate models to perform further analysis.

SARIMA(4,1,3)*(3,1,2)_12

SARIMA(0,1,0)*(3,1,2)_12

SARIMA(0,1,1)*(3,1,2)_12

SARIMA(1,1,1)*(3,1,2)_12

Now, we will be doing Model Fitting and Residual Analysis on the candidate models to determine the best model for prediction.

First, we perform the Coefficient Testing on the selected models.

model_413 = arima(log(data.ts),order= c(4,1,3), seasonal= list(order= c(3,1,2), period= 12))
coeftest(model_413)
## 
## z test of coefficients:
## 
##        Estimate Std. Error   z value Pr(>|z|)    
## ar1  -1.1845455  0.0032403 -365.5684   <2e-16 ***
## ar2   0.0687750  0.0036054   19.0758   <2e-16 ***
## ar3   0.3237710  0.0075557   42.8511   <2e-16 ***
## ar4   0.0504187  0.0042970   11.7334   <2e-16 ***
## ma1   0.7899160  0.0333949   23.6538   <2e-16 ***
## ma2  -0.8298200  0.0111382  -74.5019   <2e-16 ***
## ma3  -0.6660315  0.0311446  -21.3852   <2e-16 ***
## sar1 -0.2077070  0.0074228  -27.9824   <2e-16 ***
## sar2 -0.2530304  0.0031872  -79.3902   <2e-16 ***
## sar3 -0.1154513         NA        NA       NA    
## sma1 -0.6130960         NA        NA       NA    
## sma2 -0.0867169  0.0624830   -1.3878   0.1652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After the test we got nine significant coefficients with two induced NA values from the model_413.

model_010 = arima(log(data.ts), order= c(0,1,0), seasonal= list(order= c(3,1,2), period= 12))
coeftest(model_010)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value Pr(>|z|)   
## sar1 -0.237361         NA      NA       NA   
## sar2 -0.190652   0.064222 -2.9687 0.002991 **
## sar3 -0.062712         NA      NA       NA   
## sma1 -0.553206         NA      NA       NA   
## sma2 -0.190620         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For model_010 there is only one significant coefficient observed along with a large number of induced NA values. It suggests that the model is not good enough.

model_011 = arima(log(data.ts), order= c(0,1,1), seasonal= list(order= c(3,1,2), period= 12))
coeftest(model_011 )
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ma1  -0.448606   0.069355 -6.4683 9.913e-11 ***
## sar1  0.404387   0.447169  0.9043  0.365822    
## sar2 -0.261358   0.064010 -4.0831 4.444e-05 ***
## sar3 -0.018447   0.124752 -0.1479  0.882444    
## sma1 -1.183910   0.444520 -2.6633  0.007737 ** 
## sma2  0.374207   0.348250  1.0745  0.282583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After the test we observed 3 significant coefficients from model_011.

model_111 = arima(log(data.ts), order= c(1,1,1), seasonal= list(order= c(3,1,2), period= 12))
coeftest(model_111 )
## 
## z test of coefficients:
## 
##       Estimate Std. Error  z value  Pr(>|z|)    
## ar1   0.502651   0.054785   9.1749 < 2.2e-16 ***
## ma1  -0.925915   0.023949 -38.6624 < 2.2e-16 ***
## sar1  0.400443   0.499051   0.8024 0.4223155    
## sar2 -0.245524   0.067396  -3.6430 0.0002695 ***
## sar3  0.022966   0.135667   0.1693 0.8655766    
## sma1 -1.176519   0.494836  -2.3776 0.0174261 *  
## sma2  0.358718   0.376917   0.9517 0.3412404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For model_111 we observed 4 significant coefficients.

After the Coefficient Testing on the candidate models, we observed that the model_413 and model_111 had the highest number of significant coefficients.

6. Resiudal Analysis of models

For the residual analysis, we will create a custom function which will implement residual checking for the models and perform Ljung-Box Test on each candidate model. Reference: StackOverflow.

resAnalysis <- function(modelName, sd= TRUE){
  
  if(sd== TRUE){
    res.modelName= rstandard(modelName)
  }
  else{
   res.modelName= residuals(modelName)
  }
  par(mfrow= c(3,2))
  plot(res.modelName, type= 'o', ylab= 'Standardised Residuals', 
       main= "Time Series Plot - Standardised Residuals")
  abline(h=0)
  hist(res.modelName, main= "Histogram - Standardised Residuals")
  qqnorm(res.modelName, main= "QQ Plot - Standardized Residuals")
  qqline(res.modelName, col= 2)
  acf(res.modelName, main= "ACF - Standardized residuals")
  print(shapiro.test(res.modelName))
  k=0
  LBQPlot(res.modelName, lag.max= length(modelName$residuals)-1, StartLag= k+1, k=0, SquaredQ= FALSE)
  print(Box.test(res.modelName, lag= 1, type= "Ljung-Box"))
  par(mfrow=c(1,1))
}
resAnalysis(modelName = model_413)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.modelName
## W = 0.99409, p-value = 0.1272

## 
##  Box-Ljung test
## 
## data:  res.modelName
## X-squared = 0.0010572, df = 1, p-value = 0.9741

From the residual analysis we observe that the model_413 satisfies the Ljung-Box test and shapiro test. From the histogram we determine the normal distribution is present. From the histogram we can observe that residuals are normally distributed and from the Q-Q plot the normality assumption is fulfilled.

resAnalysis(modelName = model_010)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.modelName
## W = 0.99278, p-value = 0.05282

## 
##  Box-Ljung test
## 
## data:  res.modelName
## X-squared = 19.893, df = 1, p-value = 8.19e-06

From the residual analysis we observe that the model_010 satisfies the Shapiro-Wilk normality test but fails to statisfy the Ljung-Box test. Therefore, we will not consider this model.

resAnalysis(modelName = model_011)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.modelName
## W = 0.99495, p-value = 0.2211

## 
##  Box-Ljung test
## 
## data:  res.modelName
## X-squared = 3.5279, df = 1, p-value = 0.06034

From the residual analysis of model_011, we observe that the model satisfies Shapiro-Wilk normality test and Ljung-Box test. From the histogram we can observe that residuals are normally distributed and from the Q-Q plot the normality assumption is fulfilled.

resAnalysis(modelName = model_111)
## 
##  Shapiro-Wilk normality test
## 
## data:  res.modelName
## W = 0.99456, p-value = 0.1725

## 
##  Box-Ljung test
## 
## data:  res.modelName
## X-squared = 9.2733e-05, df = 1, p-value = 0.9923

From the residual analysis of model_111, we observe that the model satisfies Shapiro-Wilk normality test and Ljung-Box test. From the histogram we can observe that residuals are normally distributed and from the Q-Q plot the normality assumption is fulfilled.

Now, we will consider the one of the best model from model_413 and model_111 as both of these models satisfy the tests and have the highest number of significant coefficients.

7. Model Selection

We will now use AIC and BIC to select the best model between model_413 and model_111.

We will define a custom method sortScores() to compare the AIC and BIC scores of the selected models.

sortScores <- function(AICscore, score= c("bic", "aic")){
  if(score== "aic"){
    AICscore[with(AICscore, order(AIC)),]
  } 
  else if(score== "bic") {
    AICscore[with(AICscore, order(BIC)),]
  } 
  else {
    warning('"AICscore" accepts arguments ("aic","bic") only')
  }
}
scoreAIC=AIC(model_111, model_413)
scoreBIC=AIC(model_111, model_413, k = log(length(data.ts)))
sortScores(scoreAIC, score = "aic")
##           df       AIC
## model_413 13 -1742.852
## model_111  8 -1742.747
sortScores(scoreBIC, score = "aic")
##           df       AIC
## model_111  8 -1710.876
## model_413 13 -1691.060

We observed that AIC is favoring model_413 over model_111 and the BIC is favoring model_111 over model_413. However, we will select model_413 as the final model for forecasting since it has the highest number of significant coefficients.

8. Model Overfitting

For this, we will increase the value of p and q one by one in the final model selected to perform the over-fitting in the model.

overfitting_sort = arima(log(data.ts),order=c(4,1,4),seasonal=list(order=c(3,1,2), period=12))
coeftest(overfitting_sort)
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ar1   1.119046   0.140548  7.9620 1.692e-15 ***
## ar2  -0.654142   0.225120 -2.9057 0.0036638 ** 
## ar3   1.011625   0.196038  5.1603 2.465e-07 ***
## ar4  -0.480668   0.086483 -5.5579 2.730e-08 ***
## ma1  -1.540400   0.154477 -9.9717 < 2.2e-16 ***
## ma2   0.881814   0.322881  2.7311 0.0063127 ** 
## ma3  -1.045487   0.290938 -3.5935 0.0003263 ***
## ma4   0.705970   0.138638  5.0922 3.540e-07 ***
## sar1 -0.304787         NA      NA        NA    
## sar2 -0.201493   0.037963 -5.3077 1.110e-07 ***
## sar3 -0.101100         NA      NA        NA    
## sma1 -0.433971         NA      NA        NA    
## sma2 -0.252303         NA      NA        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
overfitting_sort_1 = arima(log(data.ts),order=c(5,1,3),seasonal=list(order=c(3,1,2), period=12), method = "CSS")
coeftest(overfitting_sort_1)
## 
## z test of coefficients:
## 
##        Estimate Std. Error  z value  Pr(>|z|)    
## ar1  -0.5317614  0.5143455  -1.0339  0.301201    
## ar2   0.3558874  0.2180612   1.6321  0.102668    
## ar3   0.0700612  0.2193638   0.3194  0.749436    
## ar4   0.0030268  0.0658601   0.0460  0.963344    
## ar5  -0.1279511  0.0674909  -1.8958  0.057983 .  
## ma1   0.1374075  0.5228661   0.2628  0.792707    
## ma2  -0.8505764  0.0376869 -22.5696 < 2.2e-16 ***
## ma3  -0.1108110  0.4487002  -0.2470  0.804939    
## sar1 -0.2585594  0.1976034  -1.3085  0.190712    
## sar2 -0.2942767  0.0672703  -4.3745 1.217e-05 ***
## sar3 -0.1584560  0.0686226  -2.3091  0.020938 *  
## sma1 -0.5650115  0.1980556  -2.8528  0.004334 ** 
## sma2 -0.0860874  0.1683842  -0.5113  0.609172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We observed that the new models have less significant coefficients than the original model_413. Therefore, we reject these new models and proceed with the model_413 as the final model for forecasting.

9. Forecasting

For forecasting, we will select model_413 for predicting the electricity production for the next 10 units.

forecasting = Arima(log(data.ts),order=c(4,1,3),seasonal=list(order=c(3,1,2), period=12))
prediction1 = forecast(forecasting, h = 10)
plot(prediction1)

From the above plot we are able to forecast the prediction of electricity production for the next 10 units from our final selected model.

10. Conclusion

Using the final selected model, we were able to forecast the electricity production in US for next 10 terms. The forecast clearly depicted an increase in the electricity production and reflected seasonality due to which we observed uncertainty in the forecast.

11. References

  1. Dr. Haydar Demirhan,Time series Analysis https://rmit.instructure.com/courses/79716/modules.
  2. StackOverflow,(May 10,2021).“What is the best way to fit multiple residual analysis plots?”, https://stackoverflow.com/questions/61912395/what-is-the-best-way-to-fit-multiple-residual-analysis-plots