Predicting Electric and Gas Utilities in US

Arya Andhika

September 11, 2019

Background

In this time series analysis, we will going to take a look data of industrial production of electric and gas utilities in the United States, from the years 1 February 1939 – 1 July 2019. This data measures the real output of all relevant establishments located in the United States, regardless of their ownership, but not those located in U.S. territories. We will name this dataset as utilities.

Exploratory Data Analysis

Data Structures and Summaries

Lets take a look closer to the structure and summaries of utilities.

## 'data.frame':    966 obs. of  2 variables:
##  $ DATE          : Factor w/ 966 levels "1939-02-01","1939-03-01",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ IPG2211A2N_PCA: num  9.54 30.95 9.25 0 19.16 ...
##          DATE     IPG2211A2N_PCA    
##  1939-02-01:  1   Min.   : -89.109  
##  1939-03-01:  1   1st Qu.: -35.145  
##  1939-04-01:  1   Median :   5.915  
##  1939-05-01:  1   Mean   :  42.434  
##  1939-06-01:  1   3rd Qu.:  63.564  
##  1939-07-01:  1   Max.   :1214.388  
##  (Other)   :960
##         DATE IPG2211A2N_PCA
## 1 1939-02-01        9.54191
## 2 1939-03-01       30.95378
## 3 1939-04-01        9.24768
## 4 1939-05-01        0.00000
## 5 1939-06-01       19.16030
## 6 1939-07-01        0.00000
##           DATE IPG2211A2N_PCA
## 961 2019-02-01      -66.51429
## 962 2019-03-01      -49.90490
## 963 2019-04-01      -89.10847
## 964 2019-05-01       53.95068
## 965 2019-06-01      149.60239
## 966 2019-07-01      359.91940

The explanation of the dataset are:
DATE: The date of the data recorded, start from 1 February 1939 – 1 July 2019.
IPG2211A2N_PCA: The number of industrial production of electric and gas utilities.

Check Missing Dates

Let us see wheterher there is some missing months in the utilities.

## Date of length 0

Apparently there is no missing months in the data, all completed. And based on summary check, we also know that there is no NA values. We are good to go.

Plot and Decompose the Data

By looking at the plot, we can see that the utilities dataset seems like don’t have any trend. And even at some points the data are peaked, but it doesn’t consistently occured. Therefore, we well assume it is a additive type.

Because the pattern is multiplicative, we will set our decomposition with additive pattern. There are some interesting finding here:
- Even it seems that the data showing an uptrend pattern, but it doesn’t goes smoothly. It going up and down along the time. Therefore we can say that it has no trend.
- It seems that the dataset have seasonality trend.
- The Random (error) seems don’t have any particular seasonal pattern, therefore we can assume that the decomposition able to catch the seasonality pattern.

Split data train dan test

We will going to split the dataset into two part, the train_utilities will take the older dataset, and the test_utilities will use newer dataset.

Modelling

Exponential Smoothing, with Additive Error, No Trend, and Additive Seasonality

Visualizing the Forecast Result

Exponential Smoothing Forecast

Auto ARIMA Forecast

Model Selection

Testing Errors for Exponential Smoothing Model

##                      ME      RMSE      MAE       MPE     MAPE      MASE
## Training set   4.828703  69.42198 33.22901       NaN      Inf 0.9094243
## Test set     -21.645755 121.98837 61.32874 -28.80335 203.0252 1.6784687
##                    ACF1 Theil's U
## Training set -0.1116085        NA
## Test set     -0.3687706 0.1522723

Testing Errors for ARIMA Model

##                       ME      RMSE      MAE       MPE     MAPE     MASE
## Training set  -0.1283751  64.78098 30.95246       NaN      Inf 0.847119
## Test set     -13.6283854 101.83847 56.00255 -110.6149 342.5998 1.532699
##                      ACF1 Theil's U
## Training set  0.000403391        NA
## Test set     -0.359575420 0.1066083

Checking AIC

## [1] 14551.02
## [1] 10494.61

Since there are some observations with 0 value, we will going to look at RMSE and MAE value only. And by looking at the Error and AIC results, we will use ARIMA Model on the next analysis because it gives lower errors and AIC than Exponential Smoothing Model.

Fine Tuning

Checking Stationerity

## Warning in adf.test(utilities_ts): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  utilities_ts
## Dickey-Fuller = -17.861, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

As we can see, based on the stationerity test, the p-value < alpha, therefore we can conclude the dataset is stationer.

Display PACF

As we can see, there are some line that outside the blue line, Lag 1 and Lag 12 (or Seasonal 1). Let us try to modify the p,d,q value.

Still found some lags that out of the blue line. Let’s do some fine tuning more.

Check the AIC Value

## [1] 10494.61
## [1] 10950.52
## [1] 10495.67

Check the Error Value

##                       ME      RMSE      MAE       MPE     MAPE     MASE
## Training set  -0.1283751  64.78098 30.95246       NaN      Inf 0.847119
## Test set     -13.6283854 101.83847 56.00255 -110.6149 342.5998 1.532699
##                      ACF1 Theil's U
## Training set  0.000403391        NA
## Test set     -0.359575420 0.1066083
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set   4.115695 64.38707 30.60154       NaN      Inf 0.8375149
## Test set     -11.502577 97.34349 63.44205 -522.8775 788.9096 1.7363068
##                      ACF1 Theil's U
## Training set -0.003987652        NA
## Test set     -0.399787759 0.1377803

The AIC value of arima2 model is not better than arima_utilities, and also the Errors. Therefore, we will still use the arima_utilites model for forecasting.

Key Assumptions

  1. Normality Test
## 
##  Shapiro-Wilk normality test
## 
## data:  arima_utilities$residuals
## W = 0.62787, p-value < 0.00000000000000022

Unfortunately, the normality test showing result that the residuals is not following normal distribution.

  1. Non-autocorrelation test
## 
##  Box-Pierce test
## 
## data:  arima_utilities$residuals
## X-squared = 0.0001541, df = 1, p-value = 0.9901

Sadly, the Non auto correlation test giving result that the residuals are autocorrelated. But we will continue to use the model since this is the best what we can get for now.

Plot the final forecast result

This graphic shown the forecast for the next 19 months. From the graphic we can see the forecasted utilities not giving so much different than the 2019. The model seems quite having difficulties when dealing with extreme peaks, which may cause it not so accurate.