Background

In this time series analysis, we will going to take a look data of industrial production of electric and gas utilities in the United States, from the years 1 February 1939 – 1 July 2019. This data measures the real output of all relevant establishments located in the United States, regardless of their ownership, but not those located in U.S. territories. We will name this dataset as utilities.

utilities <- read.csv("IPG2211A2N.csv")

Exploratory Data Analysis

Data Structures and Summaries

Lets take a look closer to the structure and summaries of utilities.

str(utilities)

## 'data.frame':    966 obs. of  2 variables:
##  $ DATE          : Factor w/ 966 levels "1939-02-01","1939-03-01",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ IPG2211A2N_PCA: num  9.54 30.95 9.25 0 19.16 ...

summary(utilities)

##          DATE     IPG2211A2N_PCA    
##  1939-02-01:  1   Min.   : -89.109  
##  1939-03-01:  1   1st Qu.: -35.145  
##  1939-04-01:  1   Median :   5.915  
##  1939-05-01:  1   Mean   :  42.434  
##  1939-06-01:  1   3rd Qu.:  63.564  
##  1939-07-01:  1   Max.   :1214.388  
##  (Other)   :960

head(utilities)

##         DATE IPG2211A2N_PCA
## 1 1939-02-01        9.54191
## 2 1939-03-01       30.95378
## 3 1939-04-01        9.24768
## 4 1939-05-01        0.00000
## 5 1939-06-01       19.16030
## 6 1939-07-01        0.00000

tail(utilities)

##           DATE IPG2211A2N_PCA
## 961 2019-02-01      -66.51429
## 962 2019-03-01      -49.90490
## 963 2019-04-01      -89.10847
## 964 2019-05-01       53.95068
## 965 2019-06-01      149.60239
## 966 2019-07-01      359.91940

The explanation of the dataset are:
DATE: The date of the data recorded, start from 1 February 1939 – 1 July 2019.
IPG2211A2N_PCA: The number of industrial production of electric and gas utilities.

Check Missing Dates

Let us see wheterher there is some missing months in the utilities.

d <- as.Date(utilities$DATE)
date_range <- seq(min(d), max(d), by = "1 month") 
date_range[!date_range %in% d]

## Date of length 0

Apparently there is no missing months in the data, all completed. And based on summary check, we also know that there is no NA values. We are good to go.

Create ts object

utilities_ts <- ts(utilities$IPG2211A2N_PCA, start = c(1939,2), frequency = 12)

Plot and Decompose the Data

ts_plot(utilities_ts, 
        title = "Electric and Gas Utilities in US",
        color = "firebrick",
        slider = T)

By looking at the plot, we can see that the utilities dataset seems like don’t have any trend. And even at some points the data are peaked, but it doesn’t consistently occured. Therefore, we well assume it is a additive type.

utilities_ts %>% 
                ts_decompose()

Because the pattern is multiplicative, we will set our decomposition with additive pattern. There are some interesting finding here:
- Even it seems that the data showing an uptrend pattern, but it doesn’t goes smoothly. It going up and down along the time. Therefore we can say that it has no trend.
- It seems that the dataset have seasonality trend.
- The Random (error) seems don’t have any particular seasonal pattern, therefore we can assume that the decomposition able to catch the seasonality pattern.

Split data train dan test

We will going to split the dataset into two part, the train_utilities will take the older dataset, and the test_utilities will use newer dataset.

train_utilities <- utilities_ts %>% window(start= c(1939,2), end = c(2017,12))
test_utilities <- utilities_ts %>% window(start= c(2018,1), end = c(2019,7))

Modelling

Exponential Smoothing, with Additive Error, No Trend, and Additive Seasonality

ets_utilities <- ets(train_utilities, model = "ANA")

Auto ARIMA with Seasonality

arima_utilities <- auto.arima(train_utilities, seasonal = T)

Forecasting the Model

for_ets <- forecast(ets_utilities, h = 19)

for_arima <- forecast(arima_utilities, h = 19)

Visualizing the Forecast Result

Exponential Smoothing Forecast

test_forecast(actual = utilities_ts, 
              forecast.obj = for_ets,
              train = train_utilities, 
              test = test_utilities)

Auto ARIMA Forecast

test_forecast(actual = utilities_ts,
              forecast.obj = for_arima,
              train = train_utilities,
              test = test_utilities)

Model Selection

Testing Errors for Exponential Smoothing Model

accuracy(for_ets, test_utilities)

##                      ME      RMSE      MAE       MPE     MAPE      MASE
## Training set   4.828703  69.42198 33.22901       NaN      Inf 0.9094243
## Test set     -21.645755 121.98837 61.32874 -28.80335 203.0252 1.6784687
##                    ACF1 Theil's U
## Training set -0.1116085        NA
## Test set     -0.3687706 0.1522723

Testing Errors for ARIMA Model

accuracy(for_arima, test_utilities)

##                       ME      RMSE      MAE       MPE     MAPE     MASE
## Training set  -0.1283751  64.78098 30.95246       NaN      Inf 0.847119
## Test set     -13.6283854 101.83847 56.00255 -110.6149 342.5998 1.532699
##                      ACF1 Theil's U
## Training set  0.000403391        NA
## Test set     -0.359575420 0.1066083

Checking AIC

ets_utilities$aic

## [1] 14551.02

arima_utilities$aic

## [1] 10494.61

Since there are some observations with 0 value, we will going to look at RMSE and MAE value only. And by looking at the Error and AIC results, we will use ARIMA Model on the next analysis because it gives lower errors and AIC than Exponential Smoothing Model.

Fine Tuning

Checking Stationerity

adf.test(utilities_ts)

## Warning in adf.test(utilities_ts): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  utilities_ts
## Dickey-Fuller = -17.861, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

As we can see, based on the stationerity test, the p-value < alpha, therefore we can conclude the dataset is stationer.

Display PACF

ggtsdisplay(utilities_ts)

As we can see, there are some line that outside the blue line, Lag 1 and Lag 12 (or Seasonal 1). Let us try to modify the p,d,q value.

arima1 <- arima(train_utilities, order = c(1,0,0), seasonal = c(1,0,0))
ggtsdisplay(arima1$residuals)

Still found some lags that out of the blue line. Let’s do some fine tuning more.

arima2 <- arima(train_utilities, order = c(7,0,2), seasonal = c(2,1,2))
ggtsdisplay(arima2$residuals)

Check the AIC Value

arima_utilities$aic

## [1] 10494.61

arima1$aic

## [1] 10950.52

arima2$aic

## [1] 10495.67

Check the Error Value

for_arima2 <- forecast(arima2, h = 19)
accuracy(for_arima, test_utilities)

##                       ME      RMSE      MAE       MPE     MAPE     MASE
## Training set  -0.1283751  64.78098 30.95246       NaN      Inf 0.847119
## Test set     -13.6283854 101.83847 56.00255 -110.6149 342.5998 1.532699
##                      ACF1 Theil's U
## Training set  0.000403391        NA
## Test set     -0.359575420 0.1066083

accuracy(for_arima2, test_utilities)

##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set   4.115695 64.38707 30.60154       NaN      Inf 0.8375149
## Test set     -11.502577 97.34349 63.44205 -522.8775 788.9096 1.7363068
##                      ACF1 Theil's U
## Training set -0.003987652        NA
## Test set     -0.399787759 0.1377803

The AIC value of arima2 model is not better than arima_utilities, and also the Errors. Therefore, we will still use the arima_utilites model for forecasting.

Key Assumptions

Normality Test

shapiro.test(arima_utilities$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  arima_utilities$residuals
## W = 0.62787, p-value < 0.00000000000000022

Unfortunately, the normality test showing result that the residuals is not following normal distribution.

Non-autocorrelation test

Box.test(arima_utilities$residuals)

## 
##  Box-Pierce test
## 
## data:  arima_utilities$residuals
## X-squared = 0.0001541, df = 1, p-value = 0.9901

Sadly, the Non auto correlation test giving result that the residuals are autocorrelated. But we will continue to use the model since this is the best what we can get for now.

Plot the final forecast result

arima.fin <- auto.arima(utilities_ts, seasonal = T)
for_fin <- forecast(arima.fin, h = 12)

plot_forecast(for_fin,
              title = "12 Months Forecast",
              Ytitle = "Electric & Gas Utilities in US", Xtitle = "Year")

This graphic shown the forecast for the next 19 months. From the graphic we can see the forecasted utilities not giving so much different than the 2019. The model seems quite having difficulties when dealing with extreme peaks, which may cause it not so accurate.

Predicting Electric and Gas Utilities in US

Arya Andhika

September 11, 2019