Background
In this time series analysis, we will going to take a look data of industrial production of electric and gas utilities in the United States, from the years 1 February 1939 – 1 July 2019. This data measures the real output of all relevant establishments located in the United States, regardless of their ownership, but not those located in U.S. territories. We will name this dataset as utilities.
Exploratory Data Analysis
Data Structures and Summaries
Lets take a look closer to the structure and summaries of utilities.
## 'data.frame': 966 obs. of 2 variables:
## $ DATE : Factor w/ 966 levels "1939-02-01","1939-03-01",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ IPG2211A2N_PCA: num 9.54 30.95 9.25 0 19.16 ...
## DATE IPG2211A2N_PCA
## 1939-02-01: 1 Min. : -89.109
## 1939-03-01: 1 1st Qu.: -35.145
## 1939-04-01: 1 Median : 5.915
## 1939-05-01: 1 Mean : 42.434
## 1939-06-01: 1 3rd Qu.: 63.564
## 1939-07-01: 1 Max. :1214.388
## (Other) :960
## DATE IPG2211A2N_PCA
## 1 1939-02-01 9.54191
## 2 1939-03-01 30.95378
## 3 1939-04-01 9.24768
## 4 1939-05-01 0.00000
## 5 1939-06-01 19.16030
## 6 1939-07-01 0.00000
## DATE IPG2211A2N_PCA
## 961 2019-02-01 -66.51429
## 962 2019-03-01 -49.90490
## 963 2019-04-01 -89.10847
## 964 2019-05-01 53.95068
## 965 2019-06-01 149.60239
## 966 2019-07-01 359.91940
The explanation of the dataset are:
DATE: The date of the data recorded, start from 1 February 1939 – 1 July 2019.
IPG2211A2N_PCA: The number of industrial production of electric and gas utilities.
Check Missing Dates
Let us see wheterher there is some missing months in the utilities.
d <- as.Date(utilities$DATE)
date_range <- seq(min(d), max(d), by = "1 month")
date_range[!date_range %in% d]## Date of length 0
Apparently there is no missing months in the data, all completed. And based on summary check, we also know that there is no NA values. We are good to go.
Plot and Decompose the Data
By looking at the plot, we can see that the utilities dataset seems like don’t have any trend. And even at some points the data are peaked, but it doesn’t consistently occured. Therefore, we well assume it is a additive type.
Because the pattern is multiplicative, we will set our decomposition with additive pattern. There are some interesting finding here:
- Even it seems that the data showing an uptrend pattern, but it doesn’t goes smoothly. It going up and down along the time. Therefore we can say that it has no trend.
- It seems that the dataset have seasonality trend.
- The Random (error) seems don’t have any particular seasonal pattern, therefore we can assume that the decomposition able to catch the seasonality pattern.
Split data train dan test
We will going to split the dataset into two part, the train_utilities will take the older dataset, and the test_utilities will use newer dataset.
Modelling
Exponential Smoothing, with Additive Error, No Trend, and Additive Seasonality
Auto ARIMA with Seasonality
Forecasting the Model
Visualizing the Forecast Result
Exponential Smoothing Forecast
Model Selection
Testing Errors for Exponential Smoothing Model
## ME RMSE MAE MPE MAPE MASE
## Training set 4.828703 69.42198 33.22901 NaN Inf 0.9094243
## Test set -21.645755 121.98837 61.32874 -28.80335 203.0252 1.6784687
## ACF1 Theil's U
## Training set -0.1116085 NA
## Test set -0.3687706 0.1522723
Testing Errors for ARIMA Model
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1283751 64.78098 30.95246 NaN Inf 0.847119
## Test set -13.6283854 101.83847 56.00255 -110.6149 342.5998 1.532699
## ACF1 Theil's U
## Training set 0.000403391 NA
## Test set -0.359575420 0.1066083
Checking AIC
## [1] 14551.02
## [1] 10494.61
Since there are some observations with 0 value, we will going to look at RMSE and MAE value only. And by looking at the Error and AIC results, we will use ARIMA Model on the next analysis because it gives lower errors and AIC than Exponential Smoothing Model.
Fine Tuning
Checking Stationerity
## Warning in adf.test(utilities_ts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: utilities_ts
## Dickey-Fuller = -17.861, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
As we can see, based on the stationerity test, the p-value < alpha, therefore we can conclude the dataset is stationer.
Display PACF
As we can see, there are some line that outside the blue line, Lag 1 and Lag 12 (or Seasonal 1). Let us try to modify the p,d,q value.
arima1 <- arima(train_utilities, order = c(1,0,0), seasonal = c(1,0,0))
ggtsdisplay(arima1$residuals) Still found some lags that out of the blue line. Let’s do some fine tuning more.
arima2 <- arima(train_utilities, order = c(7,0,2), seasonal = c(2,1,2))
ggtsdisplay(arima2$residuals)Check the AIC Value
## [1] 10494.61
## [1] 10950.52
## [1] 10495.67
Check the Error Value
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1283751 64.78098 30.95246 NaN Inf 0.847119
## Test set -13.6283854 101.83847 56.00255 -110.6149 342.5998 1.532699
## ACF1 Theil's U
## Training set 0.000403391 NA
## Test set -0.359575420 0.1066083
## ME RMSE MAE MPE MAPE MASE
## Training set 4.115695 64.38707 30.60154 NaN Inf 0.8375149
## Test set -11.502577 97.34349 63.44205 -522.8775 788.9096 1.7363068
## ACF1 Theil's U
## Training set -0.003987652 NA
## Test set -0.399787759 0.1377803
The AIC value of arima2 model is not better than arima_utilities, and also the Errors. Therefore, we will still use the arima_utilites model for forecasting.
Key Assumptions
- Normality Test
##
## Shapiro-Wilk normality test
##
## data: arima_utilities$residuals
## W = 0.62787, p-value < 0.00000000000000022
Unfortunately, the normality test showing result that the residuals is not following normal distribution.
- Non-autocorrelation test
##
## Box-Pierce test
##
## data: arima_utilities$residuals
## X-squared = 0.0001541, df = 1, p-value = 0.9901
Sadly, the Non auto correlation test giving result that the residuals are autocorrelated. But we will continue to use the model since this is the best what we can get for now.
Plot the final forecast result
arima.fin <- auto.arima(utilities_ts, seasonal = T)
for_fin <- forecast(arima.fin, h = 12)
plot_forecast(for_fin,
title = "12 Months Forecast",
Ytitle = "Electric & Gas Utilities in US", Xtitle = "Year")This graphic shown the forecast for the next 19 months. From the graphic we can see the forecasted utilities not giving so much different than the 2019. The model seems quite having difficulties when dealing with extreme peaks, which may cause it not so accurate.