Learning Log 23: Seasonal ARIMA Models

Today’s learning log will review constructing seasonal ARIMA models and hypothesis testing for autoregressive and moving average terms as well as examing diagnostic plots.

library(forecast)
## Warning: package 'forecast' was built under R version 3.3.3
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.3.3
## Warning: package 'xts' was built under R version 3.3.3
## Warning: package 'zoo' was built under R version 3.3.3
## Warning: package 'TTR' was built under R version 3.3.3
library(tseries)
## Warning: package 'tseries' was built under R version 3.3.3
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.3.3
## Warning: package 'timeDate' was built under R version 3.3.3
library(forecast)
library(xts)

We will be using the gas data set which is a time series of Australian monthly gas production from 1956 to 1995. We’ll start with some quick EDA to familarize ourself with the data and see if we need to make any transformations on the data.

plot.ts(gas)

Since we have heteroscedasticity, we will want to perform a transformation on the data before creating the ARIMA model.

Step 0: Transform the Data

We’ll do a log tranformation to level our bounces in the later data.

gasT <- log(gas)
plot.ts(gasT)

Now our variation in the gas production over the 50 years has been transformed.

Step 1: Plot to See Seasonality

As we construct our ARIMA seasonal model, we will first plot the data. Since our data is monthly, we look at our seasonality in terms of months.

par(mfrow=c(1,1))
(wgts <- c(.5, rep(1,11), .5)/12)
##  [1] 0.04166667 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
##  [7] 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
## [13] 0.04166667
plot.ts(filter(gasT, sides=2, filter = wgts))

Now it looks like we’ve removed most of the seasonality, but they could be some smaller seasonality bumps, we we’ll try adding a first order difference to remove the seasonality.

Step 2: Data Differences (d,D)

Since we see trend, but no seasonality, we will have S=1 for our lag. d=1, D=0

diffG1 <- diff(gasT, lag=6)
plot.ts(diffG1)

adf.test(diffG1)
## Warning in adf.test(diffG1): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diffG1
## Dickey-Fuller = -4.0655, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Since our ADF test concluded stationary, we will proceed with selection our autoregressive and moving average terms.

Step 3: Preliminary AR(p,P) and MA(q,Q) Terms

ACF

The Acf shows us highly correlated lag values that we can account for using our MA() q terms. We’ll start by looking at the values right after 0 to assess the how many q terms to add

Acf(diffG1)

It looks like we have spikes at 1 and 2 so q >= 1 We also see this pattern holding at our lag=6 values, so we will have Q >= 1 as well

PACF

The Pacf shows us highly correlated values that are not right next to each other and controls for the correlation between values. This also us to assess our AR() pth order

Pacf(diffG1)

Again, we see spikes in the first seven values, so we will look at p >= 1 When we look at our lag=6 terms, we also see spikes, so we will test P >= 1 as well

p >= 1, q >= 1 / P >= 1 , Q >= 1 / S=6

Step 4: Check Model Using ARIMA

ModARIMA <- arima(gasT, order = c(1,1,1), season=list(order = c(1,0,1), period=6))
ModARIMA
## 
## Call:
## arima(x = gasT, order = c(1, 1, 1), seasonal = list(order = c(1, 0, 1), period = 6))
## 
## Coefficients:
##           ar1      ma1     sar1    sma1
##       -0.0689  -0.1945  -0.9992  0.9242
## s.e.   0.2877   0.2900   0.0008  0.0285
## 
## sigma^2 estimated as 0.002941:  log likelihood = 700.78,  aic = -1391.55

Here we see that AR(1) and MA(1) are not significant, so we will remove them from the model. We will also explore additional season P and Q terms.

We’ll try adding an addition seasonal MA term and an additional AR term now.

ModARIMA2 <-arima(gasT, order = c(0,1,0), season=list(order = c(2,0,2), period=6))
ModARIMA2
## 
## Call:
## arima(x = gasT, order = c(0, 1, 0), seasonal = list(order = c(2, 0, 2), period = 6))
## 
## Coefficients:
##          sar1    sar2     sma1     sma2
##       -0.0024  0.9947  -0.0503  -0.8786
## s.e.   0.0057  0.0058   0.0408   0.0425
## 
## sigma^2 estimated as 0.002747:  log likelihood = 714.93,  aic = -1419.87

Adding the additional seasonal terms decreased the model’s AIC by twenty points which is significant to us, so we will keep the additional terms for our final ARIMA model.

Step 5: Compare Diagnostic Plots Using Box.test()

Box.test(x, lag = 1, type = c(“Box-Pierce”, “Ljung-Box”), fitdf = 0)

Box.test(ModARIMA2$residuals, lag =6, type = c("Ljung-Box"))
## 
##  Box-Ljung test
## 
## data:  ModARIMA2$residuals
## X-squared = 44.592, df = 6, p-value = 5.639e-08

Since our Box-Ljung test produced a very small p-value, we believe that our model is inadequate and that we should do further testing.

As a cheat code, we will see what auto.arima() used in constructing its best model.

Mod4 <- auto.arima(gasT)
Mod4
## Series: gasT 
## ARIMA(1,1,2)(1,0,0)[12] 
## 
## Coefficients:
##          ar1      ma1     ma2    sar1
##       0.9023  -1.2318  0.2584  0.8641
## s.e.  0.0513   0.0812  0.0662  0.0264
## 
## sigma^2 estimated as 0.003519:  log likelihood=661.9
## AIC=-1313.81   AICc=-1313.68   BIC=-1292.99

Auto-ARIMA used a different lag term, which makes it hard for us to directly compare models.

Forecasting

We’ll use our model vs the auto.arima model to predict the gas production in 1996.

plot(forecast(Mod4, h = 12), main = "auto.arima")

plot(forecast(ModARIMA2, h = 12), main = "Jimmy Model")

We can see that the models are relatively similar with some slight differences in the final months of 1996.