Seasonal ARIMA

We now added an additional step to the ARIMA model that we were introduced last class. Now we can create ARIMA models when there is obvious seasonal trend in the time series data. We learned of the variables p,d, and q last time, but now we have P, D, and Q. P represents the seasonal auto-regressive component, D is the seasonal differences, and Q is the seasonal moving average component. The standard equation for seasonal ARIMA is (P,D,Q)Sx(p,d,q), where S indicates the length of season (i.e. S=12 for monthly, S=4 for quarterly). Now we will walk through the steps of creating a seasonal ARIMA model and compare it to the model that R creates.

Step 1

First we will plot the data to know that type of seasonality is present.

library(forecast)
library(quantmod)
library(tseries)
library(timeSeries)
library(forecast)
library(xts)
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
births <- births[-c(1:24)]
births <- ts(births, frequency=12, start=c(1946,1))
par(mfrow=c(2,1))
plot.ts(births) 
plot.ts(filter(births, sides=2, filter=rep(1/3,3)))

par(mfrow=c(1,1))
(wgts <- c(.5, rep(1,11), .5)/12)

##  [1] 0.04166667 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
##  [7] 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
## [13] 0.04166667

plot.ts(filter(births, sides=2, filter = wgts))

Here we see that this is monthly data, so now we will need to use the seasonal differences to eliminate that trend.

Step 2

First we try to eliminate the seasonal differences present. For example, the differences for monthly are \(y_t-y_{t-12}\). This helps remove the seasonal trend. Since we know that this is monthly data, we will use lag=12 to indicate this.

diff1 <- diff(births, lag = 12)
plot.ts(diff1)

adf.test(diff1)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -3.4166, Lag order = 5, p-value = 0.05488
## alternative hypothesis: stationary

Looking at the plot, it appears that we have eliminated the seasonal trend that was present in the original data, the next step would be to eliminate any other trend present with the standard ARIMA differences, but in this case the seasonal differences appears to have eliminated the positive trend that was present as well. Therefore, this step is not necessary. Our difference values are D=1 and d=0.

Step 3

Now that we have our difference values, we will preliminary choose values for p, q, P, and Q by looking at the auto correlation function (Acf) plot and the partial auto correlation function (Pacf) plot. First we looked at the Acf plot, which helped us choose values for both seasonal and non-seasonal moving average.

Acf(diff1)

For the non-seasonal value, we look for spikes present at low lags, meaning that the lines extend past the dotted blue line. In our case the first two lines are spikes, with the second one being much closer to the blue line than the first. For now, we will say that q>=1. Now for the seasonal moving average values, we look for spikes present at lags that are multiples of our S, which is 12. We look at 12 and 24 for spikes. There is one present at 12, and not at 24. We conclude that Q=1. Now we will look at the Pacf plot to choose our auto-regressive components.

Pacf(diff1)

Similar to the Acf, for the non-seasonal AR we look for spikes present in low lags. We see one spike present so we will choose p=1. For seasonal AR we look at the lags that are multiples of our S=12. We see one spike at 12, and one right before 24, so we will choose our P value to be >=1.

Step 4

Now we will set up an ARIMA model with the preliminary values we found in the previous steps. For this first trial we will use (1,0,1)x(1,1,1)12.

mod1 <- arima(births, order = c(1,0,1), season = list( order = c( 1,1,1), period=12))

Step 5

Now we will examine our model and compare it to others. If the model is good, then our residuals will be small and the Ljung Box stat will also be small. The null hypothesis is that the model is adequate, and the alternative is that the model is inadequate.

summary(mod1)

## 
## Call:
## arima(x = births, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sar1     sma1
##       0.9884  -0.1564  -0.2199  -0.8820
## s.e.  0.0287   0.1747   0.1004   0.1498
## 
## sigma^2 estimated as 0.3924:  log likelihood = -136.81,  aic = 283.61
## 
## Training set error measures:
##                      ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.06656964 0.5998188 0.4452406 0.2299257 1.727419 0.3675936
##                   ACF1
## Training set 0.0410978

Box.test(mod1$residuals, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.24832, df = 1, p-value = 0.6183

With a p-value of 0.6183, we do not have significant evidence to conclude our model is inadequate, but we will try one more model for comparison. To test whether a certain component is significant we divide the coefficient by the standard error and that is our test stat. Then to find the p-val we use the t distribution to see whether it is worth including. We will look at ma1 because that appears to have a small test stat.

teststat <- -.1564/.1747
pval <- 2*pt(abs(teststat), lower.tail = FALSE, df=140)
pval

## [1] 0.3721906

At a p-value of 0.372, the ma1 component is not worth keeping. Now we will try a model with the values (1,0,0)x(1,1,1)12.

mod2 <- arima(births, order = c(1,0,0), season = list( order = c( 1,1,1), period=12))
Box.test(mod2$residuals, type = "Ljung") # yay

## 
##  Box-Ljung test
## 
## data:  mod2$residuals
## X-squared = 0.62922, df = 1, p-value = 0.4276

summary(mod2)

## 
## Call:
## arima(x = births, order = c(1, 0, 0), seasonal = list(order = c(1, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     sar1     sma1
##       0.9750  -0.2263  -0.8595
## s.e.  0.0326   0.0998   0.1280
## 
## sigma^2 estimated as 0.3986:  log likelihood = -137.27,  aic = 282.55
## 
## Training set error measures:
##                      ME      RMSE      MAE     MPE     MAPE      MASE
## Training set 0.07991759 0.6045416 0.450681 0.28466 1.748639 0.3720852
##                     ACF1
## Training set -0.06542041

All the components appear to be of significance to the model, so we will now compare the two models using AIC.

AIC(mod1);AIC(mod2)

## [1] 283.6117

## [1] 282.5484

Our mod2 has a lower AIC by about ten, therefore we conclude that mod2 is the best model of the ones we created. One way to see how we did is by giving R the same time series data, and letting the auto.arima command come up with its own ARIMA model to compare it to ours.

automod <- auto.arima(births)
automod

## Series: births 
## ARIMA(1,0,0)(2,1,1)[12] with drift 
## 
## Coefficients:
##          ar1     sar1     sar2     sma1   drift
##       0.6988  -0.4160  -0.3503  -0.6846  0.0447
## s.e.  0.0631   0.1222   0.1093   0.1281  0.0033
## 
## sigma^2 estimated as 0.3287:  log likelihood=-122.05
## AIC=256.09   AICc=256.77   BIC=273.39

It looks like our mod2 only had the seasonal AR value difference from the auto.arima model, they came up with a value of two instead of one.

Learning Log 23

Jeff Courneya

April 30, 2018

Intro

Decomposing Time Series