For seasonal ARIMA, we need to get the seasonal and the non-seasonal piece.
ARIMA (p,d,q)*(P,D,Q)S
(p,d,q) = non-seasonal piece
(P,D,Q) = seasonal piece
S = number of seasons (12 for months or 4 for quarters)
Steps:
plot the data to identifiy seasonality
do differences (D = seasonal, d = non-seasonal)
preliminarily choose p,q,P,Q by looking at ACF (q and Q) and PACF (p and P)
ARIMA (p,d,q)*(P,D,Q)S to estimate the model
look at diagnostic (Box.test()) and compare models with AIC
For this example, I used AirPassengers dataset.
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Warning: package 'xts' was built under R version 3.4.4
## Warning: package 'zoo' was built under R version 3.4.4
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Warning: package 'timeDate' was built under R version 3.4.4
library(forecast)
library(xts)
data(AirPassengers)
We need to look at the data to see if there is seasonality.
plot.ts(AirPassengers)
We can see that the little bumps in the plot are getting larger and larger. This means that we have increasing seasonal variance. We don’t want this so we need to transform the data until we get constant seasonal variance.
We can try log first.
data = log(AirPassengers)
plot.ts(data)
The little bumps evened out after the transformation. This data is good but it obviously shows seasonality.
Now we will do differences. We can see it looks like there is seasonality and an upward trend. To start off, we will address the seasonality. To do this, we take the differences with lag S (I picked 4):
\[{y_t} - {y_(t-4)}\]
We will plot the differences and check with the Augmented Dickey-Fuller test.
diff1 = diff(data, lag = 4)
plot.ts(diff1)
adf.test(diff1)
## Warning in adf.test(diff1): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff1
## Dickey-Fuller = -8.4949, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
The plot looks good and has gotten rid of the upward trend as well. This means that we won’t have to take the differences for the upward trend. The p-value from the Augmented Dickey-Fuller Test is also very small.
From this, we have:
D = 1
d = 0
Autocorrelation Function (ACF) calculates the correlation of points separated by a certain amount of time. From the plot, we can choose q and Q.
Acf(diff1)
We can see that there is a spike (above the blue line) at both 1 and 2. This shows that q should probably be 1 or 2.
We can also see that there is a spike at 4, which is our S. This means that Q should at least be 1.
So, now we have:
D = 1
d = 0
q >= 1
Q >= 1
Partical autocorrelation function (PACF) calculates correlation of observations separated by a certain amount of time but controls for measurements in between. From this plot, we can choose p and P.
Pacf(diff1)
We can see that there is a spike at 1 and 2, even up until 4. This shows that p should be 1 or greater.
We can also see a spike at 4. This means P should be at least 1.
Now, we have:
D = 1
d = 0
q >= 1
Q >= 1
p >= 1
P >= 1
Now, we can put it all together and create a model. It’s similar to last time but this time we have a seasonal piece at the end.
We’ll start with the smallest numbers and add as we see fit.
mod1 <- arima(data, order = c(1,0,1), season = list( order = c( 1,1,1), period=4))
mod1
##
## Call:
## arima(x = data, order = c(1, 0, 1), seasonal = list(order = c(1, 1, 1), period = 4))
##
## Coefficients:
## ar1 ma1 sar1 sma1
## 0.9999 0.2113 -0.4185 -0.9938
## s.e. 0.0010 0.0857 0.0781 0.0245
##
## sigma^2 estimated as 0.008222: log likelihood = 129.73, aic = -249.45
Box.test(mod1$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 0.013116, df = 1, p-value = 0.9088
The first thing we want to look at from this is the test stats. We can see the test stats from the model. To find the test stats, you take the top row and divide it by the standard error (bottom row).
We don’t want any test stats to be under 2. If it is under 2, we want to drop that right away. If it is between 2 and 3, we’ll want to use the pt command to get the p-value.
We can see that none of the test stats look too small so we don’t have to check any p-values.
The p-value from the Box-Ljung Test shows if the model is adequate. If the p-value is large, it means the model is adequate. If the p-value is small, then it means the model is inadequate.
Our p-value is pretty big so our model is pretty adequate.
Let’s try increasing p by one because earlier we said p had to be at least 1.
mod2 <- arima(data, order = c(2,0,1), season = list( order = c( 1,1,1), period=4))
mod2
##
## Call:
## arima(x = data, order = c(2, 0, 1), seasonal = list(order = c(1, 1, 1), period = 4))
##
## Coefficients:
## ar1 ar2 ma1 sar1 sma1
## 0.4645 0.5354 0.7201 -0.4008 -0.9941
## s.e. 0.4023 0.4023 0.3552 0.0836 0.0225
##
## sigma^2 estimated as 0.008176: log likelihood = 129.87, aic = -247.74
Box.test(mod2$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod2$residuals
## X-squared = 0.1398, df = 1, p-value = 0.7085
The test stat for ar2, the one we added, is less than 2 so we can take it out right away. Also, the Box-Ljung test got smaller so that’s not great either.
This means we should go back to our first model. But we can try to increase q instead.
mod2 <- arima(data, order = c(1,0,2), season = list( order = c( 1,1,1), period=4))
mod2
##
## Call:
## arima(x = data, order = c(1, 0, 2), seasonal = list(order = c(1, 1, 1), period = 4))
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1
## 0.9999 0.1998 -0.0479 -0.4169 -0.9947
## s.e. 0.0006 0.0889 0.1349 0.0787 0.0200
##
## sigma^2 estimated as 0.008206: log likelihood = 129.79, aic = -247.59
Box.test(mod2$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod2$residuals
## X-squared = 7.081e-05, df = 1, p-value = 0.9933
The test stat for ma2, the one we added, is less than 2. The Box-Ljung p-value did increase but since the test stat is small, we will stick with the simpler model.
We’re going to go back to the first model again. You can do the same for P and Q but they didn’t help too much so I skipped those.
The best model is the first model. Let’s look at the Box.test one more time.
Box.test(mod1$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 0.013116, df = 1, p-value = 0.9088
It is a large p-value so the model is pretty adequate.
Let’s compare auto.arima with our model to see which one is better.
automod <- auto.arima(data)
automod
## Series: data
## ARIMA(0,1,1)(0,1,1)[12]
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001371: log likelihood=244.7
## AIC=-483.4 AICc=-483.21 BIC=-474.77
AIC(automod); AIC(mod1)
## [1] -483.3991
## [1] -249.4528
As you can see, our model AIC is actually lower.
Now, we can use the models to forecast into the future.
par(mfrow=c(2,1))
plot(forecast(automod, h = 12), main = "auto model")
plot(forecast(mod1, h = 12), main = "our model")
We can see that the interval for our model is larger. Ours also doesn’t follow the trend very well.
We can see that our model had a smaller AIC but the auto model actually forecasts beter.
Last time we learned about auto.arima. But now we can make a model on our own.