The concept we discussed in class today was making an ARIMA model with a seasonal component. In general, the only difference between making an ARIMA model with and without a seasonal component is that the seasonal ARIMA model includes autoregressive, differencing and moving average parameters for the seasonal aspect of the time series.
Generally, we would use this technique if we found a seasonal trend in addition to the overall trend in our data set. To illustrate this technique and the various functions needed to do it in R, I will use an example.
The data set I will use for this example is the \(\texttt{AirPassengers}\) data set. Let’s first plot the data to determine if we have seasonality.
plot(AirPassengers)
We do seem to have a seasonal component to our data trend. We can decompose the data into its different trends by using the function \(\texttt{decompose(dataset)}\). To do the decomposition, we will need to see a constant, non-increasing or -decresing, seasonal trend. We see an increasing seasonal trend in our data set so we’ll transform it using a logarithmic transformation before we do the decomposition. Once we get our decomposed values, we will plot them.
decomp <- decompose(log(AirPassengers))
plot(decomp)
Our decomposition shows an overall increasing trend, with a strong seasonal trend so we’ll need to account for that in our ARIMA model. The order we will use for our ARIMA model is (p, d, q)x(P, D, Q)S where are p, d, q are for our non-seasonal aspects and P, D, Q are for our seasonal aspects. S corresponds to the number of seasonal time periods we have in the data.
The very first step to creating our ARIMA model is determine our order of differencing. To do this, we’ll use the \(\texttt{diff(dataset, lag)}\) function. \(\texttt{lag}\) indicates how far we need to go back in our data point to take the difference so, a lag of 12, which we will look at first for our model since we have monthly data, would take the difference between \(y_t\) and \(y_{t-12}\). We can plot this using \(\texttt{plot(differences)}\) and determine if our data is stationary after this differencing using \(\texttt{adf.test(differences)}\).
diffseas <- diff(log(AirPassengers), lag=12)
plot(diffseas)
adf.test(diffseas)
##
## Augmented Dickey-Fuller Test
##
## data: diffseas
## Dickey-Fuller = -3.1899, Lag order = 5, p-value = 0.09265
## alternative hypothesis: stationary
The p-value of our ADF test is lower than .1 and the data looks relatively stationary, not increasing or decreasing, so we can say that the data is likely stationary enough. This means we do not have to compute any non-seasonal differencing.
To determine the order of the moving average and autoregression portions of our seasonal and non-seasonal components for our preliminary model, we need to look at ACF and PACF plots for our differenced values. We do this using the \(\texttt{Acf(differences)}\) and \(\texttt{Pacf(differences)}\) commands. The ACF plot will allow us to determine q and Q while the PACF plot will allow us to p and P.
Acf(diffseas)
Our spikes are fairly high for the first 4 lags, but remain above the blue line until about the 8th lag. So we’ll try a q greater than or equal to 4. We also see a single spike at 12 so we will require a Q of 1. Next, we’ll look at the PACF plot.
Pacf(diffseas)
Our spike is high for the first lag, but still above the blue line for the second, so we’ll need a p of at least 1. Then, we don’t really see a spike around 12, so we will have P equal 0.
Our first model will be an ARIMA (1, 0, 4)x(0, 1, 1)12. We can make this using the \(\texttt{arima(dataset, order, season = list(order, period))}\) where the first order is our (p, d, q), the second order is our (P, D, Q) and period is our S.
mod1 <- arima(log(AirPassengers), order = c(1, 0, 4), season = list(order = c(0, 1, 1), period= 12))
summary(mod1)
##
## Call:
## arima(x = log(AirPassengers), order = c(1, 0, 4), seasonal = list(order = c(0,
## 1, 1), period = 12))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 sma1
## 0.9981 -0.3776 0.0565 -0.1641 -0.0615 -0.5815
## s.e. 0.0029 0.0933 0.1022 0.1013 0.0976 0.0803
##
## sigma^2 estimated as 0.001302: log likelihood = 247.39, aic = -480.77
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set 0.002452187 0.03457656 0.02584615 0.04772145 0.4685735
## MASE ACF1
## Training set 0.285323 -0.00072597
Now, we need to look at the significance of each of our parameters. We’ll need to run a t-test for each parameter. Our test statistic will be the \(\dfrac{parameter}{SE(parameter)}\) then we can use the command \(\texttt{2*pt(abs(test.statistic), df, lower.tail=FALSE)}\). Our degrees of freedom are 138 (\(n-parameters\)).
teststats <- coef(mod1)/sqrt(diag(vcov(mod1)))
2*pt(abs(teststats), df=138, lower.tail=FALSE)
## ar1 ma1 ma2 ma3 ma4
## 2.446512e-205 8.629884e-05 5.811566e-01 1.075862e-01 5.293569e-01
## sma1
## 2.854064e-11
Our ma4 parameter has a fairly high p-value, so let’s try removing this parameter from the model and check the significance of the parameters of our new model.
mod2 <- arima(log(AirPassengers), order = c(1, 0, 3), season = list(order = c(0, 1, 1), period= 12))
teststats <- coef(mod2)/sqrt(diag(vcov(mod2)))
2*pt(abs(teststats), df=139, lower.tail=FALSE)
## ar1 ma1 ma2 ma3 sma1
## 1.860866e-195 6.004233e-06 5.588079e-01 9.455535e-02 1.097780e-11
Our ma3 parameter still has a fairly high p-value, so let’s remove it from our model and check the parameter significance again.
mod3 <- arima(log(AirPassengers), order = c(1, 0, 2), season = list(order = c(0, 1, 1), period= 12))
teststats <- coef(mod3)/sqrt(diag(vcov(mod3)))
2*pt(abs(teststats), df=140, lower.tail=FALSE)
## ar1 ma1 ma2 sma1
## 6.438845e-179 1.029967e-05 6.655677e-01 7.113644e-12
Our ma2 parameter is again still high. Let’s remove it and check the significance of the model parameters again.
mod4 <- arima(log(AirPassengers), order = c(1, 0, 1), season = list(order = c(0, 1, 1), period= 12))
teststats <- coef(mod4)/sqrt(diag(vcov(mod4)))
2*pt(abs(teststats), df=141, lower.tail=FALSE)
## ar1 ma1 sma1
## 5.239338e-175 1.763166e-05 8.069311e-12
Now, all of our parameters are significant. We saw in our plots that p might also be 2, so let’s try changing this order in our model and see if we can improve our model.
mod5 <- arima(log(AirPassengers), order = c(2, 0, 1), season = list(order = c(0, 1, 1), period= 12))
teststats <- coef(mod5)/sqrt(diag(vcov(mod5)))
2*pt(abs(teststats), df=140, lower.tail=FALSE)
## ar1 ar2 ma1 sma1
## 5.748015e-06 4.565041e-01 9.799382e-03 1.031412e-11
Our ar2 parameter is not significant so we will not keep it in our model. Now we should check the Ljung-box statistic to make sure that our model is adequate.
Box.test(mod4$residuals, type = "Ljung")
##
## Box-Ljung test
##
## data: mod4$residuals
## X-squared = 0.045108, df = 1, p-value = 0.8318
Our p-value is high so our model is adequate. Thus, our final model is ARIMA (1, 0, 1)x(0, 1, 1)12.
Last Learning Log, we discussed how to create an ARIMA model using the \(\texttt{auto.arima}\) function. This concept is very related to that function because this is simiply a manual way of creating optimal ARIMA models whereas \(\texttt{auto.arima}\) does it automatically.