Learning Log Day 23

Example

Finding Seasonal Trend

The data set I will use for this example is the \(\texttt{AirPassengers}\) data set. Let’s first plot the data to determine if we have seasonality.

plot(AirPassengers)

We do seem to have a seasonal component to our data trend. We can decompose the data into its different trends by using the function \(\texttt{decompose(dataset)}\). To do the decomposition, we will need to see a constant, non-increasing or -decresing, seasonal trend. We see an increasing seasonal trend in our data set so we’ll transform it using a logarithmic transformation before we do the decomposition. Once we get our decomposed values, we will plot them.

decomp <- decompose(log(AirPassengers))
plot(decomp)

Our decomposition shows an overall increasing trend, with a strong seasonal trend so we’ll need to account for that in our ARIMA model. The order we will use for our ARIMA model is (p, d, q)x(P, D, Q)S where are p, d, q are for our non-seasonal aspects and P, D, Q are for our seasonal aspects. S corresponds to the number of seasonal time periods we have in the data.

Differencing

The very first step to creating our ARIMA model is determine our order of differencing. To do this, we’ll use the \(\texttt{diff(dataset, lag)}\) function. \(\texttt{lag}\) indicates how far we need to go back in our data point to take the difference so, a lag of 12, which we will look at first for our model since we have monthly data, would take the difference between \(y_t\) and \(y_{t-12}\). We can plot this using \(\texttt{plot(differences)}\) and determine if our data is stationary after this differencing using \(\texttt{adf.test(differences)}\).

diffseas <- diff(log(AirPassengers), lag=12)
plot(diffseas)

adf.test(diffseas)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diffseas
## Dickey-Fuller = -3.1899, Lag order = 5, p-value = 0.09265
## alternative hypothesis: stationary

The p-value of our ADF test is lower than .1 and the data looks relatively stationary, not increasing or decreasing, so we can say that the data is likely stationary enough. This means we do not have to compute any non-seasonal differencing.

Preliminary p, q, P and Q

To determine the order of the moving average and autoregression portions of our seasonal and non-seasonal components for our preliminary model, we need to look at ACF and PACF plots for our differenced values. We do this using the \(\texttt{Acf(differences)}\) and \(\texttt{Pacf(differences)}\) commands. The ACF plot will allow us to determine q and Q while the PACF plot will allow us to p and P.

Acf(diffseas)

Our spikes are fairly high for the first 4 lags, but remain above the blue line until about the 8th lag. So we’ll try a q greater than or equal to 4. We also see a single spike at 12 so we will require a Q of 1. Next, we’ll look at the PACF plot.

Pacf(diffseas)

Our spike is high for the first lag, but still above the blue line for the second, so we’ll need a p of at least 1. Then, we don’t really see a spike around 12, so we will have P equal 0.

Creating the Model

Our first model will be an ARIMA (1, 0, 4)x(0, 1, 1)12. We can make this using the \(\texttt{arima(dataset, order, season = list(order, period))}\) where the first order is our (p, d, q), the second order is our (P, D, Q) and period is our S.

mod1 <- arima(log(AirPassengers), order = c(1, 0, 4), season = list(order = c(0, 1, 1), period= 12))
summary(mod1)

## 
## Call:
## arima(x = log(AirPassengers), order = c(1, 0, 4), seasonal = list(order = c(0, 
##     1, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     ma2      ma3      ma4     sma1
##       0.9981  -0.3776  0.0565  -0.1641  -0.0615  -0.5815
## s.e.  0.0029   0.0933  0.1022   0.1013   0.0976   0.0803
## 
## sigma^2 estimated as 0.001302:  log likelihood = 247.39,  aic = -480.77
## 
## Training set error measures:
##                       ME       RMSE        MAE        MPE      MAPE
## Training set 0.002452187 0.03457656 0.02584615 0.04772145 0.4685735
##                  MASE        ACF1
## Training set 0.285323 -0.00072597

Now, we need to look at the significance of each of our parameters. We’ll need to run a t-test for each parameter. Our test statistic will be the \(\dfrac{parameter}{SE(parameter)}\) then we can use the command \(\texttt{2*pt(abs(test.statistic), df, lower.tail=FALSE)}\). Our degrees of freedom are 138 (\(n-parameters\)).

teststats <- coef(mod1)/sqrt(diag(vcov(mod1)))
2*pt(abs(teststats), df=138, lower.tail=FALSE)

##           ar1           ma1           ma2           ma3           ma4 
## 2.446512e-205  8.629884e-05  5.811566e-01  1.075862e-01  5.293569e-01 
##          sma1 
##  2.854064e-11

Our ma4 parameter has a fairly high p-value, so let’s try removing this parameter from the model and check the significance of the parameters of our new model.

mod2 <- arima(log(AirPassengers), order = c(1, 0, 3), season = list(order = c(0, 1, 1), period= 12))
teststats <- coef(mod2)/sqrt(diag(vcov(mod2)))
2*pt(abs(teststats), df=139, lower.tail=FALSE)

##           ar1           ma1           ma2           ma3          sma1 
## 1.860866e-195  6.004233e-06  5.588079e-01  9.455535e-02  1.097780e-11

Our ma3 parameter still has a fairly high p-value, so let’s remove it from our model and check the parameter significance again.

mod3 <- arima(log(AirPassengers), order = c(1, 0, 2), season = list(order = c(0, 1, 1), period= 12))
teststats <- coef(mod3)/sqrt(diag(vcov(mod3)))
2*pt(abs(teststats), df=140, lower.tail=FALSE)

##           ar1           ma1           ma2          sma1 
## 6.438845e-179  1.029967e-05  6.655677e-01  7.113644e-12

Our ma2 parameter is again still high. Let’s remove it and check the significance of the model parameters again.

mod4 <- arima(log(AirPassengers), order = c(1, 0, 1), season = list(order = c(0, 1, 1), period= 12))
teststats <- coef(mod4)/sqrt(diag(vcov(mod4)))
2*pt(abs(teststats), df=141, lower.tail=FALSE)

##           ar1           ma1          sma1 
## 5.239338e-175  1.763166e-05  8.069311e-12

Now, all of our parameters are significant. We saw in our plots that p might also be 2, so let’s try changing this order in our model and see if we can improve our model.

mod5 <- arima(log(AirPassengers), order = c(2, 0, 1), season = list(order = c(0, 1, 1), period= 12))
teststats <- coef(mod5)/sqrt(diag(vcov(mod5)))
2*pt(abs(teststats), df=140, lower.tail=FALSE)

##          ar1          ar2          ma1         sma1 
## 5.748015e-06 4.565041e-01 9.799382e-03 1.031412e-11

Our ar2 parameter is not significant so we will not keep it in our model. Now we should check the Ljung-box statistic to make sure that our model is adequate.

Box.test(mod4$residuals, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  mod4$residuals
## X-squared = 0.045108, df = 1, p-value = 0.8318

Our p-value is high so our model is adequate. Thus, our final model is ARIMA (1, 0, 1)x(0, 1, 1)12.

Learning Log Day 23

Sydney Benson

4/26/2018

Concepts

Example

Finding Seasonal Trend

Differencing

Preliminary p, q, P and Q

Creating the Model

Comparison