Example

First we need to install the following packages to use all of the necessary functions.

library(forecast)

## Warning: package 'forecast' was built under R version 3.4.4

## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/Chicago'

library(quantmod)

## Warning: package 'quantmod' was built under R version 3.4.4

## Loading required package: xts

## Warning: package 'xts' was built under R version 3.4.4

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.4.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: TTR

## Warning: package 'TTR' was built under R version 3.4.3

## Version 0.4-0 included new data defaults. See ?getSymbols.

library(tseries)

## Warning: package 'tseries' was built under R version 3.4.4

library(timeSeries)

## Warning: package 'timeSeries' was built under R version 3.4.2

## Loading required package: timeDate

## Warning: package 'timeDate' was built under R version 3.4.3

## 
## Attaching package: 'timeSeries'

## The following object is masked from 'package:zoo':
## 
##     time<-

library(forecast)
library(xts)

Step 1: Plot the Data

For this document we will be working with the AirPassengers data, so let’s call and attach our data. Then let’s plot our data and look for seasonality and trend.

data(AirPassengers)
plot(AirPassengers)

From our output we can see that seasonality exists, along with an upward trend. However, we also see that the seasonal fluctuations in our data are increasing over time; thus our model doesn’t have equal variance. Therefore, we need to transform our model.

From a prior document we learned that trasforming the AirPassengers data by taking the log of the data best resolves the heteroscedasticity. Lets transform our data and plot it now.

plot(log(AirPassengers))

Now we see that there is equal variance in our data, so we can continue with our ARIMA model development. Overall, our model has seasonality and an upward trend.

Step 1.5: Determine S

From data exploration we know that our data is collected monthly, so S=12 for our model.

Step 2: Differences

Next, let’s determine the differences for our model. As mentioned, our data has seasonality and trend, so we will use differences to remove the seasonality first. Let’s call our differences and plot them. Then, use the adf.test command to see if our data is stationary. The adf.test command runs a hypothesis test where the null hypothesis states that the data is not stationary, and the alternative hypothesis states that the data is stationary. For this hypothesis test we will use a significance level of 0.05.

diff1 <- diff(log(AirPassengers), lag = 12)
plot.ts(diff1)

adf.test(diff1)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff1
## Dickey-Fuller = -3.1899, Lag order = 5, p-value = 0.09265
## alternative hypothesis: stationary

Our plot shows us that there does not seem to be any trend in our new data (diff1), but it is difficult to tell if our data is stationary. The output from the adf.test command gives us the p-value from the hypothesis test. The resulting p-value is 0.09265, which is not less that our significance level, 0.05, so we cannot reject the null hypothesis. Thus, we conclude that our data is not stationary, so we will need to take the differences of our new data.

Use the diff command again to take the differences of our first order differences. Again, let’s plot the data and use the adf.test command.

diff2 <- diff(diff1, lag = 12)
plot.ts(diff2)

adf.test(diff2)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff2
## Dickey-Fuller = -3.5215, Lag order = 4, p-value = 0.04333
## alternative hypothesis: stationary

By looking at this plot we see that there isn’t a trend in our new data, and the adf.test output give us the p-value from our hypothesis test. This time our p-value is 0.04333, which is below 0.05. Therefore, we can reject the null hypothesis in favor of the alternative and conclude that our new data is stationary.

Therefore, D=2, d=0.

Step 3: Prelimiary Values for p, q, P, and Q

Next, we need to choose preliminary values for p, q, P, and Q, which we will do by analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) of our new data (diff2). The ACF tells us the correlation of two points separated by a specific lag, and the PACF does the same, except it is conditioned on values of shorter lag.

We will determine the preliminary values for p, q, P, and Q in the following ways:

q: To deptermine our value for q, we will look at the spikes for low lags in the ACF plot.

Q: To deptermine our value for Q, we will look at the spikes for multiples of our S value in the ACF plot. Since S=12 for our data, we will look at the spikes for lag=12 and lag=24.

p: To deptermine our value for p, we will look at the spikes for low lags in the PACF plot.

P: To deptermine our value for P, we will look at the spikes for multiples of our S value in the PACF plot. Since S=12 for our data, we will look at the spikes for lag=12 and lag=24.

First, let’s plot our ACF for our data.

Acf(diff2)

To determine q we look at the spikes for the low lags. From our plot we see that the spikes for the fist 5 lags are above the blue line, which means that all of these lags could be significant. Also, the spike for lag=6 is right on the edge of being significant or insignificant. It’s not clear which value we should use for q, so we can say that q>=1.

To determine Q we look at the spike for lag=12 and lag=24. We see that the spike for lag=12 is very significant, and the spike for lag=24 is insignificant. Thus, we can feel confident that Q=1.

Now, let’s plot our PACF for our data to determine p and P.

Pacf(diff2)

Looking at the lower lags to determine p, we see that the first lag is super significant, but the second lag is barely significant. Thus we can say that p>=1.

Also, we see that the spike for lag=12 is significant, and the spike for lag=24 is barely significant. Therefore we can say that P>=1.

Step 4: Estimate Our Model

Based on our ACF and PACF, we will estimate our ARIMA model. To start, let’s make the simplest model possible. Based on steps 1, 2, and 3, our simplest model will be in the following form: ARIMA (1, 0, 1) x (1, 2, 1) 12.

Let’s create our model, and use the summary command to look at what our coefficients are. Then let’s use the Box.test command to see if our model is adequate for our data. The Box.test command conducts a hypothesis test where the null hypothesis states that our model is adequate for our data, and the alternative hypothesis says that our model is not adequate. The “Ljung” box test is an improved version of the original box test, so we input the argument “type =”Ljung“”.

mod1 <- arima(log(AirPassengers), order = c(1,0,1), season = list( order = c( 1,2,1), period=12))
summary(mod1)

## 
## Call:
## arima(x = log(AirPassengers), order = c(1, 0, 1), seasonal = list(order = c(1, 
##     2, 1), period = 12))
## 
## Coefficients:
##          ar1      ma1     sar1     sma1
##       0.9303  -0.4175  -0.4505  -1.0000
## s.e.  0.0448   0.0991   0.0861   0.1038
## 
## sigma^2 estimated as 0.001415:  log likelihood = 203.55,  aic = -397.1
## 
## Training set error measures:
##                        ME       RMSE        MAE         MPE      MAPE
## Training set -0.001134586 0.03439361 0.02451314 -0.01830164 0.4413303
##                   MASE        ACF1
## Training set 0.2706075 -0.02443055

Box.test(mod1$residuals, type = "Ljung")

## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 0.08775, df = 1, p-value = 0.7671

The first row under the variable names in the “Coefficient”" section of the summary output gives us our coefficients. Also, the row below the coefficients give us the standard error for each of our terms. In addition, the output from the Ljung box test gives us a p-value expressing the adequacy of our model. For this specific model the p-value is 0.7671. This p-value is very large, so we can conlude that our model is adequate for our data!

Now, let’s see if all of the terms in our model are significant. We can divide the coefficient of each term by its standard deviation to get our test stat for each term. Then we use the pt function to get the p-value for our test stat for the t-distribution. Notice that the degrees of freedom for the distribution is (144-4) becasue there are 144 observations in our original data, and there are 4 terms in our model.

teststat11 <- 0.9303/0.0448
2*pt(abs(teststat11), df = 144-4, lower.tail = FALSE)

## [1] 1.384713e-44

teststat22 <- -0.4175/0.0991
2*pt(abs(teststat22), df = 144-4, lower.tail = FALSE)

## [1] 4.491466e-05

teststat33 <- -0.4505/0.0861
2*pt(abs(teststat33), df = 144-4, lower.tail = FALSE)

## [1] 5.985426e-07

teststat44 <- -1/0.1038
2*pt(abs(teststat44), df = 144-4, lower.tail = FALSE)

## [1] 3.646338e-17

The output shows that all of the p-values are nearly zero, so all of our terms are significant! Therefore, we can be happy using the model ARIMA (1,0,1) x (1,2,1) 12.

Step 5

Now if we wanted to, we could do diagnostis, hypothesis tests, or try and improve our model. However, these topics will not be covered in this R guide.

Compare Our Model to `auto.arima`

Lastly, let’s compare the model that we made with the model that the auto.arima command creates.

automod <- auto.arima(log(AirPassengers))
automod

## Series: log(AirPassengers) 
## ARIMA(0,1,1)(0,1,1)[12] 
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 estimated as 0.001371:  log likelihood=244.7
## AIC=-483.4   AICc=-483.21   BIC=-474.77

From the output we see that the ARIMA model that R creates from our data is in the following form: ARIMA (0,1,1) x (0,1,1) 12. CLearly this is different than our model, ARIMA (1,0,1) x (1,2,1) 12; our model has four terms, but the auto.arima model only has two terms. Notice that we used D=2, d=0, and auto.arima used D=1, d=1. These variations in differences could cause our other numbers to be different.

Let’s compare our model to the ARIMA model using AIC.

AIC(mod1); AIC(automod)

## [1] -397.1033

## [1] -483.3991

From the output we see that the model created by auto.arima has an AIC that is about 86 units less than our model, which means that this model is quite a bit better than our model.

Finally, let’s do some forecasting to see how much better the auto.arima model is compared ot our “home-cooked” model.

par(mfrow=c(2,1))
plot(forecast(automod, h = 12), main = "auto.arima")
plot(forecast(mod1, h = 12), main = "home-cooked")

Overall, our forecasts look pretty similar, although it looks like the peak on our model is higher than the one on the automod.

ARIMA (Day 2) Learning Log

Emily Eberspacher

4/26/2018

ARIMA Models: Testing and Diagnostics

Seasonal ARIMA Models

Home-Cooked ARIMA Models

Example

Step 1: Plot the Data

Step 1.5: Determine S

Step 2: Differences

Step 3: Prelimiary Values for p, q, P, and Q

Step 4: Estimate Our Model

Step 5

Compare Our Model to `auto.arima`

Conclusion

ARIMA (Day 2) Learning Log

Emily Eberspacher

4/26/2018

ARIMA Models: Testing and Diagnostics

Seasonal ARIMA Models

Home-Cooked ARIMA Models

Example

Step 1: Plot the Data

Step 1.5: Determine S

Step 2: Differences

Step 3: Prelimiary Values for p, q, P, and Q

Step 4: Estimate Our Model

Step 5

Compare Our Model to auto.arima

Conclusion

Compare Our Model to `auto.arima`