ARIMA Modeling

Before you begin

You’ll need to install the following packages:

install.packages("tseries", dependencies = TRUE)
install.packages("astsa", dependencies = TRUE)
install.packages("forecast", dependencies = TRUE)

Process

Box and Jenkins defined the process of setting up an ARIMA(p, d, q) model as:
Identify model parameters -> Estimate the Model -> Validate -> Forecast

In practice, we have two alternatives:

Alternative 1

Plot the data (looking for outliers, stationarity, or the need for transformation)
Difference until the data are stationary (find d) & possibly log transform data
Use differenced series to find p & q
Fit the ARIMA(p, d, q) to the original data
Verify that you have a good (or the best available) model
Forecast or evaluate intervention effects or add comparison vectors…

Alternative 2

Employ an iterative automated fitting algorithm (try out as many different models as possible and then use AIC to evaluate which one(s) is best).
Fit the ARIMA(p, d, q) to the original data
Verify that you have a good (or the best available) model - just to be sure
Forecast or evaluate intervention effects or add comparison vectors…

For this practicum, we will primarily be using Alternative 1 to fit an ARIMA(p, d, q) model. Though, I did include a “quickie” on using auto.arima for Alternative 2.

We will therefore start by importing data and looking at it:

For our example today, let’s use a migration-related data set. On the course website, look for the data: “TS Permanent & long-term migration to and from Australia.csv”.

For an explanation, see: “Permanent & Long-term Migration to and from Australia.docx”

Alternative 1: The Manual (Traditional) Approach

1) Load and plot the data

nzm <- read.csv(file.choose(), header=TRUE)
head(nzm)

##      Date TOTALARRIVALS TOTALDEPARTURES TOTALALLAGES
## 1 1978M04           988            3320        -2332
## 2 1978M05           912            3320        -2408
## 3 1978M06           840            3136        -2296
## 4 1978M07           804            3048        -2244
## 5 1978M08           952            3248        -2296
## 6 1978M09           908            2816        -1908

You can see that the data start at month 4 (April) in 1978, so you define it as:

For more information on this, look at the simple time series walkthrough practicum.

arr <- ts(nzm[,2], frequency=12, start=c(1978, 4))  # Arrivals from Australia to NZ
dep <- ts(nzm[,3], frequency=12, start=c(1978, 4))  # Departures from NZ to Australia
dif <- ts(nzm[,4], frequency=12, start=c(1978, 4))  # Difference between arr and dep

You are welcome to look at any or all of these three variables. In fact, go ahead and try it with the tools in the simple time series walk through.

For the sake of this particular exercise, we’ll just use the “diff” variable.

Now plot the difference between departures and arrivals to New Zealand from Austraila.

For the sake of comparison, stack the three plots in one window.

The ylim= argument is added to the ACF and PACF plots to keep them similar to what you would see in the readings. Feel free to try them without those parameters. They will convey the same information either way.

layout(1:3)  # This will show all three plots in the same window
plot.ts(dif)
acf(dif, ylim=c(-1, 1))
pacf(dif, ylim=c(-1, 1))

layout(1)    # Change the setting back to one graph per window

This is just an early look. You haven’t differenced the data yet. So, p and q will be difficult to estimate at this stage.

I just wanted you to have an initial sense of what the data look like.

For additional insight, decompose the series

dec.dif <- decompose(dif)
plot(dec.dif)

2) Difference until the series is stationary

Start out by differencing once or twice and check the results.

d.dif <- diff(dif, difference=1)  # First-order differencing
d.dif2 <- diff(dif, difference=2) # Second-order differencing

Check the results for each.

layout(1:2)
acf(d.dif, lag.max=20, ylim=c(-1, 1))  # Plot first-order differencing
pacf(d.dif, lag.max=20, ylim=c(-1, 1))

layout(1)


layout(1:2)
acf(d.dif2, lag.max=20, ylim=c(-1, 1)) # Plot second-order differencing
pacf(d.dif2, lag.max=20, ylim=c(-1, 1))

layout(1)

This is the first step in selecting the “d” in the ARIMA(p, d, q) model.

Compare the two and decide for yourself which differencing has done the best job of quieting the autocorrelation plots.

Choose the one with the better fit.

For a test that will help you to decide, you can also use the “augmented Dickey-Fuller test”.

library(tseries)  # You may need to install this package the first time you use it.
adf.test(d.dif, alternative = "stationary")

## 
##  Augmented Dickey-Fuller Test
## 
## data:  d.dif
## Dickey-Fuller = -9.8543, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

adf.test(d.dif2, alternative = "stationary")

## 
##  Augmented Dickey-Fuller Test
## 
## data:  d.dif2
## Dickey-Fuller = -13.282, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Notice that the results of the test remind you that the alternative hypothesis is that the series is stationary. That means that the null is that the series is not stationary.

H₀: series is not stationary

This is an important point, since the null is all that you are testing.

By the way, the other alternative that you can test is “explosive”. I am personally not sure what that would mean. But, I am sure Freud may have something to say about it.

3) Use the differenced series to estimate p & q

My own choice was the first differencing (d=1), so I am using the following ACF and PACF plots from step 2, above:

layout(1:2)
acf(d.dif, lag.max=20, ylim=c(-1, 1))
pacf(d.dif, lag.max=20, ylim=c(-1, 1))

layout(1)

To interpret, you can use your notes or the guide to ACF and PACF plots: http://tinyurl.com/afmeygy

Some possible solutions from most to least likely: arima(0,1,1) arima(1,1,1) arima(0,1,2)

4) Fit the ARIMA(p,d,q) model to the ORIGINAL data.5) Verify that you have a good (or the best available) model

Because we have three candidates, let’s try all three.

ar1 <- arima(dif, order=c(0, 1, 1))
ar2 <- arima(dif, order=c(1, 1, 1))
ar3 <- arima(dif, order=c(0, 1, 2))

5) Verify that you have a good (or the best available) model

First, compare AIC from each model.

ar1

## 
## Call:
## arima(x = dif, order = c(0, 1, 1))
## 
## Coefficients:
##           ma1
##       -0.3369
## s.e.   0.0503
## 
## sigma^2 estimated as 258366:  log likelihood = -3128.91,  aic = 6261.82

ar2

## 
## Call:
## arima(x = dif, order = c(1, 1, 1))
## 
## Coefficients:
##          ar1      ma1
##       0.3977  -0.7045
## s.e.  0.1383   0.1116
## 
## sigma^2 estimated as 256611:  log likelihood = -3127.56,  aic = 6261.12

ar3

## 
## Call:
## arima(x = dif, order = c(0, 1, 2))
## 
## Coefficients:
##           ma1      ma2
##       -0.3291  -0.0551
## s.e.   0.0493   0.0584
## 
## sigma^2 estimated as 257791:  log likelihood = -3128.46,  aic = 6262.93

# Or, to see only the aic for each

ar1$aic

## [1] 6261.825

ar2$aic

## [1] 6261.116

ar3$aic

## [1] 6262.927

Remember, lower values of AIC are better. Though, there are few substantial differences here. So, you’ll probably want something additional for evaluation and comparison.

Next, compare how well the residuals fit

layout(1:2)  
acf(ar1$residuals, lag.max=20, ylim=c(-1, 1))
pacf(ar1$residuals, lag.max=20, ylim=c(-1, 1))

acf(ar2$residuals, lag.max=20, ylim=c(-1, 1))
pacf(ar2$residuals, lag.max=20, ylim=c(-1, 1))

acf(ar3$residuals, lag.max=20, ylim=c(-1, 1))
pacf(ar3$residuals, lag.max=20, ylim=c(-1, 1))

layout(1)

Keep in mind that sometimes, you will not be able to completely remove the residual bars. That lone bar may be due to seasonality. So, don’t look for perfect at this point. Look for best of the fist you have.

Testing residuals: Is what the model does not cover essentially random?

Here, we are using the Ljung-Box test (it is actually a simplified version of the test by Box and Pierce) to assess whether the residuals for the model are random noise. This is essentially checking to see that the error terms are homoskedastic.

The null hypothesis says that the data are no different from “white noise”. So, small p-values will indicate independent residuals.

H₀: residuals are randomly distributed

Here, we are running the three different models we are trying out above. If any of these models rejects the null, we should consider it to be problematic.

Box.test(ar1$residuals, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  ar1$residuals
## X-squared = 0.066121, df = 1, p-value = 0.7971

Box.test(ar2$residuals, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  ar2$residuals
## X-squared = 0.42171, df = 1, p-value = 0.5161

Box.test(ar3$residuals, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  ar3$residuals
## X-squared = 0.0014606, df = 1, p-value = 0.9695

In each case, the p-value is far from significant. We, therefore, fail to reject the null that the residuals are randomly distributed and conclude that the ARIMA model, as specified, is a decent fit.

Alternate: install the astsa package and check residual fit.

For a more fine-grained look into model fit, consider the astsa function. This will mimic the basic diagnostic plots that we use to assess regression models.

library(astsa)
sarima(dif, 0, 1, 1)

## initial  value 6.280831 
## iter   2 value 6.232351
## iter   3 value 6.231063
## iter   4 value 6.231041
## iter   4 value 6.231041
## iter   4 value 6.231041
## final  value 6.231041 
## converged
## initial  value 6.231189 
## iter   2 value 6.231188
## iter   2 value 6.231188
## iter   2 value 6.231188
## final  value 6.231188 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ma1  constant
##       -0.3370   -2.4175
## s.e.   0.0503   16.6833
## 
## sigma^2 estimated as 258353:  log likelihood = -3128.9,  aic = 6263.8
## 
## $degrees_of_freedom
## [1] 407
## 
## $ttable
##          Estimate      SE t.value p.value
## ma1       -0.3370  0.0503 -6.6963  0.0000
## constant  -2.4175 16.6833 -0.1449  0.8849
## 
## $AIC
## [1] 15.31492
## 
## $AICc
## [1] 15.315
## 
## $BIC
## [1] 15.34436

sarima(dif, 1, 1, 1)

## initial  value 6.282031 
## iter   2 value 6.262993
## iter   3 value 6.235084
## iter   4 value 6.234827
## iter   5 value 6.231217
## iter   6 value 6.229404
## iter   7 value 6.229257
## iter   8 value 6.228977
## iter   9 value 6.228966
## iter  10 value 6.228964
## iter  10 value 6.228964
## iter  10 value 6.228964
## final  value 6.228964 
## converged
## initial  value 6.227843 
## iter   2 value 6.227842
## iter   3 value 6.227842
## iter   3 value 6.227842
## iter   3 value 6.227842
## final  value 6.227842 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1      ma1  constant
##       0.3983  -0.7051   -2.7383
## s.e.  0.1377   0.1111   12.3253
## 
## sigma^2 estimated as 256579:  log likelihood = -3127.53,  aic = 6263.07
## 
## $degrees_of_freedom
## [1] 406
## 
## $ttable
##          Estimate      SE t.value p.value
## ar1        0.3983  0.1377  2.8914  0.0040
## ma1       -0.7051  0.1111 -6.3459  0.0000
## constant  -2.7383 12.3253 -0.2222  0.8243
## 
## $AIC
## [1] 15.31312
## 
## $AICc
## [1] 15.31327
## 
## $BIC
## [1] 15.35238

sarima(dif, 0, 1, 2)

## initial  value 6.280831 
## iter   2 value 6.231571
## iter   3 value 6.230073
## iter   4 value 6.229920
## iter   5 value 6.229917
## iter   6 value 6.229917
## iter   6 value 6.229917
## iter   6 value 6.229917
## final  value 6.229917 
## converged
## initial  value 6.230084 
## iter   2 value 6.230083
## iter   2 value 6.230083
## iter   2 value 6.230083
## final  value 6.230083 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed, 
##     optim.control = list(trace = trc, REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ma1      ma2  constant
##       -0.3292  -0.0553   -2.5583
## s.e.   0.0493   0.0585   15.4803
## 
## sigma^2 estimated as 257773:  log likelihood = -3128.45,  aic = 6264.9
## 
## $degrees_of_freedom
## [1] 406
## 
## $ttable
##          Estimate      SE t.value p.value
## ma1       -0.3292  0.0493 -6.6782  0.0000
## ma2       -0.0553  0.0585 -0.9455  0.3450
## constant  -2.5583 15.4803 -0.1653  0.8688
## 
## $AIC
## [1] 15.3176
## 
## $AICc
## [1] 15.31775
## 
## $BIC
## [1] 15.35686

Where do you see the best fit?

In this case, let’s go with ar2, the ARIMA(1,1,1) model. As, it had the fewest transgressions.

6) Forecast or evaluate intervention effects or add comparison vectors…

To forecast and plot:

forecast <- predict(ar2, n.ahead=60)
upper <- forecast$pred + 1.96*forecast$se
lower <- forecast$pred - 1.96*forecast$se
ts.plot(dif,forecast$pred, col=c("black","red"))
lines(upper, col="blue", lty="solid")
lines(lower, col="blue", lty="solid")

To see more detail, you can limit which portion of the graph you see:

You can truncate the plot by changing the limits of the x and y axes (xlim= and ylim=). Below, I have changes the graph to only show years 2000 through 2015 (xlim=c(2000, 2015)). I also wanted to zoom in a little more, so I reduced the

ts.plot(dif,forecast$pred, col=c("black","red"), xlim=c(2000, 2015), ylim=c(-5000, -500))
lines(upper, col="blue", lty="solid")
lines(lower, col="blue", lty="solid")

See what I meant about the value of ARIMA being mainly in the short term?

In this case, we are missing a major piece of information: Seasonality.

Building a Seasonal Model

Looking back at the decomposition, it is easy to see that there is seasonality present in these data. This is part of the reason why the forecast looks so unimpressive.

To add seasonality, try:

ar4 <- arima(dif, order=c(0, 1, 1), seasonal=list(order=c(0,0,1), period=12))

forecast <- predict(ar4, n.ahead=60)
upper <- forecast$pred + 1.96*forecast$se
lower <- forecast$pred - 1.96*forecast$se
ts.plot(dif,forecast$pred, col=c("black","red"))
lines(upper, col="blue", lty="solid")
lines(lower, col="blue", lty="solid")

To zoom in, as we did above:

ts.plot(dif, forecast$pred, col=c("black","red"), xlim=c(2000, 2015), ylim=c(-5000, -500))
lines(upper, col="blue", lty="solid")
lines(lower, col="blue", lty="solid")

Alternative 2: the Automated Approach

As above…

0) Load and plot the data

nzm <- read.csv(file.choose(), header=TRUE)
head(nzm)


arr <- ts(nzm[,2], frequency=12, start=c(1978, 4))  # Arrivals from Australia to NZ
dep <- ts(nzm[,3], frequency=12, start=c(1978, 4))  # Departures from NZ to Australia
dif <- ts(nzm[,4], frequency=12, start=c(1978, 4))  # Difference between arr and dep

layout(1:3)  # This will show all three plots in the same window
plot.ts(dif)
acf(dif, ylim=c(-1, 1))
pacf(dif, ylim=c(-1, 1))
layout(1)    # Change the setting back to one graph per window

I numbered this “zero” since the plotting is not actually necessary when starting out using automated techniques.

It is, of course, still a good idea.

1) Employ an iterative automated fitting algorithm

Here, we’ll load the forecast package to take advantage of the auto.arima() function. auto.arima will try out a variety of different models as possible and then use AIC to evaluate which one(s) is best).

For more information about the function, try the help section ?forecast::auto.arima, or google the function’s name.

library(forecast)

## 
## Attaching package: 'forecast'

## The following object is masked from 'package:astsa':
## 
##     gas

Auto.Fit <- auto.arima(dif)

2) Fit the ARIMA(p, d, q) to the original data

library(forecast)

Auto.Fit <- auto.arima(dif)

Next, get the details on ARIMA(p, d, q)(P, D, Q).

Auto.Fit

## Series: dif 
## ARIMA(2,0,3)(1,1,1)[12] 
## 
## Coefficients:
##          ar1     ar2     ma1      ma2     ma3    sar1     sma1
##       0.1515  0.7440  0.5133  -0.1515  0.2119  0.1003  -0.7853
## s.e.  0.0709  0.0683  0.0800   0.0641  0.0548  0.0775   0.0528
## 
## sigma^2 estimated as 129466:  log likelihood=-2909.5
## AIC=5834.99   AICc=5835.36   BIC=5866.89

In this case, you may notice that the model is not elegant. Recall that p + q should not be greater than 3 if you can help it; and neither p, nor q should be greater than 2.

Although these are not necessarily terminal transgressions, the rule of thumb is that they should be avoided if possible.

3) Verify that you have a good (or the best available) model - just to be sure

We can do a few things here.

First, let’s check the model fit using AIC. We can also compare it to the fit we got in the earlier work.

Auto.Fit$aic # Auto.ARIMA model

## [1] 5834.995

ar1$aic      # Earlier models, done manually

## [1] 6261.825

ar2$aic

## [1] 6261.116

ar3$aic

## [1] 6262.927

Remember, lower values of AIC are better.

Then, try the Ljung-Box test, as above.

Box.test(Auto.Fit$residuals)

## 
##  Box-Pierce test
## 
## data:  Auto.Fit$residuals
## X-squared = 0.0079705, df = 1, p-value = 0.9289

You can also look at ACF and PACF plots.

layout(1:2)  
acf(Auto.Fit$residuals, lag.max=20, ylim=c(-1, 1))
pacf(Auto.Fit$residuals, lag.max=20, ylim=c(-1, 1))

layout(1)

4) Forecast

plot(forecast(Auto.Fit))

It is over so quickly. (Bit of a letdown, eh?)

Auto.arima is a good option. But, it will not always provide you with the best option. It is a good practice to try both methods to be sure.

Now that you have a feel for this, try it with some other data

Some useful resources:

You may like the Global Temperature Data from 1750 to present or, you may just download the csv that I prepared from these data in 2020.

Global temperature reconstruction

http://www.statmethods.net/advstats/timeseries.html

http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.htmlarima-models

http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdf

http://www.stat.pitt.edu/stoffer/tsa2/Rissues.htm