You’ll need to install the following packages:
install.packages("tseries", dependencies = TRUE)
install.packages("astsa", dependencies = TRUE)
install.packages("forecast", dependencies = TRUE)
Box and Jenkins defined the process of setting up an ARIMA(p, d, q) model as:
Identify model parameters -> Estimate the Model -> Validate -> Forecast
In practice, we have two alternatives:
For this practicum, we will primarily be using Alternative 1 to fit an ARIMA(p, d, q) model. Though, I did include a “quickie” on using auto.arima for Alternative 2.
We will therefore start by importing data and looking at it:
For our example today, let’s use a migration-related data set. On the course website, look for the data: “TS Permanent & long-term migration to and from Australia.csv”.
For an explanation, see: “Permanent & Long-term Migration to and from Australia.docx”
nzm <- read.csv(file.choose(), header=TRUE)
head(nzm)
## Date TOTALARRIVALS TOTALDEPARTURES TOTALALLAGES
## 1 1978M04 988 3320 -2332
## 2 1978M05 912 3320 -2408
## 3 1978M06 840 3136 -2296
## 4 1978M07 804 3048 -2244
## 5 1978M08 952 3248 -2296
## 6 1978M09 908 2816 -1908
For more information on this, look at the simple time series walkthrough practicum.
arr <- ts(nzm[,2], frequency=12, start=c(1978, 4)) # Arrivals from Australia to NZ
dep <- ts(nzm[,3], frequency=12, start=c(1978, 4)) # Departures from NZ to Australia
dif <- ts(nzm[,4], frequency=12, start=c(1978, 4)) # Difference between arr and dep
You are welcome to look at any or all of these three variables. In fact, go ahead and try it with the tools in the simple time series walk through.
For the sake of this particular exercise, we’ll just use the “diff” variable.
For the sake of comparison, stack the three plots in one window.
The ylim= argument is added to the ACF and PACF plots to keep them similar to what you would see in the readings. Feel free to try them without those parameters. They will convey the same information either way.
layout(1:3) # This will show all three plots in the same window
plot.ts(dif)
acf(dif, ylim=c(-1, 1))
pacf(dif, ylim=c(-1, 1))
layout(1) # Change the setting back to one graph per window
This is just an early look. You haven’t differenced the data yet. So, p and q will be difficult to estimate at this stage.
I just wanted you to have an initial sense of what the data look like.
dec.dif <- decompose(dif)
plot(dec.dif)
Start out by differencing once or twice and check the results.
d.dif <- diff(dif, difference=1) # First-order differencing
d.dif2 <- diff(dif, difference=2) # Second-order differencing
Check the results for each.
layout(1:2)
acf(d.dif, lag.max=20, ylim=c(-1, 1)) # Plot first-order differencing
pacf(d.dif, lag.max=20, ylim=c(-1, 1))
layout(1)
layout(1:2)
acf(d.dif2, lag.max=20, ylim=c(-1, 1)) # Plot second-order differencing
pacf(d.dif2, lag.max=20, ylim=c(-1, 1))
layout(1)
This is the first step in selecting the “d” in the ARIMA(p, d, q) model.
Compare the two and decide for yourself which differencing has done the best job of quieting the autocorrelation plots.
For a test that will help you to decide, you can also use the “augmented Dickey-Fuller test”.
library(tseries) # You may need to install this package the first time you use it.
adf.test(d.dif, alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: d.dif
## Dickey-Fuller = -9.8543, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
adf.test(d.dif2, alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: d.dif2
## Dickey-Fuller = -13.282, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
Notice that the results of the test remind you that the alternative hypothesis is that the series is stationary. That means that the null is that the series is not stationary.
H0: series is not stationary
This is an important point, since the null is all that you are testing.
By the way, the other alternative that you can test is “explosive”. I am personally not sure what that would mean. But, I am sure Freud may have something to say about it.
My own choice was the first differencing (d=1), so I am using the following ACF and PACF plots from step 2, above:
layout(1:2)
acf(d.dif, lag.max=20, ylim=c(-1, 1))
pacf(d.dif, lag.max=20, ylim=c(-1, 1))
layout(1)
To interpret, you can use your notes or the guide to ACF and PACF plots: http://tinyurl.com/afmeygy
Some possible solutions from most to least likely: arima(0,1,1) arima(1,1,1) arima(0,1,2)
Because we have three candidates, let’s try all three.
ar1 <- arima(dif, order=c(0, 1, 1))
ar2 <- arima(dif, order=c(1, 1, 1))
ar3 <- arima(dif, order=c(0, 1, 2))
ar1
##
## Call:
## arima(x = dif, order = c(0, 1, 1))
##
## Coefficients:
## ma1
## -0.3369
## s.e. 0.0503
##
## sigma^2 estimated as 258366: log likelihood = -3128.91, aic = 6261.82
ar2
##
## Call:
## arima(x = dif, order = c(1, 1, 1))
##
## Coefficients:
## ar1 ma1
## 0.3977 -0.7045
## s.e. 0.1383 0.1116
##
## sigma^2 estimated as 256611: log likelihood = -3127.56, aic = 6261.12
ar3
##
## Call:
## arima(x = dif, order = c(0, 1, 2))
##
## Coefficients:
## ma1 ma2
## -0.3291 -0.0551
## s.e. 0.0493 0.0584
##
## sigma^2 estimated as 257791: log likelihood = -3128.46, aic = 6262.93
# Or, to see only the aic for each
ar1$aic
## [1] 6261.825
ar2$aic
## [1] 6261.116
ar3$aic
## [1] 6262.927
Remember, lower values of AIC are better. Though, there are few substantial differences here. So, you’ll probably want something additional for evaluation and comparison.
layout(1:2)
acf(ar1$residuals, lag.max=20, ylim=c(-1, 1))
pacf(ar1$residuals, lag.max=20, ylim=c(-1, 1))
acf(ar2$residuals, lag.max=20, ylim=c(-1, 1))
pacf(ar2$residuals, lag.max=20, ylim=c(-1, 1))
acf(ar3$residuals, lag.max=20, ylim=c(-1, 1))
pacf(ar3$residuals, lag.max=20, ylim=c(-1, 1))
layout(1)
Keep in mind that sometimes, you will not be able to completely remove the residual bars. That lone bar may be due to seasonality. So, don’t look for perfect at this point. Look for best of the fist you have.
Here, we are using the Ljung-Box test (it is actually a simplified version of the test by Box and Pierce) to assess whether the residuals for the model are random noise. This is essentially checking to see that the error terms are homoskedastic.
The null hypothesis says that the data are no different from “white noise”. So, small p-values will indicate independent residuals.
H0: residuals are randomly distributed
Here, we are running the three different models we are trying out above. If any of these models rejects the null, we should consider it to be problematic.
Box.test(ar1$residuals, type="Ljung-Box")
##
## Box-Ljung test
##
## data: ar1$residuals
## X-squared = 0.066121, df = 1, p-value = 0.7971
Box.test(ar2$residuals, type="Ljung-Box")
##
## Box-Ljung test
##
## data: ar2$residuals
## X-squared = 0.42171, df = 1, p-value = 0.5161
Box.test(ar3$residuals, type="Ljung-Box")
##
## Box-Ljung test
##
## data: ar3$residuals
## X-squared = 0.0014606, df = 1, p-value = 0.9695
In each case, the p-value is far from significant. We, therefore, fail to reject the null that the residuals are randomly distributed and conclude that the ARIMA model, as specified, is a decent fit.
For a more fine-grained look into model fit, consider the astsa function. This will mimic the basic diagnostic plots that we use to assess regression models.
library(astsa)
sarima(dif, 0, 1, 1)
## initial value 6.280831
## iter 2 value 6.232351
## iter 3 value 6.231063
## iter 4 value 6.231041
## iter 4 value 6.231041
## iter 4 value 6.231041
## final value 6.231041
## converged
## initial value 6.231189
## iter 2 value 6.231188
## iter 2 value 6.231188
## iter 2 value 6.231188
## final value 6.231188
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed,
## optim.control = list(trace = trc, REPORT = 1, reltol = tol))
##
## Coefficients:
## ma1 constant
## -0.3370 -2.4175
## s.e. 0.0503 16.6833
##
## sigma^2 estimated as 258353: log likelihood = -3128.9, aic = 6263.8
##
## $degrees_of_freedom
## [1] 407
##
## $ttable
## Estimate SE t.value p.value
## ma1 -0.3370 0.0503 -6.6963 0.0000
## constant -2.4175 16.6833 -0.1449 0.8849
##
## $AIC
## [1] 15.31492
##
## $AICc
## [1] 15.315
##
## $BIC
## [1] 15.34436
sarima(dif, 1, 1, 1)
## initial value 6.282031
## iter 2 value 6.262993
## iter 3 value 6.235084
## iter 4 value 6.234827
## iter 5 value 6.231217
## iter 6 value 6.229404
## iter 7 value 6.229257
## iter 8 value 6.228977
## iter 9 value 6.228966
## iter 10 value 6.228964
## iter 10 value 6.228964
## iter 10 value 6.228964
## final value 6.228964
## converged
## initial value 6.227843
## iter 2 value 6.227842
## iter 3 value 6.227842
## iter 3 value 6.227842
## iter 3 value 6.227842
## final value 6.227842
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed,
## optim.control = list(trace = trc, REPORT = 1, reltol = tol))
##
## Coefficients:
## ar1 ma1 constant
## 0.3983 -0.7051 -2.7383
## s.e. 0.1377 0.1111 12.3253
##
## sigma^2 estimated as 256579: log likelihood = -3127.53, aic = 6263.07
##
## $degrees_of_freedom
## [1] 406
##
## $ttable
## Estimate SE t.value p.value
## ar1 0.3983 0.1377 2.8914 0.0040
## ma1 -0.7051 0.1111 -6.3459 0.0000
## constant -2.7383 12.3253 -0.2222 0.8243
##
## $AIC
## [1] 15.31312
##
## $AICc
## [1] 15.31327
##
## $BIC
## [1] 15.35238
sarima(dif, 0, 1, 2)
## initial value 6.280831
## iter 2 value 6.231571
## iter 3 value 6.230073
## iter 4 value 6.229920
## iter 5 value 6.229917
## iter 6 value 6.229917
## iter 6 value 6.229917
## iter 6 value 6.229917
## final value 6.229917
## converged
## initial value 6.230084
## iter 2 value 6.230083
## iter 2 value 6.230083
## iter 2 value 6.230083
## final value 6.230083
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = constant, transform.pars = trans, fixed = fixed,
## optim.control = list(trace = trc, REPORT = 1, reltol = tol))
##
## Coefficients:
## ma1 ma2 constant
## -0.3292 -0.0553 -2.5583
## s.e. 0.0493 0.0585 15.4803
##
## sigma^2 estimated as 257773: log likelihood = -3128.45, aic = 6264.9
##
## $degrees_of_freedom
## [1] 406
##
## $ttable
## Estimate SE t.value p.value
## ma1 -0.3292 0.0493 -6.6782 0.0000
## ma2 -0.0553 0.0585 -0.9455 0.3450
## constant -2.5583 15.4803 -0.1653 0.8688
##
## $AIC
## [1] 15.3176
##
## $AICc
## [1] 15.31775
##
## $BIC
## [1] 15.35686
Where do you see the best fit?
In this case, let’s go with ar2, the ARIMA(1,1,1) model. As, it had the fewest transgressions.
To forecast and plot:
forecast <- predict(ar2, n.ahead=60)
upper <- forecast$pred + 1.96*forecast$se
lower <- forecast$pred - 1.96*forecast$se
ts.plot(dif,forecast$pred, col=c("black","red"))
lines(upper, col="blue", lty="solid")
lines(lower, col="blue", lty="solid")
You can truncate the plot by changing the limits of the x and y axes (xlim= and ylim=). Below, I have changes the graph to only show years 2000 through 2015 (xlim=c(2000, 2015)). I also wanted to zoom in a little more, so I reduced the
ts.plot(dif,forecast$pred, col=c("black","red"), xlim=c(2000, 2015), ylim=c(-5000, -500))
lines(upper, col="blue", lty="solid")
lines(lower, col="blue", lty="solid")
See what I meant about the value of ARIMA being mainly in the short term?
In this case, we are missing a major piece of information: Seasonality.
Looking back at the decomposition, it is easy to see that there is seasonality present in these data. This is part of the reason why the forecast looks so unimpressive.
To add seasonality, try:
ar4 <- arima(dif, order=c(0, 1, 1), seasonal=list(order=c(0,0,1), period=12))
forecast <- predict(ar4, n.ahead=60)
upper <- forecast$pred + 1.96*forecast$se
lower <- forecast$pred - 1.96*forecast$se
ts.plot(dif,forecast$pred, col=c("black","red"))
lines(upper, col="blue", lty="solid")
lines(lower, col="blue", lty="solid")
To zoom in, as we did above:
ts.plot(dif, forecast$pred, col=c("black","red"), xlim=c(2000, 2015), ylim=c(-5000, -500))
lines(upper, col="blue", lty="solid")
lines(lower, col="blue", lty="solid")
As above…
nzm <- read.csv(file.choose(), header=TRUE)
head(nzm)
arr <- ts(nzm[,2], frequency=12, start=c(1978, 4)) # Arrivals from Australia to NZ
dep <- ts(nzm[,3], frequency=12, start=c(1978, 4)) # Departures from NZ to Australia
dif <- ts(nzm[,4], frequency=12, start=c(1978, 4)) # Difference between arr and dep
layout(1:3) # This will show all three plots in the same window
plot.ts(dif)
acf(dif, ylim=c(-1, 1))
pacf(dif, ylim=c(-1, 1))
layout(1) # Change the setting back to one graph per window
I numbered this “zero” since the plotting is not actually necessary when starting out using automated techniques.
It is, of course, still a good idea.
Here, we’ll load the forecast package to take advantage of the auto.arima() function. auto.arima will try out a variety of different models as possible and then use AIC to evaluate which one(s) is best).
For more information about the function, try the help section ?forecast::auto.arima, or google the function’s name.
library(forecast)
##
## Attaching package: 'forecast'
## The following object is masked from 'package:astsa':
##
## gas
Auto.Fit <- auto.arima(dif)
library(forecast)
Auto.Fit <- auto.arima(dif)
Next, get the details on ARIMA(p, d, q)(P, D, Q).
Auto.Fit
## Series: dif
## ARIMA(2,0,3)(1,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 sar1 sma1
## 0.1515 0.7440 0.5133 -0.1515 0.2119 0.1003 -0.7853
## s.e. 0.0709 0.0683 0.0800 0.0641 0.0548 0.0775 0.0528
##
## sigma^2 estimated as 129466: log likelihood=-2909.5
## AIC=5834.99 AICc=5835.36 BIC=5866.89
In this case, you may notice that the model is not elegant. Recall that p + q should not be greater than 3 if you can help it; and neither p, nor q should be greater than 2.
Although these are not necessarily terminal transgressions, the rule of thumb is that they should be avoided if possible.
We can do a few things here.
First, let’s check the model fit using AIC. We can also compare it to the fit we got in the earlier work.
Auto.Fit$aic # Auto.ARIMA model
## [1] 5834.995
ar1$aic # Earlier models, done manually
## [1] 6261.825
ar2$aic
## [1] 6261.116
ar3$aic
## [1] 6262.927
Remember, lower values of AIC are better.
Then, try the Ljung-Box test, as above.
Box.test(Auto.Fit$residuals)
##
## Box-Pierce test
##
## data: Auto.Fit$residuals
## X-squared = 0.0079705, df = 1, p-value = 0.9289
You can also look at ACF and PACF plots.
layout(1:2)
acf(Auto.Fit$residuals, lag.max=20, ylim=c(-1, 1))
pacf(Auto.Fit$residuals, lag.max=20, ylim=c(-1, 1))
layout(1)
plot(forecast(Auto.Fit))
It is over so quickly. (Bit of a letdown, eh?)
Auto.arima is a good option. But, it will not always provide you with the best option. It is a good practice to try both methods to be sure.
Some useful resources:
You may like the Global Temperature Data from 1750 to present or, you may just download the csv that I prepared from these data in 2020.
Global temperature reconstruction
http://www.statmethods.net/advstats/timeseries.html
http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.htmlarima-models