Overview

Today in class, we discussed an introduction to ARIMA models, which are autoregressive integrated moving average models. We then fit the models to different data sets, and did some forecasting with the models.

For more information on ARIMA models, please see sections 9.1, 9.3, 9.4, and 10.3 from the book.

We need to call these libraries so that we can use different functions later on.

library(quantmod)
## Warning: package 'quantmod' was built under R version 3.4.4
## Loading required package: xts
## Warning: package 'xts' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Warning: package 'TTR' was built under R version 3.4.3
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(tseries)
## Warning: package 'tseries' was built under R version 3.4.4
library(timeSeries)
## Warning: package 'timeSeries' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.3
## 
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
## 
##     time<-
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
library(xts)

The Models

Autoregressive Model (AR)

The first model corresponds to the first two letters of ARIMA: AR. This model is an autoregressive model of order p. \(y_t\) in this model depends on the previous \(p\) observations, \(y_{t-1}, y_{t-2}, ..., y_{t-p}\). The AR(p) model is in the following form: \[y_t = \phi_iy_{t-1} + \phi_2y_{t-2} + ... + \phi_py_{t-p} + \epsilon_t\] where the errors are independent and normally distrbuted, \(\phi_p \neq 0\), \(\phi_1,...\phi_p\) are constants, and \(y_t\) has mean 0, with constant variance and no seasonality.

AR(1) is an autoregressive model with p equal to 1 (so we depend on one previous observation). If we were to choose \(\phi_1=0.9\), our equation will be \(y_t=0.9y_{t-1} + \epsilon_t\). Because we chose a positive number for phi, two consecutive observations are positively correlated.

If we were to choose \(\phi_1=-0.9\), our equation will be \(y_t=-0.9y_{t-1} + \epsilon_t\). Because we chose a negative number for phi, two consecutive observations are negatively correlated. This means the plot is more “wiggly” (more back and forth).

Let’s look at both of these AR(1) models, by simulating fake data.

We simulate the data with the arima.sim() command, specifying the AR(1) model with the c(1,0,0) order; the first place is for the order of p, which we specify to be 1. We specify \(\phi\) with ar.

par(mfrow=c(2,1))
plot(arima.sim(list(order=c(1,0,0),ar=.9) , n=200), ylab = "y", main = (expression(AR(1) ~~~ phi == +.9))) #for phi = 0.9
  abline(0,0)
plot(arima.sim(list(order=c(1,0,0),ar= -.9) , n=200), ylab = "y", main = (expression(AR(1) ~~~ phi == -.9))) #for phi = -0.9
  abline(0,0)

Let’s now fit a model using simulated AR(1) data. We create a model with the auto.arima function. This is similar to our lm function, but for ARIMA models!

set.seed(2115)
fakedata <- arima.sim(list(order=c(1,0,0),ar=.9) , n=200)
plot(fakedata)

mod1 <- auto.arima(fakedata)
mod1  
## Series: fakedata 
## ARIMA(1,0,0) with zero mean 
## 
## Coefficients:
##          ar1
##       0.9266
## s.e.  0.0285
## 
## sigma^2 estimated as 1.004:  log likelihood=-284.66
## AIC=573.32   AICc=573.38   BIC=579.92

We can see that our model has a mean of zero, which is good because we don’t really see any trend in our data. The output gives us a point estimate of 0.9266, with the standard error right below it. We did get AR(1) as our model, looking at the top output.

Our equation with the estimate is \(y_t=0.9266y_{t-1} + \epsilon_t\)

Now let’s forecast using the model we just created. We’ll forecast one day into the future. We an also plot the prediction, as well as a prediction interval.

(mycast1 <- forecast(mod1, h=1)) # includes prediction intervals
##     Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## 201      -6.557273 -7.841365 -5.273181 -8.521123 -4.593423
plot(mycast1)

We can also plot the forecast for a ten day prediction.

plot(forecast(mod1, h=10)) # 10 days into the future

Moving Average Model (MA)

The second model corresponds to the last two letters of ARIMA: MA. This model is a moving average model. \(y_t\) in this model depends on the previous \(q\) residuals, \(\epsilon_{t-1}, \epsilon_{t-2}, ..., epsilon_{t-p}\). The MA(q) model is in the following form: \[y_t = \theta_i\epsilon_{t-1} + \theta_2\epsilon_{t-2} + ... + \theta_q\epsilon_{t-q} + \epsilon_t\] where the residuals are independent and normally distrbuted, \(\theta_q \neq 0\), \(\theta_1,...\theta_q\) are constants, and \(y_t\) has mean 0, with constant variance and no seasonality.

While the entire mean is 0, if take the average of one point and the previous point, then the average is not zero at each point - thus, the average is changing over time (moving average!).

Let’s simulate fake data again to create our model.

par(mfrow = c(2,1))
plot(arima.sim(list(order=c(0,0,1), ma=.9), n = 200) , ylab = "y", main = (expression(MA(1) ~~~ theta ==  +.9))) #positive correlation
plot(arima.sim(list(order=c(0,0,1), ma=-.9), n = 200) , ylab = "y", main = (expression(MA(1) ~~~ theta ==  -.9))) #negative correlation

Now let’s fit a model using simulated MA(1) data.

Notice now that the order is c(0,0,1), as the last number corresponds to the value of q. Remember that the first number corresponds to the value of p.

set.seed(2.718)
ma.data <- arima.sim(list(order=c(0,0,1), ma=.9), n = 200)
### last number is q in order, first number is p
mod2 <- auto.arima(ma.data)
mod2
## Series: ma.data 
## ARIMA(1,0,2) with zero mean 
## 
## Coefficients:
##           ar1    ma1     ma2
##       -0.8354  1.723  0.7496
## s.e.   0.2317  0.237  0.2105
## 
## sigma^2 estimated as 1.143:  log likelihood=-296.35
## AIC=600.71   AICc=600.91   BIC=613.9

What’s interesting from these results is that we got an ARIMA (1,0,2) even though that is not how we simulated our data.

Let’s forecast using the model we just created.

par(mfrow=c(2,1))
plot(forecast(mod2, h=1)) # 1 day into the future
forecast(mod2, h=1)
##     Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## 201      0.3159447 -1.054301 1.686191 -1.779666 2.411555
plot(forecast(mod2, h=10)) # 10 days into the future

Autoregressive Moving Average Model (ARMA)

The penultimate model we went over is the autoregressive moving average model, or ARMA (now all we’re missing is I!). This is a combination of both AR and MA, so… mind your p’s and q’s!

The ARMA model is in the following form, following the same assumptions as both the AR and the MA models: \[y_t = \phi_iy_{t-1} + \phi_2y_{t-2} + ... + \phi_py_{t-p} + \theta_i\epsilon_{t-1} + \theta_2\epsilon_{t-2} + ... + \theta_q\epsilon_{t-q} + \epsilon_t\]

ARMA(1,1) would have the following form: \[y_t = \phi_iy_{t-1} + \theta_i\epsilon_{t-1} + \epsilon_t\]

We can simulate data, create our model, and forecast just like above.

set.seed(1234)
arma.data <- arima.sim(list(order=c(1,0,1), ma=.9, ar = .9), n = 200)
mod3 <- auto.arima(arma.data)
mod3
## Series: arma.data 
## ARIMA(4,0,4) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     ma1     ma2      ma3      ma4
##       1.1663  -0.5789  0.8841  -0.5949  0.7172  0.0696  -0.4784  -0.6420
## s.e.  0.3502   0.5705  0.3427   0.1475  0.3347  0.2136   0.2604   0.1482
##        mean
##       1.989
## s.e.  0.372
## 
## sigma^2 estimated as 0.9662:  log likelihood=-278.04
## AIC=576.09   AICc=577.25   BIC=609.07
plot(forecast(mod3, h=1)) # 1 day into the future

forecast(mod3, h=1)
##     Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 201       2.299482 1.039752 3.559213 0.3728907 4.226074
plot(forecast(mod3, h=10)) # 10 days into the future

Again, we got an ARIMA of something other than what we simulated our data with: ARIMA(4,0,4).

ARIMA!!!

Finally, we get to the ARIMA model, where the “I” stands for integration. We want our time series to have constant variance with a mean of 0, otherwise known as a stationary time series. If we don’t have this, we can take the difference between consecutive observations (d=1). If this makes a stationary time series, we can stop there. If not, we can then use the second order differences, or the difference of the differences (d=2).

d is the middle number in the ARIMA.

An Example

To demonstrate ARIMA models, I will be using the gas dataset from the forecast package. This dataset includes Australian monthly gas production from 1956 to 1995.

library(forecast)
data(gas)

We’ll first plot the data.

plot(gas)

We can now create the model using the auto.arima function, which automatically creates the best ARIMA model for our data set.

mod4 <- auto.arima(WWWusage)
mod4
## Series: WWWusage 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1     ma1
##       0.6504  0.5256
## s.e.  0.0842  0.0896
## 
## sigma^2 estimated as 9.995:  log likelihood=-254.15
## AIC=514.3   AICc=514.55   BIC=522.08

We can see that R created a model with ARIMA(1,1,1), using first order differences, and both autoregression and moving averages (one term).

We can now forecast and plot the forecast.

mycast <- forecast(mod4,h=20)
plot(mycast)