ARIMA

1. Introduction

In this section, we will introduce the ARIMA model for forecasting.

ARIMA stands for Autoregressive Integrated Moving Averages processes. ARIMA is considered the most advanced of traditional forecasting models.

Like exponential smoothing, ARIMA models are the generic name for a family of forecasting models. And because of its complexity, it is best to explain ARIMA step by step.

In this section, we will explain:

The concept of stationary series, or processes
Random Walk processes
Autoregressive processes
Moving Average processes
Model specification

2. Stationary Processes

One of the main assumptions of the ARIMA family of models is that the series follows a stationary process. A stationary process, in a nutshell, implies that we are analyzing:

a flat series, without trend
constant variance over time
a constant autocorrelation structure over time
no periodic fluctuations (seasonality).

Most time series, however, are not stationary!

If the time series is not stationary, then we can often transform the series to make it stationari.

Although seasonality also violates stationarity, this is usually explicitly incorporated into the time series model. In short, the main criteria to conclude that our series is stationary, are a constant mean and a constant variance.

We can use the arima.sim() function of the stats package, to generate a stationary series.

The function simulates time series based on the three components of an ARIMA process.

The Autoregressive process describes the relation between an observation and its past p lags.
The Integrated process describes the process of differencing the series with its past d lags, in order to achieve stationarity.
The moving average process describes the relationship between an error term and its past q lags.

In most texts, the three parameters are indicated by p, d, and q.

The parameters signify the number of autoregressive lags, the level of differencing, and the number of moving average lags. They also indicate the order of the ARIMA process.

For example, an ARIMA(1,0,0) process is synonymous to an AR(1) model, in which the series is regressed on its first lag. Other basic models are ARIMA(0,1,0) and ARIMA(0,0,1) models. But in many real time series, better forecasts are based on more complex models of higher order, e.g., ARIMA(2,1,1) models. The order of the three parameters follows the logic of the name of the model: first AR, then I and then MA!

We can simulate an ARIMA(1,0,0) process as follows. The series has a length of 200, and one autoregressive lag. The one and only autoregressive parameter (ar) is 0.5.

set.seed(12345)
ar1 <- arima.sim(model = list(order = c(1,0,0), ar = 0.5), n = 200)
library(TSstudio)
ts_plot(ar1, 
        title = "Stationary Time Series",
        Ytitle = "Value",
        Xtitle = "Index",
        slider=TRUE)

Likewise, we can simulate an AR(2) model.

set.seed(12345)
ar1 <- arima.sim(model = list(order = c(2,0,0), ar = c(0.5,0.3)), n = 200)
library(TSstudio)
ts_plot(ar1, 
        title = "Stationary Time Series",
        Ytitle = "Value",
        Xtitle = "Index",
        slider=TRUE)

For a stationary process, it is required that the sum of the coefficients lies between -1 and +1. If this is not the case, an error will result. For example:

ar_nonstationary <- arima.sim(model = list(order = c(2,0,0), ar = c(0.3,0.8)), n = 200)

will return an error message, since \(0.3 + 0.8 > 1\).

2. Random Walk

A random walk refers to a probability process with the following traits:

The starting value is known
The walk (or pattern) of the series are defined by the following equation:

\(Yt = Yt-1 + εt\)

In this formula, \(Yt\) and \(Yt-1\) are the values of the series at times t and t-1, and ε is a random number referred to as white noise.

Intuitively, people tend to assume that random walks generate stationary series, but this is not the case! This can be easily deduced from rearranging the formula:

\(Yt - Yt-1 = εt\)

Since \(εt\) is defined as random error or white noise, we can deduce that the first difference of a random walk process generates white noise.

Random walk models are often used for non-stationary data, particular in economics.

Random walks typically have:

Long periods of apparent trends up or down
Sudden and unpredictable changes in direction
Forecasts from a random walk model are equal to the last observation, as future movements are unpredictable.

We can use arima.sim() to generate a random walk model. For reproducibility, we use the set.seed() command.

Experiment for yourself with

set.seed(888)
rw <- arima.sim(model = list(order = c(0, 1, 0)), n = 500)
ts_plot(rw,
        title = "Random Walk",
        Xtitle = "t",
        Ytitle = "Value")

ts_plot(diff(rw),
        title = "First-Difference Random Walk",
        Xtitle = "t",
        Ytitle = "Value")

library(plotly)

## Loading required package: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

p1 <- plot_ly() # initiate plotly graph
p2 <- plot_ly() # initiate plotly graph

for(i in 1:5){  # generate 5 random walks
  rw <- arima.sim(model = list(order = c(0, 1, 0)), n = 200)
  p1 <- p1 %>% add_lines(x = time(rw), y = as.numeric(rw))
  p2 <- p2 %>% add_lines(x = time(diff(rw)), y = as.numeric(diff(rw)))
}
# 5 random walks
p1 %>% layout(title = "Simulate Random Walk",
              yaxis = list(title = "Value"),
              xaxis = list(title = "Index")) %>% hide_legend()

# stationary series of the first differences of the random walks
p2 %>% layout(title = "Simulate Random Walk with First-Order Differencing",
              yaxis = list(title = "Value"),
              xaxis = list(title = "Index")) %>% hide_legend()

Is my series stationary?

But, of course, we have to work the other way around. We don’t want to generate stationary series. Rather, we have to assess whether a given series is stationary.

We can do so either by looking at the (transformed) time series, and/or by using a statistical test.

For illustration, let’s use the AirPassengers data set available in R.

data(AirPassengers)
ts_plot(AirPassengers,
        title = "Monthly Airline Passenger Numbers 1949-1960",
        Ytitle = "Thousands of Passengers",
        Xtitle = "Year")

It is clear to see that the criteria for stationarity are violated. There is a clear upward trend, and therefore the mean of the series is not constant.

We can transform the data to get a stationary series. If the trend in the series is linear, then we can compute the first differences, using the diff() function from the stats package.

In order to deal with the increasing variance, we can use a log-transformation.

# First differences
ts_plot(diff(AirPassengers),
        title = "Monthly Airline Passenger Numbers 1949-1960",
        Ytitle = "Thousands of Passengers",
        Xtitle = "Year")

# Log-transformation
ts_plot(log(AirPassengers),
        title = "Monthly Airline Passenger Numbers 1949-1960",
        Ytitle = "Thousands of Passengers",
        Xtitle = "Year")

As we tell from the figures above, differencing and log-transformation by themselves solve only part of the problem. After differencing, the mean of the series is constant, but the variance isn’t. After taking the log of the series, the variance is constant but the mean isn’t.

It makes sense to use both: we take the first differences of the log of the data.

# First differences and log-transformation
ts_plot(diff(log(AirPassengers)),
        title = "Monthly Airline Passenger Numbers 1949-1960",
        Ytitle = "Thousands of Passengers",
        Xtitle = "Year")

The plot of the time series now looks stationary!

But is there a way to test it, rather than conclude it from visual inspection only?

You can use a simple statistical test to check that the number of air passengers in the first half of the series (1949-1954) is significantly lower than in the second half (1955-1960).

But in time series analysis we prefer using specific tests which are based on the so-called unit root.

What do we mean by unit root? In simple words, if a time series can be made stationary by differencing, it is said to contain a unit root. In essence, this means that the current value of a series y_t is equal to its last value y_t−1 plus an error ε_t.

There are several tests for stationarity, and related functions in R packages to apply these tests.

The most often used tests are the Augmented Dickey Fuller test (often abbreviated to ADF), or the the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.

Below, we will use the ur.kpss() function from the urca package. In this test, the null hypothesis is that the data are stationary, and we test for evidence that the null hypothesis is not true.

Since the plot of the raw air passenger data (before transformation) indicate a non-constant mean and variance, we expect that the kpss test rejects the null hypothesis that the data are stationary. After transformation (we take the first differences of the log of the data), our hope is that the data are stationary - although a seasonal pattern remains.

library(urca)
m1 <- ur.kpss(AirPassengers); summary(m1)             # Untransformed data

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 2.7395 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

m2 <- ur.kpss(diff(log(AirPassengers))); summary(m2)  # Transformed data

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.0282 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The value of the test statistic is much higher than its critical value, at a 1% confidence level, leading us to reject the null hypothesis that the series is stationary.

After transformation of the data, the test statistic is well within the acceptable range, and we accept the hypothesis that the series is stationary. We can proceed to the next steps!

4. Model Identification

We now know the basics of ARIMA models. We have introduced the three types of processes and parameters (AR; Integration, or differencing; and MA). And we have emphasized the importance of starting with stationary series. We have also discussed the basics of using transformations of your data to make non-stationary series stationary.

We will now take a closer look at the AR and MA processes. We have learned to simulate AR processes, and we can simulate MA processes in a similar manner.

We will go back to a simulated AR process, and see how we can identify these processes.

AR models

An AR(p) process follows the following equation:

\[ AR(p): Y_t = c + \sum_{i = 1}^{p}{\phi_i*Y_{t-i}} + \epsilon_t \]

In words, an observation Y at time t, is a linear combination of its p lags. The constant c is called drift. &/phi;_i is the coefficient if the i lag of the series.

For p=1, the AR(1) process is a random walk process with drift c.

\[ AR(1): Y_t = c + {\phi_1*Y_{t-1}} + \epsilon_t \] \[-1 < \phi_1 < 1\]

Likewise, the formula for an AR(2) process reads like:

\[ AR(2): Y_t = c + {\phi_1*Y_{t-1}} + {\phi_2*Y_{t-2}} + \epsilon_t \] \[-1 < \phi_1 + \phi_2 < 1\]

Below, we will first simulate a time series of 200 observations that follows an AR(2) process, and then estimate the model.

set.seed(121)
ar2 <- arima.sim(model = list(order = c(2,0,0), ar = c(0.9, -0.3)), n = 500) 
ts_plot(ar2, 
        title = "Simulate AR(2) Series",
        Ytitle = "Value",
        Xtitle = "Index")

Since \(-1 < (\phi_1 + \phi_2 = 0.9 - 0.3 =) 0.6 < 1\), the series is stationary.

Of course, in practice we do not know which ARIMA(p,d,q) has generated the data, and we have to find a strategy for identifying the parameters p, d and q that fit the data.

To that end, we will look at the autocorrelation and partial autocorrelation functions (ACF and PACF). These functions have a typical shape for AR(p) and MA(q) processes.

For an AR(2) process, the ACF and PACF look like:

acf(ar2)

pacf(ar2)

If the ACF tails off, and the the PACF cuts off at lag p, then we are dealing with an AR(p) process. Indeed, the PACF is outside of the blue lines at lags 1 and 2, and the ACF decreases with the number of lags, suggesting an AR(2) process.

we can the fit the model using the ar() function from the stats package. We get the following result:

model_ar2_1 <- ar(ar2)
model_ar2_1

## 
## Call:
## ar(x = ar2)
## 
## Coefficients:
##       1        2        3        4  
##  0.8469  -0.2741  -0.0698   0.0921  
## 
## Order selected 4  sigma^2 estimated as  0.9605

model_ar2_2 <- ar(ar2, order.max = 2)
model_ar2_2

## 
## Call:
## ar(x = ar2, order.max = 2)
## 
## Coefficients:
##       1        2  
##  0.8452  -0.2949  
## 
## Order selected 2  sigma^2 estimated as  0.9649

The ar() function by default selects the best order. In our example, an AR(4) model has a better fit than the AR(2) model, and the function then automatically selects AR(4) - even though the ACF and PACF functions give no reason to do so. Based on the ACH and PACF functions, we can opt for a simpler model and set the order of the model at 2, for an AR(2) process. The model fit ^(σ_2) in the output), goes down slightly. This is always the case with regression analysis: the more explanatory variables (here: the higher the number of lags), the better the model fit will be. However, the main question is, if the more complex model also leads to better forecasts. Overfitting typically leads to models that better fit the data, but don’t perform any better (or even worse) when it comes to forecasting. In general, it is better to select simpler models (with less lags).

MA models

We can do something similar, for MA(q) models.

The formula for MA(q) models is as follows.

\[ MA(q): Y_t = \mu + \epsilon_t + \sum_{i = 1}^{q}{\theta_i*\epsilon_{t-i}} \]

In this formula, μ (pronounced as mu) represents the mean of the series. θ_i (pronounced as theta) is the regression coefficient for the error term at lag i.

For an MA(2) we have:

\[ MA(2): Y_t = \mu + \epsilon_t + {\theta_1*\epsilon_{t-1}} + {\theta_2*\epsilon_{t-2}} \]

In a similar manner as we have done for an AR(2) process, we will simulate an MA(2) process.

set.seed(434)
ma2 <- arima.sim(model = list(order = c(0,0,2), ma = c(0.5, -0.3)), n = 500) 
ts_plot(ma2, 
        title = "Simulate MA(2) Series",
        Ytitle = "Value",
        Xtitle = "Index")

Obviously, it is hard to tell from the graph from what (either AR(p) or MA(q), or both) this data is generated. Like for AR processes, we can inspect the ACF and PACF functions.

acf(ma2)

pacf(ma2)