The main purpose of this workshop is to illustrate the foundations of Autoregressive Integrated Moving Average Seasonal models (ARIMA/SARIMA). We will learn by doing. We will work with a model for a macro-economic index, the IGAE (Índice General de la Actividad Económica), published by INEGI.
1 General directions for this Workshop
You will work in RStudio. Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop.
At the beginning of the R Notebook write Workshop 3 - Time Series and your name (as we did in previous workshop).
You have to replicate all the steps explained in this workshop, and ALSO you have to do whatever is asked. Any QUESTION or any STEP you need to do will be written in CAPITAL LETTERS. For ANY QUESTION, you have to RESPOND IN CAPITAL LETTERS right after the question. It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your notebook. Your own workshop/notebook will be very helpful for your further study.
You have to keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file. Pay attention in class to know how to generate an html file from your .Rmd.
2 Introduction
The main purpose of this workshop is to illustrate what is the an Autoregressive Integrated Moving Average Seasonal model (ARIMA/SARIMA model) and how we can estimate a basic model.
This methodology is part of the Box-Jenkins models for time series, which provide a series of tools to understand past behaviors and predict future values of time series variables.
Given a time series Y_{t}, the ARIMA/SARIMA model can be be denoted as ARIMA(p,d,q) SARIMA(P,D,Q,#annual periods). We will see later what are these parameters of the model (p,d,q,P,D,Q,#annual periods).
In an ARIMA/SARIMA model we try to understand current values of a time-series variable using its own past values and past random shocks; the model does not include other explanatory variables (X’s variables) like in a traditional linear regression model. If we add external explanatory variables to an ARIMA/SARIMA model, then this model is usually called ARIMAX.
Then, the Y_t time-series variable in an ARIMA/SARIMA model is explained by:
Its own past values; these values are called LAGvalues or just LAGS. The value of last period (t-1) is called LAG 1; the value of 2 periods ago (t-2) is called LAG 2, and so forth. These LAGS are called AR(p) terms, where p is number of previous LAGS to be included. For example, AR(2) indicates that we include LAG 1 and LAG 2 values as explanatory variables.
Past random shocks; these values are called LAG errors or LAG shocks. The shock of last period (t-1) is called LAG 1 error or LAG 1 shock; the shock of 2 periods ago (t-2) is called LAG 2 error or LAG 2 shock, and so forth. These LAG errors are called MA(q) terms, where q is the # of previous LAG shocks to be included. For example, MA(2) indicates that we include LAG 1 error and LAG 2 error as explanatory variables.
Its own seasonal values; A season refers to 1 year. These values are called seasonal LAG values. If we have monthly data (12 periods in 1 year), then the value of last year(t-12) is called seasonal LAG 1; the value of 2 years ago (t-24) is called seasonal LAG 2, and so forth. If we have quarterly data, then we have to go back 4 periods for last year (t-4), and so forth. These LAGS are called Seasonal AR(P) terms, where P is the number of previous seasonal LAGS to be included.
Past seasonal random shocks; these values are called seasonal LAG errors or seasonal LAG shocks. If we have monthly data (12 periods in 1 year), the shock of last year (t-12) is called seasonal LAG 1 error or seasonal LAG 1 shock; the shock of 2 years ago (t-24) is called seasonal LAG 2 error or seasonal LAG 2 shock, and so forth. These seasonal LAG errors are called Seasonal MA(Q) terms, where Q is the # of previous seasonal LAG shocks to be included.
Then, the following parameters of an ARIMA/SARIMA model need to be estimated:
p: # of AR terms to be included as explanatory variables
d: # of first differences applied to the original variable
q: # of MA terms to be included as explanatory variables
P: # of Seasonal AR terms to be included as explanatory variables
D: # of Seasonal difference applied to the variable
Q: # of Seasonal MA terms to be included as explanatory variables
The d parameter refers to how many first differences we apply to the original time-series variables to make the series stationary.
The D parameter refers to how many seasonal differences we decide to apply to the variable.
Let’s start with models with only AR terms, AR(p) models, that can be named ARIMA(p,0,0), or ARIMA(p,1,0).
3 The AR(p) model - ARIMA(p,d,0)
In general the autoregressive models AR(p) assume that there is some kind of autocorrelation phenomenon in the time series. The concept of autocorrelation refers to the correlation of the variable today at time t with its past values (t-n). Autocorrelation is the building block behind these models.
In time-series Econometrics we can talk about an autocorrelation in the dependent variable (also called, endogeneous variable). This means the current or future values of an economic or financial variable can be explained somehow by its previous values. This is the foundation of what econometricians have called an Autoregressive Model also denoted as AR(p) model or ARIMA(p,d,0) model.
The AR(p) model for the Y_t variable can be expressed as:
Where \phi_{1},...\phi_{p} are the parameters of the model, \epsilon_{t} refers to the error term or random shock. We can express the same equation as follows:
In order to apply this model we have to prove the series Y_{t} is stationary. We can use the Dicky-Fuller test to decide whether the series is stationary. If the original series is not stationary, we can calculate the first difference of the series, and most of the time the resulting series will become stationary.
In linear regression models, the method to estimate the regression coefficients is the Ordinary Least Squares (OLS). For ARIMA models, the best method to estimate the autoregressive coefficients is the Maximum Likelihood (ML) method. The ML method works with iterations, while OLS has an analytic solution (in othe words, a formula). Fortunately, R can estimate this for you. In this model, we assume that the errors are not correlated among them. If that is the case, that means that we can include another autoregressive term to have a better model.
Now I will explain what is a moving average model.
4 The MA(q) model - ARIMA(0,d,q)
The AR(p) model is a process with a long-term memory. Why? this is because a shock today will impact in the future forever. If you read the mathematical equation, this can be interpreted as follows. The value of the series tomorrow will depend on the value of today, after applying the filter \phi_1. Then, the value of tomorrow already has information about the value of today. And the value of tomorrow will impact the value of 2 days from now. Then, the value of today will have an impact in the whole future of the series.
However, the impact of the value of today will be diminishing exponentially according to (\phi_1)^N, where N is the number of periods in the future. Since |\phi1|<1, then the bigger the exponent N -which refers to a further future period- the smaller the effect.
If we want to model a series with a short-term memory, then we have to think in a different equation:
This is called a Moving-Average model with q terms. In this case, the value of today is not directly affected by its previous own value, but the past random shocks. In the case of q=1, this process does not have long memory since the value of today is only affected by the shock of yesterday.
The more q terms included into the model, then the longer the memory of the model. Most of the cases, real-world series only include very few q terms.
In the MA(q) model, \theta_{0}, … \theta_{q} are the parameters of the model, \epsilon_{t-k} refers to past external errors/shocks, and the current shock is defined as \epsilon_{t}, so:
We can interpret a pure MA model as a random process where the current value of the series is influenced by some percentages of previous random q shocks/errors and the current random shock. The percentages are actually the values of the coefficients \theta.
In Finance, when we want to model the behavior of market returns, we know that there are thousands of events that affect how the market returns are moving today. These thousands of events are basically summarized in a “random shock” that nobody knows what will be its magnitude and direction (positive or negative effect).
Now I will put together both AR and MA model into one: the ARIMA(p,d,q) model.
5 ARIMA(p,d,q) model
Now we can combine both AR(p) terms and MA(q) terms into ONE model, that is called ARMA(p,q):
We combine AR and MA terms to model Y in case Y can be modeled including short-term and long-term memory. The long-term memory is modeled with the AR terms, while the short-term memory is modeled with the MA terms. We will practice with an example to illustrate this theoretical model.
6 Hand-on activity
We will run a first version of an ARIMA-SARIMA model for the IGAE index (Índice General de Actividad Económica) developed by INEGI (Instituto Nacional de Estadística y Geografía). Since 1993 INEGI calculates this index based on the Mexican general economy and different industry sectors.
6.1 Data collection and data management
For practical purposes, I already downloaded the IGAE dataset in Excel. Below I show you how to get this igae.xls file from Canvas, and also from the INEGI site (for getting updated data in the future).
6.2 Downloading from Canvas
Go to Canvas/Week 3 and download the inegi.xls file and copy into the folder where you saved this workshop. Then, go to the subsection Importing the Excel file to continue.
6.3 Downloading from the INEGI site.
If you want to download the most recent IGAE data, follow directions to generate the inegi.xls file:
Go to INEGI and download the IGAE series. Go to the following link:
https://www.inegi.org.mx/temas/igae/#Herramientas
Click in “Banco de Información Económica (BIE)” and then select:
Indicador glogal de la actividad económica, base 2013
Expand the option: “Series originales” and, then expand the option “Índice de volumen físico”.
Check-mark the option “Total (Índice base 2013=100)”
Click the button “Consulta”. You will see tabulated data for the IGAE.
Select oldest year (1993 in “Periodo desde”). Finally click the download button and select the Excel format. The file be downloaded to your computer.
Open the Excel file and drop the first rows before the tabluated data, and only leave the period column and the IGAE Total index. RENAME the second column as IGAE. Review the Period column, and edit if there are some dates with notes.
Save the Excel file as igae.xls in the same folder of your Workshop.
6.4 Importing the Excel file
library(readxl)library(xts)library(zoo)library(tseries)library(forecast)library(astsa)library(lmtest)igae <-read_excel("igae.xlsx")# I convert the igae data frame into a time series R dataset.# This is convenient when working with ARIMA-SARIMA models:igae<-ts(coredata(igae$igae),start=c(1993,1),frequency=12)# The ts function transform the dataset from a data frame to a ts object# I can see the content of IGAE in a table format:igae
Plot the IGAE index to see how it has grown over time
plot(igae)
What do you observe? Compared to previous crisis in Mexico during 1994-1995 and 2008-2009, how does the COVID crisis look like?
We see that the decline of the index for the COVID crisis was much profound compared with previous crisis (2008 mortgage crisis and the internal 1994 crisis)
8 Stationary vs Non-stationary
We first have to find a version of the IGAE series that is a stationary series. For any economic or financial series I strongly recommend to start calculating the logarithm of the variable:
logigae =log(igae)
I can plot the log of the IGAE to have an idea whether it is stationary:
plot(logigae)
It is clear that the log series is not stationary, so I do not need to run a Dicky-Fuller test to check stationarity for the log. I have to do a difference of the series and then do the Dicky-Fuller test.
For financial and economic monthly data I strongly recommend to start calculating the seasonal log difference and check whether this series is stationary. If it is, then use this version for your model.
In case that the seasonal log difference is not stationary, then you can calculate the first difference, and most of the time this first difference will be stationary.
Then, we start checking whether the seasonal log of the IGAE is a stationary series. We start with a plot of this variable, and then run the the corresponding Dicky-Fuller test:
The seasonal log difference of the series is actually the annual % change month by month of the IGAE (in continuously compounded percentages).
Let’s see why the seasonal log difference is the annual % change of the series:
diff(logigae,12) = logigae - lag(logigae,12)
lag(logigae,12) is the value of the log IGAE, but 12 months ago!
In this case the diff function gets the difference from a current value of the log minus the log value 12 months ago. This is actually a % annual change since we are taking the difference with 12 months ago.
It is hard to our eyes to see whether the seasonal log series is stationary. We need to run the Dicky-Fuller test:
adf.test(seasonaldiflog,k=1)
Augmented Dickey-Fuller Test
data: seasonaldiflog
Dickey-Fuller = -5.7989, Lag order = 1, p-value = 0.01
alternative hypothesis: stationary
Since the p-value of the Dicky-Fuller test is 0.01<0.05, then we can reject the null hypothesis and accept the alternative hypothesis. Then we can conclude that the seasonal log of the IGAE is a stationary series. Then we will use this version of the variable to do our model.
9 Autocorrelations (AC) and Partial-autocorrelations (PAC)
Once we determine which version of the variable is stationary, then we need to examine autocorrelations and partial autocorrelations of this variable. We can examine this with a correlogram.
In time series analysis, a correlogram is an chart of correlation statistics also known as an autocorrelation plot. This is a plot of the sample autocorrelations of the dependent variable Y_t and the time lags. In R, the function to compute the autocorrelations is acf2() from the astsa package, which gives you both, the autocorrelation (ACF) figures and the partial-autocorrelations (PACF), as well.
Let’s examine the autocorrelations of the seasonal log of the IGAE.
You need to install the astsa package to do auto-correlation plots.
The acf2() function gives you the autocorrelogram, which is the autocorrelations in a graph where it is easy to identify which lags of the variable have a significant autocorrelation.
The X axis represents the lag numbers, while the Y axis is the level of autocorrelation. The vertical lines are the levels of autocorrelations between the current value of Y (Y_t) and its own lags (Y_{t-1}, Y_{t-2}...).
The dotted, blue line delimits the 95% confidence interval for the autocorrelations. Then, if the autocorrelation goes out of the 95% C.I., then the autocorrelation of the variable Y_t with its corresponding lag Y_{t-1} (AutoCorr(Y_t, Y_{t-1})) is statistically different than zero.
We can see that the LAG 1 (Y_{t-1}) is significantly autocorrelated (positively) with the current value of the variable (Y_t) since this autocorrelation is significantly greater than zero.
What is the meaning of the first significant LAG? In this case, this means that the value of Y at any point in time t has an average autocorrelation about 0.80 with its own value at t-1. We can see this 0.80 as the height of the first vertical line of the ACF plot.
The autocorrelation of Y_t with its previous value Y_{t-1} can be calculated manually as the correlation, which is equal to their covariance divided by the product of their standard deviation.
Remembering the formula for correlation between 2 variables Y and X:
CORR(Y,X)=\frac{COV(Y,X)}{SD(Y)*SD(X)}
In the case of “Autocorrelation”, we only have one variable Y, so we are looking at the correlation of Y_t with its own previous value Y_{t-1}. Then, applying the correlation formula between Y_t and Y_{t-1} (instead of X), we can calculate this autocorrelation as:
AUTOCORR(Y_t,Y_{t-1})=\frac{AUTOCOV(Y_t,Y_{t-1})}{SD(Y_t)*SD(Y_{t-1})} Since Y is supposed to be stationary series, then the standard deviation of Y_t has to be the same as the standard deviation of Y_{t-1}. Then:
Fortunately, R does the calculation of all the autocorrelations of Y_t with its own lags, and these autocorrelations are plotted using the acf2 command (as seen above).
Unlike the ACF plot, the partial-autocorrelation (PACF) plot shows the autocorrelation of the lag with the variable after considering the effect of the lower-order autocorrelations.
The autocorrelation and partial autocorrelation between Y_{t} and its LAG 1 (Y_{t-1}) are the same, since there is no lower-order autocorrelations for LAG 1.
The autocorrelation between Y_{t} and its LAG2 (Y_{t-2}) and the partial autocorrelation between Y_{t} and its LAG2 (Y_{t-2}) will be different since the partial autocorrelation is calculated AFTER the autocorrelation between Y_{t} and Y_{t-1} is taken into consideration.
In other words, the PACF plot shows the amount of autocorrelation between Y_{t} and its lag K (Y_{t-k}) that is not explained by all the lower-order autocorrelations (Y_{t} with (Y_{t-k-n}).
The ACF plot shows that only the LAG 1 autocorrelation is positive and significant, and the following LAG autocorrelations are also positive but not significant.
We see a fast decay in the magnitude of the LAG autocorrelations in the PACF, and a low decay in the magnitud of the LAG autocorrelations in the ACF plot.
In the partial autocorrelation (PACF) we can see that ONLY the autocorrelation of LAG 1 is significant and positive, and immediately the rest of the autocorrelation go close to zero suddenly. This is a classical AR(1) signature with 1 AR terms. We do not need to include other terms further than LAG 1.
Then we can see that the partial autocorrelation plot (PACF) helps us to identify the # of AR terms. Then, according to this PACF we see that only LAG 1 is autocorrelated with the current value AFTER considering the effect of the lower-level autocorreations of lags 2,3,4, etc. In other words, the annual %change in the IGAE of the current month seems to be positively correlated with its own annual % changes of the previous month.
10 Calibrating the ARMIA-SARIMA model
The calibration process of an ARIMA-SARIMA is not straight-forward. It is recommended to learn this process, but for this course we will use the automated algorithm auto.arima, which is part of the forecast library.
The objective of the calibration of an ARIMA-SARIMA model is to find the right values for p, q, P and Q. It is recommended to start indicating the values of d and D after we found the stationary version of the variable. In this case, we found that the seasonal difference of the log of IGAE is stationary, so the parameters for d and D are:
d=0; D=1
If D=1 it means that we are doing a seasonal difference (annual % growth), and d=0 indicates that we are NOT doing a first difference (monthly % growth).
The values for p,q, P, Q are rarely greater than 2. We need to set the auto.arima function with maximum values of these parameters. We recommend to set p=2 as maximum, and the rest q, P, Q=1:
You need to install the forecast package to run the auto.arima function:
library(forecast)model1 <-auto.arima(igae,lambda =0, # this means that the model will transform the variable with its natural logarithm d=0,D=1, # we are modeling the annual % growth month by month (not the monthly %growth)max.p=2, max.q=1, # setting the maximum values for p and qmax.P =1,max.Q =1# setting the max. values for P and Q )coeftest(model1)
Where Y_t is the annual % growth of the IGAE (the seasonal difference of the log).
The sum of the \phi_{i} coefficients must always be less than 1. If you find that the sum of these coefficients is bigger than 1, then you need to change the calibration. Most of the time the auto.arima algorithm always ends up with a good model.
Then this model is an ARIMA(2,0,1)-SARIMA(0,1,1,12) model.
Remember the parameters of these models:
ARIMA(p,d,q) SARIMA(P,D,Q,#periods in a year)
The option lambda = 0 indicates that we will use the natural logarithm of the series. The D=1 indicates that the seasonal difference has to be made before estimating the model.
According to the result, we can see that the coefficients \phi_1 is positive and significant and \phi_2 is negative and significant (since their p-values are <0.05). Then we can say that the IGAE annual % change is positively and significantly related with its own annual % change of 1 month ago, and it is negatively and significantly related with its own annual % growth 2 months ago.
Also, the \theta_{1} coefficient (the MA coefficient) is negative and significant. This means that the annual growth of the IGAE is negatively related with the error (shock) of 1 month ago. The \theta_{2} coefficient (the seasonal MA coefficient) is negative and very significant. This means that the annual growth of the IGAE is negatively related with the error (shock) of 12 months ago.
Now we can use this model to do predictions.
If I want to do a forecast for 21 months (the end of 2024), I will use the forecast() function:
forecast_igae <-forecast(model1, h =21)autoplot(forecast_igae)
11 CHALLENGE 1
Check whether the residuals of model 1 is white-noise series. ALL calibrated models must have the residuals as white noise. Explain what is a white-noise variable and respond whether the residuals look like a white-noise.
12 CHALLENGE 2
Interpret the calibrated ARIMA-SARIMA model. Be sure to interpret the sign and magnitude of the coefficients! Pay attention in class to understand HOW TO INTERPRET the coefficients!
13 Reading
Go to the course site and carefully re-read the Notes a) Introduction to time-series financial models and b) re-read the theory explained in this workshop about the Introduction to ARIMA-SARIMA Models.
14 W3 submission
The grade of this Workshop will be the following:
Complete (100%): If you submit an ORIGINAL and COMPLETE HTML file with all the activities, with your notes, and with your OWN RESPONSES to questions
Incomplete (75%): If you submit an ORIGINAL HTML file with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
Not submitted (0%)
Remember that you have to submit your .html file through Canvas BEFORE NEXT CLASS.