Workshop 3, Time Series Models

Author

Alberto Dorantes

Published

May 19, 2025

Abstract
The main purpose of this workshop is to illustrate the foundations of Autoregressive Integrated Moving Average Seasonal models (ARIMA/SARIMA). We will learn by doing. We will work with a model for a macro-economic index, the IGAE (Índice General de la Actividad Económica), published by INEGI.

0.1 General Directions for each workshop

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google colab document for each workshop.

You must share each Colab document (workshop) with the following account:

  • cdorante@tec.mx

You must give Edit privileges to these accounts.

In Google Colab you can work with Python or R notebooks. The default is Python notebooks.

Your Notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W2-TimeSeries-YourFirstName-YourLastname”.

In your Workshop Notebook you have to:

  • You have to write and run Python code to do the exercises of this Workshop

  • You have to do whatever is asked in the workshop. It can be: responses to specific questions and/or do an exercise/challenge.

For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.

  • It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for your further study.

Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks in one-shot with Ctrl-F9. You have to submit to Canvas the web link of your Google Colab workshop.

1 Introduction

The main purpose of this workshop is to illustrate what is the an Autoregressive Integrated Moving Average Seasonal model (ARIMA/SARIMA model) and how we can estimate a basic model.

This methodology is part of the Box-Jenkins models for time series, which provide a series of tools to understand past behaviors and predict future values of time series variables.

Given a time series Y_{t}, the ARIMA/SARIMA model can be be denoted as ARIMA(p,d,q) SARIMA(P,D,Q,#annual periods). We will see later what are these parameters of the model (p,d,q,P,D,Q,#annual periods).

In an ARIMA/SARIMA model we try to understand current values of a time-series variable using its own past values and past random shocks; the model does not include other explanatory variables (X’s variables) like in a traditional linear regression model. If we add external explanatory variables to an ARIMA/SARIMA model, then this model is usually called ARIMAX.

Then, the Y_t time-series variable in an ARIMA/SARIMA model is explained by:

  • Its own past values; these values are called LAGvalues or just LAGS. The value of last period (t-1) is called LAG 1; the value of 2 periods ago (t-2) is called LAG 2, and so forth. These LAGS are called AR(p) terms, where p is number of previous LAGS to be included. For example, AR(2) indicates that we include LAG 1 and LAG 2 values as explanatory variables.

  • Past random shocks; these values are called LAG errors or LAG shocks. The shock of last period (t-1) is called LAG 1 error or LAG 1 shock; the shock of 2 periods ago (t-2) is called LAG 2 error or LAG 2 shock, and so forth. These LAG errors are called MA(q) terms, where q is the # of previous LAG shocks to be included. For example, MA(2) indicates that we include LAG 1 error and LAG 2 error as explanatory variables.

  • Its own seasonal values; A season refers to 1 year. These values are called seasonal LAG values. If we have monthly data (12 periods in 1 year), then the value of last year (t-12) is called seasonal LAG 1; the value of 2 years ago (t-24) is called seasonal LAG 2, and so forth. If we have quarterly data, then we have to go back 4 periods for last year (t-4), and so forth. These LAGS are called Seasonal AR(P) terms, where P is the number of previous seasonal LAGS to be included.

  • Past seasonal random shocks; these values are called seasonal LAG errors or seasonal LAG shocks. If we have monthly data (12 periods in 1 year), the shock of last year (t-12) is called seasonal LAG 1 error or seasonal LAG 1 shock; the shock of 2 years ago (t-24) is called seasonal LAG 2 error or seasonal LAG 2 shock, and so forth. These seasonal LAG errors are called Seasonal MA(Q) terms, where Q is the # of previous seasonal LAG shocks to be included.

Then, the following parameters of an ARIMA/SARIMA model need to be estimated:

  • p: # of AR terms to be included as explanatory variables

  • d: # of first differences applied to the original variable

  • q: # of MA terms to be included as explanatory variables

  • P: # of Seasonal AR terms to be included as explanatory variables

  • D: # of Seasonal difference applied to the variable

  • Q: # of Seasonal MA terms to be included as explanatory variables

The d parameter refers to how many first differences we apply to the original time-series variables to make the series stationary.

The D parameter refers to how many seasonal differences we decide to apply to the variable.

Let’s start with models with only AR terms, AR(p) models, that can be named ARIMA(p,0,0), or ARIMA(p,1,0).

2 The AR(p) model - ARIMA(p,d,0)

In general the autoregressive models AR(p) assume that there is some kind of autocorrelation phenomenon in the time series. The concept of autocorrelation refers to the correlation of the variable today at time t with its past values (t-n). Autocorrelation is the building block behind these models.

In time-series Econometrics we can talk about an autocorrelation in the dependent variable (also called, endogeneous variable). This means the current or future values of an economic or financial variable can be explained somehow by its previous values. This is the foundation of what econometricians have called an Autoregressive Model also denoted as AR(p) model or ARIMA(p,d,0) model.

The AR(p) model for the Y_t variable can be expressed as:

Y_{t}=\phi_{0}+\phi_{1}Y_{t-1}+\phi_{2}Y_{t-2}+...+\phi_{p}Y_{t-p}+\epsilon_{t}

Where \phi_{1},...\phi_{p} are the parameters of the model, \epsilon_{t} refers to the error term or random shock. We can express the same equation as follows:

Y_{t}=\phi_{0}+\sum_{k=1}^{p}\phi_{k}Y_{t-k}+\epsilon_{t}

In order to apply this model we have to prove the series Y_{t} is stationary. We can use the Dicky-Fuller test to decide whether the series is stationary. If the original series is not stationary, we can calculate the first difference of the series, and most of the time the resulting series will become stationary.

In linear regression models, the method to estimate the regression coefficients is the Ordinary Least Squares (OLS). For ARIMA models, the best method to estimate the autoregressive coefficients is the Maximum Likelihood (ML) method. The ML method works with iterations, while OLS has an analytic solution (in othe words, a formula). Fortunately, R can estimate this for you. In this model, we assume that the errors are not correlated among them. If that is the case, that means that we can include another autoregressive term to have a better model.

Now I will explain what is a moving average model.

3 The MA(q) model - ARIMA(0,d,q)

The AR(p) model is a process with a long-term memory. Why? this is because a shock today will impact in the future forever. If you read the mathematical equation, this can be interpreted as follows. The value of the series tomorrow will depend on the value of today, after applying the filter \phi_1. Then, the value of tomorrow already has information about the value of today. And the value of tomorrow will impact the value of 2 days from now. Then, the value of today will have an impact in the whole future of the series.

However, the impact of the value of today will be diminishing exponentially according to (\phi_1)^N, where N is the number of periods in the future. Since |\phi1|<1, then the bigger the exponent N -which refers to a further future period- the smaller the effect.

If we want to model a series with a short-term memory, then we have to think in a different equation:

Y_{t}=\theta_{0}+\theta_{1}\epsilon_{t-1}+\theta_{2}\epsilon_{t-2}+...+\theta_{q}\epsilon_{t-q}+\epsilon_{t}

This is called a Moving-Average model with q terms. In this case, the value of today is not directly affected by its previous own value, but the past random shocks. In the case of q=1, this process does not have long memory since the value of today is only affected by the shock of yesterday.

The more q terms included into the model, then the longer the memory of the model. Most of the cases, real-world series only include very few q terms.

In the MA(q) model, \theta_{0}, … \theta_{q} are the parameters of the model, \epsilon_{t-k} refers to past external errors/shocks, and the current shock is defined as \epsilon_{t}, so:

Y_{t}=\theta_{0}+\epsilon_{t}+\sum_{k=1}^{q}\theta_{k}\epsilon_{t-k}

We can interpret a pure MA model as a random process where the current value of the series is influenced by some percentages of previous random q shocks/errors and the current random shock. The percentages are actually the values of the coefficients \theta.

In Finance, when we want to model the behavior of market returns, we know that there are thousands of events that affect how the market returns are moving today. These thousands of events are basically summarized in a “random shock” that nobody knows what will be its magnitude and direction (positive or negative effect).

Now I will put together both AR and MA model into one: the ARIMA(p,d,q) model.

4 ARIMA(p,d,q) model

Now we can combine both AR(p) terms and MA(q) terms into ONE model, that is called ARMA(p,q):

Y_{t}=\phi_{0}+\sum_{k=1}^{p}\phi_{k}Y_{t-k}+ \sum_{k=1}^{q}\theta_{k}\epsilon_{t-k} + \epsilon_{t}

We combine AR and MA terms to model Y in case Y can be modeled including short-term and long-term memory. The long-term memory is modeled with the AR terms, while the short-term memory is modeled with the MA terms. We will practice with an example to illustrate this theoretical model.

5 Hand-on activity

We will run a first version of an ARIMA-SARIMA model for the IGAE index (Índice Global de Actividad Económica) developed by INEGI (Instituto Nacional de Estadística y Geografía). Since 1993 INEGI calculates this index based on the Mexican general economy and different industry sectors.

5.1 Data collection and data management

IGAE is an indicator about the economic activity in Mexico. Although the Mexico GDP is a very close measure of all products and services produced in Mexico, the IGAE is a good indicator about the Mexico Economic development. Unlike the GDP, which is published each Quarter, the IGAE is published each month, and it is usually published much faster than the GDP.

You can download the IGAE directly from the INEGI web site in different formats such as Excel format. Also, INEGI offers an automatic way to download most of its historical indicators using an API that can be accessed with different computer languages.

Here is the Python code to download the monthly IGAE up to the most recent month:

import requests
import pandas as pd
# Code to download the IGAE index from the INEGI site:
token = 'c4559227-83d3-aeb4-1a8d-bee73f627140'

igae_id = '737121'

url =  f'https://www.inegi.org.mx/app/api/indicadores/desarrolladores/jsonxml/INDICATOR/{igae_id}/es/0700/false/BIE/2.0/{token}?type=json'

response = requests.get(url)
data = response.json()
series = data.get('Series', [])
observations = series[0].get('OBSERVATIONS', [])
igaedf = pd.DataFrame(observations)
# Convert the month to a date type variable: 
igaedf['TIME_PERIOD'] = pd.to_datetime(igaedf['TIME_PERIOD'], format='%Y/%m')
# Convert the IGAE index to a numeric variable: 
igaedf['OBS_VALUE'] = pd.to_numeric(igaedf['OBS_VALUE'], errors='coerce')
# Keep only the columns we need
igaedf = igaedf[['TIME_PERIOD', 'OBS_VALUE']]
# Rename the columns:
igaedf.columns = ['Month', 'IGAE']
# Setting the month as the index of the series:
igaedf = igaedf.set_index('Month')
# Sorting the data according to the index:
igaedf = igaedf.sort_index()
# Display the first few rows
print(igaedf.head())
                 IGAE
Month                
1993-01-01  55.434736
1993-02-01  56.456971
1993-03-01  58.900549
1993-04-01  57.135844
1993-05-01  57.891853

We can plot the index over time to have an idea about the Mexico economic development over time:

# prompt: plot the IGAE from df, and trace a vertical line in Dec 2018 and Dec 2024

import pandas as pd
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.plot(igaedf.index, igaedf['IGAE'])
plt.title('IGAE over Time')
plt.xlabel('Date')
plt.ylabel('IGAE Value')
plt.grid(True)

# Trace vertical lines to identify the 6-year government periods
plt.axvline(pd.to_datetime('1994-12-01'), color='green', linestyle='--')
plt.axvline(pd.to_datetime('2000-12-01'), color='green',linestyle='--')
plt.axvline(pd.to_datetime('2006-12-01'), color='green',linestyle='--')
plt.axvline(pd.to_datetime('2012-12-01'), color='green',linestyle='--')
plt.axvline(pd.to_datetime('2018-12-01'), color='green',linestyle='--')
plt.axvline(pd.to_datetime('2024-12-01'), color='green',linestyle='--')

plt.show()

What do you observe? Compared to previous crisis in Mexico during 1994-1995 and 2008-2009, how does the COVID crisis look like?

We see that the decline of the index for the COVID crisis looks much profound compared with previous crisis (2008 mortgage crisis and the internal 1994 crisis). It is possible that for very old periods, the decline in magnitud looks shorter compared to the magnitude decline of a recent period. What matters to understand the economic development is the percentage changes of the economic output. The best way to visually identify which recession period had was the worse % decline is by looking at the log value of the index.

6 Stationary vs Non-stationary

We first have to find a version of the IGAE series that is a stationary series. For any economic or financial series it is recommended to get the logarithm and take a difference to get a measure of % change:

import numpy as np 
igaedf['logIGAE'] = np.log(igaedf['IGAE'])

plt.figure(figsize=(12, 6))
plt.plot(igaedf.index, igaedf['logIGAE'])
plt.title('log IGAE over Time')
plt.xlabel('Date')
plt.ylabel('IGAE Value')
plt.grid(True)
plt.show()

With the log of the index we can clarly appreciate that the COVID crisis/recession period had the more pronounced decline compared to other periods. In the COVID period we can see that the log of the index moved from about 4.6 to about 4.35. If we take the difference between these values (4.60 - 4.35 = 0.25) we can have a quick idea about the % decline in the Economy in the COVID period, which was about -25% from Jan 2020 to March-Apr 2020!

It is clear that the log series is not stationary, so I do not need to run a Dicky-Fuller test to check stationarity for the log. I have to do a difference of the series and then do the Dicky-Fuller test.

For financial and economic monthly data it is recommended to start calculating the seasonal (annual) log difference and check whether this series is stationary. If it is, then use this version for the model.

In case that the seasonal log difference is not stationary, then you can calculate the first difference, and most of the time this first difference will be stationary.

Then, we start checking whether the seasonal log of the IGAE is a stationary series. We start with a plot of this variable, and then run the the corresponding Dicky-Fuller test:

igaedf['annualgrowth'] = igaedf['logIGAE'] - igaedf['logIGAE'].shift(12)
# We can also do this annual growth by using the diff function:
# igaedf['annualgrowth2'] = igaedf['logIGAE'].diff(12)

plt.figure(figsize=(12, 6))
plt.plot(igaedf.index, 100*igaedf['annualgrowth'])
plt.title('IGAE Annual %Growth month by month')
plt.xlabel('Date')
plt.ylabel('IGAE % Growth')
plt.grid(True)
plt.show()

The seasonal log difference of the series is actually the annual % change month by month of the IGAE (in continuously compounded percentages).

Let’s see why the seasonal log difference is the annual % change of the series:

diff(logigae,12) = logigae - lag(logigae,12)

lag(logigae,12) is the value of the log IGAE, but 12 months ago!

In this case the diff function gets the difference from a current value of the log minus the log value 12 months ago. This is actually a % annual change since we are taking the difference with 12 months ago.

It is hard to our eyes to see whether the seasonal log series is stationary. We need to run the Dicky-Fuller test:

# !pip install statsmodels
from statsmodels.tsa.stattools import adfuller

dftest = adfuller(igaedf['annualgrowth'].dropna())

# Print the results
print('ADF Statistic: %f' % dftest[0])
ADF Statistic: -3.957525
print('p-value: %f' % dftest[1])
p-value: 0.001650
print('Critical Values:')
Critical Values:
for key, value in dftest[4].items():
  print('\t%s: %.3f' % (key, value))
    1%: -3.449
    5%: -2.870
    10%: -2.571

Since the p-value of the Dicky-Fuller test is 0.001<0.05, then we can reject the null hypothesis (phi1=1 or that the Series is NOT STATIONARY) and accept the alternative hypothesis (the series is STATIONARY). Then we can conclude that the annual % growth (seasonal log of the IGAE) is a stationary series. Then we will use this version of the variable for our model.

7 Autocorrelations (AC) and Partial-autocorrelations (PAC)

Once we determine which version of the variable is stationary, then we need to examine autocorrelations and partial autocorrelations of this variable. We can examine this with a correlogram.

In time series analysis, a correlogram is an chart of correlation statistics also known as an autocorrelation plot. This is a plot of the sample autocorrelations of the dependent variable Y_t and the time lags. In R, the function to compute the autocorrelations is acf2() from the astsa package, which gives you both, the autocorrelation (ACF) figures and the partial-autocorrelations (PACF), as well.

Let’s examine the autocorrelations of the seasonal log of the IGAE.

# prompt: Using igaedf, Do an autocorrelation plot and a partial autocorrelation plot of 'annualgrowth'

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Drop NaN values created by the differencing
annual_growth = igaedf.dropna(subset=['annualgrowth'])

# Autocorrelation plot
plt.figure(figsize=(12, 6))
plot_acf(annual_growth['annualgrowth'], lags=40, zero=False, ax=plt.gca())
plt.title('Autocorrelation Plot of annual growth')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.grid(True)
plt.show()

# Partial autocorrelation plot
plt.figure(figsize=(12, 6))
plot_pacf(annual_growth['annualgrowth'], zero=False, lags=40, ax=plt.gca())
plt.title('Partial Autocorrelation Plot of annual growth')
plt.xlabel('Lags')
plt.ylabel('Partial Autocorrelation')
plt.grid(True)
plt.show()

The plot_acf() function gives you the autocorrelogram, which is the autocorrelations in a graph where it is easy to identify which lags of the variable have a significant autocorrelation.

The X axis represents the lag numbers, while the Y axis is the level of autocorrelation. The vertical lines are the levels of autocorrelations between the current value of Y (Y_t) and its own lags (Y_{t-1}, Y_{t-2}...).

The dotted, blue line delimits the 95% confidence interval for the autocorrelations. Then, if the autocorrelation goes out of the 95% C.I., then the autocorrelation of the variable Y_t with its corresponding lag Y_{t-1} (AutoCorr(Y_t, Y_{t-1})) is statistically different than zero.

We can see that the LAG 1 (Y_{t-1}) is significantly autocorrelated (positively) with the current value of the variable (Y_t) since this autocorrelation is significantly greater than zero.

What is the meaning of the first significant LAG? In this case, this means that the value of Y at any point in time t has an average autocorrelation about 0.80 with its own value at t-1. We can see this 0.80 as the height of the first vertical line of the ACF plot.

The autocorrelation of Y_t with its previous value Y_{t-1} can be calculated manually as the correlation, which is equal to their covariance divided by the product of their standard deviation.

Remembering the formula for correlation between 2 variables Y and X:

CORR(Y,X)=\frac{COV(Y,X)}{SD(Y)*SD(X)}

In the case of “Autocorrelation”, we only have one variable Y, so we are looking at the correlation of Y_t with its own previous value Y_{t-1}. Then, applying the correlation formula between Y_t and Y_{t-1} (instead of X), we can calculate this autocorrelation as:

AUTOCORR(Y_t,Y_{t-1})=\frac{AUTOCOV(Y_t,Y_{t-1})}{SD(Y_t)*SD(Y_{t-1})} Since Y is supposed to be stationary series, then the standard deviation of Y_t has to be the same as the standard deviation of Y_{t-1}. Then:

AUTOCORR(Y_t,Y_{t-1})=\frac{AUTOCOV(Y_t,Y_{t-1})}{VAR(Y_t)} = \frac{\gamma\left(1\right)}{\sigma^{2}_y}

Fortunately, R does the calculation of all the autocorrelations of Y_t with its own lags, and these autocorrelations are plotted using the acf2 command (as seen above).

Unlike the ACF plot, the partial-autocorrelation (PACF) plot shows the autocorrelation of the lag with the variable after considering the effect of the lower-order autocorrelations.

The autocorrelation and partial autocorrelation between Y_{t} and its LAG 1 (Y_{t-1}) are the same, since there is no lower-order autocorrelations for LAG 1.

The autocorrelation between Y_{t} and its LAG2 (Y_{t-2}) and the partial autocorrelation between Y_{t} and its LAG2 (Y_{t-2}) will be different since the partial autocorrelation is calculated AFTER the autocorrelation between Y_{t} and Y_{t-1} is taken into consideration.

In other words, the PACF plot shows the amount of autocorrelation between Y_{t} and its lag K (Y_{t-k}) that is not explained by all the lower-order autocorrelations (Y_{t} with (Y_{t-k-n}).

The ACF plot shows that only the LAG 1 autocorrelation is positive and significant, and the following LAG autocorrelations are also positive but not significant.

We see a fast decay in the magnitude of the LAG autocorrelations in the PACF, and a low decay in the magnitude of the LAG autocorrelations in the ACF plot.

In the partial autocorrelation (PACF) we can see that ONLY the autocorrelation of LAG 1 is significant and positive, and immediately the rest of the autocorrelation go close to zero suddenly. This is a classical AR(1) signature with 1 AR terms. We do not need to include other terms further than LAG 1.

Then we can see that the partial autocorrelation plot (PACF) helps us to identify the # of AR terms. Then, according to this PACF we see that only LAG 1 is autocorrelated with the current value AFTER considering the effect of the lower-level autocorreations of lags 2,3,4, etc. In other words, the annual %change in the IGAE of the current month seems to be positively correlated with its own annual % changes of the previous month.

8 Calibrating the ARMIA-SARIMA model

The objective of the calibration of an ARIMA-SARIMA model is to find the right values for p, d, q, P, D and Q.

The values for d and D are already defined when we found the transformation of the variable that became stationary. In this case, we found that the seasonal difference of the log of IGAE is stationary, so the parameters for d and D are:

d=0; D=1

If D=1 it means that we are doing a seasonal difference (annual % growth), and d=0 indicates that we are NOT doing a first difference (monthly % growth).

Next step is to define the parameters p and q. The p parameter refers to the number of AR terms, while the q parameters refers to the number of MA terms in the model.

There mainly 2 set of possible initial values for p and q:

  1. p=1, 2 or 3, and q=0: This is when we observe an AR pattern (also called AR signature) in the AC and the PAC Plots.

An AR signature can be observed when the AC Plot shows positive autocorrelations for the first lags that gradually decay over time, and the PAC autocorrelations show 1, 2 or 3 positive and significant autocorrelations that drastically decay after few lags. In our AC Plot we saw this AR signature.

When we have an AR signature, the best configuration for p = number of significant autocorrelations in the PAC Plot, and the value for q is set to zero (q=0).

  1. p=0, and q=1, 2 or 3: This happens when we observe an MA pattern (also called MA signature) in the AC and the PAC Plots.

An MA signature can be observed when the AC Plot and the PAC Plot show very few negative significant autocorrelations (usually only 1), and the PAC autocorrelations decay more gradually than the AC autocorrelations.

When we have an MA signature, the best configuration is q = number of significant autocorrelations in the AC Plot, and set the value of p = 0.

Then, in our example we see an AR signature with p = 1 and q = 0. We keep P and Q equal to zero up to this point. Let’s start running the ARIMA-SARIMA model with these values

from statsmodels.tsa.statespace.sarimax import SARIMAX

igae_log = igaedf['logIGAE'].dropna()

# Fit SARIMA model ARIMA(p=1,d=0,q=0) SARIMA(P=0,D=1,Q=0,#periods=12)
model = SARIMAX(igae_log, order=(1, 0, 0), 
                seasonal_order=(0, 1, 0, 12),
                trend='c') # Include drift/constant term
result = model.fit()
print(result.summary())
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                            logIGAE   No. Observations:                  387
Model:             SARIMAX(1, 0, 0)x(0, 1, 0, 12)   Log Likelihood                 789.554
Date:                          jue., 29 may. 2025   AIC                          -1573.107
Time:                                    16:37:31   BIC                          -1561.326
Sample:                                01-01-1993   HQIC                         -1568.430
                                     - 03-01-2025                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      0.0045      0.002      2.916      0.004       0.001       0.007
ar.L1          0.7699      0.021     36.154      0.000       0.728       0.812
sigma2         0.0009   2.21e-05     39.216      0.000       0.001       0.001
===================================================================================
Ljung-Box (L1) (Q):                   1.63   Jarque-Bera (JB):              3805.87
Prob(Q):                              0.20   Prob(JB):                         0.00
Heteroskedasticity (H):               2.14   Skew:                            -0.50
Prob(H) (two-sided):                  0.00   Kurtosis:                        18.57
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

We can save the coefficients of the model in variables:

phi0 = result.params['intercept']
phi1= result.params['ar.L1']

print('phi0 = intercept = %f' % phi0)
phi0 = intercept = 0.004462
print('phi1 = %f' % phi1)
phi1 = 0.769948

The ARIMA SARIMA model estimated values for \phi_1 = 0.7699, while ths $_0 = 0.0045, and both had are significantly positive since their p-values are much less than 0.05. The next step is to check whether the errors (residuals) of this model looks like a white noise. A white noise is a time series that has no significant autocorrelation with any of its past values (lags). Then, we check the AC Plot and the PAC Plot of the model residuals:

# Autocorrelation plot
plt.figure(figsize=(12, 6))
plot_acf(result.resid.iloc[12:], lags=40, zero=False, ax=plt.gca())
plt.title('Autocorrelation Plot of errors / residuals')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.grid(True)
plt.show()

# Partial autocorrelation plot
plt.figure(figsize=(12, 6))
plot_pacf(result.resid.iloc[12:], zero=False, lags=40, ax=plt.gca())
plt.title('Partial Autocorrelation Plot of residuals')
plt.xlabel('Lags')
plt.ylabel('Partial Autocorrelation')
plt.grid(True)
plt.show()

We see that in both plots, the AC and the PAC plot, the autocorrelation of the residuals with its own lag 12 is negative and significant. This means that we can incorporate in the model a seasonal term P=1 or Q=1 to account for this significant autocorrelation.

As a rule of thumb, when this happens, if the significant autocorrelation is negative in both plots, we start selecting a seasonal MA term and NO AR terms. In other words, we start setting the values for P and Q as:

P = 0

Q = 1

We re-run the model with this change to the values of P and Q:

igae_log = igaedf['logIGAE'].dropna()

# Fit SARIMA model ARIMA(p=1,d=0,q=0) SARIMA(P=0,D=1,Q=1,#periods=12)
model = SARIMAX(igae_log, order=(1, 0, 0), 
                seasonal_order=(0, 1, 1, 12),
                trend='c') # Include drift/constant term
result = model.fit()
print(result.summary())
                                      SARIMAX Results                                       
============================================================================================
Dep. Variable:                              logIGAE   No. Observations:                  387
Model:             SARIMAX(1, 0, 0)x(0, 1, [1], 12)   Log Likelihood                 851.355
Date:                            jue., 29 may. 2025   AIC                          -1694.710
Time:                                      16:37:33   BIC                          -1679.003
Sample:                                  01-01-1993   HQIC                         -1688.474
                                       - 03-01-2025                                         
Covariance Type:                                opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      0.0026      0.000      6.742      0.000       0.002       0.003
ar.L1          0.8612      0.019     45.339      0.000       0.824       0.898
ma.S.L12      -0.8189      0.038    -21.564      0.000      -0.893      -0.744
sigma2         0.0006   1.49e-05     40.249      0.000       0.001       0.001
===================================================================================
Ljung-Box (L1) (Q):                   7.95   Jarque-Bera (JB):              7573.88
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               2.01   Skew:                            -2.10
Prob(H) (two-sided):                  0.00   Kurtosis:                        24.61
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

We can save the coefficients of the model:

phi0 = result.params['intercept']
phi1= result.params['ar.L1']
theta1 = result.params['ma.S.L12']

print('phi0 = intercept = %f' % phi0)
phi0 = intercept = 0.002610
print('phi1 = %f' % phi1)
phi1 = 0.861202
print('theta1 = %f' % theta1)
theta1 = -0.818914

The seasonal MA term was -0.8189, and was statistically significant since its p-value is much less than 0.05. Then we keep it in the model.

Now we again check whether the residuals of this second model has a signficiant autocorrelation with its own values:

# Autocorrelation plot
plt.figure(figsize=(12, 6))
plot_acf(result.resid.iloc[12:], lags=40, zero=False, ax=plt.gca())
plt.title('Autocorrelation Plot of errors / residuals')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.grid(True)
plt.show()

# Partial autocorrelation plot
plt.figure(figsize=(12, 6))
plot_pacf(result.resid.iloc[12:], zero=False, lags=40, ax=plt.gca())
plt.title('Partial Autocorrelation Plot of residuals')
plt.xlabel('Lags')
plt.ylabel('Partial Autocorrelation')
plt.grid(True)
plt.show()

We see that both AC and PAC plot show a negative and positive autocorrelagion with its lag 1. Then, we change the q from 0 to 1 to account (include the MA(1) term) for this autocorrelation. We do this since this autocorrelation is negative and we already had included an AR(1) term.

# Fit SARIMA model ARIMA(p=1,d=0,q=1) SARIMA(P=0,D=1,Q=1,#periods=12)
model = SARIMAX(igae_log, order=(1, 0, 1), 
                seasonal_order=(0, 1, 1, 12),
                trend='c') # Include drift/constant term
result = model.fit()
print(result.summary())
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                            logIGAE   No. Observations:                  387
Model:             SARIMAX(1, 0, 1)x(0, 1, 1, 12)   Log Likelihood                 858.591
Date:                          jue., 29 may. 2025   AIC                          -1707.183
Time:                                    16:37:36   BIC                          -1687.548
Sample:                                01-01-1993   HQIC                         -1699.388
                                     - 03-01-2025                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      0.0013      0.000      3.159      0.002       0.001       0.002
ar.L1          0.9300      0.021     44.398      0.000       0.889       0.971
ma.L1         -0.2556      0.035     -7.221      0.000      -0.325      -0.186
ma.S.L12      -0.8322      0.037    -22.595      0.000      -0.904      -0.760
sigma2         0.0006    1.4e-05     41.113      0.000       0.001       0.001
===================================================================================
Ljung-Box (L1) (Q):                   0.07   Jarque-Bera (JB):             10896.00
Prob(Q):                              0.79   Prob(JB):                         0.00
Heteroskedasticity (H):               2.36   Skew:                            -2.62
Prob(H) (two-sided):                  0.00   Kurtosis:                        28.88
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

We can save the coefficients of the model:

phi0 = result.params['intercept']
phi1= result.params['ar.L1']
theta1 = result.params['ma.L1']
theta2 = result.params['ma.S.L12']

print('phi0 = intercept = %f' % phi0)
phi0 = intercept = 0.001324
print('phi1 = %f' % phi1)
phi1 = 0.929958
print('theta1 = %f' % theta1)
theta1 = -0.255613
print('theta2 = %f' % theta2)
theta2 = -0.832166

We see that all coefficients of the model are statistically significant since their pvalues < 0.05. Then, we keep them in the model.

We again check whether the model residuals show autocorrelation with its own lags:

# Autocorrelation plot
plt.figure(figsize=(12, 6))
plot_acf(result.resid.iloc[12:], lags=40, zero=False, ax=plt.gca())
plt.title('Autocorrelation Plot of errors / residuals')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.grid(True)
plt.show()

# Partial autocorrelation plot
plt.figure(figsize=(12, 6))
plot_pacf(result.resid.iloc[12:], zero=False, lags=40, ax=plt.gca())
plt.title('Partial Autocorrelation Plot of residuals')
plt.xlabel('Lags')
plt.ylabel('Partial Autocorrelation')
plt.grid(True)
plt.show()

Now we see that the autocorrelations of residuals with its own past values (lags) are not significant (in general). We can see that in both the AD and PAC plots there are some barely significant autocorrelations for lags 7 and lag 10.

We could include AR or MA terms to account for these lags. However, in my own experience, when working with monthly data of economic/financial series, I recommend ONLY including terms for lags = 1 (last month), 2 (last 2 months), and 6 (last semester). Then, I accept the model errors like a white noise. Then, I stop here the calibration process.

Then the final calibration of the model ended up with the following parameters:

p = 1, d = 0, q = 1

P = 0, D = 1, Q = 1

In other words, we have an ARIMA(p=1, d=0, q=1) SARIMA(P=0, D=1, Q=1)

We can express the results of this model following the ARIMA-SARIMA equation:

Y_{t}=\phi_{0}+\phi_{1}Y_{t-1}+\theta_{1}error_{t-1}+\theta_{2}error_{t-12}+error_{t}

Where Y_t is the annual % growth of the IGAE (the seasonal difference of the log).

The sum of the \phi_{i} coefficients must always be less than 1. If you find that the sum of these coefficients is bigger than 1, then you need to review the calibration process.

According to the result, we can see that the coefficients \phi_1 is positive and significant (since its p-value are <0.05). Then we can say that the IGAE annual % change is positively and significantly related with its own annual % change of 1 month ago.

Also, the \theta_{1} coefficient (the MA coefficient) is negative and significant. This means that the annual % growth of the IGAE is negatively related with the error (shock) of 1 month ago. The \theta_{2} coefficient (the seasonal MA coefficient) is negative and very significant. This means that the annual % growth of the IGAE is negatively related with the error (shock) of 12 months ago.

Now we can use this model to do predictions:

# I will forecast 21 months in the future:
forecast_steps = 21
forecast_result = result.get_forecast(steps=forecast_steps)
# I get the mean forecast:
forecast_log = forecast_result.predicted_mean
# Get the 95% confidence interval of the forecast: 
# alpha = 1 - confidence interval
forecast_ci = forecast_result.conf_int(alpha=0.05)

# Convert log forecast values back to original scale of IGAE
forecast_values = np.exp(forecast_log)

# Plot historical and forecast values
plt.figure(figsize=(14, 7))
plt.plot(igaedf.index, igaedf['IGAE'], label='Historical IGAE')
plt.plot(forecast_values.index, forecast_values, label='Forecast IGAE', color='red')
plt.fill_between(forecast_values.index, np.exp(forecast_ci.iloc[:, 0]), np.exp(forecast_ci.iloc[:, 1]), color='pink', alpha=0.3) # Plot confidence interval

plt.title('IGAE Historical and Forecast')
plt.xlabel('Date')
plt.ylabel('IGAE Value')
plt.grid(True)
plt.legend()
plt.show()

9 CHALLENGE 1

Interpret the calibrated ARIMA-SARIMA model. Be sure to interpret the sign and magnitude of the coefficients! Pay attention in class to understand HOW TO INTERPRET the coefficients!

10 CHALLENGE 2

Download the Real Gross Domestic Product for Mexico from the FRED online site (US Federal Reserve Bank). Use the ticker NGDPRNSAXDCMXQ. You have to:

  1. Calibrate an ARIMA-SARIMA to forecast the USD GDP for the the rest of 2025 and all quarters of 2026.

  2. You have to explain your calibration steps, and

  3. Interpret the final model.

11 W3 submission

  • Complete (100%): If you submit an ORIGINAL and COMPLETE work with all the code and challenges, with your notes, and with your OWN RESPONSES to questions
  • Incomplete (75%): If you submit an ORIGINAL work with ALL the activities but you did NOT RESPOND to the challenges/questions and/or you did not do all activities and respond to some of the questions.
  • Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
  • Not submitted (0%)

Remember that you have to submit your Google Colab link through Canvas, and also grant EDIT priviledges to cdorante@tec.mx