Workshop 2 Solution, Time Series Models

Author

Alberto Dorantes

Published

May 19, 2025

Abstract
In this workshop we learn the concepts of stationarity, unitary root and cointegration of time-series variables.

1 Stationarity of time series variables

1.1 Introduction

In our previous workshop we learned about the random walk model applied to finance. We learned that the logarithm of the stock prices behaves similar to a random walk model with a positive drift (\phi_0>0).

As we learned, a random walk model can be expressed in the following equation:

Y_t = \phi_0 + Y_{t−1} + ε_t Following this model we did a Monte Carlo simulation for the logarithm of the S&P500, and we found that most of the time the simulations of a random walk model behaves similar to the real log of the S&P500 index.

We learned that if \phi_0>0, then the series will have a growing trend over time. We also learned that the higher the standard deviation of the daily shock, the more likely that the series will radically move up and down. The standard deviation of the shock is actually the daily volatility of the daily returns of the series.

An important insight about the random walk model is that the random walk model is a non-stationary series. What does this mean?

In short, a stationary series is a series that its mean in any time period is about the same (no matter which time periods we select), and its standard deviation is about the same for any time period.

Then, the random-walk model will always behave as a non-stationary series.

Let’s analyze what part of the random walk equation makes the series stationary. We can re-write the previous random walk equation as follows:

Y_t = \phi_0 + \phi_1*Y_{t−1} + ε_t

If the coefficient \phi_{1}=1, then this equation becomes the random walk equation. When \phi_{1}=1 the Y_t series becomes non-stationary.

Interestingly, if \phi_{1}<1, then Y_t becomes a stationary series.

1.2 Simulating an AR(1) process

In-class exercise

Simulate an AR(1) process with \phi_0=0.01, standard deviation of the error = 0.01, and \phi_1=0.98.

Re-run the same AR(1) simulation, but now change \phi_1 = 0.10

Re-run the same AR(1) simulation, but now change \phi_1 = 0.999

SOLUTION

Simulating an AR(1) process with \phi_1=0.98:

# prompt Gemini: Simulate Y with 1000 values as an Autoregressive Model AR(1) with phi0=0.01, phi1=0.98 and standard deviation of random shocks = 0.01

import numpy as np

phi0 = 0.01
sigma_error = 0.01
phi1 = 0.98
N = 1000

Y = np.zeros(N)
Y[0] = 0 # Initial value, could be anything

for t in range(1, N):
    shock = np.random.normal(0, sigma_error)
    Y[t] = phi0 + phi1 * Y[t-1] + shock

Plotting the simulated AR(1) serie:

#prompt Gemini: Plot Y over time
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(Y)
plt.title('Simulated AR(1) with phi1=0.98')
plt.xlabel('Time Period')
plt.ylabel('Y')
plt.grid(True)
plt.show()

Simulating an AR(1) process with \phi_1=0.10:

import numpy as np

phi0 = 0.01
sigma_error = 0.01
phi1 = 0.10
N = 1000

Y2 = np.zeros(N)
Y2[0] = 0 # Initial value, could be anything

for t in range(1, N):
    shock = np.random.normal(0, sigma_error)
    Y2[t] = phi0 + phi1 * Y2[t-1] + shock

plt.figure(figsize=(10, 6))
plt.plot(Y2)
plt.title('Simulated AR(1) with phi1=0.10')
plt.xlabel('Time Period')
plt.ylabel('Y')
plt.grid(True)
plt.show()

Simulating an AR(1) process with \phi_1=0.999:

phi0 = 0.01
sigma_error = 0.01
phi1 = 0.999
N = 1000

Y3 = np.zeros(N)
Y3[0] = 0 # Initial value, could be anything

for t in range(1, N):
    shock = np.random.normal(0, sigma_error)
    Y3[t] = phi0 + phi1 * Y3[t-1] + shock

plt.figure(figsize=(10, 6))
plt.plot(Y3)
plt.title('Simulated AR(1) with phi1=0.999')
plt.xlabel('Time Period')
plt.ylabel('Y')
plt.grid(True)
plt.show()

What differences do you observe with the 3 simulations?

Y: AR(1) with \phi_1=0.98

Y2: AR(1) with \phi_1=0.1

Y3: AR(1) with \phi_1=0.999

When \phi_1=0.98 (close to 1), the series starts to show a growing trend since its \phi_0 is positive. However, after some periods, the series starts to behave like a stationary series moving up and down around a value of 0.50.

When \phi1_1=0.10 (close to 0), the series immediately behaves like a stationary variable. Unlike the previous series that starts to grow and last several periods before going down, this series moves up and down almost every period.

Y moves with a much wider range, from 0 to less than 0.60, while Y2 moves in a smaller range from -0.02 to +0.05.

Y3 shows a clear positive trend and did not look like a stationary series since its mean grows over time.

According to theory, any AR(1) series (with phi_1<1) will behave like a stationary series. This seems that our experiment is against theory. However, if we expand the number of periods from 1000 to 100,000 we can see something interesting:

phi0 = 0.01
sigma_error = 0.01
phi1 = 0.999
N = 100000

Y3 = np.zeros(N)
Y3[0] = 0 # Initial value, could be anything

for t in range(1, N):
    shock = np.random.normal(0, sigma_error)
    Y3[t] = phi0 + phi1 * Y3[t-1] + shock

plt.figure(figsize=(10, 6))
plt.plot(Y3)
plt.title('Simulated AR(1) with phi1=0.999')
plt.xlabel('Time Period')
plt.ylabel('Y')
plt.grid(True)
plt.show()

Now we see that sooner of later, the AR(1) series becomes a stationary series, but it lasts many periods before experiencing the stationary behavior! why this happens? The main reason is that the series is strongly positively auto-correlated since its \phi_1 is very close to 1, and its initial value is far away from the average value where the series becomes stable.

If we do a zoom to this plot to check only few values in any time period after 20,000, we see something interesting:

Let’s do a zoom to see only 30 values starting the period t = 20,000:

plt.figure(figsize=(10, 6))
plt.plot(Y3[20000:20030])
plt.title('Zoom of Simulated AR(1) with phi1=0.999')
plt.xlabel('Time Period')
plt.ylabel('Y')
plt.grid(True)
plt.show()

We see that when the series starts to grow, it keeps growing for several periods. And when it changes direction going down, it also keeps going down for several periods. This means that the series is persistent with a positive and high auto-correlation. This is true since its \phi_1=0.999 is a measure of auto-correlation !

What happens when \phi_1 is very small, and when it is close to 1?

When \phi_1 is close to +1 the series becomes more persistent, high positively auto-correlated. When \phi_1 is very close to 0 the series becomes very jagged and much less persistent; if the series increase in period t, it is very likely that the series will decrease in the next period t+1.

Let’s work with an example using sales data.

1.3 Data collection and data management

We will work with an online dataset that contains sales data of different consumer products.

We download a datsaet from an online Excel. Type:

#pip install openpyxl
import pandas as pd
import requests
from io import BytesIO

# URL of the online Excel file
url = "http://www.apradie.com/ec2004/salesbrands3.xlsx"

# Define headers to simulate a browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# Fetch the file from the URL with headers
response = requests.get(url, headers=headers)
response.raise_for_status()  # Raise an error for bad status codes

# Load the Excel file into a pandas DataFrame
sales_brand = pd.read_excel(BytesIO(response.content), sheet_name='Resultado')  # Adjust sheet_name if needed

# Display first few rows
print(sales_brand.head())
        date  qbrand1tons  salesbrand1  ...  salesbrand6  tonstotal  salestotal
0 2015-01-01       106.62     60780.72  ...     55561.26    1148.41   302762.85
1 2015-02-01       105.82     60411.47  ...     51469.74    1274.85   327076.83
2 2015-03-01        88.82     53174.89  ...     50395.48    1085.59   298694.21
3 2015-04-01        86.69     50861.69  ...     49124.61     961.22   267792.19
4 2015-05-01        87.23     51014.23  ...     46906.63     997.47   266576.37

[5 rows x 15 columns]

As you can see, we have imported a monthly dataset where the variable date shows the first day of the month in a yyyy-mm-dd format. However, this column is in text format.

When working with time series datasets it is very recommended to set a Date column as the index of the data frame.With this, Python will know that we have a monthly time-series dataset.

Then, we convert the date column to Date type and set this column as the index of the data frame:

import pandas as pd
sales_brand['date'] = pd.to_datetime(sales_brand['date'], errors='coerce')
sales_brand.set_index('date', inplace=True)
sales_brand.head()
            qbrand1tons  salesbrand1  ...  tonstotal  salestotal
date                                  ...                       
2015-01-01       106.62     60780.72  ...    1148.41   302762.85
2015-02-01       105.82     60411.47  ...    1274.85   327076.83
2015-03-01        88.82     53174.89  ...    1085.59   298694.21
2015-04-01        86.69     50861.69  ...     961.22   267792.19
2015-05-01        87.23     51014.23  ...     997.47   266576.37

[5 rows x 14 columns]

This dataset has sales information of several brands of consumer products that belong to one category (for example, the category can be Cereals or Candies). The content of this dataset is very similar to a real-world category. Consumer products usually show some seasonality and trend over time. We will learn how to handle these features later.

1.4 Stationary vs Non-stationary

We can have an overview of the behavior of sales of these consumer products using the plot command. Let’s focus on brand 2 and 6. To see volume sales (the number of tons sold) of brand 2:

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(sales_brand.index, sales_brand['qbrand2tons'])
plt.xlabel('Date')
plt.ylabel('Volume of qbrand2 (in tons)')
plt.title('Sales volume of brand2')
plt.grid(True)
plt.show()

It seems that this product has a growing trend over time, except for the last year.

We can plot sales of brand 6:

plt.figure(figsize=(12, 6))
plt.plot(sales_brand.index, sales_brand['qbrand6tons'])
plt.xlabel('Date')
plt.ylabel('Volume of qbrand6 (in tons)')
plt.title('Sales volume of brand6')
plt.grid(True)
plt.show()

This brand has a clear growing trend over time. As we can see, many consumer products have a growing or a declining trend. And other products might seem to be stationary over time, with sales over an average over time.

A series with a growing or declining trend is non-stationary. Most of the variables that represent sales of any product can be treated as non-stationary. Then, a rule of thumb when we work with business variables such as volume sales, value sales, stock prices, cost of good sold, etc is to do a transformation of the series so that this transformation becomes a stationary series.

When you work with monthly sales data, if we apply the natural logarithm to volume (or value) sales, and then take the first difference of the series we calculate the monthly % change (in continuously compounded). This first difference of the log will become a stationary series most of the time when you analyze sales data.

Let’s calculate the natural log of the volume sales for all products:

import numpy as np

ln_sales = np.log(sales_brand)
ln_sales.head()
            qbrand1tons  salesbrand1  ...  tonstotal  salestotal
date                                  ...                       
2015-01-01     4.669271    11.015028  ...   7.046134   12.620705
2015-02-01     4.661740    11.008934  ...   7.150584   12.697950
2015-03-01     4.486612    10.881342  ...   6.989879   12.607176
2015-04-01     4.462339    10.836865  ...   6.868203   12.497967
2015-05-01     4.468548    10.839860  ...   6.905222   12.493416

[5 rows x 14 columns]

Now we create a dataset with the monthly % change (in continuously compounded) by calculating the first difference of the log values:

msales_growth = ln_sales.diff(1).dropna()
# I apply the dropna function at the end to delete the first row since it is not possible to calculate the % change for the first month
msales_growth.head()
            qbrand1tons  salesbrand1  ...  tonstotal  salestotal
date                                  ...                       
2015-02-01    -0.007532    -0.006094  ...   0.104450    0.077245
2015-03-01    -0.175128    -0.127593  ...  -0.160705   -0.090775
2015-04-01    -0.024273    -0.044476  ...  -0.121676   -0.109209
2015-05-01     0.006210     0.002995  ...   0.037019   -0.004551
2015-06-01    -0.019330    -0.049208  ...  -0.080772   -0.073313

[5 rows x 14 columns]

The diff(1) function applied to ln_sales computes the first difference of the log values row by row, so it calculates the % monthly change for all months and for all columns.

The diff function calculates the first difference for all rows as follows:

First Difference of a column at t = column value at t minus the previous column value (at t-1)

Let’ see how the monthly % growth looks for brand 2 and brand 6:

plt.figure(figsize=(12, 6))
plt.plot(msales_growth.index, msales_growth['qbrand2tons'], label='qbrand2tons')
plt.plot(msales_growth.index, msales_growth['qbrand6tons'], label='qbrand6tons')
plt.xlabel('Date')
plt.ylabel('Monthly % growth')
plt.title('Monthly % growth in qbrand2tons and qbrand6tons')
plt.legend()
plt.grid(True)
plt.show()

We can see that the 2 series look like stationary since their means tend to be in the same level for any time period.

Another transformation with monthly data is annual percentage change, which can be calculated as the difference between the log value of the series and its log value 12 months ago. We can calculate the annual % change of sales month by month as follows:

ysales_growth = ln_sales.diff(12).dropna()
ysales_growth.qbrand2tons.head()
date
2016-01-01   -0.143270
2016-02-01   -0.103516
2016-03-01   -0.202297
2016-04-01   -0.052669
2016-05-01   -0.161421
Name: qbrand2tons, dtype: float64

We plot the annual % change of volume sales for brand 2 and brand 6:

plt.figure(figsize=(12, 6))
plt.plot(ysales_growth.index, ysales_growth['qbrand2tons'], label='qbrand2tons')
plt.plot(ysales_growth.index, ysales_growth['qbrand6tons'], label='qbrand6tons')
plt.xlabel('Date')
plt.ylabel('Annual % growth')
plt.title('Annual % growth in qbrand2tons and qbrand6tons')
plt.legend()
plt.grid(True)
plt.show()

In this case, it is difficult to evaluate graphically whether both series are stationary. We need to do a statistical test to check whether a series can be treated as stationary or not. Some series might not look stationary for our eyes, but we need to run statistical tests to decide whether to treat the series stationary or not.

There is a statistical way to evaluate whether these series are stationary. We need a statistical confirmation in addition to the visual tools. The Dicky-Fuller test is one of the most used statistical tests for stationary. This test is called a unit root test. The Dicky-Fuller test assumes that the series is non stationary. In other words, it assumes that the series follows a random walk with a unit root (\phi1=1).

The Dicky-Fuller test actually tests whether \phi1=1. Then, the null hypothesis is that \phi_1=1, which means that the series is non-stationary. Then, if the p-value of the test is less than 0.05, then we can reject the null hypothesis and accept that the series is stationary with at least 95% of confidence.

We can run the Dicky-Fuller test for the sales annual % growth of product 2 as follows:

# !pip install statsmodels
from statsmodels.tsa.stattools import adfuller

dftest = adfuller(ysales_growth['qbrand2tons'], regression='c')
# regression = 'c' indicates to consider a possible constant or intercept

# Print the results
print('ADF Statistic: %f' % dftest[0])
ADF Statistic: -2.488449
print('p-value: %f' % dftest[1])
p-value: 0.118310
print('Critical Values:')
Critical Values:
for key, value in dftest[4].items():
  print('\t%s: %.3f' % (key, value))
    1%: -3.616
    5%: -2.941
    10%: -2.609

The null hypothesis of the Dicky-Fuller test is that the series is non-stationary. Actually, the null-hypothesis of this test is that the \phi_1=1, which implies that the series is non-stationary. This is the reason why the Dicky-Fuller test is called a unit-root test.

As any statistical test, if the p-value of the test is less than 0.05, then we have statistical evidence (at the 95% confidence level) to REJECT the null-hypothesis.

The Dicky Fuller test reports a p-value less than 0.05, indicating that we can reject the null hypothesis that states a unit root for these series (\phi_1 = 1). In other words, since the p-value is less than 0.05 we can say that there is statistical evidence to conclude that the sales annual % change of product 2 can be treated as stationary.

2 CHALLENGE 1

RUN AND INTERPRET A DICKY-FULLER TEST FOR THE ANNUAL % DIFFERENCE OF BRAND 6, AND PROVIDE AN INTERPRETATION

CHALLENGE 1 SOLUTION

from statsmodels.tsa.stattools import adfuller

dftest = adfuller(ysales_growth['qbrand6tons'], regression='c')
# regression = 'ct' indicates to consider a possible constant or intercept, and a trend in the series

# Print the results
print('ADF Statistic: %f' % dftest[0])
ADF Statistic: -3.030689
print('p-value: %f' % dftest[1])
p-value: 0.032126
print('Critical Values:')
Critical Values:
for key, value in dftest[4].items():
  print('\t%s: %.3f' % (key, value))
    1%: -3.606
    5%: -2.937
    10%: -2.607

Since the pvale is much less than 0.01, then we can say at the 99% confidece level that the annual % change of volume sales of product 6 is a stationary variable.

3 Spurious regression

When we want to examine the relationship between two non-stationary variables by running a regression model, we have the risk to end up with a non-valid - spurious - regression. Before we understand why a regression model can be spurious, we start with and example using 2 real-world variables.

Install the wbdata package. This package was written by the World Bank, and it has functions to download data of all countries around the world.

We will download the infant mortality rate and the exports for Mexico. It is supposed that these variables have nothing in common, so we would not expect a significant relationship.

#!pip install wbdata

import wbdata
import datetime

# Define the series and country
series = {"SP.DYN.IMRT.IN": "infant_mortality", "TX.VAL.MRCH.XD.WD": "merchandise_exports"}
country = "MEX"

# Define the dates
data_date = (datetime.datetime(1980, 1, 1), datetime.datetime(2022, 1, 1))

# Download the data
data = wbdata.get_dataframe(series, country=country, date=data_date)

# Order the dataset by date (which is the index after convert_date=True)
data.sort_index(inplace=True)

# Display the first few rows
print(data.head())
      infant_mortality  merchandise_exports
date                                       
1980              58.2             4.738168
1981              55.8             6.124695
1982              53.5             6.321201
1983              51.3             6.819955
1984              49.1             7.647000

We plot both variables over time:

plt.figure(figsize=(12, 6))
plt.plot(data.index, data['infant_mortality'], label='Infant Mortality')
plt.plot(data.index, data['merchandise_exports'], label='Merchandise Exports')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Infant Mortality and Merchandise Exports over Time')
plt.legend()
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42], [Text(0, 0, '1980'), Text(1, 0, '1981'), Text(2, 0, '1982'), Text(3, 0, '1983'), Text(4, 0, '1984'), Text(5, 0, '1985'), Text(6, 0, '1986'), Text(7, 0, '1987'), Text(8, 0, '1988'), Text(9, 0, '1989'), Text(10, 0, '1990'), Text(11, 0, '1991'), Text(12, 0, '1992'), Text(13, 0, '1993'), Text(14, 0, '1994'), Text(15, 0, '1995'), Text(16, 0, '1996'), Text(17, 0, '1997'), Text(18, 0, '1998'), Text(19, 0, '1999'), Text(20, 0, '2000'), Text(21, 0, '2001'), Text(22, 0, '2002'), Text(23, 0, '2003'), Text(24, 0, '2004'), Text(25, 0, '2005'), Text(26, 0, '2006'), Text(27, 0, '2007'), Text(28, 0, '2008'), Text(29, 0, '2009'), Text(30, 0, '2010'), Text(31, 0, '2011'), Text(32, 0, '2012'), Text(33, 0, '2013'), Text(34, 0, '2014'), Text(35, 0, '2015'), Text(36, 0, '2016'), Text(37, 0, '2017'), Text(38, 0, '2018'), Text(39, 0, '2019'), Text(40, 0, '2020'), Text(41, 0, '2021'), Text(42, 0, '2022')])
plt.grid(True)
plt.show()

Now run a regression using these series. Report the result of the regression. Did you find significant relationship? is your result what you expected?

import statsmodels.formula.api as smf

# Perform the regression
model = smf.ols('merchandise_exports  ~ infant_mortality', data=data).fit(missing='drop')

# Print the regression results
print(model.summary())
                             OLS Regression Results                            
===============================================================================
Dep. Variable:     merchandise_exports   R-squared:                       0.765
Model:                             OLS   Adj. R-squared:                  0.759
Method:                  Least Squares   F-statistic:                     130.3
Date:               mar., 27 may. 2025   Prob (F-statistic):           3.73e-14
Time:                         11:20:15   Log-Likelihood:                -184.31
No. Observations:                   42   AIC:                             372.6
Df Residuals:                       40   BIC:                             376.1
Df Model:                            1                                         
Covariance Type:             nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept          119.9454      6.892     17.403      0.000     106.016     133.875
infant_mortality    -2.5541      0.224    -11.414      0.000      -3.006      -2.102
==============================================================================
Omnibus:                       12.461   Durbin-Watson:                   0.122
Prob(Omnibus):                  0.002   Jarque-Bera (JB):                3.613
Skew:                           0.343   Prob(JB):                        0.164
Kurtosis:                       1.738   Cond. No.                         69.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We see that the b_1 coefficient is negative and very, very significant! This suggests that there is a very strong statistical evidence to say that infant mortality in Mexico is very related in a negative way to the merchandise exports! In other words, when infant mortality goes down, it is very likely that exports will go up. This sound wierd… this sounds spurious.

ACCORDING TO THIS REGRESSION, THE INFANT MORTALITY INDEX IS SIGNIFICANTLY RELATED TO THE NUMBER OF EXPORTS IN MEXICO. THIS DOES NOT MAKE SENSE SINCE BOTH PHENOMENA ARE NOT RELATED AT ALL. THIS REGRESSION SUGGESTS THAT WHEN THE EXPORTS GO UP, IN MEXICO THE NUMBER OF INFANT DEATHS DECREASES SINGIFICANTLY.

WHEN YOU RUN A REGRESSION WHERE THE DEPENDENT AND THE INDEPENDENT VARIABLES ARE NON-STATIONARY, MOST OF THE TIME THE RESULT WILL SHOW THAT BOTH ARE STRONGLY AND SIGNIFICANTLY RELATED (NEGATIVE OR POSITIVE). HOWEVER, THIS RELATIONSHIP MIGHT NOT BE TRUE!! IT MIGHT BE SPURIOUS.

THE ONLY CASE THAT THIS TYPE OF REGRESSION IS NOT SPURIOUS OR VALID IS WHEN BOTH VARIABLES ARE COINTEGRATED. ONLY IF THE 2 NON-STATIONARY SERIES ARE COINTEGRATED, THEN THE REGRESSION BETWEEN THEM WILL BE VALID. IF THEY ARE NOT COINTEGRATED, THEN THE REGRESSION WILL BE SPURIOUS.

THEN, WHAT IS COINTEGRATION??????

3.1 CHALLENGE 2 - QUESTION:

RESEARCH ABOUT SPURIOUS REGRESSION. With your words briefly explain in which cases you can end up with a spurious regression

4 Cointegration

When two non-stationary series are strongly related in the long run, then we say that both series are cointegrated. The question is how strong this long-term relationship needs to be in order to consider 2 non-stationary series as cointegrated.

To do a cointegration statistical test of 2 non-stationary series, we need to do the following:

  1. Run a linear regression of the 2 non-stationary series

  2. Get the residuals of this regression

  3. Do a Dicky-Fuller test to check whether the residuals (errors) of the previous regression is a stationary series. If the p-value<0.05, then we can say that there is statistical evidence to consider the 2 non-stationary series as cointegrated.

If 2 non-stationary series are NOT cointegrated, then its linear regression becomes a spurious regression. In other words, when 2 non-stationary series are not cointegrated, the result of a linear regression using these 2 variables will not be reliable or valid.

Are the infant mortality and export series cointegrated? Is the regression spurious or valid? Run and interpret the corresponding test

SOLUTION

# prompt: Do an Augmented Dicky-Fuller test to check whether the residual of the previous regression is stationary
#!pip install statsmodels

from statsmodels.tsa.stattools import adfuller

# Get the residuals from the regression model
residuals = model.resid

# Perform the Augmented Dickey-Fuller test
adf_test = adfuller(residuals)
# The parameter regression='c' is the default value, so the regression is run with intercept

# Extract the results
adf_statistic = adf_test[0]
p_value = adf_test[1]
critical_values = adf_test[4]

# Print the results
print('ADF Statistic: %f' % adf_statistic)
ADF Statistic: -1.540577
print('p-value: %f' % p_value)
p-value: 0.513352
print('Critical Values:')
Critical Values:
for key, value in critical_values.items():
  print('\t%s: %.3f' % (key, value))
    1%: -3.646
    5%: -2.954
    10%: -2.616
# Interpret the results
if p_value <= 0.05:
  print("Result: The residual is likely stationary (reject the null hypothesis).")
else:
  print("Result: The residual is likely non-stationary (fail to reject the null hypothesis).")
Result: The residual is likely non-stationary (fail to reject the null hypothesis).

SINCE THE P-VALUE OF THE TEST IS >0.05, THEN WE CANNOT REJECT THE NULL HYPOTHESES THAT STATES THAT THE ERRORS IS A NON-STATIONARY SERIES. THEN, SINCE THE ERRORS ARE NOT STATIONARY, THEN WE CONCLUDE THAT THESE 2 NON-STATIONARY SERIES ARE NOT COINTEGRATED. THEN, SINCE THEY ARE NOT COINTEGRATED, THE REGRESSION BETWEEN MORTALITY RATE AND EXPORTS IS NOT VALID; IT IS A SPURIOUS REGRESSION!

5 CHALLENGE 3: Cointegration between Financial series

Using daily data of Mexico IPCyC market index and the S&P 500, examine whether two series are cointegrated. Generate an index for each instrument. To do these indexes, create a variable that represents how 1.00 peso or 1.00 dollar invested in each instrument would be changing over time.

  1. From Jan 1, 2015 to Oct 2, 2017.

  2. From Oct 3, 2017 to Feb 28, 2018

For both time periods, run a cointegration test and INTERPRET your results.

SOLUTION

I download both indexes from Yahoo Finance:

# prompt: Download the ^GSPC and ^MXX index from Yahoo from Jan 1, 2015 to Oct 2, 2017
import yfinance as yf

# Define the tickers and date range
tickers = ["^GSPC", "^MXX"]
start_date = "2015-01-01"
end_date = "2017-10-02"

# Download the data
indexes = yf.download(tickers, start=start_date, end=end_date)['Close']
YF.download() has changed argument auto_adjust default to True

[                       0%                       ]
[*********************100%***********************]  2 of 2 completed
# Rename the columns to avoid the special characters such as ^
indexes.columns = ['GSPC', 'MXX']

# Display the first few rows of the downloaded data
print(indexes.head())
                   GSPC           MXX
Date                                 
2015-01-02  2058.199951  42115.468750
2015-01-05  2020.579956  41099.371094
2015-01-06  2002.609985  41329.410156
2015-01-07  2025.900024  41813.929688
2015-01-08  2062.139893  42402.308594

I create an indexed dataset so that both indexes start with 1.0 for the first day:

# prompt: Using indexes, create a new dataset with the indexed version of the columns starting with 1.0 for the first period for both columns 

# Create a copy of the dataset:
indexed_data = indexes.copy()

# Calculate the value at the first period for each index
first_period_values = indexed_data.iloc[0]

# Index the data by dividing each value by the value at the first period
indexed_data = indexed_data.div(first_period_values)

# Display the first few rows of the indexed data
print("\nIndexed Data (Starting from 1.0 for the first period):")

Indexed Data (Starting from 1.0 for the first period):
print(indexed_data.head())
                GSPC       MXX
Date                          
2015-01-02  1.000000  1.000000
2015-01-05  0.981722  0.975874
2015-01-06  0.972991  0.981336
2015-01-07  0.984307  0.992840
2015-01-08  1.001914  1.006811

I plot both indexes:

#prompt: Using the indexed_data, Plot both columns
plt.figure(figsize=(12, 6))
plt.plot(indexed_data.index, indexed_data['GSPC'], label='S&P 500 (GSPC)')
plt.plot(indexed_data.index, indexed_data['MXX'], label='IPC (MXX)')
plt.xlabel('Date')
plt.ylabel('Indexed Value (Starting from 1.0)')
plt.title('Indexed S&P 500 and IPC Prices (2015-2017)')
plt.legend()
plt.xticks(rotation=90)
(array([16436., 16556., 16679., 16801., 16922., 17045., 17167., 17287.,
       17410.]), [Text(16436.0, 0, '2015-01'), Text(16556.0, 0, '2015-05'), Text(16679.0, 0, '2015-09'), Text(16801.0, 0, '2016-01'), Text(16922.0, 0, '2016-05'), Text(17045.0, 0, '2016-09'), Text(17167.0, 0, '2017-01'), Text(17287.0, 0, '2017-05'), Text(17410.0, 0, '2017-09')])
plt.grid(True)
plt.show()

I see that both indexes move very closely.

Now I run a regression between these 2 indexes, which are non-stationary:

# prompt: Using indexed_data, run a regression where Y = ^MXX and x=^GSPC

# Run the regression of ^MXX on ^GSPC
model_indexed = smf.ols('MXX ~ GSPC', data=indexed_data).fit(missing='drop')

# Print the regression results
print(model_indexed.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    MXX   R-squared:                       0.809
Model:                            OLS   Adj. R-squared:                  0.808
Method:                 Least Squares   F-statistic:                     2840.
Date:              mar., 27 may. 2025   Prob (F-statistic):          1.74e-243
Time:                        11:20:16   Log-Likelihood:                 1478.7
No. Observations:                 674   AIC:                            -2953.
Df Residuals:                     672   BIC:                            -2944.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.3567      0.014     25.791      0.000       0.330       0.384
GSPC           0.6985      0.013     53.288      0.000       0.673       0.724
==============================================================================
Omnibus:                        6.435   Durbin-Watson:                   0.065
Prob(Omnibus):                  0.040   Jarque-Bera (JB):                4.968
Skew:                          -0.101   Prob(JB):                       0.0834
Kurtosis:                       2.631   Cond. No.                         26.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

I can save the beta coefficients of the model:

b0 = model_indexed.params['Intercept']
b1 = model_indexed.params['GSPC']
print('beta0: %f' % b0)
beta0: 0.356680
print('beta1: %f' % b1)
beta1: 0.698516

We will do the cointegration test. Before, I will examine the error (residuals) of this regression according to the regression equation:

  • Errors = (Real MXX - Predicted MXX)
  • Predicted MXX = 0.3567+ 0.6985 x (Real GSPC)
  • Errors = (Real MXX - 0.3567 - 0.6985 x (Real GSPC))

We can see that Errors is basically a “scaled” distance from the Mexican index to the US index. Errors is equal to the real Y (MXX) minus beta0 minus real X multiplied by beta1. Then, if this “scaled” distance between the series stays within a certain range over time, then we can say that both series are cointegrated. The condition is that the errors series must be a STATIONARY series in order to conclude that both NON-STATIONARY series are cointegrated.

We can see the errors over time:

# prompt: Plot the model_indexed residuals
# Get the residuals from the indexed regression model
model_indexed_residuals = model_indexed.resid

# Plot the residuals
plt.figure(figsize=(12, 6))
plt.plot(model_indexed_residuals.index, model_indexed_residuals)
plt.xlabel('Date')
plt.ylabel('Residuals')
plt.title('Residuals from Regression of Indexed MXX on Indexed GSPC')
plt.grid(True)
plt.show()

Now we run the test for cointegration:

# prompt: Run the augmented Dicky-Fuller test for model_indexed_residuals

# Perform the Augmented Dickey-Fuller test on the indexed model residuals
adf_test_indexed = adfuller(model_indexed_residuals)

# Extract the results for the indexed residuals test
adf_statistic_indexed = adf_test_indexed[0]
p_value_indexed = adf_test_indexed[1]
critical_values_indexed = adf_test_indexed[4]

# Print the results for the indexed residuals test
print('ADF Statistic (Indexed Residuals): %f' % adf_statistic_indexed)
ADF Statistic (Indexed Residuals): -4.006068
print('p-value (Indexed Residuals): %f' % p_value_indexed)
p-value (Indexed Residuals): 0.001379
print('Critical Values (Indexed Residuals):')
Critical Values (Indexed Residuals):
for key, value in critical_values_indexed.items():
    print('\t%s: %.3f' % (key, value))
    1%: -3.440
    5%: -2.866
    10%: -2.569
# Interpret the results for the indexed residuals test
if p_value_indexed <= 0.05:
    print("Result: The indexed residual is likely stationary (reject the null hypothesis).")
else:
    print("Result: The indexed residual is likely non-stationary (fail to reject the null hypothesis).")
Result: The indexed residual is likely stationary (reject the null hypothesis).

Since p-value<0.05, then we conclude that the residuals is a stationary variable, so both indexes are COINTEGRATED in this time period; we can say that between Jan 2015 and Oct 2017 the Financial Mexican Market is COINTEGRATED with the US Financial Market.

Then, we can rely on the regression results. Then, we can say that for each $1.00 gained in the US market, then in the Mexican market we gain on average about $69.8516 cents. There is a positive and significance relationship between the US and the Mexican market.

NOW I DO THE SAME FOR THE SECOND PERIOD:

  1. From Oct 3, 2017 to Feb 28, 2018
# prompt: Download the ^GSPC and ^MXX index from Yahoo from Jan 1, 2015 to Oct 2, 2017
import yfinance as yf
# Define the tickers and date range
tickers = ["^GSPC", "^MXX"]
start_date = "2017-10-03"
end_date = "2018-02-28"
# Download the data
indexes = yf.download(tickers, start=start_date, end=end_date)['Close']

[                       0%                       ]
[*********************100%***********************]  2 of 2 completed
# Rename the columns to avoid the special characters such as ^
indexes.columns = ['GSPC', 'MXX']
# Display the first few rows of the downloaded data
print(indexes.head())
                   GSPC           MXX
Date                                 
2017-10-03  2534.580078  50615.289062
2017-10-04  2537.739990  50565.289062
2017-10-05  2552.070068  50480.921875
2017-10-06  2549.330078  50302.960938
2017-10-09  2544.729980  50071.941406

I create an indexed dataset so that both indexes start with 1.0 for the first day:

# prompt: Using indexes, create a new dataset with the indexed version of the columns starting with 1.0 for the first period for both columns 

# Create a copy of the dataset:
indexed_data = indexes.copy()

# Calculate the value at the first period for each index
first_period_values = indexed_data.iloc[0]

# Index the data by dividing each value by the value at the first period
indexed_data = indexed_data.div(first_period_values)

# Display the first few rows of the indexed data
print("\nIndexed Data (Starting from 1.0 for the first period):")

Indexed Data (Starting from 1.0 for the first period):
print(indexed_data.head())
                GSPC       MXX
Date                          
2017-10-03  1.000000  1.000000
2017-10-04  1.001247  0.999012
2017-10-05  1.006901  0.997345
2017-10-06  1.005820  0.993829
2017-10-09  1.004005  0.989265

I plot both indexes:

#prompt: Using the indexed_data, Plot both columns
plt.figure(figsize=(12, 6))
plt.plot(indexed_data.index, indexed_data['GSPC'], label='S&P 500 (GSPC)')
plt.plot(indexed_data.index, indexed_data['MXX'], label='IPC (MXX)')
plt.xlabel('Date')
plt.ylabel('Indexed Value (Starting from 1.0)')
plt.title('Indexed S&P 500 and IPC Prices (2015-2017)')
plt.legend()
plt.xticks(rotation=90)
(array([17440., 17471., 17501., 17532., 17563., 17591.]), [Text(17440.0, 0, '2017-10'), Text(17471.0, 0, '2017-11'), Text(17501.0, 0, '2017-12'), Text(17532.0, 0, '2018-01'), Text(17563.0, 0, '2018-02'), Text(17591.0, 0, '2018-03')])
plt.grid(True)
plt.show()

I see that the indexes are not moving with a clear relationship.

Now I run a regression between these 2 indexes, which are non-stationary:

# prompt: Using indexed_data, run a regression where Y = ^MXX and x=^GSPC

# Run the regression of ^MXX on ^GSPC
model_indexed = smf.ols('MXX ~ GSPC', data=indexed_data).fit(missing='drop')

# Print the regression results
print(model_indexed.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    MXX   R-squared:                       0.080
Model:                            OLS   Adj. R-squared:                  0.070
Method:                 Least Squares   F-statistic:                     8.257
Date:              mar., 27 may. 2025   Prob (F-statistic):            0.00501
Time:                        11:20:18   Log-Likelihood:                 242.55
No. Observations:                  97   AIC:                            -481.1
Df Residuals:                      95   BIC:                            -476.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.7972      0.059     13.447      0.000       0.679       0.915
GSPC           0.1621      0.056      2.874      0.005       0.050       0.274
==============================================================================
Omnibus:                       10.439   Durbin-Watson:                   0.114
Prob(Omnibus):                  0.005   Jarque-Bera (JB):                3.804
Skew:                           0.130   Prob(JB):                        0.149
Kurtosis:                       2.065   Cond. No.                         58.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can see the errors over time:

# prompt: Plot the model_indexed residuals
# Get the residuals from the indexed regression model
model_indexed_residuals = model_indexed.resid

# Plot the residuals
plt.figure(figsize=(12, 6))
plt.plot(model_indexed_residuals.index, model_indexed_residuals)
plt.xlabel('Date')
plt.ylabel('Residuals')
plt.title('Residuals from Regression of Indexed MXX on Indexed GSPC')
plt.grid(True)
plt.show()

Now we run the test for cointegration:

# prompt: Run the augmented Dicky-Fuller test for model_indexed_residuals

# Perform the Augmented Dickey-Fuller test on the indexed model residuals
adf_test_indexed = adfuller(model_indexed_residuals)

# Extract the results for the indexed residuals test
adf_statistic_indexed = adf_test_indexed[0]
p_value_indexed = adf_test_indexed[1]
critical_values_indexed = adf_test_indexed[4]

# Print the results for the indexed residuals test
print('ADF Statistic (Indexed Residuals): %f' % adf_statistic_indexed)
ADF Statistic (Indexed Residuals): -2.089553
print('p-value (Indexed Residuals): %f' % p_value_indexed)
p-value (Indexed Residuals): 0.248736
print('Critical Values (Indexed Residuals):')
Critical Values (Indexed Residuals):
for key, value in critical_values_indexed.items():
    print('\t%s: %.3f' % (key, value))
    1%: -3.500
    5%: -2.892
    10%: -2.583
# Interpret the results for the indexed residuals test
if p_value_indexed <= 0.05:
    print("Result: The indexed residual is likely stationary (reject the null hypothesis).")
else:
    print("Result: The indexed residual is likely non-stationary (fail to reject the null hypothesis).")
Result: The indexed residual is likely non-stationary (fail to reject the null hypothesis).

Since the p-value of the test is >0.05, then we cannot reject the null hypothesis that states that the residuals are NON-stationary. Since the residuals is a non-stationary variable, then for this period, the Mexican market index is NOT COINTEGRATED with the US market. Also, for this period, the regression result of both indexes are not reliable!

In the cases where 2 non-stationary variables are NOT COINTEGRATED, and I want to know what type of relationship there exist between them, I can first convert the variables into STATIONARY variables, and then RUN THE REGRESSION using the stationary transformed variable.

In Finance and Economics, more than 95% of the non-stationary variables such as prices and indexes, can easily be transformed to stationary when we apply the first difference of their natural logarithm.

Remember that the first difference of the log price is the % change or return of the variable in continuously compounded returns!

Then, for this 2nd period, if I want to see how the Mexican market is related to the US financial market, I can run the regression using the daily returns for both indexes:

# prompt: Create a new dataset with log returns of the indexes columns 

# Calculate log returns for each column
returns = np.log(indexes / indexes.shift(1)).dropna()

# Display the first few rows of the log returns
print("\nLog Returns:")

Log Returns:
print(returns.head())
                GSPC       MXX
Date                          
2017-10-04  0.001246 -0.000988
2017-10-05  0.005631 -0.001670
2017-10-06 -0.001074 -0.003532
2017-10-09 -0.001806 -0.004603
2017-10-10  0.002320 -0.001779
# prompt: Using returns, run a regression where Y=MXX and x=GSPC

# Run the regression of MXX log returns on GSPC log returns
model_returns = smf.ols('MXX ~ GSPC', data=returns).fit(missing='drop')

# Print the regression results
print(model_returns.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    MXX   R-squared:                       0.181
Model:                            OLS   Adj. R-squared:                  0.171
Method:                 Least Squares   F-statistic:                     19.17
Date:              mar., 27 may. 2025   Prob (F-statistic):           3.31e-05
Time:                        11:20:18   Log-Likelihood:                 322.20
No. Observations:                  89   AIC:                            -640.4
Df Residuals:                      87   BIC:                            -635.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0011      0.001     -1.509      0.135      -0.002       0.000
GSPC           0.4206      0.096      4.379      0.000       0.230       0.612
==============================================================================
Omnibus:                        8.916   Durbin-Watson:                   2.207
Prob(Omnibus):                  0.012   Jarque-Bera (JB):                9.626
Skew:                          -0.566   Prob(JB):                      0.00812
Kurtosis:                       4.146   Cond. No.                         138.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
b0 = model_returns.params['Intercept']
b1 = model_returns.params['GSPC']
print('beta0: %f' % b0)
beta0: -0.001061
print('beta1: %f' % b1)
beta1: 0.420623

Since I run a linear regression using returns, and always the returns of financial series will be stationary, then the regression result will be VALID!

I can interpret this regression as follows:

The daily returns of the Mexican index are significantly and positively related to the daily returns of the S&P500 index. For +1% change in the S&P500 daily returns, it is expected that the daily return of the Mexican index will move in about 0.4206%.