Workshop 2, Time Series Models

Author

Alberto Dorantes

Published

May 19, 2025

Abstract

In this workshop we learn the concepts of stationarity, unitary root and cointegration of time-series variables.

0.1 General Directions for each workshop

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google colab document for each workshop.

You must share each Colab document (workshop) with the following account:

cdorante@tec.mx

You must give Edit privileges to these accounts.

In Google Colab you can work with Python or R notebooks. The default is Python notebooks.

Your Notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W2-TimeSeries-YourFirstName-YourLastname”.

In your Workshop Notebook you have to:

You have to write and run Python code to do the exercises of this Workshop
You have to do whatever is asked in the workshop. It can be: responses to specific questions and/or do an exercise/challenge.

For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.

It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for your further study.

Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks in one-shot with Ctrl-F9. You have to submit to Canvas the web link of your Google Colab workshop.

1 Stationarity of time series variables

1.1 Introduction

In our previous workshop we learned about the random walk model applied to finance. We learned that the logarithm of the stock prices behaves similar to a random walk model with a positive drift (\phi_0>0).

As we learned, a random walk model can be expressed in the following equation:

Y_t = \phi_0 + Y_{t−1} + ε_t Following this model we did a Monte Carlo simulation for the logarithm of the S&P500, and we found that most of the time the simulations of a random walk model behaves similar to the real log of the S&P500 index.

We learned that if \phi_0>0, then the series will have a growing trend over time. We also learned that the higher the standard deviation of the daily shock, the more likely that the series will radically move up and down. The standard deviation of the shock is actually the daily volatility of the daily returns of the series.

An important insight about the random walk model is that the random walk model is a non-stationary series. What does this mean?

In short, a stationary series is a series that its mean in any time period is about the same (no matter which time periods we select), and its standard deviation is about the same for any time period.

Then, the random-walk model will always behave as a non-stationary series.

Let’s analyze what part of the random walk equation makes the series stationary. We can re-write the previous random walk equation as follows:

Y_t = \phi_0 + \phi_1*Y_{t−1} + ε_t

If the coefficient \phi_{1}=1, then this equation becomes the random walk equation. When \phi_{1}=1 the Y_t series becomes non-stationary.

Interestingly, if \phi_{1}<1, then Y_t becomes a stationary series.

1.2 Simulating an AR(1) process

In-class exercise

Simulate an AR(1) process with \phi_0=0.01, standard deviation of the error = 0.01, and \phi_1=0.98.

Re-run the same AR(1) simulation, but now change \phi_1 = 0.10

Re-run the same AR(1) simulation, but now change \phi_1 = 0.999

What differences do you observe with the 3 simulations?

What happens when \phi_1 is very small, and when it is close to 1?

Let’s work with an example using sales data.

1.3 Data collection and data management

We will work with an online dataset that contains sales data of different consumer products.

We download a datsaet from an online Excel. Type:

#pip install openpyxl
import pandas as pd
import requests
from io import BytesIO

# URL of the online Excel file
url = "http://www.apradie.com/ec2004/salesbrands3.xlsx"

# Define headers to simulate a browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# Fetch the file from the URL with headers
response = requests.get(url, headers=headers)
response.raise_for_status()  # Raise an error for bad status codes

# Load the Excel file into a pandas DataFrame
sales_brand = pd.read_excel(BytesIO(response.content), sheet_name='Resultado')  # Adjust sheet_name if needed

# Display first few rows
print(sales_brand.head())

        date  qbrand1tons  salesbrand1  ...  salesbrand6  tonstotal  salestotal
0 2015-01-01       106.62     60780.72  ...     55561.26    1148.41   302762.85
1 2015-02-01       105.82     60411.47  ...     51469.74    1274.85   327076.83
2 2015-03-01        88.82     53174.89  ...     50395.48    1085.59   298694.21
3 2015-04-01        86.69     50861.69  ...     49124.61     961.22   267792.19
4 2015-05-01        87.23     51014.23  ...     46906.63     997.47   266576.37

[5 rows x 15 columns]

As you can see, we have imported a monthly dataset where the variable date shows the first day of the month in a yyyy-mm-dd format. However, this column is in text format.

When working with time series datasets it is very recommended to set a Date column as the index of the data frame.With this, Python will know that we have a monthly time-series dataset.

Then, we convert the date column to Date type and set this column as the index of the data frame:

import pandas as pd
sales_brand['date'] = pd.to_datetime(sales_brand['date'], errors='coerce')
sales_brand.set_index('date', inplace=True)
sales_brand.head()

            qbrand1tons  salesbrand1  ...  tonstotal  salestotal
date                                  ...                       
2015-01-01       106.62     60780.72  ...    1148.41   302762.85
2015-02-01       105.82     60411.47  ...    1274.85   327076.83
2015-03-01        88.82     53174.89  ...    1085.59   298694.21
2015-04-01        86.69     50861.69  ...     961.22   267792.19
2015-05-01        87.23     51014.23  ...     997.47   266576.37

[5 rows x 14 columns]

This dataset has sales information of several brands of consumer products that belong to one category (for example, the category can be Cereals or Candies). The content of this dataset is very similar to a real-world category. Consumer products usually show some seasonality and trend over time. We will learn how to handle these features later.

1.4 Stationary vs Non-stationary

We can have an overview of the behavior of sales of these consumer products using the plot command. Let’s focus on brand 2 and 6. To see volume sales (the number of tons sold) of brand 2:

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(sales_brand.index, sales_brand['qbrand2tons'])
plt.xlabel('Date')
plt.ylabel('Volume of qbrand2 (in tons)')
plt.title('Sales volume of brand2')
plt.grid(True)
plt.show()

It seems that this product has a growing trend over time, except for the last year.

We can plot sales of brand 6:

plt.figure(figsize=(12, 6))
plt.plot(sales_brand.index, sales_brand['qbrand6tons'])
plt.xlabel('Date')
plt.ylabel('Volume of qbrand6 (in tons)')
plt.title('Sales volume of brand6')
plt.grid(True)
plt.show()

This brand has a clear growing trend over time. As we can see, many consumer products have a growing or a declining trend. And other products might seem to be stationary over time, with sales over an average over time.

A series with a growing or declining trend is non-stationary. Most of the variables that represent sales of any product can be treated as non-stationary. Then, a rule of thumb when we work with business variables such as volume sales, value sales, stock prices, cost of good sold, etc is to transform the series into a stationary series.

When you work with sales data, I strongly recommend to transform the sales data by applying the natural logarithm. If we take the first difference of the series we are calculating the monthly % change (in continuously compounded). This first difference of the log will become a stationary series most of the time when you analyze sales data.

Then we calculate the logarithm of volume sales for the whole dataset:

import numpy as np

ln_sales = np.log(sales_brand)
ln_sales.head()

            qbrand1tons  salesbrand1  ...  tonstotal  salestotal
date                                  ...                       
2015-01-01     4.669271    11.015028  ...   7.046134   12.620705
2015-02-01     4.661740    11.008934  ...   7.150584   12.697950
2015-03-01     4.486612    10.881342  ...   6.989879   12.607176
2015-04-01     4.462339    10.836865  ...   6.868203   12.497967
2015-05-01     4.468548    10.839860  ...   6.905222   12.493416

[5 rows x 14 columns]

Now we create a dataset with the monthly % change (in continuously compounded) by calculating the first difference of the log values:

msales_growth = ln_sales.diff(1).dropna()
# I apply the dropna function at the end to delete the first row since it is not possible to calculate the % change for the first month
msales_growth.head()

            qbrand1tons  salesbrand1  ...  tonstotal  salestotal
date                                  ...                       
2015-02-01    -0.007532    -0.006094  ...   0.104450    0.077245
2015-03-01    -0.175128    -0.127593  ...  -0.160705   -0.090775
2015-04-01    -0.024273    -0.044476  ...  -0.121676   -0.109209
2015-05-01     0.006210     0.002995  ...   0.037019   -0.004551
2015-06-01    -0.019330    -0.049208  ...  -0.080772   -0.073313

[5 rows x 14 columns]

The diff(1) function applied to ln_sales computes the first difference of the log values row by row, so it calculates the % monthly change for all months and for all columns.

The diff function calculates the first difference for all rows as follows:

First Difference of a column at t = column value at t minus the previous column value (at t-1)

Let’see how the monthly % growth looks for brand 2 and brand 6:

plt.figure(figsize=(12, 6))
plt.plot(msales_growth.index, msales_growth['qbrand2tons'], label='qbrand2tons')
plt.plot(msales_growth.index, msales_growth['qbrand6tons'], label='qbrand6tons')
plt.xlabel('Date')
plt.ylabel('Monthly % growth')
plt.title('Monthly % growth in qbrand2tons and qbrand6tons')
plt.legend()
plt.grid(True)
plt.show()

We can see that the 2 series look like stationary since their means tend to be in the same level for any time period.

Another transformation with monthly data is annual percentage change, which can be calculated as the difference between the log value of the series and its log value 12 months ago. We can calculate the annual % change of sales month by month as follows:

ysales_growth = ln_sales.diff(12).dropna()
ysales_growth.qbrand2tons.head()

date
2016-01-01   -0.143270
2016-02-01   -0.103516
2016-03-01   -0.202297
2016-04-01   -0.052669
2016-05-01   -0.161421
Name: qbrand2tons, dtype: float64

We plot the annual % change of volume sales for brand 2 and brand 6:

plt.figure(figsize=(12, 6))
plt.plot(ysales_growth.index, ysales_growth['qbrand2tons'], label='qbrand2tons')
plt.plot(ysales_growth.index, ysales_growth['qbrand6tons'], label='qbrand6tons')
plt.xlabel('Date')
plt.ylabel('Annual % growth')
plt.title('Annual % growth in qbrand2tons and qbrand6tons')
plt.legend()
plt.grid(True)
plt.show()

In this case, it is difficult to evaluate graphically whether both series are stationary. We need to do a statistical test to check whether a series can be treated as stationary or not. Some series might not look stationary for our eyes, but we need to run statistical tests to decide whether to treat the series stationary or not.

There is a statistical way to evaluate whether these series are stationary. We need a statistical confirmation in addition to the visual tools. The Dicky-Fuller test is one of the most used statistical tests for stationary. This test is called a unit root test. The Dicky-Fuller test assumes that the series is non stationary. In other words, it assumes that the series follows a random walk with a unit root (\phi1=1).

We can run the Dicky-Fuller test for the sales annual % growth of product 2 as follows:

# !pip install statsmodels
from statsmodels.tsa.stattools import adfuller

dftest = adfuller(ysales_growth['qbrand2tons'], regression='ct')
# regression = 'ct' indicates to consider a possible constant or intercept, and a trend in the series

# Print the results
print('ADF Statistic: %f' % dftest[0])

ADF Statistic: -3.507864

print('p-value: %f' % dftest[1])

p-value: 0.038557

print('Critical Values:')

Critical Values:

for key, value in dftest[4].items():
  print('\t%s: %.3f' % (key, value))

    1%: -4.198
    5%: -3.524
    10%: -3.193

The null hypothesis of the Dicky-Fuller test is that the series is non-stationary. Actually, the null-hypothesis of this test is that the \phi_1=1, which implies that the series is non-stationary. This is the reason why the Dicky-Fuller test is called a unit-root test.

As any statistical test, if the p-value of the test is less than 0.05, then we have statistical evidence (at the 95% confidence level) to REJECT the null-hypothesis.

The Dicky Fuller test reports a p-value less than 0.05, indicating that we can reject the null hypothesis that states a unit root for these series (\phi_1 = 1). In other words, since the p-value is less than 0.05 we can say that there is statistical evidence to conclude that the sales annual % change of product 2 can be treated as stationary.

2 CHALLENGE 1

RUN AND INTERPRET A DICKY-FULLER TEST FOR THE ANNUAL % DIFFERENCE OF BRAND 6, AND PROVIDE AN INTERPRETATION

3 Spurious regression

When we want to examine the relationship between two non-stationary variables by running a regression model, we have the risk to end up with a non-valid - spurious - regression. Before we understand why a regression model can be spurious, we start with and example using 2 real-world variables.

Install the wbdata package. This package was written by the World Bank, and it has functions to download data of all countries around the world.

We will download the infant mortality rate and the exports for Mexico. It is supposed that these variables have nothing in common, so we would not expect a significant relationship.

#!pip install wbdata

import wbdata
import datetime

# Define the series and country
series = {"SP.DYN.IMRT.IN": "infant_mortality", "TX.VAL.MRCH.XD.WD": "merchandise_exports"}
country = "MEX"

# Define the dates
data_date = (datetime.datetime(1980, 1, 1), datetime.datetime(2022, 1, 1))

# Download the data
data = wbdata.get_dataframe(series, country=country, date=data_date)

# Order the dataset by date (which is the index after convert_date=True)
data.sort_index(inplace=True)

# Display the first few rows
print(data.head())

      infant_mortality  merchandise_exports
date                                       
1980              58.2             4.738168
1981              55.8             6.124695
1982              53.5             6.321201
1983              51.3             6.819955
1984              49.1             7.647000

We plot both variables over time:

plt.figure(figsize=(12, 6))
plt.plot(data.index, data['infant_mortality'], label='Infant Mortality')
plt.plot(data.index, data['merchandise_exports'], label='Merchandise Exports')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Infant Mortality and Merchandise Exports over Time')
plt.legend()
plt.xticks(rotation=90)

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42], [Text(0, 0, '1980'), Text(1, 0, '1981'), Text(2, 0, '1982'), Text(3, 0, '1983'), Text(4, 0, '1984'), Text(5, 0, '1985'), Text(6, 0, '1986'), Text(7, 0, '1987'), Text(8, 0, '1988'), Text(9, 0, '1989'), Text(10, 0, '1990'), Text(11, 0, '1991'), Text(12, 0, '1992'), Text(13, 0, '1993'), Text(14, 0, '1994'), Text(15, 0, '1995'), Text(16, 0, '1996'), Text(17, 0, '1997'), Text(18, 0, '1998'), Text(19, 0, '1999'), Text(20, 0, '2000'), Text(21, 0, '2001'), Text(22, 0, '2002'), Text(23, 0, '2003'), Text(24, 0, '2004'), Text(25, 0, '2005'), Text(26, 0, '2006'), Text(27, 0, '2007'), Text(28, 0, '2008'), Text(29, 0, '2009'), Text(30, 0, '2010'), Text(31, 0, '2011'), Text(32, 0, '2012'), Text(33, 0, '2013'), Text(34, 0, '2014'), Text(35, 0, '2015'), Text(36, 0, '2016'), Text(37, 0, '2017'), Text(38, 0, '2018'), Text(39, 0, '2019'), Text(40, 0, '2020'), Text(41, 0, '2021'), Text(42, 0, '2022')])

plt.grid(True)
plt.show()

Now run a regression using these series. Report the result of the regression. Did you find significant relationship? is your result what you expected?

import statsmodels.formula.api as smf

# Perform the regression
model = smf.ols('merchandise_exports  ~ infant_mortality', data=data).fit(missing='drop')

# Print the regression results
print(model.summary())

                             OLS Regression Results                            
===============================================================================
Dep. Variable:     merchandise_exports   R-squared:                       0.765
Model:                             OLS   Adj. R-squared:                  0.759
Method:                  Least Squares   F-statistic:                     130.3
Date:               mar., 20 may. 2025   Prob (F-statistic):           3.73e-14
Time:                         08:13:23   Log-Likelihood:                -184.31
No. Observations:                   42   AIC:                             372.6
Df Residuals:                       40   BIC:                             376.1
Df Model:                            1                                         
Covariance Type:             nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept          119.9454      6.892     17.403      0.000     106.016     133.875
infant_mortality    -2.5541      0.224    -11.414      0.000      -3.006      -2.102
==============================================================================
Omnibus:                       12.461   Durbin-Watson:                   0.122
Prob(Omnibus):                  0.002   Jarque-Bera (JB):                3.613
Skew:                           0.343   Prob(JB):                        0.164
Kurtosis:                       1.738   Cond. No.                         69.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

3.1 CHALLENGE 2 - QUESTION:

RESEARCH ABOUT SPURIOUS REGRESSION. With your words briefly explain in which cases you can end up with a spurious regression

4 Cointegration

When two non-stationary series are strongly related in the long run, then we say that both series are cointegrated. The question is how strong this long-term relationship needs to be in order to consider 2 non-stationary series as cointegrated.

To do a cointegration statistical test of 2 non-stationary series, we need to do the following:

Run a linear regression of the 2 non-stationary series
Get the residuals of this regression
Do a Dicky-Fuller test to check whether the residuals (errors) of the previous regression is a stationary series. If the p-value<0.05, then we can say that there is statistical evidence to consider the 2 non-stationary series as cointegrated.

If 2 non-stationary series are NOT cointegrated, then its linear regression becomes a spurious regression. In other words, when 2 non-stationary series are not cointegrated, the result of a linear regression using these 2 variables will not be reliable or valid.

Are the infant mortality and export series cointegrated? Is the regression spurious or valid? Run and interpret the corresponding test

5 CHALLENGE 3: Cointegration between Financial series

Using daily data of Mexico IPCyC market index and the S&P 500, examine whether two series are cointegrated. Generate an index for each instrument. To do these indexes, create a variable that represents how 1.00 peso or 1.00 dollar invested in each instrument would be changing over time.

From Jan 1, 2015 to Oct 2, 2017.
From Oct 3, 2017 to Feb 28, 2018

For both time periods, run a cointegration test and INTERPRET your results.

6 W2 submission

Complete (100%): If you submit an ORIGINAL and COMPLETE work with all the code and challenges, with your notes, and with your OWN RESPONSES to questions
Incomplete (75%): If you submit an ORIGINAL work with ALL the activities but you did NOT RESPOND to the challenges/questions and/or you did not do all activities and respond to some of the questions.
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
Not submitted (0%)

Remember that you have to submit your Google Colab link through Canvas, and also grant EDIT priviledges to cdorante@tec.mx