Workshop 7, Advanced AI - Statistics Module

Author

Alberto Dorantes D., Ph.D.

Abstract

In this workshop we learn the basics of time series regression models.

1 Introduction to Time Series

A time series is a variable that measures an attribute of a subject over time. For example, the daily stock price of a firm from 2010 to date.

We can observe several patterns or components of a time series:

if the series has a clear growing or declining trend (trend),
if the series has systematic changes in seasons of the year (seasonality),
if the series shows cycles longer than a year (cycles), and
if the series has irregular or idiosyncratic movements (irregular) that makes the series unpredictable.

The first 3 components can be predicted statistically with a time-series econometric model. However, the irregularity component is unpredictable, but it can be modeled as a random “shock” that behaves similar to a normal distributed variable with a specific standard deviation and a mean equal to zero. The random shock has random negative and positive movements that makes the series move in an unpredictable way.

2 Non stationary variables - The Random Walk model for stock prices

The random walk hypothesis in Finance (Fama, 1965) states that the natural logarithm of stock prices behaves like a random walk with a drift. A random walk is a series (or variable) that cannot be predicted. Imagine that \(Y_t\) is the log price of a stock for today (t). The value of Y for tomorrow (\(Y_{t+1}\)) will be equal to its today’s value (\(Y_t\)) plus a constant value (\(φ_0\)) plus a random shock. This shock is a pure random value that follows a normal distribution with mean=0 and a specific standard deviation \(σ_ε\). The process is supposed to be the same for all future periods. In mathematical terms, the random walk model is the following:

\[ Y_t = φ_0 + Y_{t−1} + ε_t \]

The \(ε_t\) is a random shock for each day, which is the result of the log price movement due to all news (external and internal to the stock) that influence the price. \(φ_0\) refers as the drift of the series. If \(|φ_0|\) > 0 we say that the series is a random walk with a drift. If \(φ_0\) is positive, then the variable will have a positive trend over time; if it is negative, the series will have a negative trend.

If we want to simulate a random walk, we need the values of the following parameters/variables:

\(Y_0\), the first value of the series
\(φ_0\), the drift of the series
\(σ_ε\), the standard deviation (volatility) of the random shock

3 Monte Carlo simulation for the random walk model

Let’s go and run a MonteCarlo simulation for a random walk of the S&P 500. We will use real values of the S&P500 to estimate the previous 3 parameters.

3.1 Downloading data for the S&P500

We download the S&P500 historical daily data from Yahoo Finance from 2009 to date (Sep 30, 2022).

I download the S&P500 index from Yahoo Finance!

import pandas as pd
import yfinance as yf
from datetime import date
import numpy as np
import statistics as st

sp500=yf.download(tickers="^GSPC", start="2008-01-01",interval="1d")
# I keep the adj close column:
sp500=sp500['Adj Close']
sp500

[*********************100%***********************]  1 of 1 completed

Date
2008-01-02    1447.160034
2008-01-03    1447.160034
2008-01-04    1411.630005
2008-01-07    1416.180054
2008-01-08    1390.189941
                 ...     
2024-09-17    5634.580078
2024-09-18    5618.259766
2024-09-19    5713.640137
2024-09-20    5702.549805
2024-09-23    5713.419922
Name: Adj Close, Length: 4210, dtype: float64

Now we generate the log of the S&P index using the closing price/quotation, and create a variable N for the number of days in the dataset:

lnsp500 = np.log(sp500)
lnsp500
# N will be the # of days in the series
N = len(lnsp500)
N

Now we will simulate 2 random walk series estimating the 3 parameters from this log series of the S&P500:

random walk with a drift (name it rw1), and
random walk with no drift (name it rw2).

3.2 Estimating the parameters of the random walk model

We have to consider the mathematical definition of a random walk and estimate its parameters (initial value, phi0, volatility of the random shock) from the real daily S&P500 data.

Now, we create a variable for a random walk with a drift trying to model the log of the S&P500.

Reviewing the random walk equation again:

\[ Y_t = φ_0 + Y_{t−1} + ε_t \] The \(ε_t\) is the random shock of each day, which represents the overall average perception of all market participants after learning the news of the day (internal and external news announced to the market).

Remember that \(\varepsilon_{t}\) behaves like a random normal distributed variable with mean=0 and with a specific standard deviation \(\sigma_{\varepsilon}\).

For the simulation of the random walk, you need to estimate the values of

\(y_{0}\), the first value of the series, which is the log S&P500 index of the first day
\(\phi_{0}\)
\(\sigma_{\varepsilon}\)

You have to estimate \(\phi_{0}\) using the last and the first real values of the series following the equation of the random walk. Here you can see possible values of a random walk over time:

\[ Y_{0} = Initial value \]

\[ Y_{1} = \phi_{0} + Y_{0} + \varepsilon_{1} \]

\[ Y_{2} = \phi_{0} + Y_{1} + \varepsilon_{2} \]

Substituting \(Y_{1}\) with its corresponding equation:

\[ Y_{2} = \phi_{0} + \phi_{0} + Y_{0} + \varepsilon_{1} + \varepsilon_{2} \] Re-arranging the terms:

\[ Y_{2} = 2\phi_{0} + Y_{0} + \varepsilon_{1} + \varepsilon_{2} \]

If you continue doing the same until the last N value, you can get:

\[ Y_{N} = N\phi_{0} + Y_{0} + \sum_{t=1}^{N}\varepsilon_{t} \]

This mathematical result is kind of intuitive. The value of a random walk at time N will be equal to its initial value plus N times phi0 plus the sum of ALL random shocks from 1 to N.

Since the mean of the shocks is assumed to be zero, then the expected value of the sum of the shocks will also be zero. Then:

\[ E[Y_{N}] = N\phi_{0} + Y_{0} \] From this equation we see that \(phi_{0}\) can be estimated as:

\[ \phi_{0} = \frac{(E[Y_{N}] - Y_{0})}{N} \]

Since \(E[Y_N] = Y_N\),

\[ \phi_{0} = \frac{(Y_{N} - Y_{0})}{N} \]

Then, \(\phi_{0}\) = (last value - first value) / # of days.

I use scalars to calculate these coefficients for the simulation. A Stata scalar is a temporal variable to save a number.

I calculate \(\phi_{0}\) following this formula:

phi0 = (lnsp500[-1] - lnsp500[0]) / N
# The -1 location of the lnsp500 is actually the last value of the lnsp500 series 
phi0

0.0003261792734327422

Remember that N is the total # of observations, so lnsp500[N-1] has last daily value of the log of the S&P500 since the first element is in the 0 location.

Now we need to estimate sigma, which is the standard deviation of the shocks. We can start estimating its variance first. It is known that the variance of a random walk cannot be determined unless we consider a specific number of periods.

Then, let’s consider the equation of the random walk series for the last value (\(Y_N\)), and then estimate its variance from there:

\[ Y_{N} = N\phi_{0} + Y_{0} + \sum_{t=1}^{N}\varepsilon_{t} \]

Using this equation, we calculate the variance of \(Y_N\) :

\[ Var(Y_{N}) = Var(N\phi_{0}) + Var(Y_{0}) + \sum_{t=1}^{N}Var(\varepsilon_{t}) \]

The variance of a constant is zero, so the first two terms are equal to zero.

Now I analyze the variance of the shock:

Since it is supposed that the volatility (standard deviation) of the shocks is about the same over time, then:

\[ Var(\varepsilon_{1}) = Var(\varepsilon_{2}) = Var(\varepsilon_{N}) = \sigma_{\varepsilon}^2 \]

Then the sum of the variances of all shocks is actually the variance of the shock times N. Then the variance of all the shocks is actually the variance of \(Y_N\).

Then we can write the variance of \(Y_N\) as:

\[ Var(Y_{N}) = N Var(\varepsilon)= N\sigma_{\varepsilon}^2 \]

To get the standard deviation of \(Y_N\) we take the square root of the variance of \(Y_N\):

\[ SD(Y_{N}) = \sqrt{N}SD(\varepsilon) \]

We use sigma character for standard deviations:

\[ \sigma_{Y} = \sqrt{N}*\sigma_{\varepsilon} \]

Finally we express the volatility of the shock (\(\sigma_{\varepsilon}\)) in terms of the volatility of \(Y_N\) (\(\sigma_{Y}\)):

\[ \sigma_{\varepsilon} = \frac{\sigma_{Y}}{\sqrt{N}} \]

Then we can estimate sigma as: sigma = StDev(lnsp) / sqrt(N). Let’s do it:

sigma = np.std(lnsp500)/np.sqrt(N)
sigma

0.007846166593129526

The volatility for the shock is python sigma

The volatility for lnsp500 series is python np.std(lnsp500).

3.2.1 Simulating the random walk with drift

Now you are ready to start the simulation of random walk using rw1:

\[ rw1_{t} = \phi_{0} + rw1_{t-1} + \varepsilon_{t} \]

The \(\phi_{0}\) coefficient is also drift of the random walk.

We will create a new column in the lnsp R dataset for the random walk with the name rw1.

I start assigning the first value of the random walk to be equal to the first value of the log of the S&P500:

# I initialize the random walk array with zeros:
rw1 = np.zeros(N)
# I set the first value equal to the real log value:
rw1[0] = lnsp500[0]

Now assign random values from day 2 to the last day following the random walk. For each day, we create the random shock using the function random.normal. We create this shock with standard deviation equal to the volatility of the shock we calculated above (the sigma). We indicate that the mean = 0:

shock = np.random.normal(0,sigma,N)

Now we are ready to start the simulation of random walk. Then we fill the values for rw1. Remembering the formula for the random walk process:

\[ rw1_{t} = \phi_{0} + rw1_{t-1} + \varepsilon_{t} \]

We start the random walk with the first value of the log of the S&P500. Then, from day 2 we do the simulation according to the previous formula and using the random shock just created:

for k in range(1,N):
  rw1[k] = rw1[k-1] + phi0 + shock[k]

# Another way to do the same using the np.cumsum function instead of a for loop: 
i = np.arange(0,N,1)
# i is an array from 0 to N-1
# I set the first shock to avoid summing this shock with the cumsum function
shock[0] = 0
rw2= lnsp500[0] + i*phi0 + np.cumsum(shock)
# The result is the same as rw1

I plot the simulated random walk and the real log of the S&P500:

import matplotlib
from matplotlib.pyplot import*
#I turn the log of the S&P500 variable into a dataframe
dflnsp500=pd.DataFrame(lnsp500)
clf()
#I add the random walk values in a new column
dflnsp500['rw1']=rw1
dflnsp500

plot(dflnsp500['Adj Close'], color='r')
plot(dflnsp500['rw1'], color='b')
legend(['original log s&p500 price', 'Random walk with a drift'], loc='upper left')
show()

I can plot the variables in the original scale (not log scale) by getting the exponential of both series:

dfsp500=pd.DataFrame(sp500)

#I add the exponential of the random walk values in a new column
dfsp500['rw1']= np.exp(rw1)
dfsp500
clf()
plot(dfsp500['Adj Close'], color='r')
plot(dfsp500['rw1'], color='b')
legend(['Original s&p500 price', 'Random walk with a drift'], loc='upper left')
show()