We introduce what is a time-series variable and the Random-walk model
0.1 General Directions for each workshop
You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.
You must share each Colab document (workshop) with the following account:
cdorante@tec.mx
You must give Edit privileges to these accounts.
In Google Colab you can work with Python or R notebooks. The default is Python notebooks.
Your Notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W1-TimeSeries-YourFirstName-YourLastname”.
Pay attention in class to learn how to write text and Python code in your Notebook.
In your Workshop Notebook you have to:
You have to write and run Python code to do the exercises of this Workshop
You have to do whatever is asked in the workshop. It can be: responses to specific questions and/or do an exercise/challenge.
For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.
It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for your further study.
Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks in one-shot with Ctrl-F9. You have to submit to Canvas the web link of your Google Colab workshop.
1 Introduction to Time Series
A time series is a variable that measures an attribute of a subject over time. For example, the daily stock price of a firm from 2010 to date.
We can observe several patterns or components of a time series:
Trend - if the series has a clear growing or declining trend
Seasonality - if the series has systematic changes in seasons of the year
Business cycle - if the series shows cycles longer than a year
Variability / irregularity - if the series has irregular or idiosyncratic movements (irregular) that makes the series unpredictable
The first 3 components can be modeled and predicted statistically with a time-series econometric model. However, the irregularity component is unpredictable, but it can be modeled as a random “shock” that behaves similar to a normal distributed variable with a specific standard deviation and a mean equal to zero. The random shock has random negative and positive movements that makes the series move in an unpredictable way.
In this section we will download time-series variables from Yahoo Finance and explore the time-series datasets.
1.1 Downloading the IPyC and the S&P500 market indexes
We will work with the Mexican market index, the IPyC, and the US market index, the S&P500.
What is a market index?
A market index in a financial market is a virtual portfolio composed by a group of public firms that issue shares in that market. There are different market indexes in each financial market. The most important market index in Mexico is the Índice de Precios y Cotizaciones, or IPyC. In the US financial market, one of the most important market indexes is the Standard & Poors 500 index, or S&P500.
The IPyC index tries to emulate a virtual portfolio with the biggest 35 public firms in the Mexican market. The weight (%) assign for each firm in the virtual portfolio is actually the market size of each firm. Then, the biggest the firm, the highest its weight in the IPyC portfolio.
The S&P500 index tries to emulate a virtual portfolio with the biggest 500 public firms in the US market. The weight (%) assign for each firm in the virtual portfolio is actually the market size of each firm. Then, the biggest the firm, the highest its weight in the S&P500 portfolio.
Both indexes are tracked every day in their respective financial markets, so we can get access to their current and historical daily values from different sources.
We download the monthly S&P500 from Yahoo Finance from 2011 to date:
import pandas as pdimport yfinance as yffrom datetime import dateimport numpy as nptickers = ['^GSPC','^MXX']start ="2011-01-01"end ="2025-04-30"prices=yf.download(tickers=tickers, start=start, end=end, interval="1mo")['Close']
YF.download() has changed argument auto_adjust default to True
[ 0% ]
[*********************100%***********************] 2 of 2 completed
In Yahoo Finance the unique identifier (also called ticker) for the IPyC is ^MXX, and the ticker for the S&P500 is ^GSPC (General S&P Composite).
For each period, Yahoo Finance keeps track of the open, high, low, close (OHLC) and adjusted prices. Also, it keeps track of volume that was traded (# of shares traded) in every specific period. The adjusted prices are used for stocks, not for market indexes. Adjusted prices consider dividend payments and stock splits. For the case of market indexes, the adjusted prices are always equal to the close prices.
For each index do a graph to visualize how the index moves over time. You have to write Python code to do the plots (You can ask to Gemini AI for this code)
Respond to the following QUESTION ABOUT THE CHARTS:
WHAT YOU CAN SAY ABOUT THE TREND OF BOTH MARKET INDEXES? IS IT CONSTANTLY GROWING, OR DECLINING, OR THERE IS NO CLEAR TREND? BRIEFLY EXPLAIN
2 Return calculation
2.1 Simple and continuously compounded (cc) return
A financial simple return for a stock in period t (R_{t}) is usually calculated as the closing stock price in t plus any dividend payment at t, and then divide this sum by the previous closing price. It is calculated as a percentage change from the previous period (t-1) to the present period (t):
When a stock pays dividends of do a stock split, the financial exchange make an adjustment to the historical stock prices. This adjustment to the stock prices is made so that we do not need to use dividends nor splits to calculate simple stock returns. Then, it is always recommended to use adjusted prices to calculate stock returns, unless you have information about all dividends payed in the past.
Then, with adjusted prices the formula for simple returns is easier:
For example, if the adjusted price of a stock at the end of January 2021 was $100.00, and its previous (December 2020) adjusted price was $80.00, then the monthly simple return of the stock in January 2021 will be:
We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.
Although the arithmetic mean of simple returns R gives us an idea of average past return, in the case of multi-period average return, this method of calculation can be misleading. Let’s see why this is the case.
Imagine you have only 2 periods and you want to calculate the average return of an investment per period:
Returns over time
Period
Investment value (at the end of the period)
Simple period Return (R)
0
$100
NA
1
$50
-0.50
2
$75
+0.50
Calculating the average simple return of this investment:
\bar{R}=\frac{-0.5+0.5}{2}=0%
Then, the simple average return gives me 0%, while I end up with $75, losing 25% of my initial investment ($100) over the first 2 periods. If I lost 25% of my initial investment over 2 periods, then the average mean return per period might a midpoint between 0 and 25%. The accurate mean return of an investment over time (multi-periods) is the “Geometric Mean” return.
The total return of the investment in the whole period -also called the holding-period return (HPR)- can be calculated as:
Then, the right average return per year is about -13.4% and the HPR for the 2 years is -25%.
However, if we use continuosuly compounded returns (r) instead of simple returns (R), then the arithmetic mean of r is an accurate measure that can be converted to simple returns to get the geometric mean, which is the accurate mean return. Let’s do the same example using continuously compounded returns:
Continuously compounded returns
Period
Investment value (at the end)
Continuously compounded return (r)
0
$100
NA
1
$50
=log(50)-log(100)=-0.6931
2
$75
=log(75)-log(50)=+0.4054
In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models.
One way to calculate cc returns is by subtracting the natural log of the current adjusted price (at t) minus the natural log of the previous adjusted price (at t-1):
r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})
This is also called as the difference of the log of the price.
We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):
As you see, when we apply a mathematical function (in this case, the log function) to a data frame, the function is calculated to all rows of all columns of the data frame.
Let’s see the plot of both indexes:
import matplotlib.pyplot as plt prices.plot(subplots=True, figsize=(10, 6))
Let’s see the plot of the S&P500 index and its logarithm:
plt.figure(figsize=(12, 6))plt.plot(prices['sp500'])plt.title('Index (price) of the SP500 ')plt.grid(True)plt.show()
plt.figure(figsize=(12, 6))plt.plot(lnprices['sp500'])plt.title('Log of the SP500 index')plt.grid(True)plt.show()
As you see, the log index have a much smaller scale (from about 7 to 8.75). Remember that the natural log is actually an exponent we raise the number e to get the index value.
With this plot, we can better appreciate the % growth over time per period. For example, from Jan 2020 to March 2020 the log price decreased from about 8.1 to 7.85. The difference between 8.1 and 7.85, which is -0.25 is the approximate % decrease in the index. This means that the decline of the S&P500 index right after the COVID was about -25%!
What is a natural logarithm?
The natural logarithm of a number is the exponent that the number e (=2.71…) needs to be raised to get another number. For example, let’s name x=natural logarithm of a stock price p. Then:
e^x = p
The way to get the value of x that satisfies this equality is actually getting the natural log of p:
x = log_e(p)
Then, we have to remember that the natural logarithm is actually an exponent that you need to raise the number e to get a specific number.
The natural log is the logarithm of base e (=2.71…). The number e is an irrational number (it cannot be expressed as a division of 2 natural numbers), and it is also called the Euler constant. Leonard Euler (1707-1783) took the idea of the logarithm from the great mathematician Jacob Bernoulli, and discovered very astonishing features of the e number. Euler is considered the most productive mathematician of all times. Some historians believe that Jacob Bernoulli discovered the number e around 1690 when he was playing with calculations to know how an amount of money grows over time with an interest rate.
How e is related to the grow of financial amounts over time?
Here is a simple example:
If I invest $100.00 with an annual interest rate of 50%, then the end balance of my investment at the end of the first year (at the beginning of year 2) will be:
I_2=100*(1+0.50)^1
If the interest rate is 100%, then I would get:
I_2=100*(1+1)^1=200
Then, the general formula to get the final amount of my investment at the beginning of year 2, for any interest rate R can be:
I_2=I_1*(1+R)^1
The (1+R) is the growth factor of my investment.
In Finance, the investment amount is called principal. If the interests are calculated (compounded) each month instead of each year, then I would end up with a higher amount at the end of the year.
Monthly compounding means that a monthly interest rate is applied to the amount to get the interest of the month, and then the interest of the month is added to the investment (principal). Then, for month 2 the principal will be higher than the initial investment. At the end of month 2 the interest will be calculated using the updated principal amount. Putting in simple math terms, the final balance of an investment at the beginning of year 2 when doing monthly compounding will be:
I_2=I_1*\left(1+\frac{R}{N}\right)^{1*N}
For monthly compounding, N=12, so the monthly interest rate is equal to the annual interest rate R divided by N (R/N). Then, with an annual rate of 100% and monthly compounding (N=12):
In this case, the growth factor is (1+1/12)^{12}, which is equal to 2.613.
Instead of compounding each month, if the compounding is every moment, then we are doing a continuously compounded rate.
If we do a continuously compounding for the previous example, then the growth factor for one year becomes the astonishing Euler constant e:
Let’s do an example for a compounding of each second (1 year has 31,536,000 seconds). The investment at the end of the year 1 (or at the beginning of year 2) will be:
Now we see that e^1 is the GROWTH FACTOR after 1 year if we do the compounding of the interests every moment!
We can generalize to any other annual interest rate R, so that e^R is the growth factor for an annual nominal rate R when the interests are compounded every moment.
When compounding every instant, we use small r instead of R for the interest rate. Then, the growth factor will be: e^r
Then we can do a relationship between this growth rate and an effective equivalent rate:
\left(1+EffectiveRate\right)=e^{r}
If we apply the natural logarithm to both sides of the equation:
ln\left(1+EffectiveRate\right)=ln\left(e^r\right)
Since the natural logarithm function is the inverse of the exponential function, then:
ln\left(1+EffectiveRate\right)=r
In the previous example with a nominal rate of 100%, when doing a continuously compounding, then the effective rate will be:
\left(1+EffectiveRate\right)=e^{r}=2.7182
EffectiveRate=e^{r}-1
Doing the calculation of the effective rate for this example:
Then, when compounding every moment, starting with a nominal rate of 100% annual interest rate, the actual effective annual rate would be 171.82%!
2.3 Return calculation
We have historical monthly adjusted prices for each stock. We can easily calculate simple returns for both stocks:
R = prices / prices.shift(1) -1# I delete the NA values located at the first period:R = R.dropna()
The shift(1) function gets the previous price value (1 period ago), and it does so for all periods.
We can also calculate continuously compounded returns (r) as follows:
r = np.log(prices) - np.log(prices.shift(1))# I delete the NA values located at the first period:r = r.dropna()
We plot the cc returns of the SP500:
plt.figure(figsize=(12, 6))plt.plot(r['sp500'])plt.title('log returns of the SP500')plt.show()
3 CHALLENGE 2
BRIEFLY RESPOND TO THE FOLLOWING AFTER LOOKING TO THE PREVIOUS PLOT:
(a) DOES THIS SERIES HAVE ABOUT THE SAME MEAN FOR ALL TIME PERIODS?
(b) DOES IT HAVE THE SAME STANDARD DEVIATION (VOLATILITY) FOR ALL TIME PERIODS?
4 Non stationary variables - The Random Walk model for stock prices
The random walk hypothesis in Finance (Fama, 1965) states that the natural logarithm of stock prices behaves like a random walk with a drift. A random walk is a series (or variable) that cannot be predicted. Imagine that Y_t is the log price of a stock for today (t). The value of Y for tomorrow (Y_{t+1}) will be equal to its today’s value (Y_t) plus a constant value (φ_0) plus a random shock. This shock is a pure random value that follows a normal distribution with mean=0 and a specific standard deviation σ_ε. The process is supposed to be the same for all future periods. In mathematical terms, the random walk model is the following:
Y_t = φ_0 + Y_{t−1} + ε_t
The ε_t is a random shock for each day, which is the result of the log price movement due to all news (external and internal to the stock) that influence the price. φ_0 refers as the drift of the series. If |φ_0| > 0 we say that the series is a random walk with a drift. If φ_0 is positive, then the variable will have a positive trend over time; if it is negative, the series will have a negative trend.
If we want to simulate a random walk, we need the values of the following parameters/variables:
Y_0, the first value of the series
φ_0, the drift of the series
σ_ε, the standard deviation (volatility) of the random shock
5 Monte Carlo simulationfor the random walk model
Let’s go and run a MonteCarlo simulation for a random walk of the S&P 500. We will use real values of the S&P500 to estimate the previous 3 parameters.
5.1 Downloading data for the S&P500
We download the S&P500 historical daily data from Yahoo Finance from 2009 to date (Sep 30, 2022).
I download the S&P500 index from 2009 from Yahoo Finance!
Now we generate the log of the S&P index using the closing price/quotation, and create a variable N for the number of days in the dataset:
lnsp500 = np.log(sp500)# I assign the column as logpricelnsp500.columns=['logprice']# N will be the # of days in the seriesN =len(lnsp500)N
4106
Now we will simulate 2 random walk series estimating the 3 parameters from this log series of the S&P500:
random walk with a drift (name it rw1), and
random walk with no drift (name it rw2).
5.2 Estimating the parameters of the random walk model
We have to consider the mathematical definition of a random walk and estimate its parameters (initial value, phi0, volatility of the random shock) from the real daily S&P500 data.
Now, we create a variable for a random walk with a drift trying to model the log of the S&P500.
Reviewing the random walk equation again:
Y_t = φ_0 + Y_{t−1} + ε_t Theε_t is the random shock of each day, which represents the overall average perception of all market participants after learning the news of the day (internal and external news announced to the market).
Remember that \varepsilon_{t} behaves like a random normal distributed variable with mean=0 and with a specific standard deviation \sigma_{\varepsilon}.
For the simulation of the random walk, you need to estimate the values of
y_{0}, the first value of the series, which is the log S&P500 index of the first day
\phi_{0}
\sigma_{\varepsilon}
You have to estimate \phi_{0} using the last and the first real values of the series following the equation of the random walk. Here you can see possible values of a random walk over time:
Y_{0} = Initial value
Y_{1} = \phi_{0} + Y_{0} + \varepsilon_{1}
Y_{2} = \phi_{0} + Y_{1} + \varepsilon_{2}
Substituting Y_{1} with its corresponding equation:
This mathematical result is kind of intuitive. The value of a random walk at time N will be equal to its initial value plus N times phi0 plus the sum of ALL random shocks from 1 to N.
Since the mean of the shocks is assumed to be zero, then the expected value of the sum of the shocks will also be zero. Then:
E[Y_{N}] = N*\phi_{0} + Y_{0}
From this equation we see that phi_{0} can be estimated as:
\phi_{0} = \frac{(Y_{N} - Y_{0})}{N}
Then, \phi_{0} = (last value - first value) / # of days.
I use scalars to calculate these coefficients for the simulation. A Stata scalar is a temporal variable to save a number.
I calculate \phi_{0} (the drift of the series) following this formula:
# The -1 location of the lnsp500 is the LAST value of the lnsp500 series, while the 0 location is the FIRST value phi0 = (lnsp500.iloc[-1]['logprice'] - lnsp500.iloc[0]['logprice']) / N# An alternative way to do the same: #phi0 = (lnsp500.iloc[-1,0] - lnsp500.iloc[0,0]) / Nphi0
np.float64(0.000435066849597146)
Remember that N is the total # of observations, so lnsp500[N-1] has last daily value of the log of the S&P500 since the first element is in the 0 location.
Now we need to estimate sigma, which is the standard deviation of the shocks. We can start estimating its variance first. It is known that the variance of a random walk cannot be determined unless we consider a specific number of periods.
Then, let’s consider the equation of the random walk series for the last value (Y_N), and then estimate its variance from there:
Then the sum of the variances of all shocks is actually the variance of the shock times N. Then the variance of all the shocks is actually the variance of Y_N.
Then we can write the variance of Y_N as:
Var(Y_{N}) = N * Var(\varepsilon)= N*\sigma_{\varepsilon}^2
To get the standard deviation of Y_N we take the square root of the variance of Y_N:
SD(Y_{N}) = \sqrt{N}*SD(\varepsilon)
We use sigma character for standard deviations:
\sigma_{Y} = \sqrt{N}*\sigma_{\varepsilon}
Finally we express the volatility of the shock (\sigma_{\varepsilon}) in terms of the volatility of Y_N (\sigma_{Y}):
For each day, we create a random shock using the random.normal function. We create this shock with standard deviation equal to the volatility of the shock we calculated above (the sigma). We indicate that the mean = 0:
We can also see whether the shock behaves like a normal distribution by doing its histogram:
plt.figure(figsize=(12, 6))plt.hist(shock, bins=30) # Adjust the number of bins as neededplt.xlabel("Histogram of random shocks")plt.ylabel("Frequency")plt.title("Histogram of Shocks")plt.show()
As expected, the shock variable behaves similar to a normal-distributed variable.
Now you are ready to start the simulation of the random walk.
Remember that we can express a random walk as its initial value plus N times the drift (ϕ0 ) plus the sum of all random shocks from the first day up a specific period N:
We can calculate the values for this random walk process as follows:
# Create day as a sequence from 0 to N - 1: day = np.arange(N)# Create rw1 as a random walk according to the YN equation:firstlogprice = lnsp500.iloc[0].logpricerw1 = firstlogprice + day * phi0 + np.cumsum(shock)#cumsum calculates to the cumulative addition of the shocks from 1 up to each dayrw1
plt.plot(lnsp500['logprice'], color='r')plt.plot(lnsp500['rw1'], color='b')plt.legend(['original s&p500 logprice', 'Random walk with a drift'], loc='upper left')plt.show()
5.2.2 Simulating a random walk with no drift
Now we can do a simulation but now without the drift. I this case, the \phi_{0} coefficient must be zero.
We will use rw2 for this series. You can follow the logic we did for rw1, but now _{0} will be equal to zero, so we do not include it into the equation:
# Create rw2 as a random walk with NO drift, so phi0 = 0 rw2 = firstlogprice + np.cumsum(shock)#cumsum calculates to the cumulative addition of the shocks from 1 up to each dayrw2
plt.plot(lnsp500['logprice'], color='r')plt.plot(lnsp500['rw2'], color='b')plt.legend(['original s&p500 logprice', 'Random walk with NO drift'], loc='upper left')plt.show()
6 CHALLENGE 3
RESPOND TO THE FOLLOWING QUESIONTS:
A) WHAT DO YOU OBSERVE with the previous plot? EXPLAIN WITH YOUR WORDS.
7 Reading
Read/skim the note: “Introduction to time series”. With your own words:
8 CHALLENGE 4
RESPOND TO THE FOLLOWING:
EXPLAIN WHAT IS A STATIONARY SERIES.
WHICH ARE THE CONDITIONS OF A SERIES TO BE CONSIDERED AS A STATIONARY SERIES?
9 W1 submission
Complete (100%): If you submit an ORIGINAL and COMPLETE work with all the code and challenges, with your notes, and with your OWN RESPONSES to questions
Incomplete (75%): If you submit an ORIGINAL work with ALL the activities but you did NOT RESPOND to the challenges/questions and/or you did not do all activities and respond to some of the questions.
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
Not submitted (0%)
Remember that you have to submit your Google Colab link through Canvas, and also grant EDIT priviledges to cdorante@tec.mx