Abstract
We introduce what is a time-series variable and we explain what is an xts-zoo objects and how calculate returns with xts-zoo objectsYou will work in RStudio. You can go to the RStudio web site and follow instructions to download the R software and the free version of RStudio. It is strongly recommended to have the latest version of R and RStudio. Once you are in RStudio, do the following.
Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop. More specifically, you have to:
Replicate all the R Code along with its output.
You have to respond to the challenges of the workshop and any other question asked in the workshop.
Any QUESTION or any INTERPRETATION you need to do will be written in CAPITAL LETTERS. For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.
You have to keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file. Pay attention in class to know how to generate an html file from your .Rmd.
Setup title and name of your Workshop
Once you have created a new R Notebook, you will see a sample R Notebook document. You must DELETE all the lines of this sample document except the first lines related to title and output. As title, write the workshop # and course, and add a new line with your name. You have to end up with something like:
title: “Workshop 1, Econometric models”
author: YourName
output: html_notebook
Now you are ready to continue writing your first R Notebook.
You can start writing your own notes/explanations we cover in this workshop. When you need to write lines of R Code, you need to click Insert at the top of the RStudio Window and select R. Immediately a chunk of R code will be set up to start writing your R code. You can execute this piece of code by clicking in the play button (green triangle).
Note that you can open and edit several R Notebooks, which will appear as tabs at the top of the window. You can visualize the output (results) of your code in the console, located at the bottom of the window. Also, the created variables are listed in the environment, located in the top-right pane. The bottom-right pane shows the files, plots, installed packages, help, and viewer tabs.
Save your R Notebook file as W1-YourName.Rmd. Go to the File menu and select Save As.
To generate the .html file, you have knit your R Notebook. Pay attention how to do this in class.
In this section we will download time-series variables from Yahoo Finance and explore the time-series datasets, more specifically, we explore the xts-zoo R objects. xts stands for “Extensible time series”, and zoo is an R class for general time-series datasets.
We start clearing our R environment:
rm(list=ls())
# To avoid scientific notation for numbers:
options(scipen=999)
In order to import and manage financial data in R, the quantmod package must be installed. This package contains the getSymbols() function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet. In order to install packages in R, go to the Package tab in the bottom-right section of RStudio, select Install and then type quantmod, and the botton Install.
Once you install a package, this package will be in your computer forever. You might re-install a package in case there is a new version of the package.
Now, you have installed a package and it is not necessary to install it again in further occasions. It will stay in your computer. However, next time you want to use it, you have to load it using the library() function
library(quantmod)
The getSymbols() function enables its user to download online and up-to-date financial data, such as stock prices, market indexes, ETF prices, interest rates, exchange rates, etc. getSymbols() allows to download this data from Yahoo Finance, Oanda, and FRED. These sources have thousands of finance and economic data series from many market exchanges and other macroeconomic variables around the world.
We will work with the Mexican market index, the IPyC, and the US market index, the S&P500.
What is a market index?
A market index in a financial market is a virtual portfolio composed by a group of public firms that issue shares in that market. There are different market indexes in each financial market. The most important market index in Mexico is the Índice de Precios y Cotizaciones, or IPyC. In the US financial market, one of the most important market indexes is the Standard & Poors 500 index, or S&P500.
The IPyC index tries to emulate a virtual portfolio with the biggest 35 public firms in the Mexican market. The weight (%) assign for each firm in the virtual portfolio is actually the market size of each firm. Then, the biggest the firm, the highest its weight in the IPyC portfolio.
The S&P500 index tries to emulate a virtual portfolio with the biggest 500 public firms in the US market. The weight (%) assign for each firm in the virtual portfolio is actually the market size of each firm. Then, the biggest the firm, the highest its weight in the S&P500 portfolio.
Both indexes are tracked every day in their respective financial markets, so we can get access to their current and historical daily values from different sources.
We download the monthly IPyC and S&P500 from Yahoo Finance from 2011 to date:
getSymbols(Symbols= c("^MXX", "^GSPC"), from="2011-01-01", to="2023-05-12",
periodicity = "monthly", src = "yahoo")
## [1] "^MXX" "^GSPC"
In Yahoo Finance the unique identifier (also called ticker) for the IPyC is ^MXX, and the ticker for the S&P500 is ^GSPC (General S&P Composite).
In this case the getSymbols function creates 2 datasets with the historical data of these 2 indexes. Each dataset is an xts-zoo R object with historical monthly quotations in a chronological order. These xts-zoo R objects are actually datasets with the values for the market indexes, and with a time index.
Each R object has at specific class. In this case, the class of these datasets is called xts-zoo. xts stands for extensible time-series. An xts-zoo object is designed to easily manipulate time series data.
For each period, Yahoo Finance keeps track of the open, high, low, close (OHLC) and adjusted prices. Also, it keeps track of volume that was traded (# of shares traded) in every specific period. The adjusted prices are used for stocks, not for market indexes. Adjusted prices consider dividend payments and stock splits. For the case of market indexes, the adjusted prices are always equal to the close prices.
We can integrate xts-zoo R objects into one xts-zoo dataset using the merge function. In this case, we only use the adjusted price, so we can also use the Ad function:
# We merge the datasets into a new R object called prices:
= merge(MXX,GSPC)
prices # We only keep the adjusted price columns:
= Ad(prices)
prices # We rename the columns with simpler names:
names(prices) = c("MXX","GSPC")
For each index we do a graph to visualize how the index moves over time. We can use the chartSeries function from the quantmod package:
chartSeries(MXX, theme=("white"))
chartSeries(GSPC, theme=("white"))
Respond to the following QUESTION ABOUT THE PREVIOUS CHARTS:
WHAT YOU CAN SAY ABOUT THE TREND OF BOTH MARKET INDEXES? IS IT CONSTANTLY GROWING, OR DECLINING, OR THERE IS NO CLEAR TREND? BRIEFLY EXPLAIN
Generate a new dataset with the natural logarithm (log) of the indexes:
= log(prices) lnprices
Now do a time plot for the natural log price of the MXX:
plot(lnprices$MXX, main = "Log of the Mexican Index over time")
What is a natural logarithm?
The natural logarithm of a number is the exponent that the number e (=2.71…) needs to be raised to get another number. For example, let’s name x=natural logarithm of a stock price p. Then:
\[ e^x = p \] The way to get the value of x that satisfies this equality is actually getting the natural log of p:
\[ x = log_e(p) \] Then, we have to remember that the natural logarithm is actually an exponent that you need to raise the number e to get a specific number.
The natural log is the logarithm of base \(e\) (=2.71…). The number \(e\) is an irrational number (it cannot be expressed as a division of 2 natural numbers), and it is also called the Euler constant. Leonard Euler (1707-1783) took the idea of the logarithm from the great mathematician Jacob Bernoulli, and discovered very astonishing features of the \(e\) number. Euler is considered the most productive mathematician of all times. Some historians believe that Jacob Bernoulli discovered the number \(e\) around 1690 when he was playing with calculations to know how an amount of money grows over time with an interest rate.
How \(e\) is related to the grow of financial amounts over time?
Here is a simple example:
If I invest $100.00 with an annual interest rate of 50%, then the end balance of my investment at the end of the first year (at the beginning of year 2) will be:
\[ I_2=100*(1+0.50)^1 \]
If the interest rate is 100%, then I would get:
\[ I_2=100*(1+1)^1=200 \] Then, the general formula to get the final amount of my investment at the beginning of year 2, for any interest rate R can be:
\[ I_2=I_1*(1+R)^1 \] The (1+R) is the growth factor of my investment.
In Finance, the investment amount is called principal. If the interests are calculated (compounded) each month instead of each year, then I would end up with a higher amount at the end of the year.
Monthly compounding means that a monthly interest rate is applied to the amount to get the interest of the month, and then the interest of the month is added to the investment (principal). Then, for month 2 the principal will be higher than the initial investment. At the end of month 2 the interest will be calculated using the updated principal amount. Putting in simple math terms, the final balance of an investment at the beginning of year 2 when doing monthly compounding will be:
\[ I_2=I_1*\left(1+\frac{R}{N}\right)^{1*N} \]
For monthly compounding, N=12, so the monthly interest rate is equal to the annual interest rate R divided by N (R/N). Then, with an annual rate of 100% and monthly compounding (N=12):
\[ I_2=100*\left(1+\frac{1}{12}\right)^{1*12}=100*(2.613..) \]
In this case, the growth factor is \((1+1/12)^{12}\), which is equal to 2.613.
Instead of compounding each month, if the compounding is every moment, then we are doing a continuously compounded rate.
If we do a continuously compounding for the previous example, then the growth factor for one year becomes the astonishing Euler constant e:
Let’s do an example for a compounding of each second (1 year has 31,536,000 seconds). The investment at the end of the year 1 (or at the beginning of year 2) will be:
\[ I_2=100*\left(1+\frac{1}{31536000}\right)^{1*31536000}=100*(2.718282..)\cong100*e^1 \]
Now we see that \(e^1\) is the GROWTH FACTOR after 1 year if we do the compounding of the interests every moment!
We can generalize to any other annual interest rate R, so that \(e^R\) is the growth factor for an annual nominal rate R when the interests are compounded every moment.
When compounding every instant, we use small r instead of R for the interest rate. Then, the growth factor will be: \(e^r\)
Then we can do a relationship between this growth rate and an effective equivalent rate:
\[ \left(1+EffectiveRate\right)=e^{r} \]
If we apply the natural logarithm to both sides of the equation:
\[ ln\left(1+EffectiveRate\right)=ln\left(e^r\right) \]
Since the natural logarithm function is the inverse of the exponential function, then:
\[ ln\left(1+EffectiveRate\right)=r \] In the previous example with a nominal rate of 100%, when doing a continuously compounding, then the effective rate will be:
\[ \left(1+EffectiveRate\right)=e^{r}=2.7182 \]
\[ EffectiveRate=e^{r}-1 \] Doing the calculation of the effective rate for this example:
\[ EffectiveRate=e^{1}-1 = 2.7182.. - 1 = 1.7182 = 171.82\% \]
Then, when compounding every moment, starting with a nominal rate of 100% annual interest rate, the actual effective annual rate would be 171.82%!
A financial simple return for a stock (\(R_{t}\)) is calculated as a percentage change of price from the previous period (t-1) to the present period (t):
\[ R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1}}=\frac{Adjprice_{t}}{Adjprice_{t-1}}-1 \]
For example, if the adjusted price of a stock at the end of January 2021 was $100.00, and its previous (December 2020) adjusted price was $80.00, then the monthly simple return of the stock in January 2021 will be:
\[ R_{Jan2021}=\frac{Adprice_{Jan2021}}{Adprice_{Dec2020}}-1=\frac{100}{80}-1=0.25 \]
We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.
In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models. cc returns are also called log returns.
One way to calculate cc returns is by subtracting the log of the current adjusted price (at t) minus the log of the previous adjusted price (at t-1):
\[ r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1}) \]
This is also called as the difference of the log of the price.
We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):
\[ r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right) \]
cc returns are usually represented by small r, while simple returns are represented by capital R.
It is recommended to always use adjusted prices to calculate financial returns. In this example that we have market indexes, the adjusted price is exactly the same as the closing price since market indexes do not have stock splits nor dividend payments.
We can use the lag function to get past (lagged) values of a time-series dataset (or column). With this function we can get the price of the previous period to calculate the simple return. Let’s create a new dataset for the simple monthly returns of both indexes:
= prices / lag(prices,n=1) - 1 R
Let’s create a new variable for the cc return for both indexes. Remember that the continuously compounded returns can be calculated as the difference between the log of the price of today minus the log of the price of the previous period:
= log(prices) - lag(log(prices),n=1) r
We can also use the function diff, which calculates the first difference of any time-series variable:
= diff(log(prices)) r
We get the same result using any of these 2 calculations, but it seems easier to use the diff function.
Now do a time plot for the cc returns of the Mexican index:
plot(r$MXX, col = "darkblue",
main = "cc return for the MXX index")
BRIEFLY RESPOND TO THE FOLLOWING AFTER LOOKING TO THE PREVIOUS PLOT:
(a) DOES THIS SERIES HAVE ABOUT THE SAME MEAN FOR ALL TIME PERIODS?
(b) DOES IT HAVE THE SAME STANDARD DEVIATION (VOLATILITY) FOR ALL TIME PERIODS?
The random walk hypothesis in Finance (Fama, 1965) states that the natural logarithm of daily stock prices behaves like a random walk with a drift. A random walk is a series (or variable) that cannot be predicted. Imagine that \(Y_t\) is the log price of a stock for today (t). The value of Y for tomorrow (\(Y_{t+1}\)) will be equal to its today’s value (\(Y_t\)) plus a constant value (\(φ_0\)) plus a random shock. This shock is a pure random value that follows a normal distribution with mean=0 and a specific standard deviation \(σ_ε\). The process is supposed to be the same for all future periods. In mathematical terms, the random walk model is the following:
\[ Y_t = φ_0 + Y_{t−1} + ε_t \]
The \(ε_t\) is a random shock for each day, which follows a normal probability distribution with mean=0 and with a specific standard deviation \(\sigma_{\epsilon}\):
\(\varepsilon_{t}\sim N\left(\mu=0,\sigma=\sigma_{\varepsilon}\right)\)
The \(ε_t\) or random shock for day t is the result of the all news (external and internal to the stock) of that day t. This shock directly influence the %change in the stock price in day t.
\(φ_0\) refers as the drift of the series. If \(|φ_0|\) > 0 we say that the series is a random walk with a drift. If \(φ_0\) is positive, then the variable will have a positive trend over time; if it is negative, the series will have a negative trend.
If we want to simulate a random walk, we need the values of the following parameters/variables:
Let’s go and do a Monte Carlo simulation for a random walk with drift that tries to behave like the S&P 500 index. We will use real historical daily values of the S&P500 to estimate the previous 3 parameters.
We need to download daily data for the S&P500 index (instead of monthly). We will download data since 2011:
getSymbols("^GSPC", from="2011-01-01")
## [1] "^GSPC"
Since we did not specify the frequency, we downloaded daily data.
Now we generate the log of the S&P index using the closing price/quotation, and create a variable N for the number of days in the dataset:
<-log(Ad(GSPC))
lnsp# I assign a name for the column:
names(lnsp)<-c("lnsp")
# We get the # of rows of the lnsp series:
<-nrow(lnsp) N
Now we will simulate 2 random walk series:
a random walk with a drift (name it rw1), and
a random walk with no drift (name it rw2).
We have to consider the mathematical definition of a random walk and estimate its parameters (initial value, phi0, volatility of the random shock) from the real daily S&P500 data.
Reviewing the random walk equation again:
\[ Y_t = φ_0 + Y_{t−1} + ε_t \]
The \(\varepsilon_t\) is the random shock of each day, which represents the overall average perception of all market participants after learning the news of the day (internal and external news announced to the market).
Remember that \(\varepsilon_{t}\) behaves like a random variable with normal probability distribution with mean=0 and with a specific standard deviation \(\sigma_{\varepsilon}\).
For the simulation of the random walk with a drift (rw1), we need to estimate the values of
\(y_{0}\), the first value of the series, which is the log S&P500 index of the first day
\(\phi_{0}\)
\(\sigma_{\varepsilon}\)
We have to estimate \(\phi_{0}\) using the last and the first real values of the series following the equation of the random walk.
Let’s calculate the first values of Y according to the random walk equation. We have to start with an initial value today (period t=0):
\[ Y_{0} = Initial value \]
Now, for tomorrow (t=1), and the next day (t=2), we apply the random walk equation. Then for t=1:
\[ Y_{1} = \phi_{0} + Y_{0} + \varepsilon_{1} \]
For t=2:
\[ Y_{2} = \phi_{0} + Y_{1} + \varepsilon_{2} \]
Substituting \(Y_{1}\) in the \(Y_2\) equation:
\[ Y_{2} = \phi_{0} + \phi_{0} + Y_{0} + \varepsilon_{1} + \varepsilon_{2} \]
Re-arranging the terms:
\[ Y_{2} = 2*\phi_{0} + Y_{0} + \varepsilon_{1} + \varepsilon_{2} \]
If we continue doing the same until the last day N, we get:
\[ Y_{N} = N*\phi_{0} + Y_{0} + \sum_{t=1}^{N}\varepsilon_{t} \]
This mathematical result is kind of intuitive. The value of a random walk at any day N will be equal to its initial value plus N times phi0 plus the sum of ALL random shocks from day 1 to day N.
The mean of all daily shocks is expected to be zero, so some days we will have negative shocks due to bad news, and some days we will have positive shocks due to good news.
Since the mean of the shocks is assumed to be zero, then also the expected value of the sum of the all shocks must be zero. Then:
\[E\left[\sum_{t=1}^{N}\varepsilon_{t}\right]=0\] We simplify the expected value of the last day \(Y_N\) as:
\[E[Y_{N}] = N*\phi_{0} + Y_{0}\]
Doing simple algebra, we see that \(phi_{0}\) can be estimated as:
\[\phi_{0} = \frac{(E[Y_{N}] - Y_{0})}{N}\]
To calculate \(\phi_0\) for a random walk that behaves like the log of the S&P, then we replace \(E[Y_N]\) by the actual last value \(Y_N\) of the data:
\[\phi_{0} = \frac{(Y_{N} - Y_{0})}{N}\]
Then, \(\phi_{0}\) = (last value - first value) / # of days.
In R we can calculate \(\phi_{0}\) as follows:
= as.numeric(lnsp$lnsp[N])
lastprice # lnsp$lnsp[N] refers to the last row N of the vector lnsp$lnsp
= as.numeric(lnsp$lnsp[1])
firstprice # lnsp$lnsp[1] refers to the row 1 of the vector lnsp$lnsp
<- (lastprice - firstprice) / N
phi0cat("The value for phi0 is ",phi0)
## The value for phi0 is 0.0003781275
#cat instruction will print the text (in "") and the value of the variable in these case phi0
Remember that N is the total # of days in the dataset, so lnsp[N] has the last daily value of the log of the S&P500.
To create the simulated random shocks for all days, we need to estimate sigma, which is the standard deviation of the shocks.
We can start estimating its variance first. It is known that the variance of a random walk cannot be determined unless we consider a specific number of periods.
Then, let’s consider the equation of the random walk series for the last value (\(Y_N\)), and then estimate its variance from there:
\[ Y_{N} = N*\phi_{0} + Y_{0} + \sum_{t=1}^{N}\varepsilon_{t} \]
Using this equation, we can calculate the variance of \(Y_N\) :
\[ Var(Y_{N}) = Var(N*\phi_{0}) + Var(Y_{0}) + \sum_{t=1}^{N}Var(\varepsilon_{t}) \]
The variance of a constant is zero. \(N*\phi_0\) and \(Y_0\) are constants, so the first two terms are equal to zero.
Now we analyze the variance of the shock:
It is supposed that the variance is homogeneous over time. In other words, the standard deviation of the shocks is about the same over time, then:
\[ Var(\varepsilon_{1}) = Var(\varepsilon_{2}) = Var(\varepsilon_{N}) = \sigma_{\varepsilon}^2 \]
Then the sum of the variances of all shocks is actually the variance of the shock times N. Then the variance of all the shocks times N is actually the variance of \(Y_N\).
Then we can write the variance of \(Y_N\) as:
\[ Var(Y_{N}) = 0 + 0 + \sum_{t=1}^{N}Var(\varepsilon_{t}) \]
\[ Var(Y_{N}) = N * Var(\varepsilon) \] \[ \sigma_{Y}^{2} = N*\sigma_{\varepsilon}^2 \]
To get the standard deviation of \(Y_N\) we take the square root of the variance of \(Y_N\):
\[ \sqrt{Var(Y_{N})}=\sqrt{N*\sigma_{\varepsilon}^{2}} \] Then:
\[ SD(Y_{N}) = \sqrt{N}*SD(\varepsilon) \]
We use sigma character for standard deviations:
\[
\sigma_{Y} = \sqrt{N}*\sigma_{\varepsilon}
\]
We call volatility to the standard deviation of the shocks.
Finally we express the volatility of the shock (\(\sigma_{\varepsilon}\)) in terms of the volatility of \(Y_N\) (\(\sigma_{Y}\)):
\[ \sigma_{\varepsilon} = \frac{\sigma_{Y}}{\sqrt{N}} \]
Then we can estimate sigma as: sigma = StDev(lnsp) / sqrt(N). Let’s do it in R:
<-sd(lnsp$lnsp) / sqrt(N)
sigmacat("The standard deviation of the log is = ",sd(lnsp$lnsp),"\n")
## The standard deviation of the log is = 0.3855358
cat("The standard deviation (volatility) for the shock is = ",sigma)
## The standard deviation (volatility) for the shock is = 0.006912174
For each day, we create a random shock using the function rnorm. We create this shock with standard deviation equal to the volatility of the shock we calculated above (the sigma). We indicate that the mean =0:
<- rnorm(n=N,mean=0,sd=sigma)
shock $shock<-shock lnsp
We can see the shock over time:
plot(shock, type="l", col="blue")
We can also see whether the shock behaves like a normal distribution by doing its histogram:
hist(lnsp$shock)
As expected, the shock behaves similar to a normal-distributed variable.
Now we are ready to start the simulation of random walk.
Remember that we can express a random walk as its initial value plus N times the drift (\(\phi_0\)) plus the sum of all random shocks from the first day up a specific period N:
\[ Y_{N} = N*\phi_{0} + Y_{0} + \sum_{t=1}^{N}\varepsilon_{t} \]
Now we rename \(Y_N\) as rw1.
We can calculate the values for this random walk process in R as follows:
# I create a column for the day #:
$day = seq(1,N)
lnsp# The seq function returns consequtive numbers from 1 to N
$rw1 = firstprice + lnsp$day*phi0 + cumsum(lnsp$shock)
lnsp#cumsum refers to the cumulative addition of the perturbances 1 to N
The cumsum function gets the cumulative sum of a variable from the first value up the each of the values of the variable.
I plot the simulated random walk and the real log of the S&P500:
ts.plot(lnsp$lnsp)
lines(seq(1,N),lnsp$rw1, col="blue")
Now we can do a simulation but now without the drift. I this case, the \(\phi_{0}\) coefficient must be zero.
We will use rw2 for this series. You can follow the logic we did for rw1, but now \(\phi_{0}\) will be equal to zero, so we do not include it into the equation:
$rw2 = firstprice + cumsum(lnsp$shock) lnsp
I plot this random walk and the log of the S&P in one plot:
ts.plot(lnsp$lnsp)
# I plot both lines to compare
lines(x=seq(1,N),y=lnsp$rw2, col="green")
RESPOND TO THE FOLLOWING QUESIONTS:
A) WHAT DO YOU OBSERVE with the previous plot? EXPLAIN WITH YOUR WORDS.
Visualizing the log of the S&P and the rw1 again:
# I plot the natural log pf S&P500
ts.plot(lnsp$lnsp)
lines(seq(1,N),lnsp$rw1, col="blue")
B) AFTER COMPARING THE SIMULATED RANDOM WALK WITH THE S&P500, DOES THE LOG OF THE S&P500 LOOKS LIKE A RANDOM WALK? WHY YES OR WHY NOT?
C) DO YOU THINK THAT WE CAN USE THIS TYPE OF SIMULATION TO PREDICT STOCK PRICES OR INDEXES? WHY YES OR WHY NOT?
Read/skim the note: “Introduction to time series”. With your own words:
RESPOND TO THE FOLLOWING:
EXPLAIN WHAT IS A STATIONARY SERIES.
WHICH ARE THE CONDITIONS OF A SERIES TO BE CONSIDERED AS A STATIONARY SERIES?
Remember that you have to submit your .html file through Canvas BEFORE NEXT CLASS.