Abstract
This is a solution for Workshop 2. Not all the workshop will be displayed; only the sections were students needed to work on an exercise or respond questions.We start clearing our R environment:
rm(list=ls())
# To avoid scientific notation for numbers:
options(scipen=999)
Using the getsymbols function, download MONTHLY prices of the Mexican market index (^MXX), the IPyC (Índice de Precios y Cotizaciones), and also download the S&P500 index from the US market (^GSPC). For both indexes download data from January 2000 to date. Do the following:
The getsymbols function brings price (quotations) data of the financial instruments.
# Load the packages I need for this workshop:
library(quantmod)
#library(dplyr)
# Downloading the historical quotation data for both indexes:
getSymbols(c("^MXX", "^GSPC"), from="2000-01-01", to="2021-08-24",
src="yahoo", periodicity="monthly")
## [1] "^MXX" "^GSPC"
In your environment now you have the MXX and the GSPC xts datasets with historical data for the Mexican and the US fiancial market indexes.
Remember that xts datasets are time-series datasets with a time index.
Both datasets have columns for open, low, high, close, and adjusted prices. As we have mentioned, it is recommended to use adjusted prices to calculate returns. In the case of indexes, close prices will be always equal to adjusted prices.
Before calculating returns, we can create an integrated dataset with both indexes:
= merge(MXX,GSPC) prices
By doing this, we can easily calculate returns for both indexes using the integrated dataset.
We select only adjusted prices and rename the columns with meaningful names for the indexes:
= Ad(prices)
prices names(prices) = c("MXX","GSPC")
With adjusted prices we calculate continuously compounded returns:
= diff(log(prices)) r
Remember that the continuously compounded returns (r) can be calculated as the first difference of the natural log prices (or index). The first difference is equal to the log price of the price at period t minus the log of the price at period (t-1)
We visualize the first historical returns:
head(r)
## MXX GSPC
## 2000-01-01 NA NA
## 2000-02-01 0.11232485 -0.02031300
## 2000-03-01 0.01410906 0.09232375
## 2000-04-01 -0.11811558 -0.03127991
## 2000-05-01 -0.10795263 -0.02215875
## 2000-06-01 0.15323959 0.02365163
We see that the first value is NA since it is not possible to calculate returns for the first month. We can delete any NA values with the na.omit function:
= na.omit(r) r
Now we calculate simple returns:
= na.omit(prices / lag(prices,n=1) - 1) R
In this case I used to na.omit to create the simple returns.
Remember that simple returns are calculated as a percentage change between the current price and the previous price.
The lag is a function that can be used with xts datasets to get previous values of the variable. In this case, we indicate n=1 meaning to get the past value for the price.
Do a histogram of the simple return of the Mexican index:
hist(R$MXX, main="Histogram of IPC monthly returns",
xlab="Simple returns", col="dark blue")
INTERPRET this histogram with your words.
THE Y AXIS OF THIS HISTOGRAM REPRESENTS THE FREQUENCY, AND THE X AXIS IS THE RANGE OF THE IPC MONTHLY RETURNS. WE CAN SEE HOW MANY MONTHS IN THE HISTORY THE IPC HAS OFFERED RETURNS IN SPECIFIC RANGES
WE CAN SEE THAT MOST OF THE MONTHLY RETURNS OF THE MEXICAN IPC HAVE BEEN BETWEEN -10% TO 10% (ABOUT 95% OF THE MONTHS). THE MOST FREQUENT RANGE OF RETURNS IS BETWEEN 0 AND 5% WITH ABOUT 100 MONTHS. WE CAN SEE THAT THE IPC HAS OFFERED FEW EXTREME NEGATIVE RETURNS BETWEEN -20% TO -10% (ABOUT 2 OR 3 MONTHS). HOWEVER, IT ALSO HAS OFFERED VERY GOOD RETURNS BETWEEN +10% AND +20% IN ABOUT 2 MONTHS IN THE HISTORY.
WE CAN ALSO SEE THAT THE MONTHLY IPC RETURNS LOOKS LIKE A NORMAL DISTRIBUTED VARIABLE. IF THIS IS THE CASE, THEN THE STANDARD DEVIATION OF THE IPC RETURNS SHOULD BE AROUND 5% SINCE ABOUT 95% OF THE TIMES THE IPC HAS OFFERED RETURNS BETWEEN -10% TO +10%. REMEMBER THAT IN A NORMAL-DISTRIBUTED VARIABLE, IF YOU TAKE THE RANGE OF ABOUT 2 STANDARD DEVIATIONS TO THE LEFT AND TO STANDARD DEVIATION TO THE RIGHT FROM THE MEAN, WE COVER ABOUT 95% OF THE CASES.
Do a histogram of the simple return of the S&P500 index.
hist(R$GSPC, main="Histogram of S&P500 monthly returns",
xlab="Simple returns", col="blue")
INTERPRET this histogram with your words.
THE S&P MONTHLY RETURNS LOOK LIKE A NORMAL-DISTRIBUTED VARIABLE, BUT IT IS A LITTLE BIT SKEWED TO THE LEFT. IN OTHER WORDS, THERE ARE MORE EXTREME VALUES IN THE NEGATIVE SIDE COMPARED TO THE POSITIVE SIDE.
WE CAN SEE THAT THE MOST FREQUENT RANGE OF RETURNS IS ALSO BETWEEN 0 AND 5% WITH MORE THAN 100 MONTHS
IT SEEMS THAT ABOUT 97% OF THE MONTHS THE RETURNS HAS BEEN BETWEEN -10% TO +10%. HOWEVER, WE CAN SEE THAT THE MEXICAN IPC RETURNS HAS HAD MORE EXTREME NEGATIVE RETURNS THAN THE S&P500. IN THIS CASE, IT SEEMS THAT ONLY 1 OR 2 MONTHS THE S&P500 HAS OFFERED BETWEEN -20% TO -15% RETURNS.
WE CAN ROUGHLY ESTIMATE A STANDARD DEVIATION OF A LITTLE BIT LESS THAN 5% SINCE A LITTLE BIT MORE THAN 95% COVER A RANGE BETWEEN -10% TO +10%.
COMPARED TO THE MEXICAN IPC RETURNS, THE S&P HAS NOT OFFERED MONTHLY RETURNS BETWEEN 15% TO 20%.
Here is a graph that shows both histograms together. To better appreciate the histograms, I reduced the length of each bar to be equal to 2 percentual points.
The S&P returns are represented in blue; the IPyC returns are represented in yellow; the gray-yellow is shared area of both histograms.
LOOK CAREFULLY AT THIS PLOT WITH BOTH HISTOGRAMS. WHICH INSTRUMENT IS RISKIER? EXPLAIN
R:THE IPC LOOKS RISKIER SINCE WE CAN SEE MORE EXTREME VALUES TO THE LEFT COMPARED TO THE S&P500 RETURNS. IN ADDITION, IT SEEMS THAT THE IPC HAS HAD MORE NEGATIVE RETURNS IF WE LOOK AT THE AREAS COVERED FOR EACH INSTRUMENT BELOW ZERO.
CALCULATE the mean and standard deviation of monthly returns for both market indexes index. Hint: you can check how we did this in Workshop 1
I LOAD THE PerformanceAnalytics LIBRARY:
library(PerformanceAnalytics)
NOW I GET DESCRIPTIVE STATISTICS OF RETURNS USING THE table.Stats FUNCTION FROM THIS LIBRARY:
table.Stats(R)
## MXX GSPC
## Observations 259.0000 259.0000
## NAs 0.0000 0.0000
## Minimum -0.1785 -0.1694
## Quartile 1 -0.0200 -0.0175
## Median 0.0107 0.0107
## Arithmetic Mean 0.0094 0.0055
## Geometric Mean 0.0080 0.0045
## Quartile 3 0.0429 0.0308
## Maximum 0.1656 0.1268
## SE Mean 0.0032 0.0027
## LCL Mean (0.95) 0.0030 0.0002
## UCL Mean (0.95) 0.0157 0.0108
## Variance 0.0027 0.0019
## Stdev 0.0520 0.0433
## Skewness -0.3260 -0.5488
## Kurtosis 1.0475 1.1711
SINCE R HAS 2 COLUMNS, THEN WE GET THE DESCRIPTIVE STATISTICS FOR BOTH RETURNS
WE CAN ALSO USE THE mean AND sd FUNCTIONS TO GET WHAT WE WANT:
= mean(R$MXX)
mean_R_MXX cat("Mean of monthly return of the IPC =", mean_R_MXX,"\n")
## Mean of monthly return of the IPC = 0.009370433
=mean(R$GSPC)
mean_R_GSPC cat("Mean of monthly return of the S&P500 =", mean_R_GSPC,"\n")
## Mean of monthly return of the S&P500 = 0.005462128
= sd(R$MX)
sd_R_MXX cat("Standard deviation of monthly return of the IPC =", sd_R_MXX,"\n")
## Standard deviation of monthly return of the IPC = 0.05200855
=sd(R$GSPC)
sd_R_GSPC cat("Standard deviation of monthly return of the S&P500 =", sd_R_GSPC,"\n")
## Standard deviation of monthly return of the S&P500 = 0.04328685
A COMMON WAY TO REFER TO STANDARD DEVIATION OF RETURNS IS VOLATILITY
Just by looking at the mean and standard deviation of each instrument returns, WHICH INSTRUMENT LOOKS MORE ATTRACTIVE TO INVEST? EXPLAIN
R: THE VOLATILITY OF THE IPC IS HIGHER THAN THAT OF THE S&P500 (0.9370433% vs 0.5462128%).
THE MEAN RETURN FOR THE IPC IS HIGHER THAN THAT OF THE S&P500 (5.2008555% vs 4.3286846%)
LOOKING ONLY TO THE MEAN AND VOLATILITY IS HARD TO DECIDE WHICH INSTRUMENT I WOULD SELECT TO INVEST. IF MY PURPOSE IS TO HAVE A LOW-RISKY INVESTMENT, FOR SURE I WOULD SELECT THE S&P500 SINCE ITS VOLATILITY IS MUCH LESS. HOWEVER IF I AM TOLERATE HIGH RISK AND LOOK FOR HIGH RETURNS, MAYBE I WOULD SELECT THE IPC. HOWEVER, I WOULD NEED MORE ANALYSIS TO MAKE A DECISION. AT LEAST, I WOULD LIKE TO CALCULATE THE HOLDING-PERIOD RETURN FOR BOTH INSTRUMENTS IN THE WHOLE HISTORY
If you had invested in the Mexican Index $1 peso in Jan 2000, WHICH WOULD BE THE VALUE OF YOUR INVESTMENT TODAY?
If you had invested in the S&P500 index $1 USD in Jan 2000, WHICH WOULD BE THE VALUE OF YOUR INVESTMENT TODAY?
R: WE CAN CALCULATE THE HOLDING RETURN FOR THE WHOLE PERIOD (THE HPR), AND THEN MULTIPLY THE $1 TIMES (1+HPR) TO GET THE ANSWER.
TO CALCULATE THE HOLDING-PERIOD RETURN (HPR) WE CAN GET DIVIDE THE LAST VALUE BY THE FIRST VALUE OF THE PRICES AND SUBTRACT 1:
<- as.numeric(tail(prices,1))
last_prices
# You can also do this by using using brackets to get the last row:
<- as.numeric(prices[nrow(prices),])
last_prices2 # as.numeric(x) transforms x into a number (numeric format).
# Recall that prices is an xts object.
# prices_MXX[x] subsets prices_MXX according to the criteria x.
# nrow(x) obtains the number of rows of an xts object x.
# Now, get the first value of the prices
<- as.numeric(head(prices,1))
first_prices
# Calculate HPR for both instruments:
<- (last_prices / first_prices) - 1
HPR
# Calculate the value of $1 invested in both instruments
<- 1 * (1 + HPR)
my_inv # The final investment of the IPC is located in position 1 of the
# vector my_inv
cat("The value of $1 invested in the IPC is $", my_inv[1], " \n")
## The value of $1 invested in the IPC is $ 7.904229
# The final investment of the S&P500 is located in position 2 of the
# vector my_inv
cat("The value of $1 invested in the S&P500 is $", my_inv[2])
## The value of $1 invested in the S&P500 is $ 3.212376
I CAN ALSO CALCULATE THE VALUE OF THE INVESTMENT FOR EVERY MONTH OF THE HISTORY, AND DO A PLOT TO SEE HOW THE INVESTMENT WOULD HAVE MOVED OVER TIME:
= prices$MXX / first_prices[1]
HPR_MXX_MONTHS = prices$GSPC / first_prices[2]
HPR_GSPC_MONTHS
plot.xts(HPR_MXX_MONTHS,col="green",ylim=c(0,8))
plot.xts(HPR_GSPC_MONTHS,col="blue")
# Putting both plots into one plot:
plot.xts(HPR_MXX_MONTHS,col="green",ylim=c(0,8))
lines(HPR_GSPC_MONTHS,col="blue")
The Central Limit Theorem is one of the most important discoveries in mathematics and statistics. Actually, thanks to this discovery, the field of Statistics was developed at the beginning of the 20th century.
We will do an exercise using simulated numbers. I hope that you understand what the Central Limit Theorem is about.
Let’s do the following.
Create the x variable as a random variable with normal distribution, with mean=20 and standard deviation=40. Create 100,000 observations:
<- rnorm(n=100000, mean = 20, 40) x
Create the variable y as a random variable with uniform distribution in the range [0,60]:
<- runif(n=100000, min = 0, max = 60) y
Learn about the variance of the uniform distribution. HOW CAN YOU ESTIMATE THE VARIANCE OF A UNIFORM DISTRIBUTED VARIABLE?
The probability density function for a uniform variable is the following:
\(f(x)=\frac{1}{(b-a)}\) para a<=x<b,
= 0 for any other value of x outside the range between a and b
In this case, a=0 and b=20
The mean or expected value of thhis function is the following:
\(E(x)=\frac{(a+b)}{2}\)
Then, the theoretical mean in this case is:
\(E(x)= \frac{(20+0)}{2} = 10\)
The variance of the uniform distribution variable is :
\(Var(x) = \frac{(b-a)^2}{12}\)
In our case the theoretical variance should be:
\(Var(x) = \frac{400}{12} = 33.333\)
Run a histogram for y and another histogram for x.
hist(x, main="Histogram of x",
xlab="x values", col="dark blue")
hist(y, main="Histogram of y",
xlab="y values", col="green")
WHAT DO YOU SEE? BRIEFLY EXPLAIN
IN THE HISTOGRAM OF X I CAN SEE THAT THE DISTRIBUTION LOOKS LIKE A NORMAL DISTRIBUTION WITH MEAN=10 AND STANDARD DEVIATION=20. ABOUT 95% OF THE DATA IS BETWEEN 2 STANDARD DEVIATIONS BEFORE THE MEAN AND 2 STANDARD DEVIATIONS ABOVE THE MEAN ( A RANGE BETWEEN -30 AND 50). IN THE HISTOGRAM OF Y I SEE THAT THE DISTRIBUTION IS NOT EXACTLY LIKE THE PERFECT UNIFORM DISTRIBUTION, BUT IT IS VERY SIMILAR. THIS HAPPENS SINCE THE DATA WAS CREATED RANDOMLY. BUT MOST OF THE X VALUES HAVE THE SAME FREQUENCY OF APPEREANCE, AS IT IS EXPECTED FOR A UNIFORM DISTRIBUTION.
Calculate the mean of x and y and save them as xbar and ybar:
= mean(x)
xbar= mean(y) ybar
Now we will manually calculate the variance of x and y. Remember that the VARIANCE of a variable is the AVERAGE OF ITS SQUARED DEVIATIONS.
Calculate first the squared deviations of x and y:
=(x-xbar)^2
xdesv2=(y-ybar)^2 ydesv2
Now we just calculate the mean of the squared deviations to get the variance:
=mean(xdesv2)
varx varx
## [1] 1603.285
=mean(ydesv2)
vary vary
## [1] 299.4701
Compare the variance you computed with the theoretical variance of each variable (the variances we used to generate the simulated values for x and y). You will see that the computed variances varx and vary will be very similar to the theoretical values.
IN MY RANDOM DATA, THE VARIANCE OF X WAS 1603.2852339. THE THEORETICAL VARIANCE SHOULD BE \(40^{2}\), WHICH IS 1,600. MY EMPIRICAL VARIANCE CALCULATED USING THE RANDOM GENERATED NUMBERS IS REALLY CLOSE TO THE THEORETICAL VARIANCE.
FOR THE Y RANDOM VARIABLE I GOT A VARIANCE OF 299.470106. THE THEORETICAL VARIANCE SHOULD BE:
\(VAR(Y) = \frac{1}{12}*(60-0)^2 = 300\)
I SEE THAT MY EMPIRICAL VARIANCE WAS REALLY CLOSE TO THE THEORETICAL VARIANCE.
Create a data frame with x and y as columns
<- cbind(x,y)
dataset <- as.data.frame(dataset) dataset
Now assign a group number to each observation. We will create 4,000 groups of 25 observations each. Each group will be labeled from 1 to 4,000. You can use the function rep and seq:
# I create a column called group where each group will have 25
# observations:
$group <- rep(seq(1:4000),each=25) dataset
With the group_by() and summarize() functions, for each group and for each variable x and y,compute the sample mean for EACH group.
Before the following code, you must INSTALL the PACKAGE dplyr. You can do this with the Package menu in RStudio.
#Now I'm grouping the observations by the columns and number previously assigned to them :
library(dplyr)
<- dataset %>%
group_means group_by(group) %>%
summarise(x_mean = mean(x),
y_mean = mean(y))
Now do a histogram of mean of x and another for the mean of y:
hist(group_means$x_mean, main="Histogram of mean of X",
xlab="Mean of X ", col="green")
hist(group_means$y_mean, main="Histogram of mean of Y",
xlab="Mean of Y", col="dark blue")
LOOKING AT THE HISTOGRAM OF THE SAMPLE MEAN OF Y (y_mean), HOW DIFFERENT IT IS FROM ITS ORIGINAL HISTOGRAM OF Y (y)? EXPLAIN
NOW I SEE THAT BOTH HISTOGRAMS FOLLOW A DISTRIBUTION SIMILAR TO THE NORMAL DISTRIBUTION. THIS IS QUITE SURPRISING FOR THE MEAN OF Y VARIABLE SINCE THE ORIGINAL Y VARIABLE WAS CREATED AS UNIFORM, NOT NORMAL.
CALCULATE THE MEAN AND STANDARD DEVIATION OF BOTH SAMPLE MEANS (COLUMNS x_mean AND y_mean). HINT: YOU CAN USE THE mean and sd functions
<- mean(group_means$x_mean)
new_xmean <- mean(group_means$y_mean)
new_ymean
<- var(group_means$x_mean)
new_varx cat("Variance of the mean of x = ",new_varx,"\n")
## Variance of the mean of x = 63.85405
<- var(group_means$y_mean)
new_vary cat("Variance of the mean of y = ",new_vary,"\n")
## Variance of the mean of y = 11.9846
IS THE VARIANCE OF THE RANDOM SAMPLE MEANS EQUAL TO THE VARIANCE OF THE ORIGINAL RANDOM VARIABLES? BRIEFLY EXPLAIN
I see something interesting with the variance of the mean of x and y. The variance of the random variable x (new_varx) is 63.8540503. The original variance of x I had obtained was 1603.2852339.
This is interesting since now the mean of x has a much less variability than the original variable. The new variance of the mean of x is about 1/25 with respect to the original variance of x. Why this is the case?
There is a simple explanation using intuition, and, we can use math to provide a more convincing explanation. let’s start with intuition.
When you take groups and then take the mean of each group, then extreme values that you could have in each group will cancel out when you take the average of the group. Then, it is expected that the variance of the mean of the group will be much less than variance of the variable. But how much less?
In our case, we got a variance of the mean of x of 63.8540503, and the variance of x was about 1600, then it seems that the variance of the variable is about 25 times bigger than the variance of the mean.
Now let’s use simple math and probability theory to examine this relationship between these variances:
LET’S DEFINE A RANDOM VARIABLE X AS THE WEIGHT OF STUDENTS x1, x2, … xN. NOW THE VARIABLE:
\[ \bar{X}=\frac{1}{N}\left(X_{1}+X_{2}+...+X_{N}\right) \]
WE CAN ESTIMATE THE VARIANCE OF THIS MEAN AS FOLLOWS:
\[ VAR\left(\bar{X}\right)=VAR\left(\frac{1}{N}\left(X_{1}+X_{2}+...+X_{N}\right)\right) \]
APPLYING BASIC PROBABILITY RULES, I CAN EXPRESS THIS VARIANCE AS:
\[ VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}VAR\left(X_{1}+X_{2}+...+X_{N}\right) \]
\[ VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}\left[VAR\left(X_{1}\right)+VAR\left(X_{2}\right)+...+VAR\left(X_{N}\right)\right] \]
SINCE THE VARIANCE OF \(X_1\) IS THE SAME AS THE VARIANCE OF \(X_2\), AND ALSO IS THE SAME FOR ANY \(x_N\), THEN:
\[ VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}N\left[VAR\left(X\right)\right] \]
THEN WE CAN EXPRESS THE VARIANCE OF THE MEAN AS:
\[ VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)\left[VAR\left(X\right)\right]\]
HERE I CAN SEE THAT THE EXPECTED VARIANCE OF THE MEAN OF A RANDOM VARIABLE IS EQUAL TO THE ORIGINAL VARIANCE OF THE VARIABLE DIVIDED BY N.
NOW I CAN GET THE STANDARD DEVIATION BY APPLYING SQUARED ROOT:
\[ SD(\bar{X})=\sqrt{\frac{1}{N}}\left[SD(X)\right] \]
\[ SD(\bar{X})=\frac{SD(X)}{\sqrt{N}} \]
I can see that the expected standard deviation of the mean of a random variable is equal to the original standard deviation of the variable divided by the squared root of n
DO A RESEARCH ABOUT THE CENTRAL LIMIT THEOREM. WITHYOUR WORDS, EXPLAIN WHAT THE CENTRAL THEOREM IS
THE CENTRAL LIMIT THEOREM SAYS THAT FOR ANY RANDOM VARIABLE WITH ANY PROBABILITY DISTRIBUTION, WHEN YOU TAKE THE MEAN OF THIS VARIABLE, THE PROBABILITY DISTRIBUTION OF THESE MEANS WILL HAVE THE FOLLOWING CHARACTERISTICS:
THE DISTRIBUTION OF THE MEANS WILL BE CLOSE TO NORMAL DISTRIBUTION WHEN YOU TAKE MANY GROUPS (AT LEAST 30 GROUPS). ACTUALLY, THIS HAPPENS NOT ONLY WITH THE MEAN OF THE VARIABLE, BUT ALSO WITH ANY LINEAR COMBINATION OF THE VARIABLE SUCH AS THE SUM OR WEIGHTED AVERAGE OF THE VARIABLE.
THE STANDARD DEVIATION OF THE MEANS WILL BE MUCH LESS THAN THE STANDARD DEVIATION OF THE INDIVIDUALS. BEING MORE SPECIFICALLY, THE STANDARD DEVIATION OF THE MEAN WILL SHRINK WITH A FACTOR OF (1/SQUARED-ROOT OF THE SIZE OF THE GROUP (N) )
THEN, IN CONCLUSION, THE CENTRAL LIMIT THEOREM SAYS THAT, NO MATTER THE ORIGINAL PROBABILITY DISTRIBUTION OF ANY RANDOM VARIABLE, IF WE TAKE GROUPS OF THIS VARIABLE, A) THE MEANS OF THESE GROUPS WILL HAVE A PROBABILITY DISTRIBUTION CLOSE TO THE NORMAL DISTRIBUTION, AND B) THE STANDARD DEVIATION OF THE MEAN WILL SHRINK ACCORDING TO THE NUMBER OF ELEMENTS OF EACH GROUP.