1 Q Analyzing Histograms of returns

We start clearing our R environment:

rm(list=ls())
# To avoid scientific notation for numbers: 
options(scipen=999)

Using the getsymbols function, download MONTHLY prices of the Mexican market index (^MXX), the IPyC (Índice de Precios y Cotizaciones), and also download the S&P500 index from the US market (^GSPC). For both indexes download data from January 2000 to date. Do the following:

1.1 Data collection with getsymbols

The getsymbols function brings price (quotations) data of the financial instruments.

# Load the packages I need for this workshop:

library(quantmod)
#library(dplyr)

# Downloading the historical quotation data for both indexes:
getSymbols(c("^MXX", "^GSPC"), from="2000-01-01", to="2021-08-24", 
           src="yahoo", periodicity="monthly")
## [1] "^MXX"  "^GSPC"

In your environment now you have the MXX and the GSPC xts datasets with historical data for the Mexican and the US fiancial market indexes.

Remember that xts datasets are time-series datasets with a time index.

Both datasets have columns for open, low, high, close, and adjusted prices. As we have mentioned, it is recommended to use adjusted prices to calculate returns. In the case of indexes, close prices will be always equal to adjusted prices.

1.2 Return calculation

Before calculating returns, we can create an integrated dataset with both indexes:

prices = merge(MXX,GSPC)

By doing this, we can easily calculate returns for both indexes using the integrated dataset.

We select only adjusted prices and rename the columns with meaningful names for the indexes:

prices = Ad(prices)
names(prices) = c("MXX","GSPC")

With adjusted prices we calculate continuously compounded returns:

r = diff(log(prices))

Remember that the continuously compounded returns (r) can be calculated as the first difference of the natural log prices (or index). The first difference is equal to the log price of the price at period t minus the log of the price at period (t-1)

We visualize the first historical returns:

head(r)
##                    MXX        GSPC
## 2000-01-01          NA          NA
## 2000-02-01  0.11232485 -0.02031300
## 2000-03-01  0.01410906  0.09232375
## 2000-04-01 -0.11811558 -0.03127991
## 2000-05-01 -0.10795263 -0.02215875
## 2000-06-01  0.15323959  0.02365163

We see that the first value is NA since it is not possible to calculate returns for the first month. We can delete any NA values with the na.omit function:

r = na.omit(r)

Now we calculate simple returns:

R =  na.omit(prices / lag(prices,n=1) - 1)

In this case I used to na.omit to create the simple returns.

Remember that simple returns are calculated as a percentage change between the current price and the previous price.

The lag is a function that can be used with xts datasets to get previous values of the variable. In this case, we indicate n=1 meaning to get the past value for the price.

1.3 Q Histograms

Do a histogram of the simple return of the Mexican index:

hist(R$MXX, main="Histogram of IPC monthly returns", 
     xlab="Simple returns", col="dark blue")

INTERPRET this histogram with your words.

THE Y AXIS OF THIS HISTOGRAM REPRESENTS THE FREQUENCY, AND THE X AXIS IS THE RANGE OF THE IPC MONTHLY RETURNS. WE CAN SEE HOW MANY MONTHS IN THE HISTORY THE IPC HAS OFFERED RETURNS IN SPECIFIC RANGES

WE CAN SEE THAT MOST OF THE MONTHLY RETURNS OF THE MEXICAN IPC HAVE BEEN BETWEEN -10% TO 10% (ABOUT 95% OF THE MONTHS). THE MOST FREQUENT RANGE OF RETURNS IS BETWEEN 0 AND 5% WITH ABOUT 100 MONTHS. WE CAN SEE THAT THE IPC HAS OFFERED FEW EXTREME NEGATIVE RETURNS BETWEEN -20% TO -10% (ABOUT 2 OR 3 MONTHS). HOWEVER, IT ALSO HAS OFFERED VERY GOOD RETURNS BETWEEN +10% AND +20% IN ABOUT 2 MONTHS IN THE HISTORY.

WE CAN ALSO SEE THAT THE MONTHLY IPC RETURNS LOOKS LIKE A NORMAL DISTRIBUTED VARIABLE. IF THIS IS THE CASE, THEN THE STANDARD DEVIATION OF THE IPC RETURNS SHOULD BE AROUND 5% SINCE ABOUT 95% OF THE TIMES THE IPC HAS OFFERED RETURNS BETWEEN -10% TO +10%. REMEMBER THAT IN A NORMAL-DISTRIBUTED VARIABLE, IF YOU TAKE THE RANGE OF ABOUT 2 STANDARD DEVIATIONS TO THE LEFT AND TO STANDARD DEVIATION TO THE RIGHT FROM THE MEAN, WE COVER ABOUT 95% OF THE CASES.

Do a histogram of the simple return of the S&P500 index.

hist(R$GSPC, main="Histogram of S&P500 monthly returns", 
     xlab="Simple returns", col="blue")

INTERPRET this histogram with your words.

THE S&P MONTHLY RETURNS LOOK LIKE A NORMAL-DISTRIBUTED VARIABLE, BUT IT IS A LITTLE BIT SKEWED TO THE LEFT. IN OTHER WORDS, THERE ARE MORE EXTREME VALUES IN THE NEGATIVE SIDE COMPARED TO THE POSITIVE SIDE.

WE CAN SEE THAT THE MOST FREQUENT RANGE OF RETURNS IS ALSO BETWEEN 0 AND 5% WITH MORE THAN 100 MONTHS

IT SEEMS THAT ABOUT 97% OF THE MONTHS THE RETURNS HAS BEEN BETWEEN -10% TO +10%. HOWEVER, WE CAN SEE THAT THE MEXICAN IPC RETURNS HAS HAD MORE EXTREME NEGATIVE RETURNS THAN THE S&P500. IN THIS CASE, IT SEEMS THAT ONLY 1 OR 2 MONTHS THE S&P500 HAS OFFERED BETWEEN -20% TO -15% RETURNS.

WE CAN ROUGHLY ESTIMATE A STANDARD DEVIATION OF A LITTLE BIT LESS THAN 5% SINCE A LITTLE BIT MORE THAN 95% COVER A RANGE BETWEEN -10% TO +10%.

COMPARED TO THE MEXICAN IPC RETURNS, THE S&P HAS NOT OFFERED MONTHLY RETURNS BETWEEN 15% TO 20%.

1.4 Q Appreciating risk by looking at histrograms.

Here is a graph that shows both histograms together. To better appreciate the histograms, I reduced the length of each bar to be equal to 2 percentual points.

The S&P returns are represented in blue; the IPyC returns are represented in yellow; the gray-yellow is shared area of both histograms.

LOOK CAREFULLY AT THIS PLOT WITH BOTH HISTOGRAMS. WHICH INSTRUMENT IS RISKIER? EXPLAIN

R:THE IPC LOOKS RISKIER SINCE WE CAN SEE MORE EXTREME VALUES TO THE LEFT COMPARED TO THE S&P500 RETURNS. IN ADDITION, IT SEEMS THAT THE IPC HAS HAD MORE NEGATIVE RETURNS IF WE LOOK AT THE AREAS COVERED FOR EACH INSTRUMENT BELOW ZERO.

1.5 Q Calculate the mean and standard deviation

CALCULATE the mean and standard deviation of monthly returns for both market indexes index. Hint: you can check how we did this in Workshop 1

I LOAD THE PerformanceAnalytics LIBRARY:

library(PerformanceAnalytics)

NOW I GET DESCRIPTIVE STATISTICS OF RETURNS USING THE table.Stats FUNCTION FROM THIS LIBRARY:

table.Stats(R)
##                      MXX     GSPC
## Observations    259.0000 259.0000
## NAs               0.0000   0.0000
## Minimum          -0.1785  -0.1694
## Quartile 1       -0.0200  -0.0175
## Median            0.0107   0.0107
## Arithmetic Mean   0.0094   0.0055
## Geometric Mean    0.0080   0.0045
## Quartile 3        0.0429   0.0308
## Maximum           0.1656   0.1268
## SE Mean           0.0032   0.0027
## LCL Mean (0.95)   0.0030   0.0002
## UCL Mean (0.95)   0.0157   0.0108
## Variance          0.0027   0.0019
## Stdev             0.0520   0.0433
## Skewness         -0.3260  -0.5488
## Kurtosis          1.0475   1.1711

SINCE R HAS 2 COLUMNS, THEN WE GET THE DESCRIPTIVE STATISTICS FOR BOTH RETURNS

WE CAN ALSO USE THE mean AND sd FUNCTIONS TO GET WHAT WE WANT:

mean_R_MXX = mean(R$MXX)
cat("Mean of monthly return of the IPC =", mean_R_MXX,"\n")
## Mean of monthly return of the IPC = 0.009370433
mean_R_GSPC =mean(R$GSPC)
cat("Mean of monthly return of the S&P500 =", mean_R_GSPC,"\n")
## Mean of monthly return of the S&P500 = 0.005462128
sd_R_MXX = sd(R$MX)
cat("Standard deviation of monthly return of the IPC =", sd_R_MXX,"\n")
## Standard deviation of monthly return of the IPC = 0.05200855
sd_R_GSPC =sd(R$GSPC)
cat("Standard deviation of monthly return of the S&P500 =", sd_R_GSPC,"\n")
## Standard deviation of monthly return of the S&P500 = 0.04328685

A COMMON WAY TO REFER TO STANDARD DEVIATION OF RETURNS IS VOLATILITY

Just by looking at the mean and standard deviation of each instrument returns, WHICH INSTRUMENT LOOKS MORE ATTRACTIVE TO INVEST? EXPLAIN

R: THE VOLATILITY OF THE IPC IS HIGHER THAN THAT OF THE S&P500 (0.9370433% vs 0.5462128%).

THE MEAN RETURN FOR THE IPC IS HIGHER THAN THAT OF THE S&P500 (5.2008555% vs 4.3286846%)

LOOKING ONLY TO THE MEAN AND VOLATILITY IS HARD TO DECIDE WHICH INSTRUMENT I WOULD SELECT TO INVEST. IF MY PURPOSE IS TO HAVE A LOW-RISKY INVESTMENT, FOR SURE I WOULD SELECT THE S&P500 SINCE ITS VOLATILITY IS MUCH LESS. HOWEVER IF I AM TOLERATE HIGH RISK AND LOOK FOR HIGH RETURNS, MAYBE I WOULD SELECT THE IPC. HOWEVER, I WOULD NEED MORE ANALYSIS TO MAKE A DECISION. AT LEAST, I WOULD LIKE TO CALCULATE THE HOLDING-PERIOD RETURN FOR BOTH INSTRUMENTS IN THE WHOLE HISTORY

1.6 Q (OPTIONAL) Calculating the holding-period return

If you had invested in the Mexican Index $1 peso in Jan 2000, WHICH WOULD BE THE VALUE OF YOUR INVESTMENT TODAY?

If you had invested in the S&P500 index $1 USD in Jan 2000, WHICH WOULD BE THE VALUE OF YOUR INVESTMENT TODAY?

R: WE CAN CALCULATE THE HOLDING RETURN FOR THE WHOLE PERIOD (THE HPR), AND THEN MULTIPLY THE $1 TIMES (1+HPR) TO GET THE ANSWER.

TO CALCULATE THE HOLDING-PERIOD RETURN (HPR) WE CAN GET DIVIDE THE LAST VALUE BY THE FIRST VALUE OF THE PRICES AND SUBTRACT 1:

last_prices <- as.numeric(tail(prices,1))

# You can also do this by using using brackets to get the last row: 
last_prices2 <- as.numeric(prices[nrow(prices),])
  # as.numeric(x) transforms x into a number (numeric format).
  # Recall that prices is an xts object.
  # prices_MXX[x] subsets prices_MXX according to the criteria x.
  # nrow(x) obtains the number of rows of an xts object x.

# Now, get the first value of the prices
first_prices <- as.numeric(head(prices,1))



# Calculate HPR for both instruments:
HPR <- (last_prices / first_prices) - 1

# Calculate the value of $1 invested in both instruments
my_inv <- 1 * (1 + HPR)
# The final investment of the IPC is located in position 1 of the 
#   vector my_inv
cat("The value of $1 invested in the IPC is $", my_inv[1], " \n")
## The value of $1 invested in the IPC is $ 7.904229
# The final investment of the S&P500 is located in position 2 of the 
#   vector my_inv
cat("The value of $1 invested in the S&P500 is $", my_inv[2])
## The value of $1 invested in the S&P500 is $ 3.212376

I CAN ALSO CALCULATE THE VALUE OF THE INVESTMENT FOR EVERY MONTH OF THE HISTORY, AND DO A PLOT TO SEE HOW THE INVESTMENT WOULD HAVE MOVED OVER TIME:

HPR_MXX_MONTHS = prices$MXX / first_prices[1]
HPR_GSPC_MONTHS = prices$GSPC / first_prices[2]

plot.xts(HPR_MXX_MONTHS,col="green",ylim=c(0,8))

plot.xts(HPR_GSPC_MONTHS,col="blue")

# Putting both plots into one plot:
plot.xts(HPR_MXX_MONTHS,col="green",ylim=c(0,8))

lines(HPR_GSPC_MONTHS,col="blue")

2 Q The Central Limit Theorem

The Central Limit Theorem is one of the most important discoveries in mathematics and statistics. Actually, thanks to this discovery, the field of Statistics was developed at the beginning of the 20th century.

We will do an exercise using simulated numbers. I hope that you understand what the Central Limit Theorem is about.

Let’s do the following.

2.1 Q Monte Carlo simulation to create variables

Create the x variable as a random variable with normal distribution, with mean=20 and standard deviation=40. Create 100,000 observations:

x <- rnorm(n=100000, mean = 20, 40)

Create the variable y as a random variable with uniform distribution in the range [0,60]:

y <- runif(n=100000, min = 0, max = 60)

Learn about the variance of the uniform distribution. HOW CAN YOU ESTIMATE THE VARIANCE OF A UNIFORM DISTRIBUTED VARIABLE?

The probability density function for a uniform variable is the following:

\(f(x)=\frac{1}{(b-a)}\) para a<=x<b,

= 0 for any other value of x outside the range between a and b

In this case, a=0 and b=20

The mean or expected value of thhis function is the following:

\(E(x)=\frac{(a+b)}{2}\)

Then, the theoretical mean in this case is:

\(E(x)= \frac{(20+0)}{2} = 10\)

The variance of the uniform distribution variable is :

\(Var(x) = \frac{(b-a)^2}{12}\)

In our case the theoretical variance should be:

\(Var(x) = \frac{400}{12} = 33.333\)

2.2 Q Histograms of x and y

Run a histogram for y and another histogram for x.

hist(x, main="Histogram of x", 
     xlab="x values", col="dark blue")

hist(y, main="Histogram of y", 
     xlab="y values", col="green")

WHAT DO YOU SEE? BRIEFLY EXPLAIN

IN THE HISTOGRAM OF X I CAN SEE THAT THE DISTRIBUTION LOOKS LIKE A NORMAL DISTRIBUTION WITH MEAN=10 AND STANDARD DEVIATION=20. ABOUT 95% OF THE DATA IS BETWEEN 2 STANDARD DEVIATIONS BEFORE THE MEAN AND 2 STANDARD DEVIATIONS ABOVE THE MEAN ( A RANGE BETWEEN -30 AND 50). IN THE HISTOGRAM OF Y I SEE THAT THE DISTRIBUTION IS NOT EXACTLY LIKE THE PERFECT UNIFORM DISTRIBUTION, BUT IT IS VERY SIMILAR. THIS HAPPENS SINCE THE DATA WAS CREATED RANDOMLY. BUT MOST OF THE X VALUES HAVE THE SAME FREQUENCY OF APPEREANCE, AS IT IS EXPECTED FOR A UNIFORM DISTRIBUTION.

2.3 Calculating standard deviation and variance

Calculate the mean of x and y and save them as xbar and ybar:

xbar= mean(x)
ybar= mean(y)

Now we will manually calculate the variance of x and y. Remember that the VARIANCE of a variable is the AVERAGE OF ITS SQUARED DEVIATIONS.

Calculate first the squared deviations of x and y:

xdesv2=(x-xbar)^2
ydesv2=(y-ybar)^2

Now we just calculate the mean of the squared deviations to get the variance:

varx=mean(xdesv2)
varx
## [1] 1603.285
vary=mean(ydesv2)
vary
## [1] 299.4701

Compare the variance you computed with the theoretical variance of each variable (the variances we used to generate the simulated values for x and y). You will see that the computed variances varx and vary will be very similar to the theoretical values.

IN MY RANDOM DATA, THE VARIANCE OF X WAS 1603.2852339. THE THEORETICAL VARIANCE SHOULD BE \(40^{2}\), WHICH IS 1,600. MY EMPIRICAL VARIANCE CALCULATED USING THE RANDOM GENERATED NUMBERS IS REALLY CLOSE TO THE THEORETICAL VARIANCE.

FOR THE Y RANDOM VARIABLE I GOT A VARIANCE OF 299.470106. THE THEORETICAL VARIANCE SHOULD BE:

\(VAR(Y) = \frac{1}{12}*(60-0)^2 = 300\)

I SEE THAT MY EMPIRICAL VARIANCE WAS REALLY CLOSE TO THE THEORETICAL VARIANCE.

2.4 Calculating mean of groups for x and y

Create a data frame with x and y as columns

dataset <- cbind(x,y)
dataset <- as.data.frame(dataset)

Now assign a group number to each observation. We will create 4,000 groups of 25 observations each. Each group will be labeled from 1 to 4,000. You can use the function rep and seq:

# I create a column called group where each group will have 25 
#   observations:

dataset$group <- rep(seq(1:4000),each=25)

With the group_by() and summarize() functions, for each group and for each variable x and y,compute the sample mean for EACH group.

Before the following code, you must INSTALL the PACKAGE dplyr. You can do this with the Package menu in RStudio.

#Now I'm grouping the observations by the columns and number previously assigned to them :
library(dplyr)
group_means <- dataset %>%
   group_by(group) %>%
   summarise(x_mean = mean(x),
             y_mean = mean(y))

Now do a histogram of mean of x and another for the mean of y:

hist(group_means$x_mean, main="Histogram of mean of X", 
     xlab="Mean of X ", col="green")

hist(group_means$y_mean, main="Histogram of mean of Y", 
     xlab="Mean of Y", col="dark blue")

LOOKING AT THE HISTOGRAM OF THE SAMPLE MEAN OF Y (y_mean), HOW DIFFERENT IT IS FROM ITS ORIGINAL HISTOGRAM OF Y (y)? EXPLAIN

NOW I SEE THAT BOTH HISTOGRAMS FOLLOW A DISTRIBUTION SIMILAR TO THE NORMAL DISTRIBUTION. THIS IS QUITE SURPRISING FOR THE MEAN OF Y VARIABLE SINCE THE ORIGINAL Y VARIABLE WAS CREATED AS UNIFORM, NOT NORMAL.

CALCULATE THE MEAN AND STANDARD DEVIATION OF BOTH SAMPLE MEANS (COLUMNS x_mean AND y_mean). HINT: YOU CAN USE THE mean and sd functions

new_xmean <- mean(group_means$x_mean)
new_ymean <- mean(group_means$y_mean)

new_varx <- var(group_means$x_mean)
cat("Variance of the mean of x = ",new_varx,"\n")
## Variance of the mean of x =  63.85405
new_vary <- var(group_means$y_mean)
cat("Variance of the mean of y = ",new_vary,"\n")
## Variance of the mean of y =  11.9846

IS THE VARIANCE OF THE RANDOM SAMPLE MEANS EQUAL TO THE VARIANCE OF THE ORIGINAL RANDOM VARIABLES? BRIEFLY EXPLAIN

I see something interesting with the variance of the mean of x and y. The variance of the random variable x (new_varx) is 63.8540503. The original variance of x I had obtained was 1603.2852339.

This is interesting since now the mean of x has a much less variability than the original variable. The new variance of the mean of x is about 1/25 with respect to the original variance of x. Why this is the case?

There is a simple explanation using intuition, and, we can use math to provide a more convincing explanation. let’s start with intuition.

When you take groups and then take the mean of each group, then extreme values that you could have in each group will cancel out when you take the average of the group. Then, it is expected that the variance of the mean of the group will be much less than variance of the variable. But how much less?

In our case, we got a variance of the mean of x of 63.8540503, and the variance of x was about 1600, then it seems that the variance of the variable is about 25 times bigger than the variance of the mean.

Now let’s use simple math and probability theory to examine this relationship between these variances:

LET’S DEFINE A RANDOM VARIABLE X AS THE WEIGHT OF STUDENTS x1, x2, … xN. NOW THE VARIABLE:

\[ \bar{X}=\frac{1}{N}\left(X_{1}+X_{2}+...+X_{N}\right) \]

WE CAN ESTIMATE THE VARIANCE OF THIS MEAN AS FOLLOWS:

\[ VAR\left(\bar{X}\right)=VAR\left(\frac{1}{N}\left(X_{1}+X_{2}+...+X_{N}\right)\right) \]

APPLYING BASIC PROBABILITY RULES, I CAN EXPRESS THIS VARIANCE AS:

\[ VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}VAR\left(X_{1}+X_{2}+...+X_{N}\right) \]

\[ VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}\left[VAR\left(X_{1}\right)+VAR\left(X_{2}\right)+...+VAR\left(X_{N}\right)\right] \]

SINCE THE VARIANCE OF \(X_1\) IS THE SAME AS THE VARIANCE OF \(X_2\), AND ALSO IS THE SAME FOR ANY \(x_N\), THEN:

\[ VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}N\left[VAR\left(X\right)\right] \]

THEN WE CAN EXPRESS THE VARIANCE OF THE MEAN AS:

\[ VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)\left[VAR\left(X\right)\right]\]

HERE I CAN SEE THAT THE EXPECTED VARIANCE OF THE MEAN OF A RANDOM VARIABLE IS EQUAL TO THE ORIGINAL VARIANCE OF THE VARIABLE DIVIDED BY N.

NOW I CAN GET THE STANDARD DEVIATION BY APPLYING SQUARED ROOT:

\[ SD(\bar{X})=\sqrt{\frac{1}{N}}\left[SD(X)\right] \]

\[ SD(\bar{X})=\frac{SD(X)}{\sqrt{N}} \]

I can see that the expected standard deviation of the mean of a random variable is equal to the original standard deviation of the variable divided by the squared root of n

DO A RESEARCH ABOUT THE CENTRAL LIMIT THEOREM. WITHYOUR WORDS, EXPLAIN WHAT THE CENTRAL THEOREM IS

THE CENTRAL LIMIT THEOREM SAYS THAT FOR ANY RANDOM VARIABLE WITH ANY PROBABILITY DISTRIBUTION, WHEN YOU TAKE THE MEAN OF THIS VARIABLE, THE PROBABILITY DISTRIBUTION OF THESE MEANS WILL HAVE THE FOLLOWING CHARACTERISTICS:

  1. THE DISTRIBUTION OF THE MEANS WILL BE CLOSE TO NORMAL DISTRIBUTION WHEN YOU TAKE MANY GROUPS (AT LEAST 30 GROUPS). ACTUALLY, THIS HAPPENS NOT ONLY WITH THE MEAN OF THE VARIABLE, BUT ALSO WITH ANY LINEAR COMBINATION OF THE VARIABLE SUCH AS THE SUM OR WEIGHTED AVERAGE OF THE VARIABLE.

  2. THE STANDARD DEVIATION OF THE MEANS WILL BE MUCH LESS THAN THE STANDARD DEVIATION OF THE INDIVIDUALS. BEING MORE SPECIFICALLY, THE STANDARD DEVIATION OF THE MEAN WILL SHRINK WITH A FACTOR OF (1/SQUARED-ROOT OF THE SIZE OF THE GROUP (N) )

THEN, IN CONCLUSION, THE CENTRAL LIMIT THEOREM SAYS THAT, NO MATTER THE ORIGINAL PROBABILITY DISTRIBUTION OF ANY RANDOM VARIABLE, IF WE TAKE GROUPS OF THIS VARIABLE, A) THE MEANS OF THESE GROUPS WILL HAVE A PROBABILITY DISTRIBUTION CLOSE TO THE NORMAL DISTRIBUTION, AND B) THE STANDARD DEVIATION OF THE MEAN WILL SHRINK ACCORDING TO THE NUMBER OF ELEMENTS OF EACH GROUP.