Workshop 2 Analyzing Histograms of returns

First clearing our R environment:

rm(list=ls())
# To avoid scientific notation for numbers: 
options(scipen=999)

Download monthly prices of the Mexican market index (^MXX), the IPyC and also S&P500 index from the US market (^GSPC) using the getsymbols function For both download data from January 2000 to date.

#2.1 Data collection with getsymbols

# Load the packages:

library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: TTR

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

#library(dplyr)

# Downloading the historical quotation data for both indexes:
getSymbols(c("^MXX", "^GSPC"), from="2000-01-01",  src="yahoo", periodicity="monthly")

## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.

## [1] "^MXX"  "^GSPC"

Now we have the MXX and the GSPC in the environment xts datasets with historical data for the Mexican and the US fiancial market indexes.

Both datasets have columns for open, low, high, close, and adjusted prices, in the case of indexes, close prices will be always equal to adjusted prices.

#2.2 Return calculation Before calculating returns, we can create an integrated dataset with both indexes:

prices = merge(MXX,GSPC)

With this integrated dataset we can calculate returns for both indexes.

Select only adjusted prices and rename the columns with meaningful names for the indexes:

prices = Ad(prices)
names(prices) = c("MXX","GSPC")

With adjusted prices we calculate continuously compounded returns:

r = diff(log(prices))

The continuously compounded returns (r) can be calculated as the first difference of the natural log prices (or index). The first difference is equal to the log price of the price at period t minus the log of the price at period (t-1) as we saw on the bank return example in class

The first historical returns:

head(r)

##                    MXX        GSPC
## 2000-01-01          NA          NA
## 2000-02-01  0.11232485 -0.02031300
## 2000-03-01  0.01410906  0.09232375
## 2000-04-01 -0.11811558 -0.03127991
## 2000-05-01 -0.10795263 -0.02215875
## 2000-06-01  0.15323959  0.02365163

The first value is NA since it is not possible to calculate returns for the first month. We can delete any NA values with the na.omit function:

r = na.omit(r)

Now we caculate simple returns:

R =  na.omit(prices / lag(prices,n=1) - 1)

Simple returns are calculated as a percentage change between the current price and the previous price.

The lag is a function that can be used with xts datasets to get previous values of the variable. n=1 means get the past value for the price.

We can have a look to the content of the simple returns by looking the oldest and most recent returns:

head(R)

##                    MXX        GSPC
## 2000-02-01  0.11887627 -0.02010808
## 2000-03-01  0.01420906  0.09671983
## 2000-04-01 -0.11140666 -0.03079576
## 2000-05-01 -0.10232989 -0.02191505
## 2000-06-01  0.16560422  0.02393355
## 2000-07-01 -0.06247834 -0.01634128

tail(R)

##                    MXX        GSPC
## 2020-10-01 -0.01256937 -0.02766579
## 2020-11-01  0.12952931  0.10754564
## 2020-12-01  0.05476471  0.03712146
## 2021-01-01 -0.02453426 -0.01113666
## 2021-02-01  0.03738868  0.02609145
## 2021-03-01  0.04633628  0.03468512

In this case, there is no NA values, but it might be a good idea to still apply the na.omit funcion just in case there is an NA value for any month of any index:

R <- na.omit(R)

#2.3 Q Histograms Do a histogram of the simple return of the Mexican index:

hist(R$MXX, main="Histogram of IPC monthly returns", 
     xlab="Simple returns", col="dark blue")

INTERPRET this histogram with your words. # THIS HISTOGRAM SHOWS THAT THE IPC MONTHLY RETURNS ARE CLOSE TO A NORMAL DISTRIBUTION. THE SIMPLE RETURNS ON THIS INDEX VARY VERY LITTLE,IT IS VERY CONSTANT, IT MOSTLY GOES FROM -.1 TO .1 Do a histogram of the simple return of the S&P500 index.

hist(R$GSPC, main="Histogram of S&P500 monthly returns", 
     xlab="Simple returns", col="orange")

INTERPRET this histogram with your words. # RESPUESTA PREGUNTA 2
#THE S&P500 MONTHLY RETURN ALSO APPEAR TO FOLLOW A STANDARD DISTRIBUTION, BUT HERE RESULTS VARY A LITTLE MORE THAN ON THE PREVIOUS INDEX. THE AVERAGE RETURN IS ABOVE 0, MEANING ON THE AVERAGE INVESTORS DO GET POSITIVE RETURNS. Here is a graph that shows both histograms together. To better appreciate the histograms, I reduced the length of each bar to be equal to 2 percentual points.

library(ggplot2)

R_GSPC <- as.data.frame(R$GSPC)

R_MXX <- as.data.frame(R$MXX)

names(R_GSPC)<-names(R_MXX)<-c("returns")

R_GSPC$INDEX <- "GSPC_ret"

R_MXX$INDEX <- "MXX_ret"

#I do the merge with rbind

Lenghts <- rbind(R_GSPC, R_MXX)

 

#ggplot(Lenghts, aes(returns, fill = INDEX))+ geom_histogram(alpha = 0.5, #binwidth = 0.02)

ggplot(Lenghts, aes(returns))+

geom_histogram(data=subset(Lenghts,INDEX=="GSPC_ret"),aes(fill=INDEX),alpha=0.5, binwidth = 0.02) +

geom_histogram(data=subset(Lenghts,INDEX=="MXX_ret"),aes(fill=INDEX), alpha=0.5, binwidth = 0.02) +

scale_fill_manual(name="INDEX",values=c("blue","yellow"),labels=c("S&P500","IPyC"))

The S&P returns are represented in blue; the IPyC returns are represented in yellow; the gray-yellow is shared area of both histograms.

LOOK CAREFULLY AT THIS PLOT WITH BOTH HISTOGRAMS. WHICH INSTRUMENT IS RISKIER? EXPLAIN #BY INTERPRETING THE HISTOGRAM I WOULD SAY THAT THE RISKIEST INDEX IS THE IPC, AS IT HAS MORE DATA TOWARDS THE NEGATIVE RETURNS AND THE AVERAGE IS HIGHER ON THE S&P500.

#2.5 Q Calculate the mean and standard deviation

mean_MXX_R<-mean(R$MXX,na.rm =TRUE)
cat("Mean_MXX=",mean_MXX_R)

## Mean_MXX= 0.009112593

sd_MXX_R<-sd(R$MXX,na.rm = TRUE)
cat("Standard Deviation_MXX=",sd_MXX_R)

## Standard Deviation_MXX= 0.05235384

mean_SP500_R<-sd(R$GSPC,na.rm = TRUE)
cat("Mean_SP500=",mean_MXX_R)

## Mean_SP500= 0.009112593

sd_SP500_R<-sd(r$GSPC,na.rm = TRUE)
cat("Standard Deviation_SP500=",sd_SP500_R)

## Standard Deviation_SP500= 0.04397723

Just by looking at the mean and standard deviation of each instrument returns, WHICH INSTRUMENT LOOKS MORE ATTRACTIVE TO INVEST? EXPLAIN #THE MEAN IS ALMOST THE SAME FOR BOTH HISTOGRAMS, BUT THE SD IS SMALLER ON THE S&P500, WHAT MAKES THIS INSTRUMENT MORE ATTRACTIVE TO INVEST IN, AS WE GET MORE FREQUENTLY RETURNS ARUND THE MEAN.

#2.6 Q (OPTIONAL) Calculating the holding-period return If you had invested in the Mexican Index $1 peso in Jan 2000, WHICH WOULD BE THE VALUE OF YOUR INVESTMENT TODAY?

If you had invested in the S&P500 index $1 USD in Jan 2000, WHICH WOULD BE THE VALUE OF YOUR INVESTMENT TODAY?

3 Q The Central Limit Theorem

The Central Limit Theorem is one of the most important discoveries in mathematics and statistics.Due to this discovery, the field of Statistics was developed at the beginning of the 20th century. #3.1 Q Monte Carlo simulation to create variables Create the x variable as a random variable with normal distribution, with mean=20 and standard deviation=40. Create 100,000 observations:

x <- rnorm(n=100000, mean = 20, 40)

Create the variable y as a random variable with uniform distribution in the range [0,60]:

y <- runif(n=100000, min = 0, max = 60)

Learn about the variance of the uniform distribution. HOW CAN YOU ESTIMATE THE VARIANCE OF A UNIFORM DISTRIBUTED VARIABLE? # I INVESTIGATED THE FORMULA FOR CALCULATING THIS, WHAT I FOUND IS THAT ON THIS VARIABLE IT IS PRETTY SIMPLE; YOU TAKE THE MAX AND MIN VALUES, FIND THE DIFFERENCE, ELEVATE THEM TO TWO AND FINALLY DEVIDE THAT BY TWELVE.

#3.2 Q Histograms of x and y

hist(x, main="Histogram of x", 
     xlab="x values", col="dark blue")

hist(y, main="Histogram of y", 
     xlab="y values", col="green")

WHAT DO YOU SEE? BRIEFLY EXPLAIN # HISTOGRAM OF X HAS A REALLY CLEAR NORMAL DISTRIBUTION SHAPE, WHILE THE Y IS COMPLEATELY A UNIFORMED SHAPED HISTOGRAM. #3.3 Calculating standard deviation and variance Calculate the mean of x and y and save them as xbar and ybar:

xbar= mean(x)
ybar= mean(y)

Now we will manually calculate the variance of x and y. Remember that the VARIANCE of a variable is the AVERAGE OF ITS SQUARED DEVIATIONS.

Calculate first the squared deviations of x and y:

xdesv2=(x-xbar)^2
ydesv2=(y-ybar)^2

Now we just calculate the mean of the squared deviations to get the variance:

varx=mean(xdesv2)
varx

## [1] 1599.981

vary=mean(ydesv2)
vary

## [1] 299.9409

Compare the variance you computed with the theoretical variance of each variable (the variances we used to generate the simulated values for x and y). You will see that the computed variances varx and vary will be very similar to the theoretical values.

#3.4 Calculating mean of groups for x and y Create a data frame with x and y as columns

dataset <- cbind(x,y)
dataset <- as.data.frame(dataset)

Now assign a group number to each observation. We will create 4,000 groups of 25 observations each. Each group will be labeled from 1 to 4,000. You can use the function rep and seq:

# I create a column called group where each group will have 25 
#   observations:

dataset$group <- rep(seq(1:4000),each=25)

With the group_by() and summarize() functions, for each group and for each variable x and y,compute the sample mean for EACH group.

#Now I'm grouping the observations by the columns and number previously assigned to them :
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:xts':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

group_means <- dataset %>%
   group_by(group) %>%
   summarise(x_mean = mean(x),
             y_mean = mean(y))

Now do a histogram of mean of x and another for the mean of y:

hist(group_means$x_mean, main="Histogram of mean of X", 
     xlab="Mean of X ", col="green")

hist(group_means$y_mean, main="Histogram of mean of Y", 
     xlab="Mean of Y", col="dark blue")

LOOKING AT THE HISTOGRAM OF THE SAMPLE MEAN OF Y (y_mean), HOW DIFFERENT IT IS FROM ITS ORIGINAL HISTOGRAM OF Y (y)? EXPLAIN #THE HISTOGRAM OF Y FOLLOWED A UNIFORM DISTRIBUTION AND THE Y_MEAN IS A VERY CLEAR NORMAL DISTRIBUTION. CALCULATE THE MEAN AND STANDARD DEVIATION OF BOTH SAMPLE MEANS (COLUMNS x_mean AND y_mean). HINT: YOU CAN USE THE mean and sd functionws

#Arithmetic mean x_mean
answermean_x <-mean(dataset$x,na.rm = TRUE)
cat("ANSWER X MEAN =", answermean_x)

## ANSWER X MEAN = 19.97274

#standard deviation x_mean
answersd_x <-sd(dataset$x, na.rm = TRUE)
cat("ANSWER X SD =", answersd_x)

## ANSWER X SD = 39.99996

#Arithmetic mean y_mean
answermean_y <-mean(dataset$y,na.rm = TRUE)
cat("ANSWER MEAN =", answermean_y)

## ANSWER MEAN = 29.97823

#standard deviation y_mean
answersd_y <-sd(dataset$y, na.rm = TRUE)
cat("ANSWER SD =", answersd_y)

## ANSWER SD = 17.31889

IS THE VARIANCE OF THE RANDOM SAMPLE MEANS EQUAL TO THE VARIANCE OF THE ORIGINAL RANDOM VARIABLES? BRIEFLY EXPLAIN #NO, THEY ARE NOT EQUAL AT ALL, MAINLY BECAUSE WE ARE USING DIFFERENT DATA AND SAMPLES. DO A RESEARCH ABOUT THE CENTRAL LIMIT THEOREM. WITHYOUR WORDS, EXPLAIN WHAT THE CENTRAL THEOREM IS # THE CENTRAL LIMIT THEOREM STATES THAT IF YOU TAKE SUFFICIENTLY LARGE RANDOM SAMPLES OF A POPULATION THAT HAS A MEAN μ AND A STANDARD DEVIATION σ, THE DISTRIBUTION TENDS TO BE NORMALLY DISTRIBUTED.MEANING ESSENTIALLY THAT IF AS THE NUMBER OF RANDOM SAMPLES INCREASES, IT BEHAVES AND APPEARS MORE LIKE A NORMAL DISTRIBUTION.

Workshop 2 Econometrics I

Andrea Contel

Workshop 2 Analyzing Histograms of returns

3 Q The Central Limit Theorem