#Q Analyzing Histograms of returns
rm(list=ls())
options(scipen=999)
Using the getsymbols function, download MONTHLY prices of the Mexican market index (^MXX), the IPyC (Índice de Precios y Cotizaciones), and also download the S&P500 index from the US market (^GSPC). For both indices download data from January 2000 to date. Do the following:
#Data collection with getsymbols
library(quantmod)
Loading required package: xts
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
Loading required package: TTR
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
getSymbols(c("^MXX", "^GSPC"), from="2000-01-01", src="yahoo", periodicity="monthly")
‘getSymbols’ currently uses auto.assign=TRUE by default, but will
use auto.assign=FALSE in 0.5-0. You will still be able to use
‘loadSymbols’ to automatically load data. getOption("getSymbols.env")
and getOption("getSymbols.auto.assign") will still be checked for
alternate defaults.
This message is shown once per session and may be disabled by setting
options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
[1] "^MXX" "^GSPC"
#Return calculation
prices = merge(MXX,GSPC)
prices = Ad(prices)
names(prices) = c("MXX","GSPC")
r = diff(log(prices))
head(r)
MXX GSPC
2000-01-01 NA NA
2000-02-01 0.11232485 -0.02031300
2000-03-01 0.01410906 0.09232375
2000-04-01 -0.11811558 -0.03127991
2000-05-01 -0.10795263 -0.02215875
2000-06-01 0.15323959 0.02365163
r = na.omit(r)
R = na.omit(prices / lag(prices,n=1) - 1)
Oldest Returns
head(R)
MXX GSPC
2000-02-01 0.11887627 -0.02010808
2000-03-01 0.01420906 0.09671983
2000-04-01 -0.11140666 -0.03079576
2000-05-01 -0.10232989 -0.02191505
2000-06-01 0.16560422 0.02393355
2000-07-01 -0.06247834 -0.01634128
Most Recent Returns
tail(R)
MXX GSPC
2021-03-01 0.05950165 0.042438633
2021-04-01 0.01615910 0.052425321
2021-05-01 0.05990934 0.005486489
2021-06-01 -0.01171638 0.022214010
2021-07-01 0.01150474 0.022748055
2021-08-01 0.02332140 0.019172933
R <- na.omit(R)
#Q Histograms Do a histogram of the simple return of the IPC Monthly Returns.
hist(R$MXX, main="Histogram of IPC monthly returns",
xlab="Simple returns", col="dark blue")

INTERPRET this histogram with your words. THE HISTOGRAM REPRESENTING SIMPLE RETURNS REPORTED BY THE IPC (CONSUMER PRICE INDEX) OR BETTER KNOWN AS ÍNDICE DE PRECIOS AL CONSUMO IN MEXICO, VISUALLY REPRESENTS THE NUMERIC VALUE THAT REFLECT VARIATIONS EXPERIMENTED BY PRICES IN A PREDETERMINED TIME LAPSE.
IN THIS CASE, THE HISTOGRAM IS SYMMETRIC UNIMODAL, WITH A LIGHT SKEW TO THE RIGHT, INDICATING THAT THE RETURNS REPORTED IN THE IPC GRAPH CONCENTRATE IN THE FIRST 0.05 MARK SHOWING A FREQUENCY OF OVER 100, AND THEN ABRUPTLY DROP PAST THE 0.1 MARK TO BE WITH A FREQUENCY LESSER THAN 10.
Do a histogram of the simple return of the S&P500 index.
hist(R$GSPC, main="Histogram of S&P500 monthly returns",
xlab="Simple returns", col="blue")

INTERPRET this histogram with your words. THE HISTOGRAM REPRESENTING SIMPLE RETURNS REPORTED BY THE STANDARD AND POORS 500 VISUALLY REPRESENTS THE NUMERIC VALUE THAT REFLECT VARIATIONS EXPERIMENTED BY PRICES IN A PREDETERMINED TIME LAPSE.
THE STANDARD AND POOR’S 500 IS A STOCK MARKET INDEX TRACKING THE PERFORMANCE OF 500 LARGE COMPANIES LISTED ON THE STOCK EXCHANGES IN THE UNITED STATES.
IN THIS CASE, THE HISTOGRAM WOULD REPRESENT THE SIMPLE RETURNS OF THESE 500 COMPANIES AND THE FREQUENCY IN WHICH THEY CONCUR IN A RETURN BRACKET. IN THIS CASE, OVER 120 COMPANIES CAN BE FOUND IN THE 0.00 - 0.05 SIMPLE RETURN BRACKET. THE HISTOGRAM IS ONCE AGAIN SYMMETRIC UNIMODAL WITH A LIGHT SKEW TO THE RIGHT, INDICATING THAT LESS THAN 5 COMPANIES SIT IN THE 0.1 - 0.15 BRACKET.
#Q Appreciating risk by looking at histrograms. LOOK CAREFULLY AT THIS PLOT WITH BOTH HISTOGRAMS. WHICH INSTRUMENT IS RISKIER? EXPLAIN SIMPLY BY LOOKING AT THE HISTOGRAM OF BOTH THE IPC AND SP500 ONE CAN TELL THAT IPC HAS A HIGHER RISK. WHY? IPC OFFERS MORE RETURN OVER INVESTMENT THAN SP500 WHEN YOU LOOK AT THE SKEW ON THE RIGHT MARKED BY THE COLOR YELLOW PAST THE 0.1 BRACKET. THE COLOR YELLOW INDICATES THAT IPC OFFERS MORE SIMPLE RETURNS ALTHOUGH THE FREQUENCY IS WAY LOWER. NOW THAT BRINGS US TO TALK ABOUR RISK. THE RETURNS MAY BE HIGHER BUT THE VOLATILITY MUST BE HIGHER TOO, AND THAT IS VISUALLY REPRESENTED IN THE FREQUENCY IN WHICH SP500 REPORTS RETURNS BETWEEN TEH 0.00 - 0.05 BRACKET IN COMPARISON TO THE FREQUENCIES THAT IPC REPORTS IN EACH BRACKET.
#Q Calculate the mean and standard deviation
mean(R$MXX)
[1] 0.009370433
mean(R$GSPC)
[1] 0.005462128
sd(R$MXX)
[1] 0.05200855
sd(R$GSPC)
[1] 0.04328685
Just by looking at the mean and standard deviation of each instrument returns, WHICH INSTRUMENT LOOKS MORE ATTRACTIVE TO INVEST? EXPLAIN I WOULD GO FOR THE SP500 INDEX, SINCE THE INSTRUMENT PREVIOUSLY MENTIONED HAS LESSER STANDARD VARIATION AND THUS I WOULD INFER THAT THE RISK WOULD ALSO BE MORE CONSTANT. HAVING SAID THAT, I BELIEVE THAT IF I WERE TO INVEST AGRESSIVELY AND DIDN’T CARE AS MUCH ABOUT RISK I’D GO FOR THE IPC SINCE THEIR MEAN AND SD SIT HIGHER.
#Q The Central Limit Theorem The Central Limit Theorem is one of the most important discoveries in mathematics and statistics. Actually, thanks to this discovery, the field of Statistics was developed at the beginning of the 20th century. We will do an exercise using simulated numbers. I hope that you understand what the Central Limit Theorem is about. Let’s do the following.
#Q Monte Carlo simulation to create variables
x <- rnorm(n=100000, mean = 20, 40)
y <- runif(n=100000, min = 0, max = 60)
HOW CAN YOU ESTIMATE THE VARIANCE OF A UNIFORM DISTRIBUTED VARIABLE?
TAKING IT STEP BY STEP, VARIANCE IS THE EXPECTATION OF THE SQUARED DEVIATION OF A RANDOM VARIABLE FROM ITS MEAN. IN STATISTICS, UNIFORM DISTRIBUTUION REFERS TO A TYPE OF PROBABILITY DISTRIBUTUIN IN WHICH ALL OUTCOMES ARE EQUALLY LIKELY. IN THIS CASE THE VARIABLE WOULD REPRESENT SOMETHING THAT CANNOT BE PREDICTED, FOR INSTANCE, THE DEMAND FOR A PRODUCT THAT HASN’T ENTERED THE MARKET.
FIRST I WOULD ORGANIZE MY DATA IN SUCH A WAY THAT I HAVE A UNIFORM DISTRIBUTION THAT ACCOUNTS TO THE NOTION THAT A VARIABLE EXISTS. ONCE I HAVE A UNIFORM DISTRIBUTED VARIABLE I WOULD GRAPH IT WHERE MY Y WOULD BE F(X) AND MY X WOULD HAVE AN A AND A B WHERE THE DISTRIBUTED VARIABLE WOULD SIT. THEN I WOULD CALCULATE THE VARIANCE USING A FORMULA I FOUND ONLINE THAT READS: m2 − m12 = (b − a)2/12.
#Q Histograms of x and y
hist(x, main="Histogram of x",
xlab="x values", col="dark blue")

hist(y, main="Histogram of y",
xlab="y values", col="green")

WHAT DO YOU SEE? BRIEFLY EXPLAIN HISTOGRAM X HAS IT’S MEAN SITTING ON 20, THUS THE HISTOGRAM APPEARS SYMMETRIC UNIMODAL, WITH ITS STANDARD DEVIATION STANDING AT 40. THESE NUMBERS GIVE THE HISTROGRAM ITS APPEARANCE.
ON THE OTHER HAND, HISTOGRAM Y IS ALMOST COMPLETLY UNIFORM, GIVING IT A BLOCKY APPEARANCE. THIS IS LARGELY DUE TO THE FACT THAT THE HISTOGRAM WAS PLOTTED BY THE VARIABLE Y AS A RANDOM VARIABLE WITH UNIFORM DISTRIBUTION COUNTING FROM 0 TO 60.
#Calculating standard deviation and variance
xbar= mean(x)
ybar= mean(y)
Now we will manually calculate the variance of x and y. Remember that the VARIANCE of a variable is the AVERAGE OF ITS SQUARED DEVIATIONS.
xdesv2=(x-xbar)^2
ydesv2=(y-ybar)^2
Now we just calculate the mean of the squared deviations to get the variance:
varx=mean(xdesv2)
varx
[1] 1597.627
vary=mean(ydesv2)
vary
[1] 300.1732
THE COMPUTED VARIANCES VARX AND VARY ARE VERY SIMILAR TO THE THEORETICAL VALUES WE OBTAINED BEFORE.
#Calculating mean of groups for x and y Create a data frame with x and y as columns
dataset <- cbind(x,y)
dataset <- as.data.frame(dataset)
Now assign a group number to each observation. We will create 4,000 groups of 25 observations each. Each group will be labeled from 1 to 4,000.
dataset$group <- rep(seq(1:4000),each=25)
With the group_by() and summarize() functions, for each group and for each variable x and y,compute the sample mean for EACH group.
library(dplyr)
group_means <- dataset %>%
group_by(group) %>%
summarise(x_mean = mean(x),
y_mean = mean(y))
hist(group_means$x_mean, main="Histogram of mean of X",
xlab="Mean of X ", col="dark blue")

hist(group_means$y_mean, main="Histogram of mean of Y",
xlab="Mean of Y", col="GREEN")

LOOKING AT THE HISTOGRAM OF THE SAMPLE MEAN OF Y (y_mean), HOW DIFFERENT IT IS FROM ITS ORIGINAL HISTOGRAM OF Y (y)? EXPLAIN THERE IS A HUGE DIFFERENCE BETWEEN THE TWO HISTOGRAMS. THE FIRST HISTOGRAM IS UNIFORM AND BLOCKY, WHEREAS THE SECOND HISTOGRAM IS UNIMODAL AND SYMMETRICAL. THIS SHOULDN’T COME AS A SHOCK SINCE THE SECOND HISTOGRAM CONCENTRATES ON THE MEAN, WHICH SITS COMFORTABLY AT THE 30 MARK.
CALCULATE THE MEAN AND STANDARD DEVIATION OF BOTH SAMPLE MEANS (COLUMNS x_mean AND y_mean). HINT: YOU CAN USE THE mean and sd functions X_MEAN
sd(group_means$x_mean)
[1] 8.06485
mean(group_means$x_mean)
[1] 20.09545
sd(group_means$y_mean)
[1] 3.414815
mean(group_means$y_mean)
[1] 30.05961
IS THE VARIANCE OF THE RANDOM SAMPLE MEANS EQUAL TO THE VARIANCE OF THE ORIGINAL RANDOM VARIABLES? BRIEFLY EXPLAIN NO, THE VARIANCE OF THE RANDOM SAMPLE MEANS IS CONTEMPLATING THE MEAN AS THE MAIN PROTAGONIST IN THE HISTOGRAM, HENCE THE DIFFERENCES IN THEIR APPEARANCES.
DO A RESEARCH ABOUT THE CENTRAL LIMIT THEOREM. WITH YOUR OWN WORDS, EXPLAIN WHAT THE CENTRAL THEOREM IS THE CENTRAL LIMIT THEOREM GIVES THE USER THE ABILITY TO MEASURE HOW MUCH THE MEANS OF VARIOUS SAMPLES WILL VARY WITHOUT HAVING TO TAKE OTHER SAMPLES TO USE AS A COMPARISON. THE CENTRAL LIMIT THEOREM STATES THAT FOR NON NORMAL DATA, THE DISTRIBUTION OF THE MEANS SAMPLED HAS AN APPROXIMATE NORMAL DISTRIBUTION NO MATTER WHAT THE ORIGINAL DATA MAY LOOK LIKE AS LONG AS THE SAMPLE SIZE IS LARGE ENOUGH.
#Datacamp exercises CHAPTER 3 CENTRAL LIMIT THEOREM
household_income <- rnorm(n=200)
hist(household_income, main="Histogram of Household Income",
xlab="Income", col="blue")

NOTE - THE SUBCHAPTER MORE MEANS WASN’T AVAILABLE IN THE FREE WEBSITE EXPERIENCE AND REQUIRED A PAYED SUBSCRIPTION. I HOPE THIS DOESN’T IMPACT MY GRADE.
THANK YOU!
