In this lab exercise, you will learn how to:
Clear all objects from the workspace: rm(list=ls())
Generate random variables from different distributions.
Obtain summary statistics: summary().
Plot
We will do two in-class exercises:
The distribution of monthly stock returns (S&P 500): normal? heavy tail? symmetric?
The sampling distribution of the sample mean: a Monte Carlo simulation experiment
In this example, we will investigate the empirical distribution of the monthly stock returns using S&P 500 data ranging from January 1952 to December 2019 from Center for Research in Security Press (CRSP) month-end values. Stock returns are the continuously compounded returns on the S&P 500 index,including dividends.
When starting a new project, we first clear all objects from the workspace.
rm(list=ls())
Next, you might also find it handy to know where your working directory is set at the moment:
getwd()
And you might consider changing the path to the folder in which you have stored your data set:
setwd("<location of your dataset>")
setwd("/Users/Econometrics/Lab") # example
Import data from Google Drive.
id <- "1Fkyi2uioF8xKOaKKbzbtjkZ1JQuvKjm0"
monthly <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))
You might also download the monthly return data from here, and save it to your working directory. Then import your local CSV files into R using the following command:
monthly <- read.csv("/Users/Econometrics/Lab/monthly_1952-2019.csv")
Once the data is imported, you might want to first view the data and display the structure of the data.
View(monthly) # view data
str(monthly) # display the structure of the data
## 'data.frame': 816 obs. of 32 variables:
## $ yyyymm : int 195201 195202 195203 195204 195205 195206 195207 195208 195209 195210 ...
## $ Index : num 24.1 23.3 24.4 23.3 23.9 ...
## $ D12 : num 1.41 1.42 1.42 1.43 1.44 ...
## $ E12 : num 2.43 2.41 2.4 2.38 2.36 ...
## $ b.m : num 0.717 0.747 0.752 0.786 0.771 ...
## $ tbl : num 0.0157 0.0154 0.0159 0.0157 0.0167 0.017 0.0181 0.0183 0.0171 0.0174 ...
## $ AAA : num 0.0298 0.0293 0.0296 0.0293 0.0293 0.0294 0.0295 0.0294 0.0295 0.0301 ...
## $ BAA : num 0.0359 0.0353 0.0351 0.035 0.0349 0.035 0.035 0.0351 0.0352 0.0354 ...
## $ lty : num 0.0268 0.0269 0.0263 0.0254 0.0257 0.0259 0.0261 0.0267 0.0277 0.0269 ...
## $ ntis : num 0.0352 0.0349 0.0321 0.0332 0.0321 ...
## $ Rfree : num 0.00131 0.00128 0.00133 0.00131 0.00139 ...
## $ infl : num 0 -0.00755 0 0.0038 0 ...
## $ ltr : num 0.0028 0.0014 0.0111 0.0171 -0.0033 0.0003 -0.002 -0.007 -0.013 0.0148 ...
## $ corpr : num 0.0199 -0.0085 0.0076 -0.0004 0.0031 0.0016 0.0016 0.0063 -0.0018 0.0039 ...
## $ svar : num 0.000538 0.000932 0.000632 0.000808 0.000526 ...
## $ csp : num 0.00352 0.00364 0.00332 0.00384 0.00451 ...
## $ CRSP_SPvw : num 0.0178 -0.0252 0.0528 -0.0414 0.0337 ...
## $ CRSP_SPvwx: num 0.0163 -0.0357 0.0507 -0.0433 0.0233 ...
## $ IndexDiv : num 25.6 24.7 25.8 24.8 25.3 ...
## $ dp : num -2.84 -2.8 -2.84 -2.79 -2.81 ...
## $ ep : num -2.3 -2.27 -2.32 -2.28 -2.31 ...
## $ dy : num -2.82 -2.84 -2.8 -2.84 -2.78 ...
## $ logretdiv : num 0.0723 0.022 0.1033 0.0155 0.0815 ...
## $ logRfree : num 0.00131 0.00128 0.00132 0.00131 0.00139 ...
## $ rp_div : num 0.071 0.0207 0.1019 0.0142 0.0801 ...
## $ logret : num 0.0154 -0.0371 0.0466 -0.044 0.0229 ...
## $ rp_nodiv : num 0.0141 -0.0384 0.0453 -0.0453 0.0215 ...
## $ tms : num 0.0111 0.0115 0.0104 0.0097 0.009 0.0089 0.008 0.0084 0.0106 0.0095 ...
## $ de : num -0.541 -0.533 -0.525 -0.509 -0.494 ...
## $ bm : num 0.717 0.747 0.752 0.786 0.771 ...
## $ dfy : num 0.0061 0.006 0.0055 0.0057 0.0056 0.0056 0.0055 0.0057 0.0057 0.0053 ...
## $ dfr : num 0.0171 -0.0099 -0.0035 -0.0175 0.0064 0.0013 0.0036 0.0133 0.0112 -0.0109 ...
rp_div
Now, let’s take a look at the monthly stock returns,
rp_div
, and get its summary statistics such as mean and
median.
Select the variable rp_div
(monthly returns including
dividends) from the data set monthly
:
mrt <- monthly$rp_div # select the return variable
ts_mrt <- ts(mrt, start=c(1952,1), end=c(2019,12), frequency=12) # convert returns to a time series
plot(ts_mrt, main="The S&P 500 monthly returns: 1952.1-2019.12", ylab="monthly returns") # time series plot
mrt
Recall, the CDF of a random variable \(X\) is the function \(F_X(x) = \Pr(X \leq x)\). The
empirical CDF
of a data sample \(\{x_t\}_{t=1}^T\) is the function that
counts the fraction of observations less than or equal to \(x\): \[\hat{F}_X(x) = \frac{1}{T}(\#x_t \leq
x).\]
The R function ecdf
can be used to compute and plot the
empirical CDF of the monthly returns:
plot(ecdf(mrt), col="blue")
We also can compute the empirical CDF of monthly returns at a given point, say \(x=0\):
Fhat.0 = sum(mrt <= 0)/length(mrt)
Fhat.0
## [1] 0.1911765
Here, the expression mrt <= 0
creates a logical
vector the same length as mrt
that is equal to
TRUE
when returns are less than or equal to 0 and
FALSE
when returns are greater than zero. Then
Fhat.0
is equal to the fraction of TRUE
values
(returns less than or equal to zero) in the data, which gives the
empirical CDF evaluated at zero, i.e., \(\hat{F}_X(x) = \frac{1}{T}(\#x_t \leq
0)\).
mrt
Let’s get the summary statistics of mrt
, including min
and max values, mean, median, variance and standard deviation:
summary(mrt) # get summary statistics of mrt
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.21652 0.01123 0.03469 0.03292 0.05859 0.19232
var(mrt) # variance
## [1] 0.001796266
sd(mrt) # standard deviation
## [1] 0.04238238
The summary statistics above are based on the monthly returns of
S&P 500 over 816 months. The average return is given by the
mean and is equal to 0.03292
, or about 3.3%
.
The median return is 3.47%
, which means there are
816 / 2 = 408
months with monthly returns of less than
3.47%
and 408 months with returns of more than
3.47%
. Both the mean and the median give us an idea about
the centrality of stock returns. In other words, they inform us about
the average performance of the stock over the sample period.
Both standard deviation and variance capture how dispersed the return observations are around their mean. Specifically, a tight distribution around the mean would have low standard deviation and variance. Conversely, if there were many returns observations far from the average, then the standard deviation and variance would both be high. Note that variance and standard deviation are often used to calculate return volatility, which often refers to the amount of uncertainty or risk related to the size of changes in a security’s value. A higher volatility means that a security’s value can potentially be spread out over a larger range of values. This means that the price of the security can change dramatically over a short time period in either direction. A lower volatility means that a security’s value does not fluctuate dramatically, and tends to be more steady.
In finance, we often say that risk and return are the two sides of a coin. Therefore, metrics such as average return are helpful when evaluating the past performance of an investment but are not sufficient on their own without proper consideration of risk.
The histogram of the monthly returns is:
h <- hist(mrt, breaks = 24, col="lightblue",
main="The S&P 500 monthly returns: 1952.1 - 2019.12",
xlab="monthly returns",
freq = FALSE) # histogram of mrt
lines(density(mrt), col="blue") # add density plot (pdf) of mrt
We compare the pdf of mrt
to the pdf of a normal random
variable \(x \sim N(\mu_{mrt},
\sigma_{mrt}^2)\), which has the same mean and variance with
mrt
:
h <- hist(mrt, breaks = 24, col="lightblue",
main="The S&P 500 monthly returns: 1952.1 - 2019.12",
xlab="monthly returns",
freq = FALSE) # histogram of mrt
lines(density(mrt), col="blue") # add density plot (pdf) of mrt
x <- seq(-0.3, 0.3, length=100)
hx <- dnorm(x, mean=mean(mrt), sd=sd(mrt))
lines(x, hx, type="l", xlab="x-value", ylab="density", col="red") # add the density of a normal r.v.
legend(-0.2, 10, legend=c("Density of mrt", "Density of x"), # add legend
col=c("blue", "red"), lty=1, cex=0.8)
Compute the skewness and kurtosis of mrt
:
#install.packages("moments") # install R package moments
library(moments) # load the package
skewness(mrt) # compute skewness of mrt
## [1] -0.5564852
kurtosis(mrt) # compute kurtosis of mrt
## [1] 5.387466
The value of skewness suggests that the distribution of
mrt
is asymmetric. In fact, it is common for stock return
distributions to be asymmetric. A negative skewness suggests a long tail
to the left (left-skewed). In this case, highly negative returns would
be observed below the average return.
Kurtosis is all about the heaviness of the tails of a
distribution. The normal distribution has a kurtosis of 3. In this case,
the return distribution has a kurtosis of 5.39
, it is said
to have an excess kurtosis of 5.39 − 3 = 2.39
.
Distributions that have positive excess kurtosis are considered to have
fat tails (also known as leptokurtic distributions). This implies that
extreme observations are more common in such distributions relative to
otherwise similar normal distributions. Stock return distributions tend
to be fat-tailed. Therefore, investors should be wary of the fact that
they are likely to experience extreme return observations more often
than they might like to believe.
In this example, we will explore the sampling distribution of the sample mean. Generate data from a uniform distribution. The Monte Carlo simulation results show that mean and variance of the sample mean of a uniform random variable converge to \(\mu\) and \(\sigma^2/n\), and that the sampling distribution of the sample mean converges to a normal distribution as sample size increases. These results are consistent with LLN and CLT.
rm(list=ls()) # clear all memories
set.seed(123)
Tip: Use set.seed
function for any Monte Carlo
simulation to make sure all results are reproducible.
Suppose \(x\sim U(0,2)\). So \[E(x)=\frac{1}{2}(2-0)=1\] and \[var(x)=\frac{1}{12}(2-0)^2=0.33.\] Draw an iid sample from \(U(0,2)\) with sample size \(100\).
N <- 100 # Set sample size
x <- runif(N, min=0, max=2) # Generate N random numbers from U(0,2)
Note: you could also draw samples from a normal distribution or a
t-distribution using rnorm
or rt
. (Use
built-in Help for more details.)
The summary statistics of \(x\):
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001249 0.490941 0.932742 0.997118 1.510943 1.988540
mean(x)
## [1] 0.997118
sd(x)
## [1] 0.5699905
var(x)
## [1] 0.3248892
Nsim <- 1000 # the number of simulations
N <- 100 # sample size
x_bar <- rep(NA, Nsim) # create a vector for sample mean
sigma2_x <- rep(NA, Nsim) # create a vector for sample variance
for (i in 1:Nsim){ # write a loop
x.data <- runif(N, min=0, max=2) # generate a sample with size N
x_bar[i] <- mean(x.data) # record the sample mean for sample i
sigma2_x[i] <- var(x.data) # record the sample variance for sample i
}
The histogram of the sample average is:
hist(x_bar, breaks=24, freq = FALSE)
We could also add a normal distribution curve to the histogram.
## Define a function
hist.plot = function(x){
hist(x, breaks=24, freq = FALSE)
x.norm <- seq(min(x), max(x), length=100)
hx.norm <- dnorm(x.norm, mean=mean(x), sd=sd(x))
lines(x.norm, hx.norm, type="l", xlab="x-value", ylab="density", col="red") # add the density of a normal r.v.
}
## Use the defined function to draw both histogram and normal curve
hist.plot(x_bar)
Thus, the (sample) mean of the sample average, \(E(\bar{x})\), is
mean(x_bar)
## [1] 0.9986263
and the (sample) variance of the sample average, \(var(\bar{x})\), is
var(x_bar)
## [1] 0.003256107
The sample variance of the sample average decreases as sample size \(N\) increases. Try a simulation experiment with sample size \(N=10\), what is the variance of the sample average? How about the mean of the sample average?