L2 (Lab) - Review of Probability

In this lab exercise, you will learn how to:

Clear all objects from the workspace: rm(list=ls())
Generate random variables from different distributions.
- Use the set.seed() function to ensure all results are reproducible.
- Normal distribution: rnorm
- Student t distribution: rt
- Uniform distribution: runif
Obtain summary statistics: summary().
Plot
- histograms: hist()
- probability density functions (pdf): plot(density())

We will do two in-class exercises:

The distribution of monthly stock returns (S&P 500): normal? heavy tail? symmetric?
The sampling distribution of the sample mean: a Monte Carlo simulation experiment
- LLN and CLT

Exercise 1: The S&P 500 Monthly Stock Returns

In this example, we will investigate the empirical distribution of the monthly stock returns using S&P 500 data ranging from January 1952 to December 2019 from Center for Research in Security Press (CRSP) month-end values. Stock returns are the continuously compounded returns on the S&P 500 index,including dividends.

Preparing Your R Workspace

When starting a new project, we first clear all objects from the workspace.

rm(list=ls())

Next, you might also find it handy to know where your working directory is set at the moment:

getwd()

And you might consider changing the path to the folder in which you have stored your data set:

setwd("<location of your dataset>")
setwd("/Users/Econometrics/Lab")    # example

Reading CSV Files into R

Import data from Google Drive.

id <- "1Fkyi2uioF8xKOaKKbzbtjkZ1JQuvKjm0"
monthly <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))

You might also download the monthly return data from here, and save it to your working directory. Then import your local CSV files into R using the following command:

monthly <- read.csv("/Users/Econometrics/Lab/monthly_1952-2019.csv")

The structure of the data

Once the data is imported, you might want to first view the data and display the structure of the data.

View(monthly)   # view data

str(monthly)    # display the structure of the data

## 'data.frame':    816 obs. of  32 variables:
##  $ yyyymm    : int  195201 195202 195203 195204 195205 195206 195207 195208 195209 195210 ...
##  $ Index     : num  24.1 23.3 24.4 23.3 23.9 ...
##  $ D12       : num  1.41 1.42 1.42 1.43 1.44 ...
##  $ E12       : num  2.43 2.41 2.4 2.38 2.36 ...
##  $ b.m       : num  0.717 0.747 0.752 0.786 0.771 ...
##  $ tbl       : num  0.0157 0.0154 0.0159 0.0157 0.0167 0.017 0.0181 0.0183 0.0171 0.0174 ...
##  $ AAA       : num  0.0298 0.0293 0.0296 0.0293 0.0293 0.0294 0.0295 0.0294 0.0295 0.0301 ...
##  $ BAA       : num  0.0359 0.0353 0.0351 0.035 0.0349 0.035 0.035 0.0351 0.0352 0.0354 ...
##  $ lty       : num  0.0268 0.0269 0.0263 0.0254 0.0257 0.0259 0.0261 0.0267 0.0277 0.0269 ...
##  $ ntis      : num  0.0352 0.0349 0.0321 0.0332 0.0321 ...
##  $ Rfree     : num  0.00131 0.00128 0.00133 0.00131 0.00139 ...
##  $ infl      : num  0 -0.00755 0 0.0038 0 ...
##  $ ltr       : num  0.0028 0.0014 0.0111 0.0171 -0.0033 0.0003 -0.002 -0.007 -0.013 0.0148 ...
##  $ corpr     : num  0.0199 -0.0085 0.0076 -0.0004 0.0031 0.0016 0.0016 0.0063 -0.0018 0.0039 ...
##  $ svar      : num  0.000538 0.000932 0.000632 0.000808 0.000526 ...
##  $ csp       : num  0.00352 0.00364 0.00332 0.00384 0.00451 ...
##  $ CRSP_SPvw : num  0.0178 -0.0252 0.0528 -0.0414 0.0337 ...
##  $ CRSP_SPvwx: num  0.0163 -0.0357 0.0507 -0.0433 0.0233 ...
##  $ IndexDiv  : num  25.6 24.7 25.8 24.8 25.3 ...
##  $ dp        : num  -2.84 -2.8 -2.84 -2.79 -2.81 ...
##  $ ep        : num  -2.3 -2.27 -2.32 -2.28 -2.31 ...
##  $ dy        : num  -2.82 -2.84 -2.8 -2.84 -2.78 ...
##  $ logretdiv : num  0.0723 0.022 0.1033 0.0155 0.0815 ...
##  $ logRfree  : num  0.00131 0.00128 0.00132 0.00131 0.00139 ...
##  $ rp_div    : num  0.071 0.0207 0.1019 0.0142 0.0801 ...
##  $ logret    : num  0.0154 -0.0371 0.0466 -0.044 0.0229 ...
##  $ rp_nodiv  : num  0.0141 -0.0384 0.0453 -0.0453 0.0215 ...
##  $ tms       : num  0.0111 0.0115 0.0104 0.0097 0.009 0.0089 0.008 0.0084 0.0106 0.0095 ...
##  $ de        : num  -0.541 -0.533 -0.525 -0.509 -0.494 ...
##  $ bm        : num  0.717 0.747 0.752 0.786 0.771 ...
##  $ dfy       : num  0.0061 0.006 0.0055 0.0057 0.0056 0.0056 0.0055 0.0057 0.0057 0.0053 ...
##  $ dfr       : num  0.0171 -0.0099 -0.0035 -0.0175 0.0064 0.0013 0.0036 0.0133 0.0112 -0.0109 ...

Monthly returns: `rp_div`

Now, let’s take a look at the monthly stock returns, rp_div, and get its summary statistics such as mean and median.

Select the variable rp_div (monthly returns including dividends) from the data set monthly:

mrt <- monthly$rp_div     # select the return variable 
ts_mrt <- ts(mrt, start=c(1952,1), end=c(2019,12), frequency=12) # convert returns to a time series
plot(ts_mrt, main="The S&P 500 monthly returns: 1952.1-2019.12", ylab="monthly returns")     # time series plot

Empirical CDF of `mrt`

Recall, the CDF of a random variable \(X\) is the function \(F_X(x) = \Pr(X \leq x)\). The empirical CDF of a data sample \(\{x_t\}_{t=1}^T\) is the function that counts the fraction of observations less than or equal to \(x\): \[\hat{F}_X(x) = \frac{1}{T}(\#x_t \leq x).\]

The R function ecdf can be used to compute and plot the empirical CDF of the monthly returns:

plot(ecdf(mrt), col="blue")

We also can compute the empirical CDF of monthly returns at a given point, say \(x=0\):

Fhat.0 = sum(mrt <= 0)/length(mrt)
Fhat.0

## [1] 0.1911765

Here, the expression mrt <= 0 creates a logical vector the same length as mrt that is equal to TRUE when returns are less than or equal to 0 and FALSE when returns are greater than zero. Then Fhat.0 is equal to the fraction of TRUE values (returns less than or equal to zero) in the data, which gives the empirical CDF evaluated at zero, i.e., \(\hat{F}_X(x) = \frac{1}{T}(\#x_t \leq 0)\).

Summary statistics of `mrt`

Let’s get the summary statistics of mrt, including min and max values, mean, median, variance and standard deviation:

summary(mrt)              # get summary statistics of mrt

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.21652  0.01123  0.03469  0.03292  0.05859  0.19232

var(mrt)                  # variance

## [1] 0.001796266

sd(mrt)                   # standard deviation

## [1] 0.04238238

The summary statistics above are based on the monthly returns of S&P 500 over 816 months. The average return is given by the mean and is equal to 0.03292, or about 3.3%. The median return is 3.47%, which means there are 816 / 2 = 408 months with monthly returns of less than 3.47% and 408 months with returns of more than 3.47%. Both the mean and the median give us an idea about the centrality of stock returns. In other words, they inform us about the average performance of the stock over the sample period.

Both standard deviation and variance capture how dispersed the return observations are around their mean. Specifically, a tight distribution around the mean would have low standard deviation and variance. Conversely, if there were many returns observations far from the average, then the standard deviation and variance would both be high. Note that variance and standard deviation are often used to calculate return volatility, which often refers to the amount of uncertainty or risk related to the size of changes in a security’s value. A higher volatility means that a security’s value can potentially be spread out over a larger range of values. This means that the price of the security can change dramatically over a short time period in either direction. A lower volatility means that a security’s value does not fluctuate dramatically, and tends to be more steady.

In finance, we often say that risk and return are the two sides of a coin. Therefore, metrics such as average return are helpful when evaluating the past performance of an investment but are not sufficient on their own without proper consideration of risk.

The histogram and kernel density (pdf) plot of the monthly returns

The histogram of the monthly returns is:

h <- hist(mrt, breaks = 24, col="lightblue",
          main="The S&P 500 monthly returns: 1952.1 - 2019.12",
          xlab="monthly returns",
          freq = FALSE)    # histogram of mrt

lines(density(mrt), col="blue")  # add density plot (pdf) of mrt

We compare the pdf of mrt to the pdf of a normal random variable \(x \sim N(\mu_{mrt}, \sigma_{mrt}^2)\), which has the same mean and variance with mrt:

h <- hist(mrt, breaks = 24, col="lightblue",
          main="The S&P 500 monthly returns: 1952.1 - 2019.12",
          xlab="monthly returns",
          freq = FALSE)    # histogram of mrt

lines(density(mrt), col="blue")  # add density plot (pdf) of mrt

x <- seq(-0.3, 0.3, length=100)     
hx <- dnorm(x, mean=mean(mrt), sd=sd(mrt))
lines(x, hx, type="l", xlab="x-value", ylab="density", col="red")  # add the density of a normal r.v.

legend(-0.2, 10, legend=c("Density of mrt", "Density of x"),      # add legend
       col=c("blue", "red"), lty=1, cex=0.8)

Compute the skewness and kurtosis of mrt:

#install.packages("moments")  # install R package moments
library(moments)              # load the package
skewness(mrt)                 # compute skewness of mrt

## [1] -0.5564852

kurtosis(mrt)                 # compute kurtosis of mrt

## [1] 5.387466

The value of skewness suggests that the distribution of mrt is asymmetric. In fact, it is common for stock return distributions to be asymmetric. A negative skewness suggests a long tail to the left (left-skewed). In this case, highly negative returns would be observed below the average return.

Kurtosis is all about the heaviness of the tails of a distribution. The normal distribution has a kurtosis of 3. In this case, the return distribution has a kurtosis of 5.39, it is said to have an excess kurtosis of 5.39 − 3 = 2.39. Distributions that have positive excess kurtosis are considered to have fat tails (also known as leptokurtic distributions). This implies that extreme observations are more common in such distributions relative to otherwise similar normal distributions. Stock return distributions tend to be fat-tailed. Therefore, investors should be wary of the fact that they are likely to experience extreme return observations more often than they might like to believe.

Exercise 2: The Sampling Distribution of the Sample Mean

In this example, we will explore the sampling distribution of the sample mean. Generate data from a uniform distribution. The Monte Carlo simulation results show that mean and variance of the sample mean of a uniform random variable converge to \(\mu\) and \(\sigma^2/n\), and that the sampling distribution of the sample mean converges to a normal distribution as sample size increases. These results are consistent with LLN and CLT.

Preparing your R workspace

rm(list=ls())                     # clear all memories 
set.seed(123)

Tip: Use set.seed function for any Monte Carlo simulation to make sure all results are reproducible.

Draw iid (independent and identically distributed) samples from a uniform distribution

Suppose \(x\sim U(0,2)\). So \[E(x)=\frac{1}{2}(2-0)=1\] and \[var(x)=\frac{1}{12}(2-0)^2=0.33.\] Draw an iid sample from \(U(0,2)\) with sample size \(100\).

N <- 100              # Set sample size
x <- runif(N, min=0, max=2)   # Generate N random numbers from U(0,2)

Note: you could also draw samples from a normal distribution or a t-distribution using rnorm or rt. (Use built-in Help for more details.)

The summary statistics of \(x\):

summary(x)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001249 0.490941 0.932742 0.997118 1.510943 1.988540

mean(x)

## [1] 0.997118

sd(x)

## [1] 0.5699905

var(x)

## [1] 0.3248892

Repeat the above process 1000 times

Nsim <- 1000               # the number of simulations
N <- 100                   # sample size
x_bar <- rep(NA, Nsim)     # create a vector for sample mean
sigma2_x <- rep(NA, Nsim)  # create a vector for sample variance

for (i in 1:Nsim){                      # write a loop
  x.data <- runif(N, min=0, max=2)      # generate a sample with size N
  x_bar[i] <- mean(x.data)              # record the sample mean for sample i
  sigma2_x[i] <- var(x.data)            # record the sample variance for sample i
}

The histogram of the sample average is:

hist(x_bar, breaks=24, freq = FALSE)

We could also add a normal distribution curve to the histogram.

## Define a function
hist.plot = function(x){
  hist(x, breaks=24, freq = FALSE)
  x.norm <- seq(min(x), max(x), length=100)
  hx.norm <- dnorm(x.norm, mean=mean(x), sd=sd(x))
  lines(x.norm, hx.norm, type="l", xlab="x-value", ylab="density", col="red")  # add the density of a normal r.v.
}

## Use the defined function to draw both histogram and normal curve
hist.plot(x_bar)

Thus, the (sample) mean of the sample average, \(E(\bar{x})\), is

mean(x_bar)

## [1] 0.9986263

and the (sample) variance of the sample average, \(var(\bar{x})\), is

var(x_bar)

## [1] 0.003256107

The sample variance of the sample average decreases as sample size \(N\) increases. Try a simulation experiment with sample size \(N=10\), what is the variance of the sample average? How about the mean of the sample average?