(Instructor : Nishant Panda) (Additional Reference (IMC) : Introducing Monte Carlo Methods with R, Christian P. Robert & George Casella, Springer)
Simulation is an important technique in applied mathematics. It typically means using computers to conduct experiments! In statistics, it is a very useful way to check if a statistical model has the necessary properties and this involves sampling from probability distributions. In most real life situations, analytical derivations of statistical properties for your model is rarely possible and one usually works with “Large sample approximations”. To test properties in finite samples simulation studies are extremely crucial!
The core step in simulation is drawing samples from probability distributions. This will be the topic of interest in this lecture.
How does R generate samples from a “named” distribution? For example rnorm generates samples from a normal distribution.
# Generates 10,000 samples from a normal distribution
# with mean 2 and standard deviation 0.5
norm.samples <- rnorm(10000, mean = 2, sd = 0.5)
hist(norm.samples, breaks = 21, col = "steelblue")# Plot the "density"
plot(density(norm.samples))
polygon(density(norm.samples), col="steelblue", border="red")As it turns out, all that is needed to generate samples from any distribution is a uniform number generator. Let us assume that this is God Given (i.e God wrote the runif).
unif.samples <- runif(100000, 0, 2)
plot(density(unif.samples))
polygon(density(unif.samples), col="steelblue", border="red")Note, that there is no such thing as a “random” number generator! The computer generates numbers deterministically that satistisfies all the properties of the distribution. This can be seen through set.seed method in R.
set.seed(1)
runif(5)## [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
set.seed(1)
runif(5)## [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
Methods that produce an endless supply of random samples from distributions are termed as Monte Carlo methods.
For a nice overview see https://en.wikipedia.org/wiki/Monte_Carlo_method
The inverse transform theorem is a technical result in probability theory which states that if \(X\) is a random variable with the C.D.F \(F(x)\), then if \(u\) is a random sample from the uniform(0,1) then \(F^{-1}(u)\) is a random sample of \(X\).
(Technically \(F^{-1}\) is NOT the inverse of the C.D.F but for the purpose of this course thinking of it as an inverse function will suffice!)
If we want to generate random samples of \(X\), what this means is the following:
1. First generate random samples \(u_i\) from uniform \(U(0,1)\).
2. Let \(x_i = F^{-1}(u_i)\). Then \(x_i\)’s are random samples from \(X\).
\[ F(x) = 1 - e^{-x} \]
First we need to get the inverse function \(F^{-1}\). This means that if \(F(x) = y\), then we solve for \(x\). From Home Assignment 1 in this Lecture, we know that \(F(x) = 1 - e^{-x}\). Thus solving for \(y\), \[ \begin{aligned} &y = 1 - e^{-x} \\ & \implies e^{-x} = 1 - y \\ & \implies -x = log(1-y)\\ & \implies x = -log(1-y) \end{aligned} \] Hence, the inverse function is given by \(F^{-1}(y) = -log(1-y)\).
# Step 1: generate n samples from uniform random variable
n <- 10000
u <- runif(n)
# Step 2: compute F^-1(u)
my.exp.samples <- sapply(u, function(u) -1.0*log(1-u))Let us test this visually with a plot.
plot(my.exp.samples)Oops What happened! This doesn’t look like an exponential distribution, does it? We are plotting samples and not the distribution! Let us look at the histogram!
# Plot histogram
hist(my.exp.samples, breaks = 21, col = "steelblue", main = "Histogram of Exp(1) samples" )# Plot density
plot(density(my.exp.samples), xlim = c(0.5, 5), main = "Density plot for Exp(1)")
polygon(density(my.exp.samples), col="steelblue", border="red", xlim = c(0.5, 5))Let us look at the R version
# Plot density
plot(density(rexp(10000)), xlim = c(0.5, 5), main = "Density plot for Exp(1) from rexp")
polygon(density(rexp(10000)), col="steelblue", border="red")(From IMC) “The generation of uniform random variables is therefore a key determinant of the behavior of simulation methods for other probability distributions since those distributions can be represented as a deterministic transformation of uniform random variables.”
R uses the inverse transform method to generate samples from the Normal distribution. Can we use the inverse transform method to generate samples from discrete distributions like Binomial etc? Yes we can! But, we won’t delve further into this.
Typically, a C.D.F won’t come up in nice closed expression for us to invert! When we come up against such a wall, the next simulation method of choice is almost always going to be the Accept-Reject method. Accept-Reject method fall under a large class of simulation methods called Indirect methods.
Again, for a great read see https://en.wikipedia.org/wiki/Rejection_sampling
With indirect methods we generate samples from a candidate (simpler) random variable and only accept it as the sample of our target random variable subject to passing a test.
Notation:
Goal: Draw a sample of \(X\) i.e simulate from \(f\).
The accept-reject algorithm can be compactly stated as follows:
1. Generate \(Y\sim g\) and \(U\sim U(0,1)\).
2. Set \(X = Y\) if \[ U < \frac{f(Y)}{e(Y)}, \]
3. else, go back to 1.
We see that picking an envelope (i.e picking \(M\) and a candidate density) is crucial for this method to work. Let us illustrate this with the following example.
We know that if \(X \sim Beta(\alpha, \beta)\), then its density \(f\) is given by, \[ f(x\lvert \alpha,\beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha - 1}(1-x)^{\beta - 1} \]
Plugging in \(\alpha = 4\) and \(\beta = 3\), we get \[ f(x) = 60x^3(1-x)^2 \]
What is the support of \(f\)?
Any \(Beta\) distribution is supported on \(\left[0, 1\right]\). Let us plot this density.
x.seq <- seq(from = 0, to = 1, length.out = 101)
f <- sapply(x.seq, function(x) 60*x^3*(1-x)^2)library(ggplot2)
# put the x and y values in a data frame
plot_data <- data.frame("x" = x.seq, "f" = f)
# create the base layer of ggplot with geom_Line
plot_f <- ggplot(plot_data, aes(x = x, y = f)) +
geom_line(color = "red", size = 1.5) +
labs(x = "x", y = "f(x| 4,3)")
plot_fNow, our candidate density function \(g\) should have the same support as \(f\) and should be a valid density. Lets take \(g\) to be uniform density \(U(0,1)\).
A naive choice for the envelope \(e(x) = Mg(x)\), would be to take \(M = \max(f(x))\). Thus, we can guarantee that \(e(x) \geq f(x)\). This naive choice works whenver we have a compact support for \(f(x)\) and can find a global maxima. Lets find \(M\). We can use the optimize function in R to find the maxima!
max_p <- optimize(function(x) 60*x^3*(1-x)^2, c(0.05, 0.95), maximum = TRUE)
round(max_p$maximum, 4)## [1] 0.6
Thus, \(M = f(0.6)\).
# Lets create our beta density f
beta.f <- function(x) {
60*x^3*(1-x)^2
}
M <- beta.f(0.6)
print(M)## [1] 2.0736
Our candidate density function \(g(x) = 1\) for any \(x\) in \(\left[0, 1\right]\) because we it to be uniform(0,1). Hence, our envelope function \(e(y) = M\). Now, let us write a simple accept-reject algorithm for this example
num.samp <- 1000
beta.samples <- rep(0, num.samp)
count.samp <- 0
# keep running the accept-reject method until you have drawn 1000 samples
while(count.samp < num.samp) {
# Step 1: generat sample y from g and u from unif(0,1)
y <- runif(1) # g is U(0,1)
u <- runif(1)
# Step 2: Check the test
if (u < beta.f(y)/M) {
# increase the counter
count.samp <- count.samp + 1
# set X = Y
beta.samples[count.samp] <- y
}
}Let us check if this worked
hist(beta.samples,prob=T,ylab="f(x)",xlab="x",ylim=c(0,max(beta.f(x.seq))),
main="Histogram of draws from Beta(4,3)" )
lines(x.seq,f,lty=2,col = "red")Let us visualize which samples were accepted and which were rejected.
# Total number of samples
num.sim <- 3000
# testing condition : u < f/e and hence ue < f
# e is just M, so ue is u * M
ue <- M*runif(num.sim)
# y is a sample from g and g is U(0,1)
y <- runif(num.sim)# create all the pairs y and ue
mat <- cbind(y, ue)
# put them in a data frame
plot_data <- data.frame(mat)
colnames(plot_data) <- c("y", "ue")
# create a factor variable to see which were accepted
plot_data$Accepted <- ue < beta.f(y)
# show the first 5 pairs
head(plot_data,5)## y ue Accepted
## 1 0.5521227 0.6982361 TRUE
## 2 0.8818690 1.0105790 FALSE
## 3 0.9236169 1.3849691 FALSE
## 4 0.2549313 0.9364871 FALSE
## 5 0.1082588 0.5137490 FALSE
# we will create a scatter plot
# "x" values will be y (i.e samples from g)
# "y" values will be ue (u * e)
# If (u*e < f(y)) then samples are accepted otherwise rejected
# Hence, in scatter plot we will color by the Accepted variable
plot_ar <- ggplot(plot_data, aes(x=y, y=ue)) +
geom_point(shape=20, aes(color=Accepted)) +
stat_function(fun = beta.f) +
geom_vline(xintercept = 0.0) +
geom_vline(xintercept = 1.0) +
geom_hline(yintercept = M) +
geom_hline(yintercept = 0)
# beautify : Increase font size in axes
plot_ar + theme(axis.text.x =
element_text(face = "bold",size = 12),
axis.text.y =
element_text(face = "bold", size = 12),
axis.line = element_line(colour = "black",
size = 1, linetype = "solid"))Let us check what proportion of the samples were accepted
round(sum(plot_data$Accepted)/nrow(plot_data),2)## [1] 0.48
round((1/M),2)## [1] 0.48
This is not a coincidence! \(\frac{1}{M}\) is the expected proportion of candidates that are accepted. If we find a better \(M\) (i.e a tighter envelope) we will on avergae have fewer rejections.
A good envelope should have the following properties:
If \(f\) is supported on a set \(\left[a, b\right]\), then a naive but good choice for \(e\) is \(\max(f(x)) * \frac{1}{b-a}\) i.e we take the candidate density to be uniform \(U(a,b)\) and \(M = \max(f(x))\). Will this method work if \(f \sim N(\mu,\sigma)\)?
f <- function(x){
exp(-x^2/2)*(sin(6*x)^2 + 3*cos(x)^2 * sin(4*x)^2+ 1)
}
opt.f <-optimize(f, c(-2,2), maximum = TRUE)
norm <- function(x){
exp(-x^2/2)
}
M = 1.05*f(opt.f$maximum)
x.seq = seq(from = -10, to = 10, length.out = 201)
f.seq = sapply(x.seq, f)
norm.seq = sapply(x.seq, norm)plot(x.seq, f.seq, type = "l")
lines(x.seq, M*norm.seq)