(Instructor : Nishant Panda)
Additional References
(IMC) : Introducing Monte Carlo Methods with R, Christian P. Robert & George Casella, Springer
(MMT) : Monte Carlo Methods, Robert & Casella
Last lecture we saw how to generate random samples from probability distributions. In this chapter, we will use that tool to approximate integrals. Most statistical properties are expressed as an Expectation which is an integral! In most cases, it is not possible to compute such an integral exactly. In this lecture we will explore a stochastic technique for evaluating integrals called Monte Carlo Integration. As the name suggests, it will involve sampling from probability distributions. Not that there are deterministic approaches to approximate integrals that are studied in numerical analysis. Such methods generally cannot be extended to higher dimensions. We will not study such methods even though they are an important part of the modern Statistician’s toolbox.
Let \(X \sim f\) be a random variable (could be multidimensional!) and consider a real valued function \(g\) and the corresponding random variable \(g(X)\). Then, the expectation of \(g(X)\) is given by, \[ \mathbb{E}\left[g(X)\right] = \int\limits_{\mathcal{X}}g(x)f(x)dx, \] where \(\mathcal{X}\) is the range of \(X\) (support of \(f(x)\)). Two common examples of \(g\) are \(g(x) = x\) and \(g(x) = (x - \mathbb{E}\left[X\right])^2\). In these cases, \(\mathbb{E}\left[g(X)\right]\) is the expectation of \(X\) and the variance of \(X\) respectively.
Let \(\theta = \mathbb{E}\left[g(X)\right]\). The Monte Carlo algorithm to compute an approximation to \(\theta\), denoted by \(\widehat{\theta_{MC}}\) is as follows:
- Generate \(n\) samples from \(f(x)\): \(X_1, X_2, \dots, X_n\). Then,
- the monte carlo approximation is given by, \[\widehat{\theta_{MC}} = \frac{1}{n}\sum\limits_{i = 1}^{n}g(X_i).\]
A deterministic approximation (very good approximation!) to this integral can be computed using the integrate function in R
g <- function(x){
(cos(50*x) + sin(20*x))^2
}
integrate(g, 0, 1)## 0.9652009 with absolute error < 1.9e-10
Note that we are not explicitly given a density here! First realize that \[ \int\limits_{0}^{1}g(x)dx = \int\limits_{0}^{1}g(x)f(x) dx, \]
where \(f(x) = 1\) is the uniform density! Thus, we can we view the integral above as the expectation \(\mathbb{E}\left[g(X)\right]\), where \(X \sim U(0,1)\). The Monte Carlo approximation can then be computed as follows:
n.sam <- 1000
# step 1: generate n i.i.d samples from f (in this case uniform(0,1))
x.sam <- runif(n.sam)
# compute the MC approximation
theta.mc <- sum(sapply(x.sam, g))/n.sam
print(theta.mc)## [1] 0.9692121
Does increasing \(n.sam\) improve our approximation?
n.sam <- 10000
# step 1: generate n i.i.d samples from f (in this case uniform(0,1))
x.sam <- runif(n.sam)
# compute the MC approximation
theta.mc <- sum(sapply(x.sam, g))/n.sam
print(theta.mc)## [1] 0.960424
Definitely! Infact, by the strong law of large number \[ \widehat{\theta_{MC}} \to \theta \text{ a.s.}, \] as \(n \to \infty\).
Note that \(P(X < t)\) is given by the integral \[ P(X < t) = \int\limits_{-\infty}^{t} \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}dx. \]
Again we don’t need to use Monte Carlo integration as this is just the pnorm(t) in R
pnorm(7.1, mean = 5, sd = 1.25)## [1] 0.9535213
How do we use the Monte Carlo integration method here? We are not given a function \(g\) here. First note that we can write \[ P(X < t) = \int\limits_{-\infty}^{t} 1\cdot\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}dx + \int\limits_{t}^{\infty} 0\cdot\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}dx \] This gives us the following: Let \(g = \mathbb{I}_{(X < t)}\) be the indicator function for the set \(\lbrace X < t \rbrace\) and let \(f(x)\) be the normal density \(f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\). Then, \[ P(X < t) = \int\limits_{\mathbb{R}} g(x)f(x) dx = \mathbb{E}\left[g(X)\right] \]
Thus our Monte Carlo approximation is given by \[ \widehat{\theta_{MC}} = \frac{1}{n}\sum\limits_{i = 1}^{n}\mathbb{I}_{(X_i < t)} \] This is the emperical distribution (STAA 572)! Let us run this in R
g.ind <- function(x, t){
if (x < t) return(1)
else return(0)
}n.sam <- 1000
# Step 1: Sample of size n from the density f (in this case N(5, 1.25))
x.sam <- rnorm(n.sam, mean = 5, sd = 1.25)
# Step 2: get the monte carlo estimate
theta.mc <- sum(sapply(x.sam, g.ind, t = 7.1))/n.sam
print(theta.mc)## [1] 0.955
Very Close to the answer given by pnorm.
Since, \(A\) is circle centered at the origin and of radius \(1\), we know that the answer is \(\pi\)! This is an integral, given by \[ Area(A) = \int\limits_{A}dxdy \] How do we use Monte Carlo integration? We can think of \(X \sim U(-1, 1)\) and \(Y \sim U(-1,1)\) as independent random variables with the joint density given by \(\frac{1}{4}\) i.e \[ f_{X,Y}(x,y) = \frac{1}{4}\hspace{0.05in} \text{for } -1\leq x,y \leq 1 \] Let \(B\) be the square \(-1\leq x,y \leq 1\). Then we can think of the \(Area(A)\) \[ Area(A) = 4P(X,Y \in A) = 4\int_B \mathbb{I}_{A}f_{X,Y}dxdy = 4\mathbb{E}\left[\mathbb{I}_A\right] \] Thus we need to sample from \(f_{X,Y}\) and our function \(g\) is just the indicator of the set \(A\), i.e \(g = \mathbb{I}_A\). Let us write the Monte Carlo method
x.sam <- runif(10000, min = -1, max = 1)
y.sam <- runif(10000, min = -1, max = 1)
joint.sam <- cbind(x.sam, y.sam)
g.indA <- function(point){
if ((point[1]^2 + point[2]^2) <= 1) return (1.0)
else return(0)
}
theta.mc <- sum(apply(joint.sam, 1, g.indA))/nrow(joint.sam)
4*theta.mc## [1] 3.1612
Let us visualize this!
joint.data <- data.frame(joint.sam)
colnames(joint.data) <- c("x", "y")
# create a factor variable to see which were accepted
joint.data$Accepted <- (joint.data$x^2 + joint.data$y^2 < 1)library(ggplot2)
# we will create a scatter plot
# "x" values will be y (i.e samples from g)
# "y" values will be ue (u * e)
# If (u*e < f(y)) then samples are accepted otherwise rejected
# Hence, in scatter plot we will color by the Accepted variable
plot_ar <- ggplot(joint.data, aes(x=x, y=y)) +
geom_point(shape=20, aes(color=Accepted)) +
stat_function(fun = function(x) sqrt(1-x^2), size=1.5) +
stat_function(fun = function(x) -1*sqrt(1-x^2), size=1.5) +
coord_fixed()
# beautify : Increase font size in axes
plot_ar + theme(axis.text.x =
element_text(face = "bold",size = 12),
axis.text.y =
element_text(face = "bold", size = 12),
axis.line = element_line(colour = "black",
size = 1, linetype = "solid"))Note that \(\widehat{\theta_{MC}}\) is an estimator for \(\theta\). What is the expected value of \(\widehat{\theta_{MC}}\)? \[ \mathbb{E}\left[\widehat{\theta_{MC}}\right] = \frac{1}{n}n\mathbb{E}\left[g(X)\right] = \theta. \] Thus, monte carlo estimator is an unbiased estimator. In order to talk about how good an estimate this is for a given \(n\), we need to figure out the variance of our monte carlo estimator. Assuming that \(g(X)^2\) has finite expectation, let \(var\left[g(X)\right] = \sigma^2\). Then variance of \(\widehat{\theta_{MC}}\) is given by, \[ var\left[\widehat{\theta_{MC}}\right] = \frac{1}{n^2}n\cdot var\left[g(X)\right] = \frac{\sigma^2}{n} \] We can estimate \(\sigma^2\) from the (monte carlo) sample using the sample variance i.e \[ S^2 = \frac{1}{n-1}\sum\limits_{i = 1}^{n}\left(g(X_i) - \widehat{\theta_{MC}}\right)^2 \] Thus, the standard error (SE) of our Monte Carlo estimate is approximately given by, \[ SE = \sqrt{var\left[\widehat{\theta_{MC}}\right]} \approx \frac{S}{\sqrt{n}} \] By, CLT, we can now build a confidence interval with an error (about the Monte Carlo estimate) that is proportional to \[ \frac{S}{\sqrt{n}} \] Thus, larger the Monte Carlo sample, smaller is the error around our monte carlo estimate. However, note that this error is proportional to \(n^{\frac{-1}{2}}\). Hence, in order to reduce our error by half, we need to quadruple our monte carlo samples!
Can we do better than the Monte Carlo estimate? To be precise, a “better” estimate should also be unbiased (or asymtotically unbiased) but should have lower variance than the Monte Carlo estimator. First lets look at Example 2 again and let us compute the probability \(P(X > 9.1)\) using Monte Carlo approximation.
g.ind2 <- function(x, t){
if (x > t) return(1)
else return(0)
}n.sam <- 1000
# Step 1: Sample of size n from the density f (in this case N(5, 1.25))
x.sam <- rnorm(n.sam, mean = 5, sd = 1.25)
# Step 2: get the monte carlo estimate
theta.mc <- sum(sapply(x.sam, g.ind2, t = 9.1))/n.sam
print(theta.mc)## [1] 0.001
1 - pnorm(9.1, mean = 5, sd = 1.25)## [1] 0.0005190354
var.fun <- function(x, t, theta){
if (x > t) return (1 - theta)^2
else return (theta)^2
}
sample.var <- sum(sapply(x.sam, var.fun, t = 9.1, theta = theta.mc))/(n.sam - 1)
SE = round(sqrt(sample.var/n.sam), 4)
print(SE)## [1] 0.0014
Oops! \(1000\) samples wasn’t able to locate this rare event. Let us see if quadrupling this helps
n.sam <- 4000
# Step 1: Sample of size n from the density f (in this case N(5, 1.25))
x.sam <- rnorm(n.sam, mean = 5, sd = 1.25)
# Step 2: get the monte carlo estimate
theta.mc <- sum(sapply(x.sam, g.ind2, t = 9.1))/n.sam
print(theta.mc)## [1] 0.00075
var.fun <- function(x, t, theta){
if (x > t) return (1 - theta)^2
else return (theta)^2
}
sample.var <- sum(sapply(x.sam, var.fun, t = 9.1, theta = theta.mc))/(n.sam - 1)
SE = round(sqrt(sample.var/n.sam), 4)
print(SE)## [1] 6e-04
Because this rare event is not sampled untill we hit a large \(n\), we don’t get a stable estimate of the monte carlo variance. And even if we do, we need to increase \(n\) by a factor of \(4\) to reduce the error by a factor of \(2\). One other way to reduce error in our estimate is by reducing \(\sigma\) (and hence \(S\)). Note that if we can reduce \(\sigma\) by a factor of \(2\), our error reduces by a factor of \(2\) for the same \(n\). Monte Carlo techniques that reduce variance are called Variance Reduction techniques. One of the most popular of these techniques is important sampling.
First some notation. Let \(f(x)\) be a probability density. We make a slight change in our notation for expectation to emphasize the corressponding density.
If \(X \sim f\), then we denote the expectation of \(X\) by \(\mathbb{E}_{f}\left[X\right]\) to emphasize the density is \(f\).
The main idea here is to take samples from a different distribution so as to reduce variance of the estimator. Consider the following motivating example (from MMT page 90)
First way: Standard Monte Carlo (see Example 2) \[ P(X > 2) = \mathbb{E}_f\left[\mathbb{I}_{X > 2}\right]. \]
Our function \(g(X) = \mathbb{I}_{X > 2}\) and the variance of the estimator is \[ \frac{var\left[g(X)\right]}{n}. \]
Note that \(g(X)\) is a Bernoulli random variable and its variance is given by \(p(1-p)\) with parameter \(p = P(X > 2)\). In this toy problem we don’t know this, so we will estimate it from the sample variance. However, because this is a toy example let us cheat and use R
p <- 1 - pcauchy(2)
p*(1 - p)## [1] 0.1258027
Hence, the variance of the usual monte carlo estimator is given by \(\frac{0.126}{n}\). Note that we are sampling in \((2, \infty)\) and this will contain a lot of rare events. Let us evaluate \(P(X > 2)\) in a different way using the fact that this distribution is symmetric, \[ P(X > 2) = \frac{1}{2} - \int\limits_{0}^{2}\frac{1}{\pi(1+x^2)}dx \] But Note that, we can write this as \[ P(X > 2) = \frac{1}{2} - \int\limits_{0}^{2}\frac{2}{\pi(1+x^2)}h(x)dx \] where \(h(x) = 1/2\) i.e \(h \sim U(0, 2)\). Thus, \[ P(X > 2) = \frac{1}{2} - \mathbb{E}_h\left[g(X)\right] \]
where \(g(x) = \frac{2}{\pi(1+x^2)}\). Now, let us use Monte Carlo with this expectation! The variance of this estimator is \[ \frac{var\left[\frac{2}{\pi(1+X^2)}\right]}{n} \] This can be shown to be equal to \(\frac{0.0285}{n}\).
Hence, we have reduced variance by a factor \(\frac{0.126}{0.0285} \approx 4\) and thus our error is reduced by a factor of 2.
This will be our motivation when we delve deeper into importance sampling in the next Lecture.