In probability and statistics (and lots of other areas of math), some expressions are difficult or impossible to compute analytically. As an example, which we’ll dive into more below, think back to Calculus 2, where you may have had integrals that you could not compute using the Fundamental Theorem of Calculus, as no formula for the antiderivative existed using the functions you knew. The integral below, which is roughly the pdf of a normal distribution, is one such integral. \[ \int_0^3 e^{-x^2}\,dx.\] What do we do in these cases?
Monte Carlo methods are one such approach. While the name sounds fancy (it comes from a famous casino in Monaco), it boils down to taking random samples from a distribution and then using those random samples to approximate the expression you want. In many cases, these approximations can be quite good (or at least good enough), and can be done quickly and easily on modern computers.
In each part below, we’ll look at examples of how Monte Carlo methods can be applied in R to answer a variety of questions.
A note on reproducibility. When we talk about reproducibility in statistics and related fields, we mean the ability for others to exactly reproduce the results of your computations. Since Monte Carlo methods involve taking random samples, this seems like it wouldn’t be possible! But, the random number generators that R and other programming languages use are not truly random. Rather, they are functions known as pseudo-random number generators. They produce lists of numbers that are (usually) sufficiently random for purposes of simulations while also allowing for reproducibility. This is done through the set.seed() function. By choosing the same seed value, two people who then take a random sample will get the same list of values.
Consider the case of the definite integral from the introduction, \[ \int_0^3 e^{-x^2}\,dx.\] A graph of the function is shown below, and the integral computes the area under the curve.
g <- function(x) {exp(-(x^2))}
ggplot() +
xlim(c(0, 3)) +
geom_function(fun = g,
color = "blue")
We can recall that if \(X\) is a random variable, then a function of \(X\), \(g(X)\), has expected value \[E[g(X)]=\int_{\textrm{support}}g(x)f(x)\,dx,\] where \(X\) has pdf \(f(x)\). If we choose \(X\) to have uniform distribution on the interval over which we integrate, \([0,3]\), and if we choose \(g(x)=e^{-x^2}\), then \[\begin{align*} E[g(X)] &= \int_0^3 e^{-x^2}\frac{1}{3}\,dx \\ E[g(X)] &= \frac{1}{3}\int_0^3 e^{-x^2}\,dx \\ 3E[g(X)] &= \int_0^3 e^{-x^2}\,dx \end{align*}\]
The wonderful observation that makes the Monte Carlo method work is our interpretation of expected value as an average. In particular, we can think of \(E[g(X)]\) as the mean of \(g(X)\). This is something we can simulate! We start by taking a bunch of samples, \(X_i\), from \(X\), compute \(Y_i=g(X_i)\) for each sample, then take the sample mean \[\overline{Y}=\frac{1}{n}\sum_{i=1}^n Y_i\]. This gives an approximation \(\overline{Y}\approx E[g(X)]\), and we can then get an approximation to our integral by \[3\overline{Y} \approx \int_0^3 e^{-x^2}\,dx\]. We’ll carry out this simulation below using 10,000 samples from the uniform distribution (check it using a different integral calculator!).
set.seed(1860)
xi_vec <- runif(10000, min = 0, max = 3)
yi_vec <- g(xi_vec)
3 * mean(yi_vec)
## [1] 0.8886948
Below are links to two web apps that compute approximations of \(\pi\) using different Monte Carlo methods. These are ineractive and you can see how the approximations change as you change the simulation settings!
In this course, Monte Carlo methods in this course will be an important tool for building intuition for statistical theorems. As an example, we’ll see how Monte Carlo methods can be used to build intuition for the Central Limit Theorem.
Recall
Central Limit Theorem. Let \(Y_1,Y_2,\ldots,Y_n\) be independent and identically distributed random variables with \(E(Y_i)=\mu\) and \(V(Y_i)=\sigma^2<\infty\). Define \[U_n = \frac{\overline{Y}-\mu}{\sigma/\sqrt{n}}.\] Then the distribution function of \(U_n\) converges to the standard normal distribution function as \(n\rightarrow\infty\). That is, \[\lim_{n\to\infty}P(U_n\le u)=F_Z(u) \textrm{ where } Z\sim N(0,1)\]
In this example, we’ll assume that \(Y\sim Unif(2,7)\), the uniform distribution on the interval \([2,7]\). As we did in the integration section above, we could find a random sample of points from \(Y\) and take their sample mean.
set.seed(1860)
min = 2
max = 7
samples1 <- runif(10, min = min, max = max)
mean(samples1)
## [1] 3.540456
However, running only one sample does not give us a lot of insight into how the sample mean behaves in general. Let’s run another sample and see what happens. Note that we don’t reset the seed, since we want to get a different sample!
samples2 <- runif(10, min = min, max = max)
mean(samples2)
## [1] 3.948109
As expected, these sample means are not the same. So, we have the idea to find a bunch of different samples, take the sample mean for each sample, and look at a histogram of all of those sample means. This gives an approximation for the distribution of the sample mean, \(\overline{Y}\). We’ll start by taking \(n=10\) sample points and \(N=100\) simulated samples.
set.seed(1860)
N = 100
sims10 <- data.frame(replicate(N, runif(10, min = min, max = max)))
The structure of this data frame is illustrated in the figure below.
We will now find the sample mean of each simulated sample and reshape the data frame to make the sample means a column.
sims10_means <- summarize(sims10, across(everything(), mean))
sims10_means <- gather(sims10_means, value = mean)
We can now plot a histogram of the sample means.
plot10 <- ggplot(sims10_means, aes(x = mean)) +
xlim(2,7) +
geom_histogram(bins = 30, color = "black", fill = "blue") +
ggtitle("Sample Mean with n=10")
plot10
This is great! It shows that the sample mean, \(\overline{Y}\), is almost certainly not a uniform distribution like \(Y\), and it seems reasonably close to a normal distribution. But, the best part of the Central Limit Theorem is that it tells us that this approximation gets better the larger the number of sample points, \(n\), becomes! We can visualize this doing the Monte Carlo simulations for several different values of \(n\).
These plots clearly show that the variance of the sample mean decreases as \(n\) gets larger. To really see if the distributions become more normal as \(n\) increases, we can normalize the sample means as described in the central limit theorem, taking \[\frac{\overline{Y}-\mu}{\sigma/\sqrt{n}}\] for each. We plot the density of each each in blue, and the pdf of the standard normal is overlaid in red. We can see (roughly) that the distribution becomes closer to standard normal as the number of sample points increases.