Laplace Approximation

Introduction

Bayesian statistical computation always require us to compute a quantity in integral form \[ \mathbb{E}_{f}g(X) = \int g(x)f(x)\,dx \] where $X$ is a R.V generated from a distribution with a probability density function $f$. Regularly, normalizing constant in posterior distribution of a parameter of interest $\int p(x \mid \theta)\, dx$ also an essential quantity that need computation. For tractable problem, it could be easy to calculate of these integrals by deriving a specific distribution that match these quantities. However, most of the cases must use approximation as there does not exist the interested form. In this blog, we will introduce a approximation method proposed by Laplace Laplace. The idea is that any probability distribution function that is smooth and reach a climax around its maximum value could be approximated by a normal distribution function. This refers to the class of exploring a class of alternative functions that could be easily tractable, and extract the functions that appoximates the true one with the most marginal errors.

Laplace Approximation

Technically, Laplace approximation work quite well for functions that are in the $\mathfrak{L}^2$ space. $\mathfrak{L}^2$ space is a real or complex valued measureable function for which the integral of the square of the absolute value is finite \[ g : \mathbb{R} \to \mathbb{C} \text{ square integrable } \Leftrightarrow \int_{-\infty}^{+\infty} |g(x)|^2\,dx < \infty \]

$\mathfrak{L}^2$ (or square integrable functions) refers to a class of functions that form a complete metric space under the metric induced by th inner product. Generally speaking, this is specifically a Hilbert space, which means it can inherit all the functional operations on the generalized Elucidean space.

The general idea of Laplace approximation is that it could take a well-behaved unimodal function and approximate it with a Gaussian, which is well-understood distribution and easy to handle.

Suppose we have a function $g(x) \in \mathfrak{L}^{2}$ which attains its maximum at $x_{0}$. We want to compute \[ \int_{a}^{b} g(x) d x \]

Laplace approximation is solely a simple two-term Taylor expansion on $log$ p.d.f. Shall we feast our eyes at this magical trick. Let $h(x)=\log g(x)$. Since $\log$ is a motomic function, the maxima value y of the original one which map to $log$ would be $log(y)$. We have \[ \int_{a}^{b} g(x) d x=\int_{a}^{b} \exp (h(x)) d x \] From here we can take a Taylor series approximation of $h(x)$ around the point $x_{0}$ to give us \[ \int_{a}^{b} \exp (h(x)) d x \approx s \int_{a}^{b} \exp \left(h\left(x_{0}\right)+h^{\prime}\left(x_{0}\right)\left(x-x_{0}\right)+\frac{1}{2} h^{\prime \prime}\left(x_{0}\right)\left(x-x_{0}\right)^{2}\right) d x \] Because we assumed $h(x)$ achieves its maximum at $x_{0}$, we know $h^{\prime}\left(x_{0}\right)=0$. Therefore, we can simplify the above expression to be \[ =\int_{a}^{b} \exp \left(h\left(x_{0}\right)+\frac{1}{2} h^{\prime \prime}\left(x_{0}\right)\left(x-x_{0}\right)^{2}\right) d x \] Given that $h\left(x_{0}\right)$ is a constant that doesn’t depend on $\boldsymbol{x}$, we can pull it outside the integral. In addition, we can rearrange some of the terms to give us \[ =\exp \left(h\left(x_{0}\right)\right) \int_{a}^{b} \exp \left(-\frac{1}{2} \frac{\left(x-x_{0}\right)^{2}}{h^{\prime \prime}\left(x_{0}\right)^{-1}}\right) d x \] You could see what it is similar to, right ? Inside the integral we have a quantity that is proportional to a Gaussian density with mean $x_{0}$ and variance $-h^{\prime \prime}\left(x_{0}\right)^{-1}$. At this point we are just one call to the pnorm() function in $R$ to approximate our integral. At this step, all we need is to compute our normalizing constants. If we let $\Phi\left(x \mid \mu, \sigma^{2}\right)$ be the cumulative distribution function for the Gaussian distribution with mean $\mu$ and variance $\sigma^{2}$ (and $\varphi$ is its density function), then we can write the above expression as \[ \begin{aligned} &=\exp \left(h\left(x_{0}\right)\right) \sqrt{\frac{2 \pi}{h^{\prime \prime}\left(x_{0}\right)}} \int_{a}^{b} \varphi\left(x \mid x_{0},-h^{\prime \prime}\left(x_{0}\right)^{-1}\right) d x \\ &=\exp \left(h\left(x_{0}\right)\right) \sqrt{\frac{2 \pi}{h^{\prime \prime}\left(x_{0}\right)}}\left[\Phi\left(b \mid x_{0},-h^{\prime \prime}\left(x_{0}\right)^{-1}\right)-\Phi\left(a \mid x_{0},-h^{\prime \prime}\left(x_{0}\right)^{-1}\right)\right] \end{aligned} \] Recall that $\exp \left(h\left(x_{0}\right)\right)=g\left(x_{0}\right)$. If $b=\infty$ and $a=-\infty$, as is commonly the case, then the term in the square brackets is equal to 1, making the Laplace approximation equal to the value of the function $g(x)$ at its mode multiplied by a constant that depends on the curvature of the function $h$.

Laplace’s approximation is simple and elegant, all one needs is that the log-pdf is smooth at the maximum (means that we decrease rapidly from the peak to the tail) and peaks well at it so that the quadratic approximation is good. Also, to make it operational, we only need to know the point of maximum (in our example here is $x_{0}$) and the curvature $-h^{\prime \prime}\left(x_{0}\right)$ at this point. Hence, our original problem of computing a complex integral now transforms into a maximizatio problem. n order to compute the Laplace approximation, we have to compute the location of the mode, using optimizing technique (e.g., Newton’s method for example). Suppose our parameter of interest is $\theta$. In Newton’s method, you will initiate a guess $\theta = \theta_0$ and iterate the guessing procedure as follows: \[ \theta_{t+1} = \theta_{t} - \frac{h^{\prime}(\theta_{t})}{h^{\prime \prime}(\theta_{t})}, \quad t = 1, 2, \dots \]

Then, after multiple iterations, the result will converge to the solution $\hat{\theta}$. This achieves a faster and inexpensive computational resource than the original direct integration.

Posterior Mean Computation

In Bayesian computations, we often want is compute the posterior mean of a parameter given the observed data. If $y$ represents data we observe and $y$ originates from the distribution $f(y \mid g)$ with parameter $\theta$ and $\theta$ has a prior distribusan $\pi(\theta)$. Then, we usually to compute the posterior distribution $\boldsymbol{p}(\theta \mid \mathrm{y})$ and its mean, \[ E_{p}[\theta] = \int \theta p(\theta \mid y)\, d \theta \] Wo can then write \[ \begin{aligned} \int \theta p(\theta \mid y)\, d x &=\frac{\int \theta f(y \mid \theta) x(\theta)\, d \theta}{\int f(y \mid \theta) x(\theta) d \theta} \\ &-\frac{\int \theta \exp (\log f(y \mid \theta) x(\theta)) \,d \theta}{\int \operatorname{exp}(\log f(y \mid \theta) x(\theta)\, d \theta)} \end{aligned} \] Here, we have used the exponential-log trick to emerge a term to use Laplace approximation.

If we assume that $h(\theta) = \log \int(y \mid \theta) \pi(\theta)$ then wa can use the same Laplace approximation procedure described in the previous section. Howerver, in order to do that we must know where $h(\theta)$ achieves is maxima. Because $h(\theta)$ is simply a monotonic transtamation af a function proportional to the posteriar density, wa know that $h(\theta)$ achieve it maximum value at the posterior mode. Suppose that $\theta$ be the posterior mode of $\mathrm{P}(\theta \mid y)$. Then we have

$$ \[\begin{aligned} \int \theta p(\theta \mid y) d x & \approx \frac{\int \theta \operatorname{exp}\left(h(\hat{\theta})+\frac{1}{2} h^{\prime \prime}(\hat{\theta})(\theta-\hat{\theta})^{2}\right) d \theta}{\int \exp \left(h(\hat{\theta})+\frac{1}{2} h^{\prime \prime}(\hat{\theta})(\theta-\hat{\theta})^{2}\right) d \theta} \\ & = \frac{\int \theta \operatorname{exp}\left(\frac{1}{2} h^{\prime \prime}(\dot{\theta})(\theta-\hat{\theta})^{2}\right) d \theta}{\int \exp \left(\frac{1}{2} h^{\prime \prime}(\hat{\theta})(\theta-\hat{\theta})^{2}\right) d \theta} \\ & = \frac{\int \theta \sqrt{\frac{2 \pi}{-h^{\prime \prime}(\hat{\theta})}} \varphi\left(\theta \mid \hat{\theta}, - h^{\prime \prime}(\theta)^{-1}\right) d \theta}{\int \sqrt{\frac{2 \pi}{-h^{\prime \prime}(\hat{\theta}}} \varphi\left(\theta \mid \hat{\theta}, -h^{\prime \prime}(\hat{\theta})\right) d \theta} \\ & = \hat{\theta} \end{aligned}\]

$$ Hence, the Laplace approximation to the posterior mean equal to the posterior mode. This approximation is likely to work well when the posterior is unimodal and relatively symmetric around the model. Furthermore, the more concentrated the posterior is around $\hat{theta}$, the better.

Example: Gamma - Poisson model

In this simple example, we will use one of the three most popular conjugate families in Bayesian, which is a Gamma - Poisson model. In this model, the data/observations come from a Poisson distribution, and the prior and posterior are both gamma distributions. For anyone do not come across these conjugate families before, we will have a post to introduce them for you !!!. Assumed that \[ \begin{aligned} Y \mid \mu & \sim \operatorname{Poisson}(\mu) \\ \mu & \sim \operatorname{Gamma}(a, b) \end{aligned} \] where the Gamma distribution with gamma function $\Gamma(z) = \int_{0}^{\infty} x^{z-1} \operatorname{e}^{-x}\, dx$, with $a$ is a shape parameter and $b$ is a scale parameter, is
\[ f(\mu)=\frac{1}{b^{a} \Gamma(a)} \mu^{a-1} e^{-\frac{\mu}{b}} \] Another common parameterization is that $\alpha$ and $\beta = 1/b$, which is call the rate parameter.

\[ f(\mu)=\frac{\beta^{\alpha}}{ \Gamma(\alpha)} \mu^{\alpha-1} e^{-\mu \beta} \]

In this example, we will use the $(a,b)$ parameterization, but you should always check which the parameterization is being used !!! (e.g., R uses the $(\alpha, \beta)$ which is more convenient for the sake of advanced computation in more complex models). Then, using the Bayes’rules, we would recognize the kernel of the Gamma when using the Gamma-Poisson family. The posterior would be Gamma($a^\ast, b^\ast$) where

\[a^\ast = a + \sum_{i=1}^{n} y_{i} \hspace{0.5cm} \text{and} \hspace{0.5cm} b^\ast = \frac{b}{nb + 1}, \hspace{0.5cm} \text{for } y = 1, 2, \dots, n\]

Given $n = 1$ (we observe only one data), then the posterior distribution with $a^\ast = y+ a$ and $b^\ast = \frac{1}{1+b}$ only. Then, if we observe that $Y$ taken the value of two, $y = 2$, then we derive the posterior as follows, assumed that the prior shape and scale of the gamma prior is $3$ and $3$, respectively

posterior <- function(y, shape, scale) {
  if (!is.null(y)){
      pdf <- dgamma(y, shape = y + shape,
                       scale = 1 / (1 + 1 / scale))
      
      post.mode <- (y + shape - 1) * (1 / (1 + 1 / scale))
      post.mean <- (y + shape) * (1 / (1 + 1 / scale))
  }
return(list("pdf" = pdf, "mode" = post.mode, "mean" = post.mean))
}

make_post <- function(y, shape, scale) {
        function(x) {
                dgamma(x, shape = y + shape,
                       scale = 1 / (1 + 1 / scale))
        }
}

set.seed(28)
y <- 2
prior.shape <- 3
prior.scale <- 3
# p <- posterior(y, prior.shape, prior.scale)$pdf
p <- make_post(y, prior.shape, prior.scale)
curve(p, 0, 12, n = 1000, lwd = 3, xlab = expression(mu),
      ylab = expression(paste("p(", mu, " | y)")))
curve(dgamma(x, shape = prior.shape, scale = prior.scale), add = TRUE,
      lty = 2)
legend("topright", legend = c("Posterior", "Prior"), lty = c(1, 2), lwd = c(3, 1), bty = "n")

From the skewness in the figure above, it’s clear that the mean and the mode should not match.

Given the posterior is a Gamma distribution, it is easily to derive the posterior mode and mean, using this basic theorem

Let $\mu \sim \operatorname{Gamma}(a, b)$, then
\[\begin{aligned} \mathbb{E}(\mu) & = \frac{a}{b} \\ \operatorname{Mode}(\mu) & = \frac{a-1}{b} \\ \operatorname{Var}(\mu) & = \frac{a}{b^2} \end{aligned}\]

Hence, we could have a closed-form posterior mode and posterior mean

pmode <- posterior(y, prior.shape, prior.scale)$mode
pmode

## [1] 3

pmean <- posterior(y, prior.shape, prior.scale)$mean
pmean

## [1] 3.75

We will now examine what the Laplace approximation to the posterior looks like in this case. First, we can compute the gradient and Hessian of the Gamma density.

a <- prior.shape
b <- prior.scale
fhat <- deriv3(~ mu^(y + a - 1) * exp(-mu * (1 + 1/b)) / ((1/(1+1/b))^(y+a) * gamma(y + a)), "mu", function.arg = TRUE)

Then we can compute the quadratic approximation to the density via the lapprox() function below.

post.shape <- y + prior.shape - 1
post.scale <- 1 / (length(y) + 1 / prior.scale)
lapprox <- Vectorize(function(mu, mu0 = pmode) {
        deriv <- fhat(mu0)
        grad <- attr(deriv, "gradient")
        hess <- drop(attr(deriv, "hessian"))
        f <- function(x) dgamma(x, shape = post.shape, scale = post.scale)
        hpp <- (hess * f(mu0) - grad^2) / f(mu0)^2
        exp(log(f(mu0)) + 0.5 * hpp * (mu - mu0)^2)
}, "mu")

curve(p, 0, 12, n = 1000, lwd = 3, col = 2, xlab = expression(mu),
      ylab = expression(paste("p(", mu, " | y)")))
curve(dgamma(x, shape = prior.shape, scale = prior.scale), add = TRUE,
      lty = 2)
legend("topright", 
       legend = c("Posterior Density", "Prior Density", "Laplace Approx"), 
       lty = c(1, 2, 1), lwd = c(3, 1, 1), col = c(2, 1, 3), bty = "n")
curve(lapprox, 0.001, 12, n = 1000, add = TRUE, col = 3, lwd = 2)

The solid green and red curve is the Laplace approximation and the true posterior Gamma(3, 0.75), respectively. We can see that in the neighborhood of the mode, the approximation is reasonable. However, as we move farther away from the mode, the tail of the Gamma is heavier on the right.

Obviously, this Laplace approximation is done with only a single observation. One would expect the approximation to improve as the sample size increases. In this case, with respect to the posterior mode as an approximation to the posterior mean, we can see that the difference between the two is simply \[ \hat{\theta}_{mean} - \hat{\theta}_{mode} = \frac{1}{n + 1/b} = \frac{b}{bn + 1} \]

which converges to zero as $n \to \infty$

It is worth noting that Laplace’s technique just gives a way to get a bell curve to approximate the posterior pdf. It does not say anything about the quality of the approximation. However, there are general guarantees that such an approximation is actually very good, when the model is a “regular” one, the prior pdf is smooth and the sample size n is large. We state the following result (without technical details since it is advance in measure theory) that is an analogue of the asymptotic normality result for MLE

(Bernstein-von Mises Theorem) Consider the model $X_{1}, \cdots, X_{n} \stackrel{\text { III }}{\sim} g\left(x_{i} \mid \theta\right)$, $\theta \in \Theta$. Under some regularity conditions on the pdfs/pmfs $g(\cdot \mid \theta)$, including that all of them have the same support, and that for each $x_{i}, \theta \mapsto \log g\left(x_{i} \mid \theta\right)$ is twice continuously differentiable, we have that for any prior $\xi(\theta)$ which is positive, bounded and twice differentiable over $\Theta$, \[ \sup _{z}\left|P(\theta \leq z \mid X=x)-\Phi\left(\{-\ddot{q}(\hat{\theta})\}^{1 / 2}(z-\hat{\theta})\right)\right| \approx 0 \] for all large $n$.

Under the same regularity condition it turns out that $\hat{\theta} \approx \hat{\theta}_{\mathrm{MLE}}(x)$ and that $-\ddot{q}(\hat{\theta}) \approx I_{\mathrm{OBS}}(x)$.

Bibliography

http://www2.stat.duke.edu/~st118/sta250/laplace.pdf