This document provides a comprehensive explanation of how INLA (Integrated Nested Laplace Approximation) works, based on a textbook example of normally distributed data with independent prior distributions.
INLA is a computational method for Bayesian inference that avoids Markov Chain Monte Carlo (MCMC) by using:
The key advantage is speed: INLA can be hundreds or thousands of times faster than MCMC for certain classes of models.
The Laplace approximation is fundamentally a second-order Taylor expansion around the mode of a log-posterior (or log-likelihood), which yields a Gaussian approximation.
For a probability density function \(p(\theta)\), we approximate \(\log p(\theta)\) using a Taylor series around the mode \(\hat{\theta}\):
\[\log p(\theta) \approx \log p(\hat{\theta}) + \underbrace{\frac{d}{d\theta}\log p(\theta)\bigg|_{\hat{\theta}}}_{=0 \text{ at mode}} + \frac{1}{2}\frac{d^2}{d\theta^2}\log p(\theta)\bigg|_{\hat{\theta}}(\theta - \hat{\theta})^2\]
The first derivative term vanishes because we’re at the maximum (mode). This leaves:
\[\log p(\theta) \approx \log p(\hat{\theta}) - \frac{1}{2}|H|(\theta - \hat{\theta})^2\]
Where \(|H| = -\frac{d^2}{d\theta^2}\log p(\theta)\big|_{\hat{\theta}}\) is the negative Hessian (observed Fisher information).
Exponentiating gives:
\[p(\theta) \approx p(\hat{\theta}) \exp\left(-\frac{1}{2}|H|(\theta - \hat{\theta})^2\right)\]
Which is proportional to a Normal distribution with: - Mean = \(\hat{\theta}\) (the mode) - Variance = \(1/|H|\) (inverse of the negative second derivative)
INLA takes this further with nested approximations:
The key insight from this document is that INLA chooses \(\theta = \theta_n\) (the mode of \(p(\theta|\psi,y)\)) to evaluate \(p(\psi|y)\) efficiently, using the fact that the denominator \(p(\theta_n|\psi,y)\) has a simple Normal form.
This is why in the R code later,
den[h] <- dnorm(theta.n[h], theta.n[h], sd = sqrt(sigma2.n[h]))
works - it’s evaluating the normalizing constant of the
Laplace-approximated conditional posterior!
We have normally distributed data:
\[y_i \mid \mu, \sigma^2 \sim \text{Normal}(\mu, \sigma^2), \quad i = 1, \dots, n\]
Where:
We choose independent priors:
Note: INLA typically works with precision (\(\psi = 1/\sigma^2\)) rather than variance because precisions have nicer mathematical properties in Gaussian models.
We want the marginal posterior distributions:
\[p(\psi \mid \mathbf{y}) = \int p(\theta, \psi \mid \mathbf{y}) d\theta\]
\[p(\theta \mid \mathbf{y}) = \int p(\theta, \psi \mid \mathbf{y}) d\psi\]
Direct integration is computationally challenging. INLA solves this with a clever trick.
The joint posterior is:
\[p(\theta, \psi \mid \mathbf{y}) \propto p(\mathbf{y} \mid \theta, \psi) \times p(\theta) \times p(\psi)\]
INLA uses the relationship:
\[p(\psi \mid \mathbf{y}) = \frac{p(\theta, \psi \mid \mathbf{y})}{p(\theta \mid \psi, \mathbf{y})} \propto \frac{p(\mathbf{y} \mid \theta, \psi) \times p(\theta) \times p(\psi)}{p(\theta \mid \psi, \mathbf{y})}\]
This holds for any value of \(\theta\)!
For normal data with normal prior, the conditional posterior is:
\[\theta \mid \psi, \mathbf{y} \sim \text{Normal}(\theta_n, \sigma_n^2)\]
Where:
\[\theta_n = \frac{\psi \sum_{i=1}^n y_i + \mu_0 / \sigma_0^2}{n\psi + 1/\sigma_0^2}\]
\[\sigma_n^2 = \frac{1}{n\psi + 1/\sigma_0^2}\]
By choosing \(\theta = \theta_n\) (the conditional posterior mode/mean), the denominator becomes:
\[p(\theta_n \mid \psi, \mathbf{y}) = \frac{1}{\sqrt{2\pi \sigma_n^2}}\]
\[p(\psi \mid \mathbf{y}) \propto \left. \frac{p(\mathbf{y} \mid \theta, \psi) p(\theta) p(\psi)}{p(\theta \mid \psi, \mathbf{y})} \right|_{\theta = \theta_n} = \frac{p(\mathbf{y} \mid \theta_n, \psi) \times p(\theta_n) \times p(\psi)}{1 / \sqrt{2\pi \sigma_n^2}}\]
The derivation of \(\theta_n\) comes directly from Bayes’ theorem:
\[p(\theta \mid \psi, \mathbf{y}) \propto p(\mathbf{y} \mid \theta, \psi) \times p(\theta)\]
\[p(\mathbf{y} \mid \theta, \psi) = \prod_{i=1}^n \sqrt{\frac{\psi}{2\pi}} \exp\left(-\frac{\psi}{2}(y_i - \theta)^2\right)\]
\[p(\theta) = \frac{1}{\sqrt{2\pi \sigma_0^2}} \exp\left(-\frac{1}{2\sigma_0^2}(\theta - \mu_0)^2\right)\]
Multiplying and completing the square gives:
\[p(\theta \mid \psi, \mathbf{y}) \propto \exp\left(-\frac{1}{2}\left[(n\psi + 1/\sigma_0^2)\theta^2 - 2\theta(\psi \sum y_i + \mu_0 / \sigma_0^2)\right]\right)\]
This is a Normal distribution with:
Mean (\(\theta_n\)): \[\theta_n = \frac{\psi \sum_{i=1}^n y_i + \mu_0 / \sigma_0^2}{n\psi + 1/\sigma_0^2}\]
Variance (\(\sigma_n^2\)): \[\sigma_n^2 = \frac{1}{n\psi + 1/\sigma_0^2}\]
\(\theta_n\) is a weighted average of:
\[\theta_n = \frac{(n\psi) \cdot \bar{y} + (1/\sigma_0^2) \cdot \mu_0}{n\psi + 1/\sigma_0^2}\]
Answer: \(\theta_n\) is the conditional posterior mode/mean of \(\theta\) given \(\psi\) and the data.
\[\theta_n = \frac{\psi \sum_{i=1}^n y_i + \mu_0 / \sigma_0^2}{n\psi + 1/\sigma_0^2}\]
Intuition: \(\theta_n\) combines two sources of information: - The data (through \(\bar{y}\)) - The prior (through \(\mu_0\))
Examples: - When \(\psi\) is large (precise data), \(\theta_n \approx \bar{y}\) (data dominates) - When \(\sigma_0^2\) is large (vague prior), \(\theta_n \approx \bar{y}\) (prior is weak) - When \(\sigma_0^2\) is small (informative prior), \(\theta_n\) is pulled toward \(\mu_0\)
Answer: This is Bayes’ theorem applied conditionally on \(\psi\).
Step-by-step derivation:
Answer: We discretize \(\psi\) to perform numerical integration. This does NOT mean we know \(\psi\)’s values!
Analogy: Think of it like a treasure hunt: - You don’t know where the treasure is (the true \(\psi\)) - But you decide to search at specific locations (grid points) - At each location, you measure “how likely” the treasure is there - Then you use these measurements to infer where the treasure probably is
Why we need a grid: - The integral \(\int p(\theta, \psi \mid \mathbf{y}) d\theta\) has no closed-form solution - Numerical integration requires evaluating the function at specific points
Grid considerations: - Too few points: May miss important features - Too many points: Computationally expensive - Wrong range: May miss the posterior entirely
Answer: The unnormalized posterior
post.psi is proportional to the true density:
\[\text{post.psi} = c \times p(\psi \mid \mathbf{y})\]
Where \(c\) is an unknown constant, and the true density satisfies \(\int p(\psi \mid \mathbf{y}) d\psi = 1\).
Therefore:
\[\int \text{post.psi} \, d\psi = \int c \times p(\psi \mid \mathbf{y}) d\psi = c \times 1 = c\]
So the definite integral of the unnormalized density equals the normalizing constant! Then we can recover the true density by division: \(\frac{\text{post.psi}}{c} = p(\psi \mid \mathbf{y})\).
Answer: This converts the density to a probability mass function for numerical integration.
Explanation: We’re approximating the continuous integral:
\[p(\theta \mid \mathbf{y}) = \int p(\theta \mid \psi, \mathbf{y}) p(\psi \mid \mathbf{y}) d\psi\]
As a Riemann sum:
\[p(\theta \mid \mathbf{y}) \approx \sum_j p(\theta \mid \psi^{(j)}, \mathbf{y}) \times [p(\psi^{(j)} \mid \mathbf{y}) \times \Delta \psi_j]\]
The code combines steps: - post.psi is the density \(p(\psi^{(j)} \mid \mathbf{y})\) -
Multiplying by Delta = 1/sum(post.psi) converts it to
probability mass
| Method | Type | Role |
|---|---|---|
| INLA Package | Professional software | Reference implementation |
| Manual (Step-by-step) | Our implementation | Validates our understanding |
| MCMC/OpenBUGS | Gold standard | Validates both above |
If all three match (which they should for this model), we can be confident that: - Our understanding of INLA is correct - Our manual implementation is correct - The INLA approximation is accurate
# Load the data
y <- c(1.2697, 7.7637, 2.2532, 3.4557, 4.1776, 6.4320, -3.6623, 7.7567,
5.9032, 7.2671, -2.3447, 8.0160, 3.5013, 2.8495, 0.6467, 3.2371,
5.8573, -3.3749, 4.1507, 4.3092, 11.7327, 2.6174, 9.4942, -2.7639,
-1.5859, 3.6986, 2.4544, -0.3294, 0.2329, 5.2846)
n <- length(y)
ybar <- mean(y)
# Prior parameters
mu0 <- -3 # prior mean for μ
sigma2_0 <- 4 # prior variance for μ
a <- 1.6 # gamma shape
b <- 0.4 # gamma rate
# Grid parameters
H <- 25 # Number of grid points
psi.min <- 0.001 # Minimum precision
psi.max <- 0.3 # Maximum precision
# Create grid
psi.grid <- seq(psi.min, psi.max, length.out = H)
# Prior for psi at grid points
hprior <- dgamma(psi.grid, shape = a, rate = b)
# Initialize vectors
theta.n <- numeric(H)
sigma2.n <- numeric(H)
prior <- numeric(H)
lik <- numeric(H)
num <- numeric(H)
den <- numeric(H)
for (h in 1:H) {
# Step 1: Compute θₙ (conditional posterior mode)
theta.n[h] <- (psi.grid[h] * n * ybar + mu0 / sigma2_0) /
(psi.grid[h] * n + 1 / sigma2_0)
# Step 2: Compute σₙ² (conditional posterior variance)
sigma2.n[h] <- 1 / (n * psi.grid[h] + 1 / sigma2_0)
# Step 3: Prior for θ at θₙ
prior[h] <- dnorm(theta.n[h], mu0, sd = sqrt(sigma2_0))
# Step 4: Likelihood of data given θₙ and ψ
lik[h] <- prod(dnorm(y, theta.n[h], sd = 1 / sqrt(psi.grid[h])))
# Step 5: Numerator = prior(ψ) × prior(θ) × likelihood
num[h] <- hprior[h] * prior[h] * lik[h]
# Step 6: Denominator = p(θₙ|ψ,y) at mode = 1/√(2πσₙ²)
den[h] <- dnorm(theta.n[h], theta.n[h], sd = sqrt(sigma2.n[h]))
}
# Unnormalized posterior for ψ
post.psi <- num / den
# Create interpolation function
f.psi <- approxfun(psi.grid, post.psi,
yleft = min(psi.grid),
yright = max(psi.grid))
# Integrate to find normalizing constant
const <- integrate(f.psi, min(psi.grid), max(psi.grid))
# Normalize
post.psi <- post.psi / const$value
cat("Normalizing constant:", const$value, "\n")
## Normalizing constant: 9.169797e-42
cat("Posterior integrates to:", sum(post.psi) * diff(psi.grid)[1], "\n")
## Posterior integrates to: 0.9902876
# Create grid for θ
J <- 50
theta.grid <- seq(-8, 5, length.out = J)
# Full conditional distributions p(θ|ψ,y)
full.cond.theta <- matrix(NA, J, H)
for (j in 1:J) {
for (h in 1:H) {
full.cond.theta[j, h] <- dnorm(theta.grid[j],
theta.n[h],
sd = sqrt(sigma2.n[h]))
}
}
# Convert density to probability mass for numerical integration
Delta <- 1 / sum(post.psi)
# Weighted joint posterior
joint.post.theta.psi <- matrix(NA, J, H)
for (h in 1:H) {
joint.post.theta.psi[, h] <- full.cond.theta[, h] * post.psi[h] * Delta
}
# Marginal posterior for θ (sum over ψ)
marg.post.theta <- rowSums(joint.post.theta.psi)
# Normalize
f.theta <- approxfun(theta.grid, marg.post.theta,
yleft = min(theta.grid),
yright = max(theta.grid))
const_theta <- integrate(f.theta, min(theta.grid), max(theta.grid))
marg.post.theta <- marg.post.theta / const_theta$value
plot(psi.grid, post.psi, type = "l", lwd = 2,
xlab = expression(psi), ylab = "Density",
main = "Marginal Posterior of ψ",
col = "blue")
grid()
Marginal Posterior of ψ
plot(theta.grid, full.cond.theta[, 1], type = "l", lwd = 0.7,
xlab = expression(theta), ylab = "Density",
main = "Full Conditional Distributions p(θ | ψ, y)",
ylim = range(full.cond.theta))
for (h in 1:H) {
lines(theta.grid, full.cond.theta[, h], lwd = 0.7)
}
grid()
Full Conditional Distributions p(θ | ψ, y)
plot(theta.grid, marg.post.theta, type = "l", lwd = 2,
xlab = expression(theta), ylab = "Density",
main = "Marginal Posterior of θ",
col = "darkred")
grid()
Marginal Posterior of θ
# Load INLA (uncomment to install)
# install.packages("INLA", repos = "https://inla.r-inla-download.org/R/stable")
library(INLA)
# Run INLA
formula <- y ~ 1
inla.output <- inla(formula,
data = data.frame(y = y),
control.family = list(
hyper = list(
prec = list(prior = "loggamma",
param = c(a, b))
)
),
control.fixed = list(
mean.intercept = mu0,
prec.intercept = 1 / sigma2_0
),
control.compute = list(config = TRUE),
verbose = FALSE)
# Extract results
inla_psi <- inla.output$marginals.hyperpar[[1]]
inla_theta <- inla.output$marginals.fixed$"(Intercept)"
# Interpolate INLA results to manual grid
inla_psi_interp <- approx(inla_psi[, 1], inla_psi[, 2],
xout = psi.grid)$y
inla_theta_interp <- approx(inla_theta[, 1], inla_theta[, 2],
xout = theta.grid)$y
par(mfrow = c(1, 2))
# ψ posterior comparison
plot(psi.grid, post.psi, type = "l", lwd = 2, col = "blue",
xlab = expression(psi), ylab = "Density",
main = expression("Posterior of " * psi),
ylim = range(c(post.psi, inla_psi_interp), na.rm = TRUE))
lines(psi.grid, inla_psi_interp, col = "red", lwd = 2, lty = 2)
legend("topright",
legend = c("Manual", "INLA Package"),
col = c("blue", "red"), lty = c(1, 2), lwd = 2)
# θ posterior comparison
plot(theta.grid, marg.post.theta, type = "l", lwd = 2, col = "blue",
xlab = expression(theta), ylab = "Density",
main = expression("Posterior of " * theta),
ylim = range(c(marg.post.theta, inla_theta_interp), na.rm = TRUE))
lines(theta.grid, inla_theta_interp, col = "red", lwd = 2, lty = 2)
legend("topright",
legend = c("Manual", "INLA Package"),
col = c("blue", "red"), lty = c(1, 2), lwd = 2)
par(mfrow = c(1, 1))
# Error metrics for ψ
mae_psi <- mean(abs(post.psi - inla_psi_interp), na.rm = TRUE)
rmse_psi <- sqrt(mean((post.psi - inla_psi_interp)^2, na.rm = TRUE))
cor_psi <- cor(post.psi, inla_psi_interp, use = "complete.obs")
# Error metrics for θ
mae_theta <- mean(abs(marg.post.theta - inla_theta_interp), na.rm = TRUE)
rmse_theta <- sqrt(mean((marg.post.theta - inla_theta_interp)^2, na.rm = TRUE))
cor_theta <- cor(marg.post.theta, inla_theta_interp, use = "complete.obs")
# Summary table
comparison <- data.frame(
Parameter = c("ψ (precision)", "θ (mean)"),
MAE = c(mae_psi, mae_theta),
RMSE = c(rmse_psi, rmse_theta),
Correlation = c(cor_psi, cor_theta)
)
knitr::kable(comparison, digits = 6,
caption = "Comparison between Manual Implementation and INLA Package")
| Parameter | MAE | RMSE | Correlation |
|---|---|---|---|
| ψ (precision) | 0.136572 | 0.162064 | 0.999973 |
| θ (mean) | 0.002716 | 0.003712 | 0.999899 |
Save this as “modelWorkingINLA_notcancel.txt”:
model {
# Likelihood: Normal distribution for the data
for (i in 1:n) {
y[i] ~ dnorm(theta, psi)
}
# Prior for theta (mean)
theta ~ dnorm(mu0, tau0)
# Prior for psi (precision)
psi ~ dgamma(a, b)
# Optional: Transform psi to variance for easier interpretation
sigma2 <- 1/psi
}
library(R2OpenBUGS)
working.dir <- getwd()
filein <- "modelWorkingINLA_notcancel.txt"
params <- c("theta","psi")
dataJags <- list(y=y,mu0=mu0,tau0=1/sigma2_0,a=a,b=b,n=n)
inits <- function(){
list(theta=rnorm(1),psi=runif(1))
}
n.iter <- 10000
n.burnin <- 2000
n.thin <- floor((n.iter-n.burnin)/500)
mcmc <- bugs(dataJags, inits, params, model.file=filein,
n.chains=2, n.iter, n.burnin, n.thin,
DIC=TRUE, working.directory=working.dir, debug=FALSE)
psi.mcmc <- mcmc$sims.list$psi
theta.mcmc <- mcmc$sims.list$theta
# ***
plot(inla.output$marginals.hyperpar[[1]], t="l",lwd=2, xlab=expression(sigma^2), ylab="",
ylim=range(inla.output$marginals.hyperpar[[1]]))
lines(spline(psi.grid,post.psi),lty=2) #step by step
lines(density(psi.mcmc),lty=3,lwd=3,col='red') #MCMC
legend("topright",c("INLA","Step-by-step","MCMC"),lty=c(1,2,3),lwd=c(2,1,3),cex=2,bty="n")
rg <- range(inla.output$marginals.fixed$"(Intercept)"[,2])
plot(inla.output$marginals.fixed$"(Intercept)",t="l",xlab=expression(mu), ylab="",ylim=rg,lwd=2)
lines(theta.grid,apply(joint.post.theta.psi,1,sum),lty=2)#Step by step
lines(density(theta.mcmc),lty=3,lwd=3,col='red')#MCMC
legend("topright",c("INLA","Step-by-step","MCMC"),lty=c(1,2,3), lwd=c(2,1,3), cex=2,bty="n")
`
par(mfrow = c(1, 2))
# ψ comparison
plot(inla_psi, type = "l", lwd = 2,
xlab = expression(psi), ylab = "Density",
main = expression("Posterior of " * psi),
ylim = range(c(post.psi, inla_psi[, 2], density(psi.mcmc)$y)))
lines(spline(psi.grid, post.psi), lty = 2, lwd = 2)
lines(density(psi.mcmc), lty = 3, lwd = 2)
legend("topright",
legend = c("INLA", "Manual", "MCMC"),
lty = c(1, 2, 3), lwd = 2)
# θ comparison
plot(inla_theta, type = "l", lwd = 2,
xlab = expression(theta), ylab = "Density",
main = expression("Posterior of " * theta),
ylim = range(c(marg.post.theta, inla_theta[, 2], density(theta.mcmc)$y)))
lines(theta.grid, marg.post.theta, lty = 2, lwd = 2)
lines(density(theta.mcmc), lty = 3, lwd = 2)
legend("topright",
legend = c("INLA", "Manual", "MCMC"),
lty = c(1, 2, 3), lwd = 2)
| Property | Value |
|---|---|
| Full name | Conditional posterior mode/mean of \(\theta\) |
| What it represents | Best estimate of \(\mu\) given \(\psi\) and data |
| Formula | Weighted average of \(\bar{y}\) and \(\mu_0\) |
| Weights | Precisions (\(n\psi\) for data, \(1/\sigma_0^2\) for prior) |
| Role in INLA | Convenient point to evaluate the posterior |
| Aspect | Explanation |
|---|---|
| Purpose | Approximate continuous integration with discrete points |
| Why needed | No closed-form solution for \(p(\psi \mid \mathbf{y})\) |
| How it works | Evaluate posterior at H strategic points |
| Key insight | Grid points are hypotheses, not known values! |
| Step | What it does | Why it works |
|---|---|---|
| Compute post.psi | Gets unnormalized density \(c \times p(\psi \mid \mathbf{y})\) | INLA gives relative probabilities |
| Integrate | Finds \(\int \text{post.psi} d\psi = c\) | Integral of unnormalized density = normalizing constant |
| Divide | Computes \(\text{post.psi} / c = p(\psi \mid \mathbf{y})\) | Division yields true probability density |