Background

Example:

  • Suppose one performs \(n\) Bernoulli trails with probability \(\theta\) of success. Given the outcome of \(n\) trials, we are interested in the probability of \(\theta\) lying between \(a\) and \(b\).

  • Here \(Y_{i} \mid \theta \stackrel{iid}\sim \mathtt{Bernoulli}(\theta)\), \(i=1, \ldots,n\). Let the observations be \(y_{1}, \ldots, y_{n}\).

  • Suppose an \(\mathtt{Uniform}(0,1)\) prior is placed on \(\theta\), i.e., \(\pi(\theta)=1\) for \(0<\theta<1\). Then \[P(\theta \in (a,b) \mid y_{1}, \ldots, y_{n})=\{\mathtt{Beta}(x+1, n-x+1) \}^{-1} \int_{a}^{b} \theta^{x} (1-\theta)^{n-x}d\theta, \tag{**} \] where \(x=\sum_{i=1}^{n} y_{i}\) and \(\mathtt{Beta}(x+1, n-x+1)= \int_{0}^{1} \theta^{x} (1-\theta)^{n-x}d\theta\).

\[\\[.2 in]\]

The Problem of Computation

\[\\[.2 in]\]

The Computational Solutions

Numerical Integration: In order to compute \[E_{\boldsymbol{\theta} \mid {\bf y}}\left[g(\boldsymbol{\theta}) \right]=\int g(\boldsymbol{\theta}) \pi(\boldsymbol{\theta}\mid {\bf y}) d\boldsymbol{\theta}, \tag{I}\] here one uses numerical approximation method to evaluate the integral, which are mainly based on splitting the range of \(\boldsymbol{\theta}\) into \(L\) grid points, and estimating the integral by a weighted sum of these \(L\) values.

Simulation Methods: These methods generate \(L\) posterior draws of \(g(\boldsymbol{\theta})\), and use a weighted mean of these draws as an estimate of \(E_{\boldsymbol{\theta} \mid {\bf y}}\left[g(\boldsymbol{\theta}) \right]\).

Deterministic Approximations: These methods replaces the integrand in \(E_{\boldsymbol{\theta} \mid {\bf y}}\left[g(\boldsymbol{\theta}) \right]\) by an approximation of some sort. The replacement is done in such a way that the error in replacement is minimal (at least asymptotically), and the integral can be computed.

\[\\[.2 in]\]

A Short Overview of Other Methods

\[\\[.2 in]\]

Why Deterministic Approximation Is Relevant?

\[\\[0.01 in]\]

\[\\[.1 in]\]

Some Deterministic Approximations

\[\\[.2 in]\]

I. Asymptotic normality of the posterior distributions under finite dimensional setup:

Bernstein von-Mises theorem.

Suppose there exist a true value \(\boldsymbol{\theta}_{0}\) from which the data is generated. Let \(l({\bf y}, \boldsymbol{\theta})= \log p({\bf y}\mid \boldsymbol{\theta})\) be the log likelihood function, \({\bf I}(\boldsymbol{\theta}_{0})\) be the Fisher information matrix evaluated at \(\boldsymbol{\theta}_{0}\) and \(\widehat{\bf I}_{n}(\widehat{\boldsymbol{\theta}}_{n})=- n^{-1} \sum_{i=1}^{n}\nabla_{\boldsymbol{\theta}}^{2} l({\bf y}_{i}, \boldsymbol{\theta}) |_{\boldsymbol{\theta}=\widehat{\boldsymbol{\theta}}_{n}}\) be the empirical Fisher information matrix evaluated at \(\widehat{\boldsymbol{\theta}}_{n}\).

Consider the following regularity conditions:

(A1) The parameter space \(\Theta\subseteq \mathbb{R}^{D}\).

(A2) The support \(\{{\bf y}: p({\bf y} \mid \boldsymbol{\theta})>0 \}\) does not depend on \(\boldsymbol{\theta}\).

(A3) There exists a fixed neighborhood of \(\boldsymbol{\theta}_{0}\) within which the log likelihood \(l({\bf y}, \boldsymbol{\theta})= \log p({\bf y}\mid \boldsymbol{\theta})\) is twice continuously differentiable w.r.t. \(\boldsymbol{\theta}\) in the neighborhood of \(\boldsymbol{\theta}_{0}\).

(A4) The prior density \(\pi(\boldsymbol{\theta})\) is continuous and positive at \(\boldsymbol{\theta}_{0}\).

(A5) The MLE of \(\boldsymbol{\theta}\), \(\widehat{\boldsymbol{\theta}}_{n}\) is a strongly consistent estimator of \(\boldsymbol{\theta}_{0}\).

(A6) Differentiation w.r.t. \(\boldsymbol{\theta}\) and integration w.r.t. the density \(p({\bf y}\mid \boldsymbol{\theta})\) is permissible. Further, the Fisher information matrix \({\bf I}(\boldsymbol{\theta}_{0})\) is positive definite.

Theorem: Under the regularity conditions (A1)–(A6), the total variation or L\(_{1}\) distance between the posterior distribution of \(t=[n\widehat{\bf I}_{n}(\widehat{\boldsymbol{\theta}}_{n})]^{1/2}(\boldsymbol{\theta}-\widehat{\boldsymbol{\theta}}_{n})\) given \({\bf y}_{1}, \ldots, {\bf y}_{n}\) and the standard Gaussian distribution converges to \(0\) as \(n\rightarrow\infty\).

Example:

Let us consider the example provided in the Introduction of Bernoulli trials.

  • Recall that there are \(n\) outcomes available on the Bernoulli trial with probability of success \(\theta\). Let the true probability of success is \(\theta_{0}=0.5\), and the MLE of \(\theta\) is \(\widehat{\theta}_{n}=\sum_{i=1}^{n} y_{i}/n = x/n\).

  • It is easy to see that the regularity conditions (A1)-(A6) are satisfied for this setup.

  • The following sequence of figures (code attached) show the posterior distribution of \(t\) given \(n\) observations and that of the standard Gaussian distribution as \(n\) increases.

  • Thus, the probability \(P(a\leq \theta \leq b)\), equivalently, \(P\left[\widehat{\sigma}(a- \widehat{\theta}_{n}) \leq t \leq \widehat{\sigma}(b- \widehat{\theta}_{n})\right]\) where \(\widehat{\sigma}=\sqrt{n \widehat{\bf I}_{n}}\), can be approximated from the asymptotic normal distribution.

\[\\[.2 in]\]

II. Laplace Approximation:

Laplace approximation first finds a mode of posterior distribution and then construct a Gaussian approximation using second order Taylor series. An outline of Laplace approximation is given below:

  1. Recall that \(\pi(\boldsymbol{\theta}\mid {\bf y}) = \pi^{*}(\boldsymbol{\theta}\mid {\bf y})/ p({\bf y})\) where \(p({\bf y})\), the normalizing constant, maybe unknown. At a typical mode \(\boldsymbol{\theta}_{n}\) of \(\pi(\boldsymbol{\theta}\mid {\bf y})\), \(\nabla \ln \pi(\boldsymbol{\theta}\mid {\bf y})\) (as well as \(\nabla \ln \pi^{*}(\boldsymbol{\theta}\mid {\bf y})\)) vanishes, and the Hessian matrix \({\bf H}(\boldsymbol{\theta}_{n})\) evaluated at \(\boldsymbol{\theta}_{n}\) is negative definite.

  2. Using second order Taylor expansion on \(\ln \pi(\boldsymbol{\theta}\mid {\bf y})\) around \(\ln \pi(\boldsymbol{\theta}_{n}\mid {\bf y})\) we get: \[\ln \pi(\boldsymbol{\theta}\mid {\bf y}) \approx \ln \pi(\boldsymbol{\theta}_{n}\mid {\bf y}) - \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_{n})^{\top} [-{\bf H}(\boldsymbol{\theta}_{n})] (\boldsymbol{\theta} - \boldsymbol{\theta}_{n}) . \]

  3. This in turn implies that \[\pi(\boldsymbol{\theta}\mid {\bf y}) \approx \pi(\boldsymbol{\theta}_{n}\mid {\bf y}) \exp \left\{ - \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_{n})^{\top} [-{\bf H}(\boldsymbol{\theta}_{n})] (\boldsymbol{\theta} - \boldsymbol{\theta}_{n}) \right\}\]

  4. Thus the approximate distribution of \(\boldsymbol{\theta} \mid {\bf y}\) is Gaussian with mean \(\boldsymbol{\theta}_{n}\) and variance covariance matrix \([-{\bf H}(\boldsymbol{\theta}_{n})]^{-1}\).

\[\\[.1 in]\]

Remarks:

  1. The conclusion of Laplace approximation is asymptotically same as that of Bernstein von-Mises theorem, as the posterior mode is asymptotically same as the MLE under any reasonable prior which is continuous in a neighborhood of the true value \(\boldsymbol{\theta}_{0}\).

  2. Numerical optimization methods are usually employed to find the mode of \(\pi^{*}(\boldsymbol{\theta}_{0} \mid {\bf y})\).

  3. Laplace approximation produces good result in a neighborhood of \(\boldsymbol{\theta}_{n}\), and can fail to depict the global scenario.

\[\\[.2 in]\]

Examples:

  • Consider the model \(Y\mid \mu \sim \mathtt{Poisson} (\mu)\) and the prior \(\mu \sim \mathtt{Gamma}(\alpha,\beta)\).

  • A simple answer to the similar question on the probability \(P(a\leq \mu \leq b)\) given \(n\) observations, is obtained by an application of Laplace approximation.

  • Let \(n\) data points are observed on \(Y\), \(y_{1},\ldots,y_{n}\).

  • Then the posterior of \(\mu \mid y_{1},\ldots,y_{n}\) is also \(\mathtt{Gamma}\) with shape parameter \(\sum_{i} y_{i}+\alpha\) and scale parameter \((n+1/\beta)^{-1}\).

  • The posterior mode in this case has an analytic expression \(\widehat{\mu}_{n}=(\sum_{i} y_{i}+\alpha-1)/(n+1/\beta)\), and the Hessian evaluated at \(\widehat{\mu}_{n}\) is \(-(\sum_{i} y_{i}+\alpha-1)/\widehat{\mu}_{n}^{2}\).

  • Thus the posterior distribution is approximated by the Gaussian distribution with mean \(\widehat{\mu}_{n}\) and variance \(\widehat{\mu}_{n}^{2}/(\sum_{i} y_{i}+\alpha-1)\).

  • The following figure (code attached) shows the accuracy of Laplace approximation as \(n\) grows, when \(\mu=3\), \(\alpha=5\) and \(\beta=0.25\).

\[\\[.2 in]\]

III. An Application of Laplace Approximation to Latent Gaussian Model: Integrated Nested Laplace Approximation (INLA):

  • INLA is applicable to models with latent Gaussian structure with \(n\) extremely large, when MCMC is unsuitable. The assumed structure is \[\mathrm{[Likelihood]}\quad {\bf y}\mid {\bf z}, \boldsymbol{\theta} \sim \prod_{i=1}^{n} p(y_{i} \mid z_{i}, \boldsymbol{\theta}); \]

\[\mathrm{[Latent ~Field]}\quad {\bf z} \mid \boldsymbol{\theta} \sim N(0, {\bf Q}^{-1} (\boldsymbol{\theta})); \] \[\mathrm{[Hyperpriors]}\quad \boldsymbol{\theta} \sim \pi(\boldsymbol{\theta}),\]

where \({\bf Q} (\boldsymbol{\theta})\) is the precision matrix of the latent Gaussian model.

  • The goal is to approximate the marginal posterior \(\pi(\boldsymbol{\theta} \mid {\bf y})\) and \(p(z_{i}\mid {\bf y})\).

  • Observe that \[\pi(\boldsymbol{\theta} \mid {\bf y})=\frac{p({\bf z}, \boldsymbol{\theta} \mid {\bf y})}{p({\bf z} \mid \boldsymbol{\theta} , {\bf y})} \propto \frac{p({\bf z}, \boldsymbol{\theta} ,{\bf y})}{p({\bf z}\mid \boldsymbol{\theta} , {\bf y})}= \frac{p({\bf y}\mid {\bf z},\boldsymbol{\theta}) p({\bf z} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})}{p({\bf z} \mid \boldsymbol{\theta}, {\bf y})}.\]

  • Next \(\pi(\boldsymbol{\theta} \mid {\bf y})\) is approximated by \[\tilde{\pi}(\boldsymbol{\theta} \mid {\bf y}) \propto \frac{p({\bf y}\mid \widehat{\bf z}(\boldsymbol{\theta}),\boldsymbol{\theta}) p(\widehat{\bf z}(\boldsymbol{\theta}) \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})}{\tilde{p}(\widehat{\bf z}(\boldsymbol{\theta}) \mid \boldsymbol{\theta}, {\bf y})}, \] where \(\widehat{\bf z}(\boldsymbol{\theta})\) is a mode of \({\bf z}\) given \(\boldsymbol{\theta}\) and \(\tilde{p}\) is a Gaussian approximation (with mean \(\widehat{\bf z}(\boldsymbol{\theta})\) and variance covariance matrix \(\boldsymbol{\Sigma}_{\boldsymbol{\theta}}\)) to the distribution \({\bf z}\mid \boldsymbol{\theta}, {\bf y}\). Further, \(\boldsymbol{\Sigma}_{\boldsymbol{\theta}}\) is the Hessian of \(-\ln p({\bf z},\boldsymbol{\theta}, {\bf y})\) w.r.t. \({\bf z}\) evaluated at \(\widehat{\bf z}(\boldsymbol{\theta})\).

  • Thus, the above expression can be simplified as \[\tilde{\pi}(\boldsymbol{\theta} \mid {\bf y}) \propto p({\bf y}\mid \widehat{\bf z}(\boldsymbol{\theta}),\boldsymbol{\theta}) p(\widehat{\bf z}(\boldsymbol{\theta}) \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta}) |\boldsymbol{\Sigma}_{\boldsymbol{\theta}} |^{1/2}. \tag{1}\]

  • Towards the second goal, the marginal posterior \[p(z_{i} \mid {\bf y})= \int p( z_{i} \mid {\bf y}, \boldsymbol{\theta}) \pi(\boldsymbol{\theta} \mid {\bf y}) d \boldsymbol{\theta}, \tag{2}\] and \[ p(z_{i} \mid \boldsymbol{\theta}, {\bf y}) \propto \frac{p({\bf z},\boldsymbol{\theta} \mid {\bf y})}{p({\bf z}_{-i}\mid z_{i}, \boldsymbol{\theta}, {\bf y})} \propto \frac{p({\bf y}\mid {\bf z},\boldsymbol{\theta}) p({\bf z} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})}{p({\bf z}_{-i}\mid z_{i}, \boldsymbol{\theta}, {\bf y})} . \]

  • Using a similar Gaussian approximation to the distribution of \({\bf z}_{-i}\mid z_{i}, \boldsymbol{\theta}, {\bf y}\) with appropriate parameters provide \[p(z_{i} \mid \boldsymbol{\theta}, {\bf y}) \approx \tilde{p}(z_{i} \mid \boldsymbol{\theta}, {\bf y}) \propto p({\bf y}\mid \widehat{\bf z}_{-i}(\boldsymbol{\theta},z_{i}),\boldsymbol{\theta}) p(\widehat{\bf z}_{-i}(\boldsymbol{\theta},z_{i}) \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta}) |\boldsymbol{\Sigma}_{\boldsymbol{\theta},z_{i}}|^{1/2}, \tag{3} \] where \({\bf z}_{-i}\) denotes all elements of \({\bf z}\) except \(z_{i}\), \(\widehat{\bf z}_{-i}(\boldsymbol{\theta},z_{i})\) denotes a mode of \(p({\bf z},\boldsymbol{\theta},{\bf y})\) w.r.t. \({\bf z}_{-i}\) given \((z_{i},\boldsymbol{\theta})\), and \(\boldsymbol{\Sigma}_{\boldsymbol{\theta},z_{i}}\) is the inverse of the Hessian matrix of \(-p({\bf z}, \boldsymbol{\theta},{\bf y})\) w.r.t. \({\bf z}_{-i}\) evaluated at \(\widehat{\bf z}_{-i}(\boldsymbol{\theta},z_{i})\).

  • Thus, (2) is approximated by a deterministic numerical integration scheme defined over a grid of values of \(\boldsymbol{\theta}\), by approximating \(p( z_{i} \mid {\bf y}, \boldsymbol{\theta})\) as in (3), and \(\pi(\boldsymbol{\theta} \mid {\bf y})\) as in (1).

\[\\[.05 in]\] Remarks:

  1. The Gaussian approximations in each case can be asymptotically justified using second order Taylor series expansion as in Laplace approximation.

  2. The approximation in (3) can be computationally expensive when \(n\) is large, and therefore some further simplifications are incorporated.

\[\\[.2 in]\]

IV: Assumed Density Filtering (ADF):

The ADF minimizes the KL divergence with the true posterior and an approximate class of densities. The main steps of ADF is described as follows:

  • Let \(p(\boldsymbol{\theta})\) be a fixed distribution and \(q(\boldsymbol{\theta})\) be an approximating distribution from the exponential family with natural parameter \(\boldsymbol{\eta}\), i.e., \[q(\boldsymbol{\theta})= h(\boldsymbol{\theta}) g(\boldsymbol{\eta})\exp\{\boldsymbol{\eta}^{\top} u(\boldsymbol{\theta}) \},\] where \(u(\boldsymbol{\theta})\) is the sufficient statistics for the family \(q\).

  • The KL-divergence of \(q\) from \(p\) can be expressed as \[KL(p||q) = - \ln g(\boldsymbol{\eta}) - \boldsymbol{\eta}^{\top} E_{p(\boldsymbol{\theta})}(u(\boldsymbol{\theta}))+\mbox{terms free of }\boldsymbol{\eta}. \]

  • The first order condition towards minimization of \(KL(p||q)\) provides \[E_{p(\boldsymbol{\theta})}(u(\boldsymbol{\theta})) = - \nabla \ln g(\boldsymbol{\eta}) . \]

  • As \(E_{q(\boldsymbol{\theta})}(u(\boldsymbol{\theta})) = - \nabla \ln g(\boldsymbol{\eta})\) by the properties of exponential family, we have \[E_{p(\boldsymbol{\theta})}(u(\boldsymbol{\theta})) = E_{q(\boldsymbol{\theta})}(u(\boldsymbol{\theta})). \] Therefore, the expectation of the sufficient statistics for the optimal approximating density must match with that under \(p\).

\[\\[.1 in]\] Remarks:

  1. ADF requires that the moments of the true posterior be calculated relatively efficiently. If this is not the case, then ADF cannot be applied in a straight-forward manner.

  2. ADF is conceptually similar to variational Bayes (VB) method. However, in VB the reverse KL divergence \(KL(q||p)\) is minimized.

\[\\[.1 in]\]

Example:

  • Consider the mixture model \(p({\bf y} \mid \boldsymbol{\theta}) = (1-w) \mathrm{N}({\bf y}; \boldsymbol{\theta}, {\bf I})+ w \mathrm{N}({\bf y}; {\bf 0}, 10 \times {\bf I})\). Here \(\boldsymbol{\theta}\) is the parameter of interest and \(w\) is the proportion (known) of the noise component. A Gaussian prior with mean \({\bf 0}\) and variance \(\sigma^{2}{\bf I}\) is placed on \(\boldsymbol{\theta}\).

  • Given \(n\) observations \({\bf y}_{1},\ldots, {\bf y}_{n}\) we have \[\pi(\boldsymbol{\theta} \mid {\bf y}_{1}, \ldots, {\bf y}_{n}) \propto \pi(\boldsymbol{\theta}) \prod_{i=1}^{n} p({\bf y}_{i} \mid \boldsymbol{\theta})= \prod_{j=0}^{n} t_{j}, \] where \(t_{0}=\pi(\boldsymbol{\theta})\) and \(t_{i}=p({\bf y}_{i} \mid \boldsymbol{\theta})\), \(t=1, \ldots,n\).

  • Suppose we chose the approximating density to be Gaussian with mean \(\boldsymbol{\mu}_{\boldsymbol{\theta}}\) and variance \(s_{\boldsymbol{\theta}}^{2} {\bf I}\).

  • Clearly the sufficient statistics of the approximating density is \(u(\boldsymbol{\theta})=(\boldsymbol{\theta}, -\boldsymbol{\theta}^{\top}\boldsymbol{\theta})\) with expectation \(E_{q(\boldsymbol{\theta})}(u(\boldsymbol{\theta}))= (\boldsymbol{\mu}_{\boldsymbol{\theta}}, \boldsymbol{\mu}_{\boldsymbol{\theta}}^{\top} \boldsymbol{\mu}_{\boldsymbol{\theta}} - s_{\boldsymbol{\theta}}^{2})\). However, we do not know \(E_{p(\boldsymbol{\theta})}(u(\boldsymbol{\theta}))\). We therefore estimate \(p\) by \(\widehat{p}(\boldsymbol{\theta})\) sequentially.

  • We start with the initialization \(\widehat{p}(\boldsymbol{\theta})=q(\boldsymbol{\theta})=\pi(\boldsymbol{\theta})\). After the first observation we update \[\widehat{p}(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta}) t_{1}(\boldsymbol{\theta})}{\int q(\boldsymbol{\theta}) t_{1} (\boldsymbol{\theta}) d\boldsymbol{\theta}}. \]

  • Now, to compute the expectations consider the following relations: \[Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2})=\int q(\boldsymbol{\theta}) t_{1} (\boldsymbol{\theta})d\boldsymbol{\theta}= \int \frac{t_{1}(\boldsymbol{\theta})}{\sqrt{2\pi} s_{\boldsymbol{\theta}}} \exp\left\{ -\frac{1}{2 s_{\boldsymbol{\theta}}^{2}} \left(\boldsymbol{\theta}- \boldsymbol{\mu}_{\boldsymbol{\theta}}\right)^{\top}\left(\boldsymbol{\theta}- \boldsymbol{\mu}_{\boldsymbol{\theta}}\right)\right\} d\boldsymbol{\theta}, \] \[\nabla_{\mu} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}) = \frac{1}{s_{\boldsymbol{\theta}}^{2}} \left\{ E_{\widehat{p}}(\boldsymbol{\theta}) - \boldsymbol{\mu}_{\boldsymbol{\theta}} \right\} , \] and \[\nabla_{s_{\boldsymbol{\theta}}^{2}} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}) = \frac{1}{2 s_{\boldsymbol{\theta}}^{4} } E_{\widehat{p}}\left\{(\boldsymbol{\theta}-\boldsymbol{\mu}_{\boldsymbol{\theta}} )^{\top}(\boldsymbol{\theta}-\boldsymbol{\mu}_{\boldsymbol{\theta}} ) \right\} - \frac{1}{2 s_{\boldsymbol{\theta}}^{2}}.\]

  • This in turn provides the relation \[E_{\widehat{p}(\boldsymbol{\theta})}(\boldsymbol{\theta}) = \boldsymbol{\mu}_{\boldsymbol{\theta}} + s_{\boldsymbol{\theta}}^{2} \nabla_{\mu} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}),\] and \[E_{\widehat{p}(\boldsymbol{\theta})}(\boldsymbol{\theta}^{\top} \boldsymbol{\theta}) - E_{\widehat{p}(\boldsymbol{\theta})}(\boldsymbol{\theta})^{\top} E_{\widehat{p}(\boldsymbol{\theta})}(\boldsymbol{\theta}) = D s_{\boldsymbol{\theta}}^{2} -s_{\boldsymbol{\theta}}^{4} \left\{ \left\|\nabla_{\mu} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2})\right\|^{2} - 2\nabla_{s_{\boldsymbol{\theta}}^{2}} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2})\right\}.\]

  • Using the above relation in the example we get \[Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2})=(1-w)\phi_{(s_{\boldsymbol{\theta}}^{2}+1){\bf I}}(\boldsymbol{\mu}_{\boldsymbol{\theta}}-{\bf y}_{1})+w \phi_{10\times {\bf I}}({\bf y}_{1})= z_{1}+z_{2};\qquad r=\frac{z_{1}}{z_{1}+z_{2}}. \]

  • Differentiating w.r.t. \(\boldsymbol{\mu}_{\boldsymbol{\theta}}\) and \(s_{\boldsymbol{\theta}}^{2}\) we get: \[\nabla_{\mu} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}) = r \frac{{\bf y}_{1}-\boldsymbol{\mu}_{\boldsymbol{\theta}}}{s_{\boldsymbol{\theta}}^{2}+1},\quad \mbox{and}\quad \nabla_{s_{\boldsymbol{\theta}}^{2}} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}) = -\frac{rD}{2(s_{\boldsymbol{\theta}}^{2}+1)}+ r \frac{({\bf y}_{1}-\boldsymbol{\mu}_{\boldsymbol{\theta}})^{\top}({\bf y}_{1}-\boldsymbol{\mu}_{\boldsymbol{\theta}})}{2(s_{\boldsymbol{\theta}}^{2}+1)^{2}}.\]

  • Thus the parameters are updated by the following step after each iteration:

    1. Initialize with the prior \(\boldsymbol{\mu}_{\boldsymbol{\theta}}={\bf 0}\) and \(s_{\boldsymbol{\theta}}^{2}=\sigma^{2}\).
    2. After observing the \(i\)-th data point update

\[ \boldsymbol{\mu}_{\boldsymbol{\theta}}^{\mathrm{new}}=\boldsymbol{\mu}_{\boldsymbol{\theta}} + s_{\boldsymbol{\theta}}^{2} r_{i} \frac{{\bf y}_{i}-\boldsymbol{\mu}_{\boldsymbol{\theta}}}{s_{\boldsymbol{\theta}}^{2}+1},\] and \[s_{\boldsymbol{\theta}}^{2~\mathrm{new}} = s_{\boldsymbol{\theta}}^{2} - r_{i} \frac{s_{\boldsymbol{\theta}}^{4}}{s_{\boldsymbol{\theta}}^{2}+1} + r_{i} (r_{i}+1)\frac{s_{\boldsymbol{\theta}}^{4}({\bf y}_{i}-\boldsymbol{\mu}_{\boldsymbol{\theta}})^{\top}({\bf y}_{i}-\boldsymbol{\mu}_{\boldsymbol{\theta}})}{D(s_{\boldsymbol{\theta}}^{2}+1)^{2}} . \] - The figure (code attached) for \(\sigma^{2}=10\) is given below.

\[\\[.2 in]\]

V: Variational Bayesian (VB) Methods:

To be covered.

\[\\[.2 in]\]

VI. Approximate Bayesian Computation (ABC):

To be covered.

\[\\[.2 in]\]

\[\\[.2 in]\]

Reference:

  1. Computing Bayes: Bayesian computation from 1763 to the 21st Century, by Gael M. Martin, David T. Frazier and Christian P. Robert. (Link)

  2. A review of deterministic approximate inference techniques for Bayesian machine learning, by Shiliang Sun. (Link)

  3. A family of algorithms for approximate Bayesian conference, by Thomas P Minka. (Link)

  4. Spatial modeling with R-INLA: A review, by Haakon Bakka, HÃ¥vard Rue, Geir-Arne Fuglstad, Andrea Riebler, David Bolin, Janine Illian, Elias Krainski, Daniel Simpson and Finn Lindgren. (Link)

\[\\[.2 in]\]