Suppose one performs \(n\) Bernoulli trails with probability \(\theta\) of success. Given the outcome of \(n\) trials, we are interested in the probability of \(\theta\) lying between \(a\) and \(b\).
Here \(Y_{i} \mid \theta \stackrel{iid}\sim \mathtt{Bernoulli}(\theta)\), \(i=1, \ldots,n\). Let the observations be \(y_{1}, \ldots, y_{n}\).
Suppose an \(\mathtt{Uniform}(0,1)\) prior is placed on \(\theta\), i.e., \(\pi(\theta)=1\) for \(0<\theta<1\). Then \[P(\theta \in (a,b) \mid y_{1}, \ldots, y_{n})=\{\mathtt{Beta}(x+1, n-x+1) \}^{-1} \int_{a}^{b} \theta^{x} (1-\theta)^{n-x}d\theta, \tag{**} \] where \(x=\sum_{i=1}^{n} y_{i}\) and \(\mathtt{Beta}(x+1, n-x+1)= \int_{0}^{1} \theta^{x} (1-\theta)^{n-x}d\theta\).
\[\\[.2 in]\]
The question in the previous example can be rephrased as \(E_{\theta \mid {\bf y}}\left(I_{[a,b]}(\theta) \right)\).
More generally, most Bayesian queries of interest are of the form \(E_{\boldsymbol{\theta} \mid {\bf y}}\left[g(\boldsymbol{\theta}) \right]\), where \(g\) is some function. Examples include
a-posteriori moments like variance;
marginal pdf like \(p(\theta_{1}^{*} \mid {\bf y})\) (here \(g(\boldsymbol{\theta})=\pi(\theta_{1}^{*} \mid \boldsymbol{\theta}_{-1},y)\));
loss functions \(L(\boldsymbol{\theta},d)\), etc.
Often an analytic solution to \(E_{\boldsymbol{\theta} \mid {\bf y}}\left[g(\boldsymbol{\theta}) \right]\) is not available. For example, the equation in (**) does not have an analytic solution.
Even the posterior in (*) is often known only up to a constant of proportionality, as \(p({\bf y})\) can not be evaluated.
Therefore computational or approximate solutions are required.
\[\\[.2 in]\]
Numerical Integration: In order to compute \[E_{\boldsymbol{\theta} \mid {\bf y}}\left[g(\boldsymbol{\theta}) \right]=\int g(\boldsymbol{\theta}) \pi(\boldsymbol{\theta}\mid {\bf y}) d\boldsymbol{\theta}, \tag{I}\] here one uses numerical approximation method to evaluate the integral, which are mainly based on splitting the range of \(\boldsymbol{\theta}\) into \(L\) grid points, and estimating the integral by a weighted sum of these \(L\) values.
Simulation Methods: These methods generate \(L\) posterior draws of \(g(\boldsymbol{\theta})\), and use a weighted mean of these draws as an estimate of \(E_{\boldsymbol{\theta} \mid {\bf y}}\left[g(\boldsymbol{\theta}) \right]\).
Deterministic Approximations: These methods replaces the integrand in \(E_{\boldsymbol{\theta} \mid {\bf y}}\left[g(\boldsymbol{\theta}) \right]\) by an approximation of some sort. The replacement is done in such a way that the error in replacement is minimal (at least asymptotically), and the integral can be computed.
\[\\[.2 in]\]
Recall the integral in (I) where \(\boldsymbol{\theta}\) is of dimension \(D\), where \(D\) is large, and \(\pi(\boldsymbol{\theta}\mid {\bf y})\) is known only up to a integrating constant \(\pi^{*}(\boldsymbol{\theta}\mid {\bf y}) = \pi(\boldsymbol{\theta})p({\bf y}\mid\boldsymbol{\theta})\).
Standard numerical integration methods are unfeasible as the number of grid points increase exponentially with the dimension of \(\boldsymbol{\theta}\), increasing the computational burden linearly with number of grid points.
Further, the approximation error decrease in a rate \(O(L^{-1/D})\) where \(L\) is the number of grid points.
If samples can not be generated from the distribution of \(\boldsymbol{\theta}\mid {\bf y}\), usual Monte Carlo estimate of (I) is not available.
The basic idea of Markov Chain Monte Carlo (MCMC) is to generate a Markov chain \(\boldsymbol{\theta}^{(i)}\), \(i=1,\ldots, M\) with stationary distribution \(\pi(\boldsymbol{\theta}\mid {\bf y})\). Given the \(i\)-th iteration \(\boldsymbol{\theta}^{(i)}\), a new candidate observation \(\boldsymbol{\theta}^{c}\) is obtained from a suitable proposal distribution with conditional density \(q\), and it is accepted with the probability \[\alpha =\min \left\{\frac{\pi^{*}(\boldsymbol{\theta}^{c}\mid {\bf y})/ q(\boldsymbol{\theta}^{(i)}\mid \boldsymbol{\theta}^{c}, {\bf y} ) }{\pi^{*}(\boldsymbol{\theta}^{(i)}\mid {\bf y})/ q(\boldsymbol{\theta}^{c}\mid \boldsymbol{\theta}^{(i)}, {\bf y}) }, 1 \right\}\]
The algorithm based on the proposal density \(q\) and acceptance probability \(\alpha\) defines a transition kernel with stationary distribution \(\pi(\boldsymbol{\theta}\mid {\bf y})\).
Hence subject to convergence to \(\pi(\boldsymbol{\theta}\mid {\bf y})\), these draws can be used to obtain a Monte Carlo type estimate of (I).
The issue of dimension is tackled by treating one element (or a few element) of \(\boldsymbol{\theta}\) at a time, conditional on all remaining elements.
Further, the idea of introducing latent variables \({\bf z}\), and augmenting \((\boldsymbol{\theta}, {\bf z})\) to yield conditionals \(\pi ( \boldsymbol{\theta} \mid {\bf z}, {\bf y})\) and \(p({\bf z} , \boldsymbol{\theta} \mid {\bf y})\), facilitated production of samples from \(\pi(\boldsymbol{\theta}\mid {\bf y})\).
\[\\[.2 in]\]
The traditional simulation and numerical methods still have the following limitations:
The MCMC methods still require the evaluation of \(p({\bf y} \mid \boldsymbol{\theta})\), which is not always available in closed form. In many problems, \(p({\bf y} \mid \boldsymbol{\theta})\) itself is known up to a constant of proportionality. These problems are called doubly intractable, where the traditional MCMC methods fail.
The evaluation of \(p({\bf y} \mid \boldsymbol{\theta})\) at any point \(\boldsymbol{\theta}\) entails a \(O(n)\) computational burden.
The increasing dimension of the latent variables entails increasing computational burden, making the MCMC methods, even if feasible, but not scalable.
The effectiveness of MCMC methods depends of the mixing time, i.e., the time for the Markov chain to converge to the stationary distribution, which is unknown.
Finally, as a whole the simulation and numerical methods are computationally much demanding, and is effectively impractical for large-scale problems. So, it is hard to access convergence.
\[\\[0.01 in]\]
Deterministic approximations are completely or partially complementary to Monte Carlo techniques. Some of them are applicable to large scale problems as well.
\[\\[.1 in]\]
\[\\[.2 in]\]
Suppose there exist a true value \(\boldsymbol{\theta}_{0}\) from which the data is generated. Let \(l({\bf y}, \boldsymbol{\theta})= \log p({\bf y}\mid \boldsymbol{\theta})\) be the log likelihood function, \({\bf I}(\boldsymbol{\theta}_{0})\) be the Fisher information matrix evaluated at \(\boldsymbol{\theta}_{0}\) and \(\widehat{\bf I}_{n}(\widehat{\boldsymbol{\theta}}_{n})=- n^{-1} \sum_{i=1}^{n}\nabla_{\boldsymbol{\theta}}^{2} l({\bf y}_{i}, \boldsymbol{\theta}) |_{\boldsymbol{\theta}=\widehat{\boldsymbol{\theta}}_{n}}\) be the empirical Fisher information matrix evaluated at \(\widehat{\boldsymbol{\theta}}_{n}\).
Consider the following regularity conditions:
(A1) The parameter space \(\Theta\subseteq \mathbb{R}^{D}\).
(A2) The support \(\{{\bf y}: p({\bf y} \mid \boldsymbol{\theta})>0 \}\) does not depend on \(\boldsymbol{\theta}\).
(A3) There exists a fixed neighborhood of \(\boldsymbol{\theta}_{0}\) within which the log likelihood \(l({\bf y}, \boldsymbol{\theta})= \log p({\bf y}\mid \boldsymbol{\theta})\) is twice continuously differentiable w.r.t. \(\boldsymbol{\theta}\) in the neighborhood of \(\boldsymbol{\theta}_{0}\).
(A4) The prior density \(\pi(\boldsymbol{\theta})\) is continuous and positive at \(\boldsymbol{\theta}_{0}\).
(A5) The MLE of \(\boldsymbol{\theta}\), \(\widehat{\boldsymbol{\theta}}_{n}\) is a strongly consistent estimator of \(\boldsymbol{\theta}_{0}\).
(A6) Differentiation w.r.t. \(\boldsymbol{\theta}\) and integration w.r.t. the density \(p({\bf y}\mid \boldsymbol{\theta})\) is permissible. Further, the Fisher information matrix \({\bf I}(\boldsymbol{\theta}_{0})\) is positive definite.
Theorem: Under the regularity conditions (A1)–(A6), the total variation or L\(_{1}\) distance between the posterior distribution of \(t=[n\widehat{\bf I}_{n}(\widehat{\boldsymbol{\theta}}_{n})]^{1/2}(\boldsymbol{\theta}-\widehat{\boldsymbol{\theta}}_{n})\) given \({\bf y}_{1}, \ldots, {\bf y}_{n}\) and the standard Gaussian distribution converges to \(0\) as \(n\rightarrow\infty\).
Let us consider the example provided in the Introduction of Bernoulli trials.
Recall that there are \(n\) outcomes available on the Bernoulli trial with probability of success \(\theta\). Let the true probability of success is \(\theta_{0}=0.5\), and the MLE of \(\theta\) is \(\widehat{\theta}_{n}=\sum_{i=1}^{n} y_{i}/n = x/n\).
It is easy to see that the regularity conditions (A1)-(A6) are satisfied for this setup.
The following sequence of figures (code attached) show the posterior distribution of \(t\) given \(n\) observations and that of the standard Gaussian distribution as \(n\) increases.
Thus, the probability \(P(a\leq \theta \leq b)\), equivalently, \(P\left[\widehat{\sigma}(a- \widehat{\theta}_{n}) \leq t \leq \widehat{\sigma}(b- \widehat{\theta}_{n})\right]\) where \(\widehat{\sigma}=\sqrt{n \widehat{\bf I}_{n}}\), can be approximated from the asymptotic normal distribution.
\[\\[.2 in]\]
Laplace approximation first finds a mode of posterior distribution and then construct a Gaussian approximation using second order Taylor series. An outline of Laplace approximation is given below:
Recall that \(\pi(\boldsymbol{\theta}\mid {\bf y}) = \pi^{*}(\boldsymbol{\theta}\mid {\bf y})/ p({\bf y})\) where \(p({\bf y})\), the normalizing constant, maybe unknown. At a typical mode \(\boldsymbol{\theta}_{n}\) of \(\pi(\boldsymbol{\theta}\mid {\bf y})\), \(\nabla \ln \pi(\boldsymbol{\theta}\mid {\bf y})\) (as well as \(\nabla \ln \pi^{*}(\boldsymbol{\theta}\mid {\bf y})\)) vanishes, and the Hessian matrix \({\bf H}(\boldsymbol{\theta}_{n})\) evaluated at \(\boldsymbol{\theta}_{n}\) is negative definite.
Using second order Taylor expansion on \(\ln \pi(\boldsymbol{\theta}\mid {\bf y})\) around \(\ln \pi(\boldsymbol{\theta}_{n}\mid {\bf y})\) we get: \[\ln \pi(\boldsymbol{\theta}\mid {\bf y}) \approx \ln \pi(\boldsymbol{\theta}_{n}\mid {\bf y}) - \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_{n})^{\top} [-{\bf H}(\boldsymbol{\theta}_{n})] (\boldsymbol{\theta} - \boldsymbol{\theta}_{n}) . \]
This in turn implies that \[\pi(\boldsymbol{\theta}\mid {\bf y}) \approx \pi(\boldsymbol{\theta}_{n}\mid {\bf y}) \exp \left\{ - \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_{n})^{\top} [-{\bf H}(\boldsymbol{\theta}_{n})] (\boldsymbol{\theta} - \boldsymbol{\theta}_{n}) \right\}\]
Thus the approximate distribution of \(\boldsymbol{\theta} \mid {\bf y}\) is Gaussian with mean \(\boldsymbol{\theta}_{n}\) and variance covariance matrix \([-{\bf H}(\boldsymbol{\theta}_{n})]^{-1}\).
\[\\[.1 in]\]
The conclusion of Laplace approximation is asymptotically same as that of Bernstein von-Mises theorem, as the posterior mode is asymptotically same as the MLE under any reasonable prior which is continuous in a neighborhood of the true value \(\boldsymbol{\theta}_{0}\).
Numerical optimization methods are usually employed to find the mode of \(\pi^{*}(\boldsymbol{\theta}_{0} \mid {\bf y})\).
Laplace approximation produces good result in a neighborhood of \(\boldsymbol{\theta}_{n}\), and can fail to depict the global scenario.
\[\\[.2 in]\]
Consider the model \(Y\mid \mu \sim \mathtt{Poisson} (\mu)\) and the prior \(\mu \sim \mathtt{Gamma}(\alpha,\beta)\).
A simple answer to the similar question on the probability \(P(a\leq \mu \leq b)\) given \(n\) observations, is obtained by an application of Laplace approximation.
Let \(n\) data points are observed on \(Y\), \(y_{1},\ldots,y_{n}\).
Then the posterior of \(\mu \mid y_{1},\ldots,y_{n}\) is also \(\mathtt{Gamma}\) with shape parameter \(\sum_{i} y_{i}+\alpha\) and scale parameter \((n+1/\beta)^{-1}\).
The posterior mode in this case has an analytic expression \(\widehat{\mu}_{n}=(\sum_{i} y_{i}+\alpha-1)/(n+1/\beta)\), and the Hessian evaluated at \(\widehat{\mu}_{n}\) is \(-(\sum_{i} y_{i}+\alpha-1)/\widehat{\mu}_{n}^{2}\).
Thus the posterior distribution is approximated by the Gaussian distribution with mean \(\widehat{\mu}_{n}\) and variance \(\widehat{\mu}_{n}^{2}/(\sum_{i} y_{i}+\alpha-1)\).
The following figure (code attached) shows the accuracy of Laplace approximation as \(n\) grows, when \(\mu=3\), \(\alpha=5\) and \(\beta=0.25\).
\[\\[.2 in]\]
\[\mathrm{[Latent ~Field]}\quad {\bf z} \mid \boldsymbol{\theta} \sim N(0, {\bf Q}^{-1} (\boldsymbol{\theta})); \] \[\mathrm{[Hyperpriors]}\quad \boldsymbol{\theta} \sim \pi(\boldsymbol{\theta}),\]
where \({\bf Q} (\boldsymbol{\theta})\) is the precision matrix of the latent Gaussian model.
The goal is to approximate the marginal posterior \(\pi(\boldsymbol{\theta} \mid {\bf y})\) and \(p(z_{i}\mid {\bf y})\).
Observe that \[\pi(\boldsymbol{\theta} \mid {\bf y})=\frac{p({\bf z}, \boldsymbol{\theta} \mid {\bf y})}{p({\bf z} \mid \boldsymbol{\theta} , {\bf y})} \propto \frac{p({\bf z}, \boldsymbol{\theta} ,{\bf y})}{p({\bf z}\mid \boldsymbol{\theta} , {\bf y})}= \frac{p({\bf y}\mid {\bf z},\boldsymbol{\theta}) p({\bf z} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})}{p({\bf z} \mid \boldsymbol{\theta}, {\bf y})}.\]
Next \(\pi(\boldsymbol{\theta} \mid {\bf y})\) is approximated by \[\tilde{\pi}(\boldsymbol{\theta} \mid {\bf y}) \propto \frac{p({\bf y}\mid \widehat{\bf z}(\boldsymbol{\theta}),\boldsymbol{\theta}) p(\widehat{\bf z}(\boldsymbol{\theta}) \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})}{\tilde{p}(\widehat{\bf z}(\boldsymbol{\theta}) \mid \boldsymbol{\theta}, {\bf y})}, \] where \(\widehat{\bf z}(\boldsymbol{\theta})\) is a mode of \({\bf z}\) given \(\boldsymbol{\theta}\) and \(\tilde{p}\) is a Gaussian approximation (with mean \(\widehat{\bf z}(\boldsymbol{\theta})\) and variance covariance matrix \(\boldsymbol{\Sigma}_{\boldsymbol{\theta}}\)) to the distribution \({\bf z}\mid \boldsymbol{\theta}, {\bf y}\). Further, \(\boldsymbol{\Sigma}_{\boldsymbol{\theta}}\) is the Hessian of \(-\ln p({\bf z},\boldsymbol{\theta}, {\bf y})\) w.r.t. \({\bf z}\) evaluated at \(\widehat{\bf z}(\boldsymbol{\theta})\).
Thus, the above expression can be simplified as \[\tilde{\pi}(\boldsymbol{\theta} \mid {\bf y}) \propto p({\bf y}\mid \widehat{\bf z}(\boldsymbol{\theta}),\boldsymbol{\theta}) p(\widehat{\bf z}(\boldsymbol{\theta}) \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta}) |\boldsymbol{\Sigma}_{\boldsymbol{\theta}} |^{1/2}. \tag{1}\]
Towards the second goal, the marginal posterior \[p(z_{i} \mid {\bf y})= \int p( z_{i} \mid {\bf y}, \boldsymbol{\theta}) \pi(\boldsymbol{\theta} \mid {\bf y}) d \boldsymbol{\theta}, \tag{2}\] and \[ p(z_{i} \mid \boldsymbol{\theta}, {\bf y}) \propto \frac{p({\bf z},\boldsymbol{\theta} \mid {\bf y})}{p({\bf z}_{-i}\mid z_{i}, \boldsymbol{\theta}, {\bf y})} \propto \frac{p({\bf y}\mid {\bf z},\boldsymbol{\theta}) p({\bf z} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})}{p({\bf z}_{-i}\mid z_{i}, \boldsymbol{\theta}, {\bf y})} . \]
Using a similar Gaussian approximation to the distribution of \({\bf z}_{-i}\mid z_{i}, \boldsymbol{\theta}, {\bf y}\) with appropriate parameters provide \[p(z_{i} \mid \boldsymbol{\theta}, {\bf y}) \approx \tilde{p}(z_{i} \mid \boldsymbol{\theta}, {\bf y}) \propto p({\bf y}\mid \widehat{\bf z}_{-i}(\boldsymbol{\theta},z_{i}),\boldsymbol{\theta}) p(\widehat{\bf z}_{-i}(\boldsymbol{\theta},z_{i}) \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta}) |\boldsymbol{\Sigma}_{\boldsymbol{\theta},z_{i}}|^{1/2}, \tag{3} \] where \({\bf z}_{-i}\) denotes all elements of \({\bf z}\) except \(z_{i}\), \(\widehat{\bf z}_{-i}(\boldsymbol{\theta},z_{i})\) denotes a mode of \(p({\bf z},\boldsymbol{\theta},{\bf y})\) w.r.t. \({\bf z}_{-i}\) given \((z_{i},\boldsymbol{\theta})\), and \(\boldsymbol{\Sigma}_{\boldsymbol{\theta},z_{i}}\) is the inverse of the Hessian matrix of \(-p({\bf z}, \boldsymbol{\theta},{\bf y})\) w.r.t. \({\bf z}_{-i}\) evaluated at \(\widehat{\bf z}_{-i}(\boldsymbol{\theta},z_{i})\).
Thus, (2) is approximated by a deterministic numerical integration scheme defined over a grid of values of \(\boldsymbol{\theta}\), by approximating \(p( z_{i} \mid {\bf y}, \boldsymbol{\theta})\) as in (3), and \(\pi(\boldsymbol{\theta} \mid {\bf y})\) as in (1).
\[\\[.05 in]\] Remarks:
The Gaussian approximations in each case can be asymptotically justified using second order Taylor series expansion as in Laplace approximation.
The approximation in (3) can be computationally expensive when \(n\) is large, and therefore some further simplifications are incorporated.
\[\\[.2 in]\]
The ADF minimizes the KL divergence with the true posterior and an approximate class of densities. The main steps of ADF is described as follows:
Let \(p(\boldsymbol{\theta})\) be a fixed distribution and \(q(\boldsymbol{\theta})\) be an approximating distribution from the exponential family with natural parameter \(\boldsymbol{\eta}\), i.e., \[q(\boldsymbol{\theta})= h(\boldsymbol{\theta}) g(\boldsymbol{\eta})\exp\{\boldsymbol{\eta}^{\top} u(\boldsymbol{\theta}) \},\] where \(u(\boldsymbol{\theta})\) is the sufficient statistics for the family \(q\).
The KL-divergence of \(q\) from \(p\) can be expressed as \[KL(p||q) = - \ln g(\boldsymbol{\eta}) - \boldsymbol{\eta}^{\top} E_{p(\boldsymbol{\theta})}(u(\boldsymbol{\theta}))+\mbox{terms free of }\boldsymbol{\eta}. \]
The first order condition towards minimization of \(KL(p||q)\) provides \[E_{p(\boldsymbol{\theta})}(u(\boldsymbol{\theta})) = - \nabla \ln g(\boldsymbol{\eta}) . \]
As \(E_{q(\boldsymbol{\theta})}(u(\boldsymbol{\theta})) = - \nabla \ln g(\boldsymbol{\eta})\) by the properties of exponential family, we have \[E_{p(\boldsymbol{\theta})}(u(\boldsymbol{\theta})) = E_{q(\boldsymbol{\theta})}(u(\boldsymbol{\theta})). \] Therefore, the expectation of the sufficient statistics for the optimal approximating density must match with that under \(p\).
\[\\[.1 in]\] Remarks:
ADF requires that the moments of the true posterior be calculated relatively efficiently. If this is not the case, then ADF cannot be applied in a straight-forward manner.
ADF is conceptually similar to variational Bayes (VB) method. However, in VB the reverse KL divergence \(KL(q||p)\) is minimized.
\[\\[.1 in]\]
Consider the mixture model \(p({\bf y} \mid \boldsymbol{\theta}) = (1-w) \mathrm{N}({\bf y}; \boldsymbol{\theta}, {\bf I})+ w \mathrm{N}({\bf y}; {\bf 0}, 10 \times {\bf I})\). Here \(\boldsymbol{\theta}\) is the parameter of interest and \(w\) is the proportion (known) of the noise component. A Gaussian prior with mean \({\bf 0}\) and variance \(\sigma^{2}{\bf I}\) is placed on \(\boldsymbol{\theta}\).
Given \(n\) observations \({\bf y}_{1},\ldots, {\bf y}_{n}\) we have \[\pi(\boldsymbol{\theta} \mid {\bf y}_{1}, \ldots, {\bf y}_{n}) \propto \pi(\boldsymbol{\theta}) \prod_{i=1}^{n} p({\bf y}_{i} \mid \boldsymbol{\theta})= \prod_{j=0}^{n} t_{j}, \] where \(t_{0}=\pi(\boldsymbol{\theta})\) and \(t_{i}=p({\bf y}_{i} \mid \boldsymbol{\theta})\), \(t=1, \ldots,n\).
Suppose we chose the approximating density to be Gaussian with mean \(\boldsymbol{\mu}_{\boldsymbol{\theta}}\) and variance \(s_{\boldsymbol{\theta}}^{2} {\bf I}\).
Clearly the sufficient statistics of the approximating density is \(u(\boldsymbol{\theta})=(\boldsymbol{\theta}, -\boldsymbol{\theta}^{\top}\boldsymbol{\theta})\) with expectation \(E_{q(\boldsymbol{\theta})}(u(\boldsymbol{\theta}))= (\boldsymbol{\mu}_{\boldsymbol{\theta}}, \boldsymbol{\mu}_{\boldsymbol{\theta}}^{\top} \boldsymbol{\mu}_{\boldsymbol{\theta}} - s_{\boldsymbol{\theta}}^{2})\). However, we do not know \(E_{p(\boldsymbol{\theta})}(u(\boldsymbol{\theta}))\). We therefore estimate \(p\) by \(\widehat{p}(\boldsymbol{\theta})\) sequentially.
We start with the initialization \(\widehat{p}(\boldsymbol{\theta})=q(\boldsymbol{\theta})=\pi(\boldsymbol{\theta})\). After the first observation we update \[\widehat{p}(\boldsymbol{\theta}) = \frac{q(\boldsymbol{\theta}) t_{1}(\boldsymbol{\theta})}{\int q(\boldsymbol{\theta}) t_{1} (\boldsymbol{\theta}) d\boldsymbol{\theta}}. \]
Now, to compute the expectations consider the following relations: \[Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2})=\int q(\boldsymbol{\theta}) t_{1} (\boldsymbol{\theta})d\boldsymbol{\theta}= \int \frac{t_{1}(\boldsymbol{\theta})}{\sqrt{2\pi} s_{\boldsymbol{\theta}}} \exp\left\{ -\frac{1}{2 s_{\boldsymbol{\theta}}^{2}} \left(\boldsymbol{\theta}- \boldsymbol{\mu}_{\boldsymbol{\theta}}\right)^{\top}\left(\boldsymbol{\theta}- \boldsymbol{\mu}_{\boldsymbol{\theta}}\right)\right\} d\boldsymbol{\theta}, \] \[\nabla_{\mu} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}) = \frac{1}{s_{\boldsymbol{\theta}}^{2}} \left\{ E_{\widehat{p}}(\boldsymbol{\theta}) - \boldsymbol{\mu}_{\boldsymbol{\theta}} \right\} , \] and \[\nabla_{s_{\boldsymbol{\theta}}^{2}} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}) = \frac{1}{2 s_{\boldsymbol{\theta}}^{4} } E_{\widehat{p}}\left\{(\boldsymbol{\theta}-\boldsymbol{\mu}_{\boldsymbol{\theta}} )^{\top}(\boldsymbol{\theta}-\boldsymbol{\mu}_{\boldsymbol{\theta}} ) \right\} - \frac{1}{2 s_{\boldsymbol{\theta}}^{2}}.\]
This in turn provides the relation \[E_{\widehat{p}(\boldsymbol{\theta})}(\boldsymbol{\theta}) = \boldsymbol{\mu}_{\boldsymbol{\theta}} + s_{\boldsymbol{\theta}}^{2} \nabla_{\mu} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}),\] and \[E_{\widehat{p}(\boldsymbol{\theta})}(\boldsymbol{\theta}^{\top} \boldsymbol{\theta}) - E_{\widehat{p}(\boldsymbol{\theta})}(\boldsymbol{\theta})^{\top} E_{\widehat{p}(\boldsymbol{\theta})}(\boldsymbol{\theta}) = D s_{\boldsymbol{\theta}}^{2} -s_{\boldsymbol{\theta}}^{4} \left\{ \left\|\nabla_{\mu} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2})\right\|^{2} - 2\nabla_{s_{\boldsymbol{\theta}}^{2}} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2})\right\}.\]
Using the above relation in the example we get \[Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2})=(1-w)\phi_{(s_{\boldsymbol{\theta}}^{2}+1){\bf I}}(\boldsymbol{\mu}_{\boldsymbol{\theta}}-{\bf y}_{1})+w \phi_{10\times {\bf I}}({\bf y}_{1})= z_{1}+z_{2};\qquad r=\frac{z_{1}}{z_{1}+z_{2}}. \]
Differentiating w.r.t. \(\boldsymbol{\mu}_{\boldsymbol{\theta}}\) and \(s_{\boldsymbol{\theta}}^{2}\) we get: \[\nabla_{\mu} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}) = r \frac{{\bf y}_{1}-\boldsymbol{\mu}_{\boldsymbol{\theta}}}{s_{\boldsymbol{\theta}}^{2}+1},\quad \mbox{and}\quad \nabla_{s_{\boldsymbol{\theta}}^{2}} \ln Z(\boldsymbol{\mu}_{\boldsymbol{\theta}},s_{\boldsymbol{\theta}}^{2}) = -\frac{rD}{2(s_{\boldsymbol{\theta}}^{2}+1)}+ r \frac{({\bf y}_{1}-\boldsymbol{\mu}_{\boldsymbol{\theta}})^{\top}({\bf y}_{1}-\boldsymbol{\mu}_{\boldsymbol{\theta}})}{2(s_{\boldsymbol{\theta}}^{2}+1)^{2}}.\]
Thus the parameters are updated by the following step after each iteration:
\[ \boldsymbol{\mu}_{\boldsymbol{\theta}}^{\mathrm{new}}=\boldsymbol{\mu}_{\boldsymbol{\theta}} + s_{\boldsymbol{\theta}}^{2} r_{i} \frac{{\bf y}_{i}-\boldsymbol{\mu}_{\boldsymbol{\theta}}}{s_{\boldsymbol{\theta}}^{2}+1},\] and \[s_{\boldsymbol{\theta}}^{2~\mathrm{new}} = s_{\boldsymbol{\theta}}^{2} - r_{i} \frac{s_{\boldsymbol{\theta}}^{4}}{s_{\boldsymbol{\theta}}^{2}+1} + r_{i} (r_{i}+1)\frac{s_{\boldsymbol{\theta}}^{4}({\bf y}_{i}-\boldsymbol{\mu}_{\boldsymbol{\theta}})^{\top}({\bf y}_{i}-\boldsymbol{\mu}_{\boldsymbol{\theta}})}{D(s_{\boldsymbol{\theta}}^{2}+1)^{2}} . \] - The figure (code attached) for \(\sigma^{2}=10\) is given below.
\[\\[.2 in]\]
To be covered.
\[\\[.2 in]\]
To be covered.
\[\\[.2 in]\]
\[\\[.2 in]\]
Computing Bayes: Bayesian computation from 1763 to the 21st Century, by Gael M. Martin, David T. Frazier and Christian P. Robert. (Link)
A review of deterministic approximate inference techniques for Bayesian machine learning, by Shiliang Sun. (Link)
A family of algorithms for approximate Bayesian conference, by Thomas P Minka. (Link)
Spatial modeling with R-INLA: A review, by Haakon Bakka, HÃ¥vard Rue, Geir-Arne Fuglstad, Andrea Riebler, David Bolin, Janine Illian, Elias Krainski, Daniel Simpson and Finn Lindgren. (Link)
\[\\[.2 in]\]