近似贝叶斯推断
1 Bayesian inference in general
1.1 Bayesian formula
parameter(s): \(\boldsymbol{\theta} \in \boldsymbol{\Theta} \subseteq \mathbb{R}^{p}\)
observed data: \(\boldsymbol{y}\)
prior of \(\boldsymbol{\theta}\): \(\pi(\boldsymbol{\theta})\)
likelihood function: \(p(\boldsymbol{y} \mid \boldsymbol{\theta})\)
posterior distribution:
\[ \pi(\boldsymbol{\theta} \mid \boldsymbol{y})=\frac{p(\boldsymbol{y} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})}{\int_{\Theta} p(\boldsymbol{y} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta}} \propto p(\boldsymbol{y} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta}) . \]
- marginal distribution of the data: \[ p(\boldsymbol{y})=\int_{\Theta} p(\boldsymbol{y} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta}. \]
1.2 Methods of Bayesian computation
- iid sampling
- inverse CDF
- importance sampling
- rejection sampling
- adaptive rejection sampling
- MCMC
- MH algorithm (randowm walk MH, independence MH)
- Gibbs sampling
- HMC (no-U-turn)
- slice sampling(Neals, 2003)
- sequential Monte Carlo (SMC)
- Probem of sampling-based methods: in sampling methods like MCMC, a likelihood estimator with large variance can cause the Markov chain to get stuck and reduce efficiency (Doucet et al., 2015).
- Pros: theoretically acurate and universally applicable, especially for complex models.
- Cons: slow in convergence for high-dimensional models, time-consuming from convergence diagnostic procedure and burn-in detection.
1.3 Approximate Bayesian methods
Approximate Bayesian computation (ABC) (Sisson et al., 2019)
Bayesian synthetic likelihood (BSL) (Wood, 2010; Price et al., 2018).
Variational Bayes (VB) (Jaakkola and Jordan, 2000)
Integrated Nested Laplace Approximations (INLA) —approximate Bayesian inference for Latent Gaussian Models (Rue, 2009)
Comparison of ABC and BSL: common points
- both are likelihood-free methods;
- both eschew calculation of the intractable likelihood via simulation from the model that is assumed to have generated the data, and are applicable to any model in which data can be reliably, and cheaply, simulated;
- both are based on a choice of summary statistics that are used to represent the data.
Comparison of ABC and BSL: differences
- ABC uses a nonparametric approach to estimated this posterior (see Blum, 2010 for details), while BSL employs a parametric assumption;
- ABC can scale poorly with the dimension of the summary statistics, while BSL can easily handle summary statistics of much larger dimension;
- ABC requires the user to select values for various tuning parameters, while BSL requires very little tuning.
2 Likelihood-free inference
2.1 Motivation
In complex models, the likelihood function can be intractable or very expensive to evaluate.
Instead, special inference techniques are needed that do not require direct evaluation of the likelihood function, but rely on model simulations to approximate the likelihood.
Common likelihood-free methods
pseudo-marginal methods (Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al., 2015) (ABC can be viewed as a pseudo-marginal method.)
pseudo-marginal MCMC synthetic likelihood method (Price et al., 2016)
indirect inference (Gourieroux et al., 1993)
approximate Bayesian computation (ABC) (Sisson et al., 2018)
Bayesian synthetic likelihood (BSL) (Wood, 2010; Price et al., 2018)
2.2 Quick look at approximae Bayesian computation (ABC)
ABC essentially estimates the intractable likelihood non-parametrically at \(\boldsymbol{\theta} \in \Theta \subseteq \mathbb{R}^{p}\) with a simulation \(\boldsymbol{x} \sim p(\cdot \mid \boldsymbol{\theta})\).
The raw dataset is usually reduced down to summaries with a carefully chosen summary statistic function \(\boldsymbol{S}(\cdot): \mathbb{R}^{m} \mapsto \mathbb{R}^{d}\), where \(m\) and \(d\) are the dimension of the raw data and summary statistics. (\(d\ge p\))
Observed summary statistics: \(\boldsymbol{s_{y}}=\boldsymbol{S}(\boldsymbol{y})\)
simulated summary statistics: \(\boldsymbol{s_{x}}=\boldsymbol{S}(\boldsymbol{x})\)
The ABC likelihood function: \[ p_{\epsilon}\left(\boldsymbol{s}_{\boldsymbol{y}} \mid \boldsymbol{\theta}\right)=\int_{\mathcal{Y}} K_{\epsilon}\left(\rho\left(\boldsymbol{s}_{\boldsymbol{y}}, \boldsymbol{s}_{\boldsymbol{x}}\right)\right) p(\boldsymbol{x} \mid \boldsymbol{\theta}) \mathrm{d} \boldsymbol{x} \] where \(\rho(\cdot)\) is a discrepancy function, which measures the distance between the observed and simulated summary statistics, \(K_{\epsilon}(\cdot)\) is a kernel weighting function, usually an indicator function \(I(\cdot<\epsilon)\) for convenience, which links distance and tolerance \(\epsilon\).
\(\epsilon\) is used to trade-off between the bias and variance of the likelihood estimator.
Remarks:
- non-parametric nature of the ABC likelihood estimate
- inefficient when the summary statistic is high dimensional (Blum and Francois, 2010), which is often referred to as the curse of dimensionality (of both the parameter and summary statistic).
2.3 Quick look at Bayesian synthetic likelihood estimator
Two poineering works:
- Wood, S. N. (2010). Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466, 1102–1107. It introduced the synthetic likelihood and used it for approximate (non-Bayesian) inference.
- Price, L. F., C. C. Drovandi, A. C. Lee, and D. J. Nott (2018). Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics 27 (1), 1–11. It discussed Bayesian implementations focusing on efficient computational methods.
Aim of BSL: find a so-called SL estimator for the likelihood \(p(s_y\mid \boldsymbol{\theta})\) by an auxiliary likelihood function which is a kind of parametric density approximation \[ p_{A}\left(\boldsymbol{s}_{\boldsymbol{y}} \mid \boldsymbol{\phi}\right). \]
BSL arises when the auxiliary likelihood is combined with a prior distribution on the parameter and make Bayesian inference for \(\boldsymbol{\theta}\).
The posterior of \(\boldsymbol{\theta}\), called SL posterior or partial posterior, becomes
\[p_{A}\left(\boldsymbol{\theta} \mid \boldsymbol{s}_{\boldsymbol{y}}\right) \propto p_{A}\left(\boldsymbol{s}_{\boldsymbol{y}} \mid \boldsymbol{\phi}\right) \pi(\boldsymbol{\theta}). \]
Remarks:
- the mapping from \(\boldsymbol{\theta}\) to \(\boldsymbol{\phi}\) is typically unknown.
- \(\boldsymbol{\phi}\) can be estimated with simulations.
Key to SL estimators:
- form of \(p_A\)
- type of estimator of \(\boldsymbol{\phi}\)
2.4 Compatibility of simulated approximate Bayesian methods
The true model is so complex that we can not easily access its DGP and must instead resort to an approximate inference approach.
While complicated, highly-structured models that allow for vast complexity allow us to explain critical features of the observed data, it is unlikely that any modeler will be able to construct an entirely accurate model that captures all features of the data. In short, “all models are wrong [and] the scientist cannot obtain a “correct” one by excessive elaboration.” (Box, 1976).
Approximate Bayesian inference methods based on summary statistics implicitly assume that the model being used (assumed model/DGP) to generate the simulated summary statistics can replicate the behavior of the observed summary statistics. That is, BSL is not required to match every aspect of the data, but, rather, only those features of the data that are captured via the summary statistics.
Both ABC and BSL generally degrade the data down to a vector of summary statistics that are used to represent the data and then perform posterior inference on the unknown \(\boldsymbol{\theta}\), now conditional only on this vector of summary statistics, by selecting values of \(\boldsymbol{\theta}\) that yield a close match between the summary statistics calculated from the observed data and those calculated under the model.
The goal of the approach is to target simulated draws from the so-called ‘partial-posterior’ based on these statistics.
MCMC BSL: using independent simulations obtained from the assumed data generating process (DGP) , the mean and variance of the summary statistics are then estimated, and used to construct a (simulated) Gaussian likelihood function that can be directly inserted into standard Markov Chain Monte Carlo (MCMC) algorithms.
3 Bayesian synthetic likelihood estimators
3.1 standard Bayesian synthetic likelihood (sBSL)
Key points:
Strong assumption: an approximately Gaussian summary statistic for the data, informative about the parameters, is available.
It uses a multivariate normal distribution to approximate the likelihood function for the summary statistic,with plug-in mean and covariance matrix obtained by Monte Carlo simulation from the model.
MCMC is used to explore the parameter space.
It can be extended to other reasonable parametric approximations of the likelihood (Drovandi et al., 2015))
Procedure:
(as in ABC), obtain \(\boldsymbol{x}_{1}, \ldots, \boldsymbol{x}_{n} \stackrel{i i d}{\sim} p(\cdot \mid \boldsymbol{\theta})\): a collection of \(n\) simulations from the model at a proposed parameter value of \(\boldsymbol{\theta}\)
calculate the simulated summary statistics: \(\boldsymbol{s_x}=S(\boldsymbol{x})=(\boldsymbol{s_1}, \boldsymbol{s_2}, \dots, \boldsymbol{s_n})\), \(\boldsymbol{s}_{i}=\boldsymbol{S}\left(\boldsymbol{x}_{i}\right) \in \mathcal{S} \subseteq \mathbb{R}^{d}\).
get the auxiliary likelihood (Wood, 2010) \[p_{sBSL}\left(\boldsymbol{s_y} \mid \boldsymbol{\phi}_{sBSL}\right)=\mathcal{N}\left(\boldsymbol{s_y} \mid \boldsymbol{\phi}_{sBSL}\right), \] where \(\boldsymbol{\phi}_{sBSL}=(\boldsymbol{\mu(\theta)}, \boldsymbol{\Sigma(\theta))}\), called auxiliary parameters, is estimated with sample mean and covariance \(\boldsymbol{\hat{\phi}_{sBSL}}=\left(\boldsymbol{\mu_{n}(\theta)}, \boldsymbol{\Sigma_{n}(\theta)}\right)\), \[ \begin{aligned} \boldsymbol{\mu}_{n}(\boldsymbol{\theta}) &=\frac{1}{n} \sum_{i=1}^{n} \boldsymbol{s}_{i} \\ \boldsymbol{\Sigma}_{n}(\boldsymbol{\theta}) &=\frac{1}{n-1} \sum_{i=1}^{n}\left(\boldsymbol{s}_{i}-\boldsymbol{\mu}_{n}(\boldsymbol{\theta})\right)\left(\boldsymbol{s}_{i}-\boldsymbol{\mu}_{n}(\boldsymbol{\theta})\right)^{\top}. \end{aligned} \] Thus get the estimated auxiliary likelihood, \(\mathcal{N}\left(\boldsymbol{s}_{y} ; \boldsymbol{\mu}_{n}(\boldsymbol{\theta}), \boldsymbol{\Sigma}_{n}(\boldsymbol{\theta})\right)\).
Get the partial (BSL) posterior of \(\boldsymbol{\theta}\) \[ p_{sBSL, n}\left(\boldsymbol{\theta} \mid \boldsymbol{s}_{y}\right) \propto p_{sBSL, n}\left(\boldsymbol{s}_{y} \mid \boldsymbol{\theta}\right) \pi(\boldsymbol{\theta}), \] where \[ p_{sBSL,n}(\boldsymbol{s_y} \mid \boldsymbol{\theta})=\int_{S^n} \mathcal{N}\left(\boldsymbol{s_y}; \boldsymbol{\mu_{n}(\theta)}, \boldsymbol{\Sigma_{n}(\theta)}\right)\prod_{i=1}^{n} p(\boldsymbol{s_{i}}\mid \boldsymbol{\theta}) \mathrm{d} \boldsymbol{s_1}\dots \mathrm{d} \boldsymbol{s_n} \]
Cf. the ideal BSL posterior
\[ \begin{aligned} p_{sBSL}\left(\boldsymbol{\theta} \mid \boldsymbol{s}_{y}\right) &= p_{sBSL}\left(\boldsymbol{s}_{y} \mid \boldsymbol{\theta}\right) \pi(\boldsymbol{\theta})\\ &\propto \mathcal{N}\left(\boldsymbol{s}_{y} ; \boldsymbol{\mu}(\boldsymbol{\theta}), \boldsymbol{\Sigma}(\boldsymbol{\theta})\right) \pi(\boldsymbol{\theta}) . \end{aligned} \]
Get samples with MCMC algorithm from \(p_{sBSL, n}\left(\boldsymbol{\theta} \mid \boldsymbol{s}_{y}\right)\), see below — pseudo-marginal algorithm(Andrieu and Roberts 2009) that targets \(p_{sBSL}\left(\boldsymbol{\theta} \mid \boldsymbol{s}_{y}\right)\). The result is called the MCMC BSL.
Two issues of BSL
sensitivity to the choice of \(n\)
robustness toward departures from normality of the summary statistic
Remarks:
- Although \(\hat{\phi}_{sBSL}\) is an unbiased estimator of \(\phi_{sBSL}\), \(\mathcal{N}\left(\boldsymbol{s}_{\boldsymbol{y}} \mid \hat{\boldsymbol{\phi}}_{sBSL}\right)\) is not an unbiased estimator of \(\mathcal{N}\left(\boldsymbol{s}_{\boldsymbol{y}} \mid \boldsymbol{\phi}_{sBSL}\right)\).
- approximate posterior exhibits very weak dependence on \(n\) (Price et al., 2018). Nevertheless, if \(n\) is prohibitively small, then there can be significant Monte Carlo error in the MCMC SL estimator.
Results:
- BSL can outperform ABC even when the model summary statistics show small departures from normality.
- BSL requires less tuning and can be accelerated with parallel computing.
- BSL is able to deal with the very high-dimensional summary statistic.
- Asymptotic properties of BSL have been derived under various assumptions. Frazier and Drovandi (2019) develop a result for posterior concentration and Nott et al. (2019) show that the BSL posterior mean is consistent and asymptotically normally distributed.
Drawback of sBSL: large number of simulations \(n\) for estimating a high-dimensional covariance matrix in the SL.
3.2 Unbiased BSL Likelihood Estimator (uBSL)
There is an exactly unbiased, nonnegative estimator of a normal density function (Ghurye and Olkin (1969, sec. 3.4)), simply uBSL, which results in a valid pseudo-marginal algorithm targeting \(p_{uBSLl}\left(\boldsymbol{\theta} \mid \boldsymbol{s}_{y}\right)\).
Ghurye and Olkin (1969) provide an unbiased estimator for the multivariate normal density based on independent simulations from it.
uBSL requires \(n>d+3\) for unbiasedness to hold.
auxiliary parameter \(\boldsymbol{\phi}\) is estimated by \[\boldsymbol{\hat{\phi}}_{g o}=\left(\boldsymbol{\mu}_{n}(\boldsymbol{\theta}), \boldsymbol{M}_{n}(\boldsymbol{\theta})\right),\] where \(\boldsymbol{M}_{n}(\boldsymbol{\theta})=(n-1) \boldsymbol{\Sigma}_{n}(\boldsymbol{\theta})\).
unbiased SL estimator is given by \[ \begin{aligned} p_{g o}\left(\boldsymbol{s_{y}} \mid \boldsymbol{\phi_{g o}}\right)=&(2 \pi)^{-d / 2} \frac{c(d, n-2)}{c(d, n-1)(1-1 / n)^{d / 2}}\left|\boldsymbol{M}_{\boldsymbol{n}}(\boldsymbol{\theta})\right|^{-(n-d-2) / 2} \\ & \Psi\left(\boldsymbol{M}_{\boldsymbol{n}}(\boldsymbol{\theta})-\frac{\left(\boldsymbol{s}_{\boldsymbol{y}}-\boldsymbol{\mu}_{n}(\boldsymbol{\theta})\right)\left(\boldsymbol{s}_{\boldsymbol{y}}-\boldsymbol{\mu}_{n}(\boldsymbol{\theta})\right)^{\top}}{1-1 / n}\right)^{(n-d-3) / 2} \end{aligned} \]
where \[ c(k, v)=\frac{2^{-k v / 2} \pi^{-k(k-1) / 4}}{\prod_{i=1}^{k} \Gamma\left(\frac{v-i+1}{2}\right)} \] and \[ \Psi(\boldsymbol{A})=\left\{\begin{array}{ll} |\boldsymbol{A}| & \text { if } \boldsymbol{A}>0 \\ 0 & \text { otherwise } \end{array}\right. \]
Results and comments:
Replacing \(p_{sBSL}\left(\boldsymbol{s}_{y} \mid \boldsymbol{\theta}\right)=\mathcal{N}\left(\boldsymbol{s}_{y} ; \boldsymbol{\mu}_{n}(\boldsymbol{\theta}), \boldsymbol{\Sigma}_{n}(\boldsymbol{\theta})\right)\) with \(p_{go}\left(\boldsymbol{s_{y}} \mid \boldsymbol{\phi_{go}}\right)\) creates the novel MCMC uBSL algorithm
\(E\left(p_{go}\left(s_{\boldsymbol{y}} \mid \boldsymbol{\hat\phi_{go}}\right)\right)=\phi(\boldsymbol{s_y} ; \boldsymbol{\mu(\theta)}, \boldsymbol{\Sigma(\theta))}\) if the summary statistic \(\boldsymbol{S(y)}\) is Gaussian, provided that \(n > d+3\).
In spite of the fact that the normality assumption is rarely true in practice, the approximate posterior distributions obtained by uBSL show less dependence on \(n\) when compared to BSL (Price et al., 2018).
The value of \(n\) should be chosen such that the log SL at some \(θ\) with high (BSL) posterior support should be estimated with a standard deviation of roughly 1 (Doucet et al., 2015).
Even though the distribution targeted by such a pseudo-marginal algorithm does not depend on \(n\), the mixing of the algorithm can be very poor unless \(n\) is chosen large enough to control the variance of the likelihood estimate. Doucet et al. (2015) suggest fixing the variance of the log likelihood estimate to be around 1 for pseudo-marginal Metropolis–Hastings algorithms, to achieve an optimal trade off between computational cost and precision.
Alternative: stochastic gradient variational inference (Tran et al., 2015)
4 Robust BSL
Issue in sBSL and uBSL: normality assumption.
Aim: find a new method which maintains the efficiency gains of BSL but enhances its robustness.
Some robustifying BSL methods:
- Semi-Parametric BSL Likelihood (An et al., 2020)
- Bayesian Synthetic Likelihood with mean adjustment (Frazier, et al., 2019)
- Bayesian Synthetic Likelihood with variance adjustment (Frazier, et al., 2019)
- An extended empirical saddlepoint (EES) estimation method(Fasiolo et al., 2018)
4.1 Semi-Parametric BSL Likelihood Estimator (semiBSL)
Breakdown may happen for sBSL when the marginal summary statistic distributions deviate greatly from Gaussian
- posterior distribution will fail to adequately approximate the true partial posterior.
- the variance of the log synthetic likelihood estimator may be so large that the MCMC chain will become stuck within only a few iterations, and no discernible posterior distribution may be recovered.
semiBSL provides additional robustness for a non-Gaussian distributed summary statistic.
Copula models can be used to approximate the joint distribution of the model summary statistic(s). Two aspects of semiBSL:
non-parametrically estimating the marginal distributions using kernel density estimation (KDE)
parametrically estimating the dependence structure using a Gaussian copula (dependence structure)
An et al. (2020) show empirically that semiBSL performs favourably to sBSL when summary statistics are distributed irregularly.
semiBSL maintains many of the attractive properties of sBSL, including its scalability to a high dimensional summary statistic and ease of tuning.
The marginal density of \(j\)-th summary statistic, \(g_{j}(s)\), is approximated using the kernel density estimate (KDE): \[ \hat{g}_{j}(s)=\frac{1}{n} \sum_{i=1}^{n} K_{h}\left(s-S(x_i)^{j}\right), \] where \(K_h(u)=h^{-1}K(u/h)\) is a non-negative kernel function with bandwidth parameter \(h\).
An et at. (2018) select Gaussian kernel \(K(u)=1/\sqrt(2\pi)\exp\{-u^2/2\}\) for simplicity and for its unbounded support, and as suggested by Silverman(2018), \(h=0.9 n^{-0.2}\min(\sigma, \text{interquartile range}/1.34)\), where \(\sigma\) is the standard deviation of the distribution and can be estimated empirically.
By using KDE, the semiBSL estimator can accommodate non-normal marginal summary statistics.
semiBSL uses a Gaussian copula to model dependence between summaries. \[ c_{\boldsymbol{R}}(\boldsymbol{u})=\frac{1}{|\boldsymbol{R}|} \exp \left\{-\frac{1}{2} \boldsymbol{\eta}^{\top}\left(\boldsymbol{R}^{-1}-\boldsymbol{I}_{d}\right) \boldsymbol{\eta}\right\}, \] where \(\boldsymbol{R}\) is the correlation matrix, \(\boldsymbol{\eta}=\left(\Phi^{-1}\left(u_{1}\right), \ldots, \Phi^{-1}\left(u_{d}\right)\right)^{\top}, \Phi^{-1}(\cdot)\) is the inverse CDF of a standard normal distribution, and \(\boldsymbol{I}_{d}\) is a \(d\)-dimensional identity matrix. The main appeal of the Gaussian copula here is that it is fast to estimate the correlation matrix.
The semi-parametric synthethic likelihood estimator is \(p_{\text {semiBSL}}\left(s_{y} \mid \boldsymbol{\phi}\right)\), where \[\boldsymbol{\phi}=\left(R, u_{1}, \ldots, u_{d}, g_{1}\left(s_{y}^{1}\right), \ldots, g_{d}\left(s_{y}^{d}\right)\right)\] is estimated by \[\boldsymbol{\hat{\phi}_{semiBSL}}= \left(\hat{\boldsymbol{R}}, \hat{u}_{1}, \ldots, \hat{u}_{d}, \hat{g}_{1}\left(s_{y}^{1}\right), \ldots, \hat{g}_{d}\left(s_{y}^{d}\right)\right),\] where \(\hat{u}_{j}=\hat{G}_{j}\left(s_{y}^{j}\right)\) and \(\hat{g}_{j}\left(s_{y}^{j}\right)\) are the CDF and pdf of \(s_i^j\) with KDE.
- \(\boldsymbol{R}\) can be estimated by MLE or the Gaussian rank correlation (GRC)(Boudt et al., 2012; An et al., 2018).
- The GRC estimator is known to be more robust. The \((i, j)\)-th component of the GRC estimate is given by \[ \hat{\rho}_{i, j}^{\mathrm{grc}}=\frac{\sum_{k=1}^{n} \Phi^{-1}\left(\frac{r\left(s_{k}^{i}\right)}{n+1}\right) \Phi^{-1}\left(\frac{r\left(s_{k}^{j}\right)}{n+1}\right)}{\sum_{k=1}^{n} \Phi^{-1}\left(\frac{k}{n+1}\right)^{2}}, \] where \(r(\cdot): \mathbb{R} \rightarrow \mathcal{A}\), where \(\mathcal{A} \equiv\{1, \ldots, n\}\), is the rank function.
The full semiBSL estimator is given by
\[ p_{semiBSL}\left(s_{y} \mid \hat{\phi}_{semiBSL}\right)=\frac{1}{\sqrt{|\hat{R}|}} \exp \left\{-\frac{1}{2} \hat{\eta}_{s_{y}}^{\top}\left(\hat{R}^{-1}-I_{d}\right) \hat{\eta}_{s_{y}}\right\} \prod_{j=1}^{d} \hat{g}_{j}\left(s_{y}^{j}\right) \]
The partial semiBSL posterior becomes \[ p_{\operatorname{semiBSL,n}}\left(\boldsymbol{\theta}\mid \boldsymbol{s}_{\boldsymbol{y}}\right) \propto p_{semiBSL, n}\left(\boldsymbol{s}_{y} \mid \boldsymbol{\theta}\right) \pi(\boldsymbol{\theta}), \] where \[ p_{\operatorname{semiBSL,n}}\left(\boldsymbol{s}_{\boldsymbol{y}}\mid \boldsymbol{\theta} \right) \propto \int_{S^n}p_{\operatorname{semiBSL,n}}\left(\boldsymbol{s}_{\boldsymbol{y}} \mid \hat{\phi}_{semiBSL}\right) \prod_{i=1}^{n} p(\boldsymbol{s_{i}}\mid \boldsymbol{\theta}) \mathrm{d} \boldsymbol{s_1}\dots \mathrm{d} \boldsymbol{s_n}. \]
Properties from examples:
- semiBSL does not require a larger \(n\) compared to BSL.
- semiBSL can provide robustness without incurring additional model simulations.
Two main limitations of semiBSL
- The number of model simulations required to accurately estimate \(R\) scales poorly with \(d\).
- the KDE is unreliable for distributions with extremely heavy tails, which may induce unduly high variance in the \(p_{semiBSL,n}(\boldsymbol{s_y}\mid \boldsymbol{\theta})\) estimator and cause semiBSL to fail.
4.2 Robust BSL with mean and variance adjustment
Implicate BSL assumption: the data generating process (DGP) can generate simulated summary statistics that capture the behaviour of the observed summary statistics.
If the assumed DGP(\(P_\theta\)) differs from the true DGP(\(P_0\)), model compatibility may no longer be satisfied and BSL can give unreliable inferences. This is the problem of model incompatibility between the observed and simulated summary statistics.
Two possible strategies can be used for conducting inference using BSL when the assumed model (assumed DGP) and summary statistics (from true DGP) are incompatible. See Frazier and Drovandi (2019).
augments the mean of the simulated summaries with additional free parameters, when the observed statistics \(\eta(\mathbf{y})\) are not in the support of the simulated mean \(\mu_{m}(\theta)\), for any \(\theta \in \Theta\), with high probability.
augments the variance of the simulated summaries with additional free parameters.
Robust versions of BSL: adjusting the simulated summaries to ensure a form of compatibility between the assumed model and the chosen summary statistics. They should give reliable inference when the summaries are not compatible and are capable of detecting precisely which summaries, if any, are incompatible.
4.2.1 Mean adjustment
Incompatibility implies that the observed statistics \(\boldsymbol{S(y)}\) are not in the support of the simulated mean \(\boldsymbol{\mu_n(\theta)}\), for any \(\boldsymbol{\theta} \in \Theta\subset \mathbb{R}^{p}\), with high probability.
To create a BSL procedure that is robust to this incompatibility issue is to adjust the vector of simulated means \(\mu_{m}(\theta)\) by adding to \(\boldsymbol{\mu_{n}(\theta)}\) an additional free parameter \(\boldsymbol{\Gamma}\in G \subset \mathbb{R}^{q}\) so that \(\boldsymbol{S(y)}\) is always in the support of this new simulated mean.
The vector of means for use in robust BSL-mean: \[ \phi_{n}(\boldsymbol{\theta},\boldsymbol{\Gamma})=\boldsymbol{\mu_{n}(\theta)}+\operatorname{diag}\left(\boldsymbol{\Sigma_{n}^{1 / 2}(\theta)}\right) \Gamma. \]
Get the augmented partial (BSL) posterior (robust BSL-mean (R-BSL-M) posterior) of \(\boldsymbol{\zeta}=(\boldsymbol{\theta}^T,\boldsymbol{\Gamma}^T)^T\) \[ p_{rBSL-m, n}\left(\boldsymbol{\zeta} \mid \boldsymbol{s}_{y}\right) \propto p_{rBSL-m, n}\left(\boldsymbol{s}_{y} \mid \boldsymbol{\zeta}\right) p(\boldsymbol{\zeta}), \] where \[ p_{rBSL-m,n}(\mathbf{s_y} \mid \boldsymbol{\zeta})=\int_{S^n} \mathcal{N}\left(\mathbf{s_y}; \phi_{n}(\zeta), \Sigma_{n}(\theta)\right)\prod_{i=1}^{n} p(s_{i}\mid \boldsymbol{\theta}) \mathrm{d} \boldsymbol{s_{1}}\dots \mathrm{d} \boldsymbol{s_{n}} \]
Prior Choice: Laplace Prior \[ \pi(\Gamma):=\prod_{j=1}^{q} \lambda e^{-\lambda\left|\gamma_{j}\right|}=\lambda^{q} e^{-\lambda \sum_{j=1}^{q}\left|\gamma_{j}\right|} \] \[ \pi(\zeta):=\pi(\theta) \pi(\Gamma) \]
4.2.2 Variance adjustment
An alternative approach is to inflate the variance of the simulated summaries to ensure \(\boldsymbol{S(y)}\) is always in the support of this new simulated mean. \[ V_{n}(\zeta):=\Sigma_{n}+\left(\begin{array}{cccc} {\left[\Sigma_{n}(\theta)\right]_{11} \gamma_{1}^{2}} & 0 & \ldots & 0 \\ 0 & {\left[\Sigma_{n}(\theta)\right]_{22} \gamma_{2}^{2}} & \ldots & 0 \\ \vdots & \ldots & \ddots & \vdots \\ 0 & \cdots & \cdots & {\left[\Sigma_{q}(\theta)\right]_{q q} \gamma_{q}^{2}} \end{array}\right) \]
The Robust BSL-variance (R-BSL-V) posterior \[ p_{rBSL-v, n}\left(\boldsymbol{\zeta} \mid \boldsymbol{s}_{y}\right) \propto p_{rBSL-v, n}\left(\boldsymbol{s}_{y} \mid \boldsymbol{\zeta}\right) p(\boldsymbol{\zeta}), \] where \[ p_{bsl,n}(\mathbf{s_y} \mid \boldsymbol{\zeta})=\int_{S^n} \mathcal{N}\left(\mathbf{s_y}; \mu_{n}(\zeta), V_{n}(\zeta)\right)\prod_{i=1}^{n} p(s_{i}\mid \boldsymbol{\theta}) \mathrm{d} \boldsymbol{s_{1:n}} \]
Prior Choice: Exponential Prior \[ \pi(\zeta):=\pi(\theta) \pi(\Gamma), \] where \[ \pi(\Gamma):=\prod_{i=1}^{q} \lambda e^{-\lambda \gamma_{i}}=\lambda^{q} e^{-\lambda \sum_{i=1}^{q} \gamma_{i}} \]
Note: To sample the target posterior distribution of \(\zeta\) we use a component-wise MCMC algorithm that updates, in turn, \(\Gamma\) conditional on \(\theta\) and then \(\theta\) conditional on \(\Gamma\). The update for \(\theta\) is the same as in standard BSL (See the next section). The full conditional distribution for \(\gamma_j\) is given by \[ \begin{aligned} \pi\left(\gamma_{j}^{*} \mid \theta, \mu_{n}(\theta), \Sigma_{n}(\theta), \gamma_{/ j}\right) &\propto \mathcal{N}\left[s(\mathbf{y}) ; \phi_{n}\left(\zeta^{*}\right), \Sigma_{n}(\theta)\right] \pi\left(\gamma_{j}^{*}\right) \text{(Mean Adjustment)}\\ \pi\left(\gamma_{j}^{*} \mid \theta, \mu_{n}(\theta), \Sigma_{n}(\theta), \gamma_{/ j}\right) &\propto \mathcal{N}\left[s(\mathbf{y}) ; \mu_{n}(\theta), V_{n}\left(\zeta^{*}\right)\right] \pi\left(\gamma_{j}^{*}\right) \text{(Variance Inflation)}, \end{aligned} \] for which slice sampling method can be used (Neal, 2003).
5 Efficient BSL
In practice, \(\boldsymbol{\Sigma_n(\theta)}\) is not the most efficient estimator of \(\boldsymbol{\Sigma(\theta)}\) in BSL.
shrinkage and whitening strategies are two common ways to improve on efficiency with the BSL context.
- With shrinkage, the primary aim is to estimate the covariance matrix \(\boldsymbol{\Sigma(\theta)}\) with as few model simulations as possible such that the performance of a BSL sampler is efficient.
5.1 Accelerated likelihood estimator with shrinkage estimation (for sBSL and semiBSL)
graphical lasso to estimate a penalised covariance (An et al., 2019; Friedman et al., 2008). The R package ‘glasso’ (Friedman et al., 2018) can be used.
- The computational gains can be achieved by assuming that the inverse covariance matrix, or precision matrix, \(\boldsymbol{\Theta}=\boldsymbol{\Sigma}^{-1}\), is sparse.
- It aims to find a sparse precision matrix \(\boldsymbol{\Theta}\) by maximising the following \(l_{1}\) regularised log-likelihood \[ \log p(\boldsymbol{s_{1:n}}|\boldsymbol{\Theta})=c+\log |\boldsymbol{\Theta}|-\operatorname{tr}(\boldsymbol{\Theta} \boldsymbol{\Sigma_n})-\lambda\|\boldsymbol{\Theta}\|_{1} \] over the space of all positive-definite matrices, where \(\boldsymbol{\Sigma_n}\) is the sample covariance, \(c\) is a constant independent of \(\boldsymbol{\Sigma}\), \(\|\cdot\|_{1}\) is the \(l_{1}\) norm, and \(\lambda\) is the penalty parameter.
The problem is solved with convex optimisation. The penalty parameter \(\lambda\) controls the sparsity in \(\boldsymbol{\Theta}\), with increasing \(\lambda\) leading to more sparsity. Incorporating the glasso estimator in MCMC BSL algorithm requires different considerations for choosing \(\lambda\).
- Algorithm: Procedure to select the penalty value \(\lambda\) for use within MCMC BSLasso

Warton’s (shrinkage) estimator
- Warton (2008) introduces a straightforward shrinkage estimator that is extremely fast to compute but still performs well.
- Define \(\hat{\boldsymbol{\Sigma}}_d\) as the diagonal matrix of the sample varaince matrix \(\boldsymbol{\Sigma_n}\), then the sample correlation matrix \(\boldsymbol{R}\) can be computed with \[\hat{\boldsymbol{R}}=\hat{\boldsymbol{\Sigma}}_d^{-1/2} \boldsymbol{\Sigma_n} \hat{\boldsymbol{\Sigma}}_d^{-1 / 2} \]
- The Warton’s shrinkage (ridge estimator) for the correlation matrix \(\boldsymbol{R}\) is given by \[ \hat{\boldsymbol{R}}_{\gamma}=\gamma \hat{\boldsymbol{R}}+(1-\gamma) \boldsymbol{I}_{d}, \] where \(\gamma \in[0,1]\) is the shrinkage parameter, \(\boldsymbol{I}_{d}\) is the identity matrix. The correlation matrix degenerates to the identity matrix as \(\gamma\) approaches 0.
- With a correlation to covariance conversion, we get the Warton’s covariance estimator for \(\boldsymbol{\Sigma}\) \[ \hat{\boldsymbol{\Sigma}}_{n,\gamma}=\hat{\boldsymbol{\Sigma}}_d^{1/2} \hat{\boldsymbol{R}}_{\gamma} \hat{\boldsymbol{\Sigma}}_d^{1/2}. \]
Justification of efficiency
- The smaller the value of \(\gamma\), the closer \(\hat{\boldsymbol{\Sigma}}_{n,\gamma}\) comes to being a diagonal matrix.
- In the context of BSL, a smaller \(\gamma\) reduces the variance of the synthetic likelihood estimator \(\hat p_{A, n}(\boldsymbol{s_y}\mid \boldsymbol{\theta})\) for a given number of simulations, \(n\). This implies that less model simulations are required to achieve the same acceptance rate (as a measure of sampler performance) within BSL.
5.2 BSL with whitening transformations (wBSL)
Limitation of standard shrinkage estimator for BSL
- Shrinkage on its own works well in cases where there is a low degree of correlation between summaries (Ong et al., 2018a), but performs poorly for small \(\gamma\) when there is signifficant correlation between summaries.
- using the standard ridge shrinkage estimator with wBSL produces far less accurate posterior approximations.
Theory: computational efficiency of a BSL algorithm based on the variance of the synthetic log-likelihood estimator
standard BSL likelihood based on the correlated summaries with simulated iid summaries \(\boldsymbol{s_i(\theta)}\) and normal distribution \(\boldsymbol{s}_{i}(\boldsymbol{\theta}) \sim \mathcal{N}(\boldsymbol{\mu}(\boldsymbol{\theta}), \boldsymbol{\Sigma}(\boldsymbol{\theta}))\): \[ p_{A, n}\left(\boldsymbol{s}_{\boldsymbol{y}} \mid \boldsymbol{\theta}\right)=\mathcal{N}\left(\boldsymbol{s}_{\boldsymbol{y}} \mid \boldsymbol{\mu}_{n}(\boldsymbol{\theta}), \boldsymbol{\Sigma}_{n}(\boldsymbol{\theta})\right) . \]
BSL likelihood based on the uncorrelated summaries with simulated iid summaries \(\boldsymbol{\zeta_i(\theta)}\) with and normal distribution \(\boldsymbol{\zeta}_{i}(\boldsymbol{\theta}) \sim \mathcal{N}(\boldsymbol{\zeta}(\boldsymbol{\theta}), \boldsymbol{\Omega}(\boldsymbol{\theta}))\), whree \(\boldsymbol{\Omega(\theta)}=diag(\omega_{11},\dots, \omega_{dd})\) (summaries are uncorrelated across the \(d\) dimensions): \[ p_{A, n, w}\left(\boldsymbol{s_{y}} \mid \boldsymbol{\theta}\right)=\mathcal{N}\left(\boldsymbol{s_{y}} \mid \boldsymbol{\zeta_{n}(\theta)}, \boldsymbol{\Omega_{n}(\theta)}\right) \]
Main result: For \(d\) and \(n\) large, but finite, with \(n > d + 4\), \[ \begin{aligned} \operatorname{Var}\left[\log \left(p_{A, n}\left(s_{y} \mid \theta\right)\right)\right]&=\mathcal{O}\left(\frac{d^{2} \cdot n^{2}}{(n-d)^{3}}\right),\\ \operatorname{Var}\left[\log \left(p_{A, n, w}\left(s_{y} \mid \theta\right)\right)\right]&=\mathcal{O}\left(\frac{d \cdot n^{2}}{(n-d)^{3}}\right) . \end{aligned} \]
For the standard synthetic likelihood estimator with a full covariance matrix, \(n\) must scale quadratically with \(d\) to control the variance of the synthetic log-likelihood.
For the synthetic likelihood with uncorrelated summaries, \(n\) must scale linearly with \(d\) to control the variance of the synthetic log-likelihood.
The main idea of wBSL is that an approximate whitening or decorrelation transformation may be applied to the summary statistic at each algorithm iteration.
In practice, a set of uncorrelated summary statistics that are informative about the model parameters is not usually available.
BSL with whitening transformations to approximately decorrelate the correlated summary statistics (Priddle, Sisson and Frazier, 2020; Priddle and Drovandi, 2020)
- a whitening transformation converts the random vector \(s\) of arbitrary distribution with covariance matrix \(\operatorname{Var}(s)=\Sigma\), into a new set of variables \[ \tilde{\boldsymbol{s}}=\boldsymbol{W} \boldsymbol{s} \] for some \(d \times d\) whitening matrix \(\boldsymbol{W}\), such that the covariance \(\operatorname{Var}(\tilde{\boldsymbol{s}})=\boldsymbol{I}_{d}\) is the identity matrix of dimension \(d\).
- The only requirement for the whitening matrix \(\boldsymbol{W}\) is \(\boldsymbol{W^T W}=\boldsymbol{\Sigma}^{-1}\).
Five natural whitening procedures outlined by Kessy et al. (2018):
- zero-phase component analysis whitening (ZCA): \(\boldsymbol{W}_{ZCA}=\boldsymbol{\Sigma}^{-1/2}\)
- ZCA correlation whitening (ZCA-cor): \(\boldsymbol{W}_{ZCA-corr}=\boldsymbol{P}^{-1/2}\boldsymbol{V}^{-1/2}\), where \(\boldsymbol{\Sigma}\) is decomposed as \(\boldsymbol{V}^{1/2}\boldsymbol{P}^{1/2}\boldsymbol{V}^{1/2}\), \(\boldsymbol{P}\) is the correlation matrix and \(\boldsymbol{V}\) is the diagonal matrix of variances.
- principal component analysis whitening (PCA): \(\boldsymbol{W}_{PCA}=\boldsymbol{\Lambda}^{-1/2}\boldsymbol{U}^{\top}\), where the eigendecomposition of the covariance matrix \(\boldsymbol{\Sigma}=\boldsymbol{U}\boldsymbol{\Lambda}\boldsymbol{U}^{\top}\) is used $$, where \(\boldsymbol{U}\) is the matrix of eigenvectors and \(\boldsymbol{\Lambda}\) the diagonal matrix of eigenvalues.
- PCA correlation whitening (PCA-cor): \(\boldsymbol{W}_{PCA-cor}=\boldsymbol{\Xi}\boldsymbol{G}^{\top}\boldsymbol{V}^{-1/2}\), where the eigendecomposition of the correlation matrix \(\boldsymbol{P}\) is used \(\boldsymbol{P}=\boldsymbol{G} \boldsymbol{\Xi} \boldsymbol{G}^{\top}\), where \(\boldsymbol{G}\) is the eigenvector matrix and \(\boldsymbol{\Xi}\) is the diagonal matrix of eigenvalues.
- Cholesky whitening: \(\boldsymbol{W}_{Cholesky}=\boldsymbol{L}^{\top}\), where the Cholesky decomposition of the precision mattrix \(\boldsymbol{\Sigma}^{-1}\) is used \(\boldsymbol{\Sigma}^{-1}=\boldsymbol{L} \boldsymbol{L}^{\top}\), where \(L\) is a unique lower triangular matrix.
Algorithm: MCMC wBSL

Result from examples
- shrinkage covariance estimation of Warton (2008) can be much more effective when a whitening transformation is applied, producing an accurate, low variance estimate of the likelihood function for a relatively small number of model simulations.
- The PCA-based whitening methods are most effective and provide the most accurate posterior approximation (provide posterior approximations closest to standard BSL.)
- Relative to the other whitening transformations, for PCA-based whitening, the covariance deviation (excluding variances) does not increase as rapidly as \(\boldsymbol{\theta}\) moves form \(\boldsymbol{\theta_{true}}\).
- Correlations between the transformed summary statistics are far less sensitive to changing \(\boldsymbol{\theta}\) for PCA-based whitening compared to ZCA-based or Cholesky whitening.
- The wBSL posterior has some robustness to the value of \(\boldsymbol{\theta_0}\).
5.3 semiBSL with whitening transformations (wsemiBSL)
Gaussian approximation of the summary statistic likelihood \(\boldsymbol{s_y}\) in terms of the varaince matrix \(\boldsymbol{\Sigma}\) (dependence on \(\boldsymbol{\theta}\) is suppressed, same as \(\boldsymbol{\mu}\)) \[ \mathcal{N}\left(s_{y} ; \boldsymbol{\mu}, \boldsymbol{\Sigma}\right)=\frac{1}{\sqrt{(2 \pi)^{d} \operatorname{det}(\mathbf{\Sigma})}} \exp \left\{-\frac{1}{2}\left(s_{y}-\boldsymbol{\mu}\right)^{\top} \boldsymbol{\Sigma}^{-1}\left(\boldsymbol{s}_{\boldsymbol{y}}-\boldsymbol{\mu}\right)\right\}, \]
with \(\boldsymbol{R}\): \(\boldsymbol{\Sigma}=\boldsymbol{\Sigma}_{d}^{1 / 2} \boldsymbol{R} \boldsymbol{\Sigma}_{d}^{1 / 2}\) and \(\boldsymbol{\Sigma}_{d}=\operatorname{diag}\left(\sigma_{1}^{2}, \ldots, \sigma_{d}^{2}\right)\), we have \[ \mathcal{N}\left(s_{y} ; \boldsymbol{\mu}, \boldsymbol{\Sigma}\right) \propto \frac{1}{\sqrt{\operatorname{det}(\boldsymbol{R})}} \exp \left\{-\frac{1}{2} \boldsymbol{\eta}_{\boldsymbol{s}_{y}}^{\top} \boldsymbol{R}^{-1} \boldsymbol{\eta}_{\boldsymbol{s}_{y}}\right\} \prod_{j=1}^{d} \frac{\mathcal{N}\left(\eta_{y}^{j} ; 0, \sigma_{j}^{2}\right)}{\phi\left(\hat{\eta}_{y}^{j}\right)} \]
In wsemiBSL the whitening transformation is applied to the standard Gaussian quantiles, and not directly to the summary statistics as in sBSL.
Aim: find a whitening transformation to convert the random vector \(\boldsymbol{\eta}\) of arbitrary distribution with covaraince matrix \(Var(\boldsymbol{\eta}) = \boldsymbol{R}\) into the transformed vector \(\boldsymbol{\tilde\eta}=\boldsymbol{\eta}\boldsymbol{W}\) for some \(d\times d\) whitening matrix \(\boldsymbol{W}\), such that the covariance \(Var(\boldsymbol{\tilde\eta}) = \boldsymbol{I_d}\).
6 Variational Bayes with intractable likelihood
6.1 Stochastic gradient variational Bayes
About classical Variational Bayes (VB)
VB is widely used as a computationally effective method for approximating the posterior distribution \(\pi(\boldsymbol{\theta}\mid y)\) (Bishop 2006).
VB approximates the posterior by a distribution \(q(\boldsymbol{\theta})\) within some tractable class, such as an exponential family, chosen to minimize the Kullback-Leibler divergence between \(q(\theta)\) and \(\pi(\theta\mid y)\).
Most of the VB algorithms require that the likelihood \(p(y \mid \boldsymbol{\theta})\) can be computed analytically for any \(\boldsymbol{\theta}\).
Difficulties in applying VB: not workable for intractable likelihood in the sense that it is infeasible to compute \(p(y\mid \boldsymbol{\theta})\) exactly at each value of \(\boldsymbol{\theta}\). Examples with intractable likelihood: ABC, state–space models, time series models.
Theory of stochastic gradient variational Bayes
consider a parametric family with typical element \(q_\lambda(\theta)\), where \(\lambda\) is a variational parameter to be chosen. The Kullback–Leibler divergence from \(q_\lambda(\theta)\) to \(p(\theta\mid y)\) is given by \[\mathrm{KL}(\lambda)=\mathrm{KL}\left(q_{\lambda}(\theta) \| p(\theta \mid y)\right)=\int \log \frac{q_{\lambda}(\theta)}{p(\theta \mid y)} q_{\lambda}(\theta) \mathrm{d} \theta. \]
Minimising \(\mathrm{KL}(\lambda)\) with respect to \(\lambda\) is equivalent to maximising \[\mathcal{L}(\lambda)=\int \log \frac{p(\theta) p(y \mid \theta)}{q_{\lambda}(\theta)} q_{\lambda}(\theta) \mathrm{d} \theta \]
\(\mathcal{L}(\lambda)\) is a lower bound on the log marginal likelihood \(\log p(y)\).
By the “log derivative trick” identity \(\nabla_{\lambda} q_{\lambda}(\theta)=q \lambda(\theta) \nabla_{\lambda} \log q_{\lambda}(\theta)\) and \(E(\nabla_{\lambda} \log q_{\lambda}(\theta))=0\), we have \[ \begin{aligned} \nabla_{\lambda} \mathcal{L}(\lambda)=& \nabla_{\lambda} \int\left\{\log h(\theta)-\log q_{\lambda}(\theta)\right\} q_{\lambda}(\theta) \mathrm{d} \theta \\ =& \int \log h(\theta) \nabla_{\lambda} \log q_{\lambda}(\theta) q_{\lambda}(\theta) \mathrm{d} \theta \\ &-\int \log q_{\lambda}(\theta) \nabla_{\lambda} \log q_{\lambda}(\theta) q_{\lambda}(\theta) \mathrm{d} \theta \\ =& \int \nabla_{\lambda} \log q_{\lambda}(\theta)\left\{\log h(\theta)-\log q_{\lambda}(\theta)\right\} q_{\lambda}(\theta) \mathrm{d} \theta \end{aligned}. \]
The last expression is easily estimated unbiasedly if we can simulate from \(q_{\lambda}(\theta)\).
It is well known that gradient estimates obtained by the log derivative trick are highly variable. A variety of additional methods for variance reduction have also been considered. See for example, Salimans and Knowles (2013), Titsias and Lázaro-Gredilla (2014, 2015).
6.2 Variational Bayes with intractable likelihood (VBIL)
6.2.1 Data augmentation in VB
By introducing a latent variable \(\alpha\), the joint density \(p(y, \alpha \mid \theta)\) is tractable.
Inference is then done with the joint posterior \(\pi(\theta, \alpha \mid y) \propto \pi(\theta) p(y, \alpha \mid \theta)\) rather than the marginal posterior of interest \(\pi(\theta\mid y)\).
Issues left:
- Classical VB algorithms approximate the joint posterior \(p(\theta, \alpha \mid y)\) by a factorized distribution \(q(\theta\mid y) q(\alpha\mid y)\), and then use \(q(\theta\mid y)\) as an approximation to \(\pi(\theta\mid y)\).
- posterior dependence between \(\theta\) and \(\alpha\) is ignored, which might lead to a poor VB approximation (Neville, Ormerod, and Wand 2014).
Note: VBSL (Tran et al., 2017): integrate out the latent variable \(\alpha\), called variational Bayes with intractable likelihood (VBIL). See the next Section.
6.2.2 Variational Bayes with intractable likelihood (VBIL),
About VBIL (Gunawan et al., 2016; Tran et al., 2015)
VBIL is the first attempt to apply stochastic gradient variational inference methods to a class of problems that includes likelihood-free inference, and uses black box variational inference methods
The VBIL approach works with an unbiased estimate of the likelihood, denoted as \(\hat p_n(y\mid \theta)\), where \(n\) is the number of simulated iid samples in BSL (or an algorithmic parameter controlling the accuracy of the approximation, such as the number of Monte Carlo samples used).
VBIL transforms the problem of approximating the posterior \(\pi(\theta\mid y)\) into a stochastic optimization problem using a noisy gradient.
It is essential for the success of stochastic optimization algorithms to have a gradient estimator with a sufficiently small variance, the variance of the noisy gradient.
several techniques are given by Tran et al.(2017) to reduce the variance of the estimated gradient: control variate, natural gradient, factorized variational distribution and randowmized quasi-Monte Carlo.
VBIL provides an estimate of the marginal likelihood, which is useful for model choice. VBIL provides an attractive way to improve the accuracy of VB approximations of marginal posteriors.
The asymptotic variance of VBIL estimators increases linearly with the variance of the log-likelihood estimator.
Applications: generalized linear mixed models, state–space models, and ABC
Extension with combinition of unbiased synthetic likelihood (uSL).
- In principle, we can use the unbiased synthetic likelihood given by Churye and Okkin (1969) to give a synthetic likelihood version of the VBIL method.
- This may be beneficial compared to unbiased estimation with ABC.
Theory of VBIL
Introduce the auxiliary variable: \(z=\log \hat{p}_{n}(y \mid \theta)-\log p(y \mid \theta)\), and let \(g_{n}(z \mid \theta)\) be the distribution of \(z\) given \(\theta\).
As \(\hat{p}_{n}(y \mid \theta)\) is unbiased, we have
\[\int \exp (z) g_{n}(z \mid \theta) \mathrm{d} \theta=1\]
Tran et al. (2015) consider implementing \(\mathrm{VB}\) in the augmented space \((\theta, z)\), with target distribution \[ p_{n}(\theta, z)=p(\theta \mid y) \exp (z) g_{n}(z \mid \theta) \] The \(\theta\) marginal of \(p_{n}(\theta, z)\) is the posterior distribution of interest, \(p(\theta \mid y)\).
Consider a family of approximating distributions of the form \(q_{\lambda}(\theta, z)=q_{\lambda}(\theta) g_{n}(z \mid \theta)\), where \(\lambda\) is a variational parameter to be chosen. The \(\theta\) marginal of \(q_{\lambda}(\theta, z)\) is \(q_{\lambda}(\theta) .\)
Objective: Performing the \(\mathrm{VB}\) optimisation in the augmented space, by choosing \(\lambda\) to minimise \(\operatorname{KL}\left(q_{\lambda}(\theta, z) \| p_{n}(\theta, z)\right)\), then the gradient of the objective function can be shown to be \[E\left(\nabla_{\lambda} \log q_{\lambda}(\theta)\left(\log \left(p(\theta) \hat{p}_{n}(y \mid \theta)\right)-\log q_{\lambda}(\theta)\right)\right) \] where the expectation is with respect to \(q_{\lambda}(\theta, z)\).
Remarks:
This expression is easily obtained as before and is easily approximated by simulation, since all that is required is simulation of \(\theta\) from \(q_{\lambda}(\theta)\) and calculation of the likelihood estimate \(\hat{p}_{n}(y \mid \theta) .\) Knowledge of \(z\), which depends on the unknown \(p(y \mid \theta)\), is not required.
Minimisation of \(\operatorname{KL}\left(q_{\lambda}(\theta, z) \| p_{n}(\theta, z)\right)\) is not the same in general as minimisation of \(\mathrm{KL}(\lambda)\). However, Tran et al. (2015) show that if two conditions met, then minimisers of \(\operatorname{KL}\left(a_{\lambda}(\theta, z) \| p_{n}(\theta, z)\right)\) and \(\mathrm{KL}(\lambda)\) correspond.
The lower bound in the augmented space is
\[\begin{aligned} \mathcal{L}_{a}(\lambda) &=\int \log \frac{p(\theta) p(y \mid \theta) \exp (z) g_{n}(z \mid \theta)}{q_{\lambda}(\theta) g_{n}(z \mid \theta)} q_{\lambda}(\theta, z) \mathrm{d} z \mathrm{~d} \theta \\ &=\mathcal{L}(\lambda)+\int z g_{n}(z \mid \theta) q_{\lambda}(\theta) \mathrm{d} \theta \end{aligned}\]
If the log-likelihood estimator is asymptotically normal, so that \(z\) is normal, this implies that asymptotically \[z \mid \theta \sim N(E(z \mid \theta),-2 E(z \mid \theta)) \] by the unbiasedness condition.
Hence, tuning \(E(z \mid \theta)\) to not depend on \(\theta\) is equivalent to tuning the variance of the log-likelihood estimator to not depend on \(\theta\) in this case. The resulting lower bound in the augmented space is \[ \mathcal{L}_{a}(\lambda)=\mathcal{L}(\lambda)-\frac{\tau^{2}}{2}, \] where \(\tau^{2}\) is the targeted variance for the log-likelihood estimator.
Crucial to the VBIL method is the use of variance reduction methods in the gradient estimates in the stochastic gradient procedure.
The VBIL method of Tran et al. (2015) is useful in a number of settings, such as state space models, random effects models, ABC, etc.
7 Variational Bayes With synthetic likelihood
Two unbiased estimators about likelihood:
- VBIL method using an unbiased likelihood estimate is not easy to apply in some ABC problems, as the user needs to tune the variance of the log likelihood estimator to be constant across the parameter space.
- For implementing stochastic gradient variational Bayes (VB) methods, it is much more convenient to work with unbiased estimates of the log-likelihood function.
- Assuming that the summary statistic is Gaussian, unbiased estimates of the log likelihood are available in the synthetic likelihood context. This makes the implementation of stochastic gradient VB methods very easy.
Variational Bayes With Synthetic Likelihood (VBSL) uses stochastic gradient variational inference methods for posterior approximation in the synthetic likelihood context, employing unbiased estimates of the log likelihood.
VBSL modifies the VBIL methodology to work with unbiased log likelihood estimates in the synthetic likelihood framework.
VBSL reduces computational overheads.
VBSL improves on VBIL (Tran, et al., 2017), by considering certain reduced variance gradient estimates, adaptive learning rates and alternative parametrisations.
- The most important advantage of the VBSL algorithm, however, is that its tuning parameters are much easier to set than for VBIL.
Tran, et al. (2017) developed an adaptive method for determining the algorithm learning rates, and reparametrisations that may be helpful in cases where ensuring the positive definiteness of the variational posterior covariance matrix is difficult.
Parametrisation of the Gaussian variational posterior covariance uses the Cholesky factor of the precision matrix with a natural gradient algorithm.
Unbiased estimate of the log of a normal density (Ripley, 1996, p. 56)
When the summary statistics are normally distributed, an unbiased estimate of the log of a normal density \(\log \phi(s ; \mu(\theta), \Sigma(\theta))\) based on a random sample of size \(n\) from it leading to sample mean and covariance matrix \(\boldsymbol{\mu}_{n}(\boldsymbol{\theta})\) and \(\boldsymbol{\Sigma}_{n}(\boldsymbol{\theta})\), respectively, is \[ \begin{aligned} \hat{l}_{n}^{U}(s \mid \theta)=&-\frac{d}{2} \log 2 \pi \\ &-\frac{1}{2}\left\{\log |\boldsymbol{\Sigma}_{n}(\boldsymbol{\theta})|+d \log \left(\frac{n-1}{2}\right)-\sum_{i=1}^{d} \psi\left(\frac{n-i}{2}\right)\right\} \\ &-\frac{1}{2}\left\{\frac{n-d-2}{n-1}(s-\boldsymbol{\mu}_{n}(\boldsymbol{\theta}))^{\mathrm{T}} \boldsymbol{\Sigma}_{n}(\boldsymbol{\theta})^{-1}(s-\boldsymbol{\mu}_{n}(\boldsymbol{\theta}))-\frac{d}{n}\right\}, \end{aligned} \] provided that \(n>d+2\), where \(\psi(\cdot)\) denotes the digamma function.
Algorithm: to implement a stochastic gradient VB algorithm for approximation of the posterior, the only change required in VBIL Algorithm 1 is to replace \(\log(\hat p_n(y\mid\theta))\) with \(\hat{l}_{n}^{U}(y \mid \theta)\) wherever it appears in the expression above.
8 Algorithms and R package for BSL
8.1 R packages:
sl: package for R
BSL (An, South, Drovandi, 2019)
EasyABC: Performing efficient approximate Bayesian computation sampling schemes using R
Algorithm 1: MCMC BSL algorithm for sBSL, uBSL, semiBSL.

- Algorithm 2: Procedure to select the penalty value \(\lambda\) for BSL or semiBSL with shrinkage estimation.

8.1.1 Algorithm 3 of VBIL

9 Case studies and applications
9.1 Simulated example with tractable likelihood: MA(2) (Ong et al., 2018)
DGP: \[ y_{t}=\theta_{1} z_{t}+\theta_{2} z_{t-1} \] for \(t=1, \ldots, T\), where \(z_{t} \sim N(0,1)\) for \(t=0, \ldots, T, \theta=\left(\theta_{1}, \theta_{2}\right)^{T}\) and \(y=\left(y_{1}, \ldots, y_{T}\right)^{T} .\)
The likelihood function \(p(y \mid \theta)\) is multivariate normal with a mean vector of zeros, \(\operatorname{Var}\left[y_{t}\right]=1+\theta_{1}^{2}+\theta_{2}^{2}, \operatorname{Cov}\left(y_{t}, y_{t-1}\right)=\theta_{1}+\theta_{1} \theta_{2}\) for \(t=2, \ldots, T\) \(\operatorname{Cov}\left(y_{t}, y_{t-2}\right)=\theta_{2}\) for \(t=3, \ldots, T\) and \(\operatorname{Cov}\left(y_{i}, y_{j}\right)=0\) otherwise.
The prior distribution is set to be uniform over the allowable parameter space: \(\Theta \equiv\left\{\mathbb{R}^{2}:-1<\theta_{2}<1, \theta_{1}+\theta_{2}>-1, \theta_{1}-\theta_{2}<1\right\} .\)
We take the summary statistic to consist of all \(T=100\) observations. The dataset used for analysis is simulated with \(\theta=(0.6 .0 .2)\).
9.2 Multivariate g-and-k example (Drovandi and Pettitt, 2011; Ong et al., 2018)
DGP: The multivariate model is constructed by joining \(q\) univariate \(g\) -and- \(k\) distributions for the marginals (Rayner and MacGillivray, 2002). A univariate \(g\) -and- \(k\) density has no closed form, but provides considerable modelling flexibility for univariate data. The \(g\) -and- \(k\) distribution can be defined through its quantile function, \(Q(p), p \in(0,1)\), where \[ Q(p)=A+B\left[1+c \frac{1-\exp (-g z(p))}{1+\exp (-g z(p))}\right]\left(1+z(p)^{2}\right)^{k} z(p) \] where \(z(p)=\Phi^{-1}(p)\) and \(\Phi(\cdot)\) is the standard normal distribution function. The parameters are \(A, B>0, g\) and \(k\) control respectively the location, scale, skewness and kurtosis. The additional parameter \(c\) is conventionally fixed at \(0.8\), and with this constraint \(k>-0.5\).
The multivariate version of the model we consider was suggested by Drovandi and Pettitt (2011). Suppose we have independent and identically distributed multivariate observations \(y_{1}, \ldots, y_{n}\) where \(y_{i}=\left(y_{i 1}, \ldots, y_{i q}\right)^{\top}\). Each \(y_{i r}\) follows a univariate \(g\) -and- \(k\) distribution marginally, \(F\left(x ; \theta_{r}\right)\) say with parameters \(\theta_{r}=\left(A_{r}, B_{r}, g_{r}, k_{r}\right)^{\top}, r=1, \ldots, q\). We write the density function and quantile function corresponding to \(F\left(x ; \theta_{r}\right)\) as \(f\left(x ; \theta_{r}\right)\) and \(Q\left(p ; \theta_{r}\right)\) respectively. In order to link these univariate \(g\) -and- \(k\) distribution together, Drovandi and Pettitt (2011) modelled the dependency between components of \(y_{i}\) using a Gaussian copula with a \(q \times q\) correlation matrix \(\rho=\left[\rho_{i j}\right]\). The density of \(y_{i}\) is \[ f\left(y_{i} ; \theta\right)=|\rho|^{-1 / 2} \exp \left(\eta_{i}^{\top}\left(I-\rho^{-1}\right) \eta_{i}\right) \prod_{j=1}^{q} f\left(y_{i j} ; \theta_{j}\right) \] where \(\eta_{i}=\left(\eta_{i 1}, \ldots, \eta_{i q}\right)^{\top}\) with \(\eta_{i r}=\Phi^{-1}\left(F\left(y_{i r} ; \theta_{r}\right)\right) .\)
For summary statistics, we follow Drovandi and Pettitt (2011) and Ong et al. (2018b) and use \[ S_{A_{r}}=E_{4}^{(r)}, S_{B_{r}}=E_{6}^{(r)}-E_{2}^{(r)}, S_{g_{r}}=\frac{E_{7}^{(r)}-E_{5}^{(r)}+E_{3}^{(r)}-E_{1}^{(r)}}{S_{B_{r}}}, S_{k_{r}}=\frac{E_{6}^{(r)}+E_{2}^{(r)}-2 E_{4}^{(r)}}{S_{B_{r}}} \] where \(E_{j}^{(r)}\) is the jth octile of the data \(\left(y_{1 r}, \ldots, y_{n r}\right)\), in addition to robust normal scores correlation coefficients (Fisher and Yates, 1948 ) between all pairs of components.
9.3 Ricker model (Wood, 2010)
DGP: (observed random variables)
\[Y_{t} \sim \mathcal{P}\left(\phi N_{t}\right)\]
\(N_{t}\) are unobserved:
\[N_{t+1}=r N_{t} e^{-N_{t}+e_{t}}, e_{t} \stackrel{\mathrm{iid}}{\sim} \mathcal{N}\left(0, \sigma_{e}^{2}\right)\]
- Model parameters: \(\boldsymbol{\theta}=\left(\log r, \phi, \sigma_{e}\right)\)
9.4 Cell Biology Model: cell motility and proliferation (Johnston et al., 2014; Price et al., 2018; Ong et al., 2018)
Background
Cell motility causes random movement, which, together with proliferation or reproduction, can cause tumors to spread (Swanson et al. 2003) or wounds to heal (Dale, Maini, and Sherratt 1994; Zahm et al. 1997).
stochastic models for collective cell spreading do not possess a tractable likelihood function
the observed data are typically available as sequences of images
A common method of collecting information about cell diffusivity and proliferation is the scratch assay
often the images are then reduced to summary statistics.
Data abstraction
By using \(d=145\) images rather than a small subset of this, valuable information about the rates of motility and proliferation could be attained.
To create the observed data, the cells can be placed on a two-dimensional discrete lattice using image analysis software and some manual processes.
Let \(X_{x, y}^{t} \in\{0,1\}\) be an indicator that defines whether a cell is present at position \((x, y)\) for \(x \in\{1, \ldots, R\}\), \(y \in\{1, \ldots, C\}\) at time index \(t \in\{0,1, \ldots, 144\} .\) Here \(R\) and \(C\) are the number of rows and columns in the lattice, respectively. Denote the matrix of indicators at time index \(t\) as \(\boldsymbol{X}^{t}\).
Summary statistics
One possibly informative summary statistic regarding motility is the Hamming distance between \(\boldsymbol{X}^{t}\) and \(\boldsymbol{X}^{t-1}\) \[ s_{t}=\sum_{x=1}^{R} \sum_{y=1}^{C}\left|X_{x, y}^{t}-X_{x, y}^{t-1}\right| . \]
The summary statistic we use to provide information regarding the proliferation is the total number of cells at the end of the experiment, which we denote as \(K\) for some simulated dataset. Thus, the simulated summary statistic is given by \(\boldsymbol{s}=\left(s_{1}, \ldots, s_{144}, K\right)\) and is of dimension 145 .
Model parameters
- The outcomes of this biological process are determined solely by the cell motility and proliferation so only two parameters are required in the random walk. While the parameters of interest are the diffusivity \(D\) and the proliferation rate \(\lambda\), it is simple to just work with the probabilities of motility and proliferation, \(P_{m} \in[0,1]\) and \(P_{p} \in[0,1] .\) Conversion back to the biological parameters is done by using the formulas in Johnston et al. (2014). It is also possible to include additional parameters for cell-to-cell adhesion and cell-to-substrate adhesion depending on the type of cell under consideration. For more details on the random walk simulation model, the reader is referred to Johnston et al. (2014).
Data simulation
- The data are simulated with \(P_{m}=0.35\) and \(P_{p}=0.001\) with \(R=27\) and \(C=36 .\) Initially, \(N(0)=110\) cells are placed randomly in the rectangle with positions \(x \in\{1,2, \ldots, 13\}\) and \(y \in\{1,2, \ldots, 36\} .\)
9.5 α-stable model
- DGP:
\(\alpha\)-stable models are a convenient family of heavy-tailed distributions used in a number of applications. Inference is challenging, since for distributions in this family there is no closed form expression for the density function. The most common parametrisation of these distributions is in terms of a parameter \(\theta=(\alpha, \beta, \gamma, \delta)^{\top}\), where \(\alpha\) is a parameter controlling tail behaviour, \(\beta\) controls skewness, \(\gamma\) is a scale parameter and \(\delta\) a location parameter. The characteristic function is \[\phi(t)=\left\{\begin{array}{ll}\exp \left\{i \delta t-\gamma^{\alpha}|t|^{\alpha}\right. & \alpha \neq 1 \\ \left.\left(1+i \beta \tan \frac{\pi \alpha}{2} \operatorname{sgn}(t)\left(|\gamma t|^{1-\alpha}-1\right)\right)\right\} & \\ \exp \{i \delta t-\gamma|t| & \alpha \neq 1 \\ \left.\left(1+i \beta \frac{2}{\pi} \operatorname{sign}(t) \log (\gamma|t|)\right)\right\} & \end{array}\right. \] where \(\operatorname{sgn}(t)\) is the sign function which is 1 if \(t>0\), 0 if \(t=0\) and \(-1\) if \(t<0 .\)
- For summary statistics, we consider a point estimator of \(\theta=(\alpha, \beta, \gamma, \delta)^{\top}\) due to McCulloch (1986) and then transform this point estimator to an estimator of \(\tilde\theta\).
9.6 M/G/1 model (Ziwen An, 2020)
The M/G/1 queuing model is a single-server queuing system with Poisson arrivals and general service times.
The observed data, \(\mathbf{y}_{1: 50}\), are 50 inter-departure times from 51 customers. In this model, the likelihood is cumbersome to calculate whereas simulation is trivial. The distribution of the service time is uniform \(\mathrm{U}\left(\theta_{1}, \theta_{2}\right)\), and the distribution of the inter-arrival time is exponential with rate parameter \(\theta_{3}\). We take \(\boldsymbol{\theta}=\left(\theta_{1}, \theta_{2}, \theta_{3}\right)\) as the parameter of interest and put a uniform prior \(\mathrm{U}(0,10) \times \mathrm{U}(0,10) \times \mathrm{U}(0,0.5)\) on \(\left(\theta_{1}, \theta_{2}-\theta_{1}, \theta_{3}\right)\).
The observed data are generated from the model with true parameter \(\boldsymbol{\theta}=(1,5,0.2)\).
Summary statistic: the log of the inter-departure times as our .
9.7 Stereological extremes (Ziwen An, 2020)
During the process of steel production, the occurrence of microscopic particles, called inclusions, is a critical measure of the quality of steel. It is desirable that the inclusions are kept under a certain threshold, since steel fatigue is believed to start from the largest inclusion within the block. Direct observation of three-dimensional inclusions is inaccessible, so that inclusion analysis typically relies on examining two-dimensional planar slices. Anderson and Coles (2002) establish a mathematical model to formulate the relationship between observed cross-section samples, S, and the real diameter of inclusions, V, assuming that the inclusions are spherical. The inclusions are mutually independent and locations of them follow a homogeneous Poisson process with rate parameter \(\lambda\).
While the spherical model possesses a likelihood function that is easily computable, the spherical assumption itself might be inappropriate. This leads to the ellipsoidal model proposed by Bortot et al. (2007),
DGP: The new model assumes that inclusions are ellipsoidal with principal diameters (V1, V2, V3), where V3 is the largest diameter without loss of generality. Here, V1 = U1V3 and V2 = U2V3, where U1 and U2 are independent uniform U(0, 1) random variables. The observed value S is the largest principal diameter of an ellipse in the two-dimensional crosssection.
Here, we consider the ellipsoidal model with parameter of interest \(\boldsymbol{\theta}=(\lambda, \sigma, \xi)\). The prior distribution is \(\mathrm{U}(30,200) \times \mathrm{U}(0,15) \times \mathrm{U}(-3,3) .\) Denoting the observed samples as \(S\).
Summary statistics: the number of inclusions, \(\log (\min (\mathbf{S})), \log (\operatorname{mean}(\mathbf{S})), \log (\max (\mathbf{S}))\).
9.8 Fowler’s toads (Ziwen An, 2020)
Background: Movements of amphibian animals exhibit patterns of site fidelity and long-distance dispersal at the same time. Modelling such patterns helps in understanding amphibian’s travel behaviour and contributes to amphibian conservation. Marchand et al. (2017) develop an individual-based model for a species called Fowler’s Toads (Anaxyrus fowleri) and collect data via radiotracking in Ontario, Canada. The comprehensive experimental and modelling details are stated in the original paper.
The model assumes that a toad hides in its refuge site in the daytime and moves to a randomly chosen foraging place at night. After its geographical position is collected via a transmitter, the toad either takes refuge at the current location or returns to one of the previous sites. For simplicity, the refuge locations are projected to a single axis and thus can be represented by a single-dimensional spatial process. GPS location data are collected on \(n_{t}\) toads for \(n_{d}\) days, i.e. the observation matrix \(\mathbf{Y}\) is of dimension \(n_{d} \times n_{t} .\) For the synthetic data we use here, \(n_{t}=66\), \(n_{d}=63\), and missingness is not considered. Then, \(\mathbf{Y}\) is summarised down to four sets comprising the relative moving distances for time lags of \(1,2,4\) and 8 days. For instance, \(\mathbf{y}_{1}\) consists of the displacement information of lag 1 day, \(\mathbf{y}_{1}=\left\{|\Delta y|=\left|\mathbf{Y}_{i, j}-\mathbf{Y}_{i+1, j}\right| ; 1 \leq i \leq n_{d}-1,1 \leq j \leq n_{t}\right\}\) Simulation from the model involves two distinct processes. For each single toad, we first generate an overnight displacement, \(\Delta y\), and then mimic the returning behaviour with a simplified model. The overnight displacement is deemed to have significant heavy tails, assumed to belong to the Lévy-alpha stable distribution family, with stability parameter \(\alpha\) and scale parameter \(\gamma .\) This distribution has no closed form, while simulation from it is straightforward (Chambers et al. 1976), making simulation-based approaches appealing. The original paper provided three returning models with different complexity. We adopt the random return model here as it has the best performance among the three and is the easiest for simulation. The total returning probability is a constant \(p_{0}\), and if a return occurs on day \(1 \leq i \leq m\), then the return site is the same as the refuge site on day \(i\), where \(i\) is selected randomly from \(1,2, \ldots, m\) with equal probability. Here, we take the observed data as synthetically generated with true parameter \(\boldsymbol{\theta}=\left(\alpha, \gamma, p_{0}\right)=(1.7,35,0.6)\). We use a uniform prior over \((1,2) \times(0,100) \times(0,0.9)\) here.
We fit a four-component Gaussian mixture model to each set of \(\log (|\Delta y|)\). Figure 5 shows the distributions of \(\log (|\Delta y|)\) for lags of \(1,2,4,8\). As the summary statistic we use the 11 -dimensional score of this fitted auxiliary model (corresponding to three component weights, four means and four standard deviations). This corresponds to the indirect inference approach for selecting summary statistics (see Drovandi et al. 2011; Gleim and Pigorsch 2013; Drovandi et al. 2015). Accommodating the four different lags, there are 44 summary statistics in total.
9.9 Simple recruitment, boom and bust (Ziwen An, 2020)
Background: The simple recruitment, boom and bust model was used in Fasiolo et al. (2018) to investigate the performance of the saddlepoint approximation to a non-normal summary statistic. This is a discrete stochastic temporal model that can be used to represent the fluctuation of the population size of a certain group over time. Given the population size \(N_{t}\) and parameter \(\boldsymbol{\theta}=(r, \kappa, \alpha, \beta)\), the next value \(N_{t+1}\) follows the following distribution \(N_{t+1} \sim\left\{\begin{array}{ll}\operatorname{Poisson}\left(N_{t}(1+r)\right)+\epsilon_{t}, & \text { if } \quad N_{t} \leq \kappa \\ \operatorname{Binom}\left(N_{t}, \alpha\right)+\epsilon_{t}, & \text { if } \quad N_{t}>\kappa\end{array}\right.\) where \(\epsilon_{t} \sim \operatorname{Pois}(\beta)\) is a stochastic term. The population oscillates between high- and low-level population sizes for several cycles. The true parameters are \(r=0.4, \kappa=50\), \(\alpha=0.09\) and \(\beta=0.05\), and the prior distribution is \(\mathrm{U}(0,1) \times \mathrm{U}(10,80) \times \mathrm{U}(0,1) \times \mathrm{U}(0,1) .\) There are \(250 \mathrm{val}-\) ues in the observed data. We use 50 burn-in values to remove the transient phase of the process.
Summary statistics: Consider a dataset \(\mathbf{x}\), define the differences and ratios as \(\mathbf{d}_{\mathbf{x}}=\left\{x_{i}-\right.\) \(\left.x_{i-1} ; i=2, \ldots, 250\right\}\) and \(\mathbf{r}_{\mathbf{x}}=\left\{x_{i} / x_{i-1} ; i=2, \ldots, 250\right\}\), respectively. We use the sample mean, variance, skewness and kurtosis of \(\mathbf{x}, \mathbf{d}_{\mathbf{x}}\) and \(\mathbf{r}_{\mathbf{x}}\) as our summary statistic, \(\mathbf{s}_{\mathbf{x}}\). We also tested the statistics used in Fasiolo et al. (2018), but we found our choice to be more informative about the model parameters. The parameter \(\beta\) seems to have a strong impact on the model statistic distribution. Small values of \(\beta\) tend to generate statistics that are highly non-normal so we consider such a case here.