This RMarkdown consists of two parts: First, we’ll go through the general cause of closed-form Bayesian updating of our beliefs when trying to estimate the mean and standard deviation of a Gaussian process. Then we’ll apply it to our specific case in which the Gaussian process is corrupted by Gaussian noise. Note that we are limiting ourselves to the univariate case for now.
The conjugate prior for a Gaussian distribution with unknown mean and variance is the Normal-inverse gamma (NIG) distribution, which has four parameters: \(m\), \(V\), \(a\) and \(b\). Note we could also use the Normal gamma as a conjugate prior for a normal distribution parameterized with and , but we’re avoiding this here to not get into trouble later with adjusting for the noisy sampling. It gives a probability distribution over passible values of \(\mu\) and \(\sigma\) for our concept.
# parameters for prior
m_0 = 1
V_0 = 1 # 1/sigma^2
a_0 = 1
b_0 = 1
# mean of inverse gamma = a/b+1
# normal prior for mean is a normal distribution with mean m and sd sqrt of V^-1
plotNormal(m_0, sqrt(1/V_0))
# gamma prior for sd is a gamma distribution with shape parameters a and b
plotInvGamma(a_0,b_0)
# the prior over btoh is a normal gamma distribution with all of these parameters
plotNormalInvGamma(m_0, 1/V_0, a_0, b_0)
where n are the number of samples taken, and {x} is the sample mean.
In code, after some re-arranging, for a sample X:
set.seed(1)
X = rnorm(100, mean = 5, sd = 3)
# observe this X[1] 100x with noise
x_bar = mean(X)
n = length(X)
V_n = 1 / (1/V_0 + n)
m_n = V_n * (1/V_0 * m_0 + n*x_bar)
a_n = a_0 + n/2
b_n = b_0 + 1/2 * (m_0^2 * 1/V_0 + sum(X^2) - m_n^2 * 1/V_n)
# Visualize update
# Mean is shifted by positive observations (was 1, now around 4, halfway between prior and sample)
plotNormal(m_n, sqrt(1/V_n))
# variance is shifted to be larger by higher variance observations (had a peak at 0 for a = 1, b = 1, now it's at 7.4, around the variance of X)
plotInvGamma(a_n, b_n)
# combined normal inverse gamma distribution (don't know why it gets cut off)
plotNormalInvGamma(m_n, 1/V_n, a_n, b_n)
The posterior predictive of an observation given a the normal-inverse-gamma distribution is the probability density of the sample mean under a Student’s t distribution with \(df = 2a_n\), \(\mu = m_n\), and \(\sigma = \frac{b_n(1+V_n)}{a_n}\), see https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf
normal_inverse_gamma_post_pred <- function(x, m_n, V_n, a_n, b_n){
# compute parameters of student's t dist
df = 2*a_n
mu = m_n
sigma = b_n*(1+V_n) / (a_n)
# get posterior density of x
dstudent(x, df, mu, sigma, log = FALSE)
}
The KL divergence between two normal gammas was derived here. The definition shouldn’t change for an normal inverse gamma (similar to how it doesn’t change between gamma and inverse gamma distribution, see here).
The (multivariate) KL is defined as:
\[\begin{aligned} \mathrm{KL}[P \| Q] &=\frac{1}{2} \frac{a_{1}}{b_{1}}\left[\left(\mu_{2}-\mu_{1}\right)^{T} V_{2}\left(\mu_{2}-\mu_{1}\right)\right]+\frac{1}{2} \operatorname{tr}\left(V_{2} V_{1}^{-1}\right)-\frac{1}{2} \ln \frac{\left|V_{2}\right|}{\left|V_{1}\right|}-\frac{k}{2} \\ &+a_{2} \ln \frac{b_{1}}{b_{2}}-\ln \frac{\Gamma\left(a_{1}\right)}{\Gamma\left(a_{2}\right)}+\left(a_{1}-a_{2}\right) \psi\left(a_{1}\right)-\left(b_{1}-b_{2}\right) \frac{a_{1}}{b_{1}} \end{aligned}\]where “1” subscripts refer to parameters of distribution P, and “2” subscripts refer to parameters of distribution Q, \(\Gamma\) is the gamma function, and \(\psi\) is the digamma function. In code:
# slightly awkward way of writing it to be more modifiable for multivariate KL
normal_inverse_gamma_KL <- function(m_p, m_q, V_p, V_q, a_p, a_q, b_p, b_q) {
KL = 1/2 * (a_p/b_p) * ((m_q - m_p) * V_q *(m_q - m_p)) + 1/2 * (V_q/V_p) - 1/2 * log(V_q/V_p) - 1/2 + a_q * log(b_p/b_q) - log(gamma(a_p)/gamma(a_q)) + (a_p - a_q) * digamma(a_p) - (b_p - b_q) * a_p/b_p
}
#parameters for P
m_p = 5
V_p = 5
a_p = 5
b_p = 5
#parameters for Q
m_q = 5
V_q = 5
a_q = 5
b_q = 5
# should be 0, when distributions are the same
print(normal_inverse_gamma_KL(m_p, m_q, V_p, V_q, a_p, a_q, b_p, b_q))
## [1] 0
# should be different, when distributions are the different
print(normal_inverse_gamma_KL(m_p+5, m_q, V_p, V_q, a_p, a_q, b_p+2, b_q))
## [1] 44.89665
EIG would be a grid over possible observations, and taking the
product between the outputs of
normal_inverse_gamma_post_pred and
normal_inverse_gamma_KL for each observation.
: We want to learn the \(\mu\) and \(\sigma\) of Gaussian random variable \(y\) via Bayesian inference. We have some prior beliefs \(P(\mu)\) and \(P(\sigma)\) about the distribution of \(y\), which I want to update from observations. However, we cannot observe \(y\) directly. Instead, we can only take noisy samples \(z = y + \epsilon\), where \(\epsilon\) is Gaussian noise, which itself has \(\mu_{\epsilon}=0\) and a (Gaussian) distribution \(P(\sigma_{\epsilon})\) with known, fixed parameters (i.e. we are not trying to estimate this noise).
Given a noisy sample \(z\), is there a closed form to compute \(P(\mu|z)\) and \(P(\sigma|z)\)?
: The key to this is to recognize that \(z\) itself is a Gaussian random variable whose parameters we can derive from the distribution that generated \(y\).
Since the sum of two Gaussians is another Gaussian with their means and variances added, the parameters of \(z\) will be: \[\begin{eqnarray} \mu_z &= \mu_y, \\ \sigma_z &= \sigma_y + \sigma_{\epsilon} \\ \end{eqnarray}\]
From this, to get our likelihood, we can write:
\[\begin{eqnarray} P(z|\mu, \sigma) = Normal(z; \mu_z, \sigma_z) \end{eqnarray}\]
When doing the analytic updating, we should only be working in the space of the z-distribution. So the prior will be distributed \(N(\mu_0, \sqrt(\sigma_0^2 + \sigma_e^2))\), and the ultimate posterior over _n and _n will describe the posterior over the z-distribution — but we can then work backwards to get the parameterization of the distribution of interest, y.
In code:
# Prior parameters (without noise)
m_0 = 0
V_0 = 1
a_0 = 1
b_0 = 1
# Noise contribution
m_e = 0
sigma_e = 0.1
# plot prior over mean and SD
plotNormal(m_0, sqrt(1/V_0))
plotInvGamma(a_0, b_0)
plotNormalInvGamma(m_0, 1/V_0, a_0, b_0)
Of interest: There are different ways to think about what noise is here -
we can define it intuitively as pertubations to the input, with a given mean and standard deviation. That would make it so that we have to scale the inverse gamma part of the distribution by \(\sigma_e^2\). That’s okay in terms of updating, but I think it might mess up the KL calculation, or at least we will no longer be able to use the closed form, since the mean of the gamma distribution isn’t a standalone parameter there that we could modify. This might be fine, since we can compute KL numerically.
We can define noise as pertubations directly to a_0 and b_0, such that the PDF of the inverse gamma would shift a bit. Note that it’s not possible to do this without also changing the shape of the curve - so this wouldn’t be ‘pure noise’, it would be a more complicated change in our beliefs about the SD of the feature.
There is a 3-parameter gamma function which includes a location parameter, i.e. a parameter that allows us to shift the mean of the distribution without touching the shape of the curve. I’m not sure how to use this parameterization while retaining the definition of the parameter updates and KL. I suspect it’s hard but it might not be.
Note: interpretation of parameters of normal gamma distribution, from Wikipedia page:
The interpretation of parameters in terms of pseudo-observations is as follows:
The new mean takes a weighted average of the old pseudo-mean and the observed mean, weighted by the number of associated (pseudo-)observations.
The precision was estimated from \(2\alpha\) pseudo-observations (i.e. possibly a different number of pseudo-observations, to allow the variance of the mean and precision to be controlled separately) with sample mean \(\mu\) and sample variance \(\frac {\beta }{\alpha }\) (i.e. with sum of squared deviations \(2\beta\) ).
The posterior updates the number of pseudo-observations (\(\lambda _{0}\)) simply by adding up the corresponding number of new observations (\(n\)).
The new sum of squared deviations is computed by adding the previous respective sums of squared deviations. However, a third “interaction term” is needed because the two sets of squared deviations were computed with respect to different means, and hence the sum of the two underestimates the actual total squared deviation.