Chapter 3

Fundamental Idea of Bayesian Inference

Let \(x\) be an observed data point (or vector) in the sample space \(\chi\). Let \(\mu\) be an unobserved parameter in the parameter space \(\Omega\).

A family of probability densities can be defined as follows:
\(F = \{f_\mu(x); x\in\chi, \mu \in \Omega\}\).

The fundamental idea for both frequentists and Bayesians is that we observe \(x\) from \(f_\mu(x)\) and then we try to infer the value of \(\mu\).

The fundamental idea for Bayesian inference is the additional assumption that \(\mu\) follows a prior density (or distribution). In other words, \(\mu := g(\mu), \mu \in \Omega\).

Bayes’ Rule

Bayes’ Theorem is a rule for combining the prior knowledge in \(g(\mu)\) with our current data \(x\). Remember, our goal is to estimate \(\mu\) (no matter if you are Frequentist or Bayesian). The only thing that has chaned is that \(\mu\) is now treated as a random variable. Thus, as with any random variable, we would like to know its distribution.

Define \(g(\mu|x)\) as the posterior density of \(\mu\). Bayes’ Rule states the following:

\(g(\mu|x) = g(\mu) \frac{f_\mu(x)}{f(x)}, \mu \in \Omega\)

where \(f(x)\) is the marginal density of \(x\), that is:

\(f(x) = \int_\Omega f_\mu(x)g(\mu)d\mu\)

An important structure to note is that we can view the marginal density as a normalizing constant (a function of our data \(x\)) for the posterior density \(g(\mu|x)\). That is,

\(g(\mu|x) = c_x L_x(\mu)g(\mu)\)

In other words, the posterior distribution for a given observed data \(x\) is proportional to (up to the normalizing constant c_x) the Likelihood \(L_x(\mu)\) times the prior \(g(\mu)\).

Two Examples (just one really)

(I.) The Physicist’s Twins: Thanks to sonograms, a physicist found out she was going to have twin boys. The physicist asked “What is the probability my twins will be Identical, rather than Fraternal?”. The doctor answered that one-third of twin births were Identicals and two-thirds were Fraternals.

In this example the physicist seeks to know the posterior distribution \(\mu\) of Identicals vs Fraternals in light of the data that her future twins are going to be twin boys. Note that \(\mu\) can be thought of as a Bernoulli random variable with prior \(g(\text{Identical}) = 1/3\) and \(g(\text{Fraternal}) = 2/3\). Using Bayes’ Rule we have:

\(g(\text{Identical} | \text{Same Sex}) = \frac{g(\text{Identical}) f_{\text{Identical}}(\text{Same Sex})}{g(\text{Identical}) f_{\text{Identical}}(\text{Same Sex}) + g(\text{Fraternal}) f_{\text{Fraternal}}(\text{Same Sex})}\)

Plugging in our known information we arrive at the conclusion that \(g(\text{Identical} | \text{Same Sex}) = \frac{1}{2}\). Thus, while our prior distribution stated that \(g(\text{Identical}) = \frac{1}{3}\), our posterior distribution given data \(x=text{Same Sex}\) increased to \(g(\text{Identical} | x=\text{Same Sex}) = \frac{1}{2}\).

In this example our posterior distribution calculations were highly dependent on the prior distribution \(g(\mu)\). For example, if we had gone with an equally likely prior (such as \(g(\text{Identical}) = g(\text{Fraternal}) = 1/2\)), then our posterior distribution would have increased further to \(g(\text{Identical} | x=\text{Same Sex}) = \frac{2}{3}\). Thankfully we have historical hospital records cleanly and simply suggest that our previous prior distribution was accurate and reasonable. This will not always be the case. Thus we devote some time to discussion of Uninformative Prior Distributions.

Uninformative Prior Distributions

Useful prior information in day-to-day scientific applications is scarce. Thus, methods have been proposed to construct “priors” that permit the use of Bayes’ rule in the absence of relevant experience or information. One such approach is the method of uninformative priors.

3.2: Uninformative Priors

An uninformative prior is one that does not bias the inference. For example, a uniform distribution for the unknown parameters.

One of the most widely used uninformative priors is the Jeffrey’s prior.

\(g^{Jeff}(\mu) = {I_{\mu}}^{1/2}\)

where \(I_{\mu}\) is the Fisher Information for the parameter \(\mu\).

The Jeffrey’s prior does in fact transform correctly under parameter changes. In other words, it does not effect Bayes’ Rule and has little to no effect on the inference. For example, Jeffrey’s prior will not affect calculating credible intervals.

3.3 and 3.4: Pros/Cons of Bayesian vs Frequentist

A significant advantage of the Bayesian approach is that posterior distribution \(g(\mu|x)\) depends only on the observed information \(x\) and the prior \(g(\mu)\). It does not depend on, for example, other potential data sets \(X\) that might not have been seen. Frequentists, for example, have to worry about the construction and choice of their estimators for their estimator must safeguard for any potential data sets \(X\).

A good example of this can be found on page 30 and 31.

The simplicity of the Bayesian approach is also its possible biggest flaw. As we stated above, the posterior \(g(\mu|x)\) depends only on the data observed and the prior \(g(\mu)\). Thus, if the prior is flawed, as in a completely wrong distribution was chosen as prior, then the posterior distribution will be flawed as well.

A challenge of Frequentism, as we stated earlier, is the method or choice of Estimator can be challenging to construct or choose from. A test statistic, for example, can be difficult to construct on first principles and even more challenging to determine its’ distribution.

Chapter 4: Fisher Inference and Maximum Likelihood Estimation

4.1: Likelihood and Maximum Likelihood

The log-likelihood function is

\(\ell_x(\mu) = log{f_\mu(x)}\).

We treat the observed data vector \(x\) as fixed and we wish to maximize this function over all possible values of \(\mu\) in the parameter space \(\Omega\). The maximizer argument \(\mu\) we will call \(\hat{mu}_{MLE}\)

While the MLE algorithm was once the most dominant estimation practice, in recent years it has fallen out of popularity. However, it is still common practice to and has several advantages.

1.) The MLE algorithm is automatic. The problem statement is clear.

2.) In large-sample situations, MLE estimates tend to be nearly unbiased and with the least possible variance. In other words, they will minimize the mean squared error.

3.) The MLE will also maximize the posterior density of a Bayesian framework is the prior is flat, i.e. a constant \(c_x\). Looking at Bayes’ Rule (posterior = prior * likelihood)

$g(|x) = c_xg()e^{l_x()}

we see that the maximizer to this posterior distribution is really just maximizing the likelihood function. Thus, in the case where the prior is flat (analagous to the Frequentist approach) our answers agree for the estimate \(\hat{mu}\).

Some downsides to the MLE will be that it can be computationally difficult, especially in the multivariate situation. For example, it may be reasonably easy to maximize coordinate wise for a multidimensional \(\theta\), yet the estimate for the global maximium \(\hat{theta}\) may be very off.

4.2: Fisher Information and the MLE (Some Results)

The MLE \(\hat{\theta}\) has an approximately Normal Distribution with mean \(\theta\) and variance \(\frac{1}{I_{\theta}}\) and that no “nearly unbiased” estimator of \(\theta\) can do better.

In other words, bigger Fisher Information implies smaller variance for the MLE.

The Cramer-Rao lower bound states the following:

Suppose we have another unbiased estimator \(\tilde{theta}\). Then the Cramer-Rao lower bound states this unbiased estimator cannot have smaller variance than the MLE estimator.

4.3: Conditional Inference - Estimating the Fisher Information from the observed data

Rather than using the plug-in version and stating that the asymptotic variance of \(\hat{\theta}\) is \(\frac{1}{nI_{\hat{\theta}}}\), Fisher suggested using a quantitiy he called the observed Fisher Information.

\(I(x) = -\frac{\partial^2}{\partial\theta^2}\ell_x(\theta)\) evaluated at \(\theta = \hat{\theta}_{MLE}\)

Compare this to the standard Fisher Information:

\(\iota(x) = -E_{\theta}[\frac{\partial^2}{\partial\theta^2} \ell_x(\theta)]\)

As we can see the observed Fisher Information has no expected value operator to carry out, which simplifies calculations. Numerical second differentiation can be carried out for the Observed Fisher Information. Unlike \(\iota(x)\), no probability calculations (i.e. expected value) are required.