The model

We assume that the data have a binomial distribution with a probability parameter \(\theta\), and we will observe \(N=10\) binomial trials: \[ y \sim \mbox{Binomial}(N,\theta) \] which implies that \[ p(y\mid\theta) = \binom{N}{y}\theta^y(1-\theta)^{N-y} \]

This function tells us the probability of various outcomes, assuming we knew the true \(\theta\); for instance, the figure below shows the probability of each possible outcome from \(y=0\) to \(y=10\), assuming that \(\theta=0.19\):

Later on, it will be helpful to imagine that this is one of several possible sets of probabilities for \(y\), embedded in a bivarate space with \(y\) on one axis and \(\theta\) on the other. The following 3D plot shows this (try rotating and zooming!):

You must enable Javascript to view this page properly.

Every value of \(\theta\) corresponds to a new set of probabilities for the outcomes. Low values of \(\theta\) will yield higher probabilities for smaller numbers of successes, while high values of \(\theta\) will yield higher probabilities for the larger numbers of successes. We can visualize this with the following video:

Of interest to us is how to make an inference about \(\theta\) after we observe \(y\) successes out of \(N=10\) trials.

Bayes’ Theorem

Every Bayesian analysis begins with Bayes’ theorem. Since joint probability will be important to us, it will be helpful to think of Bayes theorem as a direct consequence of the definition of conditional probability: \[ p(\theta\mid y) = \frac{p(\theta, y)}{p(y)}. \] This is simply the definition of conditional probability. It implies that \[ p(\theta\mid y)p(y) = p(\theta, y) = p(y \mid \theta)p(\theta) \] which, of course, implies that \[ p(\theta\mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)} \] This is called Bayes’ theorem. We begin with a “prior” distribution \(p(\theta)\) that quantifies a “reasonable belief” about \(\theta\) (in some sense) before the data, and then arrive at a “posterior” distribution \(p(\theta\mid y)\) that quantifies the “reasonable believe” we have about \(\theta\) after observing the data.

For demonstration, I will use the following prior distribution for \(\theta\):

This quantifies a belief that \(\theta\) is probably somewhere in the middle of the range and discounts extreme values, but not heavily so. It is a \(\beta(2,2)\) distribution, but this detail is not important; what matters is the shape of the curve.

If we look at the numerator on the right-hand side of Bayes theorem, we see the term \(p(\theta\mid y)p(y)\). As we pointed out, this is simply the joint probability of \(y\) and \(\theta\), \(p(\theta, y)\). Notice that this is simply the probabilities for all the outcomes, assuming a particular \(\theta\), multiplied by the prior probability of that \(\theta\). We can see how this is done in the video below. The faint points are the probabilities from the previous animation, before they have been multiplied by the prior. The solid points show the new probabilities after multiplication by the prior. Notice that when the prior probability is low, the points are pushed down toward 0; when the prior probability is high, the points are raised.

If we take all the points from the animation above – that is, for all values of \(\theta\) – and put them into a bivariate graph with \(\theta\) and \(y\) on two axes, it looks like the following (try rotating and zooming!):

You must enable Javascript to view this page properly.

This is the joint distribution of \(y\) and \(\theta\) which is notated or \(p(y, \theta)\).

The effect of an observation

By Bayes theorem, the effect of observing \(y=1\) would be to select a slice from the joint distribution above – in particular, the red curve where \(y=1\). Selecting only this curve yields a curve over the possible values of \(\theta\) (try rotating and zooming!):

You must enable Javascript to view this page properly.

This curve is the joint distribution evaluated at all the values where \(y=1\). It can be more easily shown as a standard 2D plot:

Notice, however, that this curve is a slice through the joint distribution; it does not integrate to 1. Remember that the definition of conditional probability (and Bayes’ theorem) specifies that we need to divide by \(p(y)\) to get the posterior distribution: \[ p(\theta\mid y) = \frac{p(y,\theta)}{p(y)} \]

What is \(p(y)\)? By definition, it is the constant that makes the curve integrate to 1 (that is, it makes the curve into a probability distribution). We can also think of it as the weighted average (weighted over all the possible \(\theta\) values) probability of obtaining \(y=1\). In notation, \[ \begin{eqnarray} p(y)&=&\int_0^1 p(y\mid\theta)p(\theta)\,d\theta\\ &=&E_{p(\theta)}\left[p(y\mid\theta)\right] \end{eqnarray} \] For our particular value observation, it is \[ \begin{eqnarray} p(y)&=&\int_0^1 \binom{10}{1}\theta^{1}(1-\theta)^{9}\frac{\theta^{2-1}(1-\theta)^{2-1}}{Be(2,2)}\,d\theta\\ &=& \int_0^110\times6\times\theta^{1+2 - 1}(1-\theta)^{9+2-1}\,d\theta\\ &=&0.0699301 \end{eqnarray} \]

This can be easily understood through sampling in R.

## Number of simulations to do
M = 100000

## Sample from the prior; a and b were defined before
theta.sample = rbeta(M, a, b)

## Sample data, assuming the sampled prior
## N was defined before
y.sample = rbinom(M, N, theta.sample)

## What is the *marginal* probability of observing the y we observed?
## y was defined before
mean(y.sample == ystar)
## [1] 0.06964

The sampled number above may differ slightly from the exact one above, due to sampling variability.

The posterior distribution

Now that we have our normalizing constant \(p(y)\), we can simply take our curve and divide, which will yield a new curve that is a probability distribution:

The curve above represents the posterior distribution for \(\theta\), after having observed \(y=1\).