We assume that the data have a binomial distribution with a probability parameter \(\theta\), and we will observe \(N=10\) binomial trials: \[
y \sim \mbox{Binomial}(N,\theta)
\] which implies that \[
p(y\mid\theta) = \binom{N}{y}\theta^y(1-\theta)^{N-y}
\]

This function tells us the probability of various outcomes, assuming we knew the true \(\theta\); for instance, the figure below shows the probability of each possible outcome from \(y=0\) to \(y=10\), assuming that \(\theta=0.19\):

Later on, it will be helpful to imagine that this is one of several possible sets of probabilities for \(y\), embedded in a bivarate space with \(y\) on one axis and \(\theta\) on the other. The following 3D plot shows this (try rotating and zooming!):

You must enable Javascript to view this page properly.

Every value of \(\theta\) corresponds to a new set of probabilities for the outcomes. Low values of \(\theta\) will yield higher probabilities for smaller numbers of successes, while high values of \(\theta\) will yield higher probabilities for the larger numbers of successes. We can visualize this with the following video:

Of interest to us is how to make an inference about \(\theta\) after we observe \(y\) successes out of \(N=10\) trials.

Bayes’ Theorem

Every Bayesian analysis begins with Bayes’ theorem. Since joint probability will be important to us, it will be helpful to think of Bayes theorem as a direct consequence of the definition of conditional probability: \[
p(\theta\mid y) = \frac{p(\theta, y)}{p(y)}.
\] This is simply the definition of conditional probability. It implies that \[
p(\theta\mid y)p(y) = p(\theta, y) = p(y \mid \theta)p(\theta)
\] which, of course, implies that \[
p(\theta\mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)}
\] This is called Bayes’ theorem. We begin with a “prior” distribution \(p(\theta)\) that quantifies a “reasonable belief” about \(\theta\) (in some sense) before the data, and then arrive at a “posterior” distribution \(p(\theta\mid y)\) that quantifies the “reasonable believe” we have about \(\theta\) after observing the data.

For demonstration, I will use the following prior distribution for \(\theta\):