(Pesky?) Priors

# define where plots should be saved 
knitr::opts_chunk$set(fig.path = 'figures/')

When I tell people I am learning Bayesian statistics, I tend to get one of two responses: either people look at me blankly—“What’s Bayesian statistics?”—or I get scorned for using such “loose” methods—“Bayesian analysis is too subjective!”¹. This latter “concern” arises due to (what I believe to be a misunderstanding of) the prior: Bayesian analysis requires one state what one’s prior belief is about a certain effect, and then combine this with the data observed (i.e., the likelihood) to update one’s belief (the posterior).

On the face of it, it might seem odd for a scientific method to include “subjectivity” in its analysis. I certainly had this doubt when I first started learning it. (And, in order to be honest with myself, I still struggle with it sometimes.) But, the more I read, the more I think this concern is not warranted, as the prior is not really “subjectivity” in the strictest sense of the word at all: it is based on our current understanding of the effect we are interested in, which in turn is (often) based on data we have seen before. Yes, sometimes the prior can be a guess if we² have no other information to go on, but we would express the uncertainty of a belief in the prior itself.

The more I understand Bayesian statistics, the more I appreciate the prior is essential. One under-stated side-effect of having priors is that it can protect you from dubious findings. For example, I have a very strong prior against UFO predictions; therefore, you are going to have to present me with a lot more evidence than some shaky video footage to convince me otherwise. You would not have to provide me with much evidence, however, if you claimed to have roast beef last night. Extraordinary claims require extraordinary evidence.

But, during my more sceptical hours, I often succumbed to the the-prior-is-nothing-but-subjectivity-poisoning-your-analysis story. However, I now believe that even if one is sceptical of the use of a prior, there are a few things to note:

If you are concerned your prior is wrong and is influencing your inferences, just collect more data: A poorly-specified prior will be washed away with sufficient data.
The prior isn’t (really) subjective because it would have to be justified to a sceptical audience. This requires (I suggest) plotting what the prior looks like so readers can familiarise themselves with your prior. Is it really subjective if I show you what my prior looks like and I can justify it?
Related to the above, the effect of the prior can be investigated using robustness checks, where one plots the posterior distribution based on a range of (plausible) prior values. If your conclusions don’t depend upon the exact prior used, what’s the problem?
Priors are not fixed. Once you have collected some data and have a posterior belief, if you wish to examine the effect further you can (and should) use the posterior from the previous study as your prior for the next study.

These are the points I mention to anti-Bayesians I encounter. In this blog I just wanted to skip over some of these with examples. This is selfish; it’s not really for your education (there really are better educators out there: My recommendation is Alex Etz’s excellent “Understanding Bayes” series). I just want somewhere with all of this written down so next time someone criticises my interest in Bayesian analysis I can just reply: “Read my blog!”.

As some readers might not be massively familiar with these issues, I try to highlight some of the characteristics of the prior below. In all of these examples, I will use the standard Bayesian “introductory tool” of assessing the degree of bias in a coin by observing a series of flips.

A Fair Coin

If a coin is unbiased, it should produce roughly equal heads and tails. However, often we don’t know whether a coin is biased or not. We wish to estimate the bias in the coin (denoted \(\theta\)) by collecting some data (i.e., by flipping the coin); a fair coin has a \(\theta\) = 0.5. Based on this data, we can calculate the likelihood of various \(\theta\) values. Below is the likelihood function for a fair coin.

# set up how fine the resolution should be
binWidth <- 0.005

# define parameter space
x <- seq(from = binWidth / 2, to = 1 - binWidth / 2, by = binWidth)

# calculate the likelihood of observed data.
likelihood <- dbeta(x, 50, 50)

# do the plot
plot(x, likelihood, type = "l", lwd = 3, col = "skyblue",
     xlab = bquote(theta), ylab = "Density", las = 1)

In this example, we flipped the coin 100 times, and observed 50 heads and 50 tails. Note how the peak of the likelihood is centered on \(\theta\) = 0.5. A biased coin would have a true \(\theta\) not equal to 0.5; \(\theta\) closer to zero would reflect a bias towards tails, and a \(\theta\) closer to 1 would reflect a bias towards heads. The animation below demonstrates how the likelihood changes as the number of observed heads (out of 100 flips) increases:

Click on this Gif

So, the likelihood contains the information provided by our sample about the true value for \(\theta\).

The Prior

Before collecting data, Bayesian analysts would specify what their prior belief was about \(\theta\). Below I present various priors a Bayesian may have using the beta distribution (which has two parameters: a and b):

# set up how fine the resolution should be
binWidth <- 0.005

# define parameter space
x <- seq(from = binWidth / 2, to = 1 - binWidth / 2, by = binWidth)

# do the plots

par(mfrow = c(2, 2))

beta <- dbeta(x, 4, 4)
plot(x, beta, type = "l", main = ("a = 4, b = 4"),
     lty = 2, col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)

beta <- dbeta(x, 1, 1)
plot(x, beta, type = "l", main = ("a = 1, b = 1"),
     lty = 2, col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)

beta <- dbeta(x, 0.5, 0.5)
plot(x, beta, type = "l", main = ("a = 0.5, b = 0.5"),
     lty = 2, col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)

beta <- dbeta(x, 1, 0.5)
plot(x, beta, type = "l", main = ("a = 1, b = 0.5"),
     lty = 2, col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)

The upper left plot reflects a prior belief that the coin is fair (i.e., the peak of the distribution is centered over \(\theta\) = 0.5); however, there is some uncertainty in this prior as the distribution has some spread. The upper right plot reflects total uncertainty in a prior belief: that is, the prior holds that any value of \(\theta\) is likely. The lower two plots reflect prior beliefs that the coin is biased. Maybe the researcher had obtained the coin from a known con-artist. The lower left plot reflects a prior for a biased coin, but uncertainty about which side the coin is biased towards (that is, it could be biased heads or tails); the lower right plot reflects a prior that the coin is biased towards heads.

The effect of the prior

I stated above that one of the benefits of the prior is that it allows protection (somewhat) from spurious findings. If I have a really strong prior belief that the coin is fair, 9/10 heads isn’t going to be all that convincing evidence that it is not fair. However, if I have a weak prior that the coin is fair, then I will be quite convinced by the data.

This is illustrated below. Both priors below reflect the belief that the coin is fair; what differs between the two is the strength in this belief. The prior on the left is quite a weak belief, as the distribution (although peaked at 0.5) is quite spread out. The prior on the right is a stronger belief that the coin is fair.

In both cases, the likelihood is the result of observing 9/10 heads.

# set up how fine the resolution should be
binWidth <- 0.005

# define parameter space
x <- seq(from = binWidth / 2, to = 1 - binWidth / 2, by = binWidth)

# define the prior with parameters a and b for weak & strong prior
weakPrior <- c(2, 2)
weakPrior1 <- dbeta(x, weakPrior[1], weakPrior[2])

strongPrior <- c(50, 50)
strongPrior1 <- dbeta(x, strongPrior[1], strongPrior[2])

# define the data (number of heads, number of tails)
data <- c(9, 1)

# calculate the likelihood of observed data.
# following Alex Etz's blog, I use a prior beta(1, 1) for this
likelihood <- dbeta(x, 1 + data[1], 1 + data[2])

# calculate the posterior
weakPosterior <- dbeta(x, weakPrior[1] + data[1], weakPrior[2] + data[2])
strongPosterior <- dbeta(x, strongPrior[1] + data[1], strongPrior[2] + data[2])

# do the plot
par(mfrow = c(1, 2))
plot(x, weakPrior1, type = "l", main = "Weak Prior",
     ylim = c(0, 1 * max(c(weakPrior1, strongPrior1, 
                              weakPosterior, strongPosterior, 
                              likelihood))), lty = 2, 
     col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)
lines(x, likelihood, type = "l", col = "red", lty = 2)
lines(x, weakPosterior, type = "l", col = "skyblue", lwd = 4)
legend("topleft", c("Prior", "Likelihood", "Posterior"), lty = c(2, 2, 1), 
       col = c("gray48", "red", "skyblue"), lwd = c(1, 1, 4), bty = "n")

plot(x, strongPrior1, type = "l", main = "Strong Prior",
     ylim = c(0, 1 * max(c(weakPrior1, strongPrior1, 
                              weakPosterior, strongPosterior, 
                              likelihood))), lty = 2, 
     col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)
lines(x, likelihood, type = "l", col = "red", lty = 2)
lines(x, strongPosterior, type = "l", col = "skyblue", lwd = 4)

You can see that when the prior is a weak belief, the posterior is very similar to the likelihood; that is, the posterior belief is almost entirely dictated by the data. However, when we have a strong prior belief, our beliefs are not altered much by observing just 9/10 heads.

Now, I imagine that this is the anti-Bayesian’s point: “Even with clear data you haven’t changed your mind.” True. Is this a negative? Well, imagine instead this study was assessing the existence of UFOs rather than simple coin flips. If I showed you 9 YouTube videos of UFO “evidence”, and 1 video showing little (if any) evidence, would you be convinced of UFOs? I doubt it. You were the right-hand plot in this case. (I know, I know, the theta distribution doesn’t make sense in this case, but ignore that!)

What if the prior is “wrong”?

Worried that your prior is wrong³, or that you cannot justify it completely? Throw more data at it. (When is this ever a bad idea?) Below are the same priors, but now we flip the coin 1,000 times and observe 900 heads. (Note, the proportion heads is the same in the previous example.) Now, even our strong prior belief has to be updated considerably based on this data. With more data, even mis-specified priors do not affect inference.

# set up how fine the resolution should be
binWidth <- 0.005

# define parameter space
x <- seq(from = binWidth / 2, to = 1 - binWidth / 2, by = binWidth)

# define the prior with parameters a and b for weak & strong prior
weakPrior <- c(2, 2)
weakPrior1 <- dbeta(x, weakPrior[1], weakPrior[2])

strongPrior <- c(50, 50)
strongPrior1 <- dbeta(x, strongPrior[1], strongPrior[2])

# define the data (number of heads, number of tails)
data <- c(900, 100)

# calculate the likelihood of observed data.
# following Alex Etz's blog, I use a prior beta(1, 1) for this
likelihood <- dbeta(x, 1 + data[1], 1 + data[2])

# calculate the posterior
weakPosterior <- dbeta(x, weakPrior[1] + data[1], weakPrior[2] + data[2])
strongPosterior <- dbeta(x, strongPrior[1] + data[1], strongPrior[2] + data[2])

# do the plot
par(mfrow = c(1, 2))
plot(x, weakPrior1, type = "l", main = "Weak Prior",
     ylim = c(0, 1 * max(c(weakPrior1, strongPrior1, 
                              weakPosterior, strongPosterior, 
                              likelihood))), lty = 2, 
     col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)
lines(x, likelihood, type = "l", col = "red", lty = 2)
lines(x, weakPosterior, type = "l", col = "skyblue", lwd = 4)
legend("topleft", c("Prior", "Likelihood", "Posterior"), lty = c(2, 2, 1), 
       col = c("gray48", "red", "skyblue"), lwd = c(1, 1, 4), bty = "n")

plot(x, strongPrior1, type = "l", main = "Strong Prior",
     ylim = c(0, 1 * max(c(weakPrior1, strongPrior1, 
                              weakPosterior, strongPosterior, 
                              likelihood))), lty = 2, 
     col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)
lines(x, likelihood, type = "l", col = "red", lty = 2)
lines(x, strongPosterior, type = "l", col = "skyblue", lwd = 4)

To get an idea of how sample size influences the effect of the prior on the posterior, I created the below gif animation. In it, we have a relatively strong (although not insanely so) prior belief that the coin is biased “heads”. Then, we start flipping the coin, and update the posterior after each flip. In fact, this coin is fair, so our prior is not in accord with (unobservable) “reality”. As flips increases, though, our posterior starts to match the likelihood in the data. So, “wrong” priors aren’t really a problem. Just throw more data at it.

Click on this Gif

“Today’s posterior is tomorrow’s prior” — Lindley (1970)

After collecting some data and updating your prior, you now have a posterior belief of something. If you wish to collect more data, you do not use your original prior (because it no longer reflects your belief), but you instead use the posterior from your previous study as the prior for your current one. Then, you collect some data, update your priors into your posteriors…and so on.

In this sense, Bayesian analysis is ultimately “self-correcting”: as you collect more and more data, even horrendously-specified priors won’t matter.

In the example below, we have a fairly-loose idea that the coin is fair—i.e., \(\theta\) = 0.5. We flip a coin 20 times, and observe 18 heads. Then we update to our posterior, which suggests the true value for \(\theta\) is about 0.7 ish. But then we wish to run a second “study”; we use the posterior from study 1 as our prior for study 2. We again observe 18 heads out of 20 flips, and update accordingly.

par(mfrow = c(1, 2))

# set up how fine the resolution should be
binWidth <- 0.005

# define parameter space
x <- seq(from = binWidth / 2, to = 1 - binWidth / 2, by = binWidth)

# define the prior with parameters a and b
priorParameters <- c(10, 10)
prior <- dbeta(x, priorParameters[1], priorParameters[2])

# define the data (number of heads, number of tails)
data <- c(18, 2)

# calculate the likelihood of observed data.
likelihood <- dbeta(x, 1 + data[1], 1 + data[2])

# calculate the posteriot
posterior <- dbeta(x, priorParameters[1] + data[1], priorParameters[2] + data[2])

# do the plot
plot(x, prior, type = "l", 
     ylim = c(0, 8), lty = 2, main = "First Study",
     col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)
lines(x, likelihood, type = "l", col = "red", lty = 2)
lines(x, posterior, type = "l", col = "skyblue", lwd = 4)
legend("topleft", c("Prior", "Likelihood", "Posterior"), lty = c(2, 2, 1), 
       col = c("gray48", "red", "skyblue"), lwd = c(1, 1, 4), bty = "n")



# define the prior with parameters a and b
priorParameters <- c(priorParameters[1] + data[1], priorParameters[2] + data[2])
prior <- dbeta(x, priorParameters[1], priorParameters[2])

# define more data (number of heads, number of tails)
data <- c(18, 2)

# calculate the likelihood of observed data.
likelihood <- dbeta(x, 1 + data[1], 1 + data[2])

# calculate the posteriot
posterior <- dbeta(x, priorParameters[1] + data[1], priorParameters[2] + data[2])

# do the plot
plot(x, prior, type = "l", 
     ylim = c(0, 8), lty = 2, main = "Second Study",
     col = "gray48", xlab = bquote(theta), ylab = "Density", las = 1)
lines(x, likelihood, type = "l", col = "red", lty = 2)
lines(x, posterior, type = "l", col = "skyblue", lwd = 4)

Conclusion

One of the nicest things about Bayesian analysis is that the way our beliefs should be updated in the face of incoming data is clearly (and logically) specified. Many peoples’ concerns surround the prior. I hope I have shed some light on why I do not consider this to be a problem. Even if the prior isn’t something that should be “overcome” with lots of data, it is reassuring to know for the anti–Bayesian that with sufficient data, it doesn’t really matter much.

So, stop whining about Bayesian analysis, and go collect more data. Always, more data.

Occasionally (althought his is happening more and more) I get a slow, wise, agreeing nod. I like those.↩
I really wanted to avoid the term “we”“, as it implies I am part of the”in-group“: those experts of Bayesian analysis who truly appreciate all of its beauty, and are able to apply it to all of their experimental data. I most certainly do not fall into this camp; but I am trying.↩
Technically, it cannot be “wrong” because it is your belief. If that belief is justifiable, then it’s all-game. You may have to update your prior though if considerable data contradict it. But, bear with me.↩