Bayes isn't magic

Phillip M. Alday
28 September 2017 (Nijmegen)

Stats is hard

And if it's not hard, then it's probably not worth a whole presentation. So I probably will say something infelicitous at some point.

If you catch it (either now or later), do let me know.

I'm the one-eyed guy with bad cataracts in the land of the blind. I trip a little bit less often, but I still trip.

(Sorry, SI, I lied about they're being lots of pictures.)

Bayes isn't about Bayes' Theorem

Bayes' Theorem works just as well for frequentists as it does for Bayesians.

By definition: \[ P(A \cap B) = P(A)P(B|A) \]

(The probability of A and B is the probability of A times the conditional probability of B given A.)

But \[ P(A \cap B) = P(B \cap A) = P(B)P(A|B) \]

\[ P(A)P(B|A) = P(B)P(A|B) \]

and thus

\[ P(A) = \frac{P(B)P(A|B)}{P(B|A)} \]

Notice that this only depends on the way to calculate “conjoint (=and)” probabilities.

The applied form of Bayes' Theorem.

If we call our data \( D \) and our hypothesis \( H \), then we have:

\[ P(H|D) = \frac{P(D|H) P(H)}{P(D)} \]

The parts are usually called:

\( p(H|D) \): posterior probability
\( p(D|H) \): likelihood
\( p(D) \): marginal likelihood (because it's what happens if you “marginalize” out / “aggregate” across all possible models)
\( p(H) \): prior probability

Sometimes this is written with \( M \) for model or \( \theta \) for collection of parameters / model settings instead of \( D \), but it's the same idea.

Bayes' Theorem in practice

\[ P(H|D) = \frac{P(D|H) P(H)}{P(D)} \]

Often we care about more about the probability of our hypothesis \( H \) / our model \( M \) / some set of parameters \( \theta \) given the probability of our data \( D \) than anything else.

But there're two catches here:

“Classical” statistics are often based on the maximum likelihood (\( P(D|H) \))
“Classical” statistics often use a frequentist notion of probability, which means we can't even assign probabilities to singular “events” like hypotheses (so our likelihood is actually more like \( P_H(D) \), but that's a level of detail for the philosophers and mathematicians to argue about)

(There's also some fine print about how we construct valid probability models, regardless of which notion of probability we use, but that's for later on.)

Likelihood

(Technical meanings applied to everyday words are one of many ways stats are hard.)

You need to remember that “likelihood” is a technical term. The likelihood of \( H \), \( Pr(O\|H) \), and the posterior probability of \( H \), \( Pr(H\|O) \), are different quantities and they can have different values. The likelihood of \( H \) is the probability that \( H \) confers on \( O \), not the probability that \( O \) confers on \( H \). Suppose you hear a noise coming from the attic of your house. You consider the hypothesis that there are gremlins up there bowling. The likelihood of this hypothesis is very high, since if there are gremlins bowling in the attic, there probably will be noise. But surely you don’t think that the noise makes it very probable that there are gremlins up there bowling. In this example, \( Pr(O\|H) \) is high and \( Pr(H\|O) \) is low. The gremlin hypothesis has a high likelihood (in the technical sense) but a low probability.

Sober, E. (2008). Evidence and Evolution: the Logic Behind the Science. Cambridge University Press.

Maximum likelihood estimation (MLE)

find the coefficients / predictor that maximizes the likelihood
i.e. find the quantitative hypothesis which has the best chance of generating the observed data
many numerical, iterative techniques for doing this (implemented in R with glm())
some classes of models have a closed-form / easily / directly computable solution (e.g. ordinary least-squares, implemented in R with lm())

Most of the methods you know are based on this technique.

Combining Bayes' Theorem with MLE:

MLE makes sense both under the Bayesian and Frequentist notions of probability, although it's rather boring for Bayesians.
You can even combine MLE with a prior to compute the maximum a posteriori (MAP) estimate, which is the “usual” Bayesian application of MLE.
This second step doesn't quite make sense in the Frequentist framework ….

How Bayes and Frequentism actually differ

HUGE source of confusion
traditional explanation:
- Frequentists definite probabilities in terms of long-run occurrences, i.e. as frequencies.
- Bayesians define probabilities as states of plausibility, i.e. as beliefs.
(note that calling the Bayesian perspective “subjective” is somewhat misleading as “observed frequencies” are also somehow subjective)
better way to think of it: where you find distributions

Frequentism: Fixed values and sampling distributions

Parameters have single, true fixed values.
Uncertainty arises because we cannot directly measure these values but must instead sample from populations.
Probability is a description of the properties of repeated samples.
It does not make sense to talk about the probability of a parameter having a certain value (i.e. the probability of a hypothesis), because the value is actually fixed and not recurring.
This is why confidence intervals can't be said to “contain the true value with a certain probability”.
Instead, confidence intervals constructed over repeated samples will contain the true value with a certain frequency.

We have sampling distributions but fixed parameters. Fundamentally, our inferences are points and collections thereof (intervals), although we can sometimes “simulate” distributions (e.g. with the bootstrap).

Example: Heights from people of a certain population often have a normal-like distribution. However, we cannot know the actual population mean and standard deviation without complete measurement (generally not practical or even possible.) There is nonetheless a True Population Mean and a True Population Standard deviation as single, fixed values.

Bayes: Distributions of parameters

Parameters themselves are distributions.
Uncertainty is a fundamental property – potentially epistemological.
Probability is a description of “knowledge”, “belief” or “credibility”.
We can talk about the probability of a given hypothesis, or better yet, the allocation of credibility and hence probability of various parameter values.
Thus credible intervals can be said to “contain the true value with a certain probability”, although this inference is conditioned on the data we observed (as all our inferences are).

We have parameter distributions, which update from an initial state (prior) conditional on our data model (likelihood) to an inferential final state (posterior). Fundamentally, our inferences are distributions.

Example: Heights from people of a certain population often have a normal-like distribution. However, we cannot know the actual population mean and standard deviation without complete measurement (generally not practical or even possible.) The credibility of a given value for the True Population mean is nonetheless something we can determine based on observation.

Sidebar

Don't let anybody tell you in absolute terms one notion of probability is better than the other.

It is possible to construct nonsensical straw-men for both types, popular examples include:

a Bayesian model of a coin toss where you never arrive at the correct answer no matter how many flips you have (related to Stone's Paradox)
a Frequentist model of non replicable events with large amounts of prior knowledge (often involving a submarine)

Both interpretations of probability must obey the same set of mathematical rules (called the Kolmogorov axioms), based on a mathematical notion of “size”/“distance” (length, area, volume or more generally, measure).

It is also possible to do Bayesian interpretations of Frequentist methods and to examine the Frequentist properties of Bayesian methods.

And as it turns out, the combined order actually doesn't matter in terms of final inferences (Fubini's theorem), but the inbetween steps might be quite different.

Still … for many of the things we do, I find Bayes particularly nice ….

Priors aren't a bad thing -- and you're already using them

As experienced scientists, we already have a notion of “plausible values”:
- a 10µV effect is probably impossible in EEG experiments with language
- a 5µv difference between frontal and parietal sites is large
- a reaction time faster than 100ms is exceedingly unlikely for anything beyond a conditioned button press
- etc.
These are our priors!
By including them in our model, we inject information that we are already using into the model in a way that helps the model.
This extra information can help with many things including collinearity, overfitting, variable selection and more …
There is a deep relationship between many types of priors and certain advanced “penalized” Frequentist models
- Cauchy prior on coefficients \( \Leftrightarrow \) LASSO
- Normal prior \( \Leftrightarrow \) ridge regression

There's also nothing "objective" about the likelihood

The likelihood is many ways an assumption about the shape of the data – in some sense, a prior on the data! (cf. McElreath 2015).
For historical-computational reasons, we almost always use a Gaussian likelihood, although that's a hard assumption as most bell-shaped distributions are not Gaussian (Tukey) and the mean (the estimate matching a Gaussian likelihood) is horribly sensitive to outliers (Wilcox).
We can think of the likelihood, i.e. the conditional distribution of the data, as the distribution of the errors, so if we assume an error distribution with heavier tails, i.e. with more “outliers”, than our model can deal with these without being led astray.
Changing the error distribution isn't so weird – we already do it for certain types of data with such as logistic regression for binary responses (binomial likelihood).
There is a deep relationship between many types of likelihoods and certain Frequentist models
- obvious: GLM (logistic, probit, Poisson, etc. regression)
- less obvious: \( t \)-likelihood for robust regression

So what do I (not) like about Bayes?

Yay!

Distributions instead of intervals for expressing uncertainty
Ease of piecing together models with complex properties
- robust regression through \( t \)-likelihood
- regularization with normal priors
Formal meaning of probability statements more in line with my intuitions

Nay!

Bayes factors (Frequentists have likelihood ratios)
MCMC often still slow
Even worse but faster approximations generally still slower than likelihood-based methods
Distributions instead of intervals

Final (abstract-ish) thing: What do those Markov chains actually do?

They simulate the posterior distribution!

But: They do this asymptotically, so you have to run your chains long enough to make sure they reach their stationary distribution. There are various tests and diagnostics for this and different methods of constructing the chains do this in fewer or more steps (and require more or less time per step).

Once your chains have converged, you have a decent simulation of your posterior distribution. Drawing more samples from a converged chain generally won't help shrink your credible intervals, because you're not acquiring more data to condition your data on and after a certain point you've simulated your posterior well enough that a longer simulation doesn't give you more data.