Phillip M. Alday
28 September 2017 (Nijmegen)
And if it's not hard, then it's probably not worth a whole presentation. So I probably will say something infelicitous at some point.
If you catch it (either now or later), do let me know.
I'm the one-eyed guy with bad cataracts in the land of the blind. I trip a little bit less often, but I still trip.
(Sorry, SI, I lied about they're being lots of pictures.)
Bayes' Theorem works just as well for frequentists as it does for Bayesians.
By definition: \[ P(A \cap B) = P(A)P(B|A) \]
(The probability of A and B is the probability of A times the conditional probability of B given A.)
But \[ P(A \cap B) = P(B \cap A) = P(B)P(A|B) \]
so
\[ P(A)P(B|A) = P(B)P(A|B) \]
and thus
\[ P(A) = \frac{P(B)P(A|B)}{P(B|A)} \]
Notice that this only depends on the way to calculate “conjoint (=and)” probabilities.
If we call our data \( D \) and our hypothesis \( H \), then we have:
\[ P(H|D) = \frac{P(D|H) P(H)}{P(D)} \]
The parts are usually called:
Sometimes this is written with \( M \) for model or \( \theta \) for collection of parameters / model settings instead of \( D \), but it's the same idea.
\[ P(H|D) = \frac{P(D|H) P(H)}{P(D)} \]
Often we care about more about the probability of our hypothesis \( H \) / our model \( M \) / some set of parameters \( \theta \) given the probability of our data \( D \) than anything else.
But there're two catches here:
(There's also some fine print about how we construct valid probability models, regardless of which notion of probability we use, but that's for later on.)
(Technical meanings applied to everyday words are one of many ways stats are hard.)
You need to remember that “likelihood” is a technical term. The likelihood of \( H \), \( Pr(O\|H) \), and the posterior probability of \( H \), \( Pr(H\|O) \), are different quantities and they can have different values. The likelihood of \( H \) is the probability that \( H \) confers on \( O \), not the probability that \( O \) confers on \( H \). Suppose you hear a noise coming from the attic of your house. You consider the hypothesis that there are gremlins up there bowling. The likelihood of this hypothesis is very high, since if there are gremlins bowling in the attic, there probably will be noise. But surely you don’t think that the noise makes it very probable that there are gremlins up there bowling. In this example, \( Pr(O\|H) \) is high and \( Pr(H\|O) \) is low. The gremlin hypothesis has a high likelihood (in the technical sense) but a low probability.
Sober, E. (2008). Evidence and Evolution: the Logic Behind the Science. Cambridge University Press.
glm())lm())Most of the methods you know are based on this technique.
We have sampling distributions but fixed parameters. Fundamentally, our inferences are points and collections thereof (intervals), although we can sometimes “simulate” distributions (e.g. with the bootstrap).
Example: Heights from people of a certain population often have a normal-like distribution. However, we cannot know the actual population mean and standard deviation without complete measurement (generally not practical or even possible.) There is nonetheless a True Population Mean and a True Population Standard deviation as single, fixed values.
We have parameter distributions, which update from an initial state (prior) conditional on our data model (likelihood) to an inferential final state (posterior). Fundamentally, our inferences are distributions.
Example: Heights from people of a certain population often have a normal-like distribution. However, we cannot know the actual population mean and standard deviation without complete measurement (generally not practical or even possible.) The credibility of a given value for the True Population mean is nonetheless something we can determine based on observation.
Don't let anybody tell you in absolute terms one notion of probability is better than the other.
It is possible to construct nonsensical straw-men for both types, popular examples include:
Both interpretations of probability must obey the same set of mathematical rules (called the Kolmogorov axioms), based on a mathematical notion of “size”/“distance” (length, area, volume or more generally, measure).
It is also possible to do Bayesian interpretations of Frequentist methods and to examine the Frequentist properties of Bayesian methods.
And as it turns out, the combined order actually doesn't matter in terms of final inferences (Fubini's theorem), but the inbetween steps might be quite different.
Still … for many of the things we do, I find Bayes particularly nice ….
Yay!
Nay!
They simulate the posterior distribution!
But: They do this asymptotically, so you have to run your chains long enough to make sure they reach their stationary distribution. There are various tests and diagnostics for this and different methods of constructing the chains do this in fewer or more steps (and require more or less time per step).
Once your chains have converged, you have a decent simulation of your posterior distribution. Drawing more samples from a converged chain generally won't help shrink your credible intervals, because you're not acquiring more data to condition your data on and after a certain point you've simulated your posterior well enough that a longer simulation doesn't give you more data.