Comparison: Frequentist and Bayesian Inference

Author

D.McCabe

18.05 Class 20, Comparison of frequentist and Bayesian inference., Spring 2022 3

Critiques and defenses

Critique of Bayesian inference

  1. The main critique of Bayesian inference is that a subjective prior is, well, subjective. There is no single method for choosing a prior, so different people will produce different priors and may therefore arrive at different posteriors and conclusions.
  2. Furthermore, there are philosophical objections to assigning probabilities to hypotheses, as hypotheses do not constitute outcomes of repeatable experiments in which one can measure long-term frequency. Rather, a hypothesis is either true or false, regardless of whether one knows which is the case. A coin is either fair or unfair; treatment 1 is either better or worse than treatment 2; the sun will or will not come up tomorrow.

Defense of Bayesian inference

  1. The probability of hypotheses is exactly what we need to make decisions. When the doctor tells me a screening test came back positive I want to know what is the probability this means I’m sick. That is, I want to know the probability of the hypothesis “I’m sick”.
  2. Using Bayes’ theorem is logically rigorous. Once we have a prior all our calculations have the certainty of deductive logic.
  3. By trying different priors we can see how sensitive our results are to the choice of prior.
  4. It is easy to communicate a result framed in terms of probabilities of hypotheses.
  5. Even though the prior may be subjective, one can specify the assumptions used to arrive at it, which allows other people to challenge it or try other priors.
  6. The evidence derived from the data is independent of notions about ‘data more extreme’ that depend on the exact experimental setup (see the “Stopping rules” section below).
  7. Data can be used as it comes in. There is no requirement that every contingency be planned for ahead of time.

Critique of frequentist inference

  1. It is ad-hoc and does not carry the force of deductive logic. Notions like ‘data more extreme’ are not well defined. The p-value depends on the exact experimental setup (see the “Stopping rules” section below).
  2. Experiments must be fully specified ahead of time. This can lead to paradoxical seeming results. See the ’voltmeter story in: https://en.wikipedia.org/wiki/Likelihood_principle
  3. The p-value and significance level are notoriously prone to misinterpretation. Careful statisticians know that a significance level of 0.05 means the probability of a type I error is 5%. That is, if the null hypothesis is true then 5% of the time it will be rejected due to randomness. Many (most) other people erroneously think a p-value of 0.05 means that the probability of the null hypothesis is 5%. Strictly speaking you could argue that this is not a critique of frequentist inference but, rather, a critique of popular ignorance. Still, the subtlety of the ideas certainly contributes to the problem. .

Defense of frequentist inference

  1. It is objective: all statisticians will agree on the p-value. Any individual can then decide if the p-value warrants rejecting the null hypothesis.
  2. Hypothesis testing using frequentist significance testing is applied in the statistical analysis of scientific investigations, evaluating the strength of evidence against a null hypothesis with data. The interpretation of the results is left to the user of the tests. Different users may apply different significance levels for determining statistical significance. Frequentist statistics does not pretend to provide a way to choose the significance level; rather it explicitly describes the trade-off between type I and type II errors.
  3. Frequentist experimental design demands a careful description of the experiment and methods of analysis before starting. This helps control for experimenter bias.
  4. The frequentist approach has been used for over 100 years and we have seen tremendous scientific progress. Although the frequentist themself would not put a probability on the belief that frequentist methods are valuable, shouldn’t this history give the Bayesian a strong prior belief in the utility of frequentist methods?

Appendix

Deductive versus Inductive reasoning

Deductive logic starts with general premises and guarantees conclusions if those premises are true. For example, “All swans are white; this is a swan; therefore, it is white.” The reasoning is certain and truth-preserving. Inductive reasoning, by contrast, draws generalizations from specific observations. Seeing many white swans might lead you to conclude that all swans are white—but this conclusion could be overturned by a single black swan. Induction is probabilistic and prone to revision.

In Bayesian probability theory, the process of updating beliefs is deductive. Once you accept a prior belief and observe new evidence, Bayes’ theorem provides a logically necessary way to update your belief—the conclusion (posterior) follows deductively from the premises (prior and likelihood). So while priors may be formed inductively, the mechanism of inference in Bayesian reasoning is a form of deductive logic under uncertainty.

The Likelihood Principle

The Likelihood Principle states that all the information about a parameter contained in the data is summarised by the likelihood function. Once you have observed the data, inferences should depend only on the likelihood, not on how the data could have turned out (e.g. sampling plans or unobserved outcomes). Implication: If two experiments yield proportional likelihoods for a parameter, they should lead to the same inference, even if the experiments had different designs or stopping rules. Frequentist methods (like p-values or confidence intervals) violate the likelihood principle because they depend on the sampling distribution—i.e., data that could have occurred but didn’t.


Example: Estimating the probability \(\theta\) of heads for a coin

Consider two scenarios:

  1. fixed number of tosses: Toss the coin 10 times, get 7 heads.
  2. fixed number of heads: Keep tossing until you get 7 heads; it takes 10 tosses.

The same outcome of both:

  • observed data: 7 heads in 10 tosses.

In both cases, the likelihood function (up to proportionality) is \(\mathcal{L}(\theta)\propto\theta^7(1-\theta)^3\) so, according to the Likelihood Principle, your inference about \(\theta\) should be the same because the likelihood is the same however, a frequentist analysis would treat these scenarios differently:

  1. In the fixed-tosses case, you use the binomial distribution.
  2. In the fixed-heads case, you use the negative binomial distribution.

This leads to different p-values and confidence intervals even though the observed data and likelihood are the same!


Extreme Data: The InFamous Sally Clark Case

It’s not always clear what counts as “more extreme” especially for complex or multi-dimensional data. The definition can change depending on arbitrary or technical choices (e.g. critical value, test direction, tail choice, etc.).

An expert witness claimed the probability of two cases of Sudden Infant Death Syndrome (SIDS) in one family was about 1 in 73 million, treating the two deaths as independent events. Based on that, the deaths seemed “too extreme” to be natural—so guilt was inferred: - Incorrect assumptions: The statistician assumed both deaths were independent, which is wrong—genetic, environmental, or maternal health factors could link them. - Extremeness misinterpreted: The “1 in 73 million” figure made the outcome look impossibly rare. But that rarity depends entirely on the model chosen. - Prosecutor’s fallacy: The court interpreted the low probability of this outcome under innocence as strong evidence of guilt, ignoring the base rate of wrongful convictions or the possibility of shared natural causes.

Voltmeter Story

This is a shitty joke about the likelihood principle.

An engineer draws a random sample of electron tubes and measures their voltages. The measurements range from 75 to 99 Volts. A statistician computes the sample mean and a confidence interval for the true mean. Later the statistician discovers that the voltmeter reads only as far as 100 Volts, so technically, the population appears to be “censored”. If the statistician is orthodox this necessitates a new analysis. However, the engineer says he has another meter reading to 1000 Volts, which he would have used if any voltage had been over 100. This is a relief to the statistician, because it means the population was effectively uncensored after all. But later, the statistician discovers that the second meter had not been working when the measurements were taken. The engineer informs the statistician that he would not have held up the original measurements until the second meter was fixed, and the statistician informs him that new measurements are required. The engineer is astounded. “Next you’ll be asking about my oscilloscope!”

So the Statistician is concerned about what could have happened rather than what did happen: The joke doesn’t work because the data wasn’t censored (duh) so anything that could have censored the data didn’t!