Introduction to Bayesian Thinking

The Bayesian Framework: A New Paradigm

In traditional (Frequentist) statistics, we treat parameters (like a mean \(\mu\)) as fixed, unknown constants. In the Bayesian Framework, we treat everything unknown as a random variable.

Comparison of Frequentist and Bayesian Approaches
Feature	Frequentist Approach	Bayesian Approach
Definition of Probability	Long-run frequency of repeatable events	Degree of belief or certainty in a proposition
Parameters (θ)	Fixed constants; we just don’t know them	Random variables; they have a distribution
Data (y)	Random; one of many possible outcomes	Fixed; it is the only evidence we actually have
Result	A point estimate and a p-value	A full probability distribution (the posterior)

The Paradigm Shift: In traditional stats, you ask: “If the truth were X, how likely is my data?” In Bayesian stats, you flip it and ask: “Given the data I just saw, what is the most likely truth?”

Note: The parameter \(\theta\) in the frequentist (traditional) is fixed while the data is random– meaning running the experiment again, we’ll give us different data. On the other hand, in the Bayesian paradigm, the parameter (\(\theta\)) is random, and the data we collected in the only objective truth we have.

Probability as a State of Knowledge

In the old paradigm, probability is defined by long-run frequency. If you flip a coin, “50% heads” means if you flipped it a million times, half would be heads.

In the Bayesian Paradigm, probability is a measure of certainty.

It allows us to assign probabilities to one-time events.
Example: “There is a 70% chance of rain tomorrow.”
- Frequentist: Struggles with this because “tomorrow” only happens once. You can’t flip “tomorrow” 1,000 times.
- Bayesian: This makes perfect sense. It represents our current uncertainty based on satellite data (Likelihood) and historical patterns (Prior).

The “Flow” of Information

In the traditional paradigm, each study is an island. You start from zero, calculate a p-value, and reach a conclusion.

In the Bayesian Paradigm, science is a cumulative process.

You start with what you know (Prior).
You look at new evidence (Likelihood).
You update your knowledge (Posterior).
The kicker: Tomorrow, that Posterior becomes your new Prior.

Why this Paradigm is powerful

This framework is more intuitive for human decision-making.

It handles small data better: If you only have 3 data points, a Frequentist model might break or give a “non-significant” result. A Bayesian model says, “Well, based on these 3 points and what we knew before, here is our best (though still uncertain) guess.”
It gives direct answers: Instead of confusing “confidence intervals,” it gives you a Credible Interval. You can actually say, “There is a 95% probability the drug works,” which is what most people actually want to know.

Illustration: Testing a New Heart Medicine

Imagine you are a researcher testing a new drug. You give it to 100 people and observe how much their blood pressure drops.

The Frequentist Answer (The Traditional Paradigm)

The Frequentist starts by assuming the drug does nothing (this is the “Null Hypothesis”).

The Question: “If this drug actually does nothing, how likely am I to see a result as extreme as the one I just got?”
The Tool: The p-value.
The Result: “The p-value is 0.03. Since this is less than 0.05, I reject the idea that the drug does nothing. I conclude the drug works.”
The Flaw: This doesn’t actually tell you the probability that the drug works; it only tells you how “weird” your data would be if the drug were useless.

The Bayesian Answer (The New Paradigm)

The Bayesian starts with a Prior (perhaps based on previous chemical studies or similar drugs).

The Question: “Given the data I just collected and what I already knew, what is the probability that this drug reduces blood pressure by at least 5 points?”
The Tool: The Posterior Distribution.
The Result: “There is a 94% probability that the drug reduces blood pressure by 5–10 points.”
The Benefit: This is a direct answer to the doctor’s question. It provides a range of likely effects and the certainty of those effects.

Bayes’ Rule: The Engine of Inference

Bayes’ Rule is the mathematical bridge that allows us to update our initial beliefs (the Prior) with new evidence (the Likelihood) to arrive at an updated belief (the Posterior).

The Mathematical Formula.

For a parameter \(\theta\) and observed data \(y\):

\[P(\theta | y) = \frac{P(y | \theta) P(\theta)}{P(y)}\]

\(P(\theta | y)\) (Posterior): Our belief about \(\theta\) after seeing the data.
\(P(y | \theta)\) (Likelihood): How likely the data is, given a specific \(\theta\).
\(P(\theta)\) (Prior): Our belief about \(\theta\) before seeing the data.
\(P(y)\) (Evidence): The total probability of the data, acting as a normalizing constant.

The relationship of these concepts can be better explained using the graph below:

The blue curve is the prior distribution. It represents what we believed about the parameter \(\theta\) before seeing any data. Notice that it is wide and flat, reflecting high uncertainty.
The green curve is the likelihood. It summarizes what the data are telling us. Because the data are informative, the likelihood is sharply peaked around the observed value.
The red curve is the posterior distribution. This is the result of applying Bayes’ Rule.
Notice that the posterior lies between the prior and the likelihood. It is a compromise between what we believed before and what the data suggest.
This red curve is the main output of Bayesian inference—it represents our updated belief about \(\theta\) after observing the data.

# ----------------------------
# Bayes' Rule: Prior, Likelihood, Posterior
# ----------------------------

# Grid of theta values
theta <- seq(-5, 5, length.out = 1000)

# PRIOR: vague belief (wide uncertainty)
prior_mean <- 0
prior_sd   <- 2
prior <- dnorm(theta, mean = prior_mean, sd = prior_sd)

# LIKELIHOOD: data-driven (sharp peak)
data_mean <- 1.5
lik_sd    <- 0.5
likelihood <- dnorm(theta, mean = data_mean, sd = lik_sd)

# POSTERIOR: Normal-Normal conjugacy
post_var <- 1 / (1/prior_sd^2 + 1/lik_sd^2)
post_mean <- post_var * (prior_mean/prior_sd^2 + data_mean/lik_sd^2)
post_sd <- sqrt(post_var)

posterior <- dnorm(theta, mean = post_mean, sd = post_sd)

# Plot
plot(theta, prior, type = "l", lwd = 2, col = "blue",
     ylim = c(0, max(prior, likelihood, posterior)),
     xlab = expression(theta),
     ylab = "Density",
     main = "Bayesian Updating: Prior, Likelihood, and Posterior")

lines(theta, likelihood, lwd = 2, col = "darkgreen")
lines(theta, posterior, lwd = 2, col = "red")

legend("topleft",
       legend = c("Prior (Belief Before Data)",
                  "Likelihood (Data Evidence)",
                  "Posterior (Updated Belief)"),
       col = c("blue", "darkgreen", "red"),
       lwd = 2)

Subjective vs. Objective Probability

One of the biggest debates in statistics is how we choose our Prior.

Subjective Probability (The Expert)

This view argues that probability represents a person’s personal degree of belief.

Strength: It allows us to incorporate expert knowledge, previous studies, or scientific intuition into our model.
Example: A doctor’s prior belief that a patient has a rare disease based on 20 years of clinical experience.

Objective Probability (The Skeptic)

This view seeks to minimize the influence of the researcher’s personal opinions.

Uninformative Priors: We use “flat” priors that give equal weight to all possible values so the data “speaks for itself.”
Jeffreys’ Prior: A mathematical way to create a prior that remains the same even if we change the scale of our measurements.

Illustration Concept

Scenario: Predicting the probability of finding oil in different locations (A, B, C, D).

Subjective Prior (Expert Knowledge)

The geologist thinks Location B is most promising based on experience.
Prior probabilities reflect this belief:
- A: 10%, B: 60%, C: 20%, D: 10%

Objective Prior (Flat / Non-informative)

No prior knowledge; all locations are equally likely.
Prior probabilities:
- A: 25%, B: 25%, C: 25%, D: 25%

Note: A flat prior is the simplest approach—it assumes all possibilities are equally likely. But flat priors aren’t always appropriate, especially when the parameter isn’t naturally bounded or when transformations matter. That’s where Jeffreys prior comes in. It will be discussed in details in the succeeding lectures.

Example 1. The “Rare Disease” Problem

To illustrate how the framework works, let’s walk through a classic example:

Prior: A disease affects 0.1% of the population (\(P(\theta = \text{sick}) = 0.001\)).
Likelihood: A test is 99% accurate (\(P(y = + | \theta = \text{sick}) = 0.99\)).It also has a 5% false-positive rate (\(P(+ | \text{Healthy}) = 0.05\)).
Data: A patient tests positive (\(y = +\)).
Task: Calculate the Posterior. What is the probability you are actually sick?

Solution

In Bayes theorem,

\[P(\text{Sick} | +) = \frac{P(+|\text{Sick}) \times P(\text{Sick})}{P(+)}\] Now, compute the marginal probability of positive test: \[P(+)= P(+|\text{Sick}) \times P(\text{Sick})+P(+|\text{Healthy}) \times P(\text{Healthy})\] where:

\(P(\text{Sick)}=0.001\)
\(P(\text{Healthy}=1- P(\text{Sick)}=0.999\)
\(P(\text{Sick})=0.99\)
\(P(\text{Healthy})=0.05\)

hence, \[P(+)= (0.99)(0.001)+(0.05)(0.999)=0.05094\] therefore,

\[P(\text{Sick} | +) = \frac{P(+|\text{Sick}) \times P(\text{Sick})}{P(+)}=\frac{(0.99)(0.001)}{0.05094}=0.0194=1.94%\] Even though the test is 99% accurate, the chance you are actually sick given a positive result is only about 1.94%.

Why? Because the disease is so rare (the Prior) that even with a positive test, it is more likely you are a “false positive” from the 99.9% healthy population than a “true positive” from the 0.1% sick population.

Introduction to Bayesian Thinking

Notes for Stat 321: Bayesian Statistics

Roel F. Ceballos, PhD

The Bayesian Framework: A New Paradigm

Bayes’ Rule: The Engine of Inference

Subjective vs. Objective Probability

Example 1. The “Rare Disease” Problem