This textbook-style R Markdown note introduces Bayesian statistics from the beginning. It is written for students who have basic knowledge of descriptive statistics and probability but are new to Bayesian thinking.
The aim is to teach Bayesian statistics using three connected layers:
This note is suitable for first-year master’s students in statistics, data science, epidemiology, business analytics, and applied research methods.
After completing this note, students should be able to:
Statistics is about learning from data. In many real-world problems, we do not know the true value of a parameter. For example:
In classical or frequentist statistics, unknown parameters are usually treated as fixed but unknown quantities. In Bayesian statistics, unknown parameters are treated as uncertain quantities, and this uncertainty is described using probability distributions.
Let \(\theta\) be an unknown parameter.
In the frequentist approach:
\[ \theta = \text{fixed but unknown} \]
In the Bayesian approach:
\[ \theta \sim p(\theta) \]
This means \(\theta\) has a probability distribution that represents our uncertainty.
Suppose you want to know whether it will rain today.
Before looking outside, you may think:
\[ P(\text{Rain}) = 0.20 \]
Then you see dark clouds. Your belief changes:
\[ P(\text{Rain} \mid \text{Dark clouds}) > 0.20 \]
This is Bayesian thinking: update your belief when new evidence arrives.
Bayesian learning can be summarized as:
\[ \text{Posterior belief} = \text{Prior belief} + \text{Information from data} \]
More formally:
\[ \text{Posterior} \propto \text{Likelihood} \times \text{Prior} \]
Probability measures uncertainty. If \(A\) is an event, then:
\[ 0 \leq P(A) \leq 1 \]
where:
The sample space is the set of all possible outcomes.
Example: rolling a fair die.
\[ S = \{1,2,3,4,5,6\} \]
The probability of rolling a six is:
\[ P(6)=\frac{1}{6} \]
If \(A^c\) is the complement of event \(A\), then:
\[ P(A^c)=1-P(A) \]
Example:
If \(P(\text{Rain})=0.30\), then:
\[ P(\text{No rain})=1-0.30=0.70 \]
For two events \(A\) and \(B\):
\[ P(A \cup B)=P(A)+P(B)-P(A\cap B) \]
If \(A\) and \(B\) are mutually exclusive:
\[ P(A \cup B)=P(A)+P(B) \]
Conditional probability is the probability of one event given another event has occurred.
\[ P(A \mid B)=\frac{P(A\cap B)}{P(B)}, \quad P(B)>0 \]
Example:
\[ P(\text{Pass} \mid \text{Attend class}) \]
means probability of passing given the student attended class.
We create a small dataset of students.
students <- data.frame(
Student_ID = 1:20,
Attendance = c("High", "High", "Low", "High", "Low", "High", "High", "Low", "High", "Low",
"High", "Low", "High", "High", "Low", "High", "Low", "High", "High", "Low"),
Result = c("Pass", "Pass", "Fail", "Pass", "Fail", "Pass", "Pass", "Fail", "Pass", "Fail",
"Pass", "Fail", "Pass", "Pass", "Fail", "Pass", "Fail", "Pass", "Pass", "Fail")
)
kable(students)
| Student_ID | Attendance | Result |
|---|---|---|
| 1 | High | Pass |
| 2 | High | Pass |
| 3 | Low | Fail |
| 4 | High | Pass |
| 5 | Low | Fail |
| 6 | High | Pass |
| 7 | High | Pass |
| 8 | Low | Fail |
| 9 | High | Pass |
| 10 | Low | Fail |
| 11 | High | Pass |
| 12 | Low | Fail |
| 13 | High | Pass |
| 14 | High | Pass |
| 15 | Low | Fail |
| 16 | High | Pass |
| 17 | Low | Fail |
| 18 | High | Pass |
| 19 | High | Pass |
| 20 | Low | Fail |
Now calculate:
\[ P(\text{Pass} \mid \text{High Attendance}) \]
high_attendance <- students %>% filter(Attendance == "High")
prob_pass_given_high <- mean(high_attendance$Result == "Pass")
prob_pass_given_high
## [1] 1
Interpretation: Among students with high attendance, the estimated probability of passing is 1.
Bayes’ theorem is the foundation of Bayesian statistics.
\[ P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)} \]
where:
In statistics, we usually write Bayes’ theorem as:
\[ p(\theta \mid y)=\frac{p(y \mid \theta)p(\theta)}{p(y)} \]
where:
Since \(p(y)\) is often a normalizing constant, we commonly write:
\[ p(\theta \mid y) \propto p(y \mid \theta)p(\theta) \]
Suppose a disease is rare.
We want:
\[ P(D \mid +) \]
Using Bayes’ theorem:
\[ P(D \mid +)=\frac{P(+ \mid D)P(D)}{P(+ \mid D)P(D)+P(+ \mid D^c)P(D^c)} \]
p_disease <- 0.01
sensitivity <- 0.95
false_positive <- 0.05
p_no_disease <- 1 - p_disease
posterior_disease <- (sensitivity * p_disease) /
(sensitivity * p_disease + false_positive * p_no_disease)
posterior_disease
## [1] 0.1610169
The probability of actually having the disease after a positive test is approximately:
round(posterior_disease * 100, 2)
## [1] 16.1
percent.
This is much lower than 95% because the disease is rare.
set.seed(123)
n <- 100000
disease <- rbinom(n, size = 1, prob = p_disease)
test_positive <- ifelse(
disease == 1,
rbinom(n, size = 1, prob = sensitivity),
rbinom(n, size = 1, prob = false_positive)
)
medical_df <- data.frame(
disease = factor(disease, labels = c("No disease", "Disease")),
test = factor(test_positive, labels = c("Negative", "Positive"))
)
posterior_simulated <- mean(medical_df$disease == "Disease" & medical_df$test == "Positive") /
mean(medical_df$test == "Positive")
posterior_simulated
## [1] 0.1619473
ggplot(medical_df, aes(x = test, fill = disease)) +
geom_bar(position = "fill") +
labs(
title = "Medical Test Result by Disease Status",
x = "Test Result",
y = "Proportion",
fill = "True Status"
) +
scale_y_continuous(labels = percent_format()) +
theme_minimal()
A prior distribution represents what we believe about a parameter before observing the data.
If \(\theta\) is a parameter, the prior is:
\[ p(\theta) \]
Examples:
\[ \theta \sim Uniform(0,1) \]
\[ \theta \sim Beta(\alpha,\beta) \]
\[ \theta \sim Normal(\mu,\sigma^2) \]
If we know very little about a probability parameter \(p\), we may use:
\[ p \sim Uniform(0,1) \]
This gives equal prior weight to all values between 0 and 1.
p_grid <- seq(0, 1, length.out = 1000)
uniform_df <- data.frame(p = p_grid, density = dunif(p_grid, 0, 1))
ggplot(uniform_df, aes(x = p, y = density)) +
geom_line(linewidth = 1) +
labs(title = "Uniform Prior Distribution", x = "p", y = "Density") +
theme_minimal()
The Beta distribution is very important in Bayesian statistics because it is used as a prior for probabilities.
\[ p \sim Beta(\alpha,\beta) \]
The density is:
\[ f(p)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}, \quad 0<p<1 \]
where \(\alpha>0\) and \(\beta>0\).
beta_df <- data.frame(p = rep(p_grid, 4),
density = c(dbeta(p_grid, 1, 1),
dbeta(p_grid, 2, 2),
dbeta(p_grid, 8, 2),
dbeta(p_grid, 2, 8)),
Prior = rep(c("Beta(1,1): Uniform",
"Beta(2,2): Balanced",
"Beta(8,2): High values likely",
"Beta(2,8): Low values likely"), each = length(p_grid)))
ggplot(beta_df, aes(x = p, y = density, linetype = Prior)) +
geom_line(linewidth = 1) +
labs(title = "Different Beta Prior Distributions", x = "Probability p", y = "Density") +
theme_minimal()
For \(p \sim Beta(\alpha,\beta)\):
\[ E(p)=\frac{\alpha}{\alpha+\beta} \]
\[ Var(p)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \]
beta_summary <- data.frame(
Prior = c("Beta(1,1)", "Beta(2,2)", "Beta(8,2)", "Beta(2,8)", "Beta(20,20)"),
alpha = c(1, 2, 8, 2, 20),
beta = c(1, 2, 2, 8, 20)
)
beta_summary <- beta_summary %>%
mutate(
Mean = alpha / (alpha + beta),
Variance = alpha * beta / ((alpha + beta)^2 * (alpha + beta + 1))
)
kable(beta_summary, digits = 4)
| Prior | alpha | beta | Mean | Variance |
|---|---|---|---|---|
| Beta(1,1) | 1 | 1 | 0.5 | 0.0833 |
| Beta(2,2) | 2 | 2 | 0.5 | 0.0500 |
| Beta(8,2) | 8 | 2 | 0.8 | 0.0145 |
| Beta(2,8) | 2 | 8 | 0.2 | 0.0145 |
| Beta(20,20) | 20 | 20 | 0.5 | 0.0061 |
The likelihood measures how well different parameter values explain the observed data.
Suppose:
\[ X \sim Binomial(n,p) \]
The probability mass function is:
\[ P(X=x \mid p)=\binom{n}{x}p^x(1-p)^{n-x} \]
As a likelihood function of \(p\), we write:
\[ L(p) \propto p^x(1-p)^{n-x} \]
Suppose we toss a coin 10 times and observe 7 heads.
\[ n=10, \quad x=7 \]
The likelihood is:
\[ L(p) \propto p^7(1-p)^3 \]
x <- 7
n <- 10
p_grid <- seq(0.001, 0.999, length.out = 1000)
likelihood <- dbinom(x, size = n, prob = p_grid)
likelihood_df <- data.frame(p = p_grid, likelihood = likelihood)
ggplot(likelihood_df, aes(x = p, y = likelihood)) +
geom_line(linewidth = 1) +
labs(title = "Likelihood Curve for 7 Heads in 10 Tosses", x = "p", y = "Likelihood") +
theme_minimal()
The likelihood is highest near:
\[ \hat{p}=\frac{x}{n}=\frac{7}{10}=0.70 \]
The most important Bayesian relationship is:
\[ p(p \mid x) \propto p(x \mid p)p(p) \]
For the Beta-Binomial model:
Prior:
\[ p \sim Beta(\alpha,\beta) \]
Likelihood:
\[ X \sim Binomial(n,p) \]
Posterior:
\[ p \mid x \sim Beta(\alpha+x,\beta+n-x) \]
Suppose:
\[ p \sim Beta(2,2) \]
and we observe:
\[ x=7, \quad n=10 \]
Then:
\[ p \mid x \sim Beta(2+7,2+10-7)=Beta(9,5) \]
alpha_prior <- 2
beta_prior <- 2
x <- 7
n <- 10
alpha_post <- alpha_prior + x
beta_post <- beta_prior + n - x
posterior_df <- data.frame(
p = rep(p_grid, 2),
density = c(dbeta(p_grid, alpha_prior, beta_prior),
dbeta(p_grid, alpha_post, beta_post)),
Distribution = rep(c("Prior: Beta(2,2)", "Posterior: Beta(9,5)"), each = length(p_grid))
)
ggplot(posterior_df, aes(x = p, y = density, linetype = Distribution)) +
geom_line(linewidth = 1) +
labs(title = "Prior and Posterior Distribution", x = "p", y = "Density") +
theme_minimal()
For visualization, we rescale likelihood to compare shapes.
prior_density <- dbeta(p_grid, alpha_prior, beta_prior)
likelihood_scaled <- likelihood / max(likelihood) * max(prior_density)
posterior_density <- dbeta(p_grid, alpha_post, beta_post)
plp_df <- data.frame(
p = rep(p_grid, 3),
value = c(prior_density, likelihood_scaled, posterior_density),
Component = rep(c("Prior", "Likelihood (scaled)", "Posterior"), each = length(p_grid))
)
ggplot(plp_df, aes(x = p, y = value, linetype = Component)) +
geom_line(linewidth = 1) +
labs(title = "Prior, Likelihood and Posterior", x = "p", y = "Value") +
theme_minimal()
For:
\[ p \mid x \sim Beta(\alpha+x,\beta+n-x) \]
The posterior mean is:
\[ E(p \mid x)=\frac{\alpha+x}{\alpha+\beta+n} \]
\[ Var(p \mid x)=\frac{(\alpha+x)(\beta+n-x)}{(\alpha+\beta+n)^2(\alpha+\beta+n+1)} \]
If \(\alpha+x>1\) and \(\beta+n-x>1\), the posterior mode is:
\[ \frac{\alpha+x-1}{\alpha+\beta+n-2} \]
posterior_mean <- alpha_post / (alpha_post + beta_post)
posterior_variance <- alpha_post * beta_post / ((alpha_post + beta_post)^2 * (alpha_post + beta_post + 1))
posterior_sd <- sqrt(posterior_variance)
posterior_median <- qbeta(0.5, alpha_post, beta_post)
posterior_mode <- (alpha_post - 1) / (alpha_post + beta_post - 2)
credible_interval <- qbeta(c(0.025, 0.975), alpha_post, beta_post)
summary_table <- data.frame(
Statistic = c("Posterior Mean", "Posterior Median", "Posterior Mode", "Posterior SD", "2.5%", "97.5%"),
Value = c(posterior_mean, posterior_median, posterior_mode, posterior_sd, credible_interval[1], credible_interval[2])
)
kable(summary_table, digits = 4)
| Statistic | Value |
|---|---|
| Posterior Mean | 0.6429 |
| Posterior Median | 0.6498 |
| Posterior Mode | 0.6667 |
| Posterior SD | 0.1237 |
| 2.5% | 0.3857 |
| 97.5% | 0.8614 |
A 95% credible interval is an interval \([a,b]\) such that:
\[ P(a < p < b \mid x)=0.95 \]
This means that, given the data and prior, there is 95% posterior probability that \(p\) lies between \(a\) and \(b\).
ci_lower <- credible_interval[1]
ci_upper <- credible_interval[2]
credible_df <- data.frame(p = p_grid, density = dbeta(p_grid, alpha_post, beta_post))
credible_df$inside <- credible_df$p >= ci_lower & credible_df$p <= ci_upper
ggplot(credible_df, aes(x = p, y = density)) +
geom_line(linewidth = 1) +
geom_area(data = subset(credible_df, inside), aes(x = p, y = density), alpha = 0.4) +
geom_vline(xintercept = c(ci_lower, ci_upper), linetype = "dashed") +
labs(title = "95% Bayesian Credible Interval", x = "p", y = "Posterior Density") +
theme_minimal()
A Bayesian credible interval allows a direct probability statement about the parameter:
\[ P(a<p<b \mid x)=0.95 \]
A frequentist confidence interval does not say there is 95% probability that the fixed parameter lies in a specific observed interval. Instead, it describes the long-run performance of the procedure.
Many Bayesian summaries can be calculated by drawing random samples from the posterior distribution.
If:
\[ p \mid x \sim Beta(9,5) \]
we can simulate:
\[ p^{(1)},p^{(2)},...,p^{(S)} \sim Beta(9,5) \]
set.seed(123)
S <- 10000
posterior_samples <- rbeta(S, alpha_post, beta_post)
simulation_summary <- data.frame(
Statistic = c("Mean", "Median", "SD", "2.5%", "97.5%"),
Value = c(mean(posterior_samples), median(posterior_samples), sd(posterior_samples),
quantile(posterior_samples, 0.025), quantile(posterior_samples, 0.975))
)
kable(simulation_summary, digits = 4)
| Statistic | Value |
|---|---|
| Mean | 0.6429 |
| Median | 0.6505 |
| SD | 0.1244 |
| 2.5% | 0.3869 |
| 97.5% | 0.8649 |
posterior_samples_df <- data.frame(p = posterior_samples)
ggplot(posterior_samples_df, aes(x = p)) +
geom_histogram(bins = 40, alpha = 0.7) +
geom_vline(xintercept = mean(posterior_samples), linetype = "dashed") +
labs(title = "Posterior Samples from Beta(9,5)", x = "p", y = "Frequency") +
theme_minimal()
Bayesian statistics naturally supports prediction. The posterior predictive distribution is:
\[ p(y_{new}\mid y)=\int p(y_{new}\mid \theta)p(\theta\mid y)d\theta \]
This combines parameter uncertainty and future observation uncertainty.
If \(p\) is the probability of heads, then:
\[ Y_{new}\mid p \sim Bernoulli(p) \]
We simulate:
set.seed(123)
future_heads <- rbinom(S, size = 1, prob = posterior_samples)
mean_future_heads <- mean(future_heads)
mean_future_heads
## [1] 0.647
The posterior predictive probability of heads in the next toss is approximately 0.647.
Suppose we want to predict the number of heads in 20 future tosses.
set.seed(123)
future_20_heads <- rbinom(S, size = 20, prob = posterior_samples)
future_df <- data.frame(heads = future_20_heads)
ggplot(future_df, aes(x = heads)) +
geom_bar(alpha = 0.7) +
labs(title = "Posterior Predictive Distribution: Heads in 20 Future Tosses", x = "Number of Heads", y = "Frequency") +
theme_minimal()
future_summary <- data.frame(
Statistic = c("Mean predicted heads", "Median predicted heads", "2.5%", "97.5%"),
Value = c(mean(future_20_heads), median(future_20_heads),
quantile(future_20_heads, 0.025), quantile(future_20_heads, 0.975))
)
kable(future_summary, digits = 2)
| Statistic | Value |
|---|---|
| Mean predicted heads | 12.87 |
| Median predicted heads | 13.00 |
| 2.5% | 6.00 |
| 97.5% | 19.00 |
Bayesian results depend on both the prior and the data. It is good practice to check whether conclusions change under different reasonable priors.
Suppose the observed data remain:
\[ x=7, \quad n=10 \]
We compare four priors:
\[ Beta(1,1), Beta(2,2), Beta(10,10), Beta(20,5) \]
priors <- data.frame(
Prior = c("Beta(1,1)", "Beta(2,2)", "Beta(10,10)", "Beta(20,5)"),
alpha = c(1, 2, 10, 20),
beta = c(1, 2, 10, 5)
)
sensitivity_results <- priors %>%
mutate(
alpha_post = alpha + x,
beta_post = beta + n - x,
posterior_mean = alpha_post / (alpha_post + beta_post),
lower_95 = qbeta(0.025, alpha_post, beta_post),
upper_95 = qbeta(0.975, alpha_post, beta_post)
)
kable(sensitivity_results, digits = 4)
| Prior | alpha | beta | alpha_post | beta_post | posterior_mean | lower_95 | upper_95 |
|---|---|---|---|---|---|---|---|
| Beta(1,1) | 1 | 1 | 8 | 4 | 0.6667 | 0.3903 | 0.8907 |
| Beta(2,2) | 2 | 2 | 9 | 5 | 0.6429 | 0.3857 | 0.8614 |
| Beta(10,10) | 10 | 10 | 17 | 13 | 0.5667 | 0.3894 | 0.7355 |
| Beta(20,5) | 20 | 5 | 27 | 8 | 0.7714 | 0.6210 | 0.8925 |
sens_plot_df <- data.frame()
for (i in 1:nrow(priors)) {
temp <- data.frame(
p = p_grid,
density = dbeta(p_grid, priors$alpha[i] + x, priors$beta[i] + n - x),
Prior = priors$Prior[i]
)
sens_plot_df <- rbind(sens_plot_df, temp)
}
ggplot(sens_plot_df, aes(x = p, y = density, linetype = Prior)) +
geom_line(linewidth = 1) +
labs(title = "Posterior Distributions Under Different Priors", x = "p", y = "Density") +
theme_minimal()
Suppose we observe continuous data:
\[ y_1,y_2,...,y_n \]
Assume:
\[ y_i \sim Normal(\mu,\sigma^2) \]
For simplicity, assume \(\sigma^2\) is known.
We place a Normal prior on \(\mu\):
\[ \mu \sim Normal(\mu_0,\tau_0^2) \]
The posterior is also Normal:
\[ \mu \mid y \sim Normal(\mu_n,\tau_n^2) \]
where:
\[ \tau_n^2=\left(\frac{n}{\sigma^2}+\frac{1}{\tau_0^2}\right)^{-1} \]
\[ \mu_n=\tau_n^2\left(\frac{n\bar{y}}{\sigma^2}+\frac{\mu_0}{\tau_0^2}\right) \]
set.seed(123)
n_normal <- 50
true_mu <- 70
sigma_known <- 10
y <- rnorm(n_normal, mean = true_mu, sd = sigma_known)
normal_df <- data.frame(score = y)
kable(head(normal_df, 10), digits = 2)
| score |
|---|
| 64.40 |
| 67.70 |
| 85.59 |
| 70.71 |
| 71.29 |
| 87.15 |
| 74.61 |
| 57.35 |
| 63.13 |
| 65.54 |
ggplot(normal_df, aes(x = score)) +
geom_histogram(bins = 15, alpha = 0.7) +
geom_vline(xintercept = mean(y), linetype = "dashed") +
labs(title = "Simulated Student Scores", x = "Score", y = "Frequency") +
theme_minimal()
mu0 <- 65
tau0 <- 15
sample_mean <- mean(y)
posterior_variance_mu <- 1 / (n_normal / sigma_known^2 + 1 / tau0^2)
posterior_mean_mu <- posterior_variance_mu * (n_normal * sample_mean / sigma_known^2 + mu0 / tau0^2)
posterior_sd_mu <- sqrt(posterior_variance_mu)
normal_posterior_summary <- data.frame(
Statistic = c("Sample mean", "Prior mean", "Posterior mean", "Posterior SD", "95% lower", "95% upper"),
Value = c(sample_mean, mu0, posterior_mean_mu, posterior_sd_mu,
posterior_mean_mu - 1.96 * posterior_sd_mu,
posterior_mean_mu + 1.96 * posterior_sd_mu)
)
kable(normal_posterior_summary, digits = 3)
| Statistic | Value |
|---|---|
| Sample mean | 70.344 |
| Prior mean | 65.000 |
| Posterior mean | 70.297 |
| Posterior SD | 1.408 |
| 95% lower | 67.537 |
| 95% upper | 73.057 |
mu_grid <- seq(55, 85, length.out = 1000)
normal_pp_df <- data.frame(
mu = rep(mu_grid, 2),
density = c(dnorm(mu_grid, mean = mu0, sd = tau0),
dnorm(mu_grid, mean = posterior_mean_mu, sd = posterior_sd_mu)),
Distribution = rep(c("Prior", "Posterior"), each = length(mu_grid))
)
ggplot(normal_pp_df, aes(x = mu, y = density, linetype = Distribution)) +
geom_line(linewidth = 1) +
labs(title = "Prior and Posterior for Normal Mean", x = expression(mu), y = "Density") +
theme_minimal()
In linear regression, we model a continuous outcome \(Y\) using predictor \(X\):
\[ Y_i=\beta_0+\beta_1X_i+\epsilon_i \]
where:
\[ \epsilon_i \sim Normal(0,\sigma^2) \]
In Bayesian regression, we assign priors:
\[ \beta_0 \sim Normal(0,100^2) \]
\[ \beta_1 \sim Normal(0,100^2) \]
\[ \sigma \sim HalfNormal(0,10) \]
set.seed(123)
n_reg <- 150
study_hours <- runif(n_reg, min = 0, max = 10)
exam_score <- 50 + 4 * study_hours + rnorm(n_reg, mean = 0, sd = 8)
reg_df <- data.frame(
study_hours = study_hours,
exam_score = exam_score
)
kable(head(reg_df, 10), digits = 2)
| study_hours | exam_score |
|---|---|
| 2.88 | 69.71 |
| 7.88 | 79.25 |
| 4.09 | 56.59 |
| 8.83 | 86.77 |
| 9.40 | 86.51 |
| 0.46 | 51.87 |
| 5.28 | 74.21 |
| 8.92 | 82.73 |
| 5.51 | 77.21 |
| 4.57 | 66.50 |
ggplot(reg_df, aes(x = study_hours, y = exam_score)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Relationship Between Study Hours and Exam Score", x = "Study Hours", y = "Exam Score") +
theme_minimal()
For beginner-level teaching, we use a normal approximation around the classical regression estimates. This is not full MCMC, but it gives intuition about posterior uncertainty.
lm_fit <- lm(exam_score ~ study_hours, data = reg_df)
summary(lm_fit)
##
## Call:
## lm(formula = exam_score ~ study_hours, data = reg_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.2890 -4.8644 -0.9938 4.7232 25.7358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.6221 1.2653 39.22 <2e-16 ***
## study_hours 4.0644 0.2183 18.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.65 on 148 degrees of freedom
## Multiple R-squared: 0.7009, Adjusted R-squared: 0.6989
## F-statistic: 346.8 on 1 and 148 DF, p-value: < 2.2e-16
set.seed(123)
coef_est <- coef(lm_fit)
vcov_est <- vcov(lm_fit)
# Simulate approximate posterior samples using multivariate normal logic
# Cholesky decomposition without extra packages
S <- 5000
z <- matrix(rnorm(S * 2), ncol = 2)
L <- chol(vcov_est)
beta_samples <- sweep(z %*% L, 2, coef_est, "+")
beta_df <- data.frame(
beta0 = beta_samples[, 1],
beta1 = beta_samples[, 2]
)
beta_summary <- data.frame(
Parameter = c("Intercept", "Slope"),
Mean = c(mean(beta_df$beta0), mean(beta_df$beta1)),
Lower_95 = c(quantile(beta_df$beta0, 0.025), quantile(beta_df$beta1, 0.025)),
Upper_95 = c(quantile(beta_df$beta0, 0.975), quantile(beta_df$beta1, 0.975))
)
kable(beta_summary, digits = 3)
| Parameter | Mean | Lower_95 | Upper_95 |
|---|---|---|---|
| Intercept | 49.621 | 47.113 | 52.096 |
| Slope | 4.064 | 3.637 | 4.492 |
ggplot(beta_df, aes(x = beta1)) +
geom_histogram(bins = 40, alpha = 0.7) +
geom_vline(xintercept = mean(beta_df$beta1), linetype = "dashed") +
labs(title = "Approximate Posterior Distribution of Regression Slope", x = expression(beta[1]), y = "Frequency") +
theme_minimal()
set.seed(123)
selected <- sample(1:S, 100)
x_seq <- seq(0, 10, length.out = 100)
line_df <- data.frame()
for (i in selected) {
temp <- data.frame(
study_hours = x_seq,
exam_score = beta_df$beta0[i] + beta_df$beta1[i] * x_seq,
draw = i
)
line_df <- rbind(line_df, temp)
}
ggplot(reg_df, aes(x = study_hours, y = exam_score)) +
geom_point(alpha = 0.5) +
geom_line(data = line_df, aes(x = study_hours, y = exam_score, group = draw), alpha = 0.15) +
labs(title = "Posterior Regression Lines", x = "Study Hours", y = "Exam Score") +
theme_minimal()
In simple Bayesian models, we can calculate the posterior directly. But in realistic models, the posterior may be too complicated.
Bayesian posterior:
\[ p(\theta\mid y)=\frac{p(y\mid \theta)p(\theta)}{p(y)} \]
The denominator is:
\[ p(y)=\int p(y\mid \theta)p(\theta)d\theta \]
This integral can be difficult or impossible to solve analytically.
MCMC methods allow us to generate samples from the posterior distribution without solving the integral directly.
MCMC creates a chain:
\[ \theta^{(1)},\theta^{(2)},... ,\theta^{(S)} \]
After convergence, these values behave like samples from the posterior distribution.
We estimate the posterior of \(p\) for a coin toss model using a random walk proposal.
log_posterior <- function(p, x, n, alpha, beta) {
if (p <= 0 || p >= 1) return(-Inf)
dbinom(x, size = n, prob = p, log = TRUE) + dbeta(p, alpha, beta, log = TRUE)
}
metropolis_coin <- function(iter = 10000, start = 0.5, proposal_sd = 0.05,
x = 7, n = 10, alpha = 2, beta = 2) {
chain <- numeric(iter)
chain[1] <- start
accepted <- 0
for (i in 2:iter) {
current <- chain[i - 1]
proposal <- rnorm(1, mean = current, sd = proposal_sd)
log_ratio <- log_posterior(proposal, x, n, alpha, beta) -
log_posterior(current, x, n, alpha, beta)
if (log(runif(1)) < log_ratio) {
chain[i] <- proposal
accepted <- accepted + 1
} else {
chain[i] <- current
}
}
list(chain = chain, acceptance_rate = accepted / (iter - 1))
}
set.seed(123)
mcmc_result <- metropolis_coin()
mcmc_result$acceptance_rate
## [1] 0.8736874
mcmc_df <- data.frame(iteration = 1:length(mcmc_result$chain), p = mcmc_result$chain)
ggplot(mcmc_df, aes(x = iteration, y = p)) +
geom_line(alpha = 0.8) +
labs(title = "MCMC Trace Plot", x = "Iteration", y = "p") +
theme_minimal()
burnin <- 1000
mcmc_samples <- mcmc_result$chain[-(1:burnin)]
mcmc_samples_df <- data.frame(p = mcmc_samples)
ggplot(mcmc_samples_df, aes(x = p)) +
geom_histogram(bins = 40, alpha = 0.7) +
stat_function(fun = dbeta, args = list(shape1 = alpha_post, shape2 = beta_post), linewidth = 1) +
labs(title = "MCMC Samples Compared with Exact Posterior", x = "p", y = "Frequency / Density") +
theme_minimal()
A standard Bayesian analysis usually follows these steps:
A Bayesian result should report:
Example statement:
“Using a Beta(2,2) prior and observing 7 successes out of 10 trials, the posterior distribution was Beta(9,5). The posterior mean was 0.643, and the 95% credible interval was approximately [0.39, 0.86].”
Suppose a lecturer wants to estimate the probability that students pass a statistics course after using a new interactive teaching method.
Observed data:
We estimate the pass probability \(p\).
Likelihood:
\[ X \sim Binomial(n,p) \]
Prior:
\[ p \sim Beta(5,5) \]
Posterior:
\[ p \mid x \sim Beta(5+62,5+80-62)=Beta(67,23) \]
n_course <- 80
x_pass <- 62
alpha_course <- 5
beta_course <- 5
alpha_course_post <- alpha_course + x_pass
beta_course_post <- beta_course + n_course - x_pass
course_summary <- data.frame(
Quantity = c("Observed pass rate", "Posterior mean", "Posterior median", "95% lower", "95% upper"),
Value = c(x_pass / n_course,
alpha_course_post / (alpha_course_post + beta_course_post),
qbeta(0.5, alpha_course_post, beta_course_post),
qbeta(0.025, alpha_course_post, beta_course_post),
qbeta(0.975, alpha_course_post, beta_course_post))
)
kable(course_summary, digits = 4)
| Quantity | Value |
|---|---|
| Observed pass rate | 0.7750 |
| Posterior mean | 0.7444 |
| Posterior median | 0.7463 |
| 95% lower | 0.6500 |
| 95% upper | 0.8286 |
prob_above_70 <- 1 - pbeta(0.70, alpha_course_post, beta_course_post)
prob_above_70
## [1] 0.8341849
Interpretation: The posterior probability that the true pass probability exceeds 70% is 0.834.
course_df <- data.frame(
p = p_grid,
density = dbeta(p_grid, alpha_course_post, beta_course_post)
)
ggplot(course_df, aes(x = p, y = density)) +
geom_line(linewidth = 1) +
geom_vline(xintercept = 0.70, linetype = "dashed") +
geom_area(data = subset(course_df, p > 0.70), aes(x = p, y = density), alpha = 0.4) +
labs(title = "Posterior Probability That Pass Rate Exceeds 70%", x = "Pass Probability", y = "Posterior Density") +
theme_minimal()
| Term | Meaning |
|---|---|
| Prior | Belief before observing data |
| Likelihood | Information from data |
| Posterior | Updated belief after observing data |
| Evidence | Marginal probability of data |
| Credible interval | Bayesian uncertainty interval |
| Posterior mean | Average of posterior distribution |
| Posterior predictive | Distribution for future data |
| MCMC | Simulation method for posterior sampling |
| Conjugate prior | Prior producing posterior in same family |
A coin is tossed 20 times and gives 14 heads. Use a \(Beta(1,1)\) prior.
Repeat Exercise 1 using \(Beta(10,10)\). Compare the posterior with Exercise 1.
A disease affects 2% of a population. A test has sensitivity 90% and false positive rate 8%.
Calculate:
\[ P(Disease \mid Positive) \]
using R.
Simulate a dataset of 100 students where the true pass probability is 0.75. Use a \(Beta(2,2)\) prior and estimate the posterior distribution.
Generate a regression dataset where:
\[ Y=10+3X+\epsilon \]
where \(\epsilon \sim Normal(0,4^2)\). Fit a linear regression and simulate approximate posterior samples for the slope.
x_ex1 <- 14
n_ex1 <- 20
alpha_ex1 <- 1
beta_ex1 <- 1
alpha_ex1_post <- alpha_ex1 + x_ex1
beta_ex1_post <- beta_ex1 + n_ex1 - x_ex1
ex1_summary <- data.frame(
Statistic = c("Posterior distribution", "Posterior mean", "2.5%", "97.5%"),
Value = c(paste0("Beta(", alpha_ex1_post, ",", beta_ex1_post, ")"),
round(alpha_ex1_post / (alpha_ex1_post + beta_ex1_post), 4),
round(qbeta(0.025, alpha_ex1_post, beta_ex1_post), 4),
round(qbeta(0.975, alpha_ex1_post, beta_ex1_post), 4))
)
kable(ex1_summary)
| Statistic | Value |
|---|---|
| Posterior distribution | Beta(15,7) |
| Posterior mean | 0.6818 |
| 2.5% | 0.4782 |
| 97.5% | 0.8541 |
ex1_df <- data.frame(
p = p_grid,
density = dbeta(p_grid, alpha_ex1_post, beta_ex1_post)
)
ggplot(ex1_df, aes(x = p, y = density)) +
geom_line(linewidth = 1) +
labs(title = "Exercise 1 Posterior Distribution", x = "p", y = "Density") +
theme_minimal()
\[ P(A \mid B)=\frac{P(B \mid A)P(A)}{P(B)} \]
\[ p(\theta \mid y)=\frac{p(y \mid \theta)p(\theta)}{p(y)} \]
\[ p(\theta \mid y) \propto p(y \mid \theta)p(\theta) \]
\[ P(X=x \mid p)=\binom{n}{x}p^x(1-p)^{n-x} \]
\[ p \sim Beta(\alpha,\beta) \]
\[ p \mid x \sim Beta(\alpha+x,\beta+n-x) \]
\[ E(p)=\frac{\alpha}{\alpha+\beta} \]
\[ Var(p)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \]
\[ p(y_{new}\mid y)=\int p(y_{new}\mid \theta)p(\theta\mid y)d\theta \]
| Function | Purpose |
|---|---|
dbeta() |
Beta density |
pbeta() |
Beta cumulative probability |
qbeta() |
Beta quantile |
rbeta() |
Random Beta samples |
dbinom() |
Binomial probability |
rbinom() |
Simulate Binomial data |
dnorm() |
Normal density |
rnorm() |
Simulate Normal data |
lm() |
Linear regression |
ggplot() |
Graphing |
Bayesian statistics is not only a collection of formulas. It is a way of learning from evidence. The central idea is simple:
\[ \text{New belief} = \text{Old belief} + \text{Data evidence} \]
In mathematical form:
\[ \text{Posterior} \propto \text{Likelihood} \times \text{Prior} \]
Once this idea is clear, students can gradually move toward advanced topics such as hierarchical models, MCMC, Bayesian regression, Bayesian disease modelling, Bayesian machine learning, and decision theory.