Intro to Data Science: Assignment 5

#| label: setup
#| include: false

library(tidyverse)
library(brms)
library(bayesplot)
library(tidybayes)

# GLOBAL chunk options
knitr::opts_chunk$set(
  message = FALSE,      # hide brms / package messages
  warning = FALSE,      # hide warnings
  results = "hide"      # hide all printed text output by default
)

# (Optional) shorten any accidental printed tibbles if they sneak through
options(
  dplyr.print_min = 5,
  dplyr.print_max = 5
)

Problem 1: Conjugate Priors and Posterior Inference (10 points)

Part A: Beta-Binomial Model (5 points)

A field researcher is studying the occupancy of a rare bird species. They conduct surveys at 50 sites and observe the species at 18 of them.

Question 1a (2 points)

Using a Beta(2, 2) prior (reflecting weak prior belief that occupancy is around 0.5), what is the posterior distribution and its parameters?

Bird spotted: = 18, Bird not spotted = 32. Beta = (2,2)

Posterior = Beta(20,34)

Question 1b (2 points)

Calculate the posterior probability that the true occupancy rate exceeds 40%. Show your calculation in R using pbeta().

# Posterior parameters
alpha_post <- 20
beta_post  <- 34

# P(theta > 0.4 | data)
post_prob_gt_0.4 <- 1 - pbeta(0.4, shape1 = alpha_post, shape2 = beta_post)
post_prob_gt_0.4

# post_prob_gt_0.4 = 0.32 therefore the probability that the true occupancy rate exceeds 40$ is ~0.32

Question 1c (1 point)

Create a visualization showing the prior and posterior distributions on the same plot.

library(tidyverse)

# Parameters
alpha_prior <- 2
beta_prior  <- 2

alpha_post  <- 20
beta_post   <- 34

theta <- seq(0, 1, length.out = 500)

plot_df <- tibble(
  theta     = theta,
  prior     = dbeta(theta, alpha_prior, beta_prior),
  posterior = dbeta(theta, alpha_post,  beta_post)
) |>
  pivot_longer(
    cols = c(prior, posterior),
    names_to = "dist",
    values_to = "density"
  )

ggplot(plot_df, aes(x = theta, y = density, color = dist)) +
  geom_line(linewidth = 1) +
  labs(
    x = expression(theta),
    y = "Density",
    color = "Distribution",
    title = "Prior vs Posterior for Occupancy Probability"
  ) +
  theme_minimal()

Part B: Normal-Normal Conjugate Model (5 points)

A water quality dataset shows that historical dissolved oxygen (DO) measurements have a known standard deviation of 1.5 mg/L. You collect 20 new measurements with sample mean 6.8 mg/L.

Question 1b (2 points)

Using a Normal(7, 1) prior (mean = 7, sd = 1) for the true mean DO level, state the posterior mean and standard deviation.

Posterior Mean = 1 x 7 + 8.889 x 6.8 = 6.82mg/L
Posterior Standard Deviation = sqrt of 0.1011 = 0.318mg/L

Question 1b (2 points)

Interpret: How much did the data update your prior belief? Calculate the prior predictive standard error and posterior predictive standard error.

**Tip:** The prior predictive standard error can be calculated as:

$$SE_{\text{prior-pred}} = \sqrt{\sigma^2 + \tau^2}$$

where $\sigma$ is the known data standard deviation and $\tau$ is the prior standard deviation.

The posterior predictive standard error can be calculated as:

$$SE_{\text{post-pred}} = \sqrt{\sigma^2 + \tau_{\text{post}}^2}$$

sigma <- 1.5
tau   <- 1
tau2  <- tau^2

n    <- 20
xbar <- 6.8
mu0  <- 7

# posterior variance and sd of mu
tau_post2 <- 1 / (1/tau2 + n/sigma^2)
tau_post  <- sqrt(tau_post2)

mu_post <- tau_post2 * (mu0/tau2 + n * xbar / sigma^2)

# predictive SEs
SE_prior_pred <- sqrt(sigma^2 + tau2)
SE_post_pred  <- sqrt(sigma^2 + tau_post2)

c(SE_prior_pred = SE_prior_pred,
  SE_post_pred  = SE_post_pred)

Use these results to discuss how much the data updated your prior.

Question 1b (1 point) — Again? or make 1c

Would a Normal(8, 0.5) prior (more confident that $= 8$) substantially change your posterior? Explain conceptually (no calculations needed).

Yes a Normal (8, 0.5) prior would noticeably change the posterior. Because it is more concentrated (sd=0.5) and centered further from the data (at 8 instead of 7), it would exert stronger pull away from the sample mean of 6.8. The posterior mean would end up higher than 6.82 and the posterior sd for u would be slightly smaller, reflecting a compromise between a strong, confident prior and moderately informative data (n = 20).

Problem 2: Prior Specification and Sensitivity Analysis (10 points)

You’re designing a clinical trial for a new treatment. Historical data suggests a response rate of approximately 60%, but there’s substantial uncertainty in this estimate.

Question 2a (3 points)

Compare three different Beta priors for the response rate:

Prior 1: Beta(1, 1) — uniform prior
Prior 2: Beta(6, 4) — optimistic prior (centered at 0.6)
Prior 3: Beta(3, 3) — weakly informative prior

For each prior, generate 1000 samples and create density plots. Describe how each prior reflects different domain knowledge.

n_sims <- 1000

# Draw samples from each prior
samples <- tibble(
  prior = factor(
    rep(c("Beta(1, 1)", "Beta(6, 4)", "Beta(3, 3)"), each = n_sims),
    levels = c("Beta(1, 1)", "Beta(6, 4)", "Beta(3, 3)")
  ),
  theta = c(
    rbeta(n_sims, 1, 1),  
    rbeta(n_sims, 6, 4),
    rbeta(n_sims, 3, 3)   
  )
)

# Density plots for each prior
ggplot(samples, aes(x = theta)) +
  geom_density(fill = "grey80") +
  facet_wrap(~ prior, ncol = 1) +
  labs(
    x = expression(theta),
    y = "Density",
    title = "Prior distributions for treatment response rate"
  ) +
  xlim(0, 1) +
  theme_minimal()

Prior 1: Beta(1, 1) (uniform prior) This prior places equal density on all values between 0 and 1. It represents prior ignorance: before seeing any data, we are not favoring any particular response rate. There is no built-in preference any value.
Prior 2: Beta(6, 4) (optimistic prior, centered at 0.6) This prior is centered at 0.6 and is more concentrated than the other priors, reflecting that we expect the true response rate to be around 60%, based on historical data. Interpreting the Beta parameters as prior counts, this corresponds roughly to having previously seen about 10 patients with 6 successes and 4 failures. It encodes stronger prior belief that the treatment is reasonably effective, and shrinks inferences toward 0.6 unless the new data strongly disagree.
Prior 3: Beta(3, 3) (weakly informative prior) This prior is symmetric around 0.5, with mild concentration in the middle and downweighting extreme values near 0 or 1. It reflects weak prior information: we believe extremely low or extremely high response rates are somewhat unlikely, but we’re not strongly committed to any particular value within the middle range. Compared with Beta(6, 4), it is less confident and will let the data play a larger role.

Question 2b (4 points)

Suppose you observe 45 responses out of 100 patients. For each of the three priors above:

Compute the posterior distribution
Calculate the posterior probability that the response rate exceeds 50%
Create a table showing posterior mean and posterior standard deviation (for each prior) (use R to perform these calculations by simulating from the posterior distribution)

# Observed data
y <- 45   # number of responses
n <- 100  # total patients
n_sims <- 10000

# Prior parameters
priors <- tibble(
  prior = c("Beta(1,1) uniform",
            "Beta(6,4) optimistic",
            "Beta(3,3) weakly informative"),
  alpha = c(1, 6, 3),
  beta  = c(1, 4, 3)
) |>
  mutate(
    post_alpha = alpha + y,
    post_beta  = beta  + (n - y)
  )

# Simulate from each posterior
post_draws <- priors |>
  rowwise() |>
  mutate(theta = list(rbeta(n_sims, post_alpha, post_beta))) |>
  unnest(theta)

# Posterior summaries and P(theta > 0.5)
post_summary <- post_draws |>
  group_by(prior) |>
  summarise(
    post_mean           = mean(theta),
    post_sd             = sd(theta),
    prob_theta_gt_0.5   = mean(theta > 0.5),
    .groups = "drop"
  )

post_summary

Which prior had the most influence on the posterior? Why?

The Beta(6,4) optimistic prior has the most influence on the posterior. It is more informative (equivalent to about 10 prior “pseudo-observations”) and centered at 0.6, so it pulls the posterior mean upward relative to the other priors and yields the highest posterior probability that θ > 0.5.

Question 2c (3 points)

Perform a prior predictive check for Prior 2 (Beta(6, 4)). Generate 1000 samples of hypothetical response counts (out of 100) from the prior predictive distribution. Does this prior seem reasonable given your domain knowledge? Explain.

n_sims <- 1000
n <- 100

# 1. Sample theta from the prior Beta(6, 4)
theta_samples <- rbeta(n_sims, shape1 = 6, shape2 = 4)

# 2. For each theta, sample a hypothetical response count out of 100
response_counts <- rbinom(n_sims, size = n, prob = theta_samples)

prior_pred <- tibble(
  theta     = theta_samples,
  responses = response_counts
)

# Plot prior predictive distribution of response counts
ggplot(prior_pred, aes(x = responses)) +
  geom_histogram(binwidth = 1, boundary = 0, closed = "left", fill = "grey80") +
  labs(
    title = "Prior predictive distribution under Beta(6, 4)",
    x     = "Number of responses out of 100",
    y     = "Frequency"
  ) +
  theme_minimal()

# Numeric summary
prior_pred_summary <- prior_pred |>
  summarise(
    mean_responses = mean(responses),
    sd_responses   = sd(responses),
    q2_5           = quantile(responses, 0.025),
    q97_5          = quantile(responses, 0.975)
  )

prior_pred_summary

TIP: The prior predictive distribution for a Binomial likelihood with a Beta prior can be simulated by: 1. Sample θ from the Beta prior. 2. Sample response counts from a Binomial distribution using the sampled θ.

Example in R:

theta_samples <- rbeta(1000, 6, 4)
response_counts <- rbinom(1000, size = 100, prob = theta_samples)

Does this prior seems reasonable given your understanding.

Overall, this prior seems reasonable given the domain knowledge: it encodes optimism around a 60% response rate while still allowing a wide range of possible outcomes in the prior predictive distribution. It does not unrealistically concentrate all prior mass near 60%, but it also doesn’t treat all response rates as equally plausible.

Problem 3: Posterior Predictive Checks (10 points)

You’ve fit a Bayesian regression model to predict species abundance based on environmental covariates using brms. Here’s the setup:

Question 3a (3 points)

Perform posterior predictive checks using at least three different test statistics:

pp_check(fit, type = "dens_overlay")

pp_check(fit, type = "stat", stat = "mean")

pp_check(fit, type = "error_binned")

For each check, describe what you observe and whether the model appears to fit the data well.

Figure 1 is a density overlay which shows the observed data distribution (dark line) overlaid with distributions from multiple posterior predictive samples (light lines). The posterior predictive distributions closely match the shape and location of the observed data distribution. Therefore the model appears to fit well, it is capturing the overall distributional shape of the abundance data.

Figure 2: The histogram shows the distribution of means from the posterior predictive samples (light bars) compared to the mean of the observed data (dark line). The observed mean falls very close to the center of the predictive distribution with the dark line rough centered on the x axis. Seems to be a great fit, the model’s predictions are well calibrated to match the central tendency of the data.

Figure 3: This shows the average prediction errors within bins of predicted values with 2 SE bounds. Errors seem to be scattered around zero across all panels and seem to be randomly distributed above and below zero. Once again points to a good fit, does not appear to be any systemic biases in predictions across the range of fitted values.

Question 3b (4 points)

Create a custom posterior predictive check that computes the proportion of observations in different abundance ranges (e.g., <15, 15–25, >25) and compares observed vs. predicted proportions.

library(tidyverse)
library(bayesplot)
library(posterior)

# Define abundance ranges
abundance_breaks <- c(-Inf, 15, 25, Inf)
abundance_labels <- c("<15", "15-25", ">25")

# Extract posterior predictive samples
y_rep <- posterior_predict(fit)

# Function to calculate proportions in each range
calc_proportions <- function(y) {
  cut(y, breaks = abundance_breaks, labels = abundance_labels) %>%
    table() %>%
    prop.table() %>%
    as.numeric()
}

# Calculate observed proportions
y_obs <- df$abundance
obs_props <- calc_proportions(y_obs)

# Calculate predicted proportions for each posterior draw
pred_props <- t(apply(y_rep, 1, calc_proportions))
colnames(pred_props) <- abundance_labels

# Create a dataframe for plotting
prop_df <- as_tibble(pred_props) %>%
  pivot_longer(everything(), names_to = "range", values_to = "proportion") %>%
  mutate(range = factor(range, levels = abundance_labels))

obs_df <- tibble(
  range = factor(abundance_labels, levels = abundance_labels),
  proportion = obs_props
)

# Plot the comparison
ggplot(prop_df, aes(x = proportion)) +
  geom_histogram(bins = 30, fill = "skyblue", alpha = 0.7) +
  geom_vline(data = obs_df, aes(xintercept = proportion), 
             color = "darkblue", linewidth = 1.5) +
  facet_wrap(~range, scales = "free") +
  labs(
    title = "Posterior Predictive Check: Proportions by Abundance Range",
    x = "Proportion",
    y = "Count"
  ) +
  theme_minimal()

what does this reveals about model fit?

This model does not seem to fit well for any of the three scenarios. It under predicts for <15 and >25 and then over predicts for 15-25.

Question 3c (3 points)

If you noticed systematic discrepancies in the posterior predictive checks, suggest two model modifications that might improve fit. For each, explain why it would address the problem.

Could try incorporating some other distribution such as log or gamma. Many abundance data in ecology follows log-normal or gamma distribution which may improve the fit of this mode.
Adding non-linear effects of interactions as species abundance often has a non linear relationship with environmental variables.

Problem 4: Bayesian Hypothesis Testing with Bayes Factors (10 points)

You want to evaluate whether a conservation intervention changed species richness. You have richness data from 40 sites, with 20 in control and 20 in treatment groups.

set.seed(99)
control_richness   <- rnorm(20, mean = 15, sd = 3)
treatment_richness <- rnorm(20, mean = 17, sd = 3)

Question 4a (3 points)

Frame two competing hypotheses:

H0: Treatment has no effect ($*{*\text{control}} = $)
H1: Treatment increases richness ($*{*\text{treatment}} > $)

Fit two Bayesian models using brms:

Model 0: brm(richness ~ group, data = data_combined, ...)
Model 1: brm(richness ~ group, data = data_combined, prior = c(prior(normal(2, 1), class = "b")), ...)

#| message: false
#| warning: false
#| include: false
#| 
library(brms)
library(tibble)

set.seed(99)

# 1. Simulate data 
control_richness   <- rnorm(20, mean = 15, sd = 3)
treatment_richness <- rnorm(20, mean = 17, sd = 3)

data_combined <- tibble(
  richness = c(control_richness, treatment_richness),
  group    = factor(rep(c("control", "treatment"), each = 20))
)

# 2. Model 0: no special prior on the treatment effect
# (this represents H0: no clear expected effect)
model0 <- brm(
  richness ~ group,
  data = data_combined
)

# 3. Model 1: prior that treatment increases richness by about 2 units
# (this represents H1: treatment > control)
model1 <- brm(
  richness ~ group,
  data = data_combined,
  prior = prior(normal(2, 1), class = "b", coef = "grouptreatment")
)

Question 4b (4 points)

Calculate the Bayes factor comparing these models using bayes_factor() or manually.

#| message: false
#| warning: false
#| include: false

library(brms)
library(tibble)

set.seed(99)

# Data (same as before)
control_richness   <- rnorm(20, mean = 15, sd = 3)
treatment_richness <- rnorm(20, mean = 17, sd = 3)

data_combined <- tibble(
  richness = c(control_richness, treatment_richness),
  group    = factor(rep(c("control", "treatment"), each = 20))
)

# Fit models with save_pars = save_pars(all = TRUE) so bayes_factor() works
model0 <- brm(
  richness ~ group,
  data = data_combined,
  save_pars = save_pars(all = TRUE)
)

model1 <- brm(
  richness ~ group,
  data = data_combined,
  prior = prior(normal(2, 1), class = "b", coef = "grouptreatment"),
  save_pars = save_pars(all = TRUE)
)

# Bayes factor: evidence for Model 1 vs Model 0
bf_10 <- bayes_factor(model1, model0)
bf_10

Interpret the Bayes factor according to the Kass & Raftery scale provided in lecture slides.

With bf_10 = Estimated Bayes factor in favor of model1 over model0: 0.20018 it provides substantial evidence in favor of the no-effect model (H0) over the model that assumes a positive treatment effect (H1). The data moderately supports the hypothesis that the interaction did not change species richness.

Question 4c (3 points)

In a frequentist hypothesis test, we assume the null hypothesis is true and ask: “If there were really no effect, how surprising would data like mine be just by chance?” The p-value only tells us how unusual the data are under the null; it does not tell us how likely the null or the alternative is.

In contrast, a Bayesian test with a Bayes factor compares how well both hypotheses explain the data. It answers the question: “Are these data more likely under the model with an effect or the model with no effect, and by how much?” This lets us talk about evidence for or against each hypothesis, something NHST cannot do.

Problem 5: Integration of Bayesian Inference and Prediction (10 points)

Combine Bayesian inference with predictive modeling using a realistic scenario.

Scenario: You’re modeling water quality (pH) across multiple monitoring stations. You have:

Historical prior information: mean pH typically 7.2, with sd = 0.5
New observational data from 5 stations with pH measurements
Goal: Make predictions for unsampled stations while quantifying uncertainty

Question 5a (4 points)

Fit a Bayesian hierarchical model where stations are nested within regions. Use brms to model:

pH ~ 1 + (1 | station)

Extract posterior distributions for each station’s mean pH and visualize using tidybayes::stat_pointinterval().

#| message: false
#| warning: false
#| include: false
library(brms)
library(tidybayes)
library(dplyr)
library(tidyr)
library(ggplot2)

set.seed(123)

## 1. Simulate data for 5 stations
stations <- paste0("S", 1:5)

# One random offset per station
station_effects <- rnorm(length(stations), mean = 0, sd = 0.2)

ph_data <- expand_grid(
  station = stations,
  rep = 1:4   
) |>
  mutate(
    station = factor(station),
    true_station_offset = station_effects[match(station, stations)],
    pH = rnorm(n(), mean = 7.2 + true_station_offset, sd = 0.2)
  )

## 2. Fit Bayesian hierarchical model 
priors_ph <- c(
  prior(normal(7.2, 0.5), class = "Intercept"),  
  prior(exponential(1),   class = "sd")          
)

fit_ph <- brm(
  pH ~ 1 + (1 | station),
  data  = ph_data,
  prior = priors_ph,
  chains = 2,
  iter   = 1000,
  seed   = 123,
  refresh = 0
)

## 3. Extract posterior station means
station_draws <- fit_ph |>
  spread_draws(b_Intercept, r_station[station,Intercept]) |>
  mutate(station_mean = b_Intercept + r_station)

## 4. 
ggplot(station_draws, aes(x = station_mean, y = station)) +
  stat_pointinterval() +
  labs(
    title = "Posterior mean pH for each station",
    x = "Mean pH",
    y = "Station"
  ) +
  theme_minimal()

Question 5b (3 points)

Generate posterior predictive samples for a new, unsampled station. Create a visualization showing 50%, 80%, and 95% credible intervals for predicted pH and compare to the posterior mean estimates from 5a.

library(dplyr)
library(tidybayes)
library(ggplot2)

## 1. Define a new, un sampled station
newdata <- tibble(
  station = factor("New_station", levels = levels(ph_data$station))
)

## 2. Posterior predictive draws for the new station
# Each row is one posterior draw of a future pH observation
new_y <- posterior_predict(
  fit_ph,
  newdata = newdata,
  allow_new_levels = TRUE
)

new_draws <- tibble(
  station     = "New station",
  .draw       = 1:nrow(new_y),
  .prediction = as.numeric(new_y[, 1])
)

## 3. Posterior mean pH for each existing station (from 5a
station_means <- station_draws |>
  group_by(station) |>
  summarise(
    mean_pH = mean(station_mean),
    .groups = "drop"
  )

## 4. Plot: predictive intervals + comparison to station means 
ggplot(new_draws, aes(x = .prediction, y = station)) +
  # 50%, 80%, 95% predictive intervals for the new station
  stat_interval(.width = c(0.5, 0.8, 0.95)) +
  # Rug shows posterior mean pH for existing stations (from 5a)
  geom_rug(
    data = station_means,
    inherit.aes = FALSE,
    aes(x = mean_pH),
    sides = "b"
  ) +
  labs(
    title = "Posterior predictive pH for a new station",
    subtitle = "Bands = 50%, 80%, 95% intervals",
    x = "pH",
    y = NULL
  ) +
  theme_minimal()

Compare these predictions conceptually to the station estimates from 5a.

Conceptually, the new-station prediction is what you’d expect for a typical station drawn from the same population as the ones in 5a. Because we have no data for this station, its predictive intervals are pulled strongly toward the overall mean pH (~7.2) and are fairly wide, since they include both uncertainty in the station effect and the residual measurement noise.

Question 5c (3 points)

How does the credible interval for a new prediction compare to the credible interval for the posterior mean of an observed station? Explain why they differ in width.

For a new station, the credible interval is for a future observation of pH. That interval has to include: uncertainty about the overall mean and station effect plus the random measurement/noise around that mean.

For an observed station, the interval for the posterior mean pH only reflects uncertainty in the mean after seeing data from that station; it does not include the extra spread from measurement noise, and the data shrink the uncertainty further.

Because a predictive interval = “uncertainty in the mean” + “extra variability of individual measurements,” it is always wider than the credible interval for the posterior mean of a station we have data for.

Problem 6: Reflection and Synthesis (10 points)

Question 6a (5 points)

Compare Bayesian and frequentist approaches across these dimensions:

Dimension	Bayesian	Frequentist
Parameter interpretation	Parameters are random variables with a posterior distribution.	Parameters are fixed but unknown, we do not assign probabilities to their values only data.
Use of prior information	Explicitly uses priors to combine previous knowledge with data.	Does not use priors, only uses the information in the current sample.
Type of uncertainty quantification	Credible intervals, ranges where the parameter lies with a given posterior probability.	Confidence interval, interval that would contain the true parameter in a given % of repeats samples.
Primary question answered	“How probable are different parameter values or models given my data and prior?”	“How unusual would my data be if a particular hypothesis were true.”
Strength for small sample sizes	Can perform well by borrowing strength from priors; often more stable with small n.	Can be unstable with small n, relies more heavily on large sample approximation.

Question 6b (5 points)

Describe a real ecological or environmental research scenario where you would recommend a Bayesian approach over frequentist methods. Explain:

Why the research question benefits from Bayesian framing
What prior information would be relevant
How posterior predictive checks would inform model adequacy
What decision(s) the posterior distribution would inform

One good example is setting harvest limits for a fish population in a lake where we only have a few years of noisy survey data. Lake managers want to know: “What is the likely true size of the fish population, and what harvest level keeps it from collapsing?” A Bayesian approach works well here because it lets us combine the new survey data with prior information from older studies on the same lake, nearby lakes, or similar species (for example, typical growth rates, carrying capacity, or past population estimates). That prior information is important when the current sample size is small and the data on their own are not very informative.

After fitting a Bayesian model, we can use posterior predictive checks to see whether the model can reproduce patterns we actually see in the data, such as the number of zero-catch surveys or the spread of catches across years. If the simulated data look very different from the real data, we know the model is missing something and should be improved. Once the model passes these checks, the posterior distribution can directly answer management questions like “What is the probability the population is below 30% of its historical level?” or “What harvest quota keeps the chance of collapse below 5%?” Those probability statements are exactly what decision-makers need to choose a cautious, defensible harvest policy.

Submission Guidelines

Create an Quarto file that reproduces all analyses
For each problem, include:
- Clear code with comments
- Statistical output and visualizations
- Written interpretation of results
Save as week-05-solutions.qmd
Render to HTML and submit both .qmd and .html files