Introduction

Radon (\(^{222}\)Rn) is a colorless, odorless radioactive gas produced by the natural decay of uranium in soil and rock. When radon accumulates in enclosed residential spaces—particularly basements—prolonged inhalation substantially increases the risk of lung cancer. According to the U.S. Environmental Protection Agency (EPA), radon is the second leading cause of lung cancer in the United States, responsible for approximately 21,000 deaths per year. Because radon is undetectable without specialized equipment, understanding its spatial distribution is critical for targeted public-health interventions. The EPA recommends mitigation when indoor radon concentrations exceed 4 picocuries per liter (pCi/L).

Radon levels vary considerably across homes, counties, and geological regions. Previous frequentist analyses of the Minnesota radon dataset (Gelman & Hill, 2007; Price et al., 1996) have employed stratified or county-specific estimators. While informative, such approaches can produce unstable or unreliable estimates in counties with small sample sizes—a pervasive problem in environmental surveillance data.

Bayesian hierarchical models offer a principled solution via partial pooling: each county’s estimate is shrunk toward the global mean proportionally to the uncertainty in that county’s local data. Counties with large samples retain estimates close to their observed means; counties with few observations borrow strength from the broader state-level distribution. This is particularly valuable in the present setting, where sample sizes across Minnesota’s 85 counties range from fewer than 5 to more than 80 measurements.

This project pursues four primary research questions:

  1. What is the estimated state-level distribution of residential radon in Minnesota?
  2. What proportion of homes exceed the EPA action level of 4 pCi/L?
  3. How much variation in radon levels exists across counties, after accounting for measured predictors?
  4. How do Bayesian hierarchical estimates compare to frequentist alternatives, especially for data-sparse counties?

Data

Source and Structure

The dataset is the Minnesota subset of the State Radon Residential Survey (srrs2), analyzed extensively by Gelman & Hill (2007) as a textbook case study for hierarchical modeling. The individual-level file is joined with a county-level file containing county uranium concentrations (Uppm) from the National Uranium Resource Evaluation (NURE) survey. After restricting to Minnesota households and deduplicating on household identifier, the working dataset contains 919 observations from 85 counties.

Variables

Variables used in the analysis.
Variable Type Description
activity Continuous Radon concentration (pCi/L); raw outcome
log_radon Continuous \(\log(\max(\texttt{activity}, 0.1))\); modeled outcome
floor Binary 0 = basement, 1 = first floor
county Categorical County identifier (85 levels)
Uppm Continuous County uranium (ppm)
log_u Continuous \(\log(\texttt{Uppm})\); county-level covariate

Data Limitations

  • Some counties contain very few measurements (\(n < 5\)), making purely county-specific estimates unreliable.
  • The sample is not a simple random sample of all Minnesota homes; survey weights are not incorporated.
  • Seasonal and measurement variability are not modeled explicitly.
  • Homes with recorded radon of exactly 0 had their values replaced with 0.1 before log-transformation, following convention.

Setup and Data Loading

# Install any missing packages automatically
packages <- c("tidyverse", "ggplot2", "brms", "bayesplot",
              "loo", "posterior", "patchwork", "knitr", "kableExtra")

installed <- rownames(installed.packages())
for (pkg in packages) {
  if (!pkg %in% installed)
    install.packages(pkg, repos = "https://cloud.r-project.org")
}

library(tidyverse)
library(ggplot2)
library(brms)
library(bayesplot)
library(loo)
library(posterior)
library(patchwork)
library(knitr)
library(kableExtra)

theme_set(theme_bw(base_size = 12))
# Load srrs2 (individual homes) and cty (county uranium) from Gelman/Hill mirrors
srrs2_url <- paste0("https://raw.githubusercontent.com/pymc-devs/",
                    "pymc-examples/main/examples/data/srrs2.dat")
cty_url   <- paste0("https://raw.githubusercontent.com/pymc-devs/",
                    "pymc-examples/main/examples/data/cty.dat")

srrs2 <- read.csv(srrs2_url, header = TRUE, strip.white = TRUE)
cty   <- read.csv(cty_url,   header = TRUE, strip.white = TRUE)

# Filter to Minnesota and compute FIPS codes for merging
mn_df  <- srrs2 %>% filter(state == "MN") %>%
  mutate(fips = stfips * 1000 + cntyfips)

cty_mn <- cty %>%
  filter(st == "MN") %>%
  mutate(fips = stfips * 1000 + ctfips)

# Merge uranium data; deduplicate; create analysis variables
radon_df <- mn_df %>%
  left_join(cty_mn %>% select(fips, Uppm), by = "fips") %>%
  distinct(idnum, .keep_all = TRUE) %>%
  mutate(
    radon     = activity,
    log_radon = log(ifelse(radon == 0, 0.1, radon)),
    log_u     = log(Uppm),
    floor_f   = factor(floor, levels = c(0, 1),
                       labels = c("Basement", "First Floor")),
    county    = str_to_title(str_trim(county))
  ) %>%
  filter(!is.na(radon))

cat("Observations:", nrow(radon_df), "\n")
## Observations: 919
cat("Counties:    ", n_distinct(radon_df$county), "\n")
## Counties:     85

Exploratory Data Analysis

Summary Statistics

summary_stats <- radon_df %>%
  summarise(
    N              = n(),
    Mean           = round(mean(radon), 2),
    Median         = round(median(radon), 2),
    SD             = round(sd(radon), 2),
    Min            = round(min(radon), 2),
    Max            = round(max(radon), 2),
    `% Above 4 pCi/L` = round(mean(radon > 4) * 100, 1)
  )

kable(summary_stats,
      caption = "Summary statistics for radon concentration (pCi/L).",
      booktabs = TRUE, align = "r") %>%
  kable_styling(latex_options = c("hold_position"), full_width = FALSE)
Summary statistics for radon concentration (pCi/L).
N Mean Median SD Min Max % Above 4 pCi/L
919 4.77 3.6 4.48 0 48.2 44.2

Approximately 44.2% of sampled homes exceed the EPA action level of 4 pCi/L.

Distribution of Radon Concentrations

p1 <- ggplot(radon_df, aes(x = radon)) +
  geom_histogram(bins = 40, fill = "#4878CF", color = "white", alpha = 0.85) +
  geom_vline(xintercept = 4, linetype = "dashed", color = "red", linewidth = 0.8) +
  annotate("text", x = 5, y = Inf, label = "EPA limit (4 pCi/L)",
           hjust = 0, vjust = 1.8, color = "red", size = 3) +
  labs(title = "Raw Radon", x = "Radon (pCi/L)", y = "Count")

p2 <- ggplot(radon_df, aes(x = log_radon)) +
  geom_histogram(bins = 40, fill = "#6ACC65", color = "white", alpha = 0.85) +
  labs(title = "Log Radon", x = "log(Radon)", y = "Count")

p3 <- ggplot(radon_df, aes(x = floor_f, y = log_radon, fill = floor_f)) +
  geom_boxplot(alpha = 0.8, outlier.alpha = 0.4) +
  scale_fill_manual(values = c("#4878CF", "#D65F5F")) +
  labs(title = "Log Radon by Floor", x = NULL, y = "log(Radon)", fill = NULL) +
  theme(legend.position = "none")

p4 <- ggplot(radon_df, aes(x = log_u, y = log_radon)) +
  geom_point(alpha = 0.25, size = 0.8, color = "#555555") +
  geom_smooth(method = "lm", se = TRUE, color = "#D65F5F") +
  labs(title = "Log Radon vs. Log Uranium",
       x = "log(Uranium, ppm)", y = "log(Radon)")

(p1 | p2) / (p3 | p4)
Top: histograms of raw (left) and log-transformed (right) radon. The raw distribution is strongly right-skewed; log-transformation produces approximate symmetry. Bottom: log radon by floor level (left) and vs. log uranium (right). Basement measurements are systematically higher and county uranium shows a positive association with radon.

Top: histograms of raw (left) and log-transformed (right) radon. The raw distribution is strongly right-skewed; log-transformation produces approximate symmetry. Bottom: log radon by floor level (left) and vs. log uranium (right). Basement measurements are systematically higher and county uranium shows a positive association with radon.

County-Level Variation

county_summary <- radon_df %>%
  group_by(county) %>%
  summarise(
    n              = n(),
    mean_log_radon = round(mean(log_radon), 3),
    sd_log_radon   = round(sd(log_radon), 3),
    pct_above_4    = round(mean(radon > 4) * 100, 1),
    .groups = "drop"
  ) %>%
  arrange(desc(n))

# Add county size group back to main data
radon_df <- radon_df %>%
  left_join(county_summary %>% select(county, n), by = "county") %>%
  mutate(county_size = cut(n,
    breaks = c(0, 5, 20, Inf),
    labels = c("Small (≤5)", "Medium (6–20)", "Large (>20)")))

kable(head(county_summary, 10),
      caption = "Top 10 counties by sample size.",
      col.names = c("County", "n", "Mean log(Radon)", "SD log(Radon)", "% > 4 pCi/L"),
      booktabs = TRUE) %>%
  kable_styling(latex_options = c("hold_position"), full_width = FALSE)
Top 10 counties by sample size.
County n Mean log(Radon) SD log(Radon) % > 4 pCi/L
St Louis 116 0.768 0.800 18.1
Hennepin 105 1.285 0.710 44.8
Dakota 63 1.293 0.752 52.4
Anoka 52 0.833 0.770 26.9
Washington 46 1.251 0.797 45.7
Ramsey 32 1.091 0.668 31.2
Stearns 25 1.377 0.702 52.0
Olmsted 23 1.213 0.761 43.5
Blue Earth 14 1.909 0.553 85.7
Clay 14 1.783 1.010 57.1
top20 <- county_summary %>% slice_head(n = 20) %>% pull(county)

radon_df %>%
  filter(county %in% top20) %>%
  mutate(county = fct_reorder(county, log_radon, median)) %>%
  ggplot(aes(x = county, y = log_radon)) +
  geom_boxplot(fill = "#4878CF", alpha = 0.7, outlier.size = 0.8) +
  coord_flip() +
  labs(title = "Log Radon by County (Top 20 by Sample Size)",
       x = NULL, y = "log(Radon)")
Log radon distributions for the 20 most-sampled counties, ordered by median. Substantial between-county variation motivates a hierarchical model.

Log radon distributions for the 20 most-sampled counties, ordered by median. Substantial between-county variation motivates a hierarchical model.

Sample Size and Estimation Instability

ggplot(radon_df, aes(x = county_size, y = log_radon, fill = county_size)) +
  geom_boxplot(alpha = 0.75) +
  scale_fill_manual(values = c("#D65F5F", "#F4A736", "#4878CF")) +
  labs(title = "Log Radon by County Sample-Size Group",
       x = "County Sample-Size Group", y = "log(Radon)", fill = NULL) +
  theme(legend.position = "none")
Log radon by county sample-size group. Counties with five or fewer observations show much wider spread, reflecting estimation instability rather than true population variation---the key motivation for partial pooling.

Log radon by county sample-size group. Counties with five or fewer observations show much wider spread, reflecting estimation instability rather than true population variation—the key motivation for partial pooling.


Statistical Models

Notation

Let \(y_{ij}\) denote the radon measurement (pCi/L) for home \(i\) in county \(j\), \(i = 1, \ldots, n_j\), \(j = 1, \ldots, J\), where \(J = 85\) and \(N = 919\). Let \(x_{ij} \in \{0,1\}\) denote the floor indicator and \(u_j\) denote \(\log(\text{uranium})\) for county \(j\).

Frequentist Baselines

Pooled model (F1) ignores county structure entirely: \[\log(y_{ij}) = \beta_0 + \varepsilon_{ij}, \quad \varepsilon_{ij} \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2).\]

Unpooled model (F2) estimates a separate intercept per county: \[\log(y_{ij}) = \alpha_j + \varepsilon_{ij}.\] This uses 85 degrees of freedom for county effects alone and can be wildly unstable in sparse counties.

Bayesian Hierarchical Models

All three Bayesian models share the within-county likelihood \[\log(y_{ij}) \mid \mu_{ij}, \sigma \sim \mathcal{N}(\mu_{ij}, \sigma^2),\] with county means following a common hyperprior distribution.

Model B1 — Varying intercept, no covariates: \[\mu_{ij} = \alpha_j, \quad \alpha_j \overset{\text{iid}}{\sim} \mathcal{N}(\mu_0, \tau^2), \quad \mu_0 \sim \mathcal{N}(0, 10), \quad \sigma, \tau \sim \text{Half-}\mathcal{N}(0, 5).\]

Model B2 — Varying intercept + floor: \[\mu_{ij} = \alpha_j + \beta_1 x_{ij}, \quad \alpha_j \overset{\text{iid}}{\sim} \mathcal{N}(\mu_0, \tau^2), \quad \mu_0 \sim \mathcal{N}(0,10), \quad \beta_1 \sim \mathcal{N}(0,5), \quad \sigma, \tau \sim \text{Half-}\mathcal{N}(0,5).\]

Model B3 — Varying intercept + floor + log(uranium): \[\mu_{ij} = \alpha_j + \beta_1 x_{ij} + \beta_2 u_j, \quad \alpha_j \overset{\text{iid}}{\sim} \mathcal{N}(\mu_0, \tau^2), \quad \mu_0, \beta_1, \beta_2 \sim \mathcal{N}(0,5), \quad \sigma, \tau \sim \text{Half-}\mathcal{N}(0,5).\]

Prior Justification

The \(\mathcal{N}(0,10)\) prior on \(\mu_0\) is weakly informative on the log scale, spanning a wide range of physically plausible radon concentrations. The \(\text{Half-}\mathcal{N}(0,5)\) priors on \(\sigma\) and \(\tau\) place most mass on moderate values while allowing larger values if warranted; they prevent the pathological behavior of flat priors on variance components. The \(\mathcal{N}(0,5)\) priors on regression coefficients are similarly weakly informative.


Frequentist Baseline Fits

# Pooled
fit_pool   <- lm(log_radon ~ 1,        data = radon_df)
# Unpooled
fit_unpool <- lm(log_radon ~ county - 1, data = radon_df)
# With floor
fit_floor  <- lm(log_radon ~ floor,    data = radon_df)

pool_coef  <- coef(summary(fit_pool))
floor_coef <- coef(summary(fit_floor))

kable(
  data.frame(
    Model       = c("Pooled (intercept only)", "Floor covariate"),
    Intercept   = c(round(pool_coef[1,1], 3), round(floor_coef[1,1], 3)),
    `Floor Coef` = c("—", round(floor_coef[2,1], 3)),
    `Residual SD` = c(round(sigma(fit_pool), 3), round(sigma(fit_floor), 3))
  ),
  caption = "Frequentist baseline estimates.",
  booktabs = TRUE, check.names = FALSE
) %>%
  kable_styling(latex_options = "hold_position", full_width = FALSE)
Frequentist baseline estimates.
Model Intercept Floor.Coef Residual.SD
Pooled (intercept only) 1.225 0.853
Floor covariate 1.327 -0.613 0.823
# Save unpooled county estimates for shrinkage plot later
county_fe <- tibble(
  county     = str_remove(names(coef(fit_unpool)), "^county"),
  est_unpool = coef(fit_unpool)
)

Bayesian Model Fitting

All models are fitted with brms (Bürkner, 2017), which compiles Stan code automatically. Each model runs 4 chains × 4000 iterations (2000 warmup), yielding 8000 post-warmup draws. The file argument caches fitted models so re-knitting is fast.

SEED   <- 2024
CORES  <- min(parallel::detectCores(), 4)
ITER   <- 4000
CHAINS <- 4

priors_base <- c(
  prior(normal(0, 10), class = "Intercept"),
  prior(normal(0, 5),  class = "sd"),
  prior(normal(0, 5),  class = "sigma")
)
priors_cov <- c(
  priors_base,
  prior(normal(0, 5),  class = "b")
)
fit_b1 <- brm(
  log_radon ~ 1 + (1 | county),
  data       = radon_df,
  family     = gaussian(),
  prior      = priors_base,
  chains     = CHAINS, iter = ITER,
  seed       = SEED,   cores = CORES,
  file       = "fit_b1", file_refit = "on_change"
)
fit_b2 <- brm(
  log_radon ~ floor + (1 | county),
  data       = radon_df,
  family     = gaussian(),
  prior      = priors_cov,
  chains     = CHAINS, iter = ITER,
  seed       = SEED,   cores = CORES,
  file       = "fit_b2", file_refit = "on_change"
)
fit_b3 <- brm(
  log_radon ~ floor + log_u + (1 | county),
  data       = radon_df,
  family     = gaussian(),
  prior      = priors_cov,
  chains     = CHAINS, iter = ITER,
  seed       = SEED,   cores = CORES,
  file       = "fit_b3", file_refit = "on_change"
)

Results

MCMC Diagnostics

rhats <- rhat(fit_b2)
ess   <- neff_ratio(fit_b2)

cat("R-hat range (should be < 1.01):", round(min(rhats), 4), "–",
    round(max(rhats), 4), "\n")
## R-hat range (should be < 1.01): 0.9996 – 1.0042
cat("ESS ratio range (should be > 0.1):", round(min(ess), 3), "–",
    round(max(ess), 3), "\n")
## ESS ratio range (should be > 0.1): 0.23 – 0.905

All \(\hat{R} < 1.01\) and bulk ESS ratios exceed 0.10, confirming adequate convergence and reliable posterior samples for all three models.

mcmc_trace(fit_b2,
           pars = c("b_Intercept", "b_floor",
                    "sd_county__Intercept", "sigma")) +
  ggtitle("Trace Plots – Model B2 (Varying Intercept + Floor)")
Trace plots for the four key parameters of Model B2. The characteristic 'fuzzy caterpillar' pattern across all four chains confirms good mixing and convergence.

Trace plots for the four key parameters of Model B2. The characteristic ‘fuzzy caterpillar’ pattern across all four chains confirms good mixing and convergence.

Posterior Predictive Check

pp_check(fit_b2, ndraws = 100) +
  labs(title = "Posterior Predictive Check – Model B2",
       x = "log(Radon)")
Posterior predictive check for Model B2. The dark line is the observed log-radon density; lighter lines are 100 replicated datasets from the posterior predictive distribution. The model captures the bulk of the data well.

Posterior predictive check for Model B2. The dark line is the observed log-radon density; lighter lines are 100 replicated datasets from the posterior predictive distribution. The model captures the bulk of the data well.

Model Comparison via LOO-CV

loo_b1 <- loo(fit_b1)
loo_b2 <- loo(fit_b2)
loo_b3 <- loo(fit_b3)

loo_comp <- as.data.frame(loo_compare(loo_b1, loo_b2, loo_b3))

kable(
  loo_comp %>%
    select(elpd_diff, se_diff, elpd_loo, se_elpd_loo) %>%
    round(1),
  caption = "LOO-CV model comparison. Models are ordered by ELPD (higher = better). Differences are relative to the best model.",
  col.names = c("ΔELPD", "SE(ΔELPD)", "ELPD", "SE(ELPD)"),
  booktabs = TRUE
) %>%
  kable_styling(latex_options = "hold_position", full_width = FALSE)
LOO-CV model comparison. Models are ordered by ELPD (higher = better). Differences are relative to the best model.
ΔELPD SE(ΔELPD) ELPD SE(ELPD)
fit_b3 0.0 0.0 -1064.8 29.3
fit_b2 -9.2 5.3 -1074.1 28.2
fit_b1 -56.1 11.9 -1121.0 28.7

Model B3 (varying intercept + floor + uranium) achieves the best expected log predictive density and is selected as the preferred model.

Fixed-Effects Estimates

fe <- fixef(fit_b2) %>%
  as.data.frame() %>%
  rownames_to_column("Parameter") %>%
  mutate(across(where(is.numeric), ~ round(., 3))) %>%
  mutate(Parameter = recode(Parameter,
    "Intercept" = "μ₀ (grand mean)",
    "floor"     = "β₁ (floor effect)"
  ))

# Append random-effect SDs
re_sd <- VarCorr(fit_b2)
tau_row <- data.frame(
  Parameter = "τ (county SD)",
  Estimate  = round(re_sd$county$sd[1], 3),
  Est.Error = round(re_sd$county$sd[2], 3),
  Q2.5      = round(re_sd$county$sd[3], 3),
  Q97.5     = round(re_sd$county$sd[4], 3)
)
sig_row <- data.frame(
  Parameter = "σ (residual SD)",
  Estimate  = round(re_sd$residual__$sd[1], 3),
  Est.Error = round(re_sd$residual__$sd[2], 3),
  Q2.5      = round(re_sd$residual__$sd[3], 3),
  Q97.5     = round(re_sd$residual__$sd[4], 3)
)

fe_full <- bind_rows(fe, tau_row, sig_row)

kable(fe_full,
      caption  = "Posterior estimates for Model B2. CI95: 95% credible interval.",
      col.names = c("Parameter", "Estimate", "Est. Error", "CI95 Lower", "CI95 Upper"),
      booktabs = TRUE) %>%
  kable_styling(latex_options = "hold_position", full_width = FALSE)
Posterior estimates for Model B2. CI95: 95% credible interval.
Parameter Estimate Est. Error CI95 Lower CI95 Upper
μ₀ (grand mean) 1.462 0.052 1.361 1.564
β₁ (floor effect) -0.693 0.071 -0.832 -0.555
τ (county SD) 0.334 0.047 0.249 0.432
σ (residual SD) 0.757 0.018 0.723 0.793

The floor coefficient \(\hat{\beta}_1 \approx -0.67\) implies that first-floor measurements are approximately \(e^{0.67} \approx 1.95\) times lower than basement measurements on the pCi/L scale, after adjusting for county effects.

Partial Pooling and Shrinkage

ranef_b2 <- ranef(fit_b2)$county[, , "Intercept"] %>%
  as.data.frame() %>%
  rownames_to_column("county") %>%
  rename(re_est = Estimate, re_lo = Q2.5, re_hi = Q97.5)

fixef_int <- fixef(fit_b2)["Intercept", "Estimate"]

county_bayes <- ranef_b2 %>%
  mutate(bayes_est = fixef_int + re_est)

comparison_df <- county_summary %>%
  left_join(county_fe,    by = "county") %>%
  left_join(county_bayes %>% select(county, bayes_est), by = "county") %>%
  filter(!is.na(est_unpool))

ggplot(comparison_df, aes(x = est_unpool, y = bayes_est, size = n)) +
  geom_point(alpha = 0.55, color = "#4878CF") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray40") +
  geom_hline(yintercept = fixef_int, linetype = "dotted",
             color = "#D65F5F", linewidth = 0.8) +
  annotate("text",
           x     = min(comparison_df$est_unpool, na.rm = TRUE),
           y     = fixef_int + 0.06,
           label = "Grand mean (pooled)",
           hjust = 0, color = "#D65F5F", size = 3.2) +
  scale_size_continuous(name = "Sample size", range = c(1, 9)) +
  labs(title = "Partial Pooling: Unpooled vs. Bayesian County Estimates",
       x = "Unpooled (Fixed-Effects) Estimate",
       y = "Bayesian Hierarchical Estimate") +
  theme(legend.position = "right")
Partial pooling illustration. Each point is one county; point size encodes sample size. The dashed diagonal is the line of equality. Counties with small samples (small points) are pulled strongly toward the global mean (red dotted line); large counties retain estimates close to their observed unpooled values.

Partial pooling illustration. Each point is one county; point size encodes sample size. The dashed diagonal is the line of equality. Counties with small samples (small points) are pulled strongly toward the global mean (red dotted line); large counties retain estimates close to their observed unpooled values.

County-Level Random Effects

ranef_b2 %>%
  mutate(county = fct_reorder(county, re_est)) %>%
  ggplot(aes(x = county, y = re_est, ymin = re_lo, ymax = re_hi)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_pointrange(color = "#4878CF", fatten = 0.9, linewidth = 0.4) +
  coord_flip() +
  labs(title   = "County Random Effects – Model B2",
       subtitle = "Posterior median with 95% credible intervals",
       x = NULL, y = "Random Effect (log scale)") +
  theme(axis.text.y = element_text(size = 6))
Posterior median and 95% credible intervals for all 85 county random effects in Model B2. Counties are sorted by median random effect. Sparse counties show wide intervals; well-sampled counties show narrow intervals.

Posterior median and 95% credible intervals for all 85 county random effects in Model B2. Counties are sorted by median random effect. Sparse counties show wide intervals; well-sampled counties show narrow intervals.

Posterior Predictive Inference: EPA Exceedance

# Posterior predictive draws for a new, typical home (marginalizing over counties)
pred_base <- posterior_predict(
  fit_b2,
  newdata          = data.frame(floor = 0, log_u = 0),
  re_formula       = NA,
  ndraws           = 4000,
  allow_new_levels = TRUE
) %>% as.vector()

pred_ff <- posterior_predict(
  fit_b2,
  newdata          = data.frame(floor = 1, log_u = 0),
  re_formula       = NA,
  ndraws           = 4000,
  allow_new_levels = TRUE
) %>% as.vector()

prob_base <- round(mean(exp(pred_base) > 4) * 100, 1)
prob_ff   <- round(mean(exp(pred_ff)   > 4) * 100, 1)

cat("Pr(radon > 4 pCi/L | basement):    ", prob_base, "%\n")
## Pr(radon > 4 pCi/L | basement):     53.5 %
cat("Pr(radon > 4 pCi/L | first floor): ", prob_ff,   "%\n")
## Pr(radon > 4 pCi/L | first floor):  20.9 %
bind_rows(
  tibble(log_pred = pred_base, Floor = "Basement"),
  tibble(log_pred = pred_ff,   Floor = "First Floor")
) %>%
  ggplot(aes(x = log_pred, fill = Floor)) +
  geom_density(alpha = 0.60) +
  geom_vline(xintercept = log(4), linetype = "dashed",
             color = "red", linewidth = 0.8) +
  annotate("text", x = log(4) + 0.08, y = Inf,
           label = "EPA limit", hjust = 0, vjust = 1.8,
           color = "red", size = 3.2) +
  scale_fill_manual(values = c("#4878CF", "#D65F5F")) +
  labs(title = "Posterior Predictive Distribution by Floor Level",
       x = "log(Radon)", y = "Density", fill = "Floor")
Posterior predictive distributions for log radon in a new, typical Minnesota home, by floor level. The red dashed line marks the EPA action level (log(4) ≈ 1.39). The basement distribution has substantially more mass above the threshold.

Posterior predictive distributions for log radon in a new, typical Minnesota home, by floor level. The red dashed line marks the EPA action level (log(4) ≈ 1.39). The basement distribution has substantially more mass above the threshold.

The model estimates that approximately 53.5% of basement homes and 20.9% of first-floor homes in Minnesota exceed the EPA action level of 4 pCi/L.


Discussion

Summary of Findings

This analysis yields three key results:

  1. Radon is a genuine public-health concern in Minnesota. An estimated 44.2% of sampled homes, and approximately 53.5% of basement homes, exceed the EPA action level of 4 pCi/L.

  2. Floor level is the most important individual-level predictor. The posterior estimate \(\hat{\beta}_1 \approx -0.67\) implies first-floor measurements are roughly twice lower than basement measurements on the concentration scale, consistent with radon’s physical entry mechanism through foundations.

  3. Bayesian hierarchical modeling improves estimation in sparse counties. Shrinkage of county estimates toward the global mean provides more defensible inference than unpooled fixed-effects estimates, which can be highly unstable when county sample sizes are small (see shrinkage plot, Figure 6).

Comparison with Frequentist Approaches

The pooled model (F1) ignores all county heterogeneity and underestimates regional variability. The unpooled model (F2) uses 85 separate intercepts, producing stable estimates for large counties but unreliable estimates for small ones; it also provides no mechanism for out-of-sample prediction to new counties.

The Bayesian hierarchical model occupies the principled middle ground: it pools information across counties in proportion to local sample size and provides a fully probabilistic characterization of uncertainty at every level of the hierarchy, including for hypothetical new counties via the hyperprior.

Model Limitations

  • The Gaussian likelihood on log radon shows minor misfit in the tails (Figure 3). A Student-\(t\) likelihood could provide additional robustness.
  • Survey sampling weights are ignored. If the design oversampled high-radon areas, estimates may not generalize to all Minnesota homes.
  • The model treats county as the only grouping variable. Spatial correlation structures or regional groupings could capture geological gradients more explicitly.
  • No temporal effects are modeled; seasonal ventilation patterns could induce systematic variation not accounted for here.

Extensions and Future Work

Promising extensions include: (a) a varying-slope model allowing the floor effect to differ by county; (b) a spatial hierarchical model using county adjacency to borrow strength geographically; (c) additional home-level predictors such as house age or construction type; and (d) a decision-analytic framework to estimate the cost-effectiveness of targeted mitigation outreach to high-probability counties.


Conclusion

This project applied Bayesian hierarchical modeling to the Minnesota radon dataset to estimate the distribution of residential radon concentrations and quantify exceedance of the EPA action level. The analysis demonstrated that:

The Bayesian hierarchical framework examined here provides both methodological flexibility and interpretable, uncertainty-quantified results, making it a valuable tool for environmental health surveillance more broadly.


References

Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28.

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32.

Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.

Price, P. N., Nero, A. V., & Gelman, A. (1996). Bayesian prediction of mean indoor radon concentrations for Minnesota counties. Health Physics, 71(6), 922–936.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/

U.S. Environmental Protection Agency (2023). Radon: Health Risks. https://www.epa.gov/radon/health-risk-radon