Radon (\(^{222}\)Rn) is a colorless, odorless radioactive gas produced by the natural decay of uranium in soil and rock. When radon accumulates in enclosed residential spaces—particularly basements—prolonged inhalation substantially increases the risk of lung cancer. According to the U.S. Environmental Protection Agency (EPA), radon is the second leading cause of lung cancer in the United States, responsible for approximately 21,000 deaths per year. Because radon is undetectable without specialized equipment, understanding its spatial distribution is critical for targeted public-health interventions. The EPA recommends mitigation when indoor radon concentrations exceed 4 picocuries per liter (pCi/L).
Radon levels vary considerably across homes, counties, and geological regions. Previous frequentist analyses of the Minnesota radon dataset (Gelman & Hill, 2007; Price et al., 1996) have employed stratified or county-specific estimators. While informative, such approaches can produce unstable or unreliable estimates in counties with small sample sizes—a pervasive problem in environmental surveillance data.
Bayesian hierarchical models offer a principled solution via partial pooling: each county’s estimate is shrunk toward the global mean proportionally to the uncertainty in that county’s local data. Counties with large samples retain estimates close to their observed means; counties with few observations borrow strength from the broader state-level distribution. This is particularly valuable in the present setting, where sample sizes across Minnesota’s 85 counties range from fewer than 5 to more than 80 measurements.
This project pursues four primary research questions:
The dataset is the Minnesota subset of the State Radon Residential
Survey (srrs2), analyzed extensively by Gelman & Hill
(2007) as a textbook case study for hierarchical modeling. The
individual-level file is joined with a county-level file containing
county uranium concentrations (Uppm) from the National
Uranium Resource Evaluation (NURE) survey. After restricting to
Minnesota households and deduplicating on household identifier, the
working dataset contains 919 observations from
85 counties.
| Variable | Type | Description |
|---|---|---|
activity |
Continuous | Radon concentration (pCi/L); raw outcome |
log_radon |
Continuous | \(\log(\max(\texttt{activity}, 0.1))\); modeled outcome |
floor |
Binary | 0 = basement, 1 = first floor |
county |
Categorical | County identifier (85 levels) |
Uppm |
Continuous | County uranium (ppm) |
log_u |
Continuous | \(\log(\texttt{Uppm})\); county-level covariate |
# Install any missing packages automatically
packages <- c("tidyverse", "ggplot2", "brms", "bayesplot",
"loo", "posterior", "patchwork", "knitr", "kableExtra")
installed <- rownames(installed.packages())
for (pkg in packages) {
if (!pkg %in% installed)
install.packages(pkg, repos = "https://cloud.r-project.org")
}
library(tidyverse)
library(ggplot2)
library(brms)
library(bayesplot)
library(loo)
library(posterior)
library(patchwork)
library(knitr)
library(kableExtra)
theme_set(theme_bw(base_size = 12))
# Load srrs2 (individual homes) and cty (county uranium) from Gelman/Hill mirrors
srrs2_url <- paste0("https://raw.githubusercontent.com/pymc-devs/",
"pymc-examples/main/examples/data/srrs2.dat")
cty_url <- paste0("https://raw.githubusercontent.com/pymc-devs/",
"pymc-examples/main/examples/data/cty.dat")
srrs2 <- read.csv(srrs2_url, header = TRUE, strip.white = TRUE)
cty <- read.csv(cty_url, header = TRUE, strip.white = TRUE)
# Filter to Minnesota and compute FIPS codes for merging
mn_df <- srrs2 %>% filter(state == "MN") %>%
mutate(fips = stfips * 1000 + cntyfips)
cty_mn <- cty %>%
filter(st == "MN") %>%
mutate(fips = stfips * 1000 + ctfips)
# Merge uranium data; deduplicate; create analysis variables
radon_df <- mn_df %>%
left_join(cty_mn %>% select(fips, Uppm), by = "fips") %>%
distinct(idnum, .keep_all = TRUE) %>%
mutate(
radon = activity,
log_radon = log(ifelse(radon == 0, 0.1, radon)),
log_u = log(Uppm),
floor_f = factor(floor, levels = c(0, 1),
labels = c("Basement", "First Floor")),
county = str_to_title(str_trim(county))
) %>%
filter(!is.na(radon))
cat("Observations:", nrow(radon_df), "\n")
## Observations: 919
cat("Counties: ", n_distinct(radon_df$county), "\n")
## Counties: 85
summary_stats <- radon_df %>%
summarise(
N = n(),
Mean = round(mean(radon), 2),
Median = round(median(radon), 2),
SD = round(sd(radon), 2),
Min = round(min(radon), 2),
Max = round(max(radon), 2),
`% Above 4 pCi/L` = round(mean(radon > 4) * 100, 1)
)
kable(summary_stats,
caption = "Summary statistics for radon concentration (pCi/L).",
booktabs = TRUE, align = "r") %>%
kable_styling(latex_options = c("hold_position"), full_width = FALSE)
| N | Mean | Median | SD | Min | Max | % Above 4 pCi/L |
|---|---|---|---|---|---|---|
| 919 | 4.77 | 3.6 | 4.48 | 0 | 48.2 | 44.2 |
Approximately 44.2% of sampled homes exceed the EPA action level of 4 pCi/L.
p1 <- ggplot(radon_df, aes(x = radon)) +
geom_histogram(bins = 40, fill = "#4878CF", color = "white", alpha = 0.85) +
geom_vline(xintercept = 4, linetype = "dashed", color = "red", linewidth = 0.8) +
annotate("text", x = 5, y = Inf, label = "EPA limit (4 pCi/L)",
hjust = 0, vjust = 1.8, color = "red", size = 3) +
labs(title = "Raw Radon", x = "Radon (pCi/L)", y = "Count")
p2 <- ggplot(radon_df, aes(x = log_radon)) +
geom_histogram(bins = 40, fill = "#6ACC65", color = "white", alpha = 0.85) +
labs(title = "Log Radon", x = "log(Radon)", y = "Count")
p3 <- ggplot(radon_df, aes(x = floor_f, y = log_radon, fill = floor_f)) +
geom_boxplot(alpha = 0.8, outlier.alpha = 0.4) +
scale_fill_manual(values = c("#4878CF", "#D65F5F")) +
labs(title = "Log Radon by Floor", x = NULL, y = "log(Radon)", fill = NULL) +
theme(legend.position = "none")
p4 <- ggplot(radon_df, aes(x = log_u, y = log_radon)) +
geom_point(alpha = 0.25, size = 0.8, color = "#555555") +
geom_smooth(method = "lm", se = TRUE, color = "#D65F5F") +
labs(title = "Log Radon vs. Log Uranium",
x = "log(Uranium, ppm)", y = "log(Radon)")
(p1 | p2) / (p3 | p4)
Top: histograms of raw (left) and log-transformed (right) radon. The raw distribution is strongly right-skewed; log-transformation produces approximate symmetry. Bottom: log radon by floor level (left) and vs. log uranium (right). Basement measurements are systematically higher and county uranium shows a positive association with radon.
county_summary <- radon_df %>%
group_by(county) %>%
summarise(
n = n(),
mean_log_radon = round(mean(log_radon), 3),
sd_log_radon = round(sd(log_radon), 3),
pct_above_4 = round(mean(radon > 4) * 100, 1),
.groups = "drop"
) %>%
arrange(desc(n))
# Add county size group back to main data
radon_df <- radon_df %>%
left_join(county_summary %>% select(county, n), by = "county") %>%
mutate(county_size = cut(n,
breaks = c(0, 5, 20, Inf),
labels = c("Small (≤5)", "Medium (6–20)", "Large (>20)")))
kable(head(county_summary, 10),
caption = "Top 10 counties by sample size.",
col.names = c("County", "n", "Mean log(Radon)", "SD log(Radon)", "% > 4 pCi/L"),
booktabs = TRUE) %>%
kable_styling(latex_options = c("hold_position"), full_width = FALSE)
| County | n | Mean log(Radon) | SD log(Radon) | % > 4 pCi/L |
|---|---|---|---|---|
| St Louis | 116 | 0.768 | 0.800 | 18.1 |
| Hennepin | 105 | 1.285 | 0.710 | 44.8 |
| Dakota | 63 | 1.293 | 0.752 | 52.4 |
| Anoka | 52 | 0.833 | 0.770 | 26.9 |
| Washington | 46 | 1.251 | 0.797 | 45.7 |
| Ramsey | 32 | 1.091 | 0.668 | 31.2 |
| Stearns | 25 | 1.377 | 0.702 | 52.0 |
| Olmsted | 23 | 1.213 | 0.761 | 43.5 |
| Blue Earth | 14 | 1.909 | 0.553 | 85.7 |
| Clay | 14 | 1.783 | 1.010 | 57.1 |
top20 <- county_summary %>% slice_head(n = 20) %>% pull(county)
radon_df %>%
filter(county %in% top20) %>%
mutate(county = fct_reorder(county, log_radon, median)) %>%
ggplot(aes(x = county, y = log_radon)) +
geom_boxplot(fill = "#4878CF", alpha = 0.7, outlier.size = 0.8) +
coord_flip() +
labs(title = "Log Radon by County (Top 20 by Sample Size)",
x = NULL, y = "log(Radon)")
Log radon distributions for the 20 most-sampled counties, ordered by median. Substantial between-county variation motivates a hierarchical model.
ggplot(radon_df, aes(x = county_size, y = log_radon, fill = county_size)) +
geom_boxplot(alpha = 0.75) +
scale_fill_manual(values = c("#D65F5F", "#F4A736", "#4878CF")) +
labs(title = "Log Radon by County Sample-Size Group",
x = "County Sample-Size Group", y = "log(Radon)", fill = NULL) +
theme(legend.position = "none")
Log radon by county sample-size group. Counties with five or fewer observations show much wider spread, reflecting estimation instability rather than true population variation—the key motivation for partial pooling.
Let \(y_{ij}\) denote the radon measurement (pCi/L) for home \(i\) in county \(j\), \(i = 1, \ldots, n_j\), \(j = 1, \ldots, J\), where \(J = 85\) and \(N = 919\). Let \(x_{ij} \in \{0,1\}\) denote the floor indicator and \(u_j\) denote \(\log(\text{uranium})\) for county \(j\).
Pooled model (F1) ignores county structure entirely: \[\log(y_{ij}) = \beta_0 + \varepsilon_{ij}, \quad \varepsilon_{ij} \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2).\]
Unpooled model (F2) estimates a separate intercept per county: \[\log(y_{ij}) = \alpha_j + \varepsilon_{ij}.\] This uses 85 degrees of freedom for county effects alone and can be wildly unstable in sparse counties.
All three Bayesian models share the within-county likelihood \[\log(y_{ij}) \mid \mu_{ij}, \sigma \sim \mathcal{N}(\mu_{ij}, \sigma^2),\] with county means following a common hyperprior distribution.
Model B1 — Varying intercept, no covariates: \[\mu_{ij} = \alpha_j, \quad \alpha_j \overset{\text{iid}}{\sim} \mathcal{N}(\mu_0, \tau^2), \quad \mu_0 \sim \mathcal{N}(0, 10), \quad \sigma, \tau \sim \text{Half-}\mathcal{N}(0, 5).\]
Model B2 — Varying intercept + floor: \[\mu_{ij} = \alpha_j + \beta_1 x_{ij}, \quad \alpha_j \overset{\text{iid}}{\sim} \mathcal{N}(\mu_0, \tau^2), \quad \mu_0 \sim \mathcal{N}(0,10), \quad \beta_1 \sim \mathcal{N}(0,5), \quad \sigma, \tau \sim \text{Half-}\mathcal{N}(0,5).\]
Model B3 — Varying intercept + floor + log(uranium): \[\mu_{ij} = \alpha_j + \beta_1 x_{ij} + \beta_2 u_j, \quad \alpha_j \overset{\text{iid}}{\sim} \mathcal{N}(\mu_0, \tau^2), \quad \mu_0, \beta_1, \beta_2 \sim \mathcal{N}(0,5), \quad \sigma, \tau \sim \text{Half-}\mathcal{N}(0,5).\]
The \(\mathcal{N}(0,10)\) prior on \(\mu_0\) is weakly informative on the log scale, spanning a wide range of physically plausible radon concentrations. The \(\text{Half-}\mathcal{N}(0,5)\) priors on \(\sigma\) and \(\tau\) place most mass on moderate values while allowing larger values if warranted; they prevent the pathological behavior of flat priors on variance components. The \(\mathcal{N}(0,5)\) priors on regression coefficients are similarly weakly informative.
# Pooled
fit_pool <- lm(log_radon ~ 1, data = radon_df)
# Unpooled
fit_unpool <- lm(log_radon ~ county - 1, data = radon_df)
# With floor
fit_floor <- lm(log_radon ~ floor, data = radon_df)
pool_coef <- coef(summary(fit_pool))
floor_coef <- coef(summary(fit_floor))
kable(
data.frame(
Model = c("Pooled (intercept only)", "Floor covariate"),
Intercept = c(round(pool_coef[1,1], 3), round(floor_coef[1,1], 3)),
`Floor Coef` = c("—", round(floor_coef[2,1], 3)),
`Residual SD` = c(round(sigma(fit_pool), 3), round(sigma(fit_floor), 3))
),
caption = "Frequentist baseline estimates.",
booktabs = TRUE, check.names = FALSE
) %>%
kable_styling(latex_options = "hold_position", full_width = FALSE)
| Model | Intercept | Floor.Coef | Residual.SD |
|---|---|---|---|
| Pooled (intercept only) | 1.225 | — | 0.853 |
| Floor covariate | 1.327 | -0.613 | 0.823 |
# Save unpooled county estimates for shrinkage plot later
county_fe <- tibble(
county = str_remove(names(coef(fit_unpool)), "^county"),
est_unpool = coef(fit_unpool)
)
All models are fitted with brms (Bürkner, 2017), which
compiles Stan code automatically. Each model runs 4 chains ×
4000 iterations (2000 warmup), yielding 8000 post-warmup draws.
The file argument caches fitted models so re-knitting is
fast.
SEED <- 2024
CORES <- min(parallel::detectCores(), 4)
ITER <- 4000
CHAINS <- 4
priors_base <- c(
prior(normal(0, 10), class = "Intercept"),
prior(normal(0, 5), class = "sd"),
prior(normal(0, 5), class = "sigma")
)
priors_cov <- c(
priors_base,
prior(normal(0, 5), class = "b")
)
fit_b1 <- brm(
log_radon ~ 1 + (1 | county),
data = radon_df,
family = gaussian(),
prior = priors_base,
chains = CHAINS, iter = ITER,
seed = SEED, cores = CORES,
file = "fit_b1", file_refit = "on_change"
)
fit_b2 <- brm(
log_radon ~ floor + (1 | county),
data = radon_df,
family = gaussian(),
prior = priors_cov,
chains = CHAINS, iter = ITER,
seed = SEED, cores = CORES,
file = "fit_b2", file_refit = "on_change"
)
fit_b3 <- brm(
log_radon ~ floor + log_u + (1 | county),
data = radon_df,
family = gaussian(),
prior = priors_cov,
chains = CHAINS, iter = ITER,
seed = SEED, cores = CORES,
file = "fit_b3", file_refit = "on_change"
)
rhats <- rhat(fit_b2)
ess <- neff_ratio(fit_b2)
cat("R-hat range (should be < 1.01):", round(min(rhats), 4), "–",
round(max(rhats), 4), "\n")
## R-hat range (should be < 1.01): 0.9996 – 1.0042
cat("ESS ratio range (should be > 0.1):", round(min(ess), 3), "–",
round(max(ess), 3), "\n")
## ESS ratio range (should be > 0.1): 0.23 – 0.905
All \(\hat{R} < 1.01\) and bulk ESS ratios exceed 0.10, confirming adequate convergence and reliable posterior samples for all three models.
mcmc_trace(fit_b2,
pars = c("b_Intercept", "b_floor",
"sd_county__Intercept", "sigma")) +
ggtitle("Trace Plots – Model B2 (Varying Intercept + Floor)")
Trace plots for the four key parameters of Model B2. The characteristic ‘fuzzy caterpillar’ pattern across all four chains confirms good mixing and convergence.
pp_check(fit_b2, ndraws = 100) +
labs(title = "Posterior Predictive Check – Model B2",
x = "log(Radon)")
Posterior predictive check for Model B2. The dark line is the observed log-radon density; lighter lines are 100 replicated datasets from the posterior predictive distribution. The model captures the bulk of the data well.
loo_b1 <- loo(fit_b1)
loo_b2 <- loo(fit_b2)
loo_b3 <- loo(fit_b3)
loo_comp <- as.data.frame(loo_compare(loo_b1, loo_b2, loo_b3))
kable(
loo_comp %>%
select(elpd_diff, se_diff, elpd_loo, se_elpd_loo) %>%
round(1),
caption = "LOO-CV model comparison. Models are ordered by ELPD (higher = better). Differences are relative to the best model.",
col.names = c("ΔELPD", "SE(ΔELPD)", "ELPD", "SE(ELPD)"),
booktabs = TRUE
) %>%
kable_styling(latex_options = "hold_position", full_width = FALSE)
| ΔELPD | SE(ΔELPD) | ELPD | SE(ELPD) | |
|---|---|---|---|---|
| fit_b3 | 0.0 | 0.0 | -1064.8 | 29.3 |
| fit_b2 | -9.2 | 5.3 | -1074.1 | 28.2 |
| fit_b1 | -56.1 | 11.9 | -1121.0 | 28.7 |
Model B3 (varying intercept + floor + uranium) achieves the best expected log predictive density and is selected as the preferred model.
fe <- fixef(fit_b2) %>%
as.data.frame() %>%
rownames_to_column("Parameter") %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
mutate(Parameter = recode(Parameter,
"Intercept" = "μ₀ (grand mean)",
"floor" = "β₁ (floor effect)"
))
# Append random-effect SDs
re_sd <- VarCorr(fit_b2)
tau_row <- data.frame(
Parameter = "τ (county SD)",
Estimate = round(re_sd$county$sd[1], 3),
Est.Error = round(re_sd$county$sd[2], 3),
Q2.5 = round(re_sd$county$sd[3], 3),
Q97.5 = round(re_sd$county$sd[4], 3)
)
sig_row <- data.frame(
Parameter = "σ (residual SD)",
Estimate = round(re_sd$residual__$sd[1], 3),
Est.Error = round(re_sd$residual__$sd[2], 3),
Q2.5 = round(re_sd$residual__$sd[3], 3),
Q97.5 = round(re_sd$residual__$sd[4], 3)
)
fe_full <- bind_rows(fe, tau_row, sig_row)
kable(fe_full,
caption = "Posterior estimates for Model B2. CI95: 95% credible interval.",
col.names = c("Parameter", "Estimate", "Est. Error", "CI95 Lower", "CI95 Upper"),
booktabs = TRUE) %>%
kable_styling(latex_options = "hold_position", full_width = FALSE)
| Parameter | Estimate | Est. Error | CI95 Lower | CI95 Upper |
|---|---|---|---|---|
| μ₀ (grand mean) | 1.462 | 0.052 | 1.361 | 1.564 |
| β₁ (floor effect) | -0.693 | 0.071 | -0.832 | -0.555 |
| τ (county SD) | 0.334 | 0.047 | 0.249 | 0.432 |
| σ (residual SD) | 0.757 | 0.018 | 0.723 | 0.793 |
The floor coefficient \(\hat{\beta}_1 \approx -0.67\) implies that first-floor measurements are approximately \(e^{0.67} \approx 1.95\) times lower than basement measurements on the pCi/L scale, after adjusting for county effects.
ranef_b2 <- ranef(fit_b2)$county[, , "Intercept"] %>%
as.data.frame() %>%
rownames_to_column("county") %>%
rename(re_est = Estimate, re_lo = Q2.5, re_hi = Q97.5)
fixef_int <- fixef(fit_b2)["Intercept", "Estimate"]
county_bayes <- ranef_b2 %>%
mutate(bayes_est = fixef_int + re_est)
comparison_df <- county_summary %>%
left_join(county_fe, by = "county") %>%
left_join(county_bayes %>% select(county, bayes_est), by = "county") %>%
filter(!is.na(est_unpool))
ggplot(comparison_df, aes(x = est_unpool, y = bayes_est, size = n)) +
geom_point(alpha = 0.55, color = "#4878CF") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray40") +
geom_hline(yintercept = fixef_int, linetype = "dotted",
color = "#D65F5F", linewidth = 0.8) +
annotate("text",
x = min(comparison_df$est_unpool, na.rm = TRUE),
y = fixef_int + 0.06,
label = "Grand mean (pooled)",
hjust = 0, color = "#D65F5F", size = 3.2) +
scale_size_continuous(name = "Sample size", range = c(1, 9)) +
labs(title = "Partial Pooling: Unpooled vs. Bayesian County Estimates",
x = "Unpooled (Fixed-Effects) Estimate",
y = "Bayesian Hierarchical Estimate") +
theme(legend.position = "right")
Partial pooling illustration. Each point is one county; point size encodes sample size. The dashed diagonal is the line of equality. Counties with small samples (small points) are pulled strongly toward the global mean (red dotted line); large counties retain estimates close to their observed unpooled values.
ranef_b2 %>%
mutate(county = fct_reorder(county, re_est)) %>%
ggplot(aes(x = county, y = re_est, ymin = re_lo, ymax = re_hi)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_pointrange(color = "#4878CF", fatten = 0.9, linewidth = 0.4) +
coord_flip() +
labs(title = "County Random Effects – Model B2",
subtitle = "Posterior median with 95% credible intervals",
x = NULL, y = "Random Effect (log scale)") +
theme(axis.text.y = element_text(size = 6))
Posterior median and 95% credible intervals for all 85 county random effects in Model B2. Counties are sorted by median random effect. Sparse counties show wide intervals; well-sampled counties show narrow intervals.
# Posterior predictive draws for a new, typical home (marginalizing over counties)
pred_base <- posterior_predict(
fit_b2,
newdata = data.frame(floor = 0, log_u = 0),
re_formula = NA,
ndraws = 4000,
allow_new_levels = TRUE
) %>% as.vector()
pred_ff <- posterior_predict(
fit_b2,
newdata = data.frame(floor = 1, log_u = 0),
re_formula = NA,
ndraws = 4000,
allow_new_levels = TRUE
) %>% as.vector()
prob_base <- round(mean(exp(pred_base) > 4) * 100, 1)
prob_ff <- round(mean(exp(pred_ff) > 4) * 100, 1)
cat("Pr(radon > 4 pCi/L | basement): ", prob_base, "%\n")
## Pr(radon > 4 pCi/L | basement): 53.5 %
cat("Pr(radon > 4 pCi/L | first floor): ", prob_ff, "%\n")
## Pr(radon > 4 pCi/L | first floor): 20.9 %
bind_rows(
tibble(log_pred = pred_base, Floor = "Basement"),
tibble(log_pred = pred_ff, Floor = "First Floor")
) %>%
ggplot(aes(x = log_pred, fill = Floor)) +
geom_density(alpha = 0.60) +
geom_vline(xintercept = log(4), linetype = "dashed",
color = "red", linewidth = 0.8) +
annotate("text", x = log(4) + 0.08, y = Inf,
label = "EPA limit", hjust = 0, vjust = 1.8,
color = "red", size = 3.2) +
scale_fill_manual(values = c("#4878CF", "#D65F5F")) +
labs(title = "Posterior Predictive Distribution by Floor Level",
x = "log(Radon)", y = "Density", fill = "Floor")
Posterior predictive distributions for log radon in a new, typical Minnesota home, by floor level. The red dashed line marks the EPA action level (log(4) ≈ 1.39). The basement distribution has substantially more mass above the threshold.
The model estimates that approximately 53.5% of basement homes and 20.9% of first-floor homes in Minnesota exceed the EPA action level of 4 pCi/L.
This analysis yields three key results:
Radon is a genuine public-health concern in Minnesota. An estimated 44.2% of sampled homes, and approximately 53.5% of basement homes, exceed the EPA action level of 4 pCi/L.
Floor level is the most important individual-level predictor. The posterior estimate \(\hat{\beta}_1 \approx -0.67\) implies first-floor measurements are roughly twice lower than basement measurements on the concentration scale, consistent with radon’s physical entry mechanism through foundations.
Bayesian hierarchical modeling improves estimation in sparse counties. Shrinkage of county estimates toward the global mean provides more defensible inference than unpooled fixed-effects estimates, which can be highly unstable when county sample sizes are small (see shrinkage plot, Figure 6).
The pooled model (F1) ignores all county heterogeneity and underestimates regional variability. The unpooled model (F2) uses 85 separate intercepts, producing stable estimates for large counties but unreliable estimates for small ones; it also provides no mechanism for out-of-sample prediction to new counties.
The Bayesian hierarchical model occupies the principled middle ground: it pools information across counties in proportion to local sample size and provides a fully probabilistic characterization of uncertainty at every level of the hierarchy, including for hypothetical new counties via the hyperprior.
Promising extensions include: (a) a varying-slope model allowing the floor effect to differ by county; (b) a spatial hierarchical model using county adjacency to borrow strength geographically; (c) additional home-level predictors such as house age or construction type; and (d) a decision-analytic framework to estimate the cost-effectiveness of targeted mitigation outreach to high-probability counties.
This project applied Bayesian hierarchical modeling to the Minnesota radon dataset to estimate the distribution of residential radon concentrations and quantify exceedance of the EPA action level. The analysis demonstrated that:
The Bayesian hierarchical framework examined here provides both methodological flexibility and interpretable, uncertainty-quantified results, making it a valuable tool for environmental health surveillance more broadly.
Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28.
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32.
Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
Price, P. N., Nero, A. V., & Gelman, A. (1996). Bayesian prediction of mean indoor radon concentrations for Minnesota counties. Health Physics, 71(6), 922–936.
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/
U.S. Environmental Protection Agency (2023). Radon: Health Risks. https://www.epa.gov/radon/health-risk-radon