This document reproduces the Bayesian model from Mitchell et
al. (2019) to estimate the probability of Zika virus infection
(P_Z) and the symptomatic proportion (P_SZ)
from serosurvey data. The study used data from three locations (Yap,
French Polynesia, Puerto Rico) and modeled the impact of test
sensitivity (sen), specificity (spe), and
symptom overlap (gamma). Two cases are implemented:
sen and spe
values (0.7 to 1.0, step 0.05), with gamma = 1 (allowing
symptom overlap).sen and
spe with beta priors, as in the study’s “imperfect test
with symptom overlap” scenario.The model uses Bayesian inference to estimate
parameters from observed serosurvey counts (T+S+,
T+S-, T-S+, T-S-), accounting for
imperfect testing and background symptoms (P_SB). The
likelihood is multinomial, with probabilities adjusted
for sen, spe, and gamma
(interaction between Zika symptoms and background symptoms).
Markov Chain Monte Carlo (MCMC) via JAGS generates
posterior samples, and the coda package computes medians
and 95% credible intervals (CrIs), which represent the range containing
95% of the posterior probability.
JAGS (Just Another Gibbs Sampler): JAGS is a software tool that performs MCMC simulations, specifically using Gibbs sampling (a type of MCMC where parameters are updated one at a time, conditioned on the others). The study uses JAGS to run MCMC simulations, implemented through R with the rjags package.
Coda Package: The coda package in R is used to analyze the output of MCMC simulations, summarizing results (e.g., medians, credible intervals) and checking convergence (ensuring the MCMC chains have stabilized).
Key Parameters: - P_Z: Probability of
Zika infection. - P_SZ: Proportion of infected individuals
who are symptomatic. - P_SB: Proportion with background
symptoms (not Zika-related). - sen, spe: Test
sensitivity and specificity. - gamma: 1 for symptom
overlap, 0 for mutually exclusive symptoms.
We define the serosurvey data and parameter grids for Case 1 and 2.
The datasets contains observed counts of T+S+,
T+S-, T-S+, T-S-, respectively
(extracted from Table 1 of the study).
# Observed data from serosurveys
datasets <- list(
Yap = c(156, 258, 27, 116), # T+S+, T+S-, T-S+, T-S-
"French Polynesia" = c(55, 42, 27, 72), # French Polynesia
"Puerto Rico" = c(48, 49, 35, 235) # Puerto Rico
)
This chunk defines the run_model function for Case 1,
using fixed sen and spe values.
# Sensitivity and specificity arrays for Case 1 (grid search) (paragraph 1 of Results and Figure 1)
sensitivities <- seq(0.7, 1, by = 0.05)
specificities <- seq(0.7, 1, by = 0.05)
# model
run_model <- function(data_vector, gamma_val, sen_val, spe_val) {
jags_data <- list(
x = data_vector,
n = sum(data_vector),
gamma = gamma_val,
sen = sen_val,
spe = spe_val
)
model_string <- "
model {
x[1:4] ~ dmultinom(p[1:4], n)
p1 <- sen * P_Z * (P_SZ + P_SB - gamma * P_SZ * P_SB) + (1 - spe) * (1 - P_Z) * P_SB
p2 <- sen * P_Z * (1 - P_SZ - P_SB + gamma * P_SZ * P_SB) + (1 - spe) * (1 - P_Z) * (1 - P_SB)
p3 <- spe * (1 - P_Z) * P_SB + (1 - sen) * P_Z * (P_SZ + P_SB - gamma * P_SZ * P_SB)
p4 <- spe * (1 - P_Z) * (1 - P_SB) + (1 - sen) * P_Z * (1 - P_SZ - P_SB + gamma * P_SZ * P_SB)
total <- p1 + p2 + p3 + p4
p[1] <- p1 / total
p[2] <- p2 / total
p[3] <- p3 / total
p[4] <- p4 / total
P_Z ~ dbeta(1, 1)
P_SZ ~ dbeta(1, 1)
P_SB ~ dbeta(1, 1)
}
"
inits <- function() list(P_Z = runif(1), P_SZ = runif(1), P_SB = runif(1))
model <- jags.model(textConnection(model_string), data = jags_data, inits = inits, n.chains = 3, quiet = TRUE)
update(model, 2000)
samples <- coda.samples(model, c("P_Z", "P_SZ", "P_SB"), n.iter = 5000)
summary_stats <- summary(samples)
estimates <- summary_stats$quantiles[, c("50%", "2.5%", "97.5%")]
colnames(estimates) <- c("Median", "CrI_Lower", "CrI_Upper")
out <- c(estimates["P_Z", ], estimates["P_SZ", ], estimates["P_SB", ])
names(out) <- c("P_Z_Median", "P_Z_CrI_Lower", "P_Z_CrI_Upper",
"P_SZ_Median", "P_SZ_CrI_Lower", "P_SZ_CrI_Upper",
"P_SB_Median", "P_SB_CrI_Lower", "P_SB_CrI_Upper")
return(out)
}
Explanation: - We test 49 combinations of
sen and spe (7 × 7 grid) per location,
matching the study’s exploration of test performance impact. -
Data Input: jags_data includes the
observed counts (x), total sample size (n),
gamma (1 for symptom overlap), and fixed sen
and spe. - Model String: Defines a
multinomial likelihood where x[1:4] (counts for
T+S+, T+S-, T-S+, T-S-) follows probabilities
p[1:4], adjusted for test errors and symptom overlap.
Priors are uniform (dbeta(1,1)) for P_Z,
P_SZ, P_SB. dbeta(1,1) denotes
beta distribution and when its parameters
(alpha, beta) = (1,1), it becomes uniform distribution
between 0 and 1. - MCMC
Setup: Uses jags.model to compile the model,
update for a 2,000-iteration burn-in, and
coda.samples to generate 5,000 samples per chain (15,000
total samples per parameter). Burn-in means we forget initial iterations
to remove bias towards intial condition we start our iterations
with.Then, there are 3 independent chains of iterations, and we take
median (or average) of all the three chains. - Output:
Returns a named vector with medians and 95% CrIs for P_Z,
P_SZ, P_SB, computed using
summary(samples)$quantiles. - Theory: The
Bayesian approach combines the likelihood with priors to form the
posterior. MCMC (Gibbs sampling via JAGS) approximates the posterior by
sampling from conditional distributions. The 95% CrI is the 2.5th to
97.5th percentiles of the posterior samples, reflecting uncertainty.
This chunk runs the model for all combinations of locations, sensitivities, and specificities.
results_case1 <- expand.grid(
Location = names(datasets),
Sensitivity = sensitivities,
Specificity = specificities,
stringsAsFactors = FALSE
) %>%
rowwise() %>%
mutate(
model_output = list(
tryCatch(
run_model(
data_vector = datasets[[Location]],
gamma_val = 1,
sen_val = Sensitivity,
spe_val = Specificity
),
error = function(e) {
rep(NA, 9) %>% setNames(
c("P_Z_Median", "P_Z_CrI_Lower", "P_Z_CrI_Upper",
"P_SZ_Median", "P_SZ_CrI_Lower", "P_SZ_CrI_Upper",
"P_SB_Median", "P_SB_CrI_Lower", "P_SB_CrI_Upper")
)
}
)
)
) %>%
unnest_wider(model_output) %>%
filter(!is.na(P_Z_Median) & !is.na(P_SZ_Median) & !is.na(P_SB_Median)) %>%
mutate(Location = factor(Location, levels = c("Yap", "French Polynesia", "Puerto Rico")))
Explanation: - Grid Creation:
expand.grid generates a dataframe with 147 rows (3
locations × 7 sensitivities × 7 specificities), covering all
combinations. - Model Execution:
rowwise() %>% mutate applies run_model to
each row, storing results in the model_output list-column.
- Error Handling: tryCatch returns NA
values if run_model fails (e.g., due to convergence
issues), ensuring the script continues. - Data
Expansion: unnest_wider converts
model_output into columns (P_Z_Median, etc.),
creating a clean dataframe. - Filtering: Removes rows
with NA results to ensure valid outputs. - Factor
Conversion: Sets Location as a factor for
consistent plotting. - Theory: This grid search mirrors
the study’s approach of testing multiple sen and
spe values to assess their impact on P_Z and
P_SZ, accounting for test imperfections.
This plot visualizes how P_SZ varies with
Specificity for different Sensitivity values
across locations.
ggplot(results_case1, aes(x = Specificity, y = P_SZ_Median, color = as.factor(Sensitivity))) +
geom_line(size = 1.1) +
facet_wrap(~ Location, scales = "free_y") +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
labs(
title = "Estimated Symptomatic Proportion (P_SZ) for Case 1",
x = "Specificity",
y = expression(P["SZ"]),
color = "Sensitivity",
fill = "Sensitivity"
) +
theme_minimal()
Compare it with the top row of the following figure:
This plot shows P_Z estimates across
Specificity and Sensitivity.
ggplot(results_case1, aes(x = Specificity, y = P_Z_Median, color = as.factor(Sensitivity))) +
geom_line(size = 1.1) +
facet_wrap(~ Location, scales = "free_y") +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
labs(
title = "Estimated Infection Prevalence (P_Z) for Case 1",
x = "Specificity",
y = expression(P["Z"]),
color = "Sensitivity",
fill = "Sensitivity"
) +
theme_minimal()
Compare the results with the bottom row of the Figure 1 of Mitchell et al..
This chunk defines the run_model_case2 function for Case
2, using beta priors for sen and spe.
run_model_case2 <- function(data_vector, gamma_val) {
jags_data <- list(
x = data_vector,
n = sum(data_vector),
gamma = gamma_val
)
model_string <- "
model {
x[1:4] ~ dmultinom(p[1:4], n)
p1 <- sen * P_Z * (P_SZ + P_SB - gamma * P_SZ * P_SB) + (1 - spe) * (1 - P_Z) * P_SB
p2 <- sen * P_Z * (1 - P_SZ - P_SB + gamma * P_SZ * P_SB) + (1 - spe) * (1 - P_Z) * (1 - P_SB)
p3 <- spe * (1 - P_Z) * P_SB + (1 - sen) * P_Z * (P_SZ + P_SB - gamma * P_SZ * P_SB)
p4 <- spe * (1 - P_Z) * (1 - P_SB) + (1 - sen) * P_Z * (1 - P_SZ - P_SB + gamma * P_SZ * P_SB)
total <- p1 + p2 + p3 + p4
p[1] <- p1 / total
p[2] <- p2 / total
p[3] <- p3 / total
p[4] <- p4 / total
P_Z ~ dbeta(1, 1)
P_SZ ~ dbeta(1, 1)
P_SB ~ dbeta(1, 1)
sen ~ dbeta(38.4, 2) # Median ~0.95, 95% CrI ~[0.85, 0.99]
spe ~ dbeta(45.6, 2) # Median ~0.96, 95% CrI ~[0.80, 1.00]
}
"
inits <- function() list(P_Z = runif(1), P_SZ = runif(1), P_SB = runif(1), sen = runif(1, 0.85, 0.99), spe = runif(1, 0.80, 1.0))
model <- jags.model(textConnection(model_string), data = jags_data, inits = inits, n.chains = 3, quiet = TRUE)
update(model, 2000)
samples <- coda.samples(model, c("P_Z", "P_SZ", "P_SB", "sen", "spe"), n.iter = 5000)
summary_stats <- summary(samples)
estimates <- summary_stats$quantiles[, c("50%", "2.5%", "97.5%")]
colnames(estimates) <- c("Median", "CrI_Lower", "CrI_Upper")
out <- c(estimates["P_Z", ], estimates["P_SZ", ], estimates["P_SB", ], estimates["sen", ], estimates["spe", ])
names(out) <- c("P_Z_Median", "P_Z_CrI_Lower", "P_Z_CrI_Upper",
"P_SZ_Median", "P_SZ_CrI_Lower", "P_SZ_CrI_Upper",
"P_SB_Median", "P_SB_CrI_Lower", "P_SB_CrI_Upper",
"sen_Median", "sen_CrI_Lower", "sen_CrI_Upper",
"spe_Median", "spe_CrI_Lower", "spe_CrI_Upper")
return(out)
}
Explanation: - Data Input: Excludes
sen_val and spe_val, as they are now random
variables. - Model String: Adds
sen ~ dbeta(38.4, 2) (equivalent to median ~ 0.95 , 95% CrI
~[0.85, 0.99], given in paragraph 3 of Results section of the study) and
spe ~ dbeta(45.6, 2) (equivalent to median ~0.96, 95% CrI
~[0.80, 1.00], given in paragraph 3 of Results section of the study) to
model test uncertainty, matching the study’s priors. -
MCMC: Samples sen and spe
alongside P_Z, P_SZ, P_SB,
producing 15,000 samples per parameter. - Output:
Includes medians and CrIs for all parameters. - Theory:
Incorporating priors for sen and spe increases
uncertainty in P_Z and P_SZ, as seen in the
study’s wider CrIs (e.g., P_SZ = 0.27 [0.15, 0.37] for
Yap).
This chunk runs the model for each location, as sen and
spe are now distributed.
results_case2 <- expand.grid(
Location = names(datasets),
stringsAsFactors = FALSE
) %>%
rowwise() %>%
mutate(
model_output = list(
tryCatch(
run_model_case2(
data_vector = datasets[[Location]],
gamma_val = 1
),
error = function(e) {
rep(NA, 15) %>% setNames(
c("P_Z_Median", "P_Z_CrI_Lower", "P_Z_CrI_Upper",
"P_SZ_Median", "P_SZ_CrI_Lower", "P_SZ_CrI_Upper",
"P_SB_Median", "P_SB_CrI_Lower", "P_SB_CrI_Upper",
"sen_Median", "sen_CrI_Lower", "sen_CrI_Upper",
"spe_Median", "spe_CrI_Lower", "spe_CrI_Upper")
)
}
)
)
) %>%
unnest_wider(model_output) %>%
filter(!is.na(P_Z_Median) & !is.na(P_SZ_Median) & !is.na(P_SB_Median)) %>%
mutate(Location = factor(Location, levels = c("Yap", "French Polynesia", "Puerto Rico")))
Explanation: - Grid: Only iterates
over Location (3 rows), as sen and
spe are modeled with priors. - Execution and
Output: Similar to Case 1, but includes sen and
spe estimates. - Theory: This case
reflects the study’s “imperfect test with symptom overlap” scenario,
capturing real-world test uncertainty.
library(tibble)
library(knitr)
# Mitchell et al. published values
mitchell_psz <- c(0.27, 0.44, 0.50)
mitchell_lower <- c(0.15, 0.26, 0.34)
mitchell_upper <- c(0.37, 0.66, 0.92)
# Your estimated values
our_psz <- round(results_case2$P_SZ_Median, 3)
our_lower <- round(results_case2$P_SZ_CrI_Lower, 3)
our_upper <- round(results_case2$P_SZ_CrI_Upper, 3)
comparison <- tibble(
Location = c("Yap", "French Polynesia", "Puerto Rico"),
`Mitchell et al. (Median)` = mitchell_psz,
`Mitchell 95% CrI` = paste0("(", mitchell_lower, ", ", mitchell_upper, ")"),
`Our Estimate (Median)` = our_psz,
`Our 95% CrI` = paste0("(", our_lower, ", ", our_upper, ")")
)
kable(comparison, caption = "Comparison of Estimated Symptomatic Proportion (P_SZ)")
| Location | Mitchell et al. (Median) | Mitchell 95% CrI | Our Estimate (Median) | Our 95% CrI |
|---|---|---|---|---|
| Yap | 0.27 | (0.15, 0.37) | 0.261 | (0.153, 0.367) |
| French Polynesia | 0.44 | (0.26, 0.66) | 0.430 | (0.247, 0.588) |
| Puerto Rico | 0.50 | (0.34, 0.92) | 0.486 | (0.336, 0.743) |
This analysis successfully reproduces Mitchell et al. (2019),
estimating P_Z and P_SZ for three locations
under fixed and distributed sen and spe. Case
1 shows the impact of test performance, while Case 2 aligns with the
study’s main results, incorporating test uncertainty.
To increase accuracy, I have considered using 1,000,000 iterations with thinning (as in the study) instead of 5,000 iterations. Not much improvement has been observed with that.