Reproducing Mitchell et al. (2019): Estimating Zika Infection and Symptomatic Proportion

Introduction

This document reproduces the Bayesian model from Mitchell et al. (2019) to estimate the probability of Zika virus infection (P_Z) and the symptomatic proportion (P_SZ) from serosurvey data. The study used data from three locations (Yap, French Polynesia, Puerto Rico) and modeled the impact of test sensitivity (sen), specificity (spe), and symptom overlap (gamma). Two cases are implemented:

Case 1: Fixed sen and spe values (0.7 to 1.0, step 0.05), with gamma = 1 (allowing symptom overlap).
Case 2: Distributed sen and spe with beta priors, as in the study’s “imperfect test with symptom overlap” scenario.

Theoretical Background

The model uses Bayesian inference to estimate parameters from observed serosurvey counts (T+S+, T+S-, T-S+, T-S-), accounting for imperfect testing and background symptoms (P_SB). The likelihood is multinomial, with probabilities adjusted for sen, spe, and gamma (interaction between Zika symptoms and background symptoms). Markov Chain Monte Carlo (MCMC) via JAGS generates posterior samples, and the coda package computes medians and 95% credible intervals (CrIs), which represent the range containing 95% of the posterior probability.

JAGS (Just Another Gibbs Sampler): JAGS is a software tool that performs MCMC simulations, specifically using Gibbs sampling (a type of MCMC where parameters are updated one at a time, conditioned on the others). The study uses JAGS to run MCMC simulations, implemented through R with the rjags package.

Coda Package: The coda package in R is used to analyze the output of MCMC simulations, summarizing results (e.g., medians, credible intervals) and checking convergence (ensuring the MCMC chains have stabilized).

Key Parameters: - P_Z: Probability of Zika infection. - P_SZ: Proportion of infected individuals who are symptomatic. - P_SB: Proportion with background symptoms (not Zika-related). - sen, spe: Test sensitivity and specificity. - gamma: 1 for symptom overlap, 0 for mutually exclusive symptoms.

Data Preparation

We define the serosurvey data and parameter grids for Case 1 and 2. The datasets contains observed counts of T+S+, T+S-, T-S+, T-S-, respectively (extracted from Table 1 of the study).

# Observed data from serosurveys
datasets <- list(
  Yap = c(156, 258, 27, 116),               # T+S+, T+S-, T-S+, T-S-
  "French Polynesia" = c(55, 42, 27, 72),   # French Polynesia
  "Puerto Rico" = c(48, 49, 35, 235)        # Puerto Rico
)

Case 1: Fixed Sensitivity and Specificity

Model Setup

This chunk defines the run_model function for Case 1, using fixed sen and spe values.

# Sensitivity and specificity arrays for Case 1 (grid search) (paragraph 1 of Results and Figure 1)
sensitivities <- seq(0.7, 1, by = 0.05)
specificities <- seq(0.7, 1, by = 0.05)

# model
run_model <- function(data_vector, gamma_val, sen_val, spe_val) {
  jags_data <- list(
    x = data_vector,
    n = sum(data_vector),
    gamma = gamma_val,
    sen = sen_val,
    spe = spe_val
  )

  model_string <- "
  model {
    x[1:4] ~ dmultinom(p[1:4], n)
    p1 <- sen * P_Z * (P_SZ + P_SB - gamma * P_SZ * P_SB) + (1 - spe) * (1 - P_Z) * P_SB
    p2 <- sen * P_Z * (1 - P_SZ - P_SB + gamma * P_SZ * P_SB) + (1 - spe) * (1 - P_Z) * (1 - P_SB)
    p3 <- spe * (1 - P_Z) * P_SB + (1 - sen) * P_Z * (P_SZ + P_SB - gamma * P_SZ * P_SB)
    p4 <- spe * (1 - P_Z) * (1 - P_SB) + (1 - sen) * P_Z * (1 - P_SZ - P_SB + gamma * P_SZ * P_SB)
    total <- p1 + p2 + p3 + p4
    p[1] <- p1 / total
    p[2] <- p2 / total
    p[3] <- p3 / total
    p[4] <- p4 / total
    P_Z ~ dbeta(1, 1)
    P_SZ ~ dbeta(1, 1)
    P_SB ~ dbeta(1, 1)
  }
  "

  inits <- function() list(P_Z = runif(1), P_SZ = runif(1), P_SB = runif(1))
  model <- jags.model(textConnection(model_string), data = jags_data, inits = inits, n.chains = 3, quiet = TRUE)
  update(model, 2000)
  samples <- coda.samples(model, c("P_Z", "P_SZ", "P_SB"), n.iter = 5000)
  summary_stats <- summary(samples)
  estimates <- summary_stats$quantiles[, c("50%", "2.5%", "97.5%")]
  colnames(estimates) <- c("Median", "CrI_Lower", "CrI_Upper")

  out <- c(estimates["P_Z", ], estimates["P_SZ", ], estimates["P_SB", ])
  names(out) <- c("P_Z_Median", "P_Z_CrI_Lower", "P_Z_CrI_Upper",
                  "P_SZ_Median", "P_SZ_CrI_Lower", "P_SZ_CrI_Upper",
                  "P_SB_Median", "P_SB_CrI_Lower", "P_SB_CrI_Upper")

  return(out)
}

Explanation: - We test 49 combinations of sen and spe (7 × 7 grid) per location, matching the study’s exploration of test performance impact. - Data Input: jags_data includes the observed counts (x), total sample size (n), gamma (1 for symptom overlap), and fixed sen and spe. - Model String: Defines a multinomial likelihood where x[1:4] (counts for T+S+, T+S-, T-S+, T-S-) follows probabilities p[1:4], adjusted for test errors and symptom overlap. Priors are uniform (dbeta(1,1)) for P_Z, P_SZ, P_SB. dbeta(1,1) denotes beta distribution and when its parameters (alpha, beta) = (1,1), it becomes uniform distribution between 0 and 1. - MCMC Setup: Uses jags.model to compile the model, update for a 2,000-iteration burn-in, and coda.samples to generate 5,000 samples per chain (15,000 total samples per parameter). Burn-in means we forget initial iterations to remove bias towards intial condition we start our iterations with.Then, there are 3 independent chains of iterations, and we take median (or average) of all the three chains. - Output: Returns a named vector with medians and 95% CrIs for P_Z, P_SZ, P_SB, computed using summary(samples)$quantiles. - Theory: The Bayesian approach combines the likelihood with priors to form the posterior. MCMC (Gibbs sampling via JAGS) approximates the posterior by sampling from conditional distributions. The 95% CrI is the 2.5th to 97.5th percentiles of the posterior samples, reflecting uncertainty.

Parameter Grid Search

This chunk runs the model for all combinations of locations, sensitivities, and specificities.

results_case1 <- expand.grid(
  Location = names(datasets),
  Sensitivity = sensitivities,
  Specificity = specificities,
  stringsAsFactors = FALSE
) %>%
  rowwise() %>%
  mutate(
    model_output = list(
      tryCatch(
        run_model(
          data_vector = datasets[[Location]],
          gamma_val = 1,
          sen_val = Sensitivity,
          spe_val = Specificity
        ),
        error = function(e) {
          rep(NA, 9) %>% setNames(
            c("P_Z_Median", "P_Z_CrI_Lower", "P_Z_CrI_Upper",
              "P_SZ_Median", "P_SZ_CrI_Lower", "P_SZ_CrI_Upper",
              "P_SB_Median", "P_SB_CrI_Lower", "P_SB_CrI_Upper")
          )
        }
      )
    )
  ) %>%
  unnest_wider(model_output) %>%
  filter(!is.na(P_Z_Median) & !is.na(P_SZ_Median) & !is.na(P_SB_Median)) %>%
  mutate(Location = factor(Location, levels = c("Yap", "French Polynesia", "Puerto Rico")))

Explanation: - Grid Creation: expand.grid generates a dataframe with 147 rows (3 locations × 7 sensitivities × 7 specificities), covering all combinations. - Model Execution: rowwise() %>% mutate applies run_model to each row, storing results in the model_output list-column. - Error Handling: tryCatch returns NA values if run_model fails (e.g., due to convergence issues), ensuring the script continues. - Data Expansion: unnest_wider converts model_output into columns (P_Z_Median, etc.), creating a clean dataframe. - Filtering: Removes rows with NA results to ensure valid outputs. - Factor Conversion: Sets Location as a factor for consistent plotting. - Theory: This grid search mirrors the study’s approach of testing multiple sen and spe values to assess their impact on P_Z and P_SZ, accounting for test imperfections.

Plotting Results for Case 1

Symptomatic Proportion (P_SZ)

This plot visualizes how P_SZ varies with Specificity for different Sensitivity values across locations.

ggplot(results_case1, aes(x = Specificity, y = P_SZ_Median, color = as.factor(Sensitivity))) +
  geom_line(size = 1.1) +
  facet_wrap(~ Location, scales = "free_y") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
  labs(
    title = "Estimated Symptomatic Proportion (P_SZ) for Case 1",
    x = "Specificity",
    y = expression(P["SZ"]),
    color = "Sensitivity",
    fill = "Sensitivity"
  ) +
  theme_minimal()

Compare it with the top row of the following figure:

Figure 1 of Mitchell et al.

Infection Prevalence (P_Z)

This plot shows P_Z estimates across Specificity and Sensitivity.

ggplot(results_case1, aes(x = Specificity, y = P_Z_Median, color = as.factor(Sensitivity))) +
  geom_line(size = 1.1) +
  facet_wrap(~ Location, scales = "free_y") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
  labs(
    title = "Estimated Infection Prevalence (P_Z) for Case 1",
    x = "Specificity",
    y = expression(P["Z"]),
    color = "Sensitivity",
    fill = "Sensitivity"
  ) +
  theme_minimal()

Compare the results with the bottom row of the Figure 1 of Mitchell et al..

Case 2: Distributed Sensitivity and Specificity

Model Setup

This chunk defines the run_model_case2 function for Case 2, using beta priors for sen and spe.

run_model_case2 <- function(data_vector, gamma_val) {
  jags_data <- list(
    x = data_vector,
    n = sum(data_vector),
    gamma = gamma_val
  )

  model_string <- "
  model {
    x[1:4] ~ dmultinom(p[1:4], n)
    p1 <- sen * P_Z * (P_SZ + P_SB - gamma * P_SZ * P_SB) + (1 - spe) * (1 - P_Z) * P_SB
    p2 <- sen * P_Z * (1 - P_SZ - P_SB + gamma * P_SZ * P_SB) + (1 - spe) * (1 - P_Z) * (1 - P_SB)
    p3 <- spe * (1 - P_Z) * P_SB + (1 - sen) * P_Z * (P_SZ + P_SB - gamma * P_SZ * P_SB)
    p4 <- spe * (1 - P_Z) * (1 - P_SB) + (1 - sen) * P_Z * (1 - P_SZ - P_SB + gamma * P_SZ * P_SB)
    total <- p1 + p2 + p3 + p4
    p[1] <- p1 / total
    p[2] <- p2 / total
    p[3] <- p3 / total
    p[4] <- p4 / total
    P_Z ~ dbeta(1, 1)
    P_SZ ~ dbeta(1, 1)
    P_SB ~ dbeta(1, 1)
    sen ~ dbeta(38.4, 2)  # Median ~0.95, 95% CrI ~[0.85, 0.99]
    spe ~ dbeta(45.6, 2)  # Median ~0.96, 95% CrI ~[0.80, 1.00]
  }
  "

  inits <- function() list(P_Z = runif(1), P_SZ = runif(1), P_SB = runif(1), sen = runif(1, 0.85, 0.99), spe = runif(1, 0.80, 1.0))
  model <- jags.model(textConnection(model_string), data = jags_data, inits = inits, n.chains = 3, quiet = TRUE)
  update(model, 2000)
  samples <- coda.samples(model, c("P_Z", "P_SZ", "P_SB", "sen", "spe"), n.iter = 5000)
  summary_stats <- summary(samples)
  estimates <- summary_stats$quantiles[, c("50%", "2.5%", "97.5%")]
  colnames(estimates) <- c("Median", "CrI_Lower", "CrI_Upper")

  out <- c(estimates["P_Z", ], estimates["P_SZ", ], estimates["P_SB", ], estimates["sen", ], estimates["spe", ])
  names(out) <- c("P_Z_Median", "P_Z_CrI_Lower", "P_Z_CrI_Upper",
                  "P_SZ_Median", "P_SZ_CrI_Lower", "P_SZ_CrI_Upper",
                  "P_SB_Median", "P_SB_CrI_Lower", "P_SB_CrI_Upper",
                  "sen_Median", "sen_CrI_Lower", "sen_CrI_Upper",
                  "spe_Median", "spe_CrI_Lower", "spe_CrI_Upper")

  return(out)
}

Explanation: - Data Input: Excludes sen_val and spe_val, as they are now random variables. - Model String: Adds sen ~ dbeta(38.4, 2) (equivalent to median ~ 0.95 , 95% CrI ~[0.85, 0.99], given in paragraph 3 of Results section of the study) and spe ~ dbeta(45.6, 2) (equivalent to median ~0.96, 95% CrI ~[0.80, 1.00], given in paragraph 3 of Results section of the study) to model test uncertainty, matching the study’s priors. - MCMC: Samples sen and spe alongside P_Z, P_SZ, P_SB, producing 15,000 samples per parameter. - Output: Includes medians and CrIs for all parameters. - Theory: Incorporating priors for sen and spe increases uncertainty in P_Z and P_SZ, as seen in the study’s wider CrIs (e.g., P_SZ = 0.27 [0.15, 0.37] for Yap).

Parameter Grid Search

This chunk runs the model for each location, as sen and spe are now distributed.

results_case2 <- expand.grid(
  Location = names(datasets),
  stringsAsFactors = FALSE
) %>%
  rowwise() %>%
  mutate(
    model_output = list(
      tryCatch(
        run_model_case2(
          data_vector = datasets[[Location]],
          gamma_val = 1
        ),
        error = function(e) {
          rep(NA, 15) %>% setNames(
            c("P_Z_Median", "P_Z_CrI_Lower", "P_Z_CrI_Upper",
              "P_SZ_Median", "P_SZ_CrI_Lower", "P_SZ_CrI_Upper",
              "P_SB_Median", "P_SB_CrI_Lower", "P_SB_CrI_Upper",
              "sen_Median", "sen_CrI_Lower", "sen_CrI_Upper",
              "spe_Median", "spe_CrI_Lower", "spe_CrI_Upper")
          )
        }
      )
    )
  ) %>%
  unnest_wider(model_output) %>%
  filter(!is.na(P_Z_Median) & !is.na(P_SZ_Median) & !is.na(P_SB_Median)) %>%
  mutate(Location = factor(Location, levels = c("Yap", "French Polynesia", "Puerto Rico")))

Explanation: - Grid: Only iterates over Location (3 rows), as sen and spe are modeled with priors. - Execution and Output: Similar to Case 1, but includes sen and spe estimates. - Theory: This case reflects the study’s “imperfect test with symptom overlap” scenario, capturing real-world test uncertainty.

Summarising Results for Case 2

library(tibble)
library(knitr)

# Mitchell et al. published values
mitchell_psz <- c(0.27, 0.44, 0.50)
mitchell_lower <- c(0.15, 0.26, 0.34)
mitchell_upper <- c(0.37, 0.66, 0.92)

# Your estimated values
our_psz <- round(results_case2$P_SZ_Median, 3)
our_lower <- round(results_case2$P_SZ_CrI_Lower, 3)
our_upper <- round(results_case2$P_SZ_CrI_Upper, 3)

comparison <- tibble(
  Location = c("Yap", "French Polynesia", "Puerto Rico"),
  `Mitchell et al. (Median)` = mitchell_psz,
  `Mitchell 95% CrI` = paste0("(", mitchell_lower, ", ", mitchell_upper, ")"),
  `Our Estimate (Median)` = our_psz,
  `Our 95% CrI` = paste0("(", our_lower, ", ", our_upper, ")")
)

kable(comparison, caption = "Comparison of Estimated Symptomatic Proportion (P_SZ)")

Comparison of Estimated Symptomatic Proportion (P_SZ)
Location	Mitchell et al. (Median)	Mitchell 95% CrI	Our Estimate (Median)	Our 95% CrI
Yap	0.27	(0.15, 0.37)	0.261	(0.153, 0.367)
French Polynesia	0.44	(0.26, 0.66)	0.430	(0.247, 0.588)
Puerto Rico	0.50	(0.34, 0.92)	0.486	(0.336, 0.743)

Conclusion

This analysis successfully reproduces Mitchell et al. (2019), estimating P_Z and P_SZ for three locations under fixed and distributed sen and spe. Case 1 shows the impact of test performance, while Case 2 aligns with the study’s main results, incorporating test uncertainty.

To increase accuracy, I have considered using 1,000,000 iterations with thinning (as in the study) instead of 5,000 iterations. Not much improvement has been observed with that.

Reproducing Mitchell et al. (2019): Estimating Zika Infection and Symptomatic Proportion

2025-06-07

Introduction

Theoretical Background

Data Preparation

Case 1: Fixed Sensitivity and Specificity

Model Setup

Parameter Grid Search

Plotting Results for Case 1

Symptomatic Proportion (P_SZ)

Infection Prevalence (P_Z)

Case 2: Distributed Sensitivity and Specificity

Model Setup

Parameter Grid Search

Summarising Results for Case 2

Conclusion