Hui-Walter models

CA18208 HARMONY Zurich Training School - https://harmony-net.eu/

Sonja Hartnack
Valerie Hungerbühler
Eleftherios Meletis

2022-07-14

Hui-Walter models

Se - Sp estimation - Recap

When the true infectious status is known Se-Sp of test 1 can be estimated:

Hui-Walter models (I)

Hui-Walter models (II)

In the Hui-Walter paradigm a condition that connects \(S\) the number of populations (P) and \(R\) the number of tests (T) is desribed:

\[ S \geq \frac {R}{(2\textsuperscript{R-1} - 1)}\]

If this condition is fulfilled, any combination of \(S\) and \(R\) may allow to estimate \(Se\) and \(Sp\), e.g. (2T, 2P), (3T,1P), (4T,1P), …

Hui-Walter models

Two_test - One_population setting

Model Specification (‘hw_definition’)

model{
  Cross_Classified_Data ~ dmulti(prob, N)
  
  # Test1+ Test2+
    prob[1] <- (prev * ((se[1])*(se[2]))) + ((1-prev) * ((1-sp[1])*(1-sp[2])))
  
  # Test1+ Test2-
    prob[2] <- (prev * ((se[1])*(1-se[2]))) + ((1-prev) * ((1-sp[1])*(sp[2])))

  # Test1- Test2+
    prob[3] <- (prev * ((1-se[1])*(se[2]))) + ((1-prev) * ((sp[1])*(1-sp[2])))

  # Test1- Test2-
    prob[4] <- (prev * ((1-se[1])*(1-se[2]))) + ((1-prev) * ((sp[1])*(sp[2])))

  prev ~ dbeta(1, 1)
  se[1] ~ dbeta(1, 1)
  sp[1] ~ dbeta(1, 1)
  se[2] ~ dbeta(1, 1)
  sp[2] ~ dbeta(1, 1)

  #data# Cross_Classified_Data, N
  #monitor# prev, prob, se, sp
  #inits# prev, se, sp
}
twoXtwo <- matrix(c(36, 4, 12, 48), ncol=2, nrow=2)
twoXtwo
##      [,1] [,2]
## [1,]   36   12
## [2,]    4   48
library('runjags')

Cross_Classified_Data <- as.numeric(twoXtwo)
N <- sum(Cross_Classified_Data)

prev <- list(chain1=0.05, chain2=0.95)
se <- list(chain1=c(0.01,0.99), chain2=c(0.99,0.01))
sp <- list(chain1=c(0.01,0.99), chain2=c(0.99,0.01))

results <- run.jags('basic_hw.txt', n.chains=2)

Remember to check convergence and effective sample size!

Recap - psrf: potential scale reduction factor

Recap - SSeff: effective sample size

summary(results)
Lower95 Median Upper95 SSeff psrf
prev 0.328 0.501 0.669 3954 2.275
prob[1] 0.255 0.344 0.437 13822 1.000
prob[2] 0.017 0.055 0.102 9640 1.000
prob[3] 0.073 0.132 0.201 14050 1.000
prob[4] 0.366 0.462 0.559 13257 1.000
se[1] 0.000 0.387 0.967 4459 13.597
se[2] 0.030 0.502 1.000 4683 15.382
sp[1] 0.033 0.577 1.000 4987 13.730
sp[2] 0.000 0.456 0.971 4423 15.081
plot(results)

Label Switching

How to interpret a test with Se=0% and Sp=0%?

We can force Se+Sp >= 1.

Jouden’s Index Se + Sp -1 >= 0:

  se[1] ~ dbeta(1, 1)
  sp[1] ~ dbeta(1, 1)T(1-se[1], )

Or:

  se[1] ~ dbeta(1, 1)T(1-sp[1], )
  sp[1] ~ dbeta(1, 1)

This allows the test to be useless, but not worse than useless.

Alternatively we can have the weakly informative priors:

  se[1] ~ dbeta(2, 1)
  sp[1] ~ dbeta(2, 1)

To give the model some information that we expect the test characteristics to be closer to 100% than 0%.

Or we can use stronger priors for one or both tests.

Priors

A quick way to see the distribution of a prior:

curve(dbeta(x, 1, 1), from=0, to=1)

qbeta(c(0.025,0.975), shape1=1, shape2=1)
## [1] 0.025 0.975

This was minimally informative, but how does that compare to a weakly informative prior for e.g. sensitivity?

curve(dbeta(x, 2, 1), from=0, to=1)

qbeta(c(0.025,0.975), shape1=2, shape2=1)
## [1] 0.1581139 0.9874209

Choosing a prior

What we want is e.g. Beta(20,1)

But typically we have median and 95% confidence intervals from a paper, e.g.:

“The median (95% CI) estimates of the sensitivity and specificity of the shiny new test were 94% (92-96%) and 99% (97-100%) respectively”

The PriorGen package

Median (95% CI) estimates of Se and Sp were 94% (92-96%) and 99% (97-100%)

library("PriorGen")
## Loading required package: rootSolve
findbeta(themedian = 0.94, percentile=0.95, percentile.value = 0.92)
## [1] "The desired Beta distribution that satisfies the specified conditions is: Beta( 429.95 27.76 )"
## [1] "Here is a plot of the specified distribution."
## [1] "Descriptive statistics for this distribution are:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8923  0.9324  0.9399  0.9394  0.9472  0.9747 
## [1] "Verification: The percentile value 0.92 corresponds to the 0.05 th percentile"

Note: themedian could also be themean

curve(dbeta(x, shape1=429.95, shape2=27.76))

Initial values

Part of the problem before was also that we were specifying extreme initial values:

se <- list(chain1=c(0.01,0.99), chain2=c(0.99,0.01))
sp <- list(chain1=c(0.01,0.99), chain2=c(0.99,0.01))

Let’s change these to:

se <- list(chain1=c(0.5,0.99), chain2=c(0.99,0.5))
sp <- list(chain1=c(0.5,0.99), chain2=c(0.99,0.5))

Exercise 1

Run the hw_definition model under the following different scenarios and interpret the results in each case.

  1. Change the priors for Se and Sp and try Beta(2,1).

  2. Estimate the Beta parameters for the Se and Sp of the shiny new test described.

  3. Run the model using the beta priors for Se and Sp from Step 2.

  4. Try to run the model with different initial values.

  5. Force Se + Sp > 1. Be careful to specify initial values that are within the restricted parameter space.

Multi-population Hui-Walter models

Hui-Walter models with multiple populations

Different prevalence in different populations (1st assumption)

What changes?

Model specification

model{
  for(p in 1:Populations){
    Tally[1:4, p] ~ dmulti(prob[1:4, p], TotalTests[p])
    # Test1- Test2- Pop1
    prob[1, p] <- (prev[p] * ((1-se[1])*(1-se[2])) + ((1-prev[p]) * (sp[1]*sp[2]))
    ## snip ##
      
    prev[p] ~ dbeta(1, 1)
  }

  se[1] ~ dbeta(se_prior[1,1], se_prior[1,2])T(1-sp[1], )
  sp[1] ~ dbeta(sp_prior[1,1], sp_prior[1,2])
  se[2] ~ dbeta(se_prior[2,1], se_prior[2,2])T(1-sp[2], )
  sp[2] ~ dbeta(sp_prior[2,1], sp_prior[2,2])

  #data# Tally, TotalTests, Populations, se_prior, sp_prior
  #monitor# prev, prob, se, sp
  #inits# prev, se, sp
}

Multiple populations: 2nd assumption

Multiple populations: special cases

Initial values

We have to be careful to make sure that the length of initial values for prev in each chain is equal to the number of populations

For example with 5 populations we need:

prev <- list(chain1=c(0.1, 0.1, 0.1, 0.9, 0.9), chain2=c(0.9, 0.9, 0.9, 0.1, 0.1))

The values you choose for different populations in the same chain can be the same - just make sure you pick different values for the same population between chains (i.e. over-dispersed initial values)

Note: specifying your own initial values is technically optional with JAGS, but it is always a good idea (for now at least)!!!

Incorporating populations with known prevalence

Up to now prevalence has been a parameter, but it can also be (partially) observed:

model{
  for(p in 1:Populations){
    Tally[1:4, p] ~ dmulti(prob[1:4, p], TotalTests[p])
    # Test1- Test2- Pop1
    prob[1, p] <- (prev[p] * ((1-se[1])*(1-se[2])) + ((1-prev[p]) * (sp[1]*sp[2]))
    ## snip ##
      
    prev[p] ~ dbeta(1, 1)
  }

  ## snip ##

  #data# Tally, TotalTests, Populations, se_prior, sp_prior, prev
  #monitor# prev, prob, se, sp
  #inits# prev, se, sp
}

To fix the prevalence of population 1 we could do:

Populations <- 5
prev <- rep(NA, Populations)
prev[1] <- 0
prev
## [1]  0 NA NA NA NA

But you also need to account for this in the initial values:

prev <- list(chain1=c(NA, 0.1, 0.1, 0.9, 0.9), chain2=c(NA, 0.9, 0.9, 0.1, 0.1))

Note: we now have two definitions of prev in R!

Data and initial value lists

There are actually multiple ways to specify data and initial values to runjags, including via the data and inits arguments

We will use these to keep separate lists of data and initial values (these could also be data frames, or environments)

data <- list(
  Tally = Tally,
  TotalTests = apply(Tally, 2, sum),
  Populations = dim(Tally, 2),
  prev = rep(NA, Populations),
  se_prior = matrix(1, ncol=2, nrow=2),
  sp_prior = matrix(1, ncol=2, nrow=2)
)
data$prev[1] <- 0
inits <- list(
  chain1 = list(
    prev = c(NA, 0.1, 0.1, 0.9, 0.9),
    se = c(0.5, 0.99),
    sp = c(0.5, 0.99)
  ),
  chain2 = list(
    prev = c(NA, 0.9, 0.9, 0.1, 0.1),
    se = c(0.99, 0.5),
    sp = c(0.99, 0.5)
  )
)

results <- run.jags(..., data = data, inits = inits)

See the help file for ?run.jags for more details

Let’s simulate a dataset to work on…

# Set a random seed so that the data are reproducible:
set.seed(2022-07-14)

sensitivity <- c(0.9, 0.6)
specificity <- c(0.95, 0.9)
N <- 1000

# Change the number of populations here:
Populations <- 5
# Change the variation in prevalence here:
(prevalence <- runif(Populations, min=0.1, max=0.9))

data <- tibble(Population = sample(seq_len(Populations), N, replace=TRUE)) %>%
  mutate(Status = rbinom(N, 1, prevalence[Population])) %>%
  mutate(Test1 = rbinom(N, 1, sensitivity[1]*Status + (1-specificity[1])*(1-Status))) %>%
  mutate(Test2 = rbinom(N, 1, sensitivity[2]*Status + (1-specificity[2])*(1-Status)))

(twoXtwoXpop <- with(data, table(Test1, Test2, Population)))
(Tally <- matrix(twoXtwoXpop, ncol=Populations))
(TotalTests <- apply(Tally, 2, sum))
Tally
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   47   68  117  141  142
## [2,]   48   47   24   34   22
## [3,]   19   15   13   23   14
## [4,]   76   68   38   30   14

Exercise 2

Start with 5 populations and analyse the data using the independent prevalence model.

Now try with 2, 3, and 10 populations.

Now change the simulated prevalence so that it varies between 0.4-0.6 rather than 0.1-0.9.

Solution 2

This is what the model should look like:


model{
  for(p in 1:Populations){
    Tally[1:4, p] ~ dmulti(prob[1:4, p], TotalTests[p])
  
    # Test1- Test2-
          prob[1,p] <- (prev[p] * ((1-se[1])*(1-se[2]))) + ((1-prev[p]) * ((sp[1])*(sp[2])))
    # Test1+ Test2-
    prob[2,p] <- (prev[p] * ((se[1])*(1-se[2]))) + ((1-prev[p]) * ((1-sp[1])*(sp[2])))
    # Test1- Test2+
    prob[3,p] <- (prev[p] * ((1-se[1])*(se[2]))) + ((1-prev[p]) * ((sp[1])*(1-sp[2])))
     # Test1+ Test2+
     prob[4,p] <- (prev[p] * ((se[1])*(se[2]))) + ((1-prev[p]) * ((1-sp[1])*(1-sp[2])))

    prev[p] ~ dbeta(1, 1)
  }
  se[1] ~ dbeta(se_prior[1,1], se_prior[1,2])T(1-sp[1], )
  sp[1] ~ dbeta(sp_prior[1,1], sp_prior[1,2])
  se[2] ~ dbeta(se_prior[2,1], se_prior[2,2])T(1-sp[2], )
  sp[2] ~ dbeta(sp_prior[2,1], sp_prior[2,2])

  #data# Tally, TotalTests, Populations, se_prior, sp_prior
  #monitor# prev, se, sp
  #inits# prev, se, sp
  #module# lecuyer
}

Here is the R code to run the model:

set.seed(2022-07-14)

# Set up the se_prior and sp_prior variables (optional):
se_prior <- matrix(1, nrow=2, ncol=2)
sp_prior <- matrix(1, nrow=2, ncol=2)

# Set up initial values for 5 populations:
se <- list(chain1=c(0.5,0.99), chain2=c(0.99,0.5))
sp <- list(chain1=c(0.5,0.99), chain2=c(0.99,0.5))
prev <- list(chain1=c(0.1, 0.1, 0.1, 0.9, 0.9), chain2=c(0.9, 0.9, 0.9, 0.1, 0.1))

# And run the model:
results_5p <- run.jags("multipopulation.txt", n.chains=2)

# Remember to check convergence!
# plot(results_5p)
summary(results_5p)
##            Lower95    Median   Upper95      Mean         SD Mode        MCerr
## prev[1] 0.64871206 0.7499185 0.8412375 0.7484544 0.04931864   NA 0.0008185605
## prev[2] 0.53243100 0.6354872 0.7299415 0.6342936 0.05051519   NA 0.0008191104
## prev[3] 0.22950999 0.3195574 0.4127857 0.3206858 0.04706803   NA 0.0007504695
## prev[4] 0.17078981 0.2612663 0.3600515 0.2632483 0.04879474   NA 0.0009273518
## prev[5] 0.06213104 0.1335139 0.2182322 0.1367285 0.04055939   NA 0.0007190464
## se[1]   0.76954762 0.8453448 0.9206063 0.8451176 0.03901526   NA 0.0007539896
## se[2]   0.55886277 0.6298138 0.7003766 0.6301646 0.03633644   NA 0.0006306248
## sp[1]   0.85893515 0.9125015 0.9699363 0.9130937 0.02836927   NA 0.0005538714
## sp[2]   0.87601852 0.9155925 0.9551532 0.9153571 0.02010692   NA 0.0003505415
##         MC%ofSD SSeff      AC.10     psrf
## prev[1]     1.7  3630 0.07007367 1.000097
## prev[2]     1.6  3803 0.06272667 1.000599
## prev[3]     1.6  3934 0.06963468 1.000236
## prev[4]     1.9  2769 0.10733603 1.000262
## prev[5]     1.8  3182 0.08766467 1.000951
## se[1]       1.9  2678 0.08612754 1.000194
## se[2]       1.7  3320 0.08199093 1.000313
## sp[1]       2.0  2623 0.11689246 1.000297
## sp[2]       1.7  3290 0.06616639 1.000099

To change the number of populations and range of prevalence you just need to modify the simulation code, for example 3 populations with prevalence between 0.4-0.6 can be obtained using:

# Change the number of populations here:
Populations <- 3
# Change the variation in prevalence here:
(prevalence <- runif(Populations, min=0.4, max=0.6))

data <- tibble(Population = sample(seq_len(Populations), N, replace=TRUE)) %>%
  mutate(Status = rbinom(N, 1, prevalence[Population])) %>%
  mutate(Test1 = rbinom(N, 1, sensitivity[1]*Status + (1-specificity[1])*(1-Status))) %>%
  mutate(Test2 = rbinom(N, 1, sensitivity[2]*Status + (1-specificity[2])*(1-Status)))

(twoXtwoXpop <- with(data, table(Test1, Test2, Population)))
(Tally <- matrix(twoXtwoXpop, ncol=Populations))
(TotalTests <- apply(Tally, 2, sum))
# Adjust initial values for 3 populations:
se <- list(chain1=c(0.5,0.99), chain2=c(0.99,0.5))
sp <- list(chain1=c(0.5,0.99), chain2=c(0.99,0.5))
prev <- list(chain1=c(0.1, 0.1, 0.9), chain2=c(0.9, 0.9, 0.1))
# And run the model:
results_3p <- run.jags("multipopulation.txt", n.chains=2)
# Remember to check convergence!
# plot(results_3p)
# summary(results_3p)

Note that when the effective sample size is not enough - you either need to run the model for longer in the first place, or extend it to get more samples:

# Extend the model:
results_3p <- extend.jags(results_3p, sample=50000)

# Remember to check convergence!
# plot(results_3p)
# summary(results_3p)

As a general rule, the more populations you have, and the more the prevalence varies between them, the better. However, this is conditional on having a consistent sensitivity and specificity between your populations!!!

Time for another break