cmdstanr
This script simulates binary observations of an imperfectly observed
data generating process (i.e. our measurements are made with error). It
also provides Stan code for estimating parameters of the
model in a Bayesian framework. The true infection status \(z\) is assumed to be drawn from a Bernoulli
distribution with probability \(p\):
\[z\sim \text{Bernoulli}(p)\]
A linear predictor is modeled on the \(logit\) scale to capture effects of covariates on \(p\):
\[logit(p)={\alpha} + {\beta}*\boldsymbol{X}\]
\(\alpha\) is the intercept representing average infection probability when all predictor variables are set to their means, and \(\beta\) is the set of regression coefficients that describe associations between predictor variables in the model design matrix \(\boldsymbol{X}\) and \(logit(p)\). Unfortunately, as is the case in most diagnostic test scenarios, our measurements are made with error. A classic example would be a Kato Katz faecal smear to detect eggs of soil-transmitted helminths. We know these tests are not perfect. It is easy to miss a parasite egg when the smear quality is low. Or when parasite eggs are not being shed. Or because eggs are damn small and hard to see. Assume \(\hat{SN}\) is our test’s sensitivity (the proportion of subjects with the condition that tend to have a positive test) and \(\hat{SP}\) is the test’s specificity (the proportion of subjects without the condition that tend to have a negative test). Observations \(\boldsymbol{Y}\) are assumed to be conditional on the latent, unobserved true infection status \(z\) through the following set of equations:
\[\boldsymbol{Y}\sim
\text{Bernoulli}(\hat{SN}),~if~~z~= 1\] \[\boldsymbol{Y}\sim \text{Bernoulli}(1 -
\hat{SP}),~if~~z~= 0\] Ignoring this detection error is bad. It
can bias our estimates and lead to incorrect inferences. But there are
several challenges with estimating such a model. First, we often do not
know with precision what the test’s specificity and sensitivity are. So
we need to estimate them. The second issue is that the \(z\) represent latent discrete parameters.
We do not know what the true infection status is. We could try to sample
it, but it turns out that sampling
these latent values is problematic for advanced MCMC routines such as
Stan’s Hamiltonian Monte Carlo algorithm. Other types
of samplers (i.e. Gibbs or Metropolis Hastings, like those used in
JAGS or BUGS) will allow latent discrete
parameters, but they are not efficient. Fortunately, probability theory
tells us how we can account for the possibility that the true status
\(z\) was either a 1 or a
0 using a marginal
likelihood. The unconditional probability of the observed infection
status \(\boldsymbol{Y}\) is the sum of
the probability in each unknown state (truly infected or truly
uninfected) times the chance of being in each unknown state. This
actually leads to better estimates because the expectation at
all possible values is being used, rather than being estimated by
sampling a latent discrete parameter.
First we will define the Stan model code for estimating
parameters under the model outlined above. This code uses a logical flag
to allow us to tell the sampler whether or not to account for detection
error. This will be useful for investigating the consequences of model
misspecification.
model_code <-
"data {
int<lower=0> N; // number of observations
int<lower=0> P; // number of (standardised) predictor variables
array[N] int<lower=0, upper=1> y; // binary infection observations
matrix[N, P] x; // predictor design matrix
real<lower=-1,upper=1> sensitivity_prior; // inerpretable prior on sensitivity
real<lower=-1,upper=1> specificity_prior; // inerpretable prior on specificity
int<lower=0,upper=1> estimate_accuracy; // logical (include detection error or not)
}
transformed data {
// Convert diagnostic hyperpriors into values to feed to the dirichlet
real<lower=0> sens_hyp;
real<lower=0> spec_hyp;
real<lower=0> remainder_hyp;
sens_hyp = 1 / exp(sensitivity_prior);
spec_hyp = 1 / exp(specificity_prior);
remainder_hyp = exp(sensitivity_prior) + exp(specificity_prior);
// QR decomposition to deal with (possibly) correlated predictor variables
// https://betanalpha.github.io/assets/case_studies/qr_regression.html
matrix[N, P] Q_ast;
matrix[P, P] R_ast;
matrix[P, P] R_ast_inverse;
Q_ast = qr_thin_Q(x) * sqrt(N - 1);
R_ast = qr_thin_R(x) / sqrt(N - 1);
R_ast_inverse = inverse(R_ast);
}
parameters {
vector[P] theta; // QR decomposed regression coefficients
real alpha; // intercept coefficient
vector<lower=1>[3] theta_acc; // hyperprior for test accuracy
simplex[3] accuracy; // sum-to-1 constraint for misclassification error
}
transformed parameters {
// test sensitivity and specificity
real<lower=0, upper=1> sens;
real<lower=0, upper=1> spec;
// set both to 1 if we don't want to estimate detection error
if (estimate_accuracy)
sens = 1 - accuracy[1];
else
sens = 1;
if (estimate_accuracy)
spec = 1 - accuracy[2];
else
spec = 1;
// linear predictor for the logistic regression; here we can also include
// terms for spatial, temporal, and spatiotemporal effects; some examples found here:
// https://mc-stan.org/docs/stan-users-guide/fit-gp.html
// https://betanalpha.github.io/assets/case_studies/gp_part1/part1.html#1_gaussian_processes
// https://github.com/nicholasjclark/mvgam
vector[N] logit_z = alpha + Q_ast * theta;
}
model {
// informative hyperpriors on 1 - sensitivity and 1 - specificity
theta_acc[1] ~ normal(sens_hyp, 0.25);
theta_acc[2] ~ normal(spec_hyp, 0.25);
// hyperprior for sum-to-one constraint
theta_acc[3] ~ normal(remainder_hyp, 0.25);
accuracy ~ dirichlet(theta_acc);
// priors for regression coefficients
// we assume the disease is a bit rare, so use an appropriate
// prior on the intercept alpha
alpha ~ normal(-0.75, 1);
theta ~ std_normal();
// marginal Bernoulli likelihood
// the unconditional probability of the observed infection status is the sum of the
// probability in each unknown state (truly infected or truly uninfected)
// times the chance of being in each unknown state
// for each observation, we calculate the probability that the 'true' status was a 1
// and the probability that the 'true' status was a 0. Summing these probabilities on
// the log scale is the same as multiplying them on the untransformed scale; this
// allows us to average over (i.e. marginalise out) the latent infection status
for (n in 1:N) {
target += log_sum_exp(bernoulli_lpmf(y[n] | sens) + bernoulli_logit_lpmf(1 | logit_z[n]),
bernoulli_lpmf(1 - y[n] | spec) + bernoulli_logit_lpmf(0 | logit_z[n]));
}
}
generated quantities {
vector<lower=0, upper=1>[N] prob;
array[N] int<lower=0, upper=1> ypred;
vector[P] beta;
// compute predicted infection probabilities (i.e. posterior expectations)
// for each individual in the dataset
for (n in 1:N) {
prob[n] = softmax([bernoulli_lpmf(y[n] | sens) + bernoulli_logit_lpmf(1 | logit_z[n]),
bernoulli_lpmf(1 - y[n] | spec) + bernoulli_logit_lpmf(0 | logit_z[n])]')[1];
}
// compute predicted infection statuses (i.e. posterior predictions)
// for each individual in the dataset
ypred = bernoulli_rng(prob);
// compute the regression coefficients for the original
// (back-transformed) predictor variables
beta = R_ast_inverse * theta;
}"
There is a lot going on in this model. It accepts interpretable
hyperprior information for the test diagnostics (more on that below). It
also uses a QR
decomposition to handle possible colinearity among predictors. And
of course, it marginalises over discrete latent infection status.
Details of each of these procedures are not the main aim of this
tutorial, but please see the relevant hyperlinks for more information.
Load cmdstanr and compile the model to C++
code. See information
here for troubleshooting advice when installing Stan and
Cmdstan
library(cmdstanr)
mod <- cmdstan_model(write_stan_file(model_code),
stanc_options = list('canonicalize=deprecations,braces,parentheses'))
## This is cmdstanr version 0.5.3
## - CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
## - CmdStan path: C:/Users/Nick/Documents/.cmdstan/cmdstan-2.31.0
## - CmdStan version: 2.31.0
Now that the model is defined in Stan language, we need
to simulate data to explore the model. Here we define a function to
generate simulated binary outcomes with user-specific sensitivity and
specificity. This function was modified from a very helpful Stan
discourse post
sim_det_error <- function(n = 200, # sample size
alpha = -0.5, # intercept (on logit scale)
p = 2, # number of predictors
sn = 0.85, # sensitivity of test
sp = 0.7) # specificity of test
{
# generate random predictors; note that predictors should be mean-centred
# prior to modelling for the QR decomposition to be effective
x <- rnorm(n*p, mean = 0, sd = 1)
x <- matrix(x, nrow = n, ncol = p)
colnames(x) <- paste0('predictor_', 1:p)
# generate beta coefficients
c <- rnorm(p, mean = 0, sd = 1)
# design matrix and linear predictor
x_design <- cbind(rep(1, n), x)
c_full <- c(alpha, c)
odds <- rowSums(sweep(x_design, 2, c_full, FUN = "*"))
# convert linear predictor to the probability scale (logit link)
pr <- 1 / (1 + exp(-odds))
# generate true infection data
truth <- rbinom(n = n, size = 1, prob = pr)
# generate observed data with imperfect detection
y <- integer(n)
for ( i in 1:n ) {
y[i] <- ifelse(truth[i] == 1, rbinom(n = 1, size = 1, prob = sn),
rbinom(n = 1, size = 1, prob = 1 - sp))
}
return(list(N = n,
P = p,
y = y,
x = x,
true_status = truth,
true_sens = sn,
true_spec = sp,
true_alpha = alpha,
true_betas = c))
}
Simulate some data with two predictors. Here sensitivity (\(\hat{SN}\)) = 0.75 and
specificity (\(\hat{SP}\)) =
0.85. Again, sensitivity is a measure of how well a test
can identify true positives and specificity is a measure of how well a
test can identify true negatives. These values are not unreasonable
given our knowledge of how difficult it can be to detect parasite eggs
in faecal smears
dat <- sim_det_error(n = 250, alpha = -0.85,
p = 2, sn = 0.75, sp = 0.85)
The true (latent) prevalence is
sum(dat$true_status) / length(dat$true_status)
## [1] 0.324
The observed prevalence is
sum(dat$y) / length(dat$y)
## [1] 0.316
Add interpretable ‘prior’ information about test accuracy to the
data, which will be used to structure the prior distributions for test
sensitivity and specificity. Here, we use a scale of
-1 to 1 to define any prior information we may have about
performance of a diagnostic test. A score of -1 means we
are fairly certain the test is bad (low accuracy). A score of
1 implies we are fairly certain the test is good. A score
of 0 means we are uncertain. Let’s assume the test gives
reasonable accuracy because it is a commercial diagnostic.
dat$sensitivity_prior <- 0.3
dat$specificity_prior <- 0.3
We can assess whether it is problematic to ignore detection error when fitting models. We’ll do this by starting with a basic logistic regression that ignores this error. To do this, we include a logical flag to tell the sampler not to include detection error
dat$estimate_accuracy <- 0
Now condition the model on the observed data using Stan
with the CmdStan backend
fit_noerror <- mod$sample(
data = dat,
chains = 4,
parallel_chains = 4,
refresh = 100)
Next we fit an equivalent model that does estimate detection error
(i.e. we change the logical flag for estimate_accuracy from
0 to 1)
dat$estimate_accuracy <- 1
fit_error <- mod$sample(
data = dat,
chains = 4,
parallel_chains = 4,
refresh = 100)
Inspect sampler and convergence diagnostics. Any warnings of divergences or low energy should be taken seriously and inspected further. Fortunately our models fit without issue
fit_noerror$diagnostic_summary()
## $num_divergent
## [1] 0 0 0 0
##
## $num_max_treedepth
## [1] 0 0 0 0
##
## $ebfmi
## [1] 1.0062877 0.9766290 0.9720911 0.9598392
fit_error$diagnostic_summary()
## $num_divergent
## [1] 0 0 0 0
##
## $num_max_treedepth
## [1] 0 0 0 0
##
## $ebfmi
## [1] 0.8258260 0.8637398 0.7749403 0.8422436
Compute the usual MCMC summaries, which show that the Hamiltonian Monte Carlo chains have converged nicely
fit_noerror$summary(variables = c("alpha", "beta"))
## # A tibble: 3 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 alpha -0.868 -0.866 0.149 0.150 -1.11 -0.630 1.00 4719. 3041.
## 2 beta[1] 0.345 0.343 0.158 0.161 0.0899 0.602 1.00 4760. 2940.
## 3 beta[2] 0.393 0.391 0.144 0.144 0.162 0.635 1.00 4922. 3215.
fit_error$summary(variables = c("alpha", "beta", "sens", "spec"))
## # A tibble: 5 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 alpha -1.15 -1.14 0.713 0.689 -2.33 -0.000715 1.00 2173. 2601.
## 2 beta[1] 0.808 0.736 0.480 0.413 0.189 1.73 1.00 1944. 1779.
## 3 beta[2] 0.828 0.772 0.423 0.388 0.239 1.63 1.00 1846. 1930.
## 4 sens 0.713 0.711 0.154 0.180 0.467 0.963 1.00 1434. 1441.
## 5 spec 0.849 0.846 0.0655 0.0681 0.745 0.965 1.00 1620. 1045.
Visualising the estimated coefficients for each of the models will allow us to ask whether ignoring detection error leads to biased inferences. First, define some colours for nice plots
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
Plot the estimates for the intercept parameter \(\alpha\) and overlay the true simulated value in black
alphas_noerror <- fit_noerror$draws("alpha", format = "draws_matrix")
alphas_error <- fit_error$draws("alpha", format = "draws_matrix")
layout(matrix(1:2, ncol = 1))
par(mar = c(4, 1, 1.5, 1))
hist(alphas_noerror,
col = 'grey', border = 'white', lwd = 2,
yaxt = 'n',
ylab = '', xlab = '',
xlim = range(c(alphas_noerror, alphas_error)),
breaks = seq(min(c(alphas_noerror, alphas_error)),
max(c(alphas_noerror, alphas_error)),
length.out = 50),
main = 'No detection error')
abline(v = dat$true_alpha, lwd = 3.5, col = 'white')
abline(v = dat$true_alpha, lwd = 3, col = 'black')
hist(alphas_error,
col = c_mid_highlight, border = 'white', lwd = 2,
main = 'Detection error',
xlim = range(c(alphas_noerror, alphas_error)),
breaks = seq(min(c(alphas_noerror, alphas_error)),
max(c(alphas_noerror, alphas_error)),
length.out = 50),
yaxt = 'n', ylab = '', xlab = expression(alpha))
abline(v = dat$true_alpha, lwd = 3.5, col = 'white')
abline(v = dat$true_alpha, lwd = 3, col = 'black')
The detection error model seems to struggle estimating \(\alpha\) with precision, though our estimates are centred around the true simulated value. This parameter is a bit difficult to interpret in detection error models, so we will get back to that later. But our inferences on the effects of covariates are what we are primarily interested in. Next plot the estimated \(\beta\) coefficients for the two predictors, and again overlay the true simulated values in black
betas_noerror <- fit_noerror$draws("beta", format = "draws_matrix")
betas_error <- fit_error$draws("beta", format = "draws_matrix")
layout(matrix(1:4, ncol = 2, byrow = FALSE))
hist(betas_noerror[,1],
col = 'grey', border = 'white', lwd = 2,
xlim = range(c(betas_noerror[,1], betas_error[,1])),
breaks = seq(min(c(betas_noerror[,1], betas_error[,1])),
max(c(betas_noerror[,1], betas_error[,1])),
length.out = 50),
main = 'No detection error',
yaxt = 'n', ylab = '', xlab = '')
abline(v = dat$true_betas[1], lwd = 3.5, col = 'white')
abline(v = dat$true_betas[1], lwd = 3, col = 'black')
hist(betas_error[,1],
col = c_mid_highlight, border = 'white', lwd = 2,
xlim = range(c(betas_noerror[,1], betas_error[,1])),
breaks = seq(min(c(betas_noerror[,1], betas_error[,1])),
max(c(betas_noerror[,1], betas_error[,1])),
length.out = 50),
main = 'Detection error', yaxt = 'n', ylab = '', xlab = expression(beta[1]))
abline(v = dat$true_betas[1], lwd = 3.5, col = 'white')
abline(v = dat$true_betas[1], lwd = 3, col = 'black')
hist(betas_noerror[,2],
col = 'grey', border = 'white', lwd = 2,
xlim = range(c(betas_noerror[,2], betas_error[,2])),
main = 'No detection error',
breaks = seq(min(c(betas_noerror[,2], betas_error[,2])),
max(c(betas_noerror[,2], betas_error[,2])),
length.out = 50),
yaxt = 'n', ylab = '', xlab = '')
abline(v = dat$true_betas[2], lwd = 3.5, col = 'white')
abline(v = dat$true_betas[2], lwd = 3, col = 'black')
hist(betas_error[,2],
col = c_mid_highlight, border = 'white', lwd = 2,
xlim = range(c(betas_noerror[,2], betas_error[,2])),
breaks = seq(min(c(betas_noerror[,2], betas_error[,2])),
max(c(betas_noerror[,2], betas_error[,2])),
length.out = 50),
main = 'Detection error', yaxt = 'n', ylab = '', xlab = expression(beta[2]))
abline(v = dat$true_betas[2], lwd = 3.5, col = 'white')
abline(v = dat$true_betas[2], lwd = 3, col = 'black')
Here it is immediately clear why we should account for imperfect detection when possible. The detection error model gives better estimates for both coefficients. If we had to make decisions based on these parameters, and we did not account for detection error, our decisions would be wrong. Why is there large uncertainty in these estimates? To answer this, we need to plot the estimates for the test diagnostics
layout(matrix(1:2, ncol = 1))
par(mar = c(4, 1, 1.5, 1))
posterior_sens <- fit_error$draws("sens", format = "draws_matrix")
hist(posterior_sens,
xlim = c(0, 1),
breaks = seq(0, 1, length.out = 30),
col = c_mid_highlight, border = 'white', lwd = 2,
main = '', yaxt = 'n', ylab = '', xlab = 'Sensitivity')
abline(v = dat$true_sens, lwd = 3.5, col = 'white')
abline(v = dat$true_sens, lwd = 3, col = 'black')
posterior_spec <- fit_error$draws("spec", format = "draws_matrix")
hist(posterior_spec,
xlim = c(0, 1),
breaks = seq(0, 1, length.out = 30),
col = c_mid_highlight, border = 'white', lwd = 2,
main = '', yaxt = 'n', ylab = '', xlab = 'Specificity')
abline(v = dat$true_spec, lwd = 3.5, col = 'white')
abline(v = dat$true_spec, lwd = 3, col = 'black')
Further investigation helps to diagnose why \(\alpha\) is difficult to identify for the detection error model.
layout(matrix(1:2, ncol = 1))
par(mar = c(4, 4, 1.5, 1))
posterior_sens <- fit_error$draws("sens", format = "draws_matrix")
plot(alphas_error, posterior_sens, xlab = '',
ylab = 'Sensitivity', bty = 'l', pch = 16, col = c_mid_highlight)
box(bty = 'l', lwd = 2)
abline(h = dat$true_sens, lwd = 3.5, col = 'white')
abline(h = dat$true_sens, lwd = 3, col = 'black')
abline(v = dat$true_alpha, lwd = 3.5, col = 'white')
abline(v = dat$true_alpha, lwd = 3, col = 'black')
posterior_spec <- fit_error$draws("spec", format = "draws_matrix")
plot(alphas_error, posterior_spec, xlab = expression(alpha),
ylab = 'Specificity', bty = 'l', pch = 16, col = c_mid_highlight)
box(bty = 'l', lwd = 2)
abline(h = dat$true_spec, lwd = 3.5, col = 'white')
abline(h = dat$true_spec, lwd = 3, col = 'black')
abline(v = dat$true_alpha, lwd = 3.5, col = 'white')
abline(v = dat$true_alpha, lwd = 3, col = 'black')
There are fundamental posterior correlations induced between \(\alpha\) and sensitivity and between \(\alpha\) and specificity. This model cannot estimate sensitivity or specificity with much precision, resulting in wide estimates of \(\alpha\). There is nothing implicitly wrong here, these uncertainties simply reflect reality: We do not know the true infection status, and we do not know the true test performance. Without knowing these, how can we precisely estimate the average infection probability \(\alpha\)? Stronger priors on diagnostic accuracies could help, but these uncertainties will not entirely go away without sampling many more individuals.
A very useful question raised by A/Prof Joerg
Henning relates to translating what these results might mean for a
practicing clinician. Often we want to know how likely it is that a
person is truly infected if they’ve returned a positive diagnostic test.
A famous (and very topical) problem in probability theory is the base rate
fallacy. This occurs when we ignore the background rate (or
prevalence of disease, in our case) when interpreting test results. No
test is perfect, so they will sometimes return false positives. But if
there are very few true positives in the population (because the disease
is rare), this can mean that a high proportion of people testing
positive are actually uninfected. To quote from Wikipedia:
“It is especially counter-intuitive when interpreting a positive result
in a test on a low-prevalence population after having dealt with
positive results drawn from a high-prevalence population. If the false
positive rate of the test is higher than the proportion of the new
population with the condition, then a test administrator whose
experience has been drawn from testing in a high-prevalence population
may conclude from experience that a positive test result usually
indicates a positive subject, when in fact a false positive is far more
likely to have occurred.”
We can use sampling from our model’s joint posterior distribution to calculate the probability that someone returning a positive test is actually infected, which requires the following equation: \[Pr(truly~inifected|positive~test) = Pr(positive~test|truly~infected) * Pr(infected) / (Pr(positive~test|truly~infected) * Pr(infected) + Pr(positive~test|truly~uninfected) * Pr(uninfected))\]
For details of this derivation, please read here and here).
Because I’m no wizard in probability theory, I’ll use brute force
sampling to calculate this quantity. First, create a sample sequence to
generate many posterior simulations. We do this because we want many
thousands of samples, but we only have 4000 in our
posterior.
N_samples <- 25000
sample_seq <- sample(1:NROW(posterior_sens), N_samples, replace = TRUE)
Next, sample latent (true) infection status for 25000
individuals using the logistic regression equation estimated by our
model
true_inf <- matrix(NA, nrow = N_samples, ncol = NROW(dat$x))
for(i in 1:N_samples){
true_inf[i,] <- rbinom(NROW(dat$x),
size = 1,
prob = boot::inv.logit(alphas_error[sample_seq[i]] +
dat$x %*% as.vector(betas_noerror[sample_seq[i],])))
}
Sample observed test diagnostics given our model’s estimates for sensitivity and specificity
test_results <- matrix(NA, nrow = N_samples, ncol = NROW(dat$x))
for(i in 1:N_samples){
for(j in 1:NCOL(test_results)){
test_results[i,j] <- ifelse(true_inf[i,j] == 1,
rbinom(n = 1,
size = 1,
prob = posterior_sens[sample_seq[i]]),
rbinom(n = 1,
size = 1,
prob = 1 - posterior_spec[sample_seq[i]]))
}
}
Now we have all the samples we need to calculate relevant probabilistic quantities. Calculate the probability that an individual returns a positive test if they are actually infected. We would hope this value would be high, but that is not always the case
`Pr(positive test|truly infected)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(positive test|truly infected)`[i] = sum(test_results[i, which(true_inf[i,] == 1)]) /
length(which(true_inf[i,] == 1))
}
quantile(`Pr(positive test|truly infected)`, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.4921875 0.5867769 0.7115385 0.8444444 0.9411765
Next calculate the probability that any individual in the population is infected (prevalence)
`Pr(infected)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(infected)`[i] = sum(true_inf[i,]) / length(true_inf[i,])
}
quantile(`Pr(infected)`, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.124 0.184 0.260 0.352 0.456
Now we need the probability of returning a positive test if someone is truly uninfected. We would hope this value would be quite low
`Pr(positive test|truly uninfected)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(positive test|truly uninfected)`[i] = sum(test_results[i, which(true_inf[i,] == 0)]) /
length(which(true_inf[i,] == 0))
}
quantile(`Pr(positive test|truly uninfected)`, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.05487805 0.10169492 0.15294118 0.20095694 0.24242424
Now the probability that any individual is uninfected, which is
1 - Pr(infected)
`Pr(uninfected)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(uninfected)`[i] = 1 - `Pr(infected)`[i]
}
quantile(`Pr(uninfected)`, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
## 10% 25% 50% 75% 90%
## 0.544 0.648 0.740 0.816 0.876
Finally, we can calculate our quantity of interest: the probability that an individual is actually infected if they return a positive test
`Pr(truly infected|positive test)` <- vector(length = N_samples)
for(i in 1:N_samples){
`Pr(truly infected|positive test)`[i] = `Pr(positive test|truly infected)`[i] * `Pr(infected)`[i] /
(`Pr(positive test|truly infected)`[i] * `Pr(infected)`[i] +
`Pr(positive test|truly uninfected)`[i] * `Pr(uninfected)`[i])
}
hist(`Pr(truly infected|positive test)`,
xlim = c(0, 1),
breaks = seq(0, 1, length.out = 30),
col = c_mid_highlight, border = 'white', lwd = 2,
main = '', yaxt = 'n', ylab = '',
xlab = 'Pr(truly infected|positive test)')
With imperfect tests and somewhat rare infections (~30%
in this example), we can’t be too sure that a positive test means a true
infection. This all changes of course if we have other reasons to
believe a person is infected. For example, if we only test people that
show some symptoms, we are sampling a very different population compared
to the full population of healthy and unhealthy individuals. Hopefully
this example illustrates why simple logistic regressions are often not
enough to tackle the questions we are interested in.