Disease X represents a hypothetical, yet plausible, future infectious disease outbreak. Coined by the World Health Organization (WHO), it serves as a catch-all term for an unknown pathogen that could cause a serious international epidemic. This concept emphasizes the unpredictable nature of infectious disease emergence and underscores the necessity for global preparedness and rapid response mechanisms. Disease X symbolizes the potential for an unexpected, rapidly spreading disease, and highlights the need for flexible, adaptive health systems and research capabilities to identify, understand, and combat novel pathogens.
In this practical, we are going to learn how to estimate epidemiological delays, the time between two epidemiological events, using a simulated dataset of Disease X .
Disease X is caused by an unknown pathogen, and it is directly transmitted from person-to-person. Specifically, we will focus on estimating the incubation period and serial interval.
Comprehend the key concepts of epidemiological delay distributions for Disease X.
Comprehend the data structures and tools for analysis of contact tracing data.
Learn how to adjust estimates of the serial interval and the incubation period of Disease X for interval censoring using a Bayesian framework.
Learn how to use these quantities to inform control strategies for an outbreak of an unknown pathogen.
In epidemiology, delay distributions refer to the time lags inherent in from two key dates during an outbreak. For example: the time between the symptoms onset and the diagnostics, the time between symptoms and death, among many others.
Today we will be focusing on two key delays known as the incubation period and serial interval. Both are crucial for informing public health response.
The incubation period is the time between infection and symptom onset.
The serial interval is the time between symptom onset in a primary and secondary case pair.
The relationship between these quantities have an impact on whether the disease transmits before (Pre-symptomatic transmission) or after symptoms (Symptomatic transmission) have developed in the primary case (Figure 1).
Figure 1. Relationship between the incubation period and serial interval on the timing of transmission (Adapted from Nishiura et al. 2020)
When estimating epidemiological delays, it is important to consider potential biases:
Censoring means that we know an event happened, but we do not know exactly when it happened. Most epidemiological data are “doubly censored” because there is uncertainty surrounding both primary and secondary event times. Not accounting for censoring can lead to biased estimates of the delay’s standard deviation (Park et al. in progress).
Right truncation is a type of sampling bias related to the data collection process. It arises because only cases that have been reported can be observed. Not accounting for right truncation during the growth phase of an epidemic can lead to underestimation of the mean delay (Park et al. in progress).
Dynamical (or epidemic phase) bias is another type of sampling bias. It affects backward-looking data and is related to the phase of the epidemic: during the exponential growth phase, cases that developed symptoms recently are over-represented in the observed data, while during the declining phase, these cases are underrepresented, leading to the estimation of shorter and longer delay intervals, respectively (Park et al. in progress).
Three common probability distributions used to characterize delays in infectious disease epidemiology include the following (Table 1).
Distribution | Parameters |
---|---|
weibull | shape and scale |
gamma | shape and scale |
log normal | log mean and
log standard deviation |
We will use the following R
packages in this
practical:
dplyr
for data handling
epicontacts
to visualize contact tracing
data
ggplot2
and patchwork
for
plotting
incidence
to visualize epidemic curves
rstan
to estimate the incubation period
coarseDataTools
via EpiEstim
to
estimate the serial interval
Installation instructions for packages:
To load the packages, type:
library(dplyr)
library(epicontacts)
library(incidence)
library(coarseDataTools)
library(EpiEstim)
library(ggplot2)
library(loo)
library(patchwork)
library(rstan)
We are going to split into two groups for this practical to tackle two unknown diseases with different modes of transmission.
Let’s load the simulated data, which is saved as an .RDS
file.
There are two items of interest:
linelist
, a file that contains one case of Disease X
per row.
contacts
, a file with contact tracing data which
contains information about primary and secondary case pairs.
# Group 1
dat <- readRDS("Data/practical_data_group1.RDS")
linelist <- dat$linelist
contacts <- dat$contacts
# Group 2
#dat <- readRDS("Data/practical_data_group2.RDS")
#linelist <- dat$linelist
#contacts <- dat$contacts
linelist
dataLet’s start with the linelist
. These data were collected
as part of routine epidemiological surveillance. Each row represents one
case of Disease X, and there are 7 variables:
id
: unique case id number
date_onset
: the patient’s symptom onset
date
sex
: M = male; F = female
age
: the patient’s age in years
exposure
: information about how the patient might
have been exposed
exposure_start
: earliest date that the patient was
exposed
exposure_end
: latest date that the patient was
exposed
💡 Questions (1)
|
# Inspect data
head(linelist)
## id date_onset sex age exposure exposure_start exposure_end
## 1 1 2023-10-01 M 34 Close, skin-to-skin contact <NA> <NA>
## 2 2 2023-10-03 F 38 Close, skin-to-skin contact 2023-09-29 2023-09-29
## 3 3 2023-10-06 F 38 Close, skin-to-skin contact 2023-09-28 2023-09-28
## 4 4 2023-10-10 F 37 <NA> 2023-09-25 2023-09-27
## 5 5 2023-10-11 F 33 <NA> 2023-10-05 2023-10-05
## 6 6 2023-10-12 F 34 Close, skin-to-skin contact 2023-10-10 2023-10-10
# Q1
nrow(linelist)
## [1] 166
# Q2
table(linelist$sex)[2]/nrow(linelist)
## M
## 0.6144578
#Q3
summary(linelist$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 33.00 36.00 35.51 38.00 47.00
#Q4
table(linelist$exposure, exclude = F)[1]/nrow(linelist)
## Close, skin-to-skin contact
## 0.6626506
💡 Discuss Why do you think the exposure information is missing for some cases? Discuss. |
Now let’s plot the epidemic curve. Where in the outbreak do you think we are (beginning, middle, end)?
i <- incidence(linelist$date_onset)
plot(i) +
theme_classic() +
scale_fill_manual(values = "purple") +
theme(legend.position = "none")
It looks like the epidemic might still be growing.
contact tracing
dataNow let’s look at the contact tracing data, which were obtained through patient interviews. Patients were asked about recent activities and interactions to identify possible sources of infection. Primary and secondary case pairs were matched if the secondary case named the primary case as a contact. We only have information on a subset of the cases because not all patients could be contacted for an interview.
Note that for this exercise, we will assume that the secondary cases only had one possible infector. In reality, the potential for a case to have multiple possible infectors needs to be assessed.
The contact tracing data has 4 variables:
primary_case_id
: unique id number for the primary
case (infector)
secondary_case_id
: unique id number for the
secondary case (infectee)
primary_onset_date
: symptom onset date of the
primary case
secondary_onset_date
: symptom onset date of the
secondary case
x <- make_epicontacts(linelist = linelist,
contacts = contacts,
from = "primary_case_id",
to = "secondary_case_id",
directed = TRUE) # Don´t remember what directed means
plot(x)
💡 Questions (2)
|
_______Wrap-up 1 _________
Now let’s focus on the incubation period. We will use the
linelist
for this part. We need both time of symptom onset
and time of possible exposure. Notice that there are two dates for
exposure, a start and end, in the data. Sometimes the exact date of
exposure is unknown and exposure window is provided during case
interviews instead.
💡 Questions (3)
For how many cases do we have data on both dates of symptom onset and exposure?
Calculate the exposure windows. How many cases have a single date of exposure?
ip <- filter(linelist, !is.na(exposure_start) &
!is.na(exposure_end))
nrow(ip)
## [1] 50
ip$exposure_window <- as.numeric(ip$exposure_end - ip$exposure_start)
table(ip$exposure_window)
##
## 0 1 2
## 34 11 5
Let’s start by calculating a naive estimate of the incubation period.
# Max incubation period
ip$max_ip <- ip$date_onset - ip$exposure_start
summary(as.numeric(ip$max_ip))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.50 6.38 7.75 20.00
# Minimum incubation period
ip$min_ip <- ip$date_onset - ip$exposure_end
summary(as.numeric(ip$min_ip))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 4.00 5.96 7.75 19.00
Now let’s fit three probability distributions to the incubation
period data accounting for double censoring. We will adapt some
stan
code that was published by Miura et al. during the
2022 mpox outbreak. This method does not account for right truncation or
dynamical bias.
Remember we are mainly interested in considering three probability distributions: weibull, gamma, lognormal (See Table 1).
Stan
is a software program that implements the
Hamiltonian Monte Carlo (HMC) algorithm. HMC is a Markov chain Monte
Carlo (MCMC) method for fitting complex models to data using Bayesian
statistics.
We will fit all three distributions in this code chunk.
# Prepare data
earliest_exposure <- as.Date(min(ip$exposure_start))
ip <- ip |>
mutate(date_onset = as.numeric(date_onset - earliest_exposure),
exposure_start = as.numeric(exposure_start - earliest_exposure),
exposure_end = as.numeric(exposure_end - earliest_exposure)) |>
select(id, date_onset, exposure_start, exposure_end)
# Setting some options to run the MCMC chains in parallel
# Running the MCMC chains in parallel means that we will run multiple MCMC chains at the same time by using multiple cores on your computer
options(mc.cores=parallel::detectCores())
input_data <- list(N = length(ip$exposure_start), # Number of observations
tStartExposure = ip$exposure_start,
tEndExposure = ip$exposure_end,
tSymptomOnset = ip$date_onset)
# We are going to fit three probbility distributions
distributions <- c("weibull", "gamma", "lognormal") # three distributions to fit
# Stan code
code <- sprintf("
data{
int<lower=1> N;
vector[N] tStartExposure;
vector[N] tEndExposure;
vector[N] tSymptomOnset;
}
parameters{
real<lower=0> par[2];
vector<lower=0, upper=1>[N] uE; // Uniform value for sampling between start and end exposure
}
transformed parameters{
vector[N] tE; // infection moment
tE = tStartExposure + uE .* (tEndExposure - tStartExposure);
}
model{
// Contribution to likelihood of incubation period
target += %s_lpdf(tSymptomOnset - tE | par[1], par[2]);
}
generated quantities {
// likelihood for calculation of looIC
vector[N] log_lik;
for (i in 1:N) {
log_lik[i] = %s_lpdf(tSymptomOnset[i] - tE[i] | par[1], par[2]);
}
}
", distributions, distributions)
names(code) <- distributions
# Next line takes ~1.5 min
models <- mapply(stan_model, model_code = code)
# Takes ~40 sec.
fit <- mapply(sampling, models, list(input_data),
iter=3000, # Number of iterations (length of MCMC chains)
warmup=1000, # Number of samples to discard at the beginning of MCMC
chain=4) # Number of chains to run MCMC
pos <- mapply(function(z) rstan::extract(z)$par, fit, SIMPLIFY=FALSE) # Posterior samples
Now we will check for model convergence. We will look at
r-hat
values, effective sample sizes, and the MCMC traces.
R-hat compares the between- and within-chain estimates for model
parameters; values near 1 tell us that chains have mixed well (Vehtari
et al. 2021). The effective sample size estimates the number of
independent samples after accounting for dependence in the MCMC chains
(Lambert 2018). For a model with 4 MCMC chains, a total effective sample
size of at least 400 is recommended (Vehtari et al. 2021).
For each model of different fitted distributions:
💡 Questions (4)
Are the R-hat values close < 1.1.?
Do the 4 MCMC chains generally lay on top of each other and stay around the same values (fuzzy caterpillars)?
print(fit$gamma, digits = 3, pars = c("par[1]","par[2]"))
## Inference for Stan model: 152fdcd1bc10f11f54191ff282b4b6de.
## 4 chains, each with iter=3000; warmup=1000; thin=1;
## post-warmup draws per chain=2000, total post-warmup draws=8000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## par[1] 1.974 0.003 0.360 1.340 1.721 1.950 2.207 2.735 10871 1
## par[2] 0.323 0.001 0.067 0.207 0.277 0.318 0.365 0.468 11684 1
##
## Samples were drawn using NUTS(diag_e) at Thu Nov 23 21:58:20 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
rstan::traceplot(fit$gamma, pars = c("par[1]","par[2]"))
print(fit$lognormal, digits = 3, pars = c("par[1]","par[2]"))
## Inference for Stan model: c6aece6be405f0991ed71af1901b826b.
## 4 chains, each with iter=3000; warmup=1000; thin=1;
## post-warmup draws per chain=2000, total post-warmup draws=8000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## par[1] 1.528 0.001 0.116 1.299 1.452 1.526 1.604 1.758 9134 1
## par[2] 0.801 0.001 0.085 0.652 0.743 0.794 0.851 0.991 9119 1
##
## Samples were drawn using NUTS(diag_e) at Thu Nov 23 21:58:24 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
rstan::traceplot(fit$lognormal, pars = c("par[1]","par[2]"))
print(fit$weibull, digits = 3, pars = c("par[1]","par[2]"))
## Inference for Stan model: d35ce4e4b78b193107f404dc40cbc8ef.
## 4 chains, each with iter=3000; warmup=1000; thin=1;
## post-warmup draws per chain=2000, total post-warmup draws=8000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## par[1] 1.373 0.002 0.145 1.100 1.273 1.370 1.468 1.661 8192 1
## par[2] 6.947 0.008 0.774 5.535 6.421 6.914 7.440 8.566 8971 1
##
## Samples were drawn using NUTS(diag_e) at Thu Nov 23 21:58:14 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
rstan::traceplot(fit$weibull, pars = c("par[1]","par[2]"))
We can compare the models using comparison metrics, such as:
The widely applicable information criterion (WAIC), and
Leave-one-out-information criterion (LOOIC)
The best fitting model is the model with the lowest WAIC or LOOIC. We will also summarize the distributions and make some plots.
💡 Questions (5)
# Compute WAIC for all three models
waic <- mapply(function(z) waic(extract_log_lik(z))$estimates[3,], fit)
waic
## weibull gamma lognormal
## Estimate 278.05437 276.07397 272.91143
## SE 11.96441 13.35754 13.90588
# For looic, we need to provide the relative effective sample sizes
# when calling loo. This step leads to better estimates of the PSIS effective
# sample sizes and Monte Carlo error
# Extract pointwise log-likelihood for the Weibull distribution
loglik <- extract_log_lik(fit$weibull, merge_chains = FALSE)
# Get the relative effective sample sizes
r_eff <- relative_eff(exp(loglik), cores = 2)
# Compute LOOIC
loo_w <- loo(loglik, r_eff = r_eff, cores = 2)$estimates[3,]
# Print the results
loo_w[1]
## Estimate
## 278.0724
# Extract pointwise log-likelihood for the gamma distribution
loglik <- extract_log_lik(fit$gamma, merge_chains = FALSE)
r_eff <- relative_eff(exp(loglik), cores = 2)
loo_g <- loo(loglik, r_eff = r_eff, cores = 2)$estimates[3,]
loo_g[1]
## Estimate
## 276.0931
# Extract pointwise log-likelihood for the log normal distribution
loglik <- extract_log_lik(fit$lognormal, merge_chains = FALSE)
r_eff <- relative_eff(exp(loglik), cores = 2)
loo_l <- loo(loglik, r_eff = r_eff, cores = 2)$estimates[3,]
loo_l[1]
## Estimate
## 272.9262
The right tail of the incubation period distribution is important for designing control strategies (e.g. quarantine), the 25th-75th percentiles tells us about the most likely time symptom onset could occur, and the full distribution can be used as an input in mathematical or statistical models, like for forecasting (Lessler et al. 2009).
Let’s get the summary statistics
# We need to convert the parameters of the distributions to the mean and standard deviation delay
# In Stan, the parameters of the distributions are:
# Weibull: shape and scale
# Gamma: shape and inverse scale (aka rate)
# Log normal: mu and sigma
# Reference: https://mc-stan.org/docs/2_21/functions-reference/positive-continuous-distributions.html
means <- cbind(pos$weibull[,2]*gamma(1+1/pos$weibull[,1]), # mean of Weibull
pos$gamma[,1] / pos$gamma[,2], # mean of gamma
exp(pos$lognormal[,1]+pos$lognormal[,2]^2/2)) # mean of log normal
standard_deviations <- cbind(sqrt(pos$weibull[,2]^2*(gamma(1+2/pos$weibull[,1])-(gamma(1+1/pos$weibull[,1]))^2)),
sqrt(pos$gamma[,1]/(pos$gamma[,2]^2)),
sqrt((exp(pos$lognormal[,2]^2)-1) * (exp(2*pos$lognormal[,1] + pos$lognormal[,2]^2))))
# Print the mean delays and 95% credible intervals
probs <- c(0.025, 0.5, 0.975)
res_means <- apply(means, 2, quantile, probs)
colnames(res_means) <- colnames(waic)
res_means
## weibull gamma lognormal
## 2.5% 5.138844 5.064270 4.984058
## 50% 6.337632 6.139863 6.314455
## 97.5% 7.853057 7.528840 8.471714
res_sds <- apply(standard_deviations, 2, quantile, probs)
colnames(res_sds) <- colnames(waic)
res_sds
## weibull gamma lognormal
## 2.5% 3.705945 3.418342 3.929640
## 50% 4.652873 4.379483 5.909772
## 97.5% 6.355591 5.859244 10.257078
# Report the median and 95% credible intervals for the quantiles of each distribution
quantiles_to_report <- c(0.025, 0.05, 0.5, 0.95, 0.975, 0.99)
# Weibull
cens_w_percentiles <- sapply(quantiles_to_report, function(p) quantile(qweibull(p = p, shape = pos$weibull[,1], scale = pos$weibull[,2]), probs = probs))
colnames(cens_w_percentiles) <- quantiles_to_report
print(cens_w_percentiles)
## 0.025 0.05 0.5 0.95 0.975 0.99
## 2.5% 0.2241456 0.4231455 4.105487 12.56941 14.51359 16.78929
## 50% 0.4710275 0.7888226 5.293635 15.37022 17.88626 21.03089
## 97.5% 0.8414230 1.2892020 6.616211 20.05848 23.88578 28.85104
# Gamma
cens_g_percentiles <- sapply(quantiles_to_report, function(p) quantile(qgamma(p = p, shape = pos$gamma[,1], rate = pos$gamma[,2]), probs = probs))
colnames(cens_g_percentiles) <- quantiles_to_report
print(cens_g_percentiles)
## 0.025 0.05 0.5 0.95 0.975 0.99
## 2.5% 0.3384874 0.577708 4.119598 11.84247 13.77393 16.25629
## 50% 0.7147845 1.055581 5.114399 14.64894 17.22380 20.54755
## 97.5% 1.1838746 1.609993 6.294281 18.80301 22.42376 27.18235
# Log normal
cens_ln_percentiles <- sapply(quantiles_to_report, function(p) quantile(qlnorm(p = p, meanlog = pos$lognormal[,1], sdlog= pos$lognormal[,2]), probs = probs))
colnames(cens_ln_percentiles) <- quantiles_to_report
print(cens_ln_percentiles)
## 0.025 0.05 0.5 0.95 0.975 0.99
## 2.5% 0.6089193 0.8271572 3.667261 12.47162 15.46233 19.81833
## 50% 0.9727520 1.2503304 4.598263 16.98066 21.80485 29.18082
## 97.5% 1.3731980 1.7047432 5.802374 25.38650 34.36038 48.98636
For each model, find these items for the estimated incubation period in the output above and write them below.
Mean and 95% credible interval
Standard deviation and 95% credible interval
Percentiles (e.g., 2.5, 5, 25, 50, 75, 95, 97.5, 99)
The parameters of the fitted distributions (e.g., shape and scale for gamma distribution)
# Prepare results for plotting
df <- data.frame(
#Take mean values to draw empirical cumulative density function
inc_day = ((input_data$tSymptomOnset-input_data$tEndExposure)+(input_data$tSymptomOnset-input_data$tStartExposure))/2
)
x_plot <- seq(0, 30, by=0.1) # This sets the range of the x axis (number of days)
Gam_plot <- as.data.frame(list(dose= x_plot,
pred= sapply(x_plot, function(q) quantile(pgamma(q = q, shape = pos$gamma[,1], rate = pos$gamma[,2]), probs = c(0.5))),
low = sapply(x_plot, function(q) quantile(pgamma(q = q, shape = pos$gamma[,1], rate = pos$gamma[,2]), probs = c(0.025))),
upp = sapply(x_plot, function(q) quantile(pgamma(q = q, shape = pos$gamma[,1], rate = pos$gamma[,2]), probs = c(0.975)))
))
Wei_plot <- as.data.frame(list(dose= x_plot,
pred= sapply(x_plot, function(q) quantile(pweibull(q = q, shape = pos$weibull[,1], scale = pos$weibull[,2]), probs = c(0.5))),
low = sapply(x_plot, function(q) quantile(pweibull(q = q, shape = pos$weibull[,1], scale = pos$weibull[,2]), probs = c(0.025))),
upp = sapply(x_plot, function(q) quantile(pweibull(q = q, shape = pos$weibull[,1], scale = pos$weibull[,2]), probs = c(0.975)))
))
ln_plot <- as.data.frame(list(dose= x_plot,
pred= sapply(x_plot, function(q) quantile(plnorm(q = q, meanlog = pos$lognormal[,1], sdlog= pos$lognormal[,2]), probs = c(0.5))),
low = sapply(x_plot, function(q) quantile(plnorm(q = q, meanlog = pos$lognormal[,1], sdlog= pos$lognormal[,2]), probs = c(0.025))),
upp = sapply(x_plot, function(q) quantile(plnorm(q = q, meanlog = pos$lognormal[,1], sdlog= pos$lognormal[,2]), probs = c(0.975)))
))
# Plot cumulative distribution curves
gamma_ggplot <- ggplot(df, aes(x=inc_day)) +
stat_ecdf(geom = "step")+
xlim(c(0, 30))+
geom_line(data=Gam_plot, aes(x=x_plot, y=pred), color=RColorBrewer::brewer.pal(11, "RdBu")[11], linewidth=1) +
geom_ribbon(data=Gam_plot, aes(x=x_plot,ymin=low,ymax=upp), fill = RColorBrewer::brewer.pal(11, "RdBu")[11], alpha=0.1) +
theme_bw(base_size = 11)+
labs(x="Incubation period (days)", y = "Proportion")+
ggtitle("Gamma")
weibul_ggplot <- ggplot(df, aes(x=inc_day)) +
stat_ecdf(geom = "step")+
xlim(c(0, 30))+
geom_line(data=Wei_plot, aes(x=x_plot, y=pred), color=RColorBrewer::brewer.pal(11, "RdBu")[11], linewidth=1) +
geom_ribbon(data=Wei_plot, aes(x=x_plot,ymin=low,ymax=upp), fill = RColorBrewer::brewer.pal(11, "RdBu")[11], alpha=0.1) +
theme_bw(base_size = 11)+
labs(x="Incubation period (days)", y = "Proportion")+
ggtitle("Weibull")
lognorm_ggplot <- ggplot(df, aes(x=inc_day)) +
stat_ecdf(geom = "step")+
xlim(c(0, 30))+
geom_line(data=ln_plot, aes(x=x_plot, y=pred), color=RColorBrewer::brewer.pal(11, "RdBu")[11], linewidth=1) +
geom_ribbon(data=ln_plot, aes(x=x_plot,ymin=low,ymax=upp), fill = RColorBrewer::brewer.pal(11, "RdBu")[11], alpha=0.1) +
theme_bw(base_size = 11)+
labs(x="Incubation period (days)", y = "Proportion")+
ggtitle("Log normal")
(lognorm_ggplot|gamma_ggplot|weibul_ggplot) + plot_annotation(tag_levels = 'A')
In the plots above, the black line is the empirical cumulative distribution (the data), while the blue curve is the fitted probability distribution with the 95% credible intervals. Make sure that the blue curve lays on top of the black line.
💡 Questions (6)
_______Wrap-up 2 _________
Now we will estimate the serial interval. Again, we’ll do a naive estimate first by calculating the difference between the onset times of primary and secondary case pairs.
Are there any instances of negative serial intervals in the data (e.g., symptom onset in the secondary case came before symptom onset in primary case)?
Report the median serial interval as well as the minimum and maximum.
Plot the distribution of the serial interval.
contacts$diff <- as.numeric(contacts$secondary_onset_date - contacts$primary_onset_date)
summary(contacts$diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 7.625 10.000 23.000
hist(contacts$diff, xlab = "Serial interval (days)", breaks = 25, main = "", col = "pink")
Now we will estimate the serial interval using an implementation of
the courseDataTools
package within the
EpiEstim
R package. This method accounts for double
censoring and allows different probability distributions to be compared,
but it does not adjust for right truncation or dynamical bias.
We will consider three probability distributions and select the one that best fits the data using WAIC or LOOIC. Recall that the best-fitting distribution will have the lowest WAIC or LOOIC.
Note that in coarseDataTools
, the parameters for the
distributions are slightly different than in rstan. Here, the parameters
for the gamma distribution are shape and scale (https://cran.r-project.org/web/packages/coarseDataTools/coarseDataTools.pdf).
We will only run one MCMC chain for each distribution in the interest of time, but you should run more than one chain in practice to make sure the MCMC converges on the target distribution. We will use the default prior distributions, which can be found in the package documentation (see ‘details’ for the dic.fit.mcmc function here: https://cran.r-project.org/web/packages/coarseDataTools/coarseDataTools.pdf).
# Format interval-censored serial interval data
# Each line represents a transmission event
# EL/ER show the lower/upper bound of the symptoms onset date in the primary case (infector)
# SL/SR show the same for the secondary case (infectee)
# type has entries 0 corresponding to doubly interval-censored data
# (see Reich et al. Statist. Med. 2009)
si_data <- contacts |>
select(-primary_case_id, -secondary_case_id, -primary_onset_date, -secondary_onset_date,) |>
rename(SL = diff) |>
mutate(type = 0, EL = 0, ER = 1, SR = SL + 1) |>
select(EL, ER, SL, SR, type)
Let’s fit the gamma distribution to the serial interval data first.
overall_seed <- 3 # seed for the random number generator for MCMC
MCMC_seed <- 007
# We will run the model for 4000 iterations with the first 1000 samples discarded as burnin
n_mcmc_samples <- 3000 # number of samples to draw from the posterior (after the burnin)
params = list(
dist = "G", # Fitting a Gamma distribution for the SI
method = "MCMC", # MCMC using coarsedatatools
burnin = 1000, # number of burnin samples (samples discarded at the beginning of MCMC)
n1 = 50, # n1 is the number of pairs of mean and sd of the SI that are drawn
n2 = 50) # n2 is the size of the posterior sample drawn for each pair of mean, sd of SI
mcmc_control <- make_mcmc_control(
seed = MCMC_seed,
burnin = params$burnin)
dist <- params$dist
config <- make_config(
list(
si_parametric_distr = dist,
mcmc_control = mcmc_control,
seed = overall_seed,
n1 = params$n1,
n2 = params$n2))
# Fitting the SI
si_fit <- coarseDataTools::dic.fit.mcmc(
dat = si_data,
dist = dist,
init.pars = init_mcmc_params(si_data, dist),
burnin = mcmc_control$burnin,
n.samples = n_mcmc_samples,
seed = mcmc_control$seed)
## Running 4000 MCMC iterations
## MCMCmetrop1R iteration 1 of 4000
## function value = -201.95281
## theta =
## 1.04012
## 0.99131
## Metropolis acceptance rate = 0.00000
##
## MCMCmetrop1R iteration 1001 of 4000
## function value = -202.42902
## theta =
## 0.95294
## 1.05017
## Metropolis acceptance rate = 0.55445
##
## MCMCmetrop1R iteration 2001 of 4000
## function value = -201.88547
## theta =
## 1.18355
## 0.82124
## Metropolis acceptance rate = 0.56522
##
## MCMCmetrop1R iteration 3001 of 4000
## function value = -202.11109
## theta =
## 1.26323
## 0.77290
## Metropolis acceptance rate = 0.55148
##
##
##
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
## The Metropolis acceptance rate was 0.55500
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Let’s look at the results.
# Check convergence of MCMC chains
converg_diag <- check_cdt_samples_convergence(si_fit@samples)
converg_diag
# Estimated serial interval results
si_fit
# Save these for later
si_fit_gamma <- si_fit
# Save MCMC samples in dataframe
si_samples <- data.frame(
type = 'Symptom onset',
shape = si_fit@samples$var1,
scale = si_fit@samples$var2,
p50 = qgamma(
p = 0.5,
shape = si_fit@samples$var1,
scale = si_fit@samples$var2)) |>
mutate( # Equation for conversion is here https://en.wikipedia.org/wiki/Gamma_distribution
mean = shape*scale,
sd = sqrt(shape*scale^2)
)
# Get the mean, SD, and 95% CrI
si_samples |>
summarise(
mean_mean = mean(mean),
mean_l_ci = quantile(mean,probs=.025),
mean_u_ci = quantile(mean,probs=.975),
sd_mean = mean(sd),
sd_l_ci = quantile(sd,probs=.025),
sd_u_ci = quantile(sd,probs=.975)
)
# Get the same summary statistics for the parameters of the distribution
si_samples |>
summarise(
shape_mean = mean(shape),
shape_l_ci = quantile(shape, probs=.025),
shape_u_ci = quantile(shape, probs=.975),
scale_mean = mean(scale),
scale_l_ci = quantile(scale, probs=.025),
scale_u_ci = quantile(scale, probs=.975)
)
# We will need these for plotting later
gamma_shape <- si_fit@ests['shape',][1]
gamma_rate <- 1 / si_fit@ests['scale',][1]
Now let’s fit a log normal distribution to the serial interval data.
# We will run the model for 4000 iterations with the first 1000 samples discarded as burnin
n_mcmc_samples <- 3000 # number of samples to draw from the posterior (after the burnin)
params = list(
dist = "L", # Fitting a log normal distribution for the SI
method = "MCMC", # MCMC using coarsedatatools
burnin = 1000, # number of burnin samples (samples discarded at the beginning of MCMC)
n1 = 50, # n1 is the number of pairs of mean and sd of the SI that are drawn
n2 = 50) # n2 is the size of the posterior sample drawn for each pair of mean, sd of SI
mcmc_control <- make_mcmc_control(
seed = MCMC_seed,
burnin = params$burnin)
dist <- params$dist
config <- make_config(
list(
si_parametric_distr = dist,
mcmc_control = mcmc_control,
seed = overall_seed,
n1 = params$n1,
n2 = params$n2))
# Fitting the serial interval
si_fit <- coarseDataTools::dic.fit.mcmc(
dat = si_data,
dist = dist,
init.pars = init_mcmc_params(si_data, dist),
burnin = mcmc_control$burnin,
n.samples = n_mcmc_samples,
seed = mcmc_control$seed)
## Running 4000 MCMC iterations
## MCMCmetrop1R iteration 1 of 4000
## function value = -216.54581
## theta =
## 1.94255
## -0.47988
## Metropolis acceptance rate = 1.00000
##
## MCMCmetrop1R iteration 1001 of 4000
## function value = -216.21549
## theta =
## 1.81272
## -0.47418
## Metropolis acceptance rate = 0.56643
##
## MCMCmetrop1R iteration 2001 of 4000
## function value = -215.86791
## theta =
## 1.85433
## -0.52118
## Metropolis acceptance rate = 0.57221
##
## MCMCmetrop1R iteration 3001 of 4000
## function value = -216.32347
## theta =
## 1.93196
## -0.52205
## Metropolis acceptance rate = 0.55948
##
##
##
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
## The Metropolis acceptance rate was 0.56875
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Let’s check the results.
# Check convergence of MCMC chains
converg_diag <- check_cdt_samples_convergence(si_fit@samples)
converg_diag
# Serial interval results
si_fit
# Save these for later
si_fit_lnorm <- si_fit
# Save MCMC samples in a dataframe
si_samples <- data.frame(
type = 'Symptom onset',
meanlog = si_fit@samples$var1,
sdlog = si_fit@samples$var2,
p50 = qlnorm(
p = 0.5,
meanlog = si_fit@samples$var1,
sdlog = si_fit@samples$var2)) |>
mutate( # Equation for conversion is here https://en.wikipedia.org/wiki/Log-normal_distribution
mean = exp(meanlog + (sdlog^2/2)),
sd = sqrt((exp(sdlog^2)-1) * (exp(2*meanlog + sdlog^2)))
)
# Get the mean, SD, and 95% CrI
si_samples %>%
summarise(
mean_mean = mean(mean),
mean_l_ci = quantile(mean,probs=.025),
mean_u_ci = quantile(mean,probs=.975),
sd_mean = mean(sd),
sd_l_ci = quantile(sd,probs=.025),
sd_u_ci = quantile(sd,probs=.975)
)
# Now get the summary statistics for the parameters of the distribution
si_samples |>
summarise(
meanlog_mean = mean(meanlog),
meanlog_l_ci = quantile(meanlog, probs=.025),
meanlog_u_ci = quantile(meanlog, probs=.975),
sdlog_mean = mean(sdlog),
sdlog_l_ci = quantile(sdlog, probs=.025),
sdlog_u_ci = quantile(sdlog, probs=.975)
)
lognorm_meanlog <- si_fit@ests['meanlog',][1]
lognorm_sdlog <- si_fit@ests['sdlog',][1]
Finally, let’s fit a Weibull distribution to the serial interval data.
# We will run the model for 4000 iterations with the first 1000 samples discarded as burnin
n_mcmc_samples <- 3000 # number of samples to draw from the posterior (after the burnin)
params = list(
dist = "W", # Fitting a weibull distribution for the SI
method = "MCMC", # MCMC using coarsedatatools
burnin = 1000, # number of burnin samples (samples discarded at the beginning of MCMC)
n1 = 50, # n1 is the number of pairs of mean and sd of the SI that are drawn
n2 = 50) # n2 is the size of the posterior sample drawn for each pair of mean, sd of SI
mcmc_control <- make_mcmc_control(
seed = MCMC_seed,
burnin = params$burnin)
dist <- params$dist
config <- make_config(
list(
si_parametric_distr = dist,
mcmc_control = mcmc_control,
seed = overall_seed,
n1 = params$n1,
n2 = params$n2))
# Fitting the serial interval
si_fit <- coarseDataTools::dic.fit.mcmc(
dat = si_data,
dist = dist,
init.pars = init_mcmc_params(si_data, dist),
burnin = mcmc_control$burnin,
n.samples = n_mcmc_samples,
seed = mcmc_control$seed)
Now we’ll check the results.
# Check convergence
converg_diag <- check_cdt_samples_convergence(si_fit@samples)
##
## Gelman-Rubin MCMC convergence diagnostic was successful.
converg_diag
## [1] TRUE
# Serial interval results
si_fit
## Coarse Data Model Parameter and Quantile Estimates:
## est CIlow CIhigh
## shape 1.791 1.508 2.090
## scale 8.631 7.481 9.943
## p5 1.647 1.102 2.241
## p50 7.025 5.967 8.158
## p95 15.878 13.957 18.927
## p99 20.191 17.560 24.681
## Note: please check that the MCMC converged on the target distribution by running multiple chains. MCMC samples are available in the mcmc slot (e.g. my.fit@mcmc)
# Save these for later
si_fit_weibull <- si_fit
# Save MCMC samples in dataframe
si_samples <- data.frame(
type = 'Symptom onset',
shape = si_fit@samples$var1,
scale = si_fit@samples$var2,
p50 = qweibull(
p = 0.5,
shape = si_fit@samples$var1,
scale = si_fit@samples$var2)) |>
mutate( # Equation for conversion is here https://en.wikipedia.org/wiki/Weibull_distribution
mean = scale*gamma(1+1/shape),
sd = sqrt(scale^2*(gamma(1+2/shape)-(gamma(1+1/shape))^2))
)
# Get the summary statistics
si_samples %>%
summarise(
mean_mean = mean(mean),
mean_l_ci = quantile(mean,probs=.025),
mean_u_ci = quantile(mean,probs=.975),
sd_mean = mean(sd),
sd_l_ci = quantile(sd, probs=.025),
sd_u_ci = quantile(sd, probs=.975)
)
## mean_mean mean_l_ci mean_u_ci sd_mean sd_l_ci sd_u_ci
## 1 7.693916 6.677548 8.809715 4.463549 3.797158 5.456492
# Now get the summary statistics for the parameters of the distribution
si_samples |>
summarise(
shape_mean = mean(shape),
shape_l_ci = quantile(shape, probs=.025),
shape_u_ci = quantile(shape, probs=.975),
scale_mean = mean(scale),
scale_l_ci = quantile(scale, probs=.025),
scale_u_ci = quantile(scale, probs=.975)
)
## shape_mean shape_l_ci shape_u_ci scale_mean scale_l_ci scale_u_ci
## 1 1.792317 1.50812 2.090071 8.638298 7.481234 9.943215
weibull_shape <- si_fit@ests['shape',][1]
weibull_scale <- si_fit@ests['scale',][1]
Now we will plot the three fitted distributions. Make sure the distributions fit the serial interval data well.
ggplot()+
theme_classic()+
geom_bar(data = contacts, aes(x=diff), fill = "#FFAD05") +
scale_y_continuous(limits = c(0,14), breaks = c(0,2,4,6,8,10,12,14), expand = c(0,0)) +
stat_function(
linewidth=.8,
fun = function(z, shape, rate)(dgamma(z, shape, rate) * length(contacts$diff)),
args = list(shape = gamma_shape, rate = gamma_rate),
aes(linetype = 'Gamma')
) +
stat_function(
linewidth=.8,
fun = function(z, meanlog, sdlog)(dlnorm(z, meanlog, sdlog) * length(contacts$diff)),
args = list(meanlog = lognorm_meanlog, sdlog = lognorm_sdlog),
aes(linetype ='Log normal')
) +
stat_function(
linewidth=.8,
fun = function(z, shape, scale)(dweibull(z, shape, scale) * length(contacts$diff)),
args = list(shape = weibull_shape, scale = weibull_scale),
aes(linetype ='Weibull')
) +
scale_linetype_manual(values = c('solid','twodash','dotted')) +
labs(x = "Days", y = "Number of case pairs") +
theme(
legend.position = c(0.75, 0.75),
plot.margin = margin(.2,5.2,.2,.2, "cm"),
legend.title = element_blank(),
)
Let’s calculate the WAIC
and LOOIC
. The
library coarseDataTools
does not have a built-in way to do
this, so we need to calculate the likelihood from the MCMC chains and
use the loo
R package.
# Load likelihood functions from coarseDataTools
source("Functions/likelihood_function.R")
compare_gamma <- calc_looic_waic(symp = si_data, symp_si = si_fit_gamma, dist = "G")
## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and
## MCSE estimates will be over-optimistic.
compare_lnorm <- calc_looic_waic(symp = si_data, symp_si = si_fit_lnorm, dist = "L")
## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and
## MCSE estimates will be over-optimistic.
## Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.
compare_weibull <- calc_looic_waic(symp = si_data, symp_si = si_fit_weibull, dist = "W")
## Warning:
## 2 (2.8%) p_waic estimates greater than 0.4. We recommend trying loo instead.
## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and
## MCSE estimates will be over-optimistic.
# Print results
compare_gamma[["waic"]]$estimates
## Estimate SE
## elpd_waic -202.009546 7.0118463
## p_waic 1.936991 0.4409454
## waic 404.019093 14.0236925
compare_lnorm[["waic"]]$estimates
## Estimate SE
## elpd_waic -202.215546 7.0178113
## p_waic 1.874728 0.4106866
## waic 404.431092 14.0356226
compare_weibull[["waic"]]$estimates
## Estimate SE
## elpd_waic -203.941362 6.9020056
## p_waic 2.083623 0.6633186
## waic 407.882723 13.8040112
compare_gamma[["looic"]]$estimates
## Estimate SE
## elpd_loo -202.015856 7.0144149
## p_loo 1.943301 0.4437589
## looic 404.031713 14.0288297
compare_lnorm[["looic"]]$estimates
## Estimate SE
## elpd_loo -202.218954 7.0182826
## p_loo 1.878136 0.4114664
## looic 404.437909 14.0365651
compare_weibull[["looic"]]$estimates
## Estimate SE
## elpd_loo -203.956018 6.9097884
## p_loo 2.098279 0.6725387
## looic 407.912036 13.8195768
Include the following when reporting the serial interval:
Mean and 95% credible interval
Standard deviation and 95% credible interval
The parameters of the fitted distribution (e.g., shape and scale for gamma distribution)
💡 Questions (7)
_______Wrap-up 3 _________
You have now finisiled the statistical analysis 🥳
Compare the incubation period and the serial interval.
Which one is longer?
Does pre-symptomatic transmission occur with Disease X?
Will isolating symptomatic individuals and tracing and quarantining their contacts be effective measures?
If not, which other measures can be implemented to slow the outbreak?
Based on the estimated delays as well as the demographics of cases, design a contact tracing strategy for the outbreak, including a communications plan.
The true values used to simulate the epidemics were:
Distribution | Mean (days) | Standard deviation (days) | |
Group 1 | Gamma | 8.4 | 4.9 |
Group 2 | Gamma | 4.8 | 3.3 |
Distribution | Mean (days) | Standard deviation (days) | |
Group 1 | Log normal | 5.7 | 4.6 |
Group 2 | Weibull | 7.1 | 3.7 |
How do your estimates compare with the true values? Discuss possible reasons for any differences.
In this practical, we largely ignored biases due to right truncation
and dynamical bias, partly due to a lack of readily available tools that
implement all best practices. To learn more about how these biases could
affect the estimation of epidemiological delay distributions in
real-time, we recommend a tutorial on the
dynamicaltruncation
R package by Sam Abbott and Sang Woo
Park (https://github.com/parksw3/epidist-paper).
Kelly Charniga, Zachary Madewell, Zulma M. Cucunubá