1. Introduction

Disease X represents a hypothetical, yet plausible, future infectious disease outbreak. Coined by the World Health Organization (WHO), it serves as a catch-all term for an unknown pathogen that could cause a serious international epidemic. This concept emphasizes the unpredictable nature of infectious disease emergence and underscores the necessity for global preparedness and rapid response mechanisms. Disease X symbolizes the potential for an unexpected, rapidly spreading disease, and highlights the need for flexible, adaptive health systems and research capabilities to identify, understand, and combat novel pathogens.

In this practical, we are going to learn how to estimate epidemiological delays, the time between two epidemiological events, using a simulated dataset of Disease X .

Disease X is caused by an unknown pathogen, and it is directly transmitted from person-to-person. Specifically, we will focus on estimating the incubation period and serial interval.

2. Learning objectives

Comprehend the key concepts of epidemiological delay distributions for Disease X.
Comprehend the data structures and tools for analysis of contact tracing data.
Learn how to adjust estimates of the serial interval and the incubation period of Disease X for interval censoring using a Bayesian framework.
Learn how to use these quantities to inform control strategies for an outbreak of an unknown pathogen.

3. Key Concepts

3.1. Epidemiological delays: incubation period and serial interval

In epidemiology, delay distributions refer to the time lags inherent in from two key dates during an outbreak. For example: the time between the symptoms onset and the diagnostics, the time between symptoms and death, among many others.

Today we will be focusing on two key delays known as the incubation period and serial interval. Both are crucial for informing public health response.

The incubation period is the time between infection and symptom onset.

The serial interval is the time between symptom onset in a primary and secondary case pair.

The relationship between these quantities have an impact on whether the disease transmits before (Pre-symptomatic transmission) or after symptoms (Symptomatic transmission) have developed in the primary case (Figure 1).

Figure 1. Relationship between the incubation period and serial interval on the timing of transmission (Adapted from Nishiura et al. 2020)

3.2. Potential biases and adjustments of epidemiological delays

2.2.1 Potential biases

When estimating epidemiological delays, it is important to consider potential biases:

Censoring means that we know an event happened, but we do not know exactly when it happened. Most epidemiological data are “doubly censored” because there is uncertainty surrounding both primary and secondary event times. Not accounting for censoring can lead to biased estimates of the delay’s standard deviation (Park et al. in progress).

Right truncation is a type of sampling bias related to the data collection process. It arises because only cases that have been reported can be observed. Not accounting for right truncation during the growth phase of an epidemic can lead to underestimation of the mean delay (Park et al. in progress).

Dynamical (or epidemic phase) bias is another type of sampling bias. It affects backward-looking data and is related to the phase of the epidemic: during the exponential growth phase, cases that developed symptoms recently are over-represented in the observed data, while during the declining phase, these cases are underrepresented, leading to the estimation of shorter and longer delay intervals, respectively (Park et al. in progress).

2.2.2 Delay distributions

Three common probability distributions used to characterize delays in infectious disease epidemiology include the following (Table 1).

Table 1. Three of the most common probability distributions for epidemiological delays.
Distribution	Parameters
weibull	`shape` and `scale`
gamma	`shape` and `scale`
log normal	`log mean` and `log standard deviation`

4. R packages for the practical

We will use the following R packages in this practical:

dplyr for data handling
epicontacts to visualize contact tracing data
ggplot2 and patchwork for plotting
incidence to visualize epidemic curves
rstan to estimate the incubation period
coarseDataTools via EpiEstim to estimate the serial interval

Installation instructions for packages:

To load the packages, type:

library(dplyr)
library(epicontacts)
library(incidence)
library(coarseDataTools)
library(EpiEstim)
library(ggplot2)
library(loo)
library(patchwork)
library(rstan)

5. Data

We are going to split into two groups for this practical to tackle two unknown diseases with different modes of transmission.

Let’s load the simulated data, which is saved as an .RDS file.

There are two items of interest:

linelist, a file that contains one case of Disease X per row.
contacts, a file with contact tracing data which contains information about primary and secondary case pairs.

# Group 1
dat <- readRDS("Data/practical_data_group1.RDS")
linelist <- dat$linelist
contacts <- dat$contacts

# Group 2
#dat <- readRDS("Data/practical_data_group2.RDS")
#linelist <- dat$linelist
#contacts <- dat$contacts

6. Data exploration

6.1. Exploring `linelist` data

Let’s start with the linelist. These data were collected as part of routine epidemiological surveillance. Each row represents one case of Disease X, and there are 7 variables:

id: unique case id number
date_onset: the patient’s symptom onset date
sex: M = male; F = female
age: the patient’s age in years
exposure: information about how the patient might have been exposed
exposure_start: earliest date that the patient was exposed
exposure_end: latest date that the patient was exposed

💡 Questions (1)

How many cases are in the linelist dataset?
What proportion of the cases are female?
What is the age distribution of cases?
What type of exposure information is available?

# Inspect data
head(linelist)

##   id date_onset sex age                    exposure exposure_start exposure_end
## 1  1 2023-10-01   M  34 Close, skin-to-skin contact           <NA>         <NA>
## 2  2 2023-10-03   F  38 Close, skin-to-skin contact     2023-09-29   2023-09-29
## 3  3 2023-10-06   F  38 Close, skin-to-skin contact     2023-09-28   2023-09-28
## 4  4 2023-10-10   F  37                        <NA>     2023-09-25   2023-09-27
## 5  5 2023-10-11   F  33                        <NA>     2023-10-05   2023-10-05
## 6  6 2023-10-12   F  34 Close, skin-to-skin contact     2023-10-10   2023-10-10

# Q1
nrow(linelist)

## [1] 166

# Q2
table(linelist$sex)[2]/nrow(linelist)

##         M 
## 0.6144578

#Q3
summary(linelist$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   33.00   36.00   35.51   38.00   47.00

#Q4
table(linelist$exposure, exclude = F)[1]/nrow(linelist)

## Close, skin-to-skin contact 
##                   0.6626506

💡 Discuss

Why do you think the exposure information is missing for some cases? Discuss.

Now let’s plot the epidemic curve. Where in the outbreak do you think we are (beginning, middle, end)?

i <- incidence(linelist$date_onset)
plot(i) + 
  theme_classic() + 
  scale_fill_manual(values = "purple") +
  theme(legend.position = "none")

It looks like the epidemic might still be growing.

6.1. Exploring `contact tracing` data

Now let’s look at the contact tracing data, which were obtained through patient interviews. Patients were asked about recent activities and interactions to identify possible sources of infection. Primary and secondary case pairs were matched if the secondary case named the primary case as a contact. We only have information on a subset of the cases because not all patients could be contacted for an interview.

Note that for this exercise, we will assume that the secondary cases only had one possible infector. In reality, the potential for a case to have multiple possible infectors needs to be assessed.

The contact tracing data has 4 variables:

primary_case_id: unique id number for the primary case (infector)
secondary_case_id: unique id number for the secondary case (infectee)
primary_onset_date: symptom onset date of the primary case
secondary_onset_date: symptom onset date of the secondary case

x <- make_epicontacts(linelist = linelist,
                       contacts = contacts,
                       from = "primary_case_id",
                       to = "secondary_case_id",
                       directed = TRUE) # Don´t remember what directed means

plot(x)

💡 Questions (2)

Describe the clusters.
Do you see any potential superspreading events (where one cases spreads the pathogen to many other cases)?

`_Wrap-up 1 ___`

7. Estimating the incubation period

Now let’s focus on the incubation period. We will use the linelist for this part. We need both time of symptom onset and time of possible exposure. Notice that there are two dates for exposure, a start and end, in the data. Sometimes the exact date of exposure is unknown and exposure window is provided during case interviews instead.

💡 Questions (3)

For how many cases do we have data on both dates of symptom onset and exposure?
Calculate the exposure windows. How many cases have a single date of exposure?

ip <- filter(linelist, !is.na(exposure_start) &
               !is.na(exposure_end))
nrow(ip)

## [1] 50

ip$exposure_window <- as.numeric(ip$exposure_end - ip$exposure_start)

table(ip$exposure_window)

## 
##  0  1  2 
## 34 11  5

7.1. Naive estimate of the incubation period

Let’s start by calculating a naive estimate of the incubation period.

# Max incubation period 
ip$max_ip <- ip$date_onset - ip$exposure_start
summary(as.numeric(ip$max_ip))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.50    6.38    7.75   20.00

# Minimum incubation period
ip$min_ip <- ip$date_onset - ip$exposure_end
summary(as.numeric(ip$min_ip))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    4.00    5.96    7.75   19.00

7.2. Censoring adjusted estimate of the incubation period

Now let’s fit three probability distributions to the incubation period data accounting for double censoring. We will adapt some stan code that was published by Miura et al. during the 2022 mpox outbreak. This method does not account for right truncation or dynamical bias.

Remember we are mainly interested in considering three probability distributions: weibull, gamma, lognormal (See Table 1).

Stan is a software program that implements the Hamiltonian Monte Carlo (HMC) algorithm. HMC is a Markov chain Monte Carlo (MCMC) method for fitting complex models to data using Bayesian statistics.

7.1.1. Running the model in Stan

We will fit all three distributions in this code chunk.

# Prepare data
earliest_exposure <- as.Date(min(ip$exposure_start))

ip <- ip |>
  mutate(date_onset = as.numeric(date_onset - earliest_exposure),
         exposure_start = as.numeric(exposure_start - earliest_exposure),
         exposure_end = as.numeric(exposure_end - earliest_exposure)) |>
  select(id, date_onset, exposure_start, exposure_end)

# Setting some options to run the MCMC chains in parallel
# Running the MCMC chains in parallel means that we will run multiple MCMC chains at the same time by using multiple cores on your computer
options(mc.cores=parallel::detectCores())

input_data <- list(N = length(ip$exposure_start), # Number of observations
              tStartExposure = ip$exposure_start,
              tEndExposure = ip$exposure_end,
              tSymptomOnset = ip$date_onset)

# We are going to fit three probbility distributions
distributions <- c("weibull", "gamma", "lognormal") # three distributions to fit

# Stan code
code <- sprintf("
  data{
    int<lower=1> N;
    vector[N] tStartExposure;
    vector[N] tEndExposure;
    vector[N] tSymptomOnset;
  }
  parameters{
    real<lower=0> par[2];
    vector<lower=0, upper=1>[N] uE; // Uniform value for sampling between start and end exposure
  }
  transformed parameters{
    vector[N] tE;   // infection moment
    tE = tStartExposure + uE .* (tEndExposure - tStartExposure);
  }
model{
    // Contribution to likelihood of incubation period
    target += %s_lpdf(tSymptomOnset -  tE  | par[1], par[2]);
  }
  generated quantities {
    // likelihood for calculation of looIC
    vector[N] log_lik;
    for (i in 1:N) {
      log_lik[i] = %s_lpdf(tSymptomOnset[i] -  tE[i]  | par[1], par[2]);
    }
  }
", distributions, distributions)
names(code) <- distributions

# Next line takes ~1.5 min
models <- mapply(stan_model, model_code = code)

# Takes ~40 sec.
fit <- mapply(sampling, models, list(input_data), 
              iter=3000, # Number of iterations (length of MCMC chains)
              warmup=1000, # Number of samples to discard at the beginning of MCMC
              chain=4) # Number of chains to run MCMC

pos <- mapply(function(z) rstan::extract(z)$par, fit, SIMPLIFY=FALSE) # Posterior samples

7.1.2. Checking for convergence

Now we will check for model convergence. We will look at r-hat values, effective sample sizes, and the MCMC traces. R-hat compares the between- and within-chain estimates for model parameters; values near 1 tell us that chains have mixed well (Vehtari et al. 2021). The effective sample size estimates the number of independent samples after accounting for dependence in the MCMC chains (Lambert 2018). For a model with 4 MCMC chains, a total effective sample size of at least 400 is recommended (Vehtari et al. 2021).

For each model of different fitted distributions:

💡 Questions (4)

Are the R-hat values close < 1.1.?
Do the 4 MCMC chains generally lay on top of each other and stay around the same values (fuzzy caterpillars)?

7.1.2.1. Convergence for Gamma

print(fit$gamma, digits = 3, pars = c("par[1]","par[2]"))

## Inference for Stan model: 152fdcd1bc10f11f54191ff282b4b6de.
## 4 chains, each with iter=3000; warmup=1000; thin=1; 
## post-warmup draws per chain=2000, total post-warmup draws=8000.
## 
##         mean se_mean    sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## par[1] 1.974   0.003 0.360 1.340 1.721 1.950 2.207 2.735 10871    1
## par[2] 0.323   0.001 0.067 0.207 0.277 0.318 0.365 0.468 11684    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Nov 23 21:58:20 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

rstan::traceplot(fit$gamma, pars = c("par[1]","par[2]"))

7.1.2.2. Convergence for log normal

print(fit$lognormal, digits = 3, pars = c("par[1]","par[2]"))

## Inference for Stan model: c6aece6be405f0991ed71af1901b826b.
## 4 chains, each with iter=3000; warmup=1000; thin=1; 
## post-warmup draws per chain=2000, total post-warmup draws=8000.
## 
##         mean se_mean    sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## par[1] 1.528   0.001 0.116 1.299 1.452 1.526 1.604 1.758  9134    1
## par[2] 0.801   0.001 0.085 0.652 0.743 0.794 0.851 0.991  9119    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Nov 23 21:58:24 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

rstan::traceplot(fit$lognormal, pars = c("par[1]","par[2]"))

7.1.2.3. Convergence for Weibull

print(fit$weibull, digits = 3, pars = c("par[1]","par[2]"))

## Inference for Stan model: d35ce4e4b78b193107f404dc40cbc8ef.
## 4 chains, each with iter=3000; warmup=1000; thin=1; 
## post-warmup draws per chain=2000, total post-warmup draws=8000.
## 
##         mean se_mean    sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## par[1] 1.373   0.002 0.145 1.100 1.273 1.370 1.468 1.661  8192    1
## par[2] 6.947   0.008 0.774 5.535 6.421 6.914 7.440 8.566  8971    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Nov 23 21:58:14 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

rstan::traceplot(fit$weibull, pars = c("par[1]","par[2]"))

7.1.3. Compute model comparison criteria

We can compare the models using comparison metrics, such as:

The widely applicable information criterion (WAIC), and
Leave-one-out-information criterion (LOOIC)

The best fitting model is the model with the lowest WAIC or LOOIC. We will also summarize the distributions and make some plots.

💡 Questions (5)

Which model has the best fit?

# Compute WAIC for all three models
waic <- mapply(function(z) waic(extract_log_lik(z))$estimates[3,], fit)
waic

##            weibull     gamma lognormal
## Estimate 278.05437 276.07397 272.91143
## SE        11.96441  13.35754  13.90588

# For looic, we need to provide the relative effective sample sizes
# when calling loo. This step leads to better estimates of the PSIS effective
# sample sizes and Monte Carlo error

# Extract pointwise log-likelihood for the Weibull distribution
loglik <- extract_log_lik(fit$weibull, merge_chains = FALSE)
# Get the relative effective sample sizes
r_eff <- relative_eff(exp(loglik), cores = 2)
# Compute LOOIC
loo_w <- loo(loglik, r_eff = r_eff, cores = 2)$estimates[3,]
# Print the results
loo_w[1]

## Estimate 
## 278.0724

# Extract pointwise log-likelihood for the gamma distribution
loglik <- extract_log_lik(fit$gamma, merge_chains = FALSE)
r_eff <- relative_eff(exp(loglik), cores = 2)
loo_g <- loo(loglik, r_eff = r_eff, cores = 2)$estimates[3,]
loo_g[1]

## Estimate 
## 276.0931

# Extract pointwise log-likelihood for the log normal distribution
loglik <- extract_log_lik(fit$lognormal, merge_chains = FALSE)
r_eff <- relative_eff(exp(loglik), cores = 2)
loo_l <- loo(loglik, r_eff = r_eff, cores = 2)$estimates[3,]
loo_l[1]

## Estimate 
## 272.9262

7.1.4. Report the results

The right tail of the incubation period distribution is important for designing control strategies (e.g. quarantine), the 25th-75th percentiles tells us about the most likely time symptom onset could occur, and the full distribution can be used as an input in mathematical or statistical models, like for forecasting (Lessler et al. 2009).

Let’s get the summary statistics

# We need to convert the parameters of the distributions to the mean and standard deviation delay

# In Stan, the parameters of the distributions are:
# Weibull: shape and scale
# Gamma: shape and inverse scale (aka rate)
# Log normal: mu and sigma
# Reference: https://mc-stan.org/docs/2_21/functions-reference/positive-continuous-distributions.html
means <- cbind(pos$weibull[,2]*gamma(1+1/pos$weibull[,1]), # mean of Weibull
               pos$gamma[,1] / pos$gamma[,2], # mean of gamma
               exp(pos$lognormal[,1]+pos$lognormal[,2]^2/2)) # mean of log normal

standard_deviations <- cbind(sqrt(pos$weibull[,2]^2*(gamma(1+2/pos$weibull[,1])-(gamma(1+1/pos$weibull[,1]))^2)),
                             sqrt(pos$gamma[,1]/(pos$gamma[,2]^2)),
                             sqrt((exp(pos$lognormal[,2]^2)-1) * (exp(2*pos$lognormal[,1] + pos$lognormal[,2]^2))))

# Print the mean delays and 95% credible intervals
probs <- c(0.025, 0.5, 0.975)

res_means <- apply(means, 2, quantile, probs)
colnames(res_means) <- colnames(waic)
res_means

##        weibull    gamma lognormal
## 2.5%  5.138844 5.064270  4.984058
## 50%   6.337632 6.139863  6.314455
## 97.5% 7.853057 7.528840  8.471714

res_sds <- apply(standard_deviations, 2, quantile, probs)
colnames(res_sds) <- colnames(waic)
res_sds

##        weibull    gamma lognormal
## 2.5%  3.705945 3.418342  3.929640
## 50%   4.652873 4.379483  5.909772
## 97.5% 6.355591 5.859244 10.257078

# Report the median and 95% credible intervals for the quantiles of each distribution
quantiles_to_report <- c(0.025, 0.05, 0.5, 0.95, 0.975, 0.99)

# Weibull
cens_w_percentiles <- sapply(quantiles_to_report, function(p) quantile(qweibull(p = p, shape = pos$weibull[,1], scale = pos$weibull[,2]), probs = probs))
colnames(cens_w_percentiles) <- quantiles_to_report
print(cens_w_percentiles)

##           0.025      0.05      0.5     0.95    0.975     0.99
## 2.5%  0.2241456 0.4231455 4.105487 12.56941 14.51359 16.78929
## 50%   0.4710275 0.7888226 5.293635 15.37022 17.88626 21.03089
## 97.5% 0.8414230 1.2892020 6.616211 20.05848 23.88578 28.85104

# Gamma
cens_g_percentiles <- sapply(quantiles_to_report, function(p) quantile(qgamma(p = p, shape = pos$gamma[,1], rate = pos$gamma[,2]), probs = probs))
colnames(cens_g_percentiles) <- quantiles_to_report
print(cens_g_percentiles)

##           0.025     0.05      0.5     0.95    0.975     0.99
## 2.5%  0.3384874 0.577708 4.119598 11.84247 13.77393 16.25629
## 50%   0.7147845 1.055581 5.114399 14.64894 17.22380 20.54755
## 97.5% 1.1838746 1.609993 6.294281 18.80301 22.42376 27.18235

# Log normal
cens_ln_percentiles <- sapply(quantiles_to_report, function(p) quantile(qlnorm(p = p, meanlog = pos$lognormal[,1], sdlog= pos$lognormal[,2]), probs = probs))
colnames(cens_ln_percentiles) <- quantiles_to_report
print(cens_ln_percentiles)

##           0.025      0.05      0.5     0.95    0.975     0.99
## 2.5%  0.6089193 0.8271572 3.667261 12.47162 15.46233 19.81833
## 50%   0.9727520 1.2503304 4.598263 16.98066 21.80485 29.18082
## 97.5% 1.3731980 1.7047432 5.802374 25.38650 34.36038 48.98636

For each model, find these items for the estimated incubation period in the output above and write them below.

Mean and 95% credible interval
Standard deviation and 95% credible interval
Percentiles (e.g., 2.5, 5, 25, 50, 75, 95, 97.5, 99)
The parameters of the fitted distributions (e.g., shape and scale for gamma distribution)

7.1.5. Plot the results

# Prepare results for plotting
df <- data.frame(
#Take mean values to draw empirical cumulative density function
  inc_day = ((input_data$tSymptomOnset-input_data$tEndExposure)+(input_data$tSymptomOnset-input_data$tStartExposure))/2
)

x_plot <- seq(0, 30, by=0.1) # This sets the range of the x axis (number of days)

Gam_plot <- as.data.frame(list(dose= x_plot, 
                               pred= sapply(x_plot, function(q) quantile(pgamma(q = q, shape = pos$gamma[,1], rate = pos$gamma[,2]), probs = c(0.5))),
                               low = sapply(x_plot, function(q) quantile(pgamma(q = q, shape = pos$gamma[,1], rate = pos$gamma[,2]), probs = c(0.025))),
                               upp = sapply(x_plot, function(q) quantile(pgamma(q = q, shape = pos$gamma[,1], rate = pos$gamma[,2]), probs = c(0.975)))
))

Wei_plot <- as.data.frame(list(dose= x_plot, 
                               pred= sapply(x_plot, function(q) quantile(pweibull(q = q, shape = pos$weibull[,1], scale = pos$weibull[,2]), probs = c(0.5))),
                               low = sapply(x_plot, function(q) quantile(pweibull(q = q, shape = pos$weibull[,1], scale = pos$weibull[,2]), probs = c(0.025))),
                               upp = sapply(x_plot, function(q) quantile(pweibull(q = q, shape = pos$weibull[,1], scale = pos$weibull[,2]), probs = c(0.975)))
))

ln_plot <- as.data.frame(list(dose= x_plot, 
                              pred= sapply(x_plot, function(q) quantile(plnorm(q = q, meanlog = pos$lognormal[,1], sdlog= pos$lognormal[,2]), probs = c(0.5))),
                              low = sapply(x_plot, function(q) quantile(plnorm(q = q, meanlog = pos$lognormal[,1], sdlog= pos$lognormal[,2]), probs = c(0.025))),
                              upp = sapply(x_plot, function(q) quantile(plnorm(q = q, meanlog = pos$lognormal[,1], sdlog= pos$lognormal[,2]), probs = c(0.975)))
))

# Plot cumulative distribution curves
gamma_ggplot <- ggplot(df, aes(x=inc_day)) +
  stat_ecdf(geom = "step")+ 
  xlim(c(0, 30))+
  geom_line(data=Gam_plot, aes(x=x_plot, y=pred), color=RColorBrewer::brewer.pal(11, "RdBu")[11], linewidth=1) +
  geom_ribbon(data=Gam_plot, aes(x=x_plot,ymin=low,ymax=upp), fill = RColorBrewer::brewer.pal(11, "RdBu")[11], alpha=0.1) +
  theme_bw(base_size = 11)+
  labs(x="Incubation period (days)", y = "Proportion")+
  ggtitle("Gamma")

weibul_ggplot <- ggplot(df, aes(x=inc_day)) +
  stat_ecdf(geom = "step")+ 
  xlim(c(0, 30))+
  geom_line(data=Wei_plot, aes(x=x_plot, y=pred), color=RColorBrewer::brewer.pal(11, "RdBu")[11], linewidth=1) +
  geom_ribbon(data=Wei_plot, aes(x=x_plot,ymin=low,ymax=upp), fill = RColorBrewer::brewer.pal(11, "RdBu")[11], alpha=0.1) +
  theme_bw(base_size = 11)+
  labs(x="Incubation period (days)", y = "Proportion")+
  ggtitle("Weibull")

lognorm_ggplot <- ggplot(df, aes(x=inc_day)) +
  stat_ecdf(geom = "step")+ 
  xlim(c(0, 30))+
  geom_line(data=ln_plot, aes(x=x_plot, y=pred), color=RColorBrewer::brewer.pal(11, "RdBu")[11], linewidth=1) +
  geom_ribbon(data=ln_plot, aes(x=x_plot,ymin=low,ymax=upp), fill = RColorBrewer::brewer.pal(11, "RdBu")[11], alpha=0.1) +
  theme_bw(base_size = 11)+
  labs(x="Incubation period (days)", y = "Proportion")+
  ggtitle("Log normal")

(lognorm_ggplot|gamma_ggplot|weibul_ggplot) + plot_annotation(tag_levels = 'A')

In the plots above, the black line is the empirical cumulative distribution (the data), while the blue curve is the fitted probability distribution with the 95% credible intervals. Make sure that the blue curve lays on top of the black line.

💡 Questions (6)

Are the fitting of the distributions what you’d expect?

`_Wrap-up 2 ___`

8. Estimating the serial interval

Now we will estimate the serial interval. Again, we’ll do a naive estimate first by calculating the difference between the onset times of primary and secondary case pairs.

Are there any instances of negative serial intervals in the data (e.g., symptom onset in the secondary case came before symptom onset in primary case)?
Report the median serial interval as well as the minimum and maximum.
Plot the distribution of the serial interval.

8.1. Naive estimate

contacts$diff <- as.numeric(contacts$secondary_onset_date - contacts$primary_onset_date)
summary(contacts$diff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   7.625  10.000  23.000

hist(contacts$diff, xlab = "Serial interval (days)", breaks = 25, main = "", col = "pink")

8.2. Censoring adjusted estimate

Now we will estimate the serial interval using an implementation of the courseDataTools package within the EpiEstim R package. This method accounts for double censoring and allows different probability distributions to be compared, but it does not adjust for right truncation or dynamical bias.

We will consider three probability distributions and select the one that best fits the data using WAIC or LOOIC. Recall that the best-fitting distribution will have the lowest WAIC or LOOIC.

Note that in coarseDataTools, the parameters for the distributions are slightly different than in rstan. Here, the parameters for the gamma distribution are shape and scale (https://cran.r-project.org/web/packages/coarseDataTools/coarseDataTools.pdf).

We will only run one MCMC chain for each distribution in the interest of time, but you should run more than one chain in practice to make sure the MCMC converges on the target distribution. We will use the default prior distributions, which can be found in the package documentation (see ‘details’ for the dic.fit.mcmc function here: https://cran.r-project.org/web/packages/coarseDataTools/coarseDataTools.pdf).

8.2.1. Preparing the data

# Format interval-censored serial interval data

# Each line represents a transmission event 
# EL/ER show the lower/upper bound of the symptoms onset date in the primary case (infector)
# SL/SR show the same for the secondary case (infectee)
# type has entries 0 corresponding to doubly interval-censored data
# (see Reich et al. Statist. Med. 2009)
si_data <- contacts |>
  select(-primary_case_id, -secondary_case_id, -primary_onset_date, -secondary_onset_date,) |>
  rename(SL = diff) |>
  mutate(type = 0, EL = 0, ER = 1, SR = SL + 1) |>
  select(EL, ER, SL, SR, type)

8.2.2. Fitting a gamma distribution

Let’s fit the gamma distribution to the serial interval data first.

overall_seed <- 3 # seed for the random number generator for MCMC
MCMC_seed <- 007

# We will run the model for 4000 iterations with the first 1000 samples discarded as burnin
n_mcmc_samples <- 3000 # number of samples to draw from the posterior (after the burnin)

params = list(
dist = "G", # Fitting a Gamma distribution for the SI
method = "MCMC", # MCMC using coarsedatatools
burnin = 1000, # number of burnin samples (samples discarded at the beginning of MCMC) 
n1 = 50, # n1 is the number of pairs of mean and sd of the SI that are drawn
n2 = 50) # n2 is the size of the posterior sample drawn for each pair of mean, sd of SI

mcmc_control <- make_mcmc_control(
seed = MCMC_seed,
burnin = params$burnin)

dist <- params$dist

config <- make_config(
list(
si_parametric_distr = dist,
mcmc_control = mcmc_control,
seed = overall_seed,
n1 = params$n1,
n2 = params$n2))

# Fitting the SI
si_fit <- coarseDataTools::dic.fit.mcmc(
dat = si_data,
dist = dist,
init.pars = init_mcmc_params(si_data, dist),
burnin = mcmc_control$burnin,
n.samples = n_mcmc_samples,
seed = mcmc_control$seed)

## Running 4000 MCMC iterations 
## MCMCmetrop1R iteration 1 of 4000 
## function value = -201.95281
## theta = 
##    1.04012
##    0.99131
## Metropolis acceptance rate = 0.00000
## 
## MCMCmetrop1R iteration 1001 of 4000 
## function value = -202.42902
## theta = 
##    0.95294
##    1.05017
## Metropolis acceptance rate = 0.55445
## 
## MCMCmetrop1R iteration 2001 of 4000 
## function value = -201.88547
## theta = 
##    1.18355
##    0.82124
## Metropolis acceptance rate = 0.56522
## 
## MCMCmetrop1R iteration 3001 of 4000 
## function value = -202.11109
## theta = 
##    1.26323
##    0.77290
## Metropolis acceptance rate = 0.55148
## 
## 
## 
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
## The Metropolis acceptance rate was 0.55500
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Let’s look at the results.

# Check convergence of MCMC chains 
converg_diag <- check_cdt_samples_convergence(si_fit@samples)
converg_diag

# Estimated serial interval results
si_fit

# Save these for later
si_fit_gamma <- si_fit

# Save MCMC samples in dataframe
si_samples <- data.frame(
type = 'Symptom onset',
shape = si_fit@samples$var1,
scale = si_fit@samples$var2,
p50 = qgamma(
p = 0.5, 
shape = si_fit@samples$var1, 
scale = si_fit@samples$var2)) |>
mutate( # Equation for conversion is here https://en.wikipedia.org/wiki/Gamma_distribution
mean = shape*scale,
sd = sqrt(shape*scale^2)
) 

# Get the mean, SD, and 95% CrI
si_samples |>
summarise(
mean_mean = mean(mean),
mean_l_ci = quantile(mean,probs=.025),
mean_u_ci = quantile(mean,probs=.975),
sd_mean = mean(sd),
sd_l_ci = quantile(sd,probs=.025),
sd_u_ci = quantile(sd,probs=.975)
)

# Get the same summary statistics for the parameters of the distribution
si_samples |>
summarise(
shape_mean = mean(shape),
shape_l_ci = quantile(shape, probs=.025),
shape_u_ci = quantile(shape, probs=.975),
scale_mean = mean(scale),
scale_l_ci = quantile(scale, probs=.025),
scale_u_ci = quantile(scale, probs=.975)
)

# We will need these for plotting later
gamma_shape <- si_fit@ests['shape',][1]
gamma_rate <- 1 / si_fit@ests['scale',][1]

8.2.3. Fitting a log normal distribution

Now let’s fit a log normal distribution to the serial interval data.

# We will run the model for 4000 iterations with the first 1000 samples discarded as burnin
n_mcmc_samples <- 3000 # number of samples to draw from the posterior (after the burnin)

params = list(
dist = "L", # Fitting a log normal distribution for the SI
method = "MCMC", # MCMC using coarsedatatools
burnin = 1000, # number of burnin samples (samples discarded at the beginning of MCMC) 
n1 = 50, # n1 is the number of pairs of mean and sd of the SI that are drawn
n2 = 50) # n2 is the size of the posterior sample drawn for each pair of mean, sd of SI

mcmc_control <- make_mcmc_control(
seed = MCMC_seed,
burnin = params$burnin)

dist <- params$dist

config <- make_config(
list(
si_parametric_distr = dist,
mcmc_control = mcmc_control,
seed = overall_seed,
n1 = params$n1,
n2 = params$n2))

# Fitting the serial interval
si_fit <- coarseDataTools::dic.fit.mcmc(
dat = si_data,
dist = dist,
init.pars = init_mcmc_params(si_data, dist),
burnin = mcmc_control$burnin,
n.samples = n_mcmc_samples,
seed = mcmc_control$seed)

## Running 4000 MCMC iterations 
## MCMCmetrop1R iteration 1 of 4000 
## function value = -216.54581
## theta = 
##    1.94255
##   -0.47988
## Metropolis acceptance rate = 1.00000
## 
## MCMCmetrop1R iteration 1001 of 4000 
## function value = -216.21549
## theta = 
##    1.81272
##   -0.47418
## Metropolis acceptance rate = 0.56643
## 
## MCMCmetrop1R iteration 2001 of 4000 
## function value = -215.86791
## theta = 
##    1.85433
##   -0.52118
## Metropolis acceptance rate = 0.57221
## 
## MCMCmetrop1R iteration 3001 of 4000 
## function value = -216.32347
## theta = 
##    1.93196
##   -0.52205
## Metropolis acceptance rate = 0.55948
## 
## 
## 
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
## The Metropolis acceptance rate was 0.56875
## @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Let’s check the results.

# Check convergence of MCMC chains
converg_diag <- check_cdt_samples_convergence(si_fit@samples)
converg_diag

# Serial interval results
si_fit

# Save these for later
si_fit_lnorm <- si_fit

# Save MCMC samples in a dataframe
si_samples <- data.frame(
type = 'Symptom onset',
meanlog = si_fit@samples$var1,
sdlog = si_fit@samples$var2,
p50 = qlnorm(
p = 0.5, 
meanlog = si_fit@samples$var1, 
sdlog = si_fit@samples$var2)) |>
mutate( # Equation for conversion is here https://en.wikipedia.org/wiki/Log-normal_distribution
mean = exp(meanlog + (sdlog^2/2)), 
sd = sqrt((exp(sdlog^2)-1) * (exp(2*meanlog + sdlog^2)))
)

# Get the mean, SD, and 95% CrI
si_samples %>%
summarise(
mean_mean = mean(mean),
mean_l_ci = quantile(mean,probs=.025),
mean_u_ci = quantile(mean,probs=.975),
sd_mean = mean(sd),
sd_l_ci = quantile(sd,probs=.025),
sd_u_ci = quantile(sd,probs=.975)
)

# Now get the summary statistics for the parameters of the distribution
si_samples |>
summarise(
meanlog_mean = mean(meanlog),
meanlog_l_ci = quantile(meanlog, probs=.025),
meanlog_u_ci = quantile(meanlog, probs=.975),
sdlog_mean = mean(sdlog),
sdlog_l_ci = quantile(sdlog, probs=.025),
sdlog_u_ci = quantile(sdlog, probs=.975)
)

lognorm_meanlog <- si_fit@ests['meanlog',][1]
lognorm_sdlog <- si_fit@ests['sdlog',][1]

8.2.4. Fitting a Weibull distribution

Finally, let’s fit a Weibull distribution to the serial interval data.

# We will run the model for 4000 iterations with the first 1000 samples discarded as burnin
n_mcmc_samples <- 3000 # number of samples to draw from the posterior (after the burnin)

params = list(
dist = "W", # Fitting a weibull distribution for the SI
method = "MCMC", # MCMC using coarsedatatools
burnin = 1000, # number of burnin samples (samples discarded at the beginning of MCMC) 
n1 = 50, # n1 is the number of pairs of mean and sd of the SI that are drawn
n2 = 50) # n2 is the size of the posterior sample drawn for each pair of mean, sd of SI

mcmc_control <- make_mcmc_control(
seed = MCMC_seed,
burnin = params$burnin)

dist <- params$dist

config <- make_config(
list(
si_parametric_distr = dist,
mcmc_control = mcmc_control,
seed = overall_seed,
n1 = params$n1,
n2 = params$n2))

# Fitting the serial interval 
si_fit <- coarseDataTools::dic.fit.mcmc(
dat = si_data,
dist = dist,
init.pars = init_mcmc_params(si_data, dist),
burnin = mcmc_control$burnin,
n.samples = n_mcmc_samples,
seed = mcmc_control$seed)

Now we’ll check the results.

# Check convergence
converg_diag <- check_cdt_samples_convergence(si_fit@samples)

## 
## Gelman-Rubin MCMC convergence diagnostic was successful.

converg_diag

## [1] TRUE

# Serial interval results
si_fit

## Coarse Data Model Parameter and Quantile Estimates: 
##          est  CIlow CIhigh
## shape  1.791  1.508  2.090
## scale  8.631  7.481  9.943
## p5     1.647  1.102  2.241
## p50    7.025  5.967  8.158
## p95   15.878 13.957 18.927
## p99   20.191 17.560 24.681
## Note: please check that the MCMC converged on the target distribution by running multiple chains. MCMC samples are available in the mcmc slot (e.g. my.fit@mcmc)

# Save these for later
si_fit_weibull <- si_fit

# Save MCMC samples in dataframe
si_samples <- data.frame(
type = 'Symptom onset',
shape = si_fit@samples$var1,
scale = si_fit@samples$var2,
p50 = qweibull(
p = 0.5, 
shape = si_fit@samples$var1, 
scale = si_fit@samples$var2)) |>
mutate( # Equation for conversion is here https://en.wikipedia.org/wiki/Weibull_distribution
mean = scale*gamma(1+1/shape),
sd = sqrt(scale^2*(gamma(1+2/shape)-(gamma(1+1/shape))^2))
) 

# Get the summary statistics
si_samples %>%
summarise(
mean_mean = mean(mean),
mean_l_ci = quantile(mean,probs=.025),
mean_u_ci = quantile(mean,probs=.975),
sd_mean = mean(sd),
sd_l_ci = quantile(sd, probs=.025),
sd_u_ci = quantile(sd, probs=.975)
)

##   mean_mean mean_l_ci mean_u_ci  sd_mean  sd_l_ci  sd_u_ci
## 1  7.693916  6.677548  8.809715 4.463549 3.797158 5.456492

# Now get the summary statistics for the parameters of the distribution
si_samples |>
summarise(
shape_mean = mean(shape),
shape_l_ci = quantile(shape, probs=.025),
shape_u_ci = quantile(shape, probs=.975),
scale_mean = mean(scale),
scale_l_ci = quantile(scale, probs=.025),
scale_u_ci = quantile(scale, probs=.975)
)

##   shape_mean shape_l_ci shape_u_ci scale_mean scale_l_ci scale_u_ci
## 1   1.792317    1.50812   2.090071   8.638298   7.481234   9.943215

weibull_shape <- si_fit@ests['shape',][1]
weibull_scale <- si_fit@ests['scale',][1]

8.2.5. Plot the results

Now we will plot the three fitted distributions. Make sure the distributions fit the serial interval data well.

ggplot()+
    theme_classic()+
    geom_bar(data = contacts, aes(x=diff), fill = "#FFAD05") +
    scale_y_continuous(limits = c(0,14), breaks = c(0,2,4,6,8,10,12,14), expand = c(0,0)) +
    stat_function(
        linewidth=.8,
        fun = function(z, shape, rate)(dgamma(z, shape, rate) * length(contacts$diff)),
        args = list(shape = gamma_shape, rate = gamma_rate),
        aes(linetype  = 'Gamma')
    ) +
    stat_function(
        linewidth=.8,
        fun = function(z, meanlog, sdlog)(dlnorm(z, meanlog, sdlog) * length(contacts$diff)),
        args = list(meanlog = lognorm_meanlog, sdlog = lognorm_sdlog),
        aes(linetype ='Log normal')
    ) +
    stat_function(
        linewidth=.8,
        fun = function(z, shape, scale)(dweibull(z, shape, scale) * length(contacts$diff)),
        args = list(shape = weibull_shape, scale = weibull_scale),
        aes(linetype ='Weibull')
    ) +
    scale_linetype_manual(values = c('solid','twodash','dotted')) +
    labs(x = "Days", y = "Number of case pairs") + 
    theme(
        legend.position = c(0.75, 0.75),
        plot.margin = margin(.2,5.2,.2,.2, "cm"),
        legend.title = element_blank(),
    )

Let’s calculate the WAIC and LOOIC. The library coarseDataTools does not have a built-in way to do this, so we need to calculate the likelihood from the MCMC chains and use the loo R package.

# Load likelihood functions from coarseDataTools
source("Functions/likelihood_function.R")

compare_gamma <- calc_looic_waic(symp = si_data, symp_si = si_fit_gamma, dist = "G")

## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and 
## MCSE estimates will be over-optimistic.

compare_lnorm <- calc_looic_waic(symp = si_data, symp_si = si_fit_lnorm, dist = "L")

## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and 
## MCSE estimates will be over-optimistic.

## Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.

compare_weibull <- calc_looic_waic(symp = si_data, symp_si = si_fit_weibull, dist = "W")

## Warning: 
## 2 (2.8%) p_waic estimates greater than 0.4. We recommend trying loo instead.

## Warning: Relative effective sample sizes ('r_eff' argument) not specified.
## For models fit with MCMC, the reported PSIS effective sample sizes and 
## MCSE estimates will be over-optimistic.

# Print results
compare_gamma[["waic"]]$estimates

##              Estimate         SE
## elpd_waic -202.009546  7.0118463
## p_waic       1.936991  0.4409454
## waic       404.019093 14.0236925

compare_lnorm[["waic"]]$estimates

##              Estimate         SE
## elpd_waic -202.215546  7.0178113
## p_waic       1.874728  0.4106866
## waic       404.431092 14.0356226

compare_weibull[["waic"]]$estimates

##              Estimate         SE
## elpd_waic -203.941362  6.9020056
## p_waic       2.083623  0.6633186
## waic       407.882723 13.8040112

compare_gamma[["looic"]]$estimates

##             Estimate         SE
## elpd_loo -202.015856  7.0144149
## p_loo       1.943301  0.4437589
## looic     404.031713 14.0288297

compare_lnorm[["looic"]]$estimates

##             Estimate         SE
## elpd_loo -202.218954  7.0182826
## p_loo       1.878136  0.4114664
## looic     404.437909 14.0365651

compare_weibull[["looic"]]$estimates

##             Estimate         SE
## elpd_loo -203.956018  6.9097884
## p_loo       2.098279  0.6725387
## looic     407.912036 13.8195768

Include the following when reporting the serial interval:

Mean and 95% credible interval
Standard deviation and 95% credible interval
The parameters of the fitted distribution (e.g., shape and scale for gamma distribution)

💡 Questions (7)

Which distribution(s) have the lowest WAIC and LOOIC??

`_Wrap-up 3 ___`

9. Control measures

9.1 Analyse the results altogether

You have now finisiled the statistical analysis 🥳

Compare the incubation period and the serial interval.
Which one is longer?
Does pre-symptomatic transmission occur with Disease X?
Will isolating symptomatic individuals and tracing and quarantining their contacts be effective measures?
If not, which other measures can be implemented to slow the outbreak?

9.2 Design a contact tracing strategy

Based on the estimated delays as well as the demographics of cases, design a contact tracing strategy for the outbreak, including a communications plan.

10. True values

The true values used to simulate the epidemics were:

Table 2. True values of the incubation period.
	Distribution	Mean (days)	Standard deviation (days)
Group 1	Gamma	8.4	4.9
Group 2	Gamma	4.8	3.3

Table 3. True values of the serial interval.
	Distribution	Mean (days)	Standard deviation (days)
Group 1	Log normal	5.7	4.6
Group 2	Weibull	7.1	3.7

How do your estimates compare with the true values? Discuss possible reasons for any differences.

11. Further resources

In this practical, we largely ignored biases due to right truncation and dynamical bias, partly due to a lack of readily available tools that implement all best practices. To learn more about how these biases could affect the estimation of epidemiological delay distributions in real-time, we recommend a tutorial on the dynamicaltruncation R package by Sam Abbott and Sang Woo Park (https://github.com/parksw3/epidist-paper).

12. Contributors

Kelly Charniga, Zachary Madewell, Zulma M. Cucunubá

13. References

Reich NG et al. Estimating incubation period distributions with coarse data. Stat Med. 2009;28:2769–84. PubMed https://doi.org/10.1002/sim.3659
Miura F et al. Estimated incubation period for monkeypox cases confirmed in the Netherlands, May 2022. Euro Surveill. 2022;27(24):pii=2200448. https://doi.org/10.2807/1560-7917.ES.2022.27.24.2200448
Abbott S, Park Sang W. Adjusting for common biases in infectious disease data when estimating distributions. 2022 [cited 7 November 2023]. https://github.com/parksw3/dynamicaltruncation
Lessler J et al. Incubation periods of acute respiratory viral infections: a systematic review, The Lancet Infectious Diseases. 2009;9(5):291-300. https://doi.org/10.1016/S1473-3099(09)70069-6.
Cori A et al. Estimate Time Varying Reproduction Numbers from Epidemic Curves. 2022 [cited 7 November 2023]. https://github.com/mrc-ide/EpiEstim
Lambert B. A Student’s Guide to Bayesian Statistics. Los Angeles, London, New Delhi, Singapore, Washington DC, Melbourne: SAGE, 2018.
Vehtari A et al. Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC. Bayesian Analysis 2021: Advance publication 1-28. https://doi.org/10.1214/20-BA1221
Nishiura H et al. Serial interval of novel coronavirus (COVID-19) infections. Int J Infect Dis. 2020;93:284-286. https://doi.org/10.1016/j.ijid.2020.02.060

Estimating epidemiological delay distributions: Disease X

Kelly Charniga, PhD MPH & Zulma M. Cucunubá MD, PhD

2023-11-23