Practical 5: Disease Transmission Modelling in Stan
The first part of this tutorial is based on the paper by Ginsztajn et al. and the original can be found here. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
In this tutorial, we demonstrate how to formulate, fit, and diagnose compartmental model like the ones we considered this morning in Stan using the R interface. (See yesterday’s notes for why we are interested in using Stan). We consider two different types of routinely collected data to fit our models to: prevalence data (number / proportion of infected people in a population) and incidence data (newly infected people in a population).
In this example, we examine an outbreak of influenza A (H1N1) in 1978 at a British boarding school. The data consists of the daily number of students in bed (prevalence data), spanning over a time interval of 14 days. There were 763 male students who were mostly full boarders and 512 of them became ill. The outbreak lasted from the 22nd of January to the 4th of February. It is reported that one infected boy started the epidemic, which spread rapidly in the relatively closed community of the boarding school. The data are freely available in the R package outbreaks, maintained as part of the R Epidemics Consortium, and is a great resource for trying to fit different models yourself after the course. Since there was no publicly available incidence data associated with this outbreak, we provide some simulated data created to mimic the behaviour of the outbreak.
You may need to run the following code block to install the packages necessary for this practical. You will either need to copy the code into your console or change eval = TRUE the first time you run this block.
Now we can load the necessary packages.
To start, we can take a look at what the data looks like. This command loads the first 5 lines of the data
## date in_bed convalescent
## 1 1978-01-22 3 0
## 2 1978-01-23 8 0
## 3 1978-01-24 26 0
## 4 1978-01-25 76 0
## 5 1978-01-26 225 9
## 6 1978-01-27 298 17
and here is a plot of the full timeseries.
theme_set(theme_bw())
ggplot(data = influenza_england_1978_school) +
geom_point(mapping = aes(x = date, y = in_bed)) +
labs(y = "Number of students in bed")The simplest model we can consider is the Susceptible-Infected-Recovered (SIR) model. This splits the population in three time-dependent compartments: the susceptible (\(S\)), the infected (and infectious) (\(I\)), and the recovered (and not infectious) compartments (\(R\)). When a susceptible individual comes into contact with an infectious individual, the former can become infected for some time, and then recover and become immune.
Figure 1.1: Figure 1: Diagram of SIR model
From Figure 1, the rate of progression between the S and I compartments is \(\beta * I / N\) where \(\beta\) is the contact rate, \(I\) is the number of infected people and \(N\) is the total population. The rate of progression between the \(I\) and \(R\) compartments is \(\sigma\) the recovery rate or 1 divided by the duration of infection.
Q1) Define the equations for an SIR model.
Q2) What are the assumptions of this model?
We now define the statistical inference model that we will use to fit our transmission model to data. Here we introduce the likelihood or sampling distribution for our model \[ p(y|\theta), \] which tells us, given model parameters \(\theta\), how to generate data \(y\). In a Bayesian framework, the set of plausible values is characterised by the posterior distribution, \[ p(\theta | y). \]
Bayes’ rule teaches us that \[ p(\theta | y) \sim p(y | \theta) p(\theta) \] where \(p(\theta | y)\) is our posterior distribution and this is proportional to \(p(y|\theta)\) our likelihood multiplied by \(p(θ)\) our prior distribution. The prior encodes information or our belief about the parameters we have before observing the data.
This is the part of the model that we used to link the model estimates of the number of infected students, \(I_{ODE}(t)\), for given parameters (\(\beta\) and \(\sigma\)) to the observed data, i.e the number of students in bed, \(I_{obs}(t)\). We choose to model the number of students in bed with a count distribution – the Negative Binomial. This distribution allows us to use \(I_{ODE}(t)\) as the expected value and account for over-dispersion, through the parameter \(\phi\): \[ I_{obs}(t)∼NegBin(I_{ODE}(t), \phi) \] Therefore our likelihood is \(p(y∣\theta)\), with \(\theta = (\beta, \sigma, \phi)\) as the parameters of the model.
We need to specify a prior for each of our three parameter \(\theta = (\beta, \sigma, \phi)\). One advantage of the Bayesian approach is that we can formally incorporate prior knowledge about the parameters into the model. We can change this prior if more information becomes available, constraining our parameter estimation more tightly or, on the contrary, increasing its variance. Choosing priors is one of the tricky parts of Bayesian modelling but comes with practice.
Once we fit the model and obtain a posterior distribution for \(\theta\), we can derive additional quantities of interests. The posterior distribution of predictions: \[ p(y_{pred}∣y)=\int p(y_{pred}|\theta) p(\theta | y)d\theta. \]
We will need the following libraries that you should have already installed and options to fit the compartmental model in Stan efficiently.
library(rstan)
library(gridExtra)
rstan_options(auto_write = TRUE) # saves stan model to hardisk
options(mc.cores = parallel::detectCores()) # tells R how many cores you have available for computation.An ODE takes the form \[ \frac{dz}{dt}=f(t,z) \] where \(z\) are the states of the compartmental model (\(z=(S,I,R)\) for our first example) and \(t\) is time. We also need to provide initial conditions \(z_0\) at \(t_0\) and the times, \(τ\), at which we evaluate the solution.
To specify an ODE in Stan, we first code \(f\) in the functions block. This function must observe a strict signature:
with
In our example, the ODEs for the SIR model is defined with the following structure.
Q3) You need to fill in the equations for the differential equations for S, I and R:
functions {
real[] sir(real t, real[] y, real[] theta,
real[] x_r, int[] x_i) {
real S = y[1];
real I = y[2];
real R = y[3];
real N = x_i[1];
real beta = theta[1];
real sigma = theta[2];
#TODO: fill it in.
real dS_dt =
real dI_dt =
real dR_dt =
return {dS_dt, dI_dt, dR_dt};
}
}We evaluate the solution numerically by using one of Stan’s numerical integrators. Here we choose the Runge-Kutta 4th / 5th order and a call to the integrator looks as follows:
with
We now have all the ingredients to solve our ODE.
We next code the model in Stan, working through the following coding blocks.
The functions block specifies the SIR model.
Fixed data is declared in the data block. This is things like the initial conditions, times and number of days.
We code transforms of the data in the transformed data block. In this example, we transform our data to match the signature of sir (with x_r being of length 0 because we have nothing to put in it). We need these as arguments for the numerical integration scheme.
We next declare the model parameters. If you want some parameter to be bounded, and it is not already guaranteed by it’s prior, you need to specify <lower=a, upper=b> when declaring this parameter. Note that this is how you put a truncated prior distribution on a parameter and here we choose to define a prior on the inverse of \(\phi\).
And then transforms of the parameters
transformed parameters{
real y[n_days, 3];
real phi = 1. / phi_inv;
{
real theta[2]; // model parameters
theta[1] = beta;
theta[2] = sigma;
y = integrate_ode_rk45(sir, y0, t0, ts, theta, x_r, x_i);
}
}Our model consists of the prior and likelihood as described above. Here we assume some priors for our parameters and define our likelihood.
model {
//priors
beta ~ normal(2, 1); //truncated at 0
sigma ~ normal(0.4, 0.25); //truncated at 0
phi_inv ~ exponential(5);
//likelihood
//col(matrix x, int n) - The n-th column of matrix x. Here the number of infected people
cases ~ neg_binomial_2(col(to_matrix(y), 2), phi);
}Untangled from the inference, we predict the number of cases and recovery time in the generated quantities block:
Here we write the code to fit the model in R. You need to copy all the model parts (functions, data, transformed data, parameters, transformed parameters, model and generated quantities from section 1b) into a file called sir_model.stan in a directory called stan_models.
# time series of cases
cases <- influenza_england_1978_school$in_bed # Number of students in bed
# total count
N <- 763;
# times
n_days <- length(cases)
t <- seq(1, n_days, by = 1)
t0 = 0
#initial conditions
i0 <- 1
s0 <- N - i0
r0 <- 0
y0 = c(S = s0, I = i0, R = r0)
# data for Stan
data_sir <- list(n_days = n_days, y0 = y0, t0 = t0, ts = t, N = N, cases = cases)
# number of MCMC steps
niter <- 2000Next we compile the model that you will have created, saved in the file stan_models/sir_prevalence.stan,
and run MCMC. For this problem, it suffices to use Stan’s defaults. Note that, as is standard practice, we run 4 Markov chains.
A good place to start is with a summary table of the results, which displays the posterior mean, standard error, quantiles, and some useful diagnostics.
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta 1.74 0 0.05 1.64 1.70 1.74 1.77 1.86 1859 1
## sigma 0.54 0 0.04 0.46 0.51 0.54 0.57 0.63 2668 1
## recovery_time 1.87 0 0.15 1.59 1.76 1.86 1.96 2.19 2578 1
##
## Samples were drawn using NUTS(diag_e) at Sat Feb 24 15:01:00 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Stan gives us lots of information to evaluate whether the inference is reliable:
We can furthermore plot the marginal posterior densities and confirm the Markov chains are in agreement with one another.
Now that we trust our inference, we can check the utility of our model. Utility is problem specific and can include the precise estimation of a quantity or predicting future behaviors. In general, it is good to check if our model, once fitted, produces simulations that are consistent with the observed data. This is the idea behind posterior predictive checks.
We sample predictions, \(y_{pred}\), from \(p(y_{pred}∣y)\) and use these samples to construct a fitted curve for students in bed, together with the uncertainty (90% interval, meaning observed data is expected to fall outside of this interval one in ten times). This posterior predictive check allows us to verify if the model captures the structure of the data. Here we see that the model gives a satisfying fit to the data, and that the model uncertainty is able to capture the variation of the data.
smr_pred <- cbind(as.data.frame(summary(
fit_sir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary), t, cases)
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names
c_posterior = "lightgoldenrod"
ggplot(smr_pred, mapping = aes(x = t)) +
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) +
geom_point(mapping = aes(y = cases)) +
labs(x = "Day", y = "Number of students in bed")Maybe we also want to access the true number of infected people in the \(I\) compartment at each time, and not just the number of students in bed. This is a latent variable for which we have an estimation in our model.
params <- lapply(t, function(i){sprintf("y[%s,2]", i)}) #number of infected for each day
smr_y <- as.data.frame(summary(fit_sir_negbin,
pars = params, probs = c(0.05, 0.5, 0.95))$summary)
colnames(smr_y) <- make.names(colnames(smr_y)) # to remove % in the col names
ggplot(smr_y, mapping = aes(x = t)) +
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) +
labs(x = "Day", y = "Number of infected students")A person who is infected with flu is not immediately infectious so it makes sense to include an “Exposed” compartment in our model. This should improve the fit to our model (although we don’t evaluate that here in this practical).
Figure 2.1: Figure 2: Diagram of SEIR model
The notation in Figure 2 is similar to the SIR model but we have a new parameter \(\gamma\) which is the rate of progression from the exposed to infected compartment. It is equal to 1 divided by the incubation period.
Q4) What are the equations for an SEIR model?
Q5) Edit the stan model file to turn the SIR model in an SEIR model. Save this in the stan_model directory under a new file called seir_prevalence.stan. Since we have a new parameter in our model \(\sigma\) we need to define a prior for this. From the literature the incubation period for flu is around 2 days so a prior for sigma of \(N(0.5, 0.25)\) would be suitable.
Hint: You will need to change the following parts of the model.
Similar to before, the SEIR model can be run as follows. Note we also need to add an initial condition for our exposed compartment.
cases <- influenza_england_1978_school$in_bed
# total count
N <- 763;
# times
n_days <- length(cases)
t <- seq(1, n_days, by = 1)
t0 = 0
#initial conditions
i0 <- 1
s0 <- N - i0
e0 <- 0
r0 <- 0
y0 = c(S = s0, E = e0, I = i0, R = r0)
# data for Stan
data_seir <- list(n_days = n_days, y0 = y0, t0 = t0, ts = t, N = N, cases = cases)
# number of MCMC steps
niter <- 7500
model <- stan_model("stan_models/seir_prevalence.stan")## recompiling to avoid crashing R session
## Warning: There were 35 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
Again we can print a summary for our results:
pars=c('beta', 'gamma', 'sigma', 'recovery_time', 'incubation_period', 'phi_inv')
print(fit_seir_negbin, pars = pars)## Inference for Stan model: anon_model.
## 4 chains, each with iter=7500; warmup=3750; thin=1;
## post-warmup draws per chain=3750, total post-warmup draws=15000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta 4.15 0.01 0.51 3.27 3.78 4.13 4.49 5.21 3008 1
## gamma 0.53 0.00 0.03 0.47 0.51 0.53 0.55 0.60 5468 1
## sigma 0.37 0.00 0.19 0.05 0.23 0.36 0.50 0.76 3587 1
## recovery_time 7.49 1.57 113.12 1.31 2.01 2.74 4.30 20.89 5164 1
## incubation_period 1.90 0.00 0.12 1.66 1.82 1.90 1.97 2.12 5702 1
## phi_inv 0.07 0.00 0.04 0.02 0.04 0.06 0.08 0.17 4646 1
##
## Samples were drawn using NUTS(diag_e) at Sat Feb 24 15:13:36 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
and produce plots for our sample predictions of the number of students in bed.
smr_pred <- cbind(as.data.frame(summary(
fit_seir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary), t, cases)
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names
c_posterior = "lightgoldenrod"
print(ggplot(smr_pred, mapping = aes(x = t)) +
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) +
geom_point(mapping = aes(y = cases)) +
labs(x = "Day", y = "Number of students in bed"))The number of infected people in a population is not the only type of data collected during an outbreak. Often we get a line list or a list of the times at which people were infected (or showed symptoms). We have simulated some incidence data for a flu like outbreak with similar parameters to the school data and this can be found in the file incidence_seir.rds. (Do speak to one of us if you are interested in how we did this!).
We now need to change our model to fit the incidence from our model to the data rather than the prevalence. First we need to calculate the incidence, which is equal to the number of people leaving the exposed component or \(\gamma E\).
Q6) Make a new copy of our SEIR model by re-saving seir_prevalence.stan as seir_incidence.stan. The incidence can be calculated by changing the transformed parameters block to be as follows (note only the for loop around the incidence should have changed).
transformed parameters{
real y[n_days, 3];
real phi = 1. / phi_inv;
real theta[2]; // model parameters
real incidence[n_days];
theta[1] = beta;
theta[2] = sigma;
y = integrate_ode_rk45(sir, y0, t0, ts, theta, x_r, x_i);
for (i in 1:n_days)
incidence[i] = gamma * y[i, 2];
}Then we need to update the likelihood in the model block to the following since we are now saying that our incidence is negatively binomially distributed instead of the total number of cases.
We can now use similar code to fit our model and print a summary of our parameters.
# total count
N <- 763
# times
n_days <- length(incidence)
t <- seq(1, n_days, by = 1)
t0 = 0
#initial conditions
i0 <- 1
s0 <- N - i0
e0 <- 0
r0 <- 0
y0 = c(S = s0, E = e0, I = i0, R = r0)
# data for Stan
data_sir <- list(n_days = n_days, y0 = y0, t0 = t0,
ts = t, N = N, cases = incidence)
# number of MCMC steps
niter <- 2000
model <- stan_model("stan_models/seir_incidence.stan")
fit_sir_negbin <- sampling(model,
data = data_sir,
iter = niter,
chains = 4,
seed = 0)
pars=c('beta', 'gamma', 'sigma', "recovery_time")
print(fit_sir_negbin, pars = pars)## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta 4.01 0.01 0.30 3.43 3.80 4.00 4.21 4.61 2297 1
## gamma 0.50 0.00 0.04 0.43 0.47 0.50 0.53 0.60 2047 1
## sigma 0.40 0.00 0.09 0.22 0.34 0.41 0.46 0.58 2120 1
## recovery_time 2.63 0.02 0.73 1.71 2.16 2.47 2.92 4.55 1547 1
##
## Samples were drawn using NUTS(diag_e) at Sat Feb 24 15:19:36 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Q7) Note that if you have left the same priors as before, there will be some warnings about your model. Investigate changing the priors and number of iterations to see if you can get rid of the warnings.
We can see how well our model fits the incidence data.
params_inc <- lapply(t, function(i){sprintf("incidence[%s]", i)})
smr_inc <- as.data.frame(summary(fit_sir_negbin,
pars = params_inc, probs = c(0.05, 0.5, 0.95))$summary)
colnames(smr_inc) <- make.names(colnames(smr_inc)) # to remove % in the col names
c_posterior = "lightgoldenrod"
print(ggplot(smr_inc, mapping = aes(x = t)) +
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) +
geom_point(mapping = aes(y = incidence)) +
labs(x = "Day", y = "Daily incidence") + theme_bw())If we also load in the simulated prevalence data we can see that, as expected, our model estimates the number of infected individuals well.
infected <- readRDS("data/infected_seir.RDS")
params <- lapply(t, function(i){sprintf("y[%s,3]", i)}) #total number infected
smr_y <- as.data.frame(summary(fit_sir_negbin,
pars = params, probs = c(0.05, 0.5, 0.95))$summary)
colnames(smr_y) <- make.names(colnames(smr_y)) # to remove % in the col names
print(ggplot(smr_y, mapping = aes(x = t)) +
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) +
geom_point(mapping = aes(y = infected)) +
labs(x = "Day", y = "Number of infected") + theme_bw())