The first part of this tutorial is based on the paper by Ginsztajn et al. and the original can be found here. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

In this tutorial, we demonstrate how to formulate, fit, and diagnose compartmental model like the ones we considered this morning in Stan using the R interface. (See yesterday’s notes for why we are interested in using Stan). We consider two different types of routinely collected data to fit our models to: prevalence data (number / proportion of infected people in a population) and incidence data (newly infected people in a population).

In this example, we examine an outbreak of influenza A (H1N1) in 1978 at a British boarding school. The data consists of the daily number of students in bed (prevalence data), spanning over a time interval of 14 days. There were 763 male students who were mostly full boarders and 512 of them became ill. The outbreak lasted from the 22nd of January to the 4th of February. It is reported that one infected boy started the epidemic, which spread rapidly in the relatively closed community of the boarding school. The data are freely available in the R package outbreaks, maintained as part of the R Epidemics Consortium, and is a great resource for trying to fit different models yourself after the course. Since there was no publicly available incidence data associated with this outbreak, we provide some simulated data created to mimic the behaviour of the outbreak.

You may need to run the following code block to install the packages necessary for this practical. You will either need to copy the code into your console or change eval = TRUE the first time you run this block.

install.packages(c("outbreaks", "tidyverse"))

Now we can load the necessary packages.

library(outbreaks)
library(tidyverse)

To start, we can take a look at what the data looks like. This command loads the first 5 lines of the data

head(influenza_england_1978_school)

##         date in_bed convalescent
## 1 1978-01-22      3            0
## 2 1978-01-23      8            0
## 3 1978-01-24     26            0
## 4 1978-01-25     76            0
## 5 1978-01-26    225            9
## 6 1978-01-27    298           17

and here is a plot of the full timeseries.

theme_set(theme_bw())
ggplot(data = influenza_england_1978_school) + 
  geom_point(mapping = aes(x = date, y = in_bed)) + 
  labs(y = "Number of students in bed")

1 Fitting an SIR (transmission) model to prevalence data

The simplest model we can consider is the Susceptible-Infected-Recovered (SIR) model. This splits the population in three time-dependent compartments: the susceptible (\(S\)), the infected (and infectious) (\(I\)), and the recovered (and not infectious) compartments (\(R\)). When a susceptible individual comes into contact with an infectious individual, the former can become infected for some time, and then recover and become immune.

Figure 1.1: Figure 1: Diagram of SIR model

From Figure 1, the rate of progression between the S and I compartments is \(\beta * I / N\) where \(\beta\) is the contact rate, \(I\) is the number of infected people and \(N\) is the total population. The rate of progression between the \(I\) and \(R\) compartments is \(\sigma\) the recovery rate or 1 divided by the duration of infection.

Q1) Define the equations for an SIR model.

Q2) What are the assumptions of this model?

1.1 Defining the statistical model

We now define the statistical inference model that we will use to fit our transmission model to data. Here we introduce the likelihood or sampling distribution for our model \[ p(y|\theta), \] which tells us, given model parameters \(\theta\), how to generate data \(y\). In a Bayesian framework, the set of plausible values is characterised by the posterior distribution, \[ p(\theta | y). \]

Bayes’ rule teaches us that \[ p(\theta | y) \sim p(y | \theta) p(\theta) \] where \(p(\theta | y)\) is our posterior distribution and this is proportional to \(p(y|\theta)\) our likelihood multiplied by \(p(θ)\) our prior distribution. The prior encodes information or our belief about the parameters we have before observing the data.

1.1.1 The likelihood

This is the part of the model that we used to link the model estimates of the number of infected students, \(I_{ODE}(t)\), for given parameters (\(\beta\) and \(\sigma\)) to the observed data, i.e the number of students in bed, \(I_{obs}(t)\). We choose to model the number of students in bed with a count distribution – the Negative Binomial. This distribution allows us to use \(I_{ODE}(t)\) as the expected value and account for over-dispersion, through the parameter \(\phi\): \[ I_{obs}(t)∼NegBin(I_{ODE}(t), \phi) \] Therefore our likelihood is \(p(y∣\theta)\), with \(\theta = (\beta, \sigma, \phi)\) as the parameters of the model.

1.1.2 Priors

We need to specify a prior for each of our three parameter \(\theta = (\beta, \sigma, \phi)\). One advantage of the Bayesian approach is that we can formally incorporate prior knowledge about the parameters into the model. We can change this prior if more information becomes available, constraining our parameter estimation more tightly or, on the contrary, increasing its variance. Choosing priors is one of the tricky parts of Bayesian modelling but comes with practice.

1.1.3 Predictions and derived quantities

Once we fit the model and obtain a posterior distribution for \(\theta\), we can derive additional quantities of interests. The posterior distribution of predictions: \[ p(y_{pred}∣y)=\int p(y_{pred}|\theta) p(\theta | y)d\theta. \]

1.2 Coding the model in Stan

We will need the following libraries that you should have already installed and options to fit the compartmental model in Stan efficiently.

library(rstan)
library(gridExtra)
rstan_options(auto_write = TRUE) # saves stan model to hardisk
options(mc.cores = parallel::detectCores()) # tells R how many cores you have available for computation.

1.2.1 Coding the ODE

An ODE takes the form \[ \frac{dz}{dt}=f(t,z) \] where \(z\) are the states of the compartmental model (\(z=(S,I,R)\) for our first example) and \(t\) is time. We also need to provide initial conditions \(z_0\) at \(t_0\) and the times, \(τ\), at which we evaluate the solution.

To specify an ODE in Stan, we first code \(f\) in the functions block. This function must observe a strict signature:

real[] f(real time, real[] state, real[] theta,
         real[] x_r, int[] x_i)

with

time, \(t\);
state, the volumes in each compartment, \(z\);
theta, variables used to compute \(f\), which depend on the model parameters;
x_r, real variables used to evaluate \(f\), which only depend on fixed data;
x_i, integer values used to evaluate \(f\), which only depend on fixed data.

In our example, the ODEs for the SIR model is defined with the following structure.

Q3) You need to fill in the equations for the differential equations for S, I and R:

functions {
  real[] sir(real t, real[] y, real[] theta, 
             real[] x_r, int[] x_i) {

      real S = y[1];
      real I = y[2];
      real R = y[3];
      real N = x_i[1];
      
      real beta = theta[1];
      real sigma = theta[2];
      
      #TODO: fill it in.
      real dS_dt = 
      real dI_dt =  
      real dR_dt =  
      
      return {dS_dt, dI_dt, dR_dt};
  }
}

We evaluate the solution numerically by using one of Stan’s numerical integrators. Here we choose the Runge-Kutta 4th / 5th order and a call to the integrator looks as follows:

y = integrate_ode_rk45(sir, y0, t0, ts, theta, x_r, x_i);

with

sir, the name of the function that returns the derivatives, f;
y0, the initial condition;
t0, the time of the initial condition;
ts, the times at which we require the solution to be evaluated;
theta, x_r, x_i, arguments to be passed to f.

We now have all the ingredients to solve our ODE.

1.2.2 Building the model

We next code the model in Stan, working through the following coding blocks.

The functions block specifies the SIR model.

functions {
  real[] sir (...) {...} //copy code from above
}

Fixed data is declared in the data block. This is things like the initial conditions, times and number of days.

data {
  int<lower=1> n_days;
  real y0[3];
  real t0;
  real ts[n_days];
  int N;
  int cases[n_days];
}

We code transforms of the data in the transformed data block. In this example, we transform our data to match the signature of sir (with x_r being of length 0 because we have nothing to put in it). We need these as arguments for the numerical integration scheme.

transformed data {
  real x_r[0];
  int x_i[1] = {N};
}

We next declare the model parameters. If you want some parameter to be bounded, and it is not already guaranteed by it’s prior, you need to specify <lower=a, upper=b> when declaring this parameter. Note that this is how you put a truncated prior distribution on a parameter and here we choose to define a prior on the inverse of \(\phi\).

parameters {
  real<lower=0> sigma;
  real<lower=0> beta;
  real<lower=0> phi_inv;
}

And then transforms of the parameters

transformed parameters{
  real y[n_days, 3];
  real phi = 1. / phi_inv;
  {
    real theta[2]; // model parameters
    theta[1] = beta;
    theta[2] = sigma;

    y = integrate_ode_rk45(sir, y0, t0, ts, theta, x_r, x_i);
  }
}

Our model consists of the prior and likelihood as described above. Here we assume some priors for our parameters and define our likelihood.

model {
  //priors
  beta ~ normal(2, 1); //truncated at 0
  sigma ~ normal(0.4, 0.25); //truncated at 0
  phi_inv ~ exponential(5);
  
  //likelihood
  //col(matrix x, int n) - The n-th column of matrix x. Here the number of infected people
  cases ~ neg_binomial_2(col(to_matrix(y), 2), phi);
}

Untangled from the inference, we predict the number of cases and recovery time in the generated quantities block:

generated quantities {
  real recovery_time = 1 / sigma;
  real pred_cases[n_days];
  pred_cases = neg_binomial_2_rng(col(to_matrix(y), 2) + 1e-5, phi);
}

1.3 Fitting the model in R

Here we write the code to fit the model in R. You need to copy all the model parts (functions, data, transformed data, parameters, transformed parameters, model and generated quantities from section 1b) into a file called sir_model.stan in a directory called stan_models.

# time series of cases
cases <- influenza_england_1978_school$in_bed  # Number of students in bed

# total count
N <- 763;

# times
n_days <- length(cases) 
t <- seq(1, n_days, by = 1)
t0 = 0 

#initial conditions
i0 <- 1
s0 <- N - i0
r0 <- 0
y0 = c(S = s0, I = i0, R = r0)

# data for Stan
data_sir <- list(n_days = n_days, y0 = y0, t0 = t0, ts = t, N = N, cases = cases)

# number of MCMC steps
niter <- 2000

Next we compile the model that you will have created, saved in the file stan_models/sir_prevalence.stan,

model <- stan_model("stan_models/sir_prevalence.stan")

and run MCMC. For this problem, it suffices to use Stan’s defaults. Note that, as is standard practice, we run 4 Markov chains.

fit_sir_negbin <- sampling(model,
                data = data_sir,
                iter = niter,
                chains = 4, 
                seed = 0)

1.4 Checking the model

1.4.1 Inference

A good place to start is with a summary table of the results, which displays the posterior mean, standard error, quantiles, and some useful diagnostics.

pars=c('beta', 'sigma', "recovery_time")
print(fit_sir_negbin, pars = pars)

## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##               mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## beta          1.74       0 0.05 1.64 1.70 1.74 1.77  1.86  1859    1
## sigma         0.54       0 0.04 0.46 0.51 0.54 0.57  0.63  2668    1
## recovery_time 1.87       0 0.15 1.59 1.76 1.86 1.96  2.19  2578    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Feb 24 15:01:00 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Stan gives us lots of information to evaluate whether the inference is reliable:

During sampling, warnings can tell us if something is wrong (here we have no warnings but you should check yours).
In the summary table, several quantities are available to check inference.
- Here we note that \(\hat R\) is close to 1 (< 1.01), indicating the 4 Markov chains are in close agreement with one another.
- The effective samples size, n_eff, is large (> 1007), which means the Markov chains were able to cohesively explore the parameter space.
- Conversely, large \(\hat{R}\) and low n_eff would indicate that the Markov chains are poorly mixing. Apart from fixing coding errors, improving the mixing of the Markov chains almost always requires tweaking the model specification, for example with a reparameterisation or stronger priors.

We can furthermore plot the marginal posterior densities and confirm the Markov chains are in agreement with one another.

stan_dens(fit_sir_negbin, pars = pars, separate_chains = TRUE)

1.4.2 Model

Now that we trust our inference, we can check the utility of our model. Utility is problem specific and can include the precise estimation of a quantity or predicting future behaviors. In general, it is good to check if our model, once fitted, produces simulations that are consistent with the observed data. This is the idea behind posterior predictive checks.

We sample predictions, \(y_{pred}\), from \(p(y_{pred}∣y)\) and use these samples to construct a fitted curve for students in bed, together with the uncertainty (90% interval, meaning observed data is expected to fall outside of this interval one in ten times). This posterior predictive check allows us to verify if the model captures the structure of the data. Here we see that the model gives a satisfying fit to the data, and that the model uncertainty is able to capture the variation of the data.

smr_pred <- cbind(as.data.frame(summary(
  fit_sir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary), t, cases)
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names
c_posterior = "lightgoldenrod"
ggplot(smr_pred, mapping = aes(x = t)) +
  geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
  geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) + 
  geom_point(mapping = aes(y = cases)) +
  labs(x = "Day", y = "Number of students in bed")

Maybe we also want to access the true number of infected people in the \(I\) compartment at each time, and not just the number of students in bed. This is a latent variable for which we have an estimation in our model.

params <- lapply(t, function(i){sprintf("y[%s,2]", i)}) #number of infected for each day
smr_y <- as.data.frame(summary(fit_sir_negbin, 
                               pars = params, probs = c(0.05, 0.5, 0.95))$summary)
colnames(smr_y) <- make.names(colnames(smr_y)) # to remove % in the col names
ggplot(smr_y, mapping = aes(x = t)) +
  geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
  geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) + 
  labs(x = "Day", y = "Number of infected students")

2 Fitting an SEIR model to prevelance data

A person who is infected with flu is not immediately infectious so it makes sense to include an “Exposed” compartment in our model. This should improve the fit to our model (although we don’t evaluate that here in this practical).

Figure 2.1: Figure 2: Diagram of SEIR model

The notation in Figure 2 is similar to the SIR model but we have a new parameter \(\gamma\) which is the rate of progression from the exposed to infected compartment. It is equal to 1 divided by the incubation period.

Q4) What are the equations for an SEIR model?

Q5) Edit the stan model file to turn the SIR model in an SEIR model. Save this in the stan_model directory under a new file called seir_prevalence.stan. Since we have a new parameter in our model \(\sigma\) we need to define a prior for this. From the literature the incubation period for flu is around 2 days so a prior for sigma of \(N(0.5, 0.25)\) would be suitable.

Hint: You will need to change the following parts of the model.

The functions block to encode the new ODE model,
The data block so that y[0] has dimensions of 4,
The parameter block so the sigma parameter is added,
The transformed parameter block so theta[3] is now sigma,
The model block needs the prior for sigma adding and,
The generated quantity block can be used to calculate the incubation period (1/sigma).

Similar to before, the SEIR model can be run as follows. Note we also need to add an initial condition for our exposed compartment.

cases <- influenza_england_1978_school$in_bed 

# total count
N <- 763;

# times
n_days <- length(cases)
t <- seq(1, n_days, by = 1)
t0 = 0 

#initial conditions
i0 <- 1
s0 <- N - i0
e0 <- 0
r0 <- 0
y0 = c(S = s0, E = e0, I = i0, R = r0)

# data for Stan
data_seir <- list(n_days = n_days, y0 = y0, t0 = t0, ts = t, N = N, cases = cases)

# number of MCMC steps
niter <- 7500

model <- stan_model("stan_models/seir_prevalence.stan")

## recompiling to avoid crashing R session

fit_seir_negbin <- sampling(model,
                data = data_seir,
                iter = niter,
                chains = 4, 
                seed = 0)

## Warning: There were 35 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.

## Warning: Examine the pairs() plot to diagnose sampling problems

Again we can print a summary for our results:

pars=c('beta', 'gamma', 'sigma', 'recovery_time', 'incubation_period', 'phi_inv')
print(fit_seir_negbin, pars = pars)

## Inference for Stan model: anon_model.
## 4 chains, each with iter=7500; warmup=3750; thin=1; 
## post-warmup draws per chain=3750, total post-warmup draws=15000.
## 
##                   mean se_mean     sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## beta              4.15    0.01   0.51 3.27 3.78 4.13 4.49  5.21  3008    1
## gamma             0.53    0.00   0.03 0.47 0.51 0.53 0.55  0.60  5468    1
## sigma             0.37    0.00   0.19 0.05 0.23 0.36 0.50  0.76  3587    1
## recovery_time     7.49    1.57 113.12 1.31 2.01 2.74 4.30 20.89  5164    1
## incubation_period 1.90    0.00   0.12 1.66 1.82 1.90 1.97  2.12  5702    1
## phi_inv           0.07    0.00   0.04 0.02 0.04 0.06 0.08  0.17  4646    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Feb 24 15:13:36 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

and produce plots for our sample predictions of the number of students in bed.

smr_pred <- cbind(as.data.frame(summary(
  fit_seir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary), t, cases)
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names
c_posterior = "lightgoldenrod"
print(ggplot(smr_pred, mapping = aes(x = t)) +
  geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
  geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) + 
  geom_point(mapping = aes(y = cases)) +
  labs(x = "Day", y = "Number of students in bed"))

3 Fitting an SEIR model to incidence data

The number of infected people in a population is not the only type of data collected during an outbreak. Often we get a line list or a list of the times at which people were infected (or showed symptoms). We have simulated some incidence data for a flu like outbreak with similar parameters to the school data and this can be found in the file incidence_seir.rds. (Do speak to one of us if you are interested in how we did this!).

incidence = readRDS("data/incidence_seir.RDS")

We now need to change our model to fit the incidence from our model to the data rather than the prevalence. First we need to calculate the incidence, which is equal to the number of people leaving the exposed component or \(\gamma E\).

Q6) Make a new copy of our SEIR model by re-saving seir_prevalence.stan as seir_incidence.stan. The incidence can be calculated by changing the transformed parameters block to be as follows (note only the for loop around the incidence should have changed).

transformed parameters{
  real y[n_days, 3];
  real phi = 1. / phi_inv;
  real theta[2]; // model parameters
  real incidence[n_days];

  theta[1] = beta;
  theta[2] = sigma;

  y = integrate_ode_rk45(sir, y0, t0, ts, theta, x_r, x_i);
  for (i in 1:n_days)
    incidence[i] = gamma * y[i, 2];
}

Then we need to update the likelihood in the model block to the following since we are now saying that our incidence is negatively binomially distributed instead of the total number of cases.

  //likelihood
  cases ~ neg_binomial_2(incidence, phi);

We can now use similar code to fit our model and print a summary of our parameters.

# total count
N <- 763

# times
n_days <- length(incidence)
t <- seq(1, n_days, by = 1)
t0 = 0

#initial conditions
i0 <- 1
s0 <- N - i0
e0 <- 0
r0 <- 0
y0 = c(S = s0, E = e0, I = i0, R = r0)

# data for Stan
data_sir <- list(n_days = n_days, y0 = y0, t0 = t0,
                 ts = t, N = N, cases = incidence)

# number of MCMC steps
niter <- 2000

model <- stan_model("stan_models/seir_incidence.stan")

fit_sir_negbin <- sampling(model,
                           data = data_sir,
                           iter = niter,
                           chains = 4,
                           seed = 0)

pars=c('beta', 'gamma', 'sigma', "recovery_time")
print(fit_sir_negbin, pars = pars)

## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##               mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## beta          4.01    0.01 0.30 3.43 3.80 4.00 4.21  4.61  2297    1
## gamma         0.50    0.00 0.04 0.43 0.47 0.50 0.53  0.60  2047    1
## sigma         0.40    0.00 0.09 0.22 0.34 0.41 0.46  0.58  2120    1
## recovery_time 2.63    0.02 0.73 1.71 2.16 2.47 2.92  4.55  1547    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Feb 24 15:19:36 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Q7) Note that if you have left the same priors as before, there will be some warnings about your model. Investigate changing the priors and number of iterations to see if you can get rid of the warnings.

We can see how well our model fits the incidence data.

params_inc <- lapply(t, function(i){sprintf("incidence[%s]", i)})
smr_inc <- as.data.frame(summary(fit_sir_negbin,
                               pars = params_inc, probs = c(0.05, 0.5, 0.95))$summary)
colnames(smr_inc) <- make.names(colnames(smr_inc)) # to remove % in the col names
c_posterior = "lightgoldenrod"
print(ggplot(smr_inc, mapping = aes(x = t)) +
  geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
  geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) +
  geom_point(mapping = aes(y = incidence)) +
  labs(x = "Day", y = "Daily incidence") + theme_bw())

If we also load in the simulated prevalence data we can see that, as expected, our model estimates the number of infected individuals well.

infected <- readRDS("data/infected_seir.RDS")
params <- lapply(t, function(i){sprintf("y[%s,3]", i)}) #total number infected
smr_y <- as.data.frame(summary(fit_sir_negbin,
                               pars = params, probs = c(0.05, 0.5, 0.95))$summary)
colnames(smr_y) <- make.names(colnames(smr_y)) # to remove % in the col names
print(ggplot(smr_y, mapping = aes(x = t)) +
  geom_ribbon(aes(ymin = X5., ymax = X95.), fill = c_posterior, alpha = 0.35) +
  geom_line(mapping = aes(x = t, y = X50.), color = c_posterior) +
  geom_point(mapping = aes(y = infected)) +
  labs(x = "Day", y = "Number of infected") + theme_bw())

Practical 5: Disease Transmission Modelling in Stan

North-South-AIMS-Imperial: modern statistics and global health

Ettie Unwin

2024-02-23