Approximate Bayesian Computation with EpiModel

library(EpiModel)
library(EasyABC)

Epidemic Parameter Estimation

This example shows to estimate two parameters of an SIS epidemic over a network: the infection probability per contact and rate of recovery back into the susceptible state. The question is: what combination of infection probability and recovery rate is most likely given an endemic diseaes prevalence of 25%.

First, we estimate a simple ERGM with a mean degree of 0.75.

n <- 1000
nw <- network.initialize(n = n, directed = FALSE)
formation <- ~edges
target.stats <- 0.75*(n/2)
coef.diss <- dissolution_coefs(dissolution = ~offset(edges), duration = 20)
est <- netest(nw, formation, target.stats, coef.diss, verbose = FALSE)
save(est, file = "temp2.est.rda")

The EasyABC package require that one input an R function containing a single argument, x, that may contain a vector of parameters to be estimated. The parallel computation version requires a seed to be set as shown below. The function must output a summary statistic, here the prevalence from the epidemic model, to be compared against an expected value.

myfunc <- function(x) {
  set.seed(x[1])
  require(EpiModel)
  load("temp2.est.rda")
  param <- param.net(inf.prob = x[2], rec.rate = x[3])
  init <- init.net(i.num = 50, status.rand = FALSE)
  control <- control.net(type = "SIS", nsteps = 250, nsims = 1, verbose = FALSE)
  mod <- netsim(est, param, init, control)
  df <- as.data.frame(mod)
  out <- mean(tail(df$i.num/df$num, control$nsteps/10))
  return(out)
}

Here is how you define your prior distributions for the parameters, in the order in which they are referenced in the function. For each parameter, one must list the statistical distribution and the parameters for that distribution. Here we show the prior distributions for inf.prob and rec.rate as unformly distributed between the values shown. Any distribution that can be calculated in R can be used for specifying a prior distribution. In addition to the priors, we must also specify the target statistic to which the simulations are compared. Here the target is an epidemic prevalence of 25%. The number of target statistics and their order must correspond to the output of the function defined above: for this case, there is a clear link between the function output and this target.

priors <- list(c("unif", 0.2, 0.4), c("unif", 0.05, 0.25))
prev.targ <- 0.25

Next we use the ABC_sequential function to run an ABC-SMC model in which the posterior distribution of the two epidemic parameters are to be estimated. The EasyABC package includes several varieties of sequential ABC methods, as well as other ABC implementations, including ABC-MCMC. This function specifies that we want a certain number of output simulations, and the p_acc_min argument specifies the size of the tolerance threshold that is acceptable between the Euclidean distance calculation comparing the simulations to the prevalence target statistic. Note, we used a parallel implementation of this on a 16-core computer for computational efficiency.

a <- ABC_sequential(method = "Lenormand",
                    model = myfunc,
                    prior = priors,
                    nb_simul = 200,
                    summary_stat_target = prev.targ,
                    p_acc_min = 0.05,
                    progress_bar = TRUE,
                    n_cluster = 16,
                    use_seed = TRUE)
save(a, file = "sisfit.rda")

This is the empirical density of the prevalence target, which varies around 25%.

load("sisfit.rda")
plot(density(a$stats))

These are the posterior distributions of the two parameters estimated in the model.

par(mfrow = c(1,2))
plot(density(a$param[, 1]))
plot(density(a$param[, 2]))

This uses two dimensional kernel density estimation to show the bivariate distribution of the two parameters. Logically, there is a positive correlation bewteen the two parameters, since as the infection probability goes up, the recovery rate may also rise given a fixed prevalence target in an SIS epidemic. This provides inference of the two parameters to be estimated.

library(MASS)
kde1 <- kde2d(a$param[,1], a$param[,2], n = 250)
par(mar = c(3,3,2,1), mgp = c(2,1,0))
image(kde1, col = heat.colors(100), ylab = "Recovery Rate",
      xlab = "Infection Probability")

This is the model selection component to address the question: what is the best fitting model to be used in epidemic simulations. This provides one calculation of that.

kde1.max <- kde1$z == max(kde1$z)
kde1.row <- which.max(rowSums(kde1.max))
kde1.col <- which.max(colSums(kde1.max))
bestfit1 <- c(kde1$x[kde1.row], kde1$y[kde1.col])
bestfit1

## [1] 0.3584195 0.1216473

Carnegie Correction Bias Correction

These methods may also be used for other purposes in our traditional workflow of epidemic modeling. Here we show how to use ABC to estimate the ERGM coefficient adjusted for edge dissolution. The “Carnegie” edges dissolution approximation is used to adjust the edges coefficient in a static ERGM to account for edges dissolving, and is used as a mechanism to avoid having to fit a full STERGM because of the computational difficulties in doing so, especially on large networks. However, the approximation does not perform well in certain conditions, including when the network density is high and the average edge duration is short. ABC may be used therefore, to estimate the adjusted cofficient in a more robust manner than the analytical approximate method.

Here is one example of that, with an average edge duration of 10 and average mean degree of one.

n <- 1000
nw <- network.initialize(n = n, directed = FALSE)
formation <- ~edges
target.stats <- 1*(n/2)
coef.diss <- dissolution_coefs(dissolution = ~offset(edges), duration = 10)
est <- netest(nw, formation, target.stats, coef.diss, verbose = FALSE)
save("est", file = "temp.est.rda")

Diagnosing that model shows a large bias between the target and dynamic simulations, which is entirely due to the poor performance of the approximation.

dx <- netdx(est, nsims = 10, nsteps = 500, ncores = 4, verbose = FALSE)
dx

## EpiModel Network Diagnostics
## =======================
## Diagnostic Method: Dynamic
## Simulations: 10
## Time Steps per Sim: 500
## 
## Formation Diagnostics
## ----------------------- 
##       Target Sim Mean Pct Diff Sim SD
## edges    500  553.457    0.107  23.52
## 
## Dissolution Diagnostics
## ----------------------- 
##                Target Sim Mean Pct Diff Sim SD
## Edge Duration    10.0    9.817   -0.018  9.311
## Pct Edges Diss    0.1    0.100    0.000  0.013

plot(dx)

So the ABC approach will be as follows: given the model fit above, estimate the adjustment to the coefficient needed to preserve the target statistic (that is an overall edges statistic of 500) in dynamic simulations. The test statistic will be calculated by simulating from the network using netdx in EpiModel, and then averaging over the final 50 timesteps in the simulation. This will return back one number to compare against 500.

myfunc <- function(x) {
  set.seed(x[1])
  require(EpiModel)
  load("temp.est.rda")
  est$coef.form <- est$coef.form + x[2]
  dx <- netdx(est, nsims = 1, nsteps = 200, verbose = FALSE)
  out <- mean(tail(dx$stats[[1]][, "edges"], 50))
  return(out)
}

This is the definition of the priors and the target statistic. The prior here was chosen based on the fact that result of the coefficient was to overestimate the number of edges; therefore, we specify a prior that adjusts that downwards (this was through some trial and error).

priors <- list(c("unif", -0.2, 0))
prev.targ <- 500

We again use the ABC sequential function to simulate from the model and conduct the tests against the target statistic.

a <- ABC_sequential(method = "Lenormand",
                    model = myfunc,
                    prior = priors,
                    nb_simul = 200,
                    summary_stat_target = prev.targ,
                    p_acc_min = 0.02,
                    progress_bar = TRUE,
                    verbose = FALSE,
                    n_cluster = 16,
                    use_seed = TRUE)

a
save(a, file = "carnegie.rda")

This shows the posterior distributions of the parameter to be estimated and the test statistic.

load("carnegie.rda")
par(mfrow = c(1, 2))
plot(density(a$param))
plot(density(a$stats))

The best fit in a one-parameter estimation is the sum of the weighted average of the distribution.

coef.adj <- sum(a$param * a$weights)
coef.adj

## [1] -0.1064373

That adjusted coefficient is now fed back into EpiModel, and the network diagnostics are simulated again. The simulations are perfectly on target with 500 edges. Voila.

est$coef.form <- est$coef.form + coef.adj
dx <- netdx(est, nsims = 10, nsteps = 500, ncores = 4, verbose = FALSE)
dx

## EpiModel Network Diagnostics
## =======================
## Diagnostic Method: Dynamic
## Simulations: 10
## Time Steps per Sim: 500
## 
## Formation Diagnostics
## ----------------------- 
##       Target Sim Mean Pct Diff Sim SD
## edges    500  499.581   -0.001 21.663
## 
## Dissolution Diagnostics
## ----------------------- 
##                Target Sim Mean Pct Diff Sim SD
## Edge Duration    10.0    9.832   -0.017  9.326
## Pct Edges Diss    0.1    0.100   -0.002  0.013

par(mfrow = c(1,1))
plot(dx)

Approximate Bayesian Computation with EpiModel

Samuel M. Jenness

September 14, 2015

Epidemic Parameter Estimation

Carnegie Correction Bias Correction