Age-mixing tutorial Bruges 2015

Background
Scope of this tutorial
Research questions
Getting ready
Setting up a model with Simpact
Exploring model output Part 1
Fitting the model to data (summary statistics)
Exploring model output Part 2
Addressing our research question
Notes on model fitting

Background

It is well known that people do not form sexual relationships at random with respect to the age of their partners. We can say that there exists an assortative age-mixing pattern. That is, most people form relationships with partners who are more or less in the same age group as themselves. However, the distribution of age gaps between sexual partners is typically not centred around zero. Rather, the average age gap is usually about 2 or 3 years (i.e. the man is 2-3 years older than the woman).

Scope of this tutorial

In this tutorial, we are going to use an agent-based model to explore how the age-mixing pattern may influence the epidemiology of HIV, loosely inspired by previously published data on age-mixing and HIV epidemic trajectories. Before diving into the actual research question, we are first going to construct a few toy models to get familiar with the general work flow of setting up models with Simpact and fitting them to data.

Next, we will develop a number of age-structured agent-based models, specifically designed to address our research question. By “age-structed” we explicitly mean that the rules for forming sexual relationships depend on the age(-gap) of the individuals in the model population. We will use these models to simulate dynamic sexual networks. After some “warm up” period, during which the sexual network reaches a dynamic equilibrium, we will introduce HIV into the model population, and simulate how HIV is transmitted across the network over time. We will summarise the simulation output with a small set of indicators of the epidemic trajectory, and we will perform a simple statistical analysis to capture the association between the age-mixing pattern and epidemic trajectory.

To summarise the age-mixing pattern, we will compute the average age difference across all relationships (AAD) and the variance of these age differences (VAD). We will also decompose the variance of age differences into a within-subject (WVAD) and between-subject (BVAD) part. Our main indicator of the HIV epidemic is the incidence of HIV. We could also look at HIV prevalence, but in the absence of temporal chances in the hazard of HIV-related mortality, HIV prevalence does not tell us anything extra we couldn’t learn from studying HIV incidence. Besides computing the overall, cumulative HIV incidence in the population, we are also interested in the age- and gender-specific HIV incidence over the relevant simulation period (from introduction of HIV until end of the simulation).

The emphasis of this modelling exercise is not on trying to create the model that reproduces observed age-mixing patterns and HIV epidemic trends most faithfully. Rather, we purposefully keep our models overly simple, like caricatures, to facilitate the analysis and remain focused on the key questions.

Research questions

Our central research question is: How does the time-age-gender-specific HIV incidence change as a function of AAD, WVAD and BVAD?

To give a (preliminary) answer to this question, we will construct X models that each produce distinct values of AAD, WVAD and BVAD. In the process, we will constrain these models such that the total number of relationships that are formed per unit of time during the dynamic equilibrium phase is similar across all models.

Getting ready

If you haven’t already done so, you need to first install the core SimpactCyan C++ program. MS-Windows users also need to install the Visual Studio 2013 redistributable package. Next, you should install a few R packages. Make sure you have a recent version of R installed (not older than version 3.1.1.).

install.packages("RJSONIO", repos="https://cran.rstudio.com")
install.packages("findpython", repos="https://cran.rstudio.com")
install.packages("rPithon", repos="http://research.edm.uhasselt.be/jori")
install.packages("RSimpactCyan", repos="http://research.edm.uhasselt.be/jori")

Setting up a model with Simpact

Assuming that you have successfully installed both the SimpactCyan core package and the required R packages, load the RSimpactCyan package and check out what functions are available. We are also going to need some other packages in this tutorial. You may need to install these first, before you can load them with the library() function. And we are going to load some additional helperfunctions.

library(RSimpactCyan)
library(EasyABC)
library(data.table)
library(nlme)
library(ggplot2)
library(RColorBrewer)
library(shape)
library(reshape2)
library(MASS)
library(grid)
library(stringr)
help(package="RSimpactCyan")
source(file="/Users/wimdelva/Dropbox (Personal)/FWO Age Mixing/Bruges2015/Tutorial/codechecking/postsimscript.R")
#source(file="/Users/wimdelva/GoogleDrive/Bruges2015/Tutorial/src/postsim.R")

To run a Simpact model with the default model parameters, we run the run.simpact() function, and give it an empty list as input. We will store the model output in directory /tmp/simpacttest, which will be created on the fly. Note that this directory will be erased next time we open a new R session. Lastly, there are additional arguments we can optionally specify. We can define an identifier of the output files that we will create and store. We are only running 1 test simulation, so we could call it today’s date and attach the number 1 to it. We can also set the seed of the random number generator, which can be useful if being able to reproduce the simulation output is important.

cfg <- list()
DestDir <- "/Users/wimdelva/GoogleDrive/Bruges2015/Tutorial/simoutput"
simID <- 1
identifier <- paste0("%T-%y-%m-%d-%H-%M-%S-", simID, "-")

testrun1 <- simpact.run(cfg,
                       DestDir,
                       identifierFormat = identifier,
                       seed = 1234)

The simulation output is stored in several files in the destination directory. The output.txt file contains “meta data” such as the version of Simpact and seed of the random number generator that were used, the total number of events that took place and the size of the model population at the start and end of the simulation. The simpact-cyan-my-identifier-eventlog.csv file stores each event in a separate row, the simpact-cyan-my-identifier-personlog.csv file stores key information for each person in a separate row, and the simpact-cyan-my-identifier-relationlog.csv file stores each relationship separately. The simpact-cyan-my-identifier-treatmentlog.csv file keeps track of episodes of antiretroviral treatment. Lastly, the simpact-cyan-my-identifier-config.txt file stores the model parameter values that were used for the simulation.

simpact.showconfig(NULL) lets you see all the model parameters that can be modified, what options you have when modifying them, and what the default values are. As you can see, there are many parameters that you can change, and not all parameters are used. For instance, there are two types of hazard function for the event of relationship formation that can be specified: formation.hazard.type = "simple" or formation.hazard.type = "agegap", with the latter being the default.

We can edit the model configuration by making changes to the cfg list (which was empty thus far). Let’s shorten the simulation time to 3 years and start with 500 women and 500 men, and then run the updated model.

simpact.showconfig(NULL)

cfg["population.simtime"] <- 3
cfg["population.numwomen"] <- 500
cfg["population.nummen"] <- 500
cfg["periodiclogging.interval"] <- 1 # Yearly keeping track of population size and ART programme size
simID <- 2
agedist.data.frame <- agedist.create(shape = 5, scale = 65)
identifier <- paste0("%T-%y-%m-%d-%H-%M-%S-", simID, "-")

testrun2 <- simpact.run(cfg,
                       DestDir,
                       identifierFormat = identifier,
                       agedist = agedist.data.frame,
                       seed = 1234)

Exploring model output Part 1

datalist <- readthedata(modeloutput = testrun2)
names(datalist)

## [1] "ptable" "rtable" "etable" "ttable" "ltable"

str(datalist)

## List of 5
##  $ ptable:Classes 'data.table' and 'data.frame': 1017 obs. of  16 variables:
##   ..$ ID          : int [1:1017] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ Gender      : int [1:1017] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ TOB         : num [1:1017] -11.9 -30.3 -38 -51.7 -26.7 ...
##   ..$ TOD         : num [1:1017] Inf Inf Inf Inf Inf ...
##   ..$ IDF         : int [1:1017] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##   ..$ IDM         : int [1:1017] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##   ..$ TODebut     : num [1:1017] Inf 0 0 0 0 ...
##   ..$ FormEag     : num [1:1017] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ InfectTime  : num [1:1017] Inf Inf Inf Inf Inf ...
##   ..$ InfectOrigID: int [1:1017] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##   ..$ InfectType  : int [1:1017] -1 -1 -1 -1 -1 -1 -1 -1 -1 0 ...
##   ..$ log10SPVL   : num [1:1017] -Inf -Inf -Inf -Inf -Inf ...
##   ..$ TreatTime   : num [1:1017] Inf Inf Inf Inf Inf ...
##   ..$ XCoord      : num [1:1017] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ YCoord      : num [1:1017] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ AIDSDeath   : int [1:1017] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##  $ rtable:Classes 'data.table' and 'data.frame': 967 obs. of  5 variables:
##   ..$ IDm     : int [1:967] 140 104 44 185 351 87 416 37 293 195 ...
##   ..$ IDw     : int [1:967] 720 714 803 841 771 825 908 658 542 918 ...
##   ..$ FormTime: num [1:967] 0.0004 0.0858 0.115 0.1487 0.1674 ...
##   ..$ DisTime : num [1:967] 0.0492 0.086 0.1452 0.1505 0.1737 ...
##   ..$ AgeGap  : num [1:967] 29.95 17.35 13.39 2.03 13.7 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##  $ etable:Classes 'data.table' and 'data.frame': 3706 obs. of  10 variables:
##   ..$ eventtime: num [1:3706] 0 0.0004 0.00162 0.00303 0.00327 ...
##   ..$ eventname: chr [1:3706] "HIV seeding" "formation" "formation" "formation" ...
##   ..$ p1name   : chr [1:3706] "(none)" "man_140" "man_193" "man_466" ...
##   ..$ p1ID     : int [1:3706] -1 140 193 466 302 302 765 765 832 832 ...
##   ..$ p1gender : int [1:3706] -1 0 0 0 0 0 1 1 1 1 ...
##   ..$ p1age    : num [1:3706] -1 57.8 32.7 78.7 23.3 ...
##   ..$ p2name   : chr [1:3706] "(none)" "woman_720" "woman_593" "woman_730" ...
##   ..$ p2ID     : int [1:3706] -1 720 593 730 -1 -1 -1 -1 -1 -1 ...
##   ..$ p2gender : int [1:3706] -1 1 1 1 -1 -1 -1 -1 -1 -1 ...
##   ..$ p2age    : num [1:3706] -1 27.9 37 22.3 -1 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##  $ ttable:Classes 'data.table' and 'data.frame': 14 obs. of  5 variables:
##   ..$ ID     : int [1:14] 903 248 832 11 487 137 317 332 501 556 ...
##   ..$ Gender : int [1:14] 1 0 1 0 0 0 0 0 1 1 ...
##   ..$ TStart : num [1:14] 1.03 1.57 2.01 2.69 2.69 ...
##   ..$ TEnd   : num [1:14] 1.87 2.38 2.4 Inf Inf ...
##   ..$ DiedNow: int [1:14] 1 1 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##  $ ltable:Classes 'data.table' and 'data.frame': 3 obs. of  3 variables:
##   ..$ Time       : num [1:3] 1 2 3
##   ..$ PopSize    : int [1:3] 986 979 966
##   ..$ InTreatment: int [1:3] 0 4 11
##   ..- attr(*, ".internal.selfref")=<externalptr>

Visualising the age-mixing scatter

# ptable$TOB is the time of birth.
# If you were 20 years old at the start of the simulation (time=0), then your
# TOB is -20
agemixingdata <- agemixing(testrun2) #(datalist$ptable, datalist$rtable)
names(agemixingdata)

##  [1] "AAD"                "VAD"                "men.lme"           
##  [4] "slope"              "powerm"             "within.var.base"   
##  [7] "within.var.weights" "WVAD"               "Pred"              
## [10] "BVAD"               "agescatterdata"

names(agemixingdata$agemixingdata)

## NULL

names(agemixingdata$agescatterdata)

##  [1] "IDm"             "IDw"             "FormTime"       
##  [4] "DisTime"         "AgeGap"          "Gender"         
##  [7] "TOB"             "TOD"             "IDF"            
## [10] "IDM"             "TODebut"         "FormEag"        
## [13] "InfectTime"      "InfectOrigID"    "InfectType"     
## [16] "log10SPVL"       "TreatTime"       "XCoord"         
## [19] "YCoord"          "AIDSDeath"       "AgeMaleatForm"  
## [22] "AgeMaleatForm0"  "AgeFemaleatForm" "pred"

agescatter <- agemixingdata$agescatterdata
scatterplot <- ggplot(data = agescatter, aes(x = AgeMaleatForm, y = AgeFemaleatForm)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept=0, slope=1, size = 1, aes(color = "Diagonal (x = y)")) +
  stat_smooth(method="lm", se=FALSE, aes(x = AgeMaleatForm, y = AgeFemaleatForm, colour="Linear fit"), size = 1) +
  scale_y_continuous(name="Age woman") +
  scale_color_manual('', values = c("black", "blue")) +
  xlab("Age man") +
  theme_bw(base_size = 18, base_family = "")

scatterplot

The black line is the diagonal (y = x). This scatter plot looks nothing like the empirical data that has been collected, for instance on Likoma Island in Malawi, shown below.

Age scatterplot of Likoma Island Network Study

To understand what’s going on here, let’s look at the hazard function for relationship formation, and its default parameter values.

\[ \begin{array}{lll} {\rm hazard} &=& F \times \exp\left( \alpha_{\rm baseline} + \alpha_{\rm numrel,man} P_{\rm man} + \alpha_{\rm numrel,woman} P_{\rm woman} \right. \\ &+& \alpha_{\rm numrel,diff}|P_{\rm man} - P_{\rm woman}| \\ &+& \alpha_{\rm meanage} \left(\frac{A_{\rm man}(t)+A_{\rm woman}(t)}{2}\right) \\ &+& \alpha_{\rm eagerness,sum}(E_{\rm man} + E_{\rm woman}) + \alpha_{\rm eagerness,diff}|E_{\rm man} - E_{\rm woman}| \\ &+& \alpha_{\rm gap,factor,man} |A_{\rm man}(t)-A_{\rm woman}(t)-D_{p,{\rm man}}-\alpha_{\rm gap,agescale,man} A_{\rm man}(t)| \\ &+& \alpha_{\rm gap,factor,woman} |A_{\rm man}(t)-A_{\rm woman}(t)-D_{p,{\rm woman}}-\alpha_{\rm gap,agescale,woman} A_{\rm woman}(t)| \\ &+& \left. \beta (t-t_{\rm ref}) \right) \end{array} \]

cfgFull <- simpact.getconfig(cfg)

formation.hazard.param.indices1 <- grep("formation.hazard", names(cfgFull))
formation.hazard.param.indices2 <- grep("person.agegap", names(cfgFull))
cfgFull[c(formation.hazard.param.indices1, formation.hazard.param.indices2)]

## $formation.hazard.type
## [1] "agegap"
## 
## $formation.hazard.agegap.baseline
## [1] 0.1
## 
## $formation.hazard.agegap.numrel_man
## [1] 0
## 
## $formation.hazard.agegap.numrel_woman
## [1] 0
## 
## $formation.hazard.agegap.numrel_diff
## [1] 0
## 
## $formation.hazard.agegap.meanage
## [1] 0
## 
## $formation.hazard.agegap.eagerness_sum
## [1] 0
## 
## $formation.hazard.agegap.eagerness_diff
## [1] 0
## 
## $formation.hazard.agegap.gap_factor_man
## [1] 0
## 
## $formation.hazard.agegap.gap_agescale_man
## [1] 0
## 
## $formation.hazard.agegap.gap_factor_woman
## [1] 0
## 
## $formation.hazard.agegap.gap_agescale_woman
## [1] 0
## 
## $formation.hazard.agegap.beta
## [1] 0
## 
## $formation.hazard.agegap.t_max
## [1] 200
## 
## $person.agegap.man.dist.type
## [1] "fixed"
## 
## $person.agegap.man.dist.fixed.value
## [1] 0
## 
## $person.agegap.woman.dist.type
## [1] "fixed"
## 
## $person.agegap.woman.dist.fixed.value
## [1] 0

All parameters of the formation hazard function are all set to zero, except for the baseline parameter. This means that none of the variables on the left hand side of the equation influence the hazard of relationship formation between a particular man i and woman j. Of particular importance for this tutorial are the gap,factor and gap,agescale parameters and the D_p variable, which is sampled from a distribution that can be set with the person.agegap parameters. A_man and A_woman are the ages of the man and woman. D_p,man and D_p,woman are the age gaps that are preferred by the man and woman respectively. The values of gap,factor are typically negative, such that they express how quickly the hazard (~likelihood) of relationship formation decreases as a function of how far (in either direction) the age gap between the candidate partners deviates from the preferred age gap. The values of gap,agescale allow us to simulate a situation in which the preferred age gap increases (or decreases) with age.

Fitting the model to data (summary statistics)

Fitting complex stochastic models to data is not a trivial task, and in fact an active field of research in statistics and computer science. For classical statistical models, it is often possible to write down likelihood functions: the likelihood of a certain parameter value, given the outcomes we have observed (data) is equal to the probability of the observing the data, given the parameter value. We can then implement algorithms to find the parameter value that maximises this likelihood. For complex stochastic models, however, the likelihood is “intractable”, i.e. we cannot write down a closed form equation that we can solve. Approximate Bayesian Computation (ABC) is in this case a helpful method. Instead of comparing the full data that comes out of the model with the full data that we observe, we only compare summary statistics computed from the model output data with the analogue summary statistics computed from the observed data. The task now becomes: which parameter values lead to the best match (= smallest difference) between observed summary statistics and model output summary statistics. Which summary statistics to compare mainly depends on the research question of interest, and the available data. For epidemiological modelling work, relevant summary statistics could include: HIV prevalence at certain points in time or in certain population strata, the age distribution of the population at a certain point in time, the (scaled) population size at a certain point in time, distribution of the number of sexual partners in the past year, fraction of the HIV+ people that are alive and on ART, distribution of set point viral load among HIV+ people not on ART.

Here we merely want to provide a simple example of how ABC can be used to fit the model to summary statistics. Let’s say we wanted to model an age-mixing pattern with a mean age gap of 3 years (i.e. the man is 3 years older than the women) and a standard deviation of 2 years, so that 95% of the age gaps would fall in the range - 1 to 7 years age gap. Moreover, we want to control the overall intensity of relationship formation, such that people form 1 new relationship every year, on average. In other words, we have 3 target statistics: mean age gap = 3 years; standard deviation in age gaps = 2 years; average number of relationships per person per year = 1.

Making sure that the mean age gap is about 3 years is straightforward, since this can be achieved by modifying the cfgFull object directly:

cfgFull["person.agegap.man.dist.type"] <- "fixed"
cfgFull["person.agegap.woman.dist.type"] <- "fixed"
cfgFull["person.agegap.man.dist.fixed.value"] <- 3
cfgFull["person.agegap.woman.dist.fixed.value"] <- 3

Controlling the standard deviation in age gap and the average number of new relationships per person per year, is less straightforward. From the relationship formation hazard function, we know that we can use the gap,factor parameters to tune how quickly the hazard drops as candidate couples have age gaps that deviate from 3 years. But we don’t know what the best value for these parameters is to achieve a standard deviation of 2 years. Similarly, we know that we can use the baseline parameter to tune the rate of relationship formation, but again we don’t know what the best value is.

The EasyABC package has convenient, user-friendly functions to fit custom-built models to summary statistics, using ABC. The vignette of the EasyABC package describes the parameters that need to be specified when using ABC_sequential and other ABC algorithms that are implemented in EasyABC.

vignette("EasyABC")

First, we need to create a function that runs the model based on an “inputvector”, and returns an “outputvector”. The input vector will be the vector of parameter values that is sampled from the distribution of possible/likely parameter values. In the first iteration of the ABC algorithm, this distribution is the prior distribution that we specified, but in later iterations, it is the updated distribution, based on how good or how bad the output from the first iteration approximated the target statistics.

ABC_DestDir <- "/tmp/ABC/"
#ABC_identifier <- "ABC"

simpact4ABC <- function(inputvector){
  cfg <- cfgFull
  cfg["formation.hazard.agegap.baseline"] <- inputvector[1]
  cfg["formation.hazard.agegap.gap_factor_man"] <- inputvector[2]
  cfg["formation.hazard.agegap.gap_factor_woman"] <- inputvector[2]
  results <- simpact.run(cfg, ABC_DestDir) #simpact.run(cfg, ABC_DestDir, identifierFormat = ABC_identifier)  
  datalist <- readthedata(results)
  relsperpersonperyear <- nrow(datalist$rtable) / (nrow(datalist$ptable)/2) / cfg$population.simtime
  agegapsd <- sd(datalist$rtable$AgeGap)
  outputvector <- c(relsperpersonperyear, agegapsd)
  return(outputvector)
}


#We specify the prior distributions for the input parameters
# The relationship formation rate must roughly double: hF' = 2hF
# hF = exp(a0 + a1X1 + a2X2 + .... anXn)
# 2hF = exp(a0 + a1X1 + a2X2 + .... anXn) * 2
# 2hF = exp(a0 + a1X1 + a2X2 + .... anXn) * exp(log(2))
# 2hF = exp(a0 + log(2) + a1X1 + a2X2 + .... anXn)
# So we would naively expect that the baseline parameter (0.1) should be increased by log(2) ~ 0.7 to 0.8
# However, we are also adjusting the "gap factors" and making relationships with large age gaps less
# likely will result in an overall decrease in the number of relationships formed per time unit.

simpact_prior <- list(c("unif", 0.8, 5), c("unif", -1, 0))
# Lastly, we specify the target summary statistic
sum_stat_obs <- c(1, 2)

# Now we try to run a sequential ABC scheme, according to the method proposed by Lenormand et al. 2013
# Maxime Lenormand, Franck Jabot and Guillaume Deffuant. Adaptive approximate Bayesian computation for complex models. Comput Stat (2013) 28:2777–2796 DOI 10.1007/s00180-013-0428-3


# Initial number of simulations
n_init <- 100 #40
alpha <- 0.25 #0.5 # This is the proportion of particles kept at each step
pacc <- 0.2 #0.5 # This is the stopping criterion of the algorithm: a small number ensures a better convergence of the algorithm, but at a cost in computing time. Must be 0 < p_acc_min < 1. The smaller, the more strict the criterion.

ABC_LenormandResult0 <- ABC_sequential(method="Lenormand",
                                      model=simpact4ABC,
                                      prior=simpact_prior,
                                      nb_simul=n_init,
                                      summary_stat_target=sum_stat_obs,
                                      alpha=alpha,
                                      p_acc_min=pacc,
                                      verbose=FALSE)

# Time to get a coffee and a biscuit, this will take a while.

ABC_LenormandResult0

## $param
##           [,1]       [,2]
##  [1,] 3.719308 -0.3615570
##  [2,] 3.680660 -0.3596033
##  [3,] 3.618090 -0.3573498
##  [4,] 3.631450 -0.3265796
##  [5,] 3.646929 -0.3522323
##  [6,] 3.724457 -0.3583781
##  [7,] 3.580670 -0.3381771
##  [8,] 3.663326 -0.3332427
##  [9,] 3.673916 -0.3578550
## [10,] 3.665741 -0.3532111
## [11,] 3.690356 -0.3572989
## [12,] 3.518765 -0.3425055
## [13,] 3.694512 -0.3690919
## [14,] 3.698520 -0.3568035
## [15,] 3.642166 -0.3476434
## [16,] 3.605246 -0.3309599
## [17,] 3.753966 -0.3484978
## [18,] 3.735884 -0.3465867
## [19,] 3.647670 -0.3484618
## [20,] 3.635540 -0.3297906
## [21,] 3.652411 -0.3368527
## [22,] 3.751142 -0.3864177
## [23,] 3.634541 -0.3507282
## [24,] 3.607990 -0.3180469
## [25,] 3.674129 -0.3532666
## 
## $stats
##            [,1]     [,2]
##  [1,] 0.9990319 1.986278
##  [2,] 1.0022414 2.027980
##  [3,] 1.0134875 2.001553
##  [4,] 1.0096899 2.022819
##  [5,] 1.0180180 1.981528
##  [6,] 0.9948420 1.953484
##  [7,] 1.0171023 2.036161
##  [8,] 1.0081301 2.054098
##  [9,] 1.0246433 2.000294
## [10,] 1.0003221 1.931279
## [11,] 1.0060490 1.931702
## [12,] 1.0234375 2.044389
## [13,] 0.9942085 1.930223
## [14,] 1.0148291 1.935023
## [15,] 1.0209340 1.942546
## [16,] 1.0109395 2.071558
## [17,] 1.0318226 2.018502
## [18,] 0.9993502 2.077788
## [19,] 0.9734112 2.049129
## [20,] 1.0116279 2.075666
## [21,] 1.0073647 2.079073
## [22,] 1.0305393 1.958614
## [23,] 1.0307742 1.958398
## [24,] 1.0235105 2.066650
## [25,] 1.0373230 1.996853
## 
## $weights
##  [1] 0.1598097096 0.1598097096 0.1598097096 0.1598097096 0.1598097096
##  [6] 0.1598097096 0.0147409245 0.0140356815 0.0005971082 0.0010014017
## [11] 0.0006251785 0.0010573867 0.0005420240 0.0006006904 0.0006219232
## [16] 0.0005774136 0.0005937619 0.0003705791 0.0003570454 0.0021204756
## [21] 0.0012810841 0.0008028342 0.0004679794 0.0003848190 0.0003634309
## 
## $stats_normalization
## [1] 1.552219 3.619545
## 
## $epsilon
## [1] 0.0005789159
## 
## $nsim
## [1] 550
## 
## $computime
## [1] 635.5605

hist(ABC_LenormandResult0$param[,1])

hist(ABC_LenormandResult0$param[,2])

plot(ABC_LenormandResult0$param[,1], ABC_LenormandResult0$param[,2])

plot(ABC_LenormandResult0$stats)

kde <- kde2d(ABC_LenormandResult0$param[,1], ABC_LenormandResult0$param[,2])
image(kde, col = terrain.colors(22))

kde.max <- kde$z==max(kde$z) # max at 14th row (x value) and 12th column (y value)
kde.row <- which.max(rowSums(kde.max))
kde.col <- which.max(colSums(kde.max))
bestfit <- c(kde$x[kde.row], kde$y[kde.col]) # parameter combination with highest estimated joint density
bestfit

## [1]  3.675566 -0.355081

Let’s update our model, using the best fitting parameter values, and have a look at the age-mixing pattern again.

simID <- 3
identifier <- paste0("%T-%y-%m-%d-%H-%M-%S-", simID, "-")

cfgFull["formation.hazard.agegap.baseline"] <- bestfit[1]
cfgFull["formation.hazard.agegap.gap_factor_man"] <- bestfit[2]
cfgFull["formation.hazard.agegap.gap_factor_woman"] <- bestfit[2]

testrun3 <- simpact.run(cfgFull,
                        DestDir,
                        identifierFormat = identifier,
                        seed = 1234)

datalist <- readthedata(modeloutput = testrun3)
agemixingdata <- agemixing(modeloutput = testrun3)
agescatter <- agemixingdata$agescatterdata
scatterplot <- ggplot(data = agescatter, aes(x = AgeMaleatForm, y = AgeFemaleatForm)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept=0, slope=1, size = 1, aes(color = "Diagonal (x = y)")) +
  stat_smooth(method="lm", se=FALSE, aes(x = AgeMaleatForm, y = AgeFemaleatForm, colour="Linear fit"), size = 1) +
  scale_y_continuous(name="Age woman") +
  scale_color_manual('', values = c("black", "blue")) +
  xlab("Age man") +
  theme_bw(base_size = 18, base_family = "")
scatterplot

sd(agescatter$AgeGap)

## [1] 1.970024

nrow(agescatter) / (nrow(datalist$ptable)/2) / cfgFull$population.simtime

## [1] 1.101196

#Bootstrap estimate of standard error of standard deviation
resamples <- lapply(1:1000, function(i){
  sample(agescatter$AgeGap, replace = T)
})
r.sd <- sapply(resamples, sd)
st.err.sd <- sd(r.sd)
sd.95ci <- c(sd(agescatter$AgeGap) - qnorm(0.975)*st.err.sd,
             sd(agescatter$AgeGap) + qnorm(0.975)*st.err.sd)
sd.95ci

## [1] 1.856996 2.083051

Exploring model output Part 2

Let’s now look at the HIV epidemic that emerged in our model population. Time-specific HIV prevalence and HIV incidence are arguably the most meaningful summary statistics of evolving HIV epidemics. Besides the time dimension, we also want to see how these metrics differed accros age and gender groups. By default, HIV is introduced in the population at the start of the simulation, by randomly infecting 20% of the model population.

hivseed.params <- grep("hivseed", names(cfgFull))
cfgFull[hivseed.params]

## $hivseed.time
## [1] 0
## 
## $hivseed.type
## [1] "fraction"
## 
## $hivseed.age.min
## [1] 0
## 
## $hivseed.age.max
## [1] 1000
## 
## $hivseed.fraction
## [1] 0.2

Let’s override these settings and introduce HIV by infecting 1% of the population after 10 years of simulation time. In this way, the population has had time to reach a dynamic equilibrium with respect to the sexual network configuration. We will further constrain the initial HIV “seed infections” to the age group 20-25 year olds, so that we can easily see how the epidemic spreads from this birth cohort to other birth cohorts. Lastly, let’s run the model for 40 years, i.e. 30 years of transmission after HIV is introduced. For simplicity, we are assuming that nobody will start ART during the simulations.

Note: The default value of the hivseed.type parameter is “fraction”, which means that each individual has a probability equal to this fraction of becoming infected with HIV at the time that the virus is introduced in the population. For small populations, this means that the actual fraction of HIV positive people at the time of introduction of HIV may be (substantially) lower or higher than the value of hivseed.fraction. Alternatively, you can set If you want precise control over the number of people getting infected at the time of HIV introduction, you can set hivseed.type to “amount”, and assign the desired value to hivseed.amount.

cfgFull <- simpact.getconfig(cfgFull)
cfgFull$hivseed.fraction <- NULL
cfgFull["population.numwomen"] <- 5000
cfgFull["population.nummen"] <- 5000
cfgFull["population.eyecap.fraction"] <- 0.1 #1
cfgFull["syncpopstats.interval"] <- 1

simID <- 4
identifier <- paste0("%T-%y-%m-%d-", simID, "-")

cfgFull["population.simtime"] <- 110 # 110 is 10 years of "burn in", followed by 100 years of HIV transmission
cfgFull["hivseed.time"] <- 10
cfgFull["hivseed.type"] <- "amount"
cfgFull["hivseed.amount"] <- 100     # 10 out of 1000 is 1% of the population
# HIV is introduced at an age that the individual is likely already sexually
# active, in a relationship, and still has lots of time to infect others
cfgFull["hivseed.age.min"] <- 20
cfgFull["hivseed.age.max"] <- 25

# Nobody will start ART during the simulations
cfgFull["diagnosis.baseline"] <- -100 # This will result in timing of HIV diagnosis way beyond the simulation period.

# We are inflating the transmission parameters "for effect"
cfgFull["transmission.param.a"] <- -0.3
cfgFull["transmission.param.b"] <- -8  
cfgFull["transmission.param.c"] <- 0.1649

testrun4 <- simpact.run.wrapper(cfgFull,
                                DestDir,
                                identifierFormat = identifier,
                                seed = 1234,
                                parallel = TRUE)

Let’s look at HIV prevalence and incidence now. The helper functions in postsim.R calculate HIV prevalence and HIV incidence by age group, gender and time point, record the number of people that each infected individual infects (R), the age at which people transmit HIV, and the age gaps between partners (as we have seen in part 1).

datalist <- readthedata(modeloutput = testrun4)

## 
Read 92.2% of 2148540 rows
Read 2148540 rows and 10 (of 10) columns from 0.218 GB file in 00:00:03

allPrevalencetimestepsdata <- prevalenceheatmapdata(modeloutput = testrun4) #datalist$ptable, cfgFull)
allIncidencedatalist <- incidenceheatmapdata(modeloutput = testrun4) #datalist$ptable, datalist$etable, cfgFull)
#Incidencetimestepdata <- allIncidencetimestepsdata[ ,IncidenceTime_i := sum(eventname.incident=="transmission", na.rm = TRUE) / sum(PY),by = "time_i"]
allIncidencetimestepsdata <- allIncidencedatalist$output
#Rtimestepsdata <- allRtimestepsdata(modeloutput = testrun4) #datalist$ptable, datalist$etable, cfgFull)
ageandtimeattransmissiondata <- transmissionmap(modeloutput = testrun4) #datalist$ptable, datalist$etable, cfgFull)
agemixingdata <- agemixing(modeloutput = testrun4) #datalist$ptable, datalist$rtable)

# Now let's visualise the HIV prevalence model output
my.pal <- c("#FFFFFF", brewer.pal(9,"YlOrRd"))


# allPrevalencetimestepsdataM <- allPrevalencetimestepsdata[Gender==0, ]
# p <- ggplot(allPrevalencetimestepsdataM, aes(time_i, Age)) +
#   geom_tile(aes(fill = Prevalence), colour = "white") 
# p + scale_fill_gradient(low = "white", high = "steelblue")
# 
# # And now we look at HIV prevalence over time and by age, in women:
# allPrevalencetimestepsdataW <- allPrevalencetimestepsdata[Gender==1, ]
# p <- ggplot(allPrevalencetimestepsdataW, aes(time_i, Age)) +
#   geom_tile(aes(fill = Prevalence), colour = "white") 
# p + scale_fill_gradient(low = "white", high = "darkred")

# We can create a 2-panel plot to look at HIV prevalence in men and women
allPrevalencetimestepsdata$Gender <- as.factor(allPrevalencetimestepsdata$Gender)
levels(allPrevalencetimestepsdata$Gender) <- c("Men", "Women")

p <- ggplot(allPrevalencetimestepsdata, aes(time_i, Age, Prevalence)) +
  geom_tile(aes(fill = Prevalence))
p + scale_fill_gradientn(colours = my.pal) +
  facet_wrap(~Gender, nrow=1) +
  theme(panel.grid = element_blank()) +
  xlab("Simulation time")

# Now let's visualise the HIV incidence model output
allIncidencetimestepsdata$Gender <- as.factor(allIncidencetimestepsdata$Gender)
levels(allIncidencetimestepsdata$Gender) <- c("Men", "Women")
# We don't want weirdly high incidence values for strata where the number of PY
# exposure time was very small

p <- ggplot(allIncidencetimestepsdata[allIncidencetimestepsdata$PY_G_A>=5, ], aes(time_i, Age, Incidence_G_A)) +
  geom_tile(aes(fill = Incidence_G_A))
p + scale_fill_gradientn(colours = my.pal) +
  facet_wrap(~Gender, nrow=1) +
  theme(panel.grid = element_blank()) +
  xlab("Simulation time")

# But we are also interested in the population-average HIV incidence over time
p <- ggplot(allIncidencetimestepsdata, aes(x=time_i, y=Incidence_G, group=1)) +
  geom_line() +
  xlab("Simulation time") +
  ylab(" HIV incidence") +
  facet_wrap(~Gender, nrow=1)
print(p)

# In our analysis, we are going to use the cumulative incidence as the main summary statistic
# for characterising the HIV epidemics. cumul incid = Int_0_to_sim_end(1 - exp(-incid(t)))dt
allIncidencedatalist$cumul.incidence

## [1] 0.2101746

# VISUALISING THE HIV TRANSMISSION NETWORK
# Since we keep track of all the events, we can visualise when exactly HIV was
# transmitted and how old the infectors and newly infecteds were at the time of
# transmission.

agetrans <- ageandtimeattransmissiondata$transmissiondata

plot(c(0, 115), c(0, 95), type="n",
     xlab="Simulation time",
     ylab="Ages of transmission pairs")

Arrows(agetrans$timeattransm,
       agetrans$ageoftransmitter,
       agetrans$timeattransm,
       agetrans$ageofreceptor,
       arr.length = 0.1, arr.width = 0.1, code = 2, lwd = 1.5,
       arr.type = "triangle", col = - agetrans$genderoftransmitter + 3)
legend("topleft",
       legend = c("male-to-female", "female-to-male"),
       fill = c(3,2), bty = "n")

       #border = c(3,2))

Addressing our research question

Now that we know how to set up a model in Simpact, explore it’s simulation output, and calculate the metrics to summarise the simulated age-mixing pattern and evolving HIV epidemic, we are ready for the actual modelling study design. We will first use ABC to estimate parameters for 5 models, each representing a different age-mixing pattern. While the code below could have been written in a more compact form, e.g. using a loop, the expanded form shown here makes it easier to see exactly what the target statistics are and which parameters are calibrated for each of the models.

Calibrating the five models

targetwindow.start <- 10
targetwindow.end <- 13
targetwindow.duration <- targetwindow.end - targetwindow.start
targetwindow.midpoint <- targetwindow.start + targetwindow.duration/2

cfg <- list()
cfg["population.simtime"] <- targetwindow.end # We calibrate based on the period 10-13 years
cfg["hivseed.time"] <- 105 # So there is no HIV in this simulation
cfg["population.numwomen"] <- 100
cfg["population.nummen"] <- 100

# Adding random noise to survival with be enabled from version 0.19
#cfg["person.survtime.logoffset.dist.type"] = "normal"
#cfg["person.survtime.logoffset.dist.normal.mu"] = 0
#cfg["person.survtime.logoffset.dist.normal.sigma"] = 0.1

#cfg["person.art.accept.threshold.dist.type"] = "fixed"
#cfg["person.art.accept.threshold.dist.fixed.value"] = 0 # Make sure a person never accepts treatment

# MODEL 1: Everybody has the same preferred age gap (0 years) and the sd in age gaps is 2 years
cfg["person.agegap.man.dist.type"] <- "fixed"
cfg["person.agegap.woman.dist.type"] <- "fixed"
cfg["person.agegap.man.dist.fixed.value"] <- 0
cfg["person.agegap.woman.dist.fixed.value"] <- 0

cfg1 <- cfg

ABC_DestDir <- "/tmp/ABC/"
#ABC_identifier <- "ABC"

simpact4ABC1 <- function(inputvector){
  cfgABC <- cfg1
  cfgABC["formation.hazard.agegap.baseline"] <- inputvector[1]
  cfgABC["formation.hazard.agegap.gap_factor_man"] <- inputvector[2]
  cfgABC["formation.hazard.agegap.gap_factor_woman"] <- inputvector[2]
  results1 <- simpact.run(cfgABC, ABC_DestDir)
  datalist1 <- readthedata(results1)
  # Number of relationships formed in the 3-year target window,
  # divided by half of the people that were alive at the midpoint of target period,
  # divided by the duration of the targe window (3)
  targetwindow.rels <- subset(datalist1$rtable, FormTime >=targetwindow.start & FormTime < targetwindow.end)
  nrel <- nrow(targetwindow.rels)
  halfnpeople <- nrow(subset(datalist1$ptable, TOB <= targetwindow.midpoint & TOD > targetwindow.midpoint))/2
  relsperpersonperyear <- nrel / halfnpeople / targetwindow.duration
  agegapsd <- sd(targetwindow.rels$AgeGap)
  outputvector <- c(relsperpersonperyear, agegapsd)
  return(outputvector)
}


simpact_prior <- list(c("unif", 0.8, 5), c("unif", -1, 0))
# Lastly, we specify the target summary statistic
sum_stat_obs <- c(1, 2)

# Now we try to run a sequential ABC scheme, according to the method proposed by Lenormand et al. 2013
# Maxime Lenormand, Franck Jabot and Guillaume Deffuant. Adaptive approximate Bayesian computation for complex models. Comput Stat (2013) 28:2777–2796 DOI 10.1007/s00180-013-0428-3


# Initial number of simulations
n_init <- 20 #100
alpha <- 0.5 # This is the proportion of particles kept at each step
pacc <- 0.5 #0.2 # This is the stopping criterion of the algorithm: a small number ensures a better convergence of the algorithm, but at a cost in computing time. Must be 0 < p_acc_min < 1. The smaller, the more strict the criterion.

ABC_LenormandResult1 <- ABC_sequential(method="Lenormand",
                                      model=simpact4ABC1,
                                      prior=simpact_prior,
                                      nb_simul=n_init,
                                      summary_stat_target=sum_stat_obs,
                                      alpha=alpha,
                                      p_acc_min=pacc,
                                      verbose=FALSE)

ABC_LenormandResult1

# MODEL 2: All men prefer women that are 10 years younger then themselves, sd of age gaps is still 2 years
cfg2 <- cfg1 # We use cfg1 as the starting point
cfg2["person.agegap.man.dist.type"] <- "fixed"
cfg2["person.agegap.woman.dist.type"] <- "fixed"
cfg2["person.agegap.man.dist.fixed.value"] <- 10
cfg2["person.agegap.woman.dist.fixed.value"] <- 10

ABC_DestDir <- "/tmp/ABC/"
#ABC_identifier <- "ABC"

simpact4ABC2 <- function(inputvector){
  cfgABC <- cfg2
  cfgABC["formation.hazard.agegap.baseline"] <- inputvector[1]
  cfgABC["formation.hazard.agegap.gap_factor_man"] <- inputvector[2]
  cfgABC["formation.hazard.agegap.gap_factor_woman"] <- inputvector[2]
  results2 <- simpact.run(cfgABC, ABC_DestDir)
  datalist2 <- readthedata(results2)
  # Number of relationships formed in the 3-year target window,
  # divided by half of the people that were alive at the midpoint of target period,
  # divided by the duration of the targe window (3)
  targetwindow.rels <- subset(datalist2$rtable, FormTime >=targetwindow.start & FormTime < targetwindow.end)
  nrel <- nrow(targetwindow.rels)
  halfnpeople <- nrow(subset(datalist2$ptable, TOB <= targetwindow.midpoint & TOD > targetwindow.midpoint))/2
  relsperpersonperyear <- nrel / halfnpeople / targetwindow.duration
  agegapsd <- sd(targetwindow.rels$AgeGap)
  outputvector <- c(relsperpersonperyear, agegapsd)
  return(outputvector)
}


simpact_prior <- list(c("unif", 0.8, 5), c("unif", -1, 0))
# Lastly, we specify the target summary statistic
sum_stat_obs <- c(1, 2)

# Now we try to run a sequential ABC scheme, according to the method proposed by Lenormand et al. 2013
# Maxime Lenormand, Franck Jabot and Guillaume Deffuant. Adaptive approximate Bayesian computation for complex models. Comput Stat (2013) 28:2777–2796 DOI 10.1007/s00180-013-0428-3


# Initial number of simulations
n_init <- 100
alpha <- 0.5 # This is the proportion of particles kept at each step
pacc <- 0.2 # This is the stopping criterion of the algorithm: a small number ensures a better convergence of the algorithm, but at a cost in computing time. Must be 0 < p_acc_min < 1. The smaller, the more strict the criterion.

ABC_LenormandResult2 <- ABC_sequential(method="Lenormand",
                                       model=simpact4ABC2,
                                       prior=simpact_prior,
                                       nb_simul=n_init,
                                       summary_stat_target=sum_stat_obs,
                                       alpha=alpha,
                                       p_acc_min=pacc,
                                       verbose=FALSE)

ABC_LenormandResult2

# MODEL 3: Young men prefer 0 years age gaps, but the preferred age gap grows as they grow older; sd of age gaps is still 2 years
cfg3 <- cfg1 # We use cfg1 as the starting point
cfg3["person.agegap.man.dist.type"] <- "fixed"
cfg3["person.agegap.woman.dist.type"] <- "fixed"
cfg3["person.agegap.man.dist.fixed.value"] <- 0
cfg3["person.agegap.woman.dist.fixed.value"] <- 0
cfg3["formation.hazard.agegap.gap_agescale_man"] <- 0.25
cfg3["formation.hazard.agegap.gap_agescale_woman"] <- 0.25

ABC_DestDir <- "/tmp/ABC/"
#ABC_identifier <- "ABC"

simpact4ABC3 <- function(inputvector){
  cfgABC <- cfg3
  cfgABC["formation.hazard.agegap.baseline"] <- inputvector[1]
  cfgABC["formation.hazard.agegap.gap_factor_man"] <- inputvector[2]
  cfgABC["formation.hazard.agegap.gap_factor_woman"] <- inputvector[2]
  results3 <- simpact.run(cfgABC, ABC_DestDir)
  datalist3 <- readthedata(results3)
  # Number of relationships formed in the 3-year target window,
  # divided by half of the people that were alive at the midpoint of target period,
  # divided by the duration of the targe window (3)
  targetwindow.rels <- subset(datalist3$rtable, FormTime >=targetwindow.start & FormTime < targetwindow.end)
  nrel <- nrow(targetwindow.rels)
  halfnpeople <- nrow(subset(datalist3$ptable, TOB <= targetwindow.midpoint & TOD > targetwindow.midpoint))/2
  relsperpersonperyear <- nrel / halfnpeople / targetwindow.duration
  agegapsd <- sd(targetwindow.rels$AgeGap)
  outputvector <- c(relsperpersonperyear, agegapsd)
  return(outputvector)
}


simpact_prior <- list(c("unif", 6, 13), c("unif", -3.5, -1.5))#list(c("unif", 5, 10), c("unif", -2.5, -1))
# Lastly, we specify the target summary statistic
sum_stat_obs <- c(1, 2)

# Now we try to run a sequential ABC scheme, according to the method proposed by Lenormand et al. 2013
# Maxime Lenormand, Franck Jabot and Guillaume Deffuant. Adaptive approximate Bayesian computation for complex models. Comput Stat (2013) 28:2777–2796 DOI 10.1007/s00180-013-0428-3


# Initial number of simulations
n_init <- 80 #100
alpha <- 0.5 # This is the proportion of particles kept at each step
pacc <- 0.2 #0.2 # This is the stopping criterion of the algorithm: a small number ensures a better convergence of the algorithm, but at a cost in computing time. Must be 0 < p_acc_min < 1. The smaller, the more strict the criterion.

ABC_LenormandResult3 <- ABC_sequential(method="Lenormand",
                                       model=simpact4ABC3,
                                       prior=simpact_prior,
                                       nb_simul=n_init,
                                       summary_stat_target=sum_stat_obs,
                                       alpha=alpha,
                                       p_acc_min=pacc,
                                       verbose=FALSE)

ABC_LenormandResult3
output3 <- data.frame(base.param = ABC_LenormandResult3$param[,1],
                      age.param = ABC_LenormandResult3$param[,2],
                      turnover = ABC_LenormandResult3$stats[,1],
                      sdage = ABC_LenormandResult3$stats[,2])
#View(output3[output3$turnover>0.95 & output3$turnover<1.05, ])
  
plot(ABC_LenormandResult3$param[,1], ABC_LenormandResult3$param[,2])

# MODEL 4: The preferred age gap varies (a lot) between men, but they are very reluctant to have partners that deviate
# far from their personal preferred age gap. Sd of age gaps is increased to 4.
cfg4 <- cfg1 # We use cfg1 as the starting point
cfg4["person.agegap.man.dist.type"] <- "normal"
cfg4["person.agegap.woman.dist.type"] <- "normal"
cfg4["person.agegap.man.dist.normal.mu"] <- 0
cfg4["formation.hazard.agegap.gap_factor_man"] <- -2 # Previously this was estimated (at around -0.5), now it's predefined
#cfg4["person.agegap.man.dist.normal.sigma"] <- # This is now going to be estimated
cfg4["person.agegap.woman.dist.normal.mu"] <- 0
cfg4["formation.hazard.agegap.gap_factor_woman"] <- -2 # Previously this was estimated (at around -0.5), now it's predefined
#cfg4["person.agegap.woman.dist.normal.sigma"] <- # This is now going to be estimated
cfg4["formation.hazard.agegap.gap_agescale_man"] <- 0
cfg4["formation.hazard.agegap.gap_agescale_woman"] <- 0
cfg4["person.agegap.woman.dist.fixed.value"] <- NULL
cfg4["person.agegap.man.dist.fixed.value"] <- NULL

ABC_DestDir <- "/tmp/ABC/"
#ABC_identifier <- "ABC"

simpact4ABC4 <- function(inputvector){
  cfgABC <- cfg4
  cfgABC["formation.hazard.agegap.baseline"] <- inputvector[1]
  cfgABC["person.agegap.man.dist.normal.sigma"] <- inputvector[2]
  cfgABC["person.agegap.woman.dist.normal.sigma"] <- inputvector[2]
  results4 <- simpact.run(cfgABC, ABC_DestDir)
  datalist4 <- readthedata(results4)
  # Number of relationships formed in the 3-year target window,
  # divided by half of the people that were alive at the midpoint of target period,
  # divided by the duration of the targe window (3)
  targetwindow.rels <- subset(datalist4$rtable, FormTime >=targetwindow.start & FormTime < targetwindow.end)
  nrel <- nrow(targetwindow.rels)
  halfnpeople <- nrow(subset(datalist4$ptable, TOB <= targetwindow.midpoint & TOD > targetwindow.midpoint))/2
  relsperpersonperyear <- nrel / halfnpeople / targetwindow.duration
  agegapsd <- sd(targetwindow.rels$AgeGap)
  outputvector <- c(relsperpersonperyear, agegapsd)
  return(outputvector)
}


simpact_prior <- list(c("unif", 8, 12), c("unif", 4, 8))
# Lastly, we specify the target summary statistic
sum_stat_obs <- c(1, 4)  # Much wider age gap variation this time

# Now we try to run a sequential ABC scheme, according to the method proposed by Lenormand et al. 2013
# Maxime Lenormand, Franck Jabot and Guillaume Deffuant. Adaptive approximate Bayesian computation for complex models. Comput Stat (2013) 28:2777–2796 DOI 10.1007/s00180-013-0428-3


# Initial number of simulations
n_init <- 100
alpha <- 0.5 # This is the proportion of particles kept at each step
pacc <- 0.2 # This is the stopping criterion of the algorithm: a small number ensures a better convergence of the algorithm, but at a cost in computing time. Must be 0 < p_acc_min < 1. The smaller, the more strict the criterion.

ABC_LenormandResult4 <- ABC_sequential(method="Lenormand",
                                       model=simpact4ABC4,
                                       prior=simpact_prior,
                                       nb_simul=n_init,
                                       summary_stat_target=sum_stat_obs,
                                       alpha=alpha,
                                       p_acc_min=pacc,
                                       verbose=FALSE)

ABC_LenormandResult4

# MODEL 5: The preferred age gap is fixed at 0 for all men, but the hazard of relationship formation does not decrease
# as quickly as in MODEL 1 when the candidate partner has an age gap that deviates from this preferred age gap;
# Sd of age gaps is 4.
cfg5 <- cfg1 # We use cfg1 as the starting point
cfg5["person.agegap.man.dist.type"] <- "fixed"
cfg5["person.agegap.woman.dist.type"] <- "fixed"
cfg5["person.agegap.man.dist.fixed.value"] <- 0
cfg5["person.agegap.woman.dist.fixed.value"] <- 0
cfg5["formation.hazard.agegap.gap_agescale_man"] <- 0
cfg5["formation.hazard.agegap.gap_agescale_woman"] <- 0

ABC_DestDir <- "/tmp/ABC/"
#ABC_identifier <- "ABC"

simpact4ABC5 <- function(inputvector){
  cfgABC <- cfg5
  cfgABC["formation.hazard.agegap.baseline"] <- inputvector[1]
  cfgABC["formation.hazard.agegap.gap_factor_man"] <- inputvector[2]
  cfgABC["formation.hazard.agegap.gap_factor_woman"] <- inputvector[2]
  results5 <- simpact.run(cfgABC, ABC_DestDir)
  datalist5 <- readthedata(results5)
  # Number of relationships formed in the 3-year target window,
  # divided by half of the people that were alive at the midpoint of target period,
  # divided by the duration of the targe window (3)
  targetwindow.rels <- subset(datalist5$rtable, FormTime >=targetwindow.start & FormTime < targetwindow.end)
  nrel <- nrow(targetwindow.rels)
  halfnpeople <- nrow(subset(datalist5$ptable, TOB <= targetwindow.midpoint & TOD > targetwindow.midpoint))/2
  relsperpersonperyear <- nrel / halfnpeople / targetwindow.duration
  agegapsd <- sd(targetwindow.rels$AgeGap)
  outputvector <- c(relsperpersonperyear, agegapsd)
  return(outputvector)
}


simpact_prior <- list(c("unif", 0.8, 5), c("unif", -0.3, 0))
# Lastly, we specify the target summary statistic
sum_stat_obs <- c(1, 4)  # Wider age gap variation, same as in model 4

# Now we try to run a sequential ABC scheme, according to the method proposed by Lenormand et al. 2013
# Maxime Lenormand, Franck Jabot and Guillaume Deffuant. Adaptive approximate Bayesian computation for complex models. Comput Stat (2013) 28:2777–2796 DOI 10.1007/s00180-013-0428-3


# Initial number of simulations
n_init <- 100
alpha <- 0.5 # This is the proportion of particles kept at each step
pacc <- 0.2 # This is the stopping criterion of the algorithm: a small number ensures a better convergence of the algorithm, but at a cost in computing time. Must be 0 < p_acc_min < 1. The smaller, the more strict the criterion.

ABC_LenormandResult5 <- ABC_sequential(method="Lenormand",
                                       model=simpact4ABC5,
                                       prior=simpact_prior,
                                       nb_simul=n_init,
                                       summary_stat_target=sum_stat_obs,
                                       alpha=alpha,
                                       p_acc_min=pacc,
                                       verbose=FALSE)

ABC_LenormandResult5

Exploring results of calibration

We can use two-dimensional kernel density estimation to find the combination of parameters that yielded the highest joint density.

kde1 <- kde2d(ABC_LenormandResult1$param[,1], ABC_LenormandResult1$param[,2])
plot(ABC_LenormandResult1$param[,1], ABC_LenormandResult1$param[,2],
     col=rgb(0,0,0,50,maxColorValue=255), pch=16)

image(kde1, col = terrain.colors(22))

kde1.max <- kde1$z==max(kde1$z)
kde1.row <- which.max(rowSums(kde1.max))
kde1.col <- which.max(colSums(kde1.max))
bestfit1 <- c(kde1$x[kde1.row], kde1$y[kde1.col]) # parameter combination with highest estimated joint density
bestfit1

## [1]  3.3666986 -0.3388669

kde2 <- kde2d(ABC_LenormandResult2$param[,1], ABC_LenormandResult2$param[,2])
plot(ABC_LenormandResult2$param[,1], ABC_LenormandResult2$param[,2],
     col=rgb(0,0,0,50,maxColorValue=255), pch=16)

image(kde2, col = terrain.colors(22))

kde2.max <- kde2$z==max(kde2$z)
kde2.row <- which.max(rowSums(kde2.max))
kde2.col <- which.max(colSums(kde2.max))
bestfit2 <- c(kde2$x[kde2.row], kde2$y[kde2.col]) # parameter combination with highest estimated joint density
bestfit2

## [1]  3.6163702 -0.3573781

kde3 <- kde2d(ABC_LenormandResult3$param[,1], ABC_LenormandResult3$param[,2])
plot(ABC_LenormandResult3$param[,1], ABC_LenormandResult3$param[,2],
     col=rgb(0,0,0,50,maxColorValue=255), pch=16)

image(kde3, col = terrain.colors(22))

kde3.max <- kde3$z==max(kde3$z)
kde3.row <- which.max(rowSums(kde3.max))
kde3.col <- which.max(colSums(kde3.max))
bestfit3 <- c(kde3$x[kde3.row], kde3$y[kde3.col]) # parameter combination with highest estimated joint density
bestfit3

## [1]  6.513468 -1.573958

kde4 <- kde2d(ABC_LenormandResult4$param[,1], ABC_LenormandResult4$param[,2])
plot(ABC_LenormandResult4$param[,1], ABC_LenormandResult4$param[,2],
     col=rgb(0,0,0,50,maxColorValue=255), pch=16)

image(kde4, col = terrain.colors(22))

kde4.max <- kde4$z==max(kde4$z)
kde4.row <- which.max(rowSums(kde4.max))
kde4.col <- which.max(colSums(kde4.max))
bestfit4 <- c(kde4$x[kde4.row], kde4$y[kde4.col]) # parameter combination with highest estimated joint density
bestfit4

## [1] 8.301350 6.003404

kde5 <- kde2d(ABC_LenormandResult5$param[,1], ABC_LenormandResult5$param[,2])
plot(ABC_LenormandResult5$param[,1], ABC_LenormandResult5$param[,2],
     col=rgb(0,0,0,50,maxColorValue=255), pch=16)

image(kde5, col = terrain.colors(22))

kde5.max <- kde5$z==max(kde5$z)
kde5.row <- which.max(rowSums(kde5.max))
kde5.col <- which.max(colSums(kde5.max))
bestfit5 <- c(kde5$x[kde5.row], kde5$y[kde5.col]) # parameter combination with highest estimated joint density
bestfit5

## [1]  2.6821044 -0.1674192

Running the five calibrated models

Next we run each of the five models, using the ABC-calibrated parameters. We will run each model for 110 years. Ten years into the simulation time, we will introduce HIV as before. Instead of running each model just once, we will run them several times (let’s start with 5 repeats), because we expect some stochastic variation in model output. For each model run, we want to save the key metrics for our analysis in a dataframe.

# Preparing the list that will hold the key metrics
model1.metrics <- list()

DestDir <- "/Users/wimdelva/GoogleDrive/Bruges2015/Tutorial/simoutput"
modelreps <- 5

model1.metrics[["am"]] <- vector(mode = "list", length = modelreps)
model1.metrics[["inc"]] <- vector(mode = "list", length = modelreps)

modelrepvect <- 1:modelreps

# Running model 1 "modelreps" times
cfgFull <- cfg1
cfgFull["population.simtime"] <- 110 # 40 is 10 years of "burn in", followed by 100 years of HIV transmission
cfgFull["hivseed.time"] <- 10
cfgFull["hivseed.type"] <- "amount"
cfgFull["hivseed.amount"] <- 10     # 10 out of 1000 is 1% of the population
# HIV is introduced at an age that the individual is likely already sexually
# active, in a relationship, and still has lots of time to infect others
cfgFull["hivseed.age.min"] <- 20
cfgFull["hivseed.age.max"] <- 25
cfgFull["person.art.accept.threshold.dist.type"] = "fixed"
cfgFull["person.art.accept.threshold.dist.fixed.value"] = 0 # Make sure a person never accepts treatment
# We are inflating the transmission parameters "for effect"
cfgFull["transmission.param.a"] <- -0.3
cfgFull["transmission.param.b"] <- -8  
cfgFull["transmission.param.c"] <- 0.1649
# We plug in our best estimates for the parameters that we calibrated
cfgFull["formation.hazard.agegap.baseline"] <- bestfit1[1]
cfgFull["formation.hazard.agegap.gap_factor_man"] <- bestfit1[2]
cfgFull["formation.hazard.agegap.gap_factor_woman"] <- bestfit1[2]

modelnr <- "model1."

for (modelrep in modelrepvect){
  simID <- paste0(modelnr, modelrep)
  identifier <- paste0("%T-%y-%m-%d-%H-%M-%S-", simID, "-")
  seedid <- 1000 + modelrep
  model1run <- simpact.run.wrapper(cfgFull,
                                DestDir,
                                identifierFormat = identifier,
                                seed = seedid)
  
  model1.metrics[["am"]][[modelrep]] <- agemixing(modeloutput = model1run) # storing ALL age mixing data
  model1.metrics[["inc"]][[modelrep]] <- incidenceheatmapdata(modeloutput = model1run) # storing ALL HIV incidence data
}
# Let's have a brief look at the age-mixing scatter for the last run of the model
agemixingdata <- agemixing(modeloutput = model1run)
agescatter <- agemixingdata$agescatterdata
scatterplot <- ggplot(data = agescatter, aes(x = AgeMaleatForm, y = AgeFemaleatForm)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept=0, slope=1, size = 1, aes(color = "Diagonal (x = y)")) +
  stat_smooth(method="lm", se=FALSE, aes(x = AgeMaleatForm, y = AgeFemaleatForm, colour="Linear fit"), size = 1) +
  scale_y_continuous(name="Age woman") +
  scale_color_manual('', values = c("black", "blue")) +
  xlab("Age man") +
  theme_bw(base_size = 18, base_family = "")
scatterplot

sd(agescatter$AgeGap)

## [1] 2.02196

# Preparing the list that will hold the key metrics
model2.metrics <- list()

DestDir <- "/Users/wimdelva/GoogleDrive/Bruges2015/Tutorial/simoutput"
modelreps <- 5

model2.metrics[["am"]] <- vector(mode = "list", length = modelreps)
model2.metrics[["inc"]] <- vector(mode = "list", length = modelreps)

modelrepvect <- 1:modelreps

# Running model 2 "modelreps" times
cfgFull <- cfg2
cfgFull["population.simtime"] <- 110 # 40 is 10 years of "burn in", followed by 100 years of HIV transmission
cfgFull["hivseed.time"] <- 10
cfgFull["hivseed.type"] <- "amount"
cfgFull["hivseed.amount"] <- 10     # 10 out of 1000 is 1% of the population
# HIV is introduced at an age that the individual is likely already sexually
# active, in a relationship, and still has lots of time to infect others
cfgFull["hivseed.age.min"] <- 20
cfgFull["hivseed.age.max"] <- 25
cfgFull["person.art.accept.threshold.dist.type"] = "fixed"
cfgFull["person.art.accept.threshold.dist.fixed.value"] = 0 # Make sure a person never accepts treatment
# We are inflating the transmission parameters "for effect"
cfgFull["transmission.param.a"] <- -0.3
cfgFull["transmission.param.b"] <- -8  
cfgFull["transmission.param.c"] <- 0.1649
# We plug in our best estimates for the parameters that we calibrated
cfgFull["formation.hazard.agegap.baseline"] <- bestfit2[1]
cfgFull["formation.hazard.agegap.gap_factor_man"] <- bestfit2[2]
cfgFull["formation.hazard.agegap.gap_factor_woman"] <- bestfit2[2]

modelnr <- "model2."

for (modelrep in modelrepvect){
  simID <- paste0(modelnr, modelrep)
  identifier <- paste0("%T-%y-%m-%d-%H-%M-%S-", simID, "-")
  seedid <- 2000 + modelrep
  model2run <- simpact.run.wrapper(cfgFull,
                                DestDir,
                                identifierFormat = identifier,
                                seed = seedid)
  
  model2.metrics[["am"]][[modelrep]] <- agemixing(modeloutput = model2run) # storing ALL age mixing data
  model2.metrics[["inc"]][[modelrep]] <- incidenceheatmapdata(modeloutput = model2run) # storing ALL HIV incidence data
}
# Let's have a brief look at the age-mixing scatter for the last run of the model
agemixingdata <- agemixing(modeloutput = model2run)
agescatter <- agemixingdata$agescatterdata
scatterplot <- ggplot(data = agescatter, aes(x = AgeMaleatForm, y = AgeFemaleatForm)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept=0, slope=1, size = 1, aes(color = "Diagonal (x = y)")) +
  stat_smooth(method="lm", se=FALSE, aes(x = AgeMaleatForm, y = AgeFemaleatForm, colour="Linear fit"), size = 1) +
  scale_y_continuous(name="Age woman") +
  scale_color_manual('', values = c("black", "blue")) +
  xlab("Age man") +
  theme_bw(base_size = 18, base_family = "")
scatterplot

sd(agescatter$AgeGap)

## [1] 1.987615

# Preparing the list that will hold the key metrics
model3.metrics <- list()

DestDir <- "/Users/wimdelva/GoogleDrive/Bruges2015/Tutorial/simoutput"
modelreps <- 5

model3.metrics[["am"]] <- vector(mode = "list", length = modelreps)
model3.metrics[["inc"]] <- vector(mode = "list", length = modelreps)

modelrepvect <- 1:modelreps

# Running model 3 "modelreps" times
cfg3 <- list()
bestfit3 <- c(6.728921, -1.589603)
cfg3["person.agegap.man.dist.type"] <- "fixed"
cfg3["person.agegap.woman.dist.type"] <- "fixed"
cfg3["person.agegap.man.dist.fixed.value"] <- 0
cfg3["person.agegap.woman.dist.fixed.value"] <- 0
cfg3["formation.hazard.agegap.gap_agescale_man"] <- 0.25
cfg3["formation.hazard.agegap.gap_agescale_woman"] <- 0.25
cfgFull <- cfg3
cfgFull["population.numwomen"] <- 100
cfgFull["population.nummen"] <- 100
cfgFull["population.simtime"] <- 110 # 40 is 10 years of "burn in", followed by 100 years of HIV transmission
cfgFull["hivseed.time"] <- 10
cfgFull["hivseed.type"] <- "amount"
cfgFull["hivseed.amount"] <- 10     # 10 out of 1000 is 1% of the population
# HIV is introduced at an age that the individual is likely already sexually
# active, in a relationship, and still has lots of time to infect others
cfgFull["hivseed.age.min"] <- 20
cfgFull["hivseed.age.max"] <- 25
cfgFull["person.art.accept.threshold.dist.type"] = "fixed"
cfgFull["person.art.accept.threshold.dist.fixed.value"] = 0 # Make sure a person never accepts treatment
# We are inflating the transmission parameters "for effect"
cfgFull["transmission.param.a"] <- -0.3
cfgFull["transmission.param.b"] <- -8  
cfgFull["transmission.param.c"] <- 0.1649
# We plug in our best estimates for the parameters that we calibrated
cfgFull["formation.hazard.agegap.baseline"] <- bestfit3[1]
cfgFull["formation.hazard.agegap.gap_factor_man"] <- bestfit3[2]
cfgFull["formation.hazard.agegap.gap_factor_woman"] <- bestfit3[2]

modelnr <- "model3."

for (modelrep in modelrepvect){
  simID <- paste0(modelnr, modelrep)
  identifier <- paste0("%T-%y-%m-%d-%H-%M-%S-", simID, "-")
  seedid <- 3000 + modelrep
  model3run <- simpact.run.wrapper(cfgFull,
                                DestDir,
                                identifierFormat = identifier,
                                seed = seedid)
  
  model3.metrics[["am"]][[modelrep]] <- agemixing(modeloutput = model3run) # storing ALL age mixing data
  model3.metrics[["inc"]][[modelrep]] <- incidenceheatmapdata(modeloutput = model3run) # storing ALL HIV incidence data
}
# Let's have a brief look at the age-mixing scatter for the last run of the model
agemixingdata <- agemixing(modeloutput = model3run)
agescatter <- agemixingdata$agescatterdata
scatterplot <- ggplot(data = agescatter, aes(x = AgeMaleatForm, y = AgeFemaleatForm)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept=0, slope=1, size = 1, aes(color = "Diagonal (x = y)")) +
  stat_smooth(method="lm", se=FALSE, aes(x = AgeMaleatForm, y = AgeFemaleatForm, colour="Linear fit"), size = 1) +
  scale_y_continuous(name="Age woman") +
  scale_color_manual('', values = c("black", "blue")) +
  xlab("Age man") +
  theme_bw(base_size = 18, base_family = "")
scatterplot

sd(agescatter$AgeGap)

## [1] 2.326573

# Preparing the list that will hold the key metrics
model4.metrics <- list()

DestDir <- "/Users/wimdelva/GoogleDrive/Bruges2015/Tutorial/simoutput"
modelreps <- 5

model4.metrics[["am"]] <- vector(mode = "list", length = modelreps)
model4.metrics[["inc"]] <- vector(mode = "list", length = modelreps)

modelrepvect <- 1:modelreps

# Running model 4 "modelreps" times
cfgFull <- cfg4
cfgFull["population.simtime"] <- 110 # 40 is 10 years of "burn in", followed by 100 years of HIV transmission
cfgFull["hivseed.time"] <- 10
cfgFull["hivseed.type"] <- "amount"
cfgFull["hivseed.amount"] <- 10     # 10 out of 1000 is 1% of the population
# HIV is introduced at an age that the individual is likely already sexually
# active, in a relationship, and still has lots of time to infect others
cfgFull["hivseed.age.min"] <- 20
cfgFull["hivseed.age.max"] <- 25
cfgFull["person.art.accept.threshold.dist.type"] = "fixed"
cfgFull["person.art.accept.threshold.dist.fixed.value"] = 0 # Make sure a person never accepts treatment
# We are inflating the transmission parameters "for effect"
cfgFull["transmission.param.a"] <- -0.3
cfgFull["transmission.param.b"] <- -8  
cfgFull["transmission.param.c"] <- 0.1649
# We plug in our best estimates for the parameters that we calibrated
cfgFull["formation.hazard.agegap.baseline"] <- bestfit4[1]
cfgFull["person.agegap.man.dist.normal.sigma"] <- bestfit4[2]
cfgFull["person.agegap.woman.dist.normal.sigma"] <- bestfit4[2]

modelnr <- "model4."

for (modelrep in modelrepvect){
  simID <- paste0(modelnr, modelrep)
  identifier <- paste0("%T-%y-%m-%d-%H-%M-%S-", simID, "-")
  seedid <- 4000 + modelrep
  model4run <- simpact.run.wrapper(cfgFull,
                                DestDir,
                                identifierFormat = identifier,
                                seed = seedid)
  
  model4.metrics[["am"]][[modelrep]] <- agemixing(modeloutput = model4run) # storing ALL age mixing data
  model4.metrics[["inc"]][[modelrep]] <- incidenceheatmapdata(modeloutput = model4run) # storing ALL HIV incidence data
}
# Let's have a brief look at the age-mixing scatter for the last run of the model
agemixingdata <- agemixing(modeloutput = model4run)
agescatter <- agemixingdata$agescatterdata
scatterplot <- ggplot(data = agescatter, aes(x = AgeMaleatForm, y = AgeFemaleatForm)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept=0, slope=1, size = 1, aes(color = "Diagonal (x = y)")) +
  stat_smooth(method="lm", se=FALSE, aes(x = AgeMaleatForm, y = AgeFemaleatForm, colour="Linear fit"), size = 1) +
  scale_y_continuous(name="Age woman") +
  scale_color_manual('', values = c("black", "blue")) +
  xlab("Age man") +
  theme_bw(base_size = 18, base_family = "")
scatterplot

sd(agescatter$AgeGap)

## [1] 4.21833

# Preparing the list that will hold the key metrics
model5.metrics <- list()

DestDir <- "/Users/wimdelva/GoogleDrive/Bruges2015/Tutorial/simoutput"
modelreps <- 5

model5.metrics[["am"]] <- vector(mode = "list", length = modelreps)
model5.metrics[["inc"]] <- vector(mode = "list", length = modelreps)

modelrepvect <- 1:modelreps

# Running model 5 "modelreps" times
cfgFull <- cfg5
cfgFull["population.simtime"] <- 110 # 40 is 10 years of "burn in", followed by 100 years of HIV transmission
cfgFull["hivseed.time"] <- 10
cfgFull["hivseed.type"] <- "amount"
cfgFull["hivseed.amount"] <- 10     # 10 out of 1000 is 1% of the population
# HIV is introduced at an age that the individual is likely already sexually
# active, in a relationship, and still has lots of time to infect others
cfgFull["hivseed.age.min"] <- 20
cfgFull["hivseed.age.max"] <- 25
cfgFull["person.art.accept.threshold.dist.type"] = "fixed"
cfgFull["person.art.accept.threshold.dist.fixed.value"] = 0 # Make sure a person never accepts treatment
# We are inflating the transmission parameters "for effect"
cfgFull["transmission.param.a"] <- -0.3
cfgFull["transmission.param.b"] <- -8  
cfgFull["transmission.param.c"] <- 0.1649
# We plug in our best estimates for the parameters that we calibrated
cfgFull["formation.hazard.agegap.baseline"] <- bestfit5[1]
cfgFull["formation.hazard.agegap.gap_factor_man"] <- bestfit5[2]
cfgFull["formation.hazard.agegap.gap_factor_woman"] <- bestfit5[2]

modelnr <- "model5."

for (modelrep in modelrepvect){
  simID <- paste0(modelnr, modelrep)
  identifier <- paste0("%T-%y-%m-%d-%H-%M-%S-", simID, "-")
  seedid <- 5000 + modelrep
  model5run <- simpact.run.wrapper(cfgFull,
                                DestDir,
                                identifierFormat = identifier,
                                seed = seedid)
  
  model5.metrics[["am"]][[modelrep]] <- agemixing(modeloutput = model5run) # storing ALL age mixing data
  model5.metrics[["inc"]][[modelrep]] <- incidenceheatmapdata(modeloutput = model5run) # storing ALL HIV incidence data
}
# Let's have a brief look at the age-mixing scatter for the last run of the model
agemixingdata <- agemixing(modeloutput = model5run)
agescatter <- agemixingdata$agescatterdata
scatterplot <- ggplot(data = agescatter, aes(x = AgeMaleatForm, y = AgeFemaleatForm)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept=0, slope=1, size = 1, aes(color = "Diagonal (x = y)")) +
  stat_smooth(method="lm", se=FALSE, aes(x = AgeMaleatForm, y = AgeFemaleatForm, colour="Linear fit"), size = 1) +
  scale_y_continuous(name="Age woman") +
  scale_color_manual('', values = c("black", "blue")) +
  xlab("Age man") +
  theme_bw(base_size = 18, base_family = "")
scatterplot

sd(agescatter$AgeGap)

## [1] 3.98521

Creating and exploring the dataset for model analysis

Let’s make a dataframe that holds all the summary statistics that we want to use in our post-simulation analysis.

master.df <- data.frame()
for (m in 1:5){
  modelname.metrics <- paste0("model", m, ".metrics")
  metrics.list <- eval(parse(text=modelname.metrics))
  
  for (modelrep in 1:modelreps){
    metrics.df <- data.frame(
      model=m,
      run=modelrep,
      aad=metrics.list[["am"]][[modelrep]]$AAD,
      vad=metrics.list[["am"]][[modelrep]]$VAD,
      bvad=metrics.list[["am"]][[modelrep]]$BVAD,
      powerm=metrics.list[["am"]][[modelrep]]$powerm,
      within.var.base=metrics.list[["am"]][[modelrep]]$within.var.base,
      wvad.min=min(metrics.list[["am"]][[modelrep]]$WVAD),
      wvad.max=max(metrics.list[["am"]][[modelrep]]$WVAD),
      wvad.mean=mean(metrics.list[["am"]][[modelrep]]$WVAD),
      cumul.incid=metrics.list[["inc"]][[modelrep]]$cumul.incidence,
      year=metrics.list[["inc"]][[modelrep]]$incid.table$time_i,
      incid=metrics.list[["inc"]][[modelrep]]$incid.table$Incidence_All,
      py=metrics.list[["inc"]][[modelrep]]$incid.table$PY_All,
      cases=metrics.list[["inc"]][[modelrep]]$incid.table$cases)
    master.df <- rbind(master.df, metrics.df)
  }
}

names(master.df)

##  [1] "model"           "run"             "aad"            
##  [4] "vad"             "bvad"            "powerm"         
##  [7] "within.var.base" "wvad.min"        "wvad.max"       
## [10] "wvad.mean"       "cumul.incid"     "year"           
## [13] "incid"           "py"              "cases"

dim(master.df)

## [1] 2500   15

Let’s look at how our key metrics co-vary.

pairs(master.df[c(11, 1, 3:5, 8:10)])

dim(unique(master.df))

## [1] 2500   15

And what do the incidence curves look like?

master.df$model.run <- master.df$model*1000 + master.df$run
p <- ggplot(master.df, aes(year, incid, group = model.run, colour = as.factor(model))) +
  geom_line()
plot(p)

A more robust post-simulation analysis

Let’s read in the data from 5 models times 50 simulations runs per model, which have been executed previously, so we can do a more robust analysis.

load(file = paste0(DestDir,"/master.df.RData"))
# master.df$model.run <- paste(master.df$model, master.df$run, sep = ".")
n <- 5
cols <- gg_color_hue(n, l=65)
darkcols <- gg_color_hue(n, l=40)
master.df$smoother <- factor(paste0("smoother",as.numeric(master.df$model)))

p <- ggplot(master.df, aes(year, incid, group = model.run, colour = model)) +
  geom_line() +
  facet_wrap(~ model) +
  stat_smooth(se=FALSE, 
              method="loess",
              span=0.25,
              aes(year, incid, colour=smoother)) +
#  scale_colour_hue(l = c(rep(65, 5), rep(10, 5))) +
  scale_color_manual(values = c(model1=cols[1],
                                model2=cols[2],
                                model3=cols[3],
                                model4=cols[4],
                                model5=cols[5],
                                smoother1=darkcols[1],
                                smoother2=darkcols[2],
                                smoother3=darkcols[4],
                                smoother4=darkcols[5],
                                smoother5=darkcols[3]),
                     guide = FALSE)
plot(p)

pairs(master.df[c(14, 1, 6:8, 11:13)])

# p <- ggplot(master.df, aes(year, incid, group = model.run, colour = model)) +
#   geom_line() +
#   facet_wrap(~ model) +
#   stat_smooth(se=FALSE, aes(year, incid, colour="loess smoother"), colour=) +
#   scale_colour_hue("loess smoother", l=c(10, rep(65, 5)))
# plot(p)

p <- ggplot(master.df, aes(year, popsize,
                           group = model.run,
                           colour = cumul.incid)) +
  geom_line() +
  scale_colour_gradientn(colours=heat.colors(22)) +
  #geom_point(size=0.8, alpha=1) +
  facet_wrap(~ model)
plot(p)

pp <- ggplot(master.df, aes(year, pointprevalence,
                           group = model.run,
                           colour = model)) +
  geom_line() +
  #geom_point(size=0.8, alpha=1) +
  facet_wrap(~ model) +
  stat_smooth(se=FALSE, 
              method="loess",
              span=0.25,
              aes(year, pointprevalence, group = model, colour=smoother)) +
  scale_color_manual(values = c(model1=cols[1],
                                model2=cols[2],
                                model3=cols[3],
                                model4=cols[4],
                                model5=cols[5],
                                smoother1=darkcols[1],
                                smoother2=darkcols[2],
                                smoother3=darkcols[4],
                                smoother4=darkcols[5],
                                smoother5=darkcols[3]),
                     guide = FALSE)
plot(pp)

master.df$bsdad <- sqrt(master.df$bvad) # between-subject standard deviation of age differences
master.df$wsdad <- sqrt(master.df$wvad.mean) # an (incorrect) approx because sqrt(mean())!=mean(sqrt())  
reg <- lm(cumul.incid ~ aad + bvad + wvad.mean, data = master.df)
reg2 <- lm(cumul.incid ~ aad + bsdad + wsdad, data = master.df)
reg3 <- lm(cumul.incid ~ model, data = master.df)
summary(reg)

## 
## Call:
## lm(formula = cumul.incid ~ aad + bvad + wvad.mean, data = master.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32517 -0.04215 -0.00382  0.04310  0.33673 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0885639  0.0017088   51.83   <2e-16 ***
## aad          0.0032389  0.0001961   16.52   <2e-16 ***
## bvad        -0.0010641  0.0000739  -14.40   <2e-16 ***
## wvad.mean    0.0340179  0.0002017  168.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09583 on 23496 degrees of freedom
##   (1500 observations deleted due to missingness)
## Multiple R-squared:  0.6842, Adjusted R-squared:  0.6842 
## F-statistic: 1.697e+04 on 3 and 23496 DF,  p-value: < 2.2e-16

summary(reg2)

## 
## Call:
## lm(formula = cumul.incid ~ aad + bsdad + wsdad, data = master.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28016 -0.07276 -0.00231  0.05558  0.36737 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0343557  0.0023986   14.32   <2e-16 ***
## aad          0.0002183  0.0002140    1.02    0.308    
## bsdad       -0.0097863  0.0004429  -22.09   <2e-16 ***
## wsdad        0.1197872  0.0008553  140.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1066 on 23496 degrees of freedom
##   (1500 observations deleted due to missingness)
## Multiple R-squared:  0.609,  Adjusted R-squared:  0.609 
## F-statistic: 1.22e+04 on 3 and 23496 DF,  p-value: < 2.2e-16

summary(reg3)

## 
## Call:
## lm(formula = cumul.incid ~ model, data = master.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32916 -0.03215  0.00125  0.02834  0.30300 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.173555   0.001276  136.06   <2e-16 ***
## modelmodel2  0.070169   0.001804   38.90   <2e-16 ***
## modelmodel4 -0.091616   0.001804  -50.79   <2e-16 ***
## modelmodel5  0.315699   0.001804  175.01   <2e-16 ***
## modelmodel3 -0.021926   0.001804  -12.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09019 on 24995 degrees of freedom
## Multiple R-squared:  0.7081, Adjusted R-squared:  0.708 
## F-statistic: 1.516e+04 on 4 and 24995 DF,  p-value: < 2.2e-16

Notes on model fitting

Fitting Simpact models to data can be a very computationally intensive process, for a number of compounding reasons: large population size, long calendar time period and a highly active population all contribute to a long run time for one simulation run. On top of that, we will need to do a lot of simulation runs if we want to fit many parameters, and if the prior distributions are very wide (we don’t know in what approximate region we should look for good parameter values). One approach to deal with this challenge, is to split up the problem in smaller, more manageable chuncks:

Start with a small population size (order ~ 100-1000) and run over a short calendar period (~ 1-10 years) and fit model to a small set of target statistics.
Use posterior distribution of parameter values to narrow down prior distributions in subsequent fitting procedures.
Gradually increase the number of parameters to be fitted.
Once you have a fair idea of reasonable parameter regions for a small population, refit the model with a larger population and over the desired calendar period.