Expected Modelling outcomes

The aim of this tutorial is to introduce individual-based modelling for infectious diseases in R. The dynamics of the stochastic model we target here cannot be captured by ordinary differential equations (ODES). In this session, we will br to:

  1. Create an IBM population with random movement*.
  2. Include Susceptible-Infected-Recovered disease dynamics
  3. Add transmission and recovery dynamics
  4. Run over multiple time steps and analyse transmission dynamics
  5. Create a population data.frame (=matrix) with one row per individual and one column per person characteristic
  6. Add a random walk (with borders)
  7. Combine everything into a loop to process multiple time steps

Pro tip! The best thing to do is to read each section and type (or copy and paste) the R commands (in grey boxes) into your own R session to check that you understand the code. Hands-on questions are indicated in red

Set working directory

The first step when creating a new R file is to specify a working directory. You need to tell the R session where you want it to work. To do this you use the command to “set the working directory” or setwd(). You can label this working directory as ‘home’ , and have other locations stored for plotting or for getting data. This location should be an existing folder on your computer. For example

home <- "/Users/Elanda/R/ModellingSIR" ## on a OS X
setwd(home)

A quick lesson about random numbers in R

Seeding

Random numbers in R (and other computer programs) are derived from a random number generator, which can be initiated with set.seed. Seeding the random stream is very useful to obtain stochastic results that are reproducible. As such, you can check whether model output has changed (or not) if you continue programming.Random.seed is an integer vector, containing the random number generator (RNG) state for random number generation in R. It can be saved and restored, but should not be altered by the user.RNGkind is a more friendly interface to query or set the kind of RNG in use.RNGversion can be used to set the random generators as they were in an earlier R version (for reproducibility).set.seed is the recommended way to specify seeds. For example, set the random seed to ‘2020’ and sample 3 numbers between 1 and 10, and sample 15 numbers with replacement.

set.seed(2020)
sample(x = 1:10,size = 3)
## [1] 7 6 8
sample(x = 1:10,size = 15,replace = T)
##  [1]  1  1  4 10  6  1  8  8 10  2  6  2  3  2  8

If we repeat the same operation withou without the set.seed we obtain different random numbers.

sample(x = 1:10,size = 3)
## [1] 8 4 2
sample(x = 1:10,size = 15,replace = T)
##  [1]  7  4  2  4  5  4  6  2 10  6  8  2  1  6  3

In R, we can generate random numbers using for example sample, runif, rnorm,rbinom. Please have a look at:

?sample
?runif
?rnorm
?rbinom
?set.seed

Sample

The samplefunction takes a sample of the specified size from the elements of x using either with or without replacement.

Example 1 Random Permutation: sample(150:-50, replace = FALSE) Generates numbers between the range specified.

Example 2 Bootstrap re-sampling: ( only length >1) sample(x, replace = TRUE)

Example 3 1000 Bernoulli trials sample(c(0,1), 1000, replace = TRUE)

Uniform distribution

The runif function for uniform distributions provide information about the uniform distribution on the interval from min to max. dunif gives the density, punif gives the distribution function qunif gives the quantile function and runif generates random deviates.

Example 1 dunif(x, min = 0, max = 1, log = FALSE)

Example 2 punif(q, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)

Example 3 qunif(p, min = 0, max = 1, lower.tail = TRUE, log.p = FALSE)

Example 4 runif(n, min = 0, max = 1)

Normal distribution

The ?rnorm function generates the; Density, distribution function, quantile function and random generation for the normal distribution with mean equal to mean and standard deviation equal to sd.The following: dnorm gives the density, pnorm gives the distribution function, qnorm gives the quantile function, and rnorm generates random deviates. The length of the result is determined by nfor rnorm, and is the maximum of the lengths of the numerical arguments for the other functions. The numerical arguments other than n are recycled to the length of the result. Only the first elements of the logical arguments are used.For sd = 0 this gives the limit as sd decreases to 0, a point mass at mu. sd < 0 is an error and returns NaN.

Example 1 : dnorm(x, mean = 0, sd = 1, log = FALSE)

Example 2 : pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

Example 3 : qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

Example 4 : rnorm(n, mean = 0, sd = 1)

Binomial Distribution

The rbinom function generates the Density, distribution function, quantile function and random generation for the binomial distribution with parameters size and prob. This is conventionally interpreted as the number of ‘successes’ in size trials.

Example 1 dbinom(x, size, prob, log = FALSE)

Example 2 pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)

Example 3 qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)

Example 4 rbinom(n, size, prob)

It is also possible to sample from a binomial distribution for different individuals at once.

# set number of samples, a probability vector and the number of trials to execute
num_samples <- 4
p_vector    <- c(0,0.5,0.9,1)
num_trials  <- 1
  
# sample from the binomial distribution, for each given probability
binom_sample <- rbinom(num_samples, num_trials, p_vector)

# if the sample is 1, we have a success
binom_sample
## [1] 0 1 0 1
binom_sample == 1
## [1] FALSE  TRUE FALSE  TRUE

Step 1: Setup the population

For this modelling exercise; we will use a matrix to store the population using one row per individual and one column per person characteristic. We will use a special matrix format, called data.frame, in which you can easily access columns using the column names. The following code illustrates the usefulness of the data.frame and how to access and modify elements.

Now, let us create a data.frame for a population of 300 individuals with a health state together with a x- and y coordinate. During the setup, we randomly locate each person in a 10x10 area.

pop_size    <- 300  # Population size 
area_size   <- 10   # 10x10 square
xy_coordinates <- seq(0,area_size,0.01)  # get sequence per 0.01

# setup population: one row per individual, one column per attribute, row index = id

pop_data     <- data.frame(health  = rep('S', length=pop_size),  # all individuals start in state 'S' (= susceptible)
                           x_coord = sample(x = xy_coordinates, size = pop_size,replace = T), # sample random x coordinate
                           y_coord = sample(x = xy_coordinates, size = pop_size,replace = T), # sample random y coordinate
                           stringsAsFactors    = FALSE)         # option to treat characters as 'strings' instead of 'factors'

dim(pop_data) # check the dimensions of the dataframe.
## [1] 300   3
head(pop_data) # show the first few rows
##   health x_coord y_coord
## 1      S    9.40    0.37
## 2      S    6.82    0.26
## 3      S    7.28    6.86
## 4      S    3.24    5.22
## 5      S    6.18    2.23
## 6      S    8.25    3.85

Step 2: Random Walk

We now want to simulate a random walk process with a given maximum velocity. The direction and distance should be randomized. As such, we set a maximum distance and create a function to apply the random walk during one time step. Random walk, in probability theory, a process for determining the probable location of a point subject to random motions, given the probabilities (the same at each step) of moving some distance in some direction.

Random walks are an example of Markov processes, in which future behaviour is independent of past history. It is sequence of possibly dependent random variables (x1, x2, x3,...)—identified by increasing values of a parameter, commonly time—with the property that any prediction of the next value of the sequence (xn), knowing the preceding states (x1, x2, …, xn − 1), may be based on the last state (xn − 1) alone. That is, the future value of such a variable is independent of its past history.

# set max velocity = maximum distance per time step 
max_velocity <- 1   # squares per day

random_walk <- function(pop_data, max_velocity){
  # step 1: move each person at random along the x and y axis
  pop_data$x_coord <- pop_data$x_coord + runif(n = pop_size,min = -max_velocity, max = max_velocity)
  pop_data$y_coord <- pop_data$y_coord + runif(n = pop_size,min = -max_velocity, max = max_velocity)
  
  # step 2: if an individual crosses a model boundary: set back at the border
  pop_data$x_coord[pop_data$x_coord > area_size] <- area_size
  pop_data$y_coord[pop_data$y_coord > area_size] <- area_size
  pop_data$x_coord[pop_data$x_coord < 0]         <- 0
  pop_data$y_coord[pop_data$y_coord < 0]         <- 0
  
  # return the adjusted population data.frame
  return(pop_data)
}

If you want to visualize the random walk, you can create a plot before and after applying the function. Please note that it is possible to highlight a one individual using points:

# plot the area and all indivdiuals, using color 1 (= black)
plot(x = pop_data$x_coord,
     y = pop_data$y_coord,
     col = 1)
# highlight one individual, using color 2 (= red)
points(x = pop_data$x_coord[1],
      y = pop_data$y_coord[1],
      pch = 20, # other symbol
      col = 4) # The original code was point() but its giving a dev.off() error

# highlight the selected individual again, using color 3 (= green)
# plot(x = pop_data$x_coord[1],
       #y = pop_data$y_coord[1],
       #pch = 20, # other symbol
       #col = 3)
# update population
pop_data <- random_walk(pop_data,max_velocity)
head(pop_data)
##   health  x_coord   y_coord
## 1      S 8.987709 0.0000000
## 2      S 6.149457 0.2199252
## 3      S 7.584245 6.3804491
## 4      S 2.912578 4.3621126
## 5      S 6.326365 3.1230200
## 6      S 7.863375 3.7542047

Social Contacts (Based on proximity)

We assume that susceptible individuals can acquire an infection if they are close to an infected individual. We define proximity here with a maximum distance in terms of max_contact_distance. Probability for a social contact, depends on the number people that are within the specified range and the population-based average number of social contacts individuals have per time step (=per day).

# set proximity parameter and the population-based average number of contacts per day
max_contact_distance <- 1
target_num_contacts_day <-  5

# calculate the distance matrix using the 'dist' function and store as matrix
# dist() function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
distance_matrix <- as.matrix(dist(pop_data[,c('x_coord','y_coord')],upper=T))

# create a matrix with Booleans whether individuals are close enough
eligible_contacts_matrix <- distance_matrix <= max_contact_distance

# set the diagonal to FALSE, to prevent self-contacts 
diag(eligible_contacts_matrix) <- FALSE

Let’s visualize this:

# which indididuals are close enough to our index person?
plot(x = pop_data$x_coord,
     y = pop_data$y_coord)
points(x = pop_data$x_coord[1],
       y = pop_data$y_coord[1],
       col = 3,
       pch=20)
points(x = pop_data$x_coord[eligible_contacts_matrix[1,]],
       y = pop_data$y_coord[eligible_contacts_matrix[1,]],
       pch = 20,
       col = 4)
legend('topright',
       c('index person',
       'eligible contacts'),
       col = c(3,4),
       pch = c(20,20))

Now, we will write a function to count the number of individuals that are close enough, and calculate the probability for a social contact with each of them. Please note that Boolean’s can be summed, using TRUE as 1 and FALSE as 0.

# Now, we will write a function to count the number of individuals that are close enough, and calculate the probability for a social contact with each of them.
get_contact_probability <- function(eligible_contacts_matrix,index_person){
  
  # select the eligible contacts from the index person
  eligible_contacts <- eligible_contacts_matrix[index_person,]
  
  # how many individuals can our index person contact this time step?
  num_eligible_contacts <- sum(eligible_contacts)
  
  # what is the contact probability per person?
  contact_probability <- target_num_contacts_day / num_eligible_contacts
  
  # limit the probability to 0.95 (this is an arbitrary choice)
  contact_probability[contact_probability > 0.95] <- 0.95
    
  # assign the probability for each individual that is close enough
  p_contact_vector <- eligible_contacts * contact_probability
  
  # return the probability vector
  return(p_contact_vector)
}

Does the contact_probability for individual 1 corresponds with the number of potential contacts in the plot above? What is the contact probability with each of them?

p_contact <- get_contact_probability(eligible_contacts_matrix,1)
table(p_contact)
## p_contact
##   0 0.5 
## 290  10

Step 2 Introduce infected individuals

Start with the introduction of infected cases into the population (seeding):

# set number of infected cases to start (or seed) the simulation
num_infected_seeds <- 10

# or select all individuals from row 1 till the specified number of cases.

pop_data$health[1:num_infected_seeds] <- 'I'

# it is also possible to sample random individuals to change their health state

# pop_data$health[sample(1:pop_size,size = num_infected_seeds)] <- 'I'

We can visualize the infected individuals like the following plot? (note: using pch=20 gives you full dots)

plot(x = pop_data$x_coord,
     y = pop_data$y_coord,
     col = 4,)

points(x = pop_data$x_coord[pop_data$health=="I"],
       y = pop_data$y_coord[pop_data$health=="I"],
       col = 3,
       pch=20)

legend('topright',
       c('Infected',
         'Susceptible'),
       col = c(3,4),
       pch = c(20,20))

### Transmission dynamics

To specify a transmission event upon contact, we define the transmission probability when two individuals meet and perform a Bernoulli experiment for each social contact between a infected and susceptible individual.

# disease parameters 
transmission_prob     <- 0.4  # transmission probability per social contact 

# identify all infected individuals
boolean_infected <- pop_data$health == 'I'   # = vector with Booleans (TRUE/FALSE)

ind_infected     <- which(boolean_infected)  # = vector with row indices
# get indices of infected invididuals
ind_infected 
##  [1]  1  2  3  4  5  6  7  8  9 10
num_infected     <- length(ind_infected)     # = single value: count data
num_infected
## [1] 10
# we specified the eligible contacts above
# eligible_contacts_matrix

# evaluate the contacts and transmission events for person 1, starting from the contact probability
p_contact_vector <- get_contact_probability(eligible_contacts_matrix,1)

# exclude contacts with non-susceptible individuals
p_contact_vector[pop_data$health != 'S'] <- 0

# sample from binomial distribution, given the contact and transmission probability
binom_infection_vector <-  rbinom(pop_size, size = 1, prob = p_contact_vector * transmission_prob) 

# update population matrix with new infections, if binomial sample == 1
pop_data$health[binom_infection_vector == 1] <- 'I'

# View(pop_data)

We can now visualize the newly infected individuals. (note: you might have different results since we are working with stochastic processes)

Step 3 Recovery

Each time step, we have to account for infected individual that recover. We can handle this using a recovery probability, based on an average number of days infected, and binomial sample.

# set disease parameter: from days infected to recovery rate and a probability
num_days_infected    <- 3
recovery_rate <- 1/num_days_infected
recovery_probability <- 1-exp(-recovery_rate)

# we identified the infected individuals above
##boolean_infected <- pop_data$health == 'I'  

# get vector with a probability to recover for each individual
#(if not infected, the Boolean is FALSE (=0), the probability is 0)
p_recover_vect <- boolean_infected * recovery_probability

# sample individuals that recover
binom_recover_vector <- rbinom(pop_size, size = 1, prob = p_recover_vect)

# update population data
pop_data$health[binom_recover_vector == 1] <- 'R'

We can now visualize the final health states.

plot(x = pop_data$x_coord[pop_data$health=="S"],
     y = pop_data$y_coord[pop_data$health=="S"],
     col = 2,)

points(x = pop_data$x_coord[pop_data$health=="I"],
       y = pop_data$y_coord[pop_data$health=="I"],
       col = 3,
       pch=20)

points(x = pop_data$x_coord[pop_data$health=="R"],
       y = pop_data$y_coord[pop_data$health=="R"],
       col = 4,
       pch=20)

legend('topright',
       c('Susceptible','Infected','Recovered'),
       col = c(2,3,4),
       pch = c(20,20,20))

#### Step 4 Add all parts into a loop…

We coded the population, random movement, social contacts, transmission and recovery. Now it is time to combine all these elements into a loop to iterate over different days.

# use iterator 'day_i' to process day 1,2,3 and 4
for(day_i in 1:4){
  # print the day index
  print(day_i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
# this is the same as
all_days <- 1:10
for(day_i in all_days){
  # print the day index
  print(day_i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

So, specify the number of days and run over each day using a for-loop, and compile the full transmission model. To keep track of the health states over time, you can use a matrix (population size x number of days) and store pop_data$health after each time step.

# set number of days
num_days <- 20

# create matrix to log the population health states over time
pop_data_log <- matrix(NA,nrow=pop_size,ncol=num_days)

for(day_i in 1:num_days){

  # update location
  pop_data <- random_walk(pop_data,max_velocity)

  # identify all infected individuals
  boolean_infected <- pop_data$health == 'I'   # = vector with Booleans TRUE/FALSE
  ind_infected     <- which(boolean_infected)  # = vector with indices
  num_infected     <- length(ind_infected)     # = single value: count data
  
  # evaluate the transmission events for each infected individual
  for(infected_i in ind_infected){
    p_contact_vector    <- get_contact_probability(eligible_contacts_matrix,infected_i)
  
  # exclude contacts with non-susceptible individuals
  p_contact_vector[pop_data$health != 'S'] <- 0
  
  # sample from binomial distribution, given the contact and transmission probability
  binom_infection_vector <-  rbinom(pop_size, size = 1, prob = p_contact_vector * transmission_prob) 
  
  # update population matrix with new infections, if binomial sample == 1
  pop_data$health[binom_infection_vector == 1] <- 'I'
  }
  
  # sample individuals that recover (from the originally infected individuals only!)
  p_recover_vect <- boolean_infected * recovery_probability
  binom_recover_vector <- rbinom(pop_size, size = 1, prob = p_recover_vect)
  pop_data$health[binom_recover_vector == 1] <- 'R'
  
  # log health states
  pop_data_log[,day_i] <- pop_data$health  
}

Plot the health states over time:

# count the number of S, I or R health states per column (= per time step)
num_susceptible_time <- colSums(pop_data_log == 'S')
num_infected_time <- colSums(pop_data_log == 'I')
num_recovered_time <- colSums(pop_data_log == 'R')

# plot over time
plot(num_susceptible_time,
     col=1,lwd=2,type='l',
     ylim=c(0,pop_size),
     xlab='Time',
     ylab='Individuals')
lines(num_infected_time,col=2,lwd=2)
lines(num_recovered_time,col=3,lwd=2)
legend('top',
        c('S','I','R'),
        col=1:3,
        ncol=3,
        lwd=2)

### Extending the SIR model

If you have time, you can now continue to build more complex versions of the SIR model to match the infectious disease you are interested in. For example, you could include a latent (or “Exposed”) stage or extend the model to include waning immunity, vaccination, maternal immunity, etc.