Inference of demography using R and ABC

The task we need to solve is: **Infer the expansion model parameters for each of the datasets and decide which is the “mild” and which is the “rapid” expansion.

First, we clean the memory in R and we load the necessary libraries.

## clean the memory
rm(list=ls())
## load the abc package
library(abc)

Next, we need to read in the datasets. You have two datasets, ms1 and ms2. One of them is coming from a mild expansion. The other one is coming from a more rapid expansion. The task we need to solve is: **Infer the expansion model parameters for each of the datasets and decide which is the “mild” and which is the “rapid” expansion.

## read in the observed summary statistics                                                                                                                               
obs.stat <- read.table("ms2.stat", header=TRUE)
## read in the simulation output                                                                                                                                         
sim.info <- read.table("sims.data", header=TRUE)
## parameters are the first two columns
sim.param <- sim.info[, 1:2]
## Get sumstat. We remove summary statistic 1, it's colinear with another one
sim.stat <- sim.info[, 4:ncol(sim.info)]
## also from the observed dataset
obs.stat <- obs.stat[,2:ncol(obs.stat)]

Let's see how many parameters and summary statistics we have

dim(obs.stat)

## [1] 1 9

dim(sim.stat)

## [1] 2000    9

dim(sim.param)

## [1] 2000    2

Summary statistics are correlated between themselves and with the parameters

plot(sim.info[1:100, 3:ncol(sim.info)], cex=0.6)
}

## Error: <text>:2:1: unexpected '}'
## 1: plot(sim.info[1:100, 3:ncol(sim.info)], cex=0.6)
## 2: }
##    ^

Summary statistics capture only some information that is related to our parameters. Also, not all summary statistics are equally informative. Some are more related to parameters (or some of them). Ideally we would like to use the most informative statistics. The rest are just 'producing noise'.

\( \color{red}{\text{Which statistics are the most informative}}? \)

#layout(matrix(1:(2*9), ncol=2, byrow=F), widths=3, heights=30)
par(mfrow=c(9,2), las=1)
par(mar=c(5,3,3,1))
for(i in 1:2){
  for(j in 1:ncol(sim.stat)){
    plot(sim.stat[1:200,j], sim.param[1:200,i], pch=19, col='blue', cex=0.6, xlab="", ylab="", main=paste(colnames(sim.stat)[j], "\nvs ", colnames(sim.param)[i]) )
  }
}

plot of chunk bunch_o_figs_svg

To answer this question probably, we could use mutual information analysis wikipedia link, or even regression analysis, but we will not do it in this course.

#ABC Analysis And now, we are ready for the ABC itself. ABC is using a *rejection and regression step, where the most similar datasets to the observation are used. Here, similarity means small Euclidean distance.

#calculate Euclidean distance for two vectors
a <- c(1,2, 3)
b <- c(5,3,-1)

## what is their Euclidean Distance?
dist(rbind(a, b))

##          a
## b 5.744563

# All defaults
include_graphics("images.png")

plot of chunk unnamed-chunk-6

## set the borders of the parameters (based on how we ran simulations)
logit.matrix <- matrix(c(100, 1500, 0.001,  1), ncol=2, byrow=TRUE)
res <- abc(target=obs.stat, param=sim.param,sumstat=sim.stat, tol=1, transf=c("logit"),logit.bounds=logit.matrix,  method="loclinear")

## Warning: All parameters are "logit" transformed.

Now, we will plot and see the results

The mutation rate theta

i <- 1
name.im=c("Population mutation rate (theta)", "Population size change")

plot(density(res$adj.values[,i]), type='l', xlim=c(logit.matrix[i,]), col='red', xlab=name.im[i], ylab="Density", main="")
points(density(sim.param[,i]), type='l', col='black')
points(density(res$unadj.values[,i]), type='l', col='blue')
legend("topright", legend=c("prior", "posterior"), col=c("blue", "red"), pch=19)

plot of chunk unnamed-chunk-8

The population size change

name.im=c("Population mutation rate (theta)", "Population size change")

i <- 2
plot(density(res$adj.values[,i]), type='l', xlim=c(logit.matrix[i,]), col='red', xlab=name.im[i], ylab="Density", main="")
points(density(sim.param[,i]), type='l', col='black')
points(density(res$unadj.values[,i]), type='l', col='blue')
legend("topright", legend=c("prior", "posterior"), col=c("blue", "red"), pch=19)

plot of chunk unnamed-chunk-9

Finally let's get the estimates

write.table(summary(res)[4,], file="", quote=F, row.names=T, col.names=F, sep="\t")

## Call: 
## abc(target = obs.stat, param = sim.param, sumstat = sim.stat, 
##     tol = 1, method = "loclinear", transf = c("logit"), logit.bounds = logit.matrix)
## Data:
##  abc.out$adj.values (2000 posterior samples)
## Weights:
##  abc.out$weights
## 
##                          p_theta p_pop_change_newpop
## Min.:                   100.2167              0.0033
## Weighted 2.5 % Perc.:   145.5486              0.0365
## Weighted Median:        410.3401              0.1099
## Weighted Mean:          481.5786              0.1545
## Weighted Mode:          329.9293              0.0750
## Weighted 97.5 % Perc.: 1166.9829              0.6017
## Max.:                  1497.7393              1.0000
## p_theta  481.578582481614
## p_pop_change_newpop  0.154483876252112