The task we need to solve is: **Infer the expansion model parameters for each of the datasets and decide which is the “mild” and which is the “rapid” expansion.
First, we clean the memory in R and we load the necessary libraries.
## clean the memory
rm(list=ls())
## load the abc package
library(abc)
Next, we need to read in the datasets. You have two datasets, ms1 and ms2. One of them is coming from a mild expansion. The other one is coming from a more rapid expansion. The task we need to solve is: **Infer the expansion model parameters for each of the datasets and decide which is the “mild” and which is the “rapid” expansion.
## read in the observed summary statistics
obs.stat <- read.table("ms2.stat", header=TRUE)
## read in the simulation output
sim.info <- read.table("sims.data", header=TRUE)
## parameters are the first two columns
sim.param <- sim.info[, 1:2]
## Get sumstat. We remove summary statistic 1, it's colinear with another one
sim.stat <- sim.info[, 4:ncol(sim.info)]
## also from the observed dataset
obs.stat <- obs.stat[,2:ncol(obs.stat)]
Let's see how many parameters and summary statistics we have
dim(obs.stat)
## [1] 1 9
dim(sim.stat)
## [1] 2000 9
dim(sim.param)
## [1] 2000 2
Summary statistics are correlated between themselves and with the parameters
plot(sim.info[1:100, 3:ncol(sim.info)], cex=0.6)
}
## Error: <text>:2:1: unexpected '}'
## 1: plot(sim.info[1:100, 3:ncol(sim.info)], cex=0.6)
## 2: }
## ^
Summary statistics capture only some information that is related to our parameters. Also, not all summary statistics are equally informative. Some are more related to parameters (or some of them). Ideally we would like to use the most informative statistics. The rest are just 'producing noise'.
\( \color{red}{\text{Which statistics are the most informative}}? \)
#layout(matrix(1:(2*9), ncol=2, byrow=F), widths=3, heights=30)
par(mfrow=c(9,2), las=1)
par(mar=c(5,3,3,1))
for(i in 1:2){
for(j in 1:ncol(sim.stat)){
plot(sim.stat[1:200,j], sim.param[1:200,i], pch=19, col='blue', cex=0.6, xlab="", ylab="", main=paste(colnames(sim.stat)[j], "\nvs ", colnames(sim.param)[i]) )
}
}
To answer this question probably, we could use mutual information analysis wikipedia link, or even regression analysis, but we will not do it in this course.
#ABC Analysis And now, we are ready for the ABC itself. ABC is using a *rejection and regression step, where the most similar datasets to the observation are used. Here, similarity means small Euclidean distance.
#calculate Euclidean distance for two vectors
a <- c(1,2, 3)
b <- c(5,3,-1)
## what is their Euclidean Distance?
dist(rbind(a, b))
## a
## b 5.744563
# All defaults
include_graphics("images.png")
## set the borders of the parameters (based on how we ran simulations)
logit.matrix <- matrix(c(100, 1500, 0.001, 1), ncol=2, byrow=TRUE)
res <- abc(target=obs.stat, param=sim.param,sumstat=sim.stat, tol=1, transf=c("logit"),logit.bounds=logit.matrix, method="loclinear")
## Warning: All parameters are "logit" transformed.
Now, we will plot and see the results
The mutation rate theta
i <- 1
name.im=c("Population mutation rate (theta)", "Population size change")
plot(density(res$adj.values[,i]), type='l', xlim=c(logit.matrix[i,]), col='red', xlab=name.im[i], ylab="Density", main="")
points(density(sim.param[,i]), type='l', col='black')
points(density(res$unadj.values[,i]), type='l', col='blue')
legend("topright", legend=c("prior", "posterior"), col=c("blue", "red"), pch=19)
The population size change
name.im=c("Population mutation rate (theta)", "Population size change")
i <- 2
plot(density(res$adj.values[,i]), type='l', xlim=c(logit.matrix[i,]), col='red', xlab=name.im[i], ylab="Density", main="")
points(density(sim.param[,i]), type='l', col='black')
points(density(res$unadj.values[,i]), type='l', col='blue')
legend("topright", legend=c("prior", "posterior"), col=c("blue", "red"), pch=19)
Finally let's get the estimates
write.table(summary(res)[4,], file="", quote=F, row.names=T, col.names=F, sep="\t")
## Call:
## abc(target = obs.stat, param = sim.param, sumstat = sim.stat,
## tol = 1, method = "loclinear", transf = c("logit"), logit.bounds = logit.matrix)
## Data:
## abc.out$adj.values (2000 posterior samples)
## Weights:
## abc.out$weights
##
## p_theta p_pop_change_newpop
## Min.: 100.2167 0.0033
## Weighted 2.5 % Perc.: 145.5486 0.0365
## Weighted Median: 410.3401 0.1099
## Weighted Mean: 481.5786 0.1545
## Weighted Mode: 329.9293 0.0750
## Weighted 97.5 % Perc.: 1166.9829 0.6017
## Max.: 1497.7393 1.0000
## p_theta 481.578582481614
## p_pop_change_newpop 0.154483876252112