To run this, you will need the R packages gsl
, cowplot
, tidyverse
, and viridis
installed, as well as the software ms installed. The original R markdown code can be found on github.
Here we introduce a simple model to develop an intuition about the likelihood of detecting a recent selective sweep resulting from a new beneficial mutation using reduced representation data. We make a number of simplifying assumptions, including that the population has experienced no recent demographic change, that both sequence tags and recombination occur uniformly along the genome, and that selection is on a novel beneficial mutation with additive effect that has recently swept to fixation. We also ignore correction for multiple testing.
This model is an update and correction to the one presented in Box 1 of (Tiffin and Ross-Ibarra 2014).
Set key parameters: sample size \(n\), recombination rate \(r\), mutation rate \(\mu\), effective population size \(N_e\), population mutation rate \(\theta=4N_e\mu\).
Here we assume a large sample size of \(n=100\), a recombination rate that varies from 0.1 to 10 cM/Mb (\(c=10^{-9}\) to \(10^{-7}\) per bp), a mutation rate of \(\mu=10^{-8}\) per bp, and an effective population size of \(N_e=150,000\). This gives a population mutation rate of \(\theta=0.006\) per bp, or \(\theta=1.8\) per paired-end 150bp sequence (300bp total). We also assume there have been 10 recent sweeps across the genome, and that these are far enough apart to be completely unlinked.
## Sample size
n=100
## Recombination
# centimorgans per megabase (cM/Mb)
cm=c(0.1,1,10)
# recombination per bp
c=(cm/100)/1E6
## Diversity
# mutation rate per bp
mu=1E-8
# effective population size Ne
Ne=450000
# population mutation rate per PE tag: theta = 4*Ne*mu*150*2 assuming a paired-end 150bp tag
theta=4*Ne*mu*300
## Sweeps
sweeps=10 #assume 10 sweeps
We then vary the density of sequencing tags per bp \(\delta\) and the selection coefficient \(s\) (and thus the population selection parameter of \(\alpha=2N_es\)). We assume selected mutations act additively.
## Sequence tags per bp.
## 50,000 tags in a 1Gb genome corresponds to 50 tags per Mb or a tag density of 50*10^-6
delta=c(1,2,5,10,20,50,100,200,500)*10^-6
## Selection
# selection coefficient s
s=1*10^((-10:-40)/10)
## Create data frame of values
dat1=expand.grid(s,delta,c)
colnames(dat1)=c("s","tag_density","c")
dat1<-mutate(dat1,alpha=2*Ne*s)
We model the reduction in heterozygosity at genetic distance \(r\) (the product of the recombination per base pair \(c\) and the bp distance) from a favorable allele that has recently swept to fixation. Let \(\psi=2\frac{r}{s}\), and \(\Gamma(.,.)\) is the incomplete gamma function. Then, following eq. 3 from (Kim and Stephan 2000) the reduction of diversity \(h\) at a distance \(r\) from the sweep is \(h=\psi\alpha^{-\psi} \Gamma(-\psi,\frac{1}{\alpha})\)
h<-function(r,alpha,Ne)
{
s=alpha/(2*Ne)
2*r/s*alpha^(-2*r/s)*gamma_inc(-2*r/s,1/alpha)
}
Here we use 10,000 coalescent simulations to get the neutral distribution of nucleotide diversity \(\pi\) across the genome, using the software ms
(Hudson 2002). (Note the path to ms
will need to be changed depending on installation.)
pi_<-as.numeric(system(paste('~/src/msdir/ms ', n, '10000 -t', theta, '| ~/src/msdir/sample_stats | cut -f 2'),intern=T))
qpi=quantile(pi_,0.05) # sample_stats column 2 is pi, this returns 5% quantile
This function finds the distance in bp from a sweep over which nucleotide diversity \(\pi\) is below the 5% tail of the genome-wide distribution. We use eq. 3 from (Kim and Stephan 2000) to estimate the reduction in diversity \(h\) then compare to the 5% tail from coalescent simulations.
d=1:15000*150 #bp distance vector
get_length<-function(d,c,alpha1,Ne,n,theta){
r2=d*c #genetic distance from sweep
div_pi=theta*h(r2,alpha1,Ne) # pi at distance from sweep assuming sweep finished yesterday (tau==0), modified from Kim and Stephan 2000
return(min(which(div_pi>qpi)))
}
Here we calculate the expected number of segregating sites \(S\) per tag, following Watterson (Watterson 1975) for comparison to graphs in the original paper.
S=theta*sum(1/(1:(n-1))) #From watterson's estimator
For each combination of tag density, recombination, and selection, find the distance \(D\) from the center of the sweep that has diversity \(\pi\) more than \(X\) standard deviations below the genome-wide mean.
min_rank=sapply(1:nrow(dat1), function(i) get_length(d,dat1[i,3],dat1[i,4],Ne,n,theta)) #which window of d has pi below just above 0.05 quantile from coalescent sims
D=d[min_rank]
Assuming a uniform distribution of tags, the probability of having \(m\geq1\) sequence tags within the distance \(D\) calculated above on either side of the sweep is Poisson with probability \(P(m\geq 1)=1-e^{-\delta*2*D}\)
p2sd=1-exp(-1*dat1$tag_density*2*D) #prob of distance to first tag being <= distance 2sd
If sweeps are independent, than we can calculate the expected number of sweeps detected simply as the product of the number of sweeps and the probability of detecting a single sweep.
pp=p2sd*sweeps #prob times sample size = expectation of binomial
dat2=data.frame(dat1$tag_density,dat1$s,pp,dat1$alpha,dat1$c)
#combine with param values
dat2=mutate(dat2,alpha=dat1.alpha,tag_density=dat1.tag_density,s=dat1.s,c=dat1.c)
The expected number of sweeps with a sequence tag sufficiently close to identify a reduction of diversity beyond genome-wide expectations.
#make labels
tag_labs<-c(1,5,20,100,500)
snp_labs=paste("(",unlist(round(tag_labs*S/1000,3)),")",sep="") # number of SNPs per Kb = tags/Mb * snps/tag * 1/1000
labs=paste(unlist(tag_labs),"\n",unlist(snp_labs))
facetlabs <- c("1e-09" = "0.1 cM/Mb", "1e-08" = "1 cM/Mb","1e-07" = "10 cM/Mb")
#plot
graph<-ggplot(data=dat2,aes(as.factor(tag_density),as.factor(s)))+
geom_tile(aes(fill=pp))+
scale_fill_viridis( breaks=c(0,2,4,6,8,10),limits=c(0,10))+
scale_y_discrete(labels=c(0.0001,0.001,0.01,0.1),breaks=c(0.0001,0.001,0.01,0.1)) +
labs(x = "sequence tags per Mb (SNPs per Kb)",y = "selection coefficient s", fill = "sweeps\ndetected\n") +
scale_x_discrete(labels=labs,breaks=tag_labs/1E6) +
facet_wrap(~c,labeller=labeller(c = facetlabs)) +
theme(legend.text = element_text(size=15),legend.key.size=unit(1,"cm"),axis.text = element_text(size=6),axis.title=element_text(size=15),strip.text.x = element_text(size = 15),strip.background = element_blank())
graph
Hudson, Richard R. 2002. “Generating Samples Under a Wright–Fisher Neutral Model of Genetic Variation.” Bioinformatics 18 (2). Oxford Univ Press: 337–38.
Kim, Yuseob, and Wolfgang Stephan. 2000. “Joint Effects of Genetic Hitchhiking and Background Selection on Neutral Variation.” Genetics 155 (3): 1415–27.
Tiffin, Peter, and Jeffrey Ross-Ibarra. 2014. “Advances and Limits of Using Population Genetics to Understand Local Adaptation.” Trends in Ecology & Evolution 29 (12). Elsevier: 673–80.
Watterson, GA. 1975. “On the Number of Segregating Sites in Genetical Models Without Recombination.” Theoretical Population Biology 7 (2): 256–76.