This exercise revolves around obtaining site frequency spectra from two different populations and see if we can find any signs of selection.
The input data is simulated bcf files and represents the chr20 in a European population. The simulation has been done using an error rate of 0.005.
Site frequency spectrum and VCF files
setwd("/home/wpsg/workshop_materials/a07_angsd/")
sherlock <-read.table("sherlock.readlength")
head(sherlock)
## V1 V2
## 1 284 105
## 2 268 109
## 3 430 110
## 4 553 111
## 5 962 112
## 6 692 113
holmes <-read.table("holmes.readlength")
head(holmes)
## V1 V2
## 1 3870 36
## 2 2897 37
## 3 3581 38
## 4 4265 39
## 5 5061 40
## 6 5448 41
par(mfrow=c(1,2))
plot(sherlock[,2],sherlock[,1],main="Sherlock",type='l',col=1,lwd=2)
plot(holmes[,2],holmes[,1],main="Holmes",type='l',col=2,lwd=2)
Sherlock - Modern 275 / 698751 = 0.0004 in library ’ ’
holmes - Acient 66286 / 297714 = 0.2226 in library ’ ’
#setwd("results_holmes.sorted.rmdup/Fragmisincorporation_plot.pdf")
#setwd("results_sherlock.sorted.rmdup/Length_plot.pdf")
p1 <- scan("POP1.sfs")
p1
## [1] 1.996576e+07 8.716037e+03 4.295267e+03 2.455837e+03 1.802022e+03
## [6] 1.268474e+03 1.083288e+03 8.343868e+02 7.995739e+02 6.654866e+02
## [11] 5.682040e+02 5.296620e+02 4.705308e+02 4.529351e+02 3.891028e+02
## [16] 4.580071e+02 3.530829e+02 3.468258e+02 3.184906e+02 2.700443e+02
## [21] 2.745658e+02 3.160453e+02 3.236641e+02 2.763849e+02 2.982774e+02
## [26] 3.357749e+02 3.039941e+02 2.991713e+02 3.450350e+02 3.940706e+02
## [31] 4.104080e+02 3.764334e+02 4.461091e+02 4.792181e+02 4.903710e+02
## [36] 4.670501e+02 5.402658e+02 6.010749e+02 5.877834e+02 5.956195e+02
## [41] 2.499699e+00
p2 <- scan("POP2.sfs")
p2
## [1] 1.995645e+07 1.026185e+04 5.313446e+03 3.409583e+03 2.607706e+03
## [6] 2.057982e+03 1.653452e+03 1.397400e+03 1.348372e+03 1.102803e+03
## [11] 9.907821e+02 9.310202e+02 8.954108e+02 7.658842e+02 6.877272e+02
## [16] 7.547342e+02 6.248718e+02 5.049871e+02 5.714480e+02 5.135836e+02
## [21] 4.638702e+02 4.619188e+02 4.407857e+02 4.254536e+02 4.329398e+02
## [26] 3.897811e+02 3.904429e+02 3.684628e+02 3.312423e+02 3.056305e+02
## [31] 3.392609e+02 3.352954e+02 3.466352e+02 3.455245e+02 2.540215e+02
## [36] 3.238312e+02 2.933949e+02 3.122009e+02 3.111675e+02 2.777854e+02
## [41] 4.148263e+00
barplot(p1)
barplot(p2)
barplot(p1[-1])
barplot(p2[-1])
Neutral selection
Why are we discarding the first bin? Should we discard other bins? Se remuve el primero por es (-1)
sum(p1[-1])
## [1] 34241.07
sum(p2[-1])
## [1] 43546.84
sum(p1[-1])/sum(p1)
## [1] 0.001712054
sum(p2[-1])/sum(p2)
## [1] 0.002177342
\({THETASTAT} do_stat POP1.thetas.idx -win 100000 -step 10000 -outnames POP1 /home/wpsg/software/angsd/misc/thetaStat do_stat POP1.thetas.idx -win 100000 -step 10000 -outnames POP1 Assuming binfile:POP1.thetas.gz and indexfile:POP1.thetas.idx Information from index file: 0 chr20 20000001 8 40 -r=(null) outnames=POP1 step: 10000 win: 100000 pc.chr=chr20 pc.nSites=20000001 firstpos=20000000 lastpos=40000000 Dumping file: "POP1.pestPG" (base) wpsg@ip-172-31-82-83:[~/workshop_materials/a07_angsd]\) ${THETASTAT} do_stat POP2.thetas.idx -win 100000 -step 10000 -outnames POP2 /home/wpsg/software/angsd/misc/thetaStat do_stat POP2.thetas.idx -win 100000 -step 10000 -outnames POP2 Assuming binfile:POP2.thetas.gz and indexfile:POP2.thetas.idx Information from index file: 0 chr20 20000001 8 40 -r=(null) outnames=POP2 step: 10000 win: 100000 pc.chr=chr20 pc.nSites=20000001 firstpos=20000000 lastpos=40000000 Dumping file: “POP2.pestPG”
p1<-read.table("POP1.pestPG",header=F)
colnames(p1)<-c("Index","Chr","WinCenter","tW","tP","tF","tH","tL","Tajima","fuf","fud","fayh","zeng","nSites")
p2<-read.table("POP2.pestPG",header=F)
colnames(p2)<-c("Index","Chr","WinCenter","tW","tP","tF","tH","tL","Tajima","fuf","fud","fayh","zeng","nSites")
plot(p1$WinCenter,p1$Tajima)
plot(p2$WinCenter,p2$Tajima)
par(mfrow=c(1,2))
#or on same plot
plot(p2$WinCenter/1e6,p2$Tajima,col='blue',lwd=2,type='l',ylim=range(c(p1$Tajima,p2$Tajima)),xlab="Position in MB",ylab="Tajimas D")
lines(p1$WinCenter/1e6,p1$Tajima,col='red',lwd=1)
#legend("bottomright",c("POP1","POP2"),fill=c("red","blue"))
PO1 –> represents the positive selection PO2 –> represents the neutral selection
From wikipedia: Tajima’s D is a population genetic test statistic created by and named after the Japanese researcher Fumio Tajima.[1] Tajima’s D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.
The purpose of Tajima’s D test is to distinguish between a DNA sequence evolving randomly (“neutrally”) and one evolving under a non-random process, including directional selection or balancing selection, demographic expansion or contraction, genetic hitchhiking, or introgression. A randomly evolving DNA sequence contains mutations with no effect on the fitness and survival of an organism. The randomly evolving mutations are called “neutral”, while mutations under selection are “non-neutral”.
##run in R
#plot the results
nnorm <- function(x) x/sum(x)
#expected number of sites with 1:20 derived alleles
res <- rbind(
YRI=scan("yri.sfs")[-1],
JPI=scan("jpt.sfs")[-1],
CEU=scan("ceu.sfs")[-1]
)
colnames(res) <- 1:20
#density instead of expected counts
res <- t(apply(res,1,nnorm))
#plot the non-ancestral sites
barplot(res,beside=T,legend=c("YRI","JPT","CEU"),names=1:20,main="realSFS non ancestral sites")
#plot the polymorphic sites.
resPoly <- t(apply(res[,-20],1,nnorm))
barplot(resPoly,beside=T,legend=c("YRI","JPT","CEU"),names=1:19,main="realSFS polymorphic sites")
#due the very limited amount of sites
#downsample to 5 individuals (10 chromosome) and exclude fixed derived
downsampleSFS <- function(x,chr){ #x 1:2n , chr < 2n
n<-length(x)
mat <- sapply(1:chr,function(i) choose(1:n,i)*choose(n- (1:n),chr-i)/choose(n,chr))
nnorm( as.vector(t(mat) %*% x)[-chr] )
}
resDown <- t(apply(res,1,downsampleSFS,chr=10))
barplot(resDown,beside=T,legend=c("YRI","JPT","CEU"),names=1:9,main="realSFS downsampled polymorphic sites")
###
##run in R
## read sfs
yri<-scan("yri.sfs");
jpt<-scan("jpt.sfs");
ceu<-scan("ceu.sfs");
x<-ceu #change this one to try one of the other populations
x
## [1] 1.586598e+06 1.148677e+03 6.157464e+02 4.304823e+02 3.166643e+02
## [6] 2.419848e+02 3.855879e+02 2.532106e+02 2.814779e+02 1.927430e+02
## [11] 2.376304e+02 1.492029e+02 2.609984e+02 1.845124e+02 2.123801e+02
## [16] 1.158865e+02 9.259363e+01 1.050149e+02 1.318167e+02 1.500232e+02
## [21] 2.380234e+04
nSites<-sum(x) #Number of sites where we have data
nSeg<-sum(x[c(-1,-21)]) #Number of segregating sites
an <- function(n) sum(1/1:(n-1))
thetaW <- nSeg/an(20) # Wattersons Theta
thetaW / 2.5e-8 / nSites / 4 # effective population size
## [1] 9605.458
#
y<-jpt #change this one to try one of the other populations
y
## [1] 1.581476e+06 1.588907e+03 3.721260e+02 4.856632e+02 3.572770e+02
## [6] 2.537100e+02 2.280001e+02 3.716677e+02 1.118789e+02 2.138884e+02
## [11] 1.844605e+02 1.628813e+02 1.209103e+02 2.518323e+02 1.230493e+02
## [16] 1.563629e+02 1.555130e+02 2.050896e+02 6.918117e+01 1.701935e+02
## [21] 2.379962e+04
nSites<-sum(y) #Number of sites where we have data
nSeg<-sum(y[c(-1,-21)]) #Number of segregating sites
an <- function(n) sum(1/1:(n-1))
thetaW <- nSeg/an(20) # Wattersons Theta
thetaW / 2.5e-8 / nSites / 4 # effective population size
## [1] 9768.478
z<-yri #change this one to try one of the other populations
z
## [1] 1.576113e+06 2.616637e+03 1.172053e+03 7.363980e+02 4.747166e+02
## [6] 4.433358e+02 2.920960e+02 2.834904e+02 1.668485e+02 1.845609e+02
## [11] 2.117707e+02 1.876814e+02 8.068523e+01 1.440868e+02 1.553383e+02
## [16] 1.698846e+02 1.825336e+02 1.407843e+02 1.760556e+02 2.253410e+02
## [21] 2.346083e+04
nSites<-sum(z) #Number of sites where we have data
nSeg<-sum(z[c(-1,-21)]) #Number of segregating sites
an <- function(n) sum(1/1:(n-1))
thetaW <- nSeg/an(20) # Wattersons Theta
thetaW / 2.5e-8 / nSites / 4 # effective population size
## [1] 14104.36
The above example is for the African population. Try to run it for all three populations.
Which has the largest population size? CEU [1] 9737.956
Which has the largest variability (fraction of polymorphic/segregating sites)?
In order to estimate Fst between two populations we will need to estimate the 2-dimensional frequency spectrum from the site allele frequency likelihoods
setwd("/home/wpsg/workshop_materials/a07_angsd")
##run in R
yc<-scan("yri.ceu.ml")
yj<-scan("yri.jpt.ml")
jc<-scan("jpt.ceu.ml")
source("plot2dSFS.R")
plot2<-function(s,...){
dim(s)<-c(21,21)
s[1]<-NA
s[21,21]<-NA
s<-s/sum(s,na.rm=T)
pal <- color.palette(c("darkgreen","#00A600FF","yellow","#E9BD3AFF","orange","red4","darkred","black"), space="rgb")
pplot(s/sum(s,na.rm=T),pal=pal,...)
}
comp1<-plot2(yc,ylab="YRI",xlab="CEU")
comp1
#x11()
comp2<-plot2(yj,ylab="YRI",xlab="JPT")
comp2
#x11()
comp3<-plot2(jc,ylab="JPT",xlab="CEU")
comp3
##run in R
r<-read.delim("slidingwindowBackground",as.is=T,head=T)
names(r)[-c(1:4)] <- c("wFst_YRI_JPT","wFst_YRI_CEU","wFst_JPT_CEU","PBS_YRI","PBS_JPT","PBS_CEU")
head(r) #print the results to the screen
## region chr midPos Nsites
## 1 (92,49440)(15000000,15049999)(15000000,15050000) 1 15025000 49350
## 2 (9978,59403)(15010000,15059999)(15010000,15060000) 1 15035000 49427
## 3 (19953,69393)(15020000,15069999)(15020000,15070000) 1 15045000 49442
## 4 (29940,79367)(15030000,15079999)(15030000,15080000) 1 15055000 49429
## 5 (39626,89350)(15040000,15089999)(15040000,15090000) 1 15065000 49726
## 6 (49441,99278)(15050000,15099999)(15050000,15100000) 1 15075000 49839
## wFst_YRI_JPT wFst_YRI_CEU wFst_JPT_CEU PBS_YRI PBS_JPT PBS_CEU
## 1 0.087497 0.165127 0.027889 0.121877 -0.030313 0.058598
## 2 0.104206 0.176035 0.025780 0.138777 -0.028732 0.054850
## 3 0.105805 0.171058 0.018335 0.140466 -0.028635 0.047139
## 4 0.111492 0.172312 0.013218 0.147012 -0.028801 0.042107
## 5 0.114755 0.139088 0.016130 0.127696 -0.005805 0.022067
## 6 0.104556 0.100095 0.027878 0.093814 0.016622 0.011652
#plot the distribution of Fst
mmax<-max(c(r$wFst_YRI_JPT,r$wFst_YRI_CEU,r$wFst_JPT_CEU),na.rm=T)
par(mfcol=c(3,2))
hist(r$wFst_YRI_JPT,col="lavender",xlim=c(0,mmax),br=20)
hist(r$wFst_YRI_CEU,col="mistyrose",xlim=c(0,mmax),br=20)
hist(r$wFst_JPT_CEU,col="hotpink",xlim=c(0,mmax),br=20)
mmax<-max(c(r$PBS_CEU,r$PBS_YRI,r$PBS_JPT),na.rm=T)
#plot the distribution of PBS
mmax<-max(c(r$PBS_CEU,r$PBS_YRI,r$PBS_JPT),na.rm=T)
hist(r$PBS_YRI,col="lavender",xlim=c(0,mmax),br=20)
hist(r$PBS_CEU,col="mistyrose",xlim=c(0,mmax),br=20)
hist(r$PBS_JPT,col="hotpink",xlim=c(0,mmax),br=20)
Note the maximum observed values for both the pairwise fst and the PBS.
Let’s do the same for not so randomly selection 1Mb region of on chr 5.
Site frequency spectrum One of the most important aspect of data analysis for population genetics is the estimate of the Site Frequency Spectrum (SFS). SFS records the proportions of sites at different allele frequencies. It can be folded or unfolded, and the latter case implies the use of an outgroup species to define the ancestral state. SFS is informative on the demography of the population or on selective events (when estimated at a local scale).
Population genetic differentiation Here we are going to calculate allele frequency differentiation using the PBS (population branch statistic) metric. Again, we can achieve this by avoid genotype calling using ANGSD. From the sample allele frequencies likelihoods (.saf files) we can estimate PBS using the following pipeline.
#run in R
r<-read.delim("slidingwindowChr5",as.is=T,head=T)
names(r)[-c(1:4)] <- c("wFst_YRI_JPT","wFst_YRI_CEU","wFst_JPT_CEU","PBS_YRI","PBS_JPT","PBS_CEU")
par(mfrow=1:2)
plot(r$midPos,r$wFst_YRI_CEU,ylim=c(0,max(r$wFst_YRI_CEU)),type="b",pch=18,ylab="Fst",xlab="position on Chr 5")
points(r$midPos,r$wFst_YRI_JPT,col=2,type="b",pch=18)
points(r$midPos,r$wFst_JPT_CEU,col=3,type="b",pch=18)
legend("topleft",fill=1:3,c("YRI vs. CEU","YRI vs. JPT","JPT vs CEU"))
plot(r$midPos,r$PBS_YRI,ylim=c(0,max(r$PBS_CEU)),type="b",pch=18,ylab="PBS",xlab="position on Chr 5")
points(r$midPos,r$PBS_JPT,col=2,type="b",pch=18)
points(r$midPos,r$PBS_CEU,col=3,type="b",pch=18)
legend("topleft",fill=1:3,c("YRI","JPT","CEU"))
FST is directly related to the variance in allele frequency among populations and, conversely, to the degree of resemblance among individuals within populations. If FST is small, it means that the allele frequencies within each population are similar; if it is large, it means that the allele frequencies are different.
If natural selection favours one allele over others at a particular locus in some populations, the FST at that locus will be larger than at loci in which among-population differences are purely a result of genetic drift.
Why is there two peak for the Fst and only one for the PBS?
(Yi et al. 2010) developed the Population Branch Statistic (PBS), a summary statistic that utilizes pairwise FST values among three populations to quantify sequence differentiation along each branch of their corresponding three-population tree.
Genes with large PBS values on one branch represent loci that underwent population-specific sequence differentiation consistent with positive selection (Yi et al. 2010).
CEU is under positive selection
PBS has been applied to corroborate previously established targets of selection, including genes associated with skin pigmentation (Lamason et al. 2005) and dietary fat sources (Mathias et al. 2012), as well as to identify novel candidates for high-altitude adaptation in Tibetans (Yi et al. 2010).
In which of the populations are this loci under selection?
CEU
##run in R
#plot the results
nnorm <- function(x) x/sum(x)
getSFS<-function(x) table(factor(rowSums(read.table(x)[,-c(1:2)]),levels=1:20))
res <- rbind(
YRI=getSFS("yri.geno.gz"),
JPI=getSFS("jpt.geno.gz"),
CEU=getSFS("ceu.geno.gz")
)
colnames(res) <- 1:20
# density instead of expected counts
res <- t(apply(res,1,nnorm))
#plot the none ancestral sites
barplot(res,beside=T,legend=c("YRI","JPT","CEU"),names=1:20,main="SFS from called genotypes")
#plot the polymorphic sites.
resPoly <- t(apply(res[,-20],1,nnorm))
barplot(resPoly,beside=T,legend=c("YRI","JPT","CEU"),names=1:19,main="SFS from call\
ed genotypes")
#downsample to 5 individuals (10 chromosome) and exclude fixed derived
downsampleSFS <- function(x,chr){ #x 1:2n , chr < 2n
n<-length(x)
mat <- sapply(1:chr,function(i) choose(1:n,i)*choose(n- (1:n),chr-i)/choose(n,chr))
nnorm( as.vector(t(mat) %*% x)[-chr] )
}
resDown <- t(apply(res,1,downsampleSFS,chr=10))
barplot(resDown,beside=T,legend=c("YRI","JPT","CEU"),names=1:9)