We want to test the dependence of \(F_{ST}\) in our highland/lowland study on allele frequency \(p\). Fortunately Sho has done a great job documenting the data. From the github repo we have the file filtered_SNPs.zip.

Sho’s documentation reveals

1st: chromosome
2nd: SNP ID (start with S1-10_: GBS, others: 55-k SNP)
3rd: allele 1/2 in Mexico pops (only allele 1 is shown if monomorphic)
4th: Position
5th: Sample size in Mex low (-9 if monomorphic)
6th: Sample size in Mex high
7th: Freq allele 1 in Mex low
8th: Freq allele 2 in Mex low
9th: Freq allele 1 in Mex high
10th: Freq allele 2 in Mex high
11th: Fst between Mex low and high
12th: Fst P-value under Model I
13th: Fst P-value under Model II
14th: allele 1/2 in SA pops (only allele 1 is shown if monomorphic)
15th: Sample size in SA low
16th: Sample size in SA high
17th: Freq allele 1 in SA low
18th: Freq allele 2 in SA low
19th: Freq allele 1 in SA high
20th: Freq allele 2 in SA high
21th: Fst between SA low and high
22th: Fst P-value under Model I
23th: Fst P-value under Model III

Let’s first cut out the relevant columns, and load the data with column names. We also remove all loci not polymorphic in both South America and Mexico using grep to remove rows with -9.

#set dir to github repo
setwd("/Users/jri/projects/hilo_paper/hiloSNP/")
xtemp=pipe(paste('cat chr*out | cut -f 1,2,4,7-11,17-21 | grep -v "\\-9" | sed -e s/chr//g')) # need double \\ so R ignores the \ which grep uses to escape the -
x<-read.table(xtemp,colClasses=c("numeric","character",rep("numeric",11)))                
colnames(x)=c("chr","snpid","position","p1mexlow","p2mexlow","p1mexhi","p2mexhi","fstmex","p1samlow","p2samlow","p1samhi","p2samhi","fstsam")

Now we calculate some allele frequencies and the average (as a poor estimate of the ancestral). We then calculate MAF for comparison to \(F_{ST}\)

#Mexico
pmexlow=x$p1mexlow/(x$p1mexlow+x$p2mexlow) #freq allele 1
pmexhi=x$p1mexhi/(x$p1mexhi+x$p2mexhi) #freq allele 1
pmex=sapply(1:length(pmexlow),function(i){sum(pmexlow[i],pmexhi[i])/2}) #average
mafmex=sapply(pmex,function(a){ min(a,1-a)})

#South America
psamlow=x$p1samlow/(x$p1samlow+x$p2samlow) 
psamhi=x$p1samhi/(x$p1samhi+x$p2samhi) 
psam=sapply(1:length(psamlow),function(i){sum(psamlow[i],psamhi[i])/2})
mafsam=sapply(psam,function(a){ min(a,1-a)})

#MAF
pancestral=sapply(1:length(pmexlow),function(i){sum(pmex[i],psam[i])/2})
mafcestral=sapply(pancestral,function(a){ min(a,1-a)})

Are any of these things related?

pairs(~x$fstsam+mafsam+x$fstmex+mafmex,col=rgb(0,0,0,.25),pch=16)

plot of chunk unnamed-chunk-3

cor(mafmex,mafsam)
## [1] 0.6513

Wow, so no super strong relation between MAF in SAm and Mex. This suggests we don’t likely have a false positive problem (high \(F_{ST}\) due only to high MAF) but could have a false negative problem (high \(F_{ST}\) SNP in one population has too low a MAF to be significant in the other). We divide SNPs into frequency bins to solve.

#top 1% by bin in Mexico
mexout<-vector()
for(i in 1:20){
  test<-subset(x,mafmex<=(i*0.05) & mafmex>((i-1)*0.05))  
  mexout<-c(mexout,test$snpid[which(test$fstmex>=quantile(test$fstmex,.99))])
}
mexout=mexout[order(mexout)]
#same in S. America
samout<-vector()
for(i in 1:20){
 test<-subset(x,mafsam<=(i*0.05) & mafsam>((i-1)*0.05))  
  samout<-c(samout,test$snpid[which(test$fstsam>=quantile(test$fstsam,.99))])
}
samout=samout[order(samout)]

So now let’s look at the intersect of those two lists.

length(samout)
## [1] 858
length(mexout)
## [1] 503
length(x$fstmex)
## [1] 48321
length(samout)*length(mexout)/length(x$fstmex) #expected
## [1] 8.931
length(intersect(mexout,samout)) #observed
## [1] 13

We thus expect 9 SNPs to be shared, but we see 13. Not anything super exciting. If we drop to 5% tail we expect 133 and we see 171. While this is not quite the same as redoing the sims, it certainly suggests that we’re not missing much by using the \(F_{ST}\) approach we’ve used, and like in the paper we get better results just using top overall SNPs:

allfstsam=subset(x$snpid,x$fstsam>quantile(x$fstsam,0.99))
allfstmex=subset(x$snpid,x$fstmex>quantile(x$fstmex,0.99))
length(intersect(allfstmex,allfstsam))
## [1] 24
length(allfstsam)
## [1] 480
length(allfstmex)
## [1] 484
length(allfstmex)*length(allfstsam)/length(x$fstsam)
## [1] 4.808

Finally, let’s take a look at GU’s, or Sho’s 10kb windows. We’ll use another proxy as we don’t have in this file gene ID, but we do have chromosome and position. So we just ask how many of these new outlier SNPs have an outlier SNP in the other population within 10kb; this should be more permissive than Sho’s criteria.

dist<-vector()
for(i in mexout){
  dist<-c(dist,subset(x$snpid,x$chr==x$chr[which(x$snpid==i)] & x$position<(x$position[which(x$snpid==i)]+10000) & x$position>(x$position[which(x$snpid==i)]-10000) ))
}
dist=unique(dist[order(dist)])
obsGU=length(intersect(dist,samout))

Only 36. Not much to write home about. Just to be sure, we run a permutation test:

GU<-vector()
for(i in 1:20){
  print(paste("iteration",i))
  mysam=sample(x$snpid,length(mexout))
  dist<-vector()
  for(j in mysam){
    dist<-c(dist,subset(x$snpid,x$chr==x$chr[which(x$snpid==j)] & x$position<(x$position[which(x$snpid==j)]+10000) & x$position>(x$position[which(x$snpid==j)]-10000) ))
  }
  dist=unique(dist[order(dist)])
  GU[i]=length(intersect(dist,samout))
}
## [1] "iteration 1"
## [1] "iteration 2"
## [1] "iteration 3"
## [1] "iteration 4"
## [1] "iteration 5"
## [1] "iteration 6"
## [1] "iteration 7"
## [1] "iteration 8"
## [1] "iteration 9"
## [1] "iteration 10"
## [1] "iteration 11"
## [1] "iteration 12"
## [1] "iteration 13"
## [1] "iteration 14"
## [1] "iteration 15"
## [1] "iteration 16"
## [1] "iteration 17"
## [1] "iteration 18"
## [1] "iteration 19"
## [1] "iteration 20"
length(subset(GU,GU>obsGU))
## [1] 5

We already see in 20 (slow) permutations that there are samples with more shared genes with SAm than Mex. End of the line, I don’t think we learn anything different with allele frequency bins.