Outlier

BAYESCAN

BayeScan http://cmpg.unibe.ch/software/BayeScan/ is a command-line program that aims to identify putative candidate loci under natural selection from genetic data, using differences in allele frequencies among specified groups. You can set the groups by using your sampling locations or the genetic units when investigating population structure.

BayeScan is based on the Multinomial-Dirichlet model.

This program can define three categories of putative candidate loci:

under diversifying selection
under balancing selection
under neutrality

For each locus, BayeScan calculates a posterior probability (Posterior odds) - available through the parameter pr_odds - for the model including selection. These posterior probabilities indicate how much more likely the model with selection is compared to the neutral model. For instance, a pr_odds of 10 means that there is a 1 in 10 probability for a marker to be under selection. This number would be too high when considering a dataset with up to 10,000 markers.

In the context of multiple testing with a large number of markers (up to 10,000), run BAYESCAN with appropriate parameters as recommended in Whitlock and Lotterhos (2015) https://www.jstor.org/stable/10.1086/682949?seq=1.

To do so, you should consider the number of loci in your dataset. You can also consult the BayeScan exercise https://evomics.org/learning/population-and-speciation-genomics/2016-population-and-speciation-genomics/bayescan-exercise/ to learn more about how to interpret BayeScan files and outputs.

0. Download libraries

library(vcfR)
library(hierfstat)
library(adegenet)
library(ggplot2)
library(radiator)

1. Preparing GESTE (.geste) or BayeScan (.bsc) file

Most of the Next Generation Sequencing (NGS) projects generate a VCF (.vcf) or a PLINK file (.tped and .tfam) after aligning the sequences in STACKS http://catchenlab.life.illinois.edu/stacks/.

The first step is to prepare files in an appropriate .geste or .bsc format for BAYESCAN. Convert the .vcf file to a .geste or .bsc dataset with the function genomic_converter available in the elegant radiatorpackage in R. See Thierry Gosselin github page https://github.com/thierrygosselin for more details.

lobster <- genomic_converter(
  data = "00-Data/5136snps_717ind.vcf", strata = "00-Data/population_map_sea_cucumber.txt",
  output = c("bayescan"), filename="5136snps_717ind.geste")

If you cannot install radiator or you encounter a problem, you can use the vcfR library instead and convert your file using R.

vcf <- read.vcfR("00-Data/5136snps_717ind.vcf")

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 9
##   header_line: 10
##   variant count: 5136
##   column count: 726
## 
Meta line 9 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 5136
##   Character matrix gt cols: 726
##   skip: 0
##   nrows: 5136
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant: 5136
## All variants processed

Download the population map associated to your dataset.

pop_map <- read.table("00-Data/population_map_sea_cucumber.txt", header=TRUE, stringsAsFactors = TRUE)

Transform vcf to genind format.

genind <- vcfR2genind(vcf)

Add group information (here our groups are the sampling sites)

genind@pop <- pop_map$STRATA

Transform genind to hierfstat.

hierfstat <- genind2hierfstat(genind)

Write BayeScan file using the hierfstat library.

write.bayescan(hierfstat)

2. Run BayeScan in a terminal

BayeScan uses its own input file formats that have the extension .geste. To transform the vcf file (or other formats such as genepop and structure) into a .geste file, you can easily use PGDSpider2 http://www.cmpg.unibe.ch/software/PGDSpider/

To run BayeScan in bash, use the following command:

./BayeScan2.1_macos64bits -n 5000 -burn 50000 -pr_odds 10000

By running BayeScan, you should get three output files:

Verif.txt: Use this file to verify that BayeScan has read the input files properly
.sel: Record the parameters estimated by the model
_fst.txt: Use to identify outliers

**You will use the _fst.txt** file for the following analysis.

3. Use R to identify outliers from BayeScan analyses

Open the **BayeScan output file with the "_fst.txt" extension**.

bayescan=read.table("00-Data/output_fst.txt")

The first column of the bayescan dataframe is the SNP ID. The next three columns (prob, log10(P0), and qval) are related to the test of local adaptation considering the logarithm of the posterior odds - log10(PO) - and the q-value for the model with selection. The fifth column gives the size of the locus-specific effect (alpha parameter). The last one provides the locus-specific FST averaged over all populations.

Download the list of SNPs in the right order. The .geste format has SNPs in the same order as the vcf used to produce the .geste format. Therefore, you can use this command in bash to extract the third column containing the ID info of each SNPs in your vcf.

 grep -v "#" 5136snps_717ind.vcf | cut -f 3 > id-5136snps.txt

Then import the SNPs information.

SNPb=read.table("00-Data/id-5136snps.txt",header=FALSE)

Merge the names of the outliers with the results from the bayescan dataframe.

bayescan=cbind(SNPb, bayescan)

Rename columns.

colnames(bayescan)=c("SNP","PROB","LOG_PO","Q_VALUE","ALPHA","FST")

Write the results.

write.table(bayescan, "sea-cumcumber-bayescan-results.txt", quote=FALSE, sep="\t", row.names=FALSE)

Change the value of the Q_VALUE column: 0 == 0.0001.

attach(bayescan)
class(bayescan$Q_VALUE)

## [1] "numeric"

bayescan$Q_VALUE <- as.numeric(bayescan$Q_VALUE) 
bayescan[bayescan$Q_VALUE<=0.0001,"Q_VALUE"]=0.0001

Round the values.

bayescan$LOG_PO <- (round(bayescan$LOG_PO, 4)) 
bayescan$Q_VALUE <- (round(bayescan$Q_VALUE, 4)) 
bayescan$ALPHA <- (round(bayescan$ALPHA, 4)) 
bayescan$FST <- (round(bayescan$FST, 6))

Add a column for the type of selection grouping based on a Q-VALUE < 0.05. You can also choose a Q-VALUE < 0.01 if you want to be more conservative.

bayescan$SELECTION <- ifelse(bayescan$ALPHA>=0&bayescan$Q_VALUE<=0.05,"diversifying",ifelse(bayescan$ALPHA>=0&bayescan$Q_VALUE>0.05,"neutral","balancing")) 
bayescan$SELECTION<- factor(bayescan$SELECTION)
levels(bayescan$SELECTION)

## [1] "balancing"    "diversifying" "neutral"

Save the results of the SNPs potentially under positive (divergent) and balancing selection (qvalue < 0.05).

positive <- bayescan[bayescan$SELECTION=="diversifying",] 
neutral <- bayescan[bayescan$SELECTION=="neutral",] 
balancing <- bayescan[bayescan$SELECTION=="balancing",]

Check the number of SNPs belonging to each category.

xtabs(data=bayescan, ~SELECTION)

## SELECTION
##    balancing diversifying      neutral 
##          872           78         4186

Write the results of the SNPs potentially under selection (qvalue < 0.05).

write.table(neutral, "neutral.txt", row.names=F, quote=F)
write.table(balancing, "balancing.txt", row.names=F, quote=F) 
write.table(positive, "positive.txt", row.names=F, quote=F)

Transformation Log of the Q value in order to create the ggplot graph.

range(bayescan$Q_VALUE)

## [1] 0.0001 0.9847

bayescan$LOG10_Q <- -log10(bayescan$Q_VALUE)

4. Use ggplot to create a nice graph

Create a title for the ggplot graph.

x_title="Log(q-value)" 
y_title="Fst"

Make the ggplot graph.

ggplot(bayescan,aes(x=LOG10_Q,y=FST)) +
  geom_point(aes(fill=SELECTION), pch=21, size=2)+ 
  scale_fill_manual(name="Selection",values=c("white","red","orange"))+ 
  labs(x=x_title)+ 
  labs(y=y_title)+   
  theme_classic()

Save the file in a pdf format.

ggsave("bayescan_seacumcumber.pdf", dpi=600, width=5, height=5) 
dev.off()

## null device 
##           1

5. Use the function available in BayeScan

You can also simply use the function plot_R.r already available in BayeScan. First load the function in R.

source("00-Data/plot_R.r")

Make a nice graph using this plot_bayescan function.

plot_bayescan("00-Data/output_fst.txt")

## $outliers
##  [1]    6  231  294  384  385  446  451  561  562  854 1017 1220 1221 1222 1223
## [16] 1239 1240 1241 1242 1331 1333 1439 1488 1551 1571 1763 1773 1774 1863 2019
## [31] 2051 2120 2252 2349 2350 2419 2421 2422 2631 2747 2833 2835 2836 2837 2858
## [46] 3020 3021 3034 3046 3204 3316 3326 3327 3335 3825 3852 3890 3901 3903 3907
## [61] 3945 3962 4033 4034 4035 4036 4109 4154 4155 4272 4386 4387 4388 4404 5026
## [76] 5027 5068 5069
## 
## $nb_outliers
## [1] 78

PCADAPT

6. Choose the number K of Principal Components

Dowload libraries.

library(pcadapt)
library(vcfR)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(viridis)

## Loading required package: viridisLite

Download data with the read.pcadapt function.

data <- read.pcadapt("00-Data/5136snps_717ind.vcf", type = "vcf")

## Warning in file2other(input, type, match.arg(type.out), match.arg(allele.sep)):
## Converter vcf to pcadapt is deprecated. Please use PLINK for conversion to bed
## (and QC).

## No variant got discarded.
## Summary:
## 
##  - input file:               00-Data/5136snps_717ind.vcf
##  - output file:              /var/folders/cl/whdtytz51tq2syrv9j3b663m0000gn/T//RtmpAQfhzi/file7161601f5ca.pcadapt
## 
##  - number of individuals detected:   717
##  - number of loci detected:      5136
## 
## 5136 lines detected.
## 717 columns detected.

Extract relevant information from vcfR.

data2 <- read.vcfR("00-Data/5136snps_717ind.vcf")

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 9
##   header_line: 10
##   variant count: 5136
##   column count: 726
## 
Meta line 9 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 5136
##   Character matrix gt cols: 726
##   skip: 0
##   nrows: 5136
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant: 5136
## All variants processed

data3 <- vcfR2genlight(data2)
snp <- as.data.frame(data3@loc.names)
ind <- as.data.frame(data3@ind.names)
colnames(ind) <-"INDIVIDUALS"

Add population map information.

pop_map <- read.table("00-Data/population_map_sea_cucumber.txt", header=TRUE)

Merge individuals and pop_map information.

ind_pop_map <- merge(ind, pop_map, by=c("INDIVIDUALS"))

First run pcadapt with a large number K of Principal Components, for instance K = 20.

data_pcadapt_trial <- pcadapt(data, K = 20, min.maf = 0.01, ploidy=2)

Make screeplot and score plot to determine ideal number of PC. The ‘scree plot’ displays in decreasing order the percentage of variance explained by each PC.

plot(data_pcadapt_trial, option = "screeplot", col="blue", snp.info = NULL, 
     plt.pkg = "ggplot")

Check the plot to select the optimal number K of Principal Components. First, check PC1 and PC2.

plot(data_pcadapt_trial, option = "scores", pop=pop_map$STRATA, 
     snp.info = NULL, plt.pkg = "ggplot")

## Warning: Use of `df$Pop` is discouraged. Use `Pop` instead.

Second, check PC3 and PC4.

plot(data_pcadapt_trial, option = "scores", pop=pop_map$STRATA, 
     i = 3, j = 4, snp.info = NULL, plt.pkg = "ggplot")

## Warning: Use of `df$Pop` is discouraged. Use `Pop` instead.

Third, check PC5 and PC6.

plot(data_pcadapt_trial, option = "scores", pop=pop_map$STRATA, 
     i = 5, j = 6, snp.info = NULL, plt.pkg = "ggplot")

## Warning: Use of `df$Pop` is discouraged. Use `Pop` instead.

7. Compute the statistic test

The test statistic for detecting outlier SNPs is the Mahalanobis distance, which is a multi-dimensional approach that measures how distant a point is from the mean.

For our dataset, we will run pcadapt with K = 2 (as no significant population structure was found in the other axes).

By default, the parameter min.maf is set to 5%; here we are changing it to 1%. p-values of SNPs with a minor allele frequency smaller than the threshold are not computed (NA is returned).

data_pcadapt <- pcadapt(data, K = 2, min.maf = 0.01)

Create a dataframe gathering pcadapt results and sampling sites.

pca_adapt_pop_map <-cbind(data_pcadapt$scores, ind_pop_map) 
colnames(pca_adapt_pop_map) <- c("PC1","PC2","IND","SITES")

Visualize the results according to the sampling sites.

ggplot(pca_adapt_pop_map, aes(x=PC1, y=PC2, fill= factor(SITES)))+
  geom_point(size=1.5, shape=21)+
  scale_fill_viridis(discrete = TRUE) +
  theme_classic()+
  xlab("PC1")+
  ylab("PC2")+ 
  theme_bw()

Save the graph.

ggsave("PCAdapt_SITES.pdf", width = 10, height=5, dpi=600)

Download admixture results.

admixture <- read.table("00-Data/Admixture_results_K2.txt", header=TRUE)

Merge results from pcadapt and ADMIXTURE. In doing so, you check the correspondence between the two analyses. For example, in a case where the population structure is high, outliers detected by pcadapt may be related to the clustering observed by the ADMIXTURE analysis.

pca_adapt_pop_map_admixture <- merge(pca_adapt_pop_map, admixture, by="IND")

Visualize the results according to ADMIXTURE.

ggplot(pca_adapt_pop_map_admixture, aes(x=PC1, y=PC2, fill = CLUSTER))+
  geom_point(pch=21, size=1.5)+
  scale_fill_viridis(discrete = TRUE)+
  theme_classic()+
  xlab("PC1")+
  ylab("PC2")+ 
  theme(legend.title = element_blank())

Save the graph.

ggsave("PCAdapt_ADMIXTURE.pdf", width = 5, height=5)

get the list of p-values.

snps_pvalues <- cbind(snp, data_pcadapt$pvalues)
snps_pvalues_no_na <- na.omit(snps_pvalues)
write.table(snps_pvalues, "All_Pvalues.txt", sep="\t", quote=FALSE)

Graphical tools: the user can also check the expected uniform distribution of the p-values using a Q-Q plot

plot(data_pcadapt, option = "qqplot", col, snp.info = NULL, plt.pkg = "ggplot")

Here the Q-Q plot confirms that most of the p-values follow the expected uniform distribution.

Check the distribution of these p-values.

hist(data_pcadapt$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")

A histogram of p-values confirms that most of the p-values follow a uniform distribution. The excess of small P-values indicate the presence of outliers.

Visualize the distribution of p-values.

quantile(snps_pvalues_no_na$`data_pcadapt$pvalues`, probs = c(0.01, 0.99))

##           1%          99% 
## 4.731297e-70 9.924441e-01

Get only the markers showing extreme p-values: the top 1%.

top_1percent <- subset(snps_pvalues_no_na, snps_pvalues_no_na$`data_pcadapt$pvalues` <= 4.731297e-70)
colnames(top_1percent) <- c("LOCUS","PVALUE")

write.table(top_1percent, "42outliers_pcadapt.txt", sep="\t", quote=FALSE, row.names = FALSE)

VENN DIAGRAM

Download the VennDiagram library.

library(VennDiagram)

## Loading required package: grid

## Loading required package: futile.logger

Save bayescan outlier SNP names in a vector.

bayescan_outliers <- positive$SNP

Check the number of outliers found by BayeScan.

length(bayescan)

## [1] 8

Save pcadapt names in a vector.

pcadapt_outliers <- top_1percent$LOCUS

Check the number of outliers found by pcadapt.

length(pcadapt_outliers)

## [1] 42

Create a Venn diagram.

venn.diagram(
x = list(bayescan_outliers,pcadapt_outliers),
category.names = c("Bayescan" , "Pcadapt"),
filename = 'Venn_diagramm_outliers.png',
output=TRUE,
imagetype="png" ,
height = 400 , 
width = 400, 
resolution = 300,
compression = "lzw",
cat.cex = 0.6,
cat.pos = c(-5, 5))

## [1] 1

Venn diagram.

Find the common outliers to both BayeScan and pcadapt.

intersect(bayescan_outliers,pcadapt_outliers)

## [1] "8623_19"  "8623_21"  "15144_18" "19470_22" "41054_28" "43421_50" "52394_42"
## [8] "52572_42"

As you can see, subset of shared number of candidate is low, suggesting that genome scan methods are highly variable. This finding is in agreement with Dalongeville et al. 2018, who showed that the identify and number of the outliers are drastically different depending on the method used.

Find the unique outliers to BayeScan.

setdiff(bayescan_outliers,pcadapt_outliers)

##  [1] "101_21"   "2943_49"  "4052_58"  "5610_34"  "5610_31"  "6883_27" 
##  [7] "6973_31"  "13062_15" "18276_19" "18276_20" "18276_21" "18276_22"
## [13] "18502_26" "18502_25" "18502_24" "18502_23" "19470_27" "21034_31"
## [19] "21710_44" "22776_35" "23125_37" "25272_11" "25351_57" "25351_59"
## [25] "26371_22" "28612_55" "28950_32" "29747_27" "31093_58" "32489_65"
## [31] "32489_66" "33290_62" "33290_55" "33290_54" "35884_47" "37255_35"
## [37] "38179_60" "38179_50" "38179_49" "38179_46" "38655_15" "40782_45"
## [43] "40782_20" "41240_11" "44862_60" "45048_57" "45048_58" "45177_26"
## [49] "50798_43" "51085_60" "51635_56" "51843_37" "51843_53" "51866_26"
## [55] "53252_43" "53252_45" "53252_48" "53252_49" "53995_40" "54494_41"
## [61] "54494_42" "56284_43" "57735_27" "57735_28" "57735_29" "57938_62"
## [67] "65855_42" "65855_41" "66220_35" "66220_36"

length(setdiff(bayescan_outliers,pcadapt_outliers))

## [1] 70

Find the unique outliers to pcadapt.

setdiff(pcadapt_outliers,bayescan_outliers)

##  [1] "238_35"   "2934_18"  "5604_51"  "7556_42"  "7556_41"  "12422_17"
##  [7] "16344_18" "17733_30" "18608_12" "18608_13" "19212_48" "20389_38"
## [13] "20942_11" "21532_20" "21532_19" "21532_17" "22888_29" "25959_8" 
## [19] "34049_25" "34462_51" "54171_48" "57735_23" "59905_30" "61924_47"
## [25] "61924_48" "62207_58" "62207_60" "63768_37" "63768_35" "65102_24"
## [31] "66018_42" "66018_44" "66018_45" "66174_22"

length(setdiff(pcadapt_outliers,bayescan_outliers))

## [1] 34