OutFLANK: Finding FST outliers with an inferred neutral distribution

OutFLANK is an R package that implements the method developed by Whitlock and Lotterhos to use likelihood on a trimmed distribution of $F_{ST}$ values to infer the distribution of $F_{ST}$ for neutral markers. This distribution is then used to assign q-values to each locus to detect outliers that may be due to spatially heterogeneous selection.

If you use OutFLANK, please cite:

Whitlock, M. C., and K. E. Lotterhos. Reliable detection of loci responsible for local adaptation: Inference of a neutral model through trimming the distribution of FST. Accepted at The American Naturalist.

Please see this paper for a discussion of the limitations and strengths of this approach.

To load OutFLANK

OutFLANK is distributed through github. To install the OutFLANK package, first install the devtools package on your session of R, followed by the installation of OutFLANK from github.

### Install required package devtools if not installed
if (!("devtools" %in% installed.packages())){install.packages("devtools")}
library(devtools)

### Install required package qvalue if not installed
if (!("qvalue" %in% installed.packages())){
  source("http://bioconductor.org/biocLite.R")
  biocLite("qvalue")
  }

### Install OutFLANK if not installed
if (!("OutFLANK" %in% installed.packages())){install_github("whitlock/OutFLANK")}

if (!("vcfR" %in% installed.packages())){install.packages("vcfR")}

Next, load the packages into your R session. In later uses of the package, you can start with this library step. It will automatically load the qvalue library.

library(OutFLANK)

## Loading required package: qvalue

library(vcfR)

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.5.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

Steps to running OutFLANK

Prepare a dataframe for the OutFLANK function
- From a formatted SNP dataset using MakeDiploidFSTMat()
- From your own data using functions provided with the package (see Section “For Advanced Users”)
Check for SNPs of low sample size or that have $F_{ST}$ values differentially affected by sample size correction
Run OutFLANK
Plot results, and re-run OutFLANK if necessary

Preparing input files for OutFLANK

The input for the OutFLANK() function is a data frame, with a row for each locus. OutFLANK requires that the input data include a large number of loci not strongly affected by spatially heterogeneous selection, although these need not be previously identified. It will work best with at least thousands of loci in the data set.

OutFLANK finds outliers for $F_{ST}$, but it needs to use estimates of FST that have not been corrected for sample size adjustments. This is because these sample size adjustments can sometimes cause the estimate of the FST of a locus to be negative, which is not a possible value of the chi-squared distribution used by OutFLANK. These uncorrected values are labeled with “NoCorr”, e.g. FSTNoCorr is the FST of a locus without this correction. (In order to properly average FST over loci, OutFLANK also needs the numerator and denominator of the FST calculation for FSTNoCorr; these are called T1NoCorr and T2NoCorr respectively, for more information see the section “For Advanced Users”).

Since only uncorrected values of Fst are used in the algorithm, loci with unusually low sample sizes relative to the rest of the loci could be called as outliers. The user should take care to remove them before analysis. For a discussion of this issue see the publication.

There are several required columns in the input data frame, which you can read about in the section “Running OutFLANK”. Some of the columns in the input are somewhat unusual, and functions to create the values in these columns are also included in the package. We anticipate that most people will be starting with SNP datasets and will need these variables to be calculated, as described below. You can, however, calculate these quantities yourself with functions provided in the package (see section “For advanced users”).

For diploid data, the appropriate input data frame can be calculated using the function:

MakeDiploidFSTMat(SNPmat,locusNames,popNames).

The output of this function is a data frame that can be given to the function OutFLANK() as the input parameter FSTDataFrame (see below). This function requires three data objects as input:

SNPmat is an array with a row for each individual in the data set and a column for each locus. This function assumes biallelic (i.e. SNP) data, and the value in each column is either 0, 1 or 2, showing the number of the focal alleles that the individual carries at that locus. (i.e., 2 means that the individual is a homozygote for the focal allele, 0, means that the individual has no copies of that focal allele, and 1 indicates a heterozygote.) For any locus that is unknown for an individual, SNPmat should have a 9 for that locus on that row. The function will take a few minutes to check that your data only contain 0, 1, 2, and 9, and it will produce an error if your data contains any other values.
locusNames is a character vector that gives a list of identifying names for each locus. (The length of the locusNames vector should be the same as the number of columns in SNPmat.)
popNames is a character vector that has an entry for each individual, in the same order as the rows in SNPmat, which gives the population that that individual came from. OutFLANK assumes that the individuals are grouped into relatively discrete populations, so there must be multiple individuals per population for the function to work properly.

Load example data

This dataset was simulated.

Set your working directory to the example set (this code will work if the example data is on your desktop):

setwd("~/Desktop/data/")

Load the data:

##############################
## Upload Compressed VCF file
##############################
vcf <- read.vcfR("../sim1a/vcf_sim1a_contest.vcf.gz", verbose=FALSE)
##############################


##############################
### Convert VCF format to SNP data format required by OutFLANK (Note that this is slow)
##############################
 convertVCFtoCount3 <- function(string){
    # This function assumes 0 for reference
    # and 1 for alternate allele
    a <- as.numeric(unlist(strsplit(string, split = c("[|///]"))))
    odd = seq(1, length(a), by=2)
    a[odd] + a[odd+1]
 }

  all.vcf.gen <- vcf@gt[,-1]
  system.time(gen_table <- matrix(convertVCFtoCount3(all.vcf.gen), ncol=ncol(all.vcf.gen)))

##    user  system elapsed 
##  10.708   0.456  11.368

  vcf@meta

##  [1] "##fileformat=VCFv4.2"                                                     
##  [2] "##fileDate=20170809"                                                      
##  [3] "##source=SLiM"                                                            
##  [4] "##INFO=<ID=MID,Number=1,Type=Integer,Description=\"Mutation ID in SLiM\">"
##  [5] "##INFO=<ID=S,Number=1,Type=Float,Description=\"Selection Coefficient\">"  
##  [6] "##INFO=<ID=DOM,Number=1,Type=Float,Description=\"Dominance\">"            
##  [7] "##INFO=<ID=PO,Number=1,Type=Integer,Description=\"Population of Origin\">"
##  [8] "##INFO=<ID=GO,Number=1,Type=Integer,Description=\"Generation of Origin\">"
##  [9] "##INFO=<ID=MT,Number=1,Type=Integer,Description=\"Mutation Type\">"       
## [10] "##INFO=<ID=AC,Number=1,Type=Integer,Description=\"Allele Count\">"        
## [11] "##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">"         
## [12] "##INFO=<ID=MULTIALLELIC,Number=0,Type=Flag,Description=\"Multiallelic\">" 
## [13] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"

  head(vcf@fix)

##      CHROM POS      ID REF ALT QUAL   FILTER INFO
## [1,] "2"   "55279"  NA "A" "T" "1000" "PASS" NA  
## [2,] "3"   "81812"  NA "A" "T" "1000" "PASS" NA  
## [3,] "5"   "174034" NA "A" "T" "1000" "PASS" NA  
## [4,] "5"   "190924" NA "A" "T" "1000" "PASS" NA  
## [5,] "2"   "47972"  NA "A" "T" "1000" "PASS" NA  
## [6,] "2"   "58099"  NA "A" "T" "1000" "PASS" NA

     # note that the loci are not ordered in this simulated data
     # note that every position is unique, so that sorting by position will give you the right order
  head(vcf@gt[,1:4])

##      FORMAT i0    i1    i2   
## [1,] "GT"   "0|0" "0|0" "0|0"
## [2,] "GT"   "0|0" "0|0" "0|0"
## [3,] "GT"   "0|0" "0|0" "0|0"
## [4,] "GT"   "0|0" "0|0" "0|0"
## [5,] "GT"   "0|0" "0|0" "0|0"
## [6,] "GT"   "0|0" "0|0" "1|0"

  ### The loci info 
    ### Let's make a character vector of locus names with Chromosome ID and Position:
    locinames <- paste(vcf@fix[,"CHROM"], vcf@fix[,"POS"], sep="_")
    head(locinames)

## [1] "2_55279"  "3_81812"  "5_174034" "5_190924" "2_47972"  "2_58099"

##############################

##############################
### SNP data for OutFLANK
##############################
  SNPdata <- t(gen_table)
    # gen_table has individuals in columns, but OutFLANK requires individuals in rows
  ### Show the first 5 SNPs in the first 20 individuals:
  head(SNPdata[,1:20])

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    0    0    0    0    0    0    0    0    0     0     0     0     0
## [2,]    0    0    0    0    0    0    0    0    1     0     0     0     0
## [3,]    0    0    0    0    0    1    0    0    0     0     0     0     0
## [4,]    0    0    0    0    0    0    0    1    0     0     1     0     0
## [5,]    0    0    0    0    0    0    0    0    1     0     2     0     0
## [6,]    0    0    0    0    0    0    1    0    0     0     1     0     0
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]     0     0     0     0     0     1     2
## [2,]     0     0     0     2     0     1     0
## [3,]     0     0     1     0     0     1     2
## [4,]     0     0     1     1     0     0     1
## [5,]     1     0     0     0     0     1     1
## [6,]     1     0     0     1     0     2     0

##############################

##############################
### The individual data
##############################
  ind <- read.table("../sim1a/outputIndAll_pop.txt", header=TRUE)
  head(ind)

##   id          x        y phenotype1        envi phenotype2 pop
## 1  0 0.49716400 0.992154   1.019150  0.88809700   1.019150   1
## 2  1 0.63119700 0.070583  -0.535818  0.80400400  -0.535818   2
## 3  2 0.18481100 0.320225   1.177780  0.00734727   1.177780   3
## 4  3 0.19549000 0.322930  -0.361280  0.08361870  -0.361280   3
## 5  4 0.01481070 0.979297   0.261367 -0.30221600   0.261367   4
## 6  5 0.00369433 0.077454   1.138400 -1.26882000   1.138400   5

  # note that phenotype 1 and phenotype 2 are the same thing
   k <- max(ind$pop)
   plot(ind$x, ind$y, pch=ind$pop%%6, col=adjustcolor(ind$pop%%3+2, alpha=0.2))
   text(tapply(ind$x, ind$pop, mean), tapply(ind$y, ind$pop, mean), label=1:k)

  # check your sample sizes
     table(ind$pop)

## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
## 35 20 41 42 29 35 42 24 32 26 32 31 41 17 32 11 17 17 10 29 16 33 15 17 39 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 
## 30 40 20 24 11 14 37 27 12 25 18 19 23 17

Calculate variables required for `OutFLANK()` function

Here are the columns required for the input data frame. They should have these names, but order does not matter:

$LocusName: a character string that uniquely names each locus.
$FST: $F_{ST}$ calculated for this locus by Weir and Cockerham (1984). (Kept here to report the unbiased $F_{ST}$ of the results)
$T1: The numerator of the estimator for $F_{ST}$ (necessary, with $T2, to calculate mean $F_{ST}$)
$T2: The denominator of the estimator of $F_{ST}$
$FSTNoCorr: $F_{ST}$ calculated for this locus without sample size correction. (See below for the functions used to calculate this and the next two columns)
$T1NoCorr: The numerator of the estimator for $F_{ST}$ without sample size correction (necessary, with $T2$, to find mean $F_{ST}$)
$T2NoCorr: The denominator of the estimator of $F_{ST}$ without sample size correction
$He: The heterozygosity of the locus (used to screen out low heterozygosity loci that have a different distribution)

We are going to use the MakeDiploidFSTMat() function to calculate these variables from the SNP data:

SNPdata has individuals in rows and loci in columns, with a count (0,1,2) of each reference allele (missing data is 9)
locusNames indicate the names of the loci in columns
popNames indicate the population origin of individuals in rows

 FstDataFrame <- MakeDiploidFSTMat(SNPdata,locinames,ind$pop)

## Calculating FSTs, may take a few minutes...

 head(FstDataFrame)

##   LocusName        He         FST           T1          T2  FSTNoCorr
## 1   2_55279 0.0019980 0.019790770 1.979198e-05 0.001000061 0.03858870
## 2   3_81812 0.0069755 0.001076022 3.754883e-06 0.003489596 0.02054829
## 3  5_174034 0.0148875 0.019617093 1.461786e-04 0.007451594 0.03829351
## 4  5_190924 0.0256620 0.038214677 4.911262e-04 0.012851769 0.05758598
## 5   2_47972 0.0059820 0.002318598 6.938843e-06 0.002992689 0.02175268
## 6   2_58099 0.0740355 0.014041215 5.202382e-04 0.037050793 0.03282613
##       T1NoCorr    T2NoCorr meanAlleleFreq
## 1 3.859380e-05 0.001000132         0.9990
## 2 7.171053e-05 0.003489854         0.9965
## 3 2.853679e-04 0.007452122         0.9925
## 4 7.401361e-04 0.012852714         0.9870
## 5 6.510381e-05 0.002992910         0.9970
## 6 1.216321e-03 0.037053434         0.9615

Evaluations prior to running OutFLANK

Check for loci with low sample sizes or unusual values of uncorrected $F_{ST}$

The procedure implemented in OutFLANK() will use a measure of $F_{ST}$ uncorrected for sample size, because correction for sample size results in negative values of FST that cannot be fit by a chi-square distribution. Assuming all loci have equal sample sizes, the ranks of the uncorrected FSTs ($FSTNoCorr) are equivalent to the ranks of the FSTs that have been corrected for sample size (FST).

Let’s plot the corrected vs. uncorrected FSTs:

plot(FstDataFrame$FST, FstDataFrame$FSTNoCorr, 
     xlim=c(-0.01,0.3), ylim=c(-0.01,0.3),
     pch=20)
abline(0,1)

Note how the distribution of $FSTNoCorr is shifted slightly higher than the corrected $FST distribution, but the ranks are the same. Sample size correction decreases the value of $F_{ST}$. In this dataset, no loci are missing any data, but this step is recommended for real data. Loci with unusual sample sizes will be outliers in this plot.

What if a locus has low sample size?

To demonstrate this problem, let’s severly decrease the sample size of the first locus (such that 200 randomly chosen individuals will be missing data for that locus).

SNPdata_missing <- SNPdata
missing <- sample(1:nrow(SNPdata_missing), 500, replace=FALSE)
SNPdata_missing[missing,1] <- 9
FstDataFrame_missing <- MakeDiploidFSTMat(SNPdata_missing,locinames,ind$pop)

## Calculating FSTs, may take a few minutes...

  plot(FstDataFrame_missing$FST, FstDataFrame_missing$FSTNoCorr, xlim=c(-0.01,0.3), ylim=c(-0.01,0.3), pch=20)
  ## Highlight the SNP that is missing a lot of data
  points(FstDataFrame_missing$FST[1], FstDataFrame_missing$FSTNoCorr[1], col="blue", pch=8, cex=1.3)
  abline(0,1)

Note how the SNP missing a lot of data has an elevated value of uncorrected $F_{ST}$ relative to corrected $F_{ST}$. SNPs like this should be removed before running OutFLANK.

Check if $F_{ST}$ distribution looks chi-squared distributed

Low frequency variants (rare alleles) have a statistical behavior that violate the assumptions. There is an option to remove these in the OutFLANK function.

  plot(FstDataFrame$He, FstDataFrame$FSTNoCorr, pch=20, col="grey")

    # Note the large FST values for loci with low heterozygosity (He < 0.1)

  hist(FstDataFrame$FSTNoCorr, breaks=seq(0,0.3, by=0.001))

    # Note how, when all loci are included, this doesn't look chi-square distributed 
  hist(FstDataFrame$FSTNoCorr[FstDataFrame$He>0.05], breaks=seq(0,0.3, by=0.001))

    # Removing low Heterozygosity variants results in a more chi-square looking FST distribution
  hist(FstDataFrame$FSTNoCorr[FstDataFrame$He>0.1], breaks=seq(0,0.3, by=0.001))

    # Removing low Heterozygosity variants results in a more chi-square looking FST distribution

Running OutFLANK

Once the appropriate input dataframe is available (see above), the procedure can be run with one command:

# Run outflank
out1 <- OutFLANK(FstDataFrame, NumberOfSamples=k)

# Visualize results
 OutFLANKResultsPlotter(out1, withOutliers = TRUE,
NoCorr = TRUE, Hmin = 0.1, binwidth = 0.001, Zoom =
FALSE, RightZoomFraction = 0.05, titletext = NULL)

  # wow that is a bad fit!
 
  hist(out1$results$pvaluesRightTail)

    # right-tailed (e.g. large FST) outliers will have small P-values near 0
    # note how the P-value distribution is not uniform with an inflated values near 0

OFoutput is the output of the function OutFLANK(). Run OutFLANK() first, storing the results to a variable.
withOutliers determines whether to include loci flagged by OutFLANK as outliers in the histogram
NoCorr = TRUE tells the function to plot FSTNoCorr instead of FST. This is recommended, because the distribution curve plotted will be predicted for FSTNoCorr, not FST itself.
Hmin tells the function the minimum expected heterozygosity that a locus should have before it is included in the plots.
binwidth describes the width of the histogram bins.
Zoom provides the option of “zooming in” on the right hand tail of the distribution. When Zoom = TRUE, only loci in the highest quantiles of FST will be plotted (with the proportion included determined by RightZoomFraction.)
titletext provides an opportunity to provide a title for the graph. Put the phrase you want as the title in “quotes”.

Improving fit of OutFLANK

Hints:

qthreshold is used to call outliers. The trim points must be located within any potential outliers. If qthreshold is set too high, OutFLANK will return an error.
The algorithm may fail if RightTrimFraction is set too high
Use the plotting function to check the fit of the inferred distribution to the right tail
OutFLANK will not fit the left tail of the FST distribution well

Inputs:

FstDataFrame is the input prepared above. (If your dataframe for OutFLANK is called “myFs”, then the argument would read FstDataFrame = myFs.)
LeftTrimFraction and RightTrimFraction are arguments, each set to 5% by default, to tell OutFLANK how many of the lowest and highest FST values to remove before estimating the shape of the FST distribution through likelihood. (This removes the loci most affected by selection, and the likelihood procedure in OutFLANK accounts for the fact that data below and above these thresholds have been removed.) In some cases, when there are potentially a large number of loci affected by spatially heterogeneous selection, it can be worth trying a higher RightTrimFraction.
Hmin is the threshold for expected heterozygosity. Loci with low He have distributions of FST very different from loci with higher He, and it is important to screen these out before proceeding. By default, OutFLANK removes from consideration all loci with expected heterozygosity less than 10%.
NumberOfSamples is the number of populations sampled. This is the only other argument without a default, aside from the data frame.
qthreshold sets the threshold for whether a locus is deemed an “outlier”, i.e. if the q-value for that locus is below this threshold. This is set to 5% by default (aiming at a 5% or lower false discovery rate.) These q-values are calculated based on the right-tail P-values for each locus.

Note that the OutFLANK maximum likelihood procedure calculates mean FST based on the trimmed data, but calculates the degrees of freedom based on the non-outliers. This means that none of the putative outliers can be inside of the trim points. So if you choose a qthreshold that is too large (such that outliers are within the trim points), the program will return an error.

  outlier = OutFLANK(FstDataFrame,NumberOfSamples = k, 
                     RightTrimFraction = 0.06, LeftTrimFraction = 0.35,
                     qthreshold = 0.05, Hmin = 0.1)
    # increasing the Right Trim Fraction doesn't help, but increasing the left trim fraction enables a better fit
  
OutFLANKResultsPlotter(outlier, withOutliers = TRUE,
NoCorr = TRUE, Hmin = 0.1, binwidth = 0.001, Zoom =
FALSE, RightZoomFraction = 0.05, titletext = NULL)

    OutFLANKResultsPlotter(outlier, withOutliers = TRUE,
NoCorr = TRUE, Hmin = 0.1, binwidth = 0.001, Zoom =
TRUE, RightZoomFraction = 0.1, titletext = NULL)

  hist(outlier$results$pvaluesRightTail)

    # note the excess of P-values near 1, indicating poor fit of the left tail

Values output by OutFLANK

OutFLANK returns a data frame with the following values:

$FSTbar: Mean FST of loci not flagged as outliers, using the sample-size correction (recommend reporting in publication)
$FSTNoCorrbar: Mean FST of loci not flagged as outliers, without using the sample-size correction
$dfInferred: Inferred df for the chi-square distribution from the OutFLANK procedure (recommend reporting in publication).
$numberLowFstOutliers: Number of loci flagged as having a significantly low FST (not reliable)
$numberHighFstOutliers: Number of loci flagged as having a significantly high FST.
$results: a data frame with a row for each locus. This data frame includes all the original columns in the input data set, and six new ones:
$results$indexOrder: the original row number of the locus in the input data set
$results$GoodH: Boolean variable which is TRUE if the expected heterozygosity is greater than the Hemin set by the input to OutFLANK
$results$OutlierFlag: TRUE if the method identifies the locus as an outlier, FALSE otherwise, based on the q-threshold
$results$q: the q-value for the test of neutrality for the locus, based on the right-tailed P-value.
$results$pvalues: the two-tailed P-value for the test of neutrality for the locus. Not reliable, because OutFLANK does not predict the left-tailed P-values well.
$results$pvaluesRightTail: the one-sided (right tail) P-value for the locus

Let’s look at these values for our data

  str(outlier)

## List of 6
##  $ FSTbar               : num 0.0174
##  $ FSTNoCorrbar         : num 0.0363
##  $ dfInferred           : num 31.2
##  $ numberLowFstOutliers : int 0
##  $ numberHighFstOutliers: int 132
##  $ results              :'data.frame':   5940 obs. of  15 variables:
##   ..$ LocusName       : Factor w/ 5879 levels "1_1006","1_10076",..: 1347 2528 4295 4703 1159 1433 1041 1101 243 1968 ...
##   ..$ He              : num [1:5940] 0.002 0.00698 0.01489 0.02566 0.00598 ...
##   ..$ FST             : num [1:5940] 0.01979 0.00108 0.01962 0.03821 0.00232 ...
##   ..$ T1              : num [1:5940] 1.98e-05 3.75e-06 1.46e-04 4.91e-04 6.94e-06 ...
##   ..$ T2              : num [1:5940] 0.001 0.00349 0.00745 0.01285 0.00299 ...
##   ..$ FSTNoCorr       : num [1:5940] 0.0386 0.0205 0.0383 0.0576 0.0218 ...
##   ..$ T1NoCorr        : num [1:5940] 3.86e-05 7.17e-05 2.85e-04 7.40e-04 6.51e-05 ...
##   ..$ T2NoCorr        : num [1:5940] 0.001 0.00349 0.00745 0.01285 0.00299 ...
##   ..$ meanAlleleFreq  : num [1:5940] 0.999 0.997 0.993 0.987 0.997 ...
##   ..$ indexOrder      : int [1:5940] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ GoodH           : Factor w/ 2 levels "goodH","lowH": 2 2 2 2 2 2 2 2 1 2 ...
##   ..$ qvalues         : num [1:5940] 1 1 1 0.419 1 ...
##   ..$ pvalues         : num [1:5940] 0.7463 0.0491 0.7697 0.0404 0.0753 ...
##   ..$ pvaluesRightTail: num [1:5940] 0.3731 0.9754 0.3849 0.0202 0.9623 ...
##   ..$ OutlierFlag     : logi [1:5940] FALSE FALSE FALSE FALSE FALSE FALSE ...

  # Number of loci in the input SNP data:
  nrow(FstDataFrame)

## [1] 5940

  # Number of loci in the OutFLANK output:
  nrow(outlier$results)

## [1] 5940

Note how the output has all the same loci as the input, as well as the additional columns from our simulations ($chrom, $cM, and $type) in the $results dataframe.

Outliers

sum(outlier$results$qvalues<0.01, na.rm=TRUE)

## [1] 95

plot(outlier$results$He, outlier$results$FST, pch=20, col="grey")
    points(outlier$results$He[outlier$results$qvalues<0.01], outlier$results$FST[outlier$results$qvalues<0.01], pch=21, col="blue")

### Note how OutFLANK identifies potential outliers at He < 0.1, even though
### these loci were excluded in the trimming algorithm

    # Create logical vector for top candidates
top_candidates <- outlier$results$qvalues<0.01 & outlier$results$He>0.1

plot(outlier$results$He, outlier$results$FST, pch=20, col="grey")
    points(outlier$results$He[top_candidates], outlier$results$FST[top_candidates], pch=21, col="blue")

# list top candidates
    topcan <- outlier$results[top_candidates,]
    topcan[order(topcan$LocusName),]

##      LocusName        He        FST          T1         T2  FSTNoCorr
## 133    1_21786 0.3151680 0.12469568 0.019732082 0.15824191 0.14423176
## 3380   1_21818 0.3059355 0.13337439 0.020492557 0.15364686 0.15321252
## 5088   1_21929 0.3663555 0.16380182 0.030165908 0.18416100 0.18467192
## 312    1_21930 0.3663555 0.16380182 0.030165908 0.18416100 0.18467192
## 2958   1_22917 0.1472000 0.06838425 0.005045476 0.07378126 0.08601349
## 2670   1_27702 0.4978875 0.06315184 0.015757746 0.24952156 0.08126972
## 1193   2_52204 0.1361955 0.06997157 0.004776758 0.06826713 0.08673426
## 1268   3_81284 0.2209955 0.06053192 0.006704046 0.11075223 0.08095607
## 2665   3_81391 0.3318000 0.06218771 0.010341402 0.16629335 0.08335737
## 244    3_81633 0.2012355 0.07550753 0.007618202 0.10089328 0.09568694
## 5239   3_81730 0.4200000 0.15223112 0.032130827 0.21106609 0.17511068
## 5404   3_81785 0.4143020 0.14520750 0.030226205 0.20815871 0.16793686
## 314    3_81872 0.2848320 0.07876296 0.011249249 0.14282411 0.10015539
## 3087   3_81917 0.2822000 0.07361961 0.010415971 0.14148365 0.09521314
## 4953   3_86144 0.2157420 0.06959366 0.007526062 0.10814292 0.08809216
##         T1NoCorr   T2NoCorr meanAlleleFreq indexOrder GoodH      qvalues
## 133  0.022825201 0.15825365         0.8040        133 goodH 2.592234e-10
## 3380 0.023542397 0.15365844         0.8115       3380 goodH 2.126799e-11
## 5088 0.034012060 0.18417560         0.7585       5088 goodH 0.000000e+00
## 312  0.034012060 0.18417560         0.2415        312 goodH 0.000000e+00
## 2958 0.006346608 0.07378620         0.9200       2958 goodH 2.308226e-03
## 2670 0.020279943 0.24953872         0.4675       2670 goodH 6.293398e-03
## 1193 0.005921476 0.06827147         0.9265       1193 goodH 1.933525e-03
## 1268 0.008966760 0.11076082         0.8735       1268 goodH 6.588663e-03
## 2665 0.013862889 0.16630671         0.2100       2665 goodH 4.044797e-03
## 244  0.009654908 0.10090101         0.8865        244 goodH 2.273914e-04
## 5239 0.036963137 0.21108442         0.7000       5239 goodH 0.000000e+00
## 5404 0.034960536 0.20817667         0.2930       5404 goodH 1.884207e-13
## 314  0.014305765 0.14283570         0.8280        314 goodH 7.109942e-05
## 3087 0.013472207 0.14149525         0.1700       3087 goodH 2.494138e-04
## 4953 0.009527212 0.10815051         0.8770       4953 goodH 1.406853e-03
##           pvalues pvaluesRightTail OutlierFlag
## 133  1.134648e-12     5.673240e-13        TRUE
## 3380 5.728751e-14     2.864375e-14        TRUE
## 5088 0.000000e+00     0.000000e+00        TRUE
## 312  0.000000e+00     0.000000e+00        TRUE
## 2958 5.207109e-05     2.603555e-05        TRUE
## 2670 1.801141e-04     9.005704e-05        TRUE
## 1193 4.296722e-05     2.148361e-05        TRUE
## 1268 1.952196e-04     9.760982e-05        TRUE
## 2665 1.048651e-04     5.243255e-05        TRUE
## 244  3.675012e-06     1.837506e-06        TRUE
## 5239 0.000000e+00     0.000000e+00        TRUE
## 5404 4.440892e-16     2.220446e-16        TRUE
## 314  1.029386e-06     5.146928e-07        TRUE
## 3087 4.198886e-06     2.099443e-06        TRUE
## 4953 2.984234e-05     1.492117e-05        TRUE

plot(vcf@fix[,"POS"], outlier$results$FST, 
     col=grey((as.numeric(vcf@fix[,"CHROM"])%%2+1)/3),
     pch=20
     ) 
  points(vcf@fix[top_candidates,"POS"], outlier$results$FST[top_candidates], 
         pch=21, cex=2, col=2)

  # clean up by removing low freq variants
  keep <- outlier$results$He>0.1 & !is.na(outlier$results$He)
plot(vcf@fix[keep,"POS"], outlier$results$FST[keep], 
     col=grey((as.numeric(vcf@fix[keep,"CHROM"])%%2+1)/3),
     pch=20
     ) 
  points(vcf@fix[top_candidates,"POS"], outlier$results$FST[top_candidates], 
         pch=21, cex=2, col=2)

Challenge questions:

Can you adjust the RightTrimFraction and the qthreshold in the OutFLANK procedure to improve the results? What happens if you increase the RightTrimFraction? Just the qthreshold? How high can you increase these values until the algorithm fails?

OutFLANK: Finding FST outliers with an inferred neutral distribution

Katie E. Lotterhos

To load OutFLANK

Steps to running OutFLANK

Preparing input files for OutFLANK

Load example data

Calculate variables required for `OutFLANK()` function

Evaluations prior to running OutFLANK

Check for loci with low sample sizes or unusual values of uncorrected \(F_{ST}\)

What if a locus has low sample size?

Check if \(F_{ST}\) distribution looks chi-squared distributed

Running OutFLANK

Improving fit of OutFLANK

Values output by OutFLANK

Outliers

Challenge questions:

OutFLANK: Finding FST outliers with an inferred neutral distribution

Katie E. Lotterhos

To load OutFLANK

Steps to running OutFLANK

Preparing input files for OutFLANK

Load example data

Calculate variables required for OutFLANK() function

Evaluations prior to running OutFLANK

Check for loci with low sample sizes or unusual values of uncorrected \(F_{ST}\)

What if a locus has low sample size?

Check if \(F_{ST}\) distribution looks chi-squared distributed

Running OutFLANK

Improving fit of OutFLANK

Values output by OutFLANK

Outliers

Challenge questions:

Calculate variables required for `OutFLANK()` function