OutFLANK is an R package that implements the method developed by Whitlock and Lotterhos to use likelihood on a trimmed distribution of \(F_{ST}\) values to infer the distribution of \(F_{ST}\) for neutral markers. This distribution is then used to assign q-values to each locus to detect outliers that may be due to spatially heterogeneous selection.
If you use OutFLANK, please cite:
Whitlock, M. C., and K. E. Lotterhos. Reliable detection of loci responsible for local adaptation: Inference of a neutral model through trimming the distribution of FST. Accepted at The American Naturalist.
Please see this paper for a discussion of the limitations and strengths of this approach.
OutFLANK is distributed through github. To install the OutFLANK package, first install the devtools package on your session of R, followed by the installation of OutFLANK from github.
### Install required package devtools if not installed
if (!("devtools" %in% installed.packages())){install.packages("devtools")}
library(devtools)
### Install required package qvalue if not installed
if (!("qvalue" %in% installed.packages())){
source("http://bioconductor.org/biocLite.R")
biocLite("qvalue")
}
### Install OutFLANK if not installed
if (!("OutFLANK" %in% installed.packages())){install_github("whitlock/OutFLANK")}
if (!("vcfR" %in% installed.packages())){install.packages("vcfR")}
Next, load the packages into your R session. In later uses of the package, you can start with this library step. It will automatically load the qvalue
library.
library(OutFLANK)
## Loading required package: qvalue
library(vcfR)
##
## ***** *** vcfR *** *****
## This is vcfR 1.5.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
MakeDiploidFSTMat()
The input for the OutFLANK()
function is a data frame, with a row for each locus. OutFLANK requires that the input data include a large number of loci not strongly affected by spatially heterogeneous selection, although these need not be previously identified. It will work best with at least thousands of loci in the data set.
OutFLANK finds outliers for \(F_{ST}\), but it needs to use estimates of FST that have not been corrected for sample size adjustments. This is because these sample size adjustments can sometimes cause the estimate of the FST of a locus to be negative, which is not a possible value of the chi-squared distribution used by OutFLANK. These uncorrected values are labeled with “NoCorr”, e.g. FSTNoCorr is the FST of a locus without this correction. (In order to properly average FST over loci, OutFLANK also needs the numerator and denominator of the FST calculation for FSTNoCorr; these are called T1NoCorr and T2NoCorr respectively, for more information see the section “For Advanced Users”).
Since only uncorrected values of Fst are used in the algorithm, loci with unusually low sample sizes relative to the rest of the loci could be called as outliers. The user should take care to remove them before analysis. For a discussion of this issue see the publication.
There are several required columns in the input data frame, which you can read about in the section “Running OutFLANK”. Some of the columns in the input are somewhat unusual, and functions to create the values in these columns are also included in the package. We anticipate that most people will be starting with SNP datasets and will need these variables to be calculated, as described below. You can, however, calculate these quantities yourself with functions provided in the package (see section “For advanced users”).
For diploid data, the appropriate input data frame can be calculated using the function:
MakeDiploidFSTMat(SNPmat,locusNames,popNames)
.
The output of this function is a data frame that can be given to the function OutFLANK()
as the input parameter FSTDataFrame
(see below). This function requires three data objects as input:
SNPmat
is an array with a row for each individual in the data set and a column for each locus. This function assumes biallelic (i.e. SNP) data, and the value in each column is either 0, 1 or 2, showing the number of the focal alleles that the individual carries at that locus. (i.e., 2 means that the individual is a homozygote for the focal allele, 0, means that the individual has no copies of that focal allele, and 1 indicates a heterozygote.) For any locus that is unknown for an individual, SNPmat should have a 9 for that locus on that row. The function will take a few minutes to check that your data only contain 0, 1, 2, and 9, and it will produce an error if your data contains any other values.
locusNames
is a character vector that gives a list of identifying names for each locus. (The length of the locusNames
vector should be the same as the number of columns in SNPmat
.)
popNames
is a character vector that has an entry for each individual, in the same order as the rows in SNPmat
, which gives the population that that individual came from. OutFLANK assumes that the individuals are grouped into relatively discrete populations, so there must be multiple individuals per population for the function to work properly.
This dataset was simulated.
Set your working directory to the example set (this code will work if the example data is on your desktop):
setwd("~/Desktop/data/")
Load the data:
##############################
## Upload Compressed VCF file
##############################
vcf <- read.vcfR("../sim1a/vcf_sim1a_contest.vcf.gz", verbose=FALSE)
##############################
##############################
### Convert VCF format to SNP data format required by OutFLANK (Note that this is slow)
##############################
convertVCFtoCount3 <- function(string){
# This function assumes 0 for reference
# and 1 for alternate allele
a <- as.numeric(unlist(strsplit(string, split = c("[|///]"))))
odd = seq(1, length(a), by=2)
a[odd] + a[odd+1]
}
all.vcf.gen <- vcf@gt[,-1]
system.time(gen_table <- matrix(convertVCFtoCount3(all.vcf.gen), ncol=ncol(all.vcf.gen)))
## user system elapsed
## 10.708 0.456 11.368
vcf@meta
## [1] "##fileformat=VCFv4.2"
## [2] "##fileDate=20170809"
## [3] "##source=SLiM"
## [4] "##INFO=<ID=MID,Number=1,Type=Integer,Description=\"Mutation ID in SLiM\">"
## [5] "##INFO=<ID=S,Number=1,Type=Float,Description=\"Selection Coefficient\">"
## [6] "##INFO=<ID=DOM,Number=1,Type=Float,Description=\"Dominance\">"
## [7] "##INFO=<ID=PO,Number=1,Type=Integer,Description=\"Population of Origin\">"
## [8] "##INFO=<ID=GO,Number=1,Type=Integer,Description=\"Generation of Origin\">"
## [9] "##INFO=<ID=MT,Number=1,Type=Integer,Description=\"Mutation Type\">"
## [10] "##INFO=<ID=AC,Number=1,Type=Integer,Description=\"Allele Count\">"
## [11] "##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">"
## [12] "##INFO=<ID=MULTIALLELIC,Number=0,Type=Flag,Description=\"Multiallelic\">"
## [13] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
head(vcf@fix)
## CHROM POS ID REF ALT QUAL FILTER INFO
## [1,] "2" "55279" NA "A" "T" "1000" "PASS" NA
## [2,] "3" "81812" NA "A" "T" "1000" "PASS" NA
## [3,] "5" "174034" NA "A" "T" "1000" "PASS" NA
## [4,] "5" "190924" NA "A" "T" "1000" "PASS" NA
## [5,] "2" "47972" NA "A" "T" "1000" "PASS" NA
## [6,] "2" "58099" NA "A" "T" "1000" "PASS" NA
# note that the loci are not ordered in this simulated data
# note that every position is unique, so that sorting by position will give you the right order
head(vcf@gt[,1:4])
## FORMAT i0 i1 i2
## [1,] "GT" "0|0" "0|0" "0|0"
## [2,] "GT" "0|0" "0|0" "0|0"
## [3,] "GT" "0|0" "0|0" "0|0"
## [4,] "GT" "0|0" "0|0" "0|0"
## [5,] "GT" "0|0" "0|0" "0|0"
## [6,] "GT" "0|0" "0|0" "1|0"
### The loci info
### Let's make a character vector of locus names with Chromosome ID and Position:
locinames <- paste(vcf@fix[,"CHROM"], vcf@fix[,"POS"], sep="_")
head(locinames)
## [1] "2_55279" "3_81812" "5_174034" "5_190924" "2_47972" "2_58099"
##############################
##############################
### SNP data for OutFLANK
##############################
SNPdata <- t(gen_table)
# gen_table has individuals in columns, but OutFLANK requires individuals in rows
### Show the first 5 SNPs in the first 20 individuals:
head(SNPdata[,1:20])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 1 0 0 0 0
## [3,] 0 0 0 0 0 1 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 1 0 0 1 0 0
## [5,] 0 0 0 0 0 0 0 0 1 0 2 0 0
## [6,] 0 0 0 0 0 0 1 0 0 0 1 0 0
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 0 0 0 0 0 1 2
## [2,] 0 0 0 2 0 1 0
## [3,] 0 0 1 0 0 1 2
## [4,] 0 0 1 1 0 0 1
## [5,] 1 0 0 0 0 1 1
## [6,] 1 0 0 1 0 2 0
##############################
##############################
### The individual data
##############################
ind <- read.table("../sim1a/outputIndAll_pop.txt", header=TRUE)
head(ind)
## id x y phenotype1 envi phenotype2 pop
## 1 0 0.49716400 0.992154 1.019150 0.88809700 1.019150 1
## 2 1 0.63119700 0.070583 -0.535818 0.80400400 -0.535818 2
## 3 2 0.18481100 0.320225 1.177780 0.00734727 1.177780 3
## 4 3 0.19549000 0.322930 -0.361280 0.08361870 -0.361280 3
## 5 4 0.01481070 0.979297 0.261367 -0.30221600 0.261367 4
## 6 5 0.00369433 0.077454 1.138400 -1.26882000 1.138400 5
# note that phenotype 1 and phenotype 2 are the same thing
k <- max(ind$pop)
plot(ind$x, ind$y, pch=ind$pop%%6, col=adjustcolor(ind$pop%%3+2, alpha=0.2))
text(tapply(ind$x, ind$pop, mean), tapply(ind$y, ind$pop, mean), label=1:k)
# check your sample sizes
table(ind$pop)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 35 20 41 42 29 35 42 24 32 26 32 31 41 17 32 11 17 17 10 29 16 33 15 17 39
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39
## 30 40 20 24 11 14 37 27 12 25 18 19 23 17
OutFLANK()
functionHere are the columns required for the input data frame. They should have these names, but order does not matter:
$LocusName
: a character string that uniquely names each locus.$FST
: \(F_{ST}\) calculated for this locus by Weir and Cockerham (1984). (Kept here to report the unbiased \(F_{ST}\) of the results)$T1
: The numerator of the estimator for \(F_{ST}\) (necessary, with $T2, to calculate mean \(F_{ST}\))$T2
: The denominator of the estimator of \(F_{ST}\)$FSTNoCorr
: \(F_{ST}\) calculated for this locus without sample size correction. (See below for the functions used to calculate this and the next two columns)$T1NoCorr
: The numerator of the estimator for \(F_{ST}\) without sample size correction (necessary, with \(T2\), to find mean \(F_{ST}\))$T2NoCorr
: The denominator of the estimator of \(F_{ST}\) without sample size correction$He
: The heterozygosity of the locus (used to screen out low heterozygosity loci that have a different distribution)We are going to use the MakeDiploidFSTMat()
function to calculate these variables from the SNP data:
FstDataFrame <- MakeDiploidFSTMat(SNPdata,locinames,ind$pop)
## Calculating FSTs, may take a few minutes...
head(FstDataFrame)
## LocusName He FST T1 T2 FSTNoCorr
## 1 2_55279 0.0019980 0.019790770 1.979198e-05 0.001000061 0.03858870
## 2 3_81812 0.0069755 0.001076022 3.754883e-06 0.003489596 0.02054829
## 3 5_174034 0.0148875 0.019617093 1.461786e-04 0.007451594 0.03829351
## 4 5_190924 0.0256620 0.038214677 4.911262e-04 0.012851769 0.05758598
## 5 2_47972 0.0059820 0.002318598 6.938843e-06 0.002992689 0.02175268
## 6 2_58099 0.0740355 0.014041215 5.202382e-04 0.037050793 0.03282613
## T1NoCorr T2NoCorr meanAlleleFreq
## 1 3.859380e-05 0.001000132 0.9990
## 2 7.171053e-05 0.003489854 0.9965
## 3 2.853679e-04 0.007452122 0.9925
## 4 7.401361e-04 0.012852714 0.9870
## 5 6.510381e-05 0.002992910 0.9970
## 6 1.216321e-03 0.037053434 0.9615
The procedure implemented in OutFLANK()
will use a measure of \(F_{ST}\) uncorrected for sample size, because correction for sample size results in negative values of FST that cannot be fit by a chi-square distribution. Assuming all loci have equal sample sizes, the ranks of the uncorrected FSTs ($FSTNoCorr
) are equivalent to the ranks of the FSTs that have been corrected for sample size (FST
).
Let’s plot the corrected vs. uncorrected FSTs:
plot(FstDataFrame$FST, FstDataFrame$FSTNoCorr,
xlim=c(-0.01,0.3), ylim=c(-0.01,0.3),
pch=20)
abline(0,1)
Note how the distribution of $FSTNoCorr
is shifted slightly higher than the corrected $FST
distribution, but the ranks are the same. Sample size correction decreases the value of \(F_{ST}\). In this dataset, no loci are missing any data, but this step is recommended for real data. Loci with unusual sample sizes will be outliers in this plot.
To demonstrate this problem, let’s severly decrease the sample size of the first locus (such that 200 randomly chosen individuals will be missing data for that locus).
SNPdata_missing <- SNPdata
missing <- sample(1:nrow(SNPdata_missing), 500, replace=FALSE)
SNPdata_missing[missing,1] <- 9
FstDataFrame_missing <- MakeDiploidFSTMat(SNPdata_missing,locinames,ind$pop)
## Calculating FSTs, may take a few minutes...
plot(FstDataFrame_missing$FST, FstDataFrame_missing$FSTNoCorr, xlim=c(-0.01,0.3), ylim=c(-0.01,0.3), pch=20)
## Highlight the SNP that is missing a lot of data
points(FstDataFrame_missing$FST[1], FstDataFrame_missing$FSTNoCorr[1], col="blue", pch=8, cex=1.3)
abline(0,1)
Note how the SNP missing a lot of data has an elevated value of uncorrected \(F_{ST}\) relative to corrected \(F_{ST}\). SNPs like this should be removed before running OutFLANK.
Low frequency variants (rare alleles) have a statistical behavior that violate the assumptions. There is an option to remove these in the OutFLANK function.
plot(FstDataFrame$He, FstDataFrame$FSTNoCorr, pch=20, col="grey")
# Note the large FST values for loci with low heterozygosity (He < 0.1)
hist(FstDataFrame$FSTNoCorr, breaks=seq(0,0.3, by=0.001))
# Note how, when all loci are included, this doesn't look chi-square distributed
hist(FstDataFrame$FSTNoCorr[FstDataFrame$He>0.05], breaks=seq(0,0.3, by=0.001))
# Removing low Heterozygosity variants results in a more chi-square looking FST distribution
hist(FstDataFrame$FSTNoCorr[FstDataFrame$He>0.1], breaks=seq(0,0.3, by=0.001))
# Removing low Heterozygosity variants results in a more chi-square looking FST distribution
Once the appropriate input dataframe is available (see above), the procedure can be run with one command:
# Run outflank
out1 <- OutFLANK(FstDataFrame, NumberOfSamples=k)
# Visualize results
OutFLANKResultsPlotter(out1, withOutliers = TRUE,
NoCorr = TRUE, Hmin = 0.1, binwidth = 0.001, Zoom =
FALSE, RightZoomFraction = 0.05, titletext = NULL)
# wow that is a bad fit!
hist(out1$results$pvaluesRightTail)
# right-tailed (e.g. large FST) outliers will have small P-values near 0
# note how the P-value distribution is not uniform with an inflated values near 0
OFoutput
is the output of the function OutFLANK()
. Run OutFLANK()
first, storing the results to a variable.withOutliers
determines whether to include loci flagged by OutFLANK as outliers in the histogramNoCorr = TRUE
tells the function to plot FSTNoCorr
instead of FST. This is recommended, because the distribution curve plotted will be predicted for FSTNoCorr
, not FST
itself.Hmin
tells the function the minimum expected heterozygosity that a locus should have before it is included in the plots.binwidth
describes the width of the histogram bins.Zoom
provides the option of “zooming in” on the right hand tail of the distribution. When Zoom = TRUE
, only loci in the highest quantiles of FST will be plotted (with the proportion included determined by RightZoomFraction.)titletext
provides an opportunity to provide a title for the graph. Put the phrase you want as the title in “quotes”.Hints:
Inputs:
FstDataFrame
is the input prepared above. (If your dataframe for OutFLANK is called “myFs
”, then the argument would read FstDataFrame = myFs
.)LeftTrimFraction
and RightTrimFraction
are arguments, each set to 5% by default, to tell OutFLANK how many of the lowest and highest FST values to remove before estimating the shape of the FST distribution through likelihood. (This removes the loci most affected by selection, and the likelihood procedure in OutFLANK accounts for the fact that data below and above these thresholds have been removed.) In some cases, when there are potentially a large number of loci affected by spatially heterogeneous selection, it can be worth trying a higher RightTrimFraction
.Hmin
is the threshold for expected heterozygosity. Loci with low He have distributions of FST very different from loci with higher He, and it is important to screen these out before proceeding. By default, OutFLANK removes from consideration all loci with expected heterozygosity less than 10%.NumberOfSamples
is the number of populations sampled. This is the only other argument without a default, aside from the data frame.qthreshold
sets the threshold for whether a locus is deemed an “outlier”, i.e. if the q-value for that locus is below this threshold. This is set to 5% by default (aiming at a 5% or lower false discovery rate.) These q-values are calculated based on the right-tail P-values for each locus.Note that the OutFLANK maximum likelihood procedure calculates mean FST based on the trimmed data, but calculates the degrees of freedom based on the non-outliers. This means that none of the putative outliers can be inside of the trim points. So if you choose a qthreshold
that is too large (such that outliers are within the trim points), the program will return an error.
outlier = OutFLANK(FstDataFrame,NumberOfSamples = k,
RightTrimFraction = 0.06, LeftTrimFraction = 0.35,
qthreshold = 0.05, Hmin = 0.1)
# increasing the Right Trim Fraction doesn't help, but increasing the left trim fraction enables a better fit
OutFLANKResultsPlotter(outlier, withOutliers = TRUE,
NoCorr = TRUE, Hmin = 0.1, binwidth = 0.001, Zoom =
FALSE, RightZoomFraction = 0.05, titletext = NULL)
OutFLANKResultsPlotter(outlier, withOutliers = TRUE,
NoCorr = TRUE, Hmin = 0.1, binwidth = 0.001, Zoom =
TRUE, RightZoomFraction = 0.1, titletext = NULL)
hist(outlier$results$pvaluesRightTail)
# note the excess of P-values near 1, indicating poor fit of the left tail
OutFLANK returns a data frame with the following values:
$FSTbar
: Mean FST of loci not flagged as outliers, using the sample-size correction (recommend reporting in publication)$FSTNoCorrbar
: Mean FST of loci not flagged as outliers, without using the sample-size correction$dfInferred
: Inferred df for the chi-square distribution from the OutFLANK procedure (recommend reporting in publication).$numberLowFstOutliers
: Number of loci flagged as having a significantly low FST (not reliable)$numberHighFstOutliers
: Number of loci flagged as having a significantly high FST.$results
: a data frame with a row for each locus. This data frame includes all the original columns in the input data set, and six new ones:$results$indexOrder
: the original row number of the locus in the input data set$results$GoodH
: Boolean variable which is TRUE if the expected heterozygosity is greater than the Hemin
set by the input to OutFLANK$results$OutlierFlag
: TRUE if the method identifies the locus as an outlier, FALSE otherwise, based on the q-threshold$results$q
: the q-value for the test of neutrality for the locus, based on the right-tailed P-value.$results$pvalues
: the two-tailed P-value for the test of neutrality for the locus. Not reliable, because OutFLANK does not predict the left-tailed P-values well.$results$pvaluesRightTail
: the one-sided (right tail) P-value for the locusLet’s look at these values for our data
str(outlier)
## List of 6
## $ FSTbar : num 0.0174
## $ FSTNoCorrbar : num 0.0363
## $ dfInferred : num 31.2
## $ numberLowFstOutliers : int 0
## $ numberHighFstOutliers: int 132
## $ results :'data.frame': 5940 obs. of 15 variables:
## ..$ LocusName : Factor w/ 5879 levels "1_1006","1_10076",..: 1347 2528 4295 4703 1159 1433 1041 1101 243 1968 ...
## ..$ He : num [1:5940] 0.002 0.00698 0.01489 0.02566 0.00598 ...
## ..$ FST : num [1:5940] 0.01979 0.00108 0.01962 0.03821 0.00232 ...
## ..$ T1 : num [1:5940] 1.98e-05 3.75e-06 1.46e-04 4.91e-04 6.94e-06 ...
## ..$ T2 : num [1:5940] 0.001 0.00349 0.00745 0.01285 0.00299 ...
## ..$ FSTNoCorr : num [1:5940] 0.0386 0.0205 0.0383 0.0576 0.0218 ...
## ..$ T1NoCorr : num [1:5940] 3.86e-05 7.17e-05 2.85e-04 7.40e-04 6.51e-05 ...
## ..$ T2NoCorr : num [1:5940] 0.001 0.00349 0.00745 0.01285 0.00299 ...
## ..$ meanAlleleFreq : num [1:5940] 0.999 0.997 0.993 0.987 0.997 ...
## ..$ indexOrder : int [1:5940] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ GoodH : Factor w/ 2 levels "goodH","lowH": 2 2 2 2 2 2 2 2 1 2 ...
## ..$ qvalues : num [1:5940] 1 1 1 0.419 1 ...
## ..$ pvalues : num [1:5940] 0.7463 0.0491 0.7697 0.0404 0.0753 ...
## ..$ pvaluesRightTail: num [1:5940] 0.3731 0.9754 0.3849 0.0202 0.9623 ...
## ..$ OutlierFlag : logi [1:5940] FALSE FALSE FALSE FALSE FALSE FALSE ...
# Number of loci in the input SNP data:
nrow(FstDataFrame)
## [1] 5940
# Number of loci in the OutFLANK output:
nrow(outlier$results)
## [1] 5940
Note how the output has all the same loci as the input, as well as the additional columns from our simulations ($chrom
, $cM
, and $type
) in the $results
dataframe.
sum(outlier$results$qvalues<0.01, na.rm=TRUE)
## [1] 95
plot(outlier$results$He, outlier$results$FST, pch=20, col="grey")
points(outlier$results$He[outlier$results$qvalues<0.01], outlier$results$FST[outlier$results$qvalues<0.01], pch=21, col="blue")
### Note how OutFLANK identifies potential outliers at He < 0.1, even though
### these loci were excluded in the trimming algorithm
# Create logical vector for top candidates
top_candidates <- outlier$results$qvalues<0.01 & outlier$results$He>0.1
plot(outlier$results$He, outlier$results$FST, pch=20, col="grey")
points(outlier$results$He[top_candidates], outlier$results$FST[top_candidates], pch=21, col="blue")
# list top candidates
topcan <- outlier$results[top_candidates,]
topcan[order(topcan$LocusName),]
## LocusName He FST T1 T2 FSTNoCorr
## 133 1_21786 0.3151680 0.12469568 0.019732082 0.15824191 0.14423176
## 3380 1_21818 0.3059355 0.13337439 0.020492557 0.15364686 0.15321252
## 5088 1_21929 0.3663555 0.16380182 0.030165908 0.18416100 0.18467192
## 312 1_21930 0.3663555 0.16380182 0.030165908 0.18416100 0.18467192
## 2958 1_22917 0.1472000 0.06838425 0.005045476 0.07378126 0.08601349
## 2670 1_27702 0.4978875 0.06315184 0.015757746 0.24952156 0.08126972
## 1193 2_52204 0.1361955 0.06997157 0.004776758 0.06826713 0.08673426
## 1268 3_81284 0.2209955 0.06053192 0.006704046 0.11075223 0.08095607
## 2665 3_81391 0.3318000 0.06218771 0.010341402 0.16629335 0.08335737
## 244 3_81633 0.2012355 0.07550753 0.007618202 0.10089328 0.09568694
## 5239 3_81730 0.4200000 0.15223112 0.032130827 0.21106609 0.17511068
## 5404 3_81785 0.4143020 0.14520750 0.030226205 0.20815871 0.16793686
## 314 3_81872 0.2848320 0.07876296 0.011249249 0.14282411 0.10015539
## 3087 3_81917 0.2822000 0.07361961 0.010415971 0.14148365 0.09521314
## 4953 3_86144 0.2157420 0.06959366 0.007526062 0.10814292 0.08809216
## T1NoCorr T2NoCorr meanAlleleFreq indexOrder GoodH qvalues
## 133 0.022825201 0.15825365 0.8040 133 goodH 2.592234e-10
## 3380 0.023542397 0.15365844 0.8115 3380 goodH 2.126799e-11
## 5088 0.034012060 0.18417560 0.7585 5088 goodH 0.000000e+00
## 312 0.034012060 0.18417560 0.2415 312 goodH 0.000000e+00
## 2958 0.006346608 0.07378620 0.9200 2958 goodH 2.308226e-03
## 2670 0.020279943 0.24953872 0.4675 2670 goodH 6.293398e-03
## 1193 0.005921476 0.06827147 0.9265 1193 goodH 1.933525e-03
## 1268 0.008966760 0.11076082 0.8735 1268 goodH 6.588663e-03
## 2665 0.013862889 0.16630671 0.2100 2665 goodH 4.044797e-03
## 244 0.009654908 0.10090101 0.8865 244 goodH 2.273914e-04
## 5239 0.036963137 0.21108442 0.7000 5239 goodH 0.000000e+00
## 5404 0.034960536 0.20817667 0.2930 5404 goodH 1.884207e-13
## 314 0.014305765 0.14283570 0.8280 314 goodH 7.109942e-05
## 3087 0.013472207 0.14149525 0.1700 3087 goodH 2.494138e-04
## 4953 0.009527212 0.10815051 0.8770 4953 goodH 1.406853e-03
## pvalues pvaluesRightTail OutlierFlag
## 133 1.134648e-12 5.673240e-13 TRUE
## 3380 5.728751e-14 2.864375e-14 TRUE
## 5088 0.000000e+00 0.000000e+00 TRUE
## 312 0.000000e+00 0.000000e+00 TRUE
## 2958 5.207109e-05 2.603555e-05 TRUE
## 2670 1.801141e-04 9.005704e-05 TRUE
## 1193 4.296722e-05 2.148361e-05 TRUE
## 1268 1.952196e-04 9.760982e-05 TRUE
## 2665 1.048651e-04 5.243255e-05 TRUE
## 244 3.675012e-06 1.837506e-06 TRUE
## 5239 0.000000e+00 0.000000e+00 TRUE
## 5404 4.440892e-16 2.220446e-16 TRUE
## 314 1.029386e-06 5.146928e-07 TRUE
## 3087 4.198886e-06 2.099443e-06 TRUE
## 4953 2.984234e-05 1.492117e-05 TRUE
plot(vcf@fix[,"POS"], outlier$results$FST,
col=grey((as.numeric(vcf@fix[,"CHROM"])%%2+1)/3),
pch=20
)
points(vcf@fix[top_candidates,"POS"], outlier$results$FST[top_candidates],
pch=21, cex=2, col=2)
# clean up by removing low freq variants
keep <- outlier$results$He>0.1 & !is.na(outlier$results$He)
plot(vcf@fix[keep,"POS"], outlier$results$FST[keep],
col=grey((as.numeric(vcf@fix[keep,"CHROM"])%%2+1)/3),
pch=20
)
points(vcf@fix[top_candidates,"POS"], outlier$results$FST[top_candidates],
pch=21, cex=2, col=2)
Can you adjust the RightTrimFraction and the qthreshold in the OutFLANK procedure to improve the results? What happens if you increase the RightTrimFraction? Just the qthreshold? How high can you increase these values until the algorithm fails?