What is the site frequency spectrum (SFS)?

Build the derived SFS from genotypic data

What is the x-axis of the SFS? What is the range of possible values for the x-axis? (click on the triangle to see the answer) 1. X -> Frequency of derived alleles. The x-axis of the SFS contains the absolute frequency of the derived allele in the sample.

  1. Y-> SPNs counts. The y-axis of the SFS contains the number of sites in the genome with a given absolute frequency.

Multi-dimensional SFS (2D-SFS)

How many sites have 1 heterozygote individual in pop1 and 1 heterozygote individual in pop2?

check notebook…

How many sites have a derived allele frequency of 4 in pop1 and a derived allele frequency of 4 in pop2?

Identify the entry of the 2D-SFS matrix corresponding to the sites where all individuals in both populations are heterozygote.

Does the SFS contain information about demographic history?

# set the working directory 
setwd("~/workshop_materials/a09_fastsimcoal2")

# read the observed SFS
# 2D from two populations with different sample sizes
pop2d <- read.table("./FscInputFiles/TwoPops_2DSFS.obs", skip=1, 
stringsAsFactors = F, header=T, row.names = 1)
head(pop2d)
##          d0_0 d0_1 d0_2 d0_3 d0_4 d0_5 d0_6
## d1_0 19969298 8235 2294  945  368  152   49
## d1_1     4188 1615 1142  703  471  234  112
## d1_2     1204  853  815  674  532  370  221
## d1_3      415  429  539  556  535  471  363
## d1_4       94  199  296  390  543  695    0

The observed derived SFS is (here the identifiers di_j represent the pop i and SNP frequency j): (above table)

Have a look at the observed SFS file. Does the observed SFS includes the monomorphic sites?

The entry for the monomorphic sites is the one with a derived frequency of zero in pop1 and in pop2, i.e. the entry [1,1] corresponding to the 1st row and 1st column.

Input files 2.2: DEFINITION OF MODELS

We will consider a model without gene flow. That model and the parameters are defined in the files “2PopDiv_NoMig.est” and “2PopDiv_NoMig.tpl”. Each model has a corresponding TPL file. The parameters we want to infer are given a parameter name tag (e.g. \(NPOP\), \(TDIV\)).

In sum, there are two steps:

  1. Define the model with the TPL file, specifying the parameter tags for those parameters we want to infer. NOTE: different parameters need to be given different name tags, and name tags cannot be a sub-string another one (e.g. NPOP1, NPOP1_2 cannot be used because NPOP1 is a sub-string of NPOP1_2). For this reason, we usually use a “$” at the start and end of a tag, e.g. \(NPOP1\), \(NPOP1_2\).

  2. Define the search range for each parameter in the EST file.

TO DISCUSS Exercise 2.2.2 From looking at the TPL and EST file: – What is the mutation rate assumed?

A: The average mutation rate is shown in the last line of the TPL file.

//per Block:data type, number of loci, per generation recombination and mutation rates and optional parameters
FREQ  1   0  2.5e-8 OUTEXP

– Do we need a mutation rate estimate to be able to run fastsimcoal2?

if you do not have a mutation rate and/or you do not have the number of monomorphic sites, then you can still run fastsimcoal2 but in that case you will need to define a reference effective size or a reference time for an event. NO. Can it be any parameter.

– Can we estimate the mutation rate from the SFS using fastsimcoal2?

NO.

you can still run fastsimcoal2 but in that case you will need to define a reference effective size or a reference time for an event. All the other parameters would be estimated in relation to that reference parameter. For instance, you could define that the effective size of one of the populations is the reference NPOP1 and fixed to a value, e.g. 10,000, and then you can estimate the relative size of other populations (e.g., \(NPOP2\)/\(NPOP1\)), even if you do not have monomorphic sites and/or a mutation rate. Although in theory it is possible to estimate an average mutation rate, we do not recommend this as you would need to fix an effective size.

VERY IMPORTANT

When estimating parameters with the SFS you CANNOT estimate jointly the mutation rate and the other parameters. You either need to give a fixed mutation rate or fix one effective size as a reference. This is different from other methods, such as Approximate Bayesian Computation where you simulate data and you can jointly estimate both mutation rate and demographic history parameters.

Settings to run fastsimcoal2

To run fastsimcoal2 we need to specify the following options:

the number of coalescent simulations to approximate the expected SFS (-n option). This should be larger than 100,000. We recommend to use 200,000 to 1,000,000.

the number of optimization cycles (-L option). This should be at least 20, but values between 50 and 100 are recommended.

use option -M indicating that we will run the likelihood optimization to infer parameters based on the SFS.

use option -d for the derived SFS and -m for the MAF SFS.

use option -0 if you do not have monomorphic sites in your observed SFS. NOTE: in this case you need to fix a parameter (e.g. a split time or an effective size), and all parameters will be relative to that fixed parameter.

use option -q to have a quite mode, without printing lots of information.

specify option -C1 such that the likelihood is computed for all entries with at least 1 SNP. If you specify -Cx then all entries with less than x SNPs are pooled together. This option is useful when there are many entries in the observed SFS with few SNPs and with a limited number of SNPS to avoid overfitting.

specify options for multithread. -c1 -B1 for a single core, -c4 -B4 for 4 cores using 4 batches.

install.packages("plotrix")
## Installing package into '/home/wpsg/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("diagram")
## Installing package into '/home/wpsg/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("RColorBrewer")
## Installing package into '/home/wpsg/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library("RColorBrewer")

Run fastsimcoal2 (a single run only to exemplify)

# load functions to run fastsimcoal2 and to process the output 
source("utilFscFunctions.r")
source("ParFileInterpreter_VS.r")

# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# path to fastsicoal2 executable file
# in this case the fastsimcoal2 executable is in the folder ./fastsimcoal2/fsc26_linux64/
settings$pathToFsc <- "./fastsimcoal2/fsc27_linux64/"
# executable filename
settings$Fsc <- "fsc2702"
# DEFINE OBSERVED SFS INPUT FILE
settings$obsfile <- "TwoPops_2DSFS.obs"
# path to input files (working directory)
# In this case, the input files are in the folder ./FscInputFiles/
settings$pathTo_InputFile <- "./FscInputFiles/"
# Tag for the end of the observed SFS file 
# this depends on whether it is a 1D SFS, 2D SFS or --multiSFS (check fastsimcoal2 manual)
settings$obsfileend <- "_jointDAFpop1_0"
# DEFINE THE MODEL WITH EST AND TPL FILES 
# tags to the TPL and EST files. Files should be "tag".tpl and "tag".est
# in this case files are 2PopDiv_NoMig.tpl and 2PopDiv_NoMig.est
settings$TPL_EST_file_tag <- "2PopDiv_NoMig"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20

# get the command lines to copy files and run fastsimcoal2 
cmd <- get_fsc2_commandline(settings)
# print command lines to copy files
cmd$copyfiles
## [1] "cp ./fastsimcoal2/fsc27_linux64/fsc2702 ./FscInputFiles/;chmod +x ./FscInputFiles/fsc2702;cd ./FscInputFiles/;"
# rename observed file name
cmd$renameobs
## [1] "cp TwoPops_2DSFS.obs 2PopDiv_NoMig_jointDAFpop1_0.obs"
# print command line to run fastsimcoal2
 cmd$fsc2cmd
## [1] "./fsc2702 -t 2PopDiv_NoMig.tpl -e 2PopDiv_NoMig.est -L 20 -n 100000 -d -M -q -C1 -c4 -B4;cd .."

The likelihood is given in units of log10. Hence, the closer to zero the higher the likelihood. We can see that the likelihood increased from -356267 to -137360 in 20 cycles of the optimization. We can also see that after 1 iteration the parameters were (in same order as in EST file): NPOP1 24107, NPOP2 20448, NANC 12188, TDIV 1119 and at iteration 20 the values were NPOP1 20414, NPOP2 5155, NANC 9895, TDIV 1013.

How to interpret fastsimcoal2 results?

Fastsimcoal2 will output several files, which contain important information. The most relevant information for most users is:

*.bestlhoods : file with the Maximum likelihood parameter estimates and corresponding likelihood. This is what we want – parameter estimates!

*_DAFpop0.txt : file with the expected SFS obtained with the parameters that maximized the likelihood during optimization. This is needed to visually check the fit of the SFS. Or we can use this to recompute the likelihood of other observed SFS (e.g. without linked sites) given our model.

*.simparam : file with an example of the settings to run the simulations. It is useful to check this file when you have errors.

Obtain parameter estimates and maximum likelihood.Follow the code below that will read the output files from fastsimcoal2 to obtain the parameter estimates.

maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, ".bestlhoods",
sep=""), header=T, check.names = F)
# replace the X. at the start of each parameter name
paramnames <- colnames(maxlhoodEstimates) 
eval <- which(substr(paramnames,1,2)=="X.") # get params starting with X.
paramnames[eval] <- substr(paramnames[eval],start=3,  stop=nchar(paramnames[eval])) # get only the substring after X. for those parameteres
colnames(maxlhoodEstimates) <- paramnames # replace column names

# look at the parameter estimates
maxlhoodEstimates
##   $NPOP1$ $NPOP2$ $NANC$ $TDIV$ MaxEstLhood MaxObsLhood
## 1   20301    4880   9866    993   -137360.4   -137353.9
#View(maxlhoodEstimates)
  1. Each column corresponds to a parameter name tag and the corresponding parameter estimate. You have the result of 1 run.
  2. In the analysis of real data sets we need to run several independent runs (different starting values), and select the parameter estimates of the run attaining the maximum likelihood estimate (MaxEstLhood).

MaxObsLhood: is the maximum possible value for the likelihood if there was a perfect fit of the expected to the observed SFS, i.e. if the expected SFS was the relative observed SFS.

MaxEstLhood: is the maximum likelihood estimated according to the model parameters.

The better the fit, the smaller the difference between MaxObsLhood and MaxEstLhood

Follow the code below that will read the output files from fastsimcoal2 to obtain the observed and expected SFS.

# Fit of the model expected SFS to the observed SFS

# Read the observed SFS - SNP counts
obssfs <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, settings$obsfileend, ".obs",
sep=""), skip=1, stringsAsFactors = F, header=T)

head(obssfs)
##          d0_0 d0_1 d0_2 d0_3 d0_4 d0_5 d0_6
## d1_0 19969298 8235 2294  945  368  152   49
## d1_1     4188 1615 1142  703  471  234  112
## d1_2     1204  853  815  674  532  370  221
## d1_3      415  429  539  556  535  471  363
## d1_4       94  199  296  390  543  695    0
# Read the expected SFS - PROBABILITIES
expsfs <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, settings$obsfileend, ".txt",
sep=""), header=T)

head(obssfs)
##          d0_0 d0_1 d0_2 d0_3 d0_4 d0_5 d0_6
## d1_0 19969298 8235 2294  945  368  152   49
## d1_1     4188 1615 1142  703  471  234  112
## d1_2     1204  853  815  674  532  370  221
## d1_3      415  429  539  556  535  471  363
## d1_4       94  199  296  390  543  695    0
# Plot the fit of the 2D SFS, including of the marginal 1D SFS
# the function plot2dSFS is defined in the utilfunctions.r
# you need to give as input the following arguments
#    obsSFS : matrix with observed SFS (counts)
#    expSFS : matrix with expected SFS (probabilities)
#    xtag : string with the label of x-axis
#    ytag : string with the label of y-axis
#    minentry : number with the minimum entry in the SFS (all entries with less than this are pooled together)

plot2dSFS(obsSFS=obssfs, expSFS=expsfs, xtag="Pop2", ytag="Pop1", minentry=1)

Note the difference between the observed SFS and the expected SFS. The observed SFS shows the number of sites with a given frequency (counts, which are integer values), whereas the expected SFS shows the probability of finding a site with a given absolute allele frequency in both populations (i.e. probabilities with values between 0 and 1.0).

# Draw a scenario of the model parameter estimates assuming a generation time of 1 year
# path to the maxL.par file that is a par file with the parameters that maximize the likelihood
layout(1)
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Pop1","Pop2"), gentime=1, printPDF=FALSE)
## [1] "./FscInputFiles/2PopDiv_NoMig/2PopDiv_NoMig_maxL"
## Loading required package: shape

TO DISCUSS – Is the method able to infer the correct parameters?

Yes, the size frequente spectrum contain information abput of the parameters

– Does the expected SFS fit the observed SFS? Yes.

– Does the SFS contain information about the demographic history of populations?

Yes, we were able to restimate

– Time estimates are given in generations. How to convert to years?

We multiple for the generation time of our species.

Application to real genome-wide SNP data from human populations

Apply fastsimcoal2 to infer parameters of a model with two populations

Have a look at the input files (obsSFS, EST and TPL files). Answer the following questions:

– Are monomorphics sites included in the observed SFS? If the monomorphic sites are not included we cannot use the mutation rate as a clock to get inferences.

– What is the sample size from each population? – What is the mutation rate? – How many historical events does the model include? – What are the parameter search ranges?

#EXE.

Use the command lines below for the model without gene flow. Note that we assume that the Maya population could have potentially a bottleneck associated with the split time, lasting for a fixed number of generations. Have a look at the TPL and EST files to see how the bottleneck was implemented.

# load functions to run fastsimcoal2 and to process the output 
source("utilFscFunctions.r")
source("ParFileInterpreter_VS.r")

# create a list that saves all the required settings to run fastsimcoal2
settings <- list()
# path to fastsicoal2 executable file
settings$pathToFsc <- "./fastsimcoal2/fsc27_linux64/"
# executable filename
settings$Fsc <- "fsc2702"
# DEFINE OBSERVED SFS INPUT FILE
settings$obsfile <- "Maya_San_DSFS.obs"
# path to input files (working directory)
# In this case, the input files are in the folder ./Henn_et_al_data/
settings$pathTo_InputFile <- "./Henn_et_al_data/"
# DEFINE THE MODEL WITH EST AND TPL FILES 
# tags to the TPL and EST files. Files should be "tag".tpl and "tag".est
settings$TPL_EST_file_tag <- "NoMig_San_Maya"
# Tag for the end of the observed SFS file 
# this depends on whether it is a 1D SFS, 2D SFS or --multiSFS (check fastsimcoal2 manual)
settings$obsfileend <- "_jointDAFpop1_0"
# number of coalescent simulations
settings$n_coalsims <- 100000
# number of optimization cycles
settings$n_cycles <- 20

# get the fastsimcoal2 command line
cmd <- get_fsc2_commandline(settings)

Copy the command lines to the server (Amazon instance) and run fastsimcoal2. If you type the following you should get the following command lines.

# print command lines to copy files
cmd$copyfiles
## [1] "cp ./fastsimcoal2/fsc27_linux64/fsc2702 ./Henn_et_al_data/;chmod +x ./Henn_et_al_data/fsc2702;cd ./Henn_et_al_data/;"
# rename observed file name
cmd$renameobs
## [1] "cp Maya_San_DSFS.obs NoMig_San_Maya_jointDAFpop1_0.obs"
# print command line to run fastsimcoal2
cmd$fsc2cmd
## [1] "./fsc2702 -t NoMig_San_Maya.tpl -e NoMig_San_Maya.est -L 20 -n 100000 -d -M -q -C1 -c4 -B4;cd .."
# read the file with the maximum likelihood estimates
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, ".bestlhoods",
sep=""), header=T, check.names = F)
# replace the X. at the start of each parameter name
paramnames <- colnames(maxlhoodEstimates) 
eval <- which(substr(paramnames,1,2)=="X.") # get params starting with X.
paramnames[eval] <- substr(paramnames[eval],start=3,  stop=nchar(paramnames[eval])) # get only the substring after X. for those parameteres
colnames(maxlhoodEstimates) <- paramnames # replace column names
# print the estimates
maxlhoodEstimates
##   $NPOP1$ $NPOP2$ $NANC$ $NBOTP1$ $TDIV$ $TBOT_END$ MaxEstLhood MaxObsLhood
## 1    9024   34596  18410       68   3601       1226   -476756.5   -475819.9

Now, lets compare the fit of the expected SFS under the model to the observed data.

# Read the observed SFS
# 2D from two human populations: San and Maya
maya_san_2dsfs <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, settings$obsfileend,".obs",sep=""), 
skip=1, stringsAsFactors = F, header=T, row.names = 1)
# Read the expected SFS
expectedSFS <- read.table(paste(settings$pathTo_InputFile,
settings$TPL_EST_file_tag, "/",
settings$TPL_EST_file_tag, settings$obsfileend, ".txt",
sep=""), header=T)

# Plot the SFS fit
plot2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)    

Notes

plot_relDiff2dSFS(maya_san_2dsfs, expectedSFS, xtag="San", ytag="Maya", minentry=1)

# Draw a scenario of the model parameter estimates assuming a generation time of 29 years
# path to the maxL.par file that is a par file with the parameters that maximize the likelihood
layout(1)
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Maya","San"), gentime=29, printPDF=FALSE)
## [1] "./Henn_et_al_data/NoMig_San_Maya/NoMig_San_Maya_maxL"

NOTES

Two model one with migration and the other without

DISCUSSION – What are the estimated effective sizes?

for the maya people:10000 ind sand people :35 000 ind

– What are the estimated times of divergence?

– Does the expected SFS fit the observed SFS?

More less, for many entries it look ok, but there many places where the model not fit the data. But in the 2D-SFS it fits better

– How are the parameter values affected by adding migration to the model? Would you change your conclusions?

the migration did not affect the model

– What are the estimated migration rates?

0.12 0.5

– Is the model with migration better than a model without migration? with the likelihood we can know what was better. If the value is closer to 0 the model will be better.

– Are these models sufficient to explain the data?

ok, but in this cases there was many entries in which the model did not make match

– Here we analyzed data from exome sequencing. Can we trust these demographic estimates?

Selection!!! We were using a neutral model but selection effects were not taken under consideration.