The ecological Sequentially Markovian Coalescent (eSMC) is a novel method, described in the paper “Inference of past demography, dormancy and self-fertilization rates from whole genome sequence data” (T. P. Sellinger, Abu, Moest, & Tellier, 2020). The GitHub repository https://github.com/TPPSellinger/eSMC elaborates on the implementation of the model and provides the required eSMC package (T. Sellinger, 2020).
The model infers the population size over past generations. In contrast to most other currently used models that are based on Sequential Markovian Coalescence (SMC) theory, it runs on full genome polymorphism data. ESMC also runs on individuals. It analyzes two haplotypes at the time, comparing the sequence of the individual to be investigated with the one of the reference genome. The model allows to account for and to estimate self-fertilization rates or rates of seed or egg-banking. The latter can occur due to low germination rates or seed dormancy (T. P. Sellinger et al., 2020). This is a novelty compared to the previously published SMC based models.
Prior to running the analysis, information is needed regarding: (I) the individuals to run it on, (II) the chromosome number (scaffold number), (III) the recombination and the mutation rate per site per generation, and (IV) whether to account for the recombination, the selfing or the germination rate (T. P. Sellinger et al., 2020).
Many currently used models are based on the assumptions of the Wright-Fisher model. In the presence of self-fertilization and seed-dormancy these assumptions, for example the ones of sexual reproduction or non-overlapping generations are violated. To account for these two ecological and life history traits, eSMC uses two ratios, the molecular ratio of recombination (r) over mutation rate (µ): \(\frac{r}{µ}\) and the effective ratio of the two parameters: \(\frac{ρ}{θ}\). With the assumptions of the Wright-Fisher model, the ratios are equal. However, self-fertilization and seed-banking causes these ratios to differ. Accounting for this differentiation allows to the model reduce bias in its inference (T. P. Sellinger et al., 2020). However, self-fertilization and seed-dormancy affect these ratios and the effective population size in opposing ways. Self-fertilization is decreasing the effective population size while dormant seeds are increasing it. If used simultaneously, the individual signatures of the two systems cannot get fully disentangled with eSMC. Therefore, only one of the two is accounted for at the same time. (T. P. Sellinger et al., 2020).
The analyzed dataset contains genotypic information of Chenopodium pallidicaule, commonly called canihua. The Variant Call Format (VCF) file comprises genomic data of 86 plants and nine chromosomes, respectively. Chenopodium species are predominantly autogenous (Sauer J. D., 1993), and so is canihua (Simonds N. W., 1965). Therefore, while predicting its past population history, the model will adjust for self-fertilization.
In the following, the major steps to run eSMC are described: (I) software and software packages needed, (II) choosing a representative sample of the population, (III) creating the input files, (IV) running eSMC.
The analysis is conducted in R 3.6.3 (R Development Core Team, 2019), using the platform of RStudio (RStudio Team, 2020). The Packages needed are: parallel (R Core Team, 2019), readODS (Schutten, Chan, & Leeper T. J., 2018), stringdist (van der Loo, 2014), tidyverse (Wickham et al., 2019), dplyr (Wickham, François, Henry, & Müller, 2020), installr (Galili, 2019), devtools (Wickham, Hester, & Chang, 2020), eSMC (T. Sellinger, 2020) and vcfR (Knaus & Grünwald, 2017).
# Adjust the working directory
setwd("~/Term")
# packages needed to create the input files
library(vcfR)
library(parallel)
library(readODS) #For accession/sample names
library(stringdist)
library(tidyverse)
library(dplyr)
# additional packages needed to run eSMC
library("installr")
library(devtools)
# install eSMC
path="Term/eSMC_1.0.0.tar.gz"
devtools::install_local(path)
library(eSMC)
As an initial step, we choose a representable sample of individuals. Therefore, the results of a Principal Component Analysis (PCA) are used, which were obtained from Dr. Böndel (Böndel, 2020).
Figure 1. Principal component analysis (PCA). The PCA is revealing the population structure of Chenopodium pallidicaule plants in the dataset “canihua_3”. Outliers are excluded. Three subpopulations are observed.
Figure 1 revealed that there are three subpopulations. From each subpopulation a random sample of three individuals is chosen, resulting in a sample size of nine individuals.
###########
# Identify a Sample of Individuals for the eSMC Analysis
###########
# assign individuals of each subgroup into an individual data frame
aa <- TXT[c(1:3)]
group1 <- subset(aa, EV1 > -0.25 & EV1 < -0.15 & EV2 > -0.2 & EV2 < -0.1 , select=c(sample.id,EV1,EV2))
group2 <- subset(aa, EV1 > -0.1 & EV1 < -0.0 & EV2 > -0.2 & EV2 < 0.0 , select=c(sample.id,EV1,EV2))
group3 <- subset(aa, EV1 > -0.0 & EV1 < 0.2 & EV2 > -0.1 & EV2 < 0.1 , select=c(sample.id,EV1,EV2))
# randomly choose 3 individuals from each data frame
id1 <- sample(group1$sample.id, size=3, replace= FALSE, prob = NULL)
id2 <- sample(group2$sample.id, size=3, replace= FALSE, prob = NULL)
id3 <- sample(group3$sample.id, size=3, replace= FALSE, prob = NULL)
id1
id2
id3
Revealing to the subpopulations in Figure 1 from left to right, for the first small subpopulation, the individuals B26, B23 and B24 were sampled (id1). For the second subpopulation, the individuals B35, B33 and U8 were sampled (id2). In the third and biggest subpopulation on the right, the individuals Ch_pal_05, Ch_pal_39 and Ch_pal_49 were sampled (id3). Both, the sampled individuals and the chromosomes to be analyzed are assigned to a vector.
# assign a vector for chosen individuals and chromosomes
good_ind <- c("B26", "B23", "B24", "B35", "B33", "U8", "Ch_pal_05", "Ch_pal_39", "Ch_pal_49")
good_scaffs <- c("lcl|Cp1","lcl|Cp2","lcl|Cp3","lcl|Cp4","lcl|Cp5","lcl|Cp6","lcl|Cp7","lcl|Cp8","lcl|Cp9")
As eSMC runs on individuals, a separate input file is created for each individual and each chromosome, respectively. A specific format is required for the input files. They are created as a txt files with three columns, containing the base of the individual to be analyzed in the first column and the reference base in the second one. Their position on the chromosome is noted in the third column.
To create the input files, we first read in the dataset “canihua_3.vcf”. For each of the sampled individuals and chromosomes, in “good_ind” and “good_scaffs”, we pick the information from the fixed and the gt part of the VCF, as disclosed in the code below. Then we extract the column with the single nucleotide polymorphism (SNP) information, transform it into the desired format and add the respective SNP position. We exclude homozygous sites and sites with less than 100 SNPs. The output is written into a txt file.
The name of the input files contains the method name “eSMC”, it assigns an individual number, the chromosome number and the name of the individual in the dataset. The name for the first individual and the second chromosome would be eSMC1_2_B26.
##########
# Create Input Files for the Chosen Individuals
##########
only_var <- TRUE # true to only retain heterozygous positions
kickout_less100SNPs <- TRUE # true to not record contigs with < 100 SNPs
file <- read.vcfR("canihua_3.vcf",skip = 0, nrows=-1) # read in the vcf file
sample_names <-colnames(file@gt) # take column names of genotype part
# loop over individuals and chromosomes
for (ind in 1:length(good_ind)){ # ind as index, loop over individuals
for (scaff in 1:length(good_scaffs)){ # scaff as index, loop over chromosomes
# pick data from the respective chromosome (scaff) and individual (ind)
name_pick <- c(1,grep(pattern=good_ind[ind],x = sample_names)) # select individual
SNP_pick <- which(file@fix[,"CHROM"]==good_scaffs[scaff]) # choose chromosomes from fixed part of vcf
vcf_single <- file[SNP_pick,name_pick] # create a list with both
# merge fixed and gt part into one tidy frame
haptemp <- vcfR2tidy(vcf_single,single_frame = TRUE)
# transform into input file format
# split selected colum, if ther is a "/", and inserts it into a new dataframe
hapfile <- as.data.frame(strsplit(haptemp$dat$gt_GT_alleles,"/"))
colnames(hapfile) <- NULL # relabel row names
hapfile <- as.matrix(hapfile) # turn into a matrix
hapfile <- rbind(hapfile,haptemp$dat$POS) # add the allel's position
# remove homozygous sites
if (only_var){keep1 <- stringdist(hapfile[1,],hapfile[2,],"hamming")
hapfile <- hapfile[,as.logical(keep1)]}
# only use output with at least 100 SNP's/ heterozygous positions
if (kickout_less100SNPs){if (ncol(hapfile)<100){next}}
# write data into txt files
if (!only_var){ write.table(hapfile,file=paste0("eSMCa",ind, "_",scaff,".txt"),
quote = FALSE,row.names = FALSE,col.names = FALSE)}
else { write.table(hapfile,file=paste0("eSMC",ind, "_",scaff,"_",good_ind[ind],".txt"),
quote = FALSE,row.names = FALSE,col.names = FALSE)
}
}}
The input files are saved in the working directory.
Several parameters are adjusted before running eSMC. To start, the mutation rate (mu) has been set to 6.95 10-9 mutations per site per generation (Weng et al., 2019) and the recombination rate (r) has been set to 3.6 10-8 per site per generation (Salomé et al., 2012). The values are taken from Arabidopsis thaliana, because rates for species which are more closely related to canihua could not be found. Canihua is a predominantly self-fertilizing crop (Simonds N. W., 1965), therefore eSMC is used to account for self-fertilization. Hence, the parameters to estimate the effective recombination rate (ER) and germination rate (SB) are set to FALSE, while the estimation of the selfing rate (SF) is set to TRUE. Default settings for the boundaries are kept, as they cover a very broad and biologically relevant range.
##############
# Set Parameters for eSMC
##############
mu=6.95*10^(-9) # mutation rate per position per generation
r= 3.6*10^(-8) # recombination rate per position per generation
rho=r/mu # ratio recombination/mutation
# set one to true (rest to false)
ER=F # false to estimate recombination rate
SF=T # true to estimate selfing rate
SB=F # false to estimate germination rate
# set boundaries
BoxB=c(0.05,1) # min and max value of germination rate
Boxs=c(0,0.99) # min and max value of selfing rate
In the next step, we read in the created input files. Only the files with the individuals and chromosomes on which the analysis is run on are kept in the working directory. For this analysis, for all nine sampled individuals, only the files with chromosome one to three are kept, to reduce the computing time, but still get reliable results.
In the input files, the individuals were assigned numbers. The number of the analyzed individual is adjusted for the “individual_name” and the “result”. Here, the analysis will start with the first sampled individual. The results were put into lists and stored. Their plots are displayed in the results section.
##############
# Load Input Files
##############
#First adjust number of the individual in "individual_name" the resultnumber
individual_name <- paste("eSMC",1,sep="") # create the individual_name, use pattern of input file name
single_ind <- list.files(pattern = individual_name,full.names = TRUE) # load the files with this pattern
test_data <- vector("list",length(single_ind)) # create a vector, its lenth equals nr of txt files
for (j in 1:length(single_ind)){
input_name <- single_ind[j] # []=inex, takes the value in this position
test_data[[j]] <- as.matrix(read.table(input_name)) # [[]] line of the index
}
NC <- length(test_data) # create a vector with length of number of analyzed chromosomes
##############
# Run eSMC
##############
result1= eSMC(n=30,rho=rho,test_data,BoxB=BoxB,Boxs=Boxs,SB=SB,SF=SF,Rho=ER,Check=F,NC=NC)
##############
# Save the Results and Combine Them into Lists
##############
# save the respective result
save(result1,file="eSMC_r1_B26.RData") # filename includes the number and the name of the individual
#save(result2,file="eSMC_r2_B23.RData")
#save(result3,file="eSMC_r3_B24.RData")
#save(result4,file="eSMC_r4_B35.RData")
#save(result5,file="eSMC_r5_B33.RData")
#save(result6,file="eSMC_r6_U8.RData")
#save(result7,file="eSMC_r7_Ch_pal_05.RData")
#save(result8,file="eSMC_r8_Ch_pal_39.RData")
#save(result9,file="eSMC_r9_Ch_pal_49.RData")
# create result lists of all individuals and of each subpopulation
result_list =list(result1,result2,result3,result4,result5,result6,result7,result8,result9)
pop1_list =list(result1,result2,result3)
pop2_list =list(result4,result5,result6)
pop3_list =list(result7,result8,result9)
# save the result lists
save(result_list,file="result_list.RData")
save(pop1_list,file="pop1_list.RData")
save(pop2_list,file="pop2_list.RData")
save(pop3_list,file="pop3_list.RData")
Results are plotted with the following code.
Plot_esmc_results(result_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4)) # plot all results
par(mfrow=c(1,3))
Plot_esmc_results(pop1_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4))
Plot_esmc_results(pop2_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4))
Plot_esmc_results(pop3_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4))
The eSMC result estimates the past demographic history of the canihua population, while accounting for self-fertilization.
Figure 2. Estimated demographic history of Chenopodium pallidicaule. The graph displays nine sampled individuals of a Chenopodium pallidicaule population, using eSMC to account for selfing. Chromosome one to three are used for the analysis for each individual respectively. The Mutation rate is set to 6.95 10-9 per generation per site and the recombination rate is set to 3.6 10-8.
Figure 2 displays the population size (Xi) depending on the past generations. Each line represents an estimation based on one individual and three chromosomes. For the nine sampled haplotypes the graph infers a population history from around 30 to 3000 generations into the past. The population size decreases until around 150 generations ago. Up this time, the haplotypes appear to have the same population history, with low variability between the individuals. Starting from around 70 generations ago, two opposing tendencies are displayed, the shift to an increasing population size for three individuals and a continued decline for the remaining ones.
Figure 3. Demographic history of the three sampled individuals for each subpopulation of Chenopodium pallidicaule. The demographic history is inferred using eSMC and accounting for self-fertilization. Chromosome one to three are used for the analysis for each individual respectively. The Mutation rate is set to 6.95 10-9 per generation per site and the recombination rate is set to 3.6 10-8. The first subpopulation (left) displays the individuals B26, B23 and B24, reading the from top red line to bottom one. The second subpopulation (middle) shows U8, B33 and B35. The third subpopulation (right) shows Ch_pal_05, Ch_pal_49 and Ch_pal_39.
Figure 3 shows the eSMC results for each of the three subpopulations. In the first subpopulation (left) the population size is consistently decreasing. Whereas, the second subpopulation (middle) and the third subpopulation (right) show the very recent split into the two opposing directions. For these two subpopulations, the population history of the individuals B35, Ch_pal_49 and Ch_pal_39 shows a continuously decreasing trend, while population size for the populations of the individuals U8, B33 and Ch_pal_05 starts to increase again.
The algorithm inferred the self-fertilization rate (σ) of zero and germination rate (β) of one for all sampled individuals, which implies that there is no self-fertilization and no long-term seed-banking. With three individuals and three chromosomes per individual, 2.6 Mb of data was analyzed. The average running time amounted to 90 minutes.
ESMC only needs one or two individuals and scaffolds of minimally one-megabyte length to accurately predict the population history. These basic prerequisites were met by our dataset. As the quality of the result depends more on the accuracy of the sequences than on their amount, we cannot definitely say how good our predictions are (T. P. Sellinger et al., 2020).
Due to the continuous decline in population size, it is assumed that the population goes through a prolonged bottleneck. Starting from around 70 generations ago, some individuals of the second and the third subpopulation show diverging trends for their population size. This divergence could origin from beneficial mutations, which could have occurred simultaneously in both subgroups, or trace back to a common ancestor. The eSMC result on the demographic history of a population, and diverging trends can be used as base for further models and to increase selection gain. However, recombination events might be displayed with a certain delay and sudden or recent demographic events would eventually not be displayed accurately (T. P. Sellinger et al., 2020). It is important to note that the same number of individuals was sampled of each sub-population, even if the sub-populations’ size varies. Repeating the analysis with different individuals and using certain percentage of each population could further justify and clarify the observed tendencies. Increasing the number of hidden states and the number of included chromosomes could help to increase the precision of the result. However, this would increase the computation time.
Because mutation and recombination rate are held constant over the displayed time-period, only an average effect of real values of the selfing rates can be estimated within the time window. However, the estimated self-fertilization of 0 constitutes a contrast to literature, that describes canihua as a mainly self-fertilizing crop (Sauer J. D., 1993; Simonds N. W., 1965). That the model accounts only for homologous recombination events, but not for, for example chromosomatic mutations, could be one reason for an underestimated self-fertilization rate (T. P. Sellinger et al., 2020).
Similar to the PCA, the ALStructure analysis also suggested three subpopulations (Herbold, 2020). Results of other students, who also applied eSMC showed the same declining tendency for the population size (Damm, Herrmann, & Sanchez M. A., 2020). However, their time frame was different. One reason for this could be the different priors for the mutation and recombination rate (Mather, Traves, & Ho, 2020). Therefore, mutation and recombination rates of canihua would be needed to estimate the real time frame. These student’s data also showed a difference in more recent generations, where the population size either continued to decrease or split up with a different pattern (Damm et al., 2020).These divergences could arise due to different samples or different datasets.