The ecological Sequentially Markovian Coalescent (eSMC) eSMC is a novel method, described in the paper “Inference of past demography, dormancy and self-fertilization rates from whole genome sequence data” published in Pols Genetics in April, 2020 (Sellinger et al. 2020b). (source correct?). The execution of model is further described at and can be downloaded from the following GitHub repository: https://github.com/TPPSellinger/eSMC.
Infers population size and displays changes over past generations. In contrast to most other currently used models, that are based on Sequential Markovian Coalescence theory, it runs on full genome polymorphism data. (Ignore the next 3 lines, I need to read about it again before formulating) -(eg. As key requirement to generate a null model for selection tests ->???) - Of 2 haplotypes, a sequencing and one reference genome - The model is based on the coalescent theory using the probability of the event read more about it. It also allows to account for and estimate either self- fertilization or rates of seed or egg-banking, for example caused by low germination rates due to dormancy.
(## Basic principle of the model)
Prior to running the analysis several parameters need to be known: (I) the individuals to run it on, because the analysis runs on individual haplotypes (II) the chromosome number (scaffold number), the recombination and mutation rate per site per generation, and whether to account for and estimate the recombination, the selfing or the germination rate.
Many currently used models are based on the assumptions of the Wright- Fisher model. In the presence of self- fertilization and seed dormancy these assumptions are violated. To account for these ecological and life history traits, eSMC uses two ratios, the molecular ratio of recombination (r) over mutation (µ) rate \(r/µ\) and effective ratio of the two parameters \(ρ/θ\). With the assumption of the Wright- Fisher model, the ratios are equal, However, self- fertilization and seed- banking causes the ratios to differ. Accounting for this differentiation allows to the model to predict population size with a higher accuracy. Set priors biases?
However, self-fertilization and seed-dormancy effect these ratios and effective population size in opposing ways. Self-fertilization is decreasing while dormant seeds are increasing the effective population size. Therefore, only one of the options is analyzed at the time. If used simultaneously, the individual signatures of the two systems cannot get fully disentangled with eSMC. Therefore, only one is accounted for at the time (Sellinger et al. 2020b). The analyzed dataset contains genotypic information of Chenopodium pallidicaule commonly called canihua. The vcf file comprises genomic data of 86 plants and nine chromosomes, respectively. Chenopodium species are predominantly autogenous (Sauer J. D. 1993), and so is canihua (Simmonds N. W. 1965). Therefore, in the following analysis is set to adjust for self-fertilization. Need to adjust for self-fertilization if reveling the population size, eSMC – suitable method
In the following, the major steps to run eSMC are described: (I) software and software packages needed (II) choosing a representative sample of the population, based on the population structure (III) creating the input files, (IV) running eSMC.
The analysis is conducted in R 3.6.3 (R Development Core Team 2019), using the platform of RStudio (RStudio Team 2020). The Packages needed are: parallel (R Core Team 2019), readODS (Schutten et al. 2018), stringdist (van der Loo 2014), tidyverse (Wickham et al. 2019), dplyr (Wickham et al. 2020a), installr (Galili 2019), devtools (Wickham et al. 2020b), eSMC (Sellinger 2020a) and vcfR (Knaus and Grünwald 2017).
As an initial step, we choose a representable sample of individuals. Therefore, the results of a Principal Component Analysis (PCA) are used, which were obtained by Dr. Böndel (Böndel 2020).
########### !REMEMBER TO INCLUDE FIGURE LABEL
# Display PCA results
###########
# Adjust the working directory to read in the file with the PCA
TXT<-read.table("C:/Users/koche/Documents/Term/canihua_3/canihua_3_no_5.txt") # read in txt file of with PCA result
# Plot PCA result to get an overview over the population structure
plot(TXT$EV1,TXT$EV2,xlab="Ev1 (4.9 %)",ylab="Ev2 (3.4%)",
xlim=c(-0.4,0.16), #xlim=c(-0.4,0.45),
ylim=c(-0.16,0.35), #ylim=c(-0.9,0.35),
col="black", # this way we basically plot an "empty" graph to which we can then
# add the groups in different colors
pch=1,las=1)
– ## remember Figure - heading
The plot revealed that there are three subpopulations. From each subpopulation a random sample of three individuals is chosen, resulting in a sample size of nine individuals.
###########
# Identify a sample of individuals for the eSMC analysis
###########
# Assign individuals of each subgroup into an individual dataframe
aa <- TXT[c(1:3)]
group1 <- subset(aa, EV1 > -0.25 & EV1 < -0.15 & EV2 > -0.2 & EV2 < -0.1 , select=c(sample.id,EV1,EV2))
group2 <- subset(aa, EV1 > -0.1 & EV1 < -0.0 & EV2 > -0.2 & EV2 < 0.0 , select=c(sample.id,EV1,EV2))
group3 <- subset(aa, EV1 > -0.0 & EV1 < 0.2 & EV2 > -0.1 & EV2 < 0.1 , select=c(sample.id,EV1,EV2))
# Rrandomly choose 3 individuals from each data frame
id1 <- sample(group1$sample.id, size=3, replace= FALSE, prob = NULL)
id2 <- sample(group2$sample.id, size=3, replace= FALSE, prob = NULL)
id3 <- sample(group3$sample.id, size=3, replace= FALSE, prob = NULL)
id1
id2
id3
Both, the sampled individuals and the chromosomes to be analyzed are assigned a vector.
# Assign a vector for chosen individuals and chromosomes
good_ind <- c("B26", "B23", "B24", "B35", "B33", "U8", "Ch_pal_05", "Ch_pal_39", "Ch_pal_49")
good_scaffs <- c("lcl|Cp1","lcl|Cp2","lcl|Cp3","lcl|Cp4","lcl|Cp5","lcl|Cp6","lcl|Cp7","lcl|Cp8","lcl|Cp9")
As eSMC runs on individuals, a separate input file is needed for each individual and each chromosome, respectively. The input files are created to fit a specific format, a txt file with three columns, containing the base in the first and alternative/reference base in the second and their position on the chromosome in the third column. (? How to express -> allele, allele, segregation site??)
–?? reference and the alternative allele in the first 2 columns and the position in the third column
To create the input files, we first read in the dataset “canihua_3.vcf”. For each chosen individuals and chromosome, in “good_ind” and “good_scaffs, pick the information about both from the fixed and the gt part of the VCF, as described in the code above. Extract the information about the allele, the alternative allele. Bring it into the desired format and add the respective allele position. Exclude homozygous sites and sites with less then 100 snps. Write the output into a txt files.
The name of the input files contains the method name “eSMC”, the number of the individual, the chromosome number and the name of the individual in the dataset, for example: eSMC1_1_B26.
##########
# Create input files for the chosen individuals
##########
only_var <- TRUE # True to only retain heterozygous positions
kickout_less100SNPs <- TRUE # True to not record contigs with < 100 SNPs
file <- read.vcfR("canihua_3.vcf",skip = 0, nrows=-1) # Read in the vcf file
sample_names <-colnames(file@gt) # Take column names of genotype part
# Loop over individuals and chromosomes
for (ind in 1:length(good_ind)){ # ind as index, contains numbers from 1 to 9
for (scaff in 1:length(good_scaffs)){ # scaff as index, contains numbers from 1 to 9 -> can express it like that???
# Pick data from the respective chromosome (scaff) and individual (ind)
name_pick <- c(1,grep(pattern=good_ind[ind],x = sample_names)) # select individual
SNP_pick <- which(file@fix[,"CHROM"]==good_scaffs[scaff]) # choose chromosomes from fixed part of vcf
vcf_single <- file[SNP_pick,name_pick] # create a list with both
# merge fixed and gt part into one tidy frame
haptemp <- vcfR2tidy(vcf_single,single_frame = TRUE)
# Transform into input file format
# split selected colum, if ther is a "/", and inserts it into a new dataframe
hapfile <- as.data.frame(strsplit(haptemp$dat$gt_GT_alleles,"/"))
colnames(hapfile) <- NULL # relabel row names
hapfile <- as.matrix(hapfile) # turn into a matrix
hapfile <- rbind(hapfile,haptemp$dat$POS) # add the allel's position
# Remove homozygous sites
if (only_var){keep1 <- stringdist(hapfile[1,],hapfile[2,],"hamming")
hapfile <- hapfile[,as.logical(keep1)]}
# Only use output with at least 100 SNP's/ heterozygous positions
if (kickout_less100SNPs){if (ncol(hapfile)<100){next}}
# Write data into txt files
if (!only_var){ write.table(hapfile,file=paste0("eSMCa",ind, "_",scaff,".txt"),
quote = FALSE,row.names = FALSE,col.names = FALSE)}
else { write.table(hapfile,file=paste0("eSMC",ind, "_",scaff,"_",good_ind[ind],".txt"),
quote = FALSE,row.names = FALSE,col.names = FALSE)
}
}}
The input files are saved in the working directory.
Several parameters are adjusted before running eSMC. To start, the mutation rate (mu) has been set to 6.9510-9 mutations peer site per generation (Weng et al. 2019) and the recombination rate (r) has been set to 3.610-8 per site per generation (Salomé et al. 2012). The values are taken from Arabidopsis thaliana, because none were found for species which are more closely related to canihua. Canihua is a predominantly self-fertilizing crop (Simmonds N. W. 1965), therefore eSMC is used to account for self-fertilization. Hence, the parameters to estimate the effective recombination rate (ER) and germination rate (SB) are set to FALSE, while the estimation of the selfing rate (SF) is set to TRUE. Default settings for the boundaries are kept, as they cover a very broad and biologically relevant range.
In the next step, we read in the input files. Only the files with the individuals and chromosomes on which the analysis will be run on are kept in the working directory. For this analysis, the files of the nine sampled individuals in “good_ind”, for chromosome one to three are kept, respectively.
The code does not automatically loop over these individuals. In input files, the individuals are assigned numbers, which are used to pic the respective input files. Therefore, this number is adjusted for each individual to be analyzed. Here, the analysis will start with the first sampled individual.
The results were stored and included into a list. This is then plotted to display the results, if “LIST” is set to TRUE- adjust boundaries?.
For the eSMC describe any adjustments?
Parameters for eSMC - n, symbol nr - n nr of hidden states (coalescence times) – coalescence events - symbol nr – nr of symbols used for zipping process?? - test data- name – what did we adjust
Parameters for plotting - name, - set list – TRUE to plot results of multiple individuals into one plot - adjust boundaries
##############
# Set parameters for eSMC
##############
mu=6.95*10^(-9) # mutation rate per position per generation
r= 3.6*10^(-8) # recombination rate per position per generation
rho=r/mu # ratio recombination/mutation
# Set one to TRUE (rest to FALSE)
ER=F # false to estimate recombination rate
SF=T # true to estimate selfing rate
SB=F # false to estimate germination rate
# Set boundaries
BoxB=c(0.05,1) # min and max value of germination rate
Boxs=c(0,0.99) # min and max value of selfing rate
##############
# Load input files
##############
#First adjust number of the individual in "individual_name" the resultnumber
individual_name <- paste("eSMC",1,sep="") # Create the individual_name, use pattern of input file name
single_ind <- list.files(pattern = individual_name,full.names = TRUE) # load the files with this pattern
test_data <- vector("list",length(single_ind)) # create a vector, its lenth equals nr of txt files
for (j in 1:length(single_ind)){
input_name <- single_ind[j] # []=inex, takes the value in this position
test_data[[j]] <- as.matrix(read.table(input_name)) # [[]] line or this index
}
NC <- length(test_data) # create a vector with length of number of analyzed chromosomes
##############
# Run eSMC
##############
result1= eSMC(n=30,rho=rho,test_data,BoxB=BoxB,Boxs=Boxs,SB=SB,SF=SF,Rho=ER,Check=F,NC=NC)
# Save the respective result
save(result1,file="eSMC_r1_B26.RData")
#save(result2,file="eSMC_r2_B23.RData")
#save(result3,file="eSMC_r3_B24.RData")
#save(result4,file="eSMC_r4_B35.RData")
#save(result5,file="eSMC_r5_B33.RData")
#save(result6,file="eSMC_r6_U8.RData")
#save(result7,file="eSMC_r7_Ch_pal_05.RData")
#save(result8,file="eSMC_r8_Ch_pal_39.RData")
#save(result9,file="eSMC_r9_Ch_pal_49.RData")
# Create result lists of all individuals and of each subpopulation
result_list =list(eSMC_r1_B26,eSMC_r2_B23,eSMC_r3_B24,
eSMC_r4_B35,eSMC_r5_B33,eSMC_r6_U8,
eSMC_r7_Ch_pal_05,eSMC_r8_Ch_pal_39,eSMC_r9_Ch_pal_49)
pop1_list =list(eSMC_r1_B26,eSMC_r2_B23,eSMC_r3_B24)
pop2_list =list(eSMC_r4_B35,eSMC_r5_B33,eSMC_r6_U8)
pop3_list =list(eSMC_r7_Ch_pal_05,eSMC_r8_Ch_pal_39,eSMC_r9_Ch_pal_49)
# Save the result lists
save(result_list,file="result_list.RData")
save(pop1_list,file="pop1_list.RData")
save(pop2_list,file="pop2_list.RData")
save(pop3_list,file="pop3_list.RData")
How to decide which output should be printed and which one not? change output of RMC put 3 graphs into 1 line Labelling of the curves How to lable pictures Describe the pictures
The eSMC result estimates the canihua population’s past demographic history, while accounting for self-fertilization. The plot displays population size (Xi) depending on the past generations, both on a logarithmic scale. Each line represents the estimation based on one individual. For the nine sampled haplotypes it infers population history from around 30 to around 3000 generations ago. In general, the population size decreases. The haplotypes appear to have the same population history until around 140 generations ago. Within this time frame, there is low variability between the individuals. More recent generations, starting from 70 generations ago, diverge. There are two tendencies. The population of three individuals displays an increase in size. While the population of rest of the individuals continues to decline.
Figure 1. Estimated demographic history of Chenopodium pallidicaule. The graph displays nine sampled individuals of a Chenopodium pallidicaule population, using eSMC to account for selfing. Chromosome one to three are used for the analysis for each individual respectively. The Mutation rate is set to 6.9510-9 per generation per site and the recombination rate is set to 3.610-8.
load("~/Term/pop1_list.RData")
load("~/Term/pop2_list.RData")
load("~/Term/pop3_list.RData")
par(mfrow=c(1,3))
Plot_esmc_results(pop1_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4))
Plot_esmc_results(pop2_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4))
Plot_esmc_results(pop3_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4))
**Figure 2. Estimated demographic history of Chenopodium pallidicaule*.** The graphs display the three sampled individuals of Chenopodium pallidicaule for each subpopulation, the first subpopulation (left), the second subpopulation (middle) and the third subpopulation (right). The demographic history is inferred using eSMC and accounting for self-fertilization. Chromosome one to three are used for the analysis for each individual respectively. The Mutation rate is set to 6.9510-9 per generation per site and the recombination rate is set to 3.610-8.
par(mfrow=c(1,1))
When separately looking at the three subpopulations, the results show that in the first subpopulation (left), the population size is consistently decreasing. Whereas, the second (middle) and the third (right) subpopulation show a very recent split into the two opposing directions, a further decrease and an increase in population size.
The algorithm also inferred a self-fertilization rate (σ) of zero and germination rate (β) of 1 for all sampled individuals, which implies that there no self-fertilization and no long term seed-banking. The estimated effective mutation rates varied from 3.3210-5 to 3.9910-5. The estimated effective recombination rates varied from 6.4110-6 to 7.4810-6.
With three input files, three chromosomes analyzed per individual, 2.6 Mb of data was analyzed. The average running time was 90 minutes.
Infers population history o The algorithm is run to infer β, σ and χt by fixing or simultaneously inferring the ratio p/θ given an initial value of this ratio o the algorithm infers σ, β, and χ ~ t~ using the ratio p/θ , which is inferred by the model and uses the initial ratio for r/µ as a prior, using a Bayesian approach. - What does the result – table tell me? o Parameters inferred are What is LH & Tc – LH – likelihood self-fertilization rates (σ) of 0, no self- fertilization – T – transition probability in continuous time - strasition occurs only if there is a recombination event germination rate (β) of 1, inplying a 100 percent germination (values for mutation rate and recombination rate? -> changed) o population size
About the results section - How can I best present the output? - • Record the exact codes you use to make it reproducible. - • If you get the results, look and think about them critically: - – Do they make sense? - – Do they match your expectations? - – If not, check your codes carefully and eventually run the analysis again - • Share your results with the others of your group, especially when they need them to do their analyses or to discuss their results. - • Prepare graphs/tables to best present your results.
How well the plotted results fit to the data – dep on the dataset T – time to the - desc in package - L =
For eSMC only needs one or two individuals and scaffolds of minimum 1 MB length to accurately predict population history. Both prerequisites are met by our dataset. As the quality of the result depends more on the accuracy of the sequence rather than on the amount, we cannot definitely say how good our predictions are.
Due to the continuous decline off the population size, it can be assumed, that the population goes through a prolonged bottleneck.
Starting from around 70 generations ago, some individuals of the second and the third subpopulation indicated a decreasing and some an increasing population size. This divergence could origin from beneficial mutations, which occurred simultaneously in bout subgroups. However, recombination events might be displayed with a certain delay. In this way the model can be used as base for further selection tests Can be used as base for further selection tests on these individuals
Population structure From our result, a population structure of 2 subpoulations would be expected instead of the three revealed by the PCA. However, results of an Alstructure analysis did also support the results obtained by the PCA.
Other eSMC results Results of other student showed a similar decline population structure, however the decrease continued until the last 15 generation until it started to diverge, or the tendency of the decline did not change at all.
Alexa’s
RMD Can’t load tidyverse& devtools doesn’t run through ### Evaluation criteria • The Rmarkdown file should be executable without errors when knitting it to pdf. All code documenting the analyses should be displayed in the pdf. Both, the RMarkdown file and the pdf (knitted from the respective Rmd) have to be submitted. Details about where the two files have to be uploaded will be communicated closer to the submission date. • The students should demonstrate in the term paper – that they understood what the method is about – that they carefully thought about the details of running the method (including choice of e.g. parameters, settings, data subsets, … ) – that they are able to discuss their own results in the context with other results • The analysis should be repeatable with the information given in the term paper. • The general appearance of the term paper will be taken into account for grading.
Method Description of the general method (+ relevant background information) • What does the analysis do? • What can the analysis tell you about the data set? - what do we want to do - Why
Specific parameter settings of the analysis of the data set ( Analysis should be repeatable with information given within this section)
description of results – With representative tables and/or graphs, well annotated (e.g. axis labels)
– Discussion of own results in context with the data set information, and with results of the other analyses of the group. ## Dictionary Convergence - the fact that two or more things, ideas, etc. become similar or come together: a convergence of interests/opinions/ideas The convergence of pop-cultural trends and technological progress gave us camera phones and the “selfie”.
Mark word put strg, enter -> displays [] – index [[]] – choose column/ write in column… $- also choose column
Rm() – removes from environment setwd(“~/Term”) #install.packages(“vctrs”) library(vcfR)
push in console hit esc – stop running process # x[2,] - choose 1rst column of x, when in bracket - obj good ind - chooses 1rst, 2nd… column paste: put all things behind each other ##Helpful code to remember for RM rMarkdown basics: https://rmarkdown.rstudio.com/authoring_basics.html xxx - italize \(xxx\) - inline equation
Include= True – show code, Eval= False - don’t run code Echo= false – show output only - Hide code - Same decreasing trend – - Inference are very recent, – recent events might not be that accurate very old recombination events can not be recovered MB -
Include which info from other students
Selfing rate – not comply with literature