Method

The ecological Sequentially Markovian Coalescent (eSMC) eSMC is a novel method, described in the paper “Inference of past demography, dormancy and self-fertilization rates from whole genome sequence data” published in Pols Genetics in April, 2020 (Sellinger et al. 2020b). (source correct?). The execution of model is further described at and can be downloaded from the following GitHub repository: https://github.com/TPPSellinger/eSMC.

Infers population size and displays changes over past generations. In contrast to most other currently used models, that are based on Sequential Markovian Coalescence theory, it runs on full genome polymorphism data. (Ignore the next 3 lines, I need to read about it again before formulating) -(eg. As key requirement to generate a null model for selection tests ->???) - Of 2 haplotypes, a sequencing and one reference genome - The model is based on the coalescent theory using the probability of the event  read more about it. It also allows to account for and estimate either self- fertilization or rates of seed or egg-banking, for example caused by low germination rates due to dormancy.

(## Basic principle of the model)

Prior to running the analysis several parameters need to be known: (I) the individuals to run it on, because the analysis runs on individual haplotypes (II) the chromosome number (scaffold number), the recombination and mutation rate per site per generation, and whether to account for and estimate the recombination, the selfing or the germination rate.

Many currently used models are based on the assumptions of the Wright- Fisher model. In the presence of self- fertilization and seed dormancy these assumptions are violated. To account for these ecological and life history traits, eSMC uses two ratios, the molecular ratio of recombination (r) over mutation (µ) rate \(r/µ\) and effective ratio of the two parameters \(ρ/θ\). With the assumption of the Wright- Fisher model, the ratios are equal, However, self- fertilization and seed- banking causes the ratios to differ. Accounting for this differentiation allows to the model to predict population size with a higher accuracy. Set priors  biases?

However, self-fertilization and seed-dormancy effect these ratios and effective population size in opposing ways. Self-fertilization is decreasing while dormant seeds are increasing the effective population size. Therefore, only one of the options is analyzed at the time. If used simultaneously, the individual signatures of the two systems cannot get fully disentangled with eSMC. Therefore, only one is accounted for at the time (Sellinger et al. 2020b). The analyzed dataset contains genotypic information of Chenopodium pallidicaule commonly called canihua. The vcf file comprises genomic data of 86 plants and nine chromosomes, respectively. Chenopodium species are predominantly autogenous (Sauer J. D. 1993), and so is canihua (Simmonds N. W. 1965). Therefore, in the following analysis is set to adjust for self-fertilization. Need to adjust for self-fertilization if reveling the population size, eSMC – suitable method

In the following, the major steps to run eSMC are described: (I) software and software packages needed (II) choosing a representative sample of the population, based on the population structure (III) creating the input files, (IV) running eSMC.

Software and Software Packages

The analysis is conducted in R 3.6.3 (R Development Core Team 2019), using the platform of RStudio (RStudio Team 2020). The Packages needed are: parallel (R Core Team 2019), readODS (Schutten et al. 2018), stringdist (van der Loo 2014), tidyverse (Wickham et al. 2019), dplyr (Wickham et al. 2020a), installr (Galili 2019), devtools (Wickham et al. 2020b), eSMC (Sellinger 2020a) and vcfR (Knaus and Grünwald 2017).

Sampling the Individuals

As an initial step, we choose a representable sample of individuals. Therefore, the results of a Principal Component Analysis (PCA) are used, which were obtained by Dr. Böndel (Böndel 2020).

###########  !REMEMBER TO INCLUDE FIGURE LABEL
# Display PCA results
###########

# Adjust the working directory to read in the file with the PCA
TXT<-read.table("C:/Users/koche/Documents/Term/canihua_3/canihua_3_no_5.txt") # read in txt file of with PCA result

# Plot PCA result to get an overview over the population structure
plot(TXT$EV1,TXT$EV2,xlab="Ev1 (4.9 %)",ylab="Ev2 (3.4%)",
    xlim=c(-0.4,0.16), #xlim=c(-0.4,0.45), 
    ylim=c(-0.16,0.35), #ylim=c(-0.9,0.35), 
     col="black", # this way we basically plot an "empty" graph to which we can then
     # add the groups in different colors
     pch=1,las=1)

– ## remember Figure - heading

The plot revealed that there are three subpopulations. From each subpopulation a random sample of three individuals is chosen, resulting in a sample size of nine individuals.

###########
# Identify a sample of individuals for the eSMC analysis
###########

# Assign individuals of each subgroup into an individual dataframe 
aa <- TXT[c(1:3)]
group1 <- subset(aa, EV1 > -0.25 & EV1 < -0.15 & EV2 > -0.2 & EV2 < -0.1 , select=c(sample.id,EV1,EV2)) 
group2 <- subset(aa, EV1 > -0.1  & EV1 < -0.0  & EV2 > -0.2 & EV2 <  0.0 , select=c(sample.id,EV1,EV2)) 
group3 <- subset(aa, EV1 > -0.0  & EV1 <  0.2  & EV2 > -0.1 & EV2 <  0.1 , select=c(sample.id,EV1,EV2)) 

# Rrandomly choose 3 individuals from each data frame
id1 <- sample(group1$sample.id, size=3, replace= FALSE, prob = NULL)
id2 <- sample(group2$sample.id, size=3, replace= FALSE, prob = NULL)
id3 <- sample(group3$sample.id, size=3, replace= FALSE, prob = NULL)
id1
id2
id3

Both, the sampled individuals and the chromosomes to be analyzed are assigned a vector.

# Assign a vector for chosen individuals and chromosomes
good_ind <- c("B26", "B23", "B24", "B35", "B33", "U8", "Ch_pal_05", "Ch_pal_39", "Ch_pal_49")
good_scaffs <- c("lcl|Cp1","lcl|Cp2","lcl|Cp3","lcl|Cp4","lcl|Cp5","lcl|Cp6","lcl|Cp7","lcl|Cp8","lcl|Cp9")

Creating the input files for the eSMC analysis

As eSMC runs on individuals, a separate input file is needed for each individual and each chromosome, respectively. The input files are created to fit a specific format, a txt file with three columns, containing the base in the first and alternative/reference base in the second and their position on the chromosome in the third column. (? How to express -> allele, allele, segregation site??)

–?? reference and the alternative allele in the first 2 columns and the position in the third column

To create the input files, we first read in the dataset “canihua_3.vcf”. For each chosen individuals and chromosome, in “good_ind” and “good_scaffs, pick the information about both from the fixed and the gt part of the VCF, as described in the code above. Extract the information about the allele, the alternative allele. Bring it into the desired format and add the respective allele position. Exclude homozygous sites and sites with less then 100 snps. Write the output into a txt files.

The name of the input files contains the method name “eSMC”, the number of the individual, the chromosome number and the name of the individual in the dataset, for example: eSMC1_1_B26.

##########
# Create input files for the chosen individuals 
##########

only_var <- TRUE            # True to only retain heterozygous positions
kickout_less100SNPs <- TRUE # True to not record contigs with < 100 SNPs

file <- read.vcfR("canihua_3.vcf",skip = 0, nrows=-1) # Read in the vcf file
sample_names <-colnames(file@gt) # Take column names of genotype part

# Loop over individuals and chromosomes
for (ind in 1:length(good_ind)){         # ind as index, contains numbers from 1 to 9
   for (scaff in 1:length(good_scaffs)){ # scaff as index, contains numbers from 1 to 9 -> can express it like that???
      
      # Pick data from the respective chromosome (scaff) and individual (ind) 
      name_pick <- c(1,grep(pattern=good_ind[ind],x = sample_names)) # select individual
      SNP_pick  <- which(file@fix[,"CHROM"]==good_scaffs[scaff])       # choose chromosomes from fixed part of vcf
      vcf_single <- file[SNP_pick,name_pick] # create a list with both 
     
      # merge fixed and gt part into one tidy frame
      haptemp <- vcfR2tidy(vcf_single,single_frame = TRUE)  
     
      # Transform into input file format
      # split selected colum, if ther is a "/", and inserts it into a new dataframe
      hapfile <- as.data.frame(strsplit(haptemp$dat$gt_GT_alleles,"/")) 
      colnames(hapfile) <- NULL # relabel row names
      hapfile <- as.matrix(hapfile) # turn into a matrix
      hapfile <- rbind(hapfile,haptemp$dat$POS) # add the allel's position
      
      # Remove homozygous sites
      if (only_var){keep1 <- stringdist(hapfile[1,],hapfile[2,],"hamming")  
                    hapfile <- hapfile[,as.logical(keep1)]}                 
      # Only use output with at least 100 SNP's/  heterozygous positions
      if (kickout_less100SNPs){if (ncol(hapfile)<100){next}}   
      
      # Write data into txt files
      if (!only_var){ write.table(hapfile,file=paste0("eSMCa",ind, "_",scaff,".txt"),  
                      quote = FALSE,row.names = FALSE,col.names = FALSE)} 
         else       { write.table(hapfile,file=paste0("eSMC",ind, "_",scaff,"_",good_ind[ind],".txt"),
                      quote = FALSE,row.names = FALSE,col.names = FALSE)              
                    }
   }}  

The input files are saved in the working directory.

Perform eSMC

Several parameters are adjusted before running eSMC. To start, the mutation rate (mu) has been set to 6.9510-9 mutations peer site per generation (Weng et al. 2019) and the recombination rate (r) has been set to 3.610-8 per site per generation (Salomé et al. 2012). The values are taken from Arabidopsis thaliana, because none were found for species which are more closely related to canihua. Canihua is a predominantly self-fertilizing crop (Simmonds N. W. 1965), therefore eSMC is used to account for self-fertilization. Hence, the parameters to estimate the effective recombination rate (ER) and germination rate (SB) are set to FALSE, while the estimation of the selfing rate (SF) is set to TRUE. Default settings for the boundaries are kept, as they cover a very broad and biologically relevant range.

In the next step, we read in the input files. Only the files with the individuals and chromosomes on which the analysis will be run on are kept in the working directory. For this analysis, the files of the nine sampled individuals in “good_ind”, for chromosome one to three are kept, respectively.

The code does not automatically loop over these individuals. In input files, the individuals are assigned numbers, which are used to pic the respective input files. Therefore, this number is adjusted for each individual to be analyzed. Here, the analysis will start with the first sampled individual.

The results were stored and included into a list. This is then plotted to display the results, if “LIST” is set to TRUE- adjust boundaries?.

For the eSMC  describe any adjustments?

Parameters for eSMC - n, symbol nr - n nr of hidden states (coalescence times) – coalescence events - symbol nr – nr of symbols used for zipping process?? - test data- name – what did we adjust

Parameters for plotting - name, - set list – TRUE to plot results of multiple individuals into one plot - adjust boundaries

##############
# Set parameters for eSMC
##############

mu=6.95*10^(-9) # mutation rate per position per generation 
r= 3.6*10^(-8)  # recombination rate per position per generation
rho=r/mu        # ratio recombination/mutation 

# Set one to TRUE (rest to FALSE) 
ER=F # false to estimate recombination rate
SF=T # true to estimate selfing rate
SB=F # false to estimate germination rate

# Set boundaries
BoxB=c(0.05,1) #  min and max value of germination rate 
Boxs=c(0,0.99) #  min and max value of selfing rate
##############
# Load input files
##############
 
#First adjust number of the individual in "individual_name" the resultnumber
individual_name <- paste("eSMC",1,sep="")  # Create the individual_name, use pattern of input file name
single_ind <- list.files(pattern = individual_name,full.names = TRUE) # load the files with this pattern 

test_data <- vector("list",length(single_ind)) # create a vector, its lenth equals nr of  txt files

for (j in 1:length(single_ind)){
    input_name <- single_ind[j]  # []=inex, takes the value in this position
    test_data[[j]] <- as.matrix(read.table(input_name))   # [[]] line or this index
    }
  
NC <- length(test_data) # create a vector with length of number of analyzed chromosomes

##############
# Run eSMC
##############
result1= eSMC(n=30,rho=rho,test_data,BoxB=BoxB,Boxs=Boxs,SB=SB,SF=SF,Rho=ER,Check=F,NC=NC) 

# Save the respective result
save(result1,file="eSMC_r1_B26.RData") 
#save(result2,file="eSMC_r2_B23.RData")
#save(result3,file="eSMC_r3_B24.RData")
#save(result4,file="eSMC_r4_B35.RData")
#save(result5,file="eSMC_r5_B33.RData")
#save(result6,file="eSMC_r6_U8.RData")
#save(result7,file="eSMC_r7_Ch_pal_05.RData")
#save(result8,file="eSMC_r8_Ch_pal_39.RData")
#save(result9,file="eSMC_r9_Ch_pal_49.RData")

# Create result lists of all individuals and of each subpopulation
result_list =list(eSMC_r1_B26,eSMC_r2_B23,eSMC_r3_B24,
                  eSMC_r4_B35,eSMC_r5_B33,eSMC_r6_U8,
                  eSMC_r7_Ch_pal_05,eSMC_r8_Ch_pal_39,eSMC_r9_Ch_pal_49)
pop1_list =list(eSMC_r1_B26,eSMC_r2_B23,eSMC_r3_B24)
pop2_list =list(eSMC_r4_B35,eSMC_r5_B33,eSMC_r6_U8)
pop3_list =list(eSMC_r7_Ch_pal_05,eSMC_r8_Ch_pal_39,eSMC_r9_Ch_pal_49)

# Save the result lists
save(result_list,file="result_list.RData")
save(pop1_list,file="pop1_list.RData")
save(pop2_list,file="pop2_list.RData")
save(pop3_list,file="pop3_list.RData")

Results

How to decide which output should be printed and which one not?  change output of RMC  put 3 graphs into 1 line  Labelling of the curves  How to lable pictures  Describe the pictures

The eSMC result estimates the canihua population’s past demographic history, while accounting for self-fertilization. The plot displays population size (Xi) depending on the past generations, both on a logarithmic scale. Each line represents the estimation based on one individual. For the nine sampled haplotypes it infers population history from around 30 to around 3000 generations ago. In general, the population size decreases. The haplotypes appear to have the same population history until around 140 generations ago. Within this time frame, there is low variability between the individuals. More recent generations, starting from 70 generations ago, diverge. There are two tendencies. The population of three individuals displays an increase in size. While the population of rest of the individuals continues to decline.

 **Figure 1. Estimated demographic history of *Chenopodium pallidicaule*.** The graph displays nine sampled individuals of a Chenopodium pallidicaule population, using eSMC to account for selfing. Chromosome one to three are used for the analysis for each individual respectively. The Mutation rate is set to 6.95*10^-9^ per generation per site and the recombination rate is set to 3.6*10^-8^.

Figure 1. Estimated demographic history of Chenopodium pallidicaule. The graph displays nine sampled individuals of a Chenopodium pallidicaule population, using eSMC to account for selfing. Chromosome one to three are used for the analysis for each individual respectively. The Mutation rate is set to 6.9510-9 per generation per site and the recombination rate is set to 3.610-8.

load("~/Term/pop1_list.RData")
load("~/Term/pop2_list.RData")
load("~/Term/pop3_list.RData")

par(mfrow=c(1,3))
Plot_esmc_results(pop1_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4))
Plot_esmc_results(pop2_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4))
Plot_esmc_results(pop3_list,mu,WP=F,LIST=T,x=c(10^1.2,10^3.6),y=c(2,4))
 ***Figure 2. Estimated demographic history of *Chenopodium pallidicaule*.** The graphs display the three sampled individuals of *Chenopodium pallidicaule* for each subpopulation, the first subpopulation (left), the second subpopulation (middle) and the third subpopulation (right). The demographic history is inferred using eSMC and accounting for self-fertilization. Chromosome one to three are used for the analysis for each individual respectively. The Mutation rate is set to 6.95*10^-9^ per generation per site and the recombination rate is set to 3.6*10^-8^.

**Figure 2. Estimated demographic history of Chenopodium pallidicaule*.** The graphs display the three sampled individuals of Chenopodium pallidicaule for each subpopulation, the first subpopulation (left), the second subpopulation (middle) and the third subpopulation (right). The demographic history is inferred using eSMC and accounting for self-fertilization. Chromosome one to three are used for the analysis for each individual respectively. The Mutation rate is set to 6.9510-9 per generation per site and the recombination rate is set to 3.610-8.

par(mfrow=c(1,1))

When separately looking at the three subpopulations, the results show that in the first subpopulation (left), the population size is consistently decreasing. Whereas, the second (middle) and the third (right) subpopulation show a very recent split into the two opposing directions, a further decrease and an increase in population size.

The algorithm also inferred a self-fertilization rate (σ) of zero and germination rate (β) of 1 for all sampled individuals, which implies that there no self-fertilization and no long term seed-banking. The estimated effective mutation rates varied from 3.3210-5 to 3.9910-5. The estimated effective recombination rates varied from 6.4110-6 to 7.4810-6.

With three input files, three chromosomes analyzed per individual, 2.6 Mb of data was analyzed. The average running time was 90 minutes.

notes

  • (size? Give sequence length? 501 bp, … 2 haplotypes?).
  • Every base pair coded by 2 bits, around 8.75 Mb per chromosome. Analyzing 3 chromosomes, 2 haplotypes
  • resulting in 26.25 Mb analyzed, with a running time of around 90 minutes.  Discuss – same as paper?

Infers population history o The algorithm is run to infer β, σ and χt by fixing or simultaneously inferring the ratio p/θ given an initial value of this ratio o the algorithm infers σ, β, and χ ~ t~ using the ratio p/θ , which is inferred by the model and uses the initial ratio for r/µ as a prior, using a Bayesian approach. - What does the result – table tell me? o Parameters inferred are  What is LH & Tc – LH – likelihood self-fertilization rates (σ) of 0, no self- fertilization  – T – transition probability in continuous time - strasition occurs only if there is a recombination event  germination rate (β) of 1, inplying a 100 percent germination  (values for mutation rate and recombination rate? -> changed) o population size

About the results section - How can I best present the output? - • Record the exact codes you use to make it reproducible. - • If you get the results, look and think about them critically: - – Do they make sense? - – Do they match your expectations? - – If not, check your codes carefully and eventually run the analysis again - • Share your results with the others of your group, especially when they need them to do their analyses or to discuss their results. - • Prepare graphs/tables to best present your results.

How well the plotted results fit to the data – dep on the dataset T – time to the - desc in package - L =

Discussion

For eSMC only needs one or two individuals and scaffolds of minimum 1 MB length to accurately predict population history. Both prerequisites are met by our dataset. As the quality of the result depends more on the accuracy of the sequence rather than on the amount, we cannot definitely say how good our predictions are.

Due to the continuous decline off the population size, it can be assumed, that the population goes through a prolonged bottleneck.

Starting from around 70 generations ago, some individuals of the second and the third subpopulation indicated a decreasing and some an increasing population size. This divergence could origin from beneficial mutations, which occurred simultaneously in bout subgroups. However, recombination events might be displayed with a certain delay. In this way the model can be used as base for further selection tests Can be used as base for further selection tests on these individuals

Population structure From our result, a population structure of 2 subpoulations would be expected instead of the three revealed by the PCA. However, results of an Alstructure analysis did also support the results obtained by the PCA.

Other eSMC results Results of other student showed a similar decline population structure, however the decrease continued until the last 15 generation until it started to diverge, or the tendency of the decline did not change at all.

Alexa’s

References

  1. … Publication bibliography Böndel, K. (2020): PCR Results. In University of Hohenheim. Galili, T. (2019): installr: Using R to Install Stuff on Windows OS (Such As: R, ‘Rtools’, ‘RStudio’, ‘Git’, and More!). Version 0.22.0. Available online at https://CRAN.R-project.org/package=installr. Knaus, B. J.; Grünwald, N. J. (2017): VCFR: a package to manipulate and visualize variant call format data in R. In Molecular ecology resources 17 (1), pp. 44–53. DOI: 10.1111/1755-0998.12549. R Core Team (2019): R: A language and environment for statistical computing. Version 3.6.2: R Foundation for Statistical Computing, Vienna, Austria. Available online at https://www.R-project.org/. R Development Core Team (2019): R: A language and environment for statistical computing: R Foundation for Statistical Computing, Vienna, Austria. Available online at http://www.R-project.org. RStudio Team (2020): RStudio: Integrated Development Environment for R: RStudio, PBC, Boston, MA. Available online at http://www.rstudio.com/. Sauer J. D. (1993): Historical Geography of Crop Plants. A Select Roster: CRC Press. Schutten, G.; Chan, C.; Leeper T. J. (2018): readODS: Read and Write ODS Files. Version 1.6.7. Available online at https://CRAN.R-project.org/package=readODS. Sellinger, T. (2020a): eSMC: Ecological Sequentially Markovian Coalescent. Version 1.0.0. Available online at https://github.com/TPPSellinger/eSMC. Sellinger, T. P.; Abu, A. D.; Moest, M.; Tellier, A. (2020b): Inference of past demography, dormancy and self-fertilization rates from whole genome sequence data. In PLoS genetics 16 (4), e1008698. DOI: 10.1371/journal.pgen.1008698. Simmonds N. W. (1965): The Grain Chenopods of the Tropical American Highlands. In Economic botany 19 (3), pp. 223–235. van der Loo, M. P. J. (2014): The stringdist package for approximate string matching. In The R Journal 6 (1), pp. 111–122. Available online at https://CRAN.R-project.org/package=stringdist. Wickham, H.; Averick, M.; Bryan, J.; Chang, W.; McGowan, L.; François, R. et al. (2019): Welcome to the Tidyverse. In Journal of Open Source Software 4 (43), p. 1686. DOI: 10.21105/joss.01686. Wickham, H.; François, R.; Henry, L.; Müller, K. (2020a): dplyr: A Grammar of Data Manipulation. Version 1.0.0. Available online at https://CRAN.R-project.org/package=dplyr. Wickham, H.; Hester, J.; Chang, W. (2020b): devtools: Tools to Make Developing R Packages Easier. Version 2.3.0. Available online at https://CRAN.R-project.org/package=devtools.

Notes

Problems

 RMD Can’t load tidyverse& devtools  doesn’t run through ### Evaluation criteria • The Rmarkdown file should be executable without errors when knitting it to pdf. All code documenting the analyses should be displayed in the pdf.  Both, the RMarkdown file and the pdf (knitted from the respective Rmd) have to be submitted. Details about where the two files have to be uploaded will be communicated closer to the submission date. • The students should demonstrate in the term paper – that they understood what the method is about – that they carefully thought about the details of running the method (including choice of e.g. parameters, settings, data subsets, … ) – that they are able to discuss their own results in the context with other results • The analysis should be repeatable with the information given in the term paper. • The general appearance of the term paper will be taken into account for grading.

Method Description of the general method (+ relevant background information) • What does the analysis do? • What can the analysis tell you about the data set? - what do we want to do - Why

Specific parameter settings of the analysis of the data set ( Analysis should be repeatable with information given within this section)

description of results – With representative tables and/or graphs, well annotated (e.g. axis labels)

– Discussion of own results in context with the data set information, and with results of the other analyses of the group. ## Dictionary Convergence - the fact that two or more things, ideas, etc. become similar or come together: a convergence of interests/opinions/ideas The convergence of pop-cultural trends and technological progress gave us camera phones and the “selfie”.

Helpful code to remember in R

Mark word put strg, enter -> displays [] – index [[]] – choose column/ write in column… $- also choose column

Rm() – removes from environment setwd(“~/Term”) #install.packages(“vctrs”) library(vcfR)
push in console hit esc – stop running process # x[2,] - choose 1rst column of x, when in bracket - obj good ind - chooses 1rst, 2nd… column paste: put all things behind each other ##Helpful code to remember for RM rMarkdown basics: https://rmarkdown.rstudio.com/authoring_basics.html xxx - italize \(xxx\) - inline equation

Meeting notes 2020.07.12

Include= True – show code, Eval= False - don’t run code Echo= false – show output only - Hide code - Same decreasing trend – - Inference are very recent, – recent events might not be that accurate very old recombination events can not be recovered MB -
Include which info from other students

Selfing rate – not comply with literature