Introduction

In this informal document I keep track of my attempts to analyze three genotyping-by-sequencing data sets. All contain white lupin samples. Two are “old”, meaning they were sequences a few years ago (single end, 50 base/reads), named library 1 and library 2. The third one is “new”, done with Elshire labs, and certainly bigger (paired ends, 100 base/reads).

The plan is to use the new library for building a mock reference genome, and then map all data sets on it.

Demultiplexing

Demultiplexing was done using axe-demux, as per Elshire’s suggestion, using the following command:

axe-demux -f $INFILE_R1 -r $INFILE_R2 -F $OUTFOLDER/ -R $OUTFOLDER/ -c -b $BARCODES -t $STATS_FILE -z 1 

The only meaningful detail is the use of -c parameter for combinatorial demultiplexing algorithm.

Reference genome

I used dDocent to build the reference genome. As per dDocent tutorial, there are three parameters that need to be optimized (similarity threshold, K1 and K2). This parameter tuning is usually done using a subset of the data.

Samples used for reference genome

The following samples were selected to capture the general variability of the dataset.

Plate.Name Well Sample.Name elshire_folder elshire_well sel_for_ref sel_for_test
40156-1 A1 LBI-1 7863 1A1 TRUE FALSE
40156-1 B1 LBI-2 7863 1B1 TRUE FALSE
40156-1 B12 3.2 7863 1B12 TRUE FALSE
40156-1 D7 2.2 7863 1D7 TRUE FALSE
40156-1 F2 1.3 7863 1F2 TRUE FALSE
40156-7 C4 2a 8039 1C4 TRUE FALSE
40156-7 G3 1a 8039 1G3 TRUE FALSE
40156-7 G4 3a 8039 1G4 TRUE FALSE
40156-8 B12 150/4 8039 2B12 TRUE FALSE
40156-8 C12 83A_476 8039 2C12 TRUE FALSE
40156-8 D12 P27255 8039 2D12 TRUE FALSE
40156-8 E12 UNICROP 8039 2E12 TRUE FALSE
40156-8 F12 TANJIL 8039 2F12 TRUE FALSE
40156-8 G12 P3 8039 2G12 TRUE FALSE

Optimizing parameter: similarity threshold

I used the dDocente ReferenceOpt.sh script with the following parameters:

ReferenceOpt.sh 4 8 4 8 PE 10

This lead to the following distributions:

Zooming a bit on the only configuration not completely flat:

There is no hard rule here, we need to select a point which is just before the number of contigs grows too much. The dDocen tutorial succest to look for some inflection point. I decided to fix similarity threshold to: 0.9

Optimizing parameter: K1 and K2

I used dDocent RefMapOpt.sh to test how well my samples map to the reference genome created with a set of different K1/K2 values. The script calls for 20 random samples, which should not be the same used to create the reference genome. Also, it requires them to be trimmed (by dDocent itself).

I picked the following random samples:

Plate.Name Well Sample.Name elshire_folder elshire_well sel_for_ref sel_for_test
40156-1 A5 1.52 7863 1A5 FALSE TRUE
40156-1 C2 LBI-10 7863 1C2 FALSE TRUE
40156-3 A1 5.45 7864 1A1 FALSE TRUE
40156-3 B1 5.46 7864 1B1 FALSE TRUE
40156-3 C1 5.49 7864 1C1 FALSE TRUE
40156-3 C9 7.25 7864 1C9 FALSE TRUE
40156-3 D9 7.26 7864 1D9 FALSE TRUE
40156-3 F7 7.1 7864 1F7 FALSE TRUE
40156-3 G5 6.54 7864 1G5 FALSE TRUE
40156-3 H9 7.35 7864 1H9 FALSE TRUE
40156-5 B11 12.38 7958 1B11 FALSE TRUE
40156-5 C5 11.35 7958 1C5 FALSE TRUE
40156-5 D6 11.47 7958 1D6 FALSE TRUE
40156-5 F10 12.34 7958 1F10 FALSE TRUE
40156-5 G5 11.39 7958 1G5 FALSE TRUE
40156-6 H4 13.78 7958 2H4 FALSE TRUE
40156-7 D10 14a 8039 1D10 FALSE TRUE
40156-7 H10 15a 8039 1H10 FALSE TRUE
40156-8 F11 107b 8039 2F11 FALSE TRUE
40156-8 G11 109c 8039 2G11 FALSE TRUE

Examining some test reference genome I decided to change the K1/K2 search space from 4-8 to 3-5 (the resulting contig were very few with values in 6-8 range). The used command is:

RefMapOpt.sh 3 5 3 5 0.9 PE 10 sample_test_list.txt

Where sample_test_list.txt is a file containing the list of 20 selected samples. This is done to avoid dDocent randomly picking samples, which would include the samples used for generating the reference.

This script lead to the following results:

Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM.Mapped SUM.Properly Mean.Mapped Mean.Properly MisMatched
36.6 55.6 27005 17635 3 3 19794748 0 989737 0 82344
91.5 137.8 10311 6847 3 4 18863212 0 943161 0 128252
197.6 277.5 2475 1774 3 5 9783240 0 489162 0 8285
44.8 67.2 21681 14345 4 3 19418915 0 970946 0 90542
110.1 165.2 7661 5102 4 4 16870459 0 843523 0 53796
265.8 384.4 1929 1341 4 5 10260895 0 513045 0 7689
56.1 84.0 17725 11781 5 3 19892680 0 994634 0 106100
139.7 211.3 5792 3835 5 4 16181814 0 809091 0 16544
444.5 662.5 1522 1027 5 5 13539429 0 676971 0 76032

Based on SUM.Mapped the best choices are:

  1. K1=5 K2=3
  2. K1=3 K2=3
  3. K1=4 K2=3

Test on old lupin libraries

I’ve demultiplexed the two old lupin libraries:

  • lib1 2016-01 four plates (but plate 1 is broken) with World collection lupins
  • lib2 2016-07 four plates (but plate 4 is broken) with Poland and World collection lupins

Demultiplexing

I’ve demultiplexed with axe. Both libraries are single end, so I had to update the demultiplexing command to:

axe-demux -f $INFILE_R1  -F $OUTFOLDER/ -p -b $BARCODES -t $STATS_FILE -z 1 |& tee $LOG_FILE

Please note that I had to add the -p (permissive) flag and remove the -c (combinatorial) flag.

Lib1

I selected the 20 samples with most reads. That is, the following:

lib1_E132_48a lib1_LA110_27d lib1_LA364_96a lib1_LA406_71a lib1_LA406_71b lib1_LA406_71d lib1_LA409_72b lib1_LA416_68d lib1_LA547_67c lib1_LA646_61d lib1_LA653_64a lib1_LA673_70b lib1_LA673_70c lib1_LA673_70d lib1_Egypte03_85c lib1_Egypte26_87d lib1_Egypte55_91a lib1_Egypte55_91c lib1_Egypte93_94d lib1_Ethiop98_84d

These files were trimmed using dDocent and then the following command was executed

RefMapOpt.sh 3 5 3 5 0.9 PE 10 lib1_filelist.txt
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM.Mapped SUM.Properly Mean.Mapped Mean.Properly MisMatched
46.3 67.6 27005 18486 3 3 25031413 0 1251570 0 0
104.8 158.8 10311 6799 3 4 21604688 0 1080230 0 0
234.9 341.6 2475 1702 3 5 11632077 0 581604 0 0
57.7 84.7 21681 14756 4 3 25022462 0 1251120 0 0
139.9 215.5 7661 4970 4 4 21431670 0 1071580 0 0
336.9 512.1 1929 1269 4 5 13005788 0 650289 0 0
68.8 102.2 17725 11927 5 3 24399726 0 1219990 0 0
176.6 278.0 5792 3677 5 4 20458429 0 1022920 0 0
475.7 762.2 1522 950 5 5 14489011 0 724451 0 0

Based on SUM.Mapped the best choices are:

  1. K1=3 K2=3
  2. K1=4 K2=3
  3. K1=5 K2=3

Lib2

I selected the 20 samples with most reads. That is, the following:

"5_R1", "20_R1", "44_R1", "45_R1", "82_R1", 
"86_R1", "92_R1", "102b_R1", "102d_R1", "103b_R1",      
"105_R1", "111a_R1", "117b_R1", "118b_R1", "119_R1"
"154-1_R1", "155a_R1", "156a_R1", "156b_R1", "D_R1"

These files were trimmed using dDocent and then the following command was executed

RefMapOpt.sh 3 5 3 5 0.9 PE 10 lib2_filelist.txt
Cov Non0Cov Contigs MeanContigsMapped K1 K2 SUM.Mapped SUM.Properly Mean.Mapped Mean.Properly MisMatched
42.4 61.5 27005 18473 3 3 22876032 0 1143800 0 0
85.0 128.6 10311 6786 3 4 17531290 0 876564 0 0
179.2 262.0 2475 1689 3 5 8875172 0 443759 0 0
51.2 74.9 21681 14741 4 3 22195273 0 1109760 0 0
111.4 171.6 7661 4954 4 4 17067252 0 853363 0 0
238.3 363.7 1929 1261 4 5 9196928 0 459846 0 0
58.7 87.2 17725 11899 5 3 20826292 0 1041310 0 0
137.6 217.0 5792 3664 5 4 15944638 0 797232 0 0
331.4 533.1 1522 945 5 5 10093604 0 504680 0 0

Based on SUM.Mapped the best choices are:

  1. K1=3 K2=3
  2. K1=4 K2=3
  3. K1=5 K2=3

Final decision

I report here the ranking of different configurations on the three libraries:

              Elshire      Lib1        Lib2
Best          K1=5 K2=3    K1=3 K2=3   K1=3 K2=3
Second best   K1=3 K2=3    K1=4 K2=3   K1=4 K2=3
Third best    K1=4 K2=3    K1=5 K2=3   K1=5 K2=3

I thus selected K1=3 and K2=3 (along with similarity=0.9) as the configuration of choice. Rationale:

  • it’s the best choice in the two old libraries
  • it’s the second best choice in the new Elshire library, and only by a slight margin when compared to the best performing configuration (SUM.Mapped 19794748 vs. 19892680, a decrease of 0.5%)

SNP calling - Running dDocent

I plan to run dDocent two times, once on the new (Elshire) library, and once on a library obtained joining lib1 and lib2. The reason for the separate calling is simple: Elshire library is double ended, the other are single ended. The software cannot easily manage the mixture of differente sequencing approach.

I used the following config file:

## Number of Processors
## 10
## Maximum Memory
## 30
## Trimming
## no
## Assembly?
## no
## Type_of_Assembly
## PE
## Clustering_Similarity%
## 0.9
## Minimum within individual coverage level to include a read for assembly (K1)
## 3
## Minimum number of individuals a read must be present in to include for assembly (K2)
## 3
## Mapping_Reads?
## no
## Mapping_Match_Value
## 1
## Mapping_MisMatch_Value
## 4
## Mapping_GapOpen_Penalty
## 6
## Calling_SNPs?
## yes
## Email
## foo.bar@baz.com

I’ve reported the assembly conf params for completeness, but they are not used at this stage (I already have the reference genome). The mapping parameters are BWA defaults.

Running dDocent on Elshire library

dDocent runs without hiccups on the first two steps, namely trimming and mapping. However, I was not able to proceed to actual SNP calling due to memory issues. In particular, one of dDocent steps is to create an interval file merging all intervals encountered. This was done, in dDocent main script, with the following bash function:

CreateIntervals()
{
  samtools merge -@$NUMProc -b bamlist.list -f cat-RRG.bam &>/dev/null
  samtools index cat-RRG.bam 
  wait
  bedtools merge -i cat-RRG.bam -bed >  mapped.bed
}

The first command (samtools merge) merges all .bam files coming from bwa alignment into a single one (named cat-RRG.bam). The resulting file is then indexed. There is then a wait command (why?).

Finally, the bedtools call creates a merged interval file in .bed format. This command does not work on my system, I run out of memory.

Tentative fix #1 : naive interval merging with bedops

I tried to use bedops suite instead of bedtools. My approach was:

bam2starch --max-mem 35G < cat-RRG.bam > cat-RRG.starch
bedops -m cat-RRG.starch > mapped.bed

The first command is simply a format conversion from .bam to .starch

The second command does the same merge operation as bedtools but it’s supposed to require less memory. However, it does so by using large amounts of disk space. In fact for a ~65GB bam file my 250GB of hard disk space was not enough and the command crashed.

Tentative fix #2 : interval merging with bedops, by contig

I’ve written a small python script to avoid using so much memory and hard disk space all together. Interval merging can safely be done by chromosome/contig, since intervals from different chromosomes are not going to be merged anyway.

The core of my script does two operations. First, for each contig:

#pseudo bash code, it's actually all python
INFILE   = "path/to/cat-RRG.bam"
TMP_FILE = "some_file_name_" + contigs[i] 

samtools view $INFILE contig[i] -b > $TMP_FILE.bam
samtools index $TMP_FILE.bam
bam2starch --max-mem 35G < $TMP_FILE.bam > $TMP_FILE.starch
rm $TMP_FILE.bam
rm $TMP_FILE.bam.bai

This code, for each contig, creates a .bam, indexes it, convert it to .starch, and removes the remnants .bam and .bai

Then all .starch files are merged:

bedops -m *.starch > mapped.bed

Things are actually a little more complicated. It is not possible to run the bedops merge on all .starch files together, since there are 37000+ contigs (bedops cannot open that many files all together and crashes).

In my a python script the merging is done by blocks of 100 contigs. The resulting 370 .bed files are then merged again in a single mapped.bed file.

dDocent and externally merged interval files

Looking at dDocent code it’s quite easy to let it know that it should not do the interval merging. I need to:

  • put the created “mapped.bed” file in the working directory
  • remove all “*.cov.stats” files (which were empty anyway)

Unfortunately, it’s not enough. At line 337 dDocent has the following command:

cat namelist | parallel -j $FB2 "bedtools coverage -b {}-RG.bam -a mapped.bed -counts -sorted -g genome.file > {}.cov.stats"

The interesting part is the “-sorted” option for bedtools. This requires that mapped.bed to be sorted by chromosome and position (it is). However, it generates the following error (one per sample file, I report just one):

Error: Sorted input specified, but the file mapped.bed has the following record with a different sort order than the genomeFile genome.file

This can be traced to a sorting problem:

  • genome.file (which comes directly from the passed reference.fasta) contains chromosomes in numerical order (e.g. dDocent_Contig_1, dDocent_Contig_3, dDocent_Contig_7, dDocent_Contig_9, dDocent_Contig_12, dDocent_Contig_15)
  • mapped.bed contains chromosomes in alphabetical order (e.g. dDocent_Contig_1, dDocent_Contig_10, dDocent_Contig_100, dDocent_Contig_1000, dDocent_Contig_10000, dDocent_Contig_10001, dDocent_Contig_10002…)

I thus need to sort mapped.bed by the number. It’s easily done with a small python script.

Unfortunately, it’s not enough (again…). The error is the same as above. Taking a look to the various files:

(ddocent_env) nelson@CREA-ZA-PC$ head genome.file 
dDocent_Contig_1    419
dDocent_Contig_3    312
dDocent_Contig_7    336
dDocent_Contig_9    340
dDocent_Contig_12   334
dDocent_Contig_15   350
dDocent_Contig_16   344
dDocent_Contig_18   342
dDocent_Contig_20   322
dDocent_Contig_22   336

(ddocent_env) nelson@CREA-ZA-PC:$ head mapped.bed
dDocent_Contig_1    0   419
dDocent_Contig_3    0   312
dDocent_Contig_4    0   228
dDocent_Contig_6    0   190
dDocent_Contig_7    0   148
dDocent_Contig_7    173 336
dDocent_Contig_8    0   229
dDocent_Contig_9    0   165
dDocent_Contig_9    174 340
dDocent_Contig_10   0   240

The ordering is still different. In particular mapped.bed is “correct” (my script reordered it by actual number, not by lexycographical order). The ordering in genome.file is not (e.g. contig dDocent_Contig_4 is found down the line).

Final decision: I give up. Running the original dDocent script at this point is not worth my effort anymore. I created a local copy of the script and removed the “-sorted” option from the betools invocation (the final result is the same, the option is used only to save RAM space). The script now runs and enters the freeBayes phase.

TO BE UPDATED…