Lupin GBS - SNP calling diary

Introduction

In this informal document I keep track of my attempts to analyze three genotyping-by-sequencing data sets. All contain white lupin samples. Two are “old”, meaning they were sequences a few years ago (single end, 50 base/reads), named library 1 and library 2. The third one is “new”, done with Elshire labs, and certainly bigger (paired ends, 100 base/reads).

The plan is to use the new library for building a mock reference genome, and then map all data sets on it.

Demultiplexing

Demultiplexing was done using axe-demux, as per Elshire’s suggestion, using the following command:

axe-demux -f $INFILE_R1 -r $INFILE_R2 -F $OUTFOLDER/ -R $OUTFOLDER/ -c -b $BARCODES -t $STATS_FILE -z 1

The only meaningful detail is the use of -c parameter for combinatorial demultiplexing algorithm.

Reference genome

I used dDocent to build the reference genome. As per dDocent tutorial, there are three parameters that need to be optimized (similarity threshold, K1 and K2). This parameter tuning is usually done using a subset of the data.

Samples used for reference genome

The following samples were selected to capture the general variability of the dataset.

Plate.Name	Well	Sample.Name	elshire_folder	elshire_well	sel_for_ref	sel_for_test
40156-1	A1	LBI-1	7863	1A1	TRUE	FALSE
40156-1	B1	LBI-2	7863	1B1	TRUE	FALSE
40156-1	B12	3.2	7863	1B12	TRUE	FALSE
40156-1	D7	2.2	7863	1D7	TRUE	FALSE
40156-1	F2	1.3	7863	1F2	TRUE	FALSE
40156-7	C4	2a	8039	1C4	TRUE	FALSE
40156-7	G3	1a	8039	1G3	TRUE	FALSE
40156-7	G4	3a	8039	1G4	TRUE	FALSE
40156-8	B12	150/4	8039	2B12	TRUE	FALSE
40156-8	C12	83A_476	8039	2C12	TRUE	FALSE
40156-8	D12	P27255	8039	2D12	TRUE	FALSE
40156-8	E12	UNICROP	8039	2E12	TRUE	FALSE
40156-8	F12	TANJIL	8039	2F12	TRUE	FALSE
40156-8	G12	P3	8039	2G12	TRUE	FALSE

Optimizing parameter: similarity threshold

I used the dDocente ReferenceOpt.sh script with the following parameters:

ReferenceOpt.sh 4 8 4 8 PE 10

This lead to the following distributions:

Zooming a bit on the only configuration not completely flat:

There is no hard rule here, we need to select a point which is just before the number of contigs grows too much. The dDocen tutorial succest to look for some inflection point. I decided to fix similarity threshold to: 0.9

Optimizing parameter: K1 and K2

I used dDocent RefMapOpt.sh to test how well my samples map to the reference genome created with a set of different K1/K2 values. The script calls for 20 random samples, which should not be the same used to create the reference genome. Also, it requires them to be trimmed (by dDocent itself).

I picked the following random samples:

Plate.Name	Well	Sample.Name	elshire_folder	elshire_well	sel_for_ref	sel_for_test
40156-1	A5	1.52	7863	1A5	FALSE	TRUE
40156-1	C2	LBI-10	7863	1C2	FALSE	TRUE
40156-3	A1	5.45	7864	1A1	FALSE	TRUE
40156-3	B1	5.46	7864	1B1	FALSE	TRUE
40156-3	C1	5.49	7864	1C1	FALSE	TRUE
40156-3	C9	7.25	7864	1C9	FALSE	TRUE
40156-3	D9	7.26	7864	1D9	FALSE	TRUE
40156-3	F7	7.1	7864	1F7	FALSE	TRUE
40156-3	G5	6.54	7864	1G5	FALSE	TRUE
40156-3	H9	7.35	7864	1H9	FALSE	TRUE
40156-5	B11	12.38	7958	1B11	FALSE	TRUE
40156-5	C5	11.35	7958	1C5	FALSE	TRUE
40156-5	D6	11.47	7958	1D6	FALSE	TRUE
40156-5	F10	12.34	7958	1F10	FALSE	TRUE
40156-5	G5	11.39	7958	1G5	FALSE	TRUE
40156-6	H4	13.78	7958	2H4	FALSE	TRUE
40156-7	D10	14a	8039	1D10	FALSE	TRUE
40156-7	H10	15a	8039	1H10	FALSE	TRUE
40156-8	F11	107b	8039	2F11	FALSE	TRUE
40156-8	G11	109c	8039	2G11	FALSE	TRUE

Examining some test reference genome I decided to change the K1/K2 search space from 4-8 to 3-5 (the resulting contig were very few with values in 6-8 range). The used command is:

RefMapOpt.sh 3 5 3 5 0.9 PE 10 sample_test_list.txt

Where sample_test_list.txt is a file containing the list of 20 selected samples. This is done to avoid dDocent randomly picking samples, which would include the samples used for generating the reference.

This script lead to the following results:

Cov	Non0Cov	Contigs	MeanContigsMapped	K1	K2	SUM.Mapped	Mean.Mapped	MisMatched
36.6	55.6	27005	17635	3	3	19794748	989737	82344
91.5	137.8	10311	6847	3	4	18863212	943161	128252
197.6	277.5	2475	1774	3	5	9783240	489162	8285
44.8	67.2	21681	14345	4	3	19418915	970946	90542
110.1	165.2	7661	5102	4	4	16870459	843523	53796
265.8	384.4	1929	1341	4	5	10260895	513045	7689
56.1	84.0	17725	11781	5	3	19892680	994634	106100
139.7	211.3	5792	3835	5	4	16181814	809091	16544
444.5	662.5	1522	1027	5	5	13539429	676971	76032

Based on SUM.Mapped the best choices are:

K1=5 K2=3
K1=3 K2=3
K1=4 K2=3

Test on old lupin libraries

I’ve demultiplexed the two old lupin libraries:

lib1 2016-01 four plates (but plate 1 is broken) with World collection lupins
lib2 2016-07 four plates (but plate 4 is broken) with Poland and World collection lupins

Demultiplexing

I’ve demultiplexed with axe. Both libraries are single end, so I had to update the demultiplexing command to:

axe-demux -f $INFILE_R1  -F $OUTFOLDER/ -p -b $BARCODES -t $STATS_FILE -z 1 |& tee $LOG_FILE

Please note that I had to add the -p (permissive) flag and remove the -c (combinatorial) flag.

Lib1

I selected the 20 samples with most reads. That is, the following:

lib1_E132_48a lib1_LA110_27d lib1_LA364_96a lib1_LA406_71a lib1_LA406_71b lib1_LA406_71d lib1_LA409_72b lib1_LA416_68d lib1_LA547_67c lib1_LA646_61d lib1_LA653_64a lib1_LA673_70b lib1_LA673_70c lib1_LA673_70d lib1_Egypte03_85c lib1_Egypte26_87d lib1_Egypte55_91a lib1_Egypte55_91c lib1_Egypte93_94d lib1_Ethiop98_84d

These files were trimmed using dDocent and then the following command was executed

RefMapOpt.sh 3 5 3 5 0.9 PE 10 lib1_filelist.txt

Cov	Non0Cov	Contigs	MeanContigsMapped	K1	K2	SUM.Mapped	Mean.Mapped
46.3	67.6	27005	18486	3	3	25031413	1251570
104.8	158.8	10311	6799	3	4	21604688	1080230
234.9	341.6	2475	1702	3	5	11632077	581604
57.7	84.7	21681	14756	4	3	25022462	1251120
139.9	215.5	7661	4970	4	4	21431670	1071580
336.9	512.1	1929	1269	4	5	13005788	650289
68.8	102.2	17725	11927	5	3	24399726	1219990
176.6	278.0	5792	3677	5	4	20458429	1022920
475.7	762.2	1522	950	5	5	14489011	724451

Based on SUM.Mapped the best choices are:

K1=3 K2=3
K1=4 K2=3
K1=5 K2=3

Lib2

I selected the 20 samples with most reads. That is, the following:

"5_R1", "20_R1", "44_R1", "45_R1", "82_R1", 
"86_R1", "92_R1", "102b_R1", "102d_R1", "103b_R1",      
"105_R1", "111a_R1", "117b_R1", "118b_R1", "119_R1"
"154-1_R1", "155a_R1", "156a_R1", "156b_R1", "D_R1"

These files were trimmed using dDocent and then the following command was executed

RefMapOpt.sh 3 5 3 5 0.9 PE 10 lib2_filelist.txt

Cov	Non0Cov	Contigs	MeanContigsMapped	K1	K2	SUM.Mapped	Mean.Mapped
42.4	61.5	27005	18473	3	3	22876032	1143800
85.0	128.6	10311	6786	3	4	17531290	876564
179.2	262.0	2475	1689	3	5	8875172	443759
51.2	74.9	21681	14741	4	3	22195273	1109760
111.4	171.6	7661	4954	4	4	17067252	853363
238.3	363.7	1929	1261	4	5	9196928	459846
58.7	87.2	17725	11899	5	3	20826292	1041310
137.6	217.0	5792	3664	5	4	15944638	797232
331.4	533.1	1522	945	5	5	10093604	504680

Based on SUM.Mapped the best choices are:

K1=3 K2=3
K1=4 K2=3
K1=5 K2=3

Final decision

I report here the ranking of different configurations on the three libraries:

              Elshire      Lib1        Lib2
Best          K1=5 K2=3    K1=3 K2=3   K1=3 K2=3
Second best   K1=3 K2=3    K1=4 K2=3   K1=4 K2=3
Third best    K1=4 K2=3    K1=5 K2=3   K1=5 K2=3

I thus selected K1=3 and K2=3 (along with similarity=0.9) as the configuration of choice. Rationale:

it’s the best choice in the two old libraries
it’s the second best choice in the new Elshire library, and only by a slight margin when compared to the best performing configuration (SUM.Mapped 19794748 vs. 19892680, a decrease of 0.5%)

SNP calling - Running dDocent

I plan to run dDocent two times, once on the new (Elshire) library, and once on a library obtained joining lib1 and lib2. The reason for the separate calling is simple: Elshire library is double ended, the other are single ended. The software cannot easily manage the mixture of differente sequencing approach.

I used the following config file:

## Number of Processors
## 10
## Maximum Memory
## 30
## Trimming
## no
## Assembly?
## no
## Type_of_Assembly
## PE
## Clustering_Similarity%
## 0.9
## Minimum within individual coverage level to include a read for assembly (K1)
## 3
## Minimum number of individuals a read must be present in to include for assembly (K2)
## 3
## Mapping_Reads?
## no
## Mapping_Match_Value
## 1
## Mapping_MisMatch_Value
## 4
## Mapping_GapOpen_Penalty
## 6
## Calling_SNPs?
## yes
## Email
## foo.bar@baz.com

I’ve reported the assembly conf params for completeness, but they are not used at this stage (I already have the reference genome). The mapping parameters are BWA defaults.

Running dDocent on Elshire library

dDocent runs without hiccups on the first two steps, namely trimming and mapping. However, I was not able to proceed to actual SNP calling due to memory issues. In particular, one of dDocent steps is to create an interval file merging all intervals encountered. This was done, in dDocent main script, with the following bash function:

CreateIntervals()
{
  samtools merge -@$NUMProc -b bamlist.list -f cat-RRG.bam &>/dev/null
  samtools index cat-RRG.bam 
  wait
  bedtools merge -i cat-RRG.bam -bed >  mapped.bed
}

The first command (samtools merge) merges all .bam files coming from bwa alignment into a single one (named cat-RRG.bam). The resulting file is then indexed. There is then a wait command (why?).

Finally, the bedtools call creates a merged interval file in .bed format. This command does not work on my system, I run out of memory.

Tentative fix #1 : naive interval merging with bedops

I tried to use bedops suite instead of bedtools. My approach was:

bam2starch --max-mem 35G < cat-RRG.bam > cat-RRG.starch
bedops -m cat-RRG.starch > mapped.bed

The first command is simply a format conversion from .bam to .starch

The second command does the same merge operation as bedtools but it’s supposed to require less memory. However, it does so by using large amounts of disk space. In fact for a ~65GB bam file my 250GB of hard disk space was not enough and the command crashed.

Tentative fix #2 : interval merging with bedops, by contig

I’ve written a small python script to avoid using so much memory and hard disk space all together. Interval merging can safely be done by chromosome/contig, since intervals from different chromosomes are not going to be merged anyway.

The core of my script does two operations. First, for each contig:

#pseudo bash code, it's actually all python
INFILE   = "path/to/cat-RRG.bam"
TMP_FILE = "some_file_name_" + contigs[i] 

samtools view $INFILE contig[i] -b > $TMP_FILE.bam
samtools index $TMP_FILE.bam
bam2starch --max-mem 35G < $TMP_FILE.bam > $TMP_FILE.starch
rm $TMP_FILE.bam
rm $TMP_FILE.bam.bai

This code, for each contig, creates a .bam, indexes it, convert it to .starch, and removes the remnants .bam and .bai

Then all .starch files are merged:

bedops -m *.starch > mapped.bed

Things are actually a little more complicated. It is not possible to run the bedops merge on all .starch files together, since there are 37000+ contigs (bedops cannot open that many files all together and crashes).

In my a python script the merging is done by blocks of 100 contigs. The resulting 370 .bed files are then merged again in a single mapped.bed file.

dDocent and externally merged interval files

Looking at dDocent code it’s quite easy to let it know that it should not do the interval merging. I need to:

put the created “mapped.bed” file in the working directory
remove all “*.cov.stats” files (which were empty anyway)

Unfortunately, it’s not enough. At line 337 dDocent has the following command:

cat namelist | parallel -j $FB2 "bedtools coverage -b {}-RG.bam -a mapped.bed -counts -sorted -g genome.file > {}.cov.stats"

The interesting part is the “-sorted” option for bedtools. This requires that mapped.bed to be sorted by chromosome and position (it is). However, it generates the following error (one per sample file, I report just one):

Error: Sorted input specified, but the file mapped.bed has the following record with a different sort order than the genomeFile genome.file

This can be traced to a sorting problem:

genome.file (which comes directly from the passed reference.fasta) contains chromosomes in numerical order (e.g. dDocent_Contig_1, dDocent_Contig_3, dDocent_Contig_7, dDocent_Contig_9, dDocent_Contig_12, dDocent_Contig_15)
mapped.bed contains chromosomes in alphabetical order (e.g. dDocent_Contig_1, dDocent_Contig_10, dDocent_Contig_100, dDocent_Contig_1000, dDocent_Contig_10000, dDocent_Contig_10001, dDocent_Contig_10002…)

I thus need to sort mapped.bed by the number. It’s easily done with a small python script.

Unfortunately, it’s not enough (again…). The error is the same as above. Taking a look to the various files:

(ddocent_env) nelson@CREA-ZA-PC$ head genome.file 
dDocent_Contig_1    419
dDocent_Contig_3    312
dDocent_Contig_7    336
dDocent_Contig_9    340
dDocent_Contig_12   334
dDocent_Contig_15   350
dDocent_Contig_16   344
dDocent_Contig_18   342
dDocent_Contig_20   322
dDocent_Contig_22   336

(ddocent_env) nelson@CREA-ZA-PC:$ head mapped.bed
dDocent_Contig_1    0   419
dDocent_Contig_3    0   312
dDocent_Contig_4    0   228
dDocent_Contig_6    0   190
dDocent_Contig_7    0   148
dDocent_Contig_7    173 336
dDocent_Contig_8    0   229
dDocent_Contig_9    0   165
dDocent_Contig_9    174 340
dDocent_Contig_10   0   240

The ordering is still different. In particular mapped.bed is “correct” (my script reordered it by actual number, not by lexycographical order). The ordering in genome.file is not (e.g. contig dDocent_Contig_4 is found down the line).

Final decision: I give up. Running the original dDocent script at this point is not worth my effort anymore. I created a local copy of the script and removed the “-sorted” option from the betools invocation (the final result is the same, the option is used only to save RAM space). The script now runs and enters the freeBayes phase.

TO BE UPDATED…

Lupin GBS - SNP calling diary

Nelson Nazzicari

2019-04-11

Introduction

Demultiplexing

Reference genome

Samples used for reference genome

Optimizing parameter: similarity threshold

Optimizing parameter: K1 and K2

Test on old lupin libraries

Demultiplexing

Lib1

Lib2

Final decision

SNP calling - Running dDocent

Running dDocent on Elshire library

Tentative fix #1 : naive interval merging with bedops

Tentative fix #2 : interval merging with bedops, by contig

dDocent and externally merged interval files