In this informal document I keep track of my attempts to analyze three genotyping-by-sequencing data sets. All contain white lupin samples. Two are “old”, meaning they were sequences a few years ago (single end, 50 base/reads), named library 1 and library 2. The third one is “new”, done with Elshire labs, and certainly bigger (paired ends, 100 base/reads).
The plan is to use the new library for building a mock reference genome, and then map all data sets on it.
Demultiplexing was done using axe-demux, as per Elshire’s suggestion, using the following command:
axe-demux -f $INFILE_R1 -r $INFILE_R2 -F $OUTFOLDER/ -R $OUTFOLDER/ -c -b $BARCODES -t $STATS_FILE -z 1
The only meaningful detail is the use of -c parameter for combinatorial demultiplexing algorithm.
I used dDocent to build the reference genome. As per dDocent tutorial, there are three parameters that need to be optimized (similarity threshold, K1 and K2). This parameter tuning is usually done using a subset of the data.
The following samples were selected to capture the general variability of the dataset.
| Plate.Name | Well | Sample.Name | elshire_folder | elshire_well | sel_for_ref | sel_for_test |
|---|---|---|---|---|---|---|
| 40156-1 | A1 | LBI-1 | 7863 | 1A1 | TRUE | FALSE |
| 40156-1 | B1 | LBI-2 | 7863 | 1B1 | TRUE | FALSE |
| 40156-1 | B12 | 3.2 | 7863 | 1B12 | TRUE | FALSE |
| 40156-1 | D7 | 2.2 | 7863 | 1D7 | TRUE | FALSE |
| 40156-1 | F2 | 1.3 | 7863 | 1F2 | TRUE | FALSE |
| 40156-7 | C4 | 2a | 8039 | 1C4 | TRUE | FALSE |
| 40156-7 | G3 | 1a | 8039 | 1G3 | TRUE | FALSE |
| 40156-7 | G4 | 3a | 8039 | 1G4 | TRUE | FALSE |
| 40156-8 | B12 | 150/4 | 8039 | 2B12 | TRUE | FALSE |
| 40156-8 | C12 | 83A_476 | 8039 | 2C12 | TRUE | FALSE |
| 40156-8 | D12 | P27255 | 8039 | 2D12 | TRUE | FALSE |
| 40156-8 | E12 | UNICROP | 8039 | 2E12 | TRUE | FALSE |
| 40156-8 | F12 | TANJIL | 8039 | 2F12 | TRUE | FALSE |
| 40156-8 | G12 | P3 | 8039 | 2G12 | TRUE | FALSE |
I used the dDocente ReferenceOpt.sh script with the following parameters:
ReferenceOpt.sh 4 8 4 8 PE 10
This lead to the following distributions:
Zooming a bit on the only configuration not completely flat:
There is no hard rule here, we need to select a point which is just before the number of contigs grows too much. The dDocen tutorial succest to look for some inflection point. I decided to fix similarity threshold to: 0.9
I used dDocent RefMapOpt.sh to test how well my samples map to the reference genome created with a set of different K1/K2 values. The script calls for 20 random samples, which should not be the same used to create the reference genome. Also, it requires them to be trimmed (by dDocent itself).
I picked the following random samples:
| Plate.Name | Well | Sample.Name | elshire_folder | elshire_well | sel_for_ref | sel_for_test |
|---|---|---|---|---|---|---|
| 40156-1 | A5 | 1.52 | 7863 | 1A5 | FALSE | TRUE |
| 40156-1 | C2 | LBI-10 | 7863 | 1C2 | FALSE | TRUE |
| 40156-3 | A1 | 5.45 | 7864 | 1A1 | FALSE | TRUE |
| 40156-3 | B1 | 5.46 | 7864 | 1B1 | FALSE | TRUE |
| 40156-3 | C1 | 5.49 | 7864 | 1C1 | FALSE | TRUE |
| 40156-3 | C9 | 7.25 | 7864 | 1C9 | FALSE | TRUE |
| 40156-3 | D9 | 7.26 | 7864 | 1D9 | FALSE | TRUE |
| 40156-3 | F7 | 7.1 | 7864 | 1F7 | FALSE | TRUE |
| 40156-3 | G5 | 6.54 | 7864 | 1G5 | FALSE | TRUE |
| 40156-3 | H9 | 7.35 | 7864 | 1H9 | FALSE | TRUE |
| 40156-5 | B11 | 12.38 | 7958 | 1B11 | FALSE | TRUE |
| 40156-5 | C5 | 11.35 | 7958 | 1C5 | FALSE | TRUE |
| 40156-5 | D6 | 11.47 | 7958 | 1D6 | FALSE | TRUE |
| 40156-5 | F10 | 12.34 | 7958 | 1F10 | FALSE | TRUE |
| 40156-5 | G5 | 11.39 | 7958 | 1G5 | FALSE | TRUE |
| 40156-6 | H4 | 13.78 | 7958 | 2H4 | FALSE | TRUE |
| 40156-7 | D10 | 14a | 8039 | 1D10 | FALSE | TRUE |
| 40156-7 | H10 | 15a | 8039 | 1H10 | FALSE | TRUE |
| 40156-8 | F11 | 107b | 8039 | 2F11 | FALSE | TRUE |
| 40156-8 | G11 | 109c | 8039 | 2G11 | FALSE | TRUE |
Examining some test reference genome I decided to change the K1/K2 search space from 4-8 to 3-5 (the resulting contig were very few with values in 6-8 range). The used command is:
RefMapOpt.sh 3 5 3 5 0.9 PE 10 sample_test_list.txt
Where sample_test_list.txt is a file containing the list of 20 selected samples. This is done to avoid dDocent randomly picking samples, which would include the samples used for generating the reference.
This script lead to the following results:
| Cov | Non0Cov | Contigs | MeanContigsMapped | K1 | K2 | SUM.Mapped | SUM.Properly | Mean.Mapped | Mean.Properly | MisMatched |
|---|---|---|---|---|---|---|---|---|---|---|
| 36.6 | 55.6 | 27005 | 17635 | 3 | 3 | 19794748 | 0 | 989737 | 0 | 82344 |
| 91.5 | 137.8 | 10311 | 6847 | 3 | 4 | 18863212 | 0 | 943161 | 0 | 128252 |
| 197.6 | 277.5 | 2475 | 1774 | 3 | 5 | 9783240 | 0 | 489162 | 0 | 8285 |
| 44.8 | 67.2 | 21681 | 14345 | 4 | 3 | 19418915 | 0 | 970946 | 0 | 90542 |
| 110.1 | 165.2 | 7661 | 5102 | 4 | 4 | 16870459 | 0 | 843523 | 0 | 53796 |
| 265.8 | 384.4 | 1929 | 1341 | 4 | 5 | 10260895 | 0 | 513045 | 0 | 7689 |
| 56.1 | 84.0 | 17725 | 11781 | 5 | 3 | 19892680 | 0 | 994634 | 0 | 106100 |
| 139.7 | 211.3 | 5792 | 3835 | 5 | 4 | 16181814 | 0 | 809091 | 0 | 16544 |
| 444.5 | 662.5 | 1522 | 1027 | 5 | 5 | 13539429 | 0 | 676971 | 0 | 76032 |
Based on SUM.Mapped the best choices are:
I’ve demultiplexed the two old lupin libraries:
I’ve demultiplexed with axe. Both libraries are single end, so I had to update the demultiplexing command to:
axe-demux -f $INFILE_R1 -F $OUTFOLDER/ -p -b $BARCODES -t $STATS_FILE -z 1 |& tee $LOG_FILE
Please note that I had to add the -p (permissive) flag and remove the -c (combinatorial) flag.
I selected the 20 samples with most reads. That is, the following:
lib1_E132_48a lib1_LA110_27d lib1_LA364_96a lib1_LA406_71a lib1_LA406_71b lib1_LA406_71d lib1_LA409_72b lib1_LA416_68d lib1_LA547_67c lib1_LA646_61d lib1_LA653_64a lib1_LA673_70b lib1_LA673_70c lib1_LA673_70d lib1_Egypte03_85c lib1_Egypte26_87d lib1_Egypte55_91a lib1_Egypte55_91c lib1_Egypte93_94d lib1_Ethiop98_84d
These files were trimmed using dDocent and then the following command was executed
RefMapOpt.sh 3 5 3 5 0.9 PE 10 lib1_filelist.txt
| Cov | Non0Cov | Contigs | MeanContigsMapped | K1 | K2 | SUM.Mapped | SUM.Properly | Mean.Mapped | Mean.Properly | MisMatched |
|---|---|---|---|---|---|---|---|---|---|---|
| 46.3 | 67.6 | 27005 | 18486 | 3 | 3 | 25031413 | 0 | 1251570 | 0 | 0 |
| 104.8 | 158.8 | 10311 | 6799 | 3 | 4 | 21604688 | 0 | 1080230 | 0 | 0 |
| 234.9 | 341.6 | 2475 | 1702 | 3 | 5 | 11632077 | 0 | 581604 | 0 | 0 |
| 57.7 | 84.7 | 21681 | 14756 | 4 | 3 | 25022462 | 0 | 1251120 | 0 | 0 |
| 139.9 | 215.5 | 7661 | 4970 | 4 | 4 | 21431670 | 0 | 1071580 | 0 | 0 |
| 336.9 | 512.1 | 1929 | 1269 | 4 | 5 | 13005788 | 0 | 650289 | 0 | 0 |
| 68.8 | 102.2 | 17725 | 11927 | 5 | 3 | 24399726 | 0 | 1219990 | 0 | 0 |
| 176.6 | 278.0 | 5792 | 3677 | 5 | 4 | 20458429 | 0 | 1022920 | 0 | 0 |
| 475.7 | 762.2 | 1522 | 950 | 5 | 5 | 14489011 | 0 | 724451 | 0 | 0 |
Based on SUM.Mapped the best choices are:
I selected the 20 samples with most reads. That is, the following:
"5_R1", "20_R1", "44_R1", "45_R1", "82_R1", "86_R1", "92_R1", "102b_R1", "102d_R1", "103b_R1", "105_R1", "111a_R1", "117b_R1", "118b_R1", "119_R1" "154-1_R1", "155a_R1", "156a_R1", "156b_R1", "D_R1"
These files were trimmed using dDocent and then the following command was executed
RefMapOpt.sh 3 5 3 5 0.9 PE 10 lib2_filelist.txt
| Cov | Non0Cov | Contigs | MeanContigsMapped | K1 | K2 | SUM.Mapped | SUM.Properly | Mean.Mapped | Mean.Properly | MisMatched |
|---|---|---|---|---|---|---|---|---|---|---|
| 42.4 | 61.5 | 27005 | 18473 | 3 | 3 | 22876032 | 0 | 1143800 | 0 | 0 |
| 85.0 | 128.6 | 10311 | 6786 | 3 | 4 | 17531290 | 0 | 876564 | 0 | 0 |
| 179.2 | 262.0 | 2475 | 1689 | 3 | 5 | 8875172 | 0 | 443759 | 0 | 0 |
| 51.2 | 74.9 | 21681 | 14741 | 4 | 3 | 22195273 | 0 | 1109760 | 0 | 0 |
| 111.4 | 171.6 | 7661 | 4954 | 4 | 4 | 17067252 | 0 | 853363 | 0 | 0 |
| 238.3 | 363.7 | 1929 | 1261 | 4 | 5 | 9196928 | 0 | 459846 | 0 | 0 |
| 58.7 | 87.2 | 17725 | 11899 | 5 | 3 | 20826292 | 0 | 1041310 | 0 | 0 |
| 137.6 | 217.0 | 5792 | 3664 | 5 | 4 | 15944638 | 0 | 797232 | 0 | 0 |
| 331.4 | 533.1 | 1522 | 945 | 5 | 5 | 10093604 | 0 | 504680 | 0 | 0 |
Based on SUM.Mapped the best choices are:
I report here the ranking of different configurations on the three libraries:
Elshire Lib1 Lib2
Best K1=5 K2=3 K1=3 K2=3 K1=3 K2=3
Second best K1=3 K2=3 K1=4 K2=3 K1=4 K2=3
Third best K1=4 K2=3 K1=5 K2=3 K1=5 K2=3
I thus selected K1=3 and K2=3 (along with similarity=0.9) as the configuration of choice. Rationale:
I plan to run dDocent two times, once on the new (Elshire) library, and once on a library obtained joining lib1 and lib2. The reason for the separate calling is simple: Elshire library is double ended, the other are single ended. The software cannot easily manage the mixture of differente sequencing approach.
I used the following config file:
## Number of Processors
## 10
## Maximum Memory
## 30
## Trimming
## no
## Assembly?
## no
## Type_of_Assembly
## PE
## Clustering_Similarity%
## 0.9
## Minimum within individual coverage level to include a read for assembly (K1)
## 3
## Minimum number of individuals a read must be present in to include for assembly (K2)
## 3
## Mapping_Reads?
## no
## Mapping_Match_Value
## 1
## Mapping_MisMatch_Value
## 4
## Mapping_GapOpen_Penalty
## 6
## Calling_SNPs?
## yes
## Email
## foo.bar@baz.com
I’ve reported the assembly conf params for completeness, but they are not used at this stage (I already have the reference genome). The mapping parameters are BWA defaults.
dDocent runs without hiccups on the first two steps, namely trimming and mapping. However, I was not able to proceed to actual SNP calling due to memory issues. In particular, one of dDocent steps is to create an interval file merging all intervals encountered. This was done, in dDocent main script, with the following bash function:
CreateIntervals()
{
samtools merge -@$NUMProc -b bamlist.list -f cat-RRG.bam &>/dev/null
samtools index cat-RRG.bam
wait
bedtools merge -i cat-RRG.bam -bed > mapped.bed
}
The first command (samtools merge) merges all .bam files coming from bwa alignment into a single one (named cat-RRG.bam). The resulting file is then indexed. There is then a wait command (why?).
Finally, the bedtools call creates a merged interval file in .bed format. This command does not work on my system, I run out of memory.
I tried to use bedops suite instead of bedtools. My approach was:
bam2starch --max-mem 35G < cat-RRG.bam > cat-RRG.starch
bedops -m cat-RRG.starch > mapped.bed
The first command is simply a format conversion from .bam to .starch
The second command does the same merge operation as bedtools but it’s supposed to require less memory. However, it does so by using large amounts of disk space. In fact for a ~65GB bam file my 250GB of hard disk space was not enough and the command crashed.
I’ve written a small python script to avoid using so much memory and hard disk space all together. Interval merging can safely be done by chromosome/contig, since intervals from different chromosomes are not going to be merged anyway.
The core of my script does two operations. First, for each contig:
#pseudo bash code, it's actually all python
INFILE = "path/to/cat-RRG.bam"
TMP_FILE = "some_file_name_" + contigs[i]
samtools view $INFILE contig[i] -b > $TMP_FILE.bam
samtools index $TMP_FILE.bam
bam2starch --max-mem 35G < $TMP_FILE.bam > $TMP_FILE.starch
rm $TMP_FILE.bam
rm $TMP_FILE.bam.bai
This code, for each contig, creates a .bam, indexes it, convert it to .starch, and removes the remnants .bam and .bai
Then all .starch files are merged:
bedops -m *.starch > mapped.bed
Things are actually a little more complicated. It is not possible to run the bedops merge on all .starch files together, since there are 37000+ contigs (bedops cannot open that many files all together and crashes).
In my a python script the merging is done by blocks of 100 contigs. The resulting 370 .bed files are then merged again in a single mapped.bed file.
Looking at dDocent code it’s quite easy to let it know that it should not do the interval merging. I need to:
Unfortunately, it’s not enough. At line 337 dDocent has the following command:
cat namelist | parallel -j $FB2 "bedtools coverage -b {}-RG.bam -a mapped.bed -counts -sorted -g genome.file > {}.cov.stats"
The interesting part is the “-sorted” option for bedtools. This requires that mapped.bed to be sorted by chromosome and position (it is). However, it generates the following error (one per sample file, I report just one):
Error: Sorted input specified, but the file mapped.bed has the following record with a different sort order than the genomeFile genome.file
This can be traced to a sorting problem:
I thus need to sort mapped.bed by the number. It’s easily done with a small python script.
Unfortunately, it’s not enough (again…). The error is the same as above. Taking a look to the various files:
(ddocent_env) nelson@CREA-ZA-PC$ head genome.file dDocent_Contig_1 419 dDocent_Contig_3 312 dDocent_Contig_7 336 dDocent_Contig_9 340 dDocent_Contig_12 334 dDocent_Contig_15 350 dDocent_Contig_16 344 dDocent_Contig_18 342 dDocent_Contig_20 322 dDocent_Contig_22 336 (ddocent_env) nelson@CREA-ZA-PC:$ head mapped.bed dDocent_Contig_1 0 419 dDocent_Contig_3 0 312 dDocent_Contig_4 0 228 dDocent_Contig_6 0 190 dDocent_Contig_7 0 148 dDocent_Contig_7 173 336 dDocent_Contig_8 0 229 dDocent_Contig_9 0 165 dDocent_Contig_9 174 340 dDocent_Contig_10 0 240
The ordering is still different. In particular mapped.bed is “correct” (my script reordered it by actual number, not by lexycographical order). The ordering in genome.file is not (e.g. contig dDocent_Contig_4 is found down the line).
Final decision: I give up. Running the original dDocent script at this point is not worth my effort anymore. I created a local copy of the script and removed the “-sorted” option from the betools invocation (the final result is the same, the option is used only to save RAM space). The script now runs and enters the freeBayes phase.
TO BE UPDATED…