Reference and Resources

Nature Protocols volume 8, pages 1494-1512 (2013)

AWS instance: m4.xlarge

Directory contents

ls -l tuxedo2_HSB_Spombe

-rw-rw-r– 1 ubuntu ubuntu 7830629 Mar 5 02:01 S_pombe_refTrans.fasta

-rw-rw-r– 1 ubuntu ubuntu 175846179 Mar 5 02:00 Sp.ds.1M.left.fq

-rw-rw-r– 1 ubuntu ubuntu 175846179 Mar 5 02:00 Sp.ds.1M.right.fq

-rw-rw-r– 1 ubuntu ubuntu 175736042 Mar 5 02:00 Sp.hs.1M.left.fq

-rw-rw-r– 1 ubuntu ubuntu 175736042 Mar 5 02:00 Sp.hs.1M.right.fq

-rw-rw-r– 1 ubuntu ubuntu 175741215 Mar 5 02:00 Sp.log.1M.left.fq

-rw-rw-r– 1 ubuntu ubuntu 175741215 Mar 5 02:00 Sp.log.1M.right.fq

-rw-rw-r– 1 ubuntu ubuntu 175899533 Mar 5 02:00 Sp.plat.1M.left.fq

-rw-rw-r– 1 ubuntu ubuntu 175899533 Mar 5 02:00 Sp.plat.1M.right.fq

Build an index for S. pombe using HISAT2

tuxedo2_HSB_Spombe$ hisat2-build S_pombe_refTrans.fasta SpIndex

Settings:

Output files: "SpIndex..ht2"

Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Local sequence length: 57344 Local sequence overlap between two consecutive indexes: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void*:8, int:4, long:8, size_t:8

Input files DNA, FASTA:

S_pombe_refTrans.fasta

Reading reference sizes Time reading reference sizes: 00:00:00 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:01 Time to read SNPs and splice sites: 00:00:00 Using parameters –bmax 1352447 –dcv 1024 Doing ahead-of-time memory usage test Passed! Constructing with these parameters: –bmax 1352447 –dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample Building sPrime Building sPrimeOrder V-Sorting samples V-Sorting samples time: 00:00:00 Allocating rank array Ranking v-sort output Ranking v-sort output time: 00:00:00 Invoking Larsson-Sadakane on ranks Invoking Larsson-Sadakane on ranks time: 00:00:00 Sanity-checking and returning

Building samples

Reserving space for 12 sample suffixes Generating random suffixes QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples (Using difference cover) Multikey QSorting samples time: 00:00:00 Calculating bucket sizes Splitting and merging Splitting and merging time: 00:00:00 Avg bucket size: 1.03043e+06 (target: 1352446) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering GFM loop

Getting block 1 of 7

Reserving size (1352447) for bucket 1 Calculating Z arrays for bucket 1 Entering block accumulator loop for bucket 1: bucket 1: 10% bucket 1: 20% bucket 1: 30% bucket 1: 40% bucket 1: 50% bucket 1: 60% bucket 1: 70% bucket 1: 80% bucket 1: 90% bucket 1: 100% Sorting block of length 402623 for bucket 1 (Using difference cover) Sorting block time: 00:00:00 Returning block of 402624 for bucket 1

Getting block 2 of 7

Reserving size (1352447) for bucket 2 Calculating Z arrays for bucket 2 Entering block accumulator loop for bucket 2: bucket 2: 10% bucket 2: 20% bucket 2: 30% bucket 2: 40% bucket 2: 50% bucket 2: 60% bucket 2: 70% bucket 2: 80% bucket 2: 90% bucket 2: 100% Sorting block of length 1252535 for bucket 2 (Using difference cover) Sorting block time: 00:00:00 Returning block of 1252536 for bucket 2

Getting block 3 of 7

Reserving size (1352447) for bucket 3 Calculating Z arrays for bucket 3 Entering block accumulator loop for bucket 3: bucket 3: 10% bucket 3: 20% bucket 3: 30% bucket 3: 40% bucket 3: 50% bucket 3: 60% bucket 3: 70% bucket 3: 80% bucket 3: 90% bucket 3: 100% Sorting block of length 932862 for bucket 3 (Using difference cover) Sorting block time: 00:00:00 Returning block of 932863 for bucket 3

Getting block 4 of 7

Reserving size (1352447) for bucket 4 Calculating Z arrays for bucket 4 Entering block accumulator loop for bucket 4: bucket 4: 10% bucket 4: 20% bucket 4: 30% bucket 4: 40% bucket 4: 50% bucket 4: 60% bucket 4: 70% bucket 4: 80% bucket 4: 90% bucket 4: 100% Sorting block of length 1218128 for bucket 4 (Using difference cover) Sorting block time: 00:00:01 Returning block of 1218129 for bucket 4

Getting block 5 of 7

Reserving size (1352447) for bucket 5 Calculating Z arrays for bucket 5 Entering block accumulator loop for bucket 5: bucket 5: 10% bucket 5: 20% bucket 5: 30% bucket 5: 40% bucket 5: 50% bucket 5: 60% bucket 5: 70% bucket 5: 80% bucket 5: 90% bucket 5: 100% Sorting block of length 1338798 for bucket 5 (Using difference cover) Sorting block time: 00:00:00 Returning block of 1338799 for bucket 5

Getting block 6 of 7

Reserving size (1352447) for bucket 6 Calculating Z arrays for bucket 6 Entering block accumulator loop for bucket 6: bucket 6: 10% bucket 6: 20% bucket 6: 30% bucket 6: 40% bucket 6: 50% bucket 6: 60% bucket 6: 70% bucket 6: 80% bucket 6: 90% bucket 6: 100% Sorting block of length 1040348 for bucket 6 (Using difference cover) Sorting block time: 00:00:00 Returning block of 1040349 for bucket 6

Getting block 7 of 7

Reserving size (1352447) for bucket 7 Calculating Z arrays for bucket 7 Entering block accumulator loop for bucket 7: bucket 7: 10% bucket 7: 20% bucket 7: 30% bucket 7: 40% bucket 7: 50% bucket 7: 60% bucket 7: 70% bucket 7: 80% bucket 7: 90% bucket 7: 100% Sorting block of length 1027750 for bucket 7 (Using difference cover) Sorting block time: 00:00:00 Returning block of 1027751 for bucket 7 Exited GFM loop fchr[A]: 0 fchr[C]: 2160941 fchr[G]: 3570451 fchr[T]: 5019534 fchr[$]: 7213050 Exiting GFM::buildToDisk() Returning from initFromVector

Wrote 7171170 bytes to primary GFM file: SpIndex.1.ht2

Wrote 1803268 bytes to secondary GFM file: SpIndex.2.ht2

Re-opening _in1 and _in2 as input streams Returning from GFM constructor Returning from initFromVector

Wrote 44894645 bytes to primary GFM file: SpIndex.5.ht2

Wrote 1809058 bytes to secondary GFM file: SpIndex.6.ht2

Re-opening _in5 and _in5 as input streams Returning from HierEbwt constructor

Headers: len: 7213050 gbwtLen: 7213051 nodes: 7213051 sz: 1803263 gbwtSz: 1803263 lineRate: 6 offRate: 4 offMask: 0xfffffff0 ftabChars: 10 eftabLen: 0 eftabSz: 0 ftabLen: 1048577 ftabSz: 4194308 offsLen: 450816 offsSz: 1803264 lineSz: 64 sideSz: 64 sideGbwtSz: 48 sideGbwtLen: 192 numSides: 37568 numLines: 37568 gbwtTotLen: 2404352 gbwtTotSz: 2404352 reverse: 0 linearFM: Yes Total time for call to driver() for forward index: 00:00:10

mv SpIndex* hisat_index/

ls -l Sp Hisat_index/

-rw-rw-r– 1 ubuntu ubuntu 7171170 Mar 5 02:38 SpIndex.1.ht2

-rw-rw-r– 1 ubuntu ubuntu 1803268 Mar 5 02:38 SpIndex.2.ht2

-rw-rw-r– 1 ubuntu ubuntu 46475 Mar 5 02:38 SpIndex.3.ht2

-rw-rw-r– 1 ubuntu ubuntu 1803263 Mar 5 02:38 SpIndex.4.ht2

-rw-rw-r– 1 ubuntu ubuntu 44894645 Mar 5 02:39 SpIndex.5.ht2

-rw-rw-r– 1 ubuntu ubuntu 1809058 Mar 5 02:39 SpIndex.6.ht2

-rw-rw-r– 1 ubuntu ubuntu 12 Mar 5 02:38 SpIndex.7.ht2

-rw-rw-r– 1 ubuntu ubuntu 8 Mar 5 02:38 SpIndex.8.ht2

Align fastq files to the hisat2-build index using HISAT2

tuxedo2_HSB_Spombe$ hisat2 -p 8 –rna-strandness RF –dta -x SpHisat_index/SpIndex -p 8 -1 Sp.ds.1M.left.fq -2 Sp.ds.1M.right.fq -S Sp_ds.sam –summary-file Sp_ds_alignStats.txt

1000000 reads; of these:

1000000 (100.00%) were paired; of these:

415768 (41.58%) aligned concordantly 0 times

571209 (57.12%) aligned concordantly exactly 1 time

13023 (1.30%) aligned concordantly >1 times

----

415768 pairs aligned concordantly 0 times; of these:

  1338 (0.32%) aligned discordantly 1 time
  
----

414430 pairs aligned 0 times concordantly or discordantly; of these:

  828860 mates make up the pairs; of these:
  
    590399 (71.23%) aligned 0 times
    
    233679 (28.19%) aligned exactly 1 time
    
    4782 (0.58%) aligned >1 times
    

70.48% overall alignment rate

Repeat the runs using the remainig fastq samples/files

mkdir Aligned_sam

mv *.sam Aligned_sam/

ls -l Aligned_sam/

-rw-rw-r– 1 ubuntu ubuntu 542257875 Mar 5 02:50 Sp_ds.sam

-rw-rw-r– 1 ubuntu ubuntu 543704127 Mar 5 02:49 Sp_hs.sam

-rw-rw-r– 1 ubuntu ubuntu 557552191 Mar 5 02:52 Sp_log.sam

-rw-rw-r– 1 ubuntu ubuntu 530192457 Mar 5 02:53 Sp_plat.sam

mkdir AlignedStats

mv *.txt AlignedStats/

ls -l AlignedStats/

-rw-rw-r– 1 ubuntu ubuntu 619 Mar 5 02:50 Sp_ds_alignStats.txt

-rw-rw-r– 1 ubuntu ubuntu 619 Mar 5 02:49 Sp_hs_alignStats.txt

-rw-rw-r– 1 ubuntu ubuntu 619 Mar 5 02:52 Sp_log_alignStats.txt

-rw-rw-r– 1 ubuntu ubuntu 619 Mar 5 02:53 Sp_plat_alignStats.txt

Sort the SAM files by chromosomal co-ordinates and convert them to BAM format

mkdir Aligned_bam

tuxedo2_HSB_Spombe$

samtools view -Su Aligned_sam/Sp_ds.sam | samtools sort -o Aligned_bam/Sp_ds.sorted.bam

samtools view -Su Aligned_sam/Sp_hs.sam | samtools sort -o Aligned_bam/Sp_hs.sorted.bam

samtools view -Su Aligned_sam/Sp_plat.sam | samtools sort -o Aligned_bam/Sp_plat.sorted.bam

samtools view -Su Aligned_sam/Sp_log.sam | samtools sort -o Aligned_bam/Sp_log.sorted.bam

ls -l Aligned_bam/

-rw-rw-r– 1 ubuntu ubuntu 118530779 Mar 5 02:59 Sp_ds.sorted.bam

-rw-rw-r– 1 ubuntu ubuntu 118401857 Mar 5 03:01 Sp_hs.sorted.bam

-rw-rw-r– 1 ubuntu ubuntu 119007829 Mar 5 03:03 Sp_log.sorted.bam

-rw-rw-r– 1 ubuntu ubuntu 117833483 Mar 5 03:02 Sp_plat.sorted.bam

