In order to use this tutorial, you have to download Bowtie from SourceForge. Then unzip the file in a folder you choose. Now, open a terminal and go to the folder. The folder contains the following executables and directories.
ls
## AUTHORS
## bowtie
## bowtie-align-l
## bowtie-align-l-debug
## bowtie-align-s
## bowtie-align-s-debug
## bowtie-build
## bowtie-buildc
## bowtie-build-l
## bowtie-build-l-debug
## bowtie-build-s
## bowtie-build-s-debug
## bowtie-inspect
## bowtie-inspect-l
## bowtie-inspect-l-debug
## bowtie-inspect-s
## bowtie-inspect-s-debug
## bowtie_practice.Rmd
## doc
## e_coli.1.ebwt
## e_coli.2.ebwt
## e_coli.3.ebwt
## e_coli.4.ebwt
## e_coli_index.1.ebwt
## e_coli_index.2.ebwt
## e_coli_index.3.ebwt
## e_coli_index.4.ebwt
## e_coli_index.rev.1.ebwt
## e_coli_index.rev.2.ebwt
## e_coli.rev.1.ebwt
## e_coli.rev.2.ebwt
## genomes
## indexes
## LICENSE
## make_s_cerevisiae.sh
## MANUAL
## MANUAL.markdown
## NEWS
## reads
## s_cerevisiae.1.ebwt
## s_cerevisiae.2.ebwt
## s_cerevisiae.3.ebwt
## s_cerevisiae.4.ebwt
## s_cerevisiae.ebwt.zip
## s_cerevisiae.rev.1.ebwt
## s_cerevisiae.rev.2.ebwt
## scripts
## SeqAn-1.1
## TUTORIAL
## VERSION
The first thing to do when using a Burrows-Wheeler based aligner such as bowtie is to index the genome reference. For long genomes, as those of mammals, this is a highly time-consuming process. So, pre-built indexed references can be downloaded from the Bowtie Web (right column, Pre-built indexes section) or from the iGenomes web.
Now, download the S. cerevisiae prebuilt reference and unzip it in our working directory.
wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/s_cerevisiae.ebwt.zip
unzip s_cerevisiae.ebwt.zip
In order to index a reference, the bowtie-build command is used.
./bowtie-build -h
## Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base>
## reference_in comma-separated list of files with ref sequences
## ebwt_outfile_base write Ebwt data to files with this dir/basename
## Options:
## -f reference files are Fasta (default)
## -c reference sequences given on cmd line (as <seq_in>)
## --large-index force generated index to be 'large', even if ref
## has fewer than 4 billion nucleotides
## -C/--color build a colorspace index
## -a/--noauto disable automatic -p/--bmax/--dcv memory-fitting
## -p/--packed use packed strings internally; slower, uses less mem
## --bmax <int> max bucket sz for blockwise suffix-array builder
## --bmaxdivn <int> max bucket sz as divisor of ref len (default: 4)
## --dcv <int> diff-cover period for blockwise (default: 1024)
## --nodc disable diff-cover (algorithm becomes quadratic)
## -r/--noref don't build .3/.4.ebwt (packed reference) portion
## -3/--justref just build .3/.4.ebwt (packed reference) portion
## -o/--offrate <int> SA is sampled every 2^offRate BWT chars (default: 5)
## -t/--ftabchars <int> # of chars consumed in initial lookup (default: 10)
## --threads <int> # of threads
## --ntoa convert Ns in reference to As
## --seed <int> seed for random number generator
## -q/--quiet verbose output (for debugging)
## -h/--help print detailed description of tool and its options
## --usage print this usage message
## --version print version information and quit
The bowtie-build command need two parameters: <reference_in> is the reference genome as a fasta file. <ebwt_outfile_base> is the name of the index that are going to be generated, which will have ebwt extension
./bowtie-build ./genomes/NC_008253.fna e_coli_index
## Settings:
## Output files: "e_coli_index.*.ebwt"
## Line rate: 6 (line is 64 bytes)
## Lines per side: 1 (side is 64 bytes)
## Offset rate: 5 (one in 32)
## FTable chars: 10
## Strings: unpacked
## Max bucket size: default
## Max bucket size, sqrt multiplier: default
## Max bucket size, len divisor: 4
## Difference-cover sample period: 1024
## Endianness: little
## Actual local endianness: little
## Sanity checking: disabled
## Assertions: disabled
## Random seed: 0
## Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
## ./genomes/NC_008253.fna
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 1234730
## Using parameters --bmax 926048 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 926048 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 4.93892e+06 (target: 926047)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 1
## No samples; assembling all-inclusive block
## Sorting block of length 4938920 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:01
## Returning block of 4938921 for bucket 1
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 1222723
## fchr[G]: 2474304
## fchr[T]: 3717743
## fchr[$]: 4938920
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 5605733 bytes to primary EBWT file: e_coli_index.1.ebwt
## Wrote 617372 bytes to secondary EBWT file: e_coli_index.2.ebwt
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 4938920
## bwtLen: 4938921
## sz: 1234730
## bwtSz: 1234731
## lineRate: 6
## linesPerSide: 1
## offRate: 5
## offMask: 0xffffffe0
## isaRate: -1
## isaMask: 0xffffffff
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 154342
## offsSz: 617368
## isaLen: 0
## isaSz: 0
## lineSz: 64
## sideSz: 64
## sideBwtSz: 56
## sideBwtLen: 224
## numSidePairs: 11025
## numSides: 22050
## numLines: 22050
## ebwtTotLen: 1411200
## ebwtTotSz: 1411200
## reverse: 0
## Total time for call to driver() for forward index: 00:00:02
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 1234730
## Using parameters --bmax 926048 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 926048 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 4.93892e+06 (target: 926047)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 1
## No samples; assembling all-inclusive block
## Sorting block of length 4938920 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:01
## Returning block of 4938921 for bucket 1
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 1222723
## fchr[G]: 2474304
## fchr[T]: 3717743
## fchr[$]: 4938920
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 5605733 bytes to primary EBWT file: e_coli_index.rev.1.ebwt
## Wrote 617372 bytes to secondary EBWT file: e_coli_index.rev.2.ebwt
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 4938920
## bwtLen: 4938921
## sz: 1234730
## bwtSz: 1234731
## lineRate: 6
## linesPerSide: 1
## offRate: 5
## offMask: 0xffffffe0
## isaRate: -1
## isaMask: 0xffffffff
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 154342
## offsSz: 617368
## isaLen: 0
## isaSz: 0
## lineSz: 64
## sideSz: 64
## sideBwtSz: 56
## sideBwtLen: 224
## numSidePairs: 11025
## numSides: 22050
## numLines: 22050
## ebwtTotLen: 1411200
## ebwtTotSz: 1411200
## reverse: 0
## Total time for backward call to driver() for mirror index: 00:00:01
When the indexing ends, several new files are generated.
ls *.ebwt
## e_coli.1.ebwt
## e_coli.2.ebwt
## e_coli.3.ebwt
## e_coli.4.ebwt
## e_coli_index.1.ebwt
## e_coli_index.2.ebwt
## e_coli_index.3.ebwt
## e_coli_index.4.ebwt
## e_coli_index.rev.1.ebwt
## e_coli_index.rev.2.ebwt
## e_coli.rev.1.ebwt
## e_coli.rev.2.ebwt
## s_cerevisiae.1.ebwt
## s_cerevisiae.2.ebwt
## s_cerevisiae.3.ebwt
## s_cerevisiae.4.ebwt
## s_cerevisiae.rev.1.ebwt
## s_cerevisiae.rev.2.ebwt
We can inspect the reference index using the bowtie-inspect command. We will use the -s option, which prints only a summary of the index. By deafult, prints the FASTA record of the indexed nucleotide sequence. (see bowtie-inspect -h for other options)
./bowtie-inspect -s e_coli_index
## Colorspace 0
## SA-Sample 1 in 32
## FTab-Chars 10
## Sequence-1 gi|110640213|ref|NC_008253.1| Escherichia coli 536, complete genome 4938920
Once the reference has been indexed we can now align the reads to the genome using the bowtie command. This command has a lot of options:
./bowtie -h
## usage: bowtie [-h] [-b | -i] [--verbose] [--debug] [--large-index]
## [--index INDEX]
##
## optional arguments:
## -h, --help show this help message and exit
## -b, --build
## -i, --inspect
## --verbose
## --debug
## --large-index
## --index INDEX
An interesting option to play with bowtie is the -c option, which allows to introduce the sequence to align using the command line.
./bowtie -c e_coli_index ATAA
## 0 + gi|110640213|ref|NC_008253.1| 3074809 ATAA IIII 24193
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)
Let’s try a more complex read, for example this ATTGTAGTTCGAGTAAGTAATGTGGGTTTG
./bowtie -c e_coli_index ATTGTAGTTCGAGTAAGTAATGTGGGTTTG
## # reads processed: 1
## # reads with at least one reported alignment: 0 (0.00%)
## # reads that failed to align: 1 (100.00%)
## No alignments
In the output we can see that the read fails to be aligned.
./bowtie -c s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG
## 0 + Scchr02 90972 ATTGTAGTTCGAGTAAGTAATGTGGGTTTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)
and that’s because this sequence comes from S. cerevisiae.
Bowtie outputs one alignment per line. The bowtie default output of the aligner is a collection of 8 fields separated by tabs. The most important column is the fourth, which is the 0-based leftomost position of the alignment. This format is usually stores with .map extension.
However, the most used alignment format is the SAM/BAM format. The -S or --sam option outputs the hits in SAM format.
./bowtie -cS s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG
## @HD VN:1.0 SO:unsorted
## @SQ SN:Scchr01 LN:230208
## @SQ SN:Scchr02 LN:813178
## @SQ SN:Scchr03 LN:316617
## @SQ SN:Scchr04 LN:1531917
## @SQ SN:Scchr05 LN:576869
## @SQ SN:Scchr06 LN:270148
## @SQ SN:Scchr07 LN:1090946
## @SQ SN:Scchr08 LN:562643
## @SQ SN:Scchr09 LN:439885
## @SQ SN:Scchr10 LN:745667
## @SQ SN:Scchr11 LN:666454
## @SQ SN:Scchr12 LN:1078175
## @SQ SN:Scchr13 LN:924429
## @SQ SN:Scchr14 LN:784333
## @SQ SN:Scchr15 LN:1091289
## @SQ SN:Scchr16 LN:948062
## @SQ SN:Scmito LN:85779
## @PG ID:Bowtie VN:1.2.1.1 CL:"bowtie-align --wrapper basic-0 -cS s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG"
## 0 0 Scchr02 90973 255 30M * 0 0 ATTGTAGTTCGAGTAAGTAATGTGGGTTTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:30 NM:i:0 XM:i:2
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)
We can also redirect the output to a file:
./bowtie -cS s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG > alignment.sam
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)
To store as a more efficient BAM file, use samtools:
samtools view -Sb alignment.sam > alignment.bam
Let’s see the differences in size:
ls -h alignment.*
## alignment.sam
NGS generates short reads (25-100bp), so the probability of a read to align in multiple positions increases. Short read aligners have to be able to either report multiple alignemnts or pick heuristically one of them. In Bowtie, the -k <int> option returns the
./bowtie -c -k 5 e_coli_index ATAA
## 0 + gi|110640213|ref|NC_008253.1| 3074809 ATAA IIII 24193
## 0 + gi|110640213|ref|NC_008253.1| 433792 ATAA IIII 24193
## 0 + gi|110640213|ref|NC_008253.1| 3665044 ATAA IIII 24193
## 0 + gi|110640213|ref|NC_008253.1| 1933628 ATAA IIII 24193
## 0 + gi|110640213|ref|NC_008253.1| 3294642 ATAA IIII 24193
## # reads processed: 5
## # reads with at least one reported alignment: 5 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 5 alignments to 1 output stream(s)
Besides common biological varation with respecto to the reference genome, reads coming from NGS usually have errors. This means that the alignment may have mismatches. By default, Bowtie allows 2 mismatches. For example, if we change the last nucleotides of the read we previously aligned to the S. cerevisiae genome, the alignment is the same:
./bowtie -c s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG
./bowtie -c s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTAA
## 0 + Scchr02 90972 ATTGTAGTTCGAGTAAGTAATGTGGGTTTG IIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)
## 0 + Scchr02 90972 ATTGTAGTTCGAGTAAGTAATGTGGGTTAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 28:T>A,29:G>A
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)
However if we introduce a third mismatch, the read cannot be aligned
./bowtie -c s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTAAA
## # reads processed: 1
## # reads with at least one reported alignment: 0 (0.00%)
## # reads that failed to align: 1 (100.00%)
## No alignments
We need to set the maximum number of mismatches to 3. This can be done using the -v option.
./bowtie -c -v3 s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTAAA
## 0 + Scchr02 90972 ATTGTAGTTCGAGTAAGTAATGTGGGTAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 27:T>A,28:T>A,29:G>A
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)
The so-called -v aligment mode in bowtie does not take into account the quality values. There is another, more complex, -n alignment mode that will not be explained here
Altought the -c option is quite useful to play and learn, we usually don’t want to align single reads from the command line, but reads coming from big FASTQ files. Actually, the FASTQ format is the default bowtie input format.
Bowtie comes with some example files in the reads directory
ls ./reads/
## e_coli_10000snp.fa
## e_coli_10000snp.fq
## e_coli_1000_1.fa
## e_coli_1000_1.fq
## e_coli_1000_2.fa
## e_coli_1000_2.fq
## e_coli_1000.fa
## e_coli_1000.fq
## e_coli_1000_interleaved.fq
## e_coli_1000.raw
Let’s explore the files.
First, let’s see reads in a multifasta (.fa) file. As you see, reads have no quality
cat reads/e_coli_1000.fa | head
## >r0
## GAACGATACCCACCCAACTATCGCCATTCCAGCAT
## >r1
## CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
## >r2
## TCAAAATTGTTATAGTATAACACTGTTGCTTTATG
## >r3
## AAAATTTGTGCCTGGATGGCCTGAGTACCNANTAC
## >r4
## GCAGAGCAGTTGCTAGAAANNNNNTTGAAGAGGTT
FASTQ files, however, come with qualities associated. The most important thing to remember is that in a FASTQ file, there are 4 lines per read:
cat reads/e_coli_1000.fq | head -n 8
## @r0
## GAACGATACCCACCCAACTATCGCCATTCCAGCAT
## +
## EDCCCBAAAA@@@@?>===<;;9:99987776554
## @r1
## CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
## +
## EDCCCBAAAA@@@@?>===<;;9:99987776554
If you look at the options of bowtie, you’ll see that you can change the codification of the qualities: the SANGER codification ASCII+33 (--phred33-quals) is the default option, but other qualities as --phred64-quals or --solexa-quals are allowed.
If the reads come from a single-end experiment they are stored in a single fastq file e_coli_1000.fq. To align these reads the command is:
./bowtie -S e_coli reads/e_coli_1000.fq > single_end_alignment.sam
## # reads processed: 1000
## # reads with at least one reported alignment: 699 (69.90%)
## # reads that failed to align: 301 (30.10%)
## Reported 699 alignments to 1 output stream(s)
A lot of reads (about a 30%) are not aligned. Let’s take a look to the SAM file generated
cat single_end_alignment.sam | head
## @HD VN:1.0 SO:unsorted
## @SQ SN:gi|110640213|ref|NC_008253.1| LN:4938920
## @PG ID:Bowtie VN:1.2.1.1 CL:"bowtie-align --wrapper basic-0 -S e_coli reads/e_coli_1000.fq"
## r0 16 gi|110640213|ref|NC_008253.1| 3658050 255 35M * 0 0 ATGCTGGAATGGCGATAGTTGGGTGGGTATCGTTC 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:0G1T32 NM:i:2 XM:i:2
## r1 16 gi|110640213|ref|NC_008253.1| 1902086 255 35M * 0 0 CGGATGATTTTTATCCCATGAGACATCCAGTTCGG 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r2 16 gi|110640213|ref|NC_008253.1| 3989610 255 35M * 0 0 CATAAAGCAACAGTGTTATACTATAACAATTTTGA 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r3 4 * 0 0 * * 0 0 AAAATTTGTGCCTGGATGGCCTGAGTACCNANTAC EDCCCBAAAA@@@@?>===<;;9:99987776554 XM:i:0
## r4 4 * 0 0 * * 0 0 GCAGAGCAGTTGCTAGAAANNNNNTTGAAGAGGTT EDCCCBAAAA@@@@?>===<;;9:99987776554 XM:i:0
## r5 0 gi|110640213|ref|NC_008253.1| 4249842 255 35M * 0 0 CAGCATAAGTGGATATTCAAAGTTTTGCTGTTTTA EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r6 4 * 0 0 * * 0 0 GGCAGTGATGCAACTGCCCGTTATCAACAGNCNCT EDCCCBAAAA@@@@?>===<;;9:99987776554 XM:i:0
cat single_end_alignment.sam | wc
## 1003 14112 143347
It has 1003 lines. If means that all the reads (aligned or not are stored in the SAM file). See, for example the r3 read. In case we don’t want to store the unaligned reads, we just use the --no-unal option.
Now, we usually have reads coming from paired-end experiments. When this is the case, we usually have two FASTQ files, with filenames ending in “_1.fq" and “_2.fq“. In order to tell bowtie this is the case, we have to use the -1 and -2 options.
./bowtie e_coli -1 reads/e_coli_1000_1.fq -2 reads/e_coli_1000_2.fq paired_end_alignment.map
## # reads processed: 2000
## # reads with at least one reported alignment: 2000 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1000 paired-end alignments to 1 output stream(s)
or, with SAM output
./bowtie --sam e_coli -1 reads/e_coli_1000_1.fq -2 reads/e_coli_1000_2.fq paired_end_alignment.sam
## # reads processed: 2000
## # reads with at least one reported alignment: 2000 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1000 paired-end alignments to 1 output stream(s)
If you look into the map and SAM files, you’ll see that they have 2000 and 2003 lines as each read alignment is written in one line. But take into account that now consecutive lines are related. The upstream ‘mate’ is always printed before the downstream one.
The most important parameters in paired end sequence are minimum and maximun insert sizes allowed. The insert size is the sum of the read lenght, plus the inner distance (unsequenced) .
insert size
Tunning the options -I and -X the minimun and maximum insert size allowed can be tuned (by default 0 and 250, respectively). If -I 100 is used and the reads are of about 40bp, two reads from the same pair that are aligned with an inner distance lower than 20bp will not be valid.
For example, taking this alignment:
./bowtie --sam e_coli -1 reads/e_coli_1000_1.fq -2 reads/e_coli_1000_2.fq | tail
## # reads processed: 2000
## # reads with at least one reported alignment: 2000 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1000 paired-end alignments to 1 output stream(s)
## r995 99 gi|110640213|ref|NC_008253.1| 2924988 255 35M = 2925181 225 AGTTCAAAGGTACCGGGTGTTGCGGGGATCGGACC EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r995 147 gi|110640213|ref|NC_008253.1| 2925181 255 32M = 2924988 -225 TCGACGGCAATTTACAGCAATTGCGGTTGGTA 67778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:32 NM:i:0 XM:i:2
## r996 163 gi|110640213|ref|NC_008253.1| 739590 255 32M = 739694 139 CGCACGCTGCGACGTATTATGGCGATGAATAT EDCCCBAAAA@@@@?>===<;;9:99987776 XA:i:0 MD:Z:32 NM:i:0 XM:i:2
## r996 83 gi|110640213|ref|NC_008253.1| 739694 255 35M = 739590 -139 GATGGATGAGCCGCTGGCTAACCCCGATGCGCTGA 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r997 99 gi|110640213|ref|NC_008253.1| 2926703 255 35M = 2926800 129 CTTACGATTTTTGAGAGCCAGCGCAACATGTTCAG EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r997 147 gi|110640213|ref|NC_008253.1| 2926800 255 32M = 2926703 -129 GCAATCATGTAGTGAATCGCGGGGATCGGTCG 67778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:32 NM:i:0 XM:i:2
## r998 99 gi|110640213|ref|NC_008253.1| 2300477 255 35M = 2300559 114 TGTAGGTCTGATAAGCATAGCGCATCAGGCAATTT EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r998 147 gi|110640213|ref|NC_008253.1| 2300559 255 32M = 2300477 -114 GAGAAATTATGCGTTTTTTTCTACTATTTGTG 67778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:32 NM:i:0 XM:i:2
## r999 163 gi|110640213|ref|NC_008253.1| 42356 255 32M = 42443 122 AACAGCAGAGTGTTACTACCGAGTACAGTCCA EDCCCBAAAA@@@@?>===<;;9:99987776 XA:i:0 MD:Z:32 NM:i:0 XM:i:2
## r999 83 gi|110640213|ref|NC_008253.1| 42443 255 35M = 42356 -122 ACGGTACGACCACGGGAGATGCGGGCGAGGAAGAT 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:35 NM:i:0 XM:i:2
Look the first two alignments corresponding (r995). The colum 9 in the SAM file gives you the observed template length (in this case 225). If you set a maximum insert size of 200, this pair of reads cannot be aligned.
./bowtie --sam -X 200 e_coli -1 reads/e_coli_1000_1.fq -2 reads/e_coli_1000_2.fq | tail
## # reads processed: 1671
## # reads with at least one reported alignment: 1342 (80.31%)
## # reads that failed to align: 329 (19.69%)
## Reported 671 paired-end alignments to 1 output stream(s)
## r995 77 * 0 0 * * 0 0 AGTTCAAAGGTACCGGGTGTTGCGGGGATCGGACC EDCCCBAAAA@@@@?>===<;;9:99987776554 XM:i:0
## r995 141 * 0 0 * * 0 0 TACCAACCGCAATTGCTGTAAATTGCCGTCGA EDCCCBAAAA@@@@?>===<;;9:99987776 XM:i:0
## r996 163 gi|110640213|ref|NC_008253.1| 739590 255 32M = 739694 139 CGCACGCTGCGACGTATTATGGCGATGAATAT EDCCCBAAAA@@@@?>===<;;9:99987776 XA:i:0 MD:Z:32 NM:i:0 XM:i:2
## r996 83 gi|110640213|ref|NC_008253.1| 739694 255 35M = 739590 -139 GATGGATGAGCCGCTGGCTAACCCCGATGCGCTGA 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r997 99 gi|110640213|ref|NC_008253.1| 2926703 255 35M = 2926800 129 CTTACGATTTTTGAGAGCCAGCGCAACATGTTCAG EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r997 147 gi|110640213|ref|NC_008253.1| 2926800 255 32M = 2926703 -129 GCAATCATGTAGTGAATCGCGGGGATCGGTCG 67778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:32 NM:i:0 XM:i:2
## r998 99 gi|110640213|ref|NC_008253.1| 2300477 255 35M = 2300559 114 TGTAGGTCTGATAAGCATAGCGCATCAGGCAATTT EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0 MD:Z:35 NM:i:0 XM:i:2
## r998 147 gi|110640213|ref|NC_008253.1| 2300559 255 32M = 2300477 -114 GAGAAATTATGCGTTTTTTTCTACTATTTGTG 67778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:32 NM:i:0 XM:i:2
## r999 163 gi|110640213|ref|NC_008253.1| 42356 255 32M = 42443 122 AACAGCAGAGTGTTACTACCGAGTACAGTCCA EDCCCBAAAA@@@@?>===<;;9:99987776 XA:i:0 MD:Z:32 NM:i:0 XM:i:2
## r999 83 gi|110640213|ref|NC_008253.1| 42443 255 35M = 42356 -122 ACGGTACGACCACGGGAGATGCGGGCGAGGAAGAT 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0 MD:Z:35 NM:i:0 XM:i:2
Two intersting options are the -t and -p options. The first one prints the execution times. The second allows to state the number of aligment threads to launch:
./bowtie -t -p 2 e_coli reads/e_coli_1000.fq > single_end_alignment.map
## Time loading forward index: 00:00:00
## Time loading mirror index: 00:00:00
## Seeded quality full-index search: 00:00:00
## # reads processed: 1000
## # reads with at least one reported alignment: 699 (69.90%)
## # reads that failed to align: 301 (30.10%)
## Reported 699 alignments to 1 output stream(s)
## Time searching: 00:00:00
## Overall time: 00:00:00