Short read aligners: Bowtie

Bowtie

In order to use this tutorial, you have to download Bowtie from SourceForge. Then unzip the file in a folder you choose. Now, open a terminal and go to the folder. The folder contains the following executables and directories.

ls

## AUTHORS
## bowtie
## bowtie-align-l
## bowtie-align-l-debug
## bowtie-align-s
## bowtie-align-s-debug
## bowtie-build
## bowtie-buildc
## bowtie-build-l
## bowtie-build-l-debug
## bowtie-build-s
## bowtie-build-s-debug
## bowtie-inspect
## bowtie-inspect-l
## bowtie-inspect-l-debug
## bowtie-inspect-s
## bowtie-inspect-s-debug
## bowtie_practice.Rmd
## doc
## e_coli.1.ebwt
## e_coli.2.ebwt
## e_coli.3.ebwt
## e_coli.4.ebwt
## e_coli_index.1.ebwt
## e_coli_index.2.ebwt
## e_coli_index.3.ebwt
## e_coli_index.4.ebwt
## e_coli_index.rev.1.ebwt
## e_coli_index.rev.2.ebwt
## e_coli.rev.1.ebwt
## e_coli.rev.2.ebwt
## genomes
## indexes
## LICENSE
## make_s_cerevisiae.sh
## MANUAL
## MANUAL.markdown
## NEWS
## reads
## s_cerevisiae.1.ebwt
## s_cerevisiae.2.ebwt
## s_cerevisiae.3.ebwt
## s_cerevisiae.4.ebwt
## s_cerevisiae.ebwt.zip
## s_cerevisiae.rev.1.ebwt
## s_cerevisiae.rev.2.ebwt
## scripts
## SeqAn-1.1
## TUTORIAL
## VERSION

Indexing the reference

The first thing to do when using a Burrows-Wheeler based aligner such as bowtie is to index the genome reference. For long genomes, as those of mammals, this is a highly time-consuming process. So, pre-built indexed references can be downloaded from the Bowtie Web (right column, Pre-built indexes section) or from the iGenomes web.

Now, download the S. cerevisiae prebuilt reference and unzip it in our working directory.

wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/s_cerevisiae.ebwt.zip
unzip s_cerevisiae.ebwt.zip

In order to index a reference, the bowtie-build command is used.

./bowtie-build -h

## Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base>
##     reference_in            comma-separated list of files with ref sequences
##     ebwt_outfile_base       write Ebwt data to files with this dir/basename
## Options:
##     -f                      reference files are Fasta (default)
##     -c                      reference sequences given on cmd line (as <seq_in>)
##     --large-index           force generated index to be 'large', even if ref
##                             has fewer than 4 billion nucleotides
##     -C/--color              build a colorspace index
##     -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
##     -p/--packed             use packed strings internally; slower, uses less mem
##     --bmax <int>            max bucket sz for blockwise suffix-array builder
##     --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
##     --dcv <int>             diff-cover period for blockwise (default: 1024)
##     --nodc                  disable diff-cover (algorithm becomes quadratic)
##     -r/--noref              don't build .3/.4.ebwt (packed reference) portion
##     -3/--justref            just build .3/.4.ebwt (packed reference) portion
##     -o/--offrate <int>      SA is sampled every 2^offRate BWT chars (default: 5)
##     -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
##     --threads <int>         # of threads
##     --ntoa                  convert Ns in reference to As
##     --seed <int>            seed for random number generator
##     -q/--quiet              verbose output (for debugging)
##     -h/--help               print detailed description of tool and its options
##     --usage                 print this usage message
##     --version               print version information and quit

The bowtie-build command need two parameters: <reference_in> is the reference genome as a fasta file. <ebwt_outfile_base> is the name of the index that are going to be generated, which will have ebwt extension

./bowtie-build ./genomes/NC_008253.fna e_coli_index

## Settings:
##   Output files: "e_coli_index.*.ebwt"
##   Line rate: 6 (line is 64 bytes)
##   Lines per side: 1 (side is 64 bytes)
##   Offset rate: 5 (one in 32)
##   FTable chars: 10
##   Strings: unpacked
##   Max bucket size: default
##   Max bucket size, sqrt multiplier: default
##   Max bucket size, len divisor: 4
##   Difference-cover sample period: 1024
##   Endianness: little
##   Actual local endianness: little
##   Sanity checking: disabled
##   Assertions: disabled
##   Random seed: 0
##   Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
##   ./genomes/NC_008253.fna
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 1234730
## Using parameters --bmax 926048 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 926048 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Avg bucket size: 4.93892e+06 (target: 926047)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 1
##   No samples; assembling all-inclusive block
##   Sorting block of length 4938920 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:01
## Returning block of 4938921 for bucket 1
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 1222723
## fchr[G]: 2474304
## fchr[T]: 3717743
## fchr[$]: 4938920
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 5605733 bytes to primary EBWT file: e_coli_index.1.ebwt
## Wrote 617372 bytes to secondary EBWT file: e_coli_index.2.ebwt
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 4938920
##     bwtLen: 4938921
##     sz: 1234730
##     bwtSz: 1234731
##     lineRate: 6
##     linesPerSide: 1
##     offRate: 5
##     offMask: 0xffffffe0
##     isaRate: -1
##     isaMask: 0xffffffff
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 154342
##     offsSz: 617368
##     isaLen: 0
##     isaSz: 0
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 56
##     sideBwtLen: 224
##     numSidePairs: 11025
##     numSides: 22050
##     numLines: 22050
##     ebwtTotLen: 1411200
##     ebwtTotSz: 1411200
##     reverse: 0
## Total time for call to driver() for forward index: 00:00:02
## Reading reference sizes
##   Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
##   Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 1234730
## Using parameters --bmax 926048 --dcv 1024
##   Doing ahead-of-time memory usage test
##   Passed!  Constructing with these parameters: --bmax 926048 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
##   Building sPrime
##   Building sPrimeOrder
##   V-Sorting samples
##   V-Sorting samples time: 00:00:00
##   Allocating rank array
##   Ranking v-sort output
##   Ranking v-sort output time: 00:00:00
##   Invoking Larsson-Sadakane on ranks
##   Invoking Larsson-Sadakane on ranks time: 00:00:00
##   Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
##   (Using difference cover)
##   Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
##   Splitting and merging time: 00:00:00
## Avg bucket size: 4.93892e+06 (target: 926047)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 1
##   No samples; assembling all-inclusive block
##   Sorting block of length 4938920 for bucket 1
##   (Using difference cover)
##   Sorting block time: 00:00:01
## Returning block of 4938921 for bucket 1
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 1222723
## fchr[G]: 2474304
## fchr[T]: 3717743
## fchr[$]: 4938920
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 5605733 bytes to primary EBWT file: e_coli_index.rev.1.ebwt
## Wrote 617372 bytes to secondary EBWT file: e_coli_index.rev.2.ebwt
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
##     len: 4938920
##     bwtLen: 4938921
##     sz: 1234730
##     bwtSz: 1234731
##     lineRate: 6
##     linesPerSide: 1
##     offRate: 5
##     offMask: 0xffffffe0
##     isaRate: -1
##     isaMask: 0xffffffff
##     ftabChars: 10
##     eftabLen: 20
##     eftabSz: 80
##     ftabLen: 1048577
##     ftabSz: 4194308
##     offsLen: 154342
##     offsSz: 617368
##     isaLen: 0
##     isaSz: 0
##     lineSz: 64
##     sideSz: 64
##     sideBwtSz: 56
##     sideBwtLen: 224
##     numSidePairs: 11025
##     numSides: 22050
##     numLines: 22050
##     ebwtTotLen: 1411200
##     ebwtTotSz: 1411200
##     reverse: 0
## Total time for backward call to driver() for mirror index: 00:00:01

When the indexing ends, several new files are generated.

ls *.ebwt

## e_coli.1.ebwt
## e_coli.2.ebwt
## e_coli.3.ebwt
## e_coli.4.ebwt
## e_coli_index.1.ebwt
## e_coli_index.2.ebwt
## e_coli_index.3.ebwt
## e_coli_index.4.ebwt
## e_coli_index.rev.1.ebwt
## e_coli_index.rev.2.ebwt
## e_coli.rev.1.ebwt
## e_coli.rev.2.ebwt
## s_cerevisiae.1.ebwt
## s_cerevisiae.2.ebwt
## s_cerevisiae.3.ebwt
## s_cerevisiae.4.ebwt
## s_cerevisiae.rev.1.ebwt
## s_cerevisiae.rev.2.ebwt

We can inspect the reference index using the bowtie-inspect command. We will use the -s option, which prints only a summary of the index. By deafult, prints the FASTA record of the indexed nucleotide sequence. (see bowtie-inspect -h for other options)

./bowtie-inspect -s e_coli_index

## Colorspace   0
## SA-Sample    1 in 32
## FTab-Chars   10
## Sequence-1   gi|110640213|ref|NC_008253.1| Escherichia coli 536, complete genome 4938920

Aligning reads

Once the reference has been indexed we can now align the reads to the genome using the bowtie command. This command has a lot of options:

./bowtie -h

## usage: bowtie [-h] [-b | -i] [--verbose] [--debug] [--large-index]
##               [--index INDEX]
## 
## optional arguments:
##   -h, --help     show this help message and exit
##   -b, --build
##   -i, --inspect
##   --verbose
##   --debug
##   --large-index
##   --index INDEX

An interesting option to play with bowtie is the -c option, which allows to introduce the sequence to align using the command line.

./bowtie -c e_coli_index ATAA

## 0    +   gi|110640213|ref|NC_008253.1|   3074809 ATAA    IIII    24193   
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)

Let’s try a more complex read, for example this ATTGTAGTTCGAGTAAGTAATGTGGGTTTG

./bowtie -c e_coli_index ATTGTAGTTCGAGTAAGTAATGTGGGTTTG

## # reads processed: 1
## # reads with at least one reported alignment: 0 (0.00%)
## # reads that failed to align: 1 (100.00%)
## No alignments

In the output we can see that the read fails to be aligned.

./bowtie -c  s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG

## 0    +   Scchr02 90972   ATTGTAGTTCGAGTAAGTAATGTGGGTTTG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIII  0   
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)

and that’s because this sequence comes from S. cerevisiae.

Bowtie Output

Bowtie outputs one alignment per line. The bowtie default output of the aligner is a collection of 8 fields separated by tabs. The most important column is the fourth, which is the 0-based leftomost position of the alignment. This format is usually stores with .map extension.

However, the most used alignment format is the SAM/BAM format. The -S or --sam option outputs the hits in SAM format.

./bowtie -cS  s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG

## @HD  VN:1.0  SO:unsorted
## @SQ  SN:Scchr01  LN:230208
## @SQ  SN:Scchr02  LN:813178
## @SQ  SN:Scchr03  LN:316617
## @SQ  SN:Scchr04  LN:1531917
## @SQ  SN:Scchr05  LN:576869
## @SQ  SN:Scchr06  LN:270148
## @SQ  SN:Scchr07  LN:1090946
## @SQ  SN:Scchr08  LN:562643
## @SQ  SN:Scchr09  LN:439885
## @SQ  SN:Scchr10  LN:745667
## @SQ  SN:Scchr11  LN:666454
## @SQ  SN:Scchr12  LN:1078175
## @SQ  SN:Scchr13  LN:924429
## @SQ  SN:Scchr14  LN:784333
## @SQ  SN:Scchr15  LN:1091289
## @SQ  SN:Scchr16  LN:948062
## @SQ  SN:Scmito   LN:85779
## @PG  ID:Bowtie   VN:1.2.1.1  CL:"bowtie-align --wrapper basic-0 -cS s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG"
## 0    0   Scchr02 90973   255 30M *   0   0   ATTGTAGTTCGAGTAAGTAATGTGGGTTTG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:30 NM:i:0  XM:i:2
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)

We can also redirect the output to a file:

./bowtie -cS  s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG > alignment.sam

## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)

To store as a more efficient BAM file, use samtools:

samtools view -Sb alignment.sam > alignment.bam

Let’s see the differences in size:

ls -h alignment.*

## alignment.sam

Reporting multiple positions

NGS generates short reads (25-100bp), so the probability of a read to align in multiple positions increases. Short read aligners have to be able to either report multiple alignemnts or pick heuristically one of them. In Bowtie, the -k <int> option returns the hits. By defect k=1, so only one hit is returned.

./bowtie -c -k 5 e_coli_index ATAA

## 0    +   gi|110640213|ref|NC_008253.1|   3074809 ATAA    IIII    24193   
## 0    +   gi|110640213|ref|NC_008253.1|   433792  ATAA    IIII    24193   
## 0    +   gi|110640213|ref|NC_008253.1|   3665044 ATAA    IIII    24193   
## 0    +   gi|110640213|ref|NC_008253.1|   1933628 ATAA    IIII    24193   
## 0    +   gi|110640213|ref|NC_008253.1|   3294642 ATAA    IIII    24193   
## # reads processed: 5
## # reads with at least one reported alignment: 5 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 5 alignments to 1 output stream(s)

Mismatches

Besides common biological varation with respecto to the reference genome, reads coming from NGS usually have errors. This means that the alignment may have mismatches. By default, Bowtie allows 2 mismatches. For example, if we change the last nucleotides of the read we previously aligned to the S. cerevisiae genome, the alignment is the same:

./bowtie -c  s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTTG 

./bowtie -c  s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTTAA

## 0    +   Scchr02 90972   ATTGTAGTTCGAGTAAGTAATGTGGGTTTG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIII  0   
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)
## 0    +   Scchr02 90972   ATTGTAGTTCGAGTAAGTAATGTGGGTTAA  IIIIIIIIIIIIIIIIIIIIIIIIIIIIII  0   28:T>A,29:G>A
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)

However if we introduce a third mismatch, the read cannot be aligned

./bowtie -c  s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTAAA

## # reads processed: 1
## # reads with at least one reported alignment: 0 (0.00%)
## # reads that failed to align: 1 (100.00%)
## No alignments

We need to set the maximum number of mismatches to 3. This can be done using the -v option.

./bowtie -c -v3  s_cerevisiae ATTGTAGTTCGAGTAAGTAATGTGGGTAAA

## 0    +   Scchr02 90972   ATTGTAGTTCGAGTAAGTAATGTGGGTAAA  IIIIIIIIIIIIIIIIIIIIIIIIIIIIII  0   27:T>A,28:T>A,29:G>A
## # reads processed: 1
## # reads with at least one reported alignment: 1 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1 alignments to 1 output stream(s)

The so-called -v aligment mode in bowtie does not take into account the quality values. There is another, more complex, -n alignment mode that will not be explained here

Using reads from files

Altought the -c option is quite useful to play and learn, we usually don’t want to align single reads from the command line, but reads coming from big FASTQ files. Actually, the FASTQ format is the default bowtie input format.

Bowtie comes with some example files in the reads directory

ls ./reads/

## e_coli_10000snp.fa
## e_coli_10000snp.fq
## e_coli_1000_1.fa
## e_coli_1000_1.fq
## e_coli_1000_2.fa
## e_coli_1000_2.fq
## e_coli_1000.fa
## e_coli_1000.fq
## e_coli_1000_interleaved.fq
## e_coli_1000.raw

Let’s explore the files.

First, let’s see reads in a multifasta (.fa) file. As you see, reads have no quality

cat reads/e_coli_1000.fa | head

## >r0
## GAACGATACCCACCCAACTATCGCCATTCCAGCAT
## >r1
## CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
## >r2
## TCAAAATTGTTATAGTATAACACTGTTGCTTTATG
## >r3
## AAAATTTGTGCCTGGATGGCCTGAGTACCNANTAC
## >r4
## GCAGAGCAGTTGCTAGAAANNNNNTTGAAGAGGTT

FASTQ files, however, come with qualities associated. The most important thing to remember is that in a FASTQ file, there are 4 lines per read:

@Read Information
sequence of the read
+Read Information (usually a blank line)
qualities in ASCII codification

cat reads/e_coli_1000.fq | head -n 8

## @r0
## GAACGATACCCACCCAACTATCGCCATTCCAGCAT
## +
## EDCCCBAAAA@@@@?>===<;;9:99987776554
## @r1
## CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
## +
## EDCCCBAAAA@@@@?>===<;;9:99987776554

If you look at the options of bowtie, you’ll see that you can change the codification of the qualities: the SANGER codification ASCII+33 (--phred33-quals) is the default option, but other qualities as --phred64-quals or --solexa-quals are allowed.

Single-end experiment

If the reads come from a single-end experiment they are stored in a single fastq file e_coli_1000.fq. To align these reads the command is:

./bowtie -S e_coli reads/e_coli_1000.fq > single_end_alignment.sam

## # reads processed: 1000
## # reads with at least one reported alignment: 699 (69.90%)
## # reads that failed to align: 301 (30.10%)
## Reported 699 alignments to 1 output stream(s)

A lot of reads (about a 30%) are not aligned. Let’s take a look to the SAM file generated

cat single_end_alignment.sam | head

## @HD  VN:1.0  SO:unsorted
## @SQ  SN:gi|110640213|ref|NC_008253.1|    LN:4938920
## @PG  ID:Bowtie   VN:1.2.1.1  CL:"bowtie-align --wrapper basic-0 -S e_coli reads/e_coli_1000.fq"
## r0   16  gi|110640213|ref|NC_008253.1|   3658050 255 35M *   0   0   ATGCTGGAATGGCGATAGTTGGGTGGGTATCGTTC 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0  MD:Z:0G1T32 NM:i:2  XM:i:2
## r1   16  gi|110640213|ref|NC_008253.1|   1902086 255 35M *   0   0   CGGATGATTTTTATCCCATGAGACATCCAGTTCGG 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r2   16  gi|110640213|ref|NC_008253.1|   3989610 255 35M *   0   0   CATAAAGCAACAGTGTTATACTATAACAATTTTGA 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r3   4   *   0   0   *   *   0   0   AAAATTTGTGCCTGGATGGCCTGAGTACCNANTAC EDCCCBAAAA@@@@?>===<;;9:99987776554 XM:i:0
## r4   4   *   0   0   *   *   0   0   GCAGAGCAGTTGCTAGAAANNNNNTTGAAGAGGTT EDCCCBAAAA@@@@?>===<;;9:99987776554 XM:i:0
## r5   0   gi|110640213|ref|NC_008253.1|   4249842 255 35M *   0   0   CAGCATAAGTGGATATTCAAAGTTTTGCTGTTTTA EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r6   4   *   0   0   *   *   0   0   GGCAGTGATGCAACTGCCCGTTATCAACAGNCNCT EDCCCBAAAA@@@@?>===<;;9:99987776554 XM:i:0

cat single_end_alignment.sam | wc

##    1003   14112  143347

It has 1003 lines. If means that all the reads (aligned or not are stored in the SAM file). See, for example the r3 read. In case we don’t want to store the unaligned reads, we just use the --no-unal option.

Paired-end experiments

Now, we usually have reads coming from paired-end experiments. When this is the case, we usually have two FASTQ files, with filenames ending in “_1.fq" and “_2.fq“. In order to tell bowtie this is the case, we have to use the -1 and -2 options.

./bowtie e_coli -1 reads/e_coli_1000_1.fq -2 reads/e_coli_1000_2.fq paired_end_alignment.map

## # reads processed: 2000
## # reads with at least one reported alignment: 2000 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1000 paired-end alignments to 1 output stream(s)

or, with SAM output

./bowtie --sam e_coli -1 reads/e_coli_1000_1.fq -2 reads/e_coli_1000_2.fq paired_end_alignment.sam

## # reads processed: 2000
## # reads with at least one reported alignment: 2000 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1000 paired-end alignments to 1 output stream(s)

If you look into the map and SAM files, you’ll see that they have 2000 and 2003 lines as each read alignment is written in one line. But take into account that now consecutive lines are related. The upstream ‘mate’ is always printed before the downstream one.

The most important parameters in paired end sequence are minimum and maximun insert sizes allowed. The insert size is the sum of the read lenght, plus the inner distance (unsequenced) .

insert size

Tunning the options -I and -X the minimun and maximum insert size allowed can be tuned (by default 0 and 250, respectively). If -I 100 is used and the reads are of about 40bp, two reads from the same pair that are aligned with an inner distance lower than 20bp will not be valid.

For example, taking this alignment:

./bowtie --sam e_coli -1 reads/e_coli_1000_1.fq -2 reads/e_coli_1000_2.fq | tail

## # reads processed: 2000
## # reads with at least one reported alignment: 2000 (100.00%)
## # reads that failed to align: 0 (0.00%)
## Reported 1000 paired-end alignments to 1 output stream(s)
## r995 99  gi|110640213|ref|NC_008253.1|   2924988 255 35M =   2925181 225 AGTTCAAAGGTACCGGGTGTTGCGGGGATCGGACC EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r995 147 gi|110640213|ref|NC_008253.1|   2925181 255 32M =   2924988 -225    TCGACGGCAATTTACAGCAATTGCGGTTGGTA    67778999:9;;<===>?@@@@AAAABCCCDE    XA:i:0  MD:Z:32 NM:i:0  XM:i:2
## r996 163 gi|110640213|ref|NC_008253.1|   739590  255 32M =   739694  139 CGCACGCTGCGACGTATTATGGCGATGAATAT    EDCCCBAAAA@@@@?>===<;;9:99987776    XA:i:0  MD:Z:32 NM:i:0  XM:i:2
## r996 83  gi|110640213|ref|NC_008253.1|   739694  255 35M =   739590  -139    GATGGATGAGCCGCTGGCTAACCCCGATGCGCTGA 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r997 99  gi|110640213|ref|NC_008253.1|   2926703 255 35M =   2926800 129 CTTACGATTTTTGAGAGCCAGCGCAACATGTTCAG EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r997 147 gi|110640213|ref|NC_008253.1|   2926800 255 32M =   2926703 -129    GCAATCATGTAGTGAATCGCGGGGATCGGTCG    67778999:9;;<===>?@@@@AAAABCCCDE    XA:i:0  MD:Z:32 NM:i:0  XM:i:2
## r998 99  gi|110640213|ref|NC_008253.1|   2300477 255 35M =   2300559 114 TGTAGGTCTGATAAGCATAGCGCATCAGGCAATTT EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r998 147 gi|110640213|ref|NC_008253.1|   2300559 255 32M =   2300477 -114    GAGAAATTATGCGTTTTTTTCTACTATTTGTG    67778999:9;;<===>?@@@@AAAABCCCDE    XA:i:0  MD:Z:32 NM:i:0  XM:i:2
## r999 163 gi|110640213|ref|NC_008253.1|   42356   255 32M =   42443   122 AACAGCAGAGTGTTACTACCGAGTACAGTCCA    EDCCCBAAAA@@@@?>===<;;9:99987776    XA:i:0  MD:Z:32 NM:i:0  XM:i:2
## r999 83  gi|110640213|ref|NC_008253.1|   42443   255 35M =   42356   -122    ACGGTACGACCACGGGAGATGCGGGCGAGGAAGAT 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0  MD:Z:35 NM:i:0  XM:i:2

Look the first two alignments corresponding (r995). The colum 9 in the SAM file gives you the observed template length (in this case 225). If you set a maximum insert size of 200, this pair of reads cannot be aligned.

./bowtie --sam -X 200 e_coli -1 reads/e_coli_1000_1.fq -2 reads/e_coli_1000_2.fq | tail

## # reads processed: 1671
## # reads with at least one reported alignment: 1342 (80.31%)
## # reads that failed to align: 329 (19.69%)
## Reported 671 paired-end alignments to 1 output stream(s)
## r995 77  *   0   0   *   *   0   0   AGTTCAAAGGTACCGGGTGTTGCGGGGATCGGACC EDCCCBAAAA@@@@?>===<;;9:99987776554 XM:i:0
## r995 141 *   0   0   *   *   0   0   TACCAACCGCAATTGCTGTAAATTGCCGTCGA    EDCCCBAAAA@@@@?>===<;;9:99987776    XM:i:0
## r996 163 gi|110640213|ref|NC_008253.1|   739590  255 32M =   739694  139 CGCACGCTGCGACGTATTATGGCGATGAATAT    EDCCCBAAAA@@@@?>===<;;9:99987776    XA:i:0  MD:Z:32 NM:i:0  XM:i:2
## r996 83  gi|110640213|ref|NC_008253.1|   739694  255 35M =   739590  -139    GATGGATGAGCCGCTGGCTAACCCCGATGCGCTGA 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r997 99  gi|110640213|ref|NC_008253.1|   2926703 255 35M =   2926800 129 CTTACGATTTTTGAGAGCCAGCGCAACATGTTCAG EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r997 147 gi|110640213|ref|NC_008253.1|   2926800 255 32M =   2926703 -129    GCAATCATGTAGTGAATCGCGGGGATCGGTCG    67778999:9;;<===>?@@@@AAAABCCCDE    XA:i:0  MD:Z:32 NM:i:0  XM:i:2
## r998 99  gi|110640213|ref|NC_008253.1|   2300477 255 35M =   2300559 114 TGTAGGTCTGATAAGCATAGCGCATCAGGCAATTT EDCCCBAAAA@@@@?>===<;;9:99987776554 XA:i:0  MD:Z:35 NM:i:0  XM:i:2
## r998 147 gi|110640213|ref|NC_008253.1|   2300559 255 32M =   2300477 -114    GAGAAATTATGCGTTTTTTTCTACTATTTGTG    67778999:9;;<===>?@@@@AAAABCCCDE    XA:i:0  MD:Z:32 NM:i:0  XM:i:2
## r999 163 gi|110640213|ref|NC_008253.1|   42356   255 32M =   42443   122 AACAGCAGAGTGTTACTACCGAGTACAGTCCA    EDCCCBAAAA@@@@?>===<;;9:99987776    XA:i:0  MD:Z:32 NM:i:0  XM:i:2
## r999 83  gi|110640213|ref|NC_008253.1|   42443   255 35M =   42356   -122    ACGGTACGACCACGGGAGATGCGGGCGAGGAAGAT 45567778999:9;;<===>?@@@@AAAABCCCDE XA:i:0  MD:Z:35 NM:i:0  XM:i:2

Some other options

Two intersting options are the -t and -p options. The first one prints the execution times. The second allows to state the number of aligment threads to launch:

./bowtie -t -p 2 e_coli reads/e_coli_1000.fq > single_end_alignment.map

## Time loading forward index: 00:00:00
## Time loading mirror index: 00:00:00
## Seeded quality full-index search: 00:00:00
## # reads processed: 1000
## # reads with at least one reported alignment: 699 (69.90%)
## # reads that failed to align: 301 (30.10%)
## Reported 699 alignments to 1 output stream(s)
## Time searching: 00:00:00
## Overall time: 00:00:00