Module 2: The Mapping Problem

Introduction

Read mapping, also called aligning, refers to the process of matching reads to a reference genome. Read mapping is done to find out where in the genome our short sequenced reads came from. Note that if we didn’t have to fragment our input DNA for sequencing this problem would not exist. There are many uses for mapped reads. Piles of mapped reads can be used for transcriptome/genome assemblies, gene feature annotation, and variant discovery. For our QTL-Seq study, we will be using the alignments to discover variants in our population. Although conceptually simple, read mapping remains a challenging task. It is important to keep in mind that any alignment produced is a hypothesis, the strength of which must be weighed with available data. The goal of this module is to introduce you to the challenges faced by read mapping software, common file types, and how to interpret/visualize alignment results.

Read Mapping 101

Reference genome

A reference genome is a linear, haploid representation of all the chromosomes created through deep sequencing of a single reference individual (usually arbitrarily chosen). The 30 bp genome from a one chromosome, imaginary organism “X” could be represented in fasta format like this:

>1

AACGTAACGTAACGTAACGTAACGTAACGT

It is important to realize the limitations of working with a reference genome:

A reference genome does not necessary represent the truth. Genomes need to be assembled from reads, often short reads, which is a process fraught with difficulty and uncertainty.
A reference genome is assembled from one individual. Different genomes from the same species, especially in plants, can have large structural re-arrangements and copy number variation. Therefore there is a lot of missing sequence in a single reference genome that may be important in other members of the species.
SNPs between the reference and sample sequence to be mapped may prevent accurate mapping.

To ameliorate the shortcomings of only having a reference genome from one individual. The concept of a pan-genome has been introduced. Pan-genomes are created from sequencing many different individuals.

A representation of a pan-genome for the species that our imaginary organism “X” comes from may look like this:

In the above figure, a walk along each path through the pan-genome gives a distinct genome present in the species. The core genome is represented in red, all paths go through this part of the pan-genome.Parts of the pan-genome that are only visited once are considered part of the dispensable genome and are marked blue.

Due to the nascent nature of the pan-genome concept, the cost of surveying enough diversity to make a pan-genome,and the complexity of assembling a pan-genome, not many pan-genomes exist. Therefore, community-wide standards for storing pan-genomic information and bioinformatic tools for working with them are underdeveloped or non-existent.

At the end of the day we are faced with mapping millions of short reads containing some sequencing errors to an imperfect, incomplete genome which may be missing whole chunks of DNA.

A very brief note on algorithms and their complexity

The newest mapping algorithms seek to decrease run time complexity. Run time complexity refers to how input scales with time to completion. For example, if the time for an algorithm to run scaled linearly with input it would be denoted in big-O notation, a notation commonly used in computer science to describe algorithms, as having $O(n)$ complexity. If an algorithm scaled quadratically with input it would be represented as $O(n^2)$ in big-O notation. Whenever you read a publication about a new bioinformatic tool pay attention to complexity! In general you want to avoid algorithms with $O(n!)$ complexity and find ones that can run in constant time $O(1)$ or at least run in $O(nlog(n))$ time.

Mapping algorithms

Why not just use BLAST to align our millions of reads to the genome? Simple, BLAST has a time and space complexity of $O(mn)$ (which is essentially $O(n^2)$) where $m$ is the length of the genome and $n$ is the length of the read to be mapped and that’s for every single read. Ouch! Modern aligners suitable for mapping millions of short reads use algorithms that have time complexity ~ $O(n)$ where $n$ is the length of the read. Most aligners achieve this lower complexity using a combination of the Burrows-Wheeler Transform (BWT) and some supplemental information to create an Full-text index in Minute space (FM index) for matching reads to genomic locations. The BWT for a string is just the string shuffled in a very specific way. For example the BWT for “SQUASH$” (where “$” is a special end of string character) is “HUSSA$Q”. The shuffled string actually contains more information than the original string. An explanation of how this special shuffled version of a string (which in practice would be each chromosome sequence shuffled) is created and how it helps map millions of reads super fast is not within the scope of this class. However if you are really interested, a full and long explanation can be found here. A more concise, conceptual overview of short read alignment can be found here.

Special considerations for RNA-seq

Unlike DNA-Seq, where reads come from contiguous regions of the genome, different parts of RNA-seq reads can map to different genomic neighborhoods. The reason for this is simple, RNA-seq reads are usually derived from mature mRNA created through splicing. This means that reads that map to exon-exon junctions have to be split across introns when mapping to a reference genome, as is depicted below where red reads are exon junction reads.

Splicing adds complexity to the read mapping problem. When mapping RNA-Seq reads it is important to use mapping software that is splice-aware. We will be using the splice mapper HISAT for mapping our reads. Other commonly used aligners include TopHat2,STAR, and GSNAP. Most of the newer RNA-seq aligners (HISAT and STAR) allow the use of a two-pass strategy to improve mapping accuracy. Imagine you have a read like the red one pictured below that is split into many little bits that map across the genome or a 75 bp read where 73 bp maps to one area and 2 bp maps elsewhere. Also imagine you have no genomic annotations, in other words you have no idea where genes are or what the gene model structure is. How much confidence would you have in these alignments? Probably not much. Now imagine you have all the alignment data available to you when mapping your problem read, as is depicted below with black reads. These extra reads provide evidence to support your alignment hypothesis. Not only that, but now you can annotate that area of the genome! It is expressed, so it is likely a gene (but what’s a gene anyway?), and the split alignments suggest a three exon gene model.

Two pass aligning works using the same logic we used in our thought experiment. Reads are aligned softly to identify putative splice junctions and these junctions are fed back into the aligner for a second round of alignment.

Paired-end Payoff

In the last module, paired-end sequencing was briefly discussed. Recall that in paired-end sequencing a bit from each end of the library insert DNA is read, so really the library is sequenced twice. This is why paired-end sequencing costs twice as much.

So why bother? One of the reasons is that having paired reads adds additional constraints to the mapping algorithm, which can be helpful in correctly placing reads. For example, in the above illustration we know that the average size of the library insert is 500 bp, so we expect that the two mate pair reads should map within 500 bp of each other. If one read mate maps perfectly to a region it can be used to anchor its partner unambiguously, as shown below.

Partners can either drag each other down or lift each other up in terms of mapping quality. Either way, paired-end reads give you a truer assessment of mapping quality.

Quality

Many mappers will give each alignment a mapping quality score (MAPQ). This score is based on how sure the aligner is that the alignment is real. Paired-end data is often taken into account when assigning a score; however, different aligners take different factors into consideration when generating a MAPQ score. So do not attempt to compare read alignments from different aligners based on their quality scores!

I am SAM. SAM I am.

SAM is short for sequence alignment/map format and is the main file produced by read mapping programs. A SAM file contains eleven mandatory fields. These fields contain information about where the read maps in the genome (genomic coordinates), which strand the read maps to (+ or -), how trustworthy the alignment is, and whether the read maps uniquely. The first few lines of a SAM file is pictured below.

One of the most informative fields in the SAM file is the CIGAR field, which contains the CIGAR string.

A CIGAR string could look something like this:

50M300N50M

The above CIGAR string indicates that 50 bp of the read matches to a region (50M) and then there is a 300 bp skip (300N) and then the last 50 bp of the read match the sequence (50M). The various CIGAR string symbols and their meaning are shown in the table below.

In addition to SAM files, it is common to see BAM files as alignment output. A BAM file is the binary version of a SAM file, which is more compressed and easier for computers to work with. However, BAM files are not human readable and if you open one up all you will see is incomprehensible gibberish. The SAMtools suite is often used for interconverting between SAM and BAM, and can preform a number of other useful utilities.

Mapping with HISAT

HISAT is an ultra fast aligner that achieves its speed through the use of a large global FM index and many smaller local FM indices. Basically, HISAT tries to match as much of each RNA-Seq read as possible with the global index and once a mismatch is located (suggesting splicing) it will severe the read and search the much smaller local FM indices to map across an intron. HISAT also does on-the-fly two-pass mapping. In other words, it uses earlier alignments to inform current and future alignments rather than mapping completely two times. Aligners often come with default settings for things such as allowed mismatch rate, minimal mappable part of the read, max intron size etc. . Default settings are usually okay for most situations; however, if strange trends are noticed in the mapping statistics or if you are working in a reference genome with unusual properties, it may be necessary to tune the settings to achieve optimal results. We are going to use HISAT with default settings.

We will explore mapping by mapping the read data to the tomato reference genome. The latest tomato reference genome can be found here along with many other tomato resources. For this exercise it will not be necessary to download the reference, as I have made it available to you through sharing my Galaxy history. Go to the top bar and select “Shared Data” and then select “Histories.”

Use the search bar to find “PLBRG 3250 Genomic Approaches: Module 2” history and click on it. Then select “import history”

You’re good to go now.

Galaxy does many helpful things for us under the hood. For example, if we provide Galaxy with a reference genome without any index it will build the index for us. Many model species have pre-built indices, however there are no pre-built HISAT indices for the version of the tomato genome that we are using.

Select the HISAT tool under the NGS: RNA Analysis tab and supply it with the raw fastq file, included in the shared history, called “raw_fastq”. We will specify that we are using unpaired reads and are supplying a genome from history. The rest we will leave at default. Also, set the BAM-to-SAM tool under NGS: SAMtools to run on the HISAT output followed by Select First under the Text Manipulation tab. You will need to turn in a document containing the first ten lines of your SAM file as proof you completed this activity.

Next we can look at some basic mapping statistics. The Stats tool under the NGS: Samtools tab was run on the BAM file produced by HISAT to create summary statistics. Please download “summary.txt” and “coverage.txt” from blackboard.

Using the summary.txt file answer the following questions:

What percent of the reads mapped to the genome?

Why would reads not map to the genome?

How many non-unique alignments are there (non-primary alignments)? What sorts of genomic features would you expect these reads to be derived from?

Open the coverage.txt file in Excel and make a graph of the coverage distribution (Hint: An X-Y scatter plot should be sufficient.Taking the log transform of skewed data can often aid in visualization). You will need to submit your coverage distribution

How would you describe the shape of the distribution? What are some possible explanations for why all the bases don’t have the same coverage? (Hint: Think about the nature of RNA and the process of making sequencing libraries)

SAM excercise: Download the file SAM.txt from Blackboard. This file is tab delimited and contains one entry, so you should be able to open it up in Excel. You will find section 1.4 of the SAM specification documentation useful in answering the following questions:

What is the sequence of the read?

Was the read uniquely mapped?

Where did the read map (starting position and chromosome)?

Copy and paste the CIGAR string and interpret it.

You finished Module 2! We will discuss the nuances of variant calling in the next module

Please email your answers to ch728@cornell.edu