The average sequencing run on one lane of a modern sequencer can produce ~30 gigabytes of data. That equates to roughly 200-400 million 75 to 300 bp sequences in a flat text file called a FASTQ file. That’s a lot! The sad truth is most of this data is untrustworthy;however, there is so much data that if you can sift through it you are almost guaranteed to find something useful. The goal of this module is to teach you how to sift, filter and proof sequencing data so that you can get the most out of it.
Understanding how biological samples are prepared for sequencing and how sequencing itself works will aid you in assessing and processing raw sequence data. There is a lot of terminology in the world of sequence data analysis. Most of these terms are related to sequencing methodology and library prep.
A DNA library is a collection of DNA fragments that are specially prepared for sequencing. We are not going to get into the gory details of how libraries are prepared. All you need to know is that the end result of library prep is a bunch of DNA fragments that are made up of DNA from the organism you’re working with flanked by artificial DNA, which is added by you during library prep, which allows your DNA to stick to a solid surface and be sequenced on a sequencing machine. These artificial pieces of DNA are called adapters. Several common adapter layouts are pictured below. The most basic adapter layout allows for single end sequencing. Elements 1 and 2 are the terminal parts of the adapter and are what bind to oligos, small pieces of DNA, present on a solid surface. These elements are also necessary for fragment amplification. Stepping toward the insert, the bit of DNA from our organism of interest, from element 1 we have a priming segment for read 1. The priming segment is recognized by a sequencing primer, which initiates the sequencing by synthesis process. Elements 1 and 2, as well as the RD1 sequencing primer, are required by most sequencing technologies. Our second adapter type is one that allows for paired end sequencing.This adapter is the same as the single end adapter, except it includes a second priming sequence for sequencing the other end of the read. This second priming sequence can also be used to read an index. An index allows for the pooling (mixing) of libraries from multiple samples together. Pooling enables the sequencing of multiple samples on the same lane of a sequencer, a practice often referred to as multiplexing. There is no identity crisis because fragments from different libraries have different index sequences in their adapter. The indices are used for demultiplexing, which is just the sorting out of reads based on their index sequence after sequencing. The index is read in a second sequencing step using an index sequencing primer. Alternatively, the index sequence can be included in-line as part of the first read. When an index is included in the first read, it is often referred to as a barcode. This is depicted in the third adapter layout diagram. Barcodes are read as the beginning part of the read and need to cut out after demultiplexing.
Quality control of a library before sequencing is always a good idea. QC is done by running a very small amount of library on an instrument called a bioanalyzer. This instrument runs the sample through a matrix and gives a high resolution separation of library fragments. The output of one of these QC runs is shown in the above figure. The gel read out is to the right and the graph on the left shows the amount of fragments of each size using fluorescence intensity. The bulk of the library fragments are ~200 bp in size, but the peak is broad not sharp, indicating a distribution of reads. During library preparation input DNA is fragmented into small pieces to facilitate sequencing; therefore, if there is no size selection prior to adding the adapters the insert size is variable. Size selection is usually done using SPRI beads or gel purification. For reasons that will be discussed in a later module, having a more consistent, known insert size can be very beneficial when combined with certain sequencing methods.
Our discussion of sequencing methodology will be Illumina-centric, as this is currently the most popular sequencing platform. Sequencing machines contain one or more flow cells and each of these flow cells contains one or more lanes. Further, each lane has hundreds of quadrants, called tiles, containing oligos that bind your library. Users are charged on a per lane basis. Illumina machines employ sequencing by synthesis technology. First, individual reads are amplified in tight clonal clusters, which is done to increase signal. This amplification uses the terminal elements of the adapter in a process referred to as bridge amplification.After amplification, a PCR-like reaction is used to incorporate fluorescent nucleotides one at a time.
For each sequencing cycle completed, an image like the one above is produced. Each colored dot corresponds to a clonal cluster derived from one DNA fragment. Illumina’s processing algorithm takes images like the one above and translates it into a base call for each read at each position, along with a corresponding measure of certainty in the base call.
Many issues can arise during the sequencing by synthesis process. A common issue is the tendency for some fragments in a clonal cluster to not incorporate a nucleotide toward the end of the process, when sequencing reagents become more limiting. This leads to ambiguity in base calls near the 3’ end of reads.
Understanding the concept of library complexity is crucial, as it is one of the most important factors in determining how much sequencing of your library you need to do. Complexity is related to the number of unique DNA fragments in your sample. If your library contains a bunch of fragments that are unique and are derived from many different parts of the genome then you have a complex library. On the other hand, if your library contains redundant fragments and/or fragments from few areas of the genome then you have a low complexity library. A low complexity library is not inherently bad and sometimes is very desirable. In fact many library preparation methods have been developed which purposely generate low complexity libraries, such as GBS, RAD-Seq, exome capture, amplicon sequencing and 3’RNA-seq. These types of libraries are often referred to as reduced representation libraries. Lower complexity means you need less sequencing per sample to capture all the information in a sample and facilitates multiplexing of samples. Multiplexing is generally desirable from both an experimental design stand point and in terms of cost efficiency. High complexity libraries are usually only necessary for assembling whole genomes and transcriptomes or if you want to interrogate all the genetic variation in a handful of individuals.
The data that you get from the sequencing facility will likely come in the form of a fastq file.This is what one entry in a fastq file looks like:
@HWI-ST863:195:D1P7AACXX:1:1101:1173:1855_forward/1 CACTGTACCGAGGTGGCTTACCACTGCTCCTGGTGCACCTACAACTCACTGGTATCAGCTTCGGTGTGTA +HWI-ST863:195:D1P7AACXX:1:1101:1173:1855 HIJJGHHIIGGDHCFHIIJGIGIIGIJJGGGFEFHEHGHIIIJJDGGIIJHEACEFFHDDEBD>C?;?B@
Fastq files are similar to fasta files, but are more information rich. Each fastq entry takes up exactly four lines. The first line is the header line and starts with “@”. The above header is in Casava 1.8 format and contains a lot of useful info. Let’s dissect it.
HWI-ST863 : 195 : D1P7AACXX : 1 : 1101 : 1173 : 1855 : forward/1
| Tag | Meaning |
|---|---|
| HWI-ST863 | Unique identifier for the sequencing machine |
| 195 | Run id |
| D1P7AACXX | Flowcell id |
| 1 | Flowcell lane |
| 1101 | Flowcell lane tile number |
| 1173 | X-coordinate of cluster within tile |
| 1855 | y-coordinate of cluster within tile |
| forward/1 | Read orientation |
If indexing is used, the header line will also contain the index sequence.
The line following the header is the sequence itself. The third line always starts with a “+” and usually has no other information. In some cases the information from the first line is repeated after the “+” on the third line. The fourth line is crucial and is what really differentiates a fastq file from a fasta file.
The fourth line contains the the Phred quality score for each base encoded as an ASCII character. There are different encodings of the Phred quality score, but most fastq files produced by newish Illumina sequencers use Sanger format,also called Phred+33 format, which runs from 0 (lowest quality) to 40 (highest quality). Most tools used for quality control of fastq data automatically recognize how your quality scores are encoded or at least will throw an error message at you if they don’t.
So what’s a Phred quality score?
\[Phred=-10\log_{10}(P)\]
Where
\[P=Pr(error\_in\_base\_call)\]
So if the probability of an error at a base is 0.001 then the Phred score would be 30.
\[Phred=10\log_{10}(0.001)=30\]
At this point you are probably wondering how the probability of an error at a base is calculated. This is done by Illumina’s proprietary algorithm, which nobody but Illumina knows. However, it is known that the quality scores produced by the sequencers are not always reliable and it can be worthwhile to recalibrate base quality scores, which can be done with this tool from the GATK. We, however, are not going to bother with recalibration.
So that’s it, all you need to know about fastq files.
It is important to assess the quality of your sequencing data before proceeding with further analysis. Tools, such as FastQC, are often used to visualize data quality. Common output from Fastqc modules are displayed below.
This module reports some basic summary statistics about your read data. It automatically detects the quality score encoding, number of fastq reads, read length and GC content.
This module displays the distribution of quality at each position of the read.
This module is not always displayed, as this information is not present in all fastq files.The y-axis gives the flowcell tile and the x-axis base position in the read. This graph allows us to see if any technical issues arose during sequencing in one of the flowcell tiles, which would manifest as a red splotches on the graph.
Are there a few sequences of low average quality and many of high average quality? Are there two distinct populations of sequences? Maybe there is a large population of sequences with poor average quality and a very small population with low average quality. This module was designed to give this information.
This module is a little more difficult to understand. Basically, different areas of the genome and different genomic features have different percentages of G/C bases. For example, genic regions (regions containing genes) of the genome tend to have lower GC content. Further, different plant and animal species have different average GC content. This module assumes that GC content is normally distributed with most reads having the genome wide average GC calculated from your data. Adapter contamination or contamination with DNA from other species can cause your data’s distribution to deviate from the theoretical distribution.
Technical issues during sequencing can cause a base’s identity in a read to be ambiguous. Ambiguous bases are represented with an “N” in accordance with IUPAC’s nucleic acid notation.
This module shows the length distribution of the reads.
In this plot the reads are binned into classes of duplication. Ideally, with most library prep methods,the plot should be skewed with most of the reads being unique. The percentage of reads left after removing duplicate reads is also displayed.
Adapter contamination is commonly the result of two phenomena. First, the presence of adapter dimers that aren’t cleaned prior to sequencing the library. The other is having inserts shorter than the sequencing read length. For example, if your insert is only 50bp and you are doing 75 bp sequencing, you will sequence 25 bp into the adapter. This is very common and stems from the fact that libraries contain a distribution of reads. Size selection is usually done to remove fragments that are too small or large for sequencing, but some inevitably make it to the sequencer.
K-mer enrichment refers to bias toward specific, smaller fragments of DNA in our read. In-line barcodes and adapter sequence will often set off the k-mer enrichment alarm. Another common cause of K-mer enrichment in RNA-seq libraries is the incomplete sampling of random primers, which are commonly used to synthesize cDNA
Read trimming, sometimes called quality trimming, refers to cutting out low quality regions from sequence reads and removing adapter contamination and/or other technical artifacts. Trimming is done for several reasons:
Contaminating adapter sequence or other artifacts are not biologically meaningful and therefore should be removed
Low quality regions, if they contain actual sequencing errors, can cause false positive SNPs to be detected
Low quality regions, if they contain actual sequencing errors, can cause alignment errors which can lead to false positive SNPs (or other false variants) and throw off gene expression quantification/other alignment based analyses
There are many tools that exist for read trimming! Many of these tools do dynamic trimming. In other words, each read is treated separately. For example, reads that are high quality along their entire length and contain no adapter contamination wouldn’t get trimmed at all. One approach to dynamic trimming is to look for spots in each read where the average quality falls below some threshold and then chop those regions out. This can be achieved using a sliding window algorithm, which looks like this:
In the above example the red bases represent low quality bases. If our window length was set to four, the third window would contain the lowest quality average and if it fell below our set threshold the sequence chunk in this window would be cut out.
After publishing, many groups will make their raw sequence data available on the Sequencing Read Archive (SRA), where it can be downloaded and analyzed by anybody. Making data available allows other people to compare data sets and possibly make new discoveries. Additionally, it provides data for bench marking new bioinformatics and statistical tools.
Galaxy has a convenient set of tools specifically for extracting different types of data from the SRA.Go to the tomato QTL-seq study here, which we will be analyzing further over the next couple of modules, and locate the accession code section. Here you should see two SRA accession numbers corresponding to the raw data generated in the study. Click on the first SRA number link.
This link should take you to an SRA page containing information about the data set. This includes some experimental context. For example, it tells you that the data are from 96 early flowering plants in an F2 between S. lycopersicum and S. galapagense (LA0530). However, some of the most useful information is located under the library information header.
Here you learn that the library was an RNA-seq library and was sequenced on an Illumina HiSeq 2500. Another key piece of information is the layout, which in this case is single. This means that single read end sequencing was done as opposed to paired-end sequencing. Knowing this information is critical for downstream analysis, as many processing steps treat pair-end data different from single read data.
Now let’s go to Galaxy and use the extract reads function from NCBI SRA Tools to load the data into your Galaxy history. Just type the accession number SRR3741284 into the SRR accession field and hit the execute button.
Before we continue let’s subset our data. We are doing this to make the next steps run faster and to make downloading and uploading files easier, but in real life you would carry out the analysis on the full data set. We can grab the first 100,000 fastq reads with the “Select first” tool under text manipulation tools by changing the default value of 10 to 400,000.
Next, go to Blackboard and download “raw_data.html.” This file was produced by running Fastqc on the data you downloaded from the SRA. Open this file (it should open in your internet browser) and answer the following questions (this reference may be helpful):
What trend do you see in the per base quality report? Do you expect to see this trend in raw fastq data?
Describe the distribution of the per sequence quality score.
Looking at the per base sequence content, overrepresented sequences, and adapter content, what is likely causing the Kmer module to fail?
Examine the duplicate sequence module. Based on this plot, is the library complex?
Considering all the information in the Fastqc report, what parts of your reads would you trim (tail, head, middle etc.)?
Locate the Trimmomatic tool under “NGS: QC and manipulation.” We will use this tool to trim our 100,000 reads and compare the QC results to the raw data QC. Remember our data is not paired end and we definitely want to perform the Illumina clip step. Also, we know that the data came from a single-ended run on an Illumina HiSeq 2500, so we will select “TruSeq3” for the adapter sequences to remove. Everything else will be kept at the default settings.
After trimming, run Fastqc on the trimmomatic trimmed reads. Select “FastQC” under NGS: QC and manipulation and give it your trimmed fastq file.Compare and contrast the fastqc reports from the untrimmed and trimmomatic trimmed reads.