The SRA Toolkit is a set of software tools used to interact with the Sequence Read Archive (SRA), which is a public repository that stores high-throughput sequencing data. Managed by the National Center for Biotechnology Information (NCBI), the SRA contains raw reads from various sequencing platforms such as Illumina, PacBio, and Oxford Nanopore, among others.
The SRA Toolkit provides command-line utilities to download, manipulate, and convert sequencing data from the SRA repository into various formats that are compatible with downstream analysis pipelines.
Main Tools in the SRA Toolkit:
fastq-dump: This is one of the most commonly used commands. It converts SRA data into FASTQ format, which is widely used in bioinformatics for storing nucleotide sequences along with their quality scores. It’s particularly useful for downstream analysis such as alignment, variant calling, or transcriptome assembly.prefetch: This tool is used to download SRA files in their original format (.sra) without converting them, allowing for faster retrieval of large datasets. It is commonly used before runningfastq-dump.fasterq-dump: This is an alternative tofastq-dumpand is optimized for faster extraction of FASTQ files from.sradata. It is designed to be faster, particularly for large datasets.vdb-config: This utility allows users to configure various aspects of the SRA Toolkit, such as setting up cache directories, network settings, or file paths.srapath: This tool retrieves the URL or location of a file in the SRA archive, which can be useful for scripting downloads.
Typical Workflow
Search and Identify SRA Files: You would typically search for sequencing data on the NCBI SRA website or via tools like
esearchfrom theEntrezcommand-line utilities.Download Data: Use
prefetchto download the raw.srafiles or usefastq-dump/fasterq-dumpto directly retrieve FASTQ-formatted files.Convert/Analyze Data: After downloading, you may need to convert the data into formats suitable for your analysis (e.g., FASTQ for alignment or further bioinformatics analysis).
Use Cases
- Downloading and processing raw sequencing data for re-analysis.
- Verifying the quality of publicly available data before incorporating it into research projects.
- Benchmarking new computational methods using standard datasets.
The SRA Toolkit is essential for handling and converting large datasets from the SRA, allowing researchers to work with sequencing data efficiently.
Download and Process Single-End Data
To download and process single-end sequencing data
from the SRA using prefetch and fastq-dump
with compression, follow these steps:
1. Download the .sra file using
prefetch:
Consider an example SRA ID: SRR1234567
This will download the .sra file into a default cache
directory (usually in ~/ncbi/public/sra/). or the directory
where you are executing the command
2. Split and compress the data using fastq-dump:
Explanation of options:
fastq-dump: Converts the .sra file into FASTQ file.fastq.--gzip: Compresses the resulting FASTQ file using gzip.It will generate a file:
SRR1234567.fastq.gz.
Final File:
After running these commands, the resulting compressed FASTQ file
will be SRR1234567.fastq.gz.
This file is now ready for downstream bioinformatics analysis.
Download and Process Paired-End Data
Consider an example SRA ID: SRR1234567
To download and process paired-end sequencing data
from the SRA using prefetch and fastq-dump
with the compression option, follow the steps below:
1. Download the .sra file using
prefetch:
This will download the .sra file to the default location
(usually ~/ncbi/public/sra/).
2. Split and compress the paired-end data using
fastq-dump:
Explanation of options: - --gzip:
Compresses the resulting FASTQ files using gzip.
--split-files: Splits paired-end reads into two separate files: one for forward reads and one for reverse reads.
Output Files:
After running the command, you will get two compressed FASTQ files:
1. SRR1234567_1.fastq.gz (forward reads) 2.
SRR1234567_2.fastq.gz (reverse reads)
These files are now compressed and ready for further analysis, such as alignment or quality control.
Using --split-3 for Mixed Paired-End and Single-End
Data
The --split-3 option in fastq-dump is used
to handle mixed paired-end and single-end sequencing
data within a single .sra file. This option splits the data
into different FASTQ files based on how the sequencing reads are paired
or unpaired.
Use Case for --split-3:
Sometimes, an SRA run contains both paired-end reads and unpaired (or
orphan) reads. This can occur due to sequencing or sample preparation
issues, where some reads from a pair are missing their mate. The
--split-3 option handles this by producing separate FASTQ
files for: - Paired-end reads - Unpaired reads
Output:
When you run fastq-dump --split-3, the data will be
split into three possible FASTQ files:
<SRA_ID>_1.fastq: Contains the forward reads of the paired-end data.<SRA_ID>_2.fastq: Contains the reverse reads of the paired-end data.<SRA_ID>.fastq: Contains the unpaired (or orphaned) reads, where one mate in the pair is missing.
Scenario:
- If the
.srafile contains both paired-end and single-end reads (due to missing mates or some other reason), the--split-3option will output separate files for each type. - If there are no orphaned reads, only the
_1.fastqand_2.fastqfiles will be created, and no<SRA_ID>.fastqfile will appear.
When to Use --split-3:
You should use --split-3 when you’re not sure if all
your reads are perfectly paired, or when you expect to have both
paired-end and unpaired reads in your data. This option ensures that any
unpaired reads are saved separately for further analysis.