The SRA Toolkit is a set of software tools used to interact with the Sequence Read Archive (SRA), which is a public repository that stores high-throughput sequencing data. Managed by the National Center for Biotechnology Information (NCBI), the SRA contains raw reads from various sequencing platforms such as Illumina, PacBio, and Oxford Nanopore, among others.

The SRA Toolkit provides command-line utilities to download, manipulate, and convert sequencing data from the SRA repository into various formats that are compatible with downstream analysis pipelines.

Main Tools in the SRA Toolkit:

  1. fastq-dump: This is one of the most commonly used commands. It converts SRA data into FASTQ format, which is widely used in bioinformatics for storing nucleotide sequences along with their quality scores. It’s particularly useful for downstream analysis such as alignment, variant calling, or transcriptome assembly.

  2. prefetch: This tool is used to download SRA files in their original format (.sra) without converting them, allowing for faster retrieval of large datasets. It is commonly used before running fastq-dump.

  3. fasterq-dump: This is an alternative to fastq-dump and is optimized for faster extraction of FASTQ files from .sra data. It is designed to be faster, particularly for large datasets.

  4. vdb-config: This utility allows users to configure various aspects of the SRA Toolkit, such as setting up cache directories, network settings, or file paths.

  5. srapath: This tool retrieves the URL or location of a file in the SRA archive, which can be useful for scripting downloads.

Typical Workflow

  1. Search and Identify SRA Files: You would typically search for sequencing data on the NCBI SRA website or via tools like esearch from the Entrez command-line utilities.

  2. Download Data: Use prefetch to download the raw .sra files or use fastq-dump/fasterq-dump to directly retrieve FASTQ-formatted files.

  3. Convert/Analyze Data: After downloading, you may need to convert the data into formats suitable for your analysis (e.g., FASTQ for alignment or further bioinformatics analysis).

Use Cases

The SRA Toolkit is essential for handling and converting large datasets from the SRA, allowing researchers to work with sequencing data efficiently.


Download and Process Single-End Data

To download and process single-end sequencing data from the SRA using prefetch and fastq-dump with compression, follow these steps:

1. Download the .sra file using prefetch:

Consider an example SRA ID: SRR1234567

prefetch SRR1234567

This will download the .sra file into a default cache directory (usually in ~/ncbi/public/sra/). or the directory where you are executing the command

2. Split and compress the data using fastq-dump:

fastq-dump --gzip SRR1234567.sra

Explanation of options:

  • fastq-dump: Converts the .sra file into FASTQ file .fastq.

  • --gzip: Compresses the resulting FASTQ file using gzip.

  • It will generate a file: SRR1234567.fastq.gz.

Final File:

After running these commands, the resulting compressed FASTQ file will be SRR1234567.fastq.gz.

This file is now ready for downstream bioinformatics analysis.


Download and Process Paired-End Data

Consider an example SRA ID: SRR1234567

To download and process paired-end sequencing data from the SRA using prefetch and fastq-dump with the compression option, follow the steps below:

1. Download the .sra file using prefetch:

prefetch SRR1234567

This will download the .sra file to the default location (usually ~/ncbi/public/sra/).

2. Split and compress the paired-end data using fastq-dump:

fastq-dump --gzip --split-files SRR1234567

Explanation of options: - --gzip: Compresses the resulting FASTQ files using gzip.

  • --split-files: Splits paired-end reads into two separate files: one for forward reads and one for reverse reads.

Output Files:

After running the command, you will get two compressed FASTQ files: 1. SRR1234567_1.fastq.gz (forward reads) 2. SRR1234567_2.fastq.gz (reverse reads)

These files are now compressed and ready for further analysis, such as alignment or quality control.


Using --split-3 for Mixed Paired-End and Single-End Data

The --split-3 option in fastq-dump is used to handle mixed paired-end and single-end sequencing data within a single .sra file. This option splits the data into different FASTQ files based on how the sequencing reads are paired or unpaired.

Use Case for --split-3:

Sometimes, an SRA run contains both paired-end reads and unpaired (or orphan) reads. This can occur due to sequencing or sample preparation issues, where some reads from a pair are missing their mate. The --split-3 option handles this by producing separate FASTQ files for: - Paired-end reads - Unpaired reads

Output:

When you run fastq-dump --split-3, the data will be split into three possible FASTQ files:

  1. <SRA_ID>_1.fastq: Contains the forward reads of the paired-end data.
  2. <SRA_ID>_2.fastq: Contains the reverse reads of the paired-end data.
  3. <SRA_ID>.fastq: Contains the unpaired (or orphaned) reads, where one mate in the pair is missing.

Example Command:

fastq-dump --split-3 SRR1234567

Scenario:

  • If the .sra file contains both paired-end and single-end reads (due to missing mates or some other reason), the --split-3 option will output separate files for each type.
  • If there are no orphaned reads, only the _1.fastq and _2.fastq files will be created, and no <SRA_ID>.fastq file will appear.

When to Use --split-3:

You should use --split-3 when you’re not sure if all your reads are perfectly paired, or when you expect to have both paired-end and unpaired reads in your data. This option ensures that any unpaired reads are saved separately for further analysis.