Overview

This document describes the metagenomic analysis pipeline by Edna Chiang.

This tutorial assumes you have a basic understanding of metagenomics and linux command line. It provides recommendations and guidance in choosing and using software to perform your own metagenomic analysis on paired-end data.

While many of the steps and concepts are translatable to single-end data, many of the software are incompatible with single-end data.

This guide doesn’t discuss appropriate experimental design, DNA preparation, or sequencing methods.

To learn more about metagenomics, there are plentiful reviews to read. Here are a few to consider:

To learn more about command line, there are many online and on-campus resources. Here are a few to consider:

1) Define your question

Before beginning any analysis, you must first define your question. How you analyze your data is completely dependent upon your question.
Some ideas to consider:

  • Do you want to identify specific taxa?
  • Do you want metagenome-assembled genomes?
  • In what types of genes are you interested?

2) Evaluate data quality

Use FastQC to evaluate the quality of your raw fastq files.

FastQC will output a .html file with multiple plots.

For additional resources in understanding FastQC outputs, refer to the following webpages:

/path/to/fastqc /path/to/fastq -o /path/to/output
  • /path/to/fastqc
    • Opens the fastqc software
    • Replace with your path to fastqc
  • /path/to/fastq
    • Specifies input fastq files
    • Replace with the path to your fastq input file(s)
    • If you want to include more than 1 fastq file, just separate the file paths by 1 space
  • -o /path/to/output
    • -o Specifies output folder
    • Replace /path/to/output with the path to your output folder

If you find that fastqc isn’t running and you get the following error:

Exception in thread "main" java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.

Try adding & (a space followed by an ampersand) at the very end of the command.

The output plots to which I pay most attention are:

  • Basic Statistics
  • Per base sequence quality
    • Sequence quality naturally decreases near the tail-end of the sequence
  • Over-represented sequences
    • These sequences are often adapters/primers
  • Adapter Content

3) Quality trim

Use Trimmomatic to remove adapters, primers, and over-represented sequences, and to trim poor quality basepairs.

java -jar /path/to/trimmomatic-0.36.jar PE -threads number -trimlog /path/to/trimmomatic/logfile /path/to/R1/fastq.gz /path/to/R2/fastq.gz -baseout /path/to/output.gz ILLUMINACLIP:/path/to/overrep_seqs.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
  • java -jar
    • Because Trimmomatic is written in java, you must open the software using java
  • /path/to/trimmomatic-0.36.jar
    • This tells the computer where to find the Trimmomatic software
    • Depending on what version of Trimmomatic you download, you may have a different name for the .jar software
    • Replace this with your path to the trimmomatic .jar file
  • PE
    • Indicates you’re inputting paired-end data
    • Trimmomatic also works with single-end data
  • -threads number
    • Sets the number of threads/processors you want to use
    • Replace number with a number (e.g. 4)
  • -trimlog /path/to/trimmomatic/logfile.txt
    • -trimlog specifies the location and name of your output log file
    • Trimmomatic creates a log file of all read trimmings, including:
      • read name
      • surviving sequence length
      • location of the first surviving base (amount trimmed from the start)
      • location of the last surviving base in the original read
      • amount trimmed from end
    • Replace /path/to/trimmomatic/logfile.txt with the path to your log file
  • /path/to/R1/fastq.gz
    • Specifies location of your forward read fastq file
    • Trimmomatic accepts both .fastq and .fastq.gz files
    • Replace with the path to your fastq or fastq.gz file
  • /path/to/R2/fastq.gz
    • Specifies location of your reverse read fastq file
    • Trimmomatic accepts both .fastq and .fastq.gz files
    • Replace with the path to your fastq or fastq.gz file
  • -baseout /path/to/output.gz
    • -baseout specifies the location and names of the output files:
    • output_1P.gz
      • Paired forward reads
    • output_2P.gz
      • Paired reverse reads
    • output_1U.gz
      • Unpaired forward reads
    • output_2U.gz
      • Unpaired reverse reads
    • output is the prefix of all output files
    • .gz outputs everything as gzipped fastq files
    • There should be the same number of reads in output_1P.gz and output_2P.gz
    • These are reads that have both a forward & reverse read (thus “paired”)
    • Replace /path/to/output.gz with the path to your output files
  • ILLUMINACLIP:/path/to/overrep_seqs.fa:2:30:10
    • ILLUMINACLIP: is the command to remove specific sequences
    • Trimmomatic comes with adapter files for commonly used Illumina adapters
    • You can also make your own custom adapter file. For example:
      • Sequences to be removed from forward reads must end with /1
      • Sequences to be removed from reverse reads must end with /2
    • Replace /path/to/overrep_seqs.fa with the path to Trimmomatic’s provided adapter files or your own custom file
    • :2:30:10 are the default parameters for adapter trimming. Refer to the Trimmomatic website for more info.
  • LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
    • There are the default parameters for trimming poor quality reads. Refer to the Trimmomatic website for more info.

Alternatives

I’ve also tested adapter removal with cutadapt and quality trimming with sickle.
I chose Trimmomatic over these two software simply because it’s more efficient to use one program instead of two.

4) Remove host contamination (optional)

If you study host-associated microbiomes, host DNA contamination can greatly hinder metagenomic assembly by increasing required time and computational power and by decreasing assembly quality.

If you have access to a genome of your host organism, I strongly recommend that you remove host contamination!

I follow this tutorial for my host contamination removal. Below is the long way to remove host contamination, but there is a much quicker way that’s described in the linked tutorial.

4A) Download host genome

The following is an example of how I download the genome for my host organism (13-lined ground squirrel, Ictidomys tridecemlineatus).

First, download the host genome:

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/236/235/GCA_000236235.1_SpeTri2.0/GCA_000236235.1_SpeTri2.0_assembly_structure/Primary_Assembly/unplaced_scaffolds/FASTA/unplaced.scaf.fna.gz

Open the downloaded file:

gunzip unplaced.scaf.fna.gz

Rename the genome to something that makes sense:

mv unplaced.scaf.fna host.fna

4B) Build host genome database

Use bowtie2 to create a database from the host genome.

/path/to/bowtie2-build --seed number /path/to/host.fna /path/to/database
  • /path/to/bowtie2-build
    • Replace /path/to/bowtie2 with your path to bowtie2
    • -build because calls the build command within bowtie2
  • --seed number
    • Sets the seed so your work is reproducible
    • Replace number with a number (e.g. 4)
  • /path/to/host.fna
    • Replace with the path to your host genome
  • /path/to/database
    • Replace with the path to your output host database

4C) Map reads against host database

We’ll use bowtie2 again.

/path/to/bowtie2 --very-sensitive-local -p number --seed number -x /path/to/database -1 /path/to/R1/fastq.gz -2 /path/to/R2/fastq.gz -S /path/to/output.sam

Note: The output of this to terminal will include stats about % mapped. These stats are not saved in the output file and you may want these for your records.

  • path/to/bowtie2
    • Replace with your path to bowtie2
  • --very-sensitive-local
    • This does a local alignment that’s very sensitive when matching so you get more matched reads
    • This setting is slower than other settings. If you want speed, use a less sensitive local setting
    • I use very sensitive because my main concern is removing all the host contamination I can to improve the quality of my assembly downstream. For me, speed is a minor concern
    • If you’re mapping short reads to a big genome, do not do a global alignment!
  • -p number
    • -p Sets the number of processors
    • Replace number with a number (e.g. 4)
  • --seed number
    • --seed sets the seed so that your work is reproducible
    • Replace number with a number (e.g. 4)
  • -x /path/to/database
    • -x specifies your host database
    • Replace /path/to/database with the path to your host database
  • -1 /path/to/R1/fastq.gz
    • -1 specifies your forward read fastq file
    • Replace /path/to/R1/fastq.gz with the path to your forward read fastq file
    • If you ran Trimmomatic step, this would be output_1P.gz
  • -2 /path/to/R2/fastq.gz
    • -2 specifies your reverse read fastq file
    • Replace /path/to/R2/fastq.gz with the path to your reverse read fastq file
    • If you ran Trimmomatic, this would be output_2P.gz
  • -S /path/to/output.sam
    • -S specifies your output file, which is in SAM format (gibberish to the normal human eye)
    • Replace /path/to/output.sam with the path to your output SAM file

4D) Convert SAM file to BAM file

A SAM file is huuuuuge, so convert it into a BAM file to save room.

The remaining steps will use the program samtools.

/path/to/samtools view -@ numbers -bS /path/to/output.sam > /path/to/output.bam
  • /path/to/samtools view
    • Replace /path/to/samtools/ with your path to samtools
    • view calls the view command within samtools
  • -@ number
    • -@ sets number of threads/processors
    • Replace number with a number (e.g. 4)
  • -bS
    • Outputs a BAM file
  • /path/to/output.sam > /path/to/output.bam
    • Specifies input SAM file and the output BAM file
    • Replace /path/to/output.sam with the path to the output sam file you made in the previous step (4C)
    • Keep >
    • Replace /path/to/output.bam with the path to your output bam file

4E) Extract unmapped reads (non-host)

/path/to/samtools view -@ number -b -f 12 -F 256 /path/to/output.bam > /path/to/unmap.bam
  • /path/to/samtools view
    • Replace /path/to/samtools/ with your path to samtools
    • view calls the view command within samtools
  • -@ number
    • -@ sets number of threads/processors
    • Replace number with a number (e.g. 4)
  • -b
    • Outputs a BAM file
  • -f 12
    • Extracts only alignments with both reads unmapped
  • -F 256
    • Does not extract aligns that are not primary alignment
  • /path/to/output.bam > /path/to/unmap.bam
    • Specifies input BAM file and the output BAM file
    • Replace /path/to/output.bam with the path to the output BAM file you made in the previous step (4D)
    • Keep >
    • Replace /path/to/unmap.bam with the path to your output BAM file

4F) Sort BAM file to organize paired reads

/path/to/samtools sort -n /path/to/unmap.bam -o /path/to/unmap.sorted.bam
  • /path/to/samtools sort
    • Replace /path/to/samtools/ with your path to samtools
    • sort because calls the sort command within samtools
  • -n
    • Sorts by read name
  • /path/to/unmap.bam
    • Replace with the path to the unmapped bam file you made in the previous step (4D)
  • -o /path/to/unmap.sorted.bam
    • -o specifies output
    • Replace path/to/unmap.sorted.bam with the path to your output BAM file

4G) Convert BAM to fastq

/path/to/samtools bam2fq /path/to/unmap.sorted.bam > /path/to/nohost.fastq
  • /path/to/samtools bam2fq
    • Replace /path/to/samtools with your path to samtools
    • bam2fq calls the bam2fq command within samtools
  • /path/to/unmap.sorted.bam
    • Replace with the path to the unmapped and sorted BAM file you made in the previous step (4E)
  • /path/to/nohost.fastq
    • Replace with path to your output .fastq file
    • The output is an interleaved fastq file

Alternatives

  • You can use bowtie2 flag --un-conc to directly pull out unmapped reads if you don’t need the flexibility of the long samtools pipeline. Refer to this tutorial for directions.
  • The Hurwitz Lab has a nice script to remove host contamination.

5) Evaluate data quality

This is the same as Step 2.

Use FastQC to evaluate the quality of your raw fastq files.

FastQC will output a .html file with multiple plots.

For additional resources in understanding FastQC outputs, refer to the following webpages:

/path/to/fastqc /path/to/fastq -o /path/to/output
  • /path/to/fastqc
    • Replace with your path to fastqc
  • /path/to/fastq
    • Replace with your input fastq file(s)
    • To include more than 1 fastq file, just separate each file path by 1 space
  • -o /path/to/output
    • -o specifies output
    • Replace /path/to/output with the path to your output file(s)

If you find that fastqc isn’t running and you get the following error:

Exception in thread "main" java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.

Try adding & (space followed by an ampersand) at the very end of the command.

6) Taxonomic profile (optional)

This step gives you a quick look at what taxa comprise your metagenome. You can think of this as a preliminary option before binning.

My profiler of choice is Kaiju because it gave me the highest percentage of classified reads. However, it does require quite a bit of RAM. If you are limited by computational power, I suggest using Centrifuge. You can read more about other alternatives by scrolling down to the bottom of this step.

/path/to/kaiju -t /path/to/kaiju/database/nodes.dmp -f /path/to/kaiju/database/file.fmi -i /path/to/input/file.fastq -o /path/to/output/file/kaiju.output -z number -E 1e-05 -v
  • /path/to/kaiju
    • Replace with your path to kaiju
  • -t /path/to/kaiju/database/nodes.dmp
    • -t specifies the NCBI taxonomy nodes.dmp file
    • Replace /path/to/kaiju/database/nodes.dmp with your path to nodes.dmp
  • -f /path/to/kaiju/database/file.fmi
    • -f specifies the database file
    • Replace /path/to/kaiju/database/file.fmi with your path to the NCBI taxonomy .fmi file
  • -i /path/to/input/file.fastq
    • -i specifies input. This can be .fastq, .fasta, or .ffn formats.
    • Replace /path/to/input/file.fastq with the path to your input file
  • -o /path/to/output/file/kaiju.output
    • -o saves the kaiju output. If you exclude this flag, kaiju will print the output to the terminal rather than save it in a file.
    • Replace /path/to/output/file/kaiju.output with the path to your output file
  • -z number
    • -z sets the number of threads/processors
    • Replace number with a number (e.g. 4)
  • -E 1e-05
    • -E sets the E-value cutoff
    • This cutoff should always be ≤ 1e-05
  • -v
    • Verbose option that prints additional columns in the kaiju output

After you run Kaiju, you’ll want to convert the kaiju output file(s) into a summary table using kaiju’s kaiju2table function:

/path/to/kaiju2table -t /path/to/kaiju/database/nodes.dmp -n /path/to/kaiju/database/names.dmp -r taxonomic_level -o /path/to/output/table.tsv /path/to/output/file/kaiju.output -l taxonomic,levels,separated,by,commas
  • /path/to/kaiju2table
    • Replace with your path to kaiju2table
  • -t /path/to/kaiju/database/nodes.dmp
    • -t specifies the NCBI taxonomy nodes.dmp file
    • Replace /path/to/kaiju/database/nodes.dmp with your path to nodes.dmp
  • -n /path/to/kaiju/database/names.dmp
    • -n specifies the NCBI taxonomy names.dmp file
    • Replace /path/to/kaiju/database/names.dmp with your path to names.dmp
  • -r taxonomic_level
    • -r specifies the taxonomic level you want in the output table
    • Replace taxonomic_level with your desired taxonomic level
    • You can only make a table for one taxonomic level at a time. So if you want a table for multiple taxonomic levels, you have to separately run kaiju2table for each level.
  • -o /path/to/output/table.tsv
    • -o specifies your output file
    • Replace /path/to/output/table.tsv with the path to your output classification table
  • /path/to/output/file/kaiju.output
    • Replace with the path to the output file you created in the previous code chunk
  • -l taxonomic,levels,separated,by,commas
    • -l specifies the taxonomic levels to be printed in your output table
    • For example: -l superkingdom,phylum,class,order,family,genus,species

Alternatives

  • Kraken 2 and Bracken
    • Kraken 2 requires a lot of disk space and RAM to run. You need 100 GB of disk space to store the kraken 2 database and > 29 GB of RAM to run the program.
  • Centrifuge
    • The smaller, more efficient version of kraken
  • mOTU and fetchMG
    • Uses single copy marker genes
    • The output contains species-level classifications, but you have to figure out the higher level classifications (Phylum –> Family)
  • MetaPhlAn
    • Uses clade-specific marker genes

7) Assemble

There are new assemblers being developed all the time, with each assembler suited for different needs and data types.

I’ve tested 3 assemblers:

Here are the pros/cons of each assembler:

  • metaSPAdes
    • Pros:
      • Good documentation
      • Larger # of bp’s are incorporated into final assembly compared to IDBA_UD
      • Creates largest contigs
    • Cons:
      • Overall N50 is lower than than of IDBA_UD. However, if you remove shorter contigs (e.g. < 500 bp’s), the N50 will be larger than that of IDBA_UD
  • IDBA_UD
    • Pros:
      • Highest N50 compared to metaSPAdes and MEGAHIT when you keep all contigs and don’t use a read length cutoff
    • Cons:
      • Poor documentation
      • Inflexible input options
  • MEGAHIT
    • Pros:
      • Fastest (< 30 min)
      • Least computationally intensive
    • Cons:
      • Gave the poorest quality assemblies

Recommendations:

  • metaSPAdes = For those with beginner experience who do not lack time and/or computational resources.
  • IDBA_UD = For those with at least intermediate experience who do not lack time and/or computational resources.
  • MEGAHIT = For those with little experience, little time, and/or limited computational resources.

Ultimately, the “best” assembler will differ based on your data set. If you have the time and resources, try out a few. If you don’t have the time and resources, pick one and run with it.

7A) MetaSPAdes

metaSPAdes website

Interleaved

This is when you have 1 fastq file per metagenome.

/path/to/spades-3.12.0 --meta --pe1-12 /path/to/fastq -k 55,75,95 -t number_threads -m number_memory -o /path/to/output
  • /path/to/spades-3.12.0
    • Replace with your path to spades
    • The name of the spades program will change depending on what version you’re using
  • --meta
    • Indicates that you’re doing a metagenome assembly
    • This is the same thing as running the command metaspades.py
  • --pe1-12 /path/to/fastq
    • --pe1 indicates that you’re using paired-end reads
      • The 1 tells the program this is your first sample. This is only ever changed if you run normal SPAdes (not metaSPAdes).
    • -12 indicates that your input file is interleaved (1 input file)
    • Replace /path/to/fastq with the path to your input fastq file
  • -k 55,75,95
    • -k sets the kmer sizes used to assemble your metagenome.
    • I chose the combination 55, 75, 95 because I felt this gave the best compromise between speed and quality.
    • You can replace 55,75,95 with your preferred kmer sizes. Just make sure that you list more than one size because this will greatly improve the quality of your assembly.
  • -t number_threads
    • -t sets the number of threads/processors
    • Replace number_threads with a number (e.g. 4)
  • -m number_memory
    • -m specifies the amount of memory that metaSPAdes will use
    • If this is too low, you’ll run out of memory and receive an error.
    • Replace number_memory with a number (e.g. 4)
  • -o /path/to/output
    • -o specifies the output folder
    • Replace /path/to/output with the path to your output folder

File pairs

This is when you have 2 fastq files per metagenome.

/path/to/spades-3.12.0 --meta --pe1-1 /path/to/R1/fastq --pe1-2 /path/to/R2/fastq -k 55,75,95 -t number_threads -m number_memory -o /path/to/output
  • /path/to/spades-3.12.0
    • Replace with your path to spades
    • The name of the spades program will change depending on what version you’re using
  • --meta
    • Indicates that you’re doing a metagenome assembly
    • This is the same thing as running the command metaspades.py
  • --pe1-1 /path/to/R1/fastq
    • --pe1 Indicates that you’re using paired-end reads
      • The 1 tells the program this is your first sample. This is only ever changed if you run normal SPAdes (not metaSPAdes).
    • -1 specifies your forward read
    • Replace /path/to/R1/fastq with the path to your input forward read fastq file
  • --pe1-2 /path/to/R1/fastq
    • --pe1 indicates that you’re using paired-end reads
      • The 1 tells the program this is your first sample. This is only ever changed if you run normal SPAdes (not metaSPAdes).
    • -2 specifies your reverse read
    • Replace /path/to/R2/fastq with the path to your of your input reverse read fastq file
  • -k 55,75,95
    • -k sets the kmer sizes to uses to assemble your metagenome.
    • I chose the combination 55, 75, 95 because I felt this gave the best compromise between speed and quality.
    • You can replace 55,75,95 with your preferred kmer sizes. Just make sure that you list more than one size because this will greatly improve the quality of your assembly!
  • -t number_threads
    • -t sets the number of threads/processors
    • Replace number_threads with a number (e.g. 4)
  • -m number_memory
    • -m specifies the amount of memory that metaSPAdes will use
    • If this is too low, you’ll run out of memory and receive an error.
    • Replace number_memory with a number (e.g. 4)
  • -o /path/to/output
    • -o specifies the output folder
    • Replace /path/to/output with the path to your output folder

7B) IDBA_UD

IDBA_UD website

7B.1) Convert fastq to fasta

IDBA_UD accepts only 1 fasta file per metagenome.

Interleaved

This is when you have 1 fastq file per metagenome.

/path/to/idba/bin/fq2fa --paired --filter /path/to/fastq /path/to/output.fa
  • /path/to/idba/bin/fq2fa
    • Calls the fq2fa command within IDBA to convert fastq (.fastq or .fq) to fasta (.fasta or .fa)
    • Replace with your path to fq2fa
  • --paired
    • Indicates that your reads are paired-end in one file
  • --filter
    • This removes reads containing unknown bases. This is indicated as ‘N’ in the fastq file and occurs when the sequencing machine can’t confidently identify the base in this position
  • /path/to/fastq
    • Replace with the path to your input fastq file
  • /path/to/output.fa
    • Replace with the path to your output fasta file

File pairs

This is when you have 2 fastq files per metagenome.

/path/to/idba/bin/fq2fa --merge --filter path/to/R1/fastq /path/to/R2/fastq /path/to/output/fasta
  • /path/to/idba/bin/fq2fa
    • Calls the fq2fa command within IDBA to convert fastq (.fastq or .fq) to fasta (.fasta or .fa)
    • Replace with your path to fq2fa
  • --merge
    • Indicates that your reads are paired-end in two files
  • --filter
    • This removes reads containing unknown bases. This is indicated as ‘N’ in the fastq file and occurs when the sequencing machine can’t confidently identify the base in this position
  • /path/to/R1/fastq
    • Specifies input fastq file with forward reads
    • Replace with the path to your forward read fastq file
  • /path/to/R2/fastq
    • Specifies input fastq file with reverse reads
    • Replace with the path to your reverse read fastq file
  • /path/to/output.fa
    • Replace with the path to your output fasta file

7B.2) Assemble

/path/to/idba/bin/idba_ud -r /path/to/output.fa --mink 45 --maxk 125 --step 20 --num_threads number -o /path/to/output/folder
  • /path/to/idba/bin/idba_ud
    • Replace with your path to idba_ud
  • -r /path/to/output.fa
    • -r specifies the input fasta file
    • Replace /path/to/output.fa/ with the path to one of the fasta files you made in the previous step (7A.1)
  • --mink 45 --maxk 125 --step 20
    • --mink sets the minimum kmer size. --mink 45 sets the minimum kmer size to 45.
    • --maxk sets the maximum kmer size. --maxk 125 sets the maximum kmer size to 125.
    • --step sets the increment between your min and max kmers. The increment is used to assemble your metagenome repeatedly over several kmer sizes.
    • The numbers here set the kmer combo to:
      • 45, 65, 85, 105, 125
    • I chose this kmer combo due to the compromise between speed and assembly quality. This combo doesn’t give you the largest N50, but it does result in a assembly where a higher % of input read map to the assembly.
  • --num_threads number
    • --num_threads sets the number of threads/processors
    • Replace number with a number (e.g. 4)
  • -o /path/to/output/folder
    • -o specifies the output folder
    • Replace /path/to/output/folder with the path to your output folder

7C) MegaHit

MEGAHIT website

Interleaved

This is when you have 1 fastq file per metagenome.

/path/to/megahit --12 /path/to/fastq --k-list numbers -o /path/to/output
  • /path/to/megahit
    • Replace with your path to megahit
  • --12 /path/to/fastq
    • --12 indicates your input fastq is interleaved (one input file)
    • Replace /path/to/fastq with the path to your input fastq file
  • --k-list numbers
    • --k-list sets the kmers to use for assembly
    • Replace numbers with your desired kmer sizes. Be sure to use more than 1 kmer size! This will greatly improve the quality of your assembly
  • -o /path/to/output
    • -o indicates output
    • Replace /path/to/output with the path to your output

File Pairs

This is when you have 2 fastq files per metagenome.

/path/to/megahit --1 /path/to/R1/fastq --2 /path/to/R2/fastq --k-list number -o /path/to/output
  • /path/to/megahit
    • Replace with your path to megahit
  • --1 /path/to/R1/fastq
    • --1 specifies forward read
    • Replace /path/to/R1/fastq with the path to your input forward read fastq file
  • --2 /path/to/R2/fastq
    • --2 specifies reverse read
    • Replace /path/to/R2/fastq with the path to your input reverse read fastq file
  • --k-list number
    • --k-list sets the kmers to use for assembly
    • Replace numbers with your desired kmer sizes. Be sure to use more than 1 kmer size! This will greatly improve the quality of your assembly
  • -o /path/to/output
    • -o specifies output
    • Replace /path/to/output with the path to your output

8) Evaluate assembly (part 1)

Use metaquast to evaluate assembly quality.

The output includes an .html file that includes various statistics about your assembly.

The stats to which I like to pay particular attention include:

  • Number of Contigs
    • Smaller = more likely to have larger contigs
  • N50
    • The minimum contig length needed to cover 50% of the genome
    • Similar to median contig length, but weighted towards longer contigs
    • Larger is generally better
  • Total length
    • Longer = more bp’s incorporated into your assembly
path/to/metaquast.py -m min_contig_length -t number /path/to/assembly1/fasta /path/to/assembly2/fasta /path/to/assembly3/fasta -o /path/to/output
  • path/to/metaquast.py
    • Replace with your path to metaquast
  • -m min_contig_length
    • -m sets a cutoff for minimum contig length. Any contigs shorter than this cutoff will not be included in the metaquast output.
    • Replace min_contig_length with a number (e.g. 4)
  • -t number
    • -t sets the number of threads/processors
    • Replace number with a number (e.g. 4)
  • /path/to/assembly1/fasta /path/to/assembly2/fasta /path/to/assembly3/fasta
    • Replace with the paths to your assemblies
    • You can include more than 1 assembly. The output will include all of them in 1 document
  • -o /path/to/output
    • -o specifies output
    • Replace /path/to/output with the path to your output folder

9) Evaluate assembly (part 2)

In addition to the stats output by metaquast, another way to evaluate assembly quality is the % of input reads that map to the final assembly. Not all of the input reads will map to the assembly because there will inevitably be reads that are poor quality, artifacts, or too short (even after quality trimming).

If you plan on binning, you also have to format your assembly into a form that will be usable by binning programs (see step #10).

Use bowtie2 to calculate % of input reads that map to the final assembly.

The steps are very similar to step #4 (host contamination removal)

9A) Separate interleaved input (optional)

This step is only run if your input files were interleaved.
Use the function deinterleave_fastq.sh.

/path/to/deinterleave_fastq.sh < /path/to/input/fastq /path/to/input/R1/fastq /path/to/input/R2/fastq
  • /path/to/deinterleave_fastq.sh
    • Replace with your path to deinterleave_fastq.sh
  • Keep <
  • /path/to/input/fastq
    • Replace with the path to the interleaved fastq file you used as the input for your assembly
  • /path/to/input/R1/fastq
    • Replace with the path to your output forward read fastq file
  • /path/to/input/R2/fastq
    • Replace with the path to your output reverse read fastq file

9B) Build assembly database

Convert your assembly into a database.

/path/to/bowtie2-build –seed number /path/to/assembly/fasta /path/to/output/database
  • /path/to/bowtie2-build
    • Replace /path/to/bowtie2 with your path to bowtie2.
    • -build calls the build command within bowtie2
  • --seed number
    • Sets the seed so your work is reproducible
    • Replace number with a number (e.g. 4)
  • /path/to/assembly/fasta
    • Replace with the path to your assembly fasta output
  • /path/to/output/database
    • Replace with the path to your output database

9C) Map input reads to assembly

/path/to/bowtie2 --sensitive-local -p number_processors --seed number  -x /path/to/database -1 /path/to/R1/fastq -2 /path/to/R2/fastq -S /path/to/output.sam

Note: The output of this to terminal will include stats about % mapped. These stats are not saved in the output file, and you will want these for your records.

  • path/to/bowtie2
    • Replace with your path to bowtie2
  • --sensitive-local
    • This does a local alignment that’s sensitive when matching so you get more matched reads
    • This setting is slower than other settings. If you want speed, use a less sensitive local setting
    • I use very sensitive because my main concern is removing all the host contamination I can to improve the quality of my assembly downstream. For me, speed is a minor concern
    • If you’re mapping short reads to a big genome, do not do a global alignment!
  • -p number_processors
    • -p sets the number of processors
    • Replace number_processors with a number (e.g. 4)
  • --seed number
    • --seed sets the seed so that your work is reproducible.
    • Replace number with a number (e.g. 4)
  • -x /path/to/database
    • -x specifies your assembly database
    • Replace /path/to/database with the path to the database you created in the previous step (9B)
  • -1 /path/to/R1/fastq.gz
    • -1 specifies your input forward read fastq file
    • Replace /path/to/R1/fastq.gz with the path to your input forward read.
  • -2 /path/to/R2/fastq.gz
    • -2 specifies your input reverse read fastq file
    • Replace /path/to/R2/fastq.gz with the path to your input reverse read
  • -S /path/to/output.sam
    • -S specifies your output file, which is in SAM format (gibberish to the normal human eye)
    • Replace /path/to/R2/fastq.gz with the path to your output file

9D) Index your assembly

This step is for organizing your files to prepare for downstream processing.

/path/to/samtools faidx /path/to/assembly/fasta
  • /path/to/samtools faidx
    • Replace /path/to/samtools with your path to samtools
    • faidx calls the faidx command within samtools.
  • /path/to/assembly/fasta
    • Replace with path to your assembly output fasta file

The output is a .fai file in the same folder as your input file

9E) Create BAM file

A SAM file is huuuuuge, so convert it into a BAM file to save room.

The remaining steps will use the program samtools.

/path/to/samtools import /path/to/assembly/fasta.fai /path/to/output.sam /path/to/output.bam
  • /path/to/samtools import
    • Replace /path/to/samtools with your path to samtools
    • import calls the import command within samtools
  • /path/to/assembly.fasta.fai
    • Replace with the path to the .fai file you created in the previous step (9D)
    • This should be located in the same folder as your input file in step 9D (unless you explicitly specified a different output location)
  • /path/to/output.sam
    • Replace with path to the SAM file you created in step 9C
  • /path/to/output.bam
    • Replace with path to your output BAM file

9F) Sort BAM file

This step is for organizing your files to prepare for downstream processing.

/path/to/samtools sort -m number_memory -@ number_threads /path/to/output/bam -o /path/to/sorted.output.bam
  • /path/to/samtools sort
    • Replace /path/to/samtools with your path to samtools
    • sort calls the sort command within samtools
  • -m number_memory
    • -m sets the amount of memory per thread/processor
    • Replace number_memory with a number (e.g. 4)
  • -@ number_threads
    • -@ sets the number of threads/processors
    • Replace number_threads with a number (e.g. 4)
  • /path/to/output.bam
    • Replace with the path to the BAM file you created in the previous step (9E)
  • /path/to/sorted/output/bam
    • Replace with the path to your output file

9G) Index BAM file

This step is for organizing your files to prepare for downstream processing.

/path/to/samtools index -@ number /path/to/sorted/output/bam
  • /path/to/samtools index
    • Replace /path/to/samtools with your path to samtools
    • index calls the index command within samtools
  • -@ number
    • -@ sets the number of threads/processors
    • Replace number with a number (e.g. 4)
  • /path/to/sorted/output/bam
    • Replace with the path to the BAM file in made in the previous step (9F)

9H) Generate assembly stats

Time to calculate some stats about your assembly.

/path/to/samtools idxstats /path/to/sorted/output/bam > /path/to/idxstats.txt
  • /path/to/samtools idxstats
    • Replace /path/to/samtools with your path to samtools
    • idxstats calls the idxstats command within samtools
  • /path/to/sorted/output/bam
    • Replace with the path to the BAM file you made in the previous step (9G)
  • > /path/to/idxstats.txt
    • > saves the output in a file
    • Replace /path/to/idxstats.txt with the path to your output file

9I) Generate count table

Use get_count_table.py to generate more stats about your assembly.

python get_count_table.py /path/to/idxstats.txt > /path/to/counts.txt
  • python get_count_table.py
    • Runs the function
  • /path/to/idxstats.txt
    • Replace with path to the idxstats.txt file you created in the previous step (9H)
  • > /path/to/counts.txt
    • > saves the output in a file
    • Replace path/to/counts.txt with the path to your output file

10) Binning (optional)

This step takes your assembly and separates contigs into different “bins.” Each bin represents a different taxonomic group.

There are many software to accomplish this, including:

Here, I’ll only talk about MetaBAT because that’s the only program I’ve used so far.

Please refer to the MetaBAT Binning Tutorial for more information about this step.

10A) Summarize BAM depth

Use MetaBAT to generates more stats about your assembly contig depth, which we’ll use later in binning.

/path/to/metabat/jgi_summarize_bam_contig_depths --outputDepth /path/to/output/depth.txt /path/to/sorted/output/bam
  • path/to/metabat/jgi_summarize_bam_contig_depths
    • Replace with your path to jgi_summarize_bam_contig_depths
  • --outputDepth /path/to/output/depth.txt
    • --outputDepth calculates contig depths
    • Replace /path/to/output/depth.txt with the path to your output file
  • /path/to/sorted/output/bam
    • Replace with the path to the BAM file you made in step 9G

10B) Bin Contigs

Please refer to MetaBAT for parameters to set for your binning. There are lots of options and it’s up to you to select what’s best for your purposes. I don’t include those parameters in the code displayed below.

/path/to/metabat2 -i /path/to/assembly/fasta -a /path/to/output/depth.txt -o /path/to/output/folder -v --seed number
  • /path/to/metabat2
    • Replace with your path to metabat2
    • metabat2 may be different depending on what version you use
  • -i /path/to/assembly/fasta
    • -i specifies input file
    • Replace /path/to/assembly/fasta with the path to your assembly fasta file
  • -a /path/to/output/depth.txt
    • -a specifies a file that includes base coverage depth for each contig in your assembly
    • Replace /path/to/output/depth.txt with the path to the output file you made in the previous step (10A)
  • -o /path/to/output/folder
    • -o specifies output folder
    • Replace /path/to/output/folder with the path to your output folder
  • -v
    • Sets output to “verbose” so it prints out messages as it’s running and you can follow along with each step
  • --seed number
    • --seed sets the seed so that your work is reproducible
    • Replace number with a number (e.g. 4)

10C) Evaluate Bins

Use CheckM to evaluate the quality of your bins.

/path/to/checkm lineage_wf -t number -x fa -f /path/to/output.txt /path/to/bin/folder /path/to/output/folder
  • /path/to/checkm lineage_wf
    • Replace /path/to/checkm with your path to checkm
    • lineage_wf calls the lineage_wf command within checkm
  • -t number
    • -t sets the number of threads/processors
    • Replace number with a number (e.g. 4)
  • -x fa
    • -x specifies the form of your input files
    • fa indicates your input files are in .fasta form. You can change this depending on your input file format.
  • -f /path/to/output.txt
    • -f specifies output file
    • Replace /path/to/output.txt with the path to your output file
  • /path/to/bin/folder
    • Specifies location of your output bins
    • This is the same as step 10B’s /path/to/output/folder
    • Replace with the path to your output bins
  • /path/to/output/folder
    • Specifies location of CheckM output
    • Replace with path to your CheckM output

11) Predict ORFs

Use prodigal to predict open reading frames (ORFs).
The guide here on prodigal is not comprehensive, so please consult the prodigal website for information on additional parameters.

You can predict open reading frames in your assembly (output from step 7) or in your bins (output from step 10) depending on your goals.

path/to/prodigal -a path/to/output.faa -d path/to/output.ffn -i /path/to/assembly/fasta -o /path/to/output.gbk -p meta -s /path/to/output/genes.gff
  • path/to/prodigal
    • Replace with your path to prodigal
  • -a path/to/output.faa
    • -a specifies your output .faa file. This file contains amino acids
    • Replace path/to/output.faa with the path to your output file
  • -d path/to/output.ffn
    • -d specifies your output .ffn file. This file contains nucleotides
    • Replace path/to/output.ffn with the path to your output file
  • -i /path/to/assembly/fasta
    • -i specifies your input file
    • Replace /path/to/assembly/fasta with the path to your assembly fasta file
  • -o /path/to/output.gbk
    • -o specifies the gene coordinates output file. The default file format is .gbk (Genbank-like format), but there are also other formats.
    • Replace /path/to/output.gbk with the path to your output file
  • -p meta
    • ‘-p’ specifies prediction mode
    • meta indicates you’re analyzing metagenomes.
  • -s /path/to/output/genes.gff
    • -s specifies the GFF3 output file. This contains all potential genes with scores.
    • Replace /path/to/output/genes.gff with the path to your output file

Predict genes

The genes you want to target will completely depend on your question. Databases that I’ve used are:

COG

Clusters of Orthologous Groups (COG) website

You’ll use rpsblast to predict COGs. I encourage you to explore the rpsblast website to determine what parameters you’ll need because this will change depending on your goals and plans for downstream analyses. Below are the parameters and steps I use.

path/to/rpsblast -db /path/to/COG_database -query /path/to/prodigal.faa -out /path/to/output.txt -evalue 1e-05 -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs' -num_threads number
  • path/to/rpsblast
    • Replace with your path to rpsblast
  • -db /path/to/COG_database
    • -db specifies the database
    • Replace /path/to/COG_database with your path to the COG database
  • -query /path/to/prodigal.faa
    • -query specifies input file.
    • Replace /path/to/prodigal.faa with the path to your prodigal .faa output file
  • -out /path/to/output.txt
    • -out specifies output file
    • Replace /path/to/output.txt with the path to your output file
  • -evalue 1e-05
    • -evalue sets the E-value cutoff
    • This should always be ≤ 1e-05
  • -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'
    • -outfmt specifies format/columns of output file
    • I use format 6 so I can assign COG categories to query protein sequences using cdd2cog.pl (next step)
    • You can replace '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs' with your desired output format
  • -num_threads number
    • -num_threads sets the number of threads/processors
    • Replace number with a number (e.g. 4)

After running rpsblast, you’ll want to assign COG categories to your query protein sequences. I use cdd2cog.pl.

Next, I prepare the data for stats by creating a table where rows = COG ID and columns = sample.
For reference, this is the script I use to create the table. This script isn’t intended to be downloaded and used because it’s written specifically for my sample nomenclature, but I’ve linked it here as reference if you write your own script.

To normalize your COG counts, you can normalize to universal single copy COGs. I’m working on a script to do this and will update this tutorial whenever I finish it.
Here are papers that have used this normalization method:

dbCAN

DataBase for Automated Carbohydrate-Active Enzyme Annotation (dbCAN) website

You can use the online dbCAN server or download dbCAN2.

I used an older iteration of dbCAN and describe my steps below.

You’ll use hmmer to predict carbohydrate-active enzymes (CAZymes). I encourage you to explore the hmmer website to determine what parameters you will need as this will change depending on your goals and plans for downstream analyses. Below are the parameters and steps I use.

/path/to/hmmer/hmmscan --cpu number --tblout /path/to/output.tbl --domtblout /path/to/output/domain.tbl -o /path/to/output.txt -E 1e-05 /path/to/dbCAN-HMMdb-V8.txt /path/to/input/prodigal.faa
  • /path/to/hmmer/hmmscan
    • Replace with your path to hmmscan
  • --cpu number
    • --cpu specifies number of processors
    • Replace number with a number (e.g. 4)
  • --tblout /path/to/output.tbl
    • --tblout specifies the output table. This summarizes the per-target output.
    • Replace /path/to/output.tbl with the path to your output table
  • --domtblout /path/to/output/domain.tbl
    • --domtblout specifies the output domain table. This summarizes the per-domain output.
    • Replace /path/to/output/domain.tbl with the path to your output table
  • -o /path/to/output.txt
    • -o specifies output file. This file begins with a header summarizing your parameters. The next section is a list of sequence top hits.
    • Replace /path/to/output.txt with the path to your output file
  • -E 1e-05
    • -E sets the E-value cutoff.
    • This should always be ≤ 1e-05
  • /path/to/dbCAN-HMMdb-V8.txt
    • Replace with your path to the dbCAN hmmer database. The file name will differ if you’re using a different version of dbCAN
  • /path/to/input/prodigal.faa
    • Specifies input .faa file (output from prodigal)
    • Replace /path/to/input/prodigal.faa with the path to your prodigal .faa file

To make sense of the dbCAN output, I parse it to create a table that lists each ORF with a predicted CAZyme and its top CAZyme hit, as well as a count of each CAZyme class. Here are some scripts for this (you only need to use one of them to create the tables):

To prepare the data for stats, I create two tables:

  1. A count table where rows = CAZyme families and columns = samples
  2. A classification table where rows = CAZyme families and columns = class-level and family-level CAZyme classifications. These are formatted so they can be analyzed in R with the phyloseq package.

For reference, this is the script I use to create the two tables. This script isn’t intended to be downloaded and used because it’s written specifically for my sample nomenclature, but I’ve linked it here as reference if you write your own script.

eCAMI

Simultaneous Classification and Motif Identification for enzyme/CAZyme annotation (eCAMI) website
This program classifies to carbohydrate-active enzymes (CAZymes) and enzyme commission number.
I use eCAMI primarily for enzyme commission number classification. For CAZyme classification, I prefer to use dbCAN because it’s older and more well known. eCAMI and dbCAN are both written by the Yin Lab.

Reconstruct metabolic pathways

Use KAAS - KEGG Atomatic Annotation Serve to predict metabolic pathways.

Parameters for your annotation will vary depending on your needs, so please refer to the KEGG Website.
Here are the parameters that I use:

  • Search program = GHOSTX
  • GENES data set = for Prokaryotes
  • Assignment method = SBH (single-directional best hit)

Estimate average genome size

MicrobeCensus website