This document describes the metagenomic analysis pipeline by Edna Chiang.
This tutorial assumes you have a basic understanding of metagenomics and linux command line. It provides recommendations and guidance in choosing and using software to perform your own metagenomic analysis on paired-end data.
While many of the steps and concepts are translatable to single-end data, many of the software are incompatible with single-end data.
This guide doesn’t discuss appropriate experimental design, DNA preparation, or sequencing methods.
To learn more about metagenomics, there are plentiful reviews to read. Here are a few to consider:
To learn more about command line, there are many online and on-campus resources. Here are a few to consider:
Before beginning any analysis, you must first define your question. How you analyze your data is completely dependent upon your question.
Some ideas to consider:
Use FastQC to evaluate the quality of your raw fastq files.
FastQC will output a .html file with multiple plots.
For additional resources in understanding FastQC outputs, refer to the following webpages:
/path/to/fastqc /path/to/fastq -o /path/to/output
/path/to/fastqc
/path/to/fastq
-o /path/to/output
-o Specifies output folder/path/to/output with the path to your output folderIf you find that fastqc isn’t running and you get the following error:
Exception in thread "main" java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.
Try adding & (a space followed by an ampersand) at the very end of the command.
The output plots to which I pay most attention are:
Use Trimmomatic to remove adapters, primers, and over-represented sequences, and to trim poor quality basepairs.
java -jar /path/to/trimmomatic-0.36.jar PE -threads number -trimlog /path/to/trimmomatic/logfile /path/to/R1/fastq.gz /path/to/R2/fastq.gz -baseout /path/to/output.gz ILLUMINACLIP:/path/to/overrep_seqs.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
java -jar
/path/to/trimmomatic-0.36.jar
PE
-threads number
number with a number (e.g. 4)-trimlog /path/to/trimmomatic/logfile.txt
-trimlog specifies the location and name of your output log file/path/to/trimmomatic/logfile.txt with the path to your log file/path/to/R1/fastq.gz
/path/to/R2/fastq.gz
-baseout /path/to/output.gz
-baseout specifies the location and names of the output files:output is the prefix of all output files.gz outputs everything as gzipped fastq filesoutput_1P.gz and output_2P.gz/path/to/output.gz with the path to your output filesILLUMINACLIP:/path/to/overrep_seqs.fa:2:30:10
ILLUMINACLIP: is the command to remove specific sequences/1/2/path/to/overrep_seqs.fa with the path to Trimmomatic’s provided adapter files or your own custom file:2:30:10 are the default parameters for adapter trimming. Refer to the Trimmomatic website for more info.LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
I’ve also tested adapter removal with cutadapt and quality trimming with sickle.
I chose Trimmomatic over these two software simply because it’s more efficient to use one program instead of two.
If you study host-associated microbiomes, host DNA contamination can greatly hinder metagenomic assembly by increasing required time and computational power and by decreasing assembly quality.
If you have access to a genome of your host organism, I strongly recommend that you remove host contamination!
I follow this tutorial for my host contamination removal. Below is the long way to remove host contamination, but there is a much quicker way that’s described in the linked tutorial.
The following is an example of how I download the genome for my host organism (13-lined ground squirrel, Ictidomys tridecemlineatus).
First, download the host genome:
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/236/235/GCA_000236235.1_SpeTri2.0/GCA_000236235.1_SpeTri2.0_assembly_structure/Primary_Assembly/unplaced_scaffolds/FASTA/unplaced.scaf.fna.gz
Open the downloaded file:
gunzip unplaced.scaf.fna.gz
Rename the genome to something that makes sense:
mv unplaced.scaf.fna host.fna
Use bowtie2 to create a database from the host genome.
/path/to/bowtie2-build --seed number /path/to/host.fna /path/to/database
/path/to/bowtie2-build
/path/to/bowtie2 with your path to bowtie2-build because calls the build command within bowtie2--seed number
number with a number (e.g. 4)/path/to/host.fna
/path/to/database
We’ll use bowtie2 again.
/path/to/bowtie2 --very-sensitive-local -p number --seed number -x /path/to/database -1 /path/to/R1/fastq.gz -2 /path/to/R2/fastq.gz -S /path/to/output.sam
Note: The output of this to terminal will include stats about % mapped. These stats are not saved in the output file and you may want these for your records.
path/to/bowtie2
--very-sensitive-local
-p number
-p Sets the number of processorsnumber with a number (e.g. 4)--seed number
--seed sets the seed so that your work is reproduciblenumber with a number (e.g. 4)-x /path/to/database
-x specifies your host database/path/to/database with the path to your host database-1 /path/to/R1/fastq.gz
-1 specifies your forward read fastq file/path/to/R1/fastq.gz with the path to your forward read fastq fileoutput_1P.gz-2 /path/to/R2/fastq.gz
-2 specifies your reverse read fastq file/path/to/R2/fastq.gz with the path to your reverse read fastq fileoutput_2P.gz-S /path/to/output.sam
-S specifies your output file, which is in SAM format (gibberish to the normal human eye)/path/to/output.sam with the path to your output SAM fileA SAM file is huuuuuge, so convert it into a BAM file to save room.
The remaining steps will use the program samtools.
/path/to/samtools view -@ numbers -bS /path/to/output.sam > /path/to/output.bam
/path/to/samtools view
/path/to/samtools/ with your path to samtoolsview calls the view command within samtools-@ number
-@ sets number of threads/processorsnumber with a number (e.g. 4)-bS
/path/to/output.sam > /path/to/output.bam
/path/to/output.sam with the path to the output sam file you made in the previous step (4C)>/path/to/output.bam with the path to your output bam file/path/to/samtools view -@ number -b -f 12 -F 256 /path/to/output.bam > /path/to/unmap.bam
/path/to/samtools view
/path/to/samtools/ with your path to samtoolsview calls the view command within samtools-@ number
-@ sets number of threads/processorsnumber with a number (e.g. 4)-b
-f 12
-F 256
/path/to/output.bam > /path/to/unmap.bam
/path/to/output.bam with the path to the output BAM file you made in the previous step (4D)>/path/to/unmap.bam with the path to your output BAM file/path/to/samtools sort -n /path/to/unmap.bam -o /path/to/unmap.sorted.bam
/path/to/samtools sort
/path/to/samtools/ with your path to samtoolssort because calls the sort command within samtools-n
/path/to/unmap.bam
-o /path/to/unmap.sorted.bam
-o specifies outputpath/to/unmap.sorted.bam with the path to your output BAM file/path/to/samtools bam2fq /path/to/unmap.sorted.bam > /path/to/nohost.fastq
/path/to/samtools bam2fq
/path/to/samtools with your path to samtoolsbam2fq calls the bam2fq command within samtools/path/to/unmap.sorted.bam
/path/to/nohost.fastq
--un-conc to directly pull out unmapped reads if you don’t need the flexibility of the long samtools pipeline. Refer to this tutorial for directions.This is the same as Step 2.
Use FastQC to evaluate the quality of your raw fastq files.
FastQC will output a .html file with multiple plots.
For additional resources in understanding FastQC outputs, refer to the following webpages:
/path/to/fastqc /path/to/fastq -o /path/to/output
/path/to/fastqc
/path/to/fastq
-o /path/to/output
-o specifies output/path/to/output with the path to your output file(s)If you find that fastqc isn’t running and you get the following error:
Exception in thread "main" java.awt.HeadlessException:
No X11 DISPLAY variable was set, but this program performed an operation which requires it.
Try adding & (space followed by an ampersand) at the very end of the command.
This step gives you a quick look at what taxa comprise your metagenome. You can think of this as a preliminary option before binning.
My profiler of choice is Kaiju because it gave me the highest percentage of classified reads. However, it does require quite a bit of RAM. If you are limited by computational power, I suggest using Centrifuge. You can read more about other alternatives by scrolling down to the bottom of this step.
/path/to/kaiju -t /path/to/kaiju/database/nodes.dmp -f /path/to/kaiju/database/file.fmi -i /path/to/input/file.fastq -o /path/to/output/file/kaiju.output -z number -E 1e-05 -v
/path/to/kaiju
-t /path/to/kaiju/database/nodes.dmp
-t specifies the NCBI taxonomy nodes.dmp file/path/to/kaiju/database/nodes.dmp with your path to nodes.dmp-f /path/to/kaiju/database/file.fmi
-f specifies the database file/path/to/kaiju/database/file.fmi with your path to the NCBI taxonomy .fmi file-i /path/to/input/file.fastq
-i specifies input. This can be .fastq, .fasta, or .ffn formats./path/to/input/file.fastq with the path to your input file-o /path/to/output/file/kaiju.output
-o saves the kaiju output. If you exclude this flag, kaiju will print the output to the terminal rather than save it in a file./path/to/output/file/kaiju.output with the path to your output file-z number
-z sets the number of threads/processorsnumber with a number (e.g. 4)-E 1e-05
-E sets the E-value cutoff-v
After you run Kaiju, you’ll want to convert the kaiju output file(s) into a summary table using kaiju’s kaiju2table function:
/path/to/kaiju2table -t /path/to/kaiju/database/nodes.dmp -n /path/to/kaiju/database/names.dmp -r taxonomic_level -o /path/to/output/table.tsv /path/to/output/file/kaiju.output -l taxonomic,levels,separated,by,commas
/path/to/kaiju2table
kaiju2table-t /path/to/kaiju/database/nodes.dmp
-t specifies the NCBI taxonomy nodes.dmp file/path/to/kaiju/database/nodes.dmp with your path to nodes.dmp-n /path/to/kaiju/database/names.dmp
-n specifies the NCBI taxonomy names.dmp file/path/to/kaiju/database/names.dmp with your path to names.dmp-r taxonomic_level
-r specifies the taxonomic level you want in the output tabletaxonomic_level with your desired taxonomic levelkaiju2table for each level.-o /path/to/output/table.tsv
-o specifies your output file/path/to/output/table.tsv with the path to your output classification table/path/to/output/file/kaiju.output
-l taxonomic,levels,separated,by,commas
-l specifies the taxonomic levels to be printed in your output table-l superkingdom,phylum,class,order,family,genus,speciesThere are new assemblers being developed all the time, with each assembler suited for different needs and data types.
I’ve tested 3 assemblers:
Here are the pros/cons of each assembler:
Recommendations:
Ultimately, the “best” assembler will differ based on your data set. If you have the time and resources, try out a few. If you don’t have the time and resources, pick one and run with it.
This is when you have 1 fastq file per metagenome.
/path/to/spades-3.12.0 --meta --pe1-12 /path/to/fastq -k 55,75,95 -t number_threads -m number_memory -o /path/to/output
/path/to/spades-3.12.0
--meta
metaspades.py--pe1-12 /path/to/fastq
--pe1 indicates that you’re using paired-end reads
1 tells the program this is your first sample. This is only ever changed if you run normal SPAdes (not metaSPAdes).-12 indicates that your input file is interleaved (1 input file)/path/to/fastq with the path to your input fastq file-k 55,75,95
-k sets the kmer sizes used to assemble your metagenome.55,75,95 with your preferred kmer sizes. Just make sure that you list more than one size because this will greatly improve the quality of your assembly.-t number_threads
-t sets the number of threads/processorsnumber_threads with a number (e.g. 4)-m number_memory
-m specifies the amount of memory that metaSPAdes will usenumber_memory with a number (e.g. 4)-o /path/to/output
-o specifies the output folder/path/to/output with the path to your output folderThis is when you have 2 fastq files per metagenome.
/path/to/spades-3.12.0 --meta --pe1-1 /path/to/R1/fastq --pe1-2 /path/to/R2/fastq -k 55,75,95 -t number_threads -m number_memory -o /path/to/output
/path/to/spades-3.12.0
--meta
metaspades.py--pe1-1 /path/to/R1/fastq
--pe1 Indicates that you’re using paired-end reads
1 tells the program this is your first sample. This is only ever changed if you run normal SPAdes (not metaSPAdes).-1 specifies your forward read/path/to/R1/fastq with the path to your input forward read fastq file--pe1-2 /path/to/R1/fastq
--pe1 indicates that you’re using paired-end reads
1 tells the program this is your first sample. This is only ever changed if you run normal SPAdes (not metaSPAdes).-2 specifies your reverse read/path/to/R2/fastq with the path to your of your input reverse read fastq file-k 55,75,95
-k sets the kmer sizes to uses to assemble your metagenome.55,75,95 with your preferred kmer sizes. Just make sure that you list more than one size because this will greatly improve the quality of your assembly!-t number_threads
-t sets the number of threads/processorsnumber_threads with a number (e.g. 4)-m number_memory
-m specifies the amount of memory that metaSPAdes will usenumber_memory with a number (e.g. 4)-o /path/to/output
-o specifies the output folder/path/to/output with the path to your output folderIDBA_UD accepts only 1 fasta file per metagenome.
This is when you have 1 fastq file per metagenome.
/path/to/idba/bin/fq2fa --paired --filter /path/to/fastq /path/to/output.fa
/path/to/idba/bin/fq2fa
--paired
--filter
/path/to/fastq
/path/to/output.fa
This is when you have 2 fastq files per metagenome.
/path/to/idba/bin/fq2fa --merge --filter path/to/R1/fastq /path/to/R2/fastq /path/to/output/fasta
/path/to/idba/bin/fq2fa
--merge
--filter
/path/to/R1/fastq
/path/to/R2/fastq
/path/to/output.fa
/path/to/idba/bin/idba_ud -r /path/to/output.fa --mink 45 --maxk 125 --step 20 --num_threads number -o /path/to/output/folder
/path/to/idba/bin/idba_ud
-r /path/to/output.fa
-r specifies the input fasta file/path/to/output.fa/ with the path to one of the fasta files you made in the previous step (7A.1)--mink 45 --maxk 125 --step 20
--mink sets the minimum kmer size. --mink 45 sets the minimum kmer size to 45.--maxk sets the maximum kmer size. --maxk 125 sets the maximum kmer size to 125.--step sets the increment between your min and max kmers. The increment is used to assemble your metagenome repeatedly over several kmer sizes.--num_threads number
--num_threads sets the number of threads/processorsnumber with a number (e.g. 4)-o /path/to/output/folder
-o specifies the output folder/path/to/output/folder with the path to your output folderThis is when you have 1 fastq file per metagenome.
/path/to/megahit --12 /path/to/fastq --k-list numbers -o /path/to/output
/path/to/megahit
--12 /path/to/fastq
--12 indicates your input fastq is interleaved (one input file)/path/to/fastq with the path to your input fastq file--k-list numbers
--k-list sets the kmers to use for assemblynumbers with your desired kmer sizes. Be sure to use more than 1 kmer size! This will greatly improve the quality of your assembly-o /path/to/output
-o indicates output/path/to/output with the path to your outputThis is when you have 2 fastq files per metagenome.
/path/to/megahit --1 /path/to/R1/fastq --2 /path/to/R2/fastq --k-list number -o /path/to/output
/path/to/megahit
--1 /path/to/R1/fastq
--1 specifies forward read/path/to/R1/fastq with the path to your input forward read fastq file--2 /path/to/R2/fastq
--2 specifies reverse read/path/to/R2/fastq with the path to your input reverse read fastq file--k-list number
--k-list sets the kmers to use for assemblynumbers with your desired kmer sizes. Be sure to use more than 1 kmer size! This will greatly improve the quality of your assembly-o /path/to/output
-o specifies output/path/to/output with the path to your outputUse metaquast to evaluate assembly quality.
The output includes an .html file that includes various statistics about your assembly.
The stats to which I like to pay particular attention include:
path/to/metaquast.py -m min_contig_length -t number /path/to/assembly1/fasta /path/to/assembly2/fasta /path/to/assembly3/fasta -o /path/to/output
path/to/metaquast.py
-m min_contig_length
-m sets a cutoff for minimum contig length. Any contigs shorter than this cutoff will not be included in the metaquast output.min_contig_length with a number (e.g. 4)-t number
-t sets the number of threads/processorsnumber with a number (e.g. 4)/path/to/assembly1/fasta /path/to/assembly2/fasta /path/to/assembly3/fasta
-o /path/to/output
-o specifies output/path/to/output with the path to your output folderIn addition to the stats output by metaquast, another way to evaluate assembly quality is the % of input reads that map to the final assembly. Not all of the input reads will map to the assembly because there will inevitably be reads that are poor quality, artifacts, or too short (even after quality trimming).
If you plan on binning, you also have to format your assembly into a form that will be usable by binning programs (see step #10).
Use bowtie2 to calculate % of input reads that map to the final assembly.
The steps are very similar to step #4 (host contamination removal)
This step is only run if your input files were interleaved.
Use the function deinterleave_fastq.sh.
/path/to/deinterleave_fastq.sh < /path/to/input/fastq /path/to/input/R1/fastq /path/to/input/R2/fastq
/path/to/deinterleave_fastq.sh
</path/to/input/fastq
/path/to/input/R1/fastq
/path/to/input/R2/fastq
Convert your assembly into a database.
/path/to/bowtie2-build –seed number /path/to/assembly/fasta /path/to/output/database
/path/to/bowtie2-build
/path/to/bowtie2 with your path to bowtie2.-build calls the build command within bowtie2--seed number
number with a number (e.g. 4)/path/to/assembly/fasta
/path/to/output/database
/path/to/bowtie2 --sensitive-local -p number_processors --seed number -x /path/to/database -1 /path/to/R1/fastq -2 /path/to/R2/fastq -S /path/to/output.sam
Note: The output of this to terminal will include stats about % mapped. These stats are not saved in the output file, and you will want these for your records.
path/to/bowtie2
--sensitive-local
-p number_processors
-p sets the number of processors--seed number
--seed sets the seed so that your work is reproducible.number with a number (e.g. 4)-x /path/to/database
-x specifies your assembly database/path/to/database with the path to the database you created in the previous step (9B)-1 /path/to/R1/fastq.gz
-1 specifies your input forward read fastq file/path/to/R1/fastq.gz with the path to your input forward read.-2 /path/to/R2/fastq.gz
-2 specifies your input reverse read fastq file/path/to/R2/fastq.gz with the path to your input reverse read-S /path/to/output.sam
-S specifies your output file, which is in SAM format (gibberish to the normal human eye)/path/to/R2/fastq.gz with the path to your output fileThis step is for organizing your files to prepare for downstream processing.
/path/to/samtools faidx /path/to/assembly/fasta
/path/to/samtools faidx
/path/to/samtools with your path to samtoolsfaidx calls the faidx command within samtools./path/to/assembly/fasta
The output is a .fai file in the same folder as your input file
A SAM file is huuuuuge, so convert it into a BAM file to save room.
The remaining steps will use the program samtools.
/path/to/samtools import /path/to/assembly/fasta.fai /path/to/output.sam /path/to/output.bam
/path/to/samtools import
/path/to/samtools with your path to samtoolsimport calls the import command within samtools/path/to/assembly.fasta.fai
/path/to/output.sam
/path/to/output.bam
This step is for organizing your files to prepare for downstream processing.
/path/to/samtools sort -m number_memory -@ number_threads /path/to/output/bam -o /path/to/sorted.output.bam
/path/to/samtools sort
/path/to/samtools with your path to samtoolssort calls the sort command within samtools-m number_memory
-m sets the amount of memory per thread/processornumber_memory with a number (e.g. 4)-@ number_threads
-@ sets the number of threads/processorsnumber_threads with a number (e.g. 4)/path/to/output.bam
/path/to/sorted/output/bam
This step is for organizing your files to prepare for downstream processing.
/path/to/samtools index -@ number /path/to/sorted/output/bam
/path/to/samtools index
/path/to/samtools with your path to samtoolsindex calls the index command within samtools-@ number
-@ sets the number of threads/processorsnumber with a number (e.g. 4)/path/to/sorted/output/bam
Time to calculate some stats about your assembly.
/path/to/samtools idxstats /path/to/sorted/output/bam > /path/to/idxstats.txt
/path/to/samtools idxstats
/path/to/samtools with your path to samtoolsidxstats calls the idxstats command within samtools/path/to/sorted/output/bam
> /path/to/idxstats.txt
> saves the output in a file/path/to/idxstats.txt with the path to your output fileUse get_count_table.py to generate more stats about your assembly.
python get_count_table.py /path/to/idxstats.txt > /path/to/counts.txt
python get_count_table.py
/path/to/idxstats.txt
> /path/to/counts.txt
> saves the output in a filepath/to/counts.txt with the path to your output fileThis step takes your assembly and separates contigs into different “bins.” Each bin represents a different taxonomic group.
There are many software to accomplish this, including:
Here, I’ll only talk about MetaBAT because that’s the only program I’ve used so far.
Please refer to the MetaBAT Binning Tutorial for more information about this step.
Use MetaBAT to generates more stats about your assembly contig depth, which we’ll use later in binning.
/path/to/metabat/jgi_summarize_bam_contig_depths --outputDepth /path/to/output/depth.txt /path/to/sorted/output/bam
path/to/metabat/jgi_summarize_bam_contig_depths
--outputDepth /path/to/output/depth.txt
--outputDepth calculates contig depths/path/to/output/depth.txt with the path to your output file/path/to/sorted/output/bam
Please refer to MetaBAT for parameters to set for your binning. There are lots of options and it’s up to you to select what’s best for your purposes. I don’t include those parameters in the code displayed below.
/path/to/metabat2 -i /path/to/assembly/fasta -a /path/to/output/depth.txt -o /path/to/output/folder -v --seed number
/path/to/metabat2
metabat2 may be different depending on what version you use-i /path/to/assembly/fasta
-i specifies input file/path/to/assembly/fasta with the path to your assembly fasta file-a /path/to/output/depth.txt
-a specifies a file that includes base coverage depth for each contig in your assembly/path/to/output/depth.txt with the path to the output file you made in the previous step (10A)-o /path/to/output/folder
-o specifies output folder/path/to/output/folder with the path to your output folder-v
--seed number
--seed sets the seed so that your work is reproduciblenumber with a number (e.g. 4)Use CheckM to evaluate the quality of your bins.
/path/to/checkm lineage_wf -t number -x fa -f /path/to/output.txt /path/to/bin/folder /path/to/output/folder
/path/to/checkm lineage_wf
/path/to/checkm with your path to checkmlineage_wf calls the lineage_wf command within checkm-t number
-t sets the number of threads/processorsnumber with a number (e.g. 4)-x fa
-x specifies the form of your input filesfa indicates your input files are in .fasta form. You can change this depending on your input file format.-f /path/to/output.txt
-f specifies output file/path/to/output.txt with the path to your output file/path/to/bin/folder
/path/to/output/folder/path/to/output/folder
Use prodigal to predict open reading frames (ORFs).
The guide here on prodigal is not comprehensive, so please consult the prodigal website for information on additional parameters.
You can predict open reading frames in your assembly (output from step 7) or in your bins (output from step 10) depending on your goals.
path/to/prodigal -a path/to/output.faa -d path/to/output.ffn -i /path/to/assembly/fasta -o /path/to/output.gbk -p meta -s /path/to/output/genes.gff
path/to/prodigal
-a path/to/output.faa
-a specifies your output .faa file. This file contains amino acidspath/to/output.faa with the path to your output file-d path/to/output.ffn
-d specifies your output .ffn file. This file contains nucleotidespath/to/output.ffn with the path to your output file-i /path/to/assembly/fasta
-i specifies your input file/path/to/assembly/fasta with the path to your assembly fasta file-o /path/to/output.gbk
-o specifies the gene coordinates output file. The default file format is .gbk (Genbank-like format), but there are also other formats./path/to/output.gbk with the path to your output file-p meta
meta indicates you’re analyzing metagenomes.-s /path/to/output/genes.gff
-s specifies the GFF3 output file. This contains all potential genes with scores./path/to/output/genes.gff with the path to your output fileThe genes you want to target will completely depend on your question. Databases that I’ve used are:
Clusters of Orthologous Groups (COG) website
You’ll use rpsblast to predict COGs. I encourage you to explore the rpsblast website to determine what parameters you’ll need because this will change depending on your goals and plans for downstream analyses. Below are the parameters and steps I use.
path/to/rpsblast -db /path/to/COG_database -query /path/to/prodigal.faa -out /path/to/output.txt -evalue 1e-05 -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs' -num_threads number
path/to/rpsblast
-db /path/to/COG_database
-db specifies the database/path/to/COG_database with your path to the COG database-query /path/to/prodigal.faa
-query specifies input file./path/to/prodigal.faa with the path to your prodigal .faa output file-out /path/to/output.txt
-out specifies output file/path/to/output.txt with the path to your output file-evalue 1e-05
-evalue sets the E-value cutoff-outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'
-outfmt specifies format/columns of output file'6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs' with your desired output format-num_threads number
-num_threads sets the number of threads/processorsnumber with a number (e.g. 4)After running rpsblast, you’ll want to assign COG categories to your query protein sequences. I use cdd2cog.pl.
Next, I prepare the data for stats by creating a table where rows = COG ID and columns = sample.
For reference, this is the script I use to create the table. This script isn’t intended to be downloaded and used because it’s written specifically for my sample nomenclature, but I’ve linked it here as reference if you write your own script.
To normalize your COG counts, you can normalize to universal single copy COGs. I’m working on a script to do this and will update this tutorial whenever I finish it.
Here are papers that have used this normalization method:
DataBase for Automated Carbohydrate-Active Enzyme Annotation (dbCAN) website
You can use the online dbCAN server or download dbCAN2.
I used an older iteration of dbCAN and describe my steps below.
You’ll use hmmer to predict carbohydrate-active enzymes (CAZymes). I encourage you to explore the hmmer website to determine what parameters you will need as this will change depending on your goals and plans for downstream analyses. Below are the parameters and steps I use.
/path/to/hmmer/hmmscan --cpu number --tblout /path/to/output.tbl --domtblout /path/to/output/domain.tbl -o /path/to/output.txt -E 1e-05 /path/to/dbCAN-HMMdb-V8.txt /path/to/input/prodigal.faa
/path/to/hmmer/hmmscan
--cpu number
--cpu specifies number of processorsnumber with a number (e.g. 4)--tblout /path/to/output.tbl
--tblout specifies the output table. This summarizes the per-target output./path/to/output.tbl with the path to your output table--domtblout /path/to/output/domain.tbl
--domtblout specifies the output domain table. This summarizes the per-domain output./path/to/output/domain.tbl with the path to your output table-o /path/to/output.txt
-o specifies output file. This file begins with a header summarizing your parameters. The next section is a list of sequence top hits./path/to/output.txt with the path to your output file-E 1e-05
-E sets the E-value cutoff./path/to/dbCAN-HMMdb-V8.txt
/path/to/input/prodigal.faa
/path/to/input/prodigal.faa with the path to your prodigal .faa fileTo make sense of the dbCAN output, I parse it to create a table that lists each ORF with a predicted CAZyme and its top CAZyme hit, as well as a count of each CAZyme class. Here are some scripts for this (you only need to use one of them to create the tables):
To prepare the data for stats, I create two tables:
For reference, this is the script I use to create the two tables. This script isn’t intended to be downloaded and used because it’s written specifically for my sample nomenclature, but I’ve linked it here as reference if you write your own script.
Simultaneous Classification and Motif Identification for enzyme/CAZyme annotation (eCAMI) website
This program classifies to carbohydrate-active enzymes (CAZymes) and enzyme commission number.
I use eCAMI primarily for enzyme commission number classification. For CAZyme classification, I prefer to use dbCAN because it’s older and more well known. eCAMI and dbCAN are both written by the Yin Lab.
Use KAAS - KEGG Atomatic Annotation Serve to predict metabolic pathways.
Parameters for your annotation will vary depending on your needs, so please refer to the KEGG Website.
Here are the parameters that I use: