QIIME2 crash course

Introduction

QIIME2 is a marker gene analysis platform for e.g. 16S, 18S, ITS surveys (although it’s being developed into a multi-omics platform right now as well).

It is an open-source plug-in system (all of the bioinformatics programs come in the form of plugins) - all the plugins are described at https://library.qiime2.org

You can use it on a variety of interfaces – e.g. on the command line, on a GUI like Galaxy, QIITA, python3 API (which you could use on Jupyter notebook).

Whenever you run an analysis with QIIME2 it always tracks data provenance (version, commands, etc.) so it’s easy to figure out anything wrong that’s going on.

QIIME2 has amazing documentation, tutorials, a great forum, and a youtube channel that make it pretty easy to learn how to use. The forum is a good place to debug problems but also have broader discussions about microbiome and bioinformatics research.

tutorials: https://docs.qiime2.org/2021.4/tutorials/

forum: https://forum.qiime2.org/

youtube: https://www.youtube.com/channel/UCgjZXjHOjV-KbkXsnuMKoUA

Installing QIIME2

The easiest way to install on the computing cluster is doing so within a conda environment.

#on ccv have to load most up to date anaconda for some reason
module load anaconda/3-5
wget https://data.qiime2.org/distro/core/qiime2-2021.4-py38-linux-conda.yml
conda env create -n qiime2-2021.4 --file qiime2-2021.4-py38-linux-conda.yml
rm qiime2-2021.4-py38-linux-conda.yml

Importing data into QIIME2

Data is imported into QIIME2 as “artifacts”, and all of these are defined as semantic types that describe what the data is – e.g. SampleData[SequenesWithQuality], SampleData[PairedEndSequencesWithQuality], FeatureTable[Frequency], Phylogeny, etc. - all of these could have different formats, e.g. fastq vs fasta+qual for SequencesWithQuality, which is useful as data formats change over time

QIIME artifacts are in .qza file format, which are just zip files (where the data and also data provenance are stores). You can visualize them by converting them to .qzv files and viewing of view.qiime2.org. And there are also commands to convert .qza files back to .fastq files as well.

Let’s start by importing some .fastq files from 18S rRNA gene amplicon sequencing. This data was downloaded from the NCBI SRA ((I could provide commands to to do that (which is a bit of a pain but I could provide commands to do it relatively simply).

This data is already demultiplexed, but if your data is still multiplexed and you have a file with barcodes corresponding to sample IDs, you can use q2-cutadapt to demultiplex: https://forum.qiime2.org/t/demultiplexing-and-trimming-adapters-from-reads-with-q2-cutadapt/2313 or also demux emp-paired if the data was generated with the EMP protocol: https://docs.qiime2.org/2020.11/plugins/available/demux/emp-paired/

When downloaded from SRA from a paired-end study you get 2 files per sample, read 1 and read 2. We want to import this in QIIME’s “Casava 1.8 paired-end demultiplexed fastq format” (https://docs.qiime2.org/2021.4/tutorials/importing/)

The files need to be in a specific format. From SRA they are SRR12345_R1.fastq.gz but we want SRR12345_00_L001_R1_001.fastq.gz so we need to rename them:

rename -v '_' '_00_L001_' * /
rename -v '.fastq.gz' '_001.fastq.gz' * /
rename -v '_1' '_R1' * /
rename -v '_2' '_R2' *

Now to import we run the following script. All of the .fastq files you are importing should be in the same directory, which has nothing else in it.

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path path/ \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path work/demux-paired-end.qza

Subsequent processing will be done on this output .qza file so make sure you only import reads you want to process together – e.g. if you’re using DADA2 for denoising they should be from the same sequencing run.

Metadata

Sample metadata contains whatever you want it to contain, the only requirement is that the first column must be the sample ID and the header has to be one of several specific names (listed here https://docs.qiime2.org/2021.4/tutorials/metadata/).

These IDs have to be unique, cannot be empty, and can’t start with #. For the rest of the columns, there aren’t many rules (but there is more info in that tutorial), and missing data should just be left as blank cells. You can validate the metadata directly in Google Sheets with a google sheet add-on called Keemei, or QIIME will tell you if/why it’s invalid if you try to import it. Metadata should be in tab-separated format (.tsv or .txt). Importantly, the Sample ID in the first column of the metadata should correspond with the first identifier in the imported data file name. (e.g. SRR12345 in SRR12345_00_L001_R1_001.fastq.gz). The tutorial also shows you how to merge metadata across different sources, with the resulting file containing only the common cilumns across the different files. Metadata files are used when running and visualizing analyses.

Visualizing quality of reads

QIIME2 can provide us some summary statistics about the reads that we just imported. To visualize these run:

qiime demux summarize \
--i-data file.qza \
--o-visualization file.qzv

If you are running this on RStudio or a Jupyter notebook for example you could use:

qiime tools view file.qzv

to see the visualization directly on the document you are working on which is a lot more convenient.

Trimming off primers

QIIME2 has the plugin q2-cutadapt which among other things can trim off primer sequences. This study used 18S V4 primers from Stoeck et al. 2010.

In the following command, –p-front-f is the forward primer (5’-3’), –p-front-r the reverse. I’ve also included –p-adapter-f, which is the reverse complement of the reverse primer, and –p-adapter-r, which is the reverse complement of the forward primer. These would be present in the reads if the amplicon region is shorter than the read length - in this case both primers are typically present in each forward and reverse read and you should treat them like “linked adapters”, where the forward primer is linked with the reverse complement of the reverse primer, and vice versa.

–p-discard-untrimmed discards any untrimmed reads. In this command, reads will only be discarded if they don’t have either the normal primer or the linked adapter (just having one is ok)

–verbose gives you a detailed output of how many reads and base pairs were trimmed off in each sample (by default if this was run as a script on the ccv it would be stored in the slurm file)


qiime cutadapt trim-paired \
--i-demultiplexed-sequences demux-paired-end.qza \q
--p-cores 16 \
--p-front-f CCAGCASCYGCGGTAATTCC \
--p-adapter-f TYRATCAAGAACGAAAGT \
--p-front-r ACTTTCGTTCTTGATYRA \
--p-adapter-r GGAATTACCGCRGSTGCTGG \
--p-minimum-length 10 \
--p-discard-untrimmed \
--o-trimmed-sequences demux-paired-end-trimmed.qza \
--verbose

We can visualize how the reads look after trimming:

qiime demux summarize \
--i-data file.qza \
--o-visualization file.qzv

Next steps – choosing an approach – denoising to ASVs or OTU clustering

Within QIIME2 you denoise your sequences to obtain ASVs or you could also do de novo, closed-reference, or open-reference OTU clustering. For OTU clustering you could use q2-vsearch, which is the same as vsearch outside of QIIME2 (except potentially with some different optins, I’m not sure). Here is a tutorial for that: https://docs.qiime2.org/2021.4/tutorials/otu-clustering/

This tutorial is about denoising however. QIIME2 has both DADA2 and Deblur plugins

These papers compare the different methods nicely (for 16S data): https://peerj.com/articles/5364/ https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0227434

^^Prodan et al. 2020 above states that DADA2 is the best choice for studies requiring high resolution, like looking at strain-level differences, but that USEARCH-UNOISE3 (outside of qiime) has the best overall performance. I really like QIIME though for a lot of reasons (and also it seems like most people working with 18S use it), and based on other reading I’ve concluded that I think DADA2 is best in most cases, and Deblur is best if you are working with many different studies, or studies that have significantly different study designs.

DADA2

If running DADA2, all you need to do is trim then denoise so we can go straight to the command.

It’s important to take a look at the quality plots and choose a good place for DADA2 to truncate the reads. –p-trunc-len-f defines the length the forward reads will be trimmed and –p-trunc-len-r defines the length the reverse reads will be trimmed at. You want to trim off low-quality bases because DADA2 has a initial filtering step based on quality, so it will discard the whole read if the quality drops off too much at the end (which could otherwise be retained if truncated appropriately). But you also want to keep long enough reads for DADA2 to be able to merge. The default min overlap is 12 but can be decreased down to 4 with –p-min-overlap

Example code:

qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux-paired-end-trimmed.qza \
--p-trunc-len-f 200 \
--p-trunc-len-r 220 \
--o-table DADA2_denoising_output/table.qzv \
--o-representative-sequences DADA2_denoising_output/rep-seqs.qza
--o-denoising-stats DADA2_denoising_output/DADA2-stats.qza

You can also trim at the 3’ end of both reads using –p-trim-left-f/r. There are some other parameters you can pay around with as well, like the method to detect chimeras, method used to pool samples for denoising, etc. It’s important to use these same parameters if you want to merge the results from multiple sequencing runs later.

This command will put 3 files in the directory DADA2_denoising_output: table.qzv (per-sample ASV count table), rep-seqs.qza (ASV IDs and their sequences), and DADA2-stats.qza.

If your reverse reads are super low quality you can run the following code on your paired-end reads and it will drop the reverse reads by default:

qiime dada2 denoise-single \
--i-demultiplexed-seqs demux-single-end-trimmed.qza \
--p-trunc-len 200 \
--output-dir DADA2_denoising_output \
--verbose

Or you could also run this just on the reverse reads by re-importing only those.

To generate summaries of these output that you can then visualize:

qiime feature-table summarize \
  --i-table table.qza \
  --o-visualization table.qzv \
  --m-sample-metadata-file sample-metadata.tsv

qiime feature-table tabulate-seqs \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

qiime metadata tabulate \
  --m-input-file denoising-stats.qza \
  --o-visualization denoising-stats.qzv

You can also further filter out chimeric sequences after running DADA2 using q2-vsearch. Cori says this is an important step because in her experience DADA2 retains a lot of chimeras.

qiime vsearch uchime-denovo \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --output-dir uchime-dn-out
  
#visualize stats for chimeric sequences
qiime metadata tabulate \
  --m-input-file uchime-dn-out/stats.qza \
  --o-visualization uchime-dn-out/stats.qzv
  
#filter out table and sequences to exclude chimeras
qiime feature-table filter-features \
  --i-table table.qza \
  --m-metadata-file uchime-dn-out/chimeras.qza \
  --p-exclude-ids \
  --o-filtered-table uchime-dn-out/table-nonchimeric.qza
qiime feature-table filter-seqs \
  --i-data rep-seqs.qza \
  --m-metadata-file uchime-dn-out/chimeras.qza \
  --p-exclude-ids \
  --o-filtered-data uchime-dn-out/rep-seqs-nonchimeric.qza
qiime feature-table summarize \
  --i-table uchime-dn-out/table-nonchimeric.qza \
  --o-visualization uchime-dn-out/table-nonchimeric.qzv

Deblur

Deblur is usually only used with single-end reads, but you can also use paired-end reads if you merge them first.

Merging with q2-vsearch merge-pairs (which is the same as vsearch’s fastq_mergepairs):

qiime vsearch join-pairs \
--i-demultiplexed-seqs demux-paired-end-trimmed.qza \
--o-joined-sequences demux-joined.qza \
--p-maxdiffs 30 \
--p-allowmergestagger \
--p-minovlen 30 \
--verbose

#visualize reads
qiime demux summarize \
--i-data demux-joined.qza \
--o-visualization demux-joined.qzv

–verbose will print all the merging stats to the errorfile, which is helpful because it lists all the reasons why reads didn’t merge. You should play around with these paramters for your own data. Depending on the length of overlap in your reads you need to change –p-minovlen (minimum overlap length) and –p-maxdiffs (max amounts of bp pairwise differences allowed in the overlap). I have read that most of the problems with vsearch are with really short overlaps (like 10bp), so setting –p-minovlen to 50 is essentially the same as setting it to 200. And you want –p-max-diffs to be set to be around 20% of the overlap length. For discussion on this see: https://forum.qiime2.org/t/question-regarding-parameters-used-in-qiime-vsearch-join-pairs/12289

It’s important to do an initial quality control step before running Deblur. This isn’t really as important if you’re working with merged paired-end reads rather than single-end reads, because the merging step increases the quality of the overlap significantly.

Running deblur:

Deblur has two modes, denoise-16S and denoise-other. This is 18S data so we use denoise-other. Deblur initially compares the reads against a positive filtering database (specified with –i-reference-seqs - anything not at least 60% similar sequence identity against whole database is discarded)

qiime deblur denoise-other \
  --i-demultiplexed-seqs demux-joined-filtered.qza \
  --i-reference-seqs /gpfs/data/rbeinart/aschrece/pr2/pr2_4.12.0_fasta.qza \
  --p-trim-length 325 \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza \
  --p-sample-stats \
  --p-jobs-to-start 10 \
  --o-stats deblur-stats.qza
  
#visualize the output

qiime deblur visualize-stats --i-deblur-stats deblur-stats.qza --o-visualization reports/deblur-stats.qzv

qiime feature-table summarize --i-table table.qza --o-visualization reports/table.qzv

qiime feature-table tabulate-seqs --i-data rep-seqs.qza --o-visualization reports/rep-seqs.qzv

The deblur-stats output gives detailed information on what was retained after each filtering step, which is really useful for finding out anything that went wrong. The other output files are the same as for DADA2

Merging results from different sequencing runs/studies

If you have more than one run being analyzed together you’ll have to merge the ASV tables and sequence files together (table.qza and rep-seqs.qza). Any ASVs that are exactly the same will be merged into one across runs. Even one nucleotide difference is enough for it to be considered a different ASV, which is why it’s so important to keep the upstream parameters exactly the same if you want to do analyses on the ASVs directly.

Code for merging these tables:

#merge count tables
qiime feature-table merge \
--i-tables pathtotable1/table.qza \
--i-tables pathtotable2/table.qza \
--i-table pathtotable3/table.qza \
--o-merged-table merged-table.qza

#merge rep seqs
qiime feature-table merge-seqs \
--i-data path1/rep-seqs.qza \
--i-data path2/rep-seqs.qza \
--i-data path3/work/rep-seqs.qza \
--o-merged-data merged-rep-seqs.qza

Assigning and visualizing taxonomy

QIIME2 assigns taxonomy is by using q2-feature-classifier. q2-feature-classifier has 3 different classification methods: two are alignment-based methods that find a consensus assignment across N top hits (classify-consensus-blast and classify-consensus-vsearch), and also machine-learning based classification methdos through classify-sklearn. All the methods are good but the QIIME developers have found that the classify-sklearn with a Naive Bayes classifier outperforms the other methods with 16S and ITS data. So I have been using that one, which requires a pre-trained classifier.

QIIME has pre-trained classifiers available for the SILVA and Greengenes 16S and 18S databases: https://docs.qiime2.org/2021.4/data-resources/. However if you want to use something else like PR2 or your own data you will need to pre-train the classifier yourself.

Using PR2 as an example:

#get fasta and taxonomy info
wget https://github.com/pr2database/pr2database/releases/download/v4.14.0/pr2_version_4.14.0_SSU_mothur.fasta.gz
wget https://github.com/pr2database/pr2database/releases/download/v4.14.0/pr2_version_4.14.0_SSU_mothur.tax.gz

#import as qiime artifacts
qiime tools import \
--input-path pr2_version_4.14.0_18S_mothur.fasta \
--output-path pr2_4.14.0_fasta.qza \
--type 'FeatureData[Sequence]'

qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path pr2_version_4.14.0_18S_mothur.tax \
--output-path pr2_version_4.14.0_tax.qza

Most people suggest trimming the fasta file to the primer region that the reads themselves are trimmed to, but in some cases when your organisms of interest are underrepresented in the sequence database I (and others) think this is a bad idea. I’ve found that with shorter amplicons like 18S V9, trimming improves taxonomic assignment dramatically, but with longer amplicons like 18S V4, it makes no differnece or even hinders it (especially with the metopid symbiont dataset). I’ll show how to do it anyway:

qiime feature-classifier extract-reads \
  --i-sequences pr2_4.14.0_fasta.qza \
  --p-f-primer forwardprimer \
  --p-r-primer reverseprimer \
  --p-trunc-len \
  --p-min-length \
  --p-max-length \
  --o-reads rep-seqs.qza

If your reads are also truncated to a specific length you can use –p-trunc-len here.

train the classifier:

qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads rep-seqs \
--i-reference-taxonomy pr2_version_4.14.0_tax.qza \
-o-classifier PR2-classifier.qza

Assign taxonomy to your ASVs:

qiime feature-classifier classify-sklearn \
  --i-classifier path/PR2-classifier.qza \
  --i-reads path/rep-seqs.qza \
  --o-classification taxonomy.qza
  
  #visualize in interactive table
  qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

We don’t care at all about metazoans so it’s a good idea just to filter those right away. Filter out unwanted taxa:

#filter out metazoans
qiime taxa filter-table \
  --i-table table.qza \
  --i-taxonomy taxonomy.qza \
  --p-exclude Metazoa \
  --o-filtered-table table-nometazoa.qza
  
  #only include ciliates
  qiime taxa filter-table \
  --i-table table.qza \
  --i-taxonomy taxonomy.qza \
  --p-include Ciliophora \
  --o-filtered-table table-ciliophora.qza

If you’re doing downstream analysis on this taxonomic assignment, like differential abundance analysis of different taxa, or taxonomy-informed diversity analyses, you’ll want to collapse the feature table to merge all the features that share the same taxonomic assignment into a single feature.

#collapse to class level, for example (change with --p-level)
qiime taxa collapse \
--i-table table.qza \
--i-taxonomy taxonomy.qza \
--p-level 4 \
--o-collapsed-table class_table.qza

You can visualize the taxonomy as an interactive taxonomic barplot easily in QIIME2. You need to provide the metadata as an input (at least containing sampleIDs):

qiime taxa barplot \
--i-table table.qza \
--i-taxonomy taxonomy.qza \
--m-metadata-file metadata.txt \
--o-visualization taxa-barplot.qzv

Constructing a phylogeny

We can align and build a phylogeny of our ASVs within QIIME2 to assess their phylogenetic relationships and do phylogenetically-informed diversity analyses later on. The two ways to do this in QIIME are with q2-phylogeny and q2-fragment-insertion.

q2-phylogeny takes a de novo approach where the ASVs are aligned with MAFFT, a phylogeny is constructed using FastTree, RAxML, or IQ-TREE, and the phylogeny is rooted. There is a tutorial for these options here: https://docs.qiime2.org/2021.4/tutorials/phylogeny/

However, when working with short reads, constructing a de novo phylogeny can be problematic. q2-fragment-insertion provides a way to construct an insertion tree and is specifically designed for use with ASVs/sOTUs. I am currently working through this so can update this tutorial when I’m done. In the meantime here is the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5904434/

At the end of the day you will get tree.qza as the output file.

Exporting data from QIIME2

There are a ton of diversity and statistical analyses that you can do directly in QIIME2, but I’ve decided to do those in R instead, because it provides a lot more flexibility. So here is how you export the relevant data into R (rep-seqs.qza, taxonomy.qza, table.qza, and tree.qza):

Importing data into R

https://github.com/biovcnet/topic-amplicons/blob/master/Lesson03a/analysis.md

Use qiime tools export to export count table, rep seqs fasta file, taxonomy file into R

export the count table and convert to tsv:

#export count table
qiime tools export \
--input-path table.qza \
--output-path export/table
#convert the resulting .biom file to .tsv
biom convert \
-i feature-table.biom \
-o table.tsv --to-tsv

#export rep seqs to a fasta file
qiime tools export \
--input-path representative_sequences.qza \
--output-path export/rep-seqs.fasta

#export taxonomy file to tsv file
qiime tools export \
--input-path taxonomy.qza \
--output-path export/taxonomy

From here I would import these into R, and the import them as phyloseq objects, but that is for another tutorial :)