This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Online version available at http://rpubs.com/maddieSC/mothur_SOP_May_2018

Slides: https://drive.google.com/file/d/1AEd2rRCHDoxND4552SZDdSdGRAk_L9Ra/view?usp=sharing

Tips for this workshop

You can copy-paste from this document into the Terminal window. This will be faster and prevent typos. Depending on your computer and your ssh client, this may be done by

Ctrl+v
Apple+v
Right-click on your mouse
GREEN boxes contain code that you can copy and paste to run on your machine.
WHITE boxes contain sample output of this code, and nothing will happen if you try to copy it into your console.
ORANGE boxes contain commands that you should NOT execute, either because they are unnecessary for today, or because they take too long to do during this workshop and have been done for you.

Common issues

Issues during this workshop

If at any time during this workshop you fall behind, all of the intermediate files are available in /mnt/software/workshop/data/MOTHUR/miseq_sop2/. You will see that the files created by each mothur command are listed after that command under “Output File Names’. These are the files you need in order to skip that step.

For example, if you did not complete align.seqs and want to skip ahead, you would need to transfer the .align and .align.report files to your directory. To do this, you would use

mothur > system(cp /mnt/software/workshop/data/MOTHUR/miseq_sop2/example.trim.contigs.good.unique.align ~/miseq_sop)

mothur > system(cp /mnt/software/workshop/data/MOTHUR/miseq_sop2/example.trim.contigs.good.unique.align.report ~/miseq_sop)

Some of the larger files are also zipped.

example.trim.contigs.qual.zip
example.trim.contigs.good.unique.align.zip
example.trim.contigs.good.unique.good.align.zip
example.final.dist.zip

Therefore after transfer, you would also need to unzip them.

mothur > system(cp /mnt/software/workshop/data/MOTHUR/miseq_sop2/example.final.dist.zip ~/miseq_sop)

mothur > system(unzip ~/miseq_sop/example.final.dist.zip)

Common mistakes

Typos The absolute most common mistake when working in mothur is typos. The mothur file names get very long and somewhat repetitive so please double-check! This can even occur when copy-pasting from this document if you accidentally copy a leading or lagging space around the command.

fasta and count_table don’t match Another common error seen in mothur occurs when your .fasta and .count_table files do not match. This looks like

"[ERROR]: yourSequence is in fileA but not in fileB, please correct."

This often occurs because you forgot to include your count file in a command or accidentally included the wrong one. You can see this when mothur picks a file to use when you don’t include one.

...
Using /home/BIOTECH/student01/miseq_sop/example.trim.contigs.fasta as input file for the fasta parameter.
...

You should make sure the file mothur picked is the correct one. And if you did provide this file in the command, check for typos!

Another reason a mismatch can occur is a multi-processor function failing. One or more processors may fail but the overall function appears to complete as the other processors finish. Thus, a file can be incomplete which would cause a mismatch. Running out of RAM will cause this to happen. The only way to correct this is to re-run the failed process (usually align.seqs, pre.cluster, dist.seqs, or cluster.split) on fewer processors.

Introduction

Goal

The goal of this tutorial is to demonstrate standard operating procedures (SOP) for processing amplicon data that are generated using Illumina’s MiSeq platform with paired end reads. This SOP is a combination of the Schloss lab and Suen lab (UW-Madison) procedures for 2x250bp reads of the V4 region of the 16S rRNA gene. There are a number of steps, as noted in this SOP, which require modification if you are using a different amplicon and/or read length.

Sources and citation

This SOP is based on Pat Schloss’s MiSeq SOP with some notes and modifications by Dr. Kim Dill-McFarland and Madison Cox. The original SOP can be found at http://www.mothur.org/wiki/miseq_sop.

Published work using this SOP should cite https://www.ncbi.nlm.nih.gov/pubmed/23793624

— Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology 79(17): 5112-20. doi: 10.1128/AEM.01043-13.

Programs

We will complete all sequence processing in the program mothur (https://www.mothur.org/wiki/Main_Page). Details on all commands used here as well as others can be found at http://www.mothur.org/wiki/Category:Commands.

We will use the command-line version of this software on a remote Linux computer. Windows PC users will need to download an ssh client for connecting to the remote computer. We recommend the free, personal edition of MobaXterm (http://mobaxterm.mobatek.net). This software is a combination of an enhanced terminal window and ssh client.

Data in this SOP

Data description

The Suen lab is interested in understanding how rumen (gut) microbes contribute to nutrition and milk production in dairy cows. In this example data set, we are looking at the liquid (L) and solid (S) contents from the rumen (largest stomach compartment) of a cow.

We will analyze these data in this tutorial using an OTU (Operational Taxonomic Unit) approach. This approach is the most commonly used in microbiota research. Other approaches such as phylotype and phylogenetic can be found in the online Schloss SOP.

The data in this SOP were generated using 16S V4 primers from Kozich et al. 2013. These primers allow you to sequence up to 1536 samples in parallel using only 80 primers (32+48) and obtain sequence reads of the V4 region of the 16S rRNA gene. Please consult the supplementary methods of that manuscript for more information. You can also see the latest Schloss wet-lab SOP for generating libraries similar to these (https://github.com/SchlossLab/MiSeq_WetLab_SOP).

Files

From the MiSeq Sequences come off the MiSeq as pairs of fastq.gz files with each pair representing the two paired reads per sample.

fastq files contain both the sequence data (fasta) and the quality score data (qual). gz is a form of compression to make the files smaller. If you aren’t getting these files off the sequencer, then you likely have the software parameters set incorrectly. The default MiSeq file names are sample.name_S##_L001_R#_001.fastq where

sample.name is the name you give the machine when running your library (i.e. pool of all samples together)
S## is the sample number given by the machine to ensure all file names are unique
L001 is lane 1 (MiSeqs only have one lane as opposed to HiSeqs)
R# is the forward (R1) or reverse (R2) read for each sample
001 is the replicate number, which will always be 001 as the MiSeq requires unique sample names (again, as opposed to the HiSeq)

For mothur For this tutorial, you will need our 2 example samples’ fastq.gz files as well as two databases for alignment and classification. All of these have been pre-downloaded onto sumo. However, if you need the databases in the future, they can be found through the mothur website. The database files include

mothur-formatted Silva sequences, reference alignment, and taxonomy
mothur-formatted GreenGenes sequences, reference alignment, and taxonomy

From mothur When mothur modifies your data, it creates a new file for each step. For example, if you start out with Example.fasta and run screen.seqs to remove poor-quality sequences, mothur will create Example.good.fasta. Thus, you can see all the steps you’ve run just by looking at the file names.

Note: It is generally easiest to use the “current” option for many of the commands since the file names get very long. Because this tutorial is meant to show people how to use mothur at a very nuts and bolts level, we will only selectively use the current option to demonstrate how it works. Generally, we will use the full file names for this tutorial.

Get Started

Download and run mothur on your own machine

Computational requirement

Make sure that your machine is powerful enough for the data set you wish to run. What we are doing today is a very, very small data set, which should not pose any issues on a resonably new standard laptop. However, if you were to run a full data set on a low-end personal computer, you would likely run out of memory and crash it. Not to mention that many steps in mothur processing are very time-intensive. If you can, find a way to access a large server to run full data sets. My laptop has 16Gb of RAM and 4 processors. With these specifications, running on 2 processors, a small data set of ~25 samples (V4) takes about a day. A large data set of 400+ samples may take weeks. A V3-4 data set of any reasonable size would likely crash my machine (see the “large distance matrix” section below for an explanation). So make sure that whatever machine you use is up to the task before you start.

Download the latest verison of mothur

The latest mothur release can be found here. Go to the github page and download the latest version. At time of writing, it was version 1.40. Make sure to keep track of what version of mothur you are running, as it is important to report in any manuscript using data processed in mothur. Download the zip file appropriate for your operating system and place it on your desktop. You may choose to move it later, but for today we will leave it there. Unzip the file to a folder on the desktop, and rename the folder “mothur-1.40.X”. Inside you will find the mothur executable (mothur.exe), as well as other executables which certain parts of the mothur workflow rely on. Congratulations, you have downloaded the program.

Connect to the BRC Linux Computer

We will remotely connect to the BRC Linux computer named sumo for these exercises.

Connect from a Mac

Open a Terminal (if you are not sure where that is simply search for the word Terminal with “Spotlight” tool i.e. the top right magnifying glass icon)
Within the Terminal, type the following command adapted with your own studentxx account (shown here for student01): ssh -Y student01@sumo.biotech.wisc.edu
The -Y option allows passing of any graphics from the BRC Linux computer. The XQartz program should be installed for this to work (See https://www.xquartz.org)
sumo is the name of the computer.

Connect from a Windows PC

There are 2 ways to connect to a remote computer:

Standard connection

Start MobaXterm
In the main window, type the ssh command containing your username and the computer name. For example for student01, you would write: ssh -Y student01@sumo.biotech.wisc.edu
The -Y option allows passing of any graphics from the BRC Linux computer.

GUI method connection

Start MobaXterm
In the QuickConnect text box (upper left), enter the name of the computer you want to connect to: e.g. @sumo.biotech.wisc.edu
Press the Return button
In the next window, click on the SSH icon (upper left). Click OK
At the “Login:” prompt that appears within the main window, enter the username provided e.g. student01
Enter the password provided (Note: nothing gets printed)
You may be asked if you want to save the password. Click NO
You should now be logged in the remote computer on the main window.

Pasting in MobaXterm

By default, the pasting utility in MobaXterm is Shift+Insert. You can change this by going into the “Settings” pulldown menu and opening “Keyboard Shortcuts”. Find “Paste” and use the “Edit keyboard shortcut” pulldown to select “Ctrl + V” to make it normal.

Home directory

Once you have connected to the sumo computer, you “land” within your “Home” directory. For student01, this would be:

/home/BIOTECH/student01

If at any point you get lost within the labyrinth of directories, simply type cd to come back.

If you would like to back up by a single directory, type cd ..

To know where your terminal is “looking” at any moment, you can invoke the print working directory command: pwd

`$` prompt

The $ in the Terminal is simply a way for the computer to notify you that it is ready to receive typed commands. ** Note: The $ is not typed when issuing commands. It is already there.**

For example, student01 would see the following Terminal prompt:

[student01@sumo ~]$

Project folder: Stay organized

It is a general good practice to create a project folder or working directory in order to separate files from various projects. We will use the name miseq_sop. Therefore, our first task is to create this directory within the most logical place: our “Home” directory.

First we can verify that we are indeed within the “Home” directory with the print working directory command pwd.

$ pwd

$ /home/BIOTECH/student01

Then, we create the directory with mkdir

Create a project folder

$ mkdir miseq_sop

We can verify that this directory exists by checking the content of our “Home” directory with the list command ls

$ ls

miseq_sop

Next, move into your miseq_sop folder. This will be done with the change directory command cd into miseq_sop:

Change directory

$ cd miseq_sop

We can check that this worked with the command print working directory pwd. For student01, this would look like:

$ pwd

$ /home/BIOTECH/student01/miseq_sop

Get data files

In this section, we will retrieve the data files for our 6 samples. Remember that there are R1 and R2 files so we will end up with 12 fastq.gz files.

Note: If you wanted to download files from the Internet, you could use the command web get wget followed by the URL, or web address of the file to download. For example: wget https://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip

Sequence `fastq`

We will retrieve fastq files for our 2 samples (4 files). We will copy cp a zipped folder containing the files that has already been downloaded to the local computer located in the directory: /mnt/software/workshop/data/MOTHUR/. Then, you will unzip these files and move mv them into your miseq_sop folder.

Copy data files into home directory

$ cp /mnt/software/workshop/data/MOTHUR/fastq.zip ~/miseq_sop

Note: ~ is a representation of the home directory, /home/BIOTECH/student01 with YOUR STUDENT NUMBER

Nothing will be printed to the window so check that the file has been moved.

$ ls

fastq.zip

Now, unzip the folder to pull out all the fastq.gz files. Remember to change the student## to your number

Unzip data files

$ unzip /home/BIOTECH/student01/miseq_sop/fastq.zip

This will place all of the files in a folder named the same thing as the zipped object (i.e. fastq).

$ ls

fastq/
fastq.zip

So we need to move the fastq.gz files from fastq/ into miseq_sop/. We can move all of the files ending in fastq.gz at once with the wildcard *

$ mv /home/BIOTECH/student01/miseq_sop/fastq/*fastq.gz ~/miseq_sop

You can then check the content of the directory with the list ls command.

$ ls

fastq/
fastq.zip
L62.1_S149_L001_R1_001.fastq.gz
L62.1_S149_L001_R2_001.fastq.gz
S62.1_S27_L001_R1_001.fastq.gz
S62.1_S27_L001_R2_001.fastq.gz

Previous versions of mothur have allowed us to keep the .fastq files zipped before beginning. For some reason, this version keeps throwing an error at me. So we can unzip all of the fastq.gz files with the following command.

$ gunzip /home/BIOTECH/student01/miseq_sop/*.gz
$ ls

fastq/
fastq.zip
L62.1_S149_L001_R1_001.fastq
L62.1_S149_L001_R2_001.fastq
S62.1_S27_L001_R1_001.fastq
S62.1_S27_L001_R2_001.fastq

For these samples, the sample names are coded as sampleCow.day. For example, L62.1 is the liquid sample from cow #62 on day 1. These sequences are 250 bp and overlap in the V4 region of the 16S rRNA gene; this region is about 253 bp long.

Databases

We will now copy the Silva and GreenGenes database files into your home folders with cp. For alignment, we will be using a reduced version of Silva that only contains the V4 region of 16S since we know that is our amplicon region. A full Silva database is available at https://www.mothur.org/wiki/Silva_reference_files. Also, because silva.v4.v132.align is large, it is zipped and will need to be unzipped once in your folder.

All of these files are located in /mnt/software/mothur/reference_databases/

silva.v4.v132.align.zip
silva.nr_v132.tax
gg_13_8_99.fasta
gg_13_8_99.gg.tax

Retrieve databases

$ cp /mnt/software/mothur/reference_databases/silva.v4.v132.align.zip ~/miseq_sop
$ cp /mnt/software/mothur/reference_databases/silva.nr_v132.tax ~/miseq_sop
$ cp /mnt/software/mothur/reference_databases/gg_13_8_99.fasta ~/miseq_sop
$ cp /mnt/software/mothur/reference_databases/gg_13_8_99.gg.tax ~/miseq_sop

Unzip align file

$ unzip ~/miseq_sop/silva.v4.v132.align.zip

silva.v4.v132.align.zip was a single zipped file and therefore, was not placed in a new folder like the fastq.gz files. So, it does not need to be moved.

If you do a final check of your folder, you should have all of the following files.

$ ls

fastq/
fastq.zip 
gg_13_8_99.fasta
gg_13_8_99.gg.tax
L62.1_S149_L001_R1_001.fastq
L62.1_S149_L001_R2_001.fastq
S62.1_S27_L001_R1_001.fastq
S62.1_S27_L001_R2_001.fastq
silva.nr_v132.tax
silva.v4.v132.align
silva.v4.v132.align.zip

Begin in mothur

The software mothur is installed on the BRC Linux computer sumo and will be accessed via command-line on the Terminal. We can check that the computer knows about the software with the command which. This will echo back the location of the software:

$ which mothur

/mnt/software/mothur/default/mothur

We can also verify the version that is installed. It is important to note the version used as the calculations or command syntax may differ slightly with different versions.

We can check what version is installed with the following command that will print out the results on the screen:

$ mothur --version

Linux 64Bit Version
Mothur version=1.40.1
Release Date=4/16/2018

General mothur syntax

mothur automatically creates file names based on commands run with that file.

Sometimes new versions slightly change the output names which cause later commands to fail as they cannot find the correct file. For example, version 1.35 outputs the file ...precluster.uchime.accnos after chimera.uchime while version 1.36 outputs ...precluster.denovo.uchime.accnos because new options were added and it now specifies that you used the denovo method.

If you are having issues, the first thing to check is whether or not all the needed input files exist and are named exactly what they are in your command. If they are not, make the necessary changes to the next command and try re-running.

We will run these exercises on the sumo BRC Linux computer which as a total of 16 processors and 256 Gb (1/4 Tb) of RAM. These resources have to be shared among all the attendees in class. For example if there are 15 students, each could use 2 processors.

If you were to run on your desktop or laptop computer on your own the following recommendations from Kim would be useful:

— Before you begin, there are some things to consider. If this is your first time using mothur, I recommend running through this tutorial will only 2 samples so that you can get used to command line and identify any problem areas without having to wait for computationally intensive steps to run. This will also minimize RAM load so that you can practice on your own computer before moving to larger server usage (more about that below).

— Once you move up to a bigger data set, you need to consider the computer you’re running mothur on. Many commands can be run on multiple processors by adding processors = #. However, you can only use as many as your computer actually has. A good rule of thumb is to not run mothur using more than (all of your processors - 2). This leaves 1 processor for your OS and 1 for any other programs you may have open. Try to minimize other things open if possible. The other thing to consider is RAM. If you have very little RAM, only run mothur with 1 processor to minimize RAM load. This will make this very slow going and a laptop might still run out of RAM and crash. It’s impossible to say exactly how much RAM you will need for a given data set. So, when thinking about running a large data set consisting of an entire MiSeq run or more, consider getting access to 0.5TB+ RAM machines. Madison has these resources available through the Center for High Throughput Computing, http://chtc.cs.wisc.edu/.

Log files

Every data set is different, and every data set presents its own set of problems. Logfiles are automatically generated by mothur and help to determine where issues may have arisen if running in batch mode. They also help with recordkeeping and reproducibility. Your log file is a live document which is continually updated over the course of a session. If you close mothur and reopen, an new logfile will be created. The logs are only differentiated by a long string of numbers, like mothur.1490885401.logfile so it may be a good idea to rename logfiles to be more informative, such as indicating dates and data sets.

Start mothur via command line

We can now start the program. We will all call the same path to execute mothur in our own folders. Since we are each in our own miseq_sop folders, mothur will automatically save your files to your folder. Note that once this command is issued, we will be working within the mothur software as reminded by the new prompt mothur > which will only accept mothur line commands.

Start mothur

$ /mnt/software/mothur/default/mothur

mothur v.1.40.1
Last updated: 4/16/2018
by
Patrick D. Schloss

Department of Microbiology & Immunology

University of Michigan
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

For questions and analysis support, please visit our forum at https://www.mothur.org/forum

Type 'quit()' to exit program

Interactive Mode


mothur >

If for some reason the default version of mothur is not v1.40.1, quit mothur with

mothur > quit

and then use the following to access the latest version

$ /mnt/software/mothur/versions/mothur_v.1.40.1/mothur

Basic sequence clean-up

Combine paired end reads

Paired-end runs on the Illumina MiSeq yield two reads for every sequence. One running in from the 5 prime end (read 1, R1) and one from the 3 prime end (read 2, R2). You need to combine these reads so that there is a single, longer read for every sequence (a contig). You do this by determining the overlapping regions and combining with the make.contigs command.

This command will: * extract the sequence and quality score data from your ‘fastq’ files * create the reverse complement of the reverse read and then * join the reads into contigs

We have a very simple algorithm to do this. First, we align the pairs of sequences. Next, we look across the alignment and identify any positions where the two reads disagree. If one sequence has a base and the other has a gap, the quality score of the base must be over 25 to be considered real. If both sequences have a base at that position, then we require one of the bases to have a quality score 6 or more points better than the other. If it is less than 6 points better, then we set the consensus base to an N.

You could run make.contigs on each sample individually and then combine the outputs. However, this is very time consuming. So, make.contigs can also take a tab-delimited file listing all the samples and their corresponding file names. We will have mothur make this file using make.file. Remember to change to your student number!

Importantly, the make.contigs command CAN accept the gzipped files which are generated by the MiSeq IF your computer is able to unzip them. So, if you are running mothur on OSX or Linux, you will have no issues as long as you let mothur know to look for the gzipped filetype. However, if you are running mothur on Windows, you will encounter all sorts of issues using gzipped files. You will have to use some sort of utility, like 7zip (http://www.7-zip.org/), to extract the fastq files prior to beginning. In this case, you would omit the “type=gz” in the following command, and mothur will look for the default “.fastq” filetype. Because sumo is a Linux machine, we will be able to use the gzipped files with no issue in thei workshop.

make.file

mothur > make.file(inputdir=/home/BIOTECH/student01/miseq_sop)

Setting input directory to: /home/BIOTECH/student01/miseq_sop/

Output File Names: 
/home/BIOTECH/student01/miseq_sop/stability.files

This will output a .files which is setup as

sample name -tab- R1file -tab- R2file

Let’s take a look at it.

mothur > system(head -5 stability.files)

L62.1   L62.1_S149_L001_R1_001.fastq    L62.1_S149_L001_R2_001.fastq
S62.1   S62.1_S27_L001_R1_001.fastq     S62.1_S27_L001_R2_001.fastq

Note: Notice that if you want to perform a terminal command outside of mothur while the program is running, you can use the system() command

Note: If you name two sets of files the same, they will be merged into one sample during make.contigs. This is useful if you sequenced a sample more than once.

Note: Your sample names cannot contain a dash (-) as mothur uses this symbol to denote lists in some commands.

The default file name of “stability” is not a useful name for us so we’re going to re-name this file so that all subsequent files are named correctly.

If you are working on your own machine, rather than on a server, you can make this change with a regular text editor. I like to use Excel.

Rename ‘.files’

mothur > system(cp stability.files example.files)

Note: If you were running on a Windows PC, you would use copy instead of cp since the system command calls upon the operating system of the computer. For this SOP, we are all running on Linux on sumo

Now, let’s begin our data clean-up by combining R1 and R2 with make.contigs. Please wait for instructions on how many processors to use as this will vary depending on the number of people attending.

make.contigs

mothur > make.contigs(file=example.files, processors=2)

>>>>>   Processing file pair /home/BIOTECH/student01/miseq_sop/L62.1_S149_L001_R1_001.fastq - /home/BIOTECH/student01/miseq_sop/L62.1_S149_L001_R2_001.fastq (files 1 of 2) <<<<<
Making contigs...

[truncated]

Done.

It took 13 secs to assemble 15331 reads.

>>>>>   Processing file pair /home/BIOTECH/student01/miseq_sop/S62.1_S27_L001_R1_001.fastq - /home/BIOTECH/student01/miseq_sop/S62.1_S27_L001_R2_001.fastq (files 2 of 2)        <<<<<
Making contigs...

[truncated]

Done. 

It took 115 secs to assemble 130107 reads.


Group count:
L62.1   15331
S62.1   130107

Total of all groups is 145438

It took 128 secs to process 145438 sequences.

Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.fasta
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.qual
/home/BIOTECH/student01/miseq_sop/example.scrap.contigs.fasta
/home/BIOTECH/student01/miseq_sop/example.scrap.contigs.qual
/home/BIOTECH/student01/miseq_sop/example.contigs.report
/home/BIOTECH/student01/miseq_sop/example.contigs.groups

This command will also produce several files that you will need down the road: example.trim.contigs.fasta and example.contigs.groups. These contain the sequence data and group identity for each sequence.

The example.contigs.report file will tell you something about the contig assembly for each read or why it failed. The fileexample.scrap.contigs.fasta contains sequences that fail to make contigs. The file example.contigs.groups contains a list of sequence names (given by the machine) and which sample they belong to.

We should now take a look at our data. This is done repeatedly in this SOP using ‘summary.seqs’

Note: mothur is smart enough to remember that we used “x”" number of processors in make.contigs and so it will use that throughout your current session unless you specify otherwise.

summary.seqs

mothur > summary.seqs(fasta=example.trim.contigs.fasta)

Using 2 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       248     248     0       3       1
2.5%-tile:      1       252     252     0       3       3636
25%-tile:       1       253     253     0       3       36360
Median:         1       253     253     0       4       72720
75%-tile:       1       253     253     0       4       109079
97.5%-tile:     1       254     254     5       5       141803
Maximum:        1       495     495     81      126     145438
Mean:   1       252     252     0       3
# of Seqs:      145438

It took 2 secs to summarize 145438 sequences.

Output File Names:
 /home/BIOTECH/student01/miseq_sop/example.trim.contigs.summary

This output tells us that we have 145,438 sequences that are ~253 bases long.

Things to look out for

The longest read in the data set is 495 bp. Recall that the reads are supposed to be ~250 bp each. This read clearly didn’t assemble well (or at all).
~2.5% of our sequences had some ambiguous base calls (i.e. N).
~2.5% of our sequences have long homopolymers (i.e. AAAAAAAAA)

We’ll take care of these issues in the next step when we run screen.seqs.

Remove poor quality

This implementation of the command will remove sequences with any ambiguous bases, anything longer than 300 bp, and anything with a homopolymer longer than 8 bases.

screen.seqs

mothur > screen.seqs(fasta=example.trim.contigs.fasta, group=example.contigs.groups, maxambig=0, maxlength=300, maxhomop=8)

[truncated]

72669
72769

It took 2 secs to screen 145438 sequences, removed 19898.

/******************************************/
Running command: remove.seqs(accnos=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.bad.accnos, group=/home/BIOTECH/student01/miseq_sop/example.contigs.groups)
Removed 19898 sequences from your group file.

Output File Names:
/home/BIOTECH/student01/miseq_sop/example.contigs.pick.groups

/******************************************/

Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.bad.accnos
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.fasta
/home/BIOTECH/student01/miseq_sop/example.contigs.good.groups


It took 3 secs to screen 145438 sequences.

Alternate command: There’s another way to run this using the output from summary.seqs. This may be faster because the summary.seqs output file already has the number of ambiguous bases and sequence length calculated for your sequences.

mothur > screen.seqs(fasta=example.trim.contigs.fasta, group=example.contigs.groups, summary=example.trim.contigs.summary, maxambig=0, maxlength=300, maxhomop=8)

Using “current”

Now that we’ve created some files in mothur, we can ask mothur what it knows about our current session.

get.current

mothur > get.current()

Current RAM usage: 0.0561943 Gigabytes. Total Ram: 354.979 Gigabytes.

Current files saved by mothur:
accnos=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.bad.accnos
fasta=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.fasta
group=/home/BIOTECH/student01/miseq_sop/example.contigs.good.groups
qfile=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.qual
contigsreport=/home/BIOTECH/student01/miseq_sop/example.contigs.report
processors=2
summary=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.summary
file=/home/BIOTECH/student01/miseq_sop/stability.files

Current input directory saved by mothur: /home/BIOTECH/student01/miseq_sop/

Current default directory saved by mothur: /mnt/software/mothur/default/

Current working directory: /home/BIOTECH/student01/miseq_sop/

Output File Names:
current_files.summary

What this means is that mothur remembers your latest fasta and group file as well as the number of processors you are running on. This is useful because you can shorten commands by telling mothur to use the current files. So all of the following are equivalent at this point in analysis and you would get the same output for each.

mothur > summary.seqs(fasta=example.trim.contigs.good.fasta)
mothur > summary.seqs(fasta=current)
mothur > summary.seqs()

Using 2 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       250     250     0       3       1
2.5%-tile:      1       252     252     0       3       3139
25%-tile:       1       253     253     0       3       31386
Median:         1       253     253     0       4       62771
75%-tile:       1       253     253     0       4       94156
97.5%-tile:     1       254     254     0       5       122402
Maximum:        1       294     294     0       8       125540
Mean:   1       252     252     0       3
# of Seqs:      125540

It took 3 secs to summarize 125540 sequences.

Output File Names:
 /home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.summary

Note: If you exit out of mothur and then re-start, it will not remember the current files from your last session.

For the purposes of this tutorial, we will write out the names of the files for clarity.

Unique sequences

At this point, our sequencing error rate has probably dropped more than an order of magnitude and we have 125,540 sequences We anticipate that many of our sequences are duplicates of each other. Because it’s computationally wasteful to process the same thing a bazillion times, we’ll select only unique sequences using the unique.seqs command. This way, the computer can analyze one sequence and then apply that outcome to every identical sequence in the data set.

unique.seqs

mothur > unique.seqs(fasta=example.trim.contigs.good.fasta)

[truncated]

125000  20397
125540  20469

Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.names
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.fasta

If two sequences have the same identical sequence, then they’re considered duplicates and will get merged.

In the screen output there are two columns:

the first is the number of sequences characterized and
the second is the number of unique sequences remaining.

So after running unique.seqs we have gone from 125,540 total to 20,469 unique sequences. This will make our life much easier moving forward.

Simplify names and groups with counts

Another thing to do to make our lives easier is to simplify the names and group files. mothur carefully keeps track of the name (i.e. sequence ID representing a set of uniques, .names file) and group (i.e. sample name, .groups file) for every single sequence. To make later commands shorter, we combine the data in these files into a single “count” file (.count_table).

So we’ll run count.seqs to generate a table where the rows are the names of the unique sequences and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group.

count.seqs

mothur > count.seqs(name=example.trim.contigs.good.names, group=example.contigs.good.groups)

It took 1 secs to create a table for 125540 sequences.

Total number of sequences: 125540

Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.count_table

This will generate a file called example.trim.contigs.good.count_table. In subsequent commands, we’ll use this file with the count= option.

Let’s look at another summary of our data using this count table.

summary.seqs with count_table

mothur > summary.seqs(count=example.trim.contigs.good.count_table)

Using /home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.fasta as input file for the fasta parameter.

Using 2 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       250     250     0       3       1
2.5%-tile:      1       252     252     0       3       3139
25%-tile:       1       253     253     0       3       31386
Median:         1       253     253     0       4       62771
75%-tile:       1       253     253     0       4       94156
97.5%-tile:     1       254     254     0       5       122402
Maximum:        1       294     294     0       8       125540
Mean:   1       252     252     0       3
# of unique seqs:       20469
total # of seqs:        125540

It took 0 secs to summarize 125540 sequences.

Output File Names:
 /home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.summary

We see that now, by including a fasta and count_table, we see “# of unique seqs” and “total # of seqs”. We would get the same result with summary.seqs(name=example.trim.contigs.good.names, group=example.contigs.good.groups).

Alignment

Now we need to align our sequences to a reference alignment using align.seqs. Your sequences aren’t just random strings of nucleotides; they are part of a known gene (usually 16S or 18S). For ITS, SEE NOTE BELOW

We have two options for 16S alignment: Silva or GreenGenes. Only use Silva for alignment. GreenGenes contains incomplete (not full length) 16S sequences which make alignment difficult and error-prone. These databases are formatted specifically for mothur and can be found on the mothur wiki (Silva: https://www.mothur.org/wiki/Silva_reference_files, GreenGenes: https://www.mothur.org/wiki/Greengenes-formatted_databases).

We will also improve our alignment by including flip=T so that mothur checks both the forward and reverse complement of our sequences.

The first time you work with an amplicon (i.e. primer set), it’s a good idea to align to the full database to find out where your sequences align to. This is demonstrated below for V4 but should not be run during the workshop as it will take a long time and use a lot of RAM.

Instead, we will start aligning to the pre-made custom reduced alignment database you transferred earlier (silva.v4.v132.align).

align.seqs

mothur > align.seqs(fasta=example.trim.contigs.good.unique.fasta, reference=silva.v4.v132.align, flip=T)

Using 2 processors.

Reading in the silva.v4.v132.align template sequences...
    DONE.
It took 72 to read  213119 sequences.
Aligning sequences from example.trim.contigs.good.unique.fasta ...
100
100

[truncated]

10227
10242
It took 323 secs to align 20469 sequences.


Output File Names:
example.trim.contigs.good.unique.align
example.trim.contigs.good.unique.align.report

Note: For fungal sequencing, it is best to skip the alignment step here. The ITS is too variable in length to generate a good alignment. Instead, we do an internal “needleman” alignment later at the precluster step. So if you are working with a fungal data set, skip to precluster at this point.

———————————-DO NOT RUN the following block———————————-

Custom reduced database

Even with a reduced database, this may take several minutes to run. We will go over how this reduced database was created in the interim. First, we aligned to the full Silva database.

DO NOT RUN THIS

mothur > align.seqs(fasta=example.trim.contigs.good.unique.fasta, reference=silva.nr_v132.align, flip=T)

Reading in the /home/BIOTECH/student01/miseq_sop/silva.nr_v132.align template sequences...  DONE.
It took 506 to read  172418 sequences.
Aligning sequences from /home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.fasta ...
It took 363 secs to align 20469 sequences.


Output File Names: 
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.align
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.align.report

Then we summary.seqs to see where we aligned to:

DO NOT RUN THIS

mothur > summary.seqs(fasta=example.trim.contigs.good.unique.align, count=example.trim.contigs.good.count_table)

                 | Start   | End     | NBases  | Ambigs   | Polymer | NumSeqs
---------------- | ------- | ------- | ------- | -------- | ------- | -------
Minimum:         | 13125   | 23439   | 250     | 0        | 3       | 1
2.5%-tile:       | 13862   | 23444   | 252     | 0        | 3       | 3139
25%-tile:        | 13862   | 23444   | 253     | 0        | 3       | 31386
Median:          | 13862   | 23444   | 253     | 0        | 4       | 62771
75%-tile:        | 13862   | 23444   | 253     | 0        | 4       | 94156
97.5%-tile:      | 13862   | 23444   | 254     | 0        | 5       | 122402
Maximum:         | 13875   | 26147   | 294     | 0        | 8       | 125540
Mean:            | 13862   | 23444.1 | 252.889 | 0        | 3.82739 | 
# of Seqs:       | 20469
total # of seqs: | 125540

So what does this mean?

You’ll see that the bulk of the sequences start at position 13862 (Start column) and end at position 23444 (End column). But some sequences start or end at other places. These deviants are likely due to an insertion or deletion at the terminal ends of the alignments. Sometimes you’ll see sequences that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification.

It is these Start and End values that we need to make a custom reduced database. These values are different for different amplicons/primers and sometimes for different versions/releases of a database. So, once you know them for your amplicon and a given version of Silva, you won’t have to do this step again.

We make a database customized to our region of interest using the pcr.seqs command. To run this command, you need to have the full reference database (i.e. silva.nr_v132.align) and know where in that alignment your amplicon starts and ends (from the above summary). To remove the leading and trailing dots, we will set keepdots to false.

DO NOT RUN THIS

mothur > pcr.seqs(fasta=silva.nr_v132.align, start=13862, end=23444, keepdots=F)

Output File Names: 
/home/BIOTECH/student01/miseq_sop/silva.nr_v132.pcr.align

We would then rename this file so we know which region we selected

DO NOT RUN THIS

mothur > system(mv silva.nr_v132.pcr.align silva.v4.v132.align)

Note: Windows users should use “rename” rather than “mv”" since the system command calls upon the operating system of the computer. For this SOP, we are all running on Linux on sumo

Now we have a customized reduced reference alignment to align our sequences to. The nice thing about this reference is that instead of being 50,000 columns wide, it is now 13,425 columns wide which will save our hard drive some space and improve the overall alignment quality and efficiency.

silva.v4.v132.align is the database we started an alignment to earlier. Once that is done, we summary.seqs again.

—————————————————————————–

Post-alignment summary

summary.seqs

mothur > summary.seqs(fasta=example.trim.contigs.good.unique.align, count=example.trim.contigs.good.count_table)

Using 2 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        8       9577    231     0       3       1
2.5%-tile:      8       9582    251     0       3       3139
25%-tile:       8       9582    252     0       3       31386
Median:         8       9582    252     0       4       62771
75%-tile:       8       9582    252     0       4       94156
97.5%-tile:     8       9582    253     0       5       122402
Maximum:        13      9582    256     0       8       125540
Mean:   8       9581    251     0       3
# of unique seqs:       20469
total # of seqs:        125540

It took 5 secs to summarize 125540 sequences.

Output File Names:
 example.trim.contigs.good.unique.summary

Rescreen

To make sure that everything overlaps the same region we’ll re-run screen.seqs. Note that we need to include the count table in this command so that we can update the table for the sequences we’re removing.

We’re also using the summary file we just made so we don’t have to figure out all the start and stop positions again. To this end, we will specify the following within the command: start=8, end=9582. These numbers are very different from those using the full Silva database because the numbering starts from the 5 prime end and we’ve removed a bunch of bases from that end with pcr.seqs

screen.seqs post alignment

mothur > screen.seqs(fasta=example.trim.contigs.good.unique.align, count=example.trim.contigs.good.count_table, summary=example.trim.contigs.good.unique.summary, start=8, end=9582)

Using 2 processors.
1000
1000

[truncated]

10234
10235

It took 3 secs to screen 20469 sequences, removed 181.

/******************************************/
Running command: remove.seqs(accnos=example.trim.contigs.good.unique.bad.accnos, count=example.trim.contigs.good.count_table)
Removed 255 sequences from your count file.

Output File Names:
example.trim.contigs.good.pick.count_table

/******************************************/

Output File Names:
example.trim.contigs.good.unique.bad.accnos
example.trim.contigs.good.unique.good.summary
example.trim.contigs.good.unique.good.align
example.trim.contigs.good.good.count_table


It took 5 secs to screen 20469 sequences.

And we summarize to see how re-screening impacted the data.

summary.seqs

mothur > summary.seqs(fasta=example.trim.contigs.good.unique.good.align, count=example.trim.contigs.good.good.count_table)

Using 2 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        8       9582    231     0       3       1
2.5%-tile:      8       9582    251     0       3       3133
25%-tile:       8       9582    252     0       3       31322
Median:         8       9582    252     0       4       62643
75%-tile:       8       9582    252     0       4       93964
97.5%-tile:     8       9582    253     0       5       122153
Maximum:        8       9582    256     0       8       125285
Mean:   8       9582    251     0       3
# of unique seqs:       20288
total # of seqs:        125285

It took 5 secs to summarize 125285 sequences.

Output File Names:
 example.trim.contigs.good.unique.good.summary

Filter

Now that we know our sequences overlap the same alignment coordinates, we also want to make sure they only overlap that region. We’ll filter the sequences to remove the overhangs at both ends with vertical=T. Since we’ve done paired-end sequencing, this shouldn’t be much of an issue.

Also, when aligning, mothur puts in a lot of spaces and dots to show insertions and deletions compared to the alignment database. Therefore, there are many columns in the alignment that only contain gap characters (i.e. -). This is confusing to later commands (and causes them to fail) so we remove them all with trump=..

filter.seqs

mothur > filter.seqs(fasta=example.trim.contigs.good.unique.good.align, vertical=T, trump=.)

Using 2 processors.
Creating Filter...
1000
1000

[truncated]

10144
10144
It took 6 secs to create filter for 20288 sequences.


Running Filter...
1000
1000

[truncated]

10144
10144
It took 3 secs to filter 20288 sequences.



Length of filtered alignment: 416
Number of columns removed: 9166
Length of the original alignment: 9582
Number of sequences used to construct filter: 20288

Output File Names:
example.filter
example.trim.contigs.good.unique.good.filter.fasta

This means that our initial alignment was 9582 columns wide and that we were able to remove 9166 terminal gap characters using trump=. and vertical=T. The final alignment length is 416 columns.

Because we’ve perhaps created some redundancy across our sequences by trimming the ends, we can re-run unique.seqs

unique.seqs again

mothur > unique.seqs(fasta=example.trim.contigs.good.unique.good.filter.fasta, count=example.trim.contigs.good.good.count_table)

1000    1000
2000    1998

[truncated]

20000   19925
20288   20210

Output File Names:
example.trim.contigs.good.unique.good.filter.count_table
example.trim.contigs.good.unique.good.filter.unique.fasta

This identified 78 duplicate sequences that we’ve now merged with previous unique sequences.

Pre-cluster

The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences that are very similar. We’ll use the pre.cluster command allowing for up to 2 differences between sequences by adding diffs=2 to the command.

This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are, then they get merged.

We generally favor allowing 1 difference for every 100 bp of sequence, hence a 250bp amplicon is diffs=2:

pre.cluster

mothur > pre.cluster(fasta=example.trim.contigs.good.unique.good.filter.unique.fasta, count=example.trim.contigs.good.unique.good.filter.count_table, diffs=2)

Using 2 processors.

Processing group L62.1:
L62.1   0       2625    34

Processing group S62.1:
L62.1   100     1994    665
L62.1   200     1825    834

[truncated]

L62.1   2600    1310    1349
L62.1   2659    1309    1350
Total number of sequences before pre.cluster was 2659.
pre.cluster removed 1350 sequences.

It took 0 secs to cluster 1350 sequences.
S62.1   100     12301   6436
S62.1   200     10610   8127

[truncated]

S62.1   18700   5010    13727
S62.1   18737   5010    13727
Total number of sequences before pre.cluster was 18737.
pre.cluster removed 13727 sequences.

It took 10 secs to cluster 13727 sequences.
It took 12 secs to run pre.cluster.

Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.fasta
example.trim.contigs.good.unique.good.filter.unique.precluster.count_table
example.trim.contigs.good.unique.good.filter.unique.precluster.L62.1.map
example.trim.contigs.good.unique.good.filter.unique.precluster.S62.1.map

Note: For fungal data, you should include “align=needleman” here in the precluster command to do the internal alignment

summary.seqs

mothur > summary.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.count_table)

Using 2 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       416     231     0       3       1
2.5%-tile:      1       416     251     0       3       3133
25%-tile:       1       416     252     0       3       31322
Median:         1       416     252     0       4       62643
75%-tile:       1       416     252     0       4       93964
97.5%-tile:     1       416     253     0       5       122153
Maximum:        1       416     256     0       8       125285
Mean:   1       416     251     0       3
# of unique seqs:       5620
total # of seqs:        125285

It took 0 secs to summarize 125285 sequences.

Output File Names:
 example.trim.contigs.good.unique.good.filter.unique.precluster.summary

Note that we’ve gone from 20,208 to 5,614 unique sequences but our total # of sequences has not changed. This is because no sequences were removed; they were just grouped similarly to unique.seqs but allowing up to 2 differences.

It is not uncommon to drastically reduce your unique sequence count with this command. The MiSeq is good but not perfect.

At this point we have removed as much sequencing error as we can, and it is time to turn our attention to removing chimeras.

Chimeras

Chimeras occur when the polymerase falls off one sequence and then reattaches to another during a single round of PCR. This results in a sequence that looks unique but is, in fact, just half of one and half of another e.g. not real.

We will remove chimeras with the UCHIME algorithm that is called within mothur using the chimera.uchime command.

Again, this command will split the data by sample and check for chimeras.

Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the default (dereplicate=F) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we’ve seen rare sequences get flagged as chimeric when they’re the most abundant sequence in another sample.

chimera.uchime

mothur > chimera.uchime(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)

Using 2 processors.

uchime by Robert C. Edgar
http://drive5.com/uchime
This code is donated to the public domain.

Checking sequences from example.trim.contigs.good.unique.good.filter.unique.precluster.fasta ...
uchime v4.2.40
by Robert C. Edgar
http://drive5.com/uchime
This code is donated to the public domain.

uchime v4.2.40
by Robert C. Edgar
http://drive5.com/uchime
This code is donated to the public domain.

00:00  19Mb  100.0% Reading example.trim.contigs.good.unique.good.filter.unique.precluster.temp1.temp
00:00  19Mb 1309 sequences
00:00  20Mb  100.0% Reading example.trim.contigs.good.unique.good.filter.unique.precluster.temp1.temp
00:00  20Mb 5010 sequences
00:15 2.8Mb  100.0% 79/1308 chimeras found (6.0%)

It took 15 secs to check 1309 sequences from group L62.1.
02:44 5.2Mb  100.0% 1254/5009 chimeras found (25.0%)

It took 164 secs to check 5010 sequences from group S62.1.
It took 164 secs to check 6319 sequences.


Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table
example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.chimeras
example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos

This command only identifies chimeras. We then need to remove them from the data using remove.seqs

remove.seqs

mothur > remove.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.count_table, accnos=example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos)

[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.

Removed 1290 sequences from your fasta file.
Removed 3847 sequences from your count file.

Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.count_table

summary.seqs

mothur > summary.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.count_table)

Using 2 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       416     241     0       3       1
2.5%-tile:      1       416     251     0       3       3036
25%-tile:       1       416     252     0       3       30360
Median:         1       416     252     0       4       60720
75%-tile:       1       416     252     0       4       91079
97.5%-tile:     1       416     253     0       5       118403
Maximum:        1       416     255     0       8       121438
Mean:   1       416     251     0       3
# of unique seqs:       4330
total # of seqs:        121438

It took 0 secs to summarize 121438 sequences.

Output File Names:
 example.trim.contigs.good.unique.good.filter.unique.precluster.pick.summary

Note that we went from 125,255 to 121,767 sequences for a reduction of 2.8%; this is a reasonable number of sequences to be flagged as chimeric. In the end, the percent chimeras is a result of the polymerase you used in PCR. A high-fidelity polymerase is less likely to fall off and create chimeras and fewer cycles also result in fewer chimeras.

Undesirables

As a final quality control step, we need to see if there are any “undesirables” in our data set. Even after all of this cleanup, you can still have “good” sequences that are actually contaminants.

Sometimes when we pick a primer set, they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria.

There’s also just the random stuff that we want to get rid of. Now you may say, “But wait I want that stuff”. Fine. But, the primers we use, are only supposed to amplify members of the Bacteria and if they’re hitting Eukaryota or Archaea, then its a mistake. It is important to note that if you do keep these sequences, they do not represent the overall archaeal or eukaryotic community. It would be incorrect to use this data as a shortcut to not do an archaea or eukaryote specific MiSeq run.

You may also have other “undesirables” specific to your system of study. For example, the rumen is not home to any Cyanobacteria as it is a dark, anaerobic environment. However, we always get Cyanobacteria sequences back because unfortunately, chloroplasts (very abundant in the feed of herbivores!) look a lot like Cyanobacteria. So we remove them. Not every researcher does this but it is an accepted step as long as you note it in your methods section.

The first step is to classify your sequences so you know what to remove. Let’s go ahead and classify using the Bayesian classifier with the classify.seqs command. You can classify to either Silva or GreenGenes at this point and it will make very little difference. We will use Silva herefor bacterial sequences. We will also specific a cutoff of 80. This roughly means that we are 80%+ confident in our classification, a good thing if we’re going to remove groups. We want to be sure of what we’re removing.

classify.seqs

mothur > classify.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=silva.v4.v132.align, taxonomy=silva.nr_v132.tax, cutoff=80)

Using 2 processors.
Generating search database...    DONE.
It took 78 seconds generate search database.

Reading in the silva.nr_v132.tax taxonomy...    DONE.
Calculating template taxonomy tree...     DONE.
Calculating template probabilities...     DONE.
It took 168 seconds get probabilities.
Classifying sequences from example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta ...
100
100

[truncated]

500
500
[WARNING]: M03662_8_000000000-AK09K_1_2104_20318_19579 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
600
600

[truncated]

2163
2167
It took 111 secs to classify 4330 sequences.


It took 111 secs to classify 4330 sequences.


It took 1 secs to create the summary file for 4330 sequences.


Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.taxonomy
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.tax.summary

Note: Silva and GreenGenes are the options provided, in correct format for use in alignment and classification in mothur, by Pat Schloss and the mothur developers. GreenGenes is appropriate for most well-characterized host-associated gut communities, but is poor for environmental data sets. If you are working with environmental data, it maybe a good idea to find or make an alternative database in order to get better classifications. See an example for lake systems created by the McMahon lab here on campus: https://github.com/McMahonLab/TaxAss/tree/master/FreshTrain-files.

Note: With most data sets, you will get warnings about things that could not be classified. These are labeled as domain ‘unknown’ if the classifier cannot classify the sequence to Bacteria, Archaea, or Eukaryota.

Now that everything is classified, we want to remove our undesirables. We do this with the remove.lineage command. We’ll just remove non-Bacteria domains and completely unclassifieds (“unknown”).

remove.lineage

mothur > remove.lineage(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, taxonomy=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.taxonomy, taxon=unknown;-Archaea;-Eukaryota;)

[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.


Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.pick.taxonomy
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta
example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table

summary.seqs

mothur > summary.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)

Using 2 processors.

                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       416     241     0       3       1
2.5%-tile:      1       416     251     0       3       3019
25%-tile:       1       416     252     0       3       30189
Median:         1       416     252     0       4       60378
75%-tile:       1       416     252     0       4       90566
97.5%-tile:     1       416     253     0       5       117736
Maximum:        1       416     255     0       8       120754
Mean:   1       416     251     0       3
# of unique seqs:       4330
total # of seqs:        120754

It took 0 secs to summarize 120754 sequences.

Output File Names:
 example.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.summary
total # of seqs: | 120538

At this point, we have cleaned-up our data as far as possible. Our file names are very long at this point, so let’s copy and rename them before moving onto some analysis.

Rename final files

mothur > system(cp example.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta example.final.fasta)

mothur > system(cp example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table example.final.count_table)

Define OTUs

Now we can define operational taxonomic units (OTUs), the sequencing proxy for a microbial species. We do this by calculating distances between sequences (how difference they are from each other) with dist.seqs and then clustering these distances based on a difference cutoff with cluster.split. In general, cutoffs are

0.03 = 3% different or 97% similar ~ species
0.05 = 5% different or 95% similar ~ genus
0.1 = 10% different or 90% similar ~ family

Generate distance matrix

dist.seqs

mothur > dist.seqs(fasta=example.final.fasta)

Using 2 processors.

Sequence        Time    Num_Dists_Below_Cutoff
0       0       0
100     0       5050

[truncated]

4300    59      4563820
3000    61      4501500

It took 64 secs to find distances for 4330 sequences. 9372285 distances below cutoff 1.


Output File Names:
example.final.dist

Note: For fungal data, you should use pairwise.seqs instead of dist.seqs because your sequences are not aligned by align.seqs which is required by dist.seqs.

Note: There are a couple of tricks to decrease your distance matrix size for large data sets (I’ve seen V3-4 data get up to 1TB). You can set a “cutoff” in dist.seqs. This causes mothur to not save distance greater than the cutoff. For example, if you’re only going to look at genus and species level OTUs, you could set your cutoff to 0.1. You can also output in lower triangle form (output=lt), which only saves one distance for every pairwise comparison.

Cluster

For clustering, we have a number of options

nearest neighbor
furthest neighbor
average neighbor
opticlust

You can find full definitions of these methods here but the general view is that furthest is too strict (creates too many OTUs), nearest is too lenient (too few OTUs), and average is in the middle. Sort of a Goldilocks situation.

opticlust is a very new method developed by the Schloss group, and it appears to perform as good or better than average neighbor. For more information, see http://msphere.asm.org/content/2/2/e00073-17

We will use opti here, because it is the fastest and “best”. Ths output for opticlust is super long, so I am not going to include it below. But if it looks like chaos with lots of tabs and temp files, it is probably working fine.

cluster.split

mothur > cluster.split(column=example.final.dist, count=example.final.count_table, method=opti, cutoff=0.03)


[truncated]

Output File Names:
example.final.opti_mcc.list
example.final.opti_mcc.sensspec

Generate OTU matrix

Now that we have OTUs, we’d like to look at them, right? Well, distance and cluster files are not for human eyes. So we use make.shared to combine these data with our sample names to create a table we can understand. We will have mothur only give us the species-level OTUs label=0.03 but you could ask for any level that you like (as long as it’s below the cutoff you may have used in dist.seqs).

make.shared

mothur > make.shared(list=example.final.opti_mcc.list, count=example.final.count_table, label=0.03)

0.03

Output File Names:
example.final.opti_mcc.shared

Let’s look at it!!! This file has many, many columns and is difficult to look at in the Terminal. So, we will open the .shared file in Excel or a text editor. You’ll see that for each sample, we have how many times each OTU occurs in that sample.

label	Group	numOtus	Otu0001	Otu0002 …
0.03	L62.1	2562	1362	770 …
0.03	S62.1	2562	10015	9316 …

Classify OTUs

Now that you have all these OTUs, what are they? We find out by repeating classify.seqs on our final files and then grouping based on our OTUs with classify.otu. Here, we will use GreenGenes instead of Silva. Like classifying before, you can use either. In general, GreenGenes will give fewer unclassifieds and more classifications down to lower levels. But when you go to research those classifications, they could be from a clone library or other non-culturing technique so there is little you can say about that species. Silva is only full-length 16S so you have fewer sequences to compare to and get more unclassifieds. However, the classifications are more often to cultured species.

There are also specific taxonomic groups that are more or less represented in each database. This would be specific to your system and you’d have to classify to both to find out!

classify.seqs

mothur > classify.seqs(fasta=example.final.fasta, count=example.final.count_table, template=gg_13_8_99.fasta, taxonomy=gg_13_8_99.gg.tax, cutoff=80)

Using 2 processors.
Generating search database...    DONE.
It took 180 seconds generate search database.

Reading in the gg_13_8_99.gg.tax taxonomy...    DONE.
Calculating template taxonomy tree...     DONE.
Calculating template probabilities...     DONE.
It took 526 seconds get probabilities.
Classifying sequences from example.final.fasta ...
100
100

[truncated]

2167
2163
It took 99 secs to classify 4330 sequences.


It took 99 secs to classify 4330 sequences.


It took 0 secs to create the summary file for 4330 sequences.


Output File Names:
example.final.gg.wang.taxonomy
example.final.gg.wang.tax.summary

classify.otu

mothur > classify.otu(list=example.final.opti_mcc.list, taxonomy=example.final.gg.wang.taxonomy, count=example.final.count_table, label=0.03, cutoff=80, basis=otu, probs=F)

0.03    2562

Output File Names:
example.final.opti_mcc.0.03.cons.taxonomy
example.final.opti_mcc.0.03.cons.tax.summary

Looking at it.

mothur > system(head -5 example.final.opti_mcc.0.03.cons.taxonomy)

OTU     Size    Taxonomy
Otu0001 11377   k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Prevotella;s__ruminicola;
Otu0002 10086   k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Prevotella;s__ruminicola;
Otu0003 7550    k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Aeromonadales;f__Succinivibrionaceae;f__Succinivibrionaceae_unclassified;f__Succinivibrionaceae_unclassified;
Otu0004 5270    k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Veillonellaceae;g__Succiniclasticum;g__Succiniclasticum_unclassified;

Now, the species level classifications are dicey. Because we don’t have full length 16S, we can’t say with certainty that OTU2 is Prevotella ruminicola. What we can say is that OTU2 is most closely related to Prevotella ruminicola, so if we were going to guess at OTU2’s function in the cow gut, data on Prevotella ruminicola would be a good place to start. And that’s not even getting into strain level diversity.

Coverage

How to you know you have enough sequences to see all/most of the diversity in your samples? Do you need to redo some samples? You can visualize this by rarefaction and calculate it with Good’s coverage.

Rarefaction

rarefaction.single

mothur > rarefaction.single(shared=example.final.opti_mcc.shared)

Processing group L62.1

0.03

Processing group S62.1

0.03

Output File Names:
example.final.opti_mcc.groups.rarefaction

This will produce a file called example.final.opti_mcc.groups.rarefaction. Looking at it…

mothur > system(head -15 example.final.opti_mcc.groups.rarefaction)

numsampled      0.03-L62.1      lci-L62.1       hci-L62.1       0.03-S62.1      lci-S62.1       hci-S62.1
1       1.0000  1.0000  1.0000  1.0000  1.0000  1.0000
100     68.2160 60.0000 77.0000 61.3720 52.0000 70.0000
200     118.8010        106.0000        130.0000        104.8190        93.0000 117.0000
300     161.3850        148.0000        174.0000        141.2600        128.0000        154.0000
400     198.5620        183.0000        215.0000        173.1300        158.0000        188.0000
500     231.6170        213.0000        248.0000        201.9080        184.0000        219.0000
600     261.3690        242.0000        279.0000        228.3600        211.0000        247.0000
700     288.4680        269.0000        307.0000        252.6850        234.0000        273.0000
800     313.3590        294.0000        332.0000        275.3760        255.0000        296.0000
900     336.4050        317.0000        356.0000        296.7510        276.0000        318.0000
1000    357.9280        339.0000        379.0000        316.5280        296.0000        338.0000
1100    378.2830        358.0000        399.0000        335.6720        314.0000        357.0000
1200    397.2660        377.0000        418.0000        353.6050        332.0000        375.0000
1300    415.2510        394.0000        437.0000        370.8000        349.0000        393.0000

it’s difficult to tell what’s going on. The best way to use this data is to graph it and look from the new unique sequences gained per ## of sequences added to level off. At a certain point, you need 1000s of sequences to get only 1 more unique sequence so it’s not worth it (and probably not even real)

For example (made in Excel)

However, this doesn’t numerically/statistically show you your coverage and is subject to axis-skewing. For example, our coverage looks pretty good in the above plot, but if we compress the x-axis, it doesn’t look nearly as good.

Thus, it is better to also calculate coverage with another method, like Good’s.

Good’s coverage

The metric is an estimate of the percent of diversity (species) captured in a sample by your current sequencing data. In general, we’re looking for 95% or better. We can also look at the number of sequence nseqs and number of unique OTUs sobs in each sample.

summary.single

mothur > summary.single(shared=example.final.opti_mcc.shared, label=0.03, calc=nseqs-sobs-coverage)

Processing group L62.1

0.03

Processing group S62.1

0.03

Output File Names:
example.final.opti_mcc.groups.summary

If we look at it,

mothur > system(head -3 example.final.opti_mcc.groups.summary)

label   group   nseqs   sobs    coverage
0.03    L62.1   12653.000000    930.000000      0.992176
0.03    S62.1   108101.000000   2290.000000     0.994422

we see that we have very good coverage of these samples, >99%

Normalization

There are two easy options for normalization in mothur. There is no “right” answer for normalization, and it is a rather hot topic in the field. You may find a number of other methods outside of mothur and even papers arguing against normalization at all.

In mothur, you can sub-sample down to your lowest number of sequences in a single sample with sub.sample.

sub.sample

mothur > sub.sample(shared=example.final.opti_mcc.shared)

Sampling 12653 from each group.
0.03

Output File Names:
example.final.opti_mcc.0.03.subsample.shared

Note: The default is to subsample to the number of sequences which is seen in the smallest sample. If we has a sample which did not sequence well and only had 10 reads, we would cut out all but 10 reads from every one of our samples using this method. In order to subsample to a specific number, 10000 for example, include “size=10000”.

Or you can calculate percent relative abundance (RA) (# OTU1 sequences/total # sequences in sample) and normalize to your lowest number of sequences in a single sample i.e. %RA * lowest # of sequences

normalize.shared

mothur > normalize.shared(shared=example.final.opti_mcc.shared, method=totalgroup)

Normalizing to 12653.
0.03

Output File Names:
example.final.opti_mcc.0.03.norm.shared

Note: The “method” parameter here is used if we simply want to normalize to the number of sequences which is seen in the smallest sample. If we has a sample which did not sequence well and only had 10 reads, we would cut out all but 10 reads from every one of our samples using this method. In order to normalize to a specific number,10000 for example, use “norm=10000” in place of “method=totalgroup”.

When we compare these two methods, we see some differences that can be easily viewed with summary.single

summary.single

mothur > summary.single(shared=example.final.opti_mcc.0.03.subsample.shared, label=0.03, calc=coverage-nseqs)

mothur > summary.single(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=coverage-nseqs)

mothur > system(head -7 example.final.opti_mcc.0.03.subsample.groups.summary)

mothur > system(head -7 example.final.opti_mcc.0.03.norm.groups.summary)

label   group   coverage        nseqs
0.03    L62.1   0.992176        12653.000000
0.03    S62.1   0.962855        12653.000000

label   group   coverage        nseqs
0.03    L62.1   0.992176        12653.000000
0.03    S62.1   0.967848        12472.000000

We see that:

Sub-sampling takes all samples down to the exact same number of sequences while normalizing takes them near the same number. This occurs in normalization due to rounding, because .shared files only allow whole numbers. When normalizing to a high number, like >12000 here, this is not an issue as the difference is only ~1%. However, if normalizing to a lower number, this can cause the same problems with unequal sample coverage as we were trying to avoid by normalizing in the first place! In this case, I suggest normalizing outside of mothur in a program, like R, that would allow non-whole numbers in your OTU table.
Our coverage went down for the S62.1 sample. This is still above our goal of 95%, so no issues. However, if you were to normalize or sub-sample to too low of a number, you will have very poor coverage for more diverse samples. In this case, you should consider removing samples with the lowest number of sequences and normalizing/sub-sampling to a higher value.

Of note, the norm.shared table is the data type used for all R-based analyses in the R workshop.

Basic analysis

There are a number of functions in mothur to perform statistical and visual analysis of your microbiota data. However, mothur itself, cannot display figures and does not handle much manipulation of the data. So, most analyses should be done in a another program, like R, and this section is just to give you some tools to begin to explore your data in mothur.

We will focus on visuals here as statistical modeling is outside the scope of this tutorial and specific to a given data set. Please note that mothur’s figures are not the prettiest…

If you would like to explore statistical analyses in mothur, please see the commands page for:

Amova: Analysis of molecular variance
Anosim: Analysis of similarity
Cooccurence
Homova: Homogeneity of molecular variance
Kruskal.wallis
Otu.association
Parsimony
Unifrac, weighted and unweighted

Alpha-diversity

A lot of research is interested not in the specific microorganisms in a community but the overall diversity and richness of that community. This is within sample diversity. You can calculate all of the classic metrics in mothur. For a full list, see here.

The most important thing before calculating diversity is to normalize your data. No matter the actual community, a sample with 5000 sequences will appear more diverse than one with 3000 due to sequencing errors. This can be done using the norm or subsample data we made above or with the subsample=t modifier, which randomly subsamples a number of times and averages across those subsamples.

We will calculate alpha-diversity and richness using summary.single.

summary.single

mothur > summary.single(shared=example.final.opti_mcc.0.03.subsample.shared, label=0.03, calc=coverage-nseqs-bergerparker-chao-shannon)

mothur > summary.single(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=coverage-nseqs-bergerparker-chao-shannon)

Output File Names: 
example.final.opti_mcc.0.03.subsample.groups.summary

Output File Names: 
example.final.opti_mcc.0.03.norm.groups.summary

Or using subsample=t

mothur > summary.single(shared=example.final.opti_mcc.shared, label=0.03, subsample=t, calc=coverage-nseqs-bergerparker-chao-shannon)

Output File Names: 
example.final.opti_mcc.groups.ave-std.summary
example.final.opti_mcc.groups.summary

When we compare alpha-diversity with the two normalization methods…

mothur > system(head -7 example.final.opti_mcc.0.03.subsample.groups.summary)

mothur > system(head -7 example.final.opti_mcc.0.03.norm.groups.summary)

mothur > system(head -7 example.final.opti_mcc.groups.summary)

label   group   coverage        nseqs   bergerparker    chao    chao_lci        chao_hci        shannon shannon_lci      shannon_hci
0.03    L62.1   0.992176        12653.000000    0.107642        957.100559      945.568189      977.175705       5.398214        5.364433        5.431995
0.03    S62.1   0.962855        12653.000000    0.097289        1789.945783     1644.576220     1976.065980      5.073862        5.036851        5.110874

label   group   coverage        nseqs   bergerparker    chao    chao_lci        chao_hci        shannon shannon_lci      shannon_hci
0.03    L62.1   0.992176        12653.000000    0.107642        957.100559      945.568189      977.175705       5.398214        5.364433        5.431995
0.03    S62.1   0.967848        12472.000000    0.093970        1510.869565     1411.207450     1640.074563      5.055335        5.018381        5.092289

label   group   coverage        nseqs   bergerparker    chao    chao_lci        chao_hci        shannon shannon_lci      shannon_hci
0.03    L62.1   0.992176        12653.000000    0.107642        957.100559      945.568189      977.175705       5.398214        5.364433        5.431995
0.03    S62.1   0.994422        108101.000000   0.092645        2823.832353     2723.090473     2948.007965      5.166903        5.153476        5.180330

we see that richness (chao) is more effected by normalization method than diversity (shannon).

Also, mothur gives us the low confidence interval (lci) and high confidence interval (hci) for our alpha-metrics. These could be used as error bars, if desired.

Beta-diversity

This is between sample diversity meaning every pairwise comparison of two samples has a unique value. You can visualize beta-diversity in a number of ways.

PCA

For principle component analysis (PCA), an overall microbial community is represented by a single point in an xy- or xyz-plane. Two points that are closer to each other indicate that the overall microbiota of those two samples are more similar than two points that are farther apart.

PCA fits axes to the data until 100% of the variation is explained. Thus, you do not specify the number of axes to calculate; you just plot xy or xyz after reviewing how much adding the z-axis adds to the representation of the data. PCA is calculated from the shared file.

This is not going to run properly because there is too little variation in our two-pont data set to properly calculate a PCA. So we are not going to run it because it takes forever to tell us it can’t complete. Theoretically, the output would be as shown below.

pca

mothur > pca(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03)

Processing 0.03
Rsq 1 axis: 0
Rsq 2 axis: 0
Rsq 3 axis: 0

Output File Names: 
example.final.opti_mcc.0.03.norm.0.03.pca.axes
example.final.opti_mcc.0.03.norm.0.03.pca.loadings

The outputs are

axes: the coordinate values for each point (sample) up to N axes
loadings: the percent of variation in the data explained by each axis

Making the plot (Excel), we have only 2 samples in this tutorial so we only need 1 axis to explain and separate our data. This will not occur with full data sets.

(This plot was made from a previous version of this tutorial in which PCA did run properly.)

nMDS

nMDS is similar to PCA in that it assigns a single xy(z) point to each sample relative to all other samples. nMDS is more robust than PCA, however, because it permutes through calculations a number of times to find a best fit. This fit is constrained within the number of axes you specify.

mothur currently does not have a function for calculating nMDS for samples. The current nmds function calculates a point for each OTU in a plane. However, this method is something to consider when analyzing data outside of mothur.

Heatmaps of samples

You can visualize beta-diversity metrics by heatmaps with heatmap.sim. Each box is a pairwise comparison between two samples. mothur has several beta-diversity calculator options. Here, we will use the Bray-Curtis metric.

heatmap.sim

mothur > heatmap.sim(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=braycurtis)

0.03

Output File Names:
example.final.opti_mcc.0.03.norm.0.03.braycurtis.heatmap.sim.svg

Trees of samples

You can easily make a tree where each node is an individual sample in mothur using tree.shared. We will run it with our normalized shared file and Bray-Curtis here.

tree.shared

mothur > tree.shared(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=braycurtis)

Using 2 processors.
0.03
1
1

Output File Names:
example.final.opti_mcc.0.03.norm.braycurtis.0.03.tre

If we open it in a program such as FigTree:

OTU analyses

Trees of OTUs

mothur does not have an easy way to create a tree where the nodes are OTUs. In order to make a tree of your OTUs outside of mothur, you need a representative sequence for each OTU. Otherwise, you end up with many more branches then there are OTUs (i.e. species) in your data set. You then calculate a distance matrix of these representative OTU sequences and calculate a tree from this in R or another program. This sort of tree is used for UniFrac beta-diversity calculators.

We will not actually run this analysis in this workshop as it is RAM and time intensive. However, if we were to extract representative sequences, we would use get.oturep. The first option is to have mothur calculate the best possible representative sequence for each OTU. This is the sequence with smallest maximum distance to the other sequences in an OTU group.

DO NOT RUN THIS

mothur > get.oturep(column=example.final.dist, list=example.final.opti_mcc.list, fasta=example.final.fasta, count=example.final.count_table, label=0.03, method=distance, large=T)

Or you can have mothur just choose the most abundant unique sequence within each OTU group as the representative. This representative sequence may not be the best possible one for all uniques within an OTU, but this runs much faster and with much less RAM than the other method.

DO NOT RUN THIS

mothur > get.oturep(list=example.final.opti_mcc.list, fasta=example.final.fasta, count=example.final.count_table, label=0.03, method=abundance)

These representative sequences have long, horrible names given to them by the MiSeq. We rename them based on the OTU# using a custom python script, courtesy of Tony Neumann (Suen lab). I usually run this locally (not on a server) because it is quick and easy. So step one is to download the rep.fasta file generated above onto your local drive. Then, copy the below python code into a text file and save it as something useful, like “clean_repFasta.py”.

import sys
import re

#an empty list to hold the lines, as strings, of the input fasta file
info = []

#iterate through the input file and file up the list
with open(sys.argv[1], 'r') as fasta_in:
    for line in fasta_in:
        info.append(line.replace('\r', '').rstrip('\n'))

#an empty list to hold the OTU fasta seqs and their IDs as tuples
seqs = []

#iterate through the info list, replace the complex seq IDs with just the OTU ID, and fill up the seqs list
for index, item in enumerate(info):
    if index == 0:
        ID = re.search('Otu[0-9]*',item[1:]).group(0)
    else:
        if index % 2 == 0:
            ID = re.search('Otu[0-9]*',item[1:]).group(0)
        else:
            seqs.append((ID, item))

#write the contents of the seqs list to a new file called "clean_repFasta.fasta"
with open("clean_repFasta.fasta", 'w') as fasta_out:
    for item in seqs:
        fasta_out.write('>' + item[0] + '\n' + item[1] + '\n')

To run this, first dowload Python appropriate for your OS. Then, open the command prompt on your computer and navigate to the folder where your fasta file is saved. If my python script and my fasta file are both in a Desktop folder called “demo” on my Windows laptop, that would look like this:

C:\Users\Madison>cd Desktop/demo/
  
C:\Users\Madison\Desktop\demo>clean_repFasta.py example.final.opti_mcc.0.03.rep.fasta

This outputs a new fasta called clean_repFasta.fasta. Then, we would calculate the distance matrix with the new fasta. We definitely what to use the lower triangle lt output here so our matrix is small enough to read into R later on.

DO NOT RUN THIS

mothur > dist.seqs(fasta=clean_repFasta.fasta, output=lt)

You would then load this distance matrix into R or another program to calculate and visualize the tree.

Heatmaps of OTUs

Similar to the heatmap of samples, we can do a heatmap for OTUs. Instead of pairwise comparisons, each box represents the relative abundance of an OTU in a sample. You can scale the abundance to make differences and low abundance OTUs easier to see. We will use the default log10 scale and only visualize the 10 most abundant OTUs.

heatmap.bin

mothur > heatmap.bin(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, scale=log10, numotu=10)

0.03

Output File Names:
example.final.opti_mcc.0.03.norm.0.03.heatmap.bin.svg

Venns

Venn diagrams can show you which OTUs are shared by samples. mothur can do 2, 3, or 4 sample Venns.

Note: If you have more than 4 samples, you can use the permute parameter to have mothur create all Venn groups of 2, 3, or 4 between all of your samples.

venn

mothur > venn(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=sharedsobs)

0.03

Output File Names:
example.final.opti_mcc.0.03.norm.0.03.sharedsobs.L62.1-S62.1.svg

So your distance matrix is too large…

Large data sets (many samples) and those with high error (V3-4) can create very large .dist files on the order of terabytes. This can prevent downstream analysis like clustering because the distance matrix is too large to read into your available RAM. And even when things do run, you may not want to wait days for a step to complete. If this happens, there are a number of steps you can take to reduce the distance matrix.

Solutions that do not remove sequences

dist.seqs

If you’re only going to look at genus (0.05) and/or species (0.03) OTUs in the end, set cutoff=0.05 for dist.seqs so it doesn’t save values larger than that.

You can also save the output in lower triangle format so that only one value is saved for every pairwise comparison between sequences. This is done with output=lt in dist.seqs. If you do this, the output will be example.final.phylip.dist and you will need to alter cluster.split to be phylip=example.final.phylip.dist instead of using column=.

cluster.split

Run cluster with large=true to tell mothur the distance matrix is too large to read into RAM all at once.

pre.cluster

If you have a longer amplicon, less overlap of the two sequences that are combined into contigs, and/or a high error rate in sequencing, it is worth increasing pre.cluster to diffs=3 or 4. This will group more sequences together and while it does not remove any data, you will end up with fewer uniques and therefore, a smaller distance matrix.

Solutions that do remove sequences

split.abund

This function can be used to remove very low abundance sequences. It creates a rare and abund data set. When cutoff=1, the abund data set includes all sequences that are present in the overall data set in 2 or more copies. The rare data set is often referred to as singletons. Removing singletons (or doubletons, etc) dramatically reduces your unique sequences without strongly impacting beta-diversity. However, alpha-diversity is effected by singletons so please be careful to compare values only across data sets that have been handled in a similar way. This command would be used as:

mothur > split.abund(fasta=example.final.fasta, count=example.final.count_table, cutoff=1)

Batch mode

Now that you know what you are doing, and why, how do you avoid ever having to type all of these ever again? mothur is able to accept batch files which contain a list of all of the commands you want to feed it. Here is a link to the Schloss lab SOP. The zip file with their data includes a batch file which you could adapt: https://www.mothur.org/wiki/miseq_sop

First, let’s get out of mothur…

mothur > quit()

… then, let’s copy over our mock batch file.

$ cp /mnt/software/workshop/data/MOTHUR/batch.txt ~/miseq_sop

Now, let’s look at the batch file. It contains three commands which should run fairly quickly: summary.seqs, screen.seqs, and unique.seqs.

$ head -5 batch.txt

summary.seqs(fasta=example.trim.contigs.fasta)
screen.seqs(fasta=example.trim.contigs.fasta, group=example.contigs.groups, maxambig=0, maxlength=300, maxhomop=8)
unique.seqs(fasta=example.trim.contigs.good.fasta)

We feed it to mothur using the filepath to the mothur executable followed by a space, then the name of our batch file.

$ /mnt/software/mothur/default/mothur batch.txt

You can see that mothur was opened, the commands were executed, then mothur was automatically closed. This is useful for large data sets, where certain commands (pre.cluster, dist.seqs) can take a VERY long time. This allows you to execute all the necessary commands without having to babysit the computer.

Note: Because we are in a new session of mothur, it has forgotten our preferences for multiple processors. You must indicate at the beginning of each session how many processors you want mothur to use. This is particularly important when you are using a batch file with a large data set and not paying as close attention to your data as it runs. Your first command in a batch file should indicate the number of processors you want to use.

Microbiota processing in mothur, BRC Workshop May 2018

Kim Dill-McFarland and Madison Cox

May 7, 2018

Tips for this workshop

Common issues

Issues during this workshop

Common mistakes

Introduction

Goal

Sources and citation

Programs

Data in this SOP

Data description

Files

Get Started

Download and run mothur on your own machine

Computational requirement

Download the latest verison of mothur

Connect to the BRC Linux Computer

Connect from a Mac

Connect from a Windows PC

Home directory

$ prompt

Project folder: Stay organized

Get data files

Sequence fastq

Databases

Begin in mothur

General mothur syntax

Log files

Start mothur via command line

Basic sequence clean-up

Combine paired end reads

Remove poor quality

Using “current”

Unique sequences

Simplify names and groups with counts

Alignment

Custom reduced database

Post-alignment summary

Rescreen

Filter

Pre-cluster

Chimeras

Undesirables

Define OTUs

Generate distance matrix

Cluster

Generate OTU matrix

Classify OTUs

Coverage

Rarefaction

Good’s coverage

Normalization

Basic analysis

Alpha-diversity

Beta-diversity

PCA

nMDS

Heatmaps of samples

Trees of samples

OTU analyses

Trees of OTUs

Heatmaps of OTUs

Venns

So your distance matrix is too large…

Solutions that do not remove sequences

Solutions that do remove sequences

Batch mode

`$` prompt

Sequence `fastq`