This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Online version available at http://rpubs.com/maddieSC/mothur_SOP_May_2018
Madison’s email: mcox4@wisc.edu
Slides: https://drive.google.com/file/d/1AEd2rRCHDoxND4552SZDdSdGRAk_L9Ra/view?usp=sharing
You can copy-paste from this document into the Terminal
window. This will be faster and prevent typos. Depending on your computer and your ssh client, this may be done by
If at any time during this workshop you fall behind, all of the intermediate files are available in /mnt/software/workshop/data/MOTHUR/miseq_sop2/
. You will see that the files created by each mothur command are listed after that command under “Output File Names’. These are the files you need in order to skip that step.
For example, if you did not complete align.seqs
and want to skip ahead, you would need to transfer the .align
and .align.report
files to your directory. To do this, you would use
mothur > system(cp /mnt/software/workshop/data/MOTHUR/miseq_sop2/example.trim.contigs.good.unique.align ~/miseq_sop)
mothur > system(cp /mnt/software/workshop/data/MOTHUR/miseq_sop2/example.trim.contigs.good.unique.align.report ~/miseq_sop)
Some of the larger files are also zipped.
example.trim.contigs.qual.zip
example.trim.contigs.good.unique.align.zip
example.trim.contigs.good.unique.good.align.zip
example.final.dist.zip
Therefore after transfer, you would also need to unzip them.
mothur > system(cp /mnt/software/workshop/data/MOTHUR/miseq_sop2/example.final.dist.zip ~/miseq_sop)
mothur > system(unzip ~/miseq_sop/example.final.dist.zip)
Typos The absolute most common mistake when working in mothur is typos. The mothur file names get very long and somewhat repetitive so please double-check! This can even occur when copy-pasting from this document if you accidentally copy a leading or lagging space around the command.
fasta and count_table don’t match Another common error seen in mothur occurs when your .fasta
and .count_table
files do not match. This looks like
"[ERROR]: yourSequence is in fileA but not in fileB, please correct."
This often occurs because you forgot to include your count file in a command or accidentally included the wrong one. You can see this when mothur picks a file to use when you don’t include one.
...
Using /home/BIOTECH/student01/miseq_sop/example.trim.contigs.fasta as input file for the fasta parameter.
...
You should make sure the file mothur picked is the correct one. And if you did provide this file in the command, check for typos!
Another reason a mismatch can occur is a multi-processor function failing. One or more processors may fail but the overall function appears to complete as the other processors finish. Thus, a file can be incomplete which would cause a mismatch. Running out of RAM will cause this to happen. The only way to correct this is to re-run the failed process (usually align.seqs
, pre.cluster
, dist.seqs
, or cluster.split
) on fewer processors.
The goal of this tutorial is to demonstrate standard operating procedures (SOP) for processing amplicon data that are generated using Illumina’s MiSeq platform with paired end reads. This SOP is a combination of the Schloss lab and Suen lab (UW-Madison) procedures for 2x250bp reads of the V4 region of the 16S rRNA gene. There are a number of steps, as noted in this SOP, which require modification if you are using a different amplicon and/or read length.
This SOP is based on Pat Schloss’s MiSeq SOP with some notes and modifications by Dr. Kim Dill-McFarland and Madison Cox. The original SOP can be found at http://www.mothur.org/wiki/miseq_sop.
Published work using this SOP should cite https://www.ncbi.nlm.nih.gov/pubmed/23793624
— Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. (2013): Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and Environmental Microbiology 79(17): 5112-20. doi: 10.1128/AEM.01043-13.
We will complete all sequence processing in the program mothur (https://www.mothur.org/wiki/Main_Page). Details on all commands used here as well as others can be found at http://www.mothur.org/wiki/Category:Commands.
We will use the command-line version of this software on a remote Linux computer. Windows PC users will need to download an ssh client for connecting to the remote computer. We recommend the free, personal edition of MobaXterm (http://mobaxterm.mobatek.net). This software is a combination of an enhanced terminal window and ssh client.
The Suen lab is interested in understanding how rumen (gut) microbes contribute to nutrition and milk production in dairy cows. In this example data set, we are looking at the liquid (L) and solid (S) contents from the rumen (largest stomach compartment) of a cow.
We will analyze these data in this tutorial using an OTU (Operational Taxonomic Unit) approach. This approach is the most commonly used in microbiota research. Other approaches such as phylotype and phylogenetic can be found in the online Schloss SOP.
The data in this SOP were generated using 16S V4 primers from Kozich et al. 2013. These primers allow you to sequence up to 1536 samples in parallel using only 80 primers (32+48) and obtain sequence reads of the V4 region of the 16S rRNA gene. Please consult the supplementary methods of that manuscript for more information. You can also see the latest Schloss wet-lab SOP for generating libraries similar to these (https://github.com/SchlossLab/MiSeq_WetLab_SOP).
From the MiSeq Sequences come off the MiSeq as pairs of fastq.gz
files with each pair representing the two paired reads per sample.
fastq
files contain both the sequence data (fasta) and the quality score data (qual). gz is a form of compression to make the files smaller. If you aren’t getting these files off the sequencer, then you likely have the software parameters set incorrectly. The default MiSeq file names are sample.name_S##_L001_R#_001.fastq
where
sample.name
is the name you give the machine when running your library (i.e. pool of all samples together)S##
is the sample number given by the machine to ensure all file names are uniqueL001
is lane 1 (MiSeqs only have one lane as opposed to HiSeqs)R#
is the forward (R1
) or reverse (R2
) read for each sample001
is the replicate number, which will always be 001
as the MiSeq requires unique sample names (again, as opposed to the HiSeq)For mothur For this tutorial, you will need our 2 example samples’ fastq.gz
files as well as two databases for alignment and classification. All of these have been pre-downloaded onto sumo
. However, if you need the databases in the future, they can be found through the mothur website. The database files include
From mothur When mothur modifies your data, it creates a new file for each step. For example, if you start out with Example.fasta
and run screen.seqs to remove poor-quality sequences, mothur will create Example.good.fasta
. Thus, you can see all the steps you’ve run just by looking at the file names.
Note: It is generally easiest to use the “current” option for many of the commands since the file names get very long. Because this tutorial is meant to show people how to use mothur at a very nuts and bolts level, we will only selectively use the current option to demonstrate how it works. Generally, we will use the full file names for this tutorial.
Make sure that your machine is powerful enough for the data set you wish to run. What we are doing today is a very, very small data set, which should not pose any issues on a resonably new standard laptop. However, if you were to run a full data set on a low-end personal computer, you would likely run out of memory and crash it. Not to mention that many steps in mothur processing are very time-intensive. If you can, find a way to access a large server to run full data sets. My laptop has 16Gb of RAM and 4 processors. With these specifications, running on 2 processors, a small data set of ~25 samples (V4) takes about a day. A large data set of 400+ samples may take weeks. A V3-4 data set of any reasonable size would likely crash my machine (see the “large distance matrix” section below for an explanation). So make sure that whatever machine you use is up to the task before you start.
The latest mothur release can be found here. Go to the github page and download the latest version. At time of writing, it was version 1.40. Make sure to keep track of what version of mothur you are running, as it is important to report in any manuscript using data processed in mothur. Download the zip file appropriate for your operating system and place it on your desktop. You may choose to move it later, but for today we will leave it there. Unzip the file to a folder on the desktop, and rename the folder “mothur-1.40.X”. Inside you will find the mothur executable (mothur.exe), as well as other executables which certain parts of the mothur workflow rely on. Congratulations, you have downloaded the program.
We will remotely connect to the BRC Linux computer named sumo
for these exercises.
Terminal
(if you are not sure where that is simply search for the word Terminal with “Spotlight” tool i.e. the top right magnifying glass icon)student01
): ssh -Y student01@sumo.biotech.wisc.edu
-Y
option allows passing of any graphics from the BRC Linux computer. The XQartz program should be installed for this to work (See https://www.xquartz.org)sumo
is the name of the computer.There are 2 ways to connect to a remote computer:
Standard connection
student01
, you would write: ssh -Y student01@sumo.biotech.wisc.edu
-Y
option allows passing of any graphics from the BRC Linux computer.GUI method connection
@sumo.biotech.wisc.edu
Return
buttonSSH
icon (upper left). Click OKstudent01
Pasting in MobaXterm
By default, the pasting utility in MobaXterm is Shift+Insert. You can change this by going into the “Settings” pulldown menu and opening “Keyboard Shortcuts”. Find “Paste” and use the “Edit keyboard shortcut” pulldown to select “Ctrl + V” to make it normal.
Once you have connected to the sumo
computer, you “land” within your “Home” directory. For student01
, this would be:
/home/BIOTECH/student01
If at any point you get lost within the labyrinth of directories, simply type cd
to come back.
If you would like to back up by a single directory, type cd ..
To know where your terminal is “looking” at any moment, you can invoke the print working directory command: pwd
$
promptThe $
in the Terminal
is simply a way for the computer to notify you that it is ready to receive typed commands. ** Note: The $
is not typed when issuing commands. It is already there.**
For example, student01
would see the following Terminal
prompt:
[student01@sumo ~]$
It is a general good practice to create a project folder or working directory in order to separate files from various projects. We will use the name miseq_sop
. Therefore, our first task is to create this directory within the most logical place: our “Home” directory.
First we can verify that we are indeed within the “Home” directory with the print working directory command pwd
.
$ pwd
$ /home/BIOTECH/student01
Then, we create the directory with mkdir
Create a project folder
$ mkdir miseq_sop
We can verify that this directory exists by checking the content of our “Home” directory with the list command ls
$ ls
miseq_sop
Next, move into your miseq_sop folder. This will be done with the change directory command cd
into miseq_sop:
Change directory
$ cd miseq_sop
We can check that this worked with the command print working directory pwd
. For student01
, this would look like:
$ pwd
$ /home/BIOTECH/student01/miseq_sop
In this section, we will retrieve the data files for our 6 samples. Remember that there are R1
and R2
files so we will end up with 12 fastq.gz
files.
Note: If you wanted to download files from the Internet, you could use the command web get wget
followed by the URL, or web address of the file to download. For example: wget https://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip
fastq
We will retrieve fastq files for our 2 samples (4 files). We will copy cp
a zipped folder containing the files that has already been downloaded to the local computer located in the directory: /mnt/software/workshop/data/MOTHUR/
. Then, you will unzip these files and move mv
them into your miseq_sop
folder.
Copy data files into home directory
$ cp /mnt/software/workshop/data/MOTHUR/fastq.zip ~/miseq_sop
Note: ~ is a representation of the home directory, /home/BIOTECH/student01
with YOUR STUDENT NUMBER
Nothing will be printed to the window so check that the file has been moved.
$ ls
fastq.zip
Now, unzip the folder to pull out all the fastq.gz
files. Remember to change the student## to your number
Unzip data files
$ unzip /home/BIOTECH/student01/miseq_sop/fastq.zip
This will place all of the files in a folder named the same thing as the zipped object (i.e. fastq
).
$ ls
fastq/
fastq.zip
So we need to move the fastq.gz
files from fastq/
into miseq_sop/
. We can move all of the files ending in fastq.gz
at once with the wildcard *
$ mv /home/BIOTECH/student01/miseq_sop/fastq/*fastq.gz ~/miseq_sop
You can then check the content of the directory with the list ls
command.
$ ls
fastq/
fastq.zip
L62.1_S149_L001_R1_001.fastq.gz
L62.1_S149_L001_R2_001.fastq.gz
S62.1_S27_L001_R1_001.fastq.gz
S62.1_S27_L001_R2_001.fastq.gz
Previous versions of mothur have allowed us to keep the .fastq files zipped before beginning. For some reason, this version keeps throwing an error at me. So we can unzip all of the fastq.gz files with the following command.
$ gunzip /home/BIOTECH/student01/miseq_sop/*.gz
$ ls
fastq/
fastq.zip
L62.1_S149_L001_R1_001.fastq
L62.1_S149_L001_R2_001.fastq
S62.1_S27_L001_R1_001.fastq
S62.1_S27_L001_R2_001.fastq
For these samples, the sample names are coded as sampleCow.day. For example, L62.1 is the liquid sample from cow #62 on day 1. These sequences are 250 bp and overlap in the V4 region of the 16S rRNA gene; this region is about 253 bp long.
We will now copy the Silva and GreenGenes database files into your home folders with cp
. For alignment, we will be using a reduced version of Silva that only contains the V4 region of 16S since we know that is our amplicon region. A full Silva database is available at https://www.mothur.org/wiki/Silva_reference_files. Also, because silva.v4.v132.align
is large, it is zipped and will need to be unzipped once in your folder.
All of these files are located in /mnt/software/mothur/reference_databases/
Retrieve databases
$ cp /mnt/software/mothur/reference_databases/silva.v4.v132.align.zip ~/miseq_sop
$ cp /mnt/software/mothur/reference_databases/silva.nr_v132.tax ~/miseq_sop
$ cp /mnt/software/mothur/reference_databases/gg_13_8_99.fasta ~/miseq_sop
$ cp /mnt/software/mothur/reference_databases/gg_13_8_99.gg.tax ~/miseq_sop
Unzip align file
$ unzip ~/miseq_sop/silva.v4.v132.align.zip
silva.v4.v132.align.zip
was a single zipped file and therefore, was not placed in a new folder like the fastq.gz
files. So, it does not need to be moved.
If you do a final check of your folder, you should have all of the following files.
$ ls
fastq/
fastq.zip
gg_13_8_99.fasta
gg_13_8_99.gg.tax
L62.1_S149_L001_R1_001.fastq
L62.1_S149_L001_R2_001.fastq
S62.1_S27_L001_R1_001.fastq
S62.1_S27_L001_R2_001.fastq
silva.nr_v132.tax
silva.v4.v132.align
silva.v4.v132.align.zip
The software mothur is installed on the BRC Linux computer sumo
and will be accessed via command-line on the Terminal. We can check that the computer knows about the software with the command which
. This will echo back the location of the software:
$ which mothur
/mnt/software/mothur/default/mothur
We can also verify the version that is installed. It is important to note the version used as the calculations or command syntax may differ slightly with different versions.
We can check what version is installed with the following command that will print out the results on the screen:
$ mothur --version
Linux 64Bit Version
Mothur version=1.40.1
Release Date=4/16/2018
mothur automatically creates file names based on commands run with that file.
Sometimes new versions slightly change the output names which cause later commands to fail as they cannot find the correct file. For example, version 1.35 outputs the file ...precluster.uchime.accnos
after chimera.uchime while version 1.36 outputs ...precluster.denovo.uchime.accnos
because new options were added and it now specifies that you used the denovo method.
If you are having issues, the first thing to check is whether or not all the needed input files exist and are named exactly what they are in your command. If they are not, make the necessary changes to the next command and try re-running.
We will run these exercises on the sumo
BRC Linux computer which as a total of 16 processors and 256 Gb (1/4 Tb) of RAM. These resources have to be shared among all the attendees in class. For example if there are 15 students, each could use 2 processors.
If you were to run on your desktop or laptop computer on your own the following recommendations from Kim would be useful:
— Before you begin, there are some things to consider. If this is your first time using mothur, I recommend running through this tutorial will only 2 samples so that you can get used to command line and identify any problem areas without having to wait for computationally intensive steps to run. This will also minimize RAM load so that you can practice on your own computer before moving to larger server usage (more about that below).
— Once you move up to a bigger data set, you need to consider the computer you’re running mothur on. Many commands can be run on multiple processors by adding processors = #. However, you can only use as many as your computer actually has. A good rule of thumb is to not run mothur using more than (all of your processors - 2). This leaves 1 processor for your OS and 1 for any other programs you may have open. Try to minimize other things open if possible. The other thing to consider is RAM. If you have very little RAM, only run mothur with 1 processor to minimize RAM load. This will make this very slow going and a laptop might still run out of RAM and crash. It’s impossible to say exactly how much RAM you will need for a given data set. So, when thinking about running a large data set consisting of an entire MiSeq run or more, consider getting access to 0.5TB+ RAM machines. Madison has these resources available through the Center for High Throughput Computing, http://chtc.cs.wisc.edu/.
Every data set is different, and every data set presents its own set of problems. Logfiles are automatically generated by mothur and help to determine where issues may have arisen if running in batch mode. They also help with recordkeeping and reproducibility. Your log file is a live document which is continually updated over the course of a session. If you close mothur and reopen, an new logfile will be created. The logs are only differentiated by a long string of numbers, like mothur.1490885401.logfile
so it may be a good idea to rename logfiles to be more informative, such as indicating dates and data sets.
We can now start the program. We will all call the same path to execute mothur in our own folders. Since we are each in our own miseq_sop
folders, mothur will automatically save your files to your folder. Note that once this command is issued, we will be working within the mothur software as reminded by the new prompt mothur >
which will only accept mothur line commands.
Start mothur
$ /mnt/software/mothur/default/mothur
mothur v.1.40.1
Last updated: 4/16/2018
by
Patrick D. Schloss
Department of Microbiology & Immunology
University of Michigan
http://www.mothur.org
When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.
Distributed under the GNU General Public License
Type 'help()' for information on the commands that are available
For questions and analysis support, please visit our forum at https://www.mothur.org/forum
Type 'quit()' to exit program
Interactive Mode
mothur >
If for some reason the default version of mothur is not v1.40.1, quit mothur with
mothur > quit
and then use the following to access the latest version
$ /mnt/software/mothur/versions/mothur_v.1.40.1/mothur
Paired-end runs on the Illumina MiSeq yield two reads for every sequence. One running in from the 5 prime end (read 1, R1
) and one from the 3 prime end (read 2, R2
). You need to combine these reads so that there is a single, longer read for every sequence (a contig). You do this by determining the overlapping regions and combining with the make.contigs
command.
This command will: * extract the sequence and quality score data from your ‘fastq’ files * create the reverse complement of the reverse read and then * join the reads into contigs
We have a very simple algorithm to do this. First, we align the pairs of sequences. Next, we look across the alignment and identify any positions where the two reads disagree. If one sequence has a base and the other has a gap, the quality score of the base must be over 25 to be considered real. If both sequences have a base at that position, then we require one of the bases to have a quality score 6 or more points better than the other. If it is less than 6 points better, then we set the consensus base to an N.
You could run make.contigs
on each sample individually and then combine the outputs. However, this is very time consuming. So, make.contigs
can also take a tab-delimited file listing all the samples and their corresponding file names. We will have mothur make this file using make.file
. Remember to change to your student number!
Importantly, the make.contigs command CAN accept the gzipped files which are generated by the MiSeq IF your computer is able to unzip them. So, if you are running mothur on OSX or Linux, you will have no issues as long as you let mothur know to look for the gzipped filetype. However, if you are running mothur on Windows, you will encounter all sorts of issues using gzipped files. You will have to use some sort of utility, like 7zip (http://www.7-zip.org/), to extract the fastq files prior to beginning. In this case, you would omit the “type=gz” in the following command, and mothur will look for the default “.fastq” filetype. Because sumo
is a Linux machine, we will be able to use the gzipped files with no issue in thei workshop.
make.file
mothur > make.file(inputdir=/home/BIOTECH/student01/miseq_sop)
Setting input directory to: /home/BIOTECH/student01/miseq_sop/
Output File Names:
/home/BIOTECH/student01/miseq_sop/stability.files
This will output a .files
which is setup as
sample name -tab- R1
file -tab- R2
file
Let’s take a look at it.
mothur > system(head -5 stability.files)
L62.1 L62.1_S149_L001_R1_001.fastq L62.1_S149_L001_R2_001.fastq
S62.1 S62.1_S27_L001_R1_001.fastq S62.1_S27_L001_R2_001.fastq
Note: Notice that if you want to perform a terminal command outside of mothur while the program is running, you can use the system()
command
Note: If you name two sets of files the same, they will be merged into one sample during make.contigs. This is useful if you sequenced a sample more than once.
Note: Your sample names cannot contain a dash (-) as mothur uses this symbol to denote lists in some commands.
The default file name of “stability” is not a useful name for us so we’re going to re-name this file so that all subsequent files are named correctly.
If you are working on your own machine, rather than on a server, you can make this change with a regular text editor. I like to use Excel.
Rename ‘.files’
mothur > system(cp stability.files example.files)
Note: If you were running on a Windows PC, you would use copy
instead of cp
since the system command calls upon the operating system of the computer. For this SOP, we are all running on Linux on sumo
Now, let’s begin our data clean-up by combining R1
and R2
with make.contigs
. Please wait for instructions on how many processors to use as this will vary depending on the number of people attending.
make.contigs
mothur > make.contigs(file=example.files, processors=2)
>>>>> Processing file pair /home/BIOTECH/student01/miseq_sop/L62.1_S149_L001_R1_001.fastq - /home/BIOTECH/student01/miseq_sop/L62.1_S149_L001_R2_001.fastq (files 1 of 2) <<<<<
Making contigs...
[truncated]
Done.
It took 13 secs to assemble 15331 reads.
>>>>> Processing file pair /home/BIOTECH/student01/miseq_sop/S62.1_S27_L001_R1_001.fastq - /home/BIOTECH/student01/miseq_sop/S62.1_S27_L001_R2_001.fastq (files 2 of 2) <<<<<
Making contigs...
[truncated]
Done.
It took 115 secs to assemble 130107 reads.
Group count:
L62.1 15331
S62.1 130107
Total of all groups is 145438
It took 128 secs to process 145438 sequences.
Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.fasta
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.qual
/home/BIOTECH/student01/miseq_sop/example.scrap.contigs.fasta
/home/BIOTECH/student01/miseq_sop/example.scrap.contigs.qual
/home/BIOTECH/student01/miseq_sop/example.contigs.report
/home/BIOTECH/student01/miseq_sop/example.contigs.groups
This command will also produce several files that you will need down the road: example.trim.contigs.fasta
and example.contigs.groups
. These contain the sequence data and group identity for each sequence.
The example.contigs.report
file will tell you something about the contig assembly for each read or why it failed. The fileexample.scrap.contigs.fasta
contains sequences that fail to make contigs. The file example.contigs.groups
contains a list of sequence names (given by the machine) and which sample they belong to.
We should now take a look at our data. This is done repeatedly in this SOP using ‘summary.seqs’
Note: mothur is smart enough to remember that we used “x”" number of processors in make.contigs
and so it will use that throughout your current session unless you specify otherwise.
summary.seqs
mothur > summary.seqs(fasta=example.trim.contigs.fasta)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 248 248 0 3 1
2.5%-tile: 1 252 252 0 3 3636
25%-tile: 1 253 253 0 3 36360
Median: 1 253 253 0 4 72720
75%-tile: 1 253 253 0 4 109079
97.5%-tile: 1 254 254 5 5 141803
Maximum: 1 495 495 81 126 145438
Mean: 1 252 252 0 3
# of Seqs: 145438
It took 2 secs to summarize 145438 sequences.
Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.summary
This output tells us that we have 145,438 sequences that are ~253 bases long.
Things to look out for
We’ll take care of these issues in the next step when we run screen.seqs
.
This implementation of the command will remove sequences with any ambiguous bases, anything longer than 300 bp, and anything with a homopolymer longer than 8 bases.
screen.seqs
mothur > screen.seqs(fasta=example.trim.contigs.fasta, group=example.contigs.groups, maxambig=0, maxlength=300, maxhomop=8)
[truncated]
72669
72769
It took 2 secs to screen 145438 sequences, removed 19898.
/******************************************/
Running command: remove.seqs(accnos=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.bad.accnos, group=/home/BIOTECH/student01/miseq_sop/example.contigs.groups)
Removed 19898 sequences from your group file.
Output File Names:
/home/BIOTECH/student01/miseq_sop/example.contigs.pick.groups
/******************************************/
Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.bad.accnos
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.fasta
/home/BIOTECH/student01/miseq_sop/example.contigs.good.groups
It took 3 secs to screen 145438 sequences.
Alternate command: There’s another way to run this using the output from summary.seqs
. This may be faster because the summary.seqs
output file already has the number of ambiguous bases and sequence length calculated for your sequences.
mothur > screen.seqs(fasta=example.trim.contigs.fasta, group=example.contigs.groups, summary=example.trim.contigs.summary, maxambig=0, maxlength=300, maxhomop=8)
Now that we’ve created some files in mothur, we can ask mothur what it knows about our current session.
get.current
mothur > get.current()
Current RAM usage: 0.0561943 Gigabytes. Total Ram: 354.979 Gigabytes.
Current files saved by mothur:
accnos=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.bad.accnos
fasta=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.fasta
group=/home/BIOTECH/student01/miseq_sop/example.contigs.good.groups
qfile=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.qual
contigsreport=/home/BIOTECH/student01/miseq_sop/example.contigs.report
processors=2
summary=/home/BIOTECH/student01/miseq_sop/example.trim.contigs.summary
file=/home/BIOTECH/student01/miseq_sop/stability.files
Current input directory saved by mothur: /home/BIOTECH/student01/miseq_sop/
Current default directory saved by mothur: /mnt/software/mothur/default/
Current working directory: /home/BIOTECH/student01/miseq_sop/
Output File Names:
current_files.summary
What this means is that mothur remembers your latest fasta
and group
file as well as the number of processors you are running on. This is useful because you can shorten commands by telling mothur to use the current files. So all of the following are equivalent at this point in analysis and you would get the same output for each.
mothur > summary.seqs(fasta=example.trim.contigs.good.fasta)
mothur > summary.seqs(fasta=current)
mothur > summary.seqs()
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 250 250 0 3 1
2.5%-tile: 1 252 252 0 3 3139
25%-tile: 1 253 253 0 3 31386
Median: 1 253 253 0 4 62771
75%-tile: 1 253 253 0 4 94156
97.5%-tile: 1 254 254 0 5 122402
Maximum: 1 294 294 0 8 125540
Mean: 1 252 252 0 3
# of Seqs: 125540
It took 3 secs to summarize 125540 sequences.
Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.summary
Note: If you exit out of mothur and then re-start, it will not remember the current files from your last session.
For the purposes of this tutorial, we will write out the names of the files for clarity.
At this point, our sequencing error rate has probably dropped more than an order of magnitude and we have 125,540 sequences We anticipate that many of our sequences are duplicates of each other. Because it’s computationally wasteful to process the same thing a bazillion times, we’ll select only unique sequences using the unique.seqs
command. This way, the computer can analyze one sequence and then apply that outcome to every identical sequence in the data set.
unique.seqs
mothur > unique.seqs(fasta=example.trim.contigs.good.fasta)
[truncated]
125000 20397
125540 20469
Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.names
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.fasta
If two sequences have the same identical sequence, then they’re considered duplicates and will get merged.
In the screen output there are two columns:
So after running unique.seqs we have gone from 125,540 total to 20,469 unique sequences. This will make our life much easier moving forward.
Another thing to do to make our lives easier is to simplify the names
and group
files. mothur carefully keeps track of the name (i.e. sequence ID representing a set of uniques, .names
file) and group (i.e. sample name, .groups
file) for every single sequence. To make later commands shorter, we combine the data in these files into a single “count” file (.count_table
).
So we’ll run count.seqs to
generate a table where the rows are the names of the unique sequences and the columns are the names of the groups. The table is then filled with the number of times each unique sequence shows up in each group.
count.seqs
mothur > count.seqs(name=example.trim.contigs.good.names, group=example.contigs.good.groups)
It took 1 secs to create a table for 125540 sequences.
Total number of sequences: 125540
Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.count_table
This will generate a file called example.trim.contigs.good.count_table
. In subsequent commands, we’ll use this file with the count=
option.
Let’s look at another summary of our data using this count table.
summary.seqs
with count_table
mothur > summary.seqs(count=example.trim.contigs.good.count_table)
Using /home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.fasta as input file for the fasta parameter.
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 250 250 0 3 1
2.5%-tile: 1 252 252 0 3 3139
25%-tile: 1 253 253 0 3 31386
Median: 1 253 253 0 4 62771
75%-tile: 1 253 253 0 4 94156
97.5%-tile: 1 254 254 0 5 122402
Maximum: 1 294 294 0 8 125540
Mean: 1 252 252 0 3
# of unique seqs: 20469
total # of seqs: 125540
It took 0 secs to summarize 125540 sequences.
Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.summary
We see that now, by including a fasta and count_table, we see “# of unique seqs” and “total # of seqs”. We would get the same result with summary.seqs(name=example.trim.contigs.good.names, group=example.contigs.good.groups)
.
Now we need to align our sequences to a reference alignment using align.seqs
. Your sequences aren’t just random strings of nucleotides; they are part of a known gene (usually 16S or 18S). For ITS, SEE NOTE BELOW
We have two options for 16S alignment: Silva or GreenGenes. Only use Silva for alignment. GreenGenes contains incomplete (not full length) 16S sequences which make alignment difficult and error-prone. These databases are formatted specifically for mothur and can be found on the mothur wiki (Silva: https://www.mothur.org/wiki/Silva_reference_files, GreenGenes: https://www.mothur.org/wiki/Greengenes-formatted_databases).
We will also improve our alignment by including flip=T
so that mothur checks both the forward and reverse complement of our sequences.
The first time you work with an amplicon (i.e. primer set), it’s a good idea to align to the full database to find out where your sequences align to. This is demonstrated below for V4 but should not be run during the workshop as it will take a long time and use a lot of RAM.
Instead, we will start aligning to the pre-made custom reduced alignment database you transferred earlier (silva.v4.v132.align
).
align.seqs
mothur > align.seqs(fasta=example.trim.contigs.good.unique.fasta, reference=silva.v4.v132.align, flip=T)
Using 2 processors.
Reading in the silva.v4.v132.align template sequences...
DONE.
It took 72 to read 213119 sequences.
Aligning sequences from example.trim.contigs.good.unique.fasta ...
100
100
[truncated]
10227
10242
It took 323 secs to align 20469 sequences.
Output File Names:
example.trim.contigs.good.unique.align
example.trim.contigs.good.unique.align.report
Note: For fungal sequencing, it is best to skip the alignment step here. The ITS is too variable in length to generate a good alignment. Instead, we do an internal “needleman” alignment later at the precluster
step. So if you are working with a fungal data set, skip to precluster
at this point.
———————————-DO NOT RUN the following block———————————-
Even with a reduced database, this may take several minutes to run. We will go over how this reduced database was created in the interim. First, we aligned to the full Silva database.
DO NOT RUN THIS
mothur > align.seqs(fasta=example.trim.contigs.good.unique.fasta, reference=silva.nr_v132.align, flip=T)
Reading in the /home/BIOTECH/student01/miseq_sop/silva.nr_v132.align template sequences... DONE.
It took 506 to read 172418 sequences.
Aligning sequences from /home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.fasta ...
It took 363 secs to align 20469 sequences.
Output File Names:
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.align
/home/BIOTECH/student01/miseq_sop/example.trim.contigs.good.unique.align.report
Then we summary.seqs
to see where we aligned to:
DO NOT RUN THIS
mothur > summary.seqs(fasta=example.trim.contigs.good.unique.align, count=example.trim.contigs.good.count_table)
| Start | End | NBases | Ambigs | Polymer | NumSeqs
---------------- | ------- | ------- | ------- | -------- | ------- | -------
Minimum: | 13125 | 23439 | 250 | 0 | 3 | 1
2.5%-tile: | 13862 | 23444 | 252 | 0 | 3 | 3139
25%-tile: | 13862 | 23444 | 253 | 0 | 3 | 31386
Median: | 13862 | 23444 | 253 | 0 | 4 | 62771
75%-tile: | 13862 | 23444 | 253 | 0 | 4 | 94156
97.5%-tile: | 13862 | 23444 | 254 | 0 | 5 | 122402
Maximum: | 13875 | 26147 | 294 | 0 | 8 | 125540
Mean: | 13862 | 23444.1 | 252.889 | 0 | 3.82739 |
# of Seqs: | 20469
total # of seqs: | 125540
So what does this mean?
You’ll see that the bulk of the sequences start at position 13862 (Start column) and end at position 23444 (End column). But some sequences start or end at other places. These deviants are likely due to an insertion or deletion at the terminal ends of the alignments. Sometimes you’ll see sequences that start and end at the same position indicating a very poor alignment, which is generally due to non-specific amplification.
It is these Start and End values that we need to make a custom reduced database. These values are different for different amplicons/primers and sometimes for different versions/releases of a database. So, once you know them for your amplicon and a given version of Silva, you won’t have to do this step again.
We make a database customized to our region of interest using the pcr.seqs
command. To run this command, you need to have the full reference database (i.e. silva.nr_v132.align
) and know where in that alignment your amplicon starts and ends (from the above summary). To remove the leading and trailing dots, we will set keepdots
to false.
DO NOT RUN THIS
mothur > pcr.seqs(fasta=silva.nr_v132.align, start=13862, end=23444, keepdots=F)
Output File Names:
/home/BIOTECH/student01/miseq_sop/silva.nr_v132.pcr.align
We would then rename this file so we know which region we selected
DO NOT RUN THIS
mothur > system(mv silva.nr_v132.pcr.align silva.v4.v132.align)
Note: Windows users should use “rename” rather than “mv”" since the system command calls upon the operating system of the computer. For this SOP, we are all running on Linux on sumo
Now we have a customized reduced reference alignment to align our sequences to. The nice thing about this reference is that instead of being 50,000 columns wide, it is now 13,425 columns wide which will save our hard drive some space and improve the overall alignment quality and efficiency.
silva.v4.v132.align
is the database we started an alignment to earlier. Once that is done, we summary.seqs
again.
—————————————————————————–
summary.seqs
mothur > summary.seqs(fasta=example.trim.contigs.good.unique.align, count=example.trim.contigs.good.count_table)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 8 9577 231 0 3 1
2.5%-tile: 8 9582 251 0 3 3139
25%-tile: 8 9582 252 0 3 31386
Median: 8 9582 252 0 4 62771
75%-tile: 8 9582 252 0 4 94156
97.5%-tile: 8 9582 253 0 5 122402
Maximum: 13 9582 256 0 8 125540
Mean: 8 9581 251 0 3
# of unique seqs: 20469
total # of seqs: 125540
It took 5 secs to summarize 125540 sequences.
Output File Names:
example.trim.contigs.good.unique.summary
To make sure that everything overlaps the same region we’ll re-run screen.seqs
. Note that we need to include the count table in this command so that we can update the table for the sequences we’re removing.
We’re also using the summary file we just made so we don’t have to figure out all the start and stop positions again. To this end, we will specify the following within the command: start=8, end=9582
. These numbers are very different from those using the full Silva database because the numbering starts from the 5 prime end and we’ve removed a bunch of bases from that end with pcr.seqs
screen.seqs
post alignment
mothur > screen.seqs(fasta=example.trim.contigs.good.unique.align, count=example.trim.contigs.good.count_table, summary=example.trim.contigs.good.unique.summary, start=8, end=9582)
Using 2 processors.
1000
1000
[truncated]
10234
10235
It took 3 secs to screen 20469 sequences, removed 181.
/******************************************/
Running command: remove.seqs(accnos=example.trim.contigs.good.unique.bad.accnos, count=example.trim.contigs.good.count_table)
Removed 255 sequences from your count file.
Output File Names:
example.trim.contigs.good.pick.count_table
/******************************************/
Output File Names:
example.trim.contigs.good.unique.bad.accnos
example.trim.contigs.good.unique.good.summary
example.trim.contigs.good.unique.good.align
example.trim.contigs.good.good.count_table
It took 5 secs to screen 20469 sequences.
And we summarize to see how re-screening impacted the data.
summary.seqs
mothur > summary.seqs(fasta=example.trim.contigs.good.unique.good.align, count=example.trim.contigs.good.good.count_table)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 8 9582 231 0 3 1
2.5%-tile: 8 9582 251 0 3 3133
25%-tile: 8 9582 252 0 3 31322
Median: 8 9582 252 0 4 62643
75%-tile: 8 9582 252 0 4 93964
97.5%-tile: 8 9582 253 0 5 122153
Maximum: 8 9582 256 0 8 125285
Mean: 8 9582 251 0 3
# of unique seqs: 20288
total # of seqs: 125285
It took 5 secs to summarize 125285 sequences.
Output File Names:
example.trim.contigs.good.unique.good.summary
Now that we know our sequences overlap the same alignment coordinates, we also want to make sure they only overlap that region. We’ll filter the sequences to remove the overhangs at both ends with vertical=T
. Since we’ve done paired-end sequencing, this shouldn’t be much of an issue.
Also, when aligning, mothur puts in a lot of spaces and dots to show insertions and deletions compared to the alignment database. Therefore, there are many columns in the alignment that only contain gap characters (i.e. -). This is confusing to later commands (and causes them to fail) so we remove them all with trump=.
.
filter.seqs
mothur > filter.seqs(fasta=example.trim.contigs.good.unique.good.align, vertical=T, trump=.)
Using 2 processors.
Creating Filter...
1000
1000
[truncated]
10144
10144
It took 6 secs to create filter for 20288 sequences.
Running Filter...
1000
1000
[truncated]
10144
10144
It took 3 secs to filter 20288 sequences.
Length of filtered alignment: 416
Number of columns removed: 9166
Length of the original alignment: 9582
Number of sequences used to construct filter: 20288
Output File Names:
example.filter
example.trim.contigs.good.unique.good.filter.fasta
This means that our initial alignment was 9582 columns wide and that we were able to remove 9166 terminal gap characters using trump=.
and vertical=T
. The final alignment length is 416 columns.
Because we’ve perhaps created some redundancy across our sequences by trimming the ends, we can re-run unique.seqs
unique.seqs
again
mothur > unique.seqs(fasta=example.trim.contigs.good.unique.good.filter.fasta, count=example.trim.contigs.good.good.count_table)
1000 1000
2000 1998
[truncated]
20000 19925
20288 20210
Output File Names:
example.trim.contigs.good.unique.good.filter.count_table
example.trim.contigs.good.unique.good.filter.unique.fasta
This identified 78 duplicate sequences that we’ve now merged with previous unique sequences.
The next thing we want to do to further de-noise our sequences is to pre-cluster the sequences that are very similar. We’ll use the pre.cluster
command allowing for up to 2 differences between sequences by adding diffs=2
to the command.
This command will split the sequences by group and then sort them by abundance and go from most abundant to least and identify sequences that are within 2 nt of each other. If they are, then they get merged.
We generally favor allowing 1 difference for every 100 bp of sequence, hence a 250bp amplicon is diffs=2
:
pre.cluster
mothur > pre.cluster(fasta=example.trim.contigs.good.unique.good.filter.unique.fasta, count=example.trim.contigs.good.unique.good.filter.count_table, diffs=2)
Using 2 processors.
Processing group L62.1:
L62.1 0 2625 34
Processing group S62.1:
L62.1 100 1994 665
L62.1 200 1825 834
[truncated]
L62.1 2600 1310 1349
L62.1 2659 1309 1350
Total number of sequences before pre.cluster was 2659.
pre.cluster removed 1350 sequences.
It took 0 secs to cluster 1350 sequences.
S62.1 100 12301 6436
S62.1 200 10610 8127
[truncated]
S62.1 18700 5010 13727
S62.1 18737 5010 13727
Total number of sequences before pre.cluster was 18737.
pre.cluster removed 13727 sequences.
It took 10 secs to cluster 13727 sequences.
It took 12 secs to run pre.cluster.
Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.fasta
example.trim.contigs.good.unique.good.filter.unique.precluster.count_table
example.trim.contigs.good.unique.good.filter.unique.precluster.L62.1.map
example.trim.contigs.good.unique.good.filter.unique.precluster.S62.1.map
Note: For fungal data, you should include “align=needleman” here in the precluster command to do the internal alignment
summary.seqs
mothur > summary.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.count_table)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 416 231 0 3 1
2.5%-tile: 1 416 251 0 3 3133
25%-tile: 1 416 252 0 3 31322
Median: 1 416 252 0 4 62643
75%-tile: 1 416 252 0 4 93964
97.5%-tile: 1 416 253 0 5 122153
Maximum: 1 416 256 0 8 125285
Mean: 1 416 251 0 3
# of unique seqs: 5620
total # of seqs: 125285
It took 0 secs to summarize 125285 sequences.
Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.summary
Note that we’ve gone from 20,208 to 5,614 unique sequences but our total # of sequences has not changed. This is because no sequences were removed; they were just grouped similarly to unique.seqs
but allowing up to 2 differences.
It is not uncommon to drastically reduce your unique sequence count with this command. The MiSeq is good but not perfect.
At this point we have removed as much sequencing error as we can, and it is time to turn our attention to removing chimeras.
Chimeras occur when the polymerase falls off one sequence and then reattaches to another during a single round of PCR. This results in a sequence that looks unique but is, in fact, just half of one and half of another e.g. not real.
We will remove chimeras with the UCHIME algorithm that is called within mothur using the chimera.uchime
command.
Again, this command will split the data by sample and check for chimeras.
Our preferred way of doing this is to use the abundant sequences as our reference. In addition, if a sequence is flagged as chimeric in one sample, the default (dereplicate=F
) is to remove it from all samples. Our experience suggests that this is a bit aggressive since we’ve seen rare sequences get flagged as chimeric when they’re the most abundant sequence in another sample.
chimera.uchime
mothur > chimera.uchime(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
Using 2 processors.
uchime by Robert C. Edgar
http://drive5.com/uchime
This code is donated to the public domain.
Checking sequences from example.trim.contigs.good.unique.good.filter.unique.precluster.fasta ...
uchime v4.2.40
by Robert C. Edgar
http://drive5.com/uchime
This code is donated to the public domain.
uchime v4.2.40
by Robert C. Edgar
http://drive5.com/uchime
This code is donated to the public domain.
00:00 19Mb 100.0% Reading example.trim.contigs.good.unique.good.filter.unique.precluster.temp1.temp
00:00 19Mb 1309 sequences
00:00 20Mb 100.0% Reading example.trim.contigs.good.unique.good.filter.unique.precluster.temp1.temp
00:00 20Mb 5010 sequences
00:15 2.8Mb 100.0% 79/1308 chimeras found (6.0%)
It took 15 secs to check 1309 sequences from group L62.1.
02:44 5.2Mb 100.0% 1254/5009 chimeras found (25.0%)
It took 164 secs to check 5010 sequences from group S62.1.
It took 164 secs to check 6319 sequences.
Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table
example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.chimeras
example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos
This command only identifies chimeras. We then need to remove them from the data using remove.seqs
remove.seqs
mothur > remove.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.count_table, accnos=example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.accnos)
[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.
Removed 1290 sequences from your fasta file.
Removed 3847 sequences from your count file.
Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.count_table
summary.seqs
mothur > summary.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.count_table)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 416 241 0 3 1
2.5%-tile: 1 416 251 0 3 3036
25%-tile: 1 416 252 0 3 30360
Median: 1 416 252 0 4 60720
75%-tile: 1 416 252 0 4 91079
97.5%-tile: 1 416 253 0 5 118403
Maximum: 1 416 255 0 8 121438
Mean: 1 416 251 0 3
# of unique seqs: 4330
total # of seqs: 121438
It took 0 secs to summarize 121438 sequences.
Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.summary
Note that we went from 125,255 to 121,767 sequences for a reduction of 2.8%; this is a reasonable number of sequences to be flagged as chimeric. In the end, the percent chimeras is a result of the polymerase you used in PCR. A high-fidelity polymerase is less likely to fall off and create chimeras and fewer cycles also result in fewer chimeras.
As a final quality control step, we need to see if there are any “undesirables” in our data set. Even after all of this cleanup, you can still have “good” sequences that are actually contaminants.
Sometimes when we pick a primer set, they will amplify other stuff that gets to this point in the pipeline such as 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria.
There’s also just the random stuff that we want to get rid of. Now you may say, “But wait I want that stuff”. Fine. But, the primers we use, are only supposed to amplify members of the Bacteria and if they’re hitting Eukaryota or Archaea, then its a mistake. It is important to note that if you do keep these sequences, they do not represent the overall archaeal or eukaryotic community. It would be incorrect to use this data as a shortcut to not do an archaea or eukaryote specific MiSeq run.
You may also have other “undesirables” specific to your system of study. For example, the rumen is not home to any Cyanobacteria as it is a dark, anaerobic environment. However, we always get Cyanobacteria sequences back because unfortunately, chloroplasts (very abundant in the feed of herbivores!) look a lot like Cyanobacteria. So we remove them. Not every researcher does this but it is an accepted step as long as you note it in your methods section.
The first step is to classify your sequences so you know what to remove. Let’s go ahead and classify using the Bayesian classifier with the classify.seqs command
. You can classify to either Silva or GreenGenes at this point and it will make very little difference. We will use Silva herefor bacterial sequences. We will also specific a cutoff of 80. This roughly means that we are 80%+ confident in our classification, a good thing if we’re going to remove groups. We want to be sure of what we’re removing.
classify.seqs
mothur > classify.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, reference=silva.v4.v132.align, taxonomy=silva.nr_v132.tax, cutoff=80)
Using 2 processors.
Generating search database... DONE.
It took 78 seconds generate search database.
Reading in the silva.nr_v132.tax taxonomy... DONE.
Calculating template taxonomy tree... DONE.
Calculating template probabilities... DONE.
It took 168 seconds get probabilities.
Classifying sequences from example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta ...
100
100
[truncated]
500
500
[WARNING]: M03662_8_000000000-AK09K_1_2104_20318_19579 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
600
600
[truncated]
2163
2167
It took 111 secs to classify 4330 sequences.
It took 111 secs to classify 4330 sequences.
It took 1 secs to create the summary file for 4330 sequences.
Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.taxonomy
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.tax.summary
Note: Silva and GreenGenes are the options provided, in correct format for use in alignment and classification in mothur, by Pat Schloss and the mothur developers. GreenGenes is appropriate for most well-characterized host-associated gut communities, but is poor for environmental data sets. If you are working with environmental data, it maybe a good idea to find or make an alternative database in order to get better classifications. See an example for lake systems created by the McMahon lab here on campus: https://github.com/McMahonLab/TaxAss/tree/master/FreshTrain-files.
Note: With most data sets, you will get warnings about things that could not be classified. These are labeled as domain ‘unknown’ if the classifier cannot classify the sequence to Bacteria, Archaea, or Eukaryota.
Now that everything is classified, we want to remove our undesirables. We do this with the remove.lineage
command. We’ll just remove non-Bacteria domains and completely unclassifieds (“unknown”).
remove.lineage
mothur > remove.lineage(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.count_table, taxonomy=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.taxonomy, taxon=unknown;-Archaea;-Eukaryota;)
[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.
Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v132.wang.pick.taxonomy
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta
example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table
summary.seqs
mothur > summary.seqs(fasta=example.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table)
Using 2 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 416 241 0 3 1
2.5%-tile: 1 416 251 0 3 3019
25%-tile: 1 416 252 0 3 30189
Median: 1 416 252 0 4 60378
75%-tile: 1 416 252 0 4 90566
97.5%-tile: 1 416 253 0 5 117736
Maximum: 1 416 255 0 8 120754
Mean: 1 416 251 0 3
# of unique seqs: 4330
total # of seqs: 120754
It took 0 secs to summarize 120754 sequences.
Output File Names:
example.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.summary
total # of seqs: | 120538
At this point, we have cleaned-up our data as far as possible. Our file names are very long at this point, so let’s copy and rename them before moving onto some analysis.
Rename final files
mothur > system(cp example.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta example.final.fasta)
mothur > system(cp example.trim.contigs.good.unique.good.filter.unique.precluster.denovo.uchime.pick.pick.count_table example.final.count_table)
Now we can define operational taxonomic units (OTUs), the sequencing proxy for a microbial species. We do this by calculating distances between sequences (how difference they are from each other) with dist.seqs
and then clustering these distances based on a difference cutoff with cluster.split
. In general, cutoffs are
dist.seqs
mothur > dist.seqs(fasta=example.final.fasta)
Using 2 processors.
Sequence Time Num_Dists_Below_Cutoff
0 0 0
100 0 5050
[truncated]
4300 59 4563820
3000 61 4501500
It took 64 secs to find distances for 4330 sequences. 9372285 distances below cutoff 1.
Output File Names:
example.final.dist
Note: For fungal data, you should use pairwise.seqs
instead of dist.seqs
because your sequences are not aligned by align.seqs
which is required by dist.seqs
.
Note: There are a couple of tricks to decrease your distance matrix size for large data sets (I’ve seen V3-4 data get up to 1TB). You can set a “cutoff” in dist.seqs. This causes mothur to not save distance greater than the cutoff. For example, if you’re only going to look at genus and species level OTUs, you could set your cutoff to 0.1. You can also output in lower triangle form (output=lt), which only saves one distance for every pairwise comparison.
For clustering, we have a number of options
You can find full definitions of these methods here but the general view is that furthest is too strict (creates too many OTUs), nearest is too lenient (too few OTUs), and average is in the middle. Sort of a Goldilocks situation.
opticlust is a very new method developed by the Schloss group, and it appears to perform as good or better than average neighbor. For more information, see http://msphere.asm.org/content/2/2/e00073-17
We will use opti here, because it is the fastest and “best”. Ths output for opticlust is super long, so I am not going to include it below. But if it looks like chaos with lots of tabs and temp files, it is probably working fine.
cluster.split
mothur > cluster.split(column=example.final.dist, count=example.final.count_table, method=opti, cutoff=0.03)
[truncated]
Output File Names:
example.final.opti_mcc.list
example.final.opti_mcc.sensspec
Now that we have OTUs, we’d like to look at them, right? Well, distance and cluster files are not for human eyes. So we use make.shared
to combine these data with our sample names to create a table we can understand. We will have mothur only give us the species-level OTUs label=0.03
but you could ask for any level that you like (as long as it’s below the cutoff you may have used in dist.seqs
).
make.shared
mothur > make.shared(list=example.final.opti_mcc.list, count=example.final.count_table, label=0.03)
0.03
Output File Names:
example.final.opti_mcc.shared
Let’s look at it!!! This file has many, many columns and is difficult to look at in the Terminal. So, we will open the .shared
file in Excel or a text editor. You’ll see that for each sample, we have how many times each OTU occurs in that sample.
label | Group | numOtus | Otu0001 | Otu0002 … |
---|---|---|---|---|
0.03 | L62.1 | 2562 | 1362 | 770 … |
0.03 | S62.1 | 2562 | 10015 | 9316 … |
Now that you have all these OTUs, what are they? We find out by repeating classify.seqs
on our final files and then grouping based on our OTUs with classify.otu
. Here, we will use GreenGenes instead of Silva. Like classifying before, you can use either. In general, GreenGenes will give fewer unclassifieds and more classifications down to lower levels. But when you go to research those classifications, they could be from a clone library or other non-culturing technique so there is little you can say about that species. Silva is only full-length 16S so you have fewer sequences to compare to and get more unclassifieds. However, the classifications are more often to cultured species.
There are also specific taxonomic groups that are more or less represented in each database. This would be specific to your system and you’d have to classify to both to find out!
classify.seqs
mothur > classify.seqs(fasta=example.final.fasta, count=example.final.count_table, template=gg_13_8_99.fasta, taxonomy=gg_13_8_99.gg.tax, cutoff=80)
Using 2 processors.
Generating search database... DONE.
It took 180 seconds generate search database.
Reading in the gg_13_8_99.gg.tax taxonomy... DONE.
Calculating template taxonomy tree... DONE.
Calculating template probabilities... DONE.
It took 526 seconds get probabilities.
Classifying sequences from example.final.fasta ...
100
100
[truncated]
2167
2163
It took 99 secs to classify 4330 sequences.
It took 99 secs to classify 4330 sequences.
It took 0 secs to create the summary file for 4330 sequences.
Output File Names:
example.final.gg.wang.taxonomy
example.final.gg.wang.tax.summary
classify.otu
mothur > classify.otu(list=example.final.opti_mcc.list, taxonomy=example.final.gg.wang.taxonomy, count=example.final.count_table, label=0.03, cutoff=80, basis=otu, probs=F)
0.03 2562
Output File Names:
example.final.opti_mcc.0.03.cons.taxonomy
example.final.opti_mcc.0.03.cons.tax.summary
Looking at it.
mothur > system(head -5 example.final.opti_mcc.0.03.cons.taxonomy)
OTU Size Taxonomy
Otu0001 11377 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Prevotella;s__ruminicola;
Otu0002 10086 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Prevotella;s__ruminicola;
Otu0003 7550 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Aeromonadales;f__Succinivibrionaceae;f__Succinivibrionaceae_unclassified;f__Succinivibrionaceae_unclassified;
Otu0004 5270 k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Veillonellaceae;g__Succiniclasticum;g__Succiniclasticum_unclassified;
Now, the species level classifications are dicey. Because we don’t have full length 16S, we can’t say with certainty that OTU2 is Prevotella ruminicola. What we can say is that OTU2 is most closely related to Prevotella ruminicola, so if we were going to guess at OTU2’s function in the cow gut, data on Prevotella ruminicola would be a good place to start. And that’s not even getting into strain level diversity.
How to you know you have enough sequences to see all/most of the diversity in your samples? Do you need to redo some samples? You can visualize this by rarefaction and calculate it with Good’s coverage.
rarefaction.single
mothur > rarefaction.single(shared=example.final.opti_mcc.shared)
Processing group L62.1
0.03
Processing group S62.1
0.03
Output File Names:
example.final.opti_mcc.groups.rarefaction
This will produce a file called example.final.opti_mcc.groups.rarefaction
. Looking at it…
mothur > system(head -15 example.final.opti_mcc.groups.rarefaction)
numsampled 0.03-L62.1 lci-L62.1 hci-L62.1 0.03-S62.1 lci-S62.1 hci-S62.1
1 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
100 68.2160 60.0000 77.0000 61.3720 52.0000 70.0000
200 118.8010 106.0000 130.0000 104.8190 93.0000 117.0000
300 161.3850 148.0000 174.0000 141.2600 128.0000 154.0000
400 198.5620 183.0000 215.0000 173.1300 158.0000 188.0000
500 231.6170 213.0000 248.0000 201.9080 184.0000 219.0000
600 261.3690 242.0000 279.0000 228.3600 211.0000 247.0000
700 288.4680 269.0000 307.0000 252.6850 234.0000 273.0000
800 313.3590 294.0000 332.0000 275.3760 255.0000 296.0000
900 336.4050 317.0000 356.0000 296.7510 276.0000 318.0000
1000 357.9280 339.0000 379.0000 316.5280 296.0000 338.0000
1100 378.2830 358.0000 399.0000 335.6720 314.0000 357.0000
1200 397.2660 377.0000 418.0000 353.6050 332.0000 375.0000
1300 415.2510 394.0000 437.0000 370.8000 349.0000 393.0000
it’s difficult to tell what’s going on. The best way to use this data is to graph it and look from the new unique sequences gained per ## of sequences added to level off. At a certain point, you need 1000s of sequences to get only 1 more unique sequence so it’s not worth it (and probably not even real)
For example (made in Excel)
However, this doesn’t numerically/statistically show you your coverage and is subject to axis-skewing. For example, our coverage looks pretty good in the above plot, but if we compress the x-axis, it doesn’t look nearly as good.
Thus, it is better to also calculate coverage with another method, like Good’s.
The metric is an estimate of the percent of diversity (species) captured in a sample by your current sequencing data. In general, we’re looking for 95% or better. We can also look at the number of sequence nseqs
and number of unique OTUs sobs
in each sample.
summary.single
mothur > summary.single(shared=example.final.opti_mcc.shared, label=0.03, calc=nseqs-sobs-coverage)
Processing group L62.1
0.03
Processing group S62.1
0.03
Output File Names:
example.final.opti_mcc.groups.summary
If we look at it,
mothur > system(head -3 example.final.opti_mcc.groups.summary)
label group nseqs sobs coverage
0.03 L62.1 12653.000000 930.000000 0.992176
0.03 S62.1 108101.000000 2290.000000 0.994422
we see that we have very good coverage of these samples, >99%
There are two easy options for normalization in mothur. There is no “right” answer for normalization, and it is a rather hot topic in the field. You may find a number of other methods outside of mothur and even papers arguing against normalization at all.
In mothur, you can sub-sample down to your lowest number of sequences in a single sample with sub.sample
.
sub.sample
mothur > sub.sample(shared=example.final.opti_mcc.shared)
Sampling 12653 from each group.
0.03
Output File Names:
example.final.opti_mcc.0.03.subsample.shared
Note: The default is to subsample to the number of sequences which is seen in the smallest sample. If we has a sample which did not sequence well and only had 10 reads, we would cut out all but 10 reads from every one of our samples using this method. In order to subsample to a specific number, 10000 for example, include “size=10000”.
Or you can calculate percent relative abundance (RA) (# OTU1 sequences/total # sequences in sample) and normalize to your lowest number of sequences in a single sample i.e. %RA * lowest # of sequences
normalize.shared
mothur > normalize.shared(shared=example.final.opti_mcc.shared, method=totalgroup)
Normalizing to 12653.
0.03
Output File Names:
example.final.opti_mcc.0.03.norm.shared
Note: The “method” parameter here is used if we simply want to normalize to the number of sequences which is seen in the smallest sample. If we has a sample which did not sequence well and only had 10 reads, we would cut out all but 10 reads from every one of our samples using this method. In order to normalize to a specific number,10000 for example, use “norm=10000” in place of “method=totalgroup”.
When we compare these two methods, we see some differences that can be easily viewed with summary.single
summary.single
mothur > summary.single(shared=example.final.opti_mcc.0.03.subsample.shared, label=0.03, calc=coverage-nseqs)
mothur > summary.single(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=coverage-nseqs)
mothur > system(head -7 example.final.opti_mcc.0.03.subsample.groups.summary)
mothur > system(head -7 example.final.opti_mcc.0.03.norm.groups.summary)
label group coverage nseqs
0.03 L62.1 0.992176 12653.000000
0.03 S62.1 0.962855 12653.000000
label group coverage nseqs
0.03 L62.1 0.992176 12653.000000
0.03 S62.1 0.967848 12472.000000
We see that:
.shared
files only allow whole numbers. When normalizing to a high number, like >12000 here, this is not an issue as the difference is only ~1%. However, if normalizing to a lower number, this can cause the same problems with unequal sample coverage as we were trying to avoid by normalizing in the first place! In this case, I suggest normalizing outside of mothur in a program, like R, that would allow non-whole numbers in your OTU table.S62.1
sample. This is still above our goal of 95%, so no issues. However, if you were to normalize or sub-sample to too low of a number, you will have very poor coverage for more diverse samples. In this case, you should consider removing samples with the lowest number of sequences and normalizing/sub-sampling to a higher value.Of note, the norm.shared table is the data type used for all R-based analyses in the R workshop.
There are a number of functions in mothur to perform statistical and visual analysis of your microbiota data. However, mothur itself, cannot display figures and does not handle much manipulation of the data. So, most analyses should be done in a another program, like R, and this section is just to give you some tools to begin to explore your data in mothur.
We will focus on visuals here as statistical modeling is outside the scope of this tutorial and specific to a given data set. Please note that mothur’s figures are not the prettiest…
If you would like to explore statistical analyses in mothur, please see the commands page for:
A lot of research is interested not in the specific microorganisms in a community but the overall diversity and richness of that community. This is within sample diversity. You can calculate all of the classic metrics in mothur. For a full list, see here.
The most important thing before calculating diversity is to normalize your data. No matter the actual community, a sample with 5000 sequences will appear more diverse than one with 3000 due to sequencing errors. This can be done using the norm or subsample data we made above or with the subsample=t
modifier, which randomly subsamples a number of times and averages across those subsamples.
We will calculate alpha-diversity and richness using summary.single
.
summary.single
mothur > summary.single(shared=example.final.opti_mcc.0.03.subsample.shared, label=0.03, calc=coverage-nseqs-bergerparker-chao-shannon)
mothur > summary.single(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=coverage-nseqs-bergerparker-chao-shannon)
Output File Names:
example.final.opti_mcc.0.03.subsample.groups.summary
Output File Names:
example.final.opti_mcc.0.03.norm.groups.summary
Or using subsample=t
mothur > summary.single(shared=example.final.opti_mcc.shared, label=0.03, subsample=t, calc=coverage-nseqs-bergerparker-chao-shannon)
Output File Names:
example.final.opti_mcc.groups.ave-std.summary
example.final.opti_mcc.groups.summary
When we compare alpha-diversity with the two normalization methods…
mothur > system(head -7 example.final.opti_mcc.0.03.subsample.groups.summary)
mothur > system(head -7 example.final.opti_mcc.0.03.norm.groups.summary)
mothur > system(head -7 example.final.opti_mcc.groups.summary)
label group coverage nseqs bergerparker chao chao_lci chao_hci shannon shannon_lci shannon_hci
0.03 L62.1 0.992176 12653.000000 0.107642 957.100559 945.568189 977.175705 5.398214 5.364433 5.431995
0.03 S62.1 0.962855 12653.000000 0.097289 1789.945783 1644.576220 1976.065980 5.073862 5.036851 5.110874
label group coverage nseqs bergerparker chao chao_lci chao_hci shannon shannon_lci shannon_hci
0.03 L62.1 0.992176 12653.000000 0.107642 957.100559 945.568189 977.175705 5.398214 5.364433 5.431995
0.03 S62.1 0.967848 12472.000000 0.093970 1510.869565 1411.207450 1640.074563 5.055335 5.018381 5.092289
label group coverage nseqs bergerparker chao chao_lci chao_hci shannon shannon_lci shannon_hci
0.03 L62.1 0.992176 12653.000000 0.107642 957.100559 945.568189 977.175705 5.398214 5.364433 5.431995
0.03 S62.1 0.994422 108101.000000 0.092645 2823.832353 2723.090473 2948.007965 5.166903 5.153476 5.180330
we see that richness (chao) is more effected by normalization method than diversity (shannon).
Also, mothur gives us the low confidence interval (lci) and high confidence interval (hci) for our alpha-metrics. These could be used as error bars, if desired.
This is between sample diversity meaning every pairwise comparison of two samples has a unique value. You can visualize beta-diversity in a number of ways.
For principle component analysis (PCA), an overall microbial community is represented by a single point in an xy- or xyz-plane. Two points that are closer to each other indicate that the overall microbiota of those two samples are more similar than two points that are farther apart.
PCA fits axes to the data until 100% of the variation is explained. Thus, you do not specify the number of axes to calculate; you just plot xy or xyz after reviewing how much adding the z-axis adds to the representation of the data. PCA is calculated from the shared file.
This is not going to run properly because there is too little variation in our two-pont data set to properly calculate a PCA. So we are not going to run it because it takes forever to tell us it can’t complete. Theoretically, the output would be as shown below.
pca
mothur > pca(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03)
Processing 0.03
Rsq 1 axis: 0
Rsq 2 axis: 0
Rsq 3 axis: 0
Output File Names:
example.final.opti_mcc.0.03.norm.0.03.pca.axes
example.final.opti_mcc.0.03.norm.0.03.pca.loadings
The outputs are
Making the plot (Excel), we have only 2 samples in this tutorial so we only need 1 axis to explain and separate our data. This will not occur with full data sets.
(This plot was made from a previous version of this tutorial in which PCA did run properly.)
nMDS is similar to PCA in that it assigns a single xy(z) point to each sample relative to all other samples. nMDS is more robust than PCA, however, because it permutes through calculations a number of times to find a best fit. This fit is constrained within the number of axes you specify.
mothur currently does not have a function for calculating nMDS for samples. The current nmds
function calculates a point for each OTU in a plane. However, this method is something to consider when analyzing data outside of mothur.
You can visualize beta-diversity metrics by heatmaps with heatmap.sim
. Each box is a pairwise comparison between two samples. mothur has several beta-diversity calculator options. Here, we will use the Bray-Curtis metric.
heatmap.sim
mothur > heatmap.sim(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=braycurtis)
0.03
Output File Names:
example.final.opti_mcc.0.03.norm.0.03.braycurtis.heatmap.sim.svg
You can easily make a tree where each node is an individual sample in mothur using tree.shared
. We will run it with our normalized shared
file and Bray-Curtis here.
tree.shared
mothur > tree.shared(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=braycurtis)
Using 2 processors.
0.03
1
1
Output File Names:
example.final.opti_mcc.0.03.norm.braycurtis.0.03.tre
If we open it in a program such as FigTree:
mothur does not have an easy way to create a tree where the nodes are OTUs. In order to make a tree of your OTUs outside of mothur, you need a representative sequence for each OTU. Otherwise, you end up with many more branches then there are OTUs (i.e. species) in your data set. You then calculate a distance matrix of these representative OTU sequences and calculate a tree from this in R or another program. This sort of tree is used for UniFrac beta-diversity calculators.
We will not actually run this analysis in this workshop as it is RAM and time intensive. However, if we were to extract representative sequences, we would use get.oturep
. The first option is to have mothur calculate the best possible representative sequence for each OTU. This is the sequence with smallest maximum distance to the other sequences in an OTU group.
DO NOT RUN THIS
mothur > get.oturep(column=example.final.dist, list=example.final.opti_mcc.list, fasta=example.final.fasta, count=example.final.count_table, label=0.03, method=distance, large=T)
Or you can have mothur just choose the most abundant unique sequence within each OTU group as the representative. This representative sequence may not be the best possible one for all uniques within an OTU, but this runs much faster and with much less RAM than the other method.
DO NOT RUN THIS
mothur > get.oturep(list=example.final.opti_mcc.list, fasta=example.final.fasta, count=example.final.count_table, label=0.03, method=abundance)
These representative sequences have long, horrible names given to them by the MiSeq. We rename them based on the OTU# using a custom python script, courtesy of Tony Neumann (Suen lab). I usually run this locally (not on a server) because it is quick and easy. So step one is to download the rep.fasta file generated above onto your local drive. Then, copy the below python code into a text file and save it as something useful, like “clean_repFasta.py”.
import sys
import re
#an empty list to hold the lines, as strings, of the input fasta file
info = []
#iterate through the input file and file up the list
with open(sys.argv[1], 'r') as fasta_in:
for line in fasta_in:
info.append(line.replace('\r', '').rstrip('\n'))
#an empty list to hold the OTU fasta seqs and their IDs as tuples
seqs = []
#iterate through the info list, replace the complex seq IDs with just the OTU ID, and fill up the seqs list
for index, item in enumerate(info):
if index == 0:
ID = re.search('Otu[0-9]*',item[1:]).group(0)
else:
if index % 2 == 0:
ID = re.search('Otu[0-9]*',item[1:]).group(0)
else:
seqs.append((ID, item))
#write the contents of the seqs list to a new file called "clean_repFasta.fasta"
with open("clean_repFasta.fasta", 'w') as fasta_out:
for item in seqs:
fasta_out.write('>' + item[0] + '\n' + item[1] + '\n')
To run this, first dowload Python appropriate for your OS. Then, open the command prompt on your computer and navigate to the folder where your fasta file is saved. If my python script and my fasta file are both in a Desktop folder called “demo” on my Windows laptop, that would look like this:
C:\Users\Madison>cd Desktop/demo/
C:\Users\Madison\Desktop\demo>clean_repFasta.py example.final.opti_mcc.0.03.rep.fasta
This outputs a new fasta called clean_repFasta.fasta
. Then, we would calculate the distance matrix with the new fasta. We definitely what to use the lower triangle lt
output here so our matrix is small enough to read into R later on.
DO NOT RUN THIS
mothur > dist.seqs(fasta=clean_repFasta.fasta, output=lt)
You would then load this distance matrix into R or another program to calculate and visualize the tree.
Similar to the heatmap of samples, we can do a heatmap for OTUs. Instead of pairwise comparisons, each box represents the relative abundance of an OTU in a sample. You can scale the abundance to make differences and low abundance OTUs easier to see. We will use the default log10 scale and only visualize the 10 most abundant OTUs.
heatmap.bin
mothur > heatmap.bin(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, scale=log10, numotu=10)
0.03
Output File Names:
example.final.opti_mcc.0.03.norm.0.03.heatmap.bin.svg
Venn diagrams can show you which OTUs are shared by samples. mothur can do 2, 3, or 4 sample Venns.
Note: If you have more than 4 samples, you can use the permute
parameter to have mothur create all Venn groups of 2, 3, or 4 between all of your samples.
venn
mothur > venn(shared=example.final.opti_mcc.0.03.norm.shared, label=0.03, calc=sharedsobs)
0.03
Output File Names:
example.final.opti_mcc.0.03.norm.0.03.sharedsobs.L62.1-S62.1.svg
Large data sets (many samples) and those with high error (V3-4) can create very large .dist
files on the order of terabytes. This can prevent downstream analysis like clustering because the distance matrix is too large to read into your available RAM. And even when things do run, you may not want to wait days for a step to complete. If this happens, there are a number of steps you can take to reduce the distance matrix.
dist.seqs
If you’re only going to look at genus (0.05) and/or species (0.03) OTUs in the end, set cutoff=0.05
for dist.seqs so it doesn’t save values larger than that.
You can also save the output in lower triangle format so that only one value is saved for every pairwise comparison between sequences. This is done with output=lt
in dist.seqs. If you do this, the output will be example.final.phylip.dist
and you will need to alter cluster.split to be phylip=example.final.phylip.dist
instead of using column=
.
cluster.split
Run cluster with large=true
to tell mothur the distance matrix is too large to read into RAM all at once.
pre.cluster
If you have a longer amplicon, less overlap of the two sequences that are combined into contigs, and/or a high error rate in sequencing, it is worth increasing pre.cluster to diffs=3 or 4
. This will group more sequences together and while it does not remove any data, you will end up with fewer uniques and therefore, a smaller distance matrix.
split.abund
This function can be used to remove very low abundance sequences. It creates a rare
and abund
data set. When cutoff=1
, the abund
data set includes all sequences that are present in the overall data set in 2 or more copies. The rare
data set is often referred to as singletons. Removing singletons (or doubletons, etc) dramatically reduces your unique sequences without strongly impacting beta-diversity. However, alpha-diversity is effected by singletons so please be careful to compare values only across data sets that have been handled in a similar way. This command would be used as:
mothur > split.abund(fasta=example.final.fasta, count=example.final.count_table, cutoff=1)
Now that you know what you are doing, and why, how do you avoid ever having to type all of these ever again? mothur is able to accept batch files which contain a list of all of the commands you want to feed it. Here is a link to the Schloss lab SOP. The zip file with their data includes a batch file which you could adapt: https://www.mothur.org/wiki/miseq_sop
First, let’s get out of mothur…
mothur > quit()
… then, let’s copy over our mock batch file.
$ cp /mnt/software/workshop/data/MOTHUR/batch.txt ~/miseq_sop
Now, let’s look at the batch file. It contains three commands which should run fairly quickly: summary.seqs, screen.seqs, and unique.seqs.
$ head -5 batch.txt
summary.seqs(fasta=example.trim.contigs.fasta)
screen.seqs(fasta=example.trim.contigs.fasta, group=example.contigs.groups, maxambig=0, maxlength=300, maxhomop=8)
unique.seqs(fasta=example.trim.contigs.good.fasta)
We feed it to mothur using the filepath to the mothur executable followed by a space, then the name of our batch file.
$ /mnt/software/mothur/default/mothur batch.txt
You can see that mothur was opened, the commands were executed, then mothur was automatically closed. This is useful for large data sets, where certain commands (pre.cluster, dist.seqs) can take a VERY long time. This allows you to execute all the necessary commands without having to babysit the computer.
Note: Because we are in a new session of mothur, it has forgotten our preferences for multiple processors. You must indicate at the beginning of each session how many processors you want mothur to use. This is particularly important when you are using a batch file with a large data set and not paying as close attention to your data as it runs. Your first command in a batch file should indicate the number of processors you want to use.