TransDecoder and Trinotate

This tutorial will cover transcriptome annotation with TransDecoder and Trinotate. Both packages are designed to work with transcript sequences generated by aligned RNA-seq reads to a genome/transcriptome (ie. Tophat or Cufflinks output) or transcripts generated by the de-novo transcriptome assembly program Trinity. In this tutorial we will be working with a subset of a de-novo transcriptome assembly of the wetland plant, Decodon verticillatus.

This tutorial will be done in command line. If you are unfamiliar with command line coding, here are some tips:

When you create a folder to house all the files for this tutorial make sure there are no spaces in the name. Spaces have meaning in the Linux coding language, so spaces in directory names will cause problems.
You can use tab to autocomplete directory names, variable names, etc. This will help you code much more efficiently!
The “ls” command prints out everything in your current working directory. Runnning this command often is a good way to make sure that you have the files you expect to have.
You can also run into problems if you are not in the correct working directory. Use “pwd” to check your current working directory and make sure you are where you want to be.

If you are using a Mac you can just use the pre-installed Terminal app. If you’re using a Mac you can also use the package manager Homebrewer to download certain packages for this tutorial.
Check out the Homebrewer website for more information

# copy and paste the following code into your terminal
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# enter your login password to your computer when prompted
# nothing will show up, but you can just type your password and hit enter
# the script will print out progress reports as it installs

If you are using a PC, you can download MobaXterm to access the terminal.

Download data

Create a folder on your computer for this tutorial
Download the “Data” folder from OnQ and click to unzip it, then drag it into your tutorial folder
Name your working directory “TUTORIAL_HOME” and set the path to your working directory with the code below

# TUTORIAL_HOME=path/to/working/directory
# for example, I am setting my working directory to a folder in my documents directory
# replace the code below with the path to your tutorial folder
TUTORIAL_HOME=/Users/hanathompson/Documents/Courses/Biol830/tutorial

Navigate to your working directory

# change working directory with "cd"
cd $TUTORIAL_HOME
# make sure this worked by checking your present working directory with "pwd"
pwd
# the output should be whatever you have set as TUTORIAL_HOME

Transdecoder

TransDecoder identifies coding regions in transcript sequences by identifying open reading frames (ORFs). ORFs are nucleotide sequences that can be read into amino acid sequences. For more information, check out the TransDecoder github.

Download TransDecoder

TransDecoder can be downloaded from here
Click on Source code (tar.gz)
Move the tar.gz file from your downloads to your working directory

first define the path to your downloads folder as DOWNLOADS

# copy the code below but replace with the path to your downloads folder
DOWNLOADS=/Users/hanathompson/Downloads
# we can move files around using "mv"
# we can also use "mv" to rename it to something a little shorter
mv $DOWNLOADS/TransDecoder-TransDecoder-v5.5.0.tar.gz $TUTORIAL_HOME/TransDecoder.tar.gz

Check the contents of your working directory

# use "ls" to list the contents of a directory
ls

Data TransDecoder.tar.gz

# to look at the contents of Data 
ls Data

you should have the following files

Trinity.fasta       Trinity.fasta.gene_trans_map

Trinity.fasta is the output from the de-novo assembly program Trinity
Trinity takes RNA-seq reads and assembles them into a transcriptome
You can find more information about Trinity here
- Trinity.fasta files follow the following format:
- Line 1: sequence identification information
  - For example, take the sequence ID TRINITY_DN0_c0_g1_i2
  - This is Trinity read cluster TRINITY_DN0_c0, gene 1, isoform 2
- Line 2: nucleotide sequence
Trinity.fast.gene_trans_map contains just the gene/transcript identification information

Extract TransDecoder.tar.gz files

the tar command can be used to compress or extract files
the x parameter tells the command to extract the files
the f parameter is used to specify the filename

tar -xf TransDecoder.tar.gz
ls

We now have a new directory called TransDecoder-TransDecoder-v5.5.0

Data TransDecoder-TransDecoder-v5.5.0

Let’s rename it TransDecoder for simplicity

mv TransDecoder-TransDecoder-v5.5.0 TransDecoder
ls

Data TransDecoder

Run TransDecoder

Set TRANSDECODER_HOME pathname

TRANSDECODER_HOME=$TUTORIAL_HOME/TransDecoder

Navigate to TransDecoder directory and check that all the required files are there

# try using tab to complete $TRANSDECODER_HOME by typing the first few characters then hitting tab
cd $TRANSDECODER_HOME
ls

Changelog.txt       README.md       TransDecoder.wiki   util
LICENSE.txt     TransDecoder.LongOrfs   __testing
Makefile        TransDecoder.Predict    notes
PerlLib         TransDecoder.lrgTests   sample_data

Extract long open reading frames (ORFs)

We can extract ORFs using the TransDecoder.LongOrfs script
The only required parameter is -t, which is the fasta file containing the transcripts
There are several other optional parameters
-m is the minimum protein length (ie. number of amino acids). This is set to 100 by default, but you can raise or lower this value
-O can be used to define the output directory. By default, it will create an output directory in whatever directory you ran the script and name it “inputfilename.transdecoder_dir”
for a complete list of the parameters you can run “less TransDecoder.LongOrfs” to look at the script
- click “q” to exit the file viewer

$TRANSDECODER_HOME/TransDecoder.LongOrfs -t $TUTORIAL_HOME/Data/Trinity.fasta

Check output

ls

Changelog.txt                       TransDecoder.wiki
LICENSE.txt                       Trinity.fasta.transdecoder_dir
Makefile                            Trinity.fasta.transdecoder_dir.__checkpoints_longorfs
PerlLib                             __testing
README.md                           notes
TransDecoder.LongOrfs       pipeliner.4823.cmds
TransDecoder.Predict        sample_data
TransDecoder.lrgTests       util

We now have the output directory, Trinity.fasta.transdecoder_dir

Navigate to output directory

cd Trinity.fasta.transdecoder_dir
ls

base_freqs.dat      longest_orfs.cds    longest_orfs.gff3   longest_orfs.pep

base_freqs.dat is the number and frequency of each base, ACTG
longest_orfs.cds lists all ORFs found in the target transcripts
longest_orfs.gff3 contains information about the position of each ORF in the target transcripts
longest_orfs.pep lists all ORFs that met the minimum length criteria (100 in this case)

Predict coding regions

In this step, TransDecoder is identifying likely coding regions of the ORFs using the TransDecoder.Predict script
Again, the only required parameter is -t, identifying the path to the target transcripts
To look at the script and other parameters, you can run “less TransDecoder.Predict”
We are going to add one more parameter, –no_refine_starts to turn off start refinement
Start refinement identifies potential start codons for partial ORFs, however, it doesn’t work very well unless using a full dataset. Since the data provided is just a small subset of a full transcriptome assembly we will turn this off to make sure it runs correctly.

cd $TRANSDECODER_HOME
$TRANSDECODER_HOME/TransDecoder.Predict -t $TUTORIAL_HOME/Data/Trinity.fasta --no_refine_starts
ls

When you look at the files in TransDecoder you should now see several files with “Trinity.fasta.transdecoder” in the name, with file extensions .bed, .cds, .gff3, and .pep
- .pep: this file contains the peptide sequences of the identified ORFs
- .cds: this file contains the nucleotide sequences of the coding regions of your ORFs
- .gff3: this file contains information about the position of each ORF within the target transcripts
- .bed: this is a bed-formatted file that contains information about the ORF position and can be used for visualization with IGV or GenomeView

Trinotate

Trinotate is used for functional annotation of transcriptomes. It uses databases such as BLAST+ and HMMER/PFAM to search for homologous genes and annotate your transcriptome with likely gene functions. We will have to download some additional software for Trinotate to run.
The Trinotate github page can be accessed here

Download software

Download Trinotate here

Click on Source code (tar.gz)
Move the tar.gz file from Downloads to working directory the same as above
Unzip tar.gz file and create Trinotate directory

cd $TUTORIAL_HOME
mv $DOWNLOADS/Trinotate-Trinotate-v3.2.1.tar.gz $TUTORIAL_HOME/Trinotate.tar.gz
tar -xf Trinotate.tar.gz
mv Trinotate-Trinotate-v3.2.1 Trinotate
ls

now your working directory (TUTORIAL_HOME) should look like this

Data    TransDecoder    TransDecoder.tar.gz  Trinotate  Trinotate.tar.gz

Download SQLite, which is required for database integration, here

Click on download under latest release
Under the Source Code heading click on the link ending with .tar.gz
Move into working directory, unzip and create SQLite directory

cd $TUTORIAL_HOME
mv $DOWNLOADS/sqlite-autoconf-3340100.tar.gz $TUTORIAL_HOME/sqlite.tar.gz
tar -xf sqlite.tar.gz
mv sqlite-autoconf-3340100 SQLite

Download NCBI BLAST+: Blast database homology search here

find the .tar.gz file for your computer and click to download
macosx.tar.gz or windows64.tar.gz
move into working directory and unzip

cd $TUTORIAL_HOME
mv $DOWNLOADS/ncbi-blast-2.11.0+-x64-macosx.tar.gz $TUTORIAL_HOME 
tar -xf ncbi-blast-2.11.0+-x64-macosx.tar.gz

Mac users only: Download HMMER/PFAM Protein Domain identification with Homebrewer.

HMMER searches databases for sequence homologs
PC users skip to step 5

cd $TUTORIAL_HOME
brew install hmmer

You should get output that looks like this, telling you where hmmer has been installed on your computer

==> Pouring hmmer-3.3.2.catalina.bottle.tar.gz
🍺  /usr/local/Cellar/hmmer/3.3.2: 57 files, 12.7MB

We want to move hmmer from /usr/local/Cellar to TUTORIAL_HOME
- make sure to copy your pathname from your hmmer output because it might be slightly different

mv /usr/local/Cellar/hmmer $TUTORIAL_HOME
# set path to HMMER_HOME
HMMER_HOME=$TUTORIAL_HOME/hmmer/3.3.2

PC users: download HMMER/PFAM from here

click on Download source
move into working directory and unzip as before

Our TUTORIAL_HOME directory is looking a little messy, so let’s clean it up by putting all the gzipped files into a separate directory. It is useful to keep these so that you can restart the analysis at any point if something goes wrong.

# first we make a directory called "GZIP" with "mkdir"
mkdir GZIP
# then we can move anything ending with .tar.gz using the regular expression quantifier *
# * can replace anything - any character and any number of characters
# so this syntax means move any files ending with .tar.gz
mv $TUTORIAL_HOME/*.tar.gz $TUTORIAL_HOME/GZIP
ls

Now TUTORIAL_HOME should only contain

Data            SQLite          Trinotate       ncbi-blast-2.11.0+
GZIP            TransDecoder        hmmer

And if we check our GZIP directory, we’ll find all the zipped files

cd $TUTORIAL_HOME/GZIP
ls

Build sequence databases

Trinotate also requires the SwissProt and Pfam databses to run, and the Trinotate package contains scripts to download and build these databases.

Set path to TRINOTATE_HOME

TRINOTATE_HOME=$TUTORIAL_HOME/Trinotate
cd $TRINOTATE_HOME

Check if you have Wget installed

This is a package that allows you to download items from the internet
Check if you have it pre-installed by running:

wget

If you get this output

wget: missing URL

Then you have wget installed and can skip to step 3
If you get this output

command not found: wget

Then install Wget using Homebrew

brew install wget
wget

you should now get the output

wget: missing URL

Run Build_Trinotate_Boilerplate_SQLite_db.pl perl script

$TRINOTATE_HOME/admin/Build_Trinotate_Boilerplate_SQLite_db.pl  Trinotate
# this step will take about 20-30min so now is a good time to take a lunch break!
ls

You should now see the following files added to your Trinotate directory
- Trinotate.sqlite: this is a Trinotate boilerplate sqlite database - it contains protein sequence information from various databases
- uniprot_sprot.pep: this file contains protein sequences and will be used to look for matches within the BLAST database
- Pfam-A.hmm.gz: we will use HMMER on this file to use search the Pfam database for homologs

Prepare protein database for blast searches

Set path to NCBI_HOME

BLAST_HOME=$TUTORIAL_HOME/ncbi-blast-2.11.0+

use makeblastdb command to prepare protein databse
- -in tells makeblastdb where to look for protein sequence information
- -dbtype is the type of database. In this case it is protein sequences, so we write “prot”

$BLAST_HOME/bin/makeblastdb -in uniprot_sprot.pep -dbtype prot

You may get a pop-up error along the lines of ““makeblastdb” cannot be opened because the developer cannot be verified."
- If this happens, open a finder window and locate the makeblastdb exe file
- Right click on it and press open -> open with terminal
- It should open in a new terminal window, which you can close
- Then try running the command again (you may have to do this several times throughout the tutorial)

Uncompress and prepare Pfam database to use with hmmscan

gunzip unzips gzipped files (files with .gz extension)
the hmmpress command prepares an HMM database to do hmmscan (scan for sequence homologs)

cd $TRINOTATE_HOME
gunzip $TRINOTATE_HOME/Pfam-A.hmm.gz
$HMMER_HOME/bin/hmmpress $TRINOTATE_HOME/Pfam-A.hmm

Run sequence analysis

In this step we will use BLAST to look for homologies and identify proteins in our transcripts. We will run this on both the transcriptome assembly (Trinity.fasta) and on the likely ORFs predicted by TransDecoder (Trinity.fasta.transdecoder.pep)

Search trinity transcripts using blastx

there are 3 blast options, blastn, blastp, and blastx
blastn searches a nucleotide query (ie. you give it nucleotide sequences) against a nucleotide database
blastp searches a protein query (ie. you give it protein sequences) against a protein database
blastx searchs a nucleotide query against a protein database
to see all the parameters, enter $BLAST_HOME/bin/blastx -help
-query is the input file (Trinity.fasta)
-db is the database we want to search (uniprot_sprot.pep)
-max_target_seqs is the maximum number of aligned sequences to keep
- we’ve set this to 1, meaning that blast will return the first good match it finds in the database
- this will help reduce run time, however, this value should be chosen carefully (or not changed) because returning the first good match is not the same as returning the best match (see Shah et al. 2018 for more information)
-outfmt is a number between 0-18 that defines the output format
- -outfmt 6 is Tabular
- see blastx -help for the complete list
-out is our output pathname and filename

cd $TRINOTATE_HOME
# set path to BLAST_HOME
BLAST_HOME=$TUTORIAL_HOME/ncbi-blast-2.11.0+
# run blastx to blast Trinity transcripts against protein database
$BLAST_HOME/bin/blastx -query $TUTORIAL_HOME/Data/Trinity.fasta -db $TRINOTATE_HOME/uniprot_sprot.pep -max_target_seqs 1 -outfmt 6 -out $TRINOTATE_HOME/blastx.outfmt6
# take a look at the output file
less $TRINOTATE_HOME/blastx.outfmt6

Column 1: Query (unknown gene) sequence ID (from input fasta file)
Column 2: Subject (reference genome) sequence ID
Column 3: % of identical matches
Column 4: Alignment length (how much the sequences overlapped)
Column 5: Number of mismatches
Column 6: Number of gap openings
Column 7: Position in query where alignment starts
Column 8: Position in query where alignment ends
Column 9: Position in subject where alignment starts
Column 10: Position in subject where alignment ends
Column 11: Expect-value (measure of quality of match; smaller e-value means a better match)
- you can change this cutoff value (default is 10) using the -evalue parameter
Column 12: Bit-score (another measure of quality; larger bit-score means better sequence similarity)
press “q” to quit the file viewer

Search transdecoder-predicted proteins using blastp

we can use the same parameters as above and just change the input and output files
the output will be in the same format as above

$BLAST_HOME/bin/blastp -query $TRANSDECODER_HOME/Trinity.fasta.transdecoder.pep -db $TRINOTATE_HOME/uniprot_sprot.pep -max_target_seqs 1 -outfmt 6 -out $TRINOTATE_HOME/blastp.outfmt6
less $TRINOTATE_HOME/blastp.outfmt6

Run hmmer to identify protein domains

Now we will run hmmscan on the database we prepared earlier, and search the Pfam databse for sequence homologs
–domtblout saves the output in a tabular format that is easier to understand
we can then use > to put this output into a file that we’ll call pfam.log
make sure you run this all as one line

$HMMER_HOME/bin/hmmscan --domtblout TrinotatePFAM.out $TRINOTATE_HOME/Pfam-A.hmm $TRANSDECODER_HOME/Trinity.fasta.transdecoder.pep > $TRINOTATE_HOME/pfam.log

Import generated results into a Trinotate SQLite Database

Now we can take all the searches we have done against databases for sequence homologies and load them into a Trinotate SQLite Database, which we can then use to generate a Trinotate report

Load transcripts and coding regions

first we run Trinotate, then Trinotate.sqlite
init stands for initial import of transcriptome and protein data
–gene_trans_map: gene id information (Trinity.fasta.gene_trans_map)
–transcript_fasta: target transcipts (Trinity.fasta)
–transdecoder_pep: TransDecoder output containing peptide sequences of candidate ORFs (Trinity.fasta.transdecoder.pep)

$TRINOTATE_HOME/Trinotate $TRINOTATE_HOME/Trinotate.sqlite init --gene_trans_map $TUTORIAL_HOME/Data/Trinity.fasta.gene_trans_map --transcript_fasta $TUTORIAL_HOME/Data/Trinity.fasta --transdecoder_pep $TRANSDECODER_HOME/Trinity.fasta.transdecoder.pep

Load BLAST homologies

Homologies found within the BLAST database
Load protein hits (blastp)
Then load transcript hits (blastx)

$TRINOTATE_HOME/Trinotate $TRINOTATE_HOME/Trinotate.sqlite LOAD_swissprot_blastp $TRINOTATE_HOME/blastp.outfmt6
$TRINOTATE_HOME/Trinotate $TRINOTATE_HOME/Trinotate.sqlite LOAD_swissprot_blastx $TRINOTATE_HOME/blastx.outfmt6

Load Pfam domain entries

matches found within the Pfam database

$TRINOTATE_HOME/Trinotate.sqlite LOAD_pfam $TRINOTATE_HOME/TrinotatePFAM.out

Output Trinotate annotation report

if you run this code without the parameters it will just make an annotation report without the protein and transcript sequence data
–incl_pep: include protein sequence data
–incl_trans: include transcript sequence data

$TRINOTATE_HOME/Trinotate $TRINOTATE_HOME/Trinotate.sqlite report > $TRINOTATE_HOME/trinotate_annotation_report.xls

Now you have identified coding regions and annotated your transcriptome!
This data will be useful when doing analysis of single nucleotide polymorphisms (SNPs) or mutation load between populations because you can investigate the functional significance of mutations. This data can also be combined with a differential expression analysis, allowing you to investigate what kind of genes are up- or down-regulated under different conditions.