This tutorial will cover transcriptome annotation with TransDecoder and Trinotate. Both packages are designed to work with transcript sequences generated by aligned RNA-seq reads to a genome/transcriptome (ie. Tophat or Cufflinks output) or transcripts generated by the de-novo transcriptome assembly program Trinity. In this tutorial we will be working with a subset of a de-novo transcriptome assembly of the wetland plant, Decodon verticillatus.

This tutorial will be done in command line. If you are unfamiliar with command line coding, here are some tips:

  1. When you create a folder to house all the files for this tutorial make sure there are no spaces in the name. Spaces have meaning in the Linux coding language, so spaces in directory names will cause problems.
  2. You can use tab to autocomplete directory names, variable names, etc. This will help you code much more efficiently!
  3. The “ls” command prints out everything in your current working directory. Runnning this command often is a good way to make sure that you have the files you expect to have.
  4. You can also run into problems if you are not in the correct working directory. Use “pwd” to check your current working directory and make sure you are where you want to be.

If you are using a Mac you can just use the pre-installed Terminal app. If you’re using a Mac you can also use the package manager Homebrewer to download certain packages for this tutorial.
Check out the Homebrewer website for more information

# copy and paste the following code into your terminal
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# enter your login password to your computer when prompted
# nothing will show up, but you can just type your password and hit enter
# the script will print out progress reports as it installs

If you are using a PC, you can download MobaXterm to access the terminal.

Download data

  1. Create a folder on your computer for this tutorial
  2. Download the “Data” folder from OnQ and click to unzip it, then drag it into your tutorial folder
  3. Name your working directory “TUTORIAL_HOME” and set the path to your working directory with the code below
# TUTORIAL_HOME=path/to/working/directory
# for example, I am setting my working directory to a folder in my documents directory
# replace the code below with the path to your tutorial folder
TUTORIAL_HOME=/Users/hanathompson/Documents/Courses/Biol830/tutorial
  1. Navigate to your working directory
# change working directory with "cd"
cd $TUTORIAL_HOME
# make sure this worked by checking your present working directory with "pwd"
pwd
# the output should be whatever you have set as TUTORIAL_HOME

Transdecoder

TransDecoder identifies coding regions in transcript sequences by identifying open reading frames (ORFs). ORFs are nucleotide sequences that can be read into amino acid sequences. For more information, check out the TransDecoder github.

Download TransDecoder

  1. TransDecoder can be downloaded from here
  2. Click on Source code (tar.gz)
  3. Move the tar.gz file from your downloads to your working directory
  • first define the path to your downloads folder as DOWNLOADS
# copy the code below but replace with the path to your downloads folder
DOWNLOADS=/Users/hanathompson/Downloads
# we can move files around using "mv"
# we can also use "mv" to rename it to something a little shorter
mv $DOWNLOADS/TransDecoder-TransDecoder-v5.5.0.tar.gz $TUTORIAL_HOME/TransDecoder.tar.gz
  1. Check the contents of your working directory
# use "ls" to list the contents of a directory
ls
Data TransDecoder.tar.gz
# to look at the contents of Data 
ls Data

you should have the following files

Trinity.fasta       Trinity.fasta.gene_trans_map
  • Trinity.fasta is the output from the de-novo assembly program Trinity
  • Trinity takes RNA-seq reads and assembles them into a transcriptome
  • You can find more information about Trinity here
    • Trinity.fasta files follow the following format:
    • Line 1: sequence identification information
      • For example, take the sequence ID TRINITY_DN0_c0_g1_i2
      • This is Trinity read cluster TRINITY_DN0_c0, gene 1, isoform 2
    • Line 2: nucleotide sequence
  • Trinity.fast.gene_trans_map contains just the gene/transcript identification information
  1. Extract TransDecoder.tar.gz files
  • the tar command can be used to compress or extract files
  • the x parameter tells the command to extract the files
  • the f parameter is used to specify the filename
tar -xf TransDecoder.tar.gz
ls

We now have a new directory called TransDecoder-TransDecoder-v5.5.0

Data TransDecoder-TransDecoder-v5.5.0 

Let’s rename it TransDecoder for simplicity

mv TransDecoder-TransDecoder-v5.5.0 TransDecoder
ls
Data TransDecoder

Run TransDecoder

  1. Set TRANSDECODER_HOME pathname
TRANSDECODER_HOME=$TUTORIAL_HOME/TransDecoder 
  1. Navigate to TransDecoder directory and check that all the required files are there
# try using tab to complete $TRANSDECODER_HOME by typing the first few characters then hitting tab
cd $TRANSDECODER_HOME
ls
Changelog.txt       README.md       TransDecoder.wiki   util
LICENSE.txt     TransDecoder.LongOrfs   __testing
Makefile        TransDecoder.Predict    notes
PerlLib         TransDecoder.lrgTests   sample_data
  1. Extract long open reading frames (ORFs)
  • We can extract ORFs using the TransDecoder.LongOrfs script
  • The only required parameter is -t, which is the fasta file containing the transcripts
  • There are several other optional parameters
  • -m is the minimum protein length (ie. number of amino acids). This is set to 100 by default, but you can raise or lower this value
  • -O can be used to define the output directory. By default, it will create an output directory in whatever directory you ran the script and name it “inputfilename.transdecoder_dir”
  • for a complete list of the parameters you can run “less TransDecoder.LongOrfs” to look at the script
    • click “q” to exit the file viewer
$TRANSDECODER_HOME/TransDecoder.LongOrfs -t $TUTORIAL_HOME/Data/Trinity.fasta
  • Check output
ls 
Changelog.txt                       TransDecoder.wiki
LICENSE.txt                       Trinity.fasta.transdecoder_dir
Makefile                            Trinity.fasta.transdecoder_dir.__checkpoints_longorfs
PerlLib                             __testing
README.md                           notes
TransDecoder.LongOrfs       pipeliner.4823.cmds
TransDecoder.Predict        sample_data
TransDecoder.lrgTests       util

We now have the output directory, Trinity.fasta.transdecoder_dir

  • Navigate to output directory
cd Trinity.fasta.transdecoder_dir
ls
base_freqs.dat      longest_orfs.cds    longest_orfs.gff3   longest_orfs.pep
  • base_freqs.dat is the number and frequency of each base, ACTG
  • longest_orfs.cds lists all ORFs found in the target transcripts
  • longest_orfs.gff3 contains information about the position of each ORF in the target transcripts
  • longest_orfs.pep lists all ORFs that met the minimum length criteria (100 in this case)
  1. Predict coding regions
  • In this step, TransDecoder is identifying likely coding regions of the ORFs using the TransDecoder.Predict script
  • Again, the only required parameter is -t, identifying the path to the target transcripts
  • To look at the script and other parameters, you can run “less TransDecoder.Predict”
  • We are going to add one more parameter, –no_refine_starts to turn off start refinement
  • Start refinement identifies potential start codons for partial ORFs, however, it doesn’t work very well unless using a full dataset. Since the data provided is just a small subset of a full transcriptome assembly we will turn this off to make sure it runs correctly.
cd $TRANSDECODER_HOME
$TRANSDECODER_HOME/TransDecoder.Predict -t $TUTORIAL_HOME/Data/Trinity.fasta --no_refine_starts
ls
  • When you look at the files in TransDecoder you should now see several files with “Trinity.fasta.transdecoder” in the name, with file extensions .bed, .cds, .gff3, and .pep
    • .pep: this file contains the peptide sequences of the identified ORFs
    • .cds: this file contains the nucleotide sequences of the coding regions of your ORFs
    • .gff3: this file contains information about the position of each ORF within the target transcripts
    • .bed: this is a bed-formatted file that contains information about the ORF position and can be used for visualization with IGV or GenomeView

Trinotate

Trinotate is used for functional annotation of transcriptomes. It uses databases such as BLAST+ and HMMER/PFAM to search for homologous genes and annotate your transcriptome with likely gene functions. We will have to download some additional software for Trinotate to run.
The Trinotate github page can be accessed here

Download software

  1. Download Trinotate here
  • Click on Source code (tar.gz)
  • Move the tar.gz file from Downloads to working directory the same as above
  • Unzip tar.gz file and create Trinotate directory
cd $TUTORIAL_HOME
mv $DOWNLOADS/Trinotate-Trinotate-v3.2.1.tar.gz $TUTORIAL_HOME/Trinotate.tar.gz
tar -xf Trinotate.tar.gz
mv Trinotate-Trinotate-v3.2.1 Trinotate
ls

now your working directory (TUTORIAL_HOME) should look like this

Data    TransDecoder    TransDecoder.tar.gz  Trinotate  Trinotate.tar.gz
  1. Download SQLite, which is required for database integration, here
  • Click on download under latest release
  • Under the Source Code heading click on the link ending with .tar.gz
  • Move into working directory, unzip and create SQLite directory
cd $TUTORIAL_HOME
mv $DOWNLOADS/sqlite-autoconf-3340100.tar.gz $TUTORIAL_HOME/sqlite.tar.gz
tar -xf sqlite.tar.gz
mv sqlite-autoconf-3340100 SQLite
  1. Download NCBI BLAST+: Blast database homology search here
  • find the .tar.gz file for your computer and click to download
  • macosx.tar.gz or windows64.tar.gz
  • move into working directory and unzip
cd $TUTORIAL_HOME
mv $DOWNLOADS/ncbi-blast-2.11.0+-x64-macosx.tar.gz $TUTORIAL_HOME 
tar -xf ncbi-blast-2.11.0+-x64-macosx.tar.gz 
  1. Mac users only: Download HMMER/PFAM Protein Domain identification with Homebrewer.
  • HMMER searches databases for sequence homologs
  • PC users skip to step 5
cd $TUTORIAL_HOME
brew install hmmer

You should get output that looks like this, telling you where hmmer has been installed on your computer

==> Pouring hmmer-3.3.2.catalina.bottle.tar.gz
🍺  /usr/local/Cellar/hmmer/3.3.2: 57 files, 12.7MB
  • We want to move hmmer from /usr/local/Cellar to TUTORIAL_HOME
    • make sure to copy your pathname from your hmmer output because it might be slightly different
mv /usr/local/Cellar/hmmer $TUTORIAL_HOME
# set path to HMMER_HOME
HMMER_HOME=$TUTORIAL_HOME/hmmer/3.3.2
  1. PC users: download HMMER/PFAM from here
  • click on Download source
  • move into working directory and unzip as before

Our TUTORIAL_HOME directory is looking a little messy, so let’s clean it up by putting all the gzipped files into a separate directory. It is useful to keep these so that you can restart the analysis at any point if something goes wrong.

# first we make a directory called "GZIP" with "mkdir"
mkdir GZIP
# then we can move anything ending with .tar.gz using the regular expression quantifier *
# * can replace anything - any character and any number of characters
# so this syntax means move any files ending with .tar.gz
mv $TUTORIAL_HOME/*.tar.gz $TUTORIAL_HOME/GZIP
ls

Now TUTORIAL_HOME should only contain

Data            SQLite          Trinotate       ncbi-blast-2.11.0+
GZIP            TransDecoder        hmmer

And if we check our GZIP directory, we’ll find all the zipped files

cd $TUTORIAL_HOME/GZIP
ls

Build sequence databases

Trinotate also requires the SwissProt and Pfam databses to run, and the Trinotate package contains scripts to download and build these databases.

  1. Set path to TRINOTATE_HOME
TRINOTATE_HOME=$TUTORIAL_HOME/Trinotate
cd $TRINOTATE_HOME
  1. Check if you have Wget installed
  • This is a package that allows you to download items from the internet
  • Check if you have it pre-installed by running:
wget

If you get this output

wget: missing URL

Then you have wget installed and can skip to step 3
If you get this output

command not found: wget

Then install Wget using Homebrew

brew install wget
wget

you should now get the output

wget: missing URL
  1. Run Build_Trinotate_Boilerplate_SQLite_db.pl perl script
$TRINOTATE_HOME/admin/Build_Trinotate_Boilerplate_SQLite_db.pl  Trinotate
# this step will take about 20-30min so now is a good time to take a lunch break!
ls
  • You should now see the following files added to your Trinotate directory
    • Trinotate.sqlite: this is a Trinotate boilerplate sqlite database - it contains protein sequence information from various databases
    • uniprot_sprot.pep: this file contains protein sequences and will be used to look for matches within the BLAST database
    • Pfam-A.hmm.gz: we will use HMMER on this file to use search the Pfam database for homologs
  1. Prepare protein database for blast searches
  • Set path to NCBI_HOME
BLAST_HOME=$TUTORIAL_HOME/ncbi-blast-2.11.0+
  • use makeblastdb command to prepare protein databse
    • -in tells makeblastdb where to look for protein sequence information
    • -dbtype is the type of database. In this case it is protein sequences, so we write “prot”
$BLAST_HOME/bin/makeblastdb -in uniprot_sprot.pep -dbtype prot
  • You may get a pop-up error along the lines of ““makeblastdb” cannot be opened because the developer cannot be verified."
    • If this happens, open a finder window and locate the makeblastdb exe file
    • Right click on it and press open -> open with terminal
    • It should open in a new terminal window, which you can close
    • Then try running the command again (you may have to do this several times throughout the tutorial)
  1. Uncompress and prepare Pfam database to use with hmmscan
  • gunzip unzips gzipped files (files with .gz extension)
  • the hmmpress command prepares an HMM database to do hmmscan (scan for sequence homologs)
cd $TRINOTATE_HOME
gunzip $TRINOTATE_HOME/Pfam-A.hmm.gz
$HMMER_HOME/bin/hmmpress $TRINOTATE_HOME/Pfam-A.hmm 

Run sequence analysis

In this step we will use BLAST to look for homologies and identify proteins in our transcripts. We will run this on both the transcriptome assembly (Trinity.fasta) and on the likely ORFs predicted by TransDecoder (Trinity.fasta.transdecoder.pep)

  1. Search trinity transcripts using blastx
  • there are 3 blast options, blastn, blastp, and blastx
  • blastn searches a nucleotide query (ie. you give it nucleotide sequences) against a nucleotide database
  • blastp searches a protein query (ie. you give it protein sequences) against a protein database
  • blastx searchs a nucleotide query against a protein database
  • to see all the parameters, enter $BLAST_HOME/bin/blastx -help
  • -query is the input file (Trinity.fasta)
  • -db is the database we want to search (uniprot_sprot.pep)
  • -max_target_seqs is the maximum number of aligned sequences to keep
    • we’ve set this to 1, meaning that blast will return the first good match it finds in the database
    • this will help reduce run time, however, this value should be chosen carefully (or not changed) because returning the first good match is not the same as returning the best match (see Shah et al. 2018 for more information)
  • -outfmt is a number between 0-18 that defines the output format
    • -outfmt 6 is Tabular
    • see blastx -help for the complete list
  • -out is our output pathname and filename
cd $TRINOTATE_HOME
# set path to BLAST_HOME
BLAST_HOME=$TUTORIAL_HOME/ncbi-blast-2.11.0+
# run blastx to blast Trinity transcripts against protein database
$BLAST_HOME/bin/blastx -query $TUTORIAL_HOME/Data/Trinity.fasta -db $TRINOTATE_HOME/uniprot_sprot.pep -max_target_seqs 1 -outfmt 6 -out $TRINOTATE_HOME/blastx.outfmt6
# take a look at the output file
less $TRINOTATE_HOME/blastx.outfmt6
  • Column 1: Query (unknown gene) sequence ID (from input fasta file)
  • Column 2: Subject (reference genome) sequence ID
  • Column 3: % of identical matches
  • Column 4: Alignment length (how much the sequences overlapped)
  • Column 5: Number of mismatches
  • Column 6: Number of gap openings
  • Column 7: Position in query where alignment starts
  • Column 8: Position in query where alignment ends
  • Column 9: Position in subject where alignment starts
  • Column 10: Position in subject where alignment ends
  • Column 11: Expect-value (measure of quality of match; smaller e-value means a better match)
    • you can change this cutoff value (default is 10) using the -evalue parameter
  • Column 12: Bit-score (another measure of quality; larger bit-score means better sequence similarity)
  • press “q” to quit the file viewer
  1. Search transdecoder-predicted proteins using blastp
  • we can use the same parameters as above and just change the input and output files
  • the output will be in the same format as above
$BLAST_HOME/bin/blastp -query $TRANSDECODER_HOME/Trinity.fasta.transdecoder.pep -db $TRINOTATE_HOME/uniprot_sprot.pep -max_target_seqs 1 -outfmt 6 -out $TRINOTATE_HOME/blastp.outfmt6
less $TRINOTATE_HOME/blastp.outfmt6
  1. Run hmmer to identify protein domains
  • Now we will run hmmscan on the database we prepared earlier, and search the Pfam databse for sequence homologs
  • –domtblout saves the output in a tabular format that is easier to understand
  • we can then use > to put this output into a file that we’ll call pfam.log
  • make sure you run this all as one line
$HMMER_HOME/bin/hmmscan --domtblout TrinotatePFAM.out $TRINOTATE_HOME/Pfam-A.hmm $TRANSDECODER_HOME/Trinity.fasta.transdecoder.pep > $TRINOTATE_HOME/pfam.log

Import generated results into a Trinotate SQLite Database

Now we can take all the searches we have done against databases for sequence homologies and load them into a Trinotate SQLite Database, which we can then use to generate a Trinotate report

  1. Load transcripts and coding regions
  • first we run Trinotate, then Trinotate.sqlite
  • init stands for initial import of transcriptome and protein data
  • –gene_trans_map: gene id information (Trinity.fasta.gene_trans_map)
  • –transcript_fasta: target transcipts (Trinity.fasta)
  • –transdecoder_pep: TransDecoder output containing peptide sequences of candidate ORFs (Trinity.fasta.transdecoder.pep)
$TRINOTATE_HOME/Trinotate $TRINOTATE_HOME/Trinotate.sqlite init --gene_trans_map $TUTORIAL_HOME/Data/Trinity.fasta.gene_trans_map --transcript_fasta $TUTORIAL_HOME/Data/Trinity.fasta --transdecoder_pep $TRANSDECODER_HOME/Trinity.fasta.transdecoder.pep
  1. Load BLAST homologies
  • Homologies found within the BLAST database
  • Load protein hits (blastp)
  • Then load transcript hits (blastx)
$TRINOTATE_HOME/Trinotate $TRINOTATE_HOME/Trinotate.sqlite LOAD_swissprot_blastp $TRINOTATE_HOME/blastp.outfmt6
$TRINOTATE_HOME/Trinotate $TRINOTATE_HOME/Trinotate.sqlite LOAD_swissprot_blastx $TRINOTATE_HOME/blastx.outfmt6
  1. Load Pfam domain entries
  • matches found within the Pfam database
$TRINOTATE_HOME/Trinotate.sqlite LOAD_pfam $TRINOTATE_HOME/TrinotatePFAM.out

Output Trinotate annotation report

$TRINOTATE_HOME/Trinotate $TRINOTATE_HOME/Trinotate.sqlite report > $TRINOTATE_HOME/trinotate_annotation_report.xls

Now you have identified coding regions and annotated your transcriptome!
This data will be useful when doing analysis of single nucleotide polymorphisms (SNPs) or mutation load between populations because you can investigate the functional significance of mutations. This data can also be combined with a differential expression analysis, allowing you to investigate what kind of genes are up- or down-regulated under different conditions.