University of Nevada, Reno Vintage Logo

University of Nevada, Reno Vintage Logo

HISAT2 Tutorial from Pertea, M. et. al., Nat. Protoc. 2016


Welcome!

Welcome to a video guided Rmd tutorial on using HISAT2, StringTie, and Ballgown. Here, we will be following the above protocol as a beginner’s introduction to RNA-Seq analysis. The first two videos are actually meant to be one, but the video recorder I was using previously had a 10 minute limit, leading to one 10 minute video and another 5 minute video. However, both videos show steps not seen in the other. I believe the first video cuts off during the HISAT2 run; this is where the second video picks up and continues with the SAMTools portion of this protocol. There are three other videos posted going through StringTie, IGV, and working in R using Ballgown. Be sure to check them out and use this Rmd and the ones to follow as references while watching the videos! I hope you enjoy your beginner’s introduction to RNA-Seq!

Obtain Tools

For this tutorial, we will need several tools that you may or may not have. I mention in the first installment of this series that downloading and installing Miniconda and subsequently Bioconda is a great way to drastically reduce the stress involved in downloading and installing these tools by providing you with the almighty conda install command. In the third installment of this series, I actually walk through the steps of obtaining the bash script to download Miniconda and run through the code to install Bioconda. As I generated this Rmd file, I realized these steps would have been better laid out to begin with so, lucky you, I have compiled all the information you need to obtain the tools required before you begin. After installing Miniconda and Bioconda, run the code below to obtain HISAT2, SAMTools, IGV, IGVTools, StringTie, Gffcompare, R and RStudio (if you don’t already have them), and Ballgown (if you use conda to install R/RStudio).
NOTE: Only install Ballgown using the conda command below if you install R/RStudio using Bioconda. If you have already installed these tools from the command line using sudo apt-get install r-base, sudo apt-get install rstudio, or have compiled them from source code, use the install.packages('ballgown') command in R/RStudio to install Ballgown.
$ conda install hisat2
$ conda install samtools
$ conda install igv
$ conda install igvtools
$ conda install stringtie
$ conda install gffcompare
$ conda install r-base
$ conda install rstudio
$ conda install -c bioconda bioconductor-ballgown
Remember only install R, RStudio, and Ballgown using conda if you have not already installed them using a different method.

Now that we have our tools, let’s begin!


Obtain Tutorial Files

Use the UNIX command wget to pull the data off the FTP server hosting the data we will be working with. Use the command cd [Options] [Directory] to change into your desired ~/working_directory and then download these files.
$ wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

Extract Tutorial Files

Using the UNIX command tar xvzf, we can extract the .tar.gz files into our ~/working_directory. The command option ‘x’ extracts the files, ‘v’ lists the processed files verbosely, ‘z’ filters the archive through gzip, and ‘f’ tells tar to used the archived file.
$ tar xvzf chrX_data.tar.gz

Align Reads Using HISAT2

Using HISAT2, we can align our sample .fastq.gz files (without the need to unzip them) to the indexed reference genome, that has already been prepared, located in the chrX_data/indexes/ directory. Doing so will generate our SAM (Sequence Alignment Map) files we will use in later steps. First we type out hisat2 to denote the command we are using. The options entered here are ‘-p 8’ denoting the use of 8 threads, ‘–dta’ is used to generate output SAM files that can be directly read into StringTie, ‘-x’ is used to denote the indexed reference genome, ‘-1’ and ‘-2’ are used to denote our fwd and rev samples in a paired-end alignment, and ‘-S’ is used to denote that we would like our output in SAM format.
$ hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam
Here is a bash script for the above HISAT2 command called hisat2.sh that will run all the .fastq.gz files for you simultaneously.
#!/usr/bin/bash

#bash script for hisat2; align all .fastq.gz files to indexed reference genome to generate .sam files

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    hisat2 -p 11 --dta -x ~/chrX_data/indexes/chrX_tran -1 ~/chrX_data/samples/${SAMPLE}_chrX_1.fastq.gz -2 ~/chrX_data/samples/${SAMPLE}_chrX_2.fastq.gz -S ${SAMPLE}_chrX.sam
done

#this works
Just for fun, I wrote a perl script to do the same thing.
#!/usr/bin/perl

#perl script for hisat2; align all .fastq.gz files to indexed reference genome to generate .sam files 

use warnings;

use strict;

my @samples = qw(ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916);

foreach(@samples){
    do {
        system("hisat2", "-p 11", "--dta", "-x ~/chrX_data/indexes/chrX_tran", "-1 ${_}_chrX_1.fastq.gz", "-2 ${_}_chrX_2.fastq.gz", "-S ${_}_chrX.sam");
    }
}

#perl works too

Generate BAM Files Using SAMTools

In order to generate files that we can use for StringTie and view in IGV (Integrative Genomics Viewer), we need to convert our SAM files to BAM (Binary Alignment Map) files using SAMTools. A BAM file is the binary version of a SAM file and is used to assemble aligned reads into transcripts using StringTie and is also the preferred file format for viewing in IGV. We will use the samtools command with the options: ‘sort’ to sort the alignments by the leftmost coordinates, ‘-@ 8’ to denote the usage of 8 threads, ‘-o’ to denote that we want our outputs to be BAM files in [out.bam] format, and finally we enter our [input.sam] files.
$ samtools sort -@ 8 -o ERR188044_chrX.bam ERR188044_chrX.sam
We also have a bash script for the above command called sort.sh that will do all our .sam files simultaneously.
#!/usr/bin/bash

#bash script for samtools; convert .sam files to .bam files

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    samtools sort -@ 11 -o ${SAMPLE}_chrX.bam ${SAMPLE}_chrX.sam
done

#this works
Another perl script.
#!/usr/bin/perl

#perl script for samtools; convert .sam files to .bam files 

use warnings;

use strict;

my @samples = qw(ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916);

foreach(@samples){
    do {
        system("samtools", "sort", "-@ 11", "-o ${_}_chrX.bam", "${_}_chrX.sam");
    }
}

#perl works too

Index BAM Files Using SAMTools

To view our BAM files in IGV, we need to index them and for this we also use SAMTools. IGV won’t accept our .bam files without an accompanying .bam.bai file, so in order to view our .bam files in IGV, this step is essential. Using the samtools command with the ‘index’ option, we enter out [in.bam] files and receive [out.bam.bai] files. With these two files in hand, we can now view our data using IGV!
$ samtools index ERR188044_chrX.bam ERR188044_chrX.bam.bai
Again, we have a bash script called index.sh that will index all our .bam files and generate .bam.bai files simultaneously.
#!/usr/bin/bash

#bash script for samtools; index our .bam files to obtain .bam.bai files using samtools

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    samtools index ${SAMPLE}_chrX.bam ${SAMPLE}_chrX.bam.bai
done

#this works
And yet another perl script.
#!/usr/bin/perl

#perl script for samtools; index our .bam files to obtain .bam.bai files using samtools

use warnings;

use strict;

my @samples = qw(ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916);

foreach(@samples){
    do {
        system("samtools", "index", "${_}_chrX.bam", "${_}_chrX.bam.bai");
    }
}

#perl works too

Parting Words