Rmd for HISAT2 Tutorial, BIOL792, Spring 2019

HISAT2 Tutorial from Pertea, M. et. al., Nat. Protoc. 2016

Link to YouTube Video 1: Beginner’s RNA-Seq Tutorial Part 1 - Dr. Pedro Miura’s BIOL792 Course, University of Nevada, Reno

Link to YouTube Video 2: Beginner’s RNA-Seq Tutorial Part 2 - Dr. Pedro Miura’s BIOL792 Course, University of Nevada, Reno

Welcome!

Welcome to a video guided Rmd tutorial on using HISAT2, StringTie, and Ballgown. Here, we will be following the above protocol as a beginner’s introduction to RNA-Seq analysis. The first two videos are actually meant to be one, but the video recorder I was using previously had a 10 minute limit, leading to one 10 minute video and another 5 minute video. However, both videos show steps not seen in the other. I believe the first video cuts off during the HISAT2 run; this is where the second video picks up and continues with the SAMTools portion of this protocol. There are three other videos posted going through StringTie, IGV, and working in R using Ballgown. Be sure to check them out and use this Rmd and the ones to follow as references while watching the videos! I hope you enjoy your beginner’s introduction to RNA-Seq!

Obtain Tools

For this tutorial, we will need several tools that you may or may not have. I mention in the first installment of this series that downloading and installing Miniconda and subsequently Bioconda is a great way to drastically reduce the stress involved in downloading and installing these tools by providing you with the almighty `conda install` command. In the third installment of this series, I actually walk through the steps of obtaining the `bash` script to download Miniconda and run through the code to install Bioconda. As I generated this Rmd file, I realized these steps would have been better laid out to begin with so, lucky you, I have compiled all the information you need to obtain the tools required before you begin. After installing Miniconda and Bioconda, run the code below to obtain HISAT2, SAMTools, IGV, IGVTools, StringTie, Gffcompare, R and RStudio (if you don’t already have them), and Ballgown (if you use `conda` to install R/RStudio).

NOTE: Only install Ballgown using the `conda` command below if you install R/RStudio using Bioconda. If you have already installed these tools from the command line using `sudo apt-get install r-base`, `sudo apt-get install rstudio`, or have compiled them from source code, use the `install.packages('ballgown')` command in R/RStudio to install Ballgown.

$ conda install hisat2
$ conda install samtools
$ conda install igv
$ conda install igvtools
$ conda install stringtie
$ conda install gffcompare
$ conda install r-base
$ conda install rstudio
$ conda install -c bioconda bioconductor-ballgown

Remember only install R, RStudio, and Ballgown using `conda` if you have not already installed them using a different method.

Now that we have our tools, let’s begin!

Obtain Tutorial Files

Use the `UNIX` command `wget` to pull the data off the FTP server hosting the data we will be working with. Use the command `cd [Options] [Directory]` to change into your desired ~/working_directory and then download these files.

$ wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

Extract Tutorial Files

Using the `UNIX` command `tar xvzf`, we can extract the .tar.gz files into our ~/working_directory. The command option ‘x’ extracts the files, ‘v’ lists the processed files verbosely, ‘z’ filters the archive through `gzip`, and ‘f’ tells `tar` to used the archived file.

$ tar xvzf chrX_data.tar.gz

Align Reads Using HISAT2

Using HISAT2, we can align our sample .fastq.gz files (without the need to unzip them) to the indexed reference genome, that has already been prepared, located in the chrX_data/indexes/ directory. Doing so will generate our SAM (Sequence Alignment Map) files we will use in later steps. First we type out `hisat2` to denote the command we are using. The options entered here are ‘-p 8’ denoting the use of 8 threads, ‘–dta’ is used to generate output SAM files that can be directly read into StringTie, ‘-x’ is used to denote the indexed reference genome, ‘-1’ and ‘-2’ are used to denote our fwd and rev samples in a paired-end alignment, and ‘-S’ is used to denote that we would like our output in SAM format.

$ hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam

Here is a `bash` script for the above HISAT2 command called hisat2.sh that will run all the .fastq.gz files for you simultaneously.

#!/usr/bin/bash

#bash script for hisat2; align all .fastq.gz files to indexed reference genome to generate .sam files

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    hisat2 -p 11 --dta -x ~/chrX_data/indexes/chrX_tran -1 ~/chrX_data/samples/${SAMPLE}_chrX_1.fastq.gz -2 ~/chrX_data/samples/${SAMPLE}_chrX_2.fastq.gz -S ${SAMPLE}_chrX.sam
done

#this works

Just for fun, I wrote a `perl` script to do the same thing.

#!/usr/bin/perl

#perl script for hisat2; align all .fastq.gz files to indexed reference genome to generate .sam files 

use warnings;

use strict;

my @samples = qw(ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916);

foreach(@samples){
    do {
        system("hisat2", "-p 11", "--dta", "-x ~/chrX_data/indexes/chrX_tran", "-1 ${_}_chrX_1.fastq.gz", "-2 ${_}_chrX_2.fastq.gz", "-S ${_}_chrX.sam");
    }
}

#perl works too

Generate BAM Files Using SAMTools

In order to generate files that we can use for StringTie and view in IGV (Integrative Genomics Viewer), we need to convert our SAM files to BAM (Binary Alignment Map) files using SAMTools. A BAM file is the binary version of a SAM file and is used to assemble aligned reads into transcripts using StringTie and is also the preferred file format for viewing in IGV. We will use the `samtools` command with the options: ‘sort’ to sort the alignments by the leftmost coordinates, ‘-@ 8’ to denote the usage of 8 threads, ‘-o’ to denote that we want our outputs to be BAM files in [out.bam] format, and finally we enter our [input.sam] files.

$ samtools sort -@ 8 -o ERR188044_chrX.bam ERR188044_chrX.sam

We also have a `bash` script for the above command called sort.sh that will do all our .sam files simultaneously.

#!/usr/bin/bash

#bash script for samtools; convert .sam files to .bam files

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    samtools sort -@ 11 -o ${SAMPLE}_chrX.bam ${SAMPLE}_chrX.sam
done

#this works

Another `perl` script.

#!/usr/bin/perl

#perl script for samtools; convert .sam files to .bam files 

use warnings;

use strict;

my @samples = qw(ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916);

foreach(@samples){
    do {
        system("samtools", "sort", "-@ 11", "-o ${_}_chrX.bam", "${_}_chrX.sam");
    }
}

#perl works too

Index BAM Files Using SAMTools

To view our BAM files in IGV, we need to index them and for this we also use SAMTools. IGV won’t accept our .bam files without an accompanying .bam.bai file, so in order to view our .bam files in IGV, this step is essential. Using the `samtools` command with the ‘index’ option, we enter out [in.bam] files and receive [out.bam.bai] files. With these two files in hand, we can now view our data using IGV!

$ samtools index ERR188044_chrX.bam ERR188044_chrX.bam.bai

Again, we have a `bash` script called index.sh that will index all our .bam files and generate .bam.bai files simultaneously.

#!/usr/bin/bash

#bash script for samtools; index our .bam files to obtain .bam.bai files using samtools

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    samtools index ${SAMPLE}_chrX.bam ${SAMPLE}_chrX.bam.bai
done

#this works

And yet another `perl` script.

#!/usr/bin/perl

#perl script for samtools; index our .bam files to obtain .bam.bai files using samtools

use warnings;

use strict;

my @samples = qw(ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916);

foreach(@samples){
    do {
        system("samtools", "index", "${_}_chrX.bam", "${_}_chrX.bam.bai");
    }
}

#perl works too

Parting Words

Hooray! You have made it through the first part of the HISAT2, StringTie, Ballgown tutorial! Next time we will add to this tutorial and continue with our indexed .bam.bai files to view them using IGV and assemble our transcripts using StringTie. Here is the link to the next Rmd file, so that you don’t even have to search for it!

Rmd for HISAT2 Tutorial, BIOL792, Spring 2019

Alexander Selvey, University of Nevada, Reno

March 9th, 2019

HISAT2 Tutorial from Pertea, M. et. al., Nat. Protoc. 2016

Link to YouTube Video 1: Beginner’s RNA-Seq Tutorial Part 1 - Dr. Pedro Miura’s BIOL792 Course, University of Nevada, Reno

Link to YouTube Video 2: Beginner’s RNA-Seq Tutorial Part 2 - Dr. Pedro Miura’s BIOL792 Course, University of Nevada, Reno

Welcome!

Obtain Tools

Remember only install R, RStudio, and Ballgown using `conda` if you have not already installed them using a different method.

Now that we have our tools, let’s begin!

Obtain Tutorial Files

Use the `UNIX` command `wget` to pull the data off the FTP server hosting the data we will be working with. Use the command `cd [Options] [Directory]` to change into your desired ~/working_directory and then download these files.

Extract Tutorial Files

Using the `UNIX` command `tar xvzf`, we can extract the .tar.gz files into our ~/working_directory. The command option ‘x’ extracts the files, ‘v’ lists the processed files verbosely, ‘z’ filters the archive through `gzip`, and ‘f’ tells `tar` to used the archived file.

Align Reads Using HISAT2

Here is a `bash` script for the above HISAT2 command called hisat2.sh that will run all the .fastq.gz files for you simultaneously.

Just for fun, I wrote a `perl` script to do the same thing.

Generate BAM Files Using SAMTools

We also have a `bash` script for the above command called sort.sh that will do all our .sam files simultaneously.

Another `perl` script.

Index BAM Files Using SAMTools

Again, we have a `bash` script called index.sh that will index all our .bam files and generate .bam.bai files simultaneously.

And yet another `perl` script.

Parting Words

Rmd for HISAT2 Tutorial, BIOL792, Spring 2019

Alexander Selvey, University of Nevada, Reno

March 9th, 2019

HISAT2 Tutorial from Pertea, M. et. al., Nat. Protoc. 2016

Link to YouTube Video 1: Beginner’s RNA-Seq Tutorial Part 1 - Dr. Pedro Miura’s BIOL792 Course, University of Nevada, Reno

Link to YouTube Video 2: Beginner’s RNA-Seq Tutorial Part 2 - Dr. Pedro Miura’s BIOL792 Course, University of Nevada, Reno

Welcome!

Obtain Tools

Remember only install R, RStudio, and Ballgown using conda if you have not already installed them using a different method.

Now that we have our tools, let’s begin!

Obtain Tutorial Files

Use the UNIX command wget to pull the data off the FTP server hosting the data we will be working with. Use the command cd [Options] [Directory] to change into your desired ~/working_directory and then download these files.

Extract Tutorial Files

Using the UNIX command tar xvzf, we can extract the .tar.gz files into our ~/working_directory. The command option ‘x’ extracts the files, ‘v’ lists the processed files verbosely, ‘z’ filters the archive through gzip, and ‘f’ tells tar to used the archived file.

Align Reads Using HISAT2

Here is a bash script for the above HISAT2 command called hisat2.sh that will run all the .fastq.gz files for you simultaneously.

Just for fun, I wrote a perl script to do the same thing.

Generate BAM Files Using SAMTools

We also have a bash script for the above command called sort.sh that will do all our .sam files simultaneously.

Another perl script.

Index BAM Files Using SAMTools

Again, we have a bash script called index.sh that will index all our .bam files and generate .bam.bai files simultaneously.

And yet another perl script.

Parting Words

Remember only install R, RStudio, and Ballgown using `conda` if you have not already installed them using a different method.

Use the `UNIX` command `wget` to pull the data off the FTP server hosting the data we will be working with. Use the command `cd [Options] [Directory]` to change into your desired ~/working_directory and then download these files.

Using the `UNIX` command `tar xvzf`, we can extract the .tar.gz files into our ~/working_directory. The command option ‘x’ extracts the files, ‘v’ lists the processed files verbosely, ‘z’ filters the archive through `gzip`, and ‘f’ tells `tar` to used the archived file.

Here is a `bash` script for the above HISAT2 command called hisat2.sh that will run all the .fastq.gz files for you simultaneously.

Just for fun, I wrote a `perl` script to do the same thing.

We also have a `bash` script for the above command called sort.sh that will do all our .sam files simultaneously.

Another `perl` script.

Again, we have a `bash` script called index.sh that will index all our .bam files and generate .bam.bai files simultaneously.

And yet another `perl` script.