University of Nevada, Reno Vintage Logo

University of Nevada, Reno Vintage Logo

StringTie Tutorial from Pertea, M. et. al., Nat. Protoc. 2016


Welcome Back!

Welcome back to the third installment in this beginner’s series introduction to RNA-Seq. Here we have another Rmd for you to use as a guide while watching the next two installments in the series. The first installment, Video 3, goes over the usage of StringTie and a brief introduction to IGV. The second installation, Video 3.1, takes a more indepth look at the generation of TDF files and viewing multiple TDF files side-by-side in IGV. I have also included screenshot of color coded, male and female TDF files not included in Video 3.1 to display the differences in expression patters of the XIST (X-inactive specific transcript) gene. The XIST gene is only expressed in mammalian females and is very important in the process of X inactivation, or silencing of one of the two X chromosomes. I have included the code for Video 3.1 at the very bottom of this Rmd and have also labeled it as such. Video 3 begins with the installation of IGV and IGVTools, but I have removed the code for this as I included it in the previous HISAT2 Rmd and this reference guide begins at the opening of IGV and loading of .bam and .bam.bai files. Without further ado, let’s begin!

Video 3

Working in IGV

To begin, we will open IGV and load .bam and .bam.bai files to view. In order to do this we enter the command igv from the terminal and use the GUI to load our reference genome, choose the X chromosome to view, then enter XIST as our locus of interest.
$ igv

Using StringTie to Assemble Transcripts

Here, we will begin using StringTie to assemble our transcripts for each sample .bam file we created. We use the stringtie command with the options: ‘-p 11’ to set our threads to 11, ‘-G’ to designate our reference annotation file, ‘-o’ to designate out output file destination (can include full path/ here) and format as [output.gtf], ‘-l’ to set the prefix label for the outputs, and finally [input.bam] to denote our input files.
$ stringtie -p 11 -G chrX_data/genes/chrX.gtf -o ERR188044_chrX.gtf -l ERR188044 ERR188044_chrX.bam
We have written a nifty bash script to automate this step for us.
#!/usr/bin/bash

#bash script for stringtie; assemble transcripts using stringtie

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    stringtie -p 11 -G ~/chrX_data/genes/chrX.gtf -o ${SAMPLE}_chrX.gtf -l ${SAMPLE} ${SAMPLE}_chrX.bam
done

#this works
A neat little perl script I wrote for the same task.
#!/usr/bin/perl

#perl script for stringtie; assemble transcripts using stringtie

use warnings;

use strict;

my @samples = qw(ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916);

foreach(@samples){
    do {
        system("stringtie", "-p 11", "-G ~/chrX_data/genes/chrX.gtf", "-o ${_}_chrX.gtf", "-l ${_}" "${_}_chrX.bam");
    }
}

#perl works too

Using StringTie to Merge Transcripts

To merge our transcripts, we utilize the stringtie command with the options: ‘–merge’ to use the merge function of stringtie, ‘-p 11’ to use 11 threads, ‘-G’ to designate our reference annotation file, ‘-o’ to assign our output file name and format to [output.gtf], and ‘chrX_data/mergelist.txt’ to provide the program with all of our inputs. Note: if you do not run the transcript assembly command/script in the same directory you are running this command, you will need to generate the mergelist.txt file manually, making sure to include the full path/ to each .gtf file.
stringtie --merge -p 11 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt

Using Gffcompare

We can then look at how the transcripts compare with the reference annotation (this is optional). Here we installed Gffcompare using conda install, then utilized gffcompare with options: ‘-r’ to include our reference annotation file, ‘-G’ to tell gffcompare to compare all the transcripts in our input file [stringtie_merged.gtf], and ‘-o’ to denote out output prefix.
conda install gffcompare
gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf

Finally, we will use StringTie to estimate transcript abundances and create table counts for Ballgown (this is for when we do our differential expression analysis in R). Here we again use stringtie with options: ‘-e’ to limit the estimation and output of transcripts to only those that match the reference annotation, ‘-B’ turns our outputs into Ballgown input table files (.ctab) with our desired coverage data for our reference transcripts, ‘-p 11’ to denote the use of 11 threads, ‘-G’ to designate our reference annotation file, ‘-o’ to denote the creation of a new ballgown/ directory with all our individaul /[output.gtf] files contained within their own individual /[output_prefix]/ directories, and finally we enter our [input.bam] files.
stringtie -e -B -p 11 -G stringtie_merged.gtf -o ballgown/ERR188044/ERR188044_chrX.gtf ERR188044_chrX.bam
We have also written a nifty bash script to automate this step for us.
#!/usr/bin/bash

#bash script for stringtie; calculate abundances and generate ballgown dir containing tables for use in R

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    stringtie -e -B -p 11 -G stringtie_merged.gtf -o ~/ballgown/${SAMPLE}/${SAMPLE}_chrX.gtf ${SAMPLE}_chrX.bam
done

#this works
And a perl script for those of us who operate better under pressure.
#!/usr/bin/perl

#perl script for stringtie; calculate abundances and generate ballgown/ dir containing tables for use in R

use warnings;

use strict;

my @samples = qw(ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916);

foreach(@samples){
    do {
        system("stringtie", "-e", "-B", "-p 11", "-G stringtie_merged.gtf", "-o ~/ballgown/${_}/${_}_chrX.gtf", "${_}_chrX.bam");
    }
}

#perl works too

Video 3.1

Generation of TDF Files

Let’s use IGVTools to generate TDF (Tiled Data File) files using the igvtools command in our terminal so that we may view all of our samples in IGV. Here we use the igvtools command with the following options: ‘count’ which computes coverage density for an alignment file and provides either .tdf or .wig output files, ‘-z’ is the max-zoom, ‘-w’ denotes window size, ‘-e’ which extends the read by the designated number in bp before counting, our [input.bam] file comes next, followed by our [output.tdf] file, and finally the path to our reference genome assembly.
$ igvtools count -z 5 -w 25 -e 250 ERR188044_chrX.bam ERR188044_chrX.tdf ~/igv/genomes/hg38.genome
We have generated another life saving bash script to automate this process for us!
#!/usr/bin/bash

#bash script for igvtools; generate .tdf files using igvtools

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    igvtools count -z 5 -w 25 -e 250 ${SAMPLE}_chrX.bam ${SAMPLE}_chrX.tdf ~/igv/genomes/hg38.genome
done

#this works
Here, we are under pressure once again, creating perl scripts!
#!/usr/bin/perl

#perl script for igvtools; generate .tdf files using igvtools

use warnings;

use strict;

my @samples = qw(ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916);

foreach(@samples){
    do {
        system("igvtools", "count", "-z 5", "-w 25", "-e 250", "${_}_chrX.bam", "${_}_chrX.tdf", "~/igv/genomes/hg38.genome");
    }
}

#perl works too

Parting Words

So, I hear that you have completed the StringTie/IGV tutorial! Congratulations on all your hard work! With a little more effort, you will be analyzing RNA-Seq datasets like a pro in no time! Next time we will transition to R for our differential expression analysis using Ballgown. You can follow along using this Rmd for the Ballgown tutorial video!