University of Nevada, Reno Vintage Logo

University of Nevada, Reno Vintage Logo

Ballgown Tutorial from Pertea, M. et. al., Nat. Protoc. 2016


Welcome To the Final Installment!

Hello and welcome back to the final installment of the HISAT2, StringTie, Ballgown tutorial series! In this installment we will be working exclusively in R/RStudio to complete our task of elucidating the differential expression patterns of our RNA-Seq data. In the code below, I have tried to meticulously break down the complex R code and provide as much detail about each piece as possible. If you are much more adept within the realm of R code and have better explainations than the ones I give, please shoot me a comment on RPubs or YouTube and I will update this Rmd reference guide! With that being said, let’s get to work on some differential expression analysis in R using ballgown!

Open RStudio

Let’s begin working in R. You can work in either RStudio or terminal R. From the command line I enter rstudio and it opens the RStudio GUI.
$ rstudio

Install Required Packages

Install ballgown, RColorBrewer, genefilter, dplyr, and devtools in RStudio using the install.packages() function. Note: if you have already installed ballgown using conda install, skip the installation utilized here. Also, if using conda to install R and RStudio, you may also install these packages using conda install r-[package].
install.packages('ballgown')
install.packages('RColorBrewer')
install.packages('genefilter')
install.packages('dplyr')
install.packages('devtools')

Load Required Packages

Now that we have R (or Rstudio) loaded, we need to load the packages we will be using, using the load() function. If you do not currently have these packages, utilize the command above to install ballgown to install these packages before loading.
library(ballgown) #for differential expression analysis
library(RColorBrewer) #or RSkittleBrewer for plot coloring
library(genefilter) #calc of means and variances
library(dplyr) #sort and arrange results
library(devtools) #to ensure reproducibility and install relevant packages

Set Working Directory, Read-in Phenotype Data

Next we want to read in our phenotype data for our samples which are contained in a .csv file located in the chrX_data/ directory. To set your ~/working_directory, use the setwd() function and enter the path to your ~/working_directory.
setwd("/your/working_directory/path/")
After we have set our ~/working_directory, we will create a data frame called ‘pheno_data’ which we will assign our read-in .csv file to using the read.csv() function, entering the path to the file we would like to read-in.
pheno_data = read.csv("~/Desktop/biol792_miura/chrX_data/geuvadis_phenodata.csv")

Read-in Expression Data

Let’s read-in our expression data from the ballgown/ directory we generated using StringTie. In order to do this, we create another data frame called ‘bg_chrX’ to assign our data to and use the ballgown() package function with the options: ‘datadir’ to denote the directory/ our data directories are contained in, ‘samplePattern’ to denote the pattern of each of the directories/ our data are contained within, and ‘pData’ to utilize our phenotype data frame to categorize each of our samples.
bg_chrX = ballgown(dataDir = "ballgown", samplePattern = "ERR", pData=pheno_data)

Filtering

Now, let’s filter our data to remove low abundance genes. Following the processes above, we create another data frame called ‘bg_chrX_filt’ and use the subset() function with the options: ‘bg_chrX’ to denote the data frame we are to pull information from, ‘rowVars(texpr(bg_chrX)) >1’ to supply our new data frame with all rows within bg_chrX containing values >1, and ‘genomesubset’ to subset our data according to the genome. When using the texpr() function to extract our transcript-level expression measurements from our ballgown object, we need to set the ‘genomesubset’ to TRUE, which subsets the data within bg_chrX to specific parts of the genome.
bg_chrX_filt = subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE)

Transcript Identification

Here, we will identify transcripts that show statistically significant differences. Generating another data frame called ‘results_transcripts’, we use the stattest() function with the options: ‘bg_chrX_filt’ to denote the data frame we will extract data from, ‘feature’ to identify the genomic feature we are interested in, ‘covariate’ to represent our covariate of interest that corresponds to the name of a column of our pData data frame, ‘adjustvars’ to identify potential confounding variables corresponding again to the name of a column of our pData data frame, ‘getFC’ to acquire the fold changes, and ‘meas’ to denote our desired measurement outputs.
results_transcripts = stattest(bg_chrX_filt,feature="transcript", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")

Gene Identification

Here, we will identify genes that show statistically significant differences. Again, we create a new data frame called ‘results_genes’. We then use the stattest() function with the options: ‘bg_chrX_filt’ to denote the data frame we will extract information from, ‘feature’ to identify the genomic feature we are interested in, ‘covariate’ to represent our covariate of interest that corresponds to the name of a column of our pData data frame, ‘adjustvars’ to identify potential confounding variables corresponding again to the name of a column of our pData data frame, ‘getFC’ to acquire the fold changes, and ‘meas’ to denote our desired measurement outputs.
results_genes = stattest(bg_chrX_filt, feature="gene", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")

Adding Gene Names & IDs

This one speaks for itself. We will use this chunk of code to add gene names and IDs to our ‘results_transcripts’ dataframe. Working with the ‘results_transcripts’ data frame, we utilize the data.frame() function with the options: ‘geneNames=ballgown::geneNames(bg_chrX_filt)’ which accesses the geneNames function within the ballgown package and applies the exported variables within the ballgown object to the desired data frame, ‘geneIDs=ballgown::geneIDs(bg_chrX_filt)’ which performs as described above for geneNames, and ‘results_transcripts’ to denote the output data frame of the operation.
results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

Arrange Data Based on P-Values

Arranging our transcripts from smallest to largest p-value. Using the arrange() function with the options: ‘results_transcripts’ or ‘results_genes’ to denote the data frame we will be extracting information from and ‘pval’ to denote how we want our data arranged.
results_transcripts = arrange(results_transcripts,pval)
Arranging our genes from smallest to largest p-value.
results_genes = arrange(results_genes,pval)

Write Results to CSV Files

The most important things I have learned in gradutate school: backup your data and save as often as possible! Here, we use the write.csv() function to save our generated data frames for later usage or analysis using the options: ‘results_transcripts’ or ‘results_genes’ to denote the data frame we want to save, “[file_name.csv]” to denote the name of the [output.csv] file, and ‘row.names’ to omit the row names in the [output.csv] file.
write.csv(results_transcripts, "chrX_transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "chrX_gene_results.csv", row.names=FALSE)

Identification of Transcripts and Genes with Q-Values < 0.05.

Table of transcripts with q-values < 0.05. Here, we use the knitr package to generate a nice, clean table using the kable() function. For our table, we use the subset() function to sort our ‘results_transcripts’ data frame according to q-values that are less then 0.05 by entering ‘results_transcripts\(qval<0.05'. The '{\)}’ operator is used to extract the information from the data frame that you desire.
kable(subset(results_transcripts,results_transcripts$qval<0.05))
geneNames geneIDs feature id fc pval qval
. MSTRG.523 transcript 1658 0.0307972 0.0000000 0.0000002
XIST MSTRG.523 transcript 1657 0.0030162 0.0000000 0.0000002
. MSTRG.523 transcript 1656 0.0162247 0.0000000 0.0000004
. MSTRG.523 transcript 1659 0.0281950 0.0000001 0.0000362
TSIX MSTRG.522 transcript 1655 0.0790506 0.0000021 0.0009483
. MSTRG.605 transcript 1850 7.4083691 0.0000125 0.0046280
. MSTRG.765 transcript 2340 0.2761499 0.0000266 0.0084645
. MSTRG.613 transcript 1854 9.1289128 0.0000479 0.0133586
. MSTRG.138 transcript 426 3.1697627 0.0000581 0.0144054
KDM6A MSTRG.255 transcript 739 0.0538054 0.0001138 0.0253680
PNPLA4 MSTRG.64 transcript 190 0.5911889 0.0002026 0.0410783
RPS4X MSTRG.511 transcript 1612 0.5975862 0.0002342 0.0435228
Table of genes with q-values < 0.05. Here, we use the knitr package to generate a nice, clean table using the kable() function. For our table, we use the subset() function to sort our ‘results_genes’ data frame according to q-values that are less then 0.05 by entering ‘results_genes\(qval<0.05'. The '{\)}’ operator is used to extract the information from the data frame that you desire.
kable(subset(results_genes,results_genes$qval<0.05))
feature id fc pval qval
gene MSTRG.523 0.0023874 0.0000000 0.0000000
gene MSTRG.138 3.1100716 0.0000051 0.0024287
gene MSTRG.522 0.0827306 0.0000072 0.0024287
gene MSTRG.605 7.3080984 0.0000113 0.0028715
gene MSTRG.765 0.2773329 0.0000287 0.0058227
gene MSTRG.613 9.0353161 0.0000513 0.0086579
gene MSTRG.511 0.6378401 0.0000679 0.0098262
gene MSTRG.373 0.6094302 0.0001080 0.0136804
gene MSTRG.64 0.6014479 0.0001645 0.0185181
gene MSTRG.615 7.8812467 0.0003779 0.0382853
gene MSTRG.228 1.4106273 0.0004329 0.0398617

Optional Step: Adding a Little Color

Using different R packages, you can add whatever colors you want to your plots to make them stand out. I am using RColorBrewer, but the protocol uses RSkittleBrewer; either will work fine.

Code for RColorBrewer. Again, we will generate another data frame called ‘colors’ and using the brewer.pal() function, we will assign the palette we desire. Here, I used the options ‘5’ for the number of colors I want to add and ‘RdBu’ to denote the palette I wish to choose from. From here, we use the palette() function to choose the data frame ‘colors’, which contains our color choices, and apply the colors we have chosen to our plots we will generate.
colors = brewer.pal(5,"RdBu")
palette(colors)
Code for RSkittleBrewer. This code is essentially identical to the one above, but uses the data frame ‘tropical’ with the c(), or concatenate, function to designate the colors desired. Again, we generate a new data frame and then use the palette() function to apply our desired colors to our plots.
tropical = c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow')
palette(tropical)

Gene Distribution and Abundance Boxplot

Here, we examine the distribution of gene abundances in FPKM (Fragments Per Kilobase of transcript per Million mapped reads) across all samples colored by sex. Here, we generate a data frame called ‘fpkm’ and extract our transcript-level measurements from ‘bg_chrX’ and designate the units we want using ‘meas’. The second part of the code is a log transformation to normalize our expression data. We add 1 to each value because log2(0) is undefined.

Setup our data.
fpkm = texpr(bg_chrX,meas="FPKM")
fpkm = log2(fpkm+1)
Plot our data. Below, the par() function is used to set the boundaries of the plot, adjusting these values allows you to make sure the axis titles, main title, and sample names fit within the boundaries of the plot. The ‘mar’ option stands for margins and the following concatenated code includes the (bottom, left, top, right) boundaries of the plot. The code below uses the boxplot() function to plot our data, using the options: ‘fpkm’ to denote our data frame and x-axis sample labels, ‘col’ designates the colors that will be used to color the plot, the as.numeric() function coerces the generation of numeric objects (in this case we are extracting sex from pheno_data and coloring our plot by sex), ‘las’ orients the x-axis tick mark labels (0=parallel, 1=horizontal, 2=perpendicular, 3=vertical), and ‘ylab’ is used to set the y-axis label.
par(mar=c(9,9,4,2)) 
boxplot(fpkm, col=as.numeric(pheno_data$sex), las=2, ylab='log2(FPKM+1)')


Individual Transcript Plot

Now, we will make boxplots for individual transcripts across all samples, but first we will target our gene of interest (GOI) using it’s ID#. Here, we use the ballgown package to call our function ‘::’ and extract our object of interest using the transcriptNames() function on our data frame. Here, we extract object ID [10] from our ‘bg_chrX’ data frame. We do the exact same thing for the second part of the code, only using the geneNames() function.

Return the transcript name followed by the gene name of our GOI using it’s ID#.
ballgown::transcriptNames(bg_chrX)[10]
##          10 
## "NM_012227"
ballgown::geneNames(bg_chrX)[10]
##       10 
## "GTPBP6"
Plot our data for our gene:transcript. Here we plot our gene/transcript we extracted above. The par() function is used to set the boundaries of the plot, adjusting these values allows you to make sure the axis titles, main title, and sample names fit within the boundaries of the plot. The ‘mar’ option stands for margins and the following concatenated code includes the (bottom, left, top, right) boundaries of the plot. The plot() function below generates a box plot with options: ‘fpkm[10,] ~ pheno_data$sex’ plots our transcript distributions by sex, ‘border’ colors the polygon borders,
par(mar=c(5,7,5,5))
plot(fpkm[10,] ~ pheno_data$sex, border=c(1,2), cex.main=1.5, main=paste(ballgown::geneNames(bg_chrX)[10],' : ', ballgown::transcriptNames(bg_chrX)[10]), pch=19, cex.axis=1.5, cex.lab=1.5, xlab="Sex", ylab='log2(FPKM+1)')
points(fpkm[10,] ~ jitter(as.numeric(pheno_data$sex)), col=as.numeric(pheno_data$sex))


Isoform Expression Plots

Plot structure and expression levels of all transcripts that share the same gene locus within a single sample.

Using the plotTranscripts function, we will plot the different isoforms of the XIST gene. We extract our desired data from our ballgown geneIDs object as before using ‘ballgown::geneIDs(bg_chrX)[1657]’, where 1657 is the ID#. bg_chrX is the data frame we are working in. ‘main’ is used to generate a title for our plot and ‘sample’ is used to entitle denote which sample we want to pull data from.
plotTranscripts(ballgown::geneIDs(bg_chrX)[1657], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234'))


Average Expression Levels

Plot average expression levels for all transcripts of a single gene across all samples.

Here, we use the plotMeans function to plot the average expression level for the transcripts corresponding to the ‘MSTRG.56’ gene within our ‘bg_chrX_filt’ data frame. We use ‘groupvar’ to separate our samples based on sex and we use the ‘legend’ function to include, or exclude in our case, a legend.
plotMeans('MSTRG.56', bg_chrX_filt, groupvar="sex", legend=FALSE)


Farewell Message

You have now completed the HISAT, StringTie, Ballgown protocol published by Pertea et. al. 2016! Now that you have finished the protocol, you can retool this code to fit your own data! I will be doing this myself and will definitely share my journey with you in future videos. Thank you, thank you, thank you for venturing on this learning experience with me. I hope to hear from you in the comments section of this RPub and my YouTube videos! I am looking forward to getting feedback on my own data exploration adventures, as well!