Rmd for Ballgown Tutorial, BIOL792, Spring 2019

Ballgown Tutorial from Pertea, M. et. al., Nat. Protoc. 2016

Link to YouTube Video: Beginner’s RNA-Seq Tutorial Part 4 - Dr. Pedro Miura’s BIOL792 Course, University of Nevada, Reno

Welcome To the Final Installment!

Hello and welcome back to the final installment of the HISAT2, StringTie, Ballgown tutorial series! In this installment we will be working exclusively in R/RStudio to complete our task of elucidating the differential expression patterns of our RNA-Seq data. In the code below, I have tried to meticulously break down the complex `R` code and provide as much detail about each piece as possible. If you are much more adept within the realm of `R` code and have better explainations than the ones I give, please shoot me a comment on RPubs or YouTube and I will update this Rmd reference guide! With that being said, let’s get to work on some differential expression analysis in R using `ballgown`!

Open RStudio

Let’s begin working in R. You can work in either RStudio or terminal R. From the command line I enter `rstudio` and it opens the RStudio GUI.

$ rstudio

Install Required Packages

Install `ballgown`, `RColorBrewer`, `genefilter`, `dplyr`, and `devtools` in `RStudio` using the `install.packages()` function. Note: if you have already installed `ballgown` using `conda install`, skip the installation utilized here. Also, if using `conda` to install R and RStudio, you may also install these packages using `conda install r-[package]`.

install.packages('ballgown')
install.packages('RColorBrewer')
install.packages('genefilter')
install.packages('dplyr')
install.packages('devtools')

Load Required Packages

Now that we have R (or Rstudio) loaded, we need to load the packages we will be using, using the `load()` function. If you do not currently have these packages, utilize the command above to install ballgown to install these packages before loading.

library(ballgown) #for differential expression analysis
library(RColorBrewer) #or RSkittleBrewer for plot coloring
library(genefilter) #calc of means and variances
library(dplyr) #sort and arrange results
library(devtools) #to ensure reproducibility and install relevant packages

Set Working Directory, Read-in Phenotype Data

Next we want to read in our phenotype data for our samples which are contained in a .csv file located in the chrX_data/ directory. To set your ~/working_directory, use the `setwd()` function and enter the path to your ~/working_directory.

setwd("/your/working_directory/path/")

After we have set our ~/working_directory, we will create a data frame called ‘pheno_data’ which we will assign our read-in .csv file to using the `read.csv()` function, entering the path to the file we would like to read-in.

pheno_data = read.csv("~/Desktop/biol792_miura/chrX_data/geuvadis_phenodata.csv")

Read-in Expression Data

Let’s read-in our expression data from the ballgown/ directory we generated using StringTie. In order to do this, we create another data frame called ‘bg_chrX’ to assign our data to and use the `ballgown()` package function with the options: ‘datadir’ to denote the directory/ our data directories are contained in, ‘samplePattern’ to denote the pattern of each of the directories/ our data are contained within, and ‘pData’ to utilize our phenotype data frame to categorize each of our samples.

bg_chrX = ballgown(dataDir = "ballgown", samplePattern = "ERR", pData=pheno_data)

Filtering

Now, let’s filter our data to remove low abundance genes. Following the processes above, we create another data frame called ‘bg_chrX_filt’ and use the `subset()` function with the options: ‘bg_chrX’ to denote the data frame we are to pull information from, ‘rowVars(texpr(bg_chrX)) >1’ to supply our new data frame with all rows within bg_chrX containing values >1, and ‘genomesubset’ to subset our data according to the genome. When using the `texpr()` function to extract our transcript-level expression measurements from our `ballgown` object, we need to set the ‘genomesubset’ to TRUE, which subsets the data within bg_chrX to specific parts of the genome.

bg_chrX_filt = subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE)

Transcript Identification

Here, we will identify transcripts that show statistically significant differences. Generating another data frame called ‘results_transcripts’, we use the `stattest()` function with the options: ‘bg_chrX_filt’ to denote the data frame we will extract data from, ‘feature’ to identify the genomic feature we are interested in, ‘covariate’ to represent our covariate of interest that corresponds to the name of a column of our pData data frame, ‘adjustvars’ to identify potential confounding variables corresponding again to the name of a column of our pData data frame, ‘getFC’ to acquire the fold changes, and ‘meas’ to denote our desired measurement outputs.

results_transcripts = stattest(bg_chrX_filt,feature="transcript", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")

Gene Identification

Here, we will identify genes that show statistically significant differences. Again, we create a new data frame called ‘results_genes’. We then use the `stattest()` function with the options: ‘bg_chrX_filt’ to denote the data frame we will extract information from, ‘feature’ to identify the genomic feature we are interested in, ‘covariate’ to represent our covariate of interest that corresponds to the name of a column of our pData data frame, ‘adjustvars’ to identify potential confounding variables corresponding again to the name of a column of our pData data frame, ‘getFC’ to acquire the fold changes, and ‘meas’ to denote our desired measurement outputs.

results_genes = stattest(bg_chrX_filt, feature="gene", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")

Adding Gene Names & IDs

This one speaks for itself. We will use this chunk of code to add gene names and IDs to our ‘results_transcripts’ dataframe. Working with the ‘results_transcripts’ data frame, we utilize the `data.frame()` function with the options: ‘geneNames=ballgown::geneNames(bg_chrX_filt)’ which accesses the geneNames function within the `ballgown` package and applies the exported variables within the `ballgown` object to the desired data frame, ‘geneIDs=ballgown::geneIDs(bg_chrX_filt)’ which performs as described above for geneNames, and ‘results_transcripts’ to denote the output data frame of the operation.

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

Arrange Data Based on P-Values

Arranging our transcripts from smallest to largest p-value. Using the `arrange()` function with the options: ‘results_transcripts’ or ‘results_genes’ to denote the data frame we will be extracting information from and ‘pval’ to denote how we want our data arranged.

results_transcripts = arrange(results_transcripts,pval)

Arranging our genes from smallest to largest p-value.

results_genes = arrange(results_genes,pval)

Write Results to CSV Files

The most important things I have learned in gradutate school: backup your data and save as often as possible! Here, we use the `write.csv()` function to save our generated data frames for later usage or analysis using the options: ‘results_transcripts’ or ‘results_genes’ to denote the data frame we want to save, “[file_name.csv]” to denote the name of the [output.csv] file, and ‘row.names’ to omit the row names in the [output.csv] file.

write.csv(results_transcripts, "chrX_transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "chrX_gene_results.csv", row.names=FALSE)

Identification of Transcripts and Genes with Q-Values < 0.05.

Table of transcripts with q-values < 0.05. Here, we use the `knitr` package to generate a nice, clean table using the `kable()` function. For our table, we use the `subset()` function to sort our ‘results_transcripts’ data frame according to q-values that are less then 0.05 by entering ‘results_transcripts$qval<0.05'. The '{$}’ operator is used to extract the information from the data frame that you desire.

kable(subset(results_transcripts,results_transcripts$qval<0.05))

geneNames	geneIDs	feature	id	fc	pval	qval
.	MSTRG.523	transcript	1658	0.0307972	0.0000000	0.0000002
XIST	MSTRG.523	transcript	1657	0.0030162	0.0000000	0.0000002
.	MSTRG.523	transcript	1656	0.0162247	0.0000000	0.0000004
.	MSTRG.523	transcript	1659	0.0281950	0.0000001	0.0000362
TSIX	MSTRG.522	transcript	1655	0.0790506	0.0000021	0.0009483
.	MSTRG.605	transcript	1850	7.4083691	0.0000125	0.0046280
.	MSTRG.765	transcript	2340	0.2761499	0.0000266	0.0084645
.	MSTRG.613	transcript	1854	9.1289128	0.0000479	0.0133586
.	MSTRG.138	transcript	426	3.1697627	0.0000581	0.0144054
KDM6A	MSTRG.255	transcript	739	0.0538054	0.0001138	0.0253680
PNPLA4	MSTRG.64	transcript	190	0.5911889	0.0002026	0.0410783
RPS4X	MSTRG.511	transcript	1612	0.5975862	0.0002342	0.0435228

Table of genes with q-values < 0.05. Here, we use the `knitr` package to generate a nice, clean table using the `kable()` function. For our table, we use the `subset()` function to sort our ‘results_genes’ data frame according to q-values that are less then 0.05 by entering ‘results_genes$qval<0.05'. The '{$}’ operator is used to extract the information from the data frame that you desire.

kable(subset(results_genes,results_genes$qval<0.05))

feature	id	fc	pval	qval
gene	MSTRG.523	0.0023874	0.0000000	0.0000000
gene	MSTRG.138	3.1100716	0.0000051	0.0024287
gene	MSTRG.522	0.0827306	0.0000072	0.0024287
gene	MSTRG.605	7.3080984	0.0000113	0.0028715
gene	MSTRG.765	0.2773329	0.0000287	0.0058227
gene	MSTRG.613	9.0353161	0.0000513	0.0086579
gene	MSTRG.511	0.6378401	0.0000679	0.0098262
gene	MSTRG.373	0.6094302	0.0001080	0.0136804
gene	MSTRG.64	0.6014479	0.0001645	0.0185181
gene	MSTRG.615	7.8812467	0.0003779	0.0382853
gene	MSTRG.228	1.4106273	0.0004329	0.0398617

Optional Step: Adding a Little Color

Using different R packages, you can add whatever colors you want to your plots to make them stand out. I am using `RColorBrewer`, but the protocol uses `RSkittleBrewer`; either will work fine.

Code for `RColorBrewer`. Again, we will generate another data frame called ‘colors’ and using the `brewer.pal()` function, we will assign the palette we desire. Here, I used the options ‘5’ for the number of colors I want to add and ‘RdBu’ to denote the palette I wish to choose from. From here, we use the `palette()` function to choose the data frame ‘colors’, which contains our color choices, and apply the colors we have chosen to our plots we will generate.

colors = brewer.pal(5,"RdBu")
palette(colors)

Code for `RSkittleBrewer`. This code is essentially identical to the one above, but uses the data frame ‘tropical’ with the `c()`, or concatenate, function to designate the colors desired. Again, we generate a new data frame and then use the `palette()` function to apply our desired colors to our plots.

tropical = c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow')
palette(tropical)

Gene Distribution and Abundance Boxplot

Here, we examine the distribution of gene abundances in FPKM (Fragments Per Kilobase of transcript per Million mapped reads) across all samples colored by sex. Here, we generate a data frame called ‘fpkm’ and extract our transcript-level measurements from ‘bg_chrX’ and designate the units we want using ‘meas’. The second part of the code is a log transformation to normalize our expression data. We add 1 to each value because log2(0) is undefined.

Setup our data.

fpkm = texpr(bg_chrX,meas="FPKM")
fpkm = log2(fpkm+1)

Plot our data. Below, the `par()` function is used to set the boundaries of the plot, adjusting these values allows you to make sure the axis titles, main title, and sample names fit within the boundaries of the plot. The ‘mar’ option stands for margins and the following concatenated code includes the (bottom, left, top, right) boundaries of the plot. The code below uses the `boxplot()` function to plot our data, using the options: ‘fpkm’ to denote our data frame and x-axis sample labels, ‘col’ designates the colors that will be used to color the plot, the `as.numeric()` function coerces the generation of numeric objects (in this case we are extracting sex from pheno_data and coloring our plot by sex), ‘las’ orients the x-axis tick mark labels (0=parallel, 1=horizontal, 2=perpendicular, 3=vertical), and ‘ylab’ is used to set the y-axis label.

par(mar=c(9,9,4,2)) 
boxplot(fpkm, col=as.numeric(pheno_data$sex), las=2, ylab='log2(FPKM+1)')

Individual Transcript Plot

Now, we will make boxplots for individual transcripts across all samples, but first we will target our gene of interest (GOI) using it’s ID#. Here, we use the `ballgown` package to call our function ‘::’ and extract our object of interest using the `transcriptNames()` function on our data frame. Here, we extract object ID [10] from our ‘bg_chrX’ data frame. We do the exact same thing for the second part of the code, only using the `geneNames()` function.

Return the transcript name followed by the gene name of our GOI using it’s ID#.

ballgown::transcriptNames(bg_chrX)[10]

##          10 
## "NM_012227"

ballgown::geneNames(bg_chrX)[10]

##       10 
## "GTPBP6"

Plot our data for our gene:transcript. Here we plot our gene/transcript we extracted above. The `par()` function is used to set the boundaries of the plot, adjusting these values allows you to make sure the axis titles, main title, and sample names fit within the boundaries of the plot. The ‘mar’ option stands for margins and the following concatenated code includes the (bottom, left, top, right) boundaries of the plot. The `plot()` function below generates a box plot with options: ‘fpkm[10,] ~ pheno_data$sex’ plots our transcript distributions by sex, ‘border’ colors the polygon borders,

par(mar=c(5,7,5,5))
plot(fpkm[10,] ~ pheno_data$sex, border=c(1,2), cex.main=1.5, main=paste(ballgown::geneNames(bg_chrX)[10],' : ', ballgown::transcriptNames(bg_chrX)[10]), pch=19, cex.axis=1.5, cex.lab=1.5, xlab="Sex", ylab='log2(FPKM+1)')
points(fpkm[10,] ~ jitter(as.numeric(pheno_data$sex)), col=as.numeric(pheno_data$sex))

Isoform Expression Plots

Plot structure and expression levels of all transcripts that share the same gene locus within a single sample.

Using the `plotTranscripts` function, we will plot the different isoforms of the XIST gene. We extract our desired data from our `ballgown` geneIDs object as before using ‘ballgown::geneIDs(bg_chrX)[1657]’, where 1657 is the ID#. bg_chrX is the data frame we are working in. ‘main’ is used to generate a title for our plot and ‘sample’ is used to entitle denote which sample we want to pull data from.

plotTranscripts(ballgown::geneIDs(bg_chrX)[1657], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234'))

Average Expression Levels

Plot average expression levels for all transcripts of a single gene across all samples.

Here, we use the `plotMeans` function to plot the average expression level for the transcripts corresponding to the ‘MSTRG.56’ gene within our ‘bg_chrX_filt’ data frame. We use ‘groupvar’ to separate our samples based on sex and we use the ‘legend’ function to include, or exclude in our case, a legend.

plotMeans('MSTRG.56', bg_chrX_filt, groupvar="sex", legend=FALSE)

Rmd for Ballgown Tutorial, BIOL792, Spring 2019

Alexander Selvey, University of Nevada, Reno

March 8th, 2019

Ballgown Tutorial from Pertea, M. et. al., Nat. Protoc. 2016

Link to YouTube Video: Beginner’s RNA-Seq Tutorial Part 4 - Dr. Pedro Miura’s BIOL792 Course, University of Nevada, Reno

Welcome To the Final Installment!

Open RStudio

Let’s begin working in R. You can work in either RStudio or terminal R. From the command line I enter `rstudio` and it opens the RStudio GUI.

Install Required Packages

Load Required Packages

Now that we have R (or Rstudio) loaded, we need to load the packages we will be using, using the `load()` function. If you do not currently have these packages, utilize the command above to install ballgown to install these packages before loading.

Set Working Directory, Read-in Phenotype Data

Next we want to read in our phenotype data for our samples which are contained in a .csv file located in the chrX_data/ directory. To set your ~/working_directory, use the `setwd()` function and enter the path to your ~/working_directory.

After we have set our ~/working_directory, we will create a data frame called ‘pheno_data’ which we will assign our read-in .csv file to using the `read.csv()` function, entering the path to the file we would like to read-in.

Read-in Expression Data

Filtering

Transcript Identification

Gene Identification

Adding Gene Names & IDs

Arrange Data Based on P-Values

Arranging our transcripts from smallest to largest p-value. Using the `arrange()` function with the options: ‘results_transcripts’ or ‘results_genes’ to denote the data frame we will be extracting information from and ‘pval’ to denote how we want our data arranged.

Arranging our genes from smallest to largest p-value.

Write Results to CSV Files

Identification of Transcripts and Genes with Q-Values < 0.05.

Optional Step: Adding a Little Color

Using different R packages, you can add whatever colors you want to your plots to make them stand out. I am using `RColorBrewer`, but the protocol uses `RSkittleBrewer`; either will work fine.

Gene Distribution and Abundance Boxplot

Setup our data.

Individual Transcript Plot

Return the transcript name followed by the gene name of our GOI using it’s ID#.

Isoform Expression Plots

Average Expression Levels

Plot average expression levels for all transcripts of a single gene across all samples.

Farewell Message

Rmd for Ballgown Tutorial, BIOL792, Spring 2019

Alexander Selvey, University of Nevada, Reno

March 8th, 2019

Ballgown Tutorial from Pertea, M. et. al., Nat. Protoc. 2016

Link to YouTube Video: Beginner’s RNA-Seq Tutorial Part 4 - Dr. Pedro Miura’s BIOL792 Course, University of Nevada, Reno

Welcome To the Final Installment!

Open RStudio

Let’s begin working in R. You can work in either RStudio or terminal R. From the command line I enter rstudio and it opens the RStudio GUI.

Install Required Packages

Load Required Packages

Now that we have R (or Rstudio) loaded, we need to load the packages we will be using, using the load() function. If you do not currently have these packages, utilize the command above to install ballgown to install these packages before loading.

Set Working Directory, Read-in Phenotype Data

Next we want to read in our phenotype data for our samples which are contained in a .csv file located in the chrX_data/ directory. To set your ~/working_directory, use the setwd() function and enter the path to your ~/working_directory.

After we have set our ~/working_directory, we will create a data frame called ‘pheno_data’ which we will assign our read-in .csv file to using the read.csv() function, entering the path to the file we would like to read-in.

Read-in Expression Data

Filtering

Transcript Identification

Gene Identification

Adding Gene Names & IDs

Arrange Data Based on P-Values

Arranging our transcripts from smallest to largest p-value. Using the arrange() function with the options: ‘results_transcripts’ or ‘results_genes’ to denote the data frame we will be extracting information from and ‘pval’ to denote how we want our data arranged.

Arranging our genes from smallest to largest p-value.

Write Results to CSV Files

Identification of Transcripts and Genes with Q-Values < 0.05.

Optional Step: Adding a Little Color

Using different R packages, you can add whatever colors you want to your plots to make them stand out. I am using RColorBrewer, but the protocol uses RSkittleBrewer; either will work fine.

Gene Distribution and Abundance Boxplot

Setup our data.

Individual Transcript Plot

Return the transcript name followed by the gene name of our GOI using it’s ID#.

Isoform Expression Plots

Plot structure and expression levels of all transcripts that share the same gene locus within a single sample.

Average Expression Levels

Plot average expression levels for all transcripts of a single gene across all samples.

Farewell Message

Let’s begin working in R. You can work in either RStudio or terminal R. From the command line I enter `rstudio` and it opens the RStudio GUI.

Now that we have R (or Rstudio) loaded, we need to load the packages we will be using, using the `load()` function. If you do not currently have these packages, utilize the command above to install ballgown to install these packages before loading.

Next we want to read in our phenotype data for our samples which are contained in a .csv file located in the chrX_data/ directory. To set your ~/working_directory, use the `setwd()` function and enter the path to your ~/working_directory.

After we have set our ~/working_directory, we will create a data frame called ‘pheno_data’ which we will assign our read-in .csv file to using the `read.csv()` function, entering the path to the file we would like to read-in.

Arranging our transcripts from smallest to largest p-value. Using the `arrange()` function with the options: ‘results_transcripts’ or ‘results_genes’ to denote the data frame we will be extracting information from and ‘pval’ to denote how we want our data arranged.

Using different R packages, you can add whatever colors you want to your plots to make them stand out. I am using `RColorBrewer`, but the protocol uses `RSkittleBrewer`; either will work fine.