University of Nevada, Reno Vintage Logo
R code and provide as much detail about each piece as possible. If you are much more adept within the realm of R code and have better explainations than the ones I give, please shoot me a comment on RPubs or YouTube and I will update this Rmd reference guide! With that being said, let’s get to work on some differential expression analysis in R using ballgown!rstudio and it opens the RStudio GUI.$ rstudio
ballgown, RColorBrewer, genefilter, dplyr, and devtools in RStudio using the install.packages() function. Note: if you have already installed ballgown using conda install, skip the installation utilized here. Also, if using conda to install R and RStudio, you may also install these packages using conda install r-[package].install.packages('ballgown')
install.packages('RColorBrewer')
install.packages('genefilter')
install.packages('dplyr')
install.packages('devtools')
load() function. If you do not currently have these packages, utilize the command above to install ballgown to install these packages before loading.library(ballgown) #for differential expression analysis
library(RColorBrewer) #or RSkittleBrewer for plot coloring
library(genefilter) #calc of means and variances
library(dplyr) #sort and arrange results
library(devtools) #to ensure reproducibility and install relevant packages
setwd() function and enter the path to your ~/working_directory.setwd("/your/working_directory/path/")
read.csv() function, entering the path to the file we would like to read-in.pheno_data = read.csv("~/Desktop/biol792_miura/chrX_data/geuvadis_phenodata.csv")
ballgown() package function with the options: ‘datadir’ to denote the directory/ our data directories are contained in, ‘samplePattern’ to denote the pattern of each of the directories/ our data are contained within, and ‘pData’ to utilize our phenotype data frame to categorize each of our samples.bg_chrX = ballgown(dataDir = "ballgown", samplePattern = "ERR", pData=pheno_data)
subset() function with the options: ‘bg_chrX’ to denote the data frame we are to pull information from, ‘rowVars(texpr(bg_chrX)) >1’ to supply our new data frame with all rows within bg_chrX containing values >1, and ‘genomesubset’ to subset our data according to the genome. When using the texpr() function to extract our transcript-level expression measurements from our ballgown object, we need to set the ‘genomesubset’ to TRUE, which subsets the data within bg_chrX to specific parts of the genome.bg_chrX_filt = subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE)
stattest() function with the options: ‘bg_chrX_filt’ to denote the data frame we will extract data from, ‘feature’ to identify the genomic feature we are interested in, ‘covariate’ to represent our covariate of interest that corresponds to the name of a column of our pData data frame, ‘adjustvars’ to identify potential confounding variables corresponding again to the name of a column of our pData data frame, ‘getFC’ to acquire the fold changes, and ‘meas’ to denote our desired measurement outputs.results_transcripts = stattest(bg_chrX_filt,feature="transcript", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")
stattest() function with the options: ‘bg_chrX_filt’ to denote the data frame we will extract information from, ‘feature’ to identify the genomic feature we are interested in, ‘covariate’ to represent our covariate of interest that corresponds to the name of a column of our pData data frame, ‘adjustvars’ to identify potential confounding variables corresponding again to the name of a column of our pData data frame, ‘getFC’ to acquire the fold changes, and ‘meas’ to denote our desired measurement outputs.results_genes = stattest(bg_chrX_filt, feature="gene", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")
data.frame() function with the options: ‘geneNames=ballgown::geneNames(bg_chrX_filt)’ which accesses the geneNames function within the ballgown package and applies the exported variables within the ballgown object to the desired data frame, ‘geneIDs=ballgown::geneIDs(bg_chrX_filt)’ which performs as described above for geneNames, and ‘results_transcripts’ to denote the output data frame of the operation.results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)
arrange() function with the options: ‘results_transcripts’ or ‘results_genes’ to denote the data frame we will be extracting information from and ‘pval’ to denote how we want our data arranged.results_transcripts = arrange(results_transcripts,pval)
results_genes = arrange(results_genes,pval)
write.csv() function to save our generated data frames for later usage or analysis using the options: ‘results_transcripts’ or ‘results_genes’ to denote the data frame we want to save, “[file_name.csv]” to denote the name of the [output.csv] file, and ‘row.names’ to omit the row names in the [output.csv] file.write.csv(results_transcripts, "chrX_transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "chrX_gene_results.csv", row.names=FALSE)
knitr package to generate a nice, clean table using the kable() function. For our table, we use the subset() function to sort our ‘results_transcripts’ data frame according to q-values that are less then 0.05 by entering ‘results_transcripts\(qval<0.05'. The '{\)}’ operator is used to extract the information from the data frame that you desire.kable(subset(results_transcripts,results_transcripts$qval<0.05))
| geneNames | geneIDs | feature | id | fc | pval | qval |
|---|---|---|---|---|---|---|
| . | MSTRG.523 | transcript | 1658 | 0.0307972 | 0.0000000 | 0.0000002 |
| XIST | MSTRG.523 | transcript | 1657 | 0.0030162 | 0.0000000 | 0.0000002 |
| . | MSTRG.523 | transcript | 1656 | 0.0162247 | 0.0000000 | 0.0000004 |
| . | MSTRG.523 | transcript | 1659 | 0.0281950 | 0.0000001 | 0.0000362 |
| TSIX | MSTRG.522 | transcript | 1655 | 0.0790506 | 0.0000021 | 0.0009483 |
| . | MSTRG.605 | transcript | 1850 | 7.4083691 | 0.0000125 | 0.0046280 |
| . | MSTRG.765 | transcript | 2340 | 0.2761499 | 0.0000266 | 0.0084645 |
| . | MSTRG.613 | transcript | 1854 | 9.1289128 | 0.0000479 | 0.0133586 |
| . | MSTRG.138 | transcript | 426 | 3.1697627 | 0.0000581 | 0.0144054 |
| KDM6A | MSTRG.255 | transcript | 739 | 0.0538054 | 0.0001138 | 0.0253680 |
| PNPLA4 | MSTRG.64 | transcript | 190 | 0.5911889 | 0.0002026 | 0.0410783 |
| RPS4X | MSTRG.511 | transcript | 1612 | 0.5975862 | 0.0002342 | 0.0435228 |
knitr package to generate a nice, clean table using the kable() function. For our table, we use the subset() function to sort our ‘results_genes’ data frame according to q-values that are less then 0.05 by entering ‘results_genes\(qval<0.05'. The '{\)}’ operator is used to extract the information from the data frame that you desire.kable(subset(results_genes,results_genes$qval<0.05))
| feature | id | fc | pval | qval |
|---|---|---|---|---|
| gene | MSTRG.523 | 0.0023874 | 0.0000000 | 0.0000000 |
| gene | MSTRG.138 | 3.1100716 | 0.0000051 | 0.0024287 |
| gene | MSTRG.522 | 0.0827306 | 0.0000072 | 0.0024287 |
| gene | MSTRG.605 | 7.3080984 | 0.0000113 | 0.0028715 |
| gene | MSTRG.765 | 0.2773329 | 0.0000287 | 0.0058227 |
| gene | MSTRG.613 | 9.0353161 | 0.0000513 | 0.0086579 |
| gene | MSTRG.511 | 0.6378401 | 0.0000679 | 0.0098262 |
| gene | MSTRG.373 | 0.6094302 | 0.0001080 | 0.0136804 |
| gene | MSTRG.64 | 0.6014479 | 0.0001645 | 0.0185181 |
| gene | MSTRG.615 | 7.8812467 | 0.0003779 | 0.0382853 |
| gene | MSTRG.228 | 1.4106273 | 0.0004329 | 0.0398617 |
RColorBrewer, but the protocol uses RSkittleBrewer; either will work fine.RColorBrewer. Again, we will generate another data frame called ‘colors’ and using the brewer.pal() function, we will assign the palette we desire. Here, I used the options ‘5’ for the number of colors I want to add and ‘RdBu’ to denote the palette I wish to choose from. From here, we use the palette() function to choose the data frame ‘colors’, which contains our color choices, and apply the colors we have chosen to our plots we will generate.colors = brewer.pal(5,"RdBu")
palette(colors)
RSkittleBrewer. This code is essentially identical to the one above, but uses the data frame ‘tropical’ with the c(), or concatenate, function to designate the colors desired. Again, we generate a new data frame and then use the palette() function to apply our desired colors to our plots.tropical = c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow')
palette(tropical)
fpkm = texpr(bg_chrX,meas="FPKM")
fpkm = log2(fpkm+1)
par() function is used to set the boundaries of the plot, adjusting these values allows you to make sure the axis titles, main title, and sample names fit within the boundaries of the plot. The ‘mar’ option stands for margins and the following concatenated code includes the (bottom, left, top, right) boundaries of the plot. The code below uses the boxplot() function to plot our data, using the options: ‘fpkm’ to denote our data frame and x-axis sample labels, ‘col’ designates the colors that will be used to color the plot, the as.numeric() function coerces the generation of numeric objects (in this case we are extracting sex from pheno_data and coloring our plot by sex), ‘las’ orients the x-axis tick mark labels (0=parallel, 1=horizontal, 2=perpendicular, 3=vertical), and ‘ylab’ is used to set the y-axis label.par(mar=c(9,9,4,2))
boxplot(fpkm, col=as.numeric(pheno_data$sex), las=2, ylab='log2(FPKM+1)')
ballgown package to call our function ‘::’ and extract our object of interest using the transcriptNames() function on our data frame. Here, we extract object ID [10] from our ‘bg_chrX’ data frame. We do the exact same thing for the second part of the code, only using the geneNames() function.ballgown::transcriptNames(bg_chrX)[10]
## 10
## "NM_012227"
ballgown::geneNames(bg_chrX)[10]
## 10
## "GTPBP6"
par() function is used to set the boundaries of the plot, adjusting these values allows you to make sure the axis titles, main title, and sample names fit within the boundaries of the plot. The ‘mar’ option stands for margins and the following concatenated code includes the (bottom, left, top, right) boundaries of the plot. The plot() function below generates a box plot with options: ‘fpkm[10,] ~ pheno_data$sex’ plots our transcript distributions by sex, ‘border’ colors the polygon borders,par(mar=c(5,7,5,5))
plot(fpkm[10,] ~ pheno_data$sex, border=c(1,2), cex.main=1.5, main=paste(ballgown::geneNames(bg_chrX)[10],' : ', ballgown::transcriptNames(bg_chrX)[10]), pch=19, cex.axis=1.5, cex.lab=1.5, xlab="Sex", ylab='log2(FPKM+1)')
points(fpkm[10,] ~ jitter(as.numeric(pheno_data$sex)), col=as.numeric(pheno_data$sex))
plotMeans function to plot the average expression level for the transcripts corresponding to the ‘MSTRG.56’ gene within our ‘bg_chrX_filt’ data frame. We use ‘groupvar’ to separate our samples based on sex and we use the ‘legend’ function to include, or exclude in our case, a legend.plotMeans('MSTRG.56', bg_chrX_filt, groupvar="sex", legend=FALSE)