The point

I get asked all the time, “hey, could you make a boxplot of the expression values of SomeGene for me?” The plotCounts() function in DESeq will do this, but only one gene at a time. This shows you how to do this for any arbitrary number of genes in your data (or all of them!). You can plot a small number in a single plot with facets, or you could plot a large number in a text-searchable multi-page PDF.

Setup

Load the packages you’ll use.

# CRAN
library(dplyr)
library(tidyr)
library(ggplot2)
theme_set(theme_bw(base_size=14) + theme(strip.background = element_blank()))

# Bioconductor
library(DESeq2)
library(airway)

Load the data from the airway package, and add a made up “group” variable. We’ll use this downstream when we map aesthetics onto a boxplot. Create a DESeqDataSet and run the DESeq pipeline on it.

# Load the data
data(airway)

# Make up a fake "group" variable
colData(airway)$fake <- factor(rep(c("groupA", "groupB"), each=4))

# Make a dataset and run the pipeline
dds <- DESeqDataSet(airway, design=~cell+dex)
dds <- DESeq(dds)

Inspect the colData.

colData(dds)
## DataFrame with 8 rows and 11 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
##            Experiment    Sample    BioSample     fake sizeFactor
##              <factor>  <factor>     <factor> <factor>  <numeric>
## SRR1039508  SRX384345 SRS508568 SAMN02422669   groupA  1.0236476
## SRR1039509  SRX384346 SRS508567 SAMN02422675   groupA  0.8961667
## SRR1039512  SRX384349 SRS508571 SAMN02422678   groupA  1.1794861
## SRR1039513  SRX384350 SRS508572 SAMN02422670   groupA  0.6700538
## SRR1039516  SRX384353 SRS508575 SAMN02422682   groupB  1.1776714
## SRR1039517  SRX384354 SRS508576 SAMN02422673   groupB  1.3990365
## SRR1039520  SRX384357 SRS508579 SAMN02422683   groupB  0.9207787
## SRR1039521  SRX384358 SRS508580 SAMN02422677   groupB  0.9445141

Get results

Get the results for the dex treated vs untreated samples and arrange by p-value.

res <- results(dds, tidy=TRUE, contrast=c("dex", "trt", "untrt")) %>%
  arrange(padj, pvalue) %>%
  tbl_df()
res
## Source: local data frame [64,102 x 7]
## 
##                row   baseMean log2FoldChange      lfcSE     stat
##              (chr)      (dbl)          (dbl)      (dbl)    (dbl)
## 1  ENSG00000152583   997.4398       4.313962 0.17213733 25.06116
## 2  ENSG00000165995   495.0929       3.186823 0.12815654 24.86664
## 3  ENSG00000101347 12703.3871       3.618734 0.14894336 24.29604
## 4  ENSG00000120129  3409.0294       2.871488 0.11824908 24.28338
## 5  ENSG00000189221  2341.7673       3.230395 0.13667447 23.63569
## 6  ENSG00000211445 12285.6151       3.553360 0.15798211 22.49217
## 7  ENSG00000157214  3009.2632       1.948723 0.08867432 21.97618
## 8  ENSG00000162614  5393.1017       2.003487 0.09269629 21.61345
## 9  ENSG00000125148  3656.2528       2.167122 0.10354724 20.92882
## 10 ENSG00000154734 30315.1355       2.286778 0.11308412 20.22192
## ..             ...        ...            ...        ...      ...
## Variables not shown: pvalue (dbl), padj (dbl)

Now, define some genes of interest. These must be the names of the genes as in the count data. That is, your genes of interest must be %in% names(dds). These are the genes you’ll plot. You could do as many as you like. Since you’ve arranged by p-value, let’s take the top few results (9 in this case, so we make a square faceted plot).

# Define the genes of interest.
goi <- res$row[1:9]
stopifnot(all(goi %in% names(dds)))
goi
## [1] "ENSG00000152583" "ENSG00000165995" "ENSG00000101347" "ENSG00000120129"
## [5] "ENSG00000189221" "ENSG00000211445" "ENSG00000157214" "ENSG00000162614"
## [9] "ENSG00000125148"

Join & tidy

Create a tidy/transposed count matrix. Here’s a line-by-line explanation:

  1. Here, you’ll take the counts() of the dds object, normalizing it, without outlier replacement. You’ll add a half count to it, because the next thing you’ll do is log2() it, and you don’t want any -Infs. So now you have a log-transformed normalized count matrix. Now, transpose that matrix so the sample names are the row.names.
  2. You’ll now merge that thing with the colData(dds), where the sample names are also the row.names. You’ll now have a really wide data.frame with a single row for each sample, followed by all the colData, followed by a column for each of the genes of interest.
  3. Gather that up again so you now have one row per sample per gene, with all the sample info, and that gene’s expression.
tcounts <- t(log2((counts(dds[goi, ], normalized=TRUE, replaced=FALSE)+.5))) %>%
  merge(colData(dds), ., by="row.names") %>%
  gather(gene, expression, (ncol(.)-length(goi)+1):ncol(.))

Take a look.

tcounts %>% 
  select(Row.names, dex, fake, gene, expression) %>% 
  head %>% 
  knitr::kable()
Row.names dex fake gene expression
SRR1039508 untrt groupA ENSG00000152583 5.932338
SRR1039509 trt groupA ENSG00000152583 11.152831
SRR1039512 untrt groupA ENSG00000152583 6.399767
SRR1039513 trt groupA ENSG00000152583 10.772820
SRR1039516 untrt groupB ENSG00000152583 6.416389
SRR1039517 trt groupB ENSG00000152583 10.425984

Single faceted plot

Now, create a single faceted plot. You don’t have to add a fill aesthetic - I only did so as an example, using the fake variable I made up.

ggplot(tcounts, aes(dex, expression, fill=fake)) + 
  geom_boxplot() + 
  facet_wrap(~gene, scales="free_y") + 
  labs(x="Dexamethasone treatment", 
       y="Expression (log normalized counts)", 
       fill="(Some made \nup variable)", 
       title="Top Results")

Multi-page “catalog”

Or in a loop, create a multi-page PDF (text searchable) with one page per gene.

pdf("multi-ggplot2-catalog.pdf")
for (i in goi) {
  p <- ggplot(filter(tcounts, gene==i), aes(dex, expression, fill=fake)) + geom_boxplot() + ggtitle(i)
  print(p)
}
dev.off()

Improvements