I get asked all the time, “hey, could you make a boxplot of the expression values of SomeGene for me?” The plotCounts()
function in DESeq will do this, but only one gene at a time. This shows you how to do this for any arbitrary number of genes in your data (or all of them!). You can plot a small number in a single plot with facets, or you could plot a large number in a text-searchable multi-page PDF.
Load the packages you’ll use.
# CRAN
library(dplyr)
library(tidyr)
library(ggplot2)
theme_set(theme_bw(base_size=14) + theme(strip.background = element_blank()))
# Bioconductor
library(DESeq2)
library(airway)
Load the data from the airway package, and add a made up “group” variable. We’ll use this downstream when we map aesthetics onto a boxplot. Create a DESeqDataSet
and run the DESeq pipeline on it.
# Load the data
data(airway)
# Make up a fake "group" variable
colData(airway)$fake <- factor(rep(c("groupA", "groupB"), each=4))
# Make a dataset and run the pipeline
dds <- DESeqDataSet(airway, design=~cell+dex)
dds <- DESeq(dds)
Inspect the colData
.
colData(dds)
## DataFrame with 8 rows and 11 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample fake sizeFactor
## <factor> <factor> <factor> <factor> <numeric>
## SRR1039508 SRX384345 SRS508568 SAMN02422669 groupA 1.0236476
## SRR1039509 SRX384346 SRS508567 SAMN02422675 groupA 0.8961667
## SRR1039512 SRX384349 SRS508571 SAMN02422678 groupA 1.1794861
## SRR1039513 SRX384350 SRS508572 SAMN02422670 groupA 0.6700538
## SRR1039516 SRX384353 SRS508575 SAMN02422682 groupB 1.1776714
## SRR1039517 SRX384354 SRS508576 SAMN02422673 groupB 1.3990365
## SRR1039520 SRX384357 SRS508579 SAMN02422683 groupB 0.9207787
## SRR1039521 SRX384358 SRS508580 SAMN02422677 groupB 0.9445141
Get the results for the dex treated vs untreated samples and arrange by p-value.
res <- results(dds, tidy=TRUE, contrast=c("dex", "trt", "untrt")) %>%
arrange(padj, pvalue) %>%
tbl_df()
res
## Source: local data frame [64,102 x 7]
##
## row baseMean log2FoldChange lfcSE stat
## (chr) (dbl) (dbl) (dbl) (dbl)
## 1 ENSG00000152583 997.4398 4.313962 0.17213733 25.06116
## 2 ENSG00000165995 495.0929 3.186823 0.12815654 24.86664
## 3 ENSG00000101347 12703.3871 3.618734 0.14894336 24.29604
## 4 ENSG00000120129 3409.0294 2.871488 0.11824908 24.28338
## 5 ENSG00000189221 2341.7673 3.230395 0.13667447 23.63569
## 6 ENSG00000211445 12285.6151 3.553360 0.15798211 22.49217
## 7 ENSG00000157214 3009.2632 1.948723 0.08867432 21.97618
## 8 ENSG00000162614 5393.1017 2.003487 0.09269629 21.61345
## 9 ENSG00000125148 3656.2528 2.167122 0.10354724 20.92882
## 10 ENSG00000154734 30315.1355 2.286778 0.11308412 20.22192
## .. ... ... ... ... ...
## Variables not shown: pvalue (dbl), padj (dbl)
Now, define some genes of interest. These must be the names of the genes as in the count data. That is, your genes of interest must be %in% names(dds)
. These are the genes you’ll plot. You could do as many as you like. Since you’ve arranged by p-value, let’s take the top few results (9 in this case, so we make a square faceted plot).
# Define the genes of interest.
goi <- res$row[1:9]
stopifnot(all(goi %in% names(dds)))
goi
## [1] "ENSG00000152583" "ENSG00000165995" "ENSG00000101347" "ENSG00000120129"
## [5] "ENSG00000189221" "ENSG00000211445" "ENSG00000157214" "ENSG00000162614"
## [9] "ENSG00000125148"
Create a tidy/transposed count matrix. Here’s a line-by-line explanation:
counts()
of the dds object, normalizing it, without outlier replacement. You’ll add a half count to it, because the next thing you’ll do is log2()
it, and you don’t want any -Inf
s. So now you have a log-transformed normalized count matrix. Now, transpose that matrix so the sample names are the row.names
.colData(dds)
, where the sample names are also the row.names
. You’ll now have a really wide data.frame with a single row for each sample, followed by all the colData, followed by a column for each of the genes of interest.tcounts <- t(log2((counts(dds[goi, ], normalized=TRUE, replaced=FALSE)+.5))) %>%
merge(colData(dds), ., by="row.names") %>%
gather(gene, expression, (ncol(.)-length(goi)+1):ncol(.))
Take a look.
tcounts %>%
select(Row.names, dex, fake, gene, expression) %>%
head %>%
knitr::kable()
Row.names | dex | fake | gene | expression |
---|---|---|---|---|
SRR1039508 | untrt | groupA | ENSG00000152583 | 5.932338 |
SRR1039509 | trt | groupA | ENSG00000152583 | 11.152831 |
SRR1039512 | untrt | groupA | ENSG00000152583 | 6.399767 |
SRR1039513 | trt | groupA | ENSG00000152583 | 10.772820 |
SRR1039516 | untrt | groupB | ENSG00000152583 | 6.416389 |
SRR1039517 | trt | groupB | ENSG00000152583 | 10.425984 |
Now, create a single faceted plot. You don’t have to add a fill aesthetic - I only did so as an example, using the fake variable I made up.
ggplot(tcounts, aes(dex, expression, fill=fake)) +
geom_boxplot() +
facet_wrap(~gene, scales="free_y") +
labs(x="Dexamethasone treatment",
y="Expression (log normalized counts)",
fill="(Some made \nup variable)",
title="Top Results")
Or in a loop, create a multi-page PDF (text searchable) with one page per gene.
pdf("multi-ggplot2-catalog.pdf")
for (i in goi) {
p <- ggplot(filter(tcounts, gene==i), aes(dex, expression, fill=fake)) + geom_boxplot() + ggtitle(i)
print(p)
}
dev.off()