Homework 02

In this assignment you will work with data originally published in “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae.” by Nookaew et al (Nucleic Acids Res. 2012 Nov 1;40(20):10084-97. PMID 22965124). The article is available here: doi: 10.1093/nar/gks804.

The authors used two different platforms – microarrays and RNA-Seq – to obtain gene expression data for yeast grown in two conditions: batch medium and chemostat. They then compared the results of differential expression analysis (DEA) across platforms and under various combinations of aligners and methods for identifying differential expression. We will do the same. Some of the DEA methods covered in this class were included in their analysis, edgeR and DESeq, while others were not, such as limma and, in particular, the voom function.

We will work with their data obtained from the NCBI SRA and GEO repositories. Because of some differences in pre-processing your results will differ from theirs.

Evaluation

The assignment has 25 points and represents 25% of your overall course grade. Overall presentation and mechanics is worth 5 points, with full points awarded if it's exceptional. The remaining 20 points are spread among the specific questions, as detailed below. We will try to give partial credit when/if we can figure out what you're trying to do.

Each student must submit their own work: no group efforts. It's fine to talk to fellow students and have discussion on the Google Group. If someone is really helpful, give them some credit. You have definitely crossed the line if there's any copying and pasting of code or wording.

Late policy

Late submission will cost 20% (5 points) per day or fraction thereof.

If you expect that you will be late on the deadline, contact us before missing it, and submit whatever partial work you have.

How to submit the work

See the homework submission instructions.

Coaching

Some helpful tips can be found in this coaching document.

# Load required R packages:
library(hexbin)
library(gplots)
library(RColorBrewer)
library(limma)
library(reshape2)
library(lattice)
library(yeast2.db)
library(edgeR)
library(DESeq)
library(VennDiagram)
library(ggplot2)
# Gray Scale from `RColorBrewer`
Grays <- colorRampPalette(rev(brewer.pal(n = 9, "Greys")))

Your mission

Q1) Microarray Analysis

The six samples in this study, 3 replicates for each of the two conditions, were analysed on the Affymetrix Yeast Genome Array 2.0 platform. We have already downloaded the raw CEL files from GEO and normalised them. The normalized data is saved in the file GSE37599-data.tsv.

a) (1pt) Load Microarray Data

Load the normalized data.

array.dat <- read.table("GSE37599-data.tsv", header = TRUE, row.names = 1)

What are dimensions of the dataset?

ncol(array.dat)
## [1] 6
nrow(array.dat)
## [1] 10928

In addition to reporting number of rows and columns, make it clear what rows and columns represent and how you're interpreting column names.

colnames(array.dat)
## [1] "b1" "b2" "b3" "c1" "c2" "c3"
rownames(array.dat)[1:10]
##  [1] "1769308_at"   "1769309_at"   "1769310_at"   "1769311_at"  
##  [5] "1769312_at"   "1769313_at"   "1769314_at"   "1769315_at"  
##  [9] "1769316_s_at" "1769317_at"

b) (2pt) Identify Sample Swap

The labels on two of the samples have been swapped, that is one of the batch samples has been labelled as chemostat and vice-versa. Produce the plots described below and explain how they allow you to identify the swapped samples.

i. (High volume) scatter plot matrix.

hexplom(array.dat, main = "High Volume Scatterplot Matrix")

plot of chunk unnamed-chunk-6

ii. A heatmap of the first 100 genes (you can try more but it gets slow).

heatmap.2(t(as.matrix(array.dat[c(1:100), ])), trace = "none", col = Grays, 
    cexCol = 1, cexRow = 1, labCol = FALSE, xlab = "Probes", main = "Heatmap of the first 100 probes")

plot of chunk unnamed-chunk-7

iii. Compute the Pearson correlation of the samples and plot the results using a heatmap.

heatmap.2(cor(array.dat, method = "pearson"), symm = T, trace = "none", col = Grays, 
    cexCol = 1, cexRow = 1, dendrogram = "row", main = "Heatmap of Pearson correlation")

plot of chunk unnamed-chunk-8

iv. Scatterplot the six data samples with respect to the first two principal components and label the samples.

xyplot(PC2 ~ PC1, data.frame(prcomp(t(array.dat))$x), panel = function(x, y, 
    ...) {
    panel.xyplot(x, y, ...)
    ltext(x = x, y = y, labels = rownames(t(array.dat)), pos = 1, offset = 1)
}, main = "First Two Principal Components")

plot of chunk unnamed-chunk-9

c) (2pt) Microarray Differential Expression

Fix the label swap identified in question 1b. We want to swap b1 <–> c2. Revisit one or more elements of question 1b to sanity check before proceeding.

array.dat.cor <- array.dat[, c("c2", "b2", "b3", "c1", "b1", "c3")]
# Check the column names to confirm the colume swap
colnames(array.dat.cor)
## [1] "c2" "b2" "b3" "c1" "b1" "c3"
# Rename the column to the correct names
colnames(array.dat.cor) <- c("b1", "b2", "b3", "c1", "c2", "c3")
# double check to make sure the names are correct
colnames(array.dat.cor)
## [1] "b1" "b2" "b3" "c1" "c2" "c3"
# Use high volume scatter plot matrix and Pearson correlation heatmap to do
# a sanity check
hexplom(array.dat.cor, main = "High Volume Scatterplot Matrix after swapping b1 and c2")

plot of chunk unnamed-chunk-10

heatmap.2(cor(array.dat.cor, method = "pearson"), symm = T, trace = "none", 
    col = Grays, cexCol = 1, cexRow = 1, dendrogram = "row", main = "correlation after swapping b1 and c2")

plot of chunk unnamed-chunk-10

Now use this data to do a differential expression analysis with limma.

array.des <- data.frame(Sample = c("b1", "b2", "b3", "c1", "c2", "c3"), Condition = c("Batch", 
    "Batch", "Batch", "Chemostat", "Chemostat", "Chemostat"))
array.des.mat <- model.matrix(~Condition, array.des)
arrayFit <- lmFit(array.dat.cor, array.des.mat)
arrayEbFit <- eBayes(arrayFit)
array.Hit <- topTable(arrayEbFit, coef = "ConditionChemostat", adjust.method = "BH", 
    number = Inf)

Package these results in a data frame with six columns:

gene.ids <- unlist(mget(rownames(array.Hit), yeast2ORF))
gene.ids <- as.matrix(gene.ids)

# # Sanity check before assmble the data frame tail(gene.ids) str(gene.ids,
# max.level=1) nrow(array.Hit) nrow(gene.ids) identical(rownames(array.Hit),
# rownames(gene.ids))

array.DEA.NA <- data.frame(probe.id = rownames(array.Hit), gene.id = gene.ids, 
    p.value = array.Hit$P.Value, q.value = array.Hit$adj.P.Val, log.fc = array.Hit$logFC, 
    test.stat = array.Hit$t, row.names = NULL)
tail(array.DEA.NA)
##         probe.id gene.id p.value q.value     log.fc  test.stat
## 10923 1777665_at YOR368W  0.9975  0.9980  2.171e-04  3.190e-03
## 10924 1779984_at    <NA>  0.9979  0.9983 -4.025e-04 -2.730e-03
## 10925 1769963_at    <NA>  0.9981  0.9984  3.349e-04  2.419e-03
## 10926 1770373_at    <NA>  0.9982  0.9984  4.174e-04  2.317e-03
## 10927 1776546_at YDL037C  1.0000  1.0000  7.109e-07  7.469e-06
## 10928 1769489_at    <NA>  1.0000  1.0000  3.207e-15  3.864e-14
## Check the resulting data frame identical(as.factor(rownames(array.Hit)),
## array.DEA.NA$probe.id) identical(array.DEA.NA$probe.id,
## as.factor(rownames(gene.ids))) tail(array.DEA.NA)

Remove any rows with probes which don't map to genes. You'll be able to find these because they will have NA as their gene id. Work with this data.frame to answer the questions below.

array.results <- array.DEA.NA[complete.cases(array.DEA.NA[, "gene.id"]), ]
rownames(array.results) <- NULL
tail(array.results)
##        probe.id gene.id p.value q.value     log.fc  test.stat
## 5700 1769541_at YPR042C  0.9883  0.9915 -1.261e-03 -1.520e-02
## 5701 1776332_at YLR403W  0.9913  0.9937  8.169e-04  1.125e-02
## 5702 1777350_at YAL060W  0.9944  0.9956  7.173e-04  7.241e-03
## 5703 1771787_at YBR165W  0.9962  0.9970 -3.353e-04 -4.910e-03
## 5704 1777665_at YOR368W  0.9975  0.9980  2.171e-04  3.190e-03
## 5705 1776546_at YDL037C  1.0000  1.0000  7.109e-07  7.469e-06

i. How many probes did we start with and how many remain after removing probes without gene ids?

ii. Illustrate the differential expression between the batch and the chemostat samples for the top hit (i.e., probe with the lowest p- or q-value).

Hit12 <- cbind(as.data.frame(t(array.dat.cor))[, array.results$probe.id[1], 
    drop = FALSE], array.des)
Hit12 <- melt(Hit12, id.vars = c("Sample", "Condition"), variable.name = "Probe", 
    value.name = "Expression")
stripplot(Expression ~ Condition | Probe, Hit12, jitter.data = TRUE, auto.key = TRUE, 
    type = c("p", "a"), grid = TRUE)

plot of chunk unnamed-chunk-14

iii. How many probes are identified as differentially expressed at a false discovery rate (FDR) of 1e-5 (note: this is a FDR cutoff used in the original paper)?

nrow(subset(array.results, subset = q.value <= 1e-05))
## [1] 725

iv. Save your results for later with write.table().

write.table(array.results, file = "GSE37599-table.tsv", sep = "\t", quote = FALSE, 
    row.names = TRUE, col.names = NA)

Q2) RNA-Seq Analysis

We have aligned the RNA-Seq library using the Stampy aligner and generated count data. The data file is available as stampy.counts.tsv. In this question you will use this data to do a differential expression analysis using different packages from Bioconductor.

a) (1pt) Load RNA Count Data and Sanity Check

Load the count data using read.table; you will need to pass the arguments header=TRUE and row.names=1.

rna.dat <- read.table("stampy.counts.tsv", header = TRUE, row.names = 1)

i) What are dimensions of the dataset?

ncol(rna.dat)
## [1] 6
nrow(rna.dat)
## [1] 6542

In addition to reporting number of rows and columns, make it clear what rows and columns represent. What is the difference between the rows of this dataset versus rows of the array data in question 1a?

colnames(rna.dat)
## [1] "b1" "b2" "b3" "c1" "c2" "c3"
rownames(rna.dat)[1:10]
##  [1] "15S_rRNA" "21S_rRNA" "HRA1"     "ICR1"     "LSR1"     "NME1"    
##  [7] "Q0045"    "Q0050"    "Q0055"    "Q0065"

ii) Do a sanity check to make sure there is no sample swap by plotting a heatmap of the sample correlations.

heatmap.2(cor(rna.dat, method = "pearson"), symm = T, trace = "none", col = Grays, 
    cexCol = 1, cexRow = 1, dendrogram = "row", main = "Pearson correlation of RNA Seq data")

plot of chunk unnamed-chunk-20

b) (2pt) edgeR Differential Expression Analysis

Now you will use edgeR to identify differentially expressed genes between the batch medium vs. chemostat conditions.

i) Recall that edgeR needs to estimate the dispersion parameter in the negative binomial model using an empirical Bayes method. Estimate the dispersion parameters using estimateGLMCommonDisp, estimateGLMTrendedDisp and estimateGLMTagwiseDisp. Plot the tagwise dispersion against log2-CPM (counts per million).

colnames(rna.dat)
## [1] "b1" "b2" "b3" "c1" "c2" "c3"
(rna.group <- factor(c(rep("1", 3), rep("2", 3))))
## [1] 1 1 1 2 2 2
## Levels: 1 2
dge.glm <- DGEList(counts = rna.dat, group = rna.group)
dge.glm[["samples"]]
##    group lib.size norm.factors
## b1     1  3420851            1
## b2     1  4487585            1
## b3     1  3415378            1
## c1     2  2529571            1
## c2     2  2990956            1
## c3     2  3695672            1
(design <- model.matrix(~rna.group))
##   (Intercept) rna.group2
## 1           1          0
## 2           1          0
## 3           1          0
## 4           1          1
## 5           1          1
## 6           1          1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$rna.group
## [1] "contr.treatment"
dge.glm.com.disp <- estimateGLMCommonDisp(dge.glm, design, verbose = TRUE)
## Disp = 0.00551 , BCV = 0.0742
dge.glm.trend.disp <- estimateGLMTrendedDisp(dge.glm.com.disp, design)
dge.glm.tag.disp <- estimateGLMTagwiseDisp(dge.glm.trend.disp, design)
plotBCV(dge.glm.tag.disp, main = "Biological Coefficient of Variation")

plot of chunk unnamed-chunk-21

ii) Use the glm functionality of edgeR, i.e. use the glmFit function, to identify differentially expressed genes between conditions.

fit <- glmFit(dge.glm.tag.disp, design)
colnames(coef(fit))
## [1] "(Intercept)" "rna.group2"
lrt <- glmLRT(fit, coef = "rna.group2")
topTags(lrt, n = 30, adjust.method = "BH")
## Coefficient:  rna.group2 
##          logFC logCPM   LR     PValue        FDR
## YIL057C 10.084  8.267 2191  0.000e+00  0.000e+00
## YMR175W  9.426  8.001 1877  0.000e+00  0.000e+00
## YJR095W  9.419  7.811 2247  0.000e+00  0.000e+00
## YMR107W  9.073  8.781 2627  0.000e+00  0.000e+00
## YKL217W  8.573 11.425 1689  0.000e+00  0.000e+00
## YPL201C  7.654  7.105 1951  0.000e+00  0.000e+00
## YGL205W  7.070  9.108 2691  0.000e+00  0.000e+00
## YCR010C  7.013  8.150 1809  0.000e+00  0.000e+00
## YAL054C  6.893 10.583 2542  0.000e+00  0.000e+00
## YBR067C  6.468  9.973 1657  0.000e+00  0.000e+00
## YIL160C  6.126  8.099 2224  0.000e+00  0.000e+00
## YDL085W  5.682  7.558 2049  0.000e+00  0.000e+00
## YLR312C  5.541  7.098 1537  0.000e+00  0.000e+00
## YLR174W  5.459  8.907 1683  0.000e+00  0.000e+00
## YDR256C  5.383  9.180 2305  0.000e+00  0.000e+00
## YKR009C  5.334  9.286 2095  0.000e+00  0.000e+00
## YGR067C  5.160  7.498 1539  0.000e+00  0.000e+00
## YKL187C  4.723  8.749 1630  0.000e+00  0.000e+00
## YDR384C  4.580  9.368 1897  0.000e+00  0.000e+00
## YPR001W  4.552  7.815 1532  0.000e+00  0.000e+00
## YFL030W  4.961  8.487 1482 4.941e-324 1.541e-321
## YOL126C  5.200  7.790 1471 8.251e-322 2.454e-319
## YML042W  5.176  8.624 1441 3.115e-315 8.860e-313
## YGL121C  5.677  6.643 1433 1.688e-313 4.601e-311
## YAL062W  3.983  7.253 1431 4.456e-313 1.166e-310
## YDR345C -3.925  9.956 1430 8.085e-313 2.034e-310
## YPL147W  4.109  8.039 1426 4.391e-312 1.064e-309
## YGR289C  4.183  8.113 1417 3.631e-310 8.484e-308
## YJL116C  4.390  8.286 1402 8.510e-307 1.920e-304
## YIL136W  5.058  9.838 1400 2.354e-306 5.132e-304
tt.glm <- topTags(lrt, n = Inf, adjust.method = "BH")

Package these results in a data.frame called 'edger.results' with five columns:

edger.results <- data.frame(gene.id = rownames(tt.glm$table), p.value = tt.glm$table$PValue, 
    q.value = tt.glm$table$FDR, log.fc = tt.glm$table$logFC, test.stat = tt.glm$table$LR, 
    row.names = NULL)
head(edger.results)
##   gene.id p.value q.value log.fc test.stat
## 1 YIL057C       0       0 10.084      2191
## 2 YMR175W       0       0  9.426      1877
## 3 YJR095W       0       0  9.419      2247
## 4 YMR107W       0       0  9.073      2627
## 5 YKL217W       0       0  8.573      1689
## 6 YPL201C       0       0  7.654      1951

Save your results for later with write.table() in file called stampy.edger.results.tsv.

write.table(edger.results, file = "stampy.edger.results.tsv", sep = "\t", quote = FALSE, 
    row.names = TRUE, col.names = NA)

iii) How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

nrow(tt.glm$table[tt.glm$table$FDR < 1e-05, ])
## [1] 2669
nrow(edger.results[edger.results$q.value < 1e-05, ])
## [1] 2669

iv) How many genes are differentially over-expressed in chemostat compared to batch medium samples at a false discovery rate (FDR) of 1e-5?

summary(de.glm <- decideTestsDGE(lrt, p = 1e-05, adjust = "BH"))
##    [,1]
## -1 1154
## 0  3873
## 1  1515

c) (2pt) DESeq Differential Expression Analysis

Now you will use DESeq to identify differentially expressed genes between the batch medium vs. chemostat conditions.

i) DESeq also needs to estimate the dispersion. Use estimateSizeFactors and estimateDispersions to normalize the data. Plot the estimated dispersions against the mean normalized counts.

deSeqDat <- newCountDataSet(rna.dat, rna.group)
# head(counts(deSeqDat))
deSeqDat <- estimateSizeFactors(deSeqDat)
sizeFactors(deSeqDat)
##     b1     b2     b3     c1     c2     c3 
## 1.0211 1.2962 0.9538 0.7866 0.9004 1.1317
deSeqDat <- estimateDispersions(deSeqDat)
plotDispEsts(deSeqDat)

plot of chunk unnamed-chunk-27

ii) Use the negative binomial test of DESeq, i.e. use the nbinomTest function, to identify differentially expressed genes between conditions. Note that the output of this function does not return results ordered by p-values or logged fold-changes. You can manually reorder the results if you want (not required for this homework).

DESeq.nbT <- nbinomTest(deSeqDat, levels(rna.group)[1], levels(rna.group)[2])

Package these results in a data.frame called 'deseq.results' with four columns:

DESeq.results <- data.frame(gene.id = DESeq.nbT$id, p.value = DESeq.nbT$pval, 
    q.value = DESeq.nbT$padj, log.fc = DESeq.nbT$log2FoldChange, row.names = NULL)
head(DESeq.results)
##    gene.id  p.value  q.value  log.fc
## 1 15S_rRNA 0.130017 0.189902  1.0173
## 2 21S_rRNA 0.047610 0.077711  1.3540
## 3     HRA1 0.017147 0.030716 -0.8235
## 4     ICR1 0.068405 0.107600  0.3793
## 5     LSR1 0.010074 0.018766  0.8122
## 6     NME1 0.003772 0.007596 -1.4870

Save your results for later with write.table() in file called stampy.deseq.results.tsv.

write.table(DESeq.results, file = "stampy.deseq.results.tsv", sep = "\t", quote = FALSE, 
    row.names = TRUE, col.names = NA)

iii) How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

nrow(DESeq.results[DESeq.results$q.value < 1e-05, ])
## [1] 2198

iv) How many differentially expressed genes are identified by both 'edgeR' and 'DESeq'?

length(intersect(edger.results[edger.results$q.value < 1e-05, "gene.id"], DESeq.results[DESeq.results$q.value < 
    1e-05, "gene.id"]))
## [1] 2176

d) (2pt) voom Differential Expression Analysis

Now you will use voom+limma to identify differentially expressed genes between the batch medium vs. chemostat conditions.

i) voom normalizes the counts before it converts counts to log2-cpm. Use calcNormFactors to normalize counts.

norm.factor <- calcNormFactors(rna.dat)

ii) Use voom' to convert count data into logged CPM data and then use 'limma' to identify differentially expressed genes between conditions.

dat.voomed <- voom(rna.dat, design, plot = TRUE, lib.size = colSums(rna.dat) * 
    norm.factor)

plot of chunk unnamed-chunk-34

fit <- lmFit(dat.voomed, design)
fit <- eBayes(fit)
voom.tt <- topTable(fit, coef = "rna.group2", adjust.method = "BH", number = Inf)

Package these results in a data.frame called 'voom.limma.results' with five columns:

voom.limma.results <- data.frame(gene.id = rownames(voom.tt), p.value = voom.tt$P.Value, 
    q.value = voom.tt$adj.P.Val, log.fc = voom.tt$logFC, test.stat = voom.tt$t, 
    row.names = NULL)
head(voom.limma.results)
##   gene.id   p.value   q.value log.fc test.stat
## 1 YAL054C 3.504e-18 2.292e-14  6.822     67.85
## 2 YDR384C 2.665e-17 7.179e-14  4.509     58.15
## 3 YDR345C 4.390e-17 7.179e-14 -3.991    -55.99
## 4 YDR256C 3.515e-17 7.179e-14  5.311     56.94
## 5 YKR009C 8.939e-17 1.170e-13  5.264     53.04
## 6 YIL155C 3.128e-16 2.926e-13  4.798     48.22

Save your results for later with write.table() in file called stampy.limma.results.tsv.

write.table(voom.limma.results, file = "stampy.limma.results.tsv", sep = "\t", 
    quote = FALSE, row.names = TRUE, col.names = NA)

iii) How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

nrow(voom.limma.results[voom.limma.results$q.value < 1e-05, ])
## [1] 1794

iv) What fraction of the genes identified using voom+limma are also found by edger and DESeq methods? For example if the DE analysis using voom+limma found 1000 genes and both edgeR and DESeq found 500 of these, the fraction of genes found would be \( \frac{500}{1000}=0.5 \).

edger.de.genes <- as.character(edger.results[edger.results$q.value < 1e-05, 
    "gene.id"])
DESeq.de.genes <- as.character(DESeq.results[DESeq.results$q.value < 1e-05, 
    "gene.id"])
voom.limma.de.genes <- as.character(voom.limma.results[voom.limma.results$q.value < 
    1e-05, "gene.id"])

(edfer.DESeq.fraction <- round(length(intersect(intersect(voom.limma.de.genes, 
    edger.de.genes), intersect(voom.limma.de.genes, DESeq.de.genes)))/length(voom.limma.de.genes), 
    digits = 4))
## [1] 0.9989

e) (3pt) Comparison of Differential Expression Analyses

Now that we have the results of the differential expression analysis performed by three popular methods, we are going to compare and illustrate the results.

i) In previous questions, we noticed that different methods identified different differentially expressed genes. Create a Venn diagram showing all genes identified as differentially expressed by edgeR, DESeq, and voom+limma. Check your if your answers to questions 2c-iv, and 2d-iv are correct.

plot.new()
grid.draw(venn.diagram(list(edgeR = edger.de.genes, DESeq = DESeq.de.genes, 
    voom.limma = voom.limma.de.genes), filename = NULL, fill = c("blue", "orange", 
    "green")))

plot of chunk unnamed-chunk-39

ii) Using the function plotSmear function from edgeR, you can look at a scatterplot of observed differential expression (y-axis) against overall abundance (x-axis), both axes logarithmically transformed – to check that putative DE genes seem plausible. Create a smear plot. Within this plot, identify the set of genes which are differentially expressed at an FDR of 1e-5 using all three methods (i.e., the q-values estimated by edgeR, DESeq, and voom are below 1e-5). Explain how you interpret this plot. Do you see reasonable results?

DEall <- intersect(intersect(edger.results[edger.results$q.value < 1e-05, "gene.id"], 
    DESeq.results[DESeq.results$q.value < 1e-05, "gene.id"]), voom.limma.results[voom.limma.results$q.value < 
    1e-05, "gene.id"])
plotSmear(lrt, de.tags = DEall, main = "Smear plot of differentially expressed genes")
abline(h = c(-0.5, 0.5), col = "blue")

plot of chunk unnamed-chunk-40

iii) There are two genes identified by edgeR and voom+limma but not by DESeq. Illustrate the logged counts of them. Compare the (log) counts of these two genes with those of two genes identified by the three methods (see example below)

(featureCounts <- rna.dat[setdiff(voom.limma.results[voom.limma.results$q.value < 
    1e-05, "gene.id"], DEall), ])
##           b1   b2   b3  c1  c2  c3
## YMR058W 1752 4275 1787 497 245 263
## YPL271W  181  161  130 364 293 351
featureDat <- data.frame(gene.id = factor(rep(rownames(featureCounts), ncol(featureCounts))), 
    cond = factor(rep(c("batch", "batch", "batch", "chemostat", "chemostat", 
        "chemostat"), each = nrow(featureCounts))), log.count = log2(unlist(featureCounts)))
stripplot(gene.id ~ log.count, featureDat, groups = cond, auto.key = TRUE, jitter = TRUE, 
    main = "genes identified by DESeq only")

plot of chunk unnamed-chunk-41

plot of chunk two-de-gene-demo

Q3) Compare DEA results between RNA-Seq and array

In question 1, you performed a DEA of array data using limma. In question 2a, you performed a DEA of RNA-Seq data using edgeR, among other methods. In this question you will compare the results of those two analyses.

i) Use a Venn diagram to display the overlap and non-overlap of the genes identified as differentially expressed at an FDR of 1e-5 by these analyses.

Note that the number of probes that you identified as differentially expressed at an FDR of 1e-5 in question 1c(iii) may be different from the number of genes identified as differentially expressed at the same FDR! Why? The use of the argument force.unique of venn.diagram() is crucial here.

edger.de.genes <- as.character(edger.results[edger.results$q.value < 1e-05, 
    "gene.id"])
array.de.genes <- as.character(array.results[array.results$q.value < 1e-05, 
    "gene.id"])
plot.new()
grid.draw(venn.diagram(list(RNA_Seq = edger.de.genes, MicroArray = array.de.genes), 
    filename = NULL, fill = c("red", "blue"), force.unique = TRUE))

plot of chunk unnamed-chunk-42

ii) As expected, more genes were identified as differentially expressed using RNA-Seq data. In this question, you will examine the difference between the q-values from both analyses by overlaying density plots of the q-values from each analysis.

de.merged.ture <- merge(array.results, edger.results, all = TRUE, by = "gene.id", 
    suffixes = c(".array", ".rnaseq"))
de.merged.false <- merge(array.results, edger.results, all = FALSE, by = "gene.id", 
    suffixes = c(".array", ".rnaseq"))
densityplot(~q.value.array + q.value.rnaseq, de.merged.ture, plot.points = FALSE, 
    auto.key = list(text = c("MicroArray", "RNA-Seq")), main = "Density plot of q-values (Either)", 
    xlab = "q-value", n = 200, bw = 0.02)

plot of chunk unnamed-chunk-43

densityplot(~q.value.array + q.value.rnaseq, de.merged.false, plot.points = FALSE, 
    auto.key = list(text = c("MicroArray", "RNA-Seq")), main = "Density plot of q-values (Both)", 
    xlab = "q-value", n = 200, bw = 0.02)

plot of chunk unnamed-chunk-43

Make some observations about the strengths of these two platforms.

iii) We provide a data set with array expression and count data for 5 interesting genes; below is also code to load it and a figure depicting it. Consult the DEA results from your previous analyses for these genes. For each gene, state its status with respect to these analyses, i.e. where it falls in those Venn diagrams. Comment on the results and plots.

plot of chunk stripplot-five-interesting-genes

interestinggenes <- c("YGL209W", "YCL042W", "YBL025W", "YDR384C", "YDR345C")

both <- intersect(edger.results[edger.results$q.value < 1e-05, "gene.id"], array.results[array.results$q.value < 
    1e-05, "gene.id"])
rnaseqOnly <- setdiff(edger.results[edger.results$q.value < 1e-05, "gene.id"], 
    both)
arrayOnly <- setdiff(array.results[array.results$q.value < 1e-05, "gene.id"], 
    both)
(subset(de.merged.false, gene.id %in% interestinggenes, select = grep("gene.id|q.value", 
    colnames(de.merged.false))))
##      gene.id q.value.array q.value.rnaseq
## 120  YBL025W     2.104e-01      6.416e-01
## 526  YCL042W     2.851e-01      2.975e-16
## 1186 YDR345C     4.155e-08     2.034e-310
## 1226 YDR384C     1.012e-07      0.000e+00
## 1939 YGL209W     1.912e-07      4.861e-01

(intersect(interestinggenes, both))
## [1] "YDR384C" "YDR345C"
(intersect(interestinggenes, rnaseqOnly))
## [1] "YCL042W"
(intersect(interestinggenes, arrayOnly))
## [1] "YGL209W"
(setdiff(interestinggenes, c(both, rnaseqOnly, arrayOnly)))
## [1] "YBL025W"