In this assignment you will work with data originally published in “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae.” by Nookaew et al (Nucleic Acids Res. 2012 Nov 1;40(20):10084-97. PMID 22965124). The article is available here: doi: 10.1093/nar/gks804.
The authors used two different platforms – microarrays and RNA-Seq – to obtain gene expression data for yeast grown in two conditions: batch medium and chemostat. They then compared the results of differential expression analysis (DEA) across platforms and under various combinations of aligners and methods for identifying differential expression. We will do the same. Some of the DEA methods covered in this class were included in their analysis, edgeR and DESeq, while others were not, such as limma and, in particular, the voom function.
We will work with their data obtained from the NCBI SRA and GEO repositories. Because of some differences in pre-processing your results will differ from theirs.
The assignment has 25 points and represents 25% of your overall course grade. Overall presentation and mechanics is worth 5 points, with full points awarded if it's exceptional. The remaining 20 points are spread among the specific questions, as detailed below. We will try to give partial credit when/if we can figure out what you're trying to do.
Each student must submit their own work: no group efforts. It's fine to talk to fellow students and have discussion on the Google Group. If someone is really helpful, give them some credit. You have definitely crossed the line if there's any copying and pasting of code or wording.
Late submission will cost 20% (5 points) per day or fraction thereof.
If you expect that you will be late on the deadline, contact us before missing it, and submit whatever partial work you have.
See the homework submission instructions.
Some helpful tips can be found in this coaching document.
# Load required R packages:
library(hexbin)
library(gplots)
library(RColorBrewer)
library(limma)
library(reshape2)
library(lattice)
library(yeast2.db)
library(edgeR)
library(DESeq)
library(VennDiagram)
library(ggplot2)
# Gray Scale from `RColorBrewer`
Grays <- colorRampPalette(rev(brewer.pal(n = 9, "Greys")))
The six samples in this study, 3 replicates for each of the two conditions, were analysed on the Affymetrix Yeast Genome Array 2.0 platform. We have already downloaded the raw CEL files from GEO and normalised them. The normalized data is saved in the file GSE37599-data.tsv.
array.dat <- read.table("GSE37599-data.tsv", header = TRUE, row.names = 1)
ncol(array.dat)
## [1] 6
nrow(array.dat)
## [1] 10928
colnames(array.dat)
## [1] "b1" "b2" "b3" "c1" "c2" "c3"
rownames(array.dat)[1:10]
## [1] "1769308_at" "1769309_at" "1769310_at" "1769311_at"
## [5] "1769312_at" "1769313_at" "1769314_at" "1769315_at"
## [9] "1769316_s_at" "1769317_at"
The labels on two of the samples have been swapped, that is one of the batch samples has been labelled as chemostat and vice-versa. Produce the plots described below and explain how they allow you to identify the swapped samples.
hexplom(array.dat, main = "High Volume Scatterplot Matrix")
heatmap.2(t(as.matrix(array.dat[c(1:100), ])), trace = "none", col = Grays,
cexCol = 1, cexRow = 1, labCol = FALSE, xlab = "Probes", main = "Heatmap of the first 100 probes")
heatmap.2(cor(array.dat, method = "pearson"), symm = T, trace = "none", col = Grays,
cexCol = 1, cexRow = 1, dendrogram = "row", main = "Heatmap of Pearson correlation")
xyplot(PC2 ~ PC1, data.frame(prcomp(t(array.dat))$x), panel = function(x, y,
...) {
panel.xyplot(x, y, ...)
ltext(x = x, y = y, labels = rownames(t(array.dat)), pos = 1, offset = 1)
}, main = "First Two Principal Components")
array.dat.cor <- array.dat[, c("c2", "b2", "b3", "c1", "b1", "c3")]
# Check the column names to confirm the colume swap
colnames(array.dat.cor)
## [1] "c2" "b2" "b3" "c1" "b1" "c3"
# Rename the column to the correct names
colnames(array.dat.cor) <- c("b1", "b2", "b3", "c1", "c2", "c3")
# double check to make sure the names are correct
colnames(array.dat.cor)
## [1] "b1" "b2" "b3" "c1" "c2" "c3"
# Use high volume scatter plot matrix and Pearson correlation heatmap to do
# a sanity check
hexplom(array.dat.cor, main = "High Volume Scatterplot Matrix after swapping b1 and c2")
heatmap.2(cor(array.dat.cor, method = "pearson"), symm = T, trace = "none",
col = Grays, cexCol = 1, cexRow = 1, dendrogram = "row", main = "correlation after swapping b1 and c2")
limma.array.des <- data.frame(Sample = c("b1", "b2", "b3", "c1", "c2", "c3"), Condition = c("Batch",
"Batch", "Batch", "Chemostat", "Chemostat", "Chemostat"))
array.des.mat <- model.matrix(~Condition, array.des)
arrayFit <- lmFit(array.dat.cor, array.des.mat)
arrayEbFit <- eBayes(arrayFit)
array.Hit <- topTable(arrayEbFit, coef = "ConditionChemostat", adjust.method = "BH",
number = Inf)
probe.id - The array probe id.
gene.id - The id of the gene which the probe overlaps (see below).
p.value - The raw p-value for the probe.
q.value - The BH corrected p-value, aka the q-value.
log.fc - The log fold change which is the column called “logFC” in the limma results table.
test.stat - The test statistics which for limma is the moderated t statistic. This is the column called “t” in the limma results table.
gene.ids <- unlist(mget(rownames(array.Hit), yeast2ORF))
gene.ids <- as.matrix(gene.ids)
# # Sanity check before assmble the data frame tail(gene.ids) str(gene.ids,
# max.level=1) nrow(array.Hit) nrow(gene.ids) identical(rownames(array.Hit),
# rownames(gene.ids))
array.DEA.NA <- data.frame(probe.id = rownames(array.Hit), gene.id = gene.ids,
p.value = array.Hit$P.Value, q.value = array.Hit$adj.P.Val, log.fc = array.Hit$logFC,
test.stat = array.Hit$t, row.names = NULL)
tail(array.DEA.NA)
## probe.id gene.id p.value q.value log.fc test.stat
## 10923 1777665_at YOR368W 0.9975 0.9980 2.171e-04 3.190e-03
## 10924 1779984_at <NA> 0.9979 0.9983 -4.025e-04 -2.730e-03
## 10925 1769963_at <NA> 0.9981 0.9984 3.349e-04 2.419e-03
## 10926 1770373_at <NA> 0.9982 0.9984 4.174e-04 2.317e-03
## 10927 1776546_at YDL037C 1.0000 1.0000 7.109e-07 7.469e-06
## 10928 1769489_at <NA> 1.0000 1.0000 3.207e-15 3.864e-14
## Check the resulting data frame identical(as.factor(rownames(array.Hit)),
## array.DEA.NA$probe.id) identical(array.DEA.NA$probe.id,
## as.factor(rownames(gene.ids))) tail(array.DEA.NA)
NA as their gene id. Work with this data.frame to answer the questions below.array.results <- array.DEA.NA[complete.cases(array.DEA.NA[, "gene.id"]), ]
rownames(array.results) <- NULL
tail(array.results)
## probe.id gene.id p.value q.value log.fc test.stat
## 5700 1769541_at YPR042C 0.9883 0.9915 -1.261e-03 -1.520e-02
## 5701 1776332_at YLR403W 0.9913 0.9937 8.169e-04 1.125e-02
## 5702 1777350_at YAL060W 0.9944 0.9956 7.173e-04 7.241e-03
## 5703 1771787_at YBR165W 0.9962 0.9970 -3.353e-04 -4.910e-03
## 5704 1777665_at YOR368W 0.9975 0.9980 2.171e-04 3.190e-03
## 5705 1776546_at YDL037C 1.0000 1.0000 7.109e-07 7.469e-06
Hit12 <- cbind(as.data.frame(t(array.dat.cor))[, array.results$probe.id[1],
drop = FALSE], array.des)
Hit12 <- melt(Hit12, id.vars = c("Sample", "Condition"), variable.name = "Probe",
value.name = "Expression")
stripplot(Expression ~ Condition | Probe, Hit12, jitter.data = TRUE, auto.key = TRUE,
type = c("p", "a"), grid = TRUE)
nrow(subset(array.results, subset = q.value <= 1e-05))
## [1] 725
write.table().write.table(array.results, file = "GSE37599-table.tsv", sep = "\t", quote = FALSE,
row.names = TRUE, col.names = NA)
We have aligned the RNA-Seq library using the Stampy aligner and generated count data. The data file is available as stampy.counts.tsv. In this question you will use this data to do a differential expression analysis using different packages from Bioconductor.
Load the count data using read.table; you will need to pass the arguments header=TRUE and row.names=1.
rna.dat <- read.table("stampy.counts.tsv", header = TRUE, row.names = 1)
ncol(rna.dat)
## [1] 6
nrow(rna.dat)
## [1] 6542
colnames(rna.dat)
## [1] "b1" "b2" "b3" "c1" "c2" "c3"
rownames(rna.dat)[1:10]
## [1] "15S_rRNA" "21S_rRNA" "HRA1" "ICR1" "LSR1" "NME1"
## [7] "Q0045" "Q0050" "Q0055" "Q0065"
The columes represents different sampels; there are two groups of samples - batch culture and chemostat culture, and each group contains 3 replicates
The row represents different transcripts that are aligned to the reference genome - there are total 6542 transcripts that are aligned.
In the mircroarray dataset, the rows represent probes on the chip - not the actual gene.
heatmap.2(cor(rna.dat, method = "pearson"), symm = T, trace = "none", col = Grays,
cexCol = 1, cexRow = 1, dendrogram = "row", main = "Pearson correlation of RNA Seq data")
edgeR Differential Expression AnalysisNow you will use edgeR to identify differentially expressed genes between the batch medium vs. chemostat conditions.
edgeR needs to estimate the dispersion parameter in the negative binomial model using an empirical Bayes method. Estimate the dispersion parameters using estimateGLMCommonDisp, estimateGLMTrendedDisp and estimateGLMTagwiseDisp. Plot the tagwise dispersion against log2-CPM (counts per million).colnames(rna.dat)
## [1] "b1" "b2" "b3" "c1" "c2" "c3"
(rna.group <- factor(c(rep("1", 3), rep("2", 3))))
## [1] 1 1 1 2 2 2
## Levels: 1 2
dge.glm <- DGEList(counts = rna.dat, group = rna.group)
dge.glm[["samples"]]
## group lib.size norm.factors
## b1 1 3420851 1
## b2 1 4487585 1
## b3 1 3415378 1
## c1 2 2529571 1
## c2 2 2990956 1
## c3 2 3695672 1
(design <- model.matrix(~rna.group))
## (Intercept) rna.group2
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 1
## 5 1 1
## 6 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$rna.group
## [1] "contr.treatment"
dge.glm.com.disp <- estimateGLMCommonDisp(dge.glm, design, verbose = TRUE)
## Disp = 0.00551 , BCV = 0.0742
dge.glm.trend.disp <- estimateGLMTrendedDisp(dge.glm.com.disp, design)
dge.glm.tag.disp <- estimateGLMTagwiseDisp(dge.glm.trend.disp, design)
plotBCV(dge.glm.tag.disp, main = "Biological Coefficient of Variation")
edgeR, i.e. use the glmFit function, to identify differentially expressed genes between conditions.fit <- glmFit(dge.glm.tag.disp, design)
colnames(coef(fit))
## [1] "(Intercept)" "rna.group2"
lrt <- glmLRT(fit, coef = "rna.group2")
topTags(lrt, n = 30, adjust.method = "BH")
## Coefficient: rna.group2
## logFC logCPM LR PValue FDR
## YIL057C 10.084 8.267 2191 0.000e+00 0.000e+00
## YMR175W 9.426 8.001 1877 0.000e+00 0.000e+00
## YJR095W 9.419 7.811 2247 0.000e+00 0.000e+00
## YMR107W 9.073 8.781 2627 0.000e+00 0.000e+00
## YKL217W 8.573 11.425 1689 0.000e+00 0.000e+00
## YPL201C 7.654 7.105 1951 0.000e+00 0.000e+00
## YGL205W 7.070 9.108 2691 0.000e+00 0.000e+00
## YCR010C 7.013 8.150 1809 0.000e+00 0.000e+00
## YAL054C 6.893 10.583 2542 0.000e+00 0.000e+00
## YBR067C 6.468 9.973 1657 0.000e+00 0.000e+00
## YIL160C 6.126 8.099 2224 0.000e+00 0.000e+00
## YDL085W 5.682 7.558 2049 0.000e+00 0.000e+00
## YLR312C 5.541 7.098 1537 0.000e+00 0.000e+00
## YLR174W 5.459 8.907 1683 0.000e+00 0.000e+00
## YDR256C 5.383 9.180 2305 0.000e+00 0.000e+00
## YKR009C 5.334 9.286 2095 0.000e+00 0.000e+00
## YGR067C 5.160 7.498 1539 0.000e+00 0.000e+00
## YKL187C 4.723 8.749 1630 0.000e+00 0.000e+00
## YDR384C 4.580 9.368 1897 0.000e+00 0.000e+00
## YPR001W 4.552 7.815 1532 0.000e+00 0.000e+00
## YFL030W 4.961 8.487 1482 4.941e-324 1.541e-321
## YOL126C 5.200 7.790 1471 8.251e-322 2.454e-319
## YML042W 5.176 8.624 1441 3.115e-315 8.860e-313
## YGL121C 5.677 6.643 1433 1.688e-313 4.601e-311
## YAL062W 3.983 7.253 1431 4.456e-313 1.166e-310
## YDR345C -3.925 9.956 1430 8.085e-313 2.034e-310
## YPL147W 4.109 8.039 1426 4.391e-312 1.064e-309
## YGR289C 4.183 8.113 1417 3.631e-310 8.484e-308
## YJL116C 4.390 8.286 1402 8.510e-307 1.920e-304
## YIL136W 5.058 9.838 1400 2.354e-306 5.132e-304
tt.glm <- topTags(lrt, n = Inf, adjust.method = "BH")
gene.id - The id of the gene which reads were aligned to.
p.value - The raw p-value for the gene.
q.value - The BH corrected p-value, aka the q-value.
log.fc - The log fold change which is the column called “logFC” in the edgeR results table.
test.stat - The test statistic, which for edgeR is a likelihood ratio. This is the column called “LR” in the edgeR results table.
edger.results <- data.frame(gene.id = rownames(tt.glm$table), p.value = tt.glm$table$PValue,
q.value = tt.glm$table$FDR, log.fc = tt.glm$table$logFC, test.stat = tt.glm$table$LR,
row.names = NULL)
head(edger.results)
## gene.id p.value q.value log.fc test.stat
## 1 YIL057C 0 0 10.084 2191
## 2 YMR175W 0 0 9.426 1877
## 3 YJR095W 0 0 9.419 2247
## 4 YMR107W 0 0 9.073 2627
## 5 YKL217W 0 0 8.573 1689
## 6 YPL201C 0 0 7.654 1951
write.table() in file called stampy.edger.results.tsv.write.table(edger.results, file = "stampy.edger.results.tsv", sep = "\t", quote = FALSE,
row.names = TRUE, col.names = NA)
nrow(tt.glm$table[tt.glm$table$FDR < 1e-05, ])
## [1] 2669
nrow(edger.results[edger.results$q.value < 1e-05, ])
## [1] 2669
summary(de.glm <- decideTestsDGE(lrt, p = 1e-05, adjust = "BH"))
## [,1]
## -1 1154
## 0 3873
## 1 1515
DESeq Differential Expression AnalysisNow you will use DESeq to identify differentially expressed genes between the batch medium vs. chemostat conditions.
DESeq also needs to estimate the dispersion. Use estimateSizeFactors and estimateDispersions to normalize the data. Plot the estimated dispersions against the mean normalized counts.deSeqDat <- newCountDataSet(rna.dat, rna.group)
# head(counts(deSeqDat))
deSeqDat <- estimateSizeFactors(deSeqDat)
sizeFactors(deSeqDat)
## b1 b2 b3 c1 c2 c3
## 1.0211 1.2962 0.9538 0.7866 0.9004 1.1317
deSeqDat <- estimateDispersions(deSeqDat)
plotDispEsts(deSeqDat)
DESeq, i.e. use the nbinomTest function, to identify differentially expressed genes between conditions. Note that the output of this function does not return results ordered by p-values or logged fold-changes. You can manually reorder the results if you want (not required for this homework).DESeq.nbT <- nbinomTest(deSeqDat, levels(rna.group)[1], levels(rna.group)[2])
gene.id - The id of the gene which reads were aligned to.
p.value - The raw p-value for the gene.
q.value - The BH corrected p-value, aka the q-value.
log.fc - The log fold change which is the column called “logFC” in the edgeR results table.
DESeq.results <- data.frame(gene.id = DESeq.nbT$id, p.value = DESeq.nbT$pval,
q.value = DESeq.nbT$padj, log.fc = DESeq.nbT$log2FoldChange, row.names = NULL)
head(DESeq.results)
## gene.id p.value q.value log.fc
## 1 15S_rRNA 0.130017 0.189902 1.0173
## 2 21S_rRNA 0.047610 0.077711 1.3540
## 3 HRA1 0.017147 0.030716 -0.8235
## 4 ICR1 0.068405 0.107600 0.3793
## 5 LSR1 0.010074 0.018766 0.8122
## 6 NME1 0.003772 0.007596 -1.4870
write.table() in file called stampy.deseq.results.tsv.write.table(DESeq.results, file = "stampy.deseq.results.tsv", sep = "\t", quote = FALSE,
row.names = TRUE, col.names = NA)
nrow(DESeq.results[DESeq.results$q.value < 1e-05, ])
## [1] 2198
length(intersect(edger.results[edger.results$q.value < 1e-05, "gene.id"], DESeq.results[DESeq.results$q.value <
1e-05, "gene.id"]))
## [1] 2176
voom Differential Expression AnalysisNow you will use voom+limma to identify differentially expressed genes between the batch medium vs. chemostat conditions.
voom normalizes the counts before it converts counts to log2-cpm. Use calcNormFactors to normalize counts.norm.factor <- calcNormFactors(rna.dat)
dat.voomed <- voom(rna.dat, design, plot = TRUE, lib.size = colSums(rna.dat) *
norm.factor)
fit <- lmFit(dat.voomed, design)
fit <- eBayes(fit)
voom.tt <- topTable(fit, coef = "rna.group2", adjust.method = "BH", number = Inf)
gene.id - The id of the gene which reads were aligned to.
p.value - The raw p-value for the gene.
q.value - The BH corrected p-value, aka the q-value.
log.fc - The log fold change which is the column called “logFC” in the limma results table.
test.stat - The test statistic, which is the column called “t”.
voom.limma.results <- data.frame(gene.id = rownames(voom.tt), p.value = voom.tt$P.Value,
q.value = voom.tt$adj.P.Val, log.fc = voom.tt$logFC, test.stat = voom.tt$t,
row.names = NULL)
head(voom.limma.results)
## gene.id p.value q.value log.fc test.stat
## 1 YAL054C 3.504e-18 2.292e-14 6.822 67.85
## 2 YDR384C 2.665e-17 7.179e-14 4.509 58.15
## 3 YDR345C 4.390e-17 7.179e-14 -3.991 -55.99
## 4 YDR256C 3.515e-17 7.179e-14 5.311 56.94
## 5 YKR009C 8.939e-17 1.170e-13 5.264 53.04
## 6 YIL155C 3.128e-16 2.926e-13 4.798 48.22
write.table() in file called stampy.limma.results.tsv.write.table(voom.limma.results, file = "stampy.limma.results.tsv", sep = "\t",
quote = FALSE, row.names = TRUE, col.names = NA)
nrow(voom.limma.results[voom.limma.results$q.value < 1e-05, ])
## [1] 1794
voom+limma are also found by edger and DESeq methods? For example if the DE analysis using voom+limma found 1000 genes and both edgeR and DESeq found 500 of these, the fraction of genes found would be \( \frac{500}{1000}=0.5 \).edger.de.genes <- as.character(edger.results[edger.results$q.value < 1e-05,
"gene.id"])
DESeq.de.genes <- as.character(DESeq.results[DESeq.results$q.value < 1e-05,
"gene.id"])
voom.limma.de.genes <- as.character(voom.limma.results[voom.limma.results$q.value <
1e-05, "gene.id"])
(edfer.DESeq.fraction <- round(length(intersect(intersect(voom.limma.de.genes,
edger.de.genes), intersect(voom.limma.de.genes, DESeq.de.genes)))/length(voom.limma.de.genes),
digits = 4))
## [1] 0.9989
voom+limma are also found by edger and DESeq methodsNow that we have the results of the differential expression analysis performed by three popular methods, we are going to compare and illustrate the results.
edgeR, DESeq, and voom+limma. Check your if your answers to questions 2c-iv, and 2d-iv are correct.plot.new()
grid.draw(venn.diagram(list(edgeR = edger.de.genes, DESeq = DESeq.de.genes,
voom.limma = voom.limma.de.genes), filename = NULL, fill = c("blue", "orange",
"green")))
plotSmear function from edgeR, you can look at a scatterplot of observed differential expression (y-axis) against overall abundance (x-axis), both axes logarithmically transformed – to check that putative DE genes seem plausible. Create a smear plot. Within this plot, identify the set of genes which are differentially expressed at an FDR of 1e-5 using all three methods (i.e., the q-values estimated by edgeR, DESeq, and voom are below 1e-5). Explain how you interpret this plot. Do you see reasonable results?DEall <- intersect(intersect(edger.results[edger.results$q.value < 1e-05, "gene.id"],
DESeq.results[DESeq.results$q.value < 1e-05, "gene.id"]), voom.limma.results[voom.limma.results$q.value <
1e-05, "gene.id"])
plotSmear(lrt, de.tags = DEall, main = "Smear plot of differentially expressed genes")
abline(h = c(-0.5, 0.5), col = "blue")
edgeR and voom+limma but not by DESeq. Illustrate the logged counts of them. Compare the (log) counts of these two genes with those of two genes identified by the three methods (see example below)(featureCounts <- rna.dat[setdiff(voom.limma.results[voom.limma.results$q.value <
1e-05, "gene.id"], DEall), ])
## b1 b2 b3 c1 c2 c3
## YMR058W 1752 4275 1787 497 245 263
## YPL271W 181 161 130 364 293 351
featureDat <- data.frame(gene.id = factor(rep(rownames(featureCounts), ncol(featureCounts))),
cond = factor(rep(c("batch", "batch", "batch", "chemostat", "chemostat",
"chemostat"), each = nrow(featureCounts))), log.count = log2(unlist(featureCounts)))
stripplot(gene.id ~ log.count, featureDat, groups = cond, auto.key = TRUE, jitter = TRUE,
main = "genes identified by DESeq only")
In question 1, you performed a DEA of array data using limma. In question 2a, you performed a DEA of RNA-Seq data using edgeR, among other methods. In this question you will compare the results of those two analyses.
Note that the number of probes that you identified as differentially expressed at an FDR of 1e-5 in question 1c(iii) may be different from the number of genes identified as differentially expressed at the same FDR! Why? The use of the argument
force.uniqueofvenn.diagram()is crucial here.
edger.de.genes <- as.character(edger.results[edger.results$q.value < 1e-05,
"gene.id"])
array.de.genes <- as.character(array.results[array.results$q.value < 1e-05,
"gene.id"])
plot.new()
grid.draw(venn.diagram(list(RNA_Seq = edger.de.genes, MicroArray = array.de.genes),
filename = NULL, fill = c("red", "blue"), force.unique = TRUE))
de.merged.ture <- merge(array.results, edger.results, all = TRUE, by = "gene.id",
suffixes = c(".array", ".rnaseq"))
de.merged.false <- merge(array.results, edger.results, all = FALSE, by = "gene.id",
suffixes = c(".array", ".rnaseq"))
densityplot(~q.value.array + q.value.rnaseq, de.merged.ture, plot.points = FALSE,
auto.key = list(text = c("MicroArray", "RNA-Seq")), main = "Density plot of q-values (Either)",
xlab = "q-value", n = 200, bw = 0.02)
densityplot(~q.value.array + q.value.rnaseq, de.merged.false, plot.points = FALSE,
auto.key = list(text = c("MicroArray", "RNA-Seq")), main = "Density plot of q-values (Both)",
xlab = "q-value", n = 200, bw = 0.02)
interestinggenes <- c("YGL209W", "YCL042W", "YBL025W", "YDR384C", "YDR345C")
both <- intersect(edger.results[edger.results$q.value < 1e-05, "gene.id"], array.results[array.results$q.value <
1e-05, "gene.id"])
rnaseqOnly <- setdiff(edger.results[edger.results$q.value < 1e-05, "gene.id"],
both)
arrayOnly <- setdiff(array.results[array.results$q.value < 1e-05, "gene.id"],
both)
(subset(de.merged.false, gene.id %in% interestinggenes, select = grep("gene.id|q.value",
colnames(de.merged.false))))
## gene.id q.value.array q.value.rnaseq
## 120 YBL025W 2.104e-01 6.416e-01
## 526 YCL042W 2.851e-01 2.975e-16
## 1186 YDR345C 4.155e-08 2.034e-310
## 1226 YDR384C 1.012e-07 0.000e+00
## 1939 YGL209W 1.912e-07 4.861e-01
(intersect(interestinggenes, both))
## [1] "YDR384C" "YDR345C"
(intersect(interestinggenes, rnaseqOnly))
## [1] "YCL042W"
(intersect(interestinggenes, arrayOnly))
## [1] "YGL209W"
(setdiff(interestinggenes, c(both, rnaseqOnly, arrayOnly)))
## [1] "YBL025W"
YDR384C and YDR345C fall in the overlaping part of the Microarry-RNASeq Venn diagram. And they are well seperated from the plots and q.value for both of them are less than 1e-5.
YCL042Wfalls in the RNASeq only part of the Microarry-RNASeq Venn diagram. The dots are all located very close to each other on the array panle for YCL042W but relatively well seperated on the RNASeq panle. The q.value is only less than 1e-5 for RNASeq.
YGL209Wfalls in the Microarray only part of the Microarry-RNASeq Venn diagram. The dots are all located very close to each other on the RNASeq panle for YGL209W but relatively well seperated on the array panle. The q.value is only less than 1e-5 for Microarry.
The q.value for YBL025W is higher than 1e-5 for both Microarry and RNASeq. And we can clearly see that are no expression difference from the polt for YBL025W.