library(edgeR)
library(hexbin)
library(lattice)
library(latticeExtra)
library(limma)
library(reshape2)
library(VennDiagram)
library(yeast2.db)
data <- t(read.table('array.tsv', header=TRUE, row.names=1))
design <- data.frame(sample=rownames(data))
data.tall <- melt(cbind(data, design), id.vars='sample',
variable.name='probe', value.name='expression')
pairs(t(data), pch='.', main='Scatterplot matrix of samples')
hexplom(t(data), main='Hexbin matrix of samples')
B1 is similar to C1 and C3. C2 is similar to B2 and B3.
heatmap(data[,1:100], main='Heatmap of the first 100 probes')
B2, B3 and C2 cluster together. C1, C3 and B1 cluster together.
heatmap(cor(t(data)), main='Heatmap of the sample correlation')
B2, B3 and C2 cluster together. C1, C3 and B1 cluster together.
xyplot(PC2 ~ PC1, data.frame(prcomp(data)$x),
panel=function(x, y, ...) {
panel.xyplot(x, y, ...)
ltext(x=x, y=y, labels=rownames(data), pos=1, offset=1)
})
B2, B3 and C2 have similar values of PC1. C1, C3 and B1 have similar values of PC1.
I believe samples B1 and C2 are swapped.
rownames(data) <- c('C2', 'B2', 'B3', 'C1', 'B1', 'C3')
design <- data.frame(
sample=rownames(data),
group=factor(ifelse(
substring(rownames(data), 1, 1) == 'B', 'batch', 'chemostat')))
data.tall <- melt(cbind(data, design),
id.vars=c('sample', 'group'),
variable.name='probe', value.name='expression')
heatmap(cor(t(data)), main='Sample correlation after swapping B1 and C2')
fit <- eBayes(lmFit(t(data), model.matrix(~group, design)))
tt <- topTable(fit, coef='groupchemostat', number=Inf, sort.by='none')
de <- with(tt, data.frame(
probe.id=ID,
gene.id=unlist(mget(colnames(data), yeast2ORF)),
p.value=P.Value,
q.value=adj.P.Val,
log.fc=logFC,
test.stat=t))
de <- na.omit(de)
ncol(data)
## [1] 10928
nrow(de)
## [1] 5705
ncol(data) - nrow(de)
## [1] 5223
topProbeID <- de[which.min(de$q.value),'probe.id']
stripplot(expression ~ group, data.tall,
subset=data.tall$probe==topProbeID,
main=paste('Strip plot of the top hit:', topProbeID))
sum(de$q.value < 1e-5)
## [1] 725
There are 725 probes identified as differentially expressed at a false discovery rate (FDR) of 1e-5.
write.table(de, 'limma.tsv', row.names=TRUE, col.names=NA)
rm(list=ls())
data <- read.table('stampy.counts.tsv', header=TRUE, row.names=1)
design <- data.frame(
sample=colnames(data),
group=factor(ifelse(
substring(colnames(data), 1, 1) == 'b', 'batch', 'chemostat')))
The dimensions of the dataset are (6542, 6). There are 6542 rows, which represent genes, and 6 columns, which represent samples. I'm interpreting the column names as the names of the samples. The rows of the array data represent probes, whereas the rows of this dataset represent genes.
heatmap(cor(data), main='Heatmap of the sample correlation')
Samples B1, B2 and B3 cluster together. Samples C1, C2 and C3 cluster together. There is no sample swap.
dge.glm <- DGEList(counts=data, group=design$group)
## Calculating library sizes from column totals.
mm <- model.matrix(~group, design)
dge.glm.com.disp <- estimateGLMCommonDisp(dge.glm, mm, verbose=TRUE)
## Disp = 0.00551 , BCV = 0.0742
dge.glm.trend.disp <- estimateGLMTrendedDisp(dge.glm.com.disp, mm)
## Loading required package: splines
dge.glm.tag.disp <- estimateGLMTagwiseDisp(dge.glm.trend.disp, mm)
plotBCV(dge.glm.tag.disp, main='Biological Coefficient of Variation')
fit <- glmFit(dge.glm.tag.disp, mm)
lrt <- glmLRT(fit, coef='groupchemostat')
tt.glm <- topTags(lrt, n=Inf)
de.edger <- with(tt.glm[[1]], data.frame(
gene.id=row.names(tt.glm),
p.value=PValue,
q.value=FDR,
log.fc=logFC,
test.stat=LR))
write.table(de.edger, 'stampy.edger.results.tsv', row.names=TRUE, col.names=NA)
plotSmear(lrt, de.tags=subset(de.edger, q.value < 1e-5)$gene.id,
main='Smear plot of differentially expressed genes')
This smear plot shows the log fold-change of the gene on the y-axis and the log counts-per-million on the x-axis. Genes with higher expression are found on the right, and genes with a larger fold-change are found at the top. Genes with differential expression should be found at the top and bottom of the smear plot but absent from the left where there is an insufficient number reads to determine with statistical significance whether the gene is differentially expressed. We do in fact see this organisation of the genes in the smear plot.
de.limma <- read.table('limma.tsv', header=TRUE, row.names=1,
as.is=c('probe.id', 'gene.id'))
plot.new()
grid.draw(venn.diagram(list(
Array = subset(de.limma, q.value < 1e-5)$gene.id,
RNASeq = subset(de.edger, q.value < 1e-5)$gene.id),
filename=NULL, fill=c('red', 'blue')))
rm(list=ls())
de.edger.lowdepth <- read.table('stampy.edger.low_depth.results.tsv',
header=TRUE, row.names=1, as.is=c('gene.id'))
de.limma <- read.table('limma.tsv', header=TRUE, row.names=1,
as.is=c('probe.id', 'gene.id'))
de.edger <- read.table('stampy.edger.results.tsv',
header=TRUE, row.names=1, as.is=c('gene.id'))
de.limma.voom <- read.table('stampy.limma.results.tsv',
header=TRUE, row.names=1, as.is=c('gene.id'))
sum(de.edger$q.value < 1e-5)
## [1] 2629
sum(de.edger.lowdepth$q.value < 1e-5)
## [1] 494
plot.new()
grid.draw(venn.diagram(list(
HighDepth = subset(de.edger, q.value < 1e-5)$gene.id,
LowDepth = subset(de.edger.lowdepth, q.value < 1e-5)$gene.id),
filename=NULL, fill=c('red', 'blue')))
de.edger.merged <- merge(de.edger, de.edger.lowdepth,
by='gene.id', suffixes=c('.high', '.low'))
xyplot(test.stat.low ~ test.stat.high, de.edger.merged,
scales=list(log=TRUE),
pch=20,
main='Scatter plot of low-depth vs. high-depth test statistic',
xlab='High-depth test statistic',
ylab='Low-depth test statistic')
hexbinplot(test.stat.low ~ test.stat.high, de.edger.merged,
scales=list(log=TRUE),
main='Hexbin plot of low-depth vs. high-depth test statistic',
xlab='High-depth test statistic',
ylab='Low-depth test statistic')
densityplot(~ p.value.high + p.value.low, de.edger.merged,
plot.points=FALSE, auto.key=list(text=c('High-depth', 'Low-depth')),
main='Density plot of P-values',
xlab='P-value')
ecdfplot(~ p.value.high + p.value.low, de.edger.merged,
plot.points=FALSE, auto.key=list(text=c('High-depth', 'Low-depth')),
main='Empirical CDF of P-values',
xlab='P-value')
The density plot and empirical CDF of p-values shows that the high-depth data has a much stronger peak of small p-values that more easily separates the differentially-expressed genes from the uniform distribution of p-values representing uninteresting genes. As a result, the high-depth data is able to identify genes as differentially expressed that are not found to be statistically significant with the low-depth data.
rm(list=ls())
liA <- read.table('limma.tsv', header=TRUE, row.names=1,
as.is=c('probe.id', 'gene.id'))
edR <- read.table('stampy.edger.results.tsv',
header=TRUE, row.names=1, as.is=c('gene.id'))
liV <- read.table('stampy.limma.results.tsv',
header=TRUE, row.names=1, as.is=c('gene.id'))
sum(liA$q.value < 1e-5)
## [1] 725
sum(edR$q.value < 1e-5)
## [1] 2629
sum(liV$q.value < 1e-5)
## [1] 1794
liA.hits <- subset(liA, q.value < 1e-5)$gene.id
edR.fraction <- length(intersect(subset(edR, q.value < 1e-5)$gene.id, liA.hits)) / length(liA.hits)
liV.fraction <- length(intersect(subset(liV, q.value < 1e-5)$gene.id, liA.hits)) / length(liA.hits)
plot.new()
grid.draw(venn.diagram(list(
liA = subset(liA, q.value < 1e-5)$gene.id,
edR = subset(edR, q.value < 1e-5)$gene.id,
liV = subset(liV, q.value < 1e-5)$gene.id),
filename=NULL, fill=c('red', 'blue', 'green')))
In broad strokes these results looks quite reasonable. The negative results show a clump of data points where it is difficult to see a separation of the two groups. The positive results show large absolute changes in the expression of the gene that clearly separate the two groups.
merged <- merge(liA, merge(edR, liV, suffixes=c('.edR', '.liV'), by='gene.id'),
suffixes=c('.liA', ''), by='gene.id')
subset(merged,
gene.id %in% c('YNL095C', 'YGL209W', 'YPL171C', 'YDL218W', 'YLR303W', 'YPR160W'),
select=grep('gene.id|q.value', colnames(merged)))
## gene.id q.value q.value.edR q.value.liV
## 838 YDL218W 2.232e-06 9.634e-79 3.651e-05
## 1939 YGL209W 1.912e-07 5.211e-01 7.590e-01
## 3703 YLR303W 6.298e-01 2.058e-05 4.910e-06
## 4387 YNL095C 4.439e-04 3.572e-04 5.439e-04
## 5354 YPL171C 1.547e-04 4.418e-65 1.602e-05
## 5592 YPR160W 1.132e-05 5.188e-287 5.213e-11
At a finer level, there are some surprising results.
YNL095C appears differentially expressed and upon inspection is found to have significant q-values (q<1e-3) for all three analyses, but larger than our threshold of 1e-5.
YGL209W is differentially expressed in the array data (q<1e-6) but is not differentially expressed in the RNA-seq data (q>0.5). I'm curious to know the root cause of this discrepancy, and whether it's due to a limitation of one of the two platforms.
YPL171C and YDL218W both appear differentially expressed in the RNA-seq data and are found to be differentially expressed by edR but not by liV. The liV q-value is significant (q<1e-4) for both genes but is above our threshold of q<1e-5.
YLR303W appears differentially expressed in the RNA-seq data and is found to be differentially expressed by liV but not by edR. The edR q-value is significant (q<1e-4) but is above our threshold of q<1e-5. It does not appear to be differentially expressed in the array data (q>0.5). As with YGL209W, I'm curious to know the root cause of this discrepancy.
YPR160W appears quite clearly differentially expressed in the array data, but is not found to be differentially expressed by liA. The liA result is significant (q<1e-4), but is above our threshold of q<1e-5.
I feel these results show that our chosen threshold of q<1e-5 is perhaps too stringent.
Based on the Venn diagram of differentially expressed genes, edR appears to be the most sensitive (powerful) method. It identifies many genes (789) as differentially expressed that are not found by either of the other two methods, whereas liV identifies only one single gene as differentially expressed that is not found by edR. liA is appears to be the least sensitive method, which may have more to do with the platform than the analysis.
RNA-seq appears to be the more sensitive platform.
With only a couple notable exceptions, the expression values from the array data of part ii appear at a glance quite similar to the counts from the RNA-seq data. Array data appears to measure expression somewhat less accurately and with more technical variability, which results in a larger variance of the measurements, smaller test statistics, larger p-values and, ultimately, fewer genes identified as being statistically significant in their differential expression.
The array data has limited dynamic range. At low expression the signal is lost in the background noise. At high expression the photosensor saturates and the measurement attains an artificial maximum, much like staring directly at a bright light source saturates our eyes.
RNA-seq suffers less from these two issues. The background noise floor, which is largely represented by mismapped reads, is lower than array data. Perhaps more importantly, the digital count data of RNA-seq reads is not prone to saturation. The reads from highly expressed genes can by accurately counted. These highly-expressed genes do however suck up reads from the sequencing run at the expense of the genes with lower expression.