STAT 540 Homework 2

Shaun Jackman

2013-03-18

library(edgeR)
library(hexbin)
library(lattice)
library(latticeExtra)
library(limma)
library(reshape2)
library(VennDiagram)
library(yeast2.db)

Q1) Array Analysis

a) (1pt) Load Array Data

data <- t(read.table('array.tsv', header=TRUE, row.names=1))
design <- data.frame(sample=rownames(data))
data.tall <- melt(cbind(data, design), id.vars='sample',
    variable.name='probe', value.name='expression')

b) (2pt) Identify Sample Swap

i. (High volume) scatter plot matrix.

pairs(t(data), pch='.', main='Scatterplot matrix of samples')

plot of chunk 1bi

hexplom(t(data), main='Hexbin matrix of samples')

plot of chunk 1bi

B1 is similar to C1 and C3. C2 is similar to B2 and B3.

ii. A heatmap of the first 100 genes.

heatmap(data[,1:100], main='Heatmap of the first 100 probes')

plot of chunk 1bii

B2, B3 and C2 cluster together. C1, C3 and B1 cluster together.

iii. Compute the Pearson correlation of the samples and plot the results using a heatmap.

heatmap(cor(t(data)), main='Heatmap of the sample correlation')

plot of chunk 1biii

B2, B3 and C2 cluster together. C1, C3 and B1 cluster together.

iv. Scatterplot the six data samples with respect to the first two principal components and label the samples.

xyplot(PC2 ~ PC1, data.frame(prcomp(data)$x),
    panel=function(x, y, ...) {
        panel.xyplot(x, y, ...)
        ltext(x=x, y=y, labels=rownames(data), pos=1, offset=1)
    })

plot of chunk 1biv

B2, B3 and C2 have similar values of PC1. C1, C3 and B1 have similar values of PC1.

I believe samples B1 and C2 are swapped.

c) (2pt) Differential Expression With Array

Fix the label swap from the last problem.

rownames(data) <- c('C2', 'B2', 'B3', 'C1', 'B1', 'C3')
design <- data.frame(
    sample=rownames(data),
    group=factor(ifelse(
        substring(rownames(data), 1, 1) == 'B', 'batch', 'chemostat')))
data.tall <- melt(cbind(data, design),
    id.vars=c('sample', 'group'),
    variable.name='probe', value.name='expression')

Revisit one or more elements of question 1b to sanity check before proceeding.

heatmap(cor(t(data)), main='Sample correlation after swapping B1 and C2')

plot of chunk 1c1

Now use this data to do a differential expression analysis with limma.

fit <- eBayes(lmFit(t(data), model.matrix(~group, design)))
tt <- topTable(fit, coef='groupchemostat', number=Inf, sort.by='none')
de <- with(tt, data.frame(
    probe.id=ID,
    gene.id=unlist(mget(colnames(data), yeast2ORF)),
    p.value=P.Value,
    q.value=adj.P.Val,
    log.fc=logFC,
    test.stat=t))

Remove any rows with probes which don't map to genes.

de <- na.omit(de)

i. How many probes did we start with and how many remain after removing probes without gene ids?

ncol(data)

## [1] 10928

nrow(de)

## [1] 5705

ncol(data) - nrow(de)

## [1] 5223

We started with 10928 probes.
We have 5705 probes after removing probes without gene IDs.
We removed 5223 probes without gene IDs.

ii. Make a stripplot of the top hit (lowest p- or q-value), making the batch vs. chemostat distinction clear.

topProbeID <- de[which.min(de$q.value),'probe.id']
stripplot(expression ~ group, data.tall,
    subset=data.tall$probe==topProbeID,
    main=paste('Strip plot of the top hit:', topProbeID))

plot of chunk 1cii

iii. How many probes are identified as differentially expressed at a false discovery rate (FDR) of 1e-5 (note: this is a FDR cutoff used in the original paper)?

sum(de$q.value < 1e-5)

## [1] 725

There are 725 probes identified as differentially expressed at a false discovery rate (FDR) of 1e-5.

iv. Save your results for later with write.table().

write.table(de, 'limma.tsv', row.names=TRUE, col.names=NA)

Q2) RNA-Seq Analysis

a) (1pt) Load RNA Count Data and Sanity Check

rm(list=ls())
data <- read.table('stampy.counts.tsv', header=TRUE, row.names=1)
design <- data.frame(
    sample=colnames(data),
    group=factor(ifelse(
        substring(colnames(data), 1, 1) == 'b', 'batch', 'chemostat')))

i. What are dimensions of the dataset? In addition to reporting number of rows and columns, make it clear what rows and columns represent and how you're interpreting column names. What is the difference between the rows of this dataset versus the array data?

The dimensions of the dataset are (6542, 6). There are 6542 rows, which represent genes, and 6 columns, which represent samples. I'm interpreting the column names as the names of the samples. The rows of the array data represent probes, whereas the rows of this dataset represent genes.

ii. Do a sanity check to make sure there is no sample swap by plotting a heatmap of the sample correlations.

heatmap(cor(data), main='Heatmap of the sample correlation')

plot of chunk 2aii

Samples B1, B2 and B3 cluster together. Samples C1, C2 and C3 cluster together. There is no sample swap.

b) (4pt) edgeR Differential Expression Analysis

dge.glm <- DGEList(counts=data, group=design$group)

## Calculating library sizes from column totals.

mm <- model.matrix(~group, design)
dge.glm.com.disp <- estimateGLMCommonDisp(dge.glm, mm, verbose=TRUE)

## Disp = 0.00551 , BCV = 0.0742

dge.glm.trend.disp <- estimateGLMTrendedDisp(dge.glm.com.disp, mm)

## Loading required package: splines

dge.glm.tag.disp <- estimateGLMTagwiseDisp(dge.glm.trend.disp, mm)
plotBCV(dge.glm.tag.disp, main='Biological Coefficient of Variation')

plot of chunk 2b0

fit <- glmFit(dge.glm.tag.disp, mm)
lrt <- glmLRT(fit, coef='groupchemostat')
tt.glm <- topTags(lrt, n=Inf)
de.edger <- with(tt.glm[[1]], data.frame(
    gene.id=row.names(tt.glm),
    p.value=PValue,
    q.value=FDR,
    log.fc=logFC,
    test.stat=LR))

Save your results for later with write.table() in file called stampy.edger.results.tsv

write.table(de.edger, 'stampy.edger.results.tsv', row.names=TRUE, col.names=NA)

i. Create a smear plot, show all genes which are called differentially expressed at an FDR of 1e-5 in red. Explain how you interpret this plot. Where should the differentially expressed genes be? Is that what you see?

plotSmear(lrt, de.tags=subset(de.edger, q.value < 1e-5)$gene.id,
    main='Smear plot of differentially expressed genes')

plot of chunk 2bi

This smear plot shows the log fold-change of the gene on the y-axis and the log counts-per-million on the x-axis. Genes with higher expression are found on the right, and genes with a larger fold-change are found at the top. Genes with differential expression should be found at the top and bottom of the smear plot but absent from the left where there is an insufficient number reads to determine with statistical significance whether the gene is differentially expressed. We do in fact see this organisation of the genes in the smear plot.

ii. Load your results from Q1c) and create a Venn diagram showing all genes identified as differentially expressed in the array data and RNA-Seq data at an FDR of 1e-5.

de.limma <- read.table('limma.tsv', header=TRUE, row.names=1,
    as.is=c('probe.id', 'gene.id'))
plot.new()
grid.draw(venn.diagram(list(
    Array = subset(de.limma, q.value < 1e-5)$gene.id,
    RNASeq = subset(de.edger, q.value < 1e-5)$gene.id),
    filename=NULL, fill=c('red', 'blue')))

plot of chunk 2bii

Q3) Load additional pre-computed DEA results

a) DEA results from “low depth” count data

rm(list=ls())
de.edger.lowdepth <- read.table('stampy.edger.low_depth.results.tsv',
    header=TRUE, row.names=1, as.is=c('gene.id'))

b) DEA results from full depth RNA-Seq or array data

de.limma <- read.table('limma.tsv', header=TRUE, row.names=1,
    as.is=c('probe.id', 'gene.id'))
de.edger <- read.table('stampy.edger.results.tsv',
    header=TRUE, row.names=1, as.is=c('gene.id'))
de.limma.voom <- read.table('stampy.limma.results.tsv',
    header=TRUE, row.names=1, as.is=c('gene.id'))

Q4) Differential Expression Comparison

a) (4pt) Sequencing Depth Comparison

i. Identify all genes deemed DE in each set of results at a FDR of 1e-5. How many genes are identified by each method?

sum(de.edger$q.value < 1e-5)

## [1] 2629

sum(de.edger.lowdepth$q.value < 1e-5)

## [1] 494

The high-depth data identifies 2629 differentially expressed genes.
The low-depth data identifies 494 differentially expressed genes.

Display the (non)overlap of these DE sets with a Venn diagram.

plot.new()
grid.draw(venn.diagram(list(
    HighDepth = subset(de.edger, q.value < 1e-5)$gene.id,
    LowDepth = subset(de.edger.lowdepth, q.value < 1e-5)$gene.id),
    filename=NULL, fill=c('red', 'blue')))

plot of chunk 4ai2

ii. Make a scatter plot of the underlying DE test statistics. Based on the above, state some observations about the impact of sequencing depth.

de.edger.merged <- merge(de.edger, de.edger.lowdepth,
        by='gene.id', suffixes=c('.high', '.low'))
xyplot(test.stat.low ~ test.stat.high, de.edger.merged,
    scales=list(log=TRUE),
    pch=20,
    main='Scatter plot of low-depth vs. high-depth test statistic',
    xlab='High-depth test statistic',
    ylab='Low-depth test statistic')

plot of chunk 4aii

hexbinplot(test.stat.low ~ test.stat.high, de.edger.merged,
    scales=list(log=TRUE),
    main='Hexbin plot of low-depth vs. high-depth test statistic',
    xlab='High-depth test statistic',
    ylab='Low-depth test statistic')

plot of chunk 4aii

densityplot(~ p.value.high + p.value.low, de.edger.merged,
    plot.points=FALSE, auto.key=list(text=c('High-depth', 'Low-depth')),
    main='Density plot of P-values',
    xlab='P-value')

plot of chunk 4aii

ecdfplot(~ p.value.high + p.value.low, de.edger.merged,
    plot.points=FALSE, auto.key=list(text=c('High-depth', 'Low-depth')),
    main='Empirical CDF of P-values',
    xlab='P-value')

plot of chunk 4aii

The density plot and empirical CDF of p-values shows that the high-depth data has a much stronger peak of small p-values that more easily separates the differentially-expressed genes from the uniform distribution of p-values representing uninteresting genes. As a result, the high-depth data is able to identify genes as differentially expressed that are not found to be statistically significant with the low-depth data.

b) (6pt) Method Comparison

rm(list=ls())
liA <- read.table('limma.tsv', header=TRUE, row.names=1,
    as.is=c('probe.id', 'gene.id'))
edR <- read.table('stampy.edger.results.tsv',
    header=TRUE, row.names=1, as.is=c('gene.id'))
liV <- read.table('stampy.limma.results.tsv',
    header=TRUE, row.names=1, as.is=c('gene.id'))

Identify all genes deemed DE in each set of results at a FDR of 1e-5. How many genes are identified by each method?

sum(liA$q.value < 1e-5)

## [1] 725

sum(edR$q.value < 1e-5)

## [1] 2629

sum(liV$q.value < 1e-5)

## [1] 1794

725 genes are identified as differentially expressed by liA.
2629 genes are identified as differentially expressed by edR.
1794 genes are identified as differentially expressed by liV.

What fraction of the genes identified using the array dataset are also found by each of the methods used to analyse the RNASeq dataset?

liA.hits <- subset(liA, q.value < 1e-5)$gene.id
edR.fraction <- length(intersect(subset(edR, q.value < 1e-5)$gene.id, liA.hits)) / length(liA.hits)
liV.fraction <- length(intersect(subset(liV, q.value < 1e-5)$gene.id, liA.hits)) / length(liA.hits)

edR found 0.9517 of the genes identified using the array dataset.
liV found 0.8869 of the genes identified using the array dataset.

Display the (non)overlap of these three DE sets with a Venn diagram.

plot.new()
grid.draw(venn.diagram(list(
        liA = subset(liA, q.value < 1e-5)$gene.id,
        edR = subset(edR, q.value < 1e-5)$gene.id,
        liV = subset(liV, q.value < 1e-5)$gene.id),
    filename=NULL, fill=c('red', 'blue', 'green')))

plot of chunk 4bi3

ii. Make some observations on these figures. Do you see what you expect, at least in broad strokes? How about at a more micro-level, e.g. for an individual gene that's DE only according to one analysis?

In broad strokes these results looks quite reasonable. The negative results show a clump of data points where it is difficult to see a separation of the two groups. The positive results show large absolute changes in the expression of the gene that clearly separate the two groups.

merged <- merge(liA, merge(edR, liV, suffixes=c('.edR', '.liV'), by='gene.id'),
    suffixes=c('.liA', ''), by='gene.id')
subset(merged,
    gene.id %in% c('YNL095C', 'YGL209W', 'YPL171C', 'YDL218W', 'YLR303W', 'YPR160W'),
    select=grep('gene.id|q.value', colnames(merged)))

##      gene.id   q.value q.value.edR q.value.liV
## 838  YDL218W 2.232e-06   9.634e-79   3.651e-05
## 1939 YGL209W 1.912e-07   5.211e-01   7.590e-01
## 3703 YLR303W 6.298e-01   2.058e-05   4.910e-06
## 4387 YNL095C 4.439e-04   3.572e-04   5.439e-04
## 5354 YPL171C 1.547e-04   4.418e-65   1.602e-05
## 5592 YPR160W 1.132e-05  5.188e-287   5.213e-11

At a finer level, there are some surprising results.

YNL095C appears differentially expressed and upon inspection is found to have significant q-values (q<1e-3) for all three analyses, but larger than our threshold of 1e-5.

YGL209W is differentially expressed in the array data (q<1e-6) but is not differentially expressed in the RNA-seq data (q>0.5). I'm curious to know the root cause of this discrepancy, and whether it's due to a limitation of one of the two platforms.

YPL171C and YDL218W both appear differentially expressed in the RNA-seq data and are found to be differentially expressed by edR but not by liV. The liV q-value is significant (q<1e-4) for both genes but is above our threshold of q<1e-5.

YLR303W appears differentially expressed in the RNA-seq data and is found to be differentially expressed by liV but not by edR. The edR q-value is significant (q<1e-4) but is above our threshold of q<1e-5. It does not appear to be differentially expressed in the array data (q>0.5). As with YGL209W, I'm curious to know the root cause of this discrepancy.

YPR160W appears quite clearly differentially expressed in the array data, but is not found to be differentially expressed by liA. The liA result is significant (q<1e-4), but is above our threshold of q<1e-5.

I feel these results show that our chosen threshold of q<1e-5 is perhaps too stringent.

iii. Based on all of the above, can you make any conjectures about the sensitivity (aka power) or specificity of the two RNASeq analysis methods? If you are basing your conjecture on (non)overlap with the predictions from the array dataset, what assumptions are you making about the DE genes from the array dataset?

Based on the Venn diagram of differentially expressed genes, edR appears to be the most sensitive (powerful) method. It identifies many genes (789) as differentially expressed that are not found by either of the other two methods, whereas liV identifies only one single gene as differentially expressed that is not found by edR. liA is appears to be the least sensitive method, which may have more to do with the platform than the analysis.

iv. Which platform – array or RNASeq – do you think is more sensitive? In part ii) how do the expression values from the array data compare to the counts from the RNASeq data? Both platforms can find highly expressed genes, but which can measure the abundance of RNA more accurately i.e. which has the better dynamic range?

RNA-seq appears to be the more sensitive platform.

With only a couple notable exceptions, the expression values from the array data of part ii appear at a glance quite similar to the counts from the RNA-seq data. Array data appears to measure expression somewhat less accurately and with more technical variability, which results in a larger variance of the measurements, smaller test statistics, larger p-values and, ultimately, fewer genes identified as being statistically significant in their differential expression.

The array data has limited dynamic range. At low expression the signal is lost in the background noise. At high expression the photosensor saturates and the measurement attains an artificial maximum, much like staring directly at a bright light source saturates our eyes.

RNA-seq suffers less from these two issues. The background noise floor, which is largely represented by mismapped reads, is lower than array data. Perhaps more importantly, the digital count data of RNA-seq reads is not prone to saturation. The reads from highly expressed genes can by accurately counted. These highly-expressed genes do however suck up reads from the sequencing run at the expense of the genes with lower expression.