STAT540-2014 - Homework02 - ACTON, ERICA

Question 1: Microarray Analysis

Q1a: Load Microarray Data

Load the normalized data. What are the dimensions of the dataset? In addition to reporting number of rows and columns, make it clear what rows and columns represent and how you're interpreting the column names.

Load the data.

mDat <- read.table("GSE37599-data.tsv", header = TRUE, row.names = 1)
str(mDat)

## 'data.frame':    10928 obs. of  6 variables:
##  $ b1: num  11.15 2.16 1.49 9.01 6.95 ...
##  $ b2: num  6.8 3.18 1.43 9.46 6.9 ...
##  $ b3: num  6.71 3.13 1.82 9.23 6.96 ...
##  $ c1: num  10.95 2.5 1.46 8.97 6.85 ...
##  $ c2: num  6.7 3.05 2.08 9.28 6.9 ...
##  $ c3: num  11.07 2.44 1.62 9 6.89 ...

boxplot(mDat)

The dimensions of the dataset.

dim(mDat)

## [1] 10928     6

# number of rows - the probes mapped to genes
nrow(mDat)

## [1] 10928

# number of columns - samples; b1-b3= yeast replicates grown in batch
# medium, and c1-c3= yeast replicates grown in chemostat conditions
ncol(mDat)

## [1] 6

colnames(mDat)

## [1] "b1" "b2" "b3" "c1" "c2" "c3"

The rows represent the probes mapped to genes, columns represent samples - b1-b3= yeast replicates grown in batch medium, and c1-c3= yeast replicates grown in chemostat conditions.

Q1b: Identify Sample Swap

The labels on two of the samples have been swapped, that is one of the batch samples has been labelled as chemostat and vice-versa. Produce the plots described below and explain how they allow you to identify the swapped samples.

i. (High volume) scatter plot matrix.

splom(mDat, panel = panel.smoothScatter, raster = TRUE, main = "High Volume Scatter Plot Matrix")

## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## (loaded the KernSmooth namespace)

Explanation: In the scatter plot matrix, samples which have similar expression would be expected to have a linear relationship. Since b1 has a large scatter when plotted against b2 b3, and c2, but has low scatter with c1 and c3 - knowing that a sample has been swapped - it looks as if b1 is more similar to c1 and c3. Similarly c2 has a larger scatter with c1, c3, and b1 and has less variance when compared to b2 and b3.

ii. A heatmap of the first 100 genes.

sub100g <- as.matrix(mDat[1:100, ])
head(sub100g)

##                b1    b2    b3     c1    c2     c3
## 1769308_at 11.146 6.796 6.708 10.946 6.699 11.071
## 1769309_at  2.164 3.177 3.129  2.498 3.047  2.444
## 1769310_at  1.488 1.427 1.816  1.462 2.079  1.623
## 1769311_at  9.006 9.462 9.235  8.972 9.277  9.005
## 1769312_at  6.946 6.896 6.955  6.851 6.900  6.893
## 1769313_at  7.815 6.600 6.534  7.770 6.564  7.852

gnbucolorFun <- colorRampPalette(brewer.pal(n = 9, "GnBu"))
heatmap(sub100g, Rowv = NA, Colv = NA, scale = "none", main = "Original Data: Heatmap (1st 100 Genes)", 
    margins = c(5, 8), col = gnbucolorFun(256))

plot of chunk unnamed-chunk-5

Explanation: It would be expected that replicates of samples grown under similar conditions would cluster similarly according to gene expression. In this heatmap, clearly b1, c1, and c3 cluster together, while b2, b3, and c2 cluster similarly. From this visualization it can be inferred that labels for b1 and c2 have been swapped.

iii. Compute the Pearson correlation of the samples and plot the results using a heatmap.

heatmap(cor(mDat, method = "pearson"), scale = "none", main = "Original Data: Pearson Correlation")

Explanation: From the pearson correlation heatmap, it can be seen that c3, b1, and c3 cluster together while b2, b3, and c2 cluster together. Again, since we expect samples grown under similar replicate conditions to cluster together it can be inferred that the labels for b1 and c2 have been swapped.

iv. Scatterplot the six data samples with respect to the first two principle components and label the samples. Hint: if using the base graphics function plot(), a subsequent call to text() function can be used to label each point with the corresponding sample name.

Principle component analysis and scatterplot.

pcs <- prcomp(mDat)
plot(pcs$rotation[, "PC1"], pcs$rotation[, "PC2"], ylim = c(-0.5, 0.7), main = "Original Data: P1 vs P2", 
    xlab = "PC1", ylab = "PC2")
text(pcs$rotation[, "PC1"], pcs$rotation[, "PC2"], labels = row.names(pcs$rotation), 
    offset = 0.5, pos = 3)

plot of chunk unnamed-chunk-7

Explanation: By principal component analysis, b1, c1 and c3 cluster together, while c2, b2, b3 group together. b1 is clustering with the chemostat samples and c2 is clustering with the batch samples.

Q1c: Microarray Differential Expression

Fix the label swap identified in question 1b. We want to swap b1<–>c2. Revisit one or more elements of question 1b to sanity check before proceeding.

Fix labels.

# Original labels.
colnames(mDat)

## [1] "b1" "b2" "b3" "c1" "c2" "c3"

# New swapped labels.
colnames(mDat) <- c("c2", "b2", "b3", "c1", "b1", "c3")

Check: Remake heatmap of 100 genes.

sub100g2 <- as.matrix(mDat[1:100, ])
heatmap(sub100g2, Rowv = NA, Colv = NA, scale = "none", margins = c(5, 8), main = "Swapped Labels: Heatmap (1st 100 Genes)", 
    col = gnbucolorFun(256))

plot of chunk unnamed-chunk-9

Explanation: Now we can see that the chemostat samples would cluster together and the batch media samples would cluster together, as expected.

Check: Remake Pearson correlation heatmap.

heatmap(cor(mDat, method = "pearson"), scale = "none", main = "Swapped Labels: Pearson Correlation")

plot of chunk unnamed-chunk-10

Explanation: From the Pearson Correlation heatmap the chemostat samples now cluster together and the batch media samples cluster together.

Now use this data to do a differential expression analysis with limma.

Make a design matrix.

des <- data.frame(group = c(1, 0, 0, 1, 0, 1))
mDes <- model.matrix(~group, des)

Perform differential analysis with limma. Get results in toptable.

mFit <- lmFit(mDat, mDes)
mEbFit <- eBayes(mFit)
m.tt <- topTable(mEbFit, coef = 2, n = Inf)

Package these results in a data frame with six columns: (probe.id, gene.id, p.value, q.value, log.fc, test.stat) The gene id can be retrieved using the yeast2.db package from Bioconductor. The yeast2ORF object available after loading yeas2.db contains the mapping between probe IDs and yeast gene ids. Assuming you have a character vector of probes called prob.ids, the gene IDs can be retrieved using gene.ids <- unlist(mget(probe.ids, yeast2ORF)).

Retrieve gene ids.

probe.ids <- rownames(m.tt)
gene.ids <- unlist(mget(probe.ids, yeast2ORF))
# Add gene ids to toptable.
idm.tt <- cbind.data.frame(m.tt, gene.ids)
# Rename gene.ids retrieved to 'gene.id''
colnames(idm.tt)

## [1] "logFC"     "AveExpr"   "t"         "P.Value"   "adj.P.Val" "B"        
## [7] "gene.ids"

colnames(idm.tt) <- c("logFC", "AveExpr", "t", "P.Value", "adj.P.Val", "B", 
    "gene.id")
# Verify labelling.
head(idm.tt)

##             logFC AveExpr      t   P.Value adj.P.Val     B gene.id
## 1772391_at 10.111   7.044 108.93 9.478e-14 1.036e-09 20.96 YIL057C
## 1774122_at  8.474   6.852  89.87 4.277e-13 2.337e-09 20.05 YOR388C
## 1774070_at  8.991   8.379  80.70 9.932e-13 3.618e-09 19.47 YMR303C
## 1776153_at  7.304   5.428  72.70 2.248e-12 5.717e-09 18.87 YMR206W
## 1774072_at  8.578   6.887  71.31 2.616e-12 5.717e-09 18.75 YJR095W
## 1776304_at  8.081   6.730  69.10 3.348e-12 5.881e-09 18.56 YDR256C

Package results into a dataframe.

m.results <- data.frame(probe.id = row.names(idm.tt), gene.id = idm.tt$gene.id, 
    p.value = idm.tt$P.Value, q.value = idm.tt$adj.P.Val, log.fc = idm.tt$logFC, 
    test.stat = idm.tt$t)

Remove any rows with probes which don't map to genes. You'll be able to find these because they will have NA as their gene id. Work with this data to answer the questions below.

Remove probes which do not map to genes (gene.id=NA).

mkeepDat <- na.omit(m.results)

How many probes did we start with and how many remain after removing probes without gene ids?

Number of probes we started with:

nrow(mDat)

## [1] 10928

Number of probes remaining after removing probes without gene ids:

nrow(mkeepDat)

## [1] 5705

ii. Illustrate the differential expression between the batch and the chemostat samples for the top hit (i.e., probe with the lowest p-or q-value).

topHit <- c(mkeepDat[1, ]$probe.id)
expTopHit <- mDat[topHit, ]
topHitDat <- data.frame(t(expTopHit))
colnames(topHitDat) <- "gExp"
topHitDat$group <- c("chemostat", "batch", "batch", "chemostat", "batch", "chemostat")
stripplot(group ~ gExp, topHitDat, group = group, jitter.data = TRUE, type = c("p", 
    "a"), main = "Differential Expression of Top Hit")

plot of chunk unnamed-chunk-18

iii.How many probes are identified as differentially expressed at a false discovery rate (FDR) of 1e-5 (note: this is a FDR cutoff used in the original paper)?

micro.results <- subset(mkeepDat, q.value < 1e-05)
nrow(micro.results)

## [1] 725

iv. Save your results for later with write.table(). When using write.table to save the data, you need to pass the arguments row.names=TRUE, col.names=NA to make sure the headers are written correctly.

write.table(micro.results, file = "micro.results.tsv", row.names = T, col.names = NA)

Question 2: RNA-Seq Analysis

Q2a: Load RNA Count Data and Sanity Check

Load the count data using read.table(); you will need to pass the arguments header=TRUE and row.names=1. Load data.

rsDat <- read.table("stampy.counts.tsv", header = TRUE, row.names = 1)
str(rsDat)

## 'data.frame':    6542 obs. of  6 variables:
##  $ b1: int  11 40 31 63 112 17 0 4 5 0 ...
##  $ b2: int  3 26 52 87 106 21 0 2 8 1 ...
##  $ b3: int  5 10 40 53 60 15 0 2 1 0 ...
##  $ c1: int  13 71 12 51 139 7 0 11 5 1 ...
##  $ c2: int  5 41 18 78 142 6 0 0 5 0 ...
##  $ c3: int  17 48 32 100 127 2 2 3 2 1 ...

i. What are the dimensions of the dataset? In addition to reporting the numbers of rows and columns, make it clear what the rows and columns represent. What is the difference between the rows of this dataset versus rows of the array data in question 1a?

dim(rsDat)

## [1] 6542    6

nrow(rsDat)  #the mRNAs sequenced in the samples

## [1] 6542

ncol(rsDat)  # the columns are the samples which were replicates grown in batch medium or chemostat condisions

## [1] 6

colnames(rsDat)

## [1] "b1" "b2" "b3" "c1" "c2" "c3"

Explanation: The probes (rows) of the microarray have been preselected, designed to capture certain genes, while RNA-Seq data (rows) capture all mRNAs present in the sample.

ii. Do a sanity check to make sure there is no sample swap by plotting a heatmap of the sample correlations.

heatmap(cor(rsDat, method = "pearson"), scale = "none", main = "Pearson Correlation of RNA-Seq Data")

plot of chunk unnamed-chunk-23

Comment: Chemostat condition samples and batch medium samples have clustered together as expected.

Q2b: edgeR Differential Expression Analysis

Now you will use edgeR to identify differentially expressed genes between the batch medium vs. chemostat conditions. i. Recall that edgeR needs to estimate the dispersion parameter in the negative binomial model using an empirical Bayes method. Estimate the dispersion parameters using estimateGLMCommonDisp, estimateGLMTrendedDisp and estimateGLMTagwiseDisp. Plot the tagwise dispersion against log2-CPM (counts per million). Unlike in seminar 7, you need to pass the design matrix to estimateGLMTrendedDisp.

Create groups according to sample conditions.
Group 1= batch medium. Group 2= chemostat.

group <- factor(c(rep("1", 3), rep("2", 3)))
group

## [1] 1 1 1 2 2 2
## Levels: 1 2

Create DGEList object of count data (matrix) and samples (data frame).

dgeDat <- DGEList(counts = rsDat, group = group)

Create design matrix.

dgeDes <- model.matrix(~group)
dgeDes

##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      0
## 4           1      1
## 5           1      1
## 6           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Calculate the estimated common, trended, and tagwise dispersions.

dgeDat.com.disp <- estimateGLMCommonDisp(dgeDat, dgeDes, verbose = TRUE)

## Disp = 0.00551 , BCV = 0.0742

dgeDat.trend.disp <- estimateGLMTrendedDisp(dgeDat.com.disp, dgeDes)

## Loading required package: splines

dgeDat.tag.disp <- estimateGLMTagwiseDisp(dgeDat.trend.disp, dgeDes)

Plot the tagwise dispersion agains log2-CPM.

plotBCV(dgeDat.tag.disp, main = "Biological Coefficient of Variation of Count Data")

plot of chunk unnamed-chunk-28

ii. Use the glm functionality of edgeR, ie. use the glmFit function, to identify differentially expressed genes between conditions.

fit <- glmFit(dgeDat.tag.disp, dgeDes)
lrt <- glmLRT(fit, coef = 2)
topTags(lrt)

## Coefficient:  group2 
##          logFC logCPM   LR PValue FDR
## YIL057C 10.084  8.267 2191      0   0
## YMR175W  9.426  8.001 1877      0   0
## YJR095W  9.419  7.811 2247      0   0
## YMR107W  9.073  8.781 2627      0   0
## YKL217W  8.573 11.425 1689      0   0
## YPL201C  7.654  7.105 1951      0   0
## YGL205W  7.070  9.108 2691      0   0
## YCR010C  7.013  8.150 1809      0   0
## YAL054C  6.893 10.583 2542      0   0
## YBR067C  6.468  9.973 1657      0   0

ttedge <- data.frame(topTags(lrt, n = Inf))

Package these results in a data.frame called 'edger.results' with five columns: (gene.id, p.value, q.value, log.fc, test.stat)

edger.results <- data.frame(gene.id = row.names(ttedge), p.value = ttedge$PValue, 
    q.value = ttedge$FDR, log.fc = ttedge$logFC, test.stat = ttedge$LR)

Save your results for later with write.table() in a file called stampy.edger.results.tsv.

write.table(edger.results, file = "stampy.edger.results.tsv", row.names = T, 
    col.names = NA)

iii. How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

edgeRhits <- subset(edger.results, q.value < 1e-05)
nrow(edgeRhits)

## [1] 2669

# Save for later.
write.table(edgeRhits, file = "stampy.edger.FDR.results.tsv", row.names = T, 
    col.names = NA)

iv. How many genes are differentially over-expressed between conditions at a false discovery rate (FDR) of 1e-5?

summary(de.edgeRhits <- decideTestsDGE(lrt, p = 1e-05, adjust = "BH"))

##    [,1]
## -1 1154
## 0  3873
## 1  1515

Explanation: 1515 genes are differentially over-expressed in Group 2 (chemostat condition) relative to Group 1 (batch medium). [1154 were under-expressed in the chemostat condition relative to the batch medium, and 3873 showed no significant difference in expression for the selected cutoff.]

Q2c: DESeq Differential Expression Analysis

Now you will use DESeq to identify differentially expressed genes between the batch medium vs. chemostat conditions. i. DESeq also needs to estimate the dispersion. Use estimateSizeFactors and estimateDispersions to normalize the data. Plot the estimated dispersions against the mean normalized counts.

Get count and group information.

deSeqDat <- newCountDataSet(rsDat, group)
head(counts(deSeqDat))

##           b1  b2 b3  c1  c2  c3
## 15S_rRNA  11   3  5  13   5  17
## 21S_rRNA  40  26 10  71  41  48
## HRA1      31  52 40  12  18  32
## ICR1      63  87 53  51  78 100
## LSR1     112 106 60 139 142 127
## NME1      17  21 15   7   6   2

Estimate size factors to account for differences in coverage, and estimate the variance.

deSeqDat <- estimateSizeFactors(deSeqDat)
sizeFactors(deSeqDat)

##     b1     b2     b3     c1     c2     c3 
## 1.0211 1.2962 0.9538 0.7866 0.9004 1.1317

Estimate the dispersions.

deSeqDat <- estimateDispersions(deSeqDat)

Plot dispersions against the mean normalized counts.

plotDispEsts(deSeqDat, main = "DESeq: Estimated Dispersion vs Mean Normalized Counts")

plot of chunk unnamed-chunk-37

ii. Use the negative binomial test of DESeq, i.e. use the nbionomTest() function, to identify differentially expressed genes between conditions. Note that the output of this function does not return results ordered by p-values or logged fold-changes.

results <- nbinomTest(deSeqDat, levels(group)[1], levels(group)[2])
str(results)

## 'data.frame':    6542 obs. of  8 variables:
##  $ id            : chr  "15S_rRNA" "21S_rRNA" "HRA1" "ICR1" ...
##  $ baseMean      : num  9.24 41.32 29.32 70.7 116.84 ...
##  $ baseMeanA     : num  6.11 23.24 37.47 61.46 84.79 ...
##  $ baseMeanB     : num  12.4 59.4 21.2 79.9 148.9 ...
##  $ foldChange    : num  2.024 2.556 0.565 1.301 1.756 ...
##  $ log2FoldChange: num  1.017 1.354 -0.823 0.379 0.812 ...
##  $ pval          : num  0.13 0.0476 0.0171 0.0684 0.0101 ...
##  $ padj          : num  0.1899 0.0777 0.0307 0.1076 0.0188 ...

Results: 6542 genes are differentially expressed between the two conditions.

Package these results in a data.frame called 'deseq.results' with four columns: (gene.id, p.value, q.value, log.fc)

deseq.results <- data.frame(gene.id = results$id, p.value = results$pval, q.value = results$padj, 
    log.fc = results$log2FoldChange)

Save your results for later with write.table() in a file called stampy.deseq.results.tsv.

write.table(deseq.results, file = "stampy.deseq.results.tsv", row.names = T, 
    col.names = NA)

iii. How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

nrow(subset(deseq.results, q.value < 1e-05))

## [1] 2198

# Save for later.
deSeqHits <- subset(deseq.results, q.value < 1e-05)
write.table(deSeqHits, file = "stampy.deseq.FDR.results.tsv", row.names = T, 
    col.names = NA)

iv. How many differentially expressed genes are identified by both 'edgeR' and 'DESeq'? The function intersect - which finds the elements common to two sets (vectors)- will be helpful.

The number of DE genes found by DESeq:

deSeqresults <- c(as.character(deSeqHits$gene.id))
length(deSeqresults)

## [1] 2198

The number of DE genes found by edgeR:

edgeRresults <- c(as.character(edgeRhits$gene.id))
length(edgeRresults)

## [1] 2669

The number of DE genes common to the two sets (vectors):

comboDeSeqEdgeR <- intersect(edgeRresults, deSeqresults)
length(comboDeSeqEdgeR)

## [1] 2176

Q2d: Voom Differential Expression Analysis

Now you will use voom+limma to identify differentially expressed genes between the batch medium vs. chemostat conditions. i. voom normalizes the counts before it converts to log2-cpm.

Use calcNormFactors to normalize counts.

normFact <- calcNormFactors(rsDat)

ii .Use 'voom' to convert count data into logged CPM data and then use 'limma' to identify differentially expressed genes between conditions.

Convert count data into logged CPM data.

voomDat <- voom(rsDat, dgeDes, plot = TRUE, lib.size = colSums(rsDat) * normFact)

plot of chunk unnamed-chunk-46

voomDat

## An object of class "EList"
## $E
##             b1      b2     b3    c1    c2    c3
## 15S_rRNA 1.747 -0.3262 0.7594 2.362 0.867 2.207
## 21S_rRNA 3.563  2.5943 1.6923 4.767 3.783 3.678
## HRA1     3.201  3.5806 3.6398 2.251 2.617 3.100
## ICR1     4.212  4.3176 4.0414 4.294 4.702 4.729
## LSR1     5.037  4.6011 4.2188 5.731 5.562 5.072
## 6537 more rows ...
## 
## $weights
##        [,1]  [,2]   [,3]   [,4]   [,5]   [,6]
## [1,]  3.114  3.66  3.011  4.323  4.807  5.820
## [2,]  8.734 10.96  8.323 17.578 20.006 24.738
## [3,] 15.237 19.24 14.494  7.050  7.978  9.816
## [4,] 24.255 30.44 23.093 24.267 27.559 33.987
## [5,] 31.861 40.13 30.343 42.736 48.619 59.388
## 6537 more rows ...
## 
## $design
##   (Intercept) group2
## 1           1      0
## 2           1      0
## 3           1      0
## 4           1      1
## 5           1      1
## 6           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
## 
## 
## $targets
##    lib.size
## b1  3426306
## b2  4388115
## b3  3249102
## c1  2625992
## c2  3015589
## c3  3789682

Compute linear model. Find differentially expressed genes between the conditions.

voomFit <- lmFit(voomDat, dgeDes)
voomFit <- eBayes(voomFit)
voom.results <- topTable(voomFit, coef = 2, n = Inf)
nrow(voom.results)

## [1] 6542

Results: 6542 genes were found to be differentially expressed between the conditions.

Package these results in a data.frame called 'voom.limma.results' with five columns: (gene.id, p.value, q.value, log.fc, test.stat)

voom.limma.results <- data.frame(gene.id = row.names(voom.results), p.value = voom.results$P.Value, 
    q.value = voom.results$adj.P.Val, log.fc = voom.results$logFC, test.stat = voom.results$t)

Save your results for later with write.table() in a file called stampy.limma.results.tsv.

write.table(voom.limma.results, file = "stampy.limma.results.tsv", row.names = T, 
    col.names = NA)

iii. How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

nrow(subset(voom.limma.results, q.value < 1e-05))

## [1] 1794

# Save for later.
voomHits <- subset(voom.limma.results, q.value < 1e-05)
write.table(voomHits, file = "stampy.limma.FDR.results.tsv", row.names = T, 
    col.names = NA)

iv. What fraction of the genes identified using voom+limma are also found by edger and DESeq methods? For example if the DE analysis using voom+limma found 1000 genes and both edgeR and DESeq found 500 of these, the fraction of genes found would be 500/1000=0.5.

The number of DE genes found by voom+limma:

voomresults <- c(as.character(voomHits$gene.id))
length(voomresults)

## [1] 1794

The number of genes identified with voom+limma also found by edgeR and DESeq methods:

length(intersect(voomresults, comboDeSeqEdgeR))

## [1] 1792

The fraction of genes identified with voom+limma also found by edgeR and DESeq methods:

(Fraction <- length(intersect(voomresults, comboDeSeqEdgeR))/length(voomresults))

## [1] 0.9989

Q2e: Comparison of Differential Expression Analyses

Now that we have the results of the differential expression analysis performed by three popular methods, we are going to compare and illustrate results. i. In previous questions, we noticed that the different methods identified different differentially expressed genes. Create a Venn diagram showing all genes identified as differentially expressed by edgeR, DESeq, and voom+limma. Check your answers to questions 2c-iv and 2d-iv are correct.

Plot a venn diagram of all genes identified as differentially expressed by edgeR, DESeq, and voom+limma.

de.genes <- list(DESeq = deSeqresults, edgeR = edgeRresults, voom.limma = voomresults)
plot.new()

venn.plot <- venn.diagram(de.genes, filename = NULL, fill = c("red", "blue", 
    "green"), main = "RNA-Seq Differential Expression Analysis by 3 Methods")
grid.draw(venn.plot)

plot of chunk unnamed-chunk-54

Check: The number of DE genes identified by both 'edgeR' and 'DESeq' in 2c-iv was 2176, which can be seen at the intersect of 'edgeR' and 'DESeq' in the venn diagram (1792+384=2176). From 2d-iv, the fraction of the genes identified using voom+limma also found by edger and DESeq methods was calculated by dividing the intersect of all three methods (1792 in the venn diagram) by the number of results found by voom+limma (1792 + the intersect of voom+edgeR (2) + the intersect of voom+DESeq (0)= 1794 from the venn diagram). This fraction is 0.9988852. Therefore, the previous answers match the venn diagram.

ii. Using the function plotSmear from edgeR you can look at a scatterplot of the observed differential expression (y-axis) against overall abundance (x-axis), both axes logarithmically transformed - to check that putative DE genes seem plausible. Create a smear plot. Within this plot, identify the set of genes which are differentially expressed at an FDR of 1e-5 using all three methods (i.e., the q-values estimated by edgeR, DESeq, and voom are below 1e-5). Explain how you interpret this plot. Do you see reasonable results? Use the output of DGEList as the object of plotSmear. Use de.tags to highlight genes selected by all methods.

Create a scatterplot from edgeR data, highlighting the putative common DE genes.

common <- intersect(voomresults, comboDeSeqEdgeR)
tags <- common
plotSmear(dgeDat, de.tags = tags, main = "Scatterplot Highlighting Putative DE Genes")
abline(h = c(0), col = "blue")

plot of chunk unnamed-chunk-55

Interpretation: On a scatterplot of differential abundance versus overall abundance, the highlighted red points are the genes differentially expressed at an FDR of 1e-5 by all three methods. These results appear reasonable since the highlighted genes are underexpressed or overexpressed, but not at zero (hence, the subset of interest is in fact differentially expressed). The greater number of counts per million, the greater the statistical significance when log fold changes occur. The log fold change tends to zero as the log counts per million increases.

iii. There are two genes identified by edgeR and voom+limma but not by DESeq. Illustrate the logged counts of them. Compare the (log) counts of these two genes with those of two genes identified by the three methods.

Find the two genes identified by edgeR and voom+limma but not by DESeq.

# Genes identified by edgeR + voom.
edgevoom <- intersect(voomresults, edgeRresults)
# Finding the difference between genes identified by edgeR+voom, and the
# genes identified by all 3 analysis methods.
lucky2 <- setdiff(edgevoom, common)

Compare the (log) counts of these two genes with those of two genes identified by the three methods.

# Create a design matrix.
designDiff <- c("batch", "batch", "batch", "chemostat", "chemostat", "chemostat")

# Get data from the two genes identified by edgeR+voom but not by DESeq.
luckyCounts <- rsDat[lucky2, ]
str(luckyCounts)

## 'data.frame':    2 obs. of  6 variables:
##  $ b1: int  1752 181
##  $ b2: int  4275 161
##  $ b3: int  1787 130
##  $ c1: int  497 364
##  $ c2: int  245 293
##  $ c3: int  263 351

# Make a dataframe.
luckyDat <- data.frame(gene.id = factor(rep(rownames(luckyCounts), ncol(luckyCounts))), 
    cond = factor(rep(designDiff, each = nrow(luckyCounts))), log.count = log2(unlist(luckyCounts)))
# Create a plot.
luckyplot <- stripplot(gene.id ~ log.count, luckyDat, group = cond, auto.key = TRUE, 
    jitter = TRUE, main = "DE Genes Identified by voom+edgeR\n but not DESeq")


# Get data from the two genes identified by all 3 methods.
commonDat <- data.frame(common)
set.seed(8)
common2 <- sample(1:nrow(commonDat), 2)
common2genes <- c(common[common2[1]], common[common2[2]])
common2Counts <- rsDat[common2genes, ]
# Make a dataframe.
common2Dat <- data.frame(gene.id = factor(rep(rownames(common2Counts), ncol(common2Counts))), 
    cond = factor(rep(designDiff, each = nrow(common2Counts))), log.count = log2(unlist(common2Counts)))
# Create a plot.
commonplot <- stripplot(gene.id ~ log.count, common2Dat, group = cond, auto.key = TRUE, 
    jitter = TRUE, main = "DE Genes Identified by All Methods")

# Make plots side by side for comparison of log counts.
grid.arrange(luckyplot, commonplot, ncol = 2)

plot of chunk unnamed-chunk-57

Explain: The DE genes identified by voom and edgeR but not by DESeq appears to accept a gene as differentially expressed with a smaller variance of log counts between conditions (i.e. YPL271W) than the genes identified as differentially expressed additionally by DESeq, which may require a larger dispersion to be counted as a hit. However, it is difficult to tell with this small sample size. The different algorithms will discriminate hit genes slightly differently.

Question 3: Compare DEA results between RNA-Seq and array

In question 1, you performed a DEA of array data using limma. In question 2a, you performed a DEA of RNA-Seq data using edgeR, among other methods. In this question you will compare the results of those two analyses. i. Use a Venn diagram to display the overlap and non-overlap of the genes identified as differentially expressed at an FDR of 1e-5 by these analyses.

The number of DE genes found by limma (microarray platform):

mHitGenes <- c(as.character(micro.results$gene.id))
length(mHitGenes)

## [1] 725

The number of DE genes found by both edgeR and limma(microarray platform) analyses:

maEdgeR <- intersect(edgeRresults, mHitGenes)
length(maEdgeR)

## [1] 692

Plot a venn diagram to display the overlap and non-overlap of the genes identified as differentially expressed at an FDR of 1e-5 by these analyses.

DEmethodgenes <- list(EdgeR = edgeRresults, Microarray = mHitGenes)
plot.new()
venn.plot2 <- venn.diagram(DEmethodgenes, filename = NULL, fill = c("red", "blue"), 
    main = "RNA-Seq (edgeR) and Microarray (limma) DE genes", force.unique = T)

grid.draw(venn.plot2)

plot of chunk unnamed-chunk-60

Note that the number of probes you identified as differentially expressed at an FDR of 1e-5 in question 1c(iii) may be different from the number of genes identified as differentially expressed at the same FDR! Why? The use of the argument force.unique of venn.diagram()is crucial here.

Explain: The number of microarray probes identified as differentially expressed at an FDR in 1c(iii) was 725, which is different than the 723 shown in this venn diagram. Microarrays can have more than one probe mapping to each gene, which would each show up as significant. Using force.unique has plotted only results with unique gene ids, such that each significant gene is counted once, despite the number of probes mapped to it.

ii. As expected, more genes were identified as differentially expressed using RNA-Seq data. In this question, you will examine the difference between the q-values from both analyses by overlaying density plots of the q-values from each analysis.
To respond to this question, make two plots. One plot that includes the densities of q-values of the genes analyzed by both platforms (i.e., genes shared by both frames), and another plot that includes the densities of q-values of ALL genes analyzed by at least one of the platforms.

Plot the densities of q-values of genes SHARED by both RNA-Seq (by edgeR) and array.

# Intersect of microarray and edgeR results.
maEdgeR <- intersect(edger.results$gene.id, mkeepDat$gene.id)
row.names(edger.results) <- edger.results$gene.id
# Get q-values.
fedgeR <- c(edger.results[maEdgeR, ])
fma <- subset(mkeepDat, maEdgeR %in% mkeepDat$gene.id)

plot(density(fma$q.value), main = "Q-value Density Plot: Shared edgeR + Array Genes", 
    xlab = "q.values", col = "blue", sub = "edgeR (blue), array (red)")
lines(density(fedgeR$q.value), col = "red")

plot of chunk unnamed-chunk-61

Plot the densities of ALL q-values analyzed by both platforms (pre-FDR cut-off):

plot.new()
plot(density(mkeepDat$q.value), main = "Q-value Density Plot: ALL edgeR + Array Genes", 
    xlab = "q.values", sub = "edgeR (blue), array (red)", col = "blue")
lines(density(edger.results$q.value), lty = 4, col = "red")

plot of chunk unnamed-chunk-62

Make some observations about the strengths of these two platforms. More genes were found to be differentially expressed at significantly low q-values for RNA-Seq overall, than with the microarray data. Microarrays have a set number of probes on the array of predetermined composition, with multiple probes present for a single gene. Microarrays are cheaper, but cannot be used for discovery studies since one can only detect what one already knows is present. Microarrays also have a significant background signal that must be corrected for, such that low intensity signals from a small number of molecules bound may be lost in the background while beyond a certain threshold increases in intensity are no longer proportional to the number of molecules bound. RNA-Seq is then more sensitive and more expensive than microarrays, and allows for a more absolute (rather than relative) quantification of mRNA expression. In the plot of genes shared between microarray and RNA-Seq, the latter had more low q-values, demonstrating this sensitivity. RNA-Seq can also (hypothetically) detect all mRNAs present in a sample, however coverage biases based on the length of a transcipt, the instrument platform used, and GC content may affect sensitivity.

iii. We provide a data set with array expression and count data for 5 interesting genes; below is also code to lead it and a figure depicting it. Consult the DEA results from your previous analyses for these genes. For each gene, state its status with respect to these analyses, ie. where it falls in those Venn diagrams. Comment on the results and plots.

jDat <- dget("featGenesData-q3-DPUT.txt")
str(jDat)

## 'data.frame':    30 obs. of  5 variables:
##  $ gene.id  : Factor w/ 5 levels "YDR345C","YDR384C",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ cond     : Factor w/ 2 levels "batch","chemostat": 1 1 1 2 2 2 1 1 1 2 ...
##  $ log.count: num  12.59 13.07 12.63 8.33 8.43 ...
##  $ arrayExp : num  11.59 11.62 11.81 7.53 7.73 ...
##  $ probe.id : Factor w/ 5 levels "1770682_at","1778111_at",..: 1 1 1 1 1 1 2 2 2 2 ...


logplot <- stripplot(gene.id ~ log.count, jDat, group = cond, auto.key = TRUE, 
    main = "Gene Expression of \n Five Interesting Genes \n (Log Count)")

arrayExpplot <- stripplot(gene.id ~ arrayExp, jDat, group = cond, auto.key = TRUE, 
    main = "Gene Expression Data of \n Five Interesting Genes \n (Array Expression)")

grid.arrange(logplot, arrayExpplot, ncol = 2)

plot of chunk unnamed-chunk-63

Explanation:

# For gene YGL209W:
("YGL209W" %in% deSeqresults)

## [1] FALSE

("YGL209W" %in% edgeRresults)

## [1] FALSE

("YGL209W" %in% voomresults)

## [1] FALSE

("YGL209W" %in% mHitGenes)

## [1] TRUE


# YGL209W was not differentially expressed at an FDR below 1e-5 for analysis
# by edgeR, DESeq, or voom, and is therefore not present in the venn diagram
# of RNA-Seq DE analysis.  YGL209W was differentially expressed in the
# microarray data when analyzed by limma, and is present in the venn diagram
# showing microarray DE genes, in the section that does not overlap with
# genes found to be DE by edgeR.

# For gene YCL042W:
("YCL042W" %in% common)

## [1] FALSE

("YCL042W" %in% deSeqresults)

## [1] FALSE

("YCL042W" %in% edgeRresults)

## [1] TRUE

("YCL042W" %in% voomresults)

## [1] FALSE

("YCL042W" %in% mHitGenes)

## [1] FALSE


# YCL042W was only differentially expressed at an FDR below 1e-5 by analysis
# with edgeR, and does not overlap with DESeq, or voom analysis, in the venn
# diagram comparing those 3 analysis methods, nor with the microarray data
# plotted in the second venn diagram.

# For gene YBL025W:
("YBL025W" %in% deSeqresults)

## [1] FALSE

("YBL025W" %in% edgeRresults)

## [1] FALSE

("YBL025W" %in% voomresults)

## [1] FALSE

("YBL025W" %in% mHitGenes)

## [1] FALSE


# YBL025W was not differentially expressed at an FDR below 1e-5 for any
# methods of analysis on either platform. This gene would not represent a
# gene in the venn diagrams.

# For gene YDR384C:
("YDR384C" %in% common)

## [1] TRUE

("YDR384C" %in% maEdgeR)

## [1] TRUE


# YDR384C was found to be differentially expressed at an FDR below 1e-5 by
# all methods of analysis (edgeR, DESeq, voom, limma(array)) and on both
# platforms (RNA-Seq and array).  It would be found in the intersect of all
# 3 RNA-Seq analysis methods on the first venn diagram, as well as in the
# intersect of the edgeR and array venn diagram.

# For gene YDR345C:
("YDR345C" %in% common)

## [1] TRUE

("YDR345C" %in% maEdgeR)

## [1] TRUE


# YDR345C was found to be differentially expressed at an FDR below 1e-5 by
# all methods of analysis (edgeR, DESeq, voom, limma(array)) and on both
# platforms (RNA-Seq and array). It would be found in the intersect of all 3
# RNA-Seq analysis methods on the first venn diagram, as well as in the
# intersect of the edgeR and array venn diagram.