Homework 02

In this assignment you will work with data originally published in “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae.” by Nookaew et al (Nucleic Acids Res. 2012 Nov 1;40(20):10084-97. PMID 22965124). The article is available here: doi: 10.1093/nar/gks804.

The authors used two different platforms – microarrays and RNA-Seq – to obtain gene expression data for yeast grown in two conditions: batch medium and chemostat. They then compared the results of differential expression analysis (DEA) across platforms and under various combinations of aligners and methods for identifying differential expression. We will do the same. Some of the DEA methods covered in this class were included in their analysis, edgeR and DESeq, while others were not, such as limma and, in particular, the voom function.

We will work with their data obtained from the NCBI SRA and GEO repositories. Because of some differences in pre-processing your results will differ from theirs.

Evaluation

The assignment has 25 points and represents 25% of your overall course grade. Overall presentation and mechanics is worth 5 points, with full points awarded if it's exceptional. The remaining 20 points are spread among the specific questions, as detailed below. We will try to give partial credit when/if we can figure out what you're trying to do.

Each student must submit their own work: no group efforts. It's fine to talk to fellow students and have discussion on the Google Group. If someone is really helpful, give them some credit. You have definitely crossed the line if there's any copying and pasting of code or wording.

Late policy

Late submission will cost 20% (5 points) per day or fraction thereof.

If you expect that you will be late on the deadline, contact us before missing it, and submit whatever partial work you have.

How to submit the work

See the homework submission instructions.

Coaching

Some helpful tips can be found in this coaching document.

# Load required R packages:
library(hexbin)
library(gplots)
library(RColorBrewer)
library(limma)
library(reshape2)
library(lattice)
library(yeast2.db)
library(edgeR)
library(DESeq)
library(VennDiagram)
library(ggplot2)

# Gray Scale from `RColorBrewer`
Grays <- colorRampPalette(rev(brewer.pal(n = 9, "Greys")))

Your mission

Q1) Microarray Analysis

The six samples in this study, 3 replicates for each of the two conditions, were analysed on the Affymetrix Yeast Genome Array 2.0 platform. We have already downloaded the raw CEL files from GEO and normalised them. The normalized data is saved in the file GSE37599-data.tsv.

a) (1pt) Load Microarray Data

Load the normalized data.

array.dat <- read.table("GSE37599-data.tsv", header = TRUE, row.names = 1)

What are dimensions of the dataset?

ncol(array.dat)

## [1] 6

nrow(array.dat)

## [1] 10928

There are 6 columes and 10928 rows of the dataset

In addition to reporting number of rows and columns, make it clear what rows and columns represent and how you're interpreting column names.

colnames(array.dat)

## [1] "b1" "b2" "b3" "c1" "c2" "c3"

rownames(array.dat)[1:10]

##  [1] "1769308_at"   "1769309_at"   "1769310_at"   "1769311_at"  
##  [5] "1769312_at"   "1769313_at"   "1769314_at"   "1769315_at"  
##  [9] "1769316_s_at" "1769317_at"

The columes represents different sampels; there are two groups of samples - batch culture and chemostat culture, and each group contains 3 replicates The row represents different probes on the mircoarray - there are total 10928 probes

b) (2pt) Identify Sample Swap

The labels on two of the samples have been swapped, that is one of the batch samples has been labelled as chemostat and vice-versa. Produce the plots described below and explain how they allow you to identify the swapped samples.

i. (High volume) scatter plot matrix.

hexplom(array.dat, main = "High Volume Scatterplot Matrix")

plot of chunk unnamed-chunk-6

b1 is similar to c1 and c3. c2 is similar to b2 and b3. Looks like b1 and c2 are swapped

ii. A heatmap of the first 100 genes (you can try more but it gets slow).

heatmap.2(t(as.matrix(array.dat[c(1:100), ])), trace = "none", col = Grays, 
    cexCol = 1, cexRow = 1, labCol = FALSE, xlab = "Probes", main = "Heatmap of the first 100 probes")

plot of chunk unnamed-chunk-7

b1, b3 and b1 cluster together. b2, b3 and C2 cluster together. Looks like b1 and c2 are swapped

iii. Compute the Pearson correlation of the samples and plot the results using a heatmap.

heatmap.2(cor(array.dat, method = "pearson"), symm = T, trace = "none", col = Grays, 
    cexCol = 1, cexRow = 1, dendrogram = "row", main = "Heatmap of Pearson correlation")

plot of chunk unnamed-chunk-8

b1, b3 and b1 cluster together. b2, b3 and C2 cluster together. Looks like b1 and c2 are swapped

iv. Scatterplot the six data samples with respect to the first two principal components and label the samples.

xyplot(PC2 ~ PC1, data.frame(prcomp(t(array.dat))$x), panel = function(x, y, 
    ...) {
    panel.xyplot(x, y, ...)
    ltext(x = x, y = y, labels = rownames(t(array.dat)), pos = 1, offset = 1)
}, main = "First Two Principal Components")

plot of chunk unnamed-chunk-9

c3 , b1 and c1 have similar values of PC1. b3, b2 and C2 have similar values of PC1. Looks like b1 and c2 are swapped

c) (2pt) Microarray Differential Expression

Fix the label swap identified in question 1b. We want to swap b1 <–> c2. Revisit one or more elements of question 1b to sanity check before proceeding.

array.dat.cor <- array.dat[, c("c2", "b2", "b3", "c1", "b1", "c3")]
# Check the column names to confirm the colume swap
colnames(array.dat.cor)

## [1] "c2" "b2" "b3" "c1" "b1" "c3"

# Rename the column to the correct names
colnames(array.dat.cor) <- c("b1", "b2", "b3", "c1", "c2", "c3")
# double check to make sure the names are correct
colnames(array.dat.cor)

## [1] "b1" "b2" "b3" "c1" "c2" "c3"

# Use high volume scatter plot matrix and Pearson correlation heatmap to do
# a sanity check
hexplom(array.dat.cor, main = "High Volume Scatterplot Matrix after swapping b1 and c2")

plot of chunk unnamed-chunk-10

heatmap.2(cor(array.dat.cor, method = "pearson"), symm = T, trace = "none", 
    col = Grays, cexCol = 1, cexRow = 1, dendrogram = "row", main = "correlation after swapping b1 and c2")

plot of chunk unnamed-chunk-10

Everything looks fine now

Now use this data to do a differential expression analysis with `limma`.

array.des <- data.frame(Sample = c("b1", "b2", "b3", "c1", "c2", "c3"), Condition = c("Batch", 
    "Batch", "Batch", "Chemostat", "Chemostat", "Chemostat"))
array.des.mat <- model.matrix(~Condition, array.des)
arrayFit <- lmFit(array.dat.cor, array.des.mat)
arrayEbFit <- eBayes(arrayFit)
array.Hit <- topTable(arrayEbFit, coef = "ConditionChemostat", adjust.method = "BH", 
    number = Inf)

Package these results in a data frame with six columns:

probe.id - The array probe id.
gene.id - The id of the gene which the probe overlaps (see below).
p.value - The raw p-value for the probe.
q.value - The BH corrected p-value, aka the q-value.
log.fc - The log fold change which is the column called “logFC” in the limma results table.
test.stat - The test statistics which for limma is the moderated t statistic. This is the column called “t” in the limma results table.

gene.ids <- unlist(mget(rownames(array.Hit), yeast2ORF))
gene.ids <- as.matrix(gene.ids)

# # Sanity check before assmble the data frame tail(gene.ids) str(gene.ids,
# max.level=1) nrow(array.Hit) nrow(gene.ids) identical(rownames(array.Hit),
# rownames(gene.ids))

array.DEA.NA <- data.frame(probe.id = rownames(array.Hit), gene.id = gene.ids, 
    p.value = array.Hit$P.Value, q.value = array.Hit$adj.P.Val, log.fc = array.Hit$logFC, 
    test.stat = array.Hit$t, row.names = NULL)
tail(array.DEA.NA)

##         probe.id gene.id p.value q.value     log.fc  test.stat
## 10923 1777665_at YOR368W  0.9975  0.9980  2.171e-04  3.190e-03
## 10924 1779984_at    <NA>  0.9979  0.9983 -4.025e-04 -2.730e-03
## 10925 1769963_at    <NA>  0.9981  0.9984  3.349e-04  2.419e-03
## 10926 1770373_at    <NA>  0.9982  0.9984  4.174e-04  2.317e-03
## 10927 1776546_at YDL037C  1.0000  1.0000  7.109e-07  7.469e-06
## 10928 1769489_at    <NA>  1.0000  1.0000  3.207e-15  3.864e-14

## Check the resulting data frame identical(as.factor(rownames(array.Hit)),
## array.DEA.NA$probe.id) identical(array.DEA.NA$probe.id,
## as.factor(rownames(gene.ids))) tail(array.DEA.NA)

Remove any rows with probes which don't map to genes. You'll be able to find these because they will have `NA` as their gene id. Work with this data.frame to answer the questions below.

array.results <- array.DEA.NA[complete.cases(array.DEA.NA[, "gene.id"]), ]
rownames(array.results) <- NULL
tail(array.results)

##        probe.id gene.id p.value q.value     log.fc  test.stat
## 5700 1769541_at YPR042C  0.9883  0.9915 -1.261e-03 -1.520e-02
## 5701 1776332_at YLR403W  0.9913  0.9937  8.169e-04  1.125e-02
## 5702 1777350_at YAL060W  0.9944  0.9956  7.173e-04  7.241e-03
## 5703 1771787_at YBR165W  0.9962  0.9970 -3.353e-04 -4.910e-03
## 5704 1777665_at YOR368W  0.9975  0.9980  2.171e-04  3.190e-03
## 5705 1776546_at YDL037C  1.0000  1.0000  7.109e-07  7.469e-06

i. How many probes did we start with and how many remain after removing probes without gene ids?

We started with 10928 probes and 5705 probes remain after removing probes without gene ids

ii. Illustrate the differential expression between the batch and the chemostat samples for the top hit (i.e., probe with the lowest p- or q-value).

The top hit is:

Hit12 <- cbind(as.data.frame(t(array.dat.cor))[, array.results$probe.id[1], 
    drop = FALSE], array.des)
Hit12 <- melt(Hit12, id.vars = c("Sample", "Condition"), variable.name = "Probe", 
    value.name = "Expression")
stripplot(Expression ~ Condition | Probe, Hit12, jitter.data = TRUE, auto.key = TRUE, 
    type = c("p", "a"), grid = TRUE)

plot of chunk unnamed-chunk-14

iii. How many probes are identified as differentially expressed at a false discovery rate (FDR) of 1e-5 (note: this is a FDR cutoff used in the original paper)?

nrow(subset(array.results, subset = q.value <= 1e-05))

## [1] 725

There are 725 probes are identified as differentially expressed at a false discovery rate (FDR) of 1e-5

iv. Save your results for later with `write.table()`.

write.table(array.results, file = "GSE37599-table.tsv", sep = "\t", quote = FALSE, 
    row.names = TRUE, col.names = NA)

Q2) RNA-Seq Analysis

We have aligned the RNA-Seq library using the Stampy aligner and generated count data. The data file is available as stampy.counts.tsv. In this question you will use this data to do a differential expression analysis using different packages from Bioconductor.

a) (1pt) Load RNA Count Data and Sanity Check

Load the count data using read.table; you will need to pass the arguments header=TRUE and row.names=1.

rna.dat <- read.table("stampy.counts.tsv", header = TRUE, row.names = 1)

i) What are dimensions of the dataset?

ncol(rna.dat)

## [1] 6

nrow(rna.dat)

## [1] 6542

There are 6 columes and 6542 rows of the aligned RNA-Seq dataset.

In addition to reporting number of rows and columns, make it clear what rows and columns represent. What is the difference between the rows of this dataset versus rows of the array data in question 1a?

colnames(rna.dat)

## [1] "b1" "b2" "b3" "c1" "c2" "c3"

rownames(rna.dat)[1:10]

##  [1] "15S_rRNA" "21S_rRNA" "HRA1"     "ICR1"     "LSR1"     "NME1"    
##  [7] "Q0045"    "Q0050"    "Q0055"    "Q0065"

The columes represents different sampels; there are two groups of samples - batch culture and chemostat culture, and each group contains 3 replicates
The row represents different transcripts that are aligned to the reference genome - there are total 6542 transcripts that are aligned.
In the mircroarray dataset, the rows represent probes on the chip - not the actual gene.

ii) Do a sanity check to make sure there is no sample swap by plotting a heatmap of the sample correlations.

heatmap.2(cor(rna.dat, method = "pearson"), symm = T, trace = "none", col = Grays, 
    cexCol = 1, cexRow = 1, dendrogram = "row", main = "Pearson correlation of RNA Seq data")

plot of chunk unnamed-chunk-20

There seems to be no sample swap.

b) (2pt) `edgeR` Differential Expression Analysis

Now you will use edgeR to identify differentially expressed genes between the batch medium vs. chemostat conditions.

i) Recall that `edgeR` needs to estimate the dispersion parameter in the negative binomial model using an empirical Bayes method. Estimate the dispersion parameters using `estimateGLMCommonDisp`, `estimateGLMTrendedDisp` and `estimateGLMTagwiseDisp`. Plot the tagwise dispersion against log2-CPM (counts per million).

colnames(rna.dat)

## [1] "b1" "b2" "b3" "c1" "c2" "c3"

(rna.group <- factor(c(rep("1", 3), rep("2", 3))))

## [1] 1 1 1 2 2 2
## Levels: 1 2

dge.glm <- DGEList(counts = rna.dat, group = rna.group)
dge.glm[["samples"]]

##    group lib.size norm.factors
## b1     1  3420851            1
## b2     1  4487585            1
## b3     1  3415378            1
## c1     2  2529571            1
## c2     2  2990956            1
## c3     2  3695672            1

(design <- model.matrix(~rna.group))

##   (Intercept) rna.group2
## 1           1          0
## 2           1          0
## 3           1          0
## 4           1          1
## 5           1          1
## 6           1          1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$rna.group
## [1] "contr.treatment"

dge.glm.com.disp <- estimateGLMCommonDisp(dge.glm, design, verbose = TRUE)

## Disp = 0.00551 , BCV = 0.0742

dge.glm.trend.disp <- estimateGLMTrendedDisp(dge.glm.com.disp, design)
dge.glm.tag.disp <- estimateGLMTagwiseDisp(dge.glm.trend.disp, design)
plotBCV(dge.glm.tag.disp, main = "Biological Coefficient of Variation")

plot of chunk unnamed-chunk-21

ii) Use the glm functionality of `edgeR`, i.e. use the `glmFit` function, to identify differentially expressed genes between conditions.

fit <- glmFit(dge.glm.tag.disp, design)
colnames(coef(fit))

## [1] "(Intercept)" "rna.group2"

lrt <- glmLRT(fit, coef = "rna.group2")
topTags(lrt, n = 30, adjust.method = "BH")

## Coefficient:  rna.group2 
##          logFC logCPM   LR     PValue        FDR
## YIL057C 10.084  8.267 2191  0.000e+00  0.000e+00
## YMR175W  9.426  8.001 1877  0.000e+00  0.000e+00
## YJR095W  9.419  7.811 2247  0.000e+00  0.000e+00
## YMR107W  9.073  8.781 2627  0.000e+00  0.000e+00
## YKL217W  8.573 11.425 1689  0.000e+00  0.000e+00
## YPL201C  7.654  7.105 1951  0.000e+00  0.000e+00
## YGL205W  7.070  9.108 2691  0.000e+00  0.000e+00
## YCR010C  7.013  8.150 1809  0.000e+00  0.000e+00
## YAL054C  6.893 10.583 2542  0.000e+00  0.000e+00
## YBR067C  6.468  9.973 1657  0.000e+00  0.000e+00
## YIL160C  6.126  8.099 2224  0.000e+00  0.000e+00
## YDL085W  5.682  7.558 2049  0.000e+00  0.000e+00
## YLR312C  5.541  7.098 1537  0.000e+00  0.000e+00
## YLR174W  5.459  8.907 1683  0.000e+00  0.000e+00
## YDR256C  5.383  9.180 2305  0.000e+00  0.000e+00
## YKR009C  5.334  9.286 2095  0.000e+00  0.000e+00
## YGR067C  5.160  7.498 1539  0.000e+00  0.000e+00
## YKL187C  4.723  8.749 1630  0.000e+00  0.000e+00
## YDR384C  4.580  9.368 1897  0.000e+00  0.000e+00
## YPR001W  4.552  7.815 1532  0.000e+00  0.000e+00
## YFL030W  4.961  8.487 1482 4.941e-324 1.541e-321
## YOL126C  5.200  7.790 1471 8.251e-322 2.454e-319
## YML042W  5.176  8.624 1441 3.115e-315 8.860e-313
## YGL121C  5.677  6.643 1433 1.688e-313 4.601e-311
## YAL062W  3.983  7.253 1431 4.456e-313 1.166e-310
## YDR345C -3.925  9.956 1430 8.085e-313 2.034e-310
## YPL147W  4.109  8.039 1426 4.391e-312 1.064e-309
## YGR289C  4.183  8.113 1417 3.631e-310 8.484e-308
## YJL116C  4.390  8.286 1402 8.510e-307 1.920e-304
## YIL136W  5.058  9.838 1400 2.354e-306 5.132e-304

tt.glm <- topTags(lrt, n = Inf, adjust.method = "BH")

Package these results in a data.frame called 'edger.results' with five columns:

gene.id - The id of the gene which reads were aligned to.
p.value - The raw p-value for the gene.
q.value - The BH corrected p-value, aka the q-value.
log.fc - The log fold change which is the column called “logFC” in the edgeR results table.
test.stat - The test statistic, which for edgeR is a likelihood ratio. This is the column called “LR” in the edgeR results table.

edger.results <- data.frame(gene.id = rownames(tt.glm$table), p.value = tt.glm$table$PValue, 
    q.value = tt.glm$table$FDR, log.fc = tt.glm$table$logFC, test.stat = tt.glm$table$LR, 
    row.names = NULL)
head(edger.results)

##   gene.id p.value q.value log.fc test.stat
## 1 YIL057C       0       0 10.084      2191
## 2 YMR175W       0       0  9.426      1877
## 3 YJR095W       0       0  9.419      2247
## 4 YMR107W       0       0  9.073      2627
## 5 YKL217W       0       0  8.573      1689
## 6 YPL201C       0       0  7.654      1951

Save your results for later with `write.table()` in file called `stampy.edger.results.tsv`.

write.table(edger.results, file = "stampy.edger.results.tsv", sep = "\t", quote = FALSE, 
    row.names = TRUE, col.names = NA)

iii) How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

nrow(tt.glm$table[tt.glm$table$FDR < 1e-05, ])

## [1] 2669

nrow(edger.results[edger.results$q.value < 1e-05, ])

## [1] 2669

There are 2669 genes differentially expressed between conditions at a false discovery rate (FDR) of 1e-5.

iv) How many genes are differentially over-expressed in chemostat compared to batch medium samples at a false discovery rate (FDR) of 1e-5?

summary(de.glm <- decideTestsDGE(lrt, p = 1e-05, adjust = "BH"))

##    [,1]
## -1 1154
## 0  3873
## 1  1515

There are 1515 genes differentially over-expressed in chemostat compared to batch medium samples at a false discovery rate (FDR) of 1e-5.

c) (2pt) `DESeq` Differential Expression Analysis

Now you will use DESeq to identify differentially expressed genes between the batch medium vs. chemostat conditions.

i) `DESeq` also needs to estimate the dispersion. Use `estimateSizeFactors` and `estimateDispersions` to normalize the data. Plot the estimated dispersions against the mean normalized counts.

deSeqDat <- newCountDataSet(rna.dat, rna.group)
# head(counts(deSeqDat))
deSeqDat <- estimateSizeFactors(deSeqDat)
sizeFactors(deSeqDat)

##     b1     b2     b3     c1     c2     c3 
## 1.0211 1.2962 0.9538 0.7866 0.9004 1.1317

deSeqDat <- estimateDispersions(deSeqDat)
plotDispEsts(deSeqDat)

plot of chunk unnamed-chunk-27

ii) Use the negative binomial test of `DESeq`, i.e. use the `nbinomTest` function, to identify differentially expressed genes between conditions. Note that the output of this function does not return results ordered by p-values or logged fold-changes. You can manually reorder the results if you want (not required for this homework).

DESeq.nbT <- nbinomTest(deSeqDat, levels(rna.group)[1], levels(rna.group)[2])

Package these results in a data.frame called 'deseq.results' with four columns:

gene.id - The id of the gene which reads were aligned to.
p.value - The raw p-value for the gene.
q.value - The BH corrected p-value, aka the q-value.
log.fc - The log fold change which is the column called “logFC” in the edgeR results table.

DESeq.results <- data.frame(gene.id = DESeq.nbT$id, p.value = DESeq.nbT$pval, 
    q.value = DESeq.nbT$padj, log.fc = DESeq.nbT$log2FoldChange, row.names = NULL)
head(DESeq.results)

##    gene.id  p.value  q.value  log.fc
## 1 15S_rRNA 0.130017 0.189902  1.0173
## 2 21S_rRNA 0.047610 0.077711  1.3540
## 3     HRA1 0.017147 0.030716 -0.8235
## 4     ICR1 0.068405 0.107600  0.3793
## 5     LSR1 0.010074 0.018766  0.8122
## 6     NME1 0.003772 0.007596 -1.4870

Save your results for later with `write.table()` in file called `stampy.deseq.results.tsv`.

write.table(DESeq.results, file = "stampy.deseq.results.tsv", sep = "\t", quote = FALSE, 
    row.names = TRUE, col.names = NA)

iii) How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

nrow(DESeq.results[DESeq.results$q.value < 1e-05, ])

## [1] 2198

There are 2198 genes differentially expressed between conditions at a false discovery rate (FDR) of 1e-5.

iv) How many differentially expressed genes are identified by both 'edgeR' and 'DESeq'?

length(intersect(edger.results[edger.results$q.value < 1e-05, "gene.id"], DESeq.results[DESeq.results$q.value < 
    1e-05, "gene.id"]))

## [1] 2176

There are 2176 differentially expressed genes are identified by both 'edgeR' and 'DESeq'

d) (2pt) `voom` Differential Expression Analysis

Now you will use voom+limma to identify differentially expressed genes between the batch medium vs. chemostat conditions.

i) `voom` normalizes the counts before it converts counts to log2-cpm. Use `calcNormFactors` to normalize counts.

norm.factor <- calcNormFactors(rna.dat)

ii) Use voom' to convert count data into logged CPM data and then use 'limma' to identify differentially expressed genes between conditions.

dat.voomed <- voom(rna.dat, design, plot = TRUE, lib.size = colSums(rna.dat) * 
    norm.factor)

plot of chunk unnamed-chunk-34

fit <- lmFit(dat.voomed, design)
fit <- eBayes(fit)
voom.tt <- topTable(fit, coef = "rna.group2", adjust.method = "BH", number = Inf)

Package these results in a data.frame called 'voom.limma.results' with five columns:

gene.id - The id of the gene which reads were aligned to.
p.value - The raw p-value for the gene.
q.value - The BH corrected p-value, aka the q-value.
log.fc - The log fold change which is the column called “logFC” in the limma results table.
test.stat - The test statistic, which is the column called “t”.

voom.limma.results <- data.frame(gene.id = rownames(voom.tt), p.value = voom.tt$P.Value, 
    q.value = voom.tt$adj.P.Val, log.fc = voom.tt$logFC, test.stat = voom.tt$t, 
    row.names = NULL)
head(voom.limma.results)

##   gene.id   p.value   q.value log.fc test.stat
## 1 YAL054C 3.504e-18 2.292e-14  6.822     67.85
## 2 YDR384C 2.665e-17 7.179e-14  4.509     58.15
## 3 YDR345C 4.390e-17 7.179e-14 -3.991    -55.99
## 4 YDR256C 3.515e-17 7.179e-14  5.311     56.94
## 5 YKR009C 8.939e-17 1.170e-13  5.264     53.04
## 6 YIL155C 3.128e-16 2.926e-13  4.798     48.22

Save your results for later with `write.table()` in file called `stampy.limma.results.tsv`.

write.table(voom.limma.results, file = "stampy.limma.results.tsv", sep = "\t", 
    quote = FALSE, row.names = TRUE, col.names = NA)

iii) How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

nrow(voom.limma.results[voom.limma.results$q.value < 1e-05, ])

## [1] 1794

There are 1794 genes differentially expressed between conditions at a false discovery rate (FDR) of 1e-5.

iv) What fraction of the genes identified using `voom+limma` are also found by `edger` and `DESeq` methods? For example if the DE analysis using `voom+limma` found 1000 genes and both `edgeR` and `DESeq` found 500 of these, the fraction of genes found would be \( \frac{500}{1000}=0.5 \).

edger.de.genes <- as.character(edger.results[edger.results$q.value < 1e-05, 
    "gene.id"])
DESeq.de.genes <- as.character(DESeq.results[DESeq.results$q.value < 1e-05, 
    "gene.id"])
voom.limma.de.genes <- as.character(voom.limma.results[voom.limma.results$q.value < 
    1e-05, "gene.id"])

(edfer.DESeq.fraction <- round(length(intersect(intersect(voom.limma.de.genes, 
    edger.de.genes), intersect(voom.limma.de.genes, DESeq.de.genes)))/length(voom.limma.de.genes), 
    digits = 4))

## [1] 0.9989

0.9989 of the genes identified using voom+limma are also found by edger and DESeq methods

e) (3pt) Comparison of Differential Expression Analyses

Now that we have the results of the differential expression analysis performed by three popular methods, we are going to compare and illustrate the results.

i) In previous questions, we noticed that different methods identified different differentially expressed genes. Create a Venn diagram showing all genes identified as differentially expressed by `edgeR`, `DESeq`, and `voom+limma`. Check your if your answers to questions 2c-iv, and 2d-iv are correct.

plot.new()
grid.draw(venn.diagram(list(edgeR = edger.de.genes, DESeq = DESeq.de.genes, 
    voom.limma = voom.limma.de.genes), filename = NULL, fill = c("blue", "orange", 
    "green")))

plot of chunk unnamed-chunk-39

ii) Using the function `plotSmear` function from `edgeR`, you can look at a scatterplot of observed differential expression (y-axis) against overall abundance (x-axis), both axes logarithmically transformed – to check that putative DE genes seem plausible. Create a smear plot. Within this plot, identify the set of genes which are differentially expressed at an FDR of 1e-5 using all three methods (i.e., the q-values estimated by `edgeR`, `DESeq`, and `voom` are below 1e-5). Explain how you interpret this plot. Do you see reasonable results?

DEall <- intersect(intersect(edger.results[edger.results$q.value < 1e-05, "gene.id"], 
    DESeq.results[DESeq.results$q.value < 1e-05, "gene.id"]), voom.limma.results[voom.limma.results$q.value < 
    1e-05, "gene.id"])
plotSmear(lrt, de.tags = DEall, main = "Smear plot of differentially expressed genes")
abline(h = c(-0.5, 0.5), col = "blue")

plot of chunk unnamed-chunk-40

For a smear plot, the top and bottom part should show genes with differential expression while the left part is left out because the number of reads are not sufficient enough to comfirm statistical significance for gene differential expression, as what we see in the above smear plot. And most of gene shown differential expression have at least 1.5 folder of change in expression (marked by the blue lines).

iii) There are two genes identified by `edgeR` and `voom+limma` but not by `DESeq`. Illustrate the logged counts of them. Compare the (log) counts of these two genes with those of two genes identified by the three methods (see example below)

(featureCounts <- rna.dat[setdiff(voom.limma.results[voom.limma.results$q.value < 
    1e-05, "gene.id"], DEall), ])

##           b1   b2   b3  c1  c2  c3
## YMR058W 1752 4275 1787 497 245 263
## YPL271W  181  161  130 364 293 351

featureDat <- data.frame(gene.id = factor(rep(rownames(featureCounts), ncol(featureCounts))), 
    cond = factor(rep(c("batch", "batch", "batch", "chemostat", "chemostat", 
        "chemostat"), each = nrow(featureCounts))), log.count = log2(unlist(featureCounts)))
stripplot(gene.id ~ log.count, featureDat, groups = cond, auto.key = TRUE, jitter = TRUE, 
    main = "genes identified by DESeq only")

plot of chunk unnamed-chunk-41

plot of chunk two-de-gene-demo

Compared to the two genes that are identified by all three methods(bottom plot), the two genes that are identified only by DEseq have either larger varience (YMR058W) or small difference in sample mean (YPL271W) (top plot), which makes it harder to give a statistical significance.

Q3) Compare DEA results between RNA-Seq and array

In question 1, you performed a DEA of array data using limma. In question 2a, you performed a DEA of RNA-Seq data using edgeR, among other methods. In this question you will compare the results of those two analyses.

i) Use a Venn diagram to display the overlap and non-overlap of the genes identified as differentially expressed at an FDR of 1e-5 by these analyses.

Note that the number of probes that you identified as differentially expressed at an FDR of 1e-5 in question 1c(iii) may be different from the number of genes identified as differentially expressed at the same FDR! Why? The use of the argument force.unique of venn.diagram() is crucial here.

Multiple probes can be mapped to the same gene thus the number of probes can be more than the number of genes.

edger.de.genes <- as.character(edger.results[edger.results$q.value < 1e-05, 
    "gene.id"])
array.de.genes <- as.character(array.results[array.results$q.value < 1e-05, 
    "gene.id"])
plot.new()
grid.draw(venn.diagram(list(RNA_Seq = edger.de.genes, MicroArray = array.de.genes), 
    filename = NULL, fill = c("red", "blue"), force.unique = TRUE))

plot of chunk unnamed-chunk-42

ii) As expected, more genes were identified as differentially expressed using RNA-Seq data. In this question, you will examine the difference between the q-values from both analyses by overlaying density plots of the q-values from each analysis.

To respond to this question, make two plots. One plot that includes the densities of q-values of the genes analyzed by both platforms (i.e., genes shared by both data frames), and another plot that includes the densities of q-values of ALL genes analyzed by at least one of the platforms.

de.merged.ture <- merge(array.results, edger.results, all = TRUE, by = "gene.id", 
    suffixes = c(".array", ".rnaseq"))
de.merged.false <- merge(array.results, edger.results, all = FALSE, by = "gene.id", 
    suffixes = c(".array", ".rnaseq"))
densityplot(~q.value.array + q.value.rnaseq, de.merged.ture, plot.points = FALSE, 
    auto.key = list(text = c("MicroArray", "RNA-Seq")), main = "Density plot of q-values (Either)", 
    xlab = "q-value", n = 200, bw = 0.02)

plot of chunk unnamed-chunk-43

densityplot(~q.value.array + q.value.rnaseq, de.merged.false, plot.points = FALSE, 
    auto.key = list(text = c("MicroArray", "RNA-Seq")), main = "Density plot of q-values (Both)", 
    xlab = "q-value", n = 200, bw = 0.02)

plot of chunk unnamed-chunk-43

Make some observations about the strengths of these two platforms.

The peak of RNASeq in both polts are closer to 0 which indicates that RNASeq is more sensitive in destinguish gene expression difference.

iii) We provide a data set with array expression and count data for 5 interesting genes; below is also code to load it and a figure depicting it. Consult the DEA results from your previous analyses for these genes. For each gene, state its status with respect to these analyses, i.e. where it falls in those Venn diagrams. Comment on the results and plots.

plot of chunk stripplot-five-interesting-genes

interestinggenes <- c("YGL209W", "YCL042W", "YBL025W", "YDR384C", "YDR345C")

both <- intersect(edger.results[edger.results$q.value < 1e-05, "gene.id"], array.results[array.results$q.value < 
    1e-05, "gene.id"])
rnaseqOnly <- setdiff(edger.results[edger.results$q.value < 1e-05, "gene.id"], 
    both)
arrayOnly <- setdiff(array.results[array.results$q.value < 1e-05, "gene.id"], 
    both)
(subset(de.merged.false, gene.id %in% interestinggenes, select = grep("gene.id|q.value", 
    colnames(de.merged.false))))

##      gene.id q.value.array q.value.rnaseq
## 120  YBL025W     2.104e-01      6.416e-01
## 526  YCL042W     2.851e-01      2.975e-16
## 1186 YDR345C     4.155e-08     2.034e-310
## 1226 YDR384C     1.012e-07      0.000e+00
## 1939 YGL209W     1.912e-07      4.861e-01


(intersect(interestinggenes, both))

## [1] "YDR384C" "YDR345C"

(intersect(interestinggenes, rnaseqOnly))

## [1] "YCL042W"

(intersect(interestinggenes, arrayOnly))

## [1] "YGL209W"

(setdiff(interestinggenes, c(both, rnaseqOnly, arrayOnly)))

## [1] "YBL025W"

YDR384C and YDR345C fall in the overlaping part of the Microarry-RNASeq Venn diagram. And they are well seperated from the plots and q.value for both of them are less than 1e-5.
YCL042Wfalls in the RNASeq only part of the Microarry-RNASeq Venn diagram. The dots are all located very close to each other on the array panle for YCL042W but relatively well seperated on the RNASeq panle. The q.value is only less than 1e-5 for RNASeq.
YGL209Wfalls in the Microarray only part of the Microarry-RNASeq Venn diagram. The dots are all located very close to each other on the RNASeq panle for YGL209W but relatively well seperated on the array panle. The q.value is only less than 1e-5 for Microarry.
The q.value for YBL025W is higher than 1e-5 for both Microarry and RNASeq. And we can clearly see that are no expression difference from the polt for YBL025W.

Homework 02

Evaluation

Late policy

How to submit the work

Coaching

Your mission

Q1) Microarray Analysis

a) (1pt) Load Microarray Data

Load the normalized data.

What are dimensions of the dataset?

In addition to reporting number of rows and columns, make it clear what rows and columns represent and how you're interpreting column names.

b) (2pt) Identify Sample Swap

i. (High volume) scatter plot matrix.

ii. A heatmap of the first 100 genes (you can try more but it gets slow).

iii. Compute the Pearson correlation of the samples and plot the results using a heatmap.

iv. Scatterplot the six data samples with respect to the first two principal components and label the samples.

c) (2pt) Microarray Differential Expression

Fix the label swap identified in question 1b. We want to swap b1 <–> c2. Revisit one or more elements of question 1b to sanity check before proceeding.

Now use this data to do a differential expression analysis with limma.

Package these results in a data frame with six columns:

Remove any rows with probes which don't map to genes. You'll be able to find these because they will have NA as their gene id. Work with this data.frame to answer the questions below.

i. How many probes did we start with and how many remain after removing probes without gene ids?

ii. Illustrate the differential expression between the batch and the chemostat samples for the top hit (i.e., probe with the lowest p- or q-value).

iii. How many probes are identified as differentially expressed at a false discovery rate (FDR) of 1e-5 (note: this is a FDR cutoff used in the original paper)?

iv. Save your results for later with write.table().

Q2) RNA-Seq Analysis

a) (1pt) Load RNA Count Data and Sanity Check

i) What are dimensions of the dataset?

In addition to reporting number of rows and columns, make it clear what rows and columns represent. What is the difference between the rows of this dataset versus rows of the array data in question 1a?

ii) Do a sanity check to make sure there is no sample swap by plotting a heatmap of the sample correlations.

b) (2pt) edgeR Differential Expression Analysis

ii) Use the glm functionality of edgeR, i.e. use the glmFit function, to identify differentially expressed genes between conditions.

Package these results in a data.frame called 'edger.results' with five columns:

Save your results for later with write.table() in file called stampy.edger.results.tsv.

iii) How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

iv) How many genes are differentially over-expressed in chemostat compared to batch medium samples at a false discovery rate (FDR) of 1e-5?

c) (2pt) DESeq Differential Expression Analysis

i) DESeq also needs to estimate the dispersion. Use estimateSizeFactors and estimateDispersions to normalize the data. Plot the estimated dispersions against the mean normalized counts.

Package these results in a data.frame called 'deseq.results' with four columns:

Save your results for later with write.table() in file called stampy.deseq.results.tsv.

iii) How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

iv) How many differentially expressed genes are identified by both 'edgeR' and 'DESeq'?

d) (2pt) voom Differential Expression Analysis

i) voom normalizes the counts before it converts counts to log2-cpm. Use calcNormFactors to normalize counts.

ii) Use voom' to convert count data into logged CPM data and then use 'limma' to identify differentially expressed genes between conditions.

Package these results in a data.frame called 'voom.limma.results' with five columns:

Save your results for later with write.table() in file called stampy.limma.results.tsv.

iii) How many genes are differentially expressed between conditions at a false discovery rate (FDR) of 1e-5?

iv) What fraction of the genes identified using voom+limma are also found by edger and DESeq methods? For example if the DE analysis using voom+limma found 1000 genes and both edgeR and DESeq found 500 of these, the fraction of genes found would be \( \frac{500}{1000}=0.5 \).

e) (3pt) Comparison of Differential Expression Analyses

i) In previous questions, we noticed that different methods identified different differentially expressed genes. Create a Venn diagram showing all genes identified as differentially expressed by edgeR, DESeq, and voom+limma. Check your if your answers to questions 2c-iv, and 2d-iv are correct.

iii) There are two genes identified by edgeR and voom+limma but not by DESeq. Illustrate the logged counts of them. Compare the (log) counts of these two genes with those of two genes identified by the three methods (see example below)

Q3) Compare DEA results between RNA-Seq and array

i) Use a Venn diagram to display the overlap and non-overlap of the genes identified as differentially expressed at an FDR of 1e-5 by these analyses.

ii) As expected, more genes were identified as differentially expressed using RNA-Seq data. In this question, you will examine the difference between the q-values from both analyses by overlaying density plots of the q-values from each analysis.

Make some observations about the strengths of these two platforms.

Now use this data to do a differential expression analysis with `limma`.

Remove any rows with probes which don't map to genes. You'll be able to find these because they will have `NA` as their gene id. Work with this data.frame to answer the questions below.

iv. Save your results for later with `write.table()`.

b) (2pt) `edgeR` Differential Expression Analysis

ii) Use the glm functionality of `edgeR`, i.e. use the `glmFit` function, to identify differentially expressed genes between conditions.

Save your results for later with `write.table()` in file called `stampy.edger.results.tsv`.

c) (2pt) `DESeq` Differential Expression Analysis

i) `DESeq` also needs to estimate the dispersion. Use `estimateSizeFactors` and `estimateDispersions` to normalize the data. Plot the estimated dispersions against the mean normalized counts.

Save your results for later with `write.table()` in file called `stampy.deseq.results.tsv`.

d) (2pt) `voom` Differential Expression Analysis

i) `voom` normalizes the counts before it converts counts to log2-cpm. Use `calcNormFactors` to normalize counts.

Save your results for later with `write.table()` in file called `stampy.limma.results.tsv`.

iv) What fraction of the genes identified using `voom+limma` are also found by `edger` and `DESeq` methods? For example if the DE analysis using `voom+limma` found 1000 genes and both `edgeR` and `DESeq` found 500 of these, the fraction of genes found would be \( \frac{500}{1000}=0.5 \).

iii) There are two genes identified by `edgeR` and `voom+limma` but not by `DESeq`. Illustrate the logged counts of them. Compare the (log) counts of these two genes with those of two genes identified by the three methods (see example below)