Rebecca Johnston 26/02/2014
In this seminar we will use a table of read counts for genomic features, such as genes or exons, derived from RNA-seq alignment BAM file to perform differential expression analysis. We will use edgeR, DESeq and voom + limma packages for this purpose. We will use the data from this publication. The mRNA from 21 mice belonging to two different strains have been sequenced; the count table lists the number of reads aligned to the annotated mouse genes in the Ensembl database. Our goal is to identify those genes that are differentially expressed between the two strains.
library(edgeR) # Diff expr analysis of RNA-seq data
library(DESeq) # Diff expr analysis of RNA-seq data
library(limma) # Fit linear models
dat <- read.table("bottomly_count_table.tsv", header = TRUE, row.names = 1)
str(dat, max.level = 0)
## 'data.frame': 36536 obs. of 21 variables:
des <- read.table("bottomly_phenodata.tsv", header = TRUE, row.names = 1)
str(des)
## 'data.frame': 21 obs. of 4 variables:
## $ num.tech.reps : int 1 1 1 1 1 1 1 1 1 1 ...
## $ strain : Factor w/ 2 levels "C57BL/6J","DBA/2J": 1 1 1 1 1 1 1 1 1 1 ...
## $ experiment.number: int 6 7 6 7 6 7 6 4 4 4 ...
## $ lane.number : int 1 1 2 2 3 3 5 6 7 8 ...
edgeR implements statistical methods based on the negative binomial distribution for count data. The first version of edgeR implemented exact statistical methods for comparison of multiple groups but is only applicable to experiments with one factor. This sometimes is referred to as classic edgeR. An addition to the classic version is an implementation of generalized linear models (glms) used for analysis of multifactor experiments where the response variables might not follow normal distribution. This sometimes is referred to as glm edgeR. Similar to limma, both versions of edgeR use empirical Bayes methods to estimate gene-specific variation. The classic edgeR uses quantile-adjusted conditional maximum likelihood (qCML) to estimate the disperison while the glm edgeR uses Cox-Reid profile-adjusted likelihood (CR) for dispersion estimation.
edgeR takes in as an argument a table of integer counts, with rows corresponding to genes and columns to samples.
GLM edgeR: GLM approach allows for comparison of multiple groups and/or factor levels. A design matrix can be created in the same way done for limma. Similarly, contrasts matrices can be used to make comparisons. Most of the glm functions have similar names to the classic version with an addition of 'glm'.
The first step is to create a 'group' object describing which group each sample belongs to:
with(des, table(strain))
## strain
## C57BL/6J DBA/2J
## 10 11
group <- factor(c(rep("1", 10), rep("2", 11)))
group
## [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
## Levels: 1 2
DGEList produces an object of type DGEList with can be manipulated in a similar way to any other list object in R:
dgeGlm <- DGEList(counts = dat, group = group)
str(dgeGlm)
## Formal class 'DGEList' [package "edgeR"] with 1 slots
## ..@ .Data:List of 2
## .. ..$ : int [1:36536, 1:21] 369 0 0 0 0 0 21 15 517 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:36536] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000031" ...
## .. .. .. ..$ : chr [1:21] "SRX033480" "SRX033488" "SRX033481" "SRX033489" ...
## .. ..$ :'data.frame': 21 obs. of 3 variables:
## .. .. ..$ group : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. .. ..$ lib.size : num [1:21] 3040296 6303665 2717092 6545795 3016179 ...
## .. .. ..$ norm.factors: num [1:21] 1 1 1 1 1 1 1 1 1 1 ...
names(dgeGlm)
## [1] "counts" "samples"
dgeGlm[["samples"]]
## group lib.size norm.factors
## SRX033480 1 3040296 1
## SRX033488 1 6303665 1
## SRX033481 1 2717092 1
## SRX033489 1 6545795 1
## SRX033482 1 3016179 1
## SRX033490 1 7097379 1
## SRX033483 1 3707895 1
## SRX033476 1 5165144 1
## SRX033478 1 4953201 1
## SRX033479 1 4192872 1
## SRX033472 2 3970729 1
## SRX033473 2 4733003 1
## SRX033474 2 3702051 1
## SRX033475 2 3569483 1
## SRX033491 2 7276198 1
## SRX033484 2 4422272 1
## SRX033492 2 7115851 1
## SRX033485 2 3467730 1
## SRX033493 2 7339817 1
## SRX033486 2 3879114 1
## SRX033494 2 6771680 1
nrow(dgeGlm[[1]])
## [1] 36536
ncol(dgeGlm[[1]])
## [1] 21
This DGELIst object has two components: one is a matrix called 'counts' storing the count data, the other is a data.frame called 'samples' storing information for samples. Optionally, you can also provide an annotation file for the genes which will be stored in the data.frame 'genes'. The data.frame 'samples', contains the samples IDs, group information and library sizes (or equally library sequence depth). You can either supply the library size info or it can be calculated from the sums of counts in each column.
design <- model.matrix(~group)
design
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
## 11 1 1
## 12 1 1
## 13 1 1
## 14 1 1
## 15 1 1
## 16 1 1
## 17 1 1
## 18 1 1
## 19 1 1
## 20 1 1
## 21 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.03893 , BCV = 0.1973
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
## Loading required package: splines
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
# plot the tagwise dispersion against log2-CPM (counts per million)
plotBCV(dgeGlmTagDisp)
# GLM fit
fit <- glmFit(dgeGlmTagDisp, design)
# Coefficients
colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef = 2)
# topTags is analogous to topTable?
topTags(lrt)
## Coefficient: group2
## logFC logCPM LR PValue FDR
## ENSMUSG00000020912 -5.187 3.155 401.8 2.238e-89 8.177e-85
## ENSMUSG00000050141 -5.363 2.319 311.6 9.787e-70 1.788e-65
## ENSMUSG00000035775 -4.543 2.674 298.0 9.042e-67 1.101e-62
## ENSMUSG00000015484 -1.968 4.307 282.5 2.113e-63 1.930e-59
## ENSMUSG00000024248 -3.152 3.463 277.0 3.337e-62 2.438e-58
## ENSMUSG00000030532 1.547 5.572 269.4 1.568e-60 9.548e-57
## ENSMUSG00000054354 -6.283 1.940 249.3 3.657e-56 1.909e-52
## ENSMUSG00000023236 1.426 7.067 245.0 3.133e-55 1.431e-51
## ENSMUSG00000050824 3.705 3.365 229.7 7.099e-52 2.882e-48
## ENSMUSG00000015852 -2.362 3.138 214.8 1.210e-48 4.049e-45
ttGlm <- topTags(lrt, n = Inf)
class(ttGlm)
## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"
nrow(ttGlm$table[ttGlm$table$FDR < 0.01, ])
## [1] 600
hits <- rownames(ttGlm$table[ttGlm$table$FDR < 1e-50, ])
cpm(dgeGlmTagDisp)[hits, ]
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000020912 20.39 12.691 15.83 14.819 19.230
## ENSMUSG00000050141 14.14 10.153 10.67 6.264 9.946
## ENSMUSG00000035775 14.47 10.629 19.51 11.305 12.930
## ENSMUSG00000015484 26.97 33.949 32.76 36.206 31.497
## ENSMUSG00000024248 14.80 26.017 17.30 23.527 18.567
## ENSMUSG00000030532 27.96 24.906 21.35 25.360 18.567
## ENSMUSG00000054354 10.85 7.615 11.41 8.708 8.289
## ENSMUSG00000023236 65.45 78.684 72.50 72.413 76.587
## SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000020912 17.330 26.430 13.359 23.621 14.310
## ENSMUSG00000050141 7.890 8.361 6.583 11.911 12.402
## ENSMUSG00000035775 12.540 12.136 6.389 13.527 11.210
## ENSMUSG00000015484 33.393 30.745 34.655 26.851 31.482
## ENSMUSG00000024248 25.784 18.879 24.781 14.334 14.310
## ENSMUSG00000030532 26.770 26.430 19.167 23.217 20.034
## ENSMUSG00000054354 7.608 7.012 3.291 4.442 5.963
## ENSMUSG00000023236 80.875 83.336 55.178 57.740 66.303
## SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000020912 0.5037 0.4226 1.0805 0.2802 1.0995
## ENSMUSG00000050141 0.2518 0.0000 0.0000 0.2802 0.8246
## ENSMUSG00000035775 0.7555 0.0000 0.2701 0.2802 0.6872
## ENSMUSG00000015484 7.0516 4.8595 9.1841 7.8443 8.5209
## ENSMUSG00000024248 2.2666 1.2677 1.8908 3.3618 3.1610
## ENSMUSG00000030532 69.7605 70.5683 63.2082 63.0343 73.8023
## ENSMUSG00000054354 0.0000 0.0000 0.0000 0.0000 0.5497
## ENSMUSG00000023236 195.4301 183.6044 176.1186 156.3252 206.9762
## SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000020912 0.0000 0.7027 0.8651 0.1362 0.0000
## ENSMUSG00000050141 0.0000 0.1405 0.0000 0.1362 0.2578
## ENSMUSG00000035775 0.6784 0.1405 0.5767 0.1362 1.8045
## ENSMUSG00000015484 9.4974 8.4319 8.9396 6.6759 10.3116
## ENSMUSG00000024248 1.5829 1.6864 1.7302 1.7712 2.5779
## ENSMUSG00000030532 70.0997 70.6873 62.5770 78.6123 59.2919
## ENSMUSG00000054354 0.0000 0.0000 0.0000 0.0000 0.0000
## ENSMUSG00000023236 177.9628 192.8090 207.0519 201.6399 204.6859
## SRX033494
## ENSMUSG00000020912 0.0000
## ENSMUSG00000050141 0.1477
## ENSMUSG00000035775 0.5907
## ENSMUSG00000015484 8.8604
## ENSMUSG00000024248 3.1012
## ENSMUSG00000030532 71.3265
## ENSMUSG00000054354 0.0000
## ENSMUSG00000023236 192.7144
summary(deGlm <- decideTestsDGE(lrt, p = 0.05, adjust = "BH"))
## [,1]
## -1 451
## 0 35660
## 1 425
There are 451 genes that are under-expressed in group 2 (DBA/2J) compared to group 1 (C57BL/6J), 3.566 × 104 genes show no differences in expression, while 425 genes are over-expressed.
# plotting the tagwise log fold changes against log-cpm
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
plotSmear(lrt, de.tags = tagsGlm)
abline(h = c(-2, 2), col = "blue")
Redo the above analysis but first filter the data and remove any gene that has: 1. count equal to zero across all samples 2. count equal to zero in at least one sample in each genotype group.
1. Filter = Count equal to zero across all samples
Trial: Since the head(dat) contains a row with sum equal to zero, use this mini set of rows as a trial.
(miniDat <- head(dat))
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001 369 744 287 769 348
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 0 1 0 1 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 0 1 1 5 0
## ENSMUSG00000000049 0 1 0 1 0
## SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001 803 433 469 585 321
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 1 0 7 6 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 4 0 0 0 0
## ENSMUSG00000000049 0 0 0 0 0
## SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001 301 461 309 374 781
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 1 1 1 1 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 4 1 1 0 1
## ENSMUSG00000000049 0 0 0 0 0
## SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001 555 820 294 758 419
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 2 1 1 4 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 2 1 1 1 1
## ENSMUSG00000000049 0 0 0 0 0
## SRX033494
## ENSMUSG00000000001 857
## ENSMUSG00000000003 0
## ENSMUSG00000000028 5
## ENSMUSG00000000031 0
## ENSMUSG00000000037 2
## ENSMUSG00000000049 0
# Keep the rows where rowSums != 0
(miniDat <- miniDat[rowSums(miniDat) != 0, ])
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001 369 744 287 769 348
## ENSMUSG00000000028 0 1 0 1 1
## ENSMUSG00000000037 0 1 1 5 0
## ENSMUSG00000000049 0 1 0 1 0
## SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001 803 433 469 585 321
## ENSMUSG00000000028 1 0 7 6 1
## ENSMUSG00000000037 4 0 0 0 0
## ENSMUSG00000000049 0 0 0 0 0
## SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001 301 461 309 374 781
## ENSMUSG00000000028 1 1 1 1 1
## ENSMUSG00000000037 4 1 1 0 1
## ENSMUSG00000000049 0 0 0 0 0
## SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001 555 820 294 758 419
## ENSMUSG00000000028 2 1 1 4 1
## ENSMUSG00000000037 2 1 1 1 1
## ENSMUSG00000000049 0 0 0 0 0
## SRX033494
## ENSMUSG00000000001 857
## ENSMUSG00000000028 5
## ENSMUSG00000000037 2
## ENSMUSG00000000049 0
It works! So let's apply filter 1 to dat:
datFilt1 <- dat[rowSums(dat) != 0, ]
dim(datFilt1)
## [1] 13932 21
nrow(dat) - nrow(datFilt1)
## [1] 22604
Therefore 22604 genes were removed using the filter, so we are left with 13932 genes.
Let's continue with egdeR analysis using the same group and design:
dgeGlmFilt1 <- DGEList(counts = datFilt1, group = group)
dgeGlmComDispFilt1 <- estimateGLMCommonDisp(dgeGlmFilt1, design, verbose = TRUE)
## Disp = 0.03893 , BCV = 0.1973
dgeGlmTrendDispFilt1 <- estimateGLMTrendedDisp(dgeGlmComDispFilt1, design)
dgeGlmTagDispFilt1 <- estimateGLMTagwiseDisp(dgeGlmTrendDispFilt1, design)
plotBCV(dgeGlmTagDispFilt1)
fitFilt1 <- glmFit(dgeGlmTagDispFilt1, design)
colnames(coef(fitFilt1))
## [1] "(Intercept)" "group2"
lrtFilt1 <- glmLRT(fitFilt1, coef = 2)
topTags(lrtFilt1)
## Coefficient: group2
## logFC logCPM LR PValue FDR
## ENSMUSG00000020912 -5.187 3.155 398.0 1.521e-88 2.119e-84
## ENSMUSG00000050141 -5.363 2.319 313.3 4.148e-70 2.889e-66
## ENSMUSG00000035775 -4.543 2.674 296.8 1.624e-66 7.541e-63
## ENSMUSG00000015484 -1.968 4.307 280.6 5.544e-63 1.931e-59
## ENSMUSG00000024248 -3.152 3.463 275.7 6.402e-62 1.784e-58
## ENSMUSG00000030532 1.547 5.572 271.0 7.004e-61 1.626e-57
## ENSMUSG00000054354 -6.282 1.940 250.8 1.745e-56 3.473e-53
## ENSMUSG00000023236 1.426 7.067 247.8 7.808e-56 1.360e-52
## ENSMUSG00000050824 3.705 3.365 227.9 1.699e-51 2.631e-48
## ENSMUSG00000074252 -4.872 1.610 217.5 3.104e-49 4.324e-46
ttGlmFilt1 <- topTags(lrtFilt1, n = Inf)
class(ttGlmFilt1)
## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"
nrow(ttGlmFilt1$table[ttGlm$table$FDR < 0.01, ])
## [1] 600
hitsFilt1 <- rownames(ttGlmFilt1$table[ttGlmFilt1$table$FDR < 1e-50, ])
cpm(dgeGlmTagDispFilt1)[hitsFilt1, ]
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000020912 20.39 12.691 15.83 14.819 19.230
## ENSMUSG00000050141 14.14 10.153 10.67 6.264 9.946
## ENSMUSG00000035775 14.47 10.629 19.51 11.305 12.930
## ENSMUSG00000015484 26.97 33.949 32.76 36.206 31.497
## ENSMUSG00000024248 14.80 26.017 17.30 23.527 18.567
## ENSMUSG00000030532 27.96 24.906 21.35 25.360 18.567
## ENSMUSG00000054354 10.85 7.615 11.41 8.708 8.289
## ENSMUSG00000023236 65.45 78.684 72.50 72.413 76.587
## SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000020912 17.330 26.430 13.359 23.621 14.310
## ENSMUSG00000050141 7.890 8.361 6.583 11.911 12.402
## ENSMUSG00000035775 12.540 12.136 6.389 13.527 11.210
## ENSMUSG00000015484 33.393 30.745 34.655 26.851 31.482
## ENSMUSG00000024248 25.784 18.879 24.781 14.334 14.310
## ENSMUSG00000030532 26.770 26.430 19.167 23.217 20.034
## ENSMUSG00000054354 7.608 7.012 3.291 4.442 5.963
## ENSMUSG00000023236 80.875 83.336 55.178 57.740 66.303
## SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000020912 0.5037 0.4226 1.0805 0.2802 1.0995
## ENSMUSG00000050141 0.2518 0.0000 0.0000 0.2802 0.8246
## ENSMUSG00000035775 0.7555 0.0000 0.2701 0.2802 0.6872
## ENSMUSG00000015484 7.0516 4.8595 9.1841 7.8443 8.5209
## ENSMUSG00000024248 2.2666 1.2677 1.8908 3.3618 3.1610
## ENSMUSG00000030532 69.7605 70.5683 63.2082 63.0343 73.8023
## ENSMUSG00000054354 0.0000 0.0000 0.0000 0.0000 0.5497
## ENSMUSG00000023236 195.4301 183.6044 176.1186 156.3252 206.9762
## SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000020912 0.0000 0.7027 0.8651 0.1362 0.0000
## ENSMUSG00000050141 0.0000 0.1405 0.0000 0.1362 0.2578
## ENSMUSG00000035775 0.6784 0.1405 0.5767 0.1362 1.8045
## ENSMUSG00000015484 9.4974 8.4319 8.9396 6.6759 10.3116
## ENSMUSG00000024248 1.5829 1.6864 1.7302 1.7712 2.5779
## ENSMUSG00000030532 70.0997 70.6873 62.5770 78.6123 59.2919
## ENSMUSG00000054354 0.0000 0.0000 0.0000 0.0000 0.0000
## ENSMUSG00000023236 177.9628 192.8090 207.0519 201.6399 204.6859
## SRX033494
## ENSMUSG00000020912 0.0000
## ENSMUSG00000050141 0.1477
## ENSMUSG00000035775 0.5907
## ENSMUSG00000015484 8.8604
## ENSMUSG00000024248 3.1012
## ENSMUSG00000030532 71.3265
## ENSMUSG00000054354 0.0000
## ENSMUSG00000023236 192.7144
(summaryFilt1 <- summary(deGlmFilt1 <- decideTestsDGE(lrtFilt1, p = 0.05, adjust = "BH")))
## [,1]
## -1 633
## 0 12680
## 1 619
After applying filter 1, there are 633 genes that are under-expressed in group 2 (DBA/2J) compared to group 1 (C57BL/6J), 1.268 × 104 genes show no differences in expression, while 619 genes are over-expressed. Therefore we obtain more hits with filter 1 than without it!
# plotting the tagwise log fold changes against log-cpm
tagsGlmFilt1 <- rownames(dgeGlmTagDispFilt1)[as.logical(deGlmFilt1)]
plotSmear(lrtFilt1, de.tags = tagsGlmFilt1)
abline(h = c(-2, 2), col = "blue")
2. Filter = Count equal to zero in at least one sample in each genotype group
Again, I will first test the filter on a subset of dat. I will use head(dat) as it contains rows with sum equal to zero, and rows where 1 or more samples have counts equal to zero.
Therefore, if 1 or more samples has zero reads, and we multiply all reads from each sample together (row multiplication), we will obtain 0 and can remove these rows. Not sure what function to use for this, apply?
(miniDat <- head(dat))
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001 369 744 287 769 348
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 0 1 0 1 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 0 1 1 5 0
## ENSMUSG00000000049 0 1 0 1 0
## SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001 803 433 469 585 321
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 1 0 7 6 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 4 0 0 0 0
## ENSMUSG00000000049 0 0 0 0 0
## SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001 301 461 309 374 781
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 1 1 1 1 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 4 1 1 0 1
## ENSMUSG00000000049 0 0 0 0 0
## SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001 555 820 294 758 419
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 2 1 1 4 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 2 1 1 1 1
## ENSMUSG00000000049 0 0 0 0 0
## SRX033494
## ENSMUSG00000000001 857
## ENSMUSG00000000003 0
## ENSMUSG00000000028 5
## ENSMUSG00000000031 0
## ENSMUSG00000000037 2
## ENSMUSG00000000049 0
# Use apply function with 'MARGIN = 1' to indicate by row, 'prod' = product
apply(miniDat, 1, prod)
## ENSMUSG00000000001 ENSMUSG00000000003 ENSMUSG00000000028
## 2.945e+56 0.000e+00 0.000e+00
## ENSMUSG00000000031 ENSMUSG00000000037 ENSMUSG00000000049
## 0.000e+00 0.000e+00 0.000e+00
# Use this together with indexing:
miniDat[apply(miniDat, 1, prod) != 0, ]
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001 369 744 287 769 348
## SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001 803 433 469 585 321
## SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001 301 461 309 374 781
## SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001 555 820 294 758 419
## SRX033494
## ENSMUSG00000000001 857
It works! So let's apply filter 2 to dat:
datFilt2 <- dat[apply(dat, 1, prod) != 0, ]
dim(datFilt2)
## [1] 9196 21
nrow(dat) - nrow(datFilt2)
## [1] 27340
Therefore 27340 genes were removed using the filter, and we are left with 9196 genes.
Let's continue with egdeR analysis using the same group and design:
dgeGlmFilt2 <- DGEList(counts = datFilt2, group = group)
dgeGlmComDispFilt2 <- estimateGLMCommonDisp(dgeGlmFilt2, design, verbose = TRUE)
## Disp = 0.03704 , BCV = 0.1925
dgeGlmTrendDispFilt2 <- estimateGLMTrendedDisp(dgeGlmComDispFilt2, design)
dgeGlmTagDispFilt2 <- estimateGLMTagwiseDisp(dgeGlmTrendDispFilt2, design)
plotBCV(dgeGlmTagDispFilt2)
fitFilt2 <- glmFit(dgeGlmTagDispFilt2, design)
colnames(coef(fitFilt2))
## [1] "(Intercept)" "group2"
lrtFilt2 <- glmLRT(fitFilt2, coef = 2)
topTags(lrtFilt2)
## Coefficient: group2
## logFC logCPM LR PValue FDR
## ENSMUSG00000015484 -1.968 4.309 288.3 1.165e-64 1.071e-60
## ENSMUSG00000024248 -3.153 3.466 280.1 7.101e-63 3.265e-59
## ENSMUSG00000030532 1.546 5.574 261.6 7.742e-59 2.373e-55
## ENSMUSG00000023236 1.426 7.069 255.5 1.609e-57 3.699e-54
## ENSMUSG00000050824 3.704 3.367 233.3 1.109e-52 2.039e-49
## ENSMUSG00000015852 -2.362 3.141 217.7 2.860e-49 4.384e-46
## ENSMUSG00000028393 1.802 6.113 212.1 4.789e-48 6.291e-45
## ENSMUSG00000072572 -2.010 3.709 188.1 8.150e-43 9.368e-40
## ENSMUSG00000066800 -2.321 3.520 182.9 1.147e-41 1.172e-38
## ENSMUSG00000074064 1.392 6.755 175.6 4.387e-40 4.034e-37
ttGlmFilt2 <- topTags(lrtFilt2, n = Inf)
class(ttGlmFilt2)
## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"
nrow(ttGlmFilt2$table[ttGlm$table$FDR < 0.01, ])
## [1] 600
hitsFilt2 <- rownames(ttGlmFilt2$table[ttGlmFilt2$table$FDR < 1e-50, ])
cpm(dgeGlmTagDispFilt2)[hitsFilt2, ]
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000015484 27.01 34.03 32.80 36.27 31.54
## ENSMUSG00000024248 14.82 26.08 17.32 23.57 18.59
## ENSMUSG00000030532 28.00 24.97 21.38 25.41 18.59
## ENSMUSG00000023236 65.55 78.88 72.61 72.54 76.70
## SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000015484 33.46 30.79 34.73 26.95 31.53
## ENSMUSG00000024248 25.83 18.91 24.83 14.39 14.33
## ENSMUSG00000030532 26.82 26.47 19.21 23.31 20.07
## ENSMUSG00000023236 81.03 83.45 55.29 57.96 66.41
## SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000015484 7.063 4.867 9.198 7.856 8.534
## ENSMUSG00000024248 2.270 1.270 1.894 3.367 3.166
## ENSMUSG00000030532 69.873 70.673 63.304 63.126 73.915
## ENSMUSG00000023236 195.745 183.877 176.384 156.553 207.292
## SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000015484 9.515 8.444 8.955 6.689 10.327
## ENSMUSG00000024248 1.586 1.689 1.733 1.775 2.582
## ENSMUSG00000030532 70.228 70.792 62.683 78.764 59.380
## ENSMUSG00000023236 178.288 193.095 207.404 202.029 204.990
## SRX033494
## ENSMUSG00000015484 8.879
## ENSMUSG00000024248 3.108
## ENSMUSG00000030532 71.480
## ENSMUSG00000023236 193.128
(summaryFilt2 <- summary(deGlmFilt2 <- decideTestsDGE(lrtFilt2, p = 0.05, adjust = "BH")))
## [,1]
## -1 503
## 0 8183
## 1 510
After applying filter 2, there are 503 genes that are under-expressed in group 2 (DBA/2J) compared to group 1 (C57BL/6J), 8183 genes show no differences in expression, while 510 genes are over-expressed.
# plotting the tagwise log fold changes against log-cpm
tagsGlmFilt2 <- rownames(dgeGlmTagDispFilt2)[as.logical(deGlmFilt2)]
plotSmear(lrtFilt2, de.tags = tagsGlmFilt2)
abline(h = c(-2, 2), col = "blue")
We will try the differential expression analysis of the same dataset using DESeq.
# read in same count table data and grouping information
deSeqDat <- newCountDataSet(dat, group)
head(counts(deSeqDat))
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001 369 744 287 769 348
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 0 1 0 1 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 0 1 1 5 0
## ENSMUSG00000000049 0 1 0 1 0
## SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001 803 433 469 585 321
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 1 0 7 6 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 4 0 0 0 0
## ENSMUSG00000000049 0 0 0 0 0
## SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001 301 461 309 374 781
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 1 1 1 1 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 4 1 1 0 1
## ENSMUSG00000000049 0 0 0 0 0
## SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001 555 820 294 758 419
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 2 1 1 4 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 2 1 1 1 1
## ENSMUSG00000000049 0 0 0 0 0
## SRX033494
## ENSMUSG00000000001 857
## ENSMUSG00000000003 0
## ENSMUSG00000000028 5
## ENSMUSG00000000031 0
## ENSMUSG00000000037 2
## ENSMUSG00000000049 0
Next, we estimate the size factors to account for differences in library coverage and estimate the variance:
deSeqDat <- estimateSizeFactors(deSeqDat)
sizeFactors(deSeqDat)
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482 SRX033490 SRX033483
## 0.6439 1.3454 0.5785 1.4295 0.6355 1.5240 0.7933
## SRX033476 SRX033478 SRX033479 SRX033472 SRX033473 SRX033474 SRX033475
## 1.1272 1.0772 0.8984 0.8886 1.0255 0.7987 0.7796
## SRX033491 SRX033484 SRX033492 SRX033485 SRX033493 SRX033486 SRX033494
## 1.6162 0.9882 1.5720 0.7558 1.5922 0.8264 1.4715
deSeqDat <- estimateDispersions(deSeqDat)
# plotting the estimated dispersions against the mean normalized counts
plotDispEsts(deSeqDat)
Next, we will fit the model and examine the results:
# This will take ~ 1 min
results <- nbinomTest(deSeqDat, levels(group)[1], levels(group)[2])
str(results)
## 'data.frame': 36536 obs. of 8 variables:
## $ id : chr "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000031" ...
## $ baseMean : num 489.18 0 1.57 0 1.1 ...
## $ baseMeanA : num 509.685 0 1.657 0 0.859 ...
## $ baseMeanB : num 470.53 0 1.49 0 1.32 ...
## $ foldChange : num 0.923 NaN 0.898 NaN 1.537 ...
## $ log2FoldChange: num -0.115 NaN -0.156 NaN 0.62 ...
## $ pval : num 0.498 NA 0.829 NA 0.968 ...
## $ padj : num 1 NA 1 NA 1 ...
plotMA(results)
normFactor <- calcNormFactors(dat)
datVoom <- voom(dat, design, plot = TRUE, lib.size = colSums(dat) * normFactor)
# datVoom
fit <- lmFit(datVoom, design)
fit <- eBayes(fit)
topTable(fit)
## X.Intercept. group2 AveExpr F P.Value
## ENSMUSG00000025867 12.477 -0.013979 12.471 154234 2.584e-55
## ENSMUSG00000022892 12.268 -0.086479 12.223 138459 1.103e-54
## ENSMUSG00000037852 12.321 -0.237470 12.194 134826 1.576e-54
## ENSMUSG00000042700 10.716 -0.048029 10.692 133973 1.717e-54
## ENSMUSG00000029461 10.300 0.020715 10.311 121795 6.182e-54
## ENSMUSG00000020658 9.610 -0.019628 9.601 119935 7.604e-54
## ENSMUSG00000060261 9.469 -0.015743 9.461 117841 9.635e-54
## ENSMUSG00000032549 11.904 0.003545 11.905 117324 1.022e-53
## ENSMUSG00000024462 10.227 -0.138929 10.153 116767 1.090e-53
## ENSMUSG00000030102 12.085 -0.026149 12.073 112155 1.874e-53
## adj.P.Val
## ENSMUSG00000025867 9.442e-51
## ENSMUSG00000022892 1.568e-50
## ENSMUSG00000037852 1.568e-50
## ENSMUSG00000042700 1.568e-50
## ENSMUSG00000029461 4.424e-50
## ENSMUSG00000020658 4.424e-50
## ENSMUSG00000060261 4.424e-50
## ENSMUSG00000032549 4.424e-50
## ENSMUSG00000024462 4.424e-50
## ENSMUSG00000030102 6.417e-50
Choose a specific threshold for the adjusted p value, find the genes identified as differentially expressed using each of edgeR, DESeq and voom + limma. Compare the number of genes in these 3 lists, and draw a venn digram demonstrating the overlap (if any!).
For another day :)