STAT540 Seminar 07: RNA-seq Differential Expression Analysis

Rebecca Johnston 26/02/2014

Load required packages and RNA-seq data
edgeR
DESeq
Voom & limma
Take Home Problem

In this seminar we will use a table of read counts for genomic features, such as genes or exons, derived from RNA-seq alignment BAM file to perform differential expression analysis. We will use edgeR, DESeq and voom + limma packages for this purpose. We will use the data from this publication. The mRNA from 21 mice belonging to two different strains have been sequenced; the count table lists the number of reads aligned to the annotated mouse genes in the Ensembl database. Our goal is to identify those genes that are differentially expressed between the two strains.

Load required packages and RNA-seq data

library(edgeR)  # Diff expr analysis of RNA-seq data
library(DESeq)  # Diff expr analysis of RNA-seq data
library(limma)  # Fit linear models

dat <- read.table("bottomly_count_table.tsv", header = TRUE, row.names = 1)
str(dat, max.level = 0)

## 'data.frame':    36536 obs. of  21 variables:

des <- read.table("bottomly_phenodata.tsv", header = TRUE, row.names = 1)
str(des)

## 'data.frame':    21 obs. of  4 variables:
##  $ num.tech.reps    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ strain           : Factor w/ 2 levels "C57BL/6J","DBA/2J": 1 1 1 1 1 1 1 1 1 1 ...
##  $ experiment.number: int  6 7 6 7 6 7 6 4 4 4 ...
##  $ lane.number      : int  1 1 2 2 3 3 5 6 7 8 ...

edgeR

edgeR implements statistical methods based on the negative binomial distribution for count data. The first version of edgeR implemented exact statistical methods for comparison of multiple groups but is only applicable to experiments with one factor. This sometimes is referred to as classic edgeR. An addition to the classic version is an implementation of generalized linear models (glms) used for analysis of multifactor experiments where the response variables might not follow normal distribution. This sometimes is referred to as glm edgeR. Similar to limma, both versions of edgeR use empirical Bayes methods to estimate gene-specific variation. The classic edgeR uses quantile-adjusted conditional maximum likelihood (qCML) to estimate the disperison while the glm edgeR uses Cox-Reid profile-adjusted likelihood (CR) for dispersion estimation.

edgeR takes in as an argument a table of integer counts, with rows corresponding to genes and columns to samples.

GLM edgeR: GLM approach allows for comparison of multiple groups and/or factor levels. A design matrix can be created in the same way done for limma. Similarly, contrasts matrices can be used to make comparisons. Most of the glm functions have similar names to the classic version with an addition of 'glm'.

The first step is to create a 'group' object describing which group each sample belongs to:

with(des, table(strain))

## strain
## C57BL/6J   DBA/2J 
##       10       11

group <- factor(c(rep("1", 10), rep("2", 11)))
group

##  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
## Levels: 1 2

DGEList produces an object of type DGEList with can be manipulated in a similar way to any other list object in R:

dgeGlm <- DGEList(counts = dat, group = group)
str(dgeGlm)

## Formal class 'DGEList' [package "edgeR"] with 1 slots
##   ..@ .Data:List of 2
##   .. ..$ : int [1:36536, 1:21] 369 0 0 0 0 0 21 15 517 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:36536] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000031" ...
##   .. .. .. ..$ : chr [1:21] "SRX033480" "SRX033488" "SRX033481" "SRX033489" ...
##   .. ..$ :'data.frame':  21 obs. of  3 variables:
##   .. .. ..$ group       : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. ..$ lib.size    : num [1:21] 3040296 6303665 2717092 6545795 3016179 ...
##   .. .. ..$ norm.factors: num [1:21] 1 1 1 1 1 1 1 1 1 1 ...

names(dgeGlm)

## [1] "counts"  "samples"

dgeGlm[["samples"]]

##           group lib.size norm.factors
## SRX033480     1  3040296            1
## SRX033488     1  6303665            1
## SRX033481     1  2717092            1
## SRX033489     1  6545795            1
## SRX033482     1  3016179            1
## SRX033490     1  7097379            1
## SRX033483     1  3707895            1
## SRX033476     1  5165144            1
## SRX033478     1  4953201            1
## SRX033479     1  4192872            1
## SRX033472     2  3970729            1
## SRX033473     2  4733003            1
## SRX033474     2  3702051            1
## SRX033475     2  3569483            1
## SRX033491     2  7276198            1
## SRX033484     2  4422272            1
## SRX033492     2  7115851            1
## SRX033485     2  3467730            1
## SRX033493     2  7339817            1
## SRX033486     2  3879114            1
## SRX033494     2  6771680            1

nrow(dgeGlm[[1]])

## [1] 36536

ncol(dgeGlm[[1]])

## [1] 21

This DGELIst object has two components: one is a matrix called 'counts' storing the count data, the other is a data.frame called 'samples' storing information for samples. Optionally, you can also provide an annotation file for the genes which will be stored in the data.frame 'genes'. The data.frame 'samples', contains the samples IDs, group information and library sizes (or equally library sequence depth). You can either supply the library size info or it can be calculated from the sums of counts in each column.

design <- model.matrix(~group)
design

##    (Intercept) group2
## 1            1      0
## 2            1      0
## 3            1      0
## 4            1      0
## 5            1      0
## 6            1      0
## 7            1      0
## 8            1      0
## 9            1      0
## 10           1      0
## 11           1      1
## 12           1      1
## 13           1      1
## 14           1      1
## 15           1      1
## 16           1      1
## 17           1      1
## 18           1      1
## 19           1      1
## 20           1      1
## 21           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)

## Disp = 0.03893 , BCV = 0.1973

dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)

## Loading required package: splines

dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)

# plot the tagwise dispersion against log2-CPM (counts per million)
plotBCV(dgeGlmTagDisp)

plot of chunk unnamed-chunk-6

# GLM fit
fit <- glmFit(dgeGlmTagDisp, design)

# Coefficients
colnames(coef(fit))

## [1] "(Intercept)" "group2"

lrt <- glmLRT(fit, coef = 2)

# topTags is analogous to topTable?
topTags(lrt)

## Coefficient:  group2 
##                     logFC logCPM    LR    PValue       FDR
## ENSMUSG00000020912 -5.187  3.155 401.8 2.238e-89 8.177e-85
## ENSMUSG00000050141 -5.363  2.319 311.6 9.787e-70 1.788e-65
## ENSMUSG00000035775 -4.543  2.674 298.0 9.042e-67 1.101e-62
## ENSMUSG00000015484 -1.968  4.307 282.5 2.113e-63 1.930e-59
## ENSMUSG00000024248 -3.152  3.463 277.0 3.337e-62 2.438e-58
## ENSMUSG00000030532  1.547  5.572 269.4 1.568e-60 9.548e-57
## ENSMUSG00000054354 -6.283  1.940 249.3 3.657e-56 1.909e-52
## ENSMUSG00000023236  1.426  7.067 245.0 3.133e-55 1.431e-51
## ENSMUSG00000050824  3.705  3.365 229.7 7.099e-52 2.882e-48
## ENSMUSG00000015852 -2.362  3.138 214.8 1.210e-48 4.049e-45

ttGlm <- topTags(lrt, n = Inf)
class(ttGlm)

## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"

nrow(ttGlm$table[ttGlm$table$FDR < 0.01, ])

## [1] 600

hits <- rownames(ttGlm$table[ttGlm$table$FDR < 1e-50, ])
cpm(dgeGlmTagDisp)[hits, ]

##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000020912     20.39    12.691     15.83    14.819    19.230
## ENSMUSG00000050141     14.14    10.153     10.67     6.264     9.946
## ENSMUSG00000035775     14.47    10.629     19.51    11.305    12.930
## ENSMUSG00000015484     26.97    33.949     32.76    36.206    31.497
## ENSMUSG00000024248     14.80    26.017     17.30    23.527    18.567
## ENSMUSG00000030532     27.96    24.906     21.35    25.360    18.567
## ENSMUSG00000054354     10.85     7.615     11.41     8.708     8.289
## ENSMUSG00000023236     65.45    78.684     72.50    72.413    76.587
##                    SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000020912    17.330    26.430    13.359    23.621    14.310
## ENSMUSG00000050141     7.890     8.361     6.583    11.911    12.402
## ENSMUSG00000035775    12.540    12.136     6.389    13.527    11.210
## ENSMUSG00000015484    33.393    30.745    34.655    26.851    31.482
## ENSMUSG00000024248    25.784    18.879    24.781    14.334    14.310
## ENSMUSG00000030532    26.770    26.430    19.167    23.217    20.034
## ENSMUSG00000054354     7.608     7.012     3.291     4.442     5.963
## ENSMUSG00000023236    80.875    83.336    55.178    57.740    66.303
##                    SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000020912    0.5037    0.4226    1.0805    0.2802    1.0995
## ENSMUSG00000050141    0.2518    0.0000    0.0000    0.2802    0.8246
## ENSMUSG00000035775    0.7555    0.0000    0.2701    0.2802    0.6872
## ENSMUSG00000015484    7.0516    4.8595    9.1841    7.8443    8.5209
## ENSMUSG00000024248    2.2666    1.2677    1.8908    3.3618    3.1610
## ENSMUSG00000030532   69.7605   70.5683   63.2082   63.0343   73.8023
## ENSMUSG00000054354    0.0000    0.0000    0.0000    0.0000    0.5497
## ENSMUSG00000023236  195.4301  183.6044  176.1186  156.3252  206.9762
##                    SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000020912    0.0000    0.7027    0.8651    0.1362    0.0000
## ENSMUSG00000050141    0.0000    0.1405    0.0000    0.1362    0.2578
## ENSMUSG00000035775    0.6784    0.1405    0.5767    0.1362    1.8045
## ENSMUSG00000015484    9.4974    8.4319    8.9396    6.6759   10.3116
## ENSMUSG00000024248    1.5829    1.6864    1.7302    1.7712    2.5779
## ENSMUSG00000030532   70.0997   70.6873   62.5770   78.6123   59.2919
## ENSMUSG00000054354    0.0000    0.0000    0.0000    0.0000    0.0000
## ENSMUSG00000023236  177.9628  192.8090  207.0519  201.6399  204.6859
##                    SRX033494
## ENSMUSG00000020912    0.0000
## ENSMUSG00000050141    0.1477
## ENSMUSG00000035775    0.5907
## ENSMUSG00000015484    8.8604
## ENSMUSG00000024248    3.1012
## ENSMUSG00000030532   71.3265
## ENSMUSG00000054354    0.0000
## ENSMUSG00000023236  192.7144

summary(deGlm <- decideTestsDGE(lrt, p = 0.05, adjust = "BH"))

##    [,1] 
## -1   451
## 0  35660
## 1    425

There are 451 genes that are under-expressed in group 2 (DBA/2J) compared to group 1 (C57BL/6J), 3.566 × 10⁴ genes show no differences in expression, while 425 genes are over-expressed.

# plotting the tagwise log fold changes against log-cpm
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
plotSmear(lrt, de.tags = tagsGlm)
abline(h = c(-2, 2), col = "blue")

plot of chunk unnamed-chunk-7

Mini exercise:

Redo the above analysis but first filter the data and remove any gene that has: 1. count equal to zero across all samples 2. count equal to zero in at least one sample in each genotype group.

1. Filter = Count equal to zero across all samples

Trial: Since the head(dat) contains a row with sum equal to zero, use this mini set of rows as a trial.

(miniDat <- head(dat))

##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001       369       744       287       769       348
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         0         1         0         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         0         1         1         5         0
## ENSMUSG00000000049         0         1         0         1         0
##                    SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001       803       433       469       585       321
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         1         0         7         6         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         4         0         0         0         0
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001       301       461       309       374       781
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         1         1         1         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         4         1         1         0         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001       555       820       294       758       419
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         2         1         1         4         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         2         1         1         1         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033494
## ENSMUSG00000000001       857
## ENSMUSG00000000003         0
## ENSMUSG00000000028         5
## ENSMUSG00000000031         0
## ENSMUSG00000000037         2
## ENSMUSG00000000049         0

# Keep the rows where rowSums != 0
(miniDat <- miniDat[rowSums(miniDat) != 0, ])

##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001       369       744       287       769       348
## ENSMUSG00000000028         0         1         0         1         1
## ENSMUSG00000000037         0         1         1         5         0
## ENSMUSG00000000049         0         1         0         1         0
##                    SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001       803       433       469       585       321
## ENSMUSG00000000028         1         0         7         6         1
## ENSMUSG00000000037         4         0         0         0         0
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001       301       461       309       374       781
## ENSMUSG00000000028         1         1         1         1         1
## ENSMUSG00000000037         4         1         1         0         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001       555       820       294       758       419
## ENSMUSG00000000028         2         1         1         4         1
## ENSMUSG00000000037         2         1         1         1         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033494
## ENSMUSG00000000001       857
## ENSMUSG00000000028         5
## ENSMUSG00000000037         2
## ENSMUSG00000000049         0

It works! So let's apply filter 1 to dat:

datFilt1 <- dat[rowSums(dat) != 0, ]
dim(datFilt1)

## [1] 13932    21

nrow(dat) - nrow(datFilt1)

## [1] 22604

Therefore 22604 genes were removed using the filter, so we are left with 13932 genes.

Let's continue with egdeR analysis using the same group and design:

dgeGlmFilt1 <- DGEList(counts = datFilt1, group = group)

dgeGlmComDispFilt1 <- estimateGLMCommonDisp(dgeGlmFilt1, design, verbose = TRUE)

## Disp = 0.03893 , BCV = 0.1973

dgeGlmTrendDispFilt1 <- estimateGLMTrendedDisp(dgeGlmComDispFilt1, design)
dgeGlmTagDispFilt1 <- estimateGLMTagwiseDisp(dgeGlmTrendDispFilt1, design)

plotBCV(dgeGlmTagDispFilt1)

plot of chunk unnamed-chunk-12

fitFilt1 <- glmFit(dgeGlmTagDispFilt1, design)
colnames(coef(fitFilt1))

## [1] "(Intercept)" "group2"

lrtFilt1 <- glmLRT(fitFilt1, coef = 2)
topTags(lrtFilt1)

## Coefficient:  group2 
##                     logFC logCPM    LR    PValue       FDR
## ENSMUSG00000020912 -5.187  3.155 398.0 1.521e-88 2.119e-84
## ENSMUSG00000050141 -5.363  2.319 313.3 4.148e-70 2.889e-66
## ENSMUSG00000035775 -4.543  2.674 296.8 1.624e-66 7.541e-63
## ENSMUSG00000015484 -1.968  4.307 280.6 5.544e-63 1.931e-59
## ENSMUSG00000024248 -3.152  3.463 275.7 6.402e-62 1.784e-58
## ENSMUSG00000030532  1.547  5.572 271.0 7.004e-61 1.626e-57
## ENSMUSG00000054354 -6.282  1.940 250.8 1.745e-56 3.473e-53
## ENSMUSG00000023236  1.426  7.067 247.8 7.808e-56 1.360e-52
## ENSMUSG00000050824  3.705  3.365 227.9 1.699e-51 2.631e-48
## ENSMUSG00000074252 -4.872  1.610 217.5 3.104e-49 4.324e-46

ttGlmFilt1 <- topTags(lrtFilt1, n = Inf)
class(ttGlmFilt1)

## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"

nrow(ttGlmFilt1$table[ttGlm$table$FDR < 0.01, ])

## [1] 600

hitsFilt1 <- rownames(ttGlmFilt1$table[ttGlmFilt1$table$FDR < 1e-50, ])
cpm(dgeGlmTagDispFilt1)[hitsFilt1, ]

##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000020912     20.39    12.691     15.83    14.819    19.230
## ENSMUSG00000050141     14.14    10.153     10.67     6.264     9.946
## ENSMUSG00000035775     14.47    10.629     19.51    11.305    12.930
## ENSMUSG00000015484     26.97    33.949     32.76    36.206    31.497
## ENSMUSG00000024248     14.80    26.017     17.30    23.527    18.567
## ENSMUSG00000030532     27.96    24.906     21.35    25.360    18.567
## ENSMUSG00000054354     10.85     7.615     11.41     8.708     8.289
## ENSMUSG00000023236     65.45    78.684     72.50    72.413    76.587
##                    SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000020912    17.330    26.430    13.359    23.621    14.310
## ENSMUSG00000050141     7.890     8.361     6.583    11.911    12.402
## ENSMUSG00000035775    12.540    12.136     6.389    13.527    11.210
## ENSMUSG00000015484    33.393    30.745    34.655    26.851    31.482
## ENSMUSG00000024248    25.784    18.879    24.781    14.334    14.310
## ENSMUSG00000030532    26.770    26.430    19.167    23.217    20.034
## ENSMUSG00000054354     7.608     7.012     3.291     4.442     5.963
## ENSMUSG00000023236    80.875    83.336    55.178    57.740    66.303
##                    SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000020912    0.5037    0.4226    1.0805    0.2802    1.0995
## ENSMUSG00000050141    0.2518    0.0000    0.0000    0.2802    0.8246
## ENSMUSG00000035775    0.7555    0.0000    0.2701    0.2802    0.6872
## ENSMUSG00000015484    7.0516    4.8595    9.1841    7.8443    8.5209
## ENSMUSG00000024248    2.2666    1.2677    1.8908    3.3618    3.1610
## ENSMUSG00000030532   69.7605   70.5683   63.2082   63.0343   73.8023
## ENSMUSG00000054354    0.0000    0.0000    0.0000    0.0000    0.5497
## ENSMUSG00000023236  195.4301  183.6044  176.1186  156.3252  206.9762
##                    SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000020912    0.0000    0.7027    0.8651    0.1362    0.0000
## ENSMUSG00000050141    0.0000    0.1405    0.0000    0.1362    0.2578
## ENSMUSG00000035775    0.6784    0.1405    0.5767    0.1362    1.8045
## ENSMUSG00000015484    9.4974    8.4319    8.9396    6.6759   10.3116
## ENSMUSG00000024248    1.5829    1.6864    1.7302    1.7712    2.5779
## ENSMUSG00000030532   70.0997   70.6873   62.5770   78.6123   59.2919
## ENSMUSG00000054354    0.0000    0.0000    0.0000    0.0000    0.0000
## ENSMUSG00000023236  177.9628  192.8090  207.0519  201.6399  204.6859
##                    SRX033494
## ENSMUSG00000020912    0.0000
## ENSMUSG00000050141    0.1477
## ENSMUSG00000035775    0.5907
## ENSMUSG00000015484    8.8604
## ENSMUSG00000024248    3.1012
## ENSMUSG00000030532   71.3265
## ENSMUSG00000054354    0.0000
## ENSMUSG00000023236  192.7144

(summaryFilt1 <- summary(deGlmFilt1 <- decideTestsDGE(lrtFilt1, p = 0.05, adjust = "BH")))

##    [,1] 
## -1   633
## 0  12680
## 1    619

After applying filter 1, there are 633 genes that are under-expressed in group 2 (DBA/2J) compared to group 1 (C57BL/6J), 1.268 × 10⁴ genes show no differences in expression, while 619 genes are over-expressed. Therefore we obtain more hits with filter 1 than without it!

# plotting the tagwise log fold changes against log-cpm
tagsGlmFilt1 <- rownames(dgeGlmTagDispFilt1)[as.logical(deGlmFilt1)]
plotSmear(lrtFilt1, de.tags = tagsGlmFilt1)
abline(h = c(-2, 2), col = "blue")

plot of chunk unnamed-chunk-13

2. Filter = Count equal to zero in at least one sample in each genotype group

Again, I will first test the filter on a subset of dat. I will use head(dat) as it contains rows with sum equal to zero, and rows where 1 or more samples have counts equal to zero.

Therefore, if 1 or more samples has zero reads, and we multiply all reads from each sample together (row multiplication), we will obtain 0 and can remove these rows. Not sure what function to use for this, apply?

(miniDat <- head(dat))

##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001       369       744       287       769       348
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         0         1         0         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         0         1         1         5         0
## ENSMUSG00000000049         0         1         0         1         0
##                    SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001       803       433       469       585       321
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         1         0         7         6         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         4         0         0         0         0
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001       301       461       309       374       781
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         1         1         1         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         4         1         1         0         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001       555       820       294       758       419
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         2         1         1         4         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         2         1         1         1         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033494
## ENSMUSG00000000001       857
## ENSMUSG00000000003         0
## ENSMUSG00000000028         5
## ENSMUSG00000000031         0
## ENSMUSG00000000037         2
## ENSMUSG00000000049         0

# Use apply function with 'MARGIN = 1' to indicate by row, 'prod' = product
apply(miniDat, 1, prod)

## ENSMUSG00000000001 ENSMUSG00000000003 ENSMUSG00000000028 
##          2.945e+56          0.000e+00          0.000e+00 
## ENSMUSG00000000031 ENSMUSG00000000037 ENSMUSG00000000049 
##          0.000e+00          0.000e+00          0.000e+00

# Use this together with indexing:
miniDat[apply(miniDat, 1, prod) != 0, ]

##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001       369       744       287       769       348
##                    SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001       803       433       469       585       321
##                    SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001       301       461       309       374       781
##                    SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001       555       820       294       758       419
##                    SRX033494
## ENSMUSG00000000001       857

It works! So let's apply filter 2 to dat:

datFilt2 <- dat[apply(dat, 1, prod) != 0, ]
dim(datFilt2)

## [1] 9196   21

nrow(dat) - nrow(datFilt2)

## [1] 27340

Therefore 27340 genes were removed using the filter, and we are left with 9196 genes.

Let's continue with egdeR analysis using the same group and design:

dgeGlmFilt2 <- DGEList(counts = datFilt2, group = group)

dgeGlmComDispFilt2 <- estimateGLMCommonDisp(dgeGlmFilt2, design, verbose = TRUE)

## Disp = 0.03704 , BCV = 0.1925

dgeGlmTrendDispFilt2 <- estimateGLMTrendedDisp(dgeGlmComDispFilt2, design)
dgeGlmTagDispFilt2 <- estimateGLMTagwiseDisp(dgeGlmTrendDispFilt2, design)
plotBCV(dgeGlmTagDispFilt2)

plot of chunk unnamed-chunk-18

fitFilt2 <- glmFit(dgeGlmTagDispFilt2, design)
colnames(coef(fitFilt2))

## [1] "(Intercept)" "group2"

lrtFilt2 <- glmLRT(fitFilt2, coef = 2)
topTags(lrtFilt2)

## Coefficient:  group2 
##                     logFC logCPM    LR    PValue       FDR
## ENSMUSG00000015484 -1.968  4.309 288.3 1.165e-64 1.071e-60
## ENSMUSG00000024248 -3.153  3.466 280.1 7.101e-63 3.265e-59
## ENSMUSG00000030532  1.546  5.574 261.6 7.742e-59 2.373e-55
## ENSMUSG00000023236  1.426  7.069 255.5 1.609e-57 3.699e-54
## ENSMUSG00000050824  3.704  3.367 233.3 1.109e-52 2.039e-49
## ENSMUSG00000015852 -2.362  3.141 217.7 2.860e-49 4.384e-46
## ENSMUSG00000028393  1.802  6.113 212.1 4.789e-48 6.291e-45
## ENSMUSG00000072572 -2.010  3.709 188.1 8.150e-43 9.368e-40
## ENSMUSG00000066800 -2.321  3.520 182.9 1.147e-41 1.172e-38
## ENSMUSG00000074064  1.392  6.755 175.6 4.387e-40 4.034e-37

ttGlmFilt2 <- topTags(lrtFilt2, n = Inf)
class(ttGlmFilt2)

## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"

nrow(ttGlmFilt2$table[ttGlm$table$FDR < 0.01, ])

## [1] 600

hitsFilt2 <- rownames(ttGlmFilt2$table[ttGlmFilt2$table$FDR < 1e-50, ])
cpm(dgeGlmTagDispFilt2)[hitsFilt2, ]

##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000015484     27.01     34.03     32.80     36.27     31.54
## ENSMUSG00000024248     14.82     26.08     17.32     23.57     18.59
## ENSMUSG00000030532     28.00     24.97     21.38     25.41     18.59
## ENSMUSG00000023236     65.55     78.88     72.61     72.54     76.70
##                    SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000015484     33.46     30.79     34.73     26.95     31.53
## ENSMUSG00000024248     25.83     18.91     24.83     14.39     14.33
## ENSMUSG00000030532     26.82     26.47     19.21     23.31     20.07
## ENSMUSG00000023236     81.03     83.45     55.29     57.96     66.41
##                    SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000015484     7.063     4.867     9.198     7.856     8.534
## ENSMUSG00000024248     2.270     1.270     1.894     3.367     3.166
## ENSMUSG00000030532    69.873    70.673    63.304    63.126    73.915
## ENSMUSG00000023236   195.745   183.877   176.384   156.553   207.292
##                    SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000015484     9.515     8.444     8.955     6.689    10.327
## ENSMUSG00000024248     1.586     1.689     1.733     1.775     2.582
## ENSMUSG00000030532    70.228    70.792    62.683    78.764    59.380
## ENSMUSG00000023236   178.288   193.095   207.404   202.029   204.990
##                    SRX033494
## ENSMUSG00000015484     8.879
## ENSMUSG00000024248     3.108
## ENSMUSG00000030532    71.480
## ENSMUSG00000023236   193.128

(summaryFilt2 <- summary(deGlmFilt2 <- decideTestsDGE(lrtFilt2, p = 0.05, adjust = "BH")))

##    [,1]
## -1  503
## 0  8183
## 1   510

After applying filter 2, there are 503 genes that are under-expressed in group 2 (DBA/2J) compared to group 1 (C57BL/6J), 8183 genes show no differences in expression, while 510 genes are over-expressed.

# plotting the tagwise log fold changes against log-cpm
tagsGlmFilt2 <- rownames(dgeGlmTagDispFilt2)[as.logical(deGlmFilt2)]
plotSmear(lrtFilt2, de.tags = tagsGlmFilt2)
abline(h = c(-2, 2), col = "blue")

plot of chunk unnamed-chunk-19

DESeq

We will try the differential expression analysis of the same dataset using DESeq.

# read in same count table data and grouping information
deSeqDat <- newCountDataSet(dat, group)
head(counts(deSeqDat))

##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001       369       744       287       769       348
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         0         1         0         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         0         1         1         5         0
## ENSMUSG00000000049         0         1         0         1         0
##                    SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001       803       433       469       585       321
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         1         0         7         6         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         4         0         0         0         0
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001       301       461       309       374       781
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         1         1         1         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         4         1         1         0         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001       555       820       294       758       419
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         2         1         1         4         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         2         1         1         1         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033494
## ENSMUSG00000000001       857
## ENSMUSG00000000003         0
## ENSMUSG00000000028         5
## ENSMUSG00000000031         0
## ENSMUSG00000000037         2
## ENSMUSG00000000049         0

Next, we estimate the size factors to account for differences in library coverage and estimate the variance:

deSeqDat <- estimateSizeFactors(deSeqDat)
sizeFactors(deSeqDat)

## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482 SRX033490 SRX033483 
##    0.6439    1.3454    0.5785    1.4295    0.6355    1.5240    0.7933 
## SRX033476 SRX033478 SRX033479 SRX033472 SRX033473 SRX033474 SRX033475 
##    1.1272    1.0772    0.8984    0.8886    1.0255    0.7987    0.7796 
## SRX033491 SRX033484 SRX033492 SRX033485 SRX033493 SRX033486 SRX033494 
##    1.6162    0.9882    1.5720    0.7558    1.5922    0.8264    1.4715

deSeqDat <- estimateDispersions(deSeqDat)
# plotting the estimated dispersions against the mean normalized counts
plotDispEsts(deSeqDat)

plot of chunk unnamed-chunk-21

Next, we will fit the model and examine the results:

# This will take ~ 1 min
results <- nbinomTest(deSeqDat, levels(group)[1], levels(group)[2])
str(results)

## 'data.frame':    36536 obs. of  8 variables:
##  $ id            : chr  "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000031" ...
##  $ baseMean      : num  489.18 0 1.57 0 1.1 ...
##  $ baseMeanA     : num  509.685 0 1.657 0 0.859 ...
##  $ baseMeanB     : num  470.53 0 1.49 0 1.32 ...
##  $ foldChange    : num  0.923 NaN 0.898 NaN 1.537 ...
##  $ log2FoldChange: num  -0.115 NaN -0.156 NaN 0.62 ...
##  $ pval          : num  0.498 NA 0.829 NA 0.968 ...
##  $ padj          : num  1 NA 1 NA 1 ...

plotMA(results)

plot of chunk unnamed-chunk-22

Voom & limma

normFactor <- calcNormFactors(dat)
datVoom <- voom(dat, design, plot = TRUE, lib.size = colSums(dat) * normFactor)

plot of chunk unnamed-chunk-23

# datVoom
fit <- lmFit(datVoom, design)
fit <- eBayes(fit)
topTable(fit)

##                    X.Intercept.    group2 AveExpr      F   P.Value
## ENSMUSG00000025867       12.477 -0.013979  12.471 154234 2.584e-55
## ENSMUSG00000022892       12.268 -0.086479  12.223 138459 1.103e-54
## ENSMUSG00000037852       12.321 -0.237470  12.194 134826 1.576e-54
## ENSMUSG00000042700       10.716 -0.048029  10.692 133973 1.717e-54
## ENSMUSG00000029461       10.300  0.020715  10.311 121795 6.182e-54
## ENSMUSG00000020658        9.610 -0.019628   9.601 119935 7.604e-54
## ENSMUSG00000060261        9.469 -0.015743   9.461 117841 9.635e-54
## ENSMUSG00000032549       11.904  0.003545  11.905 117324 1.022e-53
## ENSMUSG00000024462       10.227 -0.138929  10.153 116767 1.090e-53
## ENSMUSG00000030102       12.085 -0.026149  12.073 112155 1.874e-53
##                    adj.P.Val
## ENSMUSG00000025867 9.442e-51
## ENSMUSG00000022892 1.568e-50
## ENSMUSG00000037852 1.568e-50
## ENSMUSG00000042700 1.568e-50
## ENSMUSG00000029461 4.424e-50
## ENSMUSG00000020658 4.424e-50
## ENSMUSG00000060261 4.424e-50
## ENSMUSG00000032549 4.424e-50
## ENSMUSG00000024462 4.424e-50
## ENSMUSG00000030102 6.417e-50

Take Home Problem

Choose a specific threshold for the adjusted p value, find the genes identified as differentially expressed using each of edgeR, DESeq and voom + limma. Compare the number of genes in these 3 lists, and draw a venn digram demonstrating the overlap (if any!).

For another day :)