Update. 11/13/17 Thanks to Mike Love for pointing out an error. This is just my understanding. There maybe other errors.*

To allow complex study designs in iDEP(http://ge-lab.org/idep/), I tried to understand how factoral models are built and the desired results are extracted from DESeq2. The following is based on the help document from the resutls( ) function in DESeq2, plus some of Mike Love’s answers to questions from users.

An important point I want to make is the interpretation of results is tricky when the study design involve multiple factors. See figure above for an example involving 3 genotypes under 3 conditions. Similar to regression analysis in R, the reference levels for categorical factors forms the foundation of our intereptation. Yet, by default, they are determined alphabetically.

Before runing DESeq2, it is essential to choose appropriate reference levels for each factors. This can be done by the relevel( ) function in R. Reference level is the baseline level of a factor that forms the basis of meaningful comparisons. In a wildtype vs. mutant experiment, “wild-type” is the reference level. In treated vs. untreated, the reference level is obviously untreated. More details in Exmple 3.

iDEP provides a GUI to DESeq2 for most experimental desings. It also provides convienent interface for exploratory data anlysis and pathway analysis. Try it at http://ge-lab.org/idep/

Example 1: two-group comparison

First make some example data.

library(DESeq2)
dds <- makeExampleDESeqDataSet(n=10000,m=6)
assay(dds)[ 1:10,]
##        sample1 sample2 sample3 sample4 sample5 sample6
## gene1        6       4      11       1       2      13
## gene2        9      12      23      13      14      28
## gene3       58     121     173     178     118      97
## gene4        0       4       0       3       8       3
## gene5       27       3       6       9       8      12
## gene6       48       8      35      38      21      13
## gene7       36      50      61      52      44      22
## gene8        6       8      16      14      18      19
## gene9      214     266     419     198     157     166
## gene10      20      12      16      12      16       2

This is a very simple experiment design with two conditions.

colData(dds)
## DataFrame with 6 rows and 1 column
##         condition
##          <factor>
## sample1         A
## sample2         A
## sample3         A
## sample4         B
## sample5         B
## sample6         B
dds <- DESeq(dds)
resultsNames(dds)
## [1] "Intercept"        "condition_B_vs_A"

This shows the results available. Note that by default, R will choose a reference level for factors based on alphabetical order. Here A is the referece level. Fold change is defined as B compaired with A. To change reference levels, try the relevel( ) function.

res <- results(dds, contrast=c("condition","B","A"))
res <- res[order(res$padj),]
library(knitr)
kable(res[1:5,-(3:4)])
baseMean log2FoldChange pvalue padj
gene9056 360.168909 -2.045379 0.0000000 0.0001366
gene3087 43.897516 -2.203303 0.0000173 0.0858143
gene3763 72.409877 -1.834787 0.0000434 0.1434712
gene2054 322.494963 1.537408 0.0000681 0.1689463
gene4617 6.227415 6.125238 0.0002019 0.4008408

If we want to use B as control and define fold change using B as baseline. Then we can do this:

res <- results(dds, contrast=c("condition","A","B"))
ix = which.min(res$padj)
res <- res[order(res$padj),]
kable(res[1:5,-(3:4)])
baseMean log2FoldChange pvalue padj
gene9056 360.168909 2.045379 0.0000000 0.0001366
gene3087 43.897516 2.203303 0.0000173 0.0858143
gene3763 72.409877 1.834787 0.0000434 0.1434712
gene2054 322.494963 -1.537408 0.0000681 0.1689463
gene4617 6.227415 -6.125238 0.0002019 0.4008408

As you can see, the fold-change are completely opposite direction. Here we show the most significant gene.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

Example 2: Multiple groups

Suppose we have three groups A, B, and C.

dds <- makeExampleDESeqDataSet(n=100,m=6)
dds$condition <- factor( c( "A","A","B","B","C","C") )
dds <- DESeq(dds)
res = results(dds, contrast=c("condition","C","A"))
res <- res[order(res$padj),]
kable(res[1:5,-(3:4)])
baseMean log2FoldChange pvalue padj
gene2 3.634986 -5.101773 0.0348679 0.5515088
gene20 4.678176 -4.490982 0.0445664 0.5515088
gene34 56.068672 -1.462155 0.0167820 0.5515088
gene35 537.847175 -1.177240 0.0087913 0.5515088
gene41 93.967810 1.064734 0.0412034 0.5515088

Example 3: two conditions, two genotypes, with an interaction term

Here we have two genotypes, wild-type (WT), and mutant (MU). Two conditions, control (Ctrl) and treated (Trt). We are interested in the responses of both wild-type and mutant to treatment. We are also interested in the differences in response between genotypes, which is captured by the interaction term in linear models.

First, we construct example data. Note that we changed sample names from “sample1” to “Wt_Ctrl_1”, according to the two factors.

dds <- makeExampleDESeqDataSet(n=10000,m=12)
dds$condition <- factor( c( rep("Ctrl",6), rep("Trt",6) ) )
dds$genotype <- factor(rep(rep(c("WT","MU"),each=3),2))
colnames(dds) <- paste(as.character( dds$genotype),as.character( dds$condition),rownames(colData(dds)),  sep="_"  )
colnames(dds) = gsub("sample","",colnames(dds))
kable(assay(dds)[1:5,])
WT_Ctrl_1 WT_Ctrl_2 WT_Ctrl_3 MU_Ctrl_4 MU_Ctrl_5 MU_Ctrl_6 WT_Trt_7 WT_Trt_8 WT_Trt_9 MU_Trt_10 MU_Trt_11 MU_Trt_12
gene1 81 47 135 72 81 80 76 57 52 49 64 57
gene2 70 77 94 54 54 36 51 66 57 51 47 67
gene3 16 37 8 28 4 36 14 10 7 11 40 15
gene4 2 5 1 3 9 2 1 5 0 0 14 8
gene5 0 0 1 0 5 0 0 0 0 0 0 0
kable( colData(dds))
condition genotype
WT_Ctrl_1 Ctrl WT
WT_Ctrl_2 Ctrl WT
WT_Ctrl_3 Ctrl WT
MU_Ctrl_4 Ctrl MU
MU_Ctrl_5 Ctrl MU
MU_Ctrl_6 Ctrl MU
WT_Trt_7 Trt WT
WT_Trt_8 Trt WT
WT_Trt_9 Trt WT
MU_Trt_10 Trt MU
MU_Trt_11 Trt MU
MU_Trt_12 Trt MU

Check reference levels:

dds$condition
##  [1] Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Trt  Trt  Trt  Trt  Trt  Trt 
## Levels: Ctrl Trt

As you could see, “Ctrl” apeared first in the 2nd line, indicating it is the reference level for factor condition, as we can expect based on alphabetical order. This is what we want and we do not need to do anything.

dds$genotype
##  [1] WT WT WT MU MU MU WT WT WT MU MU MU
## Levels: MU WT

But “Mu” is the reference level for genotype, which is will give us results difficult to interpret. We need to change it.

dds$genotype = relevel( dds$genotype, "WT")
dds$genotype
##  [1] WT WT WT MU MU MU WT WT WT MU MU MU
## Levels: WT MU

Set up the model, and run DESeq2:

design(dds) <- ~ genotype + condition + genotype:condition
dds <- DESeq(dds) 
resultsNames(dds)
## [1] "Intercept"               "genotype_MU_vs_WT"      
## [3] "condition_Trt_vs_Ctrl"   "genotypeMU.conditionTrt"

Below, we are going to use the combination of the different results (“genotype_MU_vs_WT”, “condition_Trt_vs_Ctrl”, “genotypeMU.conditionTrt” ) to derive biologically meaningful comparisons.

The effect of treatment in wild-type (the main effect).

res = results(dds, contrast=c("condition","Trt","Ctrl"))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])
baseMean log2FoldChange pvalue padj
gene275 25.38836 2.607896 0.0000588 0.2684851
gene2744 101.02954 -1.670837 0.0001099 0.2684851
gene5441 58.67469 1.758921 0.0001261 0.2684851
gene7021 74.29359 1.610853 0.0001345 0.2684851
gene7795 326.43308 -1.624323 0.0000407 0.2684851

This is for WT, treated compared with untreated. Note that WT is not mentioned, because it is the reference level. In other words, this is the difference between samples No. 7-9, compared with samples No. 1-3.

Here we show the most significant gene.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

The effect of treatment in mutant

This is, by definition, the main effect plus the interaction term (the extra condition effect in genotype Mutant compared to genotype WT).

res <- results(dds, list( c("condition_Trt_vs_Ctrl","genotypeMU.conditionTrt") ))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])
baseMean log2FoldChange pvalue padj
gene5102 18.69017 4.757057 1.60e-06 0.0156910
gene7367 170.69259 1.481016 1.19e-05 0.0396834
gene8351 27.77227 3.033622 9.80e-06 0.0396834
gene8034 53.34342 -1.841023 1.62e-05 0.0403724
gene7272 92.82351 -1.414967 6.70e-05 0.1125808
Note that i t has to be list( c(“condit ion_Trt_vs_ Ctrl“,”genotypeMU.conditionTrt“) ). If list( ) is ommitted, the results would be drastically different. Perhaps only the first coefficient is used.

This measures the effect of treatment in mutant. In other words, samples No. 10-12 compared with samples No. 4-6.

Here we show the most significant gene, which is downregulated expressed in samples 10-12, than samples 4-6 , as expected.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

What is the difference between mutant and wild-type without treatment?

As Ctrl is the reference level, we can just retrieve the “genotype_MU_vs_WT”.

res = results(dds, contrast=c("genotype","MU","WT"))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])
baseMean log2FoldChange pvalue padj
gene5102 18.69017 -4.714519 0.0000020 0.0195594
gene2289 45.04711 1.763615 0.0000333 0.1218120
gene3388 122.85582 -1.529323 0.0000366 0.1218120
gene915 39.03250 2.104003 0.0001133 0.2374845
gene8351 27.77227 -2.653182 0.0001190 0.2374845

In other words, this is the samples No.4-6 compared with No. 1-3. Here we show the most significant gene.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )