Example 1: two-group comparison
Example 2: Multiple groups
Example 3: two conditions, two genotypes, with an interaction term
Example 4: Two conditionss, three genotpes with interaction terms
Example 5: Three conditions, three genotypes with interaction terms

Update. 11/13/17 Thanks to Mike Love for pointing out an error. This is just my understanding. There maybe other errors.*

To allow complex study designs in iDEP(http://ge-lab.org/idep/), I tried to understand how factoral models are built and the desired results are extracted from DESeq2. The following is based on the help document from the resutls( ) function in DESeq2, plus some of Mike Love’s answers to questions from users.

An important point I want to make is the interpretation of results is tricky when the study design involve multiple factors. See figure above for an example involving 3 genotypes under 3 conditions. Similar to regression analysis in R, the reference levels for categorical factors forms the foundation of our intereptation. Yet, by default, they are determined alphabetically.

Before runing DESeq2, it is essential to choose appropriate reference levels for each factors. This can be done by the relevel( ) function in R. Reference level is the baseline level of a factor that forms the basis of meaningful comparisons. In a wildtype vs. mutant experiment, “wild-type” is the reference level. In treated vs. untreated, the reference level is obviously untreated. More details in Exmple 3.

iDEP provides a GUI to DESeq2 for most experimental desings. It also provides convienent interface for exploratory data anlysis and pathway analysis. Try it at http://ge-lab.org/idep/

Example 1: two-group comparison

First make some example data.

library(DESeq2)
dds <- makeExampleDESeqDataSet(n=10000,m=6)
assay(dds)[ 1:10,]

##        sample1 sample2 sample3 sample4 sample5 sample6
## gene1        6       4      11       1       2      13
## gene2        9      12      23      13      14      28
## gene3       58     121     173     178     118      97
## gene4        0       4       0       3       8       3
## gene5       27       3       6       9       8      12
## gene6       48       8      35      38      21      13
## gene7       36      50      61      52      44      22
## gene8        6       8      16      14      18      19
## gene9      214     266     419     198     157     166
## gene10      20      12      16      12      16       2

This is a very simple experiment design with two conditions.

colData(dds)

## DataFrame with 6 rows and 1 column
##         condition
##          <factor>
## sample1         A
## sample2         A
## sample3         A
## sample4         B
## sample5         B
## sample6         B

dds <- DESeq(dds)
resultsNames(dds)

## [1] "Intercept"        "condition_B_vs_A"

This shows the results available. Note that by default, R will choose a reference level for factors based on alphabetical order. Here A is the referece level. Fold change is defined as B compaired with A. To change reference levels, try the relevel( ) function.

res <- results(dds, contrast=c("condition","B","A"))
res <- res[order(res$padj),]
library(knitr)
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene9056	360.168909	-2.045379	0.0000000	0.0001366
gene3087	43.897516	-2.203303	0.0000173	0.0858143
gene3763	72.409877	-1.834787	0.0000434	0.1434712
gene2054	322.494963	1.537408	0.0000681	0.1689463
gene4617	6.227415	6.125238	0.0002019	0.4008408

If we want to use B as control and define fold change using B as baseline. Then we can do this:

res <- results(dds, contrast=c("condition","A","B"))
ix = which.min(res$padj)
res <- res[order(res$padj),]
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene9056	360.168909	2.045379	0.0000000	0.0001366
gene3087	43.897516	2.203303	0.0000173	0.0858143
gene3763	72.409877	1.834787	0.0000434	0.1434712
gene2054	322.494963	-1.537408	0.0000681	0.1689463
gene4617	6.227415	-6.125238	0.0002019	0.4008408

As you can see, the fold-change are completely opposite direction. Here we show the most significant gene.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

Example 2: Multiple groups

Suppose we have three groups A, B, and C.

dds <- makeExampleDESeqDataSet(n=100,m=6)
dds$condition <- factor( c( "A","A","B","B","C","C") )
dds <- DESeq(dds)
res = results(dds, contrast=c("condition","C","A"))
res <- res[order(res$padj),]
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene2	3.634986	-5.101773	0.0348679	0.5515088
gene20	4.678176	-4.490982	0.0445664	0.5515088
gene34	56.068672	-1.462155	0.0167820	0.5515088
gene35	537.847175	-1.177240	0.0087913	0.5515088
gene41	93.967810	1.064734	0.0412034	0.5515088

Example 3: two conditions, two genotypes, with an interaction term

Here we have two genotypes, wild-type (WT), and mutant (MU). Two conditions, control (Ctrl) and treated (Trt). We are interested in the responses of both wild-type and mutant to treatment. We are also interested in the differences in response between genotypes, which is captured by the interaction term in linear models.

First, we construct example data. Note that we changed sample names from “sample1” to “Wt_Ctrl_1”, according to the two factors.

dds <- makeExampleDESeqDataSet(n=10000,m=12)
dds$condition <- factor( c( rep("Ctrl",6), rep("Trt",6) ) )
dds$genotype <- factor(rep(rep(c("WT","MU"),each=3),2))
colnames(dds) <- paste(as.character( dds$genotype),as.character( dds$condition),rownames(colData(dds)),  sep="_"  )
colnames(dds) = gsub("sample","",colnames(dds))
kable(assay(dds)[1:5,])

	WT_Ctrl_1	WT_Ctrl_2	WT_Ctrl_3	MU_Ctrl_4	MU_Ctrl_5	MU_Ctrl_6	WT_Trt_7	WT_Trt_8	WT_Trt_9	MU_Trt_10	MU_Trt_11	MU_Trt_12
gene1	81	47	135	72	81	80	76	57	52	49	64	57
gene2	70	77	94	54	54	36	51	66	57	51	47	67
gene3	16	37	8	28	4	36	14	10	7	11	40	15
gene4	2	5	1	3	9	2	1	5	0	0	14	8
gene5	0	0	1	0	5	0	0	0	0	0	0	0

kable( colData(dds))

	condition	genotype
WT_Ctrl_1	Ctrl	WT
WT_Ctrl_2	Ctrl	WT
WT_Ctrl_3	Ctrl	WT
MU_Ctrl_4	Ctrl	MU
MU_Ctrl_5	Ctrl	MU
MU_Ctrl_6	Ctrl	MU
WT_Trt_7	Trt	WT
WT_Trt_8	Trt	WT
WT_Trt_9	Trt	WT
MU_Trt_10	Trt	MU
MU_Trt_11	Trt	MU
MU_Trt_12	Trt	MU

Check reference levels:

dds$condition

##  [1] Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Trt  Trt  Trt  Trt  Trt  Trt 
## Levels: Ctrl Trt

As you could see, “Ctrl” apeared first in the 2nd line, indicating it is the reference level for factor condition, as we can expect based on alphabetical order. This is what we want and we do not need to do anything.

dds$genotype

##  [1] WT WT WT MU MU MU WT WT WT MU MU MU
## Levels: MU WT

But “Mu” is the reference level for genotype, which is will give us results difficult to interpret. We need to change it.

dds$genotype = relevel( dds$genotype, "WT")
dds$genotype

##  [1] WT WT WT MU MU MU WT WT WT MU MU MU
## Levels: WT MU

Set up the model, and run DESeq2:

design(dds) <- ~ genotype + condition + genotype:condition
dds <- DESeq(dds) 
resultsNames(dds)

## [1] "Intercept"               "genotype_MU_vs_WT"      
## [3] "condition_Trt_vs_Ctrl"   "genotypeMU.conditionTrt"

Below, we are going to use the combination of the different results (“genotype_MU_vs_WT”, “condition_Trt_vs_Ctrl”, “genotypeMU.conditionTrt” ) to derive biologically meaningful comparisons.

The effect of treatment in wild-type (the main effect).

res = results(dds, contrast=c("condition","Trt","Ctrl"))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene275	25.38836	2.607896	0.0000588	0.2684851
gene2744	101.02954	-1.670837	0.0001099	0.2684851
gene5441	58.67469	1.758921	0.0001261	0.2684851
gene7021	74.29359	1.610853	0.0001345	0.2684851
gene7795	326.43308	-1.624323	0.0000407	0.2684851

This is for WT, treated compared with untreated. Note that WT is not mentioned, because it is the reference level. In other words, this is the difference between samples No. 7-9, compared with samples No. 1-3.

Here we show the most significant gene.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

The effect of treatment in mutant

This is, by definition, the main effect plus the interaction term (the extra condition effect in genotype Mutant compared to genotype WT).

res <- results(dds, list( c("condition_Trt_vs_Ctrl","genotypeMU.conditionTrt") ))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene5102	18.69017	4.757057	1.60e-06	0.0156910
gene7367	170.69259	1.481016	1.19e-05	0.0396834
gene8351	27.77227	3.033622	9.80e-06	0.0396834
gene8034	53.34342	-1.841023	1.62e-05	0.0403724
gene7272	92.82351	-1.414967	6.70e-05	0.1125808
Note that i	t has to be	list( c(“condit	ion_Trt_vs_	Ctrl“,”genotypeMU.conditionTrt“) ). If list( ) is ommitted, the results would be drastically different. Perhaps only the first coefficient is used.

This measures the effect of treatment in mutant. In other words, samples No. 10-12 compared with samples No. 4-6.

Here we show the most significant gene, which is downregulated expressed in samples 10-12, than samples 4-6 , as expected.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

What is the difference between mutant and wild-type without treatment?

As Ctrl is the reference level, we can just retrieve the “genotype_MU_vs_WT”.

res = results(dds, contrast=c("genotype","MU","WT"))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene5102	18.69017	-4.714519	0.0000020	0.0195594
gene2289	45.04711	1.763615	0.0000333	0.1218120
gene3388	122.85582	-1.529323	0.0000366	0.1218120
gene915	39.03250	2.104003	0.0001133	0.2374845
gene8351	27.77227	-2.653182	0.0001190	0.2374845

In other words, this is the samples No.4-6 compared with No. 1-3. Here we show the most significant gene.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

With treatment, what is the difference between mutant and wild-type?

res = results(dds, list( c("genotype_MU_vs_WT","genotypeMU.conditionTrt") ))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene4127	50.03677	1.925094	0.0000091	0.0910878
gene3879	34.67704	-2.133203	0.0000283	0.0940387
gene4799	24.11059	-2.459345	0.0000197	0.0940387
gene5441	58.67469	-1.744769	0.0001418	0.2357640
gene5926	161.16737	1.524163	0.0001339	0.2357640

This gives us the difference between genotype MU and WT, under condition Trt. In other words, this is the sampless No. 10-12 compared with samples 7-9.

Here we show the most significant gene.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

The different response in genotypes (interaction term)

Is the effect of treatment different across genotypes? This is the interaction term.

res = results(dds, name="genotypeMU.conditionTrt")
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene5102	18.69017	7.395771	0.0000000	0.0000341
gene5441	58.67469	-3.297673	0.0000004	0.0019104
gene8034	53.34342	-2.564030	0.0000205	0.0682558
gene8858	14.68650	4.703936	0.0000297	0.0742066
gene4799	24.11059	-2.990101	0.0001234	0.2463639
Here we sho	w the most	significant gene.

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

In wild-type, this gene is upregulated by treatment, while for the mutant, it is downregulated.

Example 4: Two conditionss, three genotpes with interaction terms

dds <- makeExampleDESeqDataSet(n=10000,m=18)
dds$genotype <- factor(rep(rep(c("I","II","III"),each=3),2))
colnames(dds) <- paste(rownames(colData(dds)), as.character( dds$condition), as.character( dds$genotype),sep="_"  )
colnames(dds) = gsub("sample","S",colnames(dds))
design(dds) <- ~ genotype + condition + genotype:condition
dds <- DESeq(dds)
kable( colData(dds))

	condition	genotype	sizeFactor
S1_A_I	A	I	1.053964
S2_A_I	A	I	1.058935
S3_A_I	A	I	1.061629
S4_A_II	A	II	1.047347
S5_A_II	A	II	1.059813
S6_A_II	A	II	1.049639
S7_A_III	A	III	1.060271
S8_A_III	A	III	1.057331
S9_A_III	A	III	1.052053
S10_B_I	B	I	1.053989
S11_B_I	B	I	1.044408
S12_B_I	B	I	1.042376
S13_B_II	B	II	1.042354
S14_B_II	B	II	1.050094
S15_B_II	B	II	1.045868
S16_B_III	B	III	1.050772
S17_B_III	B	III	1.047934
S18_B_III	B	III	1.036983

resultsNames(dds)

## [1] "Intercept"              "genotype_II_vs_I"      
## [3] "genotype_III_vs_I"      "condition_B_vs_A"      
## [5] "genotypeII.conditionB"  "genotypeIII.conditionB"

The condition effect for genotype I (the main effect)

res = results(dds, contrast=c("condition","B","A"))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene9348	48.01943	1.915941	0.0000155	0.1546421
gene928	60.03093	-1.562622	0.0001151	0.5748195
gene931	439.23135	-1.101161	0.0004187	0.6065829
gene4483	16.83153	2.957899	0.0005486	0.6065829
gene4612	29.09139	-1.742999	0.0006893	0.6065829

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

The condition effect for genotype III.

This is the main effect plus the interaction term (the extra condition effect in genotype III compared to genotype I).

res = results(dds, contrast=list( c("condition_B_vs_A","genotypeIII.conditionB") ))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene316	63.666427	-1.724235	0.0000294	0.2936298
gene252	5.751827	-6.012712	0.0001418	0.2977684
gene2068	16.851863	-2.725616	0.0001422	0.2977684
gene6828	19.669827	2.840392	0.0000635	0.2977684
gene7420	85.226424	-1.510240	0.0001788	0.2977684

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

The condition effect for genotype II.

This is the main effect plus the interaction term (the extra condition effect in genotype II compared to genotype I).

res = results(dds, contrast=list( c("condition_B_vs_A","genotypeII.conditionB") ))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene980	71.929692	-1.711667	0.0005822	0.6482827
gene1177	109.057633	-1.281177	0.0002111	0.6482827
gene5118	3.487054	-5.757413	0.0005110	0.6482827
gene5570	23.786589	2.329791	0.0004286	0.6482827
gene6519	94.137270	1.422193	0.0001399	0.6482827

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

This is equivalent to using numeric vector referencing to each of the results.

[1] “Intercept” “genotype_II_vs_I” “genotype_III_vs_I”
[4] “condition_B_vs_A” “genotypeII.conditionB” “genotypeIII.conditionB”

res = results(dds, contrast= c(0,0,0,1,1,0))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene980	71.929692	-1.711667	0.0005822	0.6482827
gene1177	109.057633	-1.281177	0.0002111	0.6482827
gene5118	3.487054	-5.757413	0.0005110	0.6482827
gene5570	23.786589	2.329791	0.0004286	0.6482827
gene6519	94.137270	1.422193	0.0001399	0.6482827

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

The effect of III vs. II, under condition A.

This is the effect of “genotype_III_vs_I” minus “genotype_II_vs_I” ?

res = results(dds, contrast= c(0,-1,1,0,0,0))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene7875	172.84101	1.316870	0.0000355	0.3542127
gene316	63.66643	1.399247	0.0006044	0.5256431
gene557	17.92683	-1.980097	0.0008418	0.5256431
gene1243	51.92628	-1.844387	0.0004523	0.5256431
gene1702	55.34857	-1.631481	0.0006698	0.5256431

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

The interaction term for condition effect in genotype III vs genotype I.

This tests if the condition effect is different in III compared to I

res = results(dds, name="genotypeIII.conditionB")
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene9348	48.019429	-2.714891	0.0000104	0.1042060
gene6251	14.477069	-4.780646	0.0000479	0.2395063
gene839	161.393935	-1.662756	0.0000886	0.2950890
gene9383	9.149222	-6.436304	0.0001560	0.3896116
gene931	439.231351	1.562204	0.0003859	0.4506581

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

The interaction term for condition effect in genotype III vs genotype II.

This tests if the condition effect is different in III compared to II

res = results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene5761	27.797507	4.231182	0.0000012	0.0116502
gene6598	100.143591	2.317486	0.0000214	0.1068060
gene3491	325.224707	1.942243	0.0001168	0.2853653
gene4635	8.411707	5.206116	0.0000963	0.2853653
gene5118	3.487054	8.088922	0.0002454	0.2853653

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

Example 5: Three conditions, three genotypes with interaction terms

For a more complex example, we now have a 3x3 factorial design. It is convenent to use a diagram to represent study design and the coefficients generated by DESeq2. Each connections in the diagram below represents a coefficient for a contrast and we can use the combination of these to define desired contrasts by travelling through the graph/network.

dds <- makeExampleDESeqDataSet(n=10000,m=18)
dds$genotype <- factor(rep(rep(c("I","II","III"),each=3),2))
dds$condition <-factor(rep(c("A","B","C"), 6)  )
colnames(dds) <- paste(as.character( dds$genotype), as.character( dds$condition),rownames(colData(dds)), sep="_"  )
colnames(dds) = gsub("sample","",colnames(dds))
design(dds) <- ~ genotype + condition + genotype:condition
dds <- DESeq(dds)
kable( colData(dds))

	condition	genotype	sizeFactor
I_A_1	A	I	1.055853
I_B_2	B	I	1.037228
I_C_3	C	I	1.048146
II_A_4	A	II	1.048901
II_B_5	B	II	1.062858
II_C_6	C	II	1.036711
III_A_7	A	III	1.052625
III_B_8	B	III	1.033361
III_C_9	C	III	1.040028
I_A_10	A	I	1.054525
I_B_11	B	I	1.045331
I_C_12	C	I	1.036150
II_A_13	A	II	1.052502
II_B_14	B	II	1.047103
II_C_15	C	II	1.063802
III_A_16	A	III	1.053762
III_B_17	B	III	1.041255
III_C_18	C	III	1.048660

resultsNames(dds)

## [1] "Intercept"              "genotype_II_vs_I"      
## [3] "genotype_III_vs_I"      "condition_B_vs_A"      
## [5] "condition_C_vs_A"       "genotypeII.conditionB" 
## [7] "genotypeIII.conditionB" "genotypeII.conditionC" 
## [9] "genotypeIII.conditionC"

To derive desired contrasts, we can travel on the network of factor levels, using the given 9 coefficients as bridges.For example, we are interested in the difference between genotpes III vs. II, under condition C. This is not readily available. But we can travel from C to III, then III to I, I to II, and then II to C. Namely the path is C–III–I–II–C. In terms of coefficients: (III-I) - (II-I) + III.C - II.C= “genotype_III_vs_I” - “genotype_II_vs_I” + “genotypeIII.conditionC” - “genotypeII.conditionC” For such a complex contrast, we can use the numerical combination using a vector: c(0,-1,1,0,0,0,0,-1,1 ) These numbers are applied to the coefficients in the resultsNames. To double check we order them:

cbind(resultsNames(dds),c(0,-1,1,0,0,0,0,-1,1 ) )

##       [,1]                     [,2]
##  [1,] "Intercept"              "0" 
##  [2,] "genotype_II_vs_I"       "-1"
##  [3,] "genotype_III_vs_I"      "1" 
##  [4,] "condition_B_vs_A"       "0" 
##  [5,] "condition_C_vs_A"       "0" 
##  [6,] "genotypeII.conditionB"  "0" 
##  [7,] "genotypeIII.conditionB" "0" 
##  [8,] "genotypeII.conditionC"  "-1"
##  [9,] "genotypeIII.conditionC" "1"

Now we can extract the contrast. ## The effect of III vs. II, under condition C.

res = results(dds, contrast= c(0,-1,1,0,0,0,0,0,1 ))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
kable(res[1:5,-(3:4)])

	baseMean	log2FoldChange	pvalue	padj
gene793	50.368566	3.737810	3.36e-05	0.1460342
gene1161	827.243573	1.844200	4.39e-05	0.1460342
gene3314	23.430794	4.580687	1.57e-05	0.1460342
gene4445	51.054925	-3.096770	6.96e-05	0.1736826
gene1555	5.073803	10.771498	9.70e-05	0.1937376

barplot(assay(dds)[ix,],las=2, main=rownames(dds)[ ix  ]  )

How about many factors, each with many levels? It’s going to become increasingly complex. Drawing a network graph like the one above can help you find the right combination.

Please let me know if you have any comment.

DESeq2 experimental design and interpretation

by Steven Xijin Ge, South Dakota State University http://ge-lab.org