When testing iDEP, I and other users noticed that sometimes when using limma to identify differentially expressed genes, up-regulated genes are labelled as down-regulated, and vice versa. I digged into this a bit more and found an issue related to design matrices for limma. The limma documentation is not very clear on this, so I will document it here.
We will use part of the demo data for iDEP, available on GitHub.
library(edgeR)
## Loading required package: limma
library(limma)
library(knitr)
fileURL <- "https://raw.githubusercontent.com/iDEP-SDSU/idep/master/shinyapps/idep/BcellGSE71176_p53.csv"
x <- read.csv(fileURL, row.names = 1)
x <- x[ , 1:8]
colnames(x) <- gsub("p53_","",colnames(x))
kable( head( x ) )
mock_1 | mock_2 | mock_3 | mock_4 | IR_1 | IR_2 | IR_3 | IR_4 | |
---|---|---|---|---|---|---|---|---|
ENSMUSG00000028995 | 3210 | 2553 | 2096 | 3111 | 1054 | 663 | 1012 | 1027 |
ENSMUSG00000045007 | 122 | 97 | 53 | 78 | 22 | 10 | 15 | 22 |
ENSMUSG00000062785 | 183 | 182 | 143 | 232 | 530 | 384 | 590 | 655 |
ENSMUSG00000038268 | 1117 | 843 | 669 | 783 | 526 | 369 | 654 | 765 |
ENSMUSG00000054589 | 85 | 46 | 36 | 45 | 693 | 394 | 865 | 1120 |
ENSMUSG00000102752 | 52 | 59 | 45 | 75 | 21 | 16 | 20 | 11 |
# Correct orientation |
groups = gsub("_.*", "", colnames( x ))
groups <- factor(groups, levels = c("mock","IR") )
design <- model.matrix( ~ 0 + groups )
colnames(design) <- c("mock","IR")
design
## mock IR
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 0 1
## 6 0 1
## 7 0 1
## 8 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$groups
## [1] "contr.treatment"
dge <- DGEList(counts=x);
dge <- calcNormFactors(dge, method = "TMM") # normalization
v <- voom(dge, design)
fit <- lmFit(v, design)
cont.matrix <- makeContrasts(contrasts = "IR-mock", levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, trend=FALSE)
# calls differential gene expression 1 for up, -1 for down
results <- decideTests(fit2, p.value = .1, lfc= log2(2) )
topGenes =topTable(fit2, number = 1e12,sort.by="M" )
head(topGenes)
## logFC AveExpr t P.Value adj.P.Val
## ENSMUSG00000026831 9.892108 0.68261871 6.389315 9.581983e-05 5.409615e-04
## ENSMUSG00000000308 8.584652 0.03622513 8.150802 1.296283e-05 1.148136e-04
## ENSMUSG00000041801 8.556836 4.89825205 14.668745 6.909906e-08 3.507048e-06
## ENSMUSG00000026700 8.436601 3.83021718 11.872409 4.763704e-07 1.127824e-05
## ENSMUSG00000037188 7.932190 1.99753040 17.198558 1.581832e-08 1.522821e-06
## ENSMUSG00000009628 7.877015 4.17154790 11.394278 6.898578e-07 1.447440e-05
## B
## ENSMUSG00000026831 1.239444
## ENSMUSG00000000308 2.533939
## ENSMUSG00000041801 8.619459
## ENSMUSG00000026700 6.654007
## ENSMUSG00000037188 7.715722
## ENSMUSG00000009628 6.501137
This is correct, as we can see from the most upregulated genes.
ix <- match(row.names(topGenes)[1], row.names(x) )
top1 <- ( as.vector( t( x[ix,]) ) )
names( top1 ) = colnames(x)
barplot( top1 , las=3)
If the levels in the factor does not align correctly with the column names of the design matrix, then we will get a incorrect design matrix which could lead to up-regulated genes are makred as down-regulated ones.
groups = gsub("_.*", "", colnames( x ))
groups <- factor(groups, levels = c("IR","mock") )
design <- model.matrix( ~ 0 + groups )
colnames(design) <- c("mock","IR")
design
## mock IR
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$groups
## [1] "contr.treatment"
dge <- DGEList(counts=x);
dge <- calcNormFactors(dge, method = "TMM") # normalization
v <- voom(dge, design)
fit <- lmFit(v, design)
cont.matrix <- makeContrasts(contrasts = "IR-mock", levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, trend=FALSE)
# calls differential gene expression 1 for up, -1 for down
results <- decideTests(fit2, p.value = .1, lfc= log2(2) )
topGenes =topTable(fit2, number = 1e12,sort.by="M" )
head(topGenes)
## logFC AveExpr t P.Value
## ENSMUSG00000026831 -9.892108 0.68261871 -6.389315 9.581983e-05
## ENSMUSG00000000308 -8.584652 0.03622513 -8.150802 1.296283e-05
## ENSMUSG00000041801 -8.556836 4.89825205 -14.668745 6.909906e-08
## ENSMUSG00000026700 -8.436601 3.83021718 -11.872409 4.763704e-07
## ENSMUSG00000037188 -7.932190 1.99753040 -17.198558 1.581832e-08
## ENSMUSG00000009628 -7.877015 4.17154790 -11.394278 6.898578e-07
## adj.P.Val B
## ENSMUSG00000026831 5.409615e-04 1.239444
## ENSMUSG00000000308 1.148136e-04 2.533939
## ENSMUSG00000041801 3.507048e-06 8.619459
## ENSMUSG00000026700 1.127824e-05 6.654007
## ENSMUSG00000037188 1.522821e-06 7.715722
## ENSMUSG00000009628 1.447440e-05 6.501137
The “levels =” parameters sets the reference levels and the order of the levels. This is used in creating the design matrix. So as long as the orders are the same, we will get correct answers. Notice now that it is “IR” is the reference level for groups, and the column names are reversed accordingly. This will give us correct results.
groups = gsub("_.*", "", colnames( x ))
groups <- factor(groups, levels = c("IR","mock") )
design <- model.matrix( ~ 0 + groups )
colnames(design) <- c("IR","mock")
design
## IR mock
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$groups
## [1] "contr.treatment"
Perhaps a more straightforward way is not setting the levels and column names directly, but use whatever was generated.
groups = gsub("_.*", "", colnames( x ))
groups <- factor(groups )
design <- model.matrix( ~ 0 + groups )
design
## groupsIR groupsmock
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$groups
## [1] "contr.treatment"
Then, the only thing we need to do is to remove the “groups” from the column names.
colnames(design) <- gsub("^groups","",colnames(design))
design
## IR mock
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$groups
## [1] "contr.treatment"
This works because both the levels and the colnames are auto-generated. We just have to remove the name of the factor.
The new iDEP v 0.82, this issue has been resolved. If you used iDEP, please check your results carefully if you used limma to analyze two groups (control vs. experiment). As the factor levels are assigned as reference level based on alphbetical order, there is a 50% chance that the up- and down-regulated genes are reversed. If you have multiple groups, it is does not affect your results. If you used DESeq2, it should be fine.