Overview

The R package limmaDE2 contains a set of functions designed to make running differential expression analyses in LIMMA more user friendly. There are several functions of this package:
  • pipeLIMMA Run analyses of differnetial expression in LIMMA.
  • makeBinarySig Find and count the number of significantly differentially expressed genes
  • pqHists Plot the distribution of P and FDR-corrected P (Q) values
  • voom2PCA Conduct principal component analyses on voom (or otherwise) normalized expression matrices
  • volcanoPlot Generate log2 foldchange vs. P-value volcano plots
  • volcanoPair Compare two sets of log2 fold changes, colored by p-values
  • pipeDESeq2 Runs an analysis of differential expression similar to that of pipeLIMMA, except through the DESeq2 package.

1. Installation

The limmaDE2 packages is not on CRAN and can only be installed from github. Make sure the devtools package is installed on your system. The limmaDE2 package also requires several other R packages to be installed. These packages have functions upon which limmaDE2 depends:

Essential Packages:
  • limma contains all of the basic statistical elements that are called here.
  • qvalue fdr-corrected p-values
  • edgeR Differential expression package used for some transformations
Data for this vignette comes from:
  • SimSeq permits simulation of RNA-seq data
Other useful packages:
  • plyr summarize and manipulate lists and dataframes
  • vegan some distance-based methods
  • Heatplus heatmap plotting
  • topGO gene ontology (GO) analysis
  • ggplot2 routines for complex plots
  • hexbin methods to simplify data in plots
To install limmaDE2:
library(devtools)
install_github("jtlovell/limmaDE2")
Load all packages that will be needed for the tutorial:
library("limmaDE2")
library("ggplot2")
library("reshape2")
library("SimSeq")
library("DESeq2")

2. Importing data and checking

For limmaDE2 to work, two matrices are needed:
  • counts: Raw transcript abundance counts
  • info: Experimental design information These two matrices must match exactly, where each row in the info matrix corresponds to each column in counts. For best performance, the name of each gene should be stored in the rownames of the counts matrix.
data(kidney) # from simseq
counts<-kidney$counts
counts<-counts[sample(1:nrow(counts),1000),]
info<-with(kidney, 
           data.frame(id = paste(replic, treatment, sep = "_"), 
                      rep=replic, 
                      Treatment=ifelse(treatment == "Tumor","tumor","cntr"), 
                      stringsAsFactors=F))
colnames(counts)<-info$id
Add in another category that gives us a factorial design
group.a<-c("4619", "4712", "4863", "4865", "5452", "5453", "5454", "5455",
          "5456", "5457","5458", "5461", "5462", "5463", "5465", "5467",
          "5468", "5469", "5470", "5549","5552", "5580", "5641", "5672",
          "5689", "5701", "5703", "5706", "5989", "6088")
info$group<-ifelse(info$rep %in% group.a, "a","b")
with(info, table(group, Treatment))
##      Treatment
## group cntr tumor
##     a   30    30
##     b   42    42
info$trt.grp<-with(info, paste(Treatment, group, sep="_"))
The top of the example experimental design dataset
head(info)
##               id  rep Treatment group trt.grp
## 1 3358_Non-Tumor 3358      cntr     b  cntr_b
## 2     3358_Tumor 3358     tumor     b tumor_b
## 3 3387_Non-Tumor 3387      cntr     b  cntr_b
## 4     3387_Tumor 3387     tumor     b tumor_b
## 5 4619_Non-Tumor 4619      cntr     a  cntr_a
## 6     4619_Tumor 4619     tumor     a tumor_a
What the example counts dataset looks like
counts[1:10,1:3]
##                3358_Non-Tumor 3358_Tumor 3387_Non-Tumor
## ZEB1|6935                1248       3517           1388
## ZNF841|284371             274        951            194
## OSBPL2|9885              2182       3161           2107
## DAOA|267012                 0          0              0
## RAB5B|5869               8937      13844           9327
## PFDN4|5203                515        925            375
## TMEM223|79064            1094       1267            962
## CDC42EP3|10602           1418       2801           1799
## MRPL50|54534             1062       1519           1120
## IL11RA|3590               384        656            291

In all statistical analyses, it is important to set the levels of the experimental factors. This is esspecially true in linear modelling, such as limma, where levels are tested against the base level. Here, we set the “Non-Tumor” treatment and “a” group as the base.

info$Treatment <- factor(info$Treatment,
                           levels = c("cntr", "tumor"))
info$group <- factor(info$group,
                           levels = c("a", "b"))

3. Basic analysis of differential expression

To test the plasticity (differential expression) of all genes between the tumor and control tissue:
stats <- pipeLIMMA(counts = counts, 
                   info = info, 
                   block = NULL, 
                   formula = "~ Treatment")
## calculating normalization factors ... 
## running voom normalization correcting for quality weights ... 
## fitting linear model ... 
## processing statistics and calculating q-values ...
lmStats<-stats$stats
voom<-stats$voom$E

pipeLIMMA returns three elements: stats, voom and fstats. These are the statistical output of the limma functions: eBayes, voom and topTableF. Inspect the top few rows and columns of each.

stats: eBayes linear model statistics
gene sigma s2.post Amean Fstat Fpvalue
ZEB1|6935 ZEB1|6935 1.2591387 1.5671488 9.929414 64.363972 0.0000000
ZNF841|284371 ZNF841|284371 0.8035096 0.6461980 7.196959 40.413552 0.0000000
OSBPL2|9885 OSBPL2|9885 0.6142314 0.3832333 10.090835 195.217234 0.0000000
DAOA|267012 DAOA|267012 0.3963376 0.1674536 -2.379793 2.425542 0.1215533
RAB5B|5869 RAB5B|5869 0.3284865 0.1192599 12.201714 78.891302 0.0000000
PFDN4|5203 PFDN4|5203 0.7327044 0.5396079 8.042116 38.676736 0.0000000
voom: normalized expression matrix
3358_Non-Tumor 3358_Tumor 3387_Non-Tumor 3387_Tumor 4619_Non-Tumor 4619_Tumor
ZEB1|6935 9.377552 10.108863 9.495664 10.461569 9.648007 10.657851
ZNF841|284371 7.192234 8.222588 6.659979 7.756724 6.922638 7.720599
OSBPL2|9885 10.183338 9.954922 10.097669 9.677473 9.827493 10.834705
DAOA|267012 -1.908428 -2.671471 -1.943648 -2.623737 -1.448049 -2.142038
RAB5B|5869 12.217228 12.085554 12.243627 11.880647 12.106659 12.274430
PFDN4|5203 8.101401 8.182618 7.609021 7.937551 8.439171 8.270532
fstats: F-statistics for each factor in the model, in this case, just treatment
gene Treatment_F Treatment_P.Value Treatment_Q.Value
AAGAB|79719 337.7770122 0.0000000 0.0000000
ABCB10|23456 49.1438369 0.0000000 0.0000000
ABCB5|340273 5.8306633 0.0169960 0.0239718
ABCC1|4363 153.8390560 0.0000000 0.0000000
ABCC6|368 0.0717106 0.7892431 0.8119785
ABHD3|171586 150.6163678 0.0000000 0.0000000

4. Differential expression using a blocking variable

Lets say that here, there is some sort of experimental blocking. In this case, we employ a routine in limma that calculates the duplicateCorrelation among blocks, then uses the blocking variable in the linear model fit.
info$block <- rep(1:2,each=nrow(info)/2)
stats.block <- pipeLIMMA(counts = counts, 
                   info = info, 
                   block = info$block,
                   formula = "~ Treatment")
## calculating normalization factors ... 
## running voom normalization correcting for quality weights ... 
## calculating duplicate correlation among replicates ... 
## fitting linear model ... 
## processing statistics and calculating q-values ...

5. Differential expression in a factorial experiment

stats.factorial <- pipeLIMMA(counts = counts, 
                   info = info, 
                   block = NULL, 
                   formula = "~ Treatment + group + Treatment*group")
## calculating normalization factors ... 
## running voom normalization correcting for quality weights ... 
## fitting linear model ... 
## processing statistics and calculating q-values ...

6. Calculating differential expression using specific contrasts

Sometimes, it may make more sense to use specific contrasts to test for differential expression. For example, let’s say we are interested in how the tumors and non-tumor tissues differ for each of the two groups.

First, we need to construct a design matrix.
design <- with(info, model.matrix(~ 0 + trt.grp))
colnames(design)<-gsub("trt.grp","",colnames(design))
head(design)
##   cntr_a cntr_b tumor_a tumor_b
## 1      0      1       0       0
## 2      0      0       0       1
## 3      0      1       0       0
## 4      0      0       0       1
## 5      1      0       0       0
## 6      0      0       1       0
Then we construct a contrast matrix.
contrast.matrix <- makeContrasts(
  tumor_a - cntr_a , 
  tumor_b - cntr_b,
  levels = design)
head(contrast.matrix)
##          Contrasts
## Levels    tumor_a - cntr_a tumor_b - cntr_b
##   cntr_a                -1                0
##   cntr_b                 0               -1
##   tumor_a                1                0
##   tumor_b                0                1
Finally, we fit the model with a contrast matrix as the design, overriding the formula argument.
stats <- pipeLIMMA(counts = counts, 
                   info = info, 
                   block = NULL, 
                   design = design, 
                   contrast.matrix = contrast.matrix)
## calculating normalization factors ... 
## running voom normalization correcting for quality weights ... 
## fitting model to contrast matrix ... 
## processing statistics and calculating q-values ...
stats.contrasts <- stats$stats

6. Count and plot the number of significantly differentially expressed genes

The function makeBinarySig looks for a provided string (e.g. “Q.Value”) and outputs a matrix with whether or not those columns have values <= the provided alpha

sigs <- makeBinarySig(stats.contrasts, what = "Qvalue", alpha = 0.05)
## ebayesQvalue_tumor_a...cntr_a    705 
## ebayesQvalue_tumor_b...cntr_b    679 
Make a venn diagram of number of differentially expressed genes among the experimental factors
counts2Venn(x=sigs, cols=c(1,2), names=c("in.grpA","in.grpB"),
   colors=c("blue","red"),type="limma", legx=-3.3,legy=-3)

Make a euler diagram of number of differentially expressed genes among the experimental factors
counts2Venn(x=sigs, cols=c(1,2), names=c("in.grpA","in.grpB"),
   colors=c("blue","red"),type="Euler", legx=-3.3,legy=-3)

7. Make a volcano plot of the results

Volcano plots are a good way to look at the differences between two experimental levels. Here, we compare the extent of differential expression between the “high” treatment to the “low” treatment.

with(lmStats, volcanoPlot(pval = ebayesPvalue_Treatmenttumor,
                          lfc = Treatmenttumor_logFC,
                          sig = ebayesQvalue_Treatmenttumor,
                          main = "no tumor vs. tumor Volcano Plot", 
                          xlab = "tumor - no tumor Log2 Fold Change",
                          bty = "n", legpos = "top", leginset = c(0,-.1)))

##       sig
## updown FALSE TRUE
##   down    90  355
##   up     178  377

8. Using two contrasts, make a pairwise volcano plot

Custom colors
sigs <- data.frame(makeBinarySig(stats.contrasts, what = "Qvalue", alpha = 0.05))
## ebayesQvalue_tumor_a...cntr_a    705 
## ebayesQvalue_tumor_b...cntr_b    679 
names(sigs)<-c("sig.a","sig.b")
sigs$sign.a<-sign(stats.contrasts$tumor_a...cntr_a_logFC)
sigs$sign.b<-sign(stats.contrasts$tumor_b...cntr_b_logFC)

cols<-with(sigs, ifelse(sig.a + sig.b == 0,  
                        "grey",
                        ifelse(sig.a + sig.b == 2 & sign.a*sign.b == 1, 
                               "pink",
                               ifelse(sig.a + sig.b == 2 & sign.a*sign.b == -1,
                                      "cornflowerblue",
                                      ifelse(sig.a == 1, "darkblue", "darkred")))))
Make the plot
with(stats.contrasts, volcanoPair(lfc1 = tumor_a...cntr_a_logFC,
                                  lfc2 = tumor_b...cntr_b_logFC,
                                  pt.col = cols, pt.pch = 19, pt.cex=.5,
                                  xlab = "Tumor - control LFC (group A)",
                                  ylab = "Tumor - control LFC (group B)"))

## pt.col
## cornflowerblue       darkblue        darkred           grey           pink 
##             32            167            141            154            506

Principal component analysis of gene expression

It is often a good idea to get an idea of how the individuals (libraries) are structured - how similar are they to eachother. Principal component analyses allow for this kind of inference.

pca12 <- counts2PCA(counts=voom, info = info, ids = info$id, pcas2return = 6)

pca12.var <- pca12[[2]]
pca12 <- pca12[[1]]
gcols <- c("darkblue", "blue", "gold", "green", "pink", "red")
ggplot(pca12, aes(x = PC1, y = PC2, col = group, shape = Treatment)) +
  geom_vline(xintercept = 0, lty = 2)+
  geom_hline(yintercept = 0, lty = 2)+
  geom_point() +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        strip.background = element_blank()) +
  labs(x = paste("PCA1 (", pca12.var[1],"%)", sep = ""),
       y = paste("PCA2 (", pca12.var[2],"%)", sep = ""),
       title = "PCA analysis of voom-normalized expression")

For more information, visit lovelleeb.weebly.com/analytics