limma and limma + voom pipelineslimma and RNAseq data using edgeR + limma pipelines.
This document can be reproduced by using the R code and the data (for illustrating purposes and for the exercises) that are available here: https://github.com/isglobal-brge/short_course_omic_R/
Transcriptomic data analysis can be performed using data obtained from two infrastructures: microarrays and RNA-seq. This lectures describe the main steps to perform differential expression analysis (DEA) using both technologies. This is a brief introduction that covers the main aspects of DEA and outlines the most common types of analyses in this type of studies. Of course, other packages can be used for that purposes (i.e DESeq2 for RNAseq) but our proposed approaches can be enough for an introductory course.
A basic protocol for a DNA microarray is as follows:
If we are trying to calculate relative expression between two samples, each labeled with a different dye (See next figure, red for experiment, green for control), the resulting image is analyzed by calculating the ratio of the two dyes. If a gene is over-expressed in the experimental sample, then more of that sample cDNA than control cDNA will hybridize to the spot representing that expressed gene. In turn, the spot will fluoresce red with greater intensity than it will fluoresce green. The red-to-green fluorescence ratio thus indicates which gene is up or downregulated in the appropriate sample.
Microarray Experiment
RNAseq data is te counterpart of microarrays using next generation sequencing technology. Data analyst gets a read count table from raw fastq reads obtained from an Illumina sequencing run. Opposite to microarrays, where data are intensity values, the transcription profile from RNAseq experiment of a particular gene follows from counting the number of times the transcripts of the genes were mapped by sequenced reads. The summarized RNA-seq data is known as count data. Figure 1 describes the main steps that are taken to obtain the count table that measures the transcription of active genes in biological samples.
Figure 1: RNA-seq scheme to get count data
Both microarray and RNAseq data can be encapsulated in a ExpressionSet or a SummarizedExperiment. Then different Bioconductor packages can be used to analyze transcription data using these objects as input.
NOTE: In this course we are not covering epigenetic data analyses, but the methodologies used to analyze microarray data also apply to methylation data. For those how are not familiar with Bioconductor, we strongly recommend to use MEAL package that will be later described given its ease of use.
limma Bioconductor packagelimma is a widely used BioC package for analyzing microarray data. LIMMA stands for “linear models for microarray data”. Perhaps unsurprisingly, limma contains functionality for fitting a broad class of statistical models called “linear models”. Examples of such models include linear regression and analysis of variance. While most of the functionality of limma has been developed for microarray data, the model fitting routines of limma are useful for many types of data, and is not limited to microarrays. For example, as we have previously pointed out, DNA methylation data is also used to be analyzed using this package.
The objective of an DEA is frequently to discover which features (genes) are different between groups or stated differently: to discover which genes are differentially expressed between cases and controls. Of course, more complicated designs are also used; sometimes more than two groups are considered and sometimes there are additional important covariates such as age or sex that should be taken into account (adjusted models).
Broadly speaking, how samples are distributed between groups determines the design of the study. In addition to the design, there is one or more question(s) of interest(s) such as the difference between two groups. Such questions are usually formalized as contrasts; an example of a contrast is indeed the difference between two groups.
This can be formalized as
\[ Y = \beta_0 + \beta_1 X_1 + \epsilon \]
In this equation of a linear model, Y is the response variable. It must be a continuous variable. In the context of DEA, it is a relative measure of mRNA expression level for one gene. \(X_1\) is an explanatory variable, which can be continuous or discrete, for example, control group versus treatment, or mutant versus wild type. \(\beta_1\) quantifies the effect of the explanatory variable on the response variable. Furthermore, we can add additional explanatory variables to the equation for more complicated experimental designs. Lastly, models the random noise in the measurements.
In R, you specify a linear model with the function lm. This uses R’s formula
syntax. On the left is the object that contains the response variable, and to
the right of the tilde are the objects that contain the explanatory variables.
lm(y ~ x1)
A second explanatory variable can be added with a plus sign.
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \]
lm(y ~ x1 + x2)
The design matrix is then, how samples are distributed accross x1 and x2. Let us illustrate this in a simple simulated case when DEA aims to find genes that are differentially expressed between cases and control (variable x1)
set.seed(1234) # to guarantee reproducibility
x1 <- as.factor(sample(c("case", "control"), 10, replace=TRUE))
x1
[1] control control control control case control case case case control
Levels: case control
model.matrix( ~ x1)
(Intercept) x1control
1 1 1
2 1 1
3 1 1
4 1 1
5 1 0
6 1 1
7 1 0
8 1 0
9 1 0
10 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$x1
[1] "contr.treatment"
The data structure represented by an ExpressionSet or a SummarizedExperiment contains features in the rows of the experimental data and variables in columns.
A number of different packages allows us to fit common types of models to this data structure
Simplifying, limma is useful for continuous data such as microarray data and edgeR / DESeq / DESeq2 are useful for count data such as high-throughput sequencing (RNA-seq). Though will see that count data can also be transformed into continuous data and, hence, limma can also be used in that case.
In addition to the distributional assumptions, all of these packages uses something called empirical Bayes techniques to borrow information across features. As stated above, usually the number of samples is small and the number of features is large. It has been shown time and time again that you can get better results by borrowing information across features, for example by modeling a mean-variance relationship. This can be done in many ways and often depends on the data characteristics of a specific type of data. For example both edgeR and DESeq(2) are very popular in the analysis of RNA-seq data and all three packages uses models based on the negative binomial distribution. For a statistical point of view, a main difference between these packages (models) is how they borrow information across genes.
Let us illustrate how to perform DEA using vdx dataset which is available at Bioconductor through the package breastCancerVDX. This package can be installed along with limma using:
BiocManager::install(c("limma", "breastCancerVDX"))
After that, the packages can be loaded into R by:
library(limma)
library(breastCancerVDX)
The vdx dataset from the breastCancerVDX package is an ExpressionSet of women diagnosed with breast cancer. It contains information about 22,000 genes and several variables including estrogen receptor (ER) status.
data(vdx)
vdx
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22283 features, 344 samples
element names: exprs
protocolData: none
phenoData
sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
varLabels: samplename dataset ... e.os (21 total)
varMetadata: labelDescription
featureData
featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283 total)
fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 17420468
Annotation: hgu133a
table(vdx$er)
0 1
135 209
The code 0 means negative and 1 positive estrogen receptor. Now we do a standard limma model fit
design <- model.matrix(~ vdx$er)
fit <- lmFit(vdx, design)
fit <- eBayes(fit)
topTable(fit)
probe Gene.title Gene.symbol
205225_at 205225_at estrogen receptor 1 ESR1
209603_at 209603_at GATA binding protein 3 GATA3
209604_s_at 209604_s_at GATA binding protein 3 GATA3
212956_at 212956_at TBC1 domain family, member 9 (with GRAM domain) TBC1D9
202088_at 202088_at solute carrier family 39 (zinc transporter), member 6 SLC39A6
212496_s_at 212496_s_at lysine (K)-specific demethylase 4B KDM4B
215867_x_at 215867_x_at carbonic anhydrase XII CA12
209602_s_at 209602_s_at GATA binding protein 3 GATA3
212195_at 212195_at interleukin 6 signal transducer (gp130, oncostatin M receptor) IL6ST
218195_at 218195_at chromosome 6 open reading frame 211 C6orf211
Gene.ID EntrezGene.ID UniGene.title UniGene.symbol UniGene.ID
205225_at 2099 2099 <NA> <NA> <NA>
209603_at 2625 2625 <NA> <NA> <NA>
209604_s_at 2625 2625 <NA> <NA> <NA>
212956_at 23158 23158 <NA> <NA> <NA>
202088_at 25800 25800 <NA> <NA> <NA>
212496_s_at 23030 23030 <NA> <NA> <NA>
215867_x_at 771 771 <NA> <NA> <NA>
209602_s_at 2625 2625 <NA> <NA> <NA>
212195_at 3572 3572 <NA> <NA> <NA>
218195_at 79624 79624 <NA> <NA> <NA>
Nucleotide.Title
205225_at Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
209603_at wh43d10.x1 NCI_CGAP_Kid11 Homo sapiens cDNA clone IMAGE:2383507 3- similar to gb:X58072_rna1 TRANS-ACTING T-CELL SPECIFIC TRANSCRIPTION FACTOR (HUMAN);, mRNA sequence
209604_s_at Homo sapiens GATA binding protein 3, mRNA (cDNA clone MGC:2346 IMAGE:3504200), complete cds
212956_at qp61g12.x1 NCI_CGAP_Co8 Homo sapiens cDNA clone IMAGE:1927558 3-, mRNA sequence
202088_at ts65a01.x1 NCI_CGAP_Kid8 Homo sapiens cDNA clone IMAGE:2233416 3-, mRNA sequence
212496_s_at 601112063F1 NIH_MGC_16 Homo sapiens cDNA clone IMAGE:3352887 5-, mRNA sequence
215867_x_at Homo sapiens mRNA; cDNA DKFZp564D066 (from clone DKFZp564D066)
209602_s_at wh43d10.x1 NCI_CGAP_Kid11 Homo sapiens cDNA clone IMAGE:2383507 3- similar to gb:X58072_rna1 TRANS-ACTING T-CELL SPECIFIC TRANSCRIPTION FACTOR (HUMAN);, mRNA sequence
212195_at Homo sapiens mRNA; cDNA DKFZp564F053 (from clone DKFZp564F053)
218195_at Homo sapiens chromosome 6 open reading frame 211 (C6orf211), mRNA
GI GenBank.Accession Platform_CLONEID Platform_ORF Platform_SPOTID
205225_at 170295798 NM_000125 NA NA <NA>
209603_at 5361632 AI796169 NA NA <NA>
209604_s_at 33871929 BC003070 NA NA <NA>
212956_at 4085300 AI348094 NA NA <NA>
202088_at 4686779 AI635449 NA NA <NA>
212496_s_at 9127367 BE256900 NA NA <NA>
215867_x_at 4884095 AL050025 NA NA <NA>
209602_s_at 5361632 AI796169 NA NA <NA>
212195_at 4500013 AL049265 NA NA <NA>
218195_at 13375745 NM_024573 NA NA <NA>
Chromosome.location Chromosome.annotation
205225_at 6q25.1 Chromosome 6, NC_000006.11 (152011631..152424409)
209603_at 10p15 Chromosome 10, NC_000010.10 (8096667..8117164)
209604_s_at 10p15 Chromosome 10, NC_000010.10 (8096667..8117164)
212956_at 4q31.21 Chromosome 4, NC_000004.11 (141541936..141677471, complement)
202088_at 18q12.2 Chromosome 18, NC_000018.9 (33688494..33709357, complement)
212496_s_at 19p13.3 Chromosome 19, NC_000019.9 (4969124..5153609)
215867_x_at 15q22 Chromosome 15, NC_000015.9 (63615730..63674075, complement)
209602_s_at 10p15 Chromosome 10, NC_000010.10 (8096667..8117164)
212195_at 5q11 Chromosome 5, NC_000005.9 (55236694..55290763, complement)
218195_at 6q25.1 Chromosome 6, NC_000006.11 (151773422..151791232)
GO.Function
205225_at estrogen receptor activity///estrogen response element binding///hormone binding///metal ion binding///nitric-oxide synthase regulator activity///protein binding///protein complex binding///sequence-specific DNA binding///steroid binding///steroid hormone receptor activity///transcription factor activity///transcription factor binding///zinc ion binding
209603_at DNA binding///metal ion binding///sequence-specific DNA binding///transcription factor activity///transcription factor binding///zinc ion binding
209604_s_at DNA binding///metal ion binding///sequence-specific DNA binding///transcription factor activity///transcription factor binding///zinc ion binding
212956_at GTPase activator activity///Rab GTPase activator activity///calcium ion binding
202088_at metal ion transmembrane transporter activity
212496_s_at metal ion binding///nucleic acid binding///oxidoreductase activity///oxidoreductase activity, acting on single donors with incorporation of molecular oxygen, incorporation of two atoms of oxygen///protein binding///zinc ion binding
215867_x_at carbonate dehydratase activity///lyase activity///metal ion binding///zinc ion binding
209602_s_at DNA binding///metal ion binding///sequence-specific DNA binding///transcription factor activity///transcription factor binding///zinc ion binding
212195_at ciliary neurotrophic factor receptor activity///ciliary neurotrophic factor receptor binding///growth factor binding///contributes_to growth factor binding///interleukin-11 receptor activity///interleukin-27 receptor activity///contributes_to interleukin-6 binding///contributes_to interleukin-6 receptor activity///contributes_to interleukin-6 receptor activity///contributes_to interleukin-6 receptor binding///contributes_to leukemia inhibitory factor receptor activity///contributes_to oncostatin-M receptor activity///protein binding///protein homodimerization activity///receptor activity
218195_at protein binding
GO.Process
205225_at androgen metabolic process///antral ovarian follicle growth///epithelial cell development///epithelial cell proliferation involved in mammary gland duct elongation///estrogen receptor signaling pathway///male gonad development///mammary gland alveolus development///mammary gland branching involved in pregnancy///neuroprotection///osteoblast development///positive regulation of fibroblast proliferation///positive regulation of retinoic acid receptor signaling pathway///positive regulation of survival gene product expression///prostate epithelial cord arborization involved in prostate glandular acinus morphogenesis///prostate epithelial cord elongation///regulation of branching involved in prostate gland morphogenesis///regulation of transcription, DNA-dependent///response to estradiol stimulus///signal transduction///transcription, DNA-dependent///uterus development///vagina development
209603_at anatomical structure morphogenesis///defense response///regulation of transcription, DNA-dependent///response to estrogen stimulus///transcription from RNA polymerase II promoter
209604_s_at anatomical structure morphogenesis///defense response///regulation of transcription, DNA-dependent///response to estrogen stimulus///transcription from RNA polymerase II promoter
212956_at regulation of Rab GTPase activity
202088_at ion transport///transmembrane transport///zinc ion transport
212496_s_at chromatin modification///oxidation reduction///regulation of transcription
215867_x_at one-carbon metabolic process
209602_s_at anatomical structure morphogenesis///defense response///regulation of transcription, DNA-dependent///response to estrogen stimulus///transcription from RNA polymerase II promoter
212195_at JAK-STAT cascade///ciliary neurotrophic factor-mediated signaling pathway///cytokine-mediated signaling pathway///glycogen metabolic process///interleukin-27-mediated signaling pathway///leukemia inhibitory factor signaling pathway///negative regulation of interleukin-6-mediated signaling pathway///positive regulation of T cell proliferation///positive regulation of acute inflammatory response///positive regulation of adaptive immune response///positive regulation of anti-apoptosis///positive regulation of cardiac muscle hypertrophy///positive regulation of cell proliferation///positive regulation of cell proliferation///positive regulation of osteoblast differentiation///positive regulation of smooth muscle cell migration///positive regulation of tyrosine phosphorylation of Stat1 protein///positive regulation of tyrosine phosphorylation of Stat3 protein///positive regulation vascular endothelial growth factor production///reduction of cytosolic calcium ion concentration///regulation of Notch signaling pathway///response to cytokine stimulus///response to nutrient///signal transduction///triglyceride mobilization
218195_at <NA>
GO.Component
205225_at T-tubule///chromatin remodeling complex///cytoplasm///membrane///neuron projection///nucleus///perikaryon///plasma membrane///terminal button
209603_at nucleus
209604_s_at nucleus
212956_at intracellular
202088_at endoplasmic reticulum///integral to membrane///plasma membrane
212496_s_at nucleus
215867_x_at integral to membrane///membrane
209602_s_at nucleus
212195_at ciliary neurotrophic factor receptor complex///external side of plasma membrane///extracellular region///extracellular space///integral to membrane///interleukin-6 receptor complex///oncostatin-M receptor complex///plasma membrane
218195_at <NA>
GO.Function.1
205225_at GO:0030284///GO:0034056///GO:0042562///GO:0046872///GO:0030235///GO:0005515///GO:0032403///GO:0043565///GO:0005496///GO:0003707///GO:0003700///GO:0008134///GO:0008270
209603_at GO:0003677///GO:0046872///GO:0043565///GO:0003700///GO:0008134///GO:0008270
209604_s_at GO:0003677///GO:0046872///GO:0043565///GO:0003700///GO:0008134///GO:0008270
212956_at GO:0005096///GO:0005097///GO:0005509
202088_at GO:0046873
212496_s_at GO:0046872///GO:0003676///GO:0016491///GO:0016702///GO:0005515///GO:0008270
215867_x_at GO:0004089///GO:0016829///GO:0046872///GO:0008270
209602_s_at GO:0003677///GO:0046872///GO:0043565///GO:0003700///GO:0008134///GO:0008270
212195_at GO:0004897///GO:0005127///GO:0019838///contributes_to GO:0019838///GO:0004921///GO:0045509///contributes_to GO:0019981///contributes_to GO:0004915///contributes_to GO:0004915///contributes_to GO:0005138///contributes_to GO:0004923///contributes_to GO:0004924///GO:0005515///GO:0042803///GO:0004872
218195_at GO:0005515
GO.Process.1
205225_at GO:0008209///GO:0001547///GO:0002064///GO:0060750///GO:0030520///GO:0008584///GO:0060749///GO:0060745///GO:0043526///GO:0002076///GO:0048146///GO:0048386///GO:0045885///GO:0060527///GO:0060523///GO:0060687///GO:0006355///GO:0032355///GO:0007165///GO:0006351///GO:0060065///GO:0060068
209603_at GO:0009653///GO:0006952///GO:0006355///GO:0043627///GO:0006366
209604_s_at GO:0009653///GO:0006952///GO:0006355///GO:0043627///GO:0006366
212956_at GO:0032313
202088_at GO:0006811///GO:0055085///GO:0006829
212496_s_at GO:0016568///GO:0055114///GO:0045449
215867_x_at GO:0006730
209602_s_at GO:0009653///GO:0006952///GO:0006355///GO:0043627///GO:0006366
212195_at GO:0007259///GO:0070120///GO:0019221///GO:0005977///GO:0070106///GO:0048861///GO:0070104///GO:0042102///GO:0002675///GO:0002821///GO:0045768///GO:0010613///GO:0008284///GO:0008284///GO:0045669///GO:0014911///GO:0042511///GO:0042517///GO:0010575///GO:0051481///GO:0008593///GO:0034097///GO:0007584///GO:0007165///GO:0006642
218195_at <NA>
GO.Component.1
205225_at GO:0030315///GO:0016585///GO:0005737///GO:0016020///GO:0043005///GO:0005634///GO:0043204///GO:0005886///GO:0043195
209603_at GO:0005634
209604_s_at GO:0005634
212956_at GO:0005622
202088_at GO:0005783///GO:0016021///GO:0005886
212496_s_at GO:0005634
215867_x_at GO:0016021///GO:0016020
209602_s_at GO:0005634
212195_at GO:0070110///GO:0009897///GO:0005576///GO:0005615///GO:0016021///GO:0005896///GO:0005900///GO:0005886
218195_at <NA>
logFC AveExpr t P.Value adj.P.Val B
205225_at 3.762901 11.377735 22.68392 2.001001e-70 4.458832e-66 149.19866
209603_at 3.052348 9.941990 18.98154 1.486522e-55 1.656209e-51 115.46414
209604_s_at 2.431309 13.185334 17.59968 5.839050e-50 4.337052e-46 102.75707
212956_at 2.157435 11.702942 17.48711 1.665700e-49 9.279201e-46 101.72268
202088_at 1.719680 13.119496 17.30104 9.412084e-49 4.194589e-45 100.01376
212496_s_at 1.459843 10.703942 16.85070 6.188671e-47 2.298369e-43 95.88265
215867_x_at 2.246120 11.450485 16.79123 1.074845e-46 3.421537e-43 95.33780
209602_s_at 2.921505 9.547850 16.43202 3.004184e-45 8.367780e-42 92.05058
212195_at 1.381778 11.737839 16.31864 8.581176e-45 2.124604e-41 91.01458
218195_at 1.738740 9.479901 16.27378 1.299472e-44 2.895613e-41 90.60496
This is the common limma workflow. First, the comparison of interest (and the design of the experiment) is defined through a so-called “design matrix”. This matrix basically encompasses everything we know about the design; in this case there are two groups (we have more to say on the design below). Next, the model is fitted. This is followed by borrowing strength across genes using a so-called empirical Bayes procedure. Because this design only has two groups there is only one possible comparison to make: which genes differs between the two groups. This question is examined by the topTable() function which lists the top differentially expressed genes. In a more complicated design, the topTable() function would need to be told which comparison of interest to summarize.
An important part of the output is logFC which is the log fold-change. To interpret the sign of this quantity you need to know if this is 1-0 (e.g. ER+ vs ER-) (in which case positive values are up-regulated in ER+) or the reverse. In this case this is determined by the reference level which is the first level of the factor.
head(as.factor(vdx$er))
[1] 0 1 0 0 0 1
Levels: 0 1
we see the reference level is 0 so positive values means it is down-regulated in ER-. You can change the reference level of a factor using the relevel() function. You can also confirm this by computing the logFC by hand, which is useful to know. Let’s compute the fold-change of the top differentially expressed gene:
topTable(fit, n = 1)
probe Gene.title Gene.symbol Gene.ID EntrezGene.ID UniGene.title
205225_at 205225_at estrogen receptor 1 ESR1 2099 2099 <NA>
UniGene.symbol UniGene.ID
205225_at <NA> <NA>
Nucleotide.Title GI
205225_at Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA 170295798
GenBank.Accession Platform_CLONEID Platform_ORF Platform_SPOTID Chromosome.location
205225_at NM_000125 NA NA <NA> 6q25.1
Chromosome.annotation
205225_at Chromosome 6, NC_000006.11 (152011631..152424409)
GO.Function
205225_at estrogen receptor activity///estrogen response element binding///hormone binding///metal ion binding///nitric-oxide synthase regulator activity///protein binding///protein complex binding///sequence-specific DNA binding///steroid binding///steroid hormone receptor activity///transcription factor activity///transcription factor binding///zinc ion binding
GO.Process
205225_at androgen metabolic process///antral ovarian follicle growth///epithelial cell development///epithelial cell proliferation involved in mammary gland duct elongation///estrogen receptor signaling pathway///male gonad development///mammary gland alveolus development///mammary gland branching involved in pregnancy///neuroprotection///osteoblast development///positive regulation of fibroblast proliferation///positive regulation of retinoic acid receptor signaling pathway///positive regulation of survival gene product expression///prostate epithelial cord arborization involved in prostate glandular acinus morphogenesis///prostate epithelial cord elongation///regulation of branching involved in prostate gland morphogenesis///regulation of transcription, DNA-dependent///response to estradiol stimulus///signal transduction///transcription, DNA-dependent///uterus development///vagina development
GO.Component
205225_at T-tubule///chromatin remodeling complex///cytoplasm///membrane///neuron projection///nucleus///perikaryon///plasma membrane///terminal button
GO.Function.1
205225_at GO:0030284///GO:0034056///GO:0042562///GO:0046872///GO:0030235///GO:0005515///GO:0032403///GO:0043565///GO:0005496///GO:0003707///GO:0003700///GO:0008134///GO:0008270
GO.Process.1
205225_at GO:0008209///GO:0001547///GO:0002064///GO:0060750///GO:0030520///GO:0008584///GO:0060749///GO:0060745///GO:0043526///GO:0002076///GO:0048146///GO:0048386///GO:0045885///GO:0060527///GO:0060523///GO:0060687///GO:0006355///GO:0032355///GO:0007165///GO:0006351///GO:0060065///GO:0060068
GO.Component.1
205225_at GO:0030315///GO:0016585///GO:0005737///GO:0016020///GO:0043005///GO:0005634///GO:0043204///GO:0005886///GO:0043195
logFC AveExpr t P.Value adj.P.Val B
205225_at 3.762901 11.37774 22.68392 2.001001e-70 4.458832e-66 149.1987
genename <- rownames(topTable(fit, n=1))
typeMean <- tapply(exprs(vdx)[genename,], vdx$er, mean)
typeMean
0 1
9.091554 12.854455
typeMean["1"] - typeMean["0"]
1
3.762901
confirming the statement. It is sometimes useful to check things by hand to make sure you have the right interpretation. Finally, note that limma doesn’t do anything different from a difference of means when it computes logFC; all the statistical improvements centers on computing better t-statistics and p-values.
The reader who has some experience with statistics will note that all we are doing is comparing two groups; this is the same setup as the classic t-statistic. What we are computing here is indeed a t-statistic, but one where the variance estimation (the denominator of the t-statistics) is moderated by borrowing strength across genes (this is what eBayes() does); this is called a moderated t-statistic.
The output from topTable() includes
logFC: the log fold-change between cases and controls.t: the t-statistic used to assess differential expression.P.Value: the p-value for differential expression; this value is not adjusted for multiple testing.adj.P.Val: the p-value adjusted for multiple testing. Different adjustment methods are available, the default is Benjamini-Horchberg.How to setup and interpret a design matrix for more complicated designs is beyond the scope of this introduction. The limma User’s Guide is extremely helpful here. Also, note that setting up a design matrix for an experiment is a standard task in statistics (and requires very little knowledge about genomics), so other sources of help is a local, friendly statistician or text books on basic statistics.
rmabiocLite('CellMix', siteRepos = 'http://web.cbio.uct.ac.za/~renaud/CRAN', type='both')Inflation (produced by batch effects or uncontrolled variables) is normally tested by using QQ-plot
qqt(fit$t, df=fit$df.prior+fit$df.residual,
pch=16,cex=0.2)
abline(0,1, col="red", lwd=2)
Surrogate variable analysis (SVA) can be used to correct for uncontrolled variables. The sva package can be used to this end. This procedure is implemented in the MEAL package that requires less R code. Note that fNames argument allows to add annotation to the results by passing any of the existing variables in the feature data (fData) of the ExpressionSet object
library(MEAL)
# shows available annotated variabels
names(fData(vdx))
[1] "probe" "Gene.title" "Gene.symbol" "Gene.ID"
[5] "EntrezGene.ID" "UniGene.title" "UniGene.symbol" "UniGene.ID"
[9] "Nucleotide.Title" "GI" "GenBank.Accession" "Platform_CLONEID"
[13] "Platform_ORF" "Platform_SPOTID" "Chromosome.location" "Chromosome.annotation"
[17] "GO.Function" "GO.Process" "GO.Component" "GO.Function.1"
[21] "GO.Process.1" "GO.Component.1"
# DEA
ans <- runPipeline(vdx, variable = "er")
ans
Object of class 'ResultSet'
. created with: runPipeline
. sva: no
. #results: 1 ( error: 0 )
. featureData: 22283 probes x 22 variables
# DEA + SVA
ans.sva <- runPipeline(vdx, variable = "er", sva=TRUE)
Number of significant surrogate variables is: 25
Iteration (out of 5 ):1 2 3 4 5
ans.sva
Object of class 'ResultSet'
. created with: runPipeline
. sva: yes
. #results: 1 ( error: 0 )
. featureData: 22283 probes x 22 variables
fit.meal <- getProbeResults(ans, coef=2,
fNames=c("probe", "Gene.symbol", "Chromosome.location"))
head(fit.meal)
logFC CI.L CI.R AveExpr t P.Value adj.P.Val B
205225_at 3.762901 3.436632 4.089169 11.37774 22.68392 2.001001e-70 4.458832e-66 149.19866
209603_at 3.052348 2.736067 3.368629 9.94199 18.98154 1.486522e-55 1.656209e-51 115.46414
209604_s_at 2.431309 2.159599 2.703020 13.18533 17.59968 5.839050e-50 4.337052e-46 102.75707
212956_at 2.157435 1.914779 2.400091 11.70294 17.48711 1.665700e-49 9.279201e-46 101.72268
202088_at 1.719680 1.524180 1.915180 13.11950 17.30104 9.412084e-49 4.194589e-45 100.01376
212496_s_at 1.459843 1.289447 1.630239 10.70394 16.85070 6.188671e-47 2.298369e-43 95.88265
SE probe Gene.symbol Chromosome.location
205225_at 0.05434399 205225_at ESR1 6q25.1
209603_at 0.08505077 209603_at GATA3 10p15
209604_s_at 0.06235462 209604_s_at GATA3 10p15
212956_at 0.04006799 212956_at TBC1D9 4q31.21
202088_at 0.09917443 202088_at SLC39A6 18q12.2
212496_s_at 0.05482533 212496_s_at KDM4B 19p13.3
The QQ-plot is then produced by
plot(ans, type="QQ")
plot(ans.sva, type="QQ")
There is still inflation, that may be due to cell-types differences or any other technical variable, but using SVA you can see that inflation is dramatically reduced.
Volcano plot can be use to visualize the genes that are differentially expressed at a given fold-change and p-value. The function volcanoplot from limma package can be used to this end.
volcanoplot(fit, coef=2, names = fit$genes$Gene.symbol,
highlight = 5)
We can also use the EnhancedVolcano Bioconductor package to get a very nice visualization.
library(EnhancedVolcano)
toptable <- topTable(fit, n = Inf)
names(toptable)
[1] "probe" "Gene.title" "Gene.symbol" "Gene.ID"
[5] "EntrezGene.ID" "UniGene.title" "UniGene.symbol" "UniGene.ID"
[9] "Nucleotide.Title" "GI" "GenBank.Accession" "Platform_CLONEID"
[13] "Platform_ORF" "Platform_SPOTID" "Chromosome.location" "Chromosome.annotation"
[17] "GO.Function" "GO.Process" "GO.Component" "GO.Function.1"
[21] "GO.Process.1" "GO.Component.1" "logFC" "AveExpr"
[25] "t" "P.Value" "adj.P.Val" "B"
EnhancedVolcano(toptable,
lab = toptable$Gene.symbol,
x = 'logFC',
y = 'P.Value')
Another state-of-the-art plot is MA-plot. It is used to check whether data are comparable among groups (i.e. normalization worked properly). This can be created using a function available in limma.
limma::plotMA(fit, coef=2, main="ER+ vs ER- comparison")
abline(h=0, col="red", lwd=2)
limma + voom pipeline for RNAseq data analysisLet us illustrate how to analyze RNAseq cound data using limma + voom pipeline that was proposed here. To this end, we use count data from lymphoblastoid cell lines from 69 unrelated Nigerian individuals, described in Pickrell, et al. 2010. This data are available through the Bioconductor’s tweeDEseqCountData package.
library(tweeDEseqCountData)
data(pickrell)
pickrell.eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 52580 features, 69 samples
element names: exprs
protocolData: none
phenoData
sampleNames: NA18486 NA18498 ... NA19257 (69 total)
varLabels: num.tech.reps population study gender
varMetadata: labelDescription
featureData
featureNames: ENSG00000000003 ENSG00000000005 ... LRG_99 (52580 total)
fvarLabels: gene
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
We need to create a DGElist object and calculate the normalization factors (NOTE: it does not normalize the data just calculate some factors to be used downstream)
library(edgeR)
dge <- DGEList(pickrell.eset, group = pickrell.eset$gender)
dge <- calcNormFactors(dge)
Genes that do not have a worthwhile number of reads in any sample should be filtered out of the downstream analyses. There are several reasons for this. From a biological point of view, genes that not expressed at a biologically meaningful level in any condition are not of interest and are therefore best ignored. From a statistical point of view, removing low count genes allows the mean-variance relationship in the data to be estimated with greater reliability and also reduces the number of statistical tests that need to be carried out in downstream analyses looking at differential expression.
The filterByExpr function in the edgeR package provides an automatic way to filter genes, while keeping as many genes as possible with worthwhile counts.
keep.exprs <- filterByExpr(dge)
dge.filt <- dge[keep.exprs,]
dim(dge.filt)
[1] 6255 69
For differential expression , gene expression is rarely considered at the level of raw counts since libraries sequenced at a greater depth will result in higher counts. Rather, it is common practice to transform raw counts onto a scale that accounts for such library size differences. Popular transformations include counts per million (CPM), log2-counts per million (log-CPM), reads per kilobase of transcript per million (RPKM), and fragments per kilobase of transcript per million (FPKM). Here raw counts are converted log-CPM values using the voom function in limma package.
mm <- model.matrix( ~ group, data=dge.filt$samples)
v <- voom(dge.filt, design = mm, plot = TRUE)
What is voom doing?
limma along with the log2 CPMs.The above is a “good” voom plot. If your voom plot looks like the below (obtained from original data), you might want to filter more:
mp <- voom(dge, design = mm, plot = T)
Now we are ready to run DEA using the same functions that the ones for the microarray data
fit <- lmFit(v, mm)
fit <- eBayes(fit)
topTable(fit)
logFC AveExpr t P.Value adj.P.Val B
ENSG00000129824 9.2050125 1.95469533 46.488159 2.704379e-55 1.691589e-51 96.2169855
ENSG00000099749 6.1629712 0.40983758 38.613259 9.839513e-50 3.077308e-46 87.1119440
ENSG00000198692 5.2160492 -0.03456801 27.814122 3.722546e-40 7.761508e-37 71.7012366
ENSG00000154620 5.0785792 0.48128404 26.392727 1.151368e-38 1.800451e-35 69.6465663
ENSG00000006757 -0.9323601 5.30180979 -9.012108 2.048633e-13 2.562840e-10 20.2011440
ENSG00000130021 -0.8655133 2.13108815 -4.351579 4.401395e-05 4.588455e-02 1.8595831
ENSG00000185753 -0.5594470 3.68469795 -4.169233 8.434265e-05 6.781208e-02 1.0305360
ENSG00000086712 -0.4136276 5.08445215 -4.161316 8.673008e-05 6.781208e-02 0.8063345
ENSG00000177606 -0.5958202 8.24149085 -3.978701 1.638281e-04 1.138605e-01 -0.0609869
ENSG00000123689 -1.4795363 5.16064334 -3.868186 2.389355e-04 1.494542e-01 -0.1294323
EXERCISE: leukemiasEset provides transcriptomic data for different types of leukemias from a microarray experiment. It can be loaded into R by data(leukemiasEset) after loading the package. The variable LeukemiaType encodes several types of leukemia as well as control samples (NoL). Perform a complete differential expression analysis comparing CML versus NoL.
The description of the microarry experiment has been brougth from bitesizebio. Some parts of this material has been obtained from Kasper D. Hansen material of the course Bioconductor for Genomic Data Science
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.30.3 tweeDEseqCountData_1.26.0 EnhancedVolcano_1.6.0
[4] ggrepel_0.8.2 ggplot2_3.3.3 MEAL_1.22.0
[7] MultiDataSet_1.19.1 Biobase_2.48.0 BiocGenerics_0.34.0
[10] breastCancerVDX_1.26.0 limma_3.44.3 BiocStyle_2.16.0
loaded via a namespace (and not attached):
[1] qqman_0.1.4 BiocFileCache_1.12.1 plyr_1.8.6
[4] splines_4.0.2 BiocParallel_1.22.0 GenomeInfoDb_1.24.2
[7] sva_3.36.0 digest_0.6.27 foreach_1.5.0
[10] htmltools_0.5.1.1 magick_2.4.0 fansi_0.4.2
[13] magrittr_2.0.1 memoise_1.1.0 cluster_2.1.0
[16] Biostrings_2.56.0 readr_1.3.1 annotate_1.66.0
[19] matrixStats_0.57.0 askpass_1.1 siggenes_1.62.0
[22] prettyunits_1.1.1 colorspace_2.0-0 blob_1.2.1
[25] rappdirs_0.3.1 xfun_0.16 dplyr_1.0.5
[28] crayon_1.4.1 RCurl_1.98-1.2 genefilter_1.70.0
[31] GEOquery_2.56.0 survival_3.2-3 iterators_1.0.12
[34] glue_1.4.2 gtable_0.3.0 zlibbioc_1.34.0
[37] XVector_0.28.0 DelayedArray_0.14.1 Rhdf5lib_1.10.1
[40] HDF5Array_1.16.1 scales_1.1.1 DBI_1.1.0
[43] rngtools_1.5 Rcpp_1.0.6 xtable_1.8-4
[46] progress_1.2.2 bumphunter_1.30.0 bit_4.0.4
[49] mclust_5.4.6 preprocessCore_1.50.0 stats4_4.0.2
[52] httr_1.4.2 RColorBrewer_1.1-2 calibrate_1.7.7
[55] ellipsis_0.3.1 pkgconfig_2.0.3 reshape_0.8.8
[58] XML_3.99-0.5 farver_2.1.0 dbplyr_1.4.4
[61] locfit_1.5-9.4 utf8_1.2.1 tidyselect_1.1.0
[64] labeling_0.4.2 rlang_0.4.10 AnnotationDbi_1.50.3
[67] munsell_0.5.0 tools_4.0.2 generics_0.0.2
[70] RSQLite_2.2.0 evaluate_0.14 stringr_1.4.0
[73] yaml_2.2.1 knitr_1.29 bit64_4.0.4
[76] beanplot_1.2 scrime_1.3.5 purrr_0.3.4
[79] nlme_3.1-149 doRNG_1.8.2 nor1mix_1.3-0
[82] xml2_1.3.2 biomaRt_2.44.1 compiler_4.0.2
[85] curl_4.3 tibble_3.1.0 stringi_1.4.6
[88] highr_0.8 GenomicFeatures_1.40.1 minfi_1.34.0
[91] lattice_0.20-41 Matrix_1.2-18 vegan_2.5-6
[94] permute_0.9-5 multtest_2.44.0 vctrs_0.3.7
[97] pillar_1.6.0 lifecycle_1.0.0 BiocManager_1.30.10
[100] data.table_1.13.0 bitops_1.0-6 rtracklayer_1.48.0
[103] GenomicRanges_1.40.0 R6_2.4.1 bookdown_0.20
[106] IRanges_2.22.2 codetools_0.2-16 MASS_7.3-52
[109] assertthat_0.2.1 rhdf5_2.32.2 SummarizedExperiment_1.18.2
[112] openssl_1.4.2 withr_2.4.1 GenomicAlignments_1.24.0
[115] Rsamtools_2.4.0 S4Vectors_0.26.1 GenomeInfoDbData_1.2.3
[118] mgcv_1.8-33 hms_0.5.3 quadprog_1.5-8
[121] grid_4.0.2 tidyr_1.1.2 base64_2.0
[124] rmarkdown_2.7 DelayedMatrixStats_1.10.1 illuminaio_0.30.0