Contents

1 Gettig started

This document can be reproduced by using the R code and the data (for illustrating purposes and for the exercises) that are available here: https://github.com/isglobal-brge/short_course_omic_R/

2 Introduction

Transcriptomic data analysis can be performed using data obtained from two infrastructures: microarrays and RNA-seq. This lectures describe the main steps to perform differential expression analysis (DEA) using both technologies. This is a brief introduction that covers the main aspects of DEA and outlines the most common types of analyses in this type of studies. Of course, other packages can be used for that purposes (i.e DESeq2 for RNAseq) but our proposed approaches can be enough for an introductory course.

2.1 Microarray workflow

A basic protocol for a DNA microarray is as follows:

  • Isolate and purify mRNA from samples of interest. Since we are interested in comparing gene expression, one sample usually serves as control, and another sample would be the experiment (healthy vs. disease, etc)
  • Reverse transcribe and label the mRNA. In order to detect the transcripts by hybridization, they need to be labeled, and because starting material maybe limited, an amplification step is also used. Labeling usually involves performing a reverse transcription (RT) reaction to produce a complementary DNA strand (cDNA) and incorporating a florescent dye that has been linked to a DNA nucleotide, producing a fluorescent cDNA strand. Disease and healthy samples can be labeled with different dyes and cohybridized onto the same microarray in the following step. Some protocols do not label the cDNA but use a second step of amplification, where the cDNA from RT step serves as a template to produce a labeled cRNA strand.
  • Hybridize the labeled target to the microarray. This step involves placing labeled cDNAs onto a DNA microarray where it will hybridize to their synthetic complementary DNA probes attached on the microarray. A series of washes are used to remove non-bound sequences.
  • Scan the microarray and quantitate the signal. The fluorescent tags on bound cDNA are excited by a laser and the fluorescently labeled target sequences that bind to a probe generate a signal. The total strength of the signal depends upon the amount of target sample binding to the probes present on that spot. Thus, the amount of target sequence bound to each probe correlates to the expression level of various genes expressed in the sample. The signals are detected, quantified, and used to create a digital image of the array.

If we are trying to calculate relative expression between two samples, each labeled with a different dye (See next figure, red for experiment, green for control), the resulting image is analyzed by calculating the ratio of the two dyes. If a gene is over-expressed in the experimental sample, then more of that sample cDNA than control cDNA will hybridize to the spot representing that expressed gene. In turn, the spot will fluoresce red with greater intensity than it will fluoresce green. The red-to-green fluorescence ratio thus indicates which gene is up or downregulated in the appropriate sample.

Microarray Experiment

2.2 RNAseq workflow

RNAseq data is te counterpart of microarrays using next generation sequencing technology. Data analyst gets a read count table from raw fastq reads obtained from an Illumina sequencing run. Opposite to microarrays, where data are intensity values, the transcription profile from RNAseq experiment of a particular gene follows from counting the number of times the transcripts of the genes were mapped by sequenced reads. The summarized RNA-seq data is known as count data. Figure 1 describes the main steps that are taken to obtain the count table that measures the transcription of active genes in biological samples.

RNA-seq scheme to get count data.

Figure 1: RNA-seq scheme to get count data

Both microarray and RNAseq data can be encapsulated in a ExpressionSet or a SummarizedExperiment. Then different Bioconductor packages can be used to analyze transcription data using these objects as input.

NOTE: In this course we are not covering epigenetic data analyses, but the methodologies used to analyze microarray data also apply to methylation data. For those how are not familiar with Bioconductor, we strongly recommend to use MEAL package that will be later described given its ease of use.

3 limma Bioconductor package

limma is a widely used BioC package for analyzing microarray data. LIMMA stands for “linear models for microarray data”. Perhaps unsurprisingly, limma contains functionality for fitting a broad class of statistical models called “linear models”. Examples of such models include linear regression and analysis of variance. While most of the functionality of limma has been developed for microarray data, the model fitting routines of limma are useful for many types of data, and is not limited to microarrays. For example, as we have previously pointed out, DNA methylation data is also used to be analyzed using this package.

3.1 Model design

The objective of an DEA is frequently to discover which features (genes) are different between groups or stated differently: to discover which genes are differentially expressed between cases and controls. Of course, more complicated designs are also used; sometimes more than two groups are considered and sometimes there are additional important covariates such as age or sex that should be taken into account (adjusted models).

Broadly speaking, how samples are distributed between groups determines the design of the study. In addition to the design, there is one or more question(s) of interest(s) such as the difference between two groups. Such questions are usually formalized as contrasts; an example of a contrast is indeed the difference between two groups.

This can be formalized as

\[ Y = \beta_0 + \beta_1 X_1 + \epsilon \]

In this equation of a linear model, Y is the response variable. It must be a continuous variable. In the context of DEA, it is a relative measure of mRNA expression level for one gene. \(X_1\) is an explanatory variable, which can be continuous or discrete, for example, control group versus treatment, or mutant versus wild type. \(\beta_1\) quantifies the effect of the explanatory variable on the response variable. Furthermore, we can add additional explanatory variables to the equation for more complicated experimental designs. Lastly, models the random noise in the measurements.

In R, you specify a linear model with the function lm. This uses R’s formula syntax. On the left is the object that contains the response variable, and to the right of the tilde are the objects that contain the explanatory variables.

lm(y ~ x1)

A second explanatory variable can be added with a plus sign.

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \]

lm(y ~ x1 + x2)

The design matrix is then, how samples are distributed accross x1 and x2. Let us illustrate this in a simple simulated case when DEA aims to find genes that are differentially expressed between cases and control (variable x1)

set.seed(1234) # to guarantee reproducibility
x1 <- as.factor(sample(c("case", "control"), 10, replace=TRUE))
x1
 [1] control control control control case    control case    case    case    control
Levels: case control
model.matrix( ~ x1)
   (Intercept) x1control
1            1         1
2            1         1
3            1         1
4            1         1
5            1         0
6            1         1
7            1         0
8            1         0
9            1         0
10           1         1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$x1
[1] "contr.treatment"

The data structure represented by an ExpressionSet or a SummarizedExperiment contains features in the rows of the experimental data and variables in columns.

A number of different packages allows us to fit common types of models to this data structure

  • limma fits a so-called linear model; examples of linear models are (1) linear regression, (2) multiple linear regression and (3) analysis of variance.
  • edgeR, DESeq and DESeq2 fits generalized linear models, specifically models based on the negative binomial distribution that will be illustrated in a later session.

Simplifying, limma is useful for continuous data such as microarray data and edgeR / DESeq / DESeq2 are useful for count data such as high-throughput sequencing (RNA-seq). Though will see that count data can also be transformed into continuous data and, hence, limma can also be used in that case.

In addition to the distributional assumptions, all of these packages uses something called empirical Bayes techniques to borrow information across features. As stated above, usually the number of samples is small and the number of features is large. It has been shown time and time again that you can get better results by borrowing information across features, for example by modeling a mean-variance relationship. This can be done in many ways and often depends on the data characteristics of a specific type of data. For example both edgeR and DESeq(2) are very popular in the analysis of RNA-seq data and all three packages uses models based on the negative binomial distribution. For a statistical point of view, a main difference between these packages (models) is how they borrow information across genes.

3.2 Two group comparison

Let us illustrate how to perform DEA using vdx dataset which is available at Bioconductor through the package breastCancerVDX. This package can be installed along with limma using:

BiocManager::install(c("limma", "breastCancerVDX"))

After that, the packages can be loaded into R by:

library(limma)
library(breastCancerVDX)

The vdx dataset from the breastCancerVDX package is an ExpressionSet of women diagnosed with breast cancer. It contains information about 22,000 genes and several variables including estrogen receptor (ER) status.

data(vdx)
vdx
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22283 features, 344 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
  varLabels: samplename dataset ... e.os (21 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283 total)
  fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 17420468 
Annotation: hgu133a 
table(vdx$er)

  0   1 
135 209 

The code 0 means negative and 1 positive estrogen receptor. Now we do a standard limma model fit

design <- model.matrix(~ vdx$er)
fit <- lmFit(vdx, design)
fit <- eBayes(fit)
topTable(fit)
                  probe                                                     Gene.title Gene.symbol
205225_at     205225_at                                            estrogen receptor 1        ESR1
209603_at     209603_at                                         GATA binding protein 3       GATA3
209604_s_at 209604_s_at                                         GATA binding protein 3       GATA3
212956_at     212956_at                TBC1 domain family, member 9 (with GRAM domain)      TBC1D9
202088_at     202088_at          solute carrier family 39 (zinc transporter), member 6     SLC39A6
212496_s_at 212496_s_at                             lysine (K)-specific demethylase 4B       KDM4B
215867_x_at 215867_x_at                                         carbonic anhydrase XII        CA12
209602_s_at 209602_s_at                                         GATA binding protein 3       GATA3
212195_at     212195_at interleukin 6 signal transducer (gp130, oncostatin M receptor)       IL6ST
218195_at     218195_at                            chromosome 6 open reading frame 211    C6orf211
            Gene.ID EntrezGene.ID UniGene.title UniGene.symbol UniGene.ID
205225_at      2099          2099          <NA>           <NA>       <NA>
209603_at      2625          2625          <NA>           <NA>       <NA>
209604_s_at    2625          2625          <NA>           <NA>       <NA>
212956_at     23158         23158          <NA>           <NA>       <NA>
202088_at     25800         25800          <NA>           <NA>       <NA>
212496_s_at   23030         23030          <NA>           <NA>       <NA>
215867_x_at     771           771          <NA>           <NA>       <NA>
209602_s_at    2625          2625          <NA>           <NA>       <NA>
212195_at      3572          3572          <NA>           <NA>       <NA>
218195_at     79624         79624          <NA>           <NA>       <NA>
                                                                                                                                                                  Nucleotide.Title
205225_at                                                                                                      Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
209603_at   wh43d10.x1 NCI_CGAP_Kid11 Homo sapiens cDNA clone IMAGE:2383507 3- similar to gb:X58072_rna1 TRANS-ACTING T-CELL SPECIFIC TRANSCRIPTION FACTOR (HUMAN);, mRNA sequence
209604_s_at                                                                            Homo sapiens GATA binding protein 3, mRNA (cDNA clone MGC:2346 IMAGE:3504200), complete cds
212956_at                                                                                          qp61g12.x1 NCI_CGAP_Co8 Homo sapiens cDNA clone IMAGE:1927558 3-, mRNA sequence
202088_at                                                                                         ts65a01.x1 NCI_CGAP_Kid8 Homo sapiens cDNA clone IMAGE:2233416 3-, mRNA sequence
212496_s_at                                                                                         601112063F1 NIH_MGC_16 Homo sapiens cDNA clone IMAGE:3352887 5-, mRNA sequence
215867_x_at                                                                                                         Homo sapiens mRNA; cDNA DKFZp564D066 (from clone DKFZp564D066)
209602_s_at wh43d10.x1 NCI_CGAP_Kid11 Homo sapiens cDNA clone IMAGE:2383507 3- similar to gb:X58072_rna1 TRANS-ACTING T-CELL SPECIFIC TRANSCRIPTION FACTOR (HUMAN);, mRNA sequence
212195_at                                                                                                           Homo sapiens mRNA; cDNA DKFZp564F053 (from clone DKFZp564F053)
218195_at                                                                                                        Homo sapiens chromosome 6 open reading frame 211 (C6orf211), mRNA
                   GI GenBank.Accession Platform_CLONEID Platform_ORF Platform_SPOTID
205225_at   170295798         NM_000125               NA           NA            <NA>
209603_at     5361632          AI796169               NA           NA            <NA>
209604_s_at  33871929          BC003070               NA           NA            <NA>
212956_at     4085300          AI348094               NA           NA            <NA>
202088_at     4686779          AI635449               NA           NA            <NA>
212496_s_at   9127367          BE256900               NA           NA            <NA>
215867_x_at   4884095          AL050025               NA           NA            <NA>
209602_s_at   5361632          AI796169               NA           NA            <NA>
212195_at     4500013          AL049265               NA           NA            <NA>
218195_at    13375745         NM_024573               NA           NA            <NA>
            Chromosome.location                                         Chromosome.annotation
205225_at                6q25.1             Chromosome 6, NC_000006.11 (152011631..152424409)
209603_at                 10p15                Chromosome 10, NC_000010.10 (8096667..8117164)
209604_s_at               10p15                Chromosome 10, NC_000010.10 (8096667..8117164)
212956_at               4q31.21 Chromosome 4, NC_000004.11 (141541936..141677471, complement)
202088_at               18q12.2   Chromosome 18, NC_000018.9 (33688494..33709357, complement)
212496_s_at             19p13.3                 Chromosome 19, NC_000019.9 (4969124..5153609)
215867_x_at               15q22   Chromosome 15, NC_000015.9 (63615730..63674075, complement)
209602_s_at               10p15                Chromosome 10, NC_000010.10 (8096667..8117164)
212195_at                  5q11    Chromosome 5, NC_000005.9 (55236694..55290763, complement)
218195_at                6q25.1             Chromosome 6, NC_000006.11 (151773422..151791232)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   GO.Function
205225_at                                                                                                                                                                                                                                                  estrogen receptor activity///estrogen response element binding///hormone binding///metal ion binding///nitric-oxide synthase regulator activity///protein binding///protein complex binding///sequence-specific DNA binding///steroid binding///steroid hormone receptor activity///transcription factor activity///transcription factor binding///zinc ion binding
209603_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DNA binding///metal ion binding///sequence-specific DNA binding///transcription factor activity///transcription factor binding///zinc ion binding
209604_s_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DNA binding///metal ion binding///sequence-specific DNA binding///transcription factor activity///transcription factor binding///zinc ion binding
212956_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      GTPase activator activity///Rab GTPase activator activity///calcium ion binding
202088_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         metal ion transmembrane transporter activity
212496_s_at                                                                                                                                                                                                                                                                                                                                                                            metal ion binding///nucleic acid binding///oxidoreductase activity///oxidoreductase activity, acting on single donors with incorporation of molecular oxygen, incorporation of two atoms of oxygen///protein binding///zinc ion binding
215867_x_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             carbonate dehydratase activity///lyase activity///metal ion binding///zinc ion binding
209602_s_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DNA binding///metal ion binding///sequence-specific DNA binding///transcription factor activity///transcription factor binding///zinc ion binding
212195_at   ciliary neurotrophic factor receptor activity///ciliary neurotrophic factor receptor binding///growth factor binding///contributes_to growth factor binding///interleukin-11 receptor activity///interleukin-27 receptor activity///contributes_to interleukin-6 binding///contributes_to interleukin-6 receptor activity///contributes_to interleukin-6 receptor activity///contributes_to interleukin-6 receptor binding///contributes_to leukemia inhibitory factor receptor activity///contributes_to oncostatin-M receptor activity///protein binding///protein homodimerization activity///receptor activity
218195_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      protein binding
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            GO.Process
205225_at                                                                                                                                                                                                                                           androgen metabolic process///antral ovarian follicle growth///epithelial cell development///epithelial cell proliferation involved in mammary gland duct elongation///estrogen receptor signaling pathway///male gonad development///mammary gland alveolus development///mammary gland branching involved in pregnancy///neuroprotection///osteoblast development///positive regulation of fibroblast proliferation///positive regulation of retinoic acid receptor signaling pathway///positive regulation of survival gene product expression///prostate epithelial cord arborization involved in prostate glandular acinus morphogenesis///prostate epithelial cord elongation///regulation of branching involved in prostate gland morphogenesis///regulation of transcription, DNA-dependent///response to estradiol stimulus///signal transduction///transcription, DNA-dependent///uterus development///vagina development
209603_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           anatomical structure morphogenesis///defense response///regulation of transcription, DNA-dependent///response to estrogen stimulus///transcription from RNA polymerase II promoter
209604_s_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         anatomical structure morphogenesis///defense response///regulation of transcription, DNA-dependent///response to estrogen stimulus///transcription from RNA polymerase II promoter
212956_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            regulation of Rab GTPase activity
202088_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ion transport///transmembrane transport///zinc ion transport
212496_s_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 chromatin modification///oxidation reduction///regulation of transcription
215867_x_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               one-carbon metabolic process
209602_s_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         anatomical structure morphogenesis///defense response///regulation of transcription, DNA-dependent///response to estrogen stimulus///transcription from RNA polymerase II promoter
212195_at   JAK-STAT cascade///ciliary neurotrophic factor-mediated signaling pathway///cytokine-mediated signaling pathway///glycogen metabolic process///interleukin-27-mediated signaling pathway///leukemia inhibitory factor signaling pathway///negative regulation of interleukin-6-mediated signaling pathway///positive regulation of T cell proliferation///positive regulation of acute inflammatory response///positive regulation of adaptive immune response///positive regulation of anti-apoptosis///positive regulation of cardiac muscle hypertrophy///positive regulation of cell proliferation///positive regulation of cell proliferation///positive regulation of osteoblast differentiation///positive regulation of smooth muscle cell migration///positive regulation of tyrosine phosphorylation of Stat1 protein///positive regulation of tyrosine phosphorylation of Stat3 protein///positive regulation vascular endothelial growth factor production///reduction of cytosolic calcium ion concentration///regulation of Notch signaling pathway///response to cytokine stimulus///response to nutrient///signal transduction///triglyceride mobilization
218195_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         <NA>
                                                                                                                                                                                                                                      GO.Component
205225_at                                                                                            T-tubule///chromatin remodeling complex///cytoplasm///membrane///neuron projection///nucleus///perikaryon///plasma membrane///terminal button
209603_at                                                                                                                                                                                                                                  nucleus
209604_s_at                                                                                                                                                                                                                                nucleus
212956_at                                                                                                                                                                                                                            intracellular
202088_at                                                                                                                                                                           endoplasmic reticulum///integral to membrane///plasma membrane
212496_s_at                                                                                                                                                                                                                                nucleus
215867_x_at                                                                                                                                                                                                        integral to membrane///membrane
209602_s_at                                                                                                                                                                                                                                nucleus
212195_at   ciliary neurotrophic factor receptor complex///external side of plasma membrane///extracellular region///extracellular space///integral to membrane///interleukin-6 receptor complex///oncostatin-M receptor complex///plasma membrane
218195_at                                                                                                                                                                                                                                     <NA>
                                                                                                                                                                                                                                                                                                        GO.Function.1
205225_at                                                                                                                                      GO:0030284///GO:0034056///GO:0042562///GO:0046872///GO:0030235///GO:0005515///GO:0032403///GO:0043565///GO:0005496///GO:0003707///GO:0003700///GO:0008134///GO:0008270
209603_at                                                                                                                                                                                                                                 GO:0003677///GO:0046872///GO:0043565///GO:0003700///GO:0008134///GO:0008270
209604_s_at                                                                                                                                                                                                                               GO:0003677///GO:0046872///GO:0043565///GO:0003700///GO:0008134///GO:0008270
212956_at                                                                                                                                                                                                                                                                        GO:0005096///GO:0005097///GO:0005509
202088_at                                                                                                                                                                                                                                                                                                  GO:0046873
212496_s_at                                                                                                                                                                                                                               GO:0046872///GO:0003676///GO:0016491///GO:0016702///GO:0005515///GO:0008270
215867_x_at                                                                                                                                                                                                                                                         GO:0004089///GO:0016829///GO:0046872///GO:0008270
209602_s_at                                                                                                                                                                                                                               GO:0003677///GO:0046872///GO:0043565///GO:0003700///GO:0008134///GO:0008270
212195_at   GO:0004897///GO:0005127///GO:0019838///contributes_to GO:0019838///GO:0004921///GO:0045509///contributes_to GO:0019981///contributes_to GO:0004915///contributes_to GO:0004915///contributes_to GO:0005138///contributes_to GO:0004923///contributes_to GO:0004924///GO:0005515///GO:0042803///GO:0004872
218195_at                                                                                                                                                                                                                                                                                                  GO:0005515
                                                                                                                                                                                                                                                                                                                                  GO.Process.1
205225_at                                          GO:0008209///GO:0001547///GO:0002064///GO:0060750///GO:0030520///GO:0008584///GO:0060749///GO:0060745///GO:0043526///GO:0002076///GO:0048146///GO:0048386///GO:0045885///GO:0060527///GO:0060523///GO:0060687///GO:0006355///GO:0032355///GO:0007165///GO:0006351///GO:0060065///GO:0060068
209603_at                                                                                                                                                                                                                                                                       GO:0009653///GO:0006952///GO:0006355///GO:0043627///GO:0006366
209604_s_at                                                                                                                                                                                                                                                                     GO:0009653///GO:0006952///GO:0006355///GO:0043627///GO:0006366
212956_at                                                                                                                                                                                                                                                                                                                           GO:0032313
202088_at                                                                                                                                                                                                                                                                                                 GO:0006811///GO:0055085///GO:0006829
212496_s_at                                                                                                                                                                                                                                                                                               GO:0016568///GO:0055114///GO:0045449
215867_x_at                                                                                                                                                                                                                                                                                                                         GO:0006730
209602_s_at                                                                                                                                                                                                                                                                     GO:0009653///GO:0006952///GO:0006355///GO:0043627///GO:0006366
212195_at   GO:0007259///GO:0070120///GO:0019221///GO:0005977///GO:0070106///GO:0048861///GO:0070104///GO:0042102///GO:0002675///GO:0002821///GO:0045768///GO:0010613///GO:0008284///GO:0008284///GO:0045669///GO:0014911///GO:0042511///GO:0042517///GO:0010575///GO:0051481///GO:0008593///GO:0034097///GO:0007584///GO:0007165///GO:0006642
218195_at                                                                                                                                                                                                                                                                                                                                 <NA>
                                                                                                                GO.Component.1
205225_at   GO:0030315///GO:0016585///GO:0005737///GO:0016020///GO:0043005///GO:0005634///GO:0043204///GO:0005886///GO:0043195
209603_at                                                                                                           GO:0005634
209604_s_at                                                                                                         GO:0005634
212956_at                                                                                                           GO:0005622
202088_at                                                                                 GO:0005783///GO:0016021///GO:0005886
212496_s_at                                                                                                         GO:0005634
215867_x_at                                                                                            GO:0016021///GO:0016020
209602_s_at                                                                                                         GO:0005634
212195_at                GO:0070110///GO:0009897///GO:0005576///GO:0005615///GO:0016021///GO:0005896///GO:0005900///GO:0005886
218195_at                                                                                                                 <NA>
               logFC   AveExpr        t      P.Value    adj.P.Val         B
205225_at   3.762901 11.377735 22.68392 2.001001e-70 4.458832e-66 149.19866
209603_at   3.052348  9.941990 18.98154 1.486522e-55 1.656209e-51 115.46414
209604_s_at 2.431309 13.185334 17.59968 5.839050e-50 4.337052e-46 102.75707
212956_at   2.157435 11.702942 17.48711 1.665700e-49 9.279201e-46 101.72268
202088_at   1.719680 13.119496 17.30104 9.412084e-49 4.194589e-45 100.01376
212496_s_at 1.459843 10.703942 16.85070 6.188671e-47 2.298369e-43  95.88265
215867_x_at 2.246120 11.450485 16.79123 1.074845e-46 3.421537e-43  95.33780
209602_s_at 2.921505  9.547850 16.43202 3.004184e-45 8.367780e-42  92.05058
212195_at   1.381778 11.737839 16.31864 8.581176e-45 2.124604e-41  91.01458
218195_at   1.738740  9.479901 16.27378 1.299472e-44 2.895613e-41  90.60496

This is the common limma workflow. First, the comparison of interest (and the design of the experiment) is defined through a so-called “design matrix”. This matrix basically encompasses everything we know about the design; in this case there are two groups (we have more to say on the design below). Next, the model is fitted. This is followed by borrowing strength across genes using a so-called empirical Bayes procedure. Because this design only has two groups there is only one possible comparison to make: which genes differs between the two groups. This question is examined by the topTable() function which lists the top differentially expressed genes. In a more complicated design, the topTable() function would need to be told which comparison of interest to summarize.

An important part of the output is logFC which is the log fold-change. To interpret the sign of this quantity you need to know if this is 1-0 (e.g. ER+ vs ER-) (in which case positive values are up-regulated in ER+) or the reverse. In this case this is determined by the reference level which is the first level of the factor.

head(as.factor(vdx$er))
[1] 0 1 0 0 0 1
Levels: 0 1

we see the reference level is 0 so positive values means it is down-regulated in ER-. You can change the reference level of a factor using the relevel() function. You can also confirm this by computing the logFC by hand, which is useful to know. Let’s compute the fold-change of the top differentially expressed gene:

topTable(fit, n = 1)
              probe          Gene.title Gene.symbol Gene.ID EntrezGene.ID UniGene.title
205225_at 205225_at estrogen receptor 1        ESR1    2099          2099          <NA>
          UniGene.symbol UniGene.ID
205225_at           <NA>       <NA>
                                                             Nucleotide.Title        GI
205225_at Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA 170295798
          GenBank.Accession Platform_CLONEID Platform_ORF Platform_SPOTID Chromosome.location
205225_at         NM_000125               NA           NA            <NA>              6q25.1
                                      Chromosome.annotation
205225_at Chromosome 6, NC_000006.11 (152011631..152424409)
                                                                                                                                                                                                                                                                                                                                                                  GO.Function
205225_at estrogen receptor activity///estrogen response element binding///hormone binding///metal ion binding///nitric-oxide synthase regulator activity///protein binding///protein complex binding///sequence-specific DNA binding///steroid binding///steroid hormone receptor activity///transcription factor activity///transcription factor binding///zinc ion binding
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  GO.Process
205225_at androgen metabolic process///antral ovarian follicle growth///epithelial cell development///epithelial cell proliferation involved in mammary gland duct elongation///estrogen receptor signaling pathway///male gonad development///mammary gland alveolus development///mammary gland branching involved in pregnancy///neuroprotection///osteoblast development///positive regulation of fibroblast proliferation///positive regulation of retinoic acid receptor signaling pathway///positive regulation of survival gene product expression///prostate epithelial cord arborization involved in prostate glandular acinus morphogenesis///prostate epithelial cord elongation///regulation of branching involved in prostate gland morphogenesis///regulation of transcription, DNA-dependent///response to estradiol stimulus///signal transduction///transcription, DNA-dependent///uterus development///vagina development
                                                                                                                                           GO.Component
205225_at T-tubule///chromatin remodeling complex///cytoplasm///membrane///neuron projection///nucleus///perikaryon///plasma membrane///terminal button
                                                                                                                                                                   GO.Function.1
205225_at GO:0030284///GO:0034056///GO:0042562///GO:0046872///GO:0030235///GO:0005515///GO:0032403///GO:0043565///GO:0005496///GO:0003707///GO:0003700///GO:0008134///GO:0008270
                                                                                                                                                                                                                                                                                         GO.Process.1
205225_at GO:0008209///GO:0001547///GO:0002064///GO:0060750///GO:0030520///GO:0008584///GO:0060749///GO:0060745///GO:0043526///GO:0002076///GO:0048146///GO:0048386///GO:0045885///GO:0060527///GO:0060523///GO:0060687///GO:0006355///GO:0032355///GO:0007165///GO:0006351///GO:0060065///GO:0060068
                                                                                                              GO.Component.1
205225_at GO:0030315///GO:0016585///GO:0005737///GO:0016020///GO:0043005///GO:0005634///GO:0043204///GO:0005886///GO:0043195
             logFC  AveExpr        t      P.Value    adj.P.Val        B
205225_at 3.762901 11.37774 22.68392 2.001001e-70 4.458832e-66 149.1987
genename <- rownames(topTable(fit, n=1))
typeMean <- tapply(exprs(vdx)[genename,], vdx$er, mean)
typeMean
        0         1 
 9.091554 12.854455 
typeMean["1"] - typeMean["0"]
       1 
3.762901 

confirming the statement. It is sometimes useful to check things by hand to make sure you have the right interpretation. Finally, note that limma doesn’t do anything different from a difference of means when it computes logFC; all the statistical improvements centers on computing better t-statistics and p-values.

The reader who has some experience with statistics will note that all we are doing is comparing two groups; this is the same setup as the classic t-statistic. What we are computing here is indeed a t-statistic, but one where the variance estimation (the denominator of the t-statistics) is moderated by borrowing strength across genes (this is what eBayes() does); this is called a moderated t-statistic.

The output from topTable() includes

  • logFC: the log fold-change between cases and controls.
  • t: the t-statistic used to assess differential expression.
  • P.Value: the p-value for differential expression; this value is not adjusted for multiple testing.
  • adj.P.Val: the p-value adjusted for multiple testing. Different adjustment methods are available, the default is Benjamini-Horchberg.

How to setup and interpret a design matrix for more complicated designs is beyond the scope of this introduction. The limma User’s Guide is extremely helpful here. Also, note that setting up a design matrix for an experiment is a standard task in statistics (and requires very little knowledge about genomics), so other sources of help is a local, friendly statistician or text books on basic statistics.

3.3 Further issues

  • Data normalization oligo function rma
  • Surrogate variable analysis sva
  • Cell-type estimation biocLite('CellMix', siteRepos = 'http://web.cbio.uct.ac.za/~renaud/CRAN', type='both')

Inflation (produced by batch effects or uncontrolled variables) is normally tested by using QQ-plot

qqt(fit$t, df=fit$df.prior+fit$df.residual,
    pch=16,cex=0.2)
abline(0,1, col="red", lwd=2)

Surrogate variable analysis (SVA) can be used to correct for uncontrolled variables. The sva package can be used to this end. This procedure is implemented in the MEAL package that requires less R code. Note that fNames argument allows to add annotation to the results by passing any of the existing variables in the feature data (fData) of the ExpressionSet object

library(MEAL)

# shows available annotated variabels
names(fData(vdx))
 [1] "probe"                 "Gene.title"            "Gene.symbol"           "Gene.ID"              
 [5] "EntrezGene.ID"         "UniGene.title"         "UniGene.symbol"        "UniGene.ID"           
 [9] "Nucleotide.Title"      "GI"                    "GenBank.Accession"     "Platform_CLONEID"     
[13] "Platform_ORF"          "Platform_SPOTID"       "Chromosome.location"   "Chromosome.annotation"
[17] "GO.Function"           "GO.Process"            "GO.Component"          "GO.Function.1"        
[21] "GO.Process.1"          "GO.Component.1"       
# DEA
ans <- runPipeline(vdx, variable = "er")
ans
Object of class 'ResultSet'
 . created with: runPipeline 
 . sva:  no 
 . #results: 1 ( error: 0 )
 . featureData: 22283 probes x 22 variables
# DEA + SVA
ans.sva <- runPipeline(vdx, variable = "er", sva=TRUE)
Number of significant surrogate variables is:  25 
Iteration (out of 5 ):1  2  3  4  5  
ans.sva
Object of class 'ResultSet'
 . created with: runPipeline 
 . sva:  yes 
 . #results: 1 ( error: 0 )
 . featureData: 22283 probes x 22 variables
fit.meal <- getProbeResults(ans, coef=2, 
                       fNames=c("probe", "Gene.symbol", "Chromosome.location"))
head(fit.meal)
               logFC     CI.L     CI.R  AveExpr        t      P.Value    adj.P.Val         B
205225_at   3.762901 3.436632 4.089169 11.37774 22.68392 2.001001e-70 4.458832e-66 149.19866
209603_at   3.052348 2.736067 3.368629  9.94199 18.98154 1.486522e-55 1.656209e-51 115.46414
209604_s_at 2.431309 2.159599 2.703020 13.18533 17.59968 5.839050e-50 4.337052e-46 102.75707
212956_at   2.157435 1.914779 2.400091 11.70294 17.48711 1.665700e-49 9.279201e-46 101.72268
202088_at   1.719680 1.524180 1.915180 13.11950 17.30104 9.412084e-49 4.194589e-45 100.01376
212496_s_at 1.459843 1.289447 1.630239 10.70394 16.85070 6.188671e-47 2.298369e-43  95.88265
                    SE       probe Gene.symbol Chromosome.location
205225_at   0.05434399   205225_at        ESR1              6q25.1
209603_at   0.08505077   209603_at       GATA3               10p15
209604_s_at 0.06235462 209604_s_at       GATA3               10p15
212956_at   0.04006799   212956_at      TBC1D9             4q31.21
202088_at   0.09917443   202088_at     SLC39A6             18q12.2
212496_s_at 0.05482533 212496_s_at       KDM4B             19p13.3

The QQ-plot is then produced by

plot(ans, type="QQ")

plot(ans.sva, type="QQ")

There is still inflation, that may be due to cell-types differences or any other technical variable, but using SVA you can see that inflation is dramatically reduced.

4 Data visualization

Volcano plot can be use to visualize the genes that are differentially expressed at a given fold-change and p-value. The function volcanoplot from limma package can be used to this end.

volcanoplot(fit, coef=2, names = fit$genes$Gene.symbol, 
            highlight = 5)

We can also use the EnhancedVolcano Bioconductor package to get a very nice visualization.

library(EnhancedVolcano)
toptable <- topTable(fit, n = Inf)
names(toptable)
 [1] "probe"                 "Gene.title"            "Gene.symbol"           "Gene.ID"              
 [5] "EntrezGene.ID"         "UniGene.title"         "UniGene.symbol"        "UniGene.ID"           
 [9] "Nucleotide.Title"      "GI"                    "GenBank.Accession"     "Platform_CLONEID"     
[13] "Platform_ORF"          "Platform_SPOTID"       "Chromosome.location"   "Chromosome.annotation"
[17] "GO.Function"           "GO.Process"            "GO.Component"          "GO.Function.1"        
[21] "GO.Process.1"          "GO.Component.1"        "logFC"                 "AveExpr"              
[25] "t"                     "P.Value"               "adj.P.Val"             "B"                    
EnhancedVolcano(toptable,
    lab = toptable$Gene.symbol,
    x = 'logFC',
    y = 'P.Value')

Another state-of-the-art plot is MA-plot. It is used to check whether data are comparable among groups (i.e. normalization worked properly). This can be created using a function available in limma.

limma::plotMA(fit, coef=2, main="ER+ vs ER- comparison")
abline(h=0, col="red", lwd=2)

5 limma + voom pipeline for RNAseq data analysis

Let us illustrate how to analyze RNAseq cound data using limma + voom pipeline that was proposed here. To this end, we use count data from lymphoblastoid cell lines from 69 unrelated Nigerian individuals, described in Pickrell, et al. 2010. This data are available through the Bioconductor’s tweeDEseqCountData package.

library(tweeDEseqCountData)
data(pickrell)
pickrell.eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 52580 features, 69 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: NA18486 NA18498 ... NA19257 (69 total)
  varLabels: num.tech.reps population study gender
  varMetadata: labelDescription
featureData
  featureNames: ENSG00000000003 ENSG00000000005 ... LRG_99 (52580 total)
  fvarLabels: gene
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:  

We need to create a DGElist object and calculate the normalization factors (NOTE: it does not normalize the data just calculate some factors to be used downstream)

library(edgeR)
dge <- DGEList(pickrell.eset, group = pickrell.eset$gender)
dge <- calcNormFactors(dge)

Genes that do not have a worthwhile number of reads in any sample should be filtered out of the downstream analyses. There are several reasons for this. From a biological point of view, genes that not expressed at a biologically meaningful level in any condition are not of interest and are therefore best ignored. From a statistical point of view, removing low count genes allows the mean-variance relationship in the data to be estimated with greater reliability and also reduces the number of statistical tests that need to be carried out in downstream analyses looking at differential expression.

The filterByExpr function in the edgeR package provides an automatic way to filter genes, while keeping as many genes as possible with worthwhile counts.

keep.exprs <- filterByExpr(dge)
dge.filt <- dge[keep.exprs,]
dim(dge.filt)
[1] 6255   69

For differential expression , gene expression is rarely considered at the level of raw counts since libraries sequenced at a greater depth will result in higher counts. Rather, it is common practice to transform raw counts onto a scale that accounts for such library size differences. Popular transformations include counts per million (CPM), log2-counts per million (log-CPM), reads per kilobase of transcript per million (RPKM), and fragments per kilobase of transcript per million (FPKM). Here raw counts are converted log-CPM values using the voom function in limma package.

mm <- model.matrix( ~ group, data=dge.filt$samples)
v <- voom(dge.filt, design = mm, plot = TRUE)

What is voom doing?

The above is a “good” voom plot. If your voom plot looks like the below (obtained from original data), you might want to filter more:

mp <- voom(dge, design = mm, plot = T)

Now we are ready to run DEA using the same functions that the ones for the microarray data

fit <- lmFit(v, mm)
fit <- eBayes(fit)
topTable(fit)
                     logFC     AveExpr         t      P.Value    adj.P.Val          B
ENSG00000129824  9.2050125  1.95469533 46.488159 2.704379e-55 1.691589e-51 96.2169855
ENSG00000099749  6.1629712  0.40983758 38.613259 9.839513e-50 3.077308e-46 87.1119440
ENSG00000198692  5.2160492 -0.03456801 27.814122 3.722546e-40 7.761508e-37 71.7012366
ENSG00000154620  5.0785792  0.48128404 26.392727 1.151368e-38 1.800451e-35 69.6465663
ENSG00000006757 -0.9323601  5.30180979 -9.012108 2.048633e-13 2.562840e-10 20.2011440
ENSG00000130021 -0.8655133  2.13108815 -4.351579 4.401395e-05 4.588455e-02  1.8595831
ENSG00000185753 -0.5594470  3.68469795 -4.169233 8.434265e-05 6.781208e-02  1.0305360
ENSG00000086712 -0.4136276  5.08445215 -4.161316 8.673008e-05 6.781208e-02  0.8063345
ENSG00000177606 -0.5958202  8.24149085 -3.978701 1.638281e-04 1.138605e-01 -0.0609869
ENSG00000123689 -1.4795363  5.16064334 -3.868186 2.389355e-04 1.494542e-01 -0.1294323

EXERCISE: leukemiasEset provides transcriptomic data for different types of leukemias from a microarray experiment. It can be loaded into R by data(leukemiasEset) after loading the package. The variable LeukemiaType encodes several types of leukemia as well as control samples (NoL). Perform a complete differential expression analysis comparing CML versus NoL.


6 Acknowledgments

The description of the microarry experiment has been brougth from bitesizebio. Some parts of this material has been obtained from Kasper D. Hansen material of the course Bioconductor for Genomic Data Science

7 Session info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] edgeR_3.30.3              tweeDEseqCountData_1.26.0 EnhancedVolcano_1.6.0    
 [4] ggrepel_0.8.2             ggplot2_3.3.3             MEAL_1.22.0              
 [7] MultiDataSet_1.19.1       Biobase_2.48.0            BiocGenerics_0.34.0      
[10] breastCancerVDX_1.26.0    limma_3.44.3              BiocStyle_2.16.0         

loaded via a namespace (and not attached):
  [1] qqman_0.1.4                 BiocFileCache_1.12.1        plyr_1.8.6                 
  [4] splines_4.0.2               BiocParallel_1.22.0         GenomeInfoDb_1.24.2        
  [7] sva_3.36.0                  digest_0.6.27               foreach_1.5.0              
 [10] htmltools_0.5.1.1           magick_2.4.0                fansi_0.4.2                
 [13] magrittr_2.0.1              memoise_1.1.0               cluster_2.1.0              
 [16] Biostrings_2.56.0           readr_1.3.1                 annotate_1.66.0            
 [19] matrixStats_0.57.0          askpass_1.1                 siggenes_1.62.0            
 [22] prettyunits_1.1.1           colorspace_2.0-0            blob_1.2.1                 
 [25] rappdirs_0.3.1              xfun_0.16                   dplyr_1.0.5                
 [28] crayon_1.4.1                RCurl_1.98-1.2              genefilter_1.70.0          
 [31] GEOquery_2.56.0             survival_3.2-3              iterators_1.0.12           
 [34] glue_1.4.2                  gtable_0.3.0                zlibbioc_1.34.0            
 [37] XVector_0.28.0              DelayedArray_0.14.1         Rhdf5lib_1.10.1            
 [40] HDF5Array_1.16.1            scales_1.1.1                DBI_1.1.0                  
 [43] rngtools_1.5                Rcpp_1.0.6                  xtable_1.8-4               
 [46] progress_1.2.2              bumphunter_1.30.0           bit_4.0.4                  
 [49] mclust_5.4.6                preprocessCore_1.50.0       stats4_4.0.2               
 [52] httr_1.4.2                  RColorBrewer_1.1-2          calibrate_1.7.7            
 [55] ellipsis_0.3.1              pkgconfig_2.0.3             reshape_0.8.8              
 [58] XML_3.99-0.5                farver_2.1.0                dbplyr_1.4.4               
 [61] locfit_1.5-9.4              utf8_1.2.1                  tidyselect_1.1.0           
 [64] labeling_0.4.2              rlang_0.4.10                AnnotationDbi_1.50.3       
 [67] munsell_0.5.0               tools_4.0.2                 generics_0.0.2             
 [70] RSQLite_2.2.0               evaluate_0.14               stringr_1.4.0              
 [73] yaml_2.2.1                  knitr_1.29                  bit64_4.0.4                
 [76] beanplot_1.2                scrime_1.3.5                purrr_0.3.4                
 [79] nlme_3.1-149                doRNG_1.8.2                 nor1mix_1.3-0              
 [82] xml2_1.3.2                  biomaRt_2.44.1              compiler_4.0.2             
 [85] curl_4.3                    tibble_3.1.0                stringi_1.4.6              
 [88] highr_0.8                   GenomicFeatures_1.40.1      minfi_1.34.0               
 [91] lattice_0.20-41             Matrix_1.2-18               vegan_2.5-6                
 [94] permute_0.9-5               multtest_2.44.0             vctrs_0.3.7                
 [97] pillar_1.6.0                lifecycle_1.0.0             BiocManager_1.30.10        
[100] data.table_1.13.0           bitops_1.0-6                rtracklayer_1.48.0         
[103] GenomicRanges_1.40.0        R6_2.4.1                    bookdown_0.20              
[106] IRanges_2.22.2              codetools_0.2-16            MASS_7.3-52                
[109] assertthat_0.2.1            rhdf5_2.32.2                SummarizedExperiment_1.18.2
[112] openssl_1.4.2               withr_2.4.1                 GenomicAlignments_1.24.0   
[115] Rsamtools_2.4.0             S4Vectors_0.26.1            GenomeInfoDbData_1.2.3     
[118] mgcv_1.8-33                 hms_0.5.3                   quadprog_1.5-8             
[121] grid_4.0.2                  tidyr_1.1.2                 base64_2.0                 
[124] rmarkdown_2.7               DelayedMatrixStats_1.10.1   illuminaio_0.30.0