Differential Expression in RNA-Seq Analysis

Getting edgeR

If you have not installed edgeR into R you will can do that with biocLite. I am also installing baySeq for a data set it provides and gplots for plotting..

source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")
biocLite("gplots")
biocLite("baySeq")

Load the libraries.

library( edgeR )

## Warning: package 'edgeR' was built under R version 3.1.2

## Loading required package: limma

## Warning: package 'limma' was built under R version 3.1.3

library( baySeq )

## Loading required package: GenomicRanges

## Warning: package 'GenomicRanges' was built under R version 3.1.2

## Loading required package: BiocGenerics

## Warning: package 'BiocGenerics' was built under R version 3.1.2

## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:limma':
## 
##     plotMA
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges

## Warning: package 'IRanges' was built under R version 3.1.2

## Loading required package: GenomeInfoDb

## Warning: package 'GenomeInfoDb' was built under R version 3.1.3

## Loading required package: abind

## Warning: package 'abind' was built under R version 3.1.3

library( gplots )

## Warning: package 'gplots' was built under R version 3.1.3

## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:IRanges':
## 
##     space
## 
## The following object is masked from 'package:stats':
## 
##     lowess

Load your data

Load your expression/counts matrix from a file using it’s absolute path. Loading a matrix into R is not a trivial task. Parameters are available for the read.table function to help load your matrix depending on it’s format (for example header = TRUE ). Type “?read.table” into your console for more information.

Here in this tutorial we are going to use data from the bayseq package.

# Load data from computer.
mtrx_counts <- as.matrix( read.table( "/path/to/file.txt" ) )

# Load bayseq data
data( mobData )
data( mobAnnotation )

Differential Expression (DE)

When comparing an experiment with two groups ( control and test ) we can contrast one against another using the method exactTest( DGEList ) to obtain differentially expressed genes. This is like using a t-test but with assumptions more appropriate for this kind of data.

When we have more than 2 groups we can use a linear model instead of a t-test. Specifically, edgeR uses a generalized linear model ( GLM ) which is similar to a linear model but does not assume the response (RNA-Seq counts) is normally distributed. The response can take on many shapes as long as we understand how to model the shape of the data. This includes count data for which we can use assumptions of the negative binomial distribution. We will focus on how to use edgeR in the setting of a GLM.

To use a GLM, we have to indicate how samples are grouped. This is done with the design matrix. To use a GLM with edgeR we will need to make both a DGEList AND a design matrix.

Design Matrices: Factors

First let’s make a design matrix.

Design matrices help GLMs understand how to compare groups of your data. It answers the question, “How did you design your study? What are the groups of your observations?”

Let us image we have a study which has 6 samples, 3 chorts of 2 individuals. If we gave everyone the label of thier group we could have group1, group1, group2, group2, group3, group3. Next, I am just going to use a factor to represent the groups with 1 or 2 or 3. Factors are how R represents categorical data.

group <- factor( c( 1, 1, 2, 2, 3, 3 ) )
group

## [1] 1 1 2 2 3 3
## Levels: 1 2 3

Notice the output. The top line are the specifically labeled individuals but the bottom are the groups.

Also note, make sure your factors are in the order of your samples in the matrix_counts (counts matrix). If your individuals were actually in the order group2, group1, group3, group1, group2, group3 then one would use this factor to represent them.

new_group <- factor( c( 2, 1, 3, 1, 2, 3 ))
new_group

## [1] 2 1 3 1 2 3
## Levels: 1 2 3

Here we see the order of the individuals are preserved and the groups are the same as before. If you have a specific control / normal group, often in more traditional settings the control / normal group is indicated by 1. The methods shown below do not require this.

Design Matrices: Model Matrix

The function model.matrix() is used to make a design matrix from a factor. This method does all the work for you, let’s see it in action!

design <- model.matrix( ~0+group )
design

##   group1 group2 group3
## 1      1      0      0
## 2      1      0      0
## 3      0      1      0
## 4      0      1      0
## 5      0      0      1
## 6      0      0      1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

The result of the function is a design matrix; 3 columns for your 3 different groups in your factor (also known as factor levels) and 6 rows, one for each individual in your study (in order of the original factor from top to bottom). It is always a good idea to take a peek at the deisgn matrix. If yours is too big to display on the console you can view the top of the matrix with

head( design )

##   group1 group2 group3
## 1      1      0      0
## 2      1      0      0
## 3      0      1      0
## 4      0      1      0
## 5      0      0      1
## 6      0      0      1

or check the dimensions using dim(). Note the output is (# rows, # columns).

dim( design )

## [1] 6 3

Now back to the design matrix.

design

##   group1 group2 group3
## 1      1      0      0
## 2      1      0      0
## 3      0      1      0
## 4      0      1      0
## 5      0      0      1
## 6      0      0      1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Let’s focus on the columns of the design matrix. Each column is a labelled group matching your factor groups (alphanumeric order from the left). The individuals with a 1 in the first column and no other 1 in any other column are the individuals in the group 1 (1). Here, they are the first two rows which match the first two individuals in our original factor. The individuals with the 1 in the second column (group2) are the individuals in the group 2 (2). This would be individuals (rows) 2 and 3. The individuals with a 1 in the column group3 are the group 3 (3) individuals. In the original call we used ~0+groups, this tells the model matrix function how to format your matrix. There are other ways of making model matrices, this is the most intuitive for using edgeR.

Making Groups for the MobData

I am now making the design matrix for our test data.

group <- factor( c("MM","MM", "WM", "WM", "WW", "WW") )
# MM="triple mutant shoot grafted onto triple mutant root"
# WM="wild-type shoot grafted onto triple mutant root"
# WW="wild-type shoot grafted onto wild-type root"
design <- model.matrix( ~0+group )
design

##   groupMM groupWM groupWW
## 1       1       0       0
## 2       1       0       0
## 3       0       1       0
## 4       0       1       0
## 5       0       0       1
## 6       0       0       1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Quick Filtering, Normalization, and Making a DGEList object

There are different ways of filtering data. Genes that have very few counts throughout samples are often removed so that more active genes are the focus of the study. Additionally, these genes with minimal counts would be difficult to distinguish from sequencing noise.

Here I will quickly remove genes that do not have less than 100 counts per million in atleast 2 samples. This works for this tutorial but is not a hard number to use in all data sets.

As well, trimmed mean of M-values (TMM) normalization is used so comparisons between samples can be made. This is a very popular normalization in RNA-Seq analysis.

While we perform the filtering and normalization we are going to create a DGEList object, edgeR uses this object when working with the counts. The DGEList needs the counts as a matrix (already read in), your samples labelled by group in the order found in the matrix (already made), and the library size (we will calculate).

# Starting count dimensions
dim( mobData )

## [1] 3000    6

# Filter. Atleast 100 cpm in atleast 2 samples or discard
mobData <- mobData[ rowSums(cpm(mobData)>100) >= 2, ]
# Reduces the data set to 786 genes by 6 samples
dim( mobData )

## [1] 786   6

# Make the DGEList
cur_DGELIST <- DGEList( counts=mobData, group=group, lib.size=colSums( mobData ) )
# Use TMM normalization
cur_DGELIST <- calcNormFactors( cur_DGELIST )

Visualizing the data

After filtering and normalization, one often wants to see how samples relate to each other in a general/global way. For RNA-Seq data, there are several popular methods including PCA and MDS. Here we perform MDS, coloring the samples according to their groups.

plotMDS( cur_DGELIST, method="bcv", col=as.numeric( cur_DGELIST$samples$group ) )
legend( "topleft", as.character( unique( cur_DGELIST$samples$group )), col=1:3, pch=20)

Estimating Your Dispersions for GLM

With the DGEList object, edgeR can estimate the dispersion of the data to best model it. EdgeR’s underlying negative binomial distribution requires this estimate of dispersion as part of the model. For the GLM, this estimate is done in three steps.

cur_DGELIST <- estimateGLMCommonDisp( cur_DGELIST, design )
cur_DGELIST <- estimateGLMTrendedDisp( cur_DGELIST, design )

## Loading required package: splines

cur_DGELIST <- estimateGLMTagwiseDisp( cur_DGELIST, design )
# plotBCV( cur_DGELIST )

Fitting and Making Comparisons

Now we can use edgeR as a GLM. First fit the data to the count model before making contrasts of interest.

fit <- glmFit( cur_DGELIST, design )

At this point we have fit the data but not tested differences in groups.

With this study, we have 3 different tests between groups we could be interested in including:

group WildWild vs group WildMut
group WildWild vs group MutMut
group WildMut vs group MutMut

We can use the glmLRT() function with the makeContrasts() function to make any of those DE tests or contrasts.

# Contrast group WW vs WM ( with WW as the reference of comparison )
contrast_WW_v_WM <- glmLRT( fit, contrast=makeContrasts( groupWM-groupWW, levels=design ) )

# Contrast group WW vs MM( with WW as the reference of comparison )
contrast_WW_v_MM <- glmLRT( fit, contrast=makeContrasts( groupMM-groupWW, levels=design ) )
  
# Contrast group WM vs mm ( with WM as the reference of comparison )
contrast_WM_v_MM <- glmLRT( fit, contrast=makeContrasts( groupMM-groupWM, levels=design ) )

When asking for your comparison, make sure the reference of the comparison is the group that is subtracted from the other.

Once your test is made you can access the top expressed genes by using the toptags() function. The parameter n can be used to access the top “n” genes in the contrast. Here we show the top 10 genes in each of the 3 contrasts based on the BH FDR (p value modified to control for multiple comparisons).

topTags( contrast_WW_v_WM, n = 10 )

## Coefficient:  1*groupWM -1*groupWW 
##         logFC   logCPM       LR       PValue          FDR
## 63  -5.390741 8.309367 82.28798 1.176355e-19 9.246150e-17
## 161 -6.289658 8.704473 72.80038 1.434489e-17 4.498413e-15
## 465 -5.248451 9.205747 72.44566 1.716952e-17 4.498413e-15
## 132 -8.146962 6.815074 69.74464 6.750099e-17 1.326394e-14
## 317 -5.820571 6.994167 66.78017 3.035361e-16 4.771588e-14
## 770 -3.536811 8.944139 63.15214 1.913412e-15 2.506569e-13
## 469 -4.272702 8.207683 46.26267 1.034158e-11 1.161212e-09
## 467 -4.810553 6.240189 44.85914 2.117299e-11 2.080247e-09
## 446 -3.971621 7.636249 43.98871 3.302754e-11 2.884406e-09
## 284 -3.756665 6.669350 42.34942 7.633898e-11 6.000244e-09

topTags( contrast_WW_v_MM, n = 10 )

## Coefficient:  1*groupMM -1*groupWW 
##          logFC   logCPM        LR       PValue          FDR
## 20  -10.300083 9.542239 113.86640 1.393904e-26 1.095609e-23
## 272   5.529678 6.907979 106.25704 6.476543e-25 2.545281e-22
## 294  -9.505889 8.788746  99.79196 1.692764e-23 4.435042e-21
## 508   6.668367 7.452800  98.13219 3.913655e-23 7.690331e-21
## 357  -9.100453 8.588710  87.98196 6.605421e-21 1.038372e-18
## 322   4.232260 9.660051  84.06002 4.799802e-20 6.287741e-18
## 320   3.940997 7.644758  75.41259 3.819473e-18 4.288722e-16
## 361  -8.749389 8.127434  72.29768 1.850651e-17 1.818265e-15
## 198  -8.960148 8.210388  70.83841 3.877141e-17 3.386037e-15
## 775   3.730386 6.742448  68.64416 1.179337e-16 9.269586e-15

topTags( contrast_WM_v_MM, n = 10 )

## Coefficient:  1*groupMM -1*groupWM 
##          logFC   logCPM        LR       PValue          FDR
## 20  -10.731434 9.542239 127.96094 1.144736e-29 8.997626e-27
## 161   7.534973 8.704473 120.67621 4.498784e-28 1.768022e-25
## 357  -9.913269 8.588710 117.44232 2.296829e-27 6.017693e-25
## 294  -9.972623 8.788746 115.82009 5.204318e-27 1.022649e-24
## 322   4.842254 9.660051  91.80901 9.545870e-22 1.500611e-19
## 361  -9.337122 8.127434  89.14848 3.662643e-21 4.798062e-19
## 320   5.477270 7.644758  80.66844 2.669516e-19 2.997485e-17
## 445  -9.004428 7.739482  79.34965 5.203537e-19 5.047468e-17
## 198  -9.314755 8.210388  79.14224 5.779544e-19 5.047468e-17
## 335  -8.840161 7.428282  74.79878 5.212181e-18 4.096774e-16

Even better, one can access the top genes based on a particular measure, say BH FDR < 0.05 using the decideTestsDGE function. Here we look at the top genes in contrast_WM_v_MM with a BH FDR < 0.05 and plot them with a “Smear” plot, equivalent to a microarray MA plot.

dt_significant <- decideTestsDGE( contrast_WM_v_MM, adjust.method="BH", p.value=0.05)
vctr_names_sig <- rownames( cur_DGELIST )[ as.logical( dt_significant )]
plotSmear( contrast_WM_v_MM, de.tags = vctr_names_sig )
abline( h = c( -2, 2 ), col = "blue")

HCL of Top Differentially Expressed Genes

Another view of the data is using a heatmap that has been heirarchically clustered. This gives you a global view of the samples and how they relate (in the sample dendrogram) but also highlights genes that may be driving the differences between sample groups. To keep this figure small for the tutorial, I am going to look at the top 30 genes from each contrast. One could do other things like look at the all the genes that have a BH FDR < 0.05 in atleast one contrast. This would be more reasonable in a study.

vctr_names_top <- rownames( topTags( contrast_WW_v_WM, n = 30 ) )
vctr_names_top <- c( vctr_names_top, rownames( topTags( contrast_WW_v_MM, n = 30 ) ) )
vctr_names_top <- unique( c( vctr_names_top, rownames( topTags( contrast_WM_v_MM, n = 30 ) ) ) )

vctr_sig <- as.logical( decideTestsDGE( contrast_WW_v_WM, adjust.method="BH", p.value=0.000005) )
vctr_sig <- vctr_sig | as.logical( decideTestsDGE( contrast_WW_v_MM, adjust.method="BH", p.value=0.000005) )
vctr_sig <- vctr_sig | as.logical( decideTestsDGE( contrast_WM_v_MM, adjust.method="BH", p.value=0.000005) )
vctr_names_hcl <- rownames( cur_DGELIST )[ vctr_sig ]
length( vctr_names_hcl )

## [1] 128

head( vctr_names_hcl )

## [1] "5"  "17" "19" "20" "48" "55"

Now we perform hierarchical clustering to visualize samples and genes as groups. Functions for heatmaps and clustering are very flexible allowing many distance metrics. This is a simple start using defaults. If interested in advanced plotting, please look in the references for a suggested script which makes very flexible and beautiful heatmaps.

# Matrix of values
mtrx_significant <- cur_DGELIST$counts[ vctr_names_top, ]
# Colors for the groups
vctr_colors = as.factor( c( "black", "red", "green" ))
vctr_colors

## [1] black red   green
## Levels: black green red

vctr_sample_colors <- as.character( vctr_colors[ as.numeric( cur_DGELIST$samples$group ) ] )
vctr_sample_colors

## [1] "black" "black" "red"   "red"   "green" "green"

# Heatmap
heatmap.2( log2( mtrx_significant + 1 ), ColSideColors=vctr_sample_colors, key=TRUE, trace="none", col=heat.colors(200), scale="row" )

More Complicated Comparisons

What if I have more than 3 groups? I have 4 groups!

No problem! Make a design matrix based your sample order and groups and continue as above including contrasts involving your group4 as interested.

# Make groups
group_with_4 <- factor( c( 4, 2, 1, 3, 1, 4, 2, 3 ))
# Make design matrix
design_with_4 <- model.matrix( ~0+group_with_4 )
# Estimate dispersion
cur_DGELIST <- estimateGLMCommonDisp( cur_DGELIST, design_with_4 )
cur_DGELIST <- estimateGLMTrendedDisp( cur_DGELIST, design_with_4 )
cur_DGELIST <- estimateGLMTagwiseDisp( cur_DDGELIST, design_with_4 )
# Fit counts to model
fit_with_4 <- glmFit( cur_DGELIST, design_with_4 )
# Make contrasts
contrast_1_v_3 <- glmLRT( fit_with_4, contrast=makeContrasts( group3-group1, levels=design_with_4 ) )
contrast_2_v_4 <- glmLRT( fit_with_4, contrast=makeContrasts( group4-group2, levels=design_with_4 ) )
# See the top genes
topTags( contrast_1_v_3 )
topTags( contrast_2_v_4 )

What if I want to combine some of my groups in some comparisons and not others?

No problem! Let’s say you have 4 groups and you want to compare: 1. 2 vs 1 (reference) 2. 3 and 4 combined vs 2 (reference) 3. 1 and 2 combined vs 3 and 4 combined (reference). You can do it!

To combine groups, we use the following notation ( group + group ) / 2, this gives you the group’s average affect. Repectively, the above contrasts would be:

# Contrast 2 v 1
contrast_2_v_1 <- glmLRT( fit_with_4, contrast=makeContrasts( group2-group1, levels=design_with_4 ) )
# Contrast ( 3 and 4 ) v 2
contrast_3and4_v_2 <- glmLRT( fit_with_4, contrast=makeContrasts( ( group3 + group4 )/2 - group2, levels=design_with_4 ) )
# Contrast ( 1 and 2 ) v ( 3 and 4 )
contrast_1and2_v_3and4 <- glmLRT( fit_with_4, contrast=makeContrasts( ( group1 + group2 )/2-( group3 + group4 )/2, levels=design_with_4 ) )

What about differences between differences?

Sure, complicated but sure :-)

Here is how:

makeContrasts( (group1 - group2) - (group3 - group4), levels=design_with_4 )

Although complex at first, this can be useful. Imagine if you had two groups at two different time points and wanted to know how they differed over time. You contrast the difference in how each group changed over time.

makeContrasts(( group2_minute1 - group2_minute0) - ( group1_minute1 - group1_minute0 ), levels=design_with_4 )

Here we have how did each group differ in change from minute one to minute two.

Aren’t GLM’s awesome :-)

What if I want interaction effects?

No problem! You can do it, but that is beyond the scope of this tutorial. You’ll need to build your design matrix slightly different and ask for the contrast in a different way.

References

This tutorial relied on the following resources-

edgeR Manual

http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

edgeR Tutorial

https://web.stanford.edu/class/bios221/labs/rnaseq/lab_4_rnaseq.html

Beautiful Heatmap Plots

https://github.com/sriesenfeld/CompoHeatMap

Session Info

Incase it is needed, this is the session info that built this document.

sessionInfo()

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] splines   stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] gplots_2.17.0        baySeq_2.0.50        abind_1.4-3         
##  [4] GenomicRanges_1.18.4 GenomeInfoDb_1.2.5   IRanges_2.0.1       
##  [7] S4Vectors_0.4.0      BiocGenerics_0.12.1  edgeR_3.8.6         
## [10] limma_3.22.7        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6       caTools_1.17.1     digest_0.6.8      
##  [4] evaluate_0.7       formatR_1.2        gdata_2.16.1      
##  [7] gtools_3.4.2       htmltools_0.2.6    KernSmooth_2.23-14
## [10] knitr_1.10         magrittr_1.5       rmarkdown_0.5.1   
## [13] stringi_0.4-1      stringr_1.0.0      tools_3.1.1       
## [16] XVector_0.6.0      yaml_2.1.13