If you have not installed edgeR into R you will can do that with biocLite. I am also installing baySeq for a data set it provides and gplots for plotting..
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")
biocLite("gplots")
biocLite("baySeq")
Load the libraries.
library( edgeR )
## Warning: package 'edgeR' was built under R version 3.1.2
## Loading required package: limma
## Warning: package 'limma' was built under R version 3.1.3
library( baySeq )
## Loading required package: GenomicRanges
## Warning: package 'GenomicRanges' was built under R version 3.1.2
## Loading required package: BiocGenerics
## Warning: package 'BiocGenerics' was built under R version 3.1.2
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:limma':
##
## plotMA
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Warning: package 'IRanges' was built under R version 3.1.2
## Loading required package: GenomeInfoDb
## Warning: package 'GenomeInfoDb' was built under R version 3.1.3
## Loading required package: abind
## Warning: package 'abind' was built under R version 3.1.3
library( gplots )
## Warning: package 'gplots' was built under R version 3.1.3
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:IRanges':
##
## space
##
## The following object is masked from 'package:stats':
##
## lowess
Load your expression/counts matrix from a file using it’s absolute path. Loading a matrix into R is not a trivial task. Parameters are available for the read.table function to help load your matrix depending on it’s format (for example header = TRUE ). Type “?read.table” into your console for more information.
Here in this tutorial we are going to use data from the bayseq package.
# Load data from computer.
mtrx_counts <- as.matrix( read.table( "/path/to/file.txt" ) )
# Load bayseq data
data( mobData )
data( mobAnnotation )
When comparing an experiment with two groups ( control and test ) we can contrast one against another using the method exactTest( DGEList ) to obtain differentially expressed genes. This is like using a t-test but with assumptions more appropriate for this kind of data.
When we have more than 2 groups we can use a linear model instead of a t-test. Specifically, edgeR uses a generalized linear model ( GLM ) which is similar to a linear model but does not assume the response (RNA-Seq counts) is normally distributed. The response can take on many shapes as long as we understand how to model the shape of the data. This includes count data for which we can use assumptions of the negative binomial distribution. We will focus on how to use edgeR in the setting of a GLM.
To use a GLM, we have to indicate how samples are grouped. This is done with the design matrix. To use a GLM with edgeR we will need to make both a DGEList AND a design matrix.
First let’s make a design matrix.
Design matrices help GLMs understand how to compare groups of your data. It answers the question, “How did you design your study? What are the groups of your observations?”
Let us image we have a study which has 6 samples, 3 chorts of 2 individuals. If we gave everyone the label of thier group we could have group1, group1, group2, group2, group3, group3. Next, I am just going to use a factor to represent the groups with 1 or 2 or 3. Factors are how R represents categorical data.
group <- factor( c( 1, 1, 2, 2, 3, 3 ) )
group
## [1] 1 1 2 2 3 3
## Levels: 1 2 3
Notice the output. The top line are the specifically labeled individuals but the bottom are the groups.
Also note, make sure your factors are in the order of your samples in the matrix_counts (counts matrix). If your individuals were actually in the order group2, group1, group3, group1, group2, group3 then one would use this factor to represent them.
new_group <- factor( c( 2, 1, 3, 1, 2, 3 ))
new_group
## [1] 2 1 3 1 2 3
## Levels: 1 2 3
Here we see the order of the individuals are preserved and the groups are the same as before. If you have a specific control / normal group, often in more traditional settings the control / normal group is indicated by 1. The methods shown below do not require this.
The function model.matrix() is used to make a design matrix from a factor. This method does all the work for you, let’s see it in action!
design <- model.matrix( ~0+group )
design
## group1 group2 group3
## 1 1 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 1 0
## 5 0 0 1
## 6 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
The result of the function is a design matrix; 3 columns for your 3 different groups in your factor (also known as factor levels) and 6 rows, one for each individual in your study (in order of the original factor from top to bottom). It is always a good idea to take a peek at the deisgn matrix. If yours is too big to display on the console you can view the top of the matrix with
head( design )
## group1 group2 group3
## 1 1 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 1 0
## 5 0 0 1
## 6 0 0 1
or check the dimensions using dim(). Note the output is (# rows, # columns).
dim( design )
## [1] 6 3
Now back to the design matrix.
design
## group1 group2 group3
## 1 1 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 1 0
## 5 0 0 1
## 6 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
Let’s focus on the columns of the design matrix. Each column is a labelled group matching your factor groups (alphanumeric order from the left). The individuals with a 1 in the first column and no other 1 in any other column are the individuals in the group 1 (1). Here, they are the first two rows which match the first two individuals in our original factor. The individuals with the 1 in the second column (group2) are the individuals in the group 2 (2). This would be individuals (rows) 2 and 3. The individuals with a 1 in the column group3 are the group 3 (3) individuals. In the original call we used ~0+groups, this tells the model matrix function how to format your matrix. There are other ways of making model matrices, this is the most intuitive for using edgeR.
I am now making the design matrix for our test data.
group <- factor( c("MM","MM", "WM", "WM", "WW", "WW") )
# MM="triple mutant shoot grafted onto triple mutant root"
# WM="wild-type shoot grafted onto triple mutant root"
# WW="wild-type shoot grafted onto wild-type root"
design <- model.matrix( ~0+group )
design
## groupMM groupWM groupWW
## 1 1 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 1 0
## 5 0 0 1
## 6 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
There are different ways of filtering data. Genes that have very few counts throughout samples are often removed so that more active genes are the focus of the study. Additionally, these genes with minimal counts would be difficult to distinguish from sequencing noise.
Here I will quickly remove genes that do not have less than 100 counts per million in atleast 2 samples. This works for this tutorial but is not a hard number to use in all data sets.
As well, trimmed mean of M-values (TMM) normalization is used so comparisons between samples can be made. This is a very popular normalization in RNA-Seq analysis.
While we perform the filtering and normalization we are going to create a DGEList object, edgeR uses this object when working with the counts. The DGEList needs the counts as a matrix (already read in), your samples labelled by group in the order found in the matrix (already made), and the library size (we will calculate).
# Starting count dimensions
dim( mobData )
## [1] 3000 6
# Filter. Atleast 100 cpm in atleast 2 samples or discard
mobData <- mobData[ rowSums(cpm(mobData)>100) >= 2, ]
# Reduces the data set to 786 genes by 6 samples
dim( mobData )
## [1] 786 6
# Make the DGEList
cur_DGELIST <- DGEList( counts=mobData, group=group, lib.size=colSums( mobData ) )
# Use TMM normalization
cur_DGELIST <- calcNormFactors( cur_DGELIST )
After filtering and normalization, one often wants to see how samples relate to each other in a general/global way. For RNA-Seq data, there are several popular methods including PCA and MDS. Here we perform MDS, coloring the samples according to their groups.
plotMDS( cur_DGELIST, method="bcv", col=as.numeric( cur_DGELIST$samples$group ) )
legend( "topleft", as.character( unique( cur_DGELIST$samples$group )), col=1:3, pch=20)
With the DGEList object, edgeR can estimate the dispersion of the data to best model it. EdgeR’s underlying negative binomial distribution requires this estimate of dispersion as part of the model. For the GLM, this estimate is done in three steps.
cur_DGELIST <- estimateGLMCommonDisp( cur_DGELIST, design )
cur_DGELIST <- estimateGLMTrendedDisp( cur_DGELIST, design )
## Loading required package: splines
cur_DGELIST <- estimateGLMTagwiseDisp( cur_DGELIST, design )
# plotBCV( cur_DGELIST )
Now we can use edgeR as a GLM. First fit the data to the count model before making contrasts of interest.
fit <- glmFit( cur_DGELIST, design )
At this point we have fit the data but not tested differences in groups.
With this study, we have 3 different tests between groups we could be interested in including:
We can use the glmLRT() function with the makeContrasts() function to make any of those DE tests or contrasts.
# Contrast group WW vs WM ( with WW as the reference of comparison )
contrast_WW_v_WM <- glmLRT( fit, contrast=makeContrasts( groupWM-groupWW, levels=design ) )
# Contrast group WW vs MM( with WW as the reference of comparison )
contrast_WW_v_MM <- glmLRT( fit, contrast=makeContrasts( groupMM-groupWW, levels=design ) )
# Contrast group WM vs mm ( with WM as the reference of comparison )
contrast_WM_v_MM <- glmLRT( fit, contrast=makeContrasts( groupMM-groupWM, levels=design ) )
When asking for your comparison, make sure the reference of the comparison is the group that is subtracted from the other.
Once your test is made you can access the top expressed genes by using the toptags() function. The parameter n can be used to access the top “n” genes in the contrast. Here we show the top 10 genes in each of the 3 contrasts based on the BH FDR (p value modified to control for multiple comparisons).
topTags( contrast_WW_v_WM, n = 10 )
## Coefficient: 1*groupWM -1*groupWW
## logFC logCPM LR PValue FDR
## 63 -5.390741 8.309367 82.28798 1.176355e-19 9.246150e-17
## 161 -6.289658 8.704473 72.80038 1.434489e-17 4.498413e-15
## 465 -5.248451 9.205747 72.44566 1.716952e-17 4.498413e-15
## 132 -8.146962 6.815074 69.74464 6.750099e-17 1.326394e-14
## 317 -5.820571 6.994167 66.78017 3.035361e-16 4.771588e-14
## 770 -3.536811 8.944139 63.15214 1.913412e-15 2.506569e-13
## 469 -4.272702 8.207683 46.26267 1.034158e-11 1.161212e-09
## 467 -4.810553 6.240189 44.85914 2.117299e-11 2.080247e-09
## 446 -3.971621 7.636249 43.98871 3.302754e-11 2.884406e-09
## 284 -3.756665 6.669350 42.34942 7.633898e-11 6.000244e-09
topTags( contrast_WW_v_MM, n = 10 )
## Coefficient: 1*groupMM -1*groupWW
## logFC logCPM LR PValue FDR
## 20 -10.300083 9.542239 113.86640 1.393904e-26 1.095609e-23
## 272 5.529678 6.907979 106.25704 6.476543e-25 2.545281e-22
## 294 -9.505889 8.788746 99.79196 1.692764e-23 4.435042e-21
## 508 6.668367 7.452800 98.13219 3.913655e-23 7.690331e-21
## 357 -9.100453 8.588710 87.98196 6.605421e-21 1.038372e-18
## 322 4.232260 9.660051 84.06002 4.799802e-20 6.287741e-18
## 320 3.940997 7.644758 75.41259 3.819473e-18 4.288722e-16
## 361 -8.749389 8.127434 72.29768 1.850651e-17 1.818265e-15
## 198 -8.960148 8.210388 70.83841 3.877141e-17 3.386037e-15
## 775 3.730386 6.742448 68.64416 1.179337e-16 9.269586e-15
topTags( contrast_WM_v_MM, n = 10 )
## Coefficient: 1*groupMM -1*groupWM
## logFC logCPM LR PValue FDR
## 20 -10.731434 9.542239 127.96094 1.144736e-29 8.997626e-27
## 161 7.534973 8.704473 120.67621 4.498784e-28 1.768022e-25
## 357 -9.913269 8.588710 117.44232 2.296829e-27 6.017693e-25
## 294 -9.972623 8.788746 115.82009 5.204318e-27 1.022649e-24
## 322 4.842254 9.660051 91.80901 9.545870e-22 1.500611e-19
## 361 -9.337122 8.127434 89.14848 3.662643e-21 4.798062e-19
## 320 5.477270 7.644758 80.66844 2.669516e-19 2.997485e-17
## 445 -9.004428 7.739482 79.34965 5.203537e-19 5.047468e-17
## 198 -9.314755 8.210388 79.14224 5.779544e-19 5.047468e-17
## 335 -8.840161 7.428282 74.79878 5.212181e-18 4.096774e-16
Even better, one can access the top genes based on a particular measure, say BH FDR < 0.05 using the decideTestsDGE function. Here we look at the top genes in contrast_WM_v_MM with a BH FDR < 0.05 and plot them with a “Smear” plot, equivalent to a microarray MA plot.
dt_significant <- decideTestsDGE( contrast_WM_v_MM, adjust.method="BH", p.value=0.05)
vctr_names_sig <- rownames( cur_DGELIST )[ as.logical( dt_significant )]
plotSmear( contrast_WM_v_MM, de.tags = vctr_names_sig )
abline( h = c( -2, 2 ), col = "blue")
Another view of the data is using a heatmap that has been heirarchically clustered. This gives you a global view of the samples and how they relate (in the sample dendrogram) but also highlights genes that may be driving the differences between sample groups. To keep this figure small for the tutorial, I am going to look at the top 30 genes from each contrast. One could do other things like look at the all the genes that have a BH FDR < 0.05 in atleast one contrast. This would be more reasonable in a study.
vctr_names_top <- rownames( topTags( contrast_WW_v_WM, n = 30 ) )
vctr_names_top <- c( vctr_names_top, rownames( topTags( contrast_WW_v_MM, n = 30 ) ) )
vctr_names_top <- unique( c( vctr_names_top, rownames( topTags( contrast_WM_v_MM, n = 30 ) ) ) )
vctr_sig <- as.logical( decideTestsDGE( contrast_WW_v_WM, adjust.method="BH", p.value=0.000005) )
vctr_sig <- vctr_sig | as.logical( decideTestsDGE( contrast_WW_v_MM, adjust.method="BH", p.value=0.000005) )
vctr_sig <- vctr_sig | as.logical( decideTestsDGE( contrast_WM_v_MM, adjust.method="BH", p.value=0.000005) )
vctr_names_hcl <- rownames( cur_DGELIST )[ vctr_sig ]
length( vctr_names_hcl )
## [1] 128
head( vctr_names_hcl )
## [1] "5" "17" "19" "20" "48" "55"
Now we perform hierarchical clustering to visualize samples and genes as groups. Functions for heatmaps and clustering are very flexible allowing many distance metrics. This is a simple start using defaults. If interested in advanced plotting, please look in the references for a suggested script which makes very flexible and beautiful heatmaps.
# Matrix of values
mtrx_significant <- cur_DGELIST$counts[ vctr_names_top, ]
# Colors for the groups
vctr_colors = as.factor( c( "black", "red", "green" ))
vctr_colors
## [1] black red green
## Levels: black green red
vctr_sample_colors <- as.character( vctr_colors[ as.numeric( cur_DGELIST$samples$group ) ] )
vctr_sample_colors
## [1] "black" "black" "red" "red" "green" "green"
# Heatmap
heatmap.2( log2( mtrx_significant + 1 ), ColSideColors=vctr_sample_colors, key=TRUE, trace="none", col=heat.colors(200), scale="row" )
What if I have more than 3 groups? I have 4 groups!
No problem! Make a design matrix based your sample order and groups and continue as above including contrasts involving your group4 as interested.
# Make groups
group_with_4 <- factor( c( 4, 2, 1, 3, 1, 4, 2, 3 ))
# Make design matrix
design_with_4 <- model.matrix( ~0+group_with_4 )
# Estimate dispersion
cur_DGELIST <- estimateGLMCommonDisp( cur_DGELIST, design_with_4 )
cur_DGELIST <- estimateGLMTrendedDisp( cur_DGELIST, design_with_4 )
cur_DGELIST <- estimateGLMTagwiseDisp( cur_DDGELIST, design_with_4 )
# Fit counts to model
fit_with_4 <- glmFit( cur_DGELIST, design_with_4 )
# Make contrasts
contrast_1_v_3 <- glmLRT( fit_with_4, contrast=makeContrasts( group3-group1, levels=design_with_4 ) )
contrast_2_v_4 <- glmLRT( fit_with_4, contrast=makeContrasts( group4-group2, levels=design_with_4 ) )
# See the top genes
topTags( contrast_1_v_3 )
topTags( contrast_2_v_4 )
What if I want to combine some of my groups in some comparisons and not others?
No problem! Let’s say you have 4 groups and you want to compare: 1. 2 vs 1 (reference) 2. 3 and 4 combined vs 2 (reference) 3. 1 and 2 combined vs 3 and 4 combined (reference). You can do it!
To combine groups, we use the following notation ( group + group ) / 2, this gives you the group’s average affect. Repectively, the above contrasts would be:
# Contrast 2 v 1
contrast_2_v_1 <- glmLRT( fit_with_4, contrast=makeContrasts( group2-group1, levels=design_with_4 ) )
# Contrast ( 3 and 4 ) v 2
contrast_3and4_v_2 <- glmLRT( fit_with_4, contrast=makeContrasts( ( group3 + group4 )/2 - group2, levels=design_with_4 ) )
# Contrast ( 1 and 2 ) v ( 3 and 4 )
contrast_1and2_v_3and4 <- glmLRT( fit_with_4, contrast=makeContrasts( ( group1 + group2 )/2-( group3 + group4 )/2, levels=design_with_4 ) )
What about differences between differences?
Sure, complicated but sure :-)
Here is how:
makeContrasts( (group1 - group2) - (group3 - group4), levels=design_with_4 )
Although complex at first, this can be useful. Imagine if you had two groups at two different time points and wanted to know how they differed over time. You contrast the difference in how each group changed over time.
makeContrasts(( group2_minute1 - group2_minute0) - ( group1_minute1 - group1_minute0 ), levels=design_with_4 )
Here we have how did each group differ in change from minute one to minute two.
Aren’t GLM’s awesome :-)
What if I want interaction effects?
No problem! You can do it, but that is beyond the scope of this tutorial. You’ll need to build your design matrix slightly different and ask for the contrast in a different way.
This tutorial relied on the following resources-
edgeR Manual
http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
edgeR Tutorial
https://web.stanford.edu/class/bios221/labs/rnaseq/lab_4_rnaseq.html
Beautiful Heatmap Plots
Incase it is needed, this is the session info that built this document.
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] splines stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] gplots_2.17.0 baySeq_2.0.50 abind_1.4-3
## [4] GenomicRanges_1.18.4 GenomeInfoDb_1.2.5 IRanges_2.0.1
## [7] S4Vectors_0.4.0 BiocGenerics_0.12.1 edgeR_3.8.6
## [10] limma_3.22.7
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 caTools_1.17.1 digest_0.6.8
## [4] evaluate_0.7 formatR_1.2 gdata_2.16.1
## [7] gtools_3.4.2 htmltools_0.2.6 KernSmooth_2.23-14
## [10] knitr_1.10 magrittr_1.5 rmarkdown_0.5.1
## [13] stringi_0.4-1 stringr_1.0.0 tools_3.1.1
## [16] XVector_0.6.0 yaml_2.1.13