DEGSeq

This is a simple tutorial for DEGSEq for BIO389D. DEGseq is an R package to identify differentially expressed genes from RNA-Seq data.

For installation see:

https://bioconductor.org/packages/release/bioc/html/DEGseq.html.

Installation

To install this package, start R and enter:

## try http:// if https:// URLs are not supported
#source("https://bioconductor.org/biocLite.R")
#biocLite("DEGseq")

Let’s start with a simple data. This is the first 5000 lines of the supplementary file for Marioni et al. (2008) and we have count of reads mapped to genes for 14 lanes (7 lanes for kidney and 7 lanes for liver) of human.

library(DEGseq)
geneExpFile <- system.file("extdata", "GeneExpExample5000.txt", package="DEGseq")
geneExpMatrix1 <- readGeneExp(file=geneExpFile, geneCol=1, valCol=c(7,9,12,15,18))
geneExpMatrix2 <- readGeneExp(file=geneExpFile, geneCol=1, valCol=c(8,10,11,13,16))
head(geneExpMatrix1)
##                 EnsemblGeneID     R1L1Kidney R1L3Kidney R1L7Kidney
## ENSG00000146556 "ENSG00000146556" "0"        "0"        "0"       
## ENSG00000197194 "ENSG00000197194" "0"        "0"        "0"       
## ENSG00000197490 "ENSG00000197490" "0"        "0"        "0"       
## ENSG00000205292 "ENSG00000205292" "0"        "0"        "0"       
## ENSG00000177693 "ENSG00000177693" "0"        "0"        "0"       
## ENSG00000209338 "ENSG00000209338" "0"        "0"        "0"       
##                 R2L2Kidney R2L6Kidney
## ENSG00000146556 "0"        "0"       
## ENSG00000197194 "0"        "0"       
## ENSG00000197490 "0"        "0"       
## ENSG00000205292 "0"        "0"       
## ENSG00000177693 "0"        "0"       
## ENSG00000209338 "0"        "0"
head(geneExpMatrix2)
##                 EnsemblGeneID     R1L2Liver R1L4Liver R1L6Liver R1L8Liver
## ENSG00000146556 "ENSG00000146556" "0"       "0"       "0"       "0"      
## ENSG00000197194 "ENSG00000197194" "0"       "0"       "0"       "0"      
## ENSG00000197490 "ENSG00000197490" "0"       "0"       "0"       "0"      
## ENSG00000205292 "ENSG00000205292" "0"       "0"       "0"       "0"      
## ENSG00000177693 "ENSG00000177693" "0"       "0"       "0"       "0"      
## ENSG00000209338 "ENSG00000209338" "0"       "0"       "0"       "0"      
##                 R2L3Liver
## ENSG00000146556 "0"      
## ENSG00000197194 "0"      
## ENSG00000197490 "0"      
## ENSG00000205292 "0"      
## ENSG00000177693 "0"      
## ENSG00000209338 "0"

For understanding of experimental design, biological and technical replicates you can look into this tutorial.

http://www.nathalievilla.org/doc/pdf/tutorial-rnaseq.pdf

Boxplot

Let’s boxplot on each replicates of two tissue to see the distribution of count data in log2 scale

library(reshape2)
library(ggplot2)
geneExpMatrixMerge<-merge(geneExpMatrix1,geneExpMatrix2,by="EnsemblGeneID")
df <- melt(geneExpMatrixMerge,id.vars = "EnsemblGeneID",variable.name = "Samples",value.name = "count")
df[, "count"] <- as.numeric(df[, "count"])
head(df)
##     EnsemblGeneID    Samples count
## 1 ENSG00000000457 R1L1Kidney    22
## 2 ENSG00000000460 R1L1Kidney     9
## 3 ENSG00000000938 R1L1Kidney    14
## 4 ENSG00000000971 R1L1Kidney    28
## 5 ENSG00000001460 R1L1Kidney    24
## 6 ENSG00000001461 R1L1Kidney    23
tail(df)
##         EnsemblGeneID   Samples count
## 49995 ENSG00000216157 R2L3Liver     0
## 49996 ENSG00000216161 R2L3Liver     0
## 49997 ENSG00000216162 R2L3Liver     0
## 49998 ENSG00000216186 R2L3Liver     0
## 49999 ENSG00000216191 R2L3Liver     0
## 50000 ENSG00000216193 R2L3Liver     0
df$Condition <- substr(df$Samples, 5, length(df$Samples))
head(df)
##     EnsemblGeneID    Samples count Condition
## 1 ENSG00000000457 R1L1Kidney    22    Kidney
## 2 ENSG00000000460 R1L1Kidney     9    Kidney
## 3 ENSG00000000938 R1L1Kidney    14    Kidney
## 4 ENSG00000000971 R1L1Kidney    28    Kidney
## 5 ENSG00000001460 R1L1Kidney    24    Kidney
## 6 ENSG00000001461 R1L1Kidney    23    Kidney
ggplot(df, aes(x = Samples, y = log2(count), fill = Condition)) + geom_boxplot() + xlab("") +ylab("log2(count)") + scale_fill_manual(values = c("#619CFF", "#F564E3"))

Assumption

RNA sequencing can be modeled as a random sampling process, in which each read is sampled independently and uniformly from every possible nucleotide in the sample.

Model

Under this assumption the number of reads coming from a gene (or transcript isoform) follows a binomial distribution (and could be approximated by a Poisson distribution) (Wang et al,2010).

MA-plot-based method

MA plot is basically log fold change vs. log avegare of a gene for two different samples/replicates.

Let C1 and C2 denote the counts of reads mapped to a specific gene obtained from two samples, with Ci ∼ binomial (ni, pi), i = 1, 2, where ni denotes the total number of mapped reads and pi the probability of a read coming from that gene.

  M = log2C1- log2C2, and A = (log2C1 + log2C2)/2.
  

Differentially Expressed Gene Analysis

Step 1:

We will fit a simple random sampling model to get the DEGs and will see the effect. The red line of the MA-plot indicates 4-fold local Standard Deviation of M conditioned on A for two samples.

Method : MARS (MA-plot-based method with Random Sampling model )

layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow=TRUE))
par(mar=c(2, 2, 2, 2))
DEGexp(geneExpMatrix1=geneExpMatrix1, geneCol1=1, expCol1=c(2,3,4,5,6), groupLabel1="kidney", geneExpMatrix2=geneExpMatrix2, geneCol2=1, expCol2=c(2,3,4,5,6), groupLabel2="liver", method="MARS")
## Please wait...
## gene id column in geneExpMatrix1 for sample1:  1 
## expression value column(s) in geneExpMatrix1: 2 3 4 5 6 
## total number of reads uniquely mapped to genome obtained from sample1: 345504 354981 334557 366041 372895 
## gene id column in geneExpMatrix2 for sample2:  1 
## expression value column(s) in geneExpMatrix2: 2 3 4 5 6 
## total number of reads uniquely mapped to genome obtained from sample2: 274430 274486 264999 255041 284205 
## 
## method to identify differentially expressed genes:  MARS 
## pValue threshold: 0.001 
## output directory: none 
## 
## The outputDir is not specified! 
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...
## output ...
## The results can be observed in directory:  none

Step 2:

Now we will check how the technical replicates go with each other. Can two technical replicates alone explain the random sampling model? Is there any background noice for two technical replicates?

Method : CTR (Check whether the variation between Technical Replicates)

Let’s try it for Liver ..

layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow=TRUE))
par(mar=c(2, 2, 2, 2))
DEGexp(geneExpMatrix1=geneExpMatrix2, expCol1=2, groupLabel1="LiverR1L2",geneExpMatrix2=geneExpMatrix2, expCol2=3, groupLabel2="LiverR1L4", method="CTR")
## Please wait...
## gene id column in geneExpMatrix1 for sample1:  1 
## expression value column(s) in geneExpMatrix1: 2 
## total number of reads uniquely mapped to genome obtained from sample1: 274430 
## gene id column in geneExpMatrix2 for sample2:  1 
## expression value column(s) in geneExpMatrix2: 3 
## total number of reads uniquely mapped to genome obtained from sample2: 274486 
## 
## method to identify differentially expressed genes:  CTR 
## pValue threshold: 0.001 
## output directory: none 
## 
## The outputDir is not specified! 
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...
## output ...
## The results can be observed in directory:  none

Task 1:

Can we do it for 1st and 3nd expression column of Kidney data?

## Please wait...
## gene id column in geneExpMatrix1 for sample1:  1 
## expression value column(s) in geneExpMatrix1: 2 
## total number of reads uniquely mapped to genome obtained from sample1: 345504 
## gene id column in geneExpMatrix2 for sample2:  1 
## expression value column(s) in geneExpMatrix2: 3 
## total number of reads uniquely mapped to genome obtained from sample2: 354981 
## 
## method to identify differentially expressed genes:  CTR 
## pValue threshold: 0.001 
## output directory: none 
## 
## The outputDir is not specified! 
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...
## output ...
## The results can be observed in directory:  none

For the 2nd case, especially for higher fold, the blue line goes beyond the red one, it can be inferred that the two replicates could not be fully explained by the random sampling model. So, we need to adopt different method to normalize this variance between technical replicates.

Step 3A:

When we need to normalize technical replicates ..

Method : MATR (MA-plot-based method with Technical Replicates)

layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow=TRUE))
par(mar=c(2, 2, 2, 2))
DEGexp(geneExpMatrix1=geneExpMatrix1, expCol1=2, groupLabel1="kidneyR1L1",geneExpMatrix2=geneExpMatrix2, expCol2=2, groupLabel2="liverR1L2",replicateExpMatrix1=geneExpMatrix1, expColR1=2, replicateLabel1="kidneyR1L1",replicateExpMatrix2=geneExpMatrix1, expColR2=3, replicateLabel2="kidneyR1L3", method="MATR")
## Please wait...
## gene id column in geneExpMatrix1 for sample1:  1 
## expression value column(s) in geneExpMatrix1: 2 
## total number of reads uniquely mapped to genome obtained from sample1: 345504 
## gene id column in geneExpMatrix2 for sample2:  1 
## expression value column(s) in geneExpMatrix2: 2 
## total number of reads uniquely mapped to genome obtained from sample2: 274430 
## 
## gene id column in replicateExpMatrix1:  1 
## expression value column(s) in replicateExpMatrix1: 2 
## total number of reads uniquely mapped to genome obtained from replicate1: 345504 
## 
## gene id column in replicateExpMatrix2:  1 
## expression value column(s) in replicateExpMatrix2: 3 
## total number of reads uniquely mapped to genome obtained from replicate2: 354981 
## 
## method to identify differentially expressed genes:  MATR 
## pValue threshold: 0.001 
## output directory: none 
## 
## The outputDir is not specified! 
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...

## output ...
## The results can be observed in directory:  none

Step 3B:

Assuming that the technical replicates don’t have noice.

Method : LRT (Likelihood Ratio Test)

outputDir <- file.path(getwd(), "DEGexpExample")
layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow=TRUE))
par(mar=c(2, 2, 2, 2))
DEGexp(geneExpMatrix1=geneExpMatrix1, geneCol1=1, expCol1=c(2,3,4,5,6), groupLabel1="kidney", geneExpMatrix2=geneExpMatrix2, geneCol2=1, expCol2=c(2,3,4,5,6), groupLabel2="liver", method="LRT",outputDir = outputDir)
## Please wait...
## gene id column in geneExpMatrix1 for sample1:  1 
## expression value column(s) in geneExpMatrix1: 2 3 4 5 6 
## total number of reads uniquely mapped to genome obtained from sample1: 345504 354981 334557 366041 372895 
## gene id column in geneExpMatrix2 for sample2:  1 
## expression value column(s) in geneExpMatrix2: 2 3 4 5 6 
## total number of reads uniquely mapped to genome obtained from sample2: 274430 274486 264999 255041 284205 
## 
## method to identify differentially expressed genes:  LRT 
## pValue threshold: 0.001 
## output directory: /home/tahia/data/UT/ddRAD/riceddRAD/n_25_c3/centi_2.5/GATK/rec_freq_006/Win_sld/DEGexpExample 
## 
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...
## output ...
## 
## Done ...
## The results can be observed in directory:  /home/tahia/data/UT/ddRAD/riceddRAD/n_25_c3/centi_2.5/GATK/rec_freq_006/Win_sld/DEGexpExample

Working with output DEGs

If we mention output directory for DEGexp function, then it will write the results in that directory and we can use them to get DEGs in different fold range or direction (up/downregulated). The result is actually a list of all genes with their test statistics. The last column indicate it’s significance: TRUE/FALSE (p-value < 0.001).

result<-read.table(file=file.path(outputDir,"output_score.txt"),header = T,sep = "\t")
colnames(result)<-c("GeneNames", "value1", "value2", "log2_Fold_change", "log2_Fold_change_normalized","pval","qval1","qval2","Signature")
head(result)
##         GeneNames value1 value2 log2_Fold_change
## 1 ENSG00000209350     27    161       -2.5760294
## 2 ENSG00000212679    620    746       -0.2669074
## 3 ENSG00000212678  64591  44870        0.5255820
## 4 ENSG00000188157   2108    259        3.0248509
## 5 ENSG00000008130    425    836       -0.9760401
## 6 ENSG00000078369   4258   1661        1.3581239
##   log2_Fold_change_normalized pval qval1 qval2 Signature
## 1                  -2.9666840    0     0     0      TRUE
## 2                  -0.6575620    0     0     0      TRUE
## 3                   0.1349274    0     0     0      TRUE
## 4                   2.6341962    0     0     0      TRUE
## 5                  -1.3666947    0     0     0      TRUE
## 6                   0.9674693    0     0     0      TRUE

Let’s count the number of significant DEGs:

length(which(result$Signature=="TRUE"))
## [1] 1670

Now pull out only the significant genes from the result:

result_sig<-result[which(result$Signature=="TRUE"),]

Let’s make a list of significant genes that are upregulated for Kidney compare to Liver:

result_sig_up_kidney<-result[which(result$Signature=="TRUE" & result$log2_Fold_change_normalized > 0),]
dim(result_sig_up_kidney)
## [1] 1097    9

Task 2:

Now do it for the genes that are downregulated for Kidney compare to Liver. Count the number of genes in this list.

Task 3:

Now make two lists of top 50 up and top 50 down regulated genes based on their log2_Fold_chnge_normalized. You can use order function to do that. Remember we need a descending order for upregulated genes.

Task 4

For p-value correction use p.adjust function with “fdr” method and plot both the values to see the difference.

HeatMap

HeatMap is a graphical representation ofa matrix in color.Let’s combine both the list by row and plot a heatmap.So here the two columns are fold normalized values of Kidney and Liver samples and the rows are individual DEGs.

Here we will use a red/yellow palette where red indicate down regulated and yellow indicates up regulated.

result_sig_down_kidney<-result[which(result$Signature=="TRUE" & result$log2_Fold_change_normalized < 0),]
result_sig_up_kidney_sorted<-result_sig_up_kidney[order(-result_sig_up_kidney$log2_Fold_change_normalized),]
result_sig_down_kidney_sorted<-result_sig_down_kidney[order(result_sig_down_kidney$log2_Fold_change_normalized),]
result_top100<-rbind(  result_sig_up_kidney_sorted[1:50,],result_sig_down_kidney_sorted[1:50,])
cols<-colorRampPalette(c("red","white","yellow"))(256)

heatmap(as.matrix(result_top100[,c(2,3)]),cexRow=0.3,cexCol = 1,labRow = result_top100$GeneNames,keep.dendro = F,Colv = NA,labCol = c("Kidney","Liver"),col=cols)

So here we can see some gene clusters depending on their expression pattern.

Task 4

Let’s fetch the raw count data from the package direcoty. you can also see the file with text editor. Here you will find 14 samples, 7 from each tissue.

geneExpFile <- system.file("extdata", "GeneExpExample5000.txt", package="DEGseq")
geneExpMat<-read.table(file=geneExpFile,header = T,sep = "\t",na.strings="")
colnames(geneExpMat)<-c("EnsemblGeneID", "Chr", "GeneStart", "GeneEnd", "Status",   "ExternalID", "KidneyR1L1","LiverR1L2","KidneyR1L3", "LiverR1L4", "LiverR1L6", "KidneyR1L7", "LiverR1L8", "LiverR2L1", "KidneyR2L2", "LiverR2L3","KidneyR2L4", "KidneyR2L6", "LiverR2L7", "KidneyR2L8")

Now can we pull out the data for those 100 selected genes from this file? Can we make a heatmap for this 100 genes ploting all 14 samples? Use the same colour palette for it and this time we also want a dendogram for x-axis.

Why the colors are different here?