This is a simple tutorial for DEGSEq for BIO389D. DEGseq is an R package to identify differentially expressed genes from RNA-Seq data.
For installation see:
https://bioconductor.org/packages/release/bioc/html/DEGseq.html.
To install this package, start R and enter:
## try http:// if https:// URLs are not supported
#source("https://bioconductor.org/biocLite.R")
#biocLite("DEGseq")
Let’s start with a simple data. This is the first 5000 lines of the supplementary file for Marioni et al. (2008) and we have count of reads mapped to genes for 14 lanes (7 lanes for kidney and 7 lanes for liver) of human.
library(DEGseq)
geneExpFile <- system.file("extdata", "GeneExpExample5000.txt", package="DEGseq")
geneExpMatrix1 <- readGeneExp(file=geneExpFile, geneCol=1, valCol=c(7,9,12,15,18))
geneExpMatrix2 <- readGeneExp(file=geneExpFile, geneCol=1, valCol=c(8,10,11,13,16))
head(geneExpMatrix1)
## EnsemblGeneID R1L1Kidney R1L3Kidney R1L7Kidney
## ENSG00000146556 "ENSG00000146556" "0" "0" "0"
## ENSG00000197194 "ENSG00000197194" "0" "0" "0"
## ENSG00000197490 "ENSG00000197490" "0" "0" "0"
## ENSG00000205292 "ENSG00000205292" "0" "0" "0"
## ENSG00000177693 "ENSG00000177693" "0" "0" "0"
## ENSG00000209338 "ENSG00000209338" "0" "0" "0"
## R2L2Kidney R2L6Kidney
## ENSG00000146556 "0" "0"
## ENSG00000197194 "0" "0"
## ENSG00000197490 "0" "0"
## ENSG00000205292 "0" "0"
## ENSG00000177693 "0" "0"
## ENSG00000209338 "0" "0"
head(geneExpMatrix2)
## EnsemblGeneID R1L2Liver R1L4Liver R1L6Liver R1L8Liver
## ENSG00000146556 "ENSG00000146556" "0" "0" "0" "0"
## ENSG00000197194 "ENSG00000197194" "0" "0" "0" "0"
## ENSG00000197490 "ENSG00000197490" "0" "0" "0" "0"
## ENSG00000205292 "ENSG00000205292" "0" "0" "0" "0"
## ENSG00000177693 "ENSG00000177693" "0" "0" "0" "0"
## ENSG00000209338 "ENSG00000209338" "0" "0" "0" "0"
## R2L3Liver
## ENSG00000146556 "0"
## ENSG00000197194 "0"
## ENSG00000197490 "0"
## ENSG00000205292 "0"
## ENSG00000177693 "0"
## ENSG00000209338 "0"
For understanding of experimental design, biological and technical replicates you can look into this tutorial.
Let’s boxplot on each replicates of two tissue to see the distribution of count data in log2 scale
library(reshape2)
library(ggplot2)
geneExpMatrixMerge<-merge(geneExpMatrix1,geneExpMatrix2,by="EnsemblGeneID")
df <- melt(geneExpMatrixMerge,id.vars = "EnsemblGeneID",variable.name = "Samples",value.name = "count")
df[, "count"] <- as.numeric(df[, "count"])
head(df)
## EnsemblGeneID Samples count
## 1 ENSG00000000457 R1L1Kidney 22
## 2 ENSG00000000460 R1L1Kidney 9
## 3 ENSG00000000938 R1L1Kidney 14
## 4 ENSG00000000971 R1L1Kidney 28
## 5 ENSG00000001460 R1L1Kidney 24
## 6 ENSG00000001461 R1L1Kidney 23
tail(df)
## EnsemblGeneID Samples count
## 49995 ENSG00000216157 R2L3Liver 0
## 49996 ENSG00000216161 R2L3Liver 0
## 49997 ENSG00000216162 R2L3Liver 0
## 49998 ENSG00000216186 R2L3Liver 0
## 49999 ENSG00000216191 R2L3Liver 0
## 50000 ENSG00000216193 R2L3Liver 0
df$Condition <- substr(df$Samples, 5, length(df$Samples))
head(df)
## EnsemblGeneID Samples count Condition
## 1 ENSG00000000457 R1L1Kidney 22 Kidney
## 2 ENSG00000000460 R1L1Kidney 9 Kidney
## 3 ENSG00000000938 R1L1Kidney 14 Kidney
## 4 ENSG00000000971 R1L1Kidney 28 Kidney
## 5 ENSG00000001460 R1L1Kidney 24 Kidney
## 6 ENSG00000001461 R1L1Kidney 23 Kidney
ggplot(df, aes(x = Samples, y = log2(count), fill = Condition)) + geom_boxplot() + xlab("") +ylab("log2(count)") + scale_fill_manual(values = c("#619CFF", "#F564E3"))
RNA sequencing can be modeled as a random sampling process, in which each read is sampled independently and uniformly from every possible nucleotide in the sample.
Under this assumption the number of reads coming from a gene (or transcript isoform) follows a binomial distribution (and could be approximated by a Poisson distribution) (Wang et al,2010).
MA plot is basically log fold change vs. log avegare of a gene for two different samples/replicates.
Let C1 and C2 denote the counts of reads mapped to a specific gene obtained from two samples, with Ci ∼ binomial (ni, pi), i = 1, 2, where ni denotes the total number of mapped reads and pi the probability of a read coming from that gene.
M = log2C1- log2C2, and A = (log2C1 + log2C2)/2.
We will fit a simple random sampling model to get the DEGs and will see the effect. The red line of the MA-plot indicates 4-fold local Standard Deviation of M conditioned on A for two samples.
layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow=TRUE))
par(mar=c(2, 2, 2, 2))
DEGexp(geneExpMatrix1=geneExpMatrix1, geneCol1=1, expCol1=c(2,3,4,5,6), groupLabel1="kidney", geneExpMatrix2=geneExpMatrix2, geneCol2=1, expCol2=c(2,3,4,5,6), groupLabel2="liver", method="MARS")
## Please wait...
## gene id column in geneExpMatrix1 for sample1: 1
## expression value column(s) in geneExpMatrix1: 2 3 4 5 6
## total number of reads uniquely mapped to genome obtained from sample1: 345504 354981 334557 366041 372895
## gene id column in geneExpMatrix2 for sample2: 1
## expression value column(s) in geneExpMatrix2: 2 3 4 5 6
## total number of reads uniquely mapped to genome obtained from sample2: 274430 274486 264999 255041 284205
##
## method to identify differentially expressed genes: MARS
## pValue threshold: 0.001
## output directory: none
##
## The outputDir is not specified!
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...
## output ...
## The results can be observed in directory: none
Now we will check how the technical replicates go with each other. Can two technical replicates alone explain the random sampling model? Is there any background noice for two technical replicates?
Let’s try it for Liver ..
layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow=TRUE))
par(mar=c(2, 2, 2, 2))
DEGexp(geneExpMatrix1=geneExpMatrix2, expCol1=2, groupLabel1="LiverR1L2",geneExpMatrix2=geneExpMatrix2, expCol2=3, groupLabel2="LiverR1L4", method="CTR")
## Please wait...
## gene id column in geneExpMatrix1 for sample1: 1
## expression value column(s) in geneExpMatrix1: 2
## total number of reads uniquely mapped to genome obtained from sample1: 274430
## gene id column in geneExpMatrix2 for sample2: 1
## expression value column(s) in geneExpMatrix2: 3
## total number of reads uniquely mapped to genome obtained from sample2: 274486
##
## method to identify differentially expressed genes: CTR
## pValue threshold: 0.001
## output directory: none
##
## The outputDir is not specified!
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...
## output ...
## The results can be observed in directory: none
Can we do it for 1st and 3nd expression column of Kidney data?
## Please wait...
## gene id column in geneExpMatrix1 for sample1: 1
## expression value column(s) in geneExpMatrix1: 2
## total number of reads uniquely mapped to genome obtained from sample1: 345504
## gene id column in geneExpMatrix2 for sample2: 1
## expression value column(s) in geneExpMatrix2: 3
## total number of reads uniquely mapped to genome obtained from sample2: 354981
##
## method to identify differentially expressed genes: CTR
## pValue threshold: 0.001
## output directory: none
##
## The outputDir is not specified!
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...
## output ...
## The results can be observed in directory: none
For the 2nd case, especially for higher fold, the blue line goes beyond the red one, it can be inferred that the two replicates could not be fully explained by the random sampling model. So, we need to adopt different method to normalize this variance between technical replicates.
When we need to normalize technical replicates ..
layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow=TRUE))
par(mar=c(2, 2, 2, 2))
DEGexp(geneExpMatrix1=geneExpMatrix1, expCol1=2, groupLabel1="kidneyR1L1",geneExpMatrix2=geneExpMatrix2, expCol2=2, groupLabel2="liverR1L2",replicateExpMatrix1=geneExpMatrix1, expColR1=2, replicateLabel1="kidneyR1L1",replicateExpMatrix2=geneExpMatrix1, expColR2=3, replicateLabel2="kidneyR1L3", method="MATR")
## Please wait...
## gene id column in geneExpMatrix1 for sample1: 1
## expression value column(s) in geneExpMatrix1: 2
## total number of reads uniquely mapped to genome obtained from sample1: 345504
## gene id column in geneExpMatrix2 for sample2: 1
## expression value column(s) in geneExpMatrix2: 2
## total number of reads uniquely mapped to genome obtained from sample2: 274430
##
## gene id column in replicateExpMatrix1: 1
## expression value column(s) in replicateExpMatrix1: 2
## total number of reads uniquely mapped to genome obtained from replicate1: 345504
##
## gene id column in replicateExpMatrix2: 1
## expression value column(s) in replicateExpMatrix2: 3
## total number of reads uniquely mapped to genome obtained from replicate2: 354981
##
## method to identify differentially expressed genes: MATR
## pValue threshold: 0.001
## output directory: none
##
## The outputDir is not specified!
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...
## output ...
## The results can be observed in directory: none
Assuming that the technical replicates don’t have noice.
outputDir <- file.path(getwd(), "DEGexpExample")
layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow=TRUE))
par(mar=c(2, 2, 2, 2))
DEGexp(geneExpMatrix1=geneExpMatrix1, geneCol1=1, expCol1=c(2,3,4,5,6), groupLabel1="kidney", geneExpMatrix2=geneExpMatrix2, geneCol2=1, expCol2=c(2,3,4,5,6), groupLabel2="liver", method="LRT",outputDir = outputDir)
## Please wait...
## gene id column in geneExpMatrix1 for sample1: 1
## expression value column(s) in geneExpMatrix1: 2 3 4 5 6
## total number of reads uniquely mapped to genome obtained from sample1: 345504 354981 334557 366041 372895
## gene id column in geneExpMatrix2 for sample2: 1
## expression value column(s) in geneExpMatrix2: 2 3 4 5 6
## total number of reads uniquely mapped to genome obtained from sample2: 274430 274486 264999 255041 284205
##
## method to identify differentially expressed genes: LRT
## pValue threshold: 0.001
## output directory: /home/tahia/data/UT/ddRAD/riceddRAD/n_25_c3/centi_2.5/GATK/rec_freq_006/Win_sld/DEGexpExample
##
## Please wait ...
## Identifying differentially expressed genes ...
## Please wait patiently ...
## output ...
##
## Done ...
## The results can be observed in directory: /home/tahia/data/UT/ddRAD/riceddRAD/n_25_c3/centi_2.5/GATK/rec_freq_006/Win_sld/DEGexpExample
If we mention output directory for DEGexp function, then it will write the results in that directory and we can use them to get DEGs in different fold range or direction (up/downregulated). The result is actually a list of all genes with their test statistics. The last column indicate it’s significance: TRUE/FALSE (p-value < 0.001).
result<-read.table(file=file.path(outputDir,"output_score.txt"),header = T,sep = "\t")
colnames(result)<-c("GeneNames", "value1", "value2", "log2_Fold_change", "log2_Fold_change_normalized","pval","qval1","qval2","Signature")
head(result)
## GeneNames value1 value2 log2_Fold_change
## 1 ENSG00000209350 27 161 -2.5760294
## 2 ENSG00000212679 620 746 -0.2669074
## 3 ENSG00000212678 64591 44870 0.5255820
## 4 ENSG00000188157 2108 259 3.0248509
## 5 ENSG00000008130 425 836 -0.9760401
## 6 ENSG00000078369 4258 1661 1.3581239
## log2_Fold_change_normalized pval qval1 qval2 Signature
## 1 -2.9666840 0 0 0 TRUE
## 2 -0.6575620 0 0 0 TRUE
## 3 0.1349274 0 0 0 TRUE
## 4 2.6341962 0 0 0 TRUE
## 5 -1.3666947 0 0 0 TRUE
## 6 0.9674693 0 0 0 TRUE
Let’s count the number of significant DEGs:
length(which(result$Signature=="TRUE"))
## [1] 1670
Now pull out only the significant genes from the result:
result_sig<-result[which(result$Signature=="TRUE"),]
Let’s make a list of significant genes that are upregulated for Kidney compare to Liver:
result_sig_up_kidney<-result[which(result$Signature=="TRUE" & result$log2_Fold_change_normalized > 0),]
dim(result_sig_up_kidney)
## [1] 1097 9
Now do it for the genes that are downregulated for Kidney compare to Liver. Count the number of genes in this list.
Now make two lists of top 50 up and top 50 down regulated genes based on their log2_Fold_chnge_normalized. You can use order function to do that. Remember we need a descending order for upregulated genes.
For p-value correction use p.adjust function with “fdr” method and plot both the values to see the difference.
HeatMap is a graphical representation ofa matrix in color.Let’s combine both the list by row and plot a heatmap.So here the two columns are fold normalized values of Kidney and Liver samples and the rows are individual DEGs.
Here we will use a red/yellow palette where red indicate down regulated and yellow indicates up regulated.
result_sig_down_kidney<-result[which(result$Signature=="TRUE" & result$log2_Fold_change_normalized < 0),]
result_sig_up_kidney_sorted<-result_sig_up_kidney[order(-result_sig_up_kidney$log2_Fold_change_normalized),]
result_sig_down_kidney_sorted<-result_sig_down_kidney[order(result_sig_down_kidney$log2_Fold_change_normalized),]
result_top100<-rbind( result_sig_up_kidney_sorted[1:50,],result_sig_down_kidney_sorted[1:50,])
cols<-colorRampPalette(c("red","white","yellow"))(256)
heatmap(as.matrix(result_top100[,c(2,3)]),cexRow=0.3,cexCol = 1,labRow = result_top100$GeneNames,keep.dendro = F,Colv = NA,labCol = c("Kidney","Liver"),col=cols)
So here we can see some gene clusters depending on their expression pattern.
Let’s fetch the raw count data from the package direcoty. you can also see the file with text editor. Here you will find 14 samples, 7 from each tissue.
geneExpFile <- system.file("extdata", "GeneExpExample5000.txt", package="DEGseq")
geneExpMat<-read.table(file=geneExpFile,header = T,sep = "\t",na.strings="")
colnames(geneExpMat)<-c("EnsemblGeneID", "Chr", "GeneStart", "GeneEnd", "Status", "ExternalID", "KidneyR1L1","LiverR1L2","KidneyR1L3", "LiverR1L4", "LiverR1L6", "KidneyR1L7", "LiverR1L8", "LiverR2L1", "KidneyR2L2", "LiverR2L3","KidneyR2L4", "KidneyR2L6", "LiverR2L7", "KidneyR2L8")
Now can we pull out the data for those 100 selected genes from this file? Can we make a heatmap for this 100 genes ploting all 14 samples? Use the same colour palette for it and this time we also want a dendogram for x-axis.
Why the colors are different here?