In the study of human or model organism gene expression data, one will often be looking to find which genes are differentially expressed between two or more conditions.
If there is a particular single gene or short list of genes of interest, this data will be simple enough to work with.
Often though, you will have too many genes to look at to be able to just review them all manually.
So by finding what broader categories (like pathways) that these genes fit into, the data becomes much easier to interpret.
MSigDB, the Molecular Signatures Database, has organized a number of sets of genes to use for this type of analysis.
The goal for our larger project is to explore how these gene sets and genes are connected using network analysis.
But first, we will want to create a binary gene set x gene matrix, where 0 is absent and 1 is present, for any given gene to gene set.
We will use the GO (gene ontology), MF (molecular function) gene sets here.
First we need to download the data. The data is available for free from the website.
Unfortunately, there are no links that work with download.file or similar commands like wget/curl.
But you can paste the following link into a browser to activate download:
Once you have done that and the file is in the current working directory, read the file into R.
Read in the data, making each gene set be an item in a list, and that item being a character vector.
mf <- readLines("c5.mf.v7.0.symbols.gmt")
mf <- strsplit(mf,"\t")
Look at the first two items in the newly formed list.
mf[[1]]
## [1] "GO_CYCLIC_NUCLEOTIDE_PHOSPHODIESTERASE_ACTIVITY"
## [2] "http://www.gsea-msigdb.org/gsea/msigdb/cards/GO_CYCLIC_NUCLEOTIDE_PHOSPHODIESTERASE_ACTIVITY"
## [3] "PDE10A"
## [4] "CNP"
## [5] "PDE7B"
## [6] "PDE11A"
## [7] "PDE1A"
## [8] "PDE1C"
## [9] "PDE2A"
## [10] "PDE3A"
## [11] "PDE3B"
## [12] "PDE4A"
## [13] "PDE4B"
## [14] "AC005759.1"
## [15] "PDE4D"
## [16] "PDE6A"
## [17] "PDE6C"
## [18] "PDE6D"
## [19] "AC139530.1"
## [20] "AC020613.1"
## [21] "PDE7A"
## [22] "PDE8A"
## [23] "PDE9A"
## [24] "PDE1B"
## [25] "PDE6B"
## [26] "PRKAR1A"
## [27] "PRKAR1B"
## [28] "PRKAR2A"
## [29] "PRKAR2B"
## [30] "PDE8B"
## [31] "PDE5A"
mf[[2]]
## [1] "GO_PROTEIN_SERINE_THREONINE_TYROSINE_KINASE_ACTIVITY"
## [2] "http://www.gsea-msigdb.org/gsea/msigdb/cards/GO_PROTEIN_SERINE_THREONINE_TYROSINE_KINASE_ACTIVITY"
## [3] "TNK2"
## [4] "TESK2"
## [5] "CLK1"
## [6] "CLK2"
## [7] "CLK3"
## [8] "LRRK2"
## [9] "MAPK14"
## [10] "DYRK1A"
## [11] "AKT1"
## [12] "DSTYK"
## [13] "MAP3K9"
## [14] "PAK3"
## [15] "PRKAA2"
## [16] "PRKACA"
## [17] "PRKCG"
## [18] "MAPK1"
## [19] "MAPK3"
## [20] "MAPK9"
## [21] "MAPK10"
## [22] "MAP2K1"
## [23] "MAP2K2"
## [24] "MAP2K3"
## [25] "MAP2K5"
## [26] "MAP2K6"
## [27] "MAP2K7"
## [28] "CLK4"
## [29] "RPS6KA1"
## [30] "RPS6KA2"
## [31] "RPS6KB1"
## [32] "MAP2K4"
## [33] "SGK1"
## [34] "AURKA"
## [35] "AURKC"
## [36] "TESK1"
## [37] "TTK"
## [38] "MAPKAPK3"
## [39] "DYRK3"
## [40] "DYRK2"
## [41] "MAPKAPK5"
## [42] "DYRK4"
## [43] "DYRK1B"
## [44] "AURKB"
## [45] "ACVR2B"
Looks like the first item in every vector is the name of the gene set.
The second item is a web link, which we do not need.
Then items 3 and onward are the gene names.
Name each item in the list by the gene set name, then remove the first two values in the vector.
The following code is borrowed from the qusage package (http://www.bioconductor.org/packages/release/bioc/html/qusage.html).
names(mf) = sapply(mf, "[", 1)
mf = lapply(mf, "[", -1:-2)
How many genes are found in each gene set?
gene_set_sizes <- sapply(mf,length)
print("Range of gene set sizes:")
## [1] "Range of gene set sizes:"
range(gene_set_sizes)
## [1] 5 1922
hist(gene_set_sizes,xlab="Gene set size",ylab="# gene sets",main="",labels=TRUE,breaks=seq(from=0,to=2000,by=50))
hist(gene_set_sizes[gene_set_sizes < 50],xlab="Gene set size",ylab="# gene sets",labels=TRUE,main="Show only gene sets with size < 50")
Typically in analyses that use these kinds of databases, very small gene sets are not of interest.
So, let’s remove gene sets with fewer than 10 genes.
mf <- mf[gene_set_sizes >= 10]
Another question is how many gene sets each gene is found in.
gene_sets_per_gene <- data.frame(table(unlist(mf)))
gene_sets_per_gene$Var1 <- as.vector(gene_sets_per_gene$Var1)
print("Range of gene sets per gene:")
## [1] "Range of gene sets per gene:"
range(gene_sets_per_gene$Freq)
## [1] 1 55
hist(gene_sets_per_gene$Freq,xlab="# gene sets",ylab="# genes",main="",labels=TRUE)
print("Number of genes in the dataset:")
## [1] "Number of genes in the dataset:"
nrow(gene_sets_per_gene)
## [1] 15594
print("Number of these found in only one gene set:")
## [1] "Number of these found in only one gene set:"
length(which(gene_sets_per_gene$Freq == 1))
## [1] 1851
Even the most ubiquitous genes are only found in a relatively small subset of gene sets.
On the flip side, though, there are a large number of genes found in only one gene set.
These are not going to be interesting for a network analysis and should be removed.
genes_in_two_or_more_sets <- gene_sets_per_gene$Var1[gene_sets_per_gene$Freq > 1]
mf <- lapply(mf,function(x){x[x %in% genes_in_two_or_more_sets]})
Ready to create binary matrix.
For this, we will use mtabulate from qdapTools.
No need to download the package, just going to use the code for this one function.
Taken from here: https://github.com/trinker/qdapTools/blob/master/R/mtabulate.R
This will make each gene set be a row and each gene be a column.
mtabulate <- function(vects) {
lev <- sort(unique(unlist(vects)))
dat <- do.call(rbind, lapply(vects, function(x, lev){
tabulate(factor(x, levels = lev, ordered = TRUE),
nbins = length(lev))}, lev = lev))
colnames(dat) <- sort(lev)
data.frame(dat, check.names = FALSE)
}
binary_matrix <- mtabulate(mf)
Output this binary matrix to a CSV file.
write.table(data.frame(Gene_set = rownames(binary_matrix),binary_matrix,check.names=FALSE),
file="GO_MF_binary_matrix.csv",
row.names=FALSE,col.names=TRUE,quote=TRUE,sep=",")