Data processing

Download and read in data

First we need to download the data. The data is available for free from the website.

Unfortunately, there are no links that work with download.file or similar commands like wget/curl.

But you can paste the following link into a browser to activate download:

http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/7.0/c5.mf.v7.0.symbols.gmt

Once you have done that and the file is in the current working directory, read the file into R.

Read in the data, making each gene set be an item in a list, and that item being a character vector.

mf <- readLines("c5.mf.v7.0.symbols.gmt")
mf <- strsplit(mf,"\t")

Look at the first two items in the newly formed list.

mf[[1]]

##  [1] "GO_CYCLIC_NUCLEOTIDE_PHOSPHODIESTERASE_ACTIVITY"                                             
##  [2] "http://www.gsea-msigdb.org/gsea/msigdb/cards/GO_CYCLIC_NUCLEOTIDE_PHOSPHODIESTERASE_ACTIVITY"
##  [3] "PDE10A"                                                                                      
##  [4] "CNP"                                                                                         
##  [5] "PDE7B"                                                                                       
##  [6] "PDE11A"                                                                                      
##  [7] "PDE1A"                                                                                       
##  [8] "PDE1C"                                                                                       
##  [9] "PDE2A"                                                                                       
## [10] "PDE3A"                                                                                       
## [11] "PDE3B"                                                                                       
## [12] "PDE4A"                                                                                       
## [13] "PDE4B"                                                                                       
## [14] "AC005759.1"                                                                                  
## [15] "PDE4D"                                                                                       
## [16] "PDE6A"                                                                                       
## [17] "PDE6C"                                                                                       
## [18] "PDE6D"                                                                                       
## [19] "AC139530.1"                                                                                  
## [20] "AC020613.1"                                                                                  
## [21] "PDE7A"                                                                                       
## [22] "PDE8A"                                                                                       
## [23] "PDE9A"                                                                                       
## [24] "PDE1B"                                                                                       
## [25] "PDE6B"                                                                                       
## [26] "PRKAR1A"                                                                                     
## [27] "PRKAR1B"                                                                                     
## [28] "PRKAR2A"                                                                                     
## [29] "PRKAR2B"                                                                                     
## [30] "PDE8B"                                                                                       
## [31] "PDE5A"

mf[[2]]

##  [1] "GO_PROTEIN_SERINE_THREONINE_TYROSINE_KINASE_ACTIVITY"                                             
##  [2] "http://www.gsea-msigdb.org/gsea/msigdb/cards/GO_PROTEIN_SERINE_THREONINE_TYROSINE_KINASE_ACTIVITY"
##  [3] "TNK2"                                                                                             
##  [4] "TESK2"                                                                                            
##  [5] "CLK1"                                                                                             
##  [6] "CLK2"                                                                                             
##  [7] "CLK3"                                                                                             
##  [8] "LRRK2"                                                                                            
##  [9] "MAPK14"                                                                                           
## [10] "DYRK1A"                                                                                           
## [11] "AKT1"                                                                                             
## [12] "DSTYK"                                                                                            
## [13] "MAP3K9"                                                                                           
## [14] "PAK3"                                                                                             
## [15] "PRKAA2"                                                                                           
## [16] "PRKACA"                                                                                           
## [17] "PRKCG"                                                                                            
## [18] "MAPK1"                                                                                            
## [19] "MAPK3"                                                                                            
## [20] "MAPK9"                                                                                            
## [21] "MAPK10"                                                                                           
## [22] "MAP2K1"                                                                                           
## [23] "MAP2K2"                                                                                           
## [24] "MAP2K3"                                                                                           
## [25] "MAP2K5"                                                                                           
## [26] "MAP2K6"                                                                                           
## [27] "MAP2K7"                                                                                           
## [28] "CLK4"                                                                                             
## [29] "RPS6KA1"                                                                                          
## [30] "RPS6KA2"                                                                                          
## [31] "RPS6KB1"                                                                                          
## [32] "MAP2K4"                                                                                           
## [33] "SGK1"                                                                                             
## [34] "AURKA"                                                                                            
## [35] "AURKC"                                                                                            
## [36] "TESK1"                                                                                            
## [37] "TTK"                                                                                              
## [38] "MAPKAPK3"                                                                                         
## [39] "DYRK3"                                                                                            
## [40] "DYRK2"                                                                                            
## [41] "MAPKAPK5"                                                                                         
## [42] "DYRK4"                                                                                            
## [43] "DYRK1B"                                                                                           
## [44] "AURKB"                                                                                            
## [45] "ACVR2B"

Looks like the first item in every vector is the name of the gene set.

The second item is a web link, which we do not need.

Then items 3 and onward are the gene names.

Name each item in the list by the gene set name, then remove the first two values in the vector.

The following code is borrowed from the qusage package (http://www.bioconductor.org/packages/release/bioc/html/qusage.html).

names(mf) = sapply(mf, "[", 1)
mf = lapply(mf, "[", -1:-2)

How many genes are found in each gene set?

gene_set_sizes <- sapply(mf,length)
print("Range of gene set sizes:")

## [1] "Range of gene set sizes:"

range(gene_set_sizes)

## [1]    5 1922

hist(gene_set_sizes,xlab="Gene set size",ylab="# gene sets",main="",labels=TRUE,breaks=seq(from=0,to=2000,by=50))

hist(gene_set_sizes[gene_set_sizes < 50],xlab="Gene set size",ylab="# gene sets",labels=TRUE,main="Show only gene sets with size < 50")

Typically in analyses that use these kinds of databases, very small gene sets are not of interest.

So, let’s remove gene sets with fewer than 10 genes.

mf <- mf[gene_set_sizes >= 10]

Another question is how many gene sets each gene is found in.

gene_sets_per_gene <- data.frame(table(unlist(mf)))
gene_sets_per_gene$Var1 <- as.vector(gene_sets_per_gene$Var1)
print("Range of gene sets per gene:")

## [1] "Range of gene sets per gene:"

range(gene_sets_per_gene$Freq)

## [1]  1 55

hist(gene_sets_per_gene$Freq,xlab="# gene sets",ylab="# genes",main="",labels=TRUE)

print("Number of genes in the dataset:")

## [1] "Number of genes in the dataset:"

nrow(gene_sets_per_gene)

## [1] 15594

print("Number of these found in only one gene set:")

## [1] "Number of these found in only one gene set:"

length(which(gene_sets_per_gene$Freq == 1))

## [1] 1851

Even the most ubiquitous genes are only found in a relatively small subset of gene sets.

On the flip side, though, there are a large number of genes found in only one gene set.

These are not going to be interesting for a network analysis and should be removed.

genes_in_two_or_more_sets <- gene_sets_per_gene$Var1[gene_sets_per_gene$Freq > 1]
mf <- lapply(mf,function(x){x[x %in% genes_in_two_or_more_sets]})

Ready to create binary matrix.

For this, we will use mtabulate from qdapTools.

No need to download the package, just going to use the code for this one function.

Taken from here: https://github.com/trinker/qdapTools/blob/master/R/mtabulate.R

This will make each gene set be a row and each gene be a column.

mtabulate <- function(vects) { 
    lev <- sort(unique(unlist(vects)))
    dat <- do.call(rbind, lapply(vects, function(x, lev){ 
        tabulate(factor(x, levels = lev, ordered = TRUE),
        nbins = length(lev))}, lev = lev))
    colnames(dat) <- sort(lev) 
    data.frame(dat, check.names = FALSE)
}

binary_matrix <- mtabulate(mf)

Output this binary matrix to a CSV file.

write.table(data.frame(Gene_set = rownames(binary_matrix),binary_matrix,check.names=FALSE),
    file="GO_MF_binary_matrix.csv",
    row.names=FALSE,col.names=TRUE,quote=TRUE,sep=",")

Converting MSigDB gene sets into a binary gene set x gene matrix

Baron Curtin, Heather Geiger

Introduction

Data processing

Download and read in data