In this document, I describe the clustering of mutation data using the BBMMclusterEM package.

1 Dependencies

In order to use their codes, we need to load some R libraries. First, install RBGL package in R from Bioconductor:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("RBGL")

Note that install.packages("RBGL") will return the error message: RBGL is not available for this version of R; See here.

Next, install three additional packages (BiDAG, pcalg and clue) as follows.

install.packages("https://cran.r-project.org/src/contrib/BiDAG_2.1.1.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/pcalg_2.7-6.tar.gz", repos=NULL, type="source")
install.packages("https://cran.r-project.org/src/contrib/clue_0.3-61.tar.gz", repos=NULL, type="source")

Caution for OSX: The official R binary installer for macOS expects the gfortran compiler to be at /usr/local/gfortran. Therefore, trying to install the RBGL package may fail if your gfortran binary is located elsewhere. If it does, then do the following.

  1. Create a file ~/.R/Makevars (if it does not exist yet)

  2. Add the following to ~/.R/Makevars

FC      = /usr/local/opt/gcc/bin/gfortran  # Use `which gfortran` to find out the directory
F77     = /usr/local/opt/gcc/bin/gfortran
FLIBS   = -L/usr/local/opt/gcc/lib
  1. Restart R

  2. Proceed with BiocManager::install("RBGL")

2 Supervised Clustering using BiDAG

2.1 Bayesian Network

A Bayesian network (BN) is a probabilistic graphical model, which represents conditional independence relationships between a set of random variables by a directed acyclic graph (DAG), together with a collection of conditional probability tables. In a DAG, each node corresponds to an observable and each directed edge represents a statistical dependence. In addition, each node is associated with a conditional probability distribution of the corresponding observable which depends on its parents in the DAG.

BiDAG is a software to discover causal structure from observations, based on the Markov chain Monte Carlo (MCMC) method. Kuipers et al. (2018) used BiDAG for learning structures of BNs that characterize mutation profiles across 22 different cancer types and novel subtypes. The dataset included mutational profiles of N = 8198 tumor samples. For n = 201 significantly mutated genes, a Bayesian network-based clustering approach was used to define clusters of tumor samples, such that a Bayesian network represented each cluster center. Structure learning was performed in two steps. In the first step, the function iterativeMCMC from BiDAG was used to optimize the search space. In the second step, sampling was performed with the partitionMCMC function from BiDAG on the optimized search space.

In general, network analysis of cancer require two resources. The first is a list of oncogenic alterations that affect protein-coding genes. The second is a database of gene sets and their interactions. The gene-set databases are lists of genes that have been grouped according to common biological properties. While pathway databases represent biological processes as series of biochemical reactions and other physical events such as phosphorylation, subcellular localization and conformational changes, network databases use a simpler data model that treats biological processes as source-target pairs, i.e., an edge list. The epidermal growth factor (EGF) pathway shown in Figure 1 illustrates the essential difference between pathway and network interaction databases. In the most sophisticated approach, both types of interaction databases are used.

Figure 1: Pathway vs. Network. (a) In the pathway representation, heterogeneous nodes and edges correspond to genes, proteins, small molecules and their regulatory and catalytic relationships. Nodes do not interact directly but participate in reaction events designated by white squares. (b) In the network representation, all nodes correspond to the same type of biological entity (gene products). Edges derived from curated pathways are shown as arrows. Additional gene-gene interactions derived from gene coexpression and physical protein interactions are shown as light lines. Green nodes are proteins participating in curated reactions or interactions, blue nodes are complexes and gray nodes are uncurated proteins participating in physical interactions.

2.2 Example: BN Clustering

## This is an example taken from:
## https://github.com/cbg-ethz/pancancer-clustering/tree/master/BNclusteringUpdate
library(BiDAG)
library(pcalg)
library(RBGL)
library(clue)

Suppose we have three different Bayesian networks (BNs) of 5 genes, as shown below.

They are generated using BBMMclusterEM library:

source('./generatebinaryBN.R')   ## BBMMclusterEM programs
## Generate network data sets for a demonstration purpose
#number of nodes in BNs
n<-5
#set seed for reproducibility
sseed<-111
#generate 3 networks
BNs<-list()  ## named list
BNs[[1]]<-generatebinaryBN(n,ii=sseed,baseline=c(0.3,0.5))   ## BBMMcluster EM function
BNs[[2]]<-generatebinaryBN(n,ii=sseed+1,baseline=c(0.3,0.5))  ## baseline: min/max deviation
BNs[[3]]<-generatebinaryBN(n, ii=sseed+2,baseline=c(0.3,0.5))
##plot BNs
#par(mfrow=c(1,3))  # multiplot
#plot(BNs[[1]]$DAG, attrs=list(node=list(fontsize=8, fixedsize=TRUE, height=0.3,width=0.3)))
#plot(BNs[[2]]$DAG, attrs=list(node=list(fontsize=8, fixedsize=TRUE, height=0.3,width=0.3)))
#plot(BNs[[3]]$DAG, attrs=list(node=list(fontsize=8, fixedsize=TRUE, height=0.3,width=0.3)))

Each BN has 8 attributes:

names(BNs[[1]])
## [1] "DAG"     "adj"     "parlist" "np"      "fp"      "ord"     "map"    
## [8] "skel"

Let’s Take a look at them for the first network:

BNs[[1]]$DAG  ## directed acyclic graphs (DAC)
## A graphNEL graph with directed edges
## Number of Nodes = 5 
## Number of Edges = 4
BNs[[1]]$adj  ## adjacency matrix associated with DAC
##   1 2 3 4 5
## 1 0 1 0 1 0
## 2 0 0 1 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 1 0 0
BNs[[1]]$parlist  ## row indices of 1's in each column of the adjacency matrix
## [[1]]
## named integer(0)
## 
## [[2]]
## 1 
## 1 
## 
## [[3]]
## 2 5 
## 2 5 
## 
## [[4]]
## 1 
## 1 
## 
## [[5]]
## named integer(0)
BNs[[1]]$np  ## number of 1's in each column of the adjacency matrix
## [1] 0 1 2 1 0
BNs[[1]]$fp  ## probability for 2^np edges
## [[1]]
## [1] 0.3449688
## 
## [[2]]
## [1] 0.09292749 0.58239418
## 
## [[3]]
## [1] 0.05789584 0.79504576 0.41511347 0.95000000
## 
## [[4]]
## [1] 0.0522375 0.5205145
## 
## [[5]]
## [1] 0.3886588
BNs[[1]]$ord  ## topological order
## [1] 5 1 4 2 3
BNs[[1]]$map  
## $partable
## $partable[[1]]
##   Var1
## 1    0
## 2    1
## 
## $partable[[2]]
##   Var1 Var2
## 1    0    0
## 2    1    0
## 3    0    1
## 4    1    1
## 
## 
## $index
## $index[[1]]
## [1] 0 1
## 
## $index[[2]]
## [1] 0 2 1 3
BNs[[1]]$skel  ## upper triangular matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    0    1    0
## [2,]    0    0    1    0    0
## [3,]    0    0    0    0    1
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0

Next, we generate a random sampling of data to cluster.

ss1<-300 #number of samples in the first cluster
ss2<-200 #number of samples in the second cluster
ss3<-100 #number of samples in the third cluster
ss<-ss1+ss2+ss3 #total number of samples

# generate random data for each network
set.seed(sseed)
bindata1<-generatebinaryBN.data(n,BNs[[1]],ss1)  ## BBMMclusterEM function
bindata2<-generatebinaryBN.data(n,BNs[[2]],ss2)
bindata3<-generatebinaryBN.data(n,BNs[[3]],ss3)

#join all data in a single table
Datafull<-rbind(bindata1,bindata2,bindata3) 
head(Datafull)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    1    0    0
## [6,]    0    0    0    0    0
dim(Datafull)
## [1] 600   5

We can think of Datafull table as the yes/no table of mutations for each gene.

2.2.1 Estimation-Maximization (EM) algorithm for Clustering

The R script is lengthy, so I omit it in this section.

2.2.2 2D Visualization

Calculating the Jansen-Shannon divergence between the probability vectors of different samples provides a distance matrix, which is projected onto 2D plane using cmdscale command in R.

bestseed<-EMseeds[1]
matsize<-nrow(Datafull)
divergy<-matrix(0,matsize,matsize)
for(k1 in 1:(matsize-1)){
  
  # take the log scores each sample against the cluster
  P<-scoresprogress[[which(EMseeds==bestseed)]][k1,] 
  
  P<-exp(P-max(P)) # turn to scores
  P<-P/sum(P) # renormalise
  for(k2 in (k1+1):matsize){
    
    # take the log scores of graphs against another
    Q<-scoresprogress[[which(EMseeds==bestseed)]][k2,]  
    
    Q<-exp(Q-max(Q)) # turn to scores
    Q<-Q/sum(Q) # renormalise
    M<-(P+Q)/2 # find the average
    
    # Jenson-Shannon divergence
    JS<- (P%*%log(P/M) + Q%*%log(Q/M))/2 
    
    divergy[k1,k2]<-JS
    divergy[k2,k1]<-JS
  }
}
reducey<-cmdscale(divergy,k=2)
#plotting the result
cols<-c("#e7298a", "#377eb8","#984ea3")
relab<-relabs[[which(EMseeds==bestseed)]]
bestmemb<-reassignsamples(probs[[which(EMseeds==bestseed)]],nrow(Datafull))
plot(reducey[which(bestmemb==relab[1]),1],reducey[which(bestmemb==relab[1]),2],
     col=cols[1],xlim=c(-0.43,1),ylim=c(-0.45,0.20),pch=3,axes=F,bty="n",
     xlab="",ylab="",asp=1,main="Cluster assignments: 2D projection",cex=0.5)
lines(reducey[which(bestmemb==relab[2]),1],reducey[which(bestmemb==relab[2]),2],
      col=cols[2],type="p",pch=3,cex=0.5)
lines(reducey[which(bestmemb==relab[3]),1],reducey[which(bestmemb==relab[3]),2],
      col=cols[3],type="p",pch=3,cex=0.5)
legend(0.45,0.25,legend=c("cluster 1","cluster 2","cluster3"),
       col=cols,pch=3,bty="n")

3 Cytoscape

Cytoscape is a tool for visualizing a network (i.e., a group of entities that are causally connected). At its most basic, Cytoscape wants a spreadsheet containing two columns; the objects in the first column should be connected in some way to the objects in the second column. This spreadsheet is called an edge list. Each row in the edge list describes a connection between two entities. A tutorial on creating an edge list using Excel worksheets is given in here.

4 Alternative Clustering Methods