1 Introduction

The network analysis of biological data has increased in recent years, due to the capacity of this approach to analyze and represent complex information in a simple way, information that nowadays is growing and that covers different levels of biological resolution (protein-protein interaction, signaling interaction networks, gene regulatory network, among others). Currently, one of the most used and informative representations of biological data are co-expression networks. In this approach a network is created based on data obtained from experimental expression measures, taking into account the existence of particular patterns or relations between expression profiles among different genes, proteins or RNA fragments involved in a specific phenotype.

In most cases, the idea behind creating a co-expression network is to visualize the relationships among diferent genes, proteins, specific DNA or RNA fragments or any other kind of molecular entities, that are identified by a specific ID. For this reason, it is very useful to keep the information of the corresponding ID corresponding to each one of the probesets in the microarray. This kind of information will be used when you need to switch from a matrix of probeset-samples to one of genes (or another ID)-samples before the construction of the co-expression network.

2 Dataset

DCGL includes three simulated datasets, each having a total of 1000 genes. We also used a simulated expression dataset that we created for this purpose. We will combine both datasets when showing our differents results.

# Simulated expression data
n <- 200
m <- 20
# The vector with treatment samples and control samples
 t <- c(rep(0,10),rep(1,10))
#  Calculating the expression values normalized
mat <- as.matrix(rexp(n, rate = 1))
norm <- t(apply(mat, 1, function(nm) rnorm(m, mean=nm, sd=1)))
# Calculating the coefficient of variation to case samples
case <- cofVar(expData = norm,complete = FALSE,treatment = t,type = "case")
head(case)

#case[1:5, 1:5]

library(coexnet)
library(DCGL)
dataC <- read.table("//Users/mikelapika/Downloads/dataC.txt", header = T)
head(dataC)

#dataC[1:5, 1:5]

The first ten samples (Sample1 to Sample10) belong to one condition and the remaining ten samples belong to another condition. So we firstly divide dataC to two parts corresponding to condition A (exprs.1) and condition B (exprs.2) respectively.

3 Link filtering

 exprs.1 <- dataC[, 1:10]
 exprs.2 <- dataC[, 11:20]

The number of genes in dataC is moderate (1000), so the most time consuming link filtration function (qLinkfilter) can be run. Through link filtering, correlation pairs with q values lower than qth(0.25) in either condition are kept whereas the other pairs are set to 0. Links$rth contain the two correlation thresholds for both conditions; Both link corelation filtered keep the filtered correlation matrices for condition A and B.

Links <- qLinkfilter(exprs.1, exprs.2, 0.25)

names(Links)

## [1] "rth.1"          "rth.2"          "cor.filtered.1" "cor.filtered.2"
## [5] "genes"

Links$rth.1

## [1] 0.5864986

Links$rth.2

## [1] 0.6609703

L1 <- Links$cor.filtered.1
L2 <- Links$cor.filtered.2

L1[1:5]

## EG10007,EG10006 EG10008,EG10006 EG10012,EG10006 EG10020,EG10006 
##       0.0000000       0.0000000       0.0000000       0.6280721 
## EG10022,EG10006 
##       0.0000000

L2[1:5]

## EG10007,EG10006 EG10008,EG10006 EG10012,EG10006 EG10020,EG10006 
##       0.0000000       0.0000000       0.0000000       0.1632318 
## EG10022,EG10006 
##       0.0000000

4 Genes Symbol

library(coexnet)
library(affy)
getInfo(GSE = "GSE8216", GPL = "GPL2025",directory = ".")
affy <- getAffy(GSE = "GSE1234",directory = system.file("extdata",package = "coexnet"))
gene_table <- geneSymbol(GPL = "GPL2025",directory = system.file("extdata",package = "coexnet"))
head(gene_table)

# The created table have NA and empty IDs information.We can delete this unuseful information. Deletion of IDs with NA information
gene_na <- na.omit(gene_table)
# Deletion of empty IDs
final_table <- gene_na[gene_na$ID != "",]
head(final_table)

5 Coefficient of Variation

In some cases, the co-expression network is built from two or more microarrays studies, in this sense, it is necessary to define wich one of these studies accounts for more source of background noise and will probably have a negative impact on the results

library(coexnet)
library(affydata)

##      Package    LibPath                                   Item      
## [1,] "affydata" "/Users/mikelapika/Library/R/3.5/library" "Dilution"
##      Title                        
## [1,] "AffyBatch instance Dilution"

6 DCp for identifying DCGs

data(tf2target)
expGenes<-rownames(exprs)
DCp.res <- DCp(exprs.1,exprs.2,link.method='qth',cutoff=0.15,N=0)
DCp.res[1:5, ]

The result display explained us that the first column gives the dC value for every gene, the second column gives the length of its Differential Coexpression Profiles or degree in the coexpression networks.

7 DCe for identifying DCGs and DCLs

DCe.res<-DCe(exprs.1,exprs.2,link.method='qth',cutoff=0.25,nbins=10,p=0.1)
#DCe.res[1:10, ]
DCe.res$DCGs[1:5, ]

##         All.links DC.links DCL_same DCL_diff DCL_switch             p
## EG13414       588      339       36       64        239 2.399990e-144
## EG10170       595      330       20       60        250 9.569629e-134
## EG10624       583      326       15       67        244 1.496361e-133
## EG10441       573      322       37       53        232 8.816070e-133
## EG10350       603      331       19       59        253 2.329297e-132
##                     q
## EG13414 2.399990e-141
## EG10170 4.784815e-131
## EG10624 4.987869e-131
## EG10441 2.204018e-130
## EG10350 4.658594e-130

DCe.res$DCLs[1:5, ]

The Result is a list with four compenents, one for DCGs and another three for different types of DCLs.

8 Combine two Differential Co-expression Analysis results

DCsum.res<-DCsum(DCp.res,DCe.res,DCpcutoff=0.25,DCecutoff=0.4)
DCsum.res$DCGs[1:3,]

DCsum.res$DCLs[1:3,]

#DRplot.res<-DRplot(DCsum.res$DCGs,DCsum.res$DCLs,tf2target,expGenes,type='TF_bridged_DCL',vsize=5,asize=0.25,lcex=0.3,ewidth=1,figname=c('TF2target_DCL.pdf','TF_bridged_DCL.pdf'))

The DCGs table include seven columns all links, DCLs, same signed DCLs, differently signed DCLs, switched links, the p value, and the FWER value.

9 WGCNA, ASC and LRC for identifying DCGs

WGCNA.res <- WGCNA(exprs.1, exprs.2, power = 12, variant = "WGCNA")
WGCNA.res[1:10]

##     EG10006     EG10007     EG10008     EG10012     EG10020     EG10022 
## 0.130436661 0.002313100 0.070444785 0.003444917 0.021990313 0.012220093 
##     EG10023     EG10024     EG10025     EG10026 
## 0.009754864 0.058267889 0.139482501 0.003448272

The other two methods ‘ASC’ and ‘LRC’ also requires setting the link-filtering method, but they exert a hard thresholding strategy and basically compares the numbers of connected edges.

ASC.res <- ASC(exprs.1,exprs.2,link.method='qth',cutoff=0.25)
ASC.res[1:10]

## EG10006 EG10007 EG10008 EG10012 EG10020 EG10022 EG10023 EG10024 EG10025 
##   193.0   162.5   157.0    75.0   172.5    10.5    52.0   111.0    92.5 
## EG10026 
##    97.0

LRC.res <- LRC(exprs.1, exprs.2, link.method = "qth", cutoff = 0.25)
LRC.res[1:10]

##     EG10006     EG10007     EG10008     EG10012     EG10020     EG10022 
## 0.867762025 0.077915080 0.454258372 0.050654764 0.187339802 0.041392685 
##     EG10023     EG10024     EG10025     EG10026 
## 0.115393419 0.106268337 0.004196115 0.414973348

data(exprs)
exprs.1<-exprs[1:100,1:16]
exprs.2<-exprs[1:100,17:63]
expGenes<-rownames(exprs)

## Sort out differentially regulated genes and differentially regulated links

data(tf2target) 

DCp.res <- DCp(exprs.1,exprs.2,link.method='qth',cutoff=0.25,N=0)
DCe.res<-DCe(exprs.1,exprs.2,link.method='qth',cutoff=0.25,nbins=10,p=0.1)
DCsum.res<-DCsum(DCp.res,DCe.res,DCpcutoff=0.25,DCecutoff=0.4)
DRsort.res<-DRsort(DCsum.res$DCGs,DCsum.res$DCLs,tf2target,expGenes)
## or
DRsort.res<-DRsort(DCe.res$DCGs,DCe.res$DCLs,tf2target,expGenes)

10 Plot differentially regulated links

DRplot.res<-DRplot(DCsum.res$DCGs,DCsum.res$DCLs,tf2target,expGenes,type='TF_bridged_DCL',intgenelist=NULL,vsize=5,asize=0.25,lcex=0.3,ewidth=1,figname=c('TF2target_DCL.pdf','TF_bridged_DCL.pdf'))

## The graph of TF_bridged_DCL.pdf has been completed and saved in your working directory.

#DRplot.res
#include_graphics('/Users/mikelapika/Downloads/TF_bridged_DCL.pdf')
#knitr::include_graphics('/Users/mikelapika/Downloads/TF_bridged_DCL.pdf')
#knitr::include_graphics('/Users/mikelapika/Downloads/TF2target_DCL.pdf')

library(imager)
imq <- load.image("/Users/mikelapika/Desktop/TF_bridged_DCL1.jpg")
plot(imq,axes=FALSE)

## two types of Differentially Regulated Links which were plotted

DRplot.res$TF2target_DCL[1:3,]

DRplot.res$TF_bridged_DCL[1:3,]

11 Plot sub-network by predefined gene(s)

data(intgenelist)
DRplot.int.res<-DRplot(DCsum.res$DCGs,DCsum.res$DCLs,tf2target,expGenes,type='TF_bridged_DCL2',intgenelist=intgenelist,vsize=5,asize=0.25,lcex=0.3,ewidth=1,figname=c('TF2target_DCL2.pdf','TF_bridged_DCL2.pdf'))
#DRplot.int.res

DRplot.int.res$TF2target_DCL[1:3,]

DRplot.int.res$TF_bridged_DCL[1:3,]

library(jpeg)
library(imager)
#knitr::include_graphics('/Users/mikelapika/Downloads/TF_bridged_DCL2.pdf')
#knitr::include_graphics('/Users/mikelapika/Downloads/TF2target_DCL2.pdf')
im <- load.image("/Users/mikelapika/Desktop/TF_bridged_DCL2.jpg")
plot(im,axes=FALSE)

# Calculating the coefficient of variation to whole matrix
complete <- cofVar(dataC)
head(complete)

The decision of discarding a microarray study from our analysis, based on the result of the coefficient of variation analysis, depends on the data and the criteria of the researcher to filter the studies (the selection of a threshold value), there is no a Gold Standard to discard a study, so it is advisable to calculate the coefficient of variation of all samples at the same time to compare and to determine which one shows more variation.

# Creating the boxplot to coefficient of variation results
boxplot(complete$cv)

# Extracting the number of atipic data
length(boxplot.stats(complete$cv)$out)

## [1] 14

12 Differential Expression

When expression data of gene, proteins, or another kind of ID are used to build a co-expression network, in most cases it is convenient to asses differences in the expression value of each of them.

There are several methodologies to identify differentially expressed genes/IDs, some methods are more predictive than others, so depending of the method, it is possible obtain genes/IDs that clearly differentiate from others in their expression values or it is also possible to obtain genes/IDs whose expression value are slightly different from others, but that given the criteria of the method used, they are considered as differentialy expressed.

This function considers two ways to calculate the differentially expressed genes/IDs.

# Running the function using the two approaches
sam <- difExprs(expData = dataC,treatment = t,fdr = 0.2,DifferentialMethod = "sam")

## Achieved FDR: 0.301096159763063

head(sam)

This function identifies the genes/IDs differentially expressed taking into account the expected FDR, so independently of the method used (sam or acde), the number of genes/IDs identified as differentially expressed will be guided by the FDR expected by the user, thus increasing the predictive power in the final results.

13 Find The Threshold

Once you have the final expression matrix, it is used as basis to obtain the co-expression network. There are two methods widely used to obtain it, both of them are related to the definition of correlation value between all the genes/IDs creating a square matrix. On one hand you can calculate the Pearson Correlation Coefficient, this method calculates the correlation between each genes/IDs expression values, as result, the square matrix will have values between zero and one, given that for the future construction of the co-expression network it is necessary to use the absolute value of the results. On the other hand, the Mutual Information approach is based on the entropy of the data and in a simmilar manner a square matrix is created, but, in this case, it is necessary to perform an additional transformation of the results in order to obtain a square matrix with values between zero and one.

systematicLinkfilter <-function(exprs) {
        N<- nrow(exprs);
        exprs=as.matrix(exprs);
        r=cor(t(exprs),method="pearson",use="pairwise.complete.obs");
        CC0<- vector();
        CCmean<- vector();
        for(l in 1:100) {
            Rth<- 0.01*l;
            K <- rep(0, dim(r)[1]);
            names(K) <- rownames(r);
            CC <- K;
            for( i in 1:dim(r)[1]){
                neighbor <- rownames(r)[i];
                for( j in 1:dim(r)[2] ){
                    if( r[i,j] >= Rth ){
                        K[i] = K[i]+1;
                        neighbor <- c(neighbor, colnames(r)[j]);
                    }
                }
                if( length(neighbor) > 2 ){
                    for( m in 2:length(neighbor)){
                        for( n in 2:length(neighbor) ){
                            if( r[neighbor[m], neighbor[n]] >= Rth){
                                CC[i] = CC[i] + 1;
                            }
                        }
                    }
                    CC[i] = CC[i]/(K[i]*(K[i]-1));
                }
            }
            K2<- 0;
            KK<- 0;
            CCC<- 0;
            for(j in 1:100) {
                K2<- K[j]*K[j]+K2;
                KK<- KK+K[j]; 
                CCC<- CC[i]+CCC;
            }  
            CCCmean<- CCC/N;
            K2mean<- K2/N;
            Kmean<- KK/N;
            C0<- (K2mean-Kmean)^2/(Kmean^3)/N;
            CC0<- rbind(CC0,C0);
            CCmean<- rbind(CCmean,CCCmean);
        } 
        CC_C0<- CCmean-CC0;    
        cor <- seq(0.01,1,0.01)
        #plot(x,CC_C0,xlab="correlation threshold",ylab="C-C0");
        #lines(x,CC_C0);
        C_r <- cbind(cor,CC_C0)
        colnames(C_r)=c("cor","C_C0")
        rownames(C_r)=c(1:nrow(C_r))
        return(C_r)
    }

The function outputs a table of ‘correlation threshold’ vs. ‘clustering coeficient’.

exprs <- case[1:100, ]
C_r <- systematicLinkfilter(exprs)
plot(C_r,type="l")

plot(C_r[, 1], C_r[, 2], xlab = "correlation threshold", ylab = "C-C0")
lines(C_r[, 1], C_r[, 2])

A curve of ‘correlation threshold’ vs. ‘clustering coecient’ is plotted, which may assist the users to determine correlation threshold.

# Loading data
#pathfile <- system.file("extdata","expression_example.txt",package = "coexnet")
#data <- read.table(pathfile,stringsAsFactors = FALSE)
# Finding threshold value
cor_pearson <- findThreshold(expData = dataC,method = "correlation")
cor_pearson

## [1] 0.91

This function computes a threshold value using a novel method based on two Biological Systems approaches. First, each possible threshold value, from 0.01 to 0.99 with an increase of 0.01 is examined. Each of this values is then analyzed using the Pearson Correlation Coefficient. In most cases, the threshold value is selected by the researcher without any biological assumption. Here, we present this novel methodology to select a threshold value under a network biology assumption.

14 CreateNet

The last step in the construction of a co-expression network is the creation of a data structure that stores the information necessary to create a network graph. Once you had defined a threshold value, the last step is to transform the expression matrix in a adjacency matrix using the correlation or the mutual information method to obtain the values of relationship between the diferent genes or proteins (or another kind of ID).

This function takes the expression matrix and creates the correlation matrix using Pearson Correlation Coefficient or Mutual Informtion. After that, it creates the adjacency matrix using the threshold value given by the user and finally creates the network from the adjacency matrix as an igraph object to be analyzed using the igraph R package or any other tool that recognizes this type of object.

# Loading data
pathfile <- system.file("extdata","expression_example.txt",package = "coexnet")
data <- read.table(pathfile,stringsAsFactors = FALSE)
# Building the network
cor_pearson <- createNet(expData = dataC,threshold = 0.99,method = "correlation")
plot(cor_pearson)

In the process of construction of the adjacency matrix, sometimes a gene/ID can not be related to another one and in the process passing from a matrix to an edge list, this gene/ID will be deleted.

# load("//Users/mikelapika/Downloads/GSCA.RData")
# dc.M <- singleDC(data = LungCancer3$data$Michigan, group = c(86, 10),GSdefList = LungCancer3$info$GSdef, nperm = 3)
# str(dc.M)
# 
# GS <- LungCancer3$info$GSdef
#  dc.M <- singleDC(data = LungCancer3$data$Michigan, group = c(86, 10), GSdefList = GS, nperm = 3, permDI = TRUE)
# str(dc.M)
# 
# load("//Users/mikelapika/Downloads/GSCA.RData")
# head(GSCA)

Identifying Coexpressed Genes

Wei Wang, Mike Lapika, Adane Bewset

University of Southern Maine