WGCNA package have been widely used to create co-expression networks, grouping genes with similar expression pattern in clusters and relating these cluster with phenotypic characterics. Here I show how I adapted the script from the WGCNA page to my data, comment some functions and give some tips.
The first step running any script, not only WGCNA, is to define the working directory. Then, we need to load the packages you will use for the analysis. For WGCNA, you need to define that strings will not be considered like factors and also enable multithread to allow the algorithm run faster.
#Setting directory
#setwd(seu_diretorio)
setwd("/media/natalia/3C8E068E2A640DD5/WGCNA/Spodoptera")
#Loading WGCNA
library(WGCNA)
#Setting string not as factor
options(stringsAsFactors = FALSE)
#Enable multithread
enableWGCNAThreads()
## Allowing parallel execution with up to 7 working processes.
It is necessary to read and prepare the expression data. Expression data table usually comes in a format as the first column contains the name of the genes and the first line the name of the conditions/treatment. WGCNA works with a format as the genes are organized in the columns and the treatments in the lines. So we need to get the transpose of the data. We will also remove the column with the gene names and use it to put name in the columns of the data frame. Same for rownames that will be the condition names.
#Reading the raw data (rows are the sample and columns the genes)
expressiondata = read.csv("expressiondataEx.csv", sep = ";")
#Create a new format expression data - remove gene name column
expression = as.data.frame(expressiondata[, -c(1)])
expression = t(expression)
#Column 1 - gene names
colnames(expression) = expressiondata$EST
rownames(expression) = names(expressiondata)[-c(1)]
When working with correlations missing values are not interesting because they add noise to the calculations. So, it is important to remove genes with missing values. If the number of genes with missing values is too high, it is important to use some technique to correct this problem. That is not the case here, so we will just remove them.
Some samples do not follow the pattern of the most of the samples. They are considered outliers and need to be removed.
#Group data in a dendogram to check outliers
sampleTree = hclust(dist(expression), method = "average")
#Plot a sample tree: Open the output in a 12:9 inchs size window
#dev.off()
#sizeGrWindow(12,9)
#If you want to save this plot in a pdf file, do not comment the line below:
#pdf(file = "/media/natalia/3C8E068E2A640DD5/WGCNA/Spodoptera/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6)
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)
#Plot a line showing the cut-off
abline(h = 31000, col = "red") #This value of 31000 was chosen based on my data, you need to check the best value to your data
#Determine clusters below the line
#help("cutreeStatic")
clust = cutreeStatic(sampleTree, cutHeight = 31000, minSize = 10)
#table(clust)
#clust
#Cluster 1 contains the samples we want to keep.
keepSamples = (clust==1)
expression = expression[keepSamples, ]
#dim(expression0)
nGenes = ncol(expression)
nSamples = nrow(expression)
As I said before, WGCNA also allows to relate expression data with phenotypic data throught the cluster association analysis. So, here, we will also read and prepare our phenotypic data.
#Read phenotypic data
traitData = read.csv("phenoTraits.csv", sep = ";")
#dim(traitData)
#names(traitData)
Samples = rownames(expression)
traitRows = match(Samples, traitData$EST)
datTraits = traitData[traitRows, -1]
rownames(datTraits) = traitData[traitRows, 1]
collectGarbage()
It is interesting to observe how these samples are organized.
#Regrouping samples
sampleTree2 = hclust(dist(expression), method = "average")
#Converting phenotypic characters in a color representation: white means low value, red means high value
#and gray missing value
traitColors = numbers2colors(datTraits, signed = FALSE)
#Plot a sample dendogram with the colors below
plotDendroAndColors(sampleTree2, traitColors,
groupLabels = names(datTraits),
main = "Sample dendrogram and trait heatmap")
To create the network, the correlation values are corrected based on a correction factor. This factor enhances the differences between strong and weak correlations. Weak values become closer to zero. This power value must produce a graph similar to a scale-free network. We can also look at the mean connectivity graphic to help.
#Plotting the results
#sizeGrWindow(9, 5)
par(mfrow = c(1,2))
cex1 = 0.9
#Index the scale free topology adjust as a function of the power soft thresholding.
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red")
#This line corresponds to use a cut-off R² of h
abline(h=0.90,col="red")
#Connectivity mean as a function of soft power thresholding
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
#This line corresponds to use a cut-off R² of h
abline(h=0.90,col="red")
The correlations are calculated and stored in a adjacendy matrix. In this point, we can choose which method we will use to calculate the correlations (Pearson, Spearman or Biweight Midcorrelation). The values of correlations are elevated to the normalization power factor. We can also set the type of the network as “unsigned”, “signed”, “signed hybrid”, “distance”. This part can require a high memory capacity. For big data sets there is another way of creating these matrix, using only one function. It will be showed in the next part.
softPower = 9 #Chosen in the graphs before
adjacency = adjacency(expression, power = softPower, type = "unsigned") #Calculating the adjacency matrix
#help(adjacency )
The adjacency matrix is transformed in a Topological Overlapping Matrix (TOM) considering the number of neighbors the genes share. From this matrix, it is calculated de Dissimilarity Matrix.
#Transforming the adjacency matrix in a topological overlap
TOM = TOMsimilarity(adjacency) #Calculating the topological overlap matrix
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
dissTOM = 1-TOM ##Calculating the dissimilarity
From the Dissimilarity Matrix, we can group the genes with more similar pattern of expression in clusters. We have already used the function hclust above to group the samples. Now we are going to use it on our Dissimilarity Matrix in order to group the genes in clusters. For more details about this function you can call the help(hclust).
Here, the Dissimilarity Matrix and the dendogram will be used to convert the clusters in colors. Then, we will plot the dendogram with the colors below.
Some modules are very similar, so they can be merged in a unique cluster based on a cut-off.
#Clustering the modules
#Calculating eigengenes
MEList = moduleEigengenes(expression, colors = dynamicColors)
MEs = MEList$eigengenes
#Calculating the module dissimilarity eigengenes
MEDiss = 1-cor(MEs)
#Clustering the eigengenes modules
METree = hclust(as.dist(MEDiss), method = "average")
#Plotting the result
#sizeGrWindow(7, 6)
plot(METree, main = "Clustering of module eigengenes",
xlab = "", sub = "")
#Grouping the clusters from a cut-off
MEDissThres = 0.25
#Plotting a cut-off line
abline(h=MEDissThres, col = "red")
#Grouping module colors
mergedColors = merge$colors
#Eigengenes of new grouped modules
mergedMEs = merge$newMEs
#getwd()
#sizeGrWindow(12, 9)
#pdf(file = "/media/natalia/3C8E068E2A640DD5/WGCNA/Spodoptera/geneDendro-3.pdf", wi = 9, he = 6)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
#dev.off()
#Renaming the module colors
moduleColors = mergedColors
#Building numeric labels corresponding to the colors
colorOrder = c("grey", standardColors(50))
moduleLabels = match(moduleColors, colorOrder)-1
MEs = mergedMEs
#dim(dissTOM)
#TOMplot(dissTOM , geneTree,dynamicColors, terrainColors=TRUE)
Below it is an option to work with big amount of data. It allows to create the network and clustering it using only one function:
#Dealing with a big data set: Constructing a block-wise network and detecting modules
#bwnet = blockwiseModules(expression0, maxBlockSize = 5000,
# power = 10, TOMType = "unsigned", minModuleSize = 100,
# reassignThreshold = 0, mergeCutHeight = 0.25,
# numericLabels = TRUE,
# saveTOMs = TRUE,
# saveTOMFileBase = "SpodopteraTOM-blockwise",
# verbose = 3)
#Loading the results pf single-block analysis
#load(file = "Spodoptera-02-networkConstruction-auto.RData")
# Re-labeling blockwise modules
#bwLabels = matchLabels(bwnet$colors, moduleLabels)
#Converting the labels in colors to plot
#bwModuleColors = labels2colors(bwLabels)
#Opening graphic window
#sizeGrWindow(6,6)
#Plot the dendogram and the modules colors below to the block 1
#(Replacing the number of the block you want to plot in bwnet$dendrograms[[X]]
#bwModuleColors[bwnet$blockGenes[[X]]])
#plotDendroAndColors(bwnet$dendrograms[[1]], bwModuleColors[bwnet$blockGenes[[1]]],
# "Module colors", main = "Gene dendrogram and module colors in #block 1",
# dendroLabels = FALSE, hang = 0.03,
# addGuide = TRUE, guideHang = 0.05)
#Plotting more than one together
#sizeGrWindow(12,9)
#plotDendroAndColors(geneTree,
# cbind(moduleColors, bwModuleColors),
# c("Single block", "2 blocks"),
# main = "Single block gene dendrogram and module colors",
# dendroLabels = FALSE, hang = 0.03,
# addGuide = TRUE, guideHang = 0.05)
#singleBlockMEs = moduleEigengenes(expression0, moduleColors)$eigengenes
#blockwiseMEs = moduleEigengenes(expression0, bwModuleColors)$eigengenes
#single2blockwise = match(names(singleBlockMEs), names(blockwiseMEs))
#signif(diag(cor(blockwiseMEs[, single2blockwise], singleBlockMEs)), 3)
#dev.off()
We will relate the clusters identified with the phenotypic measures. It is done through the mean of the correlation value between each gene of the module and the phenotypic measure.
#Relating modules to characteristics and identifying important genes
#Defining the number of genes and samples
nGenes = ncol(expression)
nSamples = nrow(expression)
#Recalculating MEs with label colors
MEs0 = moduleEigengenes(expression, moduleColors)$eigengenes
MEs = orderMEs(MEs0)
moduleTraitCor = cor(MEs, datTraits, use = "p")
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
#sizeGrWindow(8,4)
#Displaying correlations and its p-values
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "")
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3))
#Displaying the correlation values in a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
Below, we will plot a graph to chek how the genes of a specific module are related with a specific characteristic looking at the distribution of the correlation values between the genes of the module and the character.
#Defining the variable Peso10dias containing the column Peso10dias of datTrait
Peso10dias = as.data.frame(datTraits$Peso10dias)
names(Peso10dias) = "Peso10d"
#names (colors) of the modules
modNames = substring(names(MEs), 3)
geneModuleMembership = as.data.frame(cor(expression, MEs, use = "p"))
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))
names(geneModuleMembership) = paste("MM", modNames, sep="")
names(MMPvalue) = paste("p.MM", modNames, sep="")
geneTraitSignificance = as.data.frame(cor(expression, Peso10dias, use = "p"))
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples))
names(geneTraitSignificance) = paste("GS.", names(Peso10dias), sep="")
names(GSPvalue) = paste("p.GS.", names(Peso10dias), sep="")
module = "pink" #########################putting the color below the plot
column = match(module, modNames)
moduleGenes = moduleColors==module
#sizeGrWindow(7, 7)
par(mfrow = c(1,1))
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
abs(geneTraitSignificance[moduleGenes, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for Peso 10 dias",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
The following code allows us to access the genes inside a module, order them by importance order and stores this information in a csv file.
#Display the gene names inside the module
#colnames(expression0)[moduleColors=="pink"]
#Identifying most important genes for one determined characteristic inside of the cluster
geneInfo0 = data.frame(EST = colnames(expression),
moduleColor = moduleColors,
geneTraitSignificance,
GSPvalue)
modOrder = order(-abs(cor(MEs, Peso10dias, use = "p")))
for (mod in 1:ncol(geneModuleMembership))
{
oldNames = names(geneInfo0)
geneInfo0 = data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]],
MMPvalue[, modOrder[mod]]);
names(geneInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
paste("p.MM.", modNames[modOrder[mod]], sep=""))
}
geneOrder = order(geneInfo0$moduleColor, -abs(geneInfo0$GS.Peso10d))
geneInfo = geneInfo0[geneOrder, ]
#if you want to write the information in a csv file, just uncomment line below
#write.csv(geneInfo, file = "geneInfo.csv")
Last step is to export and save the network. Then you can import it in a software for network visualization as Cytoscape, for example.
#Exporting the network to a cytoscape format
#Recalculating topological overlap, if necessary
#TOM = TOMsimilarityFromExpr(expression0, power = 10);
#Select the modules
#modules = c("brown", "red"); #chose modules that u want to export
#Select the gene modules
genes = colnames(expression)
#if you want export specific colors, substitute the second modulecolors by above modules
inModule = is.finite(match(moduleColors, moduleColors))
modGenes = genes[inModule]
#Select the corresponding topologic overlap
modTOM = TOM[inModule, inModule]
dimnames(modTOM) = list(modGenes, modGenes)
modTOMSignificantes = which(modTOM>0.4)
#####warnings()
#Organize the genes by importance inside the module
genes = colnames(expression)
#sum(is.na(genes))
#It must return 0.
#Create the dataframe since the beginning
geneInfo0 = data.frame(ESTs = genes,
moduleColor = moduleColors,
geneTraitSignificance,
GSPvalue)
#Order the modules by the significance by a character Ex: peso10days
modOrder = order(-abs(cor(MEs, Peso10dias, use = "p")))
#Add information of the members of the module in the chosen order
for (mod in 1:ncol(geneModuleMembership))
{
oldNames = names(geneInfo0)
geneInfo0 = data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]],
MMPvalue[, modOrder[mod]]);
names(geneInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
paste("p.MM.", modNames[modOrder[mod]], sep=""))
}
#Order the genes of geneinfo variable first by the color of the module, then by geneTraitSignificance
geneOrder = order(geneInfo0$moduleColor, -abs(geneInfo0$GS.Peso10d))
geneInfo = geneInfo0[geneOrder, ]
#write the file with the ordered values
write.csv(geneInfo, file = "geneInfo.csv")
#Export the network in list files os n edges that cytoscape can read
cyt = exportNetworkToCytoscape(modTOM,
edgeFile = "CytoscapeEdgeFile.txt",
nodeFile = "CytoscapeNodeFile.txt",
weighted = TRUE,
threshold = 0.4,
nodeNames = modGenes,
nodeAttr = moduleColors[inModule])