Chapter 7: Unraveling Transcriptomic Complexity with Weighted Gene Co-expression Network Analysis (WGCNA)

##Introduction The advent of high-throughput technologies like RNA-sequencing and microarrays has revolutionized biology, providing us with unprecedented volumes of gene expression data. A typical transcriptomic study measures the activity of tens of thousands of genes across multiple samples. While this data holds immense potential, its sheer dimensionality presents a significant analytical challenge. A common first step is to identify individual genes that are differentially expressed between experimental conditions. This approach is powerful but inherently limited; it overlooks the coordinated action of genes and fails to capture the intricate regulatory architecture of the cell.

Biological processes are rarely driven by single genes acting in isolation. Instead, they emerge from the complex interplay of numerous molecules working in concert. To understand these systems-level properties, we must move beyond single-gene analyses and adopt methods that can model the relationships between genes. Network biology provides a powerful framework for this task, representing genes as nodes and the relationships between them as edges.

Weighted Gene Co-expression Network Analysis (WGCNA) is a widely used systems biology method for understanding the relationships between genes in high-dimensional expression datasets. It is an unsupervised technique, meaning it does not require prior knowledge of gene function to find patterns. WGCNA begins with the premise that genes with similar expression patterns across a set of samples are likely to be functionally related—a concept often termed “guilt-by-association.” By analyzing these co-expression patterns, WGCNA identifies clusters, or “modules,” of highly correlated genes. These modules often correspond to known biological pathways or cellular components. The true power of WGCNA lies in its ability to then correlate these modules with external sample traits, such as clinical outcomes, disease progression, or physiological measurements, thereby linking networks of genes to observable phenotypes.

This chapter will guide you through the theoretical concepts, practical steps, and interpretative logic of WGCNA, demonstrating how it can be used to reduce the complexity of transcriptomic data and uncover biologically meaningful insights.

Core Concepts of WGCNA

At its heart, WGCNA is a multi-step process that transforms a massive gene expression matrix into a structured network of co-expressed gene modules. This process is built on several key statistical and conceptual foundations.

1. From Co-expression to a Network: The Adjacency Matrix

The first step in WGCNA is to quantify how similarly each pair of genes is expressed across the samples. This is typically done by calculating the Pearson correlation coefficient for every gene pair, resulting in a large correlation matrix.

WGCNA employs a more elegant solution called soft-thresholding. It transforms the correlation matrix into an adjacency matrix (which represents the network) by raising the absolute value of the correlation to a power, β:

Adjacency = |correlation|^β

The parameter β (the soft-thresholding power) is the key to this step. The value of β is chosen to achieve a scale-free topology. Many biological networks are scale-free, meaning they have many nodes with few connections and a few “hub” nodes with many connections. WGCNA selects the lowest β value that produces a network that approximates this structure.

2. Reinforcing Connections: The Topological Overlap Measure (TOM)

To build more robust modules, WGCNA calculates the Topological Overlap Measure (TOM). For any two genes, it considers not only their direct adjacency but also the strength of their shared connections with all other genes in the network. If two genes are strongly connected to the same group of “neighbor” genes, their topological overlap is high. This makes TOM a highly robust measure of true biological proximity.

3. Identifying Modules: Clustering and Dynamic Tree Cutting

With the robust TOM matrix in hand, the next goal is to identify modules of densely interconnected genes. WGCNA achieves this through hierarchical clustering using a dissimilarity measure of 1 - TOM. The result is a gene dendrogram, or tree, where distinct branches represent potential gene modules. To define the final modules, WGCNA employs a Dynamic Tree Cutting algorithm.

Relating Modules to External Information

The Module Eigengene (ME)

To relate an entire module to an external trait, WGCNA summarizes the expression pattern of all the genes within that module by calculating the Module Eigengene (ME). The ME is the first principal component of the module’s expression matrix and represents the dominant trend of gene expression for that module.

Correlating Modules with Traits

Once the MEs are calculated, they are correlated with the external traits of interest (e.g., disease status, age). This produces a module-trait relationship heatmap, providing a global overview of the biological significance of each module.

Digging Deeper: Gene Significance and Module Membership

WGCNA defines two key metrics to identify important genes within a module:

Gene Significance (GS): The correlation between a gene’s expression and a specific external trait.

Module Membership (kME): The correlation between a gene’s expression and its module eigengene.

Genes with both high GS and high kME are central to their module and strongly associated with the biological trait, making them top candidates for further validation.

A Practical Walkthrough with R Code

The following snippets use the WGCNA package in R.

Data Preparation

The expression data should be a matrix where rows are samples and columns are genes.

# Load the WGCNA package
# Ensure you have installed it first: install.packages("WGCNA")
library(WGCNA)

# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);

# Load expression and trait data from RData files
# For this example, we will simulate data as we don't have the file.
# In a real scenario, you would use:
# lnames = load(file = "MyData-01-dataInput.RData");

# Simulate data for demonstration
nSamples = 100
nGenes = 5000
datExpr = as.data.frame(matrix(rnorm(nGenes * nSamples), nSamples, nGenes))
names(datExpr) = paste0("Gene", 1:nGenes)

# Simulate traits
trait1 = rnorm(nSamples)
trait2 = rbinom(nSamples, 1, 0.5)
datTraits = as.data.frame(cbind(trait1, trait2))

Step 1: Network Construction & Soft-Thresholding Select the soft-thresholding power (β) to achieve a scale-free topology.

# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=2))

# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)

## pickSoftThreshold: will use block size 5000.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 5000 of 5000

##    Power SFT.R.sq   slope truncated.R.sq  mean.k. median.k.   max.k.
## 1      1 0.117000  30.200          0.957 4.02e+02  4.02e+02 4.16e+02
## 2      2 0.151000  17.600          0.937 5.05e+01  5.05e+01 5.36e+01
## 3      3 0.009830   2.930          0.977 8.04e+00  8.04e+00 8.89e+00
## 4      4 0.001710  -0.859          0.993 1.50e+00  1.50e+00 1.74e+00
## 5      5 0.000723  -0.385          0.994 3.16e-01  3.15e-01 3.87e-01
## 6      6 0.185000  -4.920          0.982 7.30e-02  7.26e-02 1.01e-01
## 7      7 0.368000 -12.600          0.403 1.83e-02  1.81e-02 3.25e-02
## 8      8 0.492000 -13.000          0.398 4.89e-03  4.78e-03 1.21e-02
## 9      9 0.555000 -11.800          0.430 1.38e-03  1.33e-03 5.11e-03
## 10    10 0.593000  -9.690          0.477 4.13e-04  3.88e-04 2.38e-03
## 11    12 0.581000  -7.430          0.461 4.19e-05  3.65e-05 5.88e-04
## 12    14 0.542000  -5.870          0.415 4.94e-06  3.81e-06 1.57e-04
## 13    16 0.541000  -4.800          0.410 6.60e-07  4.27e-07 4.31e-05
## 14    18 0.519000  -4.120          0.396 9.86e-08  5.04e-08 1.20e-05
## 15    20 0.518000  -3.640          0.389 1.62e-08  6.12e-09 3.36e-06

# Plot the results:
par(mfrow = c(1,2));
cex1 = 0.9;
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
abline(h=0.90,col="red") # We aim for a fit of ~0.9

# Mean connectivity plot
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

# Choose the power that is closest to the desired R^2 value (e.g., 0.90)
softPower = 6;

Step 2 & 3: Module Detection in One Step The blockwiseModules function performs network construction, TOM calculation, clustering, and module detection.

# One-step network construction and module detection
net = blockwiseModules(datExpr, power = softPower,
                       TOMType = "unsigned", minModuleSize = 30,
                       reassignThreshold = 0, mergeCutHeight = 0.25,
                       numericLabels = TRUE, pamRespectsDendro = FALSE,
                       saveTOMs = TRUE,
                       saveTOMFileBase = "MyTOM",
                       verbose = 3)

##  Calculating module eigengenes block-wise from all genes
##    Flagging genes and samples with too many missing values...
##     ..step 1
##  ..Working on block 1 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 1 into file MyTOM-block.1.RData
##  ....clustering..
##  ....detecting modules..
##  No modules detected in block 1
##  ..merging modules that are too close..
##      mergeCloseModules: Merging modules whose distance is less than 0.25

# The 'net$colors' variable contains the module assignment for each gene.
# 0 (grey) is reserved for genes that did not belong to any module.
table(net$colors)

## 
##    0 
## 5000

Step 4: Relating Modules to Traits Correlate module eigengenes (MEs) with clinical traits.

# Get module colors and MEs
moduleColors = labels2colors(net$colors)
MEs = net$MEs

# Correlate eigengenes with external traits
moduleTraitCor = cor(MEs, datTraits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);

# Display the correlations and their p-values in a heatmap
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
                   signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)

par(mar = c(6, 8.5, 3, 3));
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(datTraits),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = blueWhiteRed(50),
               textMatrix = textMatrix,
               setStdMargins = FALSE,
               cex.text = 0.5,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))

Expanding Beyond Transcriptomics: WGCNA for Proteomics Data

WGCNA’s principles can be applied to other high-dimensional data, like proteomics from platforms such as Olink. The workflow remains largely the same, using protein abundance levels as input.

Example R Workflow for Olink Proteomics Data

# We will simulate Olink-style data for this example
# In a real scenario, you would load your data:
# lnames = load(file = "OlinkProteomics-Input.RData");

# Simulate protein data
nProteins = 92 # Typical size of an Olink panel
datProt = as.data.frame(matrix(rnorm(nProteins * nSamples), nSamples, nProteins))
names(datProt) = paste0("Prot", 1:nProteins)

# --- Step 1: Select Soft-Thresholding Power ---
powers = c(c(1:10), seq(from = 12, to=30, by=2))
sft_prot = pickSoftThreshold(datProt, powerVector = powers, networkType = "signed", verbose = 5)

## pickSoftThreshold: will use block size 92.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 92 of 92
##    Power SFT.R.sq  slope truncated.R.sq  mean.k. median.k.   max.k.
## 1      1  0.59700 36.100        0.75200 4.55e+01  4.55e+01 4.63e+01
## 2      2  0.45000 17.300        0.50800 2.30e+01  2.30e+01 2.38e+01
## 3      3  0.43200 10.700        0.55300 1.17e+01  1.17e+01 1.24e+01
## 4      4  0.15900  4.930        0.37400 6.03e+00  6.06e+00 6.54e+00
## 5      5  0.10400  2.620        0.60800 3.13e+00  3.15e+00 3.49e+00
## 6      6  0.03510  1.380        0.79600 1.64e+00  1.66e+00 1.89e+00
## 7      7  0.00127 -0.216        0.74100 8.70e-01  8.81e-01 1.03e+00
## 8      8  0.01630 -0.717        0.66800 4.64e-01  4.72e-01 5.73e-01
## 9      9  0.05010 -1.270        0.55100 2.50e-01  2.55e-01 3.21e-01
## 10    10  0.11900 -1.700        0.58100 1.36e-01  1.37e-01 1.82e-01
## 11    12  0.28500 -2.310        0.68400 4.10e-02  4.07e-02 6.05e-02
## 12    14  0.42000 -2.050        0.81900 1.28e-02  1.24e-02 2.09e-02
## 13    16  0.53700 -2.020        0.79800 4.11e-03  3.91e-03 7.51e-03
## 14    18  0.67100 -2.080        0.85400 1.36e-03  1.26e-03 2.78e-03
## 15    20  0.11800 -4.380       -0.12500 4.62e-04  4.16e-04 1.05e-03
## 16    22  0.72000 -1.980        0.66200 1.61e-04  1.42e-04 4.19e-04
## 17    24  0.82900 -1.940        0.80200 5.76e-05  4.96e-05 1.79e-04
## 18    26  0.21100 -4.040       -0.00733 2.11e-05  1.75e-05 7.87e-05
## 19    28  0.27300 -4.260        0.12800 7.89e-06  6.16e-06 3.51e-05
## 20    30  0.31900 -5.590        0.18200 3.01e-06  2.17e-06 1.59e-05

softPower_prot = 8; # This is an example, choose based on your data's fit

# --- Step 2 & 3: One-Step Module Detection ---
# Here we use a signed network, often preferred for protein data.
net_prot = blockwiseModules(datProt, power = softPower_prot,
                       networkType = "signed",
                       minModuleSize = 10, # Module size can be smaller for proteomics
                       reassignThreshold = 0, mergeCutHeight = 0.2,
                       numericLabels = TRUE,
                       saveTOMs = TRUE,
                       saveTOMFileBase = "OlinkTOM",
                       verbose = 3)

##  Calculating module eigengenes block-wise from all genes
##    Flagging genes and samples with too many missing values...
##     ..step 1
##  ..Working on block 1 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 1 into file OlinkTOM-block.1.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##  No significant modules detected in block 1
##  ..merging modules that are too close..
##      mergeCloseModules: Merging modules whose distance is less than 0.2

# --- Step 4: Relate Protein Modules to Traits ---
MEs_prot = net_prot$MEs;
moduleTraitCor_prot = cor(MEs_prot, datTraits, use = "p");
moduleTraitPvalue_prot = corPvalueStudent(moduleTraitCor_prot, nSamples);

# Visualize the module-trait relationships using a heatmap
par(mar = c(6, 8.5, 3, 3));
labeledHeatmap(Matrix = moduleTraitCor_prot,
               xLabels = names(datTraits),
               yLabels = names(MEs_prot),
               main = "Protein Module-Trait Relationships")

Downstream Analysis and Interpretation

The output of a WGCNA analysis is the starting point for further biological inquiry. Key downstream steps include:

Identifying Hub Members: Genes or proteins with the highest kME values within significant modules are considered hubs.

Functional Enrichment Analysis: Use tools like Gene Ontology, KEGG, Reactome, or STRING-DB (for proteins) to understand what a module is “doing”.

Network Visualization: Export the network to software like Cytoscape to explore connections and identify hubs visually.

Conclusion

Weighted Co-expression Network Analysis provides a powerful, systems-level approach to analyzing complex high-dimensional omics data. By reducing complexity from thousands of individual features to a few dozen modules and linking these modules to phenotypes, WGCNA bridges the gap between molecular data and observable biology, making it an indispensable tool for researchers.