This is with the honeybee (Apis mellifera) normalized expression data.

1. Format the data for analysis.

## [1] 10315     7

## [1] "gene_id"                 "GSM1103900_worker1_FPKM"
## [3] "GSM1103901_worker2_FPKM" "GSM1103902_worker3_FPKM"
## [5] "GSM1930795_queen1_FPKM"  "GSM1930796_queen2_FPKM" 
## [7] "GSM1930797_queen3_FPKM"

## 'data.frame':    10315 obs. of  7 variables:
##  $ gene_id                : chr  "gene0" "gene1" "gene10" "gene100" ...
##  $ GSM1103900_worker1_FPKM: num  0.386 0.401 97.442 3.81 6.828 ...
##  $ GSM1103901_worker2_FPKM: num  0.218 0.408 98.583 3.862 5.837 ...
##  $ GSM1103902_worker3_FPKM: num  0.375 0.31 96.015 3.73 6.819 ...
##  $ GSM1930795_queen1_FPKM : num  0.338 0.631 111.864 1.788 2.026 ...
##  $ GSM1930796_queen2_FPKM : num  0.545 0.831 99.23 1.625 4.054 ...
##  $ GSM1930797_queen3_FPKM : num  0.488 0.669 113.127 2.849 2.116 ...

Let’s remove any genes due to missing values (and print the gene ID):

##  Flagging genes and samples with too many missing values...
##   ..step 1
##   ..Excluding 284 genes from the calculation due to too many missing samples or zero variance.
##   ..step 2

Detect any outliers with hierarchical clustering:

Good, as expected the caste phenotypes cluster together. The guide has you go through and determine where to trim samples based on outliers from clustering, but we have so few samples we are not going to do this step.

Create the traits data:

##                accessions queens workers
## 1 GSM1103900_worker1_FPKM      0       1
## 2 GSM1103901_worker2_FPKM      0       1
## 3 GSM1103902_worker3_FPKM      0       1
## 4  GSM1930795_queen1_FPKM      1       0
## 5  GSM1930796_queen2_FPKM      1       0
## 6  GSM1930797_queen3_FPKM      1       0

N.B.: The heatmap they make in the example is great, but not relevant here so we are not doing it.

2. Automatic Construction of the Gene Network and Modules

## [1] "bee1t"  "traits"

Find the Soft-Thresholding Power Needed:

## pickSoftThreshold: will use block size 4337.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 4337 of 10315

## Warning: executing %dopar% sequentially: no parallel backend registered

## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 :
## 4337 .

##    ..working on genes 4338 through 8674 of 10315

## Warning in eval(xpr, envir = envir): Some correlations are NA in block 4338 :
## 8674 .

##    ..working on genes 8675 through 10315 of 10315

## Warning in eval(xpr, envir = envir): Some correlations are NA in block 8675 :
## 10315 .

##    Power SFT.R.sq     slope truncated.R.sq mean.k. median.k. max.k.
## 1      1 2.86e-01  1.360000          0.370    6130      6970   7750
## 2      2 4.38e-01  0.334000          0.640    4660      5450   6680
## 3      3 1.88e-01  0.250000          0.561    3810      4460   5990
## 4      4 3.53e-02  0.099800          0.737    3240      3740   5480
## 5      5 2.39e-06 -0.000798          0.895    2830      3200   5080
## 6      6 1.98e-02 -0.071200          0.948    2510      2770   4750
## 7      7 6.84e-02 -0.128000          0.968    2260      2420   4480
## 8      8 1.27e-01 -0.172000          0.976    2050      2120   4240
## 9      9 1.94e-01 -0.211000          0.984    1870      1880   4030
## 10    10 2.66e-01 -0.246000          0.984    1730      1680   3850
## 11    12 3.85e-01 -0.297000          0.987    1490      1360   3540
## 12    14 4.98e-01 -0.346000          0.981    1310      1110   3280
## 13    16 5.94e-01 -0.383000          0.977    1160       922   3060
## 14    18 6.63e-01 -0.414000          0.970    1040       775   2870
## 15    20 7.23e-01 -0.445000          0.965     943       663   2700
## 16    22 7.79e-01 -0.472000          0.962     859       572   2560
## 17    24 8.10e-01 -0.495000          0.960     788       496   2430
## 18    26 8.40e-01 -0.517000          0.956     726       436   2310
## 19    28 8.64e-01 -0.537000          0.952     672       385   2200
## 20    30 8.83e-01 -0.557000          0.949     624       341   2110
## 21    32 8.99e-01 -0.576000          0.950     582       303   2020
## 22    34 9.09e-01 -0.590000          0.947     545       272   1940
## 23    36 9.23e-01 -0.607000          0.948     511       246   1860

This is the analysis of the network topology using various soft-thresholding powers. The left panel is shows the scale-free fit index as a function of the soft-thresholding power. Note that I had to increase the default range of soft thresholding values from 20 to 36. Our minimum soft thresholding value will be 30 because this corresponds with where the scale-free topology fit index curve flattens upon reaching an {R^2} of 0.90. Henceforth, we will use a soft thresholding power of 30 for honeybee.

The right panel displays the mean connectivity as a function of the soft-thresholding power. Mean connectivity doesn’t even drop below 1000 (!) until a soft-thresholding of 20, which further supports that we need to bump up our soft-thresholding power.

Run the Automatic Construction process with the Soft-Thresholding Power of 30:

##  Calculating module eigengenes block-wise from all genes
##    Flagging genes and samples with too many missing values...
##     ..step 1
##   ..Excluding 284 genes from the calculation due to too many missing samples or zero variance.
##     ..step 2
##  ..Working on block 1 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 1 into file AmelliferaTOM-block.1.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##      ..removing 1 genes from module 10 because their KME is too low.
##  ..merging modules that are too close..
##      mergeCloseModules: Merging modules whose distance is less than 0.25
##        Calculating new MEs...

## [1] "System Run Time:"

## Time difference of 6.323888 mins

Notice that compared to the default settings I set the maxBlockSize to 20000 to make sure that all of the genes could be included in a block (we have so many more genes than the example run by the Vignette for WGCNA). Note that I am running 64GB of RAM, though. I was able to run a maxBlockSize of 30000 without issue. It yieled the same number of blocks as below, so I reset it to 20000.

What is the summary statistics on the modules?

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##  306 3635 3565  646  600  333  260  192  155  153  141  124  102   59   44

## [1] "Average:"

## [1] 714.9286

## [1] "Range:"

## [1] 44

## [1] 3635

14 total modules, ranging from 44 to 3635 genes. There were 306 outside of all modules (label of 0). The average was 714.93 genes per module.

3. Visualize the modules:

Clustering dendrogram of gene swith assigned module colors, with dissimilarity based on topological overlap.

How many modules are there?

Let’s also verify that the counts match the modules seen above.

##  [1] "brown"       "turquoise"   "yellow"      "blue"        "magenta"    
##  [6] "black"       "tan"         "grey"        "pink"        "red"        
## [11] "green"       "purple"      "greenyellow" "salmon"      "cyan"

## mergedColors
##        cyan      salmon         tan greenyellow      purple     magenta 
##          44          59         102         124         141         153 
##        pink       black         red        grey       green      yellow 
##         155         192         260         306         333         600 
##       brown        blue   turquoise 
##         646        3565        3635

## 
##   14   13   12   11   10    9    8    7    6    0    5    4    3    2    1 
##   44   59  102  124  141  153  155  192  260  306  333  600  646 3565 3635

Save the module assignments in case you need them for later use:

4. Quantifying module–trait associations:

Define numbers of genes and samples

## [1] 10315

## [1] 6

Display correlations and their p-values:

Gene relationship to trait and important modules: Gene Significance and Module Membership

Intramodular analysis: identifying genes with high GS and MM

The most clearly worker module is blue, whereas the most strongly queen module is turquoise. We will use those here for starters. A scatterplot of Gene Significance (GS) for Queendom vs. Module Membership (MM) in the turquoise module. There is a STRONG POSITIVE and significant correlation between GS and MM in this module. This isn’t surprising given that I chose it.

Combined Gene Significance and Correlation Plots

Let’s do a combined plot of the most highly correlated modules (R^2 >= 0.5) for Queens:

## Here are the gene counts:
## Turquoise N = 3635
## Yellow N = 600
## Pink N = 155
## Purple N = 141

Let’s do a combined plot of the most highly correlated modules (R^2 >= 0.5) for Workers:

## Here are the gene counts:
## Blue N = 3565
## Green N = 333
## Tan N = 102
## Greenyellow N = 124

5. Get the Gene Identities in the Modules and Relate Back to the Genome for Subsequent Analyses

What are the genes in the top modules for Queens?