Install and load requisite libraries.
## [1] 10315 7
## [1] "gene_id" "GSM1103900_worker1_FPKM"
## [3] "GSM1103901_worker2_FPKM" "GSM1103902_worker3_FPKM"
## [5] "GSM1930795_queen1_FPKM" "GSM1930796_queen2_FPKM"
## [7] "GSM1930797_queen3_FPKM"
## 'data.frame': 10315 obs. of 7 variables:
## $ gene_id : chr "gene0" "gene1" "gene10" "gene100" ...
## $ GSM1103900_worker1_FPKM: num 0.386 0.401 97.442 3.81 6.828 ...
## $ GSM1103901_worker2_FPKM: num 0.218 0.408 98.583 3.862 5.837 ...
## $ GSM1103902_worker3_FPKM: num 0.375 0.31 96.015 3.73 6.819 ...
## $ GSM1930795_queen1_FPKM : num 0.338 0.631 111.864 1.788 2.026 ...
## $ GSM1930796_queen2_FPKM : num 0.545 0.831 99.23 1.625 4.054 ...
## $ GSM1930797_queen3_FPKM : num 0.488 0.669 113.127 2.849 2.116 ...
## Flagging genes and samples with too many missing values...
## ..step 1
## ..Excluding 284 genes from the calculation due to too many missing samples or zero variance.
## ..step 2
Good, as expected the caste phenotypes cluster together. The guide has you go through and determine where to trim samples based on outliers from clustering, but we have so few samples we are not going to do this step.
## accessions queens workers
## 1 GSM1103900_worker1_FPKM 0 1
## 2 GSM1103901_worker2_FPKM 0 1
## 3 GSM1103902_worker3_FPKM 0 1
## 4 GSM1930795_queen1_FPKM 1 0
## 5 GSM1930796_queen2_FPKM 1 0
## 6 GSM1930797_queen3_FPKM 1 0
N.B.: The heatmap they make in the example is great, but not relevant here so we are not doing it.
## [1] "bee1t" "traits"
## pickSoftThreshold: will use block size 4337.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 4337 of 10315
## Warning: executing %dopar% sequentially: no parallel backend registered
## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 :
## 4337 .
## ..working on genes 4338 through 8674 of 10315
## Warning in eval(xpr, envir = envir): Some correlations are NA in block 4338 :
## 8674 .
## ..working on genes 8675 through 10315 of 10315
## Warning in eval(xpr, envir = envir): Some correlations are NA in block 8675 :
## 10315 .
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 2.86e-01 1.360000 0.370 6130 6970 7750
## 2 2 4.38e-01 0.334000 0.640 4660 5450 6680
## 3 3 1.88e-01 0.250000 0.561 3810 4460 5990
## 4 4 3.53e-02 0.099800 0.737 3240 3740 5480
## 5 5 2.39e-06 -0.000798 0.895 2830 3200 5080
## 6 6 1.98e-02 -0.071200 0.948 2510 2770 4750
## 7 7 6.84e-02 -0.128000 0.968 2260 2420 4480
## 8 8 1.27e-01 -0.172000 0.976 2050 2120 4240
## 9 9 1.94e-01 -0.211000 0.984 1870 1880 4030
## 10 10 2.66e-01 -0.246000 0.984 1730 1680 3850
## 11 12 3.85e-01 -0.297000 0.987 1490 1360 3540
## 12 14 4.98e-01 -0.346000 0.981 1310 1110 3280
## 13 16 5.94e-01 -0.383000 0.977 1160 922 3060
## 14 18 6.63e-01 -0.414000 0.970 1040 775 2870
## 15 20 7.23e-01 -0.445000 0.965 943 663 2700
## 16 22 7.79e-01 -0.472000 0.962 859 572 2560
## 17 24 8.10e-01 -0.495000 0.960 788 496 2430
## 18 26 8.40e-01 -0.517000 0.956 726 436 2310
## 19 28 8.64e-01 -0.537000 0.952 672 385 2200
## 20 30 8.83e-01 -0.557000 0.949 624 341 2110
## 21 32 8.99e-01 -0.576000 0.950 582 303 2020
## 22 34 9.09e-01 -0.590000 0.947 545 272 1940
## 23 36 9.23e-01 -0.607000 0.948 511 246 1860
This is the analysis of the network topology using various soft-thresholding powers. The left panel is shows the scale-free fit index as a function of the soft-thresholding power. Note that I had to increase the default range of soft thresholding values from 20 to 36. Our minimum soft thresholding value will be 30 because this corresponds with where the scale-free topology fit index curve flattens upon reaching an {R^2} of 0.90. Henceforth, we will use a soft thresholding power of 30 for honeybee.
The right panel displays the mean connectivity as a function of the soft-thresholding power. Mean connectivity doesn’t even drop below 1000 (!) until a soft-thresholding of 20, which further supports that we need to bump up our soft-thresholding power.
## Calculating module eigengenes block-wise from all genes
## Flagging genes and samples with too many missing values...
## ..step 1
## ..Excluding 284 genes from the calculation due to too many missing samples or zero variance.
## ..step 2
## ..Working on block 1 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 1 into file AmelliferaTOM-block.1.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..removing 1 genes from module 10 because their KME is too low.
## ..merging modules that are too close..
## mergeCloseModules: Merging modules whose distance is less than 0.25
## Calculating new MEs...
## [1] "System Run Time:"
## Time difference of 6.323888 mins
Notice that compared to the default settings I set the maxBlockSize to 20000 to make sure that all of the genes could be included in a block (we have so many more genes than the example run by the Vignette for WGCNA). Note that I am running 64GB of RAM, though. I was able to run a maxBlockSize of 30000 without issue. It yieled the same number of blocks as below, so I reset it to 20000.
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 306 3635 3565 646 600 333 260 192 155 153 141 124 102 59 44
## [1] "Average:"
## [1] 714.9286
## [1] "Range:"
## [1] 44
## [1] 3635
14 total modules, ranging from 44 to 3635 genes. There were 306 outside of all modules (label of 0). The average was 714.93 genes per module.
Clustering dendrogram of gene swith assigned module colors, with dissimilarity based on topological overlap.
Let’s also verify that the counts match the modules seen above.
## [1] "brown" "turquoise" "yellow" "blue" "magenta"
## [6] "black" "tan" "grey" "pink" "red"
## [11] "green" "purple" "greenyellow" "salmon" "cyan"
## mergedColors
## cyan salmon tan greenyellow purple magenta
## 44 59 102 124 141 153
## pink black red grey green yellow
## 155 192 260 306 333 600
## brown blue turquoise
## 646 3565 3635
##
## 14 13 12 11 10 9 8 7 6 0 5 4 3 2 1
## 44 59 102 124 141 153 155 192 260 306 333 600 646 3565 3635
Save the module assignments in case you need them for later use:
## [1] 10315
## [1] 6
The most clearly worker module is blue, whereas the most strongly queen module is turquoise. We will use those here for starters. A scatterplot of Gene Significance (GS) for Queendom vs. Module Membership (MM) in the turquoise module. There is a STRONG POSITIVE and significant correlation between GS and MM in this module. This isn’t surprising given that I chose it.
Let’s do a combined plot of the most highly correlated modules (R^2 >= 0.5) for Queens:
## Here are the gene counts:
## Turquoise N = 3635
## Yellow N = 600
## Pink N = 155
## Purple N = 141
Let’s do a combined plot of the most highly correlated modules (R^2 >= 0.5) for Workers:
## Here are the gene counts:
## Blue N = 3565
## Green N = 333
## Tan N = 102
## Greenyellow N = 124
What are the genes in the top modules for Queens?