diveRsity v1.8.0 update: fastDivPart example

Kevin Keenan 2014

Introduction

As of 05/02/14, version 1.8.0 of the diveRsity is available for download from github as follows:

library(devtools)
install_github("diveRsity", "kkeenan02")

This version contains a couple of major changes, most notably, the deprecation of divPart, the original function used to calculate diversity and differentiation statistics. This function has been completely replaced by fastDivPart, which has much more compact and efficient code, as well as more sensible data output structures.

This web page will demonstrate some downstream uses of a newly returned data structure, pairwise locus parameter estimates, from fastDivPart. Specifically, we will see how locus pairwise differences can be visualised using ggplot2.

Using fastDivPart to visualise locus pairwise differentiation

For this demonstration we will used Test_data, which is distributed with diveRsity. These data can be loaded into the R session as follows:

library(diveRsity)
data(Test_data)

We can do a couple of preliminary inspections of the data using readGenepop, before calculating differentiation statistics.

myData <- readGenepop(infile = Test_data)
# how many population samples do we have?
myData$npops
[1] 6
# how many loci have been scored?
myData$nloci
[1] 37

Let’s have a look at some plots of the allele frequency differences across samples.

# load ggplot2 for plotting
library(ggplot2)
library(reshape2)
# extract allele frequencies for locus 1
loc1 <- myData$allele_freq[[1]]
# add pop names to data
colnames(loc1) <- myData$pop_names
# organise the data for plotting with ggplot
loc1 <- melt(loc1)
loc1
   Var1  Var2   value
1   258 pop1, 0.00000
2   264 pop1, 0.00000
3   280 pop1, 0.28261
4   365 pop1, 0.01087
5   369 pop1, 0.29348
6   397 pop1, 0.00000
7   539 pop1, 0.41304
8   258 pop2, 0.00000
9   264 pop2, 0.00000
10  280 pop2, 0.25000
11  365 pop2, 0.05000
12  369 pop2, 0.18750
13  397 pop2, 0.02500
14  539 pop2, 0.48750
15  258 pop3, 0.00000
16  264 pop3, 0.00000
17  280 pop3, 0.28049
18  365 pop3, 0.04878
19  369 pop3, 0.30488
20  397 pop3, 0.02439
21  539 pop3, 0.34146
22  258 pop4, 0.00000
23  264 pop4, 0.00000
24  280 pop4, 0.14423
25  365 pop4, 0.05769
26  369 pop4, 0.38462
27  397 pop4, 0.00000
28  539 pop4, 0.41346
29  258 pop5, 0.01220
30  264 pop5, 0.01220
31  280 pop5, 0.08537
32  365 pop5, 0.06098
33  369 pop5, 0.50000
34  397 pop5, 0.00000
35  539 pop5, 0.32927
36  258 pop6, 0.00000
37  264 pop6, 0.00000
38  280 pop6, 0.18293
39  365 pop6, 0.03659
40  369 pop6, 0.48780
41  397 pop6, 0.00000
42  539 pop6, 0.29268
# plot the frequencies as stacked bar charts

ggplot(loc1, aes(x = factor(Var2), y = value, fill = factor(Var1))) + 
  geom_bar(stat = "identity")

plot of chunk unnamed-chunk-6

# repeat for locus 2
loc2 <- myData$allele_freq[[2]]
# add pop names to data
colnames(loc2) <- myData$pop_names
# organise the data for plotting with ggplot
loc2 <- melt(loc2)
loc2
   Var1  Var2   value
1   162 pop1, 0.03191
2   164 pop1, 0.44681
3   167 pop1, 0.52128
4   162 pop2, 0.00000
5   164 pop2, 0.52381
6   167 pop2, 0.47619
7   162 pop3, 0.00000
8   164 pop3, 0.48780
9   167 pop3, 0.51220
10  162 pop4, 0.00000
11  164 pop4, 0.46078
12  167 pop4, 0.53922
13  162 pop5, 0.00000
14  164 pop5, 0.42683
15  167 pop5, 0.57317
16  162 pop6, 0.00000
17  164 pop6, 0.41463
18  167 pop6, 0.58537
# plot the frequencies as stacked bar charts

ggplot(loc2, aes(x = factor(Var2), y = value, fill = factor(Var1))) + 
  geom_bar(stat = "identity")

plot of chunk unnamed-chunk-6

Now we can calculate our pairwise locus statistics:

system.time({
myStats <- fastDivPart(infile = Test_data, WC_Fst = TRUE, bs_pairwise = TRUE, 
                       pairwise = TRUE, bootstraps = 1000, parallel = TRUE)
})
   user  system elapsed 
  9.780   0.962 578.406 

Let's inspect the output of fastDivPart.

names(myStats)
[1] "standard"         "estimate"         "pairwise"         "meanPairwise"     "bs_pairwise"      "bs_pairwise_loci"

Each element in myStats contains information about a range of statistics, often in different formats. For example myStats$estimate is a data.frame containing locus and global differentiation/diversity partitioning statistics, as well as some additional information.

myStats$estimate
        Harmonic_N H_st_est D_st_est G_st_est G_hed_st_est D_Jost_est  Fst_WC  Fit_WC
Locus1       43.12   0.0482   0.0160   0.0234       0.0799     0.0578  0.0260  0.0506
Locus2       43.52  -0.0038  -0.0019  -0.0038      -0.0084    -0.0046 -0.0042 -0.0649
Locus3       43.64   0.3610   0.0566   0.0629       0.4688     0.4332  0.0753  0.0896
Locus4       43.45   0.0700   0.0225   0.0321       0.1134     0.0840  0.0361  0.0449
Locus5       42.77   0.1998   0.0177   0.0191       0.2542     0.2397  0.0224  0.0646
Locus6       43.45   0.1201   0.0228   0.0274       0.1675     0.1441  0.0303  0.2179
Locus7       43.45   0.0382   0.0142   0.0221       0.0670     0.0459  0.0261  0.0318
Locus8       43.26   0.0224   0.0168   0.0632       0.0884     0.0268  0.0696  0.2463
Locus9       43.07   0.2703   0.0153   0.0160       0.3352     0.3244  0.0191  0.0483
Locus10      43.25   0.3237   0.0439   0.0483       0.4181     0.3885  0.0570  0.0890
Locus11      41.95   0.0339   0.0075   0.0095       0.0499     0.0407  0.0104  0.0579
Locus12      41.52   0.3824   0.0330   0.0349       0.4777     0.4589  0.0410  0.0630
Locus13      41.10   0.3815   0.1037   0.1247       0.5254     0.4578  0.1481  0.1301
Locus14      41.05   0.1793   0.0231   0.0258       0.2355     0.2152  0.0298  0.0674
Locus15      41.84   0.0617   0.0279   0.0485       0.1189     0.0740  0.0543  0.0957
Locus16      41.51   0.4234   0.0754   0.0840       0.5494     0.5080  0.0990  0.0954
Locus17      41.84   0.0483   0.0176   0.0270       0.0833     0.0579  0.0315  0.0260
Locus18      41.66   0.3062   0.0211   0.0222       0.3815     0.3675  0.0250  0.0661
Locus19      41.77   0.3337   0.0285   0.0302       0.4186     0.4005  0.0349  0.0740
Locus20      41.28   0.2971   0.0265   0.0283       0.3747     0.3565  0.0332  0.0659
Locus21      41.95   0.0831   0.0256   0.0357       0.1319     0.0997  0.0400  0.1491
Locus22      42.43   0.1773   0.0220   0.0245       0.2320     0.2127  0.0279  0.0758
Locus23      41.70   0.2854   0.0771   0.0955       0.4054     0.3425  0.1130  0.1223
Locus24      42.13   0.1879   0.0440   0.0543       0.2675     0.2254  0.0637  0.1493
Locus25      42.86   0.1413   0.0480   0.0678       0.2259     0.1696  0.0817  0.0252
Locus26      42.25   0.0568   0.0264   0.0470       0.1120     0.0681  0.0542  0.0742
Locus27      41.36   0.1340   0.0238   0.0281       0.1844     0.1608  0.0309  0.2046
Locus28      42.30   0.0468   0.0245   0.0489       0.1023     0.0561  0.0550  0.1426
Locus29      42.69   0.2060   0.0232   0.0255       0.2664     0.2472  0.0286  0.0815
Locus30      40.27   0.1275   0.0282   0.0349       0.1826     0.1531  0.0368  0.3774
Locus31      42.00   0.3393   0.0303   0.0322       0.4263     0.4072  0.0371  0.0685
Locus32      41.22   0.6250   0.0235   0.0238       0.7560     0.7500  0.0271  0.0923
Locus33      41.48   0.3389   0.0243   0.0255       0.4218     0.4067  0.0293  0.0675
Locus34      40.48   0.2607   0.0676   0.0836       0.3703     0.3128  0.0950  0.2614
Locus35      42.69   0.0395   0.0182   0.0326       0.0785     0.0474  0.0368  0.0687
Locus36      42.35   0.1478   0.0305   0.0370       0.2079     0.1774  0.0435  0.0621
Locus37      42.12   0.0281   0.0170   0.0413       0.0736     0.0337  0.0511  0.0965
Global          NA       NA       NA   0.0397       0.1806     0.1462  0.0461  0.0985

The newest output structure from fastDivPart is bs_pairwise_loci (or pw_locus if pairwise = TRUE and bs_pairwise = FALSE). This element of myStats is a list of either three lists (when WC_Fst = FALSE; default) or four lists (when WC_Fst = TRUE). Each of these ¾ lists contains dataframes for each locus at each of the statistics, \( G_{ST} \) (Nei & Chesser, 1983), \( G'_{ST} \) (Hedrick, 2005), \( D_{Jost} \) (Jost, 2008), and/or \( \theta \) (Weir & Cockerham, 1984). Let's explore the structure of this element a bit more.

# what elements are in `myStats$bs_pairwise_loci`?
names(myStats$bs_pairwise_loci)
[1] "gstEst"    "gstEstHed" "djostEst"  "thetaWC"  

We can see, as expected, there are four elements (since WC_Fst = TRUE). Each of these elements will have identical structures. So let's look closer at thetaWC.

# to save typing, we can assign thetaWC to a variable name
locFst <- myStats$bs_pairwise_loci$thetaWC
# names of each element in the list
names(locFst)
 [1] "Locus1"  "Locus2"  "Locus3"  "Locus4"  "Locus5"  "Locus6"  "Locus7"  "Locus8"  "Locus9"  "Locus10" "Locus11"
[12] "Locus12" "Locus13" "Locus14" "Locus15" "Locus16" "Locus17" "Locus18" "Locus19" "Locus20" "Locus21" "Locus22"
[23] "Locus23" "Locus24" "Locus25" "Locus26" "Locus27" "Locus28" "Locus29" "Locus30" "Locus31" "Locus32" "Locus33"
[34] "Locus34" "Locus35" "Locus36" "Locus37"

The names of each element is determined by the locus names in our input file, so if we know loci well, and want to inspect one in particular, we just need to know its name.

# look at locus 1
locFst$Locus1
                   actual     lower   upper
pop1, vs. pop2,  0.001996 -0.019811 0.04195
pop1, vs. pop3, -0.006733 -0.020736 0.02917
pop1, vs. pop4,  0.010357 -0.015804 0.05780
pop1, vs. pop5,  0.054282  0.002620 0.12810
pop1, vs. pop6,  0.032709 -0.014660 0.11357
pop2, vs. pop3,  0.011998 -0.017740 0.06429
pop2, vs. pop4,  0.026874 -0.012541 0.08832
pop2, vs. pop5,  0.089735  0.016991 0.17745
pop2, vs. pop6,  0.076395  0.002188 0.17600
pop3, vs. pop4,  0.010266 -0.017157 0.05835
pop3, vs. pop5,  0.043663 -0.002548 0.11648
pop3, vs. pop6,  0.019755 -0.021314 0.08440
pop4, vs. pop5,  0.008034 -0.018876 0.06095
pop4, vs. pop6,  0.008256 -0.021760 0.06780
pop5, vs. pop6, -0.003864 -0.022863 0.03001

We can see that locFst$Locus1 is a dataframe, with three columns (“actual”, “lower”, “upper”), and in this case, 15 rows (1 for each possible pairwise comparisons). If we were interested in visually inspecting how this locus was behaving among pairwise comparisons, we might like to generate a plot of the actual \( F_{ST} \) values as well as their 95% CIs (bias corrected). We could do this as follows:

df <- locFst$Locus1
ggplot(df, aes(x = rownames(df), y = actual))+
  geom_abline(aes(intercept = mean(df$actual, na.rm = TRUE), slope = 0), 
              colour = "red", lwd = 2) +
  geom_point()+
  theme(axis.text.x=element_text(angle=-90))+
  geom_errorbar(aes(ymin=lower, ymax = upper), width = 0.3)

plot of chunk unnamed-chunk-12

While the plot, in this instance, is quite easy to interpret, in many cases, where more than 10 population samples are screened, the plot will be quite crowded.

References

Contacts

For any further questions on the use of diveRsity, or to report a bug, please send an email to kkeenan qub.ac.uk

Reproducibility

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.2.2  ggplot2_0.9.3.1 diveRsity_1.8.0 plotrix_3.5-2   knitr_1.5      

loaded via a namespace (and not attached):
 [1] bitops_1.0-6       caTools_1.16       cluster_1.14.4     colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4      
 [7] ellipse_0.3-8      evaluate_0.5.1     formatR_0.10       Formula_1.1-1      grid_3.0.2         gtable_0.1.2      
[13] Hmisc_3.13-0       httpuv_1.2.1       igraph_0.6.6       jpeg_0.1-6         labeling_0.2       lattice_0.20-24   
[19] lavaan_0.5-15      MASS_7.3-29        mnormt_1.4-7       munsell_0.4.2      pbivnorm_0.5-1     plyr_1.8          
[25] png_0.1-7          proto_0.3-10       psych_1.3.12       qgraph_1.2.3       quadprog_1.5-5     RColorBrewer_1.0-5
[31] RJSONIO_1.0-3      scales_0.2.3       sem_3.1-3          shiny_0.8.0        splines_3.0.2      stats4_3.0.2      
[37] stringr_0.6.2      survival_2.37-4    tools_3.0.2        xtable_1.7-1