The problem of bootstrapping bias for measurements of genetic differentiation

A new version of diveRsity is now available for download from ‘CRAN’. It can also be downloaded using the method described here. While most of the updates in this version are minor, some are more substantial. One of these is the inclusion of an automated bias correction for confidence intervals (CIs) estimated for pairwise differentiation. Below is a coded demonstration of the problem of bias when estimating confidence intervals for diversity/differentiation statistics from bootstrapped data.

The code

Load diveRsity and calculate pairwise differentiation, including \(95\%\) Confidence Intervals (using 1000 bootstrap iterations).

# load diveRsity
library("diveRsity")
# run fastDivPart on "Test_data" (1000 bootstraps)
data(Test_data)
system.time({
  pwDiff <- fastDivPart(Test_data, pairwise = TRUE, bs_pairwise = TRUE, 
                        fst = TRUE, boots = 1000, para = TRUE)
})

Since Test_data contains genotype data for six population samples, fastDivPart will calculate confidence intervals for 15 separate pairwise estimates of \(G_{ST}\) (Nei & Chesser, 1983), \(G^{'}_{ST}\) (Hedrick, 2005), \(D_{Jost}\) (Jost, 2008) and \(\theta\) (Weir & Cockerham, 1984). We can plot these point estimates along with CIs as follows (N.B. by setting the argument plot = TRUE in the original call, these results will be plotted automatically). #### Plotting the results

# View the pairwise results data
names(pwDiff)

[1] "standard"         "estimate"         "pairwise"         "meanPairwise"     "bs_pairwise"      "bs_pairwise_loci"

We can see that the pairwise bootstrap results are located at bs_pairwise. This object within pwDiff contains separate data tables for each of the four statistics listed above. All data tables have the same structure, so we can just look at one, \(D_{Jost}\).

pwDiff$bs_pairwise$djostEst

                     actual       mean     BC_mean Lower_95%CI Upper_95%CI BC_Lower_95%CI BC_Upper_95%CI
pop1, vs. pop2, 0.002727782 0.03228548 0.002727782  0.01796663  0.05034280    -0.01159106     0.02078511
pop1, vs. pop3, 0.180215774 0.21006514 0.180215774  0.18209947  0.24098631     0.15225010     0.21113695
pop1, vs. pop4, 0.148375499 0.17729039 0.148375499  0.15185972  0.20480094     0.12294482     0.17588604
pop1, vs. pop5, 0.252671904 0.27519907 0.252671904  0.24301773  0.30823364     0.22049056     0.28570647
pop1, vs. pop6, 0.149363694 0.18665998 0.149363694  0.15732519  0.21846185     0.12002891     0.18116556
pop2, vs. pop3, 0.157886422 0.18867971 0.157886422  0.16233699  0.22042989     0.13154370     0.18963661
pop2, vs. pop4, 0.132705322 0.16173993 0.132705322  0.13683289  0.18905418     0.10779828     0.16001958
pop2, vs. pop5, 0.226246698 0.25305945 0.226246698  0.22088704  0.28865219     0.19407429     0.26183944
pop2, vs. pop6, 0.156131517 0.19055254 0.156131517  0.16554041  0.21920270     0.13111939     0.18478168
pop3, vs. pop4, 0.010193798 0.03936538 0.010193798  0.02351001  0.05940221    -0.00566157     0.03023063
pop3, vs. pop5, 0.151361439 0.18106670 0.151361439  0.15001803  0.21572389     0.12031277     0.18601863
pop3, vs. pop6, 0.066347110 0.10008670 0.066347110  0.07703212  0.12616454     0.04329253     0.09242494
pop4, vs. pop5, 0.127186619 0.15365900 0.127186619  0.12449995  0.18556646     0.09802757     0.15909407
pop4, vs. pop6, 0.058582929 0.08975116 0.058582929  0.06807984  0.11910892     0.03691161     0.08794069
pop5, vs. pop6, 0.119937693 0.15263143 0.119937693  0.12346839  0.18641813     0.09077465     0.15372439

We can see that the output for each of the four statistics estimates contains seven data columns. We are interested in comparing the difference between the standard CIs and the bias corrected CIs.

# plot using ggplot2
library(ggplot2)
# create a djost dataframe from our pwDiff results to reduce typing
djost <- as.data.frame(pwDiff$bs_pairwise$djostEst)
# The column names cause problems, so we can change them as follows:
colnames(djost) <- c("actual", "mean", "bcMean", "bucLow", "bucUp", "bcLow", 
                     "bcUp")
# plot the point estimates for each of your four pw comparisons
# with uncorrected and corrected 95% CIs
CI_type <- as.factor(c(rep("uc", nrow(djost)), rep("bc", nrow(djost))))
ggplot(djost, aes(x = c(1:nrow(djost),1:nrow(djost) + 0.5), 
                  y = c(actual, actual), colour = CI_type)) +
  geom_errorbar(aes(ymin = c(bucLow, bcLow), ymax = c(bucUp, bcUp)), 
                width = 0.3) +
  geom_point() +
  ylab(expression(D["Jost"])) +
  xlab("Pairwise Comparison")

A figure demonstrating the difference in bias corrected and uncorrected \(95\% CIs\) derived from 1000 bootstrap iterations. Blue lines represent the uncorrected CI, while pink lines represent the bias corrected CIs.

Of particular note is the fact that often the uncorrected CIs don’t even encompass the sample estimates. This occurs as a result of the known upward bias associated with bootstrapping these types of statistics. Another important point of notes is the increased risk of type I error seen for two of the pairwise comparisons when using uncorrected CIs. In these comparisons, the uncorrected CIs do not overlap \(0\), suggesting differentiation between the two populations is statistically significant. However, following bias correction, these pairwise differentiation estimates are no longer significantly different from \(0\) (i.e. they overlap \(0\)).

From this example, it is clear that correcting for bias in bootstrapped differentiation estimates is important. In line with this, diveRsity now provides automatic bias correction in the fastDivPart function.

N.B. divPart does not have this capability since it is no longer supported.

References

Hedrick, P. (2005) A standardized genetic diﬀerentiation measure. Evolution, 59, 1633–1638
Jost, L. (2008) GST and its relatives do not measure diﬀerentiation. Molecular Ecology, 17, 4015–4026
Jost, L. (2008) GST and its relatives do not measure diﬀerentiation. Molecular Ecology, 17, 4015–4026
Keenan, K., McGinnity, P., Cross, T. F., Crozier, W. W., Prodöhl, P. A. (2013), diveRsity: An R package for the estimation and exploration of population genetics parameters and their associated errors. Methods in Ecology and Evolution, 4: 782–788. doi: 10.1111/2041-210X.12067
Weir, B. & Cockerham, C. (1984) Estimating F-statistics for the analysis of population structure. Evolution, 38, 1358–1370