Testing the effect of hypervariable loci on population clustering

Kevin Keenan

The question

Do hypervariable loci make it harder to detect population structure using clustering methods?

The data

The simulated data follow exactly from an answer posted by Craig Echt in response to this question.

Confirming Craig's \( F_{st} \) and \( D_{Jost} \) values

In his answer, Craig provides the \( F_{st} \) and \( D_{Jost} \) values, estimated from his simulated data. I have replicated this using the diveRsity package in R.

The actual input parameters for easypop are:

Low diveRsity
   --------------------------------------------------------------------------------
   EASYPOP (v. 2.0.1)
   Author: F. Balloux
   Available at: http://www.unil.ch/izea/softwares/easypop.html
   --------------------------------------------------------------------------------

   Ploidy level ? (0=haplo-diploid; 1=haploid; 2=diploid)
   2
   Two sexes?:y/n
   y
   Mating system?
   1= Random mating
   2= Polygyny
   3= Monogyny
   1
   Number of populations ?
   2
   Same number of individuals in each population ?:y/n
   y
   Number of females in each population ?
   500
   Number of males in each population?
   500
   Same migration scheme over all simulation? (y/n)
   y
   proportion of female migration?(between 0 and 1)
   0
   proportion of male migration?(between 0 and 1)
   0
   Number of loci ?
   20
   Free recombination between loci?:y/n
   y
   Do all loci have the same mutation scheme?:y/n
   y
   Mutation rate ?(between 0 and 1, e.g. 0.0001) 
   0.01
   Mutation model
   1= Kam,(same probability to mutate to any allelic state)
   2= Ssm, (single step mutation model)
   3= Mixed model of Ssm with a proportion of Kam mutation events)
   4= Mixed model of Ssm with a proportion of double step mutation events)
   2
   Number of possible allelic states ?(below 1000) 
   3
   Variability of the initial population
   1= Maximal, (randomly assigned alleles)
   2= Minimal, (all individuals start with the same allele)
   2
   Number of generations ?
   1000
   Do you want the complete dataset in the '.dat' and '.gen' result files ?:y/n
   y
   Do you want a file giving the complete pedigree of the simulation ?:y/n
   (Please notice that this file can be very huge and will slow down simulations)
   n
   Name of the file?
   rg-3allele-
High diversity
   -----------------------------------------------------------------------------
   EASYPOP (v. 2.0.1)
   Author: F. Balloux
   Available at: http://www.unil.ch/izea/softwares/easypop.html
   -----------------------------------------------------------------------------

   Ploidy level ? (0=haplo-diploid; 1=haploid; 2=diploid)
   2
   Two sexes?:y/n
   y
   Mating system?
   1= Random mating
   2= Polygyny
   3= Monogyny
   1
   Number of populations ?
   2
   Same number of individuals in each population ?:y/n
   y
   Number of females in each population ?
   500
   Number of males in each population?
   500
   Same migration scheme over all simulation? (y/n)
   y
   proportion of female migration?(between 0 and 1)
   0
   proportion of male migration?(between 0 and 1)
   0
   Number of loci ?
   20
   Free recombination between loci?:y/n
   y
   Do all loci have the same mutation scheme?:y/n
   y
   Mutation rate ?(between 0 and 1, e.g. 0.0001) 
   0.01
   Mutation model
   1= Kam,(same probability to mutate to any allelic state)
   2= Ssm, (single step mutation model)
   3= Mixed model of Ssm with a proportion of Kam mutation events)
   4= Mixed model of Ssm with a proportion of double step mutation events)
   2
   Number of possible allelic states ?(below 1000) 
   100
   Variability of the initial population
   1= Maximal, (randomly assigned alleles)
   2= Minimal, (all individuals start with the same allele)
   2
   Number of generations ?
   1000
   Do you want the complete dataset in the '.dat' and '.gen' result files ?:y/n
   y
   Do you want a file giving the complete pedigree of the simulation ?:y/n
   (Please notice that this file can be very huge and will slow down simulations)
   n
   Name of the file?
   rg-100allele-

Analysis

load diveRsity (must be installed)

library(diveRsity)

Read both files (containing 1000 ind for 2 pops) and check the number of alleles

lowDiv <- readGenepop("n1000-3alleles.gen", gp = 2)
highDiv <- readGenepop("n1000-100alleles.gen", gp = 3)

Check population sizes

lowDiv$pop_sizes
[1] 1000 1000
highDiv$pop_sizes
[1] 1000 1000

Check the number of alleles per locus across populations

sapply(lowDiv$all_alleles, length)
 [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
sapply(highDiv$all_alleles, length)
 [1] 16 16 14 15 16 18 19 16 17 13 14 15 16 16 17 16 17 13 14 17

Despite the max number of alleles in the high diversity simulation being set to 100, most loci maintain only a fraction of this number. Most alleles are lost by drift. The difference in the number of alleles is still sufficient to test the respective discriminatory power of each set of loci.

Now lets look at the number of unique genotypes per populations

apply(do.call("rbind", lowDiv$pop_list), 2, function(x) {
    length(unique(x))
})
 [1] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
apply(do.call("rbind", highDiv$pop_list), 2, function(x) {
    length(unique(x))
})
 [1] 168 180 186 186 171 253 245 166 201 124 165 194 186 197 215 177 203
[18] 106 146 147

Clearly we can see that there are many more possible genotypes in the high diveRsity populations

True diversity values (calculated from populations)

First, lets calculate \( F_{st} \) (Weir & Cockerham, 1984) and \( D_{Jost} \) (Jost, 2008) from the base populations.

lowPop <- fastDivPart("n1000-3alleles.gen", gp = 2, WC_Fst = TRUE)
highPop <- fastDivPart("n1000-100alleles.gen", gp = 3, WC_Fst = TRUE)

Lets look at the locus and global values returned

lowPop$estimate[, c("D_Jost_est", "Fst_WC")]
       D_Jost_est Fst_WC
loc-1      0.0621 0.0316
loc-2      0.0059 0.0027
loc-3      0.0113 0.0051
loc-4      0.0510 0.0274
loc-5      0.0214 0.0120
loc-6      0.0359 0.0175
loc-7      0.0776 0.0398
loc-8      0.0285 0.0160
loc-9      0.0213 0.0118
loc-10     0.0303 0.0155
loc-11     0.0236 0.0124
loc-12     0.0065 0.0028
loc-13     0.0336 0.0168
loc-14     0.1083 0.0562
loc-15     0.1572 0.0843
loc-16     0.0359 0.0174
loc-17     0.1030 0.0525
loc-18     0.1738 0.0962
loc-19     0.0371 0.0190
loc-20     0.0611 0.0317
Global     0.0307 0.0288
highPop$estimate[, c("D_Jost_est", "Fst_WC")]
       D_Jost_est Fst_WC
loc-1      0.2370 0.0338
loc-2      0.2374 0.0287
loc-3      0.2924 0.0338
loc-4      0.3991 0.0548
loc-5      0.2509 0.0293
loc-6      0.3438 0.0402
loc-7      0.6610 0.0808
loc-8      0.8003 0.1163
loc-9      0.3611 0.0479
loc-10     0.2457 0.0404
loc-11     0.3135 0.0434
loc-12     0.2399 0.0298
loc-13     0.3578 0.0515
loc-14     0.5259 0.0791
loc-15     0.2084 0.0273
loc-16     0.4886 0.0637
loc-17     0.5528 0.0795
loc-18     0.1340 0.0214
loc-19     0.2885 0.0492
loc-20     0.5235 0.0852
Global     0.3092 0.0523

Repeat the above steps for the samples

Now lets see how the sample numbers compare to the base population numbers and to Craig's.

lowDivSamp <- readGenepop("n30-3alleles.gen", gp = 2)
highDivSamp <- readGenepop("n30-100alleles.gen", gp = 3)

sample sizes

lowDivSamp$pop_sizes
[1] 30 30
highDivSamp$pop_sizes
[1] 30 30

Number of alleles per locus across samples

sapply(lowDivSamp$all_alleles, length)
 [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
sapply(highDivSamp$all_alleles, length)
 [1] 14 12 13 15 12 17 17 15 15 11 12 14 16 14 15 13 15 10 12 13

Number of unique genotypes across samples per locus

apply(do.call("rbind", lowDivSamp$pop_list), 2, function(x) {
    length(unique(x))
})
 [1] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
apply(do.call("rbind", highDivSamp$pop_list), 2, function(x) {
    length(unique(x))
})
 [1] 40 47 44 43 47 46 49 49 47 36 45 45 48 45 50 45 45 36 38 41

The clear distinction here is the massively reduced number of unique genotypes in the samples.

Sample estimates

lowSamp <- fastDivPart("n30-3alleles.gen", gp = 2, WC_Fst = TRUE)
highSamp <- fastDivPart("n30-100alleles.gen", gp = 3, WC_Fst = TRUE)
lowSamp$estimate[, c("D_Jost_est", "Fst_WC")]
       D_Jost_est  Fst_WC
loc-1      0.0409  0.0056
loc-2      0.0191 -0.0049
loc-3     -0.0024 -0.0146
loc-4      0.0786  0.0323
loc-5      0.0034 -0.0153
loc-6      0.0073 -0.0101
loc-7      0.0929  0.0341
loc-8      0.0581  0.0146
loc-9     -0.0228 -0.0281
loc-10    -0.0062 -0.0188
loc-11    -0.0125 -0.0236
loc-12     0.0221 -0.0046
loc-13     0.1089  0.0383
loc-14     0.2455  0.1283
loc-15     0.0708  0.0225
loc-16    -0.0235 -0.0271
loc-17     0.1789  0.0779
loc-18     0.1504  0.0645
loc-19     0.1364  0.0611
loc-20     0.2352  0.1171
Global     0.0281  0.0235
highSamp$estimate[, c("D_Jost_est", "Fst_WC")]
       D_Jost_est Fst_WC
loc-1      0.1035 0.0156
loc-2      0.2325 0.0277
loc-3      0.3206 0.0356
loc-4      0.3016 0.0355
loc-5      0.0724 0.0081
loc-6      0.3426 0.0383
loc-7      0.6150 0.0678
loc-8      0.7769 0.0976
loc-9      0.3960 0.0604
loc-10     0.1414 0.0258
loc-11     0.3812 0.0504
loc-12     0.1475 0.0168
loc-13     0.3541 0.0521
loc-14     0.5783 0.0855
loc-15     0.2653 0.0351
loc-16     0.4950 0.0551
loc-17     0.5363 0.0776
loc-18     0.0314 0.0066
loc-19     0.2107 0.0374
loc-20     0.4352 0.0676
Global     0.2513 0.0454

We can check that the sample estimate 95% confidence intervals contain the population values for \( F_{st} \) and \( D_{Jost} \). (Warning: takes about 4 mins using 4 CPU, due to bootstrapping)

lowSampCI <- fastDivPart("n30-3alleles.gen", gp = 2, WC_Fst = TRUE, bs_pairwise = TRUE, 
    bootstraps = 1000, parallel = TRUE)
highSampCI <- fastDivPart("n30-100alleles.gen", gp = 3, WC_Fst = TRUE, bs_pairwise = TRUE, 
    bootstraps = 1000, parallel = TRUE)

low diversity CI's

\( D_{Jost} \)
lowSampCI$bs_pairwise$djostEst[, c("BC_Lower_95%CI", "BC_Upper_95%CI")]
BC_Lower_95%CI BC_Upper_95%CI 
      0.007361       0.056221 
\( F_{st} \)
lowSampCI$bs_pairwise$thetaWC[, c("BC_Lower_95%CI", "BC_Upper_95%CI")]
BC_Lower_95%CI BC_Upper_95%CI 
       0.02015        0.05964 

High diveRsity CI's

\( D_{Jost} \)
highSampCI$bs_pairwise$djostEst[, c("BC_Lower_95%CI", "BC_Upper_95%CI")]
BC_Lower_95%CI BC_Upper_95%CI 
        0.2006         0.3124 
\( F_{st} \)
highSampCI$bs_pairwise$thetaWC[, c("BC_Lower_95%CI", "BC_Upper_95%CI")]
BC_Lower_95%CI BC_Upper_95%CI 
       0.03770        0.05669 

All good! All population parameters fall within the sample 95% confidence intervals and the values appear to reflect the same pattern observed by Craig.

Clustering

Now what about clustering? For simplicity, I will use the multivariate clustering methods provided in the adegenet package. Namely, k-means is used to identify cluster assignment, which makes no assumptions about the underlying evolutionary model responsible for structure within the data.

library(adegenet)
Loading required package: ade4
   ==========================
    adegenet 1.3-9.2 is loaded
   ==========================

 - to start, type '?adegenet'
 - to browse adegenet website, type 'adegenetWeb()'
 - to post questions/comments: adegenet-forum@lists.r-forge.r-project.org
adLow <- read.genepop("n30-3alleles.gen")

 Converting data from a Genepop .gen file to a genind object... 


File description:  2    20  3   2                                                                    

...done.
adHigh <- read.genepop("n30-100alleles.gen")

 Converting data from a Genepop .gen file to a genind object... 


File description:  2  20    100 3                                                                    

...done.
adLow

   #####################
   ### Genind object ### 
   #####################
- genotypes of individuals - 

S4 class:  genind
@call: read.genepop(file = "n30-3alleles.gen")

@tab:  60 x 60 matrix of genotypes

@ind.names: vector of  60 individual names
@loc.names: vector of  20 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the  60 columns of @tab
@all.names: list of  20 components yielding allele names for each locus
@ploidy:  2
@type:  codom

Optionnal contents: 
@pop:  factor giving the population of each individual
@pop.names:  factor giving the population of each individual

@other: - empty -
adHigh

   #####################
   ### Genind object ### 
   #####################
- genotypes of individuals - 

S4 class:  genind
@call: read.genepop(file = "n30-100alleles.gen")

@tab:  60 x 275 matrix of genotypes

@ind.names: vector of  60 individual names
@loc.names: vector of  20 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the  275 columns of @tab
@all.names: list of  20 components yielding allele names for each locus
@ploidy:  2
@type:  codom

Optionnal contents: 
@pop:  factor giving the population of each individual
@pop.names:  factor giving the population of each individual

@other: - empty -

Fix the data

Because EASYPOP writes all individual names as a number corresponding to its population assignment, we need to convert this to unique individual id's.

pop1Names <- paste("pop1.", 1:30, sep = "")
pop2Names <- paste("pop2.", 1:30, sep = "")
adLow@ind.names <- c(pop1Names, pop2Names)
adHigh@ind.names <- c(pop1Names, pop2Names)
adLow@ind.names
 [1] "pop1.1"  "pop1.2"  "pop1.3"  "pop1.4"  "pop1.5"  "pop1.6"  "pop1.7" 
 [8] "pop1.8"  "pop1.9"  "pop1.10" "pop1.11" "pop1.12" "pop1.13" "pop1.14"
[15] "pop1.15" "pop1.16" "pop1.17" "pop1.18" "pop1.19" "pop1.20" "pop1.21"
[22] "pop1.22" "pop1.23" "pop1.24" "pop1.25" "pop1.26" "pop1.27" "pop1.28"
[29] "pop1.29" "pop1.30" "pop2.1"  "pop2.2"  "pop2.3"  "pop2.4"  "pop2.5" 
[36] "pop2.6"  "pop2.7"  "pop2.8"  "pop2.9"  "pop2.10" "pop2.11" "pop2.12"
[43] "pop2.13" "pop2.14" "pop2.15" "pop2.16" "pop2.17" "pop2.18" "pop2.19"
[50] "pop2.20" "pop2.21" "pop2.22" "pop2.23" "pop2.24" "pop2.25" "pop2.26"
[57] "pop2.27" "pop2.28" "pop2.29" "pop2.30"
adHigh@ind.names
 [1] "pop1.1"  "pop1.2"  "pop1.3"  "pop1.4"  "pop1.5"  "pop1.6"  "pop1.7" 
 [8] "pop1.8"  "pop1.9"  "pop1.10" "pop1.11" "pop1.12" "pop1.13" "pop1.14"
[15] "pop1.15" "pop1.16" "pop1.17" "pop1.18" "pop1.19" "pop1.20" "pop1.21"
[22] "pop1.22" "pop1.23" "pop1.24" "pop1.25" "pop1.26" "pop1.27" "pop1.28"
[29] "pop1.29" "pop1.30" "pop2.1"  "pop2.2"  "pop2.3"  "pop2.4"  "pop2.5" 
[36] "pop2.6"  "pop2.7"  "pop2.8"  "pop2.9"  "pop2.10" "pop2.11" "pop2.12"
[43] "pop2.13" "pop2.14" "pop2.15" "pop2.16" "pop2.17" "pop2.18" "pop2.19"
[50] "pop2.20" "pop2.21" "pop2.22" "pop2.23" "pop2.24" "pop2.25" "pop2.26"
[57] "pop2.27" "pop2.28" "pop2.29" "pop2.30"

Running the clustering function

We use the function find.clusters with a max number of clusters set to 10

Low diversity

The method runs a PCA on the data then we must choose the number of principal components to retain for the cluster analysis (30 seem to explain most of the variation in the low diversity example and 50 for the high). When we choose the number of PC axes, we choose the number of clusters from the plot of BIC (Bayesian Information Criterion) against cluster. Generally the 'elbow' in the plot (i.e. the point at which BIC starts to increase after reaching its minimum. In this instance, 2 appears to be the most suitable for both the low diversity and high examples.

Low diversity clustering

find.clusters(adLow, max.n.clust = 10)

PCA plot

lowpc

BIC plot

lowbic

lowGrps <- find.clusters(adLow, max.n.clust = 10, n.pca = 30, n.clust = 2)

High diversity clustering

find.clusters(adHigh, max.n.clust = 10)

PCA plot

highpc

BIC plot

highbic

highGrps <- find.clusters(adHigh, max.n.clust = 10, n.pca = 50, n.clust = 2)

Clustering accuracy

Let's see how well the clustering method has done when compared to the 'true' population of origin of the samples (which is known in this case.)

low diversity

table.value(table(adLow@pop, lowGrps$grp), col.lab = paste("inferred", 1:2), 
    row.lab = paste("True", 1:2))

plot of chunk unnamed-chunk-11

High diversity

table.value(table(adHigh@pop, highGrps$grp), col.lab = paste("inferred", 1:2), 
    row.lab = paste("True", 1:2))

plot of chunk unnamed-chunk-12

The really interesting thing here is that the high diversity samples perform way better than the low. There are no mis-assigned individuals.

N.B. Inferred cluster assignment is arbitrary.

Now I wonder if this is because, as shown above, our actual number of alleles in the populations is much lower than 100?

Just for fun, I have re-simulated the data, this time setting the number of possible allelic states to 999 (the max allowed) and started each population with maximal diversity. Below are the clustering results for the sample data (n=30 again) from this simulation.

Actual input parameters for easypop

   -----------------------------------------------------------------------------
   EASYPOP (v. 2.0.1)
   Author: F. Balloux
   Available at: http://www.unil.ch/izea/softwares/easypop.html
   -----------------------------------------------------------------------------

   Ploidy level ? (0=haplo-diploid; 1=haploid; 2=diploid)
   2
   Two sexes?:y/n
   y
   Mating system?
   1= Random mating
   2= Polygyny
   3= Monogyny
   1
   Number of populations ?
   2
   Same number of individuals in each population ?:y/n
   y
   Number of females in each population ?
   500
   Number of males in each population?
   500
   Same migration scheme over all simulation? (y/n)
   y
   proportion of female migration?(between 0 and 1)
   0
   proportion of male migration?(between 0 and 1)
   0
   Number of loci ?
   20
   Free recombination between loci?:y/n
   y
   Do all loci have the same mutation scheme?:y/n
   y
   Mutation rate ?(between 0 and 1, e.g. 0.0001) 
   0.01
   Mutation model
   1= Kam,(same probability to mutate to any allelic state)
   2= Ssm, (single step mutation model)
   3= Mixed model of Ssm with a proportion of Kam mutation events)
   4= Mixed model of Ssm with a proportion of double step mutation events)
   2
   Number of possible allelic states ?(below 1000) 
   999
   Variability of the initial population
   1= Maximal, (randomly assigned alleles)
   2= Minimal, (all individuals start with the same allele)
   1
   Number of generations ?
   1000
   Do you want the complete dataset in the '.dat' and '.gen' result files ?:y/n
   y
   Do you want a file giving the complete pedigree of the simulation ?:y/n
   (Please notice that this file can be very huge and will slow down simulations)
   n
   Name of the file?
   n1000-maxAlleles-
Explore the population data

How many alleles are there in the population and the samples, per locus?

Populations
maxDiv <- readGenepop("n1000-maxAlleles.gen", gp = 3)
sapply(maxDiv$all_alleles, length)
 [1] 67 72 53 55 63 56 66 48 53 69 65 57 62 64 59 65 57 62 52 56

Unique genotypes across populations

apply(do.call("rbind", maxDiv$pop_list), 2, function(x) {
    length(unique(x))
})
 [1] 832 834 583 695 690 688 703 546 579 846 809 706 782 700 690 696 641
[18] 741 519 697
Samples
maxDivSamp <- readGenepop("n30-maxAlleles.gen", gp = 3)
sapply(maxDivSamp$all_alleles, length)
 [1] 39 40 34 39 38 34 37 30 32 44 43 37 38 33 35 36 38 40 30 36

Unique genotypes across populations

apply(do.call("rbind", maxDivSamp$pop_list), 2, function(x) {
    length(unique(x))
})
 [1] 58 59 55 58 57 57 54 52 56 55 58 58 55 55 54 53 54 54 49 58

Clustering

adMax <- read.genepop("n30-maxAlleles.gen")

 Converting data from a Genepop .gen file to a genind object... 


File description:  2                                                                                 

...done.
adMax

   #####################
   ### Genind object ### 
   #####################
- genotypes of individuals - 

S4 class:  genind
@call: read.genepop(file = "n30-maxAlleles.gen")

@tab:  60 x 733 matrix of genotypes

@ind.names: vector of  60 individual names
@loc.names: vector of  20 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the  733 columns of @tab
@all.names: list of  20 components yielding allele names for each locus
@ploidy:  2
@type:  codom

Optionnal contents: 
@pop:  factor giving the population of each individual
@pop.names:  factor giving the population of each individual

@other: - empty -
adMax@ind.names <- c(pop1Names, pop2Names)

In this instance, almost all PCs (60) have to be retained to explain as much variation as possible. Again the elbow of the 'BIC vs Number of Clusters' graph, suggests that \( k = 2 \).

maxGrps <- find.clusters(adMax, max.n.clust = 10, n.pca = 60, n.clust = 2)

Clustering accuracy

table.value(table(adMax@pop, maxGrps$grp), col.lab = paste("inferred", 1:2), 
    row.lab = paste("True", 1:2))

plot of chunk unnamed-chunk-21

Again the clustering is 100% accurate.

Discussion

It seems that both levels of diversity perform well. Low diversity samples have lower accuracy, but the experiment may be biased in that the mutation rate is high (0.01), while there is also a strong constraint on the number of possible alleles. This results in a large number of homoplasious mutations at these loci, leading to a masking of divergence due to drift (i.e. alleles are constantly being re-introduced leading to more similar frequencies among populations.).

On the other hand, mutations at the high diversity loci contribute to the dissimilarity of populations. Many alleles will be private to single populations, and I suspect that the increased accuracy of high diversity loci when using the clustering method used here, is due to these private alleles.

It would be interesting to see if the same patterns are observed for STRUCTURE, but alas, lunch is over and I have to get back to real work.

Conclusions

It appears from these simulations that having many alleles provides much more clustering power (at least in the case of this multivariate approach).