Ranking tetraploid loci on the basis of their informativeness for the inference of ancestry

Kevin Keenan 2014

Introduction

This document demonstrates how the polyIn function in the diveRsity package can be used to rank polyploid loci on the basis of their informativeness for the inference of ancestry. The application of this statistic is considered at both the global level, where the most informative loci across all population samples are assessed, and at the pairwise level, where a consensus panel of markers is selected on the basis of their informativeness across all pairwise comparisons.

Methods

# load diveRsity
library(diveRsity)
## Loading required package: plotrix

The function of interest for these analyses is polyIn. To find out more about how this function can be used, type the following into the R console.

?polyIn

Input format

Before we can use the polyIn function, we need to make sure that our data are in the correct format. Below is an example of a truncated input file containing three population samples of ten individuals each, scored for ten tetraploid loci.

pop-test
snp1    snp2    snp3    snp4    snp5    snp6    snp7    snp8    snp9    snp10
pop
pop1_1  AABA    AABA    AABA    AABA    AABA    AABA    AABA    AABA    AABA    AABA
pop1_2  AAAA    AAAA    AAAA    AAAA    AAAA    AAAA    AAAA    AAAA    AAAA    AAAA
pop1_3  AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB
pop1_4  BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB
pop1_5  AABA    AABA    AABA    AABA    AABA    AABA    AABA    AABA    AABA    AABA
pop1_6  ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB
pop1_7  BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA
pop1_8  AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB
pop1_9  ABAA    ABAA    ABAA    -9      ABAA    ABAA    ABAA    ABAA    ABAA    ABAA
pop1_10 AAAA    AAAA    AAAA    AAAA    AAAA    AAAA    AAAA    AAAA    AAAA    AAAA
pop
pop2_1  AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB
pop2_2  AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB
pop2_3  BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB
pop2_4  BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA
pop2_5  BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA
pop2_6  BABB    BABB    BABB    BABB    BABB    BABB    BABB    BABB    BABB    BABB
pop2_7  ABBB    ABBB    ABBB    ABBB    ABBB    ABBB    ABBB    ABBB    ABBB    ABBB
pop2_8  AAAB    AAAB    AAAB    AAAB    AAAB    AAAB    AAAB    AAAB    AAAB    AAAB
pop2_9  ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB
pop2_10 BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA
pop
pop2_1  AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB
pop2_2  AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB    AABB
pop2_3  BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB    BBBB
pop2_4  BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA
pop2_5  BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA    BBBA
pop2_6  BABB    BABB    BABB    BABB    BABB    BABB    BABB    BABB    BABB    BABB
pop2_7  ABBB    ABBB    ABBB    ABBB    ABBB    ABBB    ABBB    ABBB    ABBB    ABBB
pop2_8  AAAB    AAAB    AAAB    AAAB    AAAB    AAAB    AAAB    AAAB    AAAB    AAAB
pop2_9  ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB    ABAB
pop2_10 BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA    BBAA

Firstly, we can see that the first line in the file is a description line. This can be what ever the user chooses. Next, we can see that the second line is a tab seperated row of locus names. In the third line we have the pop seperator tag. This tag preceeds the first individual in each population sample and is used to mark the beginning of a new sample. Each individual's genotypes are listed in a single row, with each locus genotype seperated by whitespace. Missing data are indicated by “-9”.

In this example we will used a made up data set, named “big_tetra.gen”, containing 10 population samples of 20 individuals each, genotyped at 5000 tetraploid loci.

Global \( I_{N} \)

Firstly, we will calculate \( I_{N} \) across all population samples. This method will generally be quicker than calculating the statistic across all pairwise comparisons.

system.time({
    globeIn <- polyIn(infile = "big_tetra.gen", pairwise = FALSE, parallel = TRUE)
})
   user  system elapsed 
   3.82    1.34   38.74 

Importantly, we can see that the calculations for this analysis takes around 40 seconds. While this may seem low, especially for 5000 loci, computation will increase markedly when calculating pairwise statistics, as we will see. An important part of this analysis is the post-processing of our results. We might like to firstly visualise the distribution of \( I_{n} \) values across loci

library(ggplot2)
# plot the density distribution
qplot(globeIn) + geom_density()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk plts1


# plot the ordered In values
qplot() + geom_point(aes(x = 1:length(globeIn), y = sort(globeIn)))

plot of chunk plts1

We can see clearly her that we have some very informative SNPs, relative to the majority. We can identify the 100 most informative SNPs as follows:

srtIn <- sort(globeIn)
informSNPs <- names(srtIn)[5000:4900]
informSNPs
  [1] "SNP-2384" "SNP-449"  "SNP-365"  "SNP-1517" "SNP-4672" "SNP-4526"
  [7] "SNP-1816" "SNP-660"  "SNP-3442" "SNP-231"  "SNP-3291" "SNP-3992"
 [13] "SNP-1215" "SNP-1960" "SNP-764"  "SNP-4571" "SNP-173"  "SNP-656" 
 [19] "SNP-2292" "SNP-3444" "SNP-4853" "SNP-3505" "SNP-891"  "SNP-3591"
 [25] "SNP-3258" "SNP-4896" "SNP-3693" "SNP-3620" "SNP-2495" "SNP-4808"
 [31] "SNP-3464" "SNP-3446" "SNP-2582" "SNP-462"  "SNP-4252" "SNP-3973"
 [37] "SNP-634"  "SNP-1092" "SNP-2378" "SNP-1671" "SNP-1990" "SNP-4042"
 [43] "SNP-3451" "SNP-4795" "SNP-3087" "SNP-1342" "SNP-554"  "SNP-2414"
 [49] "SNP-3186" "SNP-3220" "SNP-3132" "SNP-2088" "SNP-4276" "SNP-479" 
 [55] "SNP-1192" "SNP-2696" "SNP-3050" "SNP-4639" "SNP-2122" "SNP-3612"
 [61] "SNP-608"  "SNP-4817" "SNP-4507" "SNP-3435" "SNP-1497" "SNP-183" 
 [67] "SNP-2645" "SNP-3378" "SNP-1856" "SNP-563"  "SNP-1203" "SNP-2377"
 [73] "SNP-1106" "SNP-4211" "SNP-4411" "SNP-1799" "SNP-3999" "SNP-3429"
 [79] "SNP-1190" "SNP-2612" "SNP-4203" "SNP-2053" "SNP-2656" "SNP-4501"
 [85] "SNP-2324" "SNP-186"  "SNP-4384" "SNP-3284" "SNP-2217" "SNP-4527"
 [91] "SNP-4266" "SNP-4626" "SNP-90"   "SNP-4447" "SNP-633"  "SNP-1965"
 [97] "SNP-4948" "SNP-813"  "SNP-1040" "SNP-1521" "SNP-4064"

We can check to see that the method is working correctly. The first most informative SNP should have the maximum \( I_{N} \) value.

globeIn[informSNPs[1]] == max(globeIn)
SNP-2384 
    TRUE 

So we can see how informative SNP identification might works for global \( I_{N} \), now what about pairwise \( I_{N} \).

Pairwise informativeness

In this example we will see how we can run the function polyIn for pairwise comparisons, and then I will demonstrate a simplistic method for selecting a consensus panel of loci. Again we will be using the “big_tetra.gen” dataset. There are 10 population samples in this file, thus we will have \( \frac{1}{2}N \times (N-1) \) possible comparisons:

(0.5 * 10) * (10 - 1)
[1] 45

To run polyIn for the calculation of pairwise \( I_{n} \), the following command can be used:

# run
system.time({
    pwIn <- polyIn(infile = "big_tetra.gen", pairwise = TRUE, parallel = TRUE)
})
   user  system elapsed 
   4.16    2.00   59.33 

Here we can see that the analysis takes slightly longer to run, but not too much. It is important to remember that the number of pairwise comparisons increases drastically as our number of sampled populations increases. For example, when there are 100 population samples, the number of pairwise comparisons is \( 4950 \), but for 200 samples it increases to \( 19900 \).

Now that we have generated our results we should explore the output format.

typeof(pwIn)
[1] "list"
length(pwIn)
[1] 5000
round(pwIn[[1]], 3)
       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8] [,9] [,10]
 [1,]    NA    NA    NA    NA    NA    NA    NA    NA   NA    NA
 [2,] 0.011    NA    NA    NA    NA    NA    NA    NA   NA    NA
 [3,] 0.006 0.001    NA    NA    NA    NA    NA    NA   NA    NA
 [4,] 0.000 0.008 0.004    NA    NA    NA    NA    NA   NA    NA
 [5,] 0.004 0.002 0.000 0.002    NA    NA    NA    NA   NA    NA
 [6,] 0.000 0.008 0.004 0.000 0.002    NA    NA    NA   NA    NA
 [7,] 0.003 0.003 0.001 0.001 0.000 0.001    NA    NA   NA    NA
 [8,] 0.003 0.003 0.001 0.001 0.000 0.001 0.000    NA   NA    NA
 [9,] 0.008 0.000 0.000 0.005 0.001 0.005 0.001 0.001   NA    NA
[10,] 0.011 0.000 0.001 0.008 0.002 0.008 0.003 0.003    0    NA
names(pwIn)[1:10]
 [1] "SNP-1"  "SNP-2"  "SNP-3"  "SNP-4"  "SNP-5"  "SNP-6"  "SNP-7" 
 [8] "SNP-8"  "SNP-9"  "SNP-10"

We can see that the function has returned a pairwise matrix for each locus. Now we want to figure out how we might rank loci on the basis of their general informativeness across pairwise comparisons. One simplistic way we might consider doing this would be to rank all loci with each pairwise comparison and then take the mean of locus rank across comparisons, the lowest of which may indicate the most generally informative locus. There are some problems with this method such as large variance in informativeness across pairwise comparisons, but we will ignore these for demonstration purposes.

The most efficient way I can think of for working with loci would be to arrange comparisons into a large matrix with columns corresponding to pairwise comparisons and rows corresponding to loci.

pwMat <- t(sapply(pwIn, function(x) {
    return(as.vector(x[lower.tri(x)]))
}))
dim(pwMat)
[1] 5000   45

Now we can rank loci within each pairwise comparison

pwRank <- apply(pwMat, 2, order)
# calculate the mean rank score per locus
mnRank <- rowMeans(pwRank)
names(mnRank) <- names(pwIn)

Now we can extract the most generally informative SNPs

srtPWSNPs <- sort(mnRank)
informPWSNPs <- names(srtPWSNPs)[5000:4900]
informPWSNPs
  [1] "SNP-295"  "SNP-294"  "SNP-293"  "SNP-292"  "SNP-291"  "SNP-297" 
  [7] "SNP-290"  "SNP-296"  "SNP-289"  "SNP-288"  "SNP-287"  "SNP-286" 
 [13] "SNP-280"  "SNP-279"  "SNP-285"  "SNP-278"  "SNP-284"  "SNP-277" 
 [19] "SNP-283"  "SNP-276"  "SNP-282"  "SNP-275"  "SNP-281"  "SNP-274" 
 [25] "SNP-266"  "SNP-273"  "SNP-265"  "SNP-272"  "SNP-298"  "SNP-264" 
 [31] "SNP-271"  "SNP-263"  "SNP-270"  "SNP-262"  "SNP-269"  "SNP-261" 
 [37] "SNP-268"  "SNP-260"  "SNP-267"  "SNP-259"  "SNP-258"  "SNP-257" 
 [43] "SNP-256"  "SNP-255"  "SNP-254"  "SNP-253"  "SNP-252"  "SNP-251" 
 [49] "SNP-250"  "SNP-445"  "SNP-249"  "SNP-248"  "SNP-444"  "SNP-247" 
 [55] "SNP-443"  "SNP-246"  "SNP-245"  "SNP-442"  "SNP-244"  "SNP-299" 
 [61] "SNP-449"  "SNP-243"  "SNP-441"  "SNP-448"  "SNP-242"  "SNP-308" 
 [67] "SNP-241"  "SNP-440"  "SNP-307"  "SNP-447"  "SNP-240"  "SNP-239" 
 [73] "SNP-306"  "SNP-439"  "SNP-311"  "SNP-450"  "SNP-238"  "SNP-446" 
 [79] "SNP-300"  "SNP-305"  "SNP-237"  "SNP-438"  "SNP-310"  "SNP-304" 
 [85] "SNP-236"  "SNP-309"  "SNP-235"  "SNP-452"  "SNP-303"  "SNP-437" 
 [91] "SNP-234"  "SNP-302"  "SNP-233"  "SNP-436"  "SNP-451"  "SNP-1134"
 [97] "SNP-232"  "SNP-301"  "SNP-231"  "SNP-1143" "SNP-1146"

Although theses data are purely random, if we wanted to inspect the behaviour of loci in one pairwise comparison relative to another we could simply plot a correlation of them, for example:

library(ggplot2)
pw1v2 <- data.frame(pw1 = pwMat[,1], 
                    pw2 = pwMat[,2])
ggplot(pw1v2, aes(x = pw1, y = pw2)) +
  geom_point() +
  stat_smooth(method = "lm", col = "red")

plot of chunk unnamed-chunk-8

It might also be possible to derive a more meaningful measure of consensus informativeness by using only loci which show a high correlation in \( I_{N} \) values across all pairwise comparisons.

References

Reproducibility

R Under development (unstable) (2014-01-19 r64835)
Platform: x86_64-w64-mingw32/x64 (64-bit)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_0.9.3.1 diveRsity_1.8.6 plotrix_3.5-2   knitr_1.5.1    

loaded via a namespace (and not attached):
 [1] bitops_1.0-6       caTools_1.16       cluster_1.14.4    
 [4] colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4      
 [7] ellipse_0.3-8      evaluate_0.5.1     formatR_0.10      
[10] Formula_1.1-1      grid_3.1.0         gtable_0.1.2      
[13] Hmisc_3.13-0       httpuv_1.2.1       igraph_0.6.6      
[16] jpeg_0.1-6         labeling_0.2       lattice_0.20-24   
[19] lavaan_0.5-15      MASS_7.3-29        mnormt_1.4-7      
[22] munsell_0.4.2      pbivnorm_0.5-1     plyr_1.8          
[25] png_0.1-7          proto_0.3-10       psych_1.3.12      
[28] qgraph_1.2.3       quadprog_1.5-5     RColorBrewer_1.0-5
[31] reshape2_1.2.2     RJSONIO_1.0-3      scales_0.2.3      
[34] sem_3.1-3          shiny_0.8.0        splines_3.1.0     
[37] stats4_3.1.0       stringr_0.6.2      survival_2.37-6   
[40] tools_3.1.0        xtable_1.7-1