This document demonstrates how the polyIn function in the diveRsity package can be used to rank polyploid loci on the basis of their informativeness for the inference of ancestry. The application of this statistic is considered at both the global level, where the most informative loci across all population samples are assessed, and at the pairwise level, where a consensus panel of markers is selected on the basis of their informativeness across all pairwise comparisons.
# load diveRsity
library(diveRsity)
## Loading required package: plotrix
The function of interest for these analyses is polyIn. To find out more about how this function can be used, type the following into the R console.
?polyIn
Before we can use the polyIn function, we need to make sure that our data are in the correct format. Below is an example of a truncated input file containing three population samples of ten individuals each, scored for ten tetraploid loci.
pop-test
snp1 snp2 snp3 snp4 snp5 snp6 snp7 snp8 snp9 snp10
pop
pop1_1 AABA AABA AABA AABA AABA AABA AABA AABA AABA AABA
pop1_2 AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA
pop1_3 AABB AABB AABB AABB AABB AABB AABB AABB AABB AABB
pop1_4 BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB
pop1_5 AABA AABA AABA AABA AABA AABA AABA AABA AABA AABA
pop1_6 ABAB ABAB ABAB ABAB ABAB ABAB ABAB ABAB ABAB ABAB
pop1_7 BBAA BBAA BBAA BBAA BBAA BBAA BBAA BBAA BBAA BBAA
pop1_8 AABB AABB AABB AABB AABB AABB AABB AABB AABB AABB
pop1_9 ABAA ABAA ABAA -9 ABAA ABAA ABAA ABAA ABAA ABAA
pop1_10 AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA
pop
pop2_1 AABB AABB AABB AABB AABB AABB AABB AABB AABB AABB
pop2_2 AABB AABB AABB AABB AABB AABB AABB AABB AABB AABB
pop2_3 BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB
pop2_4 BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA
pop2_5 BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA
pop2_6 BABB BABB BABB BABB BABB BABB BABB BABB BABB BABB
pop2_7 ABBB ABBB ABBB ABBB ABBB ABBB ABBB ABBB ABBB ABBB
pop2_8 AAAB AAAB AAAB AAAB AAAB AAAB AAAB AAAB AAAB AAAB
pop2_9 ABAB ABAB ABAB ABAB ABAB ABAB ABAB ABAB ABAB ABAB
pop2_10 BBAA BBAA BBAA BBAA BBAA BBAA BBAA BBAA BBAA BBAA
pop
pop2_1 AABB AABB AABB AABB AABB AABB AABB AABB AABB AABB
pop2_2 AABB AABB AABB AABB AABB AABB AABB AABB AABB AABB
pop2_3 BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB
pop2_4 BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA
pop2_5 BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA BBBA
pop2_6 BABB BABB BABB BABB BABB BABB BABB BABB BABB BABB
pop2_7 ABBB ABBB ABBB ABBB ABBB ABBB ABBB ABBB ABBB ABBB
pop2_8 AAAB AAAB AAAB AAAB AAAB AAAB AAAB AAAB AAAB AAAB
pop2_9 ABAB ABAB ABAB ABAB ABAB ABAB ABAB ABAB ABAB ABAB
pop2_10 BBAA BBAA BBAA BBAA BBAA BBAA BBAA BBAA BBAA BBAA
Firstly, we can see that the first line in the file is a description line. This can be what ever the user chooses. Next, we can see that the second line is a tab seperated row of locus names. In the third line we have the pop seperator tag. This tag preceeds the first individual in each population sample and is used to mark the beginning of a new sample. Each individual's genotypes are listed in a single row, with each locus genotype seperated by whitespace. Missing data are indicated by “-9”.
In this example we will used a made up data set, named “big_tetra.gen”, containing 10 population samples of 20 individuals each, genotyped at 5000 tetraploid loci.
Firstly, we will calculate \( I_{N} \) across all population samples. This method will generally be quicker than calculating the statistic across all pairwise comparisons.
system.time({
globeIn <- polyIn(infile = "big_tetra.gen", pairwise = FALSE, parallel = TRUE)
})
user system elapsed
3.82 1.34 38.74
Importantly, we can see that the calculations for this analysis takes around 40 seconds. While this may seem low, especially for 5000 loci, computation will increase markedly when calculating pairwise statistics, as we will see. An important part of this analysis is the post-processing of our results. We might like to firstly visualise the distribution of \( I_{n} \) values across loci
library(ggplot2)
# plot the density distribution
qplot(globeIn) + geom_density()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
# plot the ordered In values
qplot() + geom_point(aes(x = 1:length(globeIn), y = sort(globeIn)))
We can see clearly her that we have some very informative SNPs, relative to the majority. We can identify the 100 most informative SNPs as follows:
srtIn <- sort(globeIn)
informSNPs <- names(srtIn)[5000:4900]
informSNPs
[1] "SNP-2384" "SNP-449" "SNP-365" "SNP-1517" "SNP-4672" "SNP-4526"
[7] "SNP-1816" "SNP-660" "SNP-3442" "SNP-231" "SNP-3291" "SNP-3992"
[13] "SNP-1215" "SNP-1960" "SNP-764" "SNP-4571" "SNP-173" "SNP-656"
[19] "SNP-2292" "SNP-3444" "SNP-4853" "SNP-3505" "SNP-891" "SNP-3591"
[25] "SNP-3258" "SNP-4896" "SNP-3693" "SNP-3620" "SNP-2495" "SNP-4808"
[31] "SNP-3464" "SNP-3446" "SNP-2582" "SNP-462" "SNP-4252" "SNP-3973"
[37] "SNP-634" "SNP-1092" "SNP-2378" "SNP-1671" "SNP-1990" "SNP-4042"
[43] "SNP-3451" "SNP-4795" "SNP-3087" "SNP-1342" "SNP-554" "SNP-2414"
[49] "SNP-3186" "SNP-3220" "SNP-3132" "SNP-2088" "SNP-4276" "SNP-479"
[55] "SNP-1192" "SNP-2696" "SNP-3050" "SNP-4639" "SNP-2122" "SNP-3612"
[61] "SNP-608" "SNP-4817" "SNP-4507" "SNP-3435" "SNP-1497" "SNP-183"
[67] "SNP-2645" "SNP-3378" "SNP-1856" "SNP-563" "SNP-1203" "SNP-2377"
[73] "SNP-1106" "SNP-4211" "SNP-4411" "SNP-1799" "SNP-3999" "SNP-3429"
[79] "SNP-1190" "SNP-2612" "SNP-4203" "SNP-2053" "SNP-2656" "SNP-4501"
[85] "SNP-2324" "SNP-186" "SNP-4384" "SNP-3284" "SNP-2217" "SNP-4527"
[91] "SNP-4266" "SNP-4626" "SNP-90" "SNP-4447" "SNP-633" "SNP-1965"
[97] "SNP-4948" "SNP-813" "SNP-1040" "SNP-1521" "SNP-4064"
We can check to see that the method is working correctly. The first most informative SNP should have the maximum \( I_{N} \) value.
globeIn[informSNPs[1]] == max(globeIn)
SNP-2384
TRUE
So we can see how informative SNP identification might works for global \( I_{N} \), now what about pairwise \( I_{N} \).
In this example we will see how we can run the function polyIn for pairwise comparisons, and then I will demonstrate a simplistic method for selecting a consensus panel of loci.
Again we will be using the “big_tetra.gen” dataset. There are 10 population samples in this file, thus we will have \( \frac{1}{2}N \times (N-1) \) possible comparisons:
(0.5 * 10) * (10 - 1)
[1] 45
To run polyIn for the calculation of pairwise \( I_{n} \), the following command can be used:
# run
system.time({
pwIn <- polyIn(infile = "big_tetra.gen", pairwise = TRUE, parallel = TRUE)
})
user system elapsed
4.16 2.00 59.33
Here we can see that the analysis takes slightly longer to run, but not too much. It is important to remember that the number of pairwise comparisons increases drastically as our number of sampled populations increases. For example, when there are 100 population samples, the number of pairwise comparisons is \( 4950 \), but for 200 samples it increases to \( 19900 \).
Now that we have generated our results we should explore the output format.
typeof(pwIn)
[1] "list"
length(pwIn)
[1] 5000
round(pwIn[[1]], 3)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA NA NA NA NA NA NA NA NA NA
[2,] 0.011 NA NA NA NA NA NA NA NA NA
[3,] 0.006 0.001 NA NA NA NA NA NA NA NA
[4,] 0.000 0.008 0.004 NA NA NA NA NA NA NA
[5,] 0.004 0.002 0.000 0.002 NA NA NA NA NA NA
[6,] 0.000 0.008 0.004 0.000 0.002 NA NA NA NA NA
[7,] 0.003 0.003 0.001 0.001 0.000 0.001 NA NA NA NA
[8,] 0.003 0.003 0.001 0.001 0.000 0.001 0.000 NA NA NA
[9,] 0.008 0.000 0.000 0.005 0.001 0.005 0.001 0.001 NA NA
[10,] 0.011 0.000 0.001 0.008 0.002 0.008 0.003 0.003 0 NA
names(pwIn)[1:10]
[1] "SNP-1" "SNP-2" "SNP-3" "SNP-4" "SNP-5" "SNP-6" "SNP-7"
[8] "SNP-8" "SNP-9" "SNP-10"
We can see that the function has returned a pairwise matrix for each locus. Now we want to figure out how we might rank loci on the basis of their general informativeness across pairwise comparisons. One simplistic way we might consider doing this would be to rank all loci with each pairwise comparison and then take the mean of locus rank across comparisons, the lowest of which may indicate the most generally informative locus. There are some problems with this method such as large variance in informativeness across pairwise comparisons, but we will ignore these for demonstration purposes.
The most efficient way I can think of for working with loci would be to arrange comparisons into a large matrix with columns corresponding to pairwise comparisons and rows corresponding to loci.
pwMat <- t(sapply(pwIn, function(x) {
return(as.vector(x[lower.tri(x)]))
}))
dim(pwMat)
[1] 5000 45
Now we can rank loci within each pairwise comparison
pwRank <- apply(pwMat, 2, order)
# calculate the mean rank score per locus
mnRank <- rowMeans(pwRank)
names(mnRank) <- names(pwIn)
Now we can extract the most generally informative SNPs
srtPWSNPs <- sort(mnRank)
informPWSNPs <- names(srtPWSNPs)[5000:4900]
informPWSNPs
[1] "SNP-295" "SNP-294" "SNP-293" "SNP-292" "SNP-291" "SNP-297"
[7] "SNP-290" "SNP-296" "SNP-289" "SNP-288" "SNP-287" "SNP-286"
[13] "SNP-280" "SNP-279" "SNP-285" "SNP-278" "SNP-284" "SNP-277"
[19] "SNP-283" "SNP-276" "SNP-282" "SNP-275" "SNP-281" "SNP-274"
[25] "SNP-266" "SNP-273" "SNP-265" "SNP-272" "SNP-298" "SNP-264"
[31] "SNP-271" "SNP-263" "SNP-270" "SNP-262" "SNP-269" "SNP-261"
[37] "SNP-268" "SNP-260" "SNP-267" "SNP-259" "SNP-258" "SNP-257"
[43] "SNP-256" "SNP-255" "SNP-254" "SNP-253" "SNP-252" "SNP-251"
[49] "SNP-250" "SNP-445" "SNP-249" "SNP-248" "SNP-444" "SNP-247"
[55] "SNP-443" "SNP-246" "SNP-245" "SNP-442" "SNP-244" "SNP-299"
[61] "SNP-449" "SNP-243" "SNP-441" "SNP-448" "SNP-242" "SNP-308"
[67] "SNP-241" "SNP-440" "SNP-307" "SNP-447" "SNP-240" "SNP-239"
[73] "SNP-306" "SNP-439" "SNP-311" "SNP-450" "SNP-238" "SNP-446"
[79] "SNP-300" "SNP-305" "SNP-237" "SNP-438" "SNP-310" "SNP-304"
[85] "SNP-236" "SNP-309" "SNP-235" "SNP-452" "SNP-303" "SNP-437"
[91] "SNP-234" "SNP-302" "SNP-233" "SNP-436" "SNP-451" "SNP-1134"
[97] "SNP-232" "SNP-301" "SNP-231" "SNP-1143" "SNP-1146"
Although theses data are purely random, if we wanted to inspect the behaviour of loci in one pairwise comparison relative to another we could simply plot a correlation of them, for example:
library(ggplot2)
pw1v2 <- data.frame(pw1 = pwMat[,1],
pw2 = pwMat[,2])
ggplot(pw1v2, aes(x = pw1, y = pw2)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
It might also be possible to derive a more meaningful measure of consensus informativeness by using only loci which show a high correlation in \( I_{N} \) values across all pairwise comparisons.
Keenan, K., McGinnity, P., Cross, T.F., Crozier, W.W., & Prodöhl, P.A., (2013), diveRsity: An R package for the estimation of population genetics parameters and their associated errors, Methods in Ecology and Evolution, doi: 10.1111/2041-210X.12067
Rosenberg, N., Li, L., Ward, R., and Pritchard, J., (2003) 'Informativeness of genetic markers for inference of of ancestry.', American Journal of Human Genetics, vol. 73, no. 6 pp. 1402-22.
R Under development (unstable) (2014-01-19 r64835)
Platform: x86_64-w64-mingw32/x64 (64-bit)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_0.9.3.1 diveRsity_1.8.6 plotrix_3.5-2 knitr_1.5.1
loaded via a namespace (and not attached):
[1] bitops_1.0-6 caTools_1.16 cluster_1.14.4
[4] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4
[7] ellipse_0.3-8 evaluate_0.5.1 formatR_0.10
[10] Formula_1.1-1 grid_3.1.0 gtable_0.1.2
[13] Hmisc_3.13-0 httpuv_1.2.1 igraph_0.6.6
[16] jpeg_0.1-6 labeling_0.2 lattice_0.20-24
[19] lavaan_0.5-15 MASS_7.3-29 mnormt_1.4-7
[22] munsell_0.4.2 pbivnorm_0.5-1 plyr_1.8
[25] png_0.1-7 proto_0.3-10 psych_1.3.12
[28] qgraph_1.2.3 quadprog_1.5-5 RColorBrewer_1.0-5
[31] reshape2_1.2.2 RJSONIO_1.0-3 scales_0.2.3
[34] sem_3.1-3 shiny_0.8.0 splines_3.1.0
[37] stats4_3.1.0 stringr_0.6.2 survival_2.37-6
[40] tools_3.1.0 xtable_1.7-1