I’m currently testing this format of report. This article is going to be updated as soon as I have new pieces of analysis.
In this article I investigate the use of SNP markers for population distinctness analysis. The general scenario is of variety discrimination: a lab is given a number of samples belonging to two populations A and B (usually plants of commercial interest). The lab needs to tell if they are significantly different. To do so, one or more SNP marker can be used to genotype the populations. The result is bound to have a certain degree of uncertainity. For specific reasons linked to my job, examples will focus on tetraploid individuals, but all considerations can easily be extended to diploids.
I modeled the results of a SNP marker genotyping on a single individual using a binomial distribution with parameters:
The following assumptions describe the two studied populations, A and B:
The whole analysis revolves around measuring the statistical capability to distinguish the differences between \(p_A\) and \(p_B\) balancing the tradeoff between false positives (the population are marked as different even if \(p_A = p_B\)) and false negative (the populations are marked as equal even if \(p_A \ne p_B\)).
Given the aforementioned modelization, once \(p_A\) and \(p_B\) are fixed, it is possible to simulate any number of candidate populations.
The lab is given a number \(NUM\) of biological samples for each population, each representing a different individual. We considered the following cases:
Without loosing generality, we’ll assume that population A has higher allele frequency than population B. The difference of allele frequencies is called DELTA. We considered the following cases:
The true values of \(p_A\) and \(p_B\) are unknownable. After the wetlab routine the lab will have the genotypes encoded as:$
We can compute the following useful parameters:
\[S_a = \sum_{i=1}^{NUM}x_{ai}\]
Where \(S_a\) is the total amount of instances of the reference allele found in population A
\[\hat{p}_a = \frac{S_a}{NUM \cdot n}\]
Where \(\hat{p}_a\) is the observerd frequency of the detected allele in population A.
Conversely, \(S_b\) and \(\hat{p}_b\) can be computed in the same way.
The following statistical tests are considered:
In all cases we fixed a significance level \(alpha = 0.05\). This is expected to correspond to the level of false positives, meaning that about 5% of populations actually having the same frequency (\(p_a = p_b\)) will be wrongly classified as different.
The contingency tables are built as follows:
| Counts of reference allele |
Counts of alternative allele |
Row total | |
|---|---|---|---|
| Population A | \(S_a\) | \(NUM \cdot n - S_a\) | \(NUM \cdot n\) |
| Population B | \(S_b\) | \(NUM \cdot n - S_b\) | \(NUM \cdot n\) |
| Column total | \(S_a + S_b\) | \(2 \cdot NUM \cdot n - (S_a + S_b)\) | \(2 \cdot NUM \cdot n\) |
This is a simplified workflow that I followed for the analysis: