THIS IS A WORK IN PROGRESS

I’m currently testing this format of report. This article is going to be updated as soon as I have new pieces of analysis.

Biological framework

In this article I investigate the use of SNP markers for population distinctness analysis. The general scenario is of variety discrimination: a lab is given a number of samples belonging to two populations A and B (usually plants of commercial interest). The lab needs to tell if they are significantly different. To do so, one or more SNP marker can be used to genotype the populations. The result is bound to have a certain degree of uncertainity. For specific reasons linked to my job, examples will focus on tetraploid individuals, but all considerations can easily be extended to diploids.

Statistical framework

I modeled the results of a SNP marker genotyping on a single individual using a binomial distribution with parameters:

The following assumptions describe the two studied populations, A and B:

The whole analysis revolves around measuring the statistical capability to distinguish the differences between \(p_A\) and \(p_B\) balancing the tradeoff between false positives (the population are marked as different even if \(p_A = p_B\)) and false negative (the populations are marked as equal even if \(p_A \ne p_B\)).

Given the aforementioned modelization, once \(p_A\) and \(p_B\) are fixed, it is possible to simulate any number of candidate populations.

Considered scenarios

The lab is given a number \(NUM\) of biological samples for each population, each representing a different individual. We considered the following cases:

Without loosing generality, we’ll assume that population A has higher allele frequency than population B. The difference of allele frequencies is called DELTA. We considered the following cases:

Available data and notation

The true values of \(p_A\) and \(p_B\) are unknownable. After the wetlab routine the lab will have the genotypes encoded as:$

We can compute the following useful parameters:

\[S_a = \sum_{i=1}^{NUM}x_{ai}\]

Where \(S_a\) is the total amount of instances of the reference allele found in population A

\[\hat{p}_a = \frac{S_a}{NUM \cdot n}\]

Where \(\hat{p}_a\) is the observerd frequency of the detected allele in population A.

Conversely, \(S_b\) and \(\hat{p}_b\) can be computed in the same way.

Considered tests

The following statistical tests are considered:

In all cases we fixed a significance level \(alpha = 0.05\). This is expected to correspond to the level of false positives, meaning that about 5% of populations actually having the same frequency (\(p_a = p_b\)) will be wrongly classified as different.

The contingency tables are built as follows:

  Counts of
reference allele
Counts of
alternative allele
Row total
Population A \(S_a\) \(NUM \cdot n - S_a\) \(NUM \cdot n\)
Population B \(S_b\) \(NUM \cdot n - S_b\) \(NUM \cdot n\)
Column total \(S_a + S_b\) \(2 \cdot NUM \cdot n - (S_a + S_b)\) \(2 \cdot NUM \cdot n\)

Workflow

This is a simplified workflow that I followed for the analysis:

  1. select a set of condions, i.e. values for \(NUM\), \(p_a\), \(DELTA\)
  2. generate simulated genotypes extracting from binomial distributions
  3. apply the statistical test
  4. verify the test veriditicy. A test marks +1 error if \(p_a \ne p_b\) but the population are classified as equal, or if \(p_a = p_b\) but they are classified as different.
  5. repeat point 2-4 many times (depending on computational power) to ensure statistic stability
  6. repeat from point 1 but with different initial conditions