Population distinctness analysis using SNP markers

THIS IS A WORK IN PROGRESS
Biological framework
Statistical framework
Considered scenarios
Available data and notation
Considered tests
Workflow

THIS IS A WORK IN PROGRESS

I’m currently testing this format of report. This article is going to be updated as soon as I have new pieces of analysis.

Biological framework

In this article I investigate the use of SNP markers for population distinctness analysis. The general scenario is of variety discrimination: a lab is given a number of samples belonging to two populations A and B (usually plants of commercial interest). The lab needs to tell if they are significantly different. To do so, one or more SNP marker can be used to genotype the populations. The result is bound to have a certain degree of uncertainity. For specific reasons linked to my job, examples will focus on tetraploid individuals, but all considerations can easily be extended to diploids.

Statistical framework

I modeled the results of a SNP marker genotyping on a single individual using a binomial distribution with parameters:

number of trials $n \in N_0$ = ploidity of the species
success probability in each trial $p \in [0,1]$ = frequency for the measured allele in the population.

The following assumptions describe the two studied populations, A and B:

$n = 4$ (both populations are tetraploid)
$p_A, p_B$ : allele frequencies for the considered SNP. If $p_A = p_B$ the two populations do not differ. True values are unknownable to the lab.

The whole analysis revolves around measuring the statistical capability to distinguish the differences between $p_A$ and $p_B$ balancing the tradeoff between false positives (the population are marked as different even if $p_A = p_B$) and false negative (the populations are marked as equal even if $p_A \ne p_B$).

Given the aforementioned modelization, once $p_A$ and $p_B$ are fixed, it is possible to simulate any number of candidate populations.

Considered scenarios

The lab is given a number $NUM$ of biological samples for each population, each representing a different individual. We considered the following cases:

$NUM \in [40, 50, 60, 80, 100]$

Without loosing generality, we’ll assume that population A has higher allele frequency than population B. The difference of allele frequencies is called DELTA. We considered the following cases:

$p_A \in [0.5, 0.4, 0.3, 0.2, 0.1]$
$DELTA \in [0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20]$
$p_B = p_A - DELTA$ (if it goes below zero the case is discarded)

Available data and notation

The true values of $p_A$ and $p_B$ are unknownable. After the wetlab routine the lab will have the genotypes encoded as:$

$X_A = \{x_{A1}, x_{A2}, x_{A \, NUM} \}$ for population A
$X_B = \{x_{B1}, x_{B2}, x_{B \, NUM} \}$ for population B
- with $x \in [0, 1, 2, \dotsc , n]$
- for tetraploids: $x \in [0, 1, 2, 3, 4]$

We can compute the following useful parameters:

\[S_a = \sum_{i=1}^{NUM}x_{ai}\]

Where $S_a$ is the total amount of instances of the reference allele found in population A

\[\hat{p}_a = \frac{S_a}{NUM \cdot n}\]

Where $\hat{p}_a$ is the observerd frequency of the detected allele in population A.

Conversely, $S_b$ and $\hat{p}_b$ can be computed in the same way.

Considered tests

The following statistical tests are considered:

Fisher exact test, implemented using R function fisher.test
Barnard exact test implemented using R function barnard.test

In all cases we fixed a significance level $alpha = 0.05$. This is expected to correspond to the level of false positives, meaning that about 5% of populations actually having the same frequency ($p_a = p_b$) will be wrongly classified as different.

The contingency tables are built as follows:

	Counts of reference allele	Counts of alternative allele	Row total
Population A	$S_a$	$NUM \cdot n - S_a$	$NUM \cdot n$
Population B	$S_b$	$NUM \cdot n - S_b$	$NUM \cdot n$
Column total	$S_a + S_b$	$2 \cdot NUM \cdot n - (S_a + S_b)$	$2 \cdot NUM \cdot n$

Workflow

This is a simplified workflow that I followed for the analysis:

select a set of condions, i.e. values for $NUM$, $p_a$, $DELTA$
generate simulated genotypes extracting from binomial distributions
apply the statistical test
verify the test veriditicy. A test marks +1 error if $p_a \ne p_b$ but the population are classified as equal, or if $p_a = p_b$ but they are classified as different.
repeat point 2-4 many times (depending on computational power) to ensure statistic stability
repeat from point 1 but with different initial conditions

	Counts of reference allele	Counts of alternative allele	Row total
Population A	\(S_a\)	\(NUM \cdot n - S_a\)	\(NUM \cdot n\)
Population B	\(S_b\)	\(NUM \cdot n - S_b\)	\(NUM \cdot n\)
Column total	\(S_a + S_b\)	\(2 \cdot NUM \cdot n - (S_a + S_b)\)	\(2 \cdot NUM \cdot n\)