Coverage calculation.

Theory

Ewens & Grant (2005) describe the coverage of a genome as a Poisson distribution. The probability of observing \(x\) reads at a given site is given by:

\[P(x) = \frac{\lambda^x e^{-\lambda}}{x!}\] Where:

\(x\) number of short reads covering a given site on the genome
\(\lambda = \frac{LN}{G}\) sequencing redundancy

and:

\(G\) genome length (in bp)
\(L\) short read average length
\(N\) number of short read sequenced

For the Bzea population

The NAM v5 B73 genome is is 2.2 GB

G = 2.2e9

The mean length of the reads is 150 bp

L = 150

The coverage I have for Bzea is 0.8

lambda = 0.8

Then the number of sequenced reads I got is:

N = G*lambda/L
N/1e6

## [1] 11.73333

each sample has 12 million reads

12e6*150/2.2e9

## [1] 0.8181818

If 1400 samples were sequenced

n=1400
total_reads= 1400*12e6

if a lane has 4e9 reads

total_reads/4e9

## [1] 4.2

4 lanes where sequenced

Fraction of the genome covered

p=1-exp(-lambda)
p

## [1] 0.550671

Expected number of samples with at least 1 read at each position

round(n*p,0)

## [1] 771

standard deviation

sqrt(n*p*(1-p))

## [1] 18.61197

Expected number of missing genotypes per position:

n - qbinom(0.50,n,p)

## [1] 629

Missing data

Expected fraction of missing data per position

1-qbinom(0.50,n,p)/n

## [1] 0.4492857

Parameters

We assume a cohort size of 1400 individuals.
We want to estimate the probability that a site has at least 98% missing data.

n <- 1400                # number of individuals
p_missing <- 0.98        # missing data proportion threshold
k_missing <- ceiling(n * p_missing)  # number of missing genotypes threshold

n

## [1] 1400

p_missing

## [1] 0.98

k_missing

## [1] 1372

Probability calculation

We assume each genotype is missing with probability p0 (unknown true missingness rate).
For illustration, let’s assume random missingness with p0 = 0.5.

We can model the number of missing genotypes at a site as a Binomial(n, p0).

p0 <- 0.5

# Probability of having >= k_missing missing genotypes
prob_high_missing <- pbinom(k_missing - 1, size = n, prob = p0, lower.tail = FALSE)
prob_high_missing

## [1] 0

Dispersion

Dispersion for the Binomial is: \[ \frac{\text{Variance}}{\text{Mean}} = \frac{n p (1-p)}{n p} = 1 - p \]

dispersion <- (n * p0 * (1 - p0)) / (n * p0)
dispersion

## [1] 0.5

Notes

If p0 is estimated from the real data (mean missingness rate across sites), this probability will be more realistic.
For overdispersed data (extra variability beyond binomial), a Beta-Binomial model may be more appropriate. ```

References

Ewens, W. J., & Grant, G. (2005). Statistical methods in bioinformatics: An introduction (2nd ed.). Springer. https://doi.org/10.1007/b137845