Ewens & Grant (2005) describe the coverage of a genome as a Poisson distribution. The probability of observing \(x\) reads at a given site is given by:
\[P(x) = \frac{\lambda^x e^{-\lambda}}{x!}\] Where:
\(x\) number of short reads covering a given site on the genome
\(\lambda = \frac{LN}{G}\) sequencing redundancy
and:
\(G\) genome length (in bp)
\(L\) short read average length
\(N\) number of short read sequenced
The NAM v5 B73 genome is is 2.2 GB
G = 2.2e9
The mean length of the reads is 150 bp
L = 150
The coverage I have for Bzea is 0.8
lambda = 0.8
Then the number of sequenced reads I got is:
N = G*lambda/L
N/1e6
## [1] 11.73333
each sample has 12 million reads
12e6*150/2.2e9
## [1] 0.8181818
If 1400 samples were sequenced
n=1400
total_reads= 1400*12e6
if a lane has 4e9 reads
total_reads/4e9
## [1] 4.2
4 lanes where sequenced
Fraction of the genome covered
p=1-exp(-lambda)
p
## [1] 0.550671
Expected number of samples with at least 1 read at each position
round(n*p,0)
## [1] 771
standard deviation
sqrt(n*p*(1-p))
## [1] 18.61197
Expected number of missing genotypes per position:
n - qbinom(0.50,n,p)
## [1] 629
Expected fraction of missing data per position
1-qbinom(0.50,n,p)/n
## [1] 0.4492857
We assume a cohort size of 1400 individuals.
We want to estimate the probability that a site has at
least 98% missing data.
n <- 1400 # number of individuals
p_missing <- 0.98 # missing data proportion threshold
k_missing <- ceiling(n * p_missing) # number of missing genotypes threshold
n
## [1] 1400
p_missing
## [1] 0.98
k_missing
## [1] 1372
We assume each genotype is missing with probability p0
(unknown true missingness rate).
For illustration, let’s assume random missingness with
p0 = 0.5
.
We can model the number of missing genotypes at a site as a Binomial(n, p0).
p0 <- 0.5
# Probability of having >= k_missing missing genotypes
prob_high_missing <- pbinom(k_missing - 1, size = n, prob = p0, lower.tail = FALSE)
prob_high_missing
## [1] 0
Dispersion for the Binomial is: \[ \frac{\text{Variance}}{\text{Mean}} = \frac{n p (1-p)}{n p} = 1 - p \]
dispersion <- (n * p0 * (1 - p0)) / (n * p0)
dispersion
## [1] 0.5
p0
is estimated from the real data (mean missingness
rate across sites), this probability will be more realistic.Ewens, W. J., & Grant, G. (2005). Statistical methods in bioinformatics: An introduction (2nd ed.). Springer. https://doi.org/10.1007/b137845