This analysis examines genome coverage and the distribution of
missing read data in the BZea maize–teosinte population
without calling genotypes or relying on VCF files. Instead, we analyzed
the BAM alignments directly, prior to any genotype
calling. We used GATK’s CollectAllelicCounts
tool on the
BAM files, restricting read counting to sites corresponding to ~27
million known teosinte variants (widesweq_ref panel). For each site and
sample, the tool counted the number of reads supporting each allele, and
these counts were then aggregated into 1 Mb bins for the initial file in
the workflow.
By aggregating the per-bin read counts across the genome, we estimated coverage (λ) and compared the observed fraction of missing genotype calls against the expected missing fraction based on the Poisson distribution model of genome coverage.
Modeling missingness in sequencing data often involves considering both stochastic sampling effects and systematic biases. A foundational concept is the Poisson coverage model, where the probability of observing zero reads at a site decreases exponentially with increasing coverage. This model reflects the stochastic nature of read sampling in sequencing experiments.
Additionally, certain genomic regions inherently exhibit higher levels of missingness due to factors like low-complexity regions / repetitive sequences, or structural variations. This leads to an irreducible baseline missingness, which remains even at high coverage depths with small reads. These two facts are the base for the exponential-with-floor model when applied to the BZea BAM data, providing a parsimonious and mechanistically interpretable summary of missingness across samples.
Under the Poisson model, if sequencing reads are randomly and independently distributed across the genome, the probability of missing a site is: \[ P_{\text{missing}} = e^{-\lambda} \] where:
This theoretical expectation serves as a baseline to evaluate whether missingness arises purely from random sequencing variation or if additional biases (e.g., mapping issues) are contributing.
We model per-sample missingness as \[ P_{\text{missing}} = \pi + (1 - \pi) e^{-k \lambda} \] where:
\(\lambda\) is the mean coverage.
\(\pi\) (floor parameter):
fraction of sites that remain uncalled even at high coverage with short
reads. This baseline missingness likely corresponds to repetitive DNA,
structural variants, or highly polymorphic regions where short reads
fail to map or are discarded by mapping filters. For example
picard
CollectWgsMetrics
reports a
“PCT_EXC_TOTAL” statistic, which can be interpreted as a floor
parameter. For data on the alux project I have a nominal
coverage \(3.2x\) I got a mean \(\text{PCT_EXC_TOTAL} = 0.45 \sim
\pi\).
\(k\) (decay rate): how much the amount of missing data decays with with increasing coverage relative to the Poisson model. I interpret the product \(k\lambda\) as an effective coverage (maybe associated the coverage of the informative variants).
# Load aggregated 1Mb bin allele count data.
# The file is named all_samples_bin_genotypes.tsv, but the genotype calls are irrelevant
# for the analysis, we only are interested the depth and missingness statistics.
base_dir <- "/Volumes/BZea/bzeaseq/ancestry"
table_file <- list.files(base_dir, pattern = "all_samples_bin_genotypes.tsv",
full.names = TRUE)
df <- read_tsv(table_file, comment = "#")
## Rows: 3057288 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (9): SAMPLE, CONTIG, GENOTYPE, Kraw, Kasin, Kgmm, Kraw_HMM, Kasin_HMM, K...
## dbl (8): BIN_POS, VARIANT_COUNT, INFORMATIVE_VARIANT_COUNT, DEPTH_SUM, ALT_C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Compute per-sample coverage and missingness statistics
per_sample <- df %>%
group_by(SAMPLE) %>%
summarize(
VARIANT_COUNT = sum(VARIANT_COUNT, na.rm = TRUE),
INFORMATIVE_VARIANT_COUNT = sum(INFORMATIVE_VARIANT_COUNT, na.rm = TRUE),
DEPTH_SUM = sum(DEPTH_SUM, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
lambda = DEPTH_SUM / VARIANT_COUNT,
missing_obs = 1 - INFORMATIVE_VARIANT_COUNT / VARIANT_COUNT,
missing_poisson = exp(-lambda)
)
# Overall population statistics
overall_stats <- per_sample %>%
summarize(
VARIANT_COUNT = sum(VARIANT_COUNT, na.rm = TRUE),
INFORMATIVE_VARIANT_COUNT = sum(INFORMATIVE_VARIANT_COUNT, na.rm = TRUE),
DEPTH_SUM = sum(DEPTH_SUM, na.rm = TRUE)
) %>%
mutate(
lambda = DEPTH_SUM / VARIANT_COUNT,
missing_obs = 1 - INFORMATIVE_VARIANT_COUNT / VARIANT_COUNT,
missing_poisson = exp(-lambda)
)
overall_stats
## # A tibble: 1 × 6
## VARIANT_COUNT INFORMATIVE_VARIANT_COUNT DEPTH_SUM lambda missing_obs
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 39583816218 13065064063 23361989704 0.590 0.670
## # ℹ 1 more variable: missing_poisson <dbl>
p_hist <- per_sample %>%
ggplot(aes(x = lambda)) +
xlab("Coverage") +
ylab("Sample Count") +
geom_histogram(aes(fill = after_stat(x))) +
geom_vline(xintercept = 0.59, color = 'red') +
scale_fill_viridis_c() +
ggpubr::theme_classic2(base_size = 8) +
theme(legend.position = "none")
print(p_hist)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The coverage distribution shows mean coverage of 0.59 with substantial variation across samples.
# Fit linear regression to log-transformed missingness
fit_linear <- lm(log(missing_obs) ~ lambda, data = per_sample)
per_sample$missing_exp_k <- exp(predict(fit_linear))
# Extract decay rate
k_est <- -coef(fit_linear)[2]
cat("Estimated k:", round(k_est, 3), "\n")
## Estimated k: 0.581
cat("Effective coverage (k*lambda) range:",
paste0(round(k_est * range(per_sample$lambda), 3), collapse = " - "), "\n")
## Effective coverage (k*lambda) range: 0 - 0.846
# Prepare data for nonlinear fitting
dat <- per_sample %>%
filter(is.finite(lambda), is.finite(missing_obs),
missing_obs > 0, missing_obs < 1)
# Initial parameter estimates
fit_missing <- lm(missing_obs ~ missing_poisson, data = per_sample)
pi_est <- coef(fit_missing)[1]
# Fit exponential-with-floor model
fit_exp_floor <- nlsLM(missing_obs ~ pi + (1 - pi) * exp(-k * lambda),
data = dat,
start = list(pi = pi_est, k = k_est),
lower = c(0, 0),
upper = c(1, 10))
# Extract fitted parameters
model_params <- coef(fit_exp_floor)
pi <- model_params[1]
k <- model_params[2]
cat("Floor parameter (π):", round(pi, 3), "\n")
## Floor parameter (π): 0.346
cat("Decay rate (k):", round(k, 3), "\n")
## Decay rate (k): 1.256
# Add model predictions
per_sample$missing_exp_floor <- predict(fit_exp_floor, newdata = per_sample)
decay_plot <- per_sample %>%
select(-missing_obs) %>%
ungroup() %>%
tidyr::pivot_longer(cols = starts_with("missing_"),
names_to = "model", values_to = "pred",
names_prefix = "missing_") %>%
ggplot(aes(x = lambda, y = pred, col = model, group = model)) +
geom_point(data = per_sample, aes(x = lambda, y = missing_obs),
inherit.aes = FALSE, alpha = 0.3) +
geom_line(size = 1) +
xlab("Coverage") +
ylab("Missing Fraction") +
geom_point(x = 0.590, y = 0.670, color = "red") +
ggpubr::theme_classic2(base_size = 16) +
theme(legend.position = c(0.2, 0.25),
legend.box = "horizontal")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(decay_plot)
p_scatter_poisson <- per_sample %>%
ggplot(aes(x = missing_poisson, y = missing_obs, col = lambda)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = 'red') +
scale_color_viridis_c(breaks = legend_breaks, labels = legend_labels,
limits = c(0, 1.5)) +
guides(col = guide_colorbar(title = "Coverage")) +
geom_point(x = 0.550, y = 0.670, color = "red") +
scale_x_continuous(limits = xy_limits, breaks = (2:10)/10) +
scale_y_continuous(limits = xy_limits, breaks = (2:10)/10) +
coord_fixed(ratio = 1) +
xlab("Expected Missing (Poisson)") +
ylab("Observed Missing") +
ggpubr::theme_classic2(base_size = 16) +
theme(legend.position = "top")
inlay_grob <- cowplot::as_grob(p_hist)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
final_plot_poisson <- cowplot::ggdraw() +
cowplot::draw_plot(p_scatter_poisson) +
cowplot::draw_plot(inlay_grob, x = 0.55, y = 0.16, width = 0.2, height = 0.25)
print(final_plot_poisson)
p_scatter_floor <- per_sample %>%
ggplot(aes(x = missing_exp_floor, y = missing_obs, col = lambda)) +
geom_point() +
geom_vline(xintercept = pi, linetype = "dashed") +
geom_abline(slope = 1, intercept = 0, color = 'red') +
scale_color_viridis_c(breaks = legend_breaks, labels = legend_labels,
limits = c(0, 1.5)) +
guides(col = guide_colorbar(title = "Coverage")) +
geom_point(x = 0.658, y = 0.670, color = "red") +
scale_x_continuous(limits = xy_limits, breaks = (2:10)/10) +
scale_y_continuous(limits = xy_limits, breaks = (2:10)/10) +
coord_fixed(ratio = 1) +
xlab("Expected Missing (Floor Model)") +
ylab("Observed Missing") +
ggpubr::theme_classic2(base_size = 16) +
theme(legend.position = "top")
final_plot_floor <- cowplot::ggdraw() +
cowplot::draw_plot(p_scatter_floor) +
cowplot::draw_plot(inlay_grob, x = 0.55, y = 0.16, width = 0.2, height = 0.25)
print(final_plot_floor)
# Update overall stats with floor model predictions
overall_stats <- overall_stats %>%
mutate(
missing_exp_floor = predict(fit_exp_floor,
newdata = data.frame(lambda = lambda))
)
# Display key results
cat("Population-level Statistics:\n")
## Population-level Statistics:
cat("Mean coverage (λ):", round(overall_stats$lambda, 3), "\n")
## Mean coverage (λ): 0.59
cat("Observed missingness:", round(overall_stats$missing_obs, 3), "\n")
## Observed missingness: 0.67
cat("Poisson expectation:", round(overall_stats$missing_poisson, 3), "\n")
## Poisson expectation: 0.554
cat("Floor model expectation:", round(overall_stats$missing_exp_floor, 3), "\n\n")
## Floor model expectation: 0.658
cat("Model Parameters:\n")
## Model Parameters:
cat("Floor parameter (π):", round(pi, 3), "\n")
## Floor parameter (π): 0.346
cat("Decay rate (k):", round(k, 3), "\n")
## Decay rate (k): 1.256
The exponential-with-floor model substantially improves fit over the simple Poisson model. The floor parameter π = 0.346 indicates that approximately 34.6% of sites remain uncallable even at high coverage, likely due to repetitive sequences, structural variants, or mapping artifacts. The decay rate k = 1.256 suggests slower-than-Poisson decay in missingness with increasing coverage.
Li, H., Ruan, J., & Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Bioinformatics, 24(3), 223–230.
Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443–451.