BAM Alignemnt Genome Coverage and Missing Data Analysis in the BZea Population

1 Introduction

This analysis examines genome coverage and the distribution of missing read data in the BZea maize–teosinte population without calling genotypes or relying on VCF files. Instead, we analyzed the BAM alignments directly, prior to any genotype calling. We used GATK’s CollectAllelicCounts tool on the BAM files, restricting read counting to sites corresponding to ~27 million known teosinte variants (widesweq_ref panel). For each site and sample, the tool counted the number of reads supporting each allele, and these counts were then aggregated into 1 Mb bins for the initial file in the workflow.

By aggregating the per-bin read counts across the genome, we estimated coverage (λ) and compared the observed fraction of missing genotype calls against the expected missing fraction based on the Poisson distribution model of genome coverage.

1.1 Theoretical background

Modeling missingness in sequencing data often involves considering both stochastic sampling effects and systematic biases. A foundational concept is the Poisson coverage model, where the probability of observing zero reads at a site decreases exponentially with increasing coverage. This model reflects the stochastic nature of read sampling in sequencing experiments.

Additionally, certain genomic regions inherently exhibit higher levels of missingness due to factors like low-complexity regions / repetitive sequences, or structural variations. This leads to an irreducible baseline missingness, which remains even at high coverage depths with small reads. These two facts are the base for the exponential-with-floor model when applied to the BZea BAM data, providing a parsimonious and mechanistically interpretable summary of missingness across samples.

1.2 Models of Missingness

1.2.1 Poisson model

Under the Poisson model, if sequencing reads are randomly and independently distributed across the genome, the probability of missing a site is: \[ P_{\text{missing}} = e^{-\lambda} \] where:

\(\lambda\) = mean coverage (average reads per variant site)
\(P_{\text{missing}}\) = probability of no reads covering a site

This theoretical expectation serves as a baseline to evaluate whether missingness arises purely from random sequencing variation or if additional biases (e.g., mapping issues) are contributing.

1.2.2 Exponential-with-floor model

We model per-sample missingness as \[ P_{\text{missing}} = \pi + (1 - \pi) e^{-k \lambda} \] where:

\(\lambda\) is the mean coverage.
\(\pi\) (floor parameter): fraction of sites that remain uncalled even at high coverage with short reads. This baseline missingness likely corresponds to repetitive DNA, structural variants, or highly polymorphic regions where short reads fail to map or are discarded by mapping filters. For example picard CollectWgsMetrics reports a “PCT_EXC_TOTAL” statistic, which can be interpreted as a floor parameter. For data on the alux project I have a nominal coverage \(3.2x\) I got a mean \(\text{PCT_EXC_TOTAL} = 0.45 \sim \pi\).
\(k\) (decay rate): how much the amount of missing data decays with with increasing coverage relative to the Poisson model. I interpret the product \(k\lambda\) as an effective coverage (maybe associated the coverage of the informative variants).

2 Data Loading and Processing

# Load aggregated 1Mb bin allele count data.
# The file is named all_samples_bin_genotypes.tsv, but the genotype calls are irrelevant
# for the analysis, we only are interested the depth and missingness statistics.
base_dir <- "/Volumes/BZea/bzeaseq/ancestry"
table_file <- list.files(base_dir, pattern = "all_samples_bin_genotypes.tsv", 
                        full.names = TRUE)
df <- read_tsv(table_file, comment = "#")

## Rows: 3057288 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (9): SAMPLE, CONTIG, GENOTYPE, Kraw, Kasin, Kgmm, Kraw_HMM, Kasin_HMM, K...
## dbl (8): BIN_POS, VARIANT_COUNT, INFORMATIVE_VARIANT_COUNT, DEPTH_SUM, ALT_C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Compute per-sample coverage and missingness statistics
per_sample <- df %>%
  group_by(SAMPLE) %>%
  summarize(
    VARIANT_COUNT = sum(VARIANT_COUNT, na.rm = TRUE),
    INFORMATIVE_VARIANT_COUNT = sum(INFORMATIVE_VARIANT_COUNT, na.rm = TRUE),
    DEPTH_SUM = sum(DEPTH_SUM, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    lambda = DEPTH_SUM / VARIANT_COUNT,
    missing_obs = 1 - INFORMATIVE_VARIANT_COUNT / VARIANT_COUNT,
    missing_poisson = exp(-lambda)
  )

# Overall population statistics
overall_stats <- per_sample %>%
  summarize(
    VARIANT_COUNT = sum(VARIANT_COUNT, na.rm = TRUE),
    INFORMATIVE_VARIANT_COUNT = sum(INFORMATIVE_VARIANT_COUNT, na.rm = TRUE),
    DEPTH_SUM = sum(DEPTH_SUM, na.rm = TRUE)
  ) %>%
  mutate(
    lambda = DEPTH_SUM / VARIANT_COUNT,
    missing_obs = 1 - INFORMATIVE_VARIANT_COUNT / VARIANT_COUNT,
    missing_poisson = exp(-lambda)
  )

overall_stats

## # A tibble: 1 × 6
##   VARIANT_COUNT INFORMATIVE_VARIANT_COUNT   DEPTH_SUM lambda missing_obs
##           <dbl>                     <dbl>       <dbl>  <dbl>       <dbl>
## 1   39583816218               13065064063 23361989704  0.590       0.670
## # ℹ 1 more variable: missing_poisson <dbl>

3 Coverage Distribution Analysis

p_hist <- per_sample %>%
  ggplot(aes(x = lambda)) + 
  xlab("Coverage") +
  ylab("Sample Count") +
  geom_histogram(aes(fill = after_stat(x))) +
  geom_vline(xintercept = 0.59, color = 'red') +
  scale_fill_viridis_c() +
  ggpubr::theme_classic2(base_size = 8) +
  theme(legend.position = "none")

print(p_hist)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The coverage distribution shows mean coverage of 0.59 with substantial variation across samples.

4 Model Fitting

4.1 Simple Exponential Model

# Fit linear regression to log-transformed missingness
fit_linear <- lm(log(missing_obs) ~ lambda, data = per_sample)
per_sample$missing_exp_k <- exp(predict(fit_linear))

# Extract decay rate
k_est <- -coef(fit_linear)[2]
cat("Estimated k:", round(k_est, 3), "\n")

## Estimated k: 0.581

cat("Effective coverage (k*lambda) range:",
    paste0(round(k_est * range(per_sample$lambda), 3), collapse = " - "), "\n")

## Effective coverage (k*lambda) range: 0 - 0.846

4.2 Exponential-with-Floor Model

# Prepare data for nonlinear fitting
dat <- per_sample %>%
  filter(is.finite(lambda), is.finite(missing_obs),
         missing_obs > 0, missing_obs < 1)

# Initial parameter estimates
fit_missing <- lm(missing_obs ~ missing_poisson, data = per_sample) 
pi_est <- coef(fit_missing)[1]

# Fit exponential-with-floor model
fit_exp_floor <- nlsLM(missing_obs ~ pi + (1 - pi) * exp(-k * lambda),
                      data = dat,
                      start = list(pi = pi_est, k = k_est),
                      lower = c(0, 0),
                      upper = c(1, 10))

# Extract fitted parameters
model_params <- coef(fit_exp_floor)
pi <- model_params[1]
k <- model_params[2]

cat("Floor parameter (π):", round(pi, 3), "\n")

## Floor parameter (π): 0.346

cat("Decay rate (k):", round(k, 3), "\n")

## Decay rate (k): 1.256

# Add model predictions
per_sample$missing_exp_floor <- predict(fit_exp_floor, newdata = per_sample)

5 Model Comparison

decay_plot <- per_sample %>%
  select(-missing_obs) %>% 
  ungroup() %>%
  tidyr::pivot_longer(cols = starts_with("missing_"),
                     names_to = "model", values_to = "pred",
                     names_prefix = "missing_") %>%
  ggplot(aes(x = lambda, y = pred, col = model, group = model)) +
  geom_point(data = per_sample, aes(x = lambda, y = missing_obs), 
            inherit.aes = FALSE, alpha = 0.3) +
  geom_line(size = 1) +
  xlab("Coverage") +
  ylab("Missing Fraction") +
  geom_point(x = 0.590, y = 0.670, color = "red") +
  ggpubr::theme_classic2(base_size = 16) +
  theme(legend.position = c(0.2, 0.25), 
        legend.box = "horizontal")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(decay_plot)

6 Model Validation

6.1 Poisson Model Assessment

p_scatter_poisson <- per_sample %>%
  ggplot(aes(x = missing_poisson, y = missing_obs, col = lambda)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = 'red') +
  scale_color_viridis_c(breaks = legend_breaks, labels = legend_labels, 
                        limits = c(0, 1.5)) +
  guides(col = guide_colorbar(title = "Coverage")) +
  geom_point(x = 0.550, y = 0.670, color = "red") +
  scale_x_continuous(limits = xy_limits, breaks = (2:10)/10) + 
  scale_y_continuous(limits = xy_limits, breaks = (2:10)/10) + 
  coord_fixed(ratio = 1) +
  xlab("Expected Missing (Poisson)") +
  ylab("Observed Missing") +
  ggpubr::theme_classic2(base_size = 16) +
  theme(legend.position = "top")

inlay_grob <- cowplot::as_grob(p_hist)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

final_plot_poisson <- cowplot::ggdraw() +
  cowplot::draw_plot(p_scatter_poisson) +
  cowplot::draw_plot(inlay_grob, x = 0.55, y = 0.16, width = 0.2, height = 0.25)

print(final_plot_poisson)

6.2 Exponential-with-Floor Model Assessment

p_scatter_floor <- per_sample %>%
  ggplot(aes(x = missing_exp_floor, y = missing_obs, col = lambda)) +
  geom_point() +
  geom_vline(xintercept = pi, linetype = "dashed") +
  geom_abline(slope = 1, intercept = 0, color = 'red') +
  scale_color_viridis_c(breaks = legend_breaks, labels = legend_labels, 
                        limits = c(0, 1.5)) +
  guides(col = guide_colorbar(title = "Coverage")) +
  geom_point(x = 0.658, y = 0.670, color = "red") +
  scale_x_continuous(limits = xy_limits, breaks = (2:10)/10) + 
  scale_y_continuous(limits = xy_limits, breaks = (2:10)/10) + 
  coord_fixed(ratio = 1) +
  xlab("Expected Missing (Floor Model)") +
  ylab("Observed Missing") +
  ggpubr::theme_classic2(base_size = 16) +
  theme(legend.position = "top")

final_plot_floor <- cowplot::ggdraw() +
  cowplot::draw_plot(p_scatter_floor) +
  cowplot::draw_plot(inlay_grob, x = 0.55, y = 0.16, width = 0.2, height = 0.25)

print(final_plot_floor)

7 Results Summary

# Update overall stats with floor model predictions
overall_stats <- overall_stats %>%
  mutate(
    missing_exp_floor = predict(fit_exp_floor, 
                               newdata = data.frame(lambda = lambda))
  )

# Display key results
cat("Population-level Statistics:\n")

## Population-level Statistics:

cat("Mean coverage (λ):", round(overall_stats$lambda, 3), "\n")

## Mean coverage (λ): 0.59

cat("Observed missingness:", round(overall_stats$missing_obs, 3), "\n")

## Observed missingness: 0.67

cat("Poisson expectation:", round(overall_stats$missing_poisson, 3), "\n")

## Poisson expectation: 0.554

cat("Floor model expectation:", round(overall_stats$missing_exp_floor, 3), "\n\n")

## Floor model expectation: 0.658

cat("Model Parameters:\n")

## Model Parameters:

cat("Floor parameter (π):", round(pi, 3), "\n")

## Floor parameter (π): 0.346

cat("Decay rate (k):", round(k, 3), "\n")

## Decay rate (k): 1.256

The exponential-with-floor model substantially improves fit over the simple Poisson model. The floor parameter π = 0.346 indicates that approximately 34.6% of sites remain uncallable even at high coverage, likely due to repetitive sequences, structural variants, or mapping artifacts. The decay rate k = 1.256 suggests slower-than-Poisson decay in missingness with increasing coverage.

8 References

Li, H., Ruan, J., & Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Bioinformatics, 24(3), 223–230.

Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443–451.