This document contains analyses and figures for our study on the mode of genetic inheritance in Varroa destructor.
The full paper by Eliash et al. 2025, “Varroa mites escape the evolutionary trap of haplodiploidy”, can be found in here.

The document is composed of two main sections:

Supplementary figures
Main figures

Note, that apart from Figure S1 (Karyotyping of varroa males and females), all the rest of the analysis and figures are focused on the Genomic Pedigree study.

Supplementry figures

Figure S1: Karyotyping of varroa males and females

dat <- read_csv("/Users/nuriteliash/Documents/GitHub/varroa-karyotyping/data/karyo_varroa_sasha.csv", col_types = "cccnnn")

dat_means <- dat %>% group_by(ID) %>% 
    reframe(
        chroms = sum(cell_count * chromo_number) / sum(cell_count), 
        count = sum(cell_count), 
        sd = sqrt(sum(cell_count * (chromo_number - chroms)^2) / sum(cell_count)), 
        sem = sd / sqrt(count), 
        ci_lower = chroms - qt(0.975, df = count - 1) * sem, 
        ci_upper = chroms + qt(0.975, df = count - 1) * sem, 
        sex = Sex[1])

dat_means %>%
ggplot(aes(x = sex, y = chroms, color = sex, size = count)) + 
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper), size = 1, width = 0.05, position=position_jitter(width=0.2)) + 
  scale_color_manual(values = c("Male" = "blue", "Female" = "#E31C79")) +  # Custom colors
  scale_size_continuous(range = c(3, 10)) +  # Custom size range
  geom_hline(yintercept = 7, color = "blue") + 
  geom_hline(yintercept = 14, color = "#E31C79") + 
  theme_bw() + 
  ylab("Chromosome count") + 
  geom_signif(comparisons = list(c("Male", "Female")), map_signif_level = TRUE, test = "wilcox.test", color = "black") + 
  scale_y_continuous(breaks = scales::breaks_pretty()) + 
  theme(legend.position = "bottom") + 
  guides(color = "none") +  
  labs(size = "Number of cells examined per individual") + 
  xlab("Sex")

#ggsave("karyptype.pdf", heigh=5, width=5)

cat(“*Figure S1.** Chromosome count analysis in male and female Varroa destructor mites. Points show mean chromosome counts for individual mites, with error bars representing 95% confidence intervals. Point size indicates the number of cells examined per individual. Blue and pink horizontal lines represent the expected haploid (n=7) and diploid (2n=14) chromosome numbers, respectively. Statistical significance between male and female counts is indicated by the horizontal bar.”)

Do males have significantly less chromosomes than the diploid number of 14?

filter(dat_means, sex == "Male") %>% pull(chroms) %>% t.test(., mu = 14)

## 
##  One Sample t-test
## 
## data:  .
## t = -6.5908, df = 7, p-value = 0.0003069
## alternative hypothesis: true mean is not equal to 14
## 95 percent confidence interval:
##   6.926898 10.662108
## sample estimates:
## mean of x 
##  8.794503

Genomic Pedigree study

This analysis is part of a comprehensive study investigating the genetic inheritance patterns in Varroa destructor mites. We analyzed genomic data from 223 mite samples across 30 families, collected from three Apis mellifera colonies. Each family consists of three generations: F0 foundress adult female mite, F1 male and female adult offspring mites, and their F2 offspring, composed of immature mites (nymphs and eggs) and at least one adult male.

The analysis should be read in conjunction with the detailed methodology presented in Site Quality Analysis.

Data Availability

Raw sequencing data: BioProject PRJNA794941
Analysis code: GitHub Repository
Filtered datasets: Available upon request

Data Sources

Raw fastq reads: BioProject PRJNA794941
Code repository: GitHub

Required Files

Before running this analysis, ensure you have:
1. The VCF file processed with the specified parameters
2. The filtered sites CSV file generated from site quality exploration
3. All required R packages installed

Data Processing

VCF File Processing

The initial variant calling file was filtered using VCFtools with stringent quality control parameters to ensure high-confidence variant calls. The filtering criteria were:

vcftools --vcf snponly_freebayes.vcf \
  --chr NW_019211454.1 \
  --chr NW_019211455.1 \
  --chr NW_019211456.1 \
  --chr NW_019211457.1 \
  --chr NW_019211458.1 \
  --chr NW_019211459.1 \
  --chr NW_019211460.1 \
  --max-alleles 2 \
  --minQ 15000 \
  --minDP 16 \
  --maxDP 40 \
  --max-missing 0.5 \
  --maf 0.2 \
  --recode \
  --recode-INFO-all \
  --out Q15000BIALLDP16HDP40mis.5maf.2Chr7

VCF file filtering rationale

Chromosomes: Selected the 7 largest chromosomes (NW_019211454.1-460.1)
Alleles: Limited to biallelic sites to focus on clear binary inheritance patterns
Quality: Minimum Phred score of 15000 ensures high base quality
Depth: Range of 16-40 balances between coverage requirements and potential artifacts
Missing Data: Maximum 50% missing data per site maintains statistical power
Minor Allele Frequency: Minimum 0.2 reduces impact of sequencing errors

Quality Control Metrics

We implemented three key metrics to assess genotype quality:

Variant Quality Distribution (see Figure S6)
Variant Allele Frequency (VAF)
- Calculated as: VAF = AO/DP
- AO: Alternate allele observations
- DP: Total read depth
- Provides measure of allelic balance
Quality Allele Frequency (QAF)
- Calculated as: QAF = QA/(QA+QO)
- QA: Quality scores for alternate allele
- QO: Quality scores for reference allele
- Weights allele calls by base quality

Figure S6: Variant Quality Distribution

## Summary of variant quality scores:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     40.48  33240.80  48198.10  52338.46  69672.30 191031.00

## 
## **Figure S6.** Distribution of variant quality scores across sites that passed initial filtering (depth ≥16, ≤40; maximum 50% missing data). The red dashed line indicates the quality threshold (15000) used for final filtering. %.1f%% of variants have quality scores above this threshold. Higher quality scores indicate greater confidence in variant calls. 96.46279

Genotype Classification Thresholds, VAF and QAF

Based on empirical distribution analysis (see the following figures S7-9), we established the following thresholds:

Genotype	VAF Range	QAF Range	Interpretation
AA (Ref)	< 0.01	< 0.01	Strong reference support
AB (Het)	0.2 - 0.8	0.2 - 0.8	Balanced allele representation
BB (Alt)	> 0.9	> 0.9	Strong alternate support

These thresholds were validated using known inheritance patterns in our pedigree structure.

The site quality filtering criteria were developed based on Variant Allele Frequency (VAF) and Genotype Quality Allele Frequency (QAF) analysis. For detailed methodology and validation of these thresholds, see Site Quality Analysis.

We applied two additional filtering parameters to account for biases in alternate allele representation and to ensure high-confidence genotype calls based on allele frequency and quality.
Variant allele frequency (VAF) was calculated as VAF = AO/DP, where AO (Alternate Observations) represents the number of reads supporting the alternate allele, and DP (Depth) is the total read depth at the site.
Similarly, Quality Allele Frequency (QAF) was computed as QAF = QA/QA+QO, where QA (Quality of Alternate Allele) is the sum of Phred-scaled base quality scores for reads supporting the alternate allele, and QO (Quality of Reference Allele) is the sum of Phred-scaled base quality scores for reads supporting the reference allele.

Figure S7-9 : Picking the thresholds for site quality: Distribution of VAF and QAF in F2 Offspring of homozygous parents

Figure S7, Quality Allele Frequency (QAF) and Variant Allele Frequency (VAF) in F2 females of AA parents.

# Read the f2_homo_parents data (sanity check data) from CSV: F2 females at the filtered positions, on chrom 1, where F1 male and female are homozygous (AA)
# the file was created using the ~/Documents/GitHub/varroa-pedigree-study/R_scripts/After_revision/site_quality_exploration.Rmd
f2_AA_parents <- read.csv("/Users/nuriteliash/Documents/GitHub/varroa-pedigree-study/data/f2_AA_parents.csv")

## 
## **Figure S1.** Distribution of Quality Allele Frequency (QAF) and Variant Allele Frequency (VAF) values in F2 females from crosses with homozygous AA parents. **A-B.** QAF and VAF distributions shown on logarithmic scales. **C-D.** Same distributions shown on linear scales. In all plots, shaded regions indicate the thresholds for genotype classification: AA (yellow, < 0.01). Vertical dashed lines mark the threshold boundaries. The plots show the distribution of genotype calls, with AA genotypes predominantly falling below the 0.01 threshold range.

Figure S8, Quality Allele Frequency (QAF) and Variant Allele Frequency (VAF) in F2 females of BB parents.

# Read the f2_homo_parents data (sanity check data) from CSV: F2 females at the filtered positions, on chrom 1, where F1 male and female are homozygous (AA)
# the file was created using the ~/Documents/GitHub/varroa-pedigree-study/R_scripts/After_revision/site_quality_exploration.Rmd
f2_BB_parents <- read.csv("/Users/nuriteliash/Documents/GitHub/varroa-pedigree-study/data/f2_BB_parents.csv")

## 
## **Figure S2.** Distribution of Quality Allele Frequency (QAF) and Variant Allele Frequency (VAF) values in F2 females from crosses with homozygous BB parents. **A-B.** QAF and VAF distributions shown on logarithmic scales. **C-D.** Same distributions shown on linear scales. In all plots, shaded regions indicate the thresholds for genotype classification: BB (blue, > 0.9). Vertical dashed lines mark the threshold boundaries. The plots show the distribution of genotype calls, with BB genotypes predominantly falling above the 0.9 threshold range.

Figure S9, Quality Allele Frequency (QAF) and Variant Allele Frequency (VAF) in F2 females of AA and AB parents.

# Read the f2_homo_parents data (sanity check data) from CSV: F2 females at the filtered positions, on chrom 1, where F1 male and female are homozygous (AA)
# the file was created using the ~/Documents/GitHub/varroa-pedigree-study/R_scripts/After_revision/site_quality_exploration.Rmd
f2_AB_combined <- read.csv("/Users/nuriteliash/Documents/GitHub/varroa-pedigree-study/data/f2_AB_combined.csv")

# QAF log-scale plot
QAF_AB_gt_log <- ggplot(f2_AB_combined, aes(x = QAF, fill = GT)) +
  # First set up the basic plot structure and coordinates
  scale_x_continuous(breaks = seq(0, 1, by = 0.1), limits = c(-0.05, 1.05)) +
  scale_y_continuous(
    trans = "log10",
    breaks = 10^(0:5),
    labels = scales::trans_format("log10", scales::math_format(10^.x))
  ) +
  coord_cartesian(ylim = c(2, max(table(f2_AB_combined$QAF)) * 1.1)) +
  # Add rectangles with explicit y-limits matching the coordinate system
  annotate("rect", xmin = 0.2, xmax = 0.8, ymin = 1, ymax = max(table(f2_AB_combined$QAF)) * 1.1, alpha = 0.2, fill = "#66b032") +
  # Add the data layers
  geom_histogram(binwidth = 0.02, alpha = 1, position = "identity") +
  geom_vline(xintercept = c(0.2, 0.8), linetype = "dashed", color = "gray", alpha = 1) +
  # Theme and labels
  theme_classic() +
  labs(x = "Quality Allele Frequency (QAF)",
       y = "Count (log scale)",
       title = "A") +
  scale_fill_manual(
    values = c("AA" = "#ffbf00", "AB" = "#66b032", "BB" = "#1982c4"),
    name = "Genotype"
  ) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1, vjust = 1),
    axis.title.x = element_text(margin = margin(t = 15)),
    plot.margin = margin(10, 10, 30, 10),
    plot.title = element_text(face = "bold", hjust = -0.2)
  )

# VAF log-scale plot
VAF_AB_gt_log <- ggplot(f2_AB_combined, aes(x = VAF, fill = GT)) +
  # First set up the basic plot structure and coordinates
  scale_x_continuous(breaks = seq(0, 1, by = 0.1), limits = c(-0.05, 1.05)) +
  scale_y_continuous(
    trans = "log10",
    breaks = 10^(0:5),
    labels = scales::trans_format("log10", scales::math_format(10^.x))
  ) +
  coord_cartesian(ylim = c(2, max(table(f2_AB_combined$VAF)) * 1.1)) +
  # Add rectangles with explicit y-limits matching the coordinate system
  annotate("rect", xmin = 0.2, xmax = 0.8, ymin = 1, ymax = max(table(f2_AB_combined$VAF)) * 1.1, alpha = 0.2, fill = "#66b032") +
  # Add the data layers
  geom_histogram(binwidth = 0.02, alpha = 1, position = "identity") +
  geom_vline(xintercept = c(0.2, 0.8), linetype = "dashed", color = "gray", alpha = 1) +
  # Theme and labels
  theme_classic() +
  labs(x = "Variant Allele Frequency (VAF)",
       y = "Count (log scale)",
       title = "B") +
  scale_fill_manual(
    values = c("AA" = "#ffbf00", "AB" = "#66b032", "BB" = "#1982c4"),
    name = "Genotype"
  ) +
  theme(
    legend.position = "right",
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1, vjust = 1),
    axis.title.x = element_text(margin = margin(t = 15)),
    plot.margin = margin(10, 10, 30, 10),
    plot.title = element_text(face = "bold", hjust = -0.2)
  )

# QAF non-log scale plot
QAF_AB_gt <- ggplot(f2_AB_combined, aes(x = QAF, fill = GT)) +
  # First set up the basic plot structure and coordinates
  scale_x_continuous(breaks = seq(0, 1, by = 0.1), limits = c(-0.05, 1.05)) +
  # Add rectangles with explicit y-limits matching the coordinate system
  annotate("rect", xmin = 0.2, xmax = 0.8, ymin = 0, ymax = max(table(f2_AB_combined$QAF)) * 1.1, alpha = 0.2, fill = "#66b032") +
  # Add the data layers
  geom_histogram(binwidth = 0.02, alpha = 1, position = "identity") +
  geom_vline(xintercept = c(0.2, 0.8), linetype = "dashed", color = "gray", alpha = 1) +
  # Theme and labels
  theme_classic() +
  labs(x = "Quality Allele Frequency (QAF)",
       y = "Count",
       title = "C") +
  scale_fill_manual(
    values = c("AA" = "#ffbf00", "AB" = "#66b032", "BB" = "#1982c4"),
    name = "Genotype"
  ) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1, vjust = 1),
    axis.title.x = element_text(margin = margin(t = 15)),
    plot.margin = margin(10, 10, 30, 10),
    plot.title = element_text(face = "bold", hjust = -0.2)
  )

# VAF non-log scale plot
VAF_AB_gt <- ggplot(f2_AB_combined, aes(x = VAF, fill = GT)) +
  # First set up the basic plot structure and coordinates
  scale_x_continuous(breaks = seq(0, 1, by = 0.1), limits = c(-0.05, 1.05)) +
  # Add rectangles with explicit y-limits matching the coordinate system
  annotate("rect", xmin = 0.2, xmax = 0.8, ymin = 0, ymax = max(table(f2_AB_combined$VAF)) * 1.1, alpha = 0.2, fill = "#66b032") +
  # Add the data layers
  geom_histogram(binwidth = 0.02, alpha = 1, position = "identity") +
  geom_vline(xintercept = c(0.2, 0.8), linetype = "dashed", color = "gray", alpha = 1) +
  # Theme and labels
  theme_classic() +
  labs(x = "Variant Allele Frequency (VAF)",
       y = "Count",
       title = "D") +
  scale_fill_manual(
    values = c("AA" = "#ffbf00", "AB" = "#66b032", "BB" = "#1982c4"),
    name = "Genotype"
  ) +
  theme(
    legend.position = "right",
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1, vjust = 1),
    axis.title.x = element_text(margin = margin(t = 15)),
    plot.margin = margin(10, 10, 30, 10),
    plot.title = element_text(face = "bold", hjust = -0.2)
  )

# Combine plots with shared legend
combined_plot <- (QAF_AB_gt_log + VAF_AB_gt_log) / (QAF_AB_gt + VAF_AB_gt) + plot_layout(ncol = 2, guides = "collect")

# Display the combined plot
combined_plot

# Add unified caption below the plot
cat("\n**Figure S3.** Distribution of Quality Allele Frequency (QAF) and Variant Allele Frequency (VAF) values in F2 females from crosses with heterozygous AB parents. **A-B.** QAF and VAF distributions shown on logarithmic scales. **C-D.** Same distributions shown on linear scales. In all plots, shaded regions indicate the thresholds for genotype classification: AB (green, 0.2-0.8). Vertical dashed lines mark the threshold boundaries. The plots show the distribution of genotype calls, with AB genotypes predominantly falling within the 0.2-0.8 threshold range.")

## 
## **Figure S3.** Distribution of Quality Allele Frequency (QAF) and Variant Allele Frequency (VAF) values in F2 females from crosses with heterozygous AB parents. **A-B.** QAF and VAF distributions shown on logarithmic scales. **C-D.** Same distributions shown on linear scales. In all plots, shaded regions indicate the thresholds for genotype classification: AB (green, 0.2-0.8). Vertical dashed lines mark the threshold boundaries. The plots show the distribution of genotype calls, with AB genotypes predominantly falling within the 0.2-0.8 threshold range.

Based on the distribution of sites’ VAF and QAF values, the following filtering criteria were applied:
- Homozygous Reference (AA): VAF < 0.01 and QAF < 0.01, sites exhibit low alternate allele support
- Heterozygous (AB): 0.2 < VAF < 0.8 and 0.2 < QAF < 0.8, sites maintain a balanced allele distribution
- Homozygous Alternate (BB): VAF > 0.9 and QAF > 0.9, sites demonstrate high alternate allele support

This filtering approach enhances the accuracy of genotype classification while reducing the inclusion of low-confidence variant calls, by minimizing genotyping errors due to sequencing artifacts or depth-related biases.

set the thresholds for QAF and VAF

all_param <- readRDS("/Users/nuriteliash/Documents/GitHub/varroa-pedigree-study/data/all_param.rds")

# Create the filtered sites dataset that will be used in further analsyis
filtered_sites <- all_param %>%
  filter(
    (GT == "AA" & QAF < 0.01 & VAF < 0.01) |   # Homozygous Reference (AA)
    (GT == "AB" & QAF > 0.2 & QAF < 0.8 & VAF > 0.2 & VAF < 0.8) |  # Heterozygous (AB)
    (GT == "BB" & QAF > 0.9 & VAF > 0.9)    # Homozygous Alternate (BB)
  ) %>%
  select(family, Sample, site, QAF, VAF, GT)   %>%
   mutate(family = as.character(family),
Sample = as.character(Sample), # Extract Chromosome from the site column
chrom = str_split(site, pattern = "_", simplify = TRUE)[, 2]) %>% 
rename(gt = GT) # Keep only necessary columns

Main figures

Figure 1A: Crossing F1 mother AB with F1 father AA, one family

~/Documents/GitHub/varroa-pedigree-study/R_scripts/After_revision/site_quality_exploration.Rmd

# Load the fam2_selected data, for one family (478) on the largest chromosome, NW_019211454
fam_478_F0AA_F1AB <- read.csv("/Users/nuriteliash/Documents/GitHub/varroa-pedigree-study/data/fam_478_F0AA_F1AB.csv")

# Create a numeric x-axis for spacing (convert 'member' to factor and assign spacing)
fam_id <- 478

fam_478_F0AA_F1AB %>%
  ggplot(aes(x = member_num, y = pos / 1e6, color = GT)) +  # Convert pos to Mb
  geom_segment(aes(x = member_num - 0.3, xend = member_num + 0.3, 
                   y = pos / 1e6, yend = pos / 1e6), size = 1.5) +  # Apply same conversion
  scale_x_continuous(breaks = unique(fam_478_F0AA_F1AB$member_num), labels = unique(fam_478_F0AA_F1AB$member)) +  # Restore original labels
  scale_color_manual(
    values = c("AA" = "#ffbf00", "AB" = "#66b032", "BB" = "#1982c4"),  # Updated colors for renamed genotypes
    name = "Genotype",
    breaks = c("AA", "AB", "BB")  # Ensures BB appears in the legend even if absent
  ) +
  theme_classic() +
  labs(
    title = paste("Genotype Flow Across Family Members\nFamily ID:", fam_id),
    x = "Family Member",
    y = "Genomic Position (Mb)"
  ) +
  theme(
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1, vjust = 1),  # Adjusted text size and angle
    axis.title.x = element_text(margin = margin(t = 15)),  # Add margin above x-axis title
    panel.grid.major = element_blank(), 
    legend.position = "right",  # Move legend to the right
    legend.box.margin = margin(0, 0, 0, 10),  # Add margin to separate legend from plot
    plot.margin = margin(10, 20, 30, 10)  # Increased bottom margin to accommodate rotated labels
  )

# Add caption below the plot
cat("\n**Figure 1A.** Genotype flow visualization across a single family (Family ID: 478). Each horizontal line represents a genomic position on chromosome NW_019211454.1, with colors indicating genotypes (yellow = AA, green = AB, blue = BB). Family members are arranged in generational order, showing the inheritance pattern from parents to offspring. The y-axis shows the genomic position in megabases (Mb).")

## 
## **Figure 1A.** Genotype flow visualization across a single family (Family ID: 478). Each horizontal line represents a genomic position on chromosome NW_019211454.1, with colors indicating genotypes (yellow = AA, green = AB, blue = BB). Family members are arranged in generational order, showing the inheritance pattern from parents to offspring. The y-axis shows the genomic position in megabases (Mb).

Figure 1C: Crossing F1 mother AB with F1 father AA, all families

# Set the families (include only families with at least one adult F2)
family <- grep("grndat|grnson", gt$ID, value=TRUE) %>%
  str_extract("[^_]+") %>%
  unique()

# Define a list to store all data frames per family
obs <- list()

for (fam in family) {
    obs[[fam]] <- table %>%
      dplyr::select(starts_with(fam)) %>%
    # Add site and chromosome information
      rownames_to_column("site") %>%
    mutate(chrom = str_split(site, pattern = "_", simplify = TRUE)[, 2]) %>%
    # Then filter for the genotype patterns we want
      dplyr::filter_at(vars(matches("_fnd")), all_vars(. == "AA")) %>%  # F0 Female must be AA
      dplyr::filter_at(vars(matches("_son")), all_vars(. == "AA")) %>%  # Sons must be AA
      dplyr::filter_at(vars(matches("_dat")), all_vars(. == "AB")) %>%  # Daughters must be AB
    # Select only the grandchildren columns and keep site/chrom info
    dplyr::select(site, chrom, contains("grn")) %>%
    # Reshape the data
      tidyr::pivot_longer(cols = -c(site, chrom)) %>%
    dplyr::rename(Sample = name, gt = value) %>%
    # filter for sites 
    left_join(filtered_sites, by = c("site", "chrom", "Sample", "gt")) %>%
    na.omit() %>% # Count genotypes
    dplyr::count(Sample, gt, chrom, .drop = FALSE) %>%
      dplyr::filter(gt %in% c("AA", "BB", "AB")) %>%
      mutate(n = as.numeric(n)) %>%
    group_by(Sample) %>%
      mutate(total = as.numeric(sum(n))) %>%
      mutate(sex = case_when(
      grepl("son", Sample) ~ "Male",
      grepl("dat", Sample) ~ "Female"
      )) %>%
    # Add family column
      mutate(family = fam)
}

# Bind all families together into a final data frame containing all observed counts
observed <- do.call("rbind", obs) %>% 
  mutate(Sample = as.character(Sample))

# Create samples dataframe
samples <- data.frame(
  Sample = rep(unique(observed$Sample), each = 3), 
  gt = rep(c("AA", "AB", "BB"))
) %>%   
  mutate(family = str_extract(Sample, "^[0-9]+"))

# Join and create final dataset
samples_obs_filtered_ABxAA <- left_join(samples, observed, by = c("family", "Sample", "gt")) %>% 
  mutate(sex = case_when(
    grepl("son", Sample) ~ "Male",
    grepl("dat", Sample) ~ "Female"
  )) %>%
  group_by(Sample, chrom) %>%
    mutate(total = as.numeric(sum(n))) %>%
    dplyr::mutate(prop = n/total) %>%
    na.omit()

### Summary of sites filtered in each sample
# Create a table showing sites filtered per sample
sites_per_sample <- table %>%
  rownames_to_column("site") %>%
  tidyr::pivot_longer(cols = -site, names_to = "Sample", values_to = "gt") %>%
  # Filter out NA values before counting
  filter(!is.na(gt)) %>%
  group_by(Sample) %>%
  summarise(
    total_sites = n(),
    .groups = 'drop'
  )

filtered_sites_per_sample <- filtered_sites %>%
  group_by(Sample) %>%
  summarise(
    filtered_sites = n(),
    .groups = 'drop'
  )

# Join and calculate differences
sites_comparison <- sites_per_sample %>%
  left_join(filtered_sites_per_sample, by = "Sample") %>%
  mutate(
    filtered_sites = replace_na(filtered_sites, 0),
    sites_removed = total_sites - filtered_sites,
    percent_removed = round((sites_removed / total_sites) * 100, 2)
  ) %>%
  arrange(desc(sites_removed))

# Display the table
sites_comparison %>%
  select(Sample, total_sites, filtered_sites, sites_removed, percent_removed) %>%
  rename(
    "Sample" = Sample,
    "Total Sites" = total_sites,
    "Sites Kept" = filtered_sites,
    "Sites Removed" = sites_removed,
    "% Removed" = percent_removed
  ) %>%
  knitr::kable(
    digits = 2,
    caption = "Number of sites filtered per sample",
    format = "html"
  ) %>%
  kableExtra::kable_styling(
    bootstrap_options = "striped",
    full_width = TRUE,
    position = "left",
    fixed_thead = TRUE
  ) %>%
  kableExtra::scroll_box(
    width = "100%", 
    height = "300px"
  )

Number of sites filtered per sample
Sample	Total Sites	Sites Kept	Sites Removed	% Removed
240_241a_grndat	27635	25552	2083	7.54
302_302a_son	23714	21809	1905	8.03
400_400a_son	29567	27804	1763	5.96
338_338_fnd	30664	29027	1637	5.34
426_427a_grndat	30732	29197	1535	4.99
412_412a_son	29902	28446	1456	4.87
548_549_dat	18715	17279	1436	7.67
458_458_fnd	29867	28468	1399	4.68
412_412_fnd	27910	26559	1351	4.84
400_401a_grnson	29747	28407	1340	4.50
302_303b_grnson	28367	27044	1323	4.66
412_413_dat	24250	22953	1297	5.35
400_401_dat	29889	28595	1294	4.33
458_459_dat	30157	28864	1293	4.29
476_477b_grnson	28391	27102	1289	4.54
534_535_2a_grndat	20940	19683	1257	6.00
478_479-1a_grnson	28898	27738	1160	4.01
300_300_fnd	22043	20903	1140	5.17
596_596a_son	28716	27597	1119	3.90
548_548_fnd	21544	20461	1083	5.03
458_459a_grnson	28336	27310	1026	3.62
266_267a_grndat	16208	15192	1016	6.27
302_302_fnd	18528	17523	1005	5.42
300_300a_son	27961	26967	994	3.55
534_535_2c_grnson	17322	16338	984	5.68
110_111c_grnson	30064	29096	968	3.22
426_427_dat	29893	28930	963	3.22
284_285b_grndat	30316	29360	956	3.15
478_478a_son	28243	27290	953	3.37
476_477d_grndat	28763	27820	943	3.28
110_111_dat	20695	19766	929	4.49
266_267_dat	26325	25404	921	3.50
534_534a_son	18886	17968	918	4.86
498_499c_grndat	27543	26628	915	3.32
57_58_dat	30612	29703	909	2.97
564_565-1_dat	12733	11826	907	7.12
534_535_2_dat	13200	12303	897	6.80
57_58a_grndat	28188	27303	885	3.14
400_400_fnd	27343	26461	882	3.23
338_339a_grndat	30321	29459	862	2.84
596_597_dat	26693	25846	847	3.17
63_63a_son	29482	28639	843	2.86
476_476a_son	24736	23894	842	3.40
476_477a_grndat	25948	25115	833	3.21
426_427b_grnson	16382	15572	810	4.94
520_520a_son	14326	13520	806	5.63
548_548a_son	11629	10825	804	6.91
63_64_dat	30582	29786	796	2.60
240_240_fnd	30066	29276	790	2.63
498_499a_grnson	27913	27126	787	2.82
43_44_dat	28703	27928	775	2.70
110_111d_grndat	26068	25302	766	2.94
133_134_dat	24195	23445	750	3.10
478_478_fnd	25274	24527	747	2.96
284_284a_son	28711	27966	745	2.59
46_46a_son	22390	21653	737	3.29
338_339_dat	30047	29311	736	2.45
338_339b_grnson	10129	9397	732	7.23
520_520_fnd	29042	28319	723	2.49
177_178_dat	29476	28755	721	2.45
600_600a_son	14625	13907	718	4.91
478_479-1b_grndat	14808	14095	713	4.81
564_565-1b_grnson	25552	24841	711	2.78
240_240a_son	23131	22422	709	3.07
110_110a_son	9180	8472	708	7.71
534_534_fnd	19995	19290	705	3.53
187_187_fnd	20910	20206	704	3.37
478_479-1_dat	27355	26658	697	2.55
600_601_dat	21128	20443	685	3.24
43_44a_grndat	28741	28059	682	2.37
476_477c_grndat	27817	27140	677	2.43
133_133_fnd	25282	24609	673	2.66
502_503f_grndat	24641	23968	673	2.73
564_564a_son	11455	10782	673	5.88
284_285_dat	29352	28681	671	2.29
498_498a_son	6923	6254	669	9.66
266_267b_grndat	29943	29277	666	2.22
600_600_fnd	13659	12997	662	4.85
338_338a_son	15304	14649	655	4.28
46_47_dat	29051	28412	639	2.20
476_476_fnd	27912	27275	637	2.28
46_47d_grnson	22004	21372	632	2.87
133_133a_son	28760	28139	621	2.16
498_499b_grndat	10473	9853	620	5.92
476_477_dat	15022	14411	611	4.07
284_285a_grnson	18988	18381	607	3.20
322_323a_grndat	15980	15374	606	3.79
564_564_fnd	10253	9653	600	5.85
48_48a_son	17227	16636	591	3.43
187_188_dat	30316	29733	583	1.92
458_458a_son	7966	7387	579	7.27
520_521_dat	15022	14448	574	3.82
177_177a_son	30198	29626	572	1.89
48_49_dat	28312	27742	570	2.01
302_303_dat	25317	24761	556	2.20
426_426_fnd	12548	12000	548	4.37
240_241_dat	26002	25459	543	2.09
133_134a_grnson	8917	8380	537	6.02
322_322a_son	17800	17268	532	2.99
63_63_fnd	20105	19586	519	2.58
600_601a_grnson	12873	12395	478	3.71
43_44b_grnson	11657	11180	477	4.09
502_503_dat	11568	11100	468	4.05
57_58b_grnson	9562	9096	466	4.87
177_177_fnd	29519	29056	463	1.57
46_47c_grndat	27492	27034	458	1.67
498_499_dat	20238	19780	458	2.26
502_503b_grnson	8277	7830	447	5.40
43_43a_son	9442	8998	444	4.70
412_413a_grnson	8076	7649	427	5.29
57_57_fnd	8619	8235	384	4.46
240_241b_grndat	5219	4852	367	7.03
502_503a_grndat	6622	6266	356	5.38
498_498_fnd	3780	3426	354	9.37
57_57a_son	5418	5071	347	6.40
596_596_fnd	5432	5092	340	6.26
48_48_fnd	6036	5717	319	5.28
300_301a_grnson	3879	3565	314	8.09
46_46_fnd	9008	8709	299	3.32
426_426a_son	5075	4787	288	5.67
284_284_fnd	21473	21195	278	1.29
502_502a_son	6414	6144	270	4.21
110_110_fnd	2163	1904	259	11.97
187_187a_son	5883	5634	249	4.23
502_502_fnd	4941	4698	243	4.92
322_323_dat	5555	5381	174	3.13
187_188a_grnson	1898	1726	172	9.06
596_597a_grnson	3572	3422	150	4.20
43_43_fnd	2310	2161	149	6.45
240_241c_grnson	3751	3605	146	3.89
300_301_dat	2112	1989	123	5.82
322_323b_grnson	6074	5953	121	1.99
266_266_fnd	1816	1731	85	4.68
177_178b_grnson	1422	1349	73	5.13
322_322_fnd	2236	2164	72	3.22
266_266a_son	1440	1372	68	4.72
177_178a_grndat	1364	1302	62	4.55

# Set minimum sites threshold
site_threshold = as.numeric(10)

# Create summary statistics for filtered data
summary_df_filtered <- samples_obs_filtered_ABxAA %>%
  filter(total >= site_threshold) %>%
  group_by(sex) %>%
  summarise(
    num_samples = n_distinct(Sample),  # Changed to Sample with capital S
    total_sites = sum(total),
    .groups = "drop"
  )

# Create summary statistics for filtered data by chromosome
summary_df_filtered_by_chrom <- samples_obs_filtered_ABxAA %>%
  filter(total >= site_threshold) %>%
  group_by(sex, chrom) %>%
  summarise(
    num_samples = n_distinct(Sample),  # Changed to Sample with capital S
    total_sites = sum(total),
    .groups = "drop"
  )

# Pulled data with totals across all chromosomes
ggplot(samples_obs_filtered_ABxAA %>% filter(total >= site_threshold), 
        aes(fill = gt, y = prop, x = factor(sex, levels = c("Female", "Male")))) +
  geom_bar(position = "fill", stat = "identity", width = 0.8) +
  ylab("Genotype Proportion") +
  labs(title = "Offspring Genotype, Crossing (AB) Male x (AA) Female
  Filtered sites", 
        fill = "Genotype") +
  theme_classic() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(size = 14),
    strip.text.x = element_text(size = 12),
    panel.spacing = unit(1, "lines"),
    plot.margin = margin(t = 10, r = 10, b = 0, l = 10)
  ) +
  scale_fill_manual(values = c("#ffbf00", "#66b032", "#1982c4")) +
  # Add sample count labels
  geom_text(data = summary_df_filtered,
            aes(x = sex, 
                y = -0.03,
                label = paste0(num_samples, " ", sex, "s")),
            inherit.aes = FALSE,
            size = 2.8,
            vjust = 1) +
  # Add sites count labels
  geom_text(data = summary_df_filtered,
            aes(x = sex, 
                y = -0.18,
                label = paste0(format(total_sites, big.mark = ","), " sites")),
            inherit.aes = FALSE,
            size = 2.8,
            vjust = 0.5)

# clean plot without labels , for the MS
ggplot(samples_obs_filtered_ABxAA %>% filter(total >= 10), 
       aes(fill = gt, y = prop, x = factor(sex, levels = c("Female", "Male")))) +
  geom_bar(position = "fill", stat = "identity", width = 0.7) +
  ylab("Genotype
Proportion") +
  labs(title = "Offspring Genotype, Crossing (AA) Male x (AB) Female
 Filtered sites", 
       fill = "Genotype") +
  theme_classic() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(size = 14),
    axis.title.y = element_text(size = 16),
    strip.text.x = element_text(size = 12),
    panel.spacing = unit(1, "lines"),
    plot.margin = margin(t = 10, r = 10, b = 0, l = 10)
  ) +
  scale_fill_manual(values = c("#ffbf00", "#66b032", "#1982c4"))# +

  #facet_wrap(~family)

Figure 1B: Crossing F1 mother AA with F1 father AB, one family

# Load the fam2_selected data, for one family (534) on the largest chromosome, NW_019211454
fam_534_F0AA_F1AA <- read.csv("/Users/nuriteliash/Documents/GitHub/varroa-pedigree-study/data/fam_534_F0AA_F1AA.csv")

# Create a numeric x-axis for spacing (convert 'member' to factor and assign spacing)
fam_id <- 534

fam_534_F0AA_F1AA %>%
  ggplot(aes(x = member_num, y = pos / 1e6, color = GT)) +  # Convert pos to Mb
  geom_segment(aes(x = member_num - 0.3, xend = member_num + 0.3, 
                   y = pos / 1e6, yend = pos / 1e6), size = 1.5) +  # Apply same conversion
  scale_x_continuous(breaks = unique(fam_534_F0AA_F1AA$member_num), labels = unique(fam_534_F0AA_F1AA$member)) +  # Restore original labels
  scale_color_manual(
    values = c("AA" = "#ffbf00", "AB" = "#66b032", "BB" = "#1982c4"),  # Updated colors for renamed genotypes
    name = "Genotype",
    breaks = c("AA", "AB", "BB")  # Ensures BB appears in the legend even if absent
  ) +
  theme_classic() +
  labs(
    title = paste("Genotype Flow Across Family Members\nFamily ID:", fam_id),
    x = "Family Member",
    y = "Genomic Position (Mb)"
  ) +
  theme(
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1, vjust = 1),  # Adjusted text size and angle
    axis.title.x = element_text(margin = margin(t = 15)),  # Add margin above x-axis title
    panel.grid.major = element_blank(), 
    legend.position = "right",  # Move legend to the right
    legend.box.margin = margin(0, 0, 0, 10),  # Add margin to separate legend from plot
    plot.margin = margin(10, 20, 30, 10)  # Increased bottom margin to accommodate rotated labels
  )

# Add caption below the plot
cat("\n**Figure 1A.** Genotype flow visualization across a single family (Family ID: 534). Each horizontal line represents a genomic position on chromosome NW_019211454.1, with colors indicating genotypes (yellow = AA, green = AB, blue = BB). Family members are arranged in generational order, showing the inheritance pattern from parents to offspring. The y-axis shows the genomic position in megabases (Mb).")

## 
## **Figure 1A.** Genotype flow visualization across a single family (Family ID: 534). Each horizontal line represents a genomic position on chromosome NW_019211454.1, with colors indicating genotypes (yellow = AA, green = AB, blue = BB). Family members are arranged in generational order, showing the inheritance pattern from parents to offspring. The y-axis shows the genomic position in megabases (Mb).

Figure 1D: Crossing F1 mother AA with F1 father AB, all families

# Set the families (include only families with at least one adult F2)
family <- grep("grndat|grnson", gt$ID, value=TRUE) %>%
  str_extract("[^_]+") %>%
  unique()

# Define a list to store all data frames per family
obs <- list()

for (fam in family) {
obs[[fam]] <- table %>%
  dplyr::select(starts_with(fam)) %>%
    # Add site and chromosome information
    rownames_to_column("site") %>%
    mutate(chrom = str_split(site, pattern = "_", simplify = TRUE)[, 2]) %>%
    # Then filter for the genotype patterns we want
    dplyr::filter_at(vars(matches("_fnd")), all_vars(. == "AB")) %>%  # F0 Female must be AB
    dplyr::filter_at(vars(matches("_son")), all_vars(. == "AB")) %>%  # Sons must be AB
    dplyr::filter_at(vars(matches("_dat")), all_vars(. == "AA")) %>%  # Daughters must be AA
    # Select only the grandchildren columns and keep site/chrom info
    dplyr::select(site, chrom, contains("grn")) %>%
    # Reshape the data
    tidyr::pivot_longer(cols = -c(site, chrom)) %>%
    dplyr::rename(Sample = name, gt = value) %>%
    # filter for sites 
    left_join(filtered_sites, by = c("site", "chrom", "Sample", "gt")) %>%
    na.omit() %>% # Count genotypes
    dplyr::count(Sample, gt, chrom, .drop = FALSE) %>%
  dplyr::filter(gt %in% c("AA", "BB", "AB")) %>%
  mutate(n = as.numeric(n)) %>%
    group_by(Sample) %>%
  mutate(total = as.numeric(sum(n))) %>%
  mutate(sex = case_when(
      grepl("son", Sample) ~ "Male",
      grepl("dat", Sample) ~ "Female"
    )) %>%
    # Add family column
    mutate(family = fam)
}

# Bind all families together into a final data frame containing all observed counts
observed <- do.call("rbind", obs) %>% 
  mutate(Sample = as.character(Sample))

# Create samples dataframe
samples <- data.frame(
  Sample = rep(unique(observed$Sample), each = 3), 
  gt = rep(c("AA", "AB", "BB"))
) %>%   
  mutate(family = str_extract(Sample, "^[0-9]+"))

# Join and create final dataset
samples_obs_filtered_AAxAB <- left_join(samples, observed, by = c("family", "Sample", "gt")) %>% 
  mutate(sex = case_when(
    grepl("son", Sample) ~ "Male",
    grepl("dat", Sample) ~ "Female"
  )) %>%
  group_by(Sample, chrom) %>%
    mutate(total = as.numeric(sum(n))) %>%
    dplyr::mutate(prop = n/total) %>%
  na.omit()

### Summary of sites filtered in each sample
# Create a table showing sites filtered per sample
sites_per_sample <- table %>%
  rownames_to_column("site") %>%
  tidyr::pivot_longer(cols = -site, names_to = "Sample", values_to = "gt") %>%
  # Filter out NA values before counting
  filter(!is.na(gt)) %>%
  group_by(Sample) %>%
  summarise(
    total_sites = n(),
    .groups = 'drop'
  )

filtered_sites_per_sample <- filtered_sites %>%
  group_by(Sample) %>%
  summarise(
    filtered_sites = n(),
    .groups = 'drop'
  )

# Join and calculate differences
sites_comparison <- sites_per_sample %>%
  left_join(filtered_sites_per_sample, by = "Sample") %>%
  mutate(
    filtered_sites = replace_na(filtered_sites, 0),
    sites_removed = total_sites - filtered_sites,
    percent_removed = round((sites_removed / total_sites) * 100, 2)
  ) %>%
  arrange(desc(sites_removed))

# Display the table
sites_comparison %>%
  select(Sample, total_sites, filtered_sites, sites_removed, percent_removed) %>%
  rename(
    "Sample" = Sample,
    "Total Sites" = total_sites,
    "Sites Kept" = filtered_sites,
    "Sites Removed" = sites_removed,
    "% Removed" = percent_removed
  ) %>%
  knitr::kable(
    digits = 2,
    caption = "Number of sites filtered per sample",
    format = "html"
  ) %>%
  kableExtra::kable_styling(
    bootstrap_options = "striped",
    full_width = TRUE,
    position = "left",
    fixed_thead = TRUE
  ) %>%
  kableExtra::scroll_box(
    width = "100%", 
    height = "300px"
  )

Number of sites filtered per sample
Sample	Total Sites	Sites Kept	Sites Removed	% Removed
240_241a_grndat	27635	25552	2083	7.54
302_302a_son	23714	21809	1905	8.03
400_400a_son	29567	27804	1763	5.96
338_338_fnd	30664	29027	1637	5.34
426_427a_grndat	30732	29197	1535	4.99
412_412a_son	29902	28446	1456	4.87
548_549_dat	18715	17279	1436	7.67
458_458_fnd	29867	28468	1399	4.68
412_412_fnd	27910	26559	1351	4.84
400_401a_grnson	29747	28407	1340	4.50
302_303b_grnson	28367	27044	1323	4.66
412_413_dat	24250	22953	1297	5.35
400_401_dat	29889	28595	1294	4.33
458_459_dat	30157	28864	1293	4.29
476_477b_grnson	28391	27102	1289	4.54
534_535_2a_grndat	20940	19683	1257	6.00
478_479-1a_grnson	28898	27738	1160	4.01
300_300_fnd	22043	20903	1140	5.17
596_596a_son	28716	27597	1119	3.90
548_548_fnd	21544	20461	1083	5.03
458_459a_grnson	28336	27310	1026	3.62
266_267a_grndat	16208	15192	1016	6.27
302_302_fnd	18528	17523	1005	5.42
300_300a_son	27961	26967	994	3.55
534_535_2c_grnson	17322	16338	984	5.68
110_111c_grnson	30064	29096	968	3.22
426_427_dat	29893	28930	963	3.22
284_285b_grndat	30316	29360	956	3.15
478_478a_son	28243	27290	953	3.37
476_477d_grndat	28763	27820	943	3.28
110_111_dat	20695	19766	929	4.49
266_267_dat	26325	25404	921	3.50
534_534a_son	18886	17968	918	4.86
498_499c_grndat	27543	26628	915	3.32
57_58_dat	30612	29703	909	2.97
564_565-1_dat	12733	11826	907	7.12
534_535_2_dat	13200	12303	897	6.80
57_58a_grndat	28188	27303	885	3.14
400_400_fnd	27343	26461	882	3.23
338_339a_grndat	30321	29459	862	2.84
596_597_dat	26693	25846	847	3.17
63_63a_son	29482	28639	843	2.86
476_476a_son	24736	23894	842	3.40
476_477a_grndat	25948	25115	833	3.21
426_427b_grnson	16382	15572	810	4.94
520_520a_son	14326	13520	806	5.63
548_548a_son	11629	10825	804	6.91
63_64_dat	30582	29786	796	2.60
240_240_fnd	30066	29276	790	2.63
498_499a_grnson	27913	27126	787	2.82
43_44_dat	28703	27928	775	2.70
110_111d_grndat	26068	25302	766	2.94
133_134_dat	24195	23445	750	3.10
478_478_fnd	25274	24527	747	2.96
284_284a_son	28711	27966	745	2.59
46_46a_son	22390	21653	737	3.29
338_339_dat	30047	29311	736	2.45
338_339b_grnson	10129	9397	732	7.23
520_520_fnd	29042	28319	723	2.49
177_178_dat	29476	28755	721	2.45
600_600a_son	14625	13907	718	4.91
478_479-1b_grndat	14808	14095	713	4.81
564_565-1b_grnson	25552	24841	711	2.78
240_240a_son	23131	22422	709	3.07
110_110a_son	9180	8472	708	7.71
534_534_fnd	19995	19290	705	3.53
187_187_fnd	20910	20206	704	3.37
478_479-1_dat	27355	26658	697	2.55
600_601_dat	21128	20443	685	3.24
43_44a_grndat	28741	28059	682	2.37
476_477c_grndat	27817	27140	677	2.43
133_133_fnd	25282	24609	673	2.66
502_503f_grndat	24641	23968	673	2.73
564_564a_son	11455	10782	673	5.88
284_285_dat	29352	28681	671	2.29
498_498a_son	6923	6254	669	9.66
266_267b_grndat	29943	29277	666	2.22
600_600_fnd	13659	12997	662	4.85
338_338a_son	15304	14649	655	4.28
46_47_dat	29051	28412	639	2.20
476_476_fnd	27912	27275	637	2.28
46_47d_grnson	22004	21372	632	2.87
133_133a_son	28760	28139	621	2.16
498_499b_grndat	10473	9853	620	5.92
476_477_dat	15022	14411	611	4.07
284_285a_grnson	18988	18381	607	3.20
322_323a_grndat	15980	15374	606	3.79
564_564_fnd	10253	9653	600	5.85
48_48a_son	17227	16636	591	3.43
187_188_dat	30316	29733	583	1.92
458_458a_son	7966	7387	579	7.27
520_521_dat	15022	14448	574	3.82
177_177a_son	30198	29626	572	1.89
48_49_dat	28312	27742	570	2.01
302_303_dat	25317	24761	556	2.20
426_426_fnd	12548	12000	548	4.37
240_241_dat	26002	25459	543	2.09
133_134a_grnson	8917	8380	537	6.02
322_322a_son	17800	17268	532	2.99
63_63_fnd	20105	19586	519	2.58
600_601a_grnson	12873	12395	478	3.71
43_44b_grnson	11657	11180	477	4.09
502_503_dat	11568	11100	468	4.05
57_58b_grnson	9562	9096	466	4.87
177_177_fnd	29519	29056	463	1.57
46_47c_grndat	27492	27034	458	1.67
498_499_dat	20238	19780	458	2.26
502_503b_grnson	8277	7830	447	5.40
43_43a_son	9442	8998	444	4.70
412_413a_grnson	8076	7649	427	5.29
57_57_fnd	8619	8235	384	4.46
240_241b_grndat	5219	4852	367	7.03
502_503a_grndat	6622	6266	356	5.38
498_498_fnd	3780	3426	354	9.37
57_57a_son	5418	5071	347	6.40
596_596_fnd	5432	5092	340	6.26
48_48_fnd	6036	5717	319	5.28
300_301a_grnson	3879	3565	314	8.09
46_46_fnd	9008	8709	299	3.32
426_426a_son	5075	4787	288	5.67
284_284_fnd	21473	21195	278	1.29
502_502a_son	6414	6144	270	4.21
110_110_fnd	2163	1904	259	11.97
187_187a_son	5883	5634	249	4.23
502_502_fnd	4941	4698	243	4.92
322_323_dat	5555	5381	174	3.13
187_188a_grnson	1898	1726	172	9.06
596_597a_grnson	3572	3422	150	4.20
43_43_fnd	2310	2161	149	6.45
240_241c_grnson	3751	3605	146	3.89
300_301_dat	2112	1989	123	5.82
322_323b_grnson	6074	5953	121	1.99
266_266_fnd	1816	1731	85	4.68
177_178b_grnson	1422	1349	73	5.13
322_322_fnd	2236	2164	72	3.22
266_266a_son	1440	1372	68	4.72
177_178a_grndat	1364	1302	62	4.55

# Set minimum sites threshold
site_threshold = as.numeric(10)

# Create summary dataframe for labels, summing across all chromosomes (for pulled data only)
summary_df_filtered <- samples_obs_filtered_AAxAB %>%
   filter(total >= site_threshold) %>%
  group_by(sex) %>%  # Remove chromosome grouping
  summarise(
    num_samples = n_distinct(Sample),  # Count unique samples
    total_sites = sum(total, na.rm = TRUE),  # Sum all sites
    .groups = 'drop'
  )

# Create per-chromosome summary for faceted plot
summary_df_filtered_by_chrom <- samples_obs_filtered_AAxAB %>%
  group_by(chrom, sex) %>%
  summarise(
    num_samples = n_distinct(Sample),
    total_sites = sum(total, na.rm = TRUE),
    .groups = 'drop'
  )

# Plot 1: Pulled data with totals across all chromosomes
ggplot(samples_obs_filtered_AAxAB %>% filter(total >= site_threshold), 
       aes(fill = gt, y = prop, x = factor(sex, levels = c("Female", "Male")))) +
  geom_bar(position = "fill", stat = "identity", width = 0.5) +
  ylab("Genotype Proportion") +
  labs(title = "Offspring Genotype, Crossing (AB) Male x (AA) Female
 Filtered sites", 
       fill = "Genotype") +
  theme_classic() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(size = 12),
    strip.text.x = element_text(size = 12),
    panel.spacing = unit(1, "lines"),
    plot.margin = margin(t = 10, r = 10, b = 60, l = 10)
  ) +
  scale_fill_manual(values = c("#ffbf00", "#66b032", "#1982c4")) +
  # Add sample count labels (only for pulled data)
  geom_text(data = summary_df_filtered,
            aes(x = sex, 
                y = -0.03,
                label = paste0(num_samples, " ", sex, "s")),
            inherit.aes = FALSE,
            size = 2.8,
            vjust = 1) +
  # Add total sites label with comma formatting (only for pulled data)
  geom_text(data = summary_df_filtered,
            aes(x = sex, 
                y = -0.18,
                label = paste0(format(total_sites, big.mark = ","), " sites")),
            inherit.aes = FALSE,
            size = 2.8,
            vjust = 0.5)

# clean plot without labels , for the MS
ggplot(samples_obs_filtered_AAxAB %>% filter(total >= site_threshold), 
       aes(fill = gt, y = prop, x = factor(sex, levels = c("Female", "Male")))) +
  geom_bar(position = "fill", stat = "identity", width = 0.7) +
  ylab("Genotype
Proportion") +
  labs(title = "Offspring Genotype, Crossing (AB) Male x (AA) Female
 Filtered sites", 
       fill = "Genotype") +
  theme_classic() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(size = 14),
    axis.title.y = element_text(size = 16),
    strip.text.x = element_text(size = 12),
    panel.spacing = unit(1, "lines"),
    plot.margin = margin(t = 10, r = 10, b = 0, l = 10)
  ) +
  scale_fill_manual(values = c("#ffbf00", "#66b032", "#1982c4"))# +

  #facet_wrap(~family)

Figure 2: RNAseq of males and females

rna_dat <- read_csv("/Users/nuriteliash/Documents/GitHub/Varroa-RNAseq-Zheng-2025/data/rna-seq.csv")

rna_dat_summary <- rna_dat %>%
  group_by(family, sex) %>%
  summarize(
    het_prop = sum(het) / sum(het + homo),  # Proportion of heterozygous
    count = sum(het + homo),                # Total trials
    sem = sqrt((het_prop * (1 - het_prop)) / count),  # Standard error of the mean
    ci_lower = het_prop - qt(0.975, df = count - 1) * sem,  # Lower confidence interval
    ci_upper = het_prop + qt(0.975, df = count - 1) * sem   # Upper confidence interval
  )

# Plot the data
rna_dat_summary %>%
  ggplot(aes(x = sex, y = het_prop, colour = sex, group = family)) +
  geom_line(position = position_dodge(width = 0.5), colour = "black") +  
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper), size = 1, position = position_dodge(width = 0.5)) +
  theme_bw() +
  ylab("Proportion of heterozygous sites") +
  xlab("Sex") +
  theme(legend.position = "bottom") +
  scale_y_continuous(labels = scales::percent_format())  + 
  scale_color_manual(values = c("Male" = "blue", "Female" = "#E31C79")) + # Custom colors
  theme(legend.position = "none")

cat("\n**Figure 2.** RNA-seq analysis of heterozygous site proportions in male and female mites. Each point represents the mean proportion of heterozygous sites per individual mite, with error bars showing 95% confidence intervals. Points are connected by lines to show family relatedness of mother and son. The y-axis shows the proportion of heterozygous sites as a percentage of total sites analyzed.")

## 
## **Figure 2.** RNA-seq analysis of heterozygous site proportions in male and female mites. Each point represents the mean proportion of heterozygous sites per individual mite, with error bars showing 95% confidence intervals. Points are connected by lines to show family relatedness of mother and son. The y-axis shows the proportion of heterozygous sites as a percentage of total sites analyzed.

Do heterozygocity proportion differ significantly between male and female mites?

# Reshape to wide format
df_wide <- rna_dat_summary %>%
  select(family, sex, het_prop) %>%
  pivot_wider(names_from = sex, values_from = het_prop)

# Paired t-test
t.test(df_wide$Female, df_wide$Male, paired = TRUE)

## 
##  Paired t-test
## 
## data:  df_wide$Female and df_wide$Male
## t = 0.45559, df = 4, p-value = 0.6723
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.1116696  0.1555112
## sample estimates:
## mean difference 
##      0.02192078

Main and supplementary figures and analysis for Varroa Destructor Inheritance Study

Nurit Eliash and Sasha Mikheyev

03 August, 2025