AIM: Estimate varroa recombintaion rate using whole genome pedigree data

The recombination frequency of Varroa destructor, a parasitic mite of honeybees was estimated for male and female adult mites. We used two analysis methods: manual calculation of exact recombination frequency, and computational estimation using a linkage mapping software, Lep-MAP3 (Rastas 2017). For both analyses we used as input a VCF file containing only the ‘Informative sites’. Informative sites are sites that are heterozygotic in the F1 female, and homozygotic for one allele in the F1 male, and his mother (F0 female). Only for these sites we can phase (determine the allele parental origin) the F2 generation genotypes, and follow the inheritance of specific sites through the generations (Fig __).
All biosamples are available in Sequence Read Archive (SRA) under the accession PRJNA794941

load libraries

library("tidyverse")
library("plyr")
library("dplyr")
library("ggplot2")
library("scales")
library("ggpubr")
library("gridExtra")
library("grid")
library("GGally")
library("vcfR") # for extracting genotype data from a vcf file
library("data.table")
library("stringr")
library("janitor")
knitr::opts_chunk$set(echo = TRUE)

Load Variant Call Format (VCF) file.

Extract genotypes for each site and individual. The metadata for all samples can be found in here.

vcf <- read.vcfR("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/data/vcf_filter/Q40BIALLDP16HDP40mis.5Chr7/Q40BIALLDP16HDP40mis.5Chr7.recode.vcf", verbose = FALSE )
vcf
## ***** Object of Class vcfR *****
## 223 samples
## 7 CHROMs
## 35,169 variants
## Object size: 293.5 Mb
## 0 percent missing data
## *****        *****         *****
# extract the genotype for each site in each individual
gt <- extract.gt(vcf, element = "GT") 
gt <- as.data.frame(t(gt)) %>%
  rownames_to_column("ID")
#clean the ID names
gt$ID <- sub("_[^_]+$", "", gt$ID)

We used the VCF file in two methods to estimate varroa mite recombination frequency:
- Lep-MAP3, a linkage mapping software (Rastas 2017). script available in lep-MAP3-varroa.Rmd - Manual estimation of the recombination rate


Here we show how to calculate the recombination rate from a VCF file:

Manual estimation of the recombination rate

See flowchart in Fig 1:

Count the sites with each genotype in each family, per chromosome.
The counts must be per chromosome, as recombination can occur only between sites on the same chromosome.

Pooled data (cross 0/0 x 0/1)

table <- gt %>% 
  t() %>%
  as.data.frame() %>%
  row_to_names(row_number = 1)

# set chromosome variable 
chromosomes = c("NW_019211454.1","NW_019211455.1","NW_019211456.1", "NW_019211457.1","NW_019211458.1","NW_019211459.1","NW_019211460.1")
  
# define a list to put all the data frames in
chr_list <- list()

# make a list with dataframes - each containing 1 chromosome
for (chr in chromosomes) {
  chr_list[[chr]] <- table %>%
  rownames_to_column("site") %>%
  dplyr::filter(stringr::str_detect(site,chr)) 
      }

# set a vector of all 30 families:
family = str_extract(gt$ID, "[^_]+") %>% unique()

# or, include only families with at least one adult F2
#family = grep("grndat|grnson",gt$ID, value=TRUE) %>%
#  str_extract("[^_]+")  %>%
#  unique()

# make a function to apply:
fun <- function(df) {
  df %>%
  dplyr::select(starts_with(fam)) %>%
  dplyr::filter_at(vars(matches("_son")), all_vars(. == "0/0")) %>%
  dplyr::filter_at(vars(matches("_dat")), all_vars(. == "0/1")) %>% 
  dplyr::filter_at(vars(matches("_fnd")), all_vars(. == "0/0")) %>%
  dplyr::select(contains("grn")) %>% 
  tidyr::pivot_longer(everything())  %>% 
  #replace_na(list(value="1/1")) %>%
  dplyr::rename(sample = name, gt = value) %>%
  tidyr::complete(sample, gt, fill = list(obs = 0)) %>%
  dplyr::count(sample, gt, .drop = FALSE) %>%
  dplyr::filter(gt %in% c("0/0", "1/1", "0/1")) %>%
  mutate(n = as.numeric(n)) %>%
  group_by(sample) %>%
  mutate(total = as.numeric(sum(n))) %>%
  dplyr::rename(obs = n) %>%
  mutate(sex = case_when(
    grepl("son", sample) ~ "male",
    grepl("dat", sample) ~ "female"))
    }

# make an empty list
obs <- list() 

# apply the function for each of the chromosome, per family 
for (fam in family) {
 obs[[fam]] <- lapply(chr_list, fun)
}

# bind all families together, to a final data frame containing all observed counts
#observed <- do.call("rbind", obs)

visualize the sites, for one family (240), on the first chromosome (NW_019211454.1):

table %>%
  rownames_to_column("site") %>%
  dplyr::filter(stringr::str_detect(site,"NW_019211454.1")) %>%
    dplyr::select(starts_with(c("site", "110"))) %>%
    dplyr::filter_at(vars(matches("_son")), all_vars(. == "0/0")) %>%
    dplyr::filter_at(vars(matches("_dat")), all_vars(. == "0/1")) %>%  
  dplyr::filter_at(vars(matches("_fnd")), all_vars(. == "0/0"))
##                      site 110_111_dat 110_111c_grnson 110_111a_grn 110_111b_grn
## 1 NW_019211454.1_41033969         0/1             0/0         <NA>          0/0
## 2 NW_019211454.1_69404882         0/1            <NA>          0/1         <NA>
##   110_111e_grn 110_110_fnd 110_111d_grndat 110_110a_son 110_111f_grn
## 1          0/0         0/0             0/0          0/0          0/0
## 2         <NA>         0/0             0/1          0/0         <NA>

Estimate recombintaion frequency

Next, calculate the recombination frequency, that is, the number of sites pairs, that are recombinant among the all possible pairs in each chromosome.
we do that by determining the ‘recombinant’ and ‘parental’ types of sites combinations, for males (parthenogenetically produced) and females (sexually produced) separately.
For the 0/0 x 0/1 cross, in both males and females F2, parental type will have the same genotype in a pair of sites, while a recombinant type will have different genotype in a pair:
F2 females:
- Parental types: 0/1;0/1 and 0/0;0/0
- Recombinant types: 0/1;0/0

F2 males:
- Parental types: 0/1;0/1, 0/0;0/0 and 1/1;1/1
- Recombinant types: 0/1;0/0, 0/1;1/1 and 1/1;0/0 (depends on the type of parthenogenesis)

so all we need to do in order to count recombination events in a chromosome, is to count the pairs of same and different genotypes, and divide by the length of the chromosome in bp.

Calculations:
- Sum of unique pairs = n(n-1)/2 x n = number of genotyped sites (example$total).
- Male recombinant pairs = count of 0/1 sites x count of 0/0+1/1 sites.
- Female recombinant pairs = count of 0/1 sites x count of 0/0 sites.
- Recombination frequency = recombinant pairs / sum of unique pairs.
- Normalized recombination freq = freq/chromosome length (bp)

Pooled data (cross 1/1 x 0/1)

Count the sites with each genotype in each family, per chromosome.
The counts must be per chromosome, as recombination can occur only between sites on the same chromosome.

table <- gt %>% 
  t() %>%
  as.data.frame() %>%
  row_to_names(row_number = 1)

# set chromosome variable 
chromosomes = c("NW_019211454.1","NW_019211455.1","NW_019211456.1", "NW_019211457.1","NW_019211458.1","NW_019211459.1","NW_019211460.1")
  
# define a list to put all the data frames in
chr_list <- list()

# make a list with dataframes - each containing 1 chromosome
for (chr in chromosomes) {
  chr_list[[chr]] <- table %>%
  rownames_to_column("site") %>%
  dplyr::filter(stringr::str_detect(site,chr)) 
      }

# set a vector of all 30 families:
family = str_extract(gt$ID, "[^_]+") %>% unique()

# or, include only families with at least one adult F2
#family = grep("grndat|grnson",gt$ID, value=TRUE) %>%
#  str_extract("[^_]+")  %>%
#  unique()


# make a function to apply:
fun <- function(df) {
  df %>%
  dplyr::select(starts_with(fam)) %>%
  dplyr::filter_at(vars(matches("_son")), all_vars(. == "1/1")) %>%
  dplyr::filter_at(vars(matches("_dat")), all_vars(. == "0/1")) %>% 
  dplyr::filter_at(vars(matches("_fnd")), all_vars(. == "1/1")) %>%
  dplyr::select(contains("grn")) %>% 
  tidyr::pivot_longer(everything())  %>% 
  #replace_na(list(value="1/1")) %>%
  dplyr::rename(sample = name, gt = value) %>%
  tidyr::complete(sample, gt, fill = list(obs = 0)) %>%
  dplyr::count(sample, gt, .drop = FALSE) %>%
  dplyr::filter(gt %in% c("0/0", "1/1", "0/1")) %>%
  mutate(n = as.numeric(n)) %>%
  group_by(sample) %>%
  mutate(total = as.numeric(sum(n))) %>%
  dplyr::rename(obs = n) %>%
  mutate(sex = case_when(
    grepl("son", sample) ~ "male",
    grepl("dat", sample) ~ "female"))
    }

# make an empty list
obs <- list() 

# apply the function for each of the chromosome, per family 
for (fam in family) {
 obs[[fam]] <- lapply(chr_list, fun)
}

# bind all families together, to a final data frame containing all observed counts
#observed <- do.call("rbind", obs)

Estimate recombintaion frequency

Next, calculate the recombination frequency, that is, the number of sites pairs, that are recombinant among the all possible pairs in each chromosome.
we do that by determining the ‘recombinant’ and ‘parental’ types of sites combinations, for males (parthenogenetically produced) and females (sexually produced) separately.
For the 0/0 x 0/1 cross, in both males and females F2, parental type will have the same genotype in a pair of sites, while a recombinant type will have different genotype in a pair:
F2 females:
- Parental types: 0/1;0/1 and 0/0;0/0
- Recombinant types: 0/1;0/0

F2 males:
- Parental types: 0/1;0/1, 0/0;0/0 and 1/1;1/1
- Recombinant types: 0/1;0/0, 0/1;1/1 and 1/1;0/0 (depends on the type of parthenogenesis)

so all we need to do in order to count recombination events in a chromosome, is to count the pairs of same and different genotypes, and divide by the length of the chromosome in bp.

Calculations:
- Sum of unique pairs = n(n-1)/2 x n = number of genotyped sites (example$total).
- Male recombinant pairs = count of 0/1 sites x count of 0/0+1/1 sites.
- Female recombinant pairs = count of 0/1 sites x count of 0/0 sites.
- Recombination frequency = recombinant pairs / sum of unique pairs.
- Normalized recombination freq = freq/chromosome length (bp)