The recombination frequency of Varroa destructor, a
parasitic mite of honeybees was estimated for male and female adult
mites. We used two analysis methods: manual calculation of exact
recombination frequency, and computational estimation using a linkage
mapping software, Lep-MAP3 (Rastas 2017). For both analyses we used as
input a VCF file containing only the ‘Informative sites’. Informative
sites are sites that are heterozygotic in the F1 female, and homozygotic
for one allele in the F1 male, and his mother (F0 female). Only for
these sites we can phase (determine the allele parental origin) the F2
generation genotypes, and follow the inheritance of specific sites
through the generations (Fig __).
All biosamples are available in Sequence Read Archive (SRA) under the
accession PRJNA794941
library("tidyverse")
library("plyr")
library("dplyr")
library("ggplot2")
library("scales")
library("ggpubr")
library("gridExtra")
library("grid")
library("GGally")
library("vcfR") # for extracting genotype data from a vcf file
library("data.table")
library("stringr")
library("janitor")
knitr::opts_chunk$set(echo = TRUE)
Extract genotypes for each site and individual. The metadata for all samples can be found in here.
vcf <- read.vcfR("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/data/vcf_filter/Q40BIALLDP16HDP40mis.5Chr7/Q40BIALLDP16HDP40mis.5Chr7.recode.vcf", verbose = FALSE )
vcf
## ***** Object of Class vcfR *****
## 223 samples
## 7 CHROMs
## 35,169 variants
## Object size: 293.5 Mb
## 0 percent missing data
## ***** ***** *****
# extract the genotype for each site in each individual
gt <- extract.gt(vcf, element = "GT")
gt <- as.data.frame(t(gt)) %>%
rownames_to_column("ID")
#clean the ID names
gt$ID <- sub("_[^_]+$", "", gt$ID)
We used the VCF file in two methods to estimate varroa mite
recombination frequency:
- Lep-MAP3, a linkage mapping software (Rastas 2017).
script available in lep-MAP3-varroa.Rmd
- Manual estimation of the recombination rate
Here we show how to calculate the recombination rate from a VCF file:
See flowchart in Fig 1:
Count the sites with each genotype in each family, per
chromosome.
The counts must be per chromosome, as recombination can occur only
between sites on the same chromosome.
table <- gt %>%
t() %>%
as.data.frame() %>%
row_to_names(row_number = 1)
# set chromosome variable
chromosomes = c("NW_019211454.1","NW_019211455.1","NW_019211456.1", "NW_019211457.1","NW_019211458.1","NW_019211459.1","NW_019211460.1")
# define a list to put all the data frames in
chr_list <- list()
# make a list with dataframes - each containing 1 chromosome
for (chr in chromosomes) {
chr_list[[chr]] <- table %>%
rownames_to_column("site") %>%
dplyr::filter(stringr::str_detect(site,chr))
}
# set a vector of all 30 families:
family = str_extract(gt$ID, "[^_]+") %>% unique()
# or, include only families with at least one adult F2
#family = grep("grndat|grnson",gt$ID, value=TRUE) %>%
# str_extract("[^_]+") %>%
# unique()
# make a function to apply:
fun <- function(df) {
df %>%
dplyr::select(starts_with(fam)) %>%
dplyr::filter_at(vars(matches("_son")), all_vars(. == "0/0")) %>%
dplyr::filter_at(vars(matches("_dat")), all_vars(. == "0/1")) %>%
dplyr::filter_at(vars(matches("_fnd")), all_vars(. == "0/0")) %>%
dplyr::select(contains("grn")) %>%
tidyr::pivot_longer(everything()) %>%
#replace_na(list(value="1/1")) %>%
dplyr::rename(sample = name, gt = value) %>%
tidyr::complete(sample, gt, fill = list(obs = 0)) %>%
dplyr::count(sample, gt, .drop = FALSE) %>%
dplyr::filter(gt %in% c("0/0", "1/1", "0/1")) %>%
mutate(n = as.numeric(n)) %>%
group_by(sample) %>%
mutate(total = as.numeric(sum(n))) %>%
dplyr::rename(obs = n) %>%
mutate(sex = case_when(
grepl("son", sample) ~ "male",
grepl("dat", sample) ~ "female"))
}
# make an empty list
obs <- list()
# apply the function for each of the chromosome, per family
for (fam in family) {
obs[[fam]] <- lapply(chr_list, fun)
}
# bind all families together, to a final data frame containing all observed counts
#observed <- do.call("rbind", obs)
visualize the sites, for one family (240), on the first chromosome (NW_019211454.1):
table %>%
rownames_to_column("site") %>%
dplyr::filter(stringr::str_detect(site,"NW_019211454.1")) %>%
dplyr::select(starts_with(c("site", "110"))) %>%
dplyr::filter_at(vars(matches("_son")), all_vars(. == "0/0")) %>%
dplyr::filter_at(vars(matches("_dat")), all_vars(. == "0/1")) %>%
dplyr::filter_at(vars(matches("_fnd")), all_vars(. == "0/0"))
## site 110_111_dat 110_111c_grnson 110_111a_grn 110_111b_grn
## 1 NW_019211454.1_41033969 0/1 0/0 <NA> 0/0
## 2 NW_019211454.1_69404882 0/1 <NA> 0/1 <NA>
## 110_111e_grn 110_110_fnd 110_111d_grndat 110_110a_son 110_111f_grn
## 1 0/0 0/0 0/0 0/0 0/0
## 2 <NA> 0/0 0/1 0/0 <NA>
Next, calculate the recombination frequency, that is, the number of
sites pairs, that are recombinant among the all possible pairs in each
chromosome.
we do that by determining the ‘recombinant’ and ‘parental’ types of
sites combinations, for males (parthenogenetically produced) and females
(sexually produced) separately.
For the 0/0 x 0/1 cross, in both males and females F2, parental type
will have the same genotype in a pair of sites, while a recombinant type
will have different genotype in a pair:
F2 females:
- Parental types: 0/1;0/1 and 0/0;0/0
- Recombinant types: 0/1;0/0
F2 males:
- Parental types: 0/1;0/1, 0/0;0/0 and 1/1;1/1
- Recombinant types: 0/1;0/0, 0/1;1/1 and 1/1;0/0 (depends on the type
of parthenogenesis)
so all we need to do in order to count recombination events in a chromosome, is to count the pairs of same and different genotypes, and divide by the length of the chromosome in bp.
Calculations:
- Sum of unique pairs = n(n-1)/2 x n = number
of genotyped sites (example$total).
- Male recombinant pairs = count of 0/1 sites x
count of 0/0+1/1 sites.
- Female recombinant pairs = count of 0/1 sites x
count of 0/0 sites.
- Recombination frequency = recombinant pairs / sum
of unique pairs.
- Normalized recombination freq = freq/chromosome
length (bp)
Count the sites with each genotype in each family, per
chromosome.
The counts must be per chromosome, as recombination can occur only
between sites on the same chromosome.
table <- gt %>%
t() %>%
as.data.frame() %>%
row_to_names(row_number = 1)
# set chromosome variable
chromosomes = c("NW_019211454.1","NW_019211455.1","NW_019211456.1", "NW_019211457.1","NW_019211458.1","NW_019211459.1","NW_019211460.1")
# define a list to put all the data frames in
chr_list <- list()
# make a list with dataframes - each containing 1 chromosome
for (chr in chromosomes) {
chr_list[[chr]] <- table %>%
rownames_to_column("site") %>%
dplyr::filter(stringr::str_detect(site,chr))
}
# set a vector of all 30 families:
family = str_extract(gt$ID, "[^_]+") %>% unique()
# or, include only families with at least one adult F2
#family = grep("grndat|grnson",gt$ID, value=TRUE) %>%
# str_extract("[^_]+") %>%
# unique()
# make a function to apply:
fun <- function(df) {
df %>%
dplyr::select(starts_with(fam)) %>%
dplyr::filter_at(vars(matches("_son")), all_vars(. == "1/1")) %>%
dplyr::filter_at(vars(matches("_dat")), all_vars(. == "0/1")) %>%
dplyr::filter_at(vars(matches("_fnd")), all_vars(. == "1/1")) %>%
dplyr::select(contains("grn")) %>%
tidyr::pivot_longer(everything()) %>%
#replace_na(list(value="1/1")) %>%
dplyr::rename(sample = name, gt = value) %>%
tidyr::complete(sample, gt, fill = list(obs = 0)) %>%
dplyr::count(sample, gt, .drop = FALSE) %>%
dplyr::filter(gt %in% c("0/0", "1/1", "0/1")) %>%
mutate(n = as.numeric(n)) %>%
group_by(sample) %>%
mutate(total = as.numeric(sum(n))) %>%
dplyr::rename(obs = n) %>%
mutate(sex = case_when(
grepl("son", sample) ~ "male",
grepl("dat", sample) ~ "female"))
}
# make an empty list
obs <- list()
# apply the function for each of the chromosome, per family
for (fam in family) {
obs[[fam]] <- lapply(chr_list, fun)
}
# bind all families together, to a final data frame containing all observed counts
#observed <- do.call("rbind", obs)
Next, calculate the recombination frequency, that is, the number of
sites pairs, that are recombinant among the all possible pairs in each
chromosome.
we do that by determining the ‘recombinant’ and ‘parental’ types of
sites combinations, for males (parthenogenetically produced) and females
(sexually produced) separately.
For the 0/0 x 0/1 cross, in both males and females F2, parental type
will have the same genotype in a pair of sites, while a recombinant type
will have different genotype in a pair:
F2 females:
- Parental types: 0/1;0/1 and 0/0;0/0
- Recombinant types: 0/1;0/0
F2 males:
- Parental types: 0/1;0/1, 0/0;0/0 and 1/1;1/1
- Recombinant types: 0/1;0/0, 0/1;1/1 and 1/1;0/0 (depends on the type
of parthenogenesis)
so all we need to do in order to count recombination events in a chromosome, is to count the pairs of same and different genotypes, and divide by the length of the chromosome in bp.
Calculations:
- Sum of unique pairs = n(n-1)/2 x n = number
of genotyped sites (example$total).
- Male recombinant pairs = count of 0/1 sites x
count of 0/0+1/1 sites.
- Female recombinant pairs = count of 0/1 sites x
count of 0/0 sites.
- Recombination frequency = recombinant pairs / sum
of unique pairs.
- Normalized recombination freq = freq/chromosome
length (bp)