Building an integrated genetic linkage map of autotetraploid alfalfa populations using the MAPpoly2
1 Introduction
In this tutorial, we will explore MAPpoly2, an R package under development for constructing genetic maps in autopolyploid species. MAPpoly2 is an extension of MAPpoly, designed to handle a wider range of ploidy levels and to provide a more user-friendly experience with additional graphical functions and potential integration with R Shiny. Our focus will be on constructing integrated maps for multiple full-sib families of alfalfa, an important autotetraploid crop. Let’s begin by installing MAPpoly2 from GitHub.
2 Installing MAPpoly2
MAPpoly2 is currently available in its development version on GitHub.
To install it, we first need the devtools
package:
Windows users should install the latest version of Rtools.
Install MAPpoly2 using the following command:
Then, load MAPpoly2:
3 About the alfalfa populations
We are going to use two datasets distributed along with MAPpoly2:
alfa_f1 This dataset comprises 184 offspring from a cross between two tetraploid alfalfa parents,
I195
andJ432
, which are resistant and susceptible to Aphanomyces euteiches, respectively. The biparental population was presented in Zhao et al., (2023) and genotyped with the alfalfa DArTag panel described in the same publication.alfa_bc This dataset comprises 93 offspring from a cross between two tetraploid alfalfa parents,
I195
andF1.85.209
, the latter being derived from alfa_f1 cross.
(#fig:alfalfa_pub)Alfalfa publication
4 Loading datasets
MAPpoly2 supports CSV file format for input datasets. Here, we will demonstrate how to read and prepare your data in CSV format suitable for MAPpoly2. The first line in the CSV file should contain headers, and subsequent lines will have marker data including dosage, chromosome number, genome position, and allele information.
For example, here’s how your dataset might look:
(#fig:data set_example)Example of CSV data set
To load the dataset, we utilize the read_geno_csv
function. This function requires specifying the ploidy level of the parents, which could be 2, 4, 6, or a combination thereof. If the names of the parents are provided, they will override any parent names present in the corresponding columns of the CSV file.
alfalfa.f1 <- read_geno_csv(file.in = "I195_x_J432.csv",
ploidy.p1 = 4,
ploidy.p2 = 4,
name.p1 = "I195",
name.p2 = "J432")
#> Reading the following data:
#> Ploidy level of I195: 4
#> Ploidy level of J432: 4
#> No. individuals: 184
#> No. markers: 3419
#> No. informative markers: 3419 (100%)
#> ---
#> This dataset contains chromosome information.
#> This dataset contains position information.
#> --> Filtering non-conforming markers.
#> --> Filtering markers with redundant information.
#> ----------------------------------
Let’s take a look at the dataset and its graphical representation:
alfalfa.f1
#> Ploidy level of I195: 4
#> Ploidy level of J432: 4
#> No. individuals: 184
#> No. unique markers: 2795
#> Percentage of redundant: 18.3%
#> Percentage of missing: 6.6%
#> Chromosome info: available
#> Genome position: available
#> Recombination farction: unavailable
plot(alfalfa.f1)
For the purposes of this tutorial, we have introduced the read_geno_csv
function
to provide an understanding of how datasets can be loaded into MAPpoly2. However,
you can use the alfa_f1
dataset that comes bundled with the MAPpoly package.
5 Dataset quality control
Quality control is a fundamental part of genetic mapping, ensuring the reliability and accuracy of the resulting genetic maps. In this section, we focus on two critical aspects of quality control: addressing missing data and identifying and removing non-F1 individuals.
5.1 Addressing Missing Data
Screening out missing data under specified thresholds involves setting thresholds
for the acceptable amount of missing data and then filtering out markers and individuals
that exceed these limits. In MAPpoly2, you can use the filter_data
function to
apply these thresholds. Here is an example of how to screen out missing data based
on predetermined thresholds:
alfalfa.f1
#> Ploidy level of I195: 4
#> Ploidy level of J432: 4
#> No. individuals: 184
#> No. unique markers: 2795
#> Percentage of redundant: 18.3%
#> Percentage of missing: 6.6%
#> Chromosome info: available
#> Genome position: available
#> Recombination farction: unavailable
#> Thresholds
#> missing mrk: 0.2
#> missing ind: 0.1
#> chi-square pval: 1.79e-05
#> non full-sib: -
#> LOD phase: 0
#> LOD rf: 0
#> rf: 0.5
#> prob.lower: 0
#> prob.upper: 1
#> Screened mrk: 2039
#> Screened ind: 165
5.2 Identifying and Removing Non-F1 Individuals
Non-F1 individuals, or those not derived from the specified parent cross, can significantly affect the mapping process. Their presence may introduce erroneous genetic variation that does not represent the true recombination patterns of the population under study. This can lead to incorrect marker ordering, inflated map distances, and even the formation of artificial linkage groups.
(#fig:fint ind2)screening individuals using filter_individuals
alfalfa.f1
#> Ploidy level of I195: 4
#> Ploidy level of J432: 4
#> No. individuals: 184
#> No. unique markers: 2795
#> Percentage of redundant: 18.3%
#> Percentage of missing: 6.6%
#> Chromosome info: available
#> Genome position: available
#> Recombination farction: unavailable
#> Thresholds
#> missing mrk: 0.2
#> missing ind: 0.1
#> chi-square pval: 1.79e-05
#> non full-sib: 0
#> LOD phase:
#> LOD rf:
#> rf:
#> prob.lower:
#> prob.upper:
#> Screened mrk: 2039
#> Screened ind: 165
5.3 Assessing Marker Density
Evaluating marker density in our genetic map helps reveal gaps which could indicate missing data, phasing errors, or natural recombination variations, thereby guiding us in enhancing the map’s accuracy and resolution.
6 Estimating Pairwise Recombination Frequency
Once we have selected our markers, the next step is to compute pairwise recombination
frequencies. For this, we use the pairwise_rf
function in MAPpoly2,
which offers three modes of calculation:
mrk.scope = “all” calculates recombination fractions between all marker pairs across the entire dataset. This method is computationally intensive due to the sheer volume of pairwise comparisons involved.
mrk.scope = “per.chrom” limits the calculation to pairwise comparisons within each chromosome. This approach is significantly less intensive as it reduces the number of comparisons, focusing only on marker pairs on the same chromosome.
mrk.scope = “chrom”: In this mode, you can specify certain chromosomes for which to calculate pairwise recombination frequencies. It requires you to provide the chromosome numbers and is less intensive than the ‘all’ mode but usually more specific than the ‘per.chrom’ mode.
To manage computation efficiently, parallel processing can be utilized by specifying the number of cores your machine will allocate for this task. Ensure your machine has sufficient RAM, and remember to leave one core free for system operations.
t1 <- system.time(alfalfa.f1.all <- pairwise_rf(alfalfa.f1, mrk.scope = "all", ncpus = 8))
plot(alfalfa.f1.all)
t2 <- system.time(alfalfa.f1.chr <- pairwise_rf(alfalfa.f1, mrk.scope = "per.chrom", ncpus = 8))
#> --> Chr_1
#> --> Chr_2
#> --> Chr_3
#> --> Chr_4
#> --> Chr_5
#> --> Chr_6
#> --> Chr_7
#> --> Chr_8
plot(alfalfa.f1.chr)
rbind(t1, t2)
#> user.self sys.self elapsed user.child sys.child
#> t1 389.117 1.047 54.815 0 0
#> t2 50.952 0.252 8.074 0 0
The Figure bellow illustrates the computational time and memory usage associated with calculating pairwise recombination frequencies using two different mrk.scope settings: ‘all’ and ‘per.chrom’. The left graph shows the time in minutes, and the right graph shows memory usage in gigabytes, both plotted against the increasing number of markers.
In the left graph, the ‘all’ setting (red line) shows a steep increase in time as the number of markers increases, highlighting its computational intensity. In contrast, the ‘per.chrom’ setting (cyan line) demonstrates a much more gradual increase, reflecting its lower computational demand.
Similarly, the right graph shows that memory usage for the ‘all’ setting rises sharply with the number of markers, which can be demanding on computational resources. The ‘per.chrom’ setting, however, maintains a relatively flat line, indicating minimal impact on memory usage regardless of marker numbers.
The simulation results showcased in this Figure were obtained using an Intel(R) Xeon(R) Gold 6226R CPU running at 2.90GHz, combined with 376 GB of RAM and 32 processing cores. All 32 cores were utilized for these simulations.
7 Screening based on recombination frequency
The function rf_filter
targets markers that are unlikely to be linked, as well
as those that might falsely appear to be highly linked across the entire genome,
potentially indicating false positives. The filtering is governed by specific
thresholds:
thresh.LOD.ph
: Sets the minimum LOD score for the linkage phase configuration, ensuring that only recombination fractions with LOD scores above this threshold, relative to the second most likely phase configuration, are selected.thresh.LOD.rf
: Defines the LOD score cut-off when compared with unlinked markers, allowing for the differentiation between truly linked markers and those that are likely unlinked due to random chance.thresh.rf
: Represents the recombination fraction threshold, beyond which markers are considered too distantly linked to be included in the analysis.
The function tallies and creates a distribution of values that meet these filtering criteria. It then eliminates the tails of this distribution as specified by the probs
argument, typically removing markers that do not exhibit linkage (lower tail) or exhibit implausible linkage across the genome (upper tail) to the analyzed marker set.
alfalfa.f1.all <- rf_filter(alfalfa.f1.all,
thresh.LOD.ph = 5,
thresh.LOD.rf = 5,
thresh.rf = 0.15,
probs = c(0.025, 0.975))
alfalfa.f1.all
#> Ploidy level of I195: 4
#> Ploidy level of J432: 4
#> No. individuals: 184
#> No. unique markers: 2795
#> Percentage of redundant: 18.3%
#> Percentage of missing: 6.6%
#> Chromosome info: available
#> Genome position: available
#> Recombination farction: available
#> Marker scope: all (Chr_1, Chr_2, Chr_3, Chr_4, Chr_5, Chr_6, Chr_7, Chr_8)
#> Thresholds
#> missing mrk: 0.2
#> missing ind: 0.1
#> chi-square pval: 1.79e-05
#> non full-sib: 0
#> LOD phase: 5
#> LOD rf: 5
#> rf: 0.15
#> prob.lower: 0.025
#> prob.upper: 0.975
#> Screened mrk: 1942
#> Screened ind: 165
plot(alfalfa.f1.all)
8 Grouping
Grouping markers is essential for organizing them into linkage groups, which are clusters of markers that are inherited together due to their proximity on a chromosome. The group function in MAPpoly2 uses two primary sources of information to assemble these groups:
Chromosome Assignment
: When available, chromosome assignment provides a pre-determined categorization of markers based on genomic data.Linkage Information
: In the absence of chromosome assignments, or to supplement them, linkage information is derived from the recombination fractions between markers. The group function applies UPGMA (Unweighted Pair Group Method with Arithmetic Mean) clustering on the matrix of recombination fractions to infer the linkage groups.
The group function is designed to be interactive, allowing users to adjust the number of expected groups and inspect the resultant partitioned dendrogram. If chromosome information is available, the function offers a comparison between the UPGMA-derived groups and the chromosome assignments. Users can inspect the results in a tabular format, which combines the groupings from both sources for examination. In the subsequent make_sequence function
, we will show how these two sources of information—chromosome assignment and linkage information—are integrated to construct linkage groups.
g <- group(x = alfalfa.f1.all, expected.groups = 8, comp.mat = TRUE, inter = FALSE)
g
#> - Number of linkage groups: 8
#> - Markers per linkage groups:
#> group n_mrk
#> ----------------
#> 1 283
#> 2 272
#> 3 225
#> 4 323
#> 5 287
#> 6 232
#> 7 214
#> 8 106
#> ----------------
#>
#> Chr_1 Chr_2 Chr_3 Chr_4 Chr_5 Chr_6 Chr_7 Chr_8
#> ----------------------------------------------------------
#> 1 268 1 3 7 1 1 . 2
#> 2 1 1 3 4 260 2 . 1
#> 3 4 4 1 1 1 . 211 3
#> 4 2 5 3 305 4 1 3 .
#> 5 1 . 1 5 4 2 1 273
#> 6 1 2 219 2 . . 6 2
#> 7 . 208 1 1 2 . 2 .
#> 8 . 1 . 2 1 100 . 2
#> ----------------------------------------------------------
plot(g)
9 Creating an working sequence from a grouped object
The make_sequence
function compiles a working sequence of markers from grouped
objects. This function takes a grouped object as input, along with two lists: lg
and ch
. The lg
list corresponds to linkage groups derived from UPGMA clustering,
and the ch
list corresponds to chromosomes. These lists should be of equal length,
with the lg
list representing the rows and the ch
list representing the columns
as observed in the tabular group output. The intersection of these two lists results
in the assignment of markers to their respective linkage groups and chromosomes.
If no values are provided, the function defaults to using the UPGMA clustering alone to assemble the linkage groups. If only the chromosome list is given, the function relies solely on genomic information for the assembly of linkage groups.
In our analysis, we utilize a combination of both sources of information. By intersecting the UPGMA clustering (rows) with the chromosome assignments (columns), we select the combination with the highest counts. This intersection method ensures that the working sequence leverages the most consistent and prevalent groupings across both linkage and chromosomal information, thereby enhancing the accuracy of the marker sequence.
s <- make_sequence(g,
lg = list(1, 7, 6, 4, 2, 8, 3, 5),
ch = list(1, 2, 3, 4, 5, 6, 7, 8))
print(s, type = "mds")
#> ---------- MDS --------------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------
#> lg1 1 268 N 0/0% . N 0/0% . N 0/0% . N
#> lg2 2 208 N 0/0% . N 0/0% . N 0/0% . N
#> lg3 3 219 N 0/0% . N 0/0% . N 0/0% . N
#> lg4 4 305 N 0/0% . N 0/0% . N 0/0% . N
#> lg5 5 260 N 0/0% . N 0/0% . N 0/0% . N
#> lg6 6 100 N 0/0% . N 0/0% . N 0/0% . N
#> lg7 7 211 N 0/0% . N 0/0% . N 0/0% . N
#> lg8 8 273 N 0/0% . N 0/0% . N 0/0% . N
#> -----------------------------------------------------------------------------------------
print(s, type = "genome")
#> ---------- Genome -----------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------
#> lg1 1 268 N 0/0% . N 0/0% . N 0/0% . N
#> lg2 2 208 N 0/0% . N 0/0% . N 0/0% . N
#> lg3 3 219 N 0/0% . N 0/0% . N 0/0% . N
#> lg4 4 305 N 0/0% . N 0/0% . N 0/0% . N
#> lg5 5 260 N 0/0% . N 0/0% . N 0/0% . N
#> lg6 6 100 N 0/0% . N 0/0% . N 0/0% . N
#> lg7 7 211 N 0/0% . N 0/0% . N 0/0% . N
#> lg8 8 273 N 0/0% . N 0/0% . N 0/0% . N
#> -----------------------------------------------------------------------------------------
The output from the make_sequence
creates an object of class mappoly2.sequence
which is displayed in a tabular form and provides a summary of mapping results for
each linkage group (lg) within a genome.
Here’s a breakdown of the information presented in the output:
Ch
(Chromosome): This indicates the chromosome number or the linkage group identifier.n.mrk
(Number of Markers): The number of markers present within each linkage group.ord
(Order): A ‘Y’ (Yes) or ‘N’ (No) indicating whether the markers’ order within the linkage group is established.phase
(Phase Configuration): : Displays the number of phase configurations determined from the pairwise recombination fractions. The accompanying percentage indicates the proportion of markers with established phases for parent 1 (p1), parent 2 (p2), and both parents (p1p2).
map
(cM) (Map Length in Centimorgans): Indicates the genetic length of the linkage group. The ‘.’ implies that the mapping step has not been performed yet.haplo
(Haplotypes Computed): A ‘Y’ (Yes) or ‘N’ (No) indicating whether haplotype probabilities have been computed for the linkage group.
The initial output from make_seq
sets the foundation for the mapping process.
The values, especially in the ord
, phase
, and haplo
columns, will be updated
as you proceed through the mapping pipeline with subsequent
functions.
10 Ordering
After creating the working sequences, the next step is to determine the order of the markers within each group.
MAPpoly2 offers two approaches for ordering markers:
MDS-based Ordering ("mds")
: This method uses multi-dimensional scaling (MDS) to order markers based on their pairwise recombination fractions. It’s particularly useful when chromosome assignment information is not available, as MDS relies solely on linkage information.Genome-based Ordering ("genome")
: When chromosome assignment data is available, genome-based ordering utilizes this additional layer of genomic information to refine the order of markers.
The order_sequence
function in MAPpoly2 is designed to handle both of these
ordering methods. It takes a sequence object, which contains the groups of markers
and their associated data, and arranges the markers within each group according
to the selected ordering method. In the near future, we plan to introduce a
custom ordering option, which will allow users to specify their own order for
the markers.
From here onwards, let’s apply both methods and contruct the maps using the resulting orders.
s <- order_sequence(s, type = "mds")
#> --> 1
#> --> 2
#> --> 3
#> --> 4
#> --> 5
#> --> 6
#> --> 7
#> --> 8
print(s, type = "mds")
#> ---------- MDS --------------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------
#> lg1 1 268 Y 0/0% . N 0/0% . N 0/0% . N
#> lg2 2 208 Y 0/0% . N 0/0% . N 0/0% . N
#> lg3 3 219 Y 0/0% . N 0/0% . N 0/0% . N
#> lg4 4 305 Y 0/0% . N 0/0% . N 0/0% . N
#> lg5 5 260 Y 0/0% . N 0/0% . N 0/0% . N
#> lg6 6 100 Y 0/0% . N 0/0% . N 0/0% . N
#> lg7 7 211 Y 0/0% . N 0/0% . N 0/0% . N
#> lg8 8 273 Y 0/0% . N 0/0% . N 0/0% . N
#> -----------------------------------------------------------------------------------------
s <- order_sequence(s, type = "genome")
#> --> 1
#> --> 2
#> --> 3
#> --> 4
#> --> 5
#> --> 6
#> --> 7
#> --> 8
print(s, type = "genome")
#> ---------- Genome -----------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------
#> lg1 1 268 Y 0/0% . N 0/0% . N 0/0% . N
#> lg2 2 208 Y 0/0% . N 0/0% . N 0/0% . N
#> lg3 3 219 Y 0/0% . N 0/0% . N 0/0% . N
#> lg4 4 305 Y 0/0% . N 0/0% . N 0/0% . N
#> lg5 5 260 Y 0/0% . N 0/0% . N 0/0% . N
#> lg6 6 100 Y 0/0% . N 0/0% . N 0/0% . N
#> lg7 7 211 Y 0/0% . N 0/0% . N 0/0% . N
#> lg8 8 273 Y 0/0% . N 0/0% . N 0/0% . N
#> -----------------------------------------------------------------------------------------
Now, let us compare the resulting recombination fraction matrices for both methods and
In our analysis, both the MDS-based and Genome-based ordering matrices display monotonically increasing patterns. Such monotonicity suggests that, as we move along the axes of the matrices, the genetic distance between markers increases steadily. The similar patterns observed in both matrices imply that the linkage data align well with the physical locations of the markers on the chromosomes. The scatter plots illustrate the alignment of markers between the MDS-based and Genome-based ordering methods. Ideally, we anticipate a 45-degree slope in these plots, which would signify a perfect concordance between the marker orders derived from both approaches. Any significant deviations from this diagonal line may indicate discrepancies, potentially arising from noise within the linkage data or inaccuracies in chromosomal assignments.
The forthcoming capability to include a custom order will complement these methods by accommodating additional data sources or expert insights into the map construction process. This feature will be particularly valuable in instances where the existing data may be incomplete or when we use a reference genome that does not necessarily align with the parents in the population but maintains a high degree of local synteny.
11 Screening based on recombination frequency per group
Now let’s apply the rf_filter
function once more, but this time, we will consider
only recombination fraction matrices within linkage groups. Usually, this step
can be skipped, but we will include it in the tutorial for the sake of completeness.
#### RF-based filter per groups ####
s <- rf_filter(s, type = "mds", probs = c(0.025, 0.975), diag.markers = 50)
s <- rf_filter(s, type = "genome", probs = c(0.025, 0.975), diag.markers = 50)
mappoly2:::plot_rf_matrix(s, type = "genome", fact = 2)
12 Pairwise rf-based phasing
Phasing markers is among the most complex tasks in mapping within a polyploid context.
The pairwise_phasing
function utilizes pairwise recombination information along
with their associated statistics in a branch-and-bound algorithm. This approach
seeks to identify a set of linkage phase configurations that will subsequently
be evaluated using a multilocus-based likelihood method.
The arguments for the pairwise_phasing function are as follows:
thresh.LOD.ph
This threshold sets the minimum LOD score for the linkage phase configuration when compared to alternative configurations. A phase configuration is considered reliable if its LOD score is at leastthresh.LOD.ph
units higher than any other configuration.thresh.LOD.rf
This argument specifies the LOD score threshold for considering recombination fractions as linked. Recombination fractions are deemed significant if their LOD scores exceed this threshold, suggesting true linkage rather than random association.thresh.rf
This argument sets the recombination fraction threshold. A recombination fraction must be less than or equal tothresh.rf
to be considered for phasing, reflecting the maximum expected value for independent assortment.max.search.expansion.p1
andmax.search.expansion.p2
These arguments define the maximum number of phase configurations to explore for parents 1 and 2. It limits the search space to a manageable number of possibilities, ensuring the algorithm remains computationally feasible.
s <- pairwise_phasing(s, type = "mds",
thresh.LOD.ph = 3,
thresh.LOD.rf = 3,
thresh.rf = 0.5,
max.search.expansion.p1 = 10,
max.search.expansion.p2 = 10)
#> --> 1
#> Phasing parent I195
#> Phasing parent J432
#> 250 phased markers out of 255: (98%)
#> --> 2
#> Phasing parent I195
#> Phasing parent J432
#> 196 phased markers out of 197: (99.5%)
#> --> 3
#> Phasing parent I195
#> Phasing parent J432
#> 197 phased markers out of 207: (95.2%)
#> --> 4
#> Phasing parent I195
#> Phasing parent J432
#> 284 phased markers out of 292: (97.3%)
#> --> 5
#> Phasing parent I195
#> Phasing parent J432
#> 247 phased markers out of 252: (98%)
#> --> 6
#> Phasing parent I195
#> Phasing parent J432
#> 92 phased markers out of 95: (96.8%)
#> --> 7
#> Phasing parent I195
#> Phasing parent J432
#> 197 phased markers out of 199: (99%)
#> --> 8
#> Phasing parent I195
#> Phasing parent J432
#> 256 phased markers out of 259: (98.8%)
print(s, type = "mds")
#> ---------- MDS --------------------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------------
#> lg1 1 255 Y 1/18% . N 1/33.3% . N 1/98% . N
#> lg2 2 197 Y 1/23.9% . N 1/26.9% . N 1/99.5% . N
#> lg3 3 207 Y 1/22.2% . N 1/28% . N 1/95.2% . N
#> lg4 4 292 Y 1/22.3% . N 1/33.9% . N 1/97.3% . N
#> lg5 5 252 Y 1/22.2% . N 1/26.6% . N 1/98% . N
#> lg6 6 95 Y 1/18.9% . N 1/37.9% . N 1/96.8% . N
#> lg7 7 199 Y 1/13.6% . N 1/37.7% . N 1/99% . N
#> lg8 8 259 Y 1/19.7% . N 1/26.6% . N 1/98.8% . N
#> -----------------------------------------------------------------------------------------------
s <- pairwise_phasing(s,
type = "genome",
thresh.LOD.ph = 3,
thresh.LOD.rf = 3,
thresh.rf = 0.5,
max.search.expansion.p1 = 10,
max.search.expansion.p2 = 10)
#> --> 1
#> Phasing parent I195
#> Phasing parent J432
#> 250 phased markers out of 255: (98%)
#> --> 2
#> Phasing parent I195
#> Phasing parent J432
#> 196 phased markers out of 197: (99.5%)
#> --> 3
#> Phasing parent I195
#> Phasing parent J432
#> 197 phased markers out of 207: (95.2%)
#> --> 4
#> Phasing parent I195
#> Phasing parent J432
#> 284 phased markers out of 292: (97.3%)
#> --> 5
#> Phasing parent I195
#> Phasing parent J432
#> 239 phased markers out of 252: (94.8%)
#> --> 6
#> Phasing parent I195
#> Phasing parent J432
#> 92 phased markers out of 95: (96.8%)
#> --> 7
#> Phasing parent I195
#> Phasing parent J432
#> 197 phased markers out of 199: (99%)
#> --> 8
#> Phasing parent I195
#> Phasing parent J432
#> 256 phased markers out of 259: (98.8%)
print(s, type = "genome")
#> ---------- Genome -----------------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------------
#> lg1 1 255 Y 1/18% . N 1/33.3% . N 1/98% . N
#> lg2 2 197 Y 1/23.9% . N 1/26.9% . N 1/99.5% . N
#> lg3 3 207 Y 1/22.2% . N 1/28% . N 1/95.2% . N
#> lg4 4 292 Y 1/22.3% . N 1/33.9% . N 1/97.3% . N
#> lg5 5 252 Y 1/20.6% . N 1/26.6% . N 1/94.8% . N
#> lg6 6 95 Y 1/18.9% . N 1/37.9% . N 1/96.8% . N
#> lg7 7 199 Y 1/13.6% . N 1/37.7% . N 1/99% . N
#> lg8 8 259 Y 1/19.7% . N 1/26.6% . N 1/98.8% . N
#> -----------------------------------------------------------------------------------------------
The pairwise_phasing function in MAPpoly2 has successfully completed the phasing of markers for both the Multi-Dimensional Scaling (MDS) and Genome-based ordering methods. The output for each method updates the sequence table, providing the overview of the phasing results for each linkage group (lg) across different parents and their combined configurations (p1, p2, and p1p2).
Here are some insights from the output:
Consistency Across Methods: Both the MDS-based and Genome-based approaches yield consistent results, as evidenced by the identical numbers and percentages in the
phase
columns for each linkage group.Phasing Efficiency: The high percentages in the
phase
columns, especially for the combined parent configuration (p1p2), indicate a successful and efficient phasing process. The majority of markers have been phased with confidence, laying a strong foundation for the subsequent mapping steps. This fact is mostly due to the high quality of the dataset used in this analysis.Further Actions: Since the mapping (indicated by map (cM)) and haplotype computation are pending (as shown by ‘.’ and ‘N’ respectively), the next steps in the mapping process will involve calculating these elements for each linkage group.
13 Multipoint estimation
Now, let’s reconstruct the genetic map for the linkage phases estimated by pairwise_phasing. In this specific case, we have only one possible linkage phase configuration per linkage group, which means the program does not need to compare multiple linkage phase likelihoods to select the optimal one. However, if multiple potential configurations were present in a different dataset, the program would perform this comparison and retain only the most likely linkage phase.
One might wonder why it’s necessary to re-estimate the map using the multilocus approach. This step is crucial because it utilizes the entire chromosomal structure in the estimation. If there are any misplaced markers or incorrect linkage phase assortments, these will be highlighted in the resulting map as an inflation of the expected map length. Furthermore, the Hidden Markov Model (HMM) approach is essential as it addresses the missing data inherent in polyploid cases, preparing the dataset for subsequent computation of the homolog composition in offspring individuals.
There are two primary strategies for reconstructing the multilocus map:
Separate Parental Maps and Integration: This approach involves constructing individual maps for each parent, merging them, and then adding unmapped markers. The procedure of constructing separate maps for each parent, inspired by the methodology of the
polymapR
package. As an addition, we include multilocus estimations along the way. The strategy begins with the functionmapping
by selecting markers informative to a specific parent using argumentsparent = "p1"
orparent = "p2"
. Technically, this strategy confines the number of hidden states in the Markov model to the number of gametes a single parent can produce (e.g., 2 in diploids, 6 in tetraploids, 20 in hexaploids), as opposed to the combined gametes of both parents (e.g., 4 in diploids, 36 in tetraploids, 400 in hexaploids). Once individual maps are constructed and evaluated using diagnostic tools, they are integrated into a full HMM model using themerge_single_parent_maps
function. To incorporate unmapped markers, posterior information from the pre-computed map is computed withcalc_haplotypes
, and the unmapped markers are inserted using theaugment_phased_map
function. This process avoids the computational intensity of constructing the map for each marker insertion, performing it only once after all markers have been appropriately phased.Joint Parental Approach: This more straightforward method constructs the map using both parents simultaneously. It generally doesn’t require merging maps or adding unmapped markers, though it is possible to do so using the
augment_phased_map
function if needed. For this tutorial, we will employ the first strategy, as it involves more coding steps and thus provides a comprehensive learning experience. However, with the dataset we’re using, one could easily adopt the second strategy by settingparent = "p1p2"
and proceeding directly to the Recomputing Haplotype Probabilities section.
13.1 Mapping Parent 1
s <- mapping(s, type = "mds", parent = "p1", ncpus = 8)
#> Multi-locus map estimation
s <- mapping(s, type = "genome", parent = "p1", ncpus = 8)
#> Multi-locus map estimation
print(s, type = "mds")
#> ---------- MDS --------------------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------------
#> lg1 1 255 Y 1/18% 83 N 1/33.3% . N 1/98% . N
#> lg2 2 197 Y 1/23.9% 118 N 1/26.9% . N 1/99.5% . N
#> lg3 3 207 Y 1/22.2% 103 N 1/28% . N 1/95.2% . N
#> lg4 4 292 Y 1/22.3% 102 N 1/33.9% . N 1/97.3% . N
#> lg5 5 252 Y 1/22.2% 82 N 1/26.6% . N 1/98% . N
#> lg6 6 95 Y 1/18.9% 67 N 1/37.9% . N 1/96.8% . N
#> lg7 7 199 Y 1/13.6% 76 N 1/37.7% . N 1/99% . N
#> lg8 8 259 Y 1/19.7% 101 N 1/26.6% . N 1/98.8% . N
#> -----------------------------------------------------------------------------------------------
print(s, type = "genome")
#> ---------- Genome -----------------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------------
#> lg1 1 255 Y 1/18% 71 N 1/33.3% . N 1/98% . N
#> lg2 2 197 Y 1/23.9% 102 N 1/26.9% . N 1/99.5% . N
#> lg3 3 207 Y 1/22.2% 99 N 1/28% . N 1/95.2% . N
#> lg4 4 292 Y 1/22.3% 85 N 1/33.9% . N 1/97.3% . N
#> lg5 5 252 Y 1/20.6% 118 N 1/26.6% . N 1/94.8% . N
#> lg6 6 95 Y 1/18.9% 70 N 1/37.9% . N 1/96.8% . N
#> lg7 7 199 Y 1/13.6% 78 N 1/37.7% . N 1/99% . N
#> lg8 8 259 Y 1/19.7% 103 N 1/26.6% . N 1/98.8% . N
#> -----------------------------------------------------------------------------------------------
plot_map(s, lg = 1, type = "mds", parent = "p1")
13.2 Mapping Parent 2
s <- mapping(s, type = "mds", parent = "p2", ncpus = 8)
#> Multi-locus map estimation
s <- mapping(s, type = "genome", parent = "p2", ncpus = 8)
#> Multi-locus map estimation
print(s, type = "mds")
#> ---------- MDS --------------------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------------
#> lg1 1 255 Y 1/18% 83 N 1/33.3% 110 N 1/98% . N
#> lg2 2 197 Y 1/23.9% 118 N 1/26.9% 82 N 1/99.5% . N
#> lg3 3 207 Y 1/22.2% 103 N 1/28% 120 N 1/95.2% . N
#> lg4 4 292 Y 1/22.3% 102 N 1/33.9% 220 N 1/97.3% . N
#> lg5 5 252 Y 1/22.2% 82 N 1/26.6% 114 N 1/98% . N
#> lg6 6 95 Y 1/18.9% 67 N 1/37.9% 121 N 1/96.8% . N
#> lg7 7 199 Y 1/13.6% 76 N 1/37.7% 106 N 1/99% . N
#> lg8 8 259 Y 1/19.7% 101 N 1/26.6% 140 N 1/98.8% . N
#> -----------------------------------------------------------------------------------------------
print(s, type = "genome")
#> ---------- Genome -----------------------------------------------------------------------------
#> p1 p2 p1p2
#> -----------------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> -----------------------------------------------------------------------------------------------
#> lg1 1 255 Y 1/18% 71 N 1/33.3% 95 N 1/98% . N
#> lg2 2 197 Y 1/23.9% 102 N 1/26.9% 89 N 1/99.5% . N
#> lg3 3 207 Y 1/22.2% 99 N 1/28% 110 N 1/95.2% . N
#> lg4 4 292 Y 1/22.3% 85 N 1/33.9% 193 N 1/97.3% . N
#> lg5 5 252 Y 1/20.6% 118 N 1/26.6% 97 N 1/94.8% . N
#> lg6 6 95 Y 1/18.9% 70 N 1/37.9% 121 N 1/96.8% . N
#> lg7 7 199 Y 1/13.6% 78 N 1/37.7% 98 N 1/99% . N
#> lg8 8 259 Y 1/19.7% 103 N 1/26.6% 149 N 1/98.8% . N
#> -----------------------------------------------------------------------------------------------
plot_map(s, lg = 1, type = "mds", parent = "p2")
13.3 Merging Maps
s <- merge_single_parent_maps(s, type = "genome", ncpus = 8, error = 0.05)
#> Multi-locus map estimation
s <- merge_single_parent_maps(s, type = "mds", ncpus = 8, error = 0.05)
#> Multi-locus map estimation
plot_map_list(s, type = "mds", parent = "p1p2")
map_summary(s)
#> ---------- MDS --- I195 x J432 -----------------------------------------------------------------------------------
#> LG Chrom Map_length_(cM) Markers/cM Simplex_P1 Simplex_P2 Double-simplex Multiplex Total Max_gap
#> ------------------------------------------------------------------------------------------------------------------
#> lg1 1 60.1 2.18 27 59 . 45 131 4.1
#> lg2 2 62.9 1.59 33 41 . 26 100 4.6
#> lg3 3 67.7 1.536 31 44 . 29 104 6.7
#> lg4 4 66.2 2.477 51 76 . 37 164 4.7
#> lg5 5 55.6 2.212 45 49 . 29 123 4.3
#> lg6 6 74.2 0.728 13 28 . 13 54 15
#> lg7 7 74 1.378 22 57 . 23 102 10
#> lg8 8 71.6 1.676 40 50 . 30 120 11
#> ------------------------------------------------------------------------------------------------------------------
#> Total 532.3 2 262 404 . 232 898 15
#> ------------------------------------------------------------------------------------------------------------------
#>
#> ---------- Genome --- I195 x J432 --------------------------------------------------------------------------------
#> LG Chrom Map_length_(cM) Markers/cM Simplex_P1 Simplex_P2 Double-simplex Multiplex Total Max_gap
#> ------------------------------------------------------------------------------------------------------------------
#> lg1 1 56 2.339 27 59 . 45 131 3.5
#> lg2 2 63.6 1.572 33 41 . 26 100 5.9
#> lg3 3 65.8 1.581 31 44 . 29 104 6.3
#> lg4 4 65.9 2.489 51 76 . 37 164 4
#> lg5 5 54.6 2.179 42 49 . 28 119 3.4
#> lg6 6 68.2 0.792 13 28 . 13 54 13
#> lg7 7 61.7 1.653 22 57 . 23 102 6.5
#> lg8 8 60.1 1.997 40 50 . 30 120 5.9
#> ------------------------------------------------------------------------------------------------------------------
#> Total 495.9 2 259 404 . 231 894 13
#> ------------------------------------------------------------------------------------------------------------------
plot_map(s, lg = 1, type = "mds", parent = "p1p2")
s <- calc_haplotypes(s, type = "mds", ncpus = 8)
s <- calc_haplotypes(s, type = "genome", ncpus = 8)
Merging maps resulted in a genetic map with a total length of 532.3 cM (MDS) and 495.9 cM (Genome) across all linkage groups, with a total of 898 (MDS) and 894 (Genome) markers. The average markers per centimorgan were around 2, and the maximum gap ranged from 15 cM (MDS) to 13 cM (Genome), indicating a well-distributed marker set.
13.4 Augmenting Merged Maps
s <- augment_phased_map(s, type = "mds", ncpus = 8)
#>
#> Reestimating multilocus map ...
#> Multi-locus map estimation
s <- augment_phased_map(s, type = "genome", ncpus = 8)
#>
#> Reestimating multilocus map ...
#> Multi-locus map estimation
plot_map_list(s, type = "mds", parent = "p1p2")
map_summary(s)
#> ---------- MDS --- I195 x J432 -----------------------------------------------------------------------------------
#> LG Chrom Map_length_(cM) Markers/cM Simplex_P1 Simplex_P2 Double-simplex Multiplex Total Max_gap
#> ------------------------------------------------------------------------------------------------------------------
#> lg1 1 65.1 3.656 27 59 54 98 238 3.5
#> lg2 2 68.4 2.822 33 41 51 68 193 5.7
#> lg3 3 72 2.736 31 45 66 55 197 6
#> lg4 4 71.7 4.003 51 76 47 113 287 4.1
#> lg5 5 61.9 3.91 45 49 58 90 242 4
#> lg6 6 77 1.104 13 28 18 26 85 16.3
#> lg7 7 80 2.462 22 57 52 66 197 10.8
#> lg8 8 76.3 3.368 40 50 76 91 257 10.9
#> ------------------------------------------------------------------------------------------------------------------
#> Total 572.4 3 262 405 422 607 1696 16.3
#> ------------------------------------------------------------------------------------------------------------------
#>
#> ---------- Genome --- I195 x J432 --------------------------------------------------------------------------------
#> LG Chrom Map_length_(cM) Markers/cM Simplex_P1 Simplex_P2 Double-simplex Multiplex Total Max_gap
#> ------------------------------------------------------------------------------------------------------------------
#> lg1 1 64.1 3.744 27 59 56 98 240 2.3
#> lg2 2 67.4 2.893 33 41 53 68 195 4.3
#> lg3 3 69.1 2.822 31 45 65 54 195 5.6
#> lg4 4 69.4 4.15 51 76 48 113 288 2.6
#> lg5 5 62.3 3.852 46 49 59 86 240 2.6
#> lg6 6 69.6 1.149 13 28 18 21 80 13.6
#> lg7 7 66.3 2.911 22 57 50 64 193 6
#> lg8 8 63.7 4.003 40 50 75 90 255 3
#> ------------------------------------------------------------------------------------------------------------------
#> Total 531.9 3 263 405 424 594 1686 13.6
#> ------------------------------------------------------------------------------------------------------------------
plot_map(s, lg = 1, type = "mds", parent = "p1p2")
Post augmentation, however, the map length slightly increased to 572.4 cM (MDS) and 531.9 cM (Genome), and the total number of markers significantly rose to 1696 (MDS) and 1686 (Genome). This increase was primarily due to adding double-simplex and multiplex markers, enhancing the map’s resolution and potential accuracy. Despite the increased number of markers, the overall map structure remained consistent, though the maximum gap size marginally increased, notably in the MDS map, where it reached 16.3 cM.
13.5 Comparing MDS and Genome-based orders
Now, let us compare the MDS and Genome-ordered maps generated in our analysis.
For an effective comparison of these orders, it’s crucial to employ a well-defined
objective metric. Here, we’ll focus on two primary aspects: the multilocus likelihood
and the characteristics of the map, specifically its length, resulting recombination
frequency matrices, and the gaps between markers.
To proceed with this comparison, we utilize the function compare_order.
This
function identifies markers common to both maps and recalculates the Hidden Markov Model
(HMM)-based likelihood for each linkage group. By assessing both the (log)likelihood
values and the structural aspects of the maps, we can determine which ordering method
is more consistent and informative for our dataset.
maps.comp <- compare_order(s)
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
#> Conf. 1 :
#> Done with map estimation
maps.comp
#> LG n.mrk map length max_gap ave_dist loglike
#> -------------------------------------------------------------------------------------------
#> mds lg1 235 65.19 3.46 0.28 -7012.14
#> genome lg1 235 64.41 2.33 0.28 -3678.8 *
#> mds lg2 193 67.79 5.68 0.35 -5788.22
#> genome lg2 193 67.83 4.28 0.35 -4547.39 *
#> mds lg3 195 71.83 6 0.37 -5769.63
#> genome lg3 195 69.08 5.55 0.36 -3836.94 *
#> mds lg4 286 71.97 4.06 0.25 -9119.55
#> genome lg4 286 69.83 2.63 0.25 -5591.6 *
#> mds lg5 235 62.01 3.98 0.26 -6322.78
#> genome lg5 235 62.07 2.6 0.27 -4019.52 *
#> mds lg6 80 76.6 16.32 0.97 -3042.48
#> genome lg6 80 69.95 13.64 0.89 -2493.8 *
#> mds lg7 193 78.7 10.82 0.41 -5237.83
#> genome lg7 193 66.21 5.99 0.34 -4448.84 *
#> mds lg8 255 76.58 10.94 0.3 -9588.02
#> genome lg8 255 64.04 2.99 0.25 -7041.11 *
#> -------------------------------------------------------------------------------------------
plot(maps.comp)
After evaluating the metrics mentioned earlier, we conclude that the Genome order
provided superior results for our dataset. However, it’s noteworthy that the MDS
order also presents a strong case. Based on the map’s characteristics, such as
its length, marker distribution, and gaps, the MDS order demonstrates a commendable
performance and could be effectively utilized in subsequent analyses. This
observation highlights the viability of the MDS order in scenarios where genome
order data might not be readily available or when a more linkage-focused approach
is preferred.
14 Computing Haplotype Probabilities
Now, let us compute the haplotype probabilities of the offsprings of the cross
between parents I195
and J432
using function calc_haplotypes
:
s <- calc_haplotypes(s, type = "mds", ncpus = 8)
s <- calc_haplotypes(s, type = "genome", ncpus = 8)
plot_haplotypes(s, lg = 1, ind = "F1.85.31")
The plot depicts the haplotype probabilities for individual F1.85.31 across linkage group 1. The graphical representation shows the probabilities of inheriting specific homologs from each parent, with each horizontal bar corresponding to one of the four homologs (h1 to h4). In the case of parent I195, the probabilities for each of the four homologs are clearly differentiated, with a crossover event indicated around the 30 cM position between homologs h1 and h2. For parent J432, depicted in shades of red, crossover events are suggested around the 45 and 50 cM positions, likely involving homologs h1 and h3, and h2 and h4.
These results are extremely important for analyzing the genetic structure of offspring and identifying areas of significant linkage or recombination. They are fundamental to further analysis such as QTL mapping or marker-assisted selection, as they provide critical insights into the genetic makeup of the population under study
15 Final remarks about F1 map
This concludes the construction of the map construction for the I195
x J432
cross.
Here are the final results:
map_summary(I195xJ432_map, type = "genome")
#> ---------- Genome --- I195 x J432 --------------------------------------------------------------------------------
#> LG Chrom Map_length_(cM) Markers/cM Simplex_P1 Simplex_P2 Double-simplex Multiplex Total Max_gap
#> ------------------------------------------------------------------------------------------------------------------
#> lg1 1 64.1 3.744 27 59 56 98 240 2.3
#> lg2 2 67.4 2.893 33 41 53 68 195 4.3
#> lg3 3 69.1 2.822 31 45 65 54 195 5.6
#> lg4 4 69.4 4.15 51 76 48 113 288 2.6
#> lg5 5 62.3 3.852 46 49 59 86 240 2.6
#> lg6 6 69.6 1.149 13 28 18 21 80 13.6
#> lg7 7 66.3 2.911 22 57 50 64 193 6
#> lg8 8 63.7 4.003 40 50 75 90 255 3
#> ------------------------------------------------------------------------------------------------------------------
#> Total 531.9 3 263 405 424 594 1686 13.6
#> ------------------------------------------------------------------------------------------------------------------
plot_map_list(I195xJ432_map, type = "genome", col = mp_pal(8))
16 Loading BC map
In this section, we will proceed to load a pre-constructed BC map. Although it’s possible to generate the BC map using a process similar to the one we employed for the F1 map, please note that the BC population size is smaller. Therefore, it’s advisable to experiment with various threshold values throughout this tutorial and fine-tune them for different linkage groups as needed to achieve optimal results.
Here is the script to load the BC map:
download.file("https://github.com/mmollina/mappoly2_vignettes/raw/main/I195_x_F1-85-209_map.rda",
destfile = "temp_file.rda")
load("temp_file.rda")
I195xF1_85_209_map <- mapping(I195xF1_85_209_map, type = "genome", error = 0.05, ncpus = 8)
#> Multi-locus map estimation
print(I195xF1_85_209_map, type = "genome")
#> ---------- Genome --------------------------------------------------------------------------------
#> p1 p2 p1p2
#> --------------------------------------------------------------------------------------------------
#> Ch n.mrk ord phase map (cM) haplo phase map (cM) haplo phase map (cM) haplo
#> --------------------------------------------------------------------------------------------------
#> lg1 1 192 Y 1/5.7% . N 1/16.1% . N 1/92.7% 220 Y
#> lg2 2 170 Y 12/4.7% . N 12/12.4% . N 12/78.2% 196 Y
#> lg3 3 160 Y 4/8.1% . N 4/16.2% . N 4/86.2% 165 Y
#> lg4 4 250 Y 12/6% . N 12/15.2% . N 12/91.2% 194 Y
#> lg5 5 224 Y 8/11.2% . N 8/12.5% . N 8/86.6% 175 Y
#> lg6 6 59 Y 8/8.5% . N 8/30.5% . N 8/61% 70 Y
#> lg7 7 207 Y 24/14.5% . N 24/21.7% . N 24/79.7% 190 Y
#> lg8 8 207 Y 48/15% . N 48/11.6% . N 48/81.2% 314 Y
#> --------------------------------------------------------------------------------------------------
map_summary(I195xF1_85_209_map, type = "genome")
#> ---------- Genome --- I195 x F1.85.209 ---------------------------------------------------------------------------
#> LG Chrom Map_length_(cM) Markers/cM Simplex_P1 Simplex_P2 Double-simplex Multiplex Total Max_gap
#> ------------------------------------------------------------------------------------------------------------------
#> lg1 1 219.7 0.824 9 28 51 93 181 33.3
#> lg2 2 195.3 0.671 8 19 47 57 131 33.6
#> lg3 3 164.9 0.837 12 25 59 42 138 14.1
#> lg4 4 193.1 1.201 11 39 88 94 232 10.4
#> lg5 5 175.1 1.102 21 28 81 63 193 38.5
#> lg6 6 70.6 0.467 5 17 6 5 33 30.3
#> lg7 7 189 0.884 14 46 44 63 167 18.6
#> lg8 8 314 0.525 16 23 46 80 165 24.1
#> ------------------------------------------------------------------------------------------------------------------
#> Total 1521.7 1 96 225 422 497 1240 38.5
#> ------------------------------------------------------------------------------------------------------------------
plot_map_list(I195xF1_85_209_map, type = "genome", col = mp_pal(8))
17 Preparing maps to integrate
Now that we have both maps at hand, our next step is to construct the consensus map. To ensure a consistent homolog correspondence for shared parents across populations, we need to prepare the data. This involves assembling a list containing the maps of the different populations. Once that’s done, we can visually inspect the maps together using the plot_multi_map
function, and then employ the prepare_to_integrate
function to assess and establish homolog correspondence. Here’s how we go about it:
Examining the dendrogram, we notice a pronounced collinearity among certain homologs from parent I195
across different populations. For instance, homolog h1
in population 1 (I195 x J432) corresponds to homolog h3
in population 2 (I195 x F1.85.209), while h2
corresponds to h1
, h3
to h2
, and h4
remains consistent across both populations. The Venn diagram corroborates this genetic linkage, showcasing that 929 markers are common between the two populations.
18 Estimating consensus map
The next step is to estimate the consensus map considering all populations. The process of estimating a multi-population genetic map using MAPpoly2 involves several steps:
Marker Assessment: Initially, individual markers are assessed within each population (Pop 1, Pop 2, Pop 3, etc.). This assessment includes determining the marker’s quality, its segregation pattern, and allelic dosages.
Single Map Construction: For each population, a single genetic map is constructed based on the markers. We did that in the previous sections
Markovian Connection: Markers from different populations are connected using the Markovian properties of HMM. This means that the state of a marker (i.e., its genotype on the genetic map) in one population can influence the state of the same marker in another population, accounting for crossover and recombination events.
Posterior Computation for Missing Data: HMMs can handle missing data by computing the posterior probabilities of missing marker states based on the observed data. This step fills in gaps and ensures a more complete and accurate representation of the genetic linkage.
Consensus Map Reconstruction: Finally, the individual maps from each population are integrated to reconstruct the final consensus map. This map combines information from all populations and provides a comprehensive overview of the genetic architecture across the different populations.
The animation might illustrate these steps dynamically, showing how individual markers (represented by colored dots) are first organized within their respective populations, then connected across populations, with the missing data being interpolated, and ultimately combined into a single, coherent genetic map that represents the consensus of all the populations analyzed.
In MAPpoly, the function estimate_consensus_map
streamlines the last three steps of the HMM-based map estimation process. This function incorporates the ability to account for genotype errors, which permits the posterior computations to consider states that would otherwise be deemed impossible. This feature enables the HMM to correct potential genotyping errors by allowing transitions to states that could not be reached without the consideration of such errors.
consensus.map <- estimate_consensus_map(prep.maps, ncpus = 8, err = 0.05)
consensus.map
#> Ploidy of founders: 4 4 4
#> Total No. individuals: 259
#> Total No. markers 1997
#> Haplotype probability computed: No
#>
#> Number of individuals per crosse:
#> J432 F1.85.209
#> ----------------------
#> I195 165 94
#> ----------------------
#>
#> Consensus Map:
#> ---------------
#> LG Map_length_.cM. Markers.cM Total.mrk Max_gap
#> ----------------------------------------------------------
#> lg1 99.86 2.764 276 8.14
#> lg2 96.84 2.396 232 6.11
#> lg3 91.25 2.532 231 4.09
#> lg4 103.34 3.309 342 3.92
#> lg5 93.22 3.122 291 2.77
#> lg6 66.7 1.304 87 6.12
#> lg7 88.94 2.743 244 2.79
#> lg8 99.06 2.968 294 3.64
#> ----------------------------------------------------------
plot(consensus.map)
The consensus map encompasses a total of 259 individuals and 1,997 markers.
This represents a significant increase in marker density compared to the 1,686
markers for the F1 map and the 1,240 markers for the BC map. The map spans genetic
distances ranging from 103.34 cM to 66.7 cM across various linkage groups. This
range can be attributed to the contribution of the more informative cross
(I195 x J432) , which compensates for the less informative population (I195 x F1.85.209),
thereby enhancing the overall robustness and resolution of the consensus map.
This interplay between the populations ensures that the less represented genomic
regions in the BC population are adequately covered, thanks to the richer genetic
information provided by the F1 population.
19 Computing consensus haplotype probability
Just as we did with the F1 map construction, let’s compute the haplotype
probabilities using the consensus map by employing the function
calc_consensus_haplo
:
consensus.map <- calc_consensus_haplo(consensus.map, ncpus = 8)
plot_consensus_haplo(consensus.map, lg = 1, ind = "F1.85.70")
The results are quite similar, but we now have different colors depending on the parent that transmitted the homolog. Currently, this color representation is limited to 10 different founders. After this step, one can save the consensus map to be used in QTL analysis with QTLpoly.