Building an integrated genetic linkage map of autotetraploid alfalfa populations using the MAPpoly2

1 Introduction

In this tutorial, we will explore MAPpoly2, an R package under development for constructing genetic maps in autopolyploid species. MAPpoly2 is an extension of MAPpoly, designed to handle a wider range of ploidy levels and to provide a more user-friendly experience with additional graphical functions and potential integration with R Shiny. Our focus will be on constructing integrated maps for multiple full-sib families of alfalfa, an important autotetraploid crop. Let’s begin by installing MAPpoly2 from GitHub.

2 Installing MAPpoly2

MAPpoly2 is currently available in its development version on GitHub. To install it, we first need the devtools package:

install.packages("devtools")

Windows users should install the latest version of Rtools.

Install MAPpoly2 using the following command:

devtools::install_github("mmollina/mappoly2", dependencies=TRUE)

Then, load MAPpoly2:

library(mappoly2)

3 About the alfalfa populations

We are going to use two datasets distributed along with MAPpoly2:

  • alfa_f1 This dataset comprises 184 offspring from a cross between two tetraploid alfalfa parents, I195 and J432, which are resistant and susceptible to Aphanomyces euteiches, respectively. The biparental population was presented in Zhao et al., (2023) and genotyped with the alfalfa DArTag panel described in the same publication.

  • alfa_bc This dataset comprises 93 offspring from a cross between two tetraploid alfalfa parents, I195 and F1.85.209, the latter being derived from alfa_f1 cross.

Alfalfa publication

(#fig:alfalfa_pub)Alfalfa publication

4 Loading datasets

MAPpoly2 supports CSV file format for input datasets. Here, we will demonstrate how to read and prepare your data in CSV format suitable for MAPpoly2. The first line in the CSV file should contain headers, and subsequent lines will have marker data including dosage, chromosome number, genome position, and allele information.

For example, here’s how your dataset might look:

Example of CSV data set

(#fig:data set_example)Example of CSV data set

To load the dataset, we utilize the read_geno_csv function. This function requires specifying the ploidy level of the parents, which could be 2, 4, 6, or a combination thereof. If the names of the parents are provided, they will override any parent names present in the corresponding columns of the CSV file.

alfalfa.f1 <- read_geno_csv(file.in = "I195_x_J432.csv",
                            ploidy.p1 = 4,
                            ploidy.p2 = 4,
                            name.p1 = "I195",
                            name.p2 = "J432")
#> Reading the following data:
#>      Ploidy level of I195:     4
#>      Ploidy level of J432:     4
#>      No. individuals:          184
#>      No. markers:              3419
#>      No. informative markers:  3419 (100%)
#>      ---
#>      This dataset contains chromosome information.
#>      This dataset contains position information.
#>       -->  Filtering non-conforming markers.
#>       -->  Filtering markers with redundant information.
#>      ----------------------------------

Let’s take a look at the dataset and its graphical representation:

alfalfa.f1
#>      Ploidy level of I195:     4
#>      Ploidy level of J432:     4
#>      No. individuals:          184
#>      No. unique markers:       2795
#>      Percentage of redundant:  18.3%
#>      Percentage of missing:    6.6%
#>      Chromosome info:          available
#>      Genome position:          available
#>      Recombination farction:   unavailable
plot(alfalfa.f1)

For the purposes of this tutorial, we have introduced the read_geno_csv function to provide an understanding of how datasets can be loaded into MAPpoly2. However, you can use the alfa_f1 dataset that comes bundled with the MAPpoly package.

alfalfa.f1 <- alfa_f1

5 Dataset quality control

Quality control is a fundamental part of genetic mapping, ensuring the reliability and accuracy of the resulting genetic maps. In this section, we focus on two critical aspects of quality control: addressing missing data and identifying and removing non-F1 individuals.

5.1 Addressing Missing Data

Screening out missing data under specified thresholds involves setting thresholds for the acceptable amount of missing data and then filtering out markers and individuals that exceed these limits. In MAPpoly2, you can use the filter_data function to apply these thresholds. Here is an example of how to screen out missing data based on predetermined thresholds:

alfalfa.f1 <- filter_data(alfalfa.f1, mrk.thresh = 0.2, ind.thresh = 0.1)

plot(alfalfa.f1)

alfalfa.f1
#>      Ploidy level of I195:     4
#>      Ploidy level of J432:     4
#>      No. individuals:          184
#>      No. unique markers:       2795
#>      Percentage of redundant:  18.3%
#>      Percentage of missing:    6.6%
#>      Chromosome info:          available
#>      Genome position:          available
#>      Recombination farction:   unavailable
#>     Thresholds           
#>         missing mrk:       0.2
#>         missing ind:       0.1
#>         chi-square pval:   1.79e-05
#>         non full-sib:      -
#>         LOD phase:         0
#>         LOD rf:            0
#>         rf:                0.5
#>         prob.lower:        0
#>         prob.upper:        1
#>      Screened mrk:         2039
#>      Screened ind:         165

5.2 Identifying and Removing Non-F1 Individuals

Non-F1 individuals, or those not derived from the specified parent cross, can significantly affect the mapping process. Their presence may introduce erroneous genetic variation that does not represent the true recombination patterns of the population under study. This can lead to incorrect marker ordering, inflated map distances, and even the formation of artificial linkage groups.

alfalfa.f1 <- filter_individuals(alfalfa.f1)
screening individuals using filter_individuals

(#fig:fint ind2)screening individuals using filter_individuals

alfalfa.f1
#>      Ploidy level of I195:     4
#>      Ploidy level of J432:     4
#>      No. individuals:          184
#>      No. unique markers:       2795
#>      Percentage of redundant:  18.3%
#>      Percentage of missing:    6.6%
#>      Chromosome info:          available
#>      Genome position:          available
#>      Recombination farction:   unavailable
#>     Thresholds           
#>         missing mrk:       0.2
#>         missing ind:       0.1
#>         chi-square pval:   1.79e-05
#>         non full-sib:      0
#>         LOD phase:        
#>         LOD rf:           
#>         rf:               
#>         prob.lower:       
#>         prob.upper:       
#>      Screened mrk:         2039
#>      Screened ind:         165

5.3 Assessing Marker Density

Evaluating marker density in our genetic map helps reveal gaps which could indicate missing data, phasing errors, or natural recombination variations, thereby guiding us in enhancing the map’s accuracy and resolution.

plot(alfalfa.f1, type = "density")
#>  Marker density plotting.

6 Estimating Pairwise Recombination Frequency

Once we have selected our markers, the next step is to compute pairwise recombination frequencies. For this, we use the pairwise_rf function in MAPpoly2, which offers three modes of calculation:

  • mrk.scope = “all” calculates recombination fractions between all marker pairs across the entire dataset. This method is computationally intensive due to the sheer volume of pairwise comparisons involved.

  • mrk.scope = “per.chrom” limits the calculation to pairwise comparisons within each chromosome. This approach is significantly less intensive as it reduces the number of comparisons, focusing only on marker pairs on the same chromosome.

  • mrk.scope = “chrom”: In this mode, you can specify certain chromosomes for which to calculate pairwise recombination frequencies. It requires you to provide the chromosome numbers and is less intensive than the ‘all’ mode but usually more specific than the ‘per.chrom’ mode.

To manage computation efficiently, parallel processing can be utilized by specifying the number of cores your machine will allocate for this task. Ensure your machine has sufficient RAM, and remember to leave one core free for system operations.

t1 <- system.time(alfalfa.f1.all <- pairwise_rf(alfalfa.f1, mrk.scope = "all", ncpus = 8))
plot(alfalfa.f1.all)

t2 <- system.time(alfalfa.f1.chr <- pairwise_rf(alfalfa.f1, mrk.scope = "per.chrom", ncpus = 8))
#>   --> Chr_1
#>   --> Chr_2
#>   --> Chr_3
#>   --> Chr_4
#>   --> Chr_5
#>   --> Chr_6
#>   --> Chr_7
#>   --> Chr_8
plot(alfalfa.f1.chr)

rbind(t1, t2)
#>    user.self sys.self elapsed user.child sys.child
#> t1   389.117    1.047  54.815          0         0
#> t2    50.952    0.252   8.074          0         0

The Figure bellow illustrates the computational time and memory usage associated with calculating pairwise recombination frequencies using two different mrk.scope settings: ‘all’ and ‘per.chrom’. The left graph shows the time in minutes, and the right graph shows memory usage in gigabytes, both plotted against the increasing number of markers.

In the left graph, the ‘all’ setting (red line) shows a steep increase in time as the number of markers increases, highlighting its computational intensity. In contrast, the ‘per.chrom’ setting (cyan line) demonstrates a much more gradual increase, reflecting its lower computational demand.

Similarly, the right graph shows that memory usage for the ‘all’ setting rises sharply with the number of markers, which can be demanding on computational resources. The ‘per.chrom’ setting, however, maintains a relatively flat line, indicating minimal impact on memory usage regardless of marker numbers.

The simulation results showcased in this Figure were obtained using an Intel(R) Xeon(R) Gold 6226R CPU running at 2.90GHz, combined with 376 GB of RAM and 32 processing cores. All 32 cores were utilized for these simulations.

7 Screening based on recombination frequency

The function rf_filter targets markers that are unlikely to be linked, as well as those that might falsely appear to be highly linked across the entire genome, potentially indicating false positives. The filtering is governed by specific thresholds:

  • thresh.LOD.ph: Sets the minimum LOD score for the linkage phase configuration, ensuring that only recombination fractions with LOD scores above this threshold, relative to the second most likely phase configuration, are selected.
  • thresh.LOD.rf: Defines the LOD score cut-off when compared with unlinked markers, allowing for the differentiation between truly linked markers and those that are likely unlinked due to random chance.
  • thresh.rf: Represents the recombination fraction threshold, beyond which markers are considered too distantly linked to be included in the analysis.

The function tallies and creates a distribution of values that meet these filtering criteria. It then eliminates the tails of this distribution as specified by the probs argument, typically removing markers that do not exhibit linkage (lower tail) or exhibit implausible linkage across the genome (upper tail) to the analyzed marker set.

alfalfa.f1.all <- rf_filter(alfalfa.f1.all, 
                            thresh.LOD.ph = 5, 
                            thresh.LOD.rf = 5, 
                            thresh.rf = 0.15, 
                            probs = c(0.025, 0.975))

alfalfa.f1.all
#>      Ploidy level of I195:     4
#>      Ploidy level of J432:     4
#>      No. individuals:          184
#>      No. unique markers:       2795
#>      Percentage of redundant:  18.3%
#>      Percentage of missing:    6.6%
#>      Chromosome info:          available
#>      Genome position:          available
#>      Recombination farction:   available
#>      Marker scope:             all (Chr_1, Chr_2, Chr_3, Chr_4, Chr_5, Chr_6, Chr_7, Chr_8)
#>     Thresholds           
#>         missing mrk:       0.2
#>         missing ind:       0.1
#>         chi-square pval:   1.79e-05
#>         non full-sib:      0
#>         LOD phase:         5
#>         LOD rf:            5
#>         rf:                0.15
#>         prob.lower:        0.025
#>         prob.upper:        0.975
#>      Screened mrk:         1942
#>      Screened ind:         165
plot(alfalfa.f1.all)

8 Grouping

Grouping markers is essential for organizing them into linkage groups, which are clusters of markers that are inherited together due to their proximity on a chromosome. The group function in MAPpoly2 uses two primary sources of information to assemble these groups:

  • Chromosome Assignment: When available, chromosome assignment provides a pre-determined categorization of markers based on genomic data.

  • Linkage Information: In the absence of chromosome assignments, or to supplement them, linkage information is derived from the recombination fractions between markers. The group function applies UPGMA (Unweighted Pair Group Method with Arithmetic Mean) clustering on the matrix of recombination fractions to infer the linkage groups.

The group function is designed to be interactive, allowing users to adjust the number of expected groups and inspect the resultant partitioned dendrogram. If chromosome information is available, the function offers a comparison between the UPGMA-derived groups and the chromosome assignments. Users can inspect the results in a tabular format, which combines the groupings from both sources for examination. In the subsequent make_sequence function, we will show how these two sources of information—chromosome assignment and linkage information—are integrated to construct linkage groups.

g <- group(x = alfalfa.f1.all, expected.groups = 8, comp.mat = TRUE, inter = FALSE)
g
#>        - Number of linkage groups:   8 
#>        - Markers per linkage groups: 
#>            group  n_mrk  
#>         ---------------- 
#>            1      283    
#>            2      272    
#>            3      225    
#>            4      323    
#>            5      287    
#>            6      232    
#>            7      214    
#>            8      106    
#>         ---------------- 
#> 
#>            Chr_1  Chr_2  Chr_3  Chr_4  Chr_5  Chr_6  Chr_7  Chr_8  
#>         ---------------------------------------------------------- 
#>         1  268    1      3      7      1      1      .      2      
#>         2  1      1      3      4      260    2      .      1      
#>         3  4      4      1      1      1      .      211    3      
#>         4  2      5      3      305    4      1      3      .      
#>         5  1      .      1      5      4      2      1      273    
#>         6  1      2      219    2      .      .      6      2      
#>         7  .      208    1      1      2      .      2      .      
#>         8  .      1      .      2      1      100    .      2      
#>         ----------------------------------------------------------
plot(g)

9 Creating an working sequence from a grouped object

The make_sequence function compiles a working sequence of markers from grouped objects. This function takes a grouped object as input, along with two lists: lg and ch. The lg list corresponds to linkage groups derived from UPGMA clustering, and the ch list corresponds to chromosomes. These lists should be of equal length, with the lg list representing the rows and the ch list representing the columns as observed in the tabular group output. The intersection of these two lists results in the assignment of markers to their respective linkage groups and chromosomes.

If no values are provided, the function defaults to using the UPGMA clustering alone to assemble the linkage groups. If only the chromosome list is given, the function relies solely on genomic information for the assembly of linkage groups.

In our analysis, we utilize a combination of both sources of information. By intersecting the UPGMA clustering (rows) with the chromosome assignments (columns), we select the combination with the highest counts. This intersection method ensures that the working sequence leverages the most consistent and prevalent groupings across both linkage and chromosomal information, thereby enhancing the accuracy of the marker sequence.

s <- make_sequence(g, 
                   lg = list(1, 7, 6, 4, 2, 8, 3, 5),
                   ch = list(1, 2, 3, 4, 5, 6, 7, 8))
print(s, type = "mds")
#> ---------- MDS -------------------------------------------------------------------------- 
#>                      p1                     p2                     p1p2                   
#> ----------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase  map (cM) haplo  phase  map (cM) haplo  phase  map (cM) haplo  
#> ----------------------------------------------------------------------------------------- 
#> lg1  1   268    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg2  2   208    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg3  3   219    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg4  4   305    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg5  5   260    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg6  6   100    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg7  7   211    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg8  8   273    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> -----------------------------------------------------------------------------------------
print(s, type = "genome")
#> ---------- Genome ----------------------------------------------------------------------- 
#>                      p1                     p2                     p1p2                   
#> ----------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase  map (cM) haplo  phase  map (cM) haplo  phase  map (cM) haplo  
#> ----------------------------------------------------------------------------------------- 
#> lg1  1   268    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg2  2   208    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg3  3   219    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg4  4   305    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg5  5   260    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg6  6   100    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg7  7   211    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg8  8   273    N    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> -----------------------------------------------------------------------------------------

The output from the make_sequence creates an object of class mappoly2.sequence which is displayed in a tabular form and provides a summary of mapping results for each linkage group (lg) within a genome.

Here’s a breakdown of the information presented in the output:

  • Ch (Chromosome): This indicates the chromosome number or the linkage group identifier.
  • n.mrk (Number of Markers): The number of markers present within each linkage group.
  • ord (Order): A ‘Y’ (Yes) or ‘N’ (No) indicating whether the markers’ order within the linkage group is established.
  • phase (Phase Configuration): : Displays the number of phase configurations determined from the pairwise recombination fractions. The accompanying percentage indicates the proportion of markers with established phases for parent 1 (p1), parent 2 (p2), and both parents (p1p2).
  • map (cM) (Map Length in Centimorgans): Indicates the genetic length of the linkage group. The ‘.’ implies that the mapping step has not been performed yet.
  • haplo (Haplotypes Computed): A ‘Y’ (Yes) or ‘N’ (No) indicating whether haplotype probabilities have been computed for the linkage group.

The initial output from make_seq sets the foundation for the mapping process. The values, especially in the ord, phase, and haplo columns, will be updated as you proceed through the mapping pipeline with subsequent functions.

10 Ordering

After creating the working sequences, the next step is to determine the order of the markers within each group.

MAPpoly2 offers two approaches for ordering markers:

  • MDS-based Ordering ("mds"): This method uses multi-dimensional scaling (MDS) to order markers based on their pairwise recombination fractions. It’s particularly useful when chromosome assignment information is not available, as MDS relies solely on linkage information.

  • Genome-based Ordering ("genome"): When chromosome assignment data is available, genome-based ordering utilizes this additional layer of genomic information to refine the order of markers.

The order_sequence function in MAPpoly2 is designed to handle both of these ordering methods. It takes a sequence object, which contains the groups of markers and their associated data, and arranges the markers within each group according to the selected ordering method. In the near future, we plan to introduce a custom ordering option, which will allow users to specify their own order for the markers.

From here onwards, let’s apply both methods and contruct the maps using the resulting orders.

s <- order_sequence(s, type = "mds")
#>   --> 1
#>   --> 2
#>   --> 3
#>   --> 4
#>   --> 5
#>   --> 6
#>   --> 7
#>   --> 8
print(s, type = "mds")
#> ---------- MDS -------------------------------------------------------------------------- 
#>                      p1                     p2                     p1p2                   
#> ----------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase  map (cM) haplo  phase  map (cM) haplo  phase  map (cM) haplo  
#> ----------------------------------------------------------------------------------------- 
#> lg1  1   268    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg2  2   208    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg3  3   219    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg4  4   305    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg5  5   260    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg6  6   100    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg7  7   211    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg8  8   273    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> -----------------------------------------------------------------------------------------
s <- order_sequence(s, type = "genome")
#>   --> 1
#>   --> 2
#>   --> 3
#>   --> 4
#>   --> 5
#>   --> 6
#>   --> 7
#>   --> 8
print(s, type = "genome")
#> ---------- Genome ----------------------------------------------------------------------- 
#>                      p1                     p2                     p1p2                   
#> ----------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase  map (cM) haplo  phase  map (cM) haplo  phase  map (cM) haplo  
#> ----------------------------------------------------------------------------------------- 
#> lg1  1   268    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg2  2   208    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg3  3   219    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg4  4   305    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg5  5   260    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg6  6   100    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg7  7   211    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> lg8  8   273    Y    0/0%   .        N      0/0%   .        N      0/0%   .        N      
#> -----------------------------------------------------------------------------------------

Now, let us compare the resulting recombination fraction matrices for both methods and

plot_rf_matrix(s, type = "mds", fact = 2)

plot_rf_matrix(s, type = "genome", fact = 2)

plot_mds_vs_genome(s)

In our analysis, both the MDS-based and Genome-based ordering matrices display monotonically increasing patterns. Such monotonicity suggests that, as we move along the axes of the matrices, the genetic distance between markers increases steadily. The similar patterns observed in both matrices imply that the linkage data align well with the physical locations of the markers on the chromosomes. The scatter plots illustrate the alignment of markers between the MDS-based and Genome-based ordering methods. Ideally, we anticipate a 45-degree slope in these plots, which would signify a perfect concordance between the marker orders derived from both approaches. Any significant deviations from this diagonal line may indicate discrepancies, potentially arising from noise within the linkage data or inaccuracies in chromosomal assignments.

The forthcoming capability to include a custom order will complement these methods by accommodating additional data sources or expert insights into the map construction process. This feature will be particularly valuable in instances where the existing data may be incomplete or when we use a reference genome that does not necessarily align with the parents in the population but maintains a high degree of local synteny.

11 Screening based on recombination frequency per group

Now let’s apply the rf_filter function once more, but this time, we will consider only recombination fraction matrices within linkage groups. Usually, this step can be skipped, but we will include it in the tutorial for the sake of completeness.

#### RF-based filter per groups ####
s <- rf_filter(s, type = "mds", probs = c(0.025, 0.975), diag.markers = 50)
s <- rf_filter(s, type = "genome", probs = c(0.025, 0.975), diag.markers = 50)
mappoly2:::plot_rf_matrix(s, type = "genome", fact = 2)

12 Pairwise rf-based phasing

Phasing markers is among the most complex tasks in mapping within a polyploid context. The pairwise_phasing function utilizes pairwise recombination information along with their associated statistics in a branch-and-bound algorithm. This approach seeks to identify a set of linkage phase configurations that will subsequently be evaluated using a multilocus-based likelihood method.

The arguments for the pairwise_phasing function are as follows:

  • thresh.LOD.ph This threshold sets the minimum LOD score for the linkage phase configuration when compared to alternative configurations. A phase configuration is considered reliable if its LOD score is at least thresh.LOD.ph units higher than any other configuration.

  • thresh.LOD.rf This argument specifies the LOD score threshold for considering recombination fractions as linked. Recombination fractions are deemed significant if their LOD scores exceed this threshold, suggesting true linkage rather than random association.

  • thresh.rf This argument sets the recombination fraction threshold. A recombination fraction must be less than or equal to thresh.rf to be considered for phasing, reflecting the maximum expected value for independent assortment.

  • max.search.expansion.p1 and max.search.expansion.p2 These arguments define the maximum number of phase configurations to explore for parents 1 and 2. It limits the search space to a manageable number of possibilities, ensuring the algorithm remains computationally feasible.

s <- pairwise_phasing(s, type = "mds",
                      thresh.LOD.ph = 3, 
                      thresh.LOD.rf = 3, 
                      thresh.rf = 0.5, 
                      max.search.expansion.p1 = 10, 
                      max.search.expansion.p2 = 10)
#>   --> 1 
#> Phasing parent I195 
#> Phasing parent J432 
#> 250 phased markers out of 255: (98%)
#>   --> 2 
#> Phasing parent I195 
#> Phasing parent J432 
#> 196 phased markers out of 197: (99.5%)
#>   --> 3 
#> Phasing parent I195 
#> Phasing parent J432 
#> 197 phased markers out of 207: (95.2%)
#>   --> 4 
#> Phasing parent I195 
#> Phasing parent J432 
#> 284 phased markers out of 292: (97.3%)
#>   --> 5 
#> Phasing parent I195 
#> Phasing parent J432 
#> 247 phased markers out of 252: (98%)
#>   --> 6 
#> Phasing parent I195 
#> Phasing parent J432 
#> 92 phased markers out of 95: (96.8%)
#>   --> 7 
#> Phasing parent I195 
#> Phasing parent J432 
#> 197 phased markers out of 199: (99%)
#>   --> 8 
#> Phasing parent I195 
#> Phasing parent J432 
#> 256 phased markers out of 259: (98.8%)
print(s, type = "mds")
#> ---------- MDS -------------------------------------------------------------------------------- 
#>                      p1                       p2                       p1p2                     
#> ----------------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase    map (cM) haplo  phase    map (cM) haplo  phase    map (cM) haplo  
#> ----------------------------------------------------------------------------------------------- 
#> lg1  1   255    Y    1/18%    .        N      1/33.3%  .        N      1/98%    .        N      
#> lg2  2   197    Y    1/23.9%  .        N      1/26.9%  .        N      1/99.5%  .        N      
#> lg3  3   207    Y    1/22.2%  .        N      1/28%    .        N      1/95.2%  .        N      
#> lg4  4   292    Y    1/22.3%  .        N      1/33.9%  .        N      1/97.3%  .        N      
#> lg5  5   252    Y    1/22.2%  .        N      1/26.6%  .        N      1/98%    .        N      
#> lg6  6   95     Y    1/18.9%  .        N      1/37.9%  .        N      1/96.8%  .        N      
#> lg7  7   199    Y    1/13.6%  .        N      1/37.7%  .        N      1/99%    .        N      
#> lg8  8   259    Y    1/19.7%  .        N      1/26.6%  .        N      1/98.8%  .        N      
#> -----------------------------------------------------------------------------------------------
s <- pairwise_phasing(s, 
                      type = "genome",
                      thresh.LOD.ph = 3, 
                      thresh.LOD.rf = 3, 
                      thresh.rf = 0.5, 
                      max.search.expansion.p1 = 10, 
                      max.search.expansion.p2 = 10)
#>   --> 1 
#> Phasing parent I195 
#> Phasing parent J432 
#> 250 phased markers out of 255: (98%)
#>   --> 2 
#> Phasing parent I195 
#> Phasing parent J432 
#> 196 phased markers out of 197: (99.5%)
#>   --> 3 
#> Phasing parent I195 
#> Phasing parent J432 
#> 197 phased markers out of 207: (95.2%)
#>   --> 4 
#> Phasing parent I195 
#> Phasing parent J432 
#> 284 phased markers out of 292: (97.3%)
#>   --> 5 
#> Phasing parent I195 
#> Phasing parent J432 
#> 239 phased markers out of 252: (94.8%)
#>   --> 6 
#> Phasing parent I195 
#> Phasing parent J432 
#> 92 phased markers out of 95: (96.8%)
#>   --> 7 
#> Phasing parent I195 
#> Phasing parent J432 
#> 197 phased markers out of 199: (99%)
#>   --> 8 
#> Phasing parent I195 
#> Phasing parent J432 
#> 256 phased markers out of 259: (98.8%)
print(s, type = "genome")
#> ---------- Genome ----------------------------------------------------------------------------- 
#>                      p1                       p2                       p1p2                     
#> ----------------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase    map (cM) haplo  phase    map (cM) haplo  phase    map (cM) haplo  
#> ----------------------------------------------------------------------------------------------- 
#> lg1  1   255    Y    1/18%    .        N      1/33.3%  .        N      1/98%    .        N      
#> lg2  2   197    Y    1/23.9%  .        N      1/26.9%  .        N      1/99.5%  .        N      
#> lg3  3   207    Y    1/22.2%  .        N      1/28%    .        N      1/95.2%  .        N      
#> lg4  4   292    Y    1/22.3%  .        N      1/33.9%  .        N      1/97.3%  .        N      
#> lg5  5   252    Y    1/20.6%  .        N      1/26.6%  .        N      1/94.8%  .        N      
#> lg6  6   95     Y    1/18.9%  .        N      1/37.9%  .        N      1/96.8%  .        N      
#> lg7  7   199    Y    1/13.6%  .        N      1/37.7%  .        N      1/99%    .        N      
#> lg8  8   259    Y    1/19.7%  .        N      1/26.6%  .        N      1/98.8%  .        N      
#> -----------------------------------------------------------------------------------------------

The pairwise_phasing function in MAPpoly2 has successfully completed the phasing of markers for both the Multi-Dimensional Scaling (MDS) and Genome-based ordering methods. The output for each method updates the sequence table, providing the overview of the phasing results for each linkage group (lg) across different parents and their combined configurations (p1, p2, and p1p2).

Here are some insights from the output:

  • Consistency Across Methods: Both the MDS-based and Genome-based approaches yield consistent results, as evidenced by the identical numbers and percentages in the phase columns for each linkage group.

  • Phasing Efficiency: The high percentages in the phase columns, especially for the combined parent configuration (p1p2), indicate a successful and efficient phasing process. The majority of markers have been phased with confidence, laying a strong foundation for the subsequent mapping steps. This fact is mostly due to the high quality of the dataset used in this analysis.

  • Further Actions: Since the mapping (indicated by map (cM)) and haplotype computation are pending (as shown by ‘.’ and ‘N’ respectively), the next steps in the mapping process will involve calculating these elements for each linkage group.

13 Multipoint estimation

Now, let’s reconstruct the genetic map for the linkage phases estimated by pairwise_phasing. In this specific case, we have only one possible linkage phase configuration per linkage group, which means the program does not need to compare multiple linkage phase likelihoods to select the optimal one. However, if multiple potential configurations were present in a different dataset, the program would perform this comparison and retain only the most likely linkage phase.

One might wonder why it’s necessary to re-estimate the map using the multilocus approach. This step is crucial because it utilizes the entire chromosomal structure in the estimation. If there are any misplaced markers or incorrect linkage phase assortments, these will be highlighted in the resulting map as an inflation of the expected map length. Furthermore, the Hidden Markov Model (HMM) approach is essential as it addresses the missing data inherent in polyploid cases, preparing the dataset for subsequent computation of the homolog composition in offspring individuals.

There are two primary strategies for reconstructing the multilocus map:

  • Separate Parental Maps and Integration: This approach involves constructing individual maps for each parent, merging them, and then adding unmapped markers. The procedure of constructing separate maps for each parent, inspired by the methodology of the polymapR package. As an addition, we include multilocus estimations along the way. The strategy begins with the function mapping by selecting markers informative to a specific parent using argumentsparent = "p1" or parent = "p2". Technically, this strategy confines the number of hidden states in the Markov model to the number of gametes a single parent can produce (e.g., 2 in diploids, 6 in tetraploids, 20 in hexaploids), as opposed to the combined gametes of both parents (e.g., 4 in diploids, 36 in tetraploids, 400 in hexaploids). Once individual maps are constructed and evaluated using diagnostic tools, they are integrated into a full HMM model using the merge_single_parent_maps function. To incorporate unmapped markers, posterior information from the pre-computed map is computed with calc_haplotypes, and the unmapped markers are inserted using the augment_phased_map function. This process avoids the computational intensity of constructing the map for each marker insertion, performing it only once after all markers have been appropriately phased.

  • Joint Parental Approach: This more straightforward method constructs the map using both parents simultaneously. It generally doesn’t require merging maps or adding unmapped markers, though it is possible to do so using the augment_phased_map function if needed. For this tutorial, we will employ the first strategy, as it involves more coding steps and thus provides a comprehensive learning experience. However, with the dataset we’re using, one could easily adopt the second strategy by setting parent = "p1p2" and proceeding directly to the Recomputing Haplotype Probabilities section.

13.1 Mapping Parent 1

s <- mapping(s, type = "mds", parent = "p1", ncpus = 8)
#> Multi-locus map estimation
s <- mapping(s, type = "genome", parent = "p1", ncpus = 8)
#> Multi-locus map estimation
print(s, type = "mds")
#> ---------- MDS -------------------------------------------------------------------------------- 
#>                      p1                       p2                       p1p2                     
#> ----------------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase    map (cM) haplo  phase    map (cM) haplo  phase    map (cM) haplo  
#> ----------------------------------------------------------------------------------------------- 
#> lg1  1   255    Y    1/18%     83      N      1/33.3%  .        N      1/98%    .        N      
#> lg2  2   197    Y    1/23.9%  118      N      1/26.9%  .        N      1/99.5%  .        N      
#> lg3  3   207    Y    1/22.2%  103      N      1/28%    .        N      1/95.2%  .        N      
#> lg4  4   292    Y    1/22.3%  102      N      1/33.9%  .        N      1/97.3%  .        N      
#> lg5  5   252    Y    1/22.2%   82      N      1/26.6%  .        N      1/98%    .        N      
#> lg6  6   95     Y    1/18.9%   67      N      1/37.9%  .        N      1/96.8%  .        N      
#> lg7  7   199    Y    1/13.6%   76      N      1/37.7%  .        N      1/99%    .        N      
#> lg8  8   259    Y    1/19.7%  101      N      1/26.6%  .        N      1/98.8%  .        N      
#> -----------------------------------------------------------------------------------------------
print(s, type = "genome")
#> ---------- Genome ----------------------------------------------------------------------------- 
#>                      p1                       p2                       p1p2                     
#> ----------------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase    map (cM) haplo  phase    map (cM) haplo  phase    map (cM) haplo  
#> ----------------------------------------------------------------------------------------------- 
#> lg1  1   255    Y    1/18%     71      N      1/33.3%  .        N      1/98%    .        N      
#> lg2  2   197    Y    1/23.9%  102      N      1/26.9%  .        N      1/99.5%  .        N      
#> lg3  3   207    Y    1/22.2%   99      N      1/28%    .        N      1/95.2%  .        N      
#> lg4  4   292    Y    1/22.3%   85      N      1/33.9%  .        N      1/97.3%  .        N      
#> lg5  5   252    Y    1/20.6%  118      N      1/26.6%  .        N      1/94.8%  .        N      
#> lg6  6   95     Y    1/18.9%   70      N      1/37.9%  .        N      1/96.8%  .        N      
#> lg7  7   199    Y    1/13.6%   78      N      1/37.7%  .        N      1/99%    .        N      
#> lg8  8   259    Y    1/19.7%  103      N      1/26.6%  .        N      1/98.8%  .        N      
#> -----------------------------------------------------------------------------------------------
plot_map(s, lg = 1, type = "mds", parent = "p1")

plot_map(s, lg = 1, type = "genome", parent = "p1")

13.2 Mapping Parent 2

s <- mapping(s, type = "mds", parent = "p2", ncpus = 8)
#> Multi-locus map estimation
s <- mapping(s, type = "genome", parent = "p2", ncpus = 8)
#> Multi-locus map estimation
print(s, type = "mds")
#> ---------- MDS -------------------------------------------------------------------------------- 
#>                      p1                       p2                       p1p2                     
#> ----------------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase    map (cM) haplo  phase    map (cM) haplo  phase    map (cM) haplo  
#> ----------------------------------------------------------------------------------------------- 
#> lg1  1   255    Y    1/18%     83      N      1/33.3%  110      N      1/98%    .        N      
#> lg2  2   197    Y    1/23.9%  118      N      1/26.9%   82      N      1/99.5%  .        N      
#> lg3  3   207    Y    1/22.2%  103      N      1/28%    120      N      1/95.2%  .        N      
#> lg4  4   292    Y    1/22.3%  102      N      1/33.9%  220      N      1/97.3%  .        N      
#> lg5  5   252    Y    1/22.2%   82      N      1/26.6%  114      N      1/98%    .        N      
#> lg6  6   95     Y    1/18.9%   67      N      1/37.9%  121      N      1/96.8%  .        N      
#> lg7  7   199    Y    1/13.6%   76      N      1/37.7%  106      N      1/99%    .        N      
#> lg8  8   259    Y    1/19.7%  101      N      1/26.6%  140      N      1/98.8%  .        N      
#> -----------------------------------------------------------------------------------------------
print(s, type = "genome")
#> ---------- Genome ----------------------------------------------------------------------------- 
#>                      p1                       p2                       p1p2                     
#> ----------------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase    map (cM) haplo  phase    map (cM) haplo  phase    map (cM) haplo  
#> ----------------------------------------------------------------------------------------------- 
#> lg1  1   255    Y    1/18%     71      N      1/33.3%   95      N      1/98%    .        N      
#> lg2  2   197    Y    1/23.9%  102      N      1/26.9%   89      N      1/99.5%  .        N      
#> lg3  3   207    Y    1/22.2%   99      N      1/28%    110      N      1/95.2%  .        N      
#> lg4  4   292    Y    1/22.3%   85      N      1/33.9%  193      N      1/97.3%  .        N      
#> lg5  5   252    Y    1/20.6%  118      N      1/26.6%   97      N      1/94.8%  .        N      
#> lg6  6   95     Y    1/18.9%   70      N      1/37.9%  121      N      1/96.8%  .        N      
#> lg7  7   199    Y    1/13.6%   78      N      1/37.7%   98      N      1/99%    .        N      
#> lg8  8   259    Y    1/19.7%  103      N      1/26.6%  149      N      1/98.8%  .        N      
#> -----------------------------------------------------------------------------------------------
plot_map(s, lg = 1, type = "mds", parent = "p2")

plot_map(s, lg = 1, type = "genome", parent = "p2")

13.3 Merging Maps

s <- merge_single_parent_maps(s, type = "genome", ncpus = 8, error = 0.05)
#> Multi-locus map estimation
s <- merge_single_parent_maps(s, type = "mds", ncpus = 8, error = 0.05)
#> Multi-locus map estimation
plot_map_list(s, type = "mds", parent = "p1p2")

plot_map_list(s, type = "genome", parent = "p1p2")

map_summary(s)
#> ---------- MDS --- I195 x J432 ----------------------------------------------------------------------------------- 
#>      LG     Chrom  Map_length_(cM)  Markers/cM  Simplex_P1  Simplex_P2  Double-simplex  Multiplex  Total  Max_gap  
#> ------------------------------------------------------------------------------------------------------------------ 
#>      lg1    1      60.1             2.18        27          59          .               45         131    4.1      
#>      lg2    2      62.9             1.59        33          41          .               26         100    4.6      
#>      lg3    3      67.7             1.536       31          44          .               29         104    6.7      
#>      lg4    4      66.2             2.477       51          76          .               37         164    4.7      
#>      lg5    5      55.6             2.212       45          49          .               29         123    4.3      
#>      lg6    6      74.2             0.728       13          28          .               13         54     15       
#>      lg7    7      74               1.378       22          57          .               23         102    10       
#>      lg8    8      71.6             1.676       40          50          .               30         120    11       
#> ------------------------------------------------------------------------------------------------------------------ 
#>      Total         532.3            2           262         404         .               232        898    15       
#> ------------------------------------------------------------------------------------------------------------------ 
#> 
#> ---------- Genome --- I195 x J432 -------------------------------------------------------------------------------- 
#>      LG     Chrom  Map_length_(cM)  Markers/cM  Simplex_P1  Simplex_P2  Double-simplex  Multiplex  Total  Max_gap  
#> ------------------------------------------------------------------------------------------------------------------ 
#>      lg1    1      56               2.339       27          59          .               45         131    3.5      
#>      lg2    2      63.6             1.572       33          41          .               26         100    5.9      
#>      lg3    3      65.8             1.581       31          44          .               29         104    6.3      
#>      lg4    4      65.9             2.489       51          76          .               37         164    4        
#>      lg5    5      54.6             2.179       42          49          .               28         119    3.4      
#>      lg6    6      68.2             0.792       13          28          .               13         54     13       
#>      lg7    7      61.7             1.653       22          57          .               23         102    6.5      
#>      lg8    8      60.1             1.997       40          50          .               30         120    5.9      
#> ------------------------------------------------------------------------------------------------------------------ 
#>      Total         495.9            2           259         404         .               231        894    13       
#> ------------------------------------------------------------------------------------------------------------------
plot_map(s, lg = 1, type = "mds", parent = "p1p2")

plot_map(s, lg = 1, type = "genome", parent = "p1p2")

s <- calc_haplotypes(s, type = "mds", ncpus = 8)
s <- calc_haplotypes(s, type = "genome", ncpus = 8)

Merging maps resulted in a genetic map with a total length of 532.3 cM (MDS) and 495.9 cM (Genome) across all linkage groups, with a total of 898 (MDS) and 894 (Genome) markers. The average markers per centimorgan were around 2, and the maximum gap ranged from 15 cM (MDS) to 13 cM (Genome), indicating a well-distributed marker set.

13.4 Augmenting Merged Maps

s <- augment_phased_map(s, type = "mds", ncpus = 8)
#> 
#> Reestimating multilocus map ...
#> Multi-locus map estimation
s <- augment_phased_map(s, type = "genome", ncpus = 8)
#> 
#> Reestimating multilocus map ...
#> Multi-locus map estimation
plot_map_list(s, type = "mds", parent = "p1p2")

plot_map_list(s, type = "genome", parent = "p1p2")

map_summary(s)
#> ---------- MDS --- I195 x J432 ----------------------------------------------------------------------------------- 
#>      LG     Chrom  Map_length_(cM)  Markers/cM  Simplex_P1  Simplex_P2  Double-simplex  Multiplex  Total  Max_gap  
#> ------------------------------------------------------------------------------------------------------------------ 
#>      lg1    1      65.1             3.656       27          59          54              98         238    3.5      
#>      lg2    2      68.4             2.822       33          41          51              68         193    5.7      
#>      lg3    3      72               2.736       31          45          66              55         197    6        
#>      lg4    4      71.7             4.003       51          76          47              113        287    4.1      
#>      lg5    5      61.9             3.91        45          49          58              90         242    4        
#>      lg6    6      77               1.104       13          28          18              26         85     16.3     
#>      lg7    7      80               2.462       22          57          52              66         197    10.8     
#>      lg8    8      76.3             3.368       40          50          76              91         257    10.9     
#> ------------------------------------------------------------------------------------------------------------------ 
#>      Total         572.4            3           262         405         422             607        1696   16.3     
#> ------------------------------------------------------------------------------------------------------------------ 
#> 
#> ---------- Genome --- I195 x J432 -------------------------------------------------------------------------------- 
#>      LG     Chrom  Map_length_(cM)  Markers/cM  Simplex_P1  Simplex_P2  Double-simplex  Multiplex  Total  Max_gap  
#> ------------------------------------------------------------------------------------------------------------------ 
#>      lg1    1      64.1             3.744       27          59          56              98         240    2.3      
#>      lg2    2      67.4             2.893       33          41          53              68         195    4.3      
#>      lg3    3      69.1             2.822       31          45          65              54         195    5.6      
#>      lg4    4      69.4             4.15        51          76          48              113        288    2.6      
#>      lg5    5      62.3             3.852       46          49          59              86         240    2.6      
#>      lg6    6      69.6             1.149       13          28          18              21         80     13.6     
#>      lg7    7      66.3             2.911       22          57          50              64         193    6        
#>      lg8    8      63.7             4.003       40          50          75              90         255    3        
#> ------------------------------------------------------------------------------------------------------------------ 
#>      Total         531.9            3           263         405         424             594        1686   13.6     
#> ------------------------------------------------------------------------------------------------------------------
plot_map(s, lg = 1, type = "mds", parent = "p1p2")

plot_map(s, lg = 1, type = "genome", parent = "p1p2")

Post augmentation, however, the map length slightly increased to 572.4 cM (MDS) and 531.9 cM (Genome), and the total number of markers significantly rose to 1696 (MDS) and 1686 (Genome). This increase was primarily due to adding double-simplex and multiplex markers, enhancing the map’s resolution and potential accuracy. Despite the increased number of markers, the overall map structure remained consistent, though the maximum gap size marginally increased, notably in the MDS map, where it reached 16.3 cM.

13.5 Comparing MDS and Genome-based orders

Now, let us compare the MDS and Genome-ordered maps generated in our analysis. For an effective comparison of these orders, it’s crucial to employ a well-defined objective metric. Here, we’ll focus on two primary aspects: the multilocus likelihood and the characteristics of the map, specifically its length, resulting recombination frequency matrices, and the gaps between markers. To proceed with this comparison, we utilize the function compare_order. This function identifies markers common to both maps and recalculates the Hidden Markov Model (HMM)-based likelihood for each linkage group. By assessing both the (log)likelihood values and the structural aspects of the maps, we can determine which ordering method is more consistent and informative for our dataset.

maps.comp <- compare_order(s)
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
#>    Conf. 1 :   
#> Done with map estimation
maps.comp
#>              LG          n.mrk       map length  max_gap     ave_dist    loglike                 
#>      ------------------------------------------------------------------------------------------- 
#>      mds     lg1         235         65.19       3.46        0.28        -7012.14                
#>      genome  lg1         235         64.41       2.33        0.28        -3678.8     *           
#>      mds     lg2         193         67.79       5.68        0.35        -5788.22                
#>      genome  lg2         193         67.83       4.28        0.35        -4547.39    *           
#>      mds     lg3         195         71.83       6           0.37        -5769.63                
#>      genome  lg3         195         69.08       5.55        0.36        -3836.94    *           
#>      mds     lg4         286         71.97       4.06        0.25        -9119.55                
#>      genome  lg4         286         69.83       2.63        0.25        -5591.6     *           
#>      mds     lg5         235         62.01       3.98        0.26        -6322.78                
#>      genome  lg5         235         62.07       2.6         0.27        -4019.52    *           
#>      mds     lg6         80          76.6        16.32       0.97        -3042.48                
#>      genome  lg6         80          69.95       13.64       0.89        -2493.8     *           
#>      mds     lg7         193         78.7        10.82       0.41        -5237.83                
#>      genome  lg7         193         66.21       5.99        0.34        -4448.84    *           
#>      mds     lg8         255         76.58       10.94       0.3         -9588.02                
#>      genome  lg8         255         64.04       2.99        0.25        -7041.11    *           
#>      -------------------------------------------------------------------------------------------
plot(maps.comp)

After evaluating the metrics mentioned earlier, we conclude that the Genome order provided superior results for our dataset. However, it’s noteworthy that the MDS order also presents a strong case. Based on the map’s characteristics, such as its length, marker distribution, and gaps, the MDS order demonstrates a commendable performance and could be effectively utilized in subsequent analyses. This observation highlights the viability of the MDS order in scenarios where genome order data might not be readily available or when a more linkage-focused approach is preferred.

14 Computing Haplotype Probabilities

Now, let us compute the haplotype probabilities of the offsprings of the cross between parents I195 and J432 using function calc_haplotypes:

s <- calc_haplotypes(s, type = "mds", ncpus = 8)
s <- calc_haplotypes(s, type = "genome", ncpus = 8)
plot_haplotypes(s, lg = 1, ind = "F1.85.31")

I195xJ432_map <- s
save(I195xJ432_map, file = "I195_x_J432_map.rda")

The plot depicts the haplotype probabilities for individual F1.85.31 across linkage group 1. The graphical representation shows the probabilities of inheriting specific homologs from each parent, with each horizontal bar corresponding to one of the four homologs (h1 to h4). In the case of parent I195, the probabilities for each of the four homologs are clearly differentiated, with a crossover event indicated around the 30 cM position between homologs h1 and h2. For parent J432, depicted in shades of red, crossover events are suggested around the 45 and 50 cM positions, likely involving homologs h1 and h3, and h2 and h4.

These results are extremely important for analyzing the genetic structure of offspring and identifying areas of significant linkage or recombination. They are fundamental to further analysis such as QTL mapping or marker-assisted selection, as they provide critical insights into the genetic makeup of the population under study

15 Final remarks about F1 map

This concludes the construction of the map construction for the I195 x J432 cross. Here are the final results:

map_summary(I195xJ432_map, type = "genome")
#> ---------- Genome --- I195 x J432 -------------------------------------------------------------------------------- 
#>      LG     Chrom  Map_length_(cM)  Markers/cM  Simplex_P1  Simplex_P2  Double-simplex  Multiplex  Total  Max_gap  
#> ------------------------------------------------------------------------------------------------------------------ 
#>      lg1    1      64.1             3.744       27          59          56              98         240    2.3      
#>      lg2    2      67.4             2.893       33          41          53              68         195    4.3      
#>      lg3    3      69.1             2.822       31          45          65              54         195    5.6      
#>      lg4    4      69.4             4.15        51          76          48              113        288    2.6      
#>      lg5    5      62.3             3.852       46          49          59              86         240    2.6      
#>      lg6    6      69.6             1.149       13          28          18              21         80     13.6     
#>      lg7    7      66.3             2.911       22          57          50              64         193    6        
#>      lg8    8      63.7             4.003       40          50          75              90         255    3        
#> ------------------------------------------------------------------------------------------------------------------ 
#>      Total         531.9            3           263         405         424             594        1686   13.6     
#> ------------------------------------------------------------------------------------------------------------------
plot_map_list(I195xJ432_map, type = "genome", col = mp_pal(8))

16 Loading BC map

In this section, we will proceed to load a pre-constructed BC map. Although it’s possible to generate the BC map using a process similar to the one we employed for the F1 map, please note that the BC population size is smaller. Therefore, it’s advisable to experiment with various threshold values throughout this tutorial and fine-tune them for different linkage groups as needed to achieve optimal results.

Here is the script to load the BC map:

download.file("https://github.com/mmollina/mappoly2_vignettes/raw/main/I195_x_F1-85-209_map.rda", 
              destfile = "temp_file.rda")
load("temp_file.rda")
I195xF1_85_209_map <- mapping(I195xF1_85_209_map, type = "genome", error = 0.05, ncpus = 8)
#> Multi-locus map estimation
print(I195xF1_85_209_map, type = "genome")
#> ---------- Genome -------------------------------------------------------------------------------- 
#>                      p1                        p2                        p1p2                      
#> -------------------------------------------------------------------------------------------------- 
#>      Ch  n.mrk  ord  phase     map (cM) haplo  phase     map (cM) haplo  phase     map (cM) haplo  
#> -------------------------------------------------------------------------------------------------- 
#> lg1  1   192    Y    1/5.7%    .        N      1/16.1%   .        N      1/92.7%   220      Y      
#> lg2  2   170    Y    12/4.7%   .        N      12/12.4%  .        N      12/78.2%  196      Y      
#> lg3  3   160    Y    4/8.1%    .        N      4/16.2%   .        N      4/86.2%   165      Y      
#> lg4  4   250    Y    12/6%     .        N      12/15.2%  .        N      12/91.2%  194      Y      
#> lg5  5   224    Y    8/11.2%   .        N      8/12.5%   .        N      8/86.6%   175      Y      
#> lg6  6   59     Y    8/8.5%    .        N      8/30.5%   .        N      8/61%      70      Y      
#> lg7  7   207    Y    24/14.5%  .        N      24/21.7%  .        N      24/79.7%  190      Y      
#> lg8  8   207    Y    48/15%    .        N      48/11.6%  .        N      48/81.2%  314      Y      
#> --------------------------------------------------------------------------------------------------
map_summary(I195xF1_85_209_map, type = "genome")
#> ---------- Genome --- I195 x F1.85.209 --------------------------------------------------------------------------- 
#>      LG     Chrom  Map_length_(cM)  Markers/cM  Simplex_P1  Simplex_P2  Double-simplex  Multiplex  Total  Max_gap  
#> ------------------------------------------------------------------------------------------------------------------ 
#>      lg1    1      219.7            0.824       9           28          51              93         181    33.3     
#>      lg2    2      195.3            0.671       8           19          47              57         131    33.6     
#>      lg3    3      164.9            0.837       12          25          59              42         138    14.1     
#>      lg4    4      193.1            1.201       11          39          88              94         232    10.4     
#>      lg5    5      175.1            1.102       21          28          81              63         193    38.5     
#>      lg6    6      70.6             0.467       5           17          6               5          33     30.3     
#>      lg7    7      189              0.884       14          46          44              63         167    18.6     
#>      lg8    8      314              0.525       16          23          46              80         165    24.1     
#> ------------------------------------------------------------------------------------------------------------------ 
#>      Total         1521.7           1           96          225         422             497        1240   38.5     
#> ------------------------------------------------------------------------------------------------------------------
plot_map_list(I195xF1_85_209_map, type = "genome", col = mp_pal(8))

17 Preparing maps to integrate

Now that we have both maps at hand, our next step is to construct the consensus map. To ensure a consistent homolog correspondence for shared parents across populations, we need to prepare the data. This involves assembling a list containing the maps of the different populations. Once that’s done, we can visually inspect the maps together using the plot_multi_map function, and then employ the prepare_to_integrate function to assess and establish homolog correspondence. Here’s how we go about it:

MAPs <- list(I195xJ432 = I195xJ432_map, 
             I195xF1_85_209 = I195xF1_85_209_map)
plot_multi_map(MAPs)

prep.maps <- prepare_to_integrate(MAPs)
plot(prep.maps)

plot_shared_markers(prep.maps)

Examining the dendrogram, we notice a pronounced collinearity among certain homologs from parent I195 across different populations. For instance, homolog h1 in population 1 (I195 x J432) corresponds to homolog h3 in population 2 (I195 x F1.85.209), while h2 corresponds to h1, h3 to h2, and h4 remains consistent across both populations. The Venn diagram corroborates this genetic linkage, showcasing that 929 markers are common between the two populations.

18 Estimating consensus map

The next step is to estimate the consensus map considering all populations. The process of estimating a multi-population genetic map using MAPpoly2 involves several steps:

  • Marker Assessment: Initially, individual markers are assessed within each population (Pop 1, Pop 2, Pop 3, etc.). This assessment includes determining the marker’s quality, its segregation pattern, and allelic dosages.

  • Single Map Construction: For each population, a single genetic map is constructed based on the markers. We did that in the previous sections

  • Markovian Connection: Markers from different populations are connected using the Markovian properties of HMM. This means that the state of a marker (i.e., its genotype on the genetic map) in one population can influence the state of the same marker in another population, accounting for crossover and recombination events.

  • Posterior Computation for Missing Data: HMMs can handle missing data by computing the posterior probabilities of missing marker states based on the observed data. This step fills in gaps and ensures a more complete and accurate representation of the genetic linkage.

  • Consensus Map Reconstruction: Finally, the individual maps from each population are integrated to reconstruct the final consensus map. This map combines information from all populations and provides a comprehensive overview of the genetic architecture across the different populations.

The animation might illustrate these steps dynamically, showing how individual markers (represented by colored dots) are first organized within their respective populations, then connected across populations, with the missing data being interpolated, and ultimately combined into a single, coherent genetic map that represents the consensus of all the populations analyzed.

In MAPpoly, the function estimate_consensus_map streamlines the last three steps of the HMM-based map estimation process. This function incorporates the ability to account for genotype errors, which permits the posterior computations to consider states that would otherwise be deemed impossible. This feature enables the HMM to correct potential genotyping errors by allowing transitions to states that could not be reached without the consideration of such errors.

consensus.map <- estimate_consensus_map(prep.maps, ncpus = 8, err = 0.05)
consensus.map
#> Ploidy of founders:              4 4 4 
#> Total No. individuals:           259 
#> Total No. markers                1997 
#> Haplotype probability computed:  No 
#> 
#> Number of individuals per crosse:
#>       J432  F1.85.209  
#> ---------------------- 
#> I195  165   94         
#> ---------------------- 
#> 
#> Consensus Map:
#> ---------------
#>      LG   Map_length_.cM.  Markers.cM  Total.mrk  Max_gap  
#> ---------------------------------------------------------- 
#>      lg1  99.86            2.764       276        8.14     
#>      lg2  96.84            2.396       232        6.11     
#>      lg3  91.25            2.532       231        4.09     
#>      lg4  103.34           3.309       342        3.92     
#>      lg5  93.22            3.122       291        2.77     
#>      lg6  66.7             1.304       87         6.12     
#>      lg7  88.94            2.743       244        2.79     
#>      lg8  99.06            2.968       294        3.64     
#> ----------------------------------------------------------
plot(consensus.map)

plot(consensus.map, only.consensus = TRUE, col = mp_pal(8))

The consensus map encompasses a total of 259 individuals and 1,997 markers. This represents a significant increase in marker density compared to the 1,686 markers for the F1 map and the 1,240 markers for the BC map. The map spans genetic distances ranging from 103.34 cM to 66.7 cM across various linkage groups. This range can be attributed to the contribution of the more informative cross (I195 x J432) , which compensates for the less informative population (I195 x F1.85.209), thereby enhancing the overall robustness and resolution of the consensus map. This interplay between the populations ensures that the less represented genomic regions in the BC population are adequately covered, thanks to the richer genetic information provided by the F1 population.

19 Computing consensus haplotype probability

Just as we did with the F1 map construction, let’s compute the haplotype probabilities using the consensus map by employing the function calc_consensus_haplo:

consensus.map <- calc_consensus_haplo(consensus.map, ncpus = 8)
plot_consensus_haplo(consensus.map, lg = 1, ind = "F1.85.70")

plot_consensus_haplo(consensus.map, lg = 1, ind = "AphBC.21")

The results are quite similar, but we now have different colors depending on the parent that transmitted the homolog. Currently, this color representation is limited to 10 different founders. After this step, one can save the consensus map to be used in QTL analysis with QTLpoly.