QTL Mapping

A goal for plant breeders is to identify specific regions of the genome influencing the traits that they measure or observe for the populations that they grow. Quantitative Trait Loci (QTL) mapping is an appraoch to accomplish this goals. QTL are intervals across the genome which are linked to DNA markers, typically single nucleotide polymorphisms (SNPs) for our G2F dataset. The statistical procedures to conduct QTL mapping are beyond the scope of this lesson, but there are several open-source software that handle this process. We will be using the qtl2 package to conduct the QTL analysis.

Cross Types

Plant breeders manipulate the genetic makeup of plants by making controlled crosses to achieve specific combinations of genotypes for their breeding purpose. Some motives for making controlled crosses include: creating genetic variation (increases genetic diversity), creating hybrids (take advantage of heterosis), introgressing genes (transfer favorable genes from donor parent).

Some common crosses and their uses are listed below:

kable(cross_types, align = "l", caption = "Plant Breeding Cross Types")
Plant Breeding Cross Types
Cross.Type Motive Description Genetic.Uses
Selfing Homozygosity, trait fixation Within same plant Pure lines, genetic mapping
Backcross Gene introgression F1 x recurrent parent Gene transfer, pyramiding
Single Cross Combine parental traits Inbred line x inbred line Hybrid vigor, combining ability
Double Cross Expand genetic base Single cross x single cross Hybrid production
Three-way Cross Add third parent’s traits Single cross x inbred line Hybrid production
Poly Cross Population improvement Random crosses among parents Combining ability, perennial breeding
Top/Test Cross Parental evaluation Inbred x tester Combining ability assessment
Interspecific/Intergeneric Gene transfer Across species/genera Novel traits, genetic studies
Bulk Cross Diversity + selection Bulk crossing multiple parents Population genetics, adaptation

Doubled Haploid Cross Type

The 2020-2021 G2F dataset uses a 6-way doubled haploid cross. This cross starts with 6 parents (A, B, C, D, E, F) which are initially crossed to form 2-way crosses (AxB, CxD, ExF). An intermediate cross follows to form 4-way crosses ((AxB)x(CxD),(ExF)). The final cross is the 6-way cross (((AxB)x(CxD))x(ExF)) to form the doubled haploid F1 generation.

QTL Scan

We will now prepare the grain yield BLUP dataset to format them as inputs for qtl2 genome scan. The phenotype files are separated according to the Tester_Environment combination.

#load libraries
pacman::p_load(readr, nlme, sjstats, lme4, RColorBrewer, dplyr, ggplot2, tidyverse, ggrepel, caret, MuMIn)

git_path <- "https://raw.githubusercontent.com/fatmaoz25/ReactionNormGenomicPrediction/refs/heads/main"
df_file <- "Yield.wide.na.csv"
df_path <- file.path(git_path, df_file)
df <- read.csv(df_path)

# Make sure string columns are not factors
df$Inbred <- as.character(df$Inbred)

# Get all Tester_Env column names (excluding 'Inbred')
tester_env_cols <- setdiff(names(df), "Inbred")

for (col in tester_env_cols) {
  output <- df[, c("Inbred", col)]
  colnames(output) <- c("Inbred", "Value")
  # make a valid R object name (replace . with _)
  obj_name <- paste0("df_", gsub("\\.", "_", col))
  assign(obj_name, output)
}

NOTE: The code only depicts the QTL scan process for one Tester_Env combination (PHZ51_TXH1.2020). This would ideally be run for each Tester_Env separately by changing the phenotype data file in the JSON file. You may do this by running a forloop that goes through each Tester_Env and saves all of the QTL results into one dataframe at the end.

library(qtl2)
## Warning: package 'qtl2' was built under R version 4.5.3
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)

write.csv(df_PHZ51_TXH1_2020, "C:\\Users\\fatma\\Documents\\SURGe\\QTL_Analysis\\PHZ51_TXH1_2020.csv", row.names=FALSE)

cross <- read_cross2("C:\\Users\\fatma\\Documents\\SURGe\\QTL_Analysis\\cm.mbp.ss.100k.sites.json")  # Replace with your actual file name

# STEP 2: Calculate genotype probabilities
pr <- calc_genoprob(cross, error_prob=0.002, quiet=FALSE)
## Chr 1
## Chr 2
## Chr 3
## Chr 4
## Chr 5
## Chr 6
## Chr 7
## Chr 8
## Chr 9
## Chr 10
# STEP 3: Compute kinship matrix (optional but recommended to account for family structure)
kinship <- calc_kinship(pr, "loco", quiet = FALSE)
##  - converting to allele probs
##  - Chr 1
##  - Chr 2
##  - Chr 3
##  - Chr 4
##  - Chr 5
##  - Chr 6
##  - Chr 7
##  - Chr 8
##  - Chr 9
##  - Chr 10
##  - Chr 1
##  - Chr 2
##  - Chr 3
##  - Chr 4
##  - Chr 5
##  - Chr 6
##  - Chr 7
##  - Chr 8
##  - Chr 9
##  - Chr 10
# STEP 4: Run the genome scan
scan1_output <- scan1(pr, cross$pheno, kinship = kinship)

# STEP 4a: Establish significance threshold with permutation test
perm_result <- scan1perm(genoprobs = pr, pheno = cross$pheno, n_perm = 1000, cores = 1) # cores = 0 arg uses maximum available cores

# STEP 4b: View the thresholds (this will be used in find_peaks below)
summary(perm_result)
## LOD thresholds (1000 permutations)
##      Value
## 0.05  6.24
# STEP 5: Find peaks above the LOD threshold
peaks <- find_peaks(scan1_output, cross$gmap, threshold = summary(perm_result), drop = 5) # drop = 5 was copied from the Michel 2022 methods
print(peaks)
##   lodindex lodcolumn chr      pos      lod     ci_lo    ci_hi
## 1        1     Value   4 8.174928 6.323298 0.7066019 170.8496
head(scan1_output)
##               Value
## S1_39898  0.9202589
## S1_195098 0.9008219
## S1_206785 0.9011153
## S1_328512 1.0748975
## S1_557669 1.0784846
## S1_682496 0.8803549

print(peaks) gives us the results for any significant regions found in the QTL scan. scan1_output contains the full genome scan results.

# Prepare phenotype info
pheno <- colnames(cross$pheno)

# Add SNP info (marker name, chromosome, and position)
scan_df <- as.data.frame(scan1_output)
scan_df$marker <- rownames(scan1_output)

# Split marker name (e.g., "S1_39898") into chr and pos
scan_df <- scan_df %>%
  separate(marker, into = c("chr", "pos"), sep = "_") %>%
  mutate(chr = str_remove(chr, "S"), pos = as.numeric(pos))

# Convert wide to long format
scan_long <- scan_df %>%
  pivot_longer(cols = all_of(pheno), names_to = "phenotype", values_to = "LOD")

head(scan_long)
## # A tibble: 6 × 4
##   chr      pos phenotype   LOD
##   <chr>  <dbl> <chr>     <dbl>
## 1 1      39898 Value     0.920
## 2 1     195098 Value     0.901
## 3 1     206785 Value     0.901
## 4 1     328512 Value     1.07 
## 5 1     557669 Value     1.08 
## 6 1     682496 Value     0.880
offsets <- scan_long %>%
  group_by(chr) %>%
  summarise(chr_len = max(pos, na.rm = TRUE)) %>%
  arrange(as.numeric(chr)) %>%
  mutate(offset_val = lag(cumsum(chr_len), default = 0))

scan_long <- scan_long %>%
  left_join(offsets, by = "chr")
str(scan_long)
## tibble [100,000 × 6] (S3: tbl_df/tbl/data.frame)
##  $ chr       : chr [1:100000] "1" "1" "1" "1" ...
##  $ pos       : num [1:100000] 39898 195098 206785 328512 557669 ...
##  $ phenotype : chr [1:100000] "Value" "Value" "Value" "Value" ...
##  $ LOD       : num [1:100000] 0.92 0.901 0.901 1.075 1.078 ...
##  $ chr_len   : num [1:100000] 3.08e+08 3.08e+08 3.08e+08 3.08e+08 3.08e+08 ...
##  $ offset_val: num [1:100000] 0 0 0 0 0 0 0 0 0 0 ...
scan_long <- scan_long %>%
  mutate(
    genome_pos = as.numeric(pos) + as.numeric(offset_val)
  )

# *** CREATE chrom_boundaries ***
chrom_boundaries <- offsets$offset_val[-1]

# Set LOD threshold
perm_sum <- summary(perm_result)
# For example, extract the 0.05 threshold:
thresh <- perm_sum["0.05", "Value"]

Manhattan Plots

Manhattan plots are a way to visualize the results of the genome scan. The x-axis is the genomic position along a set of chromosomes, while the y-axis is the logarithm-of-odds (LOD) value. The LOD value is a test statistic that quantifies whether a SNP marker and trait locus are linked and inherited together within pedigrees/families. It tracks whether or not certain SNPs co-segregate with a phenotype across generations. High LOD values indicate strong evidence of linkage between the trait loci and SNP marker, but it does not guarantee that a gene is causal for a specific phenotype.

# plot
ggplot(scan_long, aes(x = genome_pos, y = LOD, color = as.factor(chr))) +
  geom_point(size = 0.75) +
  geom_vline(xintercept = chrom_boundaries, linetype = "dashed", color = "grey70") +
  geom_hline(yintercept = thresh, color = "red", linetype = "dashed", size = 1) +
  labs(title = "PHZ51 TXH1.2020 QTL Results",
       x = "Genomic Position",
       y = "LOD Score",
       color = "Chromosome") +
  theme_bw() +
  theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.