Week 3

QTL Mapping

A goal for plant breeders is to identify specific regions of the genome influencing the traits that they measure or observe for the populations that they grow. Quantitative Trait Loci (QTL) mapping is an appraoch to accomplish this goals. QTL are intervals across the genome which are linked to DNA markers, typically single nucleotide polymorphisms (SNPs) for our G2F dataset. The statistical procedures to conduct QTL mapping are beyond the scope of this lesson, but there are several open-source software that handle this process. We will be using the qtl2 package to conduct the QTL analysis.

Cross Types

Plant breeders manipulate the genetic makeup of plants by making controlled crosses to achieve specific combinations of genotypes for their breeding purpose. Some motives for making controlled crosses include: creating genetic variation (increases genetic diversity), creating hybrids (take advantage of heterosis), introgressing genes (transfer favorable genes from donor parent).

Some common crosses and their uses are listed below:

kable(cross_types, align = "l", caption = "Plant Breeding Cross Types")

Plant Breeding Cross Types
Cross.Type	Motive	Description	Genetic.Uses
Selfing	Homozygosity, trait fixation	Within same plant	Pure lines, genetic mapping
Backcross	Gene introgression	F1 x recurrent parent	Gene transfer, pyramiding
Single Cross	Combine parental traits	Inbred line x inbred line	Hybrid vigor, combining ability
Double Cross	Expand genetic base	Single cross x single cross	Hybrid production
Three-way Cross	Add third parent’s traits	Single cross x inbred line	Hybrid production
Poly Cross	Population improvement	Random crosses among parents	Combining ability, perennial breeding
Top/Test Cross	Parental evaluation	Inbred x tester	Combining ability assessment
Interspecific/Intergeneric	Gene transfer	Across species/genera	Novel traits, genetic studies
Bulk Cross	Diversity + selection	Bulk crossing multiple parents	Population genetics, adaptation

Doubled Haploid Cross Type

The 2020-2021 G2F dataset uses a 6-way doubled haploid cross. This cross starts with 6 parents (A, B, C, D, E, F) which are initially crossed to form 2-way crosses (AxB, CxD, ExF). An intermediate cross follows to form 4-way crosses ((AxB)x(CxD),(ExF)). The final cross is the 6-way cross (((AxB)x(CxD))x(ExF)) to form the doubled haploid F1 generation.

QTL Scan

We will now prepare the grain yield BLUP dataset to format them as inputs for qtl2 genome scan. The phenotype files are separated according to the Tester_Environment combination.

#load libraries
pacman::p_load(readr, nlme, sjstats, lme4, RColorBrewer, dplyr, ggplot2, tidyverse, ggrepel, caret, MuMIn)

git_path <- "https://raw.githubusercontent.com/fatmaoz25/ReactionNormGenomicPrediction/refs/heads/main"
df_file <- "Yield.wide.na.csv"
df_path <- file.path(git_path, df_file)
df <- read.csv(df_path)

# Make sure string columns are not factors
df$Inbred <- as.character(df$Inbred)

# Get all Tester_Env column names (excluding 'Inbred')
tester_env_cols <- setdiff(names(df), "Inbred")

for (col in tester_env_cols) {
  output <- df[, c("Inbred", col)]
  colnames(output) <- c("Inbred", "Value")
  # make a valid R object name (replace . with _)
  obj_name <- paste0("df_", gsub("\\.", "_", col))
  assign(obj_name, output)
}

NOTE: The code only depicts the QTL scan process for one Tester_Env combination (PHZ51_TXH1.2020). This would ideally be run for each Tester_Env separately by changing the phenotype data file in the JSON file. You may do this by running a forloop that goes through each Tester_Env and saves all of the QTL results into one dataframe at the end.

library(qtl2)

## Warning: package 'qtl2' was built under R version 4.5.3

library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)

write.csv(df_PHZ51_TXH1_2020, "C:\\Users\\fatma\\Documents\\SURGe\\QTL_Analysis\\PHZ51_TXH1_2020.csv", row.names=FALSE)

cross <- read_cross2("C:\\Users\\fatma\\Documents\\SURGe\\QTL_Analysis\\cm.mbp.ss.100k.sites.json")  # Replace with your actual file name

# STEP 2: Calculate genotype probabilities
pr <- calc_genoprob(cross, error_prob=0.002, quiet=FALSE)

## Chr 1

## Chr 2

## Chr 3

## Chr 4

## Chr 5

## Chr 6

## Chr 7

## Chr 8

## Chr 9

## Chr 10

# STEP 3: Compute kinship matrix (optional but recommended to account for family structure)
kinship <- calc_kinship(pr, "loco", quiet = FALSE)

##  - converting to allele probs

##  - Chr 1

##  - Chr 2

##  - Chr 3

##  - Chr 4

##  - Chr 5

##  - Chr 6

##  - Chr 7

##  - Chr 8

##  - Chr 9

##  - Chr 10

##  - Chr 1

##  - Chr 2

##  - Chr 3

##  - Chr 4

##  - Chr 5

##  - Chr 6

##  - Chr 7

##  - Chr 8

##  - Chr 9

##  - Chr 10

# STEP 4: Run the genome scan
scan1_output <- scan1(pr, cross$pheno, kinship = kinship)

# STEP 4a: Establish significance threshold with permutation test
perm_result <- scan1perm(genoprobs = pr, pheno = cross$pheno, n_perm = 1000, cores = 1) # cores = 0 arg uses maximum available cores

# STEP 4b: View the thresholds (this will be used in find_peaks below)
summary(perm_result)

## LOD thresholds (1000 permutations)
##      Value
## 0.05  6.24

# STEP 5: Find peaks above the LOD threshold
peaks <- find_peaks(scan1_output, cross$gmap, threshold = summary(perm_result), drop = 5) # drop = 5 was copied from the Michel 2022 methods
print(peaks)

##   lodindex lodcolumn chr      pos      lod     ci_lo    ci_hi
## 1        1     Value   4 8.174928 6.323298 0.7066019 170.8496

head(scan1_output)

##               Value
## S1_39898  0.9202589
## S1_195098 0.9008219
## S1_206785 0.9011153
## S1_328512 1.0748975
## S1_557669 1.0784846
## S1_682496 0.8803549

print(peaks) gives us the results for any significant regions found in the QTL scan. scan1_output contains the full genome scan results.

# Prepare phenotype info
pheno <- colnames(cross$pheno)

# Add SNP info (marker name, chromosome, and position)
scan_df <- as.data.frame(scan1_output)
scan_df$marker <- rownames(scan1_output)

# Split marker name (e.g., "S1_39898") into chr and pos
scan_df <- scan_df %>%
  separate(marker, into = c("chr", "pos"), sep = "_") %>%
  mutate(chr = str_remove(chr, "S"), pos = as.numeric(pos))

# Convert wide to long format
scan_long <- scan_df %>%
  pivot_longer(cols = all_of(pheno), names_to = "phenotype", values_to = "LOD")

head(scan_long)

## # A tibble: 6 × 4
##   chr      pos phenotype   LOD
##   <chr>  <dbl> <chr>     <dbl>
## 1 1      39898 Value     0.920
## 2 1     195098 Value     0.901
## 3 1     206785 Value     0.901
## 4 1     328512 Value     1.07 
## 5 1     557669 Value     1.08 
## 6 1     682496 Value     0.880

offsets <- scan_long %>%
  group_by(chr) %>%
  summarise(chr_len = max(pos, na.rm = TRUE)) %>%
  arrange(as.numeric(chr)) %>%
  mutate(offset_val = lag(cumsum(chr_len), default = 0))

scan_long <- scan_long %>%
  left_join(offsets, by = "chr")
str(scan_long)

## tibble [100,000 × 6] (S3: tbl_df/tbl/data.frame)
##  $ chr       : chr [1:100000] "1" "1" "1" "1" ...
##  $ pos       : num [1:100000] 39898 195098 206785 328512 557669 ...
##  $ phenotype : chr [1:100000] "Value" "Value" "Value" "Value" ...
##  $ LOD       : num [1:100000] 0.92 0.901 0.901 1.075 1.078 ...
##  $ chr_len   : num [1:100000] 3.08e+08 3.08e+08 3.08e+08 3.08e+08 3.08e+08 ...
##  $ offset_val: num [1:100000] 0 0 0 0 0 0 0 0 0 0 ...

scan_long <- scan_long %>%
  mutate(
    genome_pos = as.numeric(pos) + as.numeric(offset_val)
  )

# *** CREATE chrom_boundaries ***
chrom_boundaries <- offsets$offset_val[-1]

# Set LOD threshold
perm_sum <- summary(perm_result)
# For example, extract the 0.05 threshold:
thresh <- perm_sum["0.05", "Value"]

Manhattan Plots

Manhattan plots are a way to visualize the results of the genome scan. The x-axis is the genomic position along a set of chromosomes, while the y-axis is the logarithm-of-odds (LOD) value. The LOD value is a test statistic that quantifies whether a SNP marker and trait locus are linked and inherited together within pedigrees/families. It tracks whether or not certain SNPs co-segregate with a phenotype across generations. High LOD values indicate strong evidence of linkage between the trait loci and SNP marker, but it does not guarantee that a gene is causal for a specific phenotype.

# plot
ggplot(scan_long, aes(x = genome_pos, y = LOD, color = as.factor(chr))) +
  geom_point(size = 0.75) +
  geom_vline(xintercept = chrom_boundaries, linetype = "dashed", color = "grey70") +
  geom_hline(yintercept = thresh, color = "red", linetype = "dashed", size = 1) +
  labs(title = "PHZ51 TXH1.2020 QTL Results",
       x = "Genomic Position",
       y = "LOD Score",
       color = "Chromosome") +
  theme_bw() +
  theme(legend.position = "none")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.