A goal for plant breeders is to identify specific regions of the
genome influencing the traits that they measure or observe for the
populations that they grow. Quantitative Trait Loci (QTL) mapping is an
appraoch to accomplish this goals. QTL are intervals across the genome
which are linked to DNA markers, typically single nucleotide
polymorphisms (SNPs) for our G2F dataset. The statistical procedures to
conduct QTL mapping are beyond the scope of this lesson, but there are
several open-source software that handle this process. We will be using
the qtl2 package to conduct the QTL analysis.
Plant breeders manipulate the genetic makeup of plants by making controlled crosses to achieve specific combinations of genotypes for their breeding purpose. Some motives for making controlled crosses include: creating genetic variation (increases genetic diversity), creating hybrids (take advantage of heterosis), introgressing genes (transfer favorable genes from donor parent).
Some common crosses and their uses are listed below:
kable(cross_types, align = "l", caption = "Plant Breeding Cross Types")
| Cross.Type | Motive | Description | Genetic.Uses |
|---|---|---|---|
| Selfing | Homozygosity, trait fixation | Within same plant | Pure lines, genetic mapping |
| Backcross | Gene introgression | F1 x recurrent parent | Gene transfer, pyramiding |
| Single Cross | Combine parental traits | Inbred line x inbred line | Hybrid vigor, combining ability |
| Double Cross | Expand genetic base | Single cross x single cross | Hybrid production |
| Three-way Cross | Add third parent’s traits | Single cross x inbred line | Hybrid production |
| Poly Cross | Population improvement | Random crosses among parents | Combining ability, perennial breeding |
| Top/Test Cross | Parental evaluation | Inbred x tester | Combining ability assessment |
| Interspecific/Intergeneric | Gene transfer | Across species/genera | Novel traits, genetic studies |
| Bulk Cross | Diversity + selection | Bulk crossing multiple parents | Population genetics, adaptation |
The 2020-2021 G2F dataset uses a 6-way doubled haploid cross. This cross starts with 6 parents (A, B, C, D, E, F) which are initially crossed to form 2-way crosses (AxB, CxD, ExF). An intermediate cross follows to form 4-way crosses ((AxB)x(CxD),(ExF)). The final cross is the 6-way cross (((AxB)x(CxD))x(ExF)) to form the doubled haploid F1 generation.
We will now prepare the grain yield BLUP dataset to format them as
inputs for qtl2 genome scan. The phenotype files are
separated according to the Tester_Environment combination.
#load libraries
pacman::p_load(readr, nlme, sjstats, lme4, RColorBrewer, dplyr, ggplot2, tidyverse, ggrepel, caret, MuMIn)
git_path <- "https://raw.githubusercontent.com/fatmaoz25/ReactionNormGenomicPrediction/refs/heads/main"
df_file <- "Yield.wide.na.csv"
df_path <- file.path(git_path, df_file)
df <- read.csv(df_path)
# Make sure string columns are not factors
df$Inbred <- as.character(df$Inbred)
# Get all Tester_Env column names (excluding 'Inbred')
tester_env_cols <- setdiff(names(df), "Inbred")
for (col in tester_env_cols) {
output <- df[, c("Inbred", col)]
colnames(output) <- c("Inbred", "Value")
# make a valid R object name (replace . with _)
obj_name <- paste0("df_", gsub("\\.", "_", col))
assign(obj_name, output)
}
NOTE: The code only depicts the QTL scan process for one Tester_Env combination (PHZ51_TXH1.2020). This would ideally be run for each Tester_Env separately by changing the phenotype data file in the JSON file. You may do this by running a forloop that goes through each Tester_Env and saves all of the QTL results into one dataframe at the end.
library(qtl2)
## Warning: package 'qtl2' was built under R version 4.5.3
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
write.csv(df_PHZ51_TXH1_2020, "C:\\Users\\fatma\\Documents\\SURGe\\QTL_Analysis\\PHZ51_TXH1_2020.csv", row.names=FALSE)
cross <- read_cross2("C:\\Users\\fatma\\Documents\\SURGe\\QTL_Analysis\\cm.mbp.ss.100k.sites.json") # Replace with your actual file name
# STEP 2: Calculate genotype probabilities
pr <- calc_genoprob(cross, error_prob=0.002, quiet=FALSE)
## Chr 1
## Chr 2
## Chr 3
## Chr 4
## Chr 5
## Chr 6
## Chr 7
## Chr 8
## Chr 9
## Chr 10
# STEP 3: Compute kinship matrix (optional but recommended to account for family structure)
kinship <- calc_kinship(pr, "loco", quiet = FALSE)
## - converting to allele probs
## - Chr 1
## - Chr 2
## - Chr 3
## - Chr 4
## - Chr 5
## - Chr 6
## - Chr 7
## - Chr 8
## - Chr 9
## - Chr 10
## - Chr 1
## - Chr 2
## - Chr 3
## - Chr 4
## - Chr 5
## - Chr 6
## - Chr 7
## - Chr 8
## - Chr 9
## - Chr 10
# STEP 4: Run the genome scan
scan1_output <- scan1(pr, cross$pheno, kinship = kinship)
# STEP 4a: Establish significance threshold with permutation test
perm_result <- scan1perm(genoprobs = pr, pheno = cross$pheno, n_perm = 1000, cores = 1) # cores = 0 arg uses maximum available cores
# STEP 4b: View the thresholds (this will be used in find_peaks below)
summary(perm_result)
## LOD thresholds (1000 permutations)
## Value
## 0.05 6.24
# STEP 5: Find peaks above the LOD threshold
peaks <- find_peaks(scan1_output, cross$gmap, threshold = summary(perm_result), drop = 5) # drop = 5 was copied from the Michel 2022 methods
print(peaks)
## lodindex lodcolumn chr pos lod ci_lo ci_hi
## 1 1 Value 4 8.174928 6.323298 0.7066019 170.8496
head(scan1_output)
## Value
## S1_39898 0.9202589
## S1_195098 0.9008219
## S1_206785 0.9011153
## S1_328512 1.0748975
## S1_557669 1.0784846
## S1_682496 0.8803549
print(peaks) gives us the results for any significant
regions found in the QTL scan. scan1_output contains the
full genome scan results.
# Prepare phenotype info
pheno <- colnames(cross$pheno)
# Add SNP info (marker name, chromosome, and position)
scan_df <- as.data.frame(scan1_output)
scan_df$marker <- rownames(scan1_output)
# Split marker name (e.g., "S1_39898") into chr and pos
scan_df <- scan_df %>%
separate(marker, into = c("chr", "pos"), sep = "_") %>%
mutate(chr = str_remove(chr, "S"), pos = as.numeric(pos))
# Convert wide to long format
scan_long <- scan_df %>%
pivot_longer(cols = all_of(pheno), names_to = "phenotype", values_to = "LOD")
head(scan_long)
## # A tibble: 6 × 4
## chr pos phenotype LOD
## <chr> <dbl> <chr> <dbl>
## 1 1 39898 Value 0.920
## 2 1 195098 Value 0.901
## 3 1 206785 Value 0.901
## 4 1 328512 Value 1.07
## 5 1 557669 Value 1.08
## 6 1 682496 Value 0.880
offsets <- scan_long %>%
group_by(chr) %>%
summarise(chr_len = max(pos, na.rm = TRUE)) %>%
arrange(as.numeric(chr)) %>%
mutate(offset_val = lag(cumsum(chr_len), default = 0))
scan_long <- scan_long %>%
left_join(offsets, by = "chr")
str(scan_long)
## tibble [100,000 × 6] (S3: tbl_df/tbl/data.frame)
## $ chr : chr [1:100000] "1" "1" "1" "1" ...
## $ pos : num [1:100000] 39898 195098 206785 328512 557669 ...
## $ phenotype : chr [1:100000] "Value" "Value" "Value" "Value" ...
## $ LOD : num [1:100000] 0.92 0.901 0.901 1.075 1.078 ...
## $ chr_len : num [1:100000] 3.08e+08 3.08e+08 3.08e+08 3.08e+08 3.08e+08 ...
## $ offset_val: num [1:100000] 0 0 0 0 0 0 0 0 0 0 ...
scan_long <- scan_long %>%
mutate(
genome_pos = as.numeric(pos) + as.numeric(offset_val)
)
# *** CREATE chrom_boundaries ***
chrom_boundaries <- offsets$offset_val[-1]
# Set LOD threshold
perm_sum <- summary(perm_result)
# For example, extract the 0.05 threshold:
thresh <- perm_sum["0.05", "Value"]
Manhattan plots are a way to visualize the results of the genome scan. The x-axis is the genomic position along a set of chromosomes, while the y-axis is the logarithm-of-odds (LOD) value. The LOD value is a test statistic that quantifies whether a SNP marker and trait locus are linked and inherited together within pedigrees/families. It tracks whether or not certain SNPs co-segregate with a phenotype across generations. High LOD values indicate strong evidence of linkage between the trait loci and SNP marker, but it does not guarantee that a gene is causal for a specific phenotype.
# plot
ggplot(scan_long, aes(x = genome_pos, y = LOD, color = as.factor(chr))) +
geom_point(size = 0.75) +
geom_vline(xintercept = chrom_boundaries, linetype = "dashed", color = "grey70") +
geom_hline(yintercept = thresh, color = "red", linetype = "dashed", size = 1) +
labs(title = "PHZ51 TXH1.2020 QTL Results",
x = "Genomic Position",
y = "LOD Score",
color = "Chromosome") +
theme_bw() +
theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.