Genetic architecture, heritability, and genomic prediction of yield, seed protein, and oil content in soybean using multi-environment field data

Authors

Ayodele Amodu

Orrie Feitsma

Abstract

This section will summarize the study’s objectives, data sources, analytical framework, key findings, and overall significance. A full abstract will be added after the main analyses and results are finalized.

Introduction

Improving yield, seed protein, and oil content remain a central challenge in soybean breeding because these traits are economically critical yet genetically complex. Yield is highly polygenic and strongly influenced by environmental variation Sharma et al. (2024), while protein and oil content often exhibit antagonistic genetic correlations Diers et al. (2018), making simultaneous improvement difficult. Breeders must also contend with genotype × environment interactions (G×E), which cause genotypes to perform inconsistently across locations and years, reducing the reliability of selection decisions Akram et al. (2016). Traditional phenotypic selection is further constrained by the cost and time required to generate multi-environment field data, as well as the limited ability to dissect the underlying genetic architecture of these traits.

Although phenotypic selection alone is time consuming and not maximally effective, accurate phenotyping is crucial for understanding the heritability and variance components of traits and for obtaining accurate results from genomic inferences and predictions. Heritability refers to the ability of a trait to be passed down from parent to progeny or the portion of total phenotypic variance that can be attributed to genotype. An accurate understanding of heritability is crucial for estimating a population’s response to selection and the potential genetic gain that can be expected Holland, Nyquist, and Cervantes-Martínez (2003). A higher heritability suggests that selection is simpler, and the trait is likely only controlled by one or a few genes, while low heritability is typically associated with complex, polygenic traits. The variance components involved in phenotypic variance and heritability, such as dominance and additive variance, can further be parsed apart, and linear mixed models are typically used for this.

Advances in high-density SNP genotyping (e.g., SoySNP50K) and mixed model statistical frameworks now enable more precise estimation of genetic effects, partitioning of G×E, and prediction of breeding values using genome-wide markers (Ravelombola et al., 2021; Zhang et al., 2016). The genomic data acquired by SNP genotyping can be used for various downstream applications that allow breeders to make both large-scale inferences and powerful predictions. For example, SNP data and phenotypic data can be collected from a diverse population and analyzed in a genome-wide association study (GWAS), which allows researchers to identify SNPs in regions of the genome that are significantly associated with a trait of interest Clauw et al. (2025). Genotypic and phenotypic data can also be used for genomic selection (GS), which involves building and training a model to predict breeding values of individuals based on a score at each SNP Escamilla et al. (2025).

By integrating multi-environment phenotypic data with genomic information, this project aims to characterize the genetic architecture of yield, protein, and oil content, quantify their heritability, and evaluate the potential of genomic prediction to accelerate selection for improved seed quality and productivity in soybean. We will infer variance components and heritability using phenotypic data, identify SNPs associated with oil, protein, and yield using GWAS, and predict the performance of progeny from a subset of lines in this data using a GS model.

Materials and Methods

Data sources and type

Genotypic data. Imputed genotypes of the 5,176 Nested Association Mapping (NAM) recombinant inbred lines (RILs), derived from 40 NAM parental genotypes and obtained using the SoySNP50K and SoyNAM6K assays, were used in this project. The SNP arrays were aligned to the Williams 82 reference genome assembly (Wm82.a2). The imputed genotype dataset is publicly available at SoyBase.

Phenotypic and metadata. Phenotypic data and associated metadata were obtained from the USDA Soybean Nested Association Mapping (SoyNAM) project, which provides large‑scale field evaluations of soybean lines across multiple locations and years to study complex agronomic and seed composition traits. The SoyNAM project resources are available at SoyBase Legacy.

Data wrangling

Phenotypic data were imported from the raw text file and cleaned prior to downstream analyses. Empty columns created by trailing delimiters were removed, and key agronomic traits (yield, protein, and oil) were converted to numeric format to ensure compatibility with statistical modeling. Basic structural checks were performed to confirm correct variable types and naming. Missing data patterns were quantified to assess trait completeness across environments, and descriptive summaries were generated to characterize the distribution and scale of each trait. These steps ensured that the phenotype dataset was properly formatted, internally consistent, and ready for exploratory analyses and mixed‑model variance component estimation.

# Load phenotype data
data_dir <- "C:/Users/great/OneDrive/Documents/GWAS_Project/rawData"
pheno_file <- file.path(data_dir, "phenotypic_data.txt")

pheno_raw <- read.table(
  pheno_file,
  header = TRUE,
  sep = "\t",
  fill = TRUE,
  quote = "",
  comment.char = "",
  stringsAsFactors = FALSE,
  check.names = FALSE
)

# Remove empty columns created by trailing tabs
pheno_raw <- pheno_raw[, colnames(pheno_raw) != ""]

# Convert key trait columns to numeric
pheno_raw$`Yld (kg/ha)` <- as.numeric(pheno_raw$`Yld (kg/ha)`)
pheno_raw$Protein       <- as.numeric(pheno_raw$Protein)
pheno_raw$Oil           <- as.numeric(pheno_raw$Oil)

# Check structure of key variables
str(pheno_raw[, c("Corrected Strain", "Loc", "Year", "Env",
                  "Yld (kg/ha)", "Protein", "Oil")])

'data.frame':   73294 obs. of  7 variables:
 $ Corrected Strain: chr  "IA3023" "LD04-11056" "LD04-13265" "U06-100052" ...
 $ Loc             : chr  "IA Beavis" "IA Beavis" "IA Beavis" "IA Beavis" ...
 $ Year            : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Env             : chr  "2012_IA" "2012_IA" "2012_IA" "2012_IA" ...
 $ Yld (kg/ha)     : num  2159 3463 3167 2492 2502 ...
 $ Protein         : num  32.1 31.9 32.8 35.4 32.1 ...
 $ Oil             : num  19.9 20.3 19.1 18.5 20.1 ...

# Count missing values for each trait
colSums(is.na(pheno_raw[, c("Yld (kg/ha)", "Protein", "Oil")]))

Yld (kg/ha)     Protein         Oil 
       6610       20590       20590

# Descriptive summary of traits
summary(pheno_raw[, c("Yld (kg/ha)", "Protein", "Oil")])

  Yld (kg/ha)        Protein           Oil       
 Min.   : 322.8   Min.   :25.59   Min.   :12.99  
 1st Qu.:2896.3   1st Qu.:32.71   1st Qu.:19.03  
 Median :3564.3   Median :33.76   Median :19.66  
 Mean   :3707.7   Mean   :33.76   Mean   :19.72  
 3rd Qu.:4519.2   3rd Qu.:34.80   3rd Qu.:20.35  
 Max.   :7735.0   Max.   :43.46   Max.   :23.44  
 NA's   :6610     NA's   :20590   NA's   :20590

# Count non-missing observations for each trait
colSums(!is.na(pheno_raw[, c("Yld (kg/ha)", "Protein", "Oil")]))

Yld (kg/ha)     Protein         Oil 
      66684       52704       52704

Summary of the cleaned phenotype dataset

After initial cleaning, the phenotype dataset showed substantial differences in data availability across traits. Yield had 6,610 missing observations, while protein and oil each had 20,590 missing values. Correspondingly, yield retained 66,684 non‑missing observations, whereas protein and oil each had 52,704 observations, indicating that seed composition traits were measured together and less frequently than yield. Descriptive summaries of the three traits confirmed that all values were numeric and within expected agronomic ranges, and no malformed or empty columns remained after preprocessing. This cleaned dataset formed the basis for subsequent exploratory analyses and mixed‑model heritability estimation.

Data exploration

Exploratory analyses were conducted to characterize the distribution of observations across years, locations, and environments, and to evaluate how yield, protein, and oil varied across these factors. Counts of observations by year and location were used to assess data balance, and a single problematic record with a missing location was removed. The number of unique environments and strains was confirmed, and trait means were summarized across environments. Boxplots were generated to visualize trait variation and missingness patterns, revealing strong environmental influence on yield and more moderate variation for protein and oil, with several environments lacking seed composition measurements.

library(ggplot2)

# Observations by year
table(pheno_raw$Year)


 2011  2012  2013 
 6119 34069 33105

# Observations by location
table(pheno_raw$Loc)


                IA Beavis     IL Diers    IN Rainey KS Schapaugh     Michigan 
           1        12248        15308        12248         7660          939 
  MO_Shannon    NE Specht      OH Mian    OH_McHale 
        1929        15307         3810         3844

# Remove rows with missing or blank location
pheno_raw <- pheno_raw[!(pheno_raw$Loc == "" | is.na(pheno_raw$Loc)), ]
dim(pheno_raw)

[1] 73293    28

# Unique environments and strains
c(
  n_env = length(unique(pheno_raw$Env)),
  n_strain = length(unique(pheno_raw$`Corrected Strain`))
)

   n_env n_strain 
      19     5219

# Environment distribution
table(pheno_raw$Env)


    2011_IL     2011_NE     2012_IA     2012_IL     2012_IN     2012_KS 
       3060        3059        6124        6124        6124        3808 
    2012_MI     2012_MO     2012_NE   2012_OHmc 2012_OHmian     2013_IA 
        939         982        6124        1923        1921        6124 
    2013_IL     2013_IN     2013_KS     2013_MO     2013_NE   2013_OHmc 
       6124        6124        3852         947        6124        1921 
2013_OHmian 
       1889

# Yield variation across environments
aggregate(`Yld (kg/ha)` ~ Env, data = pheno_raw, mean, na.rm = TRUE)

           Env Yld (kg/ha)
1      2011_IL    2821.804
2      2011_NE    5107.103
3      2012_IA    2809.378
4      2012_IL    3422.185
5      2012_IN    4274.668
6      2012_KS    3904.774
7      2012_MI    2374.817
8      2012_MO    3446.245
9      2012_NE    4802.989
10   2012_OHmc    3414.574
11 2012_OHmian    2842.603
12     2013_IA    2893.854
13     2013_IL    3150.394
14     2013_IN    5103.074
15     2013_KS    2815.179
16     2013_MO    4130.061
17   2013_OHmc    3676.813
18 2013_OHmian    4499.398

# Yield boxplot
yield_plot <- pheno_raw[!is.na(pheno_raw$`Yld (kg/ha)`), ]
env_order <- with(
  aggregate(`Yld (kg/ha)` ~ Env, data = yield_plot, mean),
  Env[order(`Yld (kg/ha)`, decreasing = TRUE)]
)
yield_plot$Env <- factor(yield_plot$Env, levels = env_order)

ggplot(yield_plot, aes(x = Env, y = `Yld (kg/ha)`, fill = Env)) +
  geom_boxplot(outlier.size = 0.5) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 2) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2) +
  labs(
    title = "Yield Variation Across Environments",
    x = "Environment",
    y = "Yield (kg/ha)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

# Protein variation across environments
aggregate(Protein ~ Env, data = pheno_raw, mean, na.rm = TRUE)

       Env  Protein
1  2011_IL 32.68626
2  2011_NE 33.84564
3  2012_IA 33.04600
4  2012_IL 32.92777
5  2012_IN 34.58357
6  2012_NE 34.71142
7  2013_IA 33.19734
8  2013_IL 32.52968
9  2013_IN 34.18506
10 2013_NE 35.15831

# Protein boxplot
all_env <- sort(unique(pheno_raw$Env))
protein_plot <- pheno_raw[!is.na(pheno_raw$Protein), ]
protein_plot$Env <- factor(protein_plot$Env, levels = all_env)

ggplot(protein_plot, aes(x = Env, y = Protein, fill = Env)) +
  geom_boxplot(outlier.size = 0.5) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 2) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2) +
  scale_x_discrete(drop = FALSE) +
  labs(
    title = "Protein Variation Across Environments",
    x = "Environment",
    y = "Protein (%)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

# Oil variation across environments
aggregate(Oil ~ Env, data = pheno_raw, mean, na.rm = TRUE)

       Env      Oil
1  2011_IL 20.46163
2  2011_NE 19.25276
3  2012_IA 19.17852
4  2012_IL 20.13120
5  2012_IN 19.31964
6  2012_NE 18.89248
7  2013_IA 19.20158
8  2013_IL 21.12186
9  2013_IN 19.89991
10 2013_NE 19.67468

# Oil boxplot
oil_plot <- pheno_raw[!is.na(pheno_raw$Oil), ]
oil_plot$Env <- factor(oil_plot$Env, levels = all_env)

ggplot(oil_plot, aes(x = Env, y = Oil, fill = Env)) +
  geom_boxplot(outlier.size = 0.5) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 2) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2) +
  scale_x_discrete(drop = FALSE) +
  labs(
    title = "Oil Variation Across Environments",
    x = "Environment",
    y = "Oil (%)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Summary of exploratory findings

Observations were unevenly distributed across years, with 2012 and 2013 contributing the majority of measurements. Locations were also unbalanced, with Michigan having notably few observations. After removing one record with a missing location, the dataset contained 73,293 rows and 28 columns. Nineteen environments and 5,219 strains were represented. Yield showed strong environmental influence, while protein and oil exhibited more moderate variation and were missing entirely in several environments (e.g., 2012_KS, 2012_MI). These patterns highlight the unbalanced structure of the dataset and the differing environmental sensitivity of the three traits.

Variance decomposition and heritability

Variance components were estimated for yield, protein, and oil using linear mixed models with random effects for genotype, environment, and genotype‑by‑environment interaction (G×E). For each trait, the fitted model partitioned the total phenotypic variance into contributions from these components and the residual error. Proportional variance contributions were computed to compare the relative influence of genotype, environment, and G×E across traits. Broad‑sense heritability across environments was then estimated using variance components adjusted for the number of environments and total observations contributing to each trait.

library(lme4)

# Yield model and raw variance components
yield_dat <- pheno_raw[!is.na(pheno_raw$`Yld (kg/ha)`), ]

mod_yield <- lmer(
  `Yld (kg/ha)` ~ (1|`Corrected Strain`) + (1|Env) + (1|`Corrected Strain`:Env),
  data = yield_dat
)

VarCorr(mod_yield)

 Groups                 Name        Std.Dev.
 `Corrected Strain`:Env (Intercept) 273.59  
 Corrected Strain       (Intercept) 309.21  
 Env                    (Intercept) 839.15  
 Residual                           549.75

# Proportion of variance for any trait
get_varprop <- function(trait_name) {
  
  dat <- pheno_raw[!is.na(pheno_raw[[trait_name]]), ]
  
  form <- as.formula(
    paste0("`", trait_name, "` ~ (1|`Corrected Strain`) + (1|Env) + (1|`Corrected Strain`:Env)")
  )
  
  mod <- lmer(form, data = dat)
  
  vc <- as.data.frame(VarCorr(mod))[, c("grp", "vcov")]
  
  vc$grp[vc$grp == "Corrected Strain"] <- "Genotype"
  vc$grp[vc$grp == "Env"] <- "Environment"
  vc$grp[vc$grp == "Corrected Strain:Env"] <- "GxE"
  vc$grp[vc$grp == "Residual"] <- "Residual"
  
  vc$Proportion <- vc$vcov / sum(vc$vcov)
  vc$Trait <- trait_name
  
  vc[, c("Trait", "grp", "vcov", "Proportion")]
}

varprop_yield   <- get_varprop("Yld (kg/ha)")
varprop_protein <- get_varprop("Protein")
varprop_oil     <- get_varprop("Oil")

varprop_yield

        Trait                    grp      vcov Proportion
1 Yld (kg/ha) `Corrected Strain`:Env  74853.98 0.06360458
2 Yld (kg/ha)               Genotype  95608.90 0.08124035
3 Yld (kg/ha)            Environment 704172.74 0.59834638
4 Yld (kg/ha)               Residual 302229.10 0.25680870

varprop_protein

    Trait                    grp      vcov Proportion
1 Protein `Corrected Strain`:Env 0.2120286 0.09487143
2 Protein               Genotype 0.6712800 0.30036177
3 Protein            Environment 0.8806187 0.39402958
4 Protein               Residual 0.4709777 0.21073722

varprop_oil

  Trait                    grp       vcov Proportion
1   Oil `Corrected Strain`:Env 0.08137358 0.08362463
2   Oil               Genotype 0.25383072 0.26085247
3   Oil            Environment 0.48439465 0.49779453
4   Oil               Residual 0.15348255 0.15772836

# Rename columns for heritability function
pheno <- pheno_raw
names(pheno)[names(pheno) == "Corrected Strain"] <- "Strain"
names(pheno)[names(pheno) == "Yld (kg/ha)"]      <- "Yld"

# Broad-sense heritability across environments
compute_H2 <- function(trait_name) {
  
  dat <- pheno[!is.na(pheno[[trait_name]]), ]
  
  n_env <- length(unique(dat$Env))
  n_obs <- nrow(dat)
  
  form <- as.formula(
    paste0(trait_name, " ~ (1|Strain) + (1|Env) + (1|Strain:Env)")
  )
  
  mod <- lmer(form, data = dat)
  
  vc <- as.data.frame(VarCorr(mod))
  
  sigma_g  <- vc$vcov[vc$grp == "Strain"]
  sigma_ge <- vc$vcov[vc$grp == "Strain:Env"]
  sigma_e  <- vc$vcov[vc$grp == "Residual"]
  
  H2 <- sigma_g / (sigma_g + (sigma_ge / n_env) + (sigma_e / n_obs))
  
  list(
    Trait = trait_name,
    H2 = H2,
    sigma_g = sigma_g,
    sigma_ge = sigma_ge,
    sigma_e = sigma_e,
    n_env = n_env,
    n_obs = n_obs
  )
}

H2_yield   <- compute_H2("Yld")
H2_protein <- compute_H2("Protein")
H2_oil     <- compute_H2("Oil")

H2_yield

$Trait
[1] "Yld"

$H2
[1] 0.958274

$sigma_g
[1] 95608.9

$sigma_ge
[1] 74853.98

$sigma_e
[1] 302229.1

$n_env
[1] 18

$n_obs
[1] 66684

H2_protein

$Trait
[1] "Protein"

$H2
[1] 0.9693689

$sigma_g
[1] 0.67128

$sigma_ge
[1] 0.2120286

$sigma_e
[1] 0.4709777

$n_env
[1] 10

$n_obs
[1] 52704

H2_oil

$Trait
[1] "Oil"

$H2
[1] 0.9689268

$sigma_g
[1] 0.2538307

$sigma_ge
[1] 0.08137358

$sigma_e
[1] 0.1534826

$n_env
[1] 10

$n_obs
[1] 52704

Summary of variance and heritability results

Variance decomposition showed that environment contributed the largest share of variance for yield, consistent with its strong environmental sensitivity. Protein and oil displayed smaller environmental variance and relatively larger genetic components, reflecting their greater stability across environments. G×E contributed moderately to all traits. Broad‑sense heritability estimates were high for all three traits, with H² ≈ 0.96 for yield and ≈ 0.97 for both protein and oil, indicating that despite environmental effects, the genetic signal remains strong and suitable for downstream GWAS and genomic prediction.

References

Akram, S., B. N. Hussain, M. A. Al Bari, D. J. Burritt, and M. A. Hossain. 2016. “Genetic Variability and Association Analysis of Soybean (Glycine Max (l.) Merrill) for Yield and Yield Attributing Traits.” Plant Gene and Trait 7.

Clauw, Pieter, Thomas James Ellis, Hai-Jun Liu, and Eriko Sasaki. 2025. “Beyond the Standard GWAS—a Guide for Plant Biologists.” Plant and Cell Physiology 66 (4): 431–43.

Diers, B. W., J. Specht, K. M. Rainey, P. Cregan, Q. Song, V. Ramasubramanian, G. Graef, et al. 2018. “Genetic Architecture of Soybean Yield and Agronomic Traits.” G3: Genes, Genomes, Genetics 8 (10): 3367–75.

Escamilla, D. M., D. Li, K. L. Negus, K. L. Kappelmann, A. Kusmec, A. E. Vanous, P. S. Schnable, X. Li, and J. Yu. 2025. “Genomic Selection: Essence, Applications, and Prospects.” Plant Genome 2 (2): e70053. https://doi.org/10.1002/tpg2.70053.

Holland, J. B., W. E. Nyquist, and C. T. Cervantes-Martínez. 2003. “Estimating and Interpreting Heritability for Plant Breeding: An Update.” In Plant Breeding Reviews, edited by J. Janick. Vol. 22. 2.

Sharma, A., K. Saxena, N. K. Panwar, and K. P. Singh. 2024. “Assessment of Polygenic Traits Association with Yield in Soybean Genotypes.” Journal of Advances in Biology & Biotechnology 27 (6): 693–701.