Genetic architecture, heritability, and genomic prediction of yield, seed protein, and oil content in soybean using multi-environment field data
Authors
Ayodele Amodu
Orrie Feitsma
Abstract
This section will summarize the study’s objectives, data sources, analytical framework, key findings, and overall significance. A full abstract will be added after the main analyses and results are finalized.
Introduction
Improving yield, seed protein, and oil content remain a central challenge in soybean breeding because these traits are economically critical yet genetically complex. Yield is highly polygenic and strongly influenced by environmental variation Sharma et al. (2024), while protein and oil content often exhibit antagonistic genetic correlations Diers et al. (2018), making simultaneous improvement difficult. Breeders must also contend with genotype × environment interactions (G×E), which cause genotypes to perform inconsistently across locations and years, reducing the reliability of selection decisions Akram et al. (2016). Traditional phenotypic selection is further constrained by the cost and time required to generate multi-environment field data, as well as the limited ability to dissect the underlying genetic architecture of these traits.
Although phenotypic selection alone is time consuming and not maximally effective, accurate phenotyping is crucial for understanding the heritability and variance components of traits and for obtaining accurate results from genomic inferences and predictions. Heritability refers to the ability of a trait to be passed down from parent to progeny or the portion of total phenotypic variance that can be attributed to genotype. An accurate understanding of heritability is crucial for estimating a population’s response to selection and the potential genetic gain that can be expected Holland, Nyquist, and Cervantes-Martínez (2003). A higher heritability suggests that selection is simpler, and the trait is likely only controlled by one or a few genes, while low heritability is typically associated with complex, polygenic traits. The variance components involved in phenotypic variance and heritability, such as dominance and additive variance, can further be parsed apart, and linear mixed models are typically used for this.
Advances in high-density SNP genotyping (e.g., SoySNP50K) and mixed model statistical frameworks now enable more precise estimation of genetic effects, partitioning of G×E, and prediction of breeding values using genome-wide markers (Ravelombola et al., 2021; Zhang et al., 2016). The genomic data acquired by SNP genotyping can be used for various downstream applications that allow breeders to make both large-scale inferences and powerful predictions. For example, SNP data and phenotypic data can be collected from a diverse population and analyzed in a genome-wide association study (GWAS), which allows researchers to identify SNPs in regions of the genome that are significantly associated with a trait of interest Clauw et al. (2025). Genotypic and phenotypic data can also be used for genomic selection (GS), which involves building and training a model to predict breeding values of individuals based on a score at each SNP Escamilla et al. (2025).
By integrating multi-environment phenotypic data with genomic information, this project aims to characterize the genetic architecture of yield, protein, and oil content, quantify their heritability, and evaluate the potential of genomic prediction to accelerate selection for improved seed quality and productivity in soybean. We will infer variance components and heritability using phenotypic data, identify SNPs associated with oil, protein, and yield using GWAS, and predict the performance of progeny from a subset of lines in this data using a GS model.
Materials and Methods
Data sources and type
Genotypic data. Imputed genotypes of the 5,176 Nested Association Mapping (NAM) recombinant inbred lines (RILs), derived from 40 NAM parental genotypes and obtained using the SoySNP50K and SoyNAM6K assays, were used in this project. The SNP arrays were aligned to the Williams 82 reference genome assembly (Wm82.a2). The imputed genotype dataset is publicly available at SoyBase.
Phenotypic and metadata. Phenotypic data and associated metadata were obtained from the USDA Soybean Nested Association Mapping (SoyNAM) project, which provides large‑scale field evaluations of soybean lines across multiple locations and years to study complex agronomic and seed composition traits. The SoyNAM project resources are available at SoyBase Legacy.
Data wrangling
Phenotypic data were imported from the raw text file and cleaned prior to downstream analyses. Empty columns created by trailing delimiters were removed, and key agronomic traits (yield, protein, and oil) were converted to numeric format to ensure compatibility with statistical modeling. Basic structural checks were performed to confirm correct variable types and naming. Missing data patterns were quantified to assess trait completeness across environments, and descriptive summaries were generated to characterize the distribution and scale of each trait. These steps ensured that the phenotype dataset was properly formatted, internally consistent, and ready for exploratory analyses and mixed‑model variance component estimation.
# Count missing values for each traitcolSums(is.na(pheno_raw[, c("Yld (kg/ha)", "Protein", "Oil")]))
Yld (kg/ha) Protein Oil
6610 20590 20590
# Descriptive summary of traitssummary(pheno_raw[, c("Yld (kg/ha)", "Protein", "Oil")])
Yld (kg/ha) Protein Oil
Min. : 322.8 Min. :25.59 Min. :12.99
1st Qu.:2896.3 1st Qu.:32.71 1st Qu.:19.03
Median :3564.3 Median :33.76 Median :19.66
Mean :3707.7 Mean :33.76 Mean :19.72
3rd Qu.:4519.2 3rd Qu.:34.80 3rd Qu.:20.35
Max. :7735.0 Max. :43.46 Max. :23.44
NA's :6610 NA's :20590 NA's :20590
# Count non-missing observations for each traitcolSums(!is.na(pheno_raw[, c("Yld (kg/ha)", "Protein", "Oil")]))
Yld (kg/ha) Protein Oil
66684 52704 52704
Summary of the cleaned phenotype dataset
After initial cleaning, the phenotype dataset showed substantial differences in data availability across traits. Yield had 6,610 missing observations, while protein and oil each had 20,590 missing values. Correspondingly, yield retained 66,684 non‑missing observations, whereas protein and oil each had 52,704 observations, indicating that seed composition traits were measured together and less frequently than yield. Descriptive summaries of the three traits confirmed that all values were numeric and within expected agronomic ranges, and no malformed or empty columns remained after preprocessing. This cleaned dataset formed the basis for subsequent exploratory analyses and mixed‑model heritability estimation.
Data exploration
Exploratory analyses were conducted to characterize the distribution of observations across years, locations, and environments, and to evaluate how yield, protein, and oil varied across these factors. Counts of observations by year and location were used to assess data balance, and a single problematic record with a missing location was removed. The number of unique environments and strains was confirmed, and trait means were summarized across environments. Boxplots were generated to visualize trait variation and missingness patterns, revealing strong environmental influence on yield and more moderate variation for protein and oil, with several environments lacking seed composition measurements.
library(ggplot2)# Observations by yeartable(pheno_raw$Year)
2011 2012 2013
6119 34069 33105
# Observations by locationtable(pheno_raw$Loc)
IA Beavis IL Diers IN Rainey KS Schapaugh Michigan
1 12248 15308 12248 7660 939
MO_Shannon NE Specht OH Mian OH_McHale
1929 15307 3810 3844
# Remove rows with missing or blank locationpheno_raw <- pheno_raw[!(pheno_raw$Loc ==""|is.na(pheno_raw$Loc)), ]dim(pheno_raw)
[1] 73293 28
# Unique environments and strainsc(n_env =length(unique(pheno_raw$Env)),n_strain =length(unique(pheno_raw$`Corrected Strain`)))
Observations were unevenly distributed across years, with 2012 and 2013 contributing the majority of measurements. Locations were also unbalanced, with Michigan having notably few observations. After removing one record with a missing location, the dataset contained 73,293 rows and 28 columns. Nineteen environments and 5,219 strains were represented. Yield showed strong environmental influence, while protein and oil exhibited more moderate variation and were missing entirely in several environments (e.g., 2012_KS, 2012_MI). These patterns highlight the unbalanced structure of the dataset and the differing environmental sensitivity of the three traits.
Variance decomposition and heritability
Variance components were estimated for yield, protein, and oil using linear mixed models with random effects for genotype, environment, and genotype‑by‑environment interaction (G×E). For each trait, the fitted model partitioned the total phenotypic variance into contributions from these components and the residual error. Proportional variance contributions were computed to compare the relative influence of genotype, environment, and G×E across traits. Broad‑sense heritability across environments was then estimated using variance components adjusted for the number of environments and total observations contributing to each trait.
library(lme4)# Yield model and raw variance componentsyield_dat <- pheno_raw[!is.na(pheno_raw$`Yld (kg/ha)`), ]mod_yield <-lmer(`Yld (kg/ha)`~ (1|`Corrected Strain`) + (1|Env) + (1|`Corrected Strain`:Env),data = yield_dat)VarCorr(mod_yield)
Groups Name Std.Dev.
`Corrected Strain`:Env (Intercept) 273.59
Corrected Strain (Intercept) 309.21
Env (Intercept) 839.15
Residual 549.75
# Proportion of variance for any traitget_varprop <-function(trait_name) { dat <- pheno_raw[!is.na(pheno_raw[[trait_name]]), ] form <-as.formula(paste0("`", trait_name, "` ~ (1|`Corrected Strain`) + (1|Env) + (1|`Corrected Strain`:Env)") ) mod <-lmer(form, data = dat) vc <-as.data.frame(VarCorr(mod))[, c("grp", "vcov")] vc$grp[vc$grp =="Corrected Strain"] <-"Genotype" vc$grp[vc$grp =="Env"] <-"Environment" vc$grp[vc$grp =="Corrected Strain:Env"] <-"GxE" vc$grp[vc$grp =="Residual"] <-"Residual" vc$Proportion <- vc$vcov /sum(vc$vcov) vc$Trait <- trait_name vc[, c("Trait", "grp", "vcov", "Proportion")]}varprop_yield <-get_varprop("Yld (kg/ha)")varprop_protein <-get_varprop("Protein")varprop_oil <-get_varprop("Oil")varprop_yield
Variance decomposition showed that environment contributed the largest share of variance for yield, consistent with its strong environmental sensitivity. Protein and oil displayed smaller environmental variance and relatively larger genetic components, reflecting their greater stability across environments. G×E contributed moderately to all traits. Broad‑sense heritability estimates were high for all three traits, with H² ≈ 0.96 for yield and ≈ 0.97 for both protein and oil, indicating that despite environmental effects, the genetic signal remains strong and suitable for downstream GWAS and genomic prediction.
References
Akram, S., B. N. Hussain, M. A. Al Bari, D. J. Burritt, and M. A. Hossain. 2016. “Genetic Variability and Association Analysis of Soybean (Glycine Max (l.) Merrill) for Yield and Yield Attributing Traits.”Plant Gene and Trait 7.
Clauw, Pieter, Thomas James Ellis, Hai-Jun Liu, and Eriko Sasaki. 2025. “Beyond the Standard GWAS—a Guide for Plant Biologists.”Plant and Cell Physiology 66 (4): 431–43.
Diers, B. W., J. Specht, K. M. Rainey, P. Cregan, Q. Song, V. Ramasubramanian, G. Graef, et al. 2018. “Genetic Architecture of Soybean Yield and Agronomic Traits.”G3: Genes, Genomes, Genetics 8 (10): 3367–75.
Escamilla, D. M., D. Li, K. L. Negus, K. L. Kappelmann, A. Kusmec, A. E. Vanous, P. S. Schnable, X. Li, and J. Yu. 2025. “Genomic Selection: Essence, Applications, and Prospects.”Plant Genome 2 (2): e70053. https://doi.org/10.1002/tpg2.70053.
Holland, J. B., W. E. Nyquist, and C. T. Cervantes-Martínez. 2003. “Estimating and Interpreting Heritability for Plant Breeding: An Update.” In Plant Breeding Reviews, edited by J. Janick. Vol. 22. 2.
Sharma, A., K. Saxena, N. K. Panwar, and K. P. Singh. 2024. “Assessment of Polygenic Traits Association with Yield in Soybean Genotypes.”Journal of Advances in Biology & Biotechnology 27 (6): 693–701.