Mixed Linear Models (MLMs)

Plant breeders are interested in the breeding value for genotypes/hybrids for traits of interest. However, plant breeding data is oftentimes unbalanced and contains variables with hierarchical structures since these studies span multiple environments. For example, in Week 1, we saw that not all of the Testers were present in each environment (unbalanced data/missing data), and we assume that Testers may be grouped according to their effects (PHZ51 ~ low yielding and PHK76 ~ moderate yielding). Plant breeding data is also unique (in comparison to human data) in that we can observe genetically identical individuals planted repeatedly across different environments (replicates). To handle this type of data structure, we use mixed linear models (MLMs) to provide estimates of breeding values to select for superior-performing genotypes.

MLMs are comprised of fixed effects and random effects. The specifics of how fixed and random effects are derived are beyond the scope of this lesson, but it is beneficial to be able to recognize and differentiate them. They are different ways to handle variability across different grouping levels or factors of the dataset.

Fixed Effects

Fixed effects are assumed to be constant and non-random; we use them when we are interested in the specific groups used in the study (i.e. effects of specific genotypes on grain yield in our G2F dataset - these are referred to as Best Linear Unbiased Estimates or BLUEs).

Random Effects

Random effects assume that the groups in our study are drawn from a random, normally distributed population of all possible groups. Rather than being interested in the specific effects of these groups, we are interested in their variation (i.e. variation of genotypes’ performance for grain yield from the overall average grain yield - these are referred to as Best Linear Unbiased Predictors or BLUPs).

The following equation shows the random effects model that was used to dissect the variance components and to calculate the best linear unbiased predictor (BLUP) values for grain yield for the 2020-2021 G2F dataset:

\[Grain \ Yield_{ijklm}=\mu+Genotype_{i}+Environment_{j}+GE_{ij}+Range_{k}+Row_{l}+Rep_{(j)m}+\varepsilon_{ijklm}\]

\(\mu\) is the mean grain yield of the population, \(Genotype_{i}\) is the effect of the genotype, \(Environment_{j}\) is the effect of the environment, \(GE_{ij}\) is the effect of GxE interactions, \(Range_{k}\) is the effect of the range, \(Row_{l}\) is the effect of the row, \(Rep_{(j)m}\) is the effect of the replicate, and \(\varepsilon_{ijklm}\) is the residual error not explained by the model. Since all the predictors are treated as random, they are assumed to be normally distributed

First, import the G2F 2020 and 2021 datasets and combine them:

#load libraries
pacman::p_load(readr, nlme, sjstats, lme4, RColorBrewer, dplyr, ggplot2, tidyverse, ggrepel, caret, MuMIn)

library(nlme)
library(sjstats)
library(lme4)
library("RColorBrewer")
library(dplyr)
library(ggplot2)
library(tidyverse)
library(ggrepel)
library(dplyr)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.5.3
#refer to data set in github repository
git_path <- "https://raw.githubusercontent.com/fatmaoz25/ReactionNormGenomicPrediction/refs/heads/main"
df20_file <- "g2f_2020_phenotypic_clean_data.csv"
df20_path <- file.path(git_path, df20_file)
df20 <- read.csv(df20_path)


df20 <- df20[df20$Field.Location %in% c("DEH1","GAH1","INH1","MNH1","NEH1","NEH2","NEH3","OHH1","TXH1","TXH2","WIH1","WIH2"),] #environments for 2020

df21_file <- "g2f_2021_phenotypic_clean_data.csv"
df21_path <- file.path(git_path, df21_file)
df21 <- read.csv(df21_path)
df21 <- df21[df21$Field.Location %in% c( "DEH1", "GAH1", "ILH1", "MNH1", "NCH1", "NEH1", "SCH1", "TXH2", "TXH3", "WIH1", "WIH2", "WIH3","IAH1", "IAH2", "IAH3", "IAH4", "MIH1"),] #environments for 2021

df <- rbind(df20,df21)
df$Env <- paste(df$Field.Location,df$Year, sep = ".")
df <- df[!df$Replicate %in% c(3,4),]
colnames(df)
##  [1] "Year"                                 
##  [2] "Field.Location"                       
##  [3] "State"                                
##  [4] "City"                                 
##  [5] "Plot.length..center.center.in.feet."  
##  [6] "Plot.area..ft2."                      
##  [7] "Alley.length..in.inches."             
##  [8] "Row.spacing..in.inches."              
##  [9] "Rows.per.plot"                        
## [10] "X..Seed.per.plot"                     
## [11] "Experiment"                           
## [12] "Source"                               
## [13] "Pedigree"                             
## [14] "Family"                               
## [15] "Tester"                               
## [16] "Replicate"                            
## [17] "Block"                                
## [18] "Plot"                                 
## [19] "Plot_ID"                              
## [20] "Range"                                
## [21] "Pass"                                 
## [22] "Date.Plot.Planted..MM.DD.YY."         
## [23] "Date.Plot.Harvested..MM.DD.YY."       
## [24] "Anthesis..MM.DD.YY."                  
## [25] "Silking..MM.DD.YY."                   
## [26] "Anthesis..days."                      
## [27] "Silking..days."                       
## [28] "Plant.Height..cm."                    
## [29] "Ear.Height..cm."                      
## [30] "Stand.Count....of.plants."            
## [31] "Root.Lodging....of.plants."           
## [32] "Stalk.Lodging....of.plants."          
## [33] "Grain.Moisture...."                   
## [34] "Test.Weight..lbs."                    
## [35] "Plot.Weight..lbs."                    
## [36] "Grain.Yield..bu.A."                   
## [37] "Plot.Discarded..enter..yes..or.blank."
## [38] "Comments"                             
## [39] "Filler"                               
## [40] "Snap....of.plants."                   
## [41] "Env"

Analysis of Variance (ANOVA)

Next, we have to run an analysis of variance (ANOVA) to partition the variance components for grain yield in order to obtain heritability estimates. Heritability reveals how well a phenotype is at indicating the breeding value of a genotype that has multiple replicates in multi-environment trials (MET). It is calculated as the ratio between the genotypic variance and the overall phenotypic variance, thus we can estimate how much of the phenotypic variance is attributed to genetic influences. We will be using the lme4 package to conduct ANOVA.

*note: grain yield is converted from bushels/acre to tons/hect. for the model.

df$Range <- as.factor(df$Range)
df$Pass <- as.factor(df$Pass)
df$Replicate  <- as.factor(df$Replicate )
df$Pedigree <- as.factor(df$Pedigree)
df$Env <- as.factor(df$Env)
df$Grain.Yield..bu.A. <- as.numeric(df$Grain.Yield..bu.A.)

anova<- lmer(Grain.Yield..bu.A.*.0673 ~
               (1|Pedigree)+
               (1|Env)+
               (1|Pedigree:Env)+
               (1|Range)+
               (1|Pass)+
               (1|Replicate:Env), df )
summary(anova)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Grain.Yield..bu.A. * 0.0673 ~ (1 | Pedigree) + (1 | Env) + (1 |  
##     Pedigree:Env) + (1 | Range) + (1 | Pass) + (1 | Replicate:Env)
##    Data: df
## 
## REML criterion at convergence: 104045.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -6.5521 -0.4731  0.0265  0.5070  5.3271 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev.
##  Pedigree:Env  (Intercept) 1.27451  1.1289  
##  Pedigree      (Intercept) 1.30753  1.1435  
##  Pass          (Intercept) 1.91912  1.3853  
##  Replicate:Env (Intercept) 0.16515  0.4064  
##  Range         (Intercept) 0.03614  0.1901  
##  Env           (Intercept) 3.78035  1.9443  
##  Residual                  2.69978  1.6431  
## Number of obs: 24171, groups:  
## Pedigree:Env, 16680; Pedigree, 1215; Pass, 178; Replicate:Env, 58; Range, 50; Env, 29
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  10.2983     0.3876   26.57

Next, we have to compute some metrics to assess the performance of the ANOVA model. The coefficient of variation (cv) shows the relative variability of the datapoints about the mean. The cv is .8%, indicating that the data has tight clustering and low dispersion.

The root mean square error (RMSE) tells us how close the predicted values are to the actual values. In this case, the computed rmse is 1.38, which means the model has good to moderate fit.

The \(R^2\) value of 0.75 indicates the fraction of the variance explained by the fixed and random effects, which is moderately high.

## [1] "Coefficient of Variation 0.00888620130300381"
## [1] "rmse 1.37666553946181"
## [1] "R 0.758572987963674"

To determine the heritability for grain yield, the variance components due to each factor must be extracted to plug-in them into the following equation:

\[Heritability=\frac{\sigma_{G}}{ \sigma_{G}+\frac{\sigma_{GE}}{n_{E}}+\frac{\sigma_\varepsilon}{n_{E}*n_{rep}}}\] where \(\sigma_{G}\) is the genotype effects’ variance, \(\sigma_{GE}\) is the GxE effects’ variance, \(\sigma_{\varepsilon}\) is the residual error variance, \(n_{E}\) is the number of environments, and \(n_{rep}\) is the number of reps.

The heritability of 85% is sufficient for a trait as complex as grain yield. This indicates that the expressed phenotypes for each genotype are heritable between the two reps and across different environments.

VC<-as.data.frame(print(VarCorr(anova ), comp=c("Variance")))
##  Groups        Name        Variance
##  Pedigree:Env  (Intercept) 1.274513
##  Pedigree      (Intercept) 1.307525
##  Pass          (Intercept) 1.919123
##  Replicate:Env (Intercept) 0.165155
##  Range         (Intercept) 0.036137
##  Env           (Intercept) 3.780349
##  Residual                  2.699776
VC$Percent<-round(VC$vcov/sum(VC$vcov)*100,2)
      
heritability <- VC[2,6] / (VC[2,6] +  VC[1,6]/length(unique(df$Env)) +  (VC[7,6]/length(unique(df$Env))*2))

VC[,7] <-rmse
VC[,8] <-heritability
VC[,9] <-R
names(VC)[7:9] <- c("Rmse","Heritability","R")
head(VC)
##             grp        var1 var2       vcov     sdcor Percent     Rmse
## 1  Pedigree:Env (Intercept) <NA> 1.27451264 1.1289432   11.40 1.376666
## 2      Pedigree (Intercept) <NA> 1.30752544 1.1434708   11.69 1.376666
## 3          Pass (Intercept) <NA> 1.91912294 1.3853241   17.16 1.376666
## 4 Replicate:Env (Intercept) <NA> 0.16515498 0.4063926    1.48 1.376666
## 5         Range (Intercept) <NA> 0.03613693 0.1900972    0.32 1.376666
## 6           Env (Intercept) <NA> 3.78034869 1.9443119   33.81 1.376666
##   Heritability        R
## 1    0.8503098 0.758573
## 2    0.8503098 0.758573
## 3    0.8503098 0.758573
## 4    0.8503098 0.758573
## 5    0.8503098 0.758573
## 6    0.8503098 0.758573
head(VC)
##             grp        var1 var2       vcov     sdcor Percent     Rmse
## 1  Pedigree:Env (Intercept) <NA> 1.27451264 1.1289432   11.40 1.376666
## 2      Pedigree (Intercept) <NA> 1.30752544 1.1434708   11.69 1.376666
## 3          Pass (Intercept) <NA> 1.91912294 1.3853241   17.16 1.376666
## 4 Replicate:Env (Intercept) <NA> 0.16515498 0.4063926    1.48 1.376666
## 5         Range (Intercept) <NA> 0.03613693 0.1900972    0.32 1.376666
## 6           Env (Intercept) <NA> 3.78034869 1.9443119   33.81 1.376666
##   Heritability        R
## 1    0.8503098 0.758573
## 2    0.8503098 0.758573
## 3    0.8503098 0.758573
## 4    0.8503098 0.758573
## 5    0.8503098 0.758573
## 6    0.8503098 0.758573
dim(VC)
## [1] 7 9

Now, let’s plot the results of the variance components to determine the percent each factor contributes to the variation seen in grain yield:

VC$Trait <- "Grain yield (t/ha)"
names(VC)[8] <- "Repeatability"

VC$grp <- gsub("Pedigree", "g", VC$grp)
VC$grp <- gsub("Env", "E", VC$grp)
VC$grp <- gsub("Replicate", "Rep", VC$grp)
VC$grp <- gsub("Pass", "Row", VC$grp)


VC$grp <- factor(VC$grp, levels = c("g", "E", "g:E", "Range", "Row", "Rep:E", "Residual"))




p1 <-  ggplot(VC, aes(x=Trait , y=Percent/100)) +
  geom_col(mapping=aes(fill=grp), width= 0.9) +
  geom_point(aes(y = Repeatability, shape="Repeatability"), fill="white",  color="black",size=4, alpha=1) +
  geom_point(aes(y = R, shape="R-squared"), fill="white",  color="black",size=4, alpha=1) +
  #  scale_x_discrete("")+
  scale_y_continuous("Explained percent variation, repeatability and R-squared") +
  #  scale_fill_manual(values = c("#B15928","#A6CEE3","#6A3D9A","#CAB2D6","#FF7F00","#FDBF6F","#E31A1C","#FB9A99","#33A02C" ,"#B2DF8A","#1F78B4" ))+
  #scale_fill_brewer(palette = "Paired")+
  #  scale_colour_manual(values =c("black","black"))+
  scale_shape_manual(values=c(21,24))+
  #  scale_fill_viridis_d()+
  theme_bw() +
  scale_x_discrete("")+
  theme(legend.position= c(0.53,0.3),
        legend.background =  element_rect(fill = alpha("gray80", 0.7) , linetype="solid"),
        legend.key.width = unit(0.3, 'cm'),
        legend.key.height = unit(0.1, 'cm'))+
  #      legend.key.width = unit(0.3, 'cm'),
  #      legend.key.height = unit(0.3, 'cm'))+
  #      facet_grid(~Type, scales = "free", space = "free")+
  #  labs(title = "A-) Multispectral HTP platform")+
  guides(fill=guide_legend(title="Variance component", nrow=6, byrow=TRUE))
#  ggtitle("A) Variance component")


p1

The grain yield breeding values for each genotype were calculated by adding the predicted GxE effects for an individual (using coef()) with the conditional mode of an individual’s environmental and genotype effect (using ranef()).

E <- ranef(anova)$'Env'
G <- ranef(anova)$'Pedigree'
GE <- coef(anova)$'Pedigree:Env'
head(E)
##           (Intercept)
## DEH1.2020  -1.2274882
## DEH1.2021   1.1569459
## GAH1.2020  -2.0455058
## GAH1.2021  -0.1697844
## IAH1.2021  -0.7298402
## IAH2.2021   1.1206849
dim(E)
## [1] 29  1
head(G)
##                  (Intercept)
## 2369/LH123HT       1.4257757
## 6046-BECKS         1.5006263
## ARPA W22 (X17EA)  -4.9711441
## B14A/H95          -0.2823990
## B14A/MO17         -0.1026897
## B14A/OH43          0.0660036
dim(E)
## [1] 29  1
head(GE)
##                        (Intercept)
## 2369/LH123HT:DEH1.2020    10.69428
## 2369/LH123HT:DEH1.2021    10.70033
## 2369/LH123HT:GAH1.2020    11.44766
## 2369/LH123HT:GAH1.2021    10.33503
## 2369/LH123HT:IAH1.2021    11.93277
## 2369/LH123HT:IAH2.2021    11.88455
dim(GE)
## [1] 16680     1

After calculating the breeding values, they are stored in a separate data frame where ONLY the genotypes with a Wisconsin stiff-stalk inbred parent are kept.

##       Yield                 interaction          Pedigree       Env Tester
## 1  9.850044 W10004_0006/PHK76:DEH1.2020 W10004_0006/PHK76 DEH1.2020  PHK76
## 2 11.844513 W10004_0006/PHK76:DEH1.2021 W10004_0006/PHK76 DEH1.2021  PHK76
## 3  9.213949 W10004_0006/PHK76:IAH1.2021 W10004_0006/PHK76 IAH1.2021  PHK76
## 4 10.960722 W10004_0006/PHK76:IAH2.2021 W10004_0006/PHK76 IAH2.2021  PHK76
## 5 12.002941 W10004_0006/PHK76:IAH3.2021 W10004_0006/PHK76 IAH3.2021  PHK76
## 6 10.751848 W10004_0006/PHK76:IAH4.2021 W10004_0006/PHK76 IAH4.2021  PHK76
##        Inbred Location Year
## 1 W10004_0006     DEH1 2020
## 2 W10004_0006     DEH1 2021
## 3 W10004_0006     IAH1 2021
## 4 W10004_0006     IAH2 2021
## 5 W10004_0006     IAH3 2021
## 6 W10004_0006     IAH4 2021
## [1] 15490     8

Now, let’s plot the results of the predicted breeding values to observe the genotype variance effects for grain yield across environments:

library(nlme)
library(sjstats)
library(lme4)
library("RColorBrewer")
library(dplyr)
library(ggplot2)
library(tidyverse)
library(ggrepel)
library(dplyr)
library(tidytext)

Yield <- GE
p <- ggplot(Yield, aes(x= reorder_within(Env, -Yield, Tester), y=Yield,fill=Tester)) +
  geom_boxplot(position = position_dodge2(preserve = "single"), size=0.2, outlier.size = 0.3)+
  # stat_summary(fun.y=mean, geom="point", aes(fill=CRI))+
  # geom_point(aes(x= reorder_within(Env, -Mean, Tester), y=Mean,  group = interaction(Env,CRI,Tester) ))+
  #facet_grid(~Tester, scales = "free_x")+
  scale_x_reordered("Environment") +
  scale_y_continuous("Grain yield (t/ha)")+
  theme_bw()+
  theme(legend.position = c(0.5,0.2),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.title=element_text(size=8),
        axis.text.y = element_text(size=8),
        axis.text.x = element_text(size=8, angle = 45, hjust = 1),
        legend.background = element_rect(fill="lightblue",
                                         size=0.5,
                                         linetype="solid"),
        #        legend.key.size = unit(0.2, 'cm'), #change legend key size
        #        legend.key.height = unit(0.5, 'cm'), #change legend key height
        #        legend.key.width = unit(0.5, 'cm'), #change legend key width
        #        legend.text = element_text(size=8)
  )+
  ggtitle("B) Genotypic effect of hybrid for grain yield (t/ha)")
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p