This demonstration shows how to estimate predictive ability and accuracy of a GS model. Using 5-fold cross validation. Note that k-fold cross validation generally does not give realistic results and whenever possible, a more realistic validation scheme should be used. Although I generally do not recommend k-fold cross validation, once you learn how to implement this method, you should be able to implement other validation schemes without much trouble.

Load data and packages

First we load the RData file. This contains the objects bv, M, ped and pheno. bv is a matrix of true breeding values for 530 individuals. M is a matrix of genotypic data for 530 individuals, ped is a data frame of the pedigrees for all 530 individuals. pheno is a data frame of the phenotypic data for 250 genotypes. The 250 genotypes that were phenotyped are lines, whereas the 280 genotypes that were not phenotyped are the plants used in the line development process.

Next, we load the library ‘rrBLUP’ we will use the mixed.solve function from this package.

load('Data250Lines.RData')
library(lme4)
## Loading required package: Matrix
library(rrBLUP)
library(arm)
## Loading required package: MASS
## 
## arm (Version 1.10-1, built: 2018-4-12)
## Working directory is /Users/jrut/Documents/Teaching/CPSC554/2022/Class 19

Prepare the phenotypic data

First use str() to check the structure of the phenotypic data.

str(pheno)
## 'data.frame':    1000 obs. of  5 variables:
##  $ phenoGID: int  281 282 283 284 285 286 287 288 289 290 ...
##  $ env     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ error   : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ pValue  : num  -4.445 1.959 0.554 1.211 2.596 ...
##  $ se      : num  2.45 2.45 2.45 2.45 2.45 ...

Notice that all the variables are being considered as integers or numeric. The phenoGID, and env variables are factors thus we need to change them to factors.

pheno$env<- as.factor(pheno$env)#change env to a factor
pheno$phenoGID<- as.factor(pheno$phenoGID)#change phenoGID factor

Create the validation vector

The validation vector should be the best estimate of the true breeding value or true genetic value without use of a relationship matrix. In our case the validation vector is the BLUPs of the genotype effects from a model that assumes individuals are unrelated. We can estimate these BLUPs and reliabilities easily using the lme4 and arm packages.

First we fit the model. For our data, env is a fixed effect and our random effect is phenoGID. (It must be random in order to obtain BLUPs of the genotype effects). We also need to use weights to account for the heterogenous error variances in our dataset. The weights are the reciprocal of the known variances.

mod_iid<- lmer(pheno$pValue~1+env+(1|phenoGID), weights= 1/pheno$se^2, data=pheno)

From this model we obtain the BLUPs using the ranef() function

blup_iid<- ranef(mod_iid)$phenoGID

Get the reliability of the validation BLUPs

To get the reliability of the BLUPs that we are using for validation, we need to get the Prediction Error Variances of the BLUPs and the genetic variance.

First we get the standard errors of the blups using the se.ranef() function. Then we obtain the Prediction Error Variances (PEVs) by squaring the standard errors.

se.blup_iid<- se.ranef(mod_iid)$phenoGID
pev<- se.blup_iid^2

Next we need to get the variance component table

vcomp<- data.frame(summary(mod_iid)$varcor)
vcomp
##        grp        var1 var2     vcov    sdcor
## 1 phenoGID (Intercept) <NA> 2.899115 1.702679
## 2 Residual        <NA> <NA> 1.282364 1.132415

Then we need to get the genetic variance from this table

Vg<- vcomp[vcomp$grp=='phenoGID','vcov'] #genetic variance

Now we can calculate the reliability

rel<- 1-pev/Vg
head(rel)
##     (Intercept)
## 281   0.4850622
## 282   0.4850622
## 283   0.4850622
## 284   0.4850622
## 285   0.4850622
## 286   0.4850622

Notice that we get a reliability for each BLUP

Set up the GS model

The first step is to create the relationship matrix. The genotypes being used for model training and validation need to be included. Any genotypes that have not been phenotyped are not going to be helpful for validation, thus we remove them from the marker data before calculating the genomic relationship matrix

Msub<- M[levels(pheno$phenoGID),] #markers on phenotyped lines
A<- A.mat(Msub)

Next we create the response vector: y, and the design matrices X and Z. Recall that we need to pre-multiply these by the inverse of the square root of the residual covariance matrix to account for heterogeneous error variance. This is equivalent to dividing y, X, and Z by the vector of the standard errors of y.

y<- pheno$pValue 
X<- model.matrix(pValue~1+env, pheno)
Z<- model.matrix(pValue~0+phenoGID, pheno) 
y2<- y/pheno$se
X2<- X/pheno$se
Z2<- Z/pheno$se

Now we are prepared to fit the model. Here is the model with all the data included:

mod<- mixed.solve(y= y2, Z=Z2, K=A, X=X2)

From this model we can get the GEBVs, but unless the data are simulated, we cannot use these results to determine the accuracy of this model. We must conduct model validation to estimate accuracy.

Estimate accuracy using cross-validation

With cross-validation, we mask different groups of genotypes, referred to as folds, and fit the model.

We first need to assign individuals to folds.

ngeno<- ncol(Z2) #number of genotypes
folds<- rep(1:5, ngeno/5)
folds_random<- sample(folds)
folds_random
##   [1] 2 5 5 4 5 3 1 5 1 3 3 2 3 4 1 4 2 5 5 4 3 4 3 3 5 1 4 5 1 2 2 4 4 4 4 2 4
##  [38] 4 4 3 4 3 2 1 3 5 2 1 3 5 2 2 2 3 3 3 4 2 5 3 1 5 3 4 4 5 3 2 4 3 4 1 2 5
##  [75] 1 3 3 2 3 5 2 3 3 1 3 4 2 4 2 1 3 5 4 5 1 3 5 2 2 2 3 2 5 3 2 4 3 4 5 3 2
## [112] 1 3 1 5 4 2 2 2 5 5 1 3 5 3 1 3 1 5 1 5 3 5 2 5 1 1 2 1 1 1 1 1 3 3 3 2 2
## [149] 5 3 2 5 1 1 4 2 2 2 3 2 2 1 1 5 4 1 5 2 4 2 4 5 3 2 4 4 2 1 4 1 5 4 4 3 4
## [186] 1 2 1 1 1 1 3 2 2 4 4 3 5 3 1 4 1 2 4 5 5 3 5 3 5 5 1 4 2 1 2 2 4 1 4 4 5
## [223] 3 1 4 1 1 5 5 1 4 5 3 5 3 2 5 4 4 5 5 1 5 4 2 1 5 4 4 1

Look at what individuals belong to fold 1

fold1<- levels(pheno$phenoGID)[which(folds_random==1)]
fold1
##  [1] "287" "289" "295" "306" "309" "324" "328" "341" "352" "355" "364" "370"
## [13] "375" "392" "394" "402" "406" "408" "410" "416" "417" "419" "420" "421"
## [25] "422" "423" "433" "434" "442" "443" "446" "458" "460" "466" "468" "469"
## [37] "470" "471" "480" "482" "492" "495" "499" "504" "506" "507" "510" "522"
## [49] "526" "530"

Mask phenotypic data on individuals belonging to fold 1

y2na<- y2
y2na[pheno$phenoGID %in% fold1]<- NA
mod_minus1<- mixed.solve(y= y2na, Z=Z2, K=A, X=X2)

Get the GEBVs from the model excluding individuals in fold 1

gebv<- mod_minus1$u

Now from this vector of GEBVs, get only those pertaining to fold 1. These are the individuals for which we predicted breeding value without having their own phenotype data in the model.

gebv_fold1<- gebv[names(gebv) %in% fold1]

Next we need to correlate this to the corresponding blups in our validation vector. This is the predictive ability of our GS model

pa<- cor(blup_iid[fold1,], gebv_fold1) #predictive ability

To get the prediction accuracy, we need to divide the predictive ability by the square root of the reliability of the BLUPs used to validate the model.

GSacc<- pa/mean(sqrt(rel[fold1,]))

Now we have some estimate of the accuracy. We need to repeat this 4 more times, each time removing a different fold. The average GS accuracy across all folds is the 5-fold cross validation accuracy

5-fold cross validation in a loop

Now that I have explained how to estimate GS accuracy from cross validation. I will complete the whole process using a loop.

accvec<- c()
for(i in 1:5){
  foldi<- levels(pheno$phenoGID)[which(folds_random==i)]
  y2na<- y2
  y2na[pheno$phenoGID %in% foldi]<- NA
  mod_minusi<- mixed.solve(y= y2na, Z=Z2, K=A, X=X2)
  gebv<- mod_minusi$u
  gebv_foldi<- gebv[names(gebv) %in% foldi]
  pa<- cor(blup_iid[foldi,], gebv_foldi) 
  GSacc<- pa/mean(sqrt(rel[foldi,]))
  accvec<- append(accvec, GSacc)
}
accvec
## [1] 0.6060973 0.6944989 0.5180162 0.3211657 0.6346998

The final GS accuracy estimate is the average across the 5 folds

mean(accvec)
## [1] 0.5548956