Introduction

Best linear unbiased prediction (BLUP) was first derived by Charles (Chuck) Henderson while he was a graduate student at Iowa State College (now University). The challenge for animal breeders was that we were always interested in predicting the ‘breeding value’ (random variable) for animals. This was in contrast to many fields where the only interest is in some fixed quantity such as a treatment effect. Little attention was paid to mixed models until Henderson came along and derived BLUP.

What do you mean Best, Linear, Unbiased, and Prediction?

Best: Meaning the prediction error variance \(var(a-\hat{a})\) (the true vs predicted values) is minimized or the correlation between them is maximized.

Linear: Meaning the predicted EBVs are linear functions of the data (\(L'y\)).

Unbiased: Meaning the expected value of the predicted EBV and the true EBV are the same (\(E(a=\hat{a})\))

Prediction: Meaning it involves a prediction of the true breeding value (random effect).

For more on this, please see the following links from Gota and between Gianola and Gelman

Notation

Please note, vectors and matrices are given in bold text. If you are not familiar with matrix algebra, please consult a text book or the appendix to Mrode’s book.

Example

We will use the following example from Mrode’s book (chapter 3).

Calf	Sex	WWG
4	male	4.5
5	female	2.9
6	female	3.9
7	male	3.5
8	male	5.0

And the cooresponding pedigree would be

Calf	Sire	Dam
1	0	0
2	0	0
3	0	0
4	1	0
5	3	2
6	1	2
7	4	5
8	3	6

Model

The simplest animal model is in the following form:

\[ \boldsymbol{y} = \boldsymbol{Xb} + \boldsymbol{Za} + \boldsymbol{e} \]

Where:

\(\boldsymbol{y}\) = vector of phenotypes

\(\boldsymbol{b}\) = fixed effect estimates (sometimes \(\beta\))

\(\boldsymbol{a}\) = vector of random breeding values (sometimes shown as \(u\))

\(\boldsymbol{e}\) = residuals from the model

\(\boldsymbol{X}\),\(\boldsymbol{Z}\) = incidence matrices for fixed and random (animal) effects, respectively

In this example the full rank X is of order (5, 2). For this example we fit both sexes as a column and remove the intercept term (all 1’s). And the Z matrix is of order (5, 8). Notice that both matrices have 5 rows becuase there are 5 observations in y. The reason Z has 8 columns is that we need to calculate EBVs from related individuals using the A matrix (which would be (8,8) in this example). So we would just put a 1 when the observation cooresponds to that animal in the pedigree (column). Summing down the Z matrix would give you the number of observations per animal. In this example the first three animals in the pedigree have no records so this will be a null matrix (0’s) and the rest (if ordered) will be an identity matrix of order (5,5). The y vector is of order (5,1), the b vector is (2,1) and the a vector is (8,1), one for each animal in the pedigree. Finally, the e vector is (5,1), with one residual per observation (everything the model fails to explain and should pick up all those effects into one ‘leftover’ effect.

See Mrode’s Chapter 3 for information on this model if you are not familiar.

The variances are as follows:

\[ var(\boldsymbol{b}) = 0 \] because b refers to fixed effects and don’t have a variance, they are thought of as constant each and every time we run the ‘experiment’ (shouldn’t change).

Random effects are thought to have variances (think about sampling the values from those variances.

\[ var(\boldsymbol{a}) = \boldsymbol{G} = \boldsymbol{A}\sigma^{2}_{a} \] \[ var(\boldsymbol{e}) = \boldsymbol{R} = \boldsymbol{I}\sigma^{2}_{e} \] \[ var(\boldsymbol{y}) = \boldsymbol{ZGZ'} + \boldsymbol{R} \]

The distributions of each random variable are as follows:

\[ \boldsymbol{a} \sim N(\boldsymbol{0},\boldsymbol{A}\sigma^{2}_{a}) \] \[ \boldsymbol{e} \sim N(\boldsymbol{0},\boldsymbol{I}\sigma^{2}_{e}) \]

This is may be very abstract if you are just starting to learn linear models and/or BLUP. We will visit this at the end and show you what this actually looks like. The variance of a vector is actually a matrix. Therefore, each breeding value has it’s own variance (proportional to the diagonals of A) and it’s own covariance with other breeding values. This is what sets the animal model apart from many other statistical models; a structured covariance matrices.

Mixed Model Equations

And the cooresponding MME are typically shown as:

\[ \left [ \begin{array}{cl} \boldsymbol{X'X} & \boldsymbol{X'Z} \\ \boldsymbol{Z'X} & \boldsymbol{Z'Z} + \boldsymbol{A}^{-1} \lambda \\ \end{array} \right ] \left [ \begin{array}{c} \boldsymbol{b} \\ \boldsymbol{a} \end{array} \right ] = \left [ \begin{array}{c} \boldsymbol{X'y} \\ \boldsymbol{Z'y} \end{array} \right ] \]

where:

\(\lambda\) = \(\frac{\sigma^{2}_{e}}{\sigma^{2}_{a}}\)

and the rest has been described above.

Example of BLUP

Examples are the best way to learn animal breeding. Often things are taught using the theory first and followed by examples. Mrode’s book is set up well because you can look at some of the theory and equations, but they are immediately followed by the exercises. These help students learn because if you don’t know what X is, you cannot follow the theory they are presenting before you. This is in contrast to theory, which is developed with the picture of what the matrices actually look like. So they are very hard to read without the background on the matrices and are often extremely poorly defined and written papers. This is mostly because theory is written for other theorists and things don’t need to be explained in detail.

We will solve an example from Mrode’s book, Linear Models for the Prediction of Animal Breeding Values. This should not be viewed as a replacement for the book as the book has much more valuable information available for students.

The example dataset is as follows.

#----------------------------------------#
# Data
#----------------------------------------#

# given parameters (page 37)
  sigmaE <- 40
  sigmaA <- 20
  alpha  <- sigmaE / sigmaA
  print(alpha)

## [1] 2

# Data (page 37)
  calf   <- c(4:8)
  sex    <- c("male", "female", "female", "male", "male")
  WWG    <- c(4.5, 2.9, 3.9, 3.5, 5.0)

# Dataframe
  data.3.1 <- data.frame(calf, sex, WWG)
  rm(list=(c("calf", "sex", "WWG")))
  
# print data
  print(data.3.1)

##   calf    sex WWG
## 1    4   male 4.5
## 2    5 female 2.9
## 3    6 female 3.9
## 4    7   male 3.5
## 5    8   male 5.0

The pedigree must be set up from the data. This may be difficult for new students, but not the topic of this tutorial.

#----------------------------------------#
# Pedigree
#----------------------------------------#

# set up pedigree
  Calf    <- c(1:8)
  Sire    <- c(0, 0, 0, 1, 3, 1, 4, 3)
  Dam     <- c(0, 0, 0, 0, 2, 2, 5, 6)
  ped.3.1 <- data.frame(Calf, Sire, Dam)
  rm(list=c("Calf","Sire","Dam"))
  
  print(ped.3.1)

##   Calf Sire Dam
## 1    1    0   0
## 2    2    0   0
## 3    3    0   0
## 4    4    1   0
## 5    5    3   2
## 6    6    1   2
## 7    7    4   5
## 8    8    3   6

Now we need to create A (numerator relationship matrix).

# Create A
A <- Matrix(createA(ped.3.1))

##   Time it took (sec)            :0.000040

A <- as(A, "sparseMatrix")
print(A)

## 8 x 8 sparse Matrix of class "dsCMatrix"
##                                                
## [1,] 1.00 .    .    0.500 .     0.50 0.25 0.250
## [2,] .    1.00 .    .     0.500 0.50 0.25 0.250
## [3,] .    .    1.00 .     0.500 .    0.25 0.500
## [4,] 0.50 .    .    1.000 .     0.25 0.50 0.125
## [5,] .    0.50 0.50 .     1.000 0.25 0.50 0.375
## [6,] 0.50 0.50 .    0.250 0.250 1.00 0.25 0.500
## [7,] 0.25 0.25 0.25 0.500 0.500 0.25 1.00 0.250
## [8,] 0.25 0.25 0.50 0.125 0.375 0.50 0.25 1.000

Next, invert the A matrix. The mixed model equations do not require the A matrix, it requires the inverse of A.

# inverse of A
Ainv <- solve(A)
print(Ainv)

## 8 x 8 sparse Matrix of class "dgCMatrix"
##                                                     
## [1,]  1.8333333  0.5  .   -0.6666667  .   -1.0  .  .
## [2,]  0.5000000  2.0  0.5  .         -1.0 -1.0  .  .
## [3,]  .          0.5  2.0  .         -1.0  0.5  . -1
## [4,] -0.6666667  .    .    1.8333333  0.5  .   -1  .
## [5,]  .         -1.0 -1.0  0.5000000  2.5  .   -1  .
## [6,] -1.0000000 -1.0  0.5  .          .    2.5  . -1
## [7,]  .          .    .   -1.0000000 -1.0  .    2  .
## [8,]  .          .   -1.0  .          .   -1.0  .  2

Lets now create the individual matrices for the animal model.

# set response variable
  y <- Matrix(data.3.1$WWG, ncol=1)
  print(y)

## 5 x 1 Matrix of class "dgeMatrix"
##      [,1]
## [1,]  4.5
## [2,]  2.9
## [3,]  3.9
## [4,]  3.5
## [5,]  5.0

  data.3.1$sex <- factor(data.3.1$sex, levels=c("male", "female"))

# set up X
  X <- Matrix(model.matrix(~ sex - 1, data=data.3.1))
  print(X)

## 5 x 2 Matrix of class "dgeMatrix"
##   sexmale sexfemale
## 1       1         0
## 2       0         1
## 3       0         1
## 4       1         0
## 5       1         0

# set up Z (n by n*) n* = number of animals in whole pedigree (8)
  Z <- Matrix(0, nrow=nrow(data.3.1), ncol=nrow(ped.3.1))
  
# change row/column names
  colnames(Z) <- ped.3.1$Calf
  rownames(Z) <- data.3.1$calf

# fill in Z
  for (i in 1:nrow(data.3.1)) {
    index = data.3.1[i, "calf" ]
    Z[i, index] = 1
  }
  
# print Z
  print(Z)

## 5 x 8 sparse Matrix of class "dgCMatrix"
##   1 2 3 4 5 6 7 8
## 4 . . . 1 . . . .
## 5 . . . . 1 . . .
## 6 . . . . . 1 . .
## 7 . . . . . . 1 .
## 8 . . . . . . . 1

The following function can be used for any basic animal model MME.

# set up MME for PE solutions
  basicMME <- function(X, Z, A, y, alpha) {
    
    # Calculate blocks of LHS
    XpX = crossprod(X)                           # same as t(X) %*% X
    ZpZ = crossprod(Z) + (solve(A) * alpha)      # same as (t(Z) %*% Z) + (solve(A)*alpha)
    XpZ = crossprod(X, Z)                        # same as t(X) %*% Z
    
    # Paste top and bottom together for LHS
    toprow    = cbind(XpX,    XpZ)
    bottomrow = cbind(t(XpZ), ZpZ)
    
    # Put top and bottom together for left hand side (LHS)
    LHS = rbind(toprow, bottomrow)
    
    # Elements for RHS
    XpY = crossprod(X, y)
    ZpY = crossprod(Z, y)
    
    # Calculate right hand side (RHS)
    RHS = rbind(XpY, ZpY) 
    
    # calculate solutions by direct inversion of LHS
    solutions = solve(LHS) %*% RHS
    
    # Return LHS, RHS, and solutions
    return(list(LHS=LHS, RHS=RHS, solutions=solutions))
    
  }

Finally we can use the MME solver function to solve the MME for both the estimated fixed effects and predicted random effects. This function will output the Left Hand Side (LHS), the Right Hand Side (RHS)

output <- basicMME(X, Z, A, y, alpha)

Print the LHS!

# print the LHS
LHS <- output[[1]]
print(LHS)

## 10 x 10 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 10 column names 'sexmale', 'sexfemale', '1' ... ]]

##                                                    
## sexmale   3 .  .         .  .  1.000000  .  .  1  1
## sexfemale . 2  .         .  .  .         1  1  .  .
## 1         . .  3.666667  1  . -1.333333  . -2  .  .
## 2         . .  1.000000  4  1  .        -2 -2  .  .
## 3         . .  .         1  4  .        -2  1  . -2
## 4         1 . -1.333333  .  .  4.666667  1  . -2  .
## 5         . 1  .        -2 -2  1.000000  6  . -2  .
## 6         . 1 -2.000000 -2  1  .         .  6  . -2
## 7         1 .  .         .  . -2.000000 -2  .  5  .
## 8         1 .  .         . -2  .         . -2  .  5

Print the inverse LHS!

IMPORTANT: This will show you the linear combination of values if you also look at the RHS.

# print the inverse of LHS
LHSinv <- solve(LHS)
print(round(LHSinv, 4))

## 10 x 10 sparse Matrix of class "dgCMatrix"
##                                                                      
##  [1,]  0.5956  0.1573 -0.1641 -0.0836 -0.1306 -0.2646 -0.1483 -0.1663
##  [2,]  0.1573  0.8025 -0.1329 -0.2413 -0.1120 -0.0873 -0.2989 -0.3060
##  [3,] -0.1641 -0.1329  0.4711  0.0069  0.0326  0.2195  0.0450  0.2208
##  [4,] -0.0836 -0.2413  0.0069  0.4921 -0.0103  0.0204  0.2373  0.2452
##  [5,] -0.1306 -0.1120  0.0326 -0.0103  0.4565  0.0481  0.2013  0.0226
##  [6,] -0.2646 -0.0873  0.2195  0.0204  0.0481  0.4277  0.0470  0.1276
##  [7,] -0.1483 -0.2989  0.0450  0.2373  0.2013  0.0470  0.4281  0.1697
##  [8,] -0.1663 -0.3060  0.2208  0.2452  0.0226  0.1276  0.1697  0.4423
##  [9,] -0.2842 -0.1859  0.1386  0.1198  0.1259  0.2428  0.2197  0.1522
## [10,] -0.2379 -0.1986  0.1342  0.1107  0.2177  0.1232  0.1781  0.2192
##                      
##  [1,] -0.2842 -0.2379
##  [2,] -0.1859 -0.1986
##  [3,]  0.1386  0.1342
##  [4,]  0.1198  0.1107
##  [5,]  0.1259  0.2177
##  [6,]  0.2428  0.1232
##  [7,]  0.2197  0.1781
##  [8,]  0.1522  0.2192
##  [9,]  0.4419  0.1681
## [10,]  0.1681  0.4224

Print the RHS!

# print the RHS
RHS <- output[[2]]
print(RHS)

## 10 x 1 Matrix of class "dgeMatrix"
##           [,1]
## sexmale   13.0
## sexfemale  6.8
## 1          0.0
## 2          0.0
## 3          0.0
## 4          4.5
## 5          2.9
## 6          3.9
## 7          3.5
## 8          5.0

Print the solutions! (\(\boldsymbol{b}\) and \(\boldsymbol{a}\))

# print the solutions
solutions <- output[[3]]
rownames(solutions) <- rownames(RHS)
print(solutions)

## 10 x 1 Matrix of class "dgeMatrix"
##                   [,1]
## sexmale    4.358502330
## sexfemale  3.404430006
## 1          0.098444576
## 2         -0.018770099
## 3         -0.041084203
## 4         -0.008663123
## 5         -0.185732099
## 6          0.176872088
## 7         -0.249458555
## 8          0.182614688

Using the inverse of LHS and RHS, we can see the linear combination to calculate the EBV of animal 3.

\[EBV_{3} = -0.1641(13.0) + -0.1329(6.8) + 0.4711(0) + 0.0069(0) + 0.0326(0) + \] \[0.2195(4.5) + 0.0450(2.9) + 0.2208(3.9) + 0.1386(3.5) + 0.1342(5.0) = 0.09844\]

You can now see that the EBV for animal 3 is a linear combination from LHS multiplied by each value of RHS and summed.

Where are the highest weights? You can see that the highest weight is on it’s own phenotype, but it didn’t have one so in total that becomes 0. You can see a high 0.2195 weight on animal 4 (it’s offspring), 0.2208 weight on animal 6 (also an offspring), and approximately 0.13 that are his grand-offspring. Notice that the weights are proportional to the A matrix. Other weights are much smaller, but notice they are not zero. This is an advantage of GBLUP, as the G matrix would allow those other elements that are ‘not-related’ in A to become more or less related in G, altering these weights. RHS is not chaning in GBLUP, the weights on each phenotype is changing. I believe it’s still unclear if the main advantage is changing the relationship of full/half sibs or to those more distant relatives, likely it’s a combination of both.

Variances

Now back to the variances of \(\boldsymbol{a}\), \(\boldsymbol{e}\), and \(\boldsymbol{y}\).

\[ var(\boldsymbol{a}) = \boldsymbol{G} = \boldsymbol{A}\sigma^{2}_{a} \]

# variance of a
G <- A*sigmaA

# print G
print(G)

## 8 x 8 sparse Matrix of class "dsCMatrix"
##                                   
## [1,] 20  .  . 10.0  .   10  5  5.0
## [2,]  . 20  .  .   10.0 10  5  5.0
## [3,]  .  . 20  .   10.0  .  5 10.0
## [4,] 10  .  . 20.0  .    5 10  2.5
## [5,]  . 10 10  .   20.0  5 10  7.5
## [6,] 10 10  .  5.0  5.0 20  5 10.0
## [7,]  5  5  5 10.0 10.0  5 20  5.0
## [8,]  5  5 10  2.5  7.5 10  5 20.0

\[ var(\boldsymbol{e}) = \boldsymbol{R} = \boldsymbol{I}\sigma^{2}_{e} \]

# variance of e
R <- Matrix(diag(nrow(data.3.1))*sigmaE)

# print R
print(R)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]   40    .    .    .    .
## [2,]    .   40    .    .    .
## [3,]    .    .   40    .    .
## [4,]    .    .    .   40    .
## [5,]    .    .    .    .   40

\[ var(\boldsymbol{y}) = \boldsymbol{V} = \boldsymbol{ZGZ'} + \boldsymbol{R} \]

# variance of y
V <- Z %*% G %*% t(Z) + R

# print V
print(V)

## 5 x 5 sparse Matrix of class "dgCMatrix"
##      4    5  6  7    8
## 4 60.0  .    5 10  2.5
## 5  .   60.0  5 10  7.5
## 6  5.0  5.0 60  5 10.0
## 7 10.0 10.0  5 60  5.0
## 8  2.5  7.5 10  5 60.0

Accuracy of EBVs

The accuracy is the correlation between the true and predicted breeding values. This is an easy concept from a population standpoint, but understanding the accuracy for one individual animal is a difficult concept. After a conversation with Rohan Fernando, he explained to me that you can think of the generating a vector of true breeding values (TBVs) from the (co)variance matrix (A) times the additive genetic variance (\(\boldsymbol{A}\sigma^{2}_{a}\)). This will result in a new TBV each round of sampling (as in the frequentist theory of repeated sampling). So each round the new TBV will change, the accuracy will reflect how close the EBV from the animal model will predict the TBV.

To actually calculate the accuracy we need the complete inverse of the LHS matrix.

\[ \left [ \begin{array}{cc} \boldsymbol{C_{11}} & \boldsymbol{C_{12}} \\ \boldsymbol{C_{21}} & \boldsymbol{C_{22}} \\ \end{array} \right ] \]

The \(\boldsymbol{C_{11}}\) matrix refers to the X’X part of the MME, the \(\boldsymbol{C_{12}}\) refers to the X’Z part, and the \(\boldsymbol{C_{22}}\) refers to the Z’Z part of the MME.

And the inverse of this matrix can be represented:

\[ \left [ \begin{array}{cc} \boldsymbol{C^{11}} & \boldsymbol{C^{12}} \\ \boldsymbol{C^{21}} & \boldsymbol{C^{22}} \\ \end{array} \right ] \]

This is the entire LHS inverted and then partitioned into submatrices.

The prediction error variance (PEV) is:

\[PEV = var(a - \hat{a}) = \boldsymbol{C^{22}}\sigma^{2}_{e}\]

This means that the variance of the different between the true and estimated breeding values is equal to the ‘EBV part’ of the inverted LHS of the MME times \(\sigma^{2}_{e}\).

For an individual:

\[PEV_{i} = \boldsymbol{C^{22}}_{i}\sigma^{2}_{e} = (1 - r^{2}_{i})\sigma^{2}_{a}\]

Which would be the fraction of the additive variance that is missed by the prediction. If the \(r^{2}\) is equal to 1, the PEV for animal i would be 0 because the first part of the equation goes to 0 with more information.

# Take the inverse of the LHS and multiply by sigma^2_e
solve(LHS)

## 10 x 10 sparse Matrix of class "dgCMatrix"
##                                                                    
##  [1,]  0.59556097  0.15730213 -0.164119413 -0.083624565 -0.13059083
##  [2,]  0.15730213  0.80245865 -0.132863260 -0.241250738 -0.11196840
##  [3,] -0.16411941 -0.13286326  0.471094211  0.006928037  0.03264668
##  [4,] -0.08362457 -0.24125074  0.006928037  0.492095721 -0.01030797
##  [5,] -0.13059083 -0.11196840  0.032646682 -0.010307967  0.45645878
##  [6,] -0.26455749 -0.08730803  0.219543709  0.020390333  0.04812709
##  [7,] -0.14827804 -0.29891465  0.044952254  0.237345770  0.20132326
##  [8,] -0.16632621 -0.30600266  0.220774267  0.245155707  0.02261354
##  [9,] -0.28424641 -0.18594950  0.138622268  0.119819354  0.12589831
## [10,] -0.23787901 -0.19864885  0.134192262  0.110664009  0.21774710
##                                                                
##  [1,] -0.26455749 -0.14827804 -0.16632621 -0.2842464 -0.2378790
##  [2,] -0.08730803 -0.29891465 -0.30600266 -0.1859495 -0.1986488
##  [3,]  0.21954371  0.04495225  0.22077427  0.1386223  0.1341923
##  [4,]  0.02039033  0.23734577  0.24515571  0.1198194  0.1106640
##  [5,]  0.04812709  0.20132326  0.02261354  0.1258983  0.2177471
##  [6,]  0.42768015  0.04704420  0.12757186  0.2428012  0.1231911
##  [7,]  0.04704420  0.42810675  0.16972255  0.2197160  0.1780739
##  [8,]  0.12757186  0.16972255  0.44228277  0.1521830  0.2192238
##  [9,]  0.24280124  0.21971599  0.15218301  0.4418562  0.1680818
## [10,]  0.12319108  0.17807393  0.21922376  0.1680818  0.4223641

Then we can multiply this by \(\sigma^{2}_{e}\) to get the PEV’s of the EBVs (on the diagonals).

solve(LHS)*sigmaE

## 10 x 10 sparse Matrix of class "dgCMatrix"
##                                                                         
##  [1,]  23.822439   6.292085 -6.5647765 -3.3449826 -5.2236333 -10.5822997
##  [2,]   6.292085  32.098346 -5.3145304 -9.6500295 -4.4787360  -3.4923213
##  [3,]  -6.564777  -5.314530 18.8437685  0.2771215  1.3058673   8.7817484
##  [4,]  -3.344983  -9.650030  0.2771215 19.6838288 -0.4123187   0.8156133
##  [5,]  -5.223633  -4.478736  1.3058673 -0.4123187 18.2583514   1.9250837
##  [6,] -10.582300  -3.492321  8.7817484  0.8156133  1.9250837  17.1072061
##  [7,]  -5.931122 -11.956586  1.7980902  9.4938308  8.0529304   1.8817681
##  [8,]  -6.653049 -12.240106  8.8309707  9.8062283  0.9045416   5.1028746
##  [9,] -11.369856  -7.437980  5.5448907  4.7927742  5.0359323   9.7120496
## [10,]  -9.515160  -7.945954  5.3676905  4.4265603  8.7098838   4.9276432
##                                                  
##  [1,]  -5.931122  -6.6530485 -11.369856 -9.515160
##  [2,] -11.956586 -12.2401063  -7.437980 -7.945954
##  [3,]   1.798090   8.8309707   5.544891  5.367690
##  [4,]   9.493831   9.8062283   4.792774  4.426560
##  [5,]   8.052930   0.9045416   5.035932  8.709884
##  [6,]   1.881768   5.1028746   9.712050  4.927643
##  [7,]  17.124270   6.7889020   8.788639  7.122957
##  [8,]   6.788902  17.6913106   6.087320  8.768951
##  [9,]   8.788639   6.0873203  17.674247  6.723272
## [10,]   7.122957   8.7689506   6.723272 16.894566

To get the accuracies we need to extract either the PEV or the diagonals of LHS.

# create data frame
data.acc <- data.frame(Animal = 1:8)

# create column with diagonals 
data.acc$DiagLHS = diag(solve(LHS))[3:10]

# calculate relability from Diagonals of LHS (alpha = sigmaE / sigmaA)
data.acc$r2 = 1 - data.acc$DiagLHS*alpha

# calculate the accuracy from reliability (r^2)
data.acc$r = sqrt(data.acc$r2)

# calculate Standard Error of Prediction (SEP) (using diagonals and the variance of e)
data.acc$SEP = sqrt(data.acc$DiagLHS*sigmaE)

# now the variance (square the standard error of prediction (SEP))
data.acc$PEV = data.acc$SEP^2

# print
print(data.acc)

##   Animal   DiagLHS         r2         r      SEP      PEV
## 1      1 0.4710942 0.05781158 0.2404404 4.340941 18.84377
## 2      2 0.4920957 0.01580856 0.1257321 4.436646 19.68383
## 3      3 0.4564588 0.08708243 0.2950973 4.272979 18.25835
## 4      4 0.4276802 0.14463969 0.3803153 4.136086 17.10721
## 5      5 0.4281067 0.14378651 0.3791919 4.138148 17.12427
## 6      6 0.4422828 0.11543447 0.3397565 4.206104 17.69131
## 7      7 0.4418562 0.11628766 0.3410098 4.204075 17.67425
## 8      8 0.4223641 0.15527171 0.3940453 4.110300 16.89457

# histograms
data.acc %>%
  gather(Type, Value, DiagLHS:PEV) %>%
  mutate(Type = fct_relevel(Type, "DiagLHS", "r2", "r", "SEP", "PEV")) %>%
ggplot(., aes(x=Value)) +
  geom_histogram(bins=5, color="white", fill="dodgerblue3") +
  facet_wrap(~ Type, scales="free") +
  theme_bw()

Of course they are all highly correlated, but reporting the \(r^2\) instead of accuracy gives more credability for sires with very high accuracy (typically would be AI studs). The beef and dairy industry use BIF accuracy and \(r^2\), respectively, to favor AI studs so that farmers use their bulls. Otherwise, farmers would realize that testing their own bulls would give them fairly high accuracy and would not buy AI semen. You could argue that this is a good thing as using AI ties in relatinoships between contemorary groups and is good for the industry. Others may think this is not right.

Relationship between BIF/reliability and true accuracy

#==============================================================================#
# Plot BIF vs True accuracy
#==============================================================================#

# BIF accuracy = 1 - sqrt(1 - r^2)

# Accuracy = sqrt(1 - (1-BIF)^2)

#==============================================================================#
# Setup
#==============================================================================#

# my ggplot2 theme
  my_gg_theme <- theme_light() + 
                 theme(text = element_text(size=16), 
                 plot.title = element_text(hjust=0.5),
                 panel.grid = element_blank())

#==============================================================================#
# Code
#==============================================================================#
  
# vector of accuracies
  Acc    <- seq(0,1,by=0.01)
  r2     <- Acc^2
  BIFacc <- 1 - sqrt(1 - Acc^2)
  
# combine into dataset
  data <- data.frame(Seq = seq(0,1,by=0.01), Acc = Acc, r2 = r2, BIFAcc = BIFacc)

# melt
  data.gather <- gather(data, Type, Value, Acc:BIFAcc) %>%
    mutate(Type = fct_relevel(Type, "Acc", "r2", "BIFAcc"))

# plot
  print(ggplot(data.gather, aes(x=Seq, y=Value, color=Type, shape=Type, group=Type)) +
    geom_line() +
    geom_point() +
    scale_color_manual("Accuracy", values=c("dodgerblue3", "blue", "magenta"),
        labels = c("True Acc", "Reliability", "BIF Acc")) +
    scale_shape_manual("Accuracy", values=c(16,17,15),
        labels = c("True Acc", "Reliability", "BIF Acc")) +
    ylab("Reported Accuracy") +
    xlab("True Accuracy") +
    my_gg_theme)

We can now clearly see that BIF accuracy is the most stingent. Basically this means that bulls have to have an accuracy of 0.99 to be ‘proven’ and equivalent to the true accuracy. This is because AI stud owners know that the accuracy of testing your own animals on highly heritable traits makes them already have a high accuracy. For instance, a trait with a high heritability (0.5) would have an accuracy of \(\sqrt{0.5}=0.707\). With these types of accuracies, farmers may just test their own bulls and think that accuracies this high are good enough.

Should we care about the accuracy?

What do you do if you have a young bull that has a slightly higher EBV, but a lot lower accuracy than an older bull that is ‘proven’? For some this may be a very difficult question and something industries have to deal with on a regular basis. The answer that was given to me from Rohan Fernando (through Chuck Henderson himself) was that we should always take the higher EBV animal no matter the accuracy. Even though it has a larger variance, in animal breeding we need to make progress so having the chance at more extreme animals to move the population is of interest. Also, my opinion is that on average, the next generation should be expected to have a higher mean genetic value than the previous generation. Given that generation interval is a big part of animal breeding, it also makes sense to take the younger animal.

Illustration through simulation

One can simulate the parameters used (or assumed here) to generate samples!

You have to imagine that the fixed part (sex) doesn’t change over replicate, but the random effect does. Each round a different True Breeding Value (TBV) is sampled. You can think of the pedigree (A) as fixed and then we generate new samples based on the mean (0) and variance (A\(\sigma^{2}_{a}\)).

# load MASS for mvrnorm function
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# we can generate TBVs using the mean (0) and variance (A*sigma2_a)
mu = rep(0, 8)
Sigma = A*sigmaA

# create 1000 samples of TBVs (random effects change each sampling)
TBV.samples <- mvrnorm(n=10000, mu=mu, Sigma=Sigma)

# row/col names
rownames(TBV.samples) <- 1:nrow(TBV.samples)
colnames(TBV.samples) <- 1:8

# now look at the correlations between them!
round(cov(TBV.samples), 3)

##        1      2      3      4      5      6      7      8
## 1 20.527  0.460 -0.364 10.243  0.054 10.632  4.887  5.226
## 2  0.460 20.046 -0.199  0.206  9.853 10.308  4.809  4.948
## 3 -0.364 -0.199 20.601 -0.311 10.099  0.042  4.931 10.107
## 4 10.243  0.206 -0.311 20.268 -0.099  5.355 10.170  2.429
## 5  0.054  9.853 10.099 -0.099 19.866  5.091  9.849  7.463
## 6 10.632 10.308  0.042  5.355  5.091 20.893  5.049 10.464
## 7  4.887  4.809  4.931 10.170  9.849  5.049 20.027  4.963
## 8  5.226  4.948 10.107  2.429  7.463 10.464  4.963 20.238

Does this (co)variance matrix look familiar? If not take a look again at this matrix:

# multiply A and sigma^2_a
A * sigmaA

## 8 x 8 sparse Matrix of class "dsCMatrix"
##                                   
## [1,] 20  .  . 10.0  .   10  5  5.0
## [2,]  . 20  .  .   10.0 10  5  5.0
## [3,]  .  . 20  .   10.0  .  5 10.0
## [4,] 10  .  . 20.0  .    5 10  2.5
## [5,]  . 10 10  .   20.0  5 10  7.5
## [6,] 10 10  .  5.0  5.0 20  5 10.0
## [7,]  5  5  5 10.0 10.0  5 20  5.0
## [8,]  5  5 10  2.5  7.5 10  5 20.0

# now look at the correlations between them!
round(cor(TBV.samples), 3)

##        1      2      3      4      5     6     7     8
## 1  1.000  0.023 -0.018  0.502  0.003 0.513 0.241 0.256
## 2  0.023  1.000 -0.010  0.010  0.494 0.504 0.240 0.246
## 3 -0.018 -0.010  1.000 -0.015  0.499 0.002 0.243 0.495
## 4  0.502  0.010 -0.015  1.000 -0.005 0.260 0.505 0.120
## 5  0.003  0.494  0.499 -0.005  1.000 0.250 0.494 0.372
## 6  0.513  0.504  0.002  0.260  0.250 1.000 0.247 0.509
## 7  0.241  0.240  0.243  0.505  0.494 0.247 1.000 0.247
## 8  0.256  0.246  0.495  0.120  0.372 0.509 0.247 1.000

Does this look familiar? It should look very similar to the A matrix. It should actually be the A matrix converted to relationships (correlation matrix).

Now we need to generate phenotypes!

# sex
sex <- c("male", "female", "female", "male", "male")

# Let us assume that the TRUE fixed effects are really 4.4 and 3.4
sex.solutions <- Matrix(c(4.4, 3.4), ncol=1, nrow=2) 

# Get fixed part of Model
fixed <- X %*% sex.solutions

# replicate
fixed <- Matrix(rep(fixed, 10000), nrow=5, ncol=10000)

# Get random part of Model (Z is (5,8) and TBV is (10000,8)
# so the result will be (5,10000)
random_a <- Z %*% t(TBV.samples)

# so to generate y we also need the residuals
random_e <- t(mvrnorm(n=10000, mu=rep(0,5), Sigma=diag(5)*sigmaE))

# finally we can generate y
random_y <- data.frame(as.matrix(t(fixed + random_a + random_e)))

# cov
cov(random_y)

##            X4         X5        X6        X7        X8
## X4 60.1153201  0.2317527  4.482185 11.319483  1.966944
## X5  0.2317527 59.8267585  5.642137  9.995456  8.102444
## X6  4.4821846  5.6421367 61.211254  5.732583  8.779809
## X7 11.3194826  9.9954562  5.732583 60.691265  5.318896
## X8  1.9669445  8.1024445  8.779809  5.318896 59.644530

This looks a lot like the variance for our y (\(\boldsymbol{ZGZ'} + \boldsymbol{R}\))

# variance for y
Z%*%G%*%t(Z) + R

## 5 x 5 sparse Matrix of class "dgCMatrix"
##      4    5  6  7    8
## 4 60.0  .    5 10  2.5
## 5  .   60.0  5 10  7.5
## 6  5.0  5.0 60  5 10.0
## 7 10.0 10.0  5 60  5.0
## 8  2.5  7.5 10  5 60.0

Now lets look at some means for males.

# mean of male (animals 1, 4, 5) and female (animals 2 and 3). 
mean(random_y[, 1])

## [1] 4.322791

mean(random_y[, 4])

## [1] 4.407806

mean(random_y[, 5])

## [1] 4.319868

Now the means of females.

# females
mean(random_y[, 2])

## [1] 3.380125

mean(random_y[, 3])

## [1] 3.373854

Hopefully these look familiar as the fixed means we assumed when we simluated the data.

Mistakes, Errors, Comments

If you see anything wrong or have comments (even stuff to add), please let me know ASAP and I can try to integrate this. Email is below.

A work by Austin Putz

putz.austin@gmail.com

Introduction to BLUP

Austin Putz

4/12/2018